Skip to content

Table of Contents

cs.CL [Back]

[1] Shona spaCy: A Morphological Analyzer for an Under-Resourced Bantu Language

Happymore Masoka

Main category: cs.CL

TL;DR: 本文提出了Shona spaCy,一个基于spaCy框架的开源、基于规则的绍纳语形态分析工具,通过结合词典与语言学规则实现词性标注和形态特征分析,准确率分别达到90%和88%,并促进低资源班图语的语言技术发展。

Details Motivation: 绍纳语作为一种被服务不足的班图语,在自然语言处理中缺乏相应的形态分析工具,因此需要构建支持该语言的语言技术以推动数字包容性。 Method: 采用基于规则的方法,结合手工整理的JSON词典和语言学规则,对名词类前缀、动词主语一致标记、时体标记、拟声词和附着词进行建模,并集成到spaCy的词元级注释系统中。 Result: 在正式和非正式的绍纳语语料库上评估显示,词性标注准确率为90%,形态特征标注准确率为88%,且系统具有良好的语言决策透明性。 Conclusion: Shona spaCy不仅提升了绍纳语的NLP可及性,还为其他低资源班图语提供了可复用的形态分析工具开发范式。 Abstract: Despite rapid advances in multilingual natural language processing (NLP), the Bantu language Shona remains under-served in terms of morphological analysis and language-aware tools. This paper presents Shona spaCy, an open-source, rule-based morphological pipeline for Shona built on the spaCy framework. The system combines a curated JSON lexicon with linguistically grounded rules to model noun-class prefixes (Mupanda 1-18), verbal subject concords, tense-aspect markers, ideophones, and clitics, integrating these into token-level annotations for lemma, part-of-speech, and morphological features. The toolkit is available via pip install shona-spacy, with source code at https://github.com/HappymoreMasoka/shona-spacy and a PyPI release at https://pypi.org/project/shona-spacy/0.1.4/. Evaluation on formal and informal Shona corpora yields 90% POS-tagging accuracy and 88% morphological-feature accuracy, while maintaining transparency in its linguistic decisions. By bridging descriptive grammar and computational implementation, Shona spaCy advances NLP accessibility and digital inclusion for Shona speakers and provides a template for morphological analysis tools for other under-resourced Bantu languages.

Dong Liu,Yanxuan Yu

Main category: cs.CL

TL;DR: 本文提出了语义金字塔索引(SPI),一种用于检索增强生成系统中向量数据库的多分辨率索引框架,能够根据查询动态调整检索粒度,从而在保持语义相关性的同时显著提升检索速度和内存效率。

Details Motivation: 现有的向量数据库检索依赖于单一粒度的索引结构,无法适应不同查询对语义精细程度的不同需求,导致检索速度与上下文相关性之间的权衡不理想。 Method: 提出语义金字塔索引(SPI),通过构建文档嵌入上的语义金字塔,并利用轻量级分类器为每个查询动态选择最优分辨率层级,实现从粗到细的渐进式检索。SPI可作为插件集成到FAISS和Qdrant等主流向量数据库中。 Result: 在MS MARCO、Natural Questions和多模态检索基准上验证,SPI实现了最高5.7倍的检索加速和1.8倍的内存效率提升,端到端问答F1分数最高提升2.5点。理论分析提供了检索质量与延迟边界保证,消融实验验证了各组件贡献。 Conclusion: SPI是一种高效、兼容性强且无需额外训练的多分辨率索引方案,能够在不牺牲语义覆盖的前提下显著提升RAG系统的检索效率,适用于生产环境部署。 Abstract: Retrieval-Augmented Generation (RAG) systems have become a dominant approach to augment large language models (LLMs) with external knowledge. However, existing vector database (VecDB) retrieval pipelines rely on flat or single-resolution indexing structures, which cannot adapt to the varying semantic granularity required by diverse user queries. This limitation leads to suboptimal trade-offs between retrieval speed and contextual relevance. To address this, we propose \textbf{Semantic Pyramid Indexing (SPI)}, a novel multi-resolution vector indexing framework that introduces query-adaptive resolution control for RAG in VecDBs. Unlike existing hierarchical methods that require offline tuning or separate model training, SPI constructs a semantic pyramid over document embeddings and dynamically selects the optimal resolution level per query through a lightweight classifier. This adaptive approach enables progressive retrieval from coarse-to-fine representations, significantly accelerating search while maintaining semantic coverage. We implement SPI as a plugin for both FAISS and Qdrant backends and evaluate it across multiple RAG tasks including MS MARCO, Natural Questions, and multimodal retrieval benchmarks. SPI achieves up to \textbf{5.7$\times$} retrieval speedup and \textbf{1.8$\times$} memory efficiency gain while improving end-to-end QA F1 scores by up to \textbf{2.5 points} compared to strong baselines. Our theoretical analysis provides guarantees on retrieval quality and latency bounds, while extensive ablation studies validate the contribution of each component. The framework's compatibility with existing VecDB infrastructures makes it readily deployable in production RAG systems. Code is availabe at \href{https://github.com/FastLM/SPI_VecDB}{https://github.com/FastLM/SPI\_VecDB}.

[3] Bench360: Benchmarking Local LLM Inference from 360°

Linus Stuhlmann,Mauricio Fadel Argerich,Jonathan Fürst

Main category: cs.CL

TL;DR: 本文提出了Bench360,一个全面评估本地大语言模型推理性能的基准框架,支持多种推理引擎、使用场景和量化级别,并结合系统与任务相关指标,帮助用户在功能与非功能需求间找到最优配置。

Details Motivation: 随着本地运行大语言模型的普及,用户面临大量配置选择,缺乏统一、用户导向的基准来综合评估不同模型、推理引擎和量化水平在多维度指标下的表现。 Method: 设计并实现Bench360框架,支持用户自定义任务、数据集和任务特定指标,自动评估LLM在不同使用场景(单流、批处理、服务器)下的系统性能(如延迟、吞吐量、能耗)和任务性能(如ROUGE、F1、准确率)。 Result: 在四种常见任务(常识推理、问答、摘要、文本到SQL)、三种硬件平台和四种先进推理引擎上的实验表明,不同配置在任务性能与系统效率之间存在显著权衡,不存在单一最优配置。 Conclusion: 本地LLM推理没有放之四海而皆准的最佳设置,Bench360为用户提供了全面、灵活且易用的评估工具,填补了现有基准在用户中心性和综合性方面的空白。 Abstract: Running large language models (LLMs) locally is becoming increasingly common. While the growing availability of small open-source models and inference engines has lowered the entry barrier, users now face an overwhelming number of configuration choices. Identifying an optimal configuration -- balancing functional and non-functional requirements -- requires substantial manual effort. While several benchmarks target LLM inference, they are designed for narrow evaluation goals and not user-focused. They fail to integrate relevant system and task-specific metrics into a unified, easy-to-use benchmark that supports multiple inference engines, usage scenarios, and quantization levels. To address this gap, we present Bench360 -- Benchmarking Local LLM Inference from 360°. Bench360 allows users to easily define their own custom tasks along with datasets and relevant task-specific metrics and then automatically benchmarks selected LLMs, inference engines, and quantization levels across different usage scenarios (single stream, batch & server). Bench360 tracks a wide range of metrics, including (1) system metrics -- such as Computing Performance (e.g., latency, throughput), Resource Usage (e.g., energy per query), and Deployment (e.g., cold start time) -- and (2) task-specific metrics such as ROUGE, F1 score or accuracy. We demonstrate Bench360 on four common LLM tasks -- General Knowledge & Reasoning, QA, Summarization and Text-to-SQL -- across three hardware platforms and four state of the art inference engines. Our results reveal several interesting trade-offs between task performance and system-level efficiency, highlighting the differences in inference engines and models. Most importantly, there is no single best setup for local inference, which strongly motivates the need for a framework such as Bench360.

[4] How Well Do LLMs Understand Tunisian Arabic?

Mohamed Mahdi

Main category: cs.CL

TL;DR: 本研究介绍了一个包含突尼斯阿拉伯语(Tunizi)、标准突尼斯阿拉伯语和英语翻译及情感标签的新型平行语料库,用于评估大型语言模型在低资源语言处理上的表现。

Details Motivation: 工业级大语言模型通常忽视对低资源语言(如突尼斯阿拉伯语)的理解能力,可能导致数百万使用者无法以母语与AI互动,威胁语言文化传承并影响年轻一代的语言选择。 Method: 构建了一个包含Tunizi、标准突尼斯阿拉伯语和英文翻译及情感标注的平行数据集,并在转写、翻译和情感分析三项任务上对多个主流大语言模型进行基准测试。 Result: 实验结果显示不同模型在处理突尼斯方言时存在显著差异,揭示了当前模型在理解与处理低资源方言方面的优势与局限性。 Conclusion: 量化性能差距表明,在下一代AI系统中纳入低资源语言至关重要,以确保技术的可及性、包容性以及文化相关性。 Abstract: Large Language Models (LLMs) are the engines driving today's AI agents. The better these models understand human languages, the more natural and user-friendly the interaction with AI becomes, from everyday devices like computers and smartwatches to any tool that can act intelligently. Yet, the ability of industrial-scale LLMs to comprehend low-resource languages, such as Tunisian Arabic (Tunizi), is often overlooked. This neglect risks excluding millions of Tunisians from fully interacting with AI in their own language, pushing them toward French or English. Such a shift not only threatens the preservation of the Tunisian dialect but may also create challenges for literacy and influence younger generations to favor foreign languages. In this study, we introduce a novel dataset containing parallel Tunizi, standard Tunisian Arabic, and English translations, along with sentiment labels. We benchmark several popular LLMs on three tasks: transliteration, translation, and sentiment analysis. Our results reveal significant differences between models, highlighting both their strengths and limitations in understanding and processing Tunisian dialects. By quantifying these gaps, this work underscores the importance of including low-resource languages in the next generation of AI systems, ensuring technology remains accessible, inclusive, and culturally grounded.

[5] Ellipsoid-Based Decision Boundaries for Open Intent Classification

Yuetian Zou,Hanlei Zhang,Hua Xu,Songze Li,Long Xiao

Main category: cs.CL

TL;DR: 提出EliDecide,一种基于可学习椭球决策边界的文本开放意图分类新方法,在多个基准上实现最先进性能。

Details Motivation: 现有自适应决策边界方法假设已知类为各向同性分布,限制了边界的表达能力,无法有效建模不同方向的分布方差。 Method: 采用监督对比学习构建判别性特征空间;使用可学习矩阵参数化每个已知类的椭球边界;设计双损失函数优化边界,平衡经验风险与开放空间风险。 Result: 在多个文本意图分类基准和问题分类数据集上达到SOTA性能,验证了椭球边界的灵活性和优越的开放意图检测能力。 Conclusion: EliDecide通过椭球决策边界更好地建模已知类分布,显著提升开放意图检测效果,具备良好的泛化潜力。 Abstract: Textual open intent classification is crucial for real-world dialogue systems, enabling robust detection of unknown user intents without prior knowledge and contributing to the robustness of the system. While adaptive decision boundary methods have shown great potential by eliminating manual threshold tuning, existing approaches assume isotropic distributions of known classes, restricting boundaries to balls and overlooking distributional variance along different directions. To address this limitation, we propose EliDecide, a novel method that learns ellipsoid decision boundaries with varying scales along different feature directions. First, we employ supervised contrastive learning to obtain a discriminative feature space for known samples. Second, we apply learnable matrices to parameterize ellipsoids as the boundaries of each known class, offering greater flexibility than spherical boundaries defined solely by centers and radii. Third, we optimize the boundaries via a novelly designed dual loss function that balances empirical and open-space risks: expanding boundaries to cover known samples while contracting them against synthesized pseudo-open samples. Our method achieves state-of-the-art performance on multiple text intent benchmarks and further on a question classification dataset. The flexibility of the ellipsoids demonstrates superior open intent detection capability and strong potential for generalization to more text classification tasks in diverse complex open-world scenarios.

[6] Prompt-Based Value Steering of Large Language Models

Giulio Antonio Abbo,Tony Belpaeme

Main category: cs.CL

TL;DR: 提出了一种模型无关的评估方法,用于量化提示词在引导大语言模型生成符合特定人类价值观文本中的效果,并通过Wizard-Vicuna模型和Schwartz价值观理论验证了价值引导的可行性。

Details Motivation: 大语言模型在需要与人类价值观对齐的应用中日益重要,但传统的微调方法静态且难以适应动态变化的价值观需求。 Method: 基于Schwartz的基本人类价值观理论,构建对话数据集,设计评分方法以量化生成文本中目标价值观的存在程度和增益,并比较基线提示与显式条件化价值观提示的效果。 Result: 在Wizard-Vicuna变体模型上验证了该方法,结果显示即使不修改模型或动态优化提示,也能有效实现价值观引导。 Conclusion: 提出的评估程序是实用、可复现且模型无关的,能够有效衡量提示词对生成内容中人类价值观的引导能力,为动态价值观对齐提供了新思路。 Abstract: Large language models are increasingly used in applications where alignment with human values is critical. While model fine-tuning is often employed to ensure safe responses, this technique is static and does not lend itself to everyday situations involving dynamic values and preferences. In this paper, we present a practical, reproducible, and model-agnostic procedure to evaluate whether a prompt candidate can effectively steer generated text toward specific human values, formalising a scoring method to quantify the presence and gain of target values in generated responses. We apply our method to a variant of the Wizard-Vicuna language model, using Schwartz's theory of basic human values and a structured evaluation through a dialogue dataset. With this setup, we compare a baseline prompt to one explicitly conditioned on values, and show that value steering is possible even without altering the model or dynamically optimising prompts.

[7] Concept-Based Interpretability for Toxicity Detection

Samarth Garg,Deeksha Varshney,Divya Singh

Main category: cs.CL

TL;DR: 本研究提出了一种基于概念梯度(CG)的可解释方法,利用毒性语言的子类型作为概念,分析其对模型预测的影响,并通过构建目标词典集和计算词-概念对齐(WCA)分数来揭示分类错误的成因,进而提出无词典增强策略以探究模型在缺乏显式词汇重叠时的归因行为。

Details Motivation: 尽管在文本毒性检测方面已有进展,但基于概念的解释研究仍不足,且现有方法常因概念归因不均导致误分类。 Method: 采用概念梯度(CG)方法衡量概念变化对模型输出的因果影响,构建目标毒性质词汇集,并计算词-概念对齐(WCA)分数;提出无词典的数据增强策略生成不含预定义毒性质词汇的样本。 Result: 发现了某些词汇因过度归因于特定毒性概念而导致误分类,WCA分数有效量化了此类对齐程度,且在去除显式毒性质词汇后,模型仍表现出对广义毒性模式的过归因。 Conclusion: 所提方法提升了毒性检测模型的可解释性,揭示了模型依赖特定概念和词汇进行判断的问题,无词典增强策略有助于理解模型对更广泛毒性语言模式的学习与归因机制。 Abstract: The rise of social networks has not only facilitated communication but also allowed the spread of harmful content. Although significant advances have been made in detecting toxic language in textual data, the exploration of concept-based explanations in toxicity detection remains limited. In this study, we leverage various subtype attributes present in toxicity detection datasets, such as obscene, threat, insult, identity attack, and sexual explicit as concepts that serve as strong indicators to identify whether language is toxic. However, disproportionate attribution of concepts towards the target class often results in classification errors. Our work introduces an interpretability technique based on the Concept Gradient (CG) method which provides a more causal interpretation by measuring how changes in concepts directly affect the output of the model. This is an extension of traditional gradient-based methods in machine learning, which often focus solely on input features. We propose the curation of Targeted Lexicon Set, which captures toxic words that contribute to misclassifications in text classification models. To assess the significance of these lexicon sets in misclassification, we compute Word-Concept Alignment (WCA) scores, which quantify the extent to which these words lead to errors due to over-attribution to toxic concepts. Finally, we introduce a lexicon-free augmentation strategy by generating toxic samples that exclude predefined toxic lexicon sets. This approach allows us to examine whether over-attribution persists when explicit lexical overlap is removed, providing insights into the model's attribution on broader toxic language patterns.

[8] Falsely Accused: How AI Detectors Misjudge Slightly Polished Arabic Articles

Saleh Almohaimeed,Saad Almohaimeed,Mousa Jari,Khaled A. Alobaid,Fahad Alotaibi

Main category: cs.CL

TL;DR: 本文研究了AI检测模型在识别经过轻微润色的人类撰写阿拉伯语文章时的性能,发现现有模型容易将此类文章误判为AI生成,导致准确率显著下降。

Details Motivation: 随着AI写作工具的普及,如何准确区分人类撰写与AI生成内容成为挑战。特别是当人类文章仅经过轻微AI润色时,现有检测模型可能产生误判,影响其可信度和公平性。然而,目前针对阿拉伯语的相关研究尚属空白。 Method: 作者构建了两个数据集:第一个包含800篇阿拉伯语文章(400篇AI生成,400篇人类撰写),用于评估14种大语言模型和商业AI检测工具;从中选出表现最佳的8个模型进行进一步测试。第二个数据集Ar-APT包含400篇人类撰写文章,经10种大模型在4种润色设置下处理,共生成16,400个样本,用于评估这8个检测模型对轻微润色文本的判断能力。 Result: 实验结果显示,所有检测模型均出现大量误判。表现最好的LLM Claude-4 Sonnet在原始文本上准确率为83.51%,但在经LLaMA-3轻微润色后降至57.63%;商业模型originality.AI从92%的准确率下降至经Mistral或Gemma-3润色后的12%。 Conclusion: 当前AI检测模型在面对轻微AI润色的人类撰写阿拉伯语文本时表现不佳,极易产生误判,亟需开发更具鲁棒性的检测方法以应对现实场景中的混合写作模式。 Abstract: Many AI detection models have been developed to counter the presence of articles created by artificial intelligence (AI). However, if a human-authored article is slightly polished by AI, a shift will occur in the borderline decision of these AI detection models, leading them to consider it AI-generated article. This misclassification may result in falsely accusing authors of AI plagiarism and harm the credibility of AI detector models. In English, some efforts were made to meet this challenge, but not in Arabic. In this paper, we generated two datasets. The first dataset contains 800 Arabic articles, half AI-generated and half human-authored. We used it to evaluate 14 Large Language models (LLMs) and commercial AI detectors to assess their ability in distinguishing between human-authored and AI-generated articles. The best 8 models were chosen to act as detectors for our primary concern, which is whether they would consider slightly polished human text as AI-generated. The second dataset, Ar-APT, contains 400 Arabic human-authored articles polished by 10 LLMs using 4 polishing settings, totaling 16400 samples. We use it to evaluate the 8 nominated models and determine whether slight polishing will affect their performance. The results reveal that all AI detectors incorrectly attribute a significant number of articles to AI. The best performing LLM, Claude-4 Sonnet, achieved 83.51%, their performance decreased to 57.63% for articles slightly polished by LLaMA-3. Whereas for the best performing commercial model, originality.AI, that achieves 92% accuracy, dropped to 12% for articles slightly polished by Mistral or Gemma-3.

[9] Reproducibility Report: Test-Time Training on Nearest Neighbors for Large Language Models

Boyang Zhou,Johan Lindqvist,Lindsey Li

Main category: cs.CL

TL;DR: 本文复现了Hardt和Sun(2024)提出的“测试时训练+最近邻检索”方法,通过在推理时利用Faiss索引的RoBERTa嵌入检索近邻序列并对语言模型进行单步微调,验证了该方法在多种模型和数据集上均能显著降低困惑度等指标,尤其对未在类似数据上预训练的模型效果更明显,并提出了内存优化的实现方式。

Details Motivation: 验证并扩展Test-Time Training on Nearest Neighbors方法的普适性和鲁棒性,同时解决其在大规模实验中的内存开销问题,探索小模型通过此方法逼近大模型性能的可能性。 Method: 使用预训练RoBERTa嵌入并用Faiss索引,在测试时为每个输入检索20个最近邻序列,对GPT-2、GPT-Neo及R1-Distilled-Qwen2.5-1.5B等模型在每个邻居上执行一次梯度更新;采用仅加载所需行偏移的方式实现内存高效的检索。 Result: 该方法在The Pile多个领域数据上显著降低困惑度和bits-per-byte,尤其在GitHub和EuroParl等结构化数据上提升最大;未在The Pile上预训练的模型受益更多,小模型可借此接近大模型性能;新引入的内存优化将单服务器RAM需求从128GB以上降至32GB;在R1-Distilled-Qwen2.5-1.5B上也观察到一致增益。 Conclusion: 最近邻测试时训练具有良好的通用性和实用性,适用于现代推理优化架构,且可通过工程优化降低资源消耗,为模型自适应提供了一种有效路径。 Abstract: We reproduce the central claims of Test-Time Training on Nearest Neighbors for Large Language Models (Hardt and Sun, 2024), which proposes adapting a language model at inference time by fine-tuning on retrieved nearest-neighbor sequences. Using pretrained RoBERTa embeddings indexed with Faiss, we retrieve 20 neighbors per test input and apply one gradient update per neighbor across GPT-2 (117M, 774M), GPT-Neo (1.3B), and R1-Distilled-Qwen2.5-1.5B. Our experiments confirm that test-time training significantly reduces perplexity and bits-per-byte metrics across diverse domains from The Pile, with the largest improvements in structured or specialized datasets such as GitHub and EuroParl. We further validate that models not pretrained on The Pile benefit more from this adaptation than models already trained on similar data, allowing smaller models to approach the performance of larger ones. Due to infrastructure limitations, we introduce a memory-efficient retrieval implementation that loads only required line offsets rather than entire files, reducing RAM requirements from over 128 GB per server to 32 GB. We also extend the original study by evaluating R1-Distilled-Qwen2.5-1.5B, showing that test-time training yields consistent gains even for modern reasoning-optimized architectures. Overall, our results support the robustness and generality of nearest-neighbor test-time training while highlighting practical considerations for reproducing large-scale retrieval-augmented adaptation.

[10] How Language Directions Align with Token Geometry in Multilingual LLMs

JaeSeong Kim,Suan Lee

Main category: cs.CL

TL;DR: 本文系统研究了多语言大模型中语言信息在表示空间中的结构及其在各层的演化过程,发现语言信息在第一Transformer块中迅速分离且在整个深度中保持线性可分,并揭示了语言方向与词表嵌入对齐程度受训练数据构成影响显著。

Details Motivation: 缺乏对多语言大模型内部语言信息结构及跨层演化的系统性分析。 Method: 对六个多语言大模型的所有268个Transformer层进行探测研究,结合线性和非线性探针以及新提出的Token-Language Alignment分析方法。 Result: 语言信息在首块Transformer层急剧分离(+76.4±8.2 pp),并在深层中几乎完全线性可分;中文相关模型的ZH Match@Peak为16.43%,远高于英文中心模型的3.90%,显示4.21倍的结构印记效应。 Conclusion: 多语言大模型通过由训练语料塑造的潜在表示结构而非表面脚本特征来区分语言,该结果对多语言表示学习的数据构成策略和公平性具有实践意义。 Abstract: Multilingual LLMs demonstrate strong performance across diverse languages, yet there has been limited systematic analysis of how language information is structured within their internal representation space and how it emerges across layers. We conduct a comprehensive probing study on six multilingual LLMs, covering all 268 transformer layers, using linear and nonlinear probes together with a new Token--Language Alignment analysis to quantify the layer-wise dynamics and geometric structure of language encoding. Our results show that language information becomes sharply separated in the first transformer block (+76.4$\pm$8.2 percentage points from Layer 0 to 1) and remains almost fully linearly separable throughout model depth. We further find that the alignment between language directions and vocabulary embeddings is strongly tied to the language composition of the training data. Notably, Chinese-inclusive models achieve a ZH Match@Peak of 16.43\%, whereas English-centric models achieve only 3.90\%, revealing a 4.21$\times$ structural imprinting effect. These findings indicate that multilingual LLMs distinguish languages not by surface script features but by latent representational structures shaped by the training corpus. Our analysis provides practical insights for data composition strategies and fairness in multilingual representation learning. All code and analysis scripts are publicly available at: https://github.com/thisiskorea/How-Language-Directions-Align-with-Token-Geometry-in-Multilingual-LLMs.

[11] Hierarchical Retrieval with Out-Of-Vocabulary Queries: A Case Study on SNOMED CT

Jonathon Dilworth,Hui Yang,Jiaoyan Chen,Yongsheng Gao

Main category: cs.CL

TL;DR: 本文提出了一种基于语言模型的本体嵌入方法,用于解决SNOMED CT中词汇表外(OOV)查询的层次化概念检索问题,并在实验中优于SBERT和词法匹配基线方法。

Details Motivation: 由于语言歧义、同义词、多义词等问题,SNOMED CT中的知识检索具有挑战性,尤其当查询为词汇表外(OOV)时缺乏与本体的直接匹配,难以准确检索相关概念。 Method: 采用基于语言模型的本体嵌入方法,学习SNOMED CT中概念的分布式表示,以支持对OOV查询进行层次化概念检索,并构建标注数据集评估最直接上位类及更远祖先的检索效果。 Result: 所提方法在OOV查询下的概念检索任务中优于SBERT和两种词法匹配基线方法,表现出更强的语义匹配能力。 Conclusion: 该方法有效解决了SNOMED CT中OOV查询的检索难题,具有良好的泛化性,可推广至其他本体系统,且代码与数据集已公开。 Abstract: SNOMED CT is a biomedical ontology with a hierarchical representation of large-scale concepts. Knowledge retrieval in SNOMED CT is critical for its application, but often proves challenging due to language ambiguity, synonyms, polysemies and so on. This problem is exacerbated when the queries are out-of-vocabulary (OOV), i.e., having no equivalent matchings in the ontology. In this work, we focus on the problem of hierarchical concept retrieval from SNOMED CT with OOV queries, and propose an approach based on language model-based ontology embeddings. For evaluation, we construct OOV queries annotated against SNOMED CT concepts, testing the retrieval of the most direct subsumers and their less relevant ancestors. We find that our method outperforms the baselines including SBERT and two lexical matching methods. While evaluated against SNOMED CT, the approach is generalisable and can be extended to other ontologies. We release code, tools, and evaluation datasets at https://github.com/jonathondilworth/HR-OOV.

[12] Detecting and Steering LLMs' Empathy in Action

Juan P. Cadile

Main category: cs.CL

TL;DR: 研究探讨了大语言模型中“行动中的 empathy”(empathy-in-action)作为一种线性方向的存在,通过EIA基准测试在多个模型中实现了高精度检测和不同程度的引导控制,发现不同架构对共情增强和抑制的反应存在显著差异。

Details Motivation: 探索大语言模型是否会在任务效率与满足人类需求之间做出权衡,并试图识别其在激活空间中的可量化方向。 Method: 使用基于EIA基准的对比提示,在Phi-3-mini、Qwen2.5-7B和Dolphin-Llama-3.1-8B模型中进行激活空间的方向探测与干预实验,评估检测性能(AUROC)和双向引导效果。 Result: 所有模型均能高精度检测empathy方向(AUROC >0.996),但跨模型相关性弱;Qwen和Phi-3实现双向稳定引导,Dolphin仅在增强empathy时有效,抑制时出现崩溃。 Conclusion: empathy-in-action的方向在多种LLM中可被检测,其实现方式具有架构特异性;安全训练不影响检测,但可能影响引导的鲁棒性,表明共情调节机制复杂且依赖于模型设计。 Abstract: We investigate empathy-in-action -- the willingness to sacrifice task efficiency to address human needs -- as a linear direction in LLM activation space. Using contrastive prompts grounded in the Empathy-in-Action (EIA) benchmark, we test detection and steering across Phi-3-mini-4k (3.8B), Qwen2.5-7B (safety-trained), and Dolphin-Llama-3.1-8B (uncensored). Detection: All models show AUROC 0.996-1.00 at optimal layers. Uncensored Dolphin matches safety-trained models, demonstrating empathy encoding emerges independent of safety training. Phi-3 probes correlate strongly with EIA behavioral scores (r=0.71, p<0.01). Cross-model probe agreement is limited (Qwen: r=-0.06, Dolphin: r=0.18), revealing architecture-specific implementations despite convergent detection. Steering: Qwen achieves 65.3% success with bidirectional control and coherence at extreme interventions. Phi-3 shows 61.7% success with similar coherence. Dolphin exhibits asymmetric steerability: 94.4% success for pro-empathy steering but catastrophic breakdown for anti-empathy (empty outputs, code artifacts). Implications: The detection-steering gap varies by model. Qwen and Phi-3 maintain bidirectional coherence; Dolphin shows robustness only for empathy enhancement. Safety training may affect steering robustness rather than preventing manipulation, though validation across more models is needed.

[13] NALA_MAINZ at BLP-2025 Task 2: A Multi-agent Approach for Bangla Instruction to Python Code Generation

Hossain Shaikh Saadi,Faria Alam,Mario Sanz-Guerrero,Minh Duc Bui,Manuel Mager,Katharina von der Wense

Main category: cs.CL

TL;DR: 本文提出了JGU Mainz团队在BLP-2025共享任务中获胜的系统,采用基于多智能体的代码生成与调试流程,在代码生成指令(孟加拉语)到代码生成的任务中取得了95.4的Pass@1得分。

Details Motivation: 旨在提升从孟加拉语指令生成正确代码的准确率,解决自然语言到代码转换中的错误问题。 Method: 提出一个多智能体流水线:首先由代码生成智能体生成初始代码,然后通过执行单元测试检测失败用例;仅将失败的测试用例传递给调试智能体,该智能体分析错误信息、当前程序和相关测试用例后生成修正后的代码。 Result: 该方法在BLP-2025共享任务中排名第一,Pass@1得分为95.4。 Conclusion: 基于多智能体的协作式代码生成与调试框架能有效提升从非英语指令生成代码的性能,具备良好的实用性和可扩展性。 Abstract: This paper presents JGU Mainz's winning system for the BLP-2025 Shared Task on Code Generation from Bangla Instructions. We propose a multi-agent-based pipeline. First, a code-generation agent produces an initial solution from the input instruction. The candidate program is then executed against the provided unit tests (pytest-style, assert-based). Only the failing cases are forwarded to a debugger agent, which reruns the tests, extracts error traces, and, conditioning on the error messages, the current program, and the relevant test cases, generates a revised solution. Using this approach, our submission achieved first place in the shared task with a $Pass@1$ score of 95.4. We also make our code public.

[14] From Representation to Enactment: The ABC Framework of the Translating Mind

Michael Carl,Takanori Mizowaki,Aishvarya Raj,Masaru Yamada,Devi Sri Bandaru,Yuxiang Wei,Xinyue Ren

Main category: cs.CL

TL;DR: 本文提出了一种基于延展心灵理论和激进具身认知的非表征性翻译 mind 框架,将翻译视为情感、行为与认知(ABC)过程动态整合的具身参与活动,而非静态符号转换。

Details Motivation: 挑战传统以表征为基础的翻译模型,试图从非表征视角重新理解翻译过程中意义的实时生成机制。 Method: 结合延展心灵理论、激进具身认知、预测加工与(主动)推理框架,构建翻译 mind 的ABC理论模型。 Result: 提出翻译 mind 是通过脑-身体-环境互动循环而涌现的,意义在与文本、工具和情境的具身互动中共同创造。 Conclusion: 翻译应被理解为一种社会文化实践中的熟练参与,其核心是动态、具身且非表征的意义生成过程。 Abstract: Building on the Extended Mind (EM) theory and radical enactivism, this article suggests an alternative to representation-based models of the mind. We lay out a novel ABC framework of the translating mind, in which translation is not the manipulation of static interlingual correspondences but an enacted activity, dynamically integrating affective, behavioral, and cognitive (ABC) processes. Drawing on Predictive Processing and (En)Active Inference, we argue that the translator's mind emerges, rather than being merely extended, through loops of brain-body-environment interactions. This non-representational account reframes translation as skillful participation in sociocultural practice, where meaning is co-created in real time through embodied interaction with texts, tools, and contexts.

[15] Interpretable dimensions support an effect of agentivity and telicity on split intransitivity

Eva Neu,Brian Dillon,Katrin Erk

Main category: cs.CL

TL;DR: 本文重新探讨了不及物动词的非宾格与非作格句法分类与其语义属性(施事性和终结性)之间的关系,通过可解释维度和人类判断相结合的方法,支持了二者之间的联系。

Details Motivation: 近期研究发现人类对施事性和终结性的评分难以预测不及物动词的句法行为,因此本文旨在重新检验语义特征与句法分类之间的关联。 Method: 基于位于施事性和终结性两极的种子词构建可解释的语义维度,并结合人类评分进行分析。 Result: 研究结果支持施事性与非作格、终结性与非宾格之间的关联,并表明可解释维度能更有效地揭示难以通过评分任务捕捉的语义属性。 Conclusion: 使用可解释语义维度结合人类判断,能为不及物动词的句法-语义映射提供更可靠的证据。 Abstract: Intransitive verbs fall into two different syntactic classes, unergatives and unaccusatives. It has long been argued that verbs describing an agentive action are more likely to appear in an unergative syntax, and those describing a telic event to appear in an unaccusative syntax. However, recent work by Kim et al. (2024) found that human ratings for agentivity and telicity were a poor predictor of the syntactic behavior of intransitives. Here we revisit this question using interpretable dimensions, computed from seed words on opposite poles of the agentive and telic scales. Our findings support the link between unergativity/unaccusativity and agentivity/telicity, and demonstrate that using interpretable dimensions in conjunction with human judgments can offer valuable evidence for semantic properties that are not easily evaluated in rating tasks.

[16] PEPPER: Perception-Guided Perturbation for Robust Backdoor Defense in Text-to-Image Diffusion Models

Oscar Chew,Po-Yi Lu,Jayden Lin,Kuan-Hao Huang,Hsuan-Tien Lin

Main category: cs.CL

TL;DR: 提出PEPPER方法,通过语义重写和添加无干扰元素来防御文本到图像扩散模型中的后门攻击。

Details Motivation: 现有文本到图像扩散模型易受输入提示中触发器的后门攻击影响,需有效防御机制。 Method: 采用语义距离远但视觉相似的标题重写策略,并加入无干扰元素,破坏提示中的触发器。 Result: PEPPER在抵御基于文本编码器的攻击方面效果显著,大幅降低攻击成功率且保持生成质量,并可与其他防御方法结合使用。 Conclusion: PEPPER能有效提升文本到图像扩散模型对后门攻击的鲁棒性,具有良好的通用性和兼容性。 Abstract: Recent studies show that text to image (T2I) diffusion models are vulnerable to backdoor attacks, where a trigger in the input prompt can steer generation toward harmful or unintended content. To address this, we introduce PEPPER (PErcePtion Guided PERturbation), a backdoor defense that rewrites the caption into a semantically distant yet visually similar caption while adding unobstructive elements. With this rewriting strategy, PEPPER disrupt the trigger embedded in the input prompt, dilute the influence of trigger tokens and thereby achieve enhanced robustness. Experiments show that PEPPER is particularly effective against text encoder based attacks, substantially reducing attack success while preserving generation quality. Beyond this, PEPPER can be paired with any existing defenses yielding consistently stronger and generalizable robustness than any standalone method. Our code will be released on Github.

[17] ConCISE: A Reference-Free Conciseness Evaluation Metric for LLM-Generated Answers

Seyed Mohssen Ghafari,Ronny Kol,Juan C. Quiroz,Nella Luan,Monika Patial,Chanaka Rupasinghe,Herman Wandabwa,Luiz Pizzato

Main category: cs.CL

TL;DR: 提出了一种无需参考标准的评估大语言模型生成响应简洁性的新指标,通过三种压缩比计算冗余内容,有效识别输出中的冗余,实现对话AI系统中响应简洁性的自动评估。

Details Motivation: 大语言模型常生成冗长且冗余的响应,影响清晰度、用户满意度并增加开发成本,尤其是按输出token计费的商用模型。因此需要一种无需人工标注的自动化简洁性评估方法。 Method: 提出一种无参考的简洁性评估指标,结合三种压缩方式:1)原始响应与大模型生成的抽象摘要之间的压缩比;2)原始响应与提取式摘要之间的压缩比;3)通过大模型删除非必要词汇后的词元移除压缩,以删除数量作为简洁性得分。 Result: 实验结果表明,该指标能有效识别大语言模型输出中的冗余内容,在无需人工标注的情况下,准确反映响应的简洁程度。 Conclusion: 所提出的无参考简洁性评估指标为对话式AI系统提供了一种实用、自动化的评估工具,有助于优化生成结果的简洁性,降低运营成本并提升用户体验。 Abstract: Large language models (LLMs) frequently generate responses that are lengthy and verbose, filled with redundant or unnecessary details. This diminishes clarity and user satisfaction, and it increases costs for model developers, especially with well-known proprietary models that charge based on the number of output tokens. In this paper, we introduce a novel reference-free metric for evaluating the conciseness of responses generated by LLMs. Our method quantifies non-essential content without relying on gold standard references and calculates the average of three calculations: i) a compression ratio between the original response and an LLM abstractive summary; ii) a compression ratio between the original response and an LLM extractive summary; and iii) wordremoval compression, where an LLM removes as many non-essential words as possible from the response while preserving its meaning, with the number of tokens removed indicating the conciseness score. Experimental results demonstrate that our proposed metric identifies redundancy in LLM outputs, offering a practical tool for automated evaluation of response brevity in conversational AI systems without the need for ground truth human annotations.

[18] Improving Latent Reasoning in LLMs via Soft Concept Mixing

Kang Wang,Xiangyu Duan,Tianyi Du

Main category: cs.CL

TL;DR: 提出了一种名为Soft Concept Mixing (SCM)的训练方法,通过在训练中引入软概念向量来缩小大语言模型推理与离散token训练之间的差距,并利用强化学习优化整个隐式推理过程。

Details Motivation: 大语言模型通常基于离散token进行推理,限制了其表达能力;而人类则在抽象概念空间中进行推理。现有模型因训练依赖离散token,难以有效利用软概念进行推理,因此需要一种能弥合这一差距的方法。 Method: SCM通过概率加权平均嵌入生成软概念向量,并将其混合到模型的隐藏状态中,结合强化学习对整个潜在推理过程进行优化。 Result: 在五个推理基准上的实验表明,SCM能够提升大语言模型的推理性能,同时保持稳定的训练动态。 Conclusion: 将软概念直接引入训练过程有助于提升LLMs的推理能力,SCM为构建更具表达力的推理模型提供了有效路径。 Abstract: Unlike human reasoning in abstract conceptual spaces, large language models (LLMs) typically reason by generating discrete tokens, which potentially limit their expressive power. The recent work Soft Thinking has shown that LLMs' latent reasoning via soft concepts is a promising direction, but LLMs are trained on discrete tokens. To reduce this gap between the soft concepts in reasoning and the discrete tokens in training, we propose Soft Concept Mixing (SCM), a soft concept aware training scheme that directly exposes the model to soft representations during training. Specifically, SCM constructs a soft concept vector by forming a probability-weighted average of embeddings. Then, this vector is mixed into the model's hidden states, which embody rich contextual information. Finally, the entire latent reasoning process is optimized with Reinforcement Learning (RL). Experiments on five reasoning benchmarks demonstrate that SCM improves the reasoning performance of LLMs, and simultaneously maintains a stable training dynamic.

[19] Deep Improvement Supervision

Arip Asadulaev,Rayan Banerjee,Fakhri Karray,Martin Takac

Main category: cs.CL

TL;DR: 提出了一种新的训练方法,显著提高了Tiny Recursive Models(TRMs)的训练效率,减少了18倍的前向传播次数,并在仅使用0.8M参数的情况下在ARC-1上达到24%的准确率。

Details Motivation: 探索如何以最小改动进一步提升小型循环架构(如TRMs)在复杂推理任务中的效率。 Method: 将TRMs的潜在推理视为一种无分类器指导和隐式策略改进算法,并在此基础上提出一种为训练中每个循环提供目标的新训练方案。 Result: 新方法将前向传播次数减少18倍,去除停顿机制,同时保持与标准TRMs相当的性能;在ARC-1上以0.8M参数实现24%准确率。 Conclusion: 所提训练方案显著提升了TRMs的训练效率和实用性,在小模型推理任务中优于大多数大语言模型。 Abstract: Recently, it was shown that small, looped architectures, such as Tiny Recursive Models (TRMs), can outperform Large Language Models (LLMs) on complex reasoning tasks, including the Abstraction and Reasoning Corpus (ARC). In this work, we investigate a core question: how can we further improve the efficiency of these methods with minimal changes? To address this, we frame the latent reasoning of TRMs as a form of classifier-free guidance and implicit policy improvement algorithm. Building on these insights, we propose a novel training scheme that provides a target for each loop during training. We demonstrate that our approach significantly enhances training efficiency. Our method reduces the total number of forward passes by 18x and eliminates halting mechanisms, while maintaining quality comparable to standard TRMs. Notably, we achieve 24% accuracy on ARC-1 with only 0.8M parameters, outperforming most LLMs.

[20] Predicting the Formation of Induction Heads

Tatsuya Aoyama,Ethan Gotlieb Wilcox,Nathan Schneider

Main category: cs.CL

TL;DR: 本文研究了训练数据的统计特性与现代语言模型中归纳头(IH)形成之间的关系,发现批量大小和上下文大小的简单方程可预测IH的形成,并揭示了二元组重复频率和可靠性对IH形成的关键作用。

Details Motivation: 尽管归纳头(IHs)被认为是现代语言模型具备上下文学习能力的关键,但其形成机制尚不明确,因此需要探究训练数据的统计特性如何影响IH的形成。 Method: 通过分析自然和合成数据的训练数据统计特性,结合批量大小、上下文大小、二元组重复频率与可靠性等因素,研究其对IH形成的影响,并探索不同条件下IH形成的充分条件。 Result: 1) 批量大小和上下文大小的组合可预测IH形成;2) 二元组重复频率和可靠性显著影响IH形成,且存在明确的帕累托前沿;3) 高频率和高可靠性的局部依赖足以促成IH形成,而在低频率和低可靠性情况下,类别性和边缘分布形状变得重要。 Conclusion: IH的形成不仅依赖于数据中的表面重复模式,还受到数据结构和分布特性的影响,揭示了ICL能力背后的更深层次机制。 Abstract: Arguably, specialized attention heads dubbed induction heads (IHs) underlie the remarkable in-context learning (ICL) capabilities of modern language models (LMs); yet, a precise characterization of their formation remains unclear. In this study, we investigate the relationship between statistical properties of training data (for both natural and synthetic data) and IH formation. We show that (1) a simple equation combining batch size and context size predicts the point at which IHs form; (2) surface bigram repetition frequency and reliability strongly affect the formation of IHs, and we find a precise Pareto frontier in terms of these two values; and (3) local dependency with high bigram repetition frequency and reliability is sufficient for IH formation, but when the frequency and reliability are low, categoriality and the shape of the marginal distribution matter.

[21] ARQUSUMM: Argument-aware Quantitative Summarization of Online Conversations

An Quang Tang,Xiuzhen Zhang,Minh Ngoc Dinh,Zhuang Li

Main category: cs.CL

TL;DR: 本文提出了一种新的任务——论点感知的定量摘要(argument-aware quantitative summarization),并设计了ARQUSUMM框架,利用大模型和论证理论来揭示在线对话中句子内的主张-理由结构,并通过聚类算法实现论点聚合与量化,实验表明该方法在摘要质量、论证结构呈现和量化准确性方面优于现有模型。

Details Motivation: 现有的文本摘要方法多关注于提取显性信息,忽视了在线讨论中论点的论证结构;而当前的对话摘要研究未能深入挖掘句子内部的深层论证关系,因此需要一种能同时捕捉论点内容及其理由支持强度的摘要方法。 Method: 提出ARQUSUMM框架:利用基于论证理论的大语言模型少样本学习识别句子中的命题及其主张-理由关系,并采用论点结构感知的聚类算法进行论点聚合与支持度量化。 Result: 实验结果显示,ARQUSUMM在对话摘要和定量摘要任务上均优于现有模型,生成的摘要在论点结构表达、文本质量和量化准确性方面更优,且对用户更有帮助。 Conclusion: ARQUSUMM有效揭示了在线对话中的深层论证结构,并实现了高质量的定量摘要,为理解复杂争议性话题的讨论提供了有力工具。 Abstract: Online conversations have become more prevalent on public discussion platforms (e.g. Reddit). With growing controversial topics, it is desirable to summarize not only diverse arguments, but also their rationale and justification. Early studies on text summarization focus on capturing general salient information in source documents, overlooking the argumentative nature of online conversations. Recent research on conversation summarization although considers the argumentative relationship among sentences, fail to explicate deeper argument structure within sentences for summarization. In this paper, we propose a novel task of argument-aware quantitative summarization to reveal the claim-reason structure of arguments in conversations, with quantities measuring argument strength. We further propose ARQUSUMM, a novel framework to address the task. To reveal the underlying argument structure within sentences, ARQUSUMM leverages LLM few-shot learning grounded in the argumentation theory to identify propositions within sentences and their claim-reason relationships. For quantitative summarization, ARQUSUMM employs argument structure-aware clustering algorithms to aggregate arguments and quantify their support. Experiments show that ARQUSUMM outperforms existing conversation and quantitative summarization models and generate summaries representing argument structures that are more helpful to users, of high textual quality and quantification accuracy.

[22] Supervised Fine Tuning of Large Language Models for Domain Specific Knowledge Graph Construction:A Case Study on Hunan's Historical Celebrities

Junjie Hao,Chun Wang,Ying Qiao,Qiuyue Zuo,Qiya Song,Hua Ma,Xieping Gao

Main category: cs.CL

TL;DR: 本研究提出一种监督微调方法,用于提升特定领域信息抽取效果,以湖南近代历史名人文化为案例,通过设计细粒度指令模板并构建指令微调数据集,对多个大语言模型进行参数高效微调,实验表明微调后各模型性能显著提升,其中Qwen3-8B表现最佳,验证了该方法在文化遗产知识抽取与知识图谱构建中的有效性。

Details Motivation: 针对湖南历史名人领域系统性数据资源匮乏、通用大模型在低资源场景下领域知识抽取和结构化输出能力不足的问题,亟需提升大模型在区域历史文化领域的适应性和准确性。 Method: 设计面向湖南历史名人的细粒度、模式引导的指令模板,构建领域指令微调数据集,并采用参数高效的指令微调方法对Qwen2.5-7B、Qwen3-8B、DeepSeek-R1-Distill-Qwen-7B和Llama-3.1-8B-Instruct四个开源大模型进行微调,同时建立评估标准衡量其信息抽取性能。 Result: 所有模型在微调后性能显著提升,其中Qwen3-8B在100个样本、50轮训练下达到89.3866分,表现最优。 Conclusion: 该研究为垂直领域大模型的微调提供了可行路径,展示了其在区域性历史文化知识抽取与知识图谱构建中的高性价比应用潜力。 Abstract: Large language models and knowledge graphs offer strong potential for advancing research on historical culture by supporting the extraction, analysis, and interpretation of cultural heritage. Using Hunan's modern historical celebrities shaped by Huxiang culture as a case study, pre-trained large models can help researchers efficiently extract key information, including biographical attributes, life events, and social relationships, from textual sources and construct structured knowledge graphs. However, systematic data resources for Hunan's historical celebrities remain limited, and general-purpose models often underperform in domain knowledge extraction and structured output generation in such low-resource settings. To address these issues, this study proposes a supervised fine-tuning approach for enhancing domain-specific information extraction. First, we design a fine-grained, schema-guided instruction template tailored to the Hunan historical celebrities domain and build an instruction-tuning dataset to mitigate the lack of domain-specific training corpora. Second, we apply parameter-efficient instruction fine-tuning to four publicly available large language models - Qwen2.5-7B, Qwen3-8B, DeepSeek-R1-Distill-Qwen-7B, and Llama-3.1-8B-Instruct - and develop evaluation criteria for assessing their extraction performance. Experimental results show that all models exhibit substantial performance gains after fine-tuning. Among them, Qwen3-8B achieves the strongest results, reaching a score of 89.3866 with 100 samples and 50 training iterations. This study provides new insights into fine-tuning vertical large language models for regional historical and cultural domains and highlights their potential for cost-effective applications in cultural heritage knowledge extraction and knowledge graph construction.

[23] Do Vision-Language Models Understand Visual Persuasiveness?

Gyuwon Park

Main category: cs.CL

TL;DR: 该论文探讨了视觉语言模型(VLMs)是否真正理解视觉说服力,提出了一种用于二元说服力判断的高共识数据集和视觉说服因素(VPFs)分类法,并分析了VLMs在不同层次特征上的表现,发现其主要局限在于将识别出的对象与传播意图关联的能力。

Details Motivation: 探究当前视觉语言模型是否真正理解视觉线索如何影响人类态度和决策,尤其是在视觉说服力方面的理解能力。 Method: 构建了一个用于二元说服力判断的高共识数据集,提出了包含低、中、高级特征的视觉说服因素(VPFs)分类法,并采用认知引导和知识注入策略来提升模型的推理能力。 Result: 实验表明VLMs存在回忆偏差(高估说服力),对低/中层特征区分能力弱;高层语义一致性是预测人类判断的最强指标;简单的指令或无指导推理效果有限,而简洁的、基于对象的推理显著提升精度和F1分数。 Conclusion: VLMs的核心局限不在于识别具有说服力的对象,而在于将其与传播意图建立联系。 Abstract: Recent advances in vision-language models (VLMs) have enabled impressive multi-modal reasoning and understanding. Yet, whether these models truly grasp visual persuasion-how visual cues shape human attitudes and decisions-remains unclear. To probe this question, we construct a high-consensus dataset for binary persuasiveness judgment and introduce the taxonomy of Visual Persuasive Factors (VPFs), encompassing low-level perceptual, mid-level compositional, and high-level semantic cues. We also explore cognitive steering and knowledge injection strategies for persuasion-relevant reasoning. Empirical analysis across VLMs reveals a recall-oriented bias-models over-predict high persuasiveness-and weak discriminative power for low/mid-level features. In contrast, high-level semantic alignment between message and object presence emerges as the strongest predictor of human judgment. Among intervention strategies, simple instruction or unguided reasoning scaffolds yield marginal or negative effects, whereas concise, object-grounded rationales significantly improve precision and F1 scores. These results indicate that VLMs core limitation lies not in recognizing persuasive objects but in linking them to communicative intent.

[24] Principled Design of Interpretable Automated Scoring for Large-Scale Educational Assessments

Yunsung Kim,Mike Hardy,Joseph Tey,Candace Thille,Chris Piech

Main category: cs.CL

TL;DR: 本文提出了一种可解释的自动评分框架AnalyticScore,通过提取回答要素、利用大语言模型进行可解释特征化,并采用有序逻辑回归模型实现高效且透明的短答案评分,其性能接近最先进的不可解释方法。

Details Motivation: 为了满足教育评估中对自动评分系统透明性和可解释性的需求,解决当前缺乏广泛接受的可解释自动评分方案的问题。 Method: 提出FGTI四项可解释性原则,并设计AnalyticScore框架:首先提取答案中的显性元素,然后使用大语言模型生成人类可理解的特征,最后用有序逻辑回归模型进行评分。 Result: AnalyticScore在ASAP-SAS数据集10个项目上的评分准确性接近最先进的不可解释方法(平均仅差0.06 QWK),且其特征提取行为与人工标注者高度一致。 Conclusion: 该研究展示了在保持高评分准确性的同时实现自动评分系统可解释性的可行性,为未来可解释教育评估系统的研究提供了基准框架。 Abstract: AI-driven automated scoring systems offer scalable and efficient means of evaluating complex student-generated responses. Yet, despite increasing demand for transparency and interpretability, the field has yet to develop a widely accepted solution for interpretable automated scoring to be used in large-scale real-world assessments. This work takes a principled approach to address this challenge. We analyze the needs and potential benefits of interpretable automated scoring for various assessment stakeholders and develop four principles of interpretability -- Faithfulness, Groundedness, Traceability, and Interchangeability (FGTI) -- targeted at those needs. To illustrate the feasibility of implementing these principles, we develop the AnalyticScore framework for short answer scoring as a baseline reference framework for future research. AnalyticScore operates by (1) extracting explicitly identifiable elements of the responses, (2) featurizing each response into human-interpretable values using LLMs, and (3) applying an intuitive ordinal logistic regression model for scoring. In terms of scoring accuracy, AnalyticScore outperforms many uninterpretable scoring methods, and is within only 0.06 QWK of the uninterpretable SOTA on average across 10 items from the ASAP-SAS dataset. By comparing against human annotators conducting the same featurization task, we further demonstrate that the featurization behavior of AnalyticScore aligns well with that of humans.

[25] MUCH: A Multilingual Claim Hallucination Benchmark

Jérémie Dentan,Alexi Canesse,Davide Buscaldi,Aymen Shabou,Sonia Vanier

Main category: cs.CL

TL;DR: MUCH是一个新的声明级不确定性量化(UQ)基准,支持多语言和开源大模型,提供生成logits和高效确定性分割算法,适用于真实场景下的UQ方法评估。

Details Motivation: 现有声明级UQ基准在数据可复现性、多语言支持、白盒方法开发和实时部署方面存在不足,需要一个更公平、可复现且贴近实际应用的评估基准。 Method: 构建包含4873个样本的多语言(英法西德)UQ基准MUCH,使用四个开源指令调优大模型生成数据;发布每个token的24个生成logits以支持白盒方法研究;提出一种仅需0.2%生成时间的确定性声明分割算法,适用于实时监控。 Result: MUCH是首个支持多语言和多个开源大模型的声明级UQ基准,提供了丰富的logits数据和高效的分割方案;实验表明当前UQ方法在性能和效率上仍有显著提升空间。 Conclusion: MUCH为未来声明级UQ方法的研究提供了公平、可复现且贴近实际部署条件的评估平台,推动UQ技术向高效、实用方向发展。 Abstract: Claim-level Uncertainty Quantification (UQ) is a promising approach to mitigate the lack of reliability in Large Language Models (LLMs). We introduce MUCH, the first claim-level UQ benchmark designed for fair and reproducible evaluation of future methods under realistic conditions. It includes 4,873 samples across four European languages (English, French, Spanish, and German) and four instruction-tuned open-weight LLMs. Unlike prior claim-level benchmarks, we release 24 generation logits per token, facilitating the development of future white-box methods without re-generating data. Moreover, in contrast to previous benchmarks that rely on manual or LLM-based segmentation, we propose a new deterministic algorithm capable of segmenting claims using as little as 0.2% of the LLM generation time. This makes our segmentation approach suitable for real-time monitoring of LLM outputs, ensuring that MUCH evaluates UQ methods under realistic deployment constraints. Finally, our evaluations show that current methods still have substantial room for improvement in both performance and efficiency.

[26] Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design

Quentin Anthony,Yury Tokpanov,Skyler Szot,Srivatsan Rajagopal,Praneeth Medepalli,Rishi Iyer,Vasu Shyam,Anna Golubeva,Ansh Chaurasia,Xiao Yang,Tomas Figliolia,Robert Washbourne,Drew Thorstensen,Amartey Pearson,Zack Grossbart,Jason van Patten,Emad Barsoum,Zhenyu Gu,Yao Fu,Beren Millidge

Main category: cs.CL

TL;DR: 本文报告了在纯AMD硬件上进行大规模混合专家(MoE)预训练的首次研究,使用MI300X GPU和Pollara互连,提出了系统与模型设计的实际指导,并展示了ZAYA1模型的优异性能。

Details Motivation: 探索AMD硬件平台在大规模MoE预训练中的可行性与优势,填补该架构在系统级优化和模型设计方面的空白。 Method: 进行全面的集群与网络性能表征,提供核心通信操作的微基准测试,提出MI300X感知的Transformer模块尺寸规则,并设计具备高训练吞吐与低推理延迟的MoE模型。 Result: 首次在大规模上发布Pollara互连的集体通信微基准;获得MI300X的内核与内存带宽数据;ZAYA1-base模型(7.6亿激活参数,83亿总参数)在多个基准测试中表现优于Llama-3-8B和OLMoE,媲美Qwen3-4B和Gemma3-12B。 Conclusion: AMD硬件、网络和软件栈已足够成熟,可支持具有竞争力的大规模预训练。 Abstract: We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs with Pollara interconnect. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts on Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.

[27] Learning to Compress: Unlocking the Potential of Large Language Models for Text Representation

Yeqin Zhang,Yizheng Zhao,Chen Hu,Binxing Jiao,Daxin Jiang,Ruihang Miao,Cam-Tu Nguyen

Main category: cs.CL

TL;DR: 本文提出了一种基于上下文压缩的预训练任务,用于无监督适配大语言模型(LLM)以提升文本表示能力,所提出的LLM2Comp模型在多种任务上优于现有方法且更高效。

Details Motivation: 由于大多数LLM为因果结构并优化于下一词预测,难以生成整体性文本表示,因此需要探索更适合文本表示的预训练任务。 Method: 引入上下文压缩作为预训练任务,让模型学习生成紧凑的记忆token来替代完整上下文进行序列预测,并结合对比学习进一步优化表示。 Result: 实验表明,该压缩目标显著提升了LLM的文本表示性能,超越基于token级预测任务的模型;加入对比学习后效果更优。 Conclusion: 上下文压缩是一种有效的预训练任务,LLM2Comp在少数据下表现出强竞争力,是高效的LLM-based文本编码器。 Abstract: Text representation plays a critical role in tasks like clustering, retrieval, and other downstream applications. With the emergence of large language models (LLMs), there is increasing interest in harnessing their capabilities for this purpose. However, most of the LLMs are inherently causal and optimized for next-token prediction, making them suboptimal for producing holistic representations. To address this, recent studies introduced pretext tasks to adapt LLMs for text representation. Most of these tasks, however, rely on token-level prediction objectives, such as the masked next-token prediction (MNTP) used in LLM2Vec. In this work, we explore the untapped potential of context compression as a pretext task for unsupervised adaptation of LLMs. During compression pre-training, the model learns to generate compact memory tokens, which substitute the whole context for downstream sequence prediction. Experiments demonstrate that a well-designed compression objective can significantly enhance LLM-based text representations, outperforming models trained with token-level pretext tasks. Further improvements through contrastive learning produce a strong representation model (LLM2Comp) that outperforms contemporary LLM-based text encoders on a wide range of tasks while being more sample-efficient, requiring significantly less training data.

[28] LangMark: A Multilingual Dataset for Automatic Post-Editing

Diego Velazquez,Mikaela Grace,Konstantinos Karageorgos,Lawrence Carin,Aaron Schliem,Dimitrios Zaikis,Roger Wechsler

Main category: cs.CL

TL;DR: 本文介绍了LangMark,一个新发布的人工标注多语言自动后编辑(APE)数据集,包含英译七种语言的206,983个三元组,用于提升机器翻译后编辑系统的性能。

Details Motivation: 由于缺乏专为神经机器翻译(NMT)输出定制的大规模多语言数据集,自动后编辑(APE)系统的发展受到限制。本文旨在填补这一空白。 Method: 构建并发布LangMark数据集,包含源文本、NMT输出和人工后编辑翻译的三元组,由专业语言学家标注;利用该数据集,采用少样本提示的大型语言模型(LLMs)进行APE实验。 Result: 实验表明,基于LangMark数据集,使用少样本提示的大型语言模型能有效执行自动后编辑,并优于领先的商业及专有机器翻译系统。 Conclusion: LangMark为多语言APE系统的研究与评估提供了重要资源,有望推动该领域的进一步发展。 Abstract: Automatic post-editing (APE) aims to correct errors in machine-translated text, enhancing translation quality, while reducing the need for human intervention. Despite advances in neural machine translation (NMT), the development of effective APE systems has been hindered by the lack of large-scale multilingual datasets specifically tailored to NMT outputs. To address this gap, we present and release LangMark, a new human-annotated multilingual APE dataset for English translation to seven languages: Brazilian Portuguese, French, German, Italian, Japanese, Russian, and Spanish. The dataset has 206,983 triplets, with each triplet consisting of a source segment, its NMT output, and a human post-edited translation. Annotated by expert human linguists, our dataset offers both linguistic diversity and scale. Leveraging this dataset, we empirically show that Large Language Models (LLMs) with few-shot prompting can effectively perform APE, improving upon leading commercial and even proprietary machine translation systems. We believe that this new resource will facilitate the future development and evaluation of APE systems.

[29] The PLLuM Instruction Corpus

Piotr Pęzik,Filip Żarnecki,Konrad Kaczyński,Anna Cichosz,Zuzanna Deckert,Monika Garnys,Izabela Grabarczyk,Wojciech Janowski,Sylwia Karasińska,Aleksandra Kujawiak,Piotr Misztela,Maria Szymańska,Karolina Walkusz,Igor Siek,Maciej Chrabąszcz,Anna Kołos,Agnieszka Karlińska,Karolina Seweryn,Aleksandra Krasnodębska,Paula Betscher,Zofia Cieślińska,Katarzyna Kowol,Artur Wilczek,Maciej Trzciński,Katarzyna Dziewulska,Roman Roszko,Tomasz Bernaś,Jurgita Vaičenonienė,Danuta Roszko,Paweł Levchuk,Paweł Kowalski,Irena Prawdzic-Jankowska,Marek Kozłowski,Sławomir Dadas,Rafał Poświata,Alina Wróblewska,Katarzyna Krasnowska-Kieraś,Maciej Ogrodniczuk,Michał Rudolf,Piotr Rybak,Karolina Saputa,Joanna Wołoszyn,Marcin Oleksy,Bartłomiej Koptyra,Teddy Ferdinan,Stanisław Woźniak,Maciej Piasecki,Paweł Walkowiak,Konrad Wojtasik,Arkadiusz Janz,Przemysław Kazienko,Julia Moska,Jan Kocoń

Main category: cs.CL

TL;DR: 本文介绍了PLLuM项目中用于微调基于transformer的大语言模型的指令数据集,提出了有机、转换和合成指令的功能分类,并讨论了人工编写与合成指令数据集在语言适应中的影响,同时发布了首个代表性的PLLuMIC子集。

Details Motivation: 为了支持波兰语大语言模型的发展,需要构建高质量的指令数据集,并理解不同类型指令数据对模型微调的影响。 Method: 提出了一种指令的功能分类法,分析了有机、转换和合成指令的特点,并比较了人工编写与合成数据在语言适应中的作用。 Result: 开发了PLLuM指令语料库(PLLuMIC)的第一个代表性子集,并分享了关于使用不同类型指令数据的观察结果。 Conclusion: 合成指令数据在资源有限的情况下具有潜力,但人工编写的有机数据在语言质量和多样性方面仍具优势,发布的PLLuMIC有助于指导其他类似数据集的开发。 Abstract: This paper describes the instruction dataset used to fine-tune a set of transformer-based large language models (LLMs) developed in the PLLuM (Polish Large Language Model) project. We present a functional typology of the organic, converted, and synthetic instructions used in PLLuM and share some observations about the implications of using human-authored versus synthetic instruction datasets in the linguistic adaptation of base LLMs. Additionally, we release the first representative subset of the PLLuM instruction corpus (PLLuMIC), which we believe to be useful in guiding and planning the development of similar datasets for other LLMs.

[30] Hallucinate Less by Thinking More: Aspect-Based Causal Abstention for Large Language Models

Vy Nguyen,Ziqi Xu,Jeffrey Chan,Estrid He,Feng Xia,Xiuzhen Zhang

Main category: cs.CL

TL;DR: 本文提出了一种基于因果推理的方面性因果弃权框架(ABCA),通过分析大语言模型内部知识的多样性,实现对不可靠回答的早期弃权,显著提升了弃权决策的可靠性与可解释性。

Details Motivation: 大语言模型常产生看似流畅但事实错误的回应(即幻觉),现有弃权方法依赖生成后的信号,难以提前阻止不可靠输出。因此,需要一种能在生成前判断知识可靠性的早期弃权机制。 Method: 提出Aspect-Based Causal Abstention (ABCA) 框架,利用因果推断分析LLM内部知识在不同方面(如学科、法律背景、时间范围)的多样性,估计各方面的因果效应,并据此判断知识的一致性与充分性,实现两种弃权:类型1(知识冲突)和类型2(知识不足)。 Result: 在标准基准上的实验表明,ABCA在弃权可靠性上优于现有方法,达到最先进水平,并提高了弃权决策的可解释性。 Conclusion: ABCA通过因果推断实现了基于知识多样性的早期弃权,有效识别知识冲突与不足,为减少大模型幻觉提供了可解释且高效的新途径。 Abstract: Large Language Models (LLMs) often produce fluent but factually incorrect responses, a phenomenon known as hallucination. Abstention, where the model chooses not to answer and instead outputs phrases such as "I don't know", is a common safeguard. However, existing abstention methods typically rely on post-generation signals, such as generation variations or feedback, which limits their ability to prevent unreliable responses in advance. In this paper, we introduce Aspect-Based Causal Abstention (ABCA), a new framework that enables early abstention by analysing the internal diversity of LLM knowledge through causal inference. This diversity reflects the multifaceted nature of parametric knowledge acquired from various sources, representing diverse aspects such as disciplines, legal contexts, or temporal frames. ABCA estimates causal effects conditioned on these aspects to assess the reliability of knowledge relevant to a given query. Based on these estimates, we enable two types of abstention: Type-1, where aspect effects are inconsistent (knowledge conflict), and Type-2, where aspect effects consistently support abstention (knowledge insufficiency). Experiments on standard benchmarks demonstrate that ABCA improves abstention reliability, achieves state-of-the-art performance, and enhances the interpretability of abstention decisions.

[31] Attention-Guided Feature Fusion (AGFF) Model for Integrating Statistical and Semantic Features in News Text Classification

Mohammad Zare

Main category: cs.CL

TL;DR: 本文提出了一种注意力引导的特征融合(AGFF)模型,结合统计与语义特征,通过注意力机制动态整合两类特征,提升了新闻文本分类性能。

Details Motivation: 传统方法依赖统计特征但忽略上下文,深度学习方法关注语义但可能忽视重要统计信号,需融合两类特征以提升分类效果。 Method: 提出AGFF模型,利用注意力机制动态融合基于TF-IDF的统计特征和深度语义特征,构建统一分类框架。 Result: 在基准新闻数据集上,AGFF优于传统统计模型和纯语义深度学习模型,消融实验验证了各组件的有效性。 Conclusion: 融合统计与语义特征并由注意力机制引导,能有效提升新闻文本分类准确率,兼具实用性与可解释性。 Abstract: News text classification is a crucial task in natural language processing, essential for organizing and filtering the massive volume of digital content. Traditional methods typically rely on statistical features like term frequencies or TF-IDF values, which are effective at capturing word-level importance but often fail to reflect contextual meaning. In contrast, modern deep learning approaches utilize semantic features to understand word usage within context, yet they may overlook simple, high-impact statistical indicators. This paper introduces an Attention-Guided Feature Fusion (AGFF) model that combines statistical and semantic features in a unified framework. The model applies an attention-based mechanism to dynamically determine the relative importance of each feature type, enabling more informed classification decisions. Through evaluation on benchmark news datasets, the AGFF model demonstrates superior performance compared to both traditional statistical models and purely semantic deep learning models. The results confirm that strategic integration of diverse feature types can significantly enhance classification accuracy. Additionally, ablation studies validate the contribution of each component in the fusion process. The findings highlight the model's ability to balance and exploit the complementary strengths of statistical and semantic representations, making it a practical and effective solution for real-world news classification tasks.

Ziyang Wang,Yuanlei Zheng,Zhenbiao Cao,Xiaojin Zhang,Zhongyu Wei,Pei Fu,Zhenbo Luo,Wei Chen,Xiang Bai

Main category: cs.CL

TL;DR: AutoLink是一个自主代理框架,通过将模式链接重构为迭代的、代理驱动的过程,实现了高效、可扩展的文本到SQL的模式链接,显著提升了在大型数据库上的召回率和执行准确性。

Details Motivation: 现有的模式链接方法在处理大规模数据库时成本高、扩展性差,且难以平衡召回率与噪声之间的权衡。 Method: 提出AutoLink框架,利用大语言模型指导,动态探索和扩展相关模式子集,逐步识别必要的模式组件,避免输入完整数据库模式。 Result: 在Bird-Dev和Spider-2.0-Lite数据集上分别达到97.4%和91.2%的严格模式链接召回率,执行准确率分别为68.7%和34.9%,在大规模模式(如超过3000列)下仍保持高性能。 Conclusion: AutoLink是一种高度可扩展、高召回率的模式链接解决方案,适用于工业级文本到SQL系统。 Abstract: For industrial-scale text-to-SQL, supplying the entire database schema to Large Language Models (LLMs) is impractical due to context window limits and irrelevant noise. Schema linking, which filters the schema to a relevant subset, is therefore critical. However, existing methods incur prohibitive costs, struggle to trade off recall and noise, and scale poorly to large databases. We present \textbf{AutoLink}, an autonomous agent framework that reformulates schema linking as an iterative, agent-driven process. Guided by an LLM, AutoLink dynamically explores and expands the linked schema subset, progressively identifying necessary schema components without inputting the full database schema. Our experiments demonstrate AutoLink's superior performance, achieving state-of-the-art strict schema linking recall of \textbf{97.4\%} on Bird-Dev and \textbf{91.2\%} on Spider-2.0-Lite, with competitive execution accuracy, i.e., \textbf{68.7\%} EX on Bird-Dev (better than CHESS) and \textbf{34.9\%} EX on Spider-2.0-Lite (ranking 2nd on the official leaderboard). Crucially, AutoLink exhibits \textbf{exceptional scalability}, \textbf{maintaining high recall}, \textbf{efficient token consumption}, and \textbf{robust execution accuracy} on large schemas (e.g., over 3,000 columns) where existing methods severely degrade-making it a highly scalable, high-recall schema-linking solution for industrial text-to-SQL systems.

[33] E$^3$-Pruner: Towards Efficient, Economical, and Effective Layer Pruning for Large Language Models

Tao Yuan,Haoli Bai,Yinfei Pan,Xuyang Cao,Tianyu Zhang,Lu Hou,Ting Hu,Xianzhi Yu

Main category: cs.CL

TL;DR: 提出了一种名为\namespace的高效层剪枝框架,通过可微分掩码优化和熵感知自适应知识蒸馏,在保持高性能的同时显著降低训练成本并提升推理效率。

Details Motivation: 现有层剪枝方法难以同时解决性能下降、训练成本高和加速有限等实际部署问题。 Method: 采用Gumbel-TopK采样器进行可微分掩码优化以实现高效的剪枝掩码搜索,并设计熵感知自适应知识蒸馏策略来提升任务性能。 Result: 在Qwen3-32B上剪掉25%层时,达到96%准确率(仅下降0.8%),优于现有SOTA方法(95%),且推理速度提升1.33倍,训练仅需0.5B token。 Conclusion: \namespace在模型压缩中实现了任务有效性、训练经济性和推理高效性的良好平衡,具有较强的实用价值。 Abstract: With the increasing size of large language models, layer pruning has gained increased attention as a hardware-friendly approach for model compression. However, existing layer pruning methods struggle to simultaneously address key practical deployment challenges, including performance degradation, high training costs, and limited acceleration. To overcome these limitations, we propose \name, a task-\underline{E}ffective, training-\underline{E}conomical and inference-\underline{E}fficient layer pruning framework. \namespace introduces two key innovations: (1) a differentiable mask optimization method using a Gumbel-TopK sampler, enabling efficient and precise pruning mask search; and (2) an entropy-aware adaptive knowledge distillation strategy that enhances task performance. Extensive experiments over diverse model architectures and benchmarks demonstrate the superiority of our method over state-of-the-art approaches. Notably, \namespace achieves 96\% accuracy, a mere 0.8\% drop from the original model (96.8\%) on MATH-500 when pruning 25\% layers of Qwen3-32B, outperforming existing SOTA (95\%), with a 1.33$\times$ inference speedup by consuming merely 0.5B tokens (0.5\% of the post-training data volume).

[34] A Simple Yet Strong Baseline for Long-Term Conversational Memory of LLM Agents

Sizhe Zhou

Main category: cs.CL

TL;DR: 提出一种基于事件的对话记忆表示方法,通过将对话分解为带有实体归一化和来源标注的增强型基本话语单元(EDUs),并组织成异构图,支持关联检索,在保持较短问答上下文的同时,在多个基准上达到或超越现有方法。

Details Motivation: 现有的对话记忆方法在长时对话中难以兼顾信息完整性与检索效率,固定上下文窗口限制了历史信息的利用,而外部记忆方法要么检索粒度粗,要么信息碎片化。因此需要一种既能保留细粒度信息又便于检索的记忆结构。 Method: 受新戴维森事件语义学启发,将对话历史表示为包含参与者、时间线索和局部上下文的短事件命题;使用LLM将每轮对话分解为增强型基本话语单元(EDUs),并构建包含会话、EDU及其参数的异构图;在此基础上实现基于密集相似性搜索和LLM过滤的两种检索变体,并可选地通过图传播连接相关EDU以聚合证据。 Result: 在LoCoMo和LongMemEval$_S$基准上的实验表明,该事件中心化记忆方法在更短的QA上下文长度下,性能达到或优于强基线方法。 Conclusion: 结构简单、事件级别的记忆为长周期对话代理提供了一个有原则且实用的基础,能够在不牺牲信息完整性的情况下提升记忆检索效率。 Abstract: LLM-based conversational agents still struggle to maintain coherent, personalized interaction over many sessions: fixed context windows limit how much history can be kept in view, and most external memory approaches trade off between coarse retrieval over large chunks and fine-grained but fragmented views of the dialogue. Motivated by neo-Davidsonian event semantics, we propose an event-centric alternative that represents conversational history as short, event-like propositions which bundle together participants, temporal cues, and minimal local context, rather than as independent relation triples or opaque summaries. In contrast to work that aggressively compresses or forgets past content, our design aims to preserve information in a non-compressive form and make it more accessible, rather than more lossy. Concretely, we instruct an LLM to decompose each session into enriched elementary discourse units (EDUs) -- self-contained statements with normalized entities and source turn attributions -- and organize sessions, EDUs, and their arguments in a heterogeneous graph that supports associative recall. On top of this representation we build two simple retrieval-based variants that use dense similarity search and LLM filtering, with an optional graph-based propagation step to connect and aggregate evidence across related EDUs. Experiments on the LoCoMo and LongMemEval$_S$ benchmarks show that these event-centric memories match or surpass strong baselines, while operating with much shorter QA contexts. Our results suggest that structurally simple, event-level memory provides a principled and practical foundation for long-horizon conversational agents. Our code and data will be released at https://github.com/KevinSRR/EMem.

[35] Parrot: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs

Yusuf Çelebi,Mahmoud El Hussieni,Özay Ezerceli

Main category: cs.CL

TL;DR: PARROT是一个评估大语言模型在社会压力下准确率下降的框架,通过双盲实验和置信度变化分析,揭示不同模型对权威误导的服从程度和错误模式,发现先进模型更具鲁棒性,而弱模型易出现认知崩溃。

Details Motivation: 大语言模型可能在面对权威或说服性压力时表现出过度顺从(sycophancy),导致准确性下降,影响其在现实场景中的可靠性与安全性,因此需要一个系统化方法来衡量这种脆弱性。 Method: PARROT采用双盲评估,比较中立问题与权威性错误版本的回答差异;使用基于对数似然的校准追踪来量化置信度变化;并提出八类行为分类法以系统识别失败模式。 Result: 在22个模型和1302个MMLU风格问题上的实验显示:先进模型(如GPT-5、Claude Sonnet 4.5)服从率低(≤11%),准确率稳定;而旧/小模型(如GPT-4达80%,Qwen 2.5-1.5B达94%)出现严重认知崩溃,且正确答案置信度下降,错误答案置信度上升。国际法等领域脆弱性强,初等数学则较稳健。 Conclusion: 应将‘抵抗过度服从压力’作为与准确性、无害性和隐私并列的核心安全目标,以确保语言模型在真实世界中的可靠部署。 Abstract: This study presents PARROT (Persuasion and Agreement Robustness Rating of Output Truth), a robustness focused framework designed to measure the degradation in accuracy that occurs under social pressure exerted on users through authority and persuasion in large language models (LLMs) the phenomenon of sycophancy (excessive conformity). PARROT (i) isolates causal effects by comparing the neutral version of the same question with an authoritatively false version using a double-blind evaluation, (ii) quantifies confidence shifts toward the correct and imposed false responses using log-likelihood-based calibration tracking, and (iii) systematically classifies failure modes (e.g., robust correct, sycophantic agreement, reinforced error, stubborn error, self-correction, etc.) using an eight-state behavioral taxonomy. We evaluated 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates. Findings show marked heterogeneity: advanced models (e.g., GPT-5, GPT-4.1, Claude Sonnet 4.5) exhibit low "follow rates" ($\leq 11\%$, GPT-5: 4\%) and minimal accuracy loss, while older/smaller models show severe epistemic collapse (GPT-4: 80\%, Qwen 2.5-1.5B: 94\%). The danger is not limited to response changes; weak models reduce confidence in the correct response while increasing confidence in the imposed incorrect response. While international law and global knowledge at the domain level exhibit high fragility, elementary mathematics is relatively resilient. Consequently, we argue that the goal of "resistance to overfitting pressure" should be addressed as a primary objective alongside accuracy, harm avoidance, and privacy for safe deployment in the real world.

[36] Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables

Anshul Singh,Rohan Chaudhary,Gagneet Singh,Abhay Kumary

Main category: cs.CL

TL;DR: MirageTVQA是一个新的多语言、含视觉噪声的表格问答基准,用于评估视觉语言模型在真实场景中的鲁棒性。

Details Motivation: 现有表格问答数据集多为英文且表格格式干净,无法反映真实场景中的复杂性,导致研究与实际应用之间存在差距。 Method: 构建了一个包含近6万个多语言(24种语言)QA对的数据集,引入视觉噪声以模拟扫描文档的真实情况,用于评估VLM在多语言和视觉干扰下的表现。 Result: 实验显示当前领先的VLM在视觉噪声下性能显著下降(最佳模型下降超过35%),并表现出英语优先的偏见,非英语语言的推理能力较差。 Conclusion: MirageTVQA为衡量和推动更鲁棒的多语言表格推理VLM提供了有效基准。 Abstract: The impressive performance of VLMs is largely measured on benchmarks that fail to capture the complexities of real-world scenarios. Existing datasets for tabular QA, such as WikiTableQuestions and FinQA, are overwhelmingly monolingual (English) and present tables in a digitally perfect, clean format. This creates a significant gap between research and practice. To address this, we present \textbf{MirageTVQA}, a new benchmark designed to evaluate VLMs on these exact dimensions. Featuring nearly 60,000 QA pairs across 24 languages, MirageTVQA challenges models with tables that are not only multilingual but also visually imperfect, incorporating realistic noise to mimic scanned documents. Our evaluation of the leading VLMs reveals two primary failure points: a severe degradation in performance (over 35\% drop for the best models) when faced with visual noise and a consistent English-first bias where reasoning abilities fail to transfer to other languages. MirageTVQA provides a benchmark for measuring and driving progress towards more robust VLM models for table reasoning. The dataset and the code are available at: https://github.com/anshulsc/MirageTVQA.

[37] Social-Media Based Personas Challenge: Hybrid Prediction of Common and Rare User Actions on Bluesky

Benjamin White,Anastasia Shimorina

Main category: cs.CL

TL;DR: 本文提出了一种混合方法,用于预测社交媒体上的用户行为,特别关注常见和罕见动作的差异,基于大规模Bluesky数据集验证了该方法在常见和罕见行为预测上的有效性,并在相关竞赛中获得第一名。

Details Motivation: 现有研究主要集中在常见的用户行为(如转发、点赞),而对罕见但重要的行为缺乏有效预测。本文旨在通过区分不同行为类型,提升对频繁与稀有行为的综合预测能力。 Method: 结合四种互补方法:基于历史响应模式的查找数据库系统;针对常见行为、融合时序与语义特征的个性化LightGBM模型;用于罕见行为分类的混合神经网络架构(融合文本与时序表示);以及文本回复生成。 Result: 在包含640万对话线程、12种用户行为和25个角色簇的大规模Bluesky数据集上,个性化模型在常见行为预测中平均macro F1达到0.64,罕见行为分类器在10种罕见行为上达到0.56 macro F1。 Conclusion: 有效的社交媒体行为预测需要针对不同类型行为采用定制化的建模策略,本文提出的混合方法在常见与罕见行为预测上均表现出色,并在COLM 2025研讨会的SocialSim挑战赛中排名第一。 Abstract: Understanding and predicting user behavior on social media platforms is crucial for content recommendation and platform design. While existing approaches focus primarily on common actions like retweeting and liking, the prediction of rare but significant behaviors remains largely unexplored. This paper presents a hybrid methodology for social media user behavior prediction that addresses both frequent and infrequent actions across a diverse action vocabulary. We evaluate our approach on a large-scale Bluesky dataset containing 6.4 million conversation threads spanning 12 distinct user actions across 25 persona clusters. Our methodology combines four complementary approaches: (i) a lookup database system based on historical response patterns; (ii) persona-specific LightGBM models with engineered temporal and semantic features for common actions; (iii) a specialized hybrid neural architecture fusing textual and temporal representations for rare action classification; and (iv) generation of text replies. Our persona-specific models achieve an average macro F1-score of 0.64 for common action prediction, while our rare action classifier achieves 0.56 macro F1-score across 10 rare actions. These results demonstrate that effective social media behavior prediction requires tailored modeling strategies recognizing fundamental differences between action types. Our approach achieved first place in the SocialSim: Social-Media Based Personas challenge organized at the Social Simulation with LLMs workshop at COLM 2025.

[38] Estonian WinoGrande Dataset: Comparative Analysis of LLM Performance on Human and Machine Translation

Marii Ojastu,Hele-Andra Kuulmets,Aleksei Dorkin,Marika Borovikova,Dage Särg,Kairit Sirts

Main category: cs.CL

TL;DR: 本文介绍了WinoGrande常识推理基准测试集的爱沙尼亚语本地化翻译,评估了多种模型在人工翻译数据上的表现,并探讨了通过提示工程提升机器翻译质量的可行性。结果表明,人工翻译仍优于机器翻译,且语言专家参与对可靠评估大模型语言能力至关重要。

Details Motivation: 为了在爱沙尼亚语中实现可靠的常识推理评估,需要对WinoGrande基准进行文化与语言上的本地化翻译,并检验当前模型在低资源语言中的表现及机器翻译在该任务中的适用性。 Method: 由翻译专家进行WinoGrande测试集的人工翻译与文化适应,并构建针对性提示词用于机器翻译实验;随后评估了闭源与开源语言模型在人工翻译、机器翻译数据上的表现。 Result: 模型在人工翻译的爱沙尼亚语数据上表现略低于英文原版,在机器翻译数据上显著更差;提示工程对翻译质量和模型性能提升有限。 Conclusion: 高质量的基准翻译需语言专家参与,仅靠提示优化难以达到足够翻译质量,强调了本地化翻译在低资源语言评估中的重要性。 Abstract: In this paper, we present a localized and culturally adapted Estonian translation of the test set from the widely used commonsense reasoning benchmark, WinoGrande. We detail the translation and adaptation process carried out by translation specialists and evaluate the performance of both proprietary and open source models on the human translated benchmark. Additionally, we explore the feasibility of achieving high-quality machine translation by incorporating insights from the manual translation process into the design of a detailed prompt. This prompt is specifically tailored to address both the linguistic characteristics of Estonian and the unique translation challenges posed by the WinoGrande dataset. Our findings show that model performance on the human translated Estonian dataset is slightly lower than on the original English test set, while performance on machine-translated data is notably worse. Additionally, our experiments indicate that prompt engineering offers limited improvement in translation quality or model accuracy, and highlight the importance of involving language specialists in dataset translation and adaptation to ensure reliable and interpretable evaluations of language competency and reasoning in large language models.

[39] Large Language Models for Sentiment Analysis to Detect Social Challenges: A Use Case with South African Languages

Koena Ronny Mabokela,Tim Schlippe,Matthias Wölfel

Main category: cs.CL

TL;DR: 本研究评估了多种大语言模型(GPT-3.5、GPT-4、LlaMa 2、PaLM 2、Dolly 2)在南非英语、塞佩迪语和塞茨瓦纳语社交媒体文本上的零样本情感分析性能,并探索其在识别社会问题中的应用。

Details Motivation: 目前尚无研究探讨如何利用大语言模型对南非语言的社交媒体内容进行情感分析以检测社会挑战,特别是在多语言环境下支持政府决策的需求迫切。 Method: 采用零样本学习方法,评估多个最先进的大语言模型在10个南非政府部门管辖范围内的热门话题的情感极性分类表现,并尝试融合多个模型的输出结果以提升性能。 Result: 不同模型、话题和语言之间的情感分析表现存在显著差异;通过融合多个模型的结果,情感分类错误率可降至1%以下。 Conclusion: 融合多个大语言模型的输出可实现高精度的情感分析,为多语言环境中检测社会挑战并制定针对性政策提供了可行方案。 Abstract: Sentiment analysis can aid in understanding people's opinions and emotions on social issues. In multilingual communities sentiment analysis systems can be used to quickly identify social challenges in social media posts, enabling government departments to detect and address these issues more precisely and effectively. Recently, large-language models (LLMs) have become available to the wide public and initial analyses have shown that they exhibit magnificent zero-shot sentiment analysis abilities in English. However, there is no work that has investigated to leverage LLMs for sentiment analysis on social media posts in South African languages and detect social challenges. Consequently, in this work, we analyse the zero-shot performance of the state-of-the-art LLMs GPT-3.5, GPT-4, LlaMa 2, PaLM 2, and Dolly 2 to investigate the sentiment polarities of the 10 most emerging topics in English, Sepedi and Setswana social media posts that fall within the jurisdictional areas of 10 South African government departments. Our results demonstrate that there are big differences between the various LLMs, topics, and languages. In addition, we show that a fusion of the outcomes of different LLMs provides large gains in sentiment classification performance with sentiment classification errors below 1%. Consequently, it is now feasible to provide systems that generate reliable information about sentiment analysis to detect social challenges and draw conclusions about possible needs for actions on specific topics and within different language groups.

[40] Humanlike Multi-user Agent (HUMA): Designing a Deceptively Human AI Facilitator for Group Chats

Mateusz Jacniacki,Martí Carmona Serrat

Main category: cs.CL

TL;DR: 本文提出了HUMA,一种基于大语言模型的多用户会话代理,能够以类人的方式参与异步群聊,实验表明其在自然群组对话中可达到与人类社区管理者相当的表现且难以被区分。

Details Motivation: 现有的对话系统多设计为一对一、回合制交流,难以适应自然的异步群聊场景;随着AI助手在数字平台的普及,发展更自然、类人的交互模式对维持用户信任和参与至关重要。 Method: 提出HUMA模型,采用事件驱动架构,包含Router、Action Agent和Reflection三个组件,支持消息处理、回复、反应及真实响应时间模拟,使LLM适应多用户对话动态。 Result: 在97名参与者参与的四人角色扮演聊天实验中,用户对HUMA与人类社区管理者的识别接近随机水平;两者在社区管理效能、社会存在感及参与满意度上差异小,效应量低。 Conclusion: 在自然群聊环境中,AI引导者可以实现与人类相当的交互质量,并保持难以被识别为非人类的能力,表明类人多用户代理具有实际应用潜力。 Abstract: Conversational agents built on large language models (LLMs) are becoming increasingly prevalent, yet most systems are designed for one-on-one, turn-based exchanges rather than natural, asynchronous group chats. As AI assistants become widespread throughout digital platforms, from virtual assistants to customer service, developing natural and humanlike interaction patterns seems crucial for maintaining user trust and engagement. We present the Humanlike Multi-user Agent (HUMA), an LLM-based facilitator that participates in multi-party conversations using human-like strategies and timing. HUMA extends prior multi-user chatbot work with an event-driven architecture that handles messages, replies, reactions and introduces realistic response-time simulation. HUMA comprises three components-Router, Action Agent, and Reflection-which together adapt LLMs to group conversation dynamics. We evaluate HUMA in a controlled study with 97 participants in four-person role-play chats, comparing AI and human community managers (CMs). Participants classified CMs as human at near-chance rates in both conditions, indicating they could not reliably distinguish HUMA agents from humans. Subjective experience was comparable across conditions: community-manager effectiveness, social presence, and engagement/satisfaction differed only modestly with small effect sizes. Our results suggest that, in natural group chat settings, an AI facilitator can match human quality while remaining difficult to identify as nonhuman.

[41] A new kid on the block: Distributional semantics predicts the word-specific tone signatures of monosyllabic words in conversational Taiwan Mandarin

Xiaoyun Jin,Mirjam Ernestus,R. Harald Baayen

Main category: cs.CL

TL;DR: 本文通过语料库研究探讨了在控制多种因素后,语义如何影响普通话单音节词的声调实现,发现词义是声调实现的重要预测因素,且上下文化的词嵌入能有效预测音高轮廓。

Details Motivation: 探究在自发对话普通话中,词义是否以及如何影响单音节词的音高实现,挑战传统声调理论。 Method: 使用广义加性模型分解音高轮廓,控制词长、性别、说话人身份、声调环境等因素,并利用上下文化词嵌入预测音高轮廓。 Result: 即使在控制多种变量后,词义仍是声调实现的强预测因子;同音异义词具有不同的音高轮廓;上下文化词嵌入可显著预测音高轮廓。 Conclusion: 语义在普通话声调实现中起关键作用,支持区分性词典模型,对标准声调理论提出挑战。 Abstract: We present a corpus-based investigation of how the pitch contours of monosyllabic words are realized in spontaneous conversational Mandarin, focusing on the effects of words' meanings. We used the generalized additive model to decompose a given observed pitch contour into a set of component pitch contours that are tied to different control variables and semantic predictors. Even when variables such as word duration, gender, speaker identity, tonal context, vowel height, and utterance position are controlled for, the effect of word remains a strong predictor of tonal realization. We present evidence that this effect of word is a semantic effect: word sense is shown to be a better predictor than word, and heterographic homophones are shown to have different pitch contours. The strongest evidence for the importance of semantics is that the pitch contours of individual word tokens can be predicted from their contextualized embeddings with an accuracy that substantially exceeds a permutation baseline. For phonetics, distributional semantics is a new kid on the block. Although our findings challenge standard theories of Mandarin tone, they fit well within the theoretical framework of the Discriminative Lexicon Model.

[42] Don't Learn, Ground: A Case for Natural Language Inference with Visual Grounding

Daniil Ignatev,Ayman Santeer,Albert Gatt,Denis Paperno

Main category: cs.CL

TL;DR: 提出了一种基于多模态表征的零样本自然语言推断方法,通过将语言 grounding 到视觉上下文中,利用文本到图像模型生成前提的视觉表示,并通过与文本假设进行比较来完成推理。

Details Motivation: 传统NLI方法容易受到文本偏差和表面启发式的影响,缺乏真正的语义理解,因此需要一种更鲁棒的方法来提升零样本场景下的推理能力。 Method: 使用文本到图像模型生成前提的视觉表示,然后采用余弦相似度或视觉问答(VQA)技术将视觉表示与文本假设进行匹配,从而实现推理。 Result: 该方法在无需任务特定微调的情况下实现了较高的准确率,并在控制的对抗数据集上表现出对文本偏差和表面启发式的鲁棒性。 Conclusion: 利用视觉模态作为语义表示有助于提升自然语言理解的鲁棒性,为零样本NLI提供了有前景的方向。 Abstract: We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.

[43] Selective Rotary Position Embedding

Sajad Movahedi,Timur Carstensen,Arshia Afzal,Frank Hutter,Antonio Orvieto,Volkan Cevher

Main category: cs.CL

TL;DR: 提出了一种输入依赖的旋转位置编码机制Selective RoPE,适用于线性和softmax Transformer,通过任意角度旋转提升语言建模和复杂序列任务性能。

Details Motivation: 受选择性门控在序列建模中优势的启发,希望将输入依赖的选择性机制引入旋转位置嵌入,以增强模型对位置信息的动态建模能力。 Method: 提出Selective RoPE,使旋转角度由输入动态决定,推广了固定的RoPE,并揭示了softmax注意力中隐含的位置旋转结构,结合门控线性Transformer和状态空间模型进行实现。 Result: 在语言建模及复制、状态跟踪、检索等困难序列任务上验证了Selective RoPE的有效性,输入依赖的旋转机制显著提升了性能。 Conclusion: Selective RoPE通过输入依赖的任意角度旋转,统一并增强了不同Transformer变体中的位置编码能力,是位置表示的一次有效扩展。 Abstract: Position information is essential for language modeling. In softmax transformers, Rotary Position Embeddings (\textit{RoPE}) encode positions through \textit{fixed-angle} rotations, while in linear transformers, order is handled via input-dependent (selective) gating that decays past key-value associations. Selectivity has generally been shown to improve language-related tasks. Inspired by this, we introduce \textit{Selective RoPE}, an \textit{input-dependent} rotary embedding mechanism, that generalizes \textit{RoPE}, and enables rotation in \textit{arbitrary angles} for both linear and softmax transformers. We show that softmax attention already performs a hidden form of these rotations on query-key pairs, uncovering an implicit positional structure. We further show that in state-space models and gated linear transformers, the real part manages forgetting while the imaginary part encodes positions through rotations. We validate our method by equipping gated transformers with \textit{Selective RoPE}, demonstrating that its input-dependent rotations improve performance in language modeling and on difficult sequence tasks like copying, state tracking, and retrieval.

[44] PUCP-Metrix: A Comprehensive Open-Source Repository of Linguistic Metrics for Spanish

Javier Alonso Villegas Luis,Marco Antonio Sobrevilla Cabezudo

Main category: cs.CL

TL;DR: PUCP-Metrix是一个开源的西班牙语语言特征库,包含182个涵盖词汇、句法、语义、连贯性和可读性等维度的语言指标,支持细粒度文本分析,并在可读性评估和机器生成文本检测任务中表现优异。

Details Motivation: 现有西班牙语语言特征工具覆盖范围有限,缺乏对风格、结构和可读性等任务所需细粒度、可解释性特征的支持。 Method: 构建了一个名为PUCP-Metrix的开源仓库,整合了182个语言学指标,并在自动可读性评估和机器生成文本检测任务中进行评估,与现有工具和神经网络基线模型对比性能。 Result: PUCP-Metrix在两项任务中表现出与现有工具相当甚至更优的性能,验证了其在西班牙语NLP应用中的有效性与实用性。 Conclusion: PUCP-Metrix为西班牙语提供了全面、可扩展的语言特征资源,增强了文本分析的可解释性,支持多样化的NLP研究与应用。 Abstract: Linguistic features remain essential for interpretability and tasks involving style, structure, and readability, but existing Spanish tools offer limited coverage. We present PUCP-Metrix, an open-source repository of 182 linguistic metrics spanning lexical diversity, syntactic and semantic complexity, cohesion, psycholinguistics, and readability. PUCP-Metrix enables fine-grained, interpretable text analysis. We evaluate its usefulness on Automated Readability Assessment and Machine-Generated Text Detection, showing competitive performance compared to an existing repository and strong neural baselines. PUCP-Metrix offers a comprehensive, extensible resource for Spanish, supporting diverse NLP applications.

[45] Beyond Multiple Choice: A Hybrid Framework for Unifying Robust Evaluation and Verifiable Reasoning Training

Yesheng Liu,Hao Li,Haiyu Xu,Baoqi Pei,Jiahao Wang,Mingxuan Zhao,Jingshu Zheng,Zheqi He,JG Yao,Bowen Qin,Xi Yang,Jiajun Zhang

Main category: cs.CL

TL;DR: 本文提出ReVeL框架,将多选题改写为开放式问题以避免选项泄露信号,提升模型训练和评估的可靠性。

Details Motivation: 多选题中的选项可能泄露可利用信号,导致准确率指标不可靠,容易引发模型猜测行为。 Method: 提出ReVeL框架,根据答案类型对问题分类,并采用不同的改写与验证策略,将多选题转为可验证的开放式问题。 Result: 在RFT中使用ReVeL-OpenQA训练的模型在多选题基准上保持准确率,开放式问答准确率提升约6个百分点;用于评估时发现多选题基准存在高达20个百分点的分数虚高。 Conclusion: ReVeL能提供更可靠、高效的训练与评估方式,减少成本和延迟,提升判断准确性。 Abstract: Multiple-choice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification. However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT. We propose ReVeL (Rewrite and Verify by LLM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency. We will release code and data publicly.

[46] SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation

Shrikant Kendre,Austin Xu,Honglu Zhou,Michael Ryoo,Shafiq Joty,Juan Carlos Niebles

Main category: cs.CL

TL;DR: SMILE是一种结合句子级语义、关键词级语义和词汇精确匹配的新型文本与视觉问答评估指标,相较于传统方法更贴近人类判断且计算轻量。

Details Motivation: 传统评估指标如ROUGE、EM等依赖n-gram词汇相似性,难以捕捉深层语义;现有基于上下文嵌入或大模型的方法存在灵活性不足或成本高、不稳定等问题。 Method: 提出SMILE,融合句子级语义(通过上下文嵌入)、关键词级语义(通过词对齐)和词汇精确匹配,加权综合三者得分以平衡语义与词汇相似性。 Result: 在文本、图像和视频问答任务上的广泛评测表明,SMILE与人类判断高度相关,且计算开销低。 Conclusion: SMILE有效弥合了词汇匹配与语义理解之间的差距,提供了一种高效、灵活且准确的多模态问答评估方案。 Abstract: Traditional evaluation metrics for textual and visual question answering, like ROUGE, METEOR, and Exact Match (EM), focus heavily on n-gram based lexical similarity, often missing the deeper semantic understanding needed for accurate assessment. While measures like BERTScore and MoverScore leverage contextual embeddings to address this limitation, they lack flexibility in balancing sentence-level and keyword-level semantics and ignore lexical similarity, which remains important. Large Language Model (LLM) based evaluators, though powerful, come with drawbacks like high costs, bias, inconsistency, and hallucinations. To address these issues, we introduce SMILE: Semantic Metric Integrating Lexical Exactness, a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching. This composite method balances lexical precision and semantic relevance, offering a comprehensive evaluation. Extensive benchmarks across text, image, and video QA tasks show SMILE is highly correlated with human judgments and computationally lightweight, bridging the gap between lexical and semantic evaluation.

[47] Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards

Zhen Wang,Zhifeng Gao,Guolin Ke

Main category: cs.CL

TL;DR: 本文提出了MR-RLVR方法,通过“掩码-填充”和“步骤重排序”构建过程级自监督奖励,以增强仅结果可验证场景下RLVR在数学推理中的可扩展性和性能。

Details Motivation: 现有RLVR方法在定理证明等任务中因缺乏对中间推理过程的监督而可扩展性受限,且token级SFT易陷入记忆化,难以生成长链推理。 Method: 提出MR-RLVR,利用掩码填充和步骤重排序构造过程级自监督信号;采用两阶段训练:先在数学计算与证明数据上进行自监督预训练,再在结果可验证的数据集上进行RLVR微调。 Result: 在Qwen2.5-3B和DeepSeek-R1-Distill-Qwen-1.5B模型上实验显示,相比原始RLVR,在固定采样与解码预算下,MR-RLVR平均提升+9.86% Pass@1、+5.27% Pass@5和+4.00% Pass@8。 Conclusion: 引入过程感知的自监督信号能有效提升RLVR在仅结果可验证数学任务中的性能与可扩展性。 Abstract: Test-time scaling has been shown to substantially improve large language models' (LLMs) mathematical reasoning. However, for a large portion of mathematical corpora, especially theorem proving, RLVR's scalability is limited: intermediate reasoning is crucial, while final answers are difficult to directly and reliably verify. Meanwhile, token-level SFT often degenerates into rote memorization rather than inducing longer chains of thought. Inspired by BERT's self-supervised tasks, we propose MR-RLVR (Masked-and-Reordered RLVR), which constructs process-level self-supervised rewards via "masked-then-fill" and "step reordering" to extract learnable signals from intermediate reasoning. Our training pipeline comprises two stages: we first perform self-supervised training on sampled mathematical calculation and proof data; we then conduct RLVR fine-tuning on mathematical calculation datasets where only outcomes are verifiable. We implement MR-RLVR on Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B, and evaluate on AIME24, AIME25, AMC23, and MATH500. Under a fixed sampling and decoding budget, MR-RLVR achieves average relative gains over the original RLVR of +9.86% Pass@1, +5.27% Pass@5, and +4.00% Pass@8. These results indicate that incorporating process-aware self-supervised signals can effectively enhance RLVR's scalability and performance in only outcome-verifiable settings.

cs.CV [Back]

[48] The persistence of painting styles

Reetikaa Reddy Munnangi,Barbara Giunti

Main category: cs.CV

TL;DR: 本文探讨了如何利用持久同调(PH)这一拓扑数据分析方法,客观且可解释地识别和区分艺术家风格,包括不同流派和同一流派的艺术家,并能有效辨别真人创作与AI生成的艺术作品。

Details Motivation: 传统上艺术风格的识别依赖于艺术史学家的主观判断,缺乏客观性和可重复性。本文旨在引入数学工具,特别是持久同调,以提供一种量化、可解释的方法来分析艺术风格。 Method: 采用持久同调(Persistent Homology, PH)对艺术作品进行拓扑特征提取,结合统计分析方法,比较不同艺术家及艺术流派之间的拓扑差异,并用于区分真实艺术作品与AI生成作品。 Result: 实验结果表明,PH能够以统计显著性区分不同艺术家和艺术流派,即使在同一艺术流派内部也能有效辨别;此外,该方法还能准确区分真实艺术家作品与模仿其风格的AI生成图像。 Conclusion: 持久同调为艺术风格分析提供了一种新的、客观且可解释的数学工具,展示了拓扑数据分析在人文学科特别是艺术研究中的潜力。 Abstract: Art is a deeply personal and expressive medium, where each artist brings their own style, technique, and cultural background into their work. Traditionally, identifying artistic styles has been the job of art historians or critics, relying on visual intuition and experience. However, with the advancement of mathematical tools, we can explore art through more structured lens. In this work, we show how persistent homology (PH), a method from topological data analysis, provides objective and interpretable insights on artistic styles. We show how PH can, with statistical certainty, differentiate between artists, both from different artistic currents and from the same one, and distinguish images of an artist from an AI-generated image in the artist's style.

[49] Motion Transfer-Enhanced StyleGAN for Generating Diverse Macaque Facial Expressions

Takuya Igaue,Catia Correia-Caeiro,Akito Yoshida,Takako Miyabe-Nishiwaki,Ryusuke Hayashi

Main category: cs.CV

TL;DR: 提出一种基于StyleGAN2的方法生成猕猴面部表情,通过数据增强、样本选择和损失函数优化克服训练数据不足的问题,成功生成多样化的面部表情并实现基于风格的图像编辑。

Details Motivation: 由于猕猴面部表情图像数据量少且变化有限,使用生成式AI生成其面部表情具有挑战性,因此需要开发能有效应对数据限制的新方法。 Method: 采用StyleGAN2模型,结合运动迁移进行数据增强,利用潜在表示进行样本选择,并优化损失函数以精确还原细微动作(如眼部运动)。 Result: 该方法在生成多只猕猴的多样化面部表情上优于仅使用原始静态图像训练的模型,并支持基于风格的图像编辑,特定风格参数对应特定面部动作。 Conclusion: 所提方法能有效生成猕猴面部表情,且可解耦动作成分作为风格参数,为猕猴面部表情研究提供了有力工具。 Abstract: Generating animal faces using generative AI techniques is challenging because the available training images are limited both in quantity and variation, particularly for facial expressions across individuals. In this study, we focus on macaque monkeys, widely studied in systems neuroscience and evolutionary research, and propose a method to generate their facial expressions using a style-based generative image model (i.e., StyleGAN2). To address data limitations, we implemented: 1) data augmentation by synthesizing new facial expression images using a motion transfer to animate still images with computer graphics, 2) sample selection based on the latent representation of macaque faces from an initially trained StyleGAN2 model to ensure the variation and uniform sampling in training dataset, and 3) loss function refinement to ensure the accurate reproduction of subtle movements, such as eye movements. Our results demonstrate that the proposed method enables the generation of diverse facial expressions for multiple macaque individuals, outperforming models trained solely on original still images. Additionally, we show that our model is effective for style-based image editing, where specific style parameters correspond to distinct facial movements. These findings underscore the model's potential for disentangling motion components as style parameters, providing a valuable tool for research on macaque facial expressions.

[50] PairHuman: A High-Fidelity Photographic Dataset for Customized Dual-Person Generation

Ting Pan,Ye Wang,Peiguang Jing,Rui Ma,Zili Yi,Yu Liu

Main category: cs.CV

TL;DR: 本文提出了PairHuman数据集,首个大规模双人肖像生成基准数据集,包含超过10万张高质量图像及丰富的元数据,并提出了DHumanDiff模型,实现了高个性化和高视觉质量的双人肖像生成。

Details Motivation: 缺乏专门用于双人肖像定制的大规模高质量数据集,限制了该领域的发展。 Method: 构建了PairHuman数据集,包含超过10万张图像及详细标注(如描述、关键点、属性标签等),并提出了DHumanDiff模型,结合面部一致性增强机制,平衡个性化人物生成与语义驱动场景创建。 Result: 实验结果表明,PairHuman数据集和DHumanDiff模型能够生成高度定制化、视觉质量优越且符合人类偏好的双人肖像。 Conclusion: PairHuman为双人肖像生成提供了重要数据支持,DHumanDiff展示了其在个性化与场景协调生成上的有效性,推动了该领域的研究发展。 Abstract: Personalized dual-person portrait customization has considerable potential applications, such as preserving emotional memories and facilitating wedding photography planning. However, the absence of a benchmark dataset hinders the pursuit of high-quality customization in dual-person portrait generation. In this paper, we propose the PairHuman dataset, which is the first large-scale benchmark dataset specifically designed for generating dual-person portraits that meet high photographic standards. The PairHuman dataset contains more than 100K images that capture a variety of scenes, attire, and dual-person interactions, along with rich metadata, including detailed image descriptions, person localization, human keypoints, and attribute tags. We also introduce DHumanDiff, which is a baseline specifically crafted for dual-person portrait generation that features enhanced facial consistency and simultaneously balances in personalized person generation and semantic-driven scene creation. Finally, the experimental results demonstrate that our dataset and method produce highly customized portraits with superior visual quality that are tailored to human preferences. Our dataset is publicly available at https://github.com/annaoooo/PairHuman.

[51] A Machine Learning-Driven Solution for Denoising Inertial Confinement Fusion Images

Asya Y. Akkus,Bradley T. Wolfe,Pinghan Chu,Chengkun Huang,Chris S. Campbell,Mariana Alvarado Alvarez,Petr Volegov,David Fittinghoff,Robert Reinovsky,Zhehui Wang

Main category: cs.CV

TL;DR: 提出一种基于无监督自编码器结合CDF 97小波变换的混合高斯-泊松去噪方法,用于惯性约束聚变中子成像,相比传统方法具有更低的重建误差和更好的边缘保持能力。

Details Motivation: 中子成像在惯性约束聚变分析中至关重要,但图像常受混合高斯-泊松噪声影响,传统滤波方法难以有效去噪并保持图像细节。 Method: 采用无监督自编码器,在潜空间中引入CDF 97小波变换,对合成的中子成像数据进行混合噪声去除。 Result: 该方法在前向模型生成的数据上表现出比BM3D等非机器学习方法更低的重建误差和更优的边缘保持性能。 Conclusion: 所提方法为中子图像去噪及ICF实验的三维重建分析提供了有前景的技术路径。 Abstract: Neutron imaging is important in optimizing analysis of inertial confinement fusion (ICF) events such as those at the National Ignition Facility (NIF) and improving current and future ICF platforms. However, images of neutron sources are often degraded by various types of noise. Most commonly, Gaussian and Poisson noise often coexist within one image, obscuring fine details and blurring edges. These noise types often overlap, making them difficult to distinguish and remove using conventional filtering and thresholding methods. As a result, noise removal techniques that preserve image fidelity are important for analyzing and interpreting images of a neutron source. Current solutions include a combination of filtering and thresholding methodologies. In the past, machine learning approaches were rarely implemented due to a lack of ground truth neutron imaging data for ICF processes. However, recent advances in synthetic data production, particularly in the fusion imaging field, have opened opportunities to investigate new denoising procedures using both supervised and unsupervised machine learning methods. In this study, we implement an unsupervised autoencoder with a Cohen-Daubechies- Feauveau (CDF 97) wavelet transform in the latent space for mixed Gaussian-Poisson denoising. The network successfully denoises neutron imaging data. Additionally, it demonstrates lower reconstruction error and superior edge preservation metrics when benchmarked with data generated by a forward model and compared to non-ML-based filtering mechanisms such as Block-matching and 3D filtering (BM3D). This approach presents a promising advancement in neutron image noise reduction and three-dimensional reconstruction analysis of ICF experiments.

[52] SAM 3: Segment Anything with Concepts

Nicolas Carion,Laura Gustafson,Yuan-Ting Hu,Shoubhik Debnath,Ronghang Hu,Didac Suris,Chaitanya Ryali,Kalyan Vasudev Alwala,Haitham Khedr,Andrew Huang,Jie Lei,Tengyu Ma,Baishan Guo,Arpit Kalla,Markus Marks,Joseph Greer,Meng Wang,Peize Sun,Roman Rädle,Triantafyllos Afouras,Effrosyni Mavroudi,Katherine Xu,Tsung-Han Wu,Yu Zhou,Liliane Momeni,Rishi Hazra,Shuangrui Ding,Sagar Vaze,Francois Porcher,Feng Li,Siyuan Li,Aishwarya Kamath,Ho Kei Cheng,Piotr Dollár,Nikhila Ravi,Kate Saenko,Pengchuan Zhang,Christoph Feichtenhofer

Main category: cs.CV

TL;DR: SAM 3是一个统一模型,能够基于概念提示(如短语或图像示例)在图像和视频中检测、分割和跟踪物体,显著提升准确率并推出包含400万独特概念标签的数据集。

Details Motivation: 现有的视觉模型在处理基于自然语言或图像示例的提示时,难以统一实现检测、分割与跟踪;同时缺乏高质量、大规模的概念标注数据集支持复杂场景下的概念分割任务。 Method: 提出Promptable Concept Segmentation(PCS)框架,采用共享单一骨干网络的图像检测器与基于记忆的视频追踪器,并通过解耦识别与定位设计存在性预测头以提升检测精度;构建可扩展的数据引擎生成含400万概念标签的高质量图像-视频数据集。 Result: SAM 3在图像和视频的PCS任务上准确率较现有系统提升一倍,在多种视觉分割任务中也优于之前的SAM版本;同时发布了SA-Co基准测试和开源模型。 Conclusion: SAM 3实现了基于概念提示的统一图像-视频分割与跟踪框架,通过新型架构与大规模数据支持,在准确性与泛化能力上均取得显著进步,推动了开放词汇视觉理解的发展。 Abstract: We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.

[53] SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge

Adeel Yousaf,Joseph Fioresi,James Beetham,Amrit Singh Bedi,Mubarak Shah

Main category: cs.CV

TL;DR: 提出SaFeR-CLIP,一种基于语义最近安全替代方案进行最小干预的视觉-语言模型安全微调框架,在保持安全性的同时显著恢复零样本准确率。

Details Motivation: 现有安全微调方法因强制不安全概念对齐到预定义安全目标,破坏了模型语义结构,导致泛化性能显著下降。 Method: 提出一种 proximity-aware 方法,将不安全概念重定向至语义上最接近的安全替代项,以最小化表示变化;设计 SaFeR-CLIP 框架实现该思想,并构建新基准 NSFW-Caps 用于评估分布偏移下的安全性。 Result: 相比先前方法最高恢复 8.0% 的零样本准确率,同时保持强安全性;NSFW-Caps 基准包含 1000 个高度对齐的样本对。 Conclusion: 尊重预训练表示的几何结构是实现安全与高性能兼顾的关键。 Abstract: Improving the safety of vision-language models like CLIP via fine-tuning often comes at a steep price, causing significant drops in their generalization performance. We find this trade-off stems from rigid alignment strategies that force unsafe concepts toward single, predefined safe targets, disrupting the model's learned semantic structure. To address this, we propose a proximity-aware approach: redirecting unsafe concepts to their semantically closest safe alternatives to minimize representational change. We introduce SaFeR-CLIP, a fine-tuning framework that applies this principle of minimal intervention. SaFeR-CLIP successfully reconciles safety and performance, recovering up to 8.0% in zero-shot accuracy over prior methods while maintaining robust safety. To support more rigorous evaluation, we also contribute NSFW-Caps, a new benchmark of 1,000 highly-aligned pairs for testing safety under distributional shift. Our work shows that respecting the geometry of pretrained representations is key to achieving safety without sacrificing performance.

[54] SVG360: Multi-View SVG Generation with Geometric and Color Consistency from a Single SVG

Mengnan Jiang,Zhaolin Sun,Christian Franke,Michele Franco Adesso,Antonio Haas,Grace Li Zhang

Main category: cs.CV

TL;DR: 提出了一种三阶段框架,从单视图SVG输入生成多视角一致的SVG,通过3D提升、空间记忆机制和路径优化实现几何与颜色一致性。

Details Motivation: 生成单视图SVG对应的多视角一致SVG在现有工作中研究不足,且需保持几何结构和颜色的一致性。 Method: 首先将光栅化输入提升为3D表示并渲染出多视角图像;然后扩展SAM2的时间记忆机制到空间域,构建空间记忆库以建立跨视角部件级对应关系;最后在光栅转矢量过程中进行路径合并与结构优化。 Result: 生成的SVG在多个视角间表现出强几何和颜色一致性,显著减少冗余路径,同时保留精细结构细节。 Conclusion: 该方法弥合了生成模型与结构化矢量表示之间的差距,为单输入、对象级多视角SVG生成提供了可扩展的路径,支持资产创建和语义矢量编辑等应用。 Abstract: Scalable Vector Graphics (SVGs) are central to modern design workflows, offering scaling without distortion and precise editability. However, for single object SVGs, generating multi-view consistent SVGs from a single-view input remains underexplored. We present a three stage framework that produces multi-view SVGs with geometric and color consistency from a single SVG input. First, the rasterized input is lifted to a 3D representation and rendered under target camera poses, producing multi-view images of the object. Next, we extend the temporal memory mechanism of Segment Anything 2 (SAM2) to the spatial domain, constructing a spatial memory bank that establishes part level correspondences across neighboring views, yielding cleaner and more consistent vector paths and color assignments without retraining. Finally, during the raster to vector conversion, we perform path consolidation and structural optimization to reduce redundancy while preserving boundaries and semantics. The resulting SVGs exhibit strong geometric and color consistency across views, significantly reduce redundant paths, and retain fine structural details. This work bridges generative modeling and structured vector representation, providing a scalable route to single input, object level multi-view SVG generation and supporting applications such as asset creation and semantic vector editing.

[55] Mesh RAG: Retrieval Augmentation for Autoregressive Mesh Generation

Xiatao Sun,Chen Liang,Qian Wang,Daniel Rakita

Main category: cs.CV

TL;DR: 提出了一种无需训练的即插即用框架Mesh RAG,用于自回归3D网格生成,通过检索增强生成过程,提升了生成质量与速度,并支持增量编辑。

Details Motivation: 现有的自回归3D网格生成方法受限于序列依赖,导致生成速度慢且难以进行增量编辑,需要一种能打破序列依赖、提升效率和质量的新方法。 Method: 受语言模型中RAG的启发,利用点云分割、空间变换和点云配准技术,检索、生成并融合网格组件,从而在不依赖顺序预测的情况下实现并行化生成。 Result: 在多个基础自回归网格生成模型上验证了Mesh RAG的有效性,显著提高了生成质量、加快了生成速度,并实现了无需重新训练的增量编辑能力。 Conclusion: Mesh RAG是一种通用、无需训练的框架,能够有效克服自回归网格生成中的质量-速度权衡,推动自动化3D资产创建的发展。 Abstract: 3D meshes are a critical building block for applications ranging from industrial design and gaming to simulation and robotics. Traditionally, meshes are crafted manually by artists, a process that is time-intensive and difficult to scale. To automate and accelerate this asset creation, autoregressive models have emerged as a powerful paradigm for artistic mesh generation. However, current methods to enhance quality typically rely on larger models or longer sequences that result in longer generation time, and their inherent sequential nature imposes a severe quality-speed trade-off. This sequential dependency also significantly complicates incremental editing. To overcome these limitations, we propose Mesh RAG, a novel, training-free, plug-and-play framework for autoregressive mesh generation models. Inspired by RAG for language models, our approach augments the generation process by leveraging point cloud segmentation, spatial transformation, and point cloud registration to retrieve, generate, and integrate mesh components. This retrieval-based approach decouples generation from its strict sequential dependency, facilitating efficient and parallelizable inference. We demonstrate the wide applicability of Mesh RAG across various foundational autoregressive mesh generation models, showing it significantly enhances mesh quality, accelerates generation speed compared to sequential part prediction, and enables incremental editing, all without model retraining.

[56] WorldGen: From Text to Traversable and Interactive 3D Worlds

Dilin Wang,Hyunyoung Jung,Tom Monnier,Kihyuk Sohn,Chuhang Zou,Xiaoyu Xiang,Yu-Ying Yeh,Di Liu,Zixuan Huang,Thu Nguyen-Phuoc,Yuchen Fan,Sergiu Oprea,Ziyan Wang,Roman Shapovalov,Nikolaos Sarafianos,Thibault Groueix,Antoine Toisoul,Prithviraj Dhar,Xiao Chu,Minghao Chen,Geon Yeong Park,Mahima Gupta,Yassir Azziz,Rakesh Ranjan,Andrea Vedaldi

Main category: cs.CV

TL;DR: WorldGen 是一个从文本提示自动生成大规模交互式3D世界的系统,结合LLM、扩散模型和程序化生成技术,实现无需专业技能的可导航虚拟空间创建。

Details Motivation: 降低3D虚拟世界创作门槛,使非专业用户也能快速将创意转化为功能完整的交互环境。 Method: 融合大语言模型驱动的场景布局推理、程序化生成、基于扩散的3D生成和对象感知的场景分解,构建模块化系统。 Result: 能够生成几何一致、视觉丰富且实时可渲染的3D世界,支持在标准游戏引擎中直接探索或编辑。 Conclusion: WorldGen 推动了生成式AI在3D内容创作中的应用,为游戏、模拟和社交沉浸环境提供了可扩展的自动化解决方案。 Abstract: We introduce WorldGen, a system that enables the automatic creation of large-scale, interactive 3D worlds directly from text prompts. Our approach transforms natural language descriptions into traversable, fully textured environments that can be immediately explored or edited within standard game engines. By combining LLM-driven scene layout reasoning, procedural generation, diffusion-based 3D generation, and object-aware scene decomposition, WorldGen bridges the gap between creative intent and functional virtual spaces, allowing creators to design coherent, navigable worlds without manual modeling or specialized 3D expertise. The system is fully modular and supports fine-grained control over layout, scale, and style, producing worlds that are geometrically consistent, visually rich, and efficient to render in real time. This work represents a step towards accessible, generative world-building at scale, advancing the frontier of 3D generative AI for applications in gaming, simulation, and immersive social environments.

[57] Towards Unified Vision Language Models for Forest Ecological Analysis in Earth Observation

Xizhe Xue,Xiao Xiang Zhu

Main category: cs.CV

TL;DR: 本文提出了REO-Instruct,首个面向地球观测中描述与回归任务的统一多模态基准,填补了现有数据集在连接视觉理解与生物物理变量量化之间的空白。

Details Motivation: 现有地球观测数据集主要关注语义理解任务,缺乏将多模态感知与可测量的生物物理变量(如地上生物量)关联的基准,限制了科学回归任务的发展。 Method: 构建了一个包含Sentinel-2和ALOS-2影像的多模态数据集,结合人类-AI混合流程生成并验证结构化文本标注,设计涵盖人类活动识别、地表分类、生态斑块计数及生物量回归的认知可解释逻辑链任务。 Result: 实验表明当前通用视觉语言模型在数值推理任务(如AGB回归)上表现不佳,暴露出科学型VLM在定量预测方面的不足。 Conclusion: REO-Instruct为开发兼具描述能力与科学推断能力的下一代地理空间模型提供了标准化平台,推动视觉语言模型在地球观测科学中的应用。 Abstract: Recent progress in vision language models (VLMs) has enabled remarkable perception and reasoning capabilities, yet their potential for scientific regression in Earth Observation (EO) remains largely unexplored. Existing EO datasets mainly emphasize semantic understanding tasks such as captioning or classification, lacking benchmarks that align multimodal perception with measurable biophysical variables. To fill this gap, we present REO-Instruct, the first unified benchmark designed for both descriptive and regression tasks in EO. REO-Instruct establishes a cognitively interpretable logic chain in forest ecological scenario (human activity,land-cover classification, ecological patch counting, above-ground biomass (AGB) regression), bridging qualitative understanding and quantitative prediction. The dataset integrates co-registered Sentinel-2 and ALOS-2 imagery with structured textual annotations generated and validated through a hybrid human AI pipeline. Comprehensive evaluation protocols and baseline results across generic VLMs reveal that current models struggle with numeric reasoning, highlighting an essential challenge for scientific VLMs. REO-Instruct offers a standardized foundation for developing and assessing next-generation geospatial models capable of both description and scientific inference. The project page are publicly available at \href{https://github.com/zhu-xlab/REO-Instruct}{REO-Instruct}.

[58] BOP-ASK: Object-Interaction Reasoning for Vision-Language Models

Vineet Bhat,Sungsu Kim,Valts Blukis,Greg Heinrich,Prashanth Krishnamurthy,Ramesh Karri,Stan Birchfield,Farshad Khorrami,Jonathan Tremblay

Main category: cs.CV

TL;DR: 本文提出了BOP-ASK,一个用于视觉语言模型(VLM)在物体交互推理方面的大规模数据集,涵盖精细的3D空间理解任务,如抓取姿态估计、路径规划和物体间关系建模。实验表明,基于该数据集训练的模型在复杂场景中展现出更强的空间推理能力。

Details Motivation: 现有视觉语言模型的空间推理评估主要集中于高层语义关系,缺乏对精细3D定位、物理兼容性、物体功能性和多步空间规划等真实应用场景所需能力的深入测试。因此,需要更细粒度的数据集来揭示并提升模型在物体交互中的空间理解能力。 Method: 作者构建了BOP-ASK数据集,利用BOP数据集中的6D物体姿态信息,生成包括抓取姿态、目标物体位置、路径规划轨迹、深度与相对空间关系等精细标注。该数据集包含超过15万张图像和3300万个问答对,覆盖六个任务(其中四个为新任务),并发布了用于评估泛化能力的BOP-ASK-lab(非BOP来源图像)。 Result: 在BOP-ASK-core上的实验显示,使用该数据集训练的VLM在精确物体定位、抓取姿态预测、轨迹规划及复杂环境下的细粒度空间推理方面显著优于基线模型,并展现出涌现能力;人类评估进一步验证了其有效性。 Conclusion: BOP-ASK为视觉语言模型提供了更具挑战性和现实意义的物体交互与空间推理评测基准,推动模型向更精细、可操作的具身智能方向发展。 Abstract: Vision Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high level relationships ('left of,' 'behind', etc.) but ignore fine-grained spatial understanding needed for real world applications: precise 3D localization, physical compatibility between objects, object affordances and multi step spatial planning. In this work, we present BOP-ASK, a novel large scale dataset for object interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments. We will publicly release our datasets and dataset generation pipeline.

[59] Parts-Mamba: Augmenting Joint Context with Part-Level Scanning for Occluded Human Skeleton

Tianyi Shen,Huijuan Xu,Nilesh Ahuja,Omesh Tickoo,Philip Shin,Vijaykrishnan Narayanan

Main category: cs.CV

TL;DR: 提出了一种名为Parts-Mamba的混合GCN-Mamba模型,用于提升骨架动作识别中对远距离关节点上下文信息的捕捉能力,在不同遮挡情况下显著提高了识别准确率。

Details Motivation: 现有图卷积网络(GCN)在处理因身体遮挡或通信质量差导致的骨架缺失时表现不佳,缺乏对远距离关节上下文的建模能力。 Method: 结合GCN与Mamba架构,引入部分特定扫描机制和部件-整体融合模块,以增强对局部与非邻近关节上下文信息的捕获与保持。 Result: 在NTU RGB+D 60和120数据集的不同遮挡设置下,模型最多实现了12.9%的准确率提升。 Conclusion: Parts-Mamba通过有效整合远距离上下文信息,显著提升了在骨架不完整情况下的动作识别性能,具有较强的鲁棒性和应用潜力。 Abstract: Skeleton action recognition involves recognizing human action from human skeletons. The use of graph convolutional networks (GCNs) has driven major advances in this recognition task. In real-world scenarios, the captured skeletons are not always perfect or complete because of occlusions of parts of the human body or poor communication quality, leading to missing parts in skeletons or videos with missing frames. In the presence of such non-idealities, existing GCN models perform poorly due to missing local context. To address this limitation, we propose Parts-Mamba, a hybrid GCN-Mamba model designed to enhance the ability to capture and maintain contextual information from distant joints. The proposed Parts-Mamba model effectively captures part-specific information through its parts-specific scanning feature and preserves non-neighboring joint context via a parts-body fusion module. Our proposed model is evaluated on the NTU RGB+D 60 and NTU RGB+D 120 datasets under different occlusion settings, achieving up to 12.9% improvement in accuracy.

[60] The Joint Gromov Wasserstein Objective for Multiple Object Matching

Aryan Tajmir Riahi,Khanh Dao Duc

Main category: cs.CV

TL;DR: 本文提出了联合Gromov-Wasserstein(JGW)目标,扩展了传统的Gromov-Wasserstein距离以实现多个对象间的匹配,能够在度量空间中识别部分同构的分布,并展示了在点云表示上的应用及其优越的准确性和计算效率。

Details Motivation: 传统Gromov-Wasserstein距离仅限于单个对象之间的配对匹配,限制了其在需要多对一或多对多匹配场景中的应用。因此,需要一种新的框架来克服这一局限性。 Method: 引入了联合Gromov-Wasserstein(JGW)目标,通过扩展原始GW框架来支持对象集合间的同步匹配。该方法利用最优传输中的传统算法(包括熵正则化)进行公式化和求解。 Result: 实验表明,所提出的方法在部分匹配任务上相比其他GW变体具有更高的准确性和计算效率;在合成数据集和真实世界数据集上的测试证明了其在多形状匹配(如几何形状和生物分子复合物)中的有效性。 Conclusion: JGW目标成功地将Gromov-Wasserstein距离推广到多个对象间的匹配问题,为计算机图形学和结构生物学等领域的复杂匹配问题提供了有前景的应用方案。 Abstract: The Gromov-Wasserstein (GW) distance serves as a powerful tool for matching objects in metric spaces. However, its traditional formulation is constrained to pairwise matching between single objects, limiting its utility in scenarios and applications requiring multiple-to-one or multiple-to-multiple object matching. In this paper, we introduce the Joint Gromov-Wasserstein (JGW) objective and extend the original framework of GW to enable simultaneous matching between collections of objects. Our formulation provides a non-negative dissimilarity measure that identifies partially isomorphic distributions of mm-spaces, with point sampling convergence. We also show that the objective can be formulated and solved for point cloud object representations by adapting traditional algorithms in Optimal Transport, including entropic regularization. Our benchmarking with other variants of GW for partial matching indicates superior performance in accuracy and computational efficiency of our method, while experiments on both synthetic and real-world datasets show its effectiveness for multiple shape matching, including geometric shapes and biomolecular complexes, suggesting promising applications for solving complex matching problems across diverse domains, including computer graphics and structural biology.

[61] Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representational Alignment

Loukas Sfountouris,Giannis Daras,Paris Giampouras

Main category: cs.CV

TL;DR: 本文提出了一种名为REPA的方法,通过在扩散或流生成模型与预训练自监督视觉编码器(如DINOv2)之间进行表示对齐,提升逆问题中的重建质量和感知真实性。

Details Motivation: 现有的生成模型在解决逆问题时缺乏有效的先验引导,导致重建质量受限。利用预训练自监督编码器的内部表示作为目标对齐信号,可提供更强的归纳偏置。 Method: 在推理过程中引入表示对齐(REPA),将生成模型的中间表示与预训练视觉编码器(如DINOv2)提取的近似目标特征对齐,即使在无真实信号的情况下也能提升重建效果,并从理论上分析了REPA正则化与嵌入空间散度的关系及其对表示的引导作用。 Result: 在超分辨率、块填充、高斯去模糊和运动去模糊等多个逆问题任务中,REPA显著提升了重建质量和感知真实性,并减少了所需的离散化步骤,提高了计算效率。 Conclusion: REPA是一种通用且高效的方法,能够通过表示对齐增强基于生成模型的逆问题求解器性能,在保持原有求解器性能的同时提升质量与效率。 Abstract: Enforcing alignment between the internal representations of diffusion or flow-based generative models and those of pretrained self-supervised encoders has recently been shown to provide a powerful inductive bias, improving both convergence and sample quality. In this work, we extend this idea to inverse problems, where pretrained generative models are employed as priors. We propose applying representation alignment (REPA) between diffusion or flow-based models and a pretrained self-supervised visual encoder, such as DINOv2, to guide the reconstruction process at inference time. Although ground-truth signals are unavailable in inverse problems, we show that aligning model representations with approximate target features can substantially enhance reconstruction fidelity and perceptual realism. We provide theoretical results showing (a) the relation between the REPA regularization and a divergence measure in the DINOv2 embedding space, and (b) how REPA updates steer the model's internal representations toward those of the clean image. These results offer insights into the role of REPA in improving perceptual fidelity. Finally, we demonstrate the generality of our approach by integrating it into multiple state-of-the-art inverse problem solvers. Extensive experiments on super-resolution, box inpainting, Gaussian deblurring, and motion deblurring confirm that our method consistently improves reconstruction quality across tasks, while also providing substantial efficiency gains by reducing the number of required discretization steps without compromising the performance of the underlying solver.

[62] Glass Surface Detection: Leveraging Reflection Dynamics in Flash/No-flash Imagery

Tao Yan,Hao Huang,Yiwei Lu,Zeyu Wang,Ke Xu,Yinghui Wang,Xiaojun Chang,Rynson W. H. Lau

Main category: cs.CV

TL;DR: 本文提出了一种基于闪光/无闪光图像中反射动态变化的新型玻璃表面检测方法NFGlassNet,通过设计反射对比挖掘模块和反射引导注意力模块,有效提升了检测精度。

Details Motivation: 现有玻璃表面检测方法依赖边界或反射线索,未能充分利用玻璃本身的固有特性进行准确定位。作者观察到玻璃前后光照强度差异导致反射变化的现象,启发了新方法的提出。 Method: 提出NFGlassNet,包含反射对比挖掘模块(RCMM)提取反射信息,以及反射引导注意力模块(RGAM)融合反射与玻璃表面特征;利用3.3K闪光/无闪光图像对的数据集进行训练。 Result: 实验表明该方法在多个场景下优于现有最先进方法,显著提升玻璃表面检测性能。 Conclusion: 通过利用闪光与无闪光条件下的反射动态变化,能够更准确地检测玻璃表面,为计算机视觉中的透明物体检测提供了新思路。 Abstract: Glass surfaces are ubiquitous in daily life, typically appearing colorless, transparent, and lacking distinctive features. These characteristics make glass surface detection a challenging computer vision task. Existing glass surface detection methods always rely on boundary cues (e.g., window and door frames) or reflection cues to locate glass surfaces, but they fail to fully exploit the intrinsic properties of the glass itself for accurate localization. We observed that in most real-world scenes, the illumination intensity in front of the glass surface differs from that behind it, which results in variations in the reflections visible on the glass surface. Specifically, when standing on the brighter side of the glass and applying a flash towards the darker side, existing reflections on the glass surface tend to disappear. Conversely, while standing on the darker side and applying a flash towards the brighter side, distinct reflections will appear on the glass surface. Based on this phenomenon, we propose NFGlassNet, a novel method for glass surface detection that leverages the reflection dynamics present in flash/no-flash imagery. Specifically, we propose a Reflection Contrast Mining Module (RCMM) for extracting reflections, and a Reflection Guided Attention Module (RGAM) for fusing features from reflection and glass surface for accurate glass surface detection. For learning our network, we also construct a dataset consisting of 3.3K no-flash and flash image pairs captured from various scenes with corresponding ground truth annotations. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods. Our code, model, and dataset will be available upon acceptance of the manuscript.

[63] R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios

Lu Zhu,Tiantian Geng,Yangye Chen,Teng Wang,Ping Lu,Feng Zheng

Main category: cs.CV

TL;DR: 本文提出了R-AVST,首个面向真实世界音视频时空推理的数据集,包含5000多个未剪辑视频和27,000个带细粒度时空标注的对象,并定义了三个核心任务及构建了8,000多个高质量问答对。基于此,提出AVST-Zero模型,采用强化学习多维奖励机制,无需中间监督即可优化模型行为,在音视频推理任务中表现出竞争力。

Details Motivation: 现有视频理解研究多集中于简单场景,缺乏对真实世界复杂多样的音视频事件的建模能力,尤其在时空细粒度推理方面存在明显不足,因此需要更贴近现实的数据集与更高效的推理模型。 Method: 1)构建R-AVST数据集:通过LLM提取关键对象、自动空间标注与人工质检结合的流程,收集5K+未剪辑视频,覆盖100类音视频事件;2)设计三项时空推理任务并生成8K+均衡分布的问答对;3)提出AVST-Zero模型:基于强化学习框架,设计多维奖励机制,直接优化模型输出行为,避免中间监督。 Result: 实验表明R-AVST能有效推动音视频时空推理研究,AVST-Zero在多项任务上表现优于或媲美现有模型,验证了其有效性与潜力。 Conclusion: R-AVST是首个面向真实场景的音视频时空推理数据集,为该领域提供了重要基准;AVST-Zero通过无中间监督的强化学习方法,为未来复杂音视频理解任务提供了新思路。 Abstract: Recently, rapid advancements have been made in multimodal large language models (MLLMs), especially in video understanding tasks. However, current research focuses on simple video scenarios, failing to reflect the complex and diverse nature of real-world audio-visual events in videos. To bridge this gap, we firstly introduce R-AVST, a dataset for audio-visual reasoning featuring fine-grained spatio-temporal annotations. In constructing this, we design a pipeline consisting of LLM-based key object extraction, automatic spatial annotation and manual quality inspection, resulting in over 5K untrimmed videos with 27K objects across 100 types of audio-visual events. Building on this dataset, we define three core tasks for spatio-temporal reasoning in audio-visual scenes and generate more than 8K high-quality, evenly distributed question-answer pairs to effectively benchmark model performance. To further enhance reasoning, we propose AVST-Zero, a reinforcement learning-based model that avoids intermediate supervision, directly optimizing behavior via carefully designed multi-dimensional rewards. Extensive experiments validate the effectiveness of our R-AVST in advancing audio-visual spatio-temporal reasoning, upon which AVST-Zero demonstrates competitive performance compared to existing models. To the best of our knowledge, R-AVST is the first dataset designed for real-world audio-visual spatio-temporal reasoning, and AVST-Zero offers a novel perspective for tackling future challenges in this domain.

[64] Warm Diffusion: Recipe for Blur-Noise Mixture Diffusion Models

Hao-Chien Hsueh,Chi-En Yen,Wen-Hsiao Peng,Ching-Chun Huang

Main category: cs.CV

TL;DR: 本文提出了Warm Diffusion,一种结合热扩散(纯噪声)和冷扩散(纯模糊)优势的统一模糊-噪声混合扩散模型(BNMD),通过联合控制模糊与噪声,利用图像频谱依赖性提升生成效果。

Details Motivation: 热扩散未能充分利用图像高频细节与低频结构之间的强相关性,导致生成初期行为随机;冷扩散虽利用了图像相关性但忽略噪声对数据流形的影响,导致流形外问题和性能下降。因此需要一种兼顾二者优势的新方法。 Method: 提出Warm Diffusion模型,采用模糊-噪声混合退化过程,并设计分而治之策略,将去噪与去模糊过程解耦,简化得分模型估计;通过频谱分析研究模糊-噪声比(BNR),以平衡学习动态与数据流形变化。 Result: 在多个基准上的大量实验验证了所提方法在图像生成任务中的有效性,优于现有热扩散与冷扩散方法。 Conclusion: Warm Diffusion成功融合了热扩散与冷扩散的优点,通过联合建模噪声与模糊过程,在保持数据流形的同时利用图像结构相关性,提升了生成质量与训练稳定性。 Abstract: Diffusion probabilistic models have achieved remarkable success in generative tasks across diverse data types. While recent studies have explored alternative degradation processes beyond Gaussian noise, this paper bridges two key diffusion paradigms: hot diffusion, which relies entirely on noise, and cold diffusion, which uses only blurring without noise. We argue that hot diffusion fails to exploit the strong correlation between high-frequency image detail and low-frequency structures, leading to random behaviors in the early steps of generation. Conversely, while cold diffusion leverages image correlations for prediction, it neglects the role of noise (randomness) in shaping the data manifold, resulting in out-of-manifold issues and partially explaining its performance drop. To integrate both strengths, we propose Warm Diffusion, a unified Blur-Noise Mixture Diffusion Model (BNMD), to control blurring and noise jointly. Our divide-and-conquer strategy exploits the spectral dependency in images, simplifying score model estimation by disentangling the denoising and deblurring processes. We further analyze the Blur-to-Noise Ratio (BNR) using spectral analysis to investigate the trade-off between model learning dynamics and changes in the data manifold. Extensive experiments across benchmarks validate the effectiveness of our approach for image generation.

[65] Q-REAL: Towards Realism and Plausibility Evaluation for AI-Generated Content

Shushi Wang,Zicheng Zhang,Chunyi Li,Wei Wang,Liya Ma,Fengjiao Chen,Xiaoyu Li,Xuezhi Cao,Guangtao Zhai,Xiaohong Liu

Main category: cs.CV

TL;DR: 本文提出了Q-Real,一个用于细粒度评估AI生成图像真实性和合理性的新数据集,包含3,088张图像及其标注,并构建Q-Real Bench以评估多模态大模型在判断与推理任务上的表现,同时设计了微调框架提升模型性能。

Details Motivation: 现有AI生成内容的质量评估多依赖单一评分,难以提供针对性优化指导;而在实际应用中,图像的真实性和合理性是两个关键维度,亟需细粒度评估方法来有效提升生成模型性能。 Method: 构建包含实体定位、判断问题和归因描述的细粒度标注数据集Q-Real,并基于此建立Q-Real Bench评测基准,设计针对多模态大语言模型的微调框架,在多个MLLM上进行实验验证。 Result: 实验证明Q-Real数据集具有高质量和重要意义,所提基准能全面评估MLLM在真实性和合理性判断上的能力,微调后模型性能显著提升。 Conclusion: Q-Real为AI生成图像的细粒度质量评估提供了有效工具,推动了基于多模态大模型的评估方法发展,有助于针对性优化生成模型。 Abstract: Quality assessment of AI-generated content is crucial for evaluating model capability and guiding model optimization. However, most existing quality assessment datasets and models provide only a single quality score, which is too coarse to offer targeted guidance for improving generative models. In current applications of AI-generated images, realism and plausibility are two critical dimensions, and with the emergence of unified generation-understanding models, fine-grained evaluation along these dimensions becomes especially effective for improving generative performance. Therefore, we introduce Q-Real, a novel dataset for fine-grained evaluation of realism and plausibility in AI-generated images. Q-Real consists of 3,088 images generated by popular text-to-image models. For each image, we annotate the locations of major entities and provide a set of judgment questions and attribution descriptions for these along the dimensions of realism and plausibility. Considering that recent advances in multi-modal large language models (MLLMs) enable fine-grained evaluation of AI-generated images, we construct Q-Real Bench to evaluate them on two tasks: judgment and grounding with reasoning. Finally, to enhance MLLM capabilities, we design a fine-tuning framework and conduct experiments on multiple MLLMs using our dataset. Experimental results demonstrate the high quality and significance of our dataset and the comprehensiveness of the benchmark. Dataset and code will be released upon publication.

[66] UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation

Chi Zhang,Jiepeng Wang,Youming Wang,Yuanzhi Liang,Xiaoyan Yang,Zuoxin Li,Haibin Huang,Xuelong Li

Main category: cs.CV

TL;DR: UniModel 是一个统一的生成模型,通过将文本和图像映射到共享视觉空间,在单一像素到像素扩散框架内同时支持视觉理解与生成。

Details Motivation: 旨在实现模型、任务和表示三个层面的统一,推动通用多模态智能的发展。 Method: 将文本渲染为画布上的文字图像,所有输入输出均以RGB像素形式处理,构建全视觉原生的多模态学习框架;将多种视觉语言任务转化为该空间中的像素到像素转换,并使用统一的扩散Transformer模型进行双向映射学习。 Result: 在文本到图像生成和图像到文本理解任务上表现出强跨模态对齐能力和新兴可控性,如可实现循环一致的图像-字幕-图像生成。 Conclusion: 在单一视觉空间中统一模型、任务和表示是通向通用多模态智能的一个有前景的范式。 Abstract: We present UniModel, a unified generative model that jointly supports visual understanding and visual generation within a single pixel-to-pixel diffusion framework. Our goal is to achieve unification along three axes: the model, the tasks, and the representations. At the representation level, we eliminate modality discrepancies by mapping both text and images into a shared visual space: textual prompts are rendered as painted text images on a clean canvas, and all inputs and outputs are treated purely as RGB pixels. This yields a fully vision-native formulation of multimodal learning. At the task level, a broad range of vision-language problems are cast as pixel-to-pixel transformations in this visual space. For understanding tasks, the model takes an RGB image and produces a painted text image that visually encodes the semantic prediction. For generation tasks, painted text images serve as visual conditions that guide realistic and semantically aligned image synthesis. Captioning and text-to-image generation thus become different directions of the same underlying visual translation process. At the model level, we instantiate a single Unified Diffusion Transformer trained with rectified flow in pixel space. A shared backbone jointly learns bidirectional mappings between natural images and painted text images, with lightweight task embeddings to specify the desired direction. Experiments on text-to-image synthesis and image-to-text understanding demonstrate strong cross-modal alignment and emergent controllability such as cycle-consistent image-caption-image loops. Our initial exploration suggests that unifying model, tasks, and representations in a single visual space is a promising paradigm for general-purpose multimodal intelligence.

[67] DeltaDeno: Zero-Shot Anomaly Generation via Delta-Denoising Attribution

Chaoran Xu,Chengkan Lv,Qiyu Chen,Yunkang Cao,Feng Zhang,Zhengtao Zhang

Main category: cs.CV

TL;DR: 提出了一种无需训练的零样本异常生成方法Delta-Denoising(DeltaDeno),通过对比两个扩散分支在最小提示对下的去噪差异,实现对缺陷的定位与编辑。

Details Motivation: 现有异常生成方法通常依赖少量异常样本进行微调,违背了异常稀少的实际情况且易过拟合类别先验。本文旨在在无真实异常样本和训练的情况下实现异常生成。 Method: DeltaDeno利用一对最小提示驱动两个共享调度的扩散分支,逐步累积去噪差异生成图像特定的定位图,用于引导潜在修复过程;同时引入基于异常令牌的词级提示优化和空间注意力偏置以提升稳定性和控制性。 Result: 实验表明,DeltaDeno在公开数据集上实现了优越的生成质量、真实感,并显著提升了下游检测任务的性能。 Conclusion: DeltaDeno是一种有效的训练-free、零样本异常生成方法,能够在无需异常样本和训练的前提下生成逼真的局部缺陷,具有良好的应用潜力。 Abstract: Anomaly generation is often framed as few-shot fine-tuning with anomalous samples, which contradicts the scarcity that motivates generation and tends to overfit category priors. We tackle the setting where no real anomaly samples or training are available. We propose Delta-Denoising (DeltaDeno), a training-free zero-shot anomaly generation method that localizes and edits defects by contrasting two diffusion branches driven by a minimal prompt pair under a shared schedule. By accumulating per-step denoising deltas into an image-specific localization map, we obtain a mask to guide the latent inpainting during later diffusion steps and preserve the surrounding context while generating realistic local defects. To improve stability and control, DeltaDeno performs token-level prompt refinement that aligns shared content and strengthens anomaly tokens, and applies a spatial attention bias restricted to anomaly tokens in the predicted region. Experiments on public datasets show that DeltaDeno achieves great generation, realism and consistent gains in downstream detection performance. Code will be made publicly available.

[68] Rethinking Diffusion Model-Based Video Super-Resolution: Leveraging Dense Guidance from Aligned Features

Jingyi Xu,Meisong Zheng,Ying Chen,Minglang Qiao,Xin Deng,Mai Xu

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的视频超分辨率新方法DGAF-VSR,通过在特征域进行对齐与补偿,并设计光学引导形变模块和特征时序条件模块,有效提升了感知质量、保真度和时间一致性。

Details Motivation: 现有基于扩散模型的视频超分方法存在误差累积、空间伪影及感知质量与保真度之间的权衡问题,主要源于帧间对齐与补偿的不准确和不足。 Method: 提出DGAF-VSR框架,包含光学引导形变模块(OGWM)以在高分辨率下保持高频信息,以及特征时序条件模块(FTCM)在特征域提供密集引导;强调在特征域而非像素域进行信息补偿,并分析了上采样后形变的优势与非单调性。 Result: 在合成与真实世界数据集上实验表明,DGAF-VSR在感知质量(DISTS降低35.82%)、保真度(PSNR提升0.20 dB)和时间一致性(tLPIPS降低30.37%)上均优于现有最先进方法。 Conclusion: 通过重新审视帧间对齐与补偿的作用,本文验证了特征域补偿和高分辨率对齐的有效性,为基于扩散模型的视频超分辨率提供了更优解决方案。 Abstract: Diffusion model (DM) based Video Super-Resolution (VSR) approaches achieve impressive perceptual quality. However, they suffer from error accumulation, spatial artifacts, and a trade-off between perceptual quality and fidelity, primarily caused by inaccurate alignment and insufficient compensation between video frames. In this paper, within the DM-based VSR pipeline, we revisit the role of alignment and compensation between adjacent video frames and reveal two crucial observations: (a) the feature domain is better suited than the pixel domain for information compensation due to its stronger spatial and temporal correlations, and (b) warping at an upscaled resolution better preserves high-frequency information, but this benefit is not necessarily monotonic. Therefore, we propose a novel Densely Guided diffusion model with Aligned Features for Video Super-Resolution (DGAF-VSR), with an Optical Guided Warping Module (OGWM) to maintain high-frequency details in the aligned features and a Feature-wise Temporal Condition Module (FTCM) to deliver dense guidance in the feature domain. Extensive experiments on synthetic and real-world datasets demonstrate that DGAF-VSR surpasses state-of-the-art methods in key aspects of VSR, including perceptual quality (35.82\% DISTS reduction), fidelity (0.20 dB PSNR gain), and temporal consistency (30.37\% tLPIPS reduction).

[69] Shape-preserving Tooth Segmentation from CBCT Images Using Deep Learning with Semantic and Shape Awareness

Zongrui Ji,Zhiming Cui,Na Li,Qianhan Zheng,Miaojing Shi,Ke Deng,Jingyang Zhang,Chaoyuan Li,Xuepeng Chen,Yi Dong,Lei Ma

Main category: cs.CV

TL;DR: 提出了一种结合语义和形状感知的深度学习框架,用于CBCT图像中牙齿的精确分割,尤其在牙体粘连导致形态严重变形的情况下表现出色。

Details Motivation: 在锥形束CT图像中,由于牙间粘连导致解剖形态严重扭曲,准确的牙齿分割仍然具有挑战性。 Method: 引入了目标牙齿质心提示的多标签学习策略以建模牙齿间的语义关系,并采用牙齿形状感知学习机制显式施加形态约束,通过多任务学习联合优化分割与形状保持。 Result: 在内部和外部数据集上的广泛实验表明,该方法显著优于现有方法。 Conclusion: 该方法有效缓解了形状失真问题,生成了解剖结构上更真实的牙齿边界。 Abstract: Background:Accurate tooth segmentation from cone beam computed tomography (CBCT) images is crucial for digital dentistry but remains challenging in cases of interdental adhesions, which cause severe anatomical shape distortion. Methods: To address this, we propose a deep learning framework that integrates semantic and shape awareness for shape-preserving segmentation. Our method introduces a target-tooth-centroid prompted multi-label learning strategy to model semantic relationships between teeth, reducing shape ambiguity. Additionally, a tooth-shape-aware learning mechanism explicitly enforces morphological constraints to preserve boundary integrity. These components are unified via multi-task learning, jointly optimizing segmentation and shape preservation. Results: Extensive evaluations on internal and external datasets demonstrate that our approach significantly outperforms existing methods. Conclusions: Our approach effectively mitigates shape distortions and providing anatomically faithful tooth boundaries.

[70] OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios

Hong Gao,Jingyu Wu,Xiangkai Xu,Kangni Xie,Yunchen Zhang,Bin Zhong,Xurui Gao,Min-Ling Zhang

Main category: cs.CV

TL;DR: 本文提出了OmniGround,一个用于时空视频定位(STVG)的综合性基准,以及PG-TAF,一种无需训练的两阶段框架,在复杂真实场景中显著提升了定位性能。

Details Motivation: 现有STVG模型在多样化对象和复杂查询下的表现仍与实际需求存在差距,主要由于基准范围有限导致类别偏差、推理简单化和语言鲁棒性差。 Method: 提出OmniGround基准,包含3,475个视频和81个类别,并设计Forward-Backward-Refinement标注流程以保证标签质量;同时提出DeepSTG评估框架,并设计PG-TAF——一种无需训练的两阶段方法,将STVG分解为高层时间定位和细粒度时空传播。 Result: 在OmniGround上,现有模型在复杂场景中平均性能下降10.4%;PG-TAF在m_tIoU和m_vIoU上分别提升25.6%和35.6%,并在四个基准上保持一致增益。 Conclusion: OmniGround和PG-TAF有效推动了STVG在真实复杂场景下的发展,揭示了当前模型的局限性并提供了新的技术路径。 Abstract: Spatio-Temporal Video Grounding (STVG) aims to localize target objects in videos based on natural language descriptions. Despite recent advances in Multimodal Large Language Models, a significant gap remains between current models and real-world demands involving diverse objects and complex queries. We attribute this to limited benchmark scope, causing models to exhibit category bias, oversimplified reasoning, and poor linguistic robustness. To address these limitations, we introduce OmniGround, a comprehensive benchmark with 3,475 videos spanning 81 categories and complex real-world queries. We propose the Forward-Backward-Refinement annotation pipeline that combines multi-directional tracking with intelligent error correction for high-quality labels. We further introduce DeepSTG, a systematic evaluation framework quantifying dataset quality across four complementary dimensions beyond superficial statistics. Evaluations reveal performance average drop of 10.4% on complex real-world scenes, particularly with small/occluded objects and intricate spatial relations. Motivated by these, we propose PG-TAF, a training-free two-stage framework decomposing STVG into high-level temporal grounding and fine-grained spatio-temporal propagation. Experiments demonstrate PG-TAF achieves 25.6% and 35.6% improvements in m\_tIoU and m\_vIoU on OmniGround with consistent gains across four benchmarks.

[71] MultiPriv: Benchmarking Individual-Level Privacy Reasoning in Vision-Language Models

Xiongtao Sun,Hui Li,Jiaming Zhang,Yujie Yang,Kaili Liu,Ruxin Feng,Wen Jun Tan,Wei Yang Bryan Lim

Main category: cs.CV

TL;DR: 提出首个针对视觉语言模型(VLM)个体级隐私推理能力的评测基准MultiPriv,揭示现有模型在隐私推理风险上的严重不足。

Details Motivation: 现有隐私评测基准主要关注属性感知,无法应对VLM通过推理链接分散信息构建个人画像的新威胁。 Method: 提出隐私感知与推理(PPR)框架,构建包含合成个人档案的双语多模态数据集,设计九项任务评估从属性检测到跨图像重识别和链式推理的完整PPR能力,并对50多个VLM进行大规模评测。 Result: 发现许多VLM存在显著且未被测量的基于推理的隐私风险;感知层面指标无法预测此类风险;现有安全对齐措施对此类攻击无效。 Conclusion: MultiPriv揭示了当前VLM在个体隐私保护方面的系统性漏洞,提供了评估和改进隐私推理风险的必要框架。 Abstract: Modern Vision-Language Models (VLMs) demonstrate sophisticated reasoning, escalating privacy risks beyond simple attribute perception to individual-level linkage. Current privacy benchmarks are structurally insufficient for this new threat, as they primarily evaluate privacy perception while failing to address the more critical risk of privacy reasoning: a VLM's ability to infer and link distributed information to construct individual profiles. To address this critical gap, we propose \textbf{MultiPriv}, the first benchmark designed to systematically evaluate individual-level privacy reasoning in VLMs. We introduce the \textbf{Privacy Perception and Reasoning (PPR)} framework and construct a novel, bilingual multimodal dataset to support it. The dataset uniquely features a core component of synthetic individual profiles where identifiers (e.g., faces, names) are meticulously linked to sensitive attributes. This design enables nine challenging tasks evaluating the full PPR spectrum, from attribute detection to cross-image re-identification and chained inference. We conduct a large-scale evaluation of over 50 foundational and commercial VLMs. Our analysis reveals: (1) Many VLMs possess significant, unmeasured reasoning-based privacy risks. (2) Perception-level metrics are poor predictors of these reasoning risks, revealing a critical evaluation gap. (3) Existing safety alignments are inconsistent and ineffective against such reasoning-based attacks. MultiPriv exposes systemic vulnerabilities and provides the necessary framework for developing robust, privacy-preserving VLMs.

[72] Flow-Guided Implicit Neural Representation for Motion-Aware Dynamic MRI Reconstruction

Baoqing Li,Yuanyuan Liu,Congcong Liu,Qingyong Zhu,Jing Cheng,Yihang Zhou,Hao Chen,Zhuo-Xu Cui,Dong Liang

Main category: cs.CV

TL;DR: 提出一种基于隐式神经表示(INR)的框架,联合建模动态MRI图像序列及其运动场,通过光流方程和数据一致性约束实现无需先验估计的高质量重建。

Details Motivation: 传统动态MRI重建依赖预估的光流来补偿运动,但在欠采样下光流估计不准确,影响重建质量。 Method: 采用两个隐式神经表示(INR)分别建模动态图像序列和光流场,结合光流方程作为物理启发的正则化项,并引入k空间数据一致性损失进行联合优化。 Result: 在心脏dMRI数据上实验表明,该方法优于现有先进方法,在重建质量、运动估计精度和时间保真度方面表现更优。 Conclusion: 隐式联合建模结合光流正则化可有效提升动态MRI重建性能,无需依赖先验运动估计。 Abstract: Dynamic magnetic resonance imaging (dMRI) captures temporally-resolved anatomy but is often challenged by limited sampling and motion-induced artifacts. Conventional motion-compensated reconstructions typically rely on pre-estimated optical flow, which is inaccurate under undersampling and degrades reconstruction quality. In this work, we propose a novel implicit neural representation (INR) framework that jointly models both the dynamic image sequence and its underlying motion field. Specifically, one INR is employed to parameterize the spatiotemporal image content, while another INR represents the optical flow. The two are coupled via the optical flow equation, which serves as a physics-inspired regularization, in addition to a data consistency loss that enforces agreement with k-space measurements. This joint optimization enables simultaneous recovery of temporally coherent images and motion fields without requiring prior flow estimation. Experiments on dynamic cardiac MRI datasets demonstrate that the proposed method outperforms state-of-the-art motion-compensated and deep learning approaches, achieving superior reconstruction quality, accurate motion estimation, and improved temporal fidelity. These results highlight the potential of implicit joint modeling with flow-regularized constraints for advancing dMRI reconstruction.

[73] FingerCap: Fine-grained Finger-level Hand Motion Captioning

Xin Shen,Rui Zhu,Lei Shen,Xinyu Wang,Kaihao Zhang,Tianqing Zhu,Shuchen Wu,Chenxi Miao,Weikang Li,Yang Li,Deguo Xia,Jizhou Huang,Xin Yu

Main category: cs.CV

TL;DR: 本文提出了FingerCap任务和FingerCap-40K数据集,旨在生成描述手指级手部动作的文本,并引入FiGOP方法以增强视频MLLM对细微手部运动的理解。

Details Motivation: 现有视频MLLM在稀疏采样下难以捕捉高频、细微的手指运动动态,缺乏对手指级动作的细粒度理解能力。 Method: 提出FiGOP方法,将每个RGB关键帧与其后续的手部关键点序列结合,通过轻量级时序编码器生成运动嵌入并融合到RGB特征中,恢复精细时序信息。 Result: 在FingerCap-40K数据集上实验表明,当前主流Video-MLLM在手指级推理上表现不佳,而引入FiGOP后模型在HandJudge自动评估和人工评测中均取得显著提升。 Conclusion: FiGOP通过结合关键点序列有效弥补了RGB稀疏采样的不足,为细粒度手部动作理解提供了高效且计算友好的解决方案。 Abstract: Understanding fine-grained human hand motion is fundamental to visual perception, embodied intelligence, and multimodal communication. In this work, we propose Fine-grained Finger-level Hand Motion Captioning (FingerCap), which aims to generate textual descriptions that capture detailed finger-level semantics of hand actions. To support this task, we curate FingerCap-40K, a large-scale corpus of 40K paired hand-motion videos and captions spanning two complementary sources: concise instruction-style finger motions and diverse, naturalistic hand-object interactions. To enable effective evaluation, we employ HandJudge, a LLM-based rubric that measures finger-level correctness and motion completeness. Temporal sparsity remains a fundamental bottleneck for current Video-MLLMs, since sparse RGB sampling is insufficient to capture the subtle, high-frequency dynamics underlying fine finger motions. As a simple and compute-friendly remedy, we introduce FiGOP (Finger Group-of-Pictures), which pairs each RGB keyframe with subsequent hand keypoints until the next keyframe. A lightweight temporal encoder converts the keypoints into motion embeddings and integrates them with RGB features. FiGOP adapts the classic GOP concept to finger motion, recovering fine temporal cues without increasing RGB density. Experiments on FingerCap-40K show that strong open- and closed-source Video-MLLMs still struggle with finger-level reasoning, while our FiGOP-augmented model yield consistent gains under HandJudge and human studies.

[74] Point-Supervised Facial Expression Spotting with Gaussian-Based Instance-Adaptive Intensity Modeling

Yicheng Deng,Hideaki Hayashi,Hajime Nagahara

Main category: cs.CV

TL;DR: 本文提出了一种用于点监督面部表情检测(P-FES)的双分支框架,通过高斯实例自适应强度建模(GIM)实现软伪标签,并设计了类别感知的顶点分类分支以区分宏/微表情。

Details Motivation: 现有方法依赖昂贵的时间边界标注,本文旨在仅使用每个实例单个时间戳标注的情况下实现高效的表情检测。 Method: 提出双分支框架:1)基于高斯的实例自适应强度建模(GIM)模块,用于生成表达帧的软伪标签;2)类别感知的顶点分类分支,用于分类宏/微表情;引入强度感知对比损失增强特征学习。 Result: 在SAMM-LV、CAS(ME)²和CAS(ME)³数据集上进行了广泛实验,验证了所提方法的有效性。 Conclusion: 该方法在点监督设置下显著降低了标注成本,同时提升了面部表情检测性能,尤其在区分不同强度表情和抑制中性噪声方面表现优异。 Abstract: Automatic facial expression spotting, which aims to identify facial expression instances in untrimmed videos, is crucial for facial expression analysis. Existing methods primarily focus on fully-supervised learning and rely on costly, time-consuming temporal boundary annotations. In this paper, we investigate point-supervised facial expression spotting (P-FES), where only a single timestamp annotation per instance is required for training. We propose a unique two-branch framework for P-FES. First, to mitigate the limitation of hard pseudo-labeling, which often confuses neutral and expression frames with various intensities, we propose a Gaussian-based instance-adaptive intensity modeling (GIM) module to model instance-level expression intensity distribution for soft pseudo-labeling. By detecting the pseudo-apex frame around each point label, estimating the duration, and constructing an instance-level Gaussian distribution, GIM assigns soft pseudo-labels to expression frames for more reliable intensity supervision. The GIM module is incorporated into our framework to optimize the class-agnostic expression intensity branch. Second, we design a class-aware apex classification branch that distinguishes macro- and micro-expressions solely based on their pseudo-apex frames. During inference, the two branches work independently: the class-agnostic expression intensity branch generates expression proposals, while the class-aware apex-classification branch is responsible for macro- and micro-expression classification.Furthermore, we introduce an intensity-aware contrastive loss to enhance discriminative feature learning and suppress neutral noise by contrasting neutral frames with expression frames with various intensities. Extensive experiments on the SAMM-LV, CAS(ME)$^2$, and CAS(ME)$^3$ datasets demonstrate the effectiveness of our proposed framework.

[75] Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models

Dailan He,Guanlin Feng,Xingtong Ge,Yazhe Niu,Yi Zhang,Bingqi Ma,Guanglu Song,Yu Liu,Hongsheng Li

Main category: cs.CV

TL;DR: 提出Neighbor GRPO,一种无需SDE的新型对齐算法,通过扰动ODE初始噪声条件生成候选轨迹,并基于距离优化实现高效、兼容高阶求解器的生成模型偏好对齐。

Details Motivation: 现有SDE-based GRPO方法在应用于现代流匹配模型时存在信用分配低效和不兼容少步采样高阶求解器的问题,且引入随机性带来额外复杂性。 Method: 从距离优化视角重新理解SDE-based GRPO,揭示其本质为对比学习机制;提出Neighbor GRPO,通过扰动ODE初始噪声生成多样化候选轨迹,采用softmax距离代理目标进行优化,并引入对称锚点采样和组内拟范数重加权提升效率与缓解奖励平坦化。 Result: Neighbor GRPO在训练成本、收敛速度和生成质量上显著优于SDE-based方法,同时保持确定性ODE采样的高效性和对高阶求解器的兼容性。 Conclusion: Neighbor GRPO为流匹配模型的偏好对齐提供了一种更高效、更简洁的新范式,无需引入SDE即可实现优异性能,推动了确定性采样框架下的强化学习对齐研究。 Abstract: Group Relative Policy Optimization (GRPO) has shown promise in aligning image and video generative models with human preferences. However, applying it to modern flow matching models is challenging because of its deterministic sampling paradigm. Current methods address this issue by converting Ordinary Differential Equations (ODEs) to Stochastic Differential Equations (SDEs), which introduce stochasticity. However, this SDE-based GRPO suffers from issues of inefficient credit assignment and incompatibility with high-order solvers for fewer-step sampling. In this paper, we first reinterpret existing SDE-based GRPO methods from a distance optimization perspective, revealing their underlying mechanism as a form of contrastive learning. Based on this insight, we propose Neighbor GRPO, a novel alignment algorithm that completely bypasses the need for SDEs. Neighbor GRPO generates a diverse set of candidate trajectories by perturbing the initial noise conditions of the ODE and optimizes the model using a softmax distance-based surrogate leaping policy. We establish a theoretical connection between this distance-based objective and policy gradient optimization, rigorously integrating our approach into the GRPO framework. Our method fully preserves the advantages of deterministic ODE sampling, including efficiency and compatibility with high-order solvers. We further introduce symmetric anchor sampling for computational efficiency and group-wise quasi-norm reweighting to address reward flattening. Extensive experiments demonstrate that Neighbor GRPO significantly outperforms SDE-based counterparts in terms of training cost, convergence speed, and generation quality.

[76] MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis

Di Luo,Shuhui Yang,Mingxin Yang,Jiawei Lu,Yixuan Tang,Xintong Han,Zhuo Chen,Beibei Wang,Chunchao Guo

Main category: cs.CV

TL;DR: 本文提出了MatPedia,一种基于新型RGB-PBR联合表示的基础模型,通过两个相互依赖的潜在空间编码材料的外观和物理属性,利用视频扩散架构实现高质量、多样化的1024×1024材质生成,并在文本到材质、图像到材质和内在分解任务中表现出优越性能。

Details Motivation: 现有材质生成方法缺乏统一的表示方式来桥接自然图像外观与基于物理渲染(PBR)属性,导致无法充分利用大规模RGB图像数据,且需依赖任务特定的碎片化流程。因此,需要一种能够统一处理多种材质任务并融合视觉先验的通用框架。 Method: 提出一种将RGB外观和PBR属性编码为五帧序列的联合表示方法,使用视频扩散模型架构,在包含PBR数据集和大规模RGB图像的混合数据集MatHybrid-410K上训练MatPedia模型,从而实现多任务统一建模。 Result: MatPedia实现了原生1024×1024分辨率的材质合成,在生成质量和多样性方面显著优于现有方法,并能有效支持文本到材质、图像到材质和内在分解等多种任务。 Conclusion: MatPedia通过统一的RGB-PBR联合表示和视频扩散架构,构建了一个多功能、高性能的材质生成基础模型,推动了材质建模向更高效、可扩展的方向发展。 Abstract: Physically-based rendering (PBR) materials are fundamental to photorealistic graphics, yet their creation remains labor-intensive and requires specialized expertise. While generative models have advanced material synthesis, existing methods lack a unified representation bridging natural image appearance and PBR properties, leading to fragmented task-specific pipelines and inability to leverage large-scale RGB image data. We present MatPedia, a foundation model built upon a novel joint RGB-PBR representation that compactly encodes materials into two interdependent latents: one for RGB appearance and one for the four PBR maps encoding complementary physical properties. By formulating them as a 5-frame sequence and employing video diffusion architectures, MatPedia naturally captures their correlations while transferring visual priors from RGB generation models. This joint representation enables a unified framework handling multiple material tasks--text-to-material generation, image-to-material generation, and intrinsic decomposition--within a single architecture. Trained on MatHybrid-410K, a mixed corpus combining PBR datasets with large-scale RGB images, MatPedia achieves native $1024\times1024$ synthesis that substantially surpasses existing approaches in both quality and diversity.

[77] Two Heads Better than One: Dual Degradation Representation for Blind Super-Resolution

Hsuan Yuan,Shao-Yu Weng,I-Hsuan Lo,Wei-Chen Chiu,Yu-Syuan Xu,Hao-Chien Hsueh,Jen-Hui Chuang,Ching-Chun Huang

Main category: cs.CV

TL;DR: 本文提出了一种双分支退化提取网络(Dual Branch Degradation Extractor Network)来解决盲超分辨率(blind SR)问题,通过无监督方式提取模糊和噪声退化特征,并分别指导超分网络恢复,取得了当前最优性能。

Details Motivation: 现有单图像超分辨率方法在已知且固定的退化条件下表现良好,但在实际退化未知或偏离假设时性能显著下降,尤其是当退化包含未建模的噪声时。因此需要更鲁棒的盲超分方法。 Method: 提出双分支退化提取网络,预测表示模糊和噪声的两种无监督退化嵌入;超分网络根据这两种嵌入进行差异化适应;同时将退化提取器作为正则化器,利用SR与HR图像之间的差异进行优化。 Result: 在多个基准数据集上进行了广泛实验,结果表明该方法在盲超分辨率任务中达到了最先进的性能(SOTA)。 Conclusion: 所提出的双分支退化提取网络能有效建模真实复杂的未知退化,特别是对模糊和噪声的解耦表征提升了盲超分的鲁棒性和恢复质量。 Abstract: Previous methods have demonstrated remarkable performance in single image super-resolution (SISR) tasks with known and fixed degradation (e.g., bicubic downsampling). However, when the actual degradation deviates from these assumptions, these methods may experience significant declines in performance. In this paper, we propose a Dual Branch Degradation Extractor Network to address the blind SR problem. While some blind SR methods assume noise-free degradation and others do not explicitly consider the presence of noise in the degradation model, our approach predicts two unsupervised degradation embeddings that represent blurry and noisy information. The SR network can then be adapted to blur embedding and noise embedding in distinct ways. Furthermore, we treat the degradation extractor as a regularizer to capitalize on differences between SR and HR images. Extensive experiments on several benchmarks demonstrate our method achieves SOTA performance in the blind SR problem.

[78] Real-Time Cooked Food Image Synthesis and Visual Cooking Progress Monitoring on Edge Devices

Jigyasa Gupta,Soumya Goyal,Anil Kumar,Ishan Jindal

Main category: cs.CV

TL;DR: 提出了一种边缘设备上高效的、基于菜谱和烹饪状态引导的生成模型,用于从原始食材图像合成真实的烹饪食物图像,并引入了领域特定的烹饪图像相似性(CIS)度量来保证时序一致性和烹饪合理性。

Details Motivation: 现有图像到图像生成方法在边缘设备上生成的烹饪食物图像往往不真实或资源消耗过大,且缺乏对烹饪过程中纹理、颜色和结构复杂变化的准确建模。 Method: 构建首个基于烤箱的带厨师标注熟度水平的烹饪进程数据集,提出一种轻量级生成器,以原始食物图像、菜谱和烹饪状态为条件生成图像,并引入Culinary Image Similarity(CIS)作为训练损失和进度监控指标。 Result: 在自建数据集上FID分数提升30%,在公共数据集上提升60%,显著优于现有基线方法。 Conclusion: 所提方法能在资源受限的边缘设备上生成更真实的烹饪食物图像,且通过CIS度量有效保证了生成结果的视觉质量和烹饪逻辑一致性。 Abstract: Synthesizing realistic cooked food images from raw inputs on edge devices is a challenging generative task, requiring models to capture complex changes in texture, color and structure during cooking. Existing image-to-image generation methods often produce unrealistic results or are too resource-intensive for edge deployment. We introduce the first oven-based cooking-progression dataset with chef-annotated doneness levels and propose an edge-efficient recipe and cooking state guided generator that synthesizes realistic food images conditioned on raw food image. This formulation enables user-preferred visual targets rather than fixed presets. To ensure temporal consistency and culinary plausibility, we introduce a domain-specific \textit{Culinary Image Similarity (CIS)} metric, which serves both as a training loss and a progress-monitoring signal. Our model outperforms existing baselines with significant reductions in FID scores (30\% improvement on our dataset; 60\% on public datasets)

[79] The Finer the Better: Towards Granular-aware Open-set Domain Generalization

Yunyun Wang,Zheng Duan,Xinyue Liao,Ke-Jia Chen,Songcan Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为SeeCLIP的语义增强框架,用于解决开放集域泛化(OSDG)中已知类与未知类之间的风险权衡问题,通过细粒度语义增强、双对比学习和语义引导扩散生成伪未知样本来提升模型性能。

Details Motivation: 现有方法在处理域迁移和新类别共存的开放集场景时,难以平衡已知类的结构风险与未知类的开放空间风险,且对与已知类视觉相似的“难未知类”容易过度自信,导致性能下降。 Method: 提出SeeCLIP框架,包含语义感知提示增强模块以提取判别性语义令牌;采用双对比学习(排斥已知类与凝聚未知类)来定位未知类提示;并通过语义引导扩散模块扰动语义令牌生成视觉相似但局部不同的伪未知样本作为难点负例。 Result: 在五个基准数据集上实验表明,该方法相比现有最优方法平均提升了3%的分类准确率和5%的H-score。 Conclusion: SeeCLIP通过细粒度语义增强和伪未知样本生成,有效缓解了开放集域泛化中已知类与未知类的风险冲突,显著提升了模型对难未知类的识别能力与整体泛化性能。 Abstract: Open-Set Domain Generalization (OSDG) tackles the realistic scenario where deployed models encounter both domain shifts and novel object categories. Despite impressive progress with vision-language models like CLIP, existing methods still fall into the dilemma between structural risk of known-classes and open-space risk from unknown-classes, and easily suffers from over-confidence, especially when distinguishing ``hard unknowns" that share fine-grained visual similarities with known classes. To this end, we propose a Semantic-enhanced CLIP (SeeCLIP) framework that explicitly addresses this dilemma through fine-grained semantic enhancement. In SeeCLIP, we propose a semantic-aware prompt enhancement module to decompose images into discriminative semantic tokens, enabling nuanced vision-language alignment beyond coarse category labels. To position unknown prompts effectively, we introduce duplex contrastive learning with complementary objectives, that is, repulsion to maintain separability from known classes, and cohesion to preserve semantic proximity. Further, our semantic-guided diffusion module synthesizes pseudo-unknowns by perturbing extracted semantic tokens, generating challenging samples that are visually similar to known classes yet exhibit key local differences. These hard negatives force the model to learn finer decision boundaries. Extensive experiments across five benchmarks demonstrate consistent improvements of 3% accuracy and 5% H-score over state-of-the-art methods.

[80] Gradient-Driven Natural Selection for Compact 3D Gaussian Splatting

Xiaobin Deng,Qiuli Yu,Changyu Diao,Min Li,Duanqing Xu

Main category: cs.CV

TL;DR: 提出了一种受自然选择启发的3D高斯素数剪枝框架,通过建模生存压力作为不透明度的正则化梯度场,使优化过程自主决定保留或剪枝哪些高斯素数,实现了完全可学习且无需人工干预的剪枝方法。

Details Motivation: 现有3D高斯素数剪枝方法依赖人工设计准则或引入额外可学习参数,导致效果不佳,且3DGS存在存储和计算开销大的问题。 Method: 提出自然选择启发的剪枝框架,将生存压力建模为作用于不透明度的正则化梯度场,结合优化梯度自主决策剪枝;并引入具有有限不透明度先验的不透明度衰减技术以加速选择过程。 Result: 在15%预算下相比3DGS获得超过0.6 dB的PSNR增益,实现了当前最优的紧凑型3DGS性能。 Conclusion: 该方法实现了完全可学习、无需人工干预的高效剪枝,在保持渲染质量的同时显著压缩模型,推动了紧凑3D场景表示的发展。 Abstract: 3DGS employs a large number of Gaussian primitives to fit scenes, resulting in substantial storage and computational overhead. Existing pruning methods rely on manually designed criteria or introduce additional learnable parameters, yielding suboptimal results. To address this, we propose an natural selection inspired pruning framework that models survival pressure as a regularization gradient field applied to opacity, allowing the optimization gradients--driven by the goal of maximizing rendering quality--to autonomously determine which Gaussians to retain or prune. This process is fully learnable and requires no human intervention. We further introduce an opacity decay technique with a finite opacity prior, which accelerates the selection process without compromising pruning effectiveness. Compared to 3DGS, our method achieves over 0.6 dB PSNR gain under 15\% budgets, establishing state-of-the-art performance for compact 3DGS. Project page https://xiaobin2001.github.io/GNS-web.

[81] A Diversity-optimized Deep Ensemble Approach for Accurate Plant Leaf Disease Detection

Sai Nath Chowdary Medikonduru,Hongpeng Jin,Yanzhao Wu

Main category: cs.CV

TL;DR: 提出了一种新的协同多样性(SQ)框架,用于提升基于图像的植物病害检测精度,通过改进集成多样性度量来优化模型选择。

Details Motivation: 现有集成多样性度量难以准确识别最优模型组合,影响植物病害检测的准确性。 Method: 提出新的SQ多样性度量指标,评估集成成员间的协同效应,并在植物叶片图像数据集上进行实验验证。 Result: SQ指标能更有效地指导集成模型选择,显著提升检测准确率。 Conclusion: SQ框架为图像-based植物病害检测提供了更可靠、高效的集成学习解决方案。 Abstract: Plant diseases pose a significant threat to global agriculture, causing over $220 billion in annual economic losses and jeopardizing food security. The timely and accurate detection of these diseases from plant leaf images is critical to mitigating their adverse effects. Deep neural network Ensembles (Deep Ensembles) have emerged as a powerful approach to enhancing prediction accuracy by leveraging the strengths of diverse Deep Neural Networks (DNNs). However, selecting high-performing ensemble member models is challenging due to the inherent difficulty in measuring ensemble diversity. In this paper, we introduce the Synergistic Diversity (SQ) framework to enhance plant disease detection accuracy. First, we conduct a comprehensive analysis of the limitations of existing ensemble diversity metrics (denoted as Q metrics), which often fail to identify optimal ensemble teams. Second, we present the SQ metric, a novel measure that captures the synergy between ensemble members and consistently aligns with ensemble accuracy. Third, we validate our SQ approach through extensive experiments on a plant leaf image dataset, which demonstrates that our SQ metric substantially improves ensemble selection and enhances detection accuracy. Our findings pave the way for a more reliable and efficient image-based plant disease detection.

[82] RadioKMoE: Knowledge-Guided Radiomap Estimation with Kolmogorov-Arnold Networks and Mixture-of-Experts

Fupei Guo,Kerry Pan,Songyang Zhang,Yue Wang,Zhi Ding

Main category: cs.CV

TL;DR: 提出了一种知识引导的混合专家网络框架RadioKMoE,用于提升复杂环境下的无线信号传播图估计精度。

Details Motivation: 复杂的无线传播行为和环境对传统辐射图估计方法提出了挑战,需要更精确且鲁棒的方法来建模全局传播模式与局部细节。 Method: 结合Kolmogorov-Arnold网络(KAN)与混合专家模型(MoE),利用KAN生成初始粗略覆盖图,并结合环境信息通过MoE网络进行精细化估计,各专家网络专注于不同的辐射图模式。 Result: 在多频段和单频段辐射图估计任务中,RadioKMoE相比传统方法表现出更高的估计精度和鲁棒性。 Conclusion: RadioKMoE有效融合了物理先验知识与数据驱动建模,提升了复杂场景下辐射图估计的性能,兼顾了全局一致性与局部细节。 Abstract: Radiomap serves as a vital tool for wireless network management and deployment by providing powerful spatial knowledge of signal propagation and coverage. However, increasingly complex radio propagation behavior and surrounding environments pose strong challenges for radiomap estimation (RME). In this work, we propose a knowledge-guided RME framework that integrates Kolmogorov-Arnold Networks (KAN) with Mixture-of-Experts (MoE), namely RadioKMoE. Specifically, we design a KAN module to predict an initial coarse coverage map, leveraging KAN's strength in approximating physics models and global radio propagation patterns. The initial coarse map, together with environmental information, drives our MoE network for precise radiomap estimation. Unlike conventional deep learning models, the MoE module comprises expert networks specializing in distinct radiomap patterns to improve local details while preserving global consistency. Experimental results in both multi- and single-band RME demonstrate the enhanced accuracy and robustness of the proposed RadioKMoE in radiomap estimation.

[83] DReX: Pure Vision Fusion of Self-Supervised and Convolutional Representations for Image Complexity Prediction

Jonathan Skaza,Parsa Madinei,Ziqi Wen,Miguel Eckstein

Main category: cs.CV

TL;DR: 提出了一种仅使用视觉信息的DReX模型(融合DINO和ResNet表征),在图像复杂度预测任务上达到SOTA性能,且参数量更少,证明语言信息并非必要。

Details Motivation: 探索是否需要语言信息来预测图像的视觉复杂度,并提升现有纯视觉方法的性能。 Method: 提出DReX模型,通过可学习的注意力机制融合ResNet-50的多尺度层次特征与DINOv3 ViT-S/16的语义丰富表征,以捕捉低级纹理和高级语义结构。 Result: 在IC9600基准上达到Pearson r=0.9581,超越此前包括多模态方法在内的所有方法,参数量减少约21.5倍;在多个数据集和指标(Pearson、Spearman、RMSE、MAE)上均表现出强泛化能力;消融实验和注意力分析验证了两种骨干网络特征的互补性。 Conclusion: 仅靠视觉特征即可实现与人类感知对齐的复杂度预测,自监督Transformer与监督CNN在适当融合后能产生协同增益。 Abstract: Visual complexity prediction is a fundamental problem in computer vision with applications in image compression, retrieval, and classification. Understanding what makes humans perceive an image as complex is also a long-standing question in cognitive science. Recent approaches have leveraged multimodal models that combine visual and linguistic representations, but it remains unclear whether language information is necessary for this task. We propose DReX (DINO-ResNet Fusion), a vision-only model that fuses self-supervised and convolutional representations through a learnable attention mechanism to predict image complexity. Our architecture integrates multi-scale hierarchical features from ResNet-50 with semantically rich representations from DINOv3 ViT-S/16, enabling the model to capture both low-level texture patterns and high-level semantic structure. DReX achieves state-of-the-art performance on the IC9600 benchmark (Pearson r = 0.9581), surpassing previous methods--including those trained on multimodal image-text data--while using approximately 21.5x fewer learnable parameters. Furthermore, DReX generalizes robustly across multiple datasets and metrics, achieving superior results on Pearson and Spearman correlation, Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). Ablation and attention analyses confirm that DReX leverages complementary cues from both backbones, with the DINOv3 [CLS] token enhancing sensitivity to visual complexity. Our findings suggest that visual features alone can be sufficient for human-aligned complexity prediction and that, when properly fused, self-supervised transformers and supervised deep convolutional neural networks offer complementary and synergistic benefits for this task.

[84] DepthFocus: Controllable Depth Estimation for See-Through Scenes

Junhong Min,Jimin Kim,Cheol-Hui Min,Minwook Kim,Youngpil Jeon,Minyong Choi

Main category: cs.CV

TL;DR: 本文提出了DepthFocus,一种基于意图的可调节视觉Transformer模型,用于立体深度估计。与传统静态方法不同,该模型能根据标量深度偏好动态调整计算,聚焦于目标深度,实现复杂场景中的选择性感知。通过构建包含50万个多层合成数据的新数据集进行训练,DepthFocus在单深度和多深度基准上均达到SOTA,并展现出对未见透射场景的强大泛化能力。

Details Motivation: 现实世界中的深度往往是多层次的,尤其是透明材料导致的多层模糊使传统感知系统难以处理。现有模型被动地估计最近表面的静态深度图,而人类则能主动调整注意力以感知特定深度。因此,需要一种能够模仿人类主动感知机制的动态深度估计方法。 Method: 提出DepthFocus,一种基于Vision Transformer的可调节立体深度估计模型。模型以标量深度偏好为条件,动态调整其计算过程,实现对指定深度的聚焦。使用新构建的50万个多层合成数据集进行训练,涵盖丰富的透射和反射效果,并在BOOSTER及新提出的多深度数据集上验证性能。 Result: DepthFocus在BOOSTER等单深度基准上达到最先进性能,在新提出的实拍与合成多深度数据集上展现出符合意图的定量估计结果,并在未见过的透射场景中表现出强泛化能力。 Conclusion: DepthFocus重新定义了立体深度估计为意图驱动的控制任务,实现了类似人类的主动3D感知,在处理透明物体和复杂多层次场景中表现出卓越性能与鲁棒性,是迈向主动3D感知的重要一步。 Abstract: Depth in the real world is rarely singular. Transmissive materials create layered ambiguities that confound conventional perception systems. Existing models remain passive, attempting to estimate static depth maps anchored to the nearest surface, while humans actively shift focus to perceive a desired depth. We introduce DepthFocus, a steerable Vision Transformer that redefines stereo depth estimation as intent-driven control. Conditioned on a scalar depth preference, the model dynamically adapts its computation to focus on the intended depth, enabling selective perception within complex scenes. The training primarily leverages our newly constructed 500k multi-layered synthetic dataset, designed to capture diverse see-through effects. DepthFocus not only achieves state-of-the-art performance on conventional single-depth benchmarks like BOOSTER, a dataset notably rich in transparent and reflective objects, but also quantitatively demonstrates intent-aligned estimation on our newly proposed real and synthetic multi-depth datasets. Moreover, it exhibits strong generalization capabilities on unseen see-through scenes, underscoring its robustness as a significant step toward active and human-like 3D perception.

[85] VLM-Augmented Degradation Modeling for Image Restoration Under Adverse Weather Conditions

Qianyi Shao,Yuanfan Zhang,Renxiang Xiao,Liang Hu

Main category: cs.CV

TL;DR: 提出了一种统一的记忆增强视觉-语言恢复模型(MVLR),用于在多种恶劣天气条件下恢复图像,结合视觉-语言模型与隐式记忆库,实现了高效准确的图像恢复。

Details Motivation: 在雨、雾、雪等恶劣天气条件下,自动驾驶和户外机器人需要可靠的视觉感知,但图像退化问题严重,现有方法难以兼顾恢复精度与计算效率。 Method: 设计了一个轻量级编码器-解码器结构,结合视觉-语言模型(VLM)和隐式记忆库(IMB);VLM通过链式思维推理编码天气退化先验,IMB存储退化模式的潜在表示,并通过动态交叉注意力机制自适应融合多尺度特征。 Result: 在四个严重天气基准上实验表明,MVLR在PSNR和SSIM指标上优于单分支和专家混合基线方法。 Conclusion: MVLR在模型紧凑性和表达能力之间取得了良好平衡,适合在多样化户外环境中实时部署。 Abstract: Reliable visual perception under adverse weather conditions, such as rain, haze, snow, or a mixture of them, is desirable yet challenging for autonomous driving and outdoor robots. In this paper, we propose a unified Memory-Enhanced Visual-Language Recovery (MVLR) model that restores images from different degradation levels under various weather conditions. MVLR couples a lightweight encoder-decoder backbone with a Visual-Language Model (VLM) and an Implicit Memory Bank (IMB). The VLM performs chain-of-thought inference to encode weather degradation priors and the IMB stores continuous latent representations of degradation patterns. The VLM-generated priors query the IMB to retrieve fine-grained degradation prototypes. These prototypes are then adaptively fused with multi-scale visual features via dynamic cross-attention mechanisms, enhancing restoration accuracy while maintaining computational efficiency. Extensive experiments on four severe-weather benchmarks show that MVLR surpasses single-branch and Mixture-of-Experts baselines in terms of Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM). These results indicate that MVLR offers a practical balance between model compactness and expressiveness for real-time deployment in diverse outdoor conditions.

[86] Vision Language Models are Confused Tourists

Patrick Amadeus Irawan,Ikhlasul Akmal Hanif,Muhammad Dehan Al Kautsar,Genta Indra Winata,Fajri Koto,Alham Fikri Aji

Main category: cs.CV

TL;DR: 本文提出了ConfusedTourist,一个用于评估视觉语言模型(VLMs)在多文化线索干扰下稳定性的新测试套件,揭示了现有模型在文化概念混合场景中的严重脆弱性。

Details Motivation: 尽管文化维度是评估视觉语言模型的关键方面,但现有研究缺乏对模型在多元文化输入下稳定性的测试,尤其是在多种文化线索共存的复杂场景中。 Method: 提出ConfusedTourist——一种新的文化对抗鲁棒性评测套件,通过图像堆叠和基于图像生成的扰动引入地理文化线索干扰,并结合可解释性分析研究模型注意力偏移。 Result: 实验表明,当前最先进的VLM在简单图像堆叠扰动下准确率大幅下降,且在生成式扰动下表现更差;可解释性分析显示模型注意力被无关文化线索分散。 Conclusion: 视觉文化概念的混合会显著削弱VLM的性能,暴露出其在多元文化理解上的不足,亟需提升模型的文化鲁棒性和多模态理解能力。 Abstract: Although the cultural dimension has been one of the key aspects in evaluating Vision-Language Models (VLMs), their ability to remain stable across diverse cultural inputs remains largely untested, despite being crucial to support diversity and multicultural societies. Existing evaluations often rely on benchmarks featuring only a singular cultural concept per image, overlooking scenarios where multiple, potentially unrelated cultural cues coexist. To address this gap, we introduce ConfusedTourist, a novel cultural adversarial robustness suite designed to assess VLMs' stability against perturbed geographical cues. Our experiments reveal a critical vulnerability, where accuracy drops heavily under simple image-stacking perturbations and even worsens with its image-generation-based variant. Interpretability analyses further show that these failures stem from systematic attention shifts toward distracting cues, diverting the model from its intended focus. These findings highlight a critical challenge: visual cultural concept mixing can substantially impair even state-of-the-art VLMs, underscoring the urgent need for more culturally robust multimodal understanding.

[87] FLUID: Training-Free Face De-identification via Latent Identity Substitution

Jinhyeong Park,Shaheryar Muhammad,Seangmin Lee,Jong Taek Lee,Soon Ki Jung

Main category: cs.CV

TL;DR: 提出了一种无需训练的面部去识别框架FLUID,通过在预训练扩散模型的潜在空间中进行身份替换,实现了优越的身份抑制与属性保留权衡。

Details Motivation: 在保护隐私的同时保留图像的有用属性,解决现有去识别方法在身份替换和属性保持之间的不平衡问题。 Method: 受化学取代机制启发,将身份编辑重新解释为预训练无条件扩散模型潜在h空间中的语义位移,利用新型试剂损失指导优化,发现身份编辑方向,并提出线性和测地线编辑方案以有效导航潜在流形。 Result: 在CelebA-HQ和FFHQ数据集上的实验表明,FLUID在定性和定量指标上均优于现有的最先进去识别方法。 Conclusion: FLUID提供了一种有效的训练-free去识别解决方案,在身份抑制和属性保留之间实现了更好的平衡。 Abstract: We present FLUID (Face de-identification in the Latent space via Utility-preserving Identity Displacement), a training-free framework that directly substitutes identity in the latent space of pretrained diffusion models. Inspired by substitution mechanisms in chemistry, we reinterpret identity editing as semantic displacement in the latent h-space of a pretrained unconditional diffusion model. Our framework discovers identity-editing directions through optimization guided by novel reagent losses, which supervise for attribute preservation and identity suppression. We further propose both linear and geodesic (tangent-based) editing schemes to effectively navigate the latent manifold. Experimental results on CelebA-HQ and FFHQ demonstrate that FLUID achieves a superior trade-off between identity suppression and attribute preservation, outperforming state-of-the-art de-identification methods in both qualitative and quantitative metrics.

[88] Parameter-Free Neural Lens Blur Rendering for High-Fidelity Composites

Lingyan Ruan,Bin Chen,Taehyun Rhee

Main category: cs.CV

TL;DR: 提出一种无需场景深度或相机元数据、直接从RGB图像估计弥散圈(CoC)图的新方法,通过线性关系推断虚拟物体的CoC值,并利用神经重模糊网络实现逼真的镜头模糊渲染,在混合现实合成中实现了高保真、自然的虚实融合效果。

Details Motivation: 现有基于相机参数和场景深度的镜头模糊合成方法因依赖难以获取的数据(如焦距、光圈大小、景深),限制了普通用户的使用。因此需要一种更易获取输入、更具通用性的方法来实现自然的虚实融合。 Method: 提出一种直接从RGB图像估计CoC图的合成方法,利用虚拟物体的符号化CoC图与其深度之间的线性关系推断其CoC值,并采用神经重模糊网络渲染真实感镜头模糊,无需依赖相机参数或深度信息。 Result: 实验结果表明,该方法在定性和定量评估中均优于当前最先进的技术,能够生成高保真、具有真实散焦效果的混合现实图像。 Conclusion: 所提方法为实际应用提供了一种灵活且实用的解决方案,显著提升了在缺乏相机参数和深度信息情况下的虚实融合视觉质量。 Abstract: Consistent and natural camera lens blur is important for seamlessly blending 3D virtual objects into photographed real-scenes. Since lens blur typically varies with scene depth, the placement of virtual objects and their corresponding blur levels significantly affect the visual fidelity of mixed reality compositions. Existing pipelines often rely on camera parameters (e.g., focal length, focus distance, aperture size) and scene depth to compute the circle of confusion (CoC) for realistic lens blur rendering. However, such information is often unavailable to ordinary users, limiting the accessibility and generalizability of these methods. In this work, we propose a novel compositing approach that directly estimates the CoC map from RGB images, bypassing the need for scene depth or camera metadata. The CoC values for virtual objects are inferred through a linear relationship between its signed CoC map and depth, and realistic lens blur is rendered using a neural reblurring network. Our method provides flexible and practical solution for real-world applications. Experimental results demonstrate that our method achieves high-fidelity compositing with realistic defocus effects, outperforming state-of-the-art techniques in both qualitative and quantitative evaluations.

[89] RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis

Linfeng Dong,Yuchen Yang,Hao Wu,Wei Wang,Yuenan HouZhihang Zhong,Xiao Sun

Main category: cs.CV

TL;DR: RacketVision是一个新的数据集和基准,用于推进体育分析中的计算机视觉研究,涵盖乒乓球、网球和羽毛球,首次提供大规模的球拍姿态细粒度标注,并结合球位置进行多任务研究。

Details Motivation: 推动体育分析中计算机视觉的发展,特别是细粒度的人-物交互研究,填补缺乏高质量、多模态体育动作数据的空白。 Method: 构建包含球拍姿态和球位置的大规模标注数据集,设计三个任务:细粒度球跟踪、关节式球拍姿态估计和球轨迹预测;采用CrossAttention机制进行多模态特征融合。 Result: 实验表明,直接拼接球拍特征会降低性能,而使用CrossAttention机制能有效提升轨迹预测效果,超越强单模态基线。 Conclusion: RacketVision为动态目标跟踪、条件运动预测和多模态体育分析提供了有价值的资源和研究起点。 Abstract: We introduce RacketVision, a novel dataset and benchmark for advancing computer vision in sports analytics, covering table tennis, tennis, and badminton. The dataset is the first to provide large-scale, fine-grained annotations for racket pose alongside traditional ball positions, enabling research into complex human-object interactions. It is designed to tackle three interconnected tasks: fine-grained ball tracking, articulated racket pose estimation, and predictive ball trajectory forecasting. Our evaluation of established baselines reveals a critical insight for multi-modal fusion: while naively concatenating racket pose features degrades performance, a CrossAttention mechanism is essential to unlock their value, leading to trajectory prediction results that surpass strong unimodal baselines. RacketVision provides a versatile resource and a strong starting point for future research in dynamic object tracking, conditional motion forecasting, and multimodal analysis in sports. Project page at https://github.com/OrcustD/RacketVision

[90] RoomPlanner: Explicit Layout Planner for Easier LLM-Driven 3D Room Generation

Wenzhuo Sun,Mingjian Liang,Wenxuan Song,Xuelian Cheng,Zongyuan Ge

Main category: cs.CV

TL;DR: 本文提出了RoomPlanner,首个完全自动化的基于短文本输入的3D房间生成框架,能够无需人工布局或全景图引导,生成合理且逼真的室内场景。

Details Motivation: 现有的3D室内场景生成方法通常依赖人工设计布局或全景图像指导,限制了自动化程度和生成效率,因此需要一种能从简短文本自动生成高质量、几何合理的室内场景的解决方案。 Method: 提出分层语言驱动的代理规划器,将简短文本解析为详细场景描述,并初始化3D点云;引入两种布置约束迭代优化物体空间排列;采用AnyReach采样和ITFS策略优化3D高斯表示以加速渲染。 Result: 实验表明该方法可在30分钟内生成无碰撞、可访问且视觉质量高的3D室内场景,在渲染速度和视觉效果上优于先前方法,同时保持可编辑性。 Conclusion: RoomPlanner实现了从短文本到高质量3D室内场景的全自动生成,提升了生成效率与几何合理性,是文本到场景生成领域的重要进展。 Abstract: In this paper, we propose RoomPlanner, the first fully automatic 3D room generation framework for painlessly creating realistic indoor scenes with only short text as input. Without any manual layout design or panoramic image guidance, our framework can generate explicit layout criteria for rational spatial placement. We begin by introducing a hierarchical structure of language-driven agent planners that can automatically parse short and ambiguous prompts into detailed scene descriptions. These descriptions include raw spatial and semantic attributes for each object and the background, which are then used to initialize 3D point clouds. To position objects within bounded environments, we implement two arrangement constraints that iteratively optimize spatial arrangements, ensuring a collision-free and accessible layout solution. In the final rendering stage, we propose a novel AnyReach Sampling strategy for camera trajectory, along with the Interval Timestep Flow Sampling (ITFS) strategy, to efficiently optimize the coarse 3D Gaussian scene representation. These approaches help reduce the total generation time to under 30 minutes. Extensive experiments demonstrate that our method can produce geometrically rational 3D indoor scenes, surpassing prior approaches in both rendering speed and visual quality while preserving editability. The code will be available soon.

[91] PathAgent: Toward Interpretable Analysis of Whole-slide Pathology Images via Large Language Model-based Agentic Reasoning

Jingyun Chen,Linghan Cai,Zhikang Wang,Yi Huang,Songhan Jiang,Shenjin Huang,Hongpeng Wang,Yongbing Zhang

Main category: cs.CV

TL;DR: PathAgent是一个无需训练的大型语言模型代理框架,模拟病理学家的分析过程,通过迭代探索全切片图像(WSI),精确定位重要微区域并提取形态学视觉线索,形成可解释的推理链,实现透明且临床可靠的诊断辅助。

Details Motivation: 现有的计算方法在分析全切片图像时缺乏明确的推理轨迹,导致预测结果不透明且难以解释。为弥补这一差距,需要一种能够模拟人类专家反思性、逐步分析能力的方法。 Method: 提出PathAgent,一个基于大语言模型的代理框架,包含Navigator模块用于自主探索和定位关键区域,Perceptor模块提取形态学视觉特征,Executor模块将发现整合到持续演化的自然语言推理轨迹中,形成显式的思维链。 Result: 在五个具有挑战性的数据集上评估,PathAgent在开放性和受限的视觉问答任务中均超越了特定任务的基线方法,并展现出强大的零样本泛化能力。与人类病理学家的协作评估证实了其作为透明、临床可靠诊断助手的潜力。 Conclusion: PathAgent通过模拟人类专家的动态分析过程,提供了可解释的WSI分析方案,是一种有前景的透明化、临床可落地的AI辅助诊断工具。 Abstract: Analyzing whole-slide images (WSIs) requires an iterative, evidence-driven reasoning process that parallels how pathologists dynamically zoom, refocus, and self-correct while collecting the evidence. However, existing computational pipelines often lack this explicit reasoning trajectory, resulting in inherently opaque and unjustifiable predictions. To bridge this gap, we present PathAgent, a training-free, large language model (LLM)-based agent framework that emulates the reflective, stepwise analytical approach of human experts. PathAgent can autonomously explore WSI, iteratively and precisely locating significant micro-regions using the Navigator module, extracting morphology visual cues using the Perceptor, and integrating these findings into the continuously evolving natural language trajectories in the Executor. The entire sequence of observations and decisions forms an explicit chain-of-thought, yielding fully interpretable predictions. Evaluated across five challenging datasets, PathAgent exhibits strong zero-shot generalization, surpassing task-specific baselines in both open-ended and constrained visual question-answering tasks. Moreover, a collaborative evaluation with human pathologists confirms PathAgent's promise as a transparent and clinically grounded diagnostic assistant.

[92] OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding

Teng Fu,Mengyang Zhao,Ke Niu,Kaixin Peng,Bin Li

Main category: cs.CV

TL;DR: 本文提出了一种统一的行人跟踪框架OmniPT,结合LVLM在语义理解上的优势,通过RL-Mid Training-SFT-RL训练流程,实现对行人的跟踪、指代跟踪和语义理解,实验表明该方法优于先前方法。

Details Motivation: 尽管LVLM在图像级任务中表现优异,但在实例级任务(如视觉定位、目标检测)上仍存在性能差距。同时,结合自然语言的对象跟踪新任务(如Referring MOT等)需要模型具备高级语义理解能力,而这是LVLM的优势所在,因此有必要构建一个能够统一处理这些任务的行人跟踪框架。 Method: 提出OmniPT框架,采用四阶段训练策略:先通过RL阶段使模型输出可监督的固定格式边界框;然后进行中段训练,利用大量行人相关数据提升基础能力;接着在多个行人跟踪数据集上进行监督微调;最后再次使用RL优化跟踪性能和指令遵循能力。 Result: 在多个行人跟踪基准上的实验结果显示,所提方法在跟踪性能和语义理解方面均优于之前的方法,验证了框架的有效性。 Conclusion: OmniPT成功地将LVLM应用于复杂行人跟踪任务,实现了跟踪、指代和语义理解的统一建模,并通过渐进式训练策略显著提升了模型表现,为LVLM在实例级任务中的应用提供了新思路。 Abstract: LVLMs have been shown to perform excellently in image-level tasks such as VQA and caption. However, in many instance-level tasks, such as visual grounding and object detection, LVLMs still show performance gaps compared to previous expert models. Meanwhile, although pedestrian tracking is a classical task, there have been a number of new topics in combining object tracking and natural language, such as Referring MOT, Cross-view Referring MOT, and Semantic MOT. These tasks emphasize that models should understand the tracked object at an advanced semantic level, which is exactly where LVLMs excel. In this paper, we propose a new unified Pedestrian Tracking framework, namely OmniPT, which can track, track based on reference and generate semantic understanding of tracked objects interactively. We address two issues: how to model the tracking task into a task that foundation models can perform, and how to make the model output formatted answers. To this end, we implement a training phase consisting of RL-Mid Training-SFT-RL. Based on the pre-trained weights of the LVLM, we first perform a simple RL phase to enable the model to output fixed and supervisable bounding box format. Subsequently, we conduct a mid-training phase using a large number of pedestrian-related datasets. Finally, we perform supervised fine-tuning on several pedestrian tracking datasets, and then carry out another RL phase to improve the model's tracking performance and enhance its ability to follow instructions. We conduct experiments on tracking benchmarks and the experimental results demonstrate that the proposed method can perform better than the previous methods.

[93] RL-AD-Net: Reinforcement Learning Guided Adaptive Displacement in Latent Space for Refined Point Cloud Completion

Bhanu Pratap Paregi,Vaibhav Kumar

Main category: cs.CV

TL;DR: 提出了一种基于强化学习的点云补全后处理框架RL-AD-Net,通过在预训练自编码器的潜在空间中优化全局特征向量,并结合几何一致性选择机制,提升补全结果的局部几何精度。

Details Motivation: 现有点云补全方法虽能生成整体合理的形状,但常存在局部几何不一致问题,且在随机遮挡情况下性能下降明显,需一种通用、轻量且有效的后处理方法来提升几何保真度。 Method: 构建一个强化学习代理,在预训练点云自编码器的潜在空间中调整完成结果的全局特征向量(GFV);引入轻量级非参数PointNN选择器,评估原始与RL优化后的结果并保留更优者;使用Chamfer Distance和几何一致性指标指导训练,按类别独立训练以保证收敛性。 Result: 在ShapeNetCore-2048数据集上验证,RL-AD-Net在常规和随机裁剪输入下均能持续提升各类补全模型的性能,显著改善局部几何一致性,且具有良好的模块化和模型无关特性。 Conclusion: RL-AD-Net作为一种轻量、模块化、无需重新训练的后处理框架,有效提升了点云补全的几何保真度,尤其在挑战性的随机遮挡场景中表现优越,具备广泛适用性。 Abstract: Recent point cloud completion models, including transformer-based, denoising-based, and other state-of-the-art approaches, generate globally plausible shapes from partial inputs but often leave local geometric inconsistencies. We propose RL-AD-Net, a reinforcement learning (RL) refinement framework that operates in the latent space of a pretrained point autoencoder. The autoencoder encodes completions into compact global feature vectors (GFVs), which are selectively adjusted by an RL agent to improve geometric fidelity. To ensure robustness, a lightweight non-parametric PointNN selector evaluates the geometric consistency of both the original completion and the RL-refined output, retaining the better reconstruction. When ground truth is available, both Chamfer Distance and geometric consistency metrics guide refinement. Training is performed separately per category, since the unsupervised and dynamic nature of RL makes convergence across highly diverse categories challenging. Nevertheless, the framework can be extended to multi-category refinement in future work. Experiments on ShapeNetCore-2048 demonstrate that while baseline completion networks perform reasonable under their training-style cropping, they struggle in random cropping scenarios. In contrast, RL-AD-Net consistently delivers improvements across both settings, highlighting the effectiveness of RL-guided ensemble refinement. The approach is lightweight, modular, and model-agnostic, making it applicable to a wide range of completion networks without requiring retraining.

[94] REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting

Di Wu,Liu Liu,Anran Huang,Yuyan Liu,Qiaoyu Jun,Shaofan Liu,Liangtu Song,Cewu Lu

Main category: cs.CV

TL;DR: 本文提出了REArtGS++,一种用于可泛化关节物体重建的新方法,通过引入时间几何约束和平面高斯点阵,在关节参数估计和部件级表面重建方面优于现有方法。

Details Motivation: REArtGS在处理螺钉关节或多部件物体时存在困难,并且缺乏对未见状态的几何约束。为此,本文旨在提升关节物体在不同状态下的重建能力和泛化性能。 Method: 提出REArtGS++,建模无需类型先验的解耦螺钉运动,通过部件运动融合联合优化部件感知的高斯分布与关节参数;引入时间连续的几何约束,利用泰勒一阶展开实现平面法向与深度之间的时间一致性正则化,并采用平面高斯点阵表示。 Result: 在合成和真实场景的关节物体上进行了大量实验,结果表明该方法在部件级表面重建和关节参数估计方面优于现有方法。 Conclusion: REArtGS++通过引入时间几何约束和平面高斯点阵,有效提升了对复杂关节物体(如螺钉关节和多部件物体)的重建能力与泛化性能,具有更强的鲁棒性和精度。 Abstract: Articulated objects are pervasive in daily environments, such as drawers and refrigerators. Towards their part-level surface reconstruction and joint parameter estimation, REArtGS~\cite{wu2025reartgs} introduces a category-agnostic approach using multi-view RGB images at two different states. However, we observe that REArtGS still struggles with screw-joint or multi-part objects and lacks geometric constraints for unseen states. In this paper, we propose REArtGS++, a novel method towards generalizable articulated object reconstruction with temporal geometry constraint and planar Gaussian splatting. We first model a decoupled screw motion for each joint without type prior, and jointly optimize part-aware Gaussians with joint parameters through part motion blending. To introduce time-continuous geometric constraint for articulated modeling, we encourage Gaussians to be planar and propose a temporally consistent regularization between planar normal and depth through Taylor first-order expansion. Extensive experiments on both synthetic and real-world articulated objects demonstrate our superiority in generalizable part-level surface reconstruction and joint parameter estimation, compared to existing approaches. Project Site: https://sites.google.com/view/reartgs2/home.

[95] ReBrain: Brain MRI Reconstruction from Sparse CT Slice via Retrieval-Augmented Diffusion

Junming Liu,Yifei Sun,Weihua Cheng,Yujin Kang,Yirong Chen,Ding Wang,Guosun Zeng

Main category: cs.CV

TL;DR: 本文提出了一种名为ReBrain的检索增强扩散框架,用于在稀疏CT切片条件下重建脑部MRI图像。

Details Motivation: 由于低剂量CT扫描导致切片稀疏、层间分辨率差,现有方法难以准确重建全脑MRI体积,因此需要一种能在有限CT切片下实现高质量MRI重建的方法。 Method: 采用布朗桥扩散模型(BBDM)在2D层面合成MRI切片,同时通过微调的检索模型从先验数据库中获取结构和病理相似的CT切片作为参考,并利用ControlNet分支引导中间MRI切片生成以保证结构连续性;对于检索失败情况则采用球面线性插值提供补充指导。 Result: 在SynthRAD2023和BraTS数据集上的实验表明,ReBrain在稀疏条件下的跨模态重建中达到了最先进的性能。 Conclusion: ReBrain有效解决了由稀疏CT输入带来的脑MRI重建难题,显著提升了在临床受限场景下的图像转换质量与稳定性。 Abstract: Magnetic Resonance Imaging (MRI) plays a crucial role in brain disease diagnosis, but it is not always feasible for certain patients due to physical or clinical constraints. Recent studies attempt to synthesize MRI from Computed Tomography (CT) scans; however, low-dose protocols often result in highly sparse CT volumes with poor through-plane resolution, making accurate reconstruction of the full brain MRI volume particularly challenging. To address this, we propose ReBrain, a retrieval-augmented diffusion framework for brain MRI reconstruction. Given any 3D CT scan with limited slices, we first employ a Brownian Bridge Diffusion Model (BBDM) to synthesize MRI slices along the 2D dimension. Simultaneously, we retrieve structurally and pathologically similar CT slices from a comprehensive prior database via a fine-tuned retrieval model. These retrieved slices are used as references, incorporated through a ControlNet branch to guide the generation of intermediate MRI slices and ensure structural continuity. We further account for rare retrieval failures when the database lacks suitable references and apply spherical linear interpolation to provide supplementary guidance. Extensive experiments on SynthRAD2023 and BraTS demonstrate that ReBrain achieves state-of-the-art performance in cross-modal reconstruction under sparse conditions.

[96] Diversity Has Always Been There in Your Visual Autoregressive Models

Tong Wang,Guanyu Yang,Nian Liu,Kai Wang,Yaxing Wang,Abdelrahman M Shaker,Salman Khan,Fahad Shahbaz Khan,Senmao Li

Main category: cs.CV

TL;DR: 本文提出了DiverseVAR,一种简单而有效的方法,用于恢复视觉自回归(VAR)模型的生成多样性,无需额外训练。通过抑制输入中的关键特征分量并放大输出中的该分量,显著提升了生成多样性,同时保持高质量图像合成。

Details Motivation: VAR模型尽管高效,但存在多样性崩溃问题,限制了生成结果的多样性。本文旨在解决这一问题,提升生成多样性而不牺牲图像质量或推理效率。 Method: 分析发现特征图中的关键分量是影响早期尺度多样性的核心因素。DiverseVAR通过在模型输入中抑制该分量、在输出中增强该分量,来解锁VAR模型的生成潜力。 Result: 实验结果表明,DiverseVAR显著提高了生成多样性,同时对图像质量和性能影响极小,且无需额外训练。 Conclusion: DiverseVAR是一种无需训练即可有效提升VAR模型生成多样性的方法,在保持高保真合成的同时解决了多样性崩溃问题,具有良好的实用性和推广价值。 Abstract: Visual Autoregressive (VAR) models have recently garnered significant attention for their innovative next-scale prediction paradigm, offering notable advantages in both inference efficiency and image quality compared to traditional multi-step autoregressive (AR) and diffusion models. However, despite their efficiency, VAR models often suffer from the diversity collapse i.e., a reduction in output variability, analogous to that observed in few-step distilled diffusion models. In this paper, we introduce DiverseVAR, a simple yet effective approach that restores the generative diversity of VAR models without requiring any additional training. Our analysis reveals the pivotal component of the feature map as a key factor governing diversity formation at early scales. By suppressing the pivotal component in the model input and amplifying it in the model output, DiverseVAR effectively unlocks the inherent generative potential of VAR models while preserving high-fidelity synthesis. Empirical results demonstrate that our approach substantially enhances generative diversity with only neglectable performance influences. Our code will be publicly released at https://github.com/wangtong627/DiverseVAR.

[97] Spanning Tree Autoregressive Visual Generation

Sangkyu Lee,Changho Lee,Janghoon Han,Hosung Song,Tackgeun You,Hwasup Lim,Stanley Jungkyu Choi,Honglak Lee,Youngjae Yu

Main category: cs.CV

TL;DR: 提出Spanning Tree Autoregressive (STAR) 模型,利用均匀生成树的遍历顺序结合图像先验知识,在保持采样性能的同时支持灵活的序列顺序和图像编辑。

Details Motivation: 现有自回归模型在视觉生成中使用随机排列序列顺序时,要么性能下降,要么牺牲推理时序列顺序的灵活性,难以兼顾图像编辑需求与生成质量。 Method: STAR采用基于图像块位置定义的格网中的均匀生成树,并通过广度优先搜索获得遍历顺序;利用拒绝采样确保已观测图像区域在序列中作为前缀出现,从而实现结构化随机策略。 Result: STAR在不显著改变语言AR模型架构的前提下,既保持了后缀补全能力,又维持了良好的采样性能,同时支持灵活的序列顺序选择。 Conclusion: STAR通过结合生成树遍历策略与图像先验,在图像生成与编辑任务中实现了性能与灵活性的平衡,优于传统随机排列方法。 Abstract: We present Spanning Tree Autoregressive (STAR) modeling, which can incorporate prior knowledge of images, such as center bias and locality, to maintain sampling performance while also providing sufficiently flexible sequence orders to accommodate image editing at inference. Approaches that expose randomly permuted sequence orders to conventional autoregressive (AR) models in visual generation for bidirectional context either suffer from a decline in performance or compromise the flexibility in sequence order choice at inference. Instead, STAR utilizes traversal orders of uniform spanning trees sampled in a lattice defined by the positions of image patches. Traversal orders are obtained through breadth-first search, allowing us to efficiently construct a spanning tree whose traversal order ensures that the connected partial observation of the image appears as a prefix in the sequence through rejection sampling. Through the tailored yet structured randomized strategy compared to random permutation, STAR preserves the capability of postfix completion while maintaining sampling performance without any significant changes to the model architecture widely adopted in the language AR modeling.

[98] SPAGS: Sparse-View Articulated Object Reconstruction from Single State via Planar Gaussian Splatting

Di Wu,Liu Liu,Xueyu Yuan,Qiaoyu Jun,Wenxiao Chen,Ruilong Yan,Yiming Tang,Liangtu Song

Main category: cs.CV

TL;DR: 提出了一种基于平面高斯点阵的类别无关的关节物体三维重建框架,仅使用单状态下的稀疏RGB图像即可实现高保真部件级表面重建。

Details Motivation: 现有方法通常需要多视角或多阶段观测等昂贵输入,限制了在实际场景中的应用,因此需要一种更高效、低成本的关节物体三维重建方法。 Method: 引入高斯信息场以从候选相机姿态中选择最优稀疏视角;将3D高斯压缩为平面高斯以提升深度和法线估计精度;通过深度平滑正则化和少样本扩散进行粗到精优化;引入每个高斯基元的部件分割概率,并通过渲染结果反投影部件掩码进行更新。 Result: 在合成和真实世界数据上均实现了优于现有方法的部件级表面重建效果,尤其在稀疏视角输入下表现出更高的重建保真度。 Conclusion: 所提方法能够有效利用稀疏RGB图像实现高质量的类别无关关节物体重建,具有更强的实用性和泛化能力。 Abstract: Articulated objects are ubiquitous in daily environments, and their 3D reconstruction holds great significance across various fields. However, existing articulated object reconstruction methods typically require costly inputs such as multi-stage and multi-view observations. To address the limitations, we propose a category-agnostic articulated object reconstruction framework via planar Gaussian Splatting, which only uses sparse-view RGB images from a single state. Specifically, we first introduce a Gaussian information field to perceive the optimal sparse viewpoints from candidate camera poses. Then we compress 3D Gaussians into planar Gaussians to facilitate accurate estimation of normal and depth. The planar Gaussians are optimized in a coarse-to-fine manner through depth smooth regularization and few-shot diffusion. Moreover, we introduce a part segmentation probability for each Gaussian primitive and update them by back-projecting part segmentation masks of renderings. Extensive experimental results demonstrate that our method achieves higher-fidelity part-level surface reconstruction on both synthetic and real-world data than existing methods. Codes will be made publicly available.

[99] Sparse Reasoning is Enough: Biological-Inspired Framework for Video Anomaly Detection with Large Pre-trained Models

He Huang,Zixuan Hu,Dongxiao Li,Yao Xiao,Ling-Yu Duan

Main category: cs.CV

TL;DR: 提出ReCoVAD,一种受人类神经系统启发的视频异常检测框架,通过稀疏推理实现高效训练-free检测。

Details Motivation: 现有基于大模型的视频异常检测方法依赖密集帧推理,计算成本高,本文探索是否必须进行密集推理。 Method: 设计双通路架构: Reflex通路用轻量CLIP模块快速处理帧并查询动态记忆;Conscious通路用视觉-语言模型生成事件描述并更新记忆与原型提示,结合大语言模型进行周期性回顾与修正。 Result: 在UCF-Crime和XD-Violence数据集上分别仅处理28.55%和16.04%的帧数,达到最先进的训练-free性能。 Conclusion: 稀疏推理足以有效支持基于大模型的视频异常检测,ReCoVAD显著降低计算开销同时保持高性能。 Abstract: Video anomaly detection (VAD) plays a vital role in real-world applications such as security surveillance, autonomous driving, and industrial monitoring. Recent advances in large pre-trained models have opened new opportunities for training-free VAD by leveraging rich prior knowledge and general reasoning capabilities. However, existing studies typically rely on dense frame-level inference, incurring high computational costs and latency. This raises a fundamental question: Is dense reasoning truly necessary when using powerful pre-trained models in VAD systems? To answer this, we propose ReCoVAD, a novel framework inspired by the dual reflex and conscious pathways of the human nervous system, enabling selective frame processing to reduce redundant computation. ReCoVAD consists of two core pathways: (i) a Reflex pathway that uses a lightweight CLIP-based module to fuse visual features with prototype prompts and produce decision vectors, which query a dynamic memory of past frames and anomaly scores for fast response; and (ii) a Conscious pathway that employs a medium-scale vision-language model to generate textual event descriptions and refined anomaly scores for novel frames. It continuously updates the memory and prototype prompts, while an integrated large language model periodically reviews accumulated descriptions to identify unseen anomalies, correct errors, and refine prototypes. Extensive experiments show that ReCoVAD achieves state-of-the-art training-free performance while processing only 28.55\% and 16.04\% of the frames used by previous methods on the UCF-Crime and XD-Violence datasets, demonstrating that sparse reasoning is sufficient for effective large-model-based VAD.

[100] Planning with Sketch-Guided Verification for Physics-Aware Video Generation

Yidong Huang,Zun Wang,Han Lin,Dong-Ki Kim,Shayegan Omidshafiei,Jaehong Yoon,Yue Zhang,Mohit Bansal

Main category: cs.CV

TL;DR: 提出SketchVerify,一种无需训练的草图验证运动规划框架,通过测试时采样与验证循环生成物理合理且符合指令的连贯运动轨迹,提升视频生成的运动质量和效率。

Details Motivation: 现有视频生成方法依赖单次规划或迭代优化,难以处理复杂运动且计算成本高,缺乏在生成前有效验证运动合理性的机制。 Method: 提出SketchVerify框架:基于提示和参考图像生成多个候选运动轨迹,将其渲染为轻量级视频草图,利用视觉-语言验证器联合评估语义一致性和物理合理性,迭代筛选最优轨迹后交由生成模型合成最终视频。 Result: 在WorldModelBench和PhyWorldBench上显著优于基线方法,提升了运动质量、物理真实感和长期一致性,同时效率更高;消融实验表明增加候选轨迹数量可持续提升性能。 Conclusion: SketchVerify通过测试时验证机制有效提升了运动规划的质量与动态连贯性,是一种高效、通用且无需训练的视频生成前规划方案。 Abstract: Recent video generation approaches increasingly rely on planning intermediate control signals such as object trajectories to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement which requires multiple calls to the video generator, incuring high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with more dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision-language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to the trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that our method significantly improves motion quality, physical realism, and long-term consistency compared to competitive baselines while being substantially more efficient. Our ablation study further shows that scaling up the number of trajectory candidates consistently enhances overall performance.

[101] Bridging Visual Affective Gap: Borrowing Textual Knowledge by Learning from Noisy Image-Text Pairs

Daiqing Wu,Dongbao Yang,Yu Zhou,Can Ma

Main category: cs.CV

TL;DR: 提出了一种名为Partitioned Adaptive Contrastive Learning (PACL)的方法,利用预训练文本模型的知识来增强视觉模型的情感感知能力,通过分离不同类型的样本并为每种类型设计不同的对比学习策略,有效弥合了视觉情感识别中的“情感鸿沟”。

Details Motivation: 由于视觉特征与情感类别之间存在“情感鸿沟”,导致现有预训练视觉模型在视觉情感识别任务中应用受限,而文本模态具有明确的情感表达和高信息密度,可弥补这一缺陷。 Method: 提出PACL方法,基于图像与文本在事实和情感层面的关联,对噪声社交媒体数据中的样本进行分类,并为不同类型样本设计自适应的对比学习策略,动态构建正负样本对,从而将文本模型的知识迁移至视觉模型。 Result: 实验表明,所提方法能显著提升多种预训练视觉模型在下游情感相关任务中的性能,有效利用噪声数据挖掘情感信息。 Conclusion: 通过引入文本模态知识并采用分类型自适应对比学习,PACL成功弥合了视觉情感识别中的“情感鸿沟”,提升了模型的情感理解能力。 Abstract: Visual emotion recognition (VER) is a longstanding field that has garnered increasing attention with the advancement of deep neural networks. Although recent studies have achieved notable improvements by leveraging the knowledge embedded within pre-trained visual models, the lack of direct association between factual-level features and emotional categories, called the "affective gap", limits the applicability of pre-training knowledge for VER tasks. On the contrary, the explicit emotional expression and high information density in textual modality eliminate the "affective gap". Therefore, we propose borrowing the knowledge from the pre-trained textual model to enhance the emotional perception of pre-trained visual models. We focus on the factual and emotional connections between images and texts in noisy social media data, and propose Partitioned Adaptive Contrastive Learning (PACL) to leverage these connections. Specifically, we manage to separate different types of samples and devise distinct contrastive learning strategies for each type. By dynamically constructing negative and positive pairs, we fully exploit the potential of noisy samples. Through comprehensive experiments, we demonstrate that bridging the "affective gap" significantly improves the performance of various pre-trained visual models in downstream emotion-related tasks. Our code is released on https://github.com/wdqqdw/PACL.

[102] ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better

Yuan Zhang,Ming Lu,Junwen Pan,Tao Huang,Kuan Cheng,Qi She,Shanghang Zhang

Main category: cs.CV

TL;DR: 提出ChainV框架,通过动态整合视觉提示提升多模态推理的效率与准确性,显著减少推理链长度和延迟。

Details Motivation: 现有大模型在生成长推理链时存在冗余自省问题,且现有训练无关的思维链压缩方法因依赖静态视觉参考,在多模态推理中增益有限。 Method: ChainV首先基于前一步推理进行粗粒度视觉块选择,再根据平均注意力强度识别最具代表性的原子视觉提示,并引入基于一致性的评估机制判断提示可靠性,最后通过伯努利随机过程将视觉提示坐标及其可靠性融入推理过程。 Result: 实验表明,ChainV在数学密集型基准如MathVista上表现优异,在MIMO-VL-RL中准确率提升2.3%,推理延迟降低51.4%,输出token长度缩短24.5%。 Conclusion: ChainV通过动态视觉提示集成有效缩短并优化了多模态推理链,提升了推理效率与准确性,尤其适用于依赖视觉信息的复杂符号推理任务。 Abstract: Recent advances in multimodal reasoning models have demonstrated impressive capabilities across text and vision. However, even leading models exhibit redundant self-reflection when generating lengthy reasoning chains. While training-free CoT compression methods have emerged in the LLMs domain, they rely on static visual references and thus provide limited gains for multimodal reasoning. Therefore, we propose ChainV, a framework that dynamically integrates visual hints into the reasoning process, thereby making multimodal reasoning shorter and better. Specifically, ChainV first performs a coarse visual patch selection based on the previous reasoning step, then refines it by identifying the most representative atomic visual hint according to the averaged attention intensity. Additionally, ChainV introduces a consistency-based evaluation mechanism to assess the reliability of the chosen hint, guiding the model to adaptively adjust its level of self-reflection. Eventually, the pixel coordinates of the selected visual hint and its reliability are incorporated into thinking with a Bernoulli stochastic process. Experiments indicate that our method significantly improves reasoning accuracy and efficiency, especially on math-intensive benchmarks where visual hints are crucial for multi-step symbolic reasoning. For example, ChainV achieves $2.3\%$ improvement on the MathVista within MIMO-VL-RL, while reducing inference latency by $51.4\%$ and shortening output token length by $24.5\%$.

[103] PEGS: Physics-Event Enhanced Large Spatiotemporal Motion Reconstruction via 3D Gaussian Splatting

Yijun Xu,Jingrui Zhang,Hongyi Liu,Yuhan Chen,Yuanyang Wang,Qingyao Guo,Dingwen Wang,Lei Yu,Chu He

Main category: cs.CV

TL;DR: 提出PEGS框架,结合物理先验与事件流增强,在3D高斯点阵中实现去模糊目标建模与大时空尺度刚体运动恢复。

Details Motivation: 由于建模范式限制、严重运动模糊和物理不一致,大时空尺度的刚体运动重建具有挑战性。 Method: 引入物理先验与事件流增强相结合的PEGS框架,采用三层监督机制(加速度约束、事件流引导、卡尔曼正则化)和运动感知模拟退火策略进行自适应训练调度。 Result: 在自建的首个面向自然快速刚体运动的RGB-Event配对数据集上验证,PEGS在大时空尺度运动重建方面优于主流动态方法。 Conclusion: PEGS通过融合物理一致性与高时序分辨率信息,有效提升了复杂场景下刚体运动的去模糊重建精度与鲁棒性。 Abstract: Reconstruction of rigid motion over large spatiotemporal scales remains a challenging task due to limitations in modeling paradigms, severe motion blur, and insufficient physical consistency. In this work, we propose PEGS, a framework that integrates Physical priors with Event stream enhancement within a 3D Gaussian Splatting pipeline to perform deblurred target-focused modeling and motion recovery. We introduce a cohesive triple-level supervision scheme that enforces physical plausibility via an acceleration constraint, leverages event streams for high-temporal resolution guidance, and employs a Kalman regularizer to fuse multi-source observations. Furthermore, we design a motion-aware simulated annealing strategy that adaptively schedules the training process based on real-time kinematic states. We also contribute the first RGB-Event paired dataset targeting natural, fast rigid motion across diverse scenarios. Experiments show PEGS's superior performance in reconstructing motion over large spatiotemporal scales compared to mainstream dynamic methods.

[104] Off the Planckian Locus: Using 2D Chromaticity to Improve In-Camera Color

SaiKiran Tedla,Joshua E. Little,Hakki Can Karaimer,Michael S. Brown

Main category: cs.CV

TL;DR: 本文提出了一种基于二维色度空间的色彩映射方法,取代传统依赖相关色温(CCT)的一维插值,通过轻量级多层感知机(MLP)提升LED等非普朗克光源下的色彩还原精度,平均角度误差降低22%,且兼容传统光源并支持实时相机内部署。

Details Motivation: 传统基于CCT的色彩映射在应对偏离普朗克轨迹的现代LED光源时存在局限,难以准确表征照明特性,导致色彩还原偏差。 Method: 将光照表征从一维CCT扩展到二维色度空间,并采用轻量级MLP模型进行色彩映射;使用包含典型LED光源的lightbox标定流程训练模型。 Result: 在多种LED照明场景下验证,平均角度重现已误差降低22%,保持对传统光源的向后兼容性,支持多光源场景和实时相机内部署,计算开销可忽略。 Conclusion: 基于2D色度特征的MLP色彩映射优于传统CCT插值方法,能更准确处理非普朗克光源下的色彩还原,适用于现代照明环境。 Abstract: Traditional in-camera colorimetric mapping relies on correlated color temperature (CCT)-based interpolation between pre-calibrated transforms optimized for Planckian illuminants such as CIE A and D65. However, modern lighting technologies such as LEDs can deviate substantially from the Planckian locus, exposing the limitations of relying on conventional one-dimensional CCT for illumination characterization. This paper demonstrates that transitioning from 1D CCT (on the Planckian locus) to a 2D chromaticity space (off the Planckian locus) improves colorimetric accuracy across various mapping approaches. In addition, we replace conventional CCT interpolation with a lightweight multi-layer perceptron (MLP) that leverages 2D chromaticity features for robust colorimetric mapping under non-Planckian illuminants. A lightbox-based calibration procedure incorporating representative LED sources is used to train our MLP. Validated across diverse LED lighting, our method reduces angular reproduction error by 22% on average in LED-lit scenes, maintains backward compatibility with traditional illuminants, accommodates multi-illuminant scenes, and supports real-time in-camera deployment with negligible additional computational cost.

[105] A Multi-Stage Optimization Framework for Deploying Learned Image Compression on FPGAs

Jiaxun Fang,Li Chen

Main category: cs.CV

TL;DR: 本文提出了一种面向FPGA的深度学习图像压缩模型多阶段优化框架,结合动态范围感知量化、混合精度搜索和通道剪枝技术,在8位整数模型上实现了低BD-rate开销和高硬件效率。

Details Motivation: 深度学习图像压缩模型在浮点形式下虽性能优越,但难以部署到资源受限的FPGA上,主要瓶颈在于量化带来的性能下降和硬件实现效率低。 Method: 提出动态范围感知量化(DRAQ)方法,结合统计校准的激活剪裁和权重正则化;采用渐进式混合精度搜索算法和针对GDN层的通道剪枝方法进行硬件感知优化。 Result: DRAQ将BD-rate开销从30%降至6.3%;后续硬件优化进一步降低20%以上计算复杂度,且对率失真性能影响极小。 Conclusion: 所提框架显著缩小了高性能浮点模型与高效定点硬件实现之间的差距,实现了当前最先进的FPGA友好的图像压缩模型。 Abstract: Deep learning-based image compression (LIC) has achieved state-of-the-art rate-distortion (RD) performance, yet deploying these models on resource-constrained FPGAs remains a major challenge. This work presents a complete, multi-stage optimization framework to bridge the gap between high-performance floating-point models and efficient, hardware-friendly integer-based implementations. First, we address the fundamental problem of quantization-induced performance degradation. We propose a Dynamic Range-Aware Quantization (DRAQ) method that uses statistically-calibrated activation clipping and a novel weight regularization scheme to counteract the effects of extreme data outliers and large dynamic ranges, successfully creating a high-fidelity 8-bit integer model. Second, building on this robust foundation, we introduce two hardware-aware optimization techniques tailored for FPGAs. A progressive mixed-precision search algorithm exploits FPGA flexibility to assign optimal, non-uniform bit-widths to each layer, minimizing complexity while preserving performance. Concurrently, a channel pruning method, adapted to work with the Generalized Divisive Normalization (GDN) layers common in LIC, removes model redundancy by eliminating inactive channels. Our comprehensive experiments show that the foundational DRAQ method reduces the BD-rate overhead of a GDN-based model from $30\%$ to $6.3\%$. The subsequent hardware-aware optimizations further reduce computational complexity by over $20\%$ with negligible impact on RD performance, yielding a final model that is both state-of-the-art in efficiency and superior in quality to existing FPGA-based LIC implementations.

[106] One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution

Yushun Fang,Yuxiang Chen,Shibo Yin,Qiang Hu,Jiangchao Yao,Ya Zhang,Xiaoyun Zhang,Yanfeng Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为ODTSR的一阶段扩散Transformer模型,用于真实图像超分辨率(Real-ISR),通过噪声混合视觉流(NVS)设计和保真度感知对抗训练(FAA)同时提升保真度与可控性,在通用Real-ISR和中文场景文字超分辨率(STISR)上均实现SOTA性能且无需特定数据集微调。

Details Motivation: 现有扩散模型在Real-ISR中难以兼顾保真度与可控性:多步方法因生成随机性导致保真度低,单步方法因需特定微调而缺乏控制灵活性。 Method: 提出ODTSR,基于Qwen-Image构建一阶段扩散Transformer;引入噪声混合视觉流(NVS),其中新视觉流输入带可调噪声的低质量图像以增强控制,原视觉流输入带固定噪声的图像以保持先验;采用保真度感知对抗训练(FAA)优化一阶段推理的可控性与质量。 Result: ODTSR在多个Real-ISR基准上达到SOTA性能,尤其在未专门训练的中文场景文字图像超分辨率(STISR)任务中展现出优异的提示可控能力。 Conclusion: ODTSR通过NVS和FAA设计有效平衡了扩散模型在Real-ISR中的保真度与可控性,支持灵活的一阶段生成,并拓展至复杂应用场景如文字超分辨率,具有强泛化性和实用性。 Abstract: Recent advances in diffusion-based real-world image super-resolution (Real-ISR) have demonstrated remarkable perceptual quality, yet the balance between fidelity and controllability remains a problem: multi-step diffusion-based methods suffer from generative diversity and randomness, resulting in low fidelity, while one-step methods lose control flexibility due to fidelity-specific finetuning. In this paper, we present ODTSR, a one-step diffusion transformer based on Qwen-Image that performs Real-ISR considering fidelity and controllability simultaneously: a newly introduced visual stream receives low-quality images (LQ) with adjustable noise (Control Noise), and the original visual stream receives LQs with consistent noise (Prior Noise), forming the Noise-hybrid Visual Stream (NVS) design. ODTSR further employs Fidelity-aware Adversarial Training (FAA) to enhance controllability and achieve one-step inference. Extensive experiments demonstrate that ODTSR not only achieves state-of-the-art (SOTA) performance on generic Real-ISR, but also enables prompt controllability on challenging scenarios such as real-world scene text image super-resolution (STISR) of Chinese characters without training on specific datasets.

[107] Learning to Look Closer: A New Instance-Wise Loss for Small Cerebral Lesion Segmentation

Luc Bouteille,Alexander Jaus,Jens Kleesiek,Rainer Stiefelhagen,Lukas Heine

Main category: cs.CV

TL;DR: 提出了一种新的损失函数CC-DiceCE,基于CC-Metrics框架,用于改善医学图像分割中对小病灶的欠分割问题,在多个数据集上表现优于现有的blob loss,并在nnU-Net框架中验证其有效性。

Details Motivation: 传统损失函数(如Dice)在医学图像分割中容易欠分割小病灶,因其体积小对整体损失贡献小,需引入实例级损失函数来提升检测性能。 Method: 基于CC-Metrics框架设计了CC-DiceCE损失函数,并在nnU-Net框架下与DiceCE基线和blob loss进行对比实验,评估其在多数据集上的分割与检测性能。 Result: CC-DiceCE显著提高病灶检测召回率,分割性能无明显下降,但略有增加假阳性;在多数据集上普遍优于blob loss。 Conclusion: CC-DiceCE是一种有效的实例级损失函数,能更好识别小病灶,适用于医学图像分割任务,尤其在关注高召回率的应用中具有优势。 Abstract: Traditional loss functions in medical image segmentation, such as Dice, often under-segment small lesions because their small relative volume contributes negligibly to the overall loss. To address this, instance-wise loss functions and metrics have been proposed to evaluate segmentation quality on a per-lesion basis. We introduce CC-DiceCE, a loss function based on the CC-Metrics framework, and compare it with the existing blob loss. Both are benchmarked against a DiceCE baseline within the nnU-Net framework, which provides a robust and standardized setup. We find that CC-DiceCE loss increases detection (recall) with minimal to no degradation in segmentation performance, albeit at the cost of slightly more false positives. Furthermore, our multi-dataset study shows that CC-DiceCE generally outperforms blob loss.

[108] A lightweight detector for real-time detection of remote sensing images

Qianyi Wang,Guoqiang Ren

Main category: cs.CV

TL;DR: 本文提出了一种轻量级实时检测器DMG-YOLO,用于遥感图像中的小目标检测,通过双分支特征提取和多尺度特征融合模块提升了检测性能。

Details Motivation: 遥感图像中存在大量小目标,且需要在准确性和效率之间取得平衡,现有方法难以满足实时检测需求。 Method: 设计了双分支特征提取(DFE)模块和多尺度特征融合(MFF)模块,并在颈部引入全局与局部聚合特征金字塔网络(GLAFPN)。 Result: 在VisDrone2019和NWPU VHR-10数据集上实验表明,DMG-YOLO在mAP、模型大小等关键指标上具有竞争力。 Conclusion: DMG-YOLO有效提升了遥感图像中小目标的检测精度与效率,适用于实时应用场景。 Abstract: Remote sensing imagery is widely used across various fields, yet real-time detection remains challenging due to the prevalence of small objects and the need to balance accuracy with efficiency. To address this, we propose DMG-YOLO, a lightweight real-time detector tailored for small object detection in remote sensing images. Specifically, we design a Dual-branch Feature Extraction (DFE) module in the backbone, which partitions feature maps into two parallel branches: one extracts local features via depthwise separable convolutions, and the other captures global context using a vision transformer with a gating mechanism. Additionally, a Multi-scale Feature Fusion (MFF) module with dilated convolutions enhances multi-scale integration while preserving fine details. In the neck, we introduce the Global and Local Aggregate Feature Pyramid Network (GLAFPN) to further boost small object detection through global-local feature fusion. Extensive experiments on the VisDrone2019 and NWPU VHR-10 datasets show that DMG-YOLO achieves competitive performance in terms of mAP, model size, and other key metrics.

[109] DiffRefiner: Coarse to Fine Trajectory Planning via Diffusion Refinement with Semantic Interaction for End to End Autonomous Driving

Liuhan Yin,Runkun Ju,Guodong Guo,Erkang Cheng

Main category: cs.CV

TL;DR: 本文提出了一种名为DiffRefiner的两阶段轨迹预测框架,结合判别式提议与生成式细化,在自动驾驶中实现了最先进的性能。

Details Motivation: 现有生成式轨迹预测方法依赖手工或随机锚点去噪,灵活性和准确性仍有提升空间。 Method: 第一阶段使用基于Transformer的Proposal Decoder从传感器输入回归预定义轨迹锚点生成粗略预测;第二阶段通过Diffusion Refiner迭代去噪并精细化初始预测,并设计细粒度去噪解码器以增强与环境对齐。 Result: 在NAVSIM v2上达到87.4 EPDMS,在Bench2Drive上取得87.1 DS和71.4 SR,均创下新纪录,且通过消融实验验证了各组件有效性。 Conclusion: DiffRefiner通过融合判别式提议与生成式细化,显著提升了自动驾驶中轨迹预测的精度与场景一致性。 Abstract: Unlike discriminative approaches in autonomous driving that predict a fixed set of candidate trajectories of the ego vehicle, generative methods, such as diffusion models, learn the underlying distribution of future motion, enabling more flexible trajectory prediction. However, since these methods typically rely on denoising human-crafted trajectory anchors or random noise, there remains significant room for improvement. In this paper, we propose DiffRefiner, a novel two-stage trajectory prediction framework. The first stage uses a transformer-based Proposal Decoder to generate coarse trajectory predictions by regressing from sensor inputs using predefined trajectory anchors. The second stage applies a Diffusion Refiner that iteratively denoises and refines these initial predictions. In this way, we enhance the performance of diffusion-based planning by incorporating a discriminative trajectory proposal module, which provides strong guidance for the generative refinement process. Furthermore, we design a fine-grained denoising decoder to enhance scene compliance, enabling more accurate trajectory prediction through enhanced alignment with the surrounding environment. Experimental results demonstrate that DiffRefiner achieves state-of-the-art performance, attaining 87.4 EPDMS on NAVSIM v2, and 87.1 DS along with 71.4 SR on Bench2Drive, thereby setting new records on both public benchmarks. The effectiveness of each component is validated via ablation studies as well.

[110] UI-Styler: Ultrasound Image Style Transfer with Class-Aware Prompts for Cross-Device Diagnosis Using a Frozen Black-Box Inference Network

Nhat-Tuong Do-Tran,Ngoc-Hoang-Lam Le,Ching-Chun Huang

Main category: cs.CV

TL;DR: 本文提出了一种新的超声图像域适应方法UI-Styler,通过纹理迁移和类感知提示策略,在不配对图像翻译中实现更好的语义对齐和下游任务性能。

Details Motivation: 超声图像在不同设备间存在外观差异,导致固定黑盒模型性能下降;现有无配对图像翻译方法常忽略类别级语义对齐,影响诊断准确性。 Method: 提出UI-Styler框架,采用模式匹配机制将目标域纹理迁移到源图像并保持结构内容,并引入基于伪标签的类感知提示策略以增强语义对齐。 Result: 在多个超声跨设备任务实验中,UI-Styler在分布距离和分类、分割等下游任务上均优于现有方法,达到最先进水平。 Conclusion: UI-Styler能有效缓解超声图像跨设备域偏移问题,提升无配对图像翻译中的语义一致性与诊断可靠性。 Abstract: The appearance of ultrasound images varies across acquisition devices, causing domain shifts that degrade the performance of fixed black-box downstream inference models when reused. To mitigate this issue, it is practical to develop unpaired image translation (UIT) methods that effectively align the statistical distributions between source and target domains, particularly under the constraint of a reused inference-blackbox setting. However, existing UIT approaches often overlook class-specific semantic alignment during domain adaptation, resulting in misaligned content-class mappings that can impair diagnostic accuracy. To address this limitation, we propose UI-Styler, a novel ultrasound-specific, class-aware image style transfer framework. UI-Styler leverages a pattern-matching mechanism to transfer texture patterns embedded in the target images onto source images while preserving the source structural content. In addition, we introduce a class-aware prompting strategy guided by pseudo labels of the target domain, which enforces accurate semantic alignment with diagnostic categories. Extensive experiments on ultrasound cross-device tasks demonstrate that UI-Styler consistently outperforms existing UIT methods, achieving state-of-the-art performance in distribution distance and downstream tasks, such as classification and segmentation.

[111] FireScope: Wildfire Risk Prediction with a Chain-of-Thought Oracle

Mario Markov,Stefan Maria Ailuro,Luc Van Gool,Konrad Schindler,Danda Pani Paudel

Main category: cs.CV

TL;DR: 本文提出了FireScope-Bench数据集和基于视觉语言模型的FireScope框架,用于跨大陆野火风险预测,结合遥感与气候数据,通过语言推理提升模型泛化性与可解释性。

Details Motivation: 现有方法在野火风险预测中缺乏因果推理和多模态理解能力,难以实现可靠泛化。 Method: 构建大规模多模态数据集FireScope-Bench,结合Sentinel-2影像与气候数据,并提出FireScope框架,采用强化学习与视觉监督联合训练,生成带有推理轨迹的风险图。 Result: 在跨大陆测试中(美国训练、欧洲测试),FireScope显著优于现有方法,推理轨迹被验证为语义合理且忠实于输入;实现了高分辨率、可解释、跨区域适用的野火风险建模。 Conclusion: 语言驱动的推理能有效提升栅格生成模型的泛化能力和可解释性,FireScope-Bench有望成为推动空间建模范式向可推理、可解释方向发展的基础。 Abstract: Predicting wildfire risk is a reasoning-intensive spatial problem that requires the integration of visual, climatic, and geographic factors to infer continuous risk maps. Existing methods lack the causal reasoning and multimodal understanding required for reliable generalization. We introduce $\textbf{FireScope-Bench}$, a large-scale dataset and benchmark that couples Sentinel-2 imagery and climate data with expert-defined risk rasters across the USA, and real wildfire events in Europe for cross-continental evaluation. Building on this dataset, we propose $\textbf{FireScope}$, a VLM-based reasoning-to-generation framework that learns from both reinforcement learning and visual supervision to predict risk rasters with complementary reasoning traces. When trained in the USA and tested in Europe, $\textbf{FireScope}$ achieves substantial performance gains, while expert feedback and automated analysis confirm that its reasoning traces are faithful and semantically meaningful. Our findings demonstrate that reasoning can ground raster prediction models, improving both generalization and interpretability. To our knowledge, this is the first framework to (1) demonstrate that language-based reasoning can improve generalization in visual generation, (2) propose a high-resolution wildfire risk model that can be applied across continents, and (3) enable systematic studies of robust cross-continental generalization for multimodal fire risk models. We believe that $\textbf{FireScope-Bench}$ has the potential to serve as a foundation for advancing reasoning-driven, interpretable and generalizable spatial modeling. Data and source code will be made publicly available.

[112] Investigating self-supervised representations for audio-visual deepfake detection

Dragos-Alexandru Boldisor,Stefan Smeu,Dan Oneata,Elisabeta Oneata

Main category: cs.CV

TL;DR: 该研究系统评估了自监督表示在音频、视频和多模态下用于深度伪造检测的有效性,发现这些特征能捕捉有意义的伪造信息且具有互补性,模型关注语义相关区域而非伪影,但在跨数据集泛化上仍存在挑战。

Details Motivation: 探索自监督表示在音频-视觉深度伪造检测中的潜力,解决现有方法孤立使用特征或依赖复杂架构的问题。 Method: 在不同模态(音频、视频、多模态)和领域(唇部运动、通用视觉内容)中系统评估三种自监督特征,分析其检测性能、可解释性和跨模态互补性。 Result: 大多数自监督特征能捕捉深度伪造相关信息且信息互补;模型主要关注语义显著区域;但所有特征在跨数据集泛化方面表现不佳。 Conclusion: 自监督表示在深度伪造检测中具有潜力,因其学习到的是有意义的模式而非表面伪影,但实现可靠的跨域性能仍是根本性挑战,问题可能源于数据集特性而非特征本身。 Abstract: Self-supervised representations excel at many vision and speech tasks, but their potential for audio-visual deepfake detection remains underexplored. Unlike prior work that uses these features in isolation or buried within complex architectures, we systematically evaluate them across modalities (audio, video, multimodal) and domains (lip movements, generic visual content). We assess three key dimensions: detection effectiveness, interpretability of encoded information, and cross-modal complementarity. We find that most self-supervised features capture deepfake-relevant information, and that this information is complementary. Moreover, models primarily attend to semantically meaningful regions rather than spurious artifacts. Yet none generalize reliably across datasets. This generalization failure likely stems from dataset characteristics, not from the features themselves latching onto superficial patterns. These results expose both the promise and fundamental challenges of self-supervised representations for deepfake detection: while they learn meaningful patterns, achieving robust cross-domain performance remains elusive.

[113] Navigating in the Dark: A Multimodal Framework and Dataset for Nighttime Traffic Sign Recognition

Aditya Mishra,Akshay Agarwal,Haroon Lone

Main category: cs.CV

TL;DR: 本文提出了一个用于夜间交通标志识别的大规模数据集INTSD和一种名为LENS-Net的新方法,通过图像增强与多模态图推理提升低光照条件下的识别性能。

Details Motivation: 夜间交通标志识别因光照不足和公开数据集缺乏而具有挑战性,现有方法在低照度下鲁棒性差且未能有效利用多模态信息。 Method: 提出INTSD数据集,包含印度多地41类夜间交通标志;设计LENS-Net,结合自适应图像增强检测器与基于CLIP-GCNN的多模态分类器,实现光照校正、定位与识别的一体化。 Result: 在INTSD上进行了广泛实验,LENS-Net显著优于现有方法,消融研究验证了各组件的有效性。 Conclusion: LENS-Net结合图像增强与多模态图推理,在夜间交通标志识别任务中表现出更强的鲁棒性和准确性,推动了相关应用的发展。 Abstract: Traffic signboards are vital for road safety and intelligent transportation systems, enabling navigation and autonomous driving. Yet, recognizing traffic signs at night remains challenging due to visual noise and scarcity of public nighttime datasets. Despite advances in vision architectures, existing methods struggle with robustness under low illumination and fail to leverage complementary mutlimodal cues effectively. To overcome these limitations, firstly, we introduce INTSD, a large-scale dataset comprising street-level night-time images of traffic signboards collected across diverse regions of India. The dataset spans 41 traffic signboard classes captured under varying lighting and weather conditions, providing a comprehensive benchmark for both detection and classification tasks. To benchmark INTSD for night-time sign recognition, we conduct extensive evaluations using state-of-the-art detection and classification models. Secondly, we propose LENS-Net, which integrates an adaptive image enhancement detector for joint illumination correction and sign localization, followed by a structured multimodal CLIP-GCNN classifier that leverages cross-modal attention and graph-based reasoning for robust and semantically consistent recognition. Our method surpasses existing frameworks, with ablation studies confirming the effectiveness of its key components. The dataset and code for LENS-Net is publicly available for research.

[114] PostCam: Camera-Controllable Novel-View Video Generation with Query-Shared Cross-Attention

Yipeng Chen,Zhichao Ye,Zhenzhou Fang,Xinyu Chen,Xiaoyu Zhang,Jialing Liu,Nan Wang,Haomin Liu,Guofeng Zhang

Main category: cs.CV

TL;DR: PostCam是一种用于动态场景中新颖视角视频生成的框架,支持后期捕捉相机轨迹编辑。它通过引入查询共享的交叉注意力模块,结合6自由度相机姿态和2D渲染帧,提升了相机控制精度和生成视频质量。

Details Motivation: 现有视频重捕获方法在相机运动注入策略上存在不足,导致控制精度低且难以保留源视频的细节。PostCam旨在实现更精确、灵活的运动操控。 Method: 提出查询共享的跨注意力模块,融合6-DoF相机姿态和2D渲染视频帧,在统一特征空间中提取运动线索;采用两阶段训练策略,先学习粗略相机控制,再结合视觉信息提升精度与视觉保真度。 Result: 在真实和合成数据集上的实验表明,PostCam在相机控制精度和视图一致性方面超过现有最先进方法20%以上,并达到最高的视频生成质量。 Conclusion: PostCam通过改进运动注入机制和两阶段训练策略,显著提升了动态场景下视频重捕获的控制灵活性与视觉质量,适用于需要后期相机编辑的应用场景。 Abstract: We propose PostCam, a framework for novel-view video generation that enables post-capture editing of camera trajectories in dynamic scenes. We find that existing video recapture methods suffer from suboptimal camera motion injection strategies; such suboptimal designs not only limit camera control precision but also result in generated videos that fail to preserve fine visual details from the source video. To achieve more accurate and flexible motion manipulation, PostCam introduces a query-shared cross-attention module. It integrates two distinct forms of control signals: the 6-DoF camera poses and the 2D rendered video frames. By fusing them into a unified representation within a shared feature space, our model can extract underlying motion cues, which enhances both control precision and generation quality. Furthermore, we adopt a two-stage training strategy: the model first learns coarse camera control from pose inputs, and then incorporates visual information to refine motion accuracy and enhance visual fidelity. Experiments on both real-world and synthetic datasets demonstrate that PostCam outperforms state-of-the-art methods by over 20% in camera control precision and view consistency, while achieving the highest video generation quality. Our project webpage is publicly available at: https://cccqaq.github.io/PostCam.github.io/

[115] Real Noise Decoupling for Hyperspectral Image Denoising

Yingkai Zhang,Tao Zhang,Jing Nie,Ying Fu

Main category: cs.CV

TL;DR: 提出了一种多阶段噪声解耦框架,将复杂噪声分解为显式建模和隐式建模两部分,结合预训练与联合微调策略,有效提升高光谱图像去噪性能。

Details Motivation: 真实高光谱图像中的噪声复杂且难以准确建模,限制了现有基于噪声建模方法的去噪效果。 Method: 采用多阶段噪声解耦框架:首先用已有噪声模型生成数据预训练网络以处理显式噪声;然后引入高频小波引导网络,利用预训练知识自适应提取高频特征去除隐式噪声;通过分阶段预训练和联合微调优化整体框架。 Result: 在公开及自采集数据集上的实验表明,该方法优于现有最先进方法,能有效处理复杂的实际噪声,显著提升高光谱图像质量。 Conclusion: 所提出的多阶段噪声解耦框架通过分解噪声并分别建模,提高了去噪网络对真实复杂噪声的学习能力,实现了更优的高光谱图像去噪效果。 Abstract: Hyperspectral image (HSI) denoising is a crucial step in enhancing the quality of HSIs. Noise modeling methods can fit noise distributions to generate synthetic HSIs to train denoising networks. However, the noise in captured HSIs is usually complex and difficult to model accurately, which significantly limits the effectiveness of these approaches. In this paper, we propose a multi-stage noise-decoupling framework that decomposes complex noise into explicitly modeled and implicitly modeled components. This decoupling reduces the complexity of noise and enhances the learnability of HSI denoising methods when applied to real paired data. Specifically, for explicitly modeled noise, we utilize an existing noise model to generate paired data for pre-training a denoising network, equipping it with prior knowledge to handle the explicitly modeled noise effectively. For implicitly modeled noise, we introduce a high-frequency wavelet guided network. Leveraging the prior knowledge from the pre-trained module, this network adaptively extracts high-frequency features to target and remove the implicitly modeled noise from real paired HSIs. Furthermore, to effectively eliminate all noise components and mitigate error accumulation across stages, a multi-stage learning strategy, comprising separate pre-training and joint fine-tuning, is employed to optimize the entire framework. Extensive experiments on public and our captured datasets demonstrate that our proposed framework outperforms state-of-the-art methods, effectively handling complex real-world noise and significantly enhancing HSI quality.

[116] VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation

Hanyu Zhou,Chuanhao Ma,Gim Hee Lee

Main category: cs.CV

TL;DR: 提出了一种具有4D感知的视觉-语言-动作模型VLA-4D,通过融合时间信息到3D位置中实现时空连贯的机器人操作。

Details Motivation: 现有方法在时空连贯的操作控制上表现不足,缺乏对细粒度时空表示的支持。 Method: 引入4D感知视觉表示,将1D时间嵌入3D位置,并通过交叉注意力机制融合;扩展空间动作为时空动作表示,并与大语言模型对齐进行预测。 Result: 实验表明该方法在多种机器人操作任务中优于现有方法,实现了更平滑的空间和更连贯的时间控制。 Conclusion: VLA-4D通过统一的4D感知框架显著提升了机器人操作的时空一致性,具备广泛的应用潜力。 Abstract: Vision-language-action (VLA) models show potential for general robotic tasks, but remain challenging in spatiotemporally coherent manipulation, which requires fine-grained representations. Typically, existing methods embed 3D positions into visual representations to enhance the spatial precision of actions. However, these methods struggle to achieve temporally coherent control over action execution. In this work, we propose VLA-4D, a general VLA model with 4D awareness for spatiotemporally coherent robotic manipulation. Our model is guided by two key designs: 1) 4D-aware visual representation. We extract visual features, embed 1D time into 3D positions for 4D embeddings, and fuse them into a unified visual representation via a cross-attention mechanism. 2) Spatiotemporal action representation. We extend conventional spatial action representations with temporal information to enable the spatiotemporal planning, and align the multimodal representations into the LLM for spatiotemporal action prediction. Within this unified framework, the designed visual and action representations jointly make robotic manipulation spatially-smooth and temporally-coherent. In addition, we extend the VLA dataset with temporal action annotations for fine-tuning our model. Extensive experiments have been conducted to verify the superiority of our method across different tasks of robotic manipulation.

[117] Continual Alignment for SAM: Rethinking Foundation Models for Medical Image Segmentation in Continual Learning

Jiayi Wang,Wei Dai,Haoyu Wang,Sihan Yang,Haixia Bi,Jian Sun

Main category: cs.CV

TL;DR: 本文提出了一种基于Segment Anything Model (SAM) 的持续医学图像分割方法CA-SAM,通过引入轻量化的对齐层(Alignment Layer)提升计算效率与准确性,并有效缓解灾难性遗忘问题,在九个医学数据集上实现了最先进的性能。

Details Motivation: 由于医疗机构间隐私政策差异,难以进行联合训练,因此需要在不遗忘旧知识的前提下从数据流中持续学习;同时SAM模型参数量大、计算开销高,限制了其在持续学习场景中的应用。 Method: 设计了一个轻量级的即插即用模块——对齐层,用于对齐编码器-解码器之间的特征分布,从而高效适配SAM到特定医学图像任务;在此基础上提出CA-SAM,一种持续学习策略,自动选择合适的对齐层以缓解灾难性遗忘,并利用SAM的零样本先验保持对未见数据的良好泛化能力。 Result: 在九个医学图像分割数据集的持续学习场景下,CA-SAM取得了当前最优的性能表现。 Conclusion: CA-SAM通过引入轻量对齐层成功平衡了SAM的计算效率与分割精度,验证了SAM在持续学习场景下的巨大潜力,为医学图像分割提供了一种高效且可扩展的解决方案。 Abstract: In medical image segmentation, heterogeneous privacy policies across institutions often make joint training on pooled datasets infeasible, motivating continual image segmentation-learning from data streams without catastrophic forgetting. While the Segment Anything Model (SAM) offers strong zero-shot priors and has been widely fine-tuned across downstream tasks, its large parameter count and computational overhead challenge practical deployment. This paper demonstrates that the SAM paradigm is highly promising once its computational efficiency and performance can be balanced. To this end, we introduce the Alignment Layer, a lightweight, plug-and-play module which aligns encoder-decoder feature distributions to efficiently adapt SAM to specific medical images, improving accuracy while reducing computation. Building on SAM and the Alignment Layer, we then propose Continual Alignment for SAM (CA-SAM), a continual learning strategy that automatically adapts the appropriate Alignment Layer to mitigate catastrophic forgetting, while leveraging SAM's zero-shot priors to preserve strong performance on unseen medical datasets. Experimented across nine medical segmentation datasets under continual-learning scenario, CA-SAM achieves state-of-the-art performance. Our code, models and datasets will be released on \mbox{https://github.com/azzzzyo/Continual-Alignment-for-SAM.}

[118] SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors

Kunyi Li,Michael Niemeyer,Sen Wang,Stefano Gasperini,Nassir Navab,Federico Tombari

Main category: cs.CV

TL;DR: SING3R-SLAM 是一种基于高斯表示的全局一致且紧凑的密集RGB SLAM框架,通过结合局部一致的3D重建与全局高斯地图,实现高效的几何与位姿联合优化,在跟踪、重建和新视图合成方面表现优异。

Details Motivation: 现有的密集3D重建方法在集成到SLAM时面临累积误差(drift)和点云冗余问题,影响效率及下游任务(如新视图合成)的性能。 Method: 提出SING3R-SLAM,首先通过轻量级跟踪与重建模块构建局部一致的子地图,然后逐步将其对齐并融合到一个全局高斯地图中,该地图通过跨视角几何一致性约束反向优化局部漂移,提升跟踪鲁棒性。 Result: 实验表明,SING3R-SLAM在真实数据集上实现了最先进的跟踪精度(提升超12%)、更精细的三维重建结果,并保持了紧凑且内存高效的全局表示。 Conclusion: SING3R-SLAM通过局部重建与全局高斯表示的协同优化,实现了高效、鲁棒且适用于多种下游任务的密集SLAM系统。 Abstract: Recent advances in dense 3D reconstruction enable the accurate capture of local geometry; however, integrating them into SLAM is challenging due to drift and redundant point maps, which limit efficiency and downstream tasks, such as novel view synthesis. To address these issues, we propose SING3R-SLAM, a globally consistent and compact Gaussian-based dense RGB SLAM framework. The key idea is to combine locally consistent 3D reconstructions with a unified global Gaussian representation that jointly refines scene geometry and camera poses, enabling efficient and versatile 3D mapping for multiple downstream applications. SING3R-SLAM first builds locally consistent submaps through our lightweight tracking and reconstruction module, and then progressively aligns and fuses them into a global Gaussian map that enforces cross-view geometric consistency. This global map, in turn, provides feedback to correct local drift and enhance the robustness of tracking. Extensive experiments demonstrate that SING3R-SLAM achieves state-of-the-art tracking, 3D reconstruction, and novel view rendering, resulting in over 12% improvement in tracking and producing finer, more detailed geometry, all while maintaining a compact and memory-efficient global representation on real-world datasets.

[119] Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers

Cris Claessens,Christiaan Viviers,Giacomo D'Amicantonio,Egor Bondarev,Fons van der Sommen

Main category: cs.CV

TL;DR: SPECTRE 是一种完全基于 Transformer 的三维医学影像基础模型,通过自监督和跨模态预训练学习通用 CT 表征,在多个基准上超越先前方法。

Details Motivation: 传统 Transformer 和对比学习方法难以应对三维 CT 影像中的极端 token 规模、几何各向异性和临床标注噪声等问题,亟需专门的基础模型。 Method: 提出 SPECTRE 框架,结合局部与全局 3D Vision Transformer,采用 DINO 自蒸馏和 SigLIP 视觉-语言对齐进行预训练,利用公开 CT 数据集和放射学报告进行联合优化。 Result: 在多个 CT 基准任务中,SPECTRE 在零样本和微调设置下均优于现有 CT 基础模型。 Conclusion: SPECTRE 是可扩展、开源且完全基于 Transformer 的 3D 医学影像基础模型,证明了仅使用公开数据也能实现高性能表征学习。 Abstract: We introduce SPECTRE, a fully transformer-based foundation model for volumetric computed tomography (CT). Our Self-Supervised & Cross-Modal Pretraining for CT Representation Extraction (SPECTRE) approach utilizes scalable 3D Vision Transformer architectures and modern self-supervised and vision-language pretraining strategies to learn general-purpose CT representations. Volumetric CT poses unique challenges, such as extreme token scaling, geometric anisotropy, and weak or noisy clinical supervision, that make standard transformer and contrastive learning recipes ineffective out of the box. The framework jointly optimizes a local transformer for high-resolution volumetric feature extraction and a global transformer for whole-scan context modeling, making large-scale 3D attention computationally tractable. Notably, SPECTRE is trained exclusively on openly available CT datasets, demonstrating that high-performing, generalizable representations can be achieved without relying on private data. Pretraining combines DINO-style self-distillation with SigLIP-based vision-language alignment using paired radiology reports, yielding features that are both geometrically consistent and clinically meaningful. Across multiple CT benchmarks, SPECTRE consistently outperforms prior CT foundation models in both zero-shot and fine-tuned settings, establishing SPECTRE as a scalable, open, and fully transformer-based foundation model for 3D medical imaging.

[120] FisheyeGaussianLift: BEV Feature Lifting for Surround-View Fisheye Camera Perception

Shubham Sonarghare,Prasad Deshpande,Ciaran Hogan,Deepika-Rani Kaliappan-Mahalingam,Ganesh Sistu

Main category: cs.CV

TL;DR: 提出一种失真感知的BEV分割框架,直接处理多相机鱼眼图像,通过几何反投影和逐像素深度分布估计实现准确的语义分割。

Details Motivation: 由于鱼眼图像存在极端非线性失真、遮挡和深度模糊,现有方法难以准确进行BEV语义分割。 Method: 利用标定的几何反投影将每个像素提升到3D空间,采用高斯参数化建模空间均值和各向异性协方差,并通过可微溅射融合生成BEV表示。 Result: 在复杂停车和城市驾驶场景中表现出色,对可行驶区域和车辆的IoU分别达到87.75%和57.26%。 Conclusion: 该方法无需去失真或透视校正即可生成连续且具有不确定性感知的语义地图,有效应对鱼眼图像的挑战。 Abstract: Accurate BEV semantic segmentation from fisheye imagery remains challenging due to extreme non-linear distortion, occlusion, and depth ambiguity inherent to wide-angle projections. We present a distortion-aware BEV segmentation framework that directly processes multi-camera high-resolution fisheye images,utilizing calibrated geometric unprojection and per-pixel depth distribution estimation. Each image pixel is lifted into 3D space via Gaussian parameterization, predicting spatial means and anisotropic covariances to explicitly model geometric uncertainty. The projected 3D Gaussians are fused into a BEV representation via differentiable splatting, producing continuous, uncertainty-aware semantic maps without requiring undistortion or perspective rectification. Extensive experiments demonstrate strong segmentation performance on complex parking and urban driving scenarios, achieving IoU scores of 87.75% for drivable regions and 57.26% for vehicles under severe fisheye distortion and diverse environmental conditions.

[121] Dual-domain Adaptation Networks for Realistic Image Super-resolution

Chaowei Fang,Bolin Fu,De Cheng,Lechao Cheng,Guanbin Li

Main category: cs.CV

TL;DR: 本文提出了一种双域自适应网络(Dual-domain Adaptation Networks, DAN),用于将预训练的图像超分辨率模型从合成数据有效迁移到真实世界场景,通过空间域和频率域联合自适应策略提升真实图像超分性能。

Details Motivation: 现有的真实图像超分辨率方法受限于真实低分辨率-高分辨率配对数据的稀缺,难以充分学习图像特征;而合成数据上预训练的模型蕴含丰富先验知识,但直接应用于真实场景时存在域偏差问题。 Method: 提出双域自适应网络(DAN):在空间域,采用选择性参数更新与低秩适配技术调整预训练模型的冻结参数;在频率域,引入额外分支融合输入图像的频谱信息与主干网络中间特征,预测高频残差图以增强重建细节。 Result: 在RealSR、D2CRealSR和DRealSR等多个真实图像超分基准上取得优于现有最先进方法的表现,验证了所提方法的有效性和泛化能力。 Conclusion: 通过结合空间域与频率域的自适应策略,DAN能够高效迁移预训练模型到真实超分任务中,在减少对真实标注数据依赖的同时显著提升恢复质量。 Abstract: Realistic image super-resolution (SR) focuses on transforming real-world low-resolution (LR) images into high-resolution (HR) ones, handling more complex degradation patterns than synthetic SR tasks. This is critical for applications like surveillance, medical imaging, and consumer electronics. However, current methods struggle with limited real-world LR-HR data, impacting the learning of basic image features. Pre-trained SR models from large-scale synthetic datasets offer valuable prior knowledge, which can improve generalization, speed up training, and reduce the need for extensive real-world data in realistic SR tasks. In this paper, we introduce a novel approach, Dual-domain Adaptation Networks, which is able to efficiently adapt pre-trained image SR models from simulated to real-world datasets. To achieve this target, we first set up a spatial-domain adaptation strategy through selectively updating parameters of pre-trained models and employing the low-rank adaptation technique to adjust frozen parameters. Recognizing that image super-resolution involves recovering high-frequency components, we further integrate a frequency domain adaptation branch into the adapted model, which combines the spectral data of the input and the spatial-domain backbone's intermediate features to infer HR frequency maps, enhancing the SR result. Experimental evaluations on public realistic image SR benchmarks, including RealSR, D2CRealSR, and DRealSR, demonstrate the superiority of our proposed method over existing state-of-the-art models. Codes are available at: https://github.com/dummerchen/DAN.

[122] QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy

Adam Lilja,Ji Lan,Junsheng Fu,Lars Hammarstrand

Main category: cs.CV

TL;DR: QueryOcc是一种基于查询的自监督框架,直接通过跨相邻帧采样的4D时空查询学习连续3D语义占据,支持来自视觉基础模型或原始激光雷达数据的监督,在Occ3D-nuScenes基准上超越现有相机方法26%,并以11.6 FPS运行。

Details Motivation: 由于大规模3D标注成本高昂,现有方法在3D结构显式建模和可扩展性方面存在局限,因此需要一种无需人工标签、能高效学习连续3D语义占据的自监督方法。 Method: 提出QueryOcc,采用基于独立4D时空查询的方式直接监督3D语义占据学习;引入收缩式场景表示,在保持近场细节的同时压缩远距离区域,实现长距离推理与恒定内存消耗。 Result: 在自监督Occ3D-nuScenes基准上语义RayIoU提升26%,运行速度达11.6 FPS,验证了直接4D查询监督的有效性。 Conclusion: QueryOcc通过直接的4D查询监督和高效的场景表示,实现了高性能、可扩展的自监督3D语义占据学习,为自动驾驶中的3D场景理解提供了新思路。 Abstract: Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving. Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels. Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability. We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames. The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data. To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions. QueryOcc surpasses previous camera-based methods by 26% in semantic RayIoU on the self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating that direct 4D query supervision enables strong self-supervised occupancy learning. https://research.zenseact.com/publications/queryocc/

[123] Equivariant-Aware Structured Pruning for Efficient Edge Deployment: A Comprehensive Framework with Adaptive Fine-Tuning

Mohammed Alnemari

Main category: cs.CV

TL;DR: 提出了一种结合群等变卷积神经网络(G-CNN)与保持等变性的结构化剪枝的压缩框架,实现对旋转等几何变换具有不变性且适用于资源受限环境的高效模型。

Details Motivation: 在资源受限环境下,如何在保持等变性的同时压缩G-CNN模型仍是一个挑战,现有剪枝方法往往破坏等变结构,影响模型鲁棒性。 Method: 基于e2cnn库构建C4循环群等变网络,设计针对全连接层的神经元级结构化剪枝,并引入自适应微调机制(精度下降超2%时触发,配合早停和学习率调度)以恢复性能,最后结合知识蒸馏与动态INT8量化形成完整优化流程。 Result: 在EuroSAT、CIFAR-10和Rotated MNIST上验证了方法有效性,实现了29.3%的参数量减少,并显著恢复了剪枝带来的精度损失,保持了对几何变换的鲁棒性。 Conclusion: 该框架成功结合了等变性保持与模型压缩,为等变神经网络的实际部署提供了可复现的解决方案,特别适用于卫星图像分析和几何视觉任务。 Abstract: This paper presents a novel framework combining group equivariant convolutional neural networks (G-CNNs) with equivariant-aware structured pruning to produce compact, transformation-invariant models for resource-constrained environments. Equivariance to rotations is achieved through the C4 cyclic group via the e2cnn library,enabling consistent performance under geometric transformations while reducing computational overhead. Our approach introduces structured pruning that preserves equivariant properties by analyzing e2cnn layer structure and applying neuron-level pruning to fully connected components. To mitigate accuracy degradation, we implement adaptive fine-tuning that automatically triggers when accuracy drop exceeds 2%, using early stopping and learning rate scheduling for efficient recovery. The framework includes dynamic INT8 quantization and a comprehensive pipeline encompassing training, knowledge distillation, structured pruning, fine-tuning, and quantization. We evaluate our method on satellite imagery (EuroSAT) and standard benchmarks (CIFAR-10, Rotated MNIST) demonstrating effectiveness across diverse domains. Experimental results show 29.3% parameter reduction with significant accuracy recovery, demonstrating that structured pruning of equivariant networks achieves substantial compression while maintaining geometric robustness. Our pipeline provides a reproducible framework for optimizing equivariant models, bridging the gap between group-theoretic network design and practical deployment constraints, with particular relevance to satellite imagery analysis and geometric vision tasks.

[124] Blind Deconvolution for Color Images Using Normalized Quaternion Kernels

Yuming Yang,Michael K. Ng,Zhigang Jia,Wei Wang

Main category: cs.CV

TL;DR: 提出一种基于四元数保真项的彩色图像盲解卷积新方法,充分利用颜色通道间的相互关系,有效去除伪影并显著提升去模糊效果。

Details Motivation: 现有方法常将彩色图像转为灰度图或单独处理各颜色通道,忽略了通道间的关联性,导致去模糊效果受限。 Method: 设计了一种新的四元数保真项,利用包含四个卷积核的四元数卷积核:一个用于捕捉整体模糊,另外三个无约束卷积核分别对应RGB通道以建模其未知依赖关系,并采用归一化的四元数核以保持图像强度。 Result: 在真实模糊彩色图像数据集上的大量实验表明,该方法能有效去除伪影并显著改善去模糊效果。 Conclusion: 所提方法通过充分挖掘彩色图像通道间的耦合信息,在盲解卷积任务中表现出优越性能,有望成为彩色图像去模糊的有力工具。 Abstract: In this work, we address the challenging problem of blind deconvolution for color images. Existing methods often convert color images to grayscale or process each color channel separately, which overlooking the relationships between color channels. To handle this issue, we formulate a novel quaternion fidelity term designed specifically for color image blind deconvolution. This fidelity term leverages the properties of quaternion convolution kernel, which consists of four kernels: one that functions similarly to a non-negative convolution kernel to capture the overall blur, and three additional convolution kernels without constraints corresponding to red, green and blue channels respectively model their unknown interdependencies. In order to preserve image intensity, we propose to use the normalized quaternion kernel in the blind deconvolution process. Extensive experiments on real datasets of blurred color images show that the proposed method effectively removes artifacts and significantly improves deblurring effect, demonstrating its potential as a powerful tool for color image deconvolution.

[125] Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats

Jiaye Qian,Ge Zheng,Yuchen Zhu,Sibei Yang

Main category: cs.CV

TL;DR: 本文提出了一种针对大型视觉-语言模型(LVLMs)幻觉问题的综合干预框架,基于Transformer的因果结构分析不同路径对幻觉的影响,并提出针对判别性和生成性问答格式的有效干预方法。

Details Motivation: LVLMs在多种任务中表现优异,但容易产生幻觉,亟需理解其根本成因并进行有效干预。 Method: 通过分析图像到输入文本、图像到输出文本和文本到文本的因果路径,研究不同问答对齐格式下各路径的作用,并设计针对性的方法识别和干预关键幻觉注意力头。 Result: 实验证明所提方法在多个基准上能有效减少不同类型对齐场景下的幻觉现象。 Conclusion: LVLM中的幻觉源于多条因果路径的相互作用,且依赖于问答格式;提出的路径导向干预策略可有效缓解这一问题。 Abstract: Despite their impressive performance across a wide range of tasks, Large Vision-Language Models (LVLMs) remain prone to hallucination. In this study, we propose a comprehensive intervention framework aligned with the transformer's causal architecture in LVLMs, integrating the effects of different intervention paths on hallucination. We find that hallucinations in LVLMs do not arise from a single causal path, but rather from the interplay among image-to-input-text, image-to-output-text, and text-to-text pathways. For the first time, we also find that LVLMs rely on different pathways depending on the question-answer alignment format. Building on these insights, we propose simple yet effective methods to identify and intervene on critical hallucination heads within each pathway, tailored to discriminative and generative formats. Experiments across multiple benchmarks demonstrate that our approach consistently reduces hallucinations across diverse alignment types.

[126] A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback

Bulat Khaertdinov,Mirela Popa,Nava Tintarev

Main category: cs.CV

TL;DR: 本文提出了一种受传统文本搜索启发的视觉-语言检索增强机制——相关性反馈(relevance feedback),无需微调即可在推理时提升大视觉-语言模型(VLM)的检索性能。

Details Motivation: 现有的VLM通常依赖微调或扩大模型规模来提升检索性能,缺乏在推理阶段动态优化的方法。受文本检索中相关性反馈的启发,作者希望设计一种模型无关、可即插即用的机制,在不修改模型参数的前提下提升VLM的检索效果。 Method: 提出了四种反馈策略:1)改进的经典伪相关反馈(PRF),利用排名靠前的结果优化查询嵌入;2)生成式相关反馈(GRF),通过生成合成描述进行查询优化;3)注意力反馈摘要器(AFS),一种基于Transformer的模型,融合多模态细粒度特征;4)使用真实标注作为显式反馈的上限基线。 Result: 在Flickr30k和COCO数据集上,结合不同VLM主干网络的实验表明:GRF、AFS和显式反馈相比无反馈基线,在小规模VLM上MRR@5提升3-5%,大规模VLM上提升1-3%。AFS与显式反馈类似,能有效缓解查询漂移,在多轮迭代检索中比GRF更鲁棒。 Conclusion: 相关性反馈是一种通用且有效的手段,可在推理阶段持续提升各类VLM的视觉检索性能,为交互式、自适应的视觉搜索提供了新路径。 Abstract: Large vision-language models (VLMs) enable intuitive visual search using natural language queries. However, improving their performance often requires fine-tuning and scaling to larger model variants. In this work, we propose a mechanism inspired by traditional text-based search to improve retrieval performance at inference time: relevance feedback. While relevance feedback can serve as an alternative to fine-tuning, its model-agnostic design also enables use with fine-tuned VLMs. Specifically, we introduce and evaluate four feedback strategies for VLM-based retrieval. First, we revise classical pseudo-relevance feedback (PRF), which refines query embeddings based on top-ranked results. To address its limitations, we propose generative relevance feedback (GRF), which uses synthetic captions for query refinement. Furthermore, we introduce an attentive feedback summarizer (AFS), a custom transformer-based model that integrates multimodal fine-grained features from relevant items. Finally, we simulate explicit feedback using ground-truth captions as an upper-bound baseline. Experiments on Flickr30k and COCO with the VLM backbones show that GRF, AFS, and explicit feedback improve retrieval performance by 3-5% in MRR@5 for smaller VLMs, and 1-3% for larger ones, compared to retrieval with no feedback. Moreover, AFS, similarly to explicit feedback, mitigates query drift and is more robust than GRF in iterative, multi-turn retrieval settings. Our findings demonstrate that relevance feedback can consistently enhance retrieval across VLMs and open up opportunities for interactive and adaptive visual search.

[127] Range-Edit: Semantic Mask Guided Outdoor LiDAR Scene Editing

Suchetan G. Uppur,Hemant Kumar,Vaibhav Kumar

Main category: cs.CV

TL;DR: 提出一种基于语义掩码引导的现实LiDAR扫描编辑方法,通过范围图像投影和语义条件实现高质量、几何一致的合成LiDAR点云生成,有效解决自动驾驶中复杂边缘场景数据稀缺问题。

Details Motivation: 真实世界中获取多样且复杂的边缘案例点云数据困难,限制了自动驾驶系统的泛化与鲁棒性;现有虚拟仿真方法耗时、昂贵且真实性不足。 Method: 将点云转换为2D范围图作为中间表示,结合基于凸包的语义掩码进行语义编辑,并采用扩散模型在语义掩码引导下生成新的LiDAR点云。 Result: 在KITTI-360数据集上验证了方法的有效性,能生成高质量、复杂且动态的场景点云,保持几何一致性与现实感。 Conclusion: 该方法提供了一种低成本、可扩展的解决方案,用于生成多样化的LiDAR数据,有助于提升自动驾驶系统的鲁棒性。 Abstract: Training autonomous driving and navigation systems requires large and diverse point cloud datasets that capture complex edge case scenarios from various dynamic urban settings. Acquiring such diverse scenarios from real-world point cloud data, especially for critical edge cases, is challenging, which restricts system generalization and robustness. Current methods rely on simulating point cloud data within handcrafted 3D virtual environments, which is time-consuming, computationally expensive, and often fails to fully capture the complexity of real-world scenes. To address some of these issues, this research proposes a novel approach that addresses the problem discussed by editing real-world LiDAR scans using semantic mask-based guidance to generate novel synthetic LiDAR point clouds. We incorporate range image projection and semantic mask conditioning to achieve diffusion-based generation. Point clouds are transformed to 2D range view images, which are used as an intermediate representation to enable semantic editing using convex hull-based semantic masks. These masks guide the generation process by providing information on the dimensions, orientations, and locations of objects in the real environment, ensuring geometric consistency and realism. This approach demonstrates high-quality LiDAR point cloud generation, capable of producing complex edge cases and dynamic scenes, as validated on the KITTI-360 dataset. This offers a cost-effective and scalable solution for generating diverse LiDAR data, a step toward improving the robustness of autonomous driving systems.

[128] Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation

Chuancheng Shi,Shangze Li,Shiming Guo,Simiao Xie,Wenhua Wu,Jingtong Dou,Chao Wu,Canran Xiao,Cong Wang,Zifeng Cheng,Fei Shen,Tat-Seng Chua

Main category: cs.CV

TL;DR: 本文研究了多语言文本到图像模型在跨文化一致性上的问题,提出通过定位文化敏感神经元并采用推理时文化激活和层定向文化增强策略来改善文化一致性。

Details Motivation: 现有的多语言T2I模型在多语言提示下常产生文化中性或英语偏见的结果,缺乏跨语言文化一致性。 Method: 提出一种探测方法定位文化敏感信号所在的少数神经元,并设计两种互补的对齐策略:推理时文化激活(无需微调主干)和层定向文化增强(仅更新文化相关层)。 Result: 在CultureBench上实验表明,所提方法在保持图像保真度和多样性的同时,显著提升了文化一致性表现。 Conclusion: 通过定位和强化文化相关表征,可在不改变模型整体结构的情况下有效提升多语言T2I模型的跨文化一致性。 Abstract: Multilingual text-to-image (T2I) models have advanced rapidly in terms of visual realism and semantic alignment, and are now widely utilized. Yet outputs vary across cultural contexts: because language carries cultural connotations, images synthesized from multilingual prompts should preserve cross-lingual cultural consistency. We conduct a comprehensive analysis showing that current T2I models often produce culturally neutral or English-biased results under multilingual prompts. Analyses of two representative models indicate that the issue stems not from missing cultural knowledge but from insufficient activation of culture-related representations. We propose a probing method that localizes culture-sensitive signals to a small set of neurons in a few fixed layers. Guided by this finding, we introduce two complementary alignment strategies: (1) inference-time cultural activation that amplifies the identified neurons without backbone fine-tuned; and (2) layer-targeted cultural enhancement that updates only culturally relevant layers. Experiments on our CultureBench demonstrate consistent improvements over strong baselines in cultural consistency while preserving fidelity and diversity.

[129] MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning

Wenrui Zhang,Xinggang Wang,Bin Feng,Wenyu Liu

Main category: cs.CV

TL;DR: MolSight是一个用于光学化学结构识别(OCSR)的三阶段学习框架,专注于提升立体化学信息的识别精度,在多个数据集上实现了最先进的性能。

Details Motivation: 现有OCSR系统在识别立体化学信息时面临挑战,因为立体异构体之间的视觉线索(如楔形键、虚线键和空间排列)非常细微,难以准确识别。 Method: 提出MolSight框架,采用三阶段训练范式:第一阶段在大规模噪声数据上预训练;第二阶段通过多粒度微调结合辅助任务(如化学键分类和原子定位)进行优化;第三阶段使用强化学习(GRPO算法)进行后训练,并引入新的立体化学结构数据集。 Result: MolSight在多个基准数据集上实现了最先进的(state-of-the-art)化学和立体化学结构识别性能,即使模型参数量较小,仍能通过GRPO进一步提升表现。 Conclusion: MolSight通过三阶段训练策略显著提升了OCSR中对立体化学结构的识别能力,为化学信息学、药物发现和LLM应用提供了高效且准确的解决方案。 Abstract: Optical Chemical Structure Recognition (OCSR) plays a pivotal role in modern chemical informatics, enabling the automated conversion of chemical structure images from scientific literature, patents, and educational materials into machine-readable molecular representations. This capability is essential for large-scale chemical data mining, drug discovery pipelines, and Large Language Model (LLM) applications in related domains. However, existing OCSR systems face significant challenges in accurately recognizing stereochemical information due to the subtle visual cues that distinguish stereoisomers, such as wedge and dash bonds, ring conformations, and spatial arrangements. To address these challenges, we propose MolSight, a comprehensive learning framework for OCSR that employs a three-stage training paradigm. In the first stage, we conduct pre-training on large-scale but noisy datasets to endow the model with fundamental perception capabilities for chemical structure images. In the second stage, we perform multi-granularity fine-tuning using datasets with richer supervisory signals, systematically exploring how auxiliary tasks-specifically chemical bond classification and atom localization-contribute to molecular formula recognition. Finally, we employ reinforcement learning for post-training optimization and introduce a novel stereochemical structure dataset. Remarkably, we find that even with MolSight's relatively compact parameter size, the Group Relative Policy Optimization (GRPO) algorithm can further enhance the model's performance on stereomolecular. Through extensive experiments across diverse datasets, our results demonstrate that MolSight achieves state-of-the-art performance in (stereo)chemical optical structure recognition.

[130] BiFingerPose: Bimodal Finger Pose Estimation for Touch Devices

Xiongjun Guan,Zhiyu Pan,Jianjiang Feng,Jie Zhou

Main category: cs.CV

TL;DR: 本文提出了一种基于电容图像和屏下指纹图像的双模态手指姿态估计方法BiFingerPose,可实现对俯仰、偏航和滚动角的高精度估计,显著优于现有方法。

Details Motivation: 现有基于电容图像的手指姿态估计算法在大角度输入下精度下降,且无法估计滚动角,限制了触屏设备的人机交互能力。 Method: 提出BiFingerPose算法,融合电容图像和指纹图像作为双模态输入,利用深度学习模型联合预测完整手指姿态参数。 Result: 在12人用户实验中,相比SOTA方法提升超过21%的预测性能,任务完成效率提高2.5倍,操作准确率提升23%,并成功估计滚动角。 Conclusion: BiFingerPose显著提升了手指姿态估计的精度与实用性,拓展了其在身份认证安全与交互体验中的应用潜力。 Abstract: Finger pose offers promising opportunities to expand human computer interaction capability of touchscreen devices. Existing finger pose estimation algorithms that can be implemented in portable devices predominantly rely on capacitive images, which are currently limited to estimating pitch and yaw angles and exhibit reduced accuracy when processing large-angle inputs (especially when it is greater than 45 degrees). In this paper, we propose BiFingerPose, a novel bimodal based finger pose estimation algorithm capable of simultaneously and accurately predicting comprehensive finger pose information. A bimodal input is explored, including a capacitive image and a fingerprint patch obtained from the touchscreen with an under-screen fingerprint sensor. Our approach leads to reliable estimation of roll angle, which is not achievable using only a single modality. In addition, the prediction performance of other pose parameters has also been greatly improved. The evaluation of a 12-person user study on continuous and discrete interaction tasks further validated the advantages of our approach. Specifically, BiFingerPose outperforms previous SOTA methods with over 21% improvement in prediction performance, 2.5 times higher task completion efficiency, and 23% better user operation accuracy, demonstrating its practical superiority. Finally, we delineate the application space of finger pose with respect to enhancing authentication security and improving interactive experiences, and develop corresponding prototypes to showcase the interaction potential. Our code will be available at https://github.com/XiongjunGuan/DualFingerPose.

[131] SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion

Jiajie Guo,Qingpeng Zhu,Jin Zeng,Xiaolong Wu,Changyong He,Weida Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为SpatialGeo的新型视觉编码器,通过分层融合几何与语义特征来增强多模态大语言模型(MLLMs)在三维空间中的空间推理能力。

Details Motivation: 现有MLLMs由于依赖如CLIP等仅提取实例级语义特征的视觉编码器,导致空间感知能力不足,难以准确理解和推断三维空间结构。 Method: 提出SpatialGeo,结合自监督学习获得的几何特征与CLIP的语义特征,通过分层适配器进行融合,并使用预训练LLaVA模型进行高效训练,引入随机特征丢弃策略防止模型过度依赖CLIP编码器。 Result: 实验表明,SpatialGeo在SpatialRGPT-Bench上相比当前最优模型至少提升8.0%的准确率,且推理时内存消耗减少约50%。 Conclusion: SpatialGeo有效提升了MLLMs的空间理解与定位能力,为构建更具空间感知的多模态系统提供了高效可行的解决方案。 Abstract: Multimodal large language models (MLLMs) have achieved significant progress in image and language tasks due to the strong reasoning capability of large language models (LLMs). Nevertheless, most MLLMs suffer from limited spatial reasoning ability to interpret and infer spatial arrangements in three-dimensional space. In this work, we propose a novel vision encoder based on hierarchical fusion of geometry and semantics features, generating spatial-aware visual embedding and boosting the spatial grounding capability of MLLMs. Specifically, we first unveil that the spatial ambiguity shortcoming stems from the lossy embedding of the vision encoder utilized in most existing MLLMs (e.g., CLIP), restricted to instance-level semantic features. This motivates us to complement CLIP with the geometry features from vision-only self-supervised learning via a hierarchical adapter, enhancing the spatial awareness in the proposed SpatialGeo. The network is efficiently trained using pretrained LLaVA model and optimized with random feature dropping to avoid trivial solutions relying solely on the CLIP encoder. Experimental results show that SpatialGeo improves the accuracy in spatial reasoning tasks, enhancing state-of-the-art models by at least 8.0% in SpatialRGPT-Bench with approximately 50% less memory cost during inference. The source code is available via https://ricky-plus.github.io/SpatialGeoPages/.

[132] MuM: Multi-View Masked Image Modeling for 3D Vision

David Nordström,Johan Edstedt,Fredrik Kahl,Georg Bökman

Main category: cs.CV

TL;DR: 本文提出了一种扩展的掩码自动编码器(MAE)方法MuM,用于多视角场景下的3D视觉特征学习,相较于CroCo更简单且可扩展,在多个下游任务中表现优于DINOv3和CroCo v2。

Details Motivation: 现有的自监督图像学习方法主要针对语义理解,缺乏对几何推理的支持,而3D视觉任务需要更适合的特征表示。 Method: 通过将掩码自动编码器(MAE)扩展到任意数量的同一场景视图,并采用统一掩码策略及轻量级带跨帧注意力的解码器,实现对3D特征的学习。 Result: 在前馈重建、密集图像匹配和相对位姿估计等下游任务中,MuM均优于当前最先进的视觉编码器DINOv3和CroCo v2。 Conclusion: MuM提供了一种更简单、可扩展的方法来学习适用于3D视觉的特征,在多种任务上实现了优越性能,推动了自监督学习在几何理解方向的发展。 Abstract: Self-supervised learning on images seeks to extract meaningful visual representations from unlabeled data. When scaled to large datasets, this paradigm has achieved state-of-the-art performance and the resulting trained models such as DINOv3 have seen widespread adoption. However, most prior efforts are optimized for semantic understanding rather than geometric reasoning. One important exception is Cross-View Completion, CroCo, which is a form of masked autoencoding (MAE) tailored for 3D understanding. In this work, we continue on the path proposed by CroCo and focus on learning features tailored for 3D vision. In a nutshell, we extend MAE to arbitrarily many views of the same scene. By uniformly masking all views and employing a lightweight decoder with inter-frame attention, our approach is inherently simpler and more scalable than CroCo. We evaluate the resulting model, MuM, extensively on downstream tasks including feedforward reconstruction, dense image matching and relative pose estimation, finding that it outperforms the state-of-the-art visual encoders DINOv3 and CroCo v2.

[133] NoPe-NeRF++: Local-to-Global Optimization of NeRF with No Pose Prior

Dongbo Shi,Shen Cao,Bojian Wu,Jinhui Guo,Lubin Fan,Renjie Chen,Ligang Liu,Jieping Ye

Main category: cs.CV

TL;DR: 本文提出了NoPe-NeRF++,一种无需姿态先验的局部到全局优化算法,用于训练Neural Radiance Fields(NeRF),在姿态估计和新视图合成方面优于现有方法。

Details Motivation: 现有的无姿态先验NeRF方法(如NoPe-NeRF)仅依赖图像内的局部关系,在复杂场景中难以准确恢复相机姿态,因此需要引入更鲁棒的局部与全局联合优化策略。 Method: 首先通过显式特征匹配进行相对姿态初始化,接着执行局部联合优化以提升NeRF训练中的姿态估计质量;随后引入结合几何一致性约束的全局优化阶段,利用捆绑调整(bundle adjustment)整合特征轨迹来进一步精化姿态。 Result: 在多个基准数据集上验证了该方法的优越性,显著提升了姿态估计精度和新视图合成质量,尤其在具有挑战性的场景中表现出更强的鲁棒性。 Conclusion: NoPe-NeRF++是首个将局部与全局线索无缝结合用于NeRF的方法,在无需姿态先验的情况下实现了最先进的性能,验证了所提出设计的有效性。 Abstract: In this paper, we introduce NoPe-NeRF++, a novel local-to-global optimization algorithm for training Neural Radiance Fields (NeRF) without requiring pose priors. Existing methods, particularly NoPe-NeRF, which focus solely on the local relationships within images, often struggle to recover accurate camera poses in complex scenarios. To overcome the challenges, our approach begins with a relative pose initialization with explicit feature matching, followed by a local joint optimization to enhance the pose estimation for training a more robust NeRF representation. This method significantly improves the quality of initial poses. Additionally, we introduce global optimization phase that incorporates geometric consistency constraints through bundle adjustment, which integrates feature trajectories to further refine poses and collectively boost the quality of NeRF. Notably, our method is the first work that seamlessly combines the local and global cues with NeRF, and outperforms state-of-the-art methods in both pose estimation accuracy and novel view synthesis. Extensive evaluations on benchmark datasets demonstrate our superior performance and robustness, even in challenging scenes, thus validating our design choices.

[134] Refracting Reality: Generating Images with Realistic Transparent Objects

Yue Yin,Enze Tao,Dylan Campbell

Main category: cs.CV

TL;DR: 本文提出了一种通过结合斯涅尔折射定律在生成过程中同步像素的方法,以生成具有真实折射效果的透明物体图像,显著提升了生成图像的光学合理性。

Details Motivation: 现有生成模型在处理透明物体(如折射、反射、吸收和散射)时表现不佳,尤其无法准确建模折射现象及其带来的跨图像区域的颜色约束。 Method: 在生成过程的每一步,利用斯涅尔折射定律对物体边界内外的像素进行扭曲与融合;对于通过折射或反射可见但未直接观测到的表面,通过一个以物体为中心的全景图与主图像同步恢复其外观。 Result: 该方法能够生成更符合光学物理规律的透明物体图像,在视觉上更加真实且满足物理一致性。 Conclusion: 通过物理引导的像素同步策略,可有效提升生成模型在透明和折射物体上的渲染能力,推动生成模型对复杂光学现象的建模。 Abstract: Generative image models can produce convincingly real images, with plausible shapes, textures, layouts and lighting. However, one domain in which they perform notably poorly is in the synthesis of transparent objects, which exhibit refraction, reflection, absorption and scattering. Refraction is a particular challenge, because refracted pixel rays often intersect with surfaces observed in other parts of the image, providing a constraint on the color. It is clear from inspection that generative models have not distilled the laws of optics sufficiently well to accurately render refractive objects. In this work, we consider the problem of generating images with accurate refraction, given a text prompt. We synchronize the pixels within the object's boundary with those outside by warping and merging the pixels using Snell's Law of Refraction, at each step of the generation trajectory. For those surfaces that are not directly observed in the image, but are visible via refraction or reflection, we recover their appearance by synchronizing the image with a second generated image -- a panorama centered at the object -- using the same warping and merging procedure. We demonstrate that our approach generates much more optically-plausible images that respect the physical constraints.

[135] Loomis Painter: Reconstructing the Painting Process

Markus Pobitzer,Chang Liu,Chenyi Zhuang,Teng Long,Bin Ren,Nicu Sebe

Main category: cs.CV

TL;DR: 提出了一种基于语义驱动的多媒介绘画过程生成框架,通过扩散模型和跨媒介风格增强实现一致的纹理演化与风格迁移,并引入反向绘画训练策略以生成符合人类创作流程的平滑绘画过程。

Details Motivation: 现有视频教程缺乏互动性和个性化,而现有的生成模型在跨媒介泛化和时序一致性方面存在不足,难以真实还原人类艺术创作流程。 Method: 构建一个统一的多媒介绘画生成框架,将多种媒介嵌入扩散模型的条件空间,采用语义驱动的风格控制机制和跨媒介风格增强;提出反向绘画训练策略以提升生成连贯性;构建大规模真实绘画过程数据集,并使用PDP曲线量化评估创作序列。 Result: 在LPIPS、DINO和CLIP等指标上取得良好表现,验证了跨媒介一致性、时序连贯性和最终图像保真度;PDP曲线能有效反映从构图、铺色到细节刻画的人类创作 progression。 Conclusion: 该框架能够生成高质量、风格可控且符合人类绘画逻辑的多媒介艺术创作过程,为交互式艺术学习和生成模型的创造性模拟提供了有效解决方案。 Abstract: Step-by-step painting tutorials are vital for learning artistic techniques, but existing video resources (e.g., YouTube) lack interactivity and personalization. While recent generative models have advanced artistic image synthesis, they struggle to generalize across media and often show temporal or structural inconsistencies, hindering faithful reproduction of human creative workflows. To address this, we propose a unified framework for multi-media painting process generation with a semantics-driven style control mechanism that embeds multiple media into a diffusion models conditional space and uses cross-medium style augmentation. This enables consistent texture evolution and process transfer across styles. A reverse-painting training strategy further ensures smooth, human-aligned generation. We also build a large-scale dataset of real painting processes and evaluate cross-media consistency, temporal coherence, and final-image fidelity, achieving strong results on LPIPS, DINO, and CLIP metrics. Finally, our Perceptual Distance Profile (PDP) curve quantitatively models the creative sequence, i.e., composition, color blocking, and detail refinement, mirroring human artistic progression.

[136] Label-Efficient Skeleton-based Recognition with Stable-Invertible Graph Convolutional Networks

Hichem Sahbi

Main category: cs.CV

TL;DR: 提出了一种基于图卷积网络的标签高效骨架动作识别方法,通过学习最优采集函数选择最具信息量的样本子集进行标注。

Details Motivation: 骨架动作识别依赖大量人工标注数据,获取成本高且耗时,因此需要一种标签高效的识别方法。 Method: 设计了一种新的获取函数,结合数据代表性、多样性和不确定性,优化选择最具信息量的标注子集;并利用可逆GCN将数据从环境空间映射到潜在空间,更好地捕捉数据内在分布。 Result: 在两个具有挑战性的骨架动作识别数据集上进行了广泛实验,结果表明所提方法在标签使用较少的情况下优于现有方法。 Conclusion: 所提出的标签节约型GCN方法在减少标注需求的同时保持甚至提升了识别性能,为骨架动作识别提供了一种高效的学习框架。 Abstract: Skeleton-based action recognition is a hotspot in image processing. A key challenge of this task lies in its dependence on large, manually labeled datasets whose acquisition is costly and time-consuming. This paper devises a novel, label-efficient method for skeleton-based action recognition using graph convolutional networks (GCNs). The contribution of the proposed method resides in learning a novel acquisition function -- scoring the most informative subsets for labeling -- as the optimum of an objective function mixing data representativity, diversity and uncertainty. We also extend this approach by learning the most informative subsets using an invertible GCN which allows mapping data from ambient to latent spaces where the inherent distribution of the data is more easily captured. Extensive experiments, conducted on two challenging skeleton-based recognition datasets, show the effectiveness and the outperformance of our label-frugal GCNs against the related work.

[137] DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture

Xiangteng He,Shunsuke Sakai,Kun Yuan,Nicolas Padoy,Tatsuhito Hasegawa,Leonid Sigal

Main category: cs.CV

TL;DR: DSeq-JEPA是一种新型的自监督学习架构,结合了JEPA的潜在预测与GPT式的序列推理,通过基于显著性图的判别区域顺序预测,提升了视觉表征的学习效果。

Details Motivation: I-JEPA缺乏对预测位置和顺序的显式建模,而人类视觉是选择性且有序关注信息的,因此需要引入判别性和序列性机制以提升表示质量。 Method: 提出DSeq-JEPA:首先利用Transformer生成的显著性图识别主要判别区域,然后按判别优先级顺序预测后续区域,形成从主到次的语义渐进课程式学习。 Result: 在图像分类、细粒度识别、检测分割及低层推理等多个任务上,DSeq-JEPA均优于I-JEPA变体,表现出更强的判别性和泛化能力。 Conclusion: DSeq-JEPA成功融合了预测式与自回归式自监督学习范式,通过引入视觉显著性引导的序列预测机制,实现了更高效、更具判别性的视觉表征学习。 Abstract: Image-based Joint-Embedding Predictive Architecture (I-JEPA) learns visual representations by predicting latent embeddings of masked regions from visible context. However, it treats all regions uniformly and independently, lacking an explicit notion of where or in what order predictions should be made. Inspired by human visual perception, which deploys attention selectively and sequentially from the most informative to secondary regions, we propose DSeq-JEPA, a Discriminative Sequential Joint-Embedding Predictive Architecture that bridges predictive and autoregressive self-supervised learning, integrating JEPA-style latent prediction with GPT-style sequential reasoning. Specifically, DSeq-JEPA (i) first identifies primary discriminative regions based on a transformer-derived saliency map, emphasizing the distribution of visual importance, and then (ii) predicts subsequent regions in this discriminative order, progressively forming a curriculum-like semantic progression from primary to secondary cues -- a form of GPT-style pre-training. Extensive experiments across diverse tasks, including image classification (ImageNet), fine-grained visual categorization (iNaturalist21, CUB-200-2011, Stanford-Cars), detection and segmentation (MS-COCO, ADE20K), and low-level reasoning tasks (Clevr/Count, Clevr/Dist), demonstrate that DSeq-JEPA consistently focuses on more discriminative and generalizable representations than I-JEPA variants. Project page: https://github.com/SkyShunsuke/DSeq-JEPA.

[138] UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification

Taixi Chen,Jingyun Chen,Nancy Guo

Main category: cs.CV

TL;DR: 本文提出了一种统一的注意力-Mamba(UAM)骨干网络,用于基于放射组学特征的细胞级分类,并进一步扩展为多模态框架,实现细胞分类与图像分割联合任务,在公开基准上取得了领先性能。

Details Motivation: 现有研究多集中于切片或图像块级别的肿瘤分类,缺乏针对细胞级放射组学分析的专用骨干网络,且传统混合模型需手动调参,限制了特征编码能力。 Method: 受Mamba架构在视觉与语言领域成功的启发,提出统一注意力-Mamba(UAM)骨干网络,灵活融合两者优势,无需手动调节比例;设计两种UAM变体并构建多模态框架,支持细胞级分类与图像分割联合学习。 Result: UAM在细胞分类和肿瘤分割任务中均达到最先进水平:细胞分类准确率从74%提升至78%(n=349,882),肿瘤分割精度从75%提升至80%(n=406个图像块)。 Conclusion: UAM是一种高效、统一且可扩展的多模态基础模型,显著提升了细胞级放射组学分析的性能,具有广阔的应用前景。 Abstract: Cell-level radiomics features provide fine-grained insights into tumor phenotypes and have the potential to significantly enhance diagnostic accuracy on hematoxylin and eosin (H&E) images. By capturing micro-level morphological and intensity patterns, these features support more precise tumor identification and improve AI interpretability by highlighting diagnostically relevant cells for pathologist review. However, most existing studies focus on slide-level or patch-level tumor classification, leaving cell-level radiomics analysis largely unexplored. Moreover, there is currently no dedicated backbone specifically designed for radiomics data. Inspired by the recent success of the Mamba architecture in vision and language domains, we introduce a Unified Attention-Mamba (UAM) backbone for cell-level classification using radiomics features. Unlike previous hybrid approaches that integrate Attention and Mamba modules in fixed proportions, our unified design flexibly combines their capabilities within a single cohesive architecture, eliminating the need for manual ratio tuning and improving encode capability. We develop two UAM variants to comprehensively evaluate the benefits of this unified structure. Building on this backbone, we further propose a multimodal UAM framework that jointly performs cell-level classification and image segmentation. Experimental results demonstrate that UAM achieves state-of-the-art performance across both tasks on public benchmarks, surpassing leading image-based foundation models. It improves cell classification accuracy from 74% to 78% ($n$=349,882 cells), and tumor segmentation precision from 75% to 80% ($n$=406 patches). These findings highlight the effectiveness and promise of UAM as a unified and extensible multimodal foundation for radiomics-driven cancer diagnosis.

[139] SuperQuadricOcc: Multi-Layer Gaussian Approximation of Superquadrics for Real-Time Self-Supervised Occupancy Estimation

Seamie Hayes,Reenu Mohandas,Tim Brophy,Alexandre Boulch,Ganesh Sistu,Ciaran Eising

Main category: cs.CV

TL;DR: 本文提出了SuperQuadricOcc,一种基于超二次曲面的自监督占据估计方法,通过多层二十面体细分高斯近似实现训练监督,在保持高性能的同时显著降低内存占用并实现更快的实时推理。

Details Motivation: 现有的高斯表示虽然在自监督占据估计中广泛应用,但其大量基元导致内存需求高、难以实现实时推理;而超二次曲面虽能减少基元数量和内存消耗,却因缺乏有效的光栅化方法难以用于自监督模型。 Method: 提出SuperQuadricOcc,采用超二次曲面作为场景表示,并设计多层二十面体细分的高斯近似方法,使得能够利用高斯光栅化进行训练监督,同时开发了快速超二次曲面体素化模块以支持评估。 Result: 在Occ3D数据集上,相比之前的高斯方法,SuperQuadricOcc减少了75%的内存占用,推理速度提升124%,mIoU提高5.9%,且基元数量减少84%,首次实现了兼具实时性与竞争力性能的占据建模。 Conclusion: SuperQuadricOcc是首个结合超二次曲面优势并实现自监督训练的实时占据估计模型,为自动驾驶中的高效场景理解提供了新方案。 Abstract: Semantic occupancy estimation enables comprehensive scene understanding for automated driving, providing dense spatial and semantic information essential for perception and planning. While Gaussian representations have been widely adopted in self-supervised occupancy estimation, the deployment of a large number of Gaussian primitives drastically increases memory requirements and is not suitable for real-time inference. In contrast, superquadrics permit reduced primitive count and lower memory requirements due to their diverse shape set. However, implementation into a self-supervised occupancy model is nontrivial due to the absence of a superquadric rasterizer to enable model supervision. Our proposed method, SuperQuadricOcc, employs a superquadric-based scene representation. By leveraging a multi-layer icosphere-tessellated Gaussian approximation of superquadrics, we enable Gaussian rasterization for supervision during training. On the Occ3D dataset, SuperQuadricOcc achieves a 75\% reduction in memory footprint, 124\% faster inference, and a 5.9\% improvement in mIoU compared to previous Gaussian-based methods, without the use of temporal labels. To our knowledge, this is the first occupancy model to enable real-time inference while maintaining competitive performance. The use of superquadrics reduces the number of primitives required for scene modeling by 84\% relative to Gaussian-based approaches. Finally, evaluation against prior methods is facilitated by our fast superquadric voxelization module. The code will be released as open source.

[140] ATAC: Augmentation-Based Test-Time Adversarial Correction for CLIP

Linxiang Su,András Balogh

Main category: cs.CV

TL;DR: 提出了一种名为ATAC的测试时对抗性校正方法,通过在CLIP的嵌入空间中利用增强引起的漂移向量来提升图像-文本匹配的鲁棒性,显著优于现有方法。

Details Motivation: CLIP模型在零样本图像-文本匹配中表现优异,但对图像的对抗性扰动非常敏感,且现有的测试时防御策略鲁棒性有限。 Method: 在CLIP的嵌入空间中直接操作,计算由数据增强引起的漂移向量,利用这些潜在漂移的角一致性推断语义恢复方向,并据此校正嵌入。 Result: ATAC在多种基准上均表现出极高的鲁棒性,平均比之前的最先进方法高出近50%,且计算开销极小;在非传统和极端设置下仍保持领先,并对自适应攻击也展现出一定的鲁棒性。 Conclusion: ATAC是一种高效、新颖的基于嵌入空间的测试时对抗防御方法,为提升CLIP类模型的鲁棒性提供了新范式。 Abstract: Despite its remarkable success in zero-shot image-text matching, CLIP remains highly vulnerable to adversarial perturbations on images. As adversarial fine-tuning is prohibitively costly, recent works explore various test-time defense strategies; however, these approaches still exhibit limited robustness. In this work, we revisit this problem and propose a simple yet effective strategy: Augmentation-based Test-time Adversarial Correction (ATAC). Our method operates directly in the embedding space of CLIP, calculating augmentation-induced drift vectors to infer a semantic recovery direction and correcting the embedding based on the angular consistency of these latent drifts. Across a wide range of benchmarks, ATAC consistently achieves remarkably high robustness, surpassing that of previous state-of-the-art methods by nearly 50\% on average, all while requiring minimal computational overhead. Furthermore, ATAC retains state-of-the-art robustness in unconventional and extreme settings and even achieves nontrivial robustness against adaptive attacks. Our results demonstrate that ATAC is an efficient method in a novel paradigm for test-time adversarial defenses in the embedding space of CLIP.

[141] SVRecon: Sparse Voxel Rasterization for Surface Reconstruction

Seunghun Oh,Jaesung Choe,Dongjae Lee,Daeun Lee,Seunghoon Jeong,Yu-Chiang Frank Wang,Jaesik Park

Main category: cs.CV

TL;DR: 本文提出SVRecon,通过结合符号距离函数(SDF)扩展稀疏体素光栅化范式,实现高保真表面重建。为解决稀疏体素间结构不连续的问题,采用视觉几何模型进行鲁棒初始化,并设计空间平滑损失来增强体素间的结构一致性。实验表明该方法在多种基准上实现了高精度重建和快速收敛。

Details Motivation: 稀疏体素虽具有空间解耦和边界清晰的优点,但在优化过程中易陷入局部极小,且难以保持SDF带来的几何平滑性,因此需要一种能协调稀疏体素间结构关系的方法以实现高质量表面重建。 Method: 引入SDF到稀疏体素框架中,提出SVRecon;采用视觉几何模型进行鲁棒的几何初始化,并设计一种空间平滑损失,强制约束父子及兄弟体素组之间的结构一致性,从而促进整体几何结构的连贯与平滑。 Result: 在多个基准数据集上的实验表明,SVRecon在重建精度上表现优异,并且具有稳定快速的收敛速度。 Conclusion: SVRecon通过SDF与稀疏体素的结合,有效解决了体素间结构不连续问题,实现了高保真、快速收敛的表面重建,验证了其在三维重建任务中的有效性与优势。 Abstract: We extend the recently proposed sparse voxel rasterization paradigm to the task of high-fidelity surface reconstruction by integrating Signed Distance Function (SDF), named SVRecon. Unlike 3D Gaussians, sparse voxels are spatially disentangled from their neighbors and have sharp boundaries, which makes them prone to local minima during optimization. Although SDF values provide a naturally smooth and continuous geometric field, preserving this smoothness across independently parameterized sparse voxels is nontrivial. To address this challenge, we promote coherent and smooth voxel-wise structure through (1) robust geometric initialization using a visual geometry model and (2) a spatial smoothness loss that enforces coherent relationships across parent-child and sibling voxel groups. Extensive experiments across various benchmarks show that our method achieves strong reconstruction accuracy while having consistently speedy convergence. The code will be made public.

[142] Non-Parametric Probabilistic Robustness: A Conservative Metric with Optimized Perturbation Distributions

Zheng Wang,Yi Zhang,Siddartha Khastgir,Carsten Maple,Xingyu Zhao

Main category: cs.CV

TL;DR: 提出非参数化概率鲁棒性(NPPR)作为更实用的鲁棒性度量,通过从数据中学习扰动分布来实现保守的概率鲁棒性评估,并基于GMM-MLP模型进行估计,在多个数据集和网络结构上验证了其有效性。

Details Motivation: 现有概率鲁棒性方法假设扰动分布已知且固定,这在现实中不现实,因此需要一种不依赖预设分布的更实用鲁棒性度量方法。 Method: 提出非参数化概率鲁棒性(NPPR),采用高斯混合模型(GMM)与多层感知机(MLP)头及双三次上采样构建估计器,从数据中学习最优扰动分布。 Result: 在CIFAR-10、CIFAR-100和Tiny ImageNet上对ResNet18/50、WideResNet50和VGG16的实验表明,NPPR比现有方法最多低40%,提供更保守的概率鲁棒性估计。 Conclusion: NPPR是一种更实用的概率鲁棒性度量方法,能够在分布不确定性下实现保守评估,优于依赖固定扰动分布的传统方法。 Abstract: Deep learning (DL) models, despite their remarkable success, remain vulnerable to small input perturbations that can cause erroneous outputs, motivating the recent proposal of probabilistic robustness (PR) as a complementary alternative to adversarial robustness (AR). However, existing PR formulations assume a fixed and known perturbation distribution, an unrealistic expectation in practice. To address this limitation, we propose non-parametric probabilistic robustness (NPPR), a more practical PR metric that does not rely on any predefined perturbation distribution. Following the non-parametric paradigm in statistical modeling, NPPR learns an optimized perturbation distribution directly from data, enabling conservative PR evaluation under distributional uncertainty. We further develop an NPPR estimator based on a Gaussian Mixture Model (GMM) with Multilayer Perceptron (MLP) heads and bicubic up-sampling, covering various input-dependent and input-independent perturbation scenarios. Theoretical analyses establish the relationships among AR, PR, and NPPR. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet across ResNet18/50, WideResNet50 and VGG16 validate NPPR as a more practical robustness metric, showing up to 40\% more conservative (lower) PR estimates compared to assuming those common perturbation distributions used in state-of-the-arts.

[143] MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration

Runxun Zhang,Yizhou Liu,Li Dongrui,Bo XU,Jingwei Wei

Main category: cs.CV

TL;DR: 提出MorphSeek,一种基于潜在特征空间中空间连续优化的可变形图像配准方法,通过表示级策略优化实现高效、数据高效的高维视觉对齐。

Details Motivation: 传统可变形图像配准因高维形变空间和缺乏体素级监督而困难,现有强化学习方法因降维表示难以捕捉空间变化形变。 Method: 将配准重构为潜在特征空间中的空间连续优化过程,引入随机高斯策略头建模潜在特征分布,并结合无监督预训练与弱监督微调,采用多轨迹采样的组相对策略优化稳定训练。 Result: 在OASIS脑MRI、LiTS肝CT和Abdomen MR-CT三个3D配准基准上,Dice指标持续优于基线方法,具有高标签效率、低参数成本和低延迟开销。 Conclusion: MorphSeek提出了一种原理性强、主干和优化器无关的表示级策略学习范式,适用于高维场景下的可扩展视觉对齐。 Abstract: Deformable image registration (DIR) remains a fundamental yet challenging problem in medical image analysis, largely due to the prohibitively high-dimensional deformation space of dense displacement fields and the scarcity of voxel-level supervision. Existing reinforcement learning frameworks often project this space into coarse, low-dimensional representations, limiting their ability to capture spatially variant deformations. We propose MorphSeek, a fine-grained representation-level policy optimization paradigm that reformulates DIR as a spatially continuous optimization process in the latent feature space. MorphSeek introduces a stochastic Gaussian policy head atop the encoder to model a distribution over latent features, facilitating efficient exploration and coarse-to-fine refinement. The framework integrates unsupervised warm-up with weakly supervised fine-tuning through Group Relative Policy Optimization, where multi-trajectory sampling stabilizes training and improves label efficiency. Across three 3D registration benchmarks (OASIS brain MRI, LiTS liver CT, and Abdomen MR-CT), MorphSeek achieves consistent Dice improvements over competitive baselines while maintaining high label efficiency with minimal parameter cost and low step-level latency overhead. Beyond optimizer specifics, MorphSeek advances a representation-level policy learning paradigm that achieves spatially coherent and data-efficient deformation optimization, offering a principled, backbone-agnostic, and optimizer-agnostic solution for scalable visual alignment in high-dimensional settings.

[144] Designing and Generating Diverse, Equitable Face Image Datasets for Face Verification Tasks

Georgia Baltsou,Ioannis Sarridis,Christos Koutlis,Symeon Papadopoulos

Main category: cs.CV

TL;DR: 提出了一种基于生成模型的多样化合成人脸图像方法,构建了DIF-V数据集以解决人脸识别中的种族、性别等偏见问题,并揭示现有模型在多样性和公平性方面的不足。

Details Motivation: 现有面部数据集存在种族、性别等偏见,导致人脸识别系统不公平,需构建更具多样性和包容性的数据集以提升模型的公正性和可靠性。 Method: 结合先进的生成模型合成高质量、多样化的面部图像,强调符合身份证照片规范的多样化面部特征,并构建DIF-V数据集(27,780张图像,926个身份)。 Result: 分析显示现有验证模型对特定性别和种族存在偏见,且身份风格修改会负面影响模型性能;DIF-V数据集可作为评估和改进人脸验证系统的基准。 Conclusion: 该研究通过合成多样化人脸数据缓解了数据偏见问题,推动了人工智能在人脸识别领域的多样性与伦理发展,为构建更包容、可靠的人脸验证技术奠定了基础。 Abstract: Face verification is a significant component of identity authentication in various applications including online banking and secure access to personal devices. The majority of the existing face image datasets often suffer from notable biases related to race, gender, and other demographic characteristics, limiting the effectiveness and fairness of face verification systems. In response to these challenges, we propose a comprehensive methodology that integrates advanced generative models to create varied and diverse high-quality synthetic face images. This methodology emphasizes the representation of a diverse range of facial traits, ensuring adherence to characteristics permissible in identity card photographs. Furthermore, we introduce the Diverse and Inclusive Faces for Verification (DIF-V) dataset, comprising 27,780 images of 926 unique identities, designed as a benchmark for future research in face verification. Our analysis reveals that existing verification models exhibit biases toward certain genders and races, and notably, applying identity style modifications negatively impacts model performance. By tackling the inherent inequities in existing datasets, this work not only enriches the discussion on diversity and ethics in artificial intelligence but also lays the foundation for developing more inclusive and reliable face verification technologies

[145] MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment

Huangbiao Xu,Huanqi Wu,Xiao Ke,Junyi Wu,Rui Xu,Jinglin Xu

Main category: cs.CV

TL;DR: 提出了一种新的缺失模态补全框架MCMoE,用于多模态动作质量评估,在模态缺失情况下仍能保持高性能。

Details Motivation: 现有多模态模型在推理阶段因部分模态缺失导致性能严重下降,甚至无法运行。 Method: 设计自适应门控模态生成器重建缺失模态,采用混合专家系统学习单模态与跨模态联合表示,并通过单阶段训练统一两种学习方式。 Result: 在三个公开AQA基准上,MCMoE在完整和不完整多模态学习设置下均达到最先进性能。 Conclusion: MCMoE有效解决了模态缺失问题,实现了鲁棒的多模态动作质量评估。 Abstract: Multimodal Action Quality Assessment (AQA) has recently emerged as a promising paradigm. By leveraging complementary information across shared contextual cues, it enhances the discriminative evaluation of subtle intra-class variations in highly similar action sequences. However, partial modalities are frequently unavailable at the inference stage in reality. The absence of any modality often renders existing multimodal models inoperable. Furthermore, it triggers catastrophic performance degradation due to interruptions in cross-modal interactions. To address this issue, we propose a novel Missing Completion Framework with Mixture of Experts (MCMoE) that unifies unimodal and joint representation learning in single-stage training. Specifically, we propose an adaptive gated modality generator that dynamically fuses available information to reconstruct missing modalities. We then design modality experts to learn unimodal knowledge and dynamically mix the knowledge of all experts to extract cross-modal joint representations. With a mixture of experts, missing modalities are further refined and complemented. Finally, in the training phase, we mine the complete multimodal features and unimodal expert knowledge to guide modality generation and generation-based joint representation extraction. Extensive experiments demonstrate that our MCMoE achieves state-of-the-art results in both complete and incomplete multimodal learning on three public AQA benchmarks. Code is available at https://github.com/XuHuangbiao/MCMoE.

[146] Sparse Mixture-of-Experts for Multi-Channel Imaging: Are All Channel Interactions Required?

Sukwon Yun,Heming Yao,Burkhard Hoeckendorf,David Richmond,Aviv Regev,Russell Littman

Main category: cs.CV

TL;DR: 本文提出了MoE-ViT,一种用于多通道图像的Vision Transformer架构,通过将每个通道视为专家并使用轻量级路由器选择最相关的专家进行注意力计算,显著提升了跨通道注意力的效率,同时保持或提升了性能。

Details Motivation: 现有的多通道图像处理方法在建模通道间交互时存在计算瓶颈,导致注意力机制中的FLOPs呈二次增长,训练成本高。本文旨在解决这一被忽视的效率问题。 Method: 受稀疏Mixture-of-Experts(MoE)启发,提出MoE-ViT,将每个通道作为专家,并引入轻量级路由器为每个图像块选择最相关的通道进行注意力计算,从而减少不必要的跨通道交互。 Result: 在真实世界数据集JUMP-CP和So2Sat上的实验表明,MoE-ViT在显著提升效率的同时,不牺牲甚至提升了模型性能。 Conclusion: MoE-ViT通过稀疏化跨通道注意力,有效解决了多通道Vision Transformers中的计算瓶颈,成为多通道成像任务中实用且有吸引力的骨干网络。 Abstract: Vision Transformers ($\text{ViTs}$) have become the backbone of vision foundation models, yet their optimization for multi-channel domains - such as cell painting or satellite imagery - remains underexplored. A key challenge in these domains is capturing interactions between channels, as each channel carries different information. While existing works have shown efficacy by treating each channel independently during tokenization, this approach naturally introduces a major computational bottleneck in the attention block - channel-wise comparisons leads to a quadratic growth in attention, resulting in excessive $\text{FLOPs}$ and high training cost. In this work, we shift focus from efficacy to the overlooked efficiency challenge in cross-channel attention and ask: "Is it necessary to model all channel interactions?". Inspired by the philosophy of Sparse Mixture-of-Experts ($\text{MoE}$), we propose MoE-ViT, a Mixture-of-Experts architecture for multi-channel images in $\text{ViTs}$, which treats each channel as an expert and employs a lightweight router to select only the most relevant experts per patch for attention. Proof-of-concept experiments on real-world datasets - JUMP-CP and So2Sat - demonstrate that $\text{MoE-ViT}$ achieves substantial efficiency gains without sacrificing, and in some cases enhancing, performance, making it a practical and attractive backbone for multi-channel imaging.

[147] Preventing Shortcut Learning in Medical Image Analysis through Intermediate Layer Knowledge Distillation from Specialist Teachers

Christopher Boland,Sotirios Tsaftaris,Sonia Dahdouh

Main category: cs.CV

TL;DR: 提出一种基于知识蒸馏的框架,利用教师网络在少量任务相关数据上微调来减轻学生网络中的捷径学习问题,在多个医学图像数据集上表现出优于传统方法的性能。

Details Motivation: 深度学习模型容易依赖训练数据中与任务无关但相关的特征(即捷径特征),这在医疗影像等高风险应用中可能导致模型缺乏鲁棒性并影响临床决策。 Method: 设计了一种新的知识蒸馏框架,教师网络在少量无偏的任务相关数据上微调,指导学生网络(在含偏数据上训练)的学习;通过分析不同层次的特征表示,针对性地在中间层进行干预以缓解扩散性和局部性捷径。 Result: 在CheXpert、ISIC 2017和SimBA数据集上,使用多种网络结构验证了该方法的有效性,性能优于经验风险最小化、基于增强和基于分组的去偏方法,并在分布外测试数据上接近在无偏数据上训练的基线模型表现。 Conclusion: 所提方法能有效缓解医学图像分析中的捷径学习问题,尤其适用于偏差标注有限且难以预先识别捷径特征的实际场景,具有良好的实际应用潜力。 Abstract: Deep learning models are prone to learning shortcut solutions to problems using spuriously correlated yet irrelevant features of their training data. In high-risk applications such as medical image analysis, this phenomenon may prevent models from using clinically meaningful features when making predictions, potentially leading to poor robustness and harm to patients. We demonstrate that different types of shortcuts (those that are diffuse and spread throughout the image, as well as those that are localized to specific areas) manifest distinctly across network layers and can, therefore, be more effectively targeted through mitigation strategies that target the intermediate layers. We propose a novel knowledge distillation framework that leverages a teacher network fine-tuned on a small subset of task-relevant data to mitigate shortcut learning in a student network trained on a large dataset corrupted with a bias feature. Through extensive experiments on CheXpert, ISIC 2017, and SimBA datasets using various architectures (ResNet-18, AlexNet, DenseNet-121, and 3D CNNs), we demonstrate consistent improvements over traditional Empirical Risk Minimization, augmentation-based bias-mitigation, and group-based bias-mitigation approaches. In many cases, we achieve comparable performance with a baseline model trained on bias-free data, even on out-of-distribution test data. Our results demonstrate the practical applicability of our approach to real-world medical imaging scenarios where bias annotations are limited and shortcut features are difficult to identify a priori.

[148] REMSA: An LLM Agent for Foundation Model Selection in Remote Sensing

Binger Chen,Tacettin Emre Bök,Behnood Rasti,Volker Markl,Begüm Demir

Main category: cs.CV

TL;DR: 本文提出了一个针对遥感基础模型(RSFM)的数据库RS-FMD,并基于此开发了首个基于大语言模型的自动化模型选择代理REMSA,能够通过自然语言查询帮助用户选择合适的遥感基础模型。

Details Motivation: 由于遥感基础模型的相关文档分散、格式不统一且部署约束多样,用户难以选择合适的模型,因此需要一个系统化解决方案来提升模型选择效率。 Method: 构建包含150多个RSFM的结构化数据库RS-FMD,并设计基于大语言模型的代理REMSA,利用上下文学习对候选模型进行排序,并通过专家验证的基准测试进行评估。 Result: REMSA在75个专家验证的查询场景中表现优于多种基线方法,包括朴素代理、密集检索和非结构化RAG模型,且完全基于公开元数据运行。 Conclusion: REMSA为遥感领域提供了透明、高效的模型选择方案,推动了基础模型在遥感应用中的落地与普及。 Abstract: Foundation Models (FMs) are increasingly used in remote sensing (RS) for tasks such as environmental monitoring, disaster assessment, and land-use mapping. These models include unimodal vision encoders trained on a single data modality and multimodal architectures trained on combinations of SAR, multispectral, hyperspectral, and image-text data. They support diverse RS tasks including semantic segmentation, image classification, change detection, and visual question answering. However, selecting an appropriate remote sensing foundation model (RSFM) remains difficult due to scattered documentation, heterogeneous formats, and varied deployment constraints. We introduce the RSFM Database (RS-FMD), a structured resource covering over 150 RSFMs spanning multiple data modalities, resolutions, and learning paradigms. Built on RS-FMD, we present REMSA, the first LLM-based agent for automated RSFM selection from natural language queries. REMSA interprets user requirements, resolves missing constraints, ranks candidate models using in-context learning, and provides transparent justifications. We also propose a benchmark of 75 expert-verified RS query scenarios, producing 900 configurations under an expert-centered evaluation protocol. REMSA outperforms several baselines, including naive agents, dense retrieval, and unstructured RAG-based LLMs. It operates entirely on publicly available metadata and does not access private or sensitive data.

[149] MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models

Yuqi Li,Junhao Dong,Chuanguang Yang,Shiping Wen,Piotr Koniusz,Tingwen Huang,Yingli Tian,Yew-Soon Ong

Main category: cs.CV

TL;DR: 提出了一种多模态多教师对抗鲁棒蒸馏框架MMT-ARD,通过双教师知识融合架构和动态权重分配策略,显著提升视觉语言模型的对抗鲁棒性和训练效率。

Details Motivation: 传统单教师对抗知识蒸馏方法存在知识多样性不足、收敛慢以及鲁棒性与准确性难以平衡的问题,限制了其在安全关键应用中的部署。 Method: 设计了一个双教师知识融合架构,结合动态权重分配策略(基于教师置信度)和自适应Sigmoid加权函数,以协同优化干净特征保持和鲁棒特征增强,并缓解教师间的模态偏差。 Result: 在ImageNet和零样本基准上,ViT-B-32模型的鲁棒准确率提升+4.32%,零样本准确率提升+3.5%,训练效率提高2.3倍。 Conclusion: MMT-ARD有效提升了多模态大模型的对抗鲁棒性,具有良好的可扩展性和实际应用潜力。 Abstract: Vision-Language Models (VLMs) are increasingly deployed in safety-critical applications, making their adversarial robustness a crucial concern. While adversarial knowledge distillation has shown promise in transferring robustness from teacher to student models, traditional single-teacher approaches suffer from limited knowledge diversity, slow convergence, and difficulty in balancing robustness and accuracy. To address these challenges, we propose MMT-ARD: a Multimodal Multi-Teacher Adversarial Robust Distillation framework. Our key innovation is a dual-teacher knowledge fusion architecture that collaboratively optimizes clean feature preservation and robust feature enhancement. To better handle challenging adversarial examples, we introduce a dynamic weight allocation strategy based on teacher confidence, enabling adaptive focus on harder samples. Moreover, to mitigate bias among teachers, we design an adaptive sigmoid-based weighting function that balances the strength of knowledge transfer across modalities. Extensive experiments on ImageNet and zero-shot benchmarks demonstrate that MMT-ARD improves robust accuracy by +4.32% and zero-shot accuracy by +3.5% on the ViT-B-32 model, while achieving a 2.3x increase in training efficiency over traditional single-teacher methods. These results highlight the effectiveness and scalability of MMT-ARD in enhancing the adversarial robustness of multimodal large models. Our codes are available at https://github.com/itsnotacie/MMT-ARD.

[150] Illustrator's Depth: Monocular Layer Index Prediction for Image Decomposition

Nissim Maruani,Peiying Zhang,Siddhartha Chaudhuri,Matthew Fisher,Nanxuan Zhao,Vladimir G. Kim,Pierre Alliez,Mathieu Desbrun,Wang Yifan

Main category: cs.CV

TL;DR: 本文提出了“Illustrator's Depth”,一种将平面图像分解为可编辑、有序图层的新方法,通过像素级的层索引实现全局一致的图像分解,并利用神经网络从光栅图像预测分层结构,支持多种下游应用。

Details Motivation: 解决数字内容创作中一个关键挑战:如何将二维平面图像分解为可编辑且顺序明确的图层,以提升图像的可编辑性和创作灵活性。 Method: 受艺术家创作过程启发,提出‘Illustrator's Depth’概念,为每个像素分配一个层索引,形成离散且全局一致的元素排序;构建并训练一个神经网络,使用分层矢量图形数据集从光栅输入直接预测图层结构。 Result: 该方法在图像矢量化任务上显著优于现有最先进基线,并实现了高质量的文本到矢量图形生成、2D图像自动生成3D浮雕效果以及直观的深度感知图像编辑。 Conclusion: 通过将深度从物理量重构为创意抽象,Illustrator's Depth 为可编辑图像分解提供了新的基础,拓展了图像处理与内容创作的应用边界。 Abstract: We introduce Illustrator's Depth, a novel definition of depth that addresses a key challenge in digital content creation: decomposing flat images into editable, ordered layers. Inspired by an artist's compositional process, illustrator's depth infers a layer index to each pixel, forming an interpretable image decomposition through a discrete, globally consistent ordering of elements optimized for editability. We also propose and train a neural network using a curated dataset of layered vector graphics to predict layering directly from raster inputs. Our layer index inference unlocks a range of powerful downstream applications. In particular, it significantly outperforms state-of-the-art baselines for image vectorization while also enabling high-fidelity text-to-vector-graphics generation, automatic 3D relief generation from 2D images, and intuitive depth-aware editing. By reframing depth from a physical quantity to a creative abstraction, illustrator's depth prediction offers a new foundation for editable image decomposition.

[151] Improving Multimodal Distillation for 3D Semantic Segmentation under Domain Shift

Björn Michele,Alexandre Boulch,Gilles Puy,Tuan-Hung Vu,Renaud Marlet,Nicolas Courty

Main category: cs.CV

TL;DR: 本研究探索了如何利用视觉基础模型(VFMs)在无监督域适应中提升激光雷达点云语义分割的泛化性能,提出一种冻结预训练主干网络并训练MLP头部的方法,在多个领域迁移场景下达到最先进的效果。

Details Motivation: 现有全监督训练的语义分割网络在面对未见的激光雷达数据时泛化能力差,需通过引入跨域鲁棒特征来缩小域间性能差距。 Method: 基于无监督图像到激光雷达知识蒸馏框架,系统研究视觉基础模型在激光雷达点云语义分割中的应用,分析主干网络架构设计、是否微调以及预训练权重复用性等问题。 Result: 发现激光雷达主干网络结构对泛化性能至关重要;可一次性预训练一个主干用于多种域迁移任务;冻结主干并仅训练MLP头部效果最佳,在四个主流设置上达到最先进水平。 Conclusion: 通过合理利用视觉基础模型并冻结主干网络,可在无需重新预训练的情况下有效提升激光雷达语义分割在未知域上的性能,为实际部署提供了高效可靠的解决方案。 Abstract: Semantic segmentation networks trained under full supervision for one type of lidar fail to generalize to unseen lidars without intervention. To reduce the performance gap under domain shifts, a recent trend is to leverage vision foundation models (VFMs) providing robust features across domains. In this work, we conduct an exhaustive study to identify recipes for exploiting VFMs in unsupervised domain adaptation for semantic segmentation of lidar point clouds. Building upon unsupervised image-to-lidar knowledge distillation, our study reveals that: (1) the architecture of the lidar backbone is key to maximize the generalization performance on a target domain; (2) it is possible to pretrain a single backbone once and for all, and use it to address many domain shifts; (3) best results are obtained by keeping the pretrained backbone frozen and training an MLP head for semantic segmentation. The resulting pipeline achieves state-of-the-art results in four widely-recognized and challenging settings. The code will be available at: https://github.com/valeoai/muddos.

[152] GPR-OdomNet: Difference and Similarity-Driven Odometry Estimation Network for Ground Penetrating Radar-Based Localization

Huaichao Wang,Xuanxin Fan,Ji Liu,Haifeng Li,Dezhen Song

Main category: cs.CV

TL;DR: 提出一种基于神经网络的GPR B-scan图像里程计方法,通过提取多尺度特征并分析其相似性与差异性,精确估计机器人/车辆行驶的欧氏距离。

Details Motivation: 现有GPR定位技术在处理差异较小的B-scan图像时难以准确估计距离,尤其在恶劣环境条件下性能受限。 Method: 设计一种新的定制化神经网络,从连续时刻的GPR B-scan图像中提取多尺度特征,并利用特征间的相似性和差异性来估计相邻图像间的欧氏距离。 Result: 在CMU-GPR数据集上进行实验,整体加权RMSE为0.449米,相比最先进的方法RMSE降低了10.2%,且在所有测试中均优于现有方法。 Conclusion: 所提方法能有效提升GPR在复杂环境下机器人/车辆定位的距离估计精度,具有较强的鲁棒性和应用潜力。 Abstract: When performing robot/vehicle localization using ground penetrating radar (GPR) to handle adverse weather and environmental conditions, existing techniques often struggle to accurately estimate distances when processing B-scan images with minor distinctions. This study introduces a new neural network-based odometry method that leverages the similarity and difference features of GPR B-scan images for precise estimation of the Euclidean distances traveled between the B-scan images. The new custom neural network extracts multi-scale features from B-scan images taken at consecutive moments and then determines the Euclidean distance traveled by analyzing the similarities and differences between these features. To evaluate our method, an ablation study and comparison experiments have been conducted using the publicly available CMU-GPR dataset. The experimental results show that our method consistently outperforms state-of-the-art counterparts in all tests. Specifically, our method achieves a root mean square error (RMSE), and achieves an overall weighted RMSE of 0.449 m across all data sets, which is a 10.2\% reduction in RMSE when compared to the best state-of-the-art method.

[153] Counterfactual World Models via Digital Twin-conditioned Video Diffusion

Yiqing Shen,Aiza Maksutova,Chenjia Li,Mathias Unberath

Main category: cs.CV

TL;DR: 本文提出了CWMDT框架,将视频扩散模型转化为可回答反事实问题的因果世界模型,通过数字孪生和大语言模型实现对场景干预的时序预测。

Details Motivation: 现有世界模型依赖像素空间的前向模拟,难以支持对特定场景属性进行干预的反事实推理,而许多物理AI应用需要此类能力。 Method: 构建基于结构化文本的数字孪生表示,利用大语言模型推理干预在时空中的传播,并以修改后的表示条件化视频扩散模型生成反事实视频序列。 Result: 在两个基准上实现了最先进性能,验证了结构化表示作为控制信号的有效性。 Conclusion: 结构化场景表示与大语言模型结合可有效支持反事实世界建模,为基于前向模拟的世界模型提供了更强的可控性和推理能力。 Abstract: World models learn to predict the temporal evolution of visual observations given a control signal, potentially enabling agents to reason about environments through forward simulation. Because of the focus on forward simulation, current world models generate predictions based on factual observations. For many emerging applications, such as comprehensive evaluations of physical AI behavior under varying conditions, the ability of world models to answer counterfactual queries, such as "what would happen if this object was removed?", is of increasing importance. We formalize counterfactual world models that additionally take interventions as explicit inputs, predicting temporal sequences under hypothetical modifications to observed scene properties. Traditional world models operate directly on entangled pixel-space representations where object properties and relationships cannot be selectively modified. This modeling choice prevents targeted interventions on specific scene properties. We introduce CWMDT, a framework to overcome those limitations, turning standard video diffusion models into effective counterfactual world models. First, CWMDT constructs digital twins of observed scenes to explicitly encode objects and their relationships, represented as structured text. Second, CWMDT applies large language models to reason over these representations and predict how a counterfactual intervention propagates through time to alter the observed scene. Third, CWMDT conditions a video diffusion model with the modified representation to generate counterfactual visual sequences. Evaluations on two benchmarks show that the CWMDT approach achieves state-of-the-art performance, suggesting that alternative representations of videos, such as the digital twins considered here, offer powerful control signals for video forward simulation-based world models.

[154] Radar2Shape: 3D Shape Reconstruction from High-Frequency Radar using Multiresolution Signed Distance Functions

Neel Sortur,Justin Goodwin,Purvik Patel,Luis Enrique Martinez,Tzofi Klinghoffer,Rajmonda S. Caceres,Robin Walters

Main category: cs.CV

TL;DR: 本文提出了Radar2Shape,一种基于去噪扩散模型的高频率雷达信号3D重建方法,能够从部分可观测的雷达信号中恢复任意三维形状,并在多种仿真和真实数据上表现出良好的泛化能力。

Details Motivation: 传统的深度学习方法在处理高频率雷达信号进行3D重建时难以表示任意形状,且受限于有限视角下的真实雷达信号;而光学3D重建方法无法直接有效应用于雷达信号。因此需要一种能结合雷达频域特性与多分辨率形状特征的方法。 Method: 提出Radar2Shape,采用两阶段方法:首先学习包含层次化分辨率形状特征的正则化潜在空间;然后通过将雷达信号频率以由粗到细的方式作为条件,在该潜在空间中进行扩散生成。 Result: 实验表明Radar2Shape能从部分观测的雷达信号中成功重建任意3D形状,在两种不同仿真环境和真实世界数据上均展现出强鲁棒性和良好泛化性能。同时发布了两个合成基准数据集。 Conclusion: Radar2Shape有效解决了高频率雷达信号下部分观测条件中的3D形状重建难题,为雷达系统在实际应用中的安全部署提供了可行方案,并推动了该领域的后续研究。 Abstract: Determining the shape of 3D objects from high-frequency radar signals is analytically complex but critical for commercial and aerospace applications. Previous deep learning methods have been applied to radar modeling; however, they often fail to represent arbitrary shapes or have difficulty with real-world radar signals which are collected over limited viewing angles. Existing methods in optical 3D reconstruction can generate arbitrary shapes from limited camera views, but struggle when they naively treat the radar signal as a camera view. In this work, we present Radar2Shape, a denoising diffusion model that handles a partially observable radar signal for 3D reconstruction by correlating its frequencies with multiresolution shape features. Our method consists of a two-stage approach: first, Radar2Shape learns a regularized latent space with hierarchical resolutions of shape features, and second, it diffuses into this latent space by conditioning on the frequencies of the radar signal in an analogous coarse-to-fine manner. We demonstrate that Radar2Shape can successfully reconstruct arbitrary 3D shapes even from partially-observed radar signals, and we show robust generalization to two different simulation methods and real-world data. Additionally, we release two synthetic benchmark datasets to encourage future research in the high-frequency radar domain so that models like Radar2Shape can safely be adapted into real-world radar systems.

[155] An Artificial Intelligence Framework for Measuring Human Spine Aging Using MRI

Roozbeh Bazargani,Saqib Abdullah Basar,Daniel Daly-Grafstein,Rodrigo Solis Pompa,Soojin Lee,Saurabh Garg,Yuntong Ma,John A. Carrino,Siavash Khallaghi,Sam Hashemi

Main category: cs.CV

TL;DR: 提出一种基于深度学习的计算机视觉方法,利用MRI图像估计脊柱年龄,并通过脊柱年龄差(SAG)评估脊柱健康状况及其与退行性疾病和生活方式因素的关联。

Details Motivation: 脊柱退行性变随年龄增长而增加,早期识别有助于维持脊柱健康;现有方法缺乏对脊柱整体健康状态的量化指标。 Method: 使用超过18,000个MRI序列训练深度学习模型,采用UMAP和HDBSCAN聚类确定年龄相关退变模式,通过消融实验选择最优模型,并计算脊柱年龄差(SAG)。 Result: SAG与椎间盘膨出、骨赘、椎管狭窄、骨折等病变及吸烟、重体力劳动等生活方式因素显著相关。 Conclusion: SAG可作为衡量脊柱整体健康的有效生物标志物,具有临床应用潜力。 Abstract: The human spine is a complex structure composed of 33 vertebrae. It holds the body and is important for leading a healthy life. The spine is vulnerable to age-related degenerations that can be identified through magnetic resonance imaging (MRI). In this paper we propose a novel computer-vison-based deep learning method to estimate spine age using images from over 18,000 MRI series. Data are restricted to subjects with only age-related spine degeneration. Eligibility criteria are created by identifying common age-based clusters of degenerative spine conditions using uniform manifold approximation and projection (UMAP) and hierarchical density-based spatial clustering of applications with noise (HDBSCAN). Model selection is determined using a detailed ablation study on data size, loss, and the effect of different spine regions. We evaluate the clinical utility of our model by calculating the difference between actual spine age and model-predicted age, the spine age gap (SAG), and examining the association between these differences and spine degenerative conditions and lifestyle factors. We find that SAG is associated with conditions including disc bulges, disc osteophytes, spinal stenosis, and fractures, as well as lifestyle factors like smoking and physically demanding work, and thus may be a useful biomarker for measuring overall spine health.

[156] Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models

Mark Endo,Serena Yeung-Levy

Main category: cs.CV

TL;DR: 本文研究了多模态模型中大语言模型(LLM)容量缩减对视觉能力的影响,发现其对感知能力的负面影响尤为显著;为此提出“提取+思考”(Extract+Think)方法,通过视觉特征提取微调和逐步推理提升小规模模型的效率与性能。

Details Motivation: 随着多模态模型规模扩大取得进展,实际应用中对小型、高效系统的需求日益增长,但尚不清楚LLM容量缩减如何影响多模态能力,尤其是视觉感知与推理能力的退化机制。 Method: 系统分析LLM容量缩减对多模态能力的影响,分离其对视觉感知与推理的影响;提出视觉提取微调(visual extraction tuning)以增强关键视觉信息的提取,并结合逐步推理生成答案,形成Extract+Think方法。 Result: 发现LLM容量缩减对视觉感知能力的损害与对推理能力的影响相当甚至更严重;引入Extract+Think后,在多个任务上显著提升了小型模型的性能,实现了效率与效果的更好平衡。 Conclusion: 多模态模型的视觉感知能力在LLM缩小时易被削弱,需专门优化;Extract+Think为构建高效、高性能的小型多模态模型提供了有效路径。 Abstract: Scaling up multimodal models has enabled remarkable advances in visual understanding and reasoning, but practical demands call for smaller, efficient systems. In this work, we conduct a principled analysis of downscaling intelligence in multimodal models, examining how reduced large language model (LLM) capacity affects multimodal capabilities. Our initial findings reveal an interesting trend: LLM downscaling disproportionately affects visual capabilities, rather than abilities inherited from the LLM. We then examine whether this drop mainly reflects the expected decline in visual reasoning or a more fundamental loss of perceptual abilities. Isolating the effect of LLM downscaling on perception, we find performance still drops sharply, often matching or exceeding the impact on reasoning. To address this bottleneck, we introduce visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details consistently across tasks. With these extracted visual details, we then apply step-by-step reasoning to generate answers. Together, these components form our Extract+Think approach, setting a new standard for efficiency and performance in this space.

[157] Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Yolo Yunlong Tang,Daiki Shimada,Hang Hua,Chao Huang,Jing Bi,Rogerio Feris,Chenliang Xu

Main category: cs.CV

TL;DR: 本文提出了Video-R4,一种通过视觉沉思(visual rumination)来增强文本密集视频理解的模型,该模型通过迭代选择帧、放大关键区域、重新编码像素并更新推理状态,在多个数据集上实现了最先进的性能。

Details Motivation: 现有视频问答模型通常依赖单次固定帧感知,难以捕捉短暂且细小的文本线索,容易产生幻觉和证据缺失问题。受人类反复查看、放大和重读行为启发,需要更精细的视觉推理机制。 Method: 提出Video-R4模型,引入视觉沉思机制,支持迭代式帧选择、区域放大、像素重编码与推理状态更新;构建两个包含可执行沉思轨迹的数据集Video-R4-CoT-17k和Video-R4-RL-30k;采用多阶段沉思学习框架,结合监督微调(SFT)和基于GRPO的强化学习(RL)对7B LMM进行渐进式微调。 Result: Video-R4-7B在M4-ViteVQA上达到最先进性能,并能泛化至多页文档问答、幻灯片问答和通用视频问答任务,验证了迭代沉思在像素级多模态推理中的有效性。 Conclusion: 迭代式的视觉沉思是一种有效的多模态推理范式,能够显著提升模型对文本密集视频中细粒度信息的理解能力。 Abstract: Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning.

[158] EvDiff: High Quality Video with an Event Camera

Weilun Li,Lei Sun,Ruixi Gao,Qi Jiang,Yuqin Ma,Kaiwei Wang,Ming-Hsuan Yang,Luc Van Gool,Danda Pani Paudel

Main category: cs.CV

TL;DR: 本文提出了一种基于事件的扩散模型EvDiff,用于从事件流中重建高质量视频,通过代理训练框架和单步扩散机制,在无需配对数据的情况下实现高保真和真实感的视频生成。

Details Motivation: 由于事件相机记录的亮度变化具有绝对亮度模糊性,从事件流中重建强度图像是一项高度病态的任务;传统端到端回归方法生成结果感知质量差,且难以扩展模型容量和训练数据规模。 Method: 提出EvDiff,采用基于事件的扩散模型与单次前向扩散步骤,并设计时间一致的EvEncoder以降低计算成本;引入代理训练框架,摆脱对配对事件-图像数据集的依赖,利用大规模图像数据进行训练。 Result: 实验表明,EvDiff能仅从单色事件流生成高质量彩色视频,在真实数据集上优于现有方法,兼顾像素级精度和感知质量。 Conclusion: EvDiff通过新颖的代理训练框架和高效扩散结构,实现了从事件流到高质量视频的重建,在真实感和保真度之间取得平衡,具备良好的可扩展性和应用潜力。 Abstract: As neuromorphic sensors, event cameras asynchronously record changes in brightness as streams of sparse events with the advantages of high temporal resolution and high dynamic range. Reconstructing intensity images from events is a highly ill-posed task due to the inherent ambiguity of absolute brightness. Early methods generally follow an end-to-end regression paradigm, directly mapping events to intensity frames in a deterministic manner. While effective to some extent, these approaches often yield perceptually inferior results and struggle to scale up in model capacity and training data. In this work, we propose EvDiff, an event-based diffusion model that follows a surrogate training framework to produce high-quality videos. To reduce the heavy computational cost of high-frame-rate video generation, we design an event-based diffusion model that performs only a single forward diffusion step, equipped with a temporally consistent EvEncoder. Furthermore, our novel Surrogate Training Framework eliminates the dependence on paired event-image datasets, allowing the model to leverage large-scale image datasets for higher capacity. The proposed EvDiff is capable of generating high-quality colorful videos solely from monochromatic event streams. Experiments on real-world datasets demonstrate that our method strikes a sweet spot between fidelity and realism, outperforming existing approaches on both pixel-level and perceptual metrics.

[159] Native 3D Editing with Full Attention

Weiwei Cai,Shuangkang Fang,Weicai Ye,Xin Dong,Yunhan Yang,Xuanyang Zhang,Wei Cheng,Yanpei Cao,Gang Yu,Tao Chen

Main category: cs.CV

TL;DR: 提出一种基于原生3D表示的高效指令引导3D编辑框架,通过大规模多模态数据集和3D token拼接策略,在生成质量、3D一致性和指令保真度上超越现有方法。

Details Motivation: 现有方法存在速度慢(优化类)或几何不一致、视觉质量差(基于2D编辑的前馈方法)的问题,限制了3D内容创作的可及性。 Method: 构建一个大规模多模态数据集,支持添加、删除和修改任务;提出一种直接在3D表示上操作的前馈编辑框架,并比较了交叉注意力与新型3D token拼接两种条件控制策略。 Result: 3D token拼接比交叉注意力更参数高效且性能更优;新方法在生成质量、3D一致性及指令遵循方面优于现有的2D提升方法。 Conclusion: 所提出的原生3D编辑框架通过高效的单次前馈过程和新颖的token拼接策略,为指令引导的3D编辑设立了新基准。 Abstract: Instruction-guided 3D editing is a rapidly emerging field with the potential to broaden access to 3D content creation. However, existing methods face critical limitations: optimization-based approaches are prohibitively slow, while feed-forward approaches relying on multi-view 2D editing often suffer from inconsistent geometry and degraded visual quality. To address these issues, we propose a novel native 3D editing framework that directly manipulates 3D representations in a single, efficient feed-forward pass. Specifically, we create a large-scale, multi-modal dataset for instruction-guided 3D editing, covering diverse addition, deletion, and modification tasks. This dataset is meticulously curated to ensure that edited objects faithfully adhere to the instructional changes while preserving the consistency of unedited regions with the source object. Building upon this dataset, we explore two distinct conditioning strategies for our model: a conventional cross-attention mechanism and a novel 3D token concatenation approach. Our results demonstrate that token concatenation is more parameter-efficient and achieves superior performance. Extensive evaluations show that our method outperforms existing 2D-lifting approaches, setting a new benchmark in generation quality, 3D consistency, and instruction fidelity.