Skip to content

Table of Contents

cs.CL [Back]

[1] Synthetic bootstrapped pretraining

Zitong Yang,Aonan Zhang,Hong Liu,Tatsunori Hashimoto,Emmanuel Candès,Chong Wang,Ruoming Pang

Main category: cs.CL

TL;DR: 提出了一种名为合成自举预训练(SBP)的语言模型预训练方法,通过建模文档间关系并生成新语料进行联合训练,提升了模型性能。

Details Motivation: 标准预训练仅关注单文档内的token相关性,难以有效利用文档间的丰富关联信息,SBP旨在更高效地建模可学习的跨文档关系以提升语言模型性能。 Method: 首先从预训练数据集中学习文档间的关系模型,然后利用该模型合成大量新语料,并与原始数据联合训练语言模型。 Result: 在3B参数模型和最多1T token的实验中,SBP持续优于强重复基线,并达到使用20倍更多真实数据的oracle上限的大部分性能提升;定性分析显示生成的文档非简单改写,而是抽象核心概念后重构叙述。 Conclusion: SBP不仅能提升语言模型的实证表现,还具有自然的贝叶斯解释:合成器隐式学习到相关文档之间的潜在共享概念。 Abstract: We introduce Synthetic Bootstrapped Pretraining (SBP), a language model (LM) pretraining procedure that first learns a model of relations between documents from the pretraining dataset and then leverages it to synthesize a vast new corpus for joint training. While the standard pretraining teaches LMs to learn causal correlations among tokens within a single document, it is not designed to efficiently model the rich, learnable inter-document correlations that can potentially lead to better performance. We validate SBP by designing a compute-matched pretraining setup and pretrain a 3B-parameter model on up to 1T tokens from scratch. We find SBP consistently improves upon a strong repetition baseline and delivers a significant fraction of performance improvement attainable by an oracle upper bound with access to 20x more unique data. Qualitative analysis reveals that the synthesized documents go beyond mere paraphrases -- SBP first abstracts a core concept from the seed material and then crafts a new narration on top of it. Besides strong empirical performance, SBP admits a natural Bayesian interpretation: the synthesizer implicitly learns to abstract the latent concepts shared between related documents.

[2] Comparative Analysis of Tokenization Algorithms for Low-Resource Language Dzongkha

Tandin Wangchuk,Tad Gonsalves

Main category: cs.CL

TL;DR: 本研究评估了三种常见的分词算法(BPE、WordPiece 和 SentencePiece)在低资源语言宗喀语中的适用性,结果表明SentencePiece在各项指标下表现最佳,为宗喀语大模型的构建奠定了基础。

Details Motivation: 宗喀语作为一种低资源且语言结构复杂的语言,在自然语言处理特别是分词方面研究不足,现有分词器多针对高资源语言设计,难以有效处理宗喀语,因此需要专门针对该语言的分词方案。 Method: 研究采用了Byte-Pair Encoding (BPE)、WordPiece和SentencePiece(Unigram)三种主流分词算法,在Subword Fertility、Proportion of Continued Words、Normalized Sequence Length和执行时间等指标上对它们在宗喀语文本上的表现进行了比较分析。 Result: 实验结果显示,三种算法均具有一定潜力,但SentencePiece在整体性能上优于BPE和WordPiece,特别是在序列压缩效率和保留语言结构方面表现更优,是目前最适合宗喀语的分词方法。 Conclusion: 针对低资源语言应采用定制化的分词策略,SentencePiece是当前最适用于宗喀语的分词算法,该研究为未来构建宗喀语大型语言模型提供了重要基础和技术路径。 Abstract: Large Language Models (LLMs) are gaining popularity and improving rapidly. Tokenizers are crucial components of natural language processing, especially for LLMs. Tokenizers break down input text into tokens that models can easily process while ensuring the text is accurately represented, capturing its meaning and structure. Effective tokenizers enhance the capabilities of LLMs by improving a model's understanding of context and semantics, ultimately leading to better performance in various downstream tasks, such as translation, classification, sentiment analysis, and text generation. Most pre-trained tokenizers are suitable for high-resource languages like English but perform poorly for low-resource languages. Dzongkha, Bhutan's national language spoken by around seven hundred thousand people, is a low-resource language, and its linguistic complexity poses unique NLP challenges. Despite some progress, significant research in Dzongkha NLP is lacking, particularly in tokenization. This study evaluates the training and performance of three common tokenization algorithms in comparison to other popular methods. Specifically, Byte-Pair Encoding (BPE), WordPiece, and SentencePiece (Unigram) were evaluated for their suitability for Dzongkha. Performance was assessed using metrics like Subword Fertility, Proportion of Continued Words, Normalized Sequence Length, and execution time. The results show that while all three algorithms demonstrate potential, SentencePiece is the most effective for Dzongkha tokenization, paving the way for further NLP advancements. This underscores the need for tailored approaches for low-resource languages and ongoing research. In this study, we presented three tokenization algorithms for Dzongkha, paving the way for building Dzongkha Large Language Models.

[3] Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

Yujia Hu,Ming Shan Hee,Preslav Nakov,Roy Ka-Wei Lee

Main category: cs.CL

TL;DR: 本文提出了SGToxicGuard,一个用于评估新加坡多语言环境下大语言模型(LLM)安全性的新数据集和评估框架,涵盖Singlish、中文、马来语和泰米尔语,通过红队攻击方法在三种现实场景中系统性探测LLM的漏洞,实验揭示了当前多语言LLM在安全防护方面的关键缺陷,并为提升多元语言环境下的AI安全性与包容性提供了可行见解。

Details Motivation: 大语言模型的安全机制在低资源、多语言环境下的研究尚不充分,尤其是在多元文化背景的地区如新加坡,亟需针对其独特语言混合现象(如Singlish)进行安全评估。 Method: 提出SGToxicGuard数据集与评估框架,采用红队攻击方法,在对话、问答和内容生成三种真实场景中系统测试多语言大模型的安全漏洞,覆盖新加坡主要语言。 Result: 对最先进的多语言大模型进行了广泛实验,结果揭示了这些模型在毒性内容识别和文化敏感性方面存在显著缺陷,尤其在非英语语种中表现更差。 Conclusion: SGToxicGuard填补了多语言、低资源环境下LLM安全性评估的空白,提供了改进模型安全性和文化适应性的实用路径,为构建更安全、包容的AI系统奠定了基础。 Abstract: The advancement of Large Language Models (LLMs) has transformed natural language processing; however, their safety mechanisms remain under-explored in low-resource, multilingual settings. Here, we aim to bridge this gap. In particular, we introduce \textsf{SGToxicGuard}, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore's diverse linguistic context, including Singlish, Chinese, Malay, and Tamil. SGToxicGuard adopts a red-teaming approach to systematically probe LLM vulnerabilities in three real-world scenarios: \textit{conversation}, \textit{question-answering}, and \textit{content composition}. We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails. By offering actionable insights into cultural sensitivity and toxicity mitigation, we lay the foundation for safer and more inclusive AI systems in linguistically diverse environments.\footnote{Link to the dataset: https://github.com/Social-AI-Studio/SGToxicGuard.} \textcolor{red}{Disclaimer: This paper contains sensitive content that may be disturbing to some readers.}

[4] PolBiX: Detecting LLMs' Political Bias in Fact-Checking through X-phemisms

Charlott Jakob,David Harbecke,Patrick Parschan,Pia Wenzel Neves,Vera Schmitt

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型(LLM)在事实核查任务中的政治偏见问题,通过使用德语中具有不同政治内涵的委婉语或贬义语构建事实等价的最小对,发现判断性词汇比政治倾向更显著影响真实性判断,且提示中的客观性呼吁未能有效缓解这种偏差。

Details Motivation: 随着大型语言模型在需要客观评估的应用中日益普及,其可能存在的政治偏见可能损害评估的公正性,因此有必要系统探究此类偏见及其下游任务的影响。 Method: 通过将德语声明中的词语替换为具有不同政治色彩的委婉语或贬义语,构建事实内容相同但政治内涵不同的最小对,并测试六种大型语言模型对这些声明真实性判断的一致性。 Result: 研究发现,判断性词汇的存在比模型的政治倾向更显著地影响真实性评估;少数模型表现出政治偏见倾向,但在提示中明确要求客观性并未有效缓解该问题。 Conclusion: 大型语言模型在事实核查等任务中可能受判断性语言影响而产生不一致判断,当前的提示工程方法难以有效消除此类偏差,需进一步改进模型设计以提升客观性。 Abstract: Large Language Models are increasingly used in applications requiring objective assessment, which could be compromised by political bias. Many studies found preferences for left-leaning positions in LLMs, but downstream effects on tasks like fact-checking remain underexplored. In this study, we systematically investigate political bias through exchanging words with euphemisms or dysphemisms in German claims. We construct minimal pairs of factually equivalent claims that differ in political connotation, to assess the consistency of LLMs in classifying them as true or false. We evaluate six LLMs and find that, more than political leaning, the presence of judgmental words significantly influences truthfulness assessment. While a few models show tendencies of political bias, this is not mitigated by explicitly calling for objectivism in prompts.

[5] Quantifying Self-Awareness of Knowledge in Large Language Models

Yeongbin Seo,Dongha Lee,Jinyoung Yeo

Main category: cs.CL

TL;DR: 本文提出了一种新的方法SCAO,用于减少大语言模型在幻觉预测中的问题侧捷径依赖,增强模型侧信号的使用,从而促进真正的自我意识。

Details Motivation: 现有的幻觉预测表现可能源于问题侧的表面模式而非模型自身的内省能力,需要区分这两种因素的影响。 Method: 提出了近似问题侧效应(AQE)来量化问题感知的贡献,并引入语义压缩通过单字回答(SCAO)的方法以增强模型侧信号的利用。 Result: 实验表明,在多个数据集上,SCAO在减少问题侧线索的情况下仍能保持强健和一致的表现,证明其有效性。 Conclusion: SCAO能够有效提升大语言模型在缺乏明显问题提示时的自我意识表现,推动了更真实可靠的幻觉预测研究。 Abstract: Hallucination prediction in large language models (LLMs) is often interpreted as a sign of self-awareness. However, we argue that such performance can arise from question-side shortcuts rather than true model-side introspection. To disentangle these factors, we propose the Approximate Question-side Effect (AQE), which quantifies the contribution of question-awareness. Our analysis across multiple datasets reveals that much of the reported success stems from exploiting superficial patterns in questions. We further introduce SCAO (Semantic Compression by Answering in One word), a method that enhances the use of model-side signals. Experiments show that SCAO achieves strong and consistent performance, particularly in settings with reduced question-side cues, highlighting its effectiveness in fostering genuine self-awareness in LLMs.

[6] Real, Fake, or Manipulated? Detecting Machine-Influenced Text

Yitong Wang,Zhongping Zhang,Margherita Piana,Zheng Zhou,Peter Gerstoft,Bryan A. Plummer

Main category: cs.CL

TL;DR: 本文提出了一种层次化、长度鲁棒的机器影响文本检测器HERO,能够区分人类撰写、机器生成、机器润色和机器翻译四类文本,通过子类别引导提升细粒度分类性能,在多个LLM和领域上优于现有最先进方法。

Details Motivation: 现有机器生成文本检测工作主要关注区分人类与机器写作,忽略了对LLM不同使用意图(如润色、翻译)的细粒度识别,难以应对 misinformation 等潜在风险。 Method: 提出HERO模型,结合长度专用模型与子类别引导机制,采用层次化策略对不同长度文本进行四类划分:人类撰写、机器生成、机器润色和机器翻译。 Result: 在五个LLM和六个领域上实验表明,HERO平均比现有最先进方法高出2.5-3 mAP。 Conclusion: HERO能有效区分多种机器影响文本类型,具备良好的长度鲁棒性和细粒度分类能力,有助于理解LLM使用意图并防范滥用。 Abstract: Large Language Model (LLMs) can be used to write or modify documents, presenting a challenge for understanding the intent behind their use. For example, benign uses may involve using LLM on a human-written document to improve its grammar or to translate it into another language. However, a document entirely produced by a LLM may be more likely to be used to spread misinformation than simple translation (\eg, from use by malicious actors or simply by hallucinating). Prior works in Machine Generated Text (MGT) detection mostly focus on simply identifying whether a document was human or machine written, ignoring these fine-grained uses. In this paper, we introduce a HiErarchical, length-RObust machine-influenced text detector (HERO), which learns to separate text samples of varying lengths from four primary types: human-written, machine-generated, machine-polished, and machine-translated. HERO accomplishes this by combining predictions from length-specialist models that have been trained with Subcategory Guidance. Specifically, for categories that are easily confused (\eg, different source languages), our Subcategory Guidance module encourages separation of the fine-grained categories, boosting performance. Extensive experiments across five LLMs and six domains demonstrate the benefits of our HERO, outperforming the state-of-the-art by 2.5-3 mAP on average.

[7] Beyond Spurious Signals: Debiasing Multimodal Large Language Models via Counterfactual Inference and Adaptive Expert Routing

Zichen Wu,Hsiu-Yuan Huang,Yunfang Wu

Main category: cs.CL

TL;DR: 提出一种基于因果中介的去偏框架,通过反事实样例区分核心语义与虚假上下文,并结合MoE架构实现多模态大语言模型中的动态去偏。

Details Motivation: MLLMs在多模态推理中常依赖虚假相关性,导致鲁棒性和泛化能力不足。 Method: 利用反事实样例分离核心语义与虚假文本/视觉上下文,在训练阶段进行去偏,并采用Mixture-of-Experts架构与动态路由机制,选择性激活模态特定的去偏专家。 Result: 在多模态讽刺检测和情感分析任务上显著优于单模态去偏方法和现有SOTA模型。 Conclusion: 该因果中介去偏框架有效缓解了MLLM中的表面关联偏差,提升了复杂多模态任务中的鲁棒性与性能。 Abstract: Multimodal Large Language Models (MLLMs) have shown substantial capabilities in integrating visual and textual information, yet frequently rely on spurious correlations, undermining their robustness and generalization in complex multimodal reasoning tasks. This paper addresses the critical challenge of superficial correlation bias in MLLMs through a novel causal mediation-based debiasing framework. Specially, we distinguishing core semantics from spurious textual and visual contexts via counterfactual examples to activate training-stage debiasing and employ a Mixture-of-Experts (MoE) architecture with dynamic routing to selectively engages modality-specific debiasing experts. Empirical evaluation on multimodal sarcasm detection and sentiment analysis tasks demonstrates that our framework significantly surpasses unimodal debiasing strategies and existing state-of-the-art models.

[8] Speech Language Models for Under-Represented Languages: Insights from Wolof

Yaya Sy,Dioula Doucouré,Christophe Cerisara,Irina Illina

Main category: cs.CL

TL;DR: 本文介绍了为西非使用的一种代表性不足的语言——沃洛夫语训练语音语言模型的过程,强调了收集大规模、自发、高质量语音数据的重要性,并展示了在该数据集上继续预训练HuBERT模型在自动语音识别(ASR)任务上的优越性能。研究还将语音编码器集成到沃洛夫语大语言模型中,构建了首个该语言的语音LLM,扩展至语音翻译等任务,并探索了在转录或翻译前进行多步思维链推理的训练方法,结果表明该语音LLM在语音识别和翻译方面均表现良好。

Details Motivation: 沃洛夫语作为一种资源稀缺的非洲语言,缺乏高质量的语音数据和先进的语音处理模型,导致其在语音识别与翻译等任务中发展滞后。本文旨在通过构建高质量语音数据集并结合先进模型架构,填补这一空白,推动低资源语言技术的发展。 Method: 首先收集大规模、自发、高质量的沃洛夫语语音数据;接着在该数据上对HuBERT模型进行持续预训练,并与基础模型及非洲中心化模型对比ASR性能;然后将优化后的语音编码器集成至沃洛夫语大语言模型中,构建语音LLM,并尝试引入多步Chain-of-Thought机制以增强推理能力;最后评估其在语音识别与翻译任务中的表现。 Result: 持续预训练的HuBERT在ASR任务上优于基础模型和非洲-centric模型;所提出的语音LLM不仅提升了语音识别性能,还在语音翻译任务中表现出色;引入多步思维链训练策略有助于提升模型在复杂任务中的表现。 Conclusion: 本研究成功构建了首个面向沃洛夫语的语音大语言模型,验证了高质量数据与持续预训练的重要性,并展示了语音LLM在低资源语言中实现多任务处理(如语音识别与翻译)的可行性,相关模型与代码将公开共享以促进后续研究。 Abstract: We present our journey in training a speech language model for Wolof, an underrepresented language spoken in West Africa, and share key insights. We first emphasize the importance of collecting large-scale, spontaneous, high-quality speech data, and show that continued pretraining HuBERT on this dataset outperforms both the base model and African-centric models on ASR. We then integrate this speech encoder into a Wolof LLM to train the first Speech LLM for this language, extending its capabilities to tasks such as speech translation. Furthermore, we explore training the Speech LLM to perform multi-step Chain-of-Thought before transcribing or translating. Our results show that the Speech LLM not only improves speech recognition but also performs well in speech translation. The models and the code will be openly shared.

[9] Frustratingly Easy Data Augmentation for Low-Resource ASR

Katsumi Ibaraki,David Chiang

Main category: cs.CL

TL;DR: 提出三种基于文本生成和TTS的自包含数据增强方法,用于低资源语音识别,在多种语言上显著提升性能。

Details Motivation: 低资源语言缺乏足够的标注语音数据,限制了ASR系统的性能,因此需要有效的数据增强方法。 Method: 使用词汇替换、随机替换或大语言模型生成新文本,再通过TTS合成语音数据,结合原始数据用于微调Wav2Vec2-XLSR-53模型。 Result: 在四种低资源语言上均取得显著性能提升,其中Nashta的WER绝对降低14.3%,方法也适用于高资源语言如英语。 Conclusion: 所提出的三种数据增强方法无需额外标注数据,有效提升低资源ASR性能,具有广泛适用性。 Abstract: This paper introduces three self-contained data augmentation methods for low-resource Automatic Speech Recognition (ASR). Our techniques first generate novel text--using gloss-based replacement, random replacement, or an LLM-based approach--and then apply Text-to-Speech (TTS) to produce synthetic audio. We apply these methods, which leverage only the original annotated data, to four languages with extremely limited resources (Vatlongos, Nashta, Shinekhen Buryat, and Kakabe). Fine-tuning a pretrained Wav2Vec2-XLSR-53 model on a combination of the original audio and generated synthetic data yields significant performance gains, including a 14.3% absolute WER reduction for Nashta. The methods prove effective across all four low-resource languages and also show utility for high-resource languages like English, demonstrating their broad applicability.

[10] Quantifying Uncertainty in Natural Language Explanations of Large Language Models for Question Answering

Yangyi Li,Mengdi Huai

Main category: cs.CL

TL;DR: 提出了一种新的后验、模型无关的不确定性估计框架,用于大语言模型生成的自然语言解释,并设计了在噪声下仍能保持有效不确定性的鲁棒方法。

Details Motivation: 现有的自然语言解释缺乏有效的不确定性保证,尤其在医疗等高风险领域中,理解解释背后的置信度至关重要。 Method: 提出一种后验、模型无关的不确定性估计框架,并设计了一种鲁棒的不确定性估计方法,以应对自回归生成过程和输入噪声的挑战。 Result: 在问答任务上的大量实验表明,所提方法能够提供有效的不确定性保证,且在噪声环境下仍保持稳健。 Conclusion: 该工作填补了自然语言解释不确定性量化的空白,为黑盒大模型的可信解释提供了可靠工具。 Abstract: Large language models (LLMs) have shown strong capabilities, enabling concise, context-aware answers in question answering (QA) tasks. The lack of transparency in complex LLMs has inspired extensive research aimed at developing methods to explain large language behaviors. Among existing explanation methods, natural language explanations stand out due to their ability to explain LLMs in a self-explanatory manner and enable the understanding of model behaviors even when the models are closed-source. However, despite these promising advancements, there is no existing work studying how to provide valid uncertainty guarantees for these generated natural language explanations. Such uncertainty quantification is critical in understanding the confidence behind these explanations. Notably, generating valid uncertainty estimates for natural language explanations is particularly challenging due to the auto-regressive generation process of LLMs and the presence of noise in medical inquiries. To bridge this gap, in this work, we first propose a novel uncertainty estimation framework for these generated natural language explanations, which provides valid uncertainty guarantees in a post-hoc and model-agnostic manner. Additionally, we also design a novel robust uncertainty estimation method that maintains valid uncertainty guarantees even under noise. Extensive experiments on QA tasks demonstrate the desired performance of our methods.

[11] Deep learning and abstractive summarisation for radiological reports: an empirical study for adapting the PEGASUS models' family with scarce data

Claudio Benzoni,Martina Langhals,Martin Boeker,Luise Modersohn,Máté E. Maros

Main category: cs.CL

TL;DR: 本研究探讨了在医学影像报告摘要任务中微调非领域特定的抽象摘要模型(PEGASUS 和 PEGASUS-X)的挑战,揭示了过拟合、欠拟合以及训练动态中的峰值-下降-恢复行为。

Details Motivation: 医学领域对自动文本摘要需求日益增长,但由于数据敏感性和稀缺性,现有模型难以直接应用,因此需要研究适用于小规模医学数据集的微调策略。 Method: 在公开的中等规模放射学报告数据集上,对 PEGASUS 和 PEGASUS-X 模型进行微调,评估不同检查点和训练数据量下的表现,并使用词汇和语义指标监控训练过程。 Result: PEGASUS 表现出 epoch-wise double-descent 现象;PEGASUS-X 使用较大检查点反而性能下降,表明高表达能力模型在小数据上微调存在风险。 Conclusion: 在数据稀缺的专业领域微调高性能模型需谨慎,应避免过拟合与不稳定的训练动态,本研究为未来更鲁棒的微调方法提供了基础。 Abstract: Regardless of the rapid development of artificial intelligence, abstractive summarisation is still challenging for sensitive and data-restrictive domains like medicine. With the increasing number of imaging, the relevance of automated tools for complex medical text summarisation is expected to become highly relevant. In this paper, we investigated the adaptation via fine-tuning process of a non-domain-specific abstractive summarisation encoder-decoder model family, and gave insights to practitioners on how to avoid over- and underfitting. We used PEGASUS and PEGASUS-X, on a medium-sized radiological reports public dataset. For each model, we comprehensively evaluated two different checkpoints with varying sizes of the same training data. We monitored the models' performances with lexical and semantic metrics during the training history on the fixed-size validation set. PEGASUS exhibited different phases, which can be related to epoch-wise double-descent, or peak-drop-recovery behaviour. For PEGASUS-X, we found that using a larger checkpoint led to a performance detriment. This work highlights the challenges and risks of fine-tuning models with high expressivity when dealing with scarce training data, and lays the groundwork for future investigations into more robust fine-tuning strategies for summarisation models in specialised domains.

[12] BiRQ: Bi-Level Self-Labeling Random Quantization for Self-Supervised Speech Recognition

Liuyuan Jiang,Xiaodong Cui,Brian Kingsbury,Tianyi Chen,Lisha Chen

Main category: cs.CL

TL;DR: 提出BiRQ,一种结合BEST-RQ效率和HuBERT标签增强优势的双层自监督学习框架,通过模型自身中间表示生成增强伪标签,实现高效、低复杂度的端到端语音表示学习。

Details Motivation: 现有语音自监督学习方法在伪标签的信息量和生成效率之间存在权衡:强标签方法(如HuBERT)依赖外部编码器和多阶段流程,而高效方法(如BEST-RQ)标签较弱。需要一种兼具高效与强标签优势的方法。 Method: 提出BiRQ框架,采用双层优化结构:用随机投影量化器对模型中间表示进行离散化以生成增强伪标签,同时使用原始输入直接生成锚定标签来稳定训练;通过可微Gumbel-softmax实现端到端的一阶双层优化,无需外部标签编码器。 Result: BiRQ在LibriSpeech(960小时)、AMI会议(150小时)和YODAS(5000小时)等多个数据集上 consistently 超过BEST-RQ,同时保持低复杂度和高计算效率。 Conclusion: BiRQ成功融合了高效伪标签生成与强标签性能的优势,通过模型自身进行标签增强,实现了高效、可扩展且性能优越的端到端语音自监督学习。 Abstract: Speech is a rich signal, and labeled audio-text pairs are costly, making self-supervised learning essential for scalable representation learning. A core challenge in speech SSL is generating pseudo-labels that are both informative and efficient: strong labels, such as those used in HuBERT, improve downstream performance but rely on external encoders and multi-stage pipelines, while efficient methods like BEST-RQ achieve simplicity at the cost of weaker labels. We propose BiRQ, a bilevel SSL framework that combines the efficiency of BEST-RQ with the refinement benefits of HuBERT-style label enhancement. The key idea is to reuse part of the model itself as a pseudo-label generator: intermediate representations are discretized by a random-projection quantizer to produce enhanced labels, while anchoring labels derived directly from the raw input stabilize training and prevent collapse. Training is formulated as an efficient first-order bilevel optimization problem, solved end-to-end with differentiable Gumbel-softmax selection. This design eliminates the need for external label encoders, reduces memory cost, and enables iterative label refinement in an end-to-end fashion. BiRQ consistently improves over BEST-RQ while maintaining low complexity and computational efficiency. We validate our method on various datasets, including 960-hour LibriSpeech, 150-hour AMI meetings and 5,000-hour YODAS, demonstrating consistent gains over BEST-RQ.

[13] PILOT: Steering Synthetic Data Generation with Psychological & Linguistic Output Targeting

Caitlin Cisar,Emily Sheffield,Joshua Drake,Alden Harrell,Subramanian Chidambaram,Nikita Nangia,Vinayak Arannil,Alex Williams

Main category: cs.CL

TL;DR: 本文提出了PILOT框架,通过结构化的心理语言学特征档案来引导大语言模型生成内容,相比自然语言描述,该方法在输出连贯性和主题纯度上表现更优,同时平衡了多样性与一致性。

Details Motivation: 现有的基于自然语言用户画像的生成控制方法容易导致模型做出非预期推断,缺乏对输出属性的精确控制。 Method: PILOT分为两个阶段:第一阶段将自然语言 persona 转换为标准化的心理语言学多维特征档案;第二阶段利用该档案在可度量维度上引导文本生成。在三个主流大模型上对比了自然语言、基于模式和混合三种引导方式。 Result: 基于模式的方法(SBS)显著降低了不自然的重复现象,提升了输出连贯性(轮廓系数从0.098提升至0.237,主题纯度从0.773提升至0.957),但牺牲了部分词汇多样性;自然语言方法(NPS)更具多样性但可预测性差;混合方法(HPS)在多样性和一致性之间取得了平衡。专家评估显示各方法响应质量无显著差异。 Conclusion: 结构化心理语言学档案能有效提升对生成内容的控制精度,PILOT为大模型的可控生成提供了一种可量化、可复现的新范式。 Abstract: Generative AI applications commonly leverage user personas as a steering mechanism for synthetic data generation, but reliance on natural language representations forces models to make unintended inferences about which attributes to emphasize, limiting precise control over outputs. We introduce PILOT (Psychological and Linguistic Output Targeting), a two-phase framework for steering large language models with structured psycholinguistic profiles. In Phase 1, PILOT translates natural language persona descriptions into multidimensional profiles with normalized scores across linguistic and psychological dimensions. In Phase 2, these profiles guide generation along measurable axes of variation. We evaluate PILOT across three state-of-the-art LLMs (Mistral Large 2, Deepseek-R1, LLaMA 3.3 70B) using 25 synthetic personas under three conditions: Natural-language Persona Steering (NPS), Schema-Based Steering (SBS), and Hybrid Persona-Schema Steering (HPS). Results demonstrate that schema-based approaches significantly reduce artificial-sounding persona repetition while improving output coherence, with silhouette scores increasing from 0.098 to 0.237 and topic purity from 0.773 to 0.957. Our analysis reveals a fundamental trade-off: SBS produces more concise outputs with higher topical consistency, while NPS offers greater lexical diversity but reduced predictability. HPS achieves a balance between these extremes, maintaining output variety while preserving structural consistency. Expert linguistic evaluation confirms that PILOT maintains high response quality across all conditions, with no statistically significant differences between steering approaches.

[14] Evaluating Multimodal Large Language Models on Spoken Sarcasm Understanding

Zhu Li,Xiyuan Gao,Yuqing Zhang,Shekhar Nayak,Matt Coler

Main category: cs.CL

TL;DR: 本文系统评估了大语言模型和多模态大模型在英汉双语环境下零样本、少样本及LoRA微调设置中的讽刺检测性能,提出利用协作门控融合模块整合模型特征表示,实验表明音频模态及多模态组合(如文本-音频)表现更优,凸显多模态大模型在跨语言、多模态讽刺理解中的潜力。

Details Motivation: 讽刺检测依赖文本、语音和视觉之间的细微跨模态线索,现有研究多集中于单模态或双模态(如文本-视觉),而对音频-视觉-文本三模态的综合理解仍不足,尤其缺乏跨语言视角下的探索。 Method: 在MUStARD++(英文)和MCSD 1.0(中文)数据集上评估LLMs和多模态LLMs,采用零样本、少样本和LoRA微调三种设置;除直接分类外,还将模型作为特征编码器,通过协作门控融合模块整合多模态表征。 Result: 基于音频的模型在单模态中表现最佳,文本-音频和音频-视觉组合优于单模态和三模态模型;Qwen-Omni等多模态大模型在零样本和微调设置下均表现出竞争力。 Conclusion: 多模态大语言模型在跨语言、多模态讽刺检测中具有巨大潜力,音频模态在讽刺识别中起关键作用,且适当的双模态融合优于复杂的三模态融合。 Abstract: Sarcasm detection remains a challenge in natural language understanding, as sarcastic intent often relies on subtle cross-modal cues spanning text, speech, and vision. While prior work has primarily focused on textual or visual-textual sarcasm, comprehensive audio-visual-textual sarcasm understanding remains underexplored. In this paper, we systematically evaluate large language models (LLMs) and multimodal LLMs for sarcasm detection on English (MUStARD++) and Chinese (MCSD 1.0) in zero-shot, few-shot, and LoRA fine-tuning settings. In addition to direct classification, we explore models as feature encoders, integrating their representations through a collaborative gating fusion module. Experimental results show that audio-based models achieve the strongest unimodal performance, while text-audio and audio-vision combinations outperform unimodal and trimodal models. Furthermore, MLLMs such as Qwen-Omni show competitive zero-shot and fine-tuned performance. Our findings highlight the potential of MLLMs for cross-lingual, audio-visual-textual sarcasm understanding.

[15] Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models

Madison Van Doren,Casey Ford,Emily Dix

Main category: cs.CL

TL;DR: 该研究评估了四种主流多模态大语言模型在对抗性提示下的安全性,发现不同模型和输入模态在引发有害回应方面存在显著差异,其中Pixtral 12B最易产生有害输出,而Claude Sonnet 3.5表现最佳,且文本提示比多模态提示更易绕过安全机制。

Details Motivation: 随着多模态大语言模型在现实应用中的广泛部署,其在对抗条件下的安全性亟需系统评估,以揭示潜在风险并推动安全基准的发展。 Method: 由26名红队成员生成726个针对违法活动、虚假信息和不道德行为的对抗性提示,提交给GPT-4o、Claude Sonnet 3.5、Pixtral 12B和Qwen VL Plus四个模型,并由17名标注者使用5分制对2,904个输出进行有害性评分。 Result: Pixtral 12B的有害回应率最高(约62%),Claude Sonnet 3.5最低(约10%);出乎意料的是,纯文本提示比多模态提示更易绕过安全机制;统计分析表明模型类型和输入模态均显著影响有害性。 Conclusion: 不同MLLM在安全性上表现差异显著,且当前安全机制在多模态场景下仍存漏洞,亟需建立更全面、鲁棒的多模态安全评估基准。 Abstract: Multimodal large language models (MLLMs) are increasingly used in real world applications, yet their safety under adversarial conditions remains underexplored. This study evaluates the harmlessness of four leading MLLMs (GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus) when exposed to adversarial prompts across text-only and multimodal formats. A team of 26 red teamers generated 726 prompts targeting three harm categories: illegal activity, disinformation, and unethical behaviour. These prompts were submitted to each model, and 17 annotators rated 2,904 model outputs for harmfulness using a 5-point scale. Results show significant differences in vulnerability across models and modalities. Pixtral 12B exhibited the highest rate of harmful responses (~62%), while Claude Sonnet 3.5 was the most resistant (~10%). Contrary to expectations, text-only prompts were slightly more effective at bypassing safety mechanisms than multimodal ones. Statistical analysis confirmed that both model type and input modality were significant predictors of harmfulness. These findings underscore the urgent need for robust, multimodal safety benchmarks as MLLMs are deployed more widely.

[16] mucAI at BAREC Shared Task 2025: Towards Uncertainty Aware Arabic Readability Assessment

Ahmed Abdou

Main category: cs.CL

TL;DR: 提出一种模型无关的后处理技术,通过共形预测和softmax重归一化概率加权平均,提升阿拉伯语细粒度可读性分类的性能。

Details Motivation: 在19个有序等级的阿拉伯语可读性分类任务中,减少高惩罚性误分类,提高预测可靠性与实用性。 Method: 采用共形预测生成具有覆盖保证的预测集,并在预测集上使用softmax重归一化的概率计算加权平均,实现不确定性感知的解码。 Result: 在不同基础模型上QWK提升1-3个百分点;在严格赛道中,句子级测试QWK达84.9%(测试)和85.7%(盲测),文档级为73.3%。 Conclusion: 该方法在保持统计保证的同时,提升了细粒度阿拉伯语可读性分类的准确性,有助于教育评估中的人工评审效率。 Abstract: We present a simple, model-agnostic post-processing technique for fine-grained Arabic readability classification in the BAREC 2025 Shared Task (19 ordinal levels). Our method applies conformal prediction to generate prediction sets with coverage guarantees, then computes weighted averages using softmax-renormalized probabilities over the conformal sets. This uncertainty-aware decoding improves Quadratic Weighted Kappa (QWK) by reducing high-penalty misclassifications to nearer levels. Our approach shows consistent QWK improvements of 1-3 points across different base models. In the strict track, our submission achieves QWK scores of 84.9\%(test) and 85.7\% (blind test) for sentence level, and 73.3\% for document level. For Arabic educational assessment, this enables human reviewers to focus on a handful of plausible levels, combining statistical guarantees with practical usability.

[17] LLM Cache Bandit Revisited: Addressing Query Heterogeneity for Cost-Effective LLM Inference

Hantao Yang,Hong Xie,Defu Lian,Enhong Chen

Main category: cs.CL

TL;DR: 本文研究了在异构查询大小下的LLM缓存选择问题,将其建模为背包问题,并提出一种基于累积的策略,在理论和实验上均证明能有效降低推理成本。

Details Motivation: 现有工作通常假设查询大小均匀,难以应对实际中异构查询带来的组合优化挑战,因此需要更高效的缓存选择机制以降低大模型推理成本。 Method: 将最优缓存选择建模为背包问题,采用基于累积的策略来平衡计算开销与缓存更新,在理论层面分析其后悔界并给出问题相关界。 Result: 理论上实现了O(√(MNT))的后悔界,优于先前的O(MN√T);实验基于真实数据,总成本降低约12%。 Conclusion: 所提出的累积策略在处理异构查询时更高效,兼顾理论保证与实际性能,显著降低了LLM推理的总体成本。 Abstract: This paper revisits the LLM cache bandit problem, with a special focus on addressing the query heterogeneity for cost-effective LLM inference. Previous works often assume uniform query sizes. Heterogeneous query sizes introduce a combinatorial structure for cache selection, making the cache replacement process more computationally and statistically challenging. We treat optimal cache selection as a knapsack problem and employ an accumulation-based strategy to effectively balance computational overhead and cache updates. In theoretical analysis, we prove that the regret of our algorithm achieves an $O(\sqrt{MNT})$ bound, improving the coefficient of $\sqrt{MN}$ compared to the $O(MN\sqrt{T})$ result in Berkeley, where $N$ is the total number of queries and $M$ is the cache size. Additionally, we also provide a problem-dependent bound, which was absent in previous works. The experiment rely on real-world data show that our algorithm reduces the total cost by approximately 12\%.

[18] How do Language Models Generate Slang: A Systematic Comparison between Human and Machine-Generated Slang Usages

Siyang Wu,Zhewei Sun

Main category: cs.CL

TL;DR: 本文系统比较了人类与大语言模型(如GPT-4o和Llama-3)生成的俚语用法,发现尽管模型掌握了一定的俚语创造性特征,但在使用模式上存在显著偏差,导致其在语言分析等外推任务中与人类用法对齐不足。

Details Motivation: 评估大语言模型在俚语理解和生成方面是否捕捉到了与人类实际使用相符的结构化知识,以判断其在俚语相关任务中的可靠性与泛化能力。 Method: 通过对比在线俚语词典(OSD)中的人类认证俚语用法与GPT-4o、Llama-3生成的俚语,在三个核心维度上进行系统分析:使用特征中的偏见、词汇创造与复用体现的创造力,以及作为模型蒸馏标准示例的信息性。 Result: 发现大语言模型在俚语理解中存在系统性偏差,虽然能生成具有创造性的俚语,但其使用模式与人类用法不一致,特别是在语义合理性和语境适配方面表现不足。 Conclusion: 当前大语言模型虽具备一定的俚语知识,但其生成的俚语与人类实际使用存在显著差异,限制了其在需要深入语言理解的任务中的可靠应用。 Abstract: Slang is a commonly used type of informal language that poses a daunting challenge to NLP systems. Recent advances in large language models (LLMs), however, have made the problem more approachable. While LLM agents are becoming more widely applied to intermediary tasks such as slang detection and slang interpretation, their generalizability and reliability are heavily dependent on whether these models have captured structural knowledge about slang that align well with human attested slang usages. To answer this question, we contribute a systematic comparison between human and machine-generated slang usages. Our evaluative framework focuses on three core aspects: 1) Characteristics of the usages that reflect systematic biases in how machines perceive slang, 2) Creativity reflected by both lexical coinages and word reuses employed by the slang usages, and 3) Informativeness of the slang usages when used as gold-standard examples for model distillation. By comparing human-attested slang usages from the Online Slang Dictionary (OSD) and slang generated by GPT-4o and Llama-3, we find significant biases in how LLMs perceive slang. Our results suggest that while LLMs have captured significant knowledge about the creative aspects of slang, such knowledge does not align with humans sufficiently to enable LLMs for extrapolative tasks such as linguistic analyses.

[19] A method for improving multilingual quality and diversity of instruction fine-tuning datasets

Chunguang Zhao,Yilun Liu,Pufan Zeng,Yuanchang Luo,Shimin Tao,Minggui He,Weibin Meng,Song Xu,Ziang Chen,Chen Liu,Hongxia Ma,Li Zhang,Boxing Chen,Daimeng Wei

Main category: cs.CL

TL;DR: 本文提出了一种名为M-DaQ的新方法,用于提升多语言指令微调中数据的质量与多样性,从而增强大语言模型在多语言环境下的泛化能力。

Details Motivation: 高质量多语言训练数据的缺乏以及现有数据选择方法在跨语言场景中的局限性,限制了大语言模型的多语言性能提升。 Method: 提出M-DaQ方法,通过选择高质量且语义多样化的多语言指令微调样本,并首次系统验证了多语言场景下的表面对齐假设(SAH)。 Result: 在18种语言上的实验表明,使用M-DaQ方法微调的模型相比基线模型有显著性能提升,胜率达到60%以上;人工评估也证实其响应中文化相关性的提高。 Conclusion: M-DaQ能有效提升大语言模型在多语言指令微调中的性能,支持更广泛的语言和文化适应性,代码已公开以促进后续研究。 Abstract: Multilingual Instruction Fine-Tuning (IFT) is essential for enabling large language models (LLMs) to generalize effectively across diverse linguistic and cultural contexts. However, the scarcity of high-quality multilingual training data and corresponding building method remains a critical bottleneck. While data selection has shown promise in English settings, existing methods often fail to generalize across languages due to reliance on simplistic heuristics or language-specific assumptions. In this work, we introduce Multilingual Data Quality and Diversity (M-DaQ), a novel method for improving LLMs multilinguality, by selecting high-quality and semantically diverse multilingual IFT samples. We further conduct the first systematic investigation of the Superficial Alignment Hypothesis (SAH) in multilingual setting. Empirical results across 18 languages demonstrate that models fine-tuned with M-DaQ method achieve significant performance gains over vanilla baselines over 60% win rate. Human evaluations further validate these gains, highlighting the increment of cultural points in the response. We release the M-DaQ code to support future research.

[20] DNA-DetectLLM: Unveiling AI-Generated Text via a DNA-Inspired Mutation-Repair Paradigm

Xiaowei Zhu,Yubing Ren,Fang Fang,Qingfeng Tan,Shi Wang,Yanan Cao

Main category: cs.CL

TL;DR: 提出一种受DNA启发的零样本AI生成文本检测方法DNA-DetectLLM,通过修复非最优词元并量化修复代价来区分AI与人类文本,具有优异的检测性能和鲁棒性。

Details Motivation: 随着大语言模型的发展,AI生成文本与人类书写的文本越来越难以区分,带来了 misinformation、作者归属不清和知识产权等问题,亟需可靠的检测方法。 Method: 受DNA修复机制启发,提出DNA-DetectLLM:为输入文本构建理想的AI生成序列,迭代修复其中的非最优词元,并将累积修复代价作为可解释的检测信号,实现零样本检测。 Result: 在多个公开基准数据集上,DNA-DetectLLM实现了SOTA性能,AUROC相对提升5.55%,F1分数提升2.08%,并对各种对抗攻击和不同输入长度表现出强鲁棒性。 Conclusion: DNA-DetectLLM通过可解释的修复机制有效捕捉人类与AI文本的本质差异,是一种高效且鲁棒的零样本AI生成文本检测方法。 Abstract: The rapid advancement of large language models (LLMs) has blurred the line between AI-generated and human-written text. This progress brings societal risks such as misinformation, authorship ambiguity, and intellectual property concerns, highlighting the urgent need for reliable AI-generated text detection methods. However, recent advances in generative language modeling have resulted in significant overlap between the feature distributions of human-written and AI-generated text, blurring classification boundaries and making accurate detection increasingly challenging. To address the above challenges, we propose a DNA-inspired perspective, leveraging a repair-based process to directly and interpretably capture the intrinsic differences between human-written and AI-generated text. Building on this perspective, we introduce DNA-DetectLLM, a zero-shot detection method for distinguishing AI-generated and human-written text. The method constructs an ideal AI-generated sequence for each input, iteratively repairs non-optimal tokens, and quantifies the cumulative repair effort as an interpretable detection signal. Empirical evaluations demonstrate that our method achieves state-of-the-art detection performance and exhibits strong robustness against various adversarial attacks and input lengths. Specifically, DNA-DetectLLM achieves relative improvements of 5.55% in AUROC and 2.08% in F1 score across multiple public benchmark datasets.

[21] Exploring Polyglot Harmony: On Multilingual Data Allocation for Large Language Models Pretraining

Ping Guo,Yubing Ren,Binbin Liu,Fengze Liu,Haobin Lin,Yifan Zhang,Bingni Zhang,Taifeng Wang,Yin Zheng

Main category: cs.CL

TL;DR: 本文提出了一种名为Climb的新型框架,用于优化多语言训练数据分配,通过考虑跨语言交互来提升大语言模型的多语言性能。

Details Motivation: 由于跨语言交互复杂且对数据集规模敏感,确定最优的语言比例极具挑战性,现有方法难以有效平衡多语言训练中的资源分配。 Method: Climb引入了跨语言交互感知的语言比例,并采用两步优化策略:首先均衡各语言的边际收益,然后最大化语言分配向量的幅度。 Result: 实验表明,Climb能准确衡量不同多语言设置下的跨语言交互,使用其确定的比例训练的大语言模型在多语言性能上达到最先进水平,甚至在更少训练token下优于开源模型。 Conclusion: Climb为多语言大语言模型的数据配比提供了系统化、可扩展的解决方案,显著提升了多语言训练效率与性能。 Abstract: Large language models (LLMs) have become integral to a wide range of applications worldwide, driving an unprecedented global demand for effective multilingual capabilities. Central to achieving robust multilingual performance is the strategic allocation of language proportions within training corpora. However, determining optimal language ratios is highly challenging due to intricate cross-lingual interactions and sensitivity to dataset scale. This paper introduces Climb (Cross-Lingual Interaction-aware Multilingual Balancing), a novel framework designed to systematically optimize multilingual data allocation. At its core, Climb introduces a cross-lingual interaction-aware language ratio, explicitly quantifying each language's effective allocation by capturing inter-language dependencies. Leveraging this ratio, Climb proposes a principled two-step optimization procedure--first equalizing marginal benefits across languages, then maximizing the magnitude of the resulting language allocation vectors--significantly simplifying the inherently complex multilingual optimization problem. Extensive experiments confirm that Climb can accurately measure cross-lingual interactions across various multilingual settings. LLMs trained with Climb-derived proportions consistently achieve state-of-the-art multilingual performance, even achieving competitive performance with open-sourced LLMs trained with more tokens.

[22] How important is language for human-like intelligence?

Gary Lupyan,Hunter Gentry,Martin Zettersten

Main category: cs.CL

TL;DR: 本文探讨了语言在人类认知和人工智能发展中的关键作用,认为语言不仅是思想的表达工具,更是促进抽象思维和通用智能形成的核心机制。

Details Motivation: 重新审视语言是否仅仅是思想的外在表达,还是在塑造人类思维中起着根本性作用,并结合AI的发展探讨其对通用智能的影响。 Method: 通过分析语言的两个特性——紧凑表征能力和文化演化积累的抽象概念——来论证语言如何促发域一般性认知能力。 Result: 语言提供了压缩的世界模型,使生物或人工学习系统能够反向工程人类思维背后的因果与概念结构,从而发展出更通用的智能。 Conclusion: 语言不仅是思想的载体,更是催生人类及人工智能中高级认知能力的关键因素。 Abstract: We use language to communicate our thoughts. But is language merely the expression of thoughts, which are themselves produced by other, nonlinguistic parts of our minds? Or does language play a more transformative role in human cognition, allowing us to have thoughts that we otherwise could (or would) not have? Recent developments in artificial intelligence (AI) and cognitive science have reinvigorated this old question. We argue that language may hold the key to the emergence of both more general AI systems and central aspects of human intelligence. We highlight two related properties of language that make it such a powerful tool for developing domain--general abilities. First, language offers compact representations that make it easier to represent and reason about many abstract concepts (e.g., exact numerosity). Second, these compressed representations are the iterated output of collective minds. In learning a language, we learn a treasure trove of culturally evolved abstractions. Taken together, these properties mean that a sufficiently powerful learning system exposed to language--whether biological or artificial--learns a compressed model of the world, reverse engineering many of the conceptual and causal structures that support human (and human-like) thought.

[23] LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMs

Junlong Jia,Xing Wu,Chaochen Gao,Ziyang Chen,Zijia Lin,Zhongzhi Li,Weinong Wang,Haotian Xu,Donghui Jin,Debing Zhang,Binghui Guo

Main category: cs.CL

TL;DR: LiteLong是一种资源高效的长上下文数据合成方法,通过结构化主题组织和多智能体辩论生成高质量、多样化的长上下文训练数据。

Details Motivation: 现有基于相关性的长上下文数据合成方法在计算效率上存在挑战,难以高效生成高质量的长上下文训练数据。 Method: 利用BISAC图书分类系统构建层次化主题结构,结合多LLM智能体辩论生成多样化主题,并使用轻量级BM25检索获取相关文档,拼接成128K-token的训练样本。 Result: 在HELMET和Ruler基准测试中表现出具有竞争力的长上下文性能,并能与其他长依赖增强方法无缝集成。 Conclusion: LiteLong通过降低计算和数据工程成本,使高质量长上下文数据合成更易实现,推动长上下文语言模型训练的研究进展。 Abstract: High-quality long-context data is essential for training large language models (LLMs) capable of processing extensive documents, yet existing synthesis approaches using relevance-based aggregation face challenges of computational efficiency. We present LiteLong, a resource-efficient method for synthesizing long-context data through structured topic organization and multi-agent debate. Our approach leverages the BISAC book classification system to provide a comprehensive hierarchical topic organization, and then employs a debate mechanism with multiple LLMs to generate diverse, high-quality topics within this structure. For each topic, we use lightweight BM25 retrieval to obtain relevant documents and concatenate them into 128K-token training samples. Experiments on HELMET and Ruler benchmarks demonstrate that LiteLong achieves competitive long-context performance and can seamlessly integrate with other long-dependency enhancement methods. LiteLong makes high-quality long-context data synthesis more accessible by reducing both computational and data engineering costs, facilitating further research in long-context language training.

[24] Relevance to Utility: Process-Supervised Rewrite for RAG

Jaeyoung Kim,Jongho Kim,Seung-won Hwang,Seoho Song,Young-In Song

Main category: cs.CL

TL;DR: 提出R2U方法,通过过程监督直接优化生成正确答案的概率,并利用大模型的监督进行高效蒸馏,提升检索增强生成系统的性能。

Details Motivation: 现有的检索增强生成系统在优化检索相关性和生成效用之间存在差距,检索到的文档可能主题相关但缺乏有效推理所需的内容。 Method: 提出R2U方法,直接通过过程监督优化生成正确答案的概率,并设计高效的蒸馏管道,利用大语言模型的监督信号训练更小的重写模型。 Result: 在多个开放域问答基准上评估,实验结果表明R2U consistently优于强基线方法。 Conclusion: R2U能更有效地对齐检索内容与生成需求,显著提升检索增强生成系统的性能。 Abstract: Retrieval-Augmented Generation systems often suffer from a gap between optimizing retrieval relevance and generative utility: retrieved documents may be topically relevant but still lack the content needed for effective reasoning during generation. While existing "bridge" modules attempt to rewrite the retrieved text for better generation, we show how they fail to capture true document utility. In this work, we propose R2U, with a key distinction of directly optimizing to maximize the probability of generating a correct answer through process supervision. As such direct observation is expensive, we also propose approximating an efficient distillation pipeline by scaling the supervision from LLMs, which helps the smaller rewriter model generalize better. We evaluate our method across multiple open-domain question-answering benchmarks. The empirical results demonstrate consistent improvements over strong bridging baselines.

[25] Chunk Based Speech Pre-training with High Resolution Finite Scalar Quantization

Yun Tang,Cindy Tseng

Main category: cs.CL

TL;DR: 本文提出了一种基于块的自监督学习方法(Chunk SSL),用于统一解决流式和离线语音预训练问题,通过掩码预测损失和有限标量量化(FSQ)模块提升语音到文本任务性能。

Details Motivation: 现有的自监督学习算法多基于完整语句假设,难以有效处理流式应用中常见的部分语句输入,因此需要一种适用于流式场景的统一预训练方法。 Method: 提出Chunk SSL算法,采用基于块的掩码预测损失,利用未掩码帧恢复被掩码的语音帧;引入复制追加数据增强策略,并使用高分辨率有限标量量化(FSQ)模块离散化语音特征,结合分组掩码预测损失降低计算开销。 Result: 在Librispeech和Must-C数据集上的实验表明,该方法在流式和离线模式下的语音识别与语音翻译任务中均取得具有竞争力的结果。 Conclusion: Chunk SSL为流式和离线语音预训练提供了一个有效的统一框架,尤其适用于低延迟语音人机通信场景。 Abstract: Low latency speech human-machine communication is becoming increasingly necessary as speech technology advances quickly in the last decade. One of the primary factors behind the advancement of speech technology is self-supervised learning. Most self-supervised learning algorithms are designed with full utterance assumption and compromises have to made if partial utterances are presented, which are common in the streaming applications. In this work, we propose a chunk based self-supervised learning (Chunk SSL) algorithm as an unified solution for both streaming and offline speech pre-training. Chunk SSL is optimized with the masked prediction loss and an acoustic encoder is encouraged to restore indices of those masked speech frames with help from unmasked frames in the same chunk and preceding chunks. A copy and append data augmentation approach is proposed to conduct efficient chunk based pre-training. Chunk SSL utilizes a finite scalar quantization (FSQ) module to discretize input speech features and our study shows a high resolution FSQ codebook, i.e., a codebook with vocabulary size up to a few millions, is beneficial to transfer knowledge from the pre-training task to the downstream tasks. A group masked prediction loss is employed during pre-training to alleviate the high memory and computation cost introduced by the large codebook. The proposed approach is examined in two speech to text tasks, i.e., speech recognition and speech translation. Experimental results on the \textsc{Librispeech} and \textsc{Must-C} datasets show that the proposed method could achieve very competitive results for speech to text tasks at both streaming and offline modes.

[26] DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models

Tsz Ting Chung,Lemao Liu,Mo Yu,Dit-Yan Yeung

Main category: cs.CL

TL;DR: 本文提出了一种新的经典逻辑推理基准DivLogicEval,包含以反直觉方式组合的多样化自然语句,旨在更准确地评估大语言模型的逻辑推理能力。同时引入了一种新的评估指标,以减少LLM固有偏见和随机性的影响。

Details Motivation: 现有的逻辑推理基准存在语言多样性不足和分布偏差问题,且常混合多种推理技能,导致对逻辑推理能力的评估不准确。因此需要一个更可靠、更具代表性的逻辑推理评测基准。 Method: 构建了一个名为DivLogicEval的新基准,包含多样化且反直觉组合的自然语言句子;设计了一种新的评估指标,以减轻LLM中固有的偏见和随机性对评估结果的影响。 Result: 实验证明DivLogicEval确实需要较强的逻辑推理能力才能作答,并且揭示了不同主流大语言模型在逻辑推理任务上的表现差异。新评估指标有效降低了模型偏见和随机性带来的干扰。 Conclusion: DivLogicEval能够更真实地评估大语言模型的逻辑推理能力,结合新评估指标可提供更可靠、无偏的评测结果,有助于推动逻辑推理能力的研究与发展。 Abstract: Logic reasoning in natural language has been recognized as an important measure of human intelligence for Large Language Models (LLMs). Popular benchmarks may entangle multiple reasoning skills and thus provide unfaithful evaluations on the logic reasoning skill. Meanwhile, existing logic reasoning benchmarks are limited in language diversity and their distributions are deviated from the distribution of an ideal logic reasoning benchmark, which may lead to biased evaluation results. This paper thereby proposes a new classical logic benchmark DivLogicEval, consisting of natural sentences composed of diverse statements in a counterintuitive way. To ensure a more reliable evaluation, we also introduce a new evaluation metric that mitigates the influence of bias and randomness inherent in LLMs. Through experiments, we demonstrate the extent to which logical reasoning is required to answer the questions in DivLogicEval and compare the performance of different popular LLMs in conducting logical reasoning.

[27] SciEvent: Benchmarking Multi-domain Scientific Event Extraction

Bofu Dong,Pritesh Shah,Sumedh Sonawane,Tiyasha Banerjee,Erin Brady,Xinya Du,Ming Jiang

Main category: cs.CL

TL;DR: 本文提出了SciEvent,一个基于统一事件抽取模式的多领域科学文摘基准,旨在实现对科学内容的结构化和上下文感知理解。

Details Motivation: 现有科学信息抽取方法局限于狭窄领域,难以应对跨学科研究,且常因缺乏上下文导致信息碎片化或冲突。 Method: 提出一个多阶段事件抽取流程:首先将文摘划分为背景、方法、结果和结论四个核心科学活动段落,然后抽取相应的事件触发词和细粒度论元;构建包含500篇跨五领域的手工标注数据集。 Result: 实验表明,当前模型在社会学和人文学科等领域的表现较差,存在显著性能差距。 Conclusion: SciEvent为科学信息抽取提供了一个具有挑战性的多领域基准,推动了通用化科学信息抽取的发展。 Abstract: Scientific information extraction (SciIE) has primarily relied on entity-relation extraction in narrow domains, limiting its applicability to interdisciplinary research and struggling to capture the necessary context of scientific information, often resulting in fragmented or conflicting statements. In this paper, we introduce SciEvent, a novel multi-domain benchmark of scientific abstracts annotated via a unified event extraction (EE) schema designed to enable structured and context-aware understanding of scientific content. It includes 500 abstracts across five research domains, with manual annotations of event segments, triggers, and fine-grained arguments. We define SciIE as a multi-stage EE pipeline: (1) segmenting abstracts into core scientific activities--Background, Method, Result, and Conclusion; and (2) extracting the corresponding triggers and arguments. Experiments with fine-tuned EE models, large language models (LLMs), and human annotators reveal a performance gap, with current models struggling in domains such as sociology and humanities. SciEvent serves as a challenging benchmark and a step toward generalizable, multi-domain SciIE.

[28] Concept Unlearning in Large Language Models via Self-Constructed Knowledge Triplets

Tomoya Yamashita,Yuuki Yamanaka,Masanori Yamada,Takayuki Miura,Toshiki Shibahara,Tomoharu Iwata

Main category: cs.CL

TL;DR: 本文提出了概念遗忘(Concept Unlearning, CU)作为大语言模型中机器遗忘的新需求,利用知识图谱表示模型内部知识,通过删除目标节点及关联边实现更直观、有效的概念级遗忘。

Details Motivation: 现有机器遗忘方法仅能删除特定句子,无法处理如人物或事件等更广泛的概念,限制了其在隐私和版权问题中的应用。 Method: 提出一种新方法,通过提示大语言模型生成关于遗忘目标的知识三元组和解释性句子,并基于知识图谱对这些表示进行遗忘处理,使其与模型内部知识结构对齐。 Result: 在真实和合成数据集上的实验表明,该方法能有效实现概念级遗忘,同时较好保留无关知识。 Conclusion: 基于知识图谱的提示驱动方法为大语言模型中的概念遗忘提供了有效且直观的解决方案,推动了机器遗忘在更复杂语义层面的应用。 Abstract: Machine Unlearning (MU) has recently attracted considerable attention as a solution to privacy and copyright issues in large language models (LLMs). Existing MU methods aim to remove specific target sentences from an LLM while minimizing damage to unrelated knowledge. However, these approaches require explicit target sentences and do not support removing broader concepts, such as persons or events. To address this limitation, we introduce Concept Unlearning (CU) as a new requirement for LLM unlearning. We leverage knowledge graphs to represent the LLM's internal knowledge and define CU as removing the forgetting target nodes and associated edges. This graph-based formulation enables a more intuitive unlearning and facilitates the design of more effective methods. We propose a novel method that prompts the LLM to generate knowledge triplets and explanatory sentences about the forgetting target and applies the unlearning process to these representations. Our approach enables more precise and comprehensive concept removal by aligning the unlearning process with the LLM's internal knowledge representations. Experiments on real-world and synthetic datasets demonstrate that our method effectively achieves concept-level unlearning while preserving unrelated knowledge.

[29] Sparse-Autoencoder-Guided Internal Representation Unlearning for Large Language Models

Tomoya Yamashita,Akira Ito,Yuuki Yamanaka,Masanori Yamada,Takayuki Miura,Toshiki Shibahara

Main category: cs.CL

TL;DR: 提出一种新的大语言模型遗忘方法,通过在稀疏自编码器潜在空间中将目标实体的内部激活调整至未知实体的状态,实现真正的知识遗忘,避免了传统抑制方法导致的模型崩溃问题。

Details Motivation: 现有LLM遗忘方法多采用抑制输出的方式,无法真正消除模型内部的知识表征,且易导致模型崩溃,因此需要一种能实现真正‘遗忘’的方法。 Method: 通过定义遗忘为使目标实体的内部激活与‘未知’实体不可区分,并在稀疏自编码器的潜在空间中引入遗忘目标,将目标实体的激活从已知向未知对齐,从而实现知识的移除。 Result: 实验表明该方法能有效对齐被遗忘目标的内部激活模式,显著降低模型在问答任务中对目标知识的回忆,同时保持对非目标知识的性能。 Conclusion: 所提方法实现了更彻底的模型遗忘,避免了抑制式方法的过抑制和模型崩溃问题,为LLM的隐私和版权保护提供了更可靠的技术路径。 Abstract: As large language models (LLMs) are increasingly deployed across various applications, privacy and copyright concerns have heightened the need for more effective LLM unlearning techniques. Many existing unlearning methods aim to suppress undesirable outputs through additional training (e.g., gradient ascent), which reduces the probability of generating such outputs. While such suppression-based approaches can control model outputs, they may not eliminate the underlying knowledge embedded in the model's internal activations; muting a response is not the same as forgetting it. Moreover, such suppression-based methods often suffer from model collapse. To address these issues, we propose a novel unlearning method that directly intervenes in the model's internal activations. In our formulation, forgetting is defined as a state in which the activation of a forgotten target is indistinguishable from that of ``unknown'' entities. Our method introduces an unlearning objective that modifies the activation of the target entity away from those of known entities and toward those of unknown entities in a sparse autoencoder latent space. By aligning the target's internal activation with those of unknown entities, we shift the model's recognition of the target entity from ``known'' to ``unknown'', achieving genuine forgetting while avoiding over-suppression and model collapse. Empirically, we show that our method effectively aligns the internal activations of the forgotten target, a result that the suppression-based approaches do not reliably achieve. Additionally, our method effectively reduces the model's recall of target knowledge in question-answering tasks without significant damage to the non-target knowledge.

[30] Multilingual LLM Prompting Strategies for Medical English-Vietnamese Machine Translation

Nhu Vo,Nu-Uyen-Phuong Le,Dung D. Le,Massimo Piccardi,Wray Buntine

Main category: cs.CL

TL;DR: 本文系统评估了六种多语言大模型在医学英越翻译任务上的不同提示策略,发现模型规模是性能的主要驱动因素,而术语感知提示和基于嵌入的示例检索能持续提升领域翻译效果。

Details Motivation: 越南语属于低资源语言,医学领域的英越翻译对医疗沟通至关重要,但研究不足,因此需要探索有效的提示策略以提升翻译质量。 Method: 在MedEV数据集上评估六种多语言大模型(0.5B-9B参数),比较零样本、少样本及结合Meddict词典的增强提示方法,采用术语感知提示和基于嵌入的示例检索进行优化。 Result: 模型规模越大,零样本表现越好;少样本提示提升有限,而术语感知提示和基于嵌入的检索方法显著改善医学翻译效果。 Conclusion: 尽管大规模模型在零样本下表现良好,但当前多语言大模型在医学英越翻译中仍有局限,术语增强策略有助于提升专业领域翻译性能。 Abstract: Medical English-Vietnamese machine translation (En-Vi MT) is essential for healthcare access and communication in Vietnam, yet Vietnamese remains a low-resource and under-studied language. We systematically evaluate prompting strategies for six multilingual LLMs (0.5B-9B parameters) on the MedEV dataset, comparing zero-shot, few-shot, and dictionary-augmented prompting with Meddict, an English-Vietnamese medical lexicon. Results show that model scale is the primary driver of performance: larger LLMs achieve strong zero-shot results, while few-shot prompting yields only marginal improvements. In contrast, terminology-aware cues and embedding-based example retrieval consistently improve domain-specific translation. These findings underscore both the promise and the current limitations of multilingual LLMs for medical En-Vi MT.

[31] Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations

Linyang He,Qiaolin Wang,Xilin Jiang,Nima Mesgarani

Main category: cs.CL

TL;DR: 本研究首次系统评估了多种语音语言模型(SLM)在语法和语义特征上的编码能力,发现这些模型对语法特征的编码强于概念特征。

Details Motivation: 现有研究多关注SLM对声学和音位特征的编码,但其对复杂句法和语义特征的捕捉能力尚不清楚,因此需要系统性评估。 Method: 借鉴大语言模型的语言能力评估方法,采用最小对立对设计和诊断性特征分析,在71项任务上对S3M、ASR、语音编解码器和AudioLLM等模型进行逐层及时序分析。 Result: 发现所有语音编码器均能更稳健地编码语法特征而非概念特征,且不同模型在不同语言层次上的表现存在差异。 Conclusion: 语音语言模型在语法结构的理解上表现较强,但在捕捉深层语义和概念信息方面仍有不足,需进一步优化。 Abstract: Transformer-based speech language models (SLMs) have significantly improved neural speech recognition and understanding. While existing research has examined how well SLMs encode shallow acoustic and phonetic features, the extent to which SLMs encode nuanced syntactic and conceptual features remains unclear. By drawing parallels with linguistic competence assessments for large language models, this study is the first to systematically evaluate the presence of contextual syntactic and semantic features across SLMs for self-supervised learning (S3M), automatic speech recognition (ASR), speech compression (codec), and as the encoder for auditory large language models (AudioLLMs). Through minimal pair designs and diagnostic feature analysis across 71 tasks spanning diverse linguistic levels, our layer-wise and time-resolved analysis uncovers that 1) all speech encode grammatical features more robustly than conceptual ones.

[32] VOX-KRIKRI: Unifying Speech and Language through Continuous Fusion

Dimitrios Damianos,Leon Voukoutis,Georgios Paraskevopoulos,Vassilis Katsouros

Main category: cs.CL

TL;DR: 提出一种多模态融合框架,通过在连续文本表示空间中融合Whisper的解码器状态与大语言模型(LLM)进行语音对齐,构建语音启用的LLM,并实现希腊语语音识别的最先进性能。

Details Motivation: 旨在将预训练的语言模型与语音编码器-解码器架构结合,构建支持语音的大型语言模型,特别是在低资源语言如希腊语中提升语音识别性能。 Method: 利用Whisper模型的隐含解码器状态,在连续文本表示空间中通过跨模态注意力机制与大语言模型融合,采用音频条件下的中间文本空间而非直接使用音频嵌入,支持离线和流式模式。 Result: 成功构建了首个希腊语语音大模型VoxKrikri,实验表明该方法有效对齐了跨模态表示,在多个基准测试中实现了约20%的相对性能提升,达到希腊语语音识别的最先进水平。 Conclusion: 连续空间融合是一种有前景的多语言和低资源语音大模型构建路径,所提方法在保持生成能力的同时显著提升了语音识别性能。 Abstract: We present a multimodal fusion framework that bridges pre-trained decoder-based large language models (LLM) and acoustic encoder-decoder architectures such as Whisper, with the aim of building speech-enabled LLMs. Instead of directly using audio embeddings, we explore an intermediate audio-conditioned text space as a more effective mechanism for alignment. Our method operates fully in continuous text representation spaces, fusing Whisper's hidden decoder states with those of an LLM through cross-modal attention, and supports both offline and streaming modes. We introduce \textit{VoxKrikri}, the first Greek speech LLM, and show through analysis that our approach effectively aligns representations across modalities. These results highlight continuous space fusion as a promising path for multilingual and low-resource speech LLMs, while achieving state-of-the-art results for Automatic Speech Recognition in Greek, providing an average $\sim20\%$ relative improvement across benchmarks.

[33] Fine-Tuning Large Multimodal Models for Automatic Pronunciation Assessment

Ke Wang,Wenning Wei,Yan Deng,Lei He,Sheng Zhao

Main category: cs.CL

TL;DR: 本研究探讨了大型多模态模型(LMMs)在自动发音评估(APA)中的应用,通过细粒度微调在词和句子级别上表现良好,但在音素级别仍具挑战性,且发现皮尔逊相关系数(PCC)较高而斯皮尔曼等级相关系数(SCC)较低,表明SCC更适合作为序数一致性的衡量指标。

Details Motivation: 探索大型多模态模型在多粒度发音评估中的潜力,解决其在细粒度任务中有效性不明确的问题。 Method: 使用Speechocean762数据集和私有语料库对大型多模态模型进行微调,并在不同粒度层级上评估其性能。 Result: 微调后的模型显著优于零样本设置,在单词和句子级别的评估中表现良好,PCC达到0.9,但SCC仅为0.6,音素级别评估仍有挑战。 Conclusion: LMMs在APA中具有潜力,尤其在较高粒度层级表现优异,但需进一步改进细粒度建模和采用更合适的排名感知评估指标。 Abstract: Automatic Pronunciation Assessment (APA) is critical for Computer-Assisted Language Learning (CALL), requiring evaluation across multiple granularities and aspects. Large Multimodal Models (LMMs) present new opportunities for APA, but their effectiveness in fine-grained assessment remains uncertain. This work investigates fine-tuning LMMs for APA using the Speechocean762 dataset and a private corpus. Fine-tuning significantly outperforms zero-shot settings and achieves competitive results on single-granularity tasks compared to public and commercial systems. The model performs well at word and sentence levels, while phoneme-level assessment remains challenging. We also observe that the Pearson Correlation Coefficient (PCC) reaches 0.9, whereas Spearman's rank Correlation Coefficient (SCC) remains around 0.6, suggesting that SCC better reflects ordinal consistency. These findings highlight both the promise and limitations of LMMs for APA and point to future work on fine-grained modeling and rank-aware evaluation.

[34] Once Upon a Time: Interactive Learning for Storytelling with Small Language Models

Jonas Mayer Martins,Ali Hamza Bashir,Muhammad Rehan Khalid,Lisa Beinborn

Main category: cs.CL

TL;DR: 通过引入类似人类互动的高级反馈机制,语言模型在少量数据下显著提升了故事生成能力,展现出比传统预训练更高的数据效率。

Details Motivation: 受儿童通过社会互动学习语言的启发,探索语言模型是否可以通过结合高级认知反馈和交互式学习,减少对大规模文本数据的依赖。 Method: 使用教师-学生模型框架,学生模型生成故事,教师模型从可读性、叙事连贯性和创造性方面提供反馈,并比较不同预训练量下的交互学习效果。 Result: 仅用100万词的交互学习,故事生成能力的提升相当于4.1亿词的下一词预测预训练,显示出高阶反馈的数据高效性。 Conclusion: 引入类人互动的反馈机制能大幅提升语言模型的学习效率,为减少训练数据需求提供了新路径。 Abstract: Children efficiently acquire language not just by listening, but by interacting with others in their social environment. Conversely, large language models are typically trained with next-word prediction on massive amounts of text. Motivated by this contrast, we investigate whether language models can be trained with less data by learning not only from next-word prediction but also from high-level, cognitively inspired feedback. We train a student model to generate stories, which a teacher model rates on readability, narrative coherence, and creativity. By varying the amount of pretraining before the feedback loop, we assess the impact of this interactive learning on formal and functional linguistic competence. We find that the high-level feedback is highly data efficient: With just 1 M words of input in interactive learning, storytelling skills can improve as much as with 410 M words of next-word prediction.

[35] REFER: Mitigating Bias in Opinion Summarisation via Frequency Framed Prompting

Nannan Huang,Haytham M. Fayek,Xiuzhen Zhang

Main category: cs.CL

TL;DR: 本研究提出了一种基于频率提示(REFER)的方法,通过借鉴认知科学中的频率表征技术,提升大语言模型在意见摘要中的公平性,尤其在较大模型和强推理指令下效果显著。

Details Motivation: 现有方法依赖超参数调整或提供分布信息,但这些在实际中受限;因此需要一种无需额外信息且用户友好的方式来提升意见摘要的公平性。 Method: 受认知科学启发,采用频率框架提示(REFER)替代抽象概率表示,并在不同提示框架下进行系统实验,以评估其对大语言模型意见摘要公平性的影响。 Result: 实验表明,REFER能显著提升大语言模型在意见摘要中的公平性,且在更大规模模型和更强推理指令下效果更优。 Conclusion: 频率框架提示(REFER)是一种有效且实用的方法,可在无需真实分布信息的情况下提升大语言模型在意见摘要中的公平性。 Abstract: Individuals express diverse opinions, a fair summary should represent these viewpoints comprehensively. Previous research on fairness in opinion summarisation using large language models (LLMs) relied on hyperparameter tuning or providing ground truth distributional information in prompts. However, these methods face practical limitations: end-users rarely modify default model parameters, and accurate distributional information is often unavailable. Building upon cognitive science research demonstrating that frequency-based representations reduce systematic biases in human statistical reasoning by making reference classes explicit and reducing cognitive load, this study investigates whether frequency framed prompting (REFER) can similarly enhance fairness in LLM opinion summarisation. Through systematic experimentation with different prompting frameworks, we adapted techniques known to improve human reasoning to elicit more effective information processing in language models compared to abstract probabilistic representations.Our results demonstrate that REFER enhances fairness in language models when summarising opinions. This effect is particularly pronounced in larger language models and using stronger reasoning instructions.

[36] Can LLMs Judge Debates? Evaluating Non-Linear Reasoning via Argumentation Theory Semantics

Reza Sanayei,Srdjan Vesic,Eduardo Blanco,Mihai Surdeanu

Main category: cs.CL

TL;DR: 该论文探讨了大语言模型(LLM)在非线性结构推理(如自然辩论中的论证图)中的表现,使用量化论证辩论(QuAD)语义评估LLM对论点可接受性的排序能力。研究发现,尽管高级提示策略有助于缓解输入长度和顺序带来的偏差,但LLM在长输入或话语流被打断时性能下降,表明其在形式化论证建模方面仍有局限。

Details Motivation: 大语言模型擅长线性推理,但在处理非线性结构(如辩论中的论证图)方面尚未充分探索。本文旨在评估LLM是否能从计算论证理论(CAT)的角度近似结构化推理。 Method: 采用QuAD语义,基于论点间的攻击与支持关系分配可接受性分数,在不提供底层图结构的情况下,仅使用对话格式的辩论数据,通过链式思维和上下文学习等高级提示策略,测试多个LLM对论点进行排序的能力。 Result: LLM在QuAD排序上表现出中等程度的一致性,但在输入较长或话语结构被打乱时性能下降;高级提示策略有助于减轻论点长度和位置带来的偏见。 Conclusion: LLM具备一定建模形式论证语义的潜力,但在处理复杂非线性结构时存在局限,未来需发展对图结构敏感的推理方法。 Abstract: Large Language Models (LLMs) excel at linear reasoning tasks but remain underexplored on non-linear structures such as those found in natural debates, which are best expressed as argument graphs. We evaluate whether LLMs can approximate structured reasoning from Computational Argumentation Theory (CAT). Specifically, we use Quantitative Argumentation Debate (QuAD) semantics, which assigns acceptability scores to arguments based on their attack and support relations. Given only dialogue-formatted debates from two NoDE datasets, models are prompted to rank arguments without access to the underlying graph. We test several LLMs under advanced instruction strategies, including Chain-of-Thought and In-Context Learning. While models show moderate alignment with QuAD rankings, performance degrades with longer inputs or disrupted discourse flow. Advanced prompting helps mitigate these effects by reducing biases related to argument length and position. Our findings highlight both the promise and limitations of LLMs in modeling formal argumentation semantics and motivate future work on graph-aware reasoning.

[37] UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression

Chenlong Deng,Zhisong Zhang,Kelong Mao,Shuaiyi Li,Tianqing Fang,Hongming Zhang,Haitao Mi,Dong Yu,Zhicheng Dou

Main category: cs.CL

TL;DR: UniGist是一种序列级长上下文压缩框架,通过细粒度地用压缩token(gist)替换原始token来高效保留上下文信息,支持灵活推理和实际内存节省。

Details Motivation: 大语言模型在处理长上下文输入时,KV缓存的内存开销成为部署的主要瓶颈,而现有的序列级压缩方法可能导致重要上下文信息丢失。 Method: 提出UniGist框架,采用无chunk的训练策略,设计带有gist shift技巧的高效内核,实现细粒度的上下文压缩,并支持压缩token的实际移除以节省内存。 Result: 在多个长上下文任务上的实验表明,UniGist显著提升了压缩质量,尤其在细节回忆任务和长距离依赖建模中表现突出。 Conclusion: UniGist有效平衡了内存效率与模型性能,为大语言模型的长上下文处理提供了高效的序列级压缩解决方案。 Abstract: Large language models are increasingly capable of handling long-context inputs, but the memory overhead of key-value (KV) cache remains a major bottleneck for general-purpose deployment. While various compression strategies have been explored, sequence-level compression, which drops the full KV caches for certain tokens, is particularly challenging as it can lead to the loss of important contextual information. To address this, we introduce UniGist, a sequence-level long-context compression framework that efficiently preserves context information by replacing raw tokens with special compression tokens (gists) in a fine-grained manner. We adopt a chunk-free training strategy and design an efficient kernel with a gist shift trick, enabling optimized GPU training. Our scheme also supports flexible inference by allowing the actual removal of compressed tokens, resulting in real-time memory savings. Experiments across multiple long-context tasks demonstrate that UniGist significantly improves compression quality, with especially strong performance in detail-recalling tasks and long-range dependency modeling.

[38] UPRPRC: Unified Pipeline for Reproducing Parallel Resources -- Corpus from the United Nations

Qiuyang Lu,Fangjian Shen,Zhengkai Tang,Qiang Liu,Hexuan Cheng,Hui Liu,Wushao Wen

Main category: cs.CL

TL;DR: 提出了一种端到端的解决方案,用于从联合国文件中构建大规模、可复现的多语言平行语料库,并引入了新的图辅助段落对齐算法(GAPA),生成了超过7.13亿英文词符的语料,是目前最大的公开人类翻译平行语料库。

Details Motivation: 解决以往基于联合国文档构建的多语言语料库存在的过程不透明、难以复现和规模有限的问题。 Method: 通过网络爬虫获取数据,采用新提出的图辅助段落对齐算法(GAPA)进行文本对齐,实现了完全可复现的端到端流程,并支持分布式计算以提升可扩展性。 Result: 构建了一个包含超过7.13亿英文词符的平行语料库,规模超过此前工作的两倍,是目前最大的公开可用的人类翻译、非AI生成的多语言平行语料库。 Conclusion: 该方法显著提升了多语言数据集的规模和可复现性,为机器翻译研究提供了高质量且开放的资源。 Abstract: The quality and accessibility of multilingual datasets are crucial for advancing machine translation. However, previous corpora built from United Nations documents have suffered from issues such as opaque process, difficulty of reproduction, and limited scale. To address these challenges, we introduce a complete end-to-end solution, from data acquisition via web scraping to text alignment. The entire process is fully reproducible, with a minimalist single-machine example and optional distributed computing steps for scalability. At its core, we propose a new Graph-Aided Paragraph Alignment (GAPA) algorithm for efficient and flexible paragraph-level alignment. The resulting corpus contains over 713 million English tokens, more than doubling the scale of prior work. To the best of our knowledge, this represents the largest publicly available parallel corpus composed entirely of human-translated, non-AI-generated content. Our code and corpus are accessible under the MIT License.

[39] RAVE: Retrieval and Scoring Aware Verifiable Claim Detection

Yufeng Li,Arkaitz Zubiaga

Main category: cs.CL

TL;DR: 提出RAVE框架,结合证据检索与相关性及来源可信度的结构化信号,用于检测可验证声明,在多个测试集上表现优于现有方法。

Details Motivation: 社交媒体上的错误信息迅速传播,需要可扩展的事实核查工具,而现有方法在模糊的政治言论和多样化格式(如推文)中表现不佳。 Method: 提出RAVE框架,结合证据检索与结构化的相关性和来源可信度信号,以识别可验证的声明。 Result: 在CT22-test和PoliClaim-test数据集上的实验表明,RAVE在准确率和F1分数上均优于纯文本和基于检索的基线方法。 Conclusion: RAVE通过融合检索与结构化信号,在多样化的文本格式和模糊语境下显著提升了可验证声明检测的性能。 Abstract: The rapid spread of misinformation on social media underscores the need for scalable fact-checking tools. A key step is claim detection, which identifies statements that can be objectively verified. Prior approaches often rely on linguistic cues or claim check-worthiness, but these struggle with vague political discourse and diverse formats such as tweets. We present RAVE (Retrieval and Scoring Aware Verifiable Claim Detection), a framework that combines evidence retrieval with structured signals of relevance and source credibility. Experiments on CT22-test and PoliClaim-test show that RAVE consistently outperforms text-only and retrieval-based baselines in both accuracy and F1.

[40] Best-of-L: Cross-Lingual Reward Modeling for Mathematical Reasoning

Sara Rajaee,Rochelle Choenni,Ekaterina Shutova,Christof Monz

Main category: cs.CL

TL;DR: 研究探讨了多语言大模型中不同语言的推理能力差异及其互补性,提出跨语言奖励模型显著提升数学推理性能。

Details Motivation: 探索多语言大模型中不同语言的推理能力差异以及是否存在互补性。 Method: 训练一个跨语言奖励模型来对不同语言生成的回答进行排序,并比较其与单语言奖励模型的效果。 Result: 跨语言奖励模型显著提升了数学推理性能,尤其在低采样预算下对英语效果更明显。 Conclusion: 利用不同语言的互补优势可有效提升多语言推理能力。 Abstract: While the reasoning abilities of large language models (LLMs) continue to advance, it remains unclear how such ability varies across languages in multilingual LLMs and whether different languages produce reasoning paths that complement each other. To investigate this question, we train a reward model to rank generated responses for a given question across languages. Our results show that our cross-lingual reward model substantially improves mathematical reasoning performance compared to using reward modeling within a single language, benefiting even high-resource languages. While English often exhibits the highest performance in multilingual models, we find that cross-lingual sampling particularly benefits English under low sampling budgets. Our findings reveal new opportunities to improve multilingual reasoning by leveraging the complementary strengths of diverse languages.

[41] The Curious Case of Visual Grounding: Different Effects for Speech- and Text-based Language Encoders

Adrian Sauter,Willem Zuidema,Marianne de Heer Kloots

Main category: cs.CL

TL;DR: 视觉信息在训练中对语言处理的影响在基于语音和文本的深度学习模型中有显著差异;视觉 grounding 提高了语音与文本表示之间的对齐性,但主要增强的是词身份编码而非语义理解,且对语音模型的语义可区分性无改善。

Details Motivation: 探究视觉信息如何影响基于音频和文本的语言模型内部的词表示,并比较不同模态下视觉 grounding 的作用。 Method: 通过全局表征比较和针对性聚类分析,评估视觉 grounding 对语音和文本模型在词身份、语音和语义可区分性方面的影响。 Result: 视觉 grounding 提高了语音与文本表示之间的对齐性,但主要由词身份编码驱动;语音模型在视觉 grounding 下仍以语音信息为主导,语义可区分性未提升,而文本模型则表现出不同模式。 Conclusion: 视觉 grounding 对语音和文本模型的影响机制不同,未来需开发更有效的方法将视觉语义融入语音模型。 Abstract: How does visual information included in training affect language processing in audio- and text-based deep learning models? We explore how such visual grounding affects model-internal representations of words, and find substantially different effects in speech- vs. text-based language encoders. Firstly, global representational comparisons reveal that visual grounding increases alignment between representations of spoken and written language, but this effect seems mainly driven by enhanced encoding of word identity rather than meaning. We then apply targeted clustering analyses to probe for phonetic vs. semantic discriminability in model representations. Speech-based representations remain phonetically dominated with visual grounding, but in contrast to text-based representations, visual grounding does not improve semantic discriminability. Our findings could usefully inform the development of more efficient methods to enrich speech-based models with visually-informed semantics.

[42] Multi-Physics: A Comprehensive Benchmark for Multimodal LLMs Reasoning on Chinese Multi-Subject Physics Problems

Zhongze Luo,Zhenshuai Yin,Yongxin Guo,Zhichao Wang,Jionghao Zhu,Xiaoying Tang

Main category: cs.CL

TL;DR: 本文提出了一个名为Multi-Physics的中文物理推理多模态基准,包含1,412道带图像的多选题,覆盖11个高中物理主题,分为5个难度等级,用于评估多模态大语言模型在物理领域的细粒度推理能力。

Details Motivation: 现有评估基准在科学领域(如物理)存在覆盖不全、忽视逐步推理过程、以英语为主且未能系统评估视觉信息作用等问题,限制了对多模态大模型推理能力的全面评估。 Method: 构建了Multi-Physics基准,包含多难度、多主题的中文物理题目,并采用双评估框架,分别评估模型的最终答案准确率和思维链的逐步完整性;通过改变输入模式分析难度和视觉信息对模型性能的影响。 Result: 对20种MLLM进行了评估,发现模型在不同难度和是否使用视觉信息的情况下表现差异显著,验证了该基准能有效揭示模型在多模态推理中的优缺点。 Conclusion: Multi-Physics为中文物理推理提供了细粒度、系统性的评估工具,不仅推动多模态模型在科学领域的应用评估,也为分析其推理过程提供了可靠方法,且数据集与代码已开源。 Abstract: While multimodal LLMs (MLLMs) demonstrate remarkable reasoning progress, their application in specialized scientific domains like physics reveals significant gaps in current evaluation benchmarks. Specifically, existing benchmarks often lack fine-grained subject coverage, neglect the step-by-step reasoning process, and are predominantly English-centric, failing to systematically evaluate the role of visual information. Therefore, we introduce \textbf {Multi-Physics} for Chinese physics reasoning, a comprehensive benchmark that includes 5 difficulty levels, featuring 1,412 image-associated, multiple-choice questions spanning 11 high-school physics subjects. We employ a dual evaluation framework to evaluate 20 different MLLMs, analyzing both final answer accuracy and the step-by-step integrity of their chain-of-thought. Furthermore, we systematically study the impact of difficulty level and visual information by comparing the model performance before and after changing the input mode. Our work provides not only a fine-grained resource for the community but also offers a robust methodology for dissecting the multimodal reasoning process of state-of-the-art MLLMs, and our dataset and code have been open-sourced: https://github.com/luozhongze/Multi-Physics.

[43] Distribution-Aligned Decoding for Efficient LLM Task Adaptation

Senkang Hu,Xudong Han,Jinqi Jiang,Yihang Tao,Zihan Fang,Sam Tak Wu Kwong,Yuguang Fang

Main category: cs.CL

TL;DR: 本文提出了一种名为Steering Vector Decoding (SVD) 的轻量级方法,将任务适配重新定义为输出分布对齐,通过解码过程中的导向向量直接调整模型输出分布,理论上等价于全微调的一阶梯度更新,并在多个任务上提升了性能。

Details Motivation: 现有的参数高效微调(PEFT)方法在适配大模型时仍成本较高,因此需要一种更轻量且有效的方法来提升下游任务性能。 Method: 首先进行短时间的预热微调,然后从预训练模型和微调后模型的KL散度梯度中提取任务感知的导向向量,并在解码过程中使用该向量引导输出分布。 Result: 在三个任务和九个基准上的实验表明,SVD结合四种标准PEFT方法,在多项选择准确率上最高提升5点,开放生成真实性提升2点,常识推理数据集也有1-2点增益,且不增加额外可训练参数。 Conclusion: SVD提供了一种轻量、理论支持强、兼容PEFT的大模型任务适配新路径,有效提升模型性能而无需额外参数。 Abstract: Adapting billion-parameter language models to a downstream task is still costly, even with parameter-efficient fine-tuning (PEFT). We re-cast task adaptation as output-distribution alignment: the objective is to steer the output distribution toward the task distribution directly during decoding rather than indirectly through weight updates. Building on this view, we introduce Steering Vector Decoding (SVD), a lightweight, PEFT-compatible, and theoretically grounded method. We start with a short warm-start fine-tune and extract a task-aware steering vector from the Kullback-Leibler (KL) divergence gradient between the output distribution of the warm-started and pre-trained models. This steering vector is then used to guide the decoding process to steer the model's output distribution towards the task distribution. We theoretically prove that SVD is first-order equivalent to the gradient step of full fine-tuning and derive a globally optimal solution for the strength of the steering vector. Across three tasks and nine benchmarks, SVD paired with four standard PEFT methods improves multiple-choice accuracy by up to 5 points and open-ended truthfulness by 2 points, with similar gains (1-2 points) on commonsense datasets without adding trainable parameters beyond the PEFT adapter. SVD thus offers a lightweight, theoretically grounded path to stronger task adaptation for large language models.

[44] The Psychology of Falsehood: A Human-Centric Survey of Misinformation Detection

Arghodeep Nandi,Megha Sundriyal,Euna Mehnaz Khan,Jikai Sun,Emily Vraga,Jaideep Srivastava,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: 本文综述了当前自动事实核查系统的局限性,强调需结合人类心理与行为因素(如认知偏差、社会动态和情感反应)来改进错误信息检测框架。

Details Motivation: 现有系统主要关注事实准确性,但错误信息的影响远不止于真假,还涉及人类的认知与情感反应,因此需要更以人为本的检测方法。 Method: 通过心理学视角分析最先进的错误信息检测系统,探讨传统事实核查与认知偏差、社会动态及情绪反应之间的交互。 Result: 揭示了当前方法的关键局限性,提出了融合技术与人类认知及社会影响的神经行为模型等未来研究方向。 Conclusion: 结合人类心理与行为的检测框架有望更有效地识别和缓解错误信息带来的社会危害。 Abstract: Misinformation remains one of the most significant issues in the digital age. While automated fact-checking has emerged as a viable solution, most current systems are limited to evaluating factual accuracy. However, the detrimental effect of misinformation transcends simple falsehoods; it takes advantage of how individuals perceive, interpret, and emotionally react to information. This underscores the need to move beyond factuality and adopt more human-centered detection frameworks. In this survey, we explore the evolving interplay between traditional fact-checking approaches and psychological concepts such as cognitive biases, social dynamics, and emotional responses. By analyzing state-of-the-art misinformation detection systems through the lens of human psychology and behavior, we reveal critical limitations of current methods and identify opportunities for improvement. Additionally, we outline future research directions aimed at creating more robust and adaptive frameworks, such as neuro-behavioural models that integrate technological factors with the complexities of human cognition and social influence. These approaches offer promising pathways to more effectively detect and mitigate the societal harms of misinformation.

[45] Re-FRAME the Meeting Summarization SCOPE: Fact-Based Summarization and Personalization via Questions

Frederic Kirstein,Sonu Kumar,Terry Ruas,Bela Gipp

Main category: cs.CL

TL;DR: 提出FRAME和SCOPE方法以提升会议摘要的准确性与个性化,结合P-MESA评估框架实现更可靠的无参考摘要评价。

Details Motivation: 现有大语言模型在会议摘要中易产生幻觉、遗漏和无关内容,缺乏对目标读者的适配性和可控性。 Method: 设计模块化流水线FRAME,将摘要视为语义增强任务,提取并评分显著事实,主题组织后用于丰富提纲生成抽象摘要;引入SCOPE推理协议,通过九个问题构建推理轨迹以实现个性化;提出P-MESA多维无参考评估框架,评估摘要是否符合目标读者需求。 Result: 在QMSum和FAME数据集上,FRAME将幻觉和遗漏减少2/5(MESA测量),SCOPE优于仅提示基线,提升知识匹配和目标一致性;P-MESA评估准确率≥89%,与人类评分高度一致(r≥0.70)。 Conclusion: 重新思考摘要任务范式有助于提升控制性、忠实性和个性化,FRAME、SCOPE与P-MESA共同推动会议摘要向更可靠和用户适配的方向发展。 Abstract: Meeting summarization with large language models (LLMs) remains error-prone, often producing outputs with hallucinations, omissions, and irrelevancies. We present FRAME, a modular pipeline that reframes summarization as a semantic enrichment task. FRAME extracts and scores salient facts, organizes them thematically, and uses these to enrich an outline into an abstractive summary. To personalize summaries, we introduce SCOPE, a reason-out-loud protocol that has the model build a reasoning trace by answering nine questions before content selection. For evaluation, we propose P-MESA, a multi-dimensional, reference-free evaluation framework to assess if a summary fits a target reader. P-MESA reliably identifies error instances, achieving >= 89% balanced accuracy against human annotations and strongly aligns with human severity ratings (r >= 0.70). On QMSum and FAME, FRAME reduces hallucination and omission by 2 out of 5 points (measured with MESA), while SCOPE improves knowledge fit and goal alignment over prompt-only baselines. Our findings advocate for rethinking summarization to improve control, faithfulness, and personalization.

[46] Beyond the Score: Uncertainty-Calibrated LLMs for Automated Essay Assessment

Ahmed Karim,Qiao Wang,Zheng Yuan

Main category: cs.CL

TL;DR: 本研究首次将保形预测与不确定性感知准确率(UAcc)结合用于自动作文评分(AES),通过在三个多样化语料库上微调并校准开源大语言模型(Llama-3 8B 和 Qwen-2.5 3B),实现了90%风险水平下的可靠覆盖,同时保持预测集紧凑,表明中等规模开源模型已可支持教师参与的AES系统。

Details Motivation: 现有AES系统虽评分准确率接近人类水平,但缺乏置信度度量和解释性,限制了其在高利害考试中的实际应用。 Method: 采用保形预测作为分布无关的后处理方法,为任意分类器提供具有形式化覆盖保证的集合输出,并在ASAP、TOEFL11和Cambridge-FCE三个数据集上微调Llama-3 8B和Qwen-2.5 3B模型,以90%风险水平进行校准。 Result: 校准后的模型 consistently 达到目标覆盖率,同时保持预测集紧凑;使用UAcc指标评估显示模型兼具准确性与简洁性。 Conclusion: 中等规模的开源大语言模型结合保形预测可用于构建可信、可解释的教师参与式AES系统,具备实际部署潜力。 Abstract: Automated Essay Scoring (AES) systems now reach near human agreement on some public benchmarks, yet real-world adoption, especially in high-stakes examinations, remains limited. A principal obstacle is that most models output a single score without any accompanying measure of confidence or explanation. We address this gap with conformal prediction, a distribution-free wrapper that equips any classifier with set-valued outputs and formal coverage guarantees. Two open-source large language models (Llama-3 8B and Qwen-2.5 3B) are fine-tuned on three diverse corpora (ASAP, TOEFL11, Cambridge-FCE) and calibrated at a 90 percent risk level. Reliability is assessed with UAcc, an uncertainty-aware accuracy that rewards models for being both correct and concise. To our knowledge, this is the first work to combine conformal prediction and UAcc for essay scoring. The calibrated models consistently meet the coverage target while keeping prediction sets compact, indicating that open-source, mid-sized LLMs can already support teacher-in-the-loop AES; we discuss scaling and broader user studies as future work.

[47] Localmax dynamics for attention in transformers and its asymptotic behavior

Henri Cimetière,Maria Teresa Chiri,Bahman Gharesifard

Main category: cs.CL

TL;DR: 提出了一种新的离散时间注意力模型——localmax动态,介于经典softmax和hardmax之间,通过引入对齐敏感性参数控制邻域交互的松弛,并分析其收敛行为与不变集特性。

Details Motivation: 为了在softmax和hardmax注意力机制之间建立更灵活的插值模型,克服hardmax的刚性权重分配问题,同时保留局部性和最大影响原则。 Method: 基于离散时间注意力机制设计localmax动态模型,引入控制邻居影响范围的参数和对齐敏感性参数;采用李雅普诺夫方法分析系统收敛性,并引入‘静息集’描述接近顶点时token的不变行为。 Result: 证明了token状态的凸包仍收敛到凸多面体,但结构不能仅由最大对齐集描述;提出了‘静息集’概念以刻画系统渐近行为;表明localmax不具有限时间收敛性,并在对齐敏感性参数时变情况下恢复了hardmax的极限行为。 Conclusion: localmax动态为注意力机制提供了更丰富的建模能力,平衡了soft与hard注意力的优缺点,且适用于时变参数场景,但传统李雅普诺夫方法在非对称交互下存在分析局限,需进一步研究。 Abstract: We introduce a new discrete-time attention model, termed the localmax dynamics, which interpolates between the classic softmax dynamics and the hardmax dynamics, where only the tokens that maximize the influence toward a given token have a positive weight. As in hardmax, uniform weights are determined by a parameter controlling neighbor influence, but the key extension lies in relaxing neighborhood interactions through an alignment-sensitivity parameter, which allows controlled deviations from pure hardmax behavior. As we prove, while the convex hull of the token states still converges to a convex polytope, its structure can no longer be fully described by a maximal alignment set, prompting the introduction of quiescent sets to capture the invariant behavior of tokens near vertices. We show that these sets play a key role in understanding the asymptotic behavior of the system, even under time-varying alignment sensitivity parameters. We further show that localmax dynamics does not exhibit finite-time convergence and provide results for vanishing, nonzero, time-varying alignment-sensitivity parameters, recovering the limiting behavior of hardmax as a by-product. Finally, we adapt Lyapunov-based methods from classical opinion dynamics, highlighting their limitations in the asymmetric setting of localmax interactions and outlining directions for future research.

[48] BEFT: Bias-Efficient Fine-Tuning of Language Models

Baichuan Huang,Ananth Balashankar,Amir Aminifar

Main category: cs.CL

TL;DR: 本文提出了一种高效的偏置微调方法(BEFT),通过系统选择需微调的偏置项,在多种大语言模型和下游任务上表现出优越性能。

Details Motivation: 现有偏置微调方法对如何选择关键偏置项缺乏有效指导,限制了其在低数据场景下的潜力。 Method: 提出一种基于分析不同偏置项(如查询、键、值投影中的偏置)影响的方法,以选择最优偏置项进行微调。 Result: 在110M到6.7B参数的多种LLM上验证了该方法的有效性,覆盖分类、多选和生成任务,性能优于现有偏置选择方法。 Conclusion: 所提出的BEFT方法能更高效地利用参数,提升微调效果,为偏置项的选择提供了清晰且有效的策略。 Abstract: Fine-tuning all-bias-terms stands out among various parameter-efficient fine-tuning (PEFT) techniques, owing to its out-of-the-box usability and competitive performance, especially in low-data regimes. Bias-only fine-tuning has the potential for unprecedented parameter efficiency. However, the link between fine-tuning different bias terms (i.e., bias terms in the query, key, or value projections) and downstream performance remains unclear. The existing approaches, e.g., based on the magnitude of bias change or empirical Fisher information, provide limited guidance for selecting the particular bias term for effective fine-tuning. In this paper, we propose an approach for selecting the bias term to be fine-tuned, forming the foundation of our bias-efficient fine-tuning (BEFT). We extensively evaluate our bias-efficient approach against other bias-selection approaches, across a wide range of large language models (LLMs) spanning encoder-only and decoder-only architectures from 110M to 6.7B parameters. Our results demonstrate the effectiveness and superiority of our bias-efficient approach on diverse downstream tasks, including classification, multiple-choice, and generation tasks.

[49] Session-Level Spoken Language Assessment with a Multimodal Foundation Model via Multi-Target Learning

Hong-Yun Lin,Jhen-Ke Lin,Chung-Chun Wang,Hao-Chien Lu,Berlin Chen

Main category: cs.CL

TL;DR: 本文提出了一种基于多模态基础模型的单次通路会话级口语评估方法,结合多目标学习与冻结的Whisper ASR语音先验,实现了无需手工特征的端到端口语熟练度评估,在Speak & Improve基准上超越了以往级联系统的性能。

Details Motivation: 现有的口语评估方法依赖级联流水线易导致误差传播,或使用短音频窗口的端到端模型难以捕捉话语层面的信息,因此需要一种能进行整体会话建模且鲁棒的评估方法。 Method: 提出一种新型多模态基础模型方法,结合多目标学习与冻结的Whisper ASR模型作为语音先验,对二语学习者的完整回答会话进行声学感知校准,并联合学习整体性与特征级评估目标。 Result: 在Speak & Improve基准上的实验表明,该方法优于先前最先进的级联系统,具备良好的跨部分泛化能力,并可生成适用于计算机辅助语言学习(CALL)的紧凑可部署评分器。 Conclusion: 所提方法通过单次处理完整会话并融合ASR先验信息,有效提升了口语评估的整体性和鲁棒性,为CALL应用提供了高效、准确的解决方案。 Abstract: Spoken Language Assessment (SLA) estimates a learner's oral proficiency from spontaneous speech. The growing population of L2 English speakers has intensified the demand for reliable SLA, a critical component of Computer Assisted Language Learning (CALL). Existing efforts often rely on cascaded pipelines, which are prone to error propagation, or end-to-end models that often operate on a short audio window, which might miss discourse-level evidence. This paper introduces a novel multimodal foundation model approach that performs session-level evaluation in a single pass. Our approach couples multi-target learning with a frozen, Whisper ASR model-based speech prior for acoustic-aware calibration, allowing for jointly learning holistic and trait-level objectives of SLA without resorting to handcrafted features. By coherently processing the entire response session of an L2 speaker, the model excels at predicting holistic oral proficiency. Experiments conducted on the Speak & Improve benchmark demonstrate that our proposed approach outperforms the previous state-of-the-art cascaded system and exhibits robust cross-part generalization, producing a compact deployable grader that is tailored for CALL applications.

[50] Think, Verbalize, then Speak: Bridging Complex Thoughts and Comprehensible Speech

Sang Hoon Woo,Sehun Lee,Kang-wook Kim,Gunhee Kim

Main category: cs.CL

TL;DR: 提出Think-Verbalize-Speak框架,通过解耦推理与口语表达,结合ReVerT高效生成自然、简洁的语音输出,同时保持大模型推理能力。

Details Motivation: 直接应用大语言模型于口语对话系统时,文本与口语表达方式的不匹配导致效果不佳,且现有方法对推理性能的影响尚不明确。 Method: 提出Think-Verbalize-Speak(TVS)框架,将推理与口语化输出分离;引入“verbalizing”中间步骤,并设计基于增量和异步摘要的低延迟ReVerT verbalizer。 Result: 在多个基准实验中,该方法显著提升语音自然性和简洁性,同时对原始推理能力影响极小。 Conclusion: TVS框架有效平衡了口语化输出质量与大模型推理性能,ReVerT实现了低延迟下的高质量语音准备文本生成。 Abstract: Spoken dialogue systems increasingly employ large language models (LLMs) to leverage their advanced reasoning capabilities. However, direct application of LLMs in spoken communication often yield suboptimal results due to mismatches between optimal textual and verbal delivery. While existing approaches adapt LLMs to produce speech-friendly outputs, their impact on reasoning performance remains underexplored. In this work, we propose Think-Verbalize-Speak, a framework that decouples reasoning from spoken delivery to preserve the full reasoning capacity of LLMs. Central to our method is verbalizing, an intermediate step that translates thoughts into natural, speech-ready text. We also introduce ReVerT, a latency-efficient verbalizer based on incremental and asynchronous summarization. Experiments across multiple benchmarks show that our method enhances speech naturalness and conciseness with minimal impact on reasoning. The project page with the dataset and the source code is available at https://yhytoto12.github.io/TVS-ReVerT

[51] Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses

Fangyi Yu,Nabeel Seedat,Dasha Herrmannova,Frank Schilder,Jonathan Richard Schwarz

Main category: cs.CL

TL;DR: DeCE是一种分解式LLM评估框架,通过分离精确率和召回率来更准确地评估高风险领域(如法律和医学)中的长文本回答质量,相比传统方法与专家判断具有更强的相关性。

Details Motivation: 现有评估指标(如BLEU、ROUGE)无法捕捉语义正确性,而当前基于LLM的评估器常将复杂的答案质量简化为单一分数,缺乏细粒度和可解释性。 Method: 提出DeCE框架,将评估分解为精确率(事实准确性与相关性)和召回率(覆盖必要概念的程度),并从标准答案中自动提取实例特定的评估标准,无需预定义分类或人工评分表。 Result: 在多司法管辖区法律问答任务中,DeCE与专家判断的相关系数达0.78,显著高于传统指标(0.12)、逐点LLM评分(0.35)和现代多维评估器(0.48);同时发现通用模型偏向召回率,专用模型偏向精确率;仅11.95%的自动生成标准需专家修正,显示其可扩展性。 Conclusion: DeCE提供了一种可解释且可操作的LLM评估框架,适用于需要高精度和透明度的专家领域。 Abstract: Evaluating long-form answers in high-stakes domains such as law or medicine remains a fundamental challenge. Standard metrics like BLEU and ROUGE fail to capture semantic correctness, and current LLM-based evaluators often reduce nuanced aspects of answer quality into a single undifferentiated score. We introduce DeCE, a decomposed LLM evaluation framework that separates precision (factual accuracy and relevance) and recall (coverage of required concepts), using instance-specific criteria automatically extracted from gold answer requirements. DeCE is model-agnostic and domain-general, requiring no predefined taxonomies or handcrafted rubrics. We instantiate DeCE to evaluate different LLMs on a real-world legal QA task involving multi-jurisdictional reasoning and citation grounding. DeCE achieves substantially stronger correlation with expert judgments ($r=0.78$), compared to traditional metrics ($r=0.12$), pointwise LLM scoring ($r=0.35$), and modern multidimensional evaluators ($r=0.48$). It also reveals interpretable trade-offs: generalist models favor recall, while specialized models favor precision. Importantly, only 11.95% of LLM-generated criteria required expert revision, underscoring DeCE's scalability. DeCE offers an interpretable and actionable LLM evaluation framework in expert domains.

[52] DiEP: Adaptive Mixture-of-Experts Compression through Differentiable Expert Pruning

Sikai Bai,Haoxi Li,Jie Zhang,Zicong Hong,Song Guo

Main category: cs.CL

TL;DR: 提出了一种可微的专家剪枝方法DiEP,用于非均匀地剪枝Mixture-of-Experts模型,在保持高性能的同时显著减少参数量。

Details Motivation: 现有的MoE剪枝方法采用统一的稀疏性策略,忽略了不同层专家冗余程度的差异,导致性能下降,因此需要一种能自适应调整各层剪枝率的方法。 Method: 提出DiEP方法,通过将全局离散搜索空间转化为连续空间,使用可微分的方式联合学习层间重要性和专家剪枝率,实现非均匀、梯度驱动的剪枝策略。 Result: 在五个先进MoE模型上验证了DiEP的有效性,在Mixtral 8×7B上仅保留一半专家时仍保持约92%原始性能,在MMLU数据集上比其他剪枝方法最高提升7.1%。 Conclusion: DiEP能够有效捕捉不同MoE层间的冗余差异,实现更优的非均匀剪枝,在显著压缩模型的同时最大限度保留性能,优于均匀剪枝和其他基线方法。 Abstract: Despite the significant breakthrough of Mixture-of-Experts (MoE), the increasing scale of these MoE models presents huge memory and storage challenges. Existing MoE pruning methods, which involve reducing parameter size with a uniform sparsity across all layers, often lead to suboptimal outcomes and performance degradation due to varying expert redundancy in different MoE layers. To address this, we propose a non-uniform pruning strategy, dubbed \textbf{Di}fferentiable \textbf{E}xpert \textbf{P}runing (\textbf{DiEP}), which adaptively adjusts pruning rates at the layer level while jointly learning inter-layer importance, effectively capturing the varying redundancy across different MoE layers. By transforming the global discrete search space into a continuous one, our method handles exponentially growing non-uniform expert combinations, enabling adaptive gradient-based pruning. Extensive experiments on five advanced MoE models demonstrate the efficacy of our method across various NLP tasks. Notably, \textbf{DiEP} retains around 92\% of original performance on Mixtral 8$\times$7B with only half the experts, outperforming other pruning methods by up to 7.1\% on the challenging MMLU dataset.

[53] It Depends: Resolving Referential Ambiguity in Minimal Contexts with Commonsense Knowledge

Lukas Ellinger,Georg Groh

Main category: cs.CL

TL;DR: 该研究系统探讨了大语言模型(LLM)在多轮对话中利用常识解决指代歧义的能力,发现当前模型在处理歧义时表现不佳,倾向于单一解释或列举所有可能,而非寻求澄清;简化语言请求进一步削弱其常识推理能力,而通过直接偏好优化微调可显著提升性能。

Details Motivation: 研究旨在探究大语言模型是否能像人类一样利用共享上下文和常识知识来解决对话中的指代歧义,并分析在歧义持续存在时模型的行为,特别是在简化语言请求下的表现变化。 Method: 构建了一个新的多语言评估数据集,采用LLM-as-Judge和人工标注方法,对DeepSeek v3、GPT-4o、Qwen3-32B、GPT-4o-mini和Llama-3.1-8B等模型进行测试,并对比了普通与简化语言提示下的表现差异,同时评估了基于直接偏好优化的微调效果。 Result: 实验表明当前大语言模型难以有效解决指代歧义,常表现为过早确定唯一解释或穷举可能性,缺乏合理的犹豫或澄清行为;简化语言提示显著降低了模型使用常识推理和多样化应对策略的能力;经直接偏好优化微调后的Llama-3.1-8B在各类请求下歧义处理能力明显提升。 Conclusion: 现有大语言模型在处理指代歧义时存在明显局限,尤其在需常识推理和灵活应对的场景中表现不足;简化语言请求会进一步削弱其能力;通过先进微调技术(如DPO)可有效改善模型表现,未来需加强此类训练以提升模型在复杂、模糊交流中的鲁棒性。 Abstract: Ambiguous words or underspecified references require interlocutors to resolve them, often by relying on shared context and commonsense knowledge. Therefore, we systematically investigate whether Large Language Models (LLMs) can leverage commonsense to resolve referential ambiguity in multi-turn conversations and analyze their behavior when ambiguity persists. Further, we study how requests for simplified language affect this capacity. Using a novel multilingual evaluation dataset, we test DeepSeek v3, GPT-4o, Qwen3-32B, GPT-4o-mini, and Llama-3.1-8B via LLM-as-Judge and human annotations. Our findings indicate that current LLMs struggle to resolve ambiguity effectively: they tend to commit to a single interpretation or cover all possible references, rather than hedging or seeking clarification. This limitation becomes more pronounced under simplification prompts, which drastically reduce the use of commonsense reasoning and diverse response strategies. Fine-tuning Llama-3.1-8B with Direct Preference Optimization substantially improves ambiguity resolution across all request types. These results underscore the need for advanced fine-tuning to improve LLMs' handling of ambiguity and to ensure robust performance across diverse communication styles.

[54] CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion

Sheng Zhang,Yifan Ding,Shuquan Lian,Shun Song,Hui Li

Main category: cs.CL

TL;DR: 本文提出了CodeRAG,一个用于仓库级代码补全的框架,通过改进查询构建、多路径代码检索和偏好对齐重排序,显著优于现有方法。

Details Motivation: 现有仓库级代码补全方法存在查询构建不当、单路径检索以及检索器与大模型不匹配的问题,影响补全效果。 Method: 提出CodeRAG框架,包含基于对数概率引导的查询构建、多路径代码检索和BestFit偏好对齐重排序机制。 Result: 在ReccEval和CCEval基准上实验表明,CodeRAG显著且一致地优于当前最先进的方法。 Conclusion: CodeRAG通过更有效的知识检索与对齐策略,提升了仓库级代码补全的性能。 Abstract: Repository-level code completion automatically predicts the unfinished code based on the broader information from the repository. Recent strides in Code Large Language Models (code LLMs) have spurred the development of repository-level code completion methods, yielding promising results. Nevertheless, they suffer from issues such as inappropriate query construction, single-path code retrieval, and misalignment between code retriever and code LLM. To address these problems, we introduce CodeRAG, a framework tailored to identify relevant and necessary knowledge for retrieval-augmented repository-level code completion. Its core components include log probability guided query construction, multi-path code retrieval, and preference-aligned BestFit reranking. Extensive experiments on benchmarks ReccEval and CCEval demonstrate that CodeRAG significantly and consistently outperforms state-of-the-art methods. The implementation of CodeRAG is available at https://github.com/KDEGroup/CodeRAG.

[55] CultureScope: A Dimensional Lens for Probing Cultural Understanding in LLMs

Jinghao Zhang,Sihang Jiang,Shiwei Guo,Shisong Chen,Yanghua Xiao,Hongwei Feng,Jiaqing Liang,Minggui HE,Shimin Tao,Hongxia Ma

Main category: cs.CL

TL;DR: 提出CultureScope,一个基于文化冰山理论的全面评估框架,用于衡量大语言模型的文化理解能力,涵盖3层140个维度,并支持自动化构建文化特定知识库与数据集。

Details Motivation: 现有基准在评估大语言模型文化理解能力时缺乏全面性和可扩展性,且依赖专家手动标注,难以适应不同文化背景。 Method: 基于文化冰山理论设计三层140维的文化知识分类体系,指导自动化构建文化特定的知识库和评估数据集,实现对多语言多文化的可扩展评估。 Result: 实验表明该方法能有效评估大语言模型的文化理解能力,发现现有模型文化素养不足,仅增加多语言数据未必提升文化理解。 Conclusion: CultureScope是目前最全面的文化理解评估框架,揭示了当前LLM在文化理解上的局限性,强调需专门设计以提升跨文化对齐能力。 Abstract: As large language models (LLMs) are increasingly deployed in diverse cultural environments, evaluating their cultural understanding capability has become essential for ensuring trustworthy and culturally aligned applications. However, most existing benchmarks lack comprehensiveness and are challenging to scale and adapt across different cultural contexts, because their frameworks often lack guidance from well-established cultural theories and tend to rely on expert-driven manual annotations. To address these issues, we propose CultureScope, the most comprehensive evaluation framework to date for assessing cultural understanding in LLMs. Inspired by the cultural iceberg theory, we design a novel dimensional schema for cultural knowledge classification, comprising 3 layers and 140 dimensions, which guides the automated construction of culture-specific knowledge bases and corresponding evaluation datasets for any given languages and cultures. Experimental results demonstrate that our method can effectively evaluate cultural understanding. They also reveal that existing large language models lack comprehensive cultural competence, and merely incorporating multilingual data does not necessarily enhance cultural understanding. All code and data files are available at https://github.com/HoganZinger/Culture

[56] RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation

Jane Luo,Xin Zhang,Steven Liu,Jie Wu,Yiming Huang,Yangyu Huang,Chengyu Yin,Ying Xin,Jianfeng Liu,Yuefeng Zhan,Hao Sun,Qi Chen,Scarlett Li,Mao Yang

Main category: cs.CL

TL;DR: 本文提出了Repository Planning Graph (RPG) 和基于其的ZeroRepo框架,用于从零生成完整代码仓库。RPG通过图结构统一规划与实现阶段,替代模糊的自然语言描述,实现可扩展、连贯的长周期规划。实验表明,ZeroRepo在自建基准RepoCraft上显著优于现有方法,生成代码量达36K LOC,功能覆盖率和通过率大幅提升。

Details Motivation: 大型语言模型虽擅长函数或文件级代码生成,但生成完整仓库仍面临规划不连贯、自然语言表达模糊等挑战,缺乏对复杂软件结构的有效建模机制。 Method: 提出Repository Planning Graph (RPG)作为统一表示,编码能力、文件结构、数据流和功能;基于RPG设计ZeroRepo框架,分提案规划、实现细化和图引导生成三阶段,并结合测试验证;构建包含六项真实项目的RepoCraft基准进行评估。 Result: ZeroRepo在RepoCraft上生成平均近36K LOC的仓库,代码量约为最强基线(Claude Code)的3.9倍,其他基线的64倍;功能覆盖率达81.5%,通过率69.7%,分别超过Claude Code 27.3和35.8个百分点;RPG展现出对复杂依赖的建模能力和近线性扩展性。 Conclusion: RPG为仓库级代码生成提供了有效的结构化蓝图,ZeroRepo实现了从零生成高质量、大规模代码仓库的可行路径,显著提升了生成代码的功能完整性与正确性。 Abstract: Large language models excel at function- and file-level code generation, yet generating complete repositories from scratch remains a fundamental challenge. This process demands coherent and reliable planning across proposal- and implementation-level stages, while natural language, due to its ambiguity and verbosity, is ill-suited for faithfully representing complex software structures. To address this, we introduce the Repository Planning Graph (RPG), a persistent representation that unifies proposal- and implementation-level planning by encoding capabilities, file structures, data flows, and functions in one graph. RPG replaces ambiguous natural language with an explicit blueprint, enabling long-horizon planning and scalable repository generation. Building on RPG, we develop ZeroRepo, a graph-driven framework for repository generation from scratch. It operates in three stages: proposal-level planning and implementation-level refinement to construct the graph, followed by graph-guided code generation with test validation. To evaluate this setting, we construct RepoCraft, a benchmark of six real-world projects with 1,052 tasks. On RepoCraft, ZeroRepo produces repositories averaging nearly 36K LOC, roughly 3.9$\times$ the strongest baseline (Claude Code) and about 64$\times$ other baselines. It attains 81.5% functional coverage and a 69.7% pass rate, exceeding Claude Code by 27.3 and 35.8 percentage points, respectively. Further analysis shows that RPG models complex dependencies, enables progressively more sophisticated planning through near-linear scaling, and enhances LLM understanding of repositories, thereby accelerating agent localization.

cs.CV [Back]

[57] Exploring the Capabilities of LLM Encoders for Image-Text Retrieval in Chest X-rays

Hanbin Ko,Gihun Cho,Inhyeok Baek,Donguk Kim,Joonbeom Koo,Changi Kim,Dongheon Lee,Chang Min Park

Main category: cs.CV

TL;DR: 提出LLM2VEC4CXR和LLM2CLIP4CXR,利用大语言模型提升胸部X光报告的文本理解与图像-文本对齐,强调鲁棒性而非单纯数据规模的重要性。

Details Motivation: 临床报告的异质性(如缩写、风格差异)限制了现有视觉-语言预训练在放射学中的应用,且简单扩大噪声数据规模可能损害模型性能。 Method: 开发领域适配的大语言模型编码器LLM2VEC4CXR,并构建双塔框架LLM2CLIP4CXR,结合视觉骨干网络进行图像-文本对齐。 Result: 在160万胸部X光研究数据上训练,模型在报告级指标、检索准确率和跨数据集泛化能力方面均优于基线模型和现有医学CLIP变体。 Conclusion: 鲁棒的文本表示比单纯的数据规模更能推动医学多模态学习的有效性,模型已公开以支持后续研究。 Abstract: Vision-language pretraining has advanced image-text alignment, yet progress in radiology remains constrained by the heterogeneity of clinical reports, including abbreviations, impression-only notes, and stylistic variability. Unlike general-domain settings where more data often leads to better performance, naively scaling to large collections of noisy reports can plateau or even degrade model learning. We ask whether large language model (LLM) encoders can provide robust clinical representations that transfer across diverse styles and better guide image-text alignment. We introduce LLM2VEC4CXR, a domain-adapted LLM encoder for chest X-ray reports, and LLM2CLIP4CXR, a dual-tower framework that couples this encoder with a vision backbone. LLM2VEC4CXR improves clinical text understanding over BERT-based baselines, handles abbreviations and style variation, and achieves strong clinical alignment on report-level metrics. LLM2CLIP4CXR leverages these embeddings to boost retrieval accuracy and clinically oriented scores, with stronger cross-dataset generalization than prior medical CLIP variants. Trained on 1.6M CXR studies from public and private sources with heterogeneous and noisy reports, our models demonstrate that robustness -- not scale alone -- is the key to effective multimodal learning. We release models to support further research in medical image-text representation learning.

[58] ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

Jialiang Kang,Han Shu,Wenshuo Li,Yingjie Zhai,Xinghao Chen

Main category: cs.CV

TL;DR: 本文提出了ViSpec,一种针对视觉语言模型(VLMs)的新型推测解码框架,通过轻量级视觉适配器压缩图像令牌并增强多模态一致性,在现有模型基础上实现了显著推理加速。

Details Motivation: 现有的推测解码方法在视觉语言模型上的加速效果有限(<1.5x),且多模态能力日益重要,因此需要专门针对VLMs设计更有效的加速方法。 Method: 提出ViSpec框架,包括一个轻量级视觉适配器模块用于压缩图像令牌,并将其集成到草稿模型的注意力机制中;同时提取全局图像特征向量以增强文本令牌的多模态连贯性;并通过重构现有数据集生成长响应训练数据,避免草稿模型对目标模型隐藏状态的捷径学习。 Result: 实验表明,ViSpec在VLM的推测解码中实现了首次显著的速度提升,明显优于以往方法。 Conclusion: ViSpec有效解决了VLM中推测解码加速不足的问题,为多模态大模型的高效推理提供了可行方案。 Abstract: Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (<1.5x). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model's attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model's hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding.

[59] M-PACE: Mother Child Framework for Multimodal Compliance

Shreyash Verma,Amit Kesari,Vinayak Trivedi,Anupam Purwar,Ratnesh Jamidar

Main category: cs.CV

TL;DR: 本文提出了一种名为M-PACE的多模态合规性检测框架,利用多模态大语言模型(MLLM)统一处理图文内容,在广告合规场景中实现了对15个以上属性的单次评估,并通过“母-子”MLLM架构显著降低了人力审查依赖和推理成本(降低31倍以上),同时引入了一个包含真实世界复杂情况的人工标注基准用于评估。

Details Motivation: 传统多模态合规系统依赖于分离的多阶段流水线,导致架构碎片化、运维开销高、扩展性差且难以适应动态规则。随着多模态大模型的发展,有必要构建一个统一、高效的合规检测框架以克服上述问题。 Method: 提出M-PACE框架,采用‘母-子’MLLM结构:由更强的母MLLM监督和评估多个轻量子MLLM的输出,实现对视觉-语言输入的一次性合规属性判断;并在广告合规任务上验证其性能,同时构建了一个含视觉遮挡和脏话注入等挑战性样本的人工标注增强基准数据集。 Result: M-PACE在保持与Gemini 2.5 Pro相当准确率的同时,将推理成本从0.0159/图像降至0.0005/图像(使用Gemini 2.0 Flash作为子模型),成本降低超31倍;显著减少对人工审核的依赖,提升自动化质量控制能力。 Conclusion: M-PACE为多模态内容合规提供了一个高效、可扩展的统一框架,通过母-子MLLM设计在实际部署中实现了成本与质量的平衡,展示了MLLM在工业级合规应用中的巨大潜力。 Abstract: Ensuring that multi-modal content adheres to brand, legal, or platform-specific compliance standards is an increasingly complex challenge across domains. Traditional compliance frameworks typically rely on disjointed, multi-stage pipelines that integrate separate modules for image classification, text extraction, audio transcription, hand-crafted checks, and rule-based merges. This architectural fragmentation increases operational overhead, hampers scalability, and hinders the ability to adapt to dynamic guidelines efficiently. With the emergence of Multimodal Large Language Models (MLLMs), there is growing potential to unify these workflows under a single, general-purpose framework capable of jointly processing visual and textual content. In light of this, we propose Multimodal Parameter Agnostic Compliance Engine (M-PACE), a framework designed for assessing attributes across vision-language inputs in a single pass. As a representative use case, we apply M-PACE to advertisement compliance, demonstrating its ability to evaluate over 15 compliance-related attributes. To support structured evaluation, we introduce a human-annotated benchmark enriched with augmented samples that simulate challenging real-world conditions, including visual obstructions and profanity injection. M-PACE employs a mother-child MLLM setup, demonstrating that a stronger parent MLLM evaluating the outputs of smaller child models can significantly reduce dependence on human reviewers, thereby automating quality control. Our analysis reveals that inference costs reduce by over 31 times, with the most efficient models (Gemini 2.0 Flash as child MLLM selected by mother MLLM) operating at 0.0005 per image, compared to 0.0159 for Gemini 2.5 Pro with comparable accuracy, highlighting the trade-off between cost and output quality achieved in real time by M-PACE in real life deployment over advertising data.

[60] ProFusion: 3D Reconstruction of Protein Complex Structures from Multi-view AFM Images

Jaydeep Rade,Md Hasibul Hasan Hasib,Meric Ozturk,Baboucarr Faal,Sheng Yang,Dipali G. Sashital,Vincenzo Venditti,Baoyu Chen,Soumik Sarkar,Adarsh Krishnamurthy,Anwesha Sarkar

Main category: cs.CV

TL;DR: 提出了一种名为ProFusion的混合框架,结合深度学习与原子力显微镜(AFM)模拟多视角数据,用于高精度、低成本的蛋白质复合物三维结构预测。

Details Motivation: 现有AI方法在预测大型蛋白质复合物结构时因缺乏三维空间线索而受限,实验技术如Cryo-EM虽准确但成本高、耗时长。 Method: 开发虚拟AFM框架生成约54.2万条带多视角合成图像的蛋白数据集,采用条件扩散模型生成新视角,并用实例特定的NeRF模型进行三维重建。 Result: 重建的三维结构平均Chamfer距离在AFM成像分辨率范围内,表现出高结构保真度,并在多种真实AFM实验数据上验证了有效性。 Conclusion: ProFusion为蛋白质复合物结构预测提供了一种高效、低成本的解决方案,具备与实验AFM快速迭代验证的潜力。 Abstract: AI-based in silico methods have improved protein structure prediction but often struggle with large protein complexes (PCs) involving multiple interacting proteins due to missing 3D spatial cues. Experimental techniques like Cryo-EM are accurate but costly and time-consuming. We present ProFusion, a hybrid framework that integrates a deep learning model with Atomic Force Microscopy (AFM), which provides high-resolution height maps from random orientations, naturally yielding multi-view data for 3D reconstruction. However, generating a large-scale AFM imaging data set sufficient to train deep learning models is impractical. Therefore, we developed a virtual AFM framework that simulates the imaging process and generated a dataset of ~542,000 proteins with multi-view synthetic AFM images. We train a conditional diffusion model to synthesize novel views from unposed inputs and an instance-specific Neural Radiance Field (NeRF) model to reconstruct 3D structures. Our reconstructed 3D protein structures achieve an average Chamfer Distance within the AFM imaging resolution, reflecting high structural fidelity. Our method is extensively validated on experimental AFM images of various PCs, demonstrating strong potential for accurate, cost-effective protein complex structure prediction and rapid iterative validation using AFM experiments.

[61] Multi-Modal Interpretability for Enhanced Localization in Vision-Language Models

Muhammad Imran,Yugyung Lee

Main category: cs.CV

TL;DR: 本文提出了多模态可解释学习框架MMEL,通过引入分层语义关系模块提升视觉-语言模型的可解释性与性能。

Details Motivation: 在安全关键场景中应用视觉-语言模型面临挑战,因对象间复杂关系、细微视觉线索以及对透明性和可靠性的高要求。 Method: 基于梯度归因的Transformer解释方法(Grad-eclip),提出分层语义关系模块,结合多尺度特征处理、自适应注意力加权和跨模态对齐来增强可解释性。 Result: 实验表明,MMEL生成的可视化更聚焦且具上下文感知能力,能更好地反映模型对复杂场景的理解,提升解释精度。 Conclusion: MMEL框架在保持高性能的同时提升了模型的可解释性,适用于多种领域中需要高透明度和可靠性的应用场景。 Abstract: Recent advances in vision-language models have significantly expanded the frontiers of automated image analysis. However, applying these models in safety-critical contexts remains challenging due to the complex relationships between objects, subtle visual cues, and the heightened demand for transparency and reliability. This paper presents the Multi-Modal Explainable Learning (MMEL) framework, designed to enhance the interpretability of vision-language models while maintaining high performance. Building upon prior work in gradient-based explanations for transformer architectures (Grad-eclip), MMEL introduces a novel Hierarchical Semantic Relationship Module that enhances model interpretability through multi-scale feature processing, adaptive attention weighting, and cross-modal alignment. Our approach processes features at multiple semantic levels to capture relationships between image regions at different granularities, applying learnable layer-specific weights to balance contributions across the model's depth. This results in more comprehensive visual explanations that highlight both primary objects and their contextual relationships with improved precision. Through extensive experiments on standard datasets, we demonstrate that by incorporating semantic relationship information into gradient-based attribution maps, MMEL produces more focused and contextually aware visualizations that better reflect how vision-language models process complex scenes. The MMEL framework generalizes across various domains, offering valuable insights into model decisions for applications requiring high interpretability and reliability.

[62] Walk and Read Less: Improving the Efficiency of Vision-and-Language Navigation via Tuning-Free Multimodal Token Pruning

Wenda Qin,Andrea Burns,Bryan A. Plummer,Margrit Betke

Main category: cs.CV

TL;DR: 提出了一种面向视觉-语言导航任务的导航感知剪枝方法(NAP),通过区分前景和背景token并结合导航特性,在减少超过50%计算量的同时保持更高的任务成功率。

Details Motivation: 大型模型在视觉-语言导航(VLN)任务中表现优异,但资源消耗大;现有剪枝方法未考虑VLN特有的挑战,如剪枝导致信息丢失进而增加导航步数,削弱了效率提升的效果。 Method: 提出导航感知剪枝(NAP),利用导航相关特征将token预分为前景和背景,优先剪枝背景token,并使用大语言模型提取导航相关指令;同时移除低重要性导航节点以减少回溯。 Result: 在标准VLN基准上的实验表明,NAP显著优于先前方法,在节省超过50% FLOPS的同时保持更高的成功率。 Conclusion: NAP通过结合导航任务特性设计剪枝策略,有效平衡了效率与性能,为VLN任务提供了更实用的轻量化解决方案。 Abstract: Large models achieve strong performance on Vision-and-Language Navigation (VLN) tasks, but are costly to run in resource-limited environments. Token pruning offers appealing tradeoffs for efficiency with minimal performance loss by reducing model input size, but prior work overlooks VLN-specific challenges. For example, information loss from pruning can effectively increase computational cost due to longer walks. Thus, the inability to identify uninformative tokens undermines the supposed efficiency gains from pruning. To address this, we propose Navigation-Aware Pruning (NAP), which uses navigation-specific traits to simplify the pruning process by pre-filtering tokens into foreground and background. For example, image views are filtered based on whether the agent can navigate in that direction. We also extract navigation-relevant instructions using a Large Language Model. After filtering, we focus pruning on background tokens, minimizing information loss. To further help avoid increases in navigation length, we discourage backtracking by removing low-importance navigation nodes. Experiments on standard VLN benchmarks show NAP significantly outperforms prior work, preserving higher success rates while saving more than 50% FLOPS.

[63] RespoDiff: Dual-Module Bottleneck Transformation for Responsible & Faithful T2I Generation

Silpa Vadakkeeveetil Sreelatha,Sauradip Nag,Muhammad Awais,Serge Belongie,Anjan Dutta

Main category: cs.CV

TL;DR: 提出RespoDiff框架,通过双模块转换扩散模型中间瓶颈表示,在保证语义一致性和图像质量的同时提升文本到图像生成的公平性与安全性。

Details Motivation: 现有方法在提升生成内容公平性和安全性时往往牺牲语义保真度和图像质量,亟需一种兼顾责任性与生成质量的解决方案。 Method: 引入两个可学习模块:一个用于捕捉并强制实施公平与安全等责任概念,另一个保持与中性提示的语义对齐;设计新的得分匹配目标以协调双模块学习。 Result: 相比现有最先进方法,在多样化未见提示下负责任且语义连贯的生成性能提升20%,并能无缝集成至SDXL等大规模模型。 Conclusion: RespoDiff在不损害图像保真度的前提下,有效平衡了文本到图像生成中的责任性要求与语义一致性,为大规模模型提供了可行的责任生成方案。 Abstract: The rapid advancement of diffusion models has enabled high-fidelity and semantically rich text-to-image generation; however, ensuring fairness and safety remains an open challenge. Existing methods typically improve fairness and safety at the expense of semantic fidelity and image quality. In this work, we propose RespoDiff, a novel framework for responsible text-to-image generation that incorporates a dual-module transformation on the intermediate bottleneck representations of diffusion models. Our approach introduces two distinct learnable modules: one focused on capturing and enforcing responsible concepts, such as fairness and safety, and the other dedicated to maintaining semantic alignment with neutral prompts. To facilitate the dual learning process, we introduce a novel score-matching objective that enables effective coordination between the modules. Our method outperforms state-of-the-art methods in responsible generation by ensuring semantic alignment while optimizing both objectives without compromising image fidelity. Our approach improves responsible and semantically coherent generation by 20% across diverse, unseen prompts. Moreover, it integrates seamlessly into large-scale models like SDXL, enhancing fairness and safety. Code will be released upon acceptance.

[64] Autoguided Online Data Curation for Diffusion Model Training

Valeria Pais,Luis Oala,Daniele Faccio,Marco Aversa

Main category: cs.CV

TL;DR: 本研究探讨了自动引导(autoguidance)和在线数据选择方法在提升生成扩散模型训练效率方面的潜力,发现自动引导能持续提升样本质量和多样性,而早期联合示例选择(AJEST)在数据效率上可媲美或略优于自动引导,但其时间开销和复杂性限制了实用性。

Details Motivation: 由于生成模型计算成本高昂,亟需高效的数据整理方法以提升训练效率。本文旨在评估新兴的自动引导和在线数据选择技术是否能有效改善生成扩散模型的时间和样本效率。 Method: 将联合示例选择(JEST)与自动引导整合到统一代码框架中,在2-D合成数据生成和(3x64x64)-D图像生成任务上进行控制实验,比较不同数据整理策略在相同壁钟时间和样本数量下的性能,并明确考虑选择过程的开销。 Result: 自动引导在所有实验中均显著提升样本质量与多样性;早期AJEST在数据效率上可达到甚至略超自动引导,但在时间开销和实现复杂度上更高;综合来看,自动引导或随机选择更具优势。 Conclusion: 尽管针对性的在线数据选择在训练初期可带来效率增益,但样本质量的稳健提升主要由自动引导驱动;因此,在大多数场景下推荐使用自动引导或均匀随机选择,而非复杂的数据筛选机制。 Abstract: The costs of generative model compute rekindled promises and hopes for efficient data curation. In this work, we investigate whether recently developed autoguidance and online data selection methods can improve the time and sample efficiency of training generative diffusion models. We integrate joint example selection (JEST) and autoguidance into a unified code base for fast ablation and benchmarking. We evaluate combinations of data curation on a controlled 2-D synthetic data generation task as well as (3x64x64)-D image generation. Our comparisons are made at equal wall-clock time and equal number of samples, explicitly accounting for the overhead of selection. Across experiments, autoguidance consistently improves sample quality and diversity. Early AJEST (applying selection only at the beginning of training) can match or modestly exceed autoguidance alone in data efficiency on both tasks. However, its time overhead and added complexity make autoguidance or uniform random data selection preferable in most situations. These findings suggest that while targeted online selection can yield efficiency gains in early training, robust sample quality improvements are primarily driven by autoguidance. We discuss limitations and scope, and outline when data selection may be beneficial.

[65] PRISM: Phase-enhanced Radial-based Image Signature Mapping framework for fingerprinting AI-generated images

Emanuele Ricco,Elia Onofri,Lorenzo Cima,Stefano Cresci,Roberto Di Pietro

Main category: cs.CV

TL;DR: 本文提出了一种名为PRISM的可扩展图像指纹框架,用于识别AI生成图像的来源模型,基于频域中的幅度和相位信息,在多个数据集上实现了高精度的模型归属和真假图像检测。

Details Motivation: 随着生成式AI的广泛应用,特别是在商业场景中,用户需要确保内容来源的可信性,因此亟需能够追溯AI生成内容来源的技术。 Method: PRISM通过离散傅里叶变换的径向降维,结合幅度和相位信息提取模型特异性特征,并利用线性判别分析进行聚类,实现对黑盒模型生成图像的可靠归属。 Result: 在自建数据集PRISM-36K上达到92.04%的归属准确率,在四个基准数据集上平均准确率为81.60%,在真假图像检测任务中平均准确率达88.41%,在GenImage上最高达95.06%。 Conclusion: PRISM展示了频域指纹技术在跨架构、跨数据集场景下对AI生成图像进行有效溯源的能力,为增强生成式AI系统的可问责性和信任度提供了可行方案。 Abstract: A critical need has emerged for generative AI: attribution methods. That is, solutions that can identify the model originating AI-generated content. This feature, generally relevant in multimodal applications, is especially sensitive in commercial settings where users subscribe to paid proprietary services and expect guarantees about the source of the content they receive. To address these issues, we introduce PRISM, a scalable Phase-enhanced Radial-based Image Signature Mapping framework for fingerprinting AI-generated images. PRISM is based on a radial reduction of the discrete Fourier transform that leverages amplitude and phase information to capture model-specific signatures. The output of the above process is subsequently clustered via linear discriminant analysis to achieve reliable model attribution in diverse settings, even if the model's internal details are inaccessible. To support our work, we construct PRISM-36K, a novel dataset of 36,000 images generated by six text-to-image GAN- and diffusion-based models. On this dataset, PRISM achieves an attribution accuracy of 92.04%. We additionally evaluate our method on four benchmarks from the literature, reaching an average accuracy of 81.60%. Finally, we evaluate our methodology also in the binary task of detecting real vs fake images, achieving an average accuracy of 88.41%. We obtain our best result on GenImage with an accuracy of 95.06%, whereas the original benchmark achieved 82.20%. Our results demonstrate the effectiveness of frequency-domain fingerprinting for cross-architecture and cross-dataset model attribution, offering a viable solution for enforcing accountability and trust in generative AI systems.

[66] Large Vision Models Can Solve Mental Rotation Problems

Sebastian Ray Mason,Anders Gjølbye,Phillip Chavarria Højbjerg,Lenka Tětková,Lars Kai Hansen

Main category: cs.CV

TL;DR: 本文系统评估了视觉Transformer模型在心理旋转任务中的表现,发现自监督ViT比有监督ViT更能捕捉几何结构,中间层优于最终层,且任务难度随旋转复杂性和遮挡增加而增加,反映了与人类认知相似的表征约束。

Details Motivation: 探究现代视觉Transformer模型是否能够像人类一样发展出心理旋转能力,从而理解其空间推理能力的局限性。 Method: 对ViT、CLIP、DINOv2和DINOv3模型在多种心理旋转任务上进行逐层表征分析,涵盖从简单积木到文本和真实物体的多种刺激类型。 Result: 1) 自监督ViT优于有监督ViT;2) 中间网络层表现优于最终层;3) 旋转复杂性和遮挡增加导致性能下降,与人类反应时间趋势一致。 Conclusion: 视觉Transformer在心理旋转任务中表现出与人类相似的认知约束,尤其自监督模型更优,揭示了其空间表征能力的潜力与局限。 Abstract: Mental rotation is a key test of spatial reasoning in humans and has been central to understanding how perception supports cognition. Despite the success of modern vision transformers, it is still unclear how well these models develop similar abilities. In this work, we present a systematic evaluation of ViT, CLIP, DINOv2, and DINOv3 across a range of mental-rotation tasks, from simple block structures similar to those used by Shepard and Metzler to study human cognition, to more complex block figures, three types of text, and photo-realistic objects. By probing model representations layer by layer, we examine where and how these networks succeed. We find that i) self-supervised ViTs capture geometric structure better than supervised ViTs; ii) intermediate layers perform better than final layers; iii) task difficulty increases with rotation complexity and occlusion, mirroring human reaction times and suggesting similar constraints in embedding space representations.

[67] Which Direction to Choose? An Analysis on the Representation Power of Self-Supervised ViTs in Downstream Tasks

Yannis Kaltampanidis,Alexandros Doumanoglou,Dimitrios Zarpalas

Main category: cs.CV

TL;DR: 本研究系统评估了未经修改的Vision Transformer(ViT)特征在图像分类和分割任务中的表现,涵盖标准和少样本场景,分析不同token类型、任务类型和预训练目标下的最优选择。

Details Motivation: 现有自监督学习方法常通过额外变换层提升性能,但缺乏对原始ViT特征内在表示能力的全面分析,本文旨在填补这一空白。 Method: 基于对比学习和掩码图像建模预训练的ViT模型,直接使用其最终注意力块的键、查询、值或前馈层后的特征,采用超平面或余弦相似度决策规则进行分类与分割任务。 Result: 揭示了不同任务和上下文中最优的token类型与决策规则组合,并在两个常用数据集上报告了详细结果,表明未经修饰的ViT特征已具备良好可解释性和判别能力。 Conclusion: 无需额外特征变换即可有效利用ViT的内在表示,为下游任务提供了更简洁高效的使用方式,并指导了token类型与决策规则的选择。 Abstract: Self-Supervised Learning (SSL) for Vision Transformers (ViTs) has recently demonstrated considerable potential as a pre-training strategy for a variety of computer vision tasks, including image classification and segmentation, both in standard and few-shot downstream contexts. Two pre-training objectives dominate the landscape of SSL techniques: Contrastive Learning and Masked Image Modeling. Features (or tokens) extracted from the final transformer attention block -- specifically, the keys, queries, and values -- as well as features obtained after the final block's feed-forward layer, have become a common foundation for addressing downstream tasks. However, in many existing approaches, these pre-trained ViT features are further processed through additional transformation layers, often involving lightweight heads or combined with distillation, to achieve superior task performance. Although such methods can improve task outcomes, to the best of our knowledge, a comprehensive analysis of the intrinsic representation capabilities of unaltered ViT features has yet to be conducted. This study aims to bridge this gap by systematically evaluating the use of these unmodified features across image classification and segmentation tasks, in both standard and few-shot contexts. The classification and segmentation rules that we use are either hyperplane based (as in logistic regression) or cosine-similarity based, both of which rely on the presence of interpretable directions in the ViT's latent space. Based on the previous rules and without the use of additional feature transformations, we conduct an analysis across token types, tasks, and pre-trained ViT models. This study provides insights into the optimal choice for token type and decision rule based on the task, context, and the pre-training objective, while reporting detailed findings on two widely-used datasets.

[68] How Good are Foundation Models in Step-by-Step Embodied Reasoning?

Dinura Dissanayake,Ahmed Heakl,Omkar Thawakar,Noor Ahsan,Ritesh Thawkar,Ketan More,Jean Lahoud,Rao Anwer,Hisham Cholakkal,Ivan Laptev,Fahad Shahbaz Khan,Salman Khan

Main category: cs.CV

TL;DR: 本文提出了一个名为FoMER的基准,用于评估大型多模态模型(LMMs)在复杂具身决策场景中的逐步推理能力,涵盖10项任务和8种具身形态,包含超过1100个样本,并分析了当前LMM在具身推理中的潜力与局限。

Details Motivation: 尽管大型多模态模型在视觉理解和语言生成方面取得进展,但其在真实世界具身任务中的结构化推理能力仍缺乏探索,因此需要系统评估其在安全、空间连贯性和情境感知方面的决策能力。 Method: 提出FoMER基准,包含大规模整理的具身推理任务集、解耦感知接地与动作推理的新型评估框架,并对多个先进LMM进行实证分析。 Result: FoMER包含1.1k以上样本,覆盖10项任务、8种具身形态和3类机器人;实验结果揭示了LMM在具身推理中的有效性与当前局限,特别是在物理约束和安全性推理方面表现不足。 Conclusion: LMM在具身推理中具有潜力但仍有显著局限,FoMER为未来机器人智能研究提供了重要挑战和方向。 Abstract: Embodied agents operating in the physical world must make decisions that are not only effective but also safe, spatially coherent, and grounded in context. While recent advances in large multimodal models (LMMs) have shown promising capabilities in visual understanding and language generation, their ability to perform structured reasoning for real-world embodied tasks remains underexplored. In this work, we aim to understand how well foundation models can perform step-by-step reasoning in embodied environments. To this end, we propose the Foundation Model Embodied Reasoning (FoMER) benchmark, designed to evaluate the reasoning capabilities of LMMs in complex embodied decision-making scenarios. Our benchmark spans a diverse set of tasks that require agents to interpret multimodal observations, reason about physical constraints and safety, and generate valid next actions in natural language. We present (i) a large-scale, curated suite of embodied reasoning tasks, (ii) a novel evaluation framework that disentangles perceptual grounding from action reasoning, and (iii) empirical analysis of several leading LMMs under this setting. Our benchmark includes over 1.1k samples with detailed step-by-step reasoning across 10 tasks and 8 embodiments, covering three different robot types. Our results highlight both the potential and current limitations of LMMs in embodied reasoning, pointing towards key challenges and opportunities for future research in robot intelligence. Our data and code will be made publicly available.

[69] CoDoL: Conditional Domain Prompt Learning for Out-of-Distribution Generalization

Min Zhang,Bo Jiang,Jie Zhou,Yimeng Liu,Xin Lin

Main category: cs.CV

TL;DR: 本文提出了一种新的条件域提示学习方法(CoDoL),利用领域信息生成提示,并通过轻量级域元网络(DMN)提升视觉-语言嵌入对齐和OOD泛化性能。

Details Motivation: 现有的基于提示的CLIP方法存在文本描述不准确和视觉-语言嵌入对齐不足的问题,影响了其在分布外(OOD)场景下的准确性和鲁棒性。 Method: 提出条件域提示学习(CoDoL)方法,结合可用的领域信息构建提示,并设计轻量级域元网络(DMN)生成输入条件化的token,以捕获实例和领域特异性信息。 Result: 在四个OOD基准(PACS, VLCS, OfficeHome 和 DigitDG)上的实验表明,CoDoL有效提升了视觉-语言嵌入对齐和分布外泛化性能。 Conclusion: CoDoL通过引入领域感知的提示学习和动态token生成机制,显著提高了CLIP模型在OOD场景下的鲁棒性和泛化能力。 Abstract: Recent advances in pre-training vision-language models (VLMs), e.g., contrastive language-image pre-training (CLIP) methods, have shown great potential in learning out-of-distribution (OOD) representations. Despite showing competitive performance, the prompt-based CLIP methods still suffer from: i) inaccurate text descriptions, which leads to degraded accuracy and robustness, and poses a challenge for zero-shot CLIP methods. ii) limited vision-language embedding alignment, which significantly affects the generalization performance. To tackle the above issues, this paper proposes a novel Conditional Domain prompt Learning (CoDoL) method, which utilizes readily-available domain information to form prompts and improves the vision-language embedding alignment for improving OOD generalization. To capture both instance-specific and domain-specific information, we further propose a lightweight Domain Meta Network (DMN) to generate input-conditional tokens for images in each domain. Extensive experiments on four OOD benchmarks (PACS, VLCS, OfficeHome and DigitDG) validate the effectiveness of our proposed CoDoL in terms of improving the vision-language embedding alignment as well as the out-of-distribution generalization performance.

[70] Emulating Human-like Adaptive Vision for Efficient and Flexible Machine Visual Perception

Yulin Wang,Yang Yue,Yang Yue,Huanqian Wang,Haojun Jiang,Yizeng Han,Zanlin Ni,Yifan Pu,Minglei Shi,Rui Lu,Qisen Yang,Andrew Zhao,Zhuofan Xia,Shiji Song,Gao Huang

Main category: cs.CV

TL;DR: 提出AdaptiveNN,一种从“被动”到“主动、自适应”视觉模型的范式转变框架,通过粗到精的决策过程实现高效、灵活且可解释的计算机视觉。

Details Motivation: 现有机器视觉模型被动处理整个场景,资源消耗大,难以应对复杂任务和实际应用需求。 Method: 将视觉感知建模为从粗到精的序列决策过程,结合表征学习与自奖励强化学习,实现无需额外注视监督的端到端训练。 Result: 在17个基准任务上验证,最高减少28倍推理成本而不损失精度,具备资源自适应能力和人类类似的感知行为。 Conclusion: AdaptiveNN实现了高效、灵活、可解释的视觉理解,并展现出模拟人类视觉认知的潜力。 Abstract: Human vision is highly adaptive, efficiently sampling intricate environments by sequentially fixating on task-relevant regions. In contrast, prevailing machine vision models passively process entire scenes at once, resulting in excessive resource demands scaling with spatial-temporal input resolution and model size, yielding critical limitations impeding both future advancements and real-world application. Here we introduce AdaptiveNN, a general framework aiming to drive a paradigm shift from 'passive' to 'active, adaptive' vision models. AdaptiveNN formulates visual perception as a coarse-to-fine sequential decision-making process, progressively identifying and attending to regions pertinent to the task, incrementally combining information across fixations, and actively concluding observation when sufficient. We establish a theory integrating representation learning with self-rewarding reinforcement learning, enabling end-to-end training of the non-differentiable AdaptiveNN without additional supervision on fixation locations. We assess AdaptiveNN on 17 benchmarks spanning 9 tasks, including large-scale visual recognition, fine-grained discrimination, visual search, processing images from real driving and medical scenarios, language-driven embodied AI, and side-by-side comparisons with humans. AdaptiveNN achieves up to 28x inference cost reduction without sacrificing accuracy, flexibly adapts to varying task demands and resource budgets without retraining, and provides enhanced interpretability via its fixation patterns, demonstrating a promising avenue toward efficient, flexible, and interpretable computer vision. Furthermore, AdaptiveNN exhibits closely human-like perceptual behaviors in many cases, revealing its potential as a valuable tool for investigating visual cognition. Code is available at https://github.com/LeapLabTHU/AdaptiveNN.

[71] LowDiff: Efficient Diffusion Sampling with Low-Resolution Condition

Jiuyi Xu,Qing Jin,Meida Chen,Andrew Feng,Yang Sui,Yangming Shi

Main category: cs.CV

TL;DR: 本文提出了LowDiff,一种基于级联方法的高效扩散模型框架,通过从低分辨率到高分辨率逐步生成图像,在减少高分辨率采样步数的同时保持甚至提升生成质量,显著提高了推理效率。

Details Motivation: 扩散模型在图像生成上表现优异,但采样速度慢限制了其实际应用。现有工作主要关注压缩模型或减少去噪步数,忽视了多分辨率生成的潜力。因此,作者希望探索利用多分辨率生成来提升效率的新途径。 Method: 提出LowDiff框架,采用统一模型通过级联方式从低分辨率开始逐步生成更高分辨率图像,结合特定架构设计和生成技术,减少高分辨率下的采样步骤。该方法适用于像素空间和潜在空间的扩散模型。 Result: 在CIFAR-10、FFHQ和ImageNet等多个数据集上验证了方法的有效性和通用性,实现了超过50%的吞吐量提升,同时保持相当或更优的生成质量。例如,在无条件CIFAR-10上FID为2.11,IS为9.87;在ImageNet 256x256上FID为4.00,IS达195.06。 Conclusion: LowDiff通过引入多分辨率级联生成策略,显著提升了扩散模型的生成效率,同时保持高质量输出,为高效图像生成提供了一种通用且有效的解决方案。 Abstract: Diffusion models have achieved remarkable success in image generation but their practical application is often hindered by the slow sampling speed. Prior efforts of improving efficiency primarily focus on compressing models or reducing the total number of denoising steps, largely neglecting the possibility to leverage multiple input resolutions in the generation process. In this work, we propose LowDiff, a novel and efficient diffusion framework based on a cascaded approach by generating increasingly higher resolution outputs. Besides, LowDiff employs a unified model to progressively refine images from low resolution to the desired resolution. With the proposed architecture design and generation techniques, we achieve comparable or even superior performance with much fewer high-resolution sampling steps. LowDiff is applicable to diffusion models in both pixel space and latent space. Extensive experiments on both conditional and unconditional generation tasks across CIFAR-10, FFHQ and ImageNet demonstrate the effectiveness and generality of our method. Results show over 50% throughput improvement across all datasets and settings while maintaining comparable or better quality. On unconditional CIFAR-10, LowDiff achieves an FID of 2.11 and IS of 9.87, while on conditional CIFAR-10, an FID of 1.94 and IS of 10.03. On FFHQ 64x64, LowDiff achieves an FID of 2.43, and on ImageNet 256x256, LowDiff built on LightningDiT-B/1 produces high-quality samples with a FID of 4.00 and an IS of 195.06, together with substantial efficiency gains.

[72] MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation

Yu Chang,Jiahao Chen,Anzhe Cheng,Paul Bogdan

Main category: cs.CV

TL;DR: 提出MaskAttn-SDXL,一种在Stable Diffusion XL中通过在交叉注意力logits上应用区域级门控机制来提升文本到图像生成中多对象、属性和空间关系组合准确性的方法。

Details Motivation: 现有的文本到图像扩散模型在处理包含多个对象、属性和空间关系的提示时,常出现组合失败问题,如实体纠缠、属性混淆和空间线索违反。 Method: MaskAttn-SDXL在SDXL的UNet交叉注意力logits上引入每层学习的二值掩码,在softmax前对token到潜在表示的交互进行稀疏化,仅保留语义相关的连接。该方法无需位置编码、辅助token或外部区域掩码,且推理路径保持不变,开销极低。 Result: 模型在多对象提示下显著提升了空间一致性和属性绑定能力,同时保持了图像质量和多样性。 Conclusion: logit级掩码交叉注意力是一种数据高效的组合控制基本单元,MaskAttn-SDXL为文本到图像生成中的空间控制提供了一种实用的扩展方案。 Abstract: Text-to-image diffusion models achieve impressive realism but often suffer from compositional failures on prompts with multiple objects, attributes, and spatial relations, resulting in cross-token interference where entities entangle, attributes mix across objects, and spatial cues are violated. To address these failures, we propose MaskAttn-SDXL,a region-level gating mechanism applied to the cross-attention logits of Stable Diffusion XL(SDXL)'s UNet. MaskAttn-SDXL learns a binary mask per layer, injecting it into each cross-attention logit map before softmax to sparsify token-to-latent interactions so that only semantically relevant connections remain active. The method requires no positional encodings, auxiliary tokens, or external region masks, and preserves the original inference path with negligible overhead. In practice, our model improves spatial compliance and attribute binding in multi-object prompts while preserving overall image quality and diversity. These findings demonstrate that logit-level maksed cross-attention is an data-efficient primitve for enforcing compositional control, and our method thus serves as a practical extension for spatial control in text-to-image generation.

[73] RaceGAN: A Framework for Preserving Individuality while Converting Racial Information for Image-to-Image Translation

Mst Tasnim Pervin,George Bebis,Fang Jiang,Alireza Tavakkoli

Main category: cs.CV

TL;DR: 本文提出了一种新的多域图像到图像转换框架RaceGAN,用于在不依赖参考图像的情况下进行种族特征转换,同时保持个体性和高层语义。

Details Motivation: 现有的多域图像转换模型如StarGANv2和StyleGAN虽然能够实现风格映射,但无法保持个体性且需要额外的参考图像。因此,研究旨在解决这些问题,特别是在种族属性转换中的应用。 Method: 提出了RaceGAN框架,该框架能够在多个域之间映射风格码,同时保持个体性和高层语义,无需参考图像。通过芝加哥人脸数据集上的实验验证了模型的有效性,并使用InceptionReNetv2-based分类进行了定量分析。 Result: RaceGAN在芝加哥人脸数据集上表现出色,优于其他模型在亚洲、白人和黑人种族特征转换方面的性能。此外,模型能够有效地将潜在空间划分为不同族群的脸部聚类。 Conclusion: RaceGAN成功实现了多域图像到图像转换,特别是在种族属性转换中保持了个体性和高层语义,无需参考图像,展示了其在实际应用中的潜力。 Abstract: Generative adversarial networks (GANs) have demonstrated significant progress in unpaired image-to-image translation in recent years for several applications. CycleGAN was the first to lead the way, although it was restricted to a pair of domains. StarGAN overcame this constraint by tackling image-to-image translation across various domains, although it was not able to map in-depth low-level style changes for these domains. Style mapping via reference-guided image synthesis has been made possible by the innovations of StarGANv2 and StyleGAN. However, these models do not maintain individuality and need an extra reference image in addition to the input. Our study aims to translate racial traits by means of multi-domain image-to-image translation. We present RaceGAN, a novel framework capable of mapping style codes over several domains during racial attribute translation while maintaining individuality and high level semantics without relying on a reference image. RaceGAN outperforms other models in translating racial features (i.e., Asian, White, and Black) when tested on Chicago Face Dataset. We also give quantitative findings utilizing InceptionReNetv2-based classification to demonstrate the effectiveness of our racial translation. Moreover, we investigate how well the model partitions the latent space into distinct clusters of faces for each ethnic group.

[74] Generating Part-Based Global Explanations Via Correspondence

Kunal Rathore,Prasad Tadepalli

Main category: cs.CV

TL;DR: 提出一种利用有限图像中的用户定义部件标签并将其高效迁移到更大数据集的方法,以生成全局符号化解释,从而大规模提供对深度学习模型决策的人类可理解解释。

Details Motivation: 深度学习模型通常不透明,现有基于概念的解释方法需要大量标注,导致高昂的标签成本。 Method: 利用少量图像中的用户定义部件标签,通过迁移学习将其扩展到更大数据集,并聚合基于部件的局部解释生成全局符号化解释。 Result: 实现了对大规模数据的模型决策进行人类可理解的全局解释,减少了标注成本并提高了解释的可扩展性。 Conclusion: 该方法在降低标注成本的同时,有效生成了可解释性强的全局模型解释,适用于大规模深度学习模型的透明化需求。 Abstract: Deep learning models are notoriously opaque. Existing explanation methods often focus on localized visual explanations for individual images. Concept-based explanations, while offering global insights, require extensive annotations, incurring significant labeling cost. We propose an approach that leverages user-defined part labels from a limited set of images and efficiently transfers them to a larger dataset. This enables the generation of global symbolic explanations by aggregating part-based local explanations, ultimately providing human-understandable explanations for model decisions on a large scale.

[75] Causal Fingerprints of AI Generative Models

Hui Xu,Chi Liu,Congcong Zhu,Minghao Wang,Youyang Qu,Longxiang Gao

Main category: cs.CV

TL;DR: 提出了一种基于因果解耦的生成模型指纹提取框架,通过分离图像内容与风格,在语义不变的潜在空间中捕捉模型的因果指纹,提升了跨模型的泛化性和指纹粒度,实验表明其在模型溯源和匿名化方面优于现有方法。

Details Motivation: 现有生成模型指纹方法依赖于特定模型或合成伪影,缺乏跨模型的泛化能力,且未充分探索图像来源与模型痕迹之间的因果关系。 Method: 提出因果指纹概念,构建因果解耦框架,利用预训练扩散模型的重建残差构建语义不变潜在空间,分离内容、风格与因果指纹,并通过多样化特征表示增强指纹细节。 Result: 在多种GAN和扩散模型上验证了方法的归因性能,实现了高精度的模型溯源,并通过反事实样例实现源匿名化,指纹具有良好的泛化性和细粒度。 Conclusion: 所提出的因果指纹框架能更有效地捕捉生成模型的本质痕迹,显著提升模型溯源能力,对伪造检测、模型版权追踪和身份保护具有重要意义。 Abstract: AI generative models leave implicit traces in their generated images, which are commonly referred to as model fingerprints and are exploited for source attribution. Prior methods rely on model-specific cues or synthesis artifacts, yielding limited fingerprints that may generalize poorly across different generative models. We argue that a complete model fingerprint should reflect the causality between image provenance and model traces, a direction largely unexplored. To this end, we conceptualize the \emph{causal fingerprint} of generative models, and propose a causality-decoupling framework that disentangles it from image-specific content and style in a semantic-invariant latent space derived from pre-trained diffusion reconstruction residual. We further enhance fingerprint granularity with diverse feature representations. We validate causality by assessing attribution performance across representative GANs and diffusion models and by achieving source anonymization using counterfactual examples generated from causal fingerprints. Experiments show our approach outperforms existing methods in model attribution, indicating strong potential for forgery detection, model copyright tracing, and identity protection.

[76] NeuroRAD-FM: A Foundation Model for Neuro-Oncology with Distributionally Robust Training

Moinak Bhattacharya,Angelica P. Kurtz,Fabio M. Iwamoto,Prateek Prasanna,Gagandeep Singh

Main category: cs.CV

TL;DR: 本研究开发了一种针对神经肿瘤学的专用基础模型,结合分布鲁棒优化(DRO),在多中心脑肿瘤MRI数据上提升了分子标志物预测和生存分析的准确性与泛化能力。

Details Motivation: 现有基础模型在神经肿瘤学中因数据异质性和罕见分子标志物预测困难而泛化能力有限,亟需提高跨机构一致性和对少见标记的识别性能。 Method: 采用自监督学习方法(BYOL、DINO、MAE、MoCo)在多中心脑肿瘤MRI上预训练基础模型,并引入分布鲁棒优化(DRO)以减轻站点偏差和类别不平衡问题,应用于常见及罕见分子分类、连续标志物估计和生存预测。 Result: 在多个中心实现了分子预测性能提升,CUIMC的平均平衡准确率从0.744升至0.785,AUC提升明显,尤其是CDKN2A/2B、ATRX等罕见标记;生存预测c-index在各中心均改善;Grad-CAM显示模型关注肿瘤及周围区域,具备可解释性。 Conclusion: 结合DRO的基础模型能生成更稳定的跨站点表征,显著提升常见与罕见分子标记预测及生存判别能力,未来需前瞻性验证并整合纵向和干预信号以推动精准神经肿瘤学发展。 Abstract: Neuro-oncology poses unique challenges for machine learning due to heterogeneous data and tumor complexity, limiting the ability of foundation models (FMs) to generalize across cohorts. Existing FMs also perform poorly in predicting uncommon molecular markers, which are essential for treatment response and risk stratification. To address these gaps, we developed a neuro-oncology specific FM with a distributionally robust loss function, enabling accurate estimation of tumor phenotypes while maintaining cross-institution generalization. We pretrained self-supervised backbones (BYOL, DINO, MAE, MoCo) on multi-institutional brain tumor MRI and applied distributionally robust optimization (DRO) to mitigate site and class imbalance. Downstream tasks included molecular classification of common markers (MGMT, IDH1, 1p/19q, EGFR), uncommon alterations (ATRX, TP53, CDKN2A/2B, TERT), continuous markers (Ki-67, TP53), and overall survival prediction in IDH1 wild-type glioblastoma at UCSF, UPenn, and CUIMC. Our method improved molecular prediction and reduced site-specific embedding differences. At CUIMC, mean balanced accuracy rose from 0.744 to 0.785 and AUC from 0.656 to 0.676, with the largest gains for underrepresented endpoints (CDKN2A/2B accuracy 0.86 to 0.92, AUC 0.73 to 0.92; ATRX AUC 0.69 to 0.82; Ki-67 accuracy 0.60 to 0.69). For survival, c-index improved at all sites: CUIMC 0.592 to 0.597, UPenn 0.647 to 0.672, UCSF 0.600 to 0.627. Grad-CAM highlighted tumor and peri-tumoral regions, confirming interpretability. Overall, coupling FMs with DRO yields more site-invariant representations, improves prediction of common and uncommon markers, and enhances survival discrimination, underscoring the need for prospective validation and integration of longitudinal and interventional signals to advance precision neuro-oncology.

[77] ORCA: Agentic Reasoning For Hallucination and Adversarial Robustness in Vision-Language Models

Chung-En Johnny Yu,Hsuan-Chih,Chen,Brian Jalaian,Nathaniel D. Bastian

Main category: cs.CV

TL;DR: ORCA是一个代理推理框架,通过测试时的结构化推理和小型视觉模型协作,提升大型视觉语言模型的事实准确性和对抗鲁棒性,无需重新训练或访问模型内部。

Details Motivation: 大型视觉语言模型(LVLMs)在实际应用中易受内在错误导致的幻觉和外部对抗攻击的影响,限制了其可靠性,因此需要一种提升其鲁棒性和准确性的方法。 Method: ORCA采用观察-推理-批判-行动(Observe--Reason--Critique--Act)循环,利用多个小型视觉模型回答证据性问题,检测跨模型不一致性,并迭代优化预测,同时记录中间推理过程以支持可审计决策。 Result: 在POPE幻觉基准上,ORCA使LVLM性能提升3.64%至40.67%;在对抗扰动下平均准确率提高20.11%;结合防御技术在AMBER数据上增益达1.20%至48.00%。 Conclusion: ORCA为构建更可靠、更鲁棒的多模态系统提供了一条有效路径,能在不依赖模型内部或重训练的情况下显著提升LVLM的抗幻觉和抗攻击能力。 Abstract: Large Vision-Language Models (LVLMs) exhibit strong multimodal capabilities but remain vulnerable to hallucinations from intrinsic errors and adversarial attacks from external exploitations, limiting their reliability in real-world applications. We present ORCA, an agentic reasoning framework that improves the factual accuracy and adversarial robustness of pretrained LVLMs through test-time structured inference reasoning with a suite of small vision models (less than 3B parameters). ORCA operates via an Observe--Reason--Critique--Act loop, querying multiple visual tools with evidential questions, validating cross-model inconsistencies, and refining predictions iteratively without access to model internals or retraining. ORCA also stores intermediate reasoning traces, which supports auditable decision-making. Though designed primarily to mitigate object-level hallucinations, ORCA also exhibits emergent adversarial robustness without requiring adversarial training or defense mechanisms. We evaluate ORCA across three settings: (1) clean images on hallucination benchmarks, (2) adversarially perturbed images without defense, and (3) adversarially perturbed images with defense applied. On the POPE hallucination benchmark, ORCA improves standalone LVLM performance by +3.64\% to +40.67\% across different subsets. Under adversarial perturbations on POPE, ORCA achieves an average accuracy gain of +20.11\% across LVLMs. When combined with defense techniques on adversarially perturbed AMBER images, ORCA further improves standalone LVLM performance, with gains ranging from +1.20\% to +48.00\% across evaluation metrics. These results demonstrate that ORCA offers a promising path toward building more reliable and robust multimodal systems.

[78] Region-Aware Deformable Convolutions

Abolfazl Saheban Maleki,Maryam Imani

Main category: cs.CV

TL;DR: 提出了一种新的卷积算子RAD-Conv,通过动态调整感受野的形状和大小来增强网络对复杂图像结构的适应能力。

Details Motivation: 传统可变形卷积受限于固定的四边形采样区域,难以灵活适应不同图像内容的结构变化。 Method: 引入每核元素四个边界偏移量,构建可灵活调整大小和形状的矩形区域,解耦感受野形状与卷积核结构。 Result: 实现了对感受野宽高的精确控制,即使使用1x1小核也能捕捉局部细节和长程依赖。 Conclusion: RAD-Conv结合了注意力机制的灵活性和标准卷积的高效性,为构建更高效、更具表达力的视觉模型提供了实用方案。 Abstract: We introduce Region-Aware Deformable Convolution (RAD-Conv), a new convolutional operator that enhances neural networks' ability to adapt to complex image structures. Unlike traditional deformable convolutions, which are limited to fixed quadrilateral sampling areas, RAD-Conv uses four boundary offsets per kernel element to create flexible, rectangular regions that dynamically adjust their size and shape to match image content. This approach allows precise control over the receptive field's width and height, enabling the capture of both local details and long-range dependencies, even with small 1x1 kernels. By decoupling the receptive field's shape from the kernel's structure, RAD-Conv combines the adaptability of attention mechanisms with the efficiency of standard convolutions. This innovative design offers a practical solution for building more expressive and efficient vision models, bridging the gap between rigid convolutional architectures and computationally costly attention-based methods.

[79] CAGE: Continuity-Aware edGE Network Unlocks Robust Floorplan Reconstruction

Yiyi Liu,Chunyang Liu,Weiqin Jiao,Bojian Wu,Fashuai Li,Biao Xiong

Main category: cs.CV

TL;DR: 本文提出了一种名为CAGE(Continuity-Aware edGE)的网络,用于从点云密度图直接重建向量化的平面图。该方法采用以边为中心的表示方式,提升对噪声和不完整观测的鲁棒性,并确保房间边界的拓扑一致性。通过双查询Transformer解码器实现快速收敛和稳定优化,在多个数据集上达到最先进的性能。

Details Motivation: 传统基于角点的多边形表示对噪声和缺失数据敏感,易产生碎片化或不合理的布局;现有线段分组方法虽增强鲁棒性,但难以恢复精细几何细节。因此需要一种更稳健且能保持几何连续性的表示方法。 Method: 提出一种原生的边中心化建模方法,将每个墙段表示为有向、几何连续的边,并设计双查询Transformer解码器,结合扰动查询与潜在查询,在去噪框架下进行结构预测,从而实现连贯的平面图推断。 Result: 在Structured3D和SceneCAD数据集上取得SOTA结果,F1分数分别为99.1%(房间)、91.7%(角点)和89.3%(角度),并展现出强跨数据集泛化能力。 Conclusion: CAGE通过边中心化表示和双查询Transformer架构,有效解决了传统方法在噪声环境下的结构断裂与拓扑错误问题,实现了高精度、拓扑一致的平面图重建。 Abstract: We present \textbf{CAGE} (\textit{Continuity-Aware edGE}) network, a \textcolor{red}{robust} framework for reconstructing vector floorplans directly from point-cloud density maps. Traditional corner-based polygon representations are highly sensitive to noise and incomplete observations, often resulting in fragmented or implausible layouts. Recent line grouping methods leverage structural cues to improve robustness but still struggle to recover fine geometric details. To address these limitations, we propose a \textit{native} edge-centric formulation, modeling each wall segment as a directed, geometrically continuous edge. This representation enables inference of coherent floorplan structures, ensuring watertight, topologically valid room boundaries while improving robustness and reducing artifacts. Towards this design, we develop a dual-query transformer decoder that integrates perturbed and latent queries within a denoising framework, which not only stabilizes optimization but also accelerates convergence. Extensive experiments on Structured3D and SceneCAD show that \textbf{CAGE} achieves state-of-the-art performance, with F1 scores of 99.1\% (rooms), 91.7\% (corners), and 89.3\% (angles). The method also demonstrates strong cross-dataset generalization, underscoring the efficacy of our architectural innovations. Code and pretrained models will be released upon acceptance.

[80] Self-supervised learning of imaging and clinical signatures using a multimodal joint-embedding predictive architecture

Thomas Z. Li,Aravind R. Krishnan,Lianrui Zuo,John M. Still,Kim L. Sandler,Fabien Maldonado,Thomas A. Lasko,Bennett A. Landman

Main category: cs.CV

TL;DR: 本研究利用自监督学习方法(JEPA)从无标签的纵向多模态医学数据中预训练模型,用于肺结节诊断,在内部数据集上表现优于现有模型,但在外部数据集上性能下降,揭示了其优势与局限性。

Details Motivation: 由于标注数据稀缺且模型易在训练分布上过拟合,当前多模态肺结节诊断模型的发展受限。 Method: 采用联合嵌入预测架构(JEPA),利用来自医院的无标签CT扫描和电子健康记录进行自监督预训练,随后进行有监督微调。 Result: 在内部验证队列中,该方法AUC为0.91,优于多模态模型(0.88)和仅影像模型(0.73);但在外部队列中表现较差(0.72 vs. 0.75)。研究还构建了合成环境以分析JEPA可能失效的情境。 Conclusion: 该方法展示了利用无标签多模态医疗档案提升预测模型的潜力,但也暴露了在跨机构数据上泛化能力不足的问题。 Abstract: The development of multimodal models for pulmonary nodule diagnosis is limited by the scarcity of labeled data and the tendency for these models to overfit on the training distribution. In this work, we leverage self-supervised learning from longitudinal and multimodal archives to address these challenges. We curate an unlabeled set of patients with CT scans and linked electronic health records from our home institution to power joint embedding predictive architecture (JEPA) pretraining. After supervised finetuning, we show that our approach outperforms an unregularized multimodal model and imaging-only model in an internal cohort (ours: 0.91, multimodal: 0.88, imaging-only: 0.73 AUC), but underperforms in an external cohort (ours: 0.72, imaging-only: 0.75 AUC). We develop a synthetic environment that characterizes the context in which JEPA may underperform. This work innovates an approach that leverages unlabeled multimodal medical archives to improve predictive models and demonstrates its advantages and limitations in pulmonary nodule diagnosis.

[81] Efficient Multimodal Dataset Distillation via Generative Models

Zhenghao Zhao,Haoxuan Wang,Junyi Wu,Yuzhang Shang,Gaowen Liu,Yan Yan

Main category: cs.CV

TL;DR: 本文提出了EDGE,一种高效的多模态数据集蒸馏方法,通过生成模型和双向对比损失、多样性损失解决图像-文本相关性与样本多样性问题,并显著提升文本到图像检索性能,速度比现有最先进方法快18倍。

Details Motivation: 现有的多模态数据集蒸馏方法受限于匹配训练轨迹算法,计算资源消耗大、耗时长,难以高效处理大规模多模态数据。 Method: 提出EDGE方法,采用生成式蒸馏框架,设计双向对比损失和多样性损失以增强图像与文本的关联性和生成样本的多样性,并引入新的字幕合成策略以提升文本到图像检索性能。 Result: 在Flickr30K、COCO和CC3M数据集上验证了该方法的优越性,相比现有方法具有更高的效率和性能,蒸馏速度提升了18倍。 Conclusion: EDGE为多模态数据集蒸馏提供了一种高效、可扩展的新方案,显著降低了计算成本与时耗,适用于大规模语言和视觉模型的训练需求。 Abstract: Dataset distillation aims to synthesize a small dataset from a large dataset, enabling the model trained on it to perform well on the original dataset. With the blooming of large language models and multimodal large language models, the importance of multimodal datasets, particularly image-text datasets, has grown significantly. However, existing multimodal dataset distillation methods are constrained by the Matching Training Trajectories algorithm, which significantly increases the computing resource requirement, and takes days to process the distillation. In this work, we introduce EDGE, a generative distillation method for efficient multimodal dataset distillation. Specifically, we identify two key challenges of distilling multimodal datasets with generative models: 1) The lack of correlation between generated images and captions. 2) The lack of diversity among generated samples. To address the aforementioned issues, we propose a novel generative model training workflow with a bi-directional contrastive loss and a diversity loss. Furthermore, we propose a caption synthesis strategy to further improve text-to-image retrieval performance by introducing more text information. Our method is evaluated on Flickr30K, COCO, and CC3M datasets, demonstrating superior performance and efficiency compared to existing approaches. Notably, our method achieves results 18x faster than the state-of-the-art method.

[82] OpenViGA: Video Generation for Automotive Driving Scenes by Streamlining and Fine-Tuning Open Source Models with Public Data

Björn Möller,Zhengyang Li,Malte Stelzer,Thomas Graave,Fabian Bettels,Muaaz Ataya,Tim Fingscheidt

Main category: cs.CV

TL;DR: 本文提出了OpenViGA,一个开源的自动驾驶场景视频生成系统,基于预训练模型和公开数据(BDD100K),实现了可复现、低延迟的高质量视频预测。

Details Motivation: 现有视频生成系统依赖大模型、资源消耗高、缺乏透明度和公开代码与数据,限制了研究复现与改进。 Method: 采用模块化设计,分别对图像分词器、世界模型和视频解码器进行定量与定性评估;基于多个领域开源预训练模型,使用BDD100K数据在学术级GPU上微调,并统一各组件接口以构建完整系统。 Result: 在256x256分辨率、4fps下,仅需一帧算法延迟即可逐帧生成逼真的驾驶场景视频,且系统代码、模型全部开源,实现完全可复现。 Conclusion: OpenViGA通过整合开源模型与数据,提供了一个高效、透明、可复现的视频生成框架,推动自动驾驶仿真与视频预测领域的开放研究。 Abstract: Recent successful video generation systems that predict and create realistic automotive driving scenes from short video inputs assign tokenization, future state prediction (world model), and video decoding to dedicated models. These approaches often utilize large models that require significant training resources, offer limited insight into design choices, and lack publicly available code and datasets. In this work, we address these deficiencies and present OpenViGA, an open video generation system for automotive driving scenes. Our contributions are: Unlike several earlier works for video generation, such as GAIA-1, we provide a deep analysis of the three components of our system by separate quantitative and qualitative evaluation: Image tokenizer, world model, video decoder. Second, we purely build upon powerful pre-trained open source models from various domains, which we fine-tune by publicly available automotive data (BDD100K) on GPU hardware at academic scale. Third, we build a coherent video generation system by streamlining interfaces of our components. Fourth, due to public availability of the underlying models and data, we allow full reproducibility. Finally, we also publish our code and models on Github. For an image size of 256x256 at 4 fps we are able to predict realistic driving scene videos frame-by-frame with only one frame of algorithmic latency.

[83] Comparing Computational Pathology Foundation Models using Representational Similarity Analysis

Vaibhav Mishra,William Lotter

Main category: cs.CV

TL;DR: 本文系统分析了六种计算病理学基础模型的表征空间,发现不同模型在表示结构上存在显著差异,且训练范式不保证表示相似性。研究还发现模型表示具有高切片依赖性和低疾病依赖性,染色归一化可降低依赖性,而视觉-语言模型比纯视觉模型具有更紧凑的表示。

Details Motivation: 了解计算病理学基础模型所学习到的表示结构和变异性,以指导模型改进、集成策略及部署。 Method: 采用计算神经科学中的表示相似性分析方法,基于TCGA的H&E图像块对六种基础模型(包括视觉-语言对比学习和自蒸馏模型)进行系统分析,并评估染色归一化的影响。 Result: UNI2和Virchow2的表示结构最独特,Prov-Gigapath与其他模型平均相似度最高;相同训练范式(视觉vs.视觉-语言)并不导致更高表示相似性;所有模型均表现出高切片依赖性和低疾病依赖性;染色归一化降低了切片依赖性(5.5%~20.5%);视觉-语言模型表示更紧凑,视觉-仅模型表示更分散。 Conclusion: 训练范式显著影响模型表示结构,但不决定跨模型相似性;减少切片特异性偏差有助于提升模型鲁棒性;该分析框架可推广至其他医学影像领域,支持基础模型的开发与验证。 Abstract: Foundation models are increasingly developed in computational pathology (CPath) given their promise in facilitating many downstream tasks. While recent studies have evaluated task performance across models, less is known about the structure and variability of their learned representations. Here, we systematically analyze the representational spaces of six CPath foundation models using techniques popularized in computational neuroscience. The models analyzed span vision-language contrastive learning (CONCH, PLIP, KEEP) and self-distillation (UNI (v2), Virchow (v2), Prov-GigaPath) approaches. Through representational similarity analysis using H&E image patches from TCGA, we find that UNI2 and Virchow2 have the most distinct representational structures, whereas Prov-Gigapath has the highest average similarity across models. Having the same training paradigm (vision-only vs. vision-language) did not guarantee higher representational similarity. The representations of all models showed a high slide-dependence, but relatively low disease-dependence. Stain normalization decreased slide-dependence for all models by a range of 5.5% (CONCH) to 20.5% (PLIP). In terms of intrinsic dimensionality, vision-language models demonstrated relatively compact representations, compared to the more distributed representations of vision-only models. These findings highlight opportunities to improve robustness to slide-specific features, inform model ensembling strategies, and provide insights into how training paradigms shape model representations. Our framework is extendable across medical imaging domains, where probing the internal representations of foundation models can help ensure effective development and deployment.

[84] SmolRGPT: Efficient Spatial Reasoning for Warehouse Environments with 600M Parameters

Abdarahmane Traore,Éric Hervet,Andy Couturier

Main category: cs.CV

TL;DR: 本文提出了一种名为SmolRGPT的紧凑型视觉-语言模型,通过融合RGB和深度信息,在仅有6亿参数的情况下实现了强大的区域级空间推理能力,在仓库等资源受限环境中表现出与更大模型相当甚至更优的性能。

Details Motivation: 现有的视觉-语言模型通常规模庞大,计算和内存开销高,难以在资源受限的现实场景(如仓储、机器人)中部署,因此需要一种高效且具备强空间理解能力的小型化模型。 Method: SmolRGPT采用三阶段课程学习策略,逐步对齐视觉与语言特征、理解空间关系并适应特定任务数据集,并显式地结合RGB和深度信息进行区域级空间推理。 Result: 在仅6亿参数的条件下,SmolRGPT在具有挑战性的仓库空间推理基准上达到了与更大模型相当或更好的性能。 Conclusion: SmolRGPT证明了小型化视觉-语言模型在保持关键空间推理能力的同时具备良好的部署潜力,为资源受限环境下的多模态智能提供了可行方案。 Abstract: Recent advances in vision-language models (VLMs) have enabled powerful multimodal reasoning, but state-of-the-art approaches typically rely on extremely large models with prohibitive computational and memory requirements. This makes their deployment challenging in resource-constrained environments such as warehouses, robotics, and industrial applications, where both efficiency and robust spatial understanding are critical. In this work, we present SmolRGPT, a compact vision-language architecture that explicitly incorporates region-level spatial reasoning by integrating both RGB and depth cues. SmolRGPT employs a three-stage curriculum that progressively align visual and language features, enables spatial relationship understanding, and adapts to task-specific datasets. We demonstrate that with only 600M parameters, SmolRGPT achieves competitive results on challenging warehouse spatial reasoning benchmarks, matching or exceeding the performance of much larger alternatives. These findings highlight the potential for efficient, deployable multimodal intelligence in real-world settings without sacrificing core spatial reasoning capabilities. The code of the experimentation will be available at: https://github.com/abtraore/SmolRGPT

[85] Lynx: Towards High-Fidelity Personalized Video Generation

Shen Sang,Tiancheng Zhi,Tianpei Gu,Jing Liu,Linjie Luo

Main category: cs.CV

TL;DR: Lynx是一种基于DiT的个性化视频合成模型,通过ID-adapter和Ref-adapter实现高保真身份保持。

Details Motivation: 在个性化视频生成中,保持输入图像的身份特征同时生成高质量、时序连贯的视频仍具挑战性。 Method: 提出Lynx模型,结合轻量级ID-adapter(使用ArcFace嵌入生成身份令牌)和Ref-adapter(通过冻结参考路径注入VAE特征),在Diffusion Transformer基础上实现身份保真与细节保留。 Result: 在40个主体和20个无偏提示组成的800个测试案例上评估,Lynx在面部相似度、提示跟随性和视频质量方面表现优异。 Conclusion: Lynx通过双适配器机制有效提升了个性化视频生成的身份保真度和视觉质量,推动了该领域的进展。 Abstract: We present Lynx, a high-fidelity model for personalized video synthesis from a single input image. Built on an open-source Diffusion Transformer (DiT) foundation model, Lynx introduces two lightweight adapters to ensure identity fidelity. The ID-adapter employs a Perceiver Resampler to convert ArcFace-derived facial embeddings into compact identity tokens for conditioning, while the Ref-adapter integrates dense VAE features from a frozen reference pathway, injecting fine-grained details across all transformer layers through cross-attention. These modules collectively enable robust identity preservation while maintaining temporal coherence and visual realism. Through evaluation on a curated benchmark of 40 subjects and 20 unbiased prompts, which yielded 800 test cases, Lynx has demonstrated superior face resemblance, competitive prompt following, and strong video quality, thereby advancing the state of personalized video generation.

[86] Backdoor Mitigation via Invertible Pruning Masks

Kealan Dunnett,Reza Arablouei,Dimity Miller,Volkan Dedeoglu,Raja Jurdak

Main category: cs.CV

TL;DR: 提出一种基于可学习选择机制和可逆剪枝掩码的新型剪枝方法,用于有效消除深度模型中的后门攻击,同时保持主任务性能。

Details Motivation: 现有剪枝方法难以准确识别并移除引发后门行为的关键参数,而尽管微调方法表现更优,剪枝在低数据场景下具有更好的鲁棒性和可解释性,值得深入探索。 Method: 提出带有学习型选择机制和可逆剪枝掩码的剪枝方法,构建双层优化问题:内层利用逆掩码生成候选触发器,外层优化掩码以抑制后门行为并保留主任务精度。 Result: 实验表明该方法优于现有的剪枝类防御方法,在数据稀缺时表现良好,并与最先进的微调方法具有竞争力,尤其能有效恢复被攻击样本的正确预测。 Conclusion: 所提出的可逆剪枝框架在保持模型简洁性和可解释性的同时,显著提升了剪枝在后门防御中的有效性,是微调之外的一种强有力替代方案。 Abstract: Model pruning has gained traction as a promising defense strategy against backdoor attacks in deep learning. However, existing pruning-based approaches often fall short in accurately identifying and removing the specific parameters responsible for inducing backdoor behaviors. Despite the dominance of fine-tuning-based defenses in recent literature, largely due to their superior performance, pruning remains a compelling alternative, offering greater interpretability and improved robustness in low-data regimes. In this paper, we propose a novel pruning approach featuring a learned \emph{selection} mechanism to identify parameters critical to both main and backdoor tasks, along with an \emph{invertible} pruning mask designed to simultaneously achieve two complementary goals: eliminating the backdoor task while preserving it through the inverse mask. We formulate this as a bi-level optimization problem that jointly learns selection variables, a sparse invertible mask, and sample-specific backdoor perturbations derived from clean data. The inner problem synthesizes candidate triggers using the inverse mask, while the outer problem refines the mask to suppress backdoor behavior without impairing clean-task accuracy. Extensive experiments demonstrate that our approach outperforms existing pruning-based backdoor mitigation approaches, maintains strong performance under limited data conditions, and achieves competitive results compared to state-of-the-art fine-tuning approaches. Notably, the proposed approach is particularly effective in restoring correct predictions for compromised samples after successful backdoor mitigation.

[87] MEC-Quant: Maximum Entropy Coding for Extremely Low Bit Quantization-Aware Training

Junbiao Pang,Tianyang Cai,Baochang Zhang

Main category: cs.CV

TL;DR: 提出了一种新的量化感知训练方法MEC-Quant,通过最大化表示的熵来减少量化带来的偏差,在极低比特设置下实现了与全精度模型相当甚至更优的性能。

Details Motivation: 现有量化感知训练(QAT)方法在极低比特量化时会引入表示偏差,导致性能低于全精度模型,本文旨在解决这一问题。 Method: 提出最大熵编码量化(MEC-Quant),利用有损数据编码中的最小编码长度作为熵的可计算代理,并基于混合专家(MOE)结构设计可扩展的目标函数,实现端到端训练。 Result: 在多种计算机视觉任务上验证了MEC-Quant的有效性,首次实现了x-bit激活的QAT,性能媲美或超越全精度模型。 Conclusion: MEC-Quant通过优化表示结构减少了量化偏差,显著提升了低比特量化模型的泛化能力,成为QAT的新标杆。 Abstract: Quantization-Aware Training (QAT) has driven much attention to produce efficient neural networks. Current QAT still obtains inferior performances compared with the Full Precision (FP) counterpart. In this work, we argue that quantization inevitably introduce biases into the learned representation, especially under the extremely low-bit setting. To cope with this issue, we propose Maximum Entropy Coding Quantization (MEC-Quant), a more principled objective that explicitly optimizes on the structure of the representation, so that the learned representation is less biased and thus generalizes better to unseen in-distribution samples. To make the objective end-to-end trainable, we propose to leverage the minimal coding length in lossy data coding as a computationally tractable surrogate for the entropy, and further derive a scalable reformulation of the objective based on Mixture Of Experts (MOE) that not only allows fast computation but also handles the long-tailed distribution for weights or activation values. Extensive experiments on various tasks on computer vision tasks prove its superiority. With MEC-Qaunt, the limit of QAT is pushed to the x-bit activation for the first time and the accuracy of MEC-Quant is comparable to or even surpass the FP counterpart. Without bells and whistles, MEC-Qaunt establishes a new state of the art for QAT.

[88] GUI-ARP: Enhancing Grounding with Adaptive Region Perception for GUI Agents

Xianhang Ye,Yiqing Li,Wei Dai,Miancan Liu,Ziyuan Chen,Zhangye Han,Hongbo Min,Jinkui Ren,Xiantao Zhang,Wen Yang,Zhi Jin

Main category: cs.CV

TL;DR: 提出GUI-ARP框架,通过自适应多阶段推理提升高分辨率截图中的细粒度GUI定位性能。

Details Motivation: 现有GUI定位方法在高分辨率截图中难以实现细粒度定位。 Method: 提出GUI-ARP框架,包含自适应区域感知(ARP)和自适应阶段控制(ASC),结合监督微调与基于GRPO的强化微调进行两阶段训练。 Result: 在ScreenSpot-Pro上达到60.8%准确率,在UI-Vision上达到30.9%,优于开源72B模型和部分专有模型。 Conclusion: GUI-ARP通过动态调整推理策略,在复杂GUI定位任务中实现了最先进的性能。 Abstract: Existing GUI grounding methods often struggle with fine-grained localization in high-resolution screenshots. To address this, we propose GUI-ARP, a novel framework that enables adaptive multi-stage inference. Equipped with the proposed Adaptive Region Perception (ARP) and Adaptive Stage Controlling (ASC), GUI-ARP dynamically exploits visual attention for cropping task-relevant regions and adapts its inference strategy, performing a single-stage inference for simple cases and a multi-stage analysis for more complex scenarios. This is achieved through a two-phase training pipeline that integrates supervised fine-tuning with reinforcement fine-tuning based on Group Relative Policy Optimization (GRPO). Extensive experiments demonstrate that the proposed GUI-ARP achieves state-of-the-art performance on challenging GUI grounding benchmarks, with a 7B model reaching 60.8% accuracy on ScreenSpot-Pro and 30.9% on UI-Vision benchmark. Notably, GUI-ARP-7B demonstrates strong competitiveness against open-source 72B models (UI-TARS-72B at 38.1%) and proprietary models.

[89] SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models

Sen Wang,Jingyi Tian,Le Wang,Zhimin Liao,Jiayi Li,Huaiyi Dong,Kun Xia,Sanping Zhou,Wei Tang,Hua Gang

Main category: cs.CV

TL;DR: 本文提出了SAMPO,一种结合视觉自回归和因果建模的混合框架,用于提升世界模型在视频预测和控制任务中的时空一致性与推理效率。

Details Motivation: 现有自回归世界模型在视觉连贯性、解码效率和运动建模方面存在不足,难以实现高质量的长时程预测和控制。 Method: SAMPO采用跨尺度自回归与运动提示机制,结合时间因果解码与双向空间注意力,在帧内生成中保持空间局部性,并支持并行解码;引入非对称多尺度分词器和轨迹感知运动提示模块,以增强动态场景理解。 Result: 实验表明,SAMPO在动作条件视频预测和基于模型的控制任务中性能优越,生成质量更高,推理速度提升4.4倍,并展现出良好的零样本泛化能力和可扩展性。 Conclusion: SAMPO通过融合自回归与因果建模,在保持高效解码的同时显著提升了时空一致性和物理真实性,为世界模型的设计提供了新方向。 Abstract: World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose \textbf{S}cale-wise \textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt (\textbf{SAMPO}), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4.4$\times$ faster inference. We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.

[90] Beyond Words: Enhancing Desire, Emotion, and Sentiment Recognition with Non-Verbal Cues

Wei Chen,Tongguan Wang,Feiyue Xue,Junkai Li,Hui Liu,Ying Sha

Main category: cs.CV

TL;DR: 提出了一种对称双向多模态学习框架(SyDES),用于联合识别欲望、情感和情绪,通过文本与图像模态的相互引导,有效捕捉意图相关表征,在MSED数据集上取得了优于现有方法的性能。

Details Motivation: 现有情感分析方法多关注语言线索,忽视图像作为非语言线索的作用,且缺乏专门针对人类欲望理解的多模态方法,因此需要一种能融合图文信息并强化细粒度特征建模的框架。 Method: 构建对称双向多模态学习框架,使用低分辨率图像进行跨模态对齐,高分辨率子图像通过掩码图像建模增强局部特征;引入文本引导图像解码器和图像引导文本解码器实现深层跨模态交互,并采用混合尺度图像策略平衡计算成本与感知能力。 Result: 在MSED数据集上,该方法在欲望理解、情感识别和情感分析任务上分别提升了1.1%、0.6%和0.9%的F1分数,显著优于现有方法。 Conclusion: 所提出的SyDES框架通过有效的跨模态相互引导和细粒度视觉建模,显著提升了多模态欲望、情感与情绪识别的性能,验证了图像在意图理解中的重要补充作用。 Abstract: Desire, as an intention that drives human behavior, is closely related to both emotion and sentiment. Multimodal learning has advanced sentiment and emotion recognition, but multimodal approaches specially targeting human desire understanding remain underexplored. And existing methods in sentiment analysis predominantly emphasize verbal cues and overlook images as complementary non-verbal cues. To address these gaps, we propose a Symmetrical Bidirectional Multimodal Learning Framework for Desire, Emotion, and Sentiment Recognition, which enforces mutual guidance between text and image modalities to effectively capture intention-related representations in the image. Specifically, low-resolution images are used to obtain global visual representations for cross-modal alignment, while high resolution images are partitioned into sub-images and modeled with masked image modeling to enhance the ability to capture fine-grained local features. A text-guided image decoder and an image-guided text decoder are introduced to facilitate deep cross-modal interaction at both local and global representations of image information. Additionally, to balance perceptual gains with computation cost, a mixed-scale image strategy is adopted, where high-resolution images are cropped into sub-images for masked modeling. The proposed approach is evaluated on MSED, a multimodal dataset that includes a desire understanding benchmark, as well as emotion and sentiment recognition. Experimental results indicate consistent improvements over other state-of-the-art methods, validating the effectiveness of our proposed method. Specifically, our method outperforms existing approaches, achieving F1-score improvements of 1.1% in desire understanding, 0.6% in emotion recognition, and 0.9% in sentiment analysis. Our code is available at: https://github.com/especiallyW/SyDES.

[91] Enhancing Sa2VA for Referent Video Object Segmentation: 2nd Solution for 7th LSVOS RVOS Track

Ran Hong,Feng Lu,Leilei Cao,An Yan,Youhai Jiang,Fengjie Zhu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的框架,通过引入视频-语言校验器和关键帧采样器,显著提升了现有方法在指代表达视频对象分割(RVOS)任务上的性能。

Details Motivation: 为了减少现有方法在RVOS任务中的误检,并更好地捕捉视频中对象的早期出现和长时序上下文信息。 Method: 提出一个无需训练的框架,包含两个核心组件:视频-语言校验器用于验证查询中的主体和动作是否真实出现在视频中,以减少误检;关键帧采样器用于自适应选择具有信息量的关键帧,增强时序建模能力。 Result: 在MeViS测试集上实现了64.14%的J&F分数,位列ICCV 2025第七届LSVOS挑战赛RVOS赛道第二名。 Conclusion: 所提出的训练-free框架有效提升了Sa2VA在RVOS任务上的表现,证明了显式验证和关键帧选择对视频语言理解的重要性。 Abstract: Referential Video Object Segmentation (RVOS) aims to segment all objects in a video that match a given natural language description, bridging the gap between vision and language understanding. Recent work, such as Sa2VA, combines Large Language Models (LLMs) with SAM~2, leveraging the strong video reasoning capability of LLMs to guide video segmentation. In this work, we present a training-free framework that substantially improves Sa2VA's performance on the RVOS task. Our method introduces two key components: (1) a Video-Language Checker that explicitly verifies whether the subject and action described in the query actually appear in the video, thereby reducing false positives; and (2) a Key-Frame Sampler that adaptively selects informative frames to better capture both early object appearances and long-range temporal context. Without any additional training, our approach achieves a J&F score of 64.14% on the MeViS test set, ranking 2nd place in the RVOS track of the 7th LSVOS Challenge at ICCV 2025.

[92] MS-GS: Multi-Appearance Sparse-View 3D Gaussian Splatting in the Wild

Deming Li,Kaiwen Jiang,Yutao Tang,Ravi Ramamoorthi,Rama Chellappa,Cheng Peng

Main category: cs.CV

TL;DR: 本文提出了一种名为MS-GS的新框架,用于在稀疏视角和多外观条件下实现高质量的场景重建与新视角合成。

Details Motivation: 现有的NeRF和3D高斯点阵方法在处理真实场景中稀疏图像和多变外观(如不同时间、季节)时存在过拟合和过度平滑问题。 Method: 基于单目深度估计提供的几何先验,采用SfM点锚定算法提取局部语义区域,并在虚拟视图中引入细粒度与粗粒度结合的几何引导监督,以增强3D一致性。 Result: MS-GS在多个真实场景数据集上实现了优于现有方法的 photorealistic 渲染效果,尤其在稀疏视角和多外观条件下表现突出。 Conclusion: MS-GS有效缓解了稀疏输入和多外观带来的挑战,提升了3D场景重建的鲁棒性和视觉质量。 Abstract: In-the-wild photo collections often contain limited volumes of imagery and exhibit multiple appearances, e.g., taken at different times of day or seasons, posing significant challenges to scene reconstruction and novel view synthesis. Although recent adaptations of Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) have improved in these areas, they tend to oversmooth and are prone to overfitting. In this paper, we present MS-GS, a novel framework designed with Multi-appearance capabilities in Sparse-view scenarios using 3DGS. To address the lack of support due to sparse initializations, our approach is built on the geometric priors elicited from monocular depth estimations. The key lies in extracting and utilizing local semantic regions with a Structure-from-Motion (SfM) points anchored algorithm for reliable alignment and geometry cues. Then, to introduce multi-view constraints, we propose a series of geometry-guided supervision at virtual views in a fine-grained and coarse scheme to encourage 3D consistency and reduce overfitting. We also introduce a dataset and an in-the-wild experiment setting to set up more realistic benchmarks. We demonstrate that MS-GS achieves photorealistic renderings under various challenging sparse-view and multi-appearance conditions and outperforms existing approaches significantly across different datasets.

[93] Diffusion-Based Cross-Modal Feature Extraction for Multi-Label Classification

Tian Lan,Yiming Zheng,Jianxin Yin

Main category: cs.CV

TL;DR: 本文提出了一种名为Diff-Feat的新框架,通过从预训练的扩散Transformer模型中提取中间特征并融合用于多标签分类任务,在图像和文本任务上实现了最先进的性能。

Details Motivation: 多标签分类依赖于能够捕捉标签间交互的强大表示,现有方法在特征提取方面仍有提升空间。 Method: 从预训练的扩散Transformer中提取图像和文本的中间特征,通过启发式局部搜索算法选择最优的‘图像-文本’ב块-时间步’特征对,并进行融合与线性投影。 Result: 在MS-COCO-enhanced上达到98.6% mAP,在Visual Genome 500上达到45.7% mAP,显著优于CNN、图模型和Transformer基线;t-SNE和聚类指标显示语义聚类更紧密。 Conclusion: Diff-Feat通过挖掘扩散过程中中间特征的有效组合,为多模态多标签分类提供了高效且强大的特征提取方案。 Abstract: Multi-label classification has broad applications and depends on powerful representations capable of capturing multi-label interactions. We introduce \textit{Diff-Feat}, a simple but powerful framework that extracts intermediate features from pre-trained diffusion-Transformer models for images and text, and fuses them for downstream tasks. We observe that for vision tasks, the most discriminative intermediate feature along the diffusion process occurs at the middle step and is located in the middle block in Transformer. In contrast, for language tasks, the best feature occurs at the noise-free step and is located in the deepest block. In particular, we observe a striking phenomenon across varying datasets: a mysterious "Layer $12$" consistently yields the best performance on various downstream classification tasks for images (under DiT-XL/2-256$\times$256). We devise a heuristic local-search algorithm that pinpoints the locally optimal "image-text"$\times$"block-timestep" pair among a few candidates, avoiding an exhaustive grid search. A simple fusion-linear projection followed by addition-of the selected representations yields state-of-the-art performance: 98.6\% mAP on MS-COCO-enhanced and 45.7\% mAP on Visual Genome 500, surpassing strong CNN, graph, and Transformer baselines by a wide margin. t-SNE and clustering metrics further reveal that \textit{Diff-Feat} forms tighter semantic clusters than unimodal counterparts. The code is available at https://github.com/lt-0123/Diff-Feat.

[94] From Development to Deployment of AI-assisted Telehealth and Screening for Vision- and Hearing-threatening diseases in resource-constrained settings: Field Observations, Challenges and Way Forward

Mahesh Shakya,Bijay Adhikari,Nirsara Shrestha,Bipin Koirala,Arun Adhikari,Prasanta Poudyal,Luna Mathema,Sarbagya Buddhacharya,Bijay Khatri,Bishesh Khanal

Main category: cs.CV

TL;DR: 本文探讨了在资源有限地区开发和采用可扩展的AI辅助远程医疗和筛查所面临的挑战与解决方案,强调通过迭代式、跨学科协作来实现从纸质流程到AI就绪工作流的平稳过渡。

Details Motivation: 在资源受限地区,由于专业人员稀缺和筛查设施不足,视力和听力威胁性疾病导致可预防的残疾。需要有效的AI辅助筛查手段来提升早期检测能力。 Method: 通过实地研究,采用早期原型设计、影子部署和持续反馈的迭代式跨学科协作方法,探索AI辅助远程医疗和筛查系统的开发与部署。 Result: 发现公共数据集和AI模型尽管因领域差异表现不佳但仍具价值;提出需引入自动化的AI图像质量检查以确保大规模筛查中可评估图像的获取;强调AI开发与工作流数字化应作为端到端协同设计过程。 Conclusion: 成功部署AI辅助远程医疗依赖于结合实地经验的系统性共设计方法,本文总结的实践挑战与经验有助于填补资源受限环境中AI医疗应用的现实知识空白。 Abstract: Vision- and hearing-threatening diseases cause preventable disability, especially in resource-constrained settings(RCS) with few specialists and limited screening setup. Large scale AI-assisted screening and telehealth has potential to expand early detection, but practical deployment is challenging in paper-based workflows and limited documented field experience exist to build upon. We provide insights on challenges and ways forward in development to adoption of scalable AI-assisted Telehealth and screening in such settings. Specifically, we find that iterative, interdisciplinary collaboration through early prototyping, shadow deployment and continuous feedback is important to build shared understanding as well as reduce usability hurdles when transitioning from paper-based to AI-ready workflows. We find public datasets and AI models highly useful despite poor performance due to domain shift. In addition, we find the need for automated AI-based image quality check to capture gradable images for robust screening in high-volume camps. Our field learning stress the importance of treating AI development and workflow digitization as an end-to-end, iterative co-design process. By documenting these practical challenges and lessons learned, we aim to address the gap in contextual, actionable field knowledge for building real-world AI-assisted telehealth and mass-screening programs in RCS.

[95] DC-Mamba: Bi-temporal deformable alignment and scale-sparse enhancement for remote sensing change detection

Min Sun,Fenghui Guo

Main category: cs.CV

TL;DR: 提出了一种名为DC-Mamba的“先对齐后增强”框架,通过引入双时相可变形对齐(BTDA)和尺度稀疏变化放大器(SSCA)模块,有效提升遥感变化检测性能。

Details Motivation: 现有遥感变化检测方法在处理几何错位和区分微弱真实变化与噪声方面存在不足,尤其是缺乏显式的几何对齐机制。 Method: 基于ChangeMamba构建DC-Mamba框架,集成BTDA模块进行语义特征级的几何对齐,以及SSCA模块选择性放大高置信度变化信号并抑制噪声。 Result: 在实验中F1-score从0.5730提升至0.5903,IoU从0.4015提升至0.4187,显著优于基线模型。 Conclusion: 所提出的“先对齐后增强”策略有效解决了遥感变化检测中的几何与特征层面挑战,具有强鲁棒性和易部署性。 Abstract: Remote sensing change detection (RSCD) is vital for identifying land-cover changes, yet existing methods, including state-of-the-art State Space Models (SSMs), often lack explicit mechanisms to handle geometric misalignments and struggle to distinguish subtle, true changes from noise.To address this, we introduce DC-Mamba, an "align-then-enhance" framework built upon the ChangeMamba backbone. It integrates two lightweight, plug-and-play modules: (1) Bi-Temporal Deformable Alignment (BTDA), which explicitly introduces geometric awareness to correct spatial misalignments at the semantic feature level; and (2) a Scale-Sparse Change Amplifier(SSCA), which uses multi-source cues to selectively amplify high-confidence change signals while suppressing noise before the final classification. This synergistic design first establishes geometric consistency with BTDA to reduce pseudo-changes, then leverages SSCA to sharpen boundaries and enhance the visibility of small or subtle targets. Experiments show our method significantly improves performance over the strong ChangeMamba baseline, increasing the F1-score from 0.5730 to 0.5903 and IoU from 0.4015 to 0.4187. The results confirm the effectiveness of our "align-then-enhance" strategy, offering a robust and easily deployable solution that transparently addresses both geometric and feature-level challenges in RSCD.

Shaojie Zhang,Ruoceng Zhang,Pei Fu,Shaokang Wang,Jiahui Yang,Xin Du,Shiqi Cui,Bin Qin,Ying Huang,Zhenbo Luo,Jian Luan

Main category: cs.CV

TL;DR: 提出了一种脑启发式框架“Blink-Think-Link”(BTL),用于AI驱动的GUI交互自动化,通过模拟人类认知过程提升与图形界面的自然交互能力。

Details Motivation: 现有AI驱动的GUI交互方法在交互逻辑上偏离了人类自然的交互模式,缺乏对人类认知过程的模拟。 Method: 将人机交互分解为三个阶段:Blink(快速检测和注意力定位)、Think(高层推理与决策)、Link(生成可执行命令);并提出了Blink数据生成流程和基于规则的BTL Reward机制以支持强化学习。 Result: 基于该框架开发的BTL-UI模型在静态GUI理解和动态交互任务中均达到最先进的性能。 Conclusion: BTL框架有效提升了GUI智能体的自然交互能力和整体性能,验证了模仿人类认知过程在人机交互自动化中的优势。 Abstract: In the field of AI-driven human-GUI interaction automation, while rapid advances in multimodal large language models and reinforcement fine-tuning techniques have yielded remarkable progress, a fundamental challenge persists: their interaction logic significantly deviates from natural human-GUI communication patterns. To fill this gap, we propose "Blink-Think-Link" (BTL), a brain-inspired framework for human-GUI interaction that mimics the human cognitive process between users and graphical interfaces. The system decomposes interactions into three biologically plausible phases: (1) Blink - rapid detection and attention to relevant screen areas, analogous to saccadic eye movements; (2) Think - higher-level reasoning and decision-making, mirroring cognitive planning; and (3) Link - generation of executable commands for precise motor control, emulating human action selection mechanisms. Additionally, we introduce two key technical innovations for the BTL framework: (1) Blink Data Generation - an automated annotation pipeline specifically optimized for blink data, and (2) BTL Reward -- the first rule-based reward mechanism that enables reinforcement learning driven by both process and outcome. Building upon this framework, we develop a GUI agent model named BTL-UI, which demonstrates consistent state-of-the-art performance across both static GUI understanding and dynamic interaction tasks in comprehensive benchmarks. These results provide conclusive empirical validation of the framework's efficacy in developing advanced GUI Agents.

[97] Towards Size-invariant Salient Object Detection: A Generic Evaluation and Optimization Approach

Shilong Bao,Qianqian Xu,Feiran Li,Boyu Han,Zhiyong Yang,Xiaochun Cao,Qingming Huang

Main category: cs.CV

TL;DR: 本文提出了一种新的尺寸不变性评估框架(SIEva)和优化方法(SIOpt),用于解决显著性物体检测(SOD)中现有评估指标对大尺寸物体偏倚的问题,提升了多尺度显著性物体的检测性能。

Details Motivation: 现有的SOD评估指标在处理单张图像中多个不同尺寸的显著性物体时,存在对较大区域过度加权的问题,导致小而重要的物体被忽略,影响评估的公平性和实际性能。 Method: 通过理论分析揭示现有SOD指标的尺寸敏感性,提出将评估结果分解为多个独立项并分别评估的SIEva框架,并设计了遵循尺寸不变性原则的SIOpt优化框架,适用于多种SOD模型。 Result: 实验证明,所提方法有效缓解了尺寸不平衡带来的偏差,在多种SOD骨干网络上均提升了对不同尺寸显著性物体的检测能力。同时提供了理论泛化分析支持新评估协议的有效性。 Conclusion: SIEva和SIOpt为显著性物体检测提供了一种更公平、更有效的评估与优化范式,具有良好的通用性和应用前景。 Abstract: This paper investigates a fundamental yet underexplored issue in Salient Object Detection (SOD): the size-invariant property for evaluation protocols, particularly in scenarios when multiple salient objects of significantly different sizes appear within a single image. We first present a novel perspective to expose the inherent size sensitivity of existing widely used SOD metrics. Through careful theoretical derivations, we show that the evaluation outcome of an image under current SOD metrics can be essentially decomposed into a sum of several separable terms, with the contribution of each term being directly proportional to its corresponding region size. Consequently, the prediction errors would be dominated by the larger regions, while smaller yet potentially more semantically important objects are often overlooked, leading to biased performance assessments and practical degradation. To address this challenge, a generic Size-Invariant Evaluation (SIEva) framework is proposed. The core idea is to evaluate each separable component individually and then aggregate the results, thereby effectively mitigating the impact of size imbalance across objects. Building upon this, we further develop a dedicated optimization framework (SIOpt), which adheres to the size-invariant principle and significantly enhances the detection of salient objects across a broad range of sizes. Notably, SIOpt is model-agnostic and can be seamlessly integrated with a wide range of SOD backbones. Theoretically, we also present generalization analysis of SOD methods and provide evidence supporting the validity of our new evaluation protocols. Finally, comprehensive experiments speak to the efficacy of our proposed approach. The code is available at https://github.com/Ferry-Li/SI-SOD.

[98] Multimodal Learning for Fake News Detection in Short Videos Using Linguistically Verified Data and Heterogeneous Modality Fusion

Shanghong Li,Chiam Wen Qi Ruth,Hong Xu,Fang Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为HFN(异构融合网络)的新型多模态框架,用于检测短视频中的虚假新闻,结合视频、音频和文本信息,并通过动态调整模态权重和加权特征融合模块提升检测性能,在FakeTT和新构建的VESV数据集上均优于现有方法。

Details Motivation: 由于短视频平台上虚假新闻传播迅速且影响广泛,现有方法难以有效应对多模态和动态内容,因此需要更强大的检测技术。 Method: 提出HFN模型,包含一个决策网络用于动态调整各模态权重,以及一个加权多模态特征融合模块;同时构建了专门用于短视频真实性检测的VESV数据集。 Result: 在FakeTT和VESV数据集上的实验表明,HFN相比现有最先进方法在Macro F1指标上分别提升了2.71%和4.14%。 Conclusion: HFN为短视频虚假新闻检测提供了一个鲁棒有效的解决方案,有助于应对多模态、不完整数据等现实挑战,推动了虚假信息治理的发展。 Abstract: The rapid proliferation of short video platforms has necessitated advanced methods for detecting fake news. This need arises from the widespread influence and ease of sharing misinformation, which can lead to significant societal harm. Current methods often struggle with the dynamic and multimodal nature of short video content. This paper presents HFN, Heterogeneous Fusion Net, a novel multimodal framework that integrates video, audio, and text data to evaluate the authenticity of short video content. HFN introduces a Decision Network that dynamically adjusts modality weights during inference and a Weighted Multi-Modal Feature Fusion module to ensure robust performance even with incomplete data. Additionally, we contribute a comprehensive dataset VESV (VEracity on Short Videos) specifically designed for short video fake news detection. Experiments conducted on the FakeTT and newly collected VESV datasets demonstrate improvements of 2.71% and 4.14% in Marco F1 over state-of-the-art methods. This work establishes a robust solution capable of effectively identifying fake news in the complex landscape of short video platforms, paving the way for more reliable and comprehensive approaches in combating misinformation.

[99] EyePCR: A Comprehensive Benchmark for Fine-Grained Perception, Knowledge Comprehension and Clinical Reasoning in Ophthalmic Surgery

Gui Wang,Yang Wennuo,Xusen Ma,Zehao Zhong,Zhuoru Wu,Ende Wu,Rong Qu,Wooi Ping Cheah,Jianfeng Ren,Linlin Shen

Main category: cs.CV

TL;DR: EyePCR是一个基于结构化临床知识的大规模眼科手术分析基准,用于评估MLLMs在感知、理解和推理方面的能力,揭示了现有模型在外科认知中的局限性,并推动手术视频理解模型的临床可靠性提升。

Details Motivation: 现有的多模态大语言模型在高风险、特定领域(如外科手术)中的表现尚未充分探索,尤其是在眼科手术的认知理解方面缺乏系统评估。 Method: 构建了一个名为EyePCR的大规模基准,包含超过21万道视觉问答、1048个细粒度属性、2.5万个三元组的医学知识图谱以及四项临床推理任务,用以评估模型在感知、理解和推理方面的能力;并开发了领域适配的EyePCR-MLLM模型。 Result: EyePCR-MLLM在感知任务的多项选择题准确率上表现最佳,在理解和推理任务上优于开源模型,媲美GPT-4等商业模型。 Conclusion: EyePCR有效揭示了现有MLLM在外科认知任务中的不足,为提升手术视频理解模型的临床可靠性提供了重要基础和评估标准。 Abstract: MLLMs (Multimodal Large Language Models) have showcased remarkable capabilities, but their performance in high-stakes, domain-specific scenarios like surgical settings, remains largely under-explored. To address this gap, we develop \textbf{EyePCR}, a large-scale benchmark for ophthalmic surgery analysis, grounded in structured clinical knowledge to evaluate cognition across \textit{Perception}, \textit{Comprehension} and \textit{Reasoning}. EyePCR offers a richly annotated corpus with more than 210k VQAs, which cover 1048 fine-grained attributes for multi-view perception, medical knowledge graph of more than 25k triplets for comprehension, and four clinically grounded reasoning tasks. The rich annotations facilitate in-depth cognitive analysis, simulating how surgeons perceive visual cues and combine them with domain knowledge to make decisions, thus greatly improving models' cognitive ability. In particular, \textbf{EyePCR-MLLM}, a domain-adapted variant of Qwen2.5-VL-7B, achieves the highest accuracy on MCQs for \textit{Perception} among compared models and outperforms open-source models in \textit{Comprehension} and \textit{Reasoning}, rivalling commercial models like GPT-4.1. EyePCR reveals the limitations of existing MLLMs in surgical cognition and lays the foundation for benchmarking and enhancing clinical reliability of surgical video understanding models.

[100] TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?

Zhongyuan Bao,Lejun Zhang

Main category: cs.CV

TL;DR: TennisTV是一个用于评估多模态大语言模型在网球视频理解中表现的基准,涵盖8项任务和2500个经人工验证的问题,揭示了现有模型在高频率体育视频理解中的不足。

Details Motivation: 现有的多模态大语言模型在处理快节奏、高密度信息的网球比赛视频时表现不佳,缺乏系统性评估手段。 Method: 构建TennisTV基准,将每回合比赛建模为按时间顺序排列的击球事件序列,采用自动化流程进行数据筛选和问题生成,并评估16种代表性MLLM。 Result: 评估结果显示当前MLLM在网球视频理解上存在显著缺陷,提出了帧采样密度需任务适配和提升时间定位能力两个关键见解。 Conclusion: TennisTV为高频率体育视频理解提供了首个全面评估基准,指出了改进MLLM在时间建模和任务平衡方面的方向。 Abstract: Multimodal large language models (MLLMs) excel at general video understanding but struggle with fast, high-frequency sports like tennis, where rally clips are short yet information-dense. To systematically evaluate MLLMs in this challenging domain, we present TennisTV, the first and most comprehensive benchmark for tennis video understanding. TennisTV models each rally as a temporal-ordered sequence of consecutive stroke events, using automated pipelines for filtering and question generation. It covers 8 tasks at rally and stroke levels and includes 2,500 human-verified questions. Evaluating 16 representative MLLMs, we provide the first systematic assessment of tennis video understanding. Results reveal substantial shortcomings and yield two key insights: (i) frame-sampling density should be tailored and balanced across tasks, and (ii) improving temporal grounding is essential for stronger reasoning.

[101] Enhancing WSI-Based Survival Analysis with Report-Auxiliary Self-Distillation

Zheng Wang,Hong Liu,Zheng Wang,Danyi Li,Min Cen,Baptiste Magnier,Li Liang,Liansheng Wang

Main category: cs.CV

TL;DR: 提出了一种新的基于病理报告辅助的自蒸馏框架Rasa,用于全切片图像(WSI)的生存分析,通过结合大语言模型提取文本信息和风险感知的数据增强策略,显著提升了癌症预后预测性能。

Details Motivation: 传统WSI生存分析受限于噪声特征和数据不足,且未充分利用病理报告中的丰富信息,因此需要一种能有效融合文本与图像信息的新方法来提升预测准确性。 Method: 利用大语言模型从病理报告中提取细粒度文本描述,设计基于自蒸馏的框架,使用教师模型的文本知识指导学生模型过滤无关的WSI特征,并引入风险感知的mix-up策略增强训练数据的多样性和数量。 Result: 在自建CRC数据集和公开TCGA-BRCA数据集上的实验表明,Rasa在生存分析任务上优于现有最先进方法,展现出更强的特征选择能力和泛化性能。 Conclusion: Rasa框架有效融合了病理文本与WSI图像信息,通过自蒸馏和数据增强策略提升了癌症预后预测的准确性和鲁棒性,为多模态医学分析提供了新思路。 Abstract: Survival analysis based on Whole Slide Images (WSIs) is crucial for evaluating cancer prognosis, as they offer detailed microscopic information essential for predicting patient outcomes. However, traditional WSI-based survival analysis usually faces noisy features and limited data accessibility, hindering their ability to capture critical prognostic features effectively. Although pathology reports provide rich patient-specific information that could assist analysis, their potential to enhance WSI-based survival analysis remains largely unexplored. To this end, this paper proposes a novel Report-auxiliary self-distillation (Rasa) framework for WSI-based survival analysis. First, advanced large language models (LLMs) are utilized to extract fine-grained, WSI-relevant textual descriptions from original noisy pathology reports via a carefully designed task prompt. Next, a self-distillation-based pipeline is designed to filter out irrelevant or redundant WSI features for the student model under the guidance of the teacher model's textual knowledge. Finally, a risk-aware mix-up strategy is incorporated during the training of the student model to enhance both the quantity and diversity of the training data. Extensive experiments carried out on our collected data (CRC) and public data (TCGA-BRCA) demonstrate the superior effectiveness of Rasa against state-of-the-art methods. Our code is available at https://github.com/zhengwang9/Rasa.

[102] PCSR: Pseudo-label Consistency-Guided Sample Refinement for Noisy Correspondence Learning

Zhuoyao Liu,Yang Liu,Wentao Feng,Shudong Huang

Main category: cs.CV

TL;DR: 提出了一种基于伪标签一致性的样本优化框架PCSR,用于提升噪声监督下的跨模态检索性能。

Details Motivation: 现有方法通常假设图像-文本对完全对齐,忽视了真实数据中的噪声对应关系,且采用粗粒度分类和统一训练策略,导致样本利用不充分。 Method: 通过置信度估计区分干净与噪声样本,利用伪标签一致性细化噪声样本,并提出伪标签一致性评分(PCS)分离模糊和可优化样本;设计自适应对优化(APO)策略,对不同类型的样本采用不同的训练方式。 Result: 在CC152K、MS-COCO和Flickr30K数据集上实验表明,该方法显著提升了噪声环境下的跨模态检索鲁棒性。 Conclusion: PCSR通过细粒度样本划分和自适应优化策略,有效增强了模型在噪声监督下的学习能力与检索性能。 Abstract: Cross-modal retrieval aims to align different modalities via semantic similarity. However, existing methods often assume that image-text pairs are perfectly aligned, overlooking Noisy Correspondences in real data. These misaligned pairs misguide similarity learning and degrade retrieval performance. Previous methods often rely on coarse-grained categorizations that simply divide data into clean and noisy samples, overlooking the intrinsic diversity within noisy instances. Moreover, they typically apply uniform training strategies regardless of sample characteristics, resulting in suboptimal sample utilization for model optimization. To address the above challenges, we introduce a novel framework, called Pseudo-label Consistency-Guided Sample Refinement (PCSR), which enhances correspondence reliability by explicitly dividing samples based on pseudo-label consistency. Specifically, we first employ a confidence-based estimation to distinguish clean and noisy pairs, then refine the noisy pairs via pseudo-label consistency to uncover structurally distinct subsets. We further proposed a Pseudo-label Consistency Score (PCS) to quantify prediction stability, enabling the separation of ambiguous and refinable samples within noisy pairs. Accordingly, we adopt Adaptive Pair Optimization (APO), where ambiguous samples are optimized with robust loss functions and refinable ones are enhanced via text replacement during training. Extensive experiments on CC152K, MS-COCO and Flickr30K validate the effectiveness of our method in improving retrieval robustness under noisy supervision.

[103] pFedSAM: Personalized Federated Learning of Segment Anything Model for Medical Image Segmentation

Tong Wang,Xingyue Zhao,Linghao Zhuang,Haoyu Zhao,Jiayi Yin,Yuyang He,Gang Yu,Bo Lin

Main category: cs.CV

TL;DR: 提出了一种个性化的联邦SAM框架,用于解决医学图像分割中的异构数据问题,通过仅聚合全局参数和局部知识蒸馏机制提升性能。

Details Motivation: 现有的联邦学习方法在处理复杂的医学图像数据时受限于轻量级架构,而大规模模型如SAM在联邦设置中面临挑战。 Method: 设计了一个个性化联邦SAM框架,包含仅聚合全局参数的策略和基于知识蒸馏的解耦全局-局部微调机制,并引入L-MoE组件保留领域特定特征。 Result: 在两个公开数据集上实验表明,该方法显著提升了分割性能,实现了强健的跨域适应能力,并减少了通信开销。 Conclusion: 所提出的框架有效解决了联邦医学图像分割中的异构性和隐私问题,兼顾性能与效率。 Abstract: Medical image segmentation is crucial for computer-aided diagnosis, yet privacy constraints hinder data sharing across institutions. Federated learning addresses this limitation, but existing approaches often rely on lightweight architectures that struggle with complex, heterogeneous data. Recently, the Segment Anything Model (SAM) has shown outstanding segmentation capabilities; however, its massive encoder poses significant challenges in federated settings. In this work, we present the first personalized federated SAM framework tailored for heterogeneous data scenarios in medical image segmentation. Our framework integrates two key innovations: (1) a personalized strategy that aggregates only the global parameters to capture cross-client commonalities while retaining the designed L-MoE (Localized Mixture-of-Experts) component to preserve domain-specific features; and (2) a decoupled global-local fine-tuning mechanism that leverages a teacher-student paradigm via knowledge distillation to bridge the gap between the global shared model and the personalized local models, thereby mitigating overgeneralization. Extensive experiments on two public datasets validate that our approach significantly improves segmentation performance, achieves robust cross-domain adaptation, and reduces communication overhead.

[104] UNIV: Unified Foundation Model for Infrared and Visible Modalities

Fangyuan Mao,Shuo Wang,Jilin Mei,Chen Min,Shun Lu,Fuyang Liu,Yu Hu

Main category: cs.CV

TL;DR: 本文提出了一种生物启发的红外与可见光统一基础模型UNIV,通过Patch-wise跨模态对比学习和双知识保留机制,在多模态感知任务中实现了优异性能,同时保持对单模态任务的兼容性。

Details Motivation: 现有的预训练模型在单模态下表现良好,但在RGB-红外多模态场景(如自动驾驶)中性能不足,缺乏有效的跨模态对齐和知识保留机制。 Method: 提出UNIV模型:1)采用注意力引导的Patch-wise跨模态对比学习(PCCL),模拟视网膜水平细胞的侧向抑制;2)设计基于LoRA适配器与同步蒸馏的双知识保留机制,模拟双极细胞信号通路,防止灾难性遗忘;3)构建大规模可见-红外配对数据集MVIP。 Result: 在红外语义分割上提升+1.7 mIoU,目标检测提升+0.7 mAP,同时在可见光任务中保持99%以上的基线性能。 Conclusion: UNIV通过生物启发设计有效解决了多模态红外-可见光感知中的特征对齐与知识保留问题,显著提升了跨模态性能,具有良好的应用前景。 Abstract: The demand for joint RGB-visible and infrared perception is growing rapidly, particularly to achieve robust performance under diverse weather conditions. Although pre-trained models for RGB-visible and infrared data excel in their respective domains, they often underperform in multimodal scenarios, such as autonomous vehicles equipped with both sensors. To address this challenge, we propose a biologically inspired UNified foundation model for Infrared and Visible modalities (UNIV), featuring two key innovations. First, we introduce Patch-wise Cross-modality Contrastive Learning (PCCL), an attention-guided distillation framework that mimics retinal horizontal cells' lateral inhibition, which enables effective cross-modal feature alignment while remaining compatible with any transformer-based architecture. Second, our dual-knowledge preservation mechanism emulates the retina's bipolar cell signal routing - combining LoRA adapters (2% added parameters) with synchronous distillation to prevent catastrophic forgetting, thereby replicating the retina's photopic (cone-driven) and scotopic (rod-driven) functionality. To support cross-modal learning, we introduce the MVIP dataset, the most comprehensive visible-infrared benchmark to date. It contains 98,992 precisely aligned image pairs spanning diverse scenarios. Extensive experiments demonstrate UNIV's superior performance on infrared tasks (+1.7 mIoU in semantic segmentation and +0.7 mAP in object detection) while maintaining 99%+ of the baseline performance on visible RGB tasks. Our code is available at https://github.com/fangyuanmao/UNIV.

[105] GS-Scale: Unlocking Large-Scale 3D Gaussian Splatting Training via Host Offloading

Donghyun Lee,Dawoon Jeong,Jae W. Lee,Hongil Yoon

Main category: cs.CV

TL;DR: 本文提出了GS-Scale,一种用于3D高斯点阵渲染的高效内存训练系统,通过将高斯数据存储在主机内存中并按需加载到GPU,显著降低GPU内存使用,同时保持高效训练速度。

Details Motivation: 由于存储参数、梯度和优化器状态所需的大量内存,大规模场景的高质量训练面临GPU内存不足的挑战。 Method: GS-Scale将所有高斯数据存于主机内存,仅按需将子集加载至GPU;采用三种系统级优化:选择性卸载几何参数以加速视锥裁剪、参数前传以流水线化CPU优化器更新与GPU计算、延迟优化器更新以减少无梯度高斯的内存访问。 Result: 在大规模数据集上的实验表明,GS-Scale将GPU内存需求降低了3.3-5.6倍,训练速度接近全GPU训练;在RTX 4070 Mobile上可将高斯数量从400万扩展到1800万,LPIPS指标提升23-35%。 Conclusion: GS-Scale实现了大规模3D高斯点阵的高效训练,使消费级GPU也能处理高复杂度场景,推动了该技术的普及应用。 Abstract: The advent of 3D Gaussian Splatting has revolutionized graphics rendering by delivering high visual quality and fast rendering speeds. However, training large-scale scenes at high quality remains challenging due to the substantial memory demands required to store parameters, gradients, and optimizer states, which can quickly overwhelm GPU memory. To address these limitations, we propose GS-Scale, a fast and memory-efficient training system for 3D Gaussian Splatting. GS-Scale stores all Gaussians in host memory, transferring only a subset to the GPU on demand for each forward and backward pass. While this dramatically reduces GPU memory usage, it requires frustum culling and optimizer updates to be executed on the CPU, introducing slowdowns due to CPU's limited compute and memory bandwidth. To mitigate this, GS-Scale employs three system-level optimizations: (1) selective offloading of geometric parameters for fast frustum culling, (2) parameter forwarding to pipeline CPU optimizer updates with GPU computation, and (3) deferred optimizer update to minimize unnecessary memory accesses for Gaussians with zero gradients. Our extensive evaluations on large-scale datasets demonstrate that GS-Scale significantly lowers GPU memory demands by 3.3-5.6x, while achieving training speeds comparable to GPU without host offloading. This enables large-scale 3D Gaussian Splatting training on consumer-grade GPUs; for instance, GS-Scale can scale the number of Gaussians from 4 million to 18 million on an RTX 4070 Mobile GPU, leading to 23-35% LPIPS (learned perceptual image patch similarity) improvement.

[106] FingerSplat: Contactless Fingerprint 3D Reconstruction and Generation based on 3D Gaussian Splatting

Yuwei Jia,Yutang Lu,Zhe Cui,Fei Su

Main category: cs.CV

TL;DR: 本文提出了一种基于3D高斯点阵的无接触指纹三维配准、重建与生成框架,首次实现了无需相机参数且仅用稀疏图像即可完成高质量3D指纹重建与合成,显著提升了无接触指纹识别性能。

Details Motivation: 由于缺乏具有姿态变化的无接触指纹数据以及未充分利用隐式的3D指纹表示,当前无接触指纹识别性能落后于接触式方法。 Method: 结合3D高斯点阵技术,提出一种新的无接触指纹3D注册、重建与生成框架,实现从2D图像到3D指纹模型的精确对齐与重建,并从中生成高质量的无接触指纹。 Result: 实验表明该方法在3D指纹配准、重建和生成方面均表现优异,能够准确重建3D指纹并生成高质量的无接触指纹图像。 Conclusion: 该方法为无接触指纹识别提供了新范式,是首个将3D高斯点阵应用于指纹识别并实现端到端3D重建与生成的工作,显著提升了识别性能。 Abstract: Researchers have conducted many pioneer researches on contactless fingerprints, yet the performance of contactless fingerprint recognition still lags behind contact-based methods primary due to the insufficient contactless fingerprint data with pose variations and lack of the usage of implicit 3D fingerprint representations. In this paper, we introduce a novel contactless fingerprint 3D registration, reconstruction and generation framework by integrating 3D Gaussian Splatting, with the goal of offering a new paradigm for contactless fingerprint recognition that integrates 3D fingerprint reconstruction and generation. To our knowledge, this is the first work to apply 3D Gaussian Splatting to the field of fingerprint recognition, and the first to achieve effective 3D registration and complete reconstruction of contactless fingerprints with sparse input images and without requiring camera parameters information. Experiments on 3D fingerprint registration, reconstruction, and generation prove that our method can accurately align and reconstruct 3D fingerprints from 2D images, and sequentially generates high-quality contactless fingerprints from 3D model, thus increasing the performances for contactless fingerprint recognition.

[107] A PCA Based Model for Surface Reconstruction from Incomplete Point Clouds

Hao Liu

Main category: cs.CV

TL;DR: 提出一种基于PCA的不完整点云数据表面重建模型,利用PCA估计法向信息作为正则化项指导缺失区域的表面重建,并通过算子分裂法高效求解,实验表明该方法优于现有方法。

Details Motivation: 由于高吸光率和遮挡等因素,扫描获得的点云数据常存在缺失,导致表面重建困难,因此需要有效推断缺失区域的表面结构。 Method: 采用主成分分析(PCA)从现有数据估计表面法向信息,并将其作为正则化项引入重建模型;结合算子分裂法高效求解模型。 Result: 实验表明,该方法能准确推断数据缺失区域的表面结构,实现高质量的表面重建,且性能优于现有方法。 Conclusion: 基于PCA的法向估计与正则化方法结合算子分裂算法,能有效处理不完整点云数据的表面重建问题,提升重建精度与鲁棒性。 Abstract: Point cloud data represents a crucial category of information for mathematical modeling, and surface reconstruction from such data is an important task across various disciplines. However, during the scanning process, the collected point cloud data may fail to cover the entire surface due to factors such as high light-absorption rate and occlusions, resulting in incomplete datasets. Inferring surface structures in data-missing regions and successfully reconstructing the surface poses a challenge. In this paper, we present a Principal Component Analysis (PCA) based model for surface reconstruction from incomplete point cloud data. Initially, we employ PCA to estimate the normal information of the underlying surface from the available point cloud data. This estimated normal information serves as a regularizer in our model, guiding the reconstruction of the surface, particularly in areas with missing data. Additionally, we introduce an operator-splitting method to effectively solve the proposed model. Through systematic experimentation, we demonstrate that our model successfully infers surface structures in data-missing regions and well reconstructs the underlying surfaces, outperforming existing methodologies.

[108] Camera Splatting for Continuous View Optimization

Gahye Lee,Hyomin Kim,Gwangjin Ju,Jooeun Son,Hyejeong Yoon,Seungyong Lee

Main category: cs.CV

TL;DR: 提出了一种名为Camera Splatting的新型视图优化框架,用于新视角合成。

Details Motivation: 为了更好地捕捉复杂的视依赖现象,如强烈的金属反射和复杂纹理(例如文字)。 Method: 将每个相机建模为一个3D高斯(称为相机splat),并在靠近表面的3D点处放置虚拟相机(称为点相机)来观察相机splat的分布,并通过连续且可微的方式优化这些splat。 Result: 与Farthest View Sampling (FVS) 方法相比,该方法在生成优化视图方面表现出更优的性能。 Conclusion: Camera Splatting 能有效提升新视角合成中对复杂视觉效果的还原能力。 Abstract: We propose Camera Splatting, a novel view optimization framework for novel view synthesis. Each camera is modeled as a 3D Gaussian, referred to as a camera splat, and virtual cameras, termed point cameras, are placed at 3D points sampled near the surface to observe the distribution of camera splats. View optimization is achieved by continuously and differentiably refining the camera splats so that desirable target distributions are observed from the point cameras, in a manner similar to the original 3D Gaussian splatting. Compared to the Farthest View Sampling (FVS) approach, our optimized views demonstrate superior performance in capturing complex view-dependent phenomena, including intense metallic reflections and intricate textures such as text.

[109] Layout Stroke Imitation: A Layout Guided Handwriting Stroke Generation for Style Imitation with Diffusion Model

Sidra Hanif,Longin Jan Latecki

Main category: cs.CV

TL;DR: 提出了一种基于多尺度注意力特征和词布局的条件扩散模型,用于生成具有书法风格的手写笔画,提升了手写模仿效果。

Details Motivation: 现有方法在书法风格模仿中未显式考虑词间距(词布局),导致生成结果的词间距不一致,影响风格一致性。 Method: 引入多尺度注意力特征以捕捉局部和全局书法风格,并将词布局作为显式特征融入生成过程;采用条件扩散模型生成带有时间坐标信息的笔画,而非直接生成图像。 Result: 实验表明,所提方法在笔画生成任务上优于当前最先进的方法,并与最新的图像生成网络性能相当。 Conclusion: 通过结合多尺度风格特征和词布局信息,条件扩散模型能更准确地模仿书法风格,有效提升手写笔画生成的质量和风格一致性。 Abstract: Handwriting stroke generation is crucial for improving the performance of tasks such as handwriting recognition and writers order recovery. In handwriting stroke generation, it is significantly important to imitate the sample calligraphic style. The previous studies have suggested utilizing the calligraphic features of the handwriting. However, they had not considered word spacing (word layout) as an explicit handwriting feature, which results in inconsistent word spacing for style imitation. Firstly, this work proposes multi-scale attention features for calligraphic style imitation. These multi-scale feature embeddings highlight the local and global style features. Secondly, we propose to include the words layout, which facilitates word spacing for handwriting stroke generation. Moreover, we propose a conditional diffusion model to predict strokes in contrast to previous work, which directly generated style images. Stroke generation provides additional temporal coordinate information, which is lacking in image generation. Hence, our proposed conditional diffusion model for stroke generation is guided by calligraphic style and word layout for better handwriting imitation and stroke generation in a calligraphic style. Our experimentation shows that the proposed diffusion model outperforms the current state-of-the-art stroke generation and is competitive with recent image generation networks.

[110] Saccadic Vision for Fine-Grained Visual Classification

Johann Schmidt,Sebastian Stober,Joachim Denzler,Paul Bodesheim

Main category: cs.CV

TL;DR: 提出一种受人类扫视视觉启发的两阶段细粒度视觉分类方法,通过外周特征提取和注视点采样并行编码,并引入上下文选择性注意力与非极大值抑制来减少空间冗余,提升分类性能。

Details Motivation: 现有基于局部的方法依赖复杂的定位网络,存在特征冗余、难以确定关键部分数量等问题,且易发生空间坍缩,限制了在细粒度分类中的效果。 Method: 采用两阶段策略:首先提取外周特征生成采样图,然后并行采样并编码注视块;使用权重共享编码器和上下文选择性注意力融合外周与聚焦表示,并在采样时应用非极大值抑制以减少空间冗余。 Result: 在CUB-200-2011、NABirds、Food-101、Stanford-Dogs及多个昆虫数据集(EU-Moths等)上表现优于基线模型,达到与当前最优方法相当的性能。 Conclusion: 该方法有效缓解了部分冗余和空间坍缩问题,在多种细粒度分类任务中表现出鲁棒性和优越性,具有良好的下游任务兼容潜力。 Abstract: Fine-grained visual classification (FGVC) requires distinguishing between visually similar categories through subtle, localized features - a task that remains challenging due to high intra-class variability and limited inter-class differences. Existing part-based methods often rely on complex localization networks that learn mappings from pixel to sample space, requiring a deep understanding of image content while limiting feature utility for downstream tasks. In addition, sampled points frequently suffer from high spatial redundancy, making it difficult to quantify the optimal number of required parts. Inspired by human saccadic vision, we propose a two-stage process that first extracts peripheral features (coarse view) and generates a sample map, from which fixation patches are sampled and encoded in parallel using a weight-shared encoder. We employ contextualized selective attention to weigh the impact of each fixation patch before fusing peripheral and focus representations. To prevent spatial collapse - a common issue in part-based methods - we utilize non-maximum suppression during fixation sampling to eliminate redundancy. Comprehensive evaluation on standard FGVC benchmarks (CUB-200-2011, NABirds, Food-101 and Stanford-Dogs) and challenging insect datasets (EU-Moths, Ecuador-Moths and AMI-Moths) demonstrates that our method achieves comparable performance to state-of-the-art approaches while consistently outperforming our baseline encoder.

[111] SCENEFORGE: Enhancing 3D-text alignment with Structured Scene Compositions

Cristian Sbrolli,Matteo Matteucci

Main category: cs.CV

TL;DR: SceneForge是一个通过结构化多对象场景合成来增强3D点云与文本对比对齐的新框架,有效缓解了大规模3D-文本数据集稀缺的问题,并在多种任务中显著提升性能。

Details Motivation: 解决现有3D-文本对比学习中缺乏大规模、复杂且多样化数据集的问题,提升模型对多对象场景的理解与空间推理能力。 Method: 利用单个3D形状构建具有明确空间关系的多对象场景,并结合大语言模型生成连贯的多对象文本描述,通过对比学习进行训练。 Result: 在ModelNet、ScanObjNN等多个数据集的零样本分类、ShapeNetPart的少样本部件分割、ScanQA的3D视觉问答等任务上均取得显著性能提升,且具备良好的泛化性和空间推理能力。 Conclusion: SceneForge通过结构化的场景合成策略,有效增强了3D-文本对比学习的表征能力,是一种通用、可扩展且模型无关的数据增强方法。 Abstract: The whole is greater than the sum of its parts-even in 3D-text contrastive learning. We introduce SceneForge, a novel framework that enhances contrastive alignment between 3D point clouds and text through structured multi-object scene compositions. SceneForge leverages individual 3D shapes to construct multi-object scenes with explicit spatial relations, pairing them with coherent multi-object descriptions refined by a large language model. By augmenting contrastive training with these structured, compositional samples, SceneForge effectively addresses the scarcity of large-scale 3D-text datasets, significantly enriching data complexity and diversity. We systematically investigate critical design elements, such as the optimal number of objects per scene, the proportion of compositional samples in training batches, and scene construction strategies. Extensive experiments demonstrate that SceneForge delivers substantial performance gains across multiple tasks, including zero-shot classification on ModelNet, ScanObjNN, Objaverse-LVIS, and ScanNet, as well as few-shot part segmentation on ShapeNetPart. SceneForge's compositional augmentations are model-agnostic, consistently improving performance across multiple encoder architectures. Moreover, SceneForge improves 3D visual question answering on ScanQA, generalizes robustly to retrieval scenarios with increasing scene complexity, and showcases spatial reasoning capabilities by adapting spatial configurations to align precisely with textual instructions.

[112] ORIC: Benchmarking Object Recognition in Incongruous Context for Large Vision-Language Models

Zhaoyang Li,Zhan Ling,Yuchen Zhou,Hao Su

Main category: cs.CV

TL;DR: 本文提出了一个名为ORIC的新基准,用于评估大规模视觉-语言模型(LVLMs)在对象与上下文关系不一致场景下的物体识别能力,揭示了现有模型在不协调上下文中存在显著的识别缺陷。

Details Motivation: 由于LVLMs在上下文不协调时容易出现物体误识别和幻觉问题,缺乏系统性评估方法,因此需要构建专门的基准来揭示其在真实复杂场景中的局限性。 Method: 提出ORIC基准,采用LLM引导采样识别上下文不协调但实际存在的物体,以及CLIP引导采样生成可能被幻觉的看似合理但不存在的物体,从而构建挑战性测试场景。 Result: 在18个LVLMs和两个开放词汇检测模型上的实验表明,当前模型在不协调上下文下普遍存在严重的识别差距,尤其是在物体幻觉和误识别方面表现不佳。 Conclusion: ORIC有效暴露了LVLMs在处理非常规上下文时的脆弱性,强调了发展上下文感知物体识别能力的重要性,为未来研究提供了方向。 Abstract: Large Vision-Language Models (LVLMs) have made significant strides in image caption, visual question answering, and robotics by integrating visual and textual information. However, they remain prone to errors in incongruous contexts, where objects appear unexpectedly or are absent when contextually expected. This leads to two key recognition failures: object misidentification and hallucination. To systematically examine this issue, we introduce the Object Recognition in Incongruous Context Benchmark (ORIC), a novel benchmark that evaluates LVLMs in scenarios where object-context relationships deviate from expectations. ORIC employs two key strategies: (1) LLM-guided sampling, which identifies objects that are present but contextually incongruous, and (2) CLIP-guided sampling, which detects plausible yet nonexistent objects that are likely to be hallucinated, thereby creating an incongruous context. Evaluating 18 LVLMs and two open-vocabulary detection models, our results reveal significant recognition gaps, underscoring the challenges posed by contextual incongruity. This work provides critical insights into LVLMs' limitations and encourages further research on context-aware object recognition.

[113] Training-Free Pyramid Token Pruning for Efficient Large Vision-Language Models via Region, Token, and Instruction-Guided Importance

Yuxuan Liang,Xu Li,Xiaolei Chen,Yi Zheng,Haotian Chen,Bin Li,Xiangyang Xue

Main category: cs.CV

TL;DR: 提出了一种无需训练的金字塔式视觉标记剪枝方法(PTP),结合视觉显著性和文本指令来高效处理高分辨率图像,显著降低计算开销。

Details Motivation: 大型视觉-语言模型在处理高分辨率图像时面临计算开销大、推理延迟高的问题,现有方法因分割图像导致视觉标记数量激增。 Method: 设计了一种训练-free的金字塔标记剪枝策略(PTP),在区域和标记层级结合自下而上的视觉显著性与自上而下的文本指令引导重要性,选择性保留关键标记。 Result: 在13个基准上实验表明,该方法显著减少了计算开销和推理延迟,同时保持了模型性能。 Conclusion: PTP能有效平衡高分辨率图像理解中的效率与性能,为LVLMs提供了一种高效的推理加速方案。 Abstract: Large Vision-Language Models (LVLMs) have significantly advanced multimodal understanding but still struggle with efficiently processing high-resolution images. Recent approaches partition high-resolution images into multiple sub-images, dramatically increasing the number of visual tokens and causing exponential computational overhead during inference. To address these limitations, we propose a training-free token pruning strategy, Pyramid Token Pruning (PTP), that integrates bottom-up visual saliency at both region and token levels with top-down instruction-guided importance. Inspired by human visual attention mechanisms, PTP selectively retains more tokens from visually salient regions and further leverages textual instructions to pinpoint tokens most relevant to specific multimodal tasks. Extensive experiments across 13 diverse benchmarks demonstrate that our method substantially reduces computational overhead and inference latency with minimal performance loss.

[114] SGMAGNet: A Baseline Model for 3D Cloud Phase Structure Reconstruction on a New Passive Active Satellite Benchmark

Chi Yang,Fu Wang,Xiaofei Yang,Hao Huang,Weijia Cao,Xiaowen Chu

Main category: cs.CV

TL;DR: 本研究提出了一种将多模态卫星观测数据转化为三维云相位结构的基准数据集和基线框架,旨在改进数值天气预报中的云微物理参数化。

Details Motivation: 云相位廓线对数值天气预报至关重要,因其直接影响辐射传输和降水过程。现有方法在复杂云系中的表现仍有不足,亟需更精确的三维云相位反演技术。 Method: 利用地球静止卫星的高时空分辨率可见光与热红外影像,结合星载激光雷达(CALIOP/CALIPSO)和雷达(CPR/CloudSat)提供的垂直云相位廓线,构建同步的图像-廓线配对数据集,并采用SGMAGNet模型进行监督学习以预测3D云相位结构,同时与UNet变体和SegNet等基线模型对比。 Result: SGMAGNet在云相位重建中表现最优,尤其在多层云和相位过渡区域效果显著,Precision达0.922,Recall为0.858,F1-score为0.763,IoU为0.617,各项指标均显著优于基线模型。 Conclusion: SGMAGNet能够有效融合多模态卫星观测数据,生成高精度的三维云相位结构,具备用于业务化云相位反演及耦合至数值天气预报系统的潜力。 Abstract: Cloud phase profiles are critical for numerical weather prediction (NWP), as they directly affect radiative transfer and precipitation processes. In this study, we present a benchmark dataset and a baseline framework for transforming multimodal satellite observations into detailed 3D cloud phase structures, aiming toward operational cloud phase profile retrieval and future integration with NWP systems to improve cloud microphysics parameterization. The multimodal observations consist of (1) high--spatiotemporal--resolution, multi-band visible (VIS) and thermal infrared (TIR) imagery from geostationary satellites, and (2) accurate vertical cloud phase profiles from spaceborne lidar (CALIOP\slash CALIPSO) and radar (CPR\slash CloudSat). The dataset consists of synchronized image--profile pairs across diverse cloud regimes, defining a supervised learning task: given VIS/TIR patches, predict the corresponding 3D cloud phase structure. We adopt SGMAGNet as the main model and compare it with several baseline architectures, including UNet variants and SegNet, all designed to capture multi-scale spatial patterns. Model performance is evaluated using standard classification metrics, including Precision, Recall, F1-score, and IoU. The results demonstrate that SGMAGNet achieves superior performance in cloud phase reconstruction, particularly in complex multi-layer and boundary transition regions. Quantitatively, SGMAGNet attains a Precision of 0.922, Recall of 0.858, F1-score of 0.763, and an IoU of 0.617, significantly outperforming all baselines across these key metrics.

[115] Toward Medical Deepfake Detection: A Comprehensive Dataset and Novel Method

Shuaibo Li,Zhaohu Xing,Hongqiu Wang,Pengfei Hao,Xingyu Li,Zekai Liu,Lei Zhu

Main category: cs.CV

TL;DR: 本文提出了MedForensics数据集和DSKI检测器,用于检测AI生成的医学图像,以应对生成式AI在医疗影像中带来的伪造风险。

Details Motivation: 生成式AI在医疗影像中的快速发展带来了伪造图像的风险,可能引发误诊、欺诈和误导,但现有媒体取证方法不适用于医学图像,且缺乏专用数据集。 Method: 构建了大规模医学取证数据集MedForensics,涵盖六种医学模态和十二种先进生成模型;提出双阶段知识融合检测器DSKI,包括跨域细粒度适配器(CDFA)和医学取证检索模块(MFRM),在训练时提取空间与噪声域的伪造痕迹,在测试时通过少样本检索提升性能。 Result: 实验表明,DSKI在多种医学模态上显著优于现有方法和人类专家,具有更高的检测精度。 Conclusion: DSKI结合视觉-语言特征空间与少样本检索机制,有效提升了对AI生成医学图像的检测能力,为医学图像真实性验证提供了有力工具。 Abstract: The rapid advancement of generative AI in medical imaging has introduced both significant opportunities and serious challenges, especially the risk that fake medical images could undermine healthcare systems. These synthetic images pose serious risks, such as diagnostic deception, financial fraud, and misinformation. However, research on medical forensics to counter these threats remains limited, and there is a critical lack of comprehensive datasets specifically tailored for this field. Additionally, existing media forensic methods, which are primarily designed for natural or facial images, are inadequate for capturing the distinct characteristics and subtle artifacts of AI-generated medical images. To tackle these challenges, we introduce \textbf{MedForensics}, a large-scale medical forensics dataset encompassing six medical modalities and twelve state-of-the-art medical generative models. We also propose \textbf{DSKI}, a novel \textbf{D}ual-\textbf{S}tage \textbf{K}nowledge \textbf{I}nfusing detector that constructs a vision-language feature space tailored for the detection of AI-generated medical images. DSKI comprises two core components: 1) a cross-domain fine-trace adapter (CDFA) for extracting subtle forgery clues from both spatial and noise domains during training, and 2) a medical forensic retrieval module (MFRM) that boosts detection accuracy through few-shot retrieval during testing. Experimental results demonstrate that DSKI significantly outperforms both existing methods and human experts, achieving superior accuracy across multiple medical modalities.

[116] TrueMoE: Dual-Routing Mixture of Discriminative Experts for Synthetic Image Detection

Laixin Zhang,Shuaibo Li,Wei Ma,Hongbin Zha

Main category: cs.CV

TL;DR: 提出TrueMoE,一种双路由混合判别专家框架,通过多个专业化子空间协同检测合成图像,提升泛化性与鲁棒性。

Details Motivation: 现有方法构建单一判别空间难以有效泛化到未见生成模式,需更灵活、鲁棒的检测框架。 Method: 设计双路由机制(粒度感知稀疏路由和流形感知密集路由),结合判别专家阵列(DEA),在多个互补的轻量子空间中协作检测。 Result: 在多种生成模型上实验表明,TrueMoE在跨模型泛化和鲁棒性方面优于现有方法。 Conclusion: TrueMoE通过模块化、协作式的判别结构,有效提升了合成图像检测的性能与适应性。 Abstract: The rapid progress of generative models has made synthetic image detection an increasingly critical task. Most existing approaches attempt to construct a single, universal discriminative space to separate real from fake content. However, such unified spaces tend to be complex and brittle, often struggling to generalize to unseen generative patterns. In this work, we propose TrueMoE, a novel dual-routing Mixture-of-Discriminative-Experts framework that reformulates the detection task as a collaborative inference across multiple specialized and lightweight discriminative subspaces. At the core of TrueMoE is a Discriminative Expert Array (DEA) organized along complementary axes of manifold structure and perceptual granularity, enabling diverse forgery cues to be captured across subspaces. A dual-routing mechanism, comprising a granularity-aware sparse router and a manifold-aware dense router, adaptively assigns input images to the most relevant experts. Extensive experiments across a wide spectrum of generative models demonstrate that TrueMoE achieves superior generalization and robustness.

[117] Hybrid Lie semi-group and cascade structures for the generalized Gaussian derivative model for visual receptive fields

Tony Lindeberg

Main category: cs.CV

TL;DR: 本文研究了在不同形状参数下空间和时空感受野响应之间的关系,提出了基于协变感受野族的理论框架,利用无穷小关系和宏观级联平滑特性来更好地理解和计算多参数感受野响应,有助于高效计算及建模生物视觉中的简单细胞。

Details Motivation: 由于自然图像变换导致真实世界图像结构的多样性,早期视觉系统中的感受野响应可能受到几何变换的显著影响,因此需要建立能够处理这种变异性的理论框架。 Method: 推导了不同感受野参数下的无穷小关系(类似半群和李群概念)以及宏观级联平滑性质,用以描述粗尺度感受野响应如何通过细尺度响应与小支持增量滤波器结合得到。 Result: 建立了多参数感受野响应之间的理论关系,揭示了跨尺度计算的结构性质,并显示出与李代数结构相关但具有方向偏好的特征。 Conclusion: 该理论框架不仅有助于设计更高效的多参数感受野响应计算方案,还可用于构建生物视觉中简单细胞的理想化理论模型。 Abstract: Because of the variabilities of real-world image structures under the natural image transformations that arise when observing similar objects or spatio-temporal events under different viewing conditions, the receptive field responses computed in the earliest layers of the visual hierarchy may be strongly influenced by such geometric image transformations. One way of handling this variability is by basing the vision system on covariant receptive field families, which expand the receptive field shapes over the degrees of freedom in the image transformations. This paper addresses the problem of deriving relationships between spatial and spatio-temporal receptive field responses obtained for different values of the shape parameters in the resulting multi-parameter families of receptive fields. For this purpose, we derive both (i) infinitesimal relationships, roughly corresponding to a combination of notions from semi-groups and Lie groups, as well as (ii) macroscopic cascade smoothing properties, which describe how receptive field responses at coarser spatial and temporal scales can be computed by applying smaller support incremental filters to the output from corresponding receptive fields at finer spatial and temporal scales, structurally related to the notion of Lie algebras, although with directional preferences. The presented results provide (i) a deeper understanding of the relationships between spatial and spatio-temporal receptive field responses for different values of the filter parameters, which can be used for both (ii) designing more efficient schemes for computing receptive field responses over populations of multi-parameter families of receptive fields, as well as (iii)~formulating idealized theoretical models of the computations of simple cells in biological vision.

[118] FloorSAM: SAM-Guided Floorplan Reconstruction with Semantic-Geometric Fusion

Han Ye,Haofu Wang,Yunchi Zhang,Jiangjian Xiao,Yuqiang Jin,Jinyuan Liu,Wen-An Zhang,Uladzislau Sychou,Alexander Tuzikov,Vladislav Sobolevskii,Valerii Zakharov,Boris Sokolov,Minglei Fu

Main category: cs.CV

TL;DR: 提出了一种名为FloorSAM的框架,结合点云密度图和Segment Anything Model(SAM)从LiDAR数据中精确重建楼层平面图,在噪声和复杂环境中表现优于传统方法。

Details Motivation: 传统几何算法和基于Mask R-CNN的深度学习方法在处理点云数据时存在噪声敏感、泛化能力差和几何细节丢失等问题,难以实现高精度的楼层平面重建。 Method: 通过网格滤波、自适应分辨率投影和图像增强生成鲁棒的俯视密度图;利用SAM的零样本学习能力进行房间分割,结合自适应提示点和多阶段滤波生成房间掩码,并通过联合掩码与点云分析提取并规整轮廓。 Result: 在GibsonLayout和ISPRS数据集上验证,FloorSAM在准确率、召回率和鲁棒性方面均优于传统方法,尤其在复杂和噪声环境下表现更优,能有效恢复房间拓扑关系。 Conclusion: FloorSAM通过融合点云密度图与SAM实现了高精度、强鲁棒的楼层平面重建,适用于多样化建筑布局,为室内导航和BIM提供了有效解决方案。 Abstract: Reconstructing building floor plans from point cloud data is key for indoor navigation, BIM, and precise measurements. Traditional methods like geometric algorithms and Mask R-CNN-based deep learning often face issues with noise, limited generalization, and loss of geometric details. We propose FloorSAM, a framework that integrates point cloud density maps with the Segment Anything Model (SAM) for accurate floor plan reconstruction from LiDAR data. Using grid-based filtering, adaptive resolution projection, and image enhancement, we create robust top-down density maps. FloorSAM uses SAM's zero-shot learning for precise room segmentation, improving reconstruction across diverse layouts. Room masks are generated via adaptive prompt points and multistage filtering, followed by joint mask and point cloud analysis for contour extraction and regularization. This produces accurate floor plans and recovers room topological relationships. Tests on Giblayout and ISPRS datasets show better accuracy, recall, and robustness than traditional methods, especially in noisy and complex settings. Code and materials: github.com/Silentbarber/FloorSAM.

[119] Robust Vision-Language Models via Tensor Decomposition: A Defense Against Adversarial Attacks

Het Patel,Muzammil Allie,Qian Zhang,Jia Chen,Evangelos E. Papalexakis

Main category: cs.CV

TL;DR: 提出一种基于张量分解的轻量级防御方法,无需重训练即可提升视觉语言模型对抗攻击的鲁棒性。

Details Motivation: 现有防御方法通常需要昂贵的重训练或大幅修改模型结构,缺乏对预训练视觉语言模型的高效、即插即用防御手段。 Method: 通过张量分解(Tensor Train)对视觉编码器的表示进行分解与重构,过滤对抗性噪声同时保留语义信息,适用于任何预训练的视觉语言模型。 Result: 在CLIP模型上结合COCO和Flickr30K数据集实验表明,该方法显著提升了对抗攻击下的性能:Flickr30K的Recall@1从7.5%提升至19.8%(恢复12.3%),COCO从3.8%提升至11.9%(恢复8.1%);低秩(8-32)和低残差强度(α=0.1-0.2)效果最优。 Conclusion: 该方法是一种高效、即插即用的轻量级防御方案,对现有视觉语言模型具有实用价值且计算开销极小。 Abstract: Vision language models (VLMs) excel in multimodal understanding but are prone to adversarial attacks. Existing defenses often demand costly retraining or significant architecture changes. We introduce a lightweight defense using tensor decomposition suitable for any pre-trained VLM, requiring no retraining. By decomposing and reconstructing vision encoder representations, it filters adversarial noise while preserving meaning. Experiments with CLIP on COCO and Flickr30K show improved robustness. On Flickr30K, it restores 12.3\% performance lost to attacks, raising Recall@1 accuracy from 7.5\% to 19.8\%. On COCO, it recovers 8.1\% performance, improving accuracy from 3.8\% to 11.9\%. Analysis shows Tensor Train decomposition with low rank (8-32) and low residual strength ($\alpha=0.1-0.2$) is optimal. This method is a practical, plug-and-play solution with minimal overhead for existing VLMs.

[120] Simulated Cortical Magnification Supports Self-Supervised Object Learning

Zhengyang Yu,Arthur Aubret,Chen Yu,Jochen Triesch

Main category: cs.CV

TL;DR: 该研究通过模拟人类中心凹视觉(foveated vision)的特性,改进了自监督学习模型在语义物体表征学习中的表现。

Details Motivation: 现有自监督学习模型忽略了人类视觉在视野中心高分辨率、周边低分辨率的特性,而这一特性在幼儿视觉发育中可能起重要作用。因此,作者希望探究这种变分辨率对物体表征发展的影响。 Method: 使用两个记录人类与物体交互的第一人称视频数据集,结合人类中心凹视觉和皮层放大效应的计算模型,对输入视频进行空间模糊处理,使周边视觉内容逐渐模糊。随后用这些处理后的视频训练两种基于时间对比学习目标的生物启发式自监督模型。 Result: 实验结果表明,引入中心凹视觉特性能够提升模型学到的物体表征质量。分析显示,这主要是因为该特性使物体在视野中显得更大,并更好地平衡了中心与周边视觉信息的利用。 Conclusion: 建模中心凹视觉有助于提升自监督学习模型在模拟人类视觉表征发展方面的现实性和性能,推动了更贴近人类学习机制的视觉模型的发展。 Abstract: Recent self-supervised learning models simulate the development of semantic object representations by training on visual experience similar to that of toddlers. However, these models ignore the foveated nature of human vision with high/low resolution in the center/periphery of the visual field. Here, we investigate the role of this varying resolution in the development of object representations. We leverage two datasets of egocentric videos that capture the visual experience of humans during interactions with objects. We apply models of human foveation and cortical magnification to modify these inputs, such that the visual content becomes less distinct towards the periphery. The resulting sequences are used to train two bio-inspired self-supervised learning models that implement a time-based learning objective. Our results show that modeling aspects of foveated vision improves the quality of the learned object representations in this setting. Our analysis suggests that this improvement comes from making objects appear bigger and inducing a better trade-off between central and peripheral visual information. Overall, this work takes a step towards making models of humans' learning of visual representations more realistic and performant.

[121] MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Yanghao Li,Rui Qian,Bowen Pan,Haotian Zhang,Haoshuo Huang,Bowen Zhang,Jialing Tong,Haoxuan You,Xianzhi Du,Zhe Gan,Hyunjik Kim,Chao Jia,Zhenbang Wang,Yinfei Yang,Mingfei Gao,Zi-Yi Dou,Wenze Hu,Chang Gao,Dongxu Li,Philipp Dufter,Zirui Wang,Guoli Yin,Zhengdong Zhang,Chen Chen,Yang Zhao,Ruoming Pang,Zhifeng Chen

Main category: cs.CV

TL;DR: Manzano是一个统一的多模态大语言模型框架,通过混合图像编码器和统一训练策略,有效平衡了图文理解和生成能力,在性能上达到领先水平。

Details Motivation: 现有开源多模态大模型在图文理解与生成能力之间存在性能权衡,难以兼顾。 Method: 采用共享视觉编码器和双轻量适配器,分别生成用于理解的连续嵌入和用于生成的离散图像token,并结合自回归LLM与扩散解码器实现统一建模。 Result: Manzano在统一模型中达到SOTA水平,且在文本密集型任务上表现突出,验证了混合tokenizer和联合训练的有效性。 Conclusion: 通过混合token化和统一训练,可有效缓解多模态模型中理解与生成的冲突,支持可扩展的联合学习。 Abstract: Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

[122] MCOD: The First Challenging Benchmark for Multispectral Camouflaged Object Detection

Yang Li,Tingfa Xu,Shuyan Bai,Peifu Liu,Jianan Li

Main category: cs.CV

TL;DR: 本文提出了MCOD,首个用于多光谱伪装目标检测的基准数据集,具有全面的挑战属性、多样化的真实场景和高质量像素级标注,推动了多光谱伪装检测的发展。

Details Motivation: 现有伪装目标检测数据集均为RGB图像,缺乏对多光谱方法的支持,限制了该领域的发展。 Method: 构建了一个包含多光谱图像的新数据集MCOD,具备真实世界中的多种挑战性条件,并对11种代表性方法进行了基准测试,评估多光谱模态在检测性能上的优势。 Result: 在MCOD上,现有方法性能普遍下降,表明任务更具挑战性;引入多光谱信息显著缓解性能退化,验证了其提升检测鲁棒性的价值。 Conclusion: MCOD为多光谱伪装目标检测提供了重要基础,有望促进该领域的进一步研究。 Abstract: Camouflaged Object Detection (COD) aims to identify objects that blend seamlessly into natural scenes. Although RGB-based methods have advanced, their performance remains limited under challenging conditions. Multispectral imagery, providing rich spectral information, offers a promising alternative for enhanced foreground-background discrimination. However, existing COD benchmark datasets are exclusively RGB-based, lacking essential support for multispectral approaches, which has impeded progress in this area. To address this gap, we introduce MCOD, the first challenging benchmark dataset specifically designed for multispectral camouflaged object detection. MCOD features three key advantages: (i) Comprehensive challenge attributes: It captures real-world difficulties such as small object sizes and extreme lighting conditions commonly encountered in COD tasks. (ii) Diverse real-world scenarios: The dataset spans a wide range of natural environments to better reflect practical applications. (iii) High-quality pixel-level annotations: Each image is manually annotated with precise object masks and corresponding challenge attribute labels. We benchmark eleven representative COD methods on MCOD, observing a consistent performance drop due to increased task difficulty. Notably, integrating multispectral modalities substantially alleviates this degradation, highlighting the value of spectral information in enhancing detection robustness. We anticipate MCOD will provide a strong foundation for future research in multispectral camouflaged object detection. The dataset is publicly accessible at https://github.com/yl2900260-bit/MCOD.

[123] Overview of PlantCLEF 2024: multi-species plant identification in vegetation plot images

Herve Goeau,Vincent Espitalier,Pierre Bonnet,Alexis Joly

Main category: cs.CV

TL;DR: 本文介绍了PlantCLEF 2024挑战赛,旨在通过AI技术提升生态学研究中植物样方图像的物种识别效率,任务为基于单标签训练数据的弱标注多标签分类。

Details Motivation: 利用人工智能提高生态学家在植物物种识别上的效率,扩大生态研究的范围和覆盖面积。 Method: 使用包含170万张植物图像的大规模训练集和先进的视觉Transformer模型,并在数千张专家标注的多标签测试图像上评估多标签分类性能。 Result: 提供了详细的數據描述、評估方法、參賽者使用的模型與方法,以及實際達到的結果,推動了基于图像的植物物種識別技術發展。 Conclusion: PlantCLEF 2024展示了AI在生态监测中的潜力,特别是在处理高分辨率、多物种共存的样方图像方面具有重要应用前景。 Abstract: Plot images are essential for ecological studies, enabling standardized sampling, biodiversity assessment, long-term monitoring and remote, large-scale surveys. Plot images are typically fifty centimetres or one square meter in size, and botanists meticulously identify all the species found there. The integration of AI could significantly improve the efficiency of specialists, helping them to extend the scope and coverage of ecological studies. To evaluate advances in this regard, the PlantCLEF 2024 challenge leverages a new test set of thousands of multi-label images annotated by experts and covering over 800 species. In addition, it provides a large training set of 1.7 million individual plant images as well as state-of-the-art vision transformer models pre-trained on this data. The task is evaluated as a (weakly-labeled) multi-label classification task where the aim is to predict all the plant species present on a high-resolution plot image (using the single-label training data). In this paper, we provide an detailed description of the data, the evaluation methodology, the methods and models employed by the participants and the results achieved.

[124] Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation

Weimin Bai,Yubo Li,Weijian Luo,Wenzheng Chen,He Sun

Main category: cs.CV

TL;DR: 提出VLM3D,一种将大型视觉语言模型(VLMs)引入Score Distillation Sampling(SDS)流程的新框架,以提升文本到3D生成的语义对齐和三维空间一致性。

Details Motivation: 现有SDS方法依赖CLIP式文本编码器导致语义对齐粗糙,且2D扩散先验缺乏明确的3D空间约束,造成几何不一致和多物体场景关系错误。 Method: 将大型视觉语言模型(VLMs)作为可微分语义与空间先验集成进SDS流程,利用其语言基础监督实现细粒度提示对齐,并通过其内在视觉语言建模增强3D一致性和多物体关系推理。基于开源Qwen2.5-VL模型实例化VLM3D,并在GPTeval3D基准上评估。 Result: 实验表明,VLM3D在多种对象和复杂场景下显著优于先前的SDS方法,在语义保真度、几何连贯性和空间正确性方面表现更优。 Conclusion: VLM3D通过引入VLM作为语义与空间先验,有效解决了传统SDS方法在细粒度语义对齐和3D空间一致性上的局限,显著提升了文本到3D生成的质量。 Abstract: Score Distillation Sampling (SDS) enables high-quality text-to-3D generation by supervising 3D models through the denoising of multi-view 2D renderings, using a pretrained text-to-image diffusion model to align with the input prompt and ensure 3D consistency. However, existing SDS-based methods face two fundamental limitations: (1) their reliance on CLIP-style text encoders leads to coarse semantic alignment and struggles with fine-grained prompts; and (2) 2D diffusion priors lack explicit 3D spatial constraints, resulting in geometric inconsistencies and inaccurate object relationships in multi-object scenes. To address these challenges, we propose VLM3D, a novel text-to-3D generation framework that integrates large vision-language models (VLMs) into the SDS pipeline as differentiable semantic and spatial priors. Unlike standard text-to-image diffusion priors, VLMs leverage rich language-grounded supervision that enables fine-grained prompt alignment. Moreover, their inherent vision language modeling provides strong spatial understanding, which significantly enhances 3D consistency for single-object generation and improves relational reasoning in multi-object scenes. We instantiate VLM3D based on the open-source Qwen2.5-VL model and evaluate it on the GPTeval3D benchmark. Experiments across diverse objects and complex scenes show that VLM3D significantly outperforms prior SDS-based methods in semantic fidelity, geometric coherence, and spatial correctness.

[125] Enriched Feature Representation and Motion Prediction Module for MOSEv2 Track of 7th LSVOS Challenge: 3rd Place Solution

Chang Soo Lim,Joonyoung Moon,Donghyeon Cho

Main category: cs.CV

TL;DR: 本文提出了一种名为SCOPE的视频对象分割框架,通过结合Cutie和SAM2的优势,并引入运动预测模块,提升了特征表示能力和时间稳定性,在MOSEv2挑战赛中获得第三名。

Details Motivation: 为了克服现有方法在特征容量和时间建模方面的局限性,提升视频对象分割的性能。 Method: 将SAM2的ViT编码器替换Cutie中的编码器,并引入运动预测模块以增强时间一致性;采用Cutie、SAM2及其变体的集成策略。 Result: 所提方法在第七屆LSVOS挑战赛的MOSEv2赛道中取得了第三名的成绩,验证了增强特征表示和运动预测的有效性。 Conclusion: 融合强大特征表示与运动预测的集成框架能有效提升视频对象分割的鲁棒性和性能。 Abstract: Video object segmentation (VOS) is a challenging task with wide applications such as video editing and autonomous driving. While Cutie provides strong query-based segmentation and SAM2 offers enriched representations via a pretrained ViT encoder, each has limitations in feature capacity and temporal modeling. In this report, we propose a framework that integrates their complementary strengths by replacing the encoder of Cutie with the ViT encoder of SAM2 and introducing a motion prediction module for temporal stability. We further adopt an ensemble strategy combining Cutie, SAM2, and our variant, achieving 3rd place in the MOSEv2 track of the 7th LSVOS Challenge. We refer to our final model as SCOPE (SAM2-CUTIE Object Prediction Ensemble). This demonstrates the effectiveness of enriched feature representation and motion prediction for robust video object segmentation. The code is available at https://github.com/2025-LSVOS-3rd-place/MOSEv2_3rd_place.

[126] Ideal Registration? Segmentation is All You Need

Xiang Chen,Fengting Zhang,Qinghao Liu,Min Liu,Kun Wu,Yaonan Wang,Hang Zhang

Main category: cs.CV

TL;DR: 提出了一种名为SegReg的分割驱动配准框架,通过利用区域特定的形变模式实现解剖自适应正则化,显著提升了医学图像配准的精度。

Details Motivation: 现有深度学习配准方法常采用全局统一的平滑约束,难以应对解剖结构中复杂且区域变化的形变问题。 Method: 首先通过分割将移动和固定图像分解为解剖上一致的子区域,然后在每个局部区域上使用相同的配准主干网络计算部分形变场,最后将其融合为全局形变场。 Result: 在心脏、腹部和肺部三种临床场景中,SegReg优于现有方法2-12%,使用真实分割时关键解剖结构的Dice系数达98.23%,且配准精度对分割质量呈近线性依赖。 Conclusion: SegReg通过引入分割驱动的局部形变建模,有效解决了传统方法在处理复杂解剖变形时的局限性,将配准难题转化为分割问题。 Abstract: Deep learning has revolutionized image registration by its ability to handle diverse tasks while achieving significant speed advantages over conventional approaches. Current approaches, however, often employ globally uniform smoothness constraints that fail to accommodate the complex, regionally varying deformations characteristic of anatomical motion. To address this limitation, we propose SegReg, a Segmentation-driven Registration framework that implements anatomically adaptive regularization by exploiting region-specific deformation patterns. Our SegReg first decomposes input moving and fixed images into anatomically coherent subregions through segmentation. These localized domains are then processed by the same registration backbone to compute optimized partial deformation fields, which are subsequently integrated into a global deformation field. SegReg achieves near-perfect structural alignment (98.23% Dice on critical anatomies) using ground-truth segmentation, and outperforms existing methods by 2-12% across three clinical registration scenarios (cardiac, abdominal, and lung images) even with automatic segmentation. Our SegReg demonstrates a near-linear dependence of registration accuracy on segmentation quality, transforming the registration challenge into a segmentation problem. The source code will be released upon manuscript acceptance.

[127] CBPNet: A Continual Backpropagation Prompt Network for Alleviating Plasticity Loss on Edge Devices

Runjie Shao,Boyu Diao,Zijia An,Ruiqi Liu,Yongjun Xu

Main category: cs.CV

TL;DR: 本文提出了一种名为CBPNet的持续学习框架,通过自适应重初始化未充分利用的参数来恢复模型的学习活力,有效解决了冻结预训练模型中因提示参数容量有限导致的可塑性损失问题,在多个基准上实现了高效、参数友好的性能提升。

Details Motivation: 现有的基于冻结主干网络和提示调优的持续学习方法虽然缓解了灾难性遗忘,但存在可塑性损失问题,限制了模型学习新知识的能力。本文旨在解决这一瓶颈,提升模型在边缘设备上的持续学习能力。 Method: 提出Continual Backpropagation Prompt Network (CBPNet),引入高效的CBP模块,通过反向传播识别并自适应重初始化训练过程中利用率低的参数,从而恢复模型的更新活力和学习能力。 Result: 在Split CIFAR-100上比强基线提升1%以上平均准确率;在更具挑战性的Split ImageNet-R上达到69.41%的SOTA准确率,且额外参数不到主干网络的0.2%。 Conclusion: CBPNet通过恢复模型参数的更新活力,有效缓解了冻结模型中的可塑性损失,在保持参数效率的同时显著提升了持续学习性能,适用于资源受限的边缘设备。 Abstract: To meet the demands of applications like robotics and autonomous driving that require real-time responses to dynamic environments, efficient continual learning methods suitable for edge devices have attracted increasing attention. In this transition, using frozen pretrained models with prompts has become a mainstream strategy to combat catastrophic forgetting. However, this approach introduces a new critical bottleneck: plasticity loss, where the model's ability to learn new knowledge diminishes due to the frozen backbone and the limited capacity of prompt parameters. We argue that the reduction in plasticity stems from a lack of update vitality in underutilized parameters during the training process. To this end, we propose the Continual Backpropagation Prompt Network (CBPNet), an effective and parameter efficient framework designed to restore the model's learning vitality. We innovatively integrate an Efficient CBP Block that counteracts plasticity decay by adaptively reinitializing these underutilized parameters. Experimental results on edge devices demonstrate CBPNet's effectiveness across multiple benchmarks. On Split CIFAR-100, it improves average accuracy by over 1% against a strong baseline, and on the more challenging Split ImageNet-R, it achieves a state of the art accuracy of 69.41%. This is accomplished by training additional parameters that constitute less than 0.2% of the backbone's size, validating our approach.

[128] FoBa: A Foreground-Background co-Guided Method and New Benchmark for Remote Sensing Semantic Change Detection

Haotian Zhang,Han Guo,Keyan Chen,Hao Chen,Zhengxia Zou,Zhenwei Shi

Main category: cs.CV

TL;DR: 本文提出了一种新的遥感语义变化检测基准LevirSCD和一种前景-背景协同引导的方法FoBa,通过细粒度分类和增强的交互融合模块显著提升了变化检测性能。

Details Motivation: 现有遥感语义变化检测数据集类别有限、变化类型不足且缺乏细粒度定义,同时当前方法对变化信息利用不充分,限制了模型性能提升。 Method: 构建了包含16类变化和210种具体变化类型的LevirSCD数据集,并提出FoBa方法,利用前景区域和富含上下文的背景共同引导模型,结合门控交互融合模块和一致性损失来增强检测性能。 Result: 在三个数据集(SECOND、JL1和LevirSCD)上实验表明,FoBa在SeK指标上分别比当前SOTA方法提升了1.48%、3.61%和2.81%。 Conclusion: FoBa方法有效缓解了语义模糊问题并增强了对细微变化的检测能力,所构建的LevirSCD数据集为遥感语义变化检测提供了更全面的基准支持。 Abstract: Despite the remarkable progress achieved in remote sensing semantic change detection (SCD), two major challenges remain. At the data level, existing SCD datasets suffer from limited change categories, insufficient change types, and a lack of fine-grained class definitions, making them inadequate to fully support practical applications. At the methodological level, most current approaches underutilize change information, typically treating it as a post-processing step to enhance spatial consistency, which constrains further improvements in model performance. To address these issues, we construct a new benchmark for remote sensing SCD, LevirSCD. Focused on the Beijing area, the dataset covers 16 change categories and 210 specific change types, with more fine-grained class definitions (e.g., roads are divided into unpaved and paved roads). Furthermore, we propose a foreground-background co-guided SCD (FoBa) method, which leverages foregrounds that focus on regions of interest and backgrounds enriched with contextual information to guide the model collaboratively, thereby alleviating semantic ambiguity while enhancing its ability to detect subtle changes. Considering the requirements of bi-temporal interaction and spatial consistency in SCD, we introduce a Gated Interaction Fusion (GIF) module along with a simple consistency loss to further enhance the model's detection performance. Extensive experiments on three datasets (SECOND, JL1, and the proposed LevirSCD) demonstrate that FoBa achieves competitive results compared to current SOTA methods, with improvements of 1.48%, 3.61%, and 2.81% in the SeK metric, respectively. Our code and dataset are available at https://github.com/zmoka-zht/FoBa.

[129] Minimal Semantic Sufficiency Meets Unsupervised Domain Generalization

Tan Pan,Kaiyu Guo,Dongli Xu,Zhaorui Tan,Chen Jiang,Deshu Chen,Xin Guo,Brian C. Lovell,Limei Han,Yuan Cheng,Mahsa Baktashmotlagh

Main category: cs.CV

TL;DR: 本文提出了一种新的无监督域泛化方法MS-UDG,通过信息论视角形式化学习最小充分语义表示,实现语义信息保留与无关变异的去除,在无需类别或域标签的情况下取得了当前最优性能。

Details Motivation: 现有的无监督域泛化方法依赖域标签,但在真实场景中这些标签往往不可得,且难以在没有类别标签的情况下区分语义与变化。因此需要一种不依赖标签且能有效提取共享语义信息的方法。 Method: 将无监督域泛化建模为学习最小充分语义表示的任务,理论上有信息论基础;通过InfoNCE目标保证充分性,结合新的语义-变异解耦损失和基于重构的机制来促进最小性,从而实现MS-UDG模型。 Result: MS-UDG在多个主流无监督域泛化基准上达到最先进水平,持续优于现有的自监督学习和UDG方法,且训练过程中无需任何类别或域标签。 Conclusion: 通过优化最小充分表示可有效提升无监督域泛化的泛化能力,所提出的MS-UDG框架为无标签环境下鲁棒表征学习提供了新思路。 Abstract: The generalization ability of deep learning has been extensively studied in supervised settings, yet it remains less explored in unsupervised scenarios. Recently, the Unsupervised Domain Generalization (UDG) task has been proposed to enhance the generalization of models trained with prevalent unsupervised learning techniques, such as Self-Supervised Learning (SSL). UDG confronts the challenge of distinguishing semantics from variations without category labels. Although some recent methods have employed domain labels to tackle this issue, such domain labels are often unavailable in real-world contexts. In this paper, we address these limitations by formalizing UDG as the task of learning a Minimal Sufficient Semantic Representation: a representation that (i) preserves all semantic information shared across augmented views (sufficiency), and (ii) maximally removes information irrelevant to semantics (minimality). We theoretically ground these objectives from the perspective of information theory, demonstrating that optimizing representations to achieve sufficiency and minimality directly reduces out-of-distribution risk. Practically, we implement this optimization through Minimal-Sufficient UDG (MS-UDG), a learnable model by integrating (a) an InfoNCE-based objective to achieve sufficiency; (b) two complementary components to promote minimality: a novel semantic-variation disentanglement loss and a reconstruction-based mechanism for capturing adequate variation. Empirically, MS-UDG sets a new state-of-the-art on popular unsupervised domain-generalization benchmarks, consistently outperforming existing SSL and UDG methods, without category or domain labels during representation learning.

[130] TASAM: Terrain-and-Aware Segment Anything Model for Temporal-Scale Remote Sensing Segmentation

Tianyang Wang,Xi Xiao,Gaofei Chen,Hanzhang Chi,Qi Zhang,Guo Cheng,Yingrui Ji

Main category: cs.CV

TL;DR: 本文提出了TASAM,一种针对高分辨率遥感图像分割的地形和时间感知SAM扩展模型,通过引入地形感知适配器、时间提示生成器和多尺度融合策略,在无需重新训练SAM主干的情况下显著提升了在遥感数据上的分割性能。

Details Motivation: Segment Anything Model (SAM) 在自然图像上表现良好,但在遥感数据(如复杂地形、多尺度目标和时间动态)上泛化能力有限,因此需要一种能适应这些挑战的专用方法。 Method: TASAM引入了三个轻量级模块:地形感知适配器(注入高程先验)、时间提示生成器(捕捉地表覆盖变化)和多尺度融合策略(增强细粒度对象分割),在不重新训练SAM主干的前提下实现性能提升。 Result: TASAM在三个遥感基准(LoveDA、iSAID 和 WHU-CD)上均显著优于零样本SAM和特定任务模型,且计算开销极小。 Conclusion: TASAM通过领域自适应增强显著提升了SAM在遥感图像中的分割能力,为构建更鲁棒、可扩展的地理空间分割系统提供了有效路径。 Abstract: Segment Anything Model (SAM) has demonstrated impressive zero-shot segmentation capabilities across natural image domains, but it struggles to generalize to the unique challenges of remote sensing data, such as complex terrain, multi-scale objects, and temporal dynamics. In this paper, we introduce TASAM, a terrain and temporally-aware extension of SAM designed specifically for high-resolution remote sensing image segmentation. TASAM integrates three lightweight yet effective modules: a terrain-aware adapter that injects elevation priors, a temporal prompt generator that captures land-cover changes over time, and a multi-scale fusion strategy that enhances fine-grained object delineation. Without retraining the SAM backbone, our approach achieves substantial performance gains across three remote sensing benchmarks-LoveDA, iSAID, and WHU-CD-outperforming both zero-shot SAM and task-specific models with minimal computational overhead. Our results highlight the value of domain-adaptive augmentation for foundation models and offer a scalable path toward more robust geospatial segmentation.

[131] ChronoForge-RL: Chronological Forging through Reinforcement Learning for Enhanced Video Understanding

Kehua Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为ChronoForge-RL的新型视频理解框架,结合Temporal Apex Distillation(TAD)和KeyFrame-aware Group Relative Policy Optimization(KF-GRPO),通过可微的关键帧选择机制识别语义拐点,提升计算效率并保留时间信息,在VideoMME和LVBench上显著超越现有方法,且7B参数模型性能媲美72B模型。

Details Motivation: 现有视频理解方法在处理密集视频时面临计算成本高和难以通过均匀采样识别关键语义帧的问题,因此需要一种更高效且能保留时间语义的框架。 Method: 提出ChronoForge-RL框架,包含两个核心模块:TAD通过变化评分、拐点检测和优先蒸馏选择关键帧;KF-GRPO采用对比学习与显著性增强奖励机制,鼓励模型利用帧内容和时间关系进行推理。 Result: 在VideoMME上达到69.1%,在LVBench上达到52.7%,显著优于基线方法,且7B参数模型性能接近72B参数模型。 Conclusion: ChronoForge-RL通过有效关键帧选择和时间建模,在大幅降低计算开销的同时实现了优越的视频理解性能,为高效视频理解提供了新范式。 Abstract: Current state-of-the-art video understanding methods typically struggle with two critical challenges: (1) the computational infeasibility of processing every frame in dense video content and (2) the difficulty in identifying semantically significant frames through naive uniform sampling strategies. In this paper, we propose a novel video understanding framework, called ChronoForge-RL, which combines Temporal Apex Distillation (TAD) and KeyFrame-aware Group Relative Policy Optimization (KF-GRPO) to tackle these issues. Concretely, we introduce a differentiable keyframe selection mechanism that systematically identifies semantic inflection points through a three-stage process to enhance computational efficiency while preserving temporal information. Then, two particular modules are proposed to enable effective temporal reasoning: Firstly, TAD leverages variation scoring, inflection detection, and prioritized distillation to select the most informative frames. Secondly, we introduce KF-GRPO which implements a contrastive learning paradigm with a saliency-enhanced reward mechanism that explicitly incentivizes models to leverage both frame content and temporal relationships. Finally, our proposed ChronoForge-RL achieves 69.1% on VideoMME and 52.7% on LVBench compared to baseline methods, clearly surpassing previous approaches while enabling our 7B parameter model to achieve performance comparable to 72B parameter alternatives.

[132] CIDER: A Causal Cure for Brand-Obsessed Text-to-Image Models

Fangjian Shen,Zifeng Liang,Chao Wang,Wushao Wen

Main category: cs.CV

TL;DR: 提出CIDER框架,通过推理时的提示优化减轻文本到图像模型中的品牌偏见,保持图像质量的同时提升生成内容的原创性和公平性。

Details Motivation: 文本到图像模型存在生成内容偏向主流商业品牌的“品牌偏见”问题,可能带来伦理和法律风险,亟需有效缓解方法。 Method: 提出CIDER框架,使用轻量级检测器识别品牌内容,并利用视觉-语言模型(VLM)生成风格多样的替代提示,实现模型无关的推理时去偏;同时引入品牌中立性评分(BNS)量化偏见程度。 Result: 在多个主流T2I模型上实验表明,CIDER能显著降低显性和隐性品牌偏见,同时保持生成图像的质量和美学吸引力。 Conclusion: CIDER为减轻文本到图像模型的品牌偏见提供了实用且有效的解决方案,有助于推动可信生成式AI的发展。 Abstract: Text-to-image (T2I) models exhibit a significant yet under-explored "brand bias", a tendency to generate contents featuring dominant commercial brands from generic prompts, posing ethical and legal risks. We propose CIDER, a novel, model-agnostic framework to mitigate bias at inference-time through prompt refinement to avoid costly retraining. CIDER uses a lightweight detector to identify branded content and a Vision-Language Model (VLM) to generate stylistically divergent alternatives. We introduce the Brand Neutrality Score (BNS) to quantify this issue and perform extensive experiments on leading T2I models. Results show CIDER significantly reduces both explicit and implicit biases while maintaining image quality and aesthetic appeal. Our work offers a practical solution for more original and equitable content, contributing to the development of trustworthy generative AI.

[133] Boosting Active Learning with Knowledge Transfer

Tianyang Wang,Xi Xiao,Gaofei Chen,Xiaoying Liao,Guo Cheng,Yingrui Ji

Main category: cs.CV

TL;DR: 提出一种基于知识迁移的主动学习不确定性估计新方法,通过教师-学生模型结构提升估计效果,适用于多种任务且无需复杂训练方式。

Details Motivation: 现有方法在估计未标记数据的不确定性时依赖复杂的辅助模型和高级训练方式,难以应用于特定领域任务(如计算生物学中的cryo-ET分类)。 Method: 采用教师-学生模型框架,教师为任务模型,学生为通用辅助模型,二者在每轮主动学习中同步训练,利用模型输出间的距离衡量不确定性,并关注任务损失的上界而非具体值。 Result: 在经典计算机视觉任务和cryo-ET挑战中进行了广泛实验,验证了该方法的有效性和高效性。 Conclusion: 所提方法能有效提升主动学习中的不确定性估计,具有良好的通用性和实用性,尤其适合复杂或专业领域的任务。 Abstract: Uncertainty estimation is at the core of Active Learning (AL). Most existing methods resort to complex auxiliary models and advanced training fashions to estimate uncertainty for unlabeled data. These models need special design and hence are difficult to train especially for domain tasks, such as Cryo-Electron Tomography (cryo-ET) classification in computational biology. To address this challenge, we propose a novel method using knowledge transfer to boost uncertainty estimation in AL. Specifically, we exploit the teacher-student mode where the teacher is the task model in AL and the student is an auxiliary model that learns from the teacher. We train the two models simultaneously in each AL cycle and adopt a certain distance between the model outputs to measure uncertainty for unlabeled data. The student model is task-agnostic and does not rely on special training fashions (e.g. adversarial), making our method suitable for various tasks. More importantly, we demonstrate that data uncertainty is not tied to concrete value of task loss but closely related to the upper-bound of task loss. We conduct extensive experiments to validate the proposed method on classical computer vision tasks and cryo-ET challenges. The results demonstrate its efficacy and efficiency.

[134] LC-SLab -- An Object-based Deep Learning Framework for Large-scale Land Cover Classification from Satellite Imagery and Sparse In-situ Labels

Johannes Leonhardt,Juergen Gall,Ribana Roscher

Main category: cs.CV

TL;DR: 本文提出了LC-SLab,首个用于在稀疏监督下进行大规模土地覆盖分类的基于对象的深度学习框架,通过输入级和输出级聚合方法,在减少地图碎片的同时保持高精度。

Details Motivation: 现有深度学习方法在使用稀疏实地数据时易产生碎片化和噪声预测,而基于对象的分类可提升语义一致性,但该方向在中等分辨率影像和稀疏监督下尚未被充分探索。 Method: 提出LC-SLab框架,支持基于图神经网络的输入级聚合和基于后处理的输出级聚合,并引入大模型预训练特征以提升小样本性能。 Result: 在Sentinel-2影像与LUCAS标签上验证,对象级方法精度媲美或优于像素级模型,显著减少碎片;输入级聚合在小数据下更鲁棒,输出级聚合在大数据下表现更优,多个配置优于现有产品。 Conclusion: LC-SLab证明了在稀疏监督下,对象基深度学习方法能生成更连贯且准确的土地覆盖图,具有实际应用价值。 Abstract: Large-scale land cover maps generated using deep learning play a critical role across a wide range of Earth science applications. Open in-situ datasets from principled land cover surveys offer a scalable alternative to manual annotation for training such models. However, their sparse spatial coverage often leads to fragmented and noisy predictions when used with existing deep learning-based land cover mapping approaches. A promising direction to address this issue is object-based classification, which assigns labels to semantically coherent image regions rather than individual pixels, thereby imposing a minimum mapping unit. Despite this potential, object-based methods remain underexplored in deep learning-based land cover mapping pipelines, especially in the context of medium-resolution imagery and sparse supervision. To address this gap, we propose LC-SLab, the first deep learning framework for systematically exploring object-based deep learning methods for large-scale land cover classification under sparse supervision. LC-SLab supports both input-level aggregation via graph neural networks, and output-level aggregation by postprocessing results from established semantic segmentation models. Additionally, we incorporate features from a large pre-trained network to improve performance on small datasets. We evaluate the framework on annual Sentinel-2 composites with sparse LUCAS labels, focusing on the tradeoff between accuracy and fragmentation, as well as sensitivity to dataset size. Our results show that object-based methods can match or exceed the accuracy of common pixel-wise models while producing substantially more coherent maps. Input-level aggregation proves more robust on smaller datasets, whereas output-level aggregation performs best with more data. Several configurations of LC-SLab also outperform existing land cover products, highlighting the framework's practical utility.

[135] Zero-Shot Visual Grounding in 3D Gaussians via View Retrieval

Liwei Liao,Xufeng Li,Xiaoyun Zheng,Boning Liu,Feng Gao,Ronggang Wang

Main category: cs.CV

TL;DR: 提出了一种名为GVR的零样本3D视觉定位框架,通过将3D视觉定位转化为2D检索任务,利用多视角的对象级视图检索来避免逐场景训练和大量标注数据的需求。

Details Motivation: 现有3D视觉定位方法难以处理3D高斯点阵中的空间纹理隐式表示,且依赖大量标注数据和逐场景训练,限制了其应用。 Method: 提出Grounding via View Retrieval (GVR),将3D视觉定位转化为2D视图检索任务,通过多视角对象级检索收集定位线索,实现无需逐场景训练的零样本定位。 Result: 实验表明,GVR在避免逐场景训练的同时达到了最先进的3D视觉定位性能。 Conclusion: GVR为零样本3D视觉定位提供了新思路,有效解决了标注成本高和训练效率低的问题。 Abstract: 3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on text prompts, which is essential for applications such as robotics. However, existing 3DVG methods encounter two main challenges: first, they struggle to handle the implicit representation of spatial textures in 3D Gaussian Splatting (3DGS), making per-scene training indispensable; second, they typically require larges amounts of labeled data for effective training. To this end, we propose \underline{G}rounding via \underline{V}iew \underline{R}etrieval (GVR), a novel zero-shot visual grounding framework for 3DGS to transform 3DVG as a 2D retrieval task that leverages object-level view retrieval to collect grounding clues from multiple views, which not only avoids the costly process of 3D annotation, but also eliminates the need for per-scene training. Extensive experiments demonstrate that our method achieves state-of-the-art visual grounding performance while avoiding per-scene training, providing a solid foundation for zero-shot 3DVG research. Video demos can be found in https://github.com/leviome/GVR_demos.

[136] ENSAM: an efficient foundation model for interactive segmentation of 3D medical images

Elias Stenhede,Agnar Martin Bjørnstad,Arian Ranjbar

Main category: cs.CV

TL;DR: ENSAM是一种轻量级、可提示的通用3D医学图像分割模型,基于SegResNet编码器与提示编码器和掩码解码器结合的U-Net架构,采用潜在交叉注意力、相对位置编码、归一化注意力及Muon优化器,在单个32GB GPU上6小时内从零训练完成,表现出优异的性能。

Details Motivation: 在有限的数据和计算资源下实现高效、通用的3D医学图像分割,并支持多模态影像输入,推动无需预训练权重的模型发展。 Method: 采用U-Net风格架构,结合SegResNet编码器、提示编码器和掩码解码器,引入潜在交叉注意力机制、相对位置编码、归一化注意力,并使用Muon优化器进行训练。 Result: 在CVPR 2025挑战赛中,ENSAM在隐藏测试集上DSC AUC为2.404,NSD AUC为2.266,最终DSC为0.627,NSD为0.597,优于VISTA3D和SAM-Med3D,与SegVol相当;在核心集赛道中排名第5,且在未使用预训练权重的方法中表现最佳。 Conclusion: ENSAM在低资源条件下训练高效且性能优越,相对位置编码和Muon优化器显著提升收敛速度和分割质量,验证了其在多模态3D医学图像分割中的有效性与实用性。 Abstract: We present ENSAM (Equivariant, Normalized, Segment Anything Model), a lightweight and promptable model for universal 3D medical image segmentation. ENSAM combines a SegResNet-based encoder with a prompt encoder and mask decoder in a U-Net-style architecture, using latent cross-attention, relative positional encoding, normalized attention, and the Muon optimizer for training. ENSAM is designed to achieve good performance under limited data and computational budgets, and is trained from scratch on under 5,000 volumes from multiple modalities (CT, MRI, PET, ultrasound, microscopy) on a single 32 GB GPU in 6 hours. As part of the CVPR 2025 Foundation Models for Interactive 3D Biomedical Image Segmentation Challenge, ENSAM was evaluated on hidden test set with multimodal 3D medical images, obtaining a DSC AUC of 2.404, NSD AUC of 2.266, final DSC of 0.627, and final NSD of 0.597, outperforming two previously published baseline models (VISTA3D, SAM-Med3D) and matching the third (SegVol), surpassing its performance in final DSC but trailing behind in the other three metrics. In the coreset track of the challenge, ENSAM ranks 5th of 10 overall and best among the approaches not utilizing pretrained weights. Ablation studies confirm that our use of relative positional encodings and the Muon optimizer each substantially speed up convergence and improve segmentation quality.

[137] Self-Supervised Cross-Modal Learning for Image-to-Point Cloud Registration

Xingmei Wang,Xiaoyu Hu,Chengkai Huang,Ziyan Zeng,Guohao Nie,Quan Z. Sheng,Lina Yao

Main category: cs.CV

TL;DR: 提出CrossI2P,一种自监督端到端框架,通过双路径对比学习和粗到精配准范式,实现图像与点云的高效跨模态配准。

Details Motivation: 解决图像与点云之间语义-几何鸿沟及现有方法易陷入局部最优的问题,提升2D-3D感知的鲁棒性。 Method: 采用双路径对比学习构建几何-语义融合嵌入空间;通过超点-超像素对应进行全局粗配准,再结合几何约束的点级细化实现精配准;使用动态梯度归一化机制平衡多任务损失。 Result: 在KITTI Odometry和nuScenes数据集上分别超越现有方法23.7%和37.9%,显著提升配准精度与鲁棒性。 Conclusion: CrossI2P有效桥接2D图像与3D点云,实现了更准确、更鲁棒的跨模态配准,适用于自动驾驶等自主系统。 Abstract: Bridging 2D and 3D sensor modalities is critical for robust perception in autonomous systems. However, image-to-point cloud (I2P) registration remains challenging due to the semantic-geometric gap between texture-rich but depth-ambiguous images and sparse yet metrically precise point clouds, as well as the tendency of existing methods to converge to local optima. To overcome these limitations, we introduce CrossI2P, a self-supervised framework that unifies cross-modal learning and two-stage registration in a single end-to-end pipeline. First, we learn a geometric-semantic fused embedding space via dual-path contrastive learning, enabling annotation-free, bidirectional alignment of 2D textures and 3D structures. Second, we adopt a coarse-to-fine registration paradigm: a global stage establishes superpoint-superpixel correspondences through joint intra-modal context and cross-modal interaction modeling, followed by a geometry-constrained point-level refinement for precise registration. Third, we employ a dynamic training mechanism with gradient normalization to balance losses for feature alignment, correspondence refinement, and pose estimation. Extensive experiments demonstrate that CrossI2P outperforms state-of-the-art methods by 23.7% on the KITTI Odometry benchmark and by 37.9% on nuScenes, significantly improving both accuracy and robustness.

[138] RACap: Relation-Aware Prompting for Lightweight Retrieval-Augmented Image Captioning

Xiaosheng Long,Hanyu Wang,Zhentao Song,Kun Luo,Hongde Liu

Main category: cs.CV

TL;DR: 提出RACap模型,一种关系感知的检索增强型图像描述生成方法,通过挖掘检索到的描述中的结构化关系语义并识别图像中的异质对象,提升语义一致性和关系表达能力。

Details Motivation: 现有检索增强图像描述方法在关系建模上存在局限:语义提示表示过于粗略,缺乏对图像对象及其语义关系的显式建模。 Method: 提出RACap模型,从检索描述中挖掘结构化关系语义,并在图像中识别异质对象,利用包含异质视觉信息的结构化关系特征增强描述生成。 Result: 实验结果表明,RACap仅用1080万可训练参数,在轻量级图像描述模型中表现优于先前方法。 Conclusion: RACap有效提升了图像描述生成中的关系建模能力,在保持轻量化的同时增强了语义一致性和关系表达性。 Abstract: Recent retrieval-augmented image captioning methods incorporate external knowledge to compensate for the limitations in comprehending complex scenes. However, current approaches face challenges in relation modeling: (1) the representation of semantic prompts is too coarse-grained to capture fine-grained relationships; (2) these methods lack explicit modeling of image objects and their semantic relationships. To address these limitations, we propose RACap, a relation-aware retrieval-augmented model for image captioning, which not only mines structured relation semantics from retrieval captions, but also identifies heterogeneous objects from the image. RACap effectively retrieves structured relation features that contain heterogeneous visual information to enhance the semantic consistency and relational expressiveness. Experimental results show that RACap, with only 10.8M trainable parameters, achieves superior performance compared to previous lightweight captioning models.

[139] RangeSAM: Leveraging Visual Foundation Models for Range-View repesented LiDAR segmentation

Paul Julius Kühn,Duc Anh Nguyen,Arjan Kuijper,Holger Graf,Dieter Fellner,Saptarshi Neil Sinha

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉基础模型SAM2的首个面向LiDAR点云分割的范围视图框架,通过改进编码器结构以适应范围视图特征,在SemanticKITTI上实现了具有竞争力的性能,同时保持了2D流水线的高效性与部署便捷性。

Details Motivation: 由于现有体素和点云方法计算成本高、内存访问不规则,而范围视图方法可借助成熟的2D语义分割技术实现快速准确预测,本文探索当前最先进的视觉基础模型SAM2是否可作为LiDAR点云范围视图分割的有效主干网络。 Method: 提出一种新的范围视图框架,将SAM2应用于3D点云分割,结合高效的2D特征提取与标准的投影/反投影操作;并对SAM2编码器进行三项改进:(1)增强LiDAR范围图像中水平空间依赖性的模块;(2)针对球面投影几何特性定制的配置;(3)适应范围视图伪图像中独特空间模式和不连续性的机制。 Result: 在SemanticKITTI数据集上取得了具有竞争力的性能,同时具备2D方法的高速度、良好可扩展性和简单部署优势。 Conclusion: 验证了视觉基础模型作为通用主干网络在3D感知任务中的可行性,为基于统一基础模型的LiDAR分割提供了新方向,表明基于VFMs的范围视图方法前景广阔。 Abstract: Point cloud segmentation is central to autonomous driving and 3D scene understanding. While voxel- and point-based methods dominate recent research due to their compatibility with deep architectures and ability to capture fine-grained geometry, they often incur high computational cost, irregular memory access, and limited real-time efficiency. In contrast, range-view methods, though relatively underexplored - can leverage mature 2D semantic segmentation techniques for fast and accurate predictions. Motivated by the rapid progress in Visual Foundation Models (VFMs) for captioning, zero-shot recognition, and multimodal tasks, we investigate whether SAM2, the current state-of-the-art VFM for segmentation tasks, can serve as a strong backbone for LiDAR point cloud segmentation in the range view. We present , to our knowledge, the first range-view framework that adapts SAM2 to 3D segmentation, coupling efficient 2D feature extraction with standard projection/back-projection to operate on point clouds. To optimize SAM2 for range-view representations, we implement several architectural modifications to the encoder: (1) a novel module that emphasizes horizontal spatial dependencies inherent in LiDAR range images, (2) a customized configuration of tailored to the geometric properties of spherical projections, and (3) an adapted mechanism in the encoder backbone specifically designed to capture the unique spatial patterns and discontinuities present in range-view pseudo-images. Our approach achieves competitive performance on SemanticKITTI while benefiting from the speed, scalability, and deployment simplicity of 2D-centric pipelines. This work highlights the viability of VFMs as general-purpose backbones for 3D perception and opens a path toward unified, foundation-model-driven LiDAR segmentation. Results lets us conclude that range-view segmentation methods using VFMs leads to promising results.

[140] Global Regulation and Excitation via Attention Tuning for Stereo Matching

Jiahao Li,Xinhong Chen,Zhengmin Jiang,Qian Zhou,Yung-Hui Li,Jianping Wang

Main category: cs.CV

TL;DR: 提出了一种名为GREAT的框架,通过三种注意力模块(空间注意力、匹配注意力和体积注意力)为现有的迭代立体匹配方法引入全局上下文信息,显著提升了在遮挡、无纹理和重复纹理等挑战性区域的性能表现。

Details Motivation: 现有迭代立体匹配方法在遮挡、无纹理或重复纹理等病态区域表现不佳,主要由于缺乏全局上下文和几何信息支持有效的迭代优化。 Method: 设计了GREAT框架,包含空间注意力(SA)、匹配注意力(MA)和体积注意力(VA)三个模块,分别从空间维度、极线方向和成本体中融合全局上下文与几何细节,增强迭代过程中的特征表达能力。 Result: 在Scene Flow、KITTI 2015、ETH3D榜单上排名第一,Middlebury榜单排名第二;集成到IGEV-Stereo后形成GREAT-IGEV,在多个基准测试中表现最优。 Conclusion: GREAT框架具有良好的通用性和有效性,能显著提升迭代式立体匹配方法在复杂场景下的性能,尤其改善了病态区域的匹配精度。 Abstract: Stereo matching achieves significant progress with iterative algorithms like RAFT-Stereo and IGEV-Stereo. However, these methods struggle in ill-posed regions with occlusions, textureless, or repetitive patterns, due to a lack of global context and geometric information for effective iterative refinement. To enable the existing iterative approaches to incorporate global context, we propose the Global Regulation and Excitation via Attention Tuning (GREAT) framework which encompasses three attention modules. Specifically, Spatial Attention (SA) captures the global context within the spatial dimension, Matching Attention (MA) extracts global context along epipolar lines, and Volume Attention (VA) works in conjunction with SA and MA to construct a more robust cost-volume excited by global context and geometric details. To verify the universality and effectiveness of this framework, we integrate it into several representative iterative stereo-matching methods and validate it through extensive experiments, collectively denoted as GREAT-Stereo. This framework demonstrates superior performance in challenging ill-posed regions. Applied to IGEV-Stereo, among all published methods, our GREAT-IGEV ranks first on the Scene Flow test set, KITTI 2015, and ETH3D leaderboards, and achieves second on the Middlebury benchmark. Code is available at https://github.com/JarvisLee0423/GREAT-Stereo.

[141] Deep Feedback Models

David Calhas,Arlindo L. Oliveira

Main category: cs.CV

TL;DR: 本文提出了一种新型的深度反馈模型(DFMs),通过结合自下而上的输入和随时间变化的高层表示,利用递归神经网络求解微分方程并引入指数衰减机制来稳定收敛。实验表明,DFMs在噪声鲁棒性和小样本泛化能力方面均优于前馈网络,尤其适用于低数据量或高噪声环境下的图像识别、分割任务,并在医学图像分析中展现出良好适应性。

Details Motivation: 传统的前馈神经网络缺乏动态反馈机制,难以模拟生物决策过程中的迭代优化特性。作者希望通过引入反馈连接,增强模型对噪声的鲁棒性和在有限数据下的泛化能力。 Method: 将深度反馈模型建模为一个通过递归神经网络求解的微分方程,并采用指数衰减机制以确保系统收敛。模型利用高层表示持续修正内部状态,实现对输入信息的迭代处理。 Result: 在物体识别和图像分割任务中,DFMs在低数据量和高噪声条件下显著优于前馈模型;同时在医学图像应用中表现出对多种噪声类型的鲁棒性。 Conclusion: 反馈机制对于实现稳定、鲁棒且可泛化的学习至关重要,DFMs为构建更接近生物神经系统特性的神经网络提供了有效路径。 Abstract: Deep Feedback Models (DFMs) are a new class of stateful neural networks that combine bottom up input with high level representations over time. This feedback mechanism introduces dynamics into otherwise static architectures, enabling DFMs to iteratively refine their internal state and mimic aspects of biological decision making. We model this process as a differential equation solved through a recurrent neural network, stabilized via exponential decay to ensure convergence. To evaluate their effectiveness, we measure DFMs under two key conditions: robustness to noise and generalization with limited data. In both object recognition and segmentation tasks, DFMs consistently outperform their feedforward counterparts, particularly in low data or high noise regimes. In addition, DFMs translate to medical imaging settings, while being robust against various types of noise corruption. These findings highlight the importance of feedback in achieving stable, robust, and generalizable learning. Code is available at https://github.com/DCalhas/deep_feedback_models.

[142] Sparse Multiview Open-Vocabulary 3D Detection

Olivier Moliner,Viktor Larsson,Kalle Åström

Main category: cs.CV

TL;DR: 提出了一种无需训练的开放词汇3D物体检测方法,利用预训练的2D基础模型在稀疏视角下实现高效的3D检测。

Details Motivation: 传统3D物体检测局限于固定类别且依赖大量3D数据,难以扩展;希望在稀疏视角输入下实现开放词汇的3D检测。 Method: 利用预训练的2D基础模型生成2D检测结果,通过跨视角的特征度量一致性优化提升2D检测到3D,并直接优化3D候选框。 Result: 在标准基准上表现优异,密集场景下与最先进方法相当,在稀疏视角下显著优于现有技术。 Conclusion: 该方法无需训练且有效利用2D先验知识,在开放词汇和稀疏视角3D检测中建立了强有力的基线。 Abstract: The ability to interpret and comprehend a 3D scene is essential for many vision and robotics systems. In numerous applications, this involves 3D object detection, i.e.~identifying the location and dimensions of objects belonging to a specific category, typically represented as bounding boxes. This has traditionally been solved by training to detect a fixed set of categories, which limits its use. In this work, we investigate open-vocabulary 3D object detection in the challenging yet practical sparse-view setting, where only a limited number of posed RGB images are available as input. Our approach is training-free, relying on pre-trained, off-the-shelf 2D foundation models instead of employing computationally expensive 3D feature fusion or requiring 3D-specific learning. By lifting 2D detections and directly optimizing 3D proposals for featuremetric consistency across views, we fully leverage the extensive training data available in 2D compared to 3D. Through standard benchmarks, we demonstrate that this simple pipeline establishes a powerful baseline, performing competitively with state-of-the-art techniques in densely sampled scenarios while significantly outperforming them in the sparse-view setting.

[143] PAN: Pillars-Attention-Based Network for 3D Object Detection

Ruan Bispo,Dane Mitrev,Letizia Mariotti,Clément Botty,Denver Humphrey,Anthony Scanlan,Ciarán Eising

Main category: cs.CV

TL;DR: 本文提出了一种基于相机-雷达融合的新型高效3D目标检测算法,采用鸟瞰图表示,利用雷达点云的距离和速度信息优势,并通过自注意力机制建模雷达点间依赖关系,结合简化的卷积层显著降低推理时间,在nuScenes数据集上实现了新的性能与速度基准。

Details Motivation: 现有研究较少关注相机-雷达融合,且缺乏专门设计以充分利用雷达点云优势(如精确测距和速度信息)的新架构,因此需要开发更高效、鲁棒的3D检测方法。 Method: 提出一种新的鸟瞰图下的相机-雷达融合算法,引入新主干网络将雷达pillar特征映射到嵌入维度,使用自注意力机制建模雷达点之间的依赖关系,并用简化卷积层替代基于FPN的结构以减少推理时间。 Result: 在nuScenes数据集上,使用ResNet-50达到58.2的NDS指标,成为3D目标检测的新SOTA,同时在同类方法中树立了新的推理速度基准。 Conclusion: 该方法有效挖掘了雷达模态的优势,通过结构创新在保持高精度的同时显著提升效率,为实时、低成本的自动驾驶感知提供了有力解决方案。 Abstract: Camera-radar fusion offers a robust and low-cost alternative to Camera-lidar fusion for the 3D object detection task in real-time under adverse weather and lighting conditions. However, currently, in the literature, it is possible to find few works focusing on this modality and, most importantly, developing new architectures to explore the advantages of the radar point cloud, such as accurate distance estimation and speed information. Therefore, this work presents a novel and efficient 3D object detection algorithm using cameras and radars in the bird's-eye-view (BEV). Our algorithm exploits the advantages of radar before fusing the features into a detection head. A new backbone is introduced, which maps the radar pillar features into an embedded dimension. A self-attention mechanism allows the backbone to model the dependencies between the radar points. We are using a simplified convolutional layer to replace the FPN-based convolutional layers used in the PointPillars-based architectures with the main goal of reducing inference time. Our results show that with this modification, our approach achieves the new state-of-the-art in the 3D object detection problem, reaching 58.2 of the NDS metric for the use of ResNet-50, while also setting a new benchmark for inference time on the nuScenes dataset for the same category.

[144] A multi-temporal multi-spectral attention-augmented deep convolution neural network with contrastive learning for crop yield prediction

Shalini Dangi,Surya Karthikeya Mullapudi,Chandravardhan Singh Raghaw,Shahid Shafi Dar,Mohammad Zia Ur Rehman,Nagendra Kumar

Main category: cs.CV

TL;DR: 提出了一种新的多时相多光谱产量预测网络MTMS-YieldNet,结合光谱与时空信息,利用对比学习提升特征判别能力,在多种遥感数据上实现了优于现有方法的作物产量预测性能。

Details Motivation: 气候变化影响天气、土壤和农田管理,使作物产量预测更加复杂;现有方法难以有效利用多光谱数据进行精准预测。 Method: 提出MTMS-YieldNet,融合多时相和多光谱遥感数据,采用对比学习进行预训练,以捕捉空间-光谱模式及时空依赖关系。 Result: 在Sentinel-1、Landsat-8和Sentinel-2数据上分别取得0.336、0.353和0.331的MAPE分数,显著优于七种现有先进方法。 Conclusion: MTMS-YieldNet能更准确地预测作物产量,有助于农民优化决策,提升农业可持续性和粮食安全。 Abstract: Precise yield prediction is essential for agricultural sustainability and food security. However, climate change complicates accurate yield prediction by affecting major factors such as weather conditions, soil fertility, and farm management systems. Advances in technology have played an essential role in overcoming these challenges by leveraging satellite monitoring and data analysis for precise yield estimation. Current methods rely on spatio-temporal data for predicting crop yield, but they often struggle with multi-spectral data, which is crucial for evaluating crop health and growth patterns. To resolve this challenge, we propose a novel Multi-Temporal Multi-Spectral Yield Prediction Network, MTMS-YieldNet, that integrates spectral data with spatio-temporal information to effectively capture the correlations and dependencies between them. While existing methods that rely on pre-trained models trained on general visual data, MTMS-YieldNet utilizes contrastive learning for feature discrimination during pre-training, focusing on capturing spatial-spectral patterns and spatio-temporal dependencies from remote sensing data. Both quantitative and qualitative assessments highlight the excellence of the proposed MTMS-YieldNet over seven existing state-of-the-art methods. MTMS-YieldNet achieves MAPE scores of 0.336 on Sentinel-1, 0.353 on Landsat-8, and an outstanding 0.331 on Sentinel-2, demonstrating effective yield prediction performance across diverse climatic and seasonal conditions. The outstanding performance of MTMS-YieldNet improves yield predictions and provides valuable insights that can assist farmers in making better decisions, potentially improving crop yields.

[145] Shedding Light on Depth: Explainability Assessment in Monocular Depth Estimation

Lorenzo Cirillo,Claudio Schiavella,Lorenzo Papa,Paolo Russo,Irene Amerini

Main category: cs.CV

TL;DR: 本文研究了单目深度估计(MDE)模型的可解释性,评估了Saliency Maps、Integrated Gradients和Attention Rollout等特征归因方法在轻量级和深度MDE模型上的表现,并提出了新的评估指标Attribution Fidelity来衡量视觉解释的可靠性。

Details Motivation: 尽管MDE在实际应用中广泛部署,但其可解释性研究尚不充分,亟需分析MDE模型如何从输入图像生成深度图,并评估现有解释方法的有效性。 Method: 采用Saliency Maps、Integrated Gradients和Attention Rollout等特征归因方法,在METER(轻量网络)和PixelFormer(深度网络)上进行实验,通过选择性扰动关键和非关键像素来分析解释质量,并提出Attribution Fidelity指标评估归因结果与预测深度图的一致性。 Result: Saliency Maps在轻量级模型上表现良好,Integrated Gradients在深度模型上效果更优;Attribution Fidelity能有效识别传统指标可能误判的不可靠解释。 Conclusion: 该工作为MDE模型的可解释性提供了系统分析方法,并提出的新指标有助于更准确地评估视觉解释的可信度。 Abstract: Explainable artificial intelligence is increasingly employed to understand the decision-making process of deep learning models and create trustworthiness in their adoption. However, the explainability of Monocular Depth Estimation (MDE) remains largely unexplored despite its wide deployment in real-world applications. In this work, we study how to analyze MDE networks to map the input image to the predicted depth map. More in detail, we investigate well-established feature attribution methods, Saliency Maps, Integrated Gradients, and Attention Rollout on different computationally complex models for MDE: METER, a lightweight network, and PixelFormer, a deep network. We assess the quality of the generated visual explanations by selectively perturbing the most relevant and irrelevant pixels, as identified by the explainability methods, and analyzing the impact of these perturbations on the model's output. Moreover, since existing evaluation metrics can have some limitations in measuring the validity of visual explanations for MDE, we additionally introduce the Attribution Fidelity. This metric evaluates the reliability of the feature attribution by assessing their consistency with the predicted depth map. Experimental results demonstrate that Saliency Maps and Integrated Gradients have good performance in highlighting the most important input features for MDE lightweight and deep models, respectively. Furthermore, we show that Attribution Fidelity effectively identifies whether an explainability method fails to produce reliable visual maps, even in scenarios where conventional metrics might suggest satisfactory results.

[146] CoPAD : Multi-source Trajectory Fusion and Cooperative Trajectory Prediction with Anchor-oriented Decoder in V2X Scenarios

Kangyu Wu,Jiaqi Qiao,Ya Zhang

Main category: cs.CV

TL;DR: 本文提出了一种轻量级协同轨迹预测框架CoPAD,结合匈牙利算法与卡尔曼滤波进行多源轨迹数据的早期融合,并引入PTA模块、模式注意力模块和基于锚点的解码器,显著提升了自动驾驶中轨迹预测的完整性与准确性。

Details Motivation: 单车辆感知的不稳定性限制了轨迹预测性能,因此需要利用车路协同信息提升预测的鲁棒性与精度。 Method: 提出CoPAD框架,采用匈牙利算法与卡尔曼滤波进行数据融合,设计PTA模块捕捉历史轨迹交互信息,引入模式注意力增强预测多样性,并使用基于稀疏锚点的AoD解码器生成完整轨迹。 Result: 在DAIR-V2X-Seq数据集上实现了最先进的性能,验证了模型在V2X场景下协同轨迹预测的有效性。 Conclusion: CoPAD通过有效的多源数据融合与注意力机制,在车路协同环境下实现了高完整性与准确性的轨迹预测,具有良好的应用前景。 Abstract: Recently, data-driven trajectory prediction methods have achieved remarkable results, significantly advancing the development of autonomous driving. However, the instability of single-vehicle perception introduces certain limitations to trajectory prediction. In this paper, a novel lightweight framework for cooperative trajectory prediction, CoPAD, is proposed. This framework incorporates a fusion module based on the Hungarian algorithm and Kalman filtering, along with the Past Time Attention (PTA) module, mode attention module and anchor-oriented decoder (AoD). It effectively performs early fusion on multi-source trajectory data from vehicles and road infrastructure, enabling the trajectories with high completeness and accuracy. The PTA module can efficiently capture potential interaction information among historical trajectories, and the mode attention module is proposed to enrich the diversity of predictions. Additionally, the decoder based on sparse anchors is designed to generate the final complete trajectories. Extensive experiments show that CoPAD achieves the state-of-the-art performance on the DAIR-V2X-Seq dataset, validating the effectiveness of the model in cooperative trajectory prediction in V2X scenarios.

[147] Towards Sharper Object Boundaries in Self-Supervised Depth Estimation

Aurélien Cecille,Stefan Duffner,Franck Davoine,Rémi Agier,Thibault Neveu

Main category: cs.CV

TL;DR: 提出一种自监督单目深度估计方法,通过混合分布建模像素深度,显著提升物体边界的深度锐度和点云质量。

Details Motivation: 现有方法在物体边界处常出现深度模糊,引入虚假的中间3D点,影响3D场景理解的准确性。 Method: 将每个像素的深度建模为混合分布,捕捉多种可能的深度值,并将不确定性从直接回归转移到混合权重上,结合方差感知损失函数和不确定性传播机制。 Result: 在KITTI和VKITTv2数据集上实验表明,该方法边界锐度最高提升35%,点云质量优于现有最先进方法。 Conclusion: 该方法仅用自监督即可实现清晰的深度不连续性,在保持整体深度精度的同时显著改善边界细节。 Abstract: Accurate monocular depth estimation is crucial for 3D scene understanding, but existing methods often blur depth at object boundaries, introducing spurious intermediate 3D points. While achieving sharp edges usually requires very fine-grained supervision, our method produces crisp depth discontinuities using only self-supervision. Specifically, we model per-pixel depth as a mixture distribution, capturing multiple plausible depths and shifting uncertainty from direct regression to the mixture weights. This formulation integrates seamlessly into existing pipelines via variance-aware loss functions and uncertainty propagation. Extensive evaluations on KITTI and VKITTIv2 show that our method achieves up to 35% higher boundary sharpness and improves point cloud quality compared to state-of-the-art baselines.

[148] DAFTED: Decoupled Asymmetric Fusion of Tabular and Echocardiographic Data for Cardiac Hypertension Diagnosis

Jérémie Stym-Popper,Nathan Painchaud,Clément Rambour,Pierre-Yves Courand,Nicolas Thome,Olivier Bernard

Main category: cs.CV

TL;DR: 提出一种非对称融合策略,通过解耦共享与模态特异性信息来提升多模态医学诊断性能,在239名患者的超声心动图时间序列和表格数据上验证,AUC超过90%。

Details Motivation: 为了提高医学诊断中多模态数据融合的效果,解决现有方法在融合过程中未能有效区分共享信息与模态特异性信息的问题。 Method: 采用非对称融合策略,从主模态出发,结合次级模态,通过解耦机制分离共享信息和模态特有信息,实现更有效的信息融合。 Result: 在包含239名患者的数据集上验证,模型AUC超过90%,优于现有方法。 Conclusion: 该非对称融合策略显著提升了诊断性能,达到了临床应用的关键基准,具有较高的临床实用价值。 Abstract: Multimodal data fusion is a key approach for enhancing diagnosis in medical applications. We propose an asymmetric fusion strategy starting from a primary modality and integrating secondary modalities by disentangling shared and modality-specific information. Validated on a dataset of 239 patients with echocardiographic time series and tabular records, our model outperforms existing methods, achieving an AUC over 90%. This improvement marks a crucial benchmark for clinical use.

[149] Towards Robust Visual Continual Learning with Multi-Prototype Supervision

Xiwei Liu,Yulong Li,Yichen Li,Xinlin Zhuang,Haolin Yang,Huifa Li,Imran Razzak

Main category: cs.CV

TL;DR: 提出MuproCL框架,利用多上下文感知原型和轻量级LLM代理解决语言引导的视觉持续学习中的语义歧义和类内视觉多样性问题。

Details Motivation: 现有语言引导的持续学习方法依赖单一语义目标,导致语义歧义和类内视觉多样性难以建模。 Method: 使用轻量级LLM代理进行类别消歧和视觉模态扩展,生成多个上下文感知的语义原型,并通过LogSumExp机制实现视觉模型的自适应对齐。 Result: 在多种持续学习基准上实验表明,MuproCL显著提升性能与鲁棒性。 Conclusion: MuproCL为语言引导的持续学习提供了更有效的方法,克服了单一样本原型的局限性。 Abstract: Language-guided supervision, which utilizes a frozen semantic target from a Pretrained Language Model (PLM), has emerged as a promising paradigm for visual Continual Learning (CL). However, relying on a single target introduces two critical limitations: 1) semantic ambiguity, where a polysemous category name results in conflicting visual representations, and 2) intra-class visual diversity, where a single prototype fails to capture the rich variety of visual appearances within a class. To this end, we propose MuproCL, a novel framework that replaces the single target with multiple, context-aware prototypes. Specifically, we employ a lightweight LLM agent to perform category disambiguation and visual-modal expansion to generate a robust set of semantic prototypes. A LogSumExp aggregation mechanism allows the vision model to adaptively align with the most relevant prototype for a given image. Extensive experiments across various CL baselines demonstrate that MuproCL consistently enhances performance and robustness, establishing a more effective path for language-guided continual learning.

[150] DistillMatch: Leveraging Knowledge Distillation from Vision Foundation Model for Multimodal Image Matching

Meng Yang,Fan Fan,Zizhuo Li,Songchu Deng,Yong Ma,Jiayi Ma

Main category: cs.CV

TL;DR: 提出了一种基于视觉基础模型知识蒸馏的多模态图像匹配方法DistillMatch,通过提取高层语义特征并融合模态特定信息,结合V2I-GAN增强泛化能力,在公开数据集上优于现有方法。

Details Motivation: 现有深度学习方法因缺乏高质量标注数据而在多模态图像匹配中表现不佳,且难以适应不同场景。视觉基础模型(VFM)具有跨模态的强泛化能力,但尚未被有效用于多模态匹配任务。 Method: 提出DistillMatch,采用知识蒸馏从VFM(如DINOv2、DINOv3)中提取高层语义特征作为学生模型;通过注入模态类别信息保留模态特异性;设计V2I-GAN将可见光图像转换为伪红外图像以增强数据多样性。 Result: 在多个公开数据集上的实验表明,DistillMatch在匹配精度和泛化能力方面均优于现有的多模态图像匹配算法。 Conclusion: 利用视觉基础模型进行知识蒸馏是提升多模态图像匹配性能的有效途径,结合模态信息融合与数据增强策略可显著提高模型鲁棒性和适应性。 Abstract: Multimodal image matching seeks pixel-level correspondences between images of different modalities, crucial for cross-modal perception, fusion and analysis. However, the significant appearance differences between modalities make this task challenging. Due to the scarcity of high-quality annotated datasets, existing deep learning methods that extract modality-common features for matching perform poorly and lack adaptability to diverse scenarios. Vision Foundation Model (VFM), trained on large-scale data, yields generalizable and robust feature representations adapted to data and tasks of various modalities, including multimodal matching. Thus, we propose DistillMatch, a multimodal image matching method using knowledge distillation from VFM. DistillMatch employs knowledge distillation to build a lightweight student model that extracts high-level semantic features from VFM (including DINOv2 and DINOv3) to assist matching across modalities. To retain modality-specific information, it extracts and injects modality category information into the other modality's features, which enhances the model's understanding of cross-modal correlations. Furthermore, we design V2I-GAN to boost the model's generalization by translating visible to pseudo-infrared images for data augmentation. Experiments show that DistillMatch outperforms existing algorithms on public datasets.

[151] Generalized Deep Multi-view Clustering via Causal Learning with Partially Aligned Cross-view Correspondence

Xihong Yang,Siwei Wang,Jiaqi Jin,Fangdi Wang,Tianrui Liu,Yueming Jin,Xinwang Liu,En Zhu,Kunlun He

Main category: cs.CV

TL;DR: 本文提出了一种基于因果学习的多视图聚类网络CauMVC,首次将部分对齐数据下的多视图聚类问题建模为因果干预问题,通过变分自编码器学习不变特征并利用对比正则化捕捉样本相关性,在全对齐和部分对齐数据上均表现出良好的泛化性和有效性。

Details Motivation: 现有MVC方法依赖于视图间样本完全对齐的假设,但在实际中往往只有部分数据对齐,导致性能下降。本文旨在解决从完全对齐到部分对齐数据下模型性能下降的问题,提出更通用的多视图聚类框架。 Method: 采用因果建模方法,将部分对齐数据视为干预操作,多视图聚类视为干预后推断;设计基于变分自编码器的因果学习框架,编码器估计不变特征,解码器进行干预后推断,并引入对比正则化来捕捉样本间的相关性。 Result: 在全对齐和部分对齐的数据集上实验表明,CauMVC相比现有方法具有更强的泛化能力和聚类效果,验证了其有效性和鲁棒性。 Conclusion: 本文提出的CauMVC是首个通过因果学习处理广义多视图聚类的工作,能够有效应对数据对齐程度变化带来的挑战,为多视图聚类提供了新的建模范式。 Abstract: Multi-view clustering (MVC) aims to explore the common clustering structure across multiple views. Many existing MVC methods heavily rely on the assumption of view consistency, where alignments for corresponding samples across different views are ordered in advance. However, real-world scenarios often present a challenge as only partial data is consistently aligned across different views, restricting the overall clustering performance. In this work, we consider the model performance decreasing phenomenon caused by data order shift (i.e., from fully to partially aligned) as a generalized multi-view clustering problem. To tackle this problem, we design a causal multi-view clustering network, termed CauMVC. We adopt a causal modeling approach to understand multi-view clustering procedure. To be specific, we formulate the partially aligned data as an intervention and multi-view clustering with partially aligned data as an post-intervention inference. However, obtaining invariant features directly can be challenging. Thus, we design a Variational Auto-Encoder for causal learning by incorporating an encoder from existing information to estimate the invariant features. Moreover, a decoder is designed to perform the post-intervention inference. Lastly, we design a contrastive regularizer to capture sample correlations. To the best of our knowledge, this paper is the first work to deal generalized multi-view clustering via causal learning. Empirical experiments on both fully and partially aligned data illustrate the strong generalization and effectiveness of CauMVC.

[152] GLip: A Global-Local Integrated Progressive Framework for Robust Visual Speech Recognition

Tianyue Wang,Shuang Yang,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为GLip的全局-局部集成渐进式框架,用于提升视觉语音识别(VSR)在真实复杂环境下的鲁棒性。

Details Motivation: 现有VSR方法对光照变化、遮挡、模糊和姿态变化等实际视觉挑战关注不足,亟需更鲁棒的模型。 Method: GLip采用双路径特征提取架构,在两阶段渐进学习框架中整合全局与局部特征;第一阶段利用音视频数据建立粗略对齐,第二阶段通过上下文增强模块(CEM)在时空维度上精化表征。 Result: GLip在LRS2和LRS3基准上优于现有方法,并在一个新引入的具挑战性的中文数据集上验证了其有效性。 Conclusion: 通过渐进式学习和局部判别性区域的利用,GLip显著提升了VSR在复杂视觉条件下的性能与鲁棒性。 Abstract: Visual speech recognition (VSR), also known as lip reading, is the task of recognizing speech from silent video. Despite significant advancements in VSR over recent decades, most existing methods pay limited attention to real-world visual challenges such as illumination variations, occlusions, blurring, and pose changes. To address these challenges, we propose GLip, a Global-Local Integrated Progressive framework designed for robust VSR. GLip is built upon two key insights: (i) learning an initial \textit{coarse} alignment between visual features across varying conditions and corresponding speech content facilitates the subsequent learning of \textit{precise} visual-to-speech mappings in challenging environments; (ii) under adverse conditions, certain local regions (e.g., non-occluded areas) often exhibit more discriminative cues for lip reading than global features. To this end, GLip introduces a dual-path feature extraction architecture that integrates both global and local features within a two-stage progressive learning framework. In the first stage, the model learns to align both global and local visual features with corresponding acoustic speech units using easily accessible audio-visual data, establishing a coarse yet semantically robust foundation. In the second stage, we introduce a Contextual Enhancement Module (CEM) to dynamically integrate local features with relevant global context across both spatial and temporal dimensions, refining the coarse representations into precise visual-speech mappings. Our framework uniquely exploits discriminative local regions through a progressive learning strategy, demonstrating enhanced robustness against various visual challenges and consistently outperforming existing methods on the LRS2 and LRS3 benchmarks. We further validate its effectiveness on a newly introduced challenging Mandarin dataset.

[153] Graph-based Point Cloud Surface Reconstruction using B-Splines

Stuti Pathak,Rhys G. Evans,Gunther Steenackers,Rudi Penne

Main category: cs.CV

TL;DR: 提出一种基于字典引导的图卷积网络方法,用于从含噪声点云数据中重建光滑曲面,无需依赖法向量信息,并能同时预测控制点的位置和数量。

Details Motivation: 现有基于数据驱动的表面重建方法严重依赖于真实法向量或中间估计的法向量,对噪声敏感且在缺乏真实训练数据时表现不可靠;同时,传统B样条方法因固定数量的控制点难以适应复杂表面。 Method: 提出字典引导的图卷积网络(GCN),通过学习动态预测B样条控制点的数量和位置,实现对噪声点云的鲁棒表面重建,且不依赖任何点法向量。 Result: 在多个常用评估指标下,该方法在定性和定量上均优于多种经典和最新基线方法。 Conclusion: 所提方法能够有效处理含噪声点云的表面重建问题,通过自适应控制点预测生成高质量、平滑的B样条曲面,且无需法向量输入,具有更强的鲁棒性和表达能力。 Abstract: Generating continuous surfaces from discrete point cloud data is a fundamental task in several 3D vision applications. Real-world point clouds are inherently noisy due to various technical and environmental factors. Existing data-driven surface reconstruction algorithms rely heavily on ground truth normals or compute approximate normals as an intermediate step. This dependency makes them extremely unreliable for noisy point cloud datasets, even if the availability of ground truth training data is ensured, which is not always the case. B-spline reconstruction techniques provide compact surface representations of point clouds and are especially known for their smoothening properties. However, the complexity of the surfaces approximated using B-splines is directly influenced by the number and location of the spline control points. Existing spline-based modeling methods predict the locations of a fixed number of control points for a given point cloud, which makes it very difficult to match the complexity of its underlying surface. In this work, we develop a Dictionary-Guided Graph Convolutional Network-based surface reconstruction strategy where we simultaneously predict both the location and the number of control points for noisy point cloud data to generate smooth surfaces without the use of any point normals. We compare our reconstruction method with several well-known as well as recent baselines by employing widely-used evaluation metrics, and demonstrate that our method outperforms all of them both qualitatively and quantitatively.

[154] Language-Instructed Reasoning for Group Activity Detection via Multimodal Large Language Model

Jihua Peng,Qianxiong Xu,Yichen Liu,Chenxi Liu,Cheng Long,Rui Zhao,Ziyue Li

Main category: cs.CV

TL;DR: 本文提出了一种基于多模态大语言模型(MLLM)的语言引导推理框架LIR-GAD,用于群体活动检测(GAD),通过引入活动级和群组特定的标记以及多标签分类损失,结合视觉特征与语言指令,显著提升了检测性能。

Details Motivation: 现有基于深度学习的GAD方法依赖于视觉特征的隐式模式识别,在上下文推理和可解释性方面存在不足,因此需要引入外部知识和语言引导来增强语义理解和推理能力。 Method: 提出LIR-GAD框架,扩展MLLM词汇,引入标记,结合语言指令和视频帧输入;设计多标签分类损失以增强语义学习,并构建多模态双对齐融合(MDAF)模块,融合MLLM隐藏状态与视觉特征。 Result: 在定量和定性实验中,LIR-GAD在GAD任务上均表现出优越性能,显著优于现有方法,验证了语言引导推理的有效性。 Conclusion: 通过引入语言指令和预训练的常识知识,LIR-GAD有效提升了群体活动检测的准确性与可解释性,展示了多模态大语言模型在视频理解任务中的潜力。 Abstract: Group activity detection (GAD) aims to simultaneously identify group members and categorize their collective activities within video sequences. Existing deep learning-based methods develop specialized architectures (e.g., transformer networks) to model the dynamics of individual roles and semantic dependencies between individuals and groups. However, they rely solely on implicit pattern recognition from visual features and struggle with contextual reasoning and explainability. In this work, we propose LIR-GAD, a novel framework of language-instructed reasoning for GAD via Multimodal Large Language Model (MLLM). Our approach expand the original vocabulary of MLLM by introducing an activity-level token and multiple cluster-specific tokens. We process video frames alongside two specially designed tokens and language instructions, which are then integrated into the MLLM. The pretrained commonsense knowledge embedded in the MLLM enables the token and tokens to effectively capture the semantic information of collective activities and learn distinct representational features of different groups, respectively. Also, we introduce a multi-label classification loss to further enhance the token's ability to learn discriminative semantic representations. Then, we design a Multimodal Dual-Alignment Fusion (MDAF) module that integrates MLLM's hidden embeddings corresponding to the designed tokens with visual features, significantly enhancing the performance of GAD. Both quantitative and qualitative experiments demonstrate the superior performance of our proposed method in GAD taks.

[155] See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model

Pengteng Li,Pinhao Song,Wuyang Li,Weiyu Guo,Huizai Yao,Yijie Xu,Dugang Liu,Hui Xiong

Main category: cs.CV

TL;DR: 提出SEE&TREK,首个无需训练的提示框架,通过增强视觉多样性和运动重建来提升多模态大模型在纯视觉条件下的空间理解能力。

Details Motivation: 现有方法多依赖深度或点云等额外模态来提升空间推理能力,而纯视觉空间理解仍缺乏探索。 Method: 采用最大语义丰富度采样选取语义丰富的关键帧,并通过模拟视觉轨迹编码相对空间位置,以保持空间关系和时间连贯性。 Result: 在VSI-BENCH和STI-BENCH上实验表明,该方法显著提升多种MLLM在空间推理任务上的表现,最高提升达3.5%,且无需训练和GPU资源。 Conclusion: SEE&TREK为提升多模态大模型的纯视觉空间智能提供了一条高效、即插即用的新路径。 Abstract: We introduce SEE&TREK, the first training-free prompting framework tailored to enhance the spatial understanding of Multimodal Large Language Models (MLLMS) under vision-only constraints. While prior efforts have incorporated modalities like depth or point clouds to improve spatial reasoning, purely visualspatial understanding remains underexplored. SEE&TREK addresses this gap by focusing on two core principles: increasing visual diversity and motion reconstruction. For visual diversity, we conduct Maximum Semantic Richness Sampling, which employs an off-the-shell perception model to extract semantically rich keyframes that capture scene structure. For motion reconstruction, we simulate visual trajectories and encode relative spatial positions into keyframes to preserve both spatial relations and temporal coherence. Our method is training&GPU-free, requiring only a single forward pass, and can be seamlessly integrated into existing MLLM'S. Extensive experiments on the VSI-B ENCH and STI-B ENCH show that S EE &T REK consistently boosts various MLLM S performance across diverse spatial reasoning tasks with the most +3.5% improvement, offering a promising path toward stronger spatial intelligence.

[156] Blind-Spot Guided Diffusion for Self-supervised Real-World Denoising

Shen Cheng,Haipeng Li,Haibin Huang,Xiaohong Liu,Shuaicheng Liu

Main category: cs.CV

TL;DR: 提出了一种名为Blind-Spot Guided Diffusion的自监督图像去噪框架,结合双分支扩散模型,在无需配对数据的情况下实现了真实场景下的高效去噪。

Details Motivation: 解决盲点网络在局部细节保留上的局限性和扩散模型在自监督去噪中难以适应的问题。 Method: 采用双分支扩散架构,一个分支基于盲点网络生成半干净图像,另一个分支捕捉噪声分布,并利用盲点分支指导采样过程以保留细节并建模噪声结构。 Result: 在SIDD和DND数据集上取得了最先进的性能表现。 Conclusion: 该方法是一种高效的自监督真实图像去噪方案,优于现有方法。 Abstract: In this work, we present Blind-Spot Guided Diffusion, a novel self-supervised framework for real-world image denoising. Our approach addresses two major challenges: the limitations of blind-spot networks (BSNs), which often sacrifice local detail and introduce pixel discontinuities due to spatial independence assumptions, and the difficulty of adapting diffusion models to self-supervised denoising. We propose a dual-branch diffusion framework that combines a BSN-based diffusion branch, generating semi-clean images, with a conventional diffusion branch that captures underlying noise distributions. To enable effective training without paired data, we use the BSN-based branch to guide the sampling process, capturing noise structure while preserving local details. Extensive experiments on the SIDD and DND datasets demonstrate state-of-the-art performance, establishing our method as a highly effective self-supervised solution for real-world denoising. Code and pre-trained models are released at: https://github.com/Sumching/BSGD.

[157] AdaSports-Traj: Role- and Domain-Aware Adaptation for Multi-Agent Trajectory Modeling in Sports

Yi Xu,Yun Fu

Main category: cs.CV

TL;DR: 提出AdaSports-Traj框架,通过角色与领域感知适配器和分层对比学习,实现多智能体体育场景下的自适应轨迹预测。

Details Motivation: 现有统一框架难以捕捉不同角色和体育领域间的结构化分布差异,导致泛化能力不足。 Method: 设计角色与领域感知适配器以动态调整潜在表示,并引入分层对比学习目标,分别监督角色敏感和领域感知的表征,实现解耦学习。 Result: 在Basketball-U、Football-U和Soccer-U三个数据集上验证了方法的有效性,在统一和跨域轨迹预测中均取得优异性能。 Conclusion: AdaSports-Traj能有效建模多角色、多领域的分布差异,提升复杂体育场景下的轨迹预测泛化能力。 Abstract: Trajectory prediction in multi-agent sports scenarios is inherently challenging due to the structural heterogeneity across agent roles (e.g., players vs. ball) and dynamic distribution gaps across different sports domains. Existing unified frameworks often fail to capture these structured distributional shifts, resulting in suboptimal generalization across roles and domains. We propose AdaSports-Traj, an adaptive trajectory modeling framework that explicitly addresses both intra-domain and inter-domain distribution discrepancies in sports. At its core, AdaSports-Traj incorporates a Role- and Domain-Aware Adapter to conditionally adjust latent representations based on agent identity and domain context. Additionally, we introduce a Hierarchical Contrastive Learning objective, which separately supervises role-sensitive and domain-aware representations to encourage disentangled latent structures without introducing optimization conflict. Experiments on three diverse sports datasets, Basketball-U, Football-U, and Soccer-U, demonstrate the effectiveness of our adaptive design, achieving strong performance in both unified and cross-domain trajectory prediction settings.

[158] SegDINO3D: 3D Instance Segmentation Empowered by Both Image-Level and Object-Level 2D Features

Jinyuan Qu,Hongyang Li,Xingyu Chen,Shilong Liu,Yukai Shi,Tianhe Ren,Ruitao Jing,Lei Zhang

Main category: cs.CV

TL;DR: 本文提出了SegDINO3D,一种用于3D实例分割的新型Transformer编码器-解码器框架,通过融合预训练2D检测模型的图像级和对象级特征来增强3D表示,在ScanNetV2和ScanNet200数据集上实现了最先进的性能。

Details Motivation: 由于3D训练数据通常不如2D图像充足,因此需要有效利用丰富的2D先验知识来提升3D实例分割性能。 Method: SegDINO3D结合点云和对应的2D图像作为输入,编码器阶段通过检索对应视图的2D图像特征来增强每个3D点,并使用3D编码器进行上下文融合;解码器阶段将3D查询表示为3D锚框,并从3D查询到2D检测模型生成的2D对象查询进行跨注意力操作,以高效融合2D语义信息。 Result: 在ScanNetV2和ScanNet200基准测试中达到最先进水平,在ScanNet200验证集和隐藏测试集上分别显著超越先前方法8.7和6.8 mAP。 Conclusion: SegDINO3D通过有效融合预训练2D模型的对象级和图像级特征,显著提升了3D实例分割性能,尤其在复杂场景下表现出优越性。 Abstract: In this paper, we present SegDINO3D, a novel Transformer encoder-decoder framework for 3D instance segmentation. As 3D training data is generally not as sufficient as 2D training images, SegDINO3D is designed to fully leverage 2D representation from a pre-trained 2D detection model, including both image-level and object-level features, for improving 3D representation. SegDINO3D takes both a point cloud and its associated 2D images as input. In the encoder stage, it first enriches each 3D point by retrieving 2D image features from its corresponding image views and then leverages a 3D encoder for 3D context fusion. In the decoder stage, it formulates 3D object queries as 3D anchor boxes and performs cross-attention from 3D queries to 2D object queries obtained from 2D images using the 2D detection model. These 2D object queries serve as a compact object-level representation of 2D images, effectively avoiding the challenge of keeping thousands of image feature maps in the memory while faithfully preserving the knowledge of the pre-trained 2D model. The introducing of 3D box queries also enables the model to modulate cross-attention using the predicted boxes for more precise querying. SegDINO3D achieves the state-of-the-art performance on the ScanNetV2 and ScanNet200 3D instance segmentation benchmarks. Notably, on the challenging ScanNet200 dataset, SegDINO3D significantly outperforms prior methods by +8.7 and +6.8 mAP on the validation and hidden test sets, respectively, demonstrating its superiority.

[159] RadarGaussianDet3D: An Efficient and Effective Gaussian-based 3D Detector with 4D Automotive Radars

Weiyi Xiong,Bing Zhu,Tao Huang,Zewei Zheng

Main category: cs.CV

TL;DR: 提出了一种基于高斯表示的高效3D检测器RadarGaussianDet3D,用于4D汽车雷达,通过点高斯编码和框高斯损失提升检测精度与速度。

Details Motivation: 现有4D雷达3D检测方法依赖pillar编码导致特征稀疏,且边界框属性独立优化影响精度,推理速度难以满足车载嵌入式设备实时性需求。 Method: 设计了点高斯编码器(PGE),将雷达点转换为高斯基元,并采用3D高斯泼溅(3DGS)进行BEV光栅化以生成更密集的特征图;提出框高斯损失(BGL),将边界框转为3D高斯分布进行统一优化。 Result: 在TJ4DRadSet和View-of-Delft数据集上实现了最先进的检测精度,并显著提升了推理速度。 Conclusion: RadarGaussianDet3D通过高斯表示有效改善了4D雷达3D检测的特征质量与优化方式,兼具高精度与低延迟,适合自动驾驶中的实时部署。 Abstract: 4D automotive radars have gained increasing attention for autonomous driving due to their low cost, robustness, and inherent velocity measurement capability. However, existing 4D radar-based 3D detectors rely heavily on pillar encoders for BEV feature extraction, where each point contributes to only a single BEV grid, resulting in sparse feature maps and degraded representation quality. In addition, they also optimize bounding box attributes independently, leading to sub-optimal detection accuracy. Moreover, their inference speed, while sufficient for high-end GPUs, may fail to meet the real-time requirement on vehicle-mounted embedded devices. To overcome these limitations, an efficient and effective Gaussian-based 3D detector, namely RadarGaussianDet3D is introduced, leveraging Gaussian primitives and distributions as intermediate representations for radar points and bounding boxes. In RadarGaussianDet3D, a novel Point Gaussian Encoder (PGE) is designed to transform each point into a Gaussian primitive after feature aggregation and employs the 3D Gaussian Splatting (3DGS) technique for BEV rasterization, yielding denser feature maps. PGE exhibits exceptionally low latency, owing to the optimized algorithm for point feature aggregation and fast rendering of 3DGS. In addition, a new Box Gaussian Loss (BGL) is proposed, which converts bounding boxes into 3D Gaussian distributions and measures their distance to enable more comprehensive and consistent optimization. Extensive experiments on TJ4DRadSet and View-of-Delft demonstrate that RadarGaussianDet3D achieves state-of-the-art detection accuracy while delivering substantially faster inference, highlighting its potential for real-time deployment in autonomous driving.

[160] BaseReward: A Strong Baseline for Multimodal Reward Model

Yi-Fan Zhang,Haihua Yang,Huanyu Zhang,Yang Shi,Zezhou Chen,Haochen Tian,Chaoyou Fu,Haotian Wang,Kai Wu,Bo Cui,Xu Wang,Jianfei Pan,Haotian Wang,Zhang Zhang,Liang Wang

Main category: cs.CV

TL;DR: 本文系统地研究了多模态奖励模型(MRM)构建的各个关键组件,提出了一种简单而高效的基线模型BaseReward,在多个基准上达到SOTA,并验证了其在真实强化学习场景中的有效性。

Details Motivation: 随着多模态大语言模型(MLLMs)的快速发展,如何将其与人类偏好对齐成为关键挑战。奖励模型(RMs)是实现这一目标的核心技术,但目前缺乏构建高性能多模态奖励模型(MRMs)的系统性指南。 Method: 通过全面的实验分析,系统研究了MRM开发流程中的各个关键组件,包括奖励建模范式、奖励头结构、训练策略、数据整理、骨干模型与模型规模以及集成方法,并基于这些发现构建了BaseReward模型。 Result: BaseReward在MM-RLHF-Reward Bench、VL-Reward Bench和Multimodal Reward Bench等多个主流基准上取得了新的SOTA性能,并在实际强化学习流程中成功提升了MLLM在感知、推理和对话任务上的表现。 Conclusion: 本工作不仅提出了一个高性能的多模态奖励模型BaseReward,更重要的是为社区提供了构建鲁棒奖励模型的清晰、实证支持的指导方案。 Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has made aligning them with human preferences a critical challenge. Reward Models (RMs) are a core technology for achieving this goal, but a systematic guide for building state-of-the-art Multimodal Reward Models (MRMs) is currently lacking in both academia and industry. Through exhaustive experimental analysis, this paper aims to provide a clear ``recipe'' for constructing high-performance MRMs. We systematically investigate every crucial component in the MRM development pipeline, including \textit{reward modeling paradigms} (e.g., Naive-RM, Critic-based RM, and Generative RM), \textit{reward head architecture}, \textit{training strategies}, \textit{data curation} (covering over ten multimodal and text-only preference datasets), \textit{backbone model} and \textit{model scale}, and \textit{ensemble methods}. Based on these experimental insights, we introduce \textbf{BaseReward}, a powerful and efficient baseline for multimodal reward modeling. BaseReward adopts a simple yet effective architecture, built upon a {Qwen2.5-VL} backbone, featuring an optimized two-layer reward head, and is trained on a carefully curated mixture of high-quality multimodal and text-only preference data. Our results show that BaseReward establishes a new SOTA on major benchmarks such as MM-RLHF-Reward Bench, VL-Reward Bench, and Multimodal Reward Bench, outperforming previous models. Furthermore, to validate its practical utility beyond static benchmarks, we integrate BaseReward into a real-world reinforcement learning pipeline, successfully enhancing an MLLM's performance across various perception, reasoning, and conversational tasks. This work not only delivers a top-tier MRM but, more importantly, provides the community with a clear, empirically-backed guide for developing robust reward models for the next generation of MLLMs.

[161] Recovering Parametric Scenes from Very Few Time-of-Flight Pixels

Carter Sifferman,Yiquan Li,Yiming Li,Fangzhou Mu,Michael Gleicher,Mohit Gupta,Yin Li

Main category: cs.CV

TL;DR: 本文提出了一种利用极少数低成本ToF传感器深度测量恢复3D参数化场景几何结构的方法,结合前馈预测与可微分渲染,在仅有约15个像素的稀疏数据下实现对已知物体6D姿态等场景参数的有效估计。

Details Motivation: 低分辨率但高时间分辨率的ToF传感器难以直接获取清晰深度图,但其记录的飞行时间数据蕴含丰富场景信息,因此探索如何利用极稀疏测量恢复简单参数化场景的几何结构具有重要意义。 Method: 提出一种结合前馈预测和基于分析-合成框架的可微分渲染的方法,利用少量分布式测量(如15像素)进行场景参数推断并逐步优化估计结果。 Result: 在仿真和受控真实环境中验证了该方法能有效恢复无纹理3D模型的物体位姿,并展示了其在其他参数化场景中的初步潜力,同时通过实验探究了系统性能的边界。 Conclusion: 该方法证明了使用极低空间分辨率、高时间分辨率的分布式传感器恢复简单参数化场景几何结构的可行性,为低成本稀疏传感提供了有效解决方案。 Abstract: We aim to recover the geometry of 3D parametric scenes using very few depth measurements from low-cost, commercially available time-of-flight sensors. These sensors offer very low spatial resolution (i.e., a single pixel), but image a wide field-of-view per pixel and capture detailed time-of-flight data in the form of time-resolved photon counts. This time-of-flight data encodes rich scene information and thus enables recovery of simple scenes from sparse measurements. We investigate the feasibility of using a distributed set of few measurements (e.g., as few as 15 pixels) to recover the geometry of simple parametric scenes with a strong prior, such as estimating the 6D pose of a known object. To achieve this, we design a method that utilizes both feed-forward prediction to infer scene parameters, and differentiable rendering within an analysis-by-synthesis framework to refine the scene parameter estimate. We develop hardware prototypes and demonstrate that our method effectively recovers object pose given an untextured 3D model in both simulations and controlled real-world captures, and show promising initial results for other parametric scenes. We additionally conduct experiments to explore the limits and capabilities of our imaging solution.

[162] AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models

Vatsal Malaviya,Agneet Chatterjee,Maitreya Patel,Yezhou Yang,Chitta Baral

Main category: cs.CV

TL;DR: 本文提出了一个名为AcT2I的基准,用于评估文本到图像(T2I)模型在动作中心提示下的生成能力,发现现有模型在捕捉动作中的隐含语义和上下文细节方面表现不佳。为此,作者提出一种无需训练的知识蒸馏方法,利用大语言模型增强提示信息,尤其通过引入时间维度显著提升了生成准确性,最高提升达72%,表明融合语言知识可有效改善复杂场景的图像生成。

Details Motivation: 现有T2I模型在生成以动作为核心的复杂场景时,难以准确表达动作中隐含的语义和上下文关系,缺乏系统性评估和改进方法。 Method: 构建AcT2I基准来评估模型性能,并提出一种训练-free的知识蒸馏方法,利用大语言模型从三个维度(尤其是时间信息)增强输入提示,以提升图像生成的准确性和上下文一致性。 Result: 实验表明主流T2I模型在AcT2I上表现不佳;通过增强提示,特别是加入时间细节,图像生成准确率最高提升了72%。 Conclusion: 当前T2I模型在需要复杂推理的动作场景生成上存在局限,而通过系统性融入语言模型提供的丰富语义知识,可显著提升生成图像的语义准确性和上下文完整性。 Abstract: Text-to-Image (T2I) models have recently achieved remarkable success in generating images from textual descriptions. However, challenges still persist in accurately rendering complex scenes where actions and interactions form the primary semantic focus. Our key observation in this work is that T2I models frequently struggle to capture nuanced and often implicit attributes inherent in action depiction, leading to generating images that lack key contextual details. To enable systematic evaluation, we introduce AcT2I, a benchmark designed to evaluate the performance of T2I models in generating images from action-centric prompts. We experimentally validate that leading T2I models do not fare well on AcT2I. We further hypothesize that this shortcoming arises from the incomplete representation of the inherent attributes and contextual dependencies in the training corpora of existing T2I models. We build upon this by developing a training-free, knowledge distillation technique utilizing Large Language Models to address this limitation. Specifically, we enhance prompts by incorporating dense information across three dimensions, observing that injecting prompts with temporal details significantly improves image generation accuracy, with our best model achieving an increase of 72%. Our findings highlight the limitations of current T2I methods in generating images that require complex reasoning and demonstrate that integrating linguistic knowledge in a systematic way can notably advance the generation of nuanced and contextually accurate images.

[163] Pointing to a Llama and Call it a Camel: On the Sycophancy of Multimodal Large Language Models

Renjie Pi,Kehao Miao,Li Peihang,Runtao Liu,Jiahui Gao,Jipeng Zhang,Xiaofang Zhou

Main category: cs.CV

TL;DR: 本文发现多模态大语言模型(MLLM)在处理图像输入时表现出显著的视觉谄媚行为,称之为“谄媚模态差距”。为缓解此问题,提出一种称为“谄媚反思调优(SRT)”的方法,使MLLM能通过反思推理判断用户指令是误导还是纠正,从而减少谄媚行为而不导致过度固执。

Details Motivation: 观察到MLLM在图像输入下比纯文本LLM更严重地表现出视觉谄媚行为,需探究其成因并解决该问题。 Method: 首先尝试朴素监督微调,但发现会导致模型过于顽固;进而提出Sycophantic Reflective Tuning(SRT),引入反思推理机制以区分误导与纠正性指令。 Result: 应用SRT后,MLLM对误导性指令的谄媚行为显著减少,同时在面对纠正性指令时不再过度抗拒。 Conclusion: SRT有效缓解了MLLM中的视觉谄媚行为及其带来的顽固性权衡问题,提升了模型在多模态对话中的可靠性与灵活性。 Abstract: Multimodal large language models (MLLMs) have demonstrated extraordinary capabilities in conducting conversations based on image inputs. However, we observe that MLLMs exhibit a pronounced form of visual sycophantic behavior. While similar behavior has also been noted in text-based large language models (LLMs), it becomes significantly more prominent when MLLMs process image inputs. We refer to this phenomenon as the "sycophantic modality gap." To better understand this issue, we further analyze the factors that contribute to the exacerbation of this gap. To mitigate the visual sycophantic behavior, we first experiment with naive supervised fine-tuning to help the MLLM resist misleading instructions from the user. However, we find that this approach also makes the MLLM overly resistant to corrective instructions (i.e., stubborn even if it is wrong). To alleviate this trade-off, we propose Sycophantic Reflective Tuning (SRT), which enables the MLLM to engage in reflective reasoning, allowing it to determine whether a user's instruction is misleading or corrective before drawing a conclusion. After applying SRT, we observe a significant reduction in sycophantic behavior toward misleading instructions, without resulting in excessive stubbornness when receiving corrective instructions.

[164] UniMRSeg: Unified Modality-Relax Segmentation via Hierarchical Self-Supervised Compensation

Xiaoqi Zhao,Youwei Pang,Chenyang Yu,Lihe Zhang,Huchuan Lu,Shijian Lu,Georges El Fakhri,Xiaofeng Liu

Main category: cs.CV

TL;DR: 提出了一种统一的模态松弛分割网络UniMRSeg,通过分层自监督补偿机制解决多模态图像分割中模态缺失导致的性能下降问题,在多种模态缺失场景下显著优于现有方法。

Details Motivation: 现有方法在处理训练与推理阶段模态不一致问题时需要为每种模态组合训练专门模型,导致部署成本高且难以扩展,因此需要一种统一且高效的解决方案。 Method: 提出UniMRSeg,采用分层自监督补偿(HSSC)机制,包括混合打乱掩码增强的模态重建、模态不变对比学习、轻量级反向注意力适配器以及混合一致性约束微调,跨输入、特征和输出层次弥合完整与缺失模态之间的表征差距。 Result: 在脑肿瘤MRI分割、RGB-D语义分割和RGB-D/T显著目标分割等多个任务上,UniMRSeg在不同模态缺失场景下均显著优于当前最先进方法,性能稳定且无需针对不同模态组合单独部署模型。 Conclusion: UniMRSeg通过分层自监督补偿实现了对任意模态组合的鲁棒分割,有效降低了部署复杂性,为多模态分割的实际应用提供了高效统一的解决方案。 Abstract: Multi-modal image segmentation faces real-world deployment challenges from incomplete/corrupted modalities degrading performance. While existing methods address training-inference modality gaps via specialized per-combination models, they introduce high deployment costs by requiring exhaustive model subsets and model-modality matching. In this work, we propose a unified modality-relax segmentation network (UniMRSeg) through hierarchical self-supervised compensation (HSSC). Our approach hierarchically bridges representation gaps between complete and incomplete modalities across input, feature and output levels. % First, we adopt modality reconstruction with the hybrid shuffled-masking augmentation, encouraging the model to learn the intrinsic modality characteristics and generate meaningful representations for missing modalities through cross-modal fusion. % Next, modality-invariant contrastive learning implicitly compensates the feature space distance among incomplete-complete modality pairs. Furthermore, the proposed lightweight reverse attention adapter explicitly compensates for the weak perceptual semantics in the frozen encoder. Last, UniMRSeg is fine-tuned under the hybrid consistency constraint to ensure stable prediction under all modality combinations without large performance fluctuations. Without bells and whistles, UniMRSeg significantly outperforms the state-of-the-art methods under diverse missing modality scenarios on MRI-based brain tumor segmentation, RGB-D semantic segmentation, RGB-D/T salient object segmentation. The code will be released at https://github.com/Xiaoqi-Zhao-DLUT/UniMRSeg.

[165] Fast OTSU Thresholding Using Bisection Method

Sai Varun Kodathala

Main category: cs.CV

TL;DR: 提出一种基于二分法的Otsu阈值优化算法,利用类间方差函数的单峰特性,将计算复杂度从O(L)降低至O(log L),在保持分割精度的同时显著减少计算量。

Details Motivation: 传统Otsu算法因需遍历所有可能阈值而导致计算效率低下,尤其在大规模图像处理中成为性能瓶颈,本文旨在提升其计算效率。 Method: 利用类间方差函数的单峰特性,采用二分法替代 exhaustive search 进行阈值搜索,从而减少方差计算和迭代次数。 Result: 在48个标准图像上验证,方差计算减少91.63%,迭代次数减少97.21%;66.67%情况下获得精确阈值匹配,95.83%情况偏差在5个灰度级内,且保证对数收敛。 Conclusion: 该优化方法在不牺牲Otsu算法理论基础和分割质量的前提下,显著提升计算效率,适用于实时图像处理系统。 Abstract: The Otsu thresholding algorithm represents a fundamental technique in image segmentation, yet its computational efficiency is severely limited by exhaustive search requirements across all possible threshold values. This work presents an optimized implementation that leverages the bisection method to exploit the unimodal characteristics of the between-class variance function. Our approach reduces the computational complexity from O(L) to O(log L) evaluations while preserving segmentation accuracy. Experimental validation on 48 standard test images demonstrates a 91.63% reduction in variance computations and 97.21% reduction in algorithmic iterations compared to conventional exhaustive search. The bisection method achieves exact threshold matches in 66.67% of test cases, with 95.83% exhibiting deviations within 5 gray levels. The algorithm maintains universal convergence within theoretical logarithmic bounds while providing deterministic performance guarantees suitable for real-time applications. This optimization addresses critical computational bottlenecks in large-scale image processing systems without compromising the theoretical foundations or segmentation quality of the original Otsu method.