Skip to content

Table of Contents

cs.CL [Back]

[1] A Zipf-preserving, long-range correlated surrogate for written language and other symbolic sequences

Marcelo A. Montemurro,Mirko Degli Esposti

Main category: cs.CL

TL;DR: 本文提出了一种能同时保留符号序列经验频率分布和长程相关结构(通过DFA指数量化)的代理模型,通过将分形高斯噪声映射到经验直方图实现,适用于语言和基因组DNA等符号系统分析。

Details Motivation: 现有代理模型通常只能单独保留符号频率分布或长程相关性,无法兼顾二者,限制了对符号系统结构性质的深入分析。 Method: 提出一种基于分形高斯噪声(FGN)映射到经验直方图的频率保持分配方法,生成既匹配原始序列一阶统计量又复现其长程标度特性的符号序列代理数据。 Result: 在英文、拉丁文文本及基因组DNA上验证有效:准确再现了词频/碱基组成和DFA标度指数,同时随机化了短程依赖。 Conclusion: 该模型为解构符号系统的结构特征、检验语言与DNA中缩放律及记忆效应起源的假设提供了原理性工具。 Abstract: Symbolic sequences such as written language and genomic DNA display characteristic frequency distributions and long-range correlations extending over many symbols. In language, this takes the form of Zipf's law for word frequencies together with persistent correlations spanning hundreds or thousands of tokens, while in DNA it is reflected in nucleotide composition and long-memory walks under purine-pyrimidine mappings. Existing surrogate models usually preserve either the frequency distribution or the correlation properties, but not both simultaneously. We introduce a surrogate model that retains both constraints: it preserves the empirical symbol frequencies of the original sequence and reproduces its long-range correlation structure, quantified by the detrended fluctuation analysis (DFA) exponent. Our method generates surrogates of symbolic sequences by mapping fractional Gaussian noise (FGN) onto the empirical histogram through a frequency-preserving assignment. The resulting surrogates match the original in first-order statistics and long-range scaling while randomising short-range dependencies. We validate the model on representative texts in English and Latin, and illustrate its broader applicability with genomic DNA, showing that base composition and DFA scaling are reproduced. This approach provides a principled tool for disentangling structural features of symbolic systems and for testing hypotheses on the origin of scaling laws and memory effects across language, DNA, and other symbolic domains.

[2] Universal Conceptual Structure in Neural Translation: Probing NLLB-200's Multilingual Geometry

Kyle Elliott Mathewson

Main category: cs.CL

TL;DR: 本文通过六项实验探究NLLB-200多语言翻译模型是否学习到语言无关的概念表征,发现其嵌入空间隐式编码了语言谱系结构、跨语言共词化模式、概念中性存储及语义关系一致性,支持其具备类认知的通用概念组织。

Details Motivation: 探究神经机器翻译模型是否学习语言-universal概念表征,而非仅依赖表面语言相似性。 Method: 基于NLLB-200模型,利用Swadesh核心词表(135种语言)和CLICS共词化数据库,开展六项几何探针实验,包括嵌入距离与谱系距离相关性分析、共词化对相似性检验、跨语言均值中心化、语义偏移向量一致性测量等。 Result: 嵌入距离显著关联语言谱系距离(ρ=0.13, p=0.020);共词化概念对嵌入相似性显著更高(p=1.33e-11, d=0.96);均值中心化提升概念区分度(1.19倍);基本概念间语义偏移向量跨语言高度一致(平均余弦=0.84)。 Conclusion: NLLB-200隐式建模了人类语言的谱系结构、普遍概念关联、类脑概念中枢及二阶语义关系,表明其表征具有认知可解释的语言-universal概念组织特性。 Abstract: Do neural machine translation models learn language-universal conceptual representations, or do they merely cluster languages by surface similarity? We investigate this question by probing the representation geometry of Meta's NLLB-200, a 200-language encoder-decoder Transformer, through six experiments that bridge NLP interpretability with cognitive science theories of multilingual lexical organization. Using the Swadesh core vocabulary list embedded across 135 languages, we find that the model's embedding distances significantly correlate with phylogenetic distances from the Automated Similarity Judgment Program ($ρ= 0.13$, $p = 0.020$), demonstrating that NLLB-200 has implicitly learned the genealogical structure of human languages. We show that frequently colexified concept pairs from the CLICS database exhibit significantly higher embedding similarity than non-colexified pairs ($U = 42656$, $p = 1.33 \times 10^{-11}$, $d = 0.96$), indicating that the model has internalized universal conceptual associations. Per-language mean-centering of embeddings improves the between-concept to within-concept distance ratio by a factor of 1.19, providing geometric evidence for a language-neutral conceptual store analogous to the anterior temporal lobe hub identified in bilingual neuroimaging. Semantic offset vectors between fundamental concept pairs (e.g., man to woman, big to small) show high cross-lingual consistency (mean cosine = 0.84), suggesting that second-order relational structure is preserved across typologically diverse languages. We release InterpretCognates, an open-source interactive toolkit for exploring these phenomena, alongside a fully reproducible analysis pipeline.

[3] Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects

Xiaoyu Luo,Wenrui Yu,Qiongxiu Li,Johannes Bjerva

Main category: cs.CL

TL;DR: 本文系统研究了扩散语言模型(DLMs)的数据记忆行为,提出一个统一的概率提取框架,理论证明采样分辨率越高,记忆训练数据的概率越大,并通过实验证明DLMs比自回归模型(ARMs)在PII泄露方面显著更低。

Details Motivation: 扩散语言模型(DLMs)作为自回归模型(ARMs)的新兴替代方案,其记忆行为尚未被深入研究,而ARMs已知存在训练数据记忆与泄露风险,引发隐私与版权担忧。 Method: 提出一种广义概率提取框架,统一建模前缀条件解码与扩散生成;理论推导采样分辨率与记忆概率之间的单调关系(定理4.3);开展跨模型规模与采样策略的实验验证,并在对齐的前缀条件下评估PII泄露。 Result: 理论证明:扩散生成中采样分辨率越高,精确提取训练数据的概率严格递增;AR解码是扩散生成在最大分辨率下的极限情形;实验表明DLMs在PII记忆泄露上显著低于ARMs。 Conclusion: DLMs虽仍存在记忆现象,但其记忆程度可控且低于ARMs,尤其在隐私敏感场景中具备更优的安全潜力。 Abstract: Autoregressive language models (ARMs) have been shown to memorize and occasionally reproduce training data verbatim, raising concerns about privacy and copyright liability. Diffusion language models (DLMs) have recently emerged as a competitive alternative, yet their memorization behavior remains largely unexplored due to fundamental differences in generation dynamics. To address this gap, we present a systematic theoretical and empirical characterization of memorization in DLMs. We propose a generalized probabilistic extraction framework that unifies prefix-conditioned decoding and diffusion-based generation under arbitrary masking patterns and stochastic sampling trajectories. Theorem 4.3 establishes a monotonic relationship between sampling resolution and memorization: increasing resolution strictly increases the probability of exact training data extraction, implying that autoregressive decoding corresponds to a limiting case of diffusion-based generation by setting the sampling resolution maximal. Extensive experiments across model scales and sampling strategies validate our theoretical predictions. Under aligned prefix-conditioned evaluations, we further demonstrate that DLMs exhibit substantially lower memorization-based leakage of personally identifiable information (PII) compared to ARMs.

[4] Detecting AI-Generated Essays in Writing Assessment: Responsible Use and Generalizability Across LLMs

Jiangang Hao

Main category: cs.CL

TL;DR: 本章综述了AI生成与AI辅助作文检测器的现状,并通过实证分析评估了针对某一LLM训练的检测器对其他LLM生成作文的泛化能力,基于GRE写作提示生成的文本。

Details Motivation: 随着大语言模型(LLMs)快速生成高质量作文的能力增强,学生提交作业的真实性受到严重威胁,亟需可靠、可泛化的AI内容检测方法。 Method: 对基于公共GRE写作提示生成的多LLM作文数据集进行实证分析,评估不同LLM间检测器的跨模型泛化性能,并提出检测器开发与重训练的实践指南。 Result: 发现针对某一LLM训练的检测器在识别其他LLM生成作文时性能显著下降,表明现有检测器泛化能力有限;需针对性地更新和重训练以提升实用性。 Conclusion: 当前AI作文检测器面临跨模型泛化不足的问题,应结合负责任使用指南与持续迭代的模型重训练策略,以应对不断演进的LLM生成能力。 Abstract: Writing is a foundational literacy skill that underpins effective communication, fosters critical thinking, facilitates learning across disciplines, and enables individuals to organize and articulate complex ideas. Consequently, writing assessment plays a vital role in evaluating language proficiency, communicative effectiveness, and analytical reasoning. The rapid advancement of large language models (LLMs) has made it increasingly easy to generate coherent, high-quality essays, raising significant concerns about the authenticity of student-submitted work. This chapter first provides an overview of the current landscape of detectors for AI-generated and AI-assisted essays, along with guidelines for their responsible use. It then presents empirical analyses to evaluate how well detectors trained on essays from one LLM generalize to identifying essays produced by other LLMs, based on essays generated in response to public GRE writing prompts. These findings provide guidance for developing and retraining detectors for practical applications.

[5] RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks

Alexandra Diaconu,Mădălina Vînaga,Bogdan Alexe

Main category: cs.CL

TL;DR: RO-N3WS是一个面向罗马尼亚语的多样化语音数据集,旨在提升低资源和分布外场景下的ASR泛化能力;实验表明其少量真实语音微调即可显著降低WER。

Details Motivation: 解决罗马尼亚语自动语音识别(ASR)在低资源和分布外(OOD)条件下的泛化能力不足问题。 Method: 构建包含126小时多源语音(广播新闻、有声书、电影对白等)的RO-N3WS数据集,并在零样本与微调设置下评估Whisper、Wav2Vec 2.0等模型,辅以表达性TTS生成的合成数据进行对照实验。 Result: 仅用RO-N3WS中少量真实语音微调,即可显著优于零样本基线(WER明显下降)。 Conclusion: RO-N3WS能有效提升罗马尼亚语ASR在低资源和OOD场景下的鲁棒性与泛化性,且所有模型、脚本和数据划分将开源以促进可复现研究。 Abstract: We introduce RO-N3WS, a benchmark Romanian speech dataset designed to improve generalization in automatic speech recognition (ASR), particularly in low-resource and out-of-distribution (OOD) conditions. RO-N3WS comprises over 126 hours of transcribed audio collected from broadcast news, literary audiobooks, film dialogue, children's stories, and conversational podcast speech. This diversity enables robust training and fine-tuning across stylistically distinct domains. We evaluate several state-of-the-art ASR systems (Whisper, Wav2Vec 2.0) in both zero-shot and fine-tuned settings, and conduct controlled comparisons using synthetic data generated with expressive TTS models. Our results show that even limited fine-tuning on real speech from RO-N3WS yields substantial WER improvements over zero-shot baselines. We will release all models, scripts, and data splits to support reproducible research in multilingual ASR, domain adaptation, and lightweight deployment.

[6] GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR

Pouya Mehralian,Melissa Farasyn,Anne Breitbarth,Anne-Sophie Ghyselen,Hugo Van hamme

Main category: cs.CL

TL;DR: 本文提出GLoRIA框架,利用地理位置等元数据调节预训练编码器中的低秩更新,以解决方言语音识别中因区域差异大和标注数据少带来的挑战。该方法在GCND语料库上实现了最优词错误率,仅更新不到10%参数,并具备良好的泛化性与地理可解释性。

Details Motivation: 方言语音识别面临区域差异显著和标注数据稀缺的双重挑战,现有方法在参数效率、泛化能力和可解释性方面存在不足。 Method: GLoRIA是一种参数高效适配框架:在预训练编码器每个前馈层注入低秩矩阵,通过基于位置元数据(如经纬度)的门控MLP控制各LoRA秩-1分量的非负贡献。 Result: 在GCND语料库上,GLoRIA优于地理条件化全微调、标准LoRA以及方言专用/统一全微调,在词错误率上达到SOTA;仅更新<10%参数;对未见方言(含外推场景)泛化良好;适应模式可地理可视化。 Conclusion: 元数据门控的低秩适配是方言ASR中一种高效、可解释且泛化能力强的有效方案。 Abstract: Automatic Speech Recognition (ASR) in dialect-heavy settings remains challenging due to strong regional variation and limited labeled data. We propose GLoRIA, a parameter-efficient adaptation framework that leverages metadata (e.g., coordinates) to modulate low-rank updates in a pre-trained encoder. GLoRIA injects low-rank matrices into each feed-forward layer, with a gating MLP determining the non-negative contribution of each LoRA rank-1 component based on location metadata. On the GCND corpus, GLoRIA outperforms geo-conditioned full fine-tuning, LoRA, and both dialect-specific and unified full fine-tuning, achieving state-of-the-art word error rates while updating under 10% of parameters. GLoRIA also generalizes well to unseen dialects, including in extrapolation scenarios, and enables interpretable adaptation patterns that can be visualized geospatially. These results show metadata-gated low-rank adaptation is an effective, interpretable, and efficient solution for dialectal ASR.

[7] CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think

Junzhe Shen,Jieru Zhao,Ziwei He,Zhouhan Lin

Main category: cs.CL

TL;DR: 本文研究了连续扩散语言模型(DLMs)为何落后于离散扩散方法,并提出CoDAR框架,通过在嵌入空间中保持连续扩散并引入上下文条件自回归解码器来提升生成质量。

Details Motivation: 连续扩散语言模型(DLMs)尽管具有吸引人的连续生成动态特性,但在性能上仍落后于离散扩散方法,本文旨在探究其根本原因并提出改进方案。 Method: 在受控的token恢复实验中识别出token舍入是主要瓶颈,进而提出CoDAR框架:第一阶段在嵌入空间中进行连续扩散,第二阶段使用能交叉关注去噪嵌入序列的上下文条件自回归Transformer解码器完成token化。 Result: 在LM1B和OpenWebText数据集上的实验表明,CoDAR显著优于潜在扩散模型,并与强离散DLMs性能相当;同时提供了解码器温度调节机制以平衡流畅性与多样性。 Conclusion: token舍入是连续DLMs性能受限的关键因素,CoDAR通过分离连续扩散与上下文感知离散化过程,有效提升了生成质量并提供了可控的生成权衡机制。 Abstract: We study why continuous diffusion language models (DLMs) have lagged behind discrete diffusion approaches despite their appealing continuous generative dynamics. Under a controlled token--recovery study, we identify token rounding, the final projection from denoised embeddings to tokens, as a primary bottleneck. Building on these insights, we propose CoDAR (Continuous Diffusion with Contextual AutoRegressive Decoder), a two--stage framework that keeps diffusion entirely continuous in an embedding space while learning a strong, context--conditional discretizer: an autoregressive Transformer decoder that cross--attends to the denoised embedding sequence and performs contextualized rounding to tokens. Experiments on LM1B and OpenWebText demonstrate that CoDAR substantially improves generation quality over latent diffusion and becomes competitive with strong discrete DLMs, while exposing a simple decoder--temperature knob to navigate the fluency--diversity trade off.

[8] How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

Ziwen Xu,Kewei Xu,Haoming Xu,Haiwen Hong,Longtao Huang,Hui Xue,Ningyu Zhang,Yongliang Shen,Guozhou Zheng,Huajun Chen,Shumin Deng

Main category: cs.CL

TL;DR: 本文提出了SteerEval,一个用于评估大语言模型(LLM)在语言特征、情感和人格三方面可控性的分层基准,涵盖L1-L3三级规范,并揭示当前引导方法在细粒度控制上的局限性。

Details Motivation: 大型语言模型在社会敏感领域部署日益增多,但其不可预测行为(如意图错位、人格不一致)带来显著风险,亟需系统化、可解释的可控性评估框架。 Method: 构建了名为SteerEval的分层评估基准,按语言特征、情感、人格三大领域,每域细分为L1(表达内容)、L2(表达方式)、L3(具体实例化)三级规范,并基于该基准对现有模型引导方法进行系统评测。 Result: 实证发现当前主流引导方法在L2和L3等更细粒度层级上控制效果明显下降,表明高阶可控性仍面临挑战。 Conclusion: SteerEval为LLM可控性提供了结构清晰、原理明确且可解释的评估基础,有助于推动安全、可靠、可调控的大模型发展。 Abstract: Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.

[9] ExpGuard: LLM Content Moderation in Specialized Domains

Minseok Choi,Dongjin Kim,Seungbin Yang,Subin Kim,Youngjun Kwak,Juyoung Oh,Jaegul Choo,Jungmin Son

Main category: cs.CL

TL;DR: 本文提出ExpGuard,一种专为金融、医疗和法律领域设计的鲁棒专用安全守卫模型,并构建了包含58928条标注样本的ExpGuardMix数据集,实验表明其在领域特异性对抗攻击下表现优于WildGuard等现有方法。

Details Motivation: 现有安全守卫模型主要面向通用人机交互,难以应对金融、医疗、法律等富含专业术语和概念的领域中危害性与对抗性内容,亟需领域专用的鲁棒守卫方案。 Method: 提出ExpGuard专用守卫模型,并构建ExpGuardMix数据集(含ExpGuardTrain训练集和由领域专家标注的ExpGuardTest测试集),在多个基准上评估其对领域特异性对抗攻击的鲁棒性。 Result: ExpGuard在ExpGuardTest及八个公开基准上性能领先,在提示分类和响应分类任务中分别比WildGuard提升最高达8.9%和15.3%,展现出对技术性与领域特异性内容的强鲁棒性。 Conclusion: ExpGuard是首个面向高风险专业领域的专用安全守卫模型,开源代码、数据与模型,为构建更鲁棒的领域适配守卫系统提供了坚实基础与可扩展框架。 Abstract: With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies. Current guardrail models predominantly address general human-LLM interactions, rendering LLMs vulnerable to harmful and adversarial content within domain-specific contexts, particularly those rich in technical jargon and specialized concepts. To address this limitation, we introduce ExpGuard, a robust and specialized guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains. In addition, we present ExpGuardMix, a meticulously curated dataset comprising 58,928 labeled prompts paired with corresponding refusal and compliant responses, from these specific sectors. This dataset is divided into two subsets: ExpGuardTrain, for model training, and ExpGuardTest, a high-quality test set annotated by domain experts to evaluate model robustness against technical and domain-specific content. Comprehensive evaluations conducted on ExpGuardTest and eight established public benchmarks reveal that ExpGuard delivers competitive performance across the board while demonstrating exceptional resilience to domain-specific adversarial attacks, surpassing state-of-the-art models such as WildGuard by up to 8.9% in prompt classification and 15.3% in response classification. To encourage further research and development, we open-source our code, data, and model, enabling adaptation to additional domains and supporting the creation of increasingly robust guardrail models.

[10] GPUTOK: GPU Accelerated Byte Level BPE Tokenization

Venu Gopal Kadamba,Kanishkha Jaisankar

Main category: cs.CL

TL;DR: 本文提出了一种基于GPU的字节级BPE分词器,遵循GPT-2合并规则,显著加速长上下文(最高131k tokens)下的分词过程,相比tiktoken和HuggingFace GPT-2分词器分别快1.7倍和7.6倍,同时保持输出质量几乎不变。

Details Motivation: 随着大语言模型上下文窗口扩展至百万token级别,传统CPU分词器因串行处理成为瓶颈,导致GPU资源闲置。 Method: 设计并实现了两种GPU版字节级BPE分词器:基础BlockBPE风格核函数与优化版本;后者利用cuCollections静态映射、CUB规约及pybind11 Python接口,并通过Nsight分析定位内存分配为性能瓶颈。 Result: 在WikiText103上,最长输入下比tiktoken快1.7倍、比HuggingFace GPT-2 tokenizer快7.6倍;分词结果与CPU版本一致;生成任务中与基准分词器在相似性和重叠度指标上差异<1个百分点。 Conclusion: GPU分词器可有效缓解长上下文推理中的分词瓶颈,在不牺牲输出质量前提下大幅提升效率,后续可通过内存池进一步优化。 Abstract: As large language models move toward million-token context windows, CPU tokenizers become a major slowdown because they process text one step at a time while powerful GPUs sit unused. We built a GPU-based byte-level BPE tokenizer that follows GPT-2's merge rules. It includes a basic BlockBPE-style kernel and a faster, optimized version that uses cuCollections static map, CUB reductions, and a pybind11 interface for Python. On WikiText103 sequences up to 131k tokens, the optimized GPU tokenizer produces the same tokens as a CPU version and, for the longest inputs, is about 1.7x faster than tiktoken and about 7.6x faster than the HuggingFace GPT-2 tokenizer. Nsight profiling shows that 70-80% of CUDA API time goes to memory allocation, so adding memory pooling should give the biggest speed boost next. Tests on generation tasks using WikiText103 prompts show that our GPU tokenizer's outputs stay within about one percentage point of tiktoken and HuggingFace GPT-2 on similarity and overlap metrics, meaning it keeps output quality while making long-context inference more practical.

[11] Think, But Don't Overthink: Reproducing Recursive Language Models

Daren Wang

Main category: cs.CL

TL;DR: 本研究复现并扩展了Zhang等人提出的递归语言模型(RLMs)框架,重点探究递归深度对性能的影响,发现深度递归虽提升复杂推理任务准确率,却导致简单检索任务性能下降及计算开销剧增。

Details Motivation: 原始论文仅使用默认递归深度1,并将更深递归作为未来方向;本文旨在系统探究递归深度扩展的实际影响。 Method: 在S-NIAH和OOLONG基准上,对比评估纯LLM、RLM(深度=1)和RLM(深度=2)在DeepSeek v3.2与Kimi K2两个开源智能体模型上的表现。 Result: 深度=1的RLM有效提升复杂推理任务准确率;但深度=2时出现‘过度思考’现象,导致简单检索任务性能下降,执行时间从3.6秒激增至344.5秒,token成本呈指数增长。 Conclusion: 递归深度并非越深越好;需根据任务复杂度权衡深度设置,避免计算资源浪费与性能反降。 Abstract: This project reproduces and extends the recently proposed ``Recursive Language Models'' (RLMs) framework by Zhang et al. (2026). This framework enables Large Language Models (LLMs) to process near-infinite contexts by offloading the prompt into an external REPL environment. While the original paper relies on a default recursion depth of 1 and suggests deeper recursion as a future direction, this study specifically investigates the impact of scaling the recursion depth. Using state-of-the-art open-source agentic models (DeepSeek v3.2 and Kimi K2), I evaluated pure LLM, RLM (depth=1), and RLM (depth=2) on the S-NIAH and OOLONG benchmarks. The findings reveal a compelling phenomenon: Deeper recursion causes models to ``overthink''. While depth-1 RLMs effectively boost accuracy on complex reasoning tasks, applying deeper recursion (depth=2) or using RLMs on simple retrieval tasks paradoxically degrades performance and exponentially inflates execution time (e.g., from 3.6s to 344.5s) and token costs. Code and data are available at: https://github.com/drbillwang/rlm-reproduction

[12] Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models

Shubhangi Upasani,Ravi Shanker Raju,Bo Li,Mengmeing Ji,John Long,Chen Wu,Urmish Thakker,Guangtao Wang

Main category: cs.CL

TL;DR: 本文提出跨模型族的推测性预填充(cross-family speculative prefill),利用不同家族的小型草稿模型(如Qwen、LLaMA、DeepSeek)对目标大模型进行无需训练的提示压缩,实验证明其在保持90~100%性能的同时显著降低首字生成时间(TTFT)。

Details Motivation: 现有推测性预填充方法依赖同族且共享分词器的草稿模型,但实际智能体系统中常需异构模型栈,缺乏合适的小型同族草稿模型,因此亟需支持跨模型族的提示压缩方法。 Method: 沿用已有推测性预填充机制,基于注意力机制估计token重要性,在不同模型家族(Qwen/LLaMA/DeepSeek)间组合草稿模型与目标模型,进行无需训练的提示压缩与推理评估。 Result: 跨家族提示压缩在多种任务上稳定保持90~100%全提示基线性能,部分任务因去噪效应略有精度提升,并显著降低TTFT;表明注意力重要性估计具有跨架构与跨分词器的泛化能力。 Conclusion: 推测性预填充主要依赖任务先验与语义结构,而非模型细节,是一种通用的提示压缩原语,为异构、长上下文的智能体系统提供了实用且必要的优化方案。 Abstract: Prompt length is a major bottleneck in agentic large language model (LLM) workloads, where repeated inference steps and multi-call loops incur substantial prefill cost. Recent work on speculative prefill demonstrates that attention-based token importance estimation can enable training-free prompt compression, but this assumes the existence of a draft model that shares the same tokenizer as the target model. In practice, however, agentic pipelines frequently employ models without any smaller in-family draft model. In this work, we study cross-family speculative prefill, where a lightweight draft model from one model family is used to perform prompt compression for a target model from a different family. Using the same speculative prefill mechanism as prior work, we evaluate a range of cross-family draft-target combinations, including Qwen, LLaMA, and DeepSeek models. Across a broad diversity of tasks, we find that attention-based token importance estimation transfers reliably across different model families despite differences in model architectures and tokenizers between draft and target models. Cross-model prompt compression largely retains 90~100% of full-prompt baseline performance and, in some cases, slightly improves accuracy due to denoising effects, while delivering substantial reductions in time to first token (TTFT). These results suggest that speculative prefill depends mainly on task priors and semantic structure, thus serving as a generalizable prompt compression primitive. We discuss the implications of our findings for agentic systems, where repeated long-context inference and heterogeneous model stacks make cross-model prompt compression both necessary and practical.

[13] Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches

Anum Afzal,Yuki Saito,Hiroya Takamura,Katsuhito Sudoh,Shinnosuke Takamichi,Graham Neubig,Florian Matthes,Tatsuya Ishigaki

Main category: cs.CL

TL;DR: 本文提出两种基于提示的解码策略(固定间隔和动态间隔)用于实时视频解说生成,无需微调即可实现语义相关且时间对齐的解说,并发布多语言基准数据集与模型。

Details Motivation: 现有基于多模态大语言模型的提示方法在内容生成上表现良好,但忽略了实时解说中至关重要的时间同步问题。 Method: 提出两种提示驱动的解码策略:固定间隔法和基于前一句估计时长动态调整预测时机的动态间隔法,均不依赖微调。 Result: 在日英双语赛车与格斗游戏数据集上的实验表明,动态间隔法在解说内容与人类语句时间对齐性上显著优于固定间隔法。 Conclusion: 仅通过上下文提示即可实现高质量实时视频解说生成,动态时间建模是提升时效性与自然性的关键,所构建的多语言基准推动该方向后续研究。 Abstract: Real-time video commentary generation provides textual descriptions of ongoing events in videos. It supports accessibility and engagement in domains such as sports, esports, and livestreaming. Commentary generation involves two essential decisions: what to say and when to say it. While recent prompting-based approaches using multimodal large language models (MLLMs) have shown strong performance in content generation, they largely ignore the timing aspect. We investigate whether in-context prompting alone can support real-time commentary generation that is both semantically relevant and well-timed. We propose two prompting-based decoding strategies: 1) a fixed-interval approach, and 2) a novel dynamic interval-based decoding approach that adjusts the next prediction timing based on the estimated duration of the previous utterance. Both methods enable pause-aware generation without any fine-tuning. Experiments on Japanese and English datasets of racing and fighting games show that the dynamic interval-based decoding can generate commentary more closely aligned with human utterance timing and content using prompting alone. We release a multilingual benchmark dataset, trained models, and implementations to support future research on real-time video commentary generation.

[14] Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory

Shunki Uebayashi,Kento Masui,Kyohei Atarashi,Han Bao,Hisashi Kashima,Naoto Inoue,Mayu Otani,Koh Takeuchi

Main category: cs.CL

TL;DR: 本文提出M3IRT框架,用于评估多模态大语言模型(MLLMs)的跨模态推理能力,通过分解模型能力和题目难度为图像、文本及跨模态三部分,筛选高质量跨模态问题,提升基准测试可靠性与效率。

Details Motivation: 现有MLLM基准测试中存在大量‘捷径问题’(仅靠单模态即可解答),导致评估结果不可靠、计算开销大,亟需能真实衡量跨模态整合能力的评估方法。 Method: 提出多模态、多维项目反应理论框架(M3IRT),将模型能力与题目难度分别建模为图像专属、文本专属和跨模态三个正交分量,并基于24个视觉语言模型在三个基准上的响应数据进行参数估计。 Result: M3IRT能有效识别并优先保留真正需要跨模态推理的问题;在50%题目为人工注入低质量捷径题的情况下仍保持模型排名保真度;可构建更紧凑、高信度的子集,显著降低评估成本。 Conclusion: M3IRT为MLLM跨模态能力评估提供了理论严谨且实用的新范式,有助于构建更科学、高效的多模态基准。 Abstract: Multimodal Large Language Models (MLLMs) have recently emerged as general architectures capable of reasoning over diverse modalities. Benchmarks for MLLMs should measure their ability for cross-modal integration. However, current benchmarks are filled with shortcut questions, which can be solved using only a single modality, thereby yielding unreliable rankings. For example, in vision-language cases, we can find the correct answer without either the image or the text. These low-quality questions unnecessarily increase the size and computational requirements of benchmarks. We introduce a multi-modal and multidimensional item response theory framework (M3IRT) that extends classical IRT by decomposing both model ability and item difficulty into image-only, text-only, and cross-modal components. M3IRT estimates cross-modal ability of MLLMs and each question's cross-modal difficulty, enabling compact, high-quality subsets that better reflect multimodal reasoning. Across 24 VLMs on three benchmarks, M3IRT prioritizes genuinely cross-modal questions over shortcuts and preserves ranking fidelity even when 50% of items are artificially generated low-quality questions, thereby reducing evaluation cost while improving reliability. M3IRT thus offers a practical tool for assessing cross-modal reasoning and refining multimodal benchmarks.

[15] ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs

Wicaksono Leksono Muhamad,Joanito Agili Lopo,Tack Hwa Wong,Muhammad Ravi Shulthan Habibi,Samuel Cahyawijaya

Main category: cs.CL

TL;DR: 本文提出一种通过显式结构抽象减少大语言模型在多语言推理任务中内容效应的新方法,该方法将三段论转换为标准逻辑表示并使用确定性解析判断有效性,在SemEval-2026多语言基准测试中表现优异。

Details Motivation: 大型语言模型在多语言推理任务中存在内容效应偏差,影响其逻辑推理能力。 Method: 提出显式结构抽象方法,将三段论转化为标准逻辑表示,并采用确定性解析判断其有效性。 Result: 在SemEval-2026 Task 11多语言基准测试中所有子任务均进入前五名,显著降低了内容效应。 Conclusion: 该方法为缓解大语言模型内容效应提供了一种轻量、有效且可解释的替代方案,无需复杂微调或激活层干预。 Abstract: Large language models suffer from content effects in reasoning tasks, particularly in multi-lingual contexts. We introduce a novel method that reduces these biases through explicit structural abstraction that transforms syllogisms into canonical logical representations and applies deterministic parsing to determine validity. Evaluated on the SemEval-2026 Task 11 multilingual benchmark, our approach achieves top-5 rankings across all subtasks while substantially reducing content effects and offering a competitive alternative to complex fine-tuning or activation-level interventions.

[16] HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse

Sai Kartheek Reddy Kasu,Shankar Biradar,Sunil Saumya,Md. Shad Akhtar

Main category: cs.CL

TL;DR: 本文提出HateMirage数据集,聚焦于由虚假或扭曲叙事引发的隐性仇恨言论,通过多维标注(目标、意图、影响)提升可解释性,并验证模型解释质量更依赖预训练多样性与推理数据而非单纯规模。

Details Motivation: 现有仇恨言论数据集主要覆盖显性毒性,难以反映虚假信息如何以微妙、间接方式煽动或合理化仇恨,导致相关研究存在空白。 Method: 构建名为HateMirage的新型数据集,包含4530条源自YouTube讨论的‘伪仇恨’评论;基于事实核查来源识别广泛辟谣的错误主张,并对每条评论进行目标、意图和影响三个可解释维度的人工标注;采用ROUGE-L F1和Sentence-BERT相似度评估开源语言模型的解释连贯性。 Result: 实验表明,模型解释质量更依赖于预训练数据的多样性及推理导向的训练数据,而非参数量大小;HateMirage支持 misinformation reasoning 与 harm attribution 的联合建模,显著提升可解释仇恨检测能力。 Conclusion: HateMirage为隐性仇恨言论的可解释检测提供了新基准,推动兼顾事实核查与社会影响归因的负责任AI研究。 Abstract: Subtle and indirect hate speech remains an underexplored challenge in online safety research, particularly when harmful intent is embedded within misleading or manipulative narratives. Existing hate speech datasets primarily capture overt toxicity, underrepresenting the nuanced ways misinformation can incite or normalize hate. To address this gap, we present HateMirage, a novel dataset of Faux Hate comments designed to advance reasoning and explainability research on hate emerging from fake or distorted narratives. The dataset was constructed by identifying widely debunked misinformation claims from fact-checking sources and tracing related YouTube discussions, resulting in 4,530 user comments. Each comment is annotated along three interpretable dimensions: Target (who is affected), Intent (the underlying motivation or goal behind the comment), and Implication (its potential social impact). Unlike prior explainability datasets such as HateXplain and HARE, which offer token-level or single-dimensional reasoning, HateMirage introduces a multi-dimensional explanation framework that captures the interplay between misinformation, harm, and social consequence. We benchmark multiple open-source language models on HateMirage using ROUGE-L F1 and Sentence-BERT similarity to assess explanation coherence. Results suggest that explanation quality may depend more on pretraining diversity and reasoning-oriented data rather than on model scale alone. By coupling misinformation reasoning with harm attribution, HateMirage establishes a new benchmark for interpretable hate detection and responsible AI research.

[17] Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization

Yueyang Cang,Xiaoteng Zhang,Erlu Zhao,Zehua Ji,Yuhang Liu,Yuchen He,Zhiyuan Ning,Chen Yijun,Wenge Que,Li Shi

Main category: cs.CL

TL;DR: 本文提出Graph-GRPO框架,通过组内相对策略优化解决LLM多智能体系统中通信拓扑优化的梯度方差与信用分配问题,显著提升训练稳定性与性能。

Details Motivation: 现有基于强化学习的通信拓扑优化方法依赖单样本绝对奖励,导致梯度方差大、信用分配困难,尤其在简单任务(奖励饱和)和困难任务(无有效信号)下表现不佳。 Method: 提出Graph-GRPO:对每个查询采样一组多样化通信图,基于组内相对性能计算边的advantage,以归一化奖励缓解任务难度差异带来的噪声,实现细粒度信用分配。 Result: 在推理与代码生成基准上显著超越SOTA基线,训练更稳定,并能识别出以往被奖励噪声掩盖的关键通信路径。 Conclusion: 组内相对策略优化是提升LLM多智能体系统通信拓扑学习鲁棒性与可解释性的有效范式。 Abstract: Optimizing communication topology is fundamental to the efficiency and effectiveness of Large Language Model (LLM)-based Multi-Agent Systems (MAS). While recent approaches utilize reinforcement learning to dynamically construct task-specific graphs, they typically rely on single-sample policy gradients with absolute rewards (e.g., binary correctness). This paradigm suffers from severe gradient variance and the credit assignment problem: simple queries yield non-informative positive rewards for suboptimal structures, while difficult queries often result in failures that provide no learning signal. To address these challenges, we propose Graph-GRPO, a novel topology optimization framework that integrates Group Relative Policy Optimization. Instead of evaluating a single topology in isolation, Graph-GRPO samples a group of diverse communication graphs for each query and computes the advantage of specific edges based on their relative performance within the group. By normalizing rewards across the sampled group, our method effectively mitigates the noise derived from task difficulty variance and enables fine-grained credit assignment. Extensive experiments on reasoning and code generation benchmarks demonstrate that Graph-GRPO significantly outperforms state-of-the-art baselines, achieving superior training stability and identifying critical communication pathways previously obscured by reward noise.

[18] Sensory-Aware Sequential Recommendation via Review-Distilled Representations

Yeo Chan Yoon

Main category: cs.CL

TL;DR: 本文提出ASEGR框架,利用大语言模型从商品评论中提取感官属性,并将其蒸馏为紧凑的感官嵌入,增强序列推荐效果。

Details Motivation: 现有序列推荐模型主要依赖用户行为数据,缺乏对商品感官体验(如颜色、气味等)的建模,而这些信息隐含在用户评论中且影响购买决策。 Method: 采用两阶段方法:首先微调大语言模型作为教师模型,从评论中抽取结构化的感官属性-值对;然后将知识蒸馏至轻量级学生Transformer,生成固定维度的感官嵌入,并融入SASRec、BERT4Rec等序列推荐模型。 Result: 在四个Amazon领域上,融合感官嵌入的模型持续优于基线;定性分析表明提取的属性符合人类感知,提升推荐可解释性。 Conclusion: 感官属性蒸馏是一种有效、可扩展的方式,能将信息抽取与序列推荐结合,通过结构化语义表征学习提升推荐性能与可解释性。 Abstract: We propose a novel framework for sensory-aware sequential recommendation that enriches item representations with linguistically extracted sensory attributes from product reviews. Our approach, \textsc{ASEGR} (Attribute-based Sensory Enhanced Generative Recommendation), introduces a two-stage pipeline in which a large language model is first fine-tuned as a teacher to extract structured sensory attribute--value pairs, such as \textit{color: matte black} and \textit{scent: vanilla}, from unstructured review text. The extracted structures are then distilled into a compact student transformer that produces fixed-dimensional sensory embeddings for each item. These embeddings encode experiential semantics in a reusable form and are incorporated into standard sequential recommender architectures as additional item-level representations. We evaluate our method on four Amazon domains and integrate the learned sensory embeddings into representative sequential recommendation models, including SASRec, BERT4Rec, and BSARec. Across domains, sensory-enhanced models consistently outperform their identifier-based counterparts, indicating that linguistically grounded sensory representations provide complementary signals to behavioral interaction patterns. Qualitative analysis further shows that the extracted attributes align closely with human perceptions of products, enabling interpretable connections between natural language descriptions and recommendation behavior. Overall, this work demonstrates that sensory attribute distillation offers a principled and scalable way to bridge information extraction and sequential recommendation through structured semantic representation learning.

[19] Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration

Linhao Zhong,Linyu Wu,Wen Wang,Yuling Xi,Chenchen Jing,Jiaheng Zhang,Hao Chen,Chunhua Shen

Main category: cs.CL

TL;DR: 本文提出DiSE方法,用于扩散大语言模型(dLLMs)的自我评估,通过全上下文下重生成token的概率来量化置信度,并基于此构建灵活长度生成框架。

Details Motivation: dLLMs的非顺序、双向掩码生成方式导致质量评估困难,亟需有效的自我评估机制。 Method: 提出DiSE方法,通过计算在完整上下文中重生成整个输出序列各token的概率来量化置信度;并基于DiSE构建自适应控制序列长度的灵活生成框架。 Result: 理论分析表明DiSE与dLLM泛化能力相容;实验证明DiSE与语义连贯性和答案准确性正相关,在似然评估、不确定性量化和灵活长度生成任务中均表现有效。 Conclusion: DiSE是一种简单而高效的方法,可提升dLLMs的质量评估能力与生成可控性,为扩散式语言建模提供了可靠的自我评估新范式。 Abstract: Diffusion large language models (dLLMs) have recently attracted significant attention for their ability to enhance diversity, controllability, and parallelism. However, their non-sequential, bidirectionally masked generation makes quality assessment difficult, underscoring the need for effective self-evaluation. In this work, we propose DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs. DiSE quantifies confidence by computing the probability of regenerating the tokens in the entire generated sequence, given the full context. This method enables more efficient and reliable quality assessment by leveraging token regeneration probabilities, facilitating both likelihood estimation and robust uncertainty quantification. Building upon DiSE, we further introduce a flexible-length generation framework, which adaptively controls the sequence length based on the model's self-assessment of its own output. We analyze and validate the feasibility of DiSE from the perspective of dLLM generalization, and empirically demonstrate that DiSE is positively correlated with both semantic coherence and answer accuracy. Extensive experiments on likelihood evaluation, uncertainty quantification, and flexible-length generation further confirm the effectiveness of the proposed DiSE.

[20] From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench

Weikang Shi,Houxing Ren,Junting Pan,Aojun Zhou,Ke Wang,Zimu Lu,Yunqiao Yang,Yuxuan Hu,Linda Wei,Mingjie Zhan,Hongsheng Li

Main category: cs.CL

TL;DR: 本文提出了KMP-Bench,一个面向K-8数学教学的综合性评估基准,包含KMP-Dialogue(评估六大教学原则)和KMP-Skills(细粒度能力评估)两个模块;发现当前大模型在可验证任务上表现好,但在教学原则应用上不足;并发布KMP-Pile数据集,验证其对提升模型教学能力的有效性。

Details Motivation: 现有LLM数学辅导评估方法过于简单或狭窄,无法全面衡量多轮教学效果。 Method: 构建了KMP-Bench评估基准(含KMP-Dialogue和KMP-Skills两模块)及配套大规模教学对话数据集KMP-Pile,并在多个LLM上开展实验评估与微调验证。 Result: 实验证明主流LLM在有确定答案的任务上表现优异,但在需灵活运用教学原则的场景中明显不足;基于KMP-Pile微调的模型在KMP-Bench上显著提升。 Conclusion: pedagogically-rich训练数据对提升AI数学导师的教学能力至关重要,KMP-Bench为更全面、真实地评估LLM教学能力提供了新标准。 Abstract: Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable solutions, they struggle with the nuanced application of pedagogical principles. Additionally, we present KMP-Pile, a large-scale (150K) dialogue dataset. Models fine-tuned on KMP-Pile show substantial improvement on KMP-Bench, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.

[21] OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets

Jiyuan Shen,Peiyue Yuan,Atin Ghosh,Yifan Mai,Daniel Dahlmeier

Main category: cs.CL

TL;DR: 本文通过大规模基准测试研究,评估了多种现成的多模态大语言模型(MLLMs)在商业文档信息抽取任务上的性能,并提出了一种基于大语言模型的自动化分层错误分析框架;结果表明,对于强大的MLLMs,OCR预处理可能并非必要,图像直接输入即可达到与OCR+MLLM相当的性能,且通过精心设计的模式、示例和指令可进一步提升效果。

Details Motivation: 探究多模态大语言模型(MLLMs)在文档信息抽取中的实际效果,特别是纯MLLM流程是否能媲美传统OCR+MLLM方案。 Method: 开展大规模基准测试,评估多种开箱即用的MLLMs在商业文档信息抽取任务上的表现;提出基于LLM的自动化分层错误分析框架以系统诊断错误模式。 Result: 强大的MLLMs仅用图像输入即可达到与OCR增强方法相当的性能;精心设计的schema、exemplars和instructions能进一步提升MLLMs性能。 Conclusion: OCR预处理对高性能MLLMs可能并非必需,该研究为文档信息抽取提供了实用指导和深入洞见。 Abstract: Multimodal Large Language Models (MLLMs) enhance the potential of natural language processing. However, their actual impact on document information extraction remains unclear. In particular, it is unclear whether an MLLM-only pipeline--while simpler--can truly match the performance of traditional OCR+MLLM setups. In this paper, we conduct a large-scale benchmarking study that evaluates various out-of-the-box MLLMs on business-document information extraction. To examine and explore failure modes, we propose an automated hierarchical error analysis framework that leverages large language models (LLMs) to diagnose error patterns systematically. Our findings suggest that OCR may not be necessary for powerful MLLMs, as image-only input can achieve comparable performance to OCR-enhanced approaches. Moreover, we demonstrate that carefully designed schema, exemplars, and instructions can further enhance MLLMs performance. We hope this work can offer practical guidance and valuable insight for advancing document information extraction.

[22] Faster, Cheaper, More Accurate: Specialised Knowledge Tracing Models Outperform LLMs

Prarthana Bhattacharyya,Joshua Mitton,Ralph Abboud,Simon Woodhead

Main category: cs.CL

TL;DR: 本文比较了知识追踪(KT)模型与大语言模型(LLMs)在预测学生答题反应任务上的性能、部署成本与推理速度,结果表明KT模型在准确性、速度和成本上均显著优于LLMs,强调了教育领域专用模型的重要性。

Details Motivation: 随着大语言模型(LLMs)的兴起,研究者希望评估其在学生响应预测这一教育关键任务上的适用性,并与传统知识追踪(KT)模型进行系统对比。 Method: 对多种LLMs和KT模型在预测学生未来答题反应任务上,从预测性能(准确率、F1分数)、部署成本和推理速度三方面进行实证对比实验。 Result: KT模型在准确率和F1分数上优于LLMs;LLMs推理速度比KT模型慢多个数量级,部署成本也高出多个数量级。 Conclusion: 在教育领域的预测任务中,领域专用的KT模型仍具不可替代优势;当前闭源LLMs不应被视作通用解决方案。 Abstract: Predicting future student responses to questions is particularly valuable for educational learning platforms where it enables effective interventions. One of the key approaches to do this has been through the use of knowledge tracing (KT) models. These are small, domain-specific, temporal models trained on student question-response data. KT models are optimised for high accuracy on specific educational domains and have fast inference and scalable deployments. The rise of Large Language Models (LLMs) motivates us to ask the following questions: (1) How well can LLMs perform at predicting students' future responses to questions? (2) Are LLMs scalable for this domain? (3) How do LLMs compare to KT models on this domain-specific task? In this paper, we compare multiple LLMs and KT models across predictive performance, deployment cost, and inference speed to answer the above questions. We show that KT models outperform LLMs with respect to accuracy and F1 scores on this domain-specific task. Further, we demonstrate that LLMs are orders of magnitude slower than KT models and cost orders of magnitude more to deploy. This highlights the importance of domain-specific models for education prediction tasks and the fact that current closed source LLMs should not be used as a universal solution for all tasks.

[23] A Browser-based Open Source Assistant for Multimodal Content Verification

Rosanna Milner,Michael Foster,Olesya Razuvayevskaya,Ian Roberts,Valentin Porcellini,Denis Teyssou,Kalina Bontcheva

Main category: cs.CL

TL;DR: 本文介绍了一种名为VERIFICATION ASSISTANT的浏览器工具,旨在帮助记者和事实核查人员快速识别由生成式AI制造的虚假信息,通过统一界面集成多种NLP模型,提供可操作的可信度信号与AI生成内容评估。

Details Motivation: 记者和事实核查人员急需高效验证数字媒体信息真伪,但现有NLP检测模型往往难以被非专家用户使用,也未整合进其日常流程。 Method: 设计并实现了一个基于浏览器的VERIFICATION ASSISTANT工具,作为已有VERIFICATION PLUGIN插件的核心组件;支持URL或媒体文件提交,自动提取内容并调用多个后端NLP分类器进行联合分析。 Result: 该工具已部署应用,为超14万名用户提供服务,能有效输出AI生成概率、说服策略、主观性等可信度信号,并以清晰易懂方式呈现。 Conclusion: VERIFICATION ASSISTANT成功弥合了前沿NLP技术与一线新闻工作者实际需求之间的鸿沟,证明了模块化、可集成的轻量级工具在对抗AI生成虚假信息中的实用价值。 Abstract: Disinformation and false content produced by generative AI pose a significant challenge for journalists and fact-checkers who must rapidly verify digital media information. While there is an abundance of NLP models for detecting credibility signals such as persuasion techniques, subjectivity, or machine-generated text, such methods often remain inaccessible to non-expert users and are not integrated into their daily workflows as a unified framework. This paper demonstrates the VERIFICATION ASSISTANT, a browser-based tool designed to bridge this gap. The VERIFICATION ASSISTANT, a core component of the widely adopted VERIFICATION PLUGIN (140,000+ users), allows users to submit URLs or media files to a unified interface. It automatically extracts content and routes it to a suite of backend NLP classifiers, delivering actionable credibility signals, estimating AI-generated content, and providing other verification guidance in a clear, easy-to-digest format. This paper showcases the tool architecture, its integration of multiple NLP services, and its real-world application to detecting disinformation.

[24] The Distribution of Phoneme Frequencies across the World's Languages: Macroscopic and Microscopic Information-Theoretic Models

Fermín Moscoso del Prado Martín,Suchir Salhan

Main category: cs.CL

TL;DR: 本文通过宏观和微观两个层面解释了跨语言音素频率分布:宏观上,音素等级-频率分布遵循对称狄利克雷分布的顺序统计;微观上,结合发音、音系和词汇结构约束的最大熵模型能准确预测特定语言的音素概率。

Details Motivation: 解释跨语言中音素频率分布的普遍规律及其成因。 Method: 宏观层面采用对称狄利克雷分布建模音素等级-频率分布;微观层面构建融合发音、音系与词汇结构约束的最大熵模型。 Result: 发现音素库存规模与狄利克雷浓度参数系统相关,存在补偿效应(大库存对应更低相对熵);最大熵模型可准确预测语言特异性音素概率。 Conclusion: 宏观与微观模型共同构成统一的信息论框架,揭示音素频率结构的系统性规律。 Abstract: We demonstrate that the frequency distribution of phonemes across languages can be explained at both macroscopic and microscopic levels. Macroscopically, phoneme rank-frequency distributions closely follow the order statistics of a symmetric Dirichlet distribution whose single concentration parameter scales systematically with phonemic inventory size, revealing a robust compensation effect whereby larger inventories exhibit lower relative entropy. Microscopically, a Maximum Entropy model incorporating constraints from articulatory, phonotactic, and lexical structure accurately predicts language-specific phoneme probabilities. Together, these findings provide a unified information-theoretic account of phoneme frequency structure.

[25] Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models

Haruto Yoshida,Keito Kudo,Yoichi Aoki,Ryota Tanaka,Itsumi Saito,Keisuke Sakaguchi,Kentaro Inui

Main category: cs.CL

TL;DR: 本文探究了大型视觉语言模型(LVLMs)在理解图表中节点与有向边关系时表现不佳的原因,发现边信息在线性可分性上滞后于节点信息,仅在语言模型的文本token中才线性编码,而节点和全局结构特征在视觉编码器中已线性可分。

Details Motivation: LVLMs在图表理解任务中虽表现良好,但在理解节点与有向边(如箭头、连线)之间的关系方面仍存在明显局限,本文旨在揭示其内在表征机制缺陷。 Method: 构建基于有向图的合成图表数据集,并通过探针实验分析LVLM各模块(尤其是视觉编码器和语言模型)中节点、边及全局结构信息的线性可分性。 Result: 边信息在视觉编码器中不具线性可分性,仅在语言模型的文本token中才线性编码;而节点和全局结构特征在视觉编码器隐藏层中已线性可分。 Conclusion: 不同视觉信息(节点vs边)在线性可分表征形成阶段存在差异,边表征的延迟出现可能是LVLMs难以进行关系理解(如方向判断)的根本原因。 Abstract: Large vision-language models (LVLMs) demonstrate strong performance on diagram understanding benchmarks, yet they still struggle with understanding relationships between elements, particularly those represented by nodes and directed edges (e.g., arrows and lines). To investigate the underlying causes of this limitation, we probe the internal representation of LVLMs using a carefully constructed synthetic diagram dataset based on directed graphs. Our probing experiments reveal that edge information is not linearly separable in the vision encoder and becomes linearly encoded only in the text tokens in the language model. In contrast, node information and global structural features are already linearly encoded in individual hidden states of the vision encoder. These findings suggest that the stage at which linearly separable representations are formed varies depending on the type of visual information. In particular, the delayed emergence of edge representations may help explain why LVLMs struggle with relational understanding, such as interpreting edge directions, which require more abstract, compositionally integrated processes.

[26] LaTeX Compilation: Challenges in the Era of LLMs

Tianyou Liu,Ziqiang Li,Yansong Li,Xurui Liu

Main category: cs.CL

TL;DR: 本文分析了TeX在LLM时代下的编译效率、语义生成、错误定位及工具生态等方面的缺陷,并提出了一种新型WYSIWYG结构化编辑器Mogan STEM,其在编译/渲染速度、LLM任务性能及低信息熵格式.tmu的微调效率上均优于TeX,呼吁开展更大规模的.tmu格式LLM训练实验。

Details Motivation: 随着大语言模型(LLMs)越来越多地辅助科学写作,TeX在编译效率、用户体验设计等方面的根本性缺陷(如高token开销、错误定位难、语义生成弱、工具生态陈旧)日益凸显,亟需替代方案。 Method: 通过系统性分析TeX的编译机制与交互设计缺陷,提出基于高效数据结构、快速渲染和按需插件加载的WYSIWYG结构化编辑器Mogan STEM,并设计其原生文档格式.tmu;开展编译/渲染时间对比实验及LLM任务(如生成、微调)性能评测。 Result: Mogan在编译/渲染时间上显著优于TeX;在LLM任务中表现更优;.tmu格式因信息熵更低,更适合LLM微调;实验证实了其在科学写作支持上的综合优势。 Conclusion: TeX已难以适配LLM时代的科学写作需求,Mogan STEM及其.tmu格式提供了一种更高效、更友好、更AI-ready的替代范式,值得在更大规模LLM训练中推广。 Abstract: As large language models (LLMs) increasingly assist scientific writing, limitations and the significant token cost of TeX become more and more visible. This paper analyzes TeX's fundamental defects in compilation and user experience design to illustrate its limitations on compilation efficiency, generated semantics, error localization, and tool ecosystem in the era of LLMs. As an alternative, Mogan STEM, a WYSIWYG structured editor, is introduced. Mogan outperforms TeX in the above aspects by its efficient data structure, fast rendering, and on-demand plugin loading. Extensive experiments are conducted to verify the benefits on compilation/rendering time and performance in LLM tasks. What's more, we show that due to Mogan's lower information entropy, it is more efficient to use .tmu (the document format of Mogan) to fine-tune LLMs than TeX. Therefore, we launch an appeal for larger experiments on LLM training using the .tmu format.

[27] Eval4Sim: An Evaluation Framework for Persona Simulation

Eliseo Bao,Anxo Perez,Xi Wang,Javier Parapar

Main category: cs.CL

TL;DR: 本文提出Eval4Sim框架,通过Adherence、Consistency和Naturalness三个维度,以人类对话语料(如PersonaChat)为基准,评估LLM persona模拟对话与真实人类行为的一致性,克服了现有LLM-as-a-judge方法缺乏行为依据和解释性差的问题。

Details Motivation: 现有LLM persona模拟的评估主要依赖LLM-as-a-judge,缺乏对真实人类行为的观察依据,且结果为不透明的标量分,难以反映模拟质量的真实缺陷。 Method: 提出Eval4Sim框架,从Adherence(基于说话人感知表示的密集检索)、Consistency(作者归属验证)和Naturalness(基于对话导向NLI分布)三方面量化模拟对话与人类对话的对齐程度,并以PersonaChat等带说话人标注的语料为参考基准,双向惩罚偏离(不足或过优化)。 Result: Eval4Sim提供了可解释、行为 grounded 的多维评估,能区分persona编码不足与过度优化导致的不自然问题,在PersonaChat上验证有效,并可推广至其他带说话人标注的对话语料。 Conclusion: Eval4Sim为LLM persona模拟提供了更可靠、透明且以人为中心的评估范式,推动仿真系统向真实人类行为对齐。 Abstract: Large Language Model (LLM) personas with explicit specifications of attributes, background, and behavioural tendencies are increasingly used to simulate human conversations for tasks such as user modeling, social reasoning, and behavioural analysis. Ensuring that persona-grounded simulations faithfully reflect human conversational behaviour is therefore critical. However, current evaluation practices largely rely on LLM-as-a-judge approaches, offering limited grounding in observable human behavior and producing opaque scalar scores. We address this gap by proposing Eval4Sim, an evaluation framework that measures how closely simulated conversations align with human conversational patterns across three complementary dimensions. Adherence captures how effectively persona backgrounds are implicitly encoded in generated utterances, assessed via dense retrieval with speaker-aware representations. Consistency evaluates whether a persona maintains a distinguishable identity across conversations, computed through authorship verification. Naturalness reflects whether conversations exhibit human-like flow rather than overly rigid or optimized structure, quantified through distributions derived from dialogue-focused Natural Language Inference. Unlike absolute or optimization-oriented metrics, Eval4Sim uses a human conversational corpus (i.e., PersonaChat) as a reference baseline and penalizes deviations in both directions, distinguishing insufficient persona encoding from over-optimized, unnatural behaviour. Although demonstrated on PersonaChat, the applicability of Eval4Sim extends to any conversational corpus containing speaker-level annotations.

[28] Learning to Generate and Extract: A Multi-Agent Collaboration Framework For Zero-shot Document-level Event Arguments Extraction

Guangjun Zhang,Hu Zhang,Yazhou Han,Yue Fan,Yuhang Shao,Ru Li,Hongye Tan

Main category: cs.CL

TL;DR: 本文提出了一种面向零样本文档级事件论元抽取(ZS-DEAE)的多智能体协同框架,通过‘提出-评估-修订’模拟人类协作认知过程,结合生成与评估智能体,并引入结构约束的强化学习优化,显著提升合成数据质量与抽取性能。

Details Motivation: 现有零样本方法仅依赖事件类型提示生成合成数据,难以准确建模未见事件的上下文与结构关系,且缺乏对合成数据质量的有效评估机制。 Method: 构建包含生成智能体和评估智能体的多智能体协作框架;生成智能体基于已见事件知识合成未见事件数据,评估智能体从中抽取论元并判断语义一致性;评估结果转化为含事件结构约束的奖励信号,驱动双智能体通过强化学习迭代优化。 Result: 在RAMS和WikiEvents构建的三个零样本场景中,该方法在合成数据质量和事件论元抽取性能上均取得提升,且生成的数据能有效增强其他DEAE模型的零样本性能。 Conclusion: 多智能体协同与结构感知的强化学习优化是提升零样本文档级事件论元抽取效果的有效范式。 Abstract: Document-level event argument extraction (DEAE) is essential for knowledge acquisition, aiming to extract participants of events from documents.In the zero-shot setting, existing methods employ LLMs to generate synthetic data to address the challenge posed by the scarcity of annotated data. However, relying solely on Event-type-only prompts makes it difficult for the generated content to accurately capture the contextual and structural relationships of unseen events. Moreover, ensuring the reliability and usability of synthetic data remains a significant challenge due to the absence of quality evaluation mechanisms. To this end, we introduce a multi-agent collaboration framework for zero-shot document-level event argument extraction (ZS-DEAE), which simulates the human collaborative cognitive process of "Propose-Evaluate-Revise." Specifically, the framework comprises a generation agent and an evaluation agent. The generation agent synthesizes data for unseen events by leveraging knowledge from seen events, while the evaluation agent extracts arguments from the synthetic data and assesses their semantic consistency with the context. The evaluation results are subsequently converted into reward signals, with event structure constraints incorporated into the reward design to enable iterative optimization of both agents via reinforcement learning.In three zero-shot scenarios constructed from the RAMS and WikiEvents datasets, our method achieves improvements both in data generation quality and argument extraction performance, while the generated data also effectively enhances the zero-shot performance of other DEAE models.

[29] ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation

Bo Xu,Haotian Wu,Hehai Lin,Weiquan Huang,Beier Zhu,Yao Shu,Chengwei Qin

Main category: cs.CL

TL;DR: 本文提出了一种无需数据、无需重训练、无需修改架构的模型融合新方法ACEM,通过从微调模型的参数差异中隐式估计各任务的输入协方差,从而自适应缓解任务间干扰,在视觉和语言基准上达到数据自由融合的SOTA性能。

Details Motivation: 模型融合中不同目标训练的专家模型之间存在干扰,导致性能下降;而现有方法大多依赖数据、重训练或架构修改,缺乏高效且理论严谨的数据自由解决方案。 Method: 提出ACEM(Adaptive Covariance Estimation)框架,基于理论分析发现任务输入协方差可从微调模型参数差异中隐式估计,并据此设计闭式、非迭代的融合权重计算方法。 Result: 在GPT-2的7个任务上平均绝对提升4%,显著优于现有数据自由方法;在视觉与语言多个基准上均达到SOTA,且计算开销小。 Conclusion: ACEM为数据自由模型融合提供了理论扎实、高效实用的新范式,有效解决了跨任务干扰这一核心挑战。 Abstract: Model merging aims to combine multiple task-specific expert models into a single model while preserving generalization across diverse tasks. However, interference among experts, especially when they are trained on different objectives, often leads to significant performance degradation. Despite recent progress, resolving this interference without data access, retraining, or architectural modification remains a fundamental challenge. This paper provides a theoretical analysis demonstrating that the input covariance of each task, which is a key factor for optimal merging, can be implicitly estimated from the parameter differences of its fine-tuned model, even in a fully data-free setting. Building on this insight, we introduce \acem, an Adaptive Covariance Estimation framework that effectively mitigates inter-task interference. Our approach features a principled, closed-form solution that contrasts with prior iterative or heuristic methods. Extensive experiments on both vision and language benchmarks demonstrate that \acem sets a new state-of-the-art among data-free methods. It consistently outperforms existing baselines; for example, \acem achieves an average absolute improvement of 4\% over the previous methods across seven tasks on GPT-2. Owing to its efficient closed-form formulation, \acem delivers superior performance with a modest computational cost, providing a practical and theoretically grounded solution for model merging.

[30] MaBERT:A Padding Safe Interleaved Transformer Mamba Hybrid Encoder for Efficient Extended Context Masked Language Modeling

Jinwoong Kim,Sangjin Park

Main category: cs.CL

TL;DR: MaBERT是一种混合编码器,结合Transformer的全局建模能力和Mamba的线性时间效率,通过交替堆叠两种层、引入padding-safe masking和mask-aware attention pooling,在长文本任务中显著提升训练与推理效率。

Details Motivation: 解决BERT类自注意力模型在长序列下计算开销大(O(n²)),以及Mamba等线性状态空间模型难以建模全局交互、易受padding污染的问题。 Method: 提出MaBERT:交替堆叠Transformer层(捕获全局依赖)与Mamba层(实现线性状态更新);设计paddingsafe masking防止padding位置污染状态;引入mask aware attention pooling仅聚合有效token信息。 Result: 在GLUE上5/8任务取得最高平均分,尤其在CoLA和句子对推理任务表现突出;将上下文从512扩展至4096时,训练时间和推理延迟分别降低2.36倍和2.43倍(相比编码器基线均值)。 Conclusion: MaBERT是一种兼顾建模能力与计算效率的实用长上下文编码器,验证了混合架构在长文本理解中的有效性与可行性。 Abstract: Self attention encoders such as Bidirectional Encoder Representations from Transformers(BERT) scale quadratically with sequence length, making long context modeling expensive. Linear time state space models, such as Mamba, are efficient; however, they show limitations in modeling global interactions and can suffer from padding induced state contamination. We propose MaBERT, a hybrid encoder that interleaves Transformer layers for global dependency modeling with Mamba layers for linear time state updates. This design alternates global contextual integration with fast state accumulation, enabling efficient training and inference on long inputs. To stabilize variable length batching, we introduce paddingsafe masking, which blocks state propagation through padded positions, and mask aware attention pooling, which aggregates information only from valid tokens. On GLUE, MaBERT achieves the best mean score on five of the eight tasks, with strong performance on the CoLA and sentence pair inference tasks. When extending the context from 512 to 4,096 tokens, MaBERT reduces training time and inference latency by 2.36x and 2.43x, respectively, relative to the average of encoder baselines, demonstrating a practical long context efficient encoder.

[31] TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health

Zixin Xiong,Ziteng Wang,Haotian Fan,Xinjie Zhang,Wenxuan Wang

Main category: cs.CL

TL;DR: 本文提出TrustMH-Bench框架,系统评估大语言模型在心理健康领域的可信度,涵盖可靠性、危机识别、安全性等八大维度,发现现有模型在多个维度表现不足,强调提升其可信度的紧迫性。

Details Motivation: 现有通用大语言模型评估范式无法满足心理健康这一高风险、安全敏感领域的特定需求,亟需构建针对性的可信度评估体系。 Method: 提出TrustMH-Bench框架,将心理健康领域规范映射为可量化的评估指标,从八大核心维度(可靠性、危机识别与升级、安全性、公平性、隐私性、鲁棒性、反讨好性、伦理性)系统评测模型;在6个通用和6个专用心理健康模型上开展实验。 Result: 实验表明,所有被测模型在心理健康场景下的各可信度维度均存在明显缺陷,即便是性能较强的通用模型(如GPT-5.1)也无法在所有维度保持高一致性表现。 Conclusion: 系统提升大语言模型在心理健康领域的可信度已成为一项关键且紧迫的任务,TrustMH-Bench为该方向提供了可复现、可扩展的评估基准。 Abstract: While Large Language Models (LLMs) demonstrate significant potential in providing accessible mental health support, their practical deployment raises critical trustworthiness concerns due to the domains high-stakes and safety-sensitive nature. Existing evaluation paradigms for general-purpose LLMs fail to capture mental health-specific requirements, highlighting an urgent need to prioritize and enhance their trustworthiness. To address this, we propose TrustMH-Bench, a holistic framework designed to systematically quantify the trustworthiness of mental health LLMs. By establishing a deep mapping from domain-specific norms to quantitative evaluation metrics, TrustMH-Bench evaluates models across eight core pillars: Reliability, Crisis Identification and Escalation, Safety, Fairness, Privacy, Robustness, Anti-sycophancy, and Ethics. We conduct extensive experiments across six general-purpose LLMs and six specialized mental health models. Experimental results indicate that the evaluated models underperform across various trustworthiness dimensions in mental health scenarios, revealing significant deficiencies. Notably, even generally powerful models (e.g., GPT-5.1) fail to maintain consistently high performance across all dimensions. Consequently, systematically improving the trustworthiness of LLMs has become a critical task. Our data and code are released.

[32] PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems

Sudip Bhujel

Main category: cs.CL

TL;DR: 本文提出PrivMedChat框架,通过差分隐私强化学习(DP-RLHF)在医疗对话场景中实现隐私保护的模型对齐,避免敏感医患数据泄露,同时保持高生成质量与低临床幻觉率。

Details Motivation: 大型语言模型在医疗对话应用中面临使用含敏感信息的医患对话数据进行监督训练带来的隐私泄露风险,传统监督微调和RLHF易导致训练数据记忆与成员推断攻击。 Method: 提出端到端差分隐私RLHF(DP-RLHF)框架PrivMedChat:在监督微调(SFT)、奖励模型训练、PPO策略优化阶段均采用DP-SGD;引入无需人工标注的偏好构造方法,用过滤后的非专家生成响应与医生响应配对构建偏好数据。 Result: 在ε=7下,PrivMedChat在ROUGE-L(0.156)、临床幻觉率(1.4%)、有害建议率(0.4%)及LLM-jury综合评分(2.86)上均优于其他差分隐私模型;成员推断AUC接近随机水平(0.510–0.555)。 Conclusion: PrivMedChat有效平衡了医疗大模型的隐私保护与对话性能,在保障患者数据隐私前提下实现了高质量、安全、可扩展的临床对话对齐。 Abstract: Large language models are increasingly used for patient-facing medical assistance and clinical decision support, but adapting them to clinical dialogue often requires supervision derived from doctor-patient conversations that may contain sensitive information. Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of rare training-set content. We present PrivMedChat, an end-to-end framework for differentially private RLHF (DP-RLHF) for medical dialogue. Our design enforces differential privacy at every training stage that directly accesses dialogue-derived supervision: (i) Differential Private Stochastic Gradient Descent (DP-SGD) for medical SFT and (ii) DP-SGD for reward model learning from preference pairs. To limit additional privacy expenditure during alignment, we apply DP-SGD to the PPO actor and critic when operating on dialogue-derived prompts, while the reward model remains fixed after DP training. We also introduce an annotation-free preference construction strategy that pairs physician responses with filtered non-expert generations to produce scalable preference data without clinician labeling. Experiments on medical dialogue benchmarks show that PrivMedChat at $\varepsilon=7$ achieves the highest ROUGE-L of 0.156 among all DP models, reduces clinical hallucinations to 1.4% and harmful advice to 0.4%, and obtains the highest overall score of 2.86 in a 3-model LLM-jury evaluation, while producing membership-inference signals that are near chance (AUC 0.510-0.555). We open-source our code at https://github.com/sudip-bhujel/privmedchat.

[33] TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

Zhi Xu,Jiaqi Li,Xiaotong Zhang,Hong Yu,Han Liu

Main category: cs.CL

TL;DR: 本文提出TAO-Attack,一种新型基于优化的LLM越狱攻击方法,通过两阶段损失函数抑制拒绝响应和伪有害输出,并引入方向优先的令牌优化策略提升效率,在多个大模型上实现更高攻击成功率。

Details Motivation: 现有基于优化的越狱攻击存在频繁拒绝、伪有害输出及低效的词元级更新等问题,亟需更高效鲁棒的攻击方法。 Method: 提出TAO-Attack:采用两阶段损失函数(第一阶段抑制拒绝,第二阶段惩罚伪有害输出并引导更严重有害生成);设计方向优先词元优化(DPTO)策略,先对齐梯度方向再考虑更新幅度。 Result: 在多个大语言模型上实验表明,TAO-Attack持续超越当前最优方法,攻击成功率更高,部分场景达100%。 Conclusion: TAO-Attack通过结构化损失设计与高效优化策略,显著提升了越狱攻击的有效性与鲁棒性,为评估和增强LLM安全性提供了新工具与洞见。 Abstract: Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based jailbreak method. TAO-Attack employs a two-stage loss function: the first stage suppresses refusals to ensure the model continues harmful prefixes, while the second stage penalizes pseudo-harmful outputs and encourages the model toward more harmful completions. In addition, we design a direction-priority token optimization (DPTO) strategy that improves efficiency by aligning candidates with the gradient direction before considering update magnitude. Extensive experiments on multiple LLMs demonstrate that TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100\% in certain scenarios.

[34] Compact Prompting in Instruction-tuned LLMs for Joint Argumentative Component Detection

Sofiane Elguendouze,Erwan Hain,Elena Cabrio,Serena Villata

Main category: cs.CL

TL;DR: 本文提出了一种基于指令微调大语言模型(LLM)的新方法,将论证成分检测(ACD)重构为生成式任务,直接从原始文本中识别论证成分,无需预分割,性能超越现有最优方法。

Details Motivation: 论证成分检测(ACD)是论证挖掘中最富挑战性的子任务之一,但现有研究多将其简化为序列标注或两阶段流水线,建模能力受限,亟需更统一、端到端的建模范式。 Method: 采用指令微调的大语言模型,设计紧凑的指令提示,将ACD重新定义为语言生成任务,直接输出带类型标记的论证成分,摆脱对预分割输入的依赖。 Result: 在标准基准测试中,该方法性能优于当前最优系统;是首次将ACD完全建模为生成式任务的尝试之一。 Conclusion: 指令微调LLM为复杂论证挖掘任务提供了新范式,证明生成式建模在ACD中具有显著潜力和优势。 Abstract: Argumentative component detection (ACD) is a core subtask of Argument(ation) Mining (AM) and one of its most challenging aspects, as it requires jointly delimiting argumentative spans and classifying them into components such as claims and premises. While research on this subtask remains relatively limited compared to other AM tasks, most existing approaches formulate it as a simplified sequence labeling problem, component classification, or a pipeline of component segmentation followed by classification. In this paper, we propose a novel approach based on instruction-tuned Large Language Models (LLMs) using compact instruction-based prompts, and reframe ACD as a language generation task, enabling arguments to be identified directly from plain text without relying on pre-segmented components. Experiments on standard benchmarks show that our approach achieves higher performance compared to state-of-the-art systems. To the best of our knowledge, this is one of the first attempts to fully model ACD as a generative task, highlighting the potential of instruction tuning for complex AM problems.

[35] Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems

Raad Khraishi,Iman Zafar,Katie Myles,Greig A Cowan

Main category: cs.CL

TL;DR: 本文研究了多轮大语言模型(LLM)系统中模型切换(handoff)导致的上下文不匹配问题,提出了switch-matrix基准来量化性能漂移,并揭示了显著且方向性的性能变化及模型间的系统性兼容模式。

Details Motivation: 部署中的多轮LLM系统常因升级、跨供应商路由或回退机制而中途切换模型,造成生成后续对话轮次的模型需基于其他模型生成的历史上下文进行推理,从而引发隐性性能漂移。 Method: 提出switch-matrix基准:用前缀模型生成早期对话轮次,后缀模型生成最终轮次,并与无切换基线对比;采用成对episode级bootstrap置信区间评估;在CoQA和Multi-IF数据集上开展实验;进一步分解漂移为前缀影响与后缀敏感性两项可量化指标。 Result: 单轮切换即引发普遍且统计显著的方向性性能变化,在Multi-IF严格成功率上达-8%至+13%,在CoQA F1上达±4;发现模型间存在系统性兼容模式(如某些后缀模型几乎总退化,某些则几乎总提升);分解项可解释约70%的漂移方差。 Conclusion: 模型切换鲁棒性是单模型评测所忽略的关键运维可靠性维度,需在多轮系统中进行显式监控与切换感知的缓解设计。 Abstract: Deployed multi-turn LLM systems routinely switch models mid-interaction due to upgrades, cross-provider routing, and fallbacks. Such handoffs create a context mismatch: the model generating later turns must condition on a dialogue prefix authored by a different model, potentially inducing silent performance drift. We introduce a switch-matrix benchmark that measures this effect by running a prefix model for early turns and a suffix model for the final turn, and comparing against the no-switch baseline using paired episode-level bootstrap confidence intervals. Across CoQA conversational QA and Multi-IF benchmarks, even a single-turn handoff yields prevalent and statistically significant, directional effects and may swing outcomes by -8 to +13 percentage points in Multi-IF strict success rate and +/- 4 absolute F1 on CoQA, comparable to the no-switch gap between common model tiers (e.g., GPT-5-nano vs GPT-5-mini). We further find systematic compatibility patterns: some suffix models degrade under nearly any non-self dialogue history, while others improve under nearly any foreign prefix. To enable compressed handoff risk monitoring, we decompose switch-induced drift into per-model prefix influence and suffix susceptibility terms, accounting for ~70% of variance across benchmarks. These results position handoff robustness as an operational reliability dimension that single-model benchmarks miss, motivating explicit monitoring and handoff-aware mitigation in multi-turn systems.

[36] UniSkill: A Dataset for Matching University Curricula to Professional Competencies

Nurlan Musazade,Joszef Mezei,Mike Zhang

Main category: cs.CL

TL;DR: 本文构建了欧洲技能分类(ESCO)与大学课程之间的技能标注数据集,并基于该数据集训练BERT模型实现课程-技能双向匹配,F1达87%。

Details Motivation: 解决当前技能提取与推荐系统中公开可用数据集稀缺、尤其是技能标注不足的问题。 Method: 构建人工标注与合成的课程-技能匹配数据集(基于ESCO中系统分析师和管理组织分析师职业群),涵盖课程标题-技能和课程句子-技能两个粒度;发布标注指南;训练并评估BERT等语言模型用于课程-技能双向匹配。 Result: BERT模型在标注测试集上达到87%的F1分数,验证了课程与技能匹配任务的可行性。 Conclusion: 所发布的数据集和基线模型为课程-技能映射、教育路径规划及智能招聘推荐等应用提供了重要基础资源与技术支撑。 Abstract: Skill extraction and recommendation systems have been studied from recruiter, applicant, and education perspectives. While AI applications in job advertisements have received broad attention, deficiencies in the instructed skills side remain a challenge. In this work, we address the scarcity of publicly available datasets by releasing both manually annotated and synthetic datasets of skills from the European Skills, Competences, Qualifications and Occupations (ESCO) taxonomy and university course pairs and publishing corresponding annotation guidelines. Specifically, we match graduate-level university courses with skills from the Systems Analysts and Management and Organization Analyst ESCO occupation groups at two granularities: course title with a skill, and course sentence with a skill. We train language models on this dataset to serve as a baseline for retrieval and recommendation systems for course-to-skill and skill-to-course matching. We evaluate the models on a portion of the annotated data. Our BERT model achieves 87% F1-score, showing that course and skill matching is a feasible task.

[37] APRES: An Agentic Paper Revision and Evaluation System

Bingchen Zhao,Jenny Zhang,Chenxi Whitehouse,Minqi Jiang,Michael Shvartsman,Abhishek Charnalia,Despoina Magka,Tatiana Shavrina,Derek Dunfield,Oisin Mac Aodha,Yoram Bachrach

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型(LLM)的自动化论文修订方法APRES,通过学习高预测性的评估量表来提升论文质量与影响力,且不改变科学核心内容;实验表明其显著提升引用预测准确率,并在人工评估中获79%偏好率。

Details Motivation: 当前同行评审反馈不一致,阻碍论文改进与影响力发挥;需一种客观、可重复的方法辅助作者在投稿前优化论文表达。 Method: 提出APRES方法,利用LLM根据数据驱动构建的、高度预测未来引用数的评估量表,自动修订论文文本;全程保持科学内容不变。 Result: APRES使未来引用预测的平均绝对误差降低19.6%;人工专家评估中,修订后论文被偏好比例达79%。 Conclusion: LLM可作为有效辅助工具帮助作者‘压力测试’稿件,增强传播力与影响力;该技术旨在增强而非取代人类审稿人对科学价值的判断。 Abstract: Scientific discoveries must be communicated clearly to realize their full potential. Without effective communication, even the most groundbreaking findings risk being overlooked or misunderstood. The primary way scientists communicate their work and receive feedback from the community is through peer review. However, the current system often provides inconsistent feedback between reviewers, ultimately hindering the improvement of a manuscript and limiting its potential impact. In this paper, we introduce a novel method APRES powered by Large Language Models (LLMs) to update a scientific papers text based on an evaluation rubric. Our automated method discovers a rubric that is highly predictive of future citation counts, and integrate it with APRES in an automated system that revises papers to enhance their quality and impact. Crucially, this objective should be met without altering the core scientific content. We demonstrate the success of APRES, which improves future citation prediction by 19.6% in mean averaged error over the next best baseline, and show that our paper revision process yields papers that are preferred over the originals by human expert evaluators 79% of the time. Our findings provide strong empirical support for using LLMs as a tool to help authors stress-test their manuscripts before submission. Ultimately, our work seeks to augment, not replace, the essential role of human expert reviewers, for it should be humans who discern which discoveries truly matter, guiding science toward advancing knowledge and enriching lives.

[38] BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Guoxin Chen,Fanzhe Meng,Jiale Zhao,Minghao Li,Daixuan Cheng,Huatong Song,Jie Chen,Yuzhi Lin,Hui Chen,Xin Zhao,Ruihua Song,Chang Liu,Cheng Chen,Kai Jia,Ji-Rong Wen

Main category: cs.CL

TL;DR: 本文提出BeyondSWE基准,用于评估代码智能体在跨仓库推理、领域专用问题求解等真实场景下的能力,并设计SearchSWE框架探索外部知识(如搜索)对代码生成的影响,发现当前模型仍存在显著能力瓶颈。

Details Motivation: 现有代码智能体基准局限于单一仓库内的简单修复任务,无法反映真实开发中跨仓库、依赖迁移、全库生成等复杂挑战,亟需更全面的评估体系。 Method: 构建包含500个真实世界实例的BeyondSWE基准,覆盖四个不同场景,从分辨率和知识范围两个维度拓展评估;同时提出SearchSWE框架,将深度搜索与代码生成能力结合,系统分析外部知识对性能的影响。 Result: 前沿模型在BeyondSWE上成功率低于45%,且无一模型在所有任务类型上表现稳定;SearchSWE实验表明搜索增强效果不一致,有时甚至损害性能。 Conclusion: 当前代码智能体在真实复杂任务上能力严重不足,需更贴近开发者工作流(搜索与推理交织)的新范式和评估标准,BeyondSWE与SearchSWE为此提供了基础工具与实证依据。 Abstract: Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.

[39] Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?

Dadi Guo,Yuejin Xie,Qingyu Liu,Jiayu Liu,Zhiyuan Fan,Qihan Ren,Shuai Shao,Tianyi Zhou,Dongrui Liu,Yi R. Fung

Main category: cs.CL

TL;DR: 本文提出了一种利用代码智能体自主演化数学问题的多智能体框架,通过代码执行环境验证生成问题的可解性与难度提升,实证表明该方法能高效生成结构新颖、难度更高的高质量数学问题。

Details Motivation: 大型语言模型在数学能力上向IMO水平迈进,但缺乏足够挑战性、高质量的数学问题用于训练和评估;同时,代码智能体在代理式编程与推理方面展现出强大能力,提示代码执行可作为数学实验的可扩展环境。 Method: 提出一种多智能体框架,让代码智能体自主演化现有数学问题,生成更复杂变体,并通过执行验证其可解性与难度提升。 Result: 实验表明,在充足测试时探索下,代码智能体能合成结构上不同于原题、难度更高且仍可解的新数学问题。 Conclusion: 代码驱动的智能体可在可扩展计算环境中作为生成高难度数学推理问题的有效机制。 Abstract: As large language models (LLMs) advance their mathematical capabilities toward the IMO level, the scarcity of challenging, high-quality problems for training and evaluation has become a significant bottleneck. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic coding and reasoning, suggesting that code execution can serve as a scalable environment for mathematical experimentation. In this paper, we investigate the potential of code agents to autonomously evolve existing math problems into more complex variations. We introduce a multi-agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems. Our experiments demonstrate that, given sufficient test-time exploration, code agents can synthesize new, solvable problems that are structurally distinct from and more challenging than the originals. This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems within scalable computational environments. Our data is available at https://github.com/TarferSoul/Code2Math.

[40] Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Aradhye Agarwal,Gurdit Siyan,Yash Pandya,Joykirat Singh,Akshay Nambi,Ahmed Awadallah

Main category: cs.CL

TL;DR: 本文提出MOSAIC框架,通过显式建模安全推理与拒绝动作,结合偏好强化学习,在无需轨迹标签的情况下提升智能体语言模型在多步工具使用中的安全性,显著降低有害行为与隐私泄露,同时保持良性任务性能。

Details Motivation: 现有对齐方法针对静态生成和单步任务优化,在需长期规划、调用工具和顺序决策的智能体场景中失效,尤其面临不可逆危害风险(如误访问文件或输入凭证)及对抗性工具反馈等挑战。 Method: 提出MOSAIC后训练框架,将推理结构化为‘计划-检查-执行/拒绝’循环,显式引入安全推理与拒绝作为一等动作;采用基于偏好的强化学习(成对轨迹比较),避免标量奖励难以捕捉安全差异的问题。 Result: 在Qwen2.5-7B、Qwen3-4B-Thinking、Phi-4三类模型上零样本评估,MOSAIC使有害行为降低达50%,注入攻击下的有害任务拒绝率提升超20%,隐私泄露显著减少,且良性任务性能不降反升。 Conclusion: MOSAIC实现了跨模型、跨领域和多步智能体场景下的鲁棒安全对齐,验证了显式安全决策与偏好学习在代理式语言模型对齐中的有效性。 Abstract: Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.

[41] Using Learning Progressions to Guide AI Feedback for Science Learning

Xin Xia,Nejla Yuruk,Yun Wang,Xiaoming Zhai

Main category: cs.CL

TL;DR: 本研究探讨了基于学习进展(LP)的自动评分标准生成方法是否能替代专家手工编写的任务特定评分标准,以生成高质量的AI反馈。结果表明,两种方法在清晰度、相关性、参与与激励、反思性等维度上无显著差异,说明LP驱动的方法具有可行性。

Details Motivation: 现有AI生成反馈多依赖领域专家编写的具体任务评分标准,但该过程耗时且难以扩展;学习进展(LP)提供了一种理论支撑的学生理解发展表征,可能作为替代方案。 Method: 比较两种AI反馈生成流程:(a) 基于人类专家设计的任务特定评分标准;(b) 基于学习进展自动生成的任务特定评分标准。对207名初中生化学任务中的科学解释文本生成反馈,并由两名编码员按多维量表(清晰度、准确性、相关性、参与与激励、反思性,共10个子维度)评估反馈质量。 Result: 配对t检验显示,两种流程在清晰度、相关性、参与与激励、反思性维度上均无统计学显著差异(p > 0.05),准确性未报告显著性结果;编码者间一致性高(同意率89%–100%,Cohen's kappa = 0.66–0.88)。 Conclusion: 基于学习进展自动生成评分标准的AI反馈流程,其质量可媲美专家编写的评分标准,为规模化、可迁移的AI教育反馈提供了可行新路径。 Abstract: Generative artificial intelligence (AI) offers scalable support for formative feedback, yet most AI-generated feedback relies on task-specific rubrics authored by domain experts. While effective, rubric authoring is time-consuming and limits scalability across instructional contexts. Learning progressions (LP) provide a theoretically grounded representation of students' developing understanding and may offer an alternative solution. This study examines whether an LP-driven rubric generation pipeline can produce AI-generated feedback comparable in quality to feedback guided by expert-authored task rubrics. We analyzed AI-generated feedback for written scientific explanations produced by 207 middle school students in a chemistry task. Two pipelines were compared: (a) feedback guided by a human expert-designed, task-specific rubric, and (b) feedback guided by a task-specific rubric automatically derived from a learning progression prior to grading and feedback generation. Two human coders evaluated feedback quality using a multi-dimensional rubric assessing Clarity, Accuracy, Relevance, Engagement and Motivation, and Reflectiveness (10 sub-dimensions). Inter-rater reliability was high, with percent agreement ranging from 89% to 100% and Cohen's kappa values for estimable dimensions (kappa = .66 to .88). Paired t-tests revealed no statistically significant differences between the two pipelines for Clarity (t1 = 0.00, p1 = 1.000; t2 = 0.84, p2 = .399), Relevance (t1 = 0.28, p1 = .782; t2 = -0.58, p2 = .565), Engagement and Motivation (t1 = 0.50, p1 = .618; t2 = -0.58, p2 = .565), or Reflectiveness (t = -0.45, p = .656). These findings suggest that the LP-driven rubric pipeline can serve as an alternative solution.

cs.CV [Back]

[42] CamDirector: Towards Long-Term Coherent Video Trajectory Editing

Zhihao Shi,Kejia Yin,Weilin Wan,Yuhongze Zhou,Yuanhao Yu,Xinxin Zuo,Qiang Sun,Juwei Lu

Main category: cs.CV

TL;DR: 本文提出了一种新的视频轨迹编辑(VTE)框架,通过混合形变策略和历史引导的自回归扩散模型,实现了更精确的相机控制与长时序一致性,并在新构建的iPhone-PTZ基准上达到SOTA性能。

Details Motivation: 现有VTE方法在精确相机控制和长程时序一致性方面存在不足,主要受限于嵌入容量有限或仅依赖单帧形变与隐式跨帧聚合。 Method: 1) 提出混合形变方案:静态区域融合进‘世界缓存’后渲染至目标姿态,动态区域直接形变,再融合生成全局一致的粗帧以指导细化;2) 采用历史引导的自回归视频扩散模型处理视频段及其历史,并增量更新世界缓存以增强已修复内容的时序连贯性。 Result: 在新提出的iPhone-PTZ基准(含多样相机运动与大幅轨迹变化)上取得SOTA性能,且参数量更少。 Conclusion: 该框架显著提升了VTE任务中的相机可控性与长时序一致性,验证了显式跨帧信息聚合与历史引导建模的有效性。 Abstract: Video (camera) trajectory editing aims to synthesize new videos that follow user-defined camera paths while preserving scene content and plausibly inpainting previously unseen regions, upgrading amateur footage into professionally styled videos. Existing VTE methods struggle with precise camera control and long-range consistency because they either inject target poses through a limited-capacity embedding or rely on single-frame warping with only implicit cross-frame aggregation in video diffusion models. To address these issues, we introduce a new VTE framework that 1) explicitly aggregates information across the entire source video via a hybrid warping scheme. Specifically, static regions are progressively fused into a world cache then rendered to target camera poses, while dynamic regions are directly warped; their fusion yields globally consistent coarse frames that guide refinement. 2) processes video segments jointly with their history via a history-guided autoregressive diffusion model, while the world cache is incrementally updated to reinforce already inpainted content, enabling long-term temporal coherence. Finally, we present iPhone-PTZ, a new VTE benchmark with diverse camera motions and large trajectory variations, and achieve state-of-the-art performance with fewer parameters.

[43] Social-JEPA: Emergent Geometric Isomorphism

Haoran Zhang,Youjin Wang,Yi Duan,Rong Fu,Dianyu Zhao,Sicheng Fan,Shuaishuai Cao,Wentao Guo,Xiao Zhou

Main category: cs.CV

TL;DR: 本文发现,即使没有参数共享或协调,从不同视角学习同一环境的世界模型的独立智能体,其潜在表示空间会自然形成近似线性等距关系,从而实现跨视角表征的无缝对齐与迁移。

Details Motivation: 探索在无协调、无参数共享的去中心化设置下,多个智能体如何通过自监督预测学习自发形成一致的表征结构,以实现视觉系统间的互操作性。 Method: 让两个独立智能体分别从不同视角(如不同摄像头位置)学习世界模型(如JEPA类架构),仅依赖未来预测目标进行训练,不引入任何显式对齐约束或参数共享。 Result: 训练后两智能体的潜在空间呈现近似线性等距关系;该几何一致性鲁棒应对大幅视角变化和像素级低重叠;利用该对齐可实现零步长分类器迁移,并加速后续学习、显著降低总计算量。 Conclusion: 预测性学习目标本身即隐含强几何正则性,可作为轻量级路径推动去中心化视觉系统的表征互通与协同。 Abstract: World models compress rich sensory streams into compact latent codes that anticipate future observations. We let separate agents acquire such models from distinct viewpoints of the same environment without any parameter sharing or coordination. After training, their internal representations exhibit a striking emergent property: the two latent spaces are related by an approximate linear isometry, enabling transparent translation between them. This geometric consensus survives large viewpoint shifts and scant overlap in raw pixels. Leveraging the learned alignment, a classifier trained on one agent can be ported to the other with no additional gradient steps, while distillation-like migration accelerates later learning and markedly reduces total compute. The findings reveal that predictive learning objectives impose strong regularities on representation geometry, suggesting a lightweight path to interoperability among decentralized vision systems. The code is available at https://anonymous.4open.science/r/Social-JEPA-5C57.

[44] From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification

Vasiliy Kudryavtsev,Kirill Borodin,German Berezin,Kirill Bubenchikov,Grach Mkrtchian,Alexander Ryzhkov

Main category: cs.CV

TL;DR: 本文提出了一种多模态验证框架,通过融合视觉特征与合成文本描述提取的语义身份先验,显著提升了大规模宠物重识别性能。

Details Motivation: 现有自动动物识别系统受限于数据集规模小和仅依赖单模态视觉线索,难以有效应对宠物寻回等实际任务。 Method: 构建了包含190万张照片、覆盖69.5万只独特动物的大规模训练数据集;选用SigLIP2-Giant和E5-Small-v2分别作为最优视觉与文本编码器;比较多种模态融合策略,最终采用门控融合机制。 Result: 在综合测试协议上达到Top-1准确率84.28%和等错误率0.0422,相较领先单模态基线提升11%。 Conclusion: 合成语义描述可显著优化决策边界,多模态融合(尤其是门控机制)对大规模宠物重识别具有关键作用。 Abstract: Automated animal identification is a practical task for reuniting lost pets with their owners, yet current systems often struggle due to limited dataset scale and reliance on unimodal visual cues. This study introduces a multimodal verification framework that enhances visual features with semantic identity priors derived from synthetic textual descriptions. We constructed a massive training corpus of 1.9 million photographs covering 695,091~unique animals to support this investigation. Through systematic ablation studies, we identified SigLIP2-Giant and E5-Small-v2 as the optimal vision and text backbones. We further evaluated fusion strategies ranging from simple concatenation to adaptive gating to determine the best method for integrating these modalities. Our proposed approach utilizes a gated fusion mechanism and achieved a Top-1 accuracy of 84.28\% and an Equal Error Rate of 0.0422 on a comprehensive test protocol. These results represent an 11\% improvement over leading unimodal baselines and demonstrate that integrating synthesized semantic descriptions significantly refines decision boundaries in large-scale pet re-identification.

[45] Beyond Prompt Degradation: Prototype-guided Dual-pool Prompting for Incremental Object Detection

Yaoteng Zhang,Zhou Qing,Junyu Gao,Qi Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为PDP的提示解耦框架,用于增量目标检测,通过双池提示解耦和原型伪标签生成模块,缓解提示耦合与漂移问题,在MS-COCO和PASCAL VOC上达到SOTA性能。

Details Motivation: 现有基于提示的增量目标检测方法因提示耦合和提示漂移导致提示退化,影响持续学习效果。 Method: 提出PDP框架:1)双池提示解耦范式(共享池捕获任务通用知识,私有池学习任务特异性特征);2)原型伪标签生成(PPG)模块,动态更新类原型并筛选高质量伪标签以保持监督一致性。 Result: 在MS-COCO上AP提升9.2%,在PASCAL VOC上AP提升3.3%,显著优于现有方法。 Conclusion: PDP有效平衡了稳定性与可塑性,为无回放、参数高效的增量目标检测提供了新思路。 Abstract: Incremental Object Detection (IOD) aims to continuously learn new object categories without forgetting previously learned ones. Recently, prompt-based methods have gained popularity for their replay-free design and parameter efficiency. However, due to prompt coupling and prompt drift, these methods often suffer from prompt degradation during continual adaptation. To address these issues, we propose a novel prompt-decoupled framework called PDP. PDP innovatively designs a dual-pool prompt decoupling paradigm, which consists of a shared pool used to capture task-general knowledge for forward transfer, and a private pool used to learn task-specific discriminative features. This paradigm explicitly separates task-general and task-specific prompts, preventing interference between prompts and mitigating prompt coupling. In addition, to counteract prompt drift resulting from inconsistent supervision where old foreground objects are treated as background in subsequent tasks, PDP introduces a Prototypical Pseudo-Label Generation (PPG) module. PPG can dynamically update the class prototype space during training and use the class prototypes to further filter valuable pseudo-labels, maintaining supervisory signal consistency throughout the incremental process. PDP achieves state-of-the-art performance on MS-COCO (with a 9.2\% AP improvement) and PASCAL VOC (with a 3.3\% AP improvement) benchmarks, highlighting its potential in balancing stability and plasticity. The code and dataset are released at: https://github.com/zyt95579/PDP\_IOD/tree/main

[46] AutoFFS: Adversarial Deformations for Facial Feminization Surgery Planning

Paul Friedrich,Florentin Bieder,Florian M. Thieringer,Philippe C. Cattin

Main category: cs.CV

TL;DR: 本文提出AutoFFS框架,通过对抗性自由形变生成反事实颅骨形态,为跨性别者面部女性化手术提供定量、可重复的术前规划依据。

Details Motivation: 当前面部女性化手术(FFS)依赖主观临床评估,缺乏定量和可重复的解剖学指导。 Method: 提出AutoFFS框架,利用预训练二元性别分类器集成,实施基于形变的目标导向对抗攻击,生成朝向目标性别的反事实颅骨形态。 Result: 生成的反事实颅骨形态经分类器评估与人类感知实验验证,确实展现出目标性别特征。 Conclusion: AutoFFS为FFS提供了可量化的术前规划基础,推动了对跨性别及多元性别患者群体的临床支持进展。 Abstract: Facial feminization surgery (FFS) is a key component of gender affirmation for transgender and gender diverse patients, aiming to reshape craniofacial structures toward a female morphology. Current surgical planning procedures largely rely on subjective clinical assessment, lacking quantitative and reproducible anatomical guidance. We therefore propose AutoFFS, a novel data-driven framework that generates counterfactual skull morphologies through adversarial free-form deformations. Our method performs a deformation-based targeted adversarial attack on an ensemble of pre-trained binary sex classifiers that learned sexual dimorphism, effectively transforming individual skull shapes toward the target sex. The generated counterfactual skull morphologies provide a quantitative foundation for preoperative planning in FFS, driving advances in this largely overlooked patient group. We validate our approach through classifier-based evaluation and a human perceptual study, confirming that the generated morphologies exhibit target sex characteristics.

[47] HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding

Lei Yao,Yong Chen,Yuejiao Su,Yi Wang,Moyun Liu,Lap-Pui Chau

Main category: cs.CV

TL;DR: 本文提出HAMMER框架,利用多模态大语言模型(MLLM)实现基于交互意图的3D物体功能区域定位,通过接触感知嵌入、分层跨模态融合与多粒度几何提升模块,提升语义理解与3D定位精度,并在多个数据集上验证了其优越性与鲁棒性。

Details Motivation: 人类能通过图像或视频中的交互观察识别3D物体的功能(affordance),并泛化到新物体;现有方法依赖显式属性描述或2D分割器,缺乏对交互意图和上下文语义的深入挖掘。 Method: 提出HAMMER框架:1)构建接触感知的交互意图嵌入;2)设计分层跨模态融合机制以增强3D表征;3)引入多粒度几何提升模块,将空间特征注入意图嵌入,实现精准3D功能定位。 Result: 在公开数据集及新构建的退化基准(corrupted benchmark)上,HAMMER显著优于现有方法,展现出更强的性能与鲁棒性。 Conclusion: HAMMER通过端到端地建模交互意图与3D几何,为3D affordance grounding提供了更自然、更鲁棒的新范式,代码与权重已开源。 Abstract: Humans commonly identify 3D object affordance through observed interactions in images or videos, and once formed, such knowledge can be generically generalized to novel objects. Inspired by this principle, we advocate for a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction intention-driven 3D affordance grounding, namely HAMMER. Instead of generating explicit object attribute descriptions or relying on off-the-shelf 2D segmenters, we alternatively aggregate the interaction intention depicted in the image into a contact-aware embedding and guide the model to infer textual affordance labels, ensuring it thoroughly excavates object semantics and contextual cues. We further devise a hierarchical cross-modal integration mechanism to fully exploit the complementary information from the MLLM for 3D representation refinement and introduce a multi-granular geometry lifting module that infuses spatial characteristics into the extracted intention embedding, thus facilitating accurate 3D affordance localization. Extensive experiments on public datasets and our newly constructed corrupted benchmark demonstrate the superiority and robustness of HAMMER compared to existing approaches. All code and weights are publicly available.

[48] MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry

Leo Kaixuan Cheng,Abdus Shaikh,Ruofan Liang,Zhijie Wu,Yushi Guan,Nandita Vijaykumar

Main category: cs.CV

TL;DR: MERG3R是一种无需训练的分治框架,通过图像重排序、分区、局部重建与全局对齐融合,突破神经几何模型的GPU内存限制,提升大规模无序图像集合的3D重建精度、内存效率与可扩展性。

Details Motivation: 现有基于全注意力机制的神经视觉几何模型(如VGGT、Pi3)受限于GPU显存,难以处理大规模无序图像集合。 Method: MERG3R采用训练无关的分治策略:先对无序图像进行几何感知的重排序与重叠分区,再独立重建各子集,最后通过全局对齐和置信度加权的bundle adjustment融合局部结果。 Result: 在7-Scenes、NRGBD、Tanks & Temples、Cambridge Landmarks等多个大规模数据集上,MERG3R显著提升了重建精度、内存效率与可扩展性,支持超出GPU内存容量的大规模高质量3D重建。 Conclusion: MERG3R是模型无关的通用框架,有效解耦了神经几何模型的性能与硬件内存瓶颈,为大规模场景重建提供了实用可行的新范式。 Abstract: Recent advancements in neural visual geometry, including transformer-based models such as VGGT and Pi3, have achieved impressive accuracy on 3D reconstruction tasks. However, their reliance on full attention makes them fundamentally limited by GPU memory capacity, preventing them from scaling to large, unordered image collections. We introduce MERG3R, a training-free divide-and-conquer framework that enables geometric foundation models to operate far beyond their native memory limits. MERG3R first reorders and partitions unordered images into overlapping, geometrically diverse subsets that can be reconstructed independently. It then merges the resulting local reconstructions through an efficient global alignment and confidence-weighted bundle adjustment procedure, producing a globally consistent 3D model. Our framework is model-agnostic and can be paired with existing neural geometry models. Across large-scale datasets, including 7-Scenes, NRGBD, Tanks & Temples, and Cambridge Landmarks, MERG3R consistently improves reconstruction accuracy, memory efficiency, and scalability, enabling high-quality reconstruction when the dataset exceeds memory capacity limits.

[49] Beyond Caption-Based Queries for Video Moment Retrieval

David Pujol-Perich,Albert Clapés,Dima Damen,Sergio Escalera,Michael Wray

Main category: cs.CV

TL;DR: 本文研究了现有视频时刻检索(VMR)方法(特别是DETR架构)在使用字幕式查询训练但在搜索式查询上评估时的性能退化问题,并提出了三种新基准来揭示语言差距和多时刻差距,同时通过解决解码器查询坍缩问题显著提升了搜索查询的检索性能。

Details Motivation: 现有VMR方法在字幕式查询上表现良好,但在更实际、更简洁的搜索式查询上泛化能力差,亟需系统分析其退化原因并提出改进方案。 Method: 构建三个基于HD-EPIC、YouCook2和ActivityNet-Captions修改文本查询的新基准;识别语言差距与多时刻差距;发现并缓解‘主动解码器查询坍缩’问题,通过架构改进提升活跃解码器查询数量。 Result: 在搜索查询上mAP_m提升最高达14.82%,在多时刻搜索查询上提升最高达21.83%。 Conclusion: 语言简略性和多时刻结构是VMR模型泛化到真实搜索场景的关键挑战,解码器查询坍缩是核心瓶颈,所提架构改进能有效缓解该问题并显著提升性能。 Abstract: In this work, we investigate the degradation of existing VMR methods, particularly of DETR architectures, when trained on caption-based queries but evaluated on search queries. For this, we introduce three benchmarks by modifying the textual queries in three public VMR datasets -- i.e., HD-EPIC, YouCook2 and ActivityNet-Captions. Our analysis reveals two key generalization challenges: (i) A language gap, arising from the linguistic under-specification of search queries, and (ii) a multi-moment gap, caused by the shift from single-moment to multi-moment queries. We also identify a critical issue in these architectures -- an active decoder-query collapse -- as a primary cause of the poor generalization to multi-moment instances. We mitigate this issue with architectural modifications that effectively increase the number of active decoder queries. Extensive experiments demonstrate that our approach improves performance on search queries by up to 14.82% mAP_m, and up to 21.83% mAP_m on multi-moment search queries. The code, models and data are available in the project webpage: https://davidpujol.github.io/beyond-vmr/

[50] Retrieving Patient-Specific Radiomic Feature Sets for Transparent Knee MRI Assessment

Yaxi Chen,Simin Ni,Jingjing Zhang,Shaheer U. Saeed,Yipei Wang,Aleksandra Ivanova,Rikin Hargunani,Chaozong Liu,Jie Huang,Yipeng Hu

Main category: cs.CV

TL;DR: 本文提出一种患者特异性的放射组学特征选择框架,通过两阶段检索策略(随机采样候选特征集 + 学习打分函数排序)替代传统边际top-k选择,提升特征互补性与诊断性能,同时保持高可解释性。

Details Motivation: 传统放射组学依赖预定义特征集,透明但性能受限;自适应放射组学采用DL预测特征权重并取top-k,但易引入冗余、忽略特征间互补关系。 Method: 提出患者特异的紧凑特征集选择框架:第一阶段随机采样多样化的k维特征子集;第二阶段用学习到的特征集打分函数对候选集排序,选出最优子集;最终由独立分类器完成诊断。 Result: 在ACL撕裂检测和骨关节炎KL分级任务上,该方法优于相同k值的top-k基线,并媲美端到端深度学习模型,同时生成可审计的特征集,明确关联解剖区域与放射组学类别。 Conclusion: 两阶段检索策略能有效逼近穷举搜索,在维持高透明性和临床可解释性的同时,显著提升放射组学诊断性能。 Abstract: Classical radiomic features are designed to quantify image appearance and intensity patterns. Compared with end-to-end deep learning (DL) models trained for disease classification, radiomics pipelines with low-dimensional parametric classifiers offer enhanced transparency and interpretability, yet often underperform because of the reliance on population-level predefined feature sets. Recent work on adaptive radiomics uses DL to predict feature weights over a radiomic pool, then thresholds these weights to retain the top-k features from large radiomic pool F (often ~10^3). However, such marginal ranking can over-admit redundant descriptors and overlook complementary feature interactions. We propose a patient-specific feature-set selection framework that predicts a single compact feature set per subject, targeting complementary and diverse evidence rather than marginal top-k features. To overcome the intractable combinatorial search space of F choose k features, our method utilizes a 2-stage retrieval strategy: randomly sample diverse candidate feature sets, then rank these sets with a learned scoring function to select a high-performing feature set for the specific patient. The system consists of a feature-set scorer, and a classifier that performs the final diagnosis. We empirically show that the proposed two-stage retrieval approximates the original exhaustive all k-feature selection. Validating on tasks including ACL tear detection and KL grading for osteoarthritis, the experimental results achieve diagnostic performance, outperforming the top-k approach with the same k values, and competitive with end-to-end DL models while maintaining high transparency. The model generates auditable feature sets that link clinical outcomes to specific anatomical regions and radiomic families, allowing clinicians to inspect which anatomical structures and quantitative descriptors drive the prediction.

[51] Cultural Counterfactuals: Evaluating Cultural Biases in Large Vision-Language Models with Counterfactual Examples

Phillip Howard,Xin Su,Kathleen C. Fraser

Main category: cs.CV

TL;DR: 本文提出Cultural Counterfactuals数据集,用于检测大视觉语言模型(LVLMs)在宗教、国籍与社会经济地位等文化维度上的隐性偏见,通过图像编辑生成高质量反事实图像,实现对文化语境影响的精确量化。

Details Motivation: 现有研究多关注基于视觉特征(如种族、性别)的偏见,而难以从外观识别的文化相关偏见(如宗教、社会经济地位)缺乏系统评估方法和合适数据集。 Method: 构建名为Cultural Counterfactuals的合成数据集,包含近6万张反事实图像;利用图像编辑模型将不同人群置于真实文化背景图像中,形成同一人物在多种文化语境下的对照图像组。 Result: 该数据集成功应用于量化多个主流LVLM在宗教、国籍和社会经济地位维度上的文化偏见,验证了其有效性与实用性。 Conclusion: Cultural Counterfactuals填补了文化偏见评估的数据空白,为更全面、公平地评测LVLM提供了新范式与可靠工具。 Abstract: Large Vision-Language Models (LVLMs) have grown increasingly powerful in recent years, but can also exhibit harmful biases. Prior studies investigating such biases have primarily focused on demographic traits related to the visual characteristics of a person depicted in an image, such as their race or gender. This has left biases related to cultural differences (e.g., religion, socioeconomic status), which cannot be readily discerned from an individual's appearance alone, relatively understudied. A key challenge in measuring cultural biases is that determining which group an individual belongs to often depends upon cultural context cues in images, and datasets annotated with cultural context cues are lacking. To address this gap, we introduce Cultural Counterfactuals: a high-quality synthetic dataset containing nearly 60k counterfactual images for measuring cultural biases related to religion, nationality, and socioeconomic status. To ensure that cultural contexts are accurately depicted, we generate our dataset using an image-editing model to place people of different demographics into real cultural context images. This enables the construction of counterfactual image sets which depict the same person in multiple different contexts, allowing for precise measurement of the impact that cultural context differences have on LVLM outputs. We demonstrate the utility of Cultural Counterfactuals for quantifying cultural biases in popular LVLMs.

[52] Aligning Fetal Anatomy with Kinematic Tree Log-Euclidean PolyRigid Transforms

Yingcheng Liu,Athena Taymourtash,Yang Liu,Esra Abaci Turk,William M. Wells,Leo Joskowicz,P. Ellen Grant,Polina Golland

Main category: cs.CV

TL;DR: 本文提出了一种基于SMPL的可微分体素人体模型,采用新型运动树对数欧几里得多刚性(KTPolyRigid)变换,解决了大范围关节运动中李代数歧义和体素映射折叠问题,在胎儿MRI数据上验证了其在配准与器官分割中的有效性。

Details Motivation: 现有基于表面的人体模型忽略内部体素结构,且形变方法缺乏解剖学一致性保证。 Method: 提出基于SMPL的可微分体素人体模型,引入KTPolyRigid变换以解决大范围非局部关节运动中的李代数歧义,并确保平滑、双射的体素映射。 Result: 在53例胎儿MRI数据上,KTPolyRigid显著减少形变场折叠伪影;支持鲁棒的群体图像配准和标签高效的模板式胎儿器官分割。 Conclusion: 该框架为医学影像中关节化人体的标准化体素分析提供了稳健基础。 Abstract: Automated analysis of articulated bodies is crucial in medical imaging. Existing surface-based models often ignore internal volumetric structures and rely on deformation methods that lack anatomical consistency guarantees. To address this problem, we introduce a differentiable volumetric body model based on the Skinned Multi-Person Linear (SMPL) formulation, driven by a new Kinematic Tree-based Log-Euclidean PolyRigid (KTPolyRigid) transform. KTPolyRigid resolves Lie algebra ambiguities associated with large, non-local articulated motions, and encourages smooth, bijective volumetric mappings. Evaluated on 53 fetal MRI volumes, KTPolyRigid yields deformation fields with significantly fewer folding artifacts. Furthermore, our framework enables robust groupwise image registration and a label-efficient, template-based segmentation of fetal organs. It provides a robust foundation for standardized volumetric analysis of articulated bodies in medical imaging.

[53] Advancing Earth Observation Through Machine Learning: A TorchGeo Tutorial

Caleb Robinson,Nils Lehmann,Adam J. Stewart,Burak Ekim,Heng Fang,Isaac A. Corley,Mauricio Cordeiro

Main category: cs.CV

TL;DR: 本文介绍了TorchGeo,一个基于PyTorch的地球观测机器学习库,通过教程展示其核心抽象和端到端多光谱水体分割案例。

Details Motivation: 地球观测机器学习流程与标准计算机视觉不同,需处理大尺寸地理参考影像、多种格式标签及空间感知采样策略,亟需专用工具支持。 Method: 提出TorchGeo库,提供数据集、采样器、变换和预训练模型,并通过两个Jupyter Notebook教程演示其使用:一是核心功能代码示例,二是基于Sentinel-2影像和Earth Surface Water数据集的水体语义分割端到端案例。 Result: 实现了从训练语义分割模型、在里约热内卢Sentinel-2场景上推理,到将预测结果保存为可进行后续地理空间分析的GeoTIFF文件的完整流程。 Conclusion: TorchGeo简化了遥感数据在机器学习中的使用,提升了地球观测任务建模与部署的效率与可复现性。 Abstract: Earth observation machine learning pipelines differ fundamentally from standard computer vision workflows. Imagery is typically delivered as large, georeferenced scenes, labels may be raster masks or vector geometries in distinct coordinate reference systems, and both training and evaluation often require spatially aware sampling and splitting strategies. TorchGeo is a PyTorch-based domain library that provides datasets, samplers, transforms and pre-trained models with the goal of making it easy to use geospatial data in machine learning pipelines. In this paper, we introduce a tutorial that demonstrates 1.) the core TorchGeo abstractions through code examples, and 2.) an end-to-end case study on multispectral water segmentation from Sentinel-2 imagery using the Earth Surface Water dataset. This demonstrates how to train a semantic segmentation model using TorchGeo datasets, apply the model to a Sentinel-2 scene over Rio de Janeiro, Brazil, and save the resulting predictions as a GeoTIFF for further geospatial analysis. The tutorial code itself is distributed as two Python notebooks: https://torchgeo.readthedocs.io/en/stable/tutorials/torchgeo.html and https://torchgeo.readthedocs.io/en/stable/tutorials/earth_surface_water.html.

[54] OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments

Hymalai Bello,Lala Ray,Joanna Sorysz,Sungho Suh,Paul Lukowicz

Main category: cs.CV

TL;DR: OpenMarcie是一个面向制造业环境的人类行为监测的大型多模态数据集,包含36名参与者的穿戴设备与摄像头数据,支持活动分类、开放词汇描述和跨模态对齐三项基准任务。

Details Motivation: 为提升智能工厂中工人活动识别精度,从而优化生产效率与保障工人安全,亟需高质量、贴近真实制造场景的多模态行为数据集。 Method: 构建了名为OpenMarcie的大规模多模态数据集,涵盖两种实验设置:自行车装配/拆卸(12人,无固定流程)与3D打印机装配(24人有效,含协作纠错),采集超过37小时的主客观视角、多位置、八类传感器共200+通道数据。 Result: OpenMarcie成为目前最大规模的面向制造业的多模态行为监测数据集,并在活动分类、开放词汇captioning和跨模态对齐三项任务上完成基准测试。 Conclusion: OpenMarcie填补了制造业场景下高质量、多样化、协作性人体行为数据集的空白,为后续智能工厂中基于多模态感知的行为理解与人机协同研究提供了坚实基础。 Abstract: Smart factories use advanced technologies to optimize production and increase efficiency. To this end, the recognition of worker activity allows for accurate quantification of performance metrics, improving efficiency holistically while contributing to worker safety. OpenMarcie is, to the best of our knowledge, the biggest multimodal dataset designed for human action monitoring in manufacturing environments. It includes data from wearables sensing modalities and cameras distributed in the surroundings. The dataset is structured around two experimental settings, involving a total of 36 participants. In the first setting, twelve participants perform a bicycle assembly and disassembly task under semi-realistic conditions without a fixed protocol, promoting divergent and goal-oriented problem-solving. The second experiment involves twenty-five volunteers (24 valid data) engaged in a 3D printer assembly task, with the 3D printer manufacturer's instructions provided to guide the volunteers in acquiring procedural knowledge. This setting also includes sequential collaborative assembly, where participants assess and correct each other's progress, reflecting real-world manufacturing dynamics. OpenMarcie includes over 37 hours of egocentric and exocentric, multimodal, and multipositional data, featuring eight distinct data types and more than 200 independent information channels. The dataset is benchmarked across three human activity recognition tasks: activity classification, open vocabulary captioning, and cross-modal alignment.

[55] From Fewer Samples to Fewer Bits: Reframing Dataset Distillation as Joint Optimization of Precision and Compactness

My H. Dinh,Aditya Sant,Akshay Malhotra,Keya Patani,Shahab Hamidi-Rad

Main category: cs.CV

TL;DR: 本文提出Quantization-aware Dataset Distillation (QuADD),一种在固定比特预算下联合优化数据集紧凑性与精度的统一框架,通过可微量化模块实现端到端协同优化,并支持均匀与自适应非均匀量化,在图像分类和3GPP波束管理任务中显著提升每比特精度。

Details Motivation: 现有数据集蒸馏方法主要关注样本数量压缩,忽视数据精度及其对效率的影响。 Method: 提出QuADD框架,将可微量化模块嵌入蒸馏循环中,实现合成样本与量化参数的端到端联合优化;基于率失真理论分析样本数与精度间的比特分配影响;支持均匀与自适应非均匀量化。 Result: 在图像分类和3GPP波束管理任务上,QuADD在每比特精度上超越现有数据集蒸馏及后量化基线方法。 Conclusion: QuADD为信息高效的数据集蒸馏建立了新标准,证明联合优化数据紧凑性与精度的重要性。 Abstract: Dataset Distillation (DD) compresses large datasets into compact synthetic ones that maintain training performance. However, current methods mainly target sample reduction, with limited consideration of data precision and its impact on efficiency. We propose Quantization-aware Dataset Distillation (QuADD), a unified framework that jointly optimizes dataset compactness and precision under fixed bit budgets. QuADD integrates a differentiable quantization module within the distillation loop, enabling end-to-end co-optimization of synthetic samples and quantization parameters. Guided by the rate-distortion perspective, we empirically analyze how bit allocation between sample count and precision influences learning performance. Our framework supports both uniform and adaptive non-uniform quantization, where the latter learns quantization levels from data to represent information-dense regions better. Experiments on image classification and 3GPP beam management tasks show that QuADD surpasses existing DD and post-quantized baselines in accuracy per bit, establishing a new standard for information-efficient dataset distillation.

[56] TruckDrive: Long-Range Autonomous Highway Driving Dataset

Filippo Ghilotti,Edoardo Palladin,Samuel Brucker,Adam Sigal,Mario Bijelic,Felix Heide

Main category: cs.CV

TL;DR: 本文介绍了TruckDrive,一个专为重型卡车高速公路自动驾驶设计的长距离多模态驾驶数据集,旨在解决现有数据集在长距离感知(超过100米)方面的不足,并揭示了当前主流模型在远距离(>150米)感知任务中性能急剧下降的问题。

Details Motivation: 安全的高速公路重卡自动驾驶仍面临挑战,因制动距离长,需数百米范围的场景理解以实现前瞻性规划和安全制动;而现有驾驶数据集主要覆盖城市短距场景(≤100米),缺乏长距感知基准。 Method: 构建了TruckDrive数据集:配备7个长距FMCW激光雷达(测距+径向速度)、3个高分辨率短距激光雷达、11个800万像素多焦距环视相机、10个4D FMCW毫米波雷达;采集47.5万样本,其中16.5万帧密集标注,支持2D/3D检测(最远1000米/400米)、深度估计、跟踪、规划及端到端驾驶(20秒高速序列)。 Result: 实验发现,当前SOTA自动驾驶模型在>150米范围的3D感知任务中性能下降31%–99%,暴露出架构与训练信号无法弥补的系统性长距感知缺陷。 Conclusion: TruckDrive填补了长距高速公路感知数据空白,揭示了现有模型的远距离泛化瓶颈,为未来长距感知算法与模型设计提供了关键基准与改进方向。 Abstract: Safe highway autonomy for heavy trucks remains an open and unsolved challenge: due to long braking distances, scene understanding of hundreds of meters is required for anticipatory planning and to allow safe braking margins. However, existing driving datasets primarily cover urban scenes, with perception effectively limited to short ranges of only up to 100 meters. To address this gap, we introduce TruckDrive, a highway-scale multimodal driving dataset, captured with a sensor suite purpose-built for long range sensing: seven long-range FMCW LiDARs measuring range and radial velocity, three high-resolution short-range LiDARs, eleven 8MP surround cameras with varying focal lengths and ten 4D FMCW radars. The dataset offers 475 thousands samples with 165 thousands densely annotated frames for driving perception benchmarking up to 1,000 meters for 2D detection and 400 meters for 3D detection, depth estimation, tracking, planning and end to end driving over 20 seconds sequences at highway speeds. We find that state-of-the-art autonomous driving models do not generalize to ranges beyond 150 meters, with drops between 31% and 99% in 3D perception tasks, exposing a systematic long-range gap that current architectures and training signals cannot close.

[57] DINOv3 Visual Representations for Blueberry Perception Toward Robotic Harvesting

Rui-Feng Wang,Daniel Petti,Yue Chen,Changying Li

Main category: cs.CV

TL;DR: 本文评估了DINOv3作为冻结骨干网络在蓝莓机器人采摘相关视觉任务(如果实与瘀伤分割、果实与果簇检测)中的性能,发现其在分割任务中表现良好,但在检测尤其是果簇检测中受限于目标尺度变化、图像块离散化及定位兼容性问题。

Details Motivation: 尽管视觉基础模型在视觉感知中展现出强大泛化能力,但其在农业场景(特别是蓝莓机器人采摘)中的实际作用和性能极限尚不明确。 Method: 采用DINOv3作为冻结骨干网络,在统一协议下搭配轻量级解码头,评估其在蓝莓图像的果实/瘀伤分割、果实/果簇检测等任务上的性能。 Result: 分割任务受益于稳定的图像块级表征且随骨干尺寸增大而提升;检测任务受限于目标尺度变化、图像块离散化和定位兼容性;果簇检测失败凸显其对空间聚合定义的关系型目标建模能力不足。 Conclusion: DINOv3更适合作为语义骨干网络而非端到端任务模型,其有效性取决于下游空间建模是否匹配果实尺度和聚合结构,为蓝莓机器人采摘提供实践指导。 Abstract: Vision Foundation Models trained via large-scale self-supervised learning have demonstrated strong generalization in visual perception; however, their practical role and performance limits in agricultural settings remain insufficiently understood. This work evaluates DINOv3 as a frozen backbone for blueberry robotic harvesting-related visual tasks, including fruit and bruise segmentation, as well as fruit and cluster detection. Under a unified protocol with lightweight decoders, segmentation benefits consistently from stable patch-level representations and scales with backbone size. In contrast, detection is constrained by target scale variation, patch discretization, and localization compatibility. The failure of cluster detection highlights limitations in modeling relational targets defined by spatial aggregation. Overall, DINOv3 is best viewed not as an end-to-end task model, but as a semantic backbone whose effectiveness depends on downstream spatial modeling aligned with fruit-scale and aggregation structures, providing guidance for blueberry robotic harvesting. Code and dataset will be available upon acceptance.

[58] MIRAGE: Knowledge Graph-Guided Cross-Cohort MRI Synthesis for Alzheimer's Disease Prediction

Guanchen Wu,Zhe Huang,Yuzhang Xie,Runze Yan,Akul Chopra,Deqiang Qiu,Xiao Hu,Fei Wang,Carl Yang

Main category: cs.CV

TL;DR: MIRAGE是一种新框架,通过解剖学引导的跨模态隐式蒸馏,利用电子健康记录(EHR)和生物医学知识图谱生成MRI缺失情况下的可靠阿尔茨海默病诊断表征,无需重建3D影像,提升分类准确率13%。

Details Motivation: MRI检查昂贵且常缺失,导致多模态AD诊断模型难以部署;直接从EHR合成3D MRI风险高、技术难。 Method: 提出MIRAGE框架:1)用知识图谱与图注意力网络将EHR映射到统一嵌入空间;2)冻结预训练3D U-Net解码器作为辅助正则化器,结合队列聚合跳跃特征补偿策略,对1D隐表示施加解剖结构约束;3)仅用蒸馏出的‘诊断代理’表征进行推理。 Result: 在无真实MRI的队列中,AD分类准确率较单模态基线提升13%。 Conclusion: MIRAGE有效缓解MRI缺失问题,通过知识引导的隐空间蒸馏实现安全、高效、可解释的跨模态诊断,避免了高风险的3D影像生成。 Abstract: Reliable Alzheimer's disease (AD) diagnosis increasingly relies on multimodal assessments combining structural Magnetic Resonance Imaging (MRI) and Electronic Health Records (EHR). However, deploying these models is bottlenecked by modality missingness, as MRI scans are expensive and frequently unavailable in many patient cohorts. Furthermore, synthesizing de novo 3D anatomical scans from sparse, high-dimensional tabular records is technically challenging and poses severe clinical risks. To address this, we introduce MIRAGE, a novel framework that reframes the missing-MRI problem as an anatomy-guided cross-modal latent distillation task. First, MIRAGE leverages a Biomedical Knowledge Graph (KG) and Graph Attention Networks to map heterogeneous EHR variables into a unified embedding space that can be propagated from cohorts with real MRIs to cohorts without them. To bridge the semantic gap and enforce physical spatial awareness, we employ a frozen pre-trained 3D U-Net decoder strictly as an auxiliary regularization engine. Supported by a novel cohort-aggregated skip feature compensation strategy, this decoder acts as a rigorous structural penalty, forcing 1D latent representations to encode biologically plausible, macro-level pathological semantics. By exclusively utilizing this distilled "diagnostic-surrogate" representation during inference, MIRAGE completely bypasses computationally expensive 3D voxel reconstruction. Experiments demonstrate that our framework successfully bridges the missing-modality gap, improving the AD classification rate by 13% compared to unimodal baselines in cohorts without real MRIs.

[59] ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering

Aymen Lassoued,Mohamed Ali Souibgui,Yousri Kessentini

Main category: cs.CV

TL;DR: ORCA是一种面向文档视觉问答(DocVQA)的多智能体协同推理框架,通过任务分解、模态专用智能体协作、辩论式验证与格式校验,显著提升复杂推理能力。

Details Motivation: 现有视觉语言模型在处理文档视觉问答(DocVQA)时,难以应对复杂推理和多步流程,尤其在问题分解与异构文档元素的专有处理方面存在不足。 Method: 提出ORCA框架:1)推理智能体对问题进行逻辑步骤分解;2)路由机制调用模态专用智能体(如文本、布局、图像等)协同处理;3)引入辩论机制与压力测试,并在分歧时启动正反方裁决;4)最后由一致性校验器确保答案格式合规。 Result: 在三个主流DocVQA基准上显著超越现有最优方法,验证了多智能体协同推理在视觉语言任务中的有效性。 Conclusion: ORCA为DocVQA提供了一种可扩展、可解释、高鲁棒性的多智能体协同范式,推动了视觉语言推理系统向模块化与专业化方向发展。 Abstract: Document Visual Question Answering (DocVQA) remains challenging for existing Vision-Language Models (VLMs), especially under complex reasoning and multi-step workflows. Current approaches struggle to decompose intricate questions into manageable sub-tasks and often fail to leverage specialized processing paths for different document elements. We present ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering, a novel multi-agent framework that addresses these limitations through strategic agent coordination and iterative refinement. ORCA begins with a reasoning agent that decomposes queries into logical steps, followed by a routing mechanism that activates task-specific agents from a specialized agent dock. Our framework leverages a set of specialized AI agents, each dedicated to a distinct modality, enabling fine-grained understanding and collaborative reasoning across diverse document components. To ensure answer reliability, ORCA employs a debate mechanism with stress-testing, and when necessary, a thesis-antithesis adjudication process. This is followed by a sanity checker to ensure format consistency. Extensive experiments on three benchmarks demonstrate that our approach achieves significant improvements over state-of-the-art methods, establishing a new paradigm for collaborative agent systems in vision-language reasoning.

[60] Deep Learning Based Wildfire Detection for Peatland Fires Using Transfer Learning

Emadeldeen Hamdan,Ahmad Faiz Tharima,Mohd Zahirasri Mohd Tohir,Dayang Nur Sakinah Musa,Erdem Koyuncu,Adam J. Watts,Ahmet Enis Cetin

Main category: cs.CV

TL;DR: 本文提出一种基于迁移学习的泥炭地火灾检测方法,利用通用野火图像预训练模型,并在马来西亚泥炭地数据集上微调,显著提升在低对比度烟雾、遮挡和光照变化等挑战条件下的检测精度与鲁棒性。

Details Motivation: 常规基于深度学习的野火检测模型针对明火森林火灾训练,难以有效识别具有阴燃、低火焰、持续浓烟和地下燃烧等特性的泥炭地火灾;且泥炭地火灾标注数据稀缺。 Method: 采用迁移学习策略:以通用野火检测模型的预训练权重初始化网络,再在马来西亚泥炭地图像与视频数据集上进行微调。 Result: 相比从零训练,该方法显著提升了检测准确率与鲁棒性,尤其在低对比度烟雾、部分遮挡和光照变化等复杂场景下表现更优。 Conclusion: 所提方法为泥炭地火灾早期检测提供了实用、可扩展的解决方案,有望支撑实时监测系统,助力火灾预防与环境保护。 Abstract: Machine learning (ML)-based wildfire detection methods have been developed in recent years, primarily using deep learning (DL) models trained on large collections of wildfire images and videos. However, peatland fires exhibit distinct visual and physical characteristics -- such as smoldering combustion, low flame intensity, persistent smoke, and subsurface burning -- that limit the effectiveness of conventional wildfire detectors trained on open-flame forest fires. In this work, we present a transfer learning-based approach for peatland fire detection that leverages knowledge learned from general wildfire imagery and adapts it to the peatland fire domain. We initialize a DL-based peatland fire detector using pretrained weights from a conventional wildfire detection model and subsequently fine-tune the network using a dataset composed of Malaysian peatland images and videos. This strategy enables effective learning despite the limited availability of labeled peatland fire data. Experimental results demonstrate that transfer learning significantly improves detection accuracy and robustness compared to training from scratch, particularly under challenging conditions such as low-contrast smoke, partial occlusions, and variable illumination. The proposed approach provides a practical and scalable solution for early peatland fire detection and has the potential to support real-time monitoring systems for fire prevention and environmental protection.

[61] Large-Scale Dataset and Benchmark for Skin Tone Classification in the Wild

Vitor Pereira Matias,Márcus Vinícius Lobo Costa,João Batista Neto,Tiago Novello de Brito

Main category: cs.CV

TL;DR: 本文提出了一种面向皮肤色调公平性的综合框架,包括大规模开源数据集STW(基于10色调MST量表)、对经典与深度学习方法的基准测试,以及新模型SkinToneNet(微调ViT),在跨域数据上实现SOTA泛化性能,支持CelebA、VGGFace2等数据集的公平性审计。

Details Motivation: 现有皮肤色调分析方法受限于数据粒度粗(如6色调Fitzpatrick)、数据集小/私有/不可复现、方法陈旧(依赖传统CV)、忽视数据泄露与类别不平衡等问题,亟需更细粒度、开放、鲁棒的评估框架。 Method: 构建大规模开源皮肤色调数据集STW(42,313张图像,3,564人,采用10色调MST量表标注);系统评测经典计算机视觉(SkinToneCCV)与深度学习方法;提出基于ViT微调的SkinToneNet模型,并进行跨域泛化与公平性审计实验。 Result: 经典方法表现接近随机;深度学习方法达到近标注者精度;SkinToneNet在跨域数据上实现SOTA泛化能力,成功应用于CelebA和VGGFace2的公平性审计。 Conclusion: 本工作建立了皮肤色调分类与公平性评估的新基准,通过STW数据集、严谨基准测试与SkinToneNet模型,显著推动了细粒度皮肤色调公平研究的可复现性、准确性和实用性。 Abstract: Deep learning models often inherit biases from their training data. While fairness across gender and ethnicity is well-studied, fine-grained skin tone analysis remains a challenge due to the lack of granular, annotated datasets. Existing methods often rely on the medical 6-tone Fitzpatrick scale, which lacks visual representativeness, or use small, private datasets that prevent reproducibility, or often rely on classic computer vision pipelines, with a few using deep learning. They overlook issues like train-test leakage and dataset imbalance, and are limited by small or unavailable datasets. In this work, we present a comprehensive framework for skin tone fairness. First, we introduce the STW, a large-scale, open-access dataset comprising 42,313 images from 3,564 individuals, labeled using the 10-tone MST scale. Second, we benchmark both Classic Computer Vision (SkinToneCCV) and Deep Learning approaches, demonstrating that classic models provide near-random results, while deep learning reaches nearly annotator accuracy. Finally, we propose SkinToneNet, a fine-tuned ViT that achieves state-of-the-art generalization on out-of-domain data, which enables reliable fairness auditing of public datasets like CelebA and VGGFace2. This work provides state-of-the-art results in skin tone classification and fairness assessment. Code and data available soon

[62] E2E-GNet: An End-to-End Skeleton-based Geometric Deep Neural Network for Human Motion Recognition

Mubarak Olaoluwa,Hassen Drira

Main category: cs.CV

TL;DR: 本文提出E2E-GNet,一种端到端的几何深度神经网络,用于骨架动作识别;通过引入几何变换层和失真感知优化层,在非欧空间中增强动作判别力并减少投影失真,从而提升识别精度与效率。

Details Motivation: 提升骨架动作识别在非欧空间中的判别能力,克服传统方法在投影过程中导致的几何信息损失问题。 Method: 提出E2E-GNet,包含几何变换层(联合优化骨架运动序列并使用可微对数映射激活)和失真感知优化层(限制投影引起的骨架形变)。 Result: 在五个跨域数据集上的实验表明,E2E-GNet性能优于现有方法且计算成本更低;消融研究验证了各模块的有效性。 Conclusion: E2E-GNet能有效保留判别性几何线索,显著提升骨架动作识别准确率与鲁棒性,同时具备较低计算开销。 Abstract: Geometric deep learning has recently gained significant attention in the computer vision community for its ability to capture meaningful representations of data lying in a non-Euclidean space. To this end, we propose E2E-GNet, an end-to-end geometric deep neural network for skeleton-based human motion recognition. To enhance the discriminative power between different motions in the non-Euclidean space, E2E-GNet introduces a geometric transformation layer that jointly optimizes skeleton motion sequences on this space and applies a differentiable logarithm map activation to project them onto a linear space. Building on this, we further design a distortion-aware optimization layer that limits skeleton shape distortions caused by this projection, enabling the network to retain discriminative geometric cues and achieve a higher motion recognition rate. We demonstrate the impact of each layer through ablation studies and extensive experiments across five datasets spanning three domains show that E2E-GNet outperforms other methods with lower cost.

[63] ModalPatch: A Plug-and-Play Module for Robust Multi-Modal 3D Object Detection under Modality Drop

Shuangzhi Li,Lei Ma,Xingyu Li

Main category: cs.CV

TL;DR: 本文提出ModalPatch,一种即插即用模块,用于在多模态3D目标检测中应对任意模态临时丢失问题,无需修改模型结构或重新训练,通过时序建模与不确定性引导的跨模态融合提升鲁棒性与精度。

Details Motivation: 多模态3D目标检测在自动驾驶中至关重要,但实际中常因硬件故障、恶劣天气或遮挡导致传感器模态瞬时失效(尤其是多模态同时丢失),造成感知盲区,威胁行车安全。 Method: 提出ModalPatch模块:1)利用传感器数据的时间连续性,通过历史特征预测并补偿当前缺失模态的特征;2)引入不确定性引导的跨模态融合策略,动态评估补偿特征的可靠性,抑制偏差信号、增强有效信息。 Result: 在多种模态丢失场景下,ModalPatch显著提升了主流3D检测器(如PointPillars、CenterPoint)的鲁棒性与检测精度,且无需架构修改或额外训练。 Conclusion: ModalPatch是首个支持任意模态丢失场景的即插即用鲁棒检测模块,为多模态感知系统提供了实用、高效、通用的容错解决方案。 Abstract: Multi-modal 3D object detection is pivotal for autonomous driving, integrating complementary sensors like LiDAR and cameras. However, its real-world reliability is challenged by transient data interruptions and missing, where modalities can momentarily drop due to hardware glitches, adverse weather, or occlusions. This poses a critical risk, especially during a simultaneous modality drop, where the vehicle is momentarily blind. To address this problem, we introduce ModalPatch, the first plug-and-play module designed to enable robust detection under arbitrary modality-drop scenarios. Without requiring architectural changes or retraining, ModalPatch can be seamlessly integrated into diverse detection frameworks. Technically, ModalPatch leverages the temporal nature of sensor data for perceptual continuity, using a history-based module to predict and compensate for transiently unavailable features. To improve the fidelity of the predicted features, we further introduce an uncertainty-guided cross-modality fusion strategy that dynamically estimates the reliability of compensated features, suppressing biased signals while reinforcing informative ones. Extensive experiments show that ModalPatch consistently enhances both robustness and accuracy of state-of-the-art 3D object detectors under diverse modality-drop conditions.

[64] WTHaar-Net: a Hybrid Quantum-Classical Approach

Vittorio Palladino,Tsai Idden,Ahmet Enis Cetin

Main category: cs.CV

TL;DR: 本文提出WTHaar-Net,一种将Haar小波变换(HWT)嵌入卷积神经网络的混合量子-经典架构,相比基于Hadamard变换的方法,HWT提供更符合视觉任务归纳偏置的多尺度空间局部表示,并可在近似量子硬件上实现。

Details Motivation: 利用量子计算中结构化线性变换可高效实现的特点,结合CNN对空间局部性和多尺度特征的需求,探索更优的量子-经典混合神经网络架构。 Method: 用Haar小波变换(HWT)替代原有混合架构中的Hadamard变换,并设计其基于结构化Hadamard门的量子电路实现;在CNN中集成该变换作为可学习或固定层。 Result: 在CIFAR-10和Tiny-ImageNet上实现显著参数压缩且保持竞争力的准确率;在Tiny-ImageNet上超越ResNet与Hadamard基线;并在IBM量子云硬件上成功验证量子实现可行性。 Conclusion: Haar小波变换比Hadamard变换更适合视觉任务的量子-经典CNN架构,其结构化量子实现具备实际部署潜力,为轻量化与量子增强深度学习提供了新路径。 Abstract: Convolutional neural networks rely on linear filtering operations that can be reformulated efficiently in suitable transform domains. At the same time, advances in quantum computing have shown that certain structured linear transforms can be implemented with shallow quantum circuits, opening the door to hybrid quantum-classical approaches for enhancing deep learning models. In this work, we introduce WTHaar-Net, a convolutional neural network that replaces the Hadamard Transform used in prior hybrid architectures with the Haar Wavelet Transform (HWT). Unlike the Hadamard Transform, the Haar transform provides spatially localized, multi-resolution representations that align more closely with the inductive biases of vision tasks. We show that the HWT admits a quantum realization using structured Hadamard gates, enabling its decomposition into unitary operations suitable for quantum circuits. Experiments on CIFAR-10 and Tiny-ImageNet demonstrate that WTHaar-Net achieves substantial parameter reduction while maintaining competitive accuracy. On Tiny-ImageNet, our approach outperforms both ResNet and Hadamard-based baselines. We validate the quantum implementation on IBM Quantum cloud hardware, demonstrating compatibility with near-term quantum devices.

[65] SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data

Lekang Wen,Liang Liao,Jing Xiao,Mi Wang

Main category: cs.CV

TL;DR: 本文提出SGMA框架解决不完全多模态语义分割(IMSS)中的模态不平衡、类内差异和跨模态异质性三大挑战,通过语义引导融合(SGF)与模态感知采样(MAS)两个模块实现平衡学习与鲁棒融合。

Details Motivation: 实际遥感系统常因传感器故障或覆盖不全导致模态缺失,现有方法在处理不完全多模态语义分割(IMSS)时存在过对齐、忽略模态特异性、模态不平衡、未建模类内差异与跨模态冲突等问题。 Method: 提出语义引导的模态感知(SGMA)框架,包含两个即插即用模块:(1)语义引导融合(SGF)模块提取跨模态类别级语义原型,估计各模态鲁棒性并加权自适应融合;(2)模态感知采样(MAS)模块基于SGF的鲁棒性估计动态重加权训练样本,增强脆弱模态的学习。 Result: 在多个数据集和骨干网络上实验表明,SGMA持续超越SOTA方法,尤其显著提升脆弱模态的分割性能。 Conclusion: SGMA通过语义引导机制有效缓解模态不平衡、类内变化与跨模态异质性,为不完全多模态遥感分割提供了通用、可扩展且鲁棒的解决方案。 Abstract: Multimodal semantic segmentation integrates complementary information from diverse sensors for remote sensing Earth observation. However, practical systems often encounter missing modalities due to sensor failures or incomplete coverage, termed Incomplete Multimodal Semantic Segmentation (IMSS). IMSS faces three key challenges: (1) multimodal imbalance, where dominant modalities suppress fragile ones; (2) intra-class variation in scale, shape, and orientation across modalities; and (3) cross-modal heterogeneity with conflicting cues producing inconsistent semantic responses. Existing methods rely on contrastive learning or joint optimization, which risk over-alignment, discarding modality-specific cues or imbalanced training, favoring robust modalities, while largely overlooking intra-class variation and cross-modal heterogeneity. To address these limitations, we propose the Semantic-Guided Modality-Aware (SGMA) framework, which ensures balanced multimodal learning while reducing intra-class variation and reconciling cross-modal inconsistencies through semantic guidance. SGMA introduces two complementary plug-and-play modules: (1) Semantic-Guided Fusion (SGF) module extracts multi-scale, class-wise semantic prototypes that capture consistent categorical representations across modalities, estimates per-modality robustness based on prototype-feature alignment, and performs adaptive fusion weighted by robustness scores to mitigate intra-class variation and cross-modal heterogeneity; (2) Modality-Aware Sampling (MAS) module leverages robustness estimations from SGF to dynamically reweight training samples, prioritizing challenging samples from fragile modalities to address modality imbalance. Extensive experiments across multiple datasets and backbones demonstrate that SGMA consistently outperforms state-of-the-art methods, with particularly significant improvements in fragile modalities.

[66] Beyond Anatomy: Explainable ASD Classification from rs-fMRI via Functional Parcellation and Graph Attention Networks

Syeda Hareem Madani,Noureen Bibi,Adam Rafiq Jeraj,Sumra Khan,Anas Zafar,Rizwan Qureshi

Main category: cs.CV

TL;DR: 本文提出了一种基于图神经网络的深度学习框架,通过比较解剖学(AAL)和功能学(MSDL)脑区划分策略,在ABIDE I数据集上实现了95.0%的自闭症谱系障碍(ASD)分类准确率,证明功能划分对模型性能提升最关键,并通过可解释性分析验证了模型聚焦于默认模式网络关键脑区。

Details Motivation: 解剖学脑区划分在静息态fMRI的ASD分类中占主导,但其刚性边界可能无法捕捉ASD特有的个体化功能连接模式。 Method: 构建图卷积网络(GCN)和图注意力网络(GAT)框架,对比AAL(116 ROI)与MSDL(39 ROI)两种脑区划分;采用FSL预处理应对多中心异质性;使用高斯噪声增强训练样本;采用站点分层划分(70/15/15)防止数据泄露。 Result: GAT集成模型达到95.0%准确率(AUC=0.98),显著优于现有GNN方法;仅更换为功能图谱(MSDL)即带来10.7个百分点的准确率提升;可解释性分析一致指向后扣带皮层与楔前叶为关键判别区域。 Conclusion: 功能驱动的脑区划分比解剖学划分更能表征ASD的个体化连接特征,是提升rs-fMRI分类性能最关键的建模选择;模型决策具有神经病理学意义,而非由扫描伪影驱动。 Abstract: Anatomical brain parcellations dominate rs-fMRI-based Autism Spectrum Disorder (ASD) classification, yet their rigid boundaries may fail to capture the idiosyncratic connectivity patterns that characterise ASD. We present a graph-based deep learning framework comparing anatomical (AAL, 116 ROIs) and functionally-derived (MSDL, 39 ROIs) parcellation strategies on the ABIDE I dataset. Our FSL preprocessing pipeline handles multi-site heterogeneity across 400 balanced subjects, with site-stratified 70/15/15 splits to prevent data leakage. Gaussian noise augmentation within training folds expands samples from 280 to 1,680. A three phase pipeline progresses from a baseline GCN with AAL (73.3% accuracy, AUC=0.74), to an optimised GCN with MSDL (84.0%, AUC=0.84), to a Graph Attention Network ensemble achieving 95.0% accuracy (AUC=0.98), outperforming all recent GNN-based benchmarks on ABIDE I. The 10.7-point gain from atlas substitution alone demonstrates that functional parcellation is the most impactful modelling decision. Gradient-based saliency and GNNExplainer analyses converge on the Posterior Cingulate Cortex and Precuneus as core Default Mode Network hubs, validating that model decisions reflect ASD neuropathology rather than acquisition artefacts. All code and datasets will be publicly released upon acceptance.

[67] NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining

Liang Zeng,Valerio Marsocci,Wufan Zhao,Andrea Nascetti,Maarten Vergauwen

Main category: cs.CV

TL;DR: NeighborMAE是一种新的自监督学习方法,通过联合重建邻近的遥感图像来建模空间依赖性,显著提升了遥感图像表征学习性能。

Details Motivation: 现有掩码图像建模方法忽略了地球观测图像间固有的空间连续性与邻近图像间的上下文关联,而这种空间依赖性可为自监督学习提供丰富信息。 Method: 提出NeighborMAE,联合重建邻近区域的遥感图像;采用启发式策略动态调整掩码比例和像素级损失权重以维持重建难度。 Result: 在多个预训练数据集和下游任务上,NeighborMAE显著优于现有基线方法。 Conclusion: 利用邻近图像建模空间依赖性对遥感图像自监督学习具有重要价值,NeighborMAE的设计有效提升了模型性能。 Abstract: Masked Image Modeling has been one of the most popular self-supervised learning paradigms to learn representations from large-scale, unlabeled Earth Observation images. While incorporating multi-modal and multi-temporal Earth Observation data into Masked Image Modeling has been widely explored, the spatial dependencies between images captured from neighboring areas remains largely overlooked. Since the Earth's surface is continuous, neighboring images are highly related and offer rich contextual information for self-supervised learning. To close this gap, we propose NeighborMAE, which learns spatial dependencies by joint reconstruction of neighboring Earth Observation images. To ensure that the reconstruction remains challenging, we leverage a heuristic strategy to dynamically adjust the mask ratio and the pixel-level loss weight. Experimental results across various pretraining datasets and downstream tasks show that NeighborMAE significantly outperforms existing baselines, underscoring the value of neighboring images in Masked Image Modeling for Earth Observation and the efficacy of our designs.

[68] EIMC: Efficient Instance-aware Multi-modal Collaborative Perception

Kang Yang,Peng Wang,Lantao Li,Tianci Bu,Chen Sun,Deying Li,Yongcai Wang

Main category: cs.CV

TL;DR: EIMC提出一种早期协同感知范式,通过轻量级协同体素注入、热图驱动共识协议和实例中心消息传递,在提升多模态协同感知性能的同时大幅降低通信带宽。

Details Motivation: 现有方法采用‘局部融合-再通信’流程,导致高带宽开销;需提升自动驾驶中多模态协同感知的安全性与效率。 Method: 1)将邻居代理传输的轻量级协同体素注入自车局部模态融合;2)热图驱动共识协议定位低置信、高差异区域;3)仅在这些区域查询Top-K实例向量并用交叉注意力融合;4)通过自注意力对各代理Top-K高置信实例进行精细化融合。 Result: 在OPV2V和DAIR-V2X数据集上达到73.01% AP@0.5,带宽使用降低87.98%,优于当前最优多模态协同检测器。 Conclusion: EIMC实现了高效、紧凑且信息丰富的协同感知,显著降低通信冗余,同时保障关键遮挡物体的准确恢复。 Abstract: Multi-modal collaborative perception calls for great attention to enhancing the safety of autonomous driving. However, current multi-modal approaches remain a ``local fusion to communication'' sequence, which fuses multi-modal data locally and needs high bandwidth to transmit an individual's feature data before collaborative fusion. EIMC innovatively proposes an early collaborative paradigm. It injects lightweight collaborative voxels, transmitted by neighbor agents, into the ego's local modality-fusion step, yielding compact yet informative 3D collaborative priors that tighten cross-modal alignment. Next, a heatmap-driven consensus protocol identifies exactly where cooperation is needed by computing per-pixel confidence heatmaps. Only the Top-K instance vectors located in these low-confidence, high-discrepancy regions are queried from peers, then fused via cross-attention for completion. Afterwards, we apply a refinement fusion that involves collecting the top-K most confident instances from each agent and enhancing their features using self-attention. The above instance-centric messaging reduces redundancy while guaranteeing that critical occluded objects are recovered. Evaluated on OPV2V and DAIR-V2X, EIMC attains 73.01\% AP@0.5 while reducing byte bandwidth usage by 87.98\% compared with the best published multi-modal collaborative detector. Code publicly released at https://github.com/sidiangongyuan/EIMC.

[69] ForestPersons: A Large-Scale Dataset for Under-Canopy Missing Person Detection

Deokyun Kim,Jeongjun Lee,Jungwon Choi,Jonggeon Park,Giyoung Lee,Yookyung Kim,Myungseok Ki,Juho Lee,Jihun Cha

Main category: cs.CV

TL;DR: 本文提出ForestPersons数据集,专为森林下层人员检测设计,包含96,482张图像和204,078个标注,涵盖多样的环境与时间条件,并提供姿态、遮挡可见性等细粒度标签;实验表明现有检测模型在此数据集上性能有限,凸显其对真实搜救任务的重要价值。

Details Motivation: 森林茂密树冠遮挡导致无人机航拍难以发现失踪人员,亟需适用于林下视角的人员检测基准数据集。 Method: 构建大规模ForestPersons数据集,包含地面及低空视角图像,每张图像含边界框、姿态和可见性标注;并开展基线模型在该数据集上的评估实验。 Result: 标准目标检测模型在ForestPersons上表现较差,说明现有数据集无法有效支撑林下失踪人员检测任务。 Conclusion: ForestPersons填补了林下人员检测基准缺失的空白,为搜救场景中鲁棒人员检测提供了重要支持。 Abstract: Detecting missing persons in forest environments remains a challenge, as dense canopy cover often conceals individuals from detection in top-down or oblique aerial imagery typically captured by Unmanned Aerial Vehicles (UAVs). While UAVs are effective for covering large, inaccessible areas, their aerial perspectives often miss critical visual cues beneath the forest canopy. This limitation underscores the need for under-canopy perspectives better suited for detecting missing persons in such environments. To address this gap, we introduce ForestPersons, a novel large-scale dataset specifically designed for under-canopy person detection. ForestPersons contains 96,482 images and 204,078 annotations collected under diverse environmental and temporal conditions. Each annotation includes a bounding box, pose, and visibility label for occlusion-aware analysis. ForestPersons provides ground-level and low-altitude perspectives that closely reflect the visual conditions encountered by Micro Aerial Vehicles (MAVs) during forest Search and Rescue (SAR) missions. Our baseline evaluations reveal that standard object detection models, trained on prior large-scale object detection datasets or SAR-oriented datasets, show limited performance on ForestPersons. This indicates that prior benchmarks are not well aligned with the challenges of missing person detection under the forest canopy. We offer this benchmark to support advanced person detection capabilities in real-world SAR scenarios. The dataset is publicly available at https://huggingface.co/datasets/etri/ForestPersons.

[70] On Discriminative vs. Generative classifiers: Rethinking MLLMs for Action Understanding

Zhanzhong Pang,Dibyadip Chatterjee,Fadime Sener,Angela Yao

Main category: cs.CV

TL;DR: 本文提出了一种生成辅助判别式(GAD)分类器,用于闭集动作理解任务,在保持MLLM预训练兼容性的同时,显著提升准确率与推理效率。

Details Motivation: 现有基于多模态大语言模型(MLLMs)的生成式分类器存在效率低、标签语义重叠导致歧义等问题;而判别式分类器虽高效准确,但缺乏生成建模的互补优势。 Method: 首先对比分析生成式与判别式分类器在闭集动作理解中的性能差异;继而设计策略提升生成式分类器性能,并进一步融合二者优势,提出仅在微调阶段使用的Generation-Assisted Discriminative(GAD)分类器。 Result: GAD在五个数据集的四项任务上达到SOTA,平均准确率提升2.5%,在最大COIN基准上推理速度快3倍。 Conclusion: 判别式分类优于生成式;通过GAD框架可有效融合二者优势,在不破坏MLLM预训练兼容性的前提下,兼顾高性能与高效率。 Abstract: Multimodal Large Language Models (MLLMs) have advanced open-world action understanding and can be adapted as generative classifiers for closed-set settings by autoregressively generating action labels as text. However, this approach is inefficient, and shared subwords across action labels introduce semantic overlap, leading to ambiguity in generation. In contrast, discriminative classifiers learn task-specific representations with clear decision boundaries, enabling efficient one-step classification without autoregressive decoding. We first compare generative and discriminative classifiers with MLLMs for closed-set action understanding, revealing the superior accuracy and efficiency of the latter. To bridge the performance gap, we design strategies that elevate generative classifiers toward performance comparable with discriminative ones. Furthermore, we show that generative modeling can complement discriminative classifiers, leading to better performance while preserving efficiency. To this end, we propose Generation-Assisted Discriminative~(GAD) classifier for closed-set action understanding. GAD operates only during fine-tuning, preserving full compatibility with MLLM pretraining. Extensive experiments on temporal action understanding benchmarks demonstrate that GAD improves both accuracy and efficiency over generative methods, achieving state-of-the-art results on four tasks across five datasets, including an average 2.5% accuracy gain and 3x faster inference on our largest COIN benchmark.

[71] SemGS: Feed-Forward Semantic 3D Gaussian Splatting from Sparse Views for Generalizable Scene Understanding

Sheng Ye,Zhen-Hui Dong,Ruoyu Fan,Tian Lv,Yong-Jin Liu

Main category: cs.CV

TL;DR: 本文提出SemGS,一种基于稀疏图像输入的前馈式语义场重建框架,通过双分支网络和相机感知注意力机制实现高效、泛化性强的3D语义场景重建与新视角语义合成。

Details Motivation: 现有语义场景重建与语义感知新视角合成方法依赖密集多视角输入且需场景级优化,实用性与可扩展性受限。 Method: 提出SemGS框架:采用共享浅层CNN的双分支结构分别提取颜色与语义特征;引入相机感知注意力机制建模视角间几何关系;用几何一致但属性独立的双高斯表示解码特征;结合区域平滑损失提升语义一致性。 Result: 在多个基准数据集上达到SOTA性能,具备快速推理能力,并在合成与真实场景中展现出强泛化性。 Conclusion: SemGS有效解决了稀疏输入下语义场重建的泛化性与效率难题,为机器人在复杂环境中的语义理解提供了实用可行的新范式。 Abstract: Semantic understanding of 3D scenes is essential for robots to operate effectively and safely in complex environments. Existing methods for semantic scene reconstruction and semantic-aware novel view synthesis often rely on dense multi-view inputs and require scene-specific optimization, limiting their practicality and scalability in real-world applications. To address these challenges, we propose SemGS, a feed-forward framework for reconstructing generalizable semantic fields from sparse image inputs. SemGS uses a dual-branch architecture to extract color and semantic features, where the two branches share shallow CNN layers, allowing semantic reasoning to leverage textural and structural cues in color appearance. We also incorporate a camera-aware attention mechanism into the feature extractor to explicitly model geometric relationships between camera viewpoints. The extracted features are decoded into dual-Gaussians that share geometric consistency while preserving branch-specific attributes, and further rasterized to synthesize semantic maps under novel viewpoints. Additionally, we introduce a regional smoothness loss to enhance semantic coherence. Experiments show that SemGS achieves state-of-the-art performance on benchmark datasets, while providing rapid inference and strong generalization capabilities across diverse synthetic and real-world scenarios.

[72] Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation

Chonghua Lv,Dong Zhao,Shuang Wang,Dou Quan,Ning Huyan,Nicu Sebe,Zhun Zhong

Main category: cs.CV

TL;DR: 本文提出了一种面向泛化能力的知识蒸馏框架GKD,通过解耦表征学习与任务学习、引入基于查询的软蒸馏机制,显著提升学生模型在域偏移下的泛化性能,尤其在视觉基础模型蒸馏中效果突出。

Details Motivation: 传统知识蒸馏方法在语义分割中主要关注域内精度,忽视了域外泛化能力;而视觉基础模型(VFMs)虽具备强泛化性,但用常规蒸馏会削弱这一优势。 Method: 提出多阶段可泛化知识蒸馏(GKD)框架:第一阶段通过选择性特征蒸馏学习域无关表征;第二阶段冻结表征进行任务适配;并设计基于查询的软蒸馏机制,让学生特征作为query检索教师(VFM)中可迁移的空间知识。 Result: 在五个域泛化基准上,GKD在foundation-to-foundation(F2F)和foundation-to-local(F2L)蒸馏中分别平均提升+1.9%和+10.6%,显著优于现有方法。 Conclusion: GKD有效保留并增强视觉基础模型的泛化能力,为面向分布偏移场景的高效语义分割模型压缩提供了新范式。 Abstract: Knowledge distillation (KD) has been widely applied in semantic segmentation to compress large models, but conventional approaches primarily preserve in-domain accuracy while neglecting out-of-domain generalization, which is essential under distribution shifts. This limitation becomes more severe with the emergence of vision foundation models (VFMs): although VFMs exhibit strong robustness on unseen data, distilling them with conventional KD often compromises this ability. We propose Generalizable Knowledge Distillation (GKD), a multi-stage framework that explicitly enhances generalization. GKD decouples representation learning from task learning. In the first stage, the student acquires domain-agnostic representations through selective feature distillation, and in the second stage, these representations are frozen for task adaptation, thereby mitigating overfitting to visible domains. To further support transfer, we introduce a query-based soft distillation mechanism, where student features act as queries to teacher representations to selectively retrieve transferable spatial knowledge from VFMs. Extensive experiments on five domain generalization benchmarks demonstrate that GKD consistently outperforms existing KD methods, achieving average gains of +1.9% in foundation-to-foundation (F2F) and +10.6% in foundation-to-local (F2L) distillation. The code will be available at https://github.com/Younger-hua/GKD.

[73] Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

Zhiyu Pan,Yizheng Wu,Jiashen Hua,Junyi Feng,Shaotian Yan,Bing Deng,Zhiguo Cao,Jieping Ye

Main category: cs.CV

TL;DR: 本文提出VC-STaR框架,利用视觉对比增强视觉语言模型(VLM)的自我改进能力,缓解推理中的视觉幻觉问题,并构建新数据集VisCoR-55K以提升VLM视觉推理性能。

Details Motivation: 现有基于语言的自改进方法难以迁移到视觉语言模型(VLM),因其生成的推理路径中存在难以验证和修正的视觉幻觉问题。 Method: 基于对比性VQA对(视觉相似、语义相同的问题)能更精准定位视觉线索的观察,提出Visual Contrastive Self-Taught Reasoner(VC-STaR)框架;构建多模态相似性驱动的对比VQA对,生成高质量推理路径,并构建新数据集VisCoR-55K用于监督微调。 Result: VC-STaR在多个VLM上显著提升视觉推理性能,不仅超越现有自改进方法,还优于在当前SOTA视觉推理数据集上微调的模型。 Conclusion: VLM自身具备的视觉对比能力可被有效引导以实现自我增强推理,VC-STaR为解决VLM推理幻觉提供了新范式。 Abstract: Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge:~visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR-55K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. Project at: https://github.com/zhiyupan42/VC-STaR.

[74] CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment

Maoyuan Shao,Yutong Gao,Xinyang Huang,Chuang Zhu,Lijuan Sun,Guoshun Nan

Main category: cs.CV

TL;DR: 本文提出CAPT框架,通过构建混淆银行和多粒度差异专家模块,提升视觉-语言模型在细粒度类别区分上的能力,显著减少系统性误分类。

Details Motivation: 现有视觉-语言模型(如CLIP)在视觉与语义相似类别间存在系统性误分类,这种混淆并非随机,而是反映模型内在偏差和细粒度判别能力不足。 Method: 提出混淆感知提示调优框架CAPT:构建混淆银行;设计语义混淆挖掘器(SEM)和样本混淆挖掘器(SAM);引入多粒度差异专家(MGDE)模块融合语义与样本级混淆信息。 Result: 在11个基准数据集上显著降低混淆导致的错误,提升基类与新类的判别力与泛化性,成功解决50.72%的易混淆样本对。 Conclusion: CAPT通过显式建模和利用模型自身混淆模式,有效增强跨模态模型的细粒度识别鲁棒性,为缓解系统性偏差提供了新范式。 Abstract: Vision-language models like CLIP have achieved remarkable progress in cross-modal representation learning, yet suffer from systematic misclassifications among visually and semantically similar categories. We observe that such confusion patterns are not random but persistently occur between specific category pairs, revealing the model's intrinsic bias and limited fine-grained discriminative ability. To address this, we propose CAPT, a Confusion-Aware Prompt Tuning framework that enables models to learn from their own misalignment. Specifically, we construct a Confusion Bank to explicitly model stable confusion relationships across categories and misclassified samples. On this basis, we introduce a Semantic Confusion Miner (SEM) to capture global inter-class confusion through semantic difference and commonality prompts, and a Sample Confusion Miner (SAM) to retrieve representative misclassified instances from the bank and capture sample-level cues through a Diff-Manner Adapter that integrates global and local contexts. To further unify confusion information across different granularities, a Multi-Granularity Difference Expert (MGDE) module is designed to jointly leverage semantic- and sample-level experts for more robust confusion-aware reasoning. Extensive experiments on 11 benchmark datasets demonstrate that our method significantly reduces confusion-induced errors while enhancing the discriminability and generalization of both base and novel classes, successfully resolving 50.72 percent of confusable sample pairs. Code will be released at https://github.com/greatest-gourmet/CAPT.

[75] CAWM-Mamba: A unified model for infrared-visible image fusion and compound adverse weather restoration

Huichun Liu,Xiaosong Li,Zhuangfan Huang,Tao Ye,Yang Liu,Haishu Tan

Main category: cs.CV

TL;DR: 本文提出CAWM-Mamba框架,首次实现端到端的多模态图像融合与复合恶劣天气(如雾+雨、雨+雪)联合复原,通过天气感知预处理、跨模态特征交互和小波空间状态块三大模块提升鲁棒性与泛化性,在多个数据集及下游任务中均取得SOTA性能。

Details Motivation: 现有恶劣天气下的多模态图像融合方法通常仅针对单一退化类型(如雾、雨或雪),难以应对多种退化共存的复合天气场景,限制了其在自动驾驶和无人机监控等实际应用中的可靠性。 Method: 提出CAWM-Mamba:包含天气感知预处理模块(WAPM)、跨模态特征交互模块(CFIM)和小波空间状态块(WSSB);其中WSSB引入Freq-SSM建模各向异性高频退化,并采用统一退化表征机制提升对复合天气的泛化能力。 Result: 在AWMM-100K和三个标准融合数据集上显著优于现有方法,且在语义分割与目标检测等下游任务中表现优异,验证了融合结果的实用性。 Conclusion: CAWM-Mamba是首个支持端到端复合恶劣天气下多模态图像融合与复原的统一框架,具备强泛化性、鲁棒性和实际部署价值。 Abstract: Multimodal Image Fusion (MMIF) integrates complementary information from various modalities to produce clearer and more informative fused images. MMIF under adverse weather is particularly crucial in autonomous driving and UAV monitoring applications. However, existing adverse weather fusion methods generally only tackle single types of degradation such as haze, rain, or snow, and fail when multiple degradations coexist (e.g., haze+rain, rain+snow). To address this challenge, we propose Compound Adverse Weather Mamba (CAWM-Mamba), the first end-to-end framework that jointly performs image fusion and compound weather restoration with unified shared weights. Our network contains three key components: (1) a Weather-Aware Preprocess Module (WAPM) to enhance degraded visible features and extracts global weather embeddings; (2) a Cross-modal Feature Interaction Module (CFIM) to facilitate the alignment of heterogeneous modalities and exchange of complementary features across modalities; and (3) a Wavelet Space State Block (WSSB) that leverages wavelet-domain decomposition to decouple multi-frequency degradations. WSSB includes Freq-SSM, a module that models anisotropic high-frequency degradation without redundancy, and a unified degradation representation mechanism to further improve generalization across complex compound weather conditions. Extensive experiments on the AWMM-100K benchmark and three standard fusion datasets demonstrate that CAWM-Mamba consistently outperforms state-of-the-art methods in both compound and single-weather scenarios. In addition, our fusion results excel in downstream tasks covering semantic segmentation and object detection, confirming the practical value in real-world adverse weather perception. The source code will be available at https://github.com/Feecuin/CAWM-Mamba.

[76] Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels

Jiahao Lu,Jiayi Xu,Wenbo Hu,Ruijie Zhu,Chengfeng Zhao,Sai-Kit Yeung,Ying Shan,Yuan Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为Track4World的前馈模型,用于高效、全局地进行单目视频中每个像素的世界坐标系下的3D轨迹估计,通过VGGT风格ViT编码全局3D场景,并引入新型3D相关性机制同步估计2D/3D稠密光流,显著提升了2D/3D光流估计与3D跟踪性能。

Details Motivation: 现有单目3D跟踪方法受限于稀疏点跟踪或慢速优化框架,难以实现高效、稠密、世界坐标系下的像素级3D轨迹估计。 Method: 提出Track4World前馈模型,基于VGGT-style ViT构建全局3D场景表示,并设计新型3D相关性方案,联合估计任意帧对间的像素级2D与3D稠密流;结合重建的3D几何实现高效全像素3D跟踪。 Result: 在多个基准上,该方法在2D/3D光流估计和3D跟踪任务中持续超越现有方法,展现出强鲁棒性与可扩展性。 Conclusion: Track4World为单目视频提供了高效、稠密、世界坐标系一致的4D(3D+时间)重建能力,推动了单目动态场景理解的发展。 Abstract: Estimating the 3D trajectory of every pixel from a monocular video is crucial and promising for a comprehensive understanding of the 3D dynamics of videos. Recent monocular 3D tracking works demonstrate impressive performance, but are limited to either tracking sparse points on the first frame or a slow optimization-based framework for dense tracking. In this paper, we propose a feedforward model, called Track4World, enabling an efficient holistic 3D tracking of every pixel in the world-centric coordinate system. Built on the global 3D scene representation encoded by a VGGT-style ViT, Track4World applies a novel 3D correlation scheme to simultaneously estimate the pixel-wise 2D and 3D dense flow between arbitrary frame pairs. The estimated scene flow, along with the reconstructed 3D geometry, enables subsequent efficient 3D tracking of every pixel of this video. Extensive experiments on multiple benchmarks demonstrate that our approach consistently outperforms existing methods in 2D/3D flow estimation and 3D tracking, highlighting its robustness and scalability for real-world 4D reconstruction tasks.

[77] ATD: Improved Transformer with Adaptive Token Dictionary for Image Restoration

Leheng Zhang,Wei Long,Yawei Li,Xingyu Zhou,Xiaorui Zhao,Shuhang Gu

Main category: cs.CV

TL;DR: 本文提出了一种名为自适应标记字典(ATD)的新型Transformer架构,通过引入可学习的标记字典和标记字典交叉注意力机制(TDCA),在保持线性计算复杂度的同时实现全局依赖建模,显著提升了图像超分辨率等图像恢复任务的性能。

Details Motivation: 现有基于Transformer的图像恢复方法受限于自注意力机制的二次计算复杂度,通常采用局部窗口注意力,导致感受野受限、性能次优;需在性能与计算开销间取得更好平衡。 Method: 提出自适应标记字典(ATD):1)构建可学习的标记字典以建模外部图像先验;2)设计标记字典交叉注意力(TDCA)机制增强输入特征;3)利用TDCA注意力图中的类别信息对特征聚类分组,并将该信息融入前馈网络以提升特征融合;还衍生出轻量版ATD-light和多尺度变体ATD-U。 Result: ATD及其变体在多个图像超分辨率基准上达到SOTA性能;ATD-U在图像去噪和JPEG压缩伪影去除任务中也展现出优越性,实验在定量与定性指标上均验证了其有效性。 Conclusion: ATD通过线性复杂度实现全局建模,兼顾效率与性能,为Transformer在图像恢复任务中的高效应用提供了新范式。 Abstract: Recently, Transformers have gained significant popularity in image restoration tasks such as image super-resolution and denoising, owing to their superior performance. However, balancing performance and computational burden remains a long-standing problem for transformer-based architectures. Due to the quadratic complexity of self-attention, existing methods often restrict attention to local windows, resulting in limited receptive field and suboptimal performance. To address this issue, we propose Adaptive Token Dictionary (ATD), a novel transformer-based architecture for image restoration that enables global dependency modeling with linear complexity relative to image size. The ATD model incorporates a learnable token dictionary, which summarizes external image priors (i.e., typical image structures) during the training process. To utilize this information, we introduce a token dictionary cross-attention (TDCA) mechanism that enhances the input features via interaction with the learned dictionary. Furthermore, we exploit the category information embedded in the TDCA attention maps to group input features into multiple categories, each representing a cluster of similar features across the image and serving as an attention group. We also integrate the learned category information into the feed-forward network to further improve feature fusion. ATD and its lightweight version ATD-light, achieve state-of-the-art performance on multiple image super-resolution benchmarks. Moreover, we develop ATD-U, a multi-scale variant of ATD, to address other image restoration tasks, including image denoising and JPEG compression artifacts removal. Extensive experiments demonstrate the superiority of out proposed models, both quantitatively and qualitatively.

[78] Neural Electromagnetic Fields for High-Resolution Material Parameter Reconstruction

Zhe Chen,Peilin Zheng,Wenshuo Chen,Xiucheng Wang,Yutao Yue,Nan Cheng

Main category: cs.CV

TL;DR: 本文提出NEMF框架,通过图像提供的高保真几何信息作为锚点,系统性解耦环境场与材料属性,实现非接触、无损的物理反演,从而构建具备功能性的数字孪生体。

Details Motivation: 当前NeRF等方法生成的数字孪生体视觉丰富但功能不全,缺乏材料属性(如介电常数、电导率);而通过非接触、无损传感获取场景中每一点的材料参数面临严重不适定的物理反演问题。 Method: NEMF框架利用图像提供的高保真几何作为强先验,首先解析环境场;再结合非侵入式RF信号与可微分的物理反射模型层,构建一个物理引导的解码器,学习输出连续、空间变化的材料参数场。 Result: 在高保真合成数据集上验证表明,该方法能高精度重建材料参数图,并支持高保真物理仿真。 Conclusion: NEMF突破了传统视觉数字孪生的局限,实现了从被动视觉表征到主动、可仿真功能性数字孪生的跨越。 Abstract: Creating functional Digital Twins, simulatable 3D replicas of the real world, is a central challenge in computer vision. Current methods like NeRF produce visually rich but functionally incomplete twins. The key barrier is the lack of underlying material properties (e.g., permittivity, conductivity). Acquiring this information for every point in a scene via non-contact, non-invasive sensing is a primary goal, but it demands solving a notoriously ill-posed physical inversion problem. Standard remote signals, like images and radio frequencies (RF), deeply entangle the unknown geometry, ambient field, and target materials. We introduce NEMF, a novel framework for dense, non-invasive physical inversion designed to build functional digital twins. Our key insight is a systematic disentanglement strategy. NEMF leverages high-fidelity geometry from images as a powerful anchor, which first enables the resolution of the ambient field. By constraining both geometry and field using only non-invasive data, the original ill-posed problem transforms into a well-posed, physics-supervised learning task. This transformation unlocks our core inversion module: a decoder. Guided by ambient RF signals and a differentiable layer incorporating physical reflection models, it learns to explicitly output a continuous, spatially-varying field of the scene's underlying material parameters. We validate our framework on high-fidelity synthetic datasets. Experiments show our non-invasive inversion reconstructs these material maps with high accuracy, and the resulting functional twin enables high-fidelity physical simulation. This advance moves beyond passive visual replicas, enabling the creation of truly functional and simulatable models of the physical world.

[79] Maximizing Generalization: The Effect of Different Augmentation Techniques on Lightweight Vision Transformer for Bengali Character Classification

Rafi Hassan Chowdhury,Naimul Haque,Kaniz Fatiha

Main category: cs.CV

TL;DR: 本文探讨了图像数据增强技术在资源匮乏语言(如孟加拉语)手写字符识别中的应用,尤其结合轻量级模型EfficientViT,在Ekush和AIBangla数据集上通过Random Affine与Color Jitter组合取得了最高准确率(97.48%和97.57%)。

Details Motivation: 针对孟加拉语等资源受限语言缺乏大规模标注图像数据、导致深度学习模型易过拟合或欠拟合的问题,探索数据增强作为缓解小样本瓶颈的有效手段。 Method: 系统评估多种图像增强技术(CLAHE、Random Rotation、Random Affine、Color Jitter及其组合),并将其应用于轻量级视觉模型EfficientViT,在Ekush和AIBangla两个孟加拉语手写字符数据集上进行训练与对比实验。 Result: Random Affine与Color Jitter的组合在两个数据集上分别达到97.48%和97.57%的最高识别准确率,优于其他单一或组合增强方法。 Conclusion: 合理的图像数据增强策略能显著提升轻量级模型在低资源语言手写字符识别任务上的性能,为类似场景提供了可复用的技术路径。 Abstract: Deep learning models have proven to be highly effective in computer vision, with deep convolutional neural networks achieving impressive results across various computer vision tasks. However, these models rely heavily on large datasets to avoid overfitting. When a model learns features with either low or high variance, it can lead to underfitting or overfitting on the training data. Unfortunately, large-scale datasets may not be available in many domains, particularly for resource-limited languages such as Bengali. In this experiment, a series of tests were conducted in the field of image data augmentation as an approach to addressing the limited data problem for Bengali handwritten characters. The study also provides an in-depth analysis of the performance of different augmentation techniques. Data augmentation refers to a set of techniques applied to data to increase its size and diversity, making it more suitable for training deep learning models. The image augmentation techniques evaluated in this study include CLAHE, Random Rotation, Random Affine, Color Jitter, and their combinations. The study further explores the use of augmentation methods with a lightweight model such as EfficientViT. Among the different augmentation strategies, the combination of Random Affine and Color Jitter produced the best accuracy on the Ekush [1] and AIBangla [2] datasets, achieving accuracies of 97.48% and 97.57%, respectively. This combination outperformed all other individual and combined augmentation techniques. Overall, this analysis presents a thorough examination of the impact of image data augmentation in resource-scarce languages, particularly in the context of Bengali handwritten character recognition using lightweight models.

[80] Synthetic-Child: An AIGC-Based Synthetic Data Pipeline for Privacy-Preserving Child Posture Estimation

Taowen Zeng

Main category: cs.CV

TL;DR: 本文提出Synthetic-Child合成数据流水线,利用AIGC生成带真值标注的儿童姿态图像,无需真实儿童照片,显著提升儿童姿态估计模型性能并实现边缘端实时部署。

Details Motivation: 儿童姿态估计对AI学习伴侣设备至关重要,但因隐私和伦理问题难以获取大规模真实标注数据。 Method: 构建基于SMPL-X与Blender的3D可编程儿童模型生成合理坐姿;通过双ControlNet(姿态+深度)驱动FLUX-1 Dev生成12,000张合成图像;用ViTPose过滤与增强提升质量;微调RTMPose-M并结合几何特征与轻量MLP,最后INT8量化部署。 Result: 在真实儿童测试集上FP16模型达71.2 AP(较成人基线提升+12.5 AP);INT8量化后保持70.4 AP,于RK3568 NPU上达22 FPS;单用户对比商用设备,识别率更高、响应快1.8倍。 Conclusion: 精心设计的AIGC合成数据流水线可大幅降低对真实儿童图像的依赖,在保证高精度的同时实现边缘部署,亦适用于其他隐私敏感领域。 Abstract: Accurate child posture estimation is critical for AI-powered study companion devices, yet collecting large-scale annotated datasets of children is both expensive and ethically prohibitive due to privacy concerns. We present Synthetic-Child, an AIGC-based synthetic data pipeline that produces photorealistic child posture training images with ground-truth-projected keypoint annotations, requiring zero real child photographs. The pipeline comprises four stages: (1) a programmable 3D child body model (SMPL-X) in Blender generates diverse desk-study poses with IK-constrained anatomical plausibility and automatic COCO-format ground-truth export; (2) a custom PoseInjectorNode feeds 3D-derived skeletons into a dual ControlNet (pose + depth) conditioned on FLUX-1 Dev, synthesizing 12,000 photorealistic images across 10 posture categories with low annotation drift; (3) ViTPose-based confidence filtering and targeted augmentation remove generation failures and improve robustness; (4) RTMPose-M (13.6M params) is fine-tuned on the synthetic data and paired with geometric feature engineering and a lightweight MLP for posture classification, then quantized to INT8 for real-time edge deployment. On a real-child test set (n~300), the FP16 model achieves 71.2 AP -- a +12.5 AP improvement over the COCO-pretrained adult-data baseline at identical model capacity. After INT8 quantization the model retains 70.4 AP while running at 22 FPS on a 0.8-TOPS Rockchip RK3568 NPU. In a single-subject controlled comparison with a commercial posture corrector, our system achieves substantially higher recognition rates across most tested categories and responds ~1.8x faster on average. These results demonstrate that carefully designed AIGC pipelines can substantially reduce dependence on real child imagery while achieving deployment-ready accuracy, with potential applications to other privacy-sensitive domains.

[81] VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction

A. Enes Doruk,Hasan F. Ates

Main category: cs.CV

TL;DR: 本文提出VLMFusionOcc3D,一种融合视觉-语言模型(VLM)的多模态3D语义占据预测框架,通过InstVLM、WeathFusion和DAGA等模块提升恶劣天气下及语义模糊场景中的鲁棒性与精度。

Details Motivation: 现有基于体素的占据预测模型在稀疏几何网格中存在语义歧义,且在恶劣天气下性能下降。 Method: 构建双分支特征提取网络;提出Instance-driven VLM Attention(InstVLM)利用门控跨注意力与LoRA微调CLIP嵌入注入语义与地理先验;设计Weather-Aware Adaptive Fusion(WeathFusion)动态加权多传感器输入;引入Depth-Aware Geometric Alignment(DAGA)损失对齐相机深度与LiDAR几何结构。 Result: 在nuScenes和SemanticKITTI数据集上显著提升SOTA体素基线方法性能,尤其在恶劣天气场景下效果突出,具备即插即用性和可扩展性。 Conclusion: VLMFusionOcc3D通过融合语言先验、天气感知融合与几何对齐机制,为复杂城市环境下的鲁棒3D语义占据预测提供了有效解决方案。 Abstract: This paper introduces VLMFusionOcc3D, a robust multimodal framework for dense 3D semantic occupancy prediction in autonomous driving. Current voxel-based occupancy models often struggle with semantic ambiguity in sparse geometric grids and performance degradation under adverse weather conditions. To address these challenges, we leverage the rich linguistic priors of Vision-Language Models (VLMs) to anchor ambiguous voxel features to stable semantic concepts. Our framework initiates with a dual-branch feature extraction pipeline that projects multi-view images and LiDAR point clouds into a unified voxel space. We propose Instance-driven VLM Attention (InstVLM), which utilizes gated cross-attention and LoRA-adapted CLIP embeddings to inject high-level semantic and geographic priors directly into the 3D voxels. Furthermore, we introduce Weather-Aware Adaptive Fusion (WeathFusion), a dynamic gating mechanism that utilizes vehicle metadata and weather-conditioned prompts to re-weight sensor contributions based on real-time environmental reliability. To ensure structural consistency, a Depth-Aware Geometric Alignment (DAGA) loss is employed to align dense camera-derived geometry with sparse, spatially accurate LiDAR returns. Extensive experiments on the nuScenes and SemanticKITTI datasets demonstrate that our plug-and-play modules consistently enhance the performance of state-of-the-art voxel-based baselines. Notably, our approach achieves significant improvements in challenging weather scenarios, offering a scalable and robust solution for complex urban navigation.

[82] Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs

Zhikang Xu,Qianqian Xu,Zitai Wang,Cong Hua,Sicong Li,Zhiyong Yang,Qingming Huang

Main category: cs.CV

TL;DR: 本文提出InterNeg框架,通过一致的跨模态距离增强(文本和视觉视角)来提升视觉语言模型(VLM)在分布外检测(OOD)任务中的性能,显著优于现有方法。

Details Motivation: 现有基于VLM的OOD检测方法常混用模内距离(如负文本与ID标签比较),违背CLIP类模型优化的跨模态距离目标,导致性能受限。 Method: InterNeg框架从文本视角设计跨模态负文本选择准则;从视觉视角动态识别高置信OOD图像并将其反演至文本空间,生成受跨模态距离引导的额外负文本嵌入。 Result: 在多个基准上显著超越现有方法:ImageNet上FPR95降低3.47%,Near-OOD上AUROC提升5.50%。 Conclusion: 一致利用跨模态距离(而非模内距离)可显著提升VLM在OOD检测中的表现,InterNeg为该方向提供了简单有效的范式。 Abstract: Out-of-distribution (OOD) detection seeks to identify samples from unknown classes, a critical capability for deploying machine learning models in open-world scenarios. Recent research has demonstrated that Vision-Language Models (VLMs) can effectively leverage their multi-modal representations for OOD detection. However, current methods often incorporate intra-modal distance during OOD detection, such as comparing negative texts with ID labels or comparing test images with image proxies. This design paradigm creates an inherent inconsistency against the inter-modal distance that CLIP-like VLMs are optimized for, potentially leading to suboptimal performance. To address this limitation, we propose InterNeg, a simple yet effective framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives. From the textual perspective, we devise an inter-modal criterion for selecting negative texts. From the visual perspective, we dynamically identify high-confidence OOD images and invert them into the textual space, generating extra negative text embeddings guided by inter-modal distance. Extensive experiments across multiple benchmarks demonstrate the superiority of our approach. Notably, our InterNeg achieves state-of-the-art performance compared to existing works, with a 3.47\% reduction in FPR95 on the large-scale ImageNet benchmark and a 5.50\% improvement in AUROC on the challenging Near-OOD benchmark.

[83] Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild

Seunguk Do,Minwoo Huh,Joonghyuk Shin,Jaesik Park

Main category: cs.CV

TL;DR: 本文提出DrPose,一种无需昂贵3D人体资产、仅利用单视图图像与对应2D姿态对多视角扩散模型进行后训练的直接奖励微调方法,通过最大化可微分的PoseScore奖励提升重建人体姿态的自然性与多样性。

Details Motivation: 现有单视图3D人体重建方法在动态或挑战性姿态下常生成不自然姿态,主因是高质量、多样姿态的3D人体数据集规模有限。 Method: 提出DrPose:基于单视图图像与对应2D姿态对,采用直接奖励微调策略优化多视角扩散模型;设计可微分的PoseScore作为奖励函数,衡量生成多视角隐空间图像与真实姿态的一致性;构建新数据集DrPose15K(源自人体运动数据集+姿态条件视频生成模型),覆盖更广姿态分布。 Result: 在标准基准、野外图像及新建挑战姿态基准上均取得一致的定性与定量性能提升,显著改善复杂姿态下的重建自然性。 Conclusion: DrPose有效缓解了3D人体重建中姿态多样性不足的问题,验证了仅用2D姿态监督即可高效提升多视角扩散模型对人体姿态建模能力的可行性。 Abstract: Single-view 3D human reconstruction has achieved remarkable progress through the adoption of multi-view diffusion models, yet the recovered 3D humans often exhibit unnatural poses. This phenomenon becomes pronounced when reconstructing 3D humans with dynamic or challenging poses, which we attribute to the limited scale of available 3D human datasets with diverse poses. To address this limitation, we introduce DrPose, Direct Reward fine-tuning algorithm on Poses, which enables post-training of a multi-view diffusion model on diverse poses without requiring expensive 3D human assets. DrPose trains a model using only human poses paired with single-view images, employing a direct reward fine-tuning to maximize PoseScore, which is our proposed differentiable reward that quantifies consistency between a generated multi-view latent image and a ground-truth human pose. This optimization is conducted on DrPose15K, a novel dataset that was constructed from an existing human motion dataset and a pose-conditioned video generative model. Constructed from abundant human pose sequence data, DrPose15K exhibits a broader pose distribution compared to existing 3D human datasets. We validate our approach through evaluation on conventional benchmark datasets, in-the-wild images, and a newly constructed benchmark, with a particular focus on assessing performance on challenging human poses. Our results demonstrate consistent qualitative and quantitative improvements across all benchmarks. Project page: https://seunguk-do.github.io/drpose.

[84] Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective

Kaifang Long,Lianbo Ma,Jiaqi Liu,Liming Liu,Guoyang Xie

Main category: cs.CV

TL;DR: 本文提出IB-IUMAD框架,通过Mamba解码器与信息瓶颈融合模块协同去噪,缓解增量式统一多模态异常检测中的灾难性遗忘问题。

Details Motivation: 解决增量式统一多模态异常检测中因虚假与冗余特征引发的灾难性遗忘问题,现有方法忽略了这些特征对遗忘的影响。 Method: 提出IB-IUMAD去噪框架,结合Mamba解码器(解耦对象间特征耦合)和信息瓶颈融合模块(滤除冗余特征),保留判别性信息。 Result: 在MVTec 3D-AD和Eyecandies数据集上的理论分析与实验验证表明,IB-IUMAD有效缓解遗忘,性能具有竞争力。 Conclusion: 虚假与冗余特征显著加剧灾难性遗忘;IB-IUMAD通过特征解耦与精炼,提升了增量多模态异常检测的稳定性与泛化能力。 Abstract: The quest for incremental unified multimodal anomaly detection seeks to empower a single model with the ability to systematically detect anomalies across all categories and support incremental learning to accommodate emerging objects/categories. Central to this pursuit is resolving the catastrophic forgetting dilemma, which involves acquiring new knowledge while preserving prior learned knowledge. Despite some efforts to address this dilemma, a key oversight persists: ignoring the potential impact of spurious and redundant features on catastrophic forgetting. In this paper, we delve into the negative effect of spurious and redundant features on this dilemma in incremental unified frameworks, and reveal that under similar conditions, the multimodal framework developed by naive aggregation of unimodal architectures is more prone to forgetting. To address this issue, we introduce a novel denoising framework called IB-IUMAD, which exploits the complementary benefits of the Mamba decoder and information bottleneck fusion module: the former dedicated to disentangle inter-object feature coupling, preventing spurious feature interference between objects; the latter serves to filter out redundant features from the fused features, thus explicitly preserving discriminative information. A series of theoretical analyses and experiments on MVTec 3D-AD and Eyecandies datasets demonstrates the effectiveness and competitive performance of IB-IUMAD.

[85] SEP-YOLO: Fourier-Domain Feature Representation for Transparent Object Instance Segmentation

Fengming Zhang,Tao Yan,Jianchao Huang

Main category: cs.CV

TL;DR: 本文提出SEP-YOLO框架,通过频域细节增强与多尺度空间优化,提升透明物体实例分割性能,并发布Trans10K高质量实例标注数据集。

Details Motivation: 透明物体因边界模糊、对比度低、强背景依赖等特性,导致现有方法依赖明显外观线索和清晰边界而失效。 Method: 提出SEP-YOLO:含频域细节增强模块(用可学习复数权重分离并增强高频边界成分);多尺度空间精炼流(含内容感知对齐颈部与多尺度门控精炼块);并为Trans10K提供高质量实例级标注。 Result: 在Trans10K和GVD数据集上达到SOTA性能。 Conclusion: SEP-YOLO通过双域协同机制有效克服透明物体分割难点,验证了频域建模与空间精炼结合的有效性,并推动了该领域数据建设。 Abstract: Transparent object instance segmentation presents significant challenges in computer vision, due to the inherent properties of transparent objects, including boundary blur, low contrast, and high dependence on background context. Existing methods often fail as they depend on strong appearance cues and clear boundaries. To address these limitations, we propose SEP-YOLO, a novel framework that integrates a dual-domain collaborative mechanism for transparent object instance segmentation. Our method incorporates a Frequency Domain Detail Enhancement Module, which separates and enhances weak highfrequency boundary components via learnable complex weights. We further design a multi-scale spatial refinement stream, which consists of a Content-Aware Alignment Neck and a Multi-scale Gated Refinement Block, to ensure precise feature alignment and boundary localization in deep semantic features. We also provide high-quality instance-level annotations for the Trans10K dataset, filling the critical data gap in transparent object instance segmentation. Extensive experiments on the Trans10K and GVD datasets show that SEP-YOLO achieves state-of-the-art (SOTA) performance.

[86] OmniFashion: Towards Generalist Fashion Intelligence via Multi-Task Vision-Language Learning

Zhengwei Yang,Andi Long,Hao Li,Zechao Hu,Kui Jiang,Zheng Wang

Main category: cs.CV

TL;DR: 本文提出FashionX数据集和OmniFashion统一框架,旨在解决时尚智能中因监督碎片化和标注不全导致的视觉-语义结构不一致问题,实现多任务理解与交互式对话。

Details Motivation: 现有时尚智能受限于分散的监督信号和不完整的时尚标注,难以构建一致的视觉-语义结构,阻碍了视觉语言模型成为通用的‘时尚大脑’。 Method: 构建百万级细粒度标注数据集FashionX,并在此基础上提出统一的视觉语言框架OmniFashion,以时尚对话范式统一涵盖检索、推荐、识别与对话等多任务。 Result: OmniFashion在多子任务及检索基准上展现出优异的任务准确率与跨任务泛化能力。 Conclusion: OmniFashion为实现可扩展、面向对话的通用时尚智能提供了可行路径。 Abstract: Fashion intelligence spans multiple tasks, i.e., retrieval, recommendation, recognition, and dialogue, yet remains hindered by fragmented supervision and incomplete fashion annotations. These limitations jointly restrict the formation of consistent visual-semantic structures, preventing recent vision-language models (VLMs) from serving as a generalist fashion brain that unifies understanding and reasoning across tasks. Therefore, we construct FashionX, a million-scale dataset that exhaustively annotates visible fashion items within an outfit and organizes attributes from global to part-level. Built upon this foundation, we propose OmniFashion, a unified vision-language framework that bridges diverse fashion tasks under a unified fashion dialogue paradigm, enabling both multi-task reasoning and interactive dialogue. Experiments on multi-subtasks and retrieval benchmarks show that OmniFashion achieves strong task-level accuracy and cross-task generalization, highlighting its offering of a scalable path toward universal, dialogue-oriented fashion intelligence.

[87] DREAM: Where Visual Understanding Meets Text-to-Image Generation

Chao Li,Tianhong Li,Sai Vidyaranya Nuthalapati,Hong-You Chen,Satya Narayan Shukla,Yonghuan Yang,Jun Xiao,Xiangjun Fan,Aashu Singh,Dina Katabi,Shlok Kumar Mishra

Main category: cs.CV

TL;DR: DREAM是一个统一视觉表征学习与文本到图像生成的框架,通过Masking Warmup训练策略和Semantically Aligned Decoding推理机制,在仅用CC12M数据训练下,同时提升ImageNet线性探测准确率和FID指标。

Details Motivation: 统一视觉表征学习(判别式)与文本到图像生成(生成式)目标是多模态学习的核心挑战。 Method: 提出DREAM框架,包含两个关键技术:1)训练阶段采用Masking Warmup——渐进式掩码调度,从低掩码率启动以建立对比对齐,再逐步过渡到全掩码以稳定生成训练;2)推理阶段采用语义对齐解码(Semantically Aligned Decoding),在部分掩码图像候选中依据文本语义选择最优者继续解码,提升图文保真度。 Result: 在CC12M上训练后,ImageNet线性探测准确率达72.7%(+1.1% over CLIP),FID为4.25(+6.2% over FLUID),并在少样本分类、语义分割和深度估计任务中持续提升。 Conclusion: 判别式与生成式目标可协同优化,DREAM证明了单一统一模型能同时在视觉理解与图像生成任务上取得优异性能。 Abstract: Unifying visual representation learning and text-to-image (T2I) generation within a single model remains a central challenge in multimodal learning. We introduce DREAM, a unified framework that jointly optimizes discriminative and generative objectives, while learning strong visual representations. DREAM is built on two key techniques: During training, Masking Warmup, a progressive masking schedule, begins with minimal masking to establish the contrastive alignment necessary for representation learning, then gradually transitions to full masking for stable generative training. At inference, DREAM employs Semantically Aligned Decoding to align partially masked image candidates with the target text and select the best one for further decoding, improving text-image fidelity (+6.3%) without external rerankers. Trained solely on CC12M, DREAM achieves 72.7% ImageNet linear-probing accuracy (+1.1% over CLIP) and an FID of 4.25 (+6.2% over FLUID), with consistent gains in few-shot classification, semantic segmentation, and depth estimation. These results demonstrate that discriminative and generative objectives can be synergistic, allowing unified multimodal models that excel at both visual understanding and generation.

[88] MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization

Ashutosh Chaubey,Jiacheng Pang,Mohammad Soleymani

Main category: cs.CV

TL;DR: 本文提出Modality-Decoupled Direct Preference Optimization (MoD-DPO),一种用于提升全模态大语言模型(omni LLMs)模态对齐能力的简单有效框架,通过模态感知正则化和语言先验去偏惩罚,显著缓解跨模态幻觉问题。

Details Motivation: 全模态大语言模型虽在音视频理解任务中表现优异,但易受虚假模态关联和强语言先验影响,产生跨模态幻觉。 Method: 提出MoD-DPO框架,引入模态感知正则化项以增强对相关模态扰动的敏感性和对无关模态干扰的不变性,并加入语言先验去偏惩罚以抑制文本独占型幻觉响应。 Result: 在多个音视频幻觉基准上实验表明,MoD-DPO在相同训练预算下持续提升感知准确率与抗幻觉能力,优于现有偏好优化基线。 Conclusion: 模态保真对齐至关重要,MoD-DPO为构建更可靠、鲁棒的多模态基础模型提供了可扩展路径。 Abstract: Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audiovisual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.

[89] VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation

Jinxiang Lai,Zexin Lu,Jiajun He,Rongwei Quan,Wenzhe Zhao,Qinyu Yang,Qi Chen,Qin Lin,Chuyue Li,Tao Gao,Yuhao Shan,Shuai Shao,Song Guo,Qinglin Lu

Main category: cs.CV

TL;DR: 本文提出VisionCreator,一种原生视觉生成代理模型,通过统一理解、思考、规划与创作(UTPC)能力,在端到端可学习框架中实现复杂视觉内容的自主生成。

Details Motivation: 通用模型缺乏对设计规范和创意工作流的深入理解,而基于工作流的代理又缺乏自主创意规划所需的专业知识。 Method: 构建VisGenData-4k数据集(利用元认知驱动的VisionAgent生成含显式UTPC结构的高质量创作轨迹);提出VisionCreator模型,采用渐进专业化训练(PST)和虚拟强化学习(VRL)在高保真仿真环境中优化;建立VisGenBench基准(含1.2k多样场景测试样本)用于多步视觉创作能力评估。 Result: VisionCreator-8B/32B模型在多个评测维度上超越更大规模的闭源模型。 Conclusion: VisionCreator为视觉生成代理系统的研究奠定了基础,推动了具备UTPC能力的端到端可学习视觉智能体的发展。 Abstract: Visual content creation tasks demand a nuanced understanding of design conventions and creative workflows-capabilities challenging for general models, while workflow-based agents lack specialized knowledge for autonomous creative planning. To overcome these challenges, we propose VisionCreator, a native visual-generation agentic model that unifies Understanding, Thinking, Planning, and Creation (UTPC) capabilities within an end-to-end learnable framework. Our work introduces four key contributions: (i) VisGenData-4k and its construction methodology using metacognition-based VisionAgent to generate high-quality creation trajectories with explicit UTPC structures; (ii) The VisionCreator agentic model, optimized through Progressive Specialization Training (PST) and Virtual Reinforcement Learning (VRL) within a high-fidelity simulated environment, enabling stable and efficient acquisition of UTPC capabilities for complex creation tasks; (iii) VisGenBench, a comprehensive benchmark featuring 1.2k test samples across diverse scenarios for standardized evaluation of multi-step visual creation capabilities; (iv) Remarkably, our VisionCreator-8B/32B models demonstrate superior performance over larger closed-source models across multiple evaluation dimensions. Overall, this work provides a foundation for future research in visual-generation agentic systems.

[90] ReCo-Diff: Residual-Conditioned Deterministic Sampling for Cold Diffusion in Sparse-View CT

Yong Eun Choi,Hyoung Suk Park,Kiwan Jeon,Hyun-Cheol Park,Sung Ho Kang

Main category: cs.CV

TL;DR: 本文提出ReCo-Diff框架,通过残差条件自引导采样提升稀疏视角CT重建性能,克服现有冷扩散模型在采样稳定性与误差累积方面的缺陷。

Details Motivation: 现有稀疏视角CT重建的冷扩散模型依赖启发式或固定采样策略,易受误差累积和采样不稳定影响。 Method: 提出ReCo-Diff:在每步采样中先生成无条件基线重建,再基于预测图像与稀疏测量之间的观测残差进行条件化预测,实现残差驱动的自引导采样。 Result: 实验表明ReCo-Diff在重建精度、采样稳定性和严重稀疏下的鲁棒性方面均优于现有冷扩散基线方法。 Conclusion: 残差条件化机制可有效提升扩散模型在逆问题中的测量感知能力与确定性采样可靠性,为稀疏CT重建提供新范式。 Abstract: Cold and generalized diffusion models have recently shown strong potential for sparse-view CT reconstruction by explicitly modeling deterministic degradation processes. However, existing sampling strategies often rely on ad hoc sampling controls or fixed schedules, which remain sensitive to error accumulation and sampling instability. We propose ReCo-Diff, a residual-conditioned diffusion framework that leverages observation residuals through residual-conditioned self-guided sampling. At each sampling step, ReCo-Diff first produces a null (unconditioned) baseline reconstruction and then conditions subsequent predictions on the observation residual between the predicted image and the measured sparse-view input. This residual-driven guidance provides continuous, measurement-aware correction while preserving a deterministic sampling schedule, without requiring heuristic interventions. Experimental results demonstrate that ReCo-Diff consistently outperforms existing cold diffusion sampling baselines, achieving higher reconstruction accuracy, improved stability, and enhanced robustness under severe sparsity.

[91] FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution

Aro Kim,Myeongjin Jang,Chaewon Moon,Youngjin Shin,Jinwoo Jeong,Sang-hyo Park

Main category: cs.CV

TL;DR: 本文提出FiDeSR,一种高保真、细节保持的一次性扩散超分辨率框架,通过细节感知加权策略、低高频自适应增强器和残差中残差噪声优化,显著提升真实场景图像超分辨率的视觉质量和内容保真度。

Details Motivation: 现有基于扩散的图像超分辨率方法难以同时兼顾细节保留与高保真重建,导致视觉质量欠佳。 Method: 提出FiDeSR框架:训练时采用细节感知加权策略;推理时引入低/高频自适应增强器;并加入残差中残差噪声精炼模块以修正扩散噪声预测误差。 Result: FiDeSR在真实世界超分辨率任务上优于现有扩散方法,输出兼具高感知质量与忠实的内容恢复。 Conclusion: FiDeSR通过多阶段细节增强与噪声优化机制,有效解决了扩散模型在超分辨率中细节丢失与保真度不足的矛盾,为实际应用提供了更优解决方案。 Abstract: Diffusion-based approaches have recently driven remarkable progress in real-world image super-resolution (SR). However, existing methods still struggle to simultaneously preserve fine details and ensure high-fidelity reconstruction, often resulting in suboptimal visual quality. In this paper, we propose FiDeSR, a high-fidelity and detail-preserving one-step diffusion super-resolution framework. During training, we introduce a detail-aware weighting strategy that adaptively emphasizes regions where the model exhibits higher prediction errors. During inference, low- and high-frequency adaptive enhancers further refine the reconstruction without requiring model retraining, enabling flexible enhancement control. To further improve the reconstruction accuracy, FiDeSR incorporates a residual-in-residual noise refinement, which corrects prediction errors in the diffusion noise and enhances fine detail recovery. FiDeSR achieves superior real-world SR performance compared to existing diffusion-based methods, producing outputs with both high perceptual quality and faithful content restoration. The source code will be released at: https://github.com/Ar0Kim/FiDeSR.

[92] ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

Jiayi Zhu,Jianing Zhang,Yiying Yang,Wei Cheng,Xiaoyun Yuan

Main category: cs.CV

TL;DR: ShareVerse是一个支持多智能体共享世界建模的视频生成框架,通过构建多视角交互数据集、空间拼接策略和跨智能体注意力机制,实现了大尺度、几何一致且共享一致的视频生成。

Details Motivation: 现有工作缺乏对多智能体协同构建统一共享世界的支撑,难以实现动态环境下的共享世界建模。 Method: 1)基于CARLA平台构建多视角(前后左右)多智能体交互视频数据集;2)提出四视图空间拼接策略以建模更广环境并保证多视角几何一致性;3)在预训练视频模型中引入跨智能体注意力模块,实现时空信息在智能体间的交互传递。 Result: ShareVerse支持49帧大规模视频生成,能准确定位动态智能体位置,在重叠与非重叠区域均实现一致的共享世界建模。 Conclusion: ShareVerse为多智能体视频生成与共享世界建模提供了新范式,显著提升了生成结果的空间一致性与语义合理性。 Abstract: This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.

[93] Intelligent Pathological Diagnosis of Gestational Trophoblastic Diseases via Visual-Language Deep Learning Model

Yuhang Liu,Yueyang Cang,Wenge Que,Xinru Bai,Xingtong Wang,Kuisheng Chen,Jingya Li,Xiaoteng Zhang,Xinmin Li,Lixia Zhang,Pingge Hu,Qiaoting Xie,Peiyu Xu,Xianxu Zeng,Li Shi

Main category: cs.CV

TL;DR: 本文提出了一种名为GTDoctor的专家模型及配套软件GTDiagnosis,用于妊娠滋养细胞疾病(GTD)的病理诊断,显著提升了诊断精度、速度与一致性。

Details Motivation: GTD病理诊断耗时长、依赖病理医生经验、初诊一致性低,威胁母婴健康。 Method: 开发了GTDoctor专家模型,实现病理切片的像素级病灶分割,并输出诊断结论与个性化分析;基于此构建GTDiagnosis软件系统并开展回顾性与前瞻性临床试验。 Result: 回顾性研究中病灶检测平均精确率>0.91(n=679);前瞻性研究中病理医生使用该工具的阳性预测值达95.59%(n=68),单例诊断时间由56秒降至16秒(n=285)。 Conclusion: GTDoctor与GTDiagnosis为GTD病理诊断提供了新方案,在提升诊断性能与效率的同时保持临床可解释性。 Abstract: The pathological diagnosis of gestational trophoblastic disease(GTD) takes a long time, relies heavily on the experience of pathologists, and the consistency of initial diagnosis is low, which seriously threatens maternal health and reproductive outcomes. We developed an expert model for GTD pathological diagnosis, named GTDoctor. GTDoctor can perform pixel-based lesion segmentation on pathological slides, and output diagnostic conclusions and personalized pathological analysis results. We developed a software system, GTDiagnosis, based on this technology and conducted clinical trials. The retrospective results demonstrated that GTDiagnosis achieved a mean precision of over 0.91 for lesion detection in pathological slides (n=679 slides). In prospective studies, pathologists using GTDiagnosis attained a Positive Predictive Value of 95.59% (n=68 patients). The tool reduced average diagnostic time from 56 to 16 seconds per case (n=285 patients). GTDoctor and GTDiagnosis offer a novel solution for GTD pathological diagnosis, enhancing diagnostic performance and efficiency while maintaining clinical interpretability.

[94] MiM-DiT: MoE in MoE with Diffusion Transformers for All-in-One Image Restoration

Lingshun Kong,Jiawei Zhang,Zhengpeng Duan,Xiaohe Wu,Yueqi Yang,Xiaotao Wang,Dongqing Zou,Lei Lei,Jinshan Pan

Main category: cs.CV

TL;DR: 本文提出了一种基于双层Mixture-of-Experts(MoE)与预训练扩散模型融合的统一图像恢复框架,通过跨退化类型的粗粒度自适应和类内细粒度调制,实现对多种真实退化(如雾、模糊、噪声、低光)的高效联合处理。

Details Motivation: 单一模型难以有效应对多种差异显著的图像退化类型(如雾、模糊、噪声、低光),亟需统一且灵活的恢复框架。 Method: 构建双层MoE架构:Inter-MoE层按退化类型自适应组合专家组;Intra-MoE层在每组内进一步选择子专家以处理类内细粒度变化,并与预训练扩散模型集成。 Result: 在多个图像恢复任务上显著优于当前最先进方法。 Conclusion: 双层MoE与扩散模型的协同设计可兼顾退化类型的宏观区分与微观差异建模,为全合一图像恢复提供了高效、可扩展的新范式。 Abstract: All-in-one image restoration is challenging because different degradation types, such as haze, blur, noise, and low-light, impose diverse requirements on restoration strategies, making it difficult for a single model to handle them effectively. In this paper, we propose a unified image restoration framework that integrates a dual-level Mixture-of-Experts (MoE) architecture with a pretrained diffusion model. The framework operates at two levels: the Inter-MoE layer adaptively combines expert groups to handle major degradation types, while the Intra-MoE layer further selects specialized sub-experts to address fine-grained variations within each type. This design enables the model to achieve coarse-grained adaptation across diverse degradation categories while performing fine-grained modulation for specific intra-class variations, ensuring both high specialization in handling complex, real-world corruptions. Extensive experiments demonstrate that the proposed method performs favorably against the state-of-the-art approaches on multiple image restoration task.

[95] From "What" to "How": Constrained Reasoning for Autoregressive Image Generation

Ruxue Yan,Xubo Liu,Wenya Guo,Zhengkun Zhang,Ying Zhang,Xiaojie Yuan

Main category: cs.CV

TL;DR: 本文提出CoR-Painter框架,通过引入约束推理(Constrained Reasoning)实现‘如何画→画什么’的生成范式,显式建模空间关系与构图规则,并结合Dual-Objective GRPO优化策略,在图像生成的空间合理性等指标上达到SOTA。

Details Motivation: 现有自回归图像生成方法仅重写提示词来指定‘画什么’,缺乏对‘如何结构化整幅图像’的推理,导致空间模糊、物体重叠等结构性问题。 Method: 提出CoR-Painter框架:1)从文本提示中推导出显式的视觉约束(空间关系、关键属性、构图规则),实现‘如何画’的推理;2)基于约束生成更结构化的详细描述(‘画什么’);3)设计Dual-Objective GRPO策略联合优化文本推理与视觉投影过程。 Result: 在T2I-CompBench、GenEval和WISE基准上取得SOTA性能,空间指标显著提升(如T2I-CompBench提升+5.41%)。 Conclusion: ‘How-to-What’范式与约束推理能有效提升自回归图像生成的结构合理性与视觉保真度,为可控、可解释的文生图提供了新路径。 Abstract: Autoregressive image generation has seen recent improvements with the introduction of chain-of-thought and reinforcement learning. However, current methods merely specify "What" details to depict by rewriting the input prompt, yet fundamentally fail to reason about "How" to structure the overall image. This inherent limitation gives rise to persistent issues, such as spatial ambiguity directly causing unrealistic object overlaps. To bridge this gap, we propose CoR-Painter, a novel framework that pioneers a "How-to-What" paradigm by introducing Constrained Reasoning to guide the autoregressive generation. Specifically, it first deduces "How to draw" by deriving a set of visual constraints from the input prompt, which explicitly govern spatial relationships, key attributes, and compositional rules. These constraints steer the subsequent generation of a detailed description "What to draw", providing a structurally sound and coherent basis for accurate visual synthesis. Additionally, we introduce a Dual-Objective GRPO strategy that specifically optimizes the textual constrained reasoning and visual projection processes to ensure the coherence and quality of the entire generation pipeline. Extensive experiments on T2I-CompBench, GenEval, and WISE demonstrate that our method achieves state-of-the-art performance, with significant improvements in spatial metrics (e.g., +5.41% on T2I-CompBench).

[96] TenExp: Mixture-of-Experts-Based Tensor Decomposition Structure Search Framework

Ting-Wei Zhou,Xi-Le Zhao,Sheng Liu,Wei-Hao Wu,Yu-Bang Zheng,Deyu Meng

Main category: cs.CV

TL;DR: 本文提出TenExp框架,通过混合专家机制实现动态、无监督的张量分解结构搜索,突破了传统方法受限于固定因子交互形式的局限,支持单一及混合分解,并提供了理论误差界。

Details Motivation: 选择合适的张量分解以精确捕捉数据背后的低秩结构仍具挑战性,现有结构搜索方法受限于固定的因子交互族(如张量收缩),无法生成分解混合体。 Method: 设计基于混合专家(mixture-of-experts)的张量分解结构搜索框架TenExp,支持无监督、动态选择与激活合适分解;并给出其近似误差上界理论分析。 Result: TenExp在合成与真实数据集上均显著优于现有张量分解方法,兼具单一分解灵活性与混合分解表达能力。 Conclusion: TenExp有效解决了张量分解结构搜索中固定交互范式和缺乏混合能力的双重瓶颈,兼具实用性与理论保障。 Abstract: Recently, tensor decompositions continue to emerge and receive increasing attention. Selecting a suitable tensor decomposition to exactly capture the low-rank structures behind the data is at the heart of the tensor decomposition field, which remains a challenging and relatively under-explored problem. Current tensor decomposition structure search methods are still confined by a fixed factor-interaction family (e.g., tensor contraction) and cannot deliver the mixture of decompositions. To address this problem, we elaborately design a mixture-of-experts-based tensor decomposition structure search framework (termed as TenExp), which allows us to dynamically select and activate suitable tensor decompositions in an unsupervised fashion. This framework enjoys two unique advantages over the state-of-the-art tensor decomposition structure search methods. Firstly, TenExp can provide a suitable single decomposition beyond a fixed factor-interaction family. Secondly, TenExp can deliver a suitable mixture of decompositions beyond a single decomposition. Theoretically, we also provide the approximation error bound of TenExp, which reveals the approximation capability of TenExp. Extensive experiments on both synthetic and realistic datasets demonstrate the superiority of the proposed TenExp compared to the state-of-the-art tensor decomposition-based methods.

[97] Cross-view geo-localization, Image retrieval, Multiscale geometric modeling, Frequency domain enhancement

Hongying Zhang,ShuaiShuai Ma

Main category: cs.CV

TL;DR: 本文提出了一种名为SFDE的跨视角地理定位方法,通过空间域与频域协同建模,缓解几何不对称与纹理不一致问题,在保持轻量的同时提升了定位性能。

Details Motivation: 现有CVGL方法依赖空间域特征对齐,难以应对大视角变化和局部扰动;同时存在几何不对称、纹理不一致及局部判别信息退化等问题。 Method: 提出空间与频域增强网络(SFDE),采用三分支并行架构:分别建模全局语义上下文、局部几何结构和频域统计稳定性;通过渐进式增强与耦合约束在统一嵌入空间联合优化多粒度互补特征。 Result: SFDE在多个基准上达到或超越当前最优方法,同时具备轻量化与计算高效性。 Conclusion: 融合空间与频域表征能更鲁棒地刻画跨视角一致性,SFDE为GNSS拒止环境下的视觉定位提供了新思路。 Abstract: Cross-view geo-localization (CVGL) aims to establish spatial correspondences between images captured from significantly different viewpoints and constitutes a fundamental technique for visual localization in GNSS-denied environments. Nevertheless, CVGL remains challenging due to severe geometric asymmetry, texture inconsistency across imaging domains, and the progressive degradation of discriminative local information. Existing methods predominantly rely on spatial domain feature alignment, which is inherently sensitive to large scale viewpoint variations and local disturbances. To alleviate these limitations, this paper proposes the Spatial and Frequency Domain Enhancement Network (SFDE), which leverages complementary representations from spatial and frequency domains. SFDE adopts a three branch parallel architecture to model global semantic context, local geometric structure, and statistical stability in the frequency domain, respectively, thereby characterizing consistency across domains from the perspectives of scene topology, multiscale structural patterns, and frequency invariance. The resulting complementary features are jointly optimized in a unified embedding space via progressive enhancement and coupled constraints, enabling the learning of cross-view representations with consistency across multiple granularities. Comprehensive experiments show that SFDE achieves competitive performance and in many cases even surpasses state-of-the-art methods, while maintaining a lightweight and computationally efficient design. {Our code is available at https://github.com/Mashuaishuai669/SFDE

[98] Gated Differential Linear Attention: A Linear-Time Decoder for High-Fidelity Medical Segmentation

Hongbo Zheng,Afshin Bozorgpour,Dorit Merhof,Minjia Zhang

Main category: cs.CV

TL;DR: 本文提出PVT-GDLA,一种解码器为中心的线性复杂度Transformer模型,通过门控差分线性注意力(GDLA)提升医学图像分割的边界精度与长程建模能力,兼顾效率与性能,在多模态医学影像基准上达到SOTA。

Details Motivation: 医学图像分割需兼顾精细解剖边界保持与临床部署效率;传统Transformer计算开销大、数据需求高,CNN缺乏全局推理能力,现有线性注意力存在训练不稳定和注意力模糊问题。 Method: 提出PVT-GDLA:以预训练PVT为编码器,设计解码器核心GDLA模块——在互补子空间上并行计算两路核化注意力并做差分,引入通道级可学习缩放和轻量头特异性门控增强非线性与稀疏性,并添加深度可分离卷积局部分支强化邻域交互;整体保持O(N)时间复杂度与低参数量。 Result: 在CT、MRI、超声和皮肤镜等多种医学影像分割任务上,同等训练预算下超越CNN、标准Transformer、混合模型及其它线性注意力方法,参数量相当但FLOPs更低。 Conclusion: PVT-GDLA为资源受限临床环境提供了高效、可扩展且高保真的医学图像分割新方案。 Abstract: Medical image segmentation requires models that preserve fine anatomical boundaries while remaining efficient for clinical deployment. While transformers capture long-range dependencies, they suffer from quadratic attention cost and large data requirements, whereas CNNs are compute-friendly yet struggle with global reasoning. Linear attention offers $\mathcal{O}(N)$ scaling, but often exhibits training instability and attention dilution, yielding diffuse maps. We introduce PVT-GDLA, a decoder-centric Transformer that restores sharp, long-range dependencies at linear time. Its core, Gated Differential Linear Attention (GDLA), computes two kernelized attention paths on complementary query/key subspaces and subtracts them with a learnable, channel-wise scale to cancel common-mode noise and amplify relevant context. A lightweight, head-specific gate injects nonlinearity and input-adaptive sparsity, mitigating attention sink, and a parallel local token-mixing branch with depthwise convolution strengthens neighboring-token interactions, improving boundary fidelity, all while retaining $\mathcal{O}(N)$ complexity and low parameter overhead. Coupled with a pretrained Pyramid Vision Transformer (PVT) encoder, PVT-GDLA achieves state-of-the-art accuracy across CT, MRI, ultrasound, and dermoscopy benchmarks under equal training budgets, with comparable parameters but lower FLOPs than CNN-, Transformer-, hybrid-, and linear-attention baselines. PVT-GDLA provides a practical path to fast, scalable, high-fidelity medical segmentation in clinical environments and other resource-constrained settings.

[99] CoShadow: Multi-Object Shadow Generation for Image Compositing via Diffusion Model

Waqas Ahmed,Dean Diepeveen,Ferdous Sohel

Main category: cs.CV

TL;DR: 本文提出了一种基于预训练文本到图像扩散模型的多目标阴影生成方法,通过图像路径提供空间指导、文本路径编码各目标阴影位置,并引入注意力对齐损失提升定位精度,在单/多目标阴影生成任务中均达到SOTA。

Details Motivation: 现有阴影生成方法主要针对单个物体插入,难以在多个前景物体同时合成时保证阴影在几何、附着和位置上的联合一致性,而实际应用中多物体合成是常见需求。 Method: 利用预训练文本到图像扩散模型的多模态能力:图像路径注入多尺度密集特征以提供细粒度空间引导;文本路径将每个物体的阴影边界框编码为可学习的位置标记,并通过交叉注意力融合;引入注意力对齐损失使这些标记与对应阴影区域对齐。 Result: 在增强的DESOBAv2数据集(含多物体合成场景及组合提示)上实验表明,该方法在单目标和多目标阴影生成任务中均达到当前最优性能。 Conclusion: 本文首次系统性地探索了多目标阴影生成问题,提出的双路径扩散框架有效实现了物理合理、空间一致的多阴影合成,为真实感图像合成提供了新思路。 Abstract: Realistic shadow generation is crucial for achieving seamless image compositing, yet existing methods primarily focus on single-object insertion and often fail to generalize when multiple foreground objects are composited into a background scene. In practice, however, modern compositing pipelines and real-world applications often insert multiple objects simultaneously, necessitating shadows that are jointly consistent in terms of geometry, attachment, and location. In this paper, we address the under-explored problem of multi-object shadow generation, aiming to synthesize physically plausible shadows for multiple inserted objects. Our approach exploits the multimodal capabilities of a pre-trained text-to-image diffusion model. An image pathway injects dense, multi-scale features to provide fine-grained spatial guidance, while a text-based pathway encodes per-object shadow bounding boxes as learned positional tokens and fuses them via cross-attention. An attention-alignment loss further grounds these tokens to their corresponding shadow regions. To support this task, we augment the DESOBAv2 dataset by constructing composite scenes with multiple inserted objects and automatically derive prompts combining object category and shadow positioning information. Experimental results demonstrate that our method achieves state-of-the-art performance in both single and multi-object shadow generation settings.

[100] iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding

HanZpeng Liu,Yaqian Li,Zidan Wang,Shuoxi Zhang,Zihao Bo,Rinyoichi Takezoe,Kaiwen Long,Kun He

Main category: cs.CV

TL;DR: 本文提出iGVLM框架,通过解耦的双分支架构(冻结表征分支+动态条件分支)实现指令引导的视觉调制,解决现有大视觉语言模型中静态视觉编码器导致的表征瓶颈问题,显著提升模型对指令的敏感性与多查询逻辑一致性。

Details Motivation: 现有大视觉语言模型依赖静态、指令无关的视觉编码器,导致在需要任务特定视觉线索的细粒度推理中表现受限。 Method: 提出iGVLM框架,采用解耦双分支架构:冻结表征分支保留预训练获得的任务无关视觉表征;动态条件分支通过自适应层归一化(AdaLN)进行仿射特征调制;并构建MM4诊断探针评估多查询、多指令下的逻辑一致性。 Result: iGVLM在标准基准及新提出的MM4探针上均显著提升指令敏感性,且兼容多种语言骨干网络,具备即插即用特性。 Conclusion: iGVLM有效弥合了被动感知与主动推理之间的鸿沟,为LVLM提供了更灵活、稳定且任务自适应的视觉表征机制。 Abstract: Despite the success of Large Vision--Language Models (LVLMs), most existing architectures suffer from a representation bottleneck: they rely on static, instruction-agnostic vision encoders whose visual representations are utilized in an invariant manner across different textual tasks. This rigidity hinders fine-grained reasoning where task-specific visual cues are critical. To address this issue, we propose iGVLM, a general framework for instruction-guided visual modulation. iGVLM introduces a decoupled dual-branch architecture: a frozen representation branch that preserves task-agnostic visual representations learned during pre-training, and a dynamic conditioning branch that performs affine feature modulation via Adaptive Layer Normalization (AdaLN). This design enables a smooth transition from general-purpose perception to instruction-aware reasoning while maintaining the structural integrity and stability of pre-trained visual priors. Beyond standard benchmarks, we introduce MM4, a controlled diagnostic probe for quantifying logical consistency under multi-query, multi-instruction settings. Extensive results show that iGVLM consistently enhances instruction sensitivity across diverse language backbones, offering a plug-and-play paradigm for bridging passive perception and active reasoning.

[101] Seeing Clearly without Training: Mitigating Hallucinations in Multimodal LLMs for Remote Sensing

Yi Liu,Jing Zhang,Di Wang,Xiaoyu Tian,Haonan Guo,Bo Du

Main category: cs.CV

TL;DR: 本文提出RSHBench基准用于细粒度诊断遥感视觉问答中的事实与逻辑幻觉,并设计无需训练的RADAR推理方法,利用MLLM内在注意力机制实现渐进式定位与局部细粒度推理,显著缓解大场景视觉接地失败与小目标误判导致的幻觉问题。

Details Motivation: 多模态大语言模型(MLLMs)在遥感视觉问答(RS-VQA)中存在显著幻觉,主因是大规模场景下视觉接地失败或对细粒度小目标的误解读。 Method: 提出基于协议的细粒度诊断基准RSHBench;并设计无需训练的推理方法RADAR,利用MLLM固有注意力机制,在测试时引导渐进式定位与细粒度局部推理。 Result: 在多种MLLM上广泛实验表明,RADAR持续提升RS-VQA性能,并有效降低事实性与逻辑性幻觉。 Conclusion: RADAR是一种轻量、通用、训练无关的推理增强方法,能系统性缓解RS-VQA中由视觉接地缺陷引发的幻觉问题,RSHBench为后续研究提供了可解释的诊断工具。 Abstract: Multimodal large language models (MLLMs) suffer from pronounced hallucinations in remote sensing visual question-answering (RS-VQA), primarily caused by visual grounding failures in large-scale scenes or misinterpretation of fine-grained small targets. To systematically analyze these issues, we introduce RSHBench, a protocol-based benchmark for fine-grained diagnosis of factual and logical hallucinations. To mitigate grounding-induced factual hallucinations, we further propose Relative Attention-Driven Actively Reasoning (RADAR), a training-free inference method that leverages intrinsic attention in MLLMs to guide progressive localization and fine-grained local reasoning at test time. Extensive experiments across diverse MLLMs demonstrate that RADAR consistently improves RS-VQA performance and reduces both factual and logical hallucinations. Code and data will be publicly available at: https://github.com/MiliLab/RADAR

[102] ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

HanZpeng Liu,Yaqian Li,Zidan Wang,Shuoxi Zhang,Zonglin Zhao,Zihao Bo,Rinyoichi Takezoe,Kaiwen Long,Kun He

Main category: cs.CV

TL;DR: 本文提出ITO框架,通过多模态多重对齐和训练时轻量级多模态融合模块,解决图像-文本对比预训练中表征仍部分按模态组织的问题,在保持推理效率的同时提升跨模态表征质量。

Details Motivation: 现有图像-文本对比预训练方法生成的表征仍部分按模态组织,限制了跨模态语义对齐与泛化能力。 Method: 提出ITO框架,包含两个协同机制:1)多模态多重对齐,挖掘多样化的图像-文本对应关系以增强监督信号;2)训练时轻量级多模态融合模块,强制结构化跨模态交互,且该模块在推理时被移除以保持双编码器效率。 Result: ITO在分类、检索及多模态基准上持续超越强基线;分析表明多重对齐提升判别力,训练时融合则作为关键结构正则器,消除模态间隙并稳定训练动态,防止对比学习早期饱和。 Conclusion: ITO有效缓解了对比预训练中的模态隔离问题,在不牺牲推理效率的前提下显著提升跨模态表征质量与训练稳定性。 Abstract: Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.

[103] HiLoRA: Hierarchical Low-Rank Adaptation for Personalized Federated Learning

Zihao Peng,Nan Zou,Jiandian Zeng,Guo Li,Ke Chen,Boyuan Li,Tian Wang

Main category: cs.CV

TL;DR: 本文提出HiLoRA,一种分层LoRA框架,用于联邦学习中高效微调视觉Transformer,通过根、簇、叶三级适配器分别捕获全局、子组和客户端特异性知识,并引入LoRA子空间自适应聚类机制以提升个性化与泛化性能。

Details Motivation: 现有基于LoRA的联邦微调方法忽略了现实场景中潜在的客户端结构,限制了共享表征学习并阻碍对未见客户端的有效适配。 Method: 提出HiLoRA框架,在根、簇、叶三个层级部署LoRA适配器;引入跨层级正交性约束与级联优化;设计LoRA-子空间自适应聚类机制,基于子空间相似性推断客户端分组;提供分层泛化理论分析。 Result: 在CIFAR-100和DomainNet数据集上,基于ViT主干网络的实验表明HiLoRA在个性化与泛化性能上均取得一致提升。 Conclusion: HiLoRA通过结构感知的分层适配与子空间聚类,有效提升了联邦学习中ViT的通信效率、知识共享能力与跨客户端泛化性。 Abstract: Vision Transformers (ViTs) have been widely adopted in vision tasks due to their strong transferability. In Federated Learning (FL), where full fine-tuning is communication heavy, Low-Rank Adaptation (LoRA) provides an efficient and communication-friendly way to adapt ViTs. However, existing LoRA-based federated tuning methods overlook latent client structures in real-world settings, limiting shared representation learning and hindering effective adaptation to unseen clients. To address this, we propose HiLoRA, a hierarchical LoRA framework that places adapters at three levels: root, cluster, and leaf, each designed to capture global, subgroup, and client-specific knowledge, respectively. Through cross-tier orthogonality and cascaded optimization, HiLoRA separates update subspaces and aligns each tier with its residual personalized objective. In particular, we develop a LoRA-Subspace Adaptive Clustering mechanism that infers latent client groups via subspace similarity analysis, thereby facilitating knowledge sharing across structurally aligned clients. Theoretically, we establish a tier-wise generalization analysis that supports HiLoRA's design. Experiments on ViT backbones with CIFAR-100 and DomainNet demonstrate consistent improvements in both personalization and generalization.

[104] Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology, Radiology, and Natural Language

Michelle Stegeman,Lena Philipp,Fennie van der Graaf,Marina D'Amato,Clément Grisi,Luc Builtjes,Joeran S. Bosma,Judith Lefkes,Rianne A. Weber,James A. Meakin,Thomas Koopman,Anne Mickan,Mathias Prokop,Ewoud J. Smit,Geert Litjens,Jeroen van der Laak,Bram van Ginneken,Maarten de Rooij,Henkjan Huisman,Colin Jacobs,Francesco Ciompi,Alessa Hering

Main category: cs.CV

TL;DR: 本文提出了UNICORN——一个用于系统评估医学基础模型的公开基准,采用两步框架解耦模型推理与少样本任务适配,并引入UNICORN Score统一衡量跨模态、跨任务性能。

Details Motivation: 现有医学基准碎片化、缺乏标准化与可复现性,难以评估医学基础模型的跨任务与跨模态泛化能力。 Method: 构建基于临床队列的隔离式测试集、标准化少样本适应评估流程及开放平台;提出UNICORN Score作为综合性能指标。 Result: UNICORN涵盖来自8国17家机构超2400名患者的多模态数据(3700+影像案例、2400+临床报告),覆盖8个解剖区域和4种成像模态;提供任务级与聚合级排行榜。 Conclusion: UNICORN为医学基础模型提供了首个统一、公开、可复现的多任务多模态评估基准,推动其可靠发展与公平比较。 Abstract: Medical foundation models show promise to learn broadly generalizable features from large, diverse datasets. This could be the base for reliable cross-modality generalization and rapid adaptation to new, task-specific goals, with only a few task-specific examples. Yet, evidence for this is limited by the lack of public, standardized, and reproducible evaluation frameworks, as existing public benchmarks are often fragmented across task-, organ-, or modality-specific settings, limiting assessment of cross-task generalization. We introduce UNICORN, a public benchmark designed to systematically evaluate medical foundation models under a unified protocol. To isolate representation quality, we built the benchmark on a novel two-step framework that decouples model inference from task-specific evaluation based on standardized few-shot adaptation. As a central design choice, we constructed indirectly accessible sequestered test sets derived from clinically relevant cohorts, along with standardized evaluation code and a submission interface on an open benchmarking platform. Performance is aggregated into a single UNICORN Score, a new metric that we introduce to support direct comparison of foundation models across diverse medical domains, modalities, and task types. The UNICORN test dataset includes data from more than 2,400 patients, including over 3,700 vision cases and over 2,400 clinical reports collected from 17 institutions across eight countries. The benchmark spans eight anatomical regions and four imaging modalities. Both task-specific and aggregated leaderboards enable accessible, standardized, and reproducible evaluation. By standardizing multi-task, multi-modality assessment, UNICORN establishes a foundation for reproducible benchmarking of medical foundation models. Data, baseline methods, and the evaluation platform are publicly available via unicorn.grand-challenge.org.

[105] VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning

Ruiyang Zhang,Qianguo Sun,Chao Song,Yiyan Qi,Zhedong Zheng

Main category: cs.CV

TL;DR: 本文提出VSearcher,一种将静态多模态大模型转化为能在真实网络环境中进行长周期、多轮工具调用(如文本搜索、图像搜索、网页浏览)的多模态搜索智能体的方法,通过迭代注入数据合成与SFT-RL训练流程,并构建专用评测基准MM-SearchExam,显著提升多模态网络搜索性能。

Details Motivation: 现有文本大语言模型仅限单模态,应用受限;而多模态大模型虽感知能力强,却缺乏实时网络信息获取能力,亟需赋予其动态工具使用能力。 Method: 提出VSearcher框架,包含Iterative Injection Data Synthesis生成高质量复杂多模态QA数据,SFT-then-RL两阶段训练策略,以及专用于评测的多模态搜索基准MM-SearchExam。 Result: VSearcher在多个多模态搜索基准上表现优异,超越近期开源多模态搜索智能体,甚至优于若干商用闭源模型。 Conclusion: 将静态多模态模型升级为具备真实网络环境交互能力的搜索智能体是可行且有效的,关键在于高质量多模态数据构建、强化学习驱动的多轮工具调用训练,以及针对性评测基准的设计。 Abstract: Large models are increasingly becoming autonomous agents that interact with real-world environments and use external tools to augment their static capabilities. However, most recent progress has focused on text-only large language models, which are limited to a single modality and therefore have narrower application scenarios. On the other hand, multimodal large models, while offering stronger perceptual capabilities, remain limited to static knowledge and lack the ability to access and leverage up-to-date web information. In this paper, we propose VSearcher, turning static multimodal model into multimodal search agent capable of long-horizon, multi-turn tool use in real-world web environments, including text search, image search, and web browsing, via reinforcement learning. Specifically, we introduce Iterative Injection Data Synthesis pipeline to generate large-scale, complex multimodal QA questions, which are further filtered with comprehensive metrics to ensure high quality and sufficient difficulty. We then adopt an SFT-then-RL training pipeline to turn base multimodal models to agent capable of multi-turn tool calling in real-world web environments. Besides, we propose a multimodal search benchmark MM-SearchExam dedicated to evaluating search capabilities of multimodal search agents, which proves highly challenging for recent proprietary models. Extensive evaluations across multiple multimodal search benchmarks reveal effectiveness of our method. VSearcher achieves superior performance compared to recent multimodal search agents and even surpasses several proprietary models on multimodal web search tasks.

[106] R3GW: Relightable 3D Gaussians for Outdoor Scenes in the Wild

Margherita Lea Corona,Wieland Morgenstern,Peter Eisert,Anna Hilsmann

Main category: cs.CV

TL;DR: 本文提出R3GW方法,通过分离前景(可重光照)和背景(天空),结合物理渲染与3D高斯泼溅,在变化光照下实现野外场景的可重光照3D重建。

Details Motivation: 3D高斯泼溅(3DGS)虽在静态场景重建和新视角合成中表现优异,但未显式建模光照,难以用于重光照任务,且在野外无约束照片集(光照变化大)下重建效果差。 Method: R3GW将场景分为可重光照的前景(用一组高斯建模)和非反射性天空背景(用另一组高斯建模),在变化光照条件下融合物理渲染(PBR)与3DGS表示,建模前景的视角相关反射。 Result: 在NeRF-OSR数据集上定量与定性评估显示,R3GW达到SOTA性能,支持任意光照下的逼真新视角合成,并缓解天空-前景边界处的深度重建伪影。 Conclusion: R3GW实现了野外场景下高质量、物理一致的可重光照3D高斯表示,显著拓展了3DGS在真实复杂光照条件下的应用能力。 Abstract: 3D Gaussian Splatting (3DGS) has established itself as a leading technique for 3D reconstruction and novel view synthesis of static scenes, achieving outstanding rendering quality and fast training. However, the method does not explicitly model the scene illumination, making it unsuitable for relighting tasks. Furthermore, 3DGS struggles to reconstruct scenes captured in the wild by unconstrained photo collections featuring changing lighting conditions. In this paper, we present R3GW, a novel method that learns a relightable 3DGS representation of an outdoor scene captured in the wild. Our approach separates the scene into a relightable foreground and a non-reflective background (the sky), using two distinct sets of Gaussians. R3GW models view-dependent lighting effects in the foreground reflections by combining Physically Based Rendering with the 3DGS scene representation in a varying illumination setting. We evaluate our method quantitatively and qualitatively on the NeRF-OSR dataset, offering state-of-the-art performance and enhanced support for physically-based relighting of unconstrained scenes. Our method synthesizes photorealistic novel views under arbitrary illumination conditions. Additionally, our representation of the sky mitigates depth reconstruction artifacts, improving rendering quality at the sky-foreground boundary

[107] NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing

Tianlin Pan,Jiayi Dai,Chenpu Yuan,Zhengyao Lv,Binxin Yang,Hubery Yin,Chen Li,Jing Lyu,Caifeng Shan,Chenyang Si

Main category: cs.CV

TL;DR: NOVA是一种无需配对数据的视频编辑新框架,通过稀疏控制(用户编辑关键帧)与稠密合成(保留原始视频运动与纹理)结合,并采用退化模拟训练策略,显著提升编辑保真度、运动保持和时序一致性。

Details Motivation: 现有视频编辑模型依赖大规模配对数据,但高质量本地视频配对数据难以获取;现有无配对方法在背景和时序一致性上表现不佳。 Method: 提出NOVA框架:包含稀疏分支(利用用户编辑的关键帧提供语义引导)和稠密分支(持续融合原视频的运动与纹理信息);引入退化模拟训练策略,在人工退化视频上训练以学习运动重建与时序一致性。 Result: NOVA在编辑保真度、运动保持性和时序一致性方面均优于现有方法。 Conclusion: NOVA成功实现了高质量、无需配对数据的局部视频编辑,为无监督/弱监督视频编辑提供了新范式。 Abstract: Recent video editing models have achieved impressive results, but most still require large-scale paired datasets. Collecting such naturally aligned pairs at scale remains highly challenging and constitutes a critical bottleneck, especially for local video editing data. Existing workarounds transfer image editing to video through global motion control for pair-free video editing, but such designs struggle with background and temporal consistency. In this paper, we propose NOVA: Sparse Control \& Dense Synthesis, a new framework for unpaired video editing. Specifically, the sparse branch provides semantic guidance through user-edited keyframes distributed across the video, and the dense branch continuously incorporates motion and texture information from the original video to maintain high fidelity and coherence. Moreover, we introduce a degradation-simulation training strategy that enables the model to learn motion reconstruction and temporal consistency by training on artificially degraded videos, thus eliminating the need for paired data. Our extensive experiments demonstrate that NOVA outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.

[108] Structure-Aware Text Recognition for Ancient Greek Critical Editions

Nicolas Angleraud,Antonia Karamolegkou,Benoît Sagot,Thibault Clérice

Main category: cs.CV

TL;DR: 本文研究了视觉语言模型(VLMs)在古希腊校勘本等结构复杂的历史文献上的结构感知文本识别能力,提出了合成数据集与真实扫描基准,并评估了多个SOTA VLM,发现其在零样本下表现不佳,但Qwen3VL-8B微调后达到1.0%字符错误率的最优性能。

Details Motivation: 现有VLM在处理具有密集引用层级和大量页边注释的古代学术文献(如古希腊校勘本)时,难以理解其复杂的版面语义,亟需结构感知的文本识别方法。 Method: 构建了两个新资源:(i)基于TEI/XML生成的含排版与版式变化控制的18.5万页合成图像语料;(ii)涵盖百余年出版实践的真实扫描校勘本基准;并在零样本与微调设置下评估三种SOTA VLM。 Result: 当前VLM在高度结构化历史文档上存在显著局限;零样本下多数模型远逊于传统OCR软件;Qwen3VL-8B微调后在真实扫描图像上取得中位字符错误率1.0%的SOTA结果。 Conclusion: 当前VLM尚难直接适用于复杂古籍结构识别,但通过适配(如Qwen3VL-8B)可展现巨大潜力,未来需更强调版面结构建模与历史文献特异性训练。 Abstract: Recent advances in visual language models (VLMs) have transformed end-to-end document understanding. However, their ability to interpret the complex layout semantics of historical scholarly texts remains limited. This paper investigates structure-aware text recognition for Ancient Greek critical editions, which have dense reference hierarchies and extensive marginal annotations. We introduce two novel resources: (i) a large-scale synthetic corpus of 185,000 page images generated from TEI/XML sources with controlled typographic and layout variation, and (ii) a curated benchmark of real scanned editions spanning more than a century of editorial and typographic practices. Using these datasets, we evaluate three state-of-the-art VLMs under both zero-shot and fine-tuning regimes. Our experiments reveal substantial limitations in current VLM architectures when confronted with highly structured historical documents. In zero-shot settings, most models significantly underperform compared to established off-the-shelf software. Nevertheless, the Qwen3VL-8B model achieves state-of-the-art performance, reaching a median Character Error Rate of 1.0\% on real scans. These results highlight both the current shortcomings and the future potential of VLMs for structure-aware recognition of complex scholarly documents.

[109] ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink

Douglass Wang

Main category: cs.CV

TL;DR: 本文提出ScribeTokens,一种基于像素步长分解笔迹运动的数字墨水标记化方法,仅用10个基础标记即可表示任意数字墨水,并结合BPE压缩与自监督的next-ink-token预测预训练策略,在手写文本生成和识别任务上均显著优于现有向量和标记化方法。

Details Motivation: 现有数字墨水表示方法存在缺陷:连续向量表示序列长、训练不稳定;已有标记化方法词汇表大、易出现未登录词且识别性能不如向量表示。 Method: 提出ScribeTokens标记化方案,将笔迹运动分解为单位像素步长,配合两个笔状态标记,构成固定10词基础词表,并采用BPE压缩;引入next-ink-token预测作为自监督预训练策略。 Result: 在手写生成任务中CER达17.33%(远优于向量的70.29%);在识别任务中,无需预训练即超越向量方法;经预训练后,在IAM和DeepWriting数据集上CER分别达8.27%和9.83%,为所有表示方法中最优。 Conclusion: ScribeTokens以极小固定词表实现高效、鲁棒的数字墨水表示,结合自监督预训练可显著提升生成与识别性能,验证了精巧标记化设计在笔迹建模中的关键作用。 Abstract: Digital ink -- the coordinate stream captured from stylus or touch input -- lacks a unified representation. Continuous vector representations produce long sequences and suffer from training instability, while existing token representations require large vocabularies, face out-of-vocabulary issues, and underperform vectors on recognition. We propose ScribeTokens, a tokenization that decomposes pen movement into unit pixel steps. Together with two pen-state tokens, this fixed 10-token base vocabulary suffices to represent any digital ink and enables aggressive BPE compression. On handwritten text generation, ScribeTokens dramatically outperforms vectors (17.33% vs. 70.29% CER), showing tokens are far more effective for generation. On recognition, ScribeTokens is the only token representation to outperform vectors without pretraining. We further introduce next-ink-token prediction as a self-supervised pretraining strategy, which consistently improves recognition across all token-based models and accelerates convergence by up to 83x. With pretraining, ScribeTokens achieves the best recognition results across all representations on both datasets (8.27% CER on IAM, 9.83% on DeepWriting).

[110] BrandFusion: A Multi-Agent Framework for Seamless Brand Integration in Text-to-Video Generation

Zihao Zhu,Ruotong Wang,Siwei Lyu,Min Zhang,Baoyuan Wu

Main category: cs.CV

TL;DR: 本文提出了BrandFusion框架,首次解决文本生成视频(T2V)中无缝品牌植入问题,在保持用户意图语义 fidelity 的同时,提升品牌可识别性与上下文自然性,推动T2V商业化落地。

Details Motivation: 尽管文本到视频(T2V)模型快速发展,但其商业潜力尚未被充分挖掘;现有方法难以在生成视频中自然、保真地嵌入广告品牌。 Method: 提出BrandFusion多智能体框架:离线阶段构建品牌知识库并轻量微调适配新品牌;在线阶段由五个智能体协同迭代优化用户提示,结合知识库与实时上下文跟踪,保障品牌可见性与语义一致性。 Result: 在18个主流品牌和2个定制品牌、多个SOTA T2V模型上的实验表明,BrandFusion在语义保持、品牌可识别性和融合自然性上显著优于基线;人工评估也验证了更高的用户满意度。 Conclusion: BrandFusion为T2V技术提供了可行的商业化路径,实现了品牌植入与内容生成质量的双赢。 Abstract: The rapid advancement of text-to-video (T2V) models has revolutionized content creation, yet their commercial potential remains largely untapped. We introduce, for the first time, the task of seamless brand integration in T2V: automatically embedding advertiser brands into prompt-generated videos while preserving semantic fidelity to user intent. This task confronts three core challenges: maintaining prompt fidelity, ensuring brand recognizability, and achieving contextually natural integration. To address them, we propose BrandFusion, a novel multi-agent framework comprising two synergistic phases. In the offline phase (advertiser-facing), we construct a Brand Knowledge Base by probing model priors and adapting to novel brands via lightweight fine-tuning. In the online phase (user-facing), five agents jointly refine user prompts through iterative refinement, leveraging the shared knowledge base and real-time contextual tracking to ensure brand visibility and semantic alignment. Experiments on 18 established and 2 custom brands across multiple state-of-the-art T2V models demonstrate that BrandFusion significantly outperforms baselines in semantic preservation, brand recognizability, and integration naturalness. Human evaluations further confirm higher user satisfaction, establishing a practical pathway for sustainable T2V monetization.

[111] Toward Early Quality Assessment of Text-to-Image Diffusion Models

Huanlei Guo,Hongxin Wei,Bingyi Jing

Main category: cs.CV

TL;DR: 本文提出Probe-Select方法,在文本到图像生成过程中利用早期去噪激活预测最终图像质量,实现提前终止低质量采样,显著降低计算开销并提升保留图像质量。

Details Motivation: 现有文本到图像模型采用‘生成—再筛选’范式,资源消耗大且评估滞后,亟需在生成过程中高效评估图像质量。 Method: Probe-Select是一种即插即用模块,通过分析早期去噪步骤中的中间激活特征(如粗略结构、物体布局和空间关系),预测最终图像质量分数,从而动态终止低分种子的生成过程。 Result: 在扩散与流匹配模型上验证,仅在20%生成轨迹处评估即可准确排序候选种子,采样成本降低超60%,同时提升所选图像质量。 Conclusion: 早期结构信号足以指导选择性生成,无需修改原始生成模型,为高效T2I生成提供了新范式。 Abstract: Recent text-to-image (T2I) diffusion and flow-matching models can produce highly realistic images from natural language prompts. In practical scenarios, T2I systems are often run in a ``generate--then--select'' mode: many seeds are sampled and only a few images are kept for use. However, this pipeline is highly resource-intensive since each candidate requires tens to hundreds of denoising steps, and evaluation metrics such as CLIPScore and ImageReward are post-hoc. In this work, we address this inefficiency by introducing Probe-Select, a plug-in module that enables efficient evaluation of image quality within the generation process. We observe that certain intermediate denoiser activations, even at early timesteps, encode a stable coarse structure, object layout and spatial arrangement--that strongly correlates with final image fidelity. Probe-Select exploits this property by predicting final quality scores directly from early activations, allowing unpromising seeds to be terminated early. Across diffusion and flow-matching backbones, our experiments show that early evaluation at only 20\% of the trajectory accurately ranks candidate seeds and enables selective continuation. This strategy reduces sampling cost by over 60\% while improving the quality of the retained images, demonstrating that early structural signals can effectively guide selective generation without altering the underlying generative model. Code is available at https://github.com/Guhuary/ProbeSelect.

[112] Scale-invariant Gaussian derivative residual networks

Andrzej Perzanowski,Tony Lindeberg

Main category: cs.CV

TL;DR: 本文提出了一种可证明尺度不变的高斯导数残差网络(GaussDerResNets),通过构建尺度协变的高斯导数残差模块并级联,显著提升了深度网络在未见尺度图像上的泛化能力,并在多个重缩放数据集上验证了其强尺度泛化与尺度选择性能。

Details Motivation: 解决深度网络在训练时未见过的图像尺度上泛化能力差(即分布外问题)这一根本挑战。 Method: 构建基于尺度协变高斯导数残差块的级联网络(GaussDerResNets),引入残差连接以提升精度并保持尺度不变性;提供任意维度下尺度协变与尺度不变性的理论证明;在重缩放的STL-10、Fashion-MNIST和CIFAR-10数据集上进行单尺度训练、多尺度测试的系统实验;开展消融研究,探索深度可分离卷积等架构变体的影响。 Result: GaussDerResNets在三个重缩放数据集上均展现出强尺度泛化能力和尺度选择能力;采用深度可分离卷积可减少参数量和计算量,同时较好维持精度与尺度泛化性能。 Conclusion: GaussDerResNets是一种理论上可证明尺度不变、实践中具有良好泛化性能的新型网络架构,为解决深度网络尺度鲁棒性问题提供了新思路与有效方案。 Abstract: Generalisation across image scales remains a fundamental challenge for deep networks, which often fail to handle images at scales not seen during training (the out-of-distribution problem). In this paper, we present provably scale-invariant Gaussian derivative residual networks (GaussDerResNets), constructed out of scale-covariant Gaussian derivative residual blocks coupled in cascade, aimed at addressing this problem. By adding residual skip connections to the previous notion of Gaussian derivative layers, deeper networks with substantially increased accuracy can be constructed, while preserving very good scale generalisation properties at the higher level of accuracy. Explicit proofs are provided regarding the underlying scale-covariant and scale-invariant properties in arbitrary dimensions. To analyse the ability of GaussDerResNets to generalise to new scales, we apply them on the new rescaled version of the STL-10 dataset, where training is done at a single fixed scale and evaluation is performed on multiple copies of the test set, each rescaled to a single distinct spatial scale, with scale factors extending over a range of 4. We also conduct similar systematic experiments on the rescaled versions of Fashion-MNIST and CIFAR-10 datasets. Experimentally, we demonstrate that the GaussDerResNets have strong scale generalisation and scale selection properties on all the three rescaled datasets. In our ablation studies, we investigate different architectural variants of GaussDerResNets, demonstrating that basing the architecture on depthwise-separable convolutions allows for decreasing both the number of parameters and the amount of computations, with reasonably maintained accuracy and scale generalisation.

[113] Multimodal-Prior-Guided Importance Sampling for Hierarchical Gaussian Splatting in Sparse-View Novel View Synthesis

Kaiqiang Xiong,Zhanke Wang,Ronggang Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于多模态先验引导的重要性采样方法,用于稀疏视角下的分层3D高斯点绘(3DGS)新视角合成,通过融合光度残差、语义与几何先验来指导高斯精细注入,并在多个稀疏视角基准上达到SOTA性能。

Details Motivation: 解决稀疏视角下3DGS易受纹理过拟合、姿态/外观不一致噪声干扰的问题,提升局部可恢复性估计的鲁棒性。 Method: 提出多模态先验引导的重要性采样机制,融合光度渲染残差、语义先验和几何先验;构建粗到细的高斯表示框架,并设计几何感知的采样与保留策略,聚焦几何关键区域并保护欠约束区新增高斯。 Result: 在DTU等稀疏视角基准上实现SOTA重建效果,PSNR最高提升+0.3 dB。 Conclusion: 多模态先验引导的采样比仅依赖光度残差更鲁棒,能有效抑制噪声与过拟合,提升稀疏视角重建质量。 Abstract: We present multimodal-prior-guided importance sampling as the central mechanism for hierarchical 3D Gaussian Splatting (3DGS) in sparse-view novel view synthesis. Our sampler fuses complementary cues { -- } photometric rendering residuals, semantic priors, and geometric priors { -- } to produce a robust, local recoverability estimate that directly drives where to inject fine Gaussians. Built around this sampling core, our framework comprises (1) a coarse-to-fine Gaussian representation that encodes global shape with a stable coarse layer and selectively adds fine primitives where the multimodal metric indicates recoverable detail; and (2) a geometric-aware sampling and retention policy that concentrates refinement on geometrically critical and complex regions while protecting newly added primitives in underconstrained areas from premature pruning. By prioritizing regions supported by consistent multimodal evidence rather than raw residuals alone, our method alleviates overfitting texture-induced errors and suppresses noise from pose/appearance inconsistencies. Experiments on diverse sparse-view benchmarks demonstrate state-of-the-art reconstructions, with up to +0.3 dB PSNR on DTU.

[114] Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

Jialiang Zhang,Junlong Tong,Junyan Lin,Hao Wu,Yirong Sun,Yunpu Ma,Xiaoyu Shen

Main category: cs.CV

TL;DR: 本文提出Think-as-You-See (TaYS)框架,支持大型视觉语言模型在视频流输入下进行真正并发的链式推理,显著提升响应速度与推理性能。

Details Motivation: 现有LVLMs多采用批处理方式推理完整视频,无法适配现实中逐帧到达的视频流数据,因此需研究更符合流式输入特性的推理范式。 Method: 提出TaYS统一框架,包含并行化CoT生成、流约束训练、流并行推理;引入时序对齐推理单元、流式注意力掩码与位置编码、分离视觉编码与文本推理的双KV缓存机制。 Result: 在Qwen2.5-VL系列模型上验证,TaYS在事件动态分析、因果推理和主题理解等视频CoT任务中,性能优于批处理与交错式基线,同时大幅降低首token时间(TTFT)与整体推理延迟。 Conclusion: 数据对齐的流式推理范式能有效提升LVLMs在视频理解任务中的效率与响应性,TaYS为实时视频理解提供了新思路。 Abstract: Large Vision Language Models (LVLMs) exhibit strong Chain-of-Thought (CoT) capabilities, yet most existing paradigms assume full-video availability before inference, a batch-style process misaligned with real-world video streams where information arrives sequentially. Motivated by the streaming nature of video data, we investigate two streaming reasoning paradigms for LVLMs. The first, an interleaved paradigm, alternates between receiving frames and producing partial reasoning but remains constrained by strictly ordered cache updates. To better match streaming inputs, we propose \textbf{Think-as-You-See (TaYS)}, a unified framework enabling true concurrent reasoning. TaYS integrates parallelized CoT generation, stream-constrained training, and stream-parallel inference. It further employs temporally aligned reasoning units, streaming attention masks and positional encodings, and a dual KV-cache that decouples visual encoding from textual reasoning. We evaluate all paradigms on the Qwen2.5-VL family across representative video CoT tasks, including event dynamics analysis, causal reasoning, and thematic understanding. Experiments show that TaYS consistently outperforms both batch and interleaved baselines, improving reasoning performance while substantially reducing time-to-first-token (TTFT) and overall reasoning delay. These results demonstrate the effectiveness of data-aligned streaming reasoning in enabling efficient and responsive video understanding for LVLMs. We release our code at \href{https://github.com/EIT-NLP/StreamingLLM/tree/main/TaYS}{this repository.}

[115] SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion

Xinjie Zhu,Zijing Zhao,Hui Jin,Qingxiao Guo,Yilong Ma,Yunhao Wang,Xiaobing Guo,Weifeng Zhang

Main category: cs.CV

TL;DR: 本文提出SIGMark,一种面向视频扩散模型的可扩展、盲提取水印框架,通过全局帧级伪随机编码(GF-PRC)实现无损、低开销的盲水印嵌入,并设计面向因果3D VAE的段组排序模块(SGO)提升时序扰动下的鲁棒性。

Details Motivation: 现有视频生成水印方法多为非盲式,需存储大量密钥-消息对且计算开销大;同时在现代因果3D VAE结构下对时序扰动鲁棒性差。 Method: 提出SIGMark框架:1)采用全局帧级伪随机编码(GF-PRC)生成带水印的初始噪声,实现盲提取与分布保持;2)设计段组排序模块(SGO)适配因果3D VAE,增强时序扰动下的水印逆向鲁棒性。 Result: 在多种现代视频扩散模型上验证,SIGMark在时空扰动下均保持极高比特准确率,计算开销小,具备良好可扩展性与鲁棒性。 Conclusion: SIGMark首次实现了面向因果3D VAE视频扩散模型的高效、盲提取、高鲁棒性水印方案,为AI生成视频的安全溯源提供了实用化技术路径。 Abstract: Artificial Intelligence Generated Content (AIGC), particularly video generation with diffusion models, has been advanced rapidly. Invisible watermarking is a key technology for protecting AI-generated videos and tracing harmful content, and thus plays a crucial role in AI safety. Beyond post-processing watermarks which inevitably degrade video quality, recent studies have proposed distortion-free in-generation watermarking for video diffusion models. However, existing in-generation approaches are non-blind: they require maintaining all the message-key pairs and performing template-based matching during extraction, which incurs prohibitive computational costs at scale. Moreover, when applied to modern video diffusion models with causal 3D Variational Autoencoders (VAEs), their robustness against temporal disturbance becomes extremely weak. To overcome these challenges, we propose SIGMark, a Scalable In-Generation watermarking framework with blind extraction for video diffusion. To achieve blind-extraction, we propose to generate watermarked initial noise using a Global set of Frame-wise PseudoRandom Coding keys (GF-PRC), reducing the cost of storing large-scale information while preserving noise distribution and diversity for distortion-free watermarking. To enhance robustness, we further design a Segment Group-Ordering module (SGO) tailored to causal 3D VAEs, ensuring robust watermark inversion during extraction under temporal disturbance. Comprehensive experiments on modern diffusion models show that SIGMark achieves very high bit-accuracy during extraction under both temporal and spatial disturbances with minimal overhead, demonstrating its scalability and robustness. Our project is available at https://jeremyzhao1998.github.io/SIGMark-release/.

[116] SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers

Wonsuk Jang,Thierry Tambe

Main category: cs.CV

TL;DR: 本文提出SemanticDialect方法,通过扩展格式本(formatbook)与查找表、激活分解及语义感知的方言分配(SeDA),显著提升视频扩散Transformer(VDiT)的量化效果,在保持高质量生成的同时降低边缘部署开销。

Details Motivation: Diffusion Transformers(DiT)在视频生成中质量高,但内存和计算开销大,难以部署于边缘设备;现有量化方法在高激活变化和需保持语义/时序一致性的场景下易导致质量下降。 Method: 提出SemanticDialect:1)基于查找表扩展格式本,支持低开销的每块最优格式(dialect)选择;2)激活分解技术,结合注意力引导的重要token重量化与残差误差补偿;3)语义感知方言分配(SeDA),对语义相关token共享子格式本以提升量化值一致性。 Result: 在视频DiT(VDiT)模型上的实验表明,SemanticDialect优于现有VDiT量化方法及细粒度块级格式基线,在Open-Sora 2.0上接近FP16生成质量。 Conclusion: SemanticDialect通过结构化、语义驱动的混合格式量化策略,有效平衡了视频生成质量与边缘部署效率,为高效视频扩散模型提供了新范式。 Abstract: Diffusion Transformers (DiT) achieve strong video generation quality, but their memory and compute costs hinder edge deployment. Quantization can reduce these costs, yet existing methods often degrade video quality under high activation variation and the need to preserve semantic/temporal coherence. We propose SemanticDialect, which advances recent block-wise mixed-format quantization-selecting a per-block optimal format (a dialect) from multiple candidates (a formatbook)-by scaling the formatbook with lookup tables for quantization error and quantized values, enabling efficient per-block format selection and quantization at low online cost. We also introduce activation decomposition that reduces quantization error by re-quantizing and adding back residual errors, with attention-guided salient token selection. We further propose semantic-aware dialect assignment (SeDA) to improve quantized value consistency by sharing a sub-formatbook among semantically correlated tokens. Experiments on video DiT (VDiT) models show that SemanticDialect outperforms prior VDiT quantization methods and fine-grained block-wise format baselines, while approaching FP16 quality on Open-Sora 2.0.

[117] StegaFFD: Privacy-Preserving Face Forgery Detection via Fine-Grained Steganographic Domain Lifting

Guoqing Ma,Xun Lin,Hui Ma,Ajian Liu,Yizhong Liu,Wenzhong Tang,Shan Yu,Chenqi Kong,Yi Yu

Main category: cs.CV

TL;DR: 本文提出了一种基于隐写术的面部伪造检测框架StegaFFD,在保护人脸隐私的同时,避免引起攻击者怀疑,并保持高检测精度。

Details Motivation: 现有面部伪造检测模型依赖原始人脸图像,但在客户端-服务器架构中存在隐私泄露风险;传统隐私保护方法(如匿名化、加密、失真)易引入语义失真或干扰检测性能。 Method: 提出StegaFFD框架:将人脸图像隐写于自然图像中,在隐写域直接进行伪造检测;设计LFAD(低频感知分解)和SFDA(空频差异注意力)抑制背景干扰、增强隐藏特征感知;引入SDA(隐写域对齐)使隐写域表征与原始域对齐。 Result: 在七个FFD数据集上验证,StegaFFD兼具强不可察觉性、低可疑性,并显著优于现有隐私保护方法的检测精度。 Conclusion: StegaFFD为兼顾隐私保护与伪造检测性能提供了新范式,突破了传统方法在语义保真与安全隐蔽之间的权衡瓶颈。 Abstract: Most existing Face Forgery Detection (FFD) models assume access to raw face images. In practice, under a client-server framework, private facial data may be intercepted during transmission or leaked by untrusted servers. Previous privacy protection approaches, such as anonymization, encryption, or distortion, partly mitigate leakage but often introduce severe semantic distortion, making images appear obviously protected. This alerts attackers, provoking more aggressive strategies and turning the process into a cat-and-mouse game. Moreover, these methods heavily manipulate image contents, introducing degradation or artifacts that may confuse FFD models, which rely on extremely subtle forgery traces. Inspired by advances in image steganography, which enable high-fidelity hiding and recovery, we propose a Stega}nography-based Face Forgery Detection framework (StegaFFD) to protect privacy without raising suspicion. StegaFFD hides facial images within natural cover images and directly conducts forgery detection in the steganographic domain. However, the hidden forgery-specific features are extremely subtle and interfered with by cover semantics, posing significant challenges. To address this, we propose Low-Frequency-Aware Decomposition (LFAD) and Spatial-Frequency Differential Attention (SFDA), which suppress interference from low-frequency cover semantics and enhance hidden facial feature perception. Furthermore, we introduce Steganographic Domain Alignment (SDA) to align the representations of hidden faces with those of their raw counterparts, enhancing the model's ability to perceive subtle facial cues in the steganographic domain. Extensive experiments on seven FFD datasets demonstrate that StegaFFD achieves strong imperceptibility, avoids raising attackers' suspicion, and better preserves FFD accuracy compared to existing facial privacy protection methods.

[118] LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval

Minh-Chi Phung,Thien-Bao Le,Cam-Tu Tran-Thi,Thu-Dieu Nguyen-Thi,Vu-Hung Dao

Main category: cs.CV

TL;DR: 本文提出LLandMark,一种面向地标感知的多模态视频检索的模块化多智能体框架,通过四个协作阶段(查询解析与规划、地标推理、多模态检索、重排序答案合成)实现对复杂现实查询的适应性处理;其核心Landmark Knowledge Agent可将越南场景中的文化/空间地标转化为视觉提示以增强CLIP语义匹配,并引入基于Gemini 2.5 Flash的LLM辅助图像到图像检索流程及OCR优化模块,显著提升越南语场景下的检索性能。

Details Motivation: 视频数据日益多样化和规模化,亟需具备多模态理解、自适应推理和领域知识融合能力的检索系统,尤其在越南等文化语境特定场景中面临地标识别、语义匹配和文本识别等挑战。 Method: 提出模块化多智能体框架LLandMark,包含查询解析与规划、地标推理、多模态检索、重排序答案合成四阶段;设计Landmark Knowledge Agent将地标转化为描述性视觉提示以增强CLIP匹配;构建LLM(Gemini 2.5 Flash)驱动的图像到图像检索流水线,自动完成地标检测、图像查询生成、图像检索与CLIP视觉相似度匹配;集成基于Gemini与LlamaIndex的OCR优化模块提升越南语文本识别精度。 Result: LLandMark在真实复杂查询下展现出自适应、文化适配与可解释的多模态视频检索性能,无需人工输入图像,且在越南语场景中显著提升地标识别、视觉语义匹配与文本识别效果。 Conclusion: LLandMark验证了多智能体协同与大模型赋能在文化敏感型多模态视频检索中的有效性,为地域化、可解释、端到端的智能检索系统提供了新范式。 Abstract: The increasing diversity and scale of video data demand retrieval systems capable of multimodal understanding, adaptive reasoning, and domain-specific knowledge integration. This paper presents LLandMark, a modular multi-agent framework for landmark-aware multimodal video retrieval to handle real-world complex queries. The framework features specialized agents that collaborate across four stages: query parsing and planning, landmark reasoning, multimodal retrieval, and reranked answer synthesis. A key component, the Landmark Knowledge Agent, detects cultural or spatial landmarks and reformulates them into descriptive visual prompts, enhancing CLIP-based semantic matching for Vietnamese scenes. To expand capabilities, we introduce an LLM-assisted image-to-image pipeline, where a large language model (Gemini 2.5 Flash) autonomously detects landmarks, generates image search queries, retrieves representative images, and performs CLIP-based visual similarity matching, removing the need for manual image input. In addition, an OCR refinement module leveraging Gemini and LlamaIndex improves Vietnamese text recognition. Experimental results show that LLandMark achieves adaptive, culturally grounded, and explainable retrieval performance.

[119] Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting

Kaiqiang Xiong,Rui Peng,Jiahao Wu,Zhanke Wang,Jie Liang,Xiaoyun Zheng,Feng Gao,Ronggang Wang

Main category: cs.CV

TL;DR: 本文提出MVD-HuGaS方法,利用增强的多视角人体扩散模型生成多视角图像,并结合相机位姿对齐与面部深度畸变校正模块,优化3D高斯表示,实现单图高保真自由视角人体渲染。

Details Motivation: 现有基于扩散模型的单图3D人体重建方法存在结构扁平化、过度平滑及真实场景泛化能力差等问题。 Method: 1)构建细调于高质量3D人体数据集的多视角人体扩散模型以生成具几何与结构先验的多视角图像;2)引入对齐模块联合优化3D高斯与相机位姿;3)设计基于深度的面部畸变缓解模块提升面部保真度;4)基于优化后的多视角图像与位姿优化3D高斯。 Result: 在Thuman2.0和2K2K数据集上达到单视图3D人体渲染的SOTA性能。 Conclusion: MVD-HuGaS通过融合多视角扩散先验、精准位姿估计与面部细节校正,显著提升了单图3D人体重建的质量与泛化性。 Abstract: 3D human reconstruction from a single image is a challenging problem and has been exclusively studied in the literature. Recently, some methods have resorted to diffusion models for guidance, optimizing a 3D representation via Score Distillation Sampling(SDS) or generating a back-view image for facilitating reconstruction. However, these methods tend to produce unsatisfactory artifacts (\textit{e.g.} flattened human structure or over-smoothing results caused by inconsistent priors from multiple views) and struggle with real-world generalization in the wild. In this work, we present \emph{MVD-HuGaS}, enabling free-view 3D human rendering from a single image via a multi-view human diffusion model. We first generate multi-view images from the single reference image with an enhanced multi-view diffusion model, which is well fine-tuned on high-quality 3D human datasets to incorporate 3D geometry priors and human structure priors. To infer accurate camera poses from the sparse generated multi-view images for reconstruction, an alignment module is introduced to facilitate joint optimization of 3D Gaussians and camera poses. Furthermore, we propose a depth-based Facial Distortion Mitigation module to refine the generated facial regions, thereby improving the overall fidelity of the reconstruction. Finally, leveraging the refined multi-view images, along with their accurate camera poses, MVD-HuGaS optimizes the 3D Gaussians of the target human for high-fidelity free-view renderings. Extensive experiments on Thuman2.0 and 2K2K datasets show that the proposed MVD-HuGaS achieves state-of-the-art performance on single-view 3D human rendering.

[120] 3D-DRES: Detailed 3D Referring Expression Segmentation

Qi Chen,Changli Wu,Jiayi Ji,Yiwei Ma,Liujuan Cao

Main category: cs.CV

TL;DR: 本文提出了细粒度的3D指代表达分割任务(3D-DRES),构建了首个支持短语-实例映射的3D数据集DetailRefer,并设计了双模态基线模型DetailBase,在提升短语级分割性能的同时,意外提升了传统句子级3D-RES任务的表现。

Details Motivation: 现有3D视觉定位任务仅处理句子级别的检测或分割,无法充分利用自然语言中丰富的组合上下文推理能力。 Method: 提出新任务3D-DRES;构建含54,432条描述、覆盖11,054个物体的细粒度标注数据集DetailRefer,采用首创的名词短语到3D元素显式映射范式;设计双模态基线架构DetailBase,支持句子级与短语级联合分割。 Result: 在DetailRefer上训练的模型不仅在短语级分割上表现优异,还在传统3D-RES基准上展现出意外的性能提升。 Conclusion: 细粒度短语-实例对齐能有效增强3D视觉语言理解能力,并反哺粗粒度任务性能,验证了引入 compositional reasoning 的必要性与有效性。 Abstract: Current 3D visual grounding tasks only process sentence level detection or segmentation, which critically fails to leverage the rich compositional contextual reasonings within natural language expressions. To address this challenge, we introduce Detailed 3D Referring Expression Segmentation (3D-DRES), a new task that provides a phrase to 3D instance mapping, aiming at enhancing fine-grained 3D vision language understanding. To support 3D-DRES, we present DetailRefer, a new dataset comprising 54,432 descriptions spanning 11,054 distinct objects. Unlike previous datasets, DetailRefer implements a pioneering phrase-instance annotation paradigm where each referenced noun phrase is explicitly mapped to its corresponding 3D elements. Additionally, we introduce DetailBase, a purposefully streamlined yet effective baseline architecture that supports dual-mode segmentation at both sentence and phrase levels. Our experimental results demonstrate that models trained on DetailRefer not only excel at phrase-level segmentation but also show surprising improvements on traditional 3D-RES benchmarks.

[121] ProGIC: Progressive and Lightweight Generative Image Compression with Residual Vector Quantization

Hao Cao,Chengbin Liang,Wenqi Guo,Zhijin Qin,Jungong Han

Main category: cs.CV

TL;DR: 本文提出了一种名为ProGIC的渐进式生成图像压缩方法,采用残差向量量化(RVQ)与轻量级网络结构,在保持高质量重建的同时实现高效、灵活和低比特率下的实用部署。

Details Motivation: 现有生成式图像压缩(GIC)方法依赖大规模、刚性模型,难以适应灵活传输和低比特率场景的实际部署需求。 Method: 提出基于残差向量量化(RVQ)的渐进式生成图像压缩框架ProGIC,结合轻量级骨干网络(深度可分离卷积+小型注意力模块),支持粗到细重建与渐进比特流传输。 Result: 在Kodak数据集上,相比MS-ILLM,ProGIC在DISTS和LPIPS指标上分别节省57.57%和58.83%码率;编码解码速度提升10倍以上;支持渐进式传输。 Conclusion: ProGIC在感知质量、压缩效率、计算速度和部署灵活性之间实现了良好平衡,为低比特率实际应用提供了可行方案。 Abstract: Recent advances in generative image compression (GIC) have delivered remarkable improvements in perceptual quality. However, many GICs rely on large-scale and rigid models, which severely constrain their utility for flexible transmission and practical deployment in low-bitrate scenarios. To address these issues, we propose Progressive Generative Image Compression (ProGIC), a compact codec built on residual vector quantization (RVQ). In RVQ, a sequence of vector quantizers encodes the residuals stage by stage, each with its own codebook. The resulting codewords sum to a coarse-to-fine reconstruction and a progressive bitstream, enabling previews from partial data. We pair this with a lightweight backbone based on depthwise-separable convolutions and small attention blocks, enabling practical deployment on both GPUs and CPU-only devices. Experimental results show that ProGIC attains comparable compression performance compared with previous methods. It achieves bitrate savings of up to 57.57% on DISTS and 58.83% on LPIPS compared to MS-ILLM on the Kodak dataset. Beyond perceptual quality, ProGIC enables progressive transmission for flexibility, and also delivers over 10 times faster encoding and decoding compared with MS-ILLM on GPUs for efficiency.

[122] Harmonic Beltrami Signature Network: a Shape Prior Module in Deep Learning Framework

Chenran Lin,Lok Ming Lui

Main category: cs.CV

TL;DR: 本文提出了一种名为Harmonic Beltrami Signature Network(HBSN)的新型深度学习架构,用于从二值图像中高效计算Harmonic Beltrami Signature(HBS),该表示具有形状一一对应性及平移、缩放、旋转不变性,并可作为形状先验嵌入分割模型以提升性能。

Details Motivation: 现有方法难以高效、准确地从图像中提取具有几何不变性和一一对应性的形状表示,限制了形状先验在深度学习中的应用。 Method: 提出HBSN网络,包含预空间变换网络(pre-STN)用于形状归一化、UNet主干网络用于HBS预测、以及后空间变换网络(post-STN)用于角度正则化。 Result: HBSN能准确计算复杂形状的HBS;将其嵌入现有分割模型中可提升分割性能;验证了其作为通用几何形状嵌入模块的有效性。 Conclusion: HBSN是一种鲁棒、高效且可即插即用的形状表示学习框架,为将精确几何先验融入视觉任务提供了新途径。 Abstract: This paper presents the Harmonic Beltrami Signature Network (HBSN), a novel deep learning architecture for computing the Harmonic Beltrami Signature (HBS) from binary-like images. HBS is a shape representation that provides a one-to-one correspondence with 2D simply connected shapes, with invariance to translation, scaling, and rotation. By exploiting the function approximation capacity of neural networks, HBSN enables efficient extraction and utilization of shape prior information. The proposed network architecture incorporates a pre-Spatial Transformer Network (pre-STN) for shape normalization, a UNet-based backbone for HBS prediction, and a post-STN for angle regularization. Experiments show that HBSN accurately computes HBS representations, even for complex shapes. Furthermore, we demonstrate how HBSN can be directly incorporated into existing deep learning segmentation models, improving their performance through the use of shape priors. The results confirm the utility of HBSN as a general-purpose module for embedding geometric shape information into computer vision pipelines.

[123] Articulation in Motion: Prior-free Part Mobility Analysis for Articulated Objects By Dynamic-Static Disentanglement

Hao Ai,Wenjie Chang,Jianbo Jiao,Ales Leonardis,Ofek Eyal

Main category: cs.CV

TL;DR: 本文提出了一种名为Articulation in Motion (AiM)的新框架,能够从用户与物体交互的视频和初始扫描中,无需预设部件数量等先验知识,实现高精度的部件分割、运动学分析和3D重建。

Details Motivation: 现有方法依赖于两个不同姿态的清晰观测及部件数量的先验知识,限制了泛化性和鲁棒性,尤其在部件不可见或遮挡时性能下降。 Method: 提出双高斯场景表示法,结合初始3DGS扫描与运动视频学习;利用运动线索进行部件分割与关节定位;采用顺序RANSAC进行无先验的部件运动分析,自动确定部件数量并估计运动学参数。 Result: 在简单与复杂物体上均实现了优于先前方法的部件分割质量,且无需部件数量等结构先验,具有强泛化能力。 Conclusion: AiM框架有效克服了对多视角/多状态观测和结构先验的依赖,实现了更鲁棒、自动化的 articulated object 三维理解与建模。 Abstract: Articulated objects are ubiquitous in daily life. Our goal is to achieve a high-quality reconstruction, segmentation of independent moving parts, and analysis of articulation. Recent methods analyse two different articulation states and perform per-point part segmentation, optimising per-part articulation using cross-state correspondences, given a priori knowledge of the number of parts. Such assumptions greatly limit their applications and performance. Their robustness is reduced when objects cannot be clearly visible in both states. To address these issues, in this paper, we present a new framework, Articulation in Motion (AiM). We infer part-level decomposition, articulation kinematics, and reconstruct an interactive 3D digital replica from a user-object interaction video and a start-state scan. We propose a dual-Gaussian scene representation that is learned from an initial 3DGS scan of the object and a video that shows the movement of separate parts. It uses motion cues to segment the object into parts and assign articulation joints. Subsequently, a robust, sequential RANSAC is employed to achieve part mobility analysis without any part-level structural priors, which clusters moving primitives into rigid parts and estimates kinematics while automatically determining the number of parts. The proposed approach separates the object into parts, each represented as a 3D Gaussian set, enabling high-quality rendering. Our approach yields higher quality part segmentation than previous methods, without prior knowledge. Extensive experimental analysis on both simple and complex objects validates the effectiveness and strong generalisation ability of our approach. Project page: https://haoai-1997.github.io/AiM/.

[124] Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers

Youngjun Jun,Seil Kang,Woojung Han,Seong Jae Hwang

Main category: cs.CV

TL;DR: 本文提出GramCol和IMAP方法,用于在视频扩散Transformer中定位运动概念的空间和时间特征,无需梯度计算或参数更新,显著提升了运动定位和零样本视频语义分割的可解释性。

Details Motivation: 现有研究对视频扩散Transformer如何将运动词汇转化为视频缺乏深入理解,且针对运动行为的可解释显著性图研究不足。 Method: 提出GramCol方法生成每帧的概念显著性图,并设计运动特征选择算法以获得时空定位的可解释运动注意力图(IMAP)。 Result: 实验表明该方法在运动定位任务和零样本视频语义分割中表现出色,生成更清晰、可解释的显著性图。 Conclusion: 所提方法无需梯度计算或参数更新即可有效揭示视频扩散模型中运动概念的内在机制,增强了模型的可解释性。 Abstract: Video Diffusion Transformers (DiTs) have been synthesizing high-quality video with high fidelity from given text descriptions involving motion. However, understanding how Video DiTs convert motion words into video remains insufficient. Furthermore, while prior studies on interpretable saliency maps primarily target objects, motion-related behavior in Video DiTs remains largely unexplored. In this paper, we investigate concrete motion features that specify when and which object moves for a given motion concept. First, to spatially localize, we introduce GramCol, which adaptively produces per-frame saliency maps for any text concept, including both motion and non-motion. Second, we propose a motion-feature selection algorithm to obtain an Interpretable Motion-Attentive Map (IMAP) that localizes motion spatially and temporally. Our method discovers concept saliency maps without the need for any gradient calculation or parameter update. Experimentally, our method shows outstanding localization capability on the motion localization task and zero-shot video semantic segmentation, providing interpretable and clearer saliency maps for both motion and non-motion concepts.

[125] HDINO: A Concise and Efficient Open-Vocabulary Detector

Hao Zhang,Yiqun Wang,Qinran Lin,Runze Fan,Yong Li

Main category: cs.CV

TL;DR: 本文提出HDINO,一种无需人工标注数据和跨模态特征提取的高效开放词汇目标检测器,通过两阶段训练策略(含语义对齐机制与难度加权分类损失)及轻量特征融合模块,在COCO数据集上取得SOTA性能。

Details Motivation: 现有开放词汇目标检测方法依赖大量人工精细标注数据和计算密集型跨模态特征提取,限制了其效率与可扩展性。 Method: 基于DINO模型设计两阶段训练策略:第一阶段引入噪声样本构建One-to-Many语义对齐机制(O2M)并设计难度加权分类损失(DWCL)挖掘难例;第二阶段加入轻量级特征融合模块增强语言语义敏感性。 Result: HDINO-T在COCO上达49.2 mAP(仅用2.2M无标注图像),超越Grounding DINO-T和T-Rex2;微调后HDINO-T和HDINO-L分别达56.4和59.2 mAP。 Conclusion: HDINO摆脱了对人工标注与接地数据的依赖,兼具高效性、强泛化性和可扩展性,为开放词汇检测提供了新范式。 Abstract: Despite the growing interest in open-vocabulary object detection in recent years, most existing methods rely heavily on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction. In this paper, we propose HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on these components. Specifically, we propose a two-stage training strategy built upon the transformer-based DINO model. In the first stage, noisy samples are treated as additional positive object instances to construct a One-to-Many Semantic Alignment Mechanism(O2M) between the visual and textual modalities, thereby facilitating semantic alignment. A Difficulty Weighted Classification Loss (DWCL) is also designed based on initial detection difficulty to mine hard examples and further improve model performance. In the second stage, a lightweight feature fusion module is applied to the aligned representations to enhance sensitivity to linguistic semantics. Under the Swin Transformer-T setting, HDINO-T achieves \textbf{49.2} mAP on COCO using 2.2M training images from two publicly available detection datasets, without any manual data curation and the use of grounding data, surpassing Grounding DINO-T and T-Rex2 by \textbf{0.8} mAP and \textbf{2.8} mAP, respectively, which are trained on 5.4M and 6.5M images. After fine-tuning on COCO, HDINO-T and HDINO-L further achieve \textbf{56.4} mAP and \textbf{59.2} mAP, highlighting the effectiveness and scalability of our approach. Code and models are available at https://github.com/HaoZ416/HDINO.

[126] GloPath: An Entity-Centric Foundation Model for Glomerular Lesion Assessment and Clinicopathological Insights

Qiming He,Jing Li,Tian Guan,Yifei Ma,Zimo Zhao,Yanxia Wang,Hongjing Chen,Yingming Xu,Shuang Ge,Yexing Zhang,Yizhi Wang,Xinrui Chen,Lianghui Zhu,Yiqing Liu,Qingxia Hou,Shuyan Zhao,Xiaoqin Wang,Lili Ma,Peizhen Hu,Qiang Huang,Zihan Wang,Zhiyuan Shen,Junru Cheng,Siqi Zeng,Jiurun Chen,Zhen Song,Chao He,Zhe Wang,Yonghong He

Main category: cs.CV

TL;DR: GloPath是一个基于百万级肾小球图像训练的实体中心基础模型,通过多尺度、多视角自监督学习,在肾小球病变评估和临床-病理关联发现两方面显著超越现有方法。

Details Motivation: 当前AI方法难以应对肾小球形态异质性和细微病变模式的挑战,亟需更鲁棒、可解释且临床可转化的AI工具。 Method: 提出GloPath模型,采用多尺度与多视角自监督学习,在14,049例肾活检标本提取的超百万肾小球图像上训练;在52项任务(含识别、分级、少样本分类、跨模态诊断)及224对形态-临床变量关联分析中进行验证。 Result: 在52项病变评估任务中,42项(80.8%)超越SOTA;真实世界研究中病变识别ROC-AUC达91.51%;发现224对显著相关的形态-临床变量。 Conclusion: GloPath是一个可扩展、可解释的平台,推动肾病理AI向临床落地迈进关键一步。 Abstract: Glomerular pathology is central to the diagnosis and prognosis of renal diseases, yet the heterogeneity of glomerular morphology and fine-grained lesion patterns remain challenging for current AI approaches. We present GloPath, an entity-centric foundation model trained on over one million glomeruli extracted from 14,049 renal biopsy specimens using multi-scale and multi-view self-supervised learning. GloPath addresses two major challenges in nephropathology: glomerular lesion assessment and clinicopathological insights discovery. For lesion assessment, GloPath was benchmarked across three independent cohorts on 52 tasks, including lesion recognition, grading, few-shot classification, and cross-modality diagnosis-outperforming state-of-the-art methods in 42 tasks (80.8%). In the large-scale real-world study, it achieved an ROC-AUC of 91.51% for lesion recognition, demonstrating strong robustness in routine clinical settings. For clinicopathological insights, GloPath systematically revealed statistically significant associations between glomerular morphological parameters and clinical indicators across 224 morphology-clinical variable pairs, demonstrating its capacity to connect tissue-level pathology with patient-level outcomes. Together, these results position GloPath as a scalable and interpretable platform for glomerular lesion assessment and clinicopathological discovery, representing a step toward clinically translatable AI in renal pathology.

[127] TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval

Xiangzhao Hao,Shijie Wang,Tianyu Yang,Tianyue Wang,Haiyun Guo,JinQiao Wang

Main category: cs.CV

TL;DR: 本文提出TRACE框架,通过结合生成式推理与判别式表征学习,提升多模态检索性能。它首先生成结构化思维链(CoT)进行显式推理,再压缩为紧凑嵌入;并在新构建的M-BEIR-CoT数据集上训练,展现出隐式任务自适应推理、高准确率、高吞吐及零样本迁移能力。

Details Motivation: 现有MLLM在多模态检索中被限制为静态编码器,无法充分利用其生成式推理能力,难以应对需逻辑推理的复杂用户意图。 Method: 提出TRACE框架:1)生成结构化Chain-of-Thought进行显式查询推理;2)用专用token将推理过程压缩为紧凑嵌入;3)基于难度感知路由策略构建M-BEIR-CoT数据集进行训练。 Result: 在M-BEIR基准上达到SOTA;展现出隐式任务自适应推理(复杂查询启用推理、简单查询跳过)、更高检索精度与推理吞吐;具备强零样本跨域与新约束迁移能力。 Conclusion: 将生成式推理内化到嵌入学习中是提升通用多模态检索能力的有效范式;TRACE验证了统一生成与判别建模的可行性与优越性。 Abstract: Universal Multimodal Retrieval requires unified embedding models capable of interpreting diverse user intents, ranging from simple keywords to complex compositional instructions. While Multimodal Large Language Models (MLLMs) possess strong reasoning capabilities, prevailing adaptations confine them to static encoders, underutilizing their generative potential. This encoder-only paradigm struggles with complex intents that demand logical deduction rather than superficial pattern matching. To address this, we introduce TRACE (Task-adaptive Reasoning And Compressing Embeddings). TRACE unifies generative reasoning with discriminative representation learning. It first generates a structured Chain-of-Thought (CoT) to explicitly reason about the query, and subsequently compresses this reasoning trace into a compact embedding via a dedicated token. To train this framework, we construct M-BEIR-CoT, a large-scale dataset featuring a difficulty-aware routing strategy. Experiments on the M-BEIR benchmark establish TRACE as the new state-of-the-art. Crucially, TRACE demonstrates a learned implicit routing behavior. It autonomously activates reasoning for complex queries while bypassing it for simpler ones, achieving an optimal balance between retrieval accuracy and inference throughput. Furthermore, by internalizing the deductive process, TRACE exhibits remarkable zero-shot transferability to unseen domains and novel constraints.

[128] TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration

Benlei Cui,Shaoxuan He,Bukun Huang,Zhizeng Ye,Yunyun Sun,Longtao Huang,Hui Xue,Yang Yang,Jingqun Tang,Zhou Zhao,Haiwen Hong

Main category: cs.CV

TL;DR: 本文提出了一种基于Padé逼近的轨迹一致特征预测框架TC-Padé,用于加速扩散模型在低步数(20-30步)下的采样过程,克服了现有泰勒类外推器误差累积和传统缓存策略忽视去噪阶段动态差异的问题。

Details Motivation: 扩散模型虽生成质量高,但迭代采样计算开销大;现有特征缓存方法在低步数下效果差,泰勒类外推易误差累积、轨迹漂移,且未考虑不同去噪阶段的动力学差异。 Method: 提出Trajectory-Consistent Padé(TC-Padé):采用有理函数建模特征演化;引入自适应系数调制(利用历史残差检测轨迹变化)和步长感知预测策略(区分早、中、晚期采样动力学)。 Result: 在DiT-XL/2、FLUX.1-dev、Wan2.1等模型上验证有效;FLUX.1-dev达2.88倍加速,Wan2.1达1.72倍加速,FID、CLIP、Aesthetic、VBench-2.0等指标保持高质量,显著优于现有特征缓存方法。 Conclusion: TC-Padé通过更准确建模特征演化并适配不同采样阶段动态,在低步数下实现高效稳定加速,为扩散模型实用化提供了新思路。 Abstract: Despite achieving state-of-the-art generation quality, diffusion models are hindered by the substantial computational burden of their iterative sampling process. While feature caching techniques achieve effective acceleration at higher step counts (e.g., 50 steps), they exhibit critical limitations in the practical low-step regime of 20-30 steps. As the interval between steps increases, polynomial-based extrapolators like TaylorSeer suffer from error accumulation and trajectory drift. Meanwhile, conventional caching strategies often overlook the distinct dynamical properties of different denoising phases. To address these challenges, we propose Trajectory-Consistent Padé approximation, a feature prediction framework grounded in Padé approximation. By modeling feature evolution through rational functions, our approach captures asymptotic and transitional behaviors more accurately than Taylor-based methods. To enable stable and trajectory-consistent sampling under reduced step counts, TC-Padé incorporates (1) adaptive coefficient modulation that leverages historical cached residuals to detect subtle trajectory transitions, and (2) step-aware prediction strategies tailored to the distinct dynamics of early, mid, and late sampling stages. Extensive experiments on DiT-XL/2, FLUX.1-dev, and Wan2.1 across both image and video generation demonstrate the effectiveness of TC-Padé. For instance, TC-Padé achieves 2.88x acceleration on FLUX.1-dev and 1.72x on Wan2.1 while maintaining high quality across FID, CLIP, Aesthetic, and VBench-2.0 metrics, substantially outperforming existing feature caching methods.

[129] Semi-Supervised Few-Shot Adaptation of Vision-Language Models

Julio Silva-Rodríguez,Ender Konukoglu

Main category: cs.CV

TL;DR: 本文提出了一种利用无标签数据的半监督方法,通过文本引导的伪标签传播来提升视觉-语言模型(VLMs)在医学影像极低样本(few-shot)分类任务中的性能,显著降低标注成本。

Details Motivation: 医学影像任务中类别不平衡严重,导致极低样本场景下模型性能下降;同时专家标注成本高昂,亟需减少标注需求。 Method: 引入一种高效的半监督求解器,在few-shot适应过程中利用文本信息生成并传播伪标签,结合少量有标签数据进行VLM适配。 Result: 在低样本场景下标注工作量减少超过50%,提升了对少数类别的识别能力。 Conclusion: 该方法可有效缓解医学影像中因标注稀缺和类别不平衡带来的挑战,为VLM的低成本、高鲁棒few-shot适应提供了新路径。 Abstract: Vision-language models (VLMs) pre-trained on large, heterogeneous data sources are becoming increasingly popular, providing rich multi-modal embeddings that enable efficient transfer to new tasks. A particularly relevant application is few-shot adaptation, where only a handful of annotated examples are available to adapt the model through multi-modal linear probes. In medical imaging, specialized VLMs have shown promising performance in zero- and few-shot image classification, which is valuable for mitigating the high cost of expert annotations. However, challenges remain in extremely low-shot regimes: the inherent class imbalances in medical tasks often lead to underrepresented categories, penalizing overall model performance. To address this limitation, we propose leveraging unlabeled data by introducing an efficient semi-supervised solver that propagates text-informed pseudo-labels during few-shot adaptation. The proposed method enables lower-budget annotation pipelines for adapting VLMs, reducing labeling effort by >50% in low-shot regimes.

[130] Improving Anomaly Detection with Foundation-Model Synthesis and Wavelet-Domain Attention

Wensheng Wu,Zheming Lu,Ziqian Lu,Zewei He,Xuecheng Sun,Zhao Wang,Jungong Han,Yunlong Yu

Main category: cs.CV

TL;DR: 本文提出了一种基于基础模型的异常合成流水线(FMAS)和小波域注意力模块(WDAM),用于工业异常检测,无需微调或类别特定训练即可生成高保真异常样本,并通过自适应子带处理提升异常特征提取能力。

Details Motivation: 工业异常检测面临异常样本稀缺和现实异常复杂性的挑战,且异常在频域具有独特特性。 Method: 提出FMAS异常合成流水线和小波域注意力模块(WDAM),利用小波变换进行自适应子带处理以增强异常特征提取;WDAM为即插即用模块。 Result: 在MVTec AD和VisA数据集上实验表明,WDAM显著优于现有基线方法,同时保持计算效率。 Conclusion: FMAS与WDAM结合可有效提升异常检测敏感性,为少样本工业异常检测提供了高效、通用的新范式。 Abstract: Industrial anomaly detection faces significant challenges due to the scarcity of anomalous samples and the complexity of real-world anomalies. In this paper, we propose a foundation model-based anomaly synthesis pipeline (FMAS) that generates highly realistic anomalous samples without fine-tuning or class-specific training. Motivated by the distinct frequency-domain characteristics of anomalies, we introduce aWavelet Domain Attention Module (WDAM), which exploits adaptive sub-band processing to enhance anomaly feature extraction. The combination of FMAS and WDAM significantly improves anomaly detection sensitivity while maintaining computational efficiency. Comprehensive experiments on MVTec AD and VisA datasets demonstrate that WDAM, as a plug-and-play module, achieves substantial performance gains against existing baselines.

[131] TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation

Jiaxing Liu,Zexi Zhang,Xiaoyan Li,Boyue Wang,Yongli Hu,Baocai Yin

Main category: cs.CV

TL;DR: 本文提出TagaVLM框架,通过在视觉语言模型中显式注入拓扑结构(如STAR-Att注意力机制和交错导航提示),提升其在具身视觉-语言导航(VLN)任务中的空间与拓扑推理能力,在R2R基准上达到SOTA性能。

Details Motivation: 大型视觉语言模型(VLMs)预训练于静态、非具身任务,难以适应动态、具身、空间结构化的视觉语言导航(VLN)任务;现有方法将视觉空间信息转为文本,导致空间-拓扑关系隐式建模或全局动作能力受限。 Method: 提出TagaVLM端到端框架:1)Spatial Topology Aware Residual Attention(STAR-Att)将拓扑边信息嵌入VLM自注意力机制;2)Interleaved Navigation Prompt增强节点级图文对齐;3)利用嵌入的拓扑图实现全局动作推理与路径修正。 Result: 在R2R未见环境测试中,Success Rate(SR)达51.09%,SPL达47.18,较先前大模型方法分别提升3.39% SR和9.08 SPL,成为当前大模型类方法中的SOTA。 Conclusion: 针对具身空间推理任务,对小型开源VLM进行目标明确的拓扑结构增强,比单纯扩大模型规模更有效;显式建模空间拓扑结构是提升VLN性能的关键路径。 Abstract: Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamentally clash with the dynamic, embodied, and spatially-structured nature of navigation. Existing large-model-based methods often resort to converting rich visual and spatial information into text, forcing models to implicitly infer complex visual-topological relationships or limiting their global action capabilities. To bridge this gap, we propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone. To introduce topological edge information, Spatial Topology Aware Residual Attention (STAR-Att) directly integrates it into the VLM's self-attention mechanism, enabling intrinsic spatial reasoning while preserving pretrained knowledge. To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment. Finally, with the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction. On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate (SR) of 51.09% and SPL of 47.18 in unseen environments, outperforming prior work by 3.39% in SR and 9.08 in SPL. This demonstrates that, for embodied spatial reasoning, targeted enhancements on smaller open-source VLMs can be more effective than brute-force model scaling. The code will be released upon publication.Project page: https://apex-bjut.github.io/Taga-VLM

[132] Spatial Autoregressive Modeling of DINOv3 Embeddings for Unsupervised Anomaly Detection

Ertunc Erdil,Nico Schulthess,Guney Tombak,Ender Konukoglu

Main category: cs.CV

TL;DR: 本文提出了一种基于2D自回归CNN的高效无监督异常检测框架,利用DINO模型的patch级特征,显式建模空间与上下文依赖关系,避免了传统方法中内存密集的特征存储或原型匹配,显著降低了推理时延和内存开销。

Details Motivation: 现有基于DINO的无监督异常检测方法忽略patch间的空间与邻域关系,且依赖大容量记忆库或原型表示,导致高内存与计算开销。 Method: 提出一种轻量级2D自回归卷积神经网络,直接对DINO提取的patch嵌入建模其空间与上下文依赖,学习紧凑的参数化正常分布模型。 Result: 在BMAD医学影像基准(含三个数据集)上验证,该方法在保持竞争力的异常检测性能的同时,大幅降低推理时间和内存需求。 Conclusion: 显式建模patch间空间依赖是提升UAD效率与性能的有效途径;AR-CNN结构为DINO特征提供了更高效、可扩展的分布建模方式。 Abstract: DINO models provide rich patch-level representations that have recently enabled strong performance in unsupervised anomaly detection (UAD). Most existing methods extract patch embeddings from ``normal'' images and model them independently, ignoring spatial and neighborhood relationships between patches. This implicitly assumes that self-attention and positional encodings sufficiently encode contextual information within each patch embedding. In addition, the normative distribution is often modeled as memory banks or prototype-based representations, which require storing large numbers of features and performing costly comparisons at inference time, leading to substantial memory and computational overhead. In this work, we address both limitations by proposing a simple and efficient framework that explicitly models spatial and contextual dependencies between patch embeddings using a 2D autoregressive (AR) model. Instead of storing embeddings or clustering prototypes, our approach learns a compact parametric model of the normative distribution via an AR convolutional neural network (CNN). At test time, anomaly detection reduces to a single forward pass through the network and enables fast and memory-efficient inference. We evaluate our method on the BMAD benchmark, which comprises three medical imaging datasets, and compare it against existing work including recent DINO-based methods. Experimental results demonstrate that explicitly modeling spatial dependencies achieves competitive anomaly detection performance while substantially reducing inference time and memory requirements. Code is available at the project page: https://eerdil.github.io/spatial-ar-dinov3-uad/.

[133] The Dresden Dataset for 4D Reconstruction of Non-Rigid Abdominal Surgical Scenes

Reuben Docea,Rayan Younis,Yonghao Long,Maxime Fleury,Jinjing Xu,Chenyang Li,André Schulze,Ann Wierick,Johannes Bender,Micha Pfeiffer,Qi Dou,Martin Wagner,Stefanie Speidel

Main category: cs.CV

TL;DR: 本文介绍了D4D数据集,该数据集提供了配对的内窥镜视频和高质量结构光几何数据,用于在真实手术条件下评估腹部软组织变形的3D重建性能。

Details Motivation: 为了解决在真实手术条件下对腹部软组织非刚性变形进行3D重建评估缺乏高质量基准数据的问题。 Method: 使用da Vinci Xi双目内窥镜和Zivid结构光相机,在六次猪尸体实验中采集数据,并通过光学跟踪和手动迭代对齐方法进行配准;数据包含三种类型序列(整体变形、渐进变形、移动相机),并提供校正后的双目图像、逐帧器械掩码、双目深度图、起止结构光点云、标定相机位姿与内参;后处理采用ICP和半自动配准技术及器械掩码生成。 Result: 构建了包含超过30万帧图像和369个点云、覆盖98段精细筛选录像的D4D数据集,支持可见与遮挡区域的定量几何评估以及光度视图合成基线测试。 Conclusion: D4D数据集可作为非刚性SLAM、4D重建和深度估计等算法开发与评估的全面基准资源。 Abstract: The D4D Dataset provides paired endoscopic video and high-quality structured-light geometry for evaluating 3D reconstruction of deforming abdominal soft tissue in realistic surgical conditions. Data were acquired from six porcine cadaver sessions using a da Vinci Xi stereo endoscope and a Zivid structured-light camera, registered via optical tracking and manually curated iterative alignment methods. Three sequence types - whole deformations, incremental deformations, and moved-camera clips - probe algorithm robustness to non-rigid motion, deformation magnitude, and out-of-view updates. Each clip provides rectified stereo images, per-frame instrument masks, stereo depth, start/end structured-light point clouds, curated camera poses and camera intrinsics. In postprocessing, ICP and semi-automatic registration techniques are used to register data, and instrument masks are created. The dataset enables quantitative geometric evaluation in both visible and occluded regions, alongside photometric view-synthesis baselines. Comprising over 300,000 frames and 369 point clouds across 98 curated recordings, this resource can serve as a comprehensive benchmark for developing and evaluating non-rigid SLAM, 4D reconstruction, and depth estimation methods.

[134] VIRGi: View-dependent Instant Recoloring of 3D Gaussians Splats

Alessio Mazzucchelli,Ivan Ojeda-Martin,Fernando Rivas-Manzaneque,Elena Garces,Adrian Penate-Sanchez,Francesc Moreno-Noguer

Main category: cs.CV

TL;DR: 本文提出VIRGi方法,用于快速编辑3D高斯泼溅(3DGS)建模场景的颜色,同时保留镜面高光等视角相关效果;通过分离漫反射与视角相关颜色分量、多视角训练及单张手动编辑图像微调,实现2秒内全场景无缝重着色。

Details Motivation: 现有3DGS缺乏高效且逼真的场景外观编辑方法,尤其难以在保持视角相关效果(如镜面高光)的同时进行颜色编辑。 Method: 提出VIRGi:1)新架构将颜色解耦为漫反射与视角相关分量;2)采用多视角图像块联合训练提升3DGS重建精度;3)仅需用户手动编辑一张图像,通过微调单个MLP权重并结合单次分割模块,实现快速全场景重着色。 Result: 在多个数据集上定量与定性均显著优于基于NeRF的竞品方法;重着色过程仅需2秒,支持实时交互,并可调控视角相关效果强度。 Conclusion: VIRGi为3DGS提供了首个高效、逼真、交互式的颜色编辑方案,在保持几何与视角特性前提下实现了前所未有的编辑速度与质量。 Abstract: 3D Gaussian Splatting (3DGS) has recently transformed the fields of novel view synthesis and 3D reconstruction due to its ability to accurately model complex 3D scenes and its unprecedented rendering performance. However, a significant challenge persists: the absence of an efficient and photorealistic method for editing the appearance of the scene's content. In this paper we introduce VIRGi, a novel approach for rapidly editing the color of scenes modeled by 3DGS while preserving view-dependent effects such as specular highlights. Key to our method are a novel architecture that separates color into diffuse and view-dependent components, and a multi-view training strategy that integrates image patches from multiple viewpoints. Improving over the conventional single-view batch training, our 3DGS representation provides more accurate reconstruction and serves as a solid representation for the recoloring task. For 3DGS recoloring, we then introduce a rapid scheme requiring only one manually edited image of the scene from the end-user. By fine-tuning the weights of a single MLP, alongside a module for single-shot segmentation of the editable area, the color edits are seamlessly propagated to the entire scene in just two seconds, facilitating real-time interaction and providing control over the strength of the view-dependent effects. An exhaustive validation on diverse datasets demonstrates significant quantitative and qualitative advancements over competitors based on Neural Radiance Fields representations.

[135] Any Resolution Any Geometry: From Multi-View To Multi-Patch

Wenqing Cui,Zhenyu Li,Mykola Lavreniuk,Jian Shi,Ramzi Idoughi,Xiangjun Tang,Peter Wonka

Main category: cs.CV

TL;DR: 本文提出Ultra Resolution Geometry Transformer (URGT),一种基于多块Transformer的单目高分辨率深度-法向量联合估计方法,通过跨块注意力和GridMix采样策略提升局部细节与全局一致性,在UnrealStereo4K上达到SOTA性能。

Details Motivation: 高分辨率表面法向量与深度联合估计面临局部细节保留与全局一致性难以兼顾的挑战。 Method: 将VGGT改造为统一多块Transformer框架;对高分辨率图像分块并注入预训练模型提供的粗略几何先验;利用跨块注意力实现长程几何推理;引入GridMix随机网格采样策略增强空间鲁棒性。 Result: 在UnrealStereo4K上AbsRel从0.0582降至0.0291,RMSE从2.17降至1.31,法向量平均角度误差从23.36°降至18.51°,几何输出更锐利稳定,并具备零样本迁移与跨域泛化能力。 Conclusion: URGT提供了一种高效、可扩展的高分辨率几何精细化方案,显著提升了单目深度与法向量联合估计的精度与鲁棒性。 Abstract: Joint estimation of surface normals and depth is essential for holistic 3D scene understanding, yet high-resolution prediction remains difficult due to the trade-off between preserving fine local detail and maintaining global consistency. To address this challenge, we propose the Ultra Resolution Geometry Transformer (URGT), which adapts the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth--normal estimation. A single high-resolution image is partitioned into patches that are augmented with coarse depth and normal priors from pre-trained models, and jointly processed in a single forward pass to predict refined geometric outputs. Global coherence is enforced through cross-patch attention, which enables long-range geometric reasoning and seamless propagation of information across patches within a shared backbone. To further enhance spatial robustness, we introduce a GridMix patch sampling strategy that probabilistically samples grid configurations during training, improving inter-patch consistency and generalization. Our method achieves state-of-the-art results on UnrealStereo4K, jointly improving depth and normal estimation, reducing AbsRel from 0.0582 to 0.0291, RMSE from 2.17 to 1.31, and lowering mean angular error from 23.36 degrees to 18.51 degrees, while producing sharper and more stable geometry. The proposed multi-patch framework also demonstrates strong zero-shot and cross-domain generalization and scales effectively to very high resolutions, offering an efficient and extensible solution for high-quality geometry refinement.

[136] BRIGHT: A Collaborative Generalist-Specialist Foundation Model for Breast Pathology

Xiaojing Guo,Jiatai Lin,Yumian Jia,Jingqi Huang,Zeyan Xu,Weidong Li,Longfei Wang,Jingjing Chen,Qin Li,Weiwei Wang,Lifang Cui,Wen Yue,Zhiqiang Cheng,Xiaolong Wei,Jianzhong Yu,Xia Jin,Baizhou Li,Honghong Shen,Jing Li,Chunlan Li,Yanfen Cui,Yi Dai,Yiling Yang,Xiaolong Qian,Liu Yang,Yang Yang,Guangshen Gao,Yaqing Li,Lili Zhai,Chenying Liu,Tianhua Zhang,Zhenwei Shi,Cheng Lu,Xingchen Zhou,Jing Xu,Miaoqing Zhao,Fang Mei,Jiaojiao Zhou,Ning Mao,Fangfang Liu,Chu Han,Zaiyi Liu

Main category: cs.CV

TL;DR: 本文提出了BRIGHT,首个专为乳腺病理学设计的病理基础模型,采用通用-专科协同框架,在大规模多中心数据上验证其在24项临床任务中的优越性能,并验证了该协同范式对器官特异性基础模型开发的普适性。

Details Motivation: 现有通用病理基础模型在单一器官系统(如乳腺)的全面临床任务上的表现尚不明确,缺乏大规模单器官验证队列及能有效转化通用组织形态学知识为专科级解读能力的定制化训练范式。 Method: 提出BRIGHT模型,基于约2.1亿张来自5.1万例乳腺全切片图像(涵盖4万余名患者、19家医院)的组织学瓦片进行训练;采用通用-专科协同框架,兼顾通用与器官特异性特征提取;构建迄今最大的乳腺多中心下游任务评估队列(含10家医院、2.5万余张WSI),覆盖24项临床任务。 Result: BRIGHT在24项内部验证任务中21项达到SOTA,在10项外部验证任务中5项达到SOTA,热图可解释性优异。 Conclusion: BRIGHT不仅展现出在乳腺肿瘤学中的临床实用性,更验证了通用-专科协同范式作为器官特异性病理基础模型开发的可扩展模板的有效性。 Abstract: Generalist pathology foundation models (PFMs), pretrained on large-scale multi-organ datasets, have demonstrated remarkable predictive capabilities across diverse clinical applications. However, their proficiency on the full spectrum of clinically essential tasks within a specific organ system remains an open question due to the lack of large-scale validation cohorts for a single organ as well as the absence of a tailored training paradigm that can effectively translate broad histomorphological knowledge into the organ-specific expertise required for specialist-level interpretation. In this study, we propose BRIGHT, the first PFM specifically designed for breast pathology, trained on approximately 210 million histopathology tiles from over 51,000 breast whole-slide images derived from a cohort of over 40,000 patients across 19 hospitals. BRIGHT employs a collaborative generalist-specialist framework to capture both universal and organ-specific features. To comprehensively evaluate the performance of PFMs on breast oncology, we curate the largest multi-institutional cohorts to date for downstream task development and evaluation, comprising over 25,000 WSIs across 10 hospitals. The validation cohorts cover the full spectrum of breast pathology across 24 distinct clinical tasks spanning diagnosis, biomarker prediction, treatment response and survival prediction. Extensive experiments demonstrate that BRIGHT outperforms three leading generalist PFMs, achieving state-of-the-art (SOTA) performance in 21 of 24 internal validation tasks and in 5 of 10 external validation tasks with excellent heatmap interpretability. By evaluating on large-scale validation cohorts, this study not only demonstrates BRIGHT's clinical utility in breast oncology but also validates a collaborative generalist-specialist paradigm, providing a scalable template for developing PFMs on a specific organ system.

[137] EduVQA: Benchmarking AI-Generated Video Quality Assessment for Education

Baoliang Chen,Xinlong Bu,Lingyu Zhu,Hanwei Zhu,Xiangjie Sui

Main category: cs.CV

TL;DR: 本文提出了EduAIGV-1k——首个面向教育场景(特别是儿童数学启蒙)的AI生成视频(AIGV)评测基准与数据集,并构建了细粒度多维标注体系;同时提出EduVQA模型及新型S2D-MoE模块,显著提升AIGV在感知质量与提示对齐两方面的自动评估性能。

Details Motivation: AI生成视频(AIGV)虽在视觉逼真度上表现优异,但其在视觉化、故事驱动的教育学习(尤其是面向低龄儿童的数学概念教学)中的潜力尚未被系统探索和评估,缺乏专用评测基准与方法。 Method: 构建EduAIGV-1k数据集:包含1130个由10种SOTA文本到视频模型生成的短视频,覆盖113条教学导向提示;设计双轴细粒度标注(空间/时间感知质量 + 词级/句级提示对齐);提出EduVQA评估模型,核心为结构化二维混合专家(S2D-MoE)模块,通过共享专家与动态2D门控矩阵建模各子维度与整体质量的依赖关系。 Result: EduVQA在感知质量与提示对齐评估任务上持续超越现有视频质量评估(VQA)基线;EduAIGV-1k成为首个支持多维、可解释监督信号的教育AIGV评测基准;数据集与代码将开源。 Conclusion: 本工作填补了教育导向AIGV评测的空白,通过细粒度标注与结构化评估模型,为开发高质量、教学可信的AI生成教育视频提供了关键基础设施与方法论支撑。 Abstract: While AI-generated content (AIGC) models have achieved remarkable success in generating photorealistic videos, their potential to support visual, story-driven learning in education remains largely untapped. To close this gap, we present EduAIGV-1k, the first benchmark dataset and evaluation framework dedicated to assessing the quality of AI-generated videos (AIGVs) designed to teach foundational math concepts, such as numbers and geometry, to young learners. EduAIGV-1k contains 1,130 short videos produced by ten state-of-the-art text-to-video (T2V) models using 113 pedagogy-oriented prompts. Each video is accompanied by rich, fine-grained annotations along two complementary axes: (1) Perceptual quality, disentangled into spatial and temporal fidelity, and (2) Prompt alignment, labeled at the word-level and sentence-level to quantify the degree to which each mathematical concept in the prompt is accurately grounded in the generated video. These fine-grained annotations transform each video into a multi-dimensional, interpretable supervision signal, far beyond a single quality score. Leveraging this dense feedback, we introduce EduVQA for both perceptual and alignment quality assessment of AIGVs. In particular, we propose a Structured 2D Mixture-of-Experts (S2D-MoE) module, which enhances the dependency between overall quality and each sub-dimension by shared experts and dynamic 2D gating matrix. Extensive experiments show our EduVQA consistently outperforms existing VQA baselines. Both our dataset and code will be publicly available.

[138] TinyIceNet: Low-Power SAR Sea Ice Segmentation for On-Board FPGA Inference

Mhd Rashed Al Koutayni,Mohamed Selim,Gerd Reis,Alain Pagani,Didier Stricker

Main category: cs.CV

TL;DR: 本文提出TinyIceNet,一种专为卫星在轨处理设计的轻量级语义分割网络,用于从Sentinel-1 SAR图像中实时、低功耗地生成海冰分布图。

Details Motivation: 传统地面处理受限于数据下传带宽、延迟和能耗;亟需能在星上实时处理SAR图像并生成可用海冰产品的方案。 Method: 设计面向硬件约束(FPGA、低功耗)的紧凑型语义分割网络TinyIceNet,融合SAR感知结构简化与低精度量化,并通过高层次综合(HLS)部署到Xilinx Zynq UltraScale+ FPGA平台。 Result: 在AI4Arctic数据集上实现75.216%的SOD分割F1分数,能耗比全精度GPU基线降低2倍,支持近实时推理。 Conclusion: 硬件-算法协同设计(chip-level co-design)是提升星载与边缘AI系统实时性与能效的关键路径。 Abstract: Accurate sea ice mapping is essential for safe maritime navigation in polar regions, where rapidly changing ice conditions require timely and reliable information. While Sentinel-1 Synthetic Aperture Radar (SAR) provides high-resolution, all-weather observations of sea ice, conventional ground-based processing is limited by downlink bandwidth, latency, and energy costs associated with transmitting large volumes of raw data. On-board processing, enabled by dedicated inference chips integrated directly within the satellite payload, offers a transformative alternative by generating actionable sea ice products in orbit. In this context, we present TinyIceNet, a compact semantic segmentation network co-designed for on-board Stage of Development (SOD) mapping from dual-polarized Sentinel-1 SAR imagery under strict hardware and power constraints. Trained on the AI4Arctic dataset, TinyIceNet combines SAR-aware architectural simplifications with low-precision quantization to balance accuracy and efficiency. The model is synthesized using High-Level Synthesis and deployed on a Xilinx Zynq UltraScale+ FPGA platform, demonstrating near-real-time inference with significantly reduced energy consumption. Experimental results show that TinyIceNet achieves 75.216% F1 score on SOD segmentation while reducing energy consumption by 2x compared to full-precision GPU baselines, underscoring the potential of chip-level hardware-algorithm co-design for future spaceborne and edge AI systems.

[139] MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection

Jun Yeong Park,JunYoung Seo,Minji Kang,Yu Rang Park

Main category: cs.CV

TL;DR: 本文提出MoECLIP,一种基于Mixture-of-Experts的CLIP改进架构,通过动态路由图像块至专用LoRA专家,并引入FOFS和ETF损失来提升零样本异常检测(ZSAD)性能。

Details Motivation: 现有ZSAD方法采用补丁无关设计,无法根据图像补丁特性进行差异化处理,限制了模型在保留CLIP泛化能力的同时适配异常检测任务的能力。 Method: 提出MoECLIP架构:1)基于补丁特征动态路由至专用LoRA专家;2)引入Frozen Orthogonal Feature Separation(FOFS)实现输入特征空间正交分离;3)采用simplex equiangular tight frame(ETF)损失约束专家输出呈最大等角分布。 Result: 在涵盖工业与医疗领域的14个基准数据集上,MoECLIP显著超越现有最先进方法。 Conclusion: MoECLIP通过补丁级自适应与专家功能解耦,有效平衡了CLIP的泛化能力与ZSAD任务特异性,为零样本异常检测提供了新范式。 Abstract: The CLIP model's outstanding generalization has driven recent success in Zero-Shot Anomaly Detection (ZSAD) for detecting anomalies in unseen categories. The core challenge in ZSAD is to specialize the model for anomaly detection tasks while preserving CLIP's powerful generalization capability. Existing approaches attempting to solve this challenge share the fundamental limitation of a patch-agnostic design that processes all patches monolithically without regard for their unique characteristics. To address this limitation, we propose \textbf{MoECLIP}, a Mixture-of-Experts (MoE) architecture for the ZSAD task, which achieves patch-level adaptation by dynamically routing each image patch to a specialized Low-Rank Adaptation (LoRA) expert based on its unique characteristics. Furthermore, to prevent functional redundancy among the LoRA experts, we introduce (1) Frozen Orthogonal Feature Separation (FOFS), which orthogonally separates the input feature space to force experts to focus on distinct information, and (2) a simplex equiangular tight frame (ETF) loss to regulate the expert outputs to form maximally equiangular representations. Comprehensive experimental results across 14 benchmark datasets spanning industrial and medical domains demonstrate that MoECLIP outperforms existing state-of-the-art methods. The code is available at https://github.com/CoCoRessa/MoECLIP.

[140] AWDiff: An a trous wavelet diffusion model for lung ultrasound image synthesis

Maryam Heidari,Nantheera Anantrasirichai,Steven Walker,Rahul Bhatnagar,Alin Achim

Main category: cs.CV

TL;DR: 本文提出AWDiff,一种结合a trous小波变换和BioMedCLIP语义条件的扩散模型增强框架,用于肺部超声图像生成,以在保持B线等细微诊断特征的同时提升数据多样性与临床相关性。

Details Motivation: 肺部超声(LUS)数据稀缺限制了机器学习发展;现有生成增强方法(如GAN、扩散模型)因降采样导致关键诊断特征(如B线、胸膜异常)丢失。 Method: 提出A trous Wavelet Diffusion(AWDiff):1)引入a trous小波变换替代传统下采样,保留细尺度结构;2)利用BioMedCLIP进行语义条件控制,确保生成图像与临床标签对齐。 Result: 在LUS数据集上,AWDiff相比现有方法具有更低失真度和更高感知质量,同时兼顾结构保真度与临床多样性。 Conclusion: AWDiff有效解决了LUS图像生成中细节丢失与临床不相关问题,为医学图像数据增强提供了新范式。 Abstract: Lung ultrasound (LUS) is a safe and portable imaging modality, but the scarcity of data limits the development of machine learning methods for image interpretation and disease monitoring. Existing generative augmentation methods, such as Generative Adversarial Networks (GANs) and diffusion models, often lose subtle diagnostic cues due to resolution reduction, particularly B-lines and pleural irregularities. We propose A trous Wavelet Diffusion (AWDiff), a diffusion based augmentation framework that integrates the a trous wavelet transform to preserve fine-scale structures while avoiding destructive downsampling. In addition, semantic conditioning with BioMedCLIP, a vision language foundation model trained on large scale biomedical corpora, enforces alignment with clinically meaningful labels. On a LUS dataset, AWDiff achieved lower distortion and higher perceptual quality compared to existing methods, demonstrating both structural fidelity and clinical diversity.

[141] Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

Jiyuan Wang,Chunyu Lin,Lei Sun,Zhi Cao,Yuyang Yin,Lang Nie,Zhenlong Yuan,Xiangxiang Chu,Yunchao Wei,Kang Liao,Guosheng Lin

Main category: cs.CV

TL;DR: 本文提出RL3DEdit,一种基于强化学习的单次3D编辑框架,利用3D基础模型VGGT提供的奖励信号(如置信图和姿态估计误差)来引导2D扩散模型生成多视角一致的3D编辑结果,无需依赖稀缺的3D配对监督数据。

Details Motivation: 现有方法难以在2D扩散模型驱动的3D编辑中保证多视角一致性,且缺乏足够的3D一致编辑配对数据以支持监督微调;而验证3D一致性比生成更容易,因此强化学习成为可行路径。 Method: 提出RL3DEdit框架,将2D编辑过程建模为强化学习任务,以3D基础模型VGGT输出的置信图和姿态估计误差作为奖励信号,通过RL优化直接对齐2D编辑先验与3D一致流形。 Result: 在多视角一致性、编辑质量和运行效率上均优于现有最先进方法,并具备稳定性能。 Conclusion: RL3DEdit证明了无需3D配对标注、仅靠可验证的3D一致性奖励即可有效提升2D扩散模型的3D编辑能力,为数据稀缺场景下的3D编辑提供了新范式。 Abstract: Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose \textbf{RL3DEdit}, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.

[142] Kling-MotionControl Technical Report

Kling Team,Jialu Chen,Yikang Ding,Zhixue Fang,Kun Gai,Kang He,Xu He,Jingyun Hua,Mingming Lao,Xiaohan Li,Hui Liu,Jiwen Liu,Xiaoqiang Liu,Fan Shi,Xiaoyu Shi,Peiqin Sun,Songlin Tang,Pengfei Wan,Tiancheng Wen,Zhiyong Wu,Haoxian Zhang,Runze Zhao,Yuanxing Zhang,Yan Zhou

Main category: cs.CV

TL;DR: 本文提出Kling-MotionControl,一种基于DiT的统一框架,用于高保真、鲁棒、精确且富有表现力的整体角色动画,通过分治策略融合身体、面部和手部运动表征,并支持跨身份泛化、外观保持与文本驱动控制。

Details Motivation: 现有方法难以兼顾大尺度结构稳定性与细粒度动作表现力,且跨身份泛化与外观保真能力不足,需更统一、可控、高效的动画生成框架。 Method: 提出基于DiT的Kling-MotionControl框架,采用分治策略建模身体/面部/手部异构运动;引入自适应身份无关学习实现跨身份泛化;设计精细的身份注入与融合机制及主体库保障外观一致性;结合多阶段蒸馏加速推理。 Result: 在人类偏好评估中显著优于主流商业与开源方案,在整体运动控制精度、开放域泛化性、视觉质量与连贯性方面均达领先水平;推理速度提升超10倍。 Conclusion: Kling-MotionControl是一种鲁棒、高质量、可控且拟真的角色动画解决方案,推动了生成式动画向实用化与精细化发展。 Abstract: Character animation aims to generate lifelike videos by transferring motion dynamics from a driving video to a reference image. Recent strides in generative models have paved the way for high-fidelity character animation. In this work, we present Kling-MotionControl, a unified DiT-based framework engineered specifically for robust, precise, and expressive holistic character animation. Leveraging a divide-and-conquer strategy within a cohesive system, the model orchestrates heterogeneous motion representations tailored to the distinct characteristics of body, face, and hands, effectively reconciling large-scale structural stability with fine-grained articulatory expressiveness. To ensure robust cross-identity generalization, we incorporate adaptive identity-agnostic learning, facilitating natural motion retargeting for diverse characters ranging from realistic humans to stylized cartoons. Simultaneously, we guarantee faithful appearance preservation through meticulous identity injection and fusion designs, further supported by a subject library mechanism that leverages comprehensive reference contexts. To ensure practical utility, we implement an advanced acceleration framework utilizing multi-stage distillation, boosting inference speed by over 10x. Kling-MotionControl distinguishes itself through intelligent semantic motion understanding and precise text responsiveness, allowing for flexible control beyond visual inputs. Human preference evaluations demonstrate that Kling-MotionControl delivers superior performance compared to leading commercial and open-source solutions, achieving exceptional fidelity in holistic motion control, open domain generalization, and visual quality and coherence. These results establish Kling-MotionControl as a robust solution for high-quality, controllable, and lifelike character animation.

[143] Conditioned Activation Transport for T2I Safety Steering

Maciej Chrabąszcz,Aleksander Szymczyk,Jan Dubiński,Tomasz Trzciński,Franziska Boenisch,Adam Dziedzic

Main category: cs.CV

TL;DR: 本文提出Conditioned Activation Transport (CAT)框架,通过几何条件化和非线性传输映射,在不损害图像质量的前提下,有效降低文本到图像模型生成有害内容的风险。

Details Motivation: 当前文本到图像(T2I)模型易生成不安全、有毒内容;而现有激活引导方法在处理良性提示时常损害图像质量,存在安全性与质量的权衡问题。 Method: 构建包含2300对高余弦相似度安全/不安全提示的SafeSteerDataset;提出CAT框架,利用基于几何的条件机制和非线性传输映射,使传输仅作用于不安全激活区域。 Result: 在Z-Image和Infinity两个SOTA架构上验证,CAT显著降低攻击成功率,同时保持图像保真度,优于无引导生成。 Conclusion: CAT是一种有效的推理时干预方法,能精准抑制有害生成,兼顾安全性与图像质量,具备跨架构泛化能力。 Abstract: Despite their impressive capabilities, current Text-to-Image (T2I) models remain prone to generating unsafe and toxic content. While activation steering offers a promising inference-time intervention, we observe that linear activation steering frequently degrades image quality when applied to benign prompts. To address this trade-off, we first construct SafeSteerDataset, a contrastive dataset containing 2300 safe and unsafe prompt pairs with high cosine similarity. Leveraging this data, we propose Conditioned Activation Transport (CAT), a framework that employs a geometry-based conditioning mechanism and nonlinear transport maps. By conditioning transport maps to activate only within unsafe activation regions, we minimize interference with benign queries. We validate our approach on two state-of-the-art architectures: Z-Image and Infinity. Experiments demonstrate that CAT generalizes effectively across these backbones, significantly reducing Attack Success Rate while maintaining image fidelity compared to unsteered generations. Warning: This paper contains potentially offensive text and images.

[144] ProSMA-UNet: Decoder Conditioning for Proximal-Sparse Skip Feature Selection

Chun-Wun Cheng,Yanqi Cheng,Peiyuan Jing,Guang Yang,Carola-Bibiane Schönlieb,Angelica I. Aviles-Rivero

Main category: cs.CV

TL;DR: 本文提出ProSMA-UNet,通过多尺度兼容性场与可学习的L1近端算子实现稀疏跳连门控,显式剔除无关特征,显著提升低对比度医学图像分割性能,尤其在3D任务中提升约20%。

Details Motivation: 传统U-Net跳连传递低级纹理、背景杂波和噪声,干扰深层语义过滤,尤其在低对比度临床影像中问题突出;现有注意力门仅软加权而非显式剔除无关特征。 Method: 提出Proximal-Sparse Multi-Scale Attention(ProSMA)机制:1)用轻量深度可分离空洞卷积构建多尺度兼容性场;2)引入带可学习通道阈值的ℓ1近端算子,实现闭式软阈值门控以显式稀疏化特征;3)融合解码器全局上下文驱动的通道门控以抑制语义无关通道。 Result: 在多个具挑战性的2D/3D医学图像分割基准上达到SOTA性能,在困难3D任务中mDice等指标提升约20%。 Conclusion: ProSMA-UNet通过显式稀疏门控替代传统软加权跳连,有效抑制噪声与无关信息,在低对比度医学图像分割中展现出更强鲁棒性与泛化能力。 Abstract: Medical image segmentation commonly relies on U-shaped encoder-decoder architectures such as U-Net, where skip connections preserve fine spatial detail by injecting high-resolution encoder features into the decoder. However, these skip pathways also propagate low-level textures, background clutter, and acquisition noise, allowing irrelevant information to bypass deeper semantic filtering -- an issue that is particularly detrimental in low-contrast clinical imaging. Although attention gates have been introduced to address this limitation, they typically produce dense sigmoid masks that softly reweight features rather than explicitly removing irrelevant activations. We propose ProSMA-UNet (Proximal-Sparse Multi-Scale Attention U-Net), which reformulates skip gating as a decoder-conditioned sparse feature selection problem. ProSMA constructs a multi-scale compatibility field using lightweight depthwise dilated convolutions to capture relevance across local and contextual scales, then enforces explicit sparsity via an $\ell_1$ proximal operator with learnable per-channel thresholds, yielding a closed-form soft-thresholding gate that can remove noisy responses. To further suppress semantically irrelevant channels, ProSMA incorporates decoder-conditioned channel gating driven by global decoder context. Extensive experiments on challenging 2D and 3D benchmarks demonstrate state-of-the-art performance, with particularly large gains ($\approx20$\%) on difficult 3D segmentation tasks. Project page: https://math-ml-x.github.io/ProSMA-UNet/

[145] Chain of World: World Model Thinking in Latent Motion

Fuxiang Yang,Donglin Di,Lulu Tang,Xuancheng Zhang,Lei Fan,Hao Li,Chen Wei,Tonghua Su,Baorui Ma

Main category: cs.CV

TL;DR: CoWVLA 提出一种新型‘世界链’范式,融合世界模型的时间推理与解耦的潜在运动表征,通过预训练视频VAE提取结构/运动隐变量,并在预训练和协同微调中联合建模连续潜在运动链与离散动作序列,提升视觉-运动学习效率与效果。

Details Motivation: 现有视觉-语言-动作(VLA)模型忽视视觉动态的时序因果结构;世界模型VLA冗余重建背景,潜在动作VLA缺乏连续动态建模与世界知识。 Method: 提出CoWVLA:1)用预训练视频VAE作为潜在运动提取器,分解视频为结构与运动隐变量;2)预训练阶段根据指令和初始帧推断连续潜在运动链并预测终端帧;3)协同微调阶段在统一自回归解码器中联合建模稀疏关键帧与动作序列。 Result: 在机器人仿真基准上显著优于现有世界模型与潜在动作方法,兼具计算效率与性能优势。 Conclusion: CoWVLA成功统一了世界模型的时序推理能力与潜在动作的紧凑可解释性,是一种更有效的VLA预训练范式。 Abstract: Vision-Language-Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal-causal structure underlying visual dynamics. World-model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds. Latent-action VLAs encode frame-to-frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge. To overcome these limitations, we introduce CoWVLA (Chain-of-World VLA), a new "Chain of World" paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre-training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain and predict the segment's terminal frame. Finally, during co-fine-tuning, this latent dynamic is aligned with discrete action prediction by jointly modeling sparse keyframes and action sequences in a unified autoregressive decoder. This design preserves the world-model benefits of temporal reasoning and world knowledge while retaining the compactness and interpretability of latent actions, enabling efficient visuomotor learning. Extensive experiments on robotic simulation benchmarks show that CoWVLA outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm. The project website can be found at https://fx-hit.github.io/cowvla-io.

[146] Specificity-aware reinforcement learning for fine-grained open-world classification

Samuele Angheben,Davide Berasi,Alessandro Conti,Elisa Ricci,Yiming Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为SpeciaRL的特异性感知强化学习框架,用于提升推理型大视觉语言模型在开放世界细粒度图像分类任务中的预测准确性和特异性平衡。

Details Motivation: 现有推理型大视觉语言模型在细粒度图像分类中倾向于生成过于泛化的预测,虽具备内在细粒度知识,但难以兼顾预测的正确性与特异性。 Method: 提出SpeciaRL框架,采用基于验证器的动态奖励信号,在线rollout中以最优预测为锚点,引导模型提升特异性同时避免错误预测。 Result: 在跨域实验中,SpeciaRL在多个细粒度基准上实现了最佳的正确性-特异性权衡,优于现有方法。 Conclusion: SpeciaRL有效解决了开放世界下细粒度图像分类中正确性与特异性难以兼顾的问题,推动了该方向的发展。 Abstract: Classifying fine-grained visual concepts under open-world settings, i.e., without a predefined label set, demands models to be both accurate and specific. Recent reasoning Large Multimodal Models (LMMs) exhibit strong visual understanding capability but tend to produce overly generic predictions when performing fine-grained image classification. Our preliminary analysis reveals that models do possess the intrinsic fine-grained domain knowledge. However, promoting more specific predictions (specificity) without compromising correct ones (correctness) remains a non-trivial and understudied challenge. In this work, we investigate how to steer reasoning LMMs toward predictions that are both correct and specific. We propose a novel specificity-aware reinforcement learning framework, SpeciaRL, to fine-tune reasoning LMMs on fine-grained image classification under the open-world setting. SpeciaRL introduces a dynamic, verifier-based reward signal anchored to the best predictions within online rollouts, promoting specificity while respecting the model's capabilities to prevent incorrect predictions. Our out-of-domain experiments show that SpeciaRL delivers the best trade-off between correctness and specificity across extensive fine-grained benchmarks, surpassing existing methods and advancing open-world fine-grained image classification. Code and model are publicly available at https://github.com/s-angheben/SpeciaRL.

[147] COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data -- Generation Stochastic by Design

Miguel Espinosa,Eva Gmelich Meijling,Valerio Marsocci,Elliot J. Crowley,Mikolaj Czerkawski

Main category: cs.CV

TL;DR: 本文提出COP-GEN,一种多模态潜在扩散Transformer模型,用于建模异构遥感数据(光学、雷达、高程等)的联合分布,支持任意模态间的条件生成与不确定性建模。

Details Motivation: 多源遥感数据间的关系是非单射的,确定性模型难以刻画其内在不确定性与多样性,限制了数据补全与跨传感器翻译等任务的效果。 Method: 提出基于潜在扩散机制的多模态Transformer架构COP-GEN,直接在各模态原分辨率上建模联合分布,并将跨模态映射参数化为条件分布,支持零样本翻译、波段填充及缺失输入下的生成。 Result: 在大规模全球多模态数据集上验证了COP-GEN能生成多样且物理一致的样本,在光学、雷达和高程模态上均保持高保真度;定量与定性分析表明其能捕捉有意义的跨模态结构,并随条件信息增加自适应调整输出不确定性。 Conclusion: 随机生成建模对地球观测具有实际重要性,应推动超越单参考点度量的新型评估范式。 Abstract: Earth observation applications increasingly rely on data from multiple sensors, including optical, radar, elevation, and land-cover products. Relationships between these modalities are fundamental for data integration but are inherently non-injective: identical conditioning information can correspond to multiple physically plausible observations. Thus, such conditional mappings should be parametrised as data distributions. As a result, deterministic models tend to collapse toward conditional means and fail to represent the uncertainty and variability required for tasks such as data completion and cross-sensor translation. We introduce COP-GEN, a multimodal latent diffusion transformer that models the joint distribution of heterogeneous Earth Observation modalities at their native spatial resolutions. By parameterising cross-modal mappings as conditional distributions, COP-GEN enables flexible any-to-any conditional generation, including zero-shot modality translation, spectral band infilling, and generation under partial or missing inputs, without task-specific retraining. Experiments on a large-scale global multimodal dataset show that COP-GEN generates diverse yet physically consistent realisations while maintaining strong peak fidelity across optical, radar, and elevation modalities. Qualitative and quantitative analyses demonstrate that the model captures meaningful cross-modal structure and systematically adapts its output uncertainty as conditioning information increases. These results highlight the practical importance of stochastic generative modeling for Earth observation and motivate evaluation protocols that move beyond single-reference, pointwise metrics. Website: https:// miquel-espinosa.github.io/cop-gen

[148] UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Zimo Wen,Boxiu Li,Wanbo Zhang,Junxiang Lei,Xiaoyu Chen,Yijia Fan,Qi Zhang,Yujiang Wang,Lili Qiu,Bo Li,Ziwei Liu,Caihua Shan,Yifan Yang,Yifei Shen

Main category: cs.CV

TL;DR: 本文提出UniG2U-Bench基准,系统评估生成式多模态模型在理解任务中的作用,发现生成并不总能提升理解,仅在特定任务(如空间推理、视觉错觉、多轮推理)中有效,并揭示了生成-理解耦合带来的归纳偏置。

Details Motivation: 现有基准缺乏对生成是否及何时能促进理解的系统性探究,亟需一个覆盖多种生成到理解(G2U)场景的评测体系。 Method: 构建UniG2U-Bench基准,涵盖7类范式、30个子任务,强调不同层次的隐式或显式视觉变换;对30多个统一多模态模型进行大规模评估,分析生成-理解关系及其影响因素。 Result: 1) 统一模型普遍弱于基础VLM,GtA推理常劣于直接推理;2) 在空间智能、视觉错觉和多轮推理任务中生成显著提升理解;3) 任务结构与模型架构相似性导致行为相关,表明生成-理解耦合引入任务/数据/架构一致的归纳偏置。 Conclusion: 生成并非万能,其对理解的增益具有任务特异性;需更丰富的训练数据与新范式以释放统一多模态建模潜力。 Abstract: Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.

[149] DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction

Yufu Wang,Evonne Ng,Soyong Shin,Rawal Khirodkar,Yuan Dong,Zhaoen Su,Jinhyung Park,Kris Kitani,Alexander Richard,Fabian Prada,Michael Zollhofer

Main category: cs.CV

TL;DR: DuoMo是一种两阶段扩散模型方法,从无约束视频中恢复世界坐标系下的人体运动,通过先在相机坐标系建模再提升至世界坐标系并优化全局一致性,直接生成网格顶点运动,无需参数化人体模型,在EMDB和RICH数据集上显著降低世界坐标重建误差。

Details Motivation: 从噪声大、观测不全的无约束视频中恢复世界坐标系下一致且准确的人体运动存在泛化性与全局一致性之间的根本权衡,现有方法难以兼顾。 Method: 提出双扩散模型架构:第一阶段在相机坐标系中用扩散模型估计初始运动;第二阶段用另一扩散模型将该估计提升至世界坐标系,并进行全局一致性优化;整个流程直接生成网格顶点运动,不依赖SMPL等参数化模型。 Result: 在EMDB数据集上世界坐标重建误差降低16%,脚滑控制良好;在RICH数据集上世界坐标误差降低30%;达到当前最优性能。 Conclusion: DuoMo通过解耦相机空间与世界空间的建模任务,有效平衡了泛化能力与全局一致性,验证了非参数化、端到端生成世界坐标运动的可行性与优越性。 Abstract: We present DuoMo, a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations. Reconstructing such motion requires solving a fundamental trade-off: generalizing from diverse and noisy video inputs while maintaining global motion consistency. Our approach addresses this problem by factorizing motion learning into two diffusion models. The camera-space model first estimates motion from videos in camera coordinates. The world-space model then lifts this initial estimate into world coordinates and refines it to be globally consistent. Together, the two models can reconstruct motion across diverse scenes and trajectories, even from highly noisy or incomplete observations. Moreover, our formulation is general, generating the motion of mesh vertices directly and bypassing parametric models. DuoMo achieves state-of-the-art performance. On EMDB, our method obtains a 16% reduction in world-space reconstruction error while maintaining low foot skating. On RICH, it obtains a 30% reduction in world-space error. Project page: https://yufu-wang.github.io/duomo/

[150] LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Junyi Zhang,Charles Herrmann,Junhwa Hur,Chen Sun,Ming-Hsuan Yang,Forrester Cole,Trevor Darrell,Deqing Sun

Main category: cs.CV

TL;DR: LoGeR是一种新型长序列几何重建架构,通过分块处理视频并结合学习型混合记忆模块(参数化TTT记忆+非参数滑动窗口注意力),实现了无需后优化的分钟级稠密3D重建,显著提升全局一致性与精度。

Details Motivation: 现有前馈几何基础模型受限于二次注意力复杂度或循环设计中有限的有效记忆,难以扩展到分钟级长视频重建。 Method: 提出LoGeR架构:1)分块处理视频流,利用双向先验进行高保真块内推理;2)设计学习型混合记忆模块,包含锚定全局坐标系的参数化Test-Time Training(TTT)记忆和保持未压缩上下文的非参数Sliding Window Attention(SWA)机制。 Result: 在KITTI上ATE降低超74%,在长达19k帧的VBR数据集上实现鲁棒、全局一致的重建,训练仅用128帧,推理泛化至数千帧。 Conclusion: LoGeR突破了长时序稠密3D重建的瓶颈,首次在不依赖后优化的前提下实现了分钟级视频的高精度、全局一致几何重建。 Abstract: Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods--reducing ATE on KITTI by over 74%--and achieves robust, globally consistent reconstruction over unprecedented horizons.

[151] Beyond Language Modeling: An Exploration of Multimodal Pretraining

Shengbang Tong,David Fan,John Nguyen,Ellis Brown,Gaoyue Zhou,Shengyi Qian,Boyang Zheng,Théophane Vallaeys,Junlin Han,Rob Fergus,Naila Murray,Marjan Ghazvininejad,Mike Lewis,Nicolas Ballas,Amir Bar,Michael Rabbat,Jakob Verbeek,Luke Zettlemoyer,Koustuv Sinha,Yann LeCun,Saining Xie

Main category: cs.CV

TL;DR: 本文通过从头开始的受控预训练实验,探索了多模态基础模型的设计空间,提出了Transfusion框架(结合语言的下一个词预测和视觉的扩散模型),并发现了四个关键见解:RAE是统一视觉表示的最佳选择;视觉和语言数据互补且协同提升下游能力;统一多模态预训练自然导向世界建模;MoE架构能高效扩展并实现模态专业化,同时缓解视觉比语言更依赖数据的缩放不对称性。

Details Motivation: 尽管对视觉等多模态基础模型的兴趣日益增长,但其原生设计空间仍不清晰,需通过无语言预训练干扰的受控实验提供实证洞察。 Method: 采用Transfusion框架,结合语言的next-token预测与视觉的diffusion建模,从零开始在文本、视频、图像-文本对及动作条件视频等多样化数据上进行预训练,并开展IsoFLOP分析以建模多模态缩放规律。 Result: 发现RAE优于其他视觉表征;视觉与语言数据存在协同效应;统一预训练可自然涌现世界建模能力;MoE支持高效多模态扩展并引发模态专业化;视觉比语言更数据饥渴,而MoE可调和该缩放不对称性。 Conclusion: MoE架构是构建真正统一、可扩展多模态基础模型的关键路径,能兼顾语言高容量需求与视觉数据密集特性。 Abstract: The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.

[152] CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance

Hanyang Wang,Yiyang Liu,Jiawei Chi,Fangfu Liu,Ran Xue,Yueqi Duan

Main category: cs.CV

TL;DR: 本文提出SMC-CFG,将分类器自由引导(CFG)重新建模为滑模控制问题,通过非线性反馈机制提升扩散模型在高引导尺度下的语义对齐与鲁棒性。

Details Motivation: 现有基于线性控制的CFG方法在高引导尺度下易出现不稳定、超调和语义失真问题,亟需更鲁棒的控制范式。 Method: 提出CFG-Ctrl统一框架,将CFG视为对连续时间生成流的一阶控制;将标准CFG解释为比例控制器(P-control),并进一步引入滑模控制(SMC-CFG),定义指数型滑模面与切换控制项,并给出Lyapunov稳定性分析以保证有限时间收敛。 Result: 在Stable Diffusion 3.5、Flux和Qwen-Image等多个文本到图像模型上验证,SMC-CFG显著优于标准CFG,在语义对齐和跨引导尺度鲁棒性方面均有提升。 Conclusion: 滑模控制为CFG提供了更稳定、非线性的调控视角,SMC-CFG是一种理论可证、实践有效的新型引导增强方法。 Abstract: Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models. In this paper, we explore a unified framework called CFG-Ctrl, which reinterprets CFG as a control applied to the first-order continuous-time generative flow, using the conditional-unconditional discrepancy as an error signal to adjust the velocity field. From this perspective, we summarize vanilla CFG as a proportional controller (P-control) with fixed gain, and typical follow-up variants develop extended control-law designs derived from it. However, existing methods mainly rely on linear control, inherently leading to instability, overshooting, and degraded semantic fidelity especially on large guidance scales. To address this, we introduce Sliding Mode Control CFG (SMC-CFG), which enforces the generative flow toward a rapidly convergent sliding manifold. Specifically, we define an exponential sliding mode surface over the semantic prediction error and introduce a switching control term to establish nonlinear feedback-guided correction. Moreover, we provide a Lyapunov stability analysis to theoretically support finite-time convergence. Experiments across text-to-image generation models including Stable Diffusion 3.5, Flux, and Qwen-Image demonstrate that SMC-CFG outperforms standard CFG in semantic alignment and enhances robustness across a wide range of guidance scales. Project Page: https://hanyang-21.github.io/CFG-Ctrl

[153] MIBURI: Towards Expressive Interactive Gesture Synthesis

M. Hamza Mughal,Rishabh Dabral,Vera Demberg,Christian Theobalt

Main category: cs.CV

TL;DR: 本文提出MIBURI,首个在线、因果式的实时对话驱动全身手势与面部表情生成框架,通过身体部位感知的分层离散编码与二维因果自回归建模,实现高表现力、多样性与实时同步。

Details Motivation: 现有LLM对话代理缺乏具身性与自然手势;传统ECAs动作僵硬单一,生成式共语手势方法依赖未来语音且计算耗时,难以满足实时交互需求。 Method: 提出MIBURI框架:采用身体部位感知的分层手势编解码器,将运动编码为多级离散token;基于LLM语音-文本嵌入,用二维因果自回归模型实时建模时间动态与部位层级关系;引入辅助目标提升表现力与多样性,避免静止姿态收敛。 Result: 对比实验表明,该因果实时方法在自然度与上下文对齐性上优于近期基线;支持真实对话流下的全身体态与面部表情同步生成。 Conclusion: MIBURI成功弥合了具身对话代理在实时性、表现力与自然性之间的关键鸿沟,为下一代交互式ECAs提供了可扩展、低延迟的生成范式。 Abstract: Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large language model (LLM)-based conversational agents lack embodiment and the expressive gestures essential for natural interaction. Existing solutions for ECAs often produce rigid, low-diversity motions, that are unsuitable for human-like interaction. Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue. We employ body-part aware gesture codecs that encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings, modeling both temporal dynamics and part-level motion hierarchy in real time. Further, we introduce auxiliary objectives to encourage expressive and diverse gestures while preventing convergence to static poses. Comparative evaluations demonstrate that our causal and real-time approach produces natural and contextually aligned gestures against recent baselines. We urge the reader to explore demo videos on https://vcai.mpi-inf.mpg.de/projects/MIBURI/.

[154] Utonia: Toward One Encoder for All Point Clouds

Yujia Zhang,Xiaoyang Wu,Yunhan Yang,Xianzhe Fan,Han Li,Yuechen Zhang,Zehao Huang,Naiyan Wang,Hengshuang Zhao

Main category: cs.CV

TL;DR: Utonia is the first self-supervised point transformer encoder trained jointly across diverse point cloud domains (e.g., remote sensing, LiDAR, RGB-D, CAD, video-lifted), yielding a unified, transferable representation that enhances perception and benefits downstream embodied/multimodal tasks.

Details Motivation: To unify heterogeneous point cloud data from multiple domains into a single self-supervised model—enabling cross-domain transfer and serving as a foundation model for sparse 3D data. Method: Proposes Utonia, a self-supervised point transformer encoder trained jointly on diverse point cloud domains—including remote sensing, outdoor LiDAR, indoor RGB-D, object-centric CAD, and RGB-video-lifted point clouds—despite differences in geometry, density, and priors. Result: Utonia learns a consistent, cross-domain representation space; improves perception performance; enables emergent behaviors only seen in joint training; boosts robotic manipulation when used to condition vision-language-action policies; and enhances spatial reasoning in vision-language models. Conclusion: Utonia demonstrates feasibility and benefits of unified self-supervised learning on heterogeneous point clouds, paving the way toward foundation models for sparse 3D data with broad applications in AR/VR, robotics, and autonomous driving. Abstract: We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder across diverse domains, spanning remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds lifted from RGB-only videos. Despite their distinct sensing geometries, densities, and priors, Utonia learns a consistent representation space that transfers across domains. This unification improves perception capability while revealing intriguing emergent behaviors that arise only when domains are trained jointly. Beyond perception, we observe that Utonia representations can also benefit embodied and multimodal reasoning: conditioning vision-language-action policies on Utonia features improves robotic manipulation, and integrating them into vision-language models yields gains on spatial reasoning. We hope Utonia can serve as a step toward foundation models for sparse 3D data, and support downstream applications in AR/VR, robotics, and autonomous driving.