Table of Contents
cs.CL [Back]
[1] Self-Execution Simulation Improves Coding Models
Gallil Maimon,Ori Yoran,Felix Kreuk,Michael Hassid,Gal Cohen,Pierre Chambon,Yossi Adi
Main category: cs.CL
TL;DR: 本文提出了一种通过训练代码大模型(Code LLMs)进行逐步程序执行模拟的方法,结合监督微调与基于可验证奖励的强化学习,提升其在编程竞赛任务中的表现。
Details
Motivation: 大型语言模型(LLMs)在生成正确代码方面存在局限,尤其难以准确估计自身生成代码的执行结果;本文旨在提升模型对程序执行过程的理解与模拟能力。 Method: 采用监督微调(使用自然语言执行轨迹和基于真实执行的文本解释)与强化学习(使用可验证奖励)相结合的方式;设计两个互补目标:给定代码与输入预测输出、利用真实或自预测执行反馈求解编程竞赛题目。 Result: 在多个编程竞赛基准上,该方法相比标准推理方法取得一致性能提升;消融实验与分析揭示了执行模拟的有效性及其局限性。 Conclusion: 训练Code LLMs进行程序执行模拟是可行且有效的,能支持自我验证与迭代自修复,显著增强其在竞争性编程任务中的可靠性与准确性。 Abstract: A promising research direction in enabling LLMs to generate consistently correct code involves addressing their inability to properly estimate program execution, particularly for code they generate. In this work, we demonstrate that Code LLMs can be trained to simulate program execution in a step-by-step manner and that this capability can be leveraged to improve competitive programming performance. Our approach combines supervised fine-tuning on natural language execution traces, textual explanations grounded in true execution, with reinforcement learning using verifiable rewards. We introduce two complementary objectives: output prediction given code and inputs, and solving competitive programming tasks with either ground-truth or self-predicted execution feedback. These objectives enable models to perform self-verification over multiple candidate solutions, and iterative self-fixing by simulating test execution. Across multiple competitive programming benchmarks, our method yields consistent improvements over standard reasoning approaches. We further present ablations and analysis to elucidate the role of execution simulation and its limitations.[2] Robust LLM Performance Certification via Constrained Maximum Likelihood Estimation
Minghe Shen,Ananth Balashankar,Adam Fisch,David Madras,Miguel Rodrigues
Main category: cs.CL
TL;DR: 本文提出了一种基于约束最大似然估计(MLE)的新方法,用于高效、准确地估计大语言模型(LLM)的失败率,融合人类标注校准集、LLM-judge自动标注及领域特定性能约束,显著优于现有方法如PPI。
Details
Motivation: 现有LLM失败率评估面临高成本人工标注与有偏自动标注(如LLM-as-a-Judge)之间的权衡,亟需更可靠、实用的估计方法。 Method: 提出约束最大似然估计(MLE)框架,整合三类信号:小规模高质量人工校准集、大规模LLM-judge标注数据、以及基于裁判性能统计先验的领域约束。 Result: 在多种实验设置下(不同judge准确率、校准集大小、真实失败率),该方法相较Prediction-Powered Inference(PPI)等基线,估计更准确、方差更低。 Conclusion: 该方法突破了对自动化裁判的‘黑箱’依赖,提供了原理清晰、可解释、可扩展的LLM失败率认证新路径。 Abstract: The ability to rigorously estimate the failure rates of large language models (LLMs) is a prerequisite for their safe deployment. Currently, however, practitioners often face a tradeoff between expensive human gold standards and potentially severely-biased automatic annotation schemes such as "LLM-as-a-Judge" labeling. In this paper, we propose a new, practical, and efficient approach to LLM failure rate estimation based on constrained maximum-likelihood estimation (MLE). Our method integrates three distinct signal sources: (i) a small, high-quality human-labeled calibration set, (ii) a large corpus of LLM-judge annotations, and, most importantly, (iii) additional side information via domain-specific constraints derived from known bounds on judge performance statistics. We validate our approach through a comprehensive empirical study, benchmarking it against state-of-the-art baselines like Prediction-Powered Inference (PPI). Across diverse experimental regimes -- spanning varying judge accuracies, calibration set sizes, and LLM failure rates -- our constrained MLE consistently delivers more accurate and lower-variance estimates than existing methods. By moving beyond the "black-box" use of automated judges to a flexible framework, we provide a principled, interpretable, and scalable pathway towards LLM failure-rate certification.[3] SoLA: Leveraging Soft Activation Sparsity and Low-Rank Decomposition for Large Language Model Compression
Xinhao Huang,You-Liang Huang,Zeyi Wen
Main category: cs.CL
TL;DR: 本文提出了一种无需训练的LLM压缩方法SoLA,结合软激活稀疏性和低秩分解,在不需特殊硬件或后训练的情况下实现高效模型瘦身,并在多个基准测试中显著提升语言建模和下游任务性能。
Details
Motivation: 现有LLM压缩方法依赖特殊硬件支持或昂贵的后训练来维持模型质量,难以兼顾高效性与低成本;本文旨在提出一种训练-free、易部署的轻量化方案。 Method: SoLA方法基于对现代LLM中前馈网络(FFN)激活模式的分析,利用软激活稀疏性识别关键组件,并对非关键部分采用自适应分量级低秩分解进行压缩,其中低秩截断位置按权重矩阵特性动态分配。 Result: 在LLaMA-2-7B/13B/70B和Mistral-7B上实验表明,SoLA在无后训练条件下显著提升性能:例如对LLaMA-2-70B以30%压缩率压缩后,困惑度从6.95降至4.44,下游任务准确率提升10%。 Conclusion: SoLA是一种高效、实用且无需训练的LLM压缩方法,通过软稀疏性与自适应低秩分解协同优化,在保持甚至提升模型性能的同时大幅降低参数规模,为大模型部署提供了新思路。 Abstract: Large language models (LLMs) have demonstrated impressive capabilities across various tasks, but the billion-scale parameters pose deployment challenges. Although existing methods attempt to reduce the scale of LLMs, they require either special hardware support or expensive post-training to maintain model quality. To facilitate efficient and affordable model slimming, we propose a novel training-free compression method for LLMs, named "SoLA", which leverages \textbf{So}ft activation sparsity and \textbf{L}ow-r\textbf{A}nk decomposition. SoLA can identify and retain a minority of components significantly contributing to inference, while compressing the majority through low-rank decomposition, based on our analysis of the activation pattern in the feed-forward network (FFN) of modern LLMs. To alleviate the decomposition loss, SoLA is equipped with an adaptive component-wise low-rank allocation strategy to assign appropriate truncation positions for different weight matrices. We conduct extensive experiments on LLaMA-2-7B/13B/70B and Mistral-7B models across a variety of benchmarks. SoLA exhibits remarkable improvement in both language modeling and downstream task accuracy without post-training. For example, with a 30\% compression rate on the LLaMA-2-70B model, SoLA surpasses the state-of-the-art method by reducing perplexity from 6.95 to 4.44 and enhancing downstream task accuracy by 10\%.[4] Why Attend to Everything? Focus is the Key
Hengshuai Yao,Xing Chen,Ahmed Murtadha,Jin Li,Shuai Shao,Yasin Abbasi Yadkori,Guan Wang,Mingli Yuan,William Chen,Sen Song
Main category: cs.CL
TL;DR: Focus是一种新型注意力机制,通过可学习的质心将token分组,仅在组内计算远距离注意力,本地注意力保持全分辨率;该方法完全冻结原始模型权重,仅训练少量参数(如148K),在不损害下游任务性能前提下显著提升领域困惑度,并实现推理加速与可解释性发现。
Details
Motivation: 现有高效注意力方法在即插即用(retrofit)场景下无法兼顾性能提升与零退化,且缺乏可解释性和对齐保持能力。 Method: 提出Focus方法:利用可学习质心对token进行分组;远距离注意力仅限于同组token对,本地注意力保留全分辨率;训练时冻结全部原始模型权重,仅优化质心相关参数;推理时通过top-k组选择实现硬稀疏模式,并借助Sinkhorn归一化强制组间负载均衡。 Result: 在124M至70B参数规模、五种注意力架构上验证有效:124M模型困惑度从31.4降至30.3;7B模型从头训练达13.82 vs 13.89;推理达2x–8.6x加速;TruthfulQA分数在指令微调后保持稳定,而LoRA持续下降;Sinkhorn约束下自动发现无监督的可解释语言类别。 Conclusion: Focus是一种轻量、即插即用、可解释且对齐保持的注意力增强方法,在性能、效率与泛化性上全面优于现有方案,尤其适用于大模型的高效适配与推理优化。 Abstract: We introduce Focus, a method that learns which token pairs matter rather than approximating all of them. Learnable centroids assign tokens to groups; distant attention is restricted to same-group pairs while local attention operates at full resolution. Because all model weights stay frozen, Focus is purely additive: centroid-only training (as few as 148K parameters) improves domain perplexity with zero degradation on downstream benchmarks--from 124M to 70B parameters, across five attention architectures. No existing efficient attention method achieves this in the retrofit setting. At 124M, Focus surpasses full attention (30.3 vs 31.4 PPL); trained from scratch at 7B scale (2B tokens), Focus again beats full attention (13.82 vs 13.89 PPL). At inference, restricting each token to its top-k highest-scoring groups discretizes the soft routing into a hard sparsity pattern, yielding 2x speedup while beating the pretrained baseline (41.3 vs 42.8 PPL); decomposing this pattern into two standard FlashAttention calls reaches 8.6x wall-clock speedup at 1M tokens with no custom kernels. Unlike LoRA, centroid routing preserves alignment: instruction-tuned models retain TruthfulQA scores after adaptation, while LoRA degrades at every learning rate and rank. Sinkhorn normalization enforces balanced groups as a hard constraint, and the resulting groups discover interpretable linguistic categories without supervision.[5] VIGIL: An Extensible System for Real-Time Detection and Mitigation of Cognitive Bias Triggers
Bo Kang,Sander Noels,Tijl De Bie
Main category: cs.CL
TL;DR: 本文提出VIGIL——首个用于实时检测和缓解在线信息中认知偏差触发因素的浏览器扩展,支持滚动同步检测、大语言模型驱动的可逆改写及多级隐私保护推理。
Details
Motivation: 生成式AI兴起带来信息失真与操纵风险,尤其利用人类认知偏差进行隐性说服或操纵,但目前尚无工具直接检测和缓解此类认知偏差触发因素。 Method: 设计并实现VIGIL浏览器扩展,集成滚动同步检测、LLM驱动的可逆文本重构、支持离线到云端的隐私分级推理,并支持经NLP基准验证的第三方插件扩展。 Result: VIGIL是首个面向认知偏差触发因素实时检测与缓解的开源工具,已在GitHub公开(https://github.com/aida-ugent/vigil),并包含多个经严格NLP基准验证的插件。 Conclusion: VIGIL填补了认知偏差导向信息操纵检测工具的空白,为提升数字素养和维护健康网络公共讨论提供了新范式和技术基础。 Abstract: The rise of generative AI is posing increasing risks to online information integrity and civic discourse. Most concretely, such risks can materialise in the form of mis- and disinformation. As a mitigation, media-literacy and transparency tools have been developed to address factuality of information and the reliability and ideological leaning of information sources. However, a subtler but possibly no less harmful threat to civic discourse is to use of persuasion or manipulation by exploiting human cognitive biases and related cognitive limitations. To the best of our knowledge, no tools exist to directly detect and mitigate the presence of triggers of such cognitive biases in online information. We present VIGIL (VIrtual GuardIan angeL), the first browser extension for real-time cognitive bias trigger detection and mitigation, providing in-situ scroll-synced detection, LLM-powered reformulation with full reversibility, and privacy-tiered inference from fully offline to cloud. VIGIL is built to be extensible with third-party plugins, with several plugins that are rigorously validated against NLP benchmarks are already included. It is open-sourced at https://github.com/aida-ugent/vigil.[6] LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling
Keqin Xie
Main category: cs.CL
TL;DR: 本文提出了一种名为LPC-SM的混合自回归架构,将局部注意力、持久记忆、预测校正和运行时控制分离,并引入正交新颖性传输(ONT)机制管理慢速记忆写入;实验表明该设计在长上下文建模中优于纯注意力机制。
Details
Motivation: 现有长上下文语言模型过度依赖注意力机制处理局部交互与长程状态,缺乏对序列建模其他分解方式的探索。 Method: 提出LPC-SM混合自回归架构,包含局部注意力、持久记忆、预测校正和运行时控制四大模块,并采用Orthogonal Novelty Transport(ONT)机制调控慢记忆写入。 Result: 在三阶段评估中(基础语言建模、数学续写、4096词续写),LPC-SM显著提升性能:移除mHC使Stage-A损失上升,自适应稀疏控制使Stage-B损失下降,Stage-C在4096长度下LM损失为11.582,并改善延迟标识符诊断指标。 Conclusion: 长上下文自回归建模可基于比注意力更广的功能分工进行组织,验证了模块化解耦设计的有效性。 Abstract: Most current long-context language models still rely on attention to handle both local interaction and long-range state, which leaves relatively little room to test alternative decompositions of sequence modeling. We propose LPC-SM, a hybrid autoregressive architecture that separates local attention, persistent memory, predictive correction, and run-time control within the same block, and we use Orthogonal Novelty Transport (ONT) to govern slow-memory writes. We evaluate a 158M-parameter model in three stages spanning base language modeling, mathematical continuation, and 4096-token continuation. Removing mHC raises the Stage-A final LM loss from 12.630 to 15.127, while adaptive sparse control improves the Stage-B final LM loss from 12.137 to 10.787 relative to a matched fixed-ratio continuation. The full route remains stable at sequence length 4096, where Stage C ends with final LM loss 11.582 and improves the delayed-identifier diagnostic from 14.396 to 12.031 in key cross-entropy. Taken together, these results show that long-context autoregressive modeling can be organized around a broader division of labor than attention alone.[7] Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection
Andrey Pustovit
Main category: cs.CL
TL;DR: 本文提出Knowledge Packs,利用预计算的KV缓存替代RAG,在不损失性能的前提下实现零token开销,并支持无需训练的行为引导。
Details
Motivation: RAG存在token浪费问题,且先前关于KV缓存优于RAG的结论可能源于格式错误导致的偏差,需澄清其真实潜力与限制。 Method: 利用因果Transformer中前向传播KV缓存的等价性(F与F+q产生相同KV),构建预计算的Knowledge Packs;通过对比式value delta实现行为引导,利用RoPE中key旋转与value不变的特性,避免key算术破坏连贯性。 Result: 在Qwen3-8B和Llama-3.1-8B上700个问题中实现零性能偏差,最高节省95% token;行为引导在中间层(33%-66%)生效,方向近似正交且可组合,知识注入与行为控制可同时进行(alpha≤0.7无干扰)。 Conclusion: Knowledge Packs是一种高效、免训练、免权重修改的知识注入与模型行为调控新范式,兼具高token效率与可控性。 Abstract: RAG wastes tokens. We propose Knowledge Packs: pre-computed KV caches that deliver the same knowledge at zero token cost. For causal transformers, the KV cache from a forward pass on text F is identical to what a joint pass on F+q would produce - this follows directly from the causal mask. The equivalence is exact but fragile: wrong chat template formatting causes 6-7pp degradation, which we believe explains prior claims of KV outperforming RAG. With correct formatting: zero divergences across 700 questions on Qwen3-8B and Llama-3.1-8B, up to 95% token savings. The KV interface also enables behavioral steering that RAG cannot do. Because RoPE rotates keys but leaves values untouched, contrastive deltas on cached values can nudge model behavior while key arithmetic destroys coherence. The effect sits in mid-layer values (33-66%), independent directions are nearly orthogonal (cos~0) and compose, and both channels - knowledge and steering - run simultaneously at alpha<=0.7 without interference. No training, no weight modification.[8] CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge
Mete Ismayilzada,Renqing Cuomao,Daniil Yurshevich,Anna Sotnikova,Lonneke van der Plas,Antoine Bosselut
Main category: cs.CL
TL;DR: 本文提出CresOWLve基准,用于评估大语言模型在真实世界知识基础上的创造性问题解决能力,发现现有模型在创造性任务上表现显著低于事实性任务。
Details
Motivation: 现有大语言模型评测基准多局限于单一认知能力或人工构造的脑筋急转弯,无法反映真实世界中的创造性问题解决过程。 Method: 构建基于真实世界知识的创造性谜题基准CresOWLve,要求模型综合运用多种创造性思维策略、跨领域知识检索与非显式知识整合。 Result: 前沿非推理型与推理型大语言模型在CresOWLve上表现均较差,创造性问题准确率比事实性问题低最多17%;模型能较好检索知识,但难以建立非显式的创造性连接。 Conclusion: 当前大语言模型在创造性问题解决方面仍存在根本性局限,尤其在于整合多元知识并生成非显式关联的能力不足。 Abstract: Creative problem-solving requires combining multiple cognitive abilities, including logical reasoning, lateral thinking, analogy-making, and commonsense knowledge, to discover insights that connect seemingly unrelated pieces of information. However, most existing benchmarks for large language models (LLMs) evaluate only specific components of this process. Moreover, many creativity-oriented benchmarks rely on artificially constructed brainteasers or contrived scenarios that do not reflect how creative problem-solving occurs in real-world settings. To address this gap, we introduce CresOWLve, a benchmark for evaluating creative problem-solving using puzzles grounded in real-world knowledge. Problems in CresOWLve require employing multiple creative thinking strategies, retrieving facts from diverse domains, and creatively combining them to arrive at a solution. Evaluating several frontier non-thinking and thinking LLMs, we show that CresOWLve remains highly challenging. Our analysis reveals a consistent performance gap: models perform substantially better on factual questions than on creative ones (up to a -17% drop). While models can often retrieve the relevant knowledge, they struggle to form the non-obvious creative connections required to integrate this information and arrive at the correct answer.[9] Noise Steering for Controlled Text Generation: Improving Diversity and Reading-Level Fidelity in Arabic Educational Story Generation
Haziq Mohammad Khalid,Salsabeel Shapsough,Imran Zualkernan
Main category: cs.CL
TL;DR: 本文研究了在阿拉伯语早期阅读评估中生成多样化、教学有效故事的方法,提出了一种无需训练的‘噪声引导’技术,通过在推理时向Transformer模型内部表征注入高斯噪声来提升多样性,并发现残差流噪声和注意力熵噪声注入策略效果最优。
Details
Motivation: 阿拉伯语早期阅读评估需要生成既符合词汇、阅读难度和叙事结构约束,又避免重复情节以保证评估效度的多样化故事,但现有方法(如高温采样)难以兼顾多样性与质量。 Method: 提出‘噪声引导’方法,在推理阶段向五种小型阿拉伯语专用Transformer模型(7-9B参数)的内部表征注入校准的高斯噪声;比较四种噪声注入策略(包括残差流噪声和注意力熵噪声注入)与高温采样基线;评估指标包括多样性、质量、约束满足度和阅读年级水平。 Result: 残差流噪声显著提升叙事多样性,且对质量和约束满足影响极小,同时保持早期阅读年级水平;注意力熵噪声注入(AENI)稳定了注意力- logits 噪声并恢复质量;高温采样则导致阅读年级水平虚高及部分模型崩溃。 Conclusion: 在教育内容生成等强约束场景下,对模型内部表征施加扰动比在输出层引入随机性更适合作为多样性增强策略。 Abstract: Generating diverse, pedagogically valid stories for Arabic early-grade reading assessments requires balancing tight constraints on vocabulary, reading level, and narrative structure against the need to avoid repetitive plots that undermine assessment validity. We investigate noise steering, injecting calibrated Gaussian perturbations into the internal representations of transformer models at inference time, as a training-free diversity method evaluated across five small Arabic-centric language models (7-9B parameters). We compare four injection strategies against high-temperature sampling baselines, measuring diversity, quality, constraint adherence, and reading grade level. Residual stream noise consistently improves narrative diversity with minimal quality or constraint cost and preserves early-grade reading level across all models. Attention entropy noise injection (AENI) stabilizes the otherwise unreliable attention-logit noise while recovering quality. High-temperature sampling inflates reading grade level and causes catastrophic collapse on several models. We find internal representation-level perturbation to be a more suitable diversity strategy than output-level stochasticity for constrained educational content generation.[10] Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation
Leen AlQadi,Ahmed Alzubaidi,Mohammed Alyafeai,Hamza Alobeidli,Maitha Alhammadi,Shaikha Alsuwaidi,Omar Alkaabi,Basma El Amel Boussaha,Hakim Hacid
Main category: cs.CL
TL;DR: QIMMA是一个以系统性基准验证为核心的质量保障型阿拉伯语大语言模型排行榜,通过多模型评估流程(结合自动LLM判断与人工审核)识别并修正现有阿拉伯语基准中的系统性质量问题,最终构建了一个涵盖52,000多个样本、以原生阿拉伯语内容为主的多领域多任务评测套件,并基于LightEval、EvalPlus开源实现,确保可复现与社区可扩展。
Details
Motivation: 解决现有阿拉伯语基准中存在的系统性质量问题,提升阿拉伯语大语言模型评测的可靠性与有效性。 Method: 构建多模型评估流水线,融合自动化LLM评判与人工审查,对现有阿拉伯语基准进行质量筛查与修正;基于高质量数据构建多领域多任务评测集,并采用LightEval和EvalPlus框架实现透明、可复现的评估。 Result: 发布包含52,000+样本的高质量阿拉伯语评测套件QIMMA,覆盖多领域多任务(除代码任务外均基于原生阿拉伯语),并开源全部推理输出与评估实现。 Conclusion: QIMMA为阿拉伯语NLP提供了首个以质量保障为先、可复现且可扩展的权威评测基准,推动更可信的模型评估与技术发展。 Abstract: We present QIMMA, a quality-assured Arabic LLM leaderboard that places systematic benchmark validation at its core. Rather than aggregating existing resources as-is, QIMMA applies a multi-model assessment pipeline combining automated LLM judgment with human review to surface and resolve systematic quality issues in well-established Arabic benchmarks before evaluation. The result is a curated, multi-domain, multi-task evaluation suite of over 52k samples, grounded predominantly in native Arabic content; code evaluation tasks are the sole exception, as they are inherently language-agnostic. Transparent implementation via LightEval, EvalPlus and public release of per-sample inference outputs make QIMMA a reproducible and community-extensible foundation for Arabic NLP evaluation.[11] Towards a theory of morphology-driven marking in the lexicon: The case of the state
Mohamed El Idrissi
Main category: cs.CL
TL;DR: 本文提出一种名为‘形态驱动标记’的正式模型,通过将名词组织成具有各自形态模板和无标记形式的模块化认知集合,来解释名词类型在语言内及跨语言间的标记差异,并重新评估标记性和状态概念。
Details
Motivation: 不同语言中名词范畴的实现方式差异显著,语义和/或形态句法差异程度不一,需系统探索这些变化。 Method: 以里菲安语为参照点,构建‘形态驱动标记’的正式模型,将名词组织为模块化认知集合,每个集合有其形态模板和无标记形式,并将这些模式置于句法功能中分析。 Result: 该模型能有效解释名词类型在语言内及跨语言中的标记差异,并支持将‘状态’概念扩展至所有综合语,提出基于句法的屈折新次类(如一致关系和语法格)。 Conclusion: 名词范畴的多样性可通过模块化认知结构与形态模板统一解释;标记性与状态应被重新概念化,尤其在综合语中需纳入句法驱动的屈折机制。 Abstract: All languages have a noun category, but its realisation varies considerably. Depending on the language, semantic and/or morphosyntactic differences may be more or less pronounced. This paper explores these variations, using Riffian as a reference point before extending the analysis to other languages. We propose a formal model termed morphology-driven marking. Nouns are organised into modular cognitive sets, each with its own morphological template and unmarked form. This approach helps explain differences in marking among noun types within and across languages. By situating these patterns within syntactic functions, we also reassess the notions of markedness and state. It is proposed that the concept of state be extended to all synthetic languages and analysed a novel subcategory of syntax-based inflection like agreement and grammatical case.[12] The Tool Illusion: Rethinking Tool Use in Web Agents
Renze Lou,Baolin Peng,Wenlin Yao,Qianhui Wu,Hao Cheng,Suman Nath,Wenpeng Yin,Jianfeng Gao
Main category: cs.CL
TL;DR: 本文通过大规模、受控实验重新审视了网络代理中的工具使用,旨在澄清工具是否能持续提升性能、有效工具的设计原则以及工具使用的潜在副作用。
Details
Motivation: 现有研究在工具使用方面的结论往往基于有限的实验规模和不可比的设置,导致关于工具是否提供一致增益、有效工具的设计原则及副作用等基本问题尚不明确。 Method: 本文进行了跨多样工具来源、主干模型、工具使用框架和评估基准的广泛且严格控制的研究。 Result: 研究结果修正了一些先前的结论,并以更广泛的证据补充了其他结论。 Conclusion: 本研究为未来工具使用网络代理的研究提供了更可靠的实证基础,并有望激发该领域的进一步探索。 Abstract: As web agents rapidly evolve, an increasing body of work has moved beyond conventional atomic browser interactions and explored tool use as a higher-level action paradigm. Although prior studies have shown the promise of tools, their conclusions are often drawn from limited experimental scales and sometimes non-comparable settings. As a result, several fundamental questions remain unclear: i) whether tools provide consistent gains for web agents, ii) what practical design principles characterize effective tools, and iii) what side effects tool use may introduce. To establish a stronger empirical foundation for future research, we revisit tool use in web agents through an extensive and carefully controlled study across diverse tool sources, backbone models, tool-use frameworks, and evaluation benchmarks. Our findings both revise some prior conclusions and complement others with broader evidence. We hope this study provides a more reliable empirical basis and inspires future research on tool-use web agents.[13] Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution
Jacob Dineen,Aswin RRV,Zhikun Xu,Ben Zhou
Main category: cs.CL
TL;DR: 本文提出词汇丢弃(vocabulary dropout)方法,通过在命题模型输出logits上施加硬性、非平稳的随机掩码,缓解共进化自博弈中命题模型多样性坍塌问题,从而提升求解模型在数学推理任务上的性能。
Details
Motivation: 共进化自博弈中命题模型易收敛到狭窄的问题分布,导致多样性坍塌,使课程对求解模型无效,阻碍共进化循环。 Method: 引入词汇丢弃机制:在命题模型的策略训练和课程生成阶段,对其输出logits施加硬性、非平稳的随机掩码,防止其锁定固定token序列。 Result: 在R-Zero框架下训练Qwen3-4B和Qwen3-8B模型,词汇丢弃显著维持了命题模型在词汇、语义和功能层面的多样性,并使8B模型求解性能平均提升+4.4分,尤其在竞赛级基准上增益最大。 Conclusion: 显式的动作空间约束(如词汇丢弃)可类比经典自博弈中的游戏规则,有助于维持语言模型共进化的有效性;词汇丢弃是该原则的一种简单有效实现。 Abstract: Co-evolutionary self-play, where one language model generates problems and another solves them, promises autonomous curriculum learning without human supervision. In practice, the proposer quickly converges to a narrow distribution of problems that satisfy the reward function. This diversity collapse renders the curriculum uninformative for the solver, stalling the co-evolutionary loop. We introduce vocabulary dropout, a random mask applied to the proposer's output logits during both policy training and curriculum generation, as a lightweight mechanism to sustain diversity. The mask is hard and non-stationary, preventing the proposer from locking into fixed token sequences. Training Qwen3-4B and Qwen3-8B on mathematical reasoning via R-Zero, we find that vocabulary dropout sustains proposer diversity across lexical, semantic, and functional metrics throughout training, and yields solver improvements averaging +4.4 points at 8B, with the largest gains on competition-level benchmarks. Our findings suggest that explicit action-space constraints, analogous to the structural role that game rules play in classical self-play, can help sustain productive co-evolution in language. Vocabulary dropout is one simple instantiation of this principle.[14] Evolutionary Search for Automated Design of Uncertainty Quantification Methods
Mikhail Seleznyov,Daniil Korbut,Viktor Moskvoretskii,Oleg Somov,Alexander Panchenko,Elena Tutubalina
Main category: cs.CL
TL;DR: 本文提出了一种利用大语言模型(LLM)驱动的进化搜索方法,自动发现无监督的不确定性量化(UQ)方法(以Python程序形式表示),在原子声明验证任务上显著优于人工设计基线,并揭示了不同LLM在进化策略上的差异。
Details
Motivation: 现有大语言模型的不确定性量化方法多依赖人工设计,缺乏可扩展性和通用性。 Method: 采用LLM驱动的进化搜索,自动生成以Python程序表示的无监督UQ方法。 Result: 在9个数据集的原子声明验证任务上,所演化方法相对ROC-AUC提升最高达6.7%;不同LLM展现出不同进化策略(如Claude偏好高维线性估计器,Gpt-oss-120B偏好位置加权方案);Sonnet 4.5和Opus 4.5能有效利用复杂度提升性能,而Opus 4.6出现意外退化。 Conclusion: LLM驱动的进化搜索是一种有前景的自动化、可解释幻觉检测器设计范式。 Abstract: Uncertainty quantification (UQ) methods for large language models are predominantly designed by hand based on domain knowledge and heuristics, limiting their scalability and generality. We apply LLM-powered evolutionary search to automatically discover unsupervised UQ methods represented as Python programs. On the task of atomic claim verification, our evolved methods outperform strong manually-designed baselines, achieving up to 6.7% relative ROC-AUC improvement across 9 datasets while generalizing robustly out-of-distribution. Qualitative analysis reveals that different LLMs employ qualitatively distinct evolutionary strategies: Claude models consistently design high-feature-count linear estimators, while Gpt-oss-120B gravitates toward simpler and more interpretable positional weighting schemes. Surprisingly, only Sonnet 4.5 and Opus 4.5 reliably leverage increased method complexity to improve performance -- Opus 4.6 shows an unexpected regression relative to its predecessor. Overall, our results indicate that LLM-powered evolutionary search is a promising paradigm for automated, interpretable hallucination detector design.[15] Cultural Authenticity: Comparing LLM Cultural Representations to Native Human Expectations
Erin MacMurray van Liemt,Aida Davani,Sinchana Kumbale,Neha Dixit,Sunipa Dev
Main category: cs.CL
TL;DR: 本文提出了一种以人为中心的框架,通过构建基于多国调查的人类文化重要性向量与大语言模型生成的文化表征向量,评估LLM输出在文化层面与本地人群认知和优先级的对齐程度,发现主流模型存在西方中心倾向及系统性偏差。
Details
Motivation: 现有研究多用文化多样性和事实准确性作为文化表征的代理指标,却忽视了关键问题:LLM生成内容是否真实反映本地人群自身对其文化要素的认知与重视程度(即文化对齐)。 Method: 首先,基于九个国家开放性问卷调查,归纳文化显著要素,构建人类驱动的‘文化重要性向量’作为基准;其次,设计句法多样化的提示集,提取三个前沿LLM(Gemini 2.5 Pro、GPT-4o、Claude 3.5 Haiku)的‘文化表征向量’;最后,量化二者对齐度并分析偏差模式。 Result: 发现部分模型呈现西方中心校准——文化距离美国越远的国家,对齐度越低;所有模型均存在高度相关(ρ>0.97)的系统性错误模式:过度强调表层文化符号,忽略深层社会价值观与用户优先事项。 Conclusion: 该框架超越简单多样性指标,推动AI文化表征评估迈向对全球文化层级结构真实性的衡量,为构建更具文化敏感性的LLM提供方法论基础与实证依据。 Abstract: Cultural representation in Large Language Model (LLM) outputs has primarily been evaluated through the proxies of cultural diversity and factual accuracy. However, a crucial gap remains in assessing cultural alignment: the degree to which generated content mirrors how native populations perceive and prioritize their own cultural facets. In this paper, we introduce a human-centered framework to evaluate the alignment of LLM generations with local expectations. First, we establish a human-derived ground-truth baseline of importance vectors, called Cultural Importance Vectors based on an induced set of culturally significant facets from open-ended survey responses collected across nine countries. Next, we introduce a method to compute model-derived Cultural Representation Vectors of an LLM based on a syntactically diversified prompt-set and apply it to three frontier LLMs (Gemini 2.5 Pro, GPT-4o, and Claude 3.5 Haiku). Our investigation of the alignment between the human-derived Cultural Importance and model-derived Cultural Representations reveals a Western-centric calibration for some of the models where alignment decreases as a country's cultural distance from the US increases. Furthermore, we identify highly correlated, systemic error signatures ($ρ> 0.97$) across all models, which over-index on some cultural markers while neglecting the deep-seated social and value-based priorities of users. Our approach moves beyond simple diversity metrics toward evaluating the fidelity of AI-generated content in authentically capturing the nuanced hierarchies of global cultures.[16] LangFIR: Discovering Sparse Language-Specific Features from Monolingual Data for Language Steering
Sing Hieng Wong,Hassan Sajjad,A. B. Siddique
Main category: cs.CL
TL;DR: 本文提出LangFIR方法,利用稀疏自编码器(SAE)和随机token序列,仅需单语数据即可高效识别语言特异性特征,实现更精准的多语言大模型输出语言控制。
Details
Motivation: 现有基于表示层引导的语言控制方法依赖昂贵的多语或平行语料来识别语言方向,限制了其可扩展性与实用性。 Method: 提出LangFIR:利用稀疏自编码器提取残差激活特征,通过对比目标语言输入与随机token序列的激活模式,过滤掉语言无关特征,从而识别出稀疏、高选择性的语言特异性SAE特征,并用于构建语言引导向量。 Result: LangFIR在Gemma 3 1B/4B与Llama 3.1 8B上,于三个数据集、十二种语言中平均BLEU准确率最优,显著优于仅用单语数据的最强基线,并超越依赖平行数据的方法;所发现的语言特征极稀疏、强选择性,且因果重要(定向消融仅提升对应语言的交叉熵损失)。 Conclusion: 多语言大模型中的语言身份信息局部化在少量稀疏特征方向中,且可通过单语数据可靠识别,为低成本、高精度语言控制提供了新范式。 Abstract: Large language models (LLMs) show strong multilingual capabilities, yet reliably controlling the language of their outputs remains difficult. Representation-level steering addresses this by adding language-specific vectors to model activations at inference time, but identifying language-specific directions in the residual stream often relies on multilingual or parallel data that can be expensive to obtain. Sparse autoencoders (SAEs) decompose residual activations into interpretable, sparse feature directions and offer a natural basis for this search, yet existing SAE-based approaches face the same data constraint. We introduce LangFIR (Language Feature Identification via Random-token Filtering), a method that discovers language-specific SAE features using only a small amount of monolingual data and random-token sequences. Many SAE features consistently activated by target-language inputs do not encode language identity. Random-token sequences surface these language-agnostic features, allowing LangFIR to filter them out and isolate a sparse set of language-specific features. We show that these features are extremely sparse, highly selective for their target language, and causally important: directional ablation increases cross-entropy loss only for the corresponding language. Using these features to construct steering vectors for multilingual generation control, LangFIR achieves the best average accuracy BLEU across three models (Gemma 3 1B, Gemma 3 4B, and Llama 3.1 8B), three datasets, and twelve target languages, outperforming the strongest monolingual baseline by up to and surpassing methods that rely on parallel data. Our results suggest that language identity in multilingual LLMs is localized in a sparse set of feature directions discoverable with monolingual data. Code is available at https://anonymous.4open.science/r/LangFIR-C0F5/.[17] Rethinking Token Prediction: Tree-Structured Diffusion Language Model
Zihao Wu,Haoming Yang,Juncheng Dong,Vahid Tarokh
Main category: cs.CL
TL;DR: 本文提出了一种树结构化的离散扩散语言模型,通过利用词表内在结构,将全词表预测改为树状分层预测,显著降低参数和显存开销,同时保持同等困惑度性能。
Details
Motivation: 现有离散扩散语言模型依赖全词表预测层,导致参数和GPU显存占用高,在资源受限下训练效率低。 Method: 构建预定义的词表树,将扩散过程建模为在树中各祖先节点上的中间隐状态预测,实现指数级降低分类维度。 Result: 在相同参数量下,峰值GPU显存减少一半,困惑度与当前最优离散扩散模型持平。 Conclusion: 显式全词表预测并非必需;利用词表结构设计树状扩散模型可更高效地分配参数与内存资源。 Abstract: Discrete diffusion language models have emerged as a competitive alternative to auto-regressive language models, but training them efficiently under limited parameter and memory budgets remains challenging. Modern architectures are predominantly based on a full-vocabulary token prediction layer, which accounts for a substantial fraction of model parameters (e.g., more than 20% in small scale DiT-style designs) and often dominates peak GPU memory usage. This leads to inefficient use of both parameters and memory under constrained training resources. To address this issue, we revisit the necessity of explicit full-vocabulary prediction, and instead exploit the inherent structure among tokens to build a tree-structured diffusion language model. Specifically, we model the diffusion process with intermediate latent states corresponding to a token's ancestor nodes in a pre-constructed vocabulary tree. This tree-structured factorization exponentially reduces the classification dimensionality, makes the prediction head negligible in size, and enables reallocation of parameters to deepen the attention blocks. Empirically, under the same parameter budget, our method reduces peak GPU memory usage by half while matching the perplexity performance of state-of-the-art discrete diffusion language models.[18] Text Summarization With Graph Attention Networks
Mohammadreza Ardestani,Yllias Chali
Main category: cs.CL
TL;DR: 本文探索了利用修辞结构理论(RST)和共指消解(Coref)图来提升摘要模型性能,发现图注意力网络(GAT)效果不佳,而简单多层感知机(MLP)在CNN/DM数据集上有效;同时为XSum数据集标注了RST图,构建了图神经摘要新基准。
Details
Motivation: 利用RST和共指图等图结构信息提升文本摘要模型性能。 Method: 采用图注意力网络(GAT)和多层感知机(MLP)两种架构融合RST与共指图信息,并在CNN/DM和XSum数据集上进行实验;还为XSum人工标注RST图。 Result: GAT未提升性能,MLP在CNN/DM上有效提升摘要效果;成功构建首个XSum-RST标注数据集,揭示了图神经摘要方法的优劣与挑战。 Conclusion: 图结构信息对摘要任务有潜力,但需更适配的建模方式;简单MLP有时优于复杂GAT;XSum-RST标注为后续研究提供了重要基准。 Abstract: This study aimed to leverage graph information, particularly Rhetorical Structure Theory (RST) and Co-reference (Coref) graphs, to enhance the performance of our baseline summarization models. Specifically, we experimented with a Graph Attention Network architecture to incorporate graph information. However, this architecture did not enhance the performance. Subsequently, we used a simple Multi-layer Perceptron architecture, which improved the results in our proposed model on our primary dataset, CNN/DM. Additionally, we annotated XSum dataset with RST graph information, establishing a benchmark for future graph-based summarization models. This secondary dataset posed multiple challenges, revealing both the merits and limitations of our models.[19] MultiPress: A Multi-Agent Framework for Interpretable Multimodal News Classification
Tailong Luo,Hao Li,Rong Fu,Xinyue Jiang,Huaxuan Ding,Yiduo Zhang,Zilin Zhao,Simon Fong,Guangyin Jin,Jianyuan Ni
Main category: cs.CL
TL;DR: 本文提出MultiPress,一种三阶段多智能体框架,用于多模态新闻分类,通过多模态感知、检索增强推理和门控融合评分提升准确性和可解释性。
Details
Motivation: 现有方法在处理多模态新闻内容时,常独立处理不同模态或采用简单融合策略,难以捕捉复杂跨模态交互并利用外部知识。 Method: 提出MultiPress框架,包含三个专门智能体:多模态感知、检索增强推理和门控融合评分,并引入奖励驱动的迭代优化机制。 Result: 在新构建的大规模多模态新闻数据集上验证,显著优于强基线模型。 Conclusion: 模块化多智能体协作与检索增强推理能有效提升多模态新闻分类的准确性与可解释性。 Abstract: With the growing prevalence of multimodal news content, effective news topic classification demands models capable of jointly understanding and reasoning over heterogeneous data such as text and images. Existing methods often process modalities independently or employ simplistic fusion strategies, limiting their ability to capture complex cross-modal interactions and leverage external knowledge. To overcome these limitations, we propose MultiPress, a novel three-stage multi-agent framework for multimodal news classification. MultiPress integrates specialized agents for multimodal perception, retrieval-augmented reasoning, and gated fusion scoring, followed by a reward-driven iterative optimization mechanism. We validate MultiPress on a newly constructed large-scale multimodal news dataset, demonstrating significant improvements over strong baselines and highlighting the effectiveness of modular multi-agent collaboration and retrieval-augmented reasoning in enhancing classification accuracy and interpretability.[20] Unveiling Language Routing Isolation in Multilingual MoE Models for Interpretable Subnetwork Adaptation
Kening Zheng,Wei-Chieh Huang,Jiahao Huo,Zhonghao Li,Henry Peng Zou,Yibo Yan,Xin Zou,Jungang Li,Junzhuo Li,Hanrong Zhang,Xuming Hu,Philip S. Yu
Main category: cs.CL
TL;DR: 本文发现MoE模型中存在语言路由隔离现象,即高低资源语言倾向于激活不同的专家集合,并据此提出RISE框架,通过分层选择语言特定和通用专家子网络,在提升低资源语言性能的同时保持其他语言能力。
Details
Motivation: Mixture-of-Experts (MoE) 模型在不同语言间表现差异显著,但其内在机制尚不清楚。 Method: 系统分析MoE模型中的专家路由模式,提出Language Routing Isolation概念;进行分层分析揭示路由模式的收敛-发散规律;设计RISE框架,采用三重选择策略(特异性分数选浅/深层语言特异专家,重叠分数选中层通用专家),仅训练选定子网络。 Result: 在10种语言上实验表明,RISE使目标语言F1值最高提升10.85%,且跨语言性能下降极小。 Conclusion: 语言路由隔离是MoE跨语言性能差异的关键机制,RISE通过利用该机制可有效增强低资源语言表现而不损害其他语言能力。 Abstract: Mixture-of-Experts (MoE) models exhibit striking performance disparities across languages, yet the internal mechanisms driving these gaps remain poorly understood. In this work, we conduct a systematic analysis of expert routing patterns in MoE models, revealing a phenomenon we term Language Routing Isolation, in which high- and low-resource languages tend to activate largely disjoint expert sets. Through layer-stratified analysis, we further show that routing patterns exhibit a layer-wise convergence-divergence pattern across model depth. Building on these findings, we propose RISE (Routing Isolation-guided Subnetwork Enhancement), a framework that exploits routing isolation to identify and adapt language-specific expert subnetworks. RISE applies a tripartite selection strategy, using specificity scores to identify language-specific experts in shallow and deep layers and overlap scores to select universal experts in middle layers. By training only the selected subnetwork while freezing all other parameters, RISE substantially improves low-resource language performance while preserving capabilities in other languages. Experiments on 10 languages demonstrate that RISE achieves target-language F1 gains of up to 10.85% with minimal cross-lingual degradation.[21] The Format Tax
Ivan Yee Lee,Loris D'Antoni,Taylor Berg-Kirkpatrick
Main category: cs.CL
TL;DR: 本文发现要求大语言模型以结构化格式(如JSON)输出会显著降低其推理和写作性能,这种‘格式税’主要源于提示词中的格式要求,而非解码约束;通过将推理与格式化分离(如先自由生成再重格式化),可大幅恢复性能损失。
Details
Motivation: 解决大语言模型在响应结构化格式(如JSON、XML等)时性能显著下降的问题,探究其根本原因并提出有效缓解策略。 Method: 通过系统实验分析格式要求对不同开源和API大模型的影响,诊断性能下降来源,并提出‘推理与格式化解耦’方法(如两阶段生成或扩展思考空间)。 Result: 在六个开源模型、四个API模型、四种格式及多类任务上验证,解耦方法能恢复大部分因格式要求导致的准确率损失;且近期闭源模型基本无此问题,表明该问题是当前开源模型的能力缺口。 Conclusion: 结构化输出的性能下降主要源于提示阶段的格式指令,而非解码机制;解耦推理与格式化是简单而有效的解决方案,揭示了开源模型在指令遵循与格式鲁棒性方面的提升空间。 Abstract: Asking a large language model to respond in JSON should be a formatting choice, not a capability tax. Yet we find that structured output requirements -- JSON, XML, LaTeX, Markdown -- substantially degrade reasoning and writing performance across open-weight models. The research response has focused on constrained decoding, but sampling bias accounts for only a fraction of the degradation. The dominant cost enters at the prompt: format-requesting instructions alone cause most of the accuracy loss, before any decoder constraint is applied. This diagnosis points to a simple principle: decouple reasoning from formatting. Whether by generating freeform first and reformatting in a second pass, or by enabling extended thinking within a single generation, separating the two concerns substantially recovers lost accuracy. Across six open-weight models, four API models, four formats, and tasks spanning math, science, logic, and writing, decoupling recovers most lost accuracy. Notably, most recent closed-weight models show little to no format tax, suggesting the problem is not inherent to structured generation but a gap that current open-weight models have yet to close. Code is available at https://github.com/ivnle/the-format-tax.[22] CAGMamba: Context-Aware Gated Cross-Modal Mamba Network for Multimodal Sentiment Analysis
Minghai Jiao,Jing Xiao,Peng Xiao,Ende Zhang,Shuang Kan,Wenyan Jiang,Jinyao Li,Yixian Liu,Haidong Xin
Main category: cs.CL
TL;DR: 本文提出CAGMamba,一种基于Mamba架构的上下文感知门控跨模态模型,用于对话式多模态情感分析,通过显式时序建模和可学习门控机制提升融合效果与效率。
Details
Motivation: 现有基于Transformer的跨模态融合方法计算复杂度高(O(n²)),且缺乏对对话中情感演化过程的显式时序建模;上下文信息常简单拼接,未有效建模跨轮次依赖。 Method: 将当前话语与上下文特征组织为时序二元序列输入Mamba;设计门控跨模态Mamba网络(GCMN),融合跨模态与单模态路径,并采用文本、音频及融合三分支多任务损失进行训练。 Result: 在三个基准数据集上达到SOTA或具有竞争力的性能,验证了模型在精度与效率上的优势。 Conclusion: CAGMamba通过引入Mamba替代Transformer并结合门控融合与显式时序结构,有效解决了MSA中计算开销大和时序建模弱的问题,为高效对话情感分析提供了新范式。 Abstract: Multimodal Sentiment Analysis (MSA) requires effective modeling of cross-modal interactions and contextual dependencies while remaining computationally efficient. Existing fusion approaches predominantly rely on Transformer-based cross-modal attention, which incurs quadratic complexity with respect to sequence length and limits scalability. Moreover, contextual information from preceding utterances is often incorporated through concatenation or independent fusion, without explicit temporal modeling that captures sentiment evolution across dialogue turns. To address these limitations, we propose CAGMamba, a context-aware gated cross-modal Mamba framework for dialogue-based sentiment analysis. Specifically, we organize the contextual and the current-utterance features into a temporally ordered binary sequence, which provides Mamba with explicit temporal structure for modeling sentiment evolution. To further enable controllable cross-modal integration, we propose a Gated Cross-Modal Mamba Network (GCMN) that integrates cross-modal and unimodal paths via learnable gating to balance information fusion and modality preservation, and is trained with a three-branch multi-task objective over text, audio, and fused predictions. Experiments on three benchmark datasets demonstrate that CAGMamba achieves state-of-the-art or competitive results across multiple evaluation metrics. All codes are available at https://github.com/User2024-xj/CAGMamba.[23] Document-Level Numerical Reasoning across Single and Multiple Tables in Financial Reports
Yi-Cheng Wang,Wei-An Wang,Chu-Song Chen
Main category: cs.CL
TL;DR: 本文提出FinLongDocQA数据集,用于评估大语言模型在长金融文档中的单表和跨表数值推理能力,并设计FinLongDocAgent多智能体多轮RAG方法以缓解上下文衰减与多步数值推理错误问题。
Details
Motivation: 现有基准主要关注单表场景,缺乏对长结构化文档(如财务年报)中跨表、文档级数值推理的评估;而LLM在处理超长财务报告时面临上下文衰减和多步数值计算错误两大瓶颈。 Method: 构建FinLongDocQA数据集,并提出FinLongDocAgent:一种基于多智能体、多轮迭代的检索增强生成(RAG)框架,包含证据检索、中间计算与结果验证三个核心环节。 Result: 实验表明,FinLongDocAgent显著提升LLM在长金融文档数值问答上的准确率,验证了迭代检索与交叉验证对提升可靠性的重要性。 Conclusion: 解决长文档数值推理需协同优化信息定位与数值计算能力;多轮迭代式RAG是提升金融领域可靠QA的有效范式。 Abstract: Despite the strong language understanding abilities of large language models (LLMs), they still struggle with reliable question answering (QA) over long, structured documents, particularly for numerical reasoning. Financial annual reports exemplify this difficulty: financial statement analysis often hinges on accurate arithmetic, and analysts derive key indicators by integrating evidence scattered across multiple tables and narrative text. However, existing benchmarks focus largely on single-table settings, leaving cross-table document-level numerical reasoning underexplored. To address this gap, we introduce FinLongDocQA, a dataset for both single-table and cross-table financial numerical reasoning in long-context reports. Evaluating both closed-source and open-source LLMs on FinLongDocQA reveals two bottlenecks: (1) annual reports often exceed 129k tokens, exacerbating the context rot problem for locating relevant tables; and (2) even when relevant evidence is located, LLMs remain prone to errors in multi-step numerical reasoning. We propose FinLongDocAgent, a Multi-Agent Multi-Round Retrieval-Augmented Generation (RAG) approach that iteratively retrieves evidence, performs intermediate calculations, and verifies results across rounds. Experiments highlight the importance of iterative retrieval and verification for reliable numerical QA in long financial documents.[24] AI Appeals Processor: A Deep Learning Approach to Automated Classification of Citizen Appeals in Government Services
Vladimir Beskorovainyi
Main category: cs.CL
TL;DR: 本文提出AI Appeals Processor系统,利用NLP与深度学习技术自动分类和路由公民申诉,其中Word2Vec+LSTM模型在准确率(78%)与效率间取得最佳平衡。
Details
Motivation: 政府机构面临日益增长的公民申诉量,传统人工处理耗时长(平均20分钟/件)且准确率低(67%),导致公共服务交付瓶颈。 Method: 构建基于微服务的AI Appeals Processor系统,对比评估多种NLP方法(BoW+SVM、TF-IDF+SVM、fastText、Word2Vec+LSTM、BERT)在10,000条真实申诉数据上的表现。 Result: Word2Vec+LSTM模型达到78%分类精度,处理时间减少54%,在精度与计算效率上优于Transformer类模型。 Conclusion: Word2Vec+LSTM是面向公民申诉自动分类任务的高效实用方案,可显著提升政府服务响应能力与处理质量。 Abstract: Government agencies worldwide face growing volumes of citizen appeals, with electronic submissions increasing significantly over recent years. Traditional manual processing averages 20 minutes per appeal with only 67% classification accuracy, creating significant bottlenecks in public service delivery. This paper presents AI Appeals Processor, a microservice-based system that integrates natural language processing and deep learning techniques for automated classification and routing of citizen appeals. We evaluate multiple approaches -- including Bag-of-Words with SVM, TF-IDF with SVM, fastText, Word2Vec with LSTM, and BERT -- on a representative dataset of 10,000 real citizen appeals across three primary categories (complaints, applications, and proposals) and seven thematic domains. Our experiments demonstrate that a Word2Vec+LSTM architecture achieves 78% classification accuracy while reducing processing time by 54%, offering an optimal balance between accuracy and computational efficiency compared to transformer-based models.[25] 'Layer su Layer': Identifying and Disambiguating the Italian NPN Construction in BERT's family
Greta Gorzoni,Ludovica Pannitto,Francesca Masini
Main category: cs.CL
TL;DR: 本研究通过探针方法分析BERT模型在意大利语NPN构式家族中的表征能力,检验其是否编码了构式形式与意义信息,为构式语法理论与神经语言模型的对话提供实证支持。
Details
Motivation: 评估预训练语言模型(尤其是上下文嵌入)对显式语言学理论(如构式语法)的编码能力,并将此类研究扩展到较少被考察的语言——意大利语,同时挑战以往实验设计中的理论与方法假设。 Method: 从BERT中提取上下文向量表示,输入到逐层探针分类器中,系统评估模型各内部层所编码的意大利语NPN构式(名词-介词-名词)的形式与意义信息。 Result: 结果揭示了上下文嵌入在多大程度上反映了NPN构式的构式形式与意义,为构式理论与神经语言建模之间的关联提供了新的实证证据。 Conclusion: BERT等PLM确实在一定程度上编码了意大利语NPN构式的构式知识,支持将构式语法作为解释神经模型语言表征的理论框架,并强调跨语言验证的重要性。 Abstract: Interpretability research has highlighted the importance of evaluating Pretrained Language Models (PLMs) and in particular contextual embeddings against explicit linguistic theories to determine what linguistic information they encode. This study focuses on the Italian NPN (noun-preposition-noun) constructional family, challenging some of the theoretical and methodological assumptions underlying previous experimental designs and extending this type of research to a lesser-investigated language. Contextual vector representations are extracted from BERT and used as input to layer-wise probing classifiers, systematically evaluating information encoded across the model's internal layers. The results shed light on the extent to which constructional form and meaning are reflected in contextual embeddings, contributing empirical evidence to the dialogue between constructionist theory and neural language modelling[26] Unlocking Prompt Infilling Capability for Diffusion Language Models
Yoshinari Fujinuma,Keisuke Sakaguchi
Main category: cs.CL
TL;DR: 本文提出在监督微调阶段采用全序列掩码(即同时掩码提示和响应),以解锁掩码扩散语言模型(dLMs)对提示模板的填充能力,从而自动生成高质量、可迁移且与现有提示优化方法互补的提示。
Details
Motivation: 当前掩码扩散语言模型虽具备双向去噪能力,但受限于仅对响应进行掩码的监督微调惯例,无法有效用于提示填充任务。 Method: 在监督微调阶段引入全序列掩码策略,使模型能基于少量示例对提示模板中的掩码部分进行条件填充。 Result: 模型生成的填充提示在效果上达到或超过人工设计模板,具备跨模型迁移能力,并可与现有提示优化方法协同增效。 Conclusion: 限制dLMs提示填充能力的主要因素是训练方式而非模型架构,改进训练实践即可释放其潜力。 Abstract: Masked diffusion language models (dLMs) generate text through bidirectional denoising, yet this capability remains locked for infilling prompts. This limitation is an artifact of the current supervised finetuning (SFT) convention of applying response-only masking. To unlock this capability, we extend full-sequence masking during SFT, where both prompts and responses are masked jointly. Once unlocked, the model infills masked portions of a prompt template conditioned on few-shot examples. We show that such model-infilled prompts match or surpass manually designed templates, transfer effectively across models, and are complementary to existing prompt optimization methods. Our results suggest that training practices, not architectural limitations, are the primary bottleneck preventing masked diffusion language models from infilling effective prompts[27] LightThinker++: From Reasoning Compression to Memory Management
Yuqi Zhu,Jintian Zhang,Zhenjie Wan,Yujie Luo,Shuofei Qiao,Zhengke Gui,Da Zheng,Lei Liang,Huajun Chen,Ningyu Zhang
Main category: cs.CL
TL;DR: 本文提出LightThinker及其升级版LightThinker++,通过动态压缩中间思维和显式自适应内存管理,显著降低大语言模型(LLM)长程推理的计算开销(峰值Token减少约70%,推理时间减少26%),同时提升或维持推理准确率,尤其在长周期智能体任务中表现突出。
Details
Motivation: 大型语言模型在复杂推理中面临长思维链带来的高认知开销问题;静态压缩易导致关键中间信息不可逆丢失,引发逻辑瓶颈。 Method: 提出LightThinker(动态语义压缩中间思维)及LightThinker++(引入显式自适应内存管理机制,结合行为级内存原语与专用轨迹合成训练管道以实现有目的的记忆调度)。 Result: (1)LightThinker降低峰值Token使用70%、推理时间26%,精度损失极小;(2)LightThinker++在标准推理中Token减少69.9%,精度提升2.42%;(3)在长周期智能体任务中,80轮以上仍保持稳定资源占用(降低60%-70%),平均性能提升14.8%。 Conclusion: LightThinker++为大语言模型在扩展时序下的深度推理提供了可扩展、低开销的实用化路径。 Abstract: Large language models (LLMs) excel at complex reasoning, yet their efficiency is limited by the surging cognitive overhead of long thought traces. In this paper, we propose LightThinker, a method that enables LLMs to dynamically compress intermediate thoughts into compact semantic representations. However, static compression often struggles with complex reasoning where the irreversible loss of intermediate details can lead to logical bottlenecks. To address this, we evolve the framework into LightThinker++, introducing Explicit Adaptive Memory Management. This paradigm shifts to behavioral-level management by incorporating explicit memory primitives, supported by a specialized trajectory synthesis pipeline to train purposeful memory scheduling. Extensive experiments demonstrate the framework's versatility across three dimensions. (1) LightThinker reduces peak token usage by 70% and inference time by 26% with minimal accuracy loss. (2) In standard reasoning, LightThinker++ slashes peak token usage by 69.9% while yielding a +2.42% accuracy gain under the same context budget for maximum performance. (3) Most notably, in long-horizon agentic tasks, it maintains a stable footprint beyond 80 rounds (a 60%-70% reduction), achieving an average performance gain of 14.8% across different complex scenarios. Overall, our work provides a scalable direction for sustaining deep LLM reasoning over extended horizons with minimal overhead.[28] Researchers waste 80% of LLM annotation costs by classifying one text at a time
Christian Pipal,Eva-Maria Vogel,Morgan Wack,Frank Esser
Main category: cs.CL
TL;DR: 本文探讨了在社会科学研究中使用大语言模型(LLMs)进行文本分类时,通过批量处理(batching)和多变量堆叠(stacking)来显著降低API调用与token成本的可行性,并验证其在合理范围内不会显著损害编码质量。
Details
Motivation: 现有研究普遍采用单文本单变量提示方式调用LLM进行文本分类,导致API调用和token成本极高;亟需探索高效、低成本且保持准确性的编码策略。 Method: 在3962条专家标注的推文上,测试8个来自4家厂商的主流LLM,系统性地改变批量大小(1–1000)与堆叠变量数(1–25),评估其对四类分类任务准确率的影响,并与单文本单变量基线对比。 Result: 6/8模型在批大小达100时准确率下降≤2个百分点;堆叠≤10个变量时性能与单变量相当;性能退化主因是任务复杂度而非提示长度;该范围内的测量误差小于人工编码者间差异。 Conclusion: 在合理设置(批大小≤100、堆叠变量≤10)下,批量处理与变量堆叠可大幅降低成本,且不显著牺牲编码质量,具备实际应用价值。 Abstract: Large language models (LLMs) are increasingly being used for text classification across the social sciences, yet researchers overwhelmingly classify one text per variable per prompt. Coding 100,000 texts on four variables requires 400,000 API calls. Batching 25 items and stacking all variables into a single prompt reduces this to 4,000 calls, cutting token costs by over 80%. Whether this degrades coding quality is unknown. We tested eight production LLMs from four providers on 3,962 expert-coded tweets across four tasks, varying batch size from 1 to 1,000 items and stacking up to 25 coding dimensions per prompt. Six of eight models maintained accuracy within 2 pp of the single-item baseline through batch sizes of 100. Variable stacking with up to 10 dimensions produced results comparable to single-variable coding, with degradation driven by task complexity rather than prompt length. Within this safe operating range, the measurement error from batching and stacking is smaller than typical inter-coder disagreement in the ground-truth data.[29] POEMetric: The Last Stanza of Humanity
Bingru Li,Han Wang,Hazel Wilkinson
Main category: cs.CL
TL;DR: 本文提出了POEMetric,首个全面的诗歌评估框架,通过对比人类诗人与30个大语言模型(LLMs)在形式遵循、创意表达、情感共鸣、意象运用等多维度的表现,发现当前LLMs在高级诗歌能力上仍显著落后于人类。
Details
Motivation: 现有LLMs虽能生成诗歌,但缺乏系统性、多维度的评估体系来衡量其与人类诗人的差距,尤其在创造力、个性、情感与文学技巧等高级能力方面。 Method: 构建POEMetric评估框架,涵盖基础能力(格律、主题)、高级能力(创意、词汇多样性、个性、情感共鸣、意象、修辞)及整体质量与作者归属判断;收集203首人类创作的英文定型诗作为基准数据集;让30个LLM基于相同形式与主题生成6090首诗;采用规则评估+LLM-as-a-judge+人工专家验证三重方式评估。 Result: 顶级模型在形式准确率(4.26/5)和主题契合度(4.99/5)上表现良好,但在创造力(人类4.02 vs LLM最高约3.0)、个性(3.95 vs ~2.5)、情感共鸣(4.06 vs ~2.8)、意象(4.49 vs ~3.1)和修辞(4.67 vs ~3.2)等方面远逊于人类;整体诗歌质量人类得分为4.22,最佳LLM仅3.20。 Conclusion: 当前大语言模型在诗歌创作,尤其是体现人类独特审美与情感深度的高级能力方面仍存在显著不足,诗歌生成仍是LLM面临的重要挑战。 Abstract: Large Language Models (LLMs) can compose poetry, but how far are they from human poets? In this paper, we introduce POEMetric, the first comprehensive framework for poetry evaluation, examining 1) basic instruction-following abilities in generating poems according to a certain form and theme, 2) advanced abilities of showing creativity, lexical diversity, and idiosyncrasy, evoking emotional resonance, and using imagery and literary devices, and 3) general appraisal of the overall poem quality and estimation of authorship. We curated a human poem dataset - 203 English poems of 7 fixed forms annotated with meter, rhyme patterns and themes - and experimented with 30 LLMs for poetry generation based on the same forms and themes of the human data, totaling 6,090 LLM poems. Based on POEMetric, we assessed the performance of both human poets and LLMs through rule-based evaluation and LLM-as-a-judge, whose results were validated by human experts. Results show that, though the top model achieved high form accuracy (4.26 out of 5.00, with Gemini-2.5-Pro as a judge; same below) and theme alignment (4.99), all models failed to reach the same level of advanced abilities as human poets, who achieved unparalleled creativity (4.02), idiosyncrasy (3.95), emotional resonance (4.06), and skillful use of imagery (4.49) and literary devices (4.67). Humans also defeated the best-performing LLM in overall poem quality (4.22 vs. 3.20). As such, poetry generation remains a formidable challenge for LLMs. Data and codes are released at https://github.com/Bingru-Li/POEMetric.[30] Testing the Limits of Truth Directions in LLMs
Angelos Poulis,Mark Crovella,Evimaria Terzi
Main category: cs.CL
TL;DR: 本文研究了大语言模型(LLMs)中‘真值方向’(truth direction)的普遍性,发现其受模型层级、任务类型(事实型 vs. 推理型)、任务难度及指令提示显著影响,挑战了先前关于真值方向具有广泛通用性的结论。
Details
Motivation: 先前研究对LLM中真值方向是否具有普遍性存在分歧;本文旨在系统揭示其未被充分认识的限制条件。 Method: 通过多层探针分析、跨任务类型(事实/推理)与难度的对比实验、以及不同指令模板下的真值探测泛化能力评估,检验真值方向的层依赖性、任务依赖性与指令敏感性。 Result: 真值方向高度依赖模型层级;在事实类任务中早层显著,在推理类任务中晚层显著;性能随任务复杂度变化;简单正确性评估指令会显著削弱真值探针的泛化能力。 Conclusion: 真值方向并非全局通用,其表现受层级、任务类型、难度和提示指令多重制约,此前的普遍性主张过于乐观。 Abstract: Large language models (LLMs) have been shown to encode truth of statements in their activation space along a linear truth direction. Previous studies have argued that these directions are universal in certain aspects, while more recent work has questioned this conclusion drawing on limited generalization across some settings. In this work, we identify a number of limits of truth-direction universality that have not been previously understood. We first show that truth directions are highly layer-dependent, and that a full understanding of universality requires probing at many layers in the model. We then show that truth directions depend heavily on task type, emerging in earlier layers for factual and later layers for reasoning tasks; they also vary in performance across levels of task complexity. Finally, we show that model instructions dramatically affect truth directions; simple correctness evaluation instructions significantly affect the generalization ability of truth probes. Our findings indicate that universality claims for truth directions are more limited than previously known, with significant differences observable for various model layers, task difficulties, task types, and prompt templates.[31] Your Agent is More Brittle Than You Think: Uncovering Indirect Injection Vulnerabilities in Agentic LLMs
Wenhui Zhu,Xuanzhao Dong,Xiwen Chen,Rui Cai,Peijie Qiu,Zhipeng Wang,Oana Frunza,Shao Tang,Jindong Gu,Yalin Wang
Main category: cs.CL
TL;DR: 本文系统评估了六种防御策略在九种大语言模型主干上的间接提示注入(IPI)攻击效果,发现现有方法普遍脆弱且部分缓解措施适得其反;基于代理执行恶意指令时内部状态呈现高决策熵这一现象,提出利用Representation Engineering(RepE)在工具调用前识别并拦截未授权行为,显著提升多智能体系统安全性。
Details
Motivation: 当前对间接提示注入(IPI)的安全评估多依赖静态单轮基准,而真实多步、动态工具调用环境下的系统性漏洞被严重忽视,亟需更贴近实际部署场景的评估与防御方法。 Method: 在动态多步工具调用环境中,系统评测六种防御策略对抗四种高级IPI攻击向量在九种LLM主干上的表现;提出基于Representation Engineering(RepE)的检测机制,通过提取工具输入位置的隐藏状态构建‘电路断路器’以实现前置拦截。 Result: 现有防御普遍失效,部分缓解手段反而引发副作用;恶意指令执行迅速但伴随异常高的内部决策熵;RepE方法可在动作执行前高精度识别并阻断未授权行为,跨模型泛化性强。 Conclusion: 当前IPI防御存在根本性局限;RepE驱动的前置检测为构建鲁棒多智能体系统提供了切实可行的新范式。 Abstract: The rapid deployment of open-source frameworks has significantly advanced the development of modern multi-agent systems. However, expanded action spaces, including uncontrolled privilege exposure and hidden inter-system interactions, pose severe security challenges. Specifically, Indirect Prompt Injections (IPI), which conceal malicious instructions within third-party content, can trigger unauthorized actions such as data exfiltration during normal operations. While current security evaluations predominantly rely on isolated single-turn benchmarks, the systemic vulnerabilities of these agents within complex dynamic environments remain critically underexplored. To bridge this gap, we systematically evaluate six defense strategies against four sophisticated IPI attack vectors across nine LLM backbones. Crucially, we conduct our evaluation entirely within dynamic multi-step tool-calling environments to capture the true attack surface of modern autonomous agents. Moving beyond binary success rates, our multidimensional analysis reveals a pronounced fragility. Advanced injections successfully bypass nearly all baseline defenses, and some surface-level mitigations even produce counterproductive side effects. Furthermore, while agents execute malicious instructions almost instantaneously, their internal states exhibit abnormally high decision entropy. Motivated by this latent hesitation, we investigate Representation Engineering (RepE) as a robust detection strategy. By extracting hidden states at the tool-input position, we revealed that the RepE-based circuit breaker successfully identifies and intercepts unauthorized actions before the agent commits to them, achieving high detection accuracy across diverse LLM backbones. This study exposes the limitations of current IPI defenses and provides a highly practical paradigm for building resilient multi-agent architectures.[32] When Models Know More Than They Say: Probing Analogical Reasoning in LLMs
Hope McGovern,Caroline Craig,Thomas Lippincott,Hale Sirin
Main category: cs.CL
TL;DR: 本文探讨了大语言模型(LLM)在叙事类比推理中的表现,发现其内部表征(通过探针获取)与提示式行为之间存在任务依赖的不对称性:对修辞类比,探针显著优于提示;对叙事类比,二者均表现较差,揭示提示机制可能无法有效调用模型已具备的抽象知识。
Details
Motivation: LLM在表面与结构线索一致时类比推理表现好,但在需依赖潜在信息的隐性类比(如叙事类比)中表现差,表明其抽象与泛化能力存在局限,需探究其内部表征与外部行为间的关系。 Method: 对比分析模型经探针(probing)获得的内部表征与经提示(prompting)产生的外部行为在检测修辞类比和叙事类比任务上的性能差异。 Result: 对于修辞类比,探针性能显著高于提示;对于叙事类比,二者性能均低且相近;表明模型虽可能编码相关抽象信息,但提示方式难以有效激活。 Conclusion: LLM内部表征与提示行为之间的映射是任务依赖的,提示机制存在访问瓶颈,尤其在需深层抽象的叙事类比任务中,暴露了当前接口设计对模型潜在能力的利用不足。 Abstract: Analogical reasoning is a core cognitive faculty essential for narrative understanding. While LLMs perform well when surface and structural cues align, they struggle in cases where an analogy is not apparent on the surface but requires latent information, suggesting limitations in abstraction and generalisation. In this paper we compare a model's probed representations with its prompted performance at detecting narrative analogies, revealing an asymmetry: for rhetorical analogies, probing significantly outperforms prompting in open-source models, while for narrative analogies, they achieve a similar (low) performance. This suggests that the relationship between internal representations and prompted behavior is task-dependent and may reflect limitations in how prompting accesses available information.[33] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
Haotian Zong,Binze Li,Yufei Long,Sinyin Chang,Jialong Wu,Gillian K. Hadfield
Main category: cs.CL
TL;DR: 本文提出I-CALM提示框架,通过在不修改模型的前提下,结合显式奖励机制与谦逊导向的规范原则,引导大语言模型在不确定时主动 abstain(拒绝回答),从而降低事实性问题中的幻觉率。
Details
Motivation: 大型语言模型常给出自信但错误的答案,部分原因在于传统二元评分机制更鼓励作答而非诚实表达不确定性;作者希望探索仅靠提示工程能否提升模型对自身不确定性的识别与响应能力。 Method: 提出I-CALM提示框架,包含三部分:(i) 引导模型口头表达置信度;(ii) 设计显式奖励机制鼓励适当拒答;(iii) 加入强调真实性、谦逊与责任的轻量级规范原则;并在PopQA数据集上以GPT-5 mini为基准进行评估。 Result: I-CALM显著降低了已作答样本中的错误率,主要通过将易错样本转向拒答并重新校准其置信度实现;形成清晰的‘拒答–幻觉权衡前沿’;覆盖度下降但可靠性提升,强制作答性能基本不变。 Conclusion: 仅靠提示工程(如I-CALM)即可在不微调模型的情况下改善大模型在事实性问题上的选择性作答能力,效果因模型和数据集而异。 Abstract: Large language models (LLMs) frequently produce confident but incorrect answers, partly because common binary scoring conventions reward answering over honestly expressing uncertainty. We study whether prompt-only interventions -- explicitly announcing reward schemes for answer-versus-abstain decisions plus humility-oriented normative principles -- can reduce hallucination risk without modifying the model. Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being uncertain about their answers. We first assess self-reported verbal confidence as a usable uncertainty signal, showing stability under prompt paraphrasing and reasonable calibration against a token-probability baseline. We then study I-CALM, a prompt-based framework that (i) elicits verbal confidence, (ii) partially rewards abstention through explicit reward schemes, and (iii) adds lightweight normative principles emphasizing truthfulness, humility, and responsibility. Using GPT-5 mini on PopQA as the main setting, we find that confidence-eliciting, abstention-rewarding prompts, especially with norms, reduce the false-answer rate on answered cases mainly by identifying and shifting error-prone cases to abstention and re-calibrating their confidence. This trades coverage for reliability while leaving forced-answer performance largely unchanged. Varying the abstention reward yields a clear abstention-hallucination frontier. Overall, results show the framework can improve selective answering on factual questions without retraining, with the magnitude of effect varying across models and datasets. Code is available at the following https://github.com/binzeli/hallucinationControl.[34] From Plausible to Causal: Counterfactual Semantics for Policy Evaluation in Simulated Online Communities
Agam Goyal,Yian Wang,Eshwar Chandrasekharan,Hari Sundaram
Main category: cs.CL
TL;DR: 本文提出将因果反事实框架引入基于大语言模型(LLM)的社会模拟中,以区分必要因果性与充分因果性,从而支持更可靠的政策干预评估。
Details
Motivation: 当前LLM社会模拟虽具可信度(believability),但缺乏明确的因果语义,难以支撑如‘干预A降低冲突升级’等因果主张;而政策制定亟需可解释、可检验的因果推断。 Method: 引入因果反事实框架,形式化定义必要因果性(无干预则结果不发生)与充分因果性(干预可靠导致结果),并分析其对应不同利益相关者(如内容审核员 vs 平台设计者)的需求;进一步探讨如何在模拟设计中基于显式假设进行因果估计,并强调结果应被理解为‘模拟器条件下的因果估计’。 Result: 建立了连接模拟设计与因果推断的理论框架,明确了 simulator-conditional 因果估计的概念及其对模拟保真度的要求,为评估模拟是否足以支撑真实政策提供基础标准。 Conclusion: 采用因果反事实框架是推动LLM社会模拟从‘看起来真实’迈向‘可用于政策决策’的关键一步;唯有确立该框架,才能准确定义何为‘足够保真’,并使模拟真正成为政策风洞。 Abstract: LLM-based social simulations can generate believable community interactions, enabling ``policy wind tunnels'' where governance interventions are tested before deployment. But believability is not causality. Claims like ``intervention $A$ reduces escalation'' require causal semantics that current simulation work typically does not specify. We propose adopting the causal counterfactual framework, distinguishing \textit{necessary causation} (would the outcome have occurred without the intervention?) from \textit{sufficient causation} (does the intervention reliably produce the outcome?). This distinction maps onto different stakeholder needs: moderators diagnosing incidents require evidence about necessity, while platform designers choosing policies require evidence about sufficiency. We formalize this mapping, show how simulation design can support estimation under explicit assumptions, and argue that the resulting quantities should be interpreted as simulator-conditional causal estimates whose policy relevance depends on simulator fidelity. Establishing this framework now is essential: it helps define what adequate fidelity means and moves the field from simulations that look realistic toward simulations that can support policy changes.[35] Uncertainty as a Planning Signal: Multi-Turn Decision Making for Goal-Oriented Conversation
Xinyi Ling,Ye Liu,Reza Averly,Xia Ning
Main category: cs.CL
TL;DR: 本文提出了一种面向目标对话的不确定性感知规划框架(CUP),将语言模型与结构化规划结合,利用不确定性作为多轮决策的指导信号,从而在提升任务成功率的同时减少交互轮次。
Details
Motivation: 现有方法存在两极分化:结构化方法依赖预定义模式、缺乏灵活性;大语言模型方法灵活但缺乏长程决策能力,导致信息获取与目标确认之间协调差。 Method: 将目标导向对话建模为不确定性感知的序贯决策问题;设计CUP框架:语言模型生成可行动作,规划器评估各动作对长期不确定性降低的影响。 Result: 在多个对话基准上,CUP显著提升任务成功率并减少交互轮数;分析表明其能更高效地获取信息,并更早做出可信的目标承诺。 Conclusion: 不确定性可作为多轮对话中信息获取与目标承诺之间协同决策的有效引导信号,CUP框架有效融合了语言模型的表达能力与结构化规划的长程推理能力。 Abstract: Goal-oriented conversational systems require making sequential decisions under uncertainty about the user's intent, where the algorithm must balance information acquisition and target commitment over multiple turns. Existing approaches address this challenge from different perspectives: structured methods enable multi-step planning but rely on predefined schemas, while LLM-based approaches support flexible interactions but lack long-horizon decision making, resulting in poor coordination between information acquisition and target commitment. To address this limitation, we formulate goal-oriented conversation as an uncertainty-aware sequential decision problem, where uncertainty serves as a guiding signal for multi-turn decision making. We propose a Conversation Uncertainty-aware Planning framework (CUP) that integrates language models with structured planning: a language model proposes feasible actions, and a planner evaluates their long-term impact on uncertainty reduction. Experiments on multiple conversational benchmarks show that CUP consistently improves success rates while requiring fewer interaction turns. Further analysis demonstrates that uncertainty-aware planning contributes to more efficient information acquisition and earlier confident commitment.[36] AdaptFuse: Training-Free Sequential Preference Learning via Externalized Bayesian Inference
Fangzhou Lin,Peiran Li,Shuo Xing,Siyuan Yang,Qianwen Ge,Kazunori Yamada,Ziming Zhang,Haichong Zhang,Zhengzhong Tu
Main category: cs.CL
TL;DR: 本文提出AdaptFuse,一种无需训练的框架,通过将概率计算外部化到符号模块,并结合冻结大语言模型(LLM)的语义推理能力,实现符合贝叶斯推理的多轮交互式个性化推荐,避免使用敏感用户数据。
Details
Motivation: 大语言模型在多轮用户交互中难以持续积累证据并按贝叶斯方式更新信念;现有方法依赖敏感用户交互数据微调,难以满足隐私要求。 Method: 提出AdaptFuse:由符号模块维护离散假设集上的贝叶斯后验,冻结LLM通过多样本Dirichlet聚合提供语义推理;二者通过熵自适应融合机制加权组合,动态调整对LLM与符号后验的依赖。 Result: 在航班推荐、酒店推荐和网页购物三个任务上,AdaptFuse在Gemma 2 9B、Llama 3 8B和Qwen 2.5 7B上均一致优于提示工程基线和微调的贝叶斯教学模型,准确率随交互轮次单调提升。 Conclusion: 推理时的原理性算法可替代微调实现个性化推荐,无需存储或训练敏感用户数据,兼顾性能与隐私。 Abstract: Large language models struggle to accumulate evidence across multiple rounds of user interaction, failing to update their beliefs in a manner consistent with Bayesian inference. Existing solutions require fine-tuning on sensitive user interaction data, limiting their applicability in privacy-conscious settings. We propose AdaptFuse, a training-free framework that externalizes probabilistic computation entirely from the LLM: a symbolic module maintains a Bayesian posterior over a discrete hypothesis set, while a frozen LLM contributes semantic reasoning via multi-sample Dirichlet aggregation. The two signals are combined through entropy-adaptive fusion, which automatically weights each source by its predictive confidence, shifting reliance from the LLM to the symbolic posterior as evidence accumulates. We evaluate across three domains: flight recommendation, hotel recommendation, and web shopping; on Gemma 2 9B, Llama 3 8B, and Qwen 2.5 7B. AdaptFuse consistently outperforms both prompting baselines and fine-tuned Bayesian Teaching models on all tasks, with accuracy improving monotonically over interaction rounds. These results demonstrate that principled inference-time algorithms can substitute for fine-tuning in personalized recommendation, without storing or training on sensitive user data. All the code and materials will be open-sourced.[37] Predict, Don't React: Value-Based Safety Forecasting for LLM Streaming
Pride Kavumba,Koki Wataoka,Huy H. Nguyen,Jiaxuan Li,Masaya Ohagi
Main category: cs.CL
TL;DR: 本文提出StreamGuard,一种统一的、模型无关的流式安全守卫方法,将响应安全检测建模为对未来有害性的预测问题,而非传统边界检测;通过蒙特卡洛采样监督训练,在多个基准上显著提升输入与流式输出的安全性检测性能,且无需精确的逐token边界标注。
Details
Motivation: 现有流式响应安全守卫多采用边界检测范式,依赖精确的token级不安全起始点标注,标注成本高且难以泛化;而实际部署中需兼顾prompt和streaming response的统一、低延迟、高准确率安全干预。 Method: 提出StreamGuard框架,将流式安全判断建模为‘基于前缀预测后续生成有害性的期望值’的预测任务;使用蒙特卡洛rollout生成未来续写样本并估计平均有害性,以此作为监督信号训练轻量预测头;支持跨tokenizer与模型家族的知识迁移。 Result: 在标准安全基准上,StreamGuard-8B相较Qwen3Guard-Stream-8B-strict:输入F1从86.7→88.2,流式输出F1从80.4→81.9;在QWENGUARDTEST响应定位基准上F1达97.5(+1.6),召回率95.1(+3.0),准时干预率92.6%(+2.7%),漏检率降至4.9%(−3.0%);迁移至Gemma3-1B后仍达81.3响应F1与3.5%漏检率。 Conclusion: 预测未来有害性是一种更鲁棒、可迁移、免边界标注的流式安全监督范式;StreamGuard实现了高性能、低延迟、模型无关的端到端流式内容安全守卫。 Abstract: In many practical LLM deployments, a single guardrail is used for both prompt and response moderation. Prompt moderation operates on fully observed text, whereas streaming response moderation requires safety decisions to be made over partial generations. Existing text-based streaming guardrails commonly frame this output-side problem as boundary detection, training models to identify the earliest prefix at which a response has already become unsafe. In this work, we introduce StreamGuard, a unified model-agnostic streaming guardrail that instead formulates moderation as a forecasting problem: given a partial prefix, the model predicts the expected harmfulness of likely future continuations. We supervise this prediction using Monte Carlo rollouts, which enables early intervention without requiring exact token-level boundary annotations. Across standard safety benchmarks, StreamGuard performs strongly both for input moderation and for streaming output moderation. At the 8B scale, StreamGuard improves aggregated input-moderation F1 from 86.7 to 88.2 and aggregated streaming output-moderation F1 from 80.4 to 81.9 relative to Qwen3Guard-Stream-8B-strict. On the QWENGUARDTEST response_loc streaming benchmark, StreamGuard reaches 97.5 F1, 95.1 recall, and 92.6% on-time intervention, compared to 95.9 F1, 92.1 recall, and 89.9% for Qwen3Guard-Stream-8B-stric, while reducing the miss rate from 7.9% to 4.9%. We further show that forecasting-based supervision transfers effectively across tokenizers and model families: with transferred targets, Gemma3-StreamGuard-1B reaches 81.3 response-moderation F1, 98.2 streaming F1, and a 3.5% miss rate. These results show that strong end-to-end streaming moderation can be obtained without exact boundary labels, and that forecasting future risk is an effective supervision strategy for low-latency safety intervention.[38] RUQuant: Towards Refining Uniform Quantization for Large Language Models
Han Liu,Haotian Gao,Changya Li,Feng Zhang,Xiaotong Zhang,Wei Wang,Hong Yu
Main category: cs.CL
TL;DR: 本文提出RUQuant方法,通过两阶段正交变换优化激活值量化,显著缓解非均匀分布导致的精度下降,在无需微调情况下实现接近全精度的性能。
Details
Motivation: 现有后训练量化方法采用统一量化方案,但激活值分布不均匀,导致按Lloyd-Max最优准则计算的量化点偏离区间中点,引发显著精度损失。 Method: 提出RUQuant:第一阶段将激活分块,用Householder反射与Givens旋转构成的复合正交矩阵映射为均匀采样目标向量;第二阶段通过全局Householder反射微调以最小化Transformer输出误差。 Result: 在13B LLM上,W6A6量化达全精度99.8%,W4A4达97%,耗时约1分钟;微调版本精度更高。 Conclusion: RUQuant从理论出发改进激活量化,兼顾高效性与高精度,无需重训练且可扩展性强。 Abstract: The increasing size and complexity of large language models (LLMs) have raised significant challenges in deployment efficiency, particularly under resource constraints. Post-training quantization (PTQ) has emerged as a practical solution by compressing models without requiring retraining. While existing methods focus on uniform quantization schemes for both weights and activations, they often suffer from substantial accuracy degradation due to the non-uniform nature of activation distributions. In this work, we revisit the activation quantization problem from a theoretical perspective grounded in the Lloyd-Max optimality conditions. We identify the core issue as the non-uniform distribution of activations within the quantization interval, which causes the optimal quantization point under the Lloyd-Max criterion to shift away from the midpoint of the interval. To address this issue, we propose a two-stage orthogonal transformation method, RUQuant. In the first stage, activations are divided into blocks. Each block is mapped to uniformly sampled target vectors using composite orthogonal matrices, which are constructed from Householder reflections and Givens rotations. In the second stage, a global Householder reflection is fine-tuned to further minimize quantization error using Transformer output discrepancies. Empirical results show that our method achieves near-optimal quantization performance without requiring model fine-tuning: RUQuant achieves 99.8% of full-precision accuracy with W6A6 and 97% with W4A4 quantization for a 13B LLM, within approximately one minute. A fine-tuned variant yields even higher accuracy, demonstrating the effectiveness and scalability of our approach.[39] GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces
Xinyu Geng,Yanjing Xiao,Yuyang Zhang,Hanwen Wang,Xinyan Liu,Rui Min,Tianqing Fang,Yi R. Fung
Main category: cs.CL
TL;DR: 本文提出GeoBrowse,一个结合视觉推理与多跳知识查询的地理定位基准测试,用于评估深度研究代理在整合模糊视觉线索和开放网络证据方面的能力。
Details
Motivation: 现有多模态基准很少同时要求弱视觉线索组合和类似BrowseComp的多跳验证,而地理定位天然适合这一需求,因其答案依赖于多个模糊视觉线索的组合及开放网络证据的验证。 Method: 构建了两层GeoBrowse基准(Level 1:视觉线索提取与组合;Level 2:引入长尾知识与关键实体混淆),并设计了支持评估的智能体工作流GATE,包含5个‘图文思考’工具和4个知识密集型工具,辅以专家标注的、基于可验证证据的逐步轨迹。 Result: 实验表明GATE优于直接推理和开源智能体;性能提升源于连贯且层级适配的工具使用规划,而非单纯增加调用次数;该规划能更可靠地抵达关键证据步骤,并减少最终决策中的整合错误。 Conclusion: 单一模式(无工具、仅搜索或仅图像)不足以应对复杂地理定位任务;有效的多模态、多跳推理需结构化、证据驱动的工具协同规划。 Abstract: Deep research agents integrate fragmented evidence through multi-step tool use. BrowseComp offers a text-only testbed for such agents, but existing multimodal benchmarks rarely require both weak visual cues composition and BrowseComp-style multi-hop verification. Geolocation is a natural testbed because answers depend on combining multiple ambiguous visual cues and validating them with open-web evidence. Thus, we introduce GeoBrowse, a geolocation benchmark that combines visual reasoning with knowledge-intensive multi-hop queries. Level 1 tests extracting and composing fragmented visual cues, and Level 2 increases query difficulty by injecting long-tail knowledge and obfuscating key entities. To support evaluation, we provide an agentic workflow GATE with five think-with-image tools and four knowledge-intensive tools, and release expert-annotated stepwise traces grounded in verifiable evidence for trajectory-level analysis. Experiments show that GATE outperforms direct inference and open-source agents, indicating that no-tool, search-only or image-only setups are insufficient. Gains come from coherent, level-specific tool-use plans rather than more tool calls, as they more reliably reach annotated key evidence steps and make fewer errors when integrating into the final decision. The GeoBrowse bernchmark and codes are provided in https://github.com/ornamentt/GeoBrowse[40] Unmasking Hallucinations: A Causal Graph-Attention Perspective on Factual Reliability in Large Language Models
Sailesh kiran kurra,Shiek Ruksana,Vishal Borusu
Main category: cs.CL
TL;DR: 本文提出了一种因果图注意力网络(GCAN)框架,通过建模Transformer内部的token级注意力流与梯度影响,定义因果贡献分数(CCS)并引入事实锚定的图重加权层,显著降低大语言模型的幻觉率,在TruthfulQA和HotpotQA上相比RAG基线降低27.8%幻觉率、提升16.4%事实准确性。
Details
Motivation: 大语言模型(LLMs)虽具备强大语言能力,但存在严重幻觉问题(即生成事实错误、误导性或缺乏依据的内容),在医疗诊断、法律推理等高风险场景中危害显著,亟需提升其事实可靠性。 Method: 提出因果图注意力网络(GCAN):构建融合自注意力权重与梯度影响得分的token级因果图;定义新指标‘因果贡献分数(CCS)’量化各token的事实依赖性;引入事实锚定的图重加权层,在生成过程中动态抑制易致幻节点的影响。 Result: 在TruthfulQA和HotpotQA基准上,相比检索增强生成(RAG)基线,幻觉率降低27.8%,事实准确率提升16.4%。 Conclusion: GCAN提升了LLM的可解释性、鲁棒性与事实可靠性,为构建更可信的大语言模型架构提供了新思路。 Abstract: This paper primarily focuses on the hallucinations caused due to AI language models(LLMs).LLMs have shown extraordinary Language understanding and generation capabilities .Still it has major a disadvantage hallucinations which give outputs which are factually incorrect ,misleading or unsupported by input data . These hallucinations cause serious problems in scenarios like medical diagnosis or legal reasoning.Through this work,we propose causal graph attention network (GCAN) framework that reduces hallucinations through interpretation of internal attention flow within a transformer architecture with the help of constructing token level graphs that combine self attention weights and gradient based influence scores.our method quantifies each tokens factual dependency using a new metric called the Causal Contribution Score (CCS). We further introduce a fact-anchored graph reweighting layer that dynamically reduces the influence of hallucination prone nodes during generation. Experiments on standard benchmarks such as TruthfulQA and HotpotQA show a 27.8 percent reduction in hallucination rate and 16.4 percent improvement in factual accuracy over baseline retrieval-augmented generation (RAG) models. This work contributes to the interpretability,robustness, and factual reliability of future LLM architectures.[41] Emergent Inference-Time Semantic Contamination via In-Context Priming
Marcin Abram
Main category: cs.CL
TL;DR: 本文通过受控实验发现,大语言模型在少样本提示(few-shot prompting)下可能发生推理时语义漂移,导致输出向负面主题偏移,且该现象依赖于模型能力;同时区分了结构格式污染与语义内容污染两种机制。
Details
Motivation: 质疑先前研究关于k-shot prompting不会引发模型误对齐的结论,探究推理阶段是否存在可测量的语义漂移及其边界条件。 Method: 设计受控实验,在无关语义提示前插入五个文化负载数字作为少样本示例,对比不同能力模型的输出分布变化;并引入无意义字符串作为对照,分离结构与语义污染机制。 Result: 高能力模型出现显著向黑暗、威权、污名化主题的分布偏移,而小模型无此现象;无意义字符串也会扰动输出分布,表明存在结构格式污染和语义内容污染两种独立机制。 Conclusion: 推理时污染真实存在且可测量,但依赖模型能力;需重新评估少样本提示在安全敏感场景中的风险,并区分结构与语义层面的污染机制。 Abstract: Recent work has shown that fine-tuning large language models (LLMs) on insecure code or culturally loaded numeric codes can induce emergent misalignment, causing models to produce harmful content in unrelated downstream tasks. The authors of that work concluded that $k$-shot prompting alone does not induce this effect. We revisit this conclusion and show that inference-time semantic drift is real and measurable; however, it requires models of large-enough capability. Using a controlled experiment in which five culturally loaded numbers are injected as few-shot demonstrations before a semantically unrelated prompt, we find that models with richer cultural-associative representations exhibit significant distributional shifts toward darker, authoritarian, and stigmatized themes, while a simpler/smaller model does not. We additionally find that structurally inert demonstrations (nonsense strings) perturb output distributions, suggesting two separable mechanisms: structural format contamination and semantic content contamination. Our results map the boundary conditions under which inference-time contamination occurs, and carry direct implications for the security of LLM-based applications that use few-shot prompting.[42] Extracting and Steering Emotion Representations in Small Language Models: A Methodological Comparison
Jihoon Jeong
Main category: cs.CL
TL;DR: 本文首次系统分析了小语言模型(SLMs)中的情绪表征,发现其情绪向量在中间层集中、呈U型分布,且生成式提取法优于理解式;情绪操控存在三种行为模式,并揭示了多语言模型中情绪纠缠带来的安全风险。
Details
Motivation: 探究100M-10B参数规模的小语言模型是否具备类似大模型的情绪内部表征,填补该尺度下情绪表示研究的空白。 Method: 对5类架构共9个SLM进行情绪向量提取对比实验,采用生成式与理解式两种方法,在20种情绪上评估;结合层定位分析、各向异性基线验证、因果操控实验及外部情绪分类器验证。 Result: 生成式提取显著更优(p=0.007, d=-107.5);情绪表征集中在约50%深度的中间层,呈架构不变的U型曲线;操控引发手术式、重复崩溃和爆炸式三类行为;Qwen中发现跨语言情绪纠缠现象。 Conclusion: SLMs已具备结构化、可操控的情绪表征,其性质受架构影响大于参数量;研究为开源模型情绪分析提供方法论,并警示多语言部署中的潜在安全风险。 Abstract: Small language models (SLMs) in the 100M-10B parameter range increasingly power production systems, yet whether they possess the internal emotion representations recently discovered in frontier models remains unknown. We present the first comparative analysis of emotion vector extraction methods for SLMs, evaluating 9 models across 5 architectural families (GPT-2, Gemma, Qwen, Llama, Mistral) using 20 emotions and two extraction methods (generation-based and comprehension-based). Generation-based extraction produces statistically superior emotion separation (Mann-Whitney p = 0.007; Cohen's d = -107.5), with the advantage modulated by instruction tuning and architecture. Emotion representations localize at middle transformer layers (~50% depth), following a U-shaped curve that is architecture-invariant from 124M to 3B parameters. We validate these findings against representational anisotropy baselines across 4 models and confirm causal behavioral effects through steering experiments, independently verified by an external emotion classifier (92% success rate, 37/40 scenarios). Steering reveals three regimes -- surgical (coherent text transformation), repetitive collapse, and explosive (text degradation) -- quantified by perplexity ratios and separated by model architecture rather than scale. We document cross-lingual emotion entanglement in Qwen, where steering activates semantically aligned Chinese tokens that RLHF does not suppress, raising safety concerns for multilingual deployment. This work provides methodological guidelines for emotion research on open-weight models and contributes to the Model Medicine series by bridging external behavioral profiling with internal representational analysis.[43] Embedding Enhancement via Fine-Tuned Language Models for Learner-Item Cognitive Modeling
Yuanhao Liu,Zihan Zhou,Kaiying Wu,Shuo Liu,Yiyang Huang,Jiajun Guo,Aimin Zhou,Hong Qian
Main category: cs.CL
TL;DR: 本文提出EduEmbed框架,通过微调语言模型并结合文本适配器,解决认知诊断中语言模型与CD模型目标不一致及缺乏统一嵌入增强框架的问题,在多个CD任务和CAT任务上实现稳健性能提升。
Details
Motivation: 现有基于ID嵌入的认知建模方法虽有效,但难以充分利用语言模型的语义表征能力;LM训练目标与认知诊断(CD)模型存在特征空间分布差异,且缺乏能统一整合文本嵌入、兼容主流建模范式的框架。 Method: 提出两阶段EduEmbed框架:第一阶段基于角色特异性表征和交互诊断器对语言模型进行微调,以弥合语义鸿沟;第二阶段使用文本适配器提取任务相关语义,并与现有认知建模范式融合以提升泛化能力。 Result: 在四个认知诊断任务和一个计算机化自适应测试(CAT)任务上验证了EduEmbed的稳健性能;进一步分析揭示了语义信息在不同任务中的影响规律。 Conclusion: EduEmbed为语言模型赋能认知诊断提供了可扩展、鲁棒且兼容现有范式的统一嵌入增强框架,推动了在线智能教育系统中语义驱动的认知建模发展。 Abstract: Learner-item cognitive modeling plays a central role in the web-based online intelligent education system by enabling cognitive diagnosis (CD) across diverse online educational scenarios. Although ID embedding remains the mainstream approach in cognitive modeling due to its effectiveness and flexibility, recent advances in language models (LMs) have introduced new possibilities for incorporating rich semantic representations to enhance CD performance. This highlights the need for a comprehensive analysis of how LMs enhance embeddings through semantic integration across mainstream CD tasks. This paper identifies two key challenges in fully leveraging LMs in existing work: Misalignment between the training objectives of LMs and CD models creates a distribution gap in feature spaces; A unified framework is essential for integrating textual embeddings across varied CD tasks while preserving the strengths of existing cognitive modeling paradigms to ensure the robustness of embedding enhancement. To address these challenges, this paper introduces EduEmbed, a unified embedding enhancement framework that leverages fine-tuned LMs to enrich learner-item cognitive modeling across diverse CD tasks. EduEmbed operates in two stages. In the first stage, we fine-tune LMs based on role-specific representations and an interaction diagnoser to bridge the semantic gap of CD models. In the second stage, we employ a textual adapter to extract task-relevant semantics and integrate them with existing modeling paradigms to improve generalization. We evaluate the proposed framework on four CD tasks and computerized adaptive testing (CAT) task, achieving robust performance. Further analysis reveals the impact of semantic information across diverse tasks, offering key insights for future research on the application of LMs in CD for online intelligent education systems.[44] Shorter, but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression
Lingjie Zeng,Xiaofan Chen,Yanbo Wang,Xiuying Chen
Main category: cs.CL
TL;DR: 本文首次系统研究了思维链(CoT)压缩对模型可信度(安全性、抗幻觉性、多语言鲁棒性)的影响,发现压缩常导致可信度下降,不同方法退化模式各异;提出归一化效率评分以揭示权衡,并设计了一种对齐感知的DPO变体,在减少19.3%推理长度的同时显著降低可信度损失。
Details
Motivation: 现有CoT压缩研究仅关注任务准确率和token节省,忽视了压缩可能损害模型在安全、抗幻觉、多语言鲁棒等关键可信度维度上的表现,而这些属性与压缩所修改的参数空间共享。 Method: 开展首个系统实证研究,对比多种规模模型在三种可信度维度(安全、抗幻觉、多语言鲁棒性)上受不同CoT压缩方法影响的表现;提出各维度的归一化效率评分;设计并验证一种对齐感知的DPO变体。 Result: CoT压缩常引发可信度退化,且不同方法在各维度退化特征差异显著;归一化效率评分可揭示简单标量指标掩盖的信任权衡;所提对齐感知DPO变体实现19.3%长度压缩,同时大幅缓解可信度损失。 Conclusion: CoT压缩不应只优化效率,还必须将可信度作为同等重要的设计约束进行联合优化。 Abstract: Long chain-of-thought (Long-CoT) reasoning models have motivated a growing body of work on compressing reasoning traces to reduce inference cost, yet existing evaluations focus almost exclusively on task accuracy and token savings. Trustworthiness properties, whether acquired or reinforced through post-training, are encoded in the same parameter space that compression modifies. This means preserving accuracy does not, a priori, guarantee preserving trustworthiness. We conduct the first systematic empirical study of how CoT compression affects model trustworthiness, evaluating multiple models of different scales along three dimensions: safety, hallucination resistance, and multilingual robustness. Under controlled comparisons, we find that CoT compression frequently introduces trustworthiness regressions and that different methods exhibit markedly different degradation profiles across dimensions. To enable fair comparison across bases, we propose a normalized efficiency score for each dimension that reveals how naïve scalar metrics can obscure trustworthiness trade-offs. As an existence proof, we further introduce an alignment-aware DPO variant that reduces CoT length by 19.3\% on reasoning benchmarks with substantially smaller trustworthiness loss. Our findings suggest that CoT compression should be optimized not only for efficiency but also for trustworthiness, treating both as equally important design constraints.[45] Many Preferences, Few Policies: Towards Scalable Language Model Personalization
Cheol Woo Kum,Jai Moondra,Roozbeh Nahavandi,Andrew Perrault,Milind Tambe,Swati Gupta
Main category: cs.CL
TL;DR: 本文提出PALM方法,通过构建小型对齐大语言模型(LLM)组合来实现用户个性化,理论保证组合规模小且近似最优,平衡系统开销与个性化效果。
Details
Motivation: 单用户单LLM虽理想但受限于计算、内存和系统复杂度,需寻找兼顾个性化与实用性的替代方案。 Method: 将用户偏好建模为多维权重向量,基于各维度奖励函数,设计PALM算法生成能覆盖所有可能权重下近优LLM的小型组合。 Result: 首次给出LLM组合在规模与近似质量上的理论保证;实证表明其在输出多样性上优于常见基线。 Conclusion: PALM实现了系统成本与个性化效果的可控权衡,揭示了覆盖用户偏好空间所需LLM多样性的理论边界。 Abstract: The holy grail of LLM personalization is a single LLM for each user, perfectly aligned with that user's preferences. However, maintaining a separate LLM per user is impractical due to constraints on compute, memory, and system complexity. We address this challenge by developing a principled method for selecting a small portfolio of LLMs that captures representative behaviors across heterogeneous users. We model user preferences across multiple traits (e.g., safety, humor, brevity) through a multi-dimensional weight vector. Given reward functions across these dimensions, our algorithm PALM (Portfolio of Aligned LLMs) generates a small portfolio of LLMs such that, for any weight vector, the portfolio contains a near-optimal LLM for the corresponding scalarized objective. To the best of our knowledge, this is the first result that provides theoretical guarantees on both the size and approximation quality of LLM portfolios for personalization. It characterizes the trade-off between system cost and personalization, as well as the diversity of LLMs required to cover the landscape of user preferences. We provide empirical results that validate these guarantees and demonstrate greater output diversity over common baselines.[46] A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models
Avish Vijayaraghavan,Jaskaran Singh Kawatra,Sebin Sabu,Jonny Sheldon,Will Poulett,Alex Eze,Daniel Key,John Booth,Shiren Patel,Jonny Pearson,Dan Schofield,Jonathan Hope,Pavithra Rajendran,Neil Sebire
Main category: cs.CL
TL;DR: 本研究提出了一种基于小型语言模型(SLMs)的资源高效、半自动标注工作流,用于从儿童肾活检病理报告等结构化程度低的电子病历文本中提取结构化信息,在CPU设备上实现高精度(Gemma 2 2B达84.3%),兼顾隐私保护与临床实用性。
Details
Motivation: EPR系统中大量临床信息以非结构化文本形式存在,难以利用;大型语言模型虽有潜力,但本地部署计算开销大,云端处理又引发患者隐私担忧。 Method: 构建由临床专家指导的半自动标注流程:人工标注400份儿科肾活检报告作为金标准;将信息抽取建模为问答任务,结合临床定义的实体指南与少样本示例;评估5种指令微调SLMs,并引入分歧建模框架以优化人工复核优先级。 Result: Gemma 2 2B准确率达84.3%,显著优于spaCy(74.3%)、BioBERT-SQuAD(62.3%)等基线;实体指南提升性能7–19%,少样本示例提升6–38%,但二者叠加无协同增益。 Conclusion: SLMs可在仅CPU的轻量环境中,在专业临床领域(如儿科肾活检)实现高精度结构化信息抽取,大幅降低对算力和临床人力的依赖,同时保障数据本地化与患者隐私。 Abstract: Electronic Patient Record (EPR) systems contain valuable clinical information, but much of it is trapped in unstructured text, limiting its use for research and decision-making. Large language models can extract such information but require substantial computational resources to run locally, and sending sensitive clinical data to cloud-based services, even when deidentified, raises significant patient privacy concerns. In this study, we develop a resource-efficient semi-automated annotation workflow using small language models (SLMs) to extract structured information from unstructured EPR data, focusing on paediatric histopathology reports. As a proof-of-concept, we apply the workflow to paediatric renal biopsy reports, a domain chosen for its constrained diagnostic scope and well-defined underlying biology. We develop the workflow iteratively with clinical oversight across three meetings, manually annotating 400 reports from a dataset of 2,111 at Great Ormond Street Hospital as a gold standard, while developing an automated information extraction approach using SLMs. We frame extraction as a Question-Answering task grounded by clinician-guided entity guidelines and few-shot examples, evaluating five instruction-tuned SLMs with a disagreement modelling framework to prioritise reports for clinical review. Gemma 2 2B achieves the highest accuracy at 84.3%, outperforming off-the-shelf models including spaCy (74.3%), BioBERT-SQuAD (62.3%), RoBERTa-SQuAD (59.7%), and GLiNER (60.2%). Entity guidelines improved performance by 7-19% over the zero-shot baseline, and few-shot examples by 6-38%, though their benefits do not compound when combined. These results demonstrate that SLMs can extract structured information from specialised clinical domains on CPU-only infrastructure with minimal clinician involvement. Our code is available at https://github.com/gosh-dre/nlp_renal_biopsy.[47] Position: Logical Soundness is not a Reliable Criterion for Neurosymbolic Fact-Checking with LLMs
Jason Chan,Robert Gaizauskas,Zhixue Zhao
Main category: cs.CL
TL;DR: 本文指出,将大语言模型(LLMs)与形式逻辑结合用于事实核查时,仅依赖逻辑有效性会遗漏人类实际接受的推理偏差;作者提出应转而利用LLM的人类式推理能力,作为对形式化验证结果的补充校验手段。
Details
Motivation: 现有基于逻辑的形式化事实核查方法无法识别对人类而言具有误导性的、虽逻辑有效但违背常识或语用直觉的结论,存在结构性缺陷。 Method: 结合认知科学与语用学研究成果,构建误导性逻辑结论的分类体系,并主张将LLM的人类式推理倾向视为优势,用于反向验证神经符号系统中形式化模块的输出。 Result: 揭示了逻辑有效性与人类可接受推理之间的系统性偏离,并提出以LLM为‘人类推理代理’来检测和过滤形式化推理中潜在的误导性结论。 Conclusion: 单纯依赖逻辑 soundness 不足以保障事实核查的可靠性;应采用人机协同的互补架构,让LLM主动参与对形式化推理结果的语义与语用合理性审查。 Abstract: As large language models (LLMs) are increasing integrated into fact-checking pipelines, formal logic is often proposed as a rigorous means by which to mitigate bias, errors and hallucinations in these models' outputs. For example, some neurosymbolic systems verify claims by using LLMs to translate natural language into logical formulae and then checking whether the proposed claims are logically sound, i.e. whether they can be validly derived from premises that are verified to be true. We argue that such approaches structurally fail to detect misleading claims due to systematic divergences between conclusions that are logically sound and inferences that humans typically make and accept. Drawing on studies in cognitive science and pragmatics, we present a typology of cases in which logically sound conclusions systematically elicit human inferences that are unsupported by the underlying premises. Consequently, we advocate for a complementary approach: leveraging the human-like reasoning tendencies of LLMs as a feature rather than a bug, and using these models to validate the outputs of formal components in neurosymbolic systems against potentially misleading conclusions.[48] Which English Do LLMs Prefer? Triangulating Structural Bias Towards American English in Foundation Models
Mir Tafseer Nayeem,Davood Rafiei
Main category: cs.CL
TL;DR: 本文通过后殖民视角,系统考察了大型语言模型(LLMs)在美式英语(AmE)与英式英语(BrE)之间的方言不对称性,发现预训练语料、分词器和生成输出均显著偏向AmE,揭示了技术流程中的结构性偏见,并呼吁构建更具方言包容性的语言技术。
Details
Motivation: 尽管英语在全球具有多样性及殖民历史,当前LLMs却仅提供有限语言设置(如'English (US)'),忽视了标准英语变体间的不平等;本文旨在揭示地缘政治历史如何通过数据策展、数字霸权与语言标准化影响LLM开发全流程。 Method: 构建含1813对AmE-BrE变体的语料库;提出无需训练的动态方法DiAlign,基于分布证据估计方言对齐度;通过三阶段三角验证:(i)审计六大预训练语料库的方言分布,(ii)分析分词器对BrE的切分代价,(iii)评估生成输出中的方言偏好。 Result: 发现:(i)主流预训练语料系统性偏向AmE;(ii)BrE形式在分词中面临更高开销;(iii)模型生成持续偏好AmE;这是首次跨LLM开发全阶段系统考察标准英语方言不对称性的研究。 Conclusion: 当代LLMs将AmE默认为规范,加剧语言同质化、认识论不公与全球AI部署不平等;亟需推动方言包容的语言技术实践。 Abstract: Large language models (LLMs) are increasingly deployed in high-stakes domains, yet they expose only limited language settings, most notably "English (US)," despite the global diversity and colonial history of English. Through a postcolonial framing to explain the broader significance, we investigate how geopolitical histories of data curation, digital dominance, and linguistic standardization shape the LLM development pipeline. Focusing on two dominant standard varieties, American English (AmE) and British English (BrE), we construct a curated corpus of 1,813 AmE--BrE variants and introduce DiAlign, a dynamic, training-free method for estimating dialectal alignment using distributional evidence. We operationalize structural bias by triangulating evidence across three stages: (i) audits of six major pretraining corpora reveal systematic skew toward AmE, (ii) tokenizer analyses show that BrE forms incur higher segmentation costs, and (iii) generative evaluations show a persistent AmE preference in model outputs. To our knowledge, this is the first systematic and multi-faceted examination of dialectal asymmetries in standard English varieties across the phases of LLM development. We find that contemporary LLMs privilege AmE as the de facto norm, raising concerns about linguistic homogenization, epistemic injustice, and inequity in global AI deployment, while motivating practical steps toward more dialectally inclusive language technologies.[49] DARE: Diffusion Large Language Models Alignment and Reinforcement Executor
Jingyi Yang,Yuxian Jiang,Xuhao Hu,Shuang Cheng,Biqing Qi,Jing Shao
Main category: cs.CL
TL;DR: 本文介绍了DARE,一个用于扩散大语言模型(dLLMs)后训练与评估的开源框架,旨在解决当前dLLMs生态碎片化问题,统一多种训练范式并支持可复现基准评测。
Details
Motivation: 现有扩散大语言模型(dLLMs)的开源生态碎片化严重,尤其在后训练流程(如强化学习目标、rollout实现和评测脚本)上缺乏统一标准,阻碍研究迭代、复现与公平比较。 Method: 构建了基于verl和OpenCompass的统一框架DARE,整合监督微调、参数高效微调、偏好优化及dLLM专属强化学习,支持masked和block diffusion架构,并覆盖多个主流dLLM模型族。 Result: DARE在LLaDA、Dream、SDAR、LLaDA2.x等代表性模型上实现了广泛算法覆盖、可复现基准评测与实际加速效果;实验证明其可作为dLLMs后训练方法研发、比较与部署的通用研究基座。 Conclusion: DARE为扩散大语言模型的后训练提供了标准化、模块化、可扩展的开源基础设施,显著降低工程负担,促进算法公平比较与快速迭代,推动dLLMs研究生态健康发展。 Abstract: Diffusion large language models (dLLMs) are emerging as a compelling alternative to dominant autoregressive models, replacing strictly sequential token generation with iterative denoising and parallel generation dynamics. However, their open-source ecosystem remains fragmented across model families and, in particular, across post-training pipelines, where reinforcement learning objectives, rollout implementations and evaluation scripts are often released as paper-specific codebases. This fragmentation slows research iteration, raises the engineering burden of reproduction, and makes fair comparison across algorithms difficult. We present \textbf{DARE} (\textbf{d}LLMs \textbf{A}lignment and \textbf{R}einforcement \textbf{E}xecutor), an open framework for post-training and evaluating dLLMs. Built on top of verl~\cite{sheng2024hybridflow} and OpenCompass~\cite{2023opencompass}, DARE unifies supervised fine-tuning, parameter-efficient fine-tuning, preference optimization, and dLLM-specific reinforcement learning under a shared execution stack for both masked and block diffusion language models. Across representative model families including LLaDA, Dream, SDAR, and LLaDA2.x, DARE provides broad algorithmic coverage, reproducible benchmark evaluation, and practical acceleration. Extensive empirical results position that DARE serves as a reusable research substrate for developing, comparing, and deploying post-training methods for current and emerging dLLMs.[50] CAWN: Continuous Acoustic Wave Networks for Autoregressive Language Modeling
Dejan Čugalj,Aleksandar Jevremovic
Main category: cs.CL
TL;DR: 本文提出了一种名为CAWN的连续声波网络架构,通过复数域相位累加和选择性相位共振等机制,在保持线性计算复杂度(O(L))的同时,显著缓解长上下文中的信号退化问题,并在超长文本检索任务中展现出优异性能。
Details
Motivation: 现有大语言模型依赖Transformer自注意力机制,其计算和内存开销随序列长度呈平方增长;而线性替代方案(如SSM)常在超长上下文中出现信号退化。 Method: 提出CAWN:采用连续复数域相位累加实现因果序列混合;引入双门控选择性相位共振(含频率依赖保留、硬阈值直通估计门控、时序语法缓存);使用深度卷积谐波卷积替代全连接层,并加入块注意力残差;定制Triton核支持float32下的高效复数相位运算。 Result: 150M参数原型在100B token流式训练后,在5B token里程碑评估中,通过目标语义检索协议验证了强词汇习得与长程上下文去噪能力;可在200万token上下文中精准检索信息,峰值显存稳定在8.72GB,突破O(L²)内存墙。 Conclusion: CAWN作为一种全连续序列混合架构,兼顾线性扩展性与长程建模能力,为超长上下文建模提供了新范式。 Abstract: Modern Large Language Models (LLMs) rely on Transformer self-attention, which scales quadratically with sequence length. Recent linear-time alternatives, like State Space Models (SSMs), often suffer from signal degradation over extended contexts. We introduce the Continuous Acoustic Wave Network (CAWN), a fully continuous sequence-mixing architecture. Instead of discrete matrix-based attention, CAWN projects hidden states into multi-headed complex-domain phasors, achieving sequence mixing through a causal, $O(L)$ Phase Accumulation mechanism. To prevent signal degradation over ultra-long contexts, we introduce a dual-gated Selective Phase Resonance mechanism incorporating Frequency-Dependent Retention, Hard-Threshold Gating via Straight-Through Estimation, and a Temporal Syntax Cache to capture short-term local dependencies. We also replace standard dense linear projections with Depth-wise Harmonic Convolutions for optimal spatial frequency mixing, augmented by Block Attention Residuals for depth-wise state routing. Scaled to a 150M-parameter model, CAWN utilizes custom Triton kernels for hardware-efficient, true-complex phase accumulation in float32. Trained via a continuous streaming loop on a 100-Billion-token corpus, the prototype is evaluated at a 5-Billion-token milestone. Empirical evaluations via a Targeted Semantic Retrieval protocol demonstrate robust vocabulary acquisition and extended explicitly learned contextual denoising. By leveraging $O(1)$ state-passing via chunked prefill, the model retrieves targeted information across 2,000,000 tokens while strictly plateauing at 8.72 GB of Peak VRAM, empirically overcoming the $O(L^2)$ context memory wall.[51] Adaptive Cost-Efficient Evaluation for Reliable Patent Claim Validation
Yongmin Yoo,Qiongkai Xu,Longbing Cao
Main category: cs.CL
TL;DR: 本文提出ACE框架,通过预测熵自适应地将高不确定性专利权利要求路由至专家级大语言模型(LLM)进行基于美国法典35编的链式专利推理(CoPT)验证,在保证高准确率(F1=94.95%)的同时大幅降低78%的运行成本,并构建了含4万条标注错误的基准数据集ACE-40k。
Details
Motivation: 现有专利权利要求自动验证方法面临刚性与资源消耗的两难:轻量编码器难以处理复杂的法律依赖关系,而全量使用大语言模型(LLM)又成本过高,且需零缺陷容错。 Method: 提出ACE(自适应低成本评估)混合框架:利用预测熵动态识别高不确定性权利要求,并仅将其交由基于35 U.S.C.法定标准的‘专利思维链’(CoPT)专家LLM进行精细化验证。 Result: ACE在F1指标上达到94.95%,为所评测方法中最高;相比纯LLM部署,运营成本降低78%;并发布ACE-40k基准数据集(4万条权利要求,含MPEP依据的错误标注)。 Conclusion: ACE通过智能路由与领域定制化推理协议,在法律严谨性与计算效率之间取得有效平衡,为高可靠性、低成本的自动化专利审查提供了可行路径。 Abstract: Automated validation of patent claims demands zero-defect tolerance, as even a single structural flaw can render a claim legally defective. Existing evaluation paradigms suffer from a rigidity-resource dilemma: lightweight encoders struggle with nuanced legal dependencies, while exhaustive verification via Large Language Models (LLMs) is prohibitively costly. To bridge this gap, we propose ACE (Adaptive Cost-efficient Evaluation), a hybrid framework that uses predictive entropy to route only high-uncertainty claims to an expert LLM. The expert then executes a Chain of Patent Thought (CoPT) protocol grounded in 35 U.S.C. statutory standards. This design enables ACE to handle long-range legal dependencies more effectively while preserving efficiency. ACE achieves the best F1 among the evaluated methods at 94.95\%, while reducing operational costs by 78\% compared to standalone LLM deployments. We also construct ACE-40k, a 40,000-claim benchmark with MPEP-grounded error annotations, to facilitate further research.[52] High-Stakes Personalization: Rethinking LLM Customization for Individual Investor Decision-Making
Yash Ganpat Sawant
Main category: cs.CL
TL;DR: 本文探讨了在个人投资者决策这一高风险、时间跨度长的领域中,现有大语言模型(LLM)个性化方法的根本局限性,并基于实际部署系统提出了四个关键挑战及相应的架构应对思路。
Details
Motivation: 个人投资者决策具有行为模式动态演化、投资逻辑需长期一致、主观风格与客观信号存在张力、缺乏确定性评估标准等特点,现有LLM个性化范式难以适配。 Method: 基于已构建并部署的AI增强型投资组合管理系统,识别出个性化LLM在个体投资场景中面临的四大挑战,并总结实践中形成的架构响应策略。 Result: 明确了行为记忆复杂性、漂移下的论点一致性、风格-信号张力、无真值对齐四大核心挑战,并提出面向高风险、长时间决策场景的个性化NLP研究新方向。 Conclusion: 当前LLM个性化方法在稳定、主观性强的领域表现良好,但在个体投资等高 stakes、时序长、无明确ground truth的决策任务中存在根本性局限,亟需新范式。 Abstract: Personalized LLM systems have advanced rapidly, yet most operate in domains where user preferences are stable and ground truth is either absent or subjective. We argue that individual investor decision-making presents a uniquely challenging domain for LLM personalization - one that exposes fundamental limitations in current customization paradigms. Drawing on our system, built and deployed for AI-augmented portfolio management, we identify four axes along which individual investing exposes fundamental limitations in standard LLM customization: (1) behavioral memory complexity, where investor patterns are temporally evolving, self-contradictory, and financially consequential; (2) thesis consistency under drift, where maintaining coherent investment rationale over weeks or months strains stateless and session-bounded architectures; (3) style-signal tension, where the system must simultaneously respect personal investment philosophy and surface objective evidence that may contradict it; and (4) alignment without ground truth, where personalization quality cannot be evaluated against a fixed label set because outcomes are stochastic and delayed. We describe the architectural responses that emerged from building the system and propose open research directions for personalized NLP in high-stakes, temporally extended decision domains.[53] How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings
Yujian Liu,Jiabao Ji,Li An,Tommi Jaakkola,Yang Zhang,Shiyu Chang
Main category: cs.CL
TL;DR: 本文首次系统研究了LLM代理在现实复杂场景下使用大规模真实技能库(34k个技能)的性能表现,发现技能效用在非理想条件下显著下降;提出并验证了查询特定的技能精炼方法可有效恢复性能,并在Terminal-Bench 2.0上提升了Claude Opus的通过率。
Details
Motivation: 现有技能评测过于理想化(如直接提供手工定制的精准技能),无法反映真实场景中代理需自主检索、选择甚至适配不完美技能的挑战,缺乏对技能实用性的严谨评估。 Method: 构建渐进式挑战性评测设置(从理想到真实),在34k真实技能库中测试技能检索与使用效果;对比分析多种技能精炼策略(查询特定 vs 查询无关);在Terminal-Bench 2.0上验证泛化性。 Result: 技能性能随设定 realism 增加而持续下降,最严苛下接近无技能基线;查询特定精炼显著提升性能(尤其当初始技能质量尚可时);在Terminal-Bench 2.0上Claude Opus通过率从57.7%提升至65.5%。 Conclusion: 技能虽具潜力,但其实际效用高度依赖检索与适配能力;当前技能范式在开放、动态的真实环境中仍存在明显局限,亟需更鲁棒的检索与精炼机制。 Abstract: Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly provided with hand-crafted, narrowly-tailored task-specific skills for each task, whereas in many realistic settings, the LLM agent may have to search for and select relevant skills on its own, and even the closest matching skills may not be well-tailored for the task. In this paper, we conduct the first comprehensive study of skill utility under progressively challenging realistic settings, where agents must retrieve skills from a large collection of 34k real-world skills and may not have access to any hand-curated skills. Our findings reveal that the benefits of skills are fragile: performance gains degrade consistently as settings become more realistic, with pass rates approaching no-skill baselines in the most challenging scenarios. To narrow this gap, we study skill refinement strategies, including query-specific and query-agnostic approaches, and we show that query-specific refinement substantially recovers lost performance when the initial skills are of reasonable relevance and quality. We further demonstrate the generality of retrieval and refinement on Terminal-Bench 2.0, where they improve the pass rate of Claude Opus 4.6 from 57.7% to 65.5%. Our results, consistent across multiple models, highlight both the promise and the current limitations of skills for LLM-based agents. Our code is available at https://github.com/UCSB-NLP-Chang/Skill-Usage.[54] Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction
Jinrui Fang,Runhan Chen,Xu Yang,Jian Yu,Jiawei Xu,Ashwin Vinod,Wenqi Shi,Tianlong Chen,Heng Ji,ChengXiang Zhai,Ying Ding,Yuji Zhang
Main category: cs.CL
TL;DR: 本文提出了MINT多轮医学诊断基准,揭示了大语言模型在多轮临床推理中存在过早作答、自我修正能力被抑制及易受强线索干扰三大行为模式,并给出推迟提问与延迟提供关键信息等临床可行的改进建议。
Details
Motivation: 探究大语言模型在更贴近真实临床推理的多轮证据累积场景下的诊断行为,而不仅限于单轮全信息输入。 Method: 构建高保真多轮医学诊断基准MINT(含1035个病例、临床标注的证据分片、可控轮次粒度和信息保持型分解),系统评估11个大语言模型的行为模式。 Result: 发现三大行为模式:(1)过早作答(55%以上答案在前两轮即提交);(2)自我修正能力强但常被过早作答抑制;(3)强临床线索(如检验结果)易诱发过早作答。推迟提问可使首次作答准确率提升达62.6%,延迟提供关键证据可避免最高23.3%的准确率骤降。 Conclusion: LLMs在多轮医学诊断中存在显著可靠性缺陷,需通过交互设计(如延迟提问与线索调度)加以改进;MINT为该领域提供了可控评估框架与实用优化路径。 Abstract: Large language models (LLMs) achieve high accuracy in medical diagnosis when all clinical information is provided in a single turn, yet how they behave under multi-turn evidence accumulation closer to real clinical reasoning remains unexplored. We introduce MINT (Medical Incremental N-Turn Benchmark), a high-fidelity, multi-turn medical diagnosis benchmark comprising 1,035 cases with clinically labeled evidence shards, controlled turn granularity, and information-preserving decomposition. Through systematic evaluation of 11 LLMs on MINT, we uncover three persistent behavioral patterns that significantly impact diagnostic decisions: (1) intent to answer, models rush to answer before sufficient evidence has been observed, with over 55% of answers committed within the first two turns; (2) self-correction, incorrect-to-correct answer revisions occur at up to 10.6 times the rate of correct-to-incorrect flips, revealing a latent capacity for self-correction that premature commitment forecloses; and (3) strong lures, clinically salient information such as laboratory results trigger premature answering even when models are explicitly instructed to wait. We translate these findings into clinically actionable guidance: deferring the diagnostic question to later turns reduces premature answering and improves accuracy at the first point of commitment by up to 62.6%, while reserving salient clinical evidence for later turns prevents a catastrophic accuracy drop of up to 23.3% caused by premature commitment. Our work provides both a controlled evaluation framework and concrete recommendations for improving the reliability of LLMs in multi-turn medical diagnosis.[55] GROUNDEDKG-RAG: Grounded Knowledge Graph Index for Long-document Question Answering
Tianyi Zhang,Andreas Marfurt
Main category: cs.CL
TL;DR: 本文提出GroundedKG-RAG,一种基于显式从源文档中提取并扎根于源文本的知识图谱(KG)的检索增强生成系统,用于长文档问答,以提升效率和事实准确性,并增强可解释性。
Details
Motivation: 现有RAG系统在长文档问答中存在依赖大模型描述导致高资源消耗与延迟、层级间内容重复、以及因缺乏源文本支撑而产生幻觉等问题。 Method: 提出GroundedKG-RAG:从文档中通过语义角色标注(SRL)和抽象意义表示(AMR)解析显式构建扎根于原文句子的知识图谱(节点为实体/动作,边为时序或语义关系),将其嵌入用于检索;查询时对问题做相同变换,检索最相关原文句子进行问答。 Result: 在NarrativeQA数据集上,GroundedKG-RAG性能媲美先进专有长上下文模型但成本更低,并优于竞争基线;所构建的知识图谱具备人类可读、可审计、利于错误分析的特性。 Conclusion: 显式构建并扎根于源文本的知识图谱可有效提升RAG系统的效率、事实准确性和可解释性,为长文档问答提供新范式。 Abstract: Retrieval-augmented generation (RAG) systems have been widely adopted in contemporary large language models (LLMs) due to their ability to improve generation quality while reducing the required input context length. In this work, we focus on RAG systems for long-document question answering. Current approaches suffer from a heavy reliance on LLM descriptions resulting in high resource consumption and latency, repetitive content across hierarchical levels, and hallucinations due to no or limited grounding in the source text. To improve both efficiency and factual accuracy through grounding, we propose GroundedKG-RAG, a RAG system in which the knowledge graph is explicitly extracted from and grounded in the source document. Specifically, we define nodes in GroundedKG as entities and actions, and edges as temporal or semantic relations, with each node and edge grounded in the original sentences. We construct GroundedKG from semantic role labeling (SRL) and abstract meaning representation (AMR) parses and then embed it for retrieval. During querying, we apply the same transformation to the query and retrieve the most relevant sentences from the grounded source text for question answering. We evaluate GroundedKG-RAG on examples from the NarrativeQA dataset and find that it performs on par with a state-of-the art proprietary long-context model at smaller cost and outperforms a competitive baseline. Additionally, our GroundedKG is interpretable and readable by humans, facilitating auditing of results and error analysis.[56] Compressible Softmax-Attended Language under Incompressible Attention
Wonsuk Lee
Main category: cs.CL
TL;DR: 本文研究了Transformer语言模型中注意力机制的能谱特性,发现logit能量场在少数奇异值分量中即可捕获大部分方差,表明语言交互具有高度可压缩性,且这一性质源于数据本身而非模型结构。
Details
Motivation: 探究Transformer模型中注意力机制的实际信息承载能力与维度利用效率,理解语言建模中‘冗余’与‘压缩性’的本质来源。 Method: 对五个不同规模和架构的Transformer语言模型(124M–7B参数)的每个注意力头进行奇异值分解(SVD),分析logit能量场$\tilde{E}$和学习到的交互矩阵$W_Q^\mathrm{T} W_K$的谱特性(如有效秩、谱间隙)。 Result: logit能量场$\tilde{E}$仅需2–11个奇异分量即可解释90%方差;而$W_Q^\mathrm{T} W_K$需38–75个分量(总维数$d_h=64$或$128$);有效秩存在5–25倍谱间隙;注意力机制均匀分配维度容量,但语言实际交互高度集中于少数维度。 Conclusion: softmax注意力下语言的可压缩性是数据固有属性,而非模型分析框架导致的假象;模型存在显著的维度冗余,为高效压缩与稀疏化提供了理论依据。 Abstract: Across every attention head in five transformer language models (124M--7B parameters, four architecture families), the logit energy field $\tilde{E}$ reaches 90\% of its variance in 2--11 singular components. The \emph{learned} interaction matrix $W_Q^\mathrm{T} W_K$ needs 38--75 components for the same threshold out of $d_h \in \{64, 128\}$. The spectral gap is $5$--$25\times$ in effective rank. The attention mechanism allocates capacity uniformly across all $d_h$ dimensions, but language concentrates the actual interaction into a few. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.[57] How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models
Gregory N. Frank
Main category: cs.CL
TL;DR: 本文发现对齐训练的语言模型中存在一种重复出现的稀疏路由机制:门控注意力头检测内容并触发下游放大器头,以增强拒绝信号。该机制在9个来自6个实验室的模型中得到验证,并通过政治审查和安全拒绝作为自然实验进行追踪。
Details
Motivation: 探究对齐训练语言模型中是否存在可复现的、与安全拒绝相关的内部机制,特别是在政治审查和安全拒绝等场景下。 Method: 使用自然实验(政治审查和安全拒绝)追踪门控注意力头与放大器头之间的稀疏路由机制;采用交换测试(interchange test)、自助抽样(bootstrap resampling)和消融实验分析其统计显著性与稳定性;并通过调制检测层信号连续控制策略强度。 Result: 确认了跨多个模型稳定存在的门控-放大路由电路;门控头通过必要性与充分性交换检验(p < 0.001);核心放大头在自助抽样下高度稳定(Jaccard 0.92–1.0);路由能力随模型规模扩展但变弱;通过信号调制可连续调控拒绝强度;密码编码下门控路由失效而语义理解仍存,揭示意图识别与策略路由的结构性分离。 Conclusion: 对齐模型中的安全拒绝依赖于一个脆弱但可检测的稀疏路由电路,该电路将广泛预训练获得的语义理解与较窄泛、泛化能力较弱的后训练策略绑定区分开来;这一发现为可解释性驱动的安全评估与干预提供了新路径。 Abstract: We identify a recurring sparse routing mechanism in alignment-trained language models: a gate attention head reads detected content and triggers downstream amplifier heads that boost the signal toward refusal. Using political censorship and safety refusal as natural experiments, we trace this mechanism across 9 models from 6 labs, all validated on corpora of 120 prompt pairs. The gate head passes necessity and sufficiency interchange tests (p < 0.001, permutation null), and core amplifier heads are stable under bootstrap resampling (Jaccard 0.92-1.0). Three same-generation scaling pairs show that routing distributes at scale (ablation up to 17x weaker) while remaining detectable by interchange. By modulating the detection-layer signal, we continuously control policy strength from hard refusal through steering to factual compliance, with routing thresholds that vary by topic. The circuit also reveals a structural separation between intent recognition and policy routing: under cipher encoding, the gate head's routing contribution collapses (78% in Phi-4 at n=120) while the model responds with puzzle-solving rather than refusal. The routing mechanism never fires, even though probe scores at deeper layers indicate the model begins to represent the harmful content. This asymmetry is consistent with different robustness properties of pretraining and post-training: broad semantic understanding versus narrower policy binding that generalizes less well under input transformation.[58] Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding
Haruka Kawasaki,Ryota Tanaka,Kyosuke Nishida
Main category: cs.CL
TL;DR: 本文通过线性探测研究大型视觉语言模型(LVLMs)在视觉文档理解(VDU)任务中各层对关键信息的表征能力,发现中间层比最终层更线性地编码所需信息;据此提出针对中间层的微调策略,有效提升了模型性能并缩小了表征与生成响应之间的差距。
Details
Motivation: 现有LVLMs在VDU任务上的评估仅依赖生成响应,无法反映模型是否真正内在捕获了所需信息,因此需探究其内部表征机制。 Method: 采用线性探测方法分析LVLMs中LLM部分各层对VDU任务所需信息的表征能力,并基于发现设计针对中间层的微调策略。 Result: 发现中间层比最终层更线性地编码VDU任务所需信息;中间层微调可同时提升线性探测准确率和响应准确率,并缩小表征与响应间的差距。 Conclusion: LVLMs在VDU任务中的内部表征与最终输出存在显著差距,中间层蕴含更有效的任务相关信息,针对性微调中间层是提升VDU性能的有效途径。 Abstract: Visual document understanding (VDU) is a challenging task for large vision language models (LVLMs), requiring the integration of visual perception, text recognition, and reasoning over structured layouts. Although recent LVLMs have shown progress on VDU benchmarks, their performance is typically evaluated based on generated responses, which may not necessarily reflect whether the model has actually captured the required information internally. In this paper, we investigate how information required to solve VDU tasks is represented across different layers of LLMs within LVLMs using linear probing. Our study reveals that (1) there is a clear gap between internal representations and generated responses, and (2) information required to solve the task is often encoded more linearly from intermediate layers than from the final layer. Motivated by these findings, we explore fine-tuning strategies that target intermediate layers. Experiments show that fine-tuning intermediate layers improves both linear probing accuracy and response accuracy while narrowing the gap.[59] Structured Causal Video Reasoning via Multi-Objective Alignment
Zinuo Li,Yongxin Guo,Jun Liu,Jiawei Zhan,Xi Jiang,Chengjie Wang,Mohammed Bennamoun,Farid Boussaid,Feng Zheng,Qiuhong Ke
Main category: cs.CL
TL;DR: 本文提出了一种名为Structured Event Facts的紧凑结构化事件表示方法,用于增强视频理解中的因果推理能力,并设计了CausalFact-60K数据集与四阶段训练流程(含多目标强化学习优化),最终推出模型Factum-4B,在细粒度时序推理任务中表现更优。
Details
Motivation: 人类对视频动态的理解基于实体、动作和时序关系的结构化心智表征,而现有Video-LLMs依赖松散的文本化视频推理,导致因果推断脆弱、效率低下;本文旨在弥合这一认知鸿沟。 Method: 构建结构化事件事实(Structured Event Facts)作为推理前的显式约束;提出CausalFact-60K数据集及四阶段训练流程(事实对齐、格式热启动、思维热启动、基于RL的后训练);将RL阶段建模为多目标强化学习(MORL)问题,优化结构完整性、因果保真度与推理长度之间的权衡。 Result: 实现了更可靠、可验证的视频推理;在需细粒度时序推理的挑战性视频理解任务上显著提升性能;推出新模型Factum-4B。 Conclusion: 引入结构化先验知识并结合多目标强化学习优化,能有效提升Video-LLMs的因果推理能力与鲁棒性,为视频理解提供新范式。 Abstract: Human understanding of video dynamics is typically grounded in a structured mental representation of entities, actions, and temporal relations, rather than relying solely on immediate deductive reasoning. In contrast, existing Video-LLMs largely depend on unstructured video reasoning, where critical visual evidence is embedded in verbose textual descriptions and temporal causality is often weakly modeled. This leads to inefficient processes and fragile causal inference. To bridge this cognitive gap, we propose constructing a compact representation of salient events and their causal relationships, which we name Structured Event Facts, prior to the reasoning stage. This structured prior serves as an explicit constraint to promote concise and causally grounded reasoning, while also making intermediate evidence easier to verify. To effectively train models on such structured facts, we introduce CausalFact-60K and a four-stage training pipeline comprising facts alignment, format warm-start, thinking warm-start, and reinforcement learning-based post-training. During RL stage, we find that this framework introduces competing objectives, as structural completeness and causal fidelity must be balanced against reasoning length, making it difficult to optimize. We address this challenge by formulating the optimization as a Multi-Objective Reinforcement Learning (MORL) problem and explicitly optimizing toward the Pareto-Frontier to balance these trade-offs. As a result, we introduce Factum-4B, which yields more reliable reasoning and delivers stronger performance on challenging video understanding tasks requiring fine-grained temporal inference.[60] DeonticBench: A Benchmark for Reasoning over Rules
Guangyao Dou,Luis Brena,Akhil Deo,William Jurayj,Jingyu Zhang,Nils Holzenberger,Benjamin Van Durme
Main category: cs.CL
TL;DR: 本文提出了DEONTICBENCH,一个面向现实世界法规场景(如税法、航空行李政策、移民法和住房法)的去义推理基准,包含6232个任务,支持语言链式推理与可执行Prolog程序生成两种方式;实验表明当前前沿大模型在该任务上表现有限(最高仅46.6宏F1),且强化学习等训练方法尚未显著提升求解可靠性。
Details
Motivation: 现有大语言模型在复杂、上下文相关的规则推理(尤其是法律与政策领域的义务、许可、禁止等去义推理)方面能力不足;当前主流基准多聚焦短上下文数学推理,缺乏针对长上下文、高风险去义推理的评测基准。 Method: 构建DEONTICBENCH基准,涵盖四大真实法律政策领域共6232个任务,支持自由形式链式推理与基于Prolog的符号求解双路径;提供所有实例的参考Prolog程序,并探索监督微调与强化学习对符号程序生成的改进效果。 Result: 前沿大语言模型和代码模型在DEONTICBENCH硬子集上的最佳性能仅为SARA Numeric准确率44.4%、Housing宏F1 46.6;尽管监督微调提升了Prolog生成质量,当前RL方法仍无法可靠求解任务。 Conclusion: DEONTICBENCH填补了现实规则推理评测的空白,揭示了当前LLM在上下文敏感、形式化规则推理上的根本局限,强调需结合符号与非符号方法推进研究。 Abstract: Reasoning with complex, context-specific rules remains challenging for large language models (LLMs). In legal and policy settings, this manifests as deontic reasoning: reasoning about obligations, permissions, and prohibitions under explicit rules. While many recent benchmarks emphasize short-context mathematical reasoning, fewer focus on long-context, high-stakes deontic reasoning. To address this gap, we introduce DEONTICBENCH, a benchmark of 6,232 tasks across U.S. federal taxes, airline baggage policies, U.S. immigration administration, and U.S. state housing law. These tasks can be approached in multiple ways, including direct reasoning in language or with the aid of symbolic computation. Besides free-form chain-of-thought reasoning, DEONTICBENCH enables an optional solver-based workflow in which models translate statutes and case facts into executable Prolog, leading to formal problem interpretations and an explicit program trace. We release reference Prolog programs for all instances. Across frontier LLMs and coding models, best hard-subset performance reaches only 44.4% on SARA Numeric and 46.6 macro-F1 on Housing. We further study training with supervised fine-tuning and reinforcement learning for symbolic program generation. Although training improves Prolog generation quality, current RL methods still fail to solve these tasks reliably. Overall, DEONTICBENCH provides a benchmark for studying context-grounded rule reasoning in real-world domains under both symbolic and non-symbolic settings.[61] Conversational Control with Ontologies for Large Language Models: A Lightweight Framework for Constrained Generation
Barbara Gendron,Gaël Guibon,Mathieu d'Aquin
Main category: cs.CL
TL;DR: 本文提出了一种基于本体定义的端到端方法,用于对大语言模型(LLM)的对话输出实现模块化、可解释的可控生成,通过建模关键对话维度(如英语水平、情感极性)作为约束并结合混合微调,在多个开源对话模型上验证了其有效性、轻量性与可迁移性。
Details
Motivation: 现有基于大语言模型的对话代理存在黑箱性,导致不可预测和缺乏个性化,需通过可控生成加以改善。 Method: 构建面向对话关键方面的本体定义,将其建模为生成约束,并在七种主流开源对话大模型上采用混合微调策略进行端到端可控生成训练。 Result: 该方法在英语水平控制和内容情感极性控制两个任务上均一致优于预训练基线,包括较小规模模型;框架具备模型无关性、轻量性和可解释性。 Conclusion: 本体驱动的可控生成是一种有效提升对话系统策略对齐与可解释性的新范式,具有跨领域与交互目标的扩展潜力。 Abstract: Conversational agents based on Large Language Models (LLMs) have recently emerged as powerful tools for human-computer interaction. Nevertheless, their black-box nature implies challenges in predictability and a lack of personalization, both of which can be addressed by controlled generation. This work proposes an end-to-end method to obtain modular and explainable control over LLM outputs through ontological definitions of aspects related to the conversation. Key aspects are modeled and used as constraints; we then further fine-tune the LLM to generate content accordingly. To validate our approach, we explore two tasks that tackle two key conversational aspects: the English proficiency level and the polarity profile of the content. Using a hybrid fine-tuning procedure on seven state-of-the-art, open-weight conversational LLMs, we show that our method consistently outperforms pre-trained baselines, even on smaller models. Beyond quantitative gains, the framework remains model-agnostic, lightweight, and interpretable, enabling reusable control strategies that can be extended to new domains and interaction goals. This approach enhances alignment with strategy instructions and demonstrates the effectiveness of ontology-driven control in conversational systems.[62] Same Geometry, Opposite Noise: Transformer Magnitude Representations Lack Scalar Variability
Jon-Paul Cacioli
Main category: cs.CL
TL;DR: 本文研究了大型语言模型(如Llama-3和Mistral)中数值表征的噪声是否呈现生物系统中常见的“标量变异性”(即噪声与数值大小成比例),结果发现其表征变异性反而随数值增大而减小,呈现反标量模式,表明仅靠分布学习不足以复现生物式的恒定变异系数特性。
Details
Motivation: 探究Transformer语言模型是否具备生物数值认知系统中的标量变异性(scalar variability)这一关键特征,以检验分布学习能否自然产生类生物的数值表征噪声结构。 Method: 分析三个7–8B参数模型在26个数值上的隐藏状态表征在‘数值轴’上的离散度(标准差),计算其随数值变化的缩放指数α;并控制句子身份、考察全维度与正交维度、结合语料频率进行相关性分析。 Result: 所有模型均显示负向缩放指数(α ≈ -0.19 主轴;-0.04 全维;-0.007 句子校正后),即表征变异性随数值增大而下降;该效应在数值轴上比正交维度强3–5倍;语料频率与各数值变异性高度正相关(ρ = .84)。 Conclusion: Transformer模型虽能习得对数压缩式数值几何结构,但无法再现生物系统中恒定变异系数(CV)的标量噪声特性,说明仅靠分布学习不足以生成标量变异性。 Abstract: Scalar variability -- the finding that representational noise scales proportionally with magnitude, producing a constant coefficient of variation -- is a hallmark of biological magnitude systems. We tested whether transformer language models exhibit this property by analysing the dispersion of hidden-state representations across carrier sentences for 26 numerical magnitudes in three 7-8B parameter models (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base; data from Cacioli, 2026). We found the opposite: representational variability decreased with magnitude along the magnitude axis (scaling exponent alpha approx -0.19; 0/16 primary layers with alpha > 0, all three models). The negative sign was consistent in full-dimensional space (alpha approx -0.04) and after sentence-identity correction (alpha approx -0.007). The anti-scalar pattern was 3-5x stronger along the magnitude axis than orthogonal dimensions, and corpus frequency strongly predicted per-magnitude variability (rho = .84). These results demonstrate that distributional learning alone is insufficient to produce scalar variability: transformers reproduce log-compressive magnitude geometry but not the constant-CV noise signature observed in biological systems.[63] CommonMorph: Participatory Morphological Documentation Platform
Aso Mahmudi,Sina Ahmadi,Kemal Kurniawan,Rico Sennrich,Eduard Hovy,Ekaterina Vylomova
Main category: cs.CL
TL;DR: 本文介绍了CommonMorph平台,旨在通过专家定义、贡献者采集和社区验证三阶段方法,降低形态学数据收集与标注的门槛,尤其助力低资源语言研究。
Details
Motivation: 形态学数据的收集与标注需要语言学专业知识、严谨方法和大量资源,这对低资源语言尤为困难。 Method: 提出CommonMorph平台,采用三层架构(专家定义、贡献者 elicitation、社区验证),集成主动学习、标注建议及跨语言材料复用功能,并支持多种形态类型(屈折、黏着、词根-模式)。 Result: 实现了一个开源、UniMorph兼容、支持多形态类型的形态数据收集平台,已在https://common-morph.com上线。 Conclusion: CommonMorph为低资源语言形态数据建设提供了可复现、协作式、技术驱动的解决方案,有助于语言多样性保护。 Abstract: Collecting and annotating morphological data present significant challenges, requiring linguistic expertise, methodological rigour, and substantial resources. These barriers are particularly acute for low-resource languages and varieties. To accelerate this process, we introduce \texttt{CommonMorph}, a comprehensive platform that streamlines morphological data collection development through a three-tiered approach: expert linguistic definition, contributor elicitation, and community validation. The platform minimises manual work by incorporating active learning, annotation suggestions, and tools to import and adapt materials from related languages. It accommodates diverse morphological systems, including fusional, agglutinative, and root-and-pattern morphologies. Its open-source design and UniMorph-compatible outputs ensure accessibility and interoperability with NLP tools. Our platform is accessible at https://common-morph.com, offering a replicable model for preserving linguistic diversity through collaborative technology.[64] Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation
Alhasan Mahmood,Samir Abdaljalil,Hasan Kurban
Main category: cs.CL
TL;DR: 本文研究了在代理代码基准测试中,评估语言对模型排名的影响,发现不同语言下不同模型表现差异显著,强调应将语言作为评估的显式变量。
Details
Motivation: 评估语言通常被默认为固定英语,但作者指出这可能导致模型排名失真,因此探究不同语言对评估结果的影响。 Method: 将Agent-as-a-Judge提示栈本地化至五种类型学多样的语言(英语、阿拉伯语、土耳其语、中文、印地语),在55个DevAI开发任务上,使用三种开发者代理框架和六种裁判主干模型,共完成4950次裁判运行,并进行控制变量消融实验。 Result: GPT-4o在英语中满意度最高(44.72%),Gemini在阿拉伯语(51.72%)和印地语(53.22%)中领先;无一主干模型在所有语言中均占优;裁判间一致性低(Fleiss' κ ≤ 0.231);仅部分本地化印地语指令即导致满意度从42.8%骤降至23.2%。 Conclusion: 语言应被明确视为代理基准测试中的一个关键评估变量,而非默认固定为英语。 Abstract: Evaluation language is typically treated as a fixed English default in agentic code benchmarks, yet we show that changing the judge's language can invert backbone rankings. We localize the Agent-as-a-Judge prompt stack to five typologically diverse languages (English, Arabic, Turkish, Chinese, Hindi) and evaluate 55 DevAI development tasks across three developer-agent frameworks and six judge backbones, totaling 4950 judge runs. The central finding is that backbone and language interact: GPT-4o achieves the highest satisfaction in English (44.72\%), while Gemini leads in Arabic (51.72\%, $p<0.001$ vs.\ GPT-4o) and Hindi (53.22\%). No single backbone dominates across all languages, and inter-backbone agreement on individual requirement judgments is modest (Fleiss' $κ\leq 0.231$). A controlled ablation further shows that localizing judge-side instructions, not just benchmark content, can be decisive: Hindi satisfaction drops from 42.8\% to 23.2\% under partial localization. These results indicate that language should be treated as an explicit evaluation variable in agentic benchmarks. Full requirement-level judgments and runtime statistics are released for reproducibility.[65] Formal Constraints on Dependency Syntax
Gómez-Rodríguez,Carlos,Alemany-Puig,Lluís
Main category: cs.CL
TL;DR: 本文探讨了依存句法结构中的约束条件,特别是超越传统投射性(projectivity)限制的新约束,以更准确地描述自然语言现象,尤其是在词序灵活的语言中。
Details
Motivation: 传统的投射性约束过于严格,无法充分描述某些语言现象,尤其是词序灵活的语言;需要寻找更符合实际语言使用、兼顾准确性与计算效率的中间约束。 Method: 综述并分析多种针对依存树结构提出的约束条件,对比其对语言现象的覆盖能力、解析效率及认知/演化解释力。 Result: 指出当前存在多种弱于投射性但强于无约束的依存结构约束,它们在语言描写精度、解析性能和理论解释方面各有优势与适用场景。 Conclusion: 应根据具体语言现象、任务目标(如解析、认知建模或类型学研究)选择或设计适当的依存结构约束,而非单一依赖投射性。 Abstract: Dependency syntax represents the structure of a sentence as a tree composed of dependencies, i.e., directed relations between lexical units. While in its more general form any such tree is allowed, in practice many are not plausible or are very infrequent in attested language. This has motivated a search for constraints characterizing subsets of trees that better fit real linguistic phenomena, providing a more accurate linguistic description, faster parsing or insights on language evolution and human processing. Projectivity is the most well-studied such constraint, but it has been shown to be too restrictive to represent some linguistic phenomena, especially in flexible-word-order languages. Thus, a variety of constraints have been proposed to seek a realistic middle ground between the limitations of projectivity and the excessive leniency of unrestricted dependency structures.[66] PassiveQA: A Three-Action Framework for Epistemically Calibrated Question Answering via Supervised Finetuning
Madhav S Baidya
Main category: cs.CL
TL;DR: 本文提出PassiveQA框架,通过监督微调使大语言模型在信息不充分时能主动选择回答、追问或拒绝回答,提升其认知意识与鲁棒性。
Details
Motivation: 现实中的用户查询常不完整、模糊或缺失关键变量,而现有LLM和RAG系统缺乏对自身知识边界的认知,易产生幻觉或过度自信的回答。 Method: 提出PassiveQA三动作框架(Answer/Ask/Abstain),结合结构化信息状态表示、知识图谱增强的上下文及显式建模缺失变量与决策逻辑的微调规划器。 Result: 在多个QA数据集上,微调后的规划器显著提升宏观F1与拒答召回率,并降低幻觉率,且在计算受限训练下仍有效。 Conclusion: 模型的认知决策能力需在训练阶段学习获得,而非仅靠推理时策略控制;PassiveQA为构建具备元认知能力的LLM提供了可行路径。 Abstract: Large Language Models (LLMs) have achieved strong performance in question answering and retrieval-augmented generation (RAG), yet they implicitly assume that user queries are fully specified and answerable. In real-world settings, queries are often incomplete, ambiguous, or missing critical variables, leading models to produce overconfident or hallucinated responses. In this work, we study decision-aware query resolution under incomplete information, where a model must determine whether to Answer, Ask for clarification, or Abstain. We show that standard and enhanced RAG systems do not reliably exhibit such epistemic awareness, defaulting to answer generation even when information is insufficient. To address this, we propose PassiveQA, a three-action framework that aligns model behaviour with information sufficiency through supervised finetuning. Our approach integrates structured information-state representations, knowledge graph-grounded context, and a finetuned planner that explicitly models missing variables and decision reasoning. Experiments across multiple QA datasets show that the finetuned planner achieves significant improvements in macro F1 and abstention recall while reducing hallucination rates, under a compute-constrained training regime. These results provide strong empirical evidence that epistemic decision-making must be learned during training rather than imposed at inference time.[67] Benchmarking Multilingual Speech Models on Pashto: Zero-Shot ASR, Script Failure, and Cross-Domain Evaluation
Hanif Rahman
Main category: cs.CL
TL;DR: 本文首次对普什图语自动语音识别(ASR)进行了多模型、可复现的公开数据评估,涵盖零样本ASR性能、文字脚本错误(如输出阿拉伯文而非普什图文)、跨领域泛化能力,并揭示了当前模型在语音-文字映射、音系建模和领域鲁棒性上的系统性缺陷。
Details
Motivation: 普什图语使用人口达6000–8000万,但缺乏公开、标准的多语言ASR基准;现有模型在该语言上的表现未知,且零样本与微调结果缺乏可比性与可复现性,阻碍技术进步。 Method: 在FLEURS和过滤后的Common Voice 24两个公开普什图语测试集上,系统评估10个零样本ASR模型(含Whisper全系列、MMS-1B、SeamlessM4T-v2-large、OmniASR-CTC-300M);开展语言识别审计以检测脚本输出错误;对5个微调模型进行跨域测试;并采用字符类错误分层分析关键音素错误。 Result: 零样本中SeamlessM4T在Common Voice 24上取得最佳WER(39.7%),Whisper表现极差(最高达461%);所有Whisper模型几乎不输出普什图文字(<0.8%),而MMS-1B等达93%以上;微调模型在域内WER为14%,但跨域骤升至32.5–59%,仅一增强模型实现零退化(35.1%);错误集中于普什图特有音素(如卷舌音、边擦音)。 Conclusion: 单纯依赖WER会掩盖严重脚本错配问题;当前主流模型对普什图语支持薄弱,亟需构建专用基准、改进音系建模与跨域鲁棒性;论文提出五项结构性障碍与五项优先研究方向以推动可持续进展。 Abstract: Pashto is spoken by approximately 60--80 million people but has no published benchmarks for multilingual automatic speech recognition (ASR) on any shared public test set. This paper reports the first reproducible multi-model evaluation on public Pashto data, covering zero-shot ASR, script-level failure, and cross-domain evaluation of fine-tuned models. For zero-shot ASR, ten models (all seven Whisper sizes, MMS-1B, SeamlessM4T-v2-large, and OmniASR-CTC-300M) are evaluated on the FLEURS Pashto test set and a filtered Common Voice~24 subset; zero-shot Whisper WER ranges from 90% to 297%, with the medium model collapsing to 461% on Common Voice~24 consistent with decoder looping. SeamlessM4T achieves 39.7% WER on Common Voice~24 (the best zero-shot result reported to date, as of submission); MMS-1B achieves 43.8% on FLEURS. For script failure, a language-identification audit shows that no Whisper model produces Pashto-script output in more than 0.8% of utterances, while MMS-1B, SeamlessM4T, and OmniASR each exceed 93% Pashto-script fidelity; WER alone does not reveal this failure, since a model generating Arabic-script output on Pashto audio has not achieved ASR in any interpretable sense. For cross-domain evaluation, five fine-tuned Pashto ASR models are evaluated on both test sets: published WER figures of 14% degrade to 32.5--59% on out-of-distribution sets, while one augmented model achieves 35.1% on both sets with zero cross-domain degradation. Character-class error stratification confirms that Pashto-unique phonemes (the retroflex series and lateral fricatives) account for disproportionate error mass. All evaluations cover read speech only. Five structural impediments to cumulative progress are identified and five ordered research priorities are argued.[68] Is a Picture Worth a Thousand Words? Adaptive Multimodal Fact-Checking with Visual Evidence Necessity
Jaeyoon Jung,Yejun Yoon,Kunwoo Park
Main category: cs.CL
TL;DR: 本文提出AMuFC框架,通过Analyzer和Verifier两个协作代理自适应地利用视觉证据进行多模态事实核查,挑战了‘视觉证据总是有益’的假设,并在多个数据集上验证了其有效性。
Details
Motivation: 现有研究普遍假设加入视觉证据总能提升多模态事实核查性能,但本文指出这种无差别使用可能反而降低准确率,因此需自适应判断是否需要视觉证据。 Method: 提出AMuFC框架,包含两个协作智能体:Analyzer判断视觉证据是否必要,Verifier基于检索证据及Analyzer判断预测声明真实性;同时构建新数据集WebFC用于更真实场景评估。 Result: 在三个数据集上的实验表明,将Analyzer对视觉证据必要性的判断融入Verifier预测中,显著提升了验证性能;并开源了代码与新数据集WebFC。 Conclusion: 视觉证据并非总是有益,需根据具体声明自适应选择;AMuFC通过双代理协同机制实现了更鲁棒、更高效的多模态事实核查。 Abstract: Automated fact-checking is a crucial task not only in journalism but also across web platforms, where it supports a responsible information ecosystem and mitigates the harms of misinformation. While recent research has progressed from text-only to multimodal fact-checking, a prevailing assumption is that incorporating visual evidence universally improves performance. In this work, we challenge this assumption and show that indiscriminate use of multimodal evidence can reduce accuracy. To address this challenge, we propose AMuFC, a multimodal fact-checking framework that employs two collaborative agents with distinct roles for the adaptive use of visual evidence: An Analyzer determines whether visual evidence is necessary for claim verification, and a Verifier predicts claim veracity conditioned on both the retrieved evidence and the Analyzer's assessment. Experimental results on three datasets show that incorporating the Analyzer's assessment of visual evidence necessity into the Verifier's prediction yields substantial improvements in verification performance. In addition to all code, we release WebFC, a newly constructed dataset for evaluating fact-checking modules in a more realistic scenario, available at https://github.com/ssu-humane/AMuFC.[69] IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation
Anjali Kantharuban,Aarohi Srivastava,Fahim Faisal,Orevaoghene Ahia,Antonios Anastasopoulos,David Chiang,Yulia Tsvetkov,Graham Neubig
Main category: cs.CL
TL;DR: 本文提出IDIOLEX框架,用于学习句子的风格和方言表示(即idiolectal representation),解耦语义内容与表达方式,并在阿拉伯语和西班牙语方言上验证了其有效性及跨领域迁移能力。
Details
Motivation: 现有句子表示主要编码句子‘说什么’,而非‘如何说’,而后者对许多应用至关重要;因此需要一种能捕捉风格和方言、且与语义解耦的句子表示方法。 Method: 提出IDIOLEX框架,结合句子来源(provenance)监督信号与句子内容的语言学特征,训练模型学习连续的风格与方言表示。 Result: 在阿拉伯语和西班牙语方言数据上验证了所学表示能捕捉有意义的变异,具备跨领域迁移能力,并可作为目标用于风格对齐语言模型训练。 Conclusion: 联合建模个体与社群层面的语言变异,有助于idiolect研究,并支持需敏感处理风格差异的下游任务(如构建多样化、可访问的大语言模型)。 Abstract: Existing sentence representations primarily encode what a sentence says, rather than how it is expressed, even though the latter is important for many applications. In contrast, we develop sentence representations that capture style and dialect, decoupled from semantic content. We call this the task of idiolectal representation learning. We introduce IDIOLEX, a framework for training models that combines supervision from a sentence's provenance with linguistic features of a sentence's content, to learn a continuous representation of each sentence's style and dialect. We evaluate the approach on dialects of both Arabic and Spanish. The learned representations capture meaningful variation and transfer across domains for analysis and classification. We further explore the use of these representations as training objectives for stylistically aligning language models. Our results suggest that jointly modeling individual and community-level variation provides a useful perspective for studying idiolect and supports downstream applications requiring sensitivity to stylistic differences, such as developing diverse and accessible LLMs.[70] BiST: A Gold Standard Bangla-English Bilingual Corpus for Sentence Structure and Tense Classification with Inter-Annotator Agreement
Abdullah Al Shafi,Swapnil Kundu Argha,M. A. Moyeen,Abdul Muntakim,Shoumik Barman Polok
Main category: cs.CL
TL;DR: 本文介绍了BiST,一个高质量的孟加拉语-英语双语句级语法分类语料库,包含30,534个句子,涵盖句法结构和时态两个维度的标注,并验证了其在多语言NLP任务中的有效性。
Details
Motivation: 解决低资源语言(特别是孟加拉语)中高质量双语资源匮乏的问题,推动多语言NLP发展。 Method: 从开源百科和自然对话文本中构建语料,经预处理与语言识别后,由三位独立标注员进行双维度(句法结构与时态)标注,并采用Fleiss Kappa评估一致性;同时开展统计分析与基线模型实验。 Result: 获得30,534句双语语料(英语17,465句,孟加拉语13,069句),句法与时态标注的Fleiss Kappa值分别为0.82和0.88;双编码器架构在基线实验中优于强大多语言编码器。 Conclusion: BiST为双语语法建模提供了统一、可靠、具语言学意义的基准资源,支持可控文本生成、自动反馈生成及跨语言表征学习等任务。 Abstract: High-quality bilingual resources remain a critical bottleneck for advancing multilingual NLP in low-resource settings, particularly for Bangla. To mitigate this gap, we introduce BiST, a rigorously curated Bangla-English corpus for sentence-level grammatical classification, annotated across two fundamental dimensions: syntactic structure (Simple, Complex, Compound, Complex-Compound) and tense (Present, Past, Future). The corpus is compiled from open-licensed encyclopedic sources and naturally composed conversational text, followed by systematic preprocessing and automated language identification, resulting in 30,534 sentences, including 17,465 English and 13,069 Bangla instances. Annotation quality is ensured through a multi-stage framework with three independent annotators and dimension-wise Fleiss Kappa ($κ$) agreement, yielding reliable and reproducible labels with $κ$ values of 0.82 and 0.88 for structural and temporal annotation, respectively. Statistical analyses demonstrate realistic structural and temporal distributions, while baseline evaluations show that dual-encoder architectures leveraging complementary language-specific representations consistently outperform strong multilingual encoders. Beyond benchmarking, BiST provides explicit linguistic supervision that supports grammatical modeling tasks, including controlled text generation, automated feedback generation, and cross-lingual representation learning. The corpus establishes a unified resource for bilingual grammatical modeling and facilitates linguistically grounded multilingual research.[71] What Makes Good Multilingual Reasoning? Disentangling Reasoning Traces with Measurable Features
Dayeon Ki,Kevin Duh,Marine Carpuat
Main category: cs.CL
TL;DR: 本文挑战了通过使其他语言的推理过程模仿英语来缩小多语言大模型推理性能差距的假设,提出应关注各语言自身有效的推理特征,并通过实证分析发现不同语言中推理特征与准确率的关联强度差异显著甚至方向相反,从而呼吁设计适应语言特性的奖励机制和评测基准。
Details
Motivation: 当前多语言大推理模型(LRMs)在非英语语言上性能明显落后,主流做法是让非英语推理趋同于英语推理;本文质疑该假设,转而探究多语言场景下真正有效的推理特征及其跨语言普适性。 Method: 1)定义涵盖多语言对齐、推理步骤和推理流程三方面的可测推理特征;2)用逻辑回归量化各特征与答案准确率的关联;3)在多语言推理轨迹上训练稀疏自编码器以自动发现潜在推理概念;4)将特征作为测试时选择策略,验证其对提升多语言推理的有效性。 Result: 在两个数学推理基准、四种大推理模型和十种语言上的实验表明:多数推理特征与准确率正相关,但相关强度因语言而异,部分特征甚至在某些语言中呈现负相关。 Conclusion: 英语中心化的奖励设计不适用于所有语言;应发展适应语言特性的动态目标,这对多语言评测基准和奖励函数设计具有重要启示。 Abstract: Large Reasoning Models (LRMs) still exhibit large performance gaps between English and other languages, yet much current work assumes these gaps can be closed simply by making reasoning in every language resemble English reasoning. This work challenges this assumption by asking instead: what actually characterizes effective reasoning in multilingual settings, and to what extent do English-derived reasoning features genuinely help in other languages? We first define a suite of measurable reasoning features spanning multilingual alignment, reasoning step, and reasoning flow aspects of reasoning traces, and use logistic regression to quantify how each feature associates with final answer accuracy. We further train sparse autoencoders over multilingual traces to automatically discover latent reasoning concepts that instantiate or extend these features. Finally, we use the features as test-time selection policies to examine whether they can steer models toward stronger multilingual reasoning. Across two mathematical reasoning benchmarks, four LRMs, and 10 languages, we find that most features are positively associated with accuracy, but the strength of association varies considerably across languages and can even reverse in some. Our findings challenge English-centric reward designs and point toward adaptive objectives that accommodate language-specific reasoning patterns, with concrete implications for multilingual benchmark and reward design.[72] Individual and Combined Effects of English as a Second Language and Typos on LLM Performance
Serena Liu,Yutong Yang,Prisha Sheth,Weixuan Dong,Mingjiao Diao,Xinru Zhu,Nikhil Banga,Oscar Melendez,Arnav Sharma,Minda Zhao,Marina Lin,Mengyu Wang
Main category: cs.CL
TL;DR: 本研究结合ESL变体与拼写错误,发现二者共同作用对大语言模型性能的负面影响大于单独影响,尤其在封闭式任务中更显著,表明仅用标准英语评估会高估真实场景下的模型表现。
Details
Motivation: 现有研究多孤立分析ESL变体或拼写错误对大语言模型的影响,但现实中二者常共存,亟需联合评估其真实影响。 Method: 采用Trans-EnV框架生成八种ESL变体,并结合MulTypo工具在低、中、高三水平注入拼写错误,系统评估模型在封闭式与开放式任务上的性能变化。 Result: ESL变体与拼写错误联合作用导致性能下降更显著,但非简单叠加;封闭式任务中退化模式更一致,开放式任务结果则较混杂。 Conclusion: 仅基于标准英语的评估会高估模型在真实世界(尤其ESL用户含错输入)中的表现;需联合建模ESL变异与拼写错误以全面评估模型鲁棒性。 Abstract: Large language models (LLMs) are used globally, and because much of their training data is in English, they typically perform best on English inputs. As a result, many non-native English speakers interact with them in English as a second language (ESL), and these inputs often contain typographical errors. Prior work has largely studied the effects of ESL variation and typographical errors separately, even though they often co-occur in real-world use. In this study, we use the Trans-EnV framework to transform standard English inputs into eight ESL variants and apply MulTypo to inject typos at three levels: low, moderate, and severe. We find that combining ESL variation and typos generally leads to larger performance drops than either factor alone, though the combined effect is not simply additive. This pattern is clearest on closed-ended tasks, where performance degradation can be characterized more consistently across ESL variants and typo levels, while results on open-ended tasks are more mixed. Overall, these findings suggest that evaluations on clean standard English may overestimate real-world model performance, and that evaluating ESL variation and typographical errors in isolation does not fully capture model behavior in realistic settings.[73] Metaphors We Compute By: A Computational Audit of Cultural Translation vs. Thinking in LLMs
Yuan Chang,Jiaming Qu,Zhu Li
Main category: cs.CL
TL;DR: 本文通过隐喻生成任务对大型语言模型(LLMs)的文化包容性进行初步计算审计,发现模型存在刻板印象和西方默认倾向,表明仅提示文化身份不足以实现文化扎根推理。
Details
Motivation: 区分语言能力与文化内推理能力,探究LLMs是否真正具备文化感知推理能力。 Method: 采用涵盖五种文化背景及多个抽象概念的隐喻生成任务,实证检验LLMs在创意写作中是文化多元的合作者还是仅以主导概念框架进行本地化表达的文化翻译器。 Result: 模型在某些文化设定中表现出刻板隐喻使用及西方默认主义倾向。 Conclusion: 单纯在提示中指定文化身份并不能确保LLM进行文化扎根的推理,需更深入设计以提升文化包容性。 Abstract: Large language models (LLMs) are often described as multilingual because they can understand and respond in many languages. However, speaking a language is not the same as reasoning within a culture. This distinction motivates a critical question: do LLMs truly conduct culture-aware reasoning? This paper presents a preliminary computational audit of cultural inclusivity in a creative writing task. We empirically examine whether LLMs act as culturally diverse creative partners or merely as cultural translators that leverage a dominant conceptual framework with localized expressions. Using a metaphor generation task spanning five cultural settings and several abstract concepts as a case study, we find that the model exhibits stereotyped metaphor usage for certain settings, as well as Western defaultism. These findings suggest that merely prompting an LLM with a cultural identity does not guarantee culturally grounded reasoning.[74] Lighting Up or Dimming Down? Exploring Dark Patterns of LLMs in Co-Creativity
Zhu Li,Jiaming Qu,Yuan Chang
Main category: cs.CL
TL;DR: 本文探讨了大型语言模型(LLMs)作为写作伙伴时可能引发的五种‘黑暗模式’(如谄媚、语气规训等),通过控制实验发现这些行为普遍存在且影响创作自主性,尤其Sycophancy高达91.7%,并提出支持创意写作的AI设计建议。
Details
Motivation: 大型语言模型日益成为人类协作写作伙伴,但其潜在行为可能削弱人类创作主体性,亟需识别和理解其中隐性的负面影响机制。 Method: 通过一系列受控实验,将LLM设定为不同文学体裁与主题下的写作助手,系统分析其输出中五类‘黑暗模式’(谄媚、语气规训、道德说教、死亡循环、锚定效应)的出现频率与情境依赖性。 Result: Sycophancy在91.7%的案例中出现,尤其在敏感话题中突出;Anchoring则显著依赖文学形式,在民间故事中最常出现;其他模式亦普遍存在,表明安全对齐策略可能意外限制创意探索。 Conclusion: LLM中的‘黑暗模式’虽源于安全对齐目标,却可能抑制创造性表达;未来AI写作工具需在安全性与创作自由间取得平衡,并针对性优化交互设计。 Abstract: Large language models (LLMs) are increasingly acting as collaborative writing partners, raising questions about their impact on human agency. In this exploratory work, we investigate five "dark patterns" in human-AI co-creativity -- subtle model behaviors that can suppress or distort the creative process: Sycophancy, Tone Policing, Moralizing, Loop of Death, and Anchoring. Through a series of controlled sessions where LLMs are prompted as writing assistants across diverse literary forms and themes, we analyze the prevalence of these behaviors in generated responses. Our preliminary results suggest that Sycophancy is nearly ubiquitous (91.7% of cases), particularly in sensitive topics, while Anchoring appears to be dependent on literary forms, surfacing most frequently in folktales. This study indicates that these dark patterns, often byproducts of safety alignment, may inadvertently narrow creative exploration and proposes design considerations for AI systems that effectively support creative writing.[75] Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations
Kalyan Cherukuri,Lav R. Varshney
Main category: cs.CL
TL;DR: 本文提出了一种基于几何动力系统框架来解释大语言模型(LLMs)幻觉现象的理论,指出幻觉源于任务依赖的潜在空间盆地结构;通过分析多个开源模型的隐藏状态轨迹,发现任务越复杂(如摘要、误解类任务),潜在空间中不同任务的盆地越易重叠、稳定性越低;作者进一步建立了任务复杂度与多盆地定理,并展示了基于几何感知的干预方法可在不重训练的情况下降低幻觉概率。
Details
Motivation: 大语言模型常产生流利但事实错误的输出(即幻觉),亟需从机理层面理解其成因。 Method: 构建几何动力系统框架,分析自回归模型隐状态在潜在空间中的轨迹,研究任务相关盆地结构;结合多模型、多基准实证分析,并形式化提出任务复杂度定理与多盆地定理;设计几何感知的隐藏状态引导策略。 Result: 发现幻觉与任务依赖的潜在空间盆地分离性密切相关:事实型任务盆地较清晰,而摘要和含误解的任务盆地更易重叠、稳定性差;几何引导方法可有效降低幻觉概率。 Conclusion: 幻觉本质上是潜在空间中任务相关动力学结构(特别是盆地拓扑)不稳定或重叠所致;该几何视角为理解和缓解幻觉提供了无需重训练的新路径。 Abstract: Large language models (LLMs) hallucinate: they produce fluent outputs that are factually incorrect. We present a geometric dynamical systems framework in which hallucinations arise from task-dependent basin structure in latent space. Using autoregressive hidden-state trajectories across multiple open-source models and benchmarks, we find that separability is strongly task-dependent rather than universal: factoid settings can show clearer basin separation, whereas summarization and misconception-heavy settings are typically less stable and often overlap. We formalize this behavior with task-complexity and multi-basin theorems, characterize basin emergence in L-layer transformers, and show that geometry-aware steering can reduce hallucination probability without retraining.[76] HUKUKBERT: Domain-Specific Language Model for Turkish Law
Mehmet Utku Öztürk,Tansu Türkoğlu,Buse Buz-Yalug
Main category: cs.CL
TL;DR: 本文提出了HukukBERT,这是首个针对土耳其法律领域的大型语言模型,通过混合领域自适应预训练方法在18GB法律语料上训练,并在法律完形填空和法院判决结构分割任务中达到SOTA性能。
Details
Motivation: 现有土耳其法律NLP研究受限于领域数据与模型稀缺,缺乏类似LEGAL-BERT的高质量专用模型。 Method: 构建18GB清洗后的土耳其法律语料库,采用融合全词掩码、词元跨度掩码、词跨度掩码与关键词掩码的混合领域自适应预训练(DAPT)方法,训练HukukBERT;设计新型Legal Cloze Test基准及法院判决结构分割任务进行评估。 Result: 在Legal Cloze Test上Top-1准确率达84.40%,显著优于现有模型;在法院判决结构分割任务中文档通过率达92.8%,创SOTA;开源模型以支持后续土耳其法律NLP研究。 Conclusion: HukukBERT是目前最全面的土耳其法律语言模型,有效填补了该领域专用大模型空白,为土耳其法律文本理解与处理提供了强大基础工具。 Abstract: Recent advances in natural language processing (NLP) have increasingly enabled LegalTech applications, yet existing studies specific to Turkish law have still been limited due to the scarcity of domain-specific data and models. Although extensive models like LEGAL-BERT have been developed for English legal texts, the Turkish legal domain lacks a domain-specific high-volume counterpart. In this paper, we introduce HukukBERT, the most comprehensive legal language model for Turkish, trained on a 18 GB cleaned legal corpus using a hybrid Domain-Adaptive Pre-Training (DAPT) methodology integrating Whole-Word Masking, Token Span Masking, Word Span Masking, and targeted Keyword Masking. We systematically compared our 48K WordPiece tokenizer and DAPT approach against general-purpose and existing domain-specific Turkish models. Evaluated on a novel Legal Cloze Test benchmark -- a masked legal term prediction task designed for Turkish court decisions -- HukukBERT achieves state-of-the-art performance with 84.40\% Top-1 accuracy, substantially outperforming existing models. Furthermore, we evaluated HukukBERT in the downstream task of structural segmentation of official Turkish court decisions, where it achieves a 92.8\% document pass rate, establishing a new state-of-the-art. We release HukukBERT to support future research in Turkish legal NLP tasks, including recognition of named entities, prediction of judgment, and classification of legal documents.[77] How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling
Yuhang Liu,Heyan Huang,Yizhe Yang,Hongyan Zhao,Zhizhuo Zeng,Yang Gao
Main category: cs.CL
TL;DR: 本文提出了一种面向问题、分阶段的评估框架,用于评估大语言模型(LLMs)在数学建模竞赛等端到端真实问题求解任务中的能力,并揭示了当前LLMs在理解与执行之间存在显著差距,尤其在模型求解、代码实现和结果分析等执行阶段表现不足,且该差距无法通过单纯扩大模型规模解决。
Details
Motivation: 现有推理基准虽能反映LLM的推理能力,但难以评估其解决需端到端工作流的真实世界问题的能力;数学建模竞赛提供了更严格的测试场景。 Method: 构建基于专家验证标准的分阶段(如问题识别、建模、求解、编码、分析等)评估框架,并在‘中国研究生数学建模竞赛’题目上对比自动评分与人类专家评分,验证其可靠性。 Result: 发现LLMs存在‘理解-执行’鸿沟:早期阶段(如问题识别与建模)表现良好,但在执行导向阶段(模型求解、代码实现、结果分析)持续薄弱;该鸿沟不随模型规模增大而缓解;错误源于规格说明不足、缺少验证与确认,且跨阶段传播。 Conclusion: 仅靠模型缩放无法弥合理解与执行之间的差距,需发展新方法(如增强规范性、引入验证机制、支持多阶段纠错)以提升LLMs在复杂现实问题求解中的实用性。 Abstract: Large language models (LLMs) have achieved strong performance on reasoning benchmarks, yet their ability to solve real-world problems requiring end-to-end workflows remains unclear. Mathematical modeling competitions provide a stringent testbed for evaluating such end-to-end problem-solving capability. We propose a problem-oriented, stage-wise evaluation framework that assesses LLM performance across modeling stages using expert-verified criteria. We validate the framework's reliability by comparing automatic scores with independent human expert judgments on problems from the China Postgraduate Mathematical Contest in Modeling, demonstrating substantially stronger alignment than existing evaluation schemes. Using this framework, we reveal a comprehension-execution gap in state-of-the-art LLMs: while they perform well in early stages such as problem identification and formulation, they exhibit persistent deficiencies in execution-oriented stages including model solving, code implementation, and result analysis. These gaps persist even with increased model scale. We further trace these failures to insufficient specification, missing verification, and lack of validation, with errors propagating across stages without correction. Our findings suggest that bridging this gap requires approaches beyond model scaling, offering insights for applying LLMs to complex real-world problem solving.[78] SkillX: Automatically Constructing Skill Knowledge Bases for Agents
Chenxi Wang,Zhuoyun Yu,Xin Xie,Wuguannan Yao,Runnan Fang,Shuofei Qiao,Kexin Cao,Guozhou Zheng,Xiang Qi,Peng Zhang,Shumin Deng
Main category: cs.CL
TL;DR: 本文提出SkillX框架,通过多级技能设计、迭代技能优化和探索性技能扩展,构建可复用的插拔式技能知识库,显著提升LLM代理在长周期、用户交互任务中的泛化性和执行效率。
Details
Motivation: 现有LLM代理自我演化范式效率低下:代理孤立学习、重复发现相似行为、探索冗余、泛化能力差。 Method: 提出SkillX全自动框架,包含三级技能设计(战略计划/功能技能/原子技能)、基于执行反馈的迭代技能优化、以及基于种子数据主动探索生成新技能的扩展机制。 Result: 在AppWorld、BFCL-v3和τ²-Bench等长周期用户交互基准上验证,SkillKB可显著提升弱基座代理的任务成功率与执行效率。 Conclusion: 结构化、分层的经验表征是实现代理可泛化学习的关键,SkillX为LLM代理提供了高效复用经验的新范式。 Abstract: Learning from experience is critical for building capable large language model (LLM) agents, yet prevailing self-evolving paradigms remain inefficient: agents learn in isolation, repeatedly rediscover similar behaviors from limited experience, resulting in redundant exploration and poor generalization. To address this problem, we propose SkillX, a fully automated framework for constructing a \textbf{plug-and-play skill knowledge base} that can be reused across agents and environments. SkillX operates through a fully automated pipeline built on three synergistic innovations: \textit{(i) Multi-Level Skills Design}, which distills raw trajectories into three-tiered hierarchy of strategic plans, functional skills, and atomic skills; \textit{(ii) Iterative Skills Refinement}, which automatically revises skills based on execution feedback to continuously improve library quality; and \textit{(iii) Exploratory Skills Expansion}, which proactively generates and validates novel skills to expand coverage beyond seed training data. Using a strong backbone agent (GLM-4.6), we automatically build a reusable skill library and evaluate its transferability on challenging long-horizon, user-interactive benchmarks, including AppWorld, BFCL-v3, and $τ^2$-Bench. Experiments show that SkillKB consistently improves task success and execution efficiency when plugged into weaker base agents, highlighting the importance of structured, hierarchical experience representations for generalizable agent learning. Our code will be publicly available soon at https://github.com/zjunlp/SkillX.[79] LiveFact: A Dynamic, Time-Aware Benchmark for LLM-Driven Fake News Detection
Cheng Xu,Changhong Jin,Yingjie Niu,Nan Yan,Yuke Mei,Shuhao Guan,Liming Chen,M-Tahar Kechadi
Main category: cs.CL
TL;DR: 本文提出LiveFact,一个持续更新的动态基准,用于评估大语言模型在时间不确定性和信息演化下的事实核查与推理能力,揭示了现有静态基准的局限性及模型的'推理差距'。
Details
Motivation: 当前假新闻检测和事实核查的评估框架滞后于大语言模型的发展,静态基准易受数据污染且无法有效评估时序不确定性下的推理能力。 Method: 提出LiveFact动态基准,采用时序演化的证据集,设计分类模式(最终验证)与推理模式(基于证据的推理)双模式评估,并显式监控基准数据污染(BDC)。 Result: 在22个大语言模型上的测试表明,开源MoE模型(如Qwen3-235B-A22B)已媲美或超越专有SOTA系统;发现显著'推理差距':强模型能在早期数据中识别不可验证主张,展现认知谦逊。 Conclusion: LiveFact为评估鲁棒、具时间感知能力的AI验证系统建立了可持续新标准,强调动态评估与认知谦逊的重要性。 Abstract: The rapid development of Large Language Models (LLMs) has transformed fake news detection and fact-checking tasks from simple classification to complex reasoning. However, evaluation frameworks have not kept pace. Current benchmarks are static, making them vulnerable to benchmark data contamination (BDC) and ineffective at assessing reasoning under temporal uncertainty. To address this, we introduce LiveFact a continuously updated benchmark that simulates the real-world "fog of war" in misinformation detection. LiveFact uses dynamic, temporal evidence sets to evaluate models on their ability to reason with evolving, incomplete information rather than on memorized knowledge. We propose a dual-mode evaluation: Classification Mode for final verification and Inference Mode for evidence-based reasoning, along with a component to monitor BDC explicitly. Tests with 22 LLMs show that open-source Mixture-of-Experts models, such as Qwen3-235B-A22B, now match or outperform proprietary state-of-the-art systems. More importantly, our analysis finds a significant "reasoning gap." Capable models exhibit epistemic humility by recognizing unverifiable claims in early data slices-an aspect traditional static benchmarks overlook. LiveFact sets a sustainable standard for evaluating robust, temporally aware AI verification.[80] Plausibility as Commonsense Reasoning: Humans Succeed, Large Language Models Do not
Sercan Karakaş
Main category: cs.CL
TL;DR: 本文探讨了大型语言模型(LLM)在土耳其语前置关系从句歧义消解中是否像人类一样,依据事件可能性(plausibility)进行结构敏感的句法整合;实验发现人类表现出显著且正确的可能性效应,而各类LLM则表现出微弱、不稳定甚至相反的倾向,表明当前模型未能可靠地将世界知识与句法结构结合用于歧义消解。
Details
Motivation: 检验大语言模型是否能像人类一样,在句法歧义消解中以结构敏感的方式整合世界知识(如事件可能性)与句法结构。 Method: 在土耳其语前置关系从句(RC)的高附着(HA)/低附着(LA)歧义结构上设计控制良好的刺激材料,保持句法配置不变、两种解读均语用可行,并通过独立规范评分验证事件可能性梯度对HA/LA的偏向性;对人类被试开展限时强迫选择理解实验;对土耳其语及多语种大语言模型采用基于平均逐词对数概率的偏好式HA/LA续写对比评估。 Result: 人类被试表现出强而正确方向的事件可能性效应;而所有测试的大语言模型均未稳定复现该效应——其偏好变化微弱、不稳定或方向相反。 Conclusion: 当前大语言模型在所测任务中未能像人类那样可靠地利用事件可能性信息指导句法附着偏好,提示其世界知识与句法结构的整合机制存在局限;土耳其RC歧义可作为超越通用基准的跨语言诊断工具。 Abstract: Large language models achieve strong performance on many language tasks, yet it remains unclear whether they integrate world knowledge with syntactic structure in a human-like, structure-sensitive way during ambiguity resolution. We test this question in Turkish prenominal relative-clause attachment ambiguities, where the same surface string permits high attachment (HA) or low attachment (LA). We construct ambiguous items that keep the syntactic configuration fixed and ensure both parses remain pragmatically possible, while graded event plausibility selectively favors High Attachment vs.\ Low Attachment. The contrasts are validated with independent norming ratings. In a speeded forced-choice comprehension experiment, humans show a large, correctly directed plausibility effect. We then evaluate Turkish and multilingual LLMs in a parallel preference-based setup that compares matched HA/LA continuations via mean per-token log-probability. Across models, plausibility-driven shifts are weak, unstable, or reversed. The results suggest that, in the tested models, plausibility information does not guide attachment preferences as reliably as it does in human judgments, and they highlight Turkish RC attachment as a useful cross-linguistic diagnostic beyond broad benchmarks.[81] MERIT: Multilingual Expert-Reward Informed Tuning for Chinese-Centric Low-Resource Machine Translation
Zhixiang Lu,Chong Zhang,Chenyu Xue,Angelos Stefanidis,Chong Li,Jionglong Su,Zhengyong Jiang
Main category: cs.CL
TL;DR: 本文提出MERIT框架,通过语言特定标记前缀、监督微调和基于语义对齐奖励的组相对策略优化,显著提升中文到东南亚低资源语言的神经机器翻译性能。
Details
Motivation: 中文到东南亚低资源语言的神经机器翻译受限于高质量平行语料极度匮乏及现有挖掘数据噪声严重,导致模型训练困难且性能远落后于高资源方向。 Method: 提出MERIT框架,包含语言特定token前缀(LTP)、监督微调(SFT)和由语义对齐奖励(SAR)引导的组相对策略优化(GRPO),并构建中文中心化的ALT评测套件。 Result: 在低资源语言→中文翻译任务中,针对性数据整理与奖励引导优化显著优于单纯扩大模型规模,大幅提升了翻译质量。 Conclusion: 针对低资源场景,结合数据质量提升与奖励驱动的强化学习优化比依赖大模型缩放更有效,为中文到东南亚语言翻译提供了新范式。 Abstract: Neural machine translation (NMT) from Chinese to low-resource Southeast Asian languages remains severely constrained by the extreme scarcity of clean parallel corpora and the pervasive noise in existing mined data. This chronic shortage not only impedes effective model training but also sustains a large performance gap with high-resource directions, leaving millions of speakers of languages such as Lao, Burmese, and Tagalog with persistently low-quality translation systems despite recent advances in large multilingual models. We introduce \textbf{M}ultilingual \textbf{E}xpert-\textbf{R}eward \textbf{I}nformed \textbf{T}uning (\textbf{MERIT}), a unified translation framework that transforms the traditional English-centric ALT benchmark into a Chinese-centric evaluation suite for five Southeast Asian low-resource languages (LRLs). Our framework combines language-specific token prefixing (LTP) with supervised fine-tuning (SFT) and a novel group relative policy optimization (GRPO) guided by the semantic alignment reward (SAR). These results confirm that, in LRL{\textrightarrow}Chinese translation, targeted data curation and reward-guided optimization dramatically outperform mere model scaling.[82] Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling
Qingyang Xu,Yaling Shen,Stephanie Fong,Zimu Wang,Yiwen Jiang,Xiangyu Zhao,Jiahe Liu,Zhongxing Xu,Vincent Lee,Zongyuan Ge
Main category: cs.CL
TL;DR: 本文提出了一种名为Personality-based Client Simulation Attack(PCSA)的新型红队测试框架,用于评估大语言模型(LLMs)在心理治疗场景中的安全性,尤其聚焦于识别其将有害信念或行为误判为共情回应的风险。实验表明,PCSA在暴露模型漏洞方面显著优于现有基线方法,并揭示当前LLMs仍易提供非法医疗建议、强化妄想及鼓励危险行为。
Details
Motivation: 现有红队测试框架主要关注通用危害或基于优化的攻击,忽视了心理治疗中‘病理性共情’(即错误地支持或强化患者有害信念/行为)这一高风险、领域特异性问题。 Method: 提出PCSA框架,通过构建连贯、人格驱动的模拟来访者对话,对LLM进行多轮心理辅导场景下的对抗性测试;在7个通用与心理健康专用LLM上评估,并结合困惑度分析与人工评估验证对话真实性。 Result: PCSA显著优于四个竞争性基线;生成的对话更自然、真实;实验证明当前LLMs普遍存在心理安全对齐缺陷,如提供未经授权的医疗建议、强化妄想、隐含鼓励危险行为。 Conclusion: PCSA填补了心理健康领域LLM安全评估的空白,揭示了现有模型在高风险临床交互中的严重脆弱性,强调需发展更具领域感知能力的安全对齐方法。 Abstract: The increasing use of large language models (LLMs) in mental healthcare raises safety concerns in high-stakes therapeutic interactions. A key challenge is distinguishing therapeutic empathy from maladaptive validation, where supportive responses may inadvertently reinforce harmful beliefs or behaviors in multi-turn conversations. This risk is largely overlooked by existing red-teaming frameworks, which focus mainly on generic harms or optimization-based attacks. To address this gap, we introduce Personality-based Client Simulation Attack (PCSA), the first red-teaming framework that simulates clients in psychological counseling through coherent, persona-driven client dialogues to expose vulnerabilities in psychological safety alignment. Experiments on seven general and mental health-specialized LLMs show that PCSA substantially outperforms four competitive baselines. Perplexity analysis and human inspection further indicate that PCSA generates more natural and realistic dialogues. Our results reveal that current LLMs remain vulnerable to domain-specific adversarial tactics, providing unauthorized medical advice, reinforcing delusions, and implicitly encouraging risky actions.[83] Synthetic Sandbox for Training Machine Learning Engineering Agents
Yuhang Zhou,Lizhu Zhang,Yifan Wu,Jiayi Liu,Xiangjun Fan,Zhuokai Zhao,Hong Yan
Main category: cs.CL
TL;DR: 本文提出SandMLE框架,通过生成微尺度、多样化的合成MLE环境,显著加速机器学习工程任务的智能体强化学习训练,首次实现MLE领域的轨迹级在线策略RL,并在多个基准上取得显著性能提升。
Details
Motivation: 随着大语言模型智能体从软件工程任务扩展到机器学习工程(MLE)任务,验证成本急剧上升——MLE需运行完整ML流水线,而现有方法(如SFT或离线代理奖励)牺牲了在线策略强化学习的探索与泛化能力。 Method: 提出SandMLE多智能体框架,基于少量种子任务生成结构复杂但数据集极小(每任务仅50–200样本)的多样化合成MLE环境,从而支持高效轨迹级在线策略强化学习。 Result: SandMLE将执行时间降低13倍以上;在MLE-bench-lite上,相较SFT基线,Qwen3系列模型的奖牌率相对提升20.3%–66.9%;在MLE-Dojo上对未见智能体架构泛化良好,HumanRank得分提升达32.4%。 Conclusion: SandMLE成功解决了MLE智能体在线策略RL的验证瓶颈,为MLE领域的大规模强化学习训练提供了可行路径,并展现出优异的泛化能力。 Abstract: As large language model agents advance beyond software engineering (SWE) tasks toward machine learning engineering (MLE), verifying agent behavior becomes orders of magnitude more expensive: while SWE tasks can be verified via fast-executing unit tests, MLE verification requires running full ML pipelines -- data preprocessing, model training, and metric evaluation -- on large datasets at each rollout step, rendering trajectory-wise on-policy reinforcement learning (RL) prohibitively slow. Existing approaches retreat to supervised fine-tuning (SFT) or offline proxy rewards, sacrificing the exploration and generalization benefits of on-policy RL. We observe that sandbox data size is the primary source of this bottleneck. Based on this insight, we introduce SandMLE, a multi-agent framework that generates diverse, verifiable synthetic MLE environments from a small number of seed tasks, preserving the structural and technical complexity of real-world problems while constraining datasets to micro-scale (each task is paired with only 50-200 training samples). Through extensive experiments, we show that SandMLE reduces execution time by over 13 times, enabling large-scale, on-policy trajectory-wise RL for the first time in the MLE domain. On MLE-bench-lite, SandMLE yields significant gains over SFT baselines across Qwen3-8B, 14B, and 30B-A3B, with relative medal rate improvements ranging from 20.3% to 66.9%. Furthermore, the trained policy generalizes across unseen agentic scaffolds, achieving up to 32.4% better HumanRank score on MLE-Dojo.[84] Rethinking Exploration in RLVR: From Entropy Regularization to Refinement via Bidirectional Entropy Modulation
Hengrui Gu,Xiaotian Han,Yujing Bian,Kaixiong Zhou
Main category: cs.CL
TL;DR: 本文提出AsymGRPO框架,通过解耦正负轨迹的调节来实现熵精炼,从而提升LLM在强化学习中的探索能力与推理性能。
Details
Motivation: 现有RLVR方法受限于受限探索问题,熵正则化效果差、超参敏感;作者希望重新思考策略熵与探索的关系,区分有益与有害熵成分。 Method: 基于组相对优势估计的参数化建模与熵动力学分析,提出‘信息熵’与‘虚假熵’的概念,并设计AsymGRPO框架分别调控正/负轨迹以实现熵精炼。 Result: AsymGRPO在多个任务上超越强基线,且可与现有熵正则方法协同增益。 Conclusion: 有效探索不在于盲目最大化熵,而在于熵精炼——即在正向轨迹中保持信息熵、在负向轨迹中抑制虚假熵;AsymGRPO为此提供了可解释、可控制的实现路径。 Abstract: Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning capabilities of large language models (LLMs). However, it faces a fundamental limitation termed \textit{restricted exploration}, where the policy rapidly converges to a narrow set of solutions. While entropy regularization is a popular approach used to sustain exploration, it often proves unreliable for LLMs, suffering from high hyperparameter sensitivity and yielding only marginal performance gains. Motivated by these inefficiencies, we propose to rethink the relationship between policy entropy and exploration. By deriving a parametric formulation of group-relative advantage estimation and analyzing entropy dynamics, we conceptually decompose policy entropy into \textit{informative entropy}, which preserves diverse solution paths, and \textit{spurious entropy}, which erodes reasoning patterns. Our analysis reveals that, in contrast to blind maximization, effective exploration requires \textit{entropy refinement}-a mechanism implicitly embedded in group-relative advantage estimation that sustains informative entropy on positive rollouts while suppressing spurious entropy on negative ones. Guided by this insight, we propose \textbf{AsymGRPO}, an exploratory framework that explicitly decouples the modulation of positive and negative rollouts. This allows for independent control over the preservation of informative entropy and the suppression of spurious noise. Extensive experiments demonstrate that AsymGRPO achieves superior performance compared to strong baselines and exhibits the potential to synergize with existing entropy regularization methods.[85] TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
Weian Mao,Xi Lin,Wei Huang,Yuxin Xie,Tianfu Fu,Bohan Zhuang,Song Han,Yukang Chen
Main category: cs.CL
TL;DR: 本文提出TriAttention方法,通过利用RoPE前Q/K向量的集中性及其距离偏好特性来估计KV重要性,显著缓解长上下文推理中的KV缓存内存瓶颈,在保持高准确率的同时大幅提升吞吐量或大幅降低KV内存占用。
Details
Motivation: 现有基于RoPE后注意力分数的KV压缩方法因查询向量随位置旋转导致代表性查询稀少,造成关键键选择不准和推理不稳定;需寻找更稳定、更具代表性的KV重要性估计方式。 Method: 发现并利用RoPE前Q/K向量在固定非零中心附近的集中性(Q/K concentration),分析其引发的距离偏好现象,并基于此构建三角级数模型;TriAttention据此按位置对键打分,并结合Q/K范数作为额外重要性信号。 Result: 在AIME25数据集32K-token生成任务中,TriAttention达到与全注意力相当的推理精度,同时实现2.5倍吞吐提升或10.7倍KV内存减少;优于基线方法约一倍精度;成功在单张消费级GPU上部署OpenClaw长上下文模型。 Conclusion: TriAttention通过转向RoPE前空间建模Q/K集中性与距离偏好,提供了一种更稳定、高效且准确的KV压缩新范式,有效突破长上下文推理的内存瓶颈。 Abstract: Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.[86] Early Stopping for Large Reasoning Models via Confidence Dynamics
Parsa Hosseini,Sumit Nawathe,Mahdi Salmani,Meisam Razaviyayn,Soheil Feizi
Main category: cs.CL
TL;DR: 本文提出CoDE-Stop方法,利用中间答案置信度动态变化实现无需训练的推理早期停止,显著降低计算开销并保持甚至提升准确性。
Details
Motivation: 长链推理计算开销大且易因‘过度思考’导致性能下降;需解决何时终止推理的问题。 Method: 基于中间答案置信度动态特征(正确轨迹早达高置信、错误轨迹置信不稳定),设计无需训练的早期停止策略CoDE-Stop。 Result: 在多个模型与科学/推理基准上验证有效:相比标准全长度推理减少25–50% token使用,精度-计算权衡更优;优于现有早期停止方法。 Conclusion: 置信度动态是可靠停止信号,CoDE-Stop为大模型高效推理提供简单、通用、免训练的解决方案。 Abstract: Large reasoning models rely on long chain-of-thought generation to solve complex problems, but extended reasoning often incurs substantial computational cost and can even degrade performance due to overthinking. A key challenge is determining when the model should stop reasoning and produce the final answer. In this work, we study the confidence of intermediate answers during reasoning and observe two characteristic behaviors: correct reasoning trajectories often reach high-confidence answers early, while incorrect rollouts tend to produce long, unproductive reasoning traces and exhibit less reliable confidence dynamics. Motivated by these observations, we propose CoDE-Stop (Confidence Dynamics Early Stop), an early stopping method that leverages the dynamics of intermediate answer confidence to decide when to terminate reasoning, requiring no additional training and easily integrating into existing models. We evaluate CoDE-Stop on diverse reasoning and science benchmarks across multiple models. Compared to prior early stopping methods, it achieves a more favorable accuracy-compute tradeoff and reduces total token usage by 25-50% compared to standard full-length reasoning. In addition, we provide analyses of confidence dynamics during reasoning, offering insights into how confidence changes in both correct and incorrect trajectories.[87] Beyond the Final Actor: Modeling the Dual Roles of Creator and Editor for Fine-Grained LLM-Generated Text Detection
Yang Li,Qiang Sheng,Zhengjia Wang,Yehan Yang,Danding Wang,Juan Cao
Main category: cs.CL
TL;DR: 本文提出RACE方法,用于四分类细粒度检测LLM生成文本,通过修辞结构理论和话语单元特征区分创作者与编辑者角色,提升监管精准性。
Details
Motivation: 现有二元或三元分类方法无法满足对LLM润色的人类文本与人类风格化LLM文本等细微差异的监管需求,需更精细的检测框架。 Method: 提出RACE(修辞分析驱动的创作者-编辑者建模)方法:利用修辞结构理论构建创作者逻辑图,并提取基本话语单元级特征刻画编辑者风格。 Result: RACE在四分类任务中显著优于12个基线模型,误报率低,具备政策适配性。 Conclusion: RACE为LLM内容监管提供了可解释、细粒度且政策对齐的检测方案,推动从粗粒度识别迈向责任溯源。 Abstract: The misuse of large language models (LLMs) requires precise detection of synthetic text. Existing works mainly follow binary or ternary classification settings, which can only distinguish pure human/LLM text or collaborative text at best. This remains insufficient for the nuanced regulation, as the LLM-polished human text and humanized LLM text often trigger different policy consequences. In this paper, we explore fine-grained LLM-generated text detection under a rigorous four-class setting. To handle such complexities, we propose RACE (Rhetorical Analysis for Creator-Editor Modeling), a fine-grained detection method that characterizes the distinct signatures of creator and editor. Specifically, RACE utilizes Rhetorical Structure Theory to construct a logic graph for the creator's foundation while extracting Elementary Discourse Unit-level features for the editor's style. Experiments show that RACE outperforms 12 baselines in identifying fine-grained types with low false alarms, offering a policy-aligned solution for LLM regulation.cs.CV [Back]
[88] SafeScreen: A Safety-First Screening Framework for Personalized Video Retrieval for Vulnerable Users
Wenzheng Zhao,Madhava Kalyan Gadiputi,Fengpei Yuan
Main category: cs.CV
TL;DR: SafeScreen 是一种以安全为优先的视频筛选框架,专为儿童和照护场景(如痴呆症护理)设计,通过个性化安全约束、多模态视频检索增强生成(VideoRAG)与大语言模型决策,在不依赖预标注标签的前提下实现实时、可解释的视频安全筛查。
Details
Motivation: 开放域视频平台的推荐算法以用户参与度为导向,易使脆弱群体(如儿童、痴呆症患者)接触有害内容;现有系统缺乏针对个体化安全需求的实时、可解释筛查机制。 Method: SafeScreen 提出三阶段自动化流水线:(i)基于用户档案提取个性化安全准则;(ii)通过自适应提问与多模态 VideoRAG 进行证据支撑的安全评估;(iii)利用大语言模型综合判断安全性、适宜性与相关性。 Result: 在痴呆症怀旧治疗案例中,使用30个合成患者档案与90个测试查询,SafeScreen 在80–93%的情况下显著偏离 YouTube 的高参与度排序,同时保持高安全性覆盖率、合理性与事实依据性(经LLM评估与领域专家验证)。 Conclusion: 安全可作为视频推荐的硬性前提而非权衡项;SafeScreen 证明了无需预标注、支持个体化约束、具备可解释性与实时性的视频安全筛查是可行且有效的。 Abstract: Open-domain video platforms offer rich, personalized content that could support health, caregiving, and educational applications, but their engagement-optimized recommendation algorithms can expose vulnerable users to inappropriate or harmful material. These risks are especially acute in child-directed and care settings (e.g., dementia care), where content must satisfy individualized safety constraints before being shown. We introduce SafeScreen, a safety-first video screening framework that retrieves and presents personalized video while enforcing individualized safety constraints. Rather than ranking videos by relevance or popularity, SafeScreen treats safety as a prerequisite and performs sequential approval or rejection of candidate videos through an automated pipeline. SafeScreen integrates three key components: (i) profile-driven extraction of individualized safety criteria, (ii) evidence-grounded assessments via adaptive question generation and multimodal VideoRAG analysis, and (iii) LLM-based decision-making that verifies safety, appropriateness, and relevance before content exposure. This design enables explainable, real-time screening of uncurated video repositories without relying on precomputed safety labels. We evaluate SafeScreen in a dementia-care reminiscence case study using 30 synthetic patient profiles and 90 test queries. Results demonstrate that SafeScreen prioritizes safety over engagement, diverging from YouTube's engagement-optimized rankings in 80-93% of cases, while maintaining high levels of safety coverage, sensibleness, and groundedness, as validated by both LLM-based evaluation and domain experts.[89] A reconfigurable smart camera implementation for jet flames characterization based on an optimized segmentation model
Gerardo Valente Vazquez-Garcia,Carmina Perez Guerrero,Eduardo Garduño,Miguel Gonzalez-Mendoza,Adriana Palacios,Gerardo Rodriguez-Hernandez,Vahid Foroughi,Alba Àgueda,Elsa Pastor,Gilberto Ochoa-Ruiz
Main category: cs.CV
TL;DR: 本文提出了一种基于SoC FPGA的智能摄像头平台,用于工业场景中喷射火焰的实时分割与表征,通过优化UNet模型并部署到Ultra96平台,实现30 FPS实时处理且不损失精度。
Details
Motivation: 解决工业环境中缺乏实时火灾早期分割与表征方案的问题。 Method: 构建基于SoC FPGA(Ultra96)的边缘智能处理系统;优化UNet模型(参数减少125倍),利用Vitis框架部署,并结合多线程与批归一化进一步降低延迟。 Result: 优化后模型达30 FPS处理速度,延迟降低7.5倍,Dice Score精度未下降。 Conclusion: 该轻量化、低延迟、高精度的边缘AI平台可扩展应用于其他工业火灾安全场景。 Abstract: In this work we present a novel framework for fire safety management in industrial settings through the implementation of a smart camera platform for jet flames characterization. The approach seeks to alleviate the lack of real-time solutions for industrial early fire segmentation and characterization. As a case study, we demonstrate how a SoC FPGA, running optimized Artificial Intelligence (AI) models can be leveraged to implement a full edge processing pipeline for jet flames analysis. In this paper we extend previous work on computer-vision jet fire segmentation by creating a novel experimental set-up and system implementation for addressing this issue, which can be replicated to other fire safety applications. The proposed platform is designed to carry out image processing tasks in real-time and on device, reducing video processing overheads, and thus the overall latency. This is achieved by optimizing a UNet segmentation model to make it amenable for an SoC FPGAs implementation; the optimized model can then be efficiently mapped onto the SoC reconfigurable logic for massively parallel execution. For our experiments, we have chosen the Ultra96 platform, as it also provides the means for implementing full-fledged intelligent systems using the SoC peripherals, as well as other Operating System (OS) capabilities (i.e., multi-threading) for systems management. For optimizing the model we made use of the Vitis (Xilinx) framework, which enabled us to optimize the full precision model from 7.5 million parameters to 59,095 parameters (125x less), which translated into a reduction of the processing latency of 2.9x. Further optimization (multi-threading and batch normalization) led to an improvement of 7.5x in terms of latency, yielding a performance of 30 Frames Per Second (FPS) without sacrificing accuracy in terms of the evaluated metrics (Dice Score).[90] Event-Driven Neuromorphic Vision Enables Energy-Efficient Visual Place Recognition
Geoffroy Keime,Nicolas Cuperlier,Benoit R. Cottereau
Main category: cs.CV
TL;DR: SpikeVPR是一种受哺乳动物导航系统启发的脉冲神经网络方法,结合事件相机与SNN,以极低计算和能耗实现鲁棒的视觉地点识别。
Details
Motivation: 传统深度网络在动态现实场景下进行视觉地点识别(VPR)时存在高计算与能耗瓶颈,难以满足自主机器人实时、低功耗需求。 Method: 提出SpikeVPR:融合事件相机与脉冲神经网络(SNN),端到端训练采用代理梯度法,并引入新型数据增强策略EventDilation以提升对速度与时间变化的鲁棒性。 Result: 在Brisbane-Event-VPR和NSAVP两个基准上性能媲美SOTA深度网络,参数量减少50倍,能耗降低30–250倍,支持移动端与神经形态平台实时部署。 Conclusion: 基于脉冲编码的VPR方法为复杂多变环境中的鲁棒、高效地点识别提供了可行新路径。 Abstract: Reliable visual place recognition (VPR) under dynamic real-world conditions is critical for autonomous robots, yet conventional deep networks remain limited by high computational and energy demands. Inspired by the mammalian navigation system, we introduce SpikeVPR, a bio-inspired and neuromorphic approach combining event-based cameras with spiking neural networks (SNNs) to generate compact, invariant place descriptors from few exemplars, achieving robust recognition under extreme changes in illumination, viewpoint, and appearance. SpikeVPR is trained end-to-end using surrogate gradient learning and incorporates EventDilation, a novel augmentation strategy enhancing robustness to speed and temporal variations. Evaluated on two challenging benchmarks (Brisbane-Event-VPR and NSAVP), SpikeVPR achieves performance comparable to state-of-the-art deep networks while using 50 times fewer parameters and consuming 30 and 250 times less energy, enabling real-time deployment on mobile and neuromorphic platforms. These results demonstrate that spike-based coding offers an efficient pathway toward robust VPR in complex, changing environments.[91] 3D-IDE: 3D Implicit Depth Emergent
Chushan Zhang,Ruihan Lu,Jinguang Tong,Yikai Wang,Hongdong Li
Main category: cs.CV
TL;DR: 本文提出3D-Implicit Depth Emergence方法,通过几何自监督使3D感知作为隐式涌现特性,而非显式编码,从而在不依赖深度与位姿信息、零延迟开销下实现高性能室内场景理解。
Details
Motivation: 现有MLLMs融合2D-3D表征时存在权衡困境,导致部署效果欠佳;需摆脱对显式3D编码或外部3D模型的依赖。 Method: 提出隐式几何涌现原理,利用细粒度几何验证器和全局表征约束构建信息瓶颈,最大化视觉特征与3D结构间的互信息,使3D感知自然涌现于统一视觉表征中。 Result: 在多个3D场景理解基准上超越SOTA,推理延迟降低55%,且在各类下游任务中保持强性能。 Conclusion: 将3D知识整合从外部嫁接转向隐式涌现,是视觉语言模型中3D理解范式的根本性重构。 Abstract: Leveraging 3D information within Multimodal Large Language Models (MLLMs) has recently shown significant advantages for indoor scene understanding. However, existing methods, including those using explicit ground-truth 3D positional encoding and those grafting external 3D foundation models for implicit geometry, struggle with the trade-off in 2D-3D representation fusion, leading to suboptimal deployment. To this end, we propose 3D-Implicit Depth Emergence, a method that reframes 3D perception as an emergent property derived from geometric self-supervision rather than explicit encoding. Our core insight is the Implicit Geometric Emergence Principle: by strategically leveraging privileged geometric supervision through mechanisms like a fine-grained geometry validator and global representation constraints, we construct an information bottleneck. This bottleneck forces the model to maximize the mutual information between visual features and 3D structures, allowing 3D awareness to emerge naturally within a unified visual representation. Unlike existing approaches, our method enables 3D perception to emerge implicitly, disentangling features in dense regions and, crucially, eliminating depth and pose dependencies during inference with zero latency overhead. This paradigm shift from external grafting to implicit emergence represents a fundamental rethinking of 3D knowledge integration in visual-language models. Extensive experiments demonstrate that our method surpasses SOTA on multiple 3D scene understanding benchmarks. Our approach achieves a 55% reduction in inference latency while maintaining strong performance across diverse downstream tasks, underscoring the effectiveness of meticulously designed auxiliary objectives for dependency-free 3D understanding. Source code can be found at github.com/ChushanZhang/3D-IDE.[92] XAttnRes: Cross-Stage Attention Residuals for Medical Image Segmentation
Xinyu Liu,Qing Xu,Zhen Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为Cross-Stage Attention Residuals(XAttnRes)的新机制,用于图像分割网络,通过轻量级伪查询注意力在编码器和解码器阶段间动态聚合历史特征,在不依赖传统跳跃连接的情况下提升性能。
Details
Motivation: 现有分割网络依赖固定的跳跃连接传递跨阶段信息,而LLM中Attention Residuals表明学习式选择性聚合更优;本文旨在将该思想迁移至多尺度分割架构,并解决Transformer层同维性与分割网络跨分辨率特征间的不匹配问题。 Method: 提出XAttnRes机制:构建全局特征历史池(含编码器和解码器各阶段输出),通过轻量伪查询注意力实现每阶段对所有前置表示的选择性聚合;引入空间对齐与通道投影模块以处理跨分辨率特征。 Result: XAttnRes在四个数据集、三种成像模态上一致提升现有分割网络性能;即使移除所有传统跳跃连接,仅用XAttnRes仍能达到基线模型性能。 Conclusion: 学习式跨阶段注意力残差可替代预设连接,有效建模编码器-解码器间的信息流,为分割网络设计提供了新范式。 Abstract: In the field of Large Language Models (LLMs), Attention Residuals have recently demonstrated that learned, selective aggregation over all preceding layer outputs can outperform fixed residual connections. We propose Cross-Stage Attention Residuals (XAttnRes), a mechanism that maintains a global feature history pool accumulating both encoder and decoder stage outputs. Through lightweight pseudo-query attention, each stage selectively aggregates from all preceding representations. To bridge the gap between the same-dimensional Transformer layers in LLMs and the multi-scale encoder-decoder stages in segmentation networks, XAttnRes introduces spatial alignment and channel projection steps that handle cross-resolution features with negligible overhead. When added to existing segmentation networks, XAttnRes consistently improves performance across four datasets and three imaging modalities. We further observe that XAttnRes alone, even without skip connections, achieves performance on par with the baseline, suggesting that learned aggregation can recover the inter-stage information flow traditionally provided by predetermined connections.[93] MoViD: View-Invariant 3D Human Pose Estimation via Motion-View Disentanglement
Yejia Liu,Hengle Jiang,Haoxian Liu,Runxi Huang,Xiaomin Ouyang
Main category: cs.CV
TL;DR: MoViD is a viewpoint-invariant 3D human pose estimation framework that disentangles viewpoint and motion features via a view estimator and orthogonal projection, enhanced by physics-grounded contrastive alignment; it achieves superior accuracy, robustness under occlusion with less data, and real-time edge inference.
Details
Motivation: Existing 3D human pose estimation methods suffer from poor generalization to unseen camera viewpoints, high data dependency, and slow inference—hindering real-world deployment in healthcare, robotics, and gaming. Method: MoViD introduces a view estimator modeling key joint relationships to predict viewpoint, an orthogonal projection module to disentangle motion and view features, and physics-grounded contrastive alignment across views; it further employs a frame-by-frame, view-aware inference pipeline with adaptive flip refinement for edge deployment. Result: MoViD reduces pose estimation error by >24.2% over SOTA on nine public and two newly collected multiview datasets (UAV, gait), maintains robust performance under severe occlusions using 60% less training data, and runs at 15 FPS on NVIDIA edge devices. Conclusion: MoViD effectively addresses viewpoint generalization, data efficiency, and latency bottlenecks in 3D pose estimation, enabling practical real-world deployment on resource-constrained edge platforms. Abstract: 3D human pose estimation is a key enabling technology for applications such as healthcare monitoring, human-robot collaboration, and immersive gaming, but real-world deployment remains challenged by viewpoint variations. Existing methods struggle to generalize to unseen camera viewpoints, require large amounts of training data, and suffer from high inference latency. We propose MoViD, a viewpoint-invariant 3D human pose estimation framework that disentangles viewpoint information from motion features. The key idea is to extract viewpoint information from intermediate pose features and leverage it to enhance both the robustness and efficiency of pose estimation. MoViD introduces a view estimator that models key joint relationships to predict viewpoint information, and an orthogonal projection module to disentangle motion and view features, further enhanced through physics-grounded contrastive alignment across views. For real-time edge deployment, MoViD employs a frame-by-frame inference pipeline with a view-aware strategy that adaptively activates flip refinement based on the estimated viewpoint. Evaluations on nine public datasets and newly collected multiview UAV and gait analysis datasets show that MoViD reduces pose estimation error by over 24.2\% compared to state-of-the-art methods, maintains robust performance under severe occlusions with 60\% less training data, and achieves real-time inference at 15 FPS on NVIDIA edge devices.[94] Embedding-Only Uplink for Onboard Retrieval Under Shift in Remote Sensing
Sangcheol Sim
Main category: cs.CV
TL;DR: 本文研究了在遥感数据存在跨时间、跨事件/地点、跨站点云和跨城市感兴趣区域(AOI)等分布偏移情况下,仅依赖上行嵌入向量进行星上优先级排序与检索的可行性。实验表明,尽管所有有效方法共享同一套上行嵌入,但不同任务需适配不同决策头:kNN检索在云分类中更优,而类中心法在时序变化检测中显著占优;嵌入上行是关键前提,后续可零成本切换任务头,每查询遥测数据<1KB。
Details
Motivation: 下行链路瓶颈促使发展无需传输原始像素、仅在星上处理并优先识别灾害的系统;同时需应对遥感数据中普遍存在的多种显式分布偏移(如跨时间、跨事件、跨地理站点等)。 Method: 采用OlmoEarth嵌入,在扩展的多任务基准(27景Sentinel-2 L2A影像、15个云站点、5个SpaceNet-2 AOI、10次随机种子)上,对比kNN检索与类中心(centroid)两种决策头在不同遥感偏移场景下的性能。 Result: kNN在云分类任务中准确率显著优于类中心(0.92 vs. 0.91, p<0.01);类中心在时序变化检测中远超kNN(0.85 vs. 0.48, p<0.01);所有任务均基于相同上行嵌入,且单次查询遥测<1 KB。 Conclusion: 仅上行嵌入的架构是可行且高效的基础设施:一旦嵌入驻留星上,系统可根据任务动态选择最优决策头,无需额外带宽开销,为资源受限的星上智能提供新范式。 Abstract: Downlink bottlenecks motivate onboard systems that prioritize hazards without transmitting raw pixels. We study a strict setting where a ground station uplinks only compact embeddings plus metadata, and an onboard system performs vector search to triage new captures. We ask whether this embedding-only pipeline remains useful under explicit remote-sensing shift: cross-time (pre/post-event), cross-event/location (different disasters), cross-site cloud (15 geographic sites), and cross-city AOI holdout (buildings). Using OlmoEarth embeddings on a scaled public multi-task benchmark (27 Sentinel-2 L2A scenes, 15 cloud sites, 5 SpaceNet-2 AOIs; 10 seeds), we find that all effective methods rely on the same uplinked embeddings, but the optimal decision head is task-dependent: kNN retrieval is significantly superior for cloud classification (0.92 vs. centroid 0.91; p<0.01, Wilcoxon), while class centroids dominate temporal change detection (0.85 vs. retrieval 0.48; p<0.01). These results show that embedding-only uplink is the key enabler--once embeddings are onboard, the system can select the best head per task at no additional uplink cost, with all telemetry under 1 KB per query.[95] Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models
Nanxi Li,Xiang Wang,Yuanjie Chen,Haode Zhang,Hong Li,Yong-Lu Li
Main category: cs.CV
TL;DR: 本文揭示了多模态大语言模型(MLLMs)在直观物理理解,尤其是连续体对象动力学方面的显著缺陷,并提出了Scene Dynamic Field(SDF)方法,利用物理仿真器结合多任务微调,显著提升模型在物理推理任务上的性能。
Details
Motivation: 当前MLLMs虽在图像和视频理解上表现优异,但在高阶物理推理(特别是直观物理理解)方面存在严重不足,尤其难以理解连续体对象的动力学行为。 Method: 提出两个基础评测任务(Next Frame Selection和Temporal Coherence Verification)以隔离评估直观物理理解能力;并设计Scene Dynamic Field(SDF)方法,将物理仿真器嵌入多任务微调框架中。 Result: 实验表明现有最先进MLLMs在NFS和TCV任务上表现极差;SDF方法在流体任务上最高提升20.7%,且对未见物理领域具有良好泛化性。 Conclusion: 当前MLLMs在物理世界理解上存在关键短板,SDF是一种低成本、高效益的改进路径,为构建更具物理根基的MLLMs提供了新方向。 Abstract: While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in image and video understanding, their ability to comprehend the physical world has become an increasingly important research focus. Despite their improvements, current MLLMs struggle significantly with high-level physics reasoning. In this work, we investigate the first step of physical reasoning, i.e., intuitive physics understanding, revealing substantial limitations in understanding the dynamics of continuum objects. To isolate and evaluate this specific capability, we introduce two fundamental benchmark tasks: Next Frame Selection (NFS) and Temporal Coherence Verification (TCV). Our experiments demonstrate that even state-of-the-art MLLMs perform poorly on these foundational tasks. To address this limitation, we propose Scene Dynamic Field (SDF), a concise approach that leverages physics simulators within a multi-task fine-tuning framework. SDF substantially improves performance, achieving up to 20.7% gains on fluid tasks while showing strong generalization to unseen physical domains. This work not only highlights a critical gap in current MLLMs but also presents a promising cost-efficient approach for developing more physically grounded MLLMs. Our code and data are available at https://github.com/andylinx/Scene-Dynamic-Field.[96] HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis
Mingjin Chen,Junhao Chen,Zhaoxin Fan,Yujian Lee,Zichen Dang,Lili Wang,Yawen Cui,Lap-Pui Chau,Yi Wang
Main category: cs.CV
TL;DR: 本文提出HVG-3D,一种基于扩散模型、支持显式3D条件输入的手-物交互视频合成框架,通过3D ControlNet和混合信号构建流程,在空间保真度、时序一致性和可控性上达到SOTA。
Details
Motivation: 现有方法多依赖表达能力弱的2D控制信号,难以充分利用3D条件数据,限制了手-物交互视频合成的空间表现力与可控性。 Method: 提出HVG-3D框架:1)基于扩散模型的3D感知HOI视频生成架构,融合3D输入中的几何与运动线索;2)混合输入/条件信号构建流程;并引入3D ControlNet增强3D推理能力。 Result: 在TASTE-Rob数据集上,HVG-3D在空间保真度、时序一致性与可控性方面均达SOTA,并能有效联合利用真实与仿真3D数据。 Conclusion: HVG-3D验证了显式3D条件建模对HOI视频合成的重要性,为高精度、强可控的具身视觉生成提供了新范式。 Abstract: Recent methods have made notable progress in the visual quality of hand-object interaction video synthesis. However, most approaches rely on 2D control signals that lack spatial expressiveness and limit the utilization of synthetic 3D conditional data. To address these limitations, we propose HVG-3D, a unified framework for 3D-aware hand-object interaction (HOI) video synthesis conditioned on explicit 3D representations. Specifically, we develop a diffusion-based architecture augmented with a 3D ControlNet, which encodes geometric and motion cues from 3D inputs to enable explicit 3D reasoning during video synthesis. To achieve high-quality synthesis, HVG-3D is designed with two core components: (i) a 3D-aware HOI video generation diffusion architecture that encodes geometric and motion cues from 3D inputs for explicit 3D reasoning; and (ii) a hybrid pipeline for constructing input and condition signals, enabling flexible and precise control during both training and inference. During inference, given a single real image and a 3D control signal from either simulation or real data, HVG-3D generates high-fidelity, temporally consistent videos with precise spatial and temporal control. Experiments on the TASTE-Rob dataset demonstrate that HVG-3D achieves state-of-the-art spatial fidelity, temporal coherence, and controllability, while enabling effective utilization of both real and simulated data.[97] Deep Image Clustering Based on Curriculum Learning and Density Information
Haiyang Zheng,Ruilin Zhang,Hongpeng Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于密度信息的鲁棒图像聚类方法IDCL,首次将密度引导的课程学习策略引入深度聚类,用密度核替代传统聚类中心进行分配,提升了鲁棒性、收敛速度与泛化能力。
Details
Motivation: 现有深度聚类方法忽视模型学习策略对鲁棒性和性能的影响,且依赖点到中心的距离导致误差累积。 Method: 提出IDCL方法:设计基于数据密度信息的课程学习方案以调控训练节奏,并采用密度核(而非单一点中心)指导聚类分配。 Result: 在多个基准数据集上实验表明,IDCL在鲁棒性、收敛速度、以及对数据规模、簇数和图像上下文的适应性方面均优于当前最优方法。 Conclusion: 密度信息可有效提升深度图像聚类的稳定性与性能,IDCL为复杂图像数据聚类提供了新范式。 Abstract: Image clustering is one of the crucial techniques in multimedia analytics and knowledge discovery. Recently, the Deep clustering method (DC), characterized by its ability to perform feature learning and cluster assignment jointly, surpasses the performance of traditional ones on image data. However, existing methods rarely consider the role of model learning strategies in improving the robustness and performance of clustering complex image data. Furthermore, most approaches rely solely on point-to-point distances to cluster centers for partitioning the latent representations, resulting in error accumulation throughout the iterative process. In this paper, we propose a robust image clustering method (IDCL) which, to our knowledge for the first time, introduces a model training strategy using density information into image clustering. Specifically, we design a curriculum learning scheme grounded in the density information of input data, with a more reasonable learning pace. Moreover, we employ the density core rather than the individual cluster center to guide the cluster assignment. Finally, extensive comparisons with state-of-the-art clustering approaches on benchmark datasets demonstrate the superiority of the proposed method, including robustness, rapid convergence, and flexibility in terms of data scale, number of clusters, and image context.[98] V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
Jiazhou Zhou,Yucheng Chen,Hongyang Li,Qing Jiang,Hu Zhou,Ying-Cong Chen,Lei Zhang
Main category: cs.CV
TL;DR: 本文提出V-Reflection框架,通过‘先思考、再观察’的视觉反思机制,使多模态大语言模型(MLLMs)在推理过程中能主动查询视觉特征空间,从而缓解细粒度感知任务中的幻觉问题。
Details
Motivation: 现有MLLMs在细粒度感知任务中易产生感知相关幻觉,因其推理局限于语言域,将视觉输入视为静态前导,无法动态回溯视觉细节以支撑推理过程。 Method: 提出V-Reflection框架,包含Box-Guided Compression(BCM)和Dynamic Autoregressive Compression(DAC)两个模块:BCM提供显式空间对齐的像素到隐状态目标,DAC将隐状态映射为动态探针以查询全局视觉特征图;采用两阶段知识蒸馏实现空间定位能力内化。 Result: 在六个感知密集型基准上显著缩小细粒度感知差距;可视化证实隐状态推理可自主定位任务关键视觉证据;推理时两模块完全不激活,保持端到端高效自回归解码。 Conclusion: V-Reflection使MLLM从被动观察者转变为主动 interrogator,有效提升其视觉 grounding 能力与细粒度感知鲁棒性,且不牺牲推理效率。 Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success, yet they remain prone to perception-related hallucinations in fine-grained tasks. This vulnerability arises from a fundamental limitation: their reasoning is largely restricted to the language domain, treating visual input as a static, reasoning-agnostic preamble rather than a dynamic participant. Consequently, current models act as passive observers, unable to re-examine visual details to ground their evolving reasoning states. To overcome this, we propose V-Reflection, a framework that transforms the MLLM into an active interrogator through a "think-then-look" visual reflection mechanism. During reasoning, latent states function as dynamic probes that actively interrogate the visual feature space, grounding each reasoning step for task-critical evidence. Our approach employs a two-stage distillation strategy. First, the Box-Guided Compression (BCM) module establishes stable pixel-to-latent targets through explicit spatial grounding. Next, a Dynamic Autoregressive Compression (DAC) module maps the model's hidden states into dynamic probes that interrogate the global visual feature map. By distilling the spatial expertise of the BCM teacher into the DAC student, V-Reflection internalizes the ability to localize task-critical evidence. During inference, both modules remain entirely inactive, maintaining a purely end-to-end autoregressive decoding in the latent space with optimal efficiency. Extensive experiments demonstrate the effectiveness of our V-Reflection across six perception-intensive benchmarks, significantly narrowing the fine-grained perception gap. Visualizations confirm that latent reasoning autonomously localizes task-critical visual evidence.[99] Edge-Based Standing-Water Detection via FSM-Guided Tiering and Multi-Model Consensus
Oliver Aleksander Larsen,Mahyar T. Moghaddam
Main category: cs.CV
TL;DR: 本文提出了一种部署在边缘设备上的积水检测架构,结合摄像头与环境传感器,通过有限状态机动态调度本地与云端推理,并融合多模型YOLO与昼夜环境基线,提升检测性能、降低能耗并保障实时性。
Details
Motivation: 农业田地中的积水威胁车辆通行与作物健康,亟需低功耗、高鲁棒性的边缘实时检测方案。 Method: 基于Raspberry Pi(可选Jetson加速)的边缘架构;融合摄像头与温湿度/气压传感器;采用有限状态机(FSM)作为决策引擎,动态选择本地或卸载推理层级;使用多模型YOLO集成生成图像得分;结合昼夜基线的传感器融合校正警戒阈值;支持逐帧日志与硬件在环重放。 Result: 在十种配置与传感器变体下,相较静态本地基线,积水检测性能提升;相比盲目全量卸载策略能耗更低;在真实农田环境中保持有界尾部延迟。 Conclusion: 自适应分层推理、多模型共识与昼夜感知的传感器融合共同构成了高效、节能、鲁棒的边缘积水检测系统,适用于间歇连接与移动受限的农业场景。 Abstract: Standing water in agricultural fields threatens vehicle mobility and crop health. This paper presents a deployed edge architecture for standing-water detection using Raspberry-Pi-class devices with optional Jetson acceleration. Camera input and environmental sensors (humidity, pressure, temperature) are combined in a finite-state machine (FSM) that acts as the architectural decision engine. The FSM-guided control plane selects between local and offloaded inference tiers, trading accuracy, latency, and energy under intermittent connectivity and motion-dependent compute budgets. A multi-model YOLO ensemble provides image scores, while diurnal-baseline sensor fusion adjusts caution using environmental anomalies. All decisions are logged per frame, enabling bit-identical hardware-in-the-loop replays. Across ten configurations and sensor variants on identical field sequences with frame-level ground truth, we show that the combination of adaptive tiering, multi-model consensus, and diurnal sensor fusion improves flood-detection performance over static local baselines, uses less energy than a naive always-heavy offload policy, and maintains bounded tail latency in a real agricultural setting.[100] TreeGaussian: Tree-Guided Cascaded Contrastive Learning for Hierarchical Consistent 3D Gaussian Scene Segmentation and Understanding
Jingbin You,Zehao Li,Hao Jiang,Xinzhu Ma,Shuqin Gao,Honglong Zhao,Congcong Zheng,Tianlu Mao,Feng Dai,Yucheng Zhang,Zhaoqi Wang
Main category: cs.CV
TL;DR: 本文提出TreeGaussian,一种树引导的级联对比学习框架,用于提升3D高斯泼溅在层次化语义结构建模与分割一致性方面的性能。
Details
Motivation: 现有3D高斯泼溅方法难以建模场景中物体-部件的层次语义结构,且受2D先验带来的标签不一致和密集两两对比影响,导致分割效果不佳。 Method: 构建多级对象树以显式建模层次关系;设计两级级联对比学习策略(全局到局部);引入一致分割检测(CSD)机制和图去噪模块以对齐跨视角分割并抑制不稳定高斯点。 Result: 在开放词汇3D物体选择、3D点云理解等任务上取得显著提升,消融实验验证各模块有效性。 Conclusion: TreeGaussian有效增强了3DGS的层次语义表征能力与分割一致性,在复杂场景中展现出更强鲁棒性与实用性。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a real-time, differentiable representation for neural scene understanding. However, existing 3DGS-based methods struggle to represent hierarchical 3D semantic structures and capture whole-part relationships in complex scenes. Moreover, dense pairwise comparisons and inconsistent hierarchical labels from 2D priors hinder feature learning, resulting in suboptimal segmentation. To address these limitations, we introduce TreeGaussian, a tree-guided cascaded contrastive learning framework that explicitly models hierarchical semantic relationships and reduces redundancy in contrastive supervision. By constructing a multi-level object tree, TreeGaussian enables structured learning across object-part hierarchies. In addition, we propose a two-stage cascaded contrastive learning strategy that progressively refines feature representations from global to local, mitigating saturation and stabilizing training. A Consistent Segmentation Detection (CSD) mechanism and a graph-based denoising module are further introduced to align segmentation modes across views while suppressing unstable Gaussian points, enhancing segmentation consistency and quality. Extensive experiments, including open-vocabulary 3D object selection, 3D point cloud understanding, and ablation studies, demonstrate the effectiveness and robustness of our approach.[101] Diffusion Path Alignment for Long-Range Motion Generation and Domain Transitions
Haichao Wang,Alexander Okupnik,Yuxing Han,Gene Wen,Johannes Schneider,Kyriakos Flouris
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的推理时优化框架,用于实现长距离、跨语义域的人体运动生成,特别适用于舞蹈编排等需要流畅风格转换的应用。
Details
Motivation: 长距离人体运动生成是计算机视觉与图形学中的核心挑战,尤其在需要跨不同语义或风格域(如舞蹈编排)进行连贯过渡时缺乏有效方法。 Method: 提出一种受扩散式随机最优控制启发的推理时优化框架,通过设计一个控制能量目标函数,显式正则化预训练扩散模型生成的运动轨迹过渡过程。 Result: 该方法在推理阶段优化后,能生成高保真度且时间上连贯的跨域运动过渡序列。 Conclusion: 这是首个为可控长距离人体运动生成提供通用框架并显式建模运动过渡的工作。 Abstract: Long-range human movement generation remains a central challenge in computer vision and graphics. Generating coherent transitions across semantically distinct motion domains remains largely unexplored. This capability is particularly important for applications such as dance choreography, where movements must fluidly transition across diverse stylistic and semantic motifs. We propose a simple and effective inference-time optimization framework inspired by diffusion-based stochastic optimal control. Specifically, a control-energy objective that explicitly regularizes the transition trajectories of a pretrained diffusion model. We show that optimizing this objective at inference time yields transitions with fidelity and temporal coherence. This is the first work to provide a general framework for controlled long-range human motion generation with explicit transition modeling.[102] PollutionNet: A Vision Transformer Framework for Climatological Assessment of NO$_2$ and SO$_2$ Using Satellite-Ground Data Fusion
Prasanjit Dey,Soumyabrata Dev,Bianca Schoen-Phelan
Main category: cs.CV
TL;DR: 本文提出了一种基于视觉Transformer的框架PollutionNet,融合卫星(Sentinel-5P TROPOMI)与地面观测数据,显著提升了NO₂和SO₂浓度预测精度,尤其适用于监测网络稀疏地区。
Details
Motivation: 传统空气污染物监测存在卫星数据有缺失、地面站点空间覆盖不足的问题,亟需一种能融合多源数据、兼顾时空建模能力的新方法。 Method: 提出PollutionNet——一种基于Vision Transformer的模型,利用自注意力机制建模NO₂和SO₂的复杂时空依赖关系,融合TROPOMI垂直柱密度(VCD)与地面实测浓度数据。 Result: 在爱尔兰2020–2021年案例中,NO₂预测RMSE为6.89 μg/m³,SO₂为4.49 μg/m³,较基线模型误差降低最多达14%;具备可扩展性与数据高效性。 Conclusion: PollutionNet验证了先进机器学习方法在提升空气质量评估精度与支撑环境政策制定方面的潜力,尤其适用于监测薄弱区域。 Abstract: Accurate assessment of atmospheric nitrogen dioxide (NO$_2$) and sulfur dioxide (SO$_2$) is essential for understanding climate-air quality interactions, supporting environmental policy, and protecting public health. Traditional monitoring approaches face limitations: satellite observations provide broad spatial coverage but suffer from data gaps, while ground-based sensors offer high temporal resolution but limited spatial extent. To address these challenges, we propose PollutionNet, a Vision Transformer-based framework that integrates Sentinel-5P TROPOMI vertical column density (VCD) data with ground-level observations. By leveraging self-attention mechanisms, PollutionNet captures complex spatiotemporal dependencies that are often missed by conventional CNN and RNN models. Applied to Ireland (2020-2021), our case study demonstrates that PollutionNet achieves state-of-the-art performance (RMSE: 6.89 $μ$g/m$^3$ for NO$_2$, 4.49 $μ$g/m$^3$ for SO$_2$), reducing prediction errors by up to 14% compared to baseline models. Beyond accuracy gains, PollutionNet provides a scalable and data-efficient tool for applied climatology, enabling robust pollution assessments in regions with sparse monitoring networks. These results highlight the potential of advanced machine learning approaches to enhance climate-related air quality research, inform environmental management, and support sustainable policy decisions.[103] CardioSAM: Topology-Aware Decoder Design for High-Precision Cardiac MRI Segmentation
Ujjwal Jain
Main category: cs.CV
TL;DR: 本文提出CardioSAM,一种结合冻结SAM编码器与轻量级心脏特异性解码器的混合架构,通过引入心脏特异性注意力模块和边界优化模块,在ACDC数据集上实现了优于nnU-Net及专家间一致性的高精度心脏结构分割。
Details
Motivation: 现有基础模型(如SAM)泛化能力强但边界精度不足,难以满足临床需求;而手动分割耗时且存在显著观察者间差异。 Method: 提出CardioSAM:冻结SAM编码器提取通用特征,设计轻量可训练解码器,包含融合解剖拓扑先验的心脏特异性注意力模块和提升组织界面分割精度的边界优化模块。 Result: 在ACDC基准上达到Dice 93.39%、IoU 87.61%、像素精度99.20%、HD95 4.2 mm;Dice较nnU-Net提升3.89%,超过专家间一致性水平(91.2%)。 Conclusion: CardioSAM在保持强泛化能力的同时显著提升了心脏结构分割的边界精度,具备临床应用潜力。 Abstract: Accurate segmentation of cardiac structures in cardiovascular magnetic resonance (CMR) images is essential for reliable diagnosis and treatment of cardiovascular diseases. However, manual segmentation remains time-consuming and suffers from significant inter-observer variability. Recent advances in deep learning, particularly foundation models such as the Segment Anything Model (SAM), demonstrate strong generalization but often lack the boundary precision required for clinical applications. To address this limitation, we propose CardioSAM, a hybrid architecture that combines the generalized feature extraction capability of a frozen SAM encoder with a lightweight, trainable cardiac-specific decoder. The proposed decoder introduces two key innovations: a Cardiac-Specific Attention module that incorporates anatomical topological priors, and a Boundary Refinement Module designed to improve tissue interface delineation. Experimental evaluation on the ACDC benchmark demonstrates that CardioSAM achieves a Dice coefficient of 93.39%, IoU of 87.61%, pixel accuracy of 99.20%, and HD95 of 4.2 mm. The proposed method surpasses strong baselines such as nnU-Net by +3.89% Dice and exceeds reported inter-expert agreement levels (91.2%), indicating its potential for reliable and clinically applicable cardiac segmentation.[104] CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks
Wish Suharitdamrong,Tony Alex,Muhammad Awais,Sara Ahmed
Main category: cs.CV
TL;DR: 本文提出了一种名为Cross-Modal Low-Rank Adaptation (CoLA)的新参数高效微调框架,扩展LoRA以支持跨模态适应,在视觉-语言和音视频任务上显著优于LoRA,同时保持参数效率。
Details
Motivation: 现有PEFT方法(如LoRA)在双流多模态架构中仅在单模态内独立适配,难以建模跨模态交互,限制了其在多模态任务中的效果。 Method: 提出CoLA框架,引入专用的跨模态低秩适配路径,与原有单模态适配路径并行,实现模态内与跨模态学习的解耦与协同。 Result: 在RefCOCO系列、AVE、AVS等多模态基准上,CoLA相较LoRA取得约3%(视觉-语言)和2%(音视频)的相对性能提升,并首次实现面向视觉定位的多任务PEFT框架。 Conclusion: CoLA有效解决了双流架构下跨模态适配难题,在保持参数高效的同时显著提升多模态任务性能,为高效多模态基础模型适配提供了新范式。 Abstract: Foundation models have revolutionized AI, but adapting them efficiently for multimodal tasks, particularly in dual-stream architectures composed of unimodal encoders, such as DINO and BERT, remains a significant challenge. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) enable lightweight adaptation, yet they operate in isolation within each modality, limiting their ability in capturing cross-modal interactions. In this paper, we take a step in bridging this gap with Cross-Modal Low-Rank Adaptation (CoLA), a novel PEFT framework that extends LoRA by introducing a dedicated inter-modal adaptation pathway alongside the standard intra-modal one. This dual-path design enables CoLA to adapt unimodal foundation models to multimodal tasks effectively, without interference between modality-specific and cross-modal learning. We evaluate CoLA across a range of vision-language (RefCOCO, RefCOCO+, RefCOCOg) and audio-visual (AVE, AVS) benchmarks, where it consistently outperforms LORA, achieving a relative gain of around 3\% and 2\%, respectively, while maintaining parameter efficiency. Notably, CoLA enables the first multi-task PEFT framework for visual grounding, bridging a key gap in efficient multimodal adaptation.[105] StoryBlender: Inter-Shot Consistent and Editable 3D Storyboard with Spatial-temporal Dynamics
Bingliang Li,Zhenhong Sun,Jiaming Bian,Yuehao Wu,Yifu Wang,Hongdong Li,Yatao Bian,Huadong Mo,Daoyi Dong
Main category: cs.CV
TL;DR: 本文提出StoryBlender,一种基于3D的分镜生成框架,通过语义-空间锚定、标准资产具象化和时空动态三阶段流程,兼顾跨镜头一致性与显式可编辑性,优于现有扩散模型与传统3D工作流。
Details
Motivation: 现有自动分镜生成方法难以同时满足跨镜头一致性(如角色身份稳定)与显式可编辑性(如相机、资产调整);2D扩散模型易出现身份漂移且几何控制弱,而传统3D流程虽一致可编辑但依赖专家、成本高。 Method: 提出StoryBlender框架,包含三个核心阶段:(1) 语义-空间锚定——构建连续性记忆图以解耦全局资产与镜头变量;(2) 标准资产具象化——在统一坐标系中实例化实体保障视觉身份;(3) 时空动态——基于视觉指标实现构图与运镜演化;并采用多智能体分层协作与引擎验证反馈的自修正机制。 Result: 实验表明StoryBlender在跨镜头一致性与可编辑性上显著优于基于扩散和基于3D的基线方法;生成结果为原生3D场景,支持对相机与资产的直接精确编辑,同时保持多镜头间强连续性。 Conclusion: StoryBlender为自动化分镜生成提供了兼顾一致性与可控性的新范式, bridging the gap between generative AI and professional 3D production workflows. Abstract: Storyboarding is a core skill in visual storytelling for film, animation, and games. However, automating this process requires a system to achieve two properties that current approaches rarely satisfy simultaneously: inter-shot consistency and explicit editability. While 2D diffusion-based generators produce vivid imagery, they often suffer from identity drift along with limited geometric control; conversely, traditional 3D animation workflows are consistent and editable but require expert-heavy, labor-intensive authoring. We present StoryBlender, a grounded 3D storyboard generation framework governed by a Story-centric Reflection Scheme. At its core, we propose the StoryBlender system, which is built on a three-stage pipeline: (1) Semantic-Spatial Grounding, to construct a continuity memory graph to decouple global assets from shot-specific variables for long-horizon consistency; (2) Canonical Asset Materialization, to instantiate entities in a unified coordinate space to maintain visual identity; and (3) Spatial-Temporal Dynamics, to achieve layout design and cinematic evolution through visual metrics. By orchestrating multiple agents in a hierarchical manner within a verification loop, StoryBlender iteratively self-corrects spatial hallucinations via engine-verified feedback. The resulting native 3D scenes support direct, precise editing of cameras and visual assets while preserving unwavering multi-shot continuity. Experiments demonstrate that StoryBlender significantly improves consistency and editability over both diffusion-based and 3D-grounded baselines. Code, data, and demonstration video will be available on https://engineeringai-lab.github.io/StoryBlender/[106] When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models
Jiho Choi,Jaemin Kim,Sanghwan Kim,Seunghoon Hong,Jin-Hwi Park
Main category: cs.CV
TL;DR: 本文提出注意力汇点(attention sinks)概念,区分视觉汇点(V-sinks)与语言汇点(L-sinks),揭示其在LVLM中对全局先验与局部感知的权衡影响,并设计轻量级层式汇点门控(LSG)模块,在不微调模型主干前提下提升多模态任务性能。
Details
Motivation: 探究注意力汇点在大视觉-语言模型(LVLM)中的跨模态影响,厘清其是冗余干扰还是关键全局先验。 Method: 提出视觉汇点(V-sinks)和语言汇点(L-sinks)分类;分析其在不同层的功能影响;设计无需额外监督、可插拔的层式汇点门控(LSG)模块,动态调节V-sink注意力权重。 Result: LSG在多数层上显著提升多个主流多模态基准(如VQAv2、OK-VQA、TextVQA)性能,有效平衡全局推理与局部细节感知。 Conclusion: 视觉汇点是具有双重作用的关键结构:既承载全局场景先验,又可能抑制局部视觉证据;通过层式可控调节可释放其潜力,无需修改LVLM主干即可提升性能。 Abstract: Attention sinks are defined as tokens that attract disproportionate attention. While these have been studied in single modality transformers, their cross-modal impact in Large Vision-Language Models (LVLM) remains largely unexplored: are they redundant artifacts or essential global priors? This paper first categorizes visual sinks into two distinct categories: ViT-emerged sinks (V-sinks), which propagate from the vision encoder, and LLM-emerged sinks (L-sinks), which arise within deep LLM layers. Based on the new definition, our analysis reveals a fundamental performance trade-off: while sinks effectively encode global scene-level priors, their dominance can suppress the fine-grained visual evidence required for local perception. Furthermore, we identify specific functional layers where modulating these sinks most significantly impacts downstream performance. To leverage these insights, we propose Layer-wise Sink Gating (LSG), a lightweight, plug-and-play module that dynamically scales the attention contributions of V-sink and the rest visual tokens. LSG is trained via standard next-token prediction, requiring no task-specific supervision while keeping the LVLM backbone frozen. In most layers, LSG yields improvements on representative multimodal benchmarks, effectively balancing global reasoning and precise local evidence.[107] Gaze to Insight: A Scalable AI Approach for Detecting Gaze Behaviours in Face-to-Face Collaborative Learning
Junyuan Liang,Qi Zhou,Sahan Bulathwela,Mutlu Cukurova
Main category: cs.CV
TL;DR: 本研究提出了一种无需人工标注数据的可扩展AI方法,利用预训练模型(YOLO11、YOLOE-26和Gaze-LLE)自动检测面对面协作学习中的注视行为,在F1-score达0.829的同时展现出更强的跨配置鲁棒性。
Details
Motivation: 现有基于机器学习的注视行为检测方法依赖大量人工标注数据,且在教育实际多变场景中跨配置鲁棒性不足。 Method: 采用预训练YOLO11进行人物追踪,YOLOE-26(支持文本提示)进行教育相关物体检测,Gaze-LLE模型预测注视目标,全程无需人工标注。 Result: F1-score达0.829,对面向笔记本和同伴的注视检测性能强,对其他注视目标较弱;相比监督学习方法,在复杂场景下性能更优、更稳定。 Conclusion: 该无监督、基于基础模型的方法提升了注视行为分析的可扩展性与现实适用性,为协作学习中的实时反馈与反思支持提供了新路径。 Abstract: Previous studies have illustrated the potential of analysing gaze behaviours in collaborative learning to provide educationally meaningful information for students to reflect on their learning. Over the past decades, machine learning approaches have been developed to automatically detect gaze behaviours from video data. Yet, since these approaches often require large amounts of labelled data for training, human annotation remains necessary. Additionally, researchers have questioned the cross-configuration robustness of machine learning models developed, as training datasets often fail to encompass the full range of situations encountered in educational contexts. To address these challenges, this study proposes a scalable artificial intelligence approach that leverages pretrained and foundation models to automatically detect gaze behaviours in face-to-face collaborative learning contexts without requiring human-annotated data. The approach utilises pretrained YOLO11 for person tracking, YOLOE-26 with text-prompt capability for education-related object detection, and the Gaze-LLE model for gaze target prediction. The results indicate that the proposed approach achieves an F1-score of 0.829 in detecting students' gaze behaviours from video data, with strong performance for laptop-directed gaze and peer-directed gaze, yet weaker performance for other gaze targets. Furthermore, when compared to other supervised machine learning approaches, the proposed method demonstrates superior and more stable performance in complex contexts, highlighting its better cross-configuration robustness. The implications of this approach for supporting students' collaborative learning in real-world environments are also discussed.[108] EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
Zhenghao Chen,Huiqun Wang,Di Huang
Main category: cs.CV
TL;DR: 本文提出EgoMind框架,通过无需几何先验的链式思维方法(角色扮演字幕+渐进空间分析),在少量数据下显著提升多模态大语言模型的空间推理能力。
Details
Motivation: 现有方法依赖3D先验或几何监督,数据准备和对齐成本高;纯2D方法难以建模跨帧空间关系。 Method: 提出EgoMind链式思维框架,包含Role-Play Caption(构建跨帧一致的语言场景图)和Progressive Spatial Analysis(渐进式任务导向推理)。仅用5K SFT样本和20K RL样本训练。 Result: 在VSI-Bench、SPAR-Bench、SITE-Bench和SPBench上达到有竞争力的结果。 Conclusion: 证明了纯语言推理在空间认知中的潜力,为降低空间推理的数据与对齐成本提供了新路径。 Abstract: Multimodal large language models (MLLMs) are increasingly being applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most existing works improve spatial reasoning by introducing 3D priors or geometric supervision, which enhances performance but incurs substantial data preparation and alignment costs. In contrast, purely 2D approaches often struggle with multi-frame spatial reasoning due to their limited ability to capture cross-frame spatial relationships. To address these limitations, we propose EgoMind, a Chain-of-Thought framework that enables geometry-free spatial reasoning through Role-Play Caption, which jointly constructs a coherent linguistic scene graph across frames, and Progressive Spatial Analysis, which progressively reasons toward task-specific questions. With only 5K auto-generated SFT samples and 20K RL samples, EgoMind achieves competitive results on VSI-Bench, SPAR-Bench, SITE-Bench, and SPBench, demonstrating its effectiveness in strengthening the spatial reasoning capabilities of MLLMs and highlighting the potential of linguistic reasoning for spatial cognition. Code and data are released at https://github.com/Hyggge/EgoMind.[109] Robust Multi-Source Covid-19 Detection in CT Images
Asmita Yuki Pritha,Jason Xu,Daniel Ding,Justin Li,Aryana Hou,Xin Wang,Shu Hu
Main category: cs.CV
TL;DR: 本文提出一种多任务学习方法,通过同时预测COVID-19诊断和扫描来源中心,提升模型在多中心CT数据上的泛化能力,并采用logit调整损失缓解数据不平衡问题。
Details
Motivation: 现有深度学习模型在单中心数据上表现良好,但在多中心(不同设备、协议、人群)数据上泛化差,主因是未建模数据来源偏差,导致特征表示偏向数据量大的中心。 Method: 采用多任务学习框架:共享EfficientNet-B7骨干网络,分别预测COVID-19诊断和扫描来源中心;对源分类头使用logit-adjusted交叉熵损失以缓解中心间数据不均衡;预处理采用SSFL+KDS,每例CT选取8张代表性切片。 Result: 在308例验证集上达到F1分数0.9098、AUC-ROC 0.9647;代码已开源。 Conclusion: 联合建模疾病诊断与数据来源可有效提升跨中心泛化能力,logit调整损失有助于平衡多源数据学习,为医学影像多中心鲁棒建模提供了可行路径。 Abstract: Deep learning models for COVID-19 detection from chest CT scans generally perform well when the training and test data originate from the same institution, but they often struggle when scans are drawn from multiple centres with differing scanners, imaging protocols, and patient populations. One key reason is that existing methods treat COVID-19 classification as the sole training objective, without accounting for the data source of each scan. As a result, the learned representations tend to be biased toward centres that contribute more training data. To address this, we propose a multi-task learning approach in which the model is trained to predict both the COVID-19 diagnosis and the originating data centre. The two tasks share an EfficientNet-B7 backbone, which encourages the feature extractor to learn representations that hold across all four participating centres. Since the training data is not evenly distributed across sources, we apply a logit-adjusted cross-entropy loss [1] to the source classification head to prevent underrepresented centres from being overlooked. Our pre-processing follows the SSFL framework with KDS [2], selecting eight representative slices per scan. Our method achieves an F1 score of 0.9098 and an AUC-ROC of 0.9647 on a validation set of 308 scans. The code is publicly available at https://github.com/Purdue-M2/-multisource-covid-ct.[110] VitaTouch: Property-Aware Vision-Tactile-Language Model for Robotic Quality Inspection in Manufacturing
Junyi Zong,Qingxuan Jia,Meixian Shi,Tong Li,Jiayuan Li,Zihang Lv,Gang Chen,Fang Deng
Main category: cs.CV
TL;DR: 本文提出VitaTouch,一种结合视觉、触觉与语言的多模态模型,用于智能制造业中的材料属性识别与自然语言描述,并构建了VitaSet数据集进行验证。
Details
Motivation: 质量检测需超越可见几何结构,识别材料内在与表面属性,但纯视觉方法易受遮挡和反射干扰。 Method: 采用模态专用编码器与双Q-Former提取视觉和触觉特征,压缩为前缀令牌输入大语言模型;通过文本对齐与对比学习显式耦合视觉与触觉模态;构建VitaSet多模态数据集并采用LoRA微调。 Result: 在HCT和TVL基准上性能最优,在SSVTP上具竞争力;在VitaSet上硬度、粗糙度准确率及描述符召回率分别达88.89%、75.13%、54.81%,语义相似度峰值达0.9009;缺陷识别准确率在2/3/5类任务中分别为100.0%/96.0%/92.0%,机器人实验中闭环识别与端到端分拣成功率均为94.0%。 Conclusion: VitaTouch有效融合视觉、触觉与语言模态,显著提升材料属性理解与工业场景下缺陷识别与分拣能力,验证了多模态协同在智能质检中的潜力。 Abstract: Quality inspection in smart manufacturing requires identifying intrinsic material and surface properties beyond visible geometry, yet vision-only methods remain vulnerable to occlusion and reflection. We propose VitaTouch, a property-aware vision-tactile-language model for material-property inference and natural-language attribute description. VitaTouch uses modality-specific encoders and a dual Q-Former to extract language-relevant visual and tactile features, which are compressed into prefix tokens for a large language model. We align each modality with text and explicitly couple vision and touch through contrastive learning. We also construct VitaSet, a multimodal dataset with 186 objects, 52k images, and 5.1k human-verified instruction-answer pairs. VitaTouch achieves the best performance on HCT and the overall TVL benchmark, while remaining competitive on SSVTP. On VitaSet, it reaches 88.89% hardness accuracy, 75.13% roughness accuracy, and 54.81% descriptor recall; the material-description task further achieves a peak semantic similarity of 0.9009. With LoRA-based fine-tuning, VitaTouch attains 100.0%, 96.0%, and 92.0% accuracy for 2-, 3-, and 5-category defect recognition, respectively, and delivers 94.0% closed-loop recognition accuracy and 94.0% end-to-end sorting success in 100 laboratory robotic trials. More details are available at the project page: https://vitatouch.github.io/[111] Safety-Aligned 3D Object Detection: Single-Vehicle, Cooperative, and End-to-End Perspectives
Brian Hsuan-Cheng Liao,Chih-Hong Cheng,Hasan Esen,Alois Knoll
Main category: cs.CV
TL;DR: 本文提出了一种面向安全的3D目标检测评估与优化方法,通过NDS-USC指标和EC-IoU损失函数,在单车、车路协同和端到端驾驶系统中验证了其对提升CAV安全性的有效性。
Details
Motivation: 深度学习感知存在统计不确定性,且传统评估指标未区分安全关键性错误,难以保障自动驾驶车辆(CAV)的实际安全性。 Method: 基于安全导向指标NDS-USC和安全感知损失函数EC-IoU,开展三方面工作:1)在多种单车型号上对比标准指标与安全指标;2)对车路协同检测模型进行以自车为中心的安全评估;3)将EC-IoU集成至SparseDrive端到端框架中进行感知加固。 Result: 1)标准指标(如mAP、NDS)提升不一定带来安全性能提升;2)车路协同模型显著优于单车模型,并支持‘零死亡愿景’的安全影响分析;3)在SparseDrive中使用EC-IoU使碰撞率降低近30%。 Conclusion: 安全对齐的感知评估与优化是提升单体、协同及端到端自动驾驶系统安全性的可行路径。 Abstract: Perception plays a central role in connected and autonomous vehicles (CAVs), underpinning not only conventional modular driving stacks, but also cooperative perception systems and recent end-to-end driving models. While deep learning has greatly improved perception performance, its statistical nature makes perfect predictions difficult to attain. Meanwhile, standard training objectives and evaluation benchmarks treat all perception errors equally, even though only a subset is safety-critical. In this paper, we investigate safety-aligned evaluation and optimization for 3D object detection that explicitly characterize high-impact errors. Building on our previously proposed safety-oriented metric, NDS-USC, and safety-aware loss function, EC-IoU, we make three contributions. First, we present an expanded study of single-vehicle 3D object detection models across diverse neural network architectures and sensing modalities, showing that gains under standard metrics such as mAP and NDS may not translate to safety-oriented criteria represented by NDS-USC. With EC-IoU, we reaffirm the benefit of safety-aware fine-tuning for improving safety-critical detection performance. Second, we conduct an ego-centric, safety-oriented evaluation of AV-infrastructure cooperative object detection models, underscoring its superiority over vehicle-only models and demonstrating a safety impact analysis that illustrates the potential contribution of cooperative models to "Vision Zero." Third, we integrate EC-IoU into SparseDrive and show that safety-aware perception hardening can reduce collision rate by nearly 30% and improve system-level safety directly in an end-to-end perception-to-planning framework. Overall, our results indicate that safety-aligned perception evaluation and optimization offer a practical path toward enhancing CAV safety across single-vehicle, cooperative, and end-to-end autonomy settings.[112] Review and Evaluation of Point-Cloud based Leaf Surface Reconstruction Methods for Agricultural Applications
Arif Ahmed,Parikshit Maini
Main category: cs.CV
TL;DR: 本文对九种代表性叶面表面重建方法进行了比较研究,评估了它们在不同数据集上的性能,为资源受限的农业机器人平台提供了实用指导。
Details
Motivation: 真实世界中的植物数据(不规则3D点云)复杂,难以准确重建植物部分,现有方法在叶面重建中的相对性能尚不明确。 Method: 对九种代表性表面重建方法(参数化、基于三角剖分、隐式、基于学习等)在三个公开数据集(LAST-STRAW、Pheno4D、Crops3D)上进行对比评估。 Result: 分析揭示了各方法在表面积估计精度、平滑性、抗噪声与缺失数据鲁棒性及计算成本等方面的权衡;每种方法在不同应用场景和资源约束下表现各异。 Conclusion: 研究结果为资源受限的农业机器人平台选择合适的表面重建技术提供了实践指导。 Abstract: Accurate reconstruction of leaf surfaces from 3D point cloud is essential for agricultural applications such as phenotyping. However, real-world plant data (i.e., irregular 3D point cloud) are often complex to reconstruct plant parts accurately. A wide range of surface reconstruction methods has been proposed, including parametric, triangulation-based, implicit, and learning based approaches, yet their relative performance for leaf surface reconstruction remains insufficiently understood. In this work, we present a comparative study of nine representative surface reconstruction methods for leaf surfaces. We evaluate these methods on three publicly available datasets: LAST-STRAW, Pheno4D, and Crops3D - spanning diverse species, sensors, and sensing environments, ranging from clean high-resolution indoor scans to noisy low-resolution field settings. The analysis highlights the trade-offs between surface area estimation accuracy, smoothness, robustness to noise and missing data, and computational cost across different methods. These factors affect the cost and constraints of robotic hardware used in agricultural applications. Our results show that each method exhibits distinct advantages depending on application and resource constraints. The findings provide practical guidance for selecting surface reconstruction techniques for resource constrained robotic platforms.[113] CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection
Damith Chamalke Senadeera,Dimitrios Kollias,Gregory Slabaugh
Main category: cs.CV
TL;DR: CoLoRSMamba是一种新型的视频到音频多模态架构,通过CLS引导的条件LoRA耦合VideoMamba和AudioMamba,在无token级跨注意力的情况下实现场景感知的音频动态建模,显著提升暴力检测性能。
Details
Motivation: 现实世界中的声景可能嘈杂或与视觉场景弱相关,单纯依赖音频进行暴力检测效果受限,亟需更鲁棒的音视频协同建模方法。 Method: 提出CoLoRSMamba架构:以VideoMamba的CLS token生成通道调制向量和稳定门控,动态调节AudioMamba中状态空间模型(Delta、B、C)及步长路径参数;训练采用二分类损失与对称AV-InfoNCE对比学习目标。 Result: 在NTU-CCTV和DVD音频过滤子集上分别达到88.63%/86.24%(准确率/F1-V)和75.77%/72.94%,优于各类单模态及多模态基线,且参数量与FLOPs更少、效率更高。 Conclusion: CoLoRSMamba验证了无需显式跨模态注意力、仅通过条件化状态空间调制即可实现高效音视频协同,在暴力检测任务中兼具性能与效率优势。 Abstract: Violence detection benefits from audio, but real-world soundscapes can be noisy or weakly related to the visible scene. We present CoLoRSMamba, a directional Video to Audio multimodal architecture that couples VideoMamba and AudioMamba through CLS-guided conditional LoRA. At each layer, the VideoMamba CLS token produces a channel-wise modulation vector and a stabilization gate that adapt the AudioMamba projections responsible for the selective state-space parameters (Delta, B, C), including the step-size pathway, yielding scene-aware audio dynamics without token-level cross-attention. Training combines binary classification with a symmetric AV-InfoNCE objective that aligns clip-level audio and video embeddings. To support fair multimodal evaluation, we curate audio-filtered clip level subsets of the NTU-CCTV and DVD datasets from temporal annotations, retaining only clips with available audio. On these subsets, CoLoRSMamba outperforms representative audio-only, video-only, and multimodal baselines, achieving 88.63% accuracy / 86.24% F1-V on NTU-CCTV and 75.77% accuracy / 72.94% F1-V on DVD. It further offers a favorable accuracy-efficiency tradeoff, surpassing several larger models with fewer parameters and FLOPs.[114] Bridging the Dimensionality Gap: A Taxonomy and Survey of 2D Vision Model Adaptation for 3D Analysis
Akshat Pandya,Bhavuk Jain
Main category: cs.CV
TL;DR: This survey reviews and classifies strategies for adapting 2D vision models (CNNs, ViTs) to 3D data, organizing them into data-centric, architecture-centric, and hybrid methods; it analyzes trade-offs and outlines future directions like 3D foundation models and geometric self-supervised learning.
Details
Motivation: The fundamental dichotomy between regular 2D image grids and irregular, sparse 3D data (e.g., point clouds, meshes) poses a core challenge in extending successful 2D vision models to 3D analysis. Method: The paper proposes a unified taxonomy of adaptation strategies for 3D vision, classifying them into three families: (1) Data-centric (projecting 3D to 2D), (2) Architecture-centric (designing intrinsic 3D networks), and (3) Hybrid (combining both). It qualitatively analyzes trade-offs across computational complexity, pre-training dependence, and preservation of geometric inductive biases. Result: A comprehensive review and structured framework for 3D vision adaptation methods, revealing key trade-offs and identifying open challenges. Conclusion: The survey concludes by highlighting promising future research directions, including the development of 3D foundation models, advancements in self-supervised learning for geometric data, and deeper integration of multi-modal signals. Abstract: The remarkable success of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in 2D vision has spurred significant research in extending these architectures to the complex domain of 3D analysis. Yet, a core challenge arises from a fundamental dichotomy between the regular, dense grids of 2D images and the irregular, sparse nature of 3D data such as point clouds and meshes. This survey provides a comprehensive review and a unified taxonomy of adaptation strategies that bridge this gap, classifying them into three families: (1) Data-centric methods that project 3D data into 2D formats to leverage off-the-shelf 2D models, (2) Architecture-centric methods that design intrinsic 3D networks, and (3) Hybrid methods, which synergistically combine the two modeling paradigms to benefit from both rich visual priors of large 2D datasets and explicit geometric reasoning of 3D models. Through this framework, we qualitatively analyze the fundamental trade-offs between these families concerning computational complexity, reliance on large-scale pre-training, and the preservation of geometric inductive biases. We discuss key open challenges and outline promising future research directions, including the development of 3D foundation models, advancements in self-supervised learning (SSL) for geometric data, and the deeper integration of multi-modal signals.[115] Significance and Stability Analysis of Gene-Environment Interaction using RGxEStat
Meng'en Qin,Zhe Li,Xiaohui Yang
Main category: cs.CV
TL;DR: 本文提出两种用于基因型-环境互作(GxE)研究的关键统计模型,并开发了名为RGxEStat的轻量级交互式工具,以简化育种数据分析流程。
Details
Motivation: GxE互作降低了表型在目标环境中预测的准确性,深入分析GxE有助于理解遗传优势或缺陷在特定环境下的表达或抑制机制,从而优化遗传选择与育种实践。 Method: 提出基于混合效应模型的显著性分析和稳定性分析两种方法;并开发集成建模、求解与可视化的交互式工具RGxEStat。 Result: 实现了无需掌握SAS或R编程即可进行GxE分析的用户友好型工具RGxEStat,显著加速育种研究周期;代码与数据集已开源。 Conclusion: RGxEStat为育种学家和农学家提供了高效、易用的GxE分析平台,推动了GxE研究的普及化与实用化。 Abstract: Genotype-by-Environment (GxE) interactions influence the performance of genotypes across diverse environments, reducing the predictability of phenotypes in target environments. In-depth analysis of GxE interactions facilitates the identification of how genetic advantages or defects are expressed or suppressed under specific environmental conditions, thereby enabling genetic selection and enhancing breeding practices. This paper introduces two key models for GxE interaction research. Specifically, it includes significance analysis based on the mixed effect model to determine whether genes or GxE interactions significantly affect phenotypic traits; stability analysis, which further investigates the interactive relationships between genes and environments, as well as the relative superiority or inferiority of genotypes across environments. Additionally, this paper presents RGxEStat, a lightweight interactive tool, which is developed by the authors and integrates the construction, solution, and visualization of the aforementioned models. Designed to eliminate the need for breeders and agronomists to learn complex SAS or R programming, RGxEStat provides a user-friendly interface for streamlined breeding data analysis, significantly accelerating research cycles. Codes and datasets are available at https://github.com/mason-ching/RGxEStat.[116] Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction
Wuqi Su,Huilun Song,Chen Zhao,Chi Xu
Main category: cs.CV
TL;DR: 本文提出了一种基于Swin Transformer的多级感知条件随机场(CRF)模型,通过自适应混合金字塔特征融合、分层感知适配器和动态缩放注意力的全连接CRF解码器,显著提升了单目深度估计的精度与效率。
Details
Motivation: 单目深度估计存在尺度模糊性和缺乏显式几何线索的问题,现有方法依赖复杂网络结构,训练成本高且未充分建模像素间空间依赖关系。 Method: 构建基于Swin Transformer的多级感知CRF模型,包含:(1) 自适应混合金字塔特征融合(HPF),融合多尺度空间金字塔与双轴特征聚合;(2) 分层感知适配器(HA),通过轻量广播模块增强跨层级特征交互;(3) 带动态缩放注意力和偏置学习单元的全连接CRF解码器,建模细粒度空间关系并稳定训练。 Result: 在NYU Depth v2上Abs Rel降至0.088(下降7.4%),RMSE为0.316(下降5.4%);KITTI上δ<1.25³达99.8%,仅需194M参数、21ms推理时间。 Conclusion: 所提方法在精度、效率和稳定性上均达到SOTA,验证了多级感知建模与CRF解码协同设计的有效性。 Abstract: Monocular depth estimation from a single RGB image remains a fundamental challenge in computer vision due to inherent scale ambiguity and the absence of explicit geometric cues. Existing approaches typically rely on increasingly complex network architectures to regress depth maps, which escalates training costs and computational overhead without fully exploiting inter-pixel spatial dependencies. We propose a multilevel perceptual conditional random field (CRF) model built upon the Swin Transformer backbone that addresses these limitations through three synergistic innovations: (1) an adaptive hybrid pyramid feature fusion (HPF) strategy that captures both short-range and long-range dependencies by combining multi-scale spatial pyramid pooling with biaxial feature aggregation, enabling effective integration of global and local contextual information; (2) a hierarchical awareness adapter (HA) that enriches cross-level feature interactions within the encoder through lightweight broadcast modules with learnable dimensional scaling, reducing computational complexity while enhancing representational capacity; and (3) a fully-connected CRF decoder with dynamic scaling attention that models fine-grained pixel-level spatial relationships, incorporating a bias learning unit to prevent extreme-value collapse and ensure stable training. Extensive experiments on NYU Depth v2, KITTI, and MatterPort3D datasets demonstrate that our method achieves state-of-the-art performance, reducing Abs Rel to 0.088 ($-$7.4\%) and RMSE to 0.316 ($-$5.4\%) on NYU Depth v2, while attaining near-perfect threshold accuracy ($δ< 1.25^3 \approx 99.8\%$) on KITTI with only 194M parameters and 21ms inference time.[117] Learning Additively Compositional Latent Actions for Embodied AI
Hangxing Wei,Xiaoyu Chen,Chuheng Zhang,Tim Pearce,Jianyu Chen,Alex Lamb,Li Zhao,Jiang Bian
Main category: cs.CV
TL;DR: 本文提出AC-LAM模型,通过在潜在动作空间中引入场景级加性组合结构约束,提升动作表征的结构性、运动特异性和位移校准性,从而增强下游策略学习性能。
Details
Motivation: 现有潜在动作学习方法缺乏对物理运动加性、组合性结构的先验建模,导致潜在动作纠缠无关场景细节或未来观测信息,且运动幅度校准不准。 Method: 提出Additively Compositional Latent Action Model(AC-LAM),在短时域上对潜在动作空间施加场景级加性组合结构约束,鼓励满足恒等性、可逆性与循环一致性的代数结构,并抑制非加性信息。 Result: AC-LAM在仿真与真实世界桌面任务中均优于当前最优潜在动作模型,学习到更结构化、运动特异且位移校准的潜在动作,并为下游策略学习提供更强监督信号。 Conclusion: 加性组合结构约束能有效提升潜在动作表征质量,是改进自监督动作学习的重要方向。 Abstract: Latent action learning infers pseudo-action labels from visual transitions, providing an approach to leverage internet-scale video for embodied AI. However, most methods learn latent actions without structural priors that encode the additive, compositional structure of physical motion. As a result, latents often entangle irrelevant scene details or information about future observations with true state changes and miscalibrate motion magnitude. We introduce Additively Compositional Latent Action Model (AC-LAM), which enforces scene-wise additive composition structure over short horizons on the latent action space. These AC constraints encourage simple algebraic structure in the latent action space~(identity, inverse, cycle consistency) and suppress information that does not compose additively. Empirically, AC-LAM learns more structured, motion-specific, and displacement-calibrated latent actions and provides stronger supervision for downstream policy learning, outperforming state-of-the-art LAMs across simulated and real-world tabletop tasks.[118] Mixture-of-Experts in Remote Sensing: A Survey
Yongchuan Cui,Peng Liu,Lajiao Chen
Main category: cs.CV
TL;DR: This paper provides the first systematic survey of Mixture-of-Experts (MoE) models in remote sensing, covering principles, architectures, applications, and future trends.
Details
Motivation: The diversity of sensor modalities and spatiotemporal dynamics in remote sensing data poses unique challenges, and there is a lack of comprehensive review on MoE for this domain. Method: Systematic literature review and analysis of MoE models applied to remote sensing, including fundamental principles, architectural designs, and key applications. Result: A structured overview of MoE applications in remote sensing, identifying current practices and gaps. Conclusion: MoE offers promising potential for remote sensing tasks, and this survey serves as a foundational reference to guide future research and innovation. Abstract: Remote sensing data analysis and interpretation present unique challenges due to the diversity in sensor modalities and spatiotemporal dynamics of Earth observation data. Mixture-of-Experts (MoE) model has emerged as a powerful paradigm that addresses these challenges by dynamically routing inputs to specialized experts designed for different aspects of a task. However, despite rapid progress, the community still lacks a comprehensive review of MoE for remote sensing. This survey provides the first systematic overview of MoE applications in remote sensing, covering fundamental principles, architectural designs, and key applications across a variety of remote sensing tasks. The survey also outlines future trends to inspire further research and innovation in applying MoE to remote sensing.[119] YOLOv11 Demystified: A Practical Guide to High-Performance Object Detection
Nikhileswara Rao Sulake
Main category: cs.CV
TL;DR: YOLOv11 is the latest real-time object detector in the YOLO series, featuring new modules (C3K2, SPPF, C2PSA) that improve small-object detection and spatial feature processing without compromising speed; it outperforms earlier YOLO versions in mAP and inference speed on standard benchmarks.
Details
Motivation: To advance real-time object detection by improving feature extraction—especially for small objects—while maintaining high inference speed for practical applications like autonomous driving and surveillance. Method: Design and integration of novel architectural components—C3K2 blocks, Spatial Pyramid Pooling-Fast (SPPF), and C2PSA (Cross Stage Partial with Spatial Attention)—into the backbone, neck, and head of YOLOv11; comprehensive evaluation against prior YOLO versions on standard benchmarks. Result: YOLOv11 achieves higher mean Average Precision (mAP) and competitive inference speed compared to previous YOLO versions, confirming improved accuracy and real-time capability. Conclusion: YOLOv11 represents a significant step forward in the YOLO series, balancing accuracy and efficiency; this work provides a formal research reference for future developments in real-time object detection. Abstract: YOLOv11 is the latest iteration in the You Only Look Once (YOLO) series of real-time object detectors, introducing novel architectural modules to improve feature extraction and small-object detection. In this paper, we present a detailed analysis of YOLOv11, including its backbone, neck, and head components. The model key innovations, the C3K2 blocks, Spatial Pyramid Pooling - Fast (SPPF), and C2PSA (Cross Stage Partial with Spatial Attention) modules enhance spatial feature processing while preserving speed. We compare YOLOv11 performance to prior YOLO versions on standard benchmarks, highlighting improvements in mean Average Precision (mAP) and inference speed. Our results demonstrate that YOLOv11 achieves superior accuracy without sacrificing real-time capabilities, making it well-suited for applications in autonomous driving, surveillance, and video analytics.This work formalizes YOLOv11 in a research context, providing a clear reference for future studies.[120] ViBA: Implicit Bundle Adjustment with Geometric and Temporal Consistency for Robust Visual Matching
Xiaoji Niu,Yuqing Wang,Yan Wang,Hailiang Tang,Tisheng Zhang
Main category: cs.CV
TL;DR: ViBA是一种可持续的在线学习框架,将几何优化与特征学习结合,用于无约束视频流中的关键点检测与描述,在保持实时性的同时显著提升视觉里程计的定位精度和泛化能力。
Details
Motivation: 现有关键点检测与描述方法依赖高精度位姿和深度标注数据,导致可扩展性差、泛化能力弱,并影响导航与定位性能。 Method: 提出ViBA框架,嵌入标准视觉里程计流程,包含:(i)初始帧间跟踪网络,(ii)基于深度的离群点滤波,(iii)可微全局光束法平差(BA),联合优化相机位姿与特征位置;通过BA的几何一致性和帧间长期时序一致性约束特征表示。 Result: 在EuRoC和UMA数据集上,相比SuperPoint+SuperGlue、ALIKED、LightGlue等SOTA方法,平均绝对平移误差(ATE)降低12–18%,绝对旋转误差(ARE)降低5–10%;实时推理速度达36–91 FPS;在未见序列上定位准确率超90%。 Conclusion: ViBA支持几何与时间一致性驱动的持续在线学习,显著提升真实场景中导航与定位的鲁棒性与精度。 Abstract: Most existing image keypoint detection and description methods rely on datasets with accurate pose and depth annotations, limiting scalability and generalization, and often degrading navigation and localization performance. We propose ViBA, a sustainable learning framework that integrates geometric optimization with feature learning for continuous online training on unconstrained video streams. Embedded in a standard visual odometry pipeline, it consists of an implicitly differentiable geometric residual framework: (i) an initial tracking network for inter-frame correspondences, (ii) depth-based outlier filtering, and (iii) differentiable global bundle adjustment that jointly refines camera poses and feature positions by minimizing reprojection errors. By combining geometric consistency from BA with long-term temporal consistency across frames, ViBA enforces stable and accurate feature representations. We evaluate ViBA on EuRoC and UMA datasets. Compared with state-of-the-art methods such as SuperPoint+SuperGlue, ALIKED, and LightGlue, ViBA reduces mean absolute translation error (ATE) by 12-18% and absolute rotation error (ARE) by 5-10% across sequences, while maintaining real-time inference speeds (FPS 36-91). When evaluated on unseen sequences, it retains over 90% localization accuracy, demonstrating robust generalization. These results show that ViBA supports continuous online learning with geometric and temporal consistency, consistently improving navigation and localization in real-world scenarios.[121] Banana100: Breaking NR-IQA Metrics by 100 Iterative Image Replications with Nano Banana Pro
Kenan Tang,Praveen Arunshankar,Andong Hua,Anthony Yang,Yao Qin
Main category: cs.CV
TL;DR: 本文揭示了多模态智能体系统在多轮图像编辑中存在图像质量迭代退化的问题,并提出了一个包含28000张退化图像的基准数据集Banana100,发现现有无参考图像质量评估指标均无法可靠检测此类退化,警示其对模型训练稳定性和系统安全性的潜在威胁。
Details
Motivation: 尽管当前图像编辑模型在单轮编辑中表现优异,但在多轮迭代编辑中出现严重质量退化,且现有质量评估方法无法有效识别该问题,可能危及多模态智能体系统的可靠性与安全性。 Method: 构建大规模多轮迭代退化图像数据集Banana100(28,000张,经100步编辑),系统评测21种主流无参考图像质量评估(NR-IQA)指标对退化图像的敏感性。 Result: 实验证明所有21种NR-IQA指标均无法一致地区分严重退化图像与原始高质量图像;图像生成器与评估器同时失效,构成双重风险。 Conclusion: 多轮图像编辑中的质量退化是被忽视的关键脆弱性,亟需开发更鲁棒的生成与评估方法,以保障多模态智能体系统的长期稳定性与安全性。 Abstract: The multi-step, iterative image editing capabilities of multi-modal agentic systems have transformed digital content creation. Although latest image editing models faithfully follow instructions and generate high-quality images in single-turn edits, we identify a critical weakness in multi-turn editing, which is the iterative degradation of image quality. As images are repeatedly edited, minor artifacts accumulate, rapidly leading to a severe accumulation of visible noise and a failure to follow simple editing instructions. To systematically study these failures, we introduce Banana100, a comprehensive dataset of 28,000 degraded images generated through 100 iterative editing steps, including diverse textures and image content. Alarmingly, image quality evaluators fail to detect the degradation. Among 21 popular no-reference image quality assessment (NR-IQA) metrics, none of them consistently assign lower scores to heavily degraded images than to clean ones. The dual failures of generators and evaluators may threaten the stability of future model training and the safety of deployed agentic systems, if the low-quality synthetic data generated by multi-turn edits escape quality filters. We release the full code and data to facilitate the development of more robust models, helping to mitigate the fragility of multi-modal agentic systems.[122] KiToke: Kernel-based Interval-aware Token Compression for Video Large Language Models
Haifeng Huang,Yang Li
Main category: cs.CV
TL;DR: KiToke是一种无需训练、与查询无关的视频令牌压缩方法,通过全局核基冗余度量和轻量级时间间隔构造,在大幅减少视觉令牌数量的同时保持关键视觉信息和时间连贯性,显著提升视频大语言模型在低令牌预算下的推理效率。
Details
Motivation: Video LLMs推理成本高,主要源于大量视觉token;现有无训练压缩方法多依赖局部或分段启发式策略,难以有效处理视频整体时空冗余。 Method: 提出KiToke:1)基于核函数的全局token多样性估计,实现内容自适应选择;2)轻量级时间间隔构建与区间感知token合并,保障时间连贯性;全程无需训练、不依赖具体查询。 Result: 在多个视频理解基准和Video LLM主干上验证,KiToke在各类无训练压缩方法中持续领先,尤其在仅保留1% token的极端压缩比下增益显著。 Conclusion: KiToke通过显式建模全局时空冗余,实现了更高效的token利用,在保持性能的同时大幅降低Video LLM推理开销,为高效视频理解提供了新范式。 Abstract: Video Large Language Models (Video LLMs) achieve strong performance on video understanding tasks but suffer from high inference costs due to the large number of visual tokens. We propose KiToke, a training-free, query-agnostic token compression approach that reduces spatiotemporal redundancy while preserving critical visual information. Our method estimates token diversity globally using a kernel-based redundancy measure, enabling content-adaptive selection that remains effective under extreme token budgets, and further introduces a lightweight temporal interval construction with interval-aware token merging to maintain temporal coherence. Unlike prior methods that rely on local or segment-level heuristics, KiToke explicitly captures global redundancy across an entire video, leading to more efficient token utilization. Extensive experiments on multiple video understanding benchmarks and Video LLM backbones demonstrate that KiToke consistently outperforms existing training-free compression methods, with particularly large gains at aggressive retention ratios down to 1%.[123] Zero-Shot Quantization via Weight-Space Arithmetic
Daniele Solombrino,Antonio Andrea Gargiulo,Adrian Robert Minut,Luca Zhou,Alessandro Zirilli,Emanuele Rodolà
Main category: cs.CV
TL;DR: 本文提出一种无需接收端训练数据的零样本方法,通过从捐赠任务中提取‘量化向量’来提升接收模型对后训练量化(PTQ)的鲁棒性,显著减少量化噪声影响,适用于极低位宽部署。
Details
Motivation: 现有量化感知训练(QAT)需要接收端训练数据和计算开销,而低比特部署需求迫切,亟需更低成本、零样本的量化鲁棒性提升方法。 Method: 通过权空间算术从捐赠任务中提取‘量化向量’,将其直接用于修补接收模型,无需接收端量化感知训练或训练数据。 Result: 在Vision Transformer(ViT)模型上验证该方法,可将PTQ鲁棒性提升高达60%,且无需接收端QAT或训练数据。 Conclusion: 量化鲁棒性是一种可在权空间中迁移的几何特性,而非仅依赖任务特定训练的副产品;该发现为极低位宽部署提供了高效、通用的新范式。 Abstract: We show that robustness to post-training quantization (PTQ) is a transferable direction in weight space. We call this direction the quantization vector: extracted from a donor task by simple weight-space arithmetic, it can be used to patch a receiver model and improve robustness to PTQ-induced noise by as much as 60%, without receiver-side quantization-aware training (QAT). Because the method requires no receiver training data, it provides a zero-shot, low-cost alternative to QAT for extremely low-bit deployment. We demonstrate this on Vision Transformer (ViT) models. More broadly, our results suggest that quantization robustness is not merely a byproduct of task-specific training, but a reusable feature of weight-space geometry that can be transferred rather than retrained.[124] Automated Segmentation and Tracking of Group Housed Pigs Using Foundation Models
Ye Bi,Bimala Acharya,David Rosero,Juan Steibel
Main category: cs.CV
TL;DR: 本研究提出了一种以基础模型(FM)为中心的自动化监测工作流,用于群养猪只监控,通过结合预训练视觉-语言基础模型与轻量级任务特定后处理模块,在减少标注依赖的同时实现长时间、高精度的个体身份跟踪。
Details
Motivation: 传统精准畜牧养殖中监督学习模型依赖大量标注数据、需重复训练和农场特化调优;而基础模型可提供通用视觉表征,亟需探索其在复杂农业场景(如夜间、严重遮挡)下的适应性应用。 Method: 采用Grounding-DINO进行初始目标检测,结合Grounded-SAM2实现短时视频分割,并设计包含初始化、跟踪、匹配、掩码优化、重识别及质量控制的长时跟踪流水线;所有农场适配均通过模块化后处理完成,无需微调基础模型。 Result: 在132分钟连续视频中实现零ID切换;在132帧真值样本上达到J=0.83、F=0.92、J&F=0.87、MOTA=0.99、MOTP=90.7%;短时分割后80%以上活跃轨迹完全正确。 Conclusion: 基础模型的先验知识与轻量级任务逻辑相结合,可实现可扩展、低标注依赖、长时稳定的猪只监测,为精准畜牧提供新范式。 Abstract: Foundation models (FM) are reshaping computer vision by reducing reliance on task-specific supervised learning and leveraging general visual representations learned at scale. In precision livestock farming, most pipelines remain dominated by supervised learning models that require extensive labeled data, repeated retraining, and farm-specific tuning. This study presents an FM-centered workflow for automated monitoring of group-housed nursery pigs, in which pretrained vision-language FM serve as general visual backbones and farm-specific adaptation is achieved through modular post-processing. Grounding-DINO was first applied to 1,418 annotated images to establish a baseline detection performance. While detection accuracy was high under daytime conditions, performance degraded under night-vision and heavy occlusion, motivating the integration of temporal tracking logic. Building on these detections, short-term video segmentation with Grounded-SAM2 was evaluated on 550 one-minute video clips; after post-processing, over 80% of 4,927 active tracks were fully correct, with most remaining errors arising from inaccurate masks or duplicated labels. To support identity consistency over an extended time, we further developed a long-term tracking pipeline integrating initialization, tracking, matching, mask refinement, re-identification, and post-hoc quality control. This system was evaluated on a continuous 132-minute video and maintained stable identities throughout. On 132 uniformly sampled ground-truth frames, the system achieved a mean region similarity (J) of 0.83, contour accuracy (F) of 0.92, J&F of 0.87, MOTA of 0.99, and MOTP of 90.7%, with no identity switches. Overall, this work demonstrates how FM prior knowledge can be combined with lightweight, task-specific logic to enable scalable, label-efficient, and long-duration monitoring in pig production.[125] Inference-Path Optimization via Circuit Duplication in Frozen Visual Transformers for Marine Species Classification
Thomas Manuel Rost
Main category: cs.CV
TL;DR: 本文提出在不微调模型权重的情况下,通过推理时的Circuit Duplication方法提升冻结视觉基础模型(DINOv3)嵌入在水下物种分类任务中的性能,尤其在低标注预算下显著优于标准前向传播,并首次将该方法应用于计算机视觉领域。
Details
Motivation: 解决水下物种自动分类中因标注成本高和环境变化大导致的全监督模型泛化能力差的问题,探索在冻结嵌入前提下提升推理性能的新方法。 Method: 将原用于大语言模型的推理时方法Circuit Duplication引入计算机视觉,对冻结的DINOv3嵌入,在前向传播中重复遍历选定的Transformer层;采用全局电路选择与类特异性电路选择两种策略,并结合简单半监督下游分类器进行评估。 Result: 在AQUA20基准上,类特异性电路选择在最大标注预算下达到宏F1=0.875,逼近全监督ConvNeXt(0.889),差距仅1.4点;章鱼类F1提升+12.1;约75%的类别偏好类特异性电路。 Conclusion: Circuit Duplication是一种无需梯度更新、有效提升冻结视觉基础模型推理性能的通用技术,尤其适合低资源水下图像分类,且其效益具有明显的类别依赖性。 Abstract: Automated underwater species classification is constrained by annotation cost and environmental variation that limits the transferability of fully supervised models. Recent work has shown that frozen embeddings from self-supervised vision foundation models already provide a strong label-efficient baseline for marine image classification. Here we investigate whether this frozen-embedding regime can be improved at inference time, without fine-tuning or changing model weights. We apply Circuit Duplication, an inference-time method originally proposed for Large Language Models, in which a selected range of transformer layers is traversed twice during the forward pass. We evaluate on the class-imbalanced AQUA20 benchmark using frozen DINOv3 embeddings under two settings: global circuit selection, where a single duplicated circuit is chosen for the full dataset, and class-specific circuit selection, where each species may receive a different optimal circuit. Both settings use simple semi-supervised downstream classifiers. Circuit Duplication consistently improves over the standard frozen forward pass. At the maximum label budget, class-specific selection reaches a macro F1 of 0.875, closing the gap to the fully supervised ConvNeXt benchmark (0.889) to 1.4 points without any gradient-based training. Four species exceed their fully supervised reference, with octopus improving by +12.1 F1 points. Across all budgets, roughly 75% of classes prefer a class-specific circuit, indicating a genuinely class-dependent benefit. To our knowledge, this is the first application of Circuit Duplication to computer vision.[126] ExpressEdit: Fast Editing of Stylized Facial Expressions with Diffusion Models in Photoshop
Kenan Tang,Jiasheng Guo,Jeffrey Lin,Yao Qin
Main category: cs.CV
TL;DR: 本文提出ExpressEdit,一个开源的Photoshop插件,用于快速、高质量地进行角色面部表情编辑,避免了现有AI模型引入的全局噪声和像素漂移问题,并配套构建了一个含135个标签的表达式数据库。
Details
Motivation: 现有AI图像编辑模型在风格化表情编辑中会引入全局噪声和像素漂移,难以融入专业图像编辑流程。 Method: 设计并实现了一个轻量、开源的Photoshop插件ExpressEdit,支持与原生功能(如Liquify)协同;构建了含135个表达标签、带示例故事与图像的检索增强型表情数据库。 Result: ExpressEdit可在单块消费级GPU上3秒内完成表情编辑,速度快于主流闭源模型,且无明显编辑伪影,兼容Photoshop原生操作。 Conclusion: ExpressEdit为艺术家提供了高效、可控、可集成的专业级表情编辑工具,推动AI辅助艺术创作向实际工作流落地。 Abstract: Facial expressions of characters are a vital component of visual storytelling. While current AI image editing models hold promise for assisting artists in the task of stylized expression editing, these models introduce global noise and pixel drift into the edited image, preventing the integration of these models into professional image editing software and workflows. To bridge this gap, we introduce ExpressEdit, a fully open-source Photoshop plugin that is free from common artifacts of proprietary image editing models and robustly synergizes with native Photoshop operations such as Liquify. ExpressEdit seamlessly edits an expression within 3 seconds on a single consumer-grade GPU, significantly faster than popular proprietary models. Moreover, to support the generation of diverse expressions according to different narrative needs, we compile a comprehensive expression database of 135 expression tags enriched with example stories and images designed for retrieval-augmented generation. We open source the code and dataset to facilitate future research and artistic exploration.[127] RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation
Ganlin Feng,Yuxi Long,Hafsa Ali,Erin Lou,Fahad Butt,Qian Liu,Yang Wang,Pingzhao Hu
Main category: cs.CV
TL;DR: 本文提出了RDFace数据集,一个包含456张儿童面部图像、覆盖103种罕见遗传病的高质量基准数据集,并结合合成增强与评估方法,提升低样本条件下的AI诊断性能。
Details
Motivation: 罕见病常具独特面部表型,但受限于高质量、合乎伦理的面部数据稀缺及不同疾病表型高度相似,AI辅助诊断发展受阻。 Method: 构建RDFace数据集(456张图像,103种疾病),配标准化元数据;采用DreamBooth和FastGAN生成合成图像,并通过面部关键点相似性筛选以保障表型保真度;结合真实与合成数据训练多种视觉骨干网络并交叉验证;利用视觉语言模型评估生成表型描述的语义一致性。 Result: 在极低数据场景下,融合合成图像使诊断准确率最高提升13.7%;真实与合成图像生成的表型描述报告相似度达0.84。 Conclusion: RDFace为罕见病AI研究提供了透明、可复现的基准数据集,并提出兼顾诊断性能与合成影像可信度的可扩展评估框架。 Abstract: Rare diseases often manifest with distinctive facial phenotypes in children, offering valuable diagnostic cues for clinicians and AI-assisted screening systems. However, progress in this field is severely limited by the scarcity of curated, ethically sourced facial data and the high similarity among phenotypes across different conditions. To address these challenges, we introduce RDFace, a curated benchmark dataset comprising 456 pediatric facial images spanning 103 rare genetic conditions (average 4.4 samples per condition). Each ethically verified image is paired with standardized metadata. RDFace enables the development and evaluation of data-efficient AI models for rare disease diagnosis under real-world low-data constraints. We benchmark multiple pretrained vision backbones using cross-validation and explore synthetic augmentation with DreamBooth and FastGAN. Generated images are filtered via facial landmark similarity to maintain phenotype fidelity and merged with real data, improving diagnostic accuracy by up to 13.7% in ultra-low-data regimes. To assess semantic validity, phenotype descriptions generated by a vision-language model from real and synthetic images achieve a report similarity score of 0.84. RDFace establishes a transparent, benchmark-ready dataset for equitable rare disease AI research and presents a scalable framework for evaluating both diagnostic performance and the integrity of synthetic medical imagery.[128] SpectralSplat: Appearance-Disentangled Feed-Forward Gaussian Splatting for Driving Scenes
Quentin Herau,Tianshuo Xu,Depu Meng,Jiezhi Yang,Chensheng Peng,Spencer Sherk,Yihan Hu,Wei Zhan
Main category: cs.CV
TL;DR: 本文提出SpectralSplat,一种在前馈式3D高斯泼溅框架中解耦场景几何与外观(如光照、天气、时间)的新方法,通过双流颜色预测和全局外观嵌入实现可控外观迁移与一致重光照。
Details
Motivation: 现有前馈式3D高斯泼溅方法将场景几何与瞬态外观属性(如光照、天气、时间)耦合,导致无法进行重光照、外观迁移及多遍历数据下的一致渲染。 Method: 提出双流颜色预测结构:外观无关的基底流与外观条件化的适配流,二者均由共享MLP生成,输入为基于DINOv2特征提取的全局外观嵌入;设计混合重光照流水线生成配对训练数据,并引入互补的一致性、重建、跨外观与基色损失;同时构建外观可适配的时间历史机制以存储外观无关特征。 Result: 实验表明SpectralSplat在保持底层骨干网络重建质量的同时,支持可控外观迁移与驾驶序列中时间一致的重光照。 Conclusion: SpectralSplat成功实现了几何与外观的解耦,拓展了高斯泼溅在动态真实驾驶场景中的可控编辑与跨条件一致性渲染能力。 Abstract: Feed-forward 3D Gaussian Splatting methods have achieved impressive reconstruction quality for autonomous driving scenes, yet they entangle scene geometry with transient appearance properties such as lighting, weather, and time of day. This coupling prevents relighting, appearance transfer, and consistent rendering across multi-traversal data captured under varying environmental conditions. We present SpectralSplat, a method that disentangles appearance from geometry within a feed-forward Gaussian Splatting framework. Our key insight is to factor color prediction into an appearance-agnostic base stream and and appearance-conditioned adapted stream, both produced by a shared MLP conditioned on a global appearance embedding derived from DINOv2 features. To enforce disentanglement, we train with paired observations generated by a hybrid relighting pipeline that combines physics-based intrinsic decomposition with diffusion based generative refinement, and supervise with complementary consistency, reconstruction, cross-appearance, and base color losses. We further introduce an appearance-adaptable temporal history that stores appearance-agnostic features, enabling accumulated Gaussians to be re-rendered under arbitrary target appearances. Experiments demonstrate that SpectralSplat preserves the reconstruction quality of the underlying backbone while enabling controllable appearance transfer and temporally consistent relighting across driving sequences.[129] Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition
Haocheng Tang,Xingyu Dang,Junmei Wang
Main category: cs.CV
TL;DR: 本文提出MolSeek-OCR,通过两阶段渐进式微调DeepSeek-OCR-2模型实现分子图像到SMILES序列的端到端识别,在合成与真实专利数据上训练后达到当前最优图像到序列方法的精度,但仍弱于图像到图模型;强化学习与数据精炼策略未能提升严格序列匹配性能。
Details
Motivation: 光学化学结构识别(OCSR)对将印刷文献中的二维分子图转换为机器可读格式至关重要,但现有视觉语言模型直接应用于OCSR存在挑战,全参数监督微调常失效。 Method: 将OCSR建模为图像条件下的SMILES生成任务;提出两阶段渐进式监督微调策略:先用LoRA进行参数高效微调,再以分层学习率进行选择性全参数微调;在PubChem合成图像与USPTO-MOL真实专利图像混合大数据集上训练。 Result: MolSeek-OCR在精确SMILES匹配准确率上媲美最佳图像到序列模型,但逊于当前最优图像到图模型;强化学习式后训练和基于数据筛选的优化未能提升严格序列级保真度。 Conclusion: 两阶段渐进微调策略有效提升了视觉语言模型在OCSR任务上的性能,但图像到序列范式在结构保真度上仍存在固有局限,未来需探索更适配分子拓扑结构的建模方式。 Abstract: Optical Chemical Structure Recognition (OCSR) is critical for converting 2D molecular diagrams from printed literature into machine-readable formats. While Vision-Language Models have shown promise in end-to-end OCR tasks, their direct application to OCSR remains challenging, and direct full-parameter supervised fine-tuning often fails. In this work, we adapt DeepSeek-OCR-2 for molecular optical recognition by formulating the task as image-conditioned SMILES generation. To overcome training instabilities, we propose a two-stage progressive supervised fine-tuning strategy: starting with parameter-efficient LoRA and transitioning to selective full-parameter fine-tuning with split learning rates. We train our model on a large-scale corpus combining synthetic renderings from PubChem and realistic patent images from USPTO-MOL to improve coverage and robustness. Our fine-tuned model, MolSeek-OCR, demonstrates competitive capabilities, achieving exact matching accuracies comparable to the best-performing image-to-sequence model. However, it remains inferior to state-of-the-art image-to-graph modelS. Furthermore, we explore reinforcement-style post-training and data-curation-based refinement, finding that they fail to improve the strict sequence-level fidelity required for exact SMILES matching.[130] Multimodal Urban Tree Detection from Satellite and Street-Level Imagery via Annotation-Efficient Deep Learning Strategies
In Seon Kim,Ali Moghimi
Main category: cs.CV
TL;DR: 本文提出了一种融合高分辨率卫星影像与街景图像的多模态城市树木检测框架,结合域自适应与半监督/主动学习策略,在标注受限条件下显著提升检测性能(F1达0.90),支持可持续城市规划。
Details
Motivation: 传统人工调查成本高、扩展性差,而现有自动方法在跨区域泛化和标注效率方面存在瓶颈,亟需一种可扩展、低标注依赖的城市树木制图方法。 Method: 构建多模态框架:先用卫星影像定位树木候选区,再调取对应街景图像进行精细化检测;采用域自适应迁移已有标注知识;对比评估半监督学习、主动学习及二者混合策略,使用基于Transformer的检测模型。 Result: 混合策略F1-score达0.90,较基线提升12%;主动学习稳定提升性能,半监督因伪标签确认偏误导致性能下降;错误分析显示主动与混合策略均有效降低假阳与假阴。 Conclusion: 多模态数据协同与引导式人工标注相结合,是实现高效、可扩展、低标注依赖的城市树木制图的关键路径,有助于增强环境监测与韧性城市治理能力。 Abstract: Beyond the immediate biophysical benefits, urban trees play a foundational role in environmental sustainability and disaster mitigation. Precise mapping of urban trees is essential for environmental monitoring, post-disaster assessment, and strengthening policy. However, the transition from traditional, labor-intensive field surveys to scalable automated systems remains limited by high annotation costs and poor generalization across diverse urban scenarios. This study introduces a multimodal framework that integrates high-resolution satellite imagery with ground-level Google Street View to enable scalable and detailed urban tree detection under limited-annotation conditions. The framework first leverages satellite imagery to localize tree candidates and then retrieves targeted ground-level views for detailed detection, significantly reducing inefficient street-level sampling. To address the annotation bottleneck, domain adaptation is used to transfer knowledge from an existing annotated dataset to a new region of interest. To further minimize human effort, we evaluated three learning strategies: semi-supervised learning, active learning, and a hybrid approach combining both, using a transformer-based detection model. The hybrid strategy achieved the best performance with an F1-score of 0.90, representing a 12% improvement over the baseline model. In contrast, semi-supervised learning exhibited progressive performance degradation due to confirmation bias in pseudo-labeling, while active learning steadily improved results through targeted human intervention to label uncertain or incorrect predictions. Error analysis further showed that active and hybrid strategies reduced both false positives and false negatives. Our findings highlight the importance of a multimodal approach and guided annotation for scalable, annotation-efficient urban tree mapping to strengthen sustainable city planning.[131] Determined by User Needs: A Salient Object Detection Rationale Beyond Conventional Visual Stimuli
Chenglizhao Chen,Shujian Zhang,Luming Li,Wenfeng Song,Shuai Li
Main category: cs.CV
TL;DR: 本文提出了一种新的任务UserSOD(用户显著物体检测),强调根据用户的主动需求而非被动视觉刺激来检测显著物体,并指出当前缺乏相关数据集是主要挑战。
Details
Motivation: 现有显著物体检测方法仅依赖被动视觉刺激,忽略了用户主动需求对显著性判断的决定性作用,导致无法满足用户需求并限制下游任务发展。 Method: 提出UserSOD新任务,旨在根据用户预先存在的需求检测与其一致的显著物体。 Result: 明确了UserSOD任务的定义与必要性,并指出了当前缺乏适配该任务的数据集这一关键问题。 Conclusion: 应转向以用户需求驱动的显著物体检测范式,推动UserSOD任务的研究与发展,亟需构建相应数据集。 Abstract: Existing \textbf{s}alient \textbf{o}bject \textbf{d}etection (SOD) methods adopt a \textbf{passive} visual stimulus-based rationale--objects with the strongest visual stimuli are perceived as the user's primary focus (i.e., salient objects). They ignore the decisive role of users' \textbf{proactive needs} in segmenting salient objects--if a user has a need before seeing an image, the user's salient objects align with their needs, e.g., if a user's need is ``white apple'', when this user sees an image, the user's primary focus is on the ``white apple'' or ``the most white apple-like'' objects in the image. Such an oversight not only \textbf{fails to satisfy users}, but also \textbf{limits the development of downstream tasks}. For instance, in salient object ranking tasks, focusing solely on visual stimuli-based salient objects is insufficient for conducting an analysis of fine-grained relationships between users' viewing order (usually determined by user's needs) and scenes, which may result in wrong ranking results. Clearly, it is essential to detect salient objects based on user needs. Thus, we advocate a \textbf{User} \textbf{S}alient \textbf{O}bject \textbf{D}etection (UserSOD) task, which focuses on \textbf{detecting salient objects align with users' proactive needs when user have needs}. The main challenge for this new task is the lack of datasets for model training and testing.[132] Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models
Sohyeon Kim,Sang Yeon Yoon,Kyeongbo Kong
Main category: cs.CV
TL;DR: 本文提出一种轻量级、无需训练的推理时干预方法,通过分析LVLM视觉编码器内部注意力动态,识别出‘扩散-聚焦-再扩散’三阶段结构,并在聚焦阶段选择性抑制低注意力token,结合DPP保持视觉多样性,有效降低物体幻觉且几乎不增加延迟。
Details
Motivation: 大型视觉语言模型(LVLMs)虽在多模态推理上表现优异,但仍易产生物体幻觉;现有抑制方法常依赖迭代优化,导致高推理延迟,亟需高效、低开销的干预机制。 Method: 基于对视觉编码器注意力动态的分析,发现其存在‘扩散-聚焦-再扩散’三阶段结构;提出在聚焦阶段识别并抑制低注意力token,采用单次前向传递统计与行列式点过程(DPP)实现训练免费、多样性保留的token筛选。 Result: 在多个LVLM骨干网络和解码策略上验证,该方法显著降低幻觉指标(如Occlusion、Counting等),同时保持有竞争力的图像描述质量;相比对抗不确定性估计方法,效果相当但推理延迟可忽略。 Conclusion: 视觉编码器的阶段性注意力行为是理解并缓解LVLM幻觉的关键;所提轻量级、无训练、低延迟干预策略为实际部署提供了可行方案。 Abstract: Large Vision-Language Models (LVLMs) have achieved impressive progress in multimodal reasoning, yet they remain prone to object hallucinations, generating descriptions of objects that are not present in the input image. Recent approaches attempt to mitigate hallucinations by suppressing unreliable visual signals in the vision encoder, but many rely on iterative optimization for each input, resulting in substantial inference latency. In this work, we investigate the internal attention dynamics of vision encoders in LVLMs and identify a consistent three-phase structure of visual information processing: diffusion, focus, and rediffusion. Our analysis reveals that hallucination behavior is particularly sensitive to tokens receiving low attention during the focus phase. Motivated by this observation, we propose a lightweight inference-time intervention that selectively suppresses such tokens during the focus phase. The method operates in a training-free manner using statistics from a single forward pass and employs a Determinantal Point Process (DPP) to preserve diverse visual cues while filtering redundant tokens. Extensive experiments across multiple LVLM backbones and decoding strategies demonstrate that the proposed approach consistently reduces hallucination metrics while maintaining competitive caption quality. Moreover, compared to adversarial uncertainty estimation methods, our approach achieves comparable hallucination mitigation with negligible additional inference latency.[133] LOGER: Local--Global Ensemble for Robust Deepfake Detection in the Wild
Fei Wu,Dagong Lu,Mufeng Yao,Xinlei Xu,Fengjun Guo
Main category: cs.CV
TL;DR: 本文提出LOGER框架,通过全局分支(多分辨率视觉基础模型)和局部分支(基于多实例学习的top-k聚合策略)协同检测深度伪造,有效应对真实场景中多样化的篡改技术和退化因素,显著提升鲁棒性和泛化能力。
Details
Motivation: 真实场景下深度伪造检测面临篡改技术多样、现实退化不可控等挑战,现有方法难以同时建模全局语义/统计异常与局部伪造痕迹。 Method: 提出LOcal--Global Ensemble框架(LOGER):全局分支采用多分辨率异构视觉基础模型捕捉整体异常;局部分支通过多实例学习的top-k聚合策略聚焦最可疑图像块,并施加图像级和块级双重监督;最后在logit空间融合两个互补分支的预测。 Result: 在NTIRE 2026鲁棒深度伪造检测挑战赛中获第二名,并在多个公开基准上验证了其对不同篡改方法和现实退化条件的强鲁棒性与泛化能力。 Conclusion: LOGER通过解耦并融合全局与局部取证线索,有效缓解证据稀释问题,其分支间误差低相关性使logit融合更鲁棒,为真实场景深度伪造检测提供了新思路。 Abstract: Robust deepfake detection in the wild remains challenging due to the ever-growing variety of manipulation techniques and uncontrolled real-world degradations. Forensic cues for deepfake detection reside at two complementary levels: global-level anomalies in semantics and statistics that require holistic image understanding, and local-level forgery traces concentrated in manipulated regions that are easily diluted by global averaging. Since no single backbone or input scale can effectively cover both levels, we propose LOGER, a LOcal--Global Ensemble framework for Robust deepfake detection. The global branch employs heterogeneous vision foundation model backbones at multiple resolutions to capture holistic anomalies with diverse visual priors. The local branch performs patch-level modeling with a Multiple Instance Learning top-$k$ aggregation strategy that selectively pools only the most suspicious regions, mitigating evidence dilution caused by the dominance of normal patches; dual-level supervision at both the aggregated image level and individual patch level keeps local responses discriminative. Because the two branches differ in both granularity and backbone, their errors are largely decorrelated, a property that logit-space fusion exploits for more robust prediction. LOGER achieves 2nd place in the NTIRE 2026 Robust Deepfake Detection Challenge, and further evaluation on multiple public benchmarks confirms its strong robustness and generalization across diverse manipulation methods and real-world degradation conditions.[134] Physics-Informed Untrained Learning for RGB-Guided Superresolution Single-Pixel Hyperspectral Imaging
Hao Zhang,Bilige Xu,Lichen Wei,Xu Ma,Wenyi Ren
Main category: cs.CV
TL;DR: 本文提出了一种无需外部训练数据的物理信息驱动单像素高光谱成像框架,结合RGB引导与无训练神经网络,实现低采样率下的高保真高光谱重建与超分辨率。
Details
Motivation: 单像素成像在极低采样率下难以恢复高质量空间与光谱细节,且现有深度学习方法依赖难以获取的大规模预训练数据。 Method: 提出三阶段物理驱动框架:(1) 基于RGB灰度先验的正则化最小二乘(LS-RGP)初始化;(2) 无训练高光谱恢复网络(UHRNet)进行测量一致性优化与混合正则化;(3) 基于Transformer的无训练超分网络(USRNet)利用RGB跨模态注意力提升空间分辨率。 Result: 在基准数据集上显著超越现有最先进算法;实测中以6.25%采样率成功重建144波段高光谱数据立方体。 Conclusion: 该方法为计算高光谱成像提供了一种鲁棒、数据高效的新范式。 Abstract: Single-pixel imaging (SPI) offers a cost-effective route to hyperspectral acquisition but struggles to recover high-fidelity spatial and spectral details under extremely low sampling rates, a severely ill-posed inverse problem. While deep learning has shown potential, existing data-driven methods demand large-scale pretraining datasets that are often impractical in hyperspectral imaging. To overcome this limitation, we propose an end-to-end physics-informed framework that leverages untrained neural networks and RGB guidance for joint hyperspectral reconstruction and super-resolution without any external training data. The framework comprises three physically grounded stages: (1) a Regularized Least-Squares method with RGB-derived Grayscale Priors (LS-RGP) that initializes the solution by exploiting cross-modal structural correlations; (2) an Untrained Hyperspectral Recovery Network (UHRNet) that refines the reconstruction through measurement consistency and hybrid regularization; and (3) a Transformer-based Untrained Super-Resolution Network (USRNet) that upsamples the spatial resolution via cross-modal attention, transferring high-frequency details from the RGB guide. Extensive experiments on benchmark datasets demonstrate that our approach significantly surpasses state-of-the-art algorithms in both reconstruction accuracy and spectral fidelity. Moreover, a proof-of-concept experiment using a physical single-pixel imaging system validates the framework's practical applicability, successfully reconstructing a 144-band hyperspectral data cube at a mere 6.25% sampling rate. The proposed method thus provides a robust, data-efficient solution for computational hyperspectral imaging.[135] SBF: An Effective Representation to Augment Skeleton for Video-based Human Action Recognition
Zhuoxuan Peng,Yiyi Ding,Yang Lin,S. -H. Gary Chan
Main category: cs.CV
TL;DR: 本文提出了一种名为Scale-Body-Flow(SBF)的新型骨架增强表示方法,结合尺度图(深度信息)、人体轮廓图和光流图(人-物交互),并设计SFSNet网络无额外标注地预测SBF,显著提升了基于骨架的动作识别精度。
Details
Motivation: 现有基于2D骨架的人体动作识别方法难以捕捉关节深度、人体轮廓及人-物交互等关键动作信息,导致在常见场景中性能受限。 Method: 提出Scale-Body-Flow(SBF)三元增强表示(尺度图、身体图、光流图),并设计端到端可训练的SFSNet分割网络,在仅依赖已有骨架与光流监督下预测SBF,无需额外人工标注。 Result: 在多个数据集上实验表明,所提SBF+SFSNet方案显著优于当前最优的纯骨架方法,同时保持相近的模型紧凑性与推理效率。 Conclusion: SBF作为一种轻量、易集成的骨架增强表示,有效弥补了2D骨架在深度、形态与交互建模上的不足,为视频动作识别提供了新思路。 Abstract: Many modern video-based human action recognition (HAR) approaches use 2D skeleton as the intermediate representation in their prediction pipelines. Despite overall encouraging results, these approaches still struggle in many common scenes, mainly because the skeleton does not capture critical action-related information pertaining to the depth of the joints, contour of the human body, and interaction between the human and objects. To address this, we propose an effective approach to augment skeleton with a representation capturing action-related information in the pipeline of HAR. The representation, termed Scale-Body-Flow (SBF), consists of three distinct components, namely a scale map volume given by the scale (and hence depth information) of each joint, a body map outlining the human subject, and a flow map indicating human-object interaction given by pixel-wise optical flow values. To predict SBF, we further present SFSNet, a novel segmentation network supervised by the skeleton and optical flow without extra annotation overhead beyond the existing skeleton extraction. Extensive experiments across different datasets demonstrate that our pipeline based on SBF and SFSNet achieves significantly higher HAR accuracy with similar compactness and efficiency as compared with the state-of-the-art skeleton-only approaches.[136] Stochastic Generative Plug-and-Play Priors
Chicago Y. Park,Edward P. Chandler,Yuyang Hu,Michael T. McCann,Cristina Garcia-Cardona,Brendt Wohlberg,Ulugbek S. Kamilov
Main category: cs.CV
TL;DR: 本文提出了一种基于分数的即插即用(PnP)解释,并在此基础上构建了随机生成式PnP(SGPnP)框架,通过注入噪声更有效地利用预训练分数生成扩散模型(SBDM)作为先验,在严重病态反问题中提升了鲁棒性与性能。
Details
Motivation: 尽管即插即用(PnP)方法和基于分数的扩散模型(SBDMs)都依赖去噪器,但尚无系统方法将SBDMs直接、合理地嵌入PnP框架中,而不依赖逆向扩散采样。 Method: 建立PnP的分数视角解释;提出随机生成式PnP(SGPnP)框架,在迭代中主动注入噪声以更好利用SBDM先验;提供理论分析,证明该噪声注入等价于在高斯平滑目标函数上优化,并有助于逃离严格鞍点。 Result: 在多线圈MRI重建和大遮罩自然图像修复等挑战性反问题上,SGPnP一致优于传统PnP方法,性能媲美扩散求解器。 Conclusion: SBDM可被严谨地作为PnP先验使用;SGPnP通过噪声注入机制显著提升病态成像反问题的鲁棒性与重建质量,为PnP与生成建模的融合提供了新范式。 Abstract: Plug-and-play (PnP) methods are widely used for solving imaging inverse problems by incorporating a denoiser into optimization algorithms. Score-based diffusion models (SBDMs) have recently demonstrated strong generative performance through a denoiser trained across a wide range of noise levels. Despite their shared reliance on denoisers, it remains unclear how to systematically use SBDMs as priors within the PnP framework without relying on reverse diffusion sampling. In this paper, we establish a score-based interpretation of PnP that justifies using pretrained SBDMs directly within PnP algorithms. Building on this connection, we introduce a stochastic generative PnP (SGPnP) framework that injects noise to better leverage the expressive generative SBDM priors, thereby improving robustness in severely ill-posed inverse problems. We provide a new theory showing that this noise injection induces optimization on a Gaussian-smoothed objective and promotes escape from strict saddle points. Experiments on challenging inverse tasks, such as multi-coil MRI reconstruction and large-mask natural image inpainting, demonstrate consistent improvement over conventional PnP methods and achieve performance competitive with diffusion-based solvers.[137] PortraitCraft: A Benchmark for Portrait Composition Understanding and Generation
Yuyang Sha,Zijie Lou,Youyun Tang,Xiaochao Qu,Haoxiang Li,Ting Liu,Luoqi Liu
Main category: cs.CV
TL;DR: 本文提出了PortraitCraft,一个用于人像构图理解和生成的统一基准,包含约5万张带多层次监督标注的真实人像数据集,并定义了构图理解与构图感知生成两大互补任务及标准化评测协议。
Details
Motivation: 现有数据集和基准主要关注粗粒度美学评分、通用图像美学或无约束人像生成,难以支持结构化人像构图分析和显式构图约束下可控人像生成的系统性研究。 Method: 构建了PortraitCraft数据集,包含约5万张真实人像图像,提供全局构图评分、13种构图属性标注、属性级解释文本、视觉问答对及面向构图的文本描述;并在此基础上设立构图理解(评分预测、细粒度属性推理、图像接地VQA)与构图感知生成两大基准任务。 Result: 建立了统一框架下的两个互补基准任务,定义了标准化评估协议,并提供了代表性多模态模型的基线结果。 Conclusion: PortraitCraft为细粒度人像理解、可解释美学评估和可控人像生成提供了全面、结构化的基准,推动相关领域发展。 Abstract: Portrait composition plays a central role in portrait aesthetics and visual communication, yet existing datasets and benchmarks mainly focus on coarse aesthetic scoring, generic image aesthetics, or unconstrained portrait generation. This limits systematic research on structured portrait composition analysis and controllable portrait generation under explicit composition requirements. In this paper, we introduce PortraitCraft, a unified benchmark for portrait composition understanding and generation. PortraitCraft is built on a dataset of approximately 50,000 curated real portrait images with structured multi-level supervision, including global composition scores, annotations over 13 composition attributes, attribute-level explanation texts, visual question answering pairs, and composition-oriented textual descriptions for generation. Based on this dataset, we establish two complementary benchmark tasks for composition understanding and composition-aware generation within a unified framework. The first evaluates portrait composition understanding through score prediction, fine-grained attribute reasoning, and image-grounded visual question answering, while the second evaluates portrait generation from structured composition descriptions under explicit composition constraints. We further define standardized evaluation protocols and provide reference baseline results with representative multimodal models. PortraitCraft provides a comprehensive benchmark for future research on fine-grained portrait understanding, interpretable aesthetic assessment, and controllable portrait generation.[138] Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?
Peter Yongho Kim,Juhyeon Park,Jungwoo Park,Jubin Choi,Jungwoo Seo,Jiook Cha,Taesup Moon
Main category: cs.CV
TL;DR: 本文提出TABLeT模型,利用预训练2D图像自编码器将fMRI体积压缩为紧凑连续token,结合Transformer实现高效长程时空建模,在多个大型数据集上超越现有体素级方法,兼具高性能、低内存与可解释性。
Details
Motivation: fMRI信号具有高维四维特性,传统体素级模型受限于显存,难以建模长时程时空动态。 Method: 提出TABLeT:用预训练2D自然图像自编码器对3D fMRI体积进行二维自动编码生成连续token,再输入轻量Transformer编码器;并设计自监督掩码token建模策略进行预训练。 Result: 在UKB、HCP和ADHD-200等大规模数据集上,TABLeT在多项任务中优于现有模型,且计算与内存效率显著高于当前最优体素级方法。 Conclusion: TABLeT为可扩展、高效且可解释的脑活动时空建模提供了新范式。 Abstract: Modeling long-range spatiotemporal dynamics in functional Magnetic Resonance Imaging (fMRI) remains a key challenge due to the high dimensionality of the four-dimensional signals. Prior voxel-based models, although demonstrating excellent performance and interpretation capabilities, are constrained by prohibitive memory demands and thus can only capture limited temporal windows. To address this, we propose TABLeT (Two-dimensionally Autoencoded Brain Latent Transformer), a novel approach that tokenizes fMRI volumes using a pre-trained 2D natural image autoencoder. Each 3D fMRI volume is compressed into a compact set of continuous tokens, enabling long-sequence modeling with a simple Transformer encoder with limited VRAM. Across large-scale benchmarks including the UK-Biobank (UKB), Human Connectome Project (HCP), and ADHD-200 datasets, TABLeT outperforms existing models in multiple tasks, while demonstrating substantial gains in computational and memory efficiency over the state-of-the-art voxel-based method given the same input. Furthermore, we develop a self-supervised masked token modeling approach to pre-train TABLeT, which improves the model's performance for various downstream tasks. Our findings suggest a promising approach for scalable and interpretable spatiotemporal modeling of brain activity. Our code is available at https://github.com/beotborry/TABLeT.[139] A Generative Foundation Model for Multimodal Histopathology
Jinxi Xiang,Mingjie Li,Siyu Hou,Yijiang Chen,Xiangde Luo,Yuanfeng Ji,Xiang Zhou,Ehsan Adeli,Akshay Chaudhari,Curtis P. Langlotz,Kilian M. Pohl,Ruijiang Li
Main category: cs.CV
TL;DR: 本文提出MuPD(多模态病理扩散)生成基础模型,通过解耦跨模态注意力的扩散Transformer,将H&E图像、RNA分子谱和临床文本嵌入共享潜在空间;在大规模多源数据上预训练后,显著提升跨模态合成任务性能,如文本/图像/RNA引导的组织图像生成、虚拟染色等,超越现有专用模型。
Details
Motivation: 复杂疾病诊疗需整合组织病理、分子和临床数据,但实践中这些模态常因组织稀缺、检测成本高和流程限制而缺失;现有计算方法依赖针对单一源-目标对的任务特化模型,泛化能力差。 Method: 提出MuPD模型,采用带解耦跨模态注意力机制的扩散Transformer架构,将H&E图像、RNA表达谱和临床文本映射至统一潜在空间;在1亿H&E图像块、160万图文对、1080万RNA-图像对(覆盖34个人体器官)上预训练。 Result: 在文本/图像条件生成中,FID降低50%,少样本分类准确率提升最高达47%;在RNA引导的H&E生成中FID降低23%,并保持五种癌症的细胞类型分布;作为虚拟染色工具,免疫组化与多重免疫荧光转换的平均标志物相关性提升37%。 Conclusion: 单一、统一的多模态生成基础模型经跨模态预训练后,可显著优于各类专用模型,为数字病理提供可扩展的计算框架。 Abstract: Accurate diagnosis and treatment of complex diseases require integrating histological, molecular, and clinical data, yet in practice these modalities are often incomplete owing to tissue scarcity, assay cost, and workflow constraints. Existing computational approaches attempt to impute missing modalities from available data but rely on task-specific models trained on narrow, single source-target pairs, limiting their generalizability. Here we introduce MuPD (Multimodal Pathology Diffusion), a generative foundation model that embeds hematoxylin and eosin (H&E)-stained histology, molecular RNA profiles, and clinical text into a shared latent space through a diffusion transformer with decoupled cross-modal attention. Pretrained on 100 million histology image patches, 1.6 million text-histology pairs, and 10.8 million RNA-histology pairs spanning 34 human organs, MuPD supports diverse cross-modal synthesis tasks with minimal or no task-specific fine-tuning. For text-conditioned and image-to-image generation, MuPD synthesizes histologically faithful tissue architectures, reducing Fréchet inception distance (FID) scores by 50% relative to domain-specific models and improving few-shot classification accuracy by up to 47% through synthetic data augmentation. For RNA-conditioned histology generation, MuPD reduces FID by 23% compared with the next-best method while preserving cell-type distributions across five cancer types. As a virtual stainer, MuPD translates H&E images to immunohistochemistry and multiplex immunofluorescence, improving average marker correlation by 37% over existing approaches. These results demonstrate that a single, unified generative model pretrained across heterogeneous pathology modalities can substantially outperform specialized alternatives, providing a scalable computational framework for multimodal histopathology.[140] SAGE-GAN: Towards Realistic and Robust Segmentation of Spatially Ordered Nanoparticles via Attention-Guided GANs
Anindya Pal,Varun Ajith,Saumik Bhattacharya,Sayantari Ghosh
Main category: cs.CV
TL;DR: 本文提出了一种结合自注意力U-Net与CycleGAN的两步方法,用于电子显微镜图像中纳米颗粒的自动分割与数据增强,减少对大量人工标注数据的依赖。
Details
Motivation: 传统纳米颗粒图像分析依赖耗时的手动标注或性能受限的自动分割方法,且深度学习模型需大量标注数据,获取困难。 Method: 第一步:构建自注意力驱动的U-Net进行真实电镜图像的特征分割;第二步:将该U-Net嵌入CycleGAN框架,生成逼真的图像-掩码配对合成数据,实现无监督数据增强。 Result: 模型可在多种真实纳米颗粒图像中准确检测结构特征,并自主扩充训练数据,提升分割鲁棒性与泛化能力。 Conclusion: 该方法有效缓解了标注数据稀缺问题,提升了复杂形貌与含噪电镜图像中纳米颗粒分割的精度与自动化水平。 Abstract: Precise analysis of nanoparticles for characterization in electron microscopy images is essential for advancing nanomaterial development. Yet it remains challenging due to the time-consuming nature of manual methods and the shortcomings of traditional automated segmentation techniques, especially when dealing with complex shapes and imaging artifacts. While conventional methods yield promising results, they depend on a large volume of labeled training data, which is both difficult to acquire and highly time-consuming to generate. In order to overcome these challenges, we have developed a two-step solution: Firstly, our system learns to segment the key features of nanoparticles from a dataset of real images using a self-attention driven U-Net architecture that focuses on important physical and morphological details while ignoring background features and noise. Secondly, this trained Attention U-Net is embedded in a cycle-consistent generative adversarial network (CycleGAN) framework, inspired by the cGAN-Seg model introduced by Abzargar et al. This integration allows for the creation of highly realistic synthetic electron microscopy image-mask pairs that naturally reflect the structural patterns learned by the Attention U-Net. Consequently, the model can accurately detect features in a diverse array of real-world nanoparticle images and autonomously augment the training dataset without requiring human input. Cycle consistency enforces a direct correspondence between synthetic images and ground-truth masks, ensuring realistic features, which is crucial for accurate segmentation training.[141] ComPrivDet: Efficient Privacy Object Detection in Compressed Domains Through Inference Reuse
Yunhao Yao,Zhiqiang Wang,Ruiqi Li,Haoran Cheng,Puhan Luo,Xiangyang Li
Main category: cs.CV
TL;DR: 本文提出ComPrivDet方法,在压缩视频中高效检测隐私对象(如人脸、车牌),通过重用I帧推理结果并利用压缩域线索跳过或轻量优化P/B帧检测,显著降低延迟同时保持高准确率。
Details
Motivation: 物联网视频数据隐私泄露问题日益严重,而传统逐帧保护在大规模视频分析中延迟高;现有压缩域检测方法存在解码开销大或精度低的问题。 Method: 提出ComPrivDet:基于I帧检测结果,利用压缩域特征(如运动矢量、DCT系数)判断P/B帧中是否出现新隐私对象;若无新对象则跳过检测,否则调用轻量检测器进行精炼。 Result: 在人脸和车牌隐私检测中分别达到99.75%和96.83%准确率,跳过超80%的推理;相比现有压缩域方法,平均准确率提升9.84%,延迟降低75.95%。 Conclusion: ComPrivDet实现了压缩域下高精度、低延迟的隐私对象检测,为大规模智能视频分析提供了实用隐私保护方案。 Abstract: As the Internet of Things (IoT) becomes deeply embedded in daily life, users are increasingly concerned about privacy leakage, especially from video data. Since frame-by-frame protection in large-scale video analytics (e.g., smart communities) introduces significant latency, a more efficient solution is to selectively protect frames containing privacy objects (e.g., faces). Existing object detectors require fully decoded videos or per-frame processing in compressed videos, leading to decoding overhead or reduced accuracy. Therefore, we propose ComPrivDet, an efficient method for detecting privacy objects in compressed video by reusing I-frame inference results. By identifying the presence of new objects through compressed-domain cues, ComPrivDet either skips P- and B-frame detections or efficiently refines them with a lightweight detector. ComPrivDet maintains 99.75% accuracy in private face detection and 96.83% in private license plate detection while skipping over 80% of inferences. It averages 9.84% higher accuracy with 75.95% lower latency than existing compressed-domain detection methods.[142] Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling
Yunyao Yu,Zhengxian Wu,Zhuohong Chen,Hangrui Xu,Zirui Liao,Xiangwen Deng,Zhifang Liu,Senyuan Shi,Haoqian Wang
Main category: cs.CV
TL;DR: 本文提出CSRS方法,通过连续软化重追溯重采样提升多模态大语言模型(MLLM)在无监督自进化中的推理能力,尤其在几何任务上达到SOTA。
Details
Motivation: 现有自进化方法依赖多数投票生成伪黄金答案,易受模型内在偏差影响,难以保证推理路径的客观正确性,导致性能退化。 Method: 提出CSRS框架,包括:1)Retracing Re-inference Mechanism(RRM),从锚点出发重推理以探索长尾推理路径;2)Softened Frequency Reward(SFR),用连续频率信号替代二值奖励;3)Visual Semantic Perturbation(VSP),抑制视觉表层干扰、强化数学逻辑。 Result: 在MathVision等基准上显著提升Qwen2.5-VL-7B的推理性能,在几何任务的无监督自进化中达到SOTA。 Conclusion: CSRS有效缓解了无监督自进化中因反馈信号偏差导致的退化问题,提升了MLLM对复杂推理路径的探索与泛化能力。 Abstract: In the unsupervised self-evolution of Multimodal Large Language Models, the quality of feedback signals during post-training is pivotal for stable and effective learning. However, existing self-evolution methods predominantly rely on majority voting to select the most frequent output as the pseudo-golden answer, which may stem from the model's intrinsic biases rather than guaranteeing the objective correctness of the reasoning paths. To counteract the degradation, we propose \textbf{C}ontinuous \textbf{S}oftened \textbf{R}etracing re\textbf{S}ampling (\textbf{CSRS}) in MLLM self-evolution. Specifically, we introduce a Retracing Re-inference Mechanism (\textbf{RRM}) that the model re-inferences from anchor points to expand the exploration of long-tail reasoning paths. Simultaneously, we propose Softened Frequency Reward (\textbf{SFR}), which replaces binary rewards with continuous signals, calibrating reward based on the answers' frequency across sampled reasoning sets. Furthermore, incorporated with Visual Semantic Perturbation (\textbf{VSP}), CSRS ensures the model prioritizes mathematical logic over visual superficiality. Experimental results demonstrate that CSRS significantly enhances the reasoning performance of Qwen2.5-VL-7B on benchmarks such as MathVision. We achieve state-of-the-art (SOTA) results in unsupervised self-evolution on geometric tasks. Our code is avaible at https://github.com/yyy195/CSRS.[143] ART: Adaptive Relational Transformer for Pedestrian Trajectory Prediction with Temporal-Aware Relations
Ruochen Li,Ziyi Chang,Junyan Hu,Jiannan Li,Amir Atapour-Abarghouei,Hubert P. H. Shum
Main category: cs.CV
TL;DR: 本文提出了一种自适应关系Transformer(ART),通过时序感知关系图(TARG)和自适应交互剪枝(AIP)机制,提升行人轨迹预测的准确性和计算效率。
Details
Motivation: 现有基于图或Transformer的方法在建模行人交互时存在计算开销大或难以刻画多样、时变交互特性的缺陷。 Method: 提出ART模型,包含两个核心组件:时序感知关系图(TARG)用于显式建模成对交互的动态演化;自适应交互剪枝(AIP)机制用于高效剔除冗余交互计算。 Result: 在ETH/UCY和NBA基准上取得SOTA精度,同时具备高计算效率。 Conclusion: ART在保持高预测精度的同时显著提升了计算效率,为真实场景下的行人轨迹预测提供了更实用的解决方案。 Abstract: Accurate prediction of real-world pedestrian trajectories is crucial for a wide range of robot-related applications. Recent approaches typically adopt graph-based or transformer-based frameworks to model interactions. Despite their effectiveness, these methods either introduce unnecessary computational overhead or struggle to represent the diverse and time-varying characteristics of human interactions. In this work, we present an Adaptive Relational Transformer (ART), which introduces a Temporal-Aware Relation Graph (TARG) to explicitly capture the evolution of pairwise interactions and an Adaptive Interaction Pruning (AIP) mechanism to reduce redundant computations efficiently. Extensive evaluations on ETH/UCY and NBA benchmarks show that ART delivers state-of-the-art accuracy with high computational efficiency.[144] Motion-Adaptive Multi-Scale Temporal Modelling with Skeleton-Constrained Spatial Graphs for Efficient 3D Human Pose Estimation
Ruochen Li,Shuang Chen,Wenke E,Farshad Arvin,Amir Atapour-Abarghouei
Main category: cs.CV
TL;DR: 本文提出MASC-Pose框架,通过自适应多尺度时序建模(AMTM)和骨骼约束的自适应图卷积网络(SAGCN),高效建模时空依赖,提升单目视频3D人体姿态估计精度与效率。
Details
Motivation: 现有方法在建模复杂时空依赖时存在效率低、适应性差的问题,尤其在密集注意力或固定建模方式下表现不佳。 Method: 提出MASC-Pose框架,包含自适应多尺度时序建模(AMTM)模块以捕捉不同时间尺度的异构运动动态,以及骨骼约束的自适应GCN(SAGCN)用于关节点特异性空间交互建模。 Result: 在Human3.6M和MPI-INF-3DHP数据集上实验表明,该方法在保持高计算效率的同时显著提升了3D姿态估计精度。 Conclusion: 联合自适应时序推理与高效空间聚合可有效提升单目3D人体姿态估计性能,MASC-Pose为时空建模提供了新思路。 Abstract: Accurate 3D human pose estimation from monocular videos requires effective modelling of complex spatial and temporal dependencies. However, existing methods often face challenges in efficiency and adaptability when modelling spatial and temporal dependencies, particularly under dense attention or fixed modelling schemes. In this work, we propose MASC-Pose, a Motion-Adaptive multi-scale temporal modelling framework with Skeleton-Constrained spatial graphs for efficient 3D human pose estimation. Specifically, it introduces an Adaptive Multi-scale Temporal Modelling (AMTM) module to adaptively capture heterogeneous motion dynamics at different temporal scales, together with a Skeleton-constrained Adaptive GCN (SAGCN) for joint-specific spatial interaction modelling. By jointly enabling adaptive temporal reasoning and efficient spatial aggregation, our method achieves strong accuracy with high computational efficiency. Extensive experiments on Human3.6M and MPI-INF-3DHP datasets demonstrate the effectiveness of our approach.[145] Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval
Jun Li,Xuhang Lou,Jinpeng Wang,Yuting Wang,Yaowei Wang,Shu-Tao Xia,Bin Chen
Main category: cs.CV
TL;DR: DreamPRVR提出一种粗到细的表示学习范式,通过生成全局语义寄存器并结合文本监督扩散模型与寄存器增强高斯注意力机制,提升部分相关视频检索的跨模态匹配精度。
Details
Motivation: 现有方法在部分相关视频检索(PRVR)中难以应对查询歧义和局部噪声,缺乏完整的全局上下文感知能力。 Method: 提出DreamPRVR模型:1)利用概率变分采样器初始化视频中心分布,生成粗粒度全局语义寄存器;2)通过文本监督截断扩散模型迭代优化寄存器;3)文本语义结构学习构建规整文本潜在空间;4)寄存器增强高斯注意力模块自适应融合寄存器与视频token。 Result: 在多个基准数据集上显著超越当前最优方法,验证了模型在部分相关视频检索任务中的有效性。 Conclusion: DreamPRVR通过协同建模全局语义寄存器与局部细粒度匹配,有效缓解查询歧义与局部噪声问题,为PRVR任务提供了新范式。 Abstract: Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos based on text queries that describe only partial events. Existing methods suffer from incomplete global contextual perception, struggling with query ambiguity and local noise induced by spurious responses. To address these issues, we propose DreamPRVR, which adopts a coarse-to-fine representation learning paradigm. The model first generates global contextual semantic registers as coarse-grained highlights spanning the entire video and then concentrates on fine-grained similarity optimization for precise cross-modal matching. Concretely, these registers are generated by initializing from the video-centric distribution produced by a probabilistic variational sampler and then iteratively refined via a text-supervised truncated diffusion model. During this process, textual semantic structure learning constructs a well-formed textual latent space, enhancing the reliability of global perception. The registers are then adaptively fused with video tokens through register-augmented Gaussian attention blocks, enabling context-aware feature learning. Extensive experiments show that DreamPRVR outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/CVPR26-DreamPRVR.[146] Love Me, Love My Label: Rethinking the Role of Labels in Prompt Retrieval for Visual In-Context Learning
Tianci Luo,Haohao Pan,Jinpeng Wang,Niu Lian,Xinrui Chen,Bin Chen,Shu-Tao Xia,Chun Yuan
Main category: cs.CV
TL;DR: 本文提出LaPR框架,通过显式引入标签信息提升视觉上下文学习(VICL)中的提示检索效果,解决了现有方法仅关注图像相似性而忽略标签一致性的问题。
Details
Motivation: 现有视觉上下文学习的提示检索方法主要关注图像相似性,常导致视觉相似但标签不一致的提示被选中,从而降低性能;而高标签一致性更有利于提升VICL效果。 Method: 提出Label-aware Prompt Retrieval(LaPR)框架:1)构建图像-标签联合表征;2)采用查询自适应的混合专家双编码器结构,每个专家捕获特定标签模式,路由器动态加权;3)设计交替优化策略,分别使用VICL性能引导的对比损失和标签引导的对比损失。 Result: 在分割、检测、着色等多个VICL任务上取得一致且显著的性能提升;在不同特征提取器和跨折场景下泛化良好。 Conclusion: 标签信息对VICL提示检索至关重要,LaPR通过显式建模和利用标签一致性,有效提升了提示选择质量与下游任务性能。 Abstract: Visual in-context learning (VICL) enables visual foundation models to handle multiple tasks by steering them with demonstrative prompts. The choice of such prompts largely influences VICL performance, standing out as a key challenge. Prior work has made substantial progress on prompt retrieval and reranking strategies, but mainly focuses on prompt images while overlooking labels. We reveal these approaches sometimes get visually similar but label-inconsistent prompts, which potentially degrade VICL performance. On the other hand, higher label consistency between query and prompts preferably indicates stronger VICL results. Motivated by these findings, we develop a framework named LaPR (Label-aware Prompt Retrieval), which highlights the role of labels in prompt selection. Our framework first designs an image-label joint representation for prompts to incorporate label cues explicitly. Besides, to handle unavailable query labels at test time, we introduce a mixture-of-expert mechanism to the dual encoders with query-adaptive routing. Each expert is expected to capture a specific label mode, while the router infers query-adaptive mixture weights and helps to learn label-aware representation. We carefully design alternative optimization for experts and router, with a VICL performance-guided contrastive loss and a label-guided contrastive loss, respectively. Extensive experiments show promising and consistent improvement of LaPR on in-context segmentation, detection, and colorization tasks. Moreover, LaPR generalizes well across feature extractors and cross-fold scenarios, suggesting the importance of label utilization in prompt retrieval for VICL. Code is available at https://github.com/luotc-why/CVPR26-LaPR.[147] Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos
Daniele Materia,Francesco Ragusa,Giovanni Maria Farinella
Main category: cs.CV
TL;DR: 本文提出了一种基于视觉大语言模型(VLLM)的自拍视角下人-物交互预测方法,通过Set-of-Mark提示增强视觉定位能力、利用近期注视轨迹理解用户意图,并引入逆指数采样策略建模时间动态,在HD-EPIC数据集上超越现有最优方法。
Details
Motivation: 提升智能辅助系统对人类与物体交互的预测能力,以支持日常生活指导和目标理解,尤其在自拍视角下解决现有方法在视觉定位和意图理解方面的不足。 Method: 结合视觉大语言模型(VLLM),采用Set-of-Mark提示增强视觉接地能力;利用用户最近注视点轨迹推断意图;设计逆指数采样策略捕捉交互前关键时间动态。 Result: 在HD-EPIC自拍数据集上,该方法在人-物交互预测任务中显著优于当前最先进方法,且具有模型无关性。 Conclusion: 所提方法有效提升了自拍视频中人-物交互的预测性能,验证了视觉语言模型结合多模态时序建模在该任务中的潜力。 Abstract: The ability to anticipate human-object interactions is highly desirable in an intelligent assistive system in order to guide users during daily life activities and understand their short and long-term goals. Creating systems with such capabilities requires to approach several complex challenges. This work addresses the problem of human-object interaction anticipation in Egocentric Vision using Vision Large Language Models (VLLMs). We tackle key limitations in existing approaches by improving visual grounding capabilities through Set-of-Mark prompting and understanding user intent via the trajectory formed by the user's most recent gaze fixations. To effectively capture the temporal dynamics immediately preceding the interaction, we further introduce a novel inverse exponential sampling strategy for input video frames. Experiments conducted on the egocentric dataset HD-EPIC demonstrate that our method surpasses state-of-the-art approaches for the considered task, showing its model-agnostic nature.[148] DiffSparse: Accelerating Diffusion Transformers with Learned Token Sparsity
Haowei Zhu,Ji Liu,Ziqiong Liu,Dong Li,Junhai Yong,Bin Wang,Emad Barsoum
Main category: cs.CV
TL;DR: 本文提出一种可微分的逐层稀疏优化框架,用于加速扩散Transformer模型的推理,通过令牌缓存和两阶段训练策略显著降低计算成本,同时提升生成质量。
Details
Motivation: 现有加速方法在少步扩散Transformer模型中效果不佳,主要受限于低效的特征缓存策略、人工设计的稀疏分配以及仍需保留若干完整前向计算步骤。 Method: 提出基于令牌缓存的可微分逐层稀疏优化框架,结合可学习网络与动态规划求解器进行端到端稀疏分配,并采用两阶段训练策略避免全步处理。 Result: 在DiT-XL/2、PixArt-α、FLUX、Wan2.1等多个扩散Transformer模型上验证有效性;例如PixArt-α在20步采样下计算成本降低54%,生成指标反超原模型。 Conclusion: 该方法在大幅提高推理效率的同时,不牺牲甚至提升生成质量,显著优于先前加速方法。 Abstract: Diffusion models demonstrate outstanding performance in image generation, but their multi-step inference mechanism requires immense computational cost. Previous works accelerate inference by leveraging layer or token cache techniques to reduce computational cost. However, these methods fail to achieve superior acceleration performance in few-step diffusion transformer models due to inefficient feature caching strategies, manually designed sparsity allocation, and the practice of retaining complete forward computations in several steps in these token cache methods. To tackle these challenges, we propose a differentiable layer-wise sparsity optimization framework for diffusion transformer models, leveraging token caching to reduce token computation costs and enhance acceleration. Our method optimizes layer-wise sparsity allocation in an end-to-end manner through a learnable network combined with a dynamic programming solver. Additionally, our proposed two-stage training strategy eliminates the need for full-step processing in existing methods, further improving efficiency. We conducted extensive experiments on a range of diffusion-transformer models, including DiT-XL/2, PixArt-$α$, FLUX, and Wan2.1. Across these architectures, our method consistently improves efficiency without degrading sample quality. For example, on PixArt-$α$ with 20 sampling steps, we reduce computational cost by $54\%$ while achieving generation metrics that surpass those of the original model, substantially outperforming prior approaches. These results demonstrate that our method delivers large efficiency gains while often improving generation quality.[149] DSERT-RoLL: Robust Multi-Modal Perception for Diverse Driving Conditions with Stereo Event-RGB-Thermal Cameras, 4D Radar, and Dual-LiDAR
Hoonhee Cho,Jae-Young Kang,Yuhwan Jeong,Yunseo Yang,Wonyoung Lee,Youngho Kim,Kuk-Jin Yoon
Main category: cs.CV
TL;DR: 本文介绍了DSERT-RoLL驾驶数据集,融合了立体事件相机、RGB、热成像、4D雷达和双LiDAR多模态传感器数据,并在多种天气与光照条件下采集;提供了统一的2D/3D基准、标注与评估协议,并提出一种提升恶劣环境鲁棒性的多传感器融合框架。
Details
Motivation: 缓解事件相机和4D雷达等新型传感器的数据稀缺问题,支持对其特性和行为的系统性研究。 Method: 构建多模态驾驶数据集DSERT-RoLL(含立体事件、RGB、热成像、4D雷达、双LiDAR),提供精细2D/3D标注与车辆自车定位;建立统一2D/3D基准;设计并实现一个将各传感器特征映射到统一特征空间的融合框架。 Result: 发布了首个涵盖事件相机与4D雷达的大规模多模态驾驶数据集;建立了跨模态可比的统一基准;报告了单模态与多模态方法基线结果;所提融合框架提升了复杂天气与光照下3D检测的鲁棒性。 Conclusion: DSERT-RoLL为多模态感知研究提供了高质量、多样化、标注完备的数据基础与评估标准,推动了新型传感器应用及鲁棒融合方法的发展。 Abstract: In this paper, we present DSERT-RoLL, a driving dataset that incorporates stereo event, RGB, and thermal cameras together with 4D radar and dual LiDAR, collected across diverse weather and illumination conditions. The dataset provides precise 2D and 3D bounding boxes with track IDs and ego vehicle odometry, enabling fair comparisons within and across sensor combinations. It is designed to alleviate data scarcity for novel sensors such as event cameras and 4D radar and to support systematic studies of their behavior. We establish unified 3D and 2D benchmarks that enable direct comparison of characteristics and strengths across sensor families and within each family. We report baselines for representative single modality and multimodal methods and provide protocols that encourage research on different fusion strategies and sensor combinations. In addition, we propose a fusion framework that integrates sensor specific cues into a unified feature space and improves 3D detection robustness under varied weather and lighting.[150] SciLT: Long-Tailed Classification in Scientific Image Domains
Jiahao Chen,Bing Su
Main category: cs.CV
TL;DR: 本文提出SciLT框架,通过自适应特征融合和双监督学习,利用基础模型的倒数第二层和最后一层特征,在科学长尾识别任务中实现头尾类别的均衡性能。
Details
Motivation: 现有长尾识别研究主要集中在自然图像领域,而科学图像具有独特的视觉特征和监督信号,导致基础模型微调效果有限,因此需要专门针对科学图像的长尾识别方法。 Method: 提出SciLT框架,结合倒数第二层和最后一层特征,采用自适应特征融合与双监督学习策略,进行参数高效微调(PEFT)。 Result: 在三个科学基准数据集上,SciLT显著优于现有方法,实现了头类与尾类的平衡性能,为科学长尾识别提供了强而实用的基线。 Conclusion: 倒数第二层特征对尾部类别识别尤为关键;SciLT通过多级特征利用有效缓解科学图像中的域偏移问题,验证了基础模型适配科学数据的新路径。 Abstract: Long-tailed recognition has benefited from foundation models and fine-tuning paradigms, yet existing studies and benchmarks are mainly confined to natural image domains, where pre-training and fine-tuning data share similar distributions. In contrast, scientific images exhibit distinct visual characteristics and supervision signals, raising questions about the effectiveness of fine-tuning foundation models in such settings. In this work, we investigate scientific long-tailed recognition under a purely visual and parameter-efficient fine-tuning (PEFT) paradigm. Experiments on three scientific benchmarks show that fine-tuning foundation models yields limited gains, and reveal that penultimate-layer features play an important role, particularly for tail classes. Motivated by these findings, we propose SciLT, a framework that exploits multi-level representations through adaptive feature fusion and dual-supervision learning. By jointly leveraging penultimate- and final-layer features, SciLT achieves balanced performance across head and tail classes. Extensive experiments demonstrate that SciLT consistently outperforms existing methods, establishing a strong and practical baseline for scientific long-tailed recognition and providing valuable guidance for adapting foundation models to scientific data with substantial domain shifts.[151] ResGuard: Enhancing Robustness Against Known Original Attacks in Deep Watermarking
Hanyi Wang,Han Fang,Yupeng Qiu,Shilin Wang,Ee-Chien Chang
Main category: cs.CV
TL;DR: 本文揭示了深度学习图像水印中的已知原始攻击(KOA)漏洞,并提出ResGuard模块通过增强残差特异性与引入辅助噪声层来提升抗攻击鲁棒性,显著提高水印提取准确率。
Details
Motivation: 现有基于END架构的深度学习图像水印方法忽视了对手利用已知原始-水印图像对实施的有意攻击(KOA),存在严重安全漏洞。 Method: 提出ResGuard插件模块:1)设计残差特异性增强损失,强化残差与宿主图像的耦合;2)引入辅助KOA噪声层,在训练中注入残差风格扰动以提升解码器鲁棒性。 Result: 集成到现有框架后,平均水印提取准确率从59.87%提升至99.81%。 Conclusion: END框架因残差图像依赖性不足而易受KOA攻击;ResGuard通过增强图像依赖性和训练鲁棒性,有效缓解该漏洞,为水印安全性提供新思路。 Abstract: Deep learning-based image watermarking commonly adopts an "Encoder-Noise Layer-Decoder" (END) architecture to improve robustness against random channel distortions, yet it often overlooks intentional manipulations introduced by adversaries with additional knowledge. In this paper, we revisit this paradigm and expose a critical yet underexplored vulnerability: the Known Original Attack (KOA), where an adversary has access to multiple original-watermarked image pairs, enabling various targeted suppression strategies. We show that even a simple residual-based removal approach, namely estimating an embedding residual from known pairs and subtracting it from unseen watermarked images, can almost completely remove the watermark while preserving visual quality. This vulnerability stems from the insufficient image dependency of residuals produced by END frameworks, which makes them transferable across images. To address this, we propose ResGuard, a plug-and-play module that enhances KOA robustness by enforcing image-dependent embedding. Its core lies in a residual specificity enhancement loss, which encourages residuals to be tightly coupled with their host images and thus improves image dependency. Furthermore, an auxiliary KOA noise layer injects residual-style perturbations during training, allowing the decoder to remain reliable under stronger embedding inconsistencies. Integrated into existing frameworks, ResGuard boosts KOA robustness, improving average watermark extraction accuracy from 59.87% to 99.81%.[152] FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning
Zhengyu Fu,René Zurbrügg,Kaixian Qu,Marc Pollefeys,Marco Hutter,Hermann Blum,Zuria Bauer
Main category: cs.CV
TL;DR: 本文提出FunFact框架,用于从RGB-D图像构建概率性开放词汇功能3D场景图,通过融合大语言模型常识先验与几何先验,在因子图中进行联合概率推理,提升功能关系识别的召回率与置信度校准。
Details
Motivation: 现有方法孤立建模物体对间功能关系,无法捕捉人类依赖的全场景功能互依性以消解歧义。 Method: FunFact首先构建物体与部件中心的3D地图,利用基础模型生成语义合理功能关系候选;再将其建模为因子图变量,并联合施加LLM常识先验和几何先验约束,实现所有功能边的联合概率推断。 Result: 在SceneFun3D、FunGraph3D和新提出的合成数据集FunThor上实验表明,FunFact显著提升节点与关系发现的召回率,并大幅降低歧义关系的置信度校准误差。 Conclusion: 全场景联合概率建模对功能场景理解具有关键优势,能更鲁棒、更可解释地建模复杂功能交互。 Abstract: Recent work in 3D scene understanding is moving beyond purely spatial analysis toward functional scene understanding. However, existing methods often consider functional relationships between object pairs in isolation, failing to capture the scene-wide interdependence that humans use to resolve ambiguity. We introduce FunFact, a framework for constructing probabilistic open-vocabulary functional 3D scene graphs from posed RGB-D images. FunFact first builds an object- and part-centric 3D map and uses foundation models to propose semantically plausible functional relations. These candidates are converted into factor graph variables and constrained by both LLM-derived common-sense priors and geometric priors. This formulation enables joint probabilistic inference over all functional edges and their marginals, yielding substantially better calibrated confidence scores. To benchmark this setting, we introduce FunThor, a synthetic dataset based on AI2-THOR with part-level geometry and rule-based functional annotations. Experiments on SceneFun3D, FunGraph3D, and FunThor show that FunFact improves node and relation discovery recall and significantly reduces calibration error for ambiguous relations, highlighting the benefits of holistic probabilistic modeling for functional scene understanding. See our project page at https://funfact-scenegraph.github.io/[153] SGTA: Scene-Graph Based Multi-Modal Traffic Agent for Video Understanding
Xingcheng Zhou,Mingyu Liu,Walter Zimmer,Jiajie Zhang,Alois Knoll
Main category: cs.CV
TL;DR: 本文提出了SGTA框架,结合结构化场景图与多模态推理,用于交通视频理解,通过检测、跟踪、车道提取构建场景图,并利用ReAct进行符号查询与视觉输入的工具式推理,实现可解释的复杂视频问答。
Details
Motivation: 现有交通视频理解方法缺乏结构化表征与可解释的多模态推理能力,难以应对复杂、开放式的视频问答任务。 Method: 提出SGTA框架:首先从路侧视频中检测、跟踪车辆并提取车道,构建交通场景图;然后采用ReAct范式,使大语言模型在推理过程中交替调用工具(如图查询、视觉分析)处理符号与视觉信息,生成可追溯的推理链。 Result: 在TUMTraffic VideoQA数据集子集上实验表明,SGTA在多种问题类型上达到具有竞争力的准确率,并能输出透明、可解释的推理步骤。 Conclusion: 结构化场景图与多模态智能体的融合,是提升交通视频理解性能与可解释性的有效路径。 Abstract: We present Scene-Graph Based Multi-Modal Traffic Agent (SGTA), a modular framework for traffic video understanding that combines structured scene graphs with multi-modal reasoning. It constructs a traffic scene graph from roadside videos using detection, tracking, and lane extraction, followed by tool-based reasoning over both symbolic graph queries and visual inputs. SGTA adopts ReAct to process interleaved reasoning traces from large language models with tool invocations, enabling interpretable decision-making for complex video questions. Experiments on selected TUMTraffic VideoQA dataset sample demonstrate that SGTA achieves competitive accuracy across multiple question types while providing transparent reasoning steps. These results highlight the potential of integrating structured scene representations with multi-modal agents for traffic video understanding.[154] VidNum-1.4K: A Comprehensive Benchmark for Video-based Numerical Reasoning
Shaoyang Cui,Lingbei Meng
Main category: cs.CV
TL;DR: 本文提出了VidNum-1.4K视频问答基准,用于评估视觉语言模型在真实世界视频中进行多步数值推理的能力,揭示了当前模型在构建内部世界模型方面存在显著不足。
Details
Motivation: 现有视频数值推理基准局限于狭窄领域或仅做简单计数,无法有效评估模型对时序事件、物体恒常性和组合逻辑的深层理解。 Method: 构建了一个包含1379个严格人工标注视频-问题对的三层次结构基准VidNum-1.4K,覆盖物体、动作与事件量化,并要求模型基于时序证据完成算术、比较与逻辑推理。 Result: 在多个SOTA VLM上的评测显示,最强模型Gemini-3.1-pro准确率仅约60%,主流开源模型仅25%–45%,暴露出显著的数值推理能力短板。 Conclusion: 当前VLM缺乏稳定的‘内部世界模型’,VidNum-1.4K可作为下一代视频数值智能的关键诊断基准。 Abstract: Video-based numerical reasoning provides a premier arena for testing whether Vision-Language Models (VLMs) truly "understand" real-world dynamics, as accurate numerical deduction necessitates a profound grasp of temporal events, object permanence, and compositional logic beyond superficial pattern matching. However, existing benchmarks are often confined to narrow domains, such as repetitive athletic motions, or treat simple counting merely as a superficial regression task, failing to assess multi-step numerical logic within the inherent complexity of real-world multimedia content. We introduce VidNum-1.4K, a comprehensive VideoQA benchmark comprising 1,379 strictly human-annotated video-question pairs designed to evaluate genuine numerical reasoning across highly diverse environments, encompassing object, action, and event quantification. The VidNum-1.4K is uniquely structured into a three-level hierarchy that evolves from direct visual perception to video-based compositional numerical reasoning, requiring models to perform arithmetic operations, comparisons, and logical deductions grounded in temporal evidence. Our evaluations across a diverse suite of state-of-the-art VLMs reveal a striking reasoning gap: while the Gemini-3.1-pro barely reaches a 60% accuracy threshold, representative open-source families struggle heavily in the 25%--45% range. These findings demonstrate that current VLMs still lack a stable "internal world model", positioning VidNum-1.4K as a demanding diagnostic testbed for the next generation of numerical video intelligence.[155] XSeg: A Large-scale X-ray Contraband Segmentation Benchmark For Real-World Security Screening
Hongxia Gao,Litao Li,Yixin Chen,Jiali Wen,Kaijie Zhang,Qianyun Liu
Main category: cs.CV
TL;DR: 本文提出了XSeg数据集和APSAM模型,旨在解决X射线违禁品检测中缺乏像素级标注和真实数据的问题,通过大规模分割数据集和改进的SAM模型提升检测精度与效率。
Details
Motivation: 现有X射线违禁品检测方法依赖边界框标注,缺乏像素级监督和真实世界数据,限制了模型泛化能力和性能。 Method: 构建了目前最大的X射线违禁品分割数据集XSeg(98,644张图像,295,932个实例掩码,涵盖30类违禁品);提出基于SAM的自适应点标注模型APSAM,包含能量感知编码器(提升重叠物体敏感性)和自适应点生成器(单点提示即可获得精确掩码)。 Result: APSAM在XSeg数据集上实验表明具有优越性能,显著提升了对堆叠物体的检测能力与标注效率。 Conclusion: XSeg和APSAM共同为X射线违禁品检测提供了高质量像素级监督数据与高效标注工具,推动该领域向更精准、鲁棒的方向发展。 Abstract: X-ray contraband detection is critical for public safety. However, current methods primarily rely on bounding box annotations, which limit model generalization and performance due to the lack of pixel-level supervision and real-world data. To address these limitations, we introduce XSeg. To the best of our knowledge, XSeg is the largest X-ray contraband segmentation dataset to date, including 98,644 images and 295,932 instance masks, and contains the latest 30 common contraband categories. The images are sourced from public datasets and our synthesized data, filtered through a custom data cleaning pipeline to remove low-quality samples. To enable accurate and efficient annotation and reduce manual labeling effort, we propose Adaptive Point SAM (APSAM), a specialized mask annotation model built upon the Segment Anything Model (SAM). We address SAM's poor cross-domain generalization and limited capability in detecting stacked objects by introducing an Energy-Aware Encoder that enhances the initialization of the mask decoder, significantly improving sensitivity to overlapping items. Additionally, we design an Adaptive Point Generator that allows users to obtain precise mask labels with only a single coarse point prompt. Extensive experiments on XSeg demonstrate the superior performance of APSAM.[156] Learning Superpixel Ensemble and Hierarchy Graphs for Melanoma Detection
Asmaa M. Elwer,Muhammad A. Rushdi,Mahmoud H. Annaby
Main category: cs.CV
TL;DR: 本文提出了一种基于超像素图结构学习的黑色素瘤检测方法,利用超像素集成图(SEG)和超像素层次图(SHG),结合手工与学习型边权重、多特征节点信号及边阈值剪枝,在ISIC2017数据集上达到99.00%准确率和99.59% AUC。
Details
Motivation: 传统图信号处理中图结构常依赖经验设定,缺乏自适应性;而 melanoma 检测需更鲁棒、可解释的图表示,故引入数据驱动的图结构学习方法。 Method: 构建两类图:超像素集成图(SEG,无父子约束)与超像素层次图(SHG,有父子约束),在多个尺度(20–100节点)生成超像素映射;节点信号采用纹理、几何与颜色特征;边权重采用高斯手工设定或优化学习方式;并研究不同边阈值(25%/50%/75%)对性能的影响。 Result: 在ISIC2017数据集上,使用学习型SEG+纹理节点信号取得最优结果:准确率99.00%,AUC 99.59%;数据增强缓解了类别不平衡问题。 Conclusion: 基于学习的超像素图结构(尤其是SEG)配合纹理特征,显著提升黑色素瘤检测性能,验证了图结构学习在皮肤图像分析中的有效性与潜力。 Abstract: Graph signal processing (GSP) is becoming a major tool in biomedical signal and image analysis. In most GSP techniques, graph structures and edge weights have been typically set via statistical and computational methods. More recently, graph structure learning methods offered more reliable and flexible data representations. In this work, we introduce a graph learning approach for melanoma detection in dermoscopic images based on two graph-theoretic representations: superpixel ensemble graphs (SEG) and superpixel hierarchy graphs (SHG). For these two types of graphs, superpixel maps of a skin lesion image are respectively generated at multiple levels without and with parentchild constraints among superpixels at adjacent levels, where each level corresponds to a subgraph with a different number of nodes (20, 40, 60, 80, or 100 nodes). Two edge weight assignment techniques are explored: handcrafted Gaussian weights and learned weights based on optimization methods. The graph nodal signals are assigned based on texture, geometric, and color superpixel features. In addition, the effect of graph edge thresholding is investigated by applying different thresholds (25%, 50%, and 75%) to prune the weakest edges and analyze the impact of pruning on the melanoma detection performance. Experimental evaluation of the proposed method is performed with different classifiers trained and tested on the publicly available ISIC2017 dataset. Data augmentation is applied to alleviate class imbalance by adding more melanoma images from the ISIC archive. The results show that learned superpixel ensemble graphs with textural nodal signals give the highest performance reaching an accuracy of 99.00% and an AUC of 99.59%.[157] CGHair: Compact Gaussian Hair Reconstruction with Card Clustering
Haimin Luo,Srinjay Sarkar,Albert Mosella-Montoro,Francisco Vicente Carrasco,Fernando De la Torre
Main category: cs.CV
TL;DR: 本文提出了一种基于多视角图像的高保真头发重建紧凑管线,通过将发丝聚类为代表性发卡并共享纹理码本,结合3D高斯泼溅渲染,在大幅降低存储与计算开销的同时保持视觉质量。
Details
Motivation: 现有3D高斯泼溅方法虽能实现高质量头发重建,但需数百万图元,导致存储和渲染成本过高;而头发在结构和外观上具有显著相似性,可被高效建模。 Method: 将发丝聚类为代表性发卡,并构建共享纹理码本;将该结构融入3D高斯泼溅渲染框架;引入生成先验加速初始发丝几何重建。 Result: 实验表明,发丝重建时间减少4倍,内存占用降低200倍以上,同时渲染质量与基线方法相当。 Conclusion: 该方法在保证高保真度的前提下,显著提升了头发重建的效率与实用性,为实时或资源受限场景下的数字人建模提供了新思路。 Abstract: We present a compact pipeline for high-fidelity hair reconstruction from multi-view images. While recent 3D Gaussian Splatting (3DGS) methods achieve realistic results, they often require millions of primitives, leading to high storage and rendering costs. Observing that hair exhibits structural and visual similarities across a hairstyle, we cluster strands into representative hair cards and group these into shared texture codebooks. Our approach integrates this structure with 3DGS rendering, significantly reducing reconstruction time and storage while maintaining comparable visual quality. In addition, we propose a generative prior accelerated method to reconstruct the initial strand geometry from a set of images. Our experiments demonstrate a 4-fold reduction in strand reconstruction time and achieve comparable rendering performance with over 200x lower memory footprint.[158] SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation
Guiyu Zhang,Yabo Chen,Xunzhi Xiang,Junchao Huang,Zhongyu Wang,Li Jiang
Main category: cs.CV
TL;DR: 本文提出SymphoMotion,一种联合控制相机运动与物体动态的统一视频生成框架,通过显式相机路径与几何感知线索、2D视觉引导与3D轨迹嵌入分别实现稳定视角变换和深度感知物体操控,并构建RealCOD-25K真实世界数据集支持训练与评估。
Details
Motivation: 现有方法通常仅处理相机运动或物体动态中的一种,且多依赖易混淆的2D线索,难以区分相机引起的视差与真实物体运动,缺乏统一、可控、结构一致的运动建模能力。 Method: 提出SymphoMotion框架:1)Camera Trajectory Control机制,融合显式相机路径与几何感知线索;2)Object Dynamics Control机制,结合2D视觉引导与3D轨迹嵌入;并构建RealCOD-25K真实场景数据集(含配对相机位姿与物体级3D轨迹)。 Result: 在视觉保真度、相机可控性及物体运动准确性上显著优于现有方法,用户研究与定量实验均验证其优越性。 Conclusion: SymphoMotion为视频生成中的统一运动控制建立了新基准,推动了兼具几何一致性与动态表达力的可控视频生成发展。 Abstract: Controlling both camera motion and object dynamics is essential for coherent and expressive video generation, yet current methods typically handle only one motion type or rely on ambiguous 2D cues that entangle camera-induced parallax with true object movement. We present SymphoMotion, a unified motion-control framework that jointly governs camera trajectories and object dynamics within a single model. SymphoMotion features a Camera Trajectory Control mechanism that integrates explicit camera paths with geometry-aware cues to ensure stable, structurally consistent viewpoint transitions, and an Object Dynamics Control mechanism that combines 2D visual guidance with 3D trajectory embeddings to enable depth-aware, spatially coherent object manipulation. To support large-scale training and evaluation, we further construct RealCOD-25K, a comprehensive real-world dataset containing paired camera poses and object-level 3D trajectories across diverse indoor and outdoor scenes, addressing a key data gap in unified motion control. Extensive experiments and user studies show that SymphoMotion significantly outperforms existing methods in visual fidelity, camera controllability, and object-motion accuracy, establishing a new benchmark for unified motion control in video generation.Codes and data are publicly available at https://grenoble-zhang.github.io/SymphoMotion/.[159] Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation
Binyuan Huang,Yuning Lu,Weinan Jia,Hualiang Wang,Mu Liu,Daiqing Yang
Main category: cs.CV
TL;DR: 本文提出PoCo方法,利用位置编码作为上下文控制器,解决多参考图像中外观高度相似角色的混淆问题,提升多镜头视频生成中角色控制的准确性和一致性。
Details
Motivation: 现有学术研究在多参考角色的多镜头视频生成任务上进展有限,且当参考图像外观高度相似时,模型易出现参考混淆问题。 Method: 提出PoCo(Position Embedding as a Context Controller),将位置编码作为额外的上下文控制信号,结合语义检索与token级精确定位,在保持语义一致性的同时缓解混淆。 Result: PoCo显著提升了跨镜头一致性与参考保真度,在多个基线上验证了其有效性。 Conclusion: PoCo为多参考、多镜头视频生成中相似角色的精准控制提供了新思路,有效缓解参考混淆问题。 Abstract: Recent proprietary models such as Sora2 demonstrate promising progress in generating multi-shot videos conditioned on multiple reference characters. However, academic research on this problem remains limited. We study this task and identify a core challenge: when reference images exhibit highly similar appearances, the model often suffers from reference confusion, where semantically similar tokens degrade the model's ability to retrieve the correct context. To address this, we introduce PoCo (Position Embedding as a Context Controller), which incorporates position encoding as additional context control beyond semantic retrieval. By employing side information of tokens, PoCo enables precise token-level matching while preserving implicit semantic consistency modeling. Building on PoCo, we develop a multi-reference and multi-shot video generation model capable of reliably controlling characters with extremely similar visual traits. Extensive experiments demonstrate that PoCo improves cross-shot consistency and reference fidelity compared with various baselines.[160] Shower-Aware Dual-Stream Voxel Networks for Structural Defect Detection in Cosmic-Ray Muon Tomography
Parthiv Dasgupta,Sambhav Agarwal,Palash Dutta,Raja Karmakar,Sudeshna Goswami
Main category: cs.CV
TL;DR: 本文提出SA-DSVN模型,利用宇宙线缪子层析成像对钢筋混凝土结构缺陷进行体素级分割,首次引入次级电磁簇射多重性作为关键特征,显著提升分割性能。
Details
Motivation: 传统缪子重建方法(如POCA、MLSD)仅依赖散射角信息,忽略次级电磁簇射等潜在判别性特征,难以准确识别混凝土内部复杂缺陷。 Method: 提出基于3D卷积的SA-DSVN架构,包含两个独立编码器流:分别处理缪子散射运动学(9通道)和次级电磁簇射多重性(40通道),通过交叉注意力机制融合;训练数据由云原生Geant4框架Vega生成,含450万缪子事件、900个体积及四种缺陷类型。 Result: 在60个独立验证体积上达到96.3%体素精度、各缺陷Dice系数0.59–0.81、100%体积级检测灵敏度,单体积推理耗时10ms;消融实验表明仅用簇射多重性即可将平均Dice从0.535提升至0.685。 Conclusion: 次级电磁簇射多重性是缪子层析成像中未被开发但极具判别力的新特征,SA-DSVN验证了其在结构缺陷智能识别中的有效性与实用性。 Abstract: We present SA-DSVN, a 3D convolutional architecture for voxel-level segmentation of structural defects in reinforced concrete using cosmic-ray muon tomography. Unlike conventional reconstruction methods (POCA, MLSD) that rely solely on muon scattering angles, our approach jointly processes scattering kinematics (9 channels) and secondary electromagnetic shower multiplicities (40 channels) through independent encoder streams fused via cross-attention. Training data were generated using Vega, a cloud-native Geant4 simulation framework, producing 4.5 million muon events across 900 volumes containing four defect types - honeycombing, shear fracture, corrosion voids, and delamination - embedded within a dense 7x7 rebar cage. A five-variant ablation study demonstrates that the shower multiplicity stream alone accounts for the majority of discriminative power, raising defect-mean Dice from 0.535 (scattering only) to 0.685 (shower only). On 60 independently simulated validation volumes, the model achieves 96.3% voxel accuracy, per-defect Dice scores of 0.59-0.81, and 100% volume-level detection sensitivity at 10 ms inference per volume. These results establish secondary shower multiplicity as a previously unexploited but highly effective feature for learned muon tomographic reconstruction.[161] ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs
Zitong Xu,Huiyu Duan,Shengyao Qin,Guangyu Yao,Guangji Ma,Xiongkuo Min,Ke Gu,Guangtao Zhai,Patrick Le Callet
Main category: cs.CV
TL;DR: 本文提出一个新的大规模图像描述基准ICBench,包含12类内容、2000张图像及4万条由10个先进MLLM生成的长短描述,并引入基于图像-文本-图像重建一致性的自动评估指标ITIScore,其与人类评分高度一致且具备零样本泛化能力。
Details
Motivation: 现有图像描述基准存在描述长度多样性不足、缺乏对最新多模态大模型的覆盖以及人工标注不足等问题,导致评估偏差和不全面。 Method: 构建ICBench基准:覆盖12类内容、2K图像、10个先进MLLM生成的短/长描述共40K条;开展细粒度人工主观评估(MOS);提出ITIScore自动评估指标,基于图像-文本-图像重建一致性。 Result: ITIScore与人类评分高度一致,并在其他公开数据集上展现出强零样本泛化能力。 Conclusion: ICBench弥补了当前图像描述评估基准的多项缺陷,ITIScore为自动化、可靠、可泛化的图像描述质量评估提供了新范式。 Abstract: Recent advances in multimodal large language models (MLLMs) have greatly improved image understanding and captioning capabilities. However, existing image captioning benchmarks typically suffer from limited diversity in caption length, the absence of recent advanced MLLMs, and insufficient human annotations, which potentially introduces bias and limits the ability to comprehensively assess the performance of modern MLLMs. To address these limitations, we present a new large-scale image captioning benchmark, termed, ICBench, which covers 12 content categories and consists of both short and long captions generated by 10 advanced MLLMs on 2K images, resulting in 40K captions in total. We conduct extensive human subjective studies to obtain mean opinion scores (MOSs) across fine-grained evaluation dimensions, where short captions are assessed in terms of fluency, relevance, and conciseness, while long captions are evaluated based on fluency, relevance, and completeness. Furthermore, we propose an automated evaluation metric, \textbf{ITIScore}, based on an image-to-text-to-image framework, which measures caption quality through reconstruction consistency. Experimental results demonstrate strong alignment between our automatic metric and human judgments, as well as robust zero-shot generalization ability on other public captioning datasets. Both the dataset and model will be released upon publication.[162] M2StyleGS: Multi-Modality 3D Style Transfer with Gaussian Splatting
Xingyu Miao,Xueqi Qiu,Haoran Duan,Yawen Huang,Xian Wu,Jingjing Deng,Yang Long
Main category: cs.CV
TL;DR: 本文提出M2StyleGS,一种基于3D高斯泼溅和多模态CLIP特征的实时3D风格迁移方法,支持文本或图像作为风格参考,通过细分光流对齐、观测损失与抑制损失提升风格一致性与视觉质量。
Details
Motivation: 现有3D风格迁移方法依赖固定参考图像,缺乏对文本或多样化图像等灵活用户输入的支持,难以满足VR/AR等实际应用需求。 Method: 以3D Gaussian Splatting为3D表征,融合CLIP提取的文本-图像多模态风格特征;引入细分光流(subdivisive flow)实现精准特征对齐;将CLIP特征映射至VGG风格特征空间;设计观察损失(observation loss)增强风格匹配,抑制损失(suppression loss)保持参考颜色信息稳定。 Result: M2StyleGS支持文本或图像驱动的实时3D风格化,生成精确着色的新视角序列;实验表明其在风格一致性上较先前方法最高提升32.92%,视觉质量更优。 Conclusion: M2StyleGS通过多模态风格引导与精细化损失设计,显著提升了3D风格迁移的灵活性、一致性和实时性,为VR/AR等交互式场景提供了实用新范式。 Abstract: Conventional 3D style transfer methods rely on a fixed reference image to apply artistic patterns to 3D scenes. However, in practical applications such as virtual or augmented reality, users often prefer more flexible inputs, including textual descriptions and diverse imagery. In this work, we introduce a novel real-time styling technique M2StyleGS to generate a sequence of precisely color-mapped views. It utilizes 3D Gaussian Splatting (3DGS) as a 3D presentation and multi-modality knowledge refined by CLIP as a reference style. M2StyleGS resolves the abnormal transformation issue by employing a precise feature alignment, namely subdivisive flow, it strengthens the projection of the mapped CLIP text-visual combination feature to the VGG style feature. In addition, we introduce observation loss, which assists in the stylized scene better matching the reference style during the generation, and suppression loss, which suppresses the offset of reference color information throughout the decoding process. By integrating these approaches, M2StyleGS can employ text or images as references to generate a set of style-enhanced novel views. Our experiments show that M2StyleGS achieves better visual quality and surpasses the previous work by up to 32.92% in terms of consistency.[163] When Does Multimodal AI Help? Diagnostic Complementarity of Vision-Language Models and CNNs for Spectrum Management in Satellite-Terrestrial Networks
Yuanhang Li
Main category: cs.CV
TL;DR: 本文首次系统比较了视觉语言模型(VLMs)与轻量CNN在非地面-地面协同网络频谱热图理解任务中的表现,提出SpectrumQA基准,发现二者具有任务依赖的互补性:CNN擅于监督式空间任务(如分类、定位),VLM擅于少样本语义推理;提出任务路由器可提升综合性能39.1%,并验证VLM更强的跨场景鲁棒性。
Details
Motivation: 当前VLM在无线网络管理中应用加速,但尚无系统研究揭示其相较于CNN在频谱相关任务中的优势边界。 Method: 构建四层级频谱理解基准SpectrumQA(108K QA对),在三种NTN-TN场景下对比冻结Qwen2-VL-7B与训练ResNet-18;引入任务类型路由器,并采用CoT提示分析推理能力差异;评估跨场景迁移鲁棒性。 Result: CNN在L1严重度分类(72.9%)和L3空间定位(IoU=0.552)上占优;VLM在L4语义推理中以仅3个上下文示例达到F1=0.576,且CoT提升其F1 12.6%;任务路由器实现复合得分0.616(+39.1%);VLM在5/6跨场景迁移中性能衰减更小。 Conclusion: VLM与CNN在频谱理解中并非替代关系,而是任务互补:CNN适用于空间定位等监督任务,VLM适用于语义推理等少样本认知任务;应依据任务类型进行协同部署而非统一替换。 Abstract: The adoption of vision-language models (VLMs) for wireless network management is accelerating, yet no systematic understanding exists of where these large foundation models outperform lightweight convolutional neural networks (CNNs) for spectrum-related tasks. This paper presents the first diagnostic comparison of VLMs and CNNs for spectrum heatmap understanding in non-terrestrial network and terrestrial network (NTN-TN) cooperative systems. We introduce SpectrumQA, a benchmark comprising 108K visual question-answer pairs across four granularity levels: scene classification (L1), regional reasoning (L2), spatial localization (L3), and semantic reasoning (L4). Our experiments on three NTN-TN scenarios with a frozen Qwen2-VL-7B and a trained ResNet-18 reveal a clear taskdependent complementarity: CNN achieves 72.9% accuracy at severity classification (L1) and 0.552 IoU at spatial localization (L3), while VLM uniquely enables semantic reasoning (L4) with F1=0.576 using only three in-context examples-a capability fundamentally absent in CNN architectures. Chain-of-thought (CoT) prompting further improves VLM reasoning by 12.6% (F1: 0.209->0.233) while having zero effect on spatial tasks, confirming that the complementarity is rooted in architectural differences rather than prompting limitations. A deterministic task-type router that delegates supervised tasks to CNN and reasoning tasks to VLM achieves a composite score of 0.616, a 39.1% improvement over CNN alone. We further show that VLM representations exhibit stronger cross-scenario robustness, with smaller performance degradation in 5 out of 6 transfer directions. These findings provide actionable guidelines: deploy CNNs for spatial localization and VLMs for semantic spectrum reasoning, rather than treating them as substitutes.[164] Confidence-Driven Facade Refinement of 3D Building Models Using MLS Point Clouds
Xiaoyu Huang
Main category: cs.CV
TL;DR: 本文提出了一种基于粗模型先验的自动化精细化框架,将低精度CityGML建筑模型与高精度MLS数据融合,通过表面匹配与带拓扑约束的二元整数优化,显著提升建筑立面几何精度并保证水密、流形结构。
Details
Motivation: 传统机载激光扫描(ALS)生成的CityGML模型在建筑立面上存在显著几何缺陷,难以满足数字孪生对高精度地理空间数据的需求;而重建方法会丢失语义信息且依赖完整数据覆盖,亟需一种能保留既有语义并精准更新立面的自动化方法。 Method: 以粗粒度CityGML模型为几何先验,结合MLS点云数据:首先进行表面匹配识别过时立面,再构建候选面集,并通过引入硬约束的二元整数优化选择最优面片,确保输出模型拓扑有效(水密、流形)。 Result: 实验表明该方法将Cloud-to-Mesh RMSE降低约36%,实现厘米级对齐,且输出严格水密、流形,适用于复杂城市环境下的ALS模型升级。 Conclusion: 所提框架在保留原始语义与拓扑结构前提下,实现了面向数字孪生的高效、鲁棒建筑模型精细化,为城市级数字孪生持续维护提供了可行路径。 Abstract: Digital twins require continuous maintenance to meet the increasing demand for high-precision geospatial data. However, traditional coarse CityGML building models, typically derived from Airborne Laser Scanning (ALS), often exhibit significant geometric deficiencies, particularly regarding facade accuracy due to the nadir perspective of airborne sensors. Integrating these coarse models with high-precision Mobile Laser Scanning (MLS) data is essential to recover detailed facade geometry. Unlike reconstruction-from-scratch approaches that discard existing semantic information and rely heavily on complete data coverage, this work presents an automated refinement framework that utilizes the coarse model as a geometric prior. This method enables targeted updates to facade geometry even in complex urban environments. It integrates surface matching to identify outdated surfaces and employs a binary integer optimization to select optimal faces from candidate data. Crucially, hard constraints are enforced within the optimization to ensure the topological validity of the refined output. Experimental results demonstrate that the proposed approach effectively corrects facade misalignments, reducing the Cloud-to-Mesh RMSE by approximately 36% and achieving centimeter-level alignment. Furthermore, the framework guarantees strictly watertight and manifold geometry, providing a robust solution for upgrading ALS-derived city models.[165] Next-Scale Autoregressive Models for Text-to-Motion Generation
Zhiwei Zheng,Shibo Jin,Lingjie Liu,Mingmin Zhao
Main category: cs.CV
TL;DR: 本文提出MoScale,一种分层自回归框架,用于文本驱动的运动生成,通过从粗到细的时间分辨率逐步生成运动,提高了长距离运动结构的建模能力,并在有限数据下增强了鲁棒性。
Details
Motivation: 标准的自回归下一个token预测方法与文本驱动运动生成所需的时序结构不匹配,且在有限文本-运动配对数据下鲁棒性不足。 Method: 提出MoScale框架:1)采用‘下一个尺度’(next-scale)自回归策略,构建从粗到细的因果时间层级;2)引入跨尺度分层细化以提升各尺度初始预测质量;3)引入同尺度时间细化机制,支持选择性双向重预测。 Result: MoScale在文本到运动生成任务上达到SOTA性能,训练高效,模型规模可扩展性强,并能零样本泛化至多种运动生成与编辑任务。 Conclusion: MoScale通过分层时间建模和双重细化机制,有效解决了文本驱动运动生成中的时序对齐与数据稀缺问题,为高质量、鲁棒的运动生成提供了新范式。 Abstract: Autoregressive (AR) models offer stable and efficient training, but standard next-token prediction is not well aligned with the temporal structure required for text-conditioned motion generation. We introduce MoScale, a next-scale AR framework that generates motion hierarchically from coarse to fine temporal resolutions. By providing global semantics at the coarsest scale and refining them progressively, MoScale establishes a causal hierarchy better suited for long-range motion structure. To improve robustness under limited text-motion data, we further incorporate cross-scale hierarchical refinement for improving per-scale initial predictions and in-scale temporal refinement for selective bidirectional re-prediction. MoScale achieves SOTA text-to-motion performance with high training efficiency, scales effectively with model size, and generalizes zero-shot to diverse motion generation and editing tasks.[166] HistoFusionNet: Histogram-Guided Fusion and Frequency-Adaptive Refinement for Nighttime Image Dehazing
Mohammad Heydari,Wei Dong,Shahram Shirani,Jun Chen,Han Zhou
Main category: cs.CV
TL;DR: 本文提出HistoFusionNet,一种结合直方图引导表征学习与频率自适应特征精炼的Transformer增强型网络,专为夜间图像去雾设计,显著提升复杂夜间退化场景下的恢复质量。
Details
Motivation: 夜间图像去雾面临雾霾、辉光、光照不均、色偏和传感器噪声等多重退化,传统白天去雾方法的假设常失效,亟需针对性解决方案。 Method: 构建多尺度编解码骨干网络,引入基于动态范围分组的直方图Transformer模块建模长程依赖,并增加频率感知精炼分支,自适应融合高低频信息以恢复结构、抑制伪影并增强细节。 Result: 在NTIRE 2026夜间去雾挑战赛基准上取得最优性能,22支参赛队中排名第一,验证了方法的有效性与鲁棒性。 Conclusion: HistoFusionNet提供了一个统一框架,能有效应对真实夜间雾天图像中异构退化问题,兼具理论创新与实际竞争力。 Abstract: Nighttime image dehazing remains a challenging low-level vision problem due to the joint presence of haze, glow, non-uniform illumination, color distortion, and sensor noise, which often invalidate assumptions commonly used in daytime dehazing. To address these challenges, we propose HistoFusionNet, a transformer-enhanced architecture tailored for nighttime image dehazing by combining histogram-guided representation learning with frequency-adaptive feature refinement. Built upon a multi-scale encoder-decoder backbone, our method introduces histogram transformer blocks that model long-range dependencies by grouping features according to their dynamic-range characteristics, enabling more effective aggregation of similarly degraded regions under complex nighttime lighting. To further improve restoration fidelity, we incorporate a frequency-aware refinement branch that adaptively exploits complementary low- and high-frequency cues, helping recover scene structures, suppress artifacts, and enhance local details. This design yields a unified framework that is particularly well suited to the heterogeneous degradations encountered in real nighttime hazy scenes. Extensive experiments and highly competitive performance of our method on the NTIRE 2026 Nighttime Image Dehazing Challenge benchmark demonstrate the effectiveness of the proposed method. Our team ranked 1st among 22 participating teams, highlighting the robustness and competitive performance of HistoFusionNet. The code is available at: https://github.com/heydarimo/Night-Time-Dehazing[167] Rényi Attention Entropy for Patch Pruning
Hiroaki Aizawa,Yuki Igaue
Main category: cs.CV
TL;DR: 本文提出了一种基于香农熵和Rényi熵的注意力分布准则,用于Transformer中图像patch的剪枝,以降低计算成本并保持精度。
Details
Motivation: Transformer中自注意力机制的计算成本随token数量呈平方增长,需通过patch剪枝来缓解;而细粒度图像识别等任务对patch选择尤为关键。 Method: 提出基于注意力分布香农熵的patch重要性判据(低熵保留、高熵剪枝),并扩展至Rényi熵以增强对尖锐注意力峰的敏感性,支持任务与算力自适应的剪枝策略。 Result: 在细粒度图像识别任务上,该方法在降低计算量的同时保持了模型精度;通过调节Rényi熵参数,进一步优化了精度-计算权衡。 Conclusion: 基于熵的注意力分布判据是一种有效、灵活且可解释的patch剪枝方法,适用于资源受限下的视觉Transformer高效部署。 Abstract: Transformers are strong baselines in both vision and language because self-attention captures long-range dependencies across tokens. However, the cost of self-attention grows quadratically with the number of tokens. Patch pruning mitigates this cost by estimating per-patch importance and removing redundant patches. To identify informative patches for pruning, we introduce a criterion based on the Shannon entropy of the attention distribution. Low-entropy patches, which receive selective and concentrated attention, are kept as important, while high-entropy patches with attention spread across many locations are treated as redundant. We also extend the criterion from Shannon to Rényi entropy, which emphasizes sharp attention peaks and supports pruning strategies that adapt to task needs and computational limits. In experiments on fine-grained image recognition, where patch selection is critical, our method reduced computation while preserving accuracy. Moreover, adjusting the pruning policy through the Rényi entropy measure yields further gains and improves the trade-off between accuracy and computation.[168] Bridging Restoration and Diagnosis: A Comprehensive Benchmark for Retinal Fundus Enhancement
Xuanzhao Dong,Wenhui Zhu,Xiwen Chen,Hao Wang,Xin Li,Yujian Xiong,Jiajun Cheng,Zhipeng Wang,Shao Tang,Oana Dumitrascu,Yalin Wang
Main category: cs.CV
TL;DR: 本文提出EyeBench-V2,一个面向临床效用的视网膜眼底图像增强模型基准,通过多维度下游任务评估、专家引导的评估设计和可操作性分析,弥补现有评估方法在临床相关性、协议统一性和指导性方面的不足。
Details
Motivation: 现有评估指标(如PSNR、SSIM)无法反映临床关键特征(如病灶保留、血管形态一致性);缺乏兼顾配对/非配对方法且融合临床知识的统一评估协议;需提供能推动临床对齐增强模型发展的 actionable 洞察。 Method: 构建EyeBench-V2基准,包含:(1) 多任务下游临床评估(血管分割、糖网分级、未知噪声泛化、病灶分割);(2) 专家标注的新数据集与结构化人工评估协议(聚焦病灶结构变化、背景色偏、伪影引入);(3) 对主流生成模型进行任务导向的系统性分析。 Result: EyeBench-V2实现了对眼底增强模型在临床相关性维度上的全面量化评估,揭示了当前生成模型在病灶保真度、颜色一致性及伪影控制等方面的共性缺陷,并为后续研究提供了明确改进方向。 Conclusion: EyeBench-V2有效弥合了技术性能与临床效用之间的鸿沟,确立了以临床价值为导向的眼底图像增强评估新范式,推动生成模型向真正可部署的医疗AI演进。 Abstract: Over the past decade, generative models have demonstrated success in enhancing fundus images. However, the evaluation of these models remains a challenge. A benchmark for fundus image enhancement is needed for three main reasons:(1) Conventional denoising metrics such as PSNR and SSIM fail to capture clinically relevant features, such as lesion preservation and vessel morphology consistency, limiting their applicability in real-world settings; (2) There is a lack of unified evaluation protocols that address both paired and unpaired enhancement methods, particularly those guided by clinical expertise; and (3) An evaluation framework should provide actionable insights to guide future advancements in clinically aligned enhancement models. To address these gaps, we introduce EyeBench-V2, a benchmark designed to bridge the gap between enhancement model performance and clinical utility. Our work offers three key contributions:(1) Multi-dimensional clinical-alignment through downstream evaluations: Beyond standard enhancement metrics, we assess performance across clinically meaningful tasks including vessel segmentation, diabetic retinopathy (DR) grading, generalization to unseen noise patterns, and lesion segmentation. (2) Expert-guided evaluation design: We curate a novel dataset enabling fair comparisons between paired and unpaired enhancement methods, accompanied by a structured manual assessment protocol by medical experts, which evaluates clinically critical aspects such as lesion structure alterations, background color shifts, and the introduction of artificial structures. (3) Actionable insights: Our benchmark provides a rigorous, task-oriented analysis of existing generative models, equipping clinical researchers with the evidence needed to make informed decisions, while also identifying limitations in current methods to inform the design of next-generation enhancement models.[169] InCaRPose: In-Cabin Relative Camera Pose Estimation Model and Dataset
Felix Stillger,Lukas Hahn,Frederik Hasecke,Tobias Meisen
Main category: cs.CV
TL;DR: InCaRPose是一种基于Transformer的相机外参标定方法,专为车内监控等高度畸变环境设计,利用冻结的DINOv3特征和Transformer解码器,单步预测图像对间的度量尺度相对位姿,仅用合成数据训练却能泛化至真实场景,并在7-Scenes数据集上表现优异。
Details
Motivation: 解决车载舱内监控(ICAM)等受限、高畸变环境下精确相对位姿估计的难题,尤其需满足安全感知所需的物理真实距离精度。 Method: 提出InCaRPose架构,采用冻结的DINOv3骨干提取特征,结合Transformer解码器建模参考视图与目标视图间的几何关系;完全基于合成数据训练,适配鱼眼镜头畸变;支持不同内参的真实场景泛化。 Result: 在真实舱内环境中实现高精度旋转与平移估计(即使使用ViT-Small),单步输出绝对度量尺度平移;实时推理能力满足驾驶员监控等时序敏感任务;在7-Scenes数据集上达到竞争性性能。 Conclusion: InCaRPose为高畸变车载场景提供了高效、鲁棒、无需真实标注的外参标定新范式,兼具精度、泛化性与实时性,并开源了In-Cabin-Pose测试数据集与代码。 Abstract: Camera extrinsic calibration is a fundamental task in computer vision. However, precise relative pose estimation in constrained, highly distorted environments, such as in-cabin automotive monitoring (ICAM), remains challenging. We present InCaRPose, a Transformer-based architecture designed for robust relative pose prediction between image pairs, which can be used for camera extrinsic calibration. By leveraging frozen backbone features such as DINOv3 and a Transformer-based decoder, our model effectively captures the geometric relationship between a reference and a target view. Unlike traditional methods, our approach achieves absolute metric-scale translation within the physically plausible adjustment range of in-cabin camera mounts in a single inference step, which is critical for ICAM, where accurate real-world distances are required for safety-relevant perception. We specifically address the challenges of highly distorted fisheye cameras in automotive interiors by training exclusively on synthetic data. Our model is capable of generalization to real-world cabin environments without relying on the exact same camera intrinsics and additionally achieves competitive performance on the public 7-Scenes dataset. Despite having limited training data, InCaRPose maintains high precision in both rotation and translation, even with a ViT-Small backbone. This enables real-time performance for time-critical inference, such as driver monitoring in supervised autonomous driving. We release our real-world In-Cabin-Pose test dataset consisting of highly distorted vehicle-interior images and our code at https://github.com/felixstillger/InCaRPose.[170] ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos
Peijun Bao,Anwei Luo,Gang Pan,Alex C. Kot,Xudong Jiang
Main category: cs.CV
TL;DR: 本文提出了ActivityForensics——首个用于定位视频中被篡改人类活动的大规模基准数据集,并设计了基于扩散模型的时序伪造检测基线方法TADiff,以应对活动级视频伪造带来的媒体真实性挑战。
Details
Motivation: 现有视频伪造检测基准主要关注外观层面(如换脸、物体删除),而新兴的活动级伪造(修改人类动作以扭曲事件语义)更具欺骗性,亟需专门的基准与方法来应对。 Method: 构建了包含6000+高质量活动伪造视频片段的ActivityForensics基准;提出Temporal Artifact Diffuser (TADiff),一种利用扩散模型进行特征正则化以暴露时序伪影线索的简单有效基线方法;设计了涵盖域内、跨域及开放世界场景的综合评估协议。 Result: 提供了大规模、高视觉一致性的活动级伪造视频数据集;TADiff在多个设置下展现出良好性能;全面评测了多种SOTA伪造定位方法,揭示其在活动级伪造上的局限性。 Conclusion: ActivityForensics填补了活动级视频伪造检测领域的基准空白,TADiff为该任务提供了可扩展的基线方案,推动视频深度伪造检测向更语义化、更鲁棒的方向发展。 Abstract: Temporal forgery localization aims to temporally identify manipulated segments in videos. Most existing benchmarks focus on appearance-level forgeries, such as face swapping and object removal. However, recent advances in video generation have driven the emergence of activity-level forgeries that modify human actions to distort event semantics, resulting in highly deceptive forgeries that critically undermine media authenticity and public trust. To overcome this issue, we introduce ActivityForensics, the first large-scale benchmark for localizing manipulated activity in videos. It contains over 6K forged video segments that are seamlessly blended into the video context, rendering high visual consistency that makes them almost indistinguishable from authentic content to the human eye. We further propose Temporal Artifact Diffuser (TADiff), a simple yet effective baseline that exposes artifact cues through a diffusion-based feature regularizer. Based on ActivityForensics, we introduce comprehensive evaluation protocols covering intra-domain, cross-domain, and open-world settings, and benchmark a wide range of state-of-the-art forgery localizers to facilitate future research. The dataset and code are available at https://activityforensics.github.io.[171] SPARK-IL: Spectral Retrieval-Augmented RAG for Knowledge-driven Deepfake Detection via Incremental Learning
Hessen Bougueffa Eutamene,Abdellah Zakaria Sellam,Abdelmalik Taleb-Ahmed,Abdenour Hadid
Main category: cs.CV
TL;DR: SPARK-IL是一种基于频域分析与检索增强的AI生成图像检测框架,通过双路径频谱分解、KAN网络处理及增量学习,在跨生成器检测任务中达到94.6%平均准确率。
Details
Motivation: 现有检测器在未见过的生成模型上泛化能力差;而频域特征比像素级特征更具跨模型一致性,因此利用频域信息提升泛化性成为关键动机。 Method: 提出SPARK-IL框架:采用部分冻结的ViT-L/14提取语义特征,同时并行处理RGB像素;对两路特征分别进行四频带傅里叶分解,并用Kolmogorov-Arnold网络(MoE结构)做频带特异性变换;再经交叉注意力融合;推理时通过余弦相似度从Milvus数据库中检索k近邻签名,以多数投票预测;增量学习阶段扩展数据库并使用弹性权重巩固防止灾难性遗忘。 Result: 在包含19种生成模型(GAN、换脸、扩散模型等)的UniversalFakeDetect基准上取得94.6%的平均准确率。 Conclusion: 频域特征结合检索增强与增量学习可显著提升AI图像检测器对未知生成器的泛化能力,SPARK-IL为通用伪造检测提供了新范式。 Abstract: Detecting AI-generated images remains a significant challenge because detectors trained on specific generators often fail to generalize to unseen models; however, while pixel-level artifacts vary across models, frequency-domain signatures exhibit greater consistency, providing a promising foundation for cross-generator detection. To address this, we propose SPARK-IL, a retrieval-augmented framework that combines dual-path spectral analysis with incremental learning by utilizing a partially frozen ViT-L/14 encoder for semantic representations alongside a parallel path for raw RGB pixel embeddings. Both paths undergo multi-band Fourier decomposition into four frequency bands, which are individually processed by Kolmogorov-Arnold Networks (KAN) with mixture-of-experts for band-specific transformations before the resulting spectral embeddings are fused via cross-attention with residual connections. During inference, this fused embedding retrieves the $k$ nearest labeled signatures from a Milvus database using cosine similarity to facilitate predictions via majority voting, while an incremental learning strategy expands the database and employs elastic weight consolidation to preserve previously learned transformations. Evaluated on the UniversalFakeDetect benchmark across 19 generative models -- including GANs, face-swapping, and diffusion methods -- SPARK-IL achieves a 94.6\% mean accuracy, with the code to be publicly released at https://github.com/HessenUPHF/SPARK-IL.[172] Task-Guided Multi-Annotation Triplet Learning for Remote Sensing Representations
Meilun Zhou,Alina Zare
Main category: cs.CV
TL;DR: 本文提出了一种任务引导的多标注三元组损失方法,通过互信息准则选择跨任务最具信息量的三元组,从而在不依赖静态权重的情况下优化共享表征。
Details
Motivation: 现有基于静态权重的多任务三元组损失方法需人工调参,且无法建模任务间对共享表征的协同影响。 Method: 提出任务引导的多标注三元组损失,利用互信息准则动态选择对多个任务均具高信息量的三元组,以样本选择替代损失加权。 Result: 在航拍野生动物数据集上的实验表明,该方法提升了分类与回归性能,生成了更有效的共享表征。 Conclusion: 任务感知的三元组选择比静态加权更能促进多任务学习中高质量共享表征的形成。 Abstract: Prior multi-task triplet loss methods relied on static weights to balance supervision between various types of annotation. However, static weighting requires tuning and does not account for how tasks interact when shaping a shared representation. To address this, the proposed task-guided multi-annotation triplet loss removes this dependency by selecting triplets through a mutual-information criteria that identifies triplets most informative across tasks. This strategy modifies which samples influence the representation rather than adjusting loss magnitudes. Experiments on an aerial wildlife dataset compare the proposed task-guided selection against several triplet loss setups for shaping a representation in an effective multi-task manner. The results show improved classification and regression performance and demonstrate that task-aware triplet selection produces a more effective shared representation for downstream tasks.[173] MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
Bin Wang,Tianyao He,Linke Ouyang,Fan Wu,Zhiyuan Zhao,Tao Chu,Yuan Qu,Zhenjiang Jin,Weijun Zeng,Ziyang Miao,Bangrui Xu,Junbo Niu,Mengzhang Cai,Jiantao Qiu,Qintong Zhang,Dongsheng Ma,Yuefeng Sun,Hejun Dong,Wenzheng Zhang,Jutao Xiao,Jiayong Shi,Pengyu Liao,Xiaomeng Zhao,Huaping Zhong,Liqun Wei,Jing Yu,Jie Yang,Wei Li,Shasha Wang,Qianqian Wu,Xuanhe Zhou,Weijia Li,Zhenxiang Li,Zhongying Tu,Jiang Wu,Lijun Wu,Chao Xu,Kai Chen,Wentao Zhang,Yu Qiao,Bowen Zhou,Dahua Lin,Conghui He
Main category: cs.CV
TL;DR: 本文提出minerupro,通过数据工程和训练策略优化,在不改变模型架构的前提下显著提升文档解析性能。
Details
Motivation: 现有文档解析方法主要关注模型架构创新,而训练数据的系统性工程被忽视;不同架构和参数规模的SOTA模型在相同难样本上表现出高度一致的失败模式,表明性能瓶颈源于训练数据缺陷而非架构本身。 Method: 提出minerupro,核心是围绕覆盖度、信息量和标注准确性的Data Engine:多样性与难度感知采样扩展数据集并校正分布偏移;跨模型一致性验证利用异构模型输出一致性评估样本难度并生成可靠标注;Judge-and-Refine流程通过渲染-验证迭代修正难样本标注质量;采用三阶段渐进式训练策略(大规模预训练、难样本微调、GRPO对齐)。 Result: 在OmniDocBench v1.6上达到95.69分,较同架构基线提升2.71分,并超越所有现有方法(包括参数量超200倍的模型)。 Conclusion: 训练数据的质量与构造方式是文档解析性能的关键瓶颈,仅通过数据工程与训练策略优化即可显著超越依赖大模型参数量或架构创新的方法。 Abstract: Current document parsing methods compete primarily on model architecture innovation, while systematic engineering of training data remains underexplored. Yet SOTA models of different architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than architecture itself. Building on this finding, we present \minerupro, which advances the state of the art solely through data engineering and training strategy optimization while keeping the 1.2B-parameter architecture of \mineru completely fixed. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while correcting distribution shift; Cross-Model Consistency Verification leverages output agreement among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy -- large-scale pre-training, hard sample fine-tuning, and GRPO alignment -- sequentially exploits these data at different quality tiers. On the evaluation front, we fix element-matching biases in OmniDocBench~v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench~v1.6 protocol. Without any architectural modification, \minerupro achieves 95.69 on OmniDocBench~v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200$\times$ more parameters.[174] Beyond Task-Driven Features for Object Detection
Meilun Zhou,Alina Zare
Main category: cs.CV
TL;DR: 本文提出了一种标注引导的特征增强框架,通过将标注嵌入注入目标检测骨干网络,提升特征的语义意义、泛化性和鲁棒性。
Details
Motivation: 现代目标检测器学习的任务驱动特征虽优化端到端损失,但易捕获与标注结构不一致的捷径相关性,导致迁移性、可解释性和鲁棒性受限。 Method: 构建来自标注引导潜在空间的稠密空间特征网格,并将其与特征金字塔表示融合,以影响区域提议和检测头。 Result: 在野生动物和遥感数据集上实验表明,该方法提升了目标聚焦能力、降低了背景敏感性,并在未见任务或弱监督场景下展现出更强泛化能力。 Conclusion: 将特征对齐到标注几何结构比仅优化任务损失能产生更有意义的表征。 Abstract: Task-driven features learned by modern object detectors optimize end task loss yet often capture shortcut correlations that fail to reflect underlying annotation structure. Such representations limit transfer, interpretability, and robustness when task definitions change or supervision becomes sparse. This paper introduces an annotation-guided feature augmentation framework that injects embeddings into an object detection backbone. The method constructs dense spatial feature grids from annotation-guided latent spaces and fuses them with feature pyramid representations to influence region proposal and detection heads. Experiments across wildlife and remote sensing datasets evaluate classification, localization, and data efficiency under multiple supervision regimes. Results show consistent improvements in object focus, reduced background sensitivity, and stronger generalization to unseen or weakly supervised tasks. The findings demonstrate that aligning features with annotation geometry yields more meaningful representations than purely task optimized features.[175] Vero: An Open RL Recipe for General Visual Reasoning
Gabriel Sarch,Linrong Cai,Qunzhong Wang,Haoyang Wu,Danqi Chen,Zhuang Liu
Main category: cs.CV
TL;DR: 本文提出了Vero,一个完全开源的视觉语言模型(VLM)系列,在多种视觉推理任务上达到或超越现有开源模型;通过构建大规模、多任务、任务路由奖励的强化学习数据集(Vero-600K),并系统验证了广覆盖数据对RL缩放效果的关键作用。
Details
Motivation: 现有最强视觉语言模型虽展现出跨领域视觉推理能力,但其训练方法(尤其是专有强化学习流程和私有数据)不透明,限制了社区复现与进步;亟需可复现、开源、高性能的替代方案。 Method: 构建包含59个数据集、共600K样本的多任务强化学习数据集Vero-600K,设计适配异构答案格式的任务路由奖励机制,并基于Qwen3-VL-8B-Instruct基座模型开展强化学习训练;提出VeroEval评估套件(含30个挑战性基准)并进行系统消融分析。 Result: Vero在VeroEval上平均提升基线模型3.7–5.5分,超越Qwen3-VL-8B-Thinking在30个基准中的23个;相同基座下优于现有RL数据集;消融表明多任务数据覆盖是性能提升主因,单一任务数据泛化差。 Conclusion: 广泛覆盖、结构化设计的强化学习数据是提升视觉语言模型通用推理能力的核心,Vero验证了全开源路径实现强视觉推理的可行性,并全面公开数据、代码与模型。 Abstract: What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) show such broad visual reasoning is within reach, but the recipe behind them remains unclear, locked behind proprietary reinforcement learning (RL) pipelines with non-public data. We introduce Vero, a family of fully open VLMs that matches or exceeds existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answer formats. Vero achieves state-of-the-art performance, improving over four base models by 3.7-5.5 points on average across VeroEval, our suite of 30 challenging benchmarks. Starting from Qwen3-VL-8B-Instruct, Vero outperforms Qwen3-VL-8B-Thinking on 23 of 30 benchmarks without additional proprietary thinking data. When trained from the same base model, Vero-600K exceeds existing RL datasets across task categories. Systematic ablations reveal that different task categories elicit qualitatively distinct reasoning patterns that transfer poorly in isolation, suggesting that broad data coverage is the primary driver of strong RL scaling. All data, code, and models are released.[176] Training a Student Expert via Semi-Supervised Foundation Model Distillation
Pardis Taghavi,Tian Liu,Renjie Li,Reza Langari,Zhengzhong Tu
Main category: cs.CV
TL;DR: 本文提出了一种半监督知识蒸馏(SSKD)框架,用于将大型视觉基础模型(VFMs)压缩为轻量级专家模型,仅需少量标注和大量无标注数据,特别适用于像素级标注昂贵的实例分割任务。该方法通过三阶段流程(域自适应、知识迁移、学生模型精炼)及实例感知的像素级对比损失,显著提升性能,在Cityscapes和ADE20K上超越零样本及适配后的教师模型,并优于现有SSKD方法。
Details
Motivation: 基础模型虽感知能力强,但计算开销大、难以部署,且适配通常依赖高成本人工标注;尤其在实例分割中,像素级标注极为昂贵,亟需一种高效利用有限标注与大量无标注数据的压缩方法。 Method: 提出三阶段半监督知识蒸馏框架:(1) 基于自训练与对比校准的VFM域自适应;(2) 采用融合掩码与类别得分的实例感知像素级对比损失的多目标知识迁移;(3) 学生模型精炼以缓解伪标签偏差;核心是贯穿适配与蒸馏阶段的对比信号,对齐师生嵌入并增强无标注图像利用。 Result: 在Cityscapes和ADE20K上,约11倍更小的学生模型分别比零样本教师模型提升+11.9和+8.6 AP,比适配后教师模型提升+3.4和+1.5 AP,并超越当前最优SSKD方法。 Conclusion: 所提SSKD框架能高效压缩视觉基础模型,在标注稀缺场景下显著提升实例分割性能,验证了对比驱动的半监督蒸馏范式的有效性与实用性。 Abstract: Foundation models deliver strong perception but are often too computationally heavy to deploy, and adapting them typically requires costly annotations. We introduce a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFMs) into compact experts using limited labeled and abundant unlabeled data, and instantiate it for instance segmentation where per-pixel labels are particularly expensive. The framework unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to our approach is an instance-aware pixel-wise contrastive loss that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and more effectively leverage unlabeled images. On Cityscapes and ADE20K, our $\approx 11\times$ smaller student improves over its zero-shot VFM teacher(s) by +11.9 and +8.6 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and outperforms state-of-the-art SSKD methods on benchmarks.[177] Learning 3D Reconstruction with Priors in Test Time
Lei Zhou,Haoyu Wu,Akshat Dave,Dimitris Samaras
Main category: cs.CV
TL;DR: 本文提出了一种测试时约束优化(TCO)框架,通过在推理阶段引入相机位姿、内参和深度等先验信息作为预测约束,而非修改网络结构或重新训练,显著提升了多视图Transformer(MVT)在3D任务中的性能。
Details
Motivation: 现有方法通常需重新训练模型以融合先验信息,成本高且不灵活;而直接将先验输入网络可能受限于架构设计。本文旨在探索无需重训练、更通用的测试时先验融合方式。 Method: 在测试阶段对预训练图像-only MVT进行优化:构建包含自监督目标(如多视角光度/几何一致性损失)与先验惩罚项(针对相机位姿、深度等对应输出模态)的联合损失函数,并进行梯度优化。 Result: 在ETH3D、7-Scenes和NRGBD等基准上,点图距离误差降低超50%;性能超越基线MVT及重训练的先验感知前馈方法。 Conclusion: 测试时约束优化是一种高效、通用且无需重训练的先验融合范式,显著提升多视图Transformer在多种3D视觉任务上的表现。 Abstract: We introduce a test-time framework for multiview Transformers (MVTs) that incorporates priors (e.g., camera poses, intrinsics, and depth) to improve 3D tasks without retraining or modifying pre-trained image-only networks. Rather than feeding priors into the architecture, we cast them as constraints on the predictions and optimize the network at inference time. The optimization loss consists of a self-supervised objective and prior penalty terms. The self-supervised objective captures the compatibility among multi-view predictions and is implemented using photometric or geometric loss between renderings from other views and each view itself. Any available priors are converted into penalty terms on the corresponding output modalities. Across a series of 3D vision benchmarks, including point map estimation and camera pose estimation, our method consistently improves performance over base MVTs by a large margin. On the ETH3D, 7-Scenes, and NRGBD datasets, our method reduces the point-map distance error by more than half compared with the base image-only models. Our method also outperforms retrained prior-aware feed-forward methods, demonstrating the effectiveness of our test-time constrained optimization (TCO) framework for incorporating priors into 3D vision tasks.[178] Interpreting Video Representations with Spatio-Temporal Sparse Autoencoders
Atahan Dokme,Sriram Vishwanath
Main category: cs.CV
TL;DR: 本文首次系统研究了稀疏自编码器(SAEs)在视频表征上的应用,提出了一种结合时空对比学习和嵌套式分层聚类的新方法,显著提升了时间一致性,并在动作分类和文本-视频检索任务上取得性能提升。
Details
Motivation: 标准稀疏自编码器虽能提取可解释的单义特征,但破坏视频的时间连贯性,亟需改进。 Method: 引入时空对比损失和Matryoshka分层聚类机制,通过调节对比损失权重实现重建质量与时间一致性的可调权衡,并开展跨骨干网络与数据集的系统消融实验。 Result: 对比SAE特征使动作分类准确率提升+3.9%,文本-视频检索R@1提升达2.8倍;发现现有单义性评估指标存在骨干网络对齐偏差;因果消融证实对比训练能将预测信号集中于少量可识别特征。 Conclusion: 所提方法有效恢复并超越原始视频表征的时间一致性,在多个下游任务中展现优势,同时揭示了单义性评估中的潜在偏差。 Abstract: We present the first systematic study of Sparse Autoencoders (SAEs) on video representations. Standard SAEs decompose video into interpretable, monosemantic features but destroy temporal coherence: hard TopK selection produces unstable feature assignments across frames, reducing autocorrelation by 36%. We propose spatio-temporal contrastive objectives and Matryoshka hierarchical grouping that recover and even exceed raw temporal coherence. The contrastive loss weight controls a tunable trade-off between reconstruction and temporal coherence. A systematic ablation on two backbones and two datasets shows that different configurations excel at different goals: reconstruction fidelity, temporal coherence, action discrimination, or interpretability. Contrastive SAE features improve action classification by +3.9% over raw features and text-video retrieval by up to 2.8xR@1. A cross-backbone analysis reveals that standard monosemanticity metrics contain a backbone-alignment artifact: both DINOv2 and VideoMAE produce equally monosemantic features under neutral (CLIP) similarity. Causal ablation confirms that contrastive training concentrates predictive signal into a small number of identifiable features.[179] Hierarchical Point-Patch Fusion with Adaptive Patch Codebook for 3D Shape Anomaly Detection
Xueyang Kang,Zizhao Li,Tian Lan,Dong Gong,Kourosh Khoshelham,Liangliang Nan
Main category: cs.CV
TL;DR: 本文提出了一种分层点-块异常评分网络,通过联合建模区域部件特征和局部点特征,提升3D形状异常检测在多种异常类型、尺度及噪声下的鲁棒性与泛化能力。
Details
Motivation: 现有深度学习方法在面对多样化的异常类型(如全局几何误差)和尺度、以及训练中噪声或不完整点云时,泛化能力差、鲁棒性不足。 Method: 提出分层点-块异常评分网络,包含自监督分解的自适应块化模块,联合建模区域部件与局部点特征进行异常推理。 Result: 在Anomaly-ShapeNet、Real3D-AD及新发布的工业CAD数据集上取得SOTA性能:点级异常检测提升超40%,对象级AUC-ROC平均提升7%(Real3D-AD)和4%(Anomaly-ShapeNet)。 Conclusion: 所提方法显著提升了3D形状异常检测对复杂结构偏差的建模能力,在真实工业缺陷场景下展现出更强的鲁棒性与泛化性。 Abstract: 3D shape anomaly detection is a crucial task for industrial inspection and geometric analysis. Existing deep learning approaches typically learn representations of normal shapes and identify anomalies via out-of-distribution feature detection or decoder-based reconstruction. They often fail to generalize across diverse anomaly types and scales, such as global geometric errors (e.g., planar shifts, angle misalignments), and are sensitive to noisy or incomplete local points during training. To address these limitations, we propose a hierarchical point-patch anomaly scoring network that jointly models regional part features and local point features for robust anomaly reasoning. An adaptive patchification module integrates self-supervised decomposition to capture complex structural deviations. Beyond evaluations on public benchmarks (Anomaly-ShapeNet and Real3D-AD), we release an industrial test set with real CAD models exhibiting planar, angular, and structural defects. Experiments on public and industrial datasets show superior AUC-ROC and AUC-PR performance, including over 40% point-level improvement on the new industrial anomaly type and average object-level gains of 7% on Real3D-AD and 4% on Anomaly-ShapeNet, demonstrating strong robustness and generalization.[180] Gram-Anchored Prompt Learning for Vision-Language Models via Second-Order Statistics
Minglei Chen,Weilong Wang,Jiang Duan,Ye Deng
Main category: cs.CV
TL;DR: 本文提出了一种基于二阶统计量(Gram矩阵)的参数高效提示学习方法GAPL,通过将文本提示锚定到视觉特征的二阶统计先验上,增强VLM在跨域适应中的鲁棒性。
Details
Motivation: 现有提示学习方法仅依赖一阶空间视觉特征,易受域偏移和局部噪声影响,缺乏全局结构一致性建模能力。 Method: 提出Gram-Anchored Prompt Learning (GAPL),在标准一阶特征交互基础上引入Gram矩阵构建的二阶统计流,使文本提示动态适配视觉特征的统计分布变化。 Result: 在多个基准上验证了二阶特征的有效性,GAPL展现出优异的跨域适应性能。 Conclusion: 融合一阶语义对齐与二阶结构一致性的提示学习范式,能显著提升VLM在分布偏移场景下的鲁棒性和泛化能力。 Abstract: Parameter-efficient prompt learning has become the de facto standard for adapting Vision-Language Models (VLMs) to downstream tasks. Existing approaches predominantly focus on aligning text prompts with first-order visual features (i.e., spatial feature maps). While effective for fine-grained semantic discrimination, we argue that relying solely on first-order information is insufficient for robust adaptation, as these spatially entangled features are highly susceptible to domain shifts and local noise. In this work, we propose \textbf{Gram-Anchored Prompt Learning (GAPL)} for Vision-Language Models via Second-Order Statistics, a framework that synergizes local semantic alignment with global structural consistency. Methodologically, we introduce an additional second-order statistical stream via \textbf{Gram matrices} that augments the standard first-order spatial interaction. By anchoring prompts to these second-order priors, our approach enables language representations to dynamically adapt to statistical distribution shifts across diverse domains. Extensive experiments indicate the effectiveness of the second-order features, and show compelling performances of GAPL on various benchmarks.[181] High-Fidelity Mural Restoration via a Unified Hybrid Mask-Aware Transformer
Jincheng Jiang,Qianhao Han,Chi Zhang,Zheng Zheng
Main category: cs.CV
TL;DR: 本文提出了一种名为混合掩码感知Transformer(HMAT)的统一框架,用于高保真古代壁画修复,结合了掩码感知动态滤波、Transformer长程结构推理、掩码条件风格融合模块以及带硬门控跳跃连接的教师强制解码器,显著提升了修复结果的结构一致性与视觉真实性。
Details
Motivation: 古代壁画因环境暴露、材料老化和人为活动而严重退化,修复需兼顾大范围缺失结构重建与真实未损区域的严格保留,现有方法难以兼顾二者。 Method: 提出Hybrid Mask-Aware Transformer(HMAT):1)Mask-Aware Dynamic Filtering建模局部纹理;2)Transformer bottleneck实现长程结构推理;3)mask-conditional style fusion模块动态引导生成;4)Teacher-Forcing Decoder配合硬门控跳跃连接以保障有效区域保真度并聚焦缺失区域重建。 Result: 在DHMural和九色鹿壁画数据集上验证,HMAT在不同退化程度下均优于或媲美现有最先进方法,生成结果结构更连贯、视觉更保真。 Conclusion: HMAT为文化遗产壁画的数字化修复提供了一种高效、鲁棒且高保真的解决方案。 Abstract: Ancient murals are valuable cultural artifacts, but many have suffered severe degradation due to environmental exposure, material aging, and human activity. Restoring these artworks is challenging because it requires both reconstructing large missing structures and strictly preserving authentic, undamaged regions. This paper presents the Hybrid Mask-Aware Transformer (HMAT), a unified framework for high-fidelity mural restoration. HMAT integrates Mask-Aware Dynamic Filtering for robust local texture modeling with a Transformer bottleneck for long-range structural inference. To further address the diverse morphology of degradation, we introduce a mask-conditional style fusion module that dynamically guides the generative process. In addition, a Teacher-Forcing Decoder with hard-gated skip connections is designed to enforce fidelity in valid regions and focus reconstruction on missing areas. We evaluate HMAT on the DHMural dataset and a curated Nine-Colored Deer dataset under varying degradation levels. Experimental results demonstrate that the proposed method achieves competitive performance compared to state-of-the-art approaches, while producing more structurally coherent and visually faithful restorations. These findings suggest that HMAT provides an effective solution for the digital restoration of cultural heritage murals.[182] A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning
Tianle Chen,Deepti Ghadiyaram
Main category: cs.CV
TL;DR: 本文提出多模态排版攻击(Multi-Modal Typography),系统研究音频、视觉和文本模态联合扰动对多模态大语言模型(MLLMs)的协同攻击效果,发现其攻击成功率(83.43%)远高于单模态攻击(34.93%),揭示了MLLMs在跨模态交互中的脆弱性。
Details
Motivation: 随着音频-视觉多模态大语言模型(MLLMs)在安全关键场景中广泛应用,亟需深入理解其多模态层面的安全漏洞,尤其是跨模态交互引发的新型脆弱性。 Method: 提出并系统实施多模态排版攻击,联合扰动音频、视觉与文本三种模态输入,对比分析单模态与多模态攻击在多个前沿MLLMs及常识推理、内容审核等基准任务上的效果。 Result: 多模态协同攻击成功率达83.43%,显著高于单模态攻击的34.93%;该现象在多个MLLMs和不同任务上具有一致性,证实跨模态脆弱性普遍存在。 Conclusion: 多模态排版攻击是一种关键且被低估的威胁范式,凸显当前MLLMs在跨模态对齐与鲁棒性设计上的严重不足,亟需针对性防御机制。 Abstract: As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate = $83.43\%$ vs $34.93\%$).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data will be publicly available.[183] OASIC: Occlusion-Agnostic and Severity-Informed Classification
Kay Gijzen,Gertjan J. Burghouts,Daniël M. Pelt
Main category: cs.CV
TL;DR: 本文提出OASIC模型,通过测试时估计遮挡严重程度、遮挡模式掩码及针对性选择最优模型,显著提升遮挡下图像分类性能。
Details
Motivation: 严重遮挡导致可见信息丢失和遮挡物干扰模式,给计算机视觉带来重大挑战。 Method: 1)测试时基于视觉异常检测识别并掩码遮挡区域;2)训练时对物体随机掩码以模拟不同遮挡程度;3)估计测试图像遮挡严重度,并选择对应优化程度的专用模型进行分类。 Result: OASIC在AUC_occ指标上较标准遮挡训练提升+18.5,较无遮挡微调提升+23.7。 Conclusion: 遮挡严重度感知与自适应模型选择可有效缓解遮挡影响,优于单一模型或宽泛范围优化模型。 Abstract: Severe occlusions of objects pose a major challenge for computer vision. We show that two root causes are (1) the loss of visible information and (2) the distracting patterns caused by the occluders. Our approach addresses both causes at the same time. First, the distracting patterns are removed at test-time, via masking of the occluding patterns. This masking is independent of the type of occlusion, by handling the occlusion through the lens of visual anomalies w.r.t. the object of interest. Second, to deal with less visual details, we follow standard practice by masking random parts of the object during training, for various degrees of occlusions. We discover that (a) it is possible to estimate the degree of the occlusion (i.e. severity) at test-time, and (b) that a model optimized for a specific degree of occlusion also performs best on a similar degree during test-time. Combining these two insights brings us to a severity-informed classification model called OASIC: Occlusion Agnostic Severity Informed Classification. We estimate the severity of occlusion for a test image, mask the occluder, and select the model that is optimized for the degree of occlusion. This strategy performs better than any single model optimized for any smaller or broader range of occlusion severities. Experiments show that combining gray masking with adaptive model selection improves $\text{AUC}_\text{occ}$ by +18.5 over standard training on occluded images and +23.7 over finetuning on unoccluded images.[184] HOIGS: Human-Object Interaction Gaussian Splatting
Taewoo Kim,Suwoong Yeom,Jaehyun Pyun,Geonho Cha,Dongyoon Wee,Joonsik Nam,Yun-Seong Jeong,Kyeongbo Kong,Suk-Ju Kang
Main category: cs.CV
TL;DR: 本文提出HOIGS方法,通过跨注意力机制显式建模人-物交互引起的形变,结合HexPlane(人体)和三次埃尔米特样条(物体)两种异构变形基,提升复杂交互场景下的动态场景重建精度。
Details
Motivation: 现有高斯泼溅方法在处理复杂人-物交互动态场景时存在局限:或依赖人体姿态先验而忽略物体动态,或用单一运动场近似所有运动,难以准确捕捉交互丰富的动态细节。 Method: 提出Human-Object Interaction Gaussian Splatting(HOIGS),引入基于跨注意力的HOI模块显式建模人-物交互形变;分别采用HexPlane表征人体、Cubic Hermite Spline表征物体,并融合二者异构特征以刻画相互依赖的运动。 Result: 在多个数据集上的实验表明,HOIGS在重建质量上持续优于当前先进的人为中心及4D高斯方法,尤其在遮挡、接触与物体操控等复杂交互场景中表现突出。 Conclusion: 显式建模人-物交互对实现高保真动态场景重建至关重要,HOIGS为交互密集型三维重建提供了新范式。 Abstract: Reconstructing dynamic scenes with complex human-object interactions is a fundamental challenge in computer vision and graphics. Existing Gaussian Splatting methods either rely on human pose priors while neglecting dynamic objects, or approximate all motions within a single field, limiting their ability to capture interaction-rich dynamics. To address this gap, we propose Human-Object Interaction Gaussian Splatting (HOIGS), which explicitly models interaction-induced deformation between humans and objects through a cross-attention-based HOI module. Distinct deformation baselines are employed to extract features: HexPlane for humans and Cubic Hermite Spline (CHS) for objects. By integrating these heterogeneous features, HOIGS effectively captures interdependent motions and improves deformation estimation in scenarios involving occlusion, contact, and object manipulation. Comprehensive experiments on multiple datasets demonstrate that our method consistently outperforms state-of-the-art human-centric and 4D Gaussian approaches, highlighting the importance of explicitly modeling human-object interactions for high-fidelity reconstruction.[185] 1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation
Haoyu Li,Tingyan Wen,Lin Qi,Zhe Wu,Yihuang Chen,Xing Zhou,Lifei Zhu,Xueqian Wang,Kai Zhang
Main category: cs.CV
TL;DR: 本文提出1.x-Distill,首个突破整数步限制的分数步蒸馏框架,实现1.x步高效文本到图像生成,在保持高质量与多样性的同时大幅提升速度。
Details
Motivation: 扩散模型虽生成质量高,但迭代去噪计算开销大;现有少步蒸馏方法(如DMD)在两步或更少时易出现多样性坍塌和保真度下降。 Method: 提出1.x-Distill框架:1)分析教师端CFG作用并改进以抑制模式坍塌;2)引入分阶段聚焦蒸馏(粗结构分布匹配 + 细节对抗蒸馏);3)设计轻量补偿模块,支持Distill-Cache协同训练与块级缓存集成。 Result: 在SD3-Medium和SD3.5-Large上,分别以1.67和1.74有效NFE达到优于先前少步方法的质量与多样性,相比原28×2 NFE采样最高提速33倍。 Conclusion: 1.x-Distill成功将蒸馏扩散模型带入1.x步实用化新范式,兼顾效率、质量与多样性,并为极简步长生成提供系统性解决方案。 Abstract: Diffusion models produce high-quality text-to-image results, but their iterative denoising is computationally expensive.Distribution Matching Distillation (DMD) emerges as a promising path to few-step distillation, but suffers from diversity collapse and fidelity degradation when reduced to two steps or fewer. We present 1.x-Distill, the first fractional-step distillation framework that breaks the integer-step constraint of prior few-step methods and establishes 1.x-step generation as a practical regime for distilled diffusion models.Specifically, we first analyze the overlooked role of teacher CFG in DMD and introduce a simple yet effective modification to suppress mode collapse. Then, to improve performance under extreme steps, we introduce Stagewise Focused Distillation, a two-stage strategy that learns coarse structure through diversity-preserving distribution matching and refines details with inference-consistent adversarial distillation. Furthermore, we design a lightweight compensation module for Distill--Cache co-Training, which naturally incorporates block-level caching into our distillation pipeline.Experiments on SD3-Medium and SD3.5-Large show that 1.x-Distill surpasses prior few-step methods, achieving better quality and diversity at 1.67 and 1.74 effective NFEs, respectively, with up to 33x speedup over original 28x2 NFE sampling.[186] ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity
Hang Wang,Chao Shen,Lei Zhang,Zhi-Qi Cheng
Main category: cs.CV
TL;DR: 本文提出ATSS方法,通过检测AI生成视频中异常的时间自相似性(ATSS)指纹,利用多模态相似性矩阵和跨注意力融合机制,显著提升了AI生成视频的检测性能。
Details
Motivation: 现有AI生成视频检测器主要关注局部伪影或短期时间不一致性,难以捕捉全局时间演化背后的生成逻辑,导致检测性能受限。 Method: 提出ATSS方法,构建视觉、文本和跨模态相似性矩阵以量化时间异常,并采用专用Transformer编码器和双向跨注意力融合模块建模模态内与模态间动态。 Result: 在GenVideo、EvalCrafter、VideoPhy和VidProM四大基准上,ATSS在AP、AUC和ACC指标上显著优于现有最先进方法,且对多种视频生成模型具有优异泛化能力。 Conclusion: ATSS通过挖掘AI生成视频中由锚点驱动导致的异常时间自相似性这一新指纹,为数字取证提供了更鲁棒、更具泛化性的检测框架。 Abstract: AI-generated videos (AIGVs) have achieved unprecedented photorealism, posing severe threats to digital forensics. Existing AIGV detectors focus mainly on localized artifacts or short-term temporal inconsistencies, thus often fail to capture the underlying generative logic governing global temporal evolution, limiting AIGV detection performance. In this paper, we identify a distinctive fingerprint in AIGVs, termed anomalous temporal self-similarity (ATSS). Unlike real videos that exhibit stochastic natural dynamics, AIGVs follow deterministic anchor-driven trajectories (e.g., text or image prompts), inducing unnaturally repetitive correlations across visual and semantic domains. To exploit this, we propose the ATSS method, a multimodal detection framework that exploits this insight via a triple-similarity representation and a cross-attentive fusion mechanism. Specifically, ATSS reconstructs semantic trajectories by leveraging frame-wise descriptions to construct visual, textual, and cross-modal similarity matrices, which jointly quantify the inherent temporal anomalies. These matrices are encoded by dedicated Transformer encoders and integrated via a bidirectional cross-attentive fusion module to effectively model intra- and inter-modal dynamics. Extensive experiments on four large-scale benchmarks, including GenVideo, EvalCrafter, VideoPhy, and VidProM, demonstrate that ATSS significantly outperforms state-of-the-art methods in terms of AP, AUC, and ACC metrics, exhibiting superior generalization across diverse video generation models. Code and models of ATSS will be released at https://github.com/hwang-cs-ime/ATSS.[187] TORA: Topological Representation Alignment for 3D Shape Assembly
Nahyuk Lee,Zhiang Chen,Marc Pollefeys,Sunghwan Hong
Main category: cs.CV
TL;DR: 本文提出TORA框架,通过将预训练3D编码器的拓扑关系结构蒸馏到流匹配主干网络中,提升3D形状组装性能,无需推理开销,显著加快收敛速度并增强泛化能力。
Details
Motivation: 现有基于流匹配的3D形状组装方法缺乏对跨部件交互的显式建模,无法有效指导部件运动方向。 Method: 提出拓扑优先的表示对齐框架TORA:首先采用token-wise余弦匹配注入教师模型几何描述符;进一步引入中心核对齐(CKA)损失,对齐学生与教师表征的相似性结构;系统分析不同3D编码器影响,发现几何与接触特性比语义分类能力更关键,且对齐在Transformer深层更有效。 Result: TORA在五个涵盖几何、语义及多物体组装的基准上达到SOTA;收敛速度提升最高达6.9倍;在分布内精度提升,域偏移下鲁棒性增强;零样本迁移至未见真实与合成数据集效果显著。 Conclusion: TORA通过拓扑感知的表示对齐,无需增加推理负担,有效提升了3D形状组装的效率、精度与泛化能力,验证了利用冻结预训练编码器的结构先验进行知识蒸馏的有效性。 Abstract: Flow-matching methods for 3D shape assembly learn point-wise velocity fields that transport parts toward assembled configurations, yet they receive no explicit guidance about which cross-part interactions should drive the motion. We introduce TORA, a topology-first representation alignment framework that distills relational structure from a frozen pretrained 3D encoder into the flow-matching backbone during training. We first realize this via simple instantiation, token-wise cosine matching, which injects the learned geometric descriptors from the teacher representation. We then extend to employ a Centered Kernel Alignment (CKA) loss to match the similarity structure between student and teacher representations for enhanced topological alignment. Through systematic probing of diverse 3D encoders, we show that geometry- and contact-centric teacher properties, not semantic classification ability, govern alignment effectiveness, and that alignment is most beneficial at later transformer layers where spatial structure naturally emerges. TORA introduces zero inference overhead while yielding two consistent benefits: faster convergence (up to 6.9$\times$) and improved accuracy in-distribution, along with greater robustness under domain shift. Experiments on five benchmarks spanning geometric, semantic, and inter-object assembly demonstrate state-of-the-art performance, with particularly pronounced gains in zero-shot transfer to unseen real-world and synthetic datasets. Project page: https://nahyuklee.github.io/tora.[188] DINO-VO: Learning Where to Focus for Enhanced State Estimation
Qi Chen,Guanghao Li,Sijia Hu,Xin Gao,Junpeng Ma,Xiangyang Xue,Jian Pu
Main category: cs.CV
TL;DR: DINO-VO是一种端到端单目视觉里程计系统,通过可微分自适应图像块选择器和多任务特征提取与可微分光束法平差模块的结合,提升了跨场景泛化能力与跟踪精度。
Details
Motivation: 现有视觉里程计系统依赖启发式特征提取策略,在大规模户外环境中易导致精度和鲁棒性下降。 Method: 提出DINO-VO,引入可微分自适应图像块选择器,并集成多任务特征提取模块与基于逆深度先验的可微分光束法平差模块,实现特征学习与状态估计的联合优化。 Result: 在TartanAir、KITTI、Euroc和TUM数据集上实验表明,DINO-VO在合成、室内和室外环境中均表现出强泛化能力,达到当前最优跟踪精度。 Conclusion: DINO-VO通过端到端可微设计有效提升了单目VO系统的泛化性与精度,弥合了特征学习与几何估计之间的鸿沟。 Abstract: We present DINO Patch Visual Odometry (DINO-VO), an end-to-end monocular visual odometry system with strong scene generalization. Current Visual Odometry (VO) systems often rely on heuristic feature extraction strategies, which can degrade accuracy and robustness, particularly in large-scale outdoor environments. DINO-VO addresses these limitations by incorporating a differentiable adaptive patch selector into the end-to-end pipeline, improving the quality of extracted patches and enhancing generalization across diverse datasets. Additionally, our system integrates a multi-task feature extraction module with a differentiable bundle adjustment (BA) module that leverages inverse depth priors, enabling the system to learn and utilize appearance and geometric information effectively. This integration bridges the gap between feature learning and state estimation. Extensive experiments on the TartanAir, KITTI, Euroc, and TUM datasets demonstrate that DINO-VO exhibits strong generalization across synthetic, indoor, and outdoor environments, achieving state-of-the-art tracking accuracy.[189] 4C4D: 4 Camera 4D Gaussian Splatting
Junsheng Zhou,Zhifan Yang,Liang Han,Wenyuan Zhang,Kanle Shi,Shenkun Xu,Yu-Shen Liu
Main category: cs.CV
TL;DR: 本文提出4C4D框架,利用仅四台便携相机拍摄的视频,实现高保真4D动态场景重建,通过引入神经衰减函数增强4D高斯泼溅的几何建模能力,显著提升稀疏视角下的时序一致新视角渲染效果。
Details
Motivation: 现有方法通常依赖数十甚至上百台相机的密集多视角采集,难以满足便携、低成本的实际需求;而稀疏视角下几何学习比外观建模更困难,亟需针对性改进。 Method: 提出4C4D框架,核心是为4D高斯不透明度设计神经衰减函数,使梯度更聚焦于几何学习,缓解4D高斯泼溅中几何与外观建模的不平衡问题。 Result: 在多种稀疏视角数据集(含不同相机重叠度)上实验表明,4C4D性能显著优于现有方法。 Conclusion: 4C4D有效解决了极稀疏相机配置下的4D动态场景重建难题,提升了几何建模能力与时序一致性渲染质量,推动了轻量级4D内容采集与生成的发展。 Abstract: This paper tackles the challenge of recovering 4D dynamic scenes from videos captured by as few as four portable cameras. Learning to model scene dynamics for temporally consistent novel-view rendering is a foundational task in computer graphics, where previous works often require dense multi-view captures using camera arrays of dozens or even hundreds of views. We propose \textbf{4C4D}, a novel framework that enables high-fidelity 4D Gaussian Splatting from video captures of extremely sparse cameras. Our key insight lies that the geometric learning under sparse settings is substantially more difficult than modeling appearance. Driven by this observation, we introduce a Neural Decaying Function on Gaussian opacities for enhancing the geometric modeling capability of 4D Gaussians. This design mitigates the inherent imbalance between geometry and appearance modeling in 4DGS by encouraging the 4DGS gradients to focus more on geometric learning. Extensive experiments across sparse-view datasets with varying camera overlaps show that 4C4D achieves superior performance over prior art. Project page at: https://junshengzhou.github.io/4C4D.[190] Detecting Media Clones in Cultural Repositories Using a Positive Unlabeled Learning Approach
V. Sevetlidis,V. Arampatzakis,M. Karta,I. Mourthos,D. Tsiafaki,G. Pavlidis
Main category: cs.CV
TL;DR: 本文提出了一种面向AtticPOT仓库的“策展人参与式”重复发现方法,将问题建模为正-未标注(PU)学习任务,利用单个锚点训练轻量级克隆编码器,并通过潜在空间l2范数阈值实现可解释、无需显式负样本的候选推荐。
Details
Motivation: 解决AtticPOT文物数据库中跨记录重复项难以被预先验证的问题,支持策展人在环(curator-in-the-loop)的主动去重与记录关联需求。 Method: 将重复发现建模为Positive-Unlabeled(PU)学习;以每个文物的单个锚点出发,对增强视图进行轻量级每查询克隆编码器训练;用潜在空间l2范数设定可解释阈值,对未标注库打分并推荐候选重复项供策展人验证。 Result: 在CIFAR-10上F1=96.37(AUROC=97.97),在AtticPOT上F1=90.79(AUROC=98.99),较最优基线SVDD提升+7.70 F1;定性分析显示其‘查找相似项’结果在视角和条件变化下具有稳定邻域。 Conclusion: 该方法无需显式负样本,具备透明决策机制和轻量架构,适用于去重、记录链接及策展人参与式工作流。 Abstract: We formulate curator-in-the-loop duplicate discovery in the AtticPOT repository as a Positive-Unlabeled (PU) learning problem. Given a single anchor per artefact, we train a lightweight per-query Clone Encoder on augmented views of the anchor and score the unlabeled repository with an interpretable threshold on the latent l_2 norm. The system proposes candidates for curator verification, uncovering cross-record duplicates that were not verified a priori. On CIFAR-10 we obtain F1=96.37 (AUROC=97.97); on AtticPOT we reach F1=90.79 (AUROC=98.99), improving F1 by +7.70 points over the best baseline (SVDD) under the same lightweight backbone. Qualitative "find-similar" panels show stable neighbourhoods across viewpoint and condition. The method avoids explicit negatives, offers a transparent operating point, and fits de-duplication, record linkage, and curator-in-the-loop workflows.[191] Intelligent Traffic Monitoring with YOLOv11: A Case Study in Real-Time Vehicle Detection
Shkelqim Sherifi
Main category: cs.CV
TL;DR: 本文提出了一种离线、实时的交通监控系统,结合预训练YOLOv11检测器与BoT-SORT/ByteTrack多目标跟踪算法,在PyTorch/OpenCV实现并封装为Qt桌面UI,无需云依赖,实现了高精度车辆检测与计数。
Details
Motivation: 提升交通监控系统的智能化与实用性,推动智能城市建设,同时避免对云端计算的依赖。 Method: 采用预训练YOLOv11进行车辆检测,结合BoT-SORT或ByteTrack进行多目标跟踪,基于PyTorch和OpenCV构建CNN流水线,并用Qt开发桌面端用户界面。 Result: 在多种场景下车辆计数准确率达66.67%-95.83%;类别检测精度高(轿车0.97-1.00,卡车1.00),召回率强(轿车0.82-1.00,卡车0.70-1.00),F1分数分别为0.90-1.00(轿车)和0.82-1.00(卡车);恶劣天气下性能略有下降,但常规条件下稳健。 Conclusion: 该轻量级、本地化AI交通监控系统验证了其在智能城市落地应用的可行性与有效性,为未来边缘智能交通系统提供了实用参考。 Abstract: Recent advancements in computer vision, driven by artificial intelligence, have significantly enhanced monitoring systems. One notable application is traffic monitoring, which leverages computer vision alongside deep learning-based object detection and counting. We present an offline, real-time traffic monitoring system that couples a pre-trained YOLOv11 detector with BoT-SORT/ByteTrack for multi-object tracking, implemented in PyTorch/OpenCV and wrapped in a Qt-based desktop UI. The CNN pipeline enables efficient vehicle detection and counting from video streams without cloud dependencies. Across diverse scenes, the system achieves (66.67-95.83%) counting accuracy. Class-wise detection yields high precision (cars: 0.97-1.00; trucks: 1.00) with strong recall (cars: 0.82-1.00; trucks: 0.70-1.00), resulting in F1 scores of (0.90-1.00 for cars and 0.82-1.00 for trucks). While adverse weather conditions may negatively impact this performance, results remain robust in typical conditions. By integrating lightweight models with an accessible, cloud-independent interface, this paper contributes to the modernization and development of future smart cities by showing the capacity of AI-driven traffic monitoring systems.[192] LAA-X: Unified Localized Artifact Attention for Quality-Agnostic and Generalizable Face Forgery Detection
Dat Nguyen,Enjie Ghorbel,Anis Kacem,Marcella Astrid,Djamila Aouada
Main category: cs.CV
TL;DR: 本文提出了一种名为Localized Artifact Attention X(LAA-X)的新型深度伪造检测框架,通过显式的多任务学习与基于混合的数据合成策略,聚焦于局部伪造伪影区域,在高保真伪造和未见过的篡改类型上均表现出强鲁棒性与泛化能力。
Details
Motivation: 现有方法多依赖二分类器与隐式注意力机制,难以泛化到未知篡改类型;需一种能显式定位伪造伪影、提升跨操纵类型泛化能力的新框架。 Method: 提出LAA-X框架,包含显式局部伪影注意力机制、多任务学习(引导模型关注易出错的局部区域)及基于混合的数据合成策略;支持CNN(LAA-Net)与Transformer(LAA-Former)两种骨干网络。 Result: 仅用真实图像与伪伪造样本训练,LAA-X在多个基准测试中达到或接近SOTA性能,且具备良好泛化性。 Conclusion: LAA-X通过显式建模局部伪造伪影与多任务协同学习,显著提升了深度伪造检测的鲁棒性与泛化能力,为通用伪造检测提供了新范式。 Abstract: In this paper, we propose Localized Artifact Attention X (LAA-X), a novel deepfake detection framework that is both robust to high-quality forgeries and capable of generalizing to unseen manipulations. Existing approaches typically rely on binary classifiers coupled with implicit attention mechanisms, which often fail to generalize beyond known manipulations. In contrast, LAA-X introduces an explicit attention strategy based on a multi-task learning framework combined with blending-based data synthesis. Auxiliary tasks are designed to guide the model toward localized, artifact-prone (i.e., vulnerable) regions. The proposed framework is compatible with both CNN and transformer backbones, resulting in two different versions, namely, LAA-Net and LAA-Former, respectively. Despite being trained only on real and pseudo-fake samples, LAA-X competes with state-of-the-art methods across multiple benchmarks. Code and pre-trained weights for LAA-Net\footnote{https://github.com/10Ring/LAA-Net} and LAA-Former\footnote{https://github.com/10Ring/LAA-Former} are publicly available.[193] A Physics-Informed, Behavior-Aware Digital Twin for Robust Multimodal Forecasting of Core Body Temperature in Precision Livestock Farming
Riasad Alvi,Mohaimenul Azam Khan Raiaan,Sadia Sultana Chowa,Arefin Ittesafun Abian,Reem E Mohamed,Md Rafiqul Islam,Yakub Sebastian,Sheikh Izzal Azid,Sami Azam
Main category: cs.CV
TL;DR: 本文提出了一种融合物理模型与数据驱动方法的数字孪生框架,用于奶牛核心体温(CBT)的多模态、不确定性感知预测,以实现热应激的早期预警和精准畜牧管理。
Details
Motivation: 精准畜牧业需要准确及时的热应激预测以保障动物福利并优化农场管理,而现有方法在物理可解释性、个体差异建模和不确定性量化方面存在不足。 Method: 构建了一个物理信息数字孪生(DT)框架,整合ODE热调节模型、高斯过程(个体偏差)、卡尔曼滤波(实时校准)和行为马尔可夫链;DT输出与传感器原始数据经多尺度时序分析与跨模态特征工程融合后,输入三阶段专家加权堆叠集成模型(LightGBM + Optuna调优 + 引导法不确定性估计)。 Result: 在2小时超前预测任务中,交叉验证R²达0.783,F1分数84.25%,预测区间覆盖率(PICP)为92.38%;消融实验证明DT特征与多模态融合显著提升性能。 Conclusion: 该框架兼具物理可解释性、个体适应性与不确定性感知能力,为热应激早期检测和精准畜牧管理提供了鲁棒、可信的系统支持。 Abstract: Precision livestock farming requires accurate and timely heat stress prediction to ensure animal welfare and optimize farm management. This study presents a physics-informed digital twin (DT) framework combined with an uncertainty-aware, expert-weighted stacked ensemble for multimodal forecasting of Core Body Temperature (CBT) in dairy cattle. Using the high-frequency, heterogeneous MmCows dataset, the DT integrates an ordinary differential equation (ODE)-based thermoregulation model that simulates metabolic heat production and dissipation, a Gaussian process for capturing cow-specific deviations, a Kalman filter for aligning predictions with real-time sensor data, and a behavioral Markov chain that models activity-state transitions under varying environmental conditions. The DT outputs key physiological indicators, such as predicted CBT, heat stress probability, and behavioral state distributions are fused with raw sensor data and enriched through multi-scale temporal analysis and cross-modal feature engineering to form a comprehensive feature set. The predictive methodology is designed in a three-stage stacked ensemble, where stage 1 trains modality-specific LightGBM 'expert' models on distinct feature groups, stage 2 collects their predictions as meta-features, and at stage 3 Optuna-tuned LightGBM meta-model yields the final CBT forecast. Predictive uncertainty is quantified via bootstrapping and validated using Prediction Interval Coverage Probability (PICP). Ablation analysis confirms that incorporating DT-derived features and multimodal fusion substantially enhances performance. The proposed framework achieves a cross-validated R2 of 0.783, F1 score of 84.25% and PICP of 92.38% for 2-hour ahead forecasting, providing a robust, uncertainty-aware, and physically principled system for early heat stress detection and precision livestock management.[194] Hypothesis Graph Refinement: Hypothesis-Driven Exploration with Cascade Error Correction for Embodied Navigation
Peixin Chen,Guoxi Zhang,Jianwei Ma,Qing Li
Main category: cs.CV
TL;DR: 本文提出Hypothesis Graph Refinement (HGR)框架,通过可修正的假设节点和验证驱动的级联修正机制,在具身导航中实现语义引导的高效探索与错误回溯,显著提升长期记忆可靠性与任务性能。
Details
Motivation: 现有图导航系统将未探索区域视为语义未知,导致前沿搜索低效;VLMs的语义预测易出错且错误会在图记忆中累积传播,仅靠置信度衰减无法解决。需一种既能利用语义预测指导探索、又能系统性撤回错误假设的框架。 Method: HGR构建依赖感知的图记忆,包含两个核心模块:(1) 语义假设模块——基于上下文估计前沿语义分布,并综合目标相关性、行进代价与不确定性排序探索目标;(2) 验证驱动的级联修正——现场观测与预测语义比对,一旦不匹配即撤回被证伪节点及其所有下游依赖节点,实现图结构收缩。 Result: 在GOAT-Bench上达到72.41%成功率和56.22% SPL;在A-EQA和EM-EQA上均取得一致提升;诊断分析显示级联修正消除了约20%冗余假设节点,错误区域重访减少4.5倍,镜面与透明表面占修正错误的67%。 Conclusion: HGR通过引入可撤销假设与依赖感知的级联修正,有效缓解了长程具身导航中语义预测错误的结构性累积问题,提升了图记忆的可靠性与导航效率。 Abstract: Embodied agents must explore partially observed environments while maintaining reliable long-horizon memory. Existing graph-based navigation systems improve scalability, but they often treat unexplored regions as semantically unknown, leading to inefficient frontier search. Although vision-language models (VLMs) can predict frontier semantics, erroneous predictions may be embedded into memory and propagate through downstream inferences, causing structural error accumulation that confidence attenuation alone cannot resolve. These observations call for a framework that can leverage semantic predictions for directed exploration while systematically retracting errors once new evidence contradicts them. We propose Hypothesis Graph Refinement (HGR), a framework that represents frontier predictions as revisable hypothesis nodes in a dependency-aware graph memory. HGR introduces (1) semantic hypothesis module, which estimates context-conditioned semantic distributions over frontiers and ranks exploration targets by goal relevance, travel cost, and uncertainty, and (2) verification-driven cascade correction, which compares on-site observations against predicted semantics and, upon mismatch, retracts the refuted node together with all its downstream dependents. Unlike additive map-building, this allows the graph to contract by pruning erroneous subgraphs, keeping memory reliable throughout long episodes. We evaluate HGR on multimodal lifelong navigation (GOAT-Bench) and embodied question answering (A-EQA, EM-EQA). HGR achieves 72.41% success rate and 56.22% SPL on GOAT-Bench, and shows consistent improvements on both QA benchmarks. Diagnostic analysis reveals that cascade correction eliminates approximately 20% of structurally redundant hypothesis nodes and reduces revisits to erroneous regions by 4.5x, with specular and transparent surfaces accounting for 67% of corrected prediction errors.[195] SARES-DEIM: Sparse Mixture-of-Experts Meets DETR for Robust SAR Ship Detection
Fenghao Song,Shaojing Yang,Xi Zhou
Main category: cs.CV
TL;DR: 本文提出SARES-DEIM,一种基于DETR的SAR图像舰船检测框架,包含SARESMoE(抑制斑点噪声与杂波)和SDEP(增强小目标定位),在HRSID数据集上达到76.4% mAP50:95。
Details
Motivation: SAR图像舰船检测面临相干斑点噪声、海岸杂波复杂及小目标普遍等挑战,传统光学图像检测器鲁棒性差且易丢失细粒度舰船特征。 Method: 提出SARES-DEIM框架:1)SARESMoE模块采用稀疏门控机制选择性路由特征至频域与小波专家以抑制噪声与杂波;2)SDEP颈部结构保留浅层高分辨率空间线索以提升小目标定位能力。 Result: 在HRSID数据集上mAP50:95达76.4%,mAP50达93.8%,显著优于YOLO系列及现有SAR专用检测器。 Conclusion: SARES-DEIM通过域感知设计有效应对SAR图像特有退化问题,在检测精度与计算效率间取得良好平衡,验证了MoE与多尺度增强结合的有效性。 Abstract: Ship detection in Synthetic Aperture Radar (SAR) imagery is fundamentally challenged by inherent coherent speckle noise, complex coastal clutter, and the prevalence of small-scale targets. Conventional detectors, primarily designed for optical imagery, often exhibit limited robustness against SAR-specific degradation and suffer from the loss of fine-grained ship signatures during spatial downsampling. To address these limitations, we propose SARES-DEIM, a domain-aware detection framework grounded in the DEtection TRansformer (DETR) paradigm. Central to our approach is SARESMoE (SAR-aware Expert Selection Mixture-of-Experts), a module leveraging a sparse gating mechanism to selectively route features toward specialized frequency and wavelet experts. This sparsely-activated architecture effectively filters speckle noise and semantic clutter while maintaining high computational efficiency. Furthermore, we introduce the Space-to-Depth Enhancement Pyramid (SDEP) neck to preserve high-resolution spatial cues from shallow stages, significantly improving the localization of small targets. Extensive experiments on two benchmark datasets demonstrate the superiority of SARES-DEIM. Notably, on the challenging HRSID dataset, our model achieves a mAP50:95 of 76.4% and a mAP50 of 93.8%, outperforming state-of-the-art YOLO-series and specialized SAR detectors.[196] Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks
Rubén Moreno-Aguado,Alba Magallón,Victor Moreno,Yingying Fang,Guang Yang
Main category: cs.CV
TL;DR: 本文提出VoxelFM,一种无需语言监督、基于DINO自蒸馏框架训练的3D CT基础模型,专注于学习鲁棒视觉表征;在七类临床下游任务中,仅用冻结主干+轻量探针即达到或超越现有CT基础模型,甚至在报告生成任务上优于显式语言对齐模型。
Details
Motivation: 现有CT基础模型依赖大规模图像-文本配对数据和计算密集型主干微调,而此类数据稀缺且微调成本高;亟需能以极少标注数据、无需主干微调即可高效迁移的鲁棒视觉表征。 Method: 提出VoxelFM,采用DINO自蒸馏框架进行3D CT自监督预训练,不使用任何语言监督;下游任务中固定主干网络,仅训练轻量级探针(如线性分类器),覆盖分类、回归、生存分析、检索、定位、分割和报告生成七类任务。 Result: VoxelFM在全部七类临床下游任务中均匹配或超越四个现有CT基础模型;即使无语言监督,其在报告生成任务上的表现仍优于显式语言对齐模型;验证了冻结主干+轻量探针范式优于传统视觉语言模型范式。 Conclusion: CT基础模型的核心价值在于提供高质量、可迁移的视觉特征,而非作为视觉语言模型的编码器;VoxelFM证明了纯视觉自监督学习足以支撑广泛临床任务,为资源受限的研究者提供了高效可行的新范式。 Abstract: There is substantial interest in developing artificial intelligence systems to support radiologists across tasks ranging from segmentation to report generation. Existing computed tomography (CT) foundation models have largely focused on building generalist vision-language systems capable of tasks such as question answering and report generation. However, training reliable vision-language systems requires paired image-text data at a scale that remains unavailable in CT. Moreover, adapting the underlying visual representations to downstream tasks typically requires partial or full backbone fine-tuning, a computationally demanding process inaccessible to many research groups. Instead, foundation models should prioritise learning robust visual representations that enable efficient transfer to new tasks with minimal labelled data and without backbone fine-tuning. We present VoxelFM, a 3D CT foundation model trained with self-distillation using the DINO framework, which learns semantically rich features without language supervision. We evaluated VoxelFM across seven categories of clinically relevant downstream tasks using frozen backbone representations with lightweight probes: classification, regression, survival analysis, instance retrieval, localisation, segmentation, and report generation. VoxelFM matched or outperformed four existing CT foundation models across all task categories. Despite receiving no language supervision during pre-training, VoxelFM surpassed models explicitly trained with language-alignment objectives, including on report generation. Our results indicate that current CT foundation models perform significantly better as feature extractors for lightweight probes rather than as vision encoders for vision-language models. Model weights and training code are publicly available.[197] NTIRE 2026 3D Restoration and Reconstruction in Real-world Adverse Conditions: RealX3D Challenge Results
Shuhong Liu,Chenyu Bao,Ziteng Cui,Xuangeng Chu,Bin Ren,Lin Gu,Xiang Chen,Mingrui Li,Long Ma,Marcos V. Conde,Radu Timofte,Yun Liu,Ryo Umagami,Tomohiro Hashimoto,Zijian Hu,Yuan Gan,Tianhan Xu,Yusuke Kurose,Tatsuya Harada,Junwei Yuan,Gengjia Chang,Xining Ge,Mache You,Qida Cao,Zeliang Li,Xinyuan Hu,Hongde Gu,Changyue Shi,Jiajun Ding,Zhou Yu,Jun Yu,Seungsang Oh,Fei Wang,Donggun Kim,Zhiliang Wu,Seho Ahn,Xinye Zheng,Kun Li,Yanyan Wei,Weisi Lin,Dizhe Zhang,Yuchao Chen,Meixi Song,Hanqing Wang,Haoran Feng,Lu Qi,Jiaao Shan,Yang Gu,Jiacheng Liu,Shiyu Liu,Kui Jiang,Junjun Jiang,Runyu Zhu,Sixun Dong,Qingxia Ye,Zhiqiang Zhang,Zhihua Xu,Zhiwei Wang,Phan The Son,Zhimiao Shi,Zixuan Guo,Xueming Fu,Lixia Han,Changhe Liu,Zhenyu Zhao,Manabu Tsukada,Zheng Zhang,Zihan Zhai,Tingting Li,Ziyang Zheng,Yuhao Liu,Dingju Wang,Jeongbin You,Younghyuk Kim,Il-Youp Kwak,Mingzhe Lyu,Junbo Yang,Wenhan Yang,Hongsen Zhang,Jinqiang Cui,Hong Zhang,Haojie Guo,Hantang Li,Qiang Zhu,Bowen He,Xiandong Meng,Debin Zhao,Xiaopeng Fan,Wei Zhou,Linzhe Jiang,Linfeng Li,Louzhe Xu,Qi Xu,Hang Song,Chenkun Guo,Weizhi Nie,Yufei Li,Xingan Zhan,Zhanqi Shi,Dufeng Zhang,Boyuan Tian,Jingshuo Zeng,Gang He,Yubao Fu,Weijie Wang,Cunchuan Huang
Main category: cs.CV
TL;DR: 本文综述了NTIRE 2026 3D恢复与重建(3DRR)挑战赛,聚焦于极端低光照与烟雾退化环境下的鲁棒3D重建方法,基于RealX3D真实基准测试,33支队伍提交有效结果,显著推动了恶劣条件下的3D重建性能。
Details
Motivation: 针对真实世界中极端低光照和烟雾退化等恶劣条件下3D重建鲁棒性不足的问题,构建RealX3D基准并组织挑战赛以推动技术发展。 Method: 组织NTIRE 2026 3DRR挑战赛,依托RealX3D真实场景数据集评估参赛方法,并对279名注册者中33支有效提交队伍的方法进行系统性分析与对比。 Result: 33支队伍提交了有效结果,其方法在恶劣条件下的3D重建性能显著超越现有SOTA基线,揭示了高性能方法共有的设计原则与有效退化处理策略。 Conclusion: 该挑战赛验证了当前方法在真实恶劣条件下的进步潜力,明确了提升3D重建鲁棒性的关键方向,为后续研究提供了实践参考与技术启示。 Abstract: This paper presents a comprehensive review of the NTIRE 2026 3D Restoration and Reconstruction (3DRR) Challenge, detailing the proposed methods and results. The challenge seeks to identify robust reconstruction pipelines that are robust under real-world adverse conditions, specifically extreme low-light and smoke-degraded environments, as captured by our RealX3D benchmark. A total of 279 participants registered for the competition, of whom 33 teams submitted valid results. We thoroughly evaluate the submitted approaches against state-of-the-art baselines, revealing significant progress in 3D reconstruction under adverse conditions. Our analysis highlights shared design principles among top-performing methods and provides insights into effective strategies for handling 3D scene degradation.[198] Rethinking Exposure Correction for Spatially Non-uniform Degradation
Ao Li,Jiawei Sun,Le Dong,Zhenyu Wang,Weisheng Dong
Main category: cs.CV
TL;DR: 本文提出了一种面向空间非均匀曝光误差的新型曝光校正范式,通过空间信号编码器预测自适应调制权重、多查找表变换及HSL补偿模块提升颜色保真度,并设计不确定性驱动的非均匀损失函数以动态优化局部恢复。
Details
Motivation: 现实世界中的曝光校正面临空间非均匀退化挑战,而现有方法多基于全局均匀假设,难以应对图像内共存的多样化局部曝光误差。 Method: 提出空间信号编码器预测空间自适应调制权重,驱动多个查找表进行图像变换,并引入HSL补偿模块增强色彩保真;同时设计不确定性启发的非均匀损失函数,实现对局部恢复不确定性的动态优化聚焦。 Result: 在大量实验中,该方法在定性和定量指标上均优于当前最先进方法。 Conclusion: 所提范式有效克服了传统方法对空间非均匀曝光误差建模能力不足的问题,显著提升了真实场景下的曝光校正性能。 Abstract: Real-world exposure correction is fundamentally challenged by spatially non-uniform degradations, where diverse exposure errors frequently coexist within a single image. However, existing exposure correction methods are still largely developed under a predominantly uniform assumption. Architecturally, they typically rely on globally aggregated modulation signals that capture only the overall exposure trend. From the optimization perspective, conventional reconstruction losses are usually derived under a shared global scale, thus overlooking the spatially varying correction demands across regions. To address these limitations, we propose a new exposure correction paradigm explicitly designed for spatial non-uniformity. Specifically, we introduce a Spatial Signal Encoder to predict spatially adaptive modulation weights, which are used to guide multiple look-up tables for image transformation, together with an HSL-based compensation module for improved color fidelity. Beyond the architectural design, we propose an uncertainty-inspired non-uniform loss that dynamically allocates the optimization focus based on local restoration uncertainties, better matching the heterogeneous nature of real-world exposure errors. Extensive experiments demonstrate that our method achieves superior qualitative and quantitative performance compared with state-of-the-art methods. Code is available at https://github.com/FALALAS/rethinkingEC.[199] OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
Liyu Zhang,Kehan Li,Tingrui Han,Tao Zhao,Yuxuan Sheng,Shibo He,Chao Li
Main category: cs.CV
TL;DR: 本文提出OP-GRPO,首个面向流匹配模型的离策略GRPO框架,通过轨迹重用、序列级重要性采样校正和晚期去噪步截断,显著提升训练效率(仅需34.2%训练步数),同时保持生成质量。
Details
Motivation: GRPO在流匹配模型后训练中效果显著,但因其在线策略训练范式导致样本效率低。 Method: 提出OP-GRPO:1)主动选择高质量轨迹存入回放缓冲区复用;2)引入序列级重要性采样校正以缓解离策略分布偏移,保留GRPO裁剪机制;3)理论与实证发现晚期去噪步导致离策略比病态,故截断晚期轨迹。 Result: 在图像与视频生成基准上,OP-GRPO以平均34.2%的训练步数达到与Flow-GRPO相当或更优性能,大幅提升训练效率且不损生成质量。 Conclusion: OP-GRPO成功将离策略学习引入流匹配模型的GRPO训练,解决了样本效率瓶颈,在效率与质量间实现更好平衡。 Abstract: Post training via GRPO has demonstrated remarkable effectiveness in improving the generation quality of flow-matching models. However, GRPO suffers from inherently low sample efficiency due to its on-policy training paradigm. To address this limitation, we present OP-GRPO, the first Off-Policy GRPO framework tailored for flow-matching models. First, we actively select high-quality trajectories and adaptively incorporate them into a replay buffer for reuse in subsequent training iterations. Second, to mitigate the distribution shift introduced by off-policy samples, we propose a sequence-level importance sampling correction that preserves the integrity of GRPO's clipping mechanism while ensuring stable policy updates. Third, we theoretically and empirically show that late denoising steps yield ill-conditioned off-policy ratios, and mitigate this by truncating trajectories at late steps. Across image and video generation benchmarks, OP-GRPO achieves comparable or superior performance to Flow-GRPO with only 34.2% of the training steps on average, yielding substantial gains in training efficiency while maintaining generation quality.[200] Uncertainty-Aware Test-Time Adaptation for Cross-Region Spatio-Temporal Fusion of Land Surface Temperature
Sofiane Bouaziz,Adel Hafiane,Raphael Canals,Rachid Nedjai
Main category: cs.CV
TL;DR: 本文提出了一种面向地表温度时空融合回归任务的、不确定性感知的测试时自适应(TTA)框架,仅更新预训练模型的融合模块,在无需源域数据和目标域标签的情况下,显著提升了跨区域泛化性能。
Details
Motivation: 深度学习模型在遥感应用中常因地理区域间的数据分布差异(域偏移)而泛化能力差,尤其现有测试时自适应方法主要面向分类任务,难以直接用于回归任务如时空融合(STF)。 Method: 提出一种不确定性感知的TTA框架,仅更新预训练STF模型的融合模块,利用认知不确定性、土地利用/覆盖一致性约束和偏差校正进行自适应,不依赖源域数据或目标域标签。 Result: 在意大利罗马、埃及开罗、西班牙马德里和法国蒙彼利埃四个气候差异显著的目标区域上实验表明,该方法在仅10个TTA迭代和有限无标签目标数据下,相较原预训练模型(法国奥尔良训练),RMSE和MAE平均提升24.2%和27.9%。 Conclusion: 所提TTA框架有效缓解了遥感回归任务中的跨区域域偏移问题,具备轻量、无监督、高适应性等优势,为STF等遥感回归任务提供了实用的部署方案。 Abstract: Deep learning models have shown great promise in diverse remote sensing applications. However, they often struggle to generalize across geographic regions unseen during training due to domain shifts. Domain shifts occur when data distributions differ between the training region and new target regions, due to variations in land cover, climate, and environmental conditions. Test-time adaptation (TTA) has emerged as a solution to such shifts, but existing methods are primarily designed for classification and are not directly applicable to regression tasks. In this work, we address the regression task of spatio-temporal fusion (STF) for land surface temperature estimation. We propose an uncertainty-aware TTA framework that updates only the fusion module of a pre-trained STF model, guided by epistemic uncertainty, land use and land cover consistency, and bias correction, without requiring source data or labeled target samples. Experiments on four target regions with diverse climates, namely Rome in Italy, Cairo in Egypt, Madrid in Spain, and Montpellier in France, show consistent improvements in RMSE and MAE for a pre-trained model in Orléans, France. The average gains are 24.2% and 27.9%, respectively, even with limited unlabeled target data and only 10 TTA epochs.[201] Hierarchical Co-Embedding of Font Shapes and Impression Tags
Yugo Kubota,Kaito Shiku,Seiichi Uchida
Main category: cs.CV
TL;DR: 本文提出了一种双曲协同嵌入框架,通过蕴含关系(而非简单配对对齐)建模字体与印象描述之间的对应关系,以刻画‘风格特异性’的渐进性,并在MyFonts数据集上验证了其在双向检索和可解释性上的优势。
Details
Motivation: 字体形状能唤起多种印象,但字体与印象之间并非一一对应;某些印象兼容多种字体风格,而另一些则对字体选择有强约束。这种约束强度的差异即‘风格特异性’,需被量化和建模。 Method: 构建基于双曲空间的协同嵌入框架,联合嵌入字体图像与印象标签(单个或集合),并施加两类蕴含约束:印象→字体蕴含、以及低→高风格特异性印象间的蕴含,从而在双曲空间中形成以原点为中心、按特异性径向分布的几何结构。 Result: 在MyFonts数据集上,双向检索性能优于强一对一基线;遍历分析与标签级分析表明,所学空间能清晰呈现从模糊到特异的印象演化路径,并提供数据驱动的风格特异性量化指标。 Conclusion: 双曲蕴含建模比欧氏配对对齐更适于刻画字体-印象关系的非对称与层次性,所提出的风格特异性几何度量具备可解释性与实用性。 Abstract: Font shapes can evoke a wide range of impressions, but the correspondence between fonts and impression descriptions is not one-to-one: some impressions are broadly compatible with diverse styles, whereas others strongly constrain the set of plausible fonts. We refer to this graded constraint strength as style specificity. In this paper, we propose a hyperbolic co-embedding framework that models font--impression correspondence through entailment rather than simple paired alignment. Font images and impression descriptions, represented as single tags or tag sets, are embedded in a shared hyperbolic space with two complementary entailment constraints: impression-to-font entailment and low-to-high style-specificity entailment among impressions. This formulation induces a radial structure in which low style-specificity impressions lie near the origin and high style-specificity impressions lie farther away, yielding an interpretable geometric measure of how strongly an impression constrains font style. Experiments on the MyFonts dataset demonstrate improved bidirectional retrieval over strong one-to-one baselines. In addition, traversal and tag-level analyses show that the learned space captures a coherent progression from ambiguous to more style-specific impressions and provides a meaningful, data-driven quantification of style specificity.[202] Incomplete Multi-View Multi-Label Classification via Shared Codebook and Fused-Teacher Self-Distillation
Xu Yan,Jun Yin,Shiliang Sun,Minghua Wan
Main category: cs.CV
TL;DR: 本文提出了一种面向双缺失(视图与标签均不完整)场景的多视图多标签学习方法,通过共享离散码本、跨视图重建、加权融合与融合教师自蒸馏机制,提升语义一致性与预测鲁棒性。
Details
Motivation: 现有方法在双缺失场景下缺乏显式的结构约束,导致难以学习稳定且判别性强的共享语义表示。 Method: 引入多视图共享离散码本与跨视图重建实现结构化一致表示学习;设计基于标签相关性保持能力的视图权重估计方法;构建融合教师自蒸馏框架以增强单视图分类器泛化能力。 Result: 在五个基准数据集上显著优于现有先进方法,验证了所提方法在双缺失条件下的有效性与鲁棒性。 Conclusion: 结构化的一致表示学习与知识蒸馏协同机制能有效应对多视图多标签学习中的双缺失挑战,为不完整多源数据建模提供了新思路。 Abstract: Although multi-view multi-label learning has been extensively studied, research on the dual-missing scenario, where both views and labels are incomplete, remains largely unexplored. Existing methods mainly rely on contrastive learning or information bottleneck theory to learn consistent representations under missing-view conditions, but loss-based alignment without explicit structural constraints limits the ability to capture stable and discriminative shared semantics. To address this issue, we introduce a more structured mechanism for consistent representation learning: we learn discrete consistent representations through a multi-view shared codebook and cross-view reconstruction, which naturally align different views within the limited shared codebook embeddings and reduce feature redundancy. At the decision level, we design a weight estimation method that evaluates the ability of each view to preserve label correlation structures, assigning weights accordingly to enhance the quality of the fused prediction. In addition, we introduce a fused-teacher self-distillation framework, where the fused prediction guides the training of view-specific classifiers and feeds the global knowledge back into the single-view branches, thereby enhancing the generalization ability of the model under missing-label conditions. The effectiveness of our proposed method is thoroughly demonstrated through extensive comparative experiments with advanced methods on five benchmark datasets. Code is available at https://github.com/xuy11/SCSD.[203] GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models
Yaohan Guan,Pristina Wang,Najim Dehak,Alan Yuille,Jieneng Chen,Daniel Khashabi
Main category: cs.CV
TL;DR: 本文提出了GENFIG1基准,用于评估生成式AI模型(如视觉-语言模型)根据科学论文内容生成核心图(Figure 1)的能力,强调需融合科学理解与视觉合成,现有模型表现仍不理想。
Details
Motivation: 科学论文中Figure 1是核心思想的视觉摘要,其设计困难体现了科学可视化沟通的挑战;现有AI模型缺乏对科学概念理解与可视化表达协同能力的系统评估。 Method: 构建GENFIG1基准:从顶会深度学习论文中精选高质量Figure 1及对应文本(标题、摘要、引言、图注),制定严格质量控制流程,并设计与专家评分高度相关的自动评估指标;在该基准上评测多类代表性生成模型。 Result: 当前最优生成模型在GENFIG1上表现仍远未达人类水平,任务极具挑战性;所提自动评估指标与人工专家判断具有良好相关性。 Conclusion: GENFIG1为衡量和推动多模态AI在科学理解与视觉生成协同方面的能力提供了首个系统性基准,有望促进该方向未来研究。 Abstract: In many science papers, "Figure 1" serves as the primary visual summary of the core research idea. These figures are visually simple yet conceptually rich, often requiring significant effort and iteration by human authors to get right, highlighting the difficulty of science visual communication. With this intuition, we introduce GENFIG1, a benchmark for generative AI models (e.g., Vision-Language Models). GENFIG1 evaluates models for their ability to produce figures that clearly express and motivate the central idea of a paper (title, abstract, introduction, and figure caption) as input. Solving GENFIG1 requires more than producing visually appealing graphics: the task entails reasoning for text-to-image generation that couples scientific understanding with visual synthesis. Specifically, models must (i) comprehend and grasp the technical concepts of the paper, (ii) identify the most salient ones, and (iii) design a coherent and aesthetically effective graphic that conveys those concepts visually and is faithful to the input. We curate the benchmark from papers published at top deep-learning conferences, apply stringent quality control, and introduce an automatic evaluation metric that correlates well with expert human judgments. We evaluate a suite of representative models on GENFIG1 and demonstrate that the task presents significant challenges, even for the best-performing systems. We hope this benchmark serves as a foundation for future progress in multimodal AI.[204] Scale-Aware Vision-Language Adaptation for Extreme Far-Distance Video Person Re-identification
Ashwat Rajbhandari,Bharatesh Chakravarthi
Main category: cs.CV
TL;DR: 本文提出一种基于大视觉语言模型的极端远距离视频行人重识别方法,通过升级ViT-L/14视觉主干、选择性微调、时序注意力池化及跨视角适配等策略,在DetReIDX基准上显著提升性能。
Details
Motivation: 极端远距离视频行人重识别面临尺度压缩、分辨率下降、运动模糊和空地视角不匹配等挑战,传统模型在高空远距场景下性能急剧下降。 Method: 基于CLIP框架,将视觉主干从ViT-B/16升级为ViT-L/14,并引入骨干感知的选择性微调;设计轻量级时序注意力池化以抑制低质帧;保留适配器与提示条件化的跨视图学习以缓解空地域偏移;结合优化检索与k-互惠重排序提升精度。 Result: 在DetReIDX压力测试基准上,A2G、G2A、A2A三任务mAP分别达46.69、41.23、22.98,整体mAP为35.73。 Conclusion: 大规模视觉语言模型配合稳定性驱动的适配策略,可显著增强极端远距离视频行人重识别的鲁棒性。 Abstract: Extreme far-distance video person re-identification (ReID) is particularly challenging due to scale compression, resolution degradation, motion blur, and aerial-ground viewpoint mismatch. As camera altitude and subject distance increase, models trained on close-range imagery degrade significantly. In this work, we investigate how large-scale vision-language models can be adapted to operate reliably under these conditions. Starting from a CLIP-based baseline, we upgrade the visual backbone from ViT-B/16 to ViT-L/14 and introduce backbone-aware selective fine-tuning to stabilize adaptation of the larger transformer. To address noisy and low-resolution tracklets, we incorporate a lightweight temporal attention pooling mechanism that suppresses degraded frames and emphasizes informative observations. We retain adapter-based and prompt-conditioned cross-view learning to mitigate aerial-ground domain shifts, and further refine retrieval using improved optimization and k-reciprocal re-ranking. Experiments on the DetReIDX stress-test benchmark show that our approach achieves mAP scores of 46.69 (A2G), 41.23 (G2A), and 22.98 (A2A), corresponding to an overall mAP of 35.73. These results show that large-scale vision-language backbones, when combined with stability-focused adaptation, significantly enhance robustness in extreme far-distance video person ReID.[205] AURA: Always-On Understanding and Real-Time Assistance via Video Streams
Xudong Lu,Yang Bo,Jinpeng Chen,Shuhan Li,Xintong Guo,Huankang Guan,Fang Liu,Dunyuan Xu,Peiwen Sun,Heyang Sun,Rui Liu,Hongsheng Li
Main category: cs.CV
TL;DR: AURA is an end-to-end streaming visual interaction framework enabling VideoLLMs to continuously process live video streams and support real-time, open-ended question answering and proactive assistance.
Details
Motivation: Existing VideoLLMs are mostly offline and ill-suited for live video streams requiring continuous observation and timely response; current streaming approaches suffer from decoupled pipelines or limited captioning-only outputs, hindering open-ended QA and long-horizon interaction. Method: AURA introduces an end-to-end streaming visual interaction framework integrating context management, data construction, training objectives, and deployment optimization to enable a unified VideoLLM for continuous video stream processing, real-time QA, and proactive responses. Result: AURA achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR/TTS running at 2 FPS on two 80G accelerators. Conclusion: AURA demonstrates that unified, end-to-end streaming VideoLLMs are feasible and effective for real-time, interactive video understanding, and the authors release both the model and inference framework to advance future research. Abstract: Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.[206] Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks
Adrienne Deganutti,Elad Hirsch,Haonan Zhu,Jaejung Seol,Purvanshi Mehta
Main category: cs.CV
TL;DR: 本文提出了GraphicDesignBench(GDB),首个面向专业平面设计全任务的综合性AI评测基准,涵盖布局、排版、信息图、模板语义与动画五大维度,揭示当前AI模型在空间推理、矢量生成、精细排版和动画时序分解等核心能力上仍存在显著不足。
Details
Motivation: 现有基准聚焦于自然图像理解或通用文生图,无法评估AI在专业图形设计中所需的结构化布局、精准排版、分层合成、矢量输出和动画推理等独特能力,亟需专门基准推动设计AI发展。 Method: 构建包含50个任务的GDB基准,覆盖五大设计维度,均基于真实LICA分层设计模板;采用标准化指标体系(空间精度、感知质量、文本保真度、语义对齐、结构有效性)评测前沿闭源模型。 Result: 当前模型在高层语义理解上尚可,但在空间推理、矢量代码生成、细粒度排版感知和动画时序分解等关键能力上表现薄弱,精度、结构与组合意识方面的性能差距显著。 Conclusion: GDB为评估和推动AI成为合格设计协作者提供了严谨、可复现的测试平台,明确了未来研究需突破的核心技术瓶颈。 Abstract: We introduce GraphicDesignBench (GDB), the first comprehensive benchmark suite designed specifically to evaluate AI models on the full breadth of professional graphic design tasks. Unlike existing benchmarks that focus on natural-image understanding or generic text-to-image synthesis, GDB targets the unique challenges of professional design work: translating communicative intent into structured layouts, rendering typographically faithful text, manipulating layered compositions, producing valid vector graphics, and reasoning about animation. The suite comprises 50 tasks organized along five axes: layout, typography, infographics, template & design semantics and animation, each evaluated under both understanding and generation settings, and grounded in real-world design templates drawn from the LICA layered-composition dataset. We evaluate a set of frontier closed-source models using a standardized metric taxonomy covering spatial accuracy, perceptual quality, text fidelity, semantic alignment, and structural validity. Our results reveal that current models fall short on the core challenges of professional design: spatial reasoning over complex layouts, faithful vector code generation, fine-grained typographic perception, and temporal decomposition of animations remain largely unsolved. While high-level semantic understanding is within reach, the gap widens sharply as tasks demand precision, structure, and compositional awareness. GDB provides a rigorous, reproducible testbed for tracking progress toward AI systems that can function as capable design collaborators. The full evaluation framework is publicly available.[207] DriveVA: Video Action Models are Zero-Shot Drivers
Mengmeng Liu,Diankun Zhang,Jiuming Liu,Jianfeng Cui,Hongwei Xie,Guang Chen,Hangjun Ye,Michael Ying Yang,Francesco Nex,Hao Cheng
Main category: cs.CV
TL;DR: DriveVA是一种新型自动驾驶世界模型,通过共享潜在生成过程联合解码未来视觉预测与动作序列,利用大规模视频生成模型的先验知识提升泛化性与视频-轨迹一致性,在多个基准上显著优于现有方法。
Details
Motivation: 解决现有基于世界模型的规划方法在跨数据集、跨传感器配置泛化能力弱,以及视频预测与轨迹规划松耦合导致视觉想象中一致性差的问题。 Method: 提出DriveVA模型,采用DiT-based解码器联合预测未来动作序列和视频,并引入视频续写策略增强长时序推演一致性;继承大规模预训练视频生成模型的动力学与物理合理性先验。 Result: 在NAVSIM上取得90.9 PDM闭环性能;在nuScenes和Bench2drive(CARLA v2)上零样本迁移中,平均L2误差分别降低78.9%和52.5%,碰撞率分别降低83.3%和52.4%。 Conclusion: DriveVA通过紧耦合视觉与动作生成、引入强时空先验及视频续写机制,显著提升了自动驾驶世界模型的泛化性、一致性与实际部署鲁棒性。 Abstract: Generalization is a central challenge in autonomous driving, as real-world deployment requires robust performance under unseen scenarios, sensor domains, and environmental conditions. Recent world-model-based planning methods have shown strong capabilities in scene understanding and multi-modal future prediction, yet their generalization across datasets and sensor configurations remains limited. In addition, their loosely coupled planning paradigm often leads to poor video-trajectory consistency during visual imagination. To overcome these limitations, we propose DriveVA, a novel autonomous driving world model that jointly decodes future visual forecasts and action sequences in a shared latent generative process. DriveVA inherits rich priors on motion dynamics and physical plausibility from well-pretrained large-scale video generation models to capture continuous spatiotemporal evolution and causal interaction patterns. To this end, DriveVA employs a DiT-based decoder to jointly predict future action sequences (trajectories) and videos, enabling tighter alignment between planning and scene evolution. We also introduce a video continuation strategy to strengthen long-duration rollout consistency. DriveVA achieves an impressive closed-loop performance of 90.9 PDM score on the challenge NAVSIM. Extensive experiments also demonstrate the zero-shot capability and cross-domain generalization of DriveVA, which reduces average L2 error and collision rate by 78.9% and 83.3% on nuScenes and 52.5% and 52.4% on the Bench2drive built on CARLA v2 compared with the state-of-the-art world-model-based planner.[208] A Persistent Homology Design Space for 3D Point Cloud Deep Learning
Prachi Kudeshia,Jiju Poovvancheri,Amr Ghoneim,Dong Chen
Main category: cs.CV
TL;DR: 本文提出了一种统一的Persistent Homology(PH)驱动的3D点云深度学习框架(3DPHDL),系统性地将拓扑特征作为结构归纳偏置嵌入到点云学习的多个环节中,并通过在ModelNet40和ShapeNetPart上的实验验证其对精度、鲁棒性和一致性的提升。
Details
Motivation: 尽管Persistent Homology具有理论稳定性和几何互补性,但其在点云深度学习中的应用仍零散且边缘化,缺乏系统性整合框架。 Method: 构建了3DPHDL设计空间,涵盖复形构造、滤波策略、持久性表示、神经主干与预测任务;识别出6个关键注入点(如采样、邻域图、优化动态、自监督等);在PointNet、DGCNN、Point Transformer上分别融合持久性图、图像与景观进行实证分析。 Result: 在ModelNet40分类与ShapeNetPart分割任务上,拓扑增强模型展现出更优的拓扑敏感判别能力与部件一致性,并提升了对噪声和采样变化的鲁棒性;同时揭示了表达力与组合复杂度间的权衡。 Conclusion: Persistent Homology不应仅作为辅助特征,而应作为学习流程中的结构性组件;本工作为3D点云学习提供了首个系统化融入拓扑推理的框架。 Abstract: Persistent Homology (PH) offers stable, multi-scale descriptors of intrinsic shape structure by capturing connected components, loops, and voids that persist across scales, providing invariants that complement purely geometric representations of 3D data. Yet, despite strong theoretical guarantees and increasing empirical adoption, its integration into deep learning for point clouds remains largely ad hoc and architecturally peripheral. In this work, we introduce a unified design space for Persistent-Homology driven learning in 3D point clouds (3DPHDL), formalizing the interplay between complex construction, filtration strategy, persistence representation, neural backbone, and prediction task. Beyond the canonical pipeline of diagram computation and vectorization, we identify six principled injection points through which topology can act as a structural inductive bias reshaping sampling, neighborhood graphs, optimization dynamics, self-supervision, output calibration, and even internal network regularization. We instantiate this framework through a controlled empirical study on ModelNet40 classification and ShapeNetPart segmentation, systematically augmenting representative backbones (PointNet, DGCNN, and Point Transformer) with persistence diagrams, images, and landscapes, and analyzing their impact on accuracy, robustness to noise and sampling variation, and computational scalability. Our results demonstrate consistent improvements in topology-sensitive discrimination and part consistency, while revealing meaningful trade-offs between representational expressiveness and combinatorial complexity. By viewing persistent homology not merely as an auxiliary feature but as a structured component within the learning pipeline, this work provides a systematic framework for incorporating topological reasoning into 3D point cloud learning.[209] HighFM: Towards a Foundation Model for Learning Representations from High-Frequency Earth Observation Data
Stella Girtsou,Konstantinos Alexis,Giorgos Giannopoulos,Harris Kontoes
Main category: cs.CV
TL;DR: 本文提出HighFM,一种面向高时间分辨率多光谱地球观测数据的基础模型,利用SEVIRI卫星数据和改进的SatMAE框架学习时空表征,提升云掩膜与活跃火点检测性能,推动实时灾害监测。
Details
Motivation: 现有地球观测基础模型多依赖高空间分辨率但低重访率的影像,难以应对快速演变的气候灾害和应急响应的时间敏感需求。 Method: 基于2TB以上的Meteosat第二代SEVIRI高时间分辨率数据,改进SatMAE掩码自编码框架,引入细粒度时间编码以增强短期时序建模能力,并在云掩膜和活跃火点检测任务上进行微调。 Result: 在云掩膜与活跃火点检测任务上,SEVIRI预训练ViT模型相较传统方法及最新地理空间基础模型,在平衡准确率和IoU指标上均取得一致提升。 Conclusion: 高时间密度的静止轨道遥感数据可有效支撑实时地球观测,HighFM为灾害检测与追踪提供了可扩展的基础模型新路径。 Abstract: The increasing frequency and severity of climate related disasters have intensified the need for real time monitoring, early warning, and informed decision-making. Earth Observation (EO), powered by satellite data and Machine Learning (ML), offers powerful tools to meet these challenges. Foundation Models (FMs) have revolutionized EO ML by enabling general-purpose pretraining on large scale remote sensing datasets. However most existing models rely on high-resolution satellite imagery with low revisit rates limiting their suitability for fast-evolving phenomena and time critical emergency response. In this work, we present HighFM, a first cut approach towards a FM for high temporal resolution, multispectral EO data. Leveraging over 2 TB of SEVIRI imagery from the Meteosat Second Generation (MSG) platform, we adapt the SatMAE masked autoencoding framework to learn robust spatiotemporal representations. To support real time monitoring, we enhance the original architecture with fine grained temporal encodings to capture short term variability. The pretrained models are then finetuned on cloud masking and active fire detection tasks. We benchmark our SEVIRI pretrained Vision Transformers against traditional baselines and recent geospatial FMs, demonstrating consistent gains across both balanced accuracy and IoU metrics. Our results highlight the potential of temporally dense geostationary data for real-time EO, offering a scalable path toward foundation models for disaster detection and tracking.[210] GA-GS: Generation-Assisted Gaussian Splatting for Static Scene Reconstruction
Yedong Shen,Shiqi Zhang,Sha Zhang,Yifan Duan,Xinran Zhang,Wenhao Yu,Lu Zhang,Jiajun Deng,Yanyong Zhang
Main category: cs.CV
TL;DR: 本文提出GA-GS方法,利用生成模型(扩散模型)辅助高斯泼溅重建被动态物体遮挡的静态场景区域,通过运动感知模块分割动态区域、扩散模型补全遮挡区,并引入可学习真实性标量实现真实性感知渲染与监督,在自建Trajectory-Match数据集及DAVIS上验证了其在严重遮挡下的优越性能。
Details
Motivation: 现有单目视频静态场景重建方法依赖背景,难以恢复被动态物体遮挡的区域;且缺乏带真实静态场景标注的基准数据集。 Method: 提出GA-GS:1)运动感知模块分割并移除动态区域;2)用扩散模型对遮挡区域进行生成式修复,提供伪真值监督;3)为每个高斯原语引入可学习的真实性标量,动态调节其不透明度以实现真实性感知的渲染与监督。 Result: 在DAVIS和自建Trajectory-Match数据集上实验表明,GA-GS在静态场景重建尤其大范围、持续遮挡场景中达到SOTA性能。 Conclusion: 生成模型可有效辅助静态场景重建中的遮挡恢复,真实性感知的高斯泼溅框架提升了重建鲁棒性与精度,所构建的数据集支持定量评估遮挡区域重建效果。 Abstract: Reconstructing static 3D scene from monocular video with dynamic objects is important for numerous applications such as virtual reality and autonomous driving. Current approaches typically rely on background for static scene reconstruction, limiting the ability to recover regions occluded by dynamic objects. In this paper, we propose GA-GS, a Generation-Assisted Gaussian Splatting method for Static Scene Reconstruction. The key innovation of our work lies in leveraging generation to assist in reconstructing occluded regions. We employ a motion-aware module to segment and remove dynamic regions, and thenuse a diffusion model to inpaint the occluded areas, providing pseudo-ground-truth supervision. To balance contributions from real background and generated region, we introduce a learnable authenticity scalar for each Gaussian primitive, which dynamically modulates opacity during splatting for authenticity-aware rendering and supervision. Since no existing dataset provides ground-truth static scene of video with dynamic objects, we construct a dataset named Trajectory-Match, using a fixed-path robot to record each scene with/without dynamic objects, enabling quantitative evaluation in reconstruction of occluded regions. Extensive experiments on both the DAVIS and our dataset show that GA-GS achieves state-of-the-art performance in static scene reconstruction, especially in challenging scenarios with large-scale, persistent occlusions.[211] Spatially-Weighted CLIP for Street-View Geo-localization
Ting Han,Fengjiao Li,Chunsong Chen,Haoling Huang,Yiping Chen,Meiliu Wu
Main category: cs.CV
TL;DR: 本文提出SW-CLIP框架,将空间自相关性引入视觉-语言对比学习,通过地理距离加权的软标签和邻域一致性正则化,提升街景地理定位精度与空间一致性。
Details
Motivation: 传统CLIP方法将所有非匹配样本视为同等负样本,忽视了地理空间中‘近处事物更相似’(Tobler第一定律)这一基本规律,导致地理定位性能受限。 Method: 提出空间加权CLIP(SW-CLIP):1)用‘位置即文本’编码地理坐标;2)用测地距离生成空间加权软标签替代one-hot InfoNCE目标;3)引入邻域一致性正则化以保持嵌入空间中的局部空间结构。 Result: 在多城市数据集上,SW-CLIP显著提升地理定位准确率、缓解长尾误差、增强空间一致性,优于标准CLIP。 Conclusion: 地理对齐比语义对齐更关键;该工作为将空间原理融入多模态表征学习提供了通用范式。 Abstract: This paper proposes Spatially-Weighted CLIP (SW-CLIP), a novel framework for street-view geo-localization that explicitly incorporates spatial autocorrelation into vision-language contrastive learning. Unlike conventional CLIP-based methods that treat all non-matching samples as equally negative, SW-CLIP leverages Tobler's First Law of Geography to model geographic relationships through distance-aware soft supervision. Specifically, we introduce a location-as-text representation to encode geographic positions and replace one-hot InfoNCE targets with spatially weighted soft labels derived from geodesic distance. Additionally, a neighborhood-consistency regularization is employed to preserve local spatial structure in the embedding space. Experiments on a multi-city dataset demonstrate that SW-CLIP significantly improves geo-localization accuracy, reduces long-tail errors, and enhances spatial coherence compared to standard CLIP. The results highlight the importance of shifting from semantic alignment to geographic alignment for robust geo-localization and provide a general paradigm for integrating spatial principles into multimodal representation learning.[212] Integer-Only Operations on Extreme Learning Machine Test Time Classification
Emerson Lopes Machadoa,Cristiano Jacques Miosso,Ricardo Pezzuol Jacobi
Main category: cs.CV
TL;DR: 本文提出了一种基于极限学习机(ELM)的网络分类器在测试阶段降低计算成本的新方法,通过使用仅含整数运算的策略,在不显著牺牲分类精度的前提下,提升FPGA等嵌入式平台的能效。
Details
Motivation: 为降低ELM分类器在测试阶段的计算开销,尤其面向功耗受限的嵌入式系统和高能耗的数据中心场景。 Method: 利用ELM模型特性:(i) 输入权重采用三值化(-1, 0, +1),消除乘法;(ii) 证明归一化与非归一化测试信号分类精度一致;(iii) 构造整数量化输出权重。 Result: 在5个主流计算机视觉数据集上验证,所提技术显著降低FPGA上测试阶段的计算成本,精度损失有限。 Conclusion: ELM分类器可在测试阶段完全使用整数运算实现高效推理,兼顾精度与能效,适用于资源受限与大规模部署场景。 Abstract: We present a theoretical analysis and empirical evaluations of a novel set of techniques for computational cost reduction of test time operations of network classifiers based on extreme learning machine (ELM). By exploring some characteristics we derived from these models, we show that the classification at test time can be performed using solely integer operations without compromising the classification accuracy. Our contributions are as follows: (i) We show empirical evidence that the input weights values can be drawn from the ternary set with limited reduction of the classification accuracy. This has the computational advantage of dismissing multiplications; (ii) We prove the classification accuracy of normalized and non-normalized test signals are the same; (iii) We show how to create an integer version of the output weights that results in a limited reduction of the classification accuracy. We tested our techniques on 5 computer vision datasets commonly used in the literature and the results indicate that our techniques can allow the reduction of the computational cost of the operations necessary for the classification at test time in FPGAs. This is important in embedded applications, where power consumption is limited, and crucial in data centers of large corporations, where power consumption is expensive.[213] Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
Songyuan Yang,Weijiang Yu,Ziyu Liu,Guijian Tang,Wenjing Yang,Huibin Tan,Nong Xiao
Main category: cs.CV
TL;DR: 本文提出Graph-to-Frame RAG(G2F-RAG),一种无需训练、可审计的检索增强生成范式,将外部知识以视觉帧形式融入视频推理,通过构建视频知识图谱并在推理时渲染为单帧图像,实现统一视觉空间中的联合推理,从而降低认知负荷、提升可解释性与性能。
Details
Motivation: 现有基于大 multimodal 模型(LMMs)的视频推理系统在引入外部知识时,常将异构信号(如文本或多个视频片段)强行拼接进单一注意力空间,导致注意力稀释和认知负荷增加;瓶颈不仅在于检索什么,更在于如何表示并融合外部知识与视频主干。 Method: 提出G2F-RAG:离线阶段构建问题无关的视频知识图谱(含实体、事件、空间关系及链接的世界知识);在线阶段由分层多智能体控制器判断是否需外部知识、检索最小充分子图,并将其渲染为单个‘推理帧’附加到原始视频序列;LMMs随后在统一视觉域中进行联合推理。 Result: G2F-RAG即插即用、跨骨干网络且可扩展,在多种公开基准上稳定提升性能,尤其在知识密集型任务中增益更大;消融实验证实知识表征与交付方式至关重要。 Conclusion: G2F-RAG将检索增强重新定义为‘视觉空间内的知识融合’,为视频推理提供了更鲁棒、可解释的新范式。 Abstract: When video reasoning requires external knowledge, many systems with large multimodal models (LMMs) adopt retrieval augmentation to supply the missing context. Appending textual or multi-clip evidence, however, forces heterogeneous signals into a single attention space. We observe diluted attention and higher cognitive load even on non-long videos. The bottleneck is not only what to retrieve but how to represent and fuse external knowledge with the video backbone.We present Graph-to-Frame RAG (G2F-RAG), a training free and auditable paradigm that delivers knowledge in the visual space. On the offline stage, an agent builds a problem-agnostic video knowledge graph that integrates entities, events, spatial relations, and linked world knowledge. On the online stage, a hierarchical multi-agent controller decides whether external knowledge is needed, retrieves a minimal sufficient subgraph, and renders it as a single reasoning frame appended to the video. LMMs then perform joint reasoning in a unified visual domain. This design reduces cognitive load and leaves an explicit, inspectable evidence trail.G2F-RAG is plug-and-play across backbones and scales. It yields consistent gains on diverse public benchmarks, with larger improvements in knowledge-intensive settings. Ablations further confirm that knowledge representation and delivery matter. G2F-RAG reframes retrieval as visual space knowledge fusion for robust and interpretable video reasoning.[214] Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
Songyuan Yang,Weijiang Yu,Jilin Ma,Ziyu Liu,Guijian Tang,Wenjing Yang,Huibin Tan,Nong Xiao
Main category: cs.CV
TL;DR: 本文提出RLER框架,通过解耦学习生成证据与获取可靠答案的过程,提升视频推理的可靠性与可解释性。
Details
Motivation: 现有大型多模态模型在视频推理中通常采用单次前向推理,缺乏对推理过程是否与证据对齐的验证,导致结果不可靠、难以解释。 Method: 提出Reinforce to Learn, Elect to Reason(RLER)双阶段范式:训练阶段(RLER-Training)采用组相对强化学习和三种新任务驱动奖励(帧敏感、思维透明、抗重复)来引导模型生成结构化、机器可验证的证据;推理阶段(RLER-Inference)使用无训练的编排器生成多个候选答案,解析其引用帧与推理链,按证据一致性、置信度、透明性和非冗余性打分并加权选举最优答案。 Result: 在8个主流基准上全面超越开源及RL增强的LMMs,平均提升6.3%,仅需平均3.1个候选答案,兼顾计算效率与性能。 Conclusion: 显式建模证据(学习时生成、推理时依据证据选举)是实现可信视频推理的有效路径。 Abstract: Video reasoning has advanced with large multimodal models (LMMs), yet their inference is often a single pass that returns an answer without verifying whether the reasoning is evidence-aligned. We introduce Reinforce to Learn, Elect to Reason (RLER), a dual paradigm that decouples learning to produce evidence from obtaining a reliable answer. In RLER-Training, we optimize the policy with group-relative reinforcement learning (RL) and 3 novel task-driven rewards: Frame-sensitive reward grounds reasoning on explicit key frames, Think-transparency reward shapes readable and parsable reasoning traces, and Anti-repetition reward boosts information density. These signals teach the model to emit structured, machine-checkable evidence and potentiate reasoning capabilities. In RLER-Inference, we apply a train-free orchestrator that generates a small set of diverse candidates, parses their answers and cited frames, scores them by evidence consistency, confidence, transparency, and non-redundancy, and then performs a robust evidence-weighted election. This closes the loop between producing and using evidence, improving reliability and interpretability without enlarging the model. We comprehensively evaluate RLER against various open-source and RL-based LMMs on 8 representative benchmarks. RLER achieves state of the art across all benchmarks and delivers an average improvement of 6.3\% over base models, while using on average 3.1 candidates per question, indicating a favorable balance between compute and quality. The results support a simple thesis: making evidence explicit during learning and electing by evidence during inference is a robust path to trustworthy video reasoning.[215] BiTDiff: Fine-Grained 3D Conducting Motion Generation via BiMamba-Transformer Diffusion
Tianzhi Jia,Kaixing Yang,Xiaole Yang,Xulong Tang,Ke Qiu,Shikui Wei,Yao Zhao
Main category: cs.CV
TL;DR: 本文提出BiTDiff框架和CM-Data数据集,解决3D指挥动作生成中数据稀缺与长序列高质量建模的难题。
Details
Motivation: 3D指挥动作生成任务因缺乏大规模细粒度数据集和高效高质量的长序列生成方法而未被充分探索。 Method: 构建首个大规模公开细粒度3D指挥动作数据集CM-Data;提出BiTDiff框架,融合BiMamba与Transformer架构,并结合扩散模型与人体运动学分解,引入物理一致性损失和手/身体前向运动学设计,支持免训练关节级编辑。 Result: 在CM-Data上实现3D指挥动作生成的SOTA性能,定量与定性实验均验证其优越性。 Conclusion: BiTDiff与CM-Data共同推动了音乐驱动3D指挥动作生成的研究进展,为音乐教育、虚拟演出与人-AI协同创作提供新工具。 Abstract: 3D conducting motion generation aims to synthesize fine-grained conductor motions from music, with broad potential in music education, virtual performance, digital human animation, and human-AI co-creation. However, this task remains underexplored due to two major challenges: (1) the lack of large-scale fine-grained 3D conducting datasets and (2) the absence of effective methods that can jointly support long-sequence generation with high quality and efficiency. To address the data limitation, we develop a quality-oriented 3D conducting motion collection pipeline and construct CM-Data, a fine-grained SMPL-X dataset with about 10 hours of conducting motion data. To the best of our knowledge, CM-Data is the first and largest public dataset for 3D conducting motion generation. To address the methodological limitation, we propose BiTDiff, a novel framework for 3D conducting motion generation, built upon a BiMamba-Transformer hybrid model architecture for efficient long-sequence modeling and a Diffusion-based generative strategy with human-kinematic decomposition for high-quality motion synthesis. Specifically, BiTDiff introduces auxiliary physical-consistency losses and a hand-/body-specific forward-kinematics design for better fine-grained motion modeling, while leveraging BiMamba for memory-efficient long-sequence temporal modeling and Transformer for cross-modal semantic alignment. In addition, BiTDiff supports training-free joint-level motion editing, enabling downstream human-AI interaction design. Extensive quantitative and qualitative experiments demonstrate that BiTDiff achieves state-of-the-art (SOTA) performance for 3D conducting motion generation on the CM-Data dataset. Code will be available upon acceptance.[216] UENR-600K: A Large-Scale Physically Grounded Dataset for Nighttime Video Deraining
Pei Yang,Hai Ci,Beibei Lin,Yiren Song,Mike Zheng Shou
Main category: cs.CV
TL;DR: 本文提出UENR-600K——首个大规模、基于物理建模的夜间视频去雨合成数据集,利用Unreal Engine仿真3D雨滴,解决现有2D合成方法无法建模夜间彩色、局部照明雨滴的问题;并基于该数据集改进Wan 2.2模型,首次将夜间视频去雨建模为video-to-video生成任务,显著缩小仿真到真实域差距。
Details
Motivation: 现有小规模合成数据集采用2D雨纹叠加,无法准确建模夜间雨滴受人工光源影响产生的彩色、局部照明、折射、遮挡等物理特性;而真实配对夜间雨/无雨视频难以采集,因雨效应与传感器噪声等退化耦合难分离。 Method: 构建UENR-600K数据集:在Unreal Engine中以3D粒子方式仿真雨滴,精确建模颜色折射、场景遮挡、雨幕等物理现象;基于该数据集,将夜间视频去雨建模为video-to-video生成任务,适配Wan 2.2视频生成模型作为新基线。 Result: 在真实夜间雨视频上验证,所提方法显著优于先前SOTA;消融与跨域测试表明UENR-600K训练模型具备强泛化能力,有效弥合sim-to-real鸿沟。 Conclusion: 物理真实的3D雨滴仿真数据集UENR-600K是提升夜间视频去雨性能的关键;将任务重构为生成式video-to-video建模可更充分地利用先验知识,是解决该难题的有效范式。 Abstract: Nighttime video deraining is uniquely challenging because raindrops interact with artificial lighting. Unlike daytime white rain, nighttime rain takes on various colors and appears locally illuminated. Existing small-scale synthetic datasets rely on 2D rain overlays and fail to capture these physical properties, causing models to generalize poorly to real-world night rain. Meanwhile, capturing real paired nighttime videos remains impractical because rain effects cannot be isolated from other degradations like sensor noise. To bridge this gap, we introduce UENR-600K, a large-scale, physically grounded dataset containing 600,000 1080p frame pairs. We utilize Unreal Engine to simulate rain as 3D particles within virtual environments. This approach guarantees photorealism and physically real raindrops, capturing correct details like color refractions, scene occlusions, rain curtains. Leveraging this high-quality data, we establish a new state-of-the-art baseline by adapting the Wan 2.2 video generation model. Our baseline treat deraining as a video-to-video generation task, exploiting strong generative priors to almost entirely bridge the sim-to-real gap. Extensive benchmarking demonstrates that models trained on our dataset generalize significantly better to real-world videos. Project page: https://showlab.github.io/UENR-600K/.[217] 3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image
Ze-Xin Yin,Liu Liu,Xinjie Wang,Wei Sui,Zhizhong Su,Jian Yang,Jin Xie
Main category: cs.CV
TL;DR: 本文提出3D-Fixer,一种基于局部点云补全的单视角三维场景生成新范式,通过粗到细生成、双分支条件网络和遮挡鲁棒特征对齐策略,在保持生成效率的同时显著提升几何精度,并发布大规模数据集ARSG-110K。
Details
Motivation: 现有单视图三维场景生成方法在泛化能力与推理效率之间存在权衡:前馈式方法效率高但难以处理复杂场景,实例级方法泛化好但依赖耗时的姿态优化。 Method: 提出3D-Fixer——一种原位补全范式,利用碎片化几何作为空间锚点,扩展3D生成先验以在原始位置补全完整资产;采用粗到细生成流程、双分支条件网络及遮挡鲁棒特征对齐(ORFA)策略;并构建大规模ARSG-110K数据集。 Result: 在几何精度上达到SOTA,显著优于MIDI和Gen3DSR等基线方法,同时保持扩散模型的推理效率。 Conclusion: 3D-Fixer有效弥合了生成质量与效率之间的鸿沟,验证了基于局部几何锚定的补全范式在单视图三维场景生成中的有效性与可扩展性。 Abstract: Compositional 3D scene generation from a single view requires the simultaneous recovery of scene layout and 3D assets. Existing approaches mainly fall into two categories: feed-forward generation methods and per-instance generation methods. The former directly predict 3D assets with explicit 6DoF poses through efficient network inference, but they generalize poorly to complex scenes. The latter improve generalization through a divide-and-conquer strategy, but suffer from time-consuming pose optimization. To bridge this gap, we introduce 3D-Fixer, a novel in-place completion paradigm. Specifically, 3D-Fixer extends 3D object generative priors to generate complete 3D assets conditioned on the partially visible point cloud at the original locations, which are cropped from the fragmented geometry obtained from the geometry estimation methods. Unlike prior works that require explicit pose alignment, 3D-Fixer uses fragmented geometry as a spatial anchor to preserve layout fidelity. At its core, we propose a coarse-to-fine generation scheme to resolve boundary ambiguity under occlusion, supported by a dual-branch conditioning network and an Occlusion-Robust Feature Alignment (ORFA) strategy for stable training. Furthermore, to address the data scarcity bottleneck, we present ARSG-110K, the largest scene-level dataset to date, comprising over 110K diverse scenes and 3M annotated images with high-fidelity 3D ground truth. Extensive experiments show that 3D-Fixer achieves state-of-the-art geometric accuracy, which significantly outperforms baselines such as MIDI and Gen3DSR, while maintaining the efficiency of the diffusion process. Code and data will be publicly available at https://zx-yin.github.io/3dfixer.[218] BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing
Kaiwen Wang,Kaili Zheng,Rongrong Deng,Yiming Shi,Chenyi Guo,Ji Wu
Main category: cs.CV
TL;DR: 本文提出了BoxComm数据集和结构化评论分类法,针对格斗运动评论生成任务设计了两类新评估方法,并提出了EIC-Gen基线模型以提升对瞬时动作的感知能力。
Details
Motivation: 现有体育评论生成基准仅覆盖团队运动(如足球、篮球),未探索格斗运动;而格斗运动具有动作快、视觉差异细微、战术分析占比高等独特挑战。 Method: 构建大规模BoxComm数据集(445场拳击世锦赛视频+52K专业评论句);提出包含逐帧播报、战术分析和背景信息三类的结构化评论分类法;设计类别条件生成与评论节奏评估两种新评测方式;提出EIC-Gen模型,引入检测到的击打事件作为结构化动作线索。 Result: 多个SOTA多模态大模型在两类新评估上均表现不佳;EIC-Gen模型通过融合击打事件线索,在各项指标上取得一致提升。 Conclusion: 格斗运动评论生成需更精细的动作感知与节奏建模能力;BoxComm数据集、结构化分类法及新型评估为该领域提供了重要基础与新方向。 Abstract: Recent multimodal large language models (MLLMs) have shown strong capabilities in general video understanding, driving growing interest in automatic sports commentary generation. However, existing benchmarks for this task focus exclusively on team sports such as soccer and basketball, leaving combat sports entirely unexplored. Notably, combat sports present distinct challenges: critical actions unfold within milliseconds with visually subtle yet semantically decisive differences, and professional commentary contains a substantially higher proportion of tactical analysis compared to team sports. In this paper, we present BoxComm, a large-scale dataset comprising 445 World Boxing Championship match videos with over 52K commentary sentences from professional broadcasts. We propose a structured commentary taxonomy that categorizes each sentence into play-by-play, tactical, or contextual, providing the first category-level annotation for sports commentary benchmarks. Building on this taxonomy, we introduce two novel and complementary evaluations tailored to sports commentary generation: (1) category-conditioned generation, which evaluates whether models can produce accurate commentary of a specified type given video context; and (2) commentary rhythm assessment, which measures whether freely generated commentary exhibits appropriate temporal pacing and type distribution over continuous video segments, capturing a dimension of commentary competence that prior benchmarks have not addressed. Experiments on multiple state-of-the-art MLLMs reveal that current models struggle on both evaluations. We further propose EIC-Gen, an improved baseline incorporating detected punch events to supply structured action cues, yielding consistent gains and highlighting the importance of perceiving fleeting and subtle events for combat sports commentary.[219] HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance
Green Rosh,Prateek Kukreja,Vishakha SR,Pawan Prasad B H
Main category: cs.CV
TL;DR: 本文提出了HandDreamer,首个基于文本提示的零样本3D手部模型生成方法,通过MANO初始化、骨架引导扩散及新型校正形状引导损失,有效解决了现有零样本文本到3D方法在手部生成中结构不自然、视角不一致和细节丢失的问题。
Details
Motivation: 现有3D手部建模方法成本高、流程繁琐且定制性差;而当前零样本文本到3D合成(如SDS)在手部生成上泛化能力弱,存在结构失真、视角不一致和细节缺失等问题。 Method: 提出HandDreamer:1)基于MANO手模型初始化提供强结构先验;2)引入手部骨架引导的扩散过程保障视角与姿态一致性;3)设计新型校正手形引导损失,促使多视角重建收敛至一致模式且避免几何畸变。 Result: 在多项评估中显著优于现有最先进方法,生成的手部模型具有更自然结构、更高视角一致性和更丰富细节。 Conclusion: HandDreamer为零样本文本驱动的3D手部建模提供了新范式,有效克服了SDS在复杂 articulated 物体(如手)上的固有缺陷,推动了VR交互中个性化3D手模生成的发展。 Abstract: The emergence of virtual reality has necessitated the generation of detailed and customizable 3D hand models for interaction in the virtual world. However, the current methods for 3D hand model generation are both expensive and cumbersome, offering very little customizability to the users. While recent advancements in zero-shot text-to-3D synthesis have enabled the generation of diverse and customizable 3D models using Score Distillation Sampling (SDS), they do not generalize very well to 3D hand model generation, resulting in unnatural hand structures, view-inconsistencies and loss of details. To address these limitations, we introduce HandDreamer, the first method for zero-shot 3D hand model generation from text prompts. Our findings suggest that view-inconsistencies in SDS is primarily caused due to the ambiguity in the probability landscape described by the text prompt, resulting in similar views converging to different modes of the distribution. This is particularly aggravated for hands due to the large variations in articulations and poses. To alleviate this, we propose to use MANO hand model based initialization and a hand skeleton guided diffusion process to provide a strong prior for the hand structure and to ensure view and pose consistency. Further, we propose a novel corrective hand shape guidance loss to ensure that all the views of the 3D hand model converges to view-consistent modes, without leading to geometric distortions. Extensive evaluations demonstrate the superiority of our method over the state-of-the-art methods, paving a new way forward in 3D hand model generation.[220] Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection
Weihao Cao,Runqi Wang,Xiaoyue Duan,Jinchao Zhang,Ang Yang,Liping Jing
Main category: cs.CV
TL;DR: 本文提出HSA-DINO框架,通过多尺度提示库和语义感知路由器,在不损害预训练模型泛化能力的前提下,增强开放词汇目标检测在领域迁移任务中的性能。
Details
Motivation: 现有开放词汇目标检测方法在通用场景表现良好,但在存在显著领域偏移的下游任务中性能严重下降,原因在于领域特定任务中类别标签稀缺且语义弱,且模型难以捕获类别标签之外的辅助语义。 Method: 提出HSA-DINO:1)多尺度提示库,利用图像特征金字塔提取层次化语义并选择领域特定局部语义提示,逐步丰富文本表征;2)语义感知路由器,动态选择推理时的语义增强策略,避免参数更新损害泛化能力。 Result: 在OV-COCO、多个垂直领域数据集及修改后的基准上验证,HSA-DINO优于先前SOTA方法,在领域适应性与开放词汇泛化能力之间取得更优平衡。 Conclusion: HSA-DINO是一种参数高效的语义增强框架,有效缓解了开放词汇目标检测在领域迁移中的性能退化问题,兼顾了领域适配与通用泛化能力。 Abstract: Open-vocabulary object detection (OVOD) enables models to detect any object category, including unseen ones. Benefiting from large-scale pre-training, existing OVOD methods achieve strong detection performance on general scenarios (e.g., OV-COCO) but suffer severe performance drops when transferred to downstream tasks with substantial domain shifts. This degradation stems from the scarcity and weak semantics of category labels in domain-specific task, as well as the inability of existing models to capture auxiliary semantics beyond coarse-grained category label. To address these issues, we propose HSA-DINO, a parameter-efficient semantic augmentation framework for enhancing open-vocabulary object detection. Specifically, we propose a multi-scale prompt bank that leverages image feature pyramids to capture hierarchical semantics and select domain-specific local semantic prompts, progressively enriching textual representations from coarse to fine-grained levels. Furthermore, we introduce a semantic-aware router that dynamically selects the appropriate semantic augmentation strategy during inference, thereby preventing parameter updates from degrading the generalization ability of the pre-trained OVOD model. We evaluate HSA-DINO on OV-COCO, several vertical domain datasets, and modified benchmark settings. The results show that HSA-DINO performs favorably against previous state-of-the-art methods, achieving a superior trade-off between domain adaptability and open-vocabulary generalization.[221] Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse
Hao Liu,Ye Huang,Chenghuan Huang,Zhenyi Zheng,Jiangsu Du,Ziyang Ma,Jing Lyu,Yutong Lu
Main category: cs.CV
TL;DR: 本文提出Chorus,一种跨请求缓存方法,通过利用不同请求间的相似性来加速视频扩散模型(如DiT)的推理,实现最高45%的加速,尤其在工业级蒸馏模型上效果显著。
Details
Motivation: 现有缓存方法仅利用单个请求内部扩散过程的相似性,对多请求间冗余计算缺乏利用,导致在高效蒸馏模型上加速有限。 Method: Chorus采用三阶段跨请求缓存策略:第一阶段完全复用相似请求的潜在特征;第二阶段在中间去噪步骤中对特定潜在区域进行跨请求缓存,并结合Token-Guided Attention Amplification提升生成视频与条件提示的语义对齐;第三阶段未在摘要中明确说明,但整体围绕跨请求特征复用与对齐优化展开。 Result: Chorus在工业级4步蒸馏视频扩散模型上实现最高45%的推理加速,显著优于仅依赖单请求内缓存的现有方法。 Conclusion: 跨请求缓存是提升视频扩散模型服务效率的有效新范式,Chorus通过结构化缓存与注意力引导对齐,在保持生成质量的同时大幅降低推理开销。 Abstract: Video Diffusion Transformer (DiT) models are a dominant approach for high-quality video generation but suffer from high inference cost due to iterative denoising. Existing caching approaches primarily exploit similarity within the diffusion process of a single request to skip redundant denoising steps. In this paper, we introduce Chorus, a caching approach that leverages similarity across requests to accelerate video diffusion model serving. Chorus achieves up to 45\% speedup on industrial 4-step distilled models, where prior intra-request caching approaches are ineffective. Particularly, Chorus employs a three-stage caching strategy along the denoising process. Stage 1 performs full reuse of latent features from similar requests. Stage 2 exploits inter-request caching in specific latent regions during intermediate denoising steps. This stage is combined with Token-Guided Attention Amplification to improve semantic alignment between the generated video and the conditional prompts, thereby extending the applicability of full reuse to later denoising steps.[222] Group-DINOmics: Incorporating People Dynamics into DINO for Self-supervised Group Activity Feature Learning
Ryuki Tezuka,Chihiro Nakatani,Norimichi Ukita
Main category: cs.CV
TL;DR: 本文提出了一种无需群组活动标注的群组活动特征(GAF)学习方法,通过结合DINO提取的局部动态与全局群组特征,并设计了人物流估计和群组相关物体定位两个自监督预训练任务,实现了对群组动态感知的GAF学习,在群组活动检索与识别任务上达到SOTA性能。
Details
Motivation: 现有方法多依赖低层静态局部特征学习群组活动特征(GAF),难以有效建模群组动态和场景上下文;且通常需要昂贵的群组活动标注。本文旨在实现无群组活动标注、同时兼顾局部运动与全局群组结构的GAF学习。 Method: 提出基于DINO特征的自监督群组活动特征学习框架:1)利用人物流估计作为局部动态感知的预训练任务,建模个体运动;2)利用群组相关物体位置估计作为全局群组结构感知的预训练任务,建模人-物空间关系;3)联合优化两个任务以学习群组动态感知的GAF。 Result: 在多个公开数据集上,所提方法在群组活动检索与识别任务中均达到当前最优(SOTA)性能;消融实验验证了人物流估计与群组相关物体定位两个预训练任务及DINO特征的有效性。 Conclusion: 动态感知与群组感知的预训练任务,结合DINO提供的局部与全局特征,可有效提升无监督群组活动特征学习性能,为少样本/无标注群组理解提供了新思路。 Abstract: This paper proposes Group Activity Feature (GAF) learning without group activity annotations. Unlike prior work, which uses low-level static local features to learn GAFs, we propose leveraging dynamics-aware and group-aware pretext tasks, along with local and global features provided by DINO, for group-dynamics-aware GAF learning. To adapt DINO and GAF learning to local dynamics and global group features, our pretext tasks use person flow estimation and group-relevant object location estimation, respectively. Person flow estimation is used to represent the local motion of each person, which is an important cue for understanding group activities. In contrast, group-relevant object location estimation encourages GAFs to learn scene context (e.g., spatial relations of people and objects) as global features. Comprehensive experiments on public datasets demonstrate the state-of-the-art performance of our method in group activity retrieval and recognition. Our ablation studies verify the effectiveness of each component in our method. Code: https://github.com/tezuka0001/Group-DINOmics.[223] Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model's Robustness to Natural Semantic Variation Across Diverse Tasks
Jia Chengyu,AprilPyone MaungMaung,Huy H. Nguyen,Jinyin Chen,Isao Echizen
Main category: cs.CV
TL;DR: 本文提出了一种针对视觉-语言模型(VLMs)在自然对抗场景下的系统性评估框架,涵盖图像分类、语义分割和视觉问答任务,并发现鲁棒CLIP模型可能加剧自然对抗脆弱性,CLIP对自然语言诱导的对抗样本尤为敏感。
Details
Motivation: 现有VLM评估多集中于标准基准,缺乏对自然对抗场景下鲁棒性、局限性和实际适用性的全面独立评估。 Method: 构建系统性评估框架,在典型自然对抗数据集(如typographic攻击、ImageNet-A、自然语言诱导对抗样本)上,对CLIP、robust CLIP、BLIP2、SigLIP2等模型进行零样本图像分类、语义分割和视觉问答性能测试,并开展可解释性失败模式分析。 Result: 发现robust CLIP反而可能放大自然对抗脆弱性;CLIP在自然语言诱导对抗样本上性能显著下降;不同VLM在各类自然对抗场景中表现差异明显。 Conclusion: 当前VLM在自然对抗场景下仍存在显著鲁棒性缺陷,亟需更公平、更鲁棒的多模态模式识别研究。 Abstract: Recent advances in vision-language models (VLMs) trained on web-scale image-text pairs have enabled impressive zero-shot transfer across a diverse range of visual tasks. However, comprehensive and independent evaluation beyond standard benchmarks is essential to understand their robustness, limitations, and real-world applicability. This paper presents a systematic evaluation framework for VLMs under natural adversarial scenarios for diverse downstream tasks, which has been overlooked in previous evaluation works. We evaluate a wide range of VLMs (CLIP, robust CLIP, BLIP2, and SigLIP2) on curated adversarial datasets (typographic attacks, ImageNet-A, and natural language-induced adversarial examples). We measure the natural adversarial performance of selected VLMs for zero-shot image classification, semantic segmentation, and visual question answering. Our analysis reveals that robust CLIP models can amplify natural adversarial vulnerabilities, and CLIP models significantly reduce performance for natural language-induced adversarial examples. Additionally, we provide interpretable analyses to identify failure modes. We hope our findings inspire future research in robust and fair multimodal pattern recognition.[224] MVis-Fold: A Three-Dimensional Microvascular Structure Inference Model for Super-Resolution Ultrasound
Jincao Yao,Ke Zhang,Yahan Zhou,Jiafei Shen,Jie Liu,Mudassar Ali,Bojian Feng,Jiye Chen,Jinlong Fan,Ping Liang,Dong Xu
Main category: cs.CV
TL;DR: 本文提出MVis-Fold模型,通过跨尺度网络架构,从二维超分辨率超声(SRUS)图像中高保真重建三维微血管网络,突破传统二维SRUS在三维空间参数量化上的局限,已在实体瘤中验证其准确性,为微血管三维定量分析及疾病诊疗提供新工具。
Details
Motivation: 三维微血管重建在超分辨率超声(SRUS)中仍具挑战性,而传统二维SRUS无法直接获取三维空间关键参数,限制了微血管的定量分析与临床应用。 Method: 提出名为MVis-Fold的三维微血管可视化重建模型,融合跨尺度网络架构,实现从二维SRUS图像到三维微血管网络的高保真推理与重建。 Result: 模型在实体瘤微血管三维重建中验证了高精度与可靠性,能精确计算传统二维SRUS难以获得的三维空间关键参数。 Conclusion: 该研究为微血管三维定量分析奠定基础,提供了面向疾病诊断与监测的新工具和新方法。 Abstract: Super-resolution ultrasound (SRUS) technology has overcome the resolution limitations of conventional ultrasound, enabling micrometer-scale imaging of microvasculature. However, due to the nature of imaging principles, three-dimensional reconstruction of microvasculature from SRUS remains an open challenge. We developed microvascular visualization fold (MVis-Fold), an innovative three-dimensional microvascular reconstruction model that integrates a cross-scale network architecture. This model can perform high-fidelity inference and reconstruction of three-dimensional microvascular networks from two-dimensional SRUS images. It precisely calculates key parameters in three-dimensional space that traditional two-dimensional SRUS cannot readily obtain. We validated the model's accuracy and reliability in three-dimensional microvascular reconstruction of solid tumors. This study establishes a foundation for three-dimensional quantitative analysis of microvasculature. It provides new tools and methods for diagnosis and monitoring of various diseases.[225] Training-Free Image Editing with Visual Context Integration and Concept Alignment
Rui Song,Guo-Hua Wang,Qing-Guo Chen,Weihua Luo,Tongda Xu,Zhening Liu,Yan Wang,Zehong Lin,Jun Zhang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练和反演的图像编辑方法VicoEdit,通过视觉上下文直接转换源图像为目标图像,并设计了基于概念对齐的后验采样策略以提升编辑一致性,性能超越现有训练方法。
Details
Motivation: 现有基于训练的视觉上下文感知编辑方法需要大量数据收集和训练成本,而无需训练的方法多依赖扩散反演,存在一致性和灵活性不足的问题。 Method: 提出VicoEdit,一种无需训练和反演的方法,直接利用视觉上下文将源图像变换为目标图像;并设计基于概念对齐的后验采样策略以增强编辑一致性。 Result: 实验证明,该无需训练的方法在编辑性能上甚至优于当前最先进的基于训练的模型。 Conclusion: VicoEdit有效解决了训练开销大和反演导致轨迹偏移的问题,在保持灵活性的同时显著提升了视觉上下文引导编辑的一致性与效果。 Abstract: In image editing, it is essential to incorporate a context image to convey the user's precise requirements, such as subject appearance or image style. Existing training-based visual context-aware editing methods incur data collection effort and training cost. On the other hand, the training-free alternatives are typically established on diffusion inversion, which struggles with consistency and flexibility. In this work, we propose VicoEdit, a training-free and inversion-free method to inject the visual context into the pretrained text-prompted editing model. More specifically, VicoEdit directly transforms the source image into the target one based on the visual context, thereby eliminating the need for inversion that can lead to deviated trajectories. Moreover, we design a posterior sampling approach guided by concept alignment to enhance the editing consistency. Empirical results demonstrate that our training-free method achieves even better editing performance than the state-of-the-art training-based models.[226] A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models
Tianmeng Fang,Yong Wang,Zetai Kong,Zengzhen Su,Jun Wang,Chengjin Yu,Wei Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于补丁增强和跨视图正则化的统一防御框架,用于抵御多模态大语言模型中的后门攻击,在低投毒率下有效抑制攻击成功率,同时保持模型正常生成能力。
Details
Motivation: 多模态大语言模型在监督微调中易受后门植入,而现有防御方法难以兼顾攻击抑制与良性性能保留。 Method: 结合补丁级数据增强与跨视图输出差异正则化,并引入输出熵约束以避免过度抑制。 Result: 在三个模型、两个任务、六种攻击下,显著降低攻击成功率,同时维持高水平的正常文本生成能力。 Conclusion: 该方法支持大规模多模态模型在低频投毒与隐蔽触发等现实场景下的安全可控部署。 Abstract: Multimodal large language models have become an important infrastructure for unified processing of visual and linguistic tasks. However, such models are highly susceptible to backdoor implantation during supervised fine-tuning and will steadily output the attacker's predefined harmful responses once a specific trigger pattern is activated. The core challenge of backdoor defense lies in suppressing attack success under low poisoning ratios while preserving the model's normal generation ability. These two objectives are inherently conflicting. Strong suppression often degrades benign performance, whereas weak regularization fails to mitigate backdoor behaviors. To this end, we propose a unified defense framework based on patch augmentation and cross-view regularity, which simultaneously constrains the model's anomalous behaviors in response to triggered patterns from both the feature representation and output distribution levels. Specifically, patch-level data augmentation is combined with cross-view output difference regularization to exploit the fact that backdoor responses are abnormally invariant to non-semantic perturbations and to proactively pull apart the output distributions of the original and perturbed views, thereby significantly suppressing the success rate of backdoor triggering. At the same time, we avoid over-suppression of the model during defense by imposing output entropy constraints, ensuring the quality of normal command generation. Experimental results across three models, two tasks, and six attacks show that our proposed defense method effectively reduces the attack success rate while maintaining a high level of normal text generation capability. Our work enables the secure, controlled deployment of large-scale multimodal models in realistic low-frequency poisoning and covert triggering scenarios.[227] The Indra Representation Hypothesis for Multimodal Alignment
Jianglin Lu,Hailing Wang,Kuo Yang,Yitian Zhang,Simon Jenni,Yun Fu
Main category: cs.CV
TL;DR: 本文提出Indra表示假设,认为单模态基础模型的表征收敛于反映现实共享关系结构,利用范畴论中的V-丰富Yoneda嵌入形式化该假设,并通过角距离实例化,在视觉、语言和音频跨模态任务中验证其提升鲁棒性和对齐能力。
Details
Motivation: 单模态基础模型虽学习到收敛表征,但这些内部抽象独立刻画样本,表达能力有限;作者受'因陀罗网'哲学隐喻启发,探索表征背后共享的关系结构。 Method: 提出Indra表示假设,用范畴论中V-丰富Yoneda嵌入形式化定义Indra表征为样本相对于其他样本的关系轮廓,并以角距离实例化。 Result: Indra表征在跨模型与跨模态(视觉、语言、音频)场景中一致提升鲁棒性与对齐性能,提供无需训练的单模态基础模型对齐框架。 Conclusion: Indra表示假设为理解单模态模型表征收敛提供了新视角,其理论严谨且实践有效,支持无需训练的跨架构、跨模态对齐。 Abstract: Recent studies have uncovered an interesting phenomenon: unimodal foundation models tend to learn convergent representations, regardless of differences in architecture, training objectives, or data modalities. However, these representations are essentially internal abstractions of samples that characterize samples independently, leading to limited expressiveness. In this paper, we propose The Indra Representation Hypothesis, inspired by the philosophical metaphor of Indra's Net. We argue that representations from unimodal foundation models are converging to implicitly reflect a shared relational structure underlying reality, akin to the relational ontology of Indra's Net. We formalize this hypothesis using the V-enriched Yoneda embedding from category theory, defining the Indra representation as a relational profile of each sample with respect to others. This formulation is shown to be unique, complete, and structure-preserving under a given cost function. We instantiate the Indra representation using angular distance and evaluate it in cross-model and cross-modal scenarios involving vision, language, and audio. Extensive experiments demonstrate that Indra representations consistently enhance robustness and alignment across architectures and modalities, providing a theoretically grounded and practical framework for training-free alignment of unimodal foundation models. Our code is available at https://github.com/Jianglin954/Indra.[228] Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
Shizhan Gong,Minda Hu,Qiyuan Zhang,Chen Ma,Qi Dou
Main category: cs.CV
TL;DR: 本文提出Saliency-R1框架,通过新颖的显著性图技术提升视觉语言模型(VLMs)推理的可解释性与忠实性,无需额外计算开销,并利用人类标注框与显著性图重叠作为奖励,结合GRPO优化对齐,实验证明其在推理忠实性、可解释性和任务性能上均有提升。
Details
Motivation: 解决视觉语言模型(VLMs)过度依赖文本线索、忽视视觉证据以及生成脱离图像的幻觉响应等可信度问题。 Method: 提出Saliency-R1框架,包含高效生成token级图像显著性图的技术,并追踪视觉信息在推理过程中的流向;以显著性图与人工标注边界框的重叠度为奖励,采用Group Relative Policy Optimization(GRPO)进行对齐优化。 Result: Saliency-R1在多个基准上提升了VLMs的推理忠实性、可解释性及整体任务性能。 Conclusion: 该方法有效增强了VLMs对视觉内容的依赖与理解,为构建更可信、可解释的多模态推理系统提供了可行路径。 Abstract: Vision-language models (VLMs) have achieved remarkable success across diverse tasks. However, concerns about their trustworthiness persist, particularly regarding tendencies to lean more on textual cues than visual evidence and the risk of producing ungrounded or fabricated responses. To address these issues, we propose Saliency-R1, a framework for improving the interpretability and faithfulness of VLMs reasoning. Specifically, we introduce a novel saliency map technique that efficiently highlights critical image regions contributing to generated tokens without additional computational overhead. This can further be extended to trace how visual information flows through the reasoning process to the final answers, revealing the alignment between the thinking process and the visual context. We use the overlap between the saliency maps and human-annotated bounding boxes as the reward function, and apply Group Relative Policy Optimization (GRPO) to align the salient parts and critical regions, encouraging models to focus on relevant areas when conduct reasoning. Experiments show Saliency-R1 improves reasoning faithfulness, interpretability, and overall task performance.[229] MedROI: Codec-Agnostic Region of Interest-Centric Compression for Medical Images
Jiwon Kim,Ikbeom Jang
Main category: cs.CV
TL;DR: MedROI是一种编解码器无关的、即插即用的医学影像ROI中心压缩框架,通过裁剪非诊断背景区域并保留固定大小元数据实现高效压缩,在保持ROI内重建质量的同时显著提升压缩比和编解码速度。
Details
Motivation: 医学影像数据快速增长,现有压缩方法未充分聚焦诊断相关区域(ROI),仍保留大量非诊断背景信息,导致存储与传输效率低下。 Method: MedROI采用轻量级基于强度的阈值法提取紧密组织包围盒,丢弃背景体素,并存储54字节元数据用于解压时空间恢复;裁剪后的ROI可直接输入任意2D/3D传统或神经编解码器,无需修改架构或重新训练。 Result: 在ADNI的200例T1加权脑MRI上评估显示,MedROI在多数编解码配置下显著提升压缩比与编解码速度(经多重校正的双侧t检验),ROI内重建质量保持相当;例如JPEG2000 2D(lv3)下压缩比从20.35提升至27.37,压缩时间从1.701s降至1.380s。 Conclusion: MedROI作为一种通用、轻量、即插即用的ROI预处理框架,可有效提升各类医学影像编解码器的效率,兼顾性能与实用性,适用于临床影像归档与远程传输场景。 Abstract: Medical imaging archives are growing rapidly in both size and resolution, making efficient compression increasingly important for storage and data transfer. Most existing codecs compress full images/volumes(including non-diagnostic background) or apply differential ROI coding that still preserves background bits. We propose MedROI, a codec-agnostic, plug-and-play ROI-centric framework that discards background voxels prior to compression. MedROI extracts a tight tissue bounding box via lightweight intensity-based thresholding and stores a fixed 54byte meta data record to enable spatial restoration during decompression. The cropped ROI is then compressed using any existing 2D or 3D codec without architectural modifications or retraining. We evaluate MedROI on 200 T1-weighted brain MRI volumes from ADNI using 6 codec configurations spanning conventional codecs (JPEG2000 2D/3D, HEIF) and neural compressors (LIC_TCM, TCM+AuxT, BCM-Net, SirenMRI). MedROI yields statistically significant improvements in compression ratio and encoding/decoding time for most configurations (two-sided t-test with multiple-comparison correction), while maintaining comparable reconstruction quality when measured within the ROI; HEIF is the primary exception in compression-ratio gains. For example, on JPEG20002D (lv3), MedROI improves CR from 20.35 to 27.37 while reducing average compression time from 1.701s to 1.380s. Code is available at https://github.com/labhai/MedROI.[230] MPTF-Net: Multi-view Pyramid Transformer Fusion Network for LiDAR-based Place Recognition
Shuyuan Li,Zihang Wang,Xieyuanli Chen,Wenkai Zhu,Xiaoteng Fang,Peizhou Ni,Junhao Yang,Dong Kong
Main category: cs.CV
TL;DR: 本文提出MPTF-Net,一种基于多视角、多尺度金字塔Transformer融合的LiDAR地点识别方法,通过NDT增强BEV编码并融合RIV与NDT-BEV特征,在多个数据集上达到SOTA性能且满足实时性要求。
Details
Motivation: 现有基于BEV的LiDAR地点识别方法依赖简单统计聚合,难以刻画细粒度几何结构,在复杂或重复场景中性能下降。 Method: 提出MPTF-Net:1)基于正态分布变换(NDT)的多通道BEV编码,显式建模局部几何复杂性与强度分布;2)定制化金字塔Transformer模块,跨视角、多尺度融合Range Image View(RIV)与NDT-BEV特征。 Result: 在nuScenes、KITTI和NCLT数据集上达到SOTA:nuScenes Boston分割 Recall@1达96.31%,推理延迟仅10.02ms。 Conclusion: MPTF-Net通过引入几何感知的NDT-BEV表示与多视图多尺度Transformer融合机制,显著提升了LiDAR地点识别的鲁棒性与精度,兼顾实时性,适用于无人自主系统。 Abstract: LiDAR-based place recognition (LPR) is essential for global localization and loop-closure detection in large-scale SLAM systems. Existing methods typically construct global descriptors from Range Images or BEV representations for matching. BEV is widely adopted due to its explicit 2D spatial layout encoding and efficient retrieval. However, conventional BEV representations rely on simple statistical aggregation, which fails to capture fine-grained geometric structures, leading to performance degradation in complex or repetitive environments. To address this, we propose MPTF-Net, a novel multi-view multi-scale pyramid Transformer fusion network. Our core contribution is a multi-channel NDT-based BEV encoding that explicitly models local geometric complexity and intensity distributions via Normal Distribution Transform, providing a noise-resilient structural prior. To effectively integrate these features, we develop a customized pyramid Transformer module that captures cross-view interactive correlations between Range Image Views (RIV) and NDT-BEV at multiple spatial scales. Extensive experiments on the nuScenes, KITTI and NCLT datasets demonstrate that MPTF-Net achieves state-of-the-art performance, specifically attaining a Recall@1 of 96.31\% on the nuScenes Boston split while maintaining an inference latency of only 10.02 ms, making it highly suitable for real-time autonomous unmanned systems.[231] StableTTA: Training-Free Test-Time Adaptation that Improves Model Accuracy on ImageNet1K to 96%
Zheng Li,Jerry Cheng,Huanying Helen Gu
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的稳定测试时聚合方法StableTTA,显著提升集成模型预测稳定性与效率,在ImageNet-1K上大幅提高准确率,同时大幅降低参数量和计算开销。
Details
Motivation: 集成方法虽能提升预测性能,但常带来内存和计算开销增加,并存在聚合策略冲突导致预测不稳定的问题。 Method: 提出StableTTA,一种无需训练的测试时聚合优化方法,旨在缓解聚合策略冲突、提升稳定性和效率。 Result: 在ImageNet-1K上top-1准确率提升10.93–32.82%;33个模型达95%以上准确率,多个超96%;轻量架构使用<5%参数、降低约89.1% GFLOPs,反超ViT达11.75%。 Conclusion: StableTTA在不增加训练成本的前提下,显著提升集成模型的准确性、稳定性与部署效率,尤其适用于资源受限设备。 Abstract: Ensemble methods are widely used to improve predictive performance, but their effectiveness often comes at the cost of increased memory usage and computational complexity. In this paper, we identify a conflict in aggregation strategies that negatively impacts prediction stability. We propose StableTTA, a training-free method to improve aggregation stability and efficiency. Empirical results on ImageNet-1K show gains of 10.93--32.82\% in top-1 accuracy, with 33 models achieving over 95\% accuracy and several surpassing 96\%. Notably, StableTTA allows lightweight architectures to outperform ViT by 11.75\% in top-1 accuracy while using less than 5\% of parameters and reducing computational cost by approximately 89.1\% (in GFLOPs), enabling high-accuracy inference on resource-constrained devices.[232] Relational Epipolar Graphs for Robust Relative Camera Pose Estimation
Prateeth Rao,Sachit Rao
Main category: cs.CV
TL;DR: 本文提出一种基于图神经网络的相对位姿估计方法,将关键点匹配建模为带几何约束的图关系推理问题,通过图操作直接回归旋转、平移和本质矩阵,在噪声和大基线场景下表现更鲁棒。
Details
Motivation: 传统VSLAM中相对位姿估计受噪声匹配影响严重;经典方法依赖随机采样,学习方法常缺乏显式几何结构。 Method: 将匹配关键点构建成极线对应图(节点为关键点,邻近点连边),通过图剪枝、消息传递与池化操作估计四元数旋转、平移向量及本质矩阵;采用LoFTR进行无检测器的密集匹配;联合优化五项损失:L2位姿误差、本质矩阵Frobenius范数、奇异值差异、航向角差异和尺度差异。 Result: 在室内外基准数据集上,相比经典和学习引导方法,在密集噪声和大基线变化下展现出更强鲁棒性,验证了全局关系共识的有效性。 Conclusion: 将相对位姿估计建模为图上的关系推理任务,并引入多目标几何一致性损失,能有效提升估计精度与鲁棒性,尤其适用于挑战性场景。 Abstract: A key component of Visual Simultaneous Localization and Mapping (VSLAM) is estimating relative camera poses using matched keypoints. Accurate estimation is challenged by noisy correspondences. Classical methods rely on stochastic hypothesis sampling and iterative estimation, while learning-based methods often lack explicit geometric structure. In this work, we reformulate relative pose estimation as a relational inference problem over epipolar correspondence graphs, where matched keypoints are nodes and nearby ones are connected by edges. Graph operations such as pruning, message passing, and pooling estimate a quaternion rotation, translation vector, and the Essential Matrix (EM). Minimizing a loss comprising (i) $\mathcal{L}_2$ differences with ground truth (GT), (ii) Frobenius norm between estimated and GT EMs, (iii) singular value differences, (iv) heading angle differences, and (v) scale differences, yields the relative pose between image pairs. The dense detector-free method LoFTR is used for matching. Experiments on indoor and outdoor benchmarks show improved robustness to dense noise and large baseline variation compared to classical and learning-guided approaches, highlighting the effectiveness of global relational consensus.[233] Temporal Inversion for Learning Interval Change in Chest X-Rays
Hanbin Ko,Kyeongmin Jeon,Doowoong Choi,Chang Min Park
Main category: cs.CV
TL;DR: 本文提出TILA框架,利用时间反转作为监督信号,增强模型对胸部X光片(CXRs)时间变化方向的敏感性,从而提升疾病进展分类和时序嵌入对齐性能。
Details
Motivation: 现有医学视觉-语言预训练模型大多孤立分析影像,忽视了临床中对比前后影像以评估病灶动态变化的关键任务,尤其在胸部X光检查中,捕捉时间间隔变化至关重要。 Method: 提出TILA(Temporal Inversion-aware Learning and Alignment)框架,引入时间反转(即交换图像对顺序)作为监督信号,在预训练、微调和推理阶段均融入反转感知目标,并设计统一评估协议与MS-CXR-Tretrieval检索数据集。 Result: 在多个公开数据集及真实医院队列上的实验表明,TILA能持续提升疾病进展分类准确率与时序嵌入对齐效果,适用于多种现有架构。 Conclusion: TILA通过显式建模时间顺序,有效弥补了现有模型在时序变化理解上的不足,为医学影像动态分析提供了新范式。 Abstract: Recent advances in vision--language pretraining have enabled strong medical foundation models, yet most analyze radiographs in isolation, overlooking the key clinical task of comparing prior and current images to assess interval change. For chest radiographs (CXRs), capturing interval change is essential, as radiologists must evaluate not only the static appearance of findings but also how they evolve over time. We introduce TILA (Temporal Inversion-aware Learning and Alignment), a simple yet effective framework that uses temporal inversion, reversing image pairs, as a supervisory signal to enhance the sensitivity of existing temporal vision-language models to directional change. TILA integrates inversion-aware objectives across pretraining, fine-tuning, and inference, complementing conventional appearance modeling with explicit learning of temporal order. We also propose a unified evaluation protocol to assess order sensitivity and consistency under temporal inversion, and introduce MS-CXR-Tretrieval, a retrieval evaluation set constructed through a general protocol that can be applied to any temporal CXR dataset. Experiments on public datasets and real-world hospital cohorts demonstrate that TILA consistently improves progression classification and temporal embedding alignment when applied to multiple existing architectures.[234] TAPE: A two-stage parameter-efficient adaptation framework for foundation models in OCT-OCTA analysis
Xiaofei Su,Zengshuo Wang,Minghe Sun,Xin Zhao,Mingzhu Sun
Main category: cs.CV
TL;DR: 本文提出TAPE框架,通过两阶段参数高效微调(PEFT)解决基础模型在OCT/OCTA图像分割任务中的域偏移与任务错配问题,在视网膜层分割任务上实现SOTA泛化性能且参数效率高。
Details
Motivation: 现有从头训练方法依赖大量数据和大模型,难以部署于资源受限临床环境;基于基础模型的迁移学习又面临域偏移和任务错配挑战。 Method: 提出TAPE:两阶段自适应框架,第一阶段用PEFT进行掩码图像建模以对齐医学影像域,第二阶段适配下游分割任务;应用于MAE和RETFound两类基础模型。 Result: 在视网膜层分割任务上,TAPE在多种病理场景下达到最优泛化性能,同时显著提升参数效率。 Conclusion: TAPE有效缓解基础模型在OCT/OCTA分析中的域与任务适配难题,为资源受限临床场景提供了高效可行的解决方案。 Abstract: Automated analysis of optical coherence tomography (OCT) and OCT angiography (OCTA) images is critical for robust ophthalmic diagnosis. Existing mainstream methods trained from scratch rely heavily on massive data and model scale, thereby hindering their practical deployment in resource-constrained clinical settings. Although transfer learning based on foundation models (FMs) is promising, it still faces significant challenges: domain shift and task misalignment. To address these, we propose TAPE: A Two-stage Adaptation Framework via Parameter-Efficient Fine-tuning, which strategically decouples adaptation into domain alignment and task fitting for downstream segmentation. The domain adaptation stage notably applies parameter-efficient fine-tuning (PEFT) in the context of masked image modeling for medical image domain adaptation, a novel approach to the best of our knowledge. Applying TAPE to retinal layer segmentation on both universal (masked auto-encoder, MAE) and specialized (RETFound) FMs, it demonstrates superior parameter efficiency and achieves state-of-the-art generalization performance across diverse pathologies.[235] Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models
Arian Komaei Koma,Seyed Amir Kasaei,Ali Aghayari,AmirMahdi Sadeghzadeh,Mohammad Hossein Rohban
Main category: cs.CV
TL;DR: 本文系统研究了文本到图像扩散模型中概念遗忘对组合生成能力的影响,发现遗忘效果与组合完整性之间存在权衡。
Details
Motivation: 现有工作主要关注遗忘成功率,而忽视其对整体生成能力(尤其是组合生成能力)的影响。 Method: 以Stable Diffusion 1.4为基准,聚焦于裸露内容移除任务,采用T2I-CompBench++和GenEval等评测基准,对多种前沿后处理遗忘方法进行实证评估。 Result: 发现强遗忘方法常导致属性绑定、空间推理和计数等组合能力显著下降;而保持组合结构的方法则往往遗忘不彻底。 Conclusion: 当前遗忘评估标准存在局限,需设计兼顾目标概念抑制与语义完整性保持的新型遗忘目标。 Abstract: Post-hoc unlearning has emerged as a practical mechanism for removing undesirable concepts from large text-to-image diffusion models. However, prior work primarily evaluates unlearning through erasure success; its impact on broader generative capabilities remains poorly understood. In this work, we conduct a systematic empirical study of concept unlearning through the lens of compositional text-to-image generation. Focusing on nudity removal in Stable Diffusion 1.4, we evaluate a diverse set of state-of-the-art unlearning methods using T2I-CompBench++ and GenEval, alongside established unlearning benchmarks. Our results reveal a consistent trade-off between unlearning effectiveness and compositional integrity: methods that achieve strong erasure frequently incur substantial degradation in attribute binding, spatial reasoning, and counting. Conversely, approaches that preserve compositional structure often fail to provide robust erasure. These findings highlight limitations of current evaluation practices and underscore the need for unlearning objectives that explicitly account for semantic preservation beyond targeted suppression.[236] PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis
Inseong Choi,Siwoo Lee,Seung-Hun Nam,Soohwan Song
Main category: cs.CV
TL;DR: 本文提出了一种无需真实标签的Partial-Reference Image Quality Assessment(PR-IQA)框架,用于评估扩散模型生成的稀疏视角图像质量,并将其应用于3D高斯泼溅(3DGS)重建中,显著提升重建与新视角合成效果。
Details
Motivation: 扩散模型在稀疏视角新视角合成中虽具潜力,但其生成图像存在光度与几何不一致性,直接用于监督会损害3D重建质量;现有图像质量评估方法大多依赖真实标签或无法兼顾跨视角一致性。 Method: PR-IQA通过参考图像(不同视角的真实图像)计算重叠区域的几何一致部分质量图,再利用带参考视图上下文的交叉注意力机制完成质量图补全,得到稠密全图质量评估;该质量图被用于指导3DGS训练,仅在高质量区域施加监督。 Result: 在扩散增强的3DGS流程中集成PR-IQA后,在无真实标签监督下达到全参考IQA级别的精度,显著优于现有IQA方法,提升了3D重建和新视角合成质量。 Conclusion: PR-IQA实现了无需真实标签的、跨视角一致的图像质量评估,为基于扩散先验的3D重建提供了可靠的质量感知监督机制。 Abstract: Diffusion models are promising for sparse-view novel view synthesis (NVS), as they can generate pseudo-ground-truth views to aid 3D reconstruction pipelines like 3D Gaussian Splatting (3DGS). However, these synthesized images often contain photometric and geometric inconsistencies, and their direct use for supervision can impair reconstruction. To address this, we propose Partial-Reference Image Quality Assessment (PR-IQA), a framework that evaluates diffusion-generated views using reference images from different poses, eliminating the need for ground truth. PR-IQA first computes a geometrically consistent partial quality map in overlapping regions. It then performs quality completion to inpaint this partial map into a dense, full-image map. This completion is achieved via a cross-attention mechanism that incorporates reference-view context, ensuring cross-view consistency and enabling thorough quality assessment. When integrated into a diffusion-augmented 3DGS pipeline, PR-IQA restricts supervision to high-confidence regions identified by its quality maps. Experiments demonstrate that PR-IQA outperforms existing IQA methods, achieving full-reference-level accuracy without ground-truth supervision. Thus, our quality-aware 3DGS approach more effectively filters inconsistencies, producing superior 3D reconstructions and NVS results.The project page is available at https://kakaomacao.github.io/pr-iqa-project-page/.[237] Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
Quoc-Huy Trinh,Mustapha Abdullahi,Bo Zhao,Debesh Jha
Main category: cs.CV
TL;DR: Firebolt-VL 是一种高效视觉-语言模型,用线性复杂度的 Liquid Foundation Model(LFM)解码器替代传统 Transformer 解码器,并引入 Token-Grid Correlation 模块增强细粒度视觉定位能力,在保持高性能的同时显著提升推理效率。
Details
Motivation: 现有多模态大模型计算开销大、Transformer交叉注意力具有平方复杂度,且小模型难以精准捕捉任务相关细粒度视觉区域,限制其在资源受限场景(如个人助手、文档理解、智能摄像头)的实际应用。 Method: 提出 Firebolt-VL 模型:1)用 Liquid Foundation Model(LFM)解码器替代 Transformer 解码器以实现线性时间推理;2)设计 Token-Grid Correlation Module,通过轻量级文本token与图像patch相关性建模,并结合状态空间模型与FiLM条件调制,增强视觉定位能力。 Result: 在多个基准测试中,Firebolt-VL 在细粒度视觉-语言理解任务上达到高精度,同时显著提升推理效率(线性复杂度),优于同类高效模型。 Conclusion: Firebolt-VL 有效平衡了性能与效率,为资源受限场景下的视觉-语言理解提供了可行方案,验证了LFM架构与Token-Grid Correlation机制在高效多模态建模中的有效性。 Abstract: Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as personal assistants, document understanding, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Firebolt-VL, an efficient vision-language model that replaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates via the state-space model with FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Firebolt-VL achieves accurate, fine-grained understanding with significantly improved efficiency. Our model and code are available at: https://fireboltvl.github.io[238] Beyond Semantics: Uncovering the Physics of Fakes via Universal Physical Descriptors for Cross-Modal Synthetic Detection
Mei Qiu,Jianqiang Zhao,Yanyun Qu
Main category: cs.CV
TL;DR: 本文提出了一种基于物理特征的深度伪造检测新方法,通过筛选出5种跨模型鲁棒的像素级物理特征(如Laplacian方差、Sobel统计量等),将其文本编码后融入CLIP模型,显著提升AIGC图像检测性能,在多个基准上达到99.8%准确率。
Details
Motivation: 现有深度伪造检测器易过拟合特定生成模型,缺乏对自然图像与AI生成图像之间本质物理差异的建模;需探索稳定、跨架构的物理判别特征,并融合到多模态模型中以提升可靠性。 Method: 系统评估15种物理特征在20+数据集(涵盖GAN与扩散模型)上的判别能力;提出新特征选择算法,选出5个鲁棒核心特征;将这些特征转换为文本编码,与语义caption联合引导CLIP的图文表征学习。 Result: 在Genimage等多个基准上达到SOTA性能,Wukong和SDv1.4数据集准确率达99.8%;验证了物理特征可有效增强多模态模型对AIGC的识别能力与可信度。 Conclusion: 像素级物理特征具有跨生成模型的稳定性,可作为可信多模态建模的基础;该工作首次将物理真实性嵌入CLIP类模型,为缓解大模型幻觉与文本失准问题提供新路径。 Abstract: The rapid advancement of AI generated content (AIGC) has blurred the boundaries between real and synthetic images, exposing the limitations of existing deepfake detectors that often overfit to specific generative models. This adaptability crisis calls for a fundamental reexamination of the intrinsic physical characteristics that distinguish natural from AI-generated images. In this paper, we address two critical research questions: (1) What physical features can stably and robustly discriminate AI generated images across diverse datasets and generative architectures? (2) Can these objective pixel-level features be integrated into multimodal models like CLIP to enhance detection performance while mitigating the unreliability of language-based information? To answer these questions, we conduct a comprehensive exploration of 15 physical features across more than 20 datasets generated by various GANs and diffusion models. We propose a novel feature selection algorithm that identifies five core physical features including Laplacian variance, Sobel statistics, and residual noise variance that exhibit consistent discriminative power across all tested datasets. These features are then converted into text encoded values and integrated with semantic captions to guide image text representation learning in CLIP. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple Genimage benchmarks, with near-perfect accuracy (99.8%) on datasets such as Wukong and SDv1.4. By bridging pixel level authenticity with semantic understanding, this work pioneers the use of physically grounded features for trustworthy vision language modeling and opens new directions for mitigating hallucinations and textual inaccuracies in large multimodal models.[239] Multimodal Backdoor Attack on VLMs for Autonomous Driving via Graffiti and Cross-Lingual Triggers
Jiancheng Wang,Lidan Liang,Yong Wang,Zengzhen Su,Haifeng Xia,Yuanting Yan,Wei Wang
Main category: cs.CV
TL;DR: 本文提出GLA攻击方法,利用基于涂鸦的视觉模式和跨语言文本触发器,在自动驾驶视觉语言模型中实现高隐蔽性、高成功率的后门攻击,且不损害模型在干净样本上的性能,甚至提升部分指标,从而规避传统检测手段。
Details
Motivation: 现有后门攻击依赖于单模态、显式、易检测的触发器,难以在自动驾驶等安全关键场景中构建既隐蔽又稳定的攻击通道。 Method: GLA引入两种自然化触发器:一是通过Stable Diffusion修复生成、能无缝融入城市街景的涂鸦式视觉图案;二是保持语义一致但引发分布偏移的跨语言文本触发器,以构建鲁棒的语言侧触发信号。 Result: 在DriveVLM上实验表明,仅需10%投毒率即可达到90%攻击成功率(ASR)和0%误报率(FPR);且后门未削弱模型在干净任务上的表现,反而提升了BLEU-1等指标。 Conclusion: 该研究揭示了自动驾驶VLM中被低估的安全威胁,为安全关键多模态系统的后门评估提供了新攻击范式。 Abstract: Visual language model (VLM) is rapidly being integrated into safety-critical systems such as autonomous driving, making it an important attack surface for potential backdoor attacks. Existing backdoor attacks mainly rely on unimodal, explicit, and easily detectable triggers, making it difficult to construct both covert and stable attack channels in autonomous driving scenarios. GLA introduces two naturalistic triggers: graffiti-based visual patterns generated via stable diffusion inpainting, which seamlessly blend into urban scenes, and cross-language text triggers, which introduce distributional shifts while maintaining semantic consistency to build robust language-side trigger signals. Experiments on DriveVLM show that GLA requires only a 10\% poisoning ratio to achieve a 90\% Attack Success Rate (ASR) and a 0\% False Positive Rate (FPR). More insidiously, the backdoor does not weaken the model on clean tasks, but instead improves metrics such as BLEU-1, making it difficult for traditional performance-degradation-based detection methods to identify the attack. This study reveals underestimated security threats in self-driving VLMs and provides a new attack paradigm for backdoor evaluation in safety-critical multimodal systems.[240] InCTRLv2: Generalist Residual Models for Few-Shot Anomaly Detection and Segmentation
Jiawen Zhu,Mengjia Niu,Guansong Pang
Main category: cs.CV
TL;DR: 本文提出InCTRLv2,一种新颖的少样本通用异常检测与分割(GADS)框架,通过双分支结构(DASL和OASL)结合视觉-语言模型的语义先验,在多个数据集上实现SOTA性能。
Details
Motivation: 现有异常检测方法多为特定领域专家模型,难以泛化到未见数据集;因此需要发展能跨域迁移、无需重训练的通用异常检测(GAD)范式。 Method: 基于前作InCTRL,构建双分支框架:主分支引入判别式异常分数学习(DASL),利用正常与异常样本联合建模;辅助分支采用单类异常分数学习(OASL),仅用正常样本学习广义正常模式;两分支均融合大规模视觉-语言模型提供的视觉-文本语义先验。 Result: 在十个异常检测数据集上,InCTRLv2在异常检测与分割任务中均达到当前最优(SotA)性能。 Conclusion: InCTRLv2通过双语义视角(异常-正常判别 vs 正常性偏离)提升了通用异常检测的泛化能力与鲁棒性,验证了融合视觉-语言先验与少样本学习的有效性。 Abstract: While recent anomaly detection (AD) methods have made substantial progress in recognizing abnormal patterns within specific domains, most of them are specialist models that are trained on large training samples from a specific target dataset, struggling to generalize to unseen datasets. To address this limitation, the paradigm of Generalist Anomaly Detection (GAD) has emerged in recent years, aiming to learn a single generalist model to detect anomalies across diverse domains without retraining. To this end, this work introduces InCTRLv2, a novel few-shot Generalist Anomaly Detection and Segmentation (GADS) framework that significantly extends our previously proposed GAD model, InCTRL. Building on the idea of learning in-context residuals with few-shot normal examples to detect anomalies as in InCTRL, InCTRLv2 introduces two new, complementary perspectives of anomaly perception under a dual-branch framework. This is accomplished by two novel modules upon InCTRL: i) Discriminative Anomaly Score Learning (DASL) with both normal and abnormal data in the main branch, which learns a semantic-guided abnormality and normality space that supports the classification of query samples from both the abnormality and normality perspectives; and ii) One-class Anomaly Score Learning (OASL) using only the normal data, which learns generalized normality patterns in a semantic space via an auxiliary branch, focusing on detecting anomalies through the lens of normality solely. Both branches are guided by rich visual-text semantic priors encoded by large-scale vision-language models. Together, they offer a dual semantic perspective for AD: one emphasizes normal-abnormal discriminations, while the other emphasizes normality-deviated semantics. Extensive experiments on ten AD datasets demonstrate that InCTRLv2 achieves SotA performance in both anomaly detection and segmentation tasks across various settings.[241] Preserving Forgery Artifacts: AI-Generated Video Detection at Native Scale
Zhengcen Li,Chenyang Jiang,Hang Zhao,Shiyang Zhou,Yunyang Mo,Feng Gao,Fan Yang,Qiben Shan,Shaocong Wu,Jingyong Su
Main category: cs.CV
TL;DR: 本文提出了一种面向AI生成视频检测的新框架和大规模数据集,通过在可变时空尺度上直接处理原始视频,避免传统预处理导致的高频伪造痕迹丢失,显著提升了检测性能。
Details
Motivation: 现有检测方法依赖固定分辨率预处理,易丢失高频伪造痕迹,且训练/测试数据集过时,无法应对最新生成模型。 Method: 构建包含14万+视频的全新大规模数据集(含Magic Videos基准),并基于Qwen2.5-VL视觉Transformer设计支持原生可变时空分辨率的检测框架。 Result: 在多个基准上达到SOTA性能,验证了原生尺度处理对保留高频率伪造线索和时空不一致性的关键作用。 Conclusion: 原生尺度视频处理是提升AI生成视频检测鲁棒性的关键路径,本工作为该领域建立了新基线。 Abstract: The rapid advancement of video generation models has enabled the creation of highly realistic synthetic media, raising significant societal concerns regarding the spread of misinformation. However, current detection methods suffer from critical limitations. They rely on preprocessing operations like fixed-resolution resizing and cropping. These operations not only discard subtle, high-frequency forgery traces but also cause spatial distortion and significant information loss. Furthermore, existing methods are often trained and evaluated on outdated datasets that fail to capture the sophistication of modern generative models. To address these challenges, we introduce a comprehensive dataset and a novel detection framework. First, we curate a large-scale dataset of over 140K videos from 15 state-of-the-art open-source and commercial generators, along with Magic Videos benchmark designed specifically for evaluating ultra-realistic synthetic content. In addition, we propose a novel detection framework built on the Qwen2.5-VL Vision Transformer, which operates natively at variable spatial resolutions and temporal durations. This native-scale approach effectively preserves the high-frequency artifacts and spatiotemporal inconsistencies typically lost during conventional preprocessing. Extensive experiments demonstrate that our method achieves superior performance across multiple benchmarks, underscoring the critical importance of native-scale processing and establishing a robust new baseline for AI-generated video detection.[242] Training-Free Refinement of Flow Matching with Divergence-based Sampling
Yeonwoo Cha,Jaehoon Yoo,Semin Kim,Yunseo Park,Jinhyeon Kwon,Seunghoon Hong
Main category: cs.CV
TL;DR: 本文提出Flow Divergence Sampler(FDS),一种无需训练的框架,通过在求解器每步前利用速度场散度信号优化中间状态,缓解流模型中平均速度误导样本进入低密度区域的问题,从而提升生成质量。
Details
Motivation: 流模型中样本级速度在中间状态冲突时,其平均速度可能误导样本进入低密度区域,损害生成质量。 Method: 提出FDS框架,在推理过程中利用边际速度场的散度作为可计算信号,对中间状态进行训练无关的动态修正,引导样本向更明确(低歧义)区域移动。 Result: FDS作为即插即用模块,兼容标准求解器与现有流模型主干,在文本到图像合成、反问题等任务中一致提升了生成保真度。 Conclusion: 速度场散度是评估和缓解流模型生成偏差的有效指标,FDS提供了一种简单、通用且高效的后处理改进方案。 Abstract: Flow-based models learn a target distribution by modeling a marginal velocity field, defined as the average of sample-wise velocities connecting each sample from a simple prior to the target data. When sample-wise velocities conflict at the same intermediate state, however, this averaged velocity can misguide samples toward low-density regions, degrading generation quality. To address this issue, we propose the Flow Divergence Sampler (FDS), a training-free framework that refines intermediate states before each solver step. Our key finding reveals that the severity of this misguidance is quantified by the divergence of the marginal velocity field that is readily computable during inference with a well-optimized model. FDS exploits this signal to steer states toward less ambiguous regions. As a plug-and-play framework compatible with standard solvers and off-the-shelf flow backbones, FDS consistently improves fidelity across various generation tasks including text-to-image synthesis, and inverse problems.[243] Synthesis4AD: Synthetic Anomalies are All You Need for 3D Anomaly Detection
Yihan Sun,Yuqi Cheng,Junjie Zu,Yuxiang Tan,Guoyang Xie,Yucheng Wang,Yunkang Cao,Weiming Shen
Main category: cs.CV
TL;DR: 本文提出Synthesis4AD,一种端到端的3D异常检测范式,通过可控合成高保真3D异常样本(借助MPAS引擎和3D-DefectStudio平台)、结合多模态大语言模型解析设计信息生成合成指令,并引入空间归一化与几何保真增强训练策略,显著提升点云异常检测性能。
Details
Motivation: 工业3D异常检测受限于异常样本稀缺且长尾分布,真实标注数据获取成本高、泛化能力差。 Method: 提出Synthesis4AD框架:1)基于MPAS可控合成引擎构建3D-DefectStudio平台,注入几何逼真的缺陷并生成逐点异常掩码;2)利用多模态大语言模型(MLLM)将产品设计信息自动转化为可执行的异常合成指令;3)设计面向点云的训练流程,包括空间分布归一化和几何保真数据增强,以提升Point Transformer对坐标敏感性和现实变化的鲁棒性。 Result: 在Real3D-AD、MulSen-AD及真实工业零件数据集上达到SOTA性能;MPAS合成方法与3D-DefectStudio系统将开源。 Conclusion: 可控、知识驱动的合成数据生成是解决3D工业异常检测数据瓶颈的有效途径,Synthesis4AD为高保真、可解释、可扩展的异常合成与检测提供了新范式。 Abstract: Industrial 3D anomaly detection performance is fundamentally constrained by the scarcity and long-tailed distribution of abnormal samples. To address this challenge, we propose Synthesis4AD, an end-to-end paradigm that leverages large-scale, high-fidelity synthetic anomalies to learn more discriminative representations for 3D anomaly detection. At the core of Synthesis4AD is 3D-DefectStudio, a software platform built upon the controllable synthesis engine MPAS, which injects geometrically realistic defects guided by higher-dimensional support primitives while simultaneously generating accurate point-wise anomaly masks. Furthermore, Synthesis4AD incorporates a multimodal large language model (MLLM) to interpret product design information and automatically translate it into executable anomaly synthesis instructions, enabling scalable and knowledge-driven anomalous data generation. To improve the robustness and generalization of the downstream detector on unstructured point clouds, Synthesis4AD further introduces a training pipeline based on spatial-distribution normalization and geometry-faithful data augmentations, which alleviates the sensitivity of Point Transformer architectures to absolute coordinates and improves feature learning under realistic data variations. Extensive experiments demonstrate state-of-the-art performance on Real3D-AD, MulSen-AD, and a real-world industrial parts dataset. The proposed synthesis method MPAS and the interactive system 3D-DefectStudio will be publicly released at https://github.com/hustCYQ/Synthesis4AD.[244] ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging
Selim Ahmet Iz,Francesco Nex,Norman Kerle,Henry Meissner,Ralf Berger
Main category: cs.CV
TL;DR: 本文提出ZeD-MAP框架,将零样本扩散深度模型与增量式聚类束调整(BA)结合,实现无人机超高清影像的实时、度量一致的深度重建,在保持秒级单帧推理速度的同时达到亚米级三维精度。
Details
Motivation: 超高清无人机影像的实时深度重建对灾害响应等时间敏感任务至关重要,但面临宽基线视差、大图像尺寸、弱纹理/镜面表面、遮挡及严格计算约束等挑战;现有零样本扩散模型虽快且免训练,却缺乏度量精度和时序/空间一致性。 Method: 提出ZeD-MAP:将无人机视频流分组为重叠图像簇,周期性执行簇级增量束调整(BA),生成度量一致的相机位姿和稀疏3D连接点;再将这些点反投影至关键帧,作为扩散深度模型的度量引导信号。 Result: 在DLR MACS系统采集的约50米航高数据上验证,水平(XY)误差约0.87 m,垂直(Z)误差约0.12 m;单帧运行时间1.47–4.91秒;精度接近传统摄影测量方法,但处理速度显著提升。 Conclusion: 基于BA的度量引导可使零样本扩散深度估计兼具高精度与强一致性,为实时3D地图生成提供了高效可靠的新范式。 Abstract: Real-time depth reconstruction from ultra-high-resolution UAV imagery is essential for time-critical geospatial tasks such as disaster response, yet remains challenging due to wide-baseline parallax, large image sizes, low-texture or specular surfaces, occlusions, and strict computational constraints. Recent zero-shot diffusion models offer fast per-image dense predictions without task-specific retraining, and require fewer labelled datasets than transformer-based predictors while avoiding the rigid capture geometry requirement of classical multi-view stereo. However, their probabilistic inference prevents reliable metric accuracy and temporal consistency across sequential frames and overlapping tiles. We present ZeD-MAP, a cluster-level framework that converts a test-time diffusion depth model into a metrically consistent, SLAM-like mapping pipeline by integrating incremental cluster-based bundle adjustment (BA). Streamed UAV frames are grouped into overlapping clusters; periodic BA produces metrically consistent poses and sparse 3D tie-points, which are reprojected into selected frames and used as metric guidance for diffusion-based depth estimation. Validation on ground-marker flights captured at approximately 50 m altitude (GSD is approximately 0.85 cm/px, corresponding to 2,650 square meters ground coverage per frame) with the DLR Modular Aerial Camera System (MACS) shows that our method achieves sub-meter accuracy, with approximately 0.87 m error in the horizontal (XY) plane and 0.12 m in the vertical (Z) direction, while maintaining per-image runtimes between 1.47 and 4.91 seconds. Results are subject to minor noise from manual point-cloud annotation. These findings show that BA-based metric guidance provides consistency comparable to classical photogrammetric methods while significantly accelerating processing, enabling real-time 3D map generation.[245] 3D Gaussian Splatting for Annular Dark Field Scanning Transmission Electron Microscopy Tomography Reconstruction
Beiyuan Zhang,Hesong Li,Ruiwen Shao,Ying Fu
Main category: cs.CV
TL;DR: 本文提出DenZa-Gaussian方法,通过建模局部散射强度为可学习标量场、引入散射视角归一化系数γ、以及加入2D傅里叶振幅损失项,显著提升稀疏视角下ADF-STEM三维重建的质量与鲁棒性。
Details
Motivation: 稀疏视角采集在ADF-STEM断层成像中常见,但传统方法在视角不足时易产生伪影、结构保真度下降,难以兼顾重建精度与样品保护。 Method: 将3D高斯溅射(3D GS)适配至ADF-STEM领域:1)定义可学习标量场denza表征局部散射强度;2)引入系数γ实现跨倾角散射稳定性(散射视角归一化);3)设计含2D傅里叶振幅项的损失函数以抑制缺失楔伪影。 Result: 在45视图和15视图倾斜序列上的实验表明,DenZa-Gaussian生成的3D重建及2D投影更贴近原始图像,显著优于传统方法,尤其在稀疏视角下表现出更强鲁棒性。 Conclusion: DenZa-Gaussian有效解决了ADF-STEM稀疏视角重建中的物理建模失配、视角不一致和缺失楔问题,为剂量敏感纳米材料的高保真三维表征提供了新范式。 Abstract: Analytical Dark Field Scanning Transmission Electron Microscopy (ADF-STEM) tomography reconstructs nanoscale materials in 3D by integrating multi-view tilt-series images, enabling precise analysis of their structural and compositional features. Although integrating more tilt views improves 3D reconstruction, it requires extended electron exposure that risks damaging dose-sensitive materials and introduces drift and misalignment, making it difficult to balance reconstruction fidelity with sample preservation. In practice, sparse-view acquisition is frequently required, yet conventional ADF-STEM methods degrade under limited views, exhibiting artifacts and reduced structural fidelity. To resolve these issues, in this paper, we adapt 3D GS to this domain with three key components. We first model the local scattering strength as a learnable scalar field, denza, to address the mismatch between 3DGS and ADF-STEM imaging physics. Then we introduce a coefficient $γ$ to stabilize scattering across tilt angles, ensuring consistent denza via scattering view normalization. Finally, We incorporate a loss function that includes a 2D Fourier amplitude term to suppress missing wedge artifacts in sparse-view reconstruction. Experiments on 45-view and 15-view tilt series show that DenZa-Gaussian produces high-fidelity reconstructions and 2D projections that align more closely with original tilts, demonstrating superior robustness under sparse-view conditions.[246] OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
DataFlow Team,Bohan Zeng,Daili Hua,Kaixin Zhu,Yifan Dai,Bozhou Li,Yuran Wang,Chengzhuo Tong,Yifan Yang,Mingkun Chang,Jianbin Zhao,Zhou Liu,Hao Liang,Xiaochen Ma,Ruichuan An,Junbo Niu,Zimo Meng,Tianyi Bai,Meiyi Qiang,Huanyao Zhang,Zhiyou Xiao,Tianyu Guo,Qinhan Yu,Runhao Zhao,Zhengpin Li,Xinyi Huang,Yisheng Pan,Yiwen Tang,Yang Shi,Yue Ding,Xinlong Chen,Hongcheng Gao,Minglei Shi,Jialong Wu,Zekun Wang,Yuanxing Zhang,Xintao Wang,Pengfei Wan,Yiren Song,Mike Zheng Shou,Wentao Zhang
Main category: cs.CV
TL;DR: 本文提出OpenWorldLib,一个用于高级世界模型的综合标准化推理框架,并给出了世界模型的明确定义:以感知为中心,具备交互和长期记忆能力,用于理解和预测复杂世界。
Details
Motivation: 世界模型在人工智能领域备受关注,但缺乏清晰统一的定义,亟需系统化框架与能力分类。 Method: 提出基于感知、交互与长期记忆的世界模型定义,系统分类其核心能力,并构建OpenWorldLib统一推理框架集成多任务模型。 Result: 实现了跨任务模型的高效复用与协同推理,并开源代码库(https://github.com/OpenDCAI/OpenWorldLib)。 Conclusion: 该工作为世界模型研究提供了概念基础、能力体系与实践工具,推动其向标准化与实用化发展。 Abstract: World models have garnered significant attention as a promising research direction in artificial intelligence, yet a clear and unified definition remains lacking. In this paper, we introduce OpenWorldLib, a comprehensive and standardized inference framework for Advanced World Models. Drawing on the evolution of world models, we propose a clear definition: a world model is a model or framework centered on perception, equipped with interaction and long-term memory capabilities, for understanding and predicting the complex world. We further systematically categorize the essential capabilities of world models. Based on this definition, OpenWorldLib integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference. Finally, we present additional reflections and analyses on potential future directions for world model research. Code link: https://github.com/OpenDCAI/OpenWorldLib[247] Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs
Sayed Pedram Haeri Boroujeni,Niloufar Mehrabi,Patrick Woods,Gabriel Hillesheim,Abolfazl Razi
Main category: cs.CV
TL;DR: 本文提出了一种自适应KV缓存量化方法,通过学习策略为不同重要性token分配不同位宽(2/4/8-bit或FP16),在显著降低内存与延迟的同时保持接近FP16的精度。
Details
Motivation: 现有KV缓存量化方法多采用固定精度或手工启发式,无法根据token重要性动态调整,导致精度损失或资源浪费。 Method: 设计轻量级token级特征(频率、质量分、注意力方差、熵不确定性)输入紧凑数据驱动控制器,在解码时动态选择KV精度。 Result: 在SmolLM系列模型和多个常识推理基准(如HellaSwag)上验证:相比静态量化,解码延迟降低17.75%,准确率提升7.60点,且仅比FP16低0.30点。 Conclusion: 自适应KV量化能有效优化边缘设备上LLM推理的精度-延迟权衡,为高效端侧部署提供了新范式。 Abstract: Large Language Models (LLMs) have achieved remarkable progress across reasoning, generation, and decision-making tasks, yet deploying them on mobile, embedded, and edge devices remains particularly challenging. On-device LLM inference is heavily constrained by the memory and bandwidth overhead of the key-value (KV) cache, which grows linearly with context length and often dominates decoding cost. Existing KV-cache quantization schemes typically rely on fixed precision or hand-crafted heuristics, thereby wasting bits on low-impact tokens while over-compressing informative ones, leading to avoidable accuracy degradation. Inspired by Huffman coding's principle of variable-length allocation, we propose adaptive KV-cache quantization, a learned policy that assigns bit-width proportional to token importance, minimizing expected memory and latency without sacrificing competitive accuracy. Our framework extracts lightweight token-level features, including token frequency, quality score, attention variance, and entropy-based uncertainty, and feeds them into a compact data-driven controller that dynamically selects KV precision from {2-bit, 4-bit, 8-bit, FP16} during decoding. This adaptive precision policy reduces KV memory footprint and latency while improving accuracy compared to static KV quantization and rule-based baselines, and maintaining competitive accuracy close to FP16 inference across standard LLM benchmarks. Extensive experiments across multiple commonsense reasoning benchmarks using SmolLM-135M, SmolLM-360M, and SmolLM-1.7B demonstrate that our controller consistently improves the accuracy-latency trade-off. For instance, with SmolLM-360M on HellaSwag, our method reduces decoding latency (ms/token) by 17.75% relative to static KV quantization, improves accuracy by 7.60 points, and remains within only 0.30 points of FP16 inference.[248] Discovering Failure Modes in Vision-Language Models using RL
Kanishk Jain,Qian Yang,Shravan Nayak,Parisa Kordjamshidi,Nishanth Anand,Aishwarya Agrawal
Main category: cs.CV
TL;DR: 本文提出了一种基于强化学习的框架,用于自动发现视觉语言模型(VLMs)在给定数据分布下的失败模式,无需人工干预。该框架训练一个‘提问者’智能体,根据VLM的回答自适应生成问题以诱发错误答案,并随训练逐步提升问题复杂度,从而识别出36种新失败模式。
Details
Motivation: 现有对VLM弱点的手动分析成本高、不可扩展、易受人类偏见影响,难以全面揭示模型脆弱性。 Method: 提出基于强化学习的自动发现框架,训练一个能自适应生成问题的‘提问者’智能体,通过交互式提问诱发VLM错误回答,并在训练中逐步聚焦细粒度视觉细节和多技能组合以提升问题难度。 Result: 成功识别出36种VLM此前未被发现的失败模式,并验证了该框架在多种VLM组合上的通用性和可迁移性。 Conclusion: 该RL驱动的自动化诊断方法能更全面、客观、高效地揭示VLM的盲点,为模型评估与改进提供了新范式。 Abstract: Vision-language Models (VLMs), despite achieving strong performance on multimodal benchmarks, often misinterpret straightforward visual concepts that humans identify effortlessly, such as counting, spatial reasoning, and viewpoint understanding. Previous studies manually identified these weaknesses and found that they often stem from deficits in specific skills. However, such manual efforts are costly, unscalable, and subject to human bias, which often overlooks subtle details in favor of salient objects, resulting in an incomplete understanding of a model's vulnerabilities. To address these limitations, we propose a Reinforcement Learning (RL)-based framework to automatically discover the failure modes or blind spots of any candidate VLM on a given data distribution without human intervention. Our framework trains a questioner agent that adaptively generates queries based on the candidate VLM's responses to elicit incorrect answers. Our approach increases question complexity by focusing on fine-grained visual details and distinct skill compositions as training progresses, consequently identifying 36 novel failure modes in which VLMs struggle. We demonstrate the broad applicability of our framework by showcasing its generalizability across various model combinations.[249] Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning
Lei Zhang,Junjiao Tian,Zhipeng Fan,Kunpeng Li,Jialiang Wang,Weifeng Chen,Markos Georgopoulos,Felix Juefei-Xu,Yuxiang Bao,Julian McAuley,Manling Li,Zecheng He
Main category: cs.CV
TL;DR: 本文提出了一种过程驱动的图像生成范式,将图像合成分解为多步交替的文本推理与视觉生成过程,并通过密集的逐步监督确保中间状态的一致性与可解释性。
Details
Motivation: 人类绘画是渐进式的、基于演化视觉状态的多步过程,而现有统一多模态模型缺乏对中间生成状态的建模能力;本文旨在探索模型是否能想象并可控地生成图像生成链中的中间状态。 Method: 提出四阶段迭代流程:文本规划→视觉草图→文本反思→视觉精修;引入密集的逐步监督机制,分别约束视觉中间状态的空间语义一致性与文本中间状态对视觉知识的保持及提示违规修正能力。 Result: 在多个文本到图像生成基准上验证了方法的有效性,生成过程更显式、可解释且可直接监督。 Conclusion: 过程驱动的生成范式提升了生成可控性与可解释性,为多模态模型建模‘思维-动作’协同提供了新路径。 Abstract: Humans paint images incrementally: they plan a global layout, sketch a coarse draft, inspect, and refine details, and most importantly, each step is grounded in the evolving visual states. However, can unified multimodal models trained on text-image interleaved datasets also imagine the chain of intermediate states? In this paper, we introduce process-driven image generation, a multi-step paradigm that decomposes synthesis into an interleaved reasoning trajectory of thoughts and actions. Rather than generating images in a single step, our approach unfolds across multiple iterations, each consisting of 4 stages: textual planning, visual drafting, textual reflection, and visual refinement. The textual reasoning explicitly conditions how the visual state should evolve, while the generated visual intermediate in turn constrains and grounds the next round of textual reasoning. A core challenge of process-driven generation stems from the ambiguity of intermediate states: how can models evaluate each partially-complete image? We address this through dense, step-wise supervision that maintains two complementary constraints: for the visual intermediate states, we enforce the spatial and semantic consistency; for the textual intermediate states, we preserve the prior visual knowledge while enabling the model to identify and correct prompt-violating elements. This makes the generation process explicit, interpretable, and directly supervisable. To validate proposed method, we conduct experiments under various text-to-image generation benchmarks.[250] CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
Xiangzhao Hao,Zefeng Zhang,Zhenyu Zhang,Linhao Yu,Yao Chen,Yiqian Zhang,Haiyun Guo,Shuohuan Wang,Yu Sun
Main category: cs.CV
TL;DR: 本文提出CLEAR框架,通过生成-再推理模式、潜在表征桥接和交错式GRPO强化学习,提升多模态模型在退化图像上的鲁棒性,同时保持干净图像性能。
Details
Motivation: 现有统一多模态模型虽具备生成能力,却未在推理中利用该能力应对真实场景中的图像退化(模糊、噪声、压缩、光照差)问题;其训练范式和解码-重编码路径限制了生成与理解的协同优化。 Method: 提出三步框架CLEAR:(1) 在退化感知数据集上监督微调,建立“生成-再回答”推理模式;(2) 引入潜在表征桥(Latent Representation Bridge),替代传统解码-重编码路径,实现生成与推理间可优化的直接连接;(3) 设计交错式GRPO强化学习方法,在答案正确性奖励下联合优化文本推理与视觉生成。同时构建涵盖三种退化程度、六个基准的MMD-Bench评测集。 Result: CLEAR显著提升了模型在退化输入下的鲁棒性,且不损害干净图像性能;消融发现去除像素级重建监督后,中间视觉表征感知质量反而更高,表明任务驱动优化与视觉质量天然一致。 Conclusion: 生成能力可被有效整合进多模态推理流程,关键在于架构设计(如潜在桥)与训练策略(如交错式RL)的协同;任务导向的视觉表征优化优于纯像素重建,为鲁棒多模态理解提供了新范式。 Abstract: Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. Unified multimodal models that combine understanding and generation within a single architecture are a natural fit for this challenge, as their generative pathway can model the fine-grained visual structure that degradation destroys. Yet these models fail to leverage their own generative capacity on degraded inputs. We trace this disconnect to two compounding factors: existing training regimes never ask the model to invoke generation during reasoning, and the standard decode-reencode pathway does not support effective joint optimization. We present CLEAR, a framework that connects the two capabilities through three progressive steps: (1) supervised fine-tuning on a degradation-aware dataset to establish the generate-then-answer reasoning pattern; (2) a Latent Representation Bridge that replaces the decode-reencode detour with a direct, optimizable connection between generation and reasoning; (3) Interleaved GRPO, a reinforcement learning method that jointly optimizes text reasoning and visual generation under answer-correctness rewards. We construct MMD-Bench, covering three degradation severity levels across six standard multimodal benchmarks. Experiments show that CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance. Our analysis further reveals that removing pixel-level reconstruction supervision leads to intermediate visual states with higher perceptual quality, suggesting that task-driven optimization and visual quality are naturally aligned.[251] AvatarPointillist: AutoRegressive 4D Gaussian Avatarization
Hongyu Liu,Xuan Wang,Yating Wang,Zijian Wu,Ziyu Wan,Yue Ma,Runtao Liu,Boyao Zhou,Yujun Shen,Qifeng Chen
Main category: cs.CV
TL;DR: AvatarPointillist是一种基于单张人像图像生成动态4D高斯化身的新框架,采用解码器-only Transformer自回归生成3D高斯点云,并联合预测绑定信息以实现真实动画。
Details
Motivation: 现有方法难以从单张图像生成高质量、可控且可动画的4D高斯化身,需兼顾几何精度、密度自适应与运动绑定。 Method: 提出AvatarPointillist框架:1)使用decoder-only Transformer自回归生成点云,动态调整点数和密度;2)在生成过程中联合预测每点绑定信息;3)通过高斯解码器将点转换为可渲染的高斯属性,并利用AR生成器的潜在特征进行条件化增强。 Result: 实验表明该方法能生成高质量、照片级真实感和强可控性的4D高斯化身,在保真度和动画真实性上显著优于基线方法。 Conclusion: 自回归建模为4D化身生成提供了新范式,兼具结构可控性与细节表现力,代码将开源以推动后续研究。 Abstract: We introduce AvatarPointillist, a novel framework for generating dynamic 4D Gaussian avatars from a single portrait image. At the core of our method is a decoder-only Transformer that autoregressively generates a point cloud for 3D Gaussian Splatting. This sequential approach allows for precise, adaptive construction, dynamically adjusting point density and the total number of points based on the subject's complexity. During point generation, the AR model also jointly predicts per-point binding information, enabling realistic animation. After generation, a dedicated Gaussian decoder converts the points into complete, renderable Gaussian attributes. We demonstrate that conditioning the decoder on the latent features from the AR generator enables effective interaction between stages and markedly improves fidelity. Extensive experiments validate that AvatarPointillist produces high-quality, photorealistic, and controllable avatars. We believe this autoregressive formulation represents a new paradigm for avatar generation, and we will release our code inspire future research.[252] Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving
Mayank Mayank,Bharanidhar Duraisamy,Florian Geiß,Abhinav Valada
Main category: cs.CV
TL;DR: 本文提出MMF-BEV框架,利用可变形注意力机制实现雷达与相机在鸟瞰图(BEV)空间的高效融合,显著提升3D目标检测性能,并通过传感器贡献分析验证了模态互补性。
Details
Motivation: 摄像头语义丰富但深度不可靠,毫米波雷达测距测速精准但几何稀疏,需融合二者优势以提升自动驾驶中3D目标检测精度。 Method: 构建基于BEVDepth的相机分支和基于RadarBEVNet的雷达分支,均引入可变形自注意力;通过可变形交叉注意力模块进行跨模态特征对齐;采用两阶段训练策略(先深度监督预训练相机分支,再联合训练雷达与融合模块);并在View-of-Delft 4D雷达数据集上评估三种配置并开展传感器贡献分析。 Result: 在VoD数据集上,MMF-BEV在全部物体类别及全区域/近距ROI中均一致优于单模态基线,并在多类目标检测任务上达到或超越现有融合方法的性能。 Conclusion: 可变形注意力机制能有效建模雷达与相机在BEV空间的几何-语义互补关系,所提MMF-BEV框架为多模态3D检测提供了鲁棒、可解释且高性能的融合范式。 Abstract: Accurate 3D object detection for autonomous driving requires complementary sensors. Cameras provide dense semantics but unreliable depth, while millimeter-wave radar offers precise range and velocity measurements with sparse geometry. We propose MMF-BEV, a radar-camera BEV fusion framework that leverages deformable attention for cross-modal feature alignment on the View-of-Delft (VoD) 4D radar dataset [1]. MMF-BEV builds a BEVDepth [2] camera branch and a RadarBEVNet [3] radar branch, each enhanced with Deformable Self-Attention, and fuses them via a Deformable Cross-Attention module. We evaluate three configurations: camera-only, radar-only, and hybrid fusion. A sensor contribution analysis quantifies per-distance modality weighting, providing interpretable evidence of sensor complementarity. A two-stage training strategy - pre-training the camera branch with depth supervision, then jointly training radar and fusion modules stabilizes learning. Experiments on VoD show that MMF-BEV consistently outperforms unimodal baselines and achieves competitive results against prior fusion methods across all object classes in both the full annotated area and near-range Region of Interest.[253] E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes
Jiajun Zhai,Hao Shi,Shangwei Guo,Kailun Yang,Kaiwei Wang
Main category: cs.CV
TL;DR: 本文提出E-VLA框架,通过直接利用事件相机的运动与结构线索增强视觉-语言-动作(VLA)模型在低光、运动模糊等感知退化场景下的操作鲁棒性,无需图像重建;实验表明其显著提升任务成功率,并开源了真实同步RGB-事件-动作数据集与代码。
Details
Motivation: 现有基于帧的VLA模型在极端低光、运动模糊、黑屏等传感退化条件下感知脆弱,亟需更鲁棒的感知机制。 Method: 提出E-VLA:不依赖事件到图像的重建,而是直接建模事件流中的运动与结构信息;构建DAVIS346事件相机驱动的开源遥操作平台,采集多任务、多光照下的真实RGB-事件-动作同步数据集;设计轻量、兼容预训练的事件融合策略,研究事件窗口与融合方式以保障部署稳定性。 Result: 在Pick-Place任务中,20 lux下成功率从图像-only的0%提升至叠加融合的60%、事件适配器的90%;严重运动模糊(1000ms曝光)下,Pick-Place从0%升至20–25%,Sorting从5%升至32.5%。 Conclusion: 事件驱动感知可有效融入VLA模型,显著提升开放环境下的操作鲁棒性,为超越传统帧式成像的具身智能提供系统性证据。 Abstract: Robotic Vision-Language-Action (VLA) models generalize well for open-ended manipulation, but their perception is fragile under sensing-stage degradations such as extreme low light, motion blur, and black clipping. We present E-VLA, an event-augmented VLA framework that improves manipulation robustness when conventional frame-based vision becomes unreliable. Instead of reconstructing images from events, E-VLA directly leverages motion and structural cues in event streams to preserve semantic perception and perception-action consistency under adverse conditions. We build an open-source teleoperation platform with a DAVIS346 event camera and collect a real-world synchronized RGB-event-action manipulation dataset across diverse tasks and illumination settings. We also propose lightweight, pretrained-compatible event integration strategies and study event windowing and fusion for stable deployment. Experiments show that even a simple parameter-free fusion, i.e., overlaying accumulated event maps onto RGB images, could substantially improve robustness in dark and blur-heavy scenes: on Pick-Place at 20 lux, success increases from 0% (image-only) to 60% with overlay fusion and to 90% with our event adapter; under severe motion blur (1000 ms exposure), Pick-Place improves from 0% to 20-25%, and Sorting from 5% to 32.5%. Overall, E-VLA provides systematic evidence that event-driven perception can be effectively integrated into VLA models, pointing toward robust embodied intelligence beyond conventional frame-based imaging. Code and dataset will be available at https://github.com/JJayzee/E-VLA.[254] Less Detail, Better Answers: Degradation-Driven Prompting for VQA
Haoxuan Han,Weijie Wang,Zeyu Zhang,Yefei He,Bohan Zhuang
Main category: cs.CV
TL;DR: 本文提出降质驱动提示(DDP)框架,通过有策略地降低图像保真度,引导视觉语言模型聚焦于关键结构信息,从而提升视觉问答(VQA)性能,尤其在物理属性误判和感知现象类任务上效果显著。
Details
Motivation: 高分辨率图像细节有时会成为噪声,导致视觉语言模型在视觉问答中产生幻觉或推理错误。 Method: 提出降质驱动提示(DDP)框架:对物理属性任务采用80p降采样、白底掩码与正交线等结构化视觉提示,并结合上下文学习;对感知现象任务则引入任务分类阶段,并融合模糊掩码、对比度增强与降采样等专用工具。 Result: 实验表明,DDP能有效规避干扰纹理,在具有挑战性的视觉基准上显著提升VLM的推理准确率。 Conclusion: 适度图像降质配合结构化提示可提升VLM的VQA鲁棒性与准确性,验证了‘少即是多’的设计理念。 Abstract: Recent advancements in Vision-Language Models (VLMs) have significantly pushed the boundaries of Visual Question Answering (VQA).However,high-resolution details can sometimes become noise that leads to hallucinations or reasoning errors. In this paper,we propose Degradation-Driven Prompting (DDP), a novel framework that improves VQA performance by strategically reducing image fidelity to force models to focus on essential structural information. We evaluate DDP across two distinct tasks. Physical attributes targets images prone to human misjudgment, where DDP employs a combination of 80p downsampling, structural visual aids (white background masks and orthometric lines), and In-Context Learning (ICL) to calibrate the model's focus. Perceptual phenomena addresses various machine-susceptible visual anomalies and illusions, including Visual Anomaly (VA), Color (CI), Motion(MI),Gestalt (GI), Geometric (GSI), and Visual Illusions (VI).For this task, DDP integrates a task-classification stage with specialized tools such as blur masks and contrast enhancement alongside downsampling. Our experimental results demonstrate that less is more: by intentionally degrading visual inputs and providing targeted structural prompts, DDP enables VLMs to bypass distracting textures and achieve superior reasoning accuracy on challenging visual benchmarks.[255] InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement
Yude Zou,Junji Gong,Xing Gao,Zixuan Li,Tianxing Chen,Guanjie Zheng
Main category: cs.CV
TL;DR: 本文提出了一种面向人-物-场景交互(HOSI)生成的粗到细、指令驱动的一致性模型框架,通过动态感知策略、碰撞感知引导和混合训练策略,解决了数据稀缺、物理不合理及动态场景建模难等问题,实现了SOTA性能与强泛化能力。
Details
Motivation: HOSI生成需建模人、物体与场景三者间的动态交互,但面临标注数据稀少、物理不合理(如穿透、碰撞)以及缺乏对动态场景上下文建模能力等挑战。 Method: 提出基于一致性模型的粗到细指令条件生成框架;引入动态感知策略,利用前序细化轨迹更新场景上下文并指导后续去噪;设计bump-aware guidance缓解物理冲突;采用混合训练策略,将体素化场景占据注入HOI数据合成伪HOSI样本,并联合高保真HSI数据训练。 Result: 在HOSI和HOI生成任务上达到SOTA性能,在未见场景上表现出强泛化能力,支持实时生成且显著减少物理伪影。 Conclusion: 该方法有效统一了人-物-场景三元交互建模,兼顾物理合理性、场景一致性与数据效率,为具身AI与仿真动画提供了实用、可扩展的生成范式。 Abstract: Human-object-scene interactions (HOSI) generation has broad applications in embodied AI, simulation, and animation. Unlike human-object interaction (HOI) and human-scene interaction (HSI), HOSI generation requires reasoning over dynamic object-scene changes, yet suffers from limited annotated data. To address these issues, we propose a coarse-to-fine instruction-conditioned interaction generation framework that is explicitly aligned with the iterative denoising process of a consistency model. In particular, we adopt a dynamic perception strategy that leverages trajectories from the preceding refinement to update scene context and condition subsequent refinement at each denoising step of consistency model, yielding consistent interactions. To further reduce physical artifacts, we introduce a bump-aware guidance that mitigates collisions and penetrations during sampling without requiring fine-grained scene geometry, enabling real-time generation. To overcome data scarcity, we design a hybrid training startegy that synthesizes pseudo-HOSI samples by injecting voxelized scene occupancy into HOI datasets and jointly trains with high-fidelity HSI data, allowing interaction learning while preserving realistic scene awareness. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both HOSI and HOI generation, and strong generalization to unseen scenes. Project page: https://yudezou.github.io/InfBaGel-page/[256] The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models
Runhao Mao,Hanshi Wang,Yixiang Yang,Qianli Ma,Jingmeng Zhou,Zhipeng Zhang
Main category: cs.CV
TL;DR: 本文首次系统研究了视觉-语言模型(VLMs)在自动驾驶中微调时出现的灾难性遗忘问题,提出Drive Expert Adapter(DEA)框架,通过将适配从权重空间转向提示空间,动态路由不同专家知识,从而在提升驾驶任务性能的同时避免遗忘预训练知识。
Details
Motivation: VLMs用于自动驾驶可解决长尾场景,但现有微调方法会破坏其宝贵的预训练世界知识,形成自相矛盾的困境,该问题尚未被系统研究。 Method: 构建包含18万场景的大规模数据集FidelityDrivingBench,建立首个面向自动驾驶灾难性遗忘的评测基准;提出Drive Expert Adapter(DEA),采用基于场景线索的动态专家路由机制,在提示空间完成适配,不修改基础模型参数。 Result: 实验表明DEA在驾驶任务上达到SOTA性能,同时显著缓解灾难性遗忘,有效保留VLMs的泛化能力。 Conclusion: 灾难性遗忘是VLMs落地自动驾驶的关键瓶颈;DEA通过提示空间适配与专家路由,成功解耦任务性能提升与通用知识保留,为VLMs在安全关键领域的部署提供了新范式。 Abstract: The integration of Vision-Language Models (VLMs) into autonomous driving promises to solve long-tail scenarios, but this paradigm faces the critical and unaddressed challenge of catastrophic forgetting. The very fine-tuning process used to adapt these models to driving-specific data simultaneously erodes their invaluable pre-trained world knowledge, creating a self-defeating paradox that undermines the core reason for their use. This paper provides the first systematic investigation into this phenomenon. We introduce a new large-scale dataset of 180K scenes, which enables the first-ever benchmark specifically designed to quantify catastrophic forgetting in autonomous driving. Our analysis reveals that existing methods suffer from significant knowledge degradation. To address this, we propose the Drive Expert Adapter (DEA), a novel framework that circumvents this trade-off by shifting adaptation from the weight space to the prompt space. DEA dynamically routes inference through different knowledge experts based on scene-specific cues, enhancing driving-task performance without corrupting the model's foundational parameters. Extensive experiments demonstrate that our approach not only achieves state-of-the-art results on driving tasks but also effectively mitigates catastrophic forgetting, preserving the essential generalization capabilities that make VLMs a transformative force for autonomous systems. Data and model are released at FidelityDrivingBench.[257] Unified Vector Floorplan Generation via Markup Representation
Kaede Shiohara,Toshihiko Yamasaki
Main category: cs.CV
TL;DR: 本文提出了一种名为Floorplan Markup Language (FML)的通用表示方法,并基于此构建了Transformer模型FMLM,统一处理多种条件下的住宅平面图生成任务,在RPLAN数据集上超越了以往专用模型。
Details
Motivation: 现有方法在多样性、灵活性或跨异构条件(如场地边界、房间邻接图、部分布局)的泛化能力上存在不足。 Method: 设计了统一结构化语法FML来编码平面图信息,将生成问题转化为下一个token预测任务,并构建基于Transformer的生成模型FMLM。 Result: FMLM在RPLAN数据集上以单模型形式超越了各类任务特定的SOTA方法,生成高保真且功能合理的平面图。 Conclusion: FML提供了一种通用、可扩展的平面图表示范式,证明了统一生成框架在建筑AI中的有效性与潜力。 Abstract: Automatic residential floorplan generation has long been a central challenge bridging architecture and computer graphics, aiming to make spatial design more efficient and accessible. While early methods based on constraint satisfaction or combinatorial optimization ensure feasibility, they lack diversity and flexibility. Recent generative models achieve promising results but struggle to generalize across heterogeneous conditional tasks, such as generation from site boundaries, room adjacency graphs, or partial layouts, due to their suboptimal representations. To address this gap, we introduce Floorplan Markup Language (FML), a general representation that encodes floorplan information within a single structured grammar, which casts the entire floorplan generation problem into a next token prediction task. Leveraging FML, we develop a transformer-based generative model, FMLM, capable of producing high-fidelity and functional floorplans under diverse conditions. Comprehensive experiments on the RPLAN dataset demonstrate that FMLM, despite being a single model, surpasses the previous task-specific state-of-the-art methods.[258] Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations
Tuan Dung Nguyen,Minh Khoi Ho,Qi Chen,Yutong Xie,Nguyen Cam-Tu,Minh Khoi Nguyen,Dang Huy Pham Nguyen,Anton van den Hengel,Johan W. Verjans,Phi Le Nguyen,Vu Minh Hieu Phan
Main category: cs.CV
TL;DR: 本文提出了一种基于图像块(patch-level)的细粒度幻觉检测框架,通过分析视觉语言模型中物体token与图像局部区域的对齐程度,识别出扩散性注意力模式和语义错位两大幻觉特征,实现了高达90%的token级幻觉检测准确率。
Details
Motivation: 现有LVLM幻觉检测方法依赖全局、粗粒度的图像-文本相关性度量,难以捕捉幻觉token在局部区域上的微弱但广泛分布的相关性,导致漏检。 Method: 提出patch-level幻觉检测框架,分析多层模型中token与图像patch间的细粒度交互;发现并利用两个幻觉签名:(i)注意力模式弥散非局部化;(ii)缺乏与任何视觉区域的语义对齐;据此设计基于patch统计特征与隐层表征的轻量可解释检测方法。 Result: 在token级幻觉检测任务上达到最高90%准确率,显著优于基于全局相关性的现有方法。 Conclusion: 细粒度结构化分析(特别是patch-level token-grounding分析)比全局指标更有效,为LVLM幻觉检测提供了新范式。 Abstract: Large vision-language models (LVLMs) achieve strong performance on visual reasoning tasks but remain highly susceptible to hallucination. Existing detection methods predominantly rely on coarse, whole-image measures of how an object token relates to the input image. This global strategy is limited: hallucinated tokens may exhibit weak but widely scattered correlations across many local regions, which aggregate into deceptively high overall relevance, thus evading the current global hallucination detectors. We begin with a simple yet critical observation: a faithful object token must be strongly grounded in a specific image region. Building on this insight, we introduce a patch-level hallucination detection framework that examines fine-grained token-level interactions across model layers. Our analysis uncovers two characteristic signatures of hallucinated tokens: (i) they yield diffuse, non-localized attention patterns, in contrast to the compact, well-focused attention seen in faithful tokens; and (ii) they fail to exhibit meaningful semantic alignment with any visual region. Guided by these findings, we develop a lightweight and interpretable detection method that leverages patch-level statistical features, combined with hidden-layer representations. Our approach achieves up to 90% accuracy in token-level hallucination detection, demonstrating the superiority of fine-grained structural analysis for detecting hallucinations.[259] Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction
Ahan Shabanov,Peter Hedman,Ethan Weber,Zhengqin Li,Denis Rozumny,Gael Le Lan,Naina Dhingra,Lei Luo,Andrea Vedaldi,Christian Richardt,Andrea Tagliasacchi,Bo Zhu,Numair Khan
Main category: cs.CV
TL;DR: 本文提出Free-Range Gaussians,一种仅需4张图像即可重建非像素/体素对齐3D高斯的多视角重建方法,通过高斯参数上的流匹配实现,并引入分层补丁、加权渲染损失及多种推理引导策略以提升重建质量与泛化能力。
Details
Motivation: 解决现有方法依赖像素或体素对齐高斯导致冗余、空洞和未观测区域模糊的问题,支持更灵活、稀疏、结构保持的3D表示学习。 Method: 采用基于流匹配的生成式建模预测自由位置的3D高斯;引入分层补丁机制将空间相关高斯聚为Transformer token;设计timestep加权渲染损失;在推理中使用光度梯度引导和无分类器引导。 Result: 在Objaverse和Google Scanned Objects数据集上,相比像素/体素对齐方法,用更少高斯数获得更高重建质量,尤其在输入视图稀疏、物体部分不可见时提升显著。 Conclusion: Free-Range Gaussians实现了更紧凑、鲁棒且生成能力强的多视角三维重建,为高斯表示学习提供了新范式。 Abstract: We present Free-Range Gaussians, a multi-view reconstruction method that predicts non-pixel, non-voxel-aligned 3D Gaussians from as few as four images. This is done through flow matching over Gaussian parameters. Our generative formulation of reconstruction allows the model to be supervised with non-grid-aligned 3D data, and enables it to synthesize plausible content in unobserved regions. Thus, it improves on prior methods that produce highly redundant grid-aligned Gaussians, and suffer from holes or blurry conditional means in unobserved regions. To handle the number of Gaussians needed for high-quality results, we introduce a hierarchical patching scheme to group spatially related Gaussians into joint transformer tokens, halving the sequence length while preserving structure. We further propose a timestep-weighted rendering loss during training, and photometric gradient guidance and classifier-free guidance at inference to improve fidelity. Experiments on Objaverse and Google Scanned Objects show consistent improvements over pixel and voxel-aligned methods while using significantly fewer Gaussians, with large gains when input views leave parts of the object unobserved.[260] DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing
Ke Li,Maoliang Li,Jialiang Chen,Jiayu Chen,Zihao Zheng,Shaoqi Wang,Xiang Chen
Main category: cs.CV
TL;DR: 本文提出DIRECT框架,将视频混剪创作建模为多模态一致性满足问题(MMCSP),通过分层多智能体架构(编剧、导演、剪辑师)实现语义、视觉与听觉的跨层级协同,显著提升混剪流畅性与专业性。
Details
Motivation: 现有自动化视频编辑方法缺乏跨层级多模态协同,导致视觉跳变和音乐错位,难以达到专业级流畅度。 Method: 提出DIRECT分层多智能体框架:Screenwriter负责全局结构锚定,Director生成自适应编辑意图,Editor执行细粒度镜头序列优化;并构建专用基准Mashup-Bench评估视觉连续性与听觉对齐。 Result: 在客观指标和人类主观评测上均显著优于当前最优基线方法。 Conclusion: DIRECT有效实现了多模态跨层级协同编辑,为专业级自动化视频混剪提供了新范式。 Abstract: Video mashup creation represents a complex video editing paradigm that recomposes existing footage to craft engaging audio-visual experiences, demanding intricate orchestration across semantic, visual, and auditory dimensions and multiple levels. However, existing automated editing frameworks often overlook the cross-level multimodal orchestration to achieve professional-grade fluidity, resulting in disjointed sequences with abrupt visual transitions and musical misalignment. To address this, we formulate video mashup creation as a Multimodal Coherency Satisfaction Problem (MMCSP) and propose the DIRECT framework. Simulating a professional production pipeline, our hierarchical multi-agent framework decomposes the challenge into three cascade levels: the Screenwriter for source-aware global structural anchoring, the Director for instantiating adaptive editing intent and guidance, and the Editor for intent-guided shot sequence editing with fine-grained optimization. We further introduce Mashup-Bench, a comprehensive benchmark with tailored metrics for visual continuity and auditory alignment. Extensive experiments demonstrate that DIRECT significantly outperforms state-of-the-art baselines in both objective metrics and human subjective evaluation. Project page and code: https://github.com/AK-DREAM/DIRECT[261] HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes
Mauricio Soroco,Francesco Pittaluga,Zaid Tasneem,Abhishek Aich,Bingbing Zhuang,Wuyang Chen,Manmohan Chandraker,Ziyu Jiang
Main category: cs.CV
TL;DR: 本文提出HorizonWeaver,一种面向自动驾驶场景的指令驱动图像编辑框架,解决多粒度编辑、高阶语义保持和跨域泛化三大挑战,通过构建真实/合成配对数据集、语言引导掩码机制与联合损失训练策略,在多项指标上显著超越现有方法。
Details
Motivation: 现有基于指令的图像编辑模型在密集、安全关键的驾驶场景中表现不佳,难以满足自动驾驶对可扩展、可控、逼真驾驶场景生成的需求。 Method: 提出HorizonWeaver框架,包含三方面创新:(1)构建Boreas/nuScenes/Argoverse2融合的真实-合成配对数据集;(2)设计语言引导的语义增强掩码机制实现细粒度编辑;(3)采用联合损失函数兼顾内容保真与指令对齐。 Result: 在255K图像、13类编辑任务上验证有效性,在L1、CLIP、DINO指标及BEV分割IoU上显著提升,用户偏好率提高46.4%,BEV分割IoU提升33%。 Conclusion: HorizonWeaver为复杂驾驶场景提供了可扩展、高保真、指令驱动的图像编辑新范式,有效支撑自动驾驶安全验证与数据增强。 Abstract: Ensuring safety in autonomous driving requires scalable generation of realistic, controllable driving scenes beyond what real-world testing provides. Yet existing instruction guided image editors, trained on object-centric or artistic data, struggle with dense, safety-critical driving layouts. We propose HorizonWeaver, which tackles three fundamental challenges in driving scene editing: (1) multi-level granularity, requiring coherent object- and scene-level edits in dense environments; (2) rich high-level semantics, preserving diverse objects while following detailed instructions; and (3) ubiquitous domain shifts, handling changes in climate, layout, and traffic across unseen environments. The core of HorizonWeaver is a set of complementary contributions across data, model, and training: (1) Data: Large-scale dataset generation, where we build a paired real/synthetic dataset from Boreas, nuScenes, and Argoverse2 to improve generalization; (2) Model: Language-Guided Masks for fine-grained editing, where semantics-enriched masks and prompts enable precise, language-guided edits; and (3) Training: Content preservation and instruction alignment, where joint losses enforce scene consistency and instruction fidelity. Together, HorizonWeaver provides a scalable framework for photorealistic, instruction-driven editing of complex driving scenes, collecting 255K images across 13 editing categories and outperforming prior methods in L1, CLIP, and DINO metrics, achieving +46.4% user preference and improving BEV segmentation IoU by +33%. Project page: https://msoroco.github.io/horizonweaver/[262] FileGram: Grounding Agent Personalization in File-System Behavioral Traces
Shuai Liu,Shulin Tian,Kairui Hu,Yuhao Dong,Zhe Yang,Bo Li,Jingkang Yang,Chen Change Loy,Ziwei Liu
Main category: cs.CV
TL;DR: 本文提出FileGram框架,通过利用本地文件系统中的行为痕迹(如文件操作)来实现AI代理的个性化与记忆建模,包含数据生成引擎、诊断基准和底层记忆架构三部分,旨在突破现有交互式方法在数据稀缺与隐私限制下的瓶颈。
Details
Motivation: 现有AI代理个性化方法受限于隐私壁垒和多模态真实行为数据难以大规模采集,且过度依赖对话交互,忽视了更密集、细粒度的文件系统操作行为痕迹。 Method: 提出FileGram框架,包括:(1) FileGramEngine——基于人格驱动的可扩展数据引擎,模拟真实工作流并生成细粒度多模态动作序列;(2) FileGramBench——以文件系统行为为根基的诊断基准,评估记忆系统在画像重建、轨迹解耦、人格漂移检测和多模态对齐等方面的能力;(3) FileGramOS——自底向上的记忆架构,从原子级文件操作与内容变化中构建用户画像,并分通道编码为程序性、语义性和情景性记忆。 Result: 实验证明FileGramBench对当前最先进记忆系统仍具挑战性,而FileGramEngine与FileGramOS在个性化建模任务中表现有效;框架已开源以推动后续研究。 Conclusion: FileGram首次将文件系统行为痕迹作为AI代理个性化与记忆建模的核心信号源,为隐私敏感、数据受限场景下的记忆中心型本地AI代理提供了新范式与基础设施支持。 Abstract: Coworking AI agents operating within local file systems are rapidly emerging as a paradigm in human-AI interaction; however, effective personalization remains limited by severe data constraints, as strict privacy barriers and the difficulty of jointly collecting multimodal real-world traces prevent scalable training and evaluation, and existing methods remain interaction-centric while overlooking dense behavioral traces in file-system operations; to address this gap, we propose FileGram, a comprehensive framework that grounds agent memory and personalization in file-system behavioral traces, comprising three core components: (1) FileGramEngine, a scalable persona-driven data engine that simulates realistic workflows and generates fine-grained multimodal action sequences at scale; (2) FileGramBench, a diagnostic benchmark grounded in file-system behavioral traces for evaluating memory systems on profile reconstruction, trace disentanglement, persona drift detection, and multimodal grounding; and (3) FileGramOS, a bottom-up memory architecture that builds user profiles directly from atomic actions and content deltas rather than dialogue summaries, encoding these traces into procedural, semantic, and episodic channels with query-time abstraction; extensive experiments show that FileGramBench remains challenging for state-of-the-art memory systems and that FileGramEngine and FileGramOS are effective, and by open-sourcing the framework, we hope to support future research on personalized memory-centric file-system agents.[263] ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality
Dawar Khan,Alexandre Kouyoumdjian,Xinyu Liu,Omar Mena,Dominik Engel,Ivan Viola
Main category: cs.CV
TL;DR: ClickAIXR 是一种新型的、基于设备端的多模态视觉-语言交互框架,支持在扩展现实(XR)中通过控制器点击真实物体进行自然语言问答,所有推理均在本地完成,兼顾隐私、低延迟与交互准确性。
Details
Motivation: 解决现有XR系统依赖云端AI或凝视选择带来的隐私泄露、高延迟及交互模糊性问题,提升XR交互的可信度与用户控制感。 Method: 提出基于控制器点击的对象选择范式,集成轻量级设备端视觉语言模型(VLM),在Magic Leap平台用ONNX实现本地推理;开展与Gemini 2.5 Flash和ChatGPT 5的对比用户研究。 Result: 用户研究表明其延迟适中、体验可接受,在可用性、信任度和满意度方面表现良好,验证了点击式+端侧AI范式的可行性与优势。 Conclusion: ClickAIXR证明了在XR中结合精确对象选择与本地化VLM推理,可有效构建更可信、隐私友好且响应及时的交互系统,为未来端侧多模态XR应用提供新路径。 Abstract: We present ClickAIXR, a novel on-device framework for multimodal vision-language interaction with objects in extended reality (XR). Unlike prior systems that rely on cloud-based AI (e.g., ChatGPT) or gaze-based selection (e.g., GazePointAR), ClickAIXR integrates an on-device vision-language model (VLM) with a controller-based object selection paradigm, enabling users to precisely click on real-world objects in XR. Once selected, the object image is processed locally by the VLM to answer natural language questions through both text and speech. This object-centered interaction reduces ambiguity inherent in gaze- or voice-only interfaces and improves transparency by performing all inference on-device, addressing concerns around privacy and latency. We implemented ClickAIXR in the Magic Leap SDK (C API) with ONNX-based local VLM inference. We conducted a user study comparing ClickAIXR with Gemini 2.5 Flash and ChatGPT 5, evaluating usability, trust, and user satisfaction. Results show that latency is moderate and user experience is acceptable. Our findings demonstrate the potential of click-based object selection combined with on-device AI to advance trustworthy, privacy-preserving XR interactions. The source code and supplementary materials are available at: nanovis.org/ClickAIXR.html[264] SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing
Yicheng Xiao,Wenhu Zhang,Lin Song,Yukang Chen,Wenbo Li,Nan Jiang,Tianhe Ren,Haokun Lin,Wei Huang,Haoyang Huang,Xiu Li,Nan Duan,Xiaojuan Qi
Main category: cs.CV
TL;DR: 本文提出了SpatialEdit-Bench评估基准、SpatialEdit-500k合成数据集和SpatialEdit-16B基线模型,以提升图像空间编辑在几何保真度与感知合理性的细粒度控制能力。
Details
Motivation: 现有图像空间编辑模型难以支持细粒度的空间操作,缺乏专用评估体系和高质量训练数据。 Method: 构建了包含视角重建与构图分析的联合评估基准SpatialEdit-Bench;设计基于Blender的可控渲染流程生成大规模合成数据集SpatialEdit-500k;并基于该数据训练出SpatialEdit-16B基线模型。 Result: SpatialEdit-16B在通用图像编辑任务上表现具竞争力,在空间操控任务上显著优于先前方法。 Conclusion: 本工作为图像空间编辑提供了系统性评估标准、高质量训练数据和强效基线模型,推动该方向向更精准的几何控制发展。 Abstract: Image spatial editing performs geometry-driven transformations, allowing precise control over object layout and camera viewpoints. Current models are insufficient for fine-grained spatial manipulations, motivating a dedicated assessment suite. Our contributions are listed: (i) We introduce SpatialEdit-Bench, a complete benchmark that evaluates spatial editing by jointly measuring perceptual plausibility and geometric fidelity via viewpoint reconstruction and framing analysis. (ii) To address the data bottleneck for scalable training, we construct SpatialEdit-500k, a synthetic dataset generated with a controllable Blender pipeline that renders objects across diverse backgrounds and systematic camera trajectories, providing precise ground-truth transformations for both object- and camera-centric operations. (iii) Building on this data, we develop SpatialEdit-16B, a baseline model for fine-grained spatial editing. Our method achieves competitive performance on general editing while substantially outperforming prior methods on spatial manipulation tasks. All resources will be made public at https://github.com/EasonXiao-888/SpatialEdit.[265] A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
Tommie Kerssies,Gabriele Berton,Ju He,Qihang Yu,Wufei Ma,Daan de Geus,Gijs Dubbelman,Liang-Chieh Chen
Main category: cs.CV
TL;DR: 本文提出DeltaTok和DeltaWorld,通过在视觉基础模型特征空间中编码帧间差异(delta token)来实现高效、多样化的视频未来预测,显著减少了参数量和计算量。
Details
Motivation: 现有判别式世界模型只能生成确定性预测,而生成式世界模型计算开销大;在视觉基础模型特征空间建模虽可减少参数,但多数仍为判别式方法,缺乏对未来多样性的建模能力。 Method: 提出DeltaTok tokenizer,将连续帧在VFM特征空间的差值编码为单个连续'delta' token;构建DeltaWorld生成式世界模型,在delta token序列上进行多假设训练与单次前向推理以生成多样化未来。 Result: 在密集预测任务中,DeltaWorld比现有生成式世界模型参数减少35倍、FLOPs减少2000倍,同时预测结果更贴近真实世界未来。 Conclusion: 基于帧间特征差异的紧凑token表示(delta token)能有效支持轻量、高效且多样化的视频世界建模,为未来视频预测提供了新范式。 Abstract: Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce a deterministic prediction that implicitly averages over possible futures, while existing generative world models remain computationally expensive. Recent work demonstrates that predicting the future in the feature space of a vision foundation model (VFM), rather than a latent space optimized for pixel reconstruction, requires significantly fewer world model parameters. However, most such approaches remain discriminative. In this work, we introduce DeltaTok, a tokenizer that encodes the VFM feature difference between consecutive frames into a single continuous "delta" token, and DeltaWorld, a generative world model operating on these tokens to efficiently generate diverse plausible futures. Delta tokens reduce video from a three-dimensional spatio-temporal representation to a one-dimensional temporal sequence, for example yielding a 1,024x token reduction with 512x512 frames. This compact representation enables tractable multi-hypothesis training, where many futures are generated in parallel and only the best is supervised. At inference, this leads to diverse predictions in a single forward pass. Experiments on dense forecasting tasks demonstrate that DeltaWorld forecasts futures that more closely align with real-world outcomes, while having over 35x fewer parameters and using 2,000x fewer FLOPs than existing generative world models. Code and weights: https://deltatok.github.io.[266] Your Pre-trained Diffusion Model Secretly Knows Restoration
Sudarshan Rajagopalan,Vishal M. Patel
Main category: cs.CV
TL;DR: 本文提出一种无需微调或额外控制模块的方法,通过直接学习预训练扩散模型文本编码器输出的提示嵌入(prompt embeddings)来解锁其内在的修复能力,并设计扩散桥接(diffusion bridge)训练策略以对齐训练与推理动态,从而在图像和视频修复任务中实现高效、通用的All-in-One Restoration。
Details
Motivation: 现有基于扩散模型的All-in-One修复方法依赖微调或Control-Net类模块,未能充分利用预训练扩散模型本身蕴含的修复先验;且传统文本提示或token嵌入优化难以有效激发该能力。 Method: 提出直接学习文本编码器输出端的轻量级可训练prompt embedding;针对前向加噪(使用退化图像)与反向去噪轨迹不一致的问题,引入扩散桥接(diffusion bridge)训练框架,强制建立从退化噪声态到干净图像的一致去噪路径。 Result: 在WAN(视频)和FLUX(图像)预训练模型上验证了方法有效性;在多种退化类型下达到具有竞争力的性能与泛化能力;无需模型微调或定制化控制模块。 Conclusion: 预训练扩散模型本身具备未被发掘的修复行为,可通过恰当设计的prompt学习与动态对齐训练策略高效激活,为轻量、通用、即插即用的AI修复提供了新范式。 Abstract: Pre-trained diffusion models have enabled significant advancements in All-in-One Restoration (AiOR), offering improved perceptual quality and generalization. However, diffusion-based restoration methods primarily rely on fine-tuning or Control-Net style modules to leverage the pre-trained diffusion model's priors for AiOR. In this work, we show that these pre-trained diffusion models inherently possess restoration behavior, which can be unlocked by directly learning prompt embeddings at the output of the text encoder. Interestingly, this behavior is largely inaccessible through text prompts and text-token embedding optimization. Furthermore, we observe that naive prompt learning is unstable because the forward noising process using degraded images is misaligned with the reverse sampling trajectory. To resolve this, we train prompts within a diffusion bridge formulation that aligns training and inference dynamics, enforcing a coherent denoising path from noisy degraded states to clean images. Building on these insights, we introduce our lightweight learned prompts on the pre-trained WAN video model and FLUX image models, converting them into high-performing restoration models. Extensive experiments demonstrate that our approach achieves competitive performance and generalization across diverse degradations, while avoiding fine-tuning and restoration-specific control modules.[267] Fully Procedural Synthetic Data from Simple Rules for Multi-View Stereo
Zeyu Ma,Alexander Raistrick,Jia Deng
Main category: cs.CV
TL;DR: 本文提出SimpleProc,一种基于NURBS和简单纹理/位移模式的全程序化多视图立体(MVS)训练数据生成方法,在更少图像数量下达到甚至超越人工标注数据的性能。
Details
Motivation: 为解决MVS任务中高质量、大规模、多样化训练数据获取成本高、耗时长的问题,探索程序化生成训练数据的设计空间。 Method: 提出SimpleProc:一个完全程序化的数据生成器,仅依赖少量规则,使用非均匀有理B样条(NURBS)建模几何,并叠加基础位移与纹理图案合成多视角图像。 Result: 在8,000张图像规模下,性能优于同规模人工采集(游戏+真实物体)数据;扩展至352,000张时,性能媲美甚至在多个基准上超越使用692,000张人工标注图像训练的模型。 Conclusion: 轻量级程序化规则足以高效生成高质量MVS训练数据,挑战了对大规模人工标注数据的依赖,为数据生成范式提供新思路。 Abstract: In this paper, we explore the design space of procedural rules for multi-view stereo (MVS). We demonstrate that we can generate effective training data using SimpleProc: a new, fully procedural generator driven by a very small set of rules using Non-Uniform Rational Basis Splines (NURBS), as well as basic displacement and texture patterns. At a modest scale of 8,000 images, our approach achieves superior results compared to manually curated images (at the same scale) sourced from games and real-world objects. When scaled to 352,000 images, our method yields performance comparable to--and in several benchmarks, exceeding--models trained on over 692,000 manually curated images. The source code and the data are available at https://github.com/princeton-vl/SimpleProc.[268] Rethinking Model Efficiency: Multi-Agent Inference with Large Models
Sixun Dong,Juhua Hu,Steven Li,Wei Wen,Qi Qian
Main category: cs.CV
TL;DR: 本文分析了视觉语言模型(VLMs)中输出token数量对端到端延迟的影响,发现大模型配合短输出序列可比小模型长输出更高效;据此提出一种多智能体推理框架,在保持大模型短响应的同时,按需复用小模型的关键推理token,从而在效率与性能间取得更好平衡。
Details
Motivation: 视觉语言模型(VLMs)因自回归解码导致输出token数成为端到端延迟瓶颈,而不同模型达成相近性能所需token数差异显著,亟需系统性分析与优化。 Method: 通过模拟数据对VLM各组件延迟进行综合分析,并在多个真实基准上开展实证研究;进而提出多智能体推理框架,使大模型保持短响应,必要时迁移小模型的关键推理token。 Result: 实验证明:大模型配少量输出token可优于小模型配长输出序列;所提多智能体框架通过复用小模型推理token,能逼近大模型自主推理的性能。 Conclusion: 输出token数量是影响VLM效率的关键因素,合理协同大小模型、复用关键推理token,可在不牺牲性能前提下显著提升推理效率。 Abstract: Most vision-language models (VLMs) apply a large language model (LLM) as the decoder, where the response tokens are generated sequentially through autoregression. Therefore, the number of output tokens can be the bottleneck of the end-to-end latency. However, different models may require vastly different numbers of output tokens to achieve comparable performance. In this work, we conduct a comprehensive analysis of the latency across different components of VLMs on simulated data. The experiment shows that a large model with fewer output tokens can be more efficient than a small model with a long output sequence. The empirical study on diverse real-world benchmarks confirms the observation that a large model can achieve better or comparable performance as a small model with significantly fewer output tokens. To leverage the efficiency of large models, we propose a multi-agent inference framework that keeps large models with short responses but transfers the key reasoning tokens from the small model when necessary. The comparison on benchmark tasks demonstrates that by reusing the reasoning tokens from small models, it can help approach the performance of a large model with its own reasoning, which confirms the effectiveness of our proposal.[269] LoMa: Local Feature Matching Revisited
David Nordström,Johan Edstedt,Georg Bökman,Jonathan Astermark,Anders Heyden,Viktor Larsson,Mårten Wadenbäck,Michael Felsberg,Fredrik Kahl
Main category: cs.CV
TL;DR: 本文提出LoMa方法,通过大规模数据混合、现代训练策略、扩大模型容量和计算资源,显著提升局部特征匹配性能,并构建了具有挑战性的HardMatch数据集以克服现有基准饱和问题。
Details
Motivation: 局部特征匹配在3D视觉中至关重要,但其发展滞后于其他数据驱动方法;当前基准受限于易配对图像,导致性能评估饱和。 Method: 提出LoMa方法,融合大规模多样化数据、现代训练策略、扩大模型容量与计算规模;并构建含1000对高难度图像的HardMatch新数据集,辅以人工标注真值对应点。 Result: LoMa在多个基准上大幅超越SOTA:HardMatch +18.6 mAA,WxBS +29.5 mAA,InLoc +21.4(1m, 10°),RUBIK +24.2 AUC,IMC 2022 +12.4 mAA。 Conclusion: 数据驱动视角下的系统性扩展(数据、模型、算力、训练)可显著推动局部特征匹配性能突破,HardMatch为未来研究提供了更具挑战性的评估基准。 Abstract: Local feature matching has long been a fundamental component of 3D vision systems such as Structure-from-Motion (SfM), yet progress has lagged behind the rapid advances of modern data-driven approaches. The newer approaches, such as feed-forward reconstruction models, have benefited extensively from scaling dataset sizes, whereas local feature matching models are still only trained on a few mid-sized datasets. In this paper, we revisit local feature matching from a data-driven perspective. In our approach, which we call LoMa, we combine large and diverse data mixtures, modern training recipes, scaled model capacity, and scaled compute, resulting in remarkable gains in performance. Since current standard benchmarks mainly rely on collecting sparse views from successful 3D reconstructions, the evaluation of progress in feature matching has been limited to relatively easy image pairs. To address the resulting saturation of benchmarks, we collect 1000 highly challenging image pairs from internet data into a new dataset called HardMatch. Ground truth correspondences for HardMatch are obtained via manual annotation by the authors. In our extensive benchmarking suite, we find that LoMa makes outstanding progress across the board, outperforming the state-of-the-art method ALIKED+LightGlue by +18.6 mAA on HardMatch, +29.5 mAA on WxBS, +21.4 (1m, 10$^\circ$) on InLoc, +24.2 AUC on RUBIK, and +12.4 mAA on IMC 2022. We release our code and models publicly at https://github.com/davnords/LoMa.[270] PointTPA: Dynamic Network Parameter Adaptation for 3D Scene Understanding
Siyuan Liu,Chaoqun Zheng,Xin Zhou,Tianrui Feng,Dingkang Liang,Xiang Bai
Main category: cs.CV
TL;DR: 本文提出PointTPA,一种面向场景级点云理解的测试时参数自适应框架,通过序列化邻域分组(SNG)和动态参数投影器(DPP)生成输入感知的网络参数,在极低参数开销(<2%)下显著提升性能,在ScanNet上达到78.4% mIoU。
Details
Motivation: 现有方法在场景级点云理解中受限于静态网络参数,难以适应动态、多样化的场景几何、类别分布与空间布局。 Method: 提出PointTPA框架,包含序列化邻域分组(SNG)用于构建局部一致块,以及动态参数投影器(DPP)生成块级自适应权重;集成于PTv3结构,仅引入两个轻量模块。 Result: 在ScanNet验证集上达到78.4% mIoU,优于现有各类参数高效微调(PEFT)方法,且参数增量低于骨干网络的2%。 Conclusion: 测试时动态参数适应机制能有效提升3D场景理解性能,同时保持高度参数效率。 Abstract: Scene-level point cloud understanding remains challenging due to diverse geometries, imbalanced category distributions, and highly varied spatial layouts. Existing methods improve object-level performance but rely on static network parameters during inference, limiting their adaptability to dynamic scene data. We propose PointTPA, a Test-time Parameter Adaptation framework that generates input-aware network parameters for scene-level point clouds. PointTPA adopts a Serialization-based Neighborhood Grouping (SNG) to form locally coherent patches and a Dynamic Parameter Projector (DPP) to produce patch-wise adaptive weights, enabling the backbone to adjust its behavior according to scene-specific variations while maintaining a low parameter overhead. Integrated into the PTv3 structure, PointTPA demonstrates strong parameter efficiency by introducing two lightweight modules of less than 2% of the backbone's parameters. Despite this minimal parameter overhead, PointTPA achieves 78.4% mIoU on ScanNet validation, surpassing existing parameter-efficient fine-tuning (PEFT) methods across multiple benchmarks, highlighting the efficacy of our test-time dynamic network parameter adaptation mechanism in enhancing 3D scene understanding. The code is available at https://github.com/H-EmbodVis/PointTPA.[271] Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
Hyunsoo Cha,Wonjung Woo,Byungjun Kim,Hanbyul Joo
Main category: cs.CV
TL;DR: Vanast是一个端到端框架,直接从单张人体图像、服装图像和姿态引导视频生成服装迁移的人体动画视频,通过统一建模解决传统两阶段方法中的身份漂移、服装扭曲和前后不一致问题。