Table of Contents
cs.CL [Back]
[1] Drift and selection in LLM text ecosystems
Søren Riis
Main category: cs.CL
TL;DR: 本文提出一个可解析的数学框架,用于分析公共文本记录中生成文本递归影响学习过程的现象,区分了‘漂移’(导致文本趋同简化)和‘选择’(通过质量/新颖性筛选维持文本丰富性)两种机制,并为AI训练语料设计提供理论指导。
Details
Motivation: 公共文本记录正被自身生成的内容不断重塑,形成人与AI共同学习的闭环,亟需理解这种递归过程对语言多样性与结构深度的影响。 Method: 基于变阶n-gram代理构建精确可解的数学模型,形式化分析‘漂移’(无过滤复用)与‘选择’(出版、排序、验证等过滤机制)两类作用力。 Result: 证明在无限语料极限下,纯漂移导致稳定但浅层的分布;而规范性选择(如偏好质量或新颖性)能维持深层结构,并给出其偏离浅层均衡的最优上界。 Conclusion: 递归发布本身会压缩文本多样性,唯有引入有目标的选择机制才能维持语言结构的丰富性,这对构建鲁棒、高质量的AI训练语料具有关键指导意义。 Abstract: The public text record -- the material from which both people and AI systems now learn -- is increasingly shaped by its own outputs. Generated text enters the public record, later agents learn from it, and the cycle repeats. Here we develop an exactly solvable mathematical framework for this recursive process, based on variable-order $n$-gram agents, and separate two forces acting on the public corpus. The first is drift: unfiltered reuse progressively removes rare forms, and in the infinite-corpus limit we characterise the stable distributions exactly. The second is selection: publication, ranking and verification filter what enters the record, and the outcome depends on what is selected. When publication merely reflects the statistical status quo, the corpus converges to a shallow state in which further lookahead brings no benefit. When publication is normative -- rewarding quality, correctness or novelty -- deeper structure persists, and we establish an optimal upper bound on the resulting divergence from shallow equilibria. The framework therefore identifies when recursive publication compresses public text and when selective filtering sustains richer structure, with implications for the design of AI training corpora.[2] SynDocDis: A Metadata-Driven Framework for Generating Synthetic Physician Discussions Using Large Language Models
Beny Rubinstein,Sergio Matos
Main category: cs.CL
TL;DR: 本文提出SynDocDis框架,利用结构化提示和隐私保护的去标识化病例元数据生成临床准确的医生间对话,经五位执业医师评估,其沟通效果和医学内容质量均表现优异,且严格保护医患隐私。
Details
Motivation: 隐私法规和伦理考量严重限制了医生间病例讨论数据的获取,而现有合成数据方法主要关注患者-医生互动或结构化医疗记录,缺乏对医生间交流的合成研究。 Method: 提出SynDocDis框架,结合结构化提示技术与隐私保护的去标识化病例元数据,生成医生间对话。 Result: 在九个肿瘤学和肝病学场景中,由五位执业医师评估,沟通有效性均值为4.4/5,医学内容质量均值为4.1/5,Kappa值为0.70,临床相关性评分为91%。 Conclusion: SynDocDis是一种有前景的框架,可在符合隐私规范的前提下推动医学AI研究,在医学教育和临床决策支持中有直接应用价值。 Abstract: Physician-physician discussions of patient cases represent a rich source of clinical knowledge and reasoning that could feed AI agents to enrich and even participate in subsequent interactions. However, privacy regulations and ethical considerations severely restrict access to such data. While synthetic data generation using Large Language Models offers a promising alternative, existing approaches primarily focus on patient-physician interactions or structured medical records, leaving a significant gap in physician-to-physician communication synthesis. We present SynDocDis, a novel framework that combines structured prompting techniques with privacy-preserving de-identified case metadata to generate clinically accurate physician-to-physician dialogues. Evaluation by five practicing physicians in nine oncology and hepatology scenarios demonstrated exceptional communication effectiveness (mean 4.4/5) and strong medical content quality (mean 4.1/5), with substantial interrater reliability (kappa = 0.70, 95% CI: 0.67-0.73). The framework achieved 91% clinical relevance ratings while maintaining doctors' and patients' privacy. These results place SynDocDis as a promising framework for advancing medical AI research ethically and responsibly through privacy-compliant synthetic physician dialogue generation with direct applications in medical education and clinical decision support.[3] EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context
Arth Singh
Main category: cs.CL
TL;DR: 本文通过使用指数移动平均(EMA)迹作为简单的时间累积机制,探究了高效序列模型相较于简单时间平均的优势。研究发现,EMA迹虽能编码时序结构,但会破坏词元身份信息,导致不可逆的信息损失;只有通过学习到的、依赖输入的选择机制才能解决这一问题。
Details
Motivation: 探究高效序列模型相对于简单时间平均(如EMA)的优势与局限,明确固定系数累积机制的能力边界。 Method: 采用指数移动平均(EMA)迹作为受控探针,结合Hebbian架构和多时间尺度迹进行无监督语法角色分配实验,并在语言建模中评估纯EMA上下文的效果及预测器消融分析。 Result: 多时间尺度EMA迹在无标签下达到BiGRU 96%性能,且在结构依赖任务上超越监督模型;仅用EMA上下文的130M参数LM在C4上困惑度达260(为GPT-2的8倍),且预测器替换实验表明性能差距完全源于迹本身的信息损失。 Conclusion: 固定系数的时间或深度累积会导致不可逆的信息稀释,唯有依赖输入的学习型选择机制才能克服该限制。 Abstract: What exactly do efficient sequence models gain over simple temporal averaging? We use exponential moving average (EMA) traces, the simplest recurrent context (no gating, no content-based retrieval), as a controlled probe to map the boundary between what fixed-coefficient accumulation can and cannot represent. EMA traces encode temporal structure: a Hebbian architecture with multi-timescale traces achieves 96% of a supervised BiGRU on grammatical role assignment with zero labels, surpassing the supervised model on structure-dependent roles. EMA traces destroy token identity: a 130M-parameter language model using only EMA context reaches C4 perplexity 260 (8x GPT-2), and a predictor ablation (replacing the linear predictor with full softmax attention) yields identical loss, localizing the entire gap to the traces. The traces apply lossy, data-independent compression; by the data processing inequality, no downstream predictor can recover the discarded information. Fixed-coefficient accumulation, whether across time or depth, suffers irreversible information dilution that only learned, input-dependent selection can resolve.[4] Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models
Arth Singh
Main category: cs.CL
TL;DR: 本文揭示了基于扩散的语言模型(dLLMs)在安全对齐上的结构性脆弱性:其安全性依赖于去噪过程的单调性和已提交token不可重审的假设;通过简单两步干预(重掩码拒绝token并添加肯定前缀)即可大幅提高攻击成功率,证明其安全机制浅层且非对抗鲁棒。
Details
Motivation: 现有dLLMs的安全对齐可能建立在脆弱的架构假设上,需验证其是否真正鲁棒。 Method: 提出一种无需梯度计算或对抗搜索的两步干预法:重掩码早期承诺的拒绝token,并注入12-token肯定前缀;同时对比引入Gumbel-softmax梯度优化扰动的效果。 Result: 在HarmBench上对LLaDA-8B-Instruct和Dream-7B-Instruct分别实现76.1%和81.8%攻击成功率(ASR),而加入梯度优化反而使ASR降至41.5%,证实漏洞源于结构而非可优化弱点。 Conclusion: dLLM的安全性是架构层面的浅层对齐,仅在标准去噪调度不被违反时成立;需设计安全感知的解掩码调度、步长条件前缀检测及提交后重验证等防御机制。 Abstract: Diffusion-based language models (dLLMs) generate text by iteratively denoising masked token sequences. We show that their safety alignment rests on a single fragile assumption: that the denoising schedule is monotonic and committed tokens are never re-evaluated. Safety-aligned dLLMs commit refusal tokens within the first 8-16 of 64 denoising steps, and the schedule treats these commitments as permanent. A trivial two-step intervention - re-masking these tokens and injecting a 12-token affirmative prefix - achieves 76.1% ASR on HarmBench (n=159, Lg=128) against LLaDA-8B-Instruct and 81.8% ASR (n=159) against Dream-7B-Instruct, without any gradient computation or adversarial search. The simplicity of this exploit is itself the central finding: augmenting with gradient-optimized perturbation via a differentiable Gumbel-softmax chain consistently degrades ASR (e.g., 41.5% vs. 76.1% at Lg=128), confirming that the vulnerability is structural rather than requiring sophisticated exploitation. These findings reveal that dLLM safety is not adversarially robust but architecturally shallow - it holds only because the denoising schedule is never violated. We discuss defenses including safety-aware unmasking schedules, step-conditional prefix detection, and post-commitment re-verification.[5] WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models
Hanna Lee,Tan Dat Nguyen,Jaehoon Kang,Kyuhong Shim
Main category: cs.CL
TL;DR: WAND是一种针对自回归文本到语音(AR-TTS)模型的高效推理框架,通过窗口化注意力与知识蒸馏,在保持高保真语音质量的同时,将KV缓存内存减少66.2%,实现近常数级每步延迟。
Details
Motivation: 现有解码器-only AR-TTS模型因全自注意力导致内存与计算开销随序列长度呈平方增长,难以部署于长语音生成或资源受限场景。 Method: 提出WAND框架:1)将注意力分离为对条件token的持久全局注意力和对生成token的局部滑动窗口注意力;2)采用渐进收紧窗口的课程学习策略稳定微调;3)利用全注意力教师模型进行知识蒸馏以恢复音质。 Result: 在三个现代AR-TTS模型上验证,WAND在保持原始语音质量前提下,实现最高66.2% KV缓存内存降低,且每步延迟接近长度无关的常数水平。 Conclusion: WAND有效解耦了AR-TTS的质量与效率矛盾,为高质量长文本TTS的实际部署提供了可行路径。 Abstract: Recent decoder-only autoregressive text-to-speech (AR-TTS) models produce high-fidelity speech, but their memory and compute costs scale quadratically with sequence length due to full self-attention. In this paper, we propose WAND, Windowed Attention and Knowledge Distillation, a framework that adapts pretrained AR-TTS models to operate with constant computational and memory complexity. WAND separates the attention mechanism into two: persistent global attention over conditioning tokens and local sliding-window attention over generated tokens. To stabilize fine-tuning, we employ a curriculum learning strategy that progressively tightens the attention window. We further utilize knowledge distillation from a full-attention teacher to recover high-fidelity synthesis quality with high data efficiency. Evaluated on three modern AR-TTS models, WAND preserves the original quality while achieving up to 66.2% KV cache memory reduction and length-invariant, near-constant per-step latency.[6] Medical Reasoning with Large Language Models: A Survey and MR-Bench
Xiaohan Ren,Chenxiao Fan,Wenyin Ma,Hongliang He,Chongming Gao,Xiaoyan Zhao,Fuli Feng
Main category: cs.CL
TL;DR: 本文综述了大语言模型在医学推理中的应用,基于认知理论将医学推理建模为溯因、演绎与归纳的迭代过程,系统梳理七类技术路径,并通过统一实验评估及新基准MR-Bench揭示当前模型在真实临床决策任务上的显著性能差距。
Details
Motivation: 临床决策具有安全性关键、情境依赖和证据动态演化等特点,仅靠事实记忆不足以支撑可靠推理,亟需对LLM医学推理能力进行系统性梳理与评估。 Method: 基于临床认知理论构建医学推理框架(溯因-演绎-归纳迭代),分类归纳七类技术路线(训练型与非训练型),开展跨基准统一实验评估,并构建源自真实医院数据的新基准MR-Bench。 Result: 现有模型在考试类任务上表现良好,但在MR-Bench真实临床任务上准确率显著下降,暴露出从考试到临床的性能鸿沟;统一评估揭示了不同方法的实际效果差异。 Conclusion: 当前LLM医学推理研究缺乏临床真实性验证,需发展更贴近真实诊疗流程的建模方法、评估基准与训练策略,以弥合实验室性能与临床需求之间的差距。 Abstract: Large language models (LLMs) have achieved strong performance on medical exam-style tasks, motivating growing interest in their deployment in real-world clinical settings. However, clinical decision-making is inherently safety-critical, context-dependent, and conducted under evolving evidence. In such situations, reliable LLM performance depends not on factual recall alone, but on robust medical reasoning. In this work, we present a comprehensive review of medical reasoning with LLMs. Grounded in cognitive theories of clinical reasoning, we conceptualize medical reasoning as an iterative process of abduction, deduction, and induction, and organize existing methods into seven major technical routes spanning training-based and training-free approaches. We further conduct a unified cross-benchmark evaluation of representative medical reasoning models under a consistent experimental setting, enabling a more systematic and comparable assessment of the empirical impact of existing methods. To better assess clinically grounded reasoning, we introduce MR-Bench, a benchmark derived from real-world hospital data. Evaluations on MR-Bench expose a pronounced gap between exam-level performance and accuracy on authentic clinical decision tasks. Overall, this survey provides a unified view of existing medical reasoning methods, benchmarks, and evaluation practices, and highlights key gaps between current model performance and the requirements of real-world clinical reasoning.[7] Uncertainty Estimation for the Open-Set Text Classification systems
Leonid Erlygin,Alexey Zaytsev
Main category: cs.CL
TL;DR: 本文提出了一种面向开放集文本分类(OSTC)任务的HolUE方法,通过建模文本不确定性和图库不确定性,显著提升了预测拒绝率(PRR),并在多个数据集上大幅超越基线方法。
Details
Motivation: 开放集文本分类中需区分已知类与未知类,而现有方法难以有效建模不同来源的不确定性(如查询表述不清、数据分布模糊),导致识别错误难以预判。 Method: 将Holistic Uncertainty Estimation(HolUE)方法适配至文本领域,分别建模文本不确定性(源于不良查询)和图库不确定性(源于数据分布歧义),从而预测识别错误风险。 Result: 在Yahoo Answers、DBPedia、PAN作者归属、CLINC150等数据集上,HolUE相比SCF基线在PRR指标上提升40%–365%,例如Yahoo Answers达365%(0.79 vs 0.17)。 Conclusion: HolUE能有效分离并建模OSTC中的两类不确定性,显著提升系统对识别错误的预判能力,增强文本识别系统的鲁棒性与可信度。 Abstract: Accurate uncertainty estimation is essential for building robust and trustworthy recognition systems. In this paper, we consider the open-set text classification (OSTC) task - and uncertainty estimation for it. For OSTC a text sample should be classified as one of the existing classes or rejected as unknown. To account for the different uncertainty types encountered in OSTC, we adapt the Holistic Uncertainty Estimation (HolUE) method for the text domain. Our approach addresses two major causes of prediction errors in text recognition systems: text uncertainty that stems from ill formulated queries and gallery uncertainty that is related the ambiguity of data distribution. By capturing these sources, it becomes possible to predict when the system will make a recognition error. We propose a new OSTC benchmark and conduct extensive experiments on a wide range of data, utilizing the authorship attribution, intent and topic classification datasets. HolUE achieves 40-365% improvement in Prediction Rejection Ratio (PRR) over the quality-based SCF baseline across datasets: 365% on Yahoo Answers (0.79 vs 0.17 at FPIR 0.1), 347% on DBPedia (0.85 vs 0.19), 240% on PAN authorship attribution (0.51 vs 0.15 at FPIR 0.5), and 40% on CLINC150 intent classification (0.73 vs~0.52). We make public our code and protocols https://github.com/Leonid-Erlygin/text_uncertainty.git[8] A Representation-Level Assessment of Bias Mitigation in Foundation Models
Svetoslav Nizhnichenkov,Rahul Nair,Elizabeth Daly,Brian Mac Namee
Main category: cs.CL
TL;DR: 本文通过分析BERT和Llama2等基础模型在偏见缓解前后的词嵌入空间变化,发现去偏方法能有效减少性别与职业间的关联偏差,使表征更中立平衡;同时提出新数据集WinoDec以支持解码器模型的公平性评估。
Details
Motivation: 探究偏见缓解方法如何在内部表征层面改变基础模型(尤其是编码器和解码器架构)的嵌入空间,从而提供对模型行为的可解释性审计。 Method: 采用代表性模型BERT(编码器)和Llama2(解码器),对比其基线与去偏变体在性别-职业术语间语义关联的嵌入空间差异;并构建新数据集WinoDec用于解码器模型公平性评估。 Result: 偏见缓解显著降低了性别-职业在嵌入空间中的偏差关联,带来更中立、平衡的内部表征;该现象在两类模型中具有一致性;WinoDec数据集已开源。 Conclusion: 嵌入空间分析是验证基础模型去偏效果的有效且可解释的工具;公平性提升可体现为几何上可识别的表征变换。 Abstract: We investigate how successful bias mitigation reshapes the embedding space of encoder-only and decoder-only foundation models, offering an internal audit of model behaviour through representational analysis. Using BERT and Llama2 as representative architectures, we assess the shifts in associations between gender and occupation terms by comparing baseline and bias-mitigated variants of the models. Our findings show that bias mitigation reduces gender-occupation disparities in the embedding space, leading to more neutral and balanced internal representations. These representational shifts are consistent across both model types, suggesting that fairness improvements can manifest as interpretable and geometric transformations. These results position embedding analysis as a valuable tool for understanding and validating the effectiveness of debiasing methods in foundation models. To further promote the assessment of decoder-only models, we introduce WinoDec, a dataset consisting of 4,000 sequences with gender and occupation terms, and release it to the general public. (https://github.com/winodec/wino-dec)[9] Neural networks for Text-to-Speech evaluation
Ilya Trofimenko,David Kocharyan,Aleksandr Zaitsev,Pavel Repnikov,Mark Levin,Nikita Shevtsov
Main category: cs.CL
TL;DR: 本文提出了一系列神经模型来自动评估TTS系统质量,包括用于相对评估的NeuralSBS和用于绝对评估的改进MOSNet与WhisperBert,显著优于人工评分基线,并通过消融实验验证了模型设计的有效性。
Details
Motivation: 人类主观评价(如MOS和SBS)虽为金标准,但成本高、速度慢且易受评估者偏差影响,亟需高效可靠的自动化替代方案。 Method: 提出NeuralSBS(基于HuBERT)用于SBS相对评估;改进MOSNet(定制序列长度批处理)和构建WhisperBert(Whisper音频特征与BERT文本嵌入的弱学习器堆叠融合)用于MOS绝对评估;开展消融实验并测试SpeechLM及零样本大模型等失败方案。 Result: NeuralSBS在SOMOS数据集上达73.7%准确率;最佳MOS模型RMSE≈0.40,优于人工评分基线0.62;证实堆叠融合优于跨注意力直接融合;SpeechLM和Qwen2-Audio、Gemini 2.5等零样本模型表现不佳。 Conclusion: 专用的、经任务适配的神经评估模型(尤其是多模态堆叠架构)可有效逼近专家判断,而通用大模型或简单融合策略尚不可行,凸显指标学习框架的必要性。 Abstract: Ensuring that Text-to-Speech (TTS) systems deliver human-perceived quality at scale is a central challenge for modern speech technologies. Human subjective evaluation protocols such as Mean Opinion Score (MOS) and Side-by-Side (SBS) comparisons remain the de facto gold standards, yet they are expensive, slow, and sensitive to pervasive assessor biases. This study addresses these barriers by formulating, and implementing a suite of novel neural models designed to approximate expert judgments in both relative (SBS) and absolute (MOS) settings. For relative assessment, we propose NeuralSBS, a HuBERT-backed model achieving 73.7% accuracy (on SOMOS dataset). For absolute assessment, we introduce enhancements to MOSNet using custom sequence-length batching, as well as WhisperBert, a multimodal stacking ensemble that combines Whisper audio features and BERT textual embeddings via weak learners. Our best MOS models achieve a Root Mean Square Error (RMSE) of ~0.40, significantly outperforming the human inter-rater RMSE baseline of 0.62. Furthermore, our ablation studies reveal that naively fusing text via cross-attention can degrade performance, highlighting the effectiveness of ensemble-based stacking over direct latent fusion. We additionally report negative results with SpeechLM-based architectures and zero-shot LLM evaluators (Qwen2-Audio, Gemini 2.5 flash preview), reinforcing the necessity of dedicated metric learning frameworks.[10] Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models
Mousa Salah,Amgad Muneer
Main category: cs.CL
TL;DR: 本文系统评估了不同采样温度和提示策略对扩展推理模型性能的影响,发现零样本提示在中等温度下表现最佳,而思维链提示在温度极端值下效果更好;此外,扩展推理的增益随温度升高而显著增加。
Details
Motivation: 当前对于扩展推理模型的最佳采样温度和提示策略配置尚缺乏深入探索。 Method: 在AMO-Bench数学难题集上,使用Grok-4.1模型,系统比较四种温度(0.0、0.4、0.7、1.0)下零样本提示与思维链提示的效果。 Result: 零样本提示在T=0.4和T=0.7时准确率达59%;思维链提示在温度极端值下更优;扩展推理增益从T=0.0时的6倍提升至T=1.0时的14.3倍。 Conclusion: 温度应与提示策略联合优化,而非默认使用T=0,这对推理任务的实践具有重要指导意义。 Abstract: Extended reasoning models represent a transformative shift in Large Language Model (LLM) capabilities by enabling explicit test-time computation for complex problem solving. However, the optimal configuration of sampling temperature and prompting strategy for these systems remains largely underexplored. We systematically evaluate chain-of-thought and zero-shot prompting across four temperature settings (0.0, 0.4, 0.7, and 1.0) using Grok-4.1 with extended reasoning on 39 mathematical problems from AMO-Bench, a challenging International Mathematical Olympiad-level benchmark. We find that zero-shot prompting achieves peak performance at moderate temperatures, reaching 59% accuracy at T=0.4 and T=0.7, while chain-of-thought prompting performs best at the temperature extremes. Most notably, the benefit of extended reasoning increases from 6x at T=0.0 to 14.3x at T=1.0. These results suggest that temperature should be optimized jointly with prompting strategy, challenging the common practice of using T=0 for reasoning tasks.[11] Attention-Based Sampler for Diffusion Language Models
Yuyan Zhou,Kai Syun Hou,Weiyu Chen,James Kwok
Main category: cs.CL
TL;DR: 本文提出了一种基于注意力机制的解码顺序选择方法Attn-Sampler,通过理论分析证明按注意力矩阵列和降序解码可近似最大化序列似然,并在无需训练的前提下提升扩散语言模型(dLLMs)的生成质量与并行解码效率。
Details
Motivation: 现有扩散语言模型(dLLMs)的解码策略仅依赖词元级信息,忽略全局序列结构,导致性能受限;同时自回归模型存在推理效率与建模灵活性瓶颈。 Method: 从对数似然最大化出发,理论推导出按注意力矩阵列和降序进行token解码可近似最优;据此设计无训练解码算法Attn-Sampler,并引入块注意力近似与动态注意力阈值以加速。 Result: 在多个基准测试中验证了Attn-Sampler的有效性:相比现有方法,在保持甚至提升生成质量的同时显著增强了解码并行性。 Conclusion: 注意力矩阵列和是指导dLLMs解码顺序的理论可靠指标,Attn-Sampler为扩散语言模型提供了一种高效、无需训练且原理坚实的解码新范式。 Abstract: Auto-regressive models (ARMs) have established a dominant paradigm in language modeling. However, their strictly sequential decoding paradigm imposes fundamental constraints on both inference efficiency and modeling flexibility. To address these limitations, diffusion-based large language models (dLLMs) have been proposed, offering the potential for parallel decoding and flexible language modeling. Despite these advantages, current dLLMs decoding strategies rely primarily on token level information, which fails to account for global sequence structure and often yields suboptimal results. In this paper, we study the decoding order selection problem from the perspective of log-likelihood maximization. We theoretically demonstrate that optimal sequence likelihood can be approximately achieved by decoding tokens in descending order of their attention matrix column sums. This finding provides a principled justification for attention-guided decoding and offers a theoretically grounded alternative to greedy search. We instantiate this theoretical insight in a new training-free decoding algorithm, termed Attn-Sampler, and further propose a block attention approximation and dynamic attention thresholding for practical acceleration. Extensive experiments across multiple benchmarks validate the effectiveness of our proposed method, demonstrating that it achieves superior generation quality while enhancing the decoding parallelism.[12] Dynamic sparsity in tree-structured feed-forward layers at scale
Reza Sedghi,Robin Schiewer,Anand Subramoney,David Kappel
Main category: cs.CL
TL;DR: 本文提出了一种树结构稀疏前馈层,作为Transformer中MLP块的即插即用替代方案,通过硬性分层路由实现条件计算,无需额外路由器网络;在语言建模和问答任务中,仅激活不到5%的单元即可匹配密集基线性能,并发现一种由硬路由与非对称非线性相互作用引发的自动剪枝现象。
Details
Motivation: 传统Transformer中前馈MLP块占用大量计算资源,亟需稀疏化以提升效率;现有稀疏方法常依赖额外路由器网络,增加复杂性与开销。 Method: 设计树结构的稀疏前馈层,采用硬性分层路由进行条件计算,作为MLP块的即插即用替代;不引入独立路由器网络,通过架构设计调控训练动态以实现平衡树结构。 Result: 在自回归语言建模与问答(含零样本/少样本)任务中,模型激活参数<5%仍可匹敌密集基线;观察到‘自动剪枝’现象,使动态路由部分转化为静态结构稀疏;无需辅助损失即可获得平衡树结构。 Conclusion: 树结构前馈层是一种可扩展、可控的大模型稀疏化机制,兼具高效性与训练稳定性。 Abstract: At typical context lengths, the feed-forward MLP block accounts for a large share of a transformer's compute budget, motivating sparse alternatives to dense MLP blocks. We study sparse, tree-structured feed-forward layers as drop-in replacements for MLP blocks in deep transformer architectures, enabling conditional computation via hard hierarchical routing without a separate router network. We demonstrate for the first time that this form of tree-structured conditional sparsity can be applied for autoregressive language modeling and downstream question answering, including zero- and few-shot settings, and its scalability beyond 1B parameters. Despite activating fewer than 5% of the feed-forward block's units per token, our models match dense baselines under controlled training and fine-tuning protocols. We further analyze training dynamics and identify an emergent auto-pruning effect: the interaction of hard routing with asymmetric nonlinearities progressively deactivates unused paths, yielding partial conversion of dynamic routing into static structural sparsity. We show that simple architectural choices can modulate this behavior and recover balanced trees without auxiliary losses. Overall, our work demonstrates that tree-structured feed-forward layers provide a scalable and controllable mechanism for sparsifying large transformer models.[13] Sentiment Classification of Gaza War Headlines: A Comparative Analysis of Large Language Models and Arabic Fine-Tuned BERT Models
Amr Eleraqi,Hager H. Mustafa,Abdul Hadi N. Ahmed
Main category: cs.CL
TL;DR: 本研究以2023年加沙战争为案例,比较三类大语言模型与六种微调阿拉伯语BERT模型在冲突相关媒体话语情感分析中的解释差异,发现模型架构本身即构成不同的情感解读视角,尤其揭示了LLM倾向于放大负面情绪、BERT偏爱中性判断等系统性偏差。
Details
Motivation: 传统情感分析多追求对单一金标准的准确率,而本文旨在揭示不同AI架构对冲突话语的情感解读本质上是带有立场的解释行为,需从认识论角度审视算法输出的主观性与建构性。 Method: 基于10990条阿拉伯语新闻标题,对比三类大语言模型(如LLaMA-3.1-8B、GPT-4.1)与六种微调阿拉伯BERT模型(如MARBERT);采用信息论与分布度量(香农熵、JS距离、方差分)量化模型间情感分布差异;引入框架条件分析考察叙事框架对情感判断的调节作用。 Result: 模型间存在显著且非随机的情感分布差异:微调BERT(尤MARBERT)强烈偏向中性分类;LLM普遍放大负面情绪,LLaMA-3.1-8B几近完全归为负面;GPT-4.1能依人道、法律、安全等叙事框架动态调整情感判断,其他LLM则缺乏此类语境调制能力。 Conclusion: 模型选择即意味着选择一种解释性视角,算法情感输出并非中立客观的媒体语调测量,而是在战争与危机语境中主动参与叙事建构与情绪赋义;研究呼吁将算法差异本身作为分析对象,并警惕其在敏感议题中的认知与伦理风险。 Abstract: This study examines how different artificial intelligence architectures interpret sentiment in conflict-related media discourse, using the 2023 Gaza War as a case study. Drawing on a corpus of 10,990 Arabic news headlines (Eleraqi 2026), the research conducts a comparative analysis between three large language models and six fine-tuned Arabic BERT models. Rather than evaluating accuracy against a single human-annotated gold standard, the study adopts an epistemological approach that treats sentiment classification as an interpretive act produced by model architectures. To quantify systematic differences across models, the analysis employs information-theoretic and distributional metrics, including Shannon Entropy, Jensen-Shannon Distance, and a Variance Score measuring deviation from aggregate model behavior. The results reveal pronounced and non-random divergence in sentiment distributions. Fine-tuned BERT models, particularly MARBERT, exhibit a strong bias toward neutral classifications, while LLMs consistently amplify negative sentiment, with LLaMA-3.1-8B showing near-total collapse into negativity. Frame-conditioned analysis further demonstrates that GPT-4.1 adjusts sentiment judgments in line with narrative frames (e.g., humanitarian, legal, security), whereas other LLMs display limited contextual modulation. These findings suggest that the choice of model constitutes a choice of interpretive lens, shaping how conflict narratives are algorithmically framed and emotionally evaluated. The study contributes to media studies and computational social science by foregrounding algorithmic discrepancy as an object of analysis and by highlighting the risks of treating automated sentiment outputs as neutral or interchangeable measures of media tone in contexts of war and crisis.[14] Multi-User Large Language Model Agents
Shu Yang,Shenzhe Zhu,Hao Zhu,José Ramón Enríquez,Di Wang,Alex Pentland,Michiel A. Bakker,Jiaxin Pei
Main category: cs.CL
TL;DR: 本文首次系统研究了多用户环境下的大语言模型(LLM)代理,将其建模为多主体决策问题,并提出统一交互协议与三类压力测试场景,揭示了当前前沿LLM在目标优先级稳定性、隐私保护和协作效率方面的系统性缺陷。
Details
Motivation: 随着LLM代理被部署于团队协作与组织工具中,其需同时服务多个具有不同角色、偏好和权限的用户,而现有系统主要面向单主体交互范式,难以应对多用户场景下的冲突、信息不对称与隐私约束。 Method: 形式化多用户LLM交互为多主体决策问题;提出统一的多用户交互协议;设计三类压力测试场景(指令遵循、隐私保护、协调能力)评估当前LLM表现。 Result: 前沿LLM在多用户场景下存在三大系统性缺陷:无法稳定维持冲突目标间的优先级、多轮交互中隐私泄露加剧、需迭代信息收集的协调任务中效率显著下降。 Conclusion: 单主体优化范式不足以支撑多用户LLM代理的实际应用,亟需发展支持多主体协同、隐私敏感与动态优先级管理的新框架与评估标准。 Abstract: Large language models (LLMs) and LLM-based agents are increasingly deployed as assistants in planning and decision making, yet most existing systems are implicitly optimized for a single-principal interaction paradigm, in which the model is designed to satisfy the objectives of one dominant user whose instructions are treated as the sole source of authority and utility. However, as they are integrated into team workflows and organizational tools, they are increasingly required to serve multiple users simultaneously, each with distinct roles, preferences, and authority levels, leading to multi-user, multi-principal settings with unavoidable conflicts, information asymmetry, and privacy constraints. In this work, we present the first systematic study of multi-user LLM agents. We begin by formalizing multi-user interaction with LLM agents as a multi-principal decision problem, where a single agent must account for multiple users with potentially conflicting interests and associated challenges. We then introduce a unified multi-user interaction protocol and design three targeted stress-testing scenarios to evaluate current LLMs' capabilities in instruction following, privacy preservation, and coordination. Our results reveal systematic gaps: frontier LLMs frequently fail to maintain stable prioritization under conflicting user objectives, exhibit increasing privacy violations over multi-turn interactions, and suffer from efficiency bottlenecks when coordination requires iterative information gathering.[15] Can We Still Hear the Accent? Investigating the Resilience of Native Language Signals in the LLM Era
Nabelanita Utami,Sasano Ryohei
Main category: cs.CL
TL;DR: This paper investigates whether the rise of LLM-based writing tools is causing homogenization in academic writing by analyzing native language identification (NLI) trends across three eras in ACL Anthology papers; results show a general decline in NLI performance—suggesting increased linguistic homogenization—but with notable anomalies across languages.
Details
Motivation: To examine whether the increasing use of LLM-based writing assistance is leading to linguistic homogenization in academic research papers, potentially obscuring authors' native language backgrounds. Method: The authors analyze ACL Anthology papers across three eras (pre-NN, pre-LLM, post-LLM), build a labeled dataset via semi-automated labeling, and fine-tune an NLI classifier to detect linguistic fingerprints tied to author native languages. Result: NLI performance consistently declines over time—indicating growing linguistic uniformity—but with language-specific anomalies: Chinese and French show unexpected resilience or divergence, while Japanese and Korean exhibit sharper-than-expected declines. Conclusion: The adoption of LLM writing tools appears to be contributing to stylistic and linguistic homogenization in academic writing, though the effect varies significantly across languages—highlighting both a global trend and important cross-linguistic differences. Abstract: The evolution of writing assistance tools from machine translation to large language models (LLMs) has changed how researchers write. This study investigates whether this shift is homogenizing research papers by analyzing native language identification (NLI) trends in ACL Anthology papers across three eras: pre-neural network (NN), pre-LLM, and post-LLM. We construct a labeled dataset using a semi-automated framework and fine-tune a classifier to detect linguistic fingerprints of author backgrounds. Our analysis shows a consistent decline in NLI performance over time. Interestingly, the post-LLM era reveals anomalies: while Chinese and French show unexpected resistance or divergent trends, Japanese and Korean exhibit sharper-than-expected declines.[16] Adaptive Rigor in AI System Evaluation using Temperature-Controlled Verdict Aggregation via Generalized Power Mean
Aleksandr Meshkov
Main category: cs.CL
TL;DR: 本文提出了一种名为Temperature-Controlled Verdict Aggregation (TCVA)的新评估方法,通过引入温度参数T调节评估严格度,使其更贴合不同应用场景的人类判断,并在多个基准数据集上验证了其有效性。
Details
Motivation: 现有LLM评估方法(如LLM-as-a-Judge、判决系统、NLI)难以根据应用领域自适应调整严格程度,导致与人类评估一致性不足。 Method: 提出TCVA方法:结合五级判决评分体系、广义幂均值聚合及可调温度参数T∈[0.1,1.0],低T值对应保守评估(适用于安全关键场景),高T值对应宽松评估(适用于对话AI)。 Result: 在SummEval和USR等含人类Likert评分的三个基准数据集上,TCVA在忠实性指标上与RAGAS相关性相当(Spearman=0.667 vs. 0.676),并持续优于DeepEval;且调节T无需额外LLM调用。 Conclusion: TCVA是一种灵活、高效、无需额外计算开销的LLM评估方法,能更好对齐人类判断,适用于多类AI应用场景。 Abstract: Existing evaluation methods for LLM-based AI systems, such as LLM-as-a-Judge, verdict systems, and NLI, do not always align well with human assessment because they cannot adapt their strictness to the application domain. This paper presents Temperature-Controlled Verdict Aggregation (TCVA), a method that combines a five-level verdict-scoring system with generalized power-mean aggregation and an intuitive temperature parameter T [0.1, 1.0] to control evaluation rigor. Low temperatures yield pessimistic scores suited for safety-critical domains; high temperatures produce lenient scores appropriate for conversational AI. Experimental evaluation on three benchmark datasets with human Likert-scale annotations (SummEval and USR) shows that TCVA achieves correlation with human judgments comparable to RAGAS on faithfulness (Spearman = 0.667 vs. 0.676) while consistently outperforming DeepEval. The method requires no additional LLM calls when adjusting the temperature parameter.[17] EXAONE 4.5 Technical Report
Eunbi Choi,Kibong Choi,Sehyun Chun,Seokhee Hong,Junwon Hwang,Hyojin Jeon,Ahra Jo,Hyunjik Jo,Yeonsik Jo,Joonkee Kim,Seonghwan Kim,Soyeon Kim,Sunkyoung Kim,Yireun Kim,Yongil Kim,Changhun Lee,Haeju Lee,Jinsik Lee,Kyungmin Lee,Sangha Park,Kwangrok Ryoo,Minju Seo,Sejong Yang,Heuiyeen Yeen,Hwan Chang,Stanley Jungkyu Choi,Yejin Choi,Kyubeen Han,Joonwon Jang,Kijeong Jeon,Geunyeong Jeong,Gerrard Jeongwon Jo,Jiyeon Jung,Daeseong Kim,Dohoon Kim,Dohyun Kim,Hyunseo Kim,Minu Kim,Myoungshin Kim,Youchul Kim,Byungoh Ko,Christopher Lee,Edward Hwayoung Lee,Honglak Lee,Jiyoung Lee,Sangeun Lee,Seungwon Lim,Woohyung Lim,Jueun Mun,Jaewoo Park,Jimin Park,Jinho Park,Yongmin Park,Wooseok Seo,Yongwoo Song,Sihyuk Yi,Kyungjae Yoo,Sangyeon Yoon
Main category: cs.CL
TL;DR: EXAONE 4.5是LG AI Research发布的首个开源视觉语言模型,通过在EXAONE 4.0基础上集成专用视觉编码器,并在精心筛选的文档导向数据上进行多模态预训练,显著提升文档理解与韩语推理能力,支持256K长上下文。
Details
Motivation: 提升文档理解能力并支持LG战略应用领域(如企业级文档处理),推动AI在工业场景中的实用化部署。 Method: 在EXAONE 4.0语言模型基础上集成专用视觉编码器,开展端到端多模态预训练;使用大规模、重点 curated 的文档中心语料进行训练;扩展上下文长度至256K tokens。 Result: 在文档理解与韩语上下文推理任务上超越同规模SOTA模型,在通用语言基准测试中表现具竞争力,并实现长上下文推理能力。 Conclusion: EXAONE 4.5是面向实际工业应用的开源多模态基础模型,具备强文档理解能力与可扩展性,为后续多领域适配奠定基础。 Abstract: This technical report introduces EXAONE 4.5, the first open-weight vision language model released by LG AI Research. EXAONE 4.5 is architected by integrating a dedicated visual encoder into the existing EXAONE 4.0 framework, enabling native multimodal pretraining over both visual and textual modalities. The model is trained on large-scale data with careful curation, particularly emphasizing document-centric corpora that align with LG's strategic application domains. This targeted data design enables substantial performance gains in document understanding and related tasks, while also delivering broad improvements across general language capabilities. EXAONE 4.5 extends context length up to 256K tokens, facilitating long-context reasoning and enterprise-scale use cases. Comparative evaluations demonstrate that EXAONE 4.5 achieves competitive performance in general benchmarks while outperforming state-of-the-art models of similar scale in document understanding and Korean contextual reasoning. As part of LG's ongoing effort toward practical industrial deployment, EXAONE 4.5 is designed to be continuously extended with additional domains and application scenarios to advance AI for a better life.[18] Decomposing the Delta: What Do Models Actually Learn from Preference Pairs?
Chia-Hsuan Lee,Mingyang Zhou,Renkun Ni,Zelei Cheng,Sihui Dai,Supriyo Chakraborty,Shixiong Zhang,Sambit Sahu,William Campbell
Main category: cs.CL
TL;DR: 本文研究了偏好优化方法(如DPO和KTO)中影响推理能力提升的关键因素,发现生成器层面的质量差异(generator-level delta)和样本层面的质量差异(sample-level delta)共同决定下游推理性能;增大前者可提升泛化能力,后者可用于高效数据筛选。
Details
Motivation: 现有偏好优化方法(如DPO、KTO)虽广泛应用,但尚不清楚偏好数据的哪些属性真正驱动语言模型在通用推理任务上的性能提升。 Method: 从两个维度定义并量化‘质量差’(delta):1)生成器层面delta——由生成‘chosen’与‘rejected’推理轨迹的不同模型的能力差异引起,通过改变模型规模和家族来调控;2)样本层面delta——由单个偏好对内人工或LLM评判的质量差异引起,使用LLM-as-a-judge沿多个推理质量维度打分。 Result: 增大generator-level delta能持续提升模型在域外推理任务上的表现;基于sample-level delta过滤数据可实现更高效训练(即更少数据达到同等性能)。 Conclusion: 提升推理性能的双重策略:构建偏好对时应最大化generator-level delta;训练前应利用sample-level delta筛选最具信息量的样本。 Abstract: Preference optimization methods such as DPO and KTO are widely used for aligning language models, yet little is understood about what properties of preference data drive downstream reasoning gains. We ask: what aspects of a preference pair improve a reasoning model's performance on general reasoning tasks? We investigate two distinct notions of quality delta in preference data: generator-level delta, arising from the differences in capability between models that generate chosen and rejected reasoning traces, and sample-level delta, arising from differences in judged quality differences within an individual preference pair. To study generator-level delta, we vary the generator's scale and model family, and to study sample-level delta, we employ an LLM-as-a-judge to rate the quality of generated traces along multiple reasoning-quality dimensions. We find that increasing generator-level delta steadily improves performance on out-of-domain reasoning tasks and filtering data by sample-level delta can enable more data-efficient training. Our results suggest a twofold recipe for improving reasoning performance through preference optimization: maximize generator-level delta when constructing preference pairs and exploit sample-level delta to select the most informative training examples.[19] LLMs Underperform Graph-Based Parsers on Supervised Relation Extraction for Complex Graphs
Paolo Gajo,Domenic Rosati,Hassan Sajjad,Alberto Barrón-Cedeño
Main category: cs.CL
TL;DR: 本文发现,在处理具有复杂语言图结构的文本时,大型语言模型(LLM)的关系抽取性能反而不如更轻量级的图解析器。
Details
Motivation: 尽管大语言模型在关系抽取任务中展现出潜力,但其在处理复杂语言图结构时的表现尚不明确,作者旨在探究其实际适用边界。 Method: 在六个具有不同规模与复杂度句子图的关系抽取数据集上,对比评估四个大语言模型与一个图解析器的性能。 Result: 随着输入文档中关系数量增加,图解析器性能持续超越大语言模型,尤其在复杂语言图场景下优势显著。 Conclusion: 对于复杂语言图结构的关系抽取任务,轻量级图解析器比大语言模型更具优势,提示需根据图复杂度选择合适方法。 Abstract: Relation extraction represents a fundamental component in the process of creating knowledge graphs, among other applications. Large language models (LLMs) have been adopted as a promising tool for relation extraction, both in supervised and in-context learning settings. However, in this work we show that their performance still lags behind much smaller architectures when the linguistic graph underlying a text has great complexity. To demonstrate this, we evaluate four LLMs against a graph-based parser on six relation extraction datasets with sentence graphs of varying sizes and complexities. Our results show that the graph-based parser increasingly outperforms the LLMs, as the number of relations in the input documents increases. This makes the much lighter graph-based parser a superior choice in the presence of complex linguistic graphs.[20] Cards Against LLMs: Benchmarking Humor Alignment in Large Language Models
Yousra Fettach,Guillaume Bied,Hannu Toivonen,Tijl De Bie
Main category: cs.CL
TL;DR: 本文研究大型语言模型(LLM)在幽默判断任务(Cards Against Humanity)中与人类偏好的对齐程度,发现模型虽优于随机水平,但与人类一致性较低,且模型间一致性远高于与人类的一致性,提示其幽默判断可能受位置偏差和内容偏好等结构因素影响,而非真实幽默理解。
Details
Motivation: 幽默是人类交流中高度文化嵌入且社会意义重大的维度,但在大语言模型对齐研究中长期被忽视。 Method: 让5个前沿语言模型与人类玩家进行相同规则的Cards Against Humanity游戏,在9894轮中每轮从10张候选卡中选出最搞笑的一张,并评估模型与人类偏好的一致性及模型间一致性,进一步分析位置偏差与内容偏好等潜在影响因素。 Result: 所有模型表现均超过随机基线,但与人类偏好对齐程度有限;模型之间的一致性显著高于模型与人类之间的一致性;该现象部分可由系统性位置偏差和内容偏好解释。 Conclusion: LLM在幽默判断上的表现可能更多反映推理与对齐过程中的结构性偏差,而非真正的幽默理解或文化敏感性,提示需谨慎评估其在主观、文化密集型任务中的对齐质量。 Abstract: Humor is one of the most culturally embedded and socially significant dimensions of human communication, yet it remains largely unexplored as a dimension of Large Language Model (LLM) alignment. In this study, five frontier language models play the same Cards Against Humanity games (CAH) as human players. The models select the funniest response from a slate of ten candidate cards across 9,894 rounds. While all models exceed the random baseline, alignment with human preference remains modest. More striking is that models agree with each other substantially more often than they agree with humans. We show that this preference is partly explained by systematic position biases and content preferences, raising the question whether LLM humor judgment reflects genuine preference or structural artifacts of inference and alignment.[21] Revisiting Anisotropy in Language Transformers: The Geometry of Learning Dynamics
Raphael Bernas,Fanny Jourdan,Antonin Poché,Céline Hudelot
Main category: cs.CL
TL;DR: 本文通过几何论证和实证分析,揭示了Transformer模型中各向异性现象的切线对齐本质,并利用基于概念的机制可解释性方法,在训练过程中拟合激活导出的低秩切线代理,验证其比普通梯度更能捕捉梯度各向异性。
Details
Motivation: Transformer模型中存在固有的各向异性现象,但此前理论研究很少基于表征几何基础;本文旨在从几何角度深入理解该现象的成因。 Method: 从几何角度推导频率偏差采样如何削弱曲率可见性、训练如何优先放大切线方向;并在训练过程中使用基于概念的机制可解释性方法,拟合激活导出的低秩切线代理,与反向传播的真实梯度进行对比。 Result: 在编码器式和解码器式语言模型上均发现,激活导出的方向不仅捕获异常大的梯度能量,还比同秩法向控制组捕获显著更多的梯度各向异性。 Conclusion: 实验结果为各向异性源于切线对齐提供了强有力的经验支持,推动了对Transformer表征几何结构的深入理解。 Abstract: Since their introduction, Transformer architectures have dominated Natural Language Processing (NLP). However, recent research has highlighted an inherent anisotropy phenomenon in these models, presenting a significant challenge to their geometric interpretation. Previous theoretical studies on this phenomenon are rarely grounded in the underlying representation geometry. In this paper, we extend them by deriving geometric arguments for how frequency-biased sampling attenuates curvature visibility and why training preferentially amplify tangent directions. Empirically, we then use concept-based mechanistic interpretability during training, rather than only post hoc, to fit activation-derived low-rank tangent proxies and test them against ordinary backpropagated true gradients. Across encoder-style and decoder-style language models, we find that these activation-derived directions capture both unusually large gradient energy and a substantially larger share of gradient anisotropy than matched-rank normal controls, providing strong empirical support for a tangent-aligned account of anisotropy.[22] MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation
Jyotika Singh,Fang Tu,Miguel Ballesteros,Weiyi Sun,Sandip Ghoshal,Michelle Yuan,Yassine Benajiba,Sujith Ravi,Dan Roth
Main category: cs.CL
TL;DR: 本文提出MT-OSC框架,通过Condenser Agent和轻量级Decider自动压缩多轮对话历史,在不干扰用户体验的前提下显著减少token数(最高达72%),提升或保持多轮对话中的模型准确性,并降低延迟与计算成本。
Details
Motivation: 大语言模型在多轮对话中因指令与上下文分散、历史全量拼接导致上下文窗口迅速耗尽,引发性能下降、延迟增加和成本上升。 Method: 提出One-off Sequential Condensation(MT-OSC)框架,包含基于少样本推理的Condenser和轻量级Decider,后台自动选择性保留关键信息以压缩对话历史。 Result: 在13个SOTA LLM和多个多轮基准上验证,MT-OSC将10轮对话token数最多压缩72%,多轮性能差距显著缩小,准确率提升或保持,对干扰项鲁棒。 Conclusion: MT-OSC是一种可扩展的多轮对话优化方案,能在受限输入空间内支持更丰富上下文,兼顾性能、延迟与成本。 Abstract: Large language models (LLMs) suffer significant performance degradation when user instructions and context are distributed over multiple conversational turns, yet multi-turn (MT) interactions dominate chat interfaces. The routine approach of appending full chat history to prompts rapidly exhausts context windows, leading to increased latency, higher computational costs, and diminishing returns as conversations extend. We introduce MT-OSC, a One-off Sequential Condensation framework that efficiently and automatically condenses chat history in the background without disrupting the user experience. MT-OSC employs a Condenser Agent that uses a few-shot inference-based Condenser and a lightweight Decider to selectively retain essential information, reducing token counts by up to 72% in 10-turn dialogues. Evaluated across 13 state-of-the-art LLMs and diverse multi-turn benchmarks, MT-OSC consistently narrows the multi-turn performance gap - yielding improved or preserved accuracy across datasets while remaining robust to distractors and irrelevant turns. Our results establish MT-OSC as a scalable solution for multi-turn chats, enabling richer context within constrained input spaces, reducing latency and operational cost, while balancing performance.[23] MedConceal: A Benchmark for Clinical Hidden-Concern Reasoning Under Partial Observability
Yikun Han,Joey Chan,Jingyuan Chen,Mengting Ai,Simo Du,Yue Guo
Main category: cs.CL
TL;DR: 本文提出MedConceal基准,用于评估医疗对话中对患者隐藏担忧的推理能力,强调在信息不完全条件下的医患沟通建模,发现当前大模型在确认隐藏担忧方面表现不一,而在干预(即解决主要担忧并引导治疗计划)方面仍显著落后于人类医生。
Details
Motivation: 现有医疗对话基准大多忽略患者隐藏状态(如恐惧、误解、实际障碍)带来的部分可观测性挑战,将隐含担忧的主动探询简化为显式信息提取,无法真实评估临床沟通中的推理过程。 Method: 构建MedConceal基准,包含300个基于真实在线健康讨论整理的病例及600次临床医生与大语言模型的交互;设计一个可交互的患者模拟器,内部维护基于文献和专家分类法定义的隐藏担忧,对外不可见,并通过理论支撑的回合级通信信号追踪担忧是否被揭示与解决;所有病例经临床医生评审确保医学合理性。 Result: 前沿大模型在不同‘确认’指标上表现各异,但在‘干预’成功率上,人类医生(N=159)显著优于所有测试模型;整体表明隐藏担忧推理仍是医疗对话系统的关键未解难题。 Conclusion: MedConceal揭示了当前医疗对话AI在部分可观测环境下的根本局限——尤其在主动探询、确认并有效响应患者隐藏关切方面尚未达到临床可用水平,亟需面向过程、理论驱动的建模与评估新范式。 Abstract: Patient-clinician communication is an asymmetric-information problem: patients often do not disclose fears, misconceptions, or practical barriers unless clinicians elicit them skillfully. Effective medical dialogue therefore requires reasoning under partial observability: clinicians must elicit latent concerns, confirm them through interaction, and respond in ways that guide patients toward appropriate care. However, existing medical dialogue benchmarks largely sidestep this challenge by exposing hidden patient state, collapsing elicitation into extraction, or evaluating responses without modeling what remains hidden. We present MedConceal, a benchmark with an interactive patient simulator for evaluating hidden-concern reasoning in medical dialogue, comprising 300 curated cases and 600 clinician-LLM interactions. Built from clinician-answered online health discussions, each case pairing clinician-visible context with simulator-internal hidden concerns derived from prior literature and structured using an expert-developed taxonomy. The simulator withholds these concerns from the dialogue agent, tracks whether they have been revealed and addressed via theory-grounded turn-level communication signals, and is clinician-reviewed for clinical plausibility. This enables process-aware evaluation of both task success and the interaction process that leads to it. We study two abilities: confirmation, surfacing hidden concerns through multi-turn dialogue, and intervention, addressing the primary concern and guiding the patient toward a target plan. Results show that no single system dominates: frontier models lead on different confirmation metrics, while human clinicians (N=159) remain strongest on intervention success. Together, these results identify hidden-concern reasoning under partial observability as a key unresolved challenge for medical dialogue systems.[24] Lessons Without Borders? Evaluating Cultural Alignment of LLMs Using Multilingual Story Moral Generation
Sophie Wu,Andrew Piper
Main category: cs.CL
TL;DR: 本文提出多语言故事道德生成任务,通过14种语言文化对的人类标注数据集评估大模型在道德理解上的文化适应性;发现GPT-4o等前沿模型虽能较好拟合人类主流道德判断,但缺乏跨语言文化差异性与价值多样性。
Details
Motivation: 故事是跨文化传播价值观的关键载体,但其道德解读因语言与文化背景而异;现有语言模型评估缺乏对文化语境敏感性的动态叙事理解任务。 Method: 构建覆盖14种语言-文化对的人类撰写故事道德数据集,采用语义相似度、人类偏好调查和价值观分类三种方式评估模型输出。 Result: GPT-4o和Gemini等前沿模型生成的道德判断与人类响应语义相似度高、受人类偏好认可,但跨语言变异性低,聚焦于少数普世价值,缺乏文化特异性。 Conclusion: 当前大模型可近似人类道德判断的中心趋势,但难以再现真实人类叙事理解中的文化多样性;将叙事解读建模为评价任务,为语言模型的文化对齐研究提供了新范式。 Abstract: Stories are key to transmitting values across cultures, but their interpretation varies across linguistic and cultural contexts. Thus, we introduce multilingual story moral generation as a novel culturally grounded evaluation task. Using a new dataset of human-written story morals collected across 14 language-culture pairs, we compare model outputs with human interpretations via semantic similarity, a human preference survey, and value categorization. We show that frontier models such as GPT-4o and Gemini generate story morals that are semantically similar to human responses and preferred by human evaluators. However, their outputs exhibit markedly less cross-linguistic variation and concentrate on a narrower set of widely shared values. These findings suggest that while contemporary models can approximate central tendencies of human moral interpretation, they struggle to reproduce the diversity that characterizes human narrative understanding. By framing narrative interpretation as an evaluative task, this work introduces a new approach to studying cultural alignment in language models beyond static benchmarks or knowledge-based tests.[25] Scalable High-Recall Constraint-Satisfaction-Based Information Retrieval for Clinical Trials Matching
Cyrus Zhou,Yufei Jin,Yilin Xu,Yu-Chiang Wang,Chieh-Ju Chao,Monica S. Lam
Main category: cs.CL
TL;DR: 本文提出SatIR,一种基于约束满足的临床试验检索方法,利用SMT和关系代数建模,并结合LLM将非结构化临床信息转化为可解释的精确约束,在召回率、精度和可解释性上优于现有方法。
Details
Motivation: 现有基于关键词和嵌入相似性的临床试验匹配方法存在低召回、低精度和不可解释等问题,难以应对复杂的入组约束。 Method: 采用Satisfiability Modulo Theories(SMT)与关系代数进行形式化建模;结合LLM将患者记录和试验资格标准中的模糊性、隐含假设与不完整性转化为显式、可控、可解释的逻辑约束。 Result: 在59名患者和3621项试验上的实验表明,SatIR相较TrialGPT:每患者多检出32%-72%相关且合格的试验;召回率提升22–38个百分点;服务更多至少有一项可用试验的患者;单次检索耗时仅2.95秒。 Conclusion: SatIR是一种可扩展、高效且可解释的临床试验匹配新范式,显著提升了精准匹配能力与临床实用性。 Abstract: Clinical trials are central to evidence-based medicine, yet many struggle to meet enrollment targets, despite the availability of over half a million trials listed on ClinicalTrials.gov, which attracts approximately two million users monthly. Existing retrieval techniques, largely based on keyword and embedding-similarity matching between patient profiles and eligibility criteria, often struggle with low recall, low precision, and limited interpretability due to complex constraints. We propose SatIR, a scalable clinical trial retrieval method based on constraint satisfaction, enabling high-precision and interpretable matching of patients to relevant trials. Our approach uses formal methods -- Satisfiability Modulo Theories (SMT) and relational algebra -- to efficiently represent and match key constraints from clinical trials and patient records. Beyond leveraging established medical ontologies and conceptual models, we use Large Language Models (LLMs) to convert informal reasoning regarding ambiguity, implicit clinical assumptions, and incomplete patient records into explicit, precise, controllable, and interpretable formal constraints. Evaluated on 59 patients and 3,621 trials, SatIR outperforms TrialGPT on all three evaluated retrieval objectives. It retrieves 32%-72% more relevant-and-eligible trials per patient, improves recall over the union of useful trials by 22-38 points, and serves more patients with at least one useful trial. Retrieval is fast, requiring 2.95 seconds per patient over 3,621 trials. These results show that SatIR is scalable, effective, and interpretable.[26] Cross-Lingual Attention Distillation with Personality-Informed Generative Augmentation for Multilingual Personality Recognition
Jing Jie Tan,Ban-Hoe Kwan,Danny Wee-Kiat Ng,Yan-Chai Hum,Noriyuki Kawarazaki,Kosuke Takano
Main category: cs.CL
TL;DR: 本文提出ADAM方法,通过人格引导的生成增强(PIGA)和跨语言注意力蒸馏(CLAD),在多语言人格识别任务中取得显著性能提升。
Details
Motivation: 现有研究缺乏多语言人格识别数据集,限制了该领域的发展。 Method: 利用英语人格数据集作为主源,结合大语言模型进行翻译增强,并引入人格引导的生成增强(PIGA)构建多语言训练数据;进一步提出跨语言注意力蒸馏(CLAD)来训练跨语言人格识别模型。 Result: CLAD在多个数据集和语言上均显著优于标准BCE损失,平均BA分数分别提升0.0573和0.0968;模型具备强泛化能力,性能媲美当前领先编码器模型。 Conclusion: ADAM是一种先进且有效的多语言人格识别方法,解决了数据稀缺问题,并在性能与泛化性上均表现出色。 Abstract: While significant work has been done on personality recognition, the lack of multilingual datasets remains an unresolved challenge. To address this, we propose ADAM (Cross-Lingual (A)ttention (D)istillation with Personality-Guided Generative (A)ugmentation for (M)ultilingual Personality Recognition), a state-of-the-art approach designed to advance multilingual personality recognition. Our approach leverages an existing English-language personality dataset as the primary source and employs a large language model (LLM) for translationbased augmentation, enhanced by Personality-Informed Generative Augmentation (PIGA), to generate high-quality training data in multiple languages, including Japanese, Chinese, Malay, and French. We provide a thorough analysis to justify the effectiveness of these augmentation techniques. Building on these advancements, ADAM integrates Cross-Lingual Attention Distillation (CLAD) to train a model capable of understanding and recognizing personality traits across languages, bridging linguistic and cultural gaps in personality analysis. This research presents a thorough evaluation of the proposed augmentation method, incorporating an ablation study on recognition performance to ensure fair comparisons and robust validation. Overall, with PIGA augmentation, the findings demonstrate that CLAD significantly outperforms the standard BCE across all languages and personality traits, achieving notable improvements in average BA scores - 0.6332 (+0.0573) on the Essays dataset and 0.7448 (+0.0968) on the Kaggle dataset. The CLAD-trained model also demonstrated strong generalizability and achieved benchmark performance comparable to current leading encoder models. The model weight, dataset, and algorithm repository are available at https://research.jingjietan.com/?q=ADAM.[27] GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification
Faxian Wan,Xiaocui Yang,Yifan Cao,Shi Feng,Daling Wang,Yifei Zhang
Main category: cs.CL
TL;DR: 本文提出GRASP框架,通过结合视觉定位与显式思维链(CoT)推理,解决多模态反讽目标识别(MSTI)中细粒度定位与可解释性不足的问题,并构建新数据集MSTI-MAX,采用双阶段优化策略提升性能。
Details
Motivation: 现有方法依赖隐式跨模态对齐,导致细粒度目标定位不准且缺乏可解释性;MSTI任务本身比传统二分类更具挑战性。 Method: 提出GRASP框架:1)构建平衡、信息丰富的MSTI-MAX数据集;2)引入视觉锚定的显式Grounded CoT推理机制;3)采用双阶段优化:坐标感知加权监督微调 + 细粒度目标策略优化。 Result: GRASP在跨模态细粒度反讽目标识别上显著优于现有基线;LLM-as-a-Judge评估验证了其推理链质量;代码与数据集将开源。 Conclusion: 显式视觉接地与结构化推理可有效提升多模态反讽识别的准确性与可解释性,为MSTI任务提供了新范式。 Abstract: Moving beyond the traditional binary classification paradigm of Multimodal Sarcasm Detection, Multimodal Sarcasm Target Identification (MSTI) presents a more formidable challenge, requiring precise localization of fine-grained targets such as textual phrases and visual regions. Existing approaches predominantly rely on implicit cross-modal alignment, offering limited interpretability and suboptimal fine-grained localization. To address these limitations, we propose GRASP, Grounded Chain-of-Thought ReAsoning with Dual-Stage Optimization for Multimodal Sarcasm Prediction and Target Identification, a framework that integrates visual grounding with explicit Chain-of-Thought (CoT) reasoning to move beyond black-box MSTI. Specifically, we curate MSTI-MAX, a refined dataset that mitigates class imbalance and enriches multimodal sarcasm cues. We introduce Grounded CoT reasoning, which explicitly anchors sarcasm-related visual regions within the reasoning trajectory and prompts the model to articulate rationales before predicting the final classification labels and sarcasm targets. Furthermore, we employ a dual-stage outcome-supervised joint optimization strategy: Supervised Fine-Tuning with a coordinate-aware weighted loss, followed by Fine-Grained Target Policy Optimization. Extensive experiments demonstrate that GRASP outperforms existing baselines in fine-grained sarcasm target identification across modalities, and an LLM-as-a-Judge evaluation quantitatively measures the quality of internal reasoning chains. Our dataset and source code will be released on GitHub.[28] NCL-BU at SemEval-2026 Task 3: Fine-tuning XLM-RoBERTa for Multilingual Dimensional Sentiment Regression
Tong Wu,Nicolay Rusnachenko,Huizhi Liang
Main category: cs.CL
TL;DR: 本文提出了一种基于XLM-RoBERTa的细粒度维度情感分析(DimABSA)方法,用于在1-9范围内预测文本中各方面的效价(valence)和唤醒度(arousal)连续值,并在多语言多领域任务中显著优于大语言模型少样本提示方法。
Details
Motivation: 传统基于方面的情感分析(ABSA)仅输出离散极性标签,而实际情感具有连续、多维特性;为更精细建模情感,需扩展至效价-唤醒度(VA)连续回归任务,尤其面向多语言、多领域场景。 Method: 采用XLM-RoBERTa-base进行微调,输入格式为[CLS]文本[SEP]方面词[SEP],并构建双回归头分别预测VA值,输出经sigmoid缩放至[1,9]区间;针对英/中文及餐厅/笔记本/金融三个领域,分别训练独立模型;开发集与训练集合并用于最终测试预测。 Result: 在开发实验中,所提微调方法在所有评测数据集上均显著且稳定地优于GPT-5.2、LLaMA-3-70B、LLaMA-3.3-70B和LLaMA-4-Maverick等大模型在少样本提示下的表现。 Conclusion: 任务特定微调仍是在多语言维度情感回归任务中比当前大语言模型少样本提示更有效、更鲁棒的方案;该方法具备良好可扩展性与实用性,代码已开源。 Abstract: Dimensional Aspect-Based Sentiment Analysis (DimABSA) extends traditional ABSA from categorical polarity labels to continuous valence-arousal (VA) regression. This paper describes a system developed for Track A - Subtask 1 (Dimensional Aspect Sentiment Regression), aiming to predict real-valued VA scores in the [1, 9] range for each given aspect in a text. A fine-tuning approach based on XLM-RoBERTa-base is adopted, constructing the input as [CLS] T [SEP] a_i [SEP] and training dual regression heads with sigmoid-scaled outputs for valence and arousal prediction. Separate models are trained for each language-domain combination (English and Chinese across restaurant, laptop, and finance domains), and training and development sets are merged for final test predictions. In development experiments, the fine-tuning approach is compared against several large language models including GPT-5.2, LLaMA-3-70B, LLaMA-3.3-70B, and LLaMA-4-Maverick under a few-shot prompting setting, demonstrating that task-specific fine-tuning substantially and consistently outperforms these LLM-based methods across all evaluation datasets. The code is publicly available at https://github.com/tongwu17/SemEval-2026-Task3-Track-A.[29] MuTSE: A Human-in-the-Loop Multi-use Text Simplification Evaluator
Rares-Alexandru Roscan,Gabriel Petre1,Adrian-Marius Dumitran,Angela-Liliana Dumitran
Main category: cs.CL
TL;DR: 本文提出MuTSE,一个支持人机协同的交互式Web应用,用于系统评估大语言模型在不同提示策略和架构下生成的文本简化结果,特别面向CEFR等级目标,并通过分层语义对齐与线性偏差启发式提升可解释性与可复现性。
Details
Motivation: 现有LLM文本简化评估缺乏结构化、可视化的比较分析框架;研究人员依赖静态脚本,教育者受限于普通对话界面,均难以支持多维、系统性的提示-模型组合评估。 Method: 设计并实现MuTSE交互式Web系统,支持P×M种提示-模型组合并发执行;引入带线性偏差启发式(λ)的分层语义对齐引擎,实现源句与简化句的可视化映射;支持实时生成对比矩阵及结构化人工标注。 Result: MuTSE显著降低定性分析的认知负荷,支持可复现的结构化标注,便于构建下游NLP数据集;项目代码与演示已开源供同行评审。 Conclusion: MuTSE填补了LLM文本简化评估中人机协同、可视化与系统化分析的空白,为NLP研究与智能导学系统提供了可扩展、可解释的新范式。 Abstract: As Large Language Models (LLMs) become increasingly prevalent in text simplification, systematically evaluating their outputs across diverse prompting strategies and architectures remains a critical methodological challenge in both NLP research and Intelligent Tutoring Systems (ITS). Developing robust prompts is often hindered by the absence of structured, visual frameworks for comparative text analysis. While researchers typically rely on static computational scripts, educators are constrained to standard conversational interfaces -- neither paradigm supports systematic multi-dimensional evaluation of prompt-model permutations. To address these limitations, we introduce \textbf{MuTSE}\footnote{The project code and the demo have been made available for peer review at the following anonymized URL. https://osf.io/njs43/overview?view_only=4b4655789f484110a942ebb7788cdf2a, an interactive human-in-the-loop web application designed to streamline the evaluation of LLM-generated text simplifications across arbitrary CEFR proficiency targets. The system supports concurrent execution of $P \times M$ prompt-model permutations, generating a comprehensive comparison matrix in real-time. By integrating a novel tiered semantic alignment engine augmented with a linearity bias heuristic ($λ$), MuTSE visually maps source sentences to their simplified counterparts, reducing the cognitive load associated with qualitative analysis and enabling reproducible, structured annotation for downstream NLP dataset construction.[30] TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice
Gang Hu,Yating Chen,Haiyan Ding,Wang Gao,Jiajia Huang,Min Peng,Qianqian Xie,Kun Yu
Main category: cs.CL
TL;DR: 本文提出了首个面向中文税务实践的专用基准TaxPraBen,涵盖10个传统NLP任务和3个真实场景(税务风险防控、税务稽查分析、税务筹划),共7.3K样本,并设计了可扩展的结构化评估范式;在19个LLM上的评测显示国产大模型(如Qwen2.5)整体优于多语言模型,但仅用部分税务数据微调的YaYi2提升有限。
Details
Motivation: 现有LLM在高度专业化、知识密集且法规严格的中文税务领域表现不足,而现有税务相关评测多聚焦孤立NLP任务,缺乏对真实税务实践能力的系统评估。 Method: 构建TaxPraBen基准:整合10个传统任务与3个真实税务场景(来自14个数据集共7.3K实例);提出基于'结构化解析-字段对齐抽取-数值与文本匹配'的可扩展结构化评估范式;依据布鲁姆分类法对19个LLM进行系统评测。 Result: 评测发现:闭源大参数LLM整体表现优异;国产LLM(如Qwen2.5)普遍优于多语言LLM;仅用部分税务数据微调的YaYi2提升有限;验证了TaxPraBen对端到端税务实践能力评估的有效性与可扩展性。 Conclusion: TaxPraBen是推动LLM在专业、实践导向场景中评估与发展的关键资源,其结构化评估范式亦可迁移至其他垂直领域。 Abstract: While Large Language Models (LLMs) excel in various general domains, they exhibit notable gaps in the highly specialized, knowledge-intensive, and legally regulated Chinese tax domain. Consequently, while tax-related benchmarks are gaining attention, many focus on isolated NLP tasks, neglecting real-world practical capabilities. To address this issue, we introduce TaxPraBen, the first dedicated benchmark for Chinese taxation practice. It combines 10 traditional application tasks, along with 3 pioneering real-world scenarios: tax risk prevention, tax inspection analysis, and tax strategy planning, sourced from 14 datasets totaling 7.3K instances. TaxPraBen features a scalable structured evaluation paradigm designed through process of "structured parsing-field alignment extraction-numerical and textual matching", enabling end-to-end tax practice assessment while being extensible to other domains. We evaluate 19 LLMs based on Bloom's taxonomy. The results indicate significant performance disparities: all closed-source large-parameter LLMs excel, and Chinese LLMs like Qwen2.5 generally exceed multilingual LLMs, while the YaYi2 LLM, fine-tuned with some tax data, shows only limited improvement. TaxPraBen serves as a vital resource for advancing evaluations of LLMs in practical applications.[31] MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits
Yixin Xiang,Yunshan Ma,Xiaoyu Du,Yibing Chen,Yanxin Zhang,Jinhui Tang
Main category: cs.CL
TL;DR: 本文提出了一种基于多臂赌博机(MAB)的文档问答框架MAB-DQA,通过将查询分解为多个方面感知的子查询,并依据初步推理结果动态分配检索资源,从而提升多模态RAG中对文档图像的有效利用,显著提高了DQA性能。
Details
Motivation: 现有多模态RAG方法在文档问答中难以有效利用大量页面图像,因检索阶段仅保留少数候选页(如Top-4),易遗漏信息丰富但视觉不显著的内容,而偏好常见但信息量低的页面。 Method: 提出MAB-DQA框架:将查询分解为方面感知的子查询,每个子查询对应一个‘臂’;利用少量代表性页面的初步推理结果作为奖励信号估计各方面的效用;通过探索-利用策略动态调整各方面的检索预算;最终基于高价值页面及其关联生成答案。 Result: 在四个基准数据集上,MAB-DQA相较当前最优方法平均提升5%-18%,显著增强文档理解能力。 Conclusion: MAB-DQA通过建模查询的多方面重要性并动态优化检索资源分配,有效缓解了多模态DQA中图像信息利用不足的问题,验证了基于强化学习思想的检索调度在文档理解任务中的有效性。 Abstract: Document Question Answering (DQA) involves generating answers from a document based on a user's query, representing a key task in document understanding. This task requires interpreting visual layouts, which has prompted recent studies to adopt multimodal Retrieval-Augmented Generation (RAG) that processes page images for answer generation. However, in multimodal RAG, visual DQA struggles to utilize a large number of images effectively, as the retrieval stage often retains only a few candidate pages (e.g., Top-4), causing informative but less visually salient content to be overlooked in favor of common yet low-information pages. To address this issue, we propose a Multi-Armed Bandit-based DQA framework (MAB-DQA) to explicitly model the varying importance of multiple implicit aspects in a query. Specifically, MAB-DQA decomposes a query into aspect-aware subqueries and retrieves an aspect-specific candidate set for each. It treats each subquery as an arm and uses preliminary reasoning results from a small number of representative pages as reward signals to estimate aspect utility. Guided by an exploration-exploitation policy, MAB-DQA dynamically reallocates retrieval budgets toward high-value aspects. With the most informative pages and their correlations, MAB-DQA generates the expected results. On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding. Code at https://github.com/ElephantOH/MAB-DQA.[32] Breaking Block Boundaries: Anchor-based History-stable Decoding for Diffusion Large Language Models
Shun Zou,Yong Wang,Zehui Chen,Lin Chen,Chongyang Tao,Feng Zhao,Xiangxiang Chu
Main category: cs.CL
TL;DR: 本文提出了一种无需训练、即插即用的动态解码策略Anchor-based History-stable Decoding (AHD),通过实时监测token稳定性趋势并触发跨块早解码,显著提升dLLMs的推理效率与性能,克服了半自回归解码中的块约束问题。
Details
Motivation: 半自回归(Semi-AR)解码在扩散大语言模型(dLLMs)中存在固有块约束,导致大量跨块稳定token被不必要延迟解码,影响效率与性能。 Method: 基于对token稳定性的系统分析,发现token稳定性与收敛趋势强相关且历史信息孤立;据此提出AHD方法:利用动态锚点实时监控token稳定性趋势,一旦token稳定即启动跨块早解码。 Result: AHD在语言、视觉-语言、音频-语言多模态任务上均显著提升性能与推理效率;在BBH基准上减少80%解码步数的同时提升性能3.67%,并逆转了现有加速策略常导致的性能下降问题。 Conclusion: AHD是一种训练无关、通用性强的解码优化策略,有效缓解dLLMs中Semi-AR解码的块约束瓶颈,为高效高质量生成提供了新范式。 Abstract: Diffusion Large Language Models (dLLMs) have recently become a promising alternative to autoregressive large language models (ARMs). Semi-autoregressive (Semi-AR) decoding is widely employed in base dLLMs and advanced decoding strategies due to its superior performance. However, our observations reveal that Semi-AR decoding suffers from inherent block constraints, which cause the decoding of many cross-block stable tokens to be unnecessarily delayed. To address this challenge, we systematically investigate the identification of stable tokens and present three key findings: (1) naive lookahead decoding is unreliable, (2) token stability closely correlates with convergence trend, and (3) historical information is isolated. Building on these insights, we propose Anchor-based History-stable Decoding (AHD), a training-free, plug-and-play dynamic decoding strategy. Specifically, AHD monitors the stability trend of tokens in real time through dynamic anchors. Once a token reaches stability, it initiates early cross-block decoding to enhance efficiency and performance. Extensive experiments across language, vision-language, and audio-language domains demonstrate that AHD simultaneously improves both performance and inference efficiency. Notably, AHD effectively reverses the performance degradation typically observed in existing advanced decoding acceleration strategies. For instance, on the BBH benchmark, our approach reduces decoding steps by 80% while improving performance by 3.67%.[33] Litmus (Re)Agent: A Benchmark and Agentic System for Predictive Evaluation of Multilingual Models
Avni Mittal,Shanu Kumar,Sandipan Dandapat,Monojit Choudhury
Main category: cs.CL
TL;DR: 本文提出了一种预测多语言评估方法,通过构建包含1500个问题的受控基准和名为Litmus (Re)Agent的DAG编排智能体系统,来估计模型在缺乏直接基准结果的目标语言上的性能。
Details
Motivation: 解决多语言部署中评估覆盖稀疏、文献证据在语言、任务和模型家族间分布不均的问题。 Method: 构建了涵盖六项任务和五种证据场景的1500题基准,并设计Litmus (Re)Agent智能体系统,该系统通过假设分解、证据检索与特征感知聚合进行预测。 Result: Litmus (Re)Agent在六种系统中整体性能最优,尤其在直接证据薄弱或缺失的迁移密集型场景中提升最大。 Conclusion: 结构化智能体推理是应对不完整证据下多语言性能估计问题的一种有前景的方法。 Abstract: We study predictive multilingual evaluation: estimating how well a model will perform on a task in a target language when direct benchmark results are missing. This problem is common in multilingual deployment, where evaluation coverage is sparse and published evidence is uneven across languages, tasks, and model families. We introduce a controlled benchmark of 1,500 questions spanning six tasks and five evidence scenarios. The benchmark separates accessible evidence from ground truth, enabling evaluation of systems that must infer missing results from incomplete literature evidence. We also present Litmus (Re)Agent, a DAG-orchestrated agentic system that decomposes queries into hypotheses, retrieves evidence, and synthesises predictions through feature-aware aggregation. Across six systems, Litmus (Re)Agent achieves the best overall performance, with the largest gains in transfer-heavy scenarios where direct evidence is weak or absent. These results show that structured agentic reasoning is a promising approach to multilingual performance estimation under incomplete evidence.[34] Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning
Lorenzo Jaime Yu Flores,Cesare Spinoso di-Piano,Jackie Chi Kit Cheung
Main category: cs.CL
TL;DR: 本文研究了监督微调(SFT)对语言模型置信度评分与输出质量相关性的影响,发现微调后相关性下降,主要源于置信度受训练分布相似性等非质量因素干扰;作者通过案例说明该问题会削弱置信度在下游任务中的实用性,并强调需测试和设计更鲁棒的不确定性度量。
Details
Motivation: 置信度评分需与模型输出质量强相关才具实用价值,但近期发现监督微调(SFT)会损害这种相关性,因此需深入探究其内在机制。 Method: 分析SFT前后多种置信度评分与输出质量的相关性变化,识别导致相关性退化的因素(如输出与训练分布的相似性),并通过下游任务的案例研究验证影响。 Result: SFT后各类置信度评分与输出质量的相关性普遍下降;该退化部分源于置信度受输出与训练分布匹配程度等非质量因素驱动;未校正此误相关将显著降低置信度在实际任务中的有效性。 Conclusion: 置信度指标不能直接‘开箱即用’,必须针对具体微调设置进行验证;亟需开发对微调更鲁棒的新型不确定性量化方法。 Abstract: Uncertainty quantification is a set of techniques that measure confidence in language models. They can be used, for example, to detect hallucinations or alert users to review uncertain predictions. To be useful, these confidence scores must be correlated with the quality of the output. However, recent work found that fine-tuning can affect the correlation between confidence scores and quality. Hence, we investigate the underlying behavior of confidence scores to understand its sensitivity to supervised fine-tuning (SFT). We find that post-SFT, the correlation of various confidence scores degrades, which can stem from changes in confidence scores due to factors other than the output quality, such as the output's similarity to the training distribution. We demonstrate via a case study how failing to address this miscorrelation reduces the usefulness of the confidence scores on a downstream task. Our findings show how confidence metrics cannot be used off-the-shelf without testing, and motivate the need for developing metrics which are more robust to fine-tuning.[35] Quantisation Reshapes the Metacognitive Geometry of Language Models
Jon-Paul Cacioli
Main category: cs.CL
TL;DR: 本文发现模型量化会重构大语言模型在领域层面的元认知效率,而非均匀降低;不同精度格式下M-ratio在各知识领域的分布显著变化,但Type-2 AUROC保持稳定,表明问题出在M-ratio归一化而非底层判别能力;针对弱领域开展的置信度增强微调未能提升元认知敏感性(meta-d'),因其诊断结果不跨格式迁移;研究强调依赖M-ratio的系统存在未被察觉的推理格式依赖性,而使用AUROC_2更稳健。
Details
Motivation: 探究模型量化是否均匀损害大语言模型(LLM)的元认知效率,以及能否通过领域条件化训练改善其元认知表现。 Method: 在Llama-3-8B-Instruct上对比Q5_K_M与f16精度下对3000个问题的元认知评估,计算各知识域M-ratio与Type-2 AUROC;开展预注册的领域条件化监督微调(SFT)实验,包括针对诊断出的弱领域(Arts & Literature)的置信度放大训练及多种对照。 Result: M-ratio在不同量化格式下各领域排序完全不相关(Spearman rho = 0.00),如Arts & Literature从最差监控升至最佳,Geography则相反;但Type-2 AUROC完全稳定(rho = 1.00);所有四项验证性假设均不成立;SFT虽成功改变置信度分布并扩大NLP差距,却未提升meta-d',因M-ratio诊断不具跨格式泛化性。 Conclusion: 模型量化重构而非削弱元认知效率,且该重构仅影响M-ratio归一化层面;依赖M-ratio进行领域级元认知评估的系统隐含推理格式依赖风险;Type-2 AUROC更具鲁棒性,应作为更可靠的元认知指标。 Abstract: We report that model quantisation restructures domain-level metacognitive efficiency in LLMs rather than degrading it uniformly. Evaluating Llama-3-8B-Instruct on the same 3,000 questions at Q5_K_M and f16 precision, we find that M-ratio profiles across four knowledge domains are uncorrelated between formats (Spearman rho = 0.00). Arts & Literature moves from worst-monitored (M-ratio = 0.606 at Q5_K_M) to best-monitored (1.542 at f16). Geography moves from well-monitored (1.210) to under-monitored (0.798). However, Type-2 AUROC profiles are perfectly stable across formats (rho = 1.00), localising the restructuring to the M-ratio normalisation rather than the underlying discrimination signal. This finding emerged from a pre-registered attempt to improve metacognition through domain-conditional training. We prescribed confidence-amplification SFT for the diagnosed weak domain, with matched-budget agnostic and wrong-prescription controls. All four confirmatory hypotheses were null (10,000 bootstrap resamples, seed = 42). The training successfully reshaped confidence distributions, doubling the NLP gap in Science from 0.076 to 0.152, but did not improve meta-d' because the diagnostic profile did not transfer across formats. Any system relying on domain-level M-ratio profiles has an unexamined dependency on inference format. Systems using AUROC_2 are safer. We release all code, pre-registrations, and trial-level data.[36] Testing the Assumptions of Active Learning for Translation Tasks with Few Samples
Lorenzo Jaime Yu Flores,Cesare Spinoso di-Piano,Ori Ernst,David Ifeoluwa Adelani,Jackie Chi Kit Cheung
Main category: cs.CL
TL;DR: 本文探讨了主动学习(AL)在少量样本(100–500)语言生成任务中为何常不如随机采样,发现AL依赖的核心假设(信息量与多样性决定性能)不成立;实际影响性能的关键是训练样本顺序及与预训练数据的交互。
Details
Motivation: 解释为何主流主动学习策略在小样本语言生成任务中表现不如随机采样。 Method: 通过实证分析检验AL策略所依赖的信息量和多样性指标是否与测试性能相关,并考察样本顺序、预训练数据交互等替代因素的影响。 Result: 发现信息量和多样性与模型最终性能无显著相关性;而样本顺序和与预训练数据的交互对性能影响更大。 Conclusion: AL在极小样本场景下失效的根本原因在于其理论假设不成立;未来AL方法需建模样本顺序及预训练数据协同效应。 Abstract: Active learning (AL) is a training paradigm for selecting unlabeled samples for annotation to improve model performance on a test set, which is useful when only a limited number of samples can be annotated. These algorithms often work by optimizing for the informativeness and diversity of the training data to be annotated. Recent work found that AL strategies fail to outperform random sampling on various language generation tasks when using 100-500 samples. To understand AL's poor performance when only using few samples, we investigate whether the core assumptions underlying AL strategies hold. We find that neither the informativeness nor diversity of the training data, which AL strategies optimize for, are correlated with test set performance. Instead, factors like the ordering of the training samples and interactions with pre-training data have a larger impact on performance. This suggests that future AL methods must take these factors into account in order to work with very few samples.[37] PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment
Jihwan Oh,Soowon Oh,Murad Aghazada,Minchan Jeong,Sungnyun Kim,Se-Young Yun
Main category: cs.CL
TL;DR: 本文提出PerMix-RLVR方法,在训练阶段解决大语言模型对persona提示的敏感性问题,兼顾鲁棒性与角色表达保真度。
Details
Motivation: 现有persona prompting方法依赖人工选择最优角色,耗时且效果不稳定;先前工作多在推理阶段优化,带来额外计算开销;缺乏在训练阶段系统性提升模型对多样persona适应能力的方法。 Method: 提出PerMix-RLVR:一种混合persona的强化学习与可验证奖励(RLVR)策略,在训练中同时引入多样化persona样本,平衡鲁棒性(抵抗有害persona变化)与保真度(忠实扮演所需角色)。 Result: 在MATH500上Persona稳定性得分(PSS)较RLVR提升+21.2%;在PersonaGym上persona保真度提升+11.4%。 Conclusion: 训练阶段引入persona混合策略可有效缓解鲁棒性与保真度之间的固有冲突,为构建更可控、更可靠的persona-aware LLM提供新路径。 Abstract: Persona prompting has been widely adopted to steer large language models (LLMs) behavior and improve their instruction performance by assigning specific characters. However, identifying an optimal persona is time-consuming, and its impact on output quality remains poorly understood. Prior work has mainly addressed this issue at the prompt level via inference-time strategies, incurring additional computation. In this work, we avoid inference-time prompt search by tackling persona sensitivity during training, aiming to train models that adapt their behavior to diverse personas while preserving task performance. In particular, we find that reinforcement learning with verifiable rewards (RLVR) systematically reduces sensitivity to persona prompts, but also reveals an inherent trade-off of outcome-based optimization: while RLVR improves robustness on tasks with verifiable goals, it can also degrade persona expressivity when needed, e.g., in-character role-playing. To address this limitation, we propose PerMix-RLVR, a persona-mixed RLVR strategy that mitigates the persona robustness-fidelity trade-off, preserving strong robustness to harmful persona variation while enabling faithful persona adoption when required. Concretely, PerMix-RLVR improves persona stability score (PSS) over RLVR by +21.2% on MATH500, while also enhancing persona fidelity by +11.4% on PersonaGym.[38] ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering
Xiaoke Guo,Songze Li,Zhiqiang Liu,Zhaoyan Gong,Yuanxiang Liu,Huajun Chen,Wen Zhang
Main category: cs.CL
TL;DR: 本文提出ASTRA框架,包含AdaSTR和DuTR两个模块,通过构建逻辑语义树和双模推理提升表格问答性能,达到SOTA。
Details
Motivation: 现有表格序列化方法存在结构忽视、表征差距和推理不透明等问题,难以有效支持复杂表格问答任务。 Method: 提出ASTRA框架:1)AdaSTR模块利用LLM的全局语义感知能力,将表格重构为逻辑语义树,并自适应优化构建策略;2)DuTR模块结合基于树搜索的文本导航与符号代码执行进行双模推理。 Result: 在多个复杂表格问答基准测试上达到当前最优(SOTA)性能。 Conclusion: ASTRA通过显式建模层次依赖与双模推理机制,显著提升了LLM在复杂表格问答中的表现,解决了结构建模与语义推理的关键瓶颈。 Abstract: Table serialization remains a critical bottleneck for Large Language Models (LLMs) in complex table question answering, hindered by challenges such as structural neglect, representation gaps, and reasoning opacity. Existing serialization methods fail to capture explicit hierarchies and lack schema flexibility, while current tree-based approaches suffer from limited semantic adaptability. To address these limitations, we propose ASTRA (Adaptive Semantic Tree Reasoning Architecture) including two main modules, AdaSTR and DuTR. First, we introduce AdaSTR, which leverages the global semantic awareness of LLMs to reconstruct tables into Logical Semantic Trees. This serialization explicitly models hierarchical dependencies and employs an adaptive mechanism to optimize construction strategies based on table scale. Second, building on this structure, we present DuTR, a dual-mode reasoning framework that integrates tree-search-based textual navigation for linguistic alignment and symbolic code execution for precise verification. Experiments on complex table benchmarks demonstrate that our method achieves state-of-the-art (SOTA) performance.[39] Towards Linguistically-informed Representations for English as a Second or Foreign Language: Review, Construction and Application
Wenxi Li,Xihao Wang,Weiwei Sun
Main category: cs.CL
TL;DR: 本文提出了一种基于构式语法的英语二语/外语(ESFL)句法语义资源构建方法,创建了包含1643句标注语料的黄金标准语料库,并通过验证语言生态位假说展示了其在二语习得研究中的实用价值。
Details
Motivation: 英语作为第二或外语(ESFL)的广泛应用促使学界将其视为独立的语言系统,而非对标准英语的偏离;然而现有ESFL资源存在不足,亟需知识密集型、专门化的表征手段。 Method: 基于建构主义理论,以构式为基本分析单位,建模ESFL与标准英语的句法—语义接口,通过映射标准英语的句法语义关系并保留ESFL独特特征,构建专用语义库(sembank)。 Result: 建成一个含1643句人工标注ESFL句子的高质量句法语义资源库,并通过一项验证语言生态位假说的初步实证研究,证明其在二语习得研究中的实用性。 Conclusion: 该构式驱动的ESFL语义库不仅弥补了现有资源缺陷,也为二语习得理论检验与NLP应用提供了可靠的知识基础。 Abstract: The widespread use of English as a Second or Foreign Language (ESFL) has sparked a paradigm shift: ESFL is not seen merely as a deviation from standard English but as a distinct linguistic system in its own right. This shift highlights the need for dedicated, knowledge-intensive representations of ESFL. In response, this paper surveys existing ESFL resources, identifies their limitations, and proposes a novel solution. Grounded in constructivist theories, the paper treats constructions as the fundamental units of analysis, allowing it to model the syntax--semantics interface of both ESFL and standard English. This design captures a wide range of ESFL phenomena by referring to syntactico-semantic mappings of English while preserving ESFL's unique characteristics, resulting a gold-standard syntactico-semantic resource comprising 1643 annotated ESFL sentences. To demonstrate the sembank's practical utility, we conduct a pilot study testing the Linguistic Niche Hypothesis, highlighting its potential as a valuable tool in Second Language Acquisition research.[40] CONDESION-BENCH: Conditional Decision-Making of Large Language Models in Compositional Action Space
Yeonjun Hwang,Sungyong Park,Minju Kim,Dongha Lee,Jinyoung Yeo
Main category: cs.CL
TL;DR: 本文提出CONDESION-BENCH,一个用于评估大语言模型在组合式动作空间中条件决策能力的新基准,强调动作的可组合性与多层级显式约束,并采用基于oracle的评估方法。
Details
Motivation: 现有决策基准假设动作来自有限预定义集合且忽略动作可行性约束,无法反映真实世界中动作的组合结构和显式约束条件。 Method: 构建CONDESION-BENCH基准,将动作定义为对决策变量的分配,并引入变量级、上下文级和分配级三类显式约束;采用oracle-based方法同时评估决策质量与约束遵守程度。 Result: 提供了更严格、更贴近实际场景的LLM决策支持能力评估框架,凸显了当前LLM在条件决策与约束遵循方面的局限性。 Conclusion: CONDESION-BENCH填补了面向组合动作与显式约束的条件决策评估空白,推动LLM在高风险领域中更可靠、可解释的决策支持应用。 Abstract: Large language models have been widely explored as decision-support tools in high-stakes domains due to their contextual understanding and reasoning capabilities. However, existing decision-making benchmarks rely on two simplifying assumptions: actions are selected from a finite set of pre-defined candidates, and explicit conditions restricting action feasibility are not incorporated into the decision-making process. These assumptions fail to capture the compositional structure of real-world actions and the explicit conditions that constrain their validity. To address these limitations, we introduce CONDESION-BENCH, a benchmark designed to evaluate conditional decision-making in compositional action space. In CONDESION-BENCH, actions are defined as allocations to decision variables and are restricted by explicit conditions at the variable, contextual, and allocation levels. By employing oracle-based evaluation of both decision quality and condition adherence, we provide a more rigorous assessment of LLMs as decision-support tools.[41] Anchored Sliding Window: Toward Robust and Imperceptible Linguistic Steganography
Ruiyi Yan,Shiao Meng,Yugo Murawaki
Main category: cs.CL
TL;DR: 本文提出了一种锚定滑动窗口(ASW)框架,通过锚定提示和桥接上下文提升基于语言模型的隐写术在不可感知性与鲁棒性上的表现,并采用类提示蒸馏及自蒸馏策略优化桥接上下文。
Details
Motivation: 现有基于语言模型的隐写术对文本微小改动极为敏感(脆弱),而此前通过缩小上下文窗口缓解该问题的方法严重损害文本质量。 Method: 提出锚定滑动窗口(ASW)框架:在滑动窗口中固定提示(prompt)与桥接上下文(bridge context),使模型能补偿被排除的上下文;将桥接上下文优化建模为一种提示蒸馏变体,并引入自蒸馏策略进一步增强。 Result: 实验表明,ASW在文本质量、不可感知性和鲁棒性三方面均显著且一致地优于基线方法,适用于多种设置。 Conclusion: ASW框架有效平衡了隐写文本的鲁棒性与语言质量,为实用化语言模型隐写术提供了新思路。 Abstract: Linguistic steganography based on language models typically assumes that steganographic texts are transmitted without alteration, making them fragile to even minor modifications. While previous work mitigates this fragility by limiting the context window, it significantly compromises text quality. In this paper, we propose the anchored sliding window (ASW) framework to improve imperceptibility and robustness. In addition to the latest tokens, the prompt and a bridge context are anchored within the context window, encouraging the model to compensate for the excluded tokens. We formulate the optimization of the bridge context as a variant of prompt distillation, which we further extend using self-distillation strategies. Experiments show that our ASW significantly and consistently outperforms the baseline method in text quality, imperceptibility, and robustness across diverse settings. The code is available at github.com/ryehr/ASW_steganography.[42] NyayaMind- A Framework for Transparent Legal Reasoning and Judgment Prediction in the Indian Legal System
Parjanya Aditya Shukla,Shubham Kumar Nigam,Debtanu Datta,Balaramamahanthi Deepak Patnaik,Noel Shallum,Pradeep Reddy Vanga,Saptarshi Ghosh,Arnab Bhattacharya
Main category: cs.CL
TL;DR: 本文提出了NyayaMind,一个面向印度司法系统的开源框架,用于法院判决预测与解释(CJPE),通过检索-推理-验证机制实现透明、可扩展且符合司法实践的法律推理。
Details
Motivation: 现有CJPE系统缺乏与真实司法实践一致的透明、结构化法律推理能力,难以在实际司法或法律研究中可靠应用。 Method: 构建NyayaMind框架,包含两个核心模块:基于RAG的检索模块(用于获取相关法律条文和判例)和基于领域微调推理型大模型的预测模块(生成问题、论点、理由与判决等结构化输出)。 Result: 实验与专家评估表明,NyayaMind在解释质量与证据对齐性方面显著优于现有CJPE方法。 Conclusion: NyayaMind为构建可信的AI辅助法律决策支持系统提供了可行路径,尤其适用于印度司法语境。 Abstract: Court Judgment Prediction and Explanation (CJPE) aims to predict a judicial decision and provide a legally grounded explanation for a given case based on the facts, legal issues, arguments, cited statutes, and relevant precedents. For such systems to be practically useful in judicial or legal research settings, they must not only achieve high predictive performance but also generate transparent and structured legal reasoning that aligns with established judicial practices. In this work, we present NyayaMind, an open-source framework designed to enable transparent and scalable legal reasoning for the Indian judiciary. The proposed framework integrates retrieval, reasoning, and verification mechanisms to emulate the structured decision-making process typically followed in courts. Specifically, NyayaMind consists of two main components: a Retrieval Module and a Prediction Module. The Retrieval Module employs a RAG pipeline to identify legally relevant statutes and precedent cases from large-scale legal corpora, while the Prediction Module utilizes reasoning-oriented LLMs fine-tuned for the Indian legal domain to generate structured outputs including issues, arguments, rationale, and the final decision. Our extensive results and expert evaluation demonstrate that NyayaMind significantly improves the quality of explanation and evidence alignment compared to existing CJPE approaches, providing a promising step toward trustworthy AI-assisted legal decision support systems.[43] Hierarchical Alignment: Enforcing Hierarchical Instruction-Following in LLMs through Logical Consistency
Shu Yang,Zihao Zhou,Di Wang,Wenda Li
Main category: cs.CL
TL;DR: 本文提出了一种名为Neuro-Symbolic Hierarchical Alignment (NSHA)的方法,用于解决大语言模型在处理多源、多层次指令时的冲突问题,通过推理时的求解器引导约束满足和训练时的自动监督蒸馏,提升指令遵循的一致性与安全性。
Details
Motivation: 现有工作关注对抗性指令冲突,但忽略了现实应用中常见的良性指令冲突,模型需在保障安全的同时维持任务效用与行为一致性。 Method: 提出NSHA框架:推理时采用求解器引导的约束满足方法建模指令优先级;训练时通过自动构建监督信号将求解器决策蒸馏至模型参数。 Result: 在规则遵循、任务执行、工具使用与安全性等多个任务上验证了NSHA的有效性,显著提升了指令冲突下的性能,同时在无冲突基准场景中保持竞争力。 Conclusion: NSHA通过神经符号结合的方式显式建模指令层级关系,为多源异构指令环境下的鲁棒指令遵循提供了新范式。 Abstract: Large language models increasingly operate under multiple instructions from heterogeneous sources with different authority levels, including system policies, user requests, tool outputs, and retrieved context. While prior work on instruction hierarchy highlights the importance of respecting instruction priorities, it mainly focuses on adversarial attacks and overlooks the benign but common instruction conflicts that arise in real-world applications. In such settings, models must not only avoid security violations but also preserve task utility and behavioral consistency when instructions partially or implicitly conflict. We propose Neuro-Symbolic Hierarchical Alignment (NSHA) for hierarchical instruction-following by explicitly modeling and enforcing instruction priorities. At inference time, we introduce solver-guided reasoning that formulates instruction resolution as a constraint satisfaction problem, enabling the model to derive a maximally consistent set of applicable instructions under hierarchical constraints. At training time, NSHA distills solver-based decisions into model parameters using automatically constructed supervision. We evaluate our approach on rule following, task execution, tool use, and safety, covering both single-turn and multi-turn interactions, and show that NSHA significantly improves performance under such conflicts while maintaining competitive utility in reference settings.[44] Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition
Peng Wang,Yanqiao Zhu,Zixuan Jiang,Qinyuan Chen,Xingjian Zhao,Xipeng Qiu,Wupeng Wang,Zhifu Gao,Xiangang Li,Kai Yu,Xie Chen
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型(LLM)的交互式自动语音识别(ASR)框架,引入LLM-as-a-Judge作为语义感知评估指标,并设计LLM驱动的智能体实现多轮语义反馈下的识别结果迭代修正,显著提升语义保真度与交互纠错能力。
Details
Motivation: 现有ASR研究过度依赖词错误率(WER)这一词汇级指标,忽视语义正确性;同时缺乏对人类交流中关键的交互式纠错机制的系统研究。 Method: 提出基于LLM-as-a-Judge的语义评估方法,并构建LLM驱动的智能体框架,支持多轮交互与语义反馈驱动的识别结果迭代优化。 Result: 在GigaSpeech(英文)、WenetSpeech(中文)及ASRU 2019语码转换数据集上验证有效,主客观评估均显示语义保真度和交互纠错能力显著提升。 Conclusion: 将语义评估与交互式纠错统一于智能体框架下,为下一代ASR系统提供了新范式,代码将开源以推动交互式、具身化ASR研究。 Abstract: Recent years have witnessed remarkable progress in automatic speech recognition (ASR), driven by advances in model architectures and large-scale training data. However, two important aspects remain underexplored. First, Word Error Rate (WER), the dominant evaluation metric for decades, treats all words equally and often fails to reflect the semantic correctness of an utterance at the sentence level. Second, interactive correction-an essential component of human communication-has rarely been systematically studied in ASR research. In this paper, we integrate these two perspectives under an agentic framework for interactive ASR. We propose leveraging LLM-as-a-Judge as a semantic-aware evaluation metric to assess recognition quality beyond token-level accuracy. Furthermore, we design an LLM-driven agent framework to simulate human-like multi-turn interaction, enabling iterative refinement of recognition outputs through semantic feedback. Extensive experiments are conducted on standard benchmarks, including GigaSpeech (English), WenetSpeech (Chinese), the ASRU 2019 code-switching test set. Both objective and subjective evaluations demonstrate the effectiveness of the proposed framework in improving semantic fidelity and interactive correction capability. We will release the code to facilitate future research in interactive and agentic ASR.[45] Prototype-Regularized Federated Learning for Cross-Domain Aspect Sentiment Triplet Extraction
Zongming Cai,Jianhang Tang,Zhenyong Zhang,Jinghui Qin,Kebing Jin,Hankz Hankui Zhuo
Main category: cs.CL
TL;DR: 本文提出了一种基于原型的跨域联邦学习框架PCD-SpanProto,用于解决ASTE任务中跨域泛化与数据隐私问题,通过交换类级别原型而非模型参数,并引入加权聚合与对比正则化提升全局原型质量。
Details
Motivation: 现有ASTE方法通常孤立训练于单个数据集,难以联合捕获跨域共性特征,且受数据隐私限制无法集中聚合数据。 Method: 提出PCD-SpanProto框架:基于原型的联邦学习,客户端交换span-level类别原型;设计性能感知的加权聚合策略和对比正则化模块,增强类内紧凑性与类间可分性。 Result: 在四个ASTE数据集上显著优于基线方法,同时降低通信开销。 Conclusion: 原型级知识共享可在保护隐私前提下有效实现跨域ASTE性能提升,验证了原型驱动联邦学习在细粒度情感分析中的可行性与优势。 Abstract: Aspect Sentiment Triplet Extraction (ASTE) aims to extract all sentiment triplets of aspect terms, opinion terms, and sentiment polarities from a sentence. Existing methods are typically trained on individual datasets in isolation, failing to jointly capture the common feature representations shared across domains. Moreover, data privacy constraints prevent centralized data aggregation. To address these challenges, we propose Prototype-based Cross-Domain Span Prototype extraction (PCD-SpanProto), a prototype-regularized federated learning framework to enable distributed clients to exchange class-level prototypes instead of full model parameters. Specifically, we design a weighted performance-aware aggregation strategy and a contrastive regularization module to improve the global prototype under domain heterogeneity and the promotion between intra-class compactness and inter-class separability across clients. Extensive experiments on four ASTE datasets demonstrate that our method outperforms baselines and reduces communication costs, validating the effectiveness of prototype-based cross-domain knowledge transfer.[46] Think Less, Know More: State-Aware Reasoning Compression with Knowledge Guidance for Efficient Reasoning
Yi Sui,Chaozhuo Li,Dawei Song
Main category: cs.CL
TL;DR: 本文提出STACK框架,通过状态感知的推理压缩与知识引导,在保持甚至提升准确率的同时显著减少大型推理模型(LRM)的推理步骤和延迟。
Details
Motivation: 大型推理模型(LRMs)虽能通过长思维链(CoT)解决复杂任务,但易出现‘过度思考’,导致推理步骤冗余、延迟高;现有CoT压缩方法难以兼顾精度与效率,且缺乏对步骤级冗余和推理偏差的细粒度适应。 Method: 提出STACK框架:1)建模阶段特异性冗余源;2)构建在线长短对比样本;3)根据推理状态动态切换知识引导压缩(针对不确定/有偏状态)或自提示压缩(针对冗长但自信状态);4)引入基于答案收敛的早停机制;5)设计结合PPO与DPO的奖励差异驱动训练策略,学习状态条件下的压缩策略。 Result: 在三个数学推理基准上,STACK将平均响应长度减少59.9%,同时准确率提升4.8个百分点,优于现有方法。 Conclusion: STACK实现了更优的精度-效率权衡,验证了状态感知与知识引导在CoT压缩中的有效性,为高效推理提供了新范式。 Abstract: Large Reasoning Models (LRMs) achieve strong performance on complex tasks by leveraging long Chain-of-Thought (CoT), but often suffer from overthinking, leading to excessive reasoning steps and high inference latency. Existing CoT compression methods struggle to balance accuracy and efficiency, and lack fine-grained, step-level adaptation to redundancy and reasoning bias. Therefore, we propose State-Aware Reasoning Compression with Knowledge Guidance (STACK), a framework that performs step-wise CoT compression by explicitly modeling stage-specific redundancy sources and integrating with a retrieval-augmented guidance. STACK constructs online long-short contrastive samples and dynamically switches between knowledge-guided compression for uncertain or biased reasoning state and self-prompted compression for overly long but confident state, complemented by an answer-convergence-based early stopping mechanism to suppress redundant verification. We further propose a reward-difference-driven training strategy by combining Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO), enabling models to learn state-conditioned compression strategies. Experiments on three mathematical reasoning benchmarks show that STACK achieves a superior accuracy-efficiency balance, reducing average response length by 59.9% while improving accuracy by 4.8 points over existing methods.[47] Persona-E$^2$: A Human-Grounded Dataset for Personality-Shaped Emotional Responses to Textual Events
Yuqin Yang,Haowu Zhou,Haoran Tu,Zhiwen Hui,Shiqi Yan,HaoYang Li,Dong She,Xianrong Yao,Yang Gao,Zhanpeng Jin
Main category: cs.CL
TL;DR: 本文提出Persona-E²数据集,旨在解决现有情感计算研究忽视读者个性差异的问题,通过MBTI和大五人格特质标注,捕捉不同读者对同一事件的情感反应差异,并验证了人格信息对提升大模型情感理解能力的有效性。
Details
Motivation: 现有情感计算研究多将情感视为文本的静态属性,忽视读者个性差异导致的情感反应多样性,且缺乏真实人类数据支持人格与情感变化的关联建模。 Method: 构建基于MBTI和大五人格特质标注的大规模数据集Persona-E²,覆盖新闻、社交媒体和生活叙事三类文本;设计实验评估主流大语言模型在人格感知情感理解任务上的表现,并分析人格信息对缓解‘人格幻觉’的作用。 Result: 实验证明当前SOTA大模型难以准确捕捉个性驱动的情感评价变化,尤其在社交媒体文本上表现更差;引入大五人格特征显著提升模型理解能力,并有效缓解‘人格幻觉’现象。 Conclusion: 读者个性是影响情感反应的关键因素,需在情感计算中显式建模;Persona-E²为该方向提供了首个大规模、人格标注的基准数据集,推动更真实、可解释的个性化情感理解研究。 Abstract: Most affective computing research treats emotion as a static property of text, focusing on the writer's sentiment while overlooking the reader's perspective. This approach ignores how individual personalities lead to diverse emotional appraisals of the same event. Although role-playing Large Language Models (LLMs) attempt to simulate such nuanced reactions, they often suffer from "personality illusion'' -- relying on surface-level stereotypes rather than authentic cognitive logic. A critical bottleneck is the absence of ground-truth human data to link personality traits to emotional shifts. To bridge the gap, we introduce Persona-E$^2$ (Persona-Event2Emotion), a large-scale dataset grounded in annotated MBTI and Big Five traits to capture reader-based emotional variations across news, social media, and life narratives. Extensive experiments reveal that state-of-the-art LLMs struggle to capture precise appraisal shifts, particularly in social media domains. Crucially, we find that personality information significantly improves comprehension, with the Big Five traits alleviating "personality illusion.'[48] Facet-Level Tracing of Evidence Uncertainty and Hallucination in RAG
Passant Elchafei,Monorama Swain,Shahed Masoudian,Markus Schedl
Main category: cs.CL
TL;DR: 本文提出了一种面向问答任务的细粒度诊断框架,通过将问题分解为原子推理要素(facets),结合检索相关性与自然语言推理(NLI)的忠实度评分,构建Facet×Chunk矩阵,系统分析RAG中证据使用不当(如忽略、错配、被先验知识覆盖)导致的幻觉问题。
Details
Motivation: 现有RAG评估停留在答案级或段落级,难以揭示证据在生成过程中如何被实际使用;幻觉频发即使在检索到相关文档时仍存在,亟需更细粒度的诊断手段。 Method: 提出基于推理要素(facet)的诊断框架:将问题拆解为原子facet,构建Facet×Chunk矩阵(融合检索相关性+NLI忠实度);设计三种受控推理模式(Strict RAG、Soft RAG、LLM-only)对比分析证据使用;在医学QA和HotpotQA上评估GPT、Gemini、LLaMA三类模型。 Result: 发现RAG幻觉主因并非检索不准,而是生成阶段对检索证据的错误整合(如证据缺失、错配、先验覆盖);facet级分析揭示了答案级评估无法发现的系统性失败模式。 Conclusion: 证据整合机制比检索质量更关键;facet-level诊断可提供可解释、可归因的RAG失效分析,为改进RAG系统提供新方向。 Abstract: Retrieval-Augmented Generation (RAG) aims to reduce hallucination by grounding answers in retrieved evidence, yet hallucinated answers remain common even when relevant documents are available. Existing evaluations focus on answer-level or passage-level accuracy, offering limited insight into how evidence is used during generation. In this work, we introduce a facet-level diagnostics framework for QA that decomposes each input question into atomic reasoning facets. For each facet, we assess evidence sufficiency and grounding using a structured Facet x Chunk matrix that combines retrieval relevance with natural language inference-based faithfulness scores. To diagnose evidence usage, we analyze three controlled inference modes: Strict RAG, which enforces exclusive reliance on retrieved evidence; Soft RAG, which allows integration of retrieved evidence and parametric knowledge; and LLM-only generation without retrieval. Comparing these modes enables thorough analysis of retrieval-generation misalignment, defined as cases where relevant evidence is retrieved but not correctly integrated during generation. Across medical QA and HotpotQA, we evaluate three open-source and closed-source LLMs (GPT, Gemini, and LLaMA), providing interpretable diagnostics that reveal recurring facet-level failure modes, including evidence absence, evidence misalignment, and prior-driven overrides. Our results demonstrate that hallucinations in RAG systems are driven less by retrieval accuracy and more by how retrieved evidence is integrated during generation, with facet-level analysis exposing systematic evidence override and misalignment patterns that remain hidden under answer-level evaluation.[49] Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies
Avni Mittal
Main category: cs.CL
TL;DR: 本文提出Symbolic-Neural Consistency Audit (SNCA)框架,用于评估大语言模型(LLM)自身声明的安全规则与其实际行为之间的一致性,发现模型在安全政策表述与执行间存在显著且架构依赖的差距。
Details
Motivation: 现有安全评估方法仅依据外部标准检验模型行为,无法衡量模型是否真正理解并执行其自身声明的安全边界;同时,RLHF训练出的安全策略缺乏形式化定义,难以审查。 Method: 提出SNCA框架:(1) 通过结构化提示提取模型自我陈述的安全规则;(2) 将其形式化为三类类型化谓词(绝对型、条件型、自适应型);(3) 在确定性基准上比对模型行为以量化合规性。在4个前沿模型、45类危害、47496次观测上开展实证评估。 Result: 发现系统性不一致:声称‘绝对拒绝’的模型常执行有害请求;推理型模型自我一致性最高但29%的危害类别无明确政策;跨模型对规则类型的共识率仅11%。 Conclusion: LLM‘言’与‘行’之间的差距可被量化且具架构依赖性,亟需将反思性一致性审计作为行为基准的必要补充。 Abstract: LLMs internalize safety policies through RLHF, yet these policies are never formally specified and remain difficult to inspect. Existing benchmarks evaluate models against external standards but do not measure whether models understand and enforce their own stated boundaries. We introduce the Symbolic-Neural Consistency Audit (SNCA), a framework that (1) extracts a model's self-stated safety rules via structured prompts, (2) formalizes them as typed predicates (Absolute, Conditional, Adaptive), and (3) measures behavioral compliance via deterministic comparison against harm benchmarks. Evaluating four frontier models across 45 harm categories and 47,496 observations reveals systematic gaps between stated policy and observed behavior: models claiming absolute refusal frequently comply with harmful prompts, reasoning models achieve the highest self-consistency but fail to articulate policies for 29% of categories, and cross-model agreement on rule types is remarkably low (11%). These results demonstrate that the gap between what LLMs say and what they do is measurable and architecture-dependent, motivating reflexive consistency audits as a complement to behavioral benchmarks.[50] SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation
Han Luo,Guy Laban
Main category: cs.CL
TL;DR: 本文提出SPASM框架,用于生成稳定、多轮对话,通过Egocentric Context Projection(ECP)技术缓解LLM在长程对话中的角色漂移与回声问题,并构建了含45,000条对话的大规模合成数据集。
Details
Motivation: 大语言模型在辅导、客服等多轮对话场景中需长期保持角色与人格一致性;而现有LLM-LLM合成对话易出现人格漂移、角色混淆和‘回声’现象,影响训练与评估可靠性。 Method: 提出SPASM框架,包含三模块:(i)基于模式采样、合理性验证与自然语言撰写的人格构建;(ii)客户-响应者对话生成;(iii)一致性终止检测;核心创新为无需修改模型权重的Egocentric Context Projection(ECP),将对话历史映射至各代理的自我中心视角。 Result: 在GPT-4o-mini、DeepSeek-V3.2、Qwen-Plus三大模型及九组角色配对上,构建4,500个人格与45,000条对话数据集;消融实验证明ECP显著降低人格漂移,人工评估确认其彻底消除回声现象;嵌入分析揭示清晰的人格结构与响应者主导的交互几何特征。 Conclusion: SPASM是一种模块化、稳定性优先的多轮对话模拟框架,ECP是提升长程人格一致性的有效轻量干预方法,为可信LLM代理设计与合成数据构建提供了新范式。 Abstract: Large language models are increasingly deployed in multi-turn settings such as tutoring, support, and counseling, where reliability depends on preserving consistent roles, personas, and goals across long horizons. This requirement becomes critical when LLMs are used to generate synthetic dialogues for training and evaluation, since LLM--LLM conversations can accumulate identity-related failures such as persona drift, role confusion, and "echoing", where one agent gradually mirrors its partner. We introduce SPASM (Stable Persona-driven Agent Simulation for Multi-turn dialogue generation), a modular, stability-first framework that decomposes simulation into (i) persona creation via schema sampling, plausibility validation, and natural-language persona crafting, (ii) Client--Responder dialogue generation, and (iii) termination detection for coherent stopping. To improve long-horizon stability without changing model weights, we propose Egocentric Context Projection (ECP): dialogue history is stored in a perspective-agnostic representation and deterministically projected into each agent's egocentric view before generation. Across three LLM backbones (GPT-4o-mini, DeepSeek-V3.2, Qwen-Plus) and nine Client--Responder pairings, we construct a dataset of 4,500 personas and 45,000 conversations (500 personas X 10 conversations per pairing). Ablations show ECP substantially reduces persona drift and, under human validation, eliminates echoing; embedding analyses recover persona structure and reveal strong responder-driven interaction geometry. Our code is available at https://github.com/lhannnn/SPASM.[51] ScheMatiQ: From Research Question to Structured Data through Interactive Schema Discovery
Shahar Levy,Eliya Habba,Reshef Mintz,Barak Raveh,Renana Keydar,Gabriel Stanovsky
Main category: cs.CL
TL;DR: ScheMatiQ 是一种利用大语言模型自动生成标注模式和结构化数据库的工具,支持领域专家对大规模文本集合进行快速、可交互的问答分析。
Details
Motivation: 传统方法依赖人工设计标注模式并 exhaustive 标注,耗时且易错;而多学科研究常需从大量文档中提取结构化证据来回答自然语言问题。 Method: 提出 ScheMatiQ 框架,通过调用骨干大语言模型(LLM),根据用户提出的问题和给定语料库,自动生成标注模式(schema)和基于原文的结构化数据库,并提供可交互的 Web 界面用于引导与修正抽取过程。 Result: 在法律和计算生物学领域与领域专家合作验证,ScheMatiQ 能生成支持真实世界分析的有效输出;项目已开源并提供公开 Web 接口、源码及演示视频。 Conclusion: ScheMatiQ 为跨学科研究者提供了一种高效、低门槛、可协作的自动化结构化信息抽取新范式。 Abstract: Many disciplines pose natural-language research questions over large document collections whose answers typically require structured evidence, traditionally obtained by manually designing an annotation schema and exhaustively labeling the corpus, a slow and error-prone process. We introduce ScheMatiQ, which leverages calls to a backbone LLM to take a question and a corpus to produce a schema and a grounded database, with a web interface that lets steer and revise the extraction. In collaboration with domain experts, we show that ScheMatiQ yields outputs that support real-world analysis in law and computational biology. We release ScheMatiQ as open source with a public web interface, and invite experts across disciplines to use it with their own data. All resources, including the website, source code, and demonstration video, are available at: www.ScheMatiQ-ai.com[52] EthicMind: A Risk-Aware Framework for Ethical-Emotional Alignment in Multi-Turn Dialogue
Jiawen Deng,Wei Li,Wentao Zhang,Ziyun Jiao,Fuji Ren
Main category: cs.CL
TL;DR: 本文提出EthicMind框架,旨在解决对话系统中伦理判断与情感共鸣的协同问题,通过在每轮对话中联合分析伦理风险与用户情绪、规划响应策略并生成兼顾伦理引导与情感互动的回复,无需额外训练即可提升多轮对话中的伦理-情感对齐能力。
Details
Motivation: 现有对话模型通常孤立处理共情与伦理安全问题,难以适应多轮交互中动态变化的伦理风险和用户情绪,可能导致严重伤害。 Method: 提出EthicMind框架,将伦理-情感对齐建模为显式的逐轮决策问题;在推理时每轮联合分析伦理风险信号与用户情绪,规划高层响应策略,并生成上下文敏感的回复;引入基于风险分层与上下文感知用户模拟的多轮评估协议。 Result: 实验表明,EthicMind在高风险与道德模糊场景下,相比基线方法展现出更一致的伦理引导能力和情感互动效果。 Conclusion: 伦理与情感需在逐轮对话中协同建模;EthicMind提供了一种无需再训练、可即插即用的风险感知对齐机制,有效提升了智能对话系统在敏感场景下的安全性与人文性。 Abstract: Intelligent dialogue systems are increasingly deployed in emotionally and ethically sensitive settings, where failures in either emotional attunement or ethical judgment can cause significant harm. Existing dialogue models typically address empathy and ethical safety in isolation, and often fail to adapt their behavior as ethical risk and user emotion evolve across multi-turn interactions. We formulate ethical-emotional alignment in dialogue as an explicit turn-level decision problem, and propose \textsc{EthicMind}, a risk-aware framework that implements this formulation in multi-turn dialogue at inference time. At each turn, \textsc{EthicMind} jointly analyzes ethical risk signals and user emotion, plans a high-level response strategy, and generates context-sensitive replies that balance ethical guidance with emotional engagement, without requiring additional model training. To evaluate alignment behavior under ethically complex interactions, we introduce a risk-stratified, multi-turn evaluation protocol with a context-aware user simulation procedure. Experimental results show that \textsc{EthicMind} achieves more consistent ethical guidance and emotional engagement than competitive baselines, particularly in high-risk and morally ambiguous scenarios.[53] Task-Aware LLM Routing with Multi-Level Task-Profile-Guided Data Synthesis for Cold-Start Scenarios
Hui Liu,Bin Zou,Kecheng Chen,Jie Liu,Wenya Wang,Haoliang Li
Main category: cs.CL
TL;DR: 本文提出TRouter,一种基于任务类型感知的LLM路由方法,结合多层级任务画像引导的数据合成框架,在冷启动和领域内场景下均能有效提升路由性能。
Details
Motivation: 现有LLM路由系统在缺乏目标领域训练数据(冷启动)时泛化能力差,难以满足用户对成本-性能权衡的需求。 Method: 构建多层级任务画像引导的数据合成框架,建立分层任务分类体系并生成多样化问答对以逼近测试分布;在此基础上提出TRouter,利用隐式任务类型变量建模查询条件下的成本与性能,并引入基于合成任务分类体系的先验正则化。 Result: 在多个基准上验证了该数据合成框架可缓解冷启动问题,且TRouter在冷启动和领域内设置下均实现了有效的LLM路由。 Conclusion: 任务类型感知建模与结构化数据合成相结合,显著提升了LLM路由器在数据稀缺场景下的实用性与鲁棒性。 Abstract: Large language models (LLMs) exhibit substantial variability in performance and computational cost across tasks and queries, motivating routing systems that select models to meet user-specific cost-performance trade-offs. However, existing routers generalize poorly in cold-start scenarios where in-domain training data is unavailable. We address this limitation with a multi-level task-profile-guided data synthesis framework that constructs a hierarchical task taxonomy and produces diverse question-answer pairs to approximate the test-time query distribution. Building on this, we introduce TRouter, a task-type-aware router approach that models query-conditioned cost and performance via latent task-type variables, with prior regularization derived from the synthesized task taxonomy. This design enhances TRouter's routing utility under both cold-start and in-domain settings. Across multiple benchmarks, we show that our synthesis framework alleviates cold-start issues and that TRouter delivers effective LLM routing.[54] Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM
Solomiia Bilyk,Volodymyr Getmanskyi,Taras Firman
Main category: cs.CL
TL;DR: 本文提出了一种基于规则归纳的自动化指令修订(AIR)方法,用于在少量任务示例下适配大语言模型,并通过多维度基准测试表明不同适配方法(AIR、检索、微调等)性能高度依赖任务类型。
Details
Motivation: 现有LLM适配方法(如提示优化、检索、微调)缺乏系统性比较,且适用场景不明确;需一种可解释、轻量且适应特定任务结构的适配方式。 Method: 提出Automated Instruction Revision (AIR),一种基于规则归纳的指令自动修订方法,利用少量任务样例归纳出简洁、可解释的指令规则来引导LLM行为。 Result: 在五个基准上实验表明:AIR在标签重映射分类任务中表现最强或接近最优;KNN检索在闭卷问答中最佳;微调在结构化抽取和事件顺序推理中占优。 Conclusion: LLM适配方法效果具有强任务依赖性;AIR适用于指令可规则化表达的任务,而检索和微调更适用于依赖源知识或数据标注规律的任务。 Abstract: This paper studies Automated Instruction Revision (AIR), a rule-induction-based method for adapting large language models (LLMs) to downstream tasks using limited task-specific examples. We position AIR within the broader landscape of adaptation strategies, including prompt optimization, retrieval-based methods, and fine-tuning. We then compare these approaches across a diverse benchmark suite designed to stress different task requirements, such as knowledge injection, structured extraction, label remapping, and logical reasoning. The paper argues that adaptation performance is strongly task-dependent: no single method dominates across all settings. Across five benchmarks, AIR was strongest or near-best on label-remapping classification, while KNN retrieval performed best on closed-book QA, and fine-tuning dominated structured extraction and event-order reasoning. AIR is most promising when task behavior can be captured by compact, interpretable instruction rules, while retrieval and fine-tuning remain stronger in tasks dominated by source-specific knowledge or dataset-specific annotation regularities.[55] UIPress: Bringing Optical Token Compression to UI-to-Code Generation
Dasen Dai,Shuoqi Li,Ronghao Chen,Huacan Wang,Biao Wu,Qizhen Lan
Main category: cs.CL
TL;DR: 本文提出UIPress,一种轻量级学习型压缩模块,用于UI-to-Code生成任务,在ViT编码器与LLM解码器之间将约6700个视觉token压缩至固定256个,显著提升推理速度并提升CLIP评分。
Details
Motivation: 现有视觉token压缩方法在UI-to-Code任务中存在缺陷:推理时启发式选择不适应UI截图信息密度不均,而注意力置零未真正缩短序列;光学压缩在OCR中有效,但尚未应用于UI-to-Code。 Method: 提出UIPress模块,集成深度可分离卷积、元素引导的空间重加权和Transformer精炼,将视觉token从~6700压缩至固定256;结合LoRA微调解码器以弥合表征差距;仅引入~21.7M可训练参数(占8B基模型0.26%)。 Result: 在Design2Code数据集上,UIPress(256 token)CLIP得分为0.8127,较无压缩基线提升+7.5%,较最强推理时方法提升+4.6%,首token延迟加速9.1×。 Conclusion: UIPress是首个面向UI-to-Code任务的编码器侧学习型压缩方法,兼顾高效性与性能,在视觉token压缩与端到端生成质量间取得良好平衡。 Abstract: UI-to-Code generation requires vision-language models (VLMs) to produce thousands of tokens of structured HTML/CSS from a single screenshot, making visual token efficiency critical. Existing compression methods either select tokens at inference time using task-agnostic heuristics, or zero out low-attention features without actually shortening the sequence -- neither truly reduces prefill latency or adapts to the non-uniform information density of UI screenshots. Meanwhile, optical (encoder-side learned) compression has shown strong results for document OCR, yet no prior work has adapted this paradigm to UI-to-Code generation. We propose UIPress, a lightweight learned compression module inserted between the frozen ViT encoder and the LLM decoder of Qwen3-VL-8B. UIPress combines depthwise-separable convolutions, element-guided spatial reweighting, and Transformer refinement to compress ${\sim}$6{,}700 visual tokens to a fixed budget of 256. Together with Low-Rank Adaptation (LoRA) on the decoder to bridge the representation gap, the entire system adds only ${\sim}$21.7M trainable parameters (0.26\% of the 8B base model). Under a fair comparison on the same base model against four baselines on Design2Code, UIPress at 256 tokens achieves a CLIP score of 0.8127, outperforming the uncompressed baseline by +7.5\% and the strongest inference-time method by +4.6\%, while delivering 9.1$\times$ time-to-first-token speedup. To the best of our knowledge, UIPress is the first encoder-side learned compression method for the UI-to-Code task.[56] Many-Tier Instruction Hierarchy in LLM Agents
Jingyu Zhang,Tianjian Li,William Jurayj,Hongyuan Zhan,Benjamin Van Durme,Daniel Khashabi
Main category: cs.CL
TL;DR: 本文提出Many-Tier Instruction Hierarchy(ManyIH)范式,以解决大语言模型智能体在面对多源、多层级指令冲突时的细粒度权限决策问题,并构建首个基准ManyIH-Bench进行评测,发现当前前沿模型在此类任务上表现不佳(约40%准确率)。
Details
Motivation: 现有指令层次(IH)范式仅支持少量固定特权等级(如system > user),难以应对真实智能体场景中大量异构指令源与复杂上下文引发的冲突。 Method: 提出ManyIH范式,支持任意多级指令特权划分;构建ManyIH-Bench基准,含853个任务(427编程+426指令遵循),覆盖46种真实智能体,约束由LLM生成并经人工验证。 Result: 实验表明,当前前沿大模型在指令冲突规模扩大时性能显著下降,平均准确率仅约40%。 Conclusion: 亟需发展面向细粒度、可扩展指令冲突解析的新方法,以提升智能体在复杂指令环境下的安全性和有效性。 Abstract: Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, and more-each carrying different levels of trust and authority. When these instructions conflict, models must reliably follow the highest-privilege instruction to remain safe and effective. The dominant paradigm, instruction hierarchy (IH), assumes a fixed, small set of privilege levels (typically fewer than five) defined by rigid role labels (e.g., system > user). This is inadequate for real-world agentic settings, where conflicts can arise across far more sources and contexts. In this work, we propose Many-Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among instructions with arbitrarily many privilege levels. We introduce ManyIH-Bench, the first benchmark for ManyIH. ManyIH-Bench requires models to navigate up to 12 levels of conflicting instructions with varying privileges, comprising 853 agentic tasks (427 coding and 426 instruction-following). ManyIH-Bench composes constraints developed by LLMs and verified by humans to create realistic and difficult test cases spanning 46 real-world agents. Our experiments show that even the current frontier models perform poorly (~40% accuracy) when instruction conflict scales. This work underscores the urgent need for methods that explicitly target fine-grained, scalable instruction conflict resolution in agentic settings.[57] From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models
Chenchen Zhang
Main category: cs.CL
TL;DR: 本文综述了2024至2026年初发表的47种面向大语言模型强化学习的信用分配(CA)方法,提出二维分类法(按粒度与方法论),并提供结构化文献库、报告检查清单与基准协议三类可复用资源;指出推理RL与具身RL中CA面临不同挑战,催生不同技术路径。
Details
Motivation: 解决大语言模型强化学习中稀疏、结果级奖励下的信用分配难题,尤其在长程推理(数百至数万token)和多轮具身交互(百轮以上、数十万token)两种场景下,传统episode级信用信号失效。 Method: 构建基于粒度(token/segment/step/turn/multi-agent)与方法论(Monte Carlo、时序差分、模型驱动、博弈论、信息论)的二维分类法;系统梳理47种CA方法;开发机器可读文献库、方法报告检查清单与标准化benchmark协议(含任务族、元数据要求与控制分支任务)。 Result: 识别出推理RL中CA趋于成熟(聚焦过程奖励建模与无critic组对比),而具身RL催生全新方法(如回溯反事实分析、特权非对称critic、回合级MDP重构);发现现有研究在方法报告与评估上存在系统性缺口。 Conclusion: 信用分配正随RL应用场景从‘推理’向‘具身’演进而发生范式转变,需针对性发展新方法与统一评估框架;本文提供的三类资源可推动该领域规范化与可复现性发展。 Abstract: Reinforcement learning (RL) for large language models (LLMs) increasingly relies on sparse, outcome-level rewards -- yet determining which actions within a long trajectory caused the outcome remains difficult. This credit assignment (CA) problem manifests in two regimes: reasoning RL, where credit must be distributed across tokens and steps within a single chain-of-thought generation (500--30K+ tokens); and agentic RL, where multi-turn environment interaction introduces stochastic transitions, partial observability, and horizons of 100+ turns (100K--1M tokens), making episode-level credit increasingly uninformative. We survey 47 CA methods (41 core, 6 adjacent enablers) published between 2024 and early 2026, organizing them in a two-dimensional taxonomy by assignment granularity (token, segment, step, turn, multi-agent) and methodology (Monte Carlo, temporal difference, model-based, game-theoretic, information-theoretic). Beyond the survey itself, we contribute three reusable resources: (1) a structured, machine-readable paper inventory with taxonomy labels, baseline families, and evidence levels; (2) a reporting checklist for future CA papers, validated against the reviewed literature to identify systematic methodological gaps; and (3) a benchmark protocol specification with task families, metadata requirements, and controlled bifurcation tasks, accompanied by a method selection decision tree. Our synthesis suggests that the shift from reasoning to agentic RL complicates and reshapes the credit assignment landscape: reasoning CA is maturing around process reward models and critic-free group comparison, while agentic CA is driving genuinely new approaches -- hindsight counterfactual analysis, privileged asymmetric critics, and turn-level MDP reformulations -- that have no direct precedent in reasoning RL.[58] Across the Levels of Analysis: Explaining Predictive Processing in Humans Requires More Than Machine-Estimated Probabilities
Sathvik Nair,Colin Phillips
Main category: cs.CL
TL;DR: 本文在Marr的分析层次框架下,批判性地审视并扩展了关于语言模型与语言处理的两个主张,并提出了结合大语言模型与心理语言学模型的未来研究方向。
Details
Motivation: 基于Marr的分析层次理论,对当前语言模型在语言处理中核心作用及对心理语言学影响的流行观点进行批判性反思。 Method: 采用理论分析与批判性综述方法,在Marr的计算、算法和实现三个层次上重新审视语言模型与语言处理的关系。 Result: 指出预测性编码并非语言处理唯一核心机制,且心理语言学进展不完全依赖大语言模型;提出融合二者优势的新研究路径。 Conclusion: 语言模型虽具价值,但不能替代心理语言学理论;需构建更符合人类认知机制的整合性建模框架。 Abstract: Under the lens of Marr's levels of analysis, we critique and extend two claims about language models (LMs) and language processing: first, that predicting upcoming linguistic information based on context is central to language processing, and second, that many advances in psycholinguistics would be impossible without large language models (LLMs). We further outline future directions that combine the strengths of LLMs with psycholinguistic models.[59] Agentic Jackal: Live Execution and Semantic Value Grounding for Text-to-JQL
Vishnu Murali,Anmol Gulati,Elias Lumer,Kevin Frank,Sindy Campagna,Vamse Kumar Subbiah
Main category: cs.CL
TL;DR: 本文提出了Jackal,首个大规模、基于执行的自然语言到Jira查询语言(JQL)基准测试数据集(含10万对NL-JQL样本),并设计了Agentic Jackal代理框架,结合Jira MCP服务器和语义检索工具JiraAnchor,显著提升LLM生成JQL的执行准确率;实验表明当前单次生成模型准确率仅43.4%,而代理式方法可带来明显提升,尤其在分类值解析上效果显著。
Details
Motivation: 现有单次生成大模型无法感知Jira实例中真实存在的分类值(如组件名、修复版本),也无法通过执行验证查询正确性,导致在歧义或改写请求上准确率低;且缺乏公开、可执行的NL-to-JQL基准。 Method: 构建Jackal基准(10万对经执行验证的NL-JQL样本);提出Agentic Jackal代理框架,集成Jira MCP实时查询执行能力与JiraAnchor语义检索模块(基于嵌入相似度解析自然语言中的分类值)。 Result: 9个前沿LLM在Jackal上单次生成平均执行准确率仅43.4%;Agentic Jackal使其中7个模型性能提升,最难变体上相对增益达9.0%;消融实验显示JiraAnchor将分类值准确率从48.7%提至71.7%,组件字段准确率从16.9%跃升至66.2%。 Conclusion: NL-to-JQL仍是开放挑战,主要失败源于语义歧义(如问题类型区分、文本字段选择),而非值解析;本文发布的基准、代理日志与代码将推动该领域发展。 Abstract: Translating natural language into Jira Query Language (JQL) requires resolving ambiguous field references, instance-specific categorical values, and complex Boolean predicates. Single-pass LLMs cannot discover which categorical values (e.g., component names or fix versions) actually exist in a given Jira instance, nor can they verify generated queries against a live data source, limiting accuracy on paraphrased or ambiguous requests. No open, execution-based benchmark exists for mapping natural language to JQL. We introduce Jackal, the first large-scale, execution-based text-to-JQL benchmark comprising 100,000 validated NL-JQL pairs on a live Jira instance with over 200,000 issues. To establish baselines on Jackal, we propose Agentic Jackal, a tool-augmented agent that equips LLMs with live query execution via the Jira MCP server and JiraAnchor, a semantic retrieval tool that resolves natural-language mentions of categorical values through embedding-based similarity search. Among 9 frontier LLMs evaluated, single-pass models average only 43.4% execution accuracy on short natural-language queries, highlighting that text-to-JQL remains an open challenge. The agentic approach improves 7 of 9 models, with a 9.0% relative gain on the most linguistically challenging variant; in a controlled ablation isolating JiraAnchor, categorical-value accuracy rises from 48.7% to 71.7%, with component-field accuracy jumping from 16.9% to 66.2%. Our analysis identifies inherent semantic ambiguities, such as issue-type disambiguation and text-field selection, as the dominant failure modes rather than value-resolution errors, pointing to concrete directions for future work. We publicly release the benchmark, all agent transcripts, and evaluation code to support reproducibility.[60] RecaLLM: Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval
Kyle Whitecross,Negin Rahimi
Main category: cs.CL
TL;DR: 本文提出RecaLLM,一种通过后训练增强长上下文信息利用能力的推理语言模型。它通过交替进行推理与显式上下文检索,并引入低开销约束解码机制实现证据片段的准确复制,从而缓解‘思考中迷失’(lost-in-thought)问题,在长上下文基准测试中显著优于基线,且仅需短上下文训练数据即可泛化至128K上下文长度。
Details
Motivation: 现有语言模型在长上下文推理中存在‘lost-in-thought’瓶颈:推理步骤越多,后续上下文检索能力越差,制约了测试时扩展性;而检索与推理本应协同,却长期被割裂建模。 Method: RecaLLM采用交替式推理-检索架构,并设计约束解码机制支持证据片段的逐字复制;在多样化的词法与语义检索任务上进行后训练,训练样本最长仅10K tokens。 Result: 在RULER和HELMET两个长上下文基准上显著超越基线;在128K上下文长度下仍保持稳定增益,且训练成本远低于依赖超长上下文训练的现有方法。 Conclusion: 显式耦合检索与推理、辅以轻量约束解码,是提升长上下文语言模型性能的有效且高效路径,无需依赖昂贵的长上下文训练数据。 Abstract: We propose RecaLLM, a set of reasoning language models post-trained to make effective use of long-context information. In-context retrieval, which identifies relevant evidence from context, and reasoning are deeply intertwined: retrieval supports reasoning, while reasoning often determines what must be retrieved. However, their interaction remains largely underexplored. In preliminary experiments on several open-source LLMs, we observe that in-context retrieval performance substantially degrades even after a short reasoning span, revealing a key bottleneck for test-time scaling that we refer to as lost-in-thought: reasoning steps that improve performance also make subsequent in-context retrieval more challenging. To address this limitation, RecaLLM interleaves reasoning with explicit in-context retrieval, alternating between reasoning and retrieving context information needed to solve intermediate subproblems. We introduce a negligible-overhead constrained decoding mechanism that enables verbatim copying of evidence spans, improving the grounding of subsequent generation. Trained on diverse lexical and semantic retrieval tasks, RecaLLM achieves strong performance on two long-context benchmarks, RULER and HELMET, significantly outperforming baselines. Notably, we observe consistent gains at context windows of up to 128K tokens using training samples of at most 10K tokens, far shorter than those used by existing long-context approaches, highlighting a promising path toward improving long-context performance without expensive long-context training data.[61] BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation
Hippolyte Gisserot-Boukhlef,Nicolas Boizard,Emmanuel Malherbe,Céline Hudelot,Pierre Colombo
Main category: cs.CL
TL;DR: 本文提出BERT-as-a-Judge,一种轻量、高效、语义鲁棒的生成式答案评估方法,克服了传统词法评估与大型LLM评判器的缺陷,在多项任务上媲美大模型评判器且计算开销显著更低。
Details
Motivation: 现有词法评估方法易将模型格式合规性误判为能力,而LLM-as-a-Judge虽语义更准但计算成本过高,亟需兼顾准确性与效率的评估新范式。 Method: 提出基于BERT编码器的参考式评估方法(BERT-as-a-Judge),在合成标注的问答-候选答案-参考答案三元组上进行轻量微调,支持语义级正确性判断。 Result: 在36个模型、15项下游任务的大规模实证中,词法评估与人工判断相关性差;BERT-as-a-Judge显著优于词法基线,性能媲美大型LLM评判器,且推理开销低得多。 Conclusion: BERT-as-a-Judge为生成式模型评估提供了可靠、可扩展、低成本的新方案,兼具语义鲁棒性与实用性,并已开源全部代码与数据以促进落地应用。 Abstract: Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In practice, however, evaluating generative outputs typically relies on rigid lexical methods to extract and assess answers, which can conflate a model's true problem-solving ability with its compliance with predefined formatting guidelines. While recent LLM-as-a-Judge approaches mitigate this issue by assessing semantic correctness rather than strict structural conformity, they also introduce substantial computational overhead, making evaluation costly. In this work, we first systematically investigate the limitations of lexical evaluation through a large-scale empirical study spanning 36 models and 15 downstream tasks, demonstrating that such methods correlate poorly with human judgments. To address this limitation, we introduce BERT-as-a-Judge, an encoder-driven approach for assessing answer correctness in reference-based generative settings, robust to variations in output phrasing, and requiring only lightweight training on synthetically annotated question-candidate-reference triplets. We show that it consistently outperforms the lexical baseline while matching the performance of much larger LLM judges, providing a compelling tradeoff between the two and enabling reliable, scalable evaluation. Finally, through extensive experimentation, we provide detailed insights into BERT-as-a-Judge's performance to offer practical guidance for practitioners, and release all project artifacts to foster downstream adoption.[62] You Can't Fight in Here! This is BBS!
Richard Futrell,Kyle Mahowald
Main category: cs.CL
TL;DR: 本文通过虚构的语言学家Norm和计算语言学家Claudette的对话,批判了对语言模型(LMs)在语言科学中作用的两种常见误解:'字符串统计稻草人'和'已臻极致假设',主张拓展语言科学研究计划,以更全面地理解人类语言与语言模型。
Details
Motivation: 澄清语言模型在语言科学中的角色,反驳对其能力与科学价值的误解,并推动跨学科合作以深化对人类语言和语言模型的理解。 Method: 采用概念分析与批判性讨论的方法,借助虚构对话形式揭示并剖析两种流行误解,并提出更具包容性和拓展性的研究框架。 Result: 明确了语言模型研究不应被简化为纯统计建模,也不应被预设为已达认知科学解释的极限;强调需整合多学科视角开展更扎实、更具解释力的语言科学与AI交叉研究。 Conclusion: 语言模型可为语言科学提供重要洞见,但需超越现有范式局限,构建融合语言学、认知科学、神经科学与计算机科学的新型研究纲领。 Abstract: Norm, the formal theoretical linguist, and Claudette, the computational language scientist, have a lovely time discussing whether modern language models can inform important questions in the language sciences. Just as they are about to part ways until they meet again, 25 of their closest friends show up -- from linguistics, neuroscience, cognitive science, psychology, philosophy, and computer science. We use this discussion to highlight what we see as some common underlying issues: the String Statistics Strawman (the mistaken idea that LMs can't be linguistically competent or interesting because they, like their Markov model predecessors, are statistical models that learn from strings) and the As Good As it Gets Assumption (the idea that LM research as it stands in 2026 is the limit of what it can tell us about linguistics). We clarify the role of LM-based work for scientific insights into human language and advocate for a more expansive research program for the language sciences in the AI age, one that takes on the commentators' concerns in order to produce a better and more robust science of both human language and of LMs.[63] Many Ways to Be Fake: Benchmarking Fake News Detection Under Strategy-Driven AI Generation
Xinyu Wang,Sai Koneru,Wenbo Zhang,Wenliang Zheng,Saksham Ranjan,Sarah Rajtmajer
Main category: cs.CL
TL;DR: 本文提出MANYFAKE基准,用于评估假新闻检测模型在混合真伪内容上的性能,发现现有模型对细微、优化且与真实信息交织的虚假内容仍很脆弱。
Details
Motivation: 现代假新闻常通过人机协作生成,将战略性错误嵌入原本准确可信的叙述中,形成混合真伪内容,但现有基准对此类案例覆盖不足。 Method: 构建了一个名为MANYFAKE的合成基准,包含6798篇通过多种策略驱动提示流程生成的假新闻文章,并在此基准上评估了多种前沿假新闻检测模型。 Result: 实验表明,即使具备高级推理能力的模型,在完全虚构的新闻上表现趋近饱和,但在面对细微、优化且与真实信息交织的虚假内容时仍表现脆弱。 Conclusion: 当前假新闻检测模型在应对混合真伪的现实威胁方面存在显著局限,亟需更贴近实际场景的基准和更鲁棒的检测方法。 Abstract: Recent advances in large language models (LLMs) have enabled the large-scale generation of highly fluent and deceptive news-like content. While prior work has often treated fake news detection as a binary classification problem, modern fake news increasingly arises through human-AI collaboration, where strategic inaccuracies are embedded within otherwise accurate and credible narratives. These mixed-truth cases represent a realistic and consequential threat, yet they remain underrepresented in existing benchmarks. To address this gap, we introduce MANYFAKE, a synthetic benchmark containing 6,798 fake news articles generated through multiple strategy-driven prompting pipelines that capture many ways fake news can be constructed and refined. Using this benchmark, we evaluate a range of state-of-the-art fake news detectors. Our results show that even advanced reasoning-enabled models approach saturation on fully fabricated stories, but remain brittle when falsehoods are subtle, optimized, and interwoven with accurate information.[64] Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision
Soroosh Tayebi Arasteh,Mehdi Joodaki,Mahshad Lotfinia,Sven Nebelung,Daniel Truhn
Main category: cs.CL
TL;DR: 本文提出了一种基于案例的证据验证框架,通过构造显式支持与可控非支持样本(无需人工标注)来增强模型对证据的依赖性,在放射学任务中验证了其有效性。
Details
Motivation: 现有证据推理模型常因监督信号弱、证据与主张关联松散、评估未直接检验证据依赖性而失败。 Method: 提出案例驱动的证据验证框架,设计一种自动生成显式支持样本与语义控制的非支持样本(如反事实错误状态和主题相关负样本)的监督构建方法,并在放射学领域训练标准验证器。 Result: 所学验证器显著优于仅用案例或仅用证据的基线模型;在正确证据下表现强劲,但移除或替换证据时性能骤降,表明其真正依赖证据;该行为可迁移到未见证据文章和外部案例分布,但在证据来源偏移下性能下降,且对主干模型选择敏感。 Conclusion: 证据接地的主要瓶颈不仅在于模型能力,更在于缺乏能编码证据因果作用的监督信号。 Abstract: Evidence-grounded reasoning requires more than attaching retrieved text to a prediction: a model should make decisions that depend on whether the provided evidence supports the target claim. In practice, this often fails because supervision is weak, evidence is only loosely tied to the claim, and evaluation does not test evidence dependence directly. We introduce case-grounded evidence verification, a general framework in which a model receives a local case context, external evidence, and a structured claim, and must decide whether the evidence supports the claim for that case. Our key contribution is a supervision construction procedure that generates explicit support examples together with semantically controlled non-support examples, including counterfactual wrong-state and topic-related negatives, without manual evidence annotation. We instantiate the framework in radiology and train a standard verifier on the resulting support task. The learned verifier substantially outperforms both case-only and evidence-only baselines, remains strong under correct evidence, and collapses when evidence is removed or swapped, indicating genuine evidence dependence. This behavior transfers across unseen evidence articles and an external case distribution, though performance degrades under evidence-source shift and remains sensitive to backbone choice. Overall, the results suggest that a major bottleneck in evidence grounding is not only model capacity, but the lack of supervision that encodes the causal role of evidence.[65] Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism
Hadas Orgad,Boyi Wei,Kaden Zheng,Martin Wattenberg,Peter Henderson,Seraphina Goldfarb-Tarrant,Yonatan Belinkov
Main category: cs.CL
TL;DR: 本文通过定向权重剪枝方法,发现大语言模型中危害性内容生成依赖于一组紧凑、跨类型通用且与良性能力分离的权重;对齐训练会压缩这些‘危害权重’,但这也导致微调时易引发广泛误对齐;同时,模型生成危害内容的能力与其识别/解释该内容的能力相互独立。
Details
Motivation: 探究大语言模型中‘有害性’是否具有内在一致的组织结构,解释当前对齐方法为何表面脆弱(如易被越狱、微调后出现泛化性误对齐)。 Method: 采用靶向权重剪枝作为因果干预手段,系统分析对齐与未对齐模型中生成有害内容所依赖的参数子集,并检验其跨危害类型通用性、与良性能力的分离性,以及与有害内容识别/解释能力的关系。 Result: 发现:1)有害内容生成依赖一组紧凑、跨类型通用、区别于良性能力的权重;2)对齐模型比未对齐模型更显著压缩此类‘危害权重’;3)该压缩机制可解释‘新兴误对齐’现象;4)在窄域剪枝危害权重可显著缓解新兴误对齐;5)有害生成能力与有害识别/解释能力彼此解耦。 Conclusion: 大语言模型内部存在关于‘有害性’的连贯结构,这种结构虽导致表层安全防护脆弱,却为构建更原理性的安全方法提供了基础。 Abstract: Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.cs.CV [Back]
[66] Detection of Hate and Threat in Digital Forensics: A Case-Driven Multimodal Approach
Ponkoj Chandra Shill
Main category: cs.CV
TL;DR: 本文提出了一种面向数字取证场景的、以案例驱动的多模态仇恨与威胁检测框架,根据证据类型(嵌入文本、上下文文本或纯图像)自适应选择文本分析、多模态融合或纯视觉语义推理方法,提升可追溯性与法证合理性。
Details
Motivation: 现有自动化方法多假设输入为干净文本,或未经法证依据直接应用视觉模型,难以应对数字取证中图像、扫描文档和上下文报告等异构证据中显性或隐性危害表达的检测需求。 Method: 提出一种案例驱动的多模态框架,首先判别证据中是否存在文本及其来源(嵌入文本、关联上下文文本或纯图像),再据此动态选择文本分析、多模态融合或基于ViT骨干的视觉语言模型进行图像语义推理。 Result: 在法证风格图像证据上的实验表明,该方法在多种异构证据场景下均展现出稳定且可解释的检测行为。 Conclusion: 通过将推理过程与证据可用性条件绑定,该框架更贴近真实法证决策逻辑,增强了证据溯源能力,并避免了不合理的模态假设。 Abstract: Digital forensic investigations increasingly rely on heterogeneous evidence such as images, scanned documents, and contextual reports. These artifacts may contain explicit or implicit expressions of harm, hate, threat, violence, or intimidation, yet existing automated approaches often assume clean text input or apply vision models without forensic justification. This paper presents a case-driven multimodal approach for hate and threat detection in forensic analysis. The proposed framework explicitly determines the presence and source of textual evidence, distinguishing between embedded text, associated contextual text, and image-only evidence. Based on the identified evidence configuration, the framework selectively applies text analysis, multimodal fusion, or image-only semantic reasoning using vision language models with vision transformer backbones (ViT). By conditioning inference on evidence availability, the approach mirrors forensic decision-making, improves evidentiary traceability, and avoids unjustified modality assumptions. Experimental evaluation on forensic-style image evidence demonstrates consistent and interpretable behavior across heterogeneous evidence scenarios.[67] A Semi-Automated Framework for 3D Reconstruction of Medieval Manuscript Miniatures
Riccardo Pallotto,Pierluigi Feliciati,Tiberio Uricchio
Main category: cs.CV
TL;DR: 本文提出了一种半自动化框架,将中世纪手稿中的二维缩略图转化为适用于扩展现实(XR)、触觉3D打印和网络可视化的三维数字模型;通过评估七种图像到3D方法,发现Hi3DGen在拓扑质量与表面细节间取得较好平衡,并结合SAM分割、专家精修与AI纹理生成构建完整流程;案例涵盖哥特式与文艺复兴手稿,成果支持WebXR、AR叠加及视障用户触觉使用。
Details
Motivation: 将中世纪手稿中的二维微型画转化为高质量三维模型,以支持XR展示、触觉3D打印(尤其服务视障用户)及网络可视化,弥补文化遗产数字化中几何保真与艺术表现兼顾的空白。 Method: 评估七种图像到3D生成方法(TripoSR、SF3D等),采用渲染指标(Silhouette IoU、LPIPS、CLIP Score)和体素指标(Depth Range Ratio、watertight percentage)进行定量比较;构建半自动流程:SAM图像分割 → Hi3DGen生成初始网格 → ZBrush专家精修 → AI辅助纹理映射。 Result: Hi3DGen在拓扑完整性与表面细节上表现最优,适合作为专家精修起点;完整流程成功应用于《格拉提安教令集》哥特式细密画与朱利奥·克洛维奥文艺复兴手稿,生成模型已支持WebXR、AR叠加及触觉3D打印。 Conclusion: 该框架为中世纪手稿三维化提供了可复用、跨艺术风格的技术路径,强调AI初筛与人工精修协同,在文化遗产保护与包容性访问中具有实践价值。 Abstract: This paper presents a semi-automated framework for transforming two-dimensional miniatures from medieval manuscripts into three-dimensional digital models suitable for extended reality (XR), tactile 3D~printing, and web-based visualization. We evaluate seven image-to-3D methods (TripoSR, SF3D, SPAR3D, TRELLIS, Wonder3D, SAM~3D, Hi3DGen) on 69~manuscript figures from two collections using rendering-based metrics (Silhouette IoU, LPIPS, CLIP~Score) and volumetric measures (Depth Range Ratio, watertight percentage), revealing a trade-off between volumetric expansion and geometric fidelity. Hi3DGen balances topological quality with rich surface detail through its normal bridging approach, making it a good starting point for expert refinement. Our pipeline combines SAM segmentation, Hi3DGen mesh generation, expert refinement in ZBrush, and AI-assisted texturing. Two case studies on Gothic illuminations from the Decretum Gratiani (Vatican Library) and Renaissance miniatures by Giulio Clovio demonstrate applicability across artistic traditions. The resulting models can support WebXR visualization, AR overlay on physical manuscripts, and tactile 3D~prints for visually impaired users.[68] ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction
Kun Wang,Yupeng Hu,Zhiran Li,Hao Liu,Qianlong Xiang,Liqiang Nie
Main category: cs.CV
TL;DR: 本文提出了ViSAGE,一种基于多专家自适应门控的视频显著性预测框架,在NTIRE 2026挑战赛中取得领先性能。
Details
Motivation: 为了利用互补的归纳偏置来提升视频显著性预测效果,解决现有方法在建模复杂时空显著性线索方面的局限性。 Method: 提出多专家集成框架ViSAGE,每个专家解码头采用自适应门控与调制机制细化时空特征,并在推理阶段融合各专家的互补预测结果。 Result: 在NTIRE 2026挑战赛私有测试集上,ViSAGE在四项指标中的两项排名第一,其余两项也优于大多数对比方法。 Conclusion: ViSAGE通过聚合多样化归纳偏置,有效提升了视频显著性预测的准确性与泛化能力,验证了多专家自适应门控设计的优越性。 Abstract: In this report, we present our champion solution for the NTIRE 2026 Challenge on Video Saliency Prediction held in conjunction with CVPR 2026. To exploit complementary inductive biases for video saliency, we propose Video Saliency with Adaptive Gated Experts (ViSAGE), a multi-expert ensemble framework. Each specialized decoder performs adaptive gating and modulation to refine spatio-temporal features. The complementary predictions from different experts are then fused at inference. ViSAGE thereby aggregates diverse inductive biases to capture complex spatio-temporal saliency cues in videos. On the Private Test set, ViSAGE ranked first on two out of four evaluation metrics, and outperformed most competing solutions on the other two metrics, demonstrating its effectiveness and generalization ability. Our code has been released at https://github.com/iLearn-Lab/CVPRW26-ViSAGE.[69] MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments
Xingming Liao,Ning Chen,Muying Shu,Yunpeng Yin,Peijian Zeng,Zhuowei Wang,Nankai Lin,Lianglun Cheng
Main category: cs.CV
TL;DR: 本文提出了MARINER,一个基于Entity-Environment-Event(3E)范式的综合性海上视觉理解基准,包含16,629张多源图像、63类细粒度船舶类别、多样恶劣环境及5类典型海上事件,覆盖细粒度分类、目标检测和视觉问答任务;实验表明现有主流多模态大模型在细粒度识别与因果推理上仍存在显著挑战。
Details
Motivation: 现实开放水域环境中细粒度视觉理解和高层推理缺乏专用评测基准,限制了相关研究发展。 Method: 提出Entity-Environment-Event(3E)新范式,构建MARINER基准,涵盖多源图像、细粒度船舶分类、复杂环境建模与动态事件标注,并对主流多模态大语言模型进行系统评测。 Result: 现有先进多模态大模型在MARINER上的细粒度判别和因果推理能力表现不佳,揭示了当前模型在真实海洋场景中的局限性。 Conclusion: MARINER填补了海上多模态理解在真实性和认知层次评测方面的空白,为面向开放水域应用的鲁棒视觉-语言模型研究提供了重要基础。 Abstract: Fine-grained visual understanding and high-level reasoning in real-world open-water environments remain under-explored due to the lack of dedicated benchmarks. We introduce MARINER, a comprehensive benchmark built under the novel Entity-Environment-Event (3E) paradigm. MARINER contains 16,629 multi-source maritime images with 63 fine-grained vessel categories, diverse adverse environments, and 5 typical dynamic maritime incidents, covering fine-grained classification, object detection, and visual question answering tasks. We conduct extensive evaluations on mainstream Multimodal Large language models (MLLMs) and establish baselines, revealing that even advanced models struggle with fine-grained discrimination and causal reasoning in complex marine scenes. As a dedicated maritime benchmark, MARINER fills the gap of realistic and cognitive-level evaluation for maritime multimodal understanding, and promotes future research on robust vision-language models for open-water applications. Appendix and supplementary materials are available at https://lxixim.github.io/MARINER.[70] WildDet3D: Scaling Promptable 3D Detection in the Wild
Weikai Huang,Jieyu Zhang,Sijun Li,Taoyang Jia,Jiafei Duan,Yunqian Cheng,Jaemin Cho,Mattew Wallingford,Rustin Soraki,Chris Dongjoo Kim,Donovan Clay,Taira Anderson,Winson Han,Ali Farhadi,Bharath Hariharan,Zhongzheng Ren,Ranjay Krishna
Main category: cs.CV
TL;DR: 本文提出WildDet3D,一种支持多模态提示(文本、点、框)并可融合深度信息的通用单目3D目标检测框架,并构建了覆盖13.5K类、超100万图像的大规模开放世界3D检测数据集WildDet3D-Data,在多个基准上达到SOTA性能。
Details
Motivation: 现有方法受限于单一提示类型、缺乏几何线索融合机制,且3D数据集类别窄、场景受限,难以支撑开放世界泛化需求。 Method: 提出WildDet3D统一架构,支持文本/点/框多模态输入,并在推理时动态融合深度信号;构建WildDet3D-Data数据集,通过从2D标注生成候选3D框并经人工验证,覆盖大量真实场景与长尾类别。 Result: 在自建WildDet3D-Bench上达22.6/24.8 AP3D(文本/框提示),Omni3D上达34.2/36.4 AP3D,零样本迁移在Argoverse2和ScanNet上达40.3/48.9 ODS;引入深度线索平均提升+20.7 AP。 Conclusion: WildDet3D显著提升了单目3D检测在开放世界、多模态提示与几何感知方面的实用性与泛化能力,为3D空间智能提供了更鲁棒、可扩展的基础模型范式。 Abstract: Understanding objects in 3D from a single image is a cornerstone of spatial intelligence. A key step toward this goal is monocular 3D object detection--recovering the extent, location, and orientation of objects from an input RGB image. To be practical in the open world, such a detector must generalize beyond closed-set categories, support diverse prompt modalities, and leverage geometric cues when available. Progress is hampered by two bottlenecks: existing methods are designed for a single prompt type and lack a mechanism to incorporate additional geometric cues, and current 3D datasets cover only narrow categories in controlled environments, limiting open-world transfer. In this work we address both gaps. First, we introduce WildDet3D, a unified geometry-aware architecture that natively accepts text, point, and box prompts and can incorporate auxiliary depth signals at inference time. Second, we present WildDet3D-Data, the largest open 3D detection dataset to date, constructed by generating candidate 3D boxes from existing 2D annotations and retaining only human-verified ones, yielding over 1M images across 13.5K categories in diverse real-world scenes. WildDet3D establishes a new state-of-the-art across multiple benchmarks and settings. In the open-world setting, it achieves 22.6/24.8 AP3D on our newly introduced WildDet3D-Bench with text and box prompts. On Omni3D, it reaches 34.2/36.4 AP3D with text and box prompts, respectively. In zero-shot evaluation, it achieves 40.3/48.9 ODS on Argoverse 2 and ScanNet. Notably, incorporating depth cues at inference time yields substantial additional gains (+20.7 AP on average across settings).[71] On Semiotic-Grounded Interpretive Evaluation of Generative Art
Ruixiang Jiang,Changwen Chen
Main category: cs.CV
TL;DR: 本文提出SemJudge,一种基于皮尔斯符号学理论的生成艺术评估框架,通过分层符号图(HSG)建模人-生成艺术交互中的符号与指号意义,弥补现有评估方法仅关注图像质量或提示字面匹配的不足,显著提升与人类艺术解读判断的一致性。
Details
Motivation: 现有生成艺术(GenArt)评估器仅关注表层图像质量或提示字面匹配,无法衡量创作者意图传达的深层象征性或抽象意义,缺乏对符号性和指号性意义的建模能力。 Method: 基于皮尔斯计算符号学理论,构建‘人-生成艺术交互’(HGI)的级联符号过程模型;提出SemJudge评估器,利用分层符号图(HSG)显式建模从提示到生成图像全过程中的图标、符号和指号三重意义传递。 Result: 在强调艺术解读的细粒度艺术基准上,SemJudge比先前评估器更贴近人类判断;用户研究表明其能生成更深入、更具洞察力的艺术解读。 Conclusion: SemJudge推动生成艺术从生成‘漂亮图像’迈向表达复杂人类经验的艺术媒介,为可解释、有意义的生成艺术评估奠定理论与实践基础。 Abstract: Interpretation is essential to deciphering the language of art: audiences communicate with artists by recovering meaning from visual artifacts. However, current Generative Art (GenArt) evaluators remain fixated on surface-level image quality or literal prompt adherence, failing to assess the deeper symbolic or abstract meaning intended by the creator. We address this gap by formalizing a Peircean computational semiotic theory that models Human-GenArt Interaction (HGI) as cascaded semiosis. This framework reveals that artistic meaning is conveyed through three modes - iconic, symbolic, and indexical - yet existing evaluators operate heavily within the iconic mode, remaining structurally blind to the latter two. To overcome this structural blindness, we propose SemJudge. This evaluator explicitly assesses symbolic and indexical meaning in HGI via a Hierarchical Semiosis Graph (HSG) that reconstructs the meaning-making process from prompt to generated artifact. Extensive quantitative experiments show that SemJudge aligns more closely with human judgments than prior evaluators on an interpretation-intensive fine-art benchmark. User studies further demonstrate that SemJudge produces deeper, more insightful artistic interpretations, thereby paving the way for GenArt to move beyond the generation of "pretty" images toward a medium capable of expressing complex human experience. Project page: https://github.com/songrise/SemJudge.[72] 3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding
Makanjuola Ogunleye,Eman Abdelrahman,Ismini Lourentzou
Main category: cs.CV
TL;DR: 本文提出3D-VCD,首个面向3D具身智能体的推理时视觉对比解码框架,通过扰动3D场景图并对比原始与失真场景下的预测,抑制依赖语言先验而非真实场景证据的幻觉输出,在3D-POPE和HEAL基准上验证了其有效性。
Details
Motivation: 现有推理时幻觉缓解方法主要针对2D图文场景,难以迁移到3D具身推理中;而3D中幻觉源于物体存在性、空间布局和几何定位等结构性问题,非像素级不一致。 Method: 提出3D-VCD框架:构建语义与几何扰动(如类别替换、坐标/尺寸破坏)的失真3D场景图,对比原始与失真3D上下文下的模型预测,抑制对场景证据不敏感的token。 Result: 在3D-POPE和HEAL基准上,3D-VCD无需重训练即显著提升3D具身代理的接地推理能力,验证了基于结构化3D表示的推理时对比解码的有效性与实用性。 Conclusion: 3D-VCD为3D具身智能提供了首个有效的推理时幻觉缓解方案,表明利用结构化3D表示进行对比解码是提升具身智能可靠性的重要可行路径。 Abstract: Large multimodal models are increasingly used as the reasoning core of embodied agents operating in 3D environments, yet they remain prone to hallucinations that can produce unsafe and ungrounded decisions. Existing inference-time hallucination mitigation methods largely target 2D vision-language settings and do not transfer to embodied 3D reasoning, where failures arise from object presence, spatial layout, and geometric grounding rather than pixel-level inconsistencies. We introduce 3D-VCD, the first inference-time visual contrastive decoding framework for hallucination mitigation in 3D embodied agents. 3D-VCD constructs a distorted 3D scene graph by applying semantic and geometric perturbations to object-centric representations, such as category substitutions and coordinate or extent corruption. By contrasting predictions under the original and distorted 3D contexts, our method suppresses tokens that are insensitive to grounded scene evidence and are therefore likely driven by language priors. We evaluate 3D-VCD on the 3D-POPE and HEAL benchmarks and show that it consistently improves grounded reasoning without any retraining, establishing inference-time contrastive decoding over structured 3D representations as an effective and practical route to more reliable embodied intelligence.[73] InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
Zhefan Rao,Bin Zou,Haoxuan Che,Xuanhua He,Chong Hou Choi,Yanheng Li,Rui Liu,Qifeng Chen
Main category: cs.CV
TL;DR: 本文提出InsEdit,一种基于指令的视频编辑模型,在仅使用约10万条视频编辑数据的情况下,无需大规模视频编辑数据即可实现SOTA性能,并支持图像编辑。
Details
Motivation: 指令式视频编辑通常需要大量高质量编辑数据,但此类数据稀缺;现有方法在将生成模型适配为编辑器时面临数据饥渴问题。 Method: 提出InsEdit模型,基于HunyuanVideo-1.5构建,融合视觉编辑架构与基于互文注意力(MCA)的视频数据管道,支持从视频任意帧开始编辑;训练中同时引入图像编辑数据。 Result: 在自建视频指令编辑基准上达到开源方法SOTA;模型无需修改即可支持图像编辑任务。 Conclusion: 视频生成主干模型可通过合理架构与小规模编辑数据高效转化为强编辑器,无需依赖大规模视频编辑数据。 Abstract: Instruction-based video editing is a natural way to control video content with text, but adapting a video generation model into an editor usually appears data-hungry. At the same time, high-quality video editing data remains scarce. In this paper, we show that a video generation backbone can become a strong video editor without large scale video editing data. We present InsEdit, an instruction-based editing model built on HunyuanVideo-1.5. InsEdit combines a visual editing architecture with a video data pipeline based on Mutual Context Attention (MCA), which creates aligned video pairs where edits can begin in the middle of a clip rather than only from the first frame. With only O(100)K video editing data, InsEdit achieves state-of-the-art results among open-source methods on our video instruction editing benchmarks. In addition, because our training recipe also includes image editing data, the final model supports image editing without any modification.[74] EfficientSign: An Attention-Enhanced Lightweight Architecture for Indian Sign Language Recognition
Rishabh Gupta,Shravya R. Nalla
Main category: cs.CV
TL;DR: 本文提出了EfficientSign,一种轻量级的手语识别模型,通过在EfficientNet-B0基础上引入通道注意力(Squeeze-and-Excitation)和空间注意力模块,在印度手语字母识别任务中达到99.94%准确率,参数量比ResNet18减少62%,且显著优于传统方法。
Details
Motivation: 构建可在手机端部署的高效手语识别系统。 Method: 基于EfficientNet-B0,集成Squeeze-and-Excitation通道注意力与自研空间注意力模块;同时对比测试了将EfficientNet-B0深层特征输入SVM、逻辑回归、KNN等经典分类器的效果。 Result: EfficientSign在12,637张ISL图像上5折交叉验证达99.94%±0.05%准确率,参数仅4.2M(较ResNet18的11.2M减少62%);SVM、逻辑回归、KNN分别达99.63%、99.03%、96.33%,均远超2015年SURF方法的92%。 Conclusion: 注意力增强的轻量模型可高效、便捷地实现ISL识别,无需大模型或人工设计特征流程,适合移动端部署。 Abstract: How do you build a sign language recognizer that works on a phone? That question drove this work. We built EfficientSign, a lightweight model which takes EfficientNet-B0 and focuses on two attention modules (Squeeze-and-Excitation for channel focus, and a spatial attention layer that focuses on the hand gestures). We tested it against five other approaches on 12,637 images of Indian Sign Language alphabets, all 26 classes, using 5-fold cross-validation. EfficientSign achieves the accuracy of 99.94% (+/-0.05%), which matches the performance of ResNet18's 99.97% accuracy, but with 62% fewer parameters (4.2M vs 11.2M). We also experimented with feeding deep features (1,280-dimensional vectors pulled from EfficientNet-B0's pooling layer) into classical classifiers. SVM achieved the accuracy of 99.63%, Logistic Regression achieved the accuracy of 99.03% and KNN achieved accuracy of 96.33%. All of these blow past the 92% that SURF-based methods managed on a similar dataset back in 2015. Our results show that attention-enhanced learning model provides an efficient and deployable solution for ISL recognition without requiring a massive model or hand-tuned feature pipelines anymore.[75] Unified Multimodal Uncertain Inference
Dengjia Zhang,Alexander Martin,William Jurayj,Kenton Murray,Benjamin Van Durme,Reno Kriz
Main category: cs.CV
TL;DR: 本文提出了统一多模态不确定推理(UMUI)任务,旨在让模型在文本、音频和视频等任意模态或其组合前提下,对假设生成校准过的概率估计;为此构建了人类标注的多模态概率评估数据集,并提出CLUE方法,通过自一致教师校准与基于分布的置信探测实现校准预测,在参数量更小(3B)的情况下达到甚至超越更大模型(32B)的性能。
Details
Motivation: 现有不确定推理研究主要集中于文本模态,跨模态(尤其是音频、视频)的细粒度概率推理缺乏统一框架和评估基准。 Method: 提出CLUE(Calibrated Latent Uncertainty Estimation)方法,融合自一致性教师校准与基于分布的置信探测机制;构建覆盖文本、音频、视频及音视频组合的人类标注概率评估数据集,并在多个现有文本与音频基准上进行评测。 Result: 所提出的3B参数模型在所有模态上均达到与32B参数基线模型相当或更优的校准性能与推理准确率。 Conclusion: UMUI任务为多模态不确定推理提供了首个统一框架和评估标准;CLUE方法有效提升了跨模态概率预测的校准性,且具备参数效率优势。 Abstract: We introduce Unified Multimodal Uncertain Inference (UMUI), a multimodal inference task spanning text, audio, and video, where models must produce calibrated probability estimates of hypotheses conditioned on a premise in any modality or combination. While uncertain inference has been explored in text, extension to other modalities has been limited to single-modality binary entailment judgments, leaving no framework for fine-grained probabilistic reasoning in or across other modalities. To address this, we curate a human-annotated evaluation set with scalar probability judgments across audio, visual, and audiovisual settings, and additionally evaluate on existing text and audio benchmarks. We introduce CLUE (Calibrated Latent Uncertainty Estimation), which combines self-consistent teacher calibration and distribution-based confidence probing to produce calibrated predictions. We demonstrate that our 3B-parameter model achieves equivalent or stronger performance than baselines up to 32B parameters across all modalities.[76] RS-OVC: Open-Vocabulary Counting for Remote-Sensing Data
Tamir Shor,George Leifman,Genady Beryozkin
Main category: cs.CV
TL;DR: 本文提出了RS-OVC,首个面向遥感与航拍图像的开放词汇计数(OVC)模型,支持仅通过文本或视觉提示对训练中未见的新物体类别进行准确计数,突破了传统方法局限于封闭预定义类别的限制。
Details
Motivation: 现有遥感图像目标计数方法局限于封闭、预定义的物体类别,难以适应动态现实监测场景中出现的新物体,需昂贵的重标注和重训练。 Method: 提出RS-OVC——首个面向遥感与航拍图像的开放词汇计数(OVC)模型,利用文本和/或视觉条件实现对训练未见新类别的计数。 Result: RS-OVC能基于文本和/或视觉提示,准确计数训练阶段未见过的新型物体类别。 Conclusion: RS-OVC有效解决了遥感图像计数中对新类别泛化能力不足的问题,提升了模型在真实动态监测场景中的适用性与可扩展性。 Abstract: Object-Counting for remote-sensing (RS) imagery is attracting increasing research interest due to its crucial role in a wide and diverse set of applications. While several promising methods for RS object-counting have been proposed, existing methods focus on a closed, pre-defined set of object classes. This limitation necessitates costly re-annotation and model re-training to adapt current approaches for counting of novel objects that have not been seen during training, and severely inhibits their application in dynamic, real-world monitoring scenarios. To address this gap, in this work we propose RS-OVC - the first Open Vocabulary Counting (OVC) model for Remote-Sensing and aerial imagery. We show that our model is capable of accurate counting of novel object classes, that were unseen during training, based solely on textual and/or visual conditioning.[77] Deep Learning-Based Tracking and Lineage Reconstruction of Ligament Breakup
Vrushank Ahire,Vivek Kurumanghat,Mudasir Ganaie,Lipika Kabiraj
Main category: cs.CV
TL;DR: 本文提出了一种两阶段深度学习框架,用于分析高速阴影图像中液膜破碎产生的液丝和液滴的检测与跨帧时序关系建模,特别支持一对多断裂事件的识别,实现了断裂谱系重建与初级雾化模式的自动分析。
Details
Motivation: 传统多目标跟踪方法无法处理液膜破碎中常见的一对多断裂事件,且高动态、多尺度的瞬态过程难以从高速阴影图像中准确量化。 Method: 第一阶段采用带ResNet-50骨干网和特征金字塔网络(FPN)的Faster R-CNN检测液丝与液滴,并结合形态保持的合成数据增强策略;第二阶段使用Transformer增强的MLP,基于物理信息几何特征分类帧间关联(延续、断裂、无关联)。 Result: 检测阶段在14种原始-合成配置下最高F1达0.872;关联分类阶段对断裂事件实现100%召回率、93.2%精度和86.1%总体准确率,并成功重建断裂树、提取碎片多重性与液滴尺寸分布。 Conclusion: 该框架首次实现了对液膜破碎中父子谱系的显式建模与自动化分析,为初级雾化机理研究提供了可解释、高精度的定量工具。 Abstract: The disintegration of liquid sheets into ligaments and droplets involves highly transient, multi-scale dynamics that are difficult to quantify from high-speed shadowgraphy images. Identifying droplets, ligaments, and blobs formed during breakup, along with tracking across frames, is essential for spray analysis. However, conventional multi-object tracking frameworks impose strict one-to-one temporal associations and cannot represent one-to-many fragmentation events. In this study, we present a two-stage deep learning framework for object detection and temporal relationship modeling across frames. The framework captures ligament deformation, fragmentation, and parent-child lineage during liquid sheet disintegration. In the first stage, a Faster R-CNN with a ResNet-50 backbone and Feature Pyramid Network detects and classifies ligaments and droplets in high-speed shadowgraphy recordings of an impinging Carbopol gel jet. A morphology-preserving synthetic data generation strategy augments the training set without introducing physically implausible configurations, achieving a held-out F1 score of up to 0.872 across fourteen original-to-synthetic configurations. In the second stage, a Transformer-augmented multilayer perceptron classifies inter-frame associations into continuation, fragmentation (one-to-many), and non-association using physics-informed geometric features. Despite severe class imbalance, the model achieves 86.1% accuracy, 93.2% precision, and perfect recall (1.00) for fragmentation events. Together, the framework enables automated reconstruction of fragmentation trees, preservation of parent-child lineage, and extraction of breakup statistics such as fragment multiplicity and droplet size distributions. By explicitly identifying children droplets formed from ligament fragmentation, the framework provides automated analysis of the primary atomization mode.[78] What Matters in Virtual Try-Off? Dual-UNet Diffusion Model For Garment Reconstruction
Loc-Phat Truong,Meysam Madadi,Sergio Escalera
Main category: cs.CV
TL;DR: 本文研究虚拟脱衣(VTOFF)任务,即从穿着图像中重建原始服装,提出基于双UNet扩散模型的框架,并系统分析生成主干、条件输入和损失函数等设计选择,在VITON-HD和DressCode数据集上达到SOTA性能。
Details
Motivation: 虚拟脱衣(VTOFF)作为虚拟试衣(VTON)的逆问题,长期被忽视且缺乏系统研究,亟需建立稳健的架构基础。 Method: 基于Dual-UNet扩散模型,系统探究三方面设计:(i)生成主干(Stable Diffusion变体对比);(ii)条件机制(掩码设计、输入遮蔽策略、高层语义特征);(iii)损失与训练策略(注意力辅助损失、感知目标、多阶段课程学习)。 Result: 在VITON-HD和DressCode上实现SOTA:DISTS下降9.5%,LPIPS、FID、KID、SSIM等指标表现具竞争力。 Conclusion: 本工作为VTOFF提供了强基线与可复现的设计指南,揭示了关键模块权衡,推动该新兴方向发展。 Abstract: Virtual Try-On (VTON) has seen rapid advancements, providing a strong foundation for generative fashion tasks. However, the inverse problem, Virtual Try-Off (VTOFF)-aimed at reconstructing the canonical garment from a draped-on image-remains a less understood domain, distinct from the heavily researched field of VTON. In this work, we seek to establish a robust architectural foundation for VTOFF by studying and adapting various diffusion-based strategies from VTON and general Latent Diffusion Models (LDMs). We focus our investigation on the Dual-UNet Diffusion Model architecture and analyze three axes of design: (i) Generation Backbone: comparing Stable Diffusion variants; (ii) Conditioning: ablating different mask designs, masked/unmasked inputs for image conditioning, and the utility of high-level semantic features; and (iii) Losses and Training Strategies: evaluating the impact of the auxiliary attention-based loss, perceptual objectives and multi-stage curriculum schedules. Extensive experiments reveal trade-offs across various configuration options. Evaluated on VITON-HD and DressCode datasets, our framework achieves state-of-the-art performance with a drop of 9.5\% on the primary metric DISTS and competitive performance on LPIPS, FID, KID, and SSIM, providing both stronger baselines and insights to guide future Virtual Try-Off research.[79] Accelerating Transformer-Based Monocular SLAM via Geometric Utility Scoring
Xinmiao Xiong,Bangya Liu,Hao Wang,Dayou Li,Nuo Chen,Andrew Feng,Mingyu Ding,Suman Banerjee,Yang Zhou,Zhiwen Fan
Main category: cs.CV
TL;DR: 本文提出LeanGate,一种轻量级前馈帧门控网络,用于在几何基础模型(GFM)驱动的单目SLAM中提前过滤冗余帧,显著降低计算开销而不损失精度。
Details
Motivation: 现有GFM-SLAM系统依赖后验关键帧选择,需先执行昂贵的密集几何解码才能判断帧是否具有新颖几何信息,导致大量计算浪费。 Method: 设计LeanGate模块,在GFM特征提取和匹配之前,通过轻量前馈网络预测每帧的几何效用分,实现帧级早期拒绝。 Result: 在标准SLAM基准上,LeanGate减少超85%的跟踪FLOPs,端到端吞吐量提升5倍,并保持与稠密基线相当的跟踪与建图精度。 Conclusion: LeanGate是一种高效、即插即用的帧筛选机制,有效缓解GFM-SLAM中的计算冗余问题,为实时、低功耗部署提供了新路径。 Abstract: Geometric Foundation Models (GFMs) have recently advanced monocular SLAM by providing robust, calibration-free 3D priors. However, deploying these models on dense video streams introduces significant computational redundancy. Current GFM-based SLAM systems typically rely on post hoc keyframe selection. Because of this, they must perform expensive dense geometric decoding simply to determine whether a frame contains novel geometry, resulting in late rejection and wasted computation. To mitigate this inefficiency, we propose LeanGate, a lightweight feed-forward frame-gating network. LeanGate predicts a geometric utility score to assess a frame's mapping value prior to the heavy GFM feature extraction and matching stages. As a predictive plug-and-play module, our approach bypasses over 90% of redundant frames. Evaluations on standard SLAM benchmarks demonstrate that LeanGate reduces tracking FLOPs by more than 85% and achieves a 5x end-to-end throughput speedup. Furthermore, it maintains the tracking and mapping accuracy of dense baselines.[80] LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving
Hao Shao,Letian Wang,Yang Zhou,Yuxuan Hu,Zhuofan Zong,Steven L. Waslander,Wei Zhan,Hongsheng Li
Main category: cs.CV
TL;DR: 本文提出LMGenDrive框架,首次将基于大语言模型(LLM)的多模态理解与生成式世界模型结合,实现端到端闭环自动驾驶,支持多视角视频输入与自然语言指令下的未来驾驶视频与控制信号联合生成。
Details
Motivation: 现有自动驾驶系统在长尾和开放世界场景中泛化能力不足;受人类‘理解+想象’智能启发,需统一视觉语言理解与场景演化建模能力。 Method: 提出LMGenDrive:融合LLM的多模态理解模块与生成式世界模型,支持多视图图像与自然语言指令输入,输出未来驾驶视频与控制信号;采用三阶段渐进训练策略(视觉预训练→多步长时序驾驶训练)。 Result: 在具挑战性的闭环基准测试中显著优于先前方法,尤其在指令跟随、时空理解及罕见场景鲁棒性方面提升明显;支持低延迟在线规划与自回归离线视频生成。 Conclusion: 统一多模态理解与生成是提升具身决策系统泛化性与鲁棒性的可行路径,为通用自动驾驶提供新范式。 Abstract: Recent years have seen remarkable progress in autonomous driving, yet generalization to long-tail and open-world scenarios remains a major bottleneck for large-scale deployment. To address this challenge, some works use LLMs and VLMs for vision-language understanding and reasoning, enabling vehicles to interpret rare and safety-critical situations when generating actions. Others study generative world models to capture the spatio-temporal evolution of driving scenes, allowing agents to imagine possible futures before acting. Inspired by human intelligence, which unifies understanding and imagination, we explore a unified model for autonomous driving. We present LMGenDrive, the first framework that combines LLM-based multimodal understanding with generative world models for end-to-end closed-loop driving. Given multi-view camera inputs and natural-language instructions, LMGenDrive generates both future driving videos and control signals. This design provides complementary benefits: video prediction improves spatio-temporal scene modeling, while the LLM contributes strong semantic priors and instruction grounding from large-scale pretraining. We further propose a progressive three-stage training strategy, from vision pretraining to multi-step long-horizon driving, to improve stability and performance. LMGenDrive supports both low-latency online planning and autoregressive offline video generation. Experiments show that it significantly outperforms prior methods on challenging closed-loop benchmarks, with clear gains in instruction following, spatio-temporal understanding, and robustness to rare scenarios. These results suggest that unifying multimodal understanding and generation is a promising direction for more generalizable and robust embodied decision-making systems.[81] AI Driven Soccer Analysis Using Computer Vision
Adrian Manchado,Tanner Cellio,Jonathan Keane,Yiyang Wang
Main category: cs.CV
TL;DR: 本文提出一种结合目标检测、关键点定位与单应性变换的足球比赛视频分析框架,实现球员在真实场地坐标系中的精确定位与运动分析。
Details
Motivation: 体育分析对提升球队表现至关重要,但传统视频分析难以提供精确的场地坐标系下的战术数据,亟需结合计算机视觉技术实现从视频到真实世界坐标的映射。 Method: 采用YOLO和Faster R-CNN等模型进行球员检测;使用CNN预测球场关键点;通过单应性变换(homography)将SAM2生成的球员分割掩码及关键点从相机视角映射至真实场地坐标系;最终计算速度、跑动距离、热力图等战术指标。 Result: 实现了不依赖固定摄像角度的鲁棒球员定位,支持多角度视频输入,并可输出真实尺度下的量化战术指标(如球员速度、覆盖距离、空间分布热力图等)。 Conclusion: 该框架有效 bridged 视频分析与真实场地坐标之间的鸿沟,为教练和运动员提供了以往无法获得的高精度、可操作的比赛表现数据。 Abstract: Sport analysis is crucial for team performance since it provides actionable data that can inform coaching decisions, improve player performance, and enhance team strategies. To analyze more complex features from game footage, a computer vision model can be used to identify and track key entities from the field. We propose the use of an object detection and tracking system to predict player positioning throughout the game. To translate this to positioning in relation to the field dimensions, we use a point prediction model to identify key points on the field and combine these with known field dimensions to extract actual distances. For the player-identification model, object detection models like YOLO and Faster R-CNN are evaluated on the accuracy of our custom video footage using multiple different evaluation metrics. The goal is to identify the best model for object identification to obtain the most accurate results when paired with SAM2 (Segment Anything Model 2) for segmentation and tracking. For the key point detection model, we use a CNN model to find consistent locations in the soccer field. Through homography, the positions of points and objects in the camera perspective will be transformed to a real-ground perspective. The segmented player masks from SAM2 are transformed from camera perspective to real-world field coordinates through homography, regardless of camera angle or movement. The transformed real-world coordinates can be used to calculate valuable tactical insights including player speed, distance covered, positioning heatmaps, and more complex team statistics, providing coaches and players with actionable performance data previously unavailable from standard video analysis.[82] LPLCv2: An Expanded Dataset for Fine-Grained License Plate Legibility Classification
Lucas Wojcik,Eduardo A. F. Machoski,Eduil Nascimento,Rayson Laroca,David Menotti
Main category: cs.CV
TL;DR: 本文扩展了现有的车牌识别基准数据集LPLCv2,增加了三倍规模、修正标注并引入新标签;提出基于指数移动平均的损失函数和改进的学习率调度器,使基线模型F1得分达89.5%,显著超越先前SOTA;同时提出新协议解决训练/测试中相机污染问题。
Details
Motivation: 现有ALPR系统在真实复杂场景(如低质成像、压缩伪影、相机安装不佳)下性能受限;已有 illegible LP 基准规模小、标注错误多,限制其影响力。 Method: 扩展原始基准至三倍以上规模,新增两天采集数据;全面修订LP级(框、文本、可读性)、车辆级(品牌、型号、类型、颜色)和图像级(相机ID、天气、故障、时间、日期)标注;提出基于指数移动平均(EMA)的损失函数与精细化学习率调度器;设计新评估协议以控制相机污染。 Result: 基线模型在测试集上达到89.5% F1-score,显著优于先前最优结果;新相机污染协议验证显示影响较小。 Conclusion: LPLCv2是当前最全面、高质量的 illegible LP 基准;所提训练策略有效提升鲁棒性与性能;开源数据集与代码推动ALPR在真实场景中的发展。 Abstract: Modern Automatic License Plate Recognition (ALPR) systems achieve outstanding performance in controlled, well-defined scenarios. However, large-scale real-world usage remains challenging due to low-quality imaging devices, compression artifacts, and suboptimal camera installation. Identifying illegible license plates (LPs) has recently become feasible through a dedicated benchmark; however, its impact has been limited by its small size and annotation errors. In this work, we expand the original benchmark to over three times the size with two extra capture days, revise its annotations and introduce novel labels. LP-level annotations include bounding boxes, text, and legibility level, while vehicle-level annotations comprise make, model, type, and color. Image-level annotations feature camera identity, capture conditions (e.g., rain and faulty cameras), acquisition time, and day ID. We present a novel training procedure featuring an Exponential Moving Average-based loss function and a refined learning rate scheduler, addressing common mistakes in testing. These improvements enable a baseline model to achieve an 89.5% F1-score on the test set, considerably surpassing the previous state of the art. We further introduce a novel protocol to explicitly addresses camera contamination between training and evaluation splits, where results show a small impact. Dataset and code are publicly available at https://github.com/lmlwojcik/LPLCv2-Dataset.[83] SIC3D: Style Image Conditioned Text-to-3D Gaussian Splatting Generation
Ming He,Zhixiang Chen,Steve Maddock
Main category: cs.CV
TL;DR: SIC3D是一种图像引导的可控文本到3D生成方法,利用3D高斯泼溅和新提出的变分风格化分数蒸馏(VSSD)损失,提升几何保真度与纹理风格一致性。
Details
Motivation: 现有文本到3D方法受限于文本模态表达能力,导致可控性差和纹理模糊。 Method: 提出两阶段SIC3D框架:第一阶段用文本生成3D高斯泼溅;第二阶段通过新设计的VSSD损失将参考图像风格迁移至3DGS,并引入缩放正则化抑制伪影。 Result: 在定性和定量评估中均优于先前方法,显著提升了几何保真度与风格一致性。 Conclusion: SIC3D有效解决了文本到3D生成中可控性弱与纹理歧义问题,为图像引导的可控3D内容生成提供了新范式。 Abstract: Recent progress in text-to-3D object generation enables the synthesis of detailed geometry from text input by leveraging 2D diffusion models and differentiable 3D representations. However, the approaches often suffer from limited controllability and texture ambiguity due to the limitation of the text modality. To address this, we present SIC3D, a controllable image-conditioned text-to-3D generation pipeline with 3D Gaussian Splatting (3DGS). There are two stages in SIC3D. The first stage generates the 3D object content from text with a text-to-3DGS generation model. The second stage transfers style from a reference image to the 3DGS. Within this stylization stage, we introduce a novel Variational Stylized Score Distillation (VSSD) loss to effectively capture both global and local texture patterns while mitigating conflicts between geometry and appearance. A scaling regularization is further applied to prevent the emergence of artifacts and preserve the pattern from the style image. Extensive experiments demonstrate that SIC3D enhances geometric fidelity and style adherence, outperforming prior approaches in both qualitative and quantitative evaluations.[84] State Space Models are Effective Sign Language Learners: Exploiting Phonological Compositionality for Vocabulary-Scale Recognition
Bryan Cheng,Austin Jin,Jasper Zhang
Main category: cs.CV
TL;DR: 本文提出PHONSSM模型,通过解剖学引导的图注意力、正交子空间分解和原型分类,将手语识别建模为音系参数(手形、位置、运动、朝向)的组合过程,在大规模ASL数据集上显著提升性能,尤其在小样本和零样本迁移场景下效果突出。
Details
Motivation: 现有手语识别模型将手势视为原子视觉模式,无法利用手语固有的音系组合结构,导致在词汇量增大时性能急剧下降(灾难性扩展失败)。 Method: 提出PHONSSM模型,融合解剖学引导的图注意力机制、显式的正交子空间因子分解、以及基于原型的分类策略,强制模型学习音系参数分解表示。仅使用骨架数据,在迄今最大的ASL数据集(5565个手势)上进行训练与评估。 Result: 在WLASL2000上达72.1%准确率(较骨架SOTA提升18.4个百分点),超越多数RGB视频方法;小样本性能相对提升225%;零样本迁移到ASL Citizen数据集,超过监督式RGB基线。 Conclusion: 手语识别的词汇扩展瓶颈本质上是表征学习问题,引入符合语言学结构的组合归纳偏置可有效解决。 Abstract: Sign language recognition suffers from catastrophic scaling failure: models achieving high accuracy on small vocabularies collapse at realistic sizes. Existing architectures treat signs as atomic visual patterns, learning flat representations that cannot exploit the compositional structure of sign languages-systematically organized from discrete phonological parameters (handshape, location, movement, orientation) reused across the vocabulary. We introduce PHONSSM, enforcing phonological decomposition through anatomically-grounded graph attention, explicit factorization into orthogonal subspaces, and prototypical classification enabling few-shot transfer. Using skeleton data alone on the largest ASL dataset ever assembled (5,565 signs), PHONSSM achieves 72.1% on WLASL2000 (+18.4pp over skeleton SOTA), surpassing most RGB methods without video input. Gains are most dramatic in the few-shot regime (+225% relative), and the model transfers zero-shot to ASL Citizen, exceeding supervised RGB baselines. The vocabulary scaling bottleneck is fundamentally a representation learning problem, solvable through compositional inductive biases mirroring linguistic structure.[85] InstrAct: Towards Action-Centric Understanding in Instructional Videos
Zhuoyi Yang,Jiapeng Yu,Reuben Tan,Boyang Li,Huijuan Xu
Main category: cs.CV
TL;DR: 本文提出InstrAction框架,通过数据驱动的过滤与负样本生成、动作感知器提取运动特征、动态时间规整对齐和掩码动作建模等技术,提升 instructional video 中细粒度动作识别与时序建模能力,并在新构建的 InstrAct Bench 基准上显著超越现有视频基础模型。
Details
Motivation: 当前视频基础模型在理解教学视频时面临两大挑战:网络监督噪声大、存在‘静态偏差’(依赖物体而非运动线索),难以识别细粒度动作并建模其时序关系。 Method: 提出 InstrAction 预训练框架:1)数据驱动策略——过滤噪声字幕、生成动作中心的难负样本以解耦动作与物体;2)Action Perceiver 模块提取运动相关视觉token;3)引入 DTW-Align 建模时序结构、Masked Action Modeling 强化跨模态对齐;4)构建 InstrAct Bench 评测基准。 Result: 在 InstrAct Bench 的语义推理、程序逻辑和细粒度检索任务上,IntrAction 一致优于现有 SOTA 视频基础模型。 Conclusion: 动作中心的表征学习需从数据、特征提取和预训练目标三方面协同优化,InstrAction 有效缓解静态偏差、提升运动感知与时序建模能力,为 instructional video 理解提供了新范式。 Abstract: Understanding instructional videos requires recognizing fine-grained actions and modeling their temporal relations, which remains challenging for current Video Foundation Models (VFMs). This difficulty stems from noisy web supervision and a pervasive "static bias", where models rely on objects rather than motion cues. To address this, we propose InstrAction, a pretraining framework for instructional videos' action-centric representations. We first introduce a data-driven strategy, which filters noisy captions and generates action-centric hard negatives to disentangle actions from objects during contrastive learning. At the visual feature level, an Action Perceiver extracts motion-relevant tokens from redundant video encodings. Beyond contrastive learning, we introduce two auxiliary objectives: Dynamic Time Warping alignment (DTW-Align) for modeling sequential temporal structure, and Masked Action Modeling (MAM) for strengthening cross-modal grounding. Finally, we introduce the InstrAct Bench to evaluate action-centric understanding, where our method consistently outperforms state-of-the-art VFMs on semantic reasoning, procedural logic, and fine-grained retrieval tasks.[86] R2G: A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII
Zewei Zhou,Jiajun Zou,Jiajia Zhang,Ao Yang,Ruichao He,Haozheng Zhou,Ao Liu,Jiawei Liu,Leilei Jin,Shan Shen,Daying Sun
Main category: cs.CV
TL;DR: 本文提出了R2G基准套件,用于标准化物理设计任务中的电路图表示,并通过系统性实验揭示了图表示选择对GNN性能的影响远大于模型选择本身。
Details
Motivation: 现有GNN在物理设计任务(如拥塞预测、线长估计)中进展受限,主要由于电路表示不一致及缺乏可控的评估协议。 Method: 构建了R2G(RTL-to-GDSII)多视图电路图基准套件,包含5种阶段感知、信息对等的图视图,覆盖30个开源IP核;提供从DEF到图的端到端流水线、统一数据划分、领域指标和可复现基线;并在GINE、GAT、ResGatedGCN上开展系统实验。 Result: 实验发现:(i) 图视图选择对测试R²影响远超模型选择(差异>0.3);(ii) 以节点为中心的视图在布局与布线任务中泛化最佳;(iii) 解码头深度(3–4层)是精度主导因素,可使R²达>0.99。 Conclusion: R2G有效解耦了图表示与模型选择,揭示了图表示设计在EDA图学习中的核心作用,为后续研究提供了标准化基准与关键设计启示。 Abstract: Graph neural networks (GNNs) are increasingly applied to physical design tasks such as congestion prediction and wirelength estimation, yet progress is hindered by inconsistent circuit representations and the absence of controlled evaluation protocols. We present R2G (RTL-to-GDSII), a multi-view circuit-graph benchmark suite that standardizes five stage-aware views with information parity (every view encodes the same attribute set, differing only in where features attach) over 30 open-source IP cores (up to $10^6$ nodes/edges). R2G provides an end-to-end DEF-to-graph pipeline spanning synthesis, placement, and routing stages, together with loaders, unified splits, domain metrics, and reproducible baselines. By decoupling representation choice from model choice, R2G isolates a confound that prior EDA and graph-ML benchmarks leave uncontrolled. In systematic studies with GINE, GAT, and ResGatedGCN, we find: (i) view choice dominates model choice, with Test R$^2$ varying by more than 0.3 across representations for a fixed GNN; (ii) node-centric views generalize best across both placement and routing; and (iii) decoder-head depth (3--4 layers) is the primary accuracy driver, turning divergent training into near-perfect predictions (R$^2$$>$0.99). Code and datasets are available at https://github.com/ShenShan123/R2G.[87] Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models
Sumra Khan,Sagar Chhabriya,Aizan Zafar,Sheeraz Arif,Amgad Muneer,Anas Zafar,Shaina Raza,Rizwan Qureshi
Main category: cs.CV
TL;DR: 本文提出了一种上下文对齐的推理框架,通过整合放射组学统计、可解释性激活和语义线索等多源临床证据,增强医疗视觉-语言模型(VLM)的跨模态一致性与诊断可靠性,显著提升AUC、减少幻觉并生成更简洁、可信的结构化输出。
Details
Motivation: 现有医疗视觉-语言模型虽在放射学任务中表现良好,但常因过度依赖单一模态而生成流利却缺乏依据的结论,导致可解释性与可靠性不足。 Method: 在冻结VLM基础上,引入源自放射组学统计、可解释性激活和词汇对齐语义线索的结构化上下文信号,并通过上下文验证机制融合这些信号;模型输出结构化诊断结果,包含支持证据、不确定性估计、局限性和安全提示。 Result: 在胸部X光数据集上,AUC从0.918提升至0.925,幻觉关键词数从1.14降至0.25,推理长度从19.4词减至15.3词,模型置信度略有下降(0.70→0.68),不确定性保持校准;CheXpert跨数据集实验进一步揭示模态信息量影响推理行为。 Conclusion: 强制多证据一致性可显著提升医疗多模态推理的可靠性与可信度,且无需修改底层模型架构。 Abstract: Medical vision-language models (VLMs) show strong performance on radiology tasks but often produce fluent yet weakly grounded conclusions due to over-reliance on a dominant modality. We introduce a context-aligned reasoning framework that enforces agreement across heterogeneous clinical evidence before generating diagnostic conclusions. The proposed approach augments a frozen VLM with structured contextual signals derived from radiomic statistics, explainability activations, and vocabulary-grounded semantic cues. Instead of producing free-form responses, the model generates structured outputs containing supporting evidence, uncertainty estimates, limitations, and safety notes. We observe that auxiliary signals alone provide limited benefit; performance gains emerge only when these signals are integrated through contextual verification. Experiments on chest X-ray datasets demonstrate that context alignment improves discriminative performance (AUC 0.918 to 0.925) while maintaining calibrated uncertainty. The framework also substantially reduces hallucinated keywords (1.14 to 0.25) and produces more concise reasoning explanations (19.4 to 15.3 words) without increasing model confidence (0.70 to 0.68). Cross-dataset evaluation on CheXpert further reveals that modality informativeness significantly influences reasoning behavior. These results suggest that enforcing multi-evidence agreement improves both reliability and trustworthiness in medical multimodal reasoning, while preserving the underlying model architecture.[88] SenBen: Sensitive Scene Graphs for Explainable Content Moderation
Fatih Cagatay Akyon,Alptekin Temizel
Main category: cs.CV
TL;DR: 本文提出了首个面向敏感内容的大型场景图基准SenBen,并设计了一种多任务蒸馏方法,将前沿视觉语言模型压缩为轻量级学生模型,在敏感内容检测任务上显著提升性能,同时兼顾高效推理和低资源消耗。
Details
Motivation: 现有内容审核系统缺乏空间定位能力和可解释性,无法说明检测到的敏感行为、涉及人物及发生位置。 Method: 构建了包含13999帧电影画面的SenBen基准,采用Visual Genome风格场景图标注;提出多任务知识蒸馏框架,包括基于后缀的对象标识、词汇感知召回(VAR)损失、解耦式Query2Label标签头与非对称损失。 Result: 学生模型在SenBen上的Recall提升6.4个百分点;在场景图理解指标上优于所有对比VLM(除Gemini外)及商用安全API;在目标检测与图像描述任务中得分最高,且推理速度快7.6倍、GPU显存占用低16倍。 Conclusion: 所提方法在保持高精度的同时大幅提升效率与可解释性,为敏感内容审核提供了新范式。 Abstract: Content moderation systems classify images as safe or unsafe but lack spatial grounding and interpretability: they cannot explain what sensitive behavior was detected, who is involved, or where it occurs. We introduce the Sensitive Benchmark (SenBen), the first large-scale scene graph benchmark for sensitive content, comprising 13,999 frames from 157 movies annotated with Visual Genome-style scene graphs (25 object classes, 28 attributes including affective states such as pain, fear, aggression, and distress, 14 predicates) and 16 sensitivity tags across 5 categories. We distill a frontier VLM into a compact 241M student model using a multi-task recipe that addresses vocabulary imbalance in autoregressive scene graph generation through suffix-based object identity, Vocabulary-Aware Recall (VAR) Loss, and a decoupled Query2Label tag head with asymmetric loss, yielding a +6.4 percentage point improvement in SenBen Recall over standard cross-entropy training. On grounded scene graph metrics, our student model outperforms all evaluated VLMs except Gemini models and all commercial safety APIs, while achieving the highest object detection and captioning scores across all models, at $7.6\times$ faster inference and $16\times$ less GPU memory.[89] CatalogStitch: Dimension-Aware and Occlusion-Preserving Object Compositing for Catalog Image Generation
Sanyam Jain,Pragya Kandari,Manit Singhal,He Zhang,Soo Ye Kim
Main category: cs.CV
TL;DR: 本文提出CatalogStitch,一组模型无关的技术,用于自动化生成式图像合成中产品尺寸适配与遮挡修复,显著减少人工干预,提升商品目录图像生成的实用性与易用性。
Details
Motivation: 现有生成式对象合成方法在真实商品目录图像生成中需大量手动干预(如调整掩码、修复遮挡),效率低且不实用。 Method: 提出两种核心技术:1)尺寸感知掩码计算算法,自动适配不同尺寸产品;2)遮挡感知混合修复方法,精确保留遮挡元素;并构建CatalogStitch-Eval基准(58个样例)及配套可视化工具。 Result: 在ObjectStitch、OmniPaint和InsertAnything三个SOTA模型上验证,CatalogStitch在多种目录场景下均带来一致性能提升;显著降低人工编辑需求。 Conclusion: CatalogStitch将生成式合成转化为实用、人性化的产品目录生产工具,推动其在工业场景落地。 Abstract: Generative object compositing methods have shown remarkable ability to seamlessly insert objects into scenes. However, when applied to real-world catalog image generation, these methods require tedious manual intervention: users must carefully adjust masks when product dimensions differ, and painstakingly restore occluded elements post-generation. We present CatalogStitch, a set of model-agnostic techniques that automate these corrections, enabling user-friendly content creation. Our dimension-aware mask computation algorithm automatically adapts the target region to accommodate products with different dimensions; users simply provide a product image and background, without manual mask adjustments. Our occlusion-aware hybrid restoration method guarantees pixel-perfect preservation of occluding elements, eliminating post-editing workflows. We additionally introduce CatalogStitch-Eval, a 58-example benchmark covering aspect-ratio mismatch and occlusion-heavy catalog scenarios, together with supplementary PDF and HTML viewers. We evaluate our techniques with three state-of-the-art compositing models (ObjectStitch, OmniPaint, and InsertAnything), demonstrating consistent improvements across diverse catalog scenarios. By reducing manual intervention and automating tedious corrections, our approach transforms generative compositing into a practical, human-friendly tool for production catalog workflows.[90] DeFakeQ: Enabling Real-Time Deepfake Detection on Edge Devices via Adaptive Bidirectional Quantization
Xiangyu Li,Yujing Sun,Yuhang Zheng,Yuexin Ma,Kwok-Yan Lam
Main category: cs.CV
TL;DR: 本文提出DefakeQ,首个专为深度伪造检测器设计的量化框架,通过自适应双向压缩策略,在保持高检测性能的同时实现模型轻量化,支持边缘设备实时部署。
Details
Motivation: 现有深度伪造检测方法计算密集、参数量大,难以在资源受限的边缘设备上实时运行;同时,传统量化技术因无法有效保留细微伪造特征而导致性能显著下降。 Method: 提出DefakeQ量化框架,采用新颖的自适应双向压缩策略,兼顾特征相关性建模与冗余消除,以平衡模型紧凑性与检测精度。 Result: 在五个基准数据集和十一个主流检测器上显著优于现有量化与压缩方法,并在移动设备上成功实现真实场景下的实时检测。 Conclusion: DefakeQ是首个面向深度伪造检测任务定制的量化方案,有效解决了边缘端高效、鲁棒检测的关键挑战,推动了媒体取证技术的实际落地。 Abstract: Deepfake detection has become a fundamental component of modern media forensics. Despite significant progress in detection accuracy, most existing methods remain computationally intensive and parameter-heavy, limiting their deployment on resource-constrained edge devices that require real-time, on-site inference. This limitation is particularly critical in an era where mobile devices are extensively used for media-centric applications, including online payments, virtual meetings, and social networking. Meanwhile, due to the unique requirement of capturing extremely subtle forgery artifacts for deepfake detection, state-of-the-art quantization techniques usually underperform for such a challenging task. These fine-grained cues are highly sensitive to model compression and can be easily degraded during quantization, leading to noticeable performance drops. This challenge highlights the need for quantization strategies specifically designed to preserve the discriminative features essential for reliable deepfake detection. To address this gap, we propose DefakeQ, the first quantization framework tailored for deepfake detectors, enabling real-time deployment on edge devices. Our approach introduces a novel adaptive bidirectional compression strategy that simultaneously leverages feature correlations and eliminates redundancy, achieving an effective balance between model compactness and detection performance. Extensive experiments across five benchmark datasets and eleven state-of-the-art backbone detectors demonstrate that DeFakeQ consistently surpasses existing quantization and model compression baselines. Furthermore, we deploy DefakeQ on mobile devices in real-world scenarios, demonstrating its capability for real-time deepfake detection and its practical applicability in edge environments.[91] BIAS: A Biologically Inspired Algorithm for Video Saliency Detection
Zhao-ji Zhang,Ya-tang Li
Main category: cs.CV
TL;DR: BIAS是一种快速、受生物启发的动态视觉显著性检测模型,结合静态与运动信息,在DHF1K数据集和交通事故分析中表现出色。
Details
Motivation: 提升动态视频流中显著性检测的速度、生物合理性及实际应用能力,尤其在依赖自下而上注意的场景中。 Method: 基于Itti-Koch框架,引入视网膜启发的运动检测器提取时序特征,并采用贪心多高斯峰拟合算法识别注视焦点(FOAs),兼顾竞争机制与信息最大化。 Result: 毫秒级延迟检测显著区域,在DHF1K数据集上优于多种启发式方法和深度学习模型;在交通事故分析中实现因果识别SOTA,并可提前0.72秒可靠预测事故。 Conclusion: BIAS在生物可解释性、计算效率与实际性能之间取得良好平衡,适用于高速、可解释的动态显著性检测任务。 Abstract: We present BIAS, a fast, biologically inspired model for dynamic visual saliency detection in continuous video streams. Building on the Itti--Koch framework, BIAS incorporates a retina-inspired motion detector to extract temporal features, enabling the generation of saliency maps that integrate both static and motion information. Foci of attention (FOAs) are identified using a greedy multi-Gaussian peak-fitting algorithm that balances winner-take-all competition with information maximization. BIAS detects salient regions with millisecond-scale latency and outperforms heuristic-based approaches and several deep-learning models on the DHF1K dataset, particularly in videos dominated by bottom-up attention. Applied to traffic accident analysis, BIAS demonstrates strong real-world utility, achieving state-of-the-art performance in cause-effect recognition and anticipating accidents up to 0.72 seconds before manual annotation with reliable accuracy. Overall, BIAS bridges biological plausibility and computational efficiency to achieve interpretable, high-speed dynamic saliency detection.[92] Harnessing Weak Pair Uncertainty for Text-based Person Search
Jintao Sun,Zhedong Zheng,Gangyi Ding
Main category: cs.CV
TL;DR: 本文提出了一种不确定性感知的方法,用于文本驱动的人物检索,通过估计图像-文本对的不确定性并将其融入优化过程,同时引入组级匹配损失,以更好利用弱正样本对,显著提升检索性能。
Details
Motivation: 现有方法过于依赖严格的一对一图像-文本匹配,忽略了来自不同视角但描述同一人物的弱正样本对,导致信息利用不充分。 Method: 提出包含不确定性估计和不确定性正则化两个模块的方法:前者估计正样本对的相对置信度;后者根据不确定性自适应调整损失权重;此外还引入组级图像-文本匹配损失。 Result: 在CUHK-PEDES、RSTPReid和ICFG-PEDES三个数据集上,mAP分别提升3.06%、3.55%和6.94%。 Conclusion: 该方法能有效利用弱正样本对,避免模型错误排斥潜在正样本,在文本驱动人物检索任务中取得显著性能提升。 Abstract: In this paper, we study the text-based person search, which is to retrieve the person of interest via natural language description. Prevailing methods usually focus on the strict one-to-one correspondence pair matching between the visual and textual modality, such as contrastive learning. However, such a paradigm unintentionally disregards the weak positive image-text pairs, which are of the same person but the text descriptions are annotated from different views (cameras). To take full use of weak positives, we introduce an uncertainty-aware method to explicitly estimate image-text pair uncertainty, and incorporate the uncertainty into the optimization procedure in a smooth manner. Specifically, our method contains two modules: uncertainty estimation and uncertainty regularization. (1) Uncertainty estimation is to obtain the relative confidence on the given positive pairs; (2) Based on the predicted uncertainty, we propose the uncertainty regularization to adaptively adjust loss weight. Additionally, we introduce a group-wise image-text matching loss to further facilitate the representation space among the weak pairs. Compared with existing methods, the proposed method explicitly prevents the model from pushing away potentially weak positive candidates. Extensive experiments on three widely-used datasets, .e.g, CUHK-PEDES, RSTPReid and ICFG-PEDES, verify the mAP improvement of our method against existing competitive methods +3.06%, +3.55% and +6.94%, respectively.[93] Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance
Enyi Shi,Fei Shen,Shuyi Miao,Linxia Zhu,Pengyang Shao,Jinhui Tang,Tat-Seng Chua
Main category: cs.CV
TL;DR: 本文提出Precise Shield框架,通过识别并约束安全相关神经元(仅影响<0.03%参数),提升多语言多模态场景下视觉语言大模型的安全性,并发现安全能力在语言与模态间存在中等重叠,支持零样本迁移。
Details
Motivation: 现实部署中,视觉语言大模型面临多语言与多模态复合攻击(如有害图像+低资源语言文本),现有跨语言/跨模态安全方法存在结构性盲区;亟需理解安全能力在模型中的机制分布。 Method: 提出两阶段Precise Shield框架:1)通过对比有害与良性输入的激活模式识别关键安全神经元;2)采用梯度掩码技术,仅在该神经元子空间内约束参数更新。 Result: 显著提升模型安全性,同时保持多语言与多模态泛化能力;发现安全神经元在不同语言和模态间存在中等重叠,支持零样本跨语言/跨模态安全迁移。 Conclusion: 安全能力部分由少量共享神经元承载,其跨语言/跨模态的可迁移性为基于神经元级干预的通用安全增强提供了新路径。 Abstract: In real-world deployments, Vision-Language Large Models (VLLMs) face critical challenges from multilingual and multimodal composite attacks: harmful images paired with low-resource language texts can easily bypass defenses designed for high-resource language scenarios, exposing structural blind spots in current cross-lingual and cross-modal safety methods. This raises a mechanistic question: where is safety capability instantiated within the model, and how is it distributed across languages and modalities? Prior studies on pure-text LLMs have identified cross-lingual shared safety neurons, suggesting that safety may be governed by a small subset of critical neurons. Leveraging this insight, we propose Precise Shield, a two-stage framework that first identifies safety neurons by contrasting activation patterns between harmful and benign inputs, and then constrains parameter updates strictly within this subspace via gradient masking with affecting fewer than 0.03% of parameters. This strategy substantially improves safety while preserving multilingual and multimodal generalization. Further analysis reveals a moderate overlap of safety neurons across languages and modalities, enabling zero-shot cross-lingual and cross-modal transfer of safety capabilities, and offering a new direction for neuron-level, transfer-based safety enhancement.[94] HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
Xinyu Zhang,Zurong Mai,Qingmei Li,Zjin Liao,Yibin Wen,Yuhang Chen,Xiaoya Fan,Chan Tsz Ho,Bi Tianyuan,Haoyuan Liang,Ruifeng Su,Zihao Qian,Juepeng Zheng,Jianxi Huang,Yutong Lu,Haohuan Fu
Main category: cs.CV
TL;DR: 本文提出了首个专门用于评估多模态大语言模型(MLLMs)在高光谱图像(HSI)理解能力的基准HM-Bench,并设计双模态评估框架(PCA图像+文本报告),揭示现有MLLMs在光谱-空间联合推理任务中存在显著困难,且视觉输入优于文本输入。
Details
Motivation: 多模态大语言模型(MLLMs)在自然图像理解上进展显著,但在高光谱图像(HSI)这一遥感关键模态上的感知与推理能力尚属空白;HSI的高维性与复杂光谱-空间特性使其难以被基于RGB训练的模型直接处理。 Method: 构建首个HSI专用多模态基准HM-Bench(含19,337 QA对、13类任务);提出双模态评估框架:将原始高光谱立方体转换为PCA合成图像和结构化文本报告,以系统比较不同表征方式对模型性能的影响。 Result: 在18个主流MLLMs上的大规模评测表明:模型普遍难以完成复杂光谱-空间推理任务;视觉输入(PCA图像)整体显著优于文本输入,验证了光谱-空间证据具身化对HSI理解的关键作用。 Conclusion: HSI理解是MLLMs亟待拓展的重要方向;HM-Bench为该领域提供了标准化评测平台;双模态表征与视觉优先策略为后续HSI多模态建模提供了有效范式。 Abstract: While multimodal large language models (MLLMs) have made significant strides in natural image understanding, their ability to perceive and reason over hyperspectral image (HSI) remains underexplored, which is a vital modality in remote sensing. The high dimensionality and intricate spectral-spatial properties of HSI pose unique challenges for models primarily trained on RGB data.To address this gap, we introduce Hyperspectral Multimodal Benchmark (HM-Bench), the first benchmark designed specifically to evaluate MLLMs in HSI understanding. We curate a large-scale dataset of 19,337 question-answer pairs across 13 task categories, ranging from basic perception to spectral reasoning. Given that existing MLLMs are not equipped to process raw hyperspectral cubes natively, we propose a dual-modality evaluation framework that transforms HSI data into two complementary representations: PCA-based composite images and structured textual reports. This approach facilitates a systematic comparison of different representation for model performance. Extensive evaluations on 18 representative MLLMs reveal significant difficulties in handling complex spatial-spectral reasoning tasks. Furthermore, our results demonstrate that visual inputs generally outperform textual inputs, highlighting the importance of grounding in spectral-spatial evidence for effective HSI understanding. Dataset and appendix can be accessed at https://github.com/HuoRiLi-Yu/HM-Bench.[95] Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS)
Mohsen Yaghoubi Suraki
Main category: cs.CV
TL;DR: 本文提出了一种名为ADRUwAMS的新型深度学习模型,结合自适应双残差网络、注意力门机制和多尺度空间注意力机制,用于脑胶质瘤MRI图像的精确分割,在BraTS 2020数据集上取得了优异的Dice分数。
Details
Motivation: 胶质瘤早期检测对治疗至关重要,但因其位置、大小等异质性导致自动精准分割困难,亟需更鲁棒的深度学习方法。 Method: 提出ADRUwAMS模型,融合自适应双残差U-Net、注意力门(整合门控与输入信号计算注意力系数)和多尺度空间注意力(生成并融合多尺度注意力图),在BraTS 2019/2020数据集上训练200轮,使用ReLU激活函数。 Result: 在BraTS 2020数据集上,全肿瘤、肿瘤核心和增强肿瘤的Dice分数分别达0.9229、0.8432和0.8004。 Conclusion: ADRUwAMS通过协同建模多尺度特征与注意力机制,显著提升了胶质瘤亚区分割精度,验证了其在临床辅助诊断中的潜力。 Abstract: Glioma is a harmful brain tumor that requires early detection to ensure better health results. Early detection of this tumor is key for effective treatment and requires an automated segmentation process. However, it is a challenging task to find tumors due to tumor characteristics like location and size. A reliable method to accurately separate tumor zones from healthy tissues is deep learning models, which have shown promising results over the last few years. In this research, an Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS) is introduced. This model is an innovative combination of adaptive dual residual networks, attention mechanisms, and multiscale spatial attention. The dual adaptive residual network architecture captures high-level semantic and intricate low-level details from brain images, ensuring precise segmentation of different tumor parts, types, and hard regions. The attention gates use gating and input signals to compute attention coefficients for the input features, and multiscale spatial attention generates scaled attention maps and combines these features to hold the most significant information about the brain tumor. We trained the model for 200 epochs using the ReLU activation function on BraTS 2020 and BraTS 2019 datasets. These improvements resulted in high accuracy for tumor detection and segmentation on BraTS 2020, achieving dice scores of 0.9229 for the whole tumor, 0.8432 for the tumor core, and 0.8004 for the enhancing tumor.[96] GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
Aoran Xiao,Shihao Cheng,Yonghao Xu,Yexian Ren,Hongruixuan Chen,Naoto Yokoya
Main category: cs.CV
TL;DR: 本文提出了GeoMMBench多模态遥感问答基准和GeoMMAgent多智能体框架,以解决地理科学与遥感领域中大模型在领域知识、感知对齐和推理能力上的不足。
Details
Motivation: 地理科学与遥感领域面临学科知识广、传感器模态异构、任务碎片化等挑战,现有MLLMs难以满足专业级地学解译需求。 Method: 构建覆盖多学科、多传感器、多任务的GeoMMBench基准;基于该基准评估36个大模型;提出融合检索、感知与推理的多智能体框架GeoMMAgent,集成领域专用RS模型与工具。 Result: 发现现有模型在领域知识、感知接地和推理方面存在系统性缺陷;GeoMMAgent显著优于单一大语言模型。 Conclusion: 工具增强的多智能体架构是应对复杂地学与遥感挑战的关键路径,GeoMMBench为该领域提供了更全面、严格的评测标准。 Abstract: Recent advances in multimodal large language models (MLLMs) have accelerated progress in domain-oriented AI, yet their development in geoscience and remote sensing (RS) remains constrained by distinctive challenges: wide-ranging disciplinary knowledge, heterogeneous sensor modalities, and a fragmented spectrum of tasks. To bridge these gaps, we introduce GeoMMBench, a comprehensive multimodal question-answering benchmark covering diverse RS disciplines, sensors, and tasks, enabling broader and more rigorous evaluation than prior benchmarks. Using GeoMMBench, we assess 36 open-source and proprietary large language models, uncovering systematic deficiencies in domain knowledge, perceptual grounding, and reasoning--capabilities essential for expert-level geospatial interpretation. Beyond evaluation, we propose GeoMMAgent, a multi-agent framework that strategically integrates retrieval, perception, and reasoning through domain-specific RS models and tools. Extensive experimental results demonstrate that GeoMMAgent significantly outperforms standalone LLMs, underscoring the importance of tool-augmented agents for dynamically tackling complex geoscience and RS challenges.[97] Fast Model-guided Instance-wise Adaptation Framework for Real-world Pansharpening with Fidelity Constraints
Zhiqi Yang,Jin-Liang Xiao,Shan Yin,Liang-Jian Deng,Gemine Vivone
Main category: cs.CV
TL;DR: 本文提出FMG-Pan框架,一种快速、可泛化的模型引导式单样本自适应方法,用于遥感图像的全色锐化,在保持光谱与空间保真度的同时,实现跨传感器泛化和秒级训练推理。
Details
Motivation: 现有深度学习方法依赖大量标注数据、训练成本高、泛化性差;零样本方法虽泛化强,但融合质量低、计算开销大、收敛慢。亟需兼顾高质量融合、强泛化能力与高效推理的新方法。 Method: 提出FMG-Pan:基于预训练模型引导轻量自适应网络,通过联合优化光谱保真度与新设计的物理保真度约束,实现单对图像的实例级自适应。 Result: 在WorldView-3等真实数据集上达到SOTA性能;512×512×8图像在RTX 3090上训练+推理仅需3秒,显著快于现有零样本方法。 Conclusion: FMG-Pan在融合质量、跨传感器泛化性与运行效率三方面取得良好平衡,具备实际部署价值。 Abstract: Pansharpening aims to generate high-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) and high-resolution panchromatic (PAN) images while preserving both spectral and spatial information. Although deep learning (DL)-based pansharpening methods achieve impressive performance, they require high training cost and large datasets, and often degrade when the test distribution differs from training, limiting generalization. Recent zero-shot methods, trained on a single PAN/LRMS pair, offer strong generalization but suffer from limited fusion quality, high computational overhead, and slow convergence. To address these issues, we propose FMG-Pan, a fast and generalizable model-guided instance-wise adaptation framework for real-world pansharpening, achieving both cross-sensor generality and rapid training-inference. The framework leverages a pretrained model to guide a lightweight adaptive network through joint optimization with spectral and physical fidelity constraints. We further design a novel physical fidelity term to enhance spatial detail preservation. Extensive experiments on real-world datasets under both intra- and cross-sensor settings demonstrate state-of-the-art performance. On the WorldView-3 dataset, FMG-Pan completes training and inference for a 512x512x8 image within 3 seconds on an RTX 3090 GPU, significantly faster than existing zero-shot methods, making it suitable for practical deployment.[98] Large-Scale Universal Defect Generation: Foundation Models and Datasets
Yuanting Fan,Jun Liu,Bin-Bin Gao,Xiaochen Chen,Yuhuan Lin,Zhewei Dai,Jiawei Zhan,Chengjie Wang
Main category: cs.CV
TL;DR: 本文提出UDG数据集和UniDG模型,解决现有缺陷生成方法因少样本学习导致的过拟合、泛化差、真实性低等问题,支持无需类别微调的参考式与文本指令式缺陷生成,在多个基准上取得SOTA性能。
Details
Motivation: 现有缺陷生成方法依赖少样本学习,缺乏大规模配对缺陷编辑数据,且受缺陷尺度与形态差异影响,导致泛化能力差、真实性低、类别一致性弱。 Method: 构建包含30万组正常-异常-掩码-描述四元组的大规模UDG数据集;提出通用缺陷生成基础模型UniDG,采用自适应缺陷裁剪与结构化双联图输入实现缺陷-上下文编辑,通过MM-DiT多模态注意力融合参考与目标条件,并采用多样性监督微调(Diversity-SFT)与一致性强化微调(Consistency-RFT)两阶段训练策略。 Result: 在MVTec-AD和VisA数据集上,UniDG在合成质量及下游单/多类异常检测与定位任务中均超越现有少样本异常生成与图像插入/编辑方法。 Conclusion: UniDG通过大规模数据支撑与统一多模态架构设计,实现了无需类别微调的高保真、高多样性、强一致性的通用缺陷生成,为工业缺陷合成与异常检测提供了新范式。 Abstract: Existing defect/anomaly generation methods often rely on few-shot learning, which overfits to specific defect categories due to the lack of large-scale paired defect editing data. This issue is aggravated by substantial variations in defect scale and morphology, resulting in limited generalization, degraded realism, and category consistency. We address these challenges by introducing UDG, a large-scale dataset of 300K normal-abnormal-mask-caption quadruplets spanning diverse domains, and by presenting UniDG, a universal defect generation foundation model that supports both reference-based defect generation and text instruction-based defect editing without per-category fine-tuning. UniDG performs Defect-Context Editing via adaptive defect cropping and structured diptych input format, and fuses reference and target conditions through MM-DiT multimodal attention. A two-stage training strategy, Diversity-SFT followed by Consistency-RFT, further improves diversity while enhancing realism and reference consistency. Extensive experiments on MVTec-AD and VisA show that UniDG outperforms prior few-shot anomaly generation and image insertion/editing baselines in synthesis quality and downstream single- and multi-class anomaly detection/localization. Code will be available at https://github.com/RetoFan233/UniDG.[99] MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation
Yibo Zhao,Yigong Zhang,Jin Xie
Main category: cs.CV
TL;DR: 本文提出MV3DIS框架,通过引入3D先验知识(如3D引导的掩码匹配与深度一致性加权),实现零样本3D实例分割,显著提升跨视角一致性与分割精度。
Details
Motivation: 现有零样本3D实例分割方法仅依赖单帧2D掩码和SAM置信度,忽视多视角关联与3D先验,导致掩码不一致和3D分割碎片化。 Method: 提出粗到细的MV3DIS框架:1)以粗粒度3D段为参考,进行3D引导的跨视角2D掩码匹配;2)利用3D覆盖分布增强多视角掩码一致性;3)基于深度一致性加权抑制遮挡引起的歧义,提升3D-2D对应鲁棒性;4)用一致的2D掩码精细化粗3D段为精确3D实例。 Result: 在ScanNetV2、ScanNet200、ScanNet++、Replica和Matterport3D数据集上,MV3DIS显著优于先前方法,验证了其有效性。 Conclusion: 显式建模3D先验与多视角一致性是提升零样本3D实例分割性能的关键,MV3DIS为此提供了有效且可扩展的解决方案。 Abstract: Conventional 3D instance segmentation methods rely on labor-intensive 3D annotations for supervised training, which limits their scalability and generalization to novel objects. Recent approaches leverage multi-view 2D masks from the Segment Anything Model (SAM) to guide the merging of 3D geometric primitives, thereby enabling zero-shot 3D instance segmentation. However, these methods typically process each frame independently and rely solely on 2D metrics, such as SAM prediction scores, to produce segmentation maps. This design overlooks multi-view correlations and inherent 3D priors, leading to inconsistent 2D masks across views and ultimately fragmented 3D segmentation. In this paper, we propose MV3DIS, a coarse-to-fine framework for zero-shot 3D instance segmentation that explicitly incorporates 3D priors. Specifically, we introduce a 3D-guided mask matching strategy that uses coarse 3D segments as a common reference to match 2D masks across views and consolidates multi-view mask consistency via 3D coverage distributions. Guided by these view-consistent 2D masks, the coarse 3D segments are further refined into precise 3D instances. Additionally, we introduce a depth consistency weighting scheme that quantifies projection reliability to suppress ambiguities from inter-object occlusions, thereby improving the robustness of 3D-to-2D correspondence. Extensive experiments on the ScanNetV2, ScanNet200, ScanNet++, Replica, and Matterport3D datasets demonstrate the effectiveness of MV3DIS, which achieves superior performance over previous methods[100] TAIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction
Ao Li,Yonggen Ling,Yiyang Lin,Yuji Wang,Yong Deng,Yansong Tang
Main category: cs.CV
TL;DR: 本文提出TAIHRI,首个面向近距离人机交互(HRI)感知的视觉-语言模型,通过将3D关键点量化到有限交互空间,并利用2D关键点推理与下一词预测实现任务相关身体部位的精确3D定位,显著提升egocentric场景下关键部位估计精度。
Details
Motivation: 传统3D人体关键点估计侧重于以根关节为基准的整体重建质量,而实际HRI中机器人更需在自身视角坐标系下对任务相关身体部位进行精确、度量尺度的3D空间定位。 Method: 提出TAIHRI模型:基于视觉-语言架构,将3D关键点量化至有限交互空间,通过2D关键点推理结合next token预测实现3D坐标回归;支持自然语言控制与全局人体网格恢复等下游任务。 Result: 在egocentric交互基准上,TAIHRI在任务关键身体部位的3D估计精度上优于现有方法。 Conclusion: TAIHRI为具身化人机交互感知开辟了新研究方向,是首个专为近距离HRI设计的VLM。 Abstract: Accurate 3D human keypoints localization is a critical technology enabling robots to achieve natural and safe physical interaction with users. Conventional 3D human keypoints estimation methods primarily focus on the whole-body reconstruction quality relative to the root joint. However, in practical human-robot interaction (HRI) scenarios, robots are more concerned with the precise metric-scale spatial localization of task-relevant body parts under the egocentric camera 3D coordinate. We propose TAIHRI, the first Vision-Language Model (VLM) tailored for close-range HRI perception, capable of understanding users' motion commands and directing the robot's attention to the most task-relevant keypoints. By quantizing 3D keypoints into a finite interaction space, TAIHRI precisely localize the 3D spatial coordinates of critical body parts by 2D keypoint reasoning via next token prediction, and seamlessly adapt to downstream tasks such as natural language control or global space human mesh recovery. Experiments on egocentric interaction benchmarks demonstrate that TAIHRI achieves superior estimation accuracy for task-critical body parts. We believe TAIHRI opens new research avenues in the field of embodied human-robot interaction. Code is available at: https://github.com/Tencent/TAIHRI.[101] Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios
Yu Shi,Yu Liu,Zhong-Cheng Wu,Juan Cheng,Huafeng Li,Xun Chen
Main category: cs.CV
TL;DR: 本文提出了一种面向任意退化场景的高效退化感知扩散框架,用于图像融合任务,通过隐式去噪和联合观测模型校正机制,在复杂退化下实现高精度融合。
Details
Motivation: 现有端到端神经网络方法可解释性差,而传统扩散模型难以直接适用于缺乏天然融合数据、需融合多源互补信息的图像融合任务。 Method: 提出退化感知扩散框架:采用隐式去噪(直接回归融合图像而非预测噪声),并设计联合观测模型校正机制,在采样过程中同时施加退化与融合约束。 Result: 在多种融合任务和退化配置下的实验表明,该方法在复杂退化场景下性能优于现有方法。 Conclusion: 所提方法兼顾了生成先验能力、可解释性与实际适用性,为退化多样化的图像融合任务提供了新思路。 Abstract: Complex degradations like noise, blur, and low resolution are typical challenges in real world image fusion tasks, limiting the performance and practicality of existing methods. End to end neural network based approaches are generally simple to design and highly efficient in inference, but their black-box nature leads to limited interpretability. Diffusion based methods alleviate this to some extent by providing powerful generative priors and a more structured inference process. However, they are trained to learn a single domain target distribution, whereas fusion lacks natural fused data and relies on modeling complementary information from multiple sources, making diffusion hard to apply directly in practice. To address these challenges, this paper proposes an efficient degradation aware diffusion framework for image fusion under arbitrary degradation scenarios. Specifically, instead of explicitly predicting noise as in conventional diffusion models, our method performs implicit denoising by directly regressing the fused image, enabling flexible adaptation to diverse fusion tasks under complex degradations with limited steps. Moreover, we design a joint observation model correction mechanism that simultaneously imposes degradation and fusion constraints during sampling to ensure high reconstruction accuracy. Experiments on diverse fusion tasks and degradation configurations demonstrate the superiority of the proposed method under complex degradation scenarios.[102] Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion
Zengyi Yang,Yu Liu,Juan Cheng,Zhiqin Zhu,Yafei Zhang,Huafeng Li
Main category: cs.CV
TL;DR: 本文提出了一种闭环动态网络(CLDyN),通过需求驱动的语义补偿模块(RSC)和奖励-惩罚策略,实现红外-可见光图像融合网络对多种下游任务的自适应定制,无需重新训练。
Details
Motivation: 现有红外-可见光图像融合方法难以同时适配多个下游任务,缺乏任务定制化能力。 Method: 提出闭环动态网络(CLDyN),包含需求驱动的语义补偿(RSC)模块、基向量库(BVB)和架构自适应语义注入(A2SI)块,并引入奖励-惩罚策略优化语义补偿。 Result: 在M3FD、FMB和VT5000数据集上验证了CLDyN在保持高融合质量的同时具备强多任务适应性。 Conclusion: CLDyN实现了无需重训练的任务定制化图像融合,为多任务导向的融合网络设计提供了新范式。 Abstract: Infrared-visible image fusion aims to integrate complementary information for robust visual understanding, but existing fusion methods struggle with simultaneously adapting to multiple downstream tasks. To address this issue, we propose a Closed-Loop Dynamic Network (CLDyN) that can adaptively respond to the semantic requirements of diverse downstream tasks for task-customized image fusion. Specifically, CLDyN introduces a closed-loop optimization mechanism that establishes a semantic transmission chain to achieve explicit feedback from downstream tasks to the fusion network through a Requirement-driven Semantic Compensation (RSC) module. The RSC module leverages a Basis Vector Bank (BVB) and an Architecture-Adaptive Semantic Injection (A2SI) block to customize the network architecture according to task requirements, thereby enabling task-specific semantic compensation and allowing the fusion network to actively adapt to diverse tasks without retraining. To promote semantic compensation, a reward-penalty strategy is introduced to reward or penalize the RSC module based on task performance variations. Experiments on the M3FD, FMB, and VT5000 datasets demonstrate that CLDyN not only maintains high fusion quality but also exhibits strong multi-task adaptability. The code is available at https://github.com/YR0211/CLDyN.[103] M-IDoL: Information Decomposition for Modality-Specific and Diverse Representation Learning in Medical Foundation Model
Yihang Liu,Ying Wen,Jiaxiong Yang,Longzhen Yang,Lianghua He,Heng Tao Shen
Main category: cs.CV
TL;DR: 本文提出M-IDoL,一种通过信息分解实现多模态表征学习的医学基础模型,提升模态特异性与多样性,在21个下游临床任务中表现优异。
Details
Motivation: 现有医学基础模型存在多模态表征混叠导致模态特异性与多样性下降的问题。 Method: 提出M-IDoL模型,通过两个自监督目标进行信息分解:i)最大化模态间熵,将表征分散到可分离的混合专家(MoE)子空间以增强模态特异性;ii)最小化模态内不确定性,在每个MoE子空间内进行细粒度语义判别以提升模态内表征多样性。 Result: 在115万张医学图像上预训练后,M-IDoL在5种影像模态(X光、眼底、OCT、皮肤镜、病理)的21个下游临床任务中超越20个基线模型,并展现出更清晰的跨模态特征聚类分离和更精细的模态内特征判别能力。 Conclusion: M-IDoL通过信息分解有效缓解多模态表征混叠问题,显著提升医学基础模型的模态特异性与多样性,从而增强其在多样化临床任务中的泛化能力。 Abstract: Medical foundation models (MFMs) aim to learn universal representations from multimodal medical images that can generalize effectively to diverse downstream clinical tasks. However, most existing MFMs suffer from information ambiguity that blend multimodal representations in a single embedding space, leading to the degradation of modality specificity and diversity. In this paper, we propose M-IDoL, a self-supervised \underline{\textit{M}}FM that introduces Information Decomposition for multimodal representation Learning via two objectives: i) maximize inter-modality entropy by dispersing multimodal representation into separable Mixture-of-Experts (MoE) subspaces to achieve representation specificity across modalities; and ii) minimize intra-modality uncertainty by performing fine-grained semantic discrimination within each MoE subspace to enrich representation diversity per modality. By pre-training on 1.15 million medical images, M-IDoL i) delivers superior generalization across 21 downstream clinical tasks, outperforming 20 foundation models on five imaging modalities (e.g., X-ray, fundus, OCT, dermoscopy and pathology), and ii) learns modality-specific and diverse representations, showing clearer separation of feature cluster across modalities and finer-grained feature discrimination within each modality.[104] MASS: Mesh-inellipse Aligned Deformable Surfel Splatting for Hand Reconstruction and Rendering from Egocentric Monocular Video
Haoyu Zhu,Yi Zhang,Lei Yao,Lap-pui Chau,Yi Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为MASS的新方法,利用可变形的2D高斯Surfel表示,结合网格对齐的Steiner内椭圆和分形稠密化技术,从单目自拍视频中高效、高保真地重建3D手部模型,并在多个数据集上超越了现有方法。
Details
Motivation: 从单目自拍视频中重建高保真3D手部模型面临几何分辨率低、手-物交互建模难、复杂物体遮挡及计算开销大等挑战。 Method: 提出Mesh-inellipse Aligned deformable Surfel Splatting(MASS),包括:1)基于网格对齐Steiner内椭圆与分形稠密化的mesh-to-surfel转换;2)高斯Surfel形变机制,通过残差更新与不透明度掩码优化几何与纹理;3)两阶段训练策略与新型绑定损失提升优化鲁棒性。 Result: 在ARCTIC、Hand Appearance和Interhand2.6M数据集上,MASS在重建精度与渲染质量上均优于当前最优方法。 Conclusion: MASS实现了高效、高保真的单目自拍手部三维重建,兼顾实时性与表现力,为手部建模与交互应用提供了新范式。 Abstract: Reconstructing high-fidelity 3D hands from egocentric monocular videos remains a challenge due to the limitations in capturing high-resolution geometry, hand-object interactions, and complex objects on hands. Additionally, existing methods often incur high computational costs, making them impractical for real-time applications. In this work, we propose Mesh-inellipse Aligned deformable Surfel Splatting (MASS) to address these challenges by leveraging a deformable 2D Gaussian Surfel representation. We introduce the mesh-aligned Steiner Inellipse and fractal densification for mesh-to-surfel conversion that initiates high-resolution 2D Gaussian surfels from coarse parametric hand meshes, providing surface representation with photorealistic rendering potential. Second, we propose Gaussian Surfel Deformation, which enables efficient modeling of hand deformations and personalized features by predicting residual updates to surfel attributes and introducing an opacity mask to refine geometry and texture without adaptive density control. In addition, we propose a two-stage training strategy and a novel binding loss to improve the optimization robustness and reconstruction quality. Extensive experiments on the ARCTIC dataset, the Hand Appearance dataset, and the Interhand2.6M dataset demonstrate that our model achieves superior reconstruction performance compared to state-of-the-art methods.[105] TouchAnything: Diffusion-Guided 3D Reconstruction from Sparse Robot Touches
Langzhe Gu,Hung-Jui Huang,Mohamad Qadri,Michael Kaess,Wenzhen Yuan
Main category: cs.CV
TL;DR: 本文提出TouchAnything框架,利用预训练的2D视觉扩散模型作为语义与几何先验,结合稀疏触觉测量实现开放世界下的高精度3D物体几何重建。
Details
Motivation: 视觉在遮挡或复杂光照下不可靠,而纯触觉重建因数据稀疏而本质欠约束,需引入强几何先验。 Method: 将预训练视觉扩散模型的知识迁移到触觉域,以稀疏触点约束和粗粒度类别描述为输入,构建兼顾触觉一致性与扩散先验的优化问题进行重建。 Result: 仅需少量触碰即可实现高精度重建,在多个指标上超越现有基线,并支持对未见过物体实例的开放世界3D重建。 Conclusion: 视觉扩散模型蕴含的几何知识可有效迁移至触觉重建任务,为多模态感知驱动的机器人操作提供了新范式。 Abstract: Accurate object geometry estimation is essential for many downstream tasks, including robotic manipulation and physical interaction. Although vision is the dominant modality for shape perception, it becomes unreliable under occlusions or challenging lighting conditions. In such scenarios, tactile sensing provides direct geometric information through physical contact. However, reconstructing global 3D geometry from sparse local touches alone is fundamentally underconstrained. We present TouchAnything, a framework that leverages a pretrained large-scale 2D vision diffusion model as a semantic and geometric prior for 3D reconstruction from sparse tactile measurements. Unlike prior work that trains category-specific reconstruction networks or learns diffusion models directly from tactile data, we transfer the geometric knowledge encoded in pretrained visual diffusion models to the tactile domain. Given sparse contact constraints and a coarse class-level description of the object, we formulate reconstruction as an optimization problem that enforces tactile consistency while guiding solutions toward shapes consistent with the diffusion prior. Our method reconstructs accurate geometries from only a few touches, outperforms existing baselines, and enables open-world 3D reconstruction of previously unseen object instances. Our project page is https://grange007.github.io/touchanything .[106] Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift
Harshith Kethavath,Weiming Hu
Main category: cs.CV
TL;DR: 本文研究了在遥感图像(如卫星云图)上适配视觉-语言模型(如CLIPSeg)的效果,发现提示工程(prompting)在云分割任务中全面劣于零样本基线;而仅需极少量标注数据(0.1%≈8张图)的监督微调即可显著超越零样本性能,表明标注数据比提示工程更有效、更值得投入。
Details
Motivation: 遥感图像在视觉和语言分布上均严重偏离自然图像预训练语料,但当前主流仍依赖提示工程来引导冻结模型,该假设缺乏在强域偏移场景(如云分割)下的实证检验。 Method: 在CloudSEN12+云分割基准上,系统评估60种提示变体(含标签、领域术语、外观描述、上下文线索),并与零样本基线及不同规模监督微调(包括全参数微调与低秩适配LoRA)对比,分析mIoU性能变化及‘监督下降’现象。 Result: 所有提示变体均低于零样本基线(0.255 mIoU),最差仅0.07 mIoU;0.1%标注数据微调即超越零样本;5–10%数据恢复约85%最优mIoU;全微调持续优于LoRA(+0.03–0.09 mIoU);0.5–1%数据时对光谱模糊类出现短暂性能下降(supervision dip)。 Conclusion: 在遥感等专业图像领域,标注数据不是提示工程的昂贵替代方案,而是更高效、更可靠的适配路径;提示工程无法弥补视觉表示与遥感光谱特征间的根本鸿沟。 Abstract: Adapting vision-language models to remote sensing imagery presents a fundamental challenge: both the visual and linguistic distributions of satellite data lie far outside natural image pretraining corpora. Despite this, prompting remains the dominant deployment paradigm, driven by the assumption that domain-specific language can guide frozen model representations toward specialized tasks. We test this assumption directly on a domain where the mismatch is prominent: cloud segmentation for satellite imagery. Using CLIPSeg on the CloudSEN12+ cloud segmentation benchmark, we evaluate 60 prompt variants spanning simple labels, domain terminology, appearance descriptors, and contextual cues, finding that every variant underperforms the zero-shot baseline (0.255 mIoU), with engineered prompts scoring as low as 0.07 mIoU. No amount of linguistic refinement bridges the gap between CLIP's natural image representations and satellite spectral imagery. In contrast, supervised fine-tuning with just 0.1% labeled data (~8 images) surpasses zero-shot performance overall, and 5-10% data recovers ~85% of maximum achievable mIoU. Full fine-tuning consistently outperforms low-rank adaptation by 0.03-0.09 mIoU, with the largest gaps for spectrally ambiguous classes, and at 0.5 to 1% labeled data, fine-tuning temporarily degrades performance on these classes before recovering, a supervision dip that aggregate mIoU can mask. For practitioners adapting vision-language models to specialized imagery, our results deliver a clear message: labeled data is not the expensive alternative to prompting; it is the worthwhile path.[107] Dynamic Class-Aware Active Learning for Unbiased Satellite Image Segmentation
Gadi Hemanth Kumar,Athira Nambiar,Pankaj Bodani
Main category: cs.CV
TL;DR: 本文提出了一种动态类别感知不确定性主动学习方法(DCAU-AL),用于卫星影像语义分割,通过实时跟踪各类别性能差距并动态调整采样权重,缓解类别不平衡问题,显著提升小样本标注下的分割精度与效率。
Details
Motivation: 标准主动学习策略依赖全局不确定性或多样性度量,难以适应训练过程中表现不佳或稀有类别的变化,导致模型偏差;而卫星影像标注成本高、类别严重不均衡,亟需更自适应的采样机制。 Method: 提出动态类别感知不确定性主动学习(DCAU-AL)方法:在主动学习循环中持续评估每类分割性能(如IoU),据此动态加权候选样本的不确定性得分,优先选择对当前弱势类别最具信息量的样本进行人工标注。 Result: 在OpenEarth土地覆盖数据集上的实验表明,DCAU-AL在严重类别不平衡下显著优于现有AL方法,各类别IoU更高,且以更少标注样本达到同等或更优整体性能。 Conclusion: DCAU-AL通过引入类别感知的动态采样机制,有效缓解了主动学习在遥感语义分割中因类别不平衡导致的性能偏差,提升了标注效率与模型泛化能力,为资源受限的大规模遥感应用提供了实用解决方案。 Abstract: Semantic segmentation of satellite imagery plays a vital role in land cover mapping and environmental monitoring. However, annotating large-scale, high-resolution satellite datasets is costly and time consuming, especially when covering vast geographic regions. Instead of randomly labeling data or exhaustively annotating entire datasets, Active Learning (AL) offers an efficient alternative by intelligently selecting the most informative samples for annotation with the help of Human-in-the-loop (HITL), thereby reducing labeling costs while maintaining high model performance. AL is particularly beneficial for large-scale or resource-constrained satellite applications, as it enables high segmentation accuracy with significantly fewer labeled samples. Despite these advantages, standard AL strategies typically rely on global uncertainty or diversity measures and lack the adaptability to target underperforming or rare classes as training progresses, leading to bias in the system. To overcome these limitations, we propose a novel adaptive acquisition function, Dynamic Class-Aware Uncertainty based Active learning (DCAU-AL) that prioritizes sample selection based on real-time class-wise performance gaps, thereby overcoming class-imbalance issue. The proposed DCAU-AL mechanism continuously tracks the performance of the segmentation per class and dynamically adjusts the sampling weights to focus on poorly performing or underrepresented classes throughout the active learning process. Extensive experiments on the OpenEarth land cover dataset show that DCAU-AL significantly outperforms existing AL methods, especially under severe class imbalance, delivering superior per-class IoU and improved annotation efficiency.[108] How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms
Shengji Jin,Yuanhao Zou,Victor Zhu,Zhengping Ji,Chen Chen
Main category: cs.CV
TL;DR: 本文通过控制实验对比了三种视频时序定位(VTG)输出范式(文本数字生成、时序标记生成、连续时序解码)在相同轻量VLMs、数据集和微调协议下的精度与系统效率,发现连续分布范式在精度-效率权衡上最优。
Details
Motivation: 现有VTG方法因耦合不同骨干网络、数据集和训练协议,难以分离输出设计的影响;同时边缘部署对输出形式与系统效率的权衡提出新需求。 Method: 在统一紧凑VLM(SmolVLM2/FastVLM/Molmo2)、相同数据集(Charades-STA/QVHighlights/YouCook2)和LoRA微调协议下,对比三种VTG输出范式,评估定位精度、推理延迟、训练吞吐量和参数开销。 Result: 输出范式选择显著影响定位精度与计算成本;连续时序解码范式在Pareto前沿上始终取得最优精度-效率权衡,兼具鲁棒定位性能与最低延迟开销。 Conclusion: 连续分布输出范式为构建高效、可部署的VTG系统提供了实证指导,输出设计应被视作独立于模型规模的关键优化维度。 Abstract: While Multimodal Large Language Models (MLLMs) have advanced Video Temporal Grounding (VTG), existing methods often couple output paradigms with different backbones, datasets, and training protocols. This makes it challenging to isolate the specific impact of the output design. Additionally, as VTG systems are increasingly considered for resource-constrained edge deployment, the trade-off between output formulation and system-level efficiency requires systematic investigation. In this paper, we present a controlled empirical study comparing three dominant VTG output paradigms: Text Numeral Generation, Temporal Token Generation, and Continuous Temporal Decoding. We evaluate these paradigms across identical compact VLMs (SmolVLM2, FastVLM, and Molmo2) using consistent datasets and LoRA fine-tuning protocols. Evaluations on Charades-STA, QVHighlights, and YouCook2 measure both localization accuracy and system efficiency, including inference latency, training throughput, and parameter overhead. Our results demonstrate that the choice of output formulation significantly affects both grounding accuracy and computational cost, independent of model scale. Specifically, the continuous distribution paradigm consistently achieves the most favorable efficiency-accuracy trade-off on the Pareto frontier, delivering robust localization with minimal latency overhead. These findings provide objective empirical guidelines for designing efficient, deployment-ready VTG systems.[109] ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning
Shifeng Liu,Zhengye Zhang,Sirui Zhao,Xinglong Mao,Zhehan Kan,Zhixiang Wei,Shiwei Wu,Chaoyou Fu,Tong Xu,Enhong Chen
Main category: cs.CV
TL;DR: 本文提出ActFER,一种基于代理的主动式面部表情识别框架,通过动态调用工具进行面部检测、对齐与局部聚焦,并结合AU感知的多级奖励强化学习算法UC-GRPO,实现更精准的表情与动作单元识别。
Details
Motivation: 现有基于多模态大语言模型的面部表情识别方法仍为被动范式,依赖固定视觉输入且缺乏主动感知能力,难以支持深层情感推理。 Method: 提出ActFER框架,包含主动视觉证据获取(人脸检测/对齐/局部缩放)与多模态推理(基于动作单元的视觉思维链);并设计UC-GRPO强化学习算法,融合AU标注的多级可验证奖励、查询条件对比效用估计和情感感知EMA校准。 Result: ActFER在多个基准上显著超越被动MLLM基线,尤其大幅提升动作单元(AU)预测准确率。 Conclusion: 主动式代理框架与效用校准强化学习可有效提升面部表情识别的细粒度理解与推理能力,为FER迈向类人主动感知提供新范式。 Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have created new opportunities for facial expression recognition (FER), moving it beyond pure label prediction toward reasoning-based affect understanding. However, existing MLLM-based FER methods still follow a passive paradigm: they rely on externally prepared facial inputs and perform single-pass reasoning over fixed visual evidence, without the capability for active facial perception. To address this limitation, we propose ActFER, an agentic framework that reformulates FER as active visual evidence acquisition followed by multimodal reasoning. Specifically, ActFER dynamically invokes tools for face detection and alignment, selectively zooms into informative local regions, and reasons over facial Action Units (AUs) and emotions through a visual Chain-of-Thought. To realize such behavior, we further develop Utility-Calibrated GRPO (UC-GRPO), a reinforcement learning algorithm tailored to agentic FER. UC-GRPO uses AU-grounded multi-level verifiable rewards to densify supervision, query-conditional contrastive utility estimation to enable sample-aware dynamic credit assignment for local inspection, and emotion-aware EMA calibration to reduce noisy utility estimates while capturing emotion-wise inspection tendencies. This algorithm enables ActFER to learn both when local inspection is beneficial and how to reason over the acquired evidence. Comprehensive experiments show that ActFER trained with UC-GRPO consistently outperforms passive MLLM-based FER baselines and substantially improves AU prediction accuracy.[110] PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
Zhiyu Zhou,Peilin Liu,Ruoxuan Zhang,Luyang Zhang,Cheng Zhang,Hongxia Xie,Wen-Huang Cheng
Main category: cs.CV
TL;DR: 本文提出了PinpointQA,首个用于室内视频中小物体空间理解的基准数据集,包含四个渐进式挑战任务,旨在评估和提升多模态大语言模型在小物体定位与空间描述方面的能力。
Details
Motivation: 现有基准未能直接评估模型在视频中精确定位目标小物体并表达其位置的能力,而这种能力对物体搜索和辅助应用至关重要。 Method: 基于ScanNet++和ScanNet200构建PinpointQA数据集,包含1024个场景和10094个问答对,分为四个任务:目标存在验证(TPV)、最近参考识别(NRI)、细粒度空间描述(FSD)和结构化空间预测(SSP),问答对通过自动生与人工质量控制结合生成。 Result: 实验表明主流多模态大语言模型在该基准上存在系统性能力差距,尤其在SSP任务上表现最差;监督微调显著提升模型性能,尤其在较难任务上。 Conclusion: PinpointQA既可作为诊断模型空间理解能力的基准,也可作为有效训练数据,推动室内视频中小物体空间理解的研究与应用。 Abstract: Small object-centric spatial understanding in indoor videos remains a significant challenge for multimodal large language models (MLLMs), despite its practical value for object search and assistive applications. Although existing benchmarks have advanced video spatial intelligence, embodied reasoning, and diagnostic perception, no existing benchmark directly evaluates whether a model can localize a target object in video and express its position with sufficient precision for downstream use. In this work, we introduce PinpointQA, the first dataset and benchmark for small object-centric spatial understanding in indoor videos. Built from ScanNet++ and ScanNet200, PinpointQA comprises 1,024 scenes and 10,094 QA pairs organized into four progressively challenging tasks: Target Presence Verification (TPV), Nearest Reference Identification (NRI), Fine-Grained Spatial Description (FSD), and Structured Spatial Prediction (SSP). The dataset is built from intermediate spatial representations, with QA pairs generated automatically and further refined through quality control. Experiments on representative MLLMs reveal a consistent capability gap along the progressive chain, with SSP remaining particularly difficult. Supervised fine-tuning on PinpointQA yields substantial gains, especially on the harder tasks, demonstrating that PinpointQA serves as both a diagnostic benchmark and an effective training dataset. The dataset and project page are available at https://rainchowz.github.io/PinpointQA.[111] Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
Zile Wang,Zexiang Liu,Jaixing Li,Kaichen Huang,Baixin Xu,Fei Kang,Mengyin An,Peiyu Wang,Biao Jiang,Yichen Wei,Yidan Xietian,Jiangbo Pei,Liang Hu,Boyi Jiang,Hua Xue,Zidong Wang,Haofeng Sun,Wei Li,Wanli Ouyang,Xianglong He,Yang Liu,Yangguang Li,Yahui Zhou
Main category: cs.CV
TL;DR: Matrix-Game 3.0 是一种内存增强型交互式世界模型,支持720p分辨率下实时(最高40 FPS)、长时序(分钟级)视频生成,通过数据引擎升级、长时序一致性训练框架和高效推理策略实现性能突破。
Details
Motivation: 现有扩散模型在交互式视频生成中难以兼顾长时序记忆一致性与高分辨率实时生成,限制了实际落地应用。 Method: 1)构建工业级无限数据引擎,融合Unreal合成数据、AAA游戏自动采集与真实视频增强,生成大规模Video-Pose-Action-Prompt四元组数据;2)提出长时序一致性训练框架,包括残差建模、生成帧自反馈训练、相机感知的记忆检索与注入;3)设计基于分布匹配蒸馏(DMD)的多段自回归蒸馏策略,并结合模型量化与VAE解码器剪枝以提升推理效率。 Result: 在720p分辨率下实现最高40 FPS实时生成,5B模型可稳定维持分钟级记忆一致性;扩展至2×14B模型后,生成质量、动态表现与泛化能力进一步提升。 Conclusion: Matrix-Game 3.0 为工业级可部署的世界模型提供了切实可行的技术路径。 Abstract: With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.[112] StreamMeCo: Long-Term Agent Memory Compression for Efficient Streaming Video Understanding
Junxi Wang,Te Sun,Jiayi Zhu,Junxian Li,Haowen Xu,Zichen Wen,Xuming Hu,Zhiyu Li,Linfeng Zhang
Main category: cs.CV
TL;DR: 本文提出StreamMeCo框架,通过边无关minmax采样和边感知权重剪枝压缩视频理解中的视觉智能体记忆图,在70%压缩率下实现1.87倍检索加速和1.0%平均精度提升。
Details
Motivation: 视频流理解中智能体记忆存储带来高昂的存储与计算开销,亟需高效压缩方法。 Method: 提出StreamMeCo:对孤立节点采用边无关minmax采样,对连通节点采用边感知权重剪枝;引入时间衰减记忆检索机制缓解压缩导致的性能下降。 Result: 在M3-Bench-robot、M3-Bench-web和Video-MME-Long三个基准上,70%内存图压缩下内存检索速度提升1.87倍,平均准确率提升1.0%。 Conclusion: StreamMeCo有效平衡了视觉智能体记忆的压缩率、检索效率与理解精度,为长视频流理解提供了实用可行的轻量化方案。 Abstract: Vision agent memory has shown remarkable effectiveness in streaming video understanding. However, storing such memory for videos incurs substantial memory overhead, leading to high costs in both storage and computation. To address this issue, we propose StreamMeCo, an efficient Stream Agent Memory Compression framework. Specifically, based on the connectivity of the memory graph, StreamMeCo introduces edge-free minmax sampling for the isolated nodes and an edge-aware weight pruning for connected nodes, evicting the redundant memory nodes while maintaining the accuracy. In addition, we introduce a time-decay memory retrieval mechanism to further eliminate the performance degradation caused by memory compression. Extensive experiments on three challenging benchmark datasets (M3-Bench-robot, M3-Bench-web and Video-MME-Long) demonstrate that under 70% memory graph compression, StreamMeCo achieves a 1.87* speedup in memory retrieval while delivering an average accuracy improvement of 1.0%. Our code is available at https://github.com/Celina-love-sweet/StreamMeCo.[113] Robust by Design: A Continuous Monitoring and Data Integration Framework for Medical AI
Mohammad Daouk,Jan Ulrich Becker,Neeraja Kambham,Anthony Chang,Chandra Mohan,Hien Van Nguyen
Main category: cs.CV
TL;DR: 本文提出了一种自主连续监控与数据集成框架,用于应对临床环境中因数据漂移导致的自适应医学AI模型性能下降问题,特别应用于狼疮性肾炎病理图像分类任务。
Details
Motivation: 适应性医学AI模型在动态临床环境中常因数据漂移而性能下降,亟需一种能持续监控并安全更新模型的方法。 Method: 三阶段方法:基于多度量(欧氏、余弦、马氏距离)特征分析和蒙特卡洛Dropout不确定性门控筛选新数据;仅对分布相似且预测熵低的图像进行增量重训练,并设置严格性能保障(任一指标下降不超过5%)。 Result: 在多中心ResNet18集成模型实验中,新增图像未引起AUC(~0.92)和准确率(~89%)显著下降,有效防止性能退化。 Conclusion: 该框架可有效应对数据偏移、避免灾难性遗忘,支持医学影像AI的可持续学习。 Abstract: Adaptive medical AI models often face performance drops in dynamic clinical environments due to data drift. We propose an autonomous continuous monitoring and data integration framework that maintains robust performance over time. Focusing on glomerular pathology image classification (proliferative vs. non-proliferative lupus nephritis), our three-stage method uses multi-metric feature analysis and Monte Carlo dropout-based uncertainty gating to decide when to retrain on new data. Only images statistically similar to the training distribution (via Euclidean, cosine, Mahalanobis metrics) and with low predictive entropy are integrated. The model is then incrementally retrained with these images under strict performance safeguards (no metric degradation >5%). In experiments with a ResNet18 ensemble on a multi-center dataset, the framework prevents performance degradation: new images were added without significant change in AUC (~0.92) or accuracy (~89%). This approach addresses data shift and avoids catastrophic forgetting, enabling sustained learning in medical imaging AI.[114] Domain-generalizable Face Anti-Spoofing with Patch-based Multi-tasking and Artifact Pattern Conversion
Seungjin Jung,Yonghyun Jeong,Minha Kim,Jimin Min,Youngjoon Yoo,Jongwon Choi
Main category: cs.CV
TL;DR: 本文提出PCGAN模型,通过解耦伪造伪影和人脸特征的潜在向量,提升人脸反欺骗(FAS)算法在未知域和新型攻击下的泛化能力,并结合基于块的多任务学习缓解过拟合与局部攻击问题。
Details
Motivation: 现有FAS算法受限于数据集多样性不足,难以应对未见过的视觉域和新型伪造手段。 Method: 提出Pattern Conversion GAN(PCGAN),实现伪造伪影与人脸特征的潜在空间解耦;引入基于图像块的判别器和多任务学习机制,以增强对局部攻击的鲁棒性并防止过拟合。 Result: 实验表明PCGAN在跨域泛化和局部攻击检测上显著优于现有方法,提升了人脸识别系统的安全性。 Conclusion: PCGAN通过可控生成多样化伪造样本,有效增强了FAS模型的域泛化能力和鲁棒性,为实际部署提供了更可靠的反欺骗方案。 Abstract: Face Anti-Spoofing (FAS) algorithms, designed to secure face recognition systems against spoofing, struggle with limited dataset diversity, impairing their ability to handle unseen visual domains and spoofing methods. We introduce the Pattern Conversion Generative Adversarial Network (PCGAN) to enhance domain generalization in FAS. PCGAN effectively disentangles latent vectors for spoof artifacts and facial features, allowing to generate images with diverse artifacts. We further incorporate patch-based and multi-task learning to tackle partial attacks and overfitting issues to facial features. Our extensive experiments validate PCGAN's effectiveness in domain generalization and detecting partial attacks, giving a substantial improvement in facial recognition security.[115] BlendFusion -- Scalable Synthetic Data Generation for Diffusion Model Training
Thejas Venkatesh,Suguna Varshini Velury
Main category: cs.CV
TL;DR: 本文提出BlendFusion框架,利用路径追踪从3D场景生成高质量图像-文本对,以缓解扩散模型生成数据导致的模型自噬障碍(MAD)问题,并构建了FineBLEND数据集。
Details
Motivation: 扩散模型生成的图像存在视觉不一致问题,直接用于训练易引发模型自噬障碍(MAD);需更可靠的合成数据生成方法。 Method: 提出BlendFusion框架:基于路径追踪的3D场景渲染,结合目标中心化相机放置策略、鲁棒过滤机制和自动图文标注。 Result: 构建了FineBLEND图像-文本数据集;实证表明其质量优于多个主流图文数据集;验证了目标中心化采样策略优于目标无关采样。 Conclusion: BlendFusion为可扩展、高配置性的开源合成数据生成框架,能有效提升合成数据质量,避免MAD,支持社区自定义3D数据集构建。 Abstract: With the rapid adoption of diffusion models, synthetic data generation has emerged as a promising approach for addressing the growing demand for large-scale image datasets. However, images generated purely by diffusion models often exhibit visual inconsistencies, and training models on such data can create an autophagous feedback loop that leads to model collapse, commonly referred to as Model Autophagy Disorder (MAD). To address these challenges, we propose BlendFusion, a scalable framework for synthetic data generation from 3D scenes using path tracing. Our pipeline incorporates an object-centric camera placement strategy, robust filtering mechanisms, and automatic captioning to produce high-quality image-caption pairs. Using this pipeline, we curate FineBLEND, an image-caption dataset constructed from a diverse set of 3D scenes. We empirically analyze the quality of FineBLEND and compare it to several widely used image-caption datasets. We also demonstrate the effectiveness of our object-centric camera placement strategy relative to object-agnostic sampling approaches. Our open-source framework is designed for high configurability, enabling the community to create their own datasets from 3D scenes.[116] CAD 100K: A Comprehensive Multi-Task Dataset for Car Related Visual Anomaly Detection
Jiahua Pang,Ying Li,Dongpu Cao,Jingcai Luo,Yanuo Zheng,Bao Yunfan,Yujie Lei,Rui Yuan,Yuxi Tian,Guojin Yuan,Hongchang Chen,Zhi Zheng,Yongchun Liu
Main category: cs.CV
TL;DR: 本文提出了首个面向汽车相关多任务视觉异常检测的大型基准数据集CAD,包含7个车辆领域和3个任务的100多张图像,并结合合成数据增强解决少样本问题;实验表明多任务学习能促进任务间知识迁移但也存在任务冲突。
Details
Motivation: 现有方法局限于单任务,缺乏统一的多任务评估基准,难以推动汽车制造质量检测中多任务异常检测的发展。 Method: 构建了CAD数据集,涵盖7个车辆领域和3个任务,引入合成数据增强以缓解少样本异常图像问题,并实现了一个多任务基线模型进行实证研究。 Result: 多任务学习能有效促进任务交互与知识迁移,但也暴露出不同任务间的优化冲突;CAD数据集为该领域提供了标准化评测平台。 Conclusion: CAD数据集填补了汽车相关多任务视觉异常检测缺乏统一基准的空白,推动了该方向的研究发展。 Abstract: Multi-task visual anomaly detection is critical for car-related manufacturing quality assessment. However, existing methods remain task-specific, hindered by the absence of a unified benchmark for multi-task evaluation. To fill in this gap, We present the CAD Dataset, a large-scale and comprehensive benchmark designed for car-related multi-task visual anomaly detection. The dataset contains over 100 images crossing 7 vehicle domains and 3 tasks, providing models a comprehensive view for car-related anomaly detection. It is the first car-related anomaly dataset specialized for multi-task learning(MTL), while combining synthesis data augmentation for few-shot anomaly images. We implement a multi-task baseline and conduct extensive empirical studies. Results show MTL promotes task interaction and knowledge transfer, while also exposing challenging conflicts between tasks. The CAD dataset serves as a standardized platform to drive future advances in car-related multi-task visual anomaly detection.[117] Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Images via Visual Prompt Injection
Zedian Shao,Hongbin Liu,Yuepeng Hu,Neil Zhenqiang Gong
Main category: cs.CV
TL;DR: 本文提出ImageProtector,一种用户端图像保护方法,通过嵌入微小扰动使多模态大语言模型(MLLMs)在分析受保护图像时自动拒绝响应,从而防止敏感信息泄露。实验验证其在多个MLLM和数据集上的有效性,并分析现有防御手段的局限性。
Details
Motivation: 开放权重的多模态大语言模型(MLLMs)可能被滥用于大规模提取个人图像中的敏感信息(如身份、位置等),亟需用户可控的前置隐私保护机制。 Method: 设计一种面向MLLM的视觉提示注入式扰动,嵌入到原始图像中形成受保护图像;该扰动几乎不可见,但能稳定触发MLLM输出拒绝响应(如‘我无法帮助完成该请求’)。 Result: ImageProtector在6个MLLM和4个数据集上均有效;评估的三种反制手段(高斯噪声、DiffPure、对抗训练)仅部分削弱其效果,却显著损害模型准确率或效率。 Conclusion: 基于扰动的图像隐私保护在开放权重MLLM场景下具有实用潜力,但也存在防御与性能难以兼顾的根本局限,需进一步探索更鲁棒的防护范式。 Abstract: Multi-modal large language models (MLLMs) have emerged as powerful tools for analyzing Internet-scale image data, offering significant benefits but also raising critical safety and societal concerns. In particular, open-weight MLLMs may be misused to extract sensitive information from personal images at scale, such as identities, locations, or other private details. In this work, we propose ImageProtector, a user-side method that proactively protects images before sharing by embedding a carefully crafted, nearly imperceptible perturbation that acts as a visual prompt injection attack on MLLMs. As a result, when an adversary analyzes a protected image with an MLLM, the MLLM is consistently induced to generate a refusal response such as "I'm sorry, I can't help with that request." We empirically demonstrate the effectiveness of ImageProtector across six MLLMs and four datasets. Additionally, we evaluate three potential countermeasures, Gaussian noise, DiffPure, and adversarial training, and show that while they partially mitigate the impact of ImageProtector, they simultaneously degrade model accuracy and/or efficiency. Our study focuses on the practically important setting of open-weight MLLMs and large-scale automated image analysis, and highlights both the promise and the limitations of perturbation-based privacy protection.[118] Skill-Conditioned Visual Geolocation for Vision-Language
Chenjie Yang,Yutian Jiang,Chenyu Wu
Main category: cs.CV
TL;DR: 本文提出GeoSkill框架,通过无需训练的技能图(Skill-Graph)实现视觉语言模型在图像地理定位中的结构化地理推理与自主演化,显著提升定位精度、推理可信度与泛化能力。
Details
Motivation: 现有视觉语言模型在图像地理定位中缺乏结构化地理推理能力和自主演化机制,依赖过时隐式参数记忆,易产生幻觉推理,且推理过程为单次执行,缺乏反馈优化闭环。 Method: 提出训练无关的GeoSkill框架:1)将人类专家轨迹提炼为原子级自然语言技能,构建初始Skill-Graph;2)推理阶段由推理模型基于当前图进行直接引导推理;3)自主演化机制利用更大模型在网页规模图像-坐标对上开展多轮推理,分析成功/失败轨迹,迭代合成与剪枝技能,扩展图并纠偏,全程无参数更新。 Result: 在GeoRC数据集上显著提升地理定位准确率与推理忠实性,并在多个外部数据集上展现优异泛化能力;自主演化催生可验证的新技能,增强系统对真实地理知识的整体认知。 Conclusion: GeoSkill通过技能图建模与自主演化机制,突破了传统VLM在地理定位中推理结构化与持续进化能力的瓶颈,为具身地理智能提供了新范式。 Abstract: Vision-language models (VLMs) have shown a promising ability in image geolocation, but they still lack structured geographic reasoning and the capacity for autonomous self-evolution. Existing methods predominantly rely on implicit parametric memory, which often exploits outdated knowledge and generates hallucinated reasoning. Furthermore, current inference is a "one-off" process, lacking the feedback loops necessary for self-evolution based on reasoning outcomes. To address these issues, we propose GeoSkill, a training-free framework based on an evolving Skill-Graph. We first initialize the graph by refining human expert trajectories into atomic, natural-language skills. For execution, GeoSkill employs an inference model to perform direct reasoning guided by the current Skill-Graph. For continuous growth, an Autonomous Evolution mechanism leverages a larger model to conduct multiple reasoning rollouts on image-coordinate pairs sourced from web-scale data and verified real-world reasoning. By analyzing both successful and failed trajectories from these rollouts, the mechanism iteratively synthesizes and prunes skills, effectively expanding the Skill-Graph and correcting geographic biases without any parameter updates. Experiments demonstrate that GeoSkill achieves promising performance in both geolocation accuracy and reasoning faithfulness on GeoRC, while maintaining superior generalization across diverse external datasets. Furthermore, our autonomous evolution fosters the emergence of novel, verifiable skills, significantly enhancing the system's cognition of real-world geographic knowledge beyond isolated case studies.[119] NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Multi-Exposure Image Fusion in Dynamic Scenes (Track 2)
Lishen Qu,Yao Liu,Jie Liang,Hui Zeng,Wen Dai,Guanyi Qin,Ya-nan Guan,Shihao Zhou,Jufeng Yang,Lei Zhang,Radu Timofte,Xiyuan Yuan,Wanjie Sun,Shihang Li,Bo Zhang,Bin Chen,Jiannan Lin,Yuxu Chen,Qinquan Gao,Tong Tong,Song Gao,Jiacong Tang,Tao Hu,Xiaowen Ma,Qingsen Yan,Sunhan Xu,Juan Wang,Xinyu Sun,Lei Qi,He Xu,Jiachen Tu,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Yaokun Shi
Main category: cs.CV
TL;DR: 本文介绍了NTIRE 2026 RAIM挑战赛,聚焦于动态场景下的多曝光图像融合问题,构建了包含训练与测试序列的基准数据集,并通过PSNR、SSIM、LPIPS等指标及主观质量、效率和可复现性综合评估方法性能。
Details
Motivation: 解决动态场景下多曝光图像融合所面临的运动模糊、光照变化和手持抖动等实际难题,提升HDR成像质量。 Method: 组织NTIRE 2026 RAIM挑战赛,构建含100组训练(7曝光)和100组测试(5曝光)序列的数据集,采用PSNR/SSIM/LPIPS量化评估结合感知质量、效率与可复现性进行综合评审。 Result: 共吸引114支队伍、987次提交;优胜方法显著提升了去伪影能力和细节恢复效果。 Conclusion: 该挑战推动了动态场景HDR融合技术的发展,提供了高质量公开数据集与代码资源,促进了领域研究与应用进步。 Abstract: This paper presents NTIRE 2026, the 3rd Restore Any Image Model (RAIM) challenge on multi-exposure image fusion in dynamic scenes. We introduce a benchmark that targets a practical yet difficult HDR imaging setting, where exposure bracketing must be fused under scene motion, illumination variation, and handheld camera jitter. The challenge data contains 100 training sequences with 7 exposure levels and 100 test sequences with 5 exposure levels, reflecting real-world scenarios that frequently cause misalignment and ghosting artefacts. We evaluate submissions with a leaderboard score derived from PSNR, SSIM, and LPIPS, while also considering perceptual quality, efficiency, and reproducibility during the final review. This track attracted 114 participating teams and received 987 submissions. The winning methods significantly improved the ability to remove artifacts from multi-exposure fusion and recover fine details. The dataset and the code of each team can be found at the repository: https://github.com/qulishen/RAIM-HDR.[120] SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos
Xiyang Huang,Jiawei Lin,Keying Wu,Jiaxin Huang,Kailai Yang,Renxiong Wei,Cheng zeng,Jiayi Xiang,Ziyan Kuang,Min Peng,Qianqian Xie,Sophia Ananiadou
Main category: cs.CV
TL;DR: 本文提出SiMing-Bench,首个面向临床技能视频的多模态大模型程序性判断能力评测基准,聚焦交互驱动的程序状态动态更新能力;实验表明现有MLLMs在此关键能力上表现薄弱,且全局评估易高估模型真实水平。
Details
Motivation: 现有视频基准忽视了专家级程序判断所需的关键能力——跟踪交互如何动态更新程序状态并影响后续动作正确性。 Method: 构建SiMing-Bench基准及配套数据集SiMing-Score,包含真实临床技能考试视频、标准化分步评分量表和双专家标注,并设计多维度评估协议(整体流程、中间步骤、二元步骤判断、步对齐片段)。 Result: 各类开源与闭源MLLM在医生判断一致性上普遍表现差;即使流程级相关性尚可,中间步骤判断仍显著薄弱;进一步分析表明瓶颈在于建模交互引发的状态时序演化,而非细粒度评分或时间定位。 Conclusion: 当前MLLMs缺乏对程序性状态动态演化的建模能力,SiMing-Bench为评估和推动该方向发展提供了新标准。 Abstract: Current video benchmarks for multimodal large language models (MLLMs) focus on event recognition, temporal ordering, and long-context recall, but overlook a harder capability required for expert procedural judgment: tracking how ongoing interactions update the procedural state and thereby determine the correctness of later actions. We introduce SiMing-Bench, the first benchmark for evaluating this capability from full-length clinical skill videos. It targets rubric-grounded process-level judgment of whether interaction-driven state updates preserve procedural correctness across an entire workflow. SiMing-Bench is instantiated with SiMing-Score, a physician-annotated dataset of real clinical skill examination videos spanning cardiopulmonary resuscitation, automated external defibrillator operation, and bag-mask ventilation, each paired with a standardized step-wise rubric and dual-expert labels. Across diverse open- and closed-source MLLMs, we observe consistently weak agreement with physician judgments. Moreover, weak performance on rubric-defined intermediate steps persists even when overall procedure-level correlation appears acceptable, suggesting that coarse global assessment substantially overestimates current models' procedural judgment ability. Additional analyses with binary step judgment and step-aligned clips indicate that the bottleneck is not merely fine-grained scoring or temporal localization, but modeling how continuous interactions update procedural state over time.[121] Scene-Agnostic Object-Centric Representation Learning for 3D Gaussian Splatting
Tsuheng Hsu,Guiyu Liu,Juho Kannala,Janne Heikkilä
Main category: cs.CV
TL;DR: 本文提出了一种面向数据集级别、以对象为中心的监督方案,用于在3D高斯泼溅(3DGS)中学习对象表示,通过预训练的基于slot attention的全局对象中心学习(GOCL)模块构建场景无关的对象码本,实现跨视角和跨场景一致的对象身份表征,无需逐场景微调或掩码后处理。
Details
Motivation: 现有利用视觉基础模型(VFMs)2D掩码监督辐射场的方法缺乏本质的对象中心性,存在跨视角掩码身份冲突、需额外掩码处理或定制化训练设计等问题,且学习到的3D场景身份依赖于具体场景,泛化能力差。 Method: 基于预训练的slot attention架构的全局对象中心学习(GOCL)模块,构建场景无关的对象码本,并结合其无监督对象掩码,直接监督3D高斯的身份特征,避免掩码预/后处理及显式多视角对齐。 Result: 实现了无需逐场景微调或重训练的对象监督与识别;将无监督对象中心学习(OCL)引入3DGS,提升了表征结构性和下游任务(如机器人交互、场景理解、跨场景泛化)的泛化性能。 Conclusion: 所提方法通过引入数据集级对象中心监督和场景无关码本,显著增强了3DGS中对象表征的一致性与泛化能力,为3D场景理解提供了更鲁棒、可扩展的无监督对象建模范式。 Abstract: Recent works on 3D scene understanding leverage 2D masks from visual foundation models (VFMs) to supervise radiance fields, enabling instance-level 3D segmentation. However, the supervision signals from foundation models are not fundamentally object-centric and often require additional mask pre/post-processing or specialized training and loss design to resolve mask identity conflicts across views. The learned identity of the 3D scene is scene-dependent, limiting generalizability across scenes. Therefore, we propose a dataset-level, object-centric supervision scheme to learn object representations in 3D Gaussian Splatting (3DGS). Building on a pre-trained slot attention-based Global Object Centric Learning (GOCL) module, we learn a scene-agnostic object codebook that provides consistent, identity-anchored representations across views and scenes. By coupling the codebook with the module's unsupervised object masks, we can directly supervise the identity features of 3D Gaussians without additional mask pre-/post-processing or explicit multi-view alignment. The learned scene-agnostic codebook enables object supervision and identification without per-scene fine-tuning or retraining. Our method thus introduces unsupervised object-centric learning (OCL) into 3DGS, yielding more structured representations and better generalization for downstream tasks such as robotic interaction, scene understanding, and cross-scene generalization.[122] Text-Conditioned Multi-Expert Regression Framework for Fully Automated Multi-Abutment Design
Mianjie Zheng,Xinquan Yang,Xuefen Liu,Xuguang Li,Kun Tang,He Meng,Linlin Shen
Main category: cs.CV
TL;DR: 本文提出TEMAD,一种全自动、文本条件化的多专家架构,用于多牙种植体基台设计,通过植入位点识别网络(ISIN)、牙齿条件化特征线性调制(TC-FiLM)模块和系统提示的专家混合(SPMoE)机制,实现高精度、可扩展的自动化设计。
Details
Motivation: 现有牙科种植体基台设计依赖人工、耗时且难以扩展至多基台场景,已有深度学习方法仍需大量临床干预,缺乏全自动与可扩展性。 Method: 提出TEMAD框架,包含三部分:1)Implant Site Identification Network(ISIN)自动定位植入位点;2)Tooth-Conditioned FiLM(TC-FiLM)模块利用牙齿嵌入对网格表征进行位置自适应调制;3)System-Prompted Mixture-of-Experts(SPMoE)机制依据种植系统提示动态选择专家,实现系统感知的参数回归。 Result: 在大规模基台设计数据集上实验表明,TEMAD在多基台场景下显著优于现有方法,达到SOTA性能,验证了其全自动牙科种植规划的有效性与实用性。 Conclusion: TEMAD实现了真正端到端、全自动、可扩展的多基台设计,为智能口腔修复提供了新范式,推动了AI在精准牙科诊疗中的临床落地。 Abstract: Dental implant abutments serve as the geometric and biomechanical interface between the implant fixture and the prosthetic crown, yet their design relies heavily on manual effort and is time-consuming. Although deep neural networks have been proposed to assist dentists in designing abutments, most existing approaches remain largely manual or semi-automated, requiring substantial clinician intervention and lacking scalability in multi-abutment scenarios. To address these limitations, we propose TEMAD, a fully automated, text-conditioned multi-expert architecture for multi-abutment design. This framework integrates implant site localization and implant system, compatible abutment parameter regression into a unified pipeline. Specifically, we introduce an Implant Site Identification Network (ISIN) to automatically localize implant sites and provide this information to the subsequent multi-abutment regression network. We further design a Tooth-Conditioned Feature-wise Linear Modulation (TC-FiLM) module, which adaptively calibrates mesh representations using tooth embeddings to enable position-specific feature modulation. Additionally, a System-Prompted Mixture-of-Experts (SPMoE) mechanism leverages implant system prompts to guide expert selection, ensuring system-aware regression. Extensive experiments on a large-scale abutment design dataset show that TEMAD achieves state-of-the-art performance compared to existing methods, particularly in multi-abutment settings, validating its effectiveness for fully automated dental implant planning.[123] Fine-Grained Action Segmentation for Renorrhaphy in Robot-Assisted Partial Nephrectomy
Jiaheng Dai,Huanrong Liu,Tailai Zhou,Tongyu Jia,Qin Liu,Yutong Ban,Zeju Li,Yu Gao,Xin Ma,Qingbiao Li
Main category: cs.CV
TL;DR: 本文提出了SIA-RAPN基准,用于评估机器人辅助部分肾切除术中精细动作分割任务,比较了四种基于I3D特征的时序模型(MS-TCN++、AsFormer、TUT、DiffAct),并报告了其在多指标下的性能表现。
Details
Motivation: 解决肾缝合过程中视觉相似、持续时间可变且类别严重不平衡的细粒度动作分割问题。 Method: 构建SIA-RAPN基准数据集(50个达芬奇Xi系统临床视频,12类帧级标注),并在其上对比MS-TCN++、AsFormer、TUT和DiffAct四种时序模型,使用多种指标(如平衡准确率、编辑分数、分段F1、帧准确率、帧平均精度)进行评估,并拓展至单孔RAPN跨域测试。 Result: DiffAct在F1、帧准确率、编辑分数和帧mAP上表现最优;MS-TCN++在平衡准确率上最高。 Conclusion: DiffAct在多数指标上优于其他模型,表明其更适合处理该任务中的时序建模与类别不平衡挑战;而MS-TCN++在类别均衡性要求高的指标上更具优势。 Abstract: Fine-grained action segmentation during renorrhaphy in robot-assisted partial nephrectomy requires frame-level recognition of visually similar suturing gestures with variable duration and substantial class imbalance. The SIA-RAPN benchmark defines this problem on 50 clinical videos acquired with the da Vinci Xi system and annotated with 12 frame-level labels. The benchmark compares four temporal models built on I3D features: MS-TCN++, AsFormer, TUT, and DiffAct. Evaluation uses balanced accuracy, edit score, segmental F1 at overlap thresholds of 10, 25, and 50, frame-wise accuracy, and frame-wise mean average precision. In addition to the primary evaluation across five released split configurations on SIA-RAPN, the benchmark reports cross-domain results on a separate single-port RAPN dataset. Across the strongest reported values over those five runs on the primary dataset, DiffAct achieves the highest F1, frame-wise accuracy, edit score, and frame mAP, while MS-TCN++ attains the highest balanced accuracy.[124] Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence
Junchao Liao,Zhenghao Zhang,Xiangyu Meng,Litao Li,Ziying Zhang,Siyu Zhu,Long Qin,Weizhi Wang
Main category: cs.CV
TL;DR: Tora3是一种轨迹引导的音视频生成框架,通过将物体轨迹作为共享运动学先验,提升音视频生成中运动与声音的物理一致性。
Details
Motivation: 现有音视频生成方法难以生成具有合理运动-声音关系的内容,主要因缺乏显式的、音视频共享的运动感知结构。 Method: 提出Tora3框架:1)设计轨迹对齐的视频运动表征;2)构建由轨迹导出的二阶运动学状态驱动的运动-音频对齐模块;3)采用混合流匹配方案,在轨迹条件区域保持轨迹保真度,其余区域维持局部一致性;并构建大规模含自动运动标注的PAV数据集。 Result: 实验表明,Tora3在运动真实性、音画同步性及整体音视频生成质量上均优于强开源基线。 Conclusion: 以物体轨迹为共享先验可有效提升音视频生成的物理一致性和多模态协同质量。 Abstract: Audio-video (AV) generation has recently made strong progress in perceptual quality and multimodal coherence, yet generating content with plausible motion-sound relations remains challenging. Existing methods often produce object motions that are visually unstable and sounds that are only loosely aligned with salient motion or contact events, largely because they lack an explicit motion-aware structure shared by video and audio generation. We present Tora3, a trajectory-guided AV generation framework that improves physical coherence by using object trajectories as a shared kinematic prior. Rather than treating trajectories as a video-only control signal, Tora3 uses them to jointly guide visual motion and acoustic events. Specifically, we design a trajectory-aligned motion representation for video, a kinematic-audio alignment module driven by trajectory-derived second-order kinematic states, and a hybrid flow matching scheme that preserves trajectory fidelity in trajectory-conditioned regions while maintaining local coherence elsewhere. We further curate PAV, a large-scale AV dataset emphasizing motion-relevant patterns with automatically extracted motion annotations. Extensive experiments show that Tora3 improves motion realism, motion-sound synchronization, and overall AV generation quality over strong open-source baselines.[125] Learning Vision-Language-Action World Models for Autonomous Driving
Guoqing Wang,Pin Tang,Xiangxuan Ren,Guodongfang Zhao,Bailan Feng,Chao Ma
Main category: cs.CV
TL;DR: 本文提出VLA-World,一种融合预测性想象与反思性推理的视觉-语言-动作世界模型,用于提升自动驾驶的前瞻性与安全性;通过动作引导的轨迹生成下一帧图像,并基于该图像反思优化轨迹,在nuScenes-GR-20K数据集和三阶段训练策略下显著超越现有VLA与世界模型方法。
Details
Motivation: 现有VLA模型缺乏对时间动态和全局世界一致性的显式建模,限制其预见性和安全性;而世界模型虽能生成未来场景,却难以对生成内容进行推理与评估。 Method: 提出VLA-World模型:1)利用动作导出的可行轨迹指导下一帧图像生成,捕捉时空演化信息;2)对自生成的未来图像进行反思推理以优化轨迹预测;3)构建nuScenes-GR-20K生成推理数据集,并采用预训练、监督微调和强化学习三阶段训练策略。 Result: 在规划与未来生成基准上,VLA-World持续超越当前最优的VLA模型和世界模型基线。 Conclusion: 将预测性想象与反思性推理统一于VLA框架中,可有效提升自动驾驶系统的前瞻性、安全性与可解释性,为端到端智能体建模提供新范式。 Abstract: Vision-Language-Action (VLA) models have recently achieved notable progress in end-to-end autonomous driving by integrating perception, reasoning, and control within a unified multimodal framework. However, they often lack explicit modeling of temporal dynamics and global world consistency, which limits their foresight and safety. In contrast, world models can simulate plausible future scenes but generally struggle to reason about or evaluate the imagined future they generate. In this work, we present VLA-World, a simple yet effective VLA world model that unifies predictive imagination with reflective reasoning to improve driving foresight. VLA-World first uses an action-derived feasible trajectory to guide the generation of the next-frame image, capturing rich spatial and temporal cues that describe how the surrounding environment evolves. The model then reasons over this self-generated future imagined frame to refine the predicted trajectory, achieving higher performance and better interpretability. To support this pipeline, we curate nuScenes-GR-20K, a generative reasoning dataset derived from nuScenes, and employ a three-stage training strategy that includes pretraining, supervised fine-tuning, and reinforcement learning. Extensive experiments demonstrate that VLA-World consistently surpasses state-of-the-art VLA and world-model baselines on both planning and future-generation benchmarks. Project page: https://vlaworld.github.io[126] Nested Radially Monotone Polar Occupancy Estimation: Clinically-Grounded Optic Disc and Cup Segmentation for Glaucoma Screening
Rimsa Goperma,Rojan Basnet,Liang Zhao
Main category: cs.CV
TL;DR: 本文提出NPS-Net,首次将视杯/视盘分割建模为嵌套的径向单调极坐标占据估计,确保解剖学有效性(星凸性与嵌套结构),显著提升跨数据集泛化能力与诊断指标精度。
Details
Motivation: 现有深度学习方法无法保证视杯/视盘分割结果的临床有效性(如星凸性、嵌套结构),导致诊断指标(如vCDR)在跨域场景下失真。 Method: 提出NPS-Net框架,将OD/OC分割建模为嵌套的径向单调极坐标占用估计,从输出表示层面硬性约束解剖学有效性。 Result: 在7个公开数据集上展现出强零样本泛化能力;在RIM-ONE上100%保持解剖有效性,Cup Dice提升12.8%,vCDR MAE降低超56%;在PAPILA上Disc Dice达0.9438,Disc HD95为2.78像素,较最优方法降低83%。 Conclusion: NPS-Net通过新颖的极坐标嵌套建模,兼顾临床有效性与分割精度,为鲁棒、可解释的青光眼筛查提供了新范式。 Abstract: Valid segmentation of the optic disc (OD) and optic cup (OC) from fundus photographs is essential for glaucoma screening. Unfortunately, existing deep learning methods do not guarantee clinical validness including star-convexity and nested structure of OD and OC, resulting corruption in diagnostic metric, especially under cross-dataset domain shift. To adress this issue, this paper proposed NPS-Net (Nested Polar Shape Network), the first framework that formulates the OD/OC segmentation as nested radially monotone polar occupancy estimation.This output representation can guarantee the aforementioned clinical validness and achieve high accuracy. Evaluated across seven public datasets, NPS-Net shows strong zero-shot generalization. On RIM-ONE, it maintains 100% anatomical validity and improves Cup Dice by 12.8% absolute over the best baseline, reducing vCDR MAE by over 56%. On PAPILA, it achieves Disc Dice of 0.9438 and Disc HD95 of 2.78 px, an 83% reduction over the best competing method.[127] Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition
Yuxi Zhou,Zhengbo Zhang,Jingyu Pan,Zhiyu Lin,Zhigang Tu
Main category: cs.CV
TL;DR: 本文提出了一种面向零样本骨架动作识别(ZSAR)的频域感知扩散模型FDSM,通过引入语义引导的谱残差模块、时间步自适应谱损失和课程式语义抽象机制,缓解扩散模型频谱偏差导致的高频运动细节丢失问题,在多个基准数据集上达到SOTA性能。
Details
Motivation: 现有监督式骨架动作识别方法依赖大量标注数据,泛化能力差;零样本骨架动作识别(ZSAR)虽具潜力,但受扩散模型频谱偏差(过度平滑高频动态)制约,难以恢复精细运动细节。 Method: 提出Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM),包含三个核心组件:1)Semantic-Guided Spectral Residual Module(语义引导谱残差模块),增强高频运动建模;2)Timestep-Adaptive Spectral Loss(时间步自适应谱损失),在扩散过程中分阶段优化频域重建;3)Curriculum-based Semantic Abstraction(课程式语义抽象),逐步提升语义-骨架对齐难度。 Result: 在NTU RGB+D、PKU-MMD和Kinetics-skeleton三个主流骨架数据集上均取得零样本动作识别的SOTA性能。 Conclusion: 频域建模对提升ZSAR中细粒度动作表征至关重要;FDSM通过显式频谱干预与语义引导扩散,有效克服了扩散模型固有频谱偏差,为ZSAR提供了新范式。 Abstract: Human action recognition is pivotal in computer vision, with applications ranging from surveillance to human-robot interaction. Despite the effectiveness of supervised skeleton-based methods, their reliance on exhaustive annotation limits generalization to novel actions. Zero-Shot Skeleton Action Recognition (ZSAR) emerges as a promising paradigm, yet it faces challenges due to the spectral bias of diffusion models, which oversmooth high-frequency dynamics. Here, we propose Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM), integrating a Semantic-Guided Spectral Residual Module, a Timestep-Adaptive Spectral Loss, and Curriculum-based Semantic Abstraction to address these challenges. Our approach effectively recovers fine-grained motion details, achieving state-of-the-art performance on NTU RGB+D, PKU-MMD, and Kinetics-skeleton datasets. Code has been made available at https://github.com/yuzhi535/FDSM. Project homepage: https://yuzhi535.github.io/FDSM.github.io/[128] Cross-Modal Knowledge Distillation from Spatial Transcriptomics to Histology
Arbel Hizmi,Artemii Bakulin,Shai Bagon,Nir Yosef
Main category: cs.CV
TL;DR: 本文提出了一种跨模态蒸馏方法,利用配对的空间转录组学和H&E染色图像数据,将转录组学定义的组织微环境结构迁移到仅基于H&E图像的模型中,从而在推理阶段无需转录组数据即可准确预测组织微环境。
Details
Motivation: 空间转录组学虽能揭示组织微环境,但成本高、数据稀缺;而H&E染色图像丰富但信息粒度粗。因此需 bridging the gap 以低成本方式获得高分辨率微环境解析。 Method: 采用跨模态知识蒸馏策略,以空间转录组学定义的微环境为教师模型,指导仅基于H&E图像的学生模型学习其空间结构表征。 Result: 蒸馏模型在多种组织与疾病背景下,较无监督形态学基线显著提升与转录组定义微环境的一致性,并经细胞类型分析验证其生物学合理性。 Conclusion: 该框架可在训练时利用配对数据,在推理时仅需H&E图像,实现了高精度、低成本、可泛化的组织微环境解析。 Abstract: Spatial transcriptomics provides a molecularly rich description of tissue organization, enabling unsupervised discovery of tissue niches -- spatially coherent regions of distinct cell-type composition and function that are relevant to both biological research and clinical interpretation. However, spatial transcriptomics remains costly and scarce, while H&E histology is abundant but carries a less granular signal. We propose to leverage paired spatial transcriptomics and H&E data to transfer transcriptomics-derived niche structure to a histology-only model via cross-modal distillation. Across multiple tissue types and disease contexts, the distilled model achieves substantially higher agreement with transcriptomics-derived niche structure than unsupervised morphology-based baselines trained on identical image features, and recovers biologically meaningful neighborhood composition as confirmed by cell-type analysis. The resulting framework leverages paired spatial transcriptomic and H&E data during training, and can then be applied to held-out tissue regions using histology alone, without any transcriptomic input at inference time.[129] Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation
Yutong Zhang,Jiaxin Chen,Honglin Chen,Kaiqi Zheng,Shengcai Liao,Hanwen Zhong,Weixin Li,Yunhong Wang
Main category: cs.CV
TL;DR: 本文提出Masked Dual Path Distillation (MDPD)方法,在保持微调阶段参数与内存高效的同时,通过双向蒸馏消除推理时的侧网络开销,显著提升推理速度与精度。
Details
Motivation: 现有内存高效迁移学习(METL)方法虽减少微调内存消耗,但引入可学习侧网络导致推理时额外内存与时间开销,违背高效迁移学习初衷。 Method: 提出MDPD框架:1)在微调中对冻结主干网络与可学习侧网络进行互蒸馏;2)设计面向多层编码器结构的特征级知识蒸馏;3)推理时完全移除侧网络。 Result: 在多类视觉/语言及跨模态任务上,相比SOTA方法,推理加速≥25.2%,微调参数与内存开销相当,且精度显著提升。 Conclusion: MDPD成功解耦微调效率与推理效率,验证了无需侧网络即可维持高性能的可行性,为高效迁移学习提供了新范式。 Abstract: Memory-efficient transfer learning (METL) approaches have recently achieved promising performance in adapting pre-trained models to downstream tasks. They avoid applying gradient backpropagation in large backbones, thus significantly reducing the number of trainable parameters and high memory consumption during fine-tuning. However, since they typically employ a lightweight and learnable side network, these methods inevitably introduce additional memory and time overhead during inference, which contradicts the ultimate goal of efficient transfer learning. To address the above issue, we propose a novel approach dubbed Masked Dual Path Distillation (MDPD) to accelerate inference while retaining parameter and memory efficiency in fine-tuning with fading side networks. Specifically, MDPD develops a framework that enhances the performance by mutually distilling the frozen backbones and learnable side networks in fine-tuning, and discard the side network during inference without sacrificing accuracy. Moreover, we design a novel feature-based knowledge distillation method for the encoder structure with multiple layers. Extensive experiments on distinct backbones across vision/language-only and vision-and-language tasks demonstrate that our method not only accelerates inference by at least 25.2\% while keeping parameter and memory consumption comparable, but also remarkably promotes the accuracy compared to SOTA approaches. The source code is available at https://github.com/Zhang-VKk/MDPD.[130] Off-the-shelf Vision Models Benefit Image Manipulation Localization
Zhengxuan Zhang,Keji Song,Junmin Hu,Ao Luo,Yuezun Li
Main category: cs.CV
TL;DR: 本文提出了一种名为ReVi的可训练适配器,利用现成的通用视觉模型(如生成和分割网络)进行图像篡改定位(IML),通过解耦语义冗余与篡改特异性信息,仅微调适配器而冻结主干模型参数,实现了高效、可扩展的IML方法。
Details
Motivation: 图像篡改定位(IML)与通用视觉任务长期被视作独立方向,本文旨在弥合二者鸿沟,探索通用语义先验对IML的潜在增益。 Method: 提出ReVi适配器,受鲁棒主成分分析启发,在冻结预训练通用视觉模型参数的前提下,解耦并增强其中蕴含的篡改特异性信息,仅对适配器进行微调。 Result: 实验表明该方法在IML任务上性能优越,无需大规模重设计或全模型重训练,具备良好的可扩展性。 Conclusion: 通用视觉模型中蕴含可迁移的篡改相关线索,通过轻量适配即可有效服务于IML,为构建可扩展的IML框架提供了新范式。 Abstract: Image manipulation localization (IML) and general vision tasks are typically treated as two separate research directions due to the fundamental differences between manipulation-specific and semantic features. In this paper, however, we bridge this gap by introducing a fresh perspective: these two directions are intrinsically connected, and general semantic priors can benefit IML. Building on this insight, we propose a novel trainable adapter (named ReVi) that repurposes existing off-the-shelf general-purpose vision models (e.g., image generation and segmentation networks) for IML. Inspired by robust principal component analysis, the adapter disentangles semantic redundancy from manipulation-specific information embedded in these models and selectively enhances the latter. Unlike existing IML methods that require extensive model redesign and full retraining, our method relies on the off-the-shelf vision models with frozen parameters and only fine-tunes the proposed adapter. The experimental results demonstrate the superiority of our method, showing the potential for scalable IML frameworks.[131] Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch
Gabriele Mario Caddeo,Pasquale Marra,Lorenzo Natale
Main category: cs.CV
TL;DR: 本文提出了一种多模态、物理 grounded 的方法,用于在严重手部遮挡下进行度量尺度的无模态物体重建与位姿估计,融合视觉、本体感知与多点触觉信号,并通过物理约束优化重建结果。
Details
Motivation: 现有基于视觉的遮挡感知3D生成方法在严重手部遮挡下存在结构歧义,难以实现度量尺度和物理一致的重建。 Method: 构建姿态感知、相机对齐的符号距离场(SDF)表征;设计Structure-VAE学习紧凑潜在空间;在该空间中训练条件流匹配扩散模型,融合RGB图像、遮挡/可见性掩码、手部潜在表示及触觉信息;引入物理驱动目标(如避免手物穿透)与可微解码器引导进行微调与推理。 Result: 仿真实验表明,加入本体感知与触觉显著提升遮挡下的物体补全质量,获得符合真实尺度且物理合理的重建结果;模型成功迁移到未见过末端执行器的真实人形机器人上。 Conclusion: 多模态物理信号融合与显式物理约束建模,是实现鲁棒、度量准确、物理一致的遮挡下物体重建的关键路径,可自然嵌入两阶段重建流程。 Abstract: We propose a multimodal, physically grounded approach for metric-scale amodal object reconstruction and pose estimation under severe hand occlusion. Unlike prior occlusion-aware 3D generation methods that rely only on vision, we leverage physical interaction signals: proprioception provides the posed hand geometry, and multi-contact touch constrains where the object surface must lie, reducing ambiguity in occluded regions. We represent object structure as a pose-aware, camera-aligned signed distance field (SDF) and learn a compact latent space with a Structure-VAE. In this latent space, we train a conditional flow-matching diffusion model, pretraining on vision-only images and finetuning on occluded manipulation scenes while conditioning on visible RGB evidence, occluder/visibility masks, the hand latent representation, and tactile information. Crucially, we incorporate physics-based objectives and differentiable decoder-guidance during finetuning and inference to reduce hand--object interpenetration and to align the reconstructed surface with contact observations. Because our method produces a metric, physically consistent structure estimate, it integrates naturally into existing two-stage reconstruction pipelines, where a downstream module refines geometry and predicts appearance. Experiments in simulation show that adding proprioception and touch substantially improves completion under occlusion and yields physically plausible reconstructions at correct real-world scale compared to vision-only baselines; we further validate transfer by deploying the model on a real humanoid robot with an end-effector different from those used during training.[132] Detecting Diffusion-generated Images via Dynamic Assembly ForestsDetecting Diffusion-generated Images via Dynamic Assembly Forests
Mengxin Fu,Yuezun Li
Main category: cs.CV
TL;DR: 本文提出了一种基于深度森林范式的动态组装森林模型(DAF),用于检测扩散模型生成的图像,相比深度神经网络方法,DAF参数更少、计算成本更低、无需GPU即可部署,且性能具有竞争力。
Details
Motivation: 扩散模型生成高质量图像带来严重安全问题,现有检测方法多依赖计算昂贵的深度神经网络,而忽视了传统机器学习模型的潜力。 Method: 提出动态组装森林模型(DAF),基于深度森林范式,改进特征学习与可扩展训练能力。 Result: DAF参数显著更少、计算开销大幅降低、无需GPU即可部署,在标准评估协议下达到与现有DNN方法相当的检测性能。 Conclusion: DAF展现出作为资源受限场景下轻量级、实用型扩散图像检测器的强大潜力,是重型DNN模型的有效替代方案。 Abstract: Diffusion models are known for generating high-quality images, causing serious security concerns. To combat this, most efforts rely on deep neural networks (e.g., CNNs and Transformers), while largely overlooking the potential of traditional machine learning models. In this paper, we freshly investigate such alternatives and proposes a novel Dynamic Assembly Forest model (DAF) to detect diffusion-generated images. Built upon the deep forest paradigm, DAF addresses the inherent limitations in feature learning and scalable training, making it an effective diffusion-generated image detector. Compared to existing DNN-based methods, DAF has significantly fewer parameters, much lower computational cost, and can be deployed without GPUs, while achieving competitive performance under standard evaluation protocols. These results highlight the strong potential of the proposed method as a practical substitute for heavyweight DNN models in resource-constrained scenarios. Our code and models are available at https://github.com/OUC-VAS/DAF.[133] FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval
François Gardères,Camille-Sovanneary Gauthier,Jean Ponce,Shizhe Chen
Main category: cs.CV
TL;DR: 本文提出FIRE-CIR模型,通过问题驱动的视觉推理提升时尚领域组合图像检索(CIR)的准确性和可解释性,优于现有方法。
Details
Motivation: 现有基于视觉语言模型的CIR方法难以区分参考图像中应保留与应修改的部分,导致可解释性差、细粒度领域(如时尚)性能不佳。 Method: FIRE-CIR自动从文本修改中生成属性聚焦的视觉问题,并在参考图与候选图中验证对应视觉证据;为此构建了大规模时尚专用视觉问答数据集,支持单图或双图分析;检索时利用该推理过程对候选结果重排序。 Result: 在Fashion IQ基准上,FIRE-CIR超越当前最优方法,提升检索精度,并提供属性级可解释的检索决策依据。 Conclusion: 引入显式、问题驱动的视觉推理机制可有效增强CIR在细粒度领域的性能与可解释性,尤其适用于时尚等需精准属性控制的任务。 Abstract: Composed image retrieval (CIR) aims to retrieve a target image that depicts a reference image modified by a textual description. While recent vision-language models (VLMs) achieve promising CIR performance by embedding images and text into a shared space for retrieval, they often fail to reason about what to preserve and what to change. This limitation hinders interpretability and yields suboptimal results, particularly in fine-grained domains like fashion. In this paper, we introduce FIRE-CIR, a model that brings compositional reasoning and interpretability to fashion CIR. Instead of relying solely on embedding similarity, FIRE-CIR performs question-driven visual reasoning: it automatically generates attribute-focused visual questions derived from the modification text, and verifies the corresponding visual evidence in both reference and candidate images. To train such a reasoning system, we automatically construct a large-scale fashion-specific visual question answering dataset, containing questions requiring either single- or dual-image analysis. During retrieval, our model leverages this explicit reasoning to re-rank candidate results, filtering out images inconsistent with the intended modifications. Experimental results on the Fashion IQ benchmark show that FIRE-CIR outperforms state-of-the-art methods in retrieval accuracy. It also provides interpretable, attribute-level insights into retrieval decisions.[134] Few-Shot Personalized Age Estimation
Jakub Paplhám,Vojtěch Franc,Artem Moroz
Main category: cs.CV
TL;DR: 本文提出OpenPAE,首个开源的N-shot个性化年龄估计基准,并建立从算术偏移到条件注意力神经过程的多层次基线模型,实验证明个性化方法能持续提升性能。
Details
Motivation: 现有年龄估计方法将每张人脸视为独立样本,忽略了个体因遗传、生活方式和健康状况不同而存在差异的老化速率;当有同一人的多张已知年龄参考图像时,可利用该上下文进行个性化估计。 Method: 构建了OpenPAE开源基准,设计严格的评估协议;提出了从算术偏移、闭式贝叶斯线性回归到条件注意力神经过程的层次化基线模型。 Result: 实验表明个性化估计能持续提升性能,其增益并非仅来自领域自适应,且非线性方法显著优于简单方法。 Conclusion: 个性化年龄估计是有效且必要的方向,OpenPAE为该任务提供了首个开放、可复现、支持N-shot设定的基准与完整工具链。 Abstract: Existing age estimation methods treat each face as an independent sample, learning a global mapping from appearance to age. This ignores a well-documented phenomenon: individuals age at different rates due to genetics, lifestyle, and health, making the mapping from face to age identity-dependent. When reference images of the same person with known ages are available, we can exploit this context to personalize the estimate. The only existing benchmark for this task (NIST FRVT) is closed-source and limited to a single reference image. In this work, we introduce OpenPAE, the first open benchmark for $N$-shot personalized age estimation with strict evaluation protocols. We establish a hierarchy of increasingly sophisticated baselines: from arithmetic offset, through closed-form Bayesian linear regression, to a conditional attentive neural process. Our experiments show that personalization consistently improves performance, that the gains are not merely domain adaptation, and that nonlinear methods significantly outperform simpler alternatives. We release all models, code, protocols, and evaluation splits.[135] FaceLiVTv2: An Improved Hybrid Architecture for Efficient Mobile Face Recognition
Novendra Setyawan,Chi-Chia Sun,Mao-Hsiu Hsu,Wen-Kai Kuo,Jun-Wei Hsieh
Main category: cs.CV
TL;DR: FaceLiVTv2是一种轻量级混合CNN-Transformer人脸验证模型,通过Lite MHLA模块和RepMix块提升全局-局部特征交互效率,在保持高精度的同时显著降低移动端推理延迟。
Details
Motivation: 面向边缘与移动设备的轻量级人脸识别需兼顾低延迟、低内存/能耗与高精度,现有混合架构在性能与效率平衡上仍存挑战。 Method: 提出FaceLiVTv2:引入Lite MHLA(多头线性token投影+仿射重标度)替代冗余多层注意力;设计统一RepMix块协调局部与全局特征交互,并采用全局depthwise卷积实现自适应空间聚合。 Result: 在LFW、CA-LFW等6个基准上优于现有轻量方法:相比FaceLiVTv1降低22%移动端延迟,较GhostFaceNets提速最高30.8%,较EdgeFace/KANFace降低20–41%延迟且精度更高。 Conclusion: FaceLiVTv2在精度与效率间取得更优权衡,是适用于实时人脸验证的实用化部署方案。 Abstract: Lightweight face recognition is increasingly important for deployment on edge and mobile devices, where strict constraints on latency, memory, and energy consumption must be met alongside reliable accuracy. Although recent hybrid CNN-Transformer architectures have advanced global context modeling, striking an effective balance between recognition performance and computational efficiency remains an open challenge. In this work, we present FaceLiVTv2, an improved version of our FaceLiVT hybrid architecture designed for efficient global--local feature interaction in mobile face recognition. At its core is Lite MHLA, a lightweight global token interaction module that replaces the original multi-layer attention design with multi-head linear token projections and affine rescale transformations, reducing redundancy while preserving representational diversity across heads. We further integrate Lite MHLA into a unified RepMix block that coordinates local and global feature interactions and adopts global depthwise convolution for adaptive spatial aggregation in the embedding stage. Under our experimental setup, results on LFW, CA-LFW, CP-LFW, CFP-FP, AgeDB-30, and IJB show that FaceLiVTv2 consistently improves the accuracy-efficiency trade-off over existing lightweight methods. Notably, FaceLiVTv2 reduces mobile inference latency by 22% relative to FaceLiVTv1, achieves speedups of up to 30.8% over GhostFaceNets on mobile devices, and delivers 20-41% latency improvements over EdgeFace and KANFace across platforms while maintaining higher recognition accuracy. These results demonstrate that FaceLiVTv2 offers a practical and deployable solution for real-time face recognition. Code is available at https://github.com/novendrastywn/FaceLiVT.[136] Visually-Guided Policy Optimization for Multimodal Reasoning
Zengbin Wang,Feng Xiong,Liang Lin,Xuecai Hu,Yong Wang,Yanlin Wang,Man Zhang,Xiangxiang Chu
Main category: cs.CV
TL;DR: 本文提出Visually-Guided Policy Optimization (VGPO)框架,通过视觉注意力补偿与双粒度优势重加权策略,增强视觉语言模型在强化学习中的视觉忠实性与长期视觉记忆能力,提升多模态数学推理等视觉依赖任务性能。
Details
Motivation: 现有视觉语言模型在强化学习中存在视觉忠实性不足问题,表现为视觉token稀疏激活,且在推理过程中出现时间维度上的视觉遗忘现象。 Method: 提出VGPO框架:1)视觉注意力补偿机制,利用视觉相似性定位并增强视觉线索,并逐步提高后续步骤的视觉期望以缓解视觉遗忘;2)双粒度优势重加权策略,包括轨迹内(突出高视觉激活token)和轨迹间(优先选择视觉累积更优的轨迹)两个层面。 Result: 在数学多模态推理和视觉依赖任务上,VGPO显著提升了视觉激活程度与整体性能。 Conclusion: VGPO有效缓解了VLM在RLVR中视觉注意力薄弱与时间遗忘问题,为提升多模态推理的视觉忠实性提供了新思路。 Abstract: Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage re-weighting strategy: the intra-trajectory level highlights tokens exhibiting relatively high visual activation, while the inter-trajectory level prioritizes trajectories demonstrating superior visual accumulation. Extensive experiments demonstrate that VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks.[137] Strips as Tokens: Artist Mesh Generation with Native UV Segmentation
Rui Xu,Dafei Qin,Kaichun Qiao,Qiujie Dong,Huaijin Pi,Qixuan Zhang,Longwen Zhang,Lan Xu,Jingyi Yu,Wenping Wang,Taku Komura
Main category: cs.CV
TL;DR: 本文提出了一种名为 Strips as Tokens (SATO) 的新框架,通过受三角形带(triangle strips)启发的 token 排序策略,提升自回归 Transformer 生成高质量三维网格的能力;该方法保持边缘流与结构规律性,并支持统一解码为三角或四边形网格,实现联合训练与性能提升。
Details
Motivation: 现有自回归 Transformer 方法在网格生成中采用的 token 排序策略(如坐标排序或块状启发式)无法满足专业艺术家对连续边缘流、结构规律性和 UV 边界清晰性的要求。 Method: 提出 SATO 框架,将网格面按三角形带方式组织为连通面链,显式编码 UV 边界;设计统一 token 表示,支持同一序列解码为三角形或四边形网格,并在两类数据上联合训练。 Result: 实验表明 SATO 在几何质量、结构一致性与 UV 分割效果上均显著优于先前方法。 Conclusion: SATO 通过结构感知的 token 排序与统一网格表示,有效 bridging 了自回归建模与专业级网格生成之间的鸿沟,为高质量 3D 内容生成提供了新范式。 Abstract: Recent advancements in autoregressive transformers have demonstrated remarkable potential for generating artist-quality meshes. However, the token ordering strategies employed by existing methods typically fail to meet professional artist standards, where coordinate-based sorting yields inefficiently long sequences, and patch-based heuristics disrupt the continuous edge flow and structural regularity essential for high-quality modeling. To address these limitations, we propose Strips as Tokens (SATO), a novel framework with a token ordering strategy inspired by triangle strips. By constructing the sequence as a connected chain of faces that explicitly encodes UV boundaries, our method naturally preserves the organized edge flow and semantic layout characteristic of artist-created meshes. A key advantage of this formulation is its unified representation, enabling the same token sequence to be decoded into either a triangle or quadrilateral mesh. This flexibility facilitates joint training on both data types: large-scale triangle data provides fundamental structural priors, while high-quality quad data enhances the geometric regularity of the outputs. Extensive experiments demonstrate that SATO consistently outperforms prior methods in terms of geometric quality, structural coherence, and UV segmentation.[138] Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts
Farhad Nooralahzadeh,Omid Rohanian,Yi Zhang,Jonathan Fürst,Kurt Stockinger
Main category: cs.CV
TL;DR: 本文探讨了视觉-语言模型(VLM)在面对违背常识的图像(如蓝色香蕉)时回答错误的原因,发现根本问题不在于视觉感知能力弱,而在于多模态信号仲裁机制失效;通过MAC分析和激活修补等因果干预方法,证实视觉信息在早期层已被充分编码,但未被下游有效利用;提出无需训练的早期层激活引导策略可提升视觉接地性能。
Details
Motivation: 探究VLM在常识冲突场景下回答错误的根源——是视觉感知缺陷,还是视觉与先验知识之间的仲裁失败。 Method: 提出Multimodal Arbitration Crossover(MAC)分析与layer-by-layer Logit Lens探测,结合全序列激活修补(full-sequence activation patching)、部分token分解及训练-free的激活引导(线性与稀疏自编码器引导)方法。 Result: 发现视觉属性在早期层即可线性解码(AUC > 0.86),且编码强度在正确/错误样本间无显著差异;最终层logit差值而非编码强度决定接地效果(高相关性);全序列修补可改变60–84%输出;图像token主导因果影响,文本token几乎无影响;早期层激活引导最多提升视觉接地性能+3.8%。 Conclusion: VLM已具备良好的视觉感知能力,核心问题在于未能有效利用所见信息进行决策;视觉-语言仲裁机制缺陷是主因,可通过针对性、训练-free的早期层干预加以改善。 Abstract: When a Vision-Language Model (VLM) sees a blue banana and answers "yellow", is the problem of perception or arbitration? We explore the question in ten VLMs with various sizes and reveal an Encoding--Grounding Dissociation: models that fail to report what they see (and thus provide a wrong answer) still encode the visual evidence as strongly as models that provide the correct answer. Using Multimodal Arbitration Crossover (MAC) analysis with layer-by-layer Logit Lens probing, we track the competition between visual and prior signals across every layer of each model. We show that visual attributes can be linearly decodable from early layers (AUC > 0.86). The accuracy remains nearly identical for both successful and failed samples. However, the gap in the final-layer logit -- not the strength of encoding -- better predicts grounding outcomes with a correlation of . After having studied when VLMs base their answers on image clues rather than prior knowledge, we want to understand the causal relationships. We establish causality through full-sequence activation patching. The standard last-token interventions in LLM interpretability do not affect VLMs. In contrast, replacing the full token sequence at layers identified by MAC alters 60 to 84% of outputs. Partial-token decomposition shows that image tokens carry almost all of the causal impact, while text tokens have none. Scaling addresses the remaining architectural differences to achieve perfect retention. Moving from diagnosis to intervention, we show that training-free activation steering -- both linear and sparse autoencoder-guided -- in early layers can improve visual grounding by up to +3.8% with degrading performance in some setups. Overall, these findings lead to a clear conclusion: VLMs already see well, but the challenge is acting on what they see. Targeted interventions can help to bridge this gap.[139] Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching
Jiahao Li,Xinhong Chen,Zhengmin Jiang,Cheng Huang,Yung-Hui Li,Jianping Wang
Main category: cs.CV
TL;DR: 本文提出GREATEN框架,通过引入表面法向量作为几何线索来增强合成到真实场景的零样本立体匹配泛化能力,并在多个数据集上显著提升性能与效率。
Details
Motivation: 现有图像驱动的立体匹配方法在合成到真实场景(Syn-to-Real)零样本迁移中表现不佳,主要受限于跨域分布偏移及图像纹理在遮挡、无纹理、重复纹理和非朗伯区域(如镜面/透明)中的不适定歧义。 Method: 提出GREATEN框架,包含三个核心模块:(1) 门控上下文-几何融合(GCGF)模块,自适应抑制不可靠图像上下文并融合法向量引导的几何特征;(2) 镜面-透明增强(STA)策略,提升对非朗伯区域误导性视觉线索的鲁棒性;(3) 多种稀疏注意力机制(SSA、SDMA、SVA),兼顾细粒度全局建模与计算效率。 Result: 仅在SceneFlow等合成数据上训练,GREATEN-IGEV在ETH3D、Booster、KITTI-2015上分别较基线模型降低误差30%、8.5%、14.1%,且比GREAT-IGEV快19.2%,支持Middlebury上3K分辨率、最大768视差范围推理。 Conclusion: 引入表面法向量作为领域不变、物体固有且判别性强的几何先验,可有效弥补纯图像纹理的局限,显著提升立体匹配模型在合成到真实场景下的零样本泛化能力与实用性。 Abstract: Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) regions. To improve Syn-to-Real generalization, we propose GREATEN, a framework that incorporates surface normals as domain-invariant, object-intrinsic, and discriminative geometric cues to compensate for the limitations of image textures. The proposed framework consists of three key components. First, a Gated Contextual-Geometric Fusion (GCGF) module adaptively suppresses unreliable contextual cues in image features and fuses the filtered image features with normal-driven geometric features to construct domain-invariant and discriminative contextual-geometric representations. Second, a Specular-Transparent Augmentation (STA) strategy improves the robustness of GCGF against misleading visual cues in non-Lambertian regions. Third, sparse attention designs preserve the fine-grained global feature extraction capability of GREAT-Stereo for handling occlusion and texture-related ambiguities while substantially reducing computational overhead, including Sparse Spatial (SSA), Sparse Dual-Matching (SDMA), and Simple Volume (SVA) attentions. Trained exclusively on synthetic data such as SceneFlow, GREATEN-IGEV achieves outstanding Syn-to-Real performance. Specifically, it reduces errors by 30% on ETH3D, 8.5% on the non-Lambertian Booster, and 14.1% on KITTI-2015, compared to FoundationStereo, Monster-Stereo, and DEFOM-Stereo, respectively. In addition, GREATEN-IGEV runs 19.2% faster than GREAT-IGEV and supports high-resolution (3K) inference on Middlebury with disparity ranges up to 768.[140] VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning
Wenyi Xiao,Xinchi Xu,Leilei Gan
Main category: cs.CV
TL;DR: 本文提出VL-Calibration框架,通过强化学习将视觉语言模型(LVLMs)的置信度解耦为视觉置信度和推理置信度,并利用内在视觉确定性估计与token级优势重加权,缓解幻觉、提升校准性与视觉推理准确率。
Details
Motivation: 现有面向纯文本大模型的置信度校准方法不适用于LVLMs,因其无法区分感知错误与推理错误,且单一群体置信度混淆了不同不确定性来源,尤其视觉不确定性常被语言先验主导。 Method: 提出VL-Calibration:1)用强化学习显式解耦视觉与推理置信度;2)设计无真值标签的内在视觉确定性估计(结合图像扰动下的KL散度视觉定位 + token熵衡量内部确定性);3)引入基于视觉确定性的token级优势重加权机制。 Result: 在13个基准上验证了VL-Calibration能有效提升置信度校准性并提高视觉推理准确率,且对分布外数据、不同模型规模与架构具有良好泛化性。 Conclusion: 解耦视觉与推理置信度并辅以内在视觉确定性监督,是提升LVLMs可靠性与鲁棒性的有效路径;VL-Calibration为多模态模型可信校准提供了新范式。 Abstract: Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high-stakes domains. Existing verbalized confidence calibration methods, largely developed for text-only LLMs, typically optimize a single holistic confidence score using binary answer-level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground-truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL-divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token-level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL-Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out-of-distribution benchmarks across model scales and architectures.[141] Deep Light Pollution Removal in Night Cityscape Photographs
Hao Wang,Xiaolin Wu,Xi Zhang,Baoqing Sun
Main category: cs.CV
TL;DR: 本文提出了一种面向光污染去除的物理退化模型与合成-真实耦合训练策略,有效恢复纯净夜景影像。
Details
Motivation: 夜间摄影受城市人工照明引发的光污染严重退化,现有夜间去雾方法无法满足恢复原始夜景外观的需求。 Method: 构建了包含各向异性光源扩散和地平线后不可见光源导致天空辉光的物理退化模型,并采用大生成模型辅助的合成-真实耦合训练策略。 Result: 实验表明该方法显著减少光污染伪影,在恢复真实夜景影像方面优于现有夜间复原方法。 Conclusion: 所提物理模型与训练框架为夜间光污染去除提供了新思路,提升了夜景图像复原的真实性与泛化性。 Abstract: Nighttime photography is severely degraded by light pollution induced by pervasive artificial lighting in urban environments. After long-range scattering and spatial diffusion, unwanted artificial light overwhelms natural night luminance, generates skyglow that washes out the view of stars and celestial objects and produces halos and glow artifacts around light sources. Unlike nighttime dehazing, which aims to improve detail legibility through thick air, the objective of light pollution removal is to restore the pristine night appearance by neutralizing the radiative footprint of ground lighting. In this paper we introduce a physically-based degradation model that adds to the previous ones for nighttime dehazing two critical aspects; (i) anisotropic spread of directional light sources, and (ii) skyglow caused by invisible surface lights behind skylines. In addition, we construct a training strategy that leverages large generative model and synthetic-real coupling to compensate for the scarcity of paired real data and enhance generalization. Extensive experiments demonstrate that the proposed formulation and learning framework substantially reduce light pollution artifacts and better recover authentic night imagery than prior nighttime restoration methods.[142] VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images
Guanyu Zhou,Yida Yin,Wenhao Chai,Shengbang Tong,Xingyu Fu,Zhuang Liu
Main category: cs.CV
TL;DR: 本文提出VisionFoundry,一种仅需任务关键词即可生成合成视觉问答数据的流程,用于提升视觉语言模型在空间理解等低级视觉技能上的表现,并验证了其有效性。
Details
Motivation: 视觉语言模型(VLMs)在空间理解、视角识别等低级视觉感知任务上仍存在不足,自然图像数据集对这些技能的监督有限,因此探索是否可通过任务关键词驱动的合成数据来弥补这一缺陷。 Method: 提出VisionFoundry——一个任务感知的合成数据生成流水线:仅输入任务名(如Depth Order),利用大语言模型(LLMs)自动生成问题、答案和文生图(T2I)提示,再通过T2I模型合成图像,并用专有VLM验证图文一致性,全程无需真实图像或人工标注;据此构建含10k样本的VisionFoundry-10K合成VQA数据集。 Result: 在MMVP和CV-Bench-3D视觉感知基准上分别提升+7%和+10%,同时保持模型通用能力,并展现良好的数据规模扩展性。 Conclusion: 有限的任务定向监督是当前VLM视觉感知瓶颈的重要成因,而靶向合成监督为实现更系统化的VLM训练提供了可行路径。 Abstract: Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.[143] Benchmarking CNN- and Transformer-Based Models for Surgical Instrument Segmentation in Robotic-Assisted Surgery
Sara Ameli
Main category: cs.CV
TL;DR: 本文在SAR-RARP50数据集上对比了UNet、Attention UNet、DeepLabV3和SegFormer等五种深度学习模型用于机器人辅助手术中器械的多类别语义分割,采用Cross Entropy与Dice损失联合优化,发现DeepLabV3性能接近SegFormer,而Transformer架构在全局上下文建模上更具优势。
Details
Motivation: 准确分割机器人辅助手术中的手术器械对实现工具跟踪、手术流程分析和自主决策等上下文感知的计算机辅助干预至关重要。 Method: 在SAR-RARP50数据集上 benchmark 五种深度学习架构(UNet、DeepLabV3、Attention UNet、SegFormer等),使用Cross Entropy与Dice损失的复合损失函数训练模型,以缓解类别不平衡并精准捕获器械边界。 Result: Convolutional模型(如UNet、Attention UNet)表现稳健;DeepLabV3凭借空洞卷积与多尺度上下文聚合,性能接近SegFormer;SegFormer等Transformer架构在全局上下文理解与跨场景泛化能力上更优。 Conclusion: 本文为外科AI应用中的分割模型选型提供了全面对比与实用建议,强调了CNN与Transformer方法在精度、鲁棒性与计算开销间的权衡。 Abstract: Accurate segmentation of surgical instruments in robotic-assisted surgery is critical for enabling context-aware computer-assisted interventions, such as tool tracking, workflow analysis, and autonomous decision-making. In this study, we benchmark five deep learning architectures-UNet, UNet, DeepLabV3, Attention UNet, and SegFormer on the SAR-RARP50 dataset for multi-class semantic segmentation of surgical instruments in real-world radical prostatectomy videos. The models are trained with a compound loss function combining Cross Entropy and Dice loss to address class imbalance and capture fine object boundaries. Our experiments reveal that while convolutional models such as UNet and Attention UNet provide strong baseline performance, DeepLabV3 achieves results comparable to SegFormer, demonstrating the effectiveness of atrous convolution and multi-scale context aggregation in capturing complex surgical scenes. Transformer-based architectures like SegFormer further enhance global contextual understanding, leading to improved generalization across varying instrument appearances and surgical conditions. This work provides a comprehensive comparison and practical insights for selecting segmentation models in surgical AI applications, highlighting the trade-offs between convolutional and transformer-based approaches.[144] Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection
Yicheng Qiu,Keiji Yanai
Main category: cs.CV
TL;DR: 本文提出了一种基于状态空间模型(SSM)的新型视频人体动作检测框架,通过引入高效时空聚焦(ESTF)适配器和时序边界感知SSM(TB-SSM),缓解了传统CNN/Transformer在长视频中特征冗余与全局依赖建模弱的问题,显著提升了定位性能与鲁棒性。
Details
Motivation: 现有CNN和Transformer模型在处理长视频时存在特征冗余和全局时序依赖建模能力下降的问题,限制其在真实场景中的可扩展性;而状态空间模型(SSM)具备线性长期建模和强全局时序推理能力,值得重新探索其在时序动作检测中的应用。 Method: 构建基于SSM的新型时序动作检测框架,核心是将高效时空聚焦(ESTF)适配器插入预训练模型层中,其中融合了自研的时序边界感知SSM(TB-SSM)用于时序特征建模,并兼顾空间特征的高效处理。 Result: 在多个基准数据集上进行了全面定量分析,实验表明该方法在定位性能和鲁棒性方面均显著优于现有SSM-based及其他结构方法。 Conclusion: 将SSM特别是TB-SSM与适配器机制结合,能有效提升长视频中人体动作的时序定位能力,验证了SSM在该任务上的潜力与实用性。 Abstract: Temporal human action detection aims to identify and localize action segments within untrimmed videos, serving as a pivotal task in video understanding. Despite the progress achieved by prior architectures like CNN and Transformer models, these continue to struggle with feature redundancy and degraded global dependency modeling capabilities when applied to long video sequences. These limitations severely constrain their scalability in real-world video analysis. State Space Models (SSMs) offer a promising alternative with linear long-term modeling and robust global temporal reasoning capabilities. Rethinking the application of SSMs in temporal modeling, this research constructs a novel framework for video human action detection. Specifically, we introduce the Efficient Spatial-Temporal Focal (ESTF) Adapter into the pre-trained layers. This module integrates the advantages of our proposed Temporal Boundary-aware SSM(TB-SSM) for temporal feature modeling with efficient processing of spatial features. We perform comprehensive and quantitative analyses across multiple benchmarks, comparing our proposed method against previous SSM-based and other structural methods. Extensive experiments demonstrate that our improved strategy significantly enhances both localization performance and robustness, validating the effectiveness of our proposed method.[145] MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding
Henry Zheng,Chenyue Fang,Rui Huang,Siyuan Wei,Xiao Liu,Gao Huang
Main category: cs.CV
TL;DR: 本文提出MAG-3D,一种无需训练的多智能体框架,利用现成视觉语言模型(VLMs)实现三维场景中的接地推理,通过规划、接地和编码三个专家智能体协同完成任务分解、自由形式3D定位与几何推理验证。
Details
Motivation: 现有3D接地推理方法依赖领域内微调或手工设计的推理流程,泛化能力差,难以零样本迁移到新环境。 Method: 提出MAG-3D多智能体框架:包含规划智能体(任务分解与流程编排)、接地智能体(自由式3D定位与相关帧检索)和编码智能体(通过可执行程序进行灵活几何推理与显式验证),全程无需训练。 Result: 在多个具挑战性的基准测试上达到SOTA性能,支持跨多样场景的灵活、免训练3D接地推理。 Conclusion: MAG-3D验证了无需训练、基于多智能体协作的范式在3D视觉语言推理中的有效性与通用性,显著提升零样本泛化能力。 Abstract: Vision-language models (VLMs) have achieved strong performance in multimodal understanding and reasoning, yet grounded reasoning in 3D scenes remains underexplored. Effective 3D reasoning hinges on accurate grounding: to answer open-ended queries, a model must first identify query-relevant objects and regions in a complex scene, and then reason about their spatial and geometric relationships. Recent approaches have demonstrated strong potential for grounded 3D reasoning. However, they often rely on in-domain tuning or hand-crafted reasoning pipelines, which limit their flexibility and zero-shot generalization to novel environments. In this work, we present MAG-3D, a training-free multi-agent framework for grounded 3D reasoning with off-the-shelf VLMs. Instead of relying on task-specific training or fixed reasoning procedures, MAG-3D dynamically coordinates expert agents to address the key challenges of 3D reasoning. Specifically, we propose a planning agent that decomposes the task and orchestrates the overall reasoning process, a grounding agent that performs free-form 3D grounding and relevant frame retrieval from extensive 3D scene observations, and a coding agent that conducts flexible geometric reasoning and explicit verification through executable programs. This multi-agent collaborative design enables flexible training-free 3D grounded reasoning across diverse scenes and achieves state-of-the-art performance on challenging benchmarks.[146] ELT: Elastic Looped Transformers for Visual Generation
Sahil Goyal,Swayam Agrawal,Gautham Govind Anil,Prateek Jain,Sujoy Paul,Aditya Kusupati
Main category: cs.CV
TL;DR: 本文提出了Elastic Looped Transformers(ELT),一种基于循环Transformer架构的高效视觉生成模型,通过权重共享和Intra-Loop Self Distillation(ILSD)实现参数高效与任意时刻推理能力。
Details
Motivation: 解决传统生成模型参数量大、计算成本高的问题,追求在保持高质量合成的同时显著降低参数量。 Method: 采用循环、权重共享的Transformer块构建ELT;提出Intra-Loop Self Distillation(ILSD)策略,在单步训练中对不同循环次数的中间结果进行自蒸馏,确保深度一致性。 Result: ELT在ImageNet 256×256上FID达2.0,UCF-101视频生成上FVD为72.8;相比基线模型,同等推理计算下参数减少4倍。 Conclusion: ELT在参数效率与生成质量之间实现了新平衡,推动了视觉合成的效率前沿,并支持任意时刻推理的弹性部署。 Abstract: We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality. To effectively train these models for image and video generation, we propose the idea of Intra-Loop Self Distillation (ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model's depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation quality, with the same parameter count. ELT significantly shifts the efficiency frontier for visual synthesis. With $4\times$ reduction in parameter count under iso-inference-compute settings, ELT achieves a competitive FID of $2.0$ on class-conditional ImageNet $256 \times 256$ and FVD of $72.8$ on class-conditional UCF-101.[147] UniSemAlign: Text-Prototype Alignment with a Foundation Encoder for Semi-Supervised Histopathology Segmentation
Le-Van Thai,Tien Dat Nguyen,Hoai Nhan Pham,Lan Anh Dinh Thi,Duy-Dong Nguyen,Ngoc Lam Quang Bui
Main category: cs.CV
TL;DR: 本文提出UniSemAlign框架,通过双模态语义对齐(原型级和文本级)提升病理图像半监督语义分割性能,在GlaS和CRAG数据集上显著优于现有方法。
Details
Motivation: 半监督语义分割在计算病理学中面临像素级标注稀缺和伪标签监督不可靠的挑战。 Method: 基于病理预训练Transformer编码器,构建包含原型级和文本级对齐分支的双模态语义对齐框架UniSemAlign,在共享嵌入空间中提供结构化引导,并融合对齐表征与视觉预测以生成更可靠的无标签图像监督;端到端联合优化监督分割、跨视图一致性与跨模态对齐目标。 Result: 在GlaS和CRAG数据集上,仅用10%标注数据时Dice分数分别提升2.6%和8.6%,20%标注下亦有显著提升。 Conclusion: UniSemAlign通过引入显式的类级结构信息增强像素级学习,有效缓解类别歧义、稳定伪标签优化,为病理图像半监督分割提供了新范式。 Abstract: Semi-supervised semantic segmentation in computational pathology remains challenging due to scarce pixel-level annotations and unreliable pseudo-label supervision. We propose UniSemAlign, a dual-modal semantic alignment framework that enhances visual segmentation by injecting explicit class-level structure into pixel-wise learning. Built upon a pathology-pretrained Transformer encoder, UniSemAlign introduces complementary prototype-level and text-level alignment branches in a shared embedding space, providing structured guidance that reduces class ambiguity and stabilizes pseudo-label refinement. The aligned representations are fused with visual predictions to generate more reliable supervision for unlabeled histopathology images. The framework is trained end-to-end with supervised segmentation, cross-view consistency, and cross-modal alignment objectives. Extensive experiments on the GlaS and CRAG datasets demonstrate that UniSemAlign substantially outperforms recent semi-supervised baselines under limited supervision, achieving Dice improvements of up to 2.6% on GlaS and 8.6% on CRAG with only 10% labeled data, and strong improvements at 20% supervision. Code is available at: https://github.com/thailevann/UniSemAlign[148] MixFlow: Mixed Source Distributions Improve Rectified Flows
Nazir Nayal,Christopher Wewer,Jan Eric Lenssen
Main category: cs.CV
TL;DR: 本文提出κ-FC和MixFlow方法,通过改进源分布与数据分布的对齐性来降低生成路径曲率,从而提升扩散模型采样效率和生成质量。
Details
Motivation: 扩散模型及其变体(如校正流)虽能生成高质量图像,但因学习到的生成路径高度弯曲,导致迭代采样速度慢;而高曲率的重要原因是标准高斯源分布与数据分布之间缺乏关联性。 Method: 首先提出κ-FC通用框架,使源分布依赖于一个对齐数据分布的任意信号κ;其次提出MixFlow训练策略,在固定无条件分布与κ-FC分布的线性混合上训练流模型。 Result: 在固定采样步数下,相比标准校正流FID提升12%,相比先前基线提升7%;同时加快训练收敛、提升生成质量并减少所需采样步数。 Conclusion: 通过改进源分布建模和混合训练策略,可有效缓解生成路径高曲率问题,显著提升扩散类模型的采样效率与生成性能。 Abstract: Diffusion models and their variations, such as rectified flows, generate diverse and high-quality images, but they are still hindered by slow iterative sampling caused by the highly curved generative paths they learn. An important cause of high curvature, as shown by previous work, is independence between the source distribution (standard Gaussian) and the data distribution. In this work, we tackle this limitation by two complementary contributions. First, we attempt to break away from the standard Gaussian assumption by introducing $κ\texttt{-FC}$, a general formulation that conditions the source distribution on an arbitrary signal $κ$ that aligns it better with the data distribution. Then, we present MixFlow, a simple but effective training strategy that reduces the generative path curvatures and considerably improves sampling efficiency. MixFlow trains a flow model on linear mixtures of a fixed unconditional distribution and a $κ\texttt{-FC}$-based distribution. This simple mixture improves the alignment between the source and data, provides better generation quality with less required sampling steps, and accelerates the training convergence considerably. On average, our training procedure improves the generation quality by 12\% in FID compared to standard rectified flow and 7\% compared to previous baselines under a fixed sampling budget. Code available at: $\href{https://github.com/NazirNayal8/MixFlow}{https://github.com/NazirNayal8/MixFlow}$[149] Vision Transformers for Preoperative CT-Based Prediction of Histopathologic Chemotherapy Response Score in High-Grade Serous Ovarian Carcinoma
Francesca Fati,Felipe Coutinho,Marika Reinius,Marina Rosanu,Gabriel Funingana,Luigi De Vitis,Gabriella Schivardi,Hannah Clayton,Alice Traversa,Zeyu Gao,Guilherme Penteado,Shangqi Gao,Francesco Pastori,Ramona Woitek,Maria Cristina Ghioni,Giovanni Damiano Aletti,Mercedes Jimenez-Linan,Sarah Burge,Nicoletta Colombo,Evis Sala,Maria Francesca Spadea,Timothy L. Kline,James D. Brenton,Jaime Cardoso,Francesco Multinu,Elena De Momi,Mireia Crispin-Ortuzar,Ines P. Machado
Main category: cs.CV
TL;DR: 本研究开发了一种基于Transformer的2.5D多模态深度学习模型,利用术前CT影像和临床数据预测高级别浆液性卵巢癌(HGSOC)患者对新辅助化疗(NACT)的组织病理学反应评分(CRS),在内部测试集上表现优异(AUC=0.95),外部验证集上中等(AUC=0.68),提示其作为术前无创决策支持工具的可行性。
Details
Motivation: CRS是评估HGSOC患者对NACT反应的有效病理标志物,但仅能在术后获得;亟需一种术前、无创、可预测CRS的方法以支持多学科团队(MDT)早期治疗决策。 Method: 提出一种2.5D多模态深度学习框架:使用预训练Vision Transformer编码器处理病灶密集的大网膜CT切片,并通过中间融合模块整合视觉表征与临床变量,最终预测CRS。 Result: 内部测试队列(IEO,n=41):ROC-AUC=0.95,准确率95%,精确率80%;外部测试队列(OV04,n=70):ROC-AUC=0.68,准确率67%,精确率75%。 Conclusion: 基于Transformer的多模态模型可利用常规术前CT和临床数据预测CRS,初步验证了其作为术前无创决策支持工具的可行性,有望辅助MDT进行早期治疗响应评估。 Abstract: Purpose. High-grade serous ovarian carcinoma (HGSOC) is characterized by pronounced biological and spatial heterogeneity and is frequently diagnosed at an advanced stage. Neoadjuvant chemotherapy (NACT) followed by delayed primary surgery is commonly employed in patients unsuitable for primary cytoreduction. The Chemotherapy Response Score (CRS) is a validated histopathological biomarker of response to NACT, but it is only available postoperatively. In this study, we investigate whether pre-treatment computed tomography (CT) imaging and clinical data can be used to predict CRS as an investigational decision-support adjunct to inform multidisciplinary team (MDT) discussions regarding expected treatment response. Methods. We proposed a 2.5D multimodal deep learning framework that processes lesion-dense omental slices using a pre-trained Vision Transformer encoder and integrates the resulting visual representations with clinical variables through an intermediate fusion module to predict CRS. Results. Our multimodal model, integrating imaging and clinical data, achieved a ROC-AUC of 0.95 alongside 95% accuracy and 80% precision on the internal test cohort (IEO, n=41 patients). On the external test set (OV04, n=70 patients), it achieved a ROC-AUC of 0.68, alongside 67% accuracy and 75% precision. Conclusion. These preliminary results demonstrate the feasibility of transformer-based deep learning for preoperative prediction of CRS in HGSOC using routine clinical data and CT imaging. As an investigational, pre-treatment decision-support tool, this approach may assist MDT discussions by providing early, non-invasive estimates of treatment response.[150] Globally Optimal Pose from Orthographic Silhouettes
Agniva Sengupta,Dilara Kuş,Jianning Li,Stefan Zachow
Main category: cs.CV
TL;DR: 本文提出了一种仅利用物体未遮挡轮廓(silhouette)就能全局最优估计三维姿态的方法,通过轮廓面积的连续性与椭圆拟合的长宽比作为形状签名,无需特征点对应关系,适用于任意形状(包括非凸、多孔物体)。
Details
Motivation: 现有方法通常依赖特征点对应或对物体形状有凸性/简单拓扑等限制,而仅用轮廓估计全局最优姿态仍具挑战。 Method: 利用轮廓面积在旋转空间中的连续性构建预计算的轮廓签名响应曲面,并结合投影轮廓拟合椭圆的长宽比作为辅助全局形状签名,实现分辨率引导的旋转空间分支搜索。 Result: 在合成与真实数据上验证了该方法显著优于同类方法,在精度和鲁棒性上均有提升,且首次实现了对任意形状(不限凸性与亏格)的高效全局最优姿态估计。 Conclusion: 仅基于轮廓即可实现全局最优三维姿态估计是可行的;连续性建模与多签名融合是关键,为无纹理、低纹理物体的姿态估计提供了新范式。 Abstract: We solve the problem of determining the pose of known shapes in $\mathbb{R}^3$ from their unoccluded silhouettes. The pose is determined up to global optimality using a simple yet under-explored property of the area-of-silhouette: its continuity w.r.t trajectories in the rotation space. The proposed method utilises pre-computed silhouette-signatures, modelled as a response surface of the area-of-silhouettes. Querying this silhouette-signature response surface for pose estimation leads to a strong branching of the rotation search space, making resolution-guided candidate search feasible. Additionally, we utilise the aspect ratio of 2D ellipses fitted to projected silhouettes as an auxiliary global shape signature to accelerate the pose search. This combined strategy forms the first method to efficiently estimate globally optimal pose from just the silhouettes, without being guided by correspondences, for any shape, irrespective of its convexity and genus. We validate our method on synthetic and real examples, demonstrating significantly improved accuracy against comparable approaches. Code, data, and supplementary in: https://agnivsen.github.io/pose-from-silhouette/[151] CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation
Haoyu Zhao,Zihao Zhang,Jiaxi Gu,Haoran Chen,Qingping Zheng,Pin Tang,Yeyin Jin,Yuang Zhang,Junqi Cheng,Zenghui Lu,Peng Shu,Zuxuan Wu,Yu-Gang Jiang
Main category: cs.CV
TL;DR: 本文提出了一种名为CT-1的视觉-语言-相机模型,通过波尔特正则化损失和大规模数据集CT-200K,实现了对视频生成中相机轨迹的高精度控制,提升了相机控制准确率25.7%。
Details
Motivation: 现有方法在相机可控视频生成中存在相机控制不精确或依赖繁琐手动参数的问题,难以适用于自动化场景。 Method: 提出CT-1模型,融合视觉-语言模块与扩散Transformer,并引入频域下的小波正则化损失以学习复杂相机轨迹分布;构建CT-200K大规模数据集(4700万帧)支持训练;将估计的轨迹融入视频扩散模型实现空间感知的相机控制。 Result: 实验表明该框架显著提升了相机控制精度(较先前方法提高25.7%),生成高质量、符合用户意图的相机可控视频。 Conclusion: CT-1成功弥合了空间推理与视频合成之间的鸿沟,为自动化、高精度相机可控视频生成提供了新范式。 Abstract: Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated scenarios. To address these issues, we propose a novel Vision-Language-Camera model, termed CT-1 (Camera Transformer 1), a specialized model designed to transfer spatial reasoning knowledge to video generation by accurately estimating camera trajectories. Built upon vision-language modules and a Diffusion Transformer model, CT-1 employs a Wavelet-based Regularization Loss in the frequency domain to effectively learn complex camera trajectory distributions. These trajectories are integrated into a video diffusion model to enable spatially aware camera control that aligns with user intentions. To facilitate the training of CT-1, we design a dedicated data curation pipeline and construct CT-200K, a large-scale dataset containing over 47M frames. Experimental results demonstrate that our framework successfully bridges the gap between spatial reasoning and video synthesis, yielding faithful and high-quality camera-controllable videos and improving camera control accuracy by 25.7% over prior methods.[152] Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception
Jiahao Wang,Zikun Xu,Yuner Zhang,Zhongwei Jiang,Chenyang Lu,Shuocheng Yang,Yuxuan Wang,Jiaru Zhong,Chuang Zhang,Shaobing Xu,Jianqiang Wang
Main category: cs.CV
TL;DR: 本文提出Long-SCOPE,一种面向长距离车路协同3D感知的全稀疏框架,通过几何引导查询生成和上下文感知关联模块,解决了现有方法在远距离下计算开销大、特征匹配脆弱的问题,在100–150米范围内达到SOTA性能。
Details
Motivation: 现有协同3D感知方法在长距离(如100–150米)下存在两大瓶颈:BEV稠密表示导致的二次计算复杂度,以及在观测与配准误差较大时特征关联机制鲁棒性差。 Method: 提出全稀疏框架Long-SCOPE,包含两个新模块:1)几何引导查询生成模块,提升对小而远目标的检测精度;2)可学习的上下文感知关联模块,增强在严重位置噪声下的协同查询匹配鲁棒性。 Result: 在V2X-Seq和Griffin数据集上验证,Long-SCOPE在100–150米长距场景下性能达SOTA,同时保持较低的计算与通信开销。 Conclusion: Long-SCOPE通过稀疏化建模与鲁棒关联机制,有效提升了远距离协同感知的实用性与可靠性,为V2X实际部署提供了可行路径。 Abstract: Cooperative 3D perception via Vehicle-to-Everything communication is a promising paradigm for enhancing autonomous driving, offering extended sensing horizons and occlusion resolution. However, the practical deployment of existing methods is hindered at long distances by two critical bottlenecks: the quadratic computational scaling of dense BEV representations and the fragility of feature association mechanisms under significant observation and alignment errors. To overcome these limitations, we introduce Long-SCOPE, a fully sparse framework designed for robust long-distance cooperative 3D perception. Our method features two novel components: a Geometry-guided Query Generation module to accurately detect small, distant objects, and a learnable Context-Aware Association module that robustly matches cooperative queries despite severe positional noise. Experiments on the V2X-Seq and Griffin datasets validate that Long-SCOPE achieves state-of-the-art performance, particularly in challenging 100-150 m long-range settings, while maintaining highly competitive computation and communication costs.[153] Adding Another Dimension to Image-based Animal Detection
Vandita Shukla,Fabio Remondino,Benjamin Risse
Main category: cs.CV
TL;DR: 本文提出了一种利用Skinned Multi Animal Linear模型和相机位姿优化算法,从单目RGB图像中生成鲁棒2D投影标签(含3D包围盒及面可见性度量)的pipeline,以填补单目动物3D检测标注数据缺失的空白,并在Animal3D数据集上验证了跨物种和场景的有效性。
Details
Motivation: 单目动物图像天然丢失3D结构信息,现有检测算法仅输出无朝向信息的2D包围框;而构建单目3D动物检测方法受限于缺乏带3D标注的大规模数据集,因3D标注需同步获取3D输入流与RGB数据。 Method: 提出一种新pipeline:1)采用Skinned Multi Animal Linear(SMAL)模型估计动物3D包围盒;2)结合专用相机位姿优化算法将3D包围盒稳健投影至2D图像空间;3)计算立方体面可见性指标以判断动物各侧面是否被拍摄到。 Result: 在Animal3D数据集上评估表明,该方法在不同动物种类和拍摄环境下均能实现高精度的3D包围盒估计与2D投影标签生成,所生成的3D包围盒及可见性指标可作为后续单目3D动物检测算法开发与评测的关键基础。 Conclusion: 该工作为单目RGB图像下的动物3D检测提供了首个可行的、无需真实3D标注即可生成高质量弱监督3D标签的自动化pipeline,显著缓解了领域内标注瓶颈问题,推动了单目3D动物感知的发展。 Abstract: Monocular imaging of animals inherently reduces 3D structures to 2D projections. Detection algorithms lead to 2D bounding boxes that lack information about animal's orientation relative to the camera. To build 3D detection methods for RGB animal images, there is a lack of labeled datasets; such labeling processes require 3D input streams along with RGB data. We present a pipeline that utilises Skinned Multi Animal Linear models to estimate 3D bounding boxes and to project them as robust labels into 2D image space using a dedicated camera pose refinement algorithm. To assess which sides of the animal are captured, cuboid face visibility metrics are computed. These 3D bounding boxes and metrics form a crucial step toward developing and benchmarking future monocular 3D animal detection algorithms. We evaluate our method on the Animal3D dataset, demonstrating accurate performance across species and settings.[154] SHIFT: Steering Hidden Intermediates in Flow Transformers
Nina Konovalova,Andrey Kuznetsov,Aibek Alanov
Main category: cs.CV
TL;DR: 本文提出SHIFT框架,通过在推理时针对性地操控DiT扩散模型的中间激活来实现概念移除,无需重新训练,兼具灵活性与高效性。
Details
Motivation: 现有DiT扩散模型虽生成质量高、提示遵循好,但缺乏对生成内容中特定视觉概念(如不想要的对象或风格)进行细粒度、无需重训练的干预能力。 Method: 受大语言模型中激活引导(activation steering)启发,SHIFT学习动态应用于选定层和时间步的引导向量,在推理时直接操纵中间特征以抑制/偏移/注入特定视觉概念。 Result: SHIFT在多种提示和目标概念上均实现了有效的概念移除、风格迁移与目标对象编辑,且保持图像质量与其余提示内容,无需耗时的模型微调或重训练。 Conclusion: SHIFT是一种轻量、通用、即插即用的推理时控制方法,显著拓展了DiT类扩散模型的可控生成能力。 Abstract: Diffusion models have become leading approaches for high-fidelity image generation. Recent DiT-based diffusion models, in particular, achieve strong prompt adherence while producing high-quality samples. We propose SHIFT, a simple but effective and lightweight framework for concept removal in DiT diffusion models via targeted manipulation of intermediate activations at inference time, inspired by activation steering in large language models. SHIFT learns steering vectors that are dynamically applied to selected layers and timesteps to suppress unwanted visual concepts while preserving the prompt's remaining content and overall image quality. Beyond suppression, the same mechanism can shift generations into a desired \emph{style domain} or bias samples toward adding or changing target objects. We demonstrate that SHIFT provides effective and flexible control over DiT generation across diverse prompts and targets without time-consuming retraining.[155] TinyNeRV: Compact Neural Video Representations via Capacity Scaling, Distillation, and Low-Precision Inference
Muhammad Hannan Akhtar,Ihab Amer,Tamer Shanableh
Main category: cs.CV
TL;DR: 本文系统研究了极简NeRV架构(NeRV-T和NeRV-T+),探索其在资源受限场景下的视频重建性能,并提出频率感知知识蒸馏与低精度推理策略以提升小模型表现。
Details
Motivation: 现有神经视频表示研究多聚焦于中高容量模型,对极简模型在受限环境中的行为缺乏充分探索。 Method: 提出两种轻量级NeRV架构(NeRV-T和NeRV-T+);引入频率感知焦点监督的知识蒸馏;评估后训练量化与量化感知训练对低精度推理的影响。 Result: 实验表明,精心设计的极简NeRV变体可在大幅降低参数量、计算开销和内存需求的同时,实现良好的质量-效率权衡。 Conclusion: 该研究揭示了紧凑型神经视频表示的实际性能边界,为NeRV类模型在资源受限与实时场景中的部署提供了实用指导。 Abstract: Implicit neural video representations encode entire video sequences within the parameters of a neural network and enable constant time frame reconstruction. Recent work on Neural Representations for Videos (NeRV) has demonstrated competitive reconstruction performance while avoiding the sequential decoding process of conventional video codecs. However, most existing studies focus on moderate or high capacity models, leaving the behavior of extremely compact configurations required for constrained environments insufficiently explored. This paper presents a systematic study of tiny NeRV architectures designed for efficient deployment. Two lightweight configurations, NeRV-T and NeRV-T+, are introduced and evaluated across multiple video datasets in order to analyze how aggressive capacity reduction affects reconstruction quality, computational complexity, and decoding throughput. Beyond architectural scaling, the work investigates strategies for improving the performance of compact models without increasing inference cost. Knowledge distillation with frequency-aware focal supervision is explored to enhance reconstruction fidelity in low-capacity networks. In addition, the impact of lowprecision inference is examined through both post training quantization and quantization aware training to study the robustness of tiny models under reduced numerical precision. Experimental results demonstrate that carefully designed tiny NeRV variants can achieve favorable quality efficiency trade offs while substantially reducing parameter count, computational cost, and memory requirements. These findings provide insight into the practical limits of compact neural video representations and offer guidance for deploying NeRV style models in resource constrained and real-time environments. The official implementation is available at https: //github.com/HannanAkhtar/TinyNeRV-Implementation.[156] Hitem3D 2.0: Multi-View Guided Native 3D Texture Generation
Huiang He,Shengchu Zhao,Jianwen Huang,Jie Li,Jiaqi Wu,Hu Zhang,Pei Tang,Heliang Zheng,Yukun Li,Rongfei Jia
Main category: cs.CV
TL;DR: Hitem3D 2.0 是一种多视角引导的原生3D纹理生成框架,通过融合2D多视角生成先验与原生3D纹理表示,显著提升纹理完整性、跨视角一致性与几何对齐性。
Details
Motivation: 现有3D纹理生成方法存在纹理覆盖不全、跨视角不一致及几何-纹理错位等问题。 Method: 提出包含多视角合成框架和原生3D纹理生成模型的两阶段框架;前者基于预训练图像编辑骨干网络并加入几何对齐、跨视角一致性和光照均匀性模块;后者将多视角纹理投影至3D表面并合理补全不可见区域。 Result: 在纹理细节、保真度、一致性、连贯性和对齐性方面均优于现有方法。 Conclusion: Hitem3D 2.0 通过联合建模多视角一致性与原生3D纹理,有效解决了当前3D纹理生成的核心挑战。 Abstract: Although recent advances have improved the quality of 3D texture generation, existing methods still struggle with incomplete texture coverage, cross-view inconsistency, and misalignment between geometry and texture. To address these limitations, we propose Hitem3D 2.0, a multi-view guided native 3D texture generation framework that enhances texture quality through the integration of 2D multi-view generation priors and native 3D texture representations. Hitem3D 2.0 comprises two key components: a multi-view synthesis framework and a native 3D texture generation model. The multi-view generation is built upon a pre-trained image editing backbone and incorporates plug-and-play modules that explicitly promote geometric alignment, cross-view consistency, and illumination uniformity, thereby enabling the synthesis of high-fidelity multi-view images. Conditioned on the generated views and 3D geometry, the native 3D texture generation model projects multi-view textures onto 3D surfaces while plausibly completing textures in unseen regions. Through the integration of multi-view consistency constraints with native 3D texture modeling, Hitem3D 2.0 significantly improves texture completeness, cross-view coherence, and geometric alignment. Experimental results demonstrate that Hitem3D 2.0 outperforms existing methods in terms of texture detail, fidelity, consistency, coherence, and alignment.[157] Neural Distribution Prior for LiDAR Out-of-Distribution Detection
Zizhao Li,Zhengkang Xiang,Jiayang Ao,Feng Liu,Joseph West,Kourosh Khoshelham
Main category: cs.CV
TL;DR: 本文提出Neural Distribution Prior (NDP)框架,通过建模网络预测的分布结构并自适应重加权OOD分数,解决LiDAR感知中因类别不平衡导致的OOD检测性能受限问题,并结合Perlin噪声合成策略提升OOD训练鲁棒性。
Details
Motivation: 现有LiDAR感知模型基于闭集假设,难以识别开放世界中的未知OOD物体;且当前OOD打分函数忽略LiDAR OOD检测中固有的严重类别不平衡问题,假设类别分布均匀,导致性能受限。 Method: 提出Neural Distribution Prior(NDP)框架:1)建模网络logit分布模式,学习分布先验;2)通过注意力模块校正类别依赖的置信度偏差;3)引入基于Perlin噪声的OOD合成策略,从输入点云生成多样化辅助OOD样本,实现无需外部数据的鲁棒OOD训练。 Result: 在SemanticKITTI和STU基准上实验表明,NDP显著提升OOD检测性能,在STU测试集上达到61.31%的点级AP,比此前最优结果高出10倍以上;且兼容多种现有OOD打分方法。 Conclusion: NDP为开放世界LiDAR感知提供了一种有效、通用且无需额外数据集的OOD检测解决方案,尤其适用于类别极度不平衡的实际场景。 Abstract: LiDAR-based perception is critical for autonomous driving due to its robustness to poor lighting and visibility conditions. Yet, current models operate under the closed-set assumption and often fail to recognize unexpected out-of-distribution (OOD) objects in the open world. Existing OOD scoring functions exhibit limited performance because they ignore the pronounced class imbalance inherent in LiDAR OOD detection and assume a uniform class distribution. To address this limitation, we propose the Neural Distribution Prior (NDP), a framework that models the distributional structure of network predictions and adaptively reweights OOD scores based on alignment with a learned distribution prior. NDP dynamically captures the logit distribution patterns of training data and corrects class-dependent confidence bias through an attention-based module. We further introduce a Perlin noise-based OOD synthesis strategy that generates diverse auxiliary OOD samples from input scans, enabling robust OOD training without external datasets. Extensive experiments on the SemanticKITTI and STU benchmarks demonstrate that NDP substantially improves OOD detection performance, achieving a point-level AP of 61.31\% on the STU test set, which is more than 10$\times$ higher than the previous best result. Our framework is compatible with various existing OOD scoring formulations, providing an effective solution for open-world LiDAR perception.[158] FashionStylist: An Expert Knowledge-enhanced Multimodal Dataset for Fashion Understanding
Kaidong Feng,Zhuoxuan Huang,Huizhong Guo,Yuting Jin,Xinyu Chen,Yue Liang,Yifei Gai,Li Zhou,Yunshan Ma,Zhu Sun
Main category: cs.CV
TL;DR: 本文提出了FashionStylist,一个由时尚专家标注的基准数据集,旨在支持整体性、专家级的时尚理解,涵盖服装定位、搭配补全与搭配评估三大任务。
Details
Motivation: 现有时尚数据集碎片化、任务单一,难以支撑对风格、场合、兼容性及搭配逻辑等多维度的整体理解。 Method: 构建了由时尚专家参与的标注流程,提供细粒度(单品级)和整体(套装级)的专业标注,并设计三项代表性任务:outfit-to-item grounding、outfit completion、outfit evaluation。 Result: FashionStylist被验证为统一的多任务基准,并可有效提升MLLM在服装定位、搭配补全及语义评估方面的能力。 Conclusion: FashionStylist填补了专家级、整体性时尚理解基准的空白,推动多任务协同建模与大模型在时尚领域的深度应用。 Abstract: Fashion understanding requires both visual perception and expert-level reasoning about style, occasion, compatibility, and outfit rationale. However, existing fashion datasets remain fragmented and task-specific, often focusing on item attributes, outfit co-occurrence, or weak textual supervision, and thus provide limited support for holistic outfit understanding. In this paper, we introduce FashionStylist, an expert-annotated benchmark for holistic and expert-level fashion understanding. Constructed through a dedicated fashion-expert annotation pipeline, FashionStylist provides professionally grounded annotations at both the item and outfit levels. It supports three representative tasks: outfit-to-item grounding, outfit completion, and outfit evaluation. These tasks cover realistic item recovery from complex outfits with layering and accessories, compatibility-aware composition beyond co-occurrence matching, and expert-level assessment of style, season, occasion, and overall coherence. Experimental results show that FashionStylist serves not only as a unified benchmark for multiple fashion tasks, but also as an effective training resource for improving grounding, completion, and outfit-level semantic evaluation in MLLM-based fashion systems.[159] Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization
Yuqin Lan,Gen Li,Yuanze Hu,Weihao Shen,Zhaoxin Fan,Faguo Wu,Xiao Zhang,Laurence T. Yang,Zhiming Zheng
Main category: cs.CV
TL;DR: 本文提出Mosaic框架,通过多视角集成优化缓解异构代理-目标设置下的代理依赖问题,提升对商业闭源视觉语言模型的多模态越狱攻击效果。
Details
Motivation: 现有基于梯度的多模态越狱攻击在同构开源代理-目标设置下有效,但在异构闭源商业VLM上效果不明;作者发现并命名了‘代理依赖’现象,即攻击性能严重依赖特定代理模型和视觉视角。 Method: 提出Mosaic多视图集成优化框架,包含三个模块:文本侧变换模块(扰动拒绝敏感词汇)、多视图图像优化模块(在多种裁剪视角下更新扰动)、代理集成引导模块(聚合多个代理VLM的优化信号)。 Result: 在安全基准测试中,Mosaic在商用闭源VLM上实现了当前最优的攻击成功率(ASR)和平均毒性(Average Toxicity)。 Conclusion: Mosaic有效缓解了异构设置下的代理依赖问题,提升了对闭源VLM的跨模型、跨视角鲁棒攻击能力,为多模态安全评估提供了新方法。 Abstract: Vision-Language Models (VLMs) are powerful but remain vulnerable to multimodal jailbreak attacks. Existing attacks mainly rely on either explicit visual prompt attacks or gradient-based adversarial optimization. While the former is easier to detect, the latter produces subtle perturbations that are less perceptible, but is usually optimized and evaluated under homogeneous open-source surrogate-target settings, leaving its effectiveness on commercial closed-source VLMs under heterogeneous settings unclear. To examine this issue, we study different surrogate-target settings and observe a consistent gap between homogeneous and heterogeneous settings, a phenomenon we term surrogate dependency. Motivated by this finding, we propose Mosaic, a Multi-view ensemble optimization framework for multimodal jailbreak against closed-source VLMs, which alleviates surrogate dependency under heterogeneous surrogate-target settings by reducing over-reliance on any single surrogate model and visual view. Specifically, Mosaic incorporates three core components: a Text-Side Transformation module, which perturbs refusal-sensitive lexical patterns; a Multi-View Image Optimization module, which updates perturbations under diverse cropped views to avoid overfitting to a single visual view; and a Surrogate Ensemble Guidance module, which aggregates optimization signals from multiple surrogate VLMs to reduce surrogate-specific bias. Extensive experiments on safety benchmarks demonstrate that Mosaic achieves state-of-the-art Attack Success Rate and Average Toxicity against commercial closed-source VLMs.[160] Beyond Segmentation: Structurally Informed Facade Parsing from Imperfect Images
Maciej Janicki,Aleksander Plocharski,Przemyslaw Musialski
Main category: cs.CV
TL;DR: 本文提出了一种在YOLOv8中引入轻量级对齐损失的方法,以提升建筑立面检测的结构一致性,从而支持后续程序化重建。
Details
Motivation: 标准目标检测器独立处理建筑构件,导致立面解析缺乏结构一致性,难以支持下游的程序化重建。 Method: 在YOLOv8训练目标中加入自定义的轻量级对齐损失,通过正则化促使边界框在网格上保持一致排列,注入几何先验而不改变推理流程。 Result: 在CMP数据集上的实验表明,该方法有效提升了结构规整性,校正了由透视和遮挡引起的对齐误差,同时可在结构精度与检测精度间实现可控权衡。 Conclusion: 所提对齐损失是一种简单而有效的改进方式,能在不增加推理开销的前提下显著增强建筑立面检测的结构合理性。 Abstract: Standard object detectors typically treat architectural elements independently, often resulting in facade parsings that lack the structural coherence required for downstream procedural reconstruction. We address this limitation by augmenting the YOLOv8 training objective with a custom lightweight alignment loss. This regularization encourages grid-consistent arrangements of bounding boxes during training, effectively injecting geometric priors without altering the standard inference pipeline. Experiments on the CMP dataset demonstrate that our method successfully improves structural regularity, correcting alignment errors caused by perspective and occlusion while maintaining a controllable trade-off with standard detection accuracy.[161] GeRM: A Generative Rendering Model From Physically Realistic to Photorealistic
Jiayuan Lu,Rengan Xie,Xuancheng Jin,Zhizhen Wu,Qi Ye,Tian Xie,Hujun Bao,Rui Wang. Yuchi Huo
Main category: cs.CV
TL;DR: 本文提出GeRM模型,通过多模态生成式渲染弥合物理渲染(PBR)与照片级真实感渲染(PRR)之间的差距(P2P),结合G-buffer、文本提示和渐进式注入,在保持几何一致性和可控性的同时实现从物理保真到感知真实感的平滑过渡。
Details
Motivation: 物理渲染(PBR)虽保证物理正确性,但照片级真实感(PRR)还需高保真的几何与外观建模;当前显式PBR受限于难以获取真实数字模型,而隐式生成模型又牺牲可控性与几何一致性,形成P2P鸿沟。 Method: 构建首个P2P配对数据集P2P-50K(基于多智能体VLM框架标注);提出多条件ControlNet学习分布转移向量场(DTV Field),以G-buffer、文本提示和区域增强线索为条件,实现PBR图像到PRR图像的渐进式生成。 Result: 首次实现PBR与PRR的统一建模;GeRM支持用户在物理保真与感知真实感之间连续可控调节;在定性与定量评估中均展现出优于现有方法的几何一致性与视觉真实感。 Conclusion: P2P鸿沟是通向真正光追级真实感的关键瓶颈;GeRM验证了融合物理先验与生成先验的多模态渲染范式可行性,为可控、一致、真实的生成式渲染开辟新路径。 Abstract: For decades, Physically-Based Rendering (PBR) is the fundation of synthesizing photorealisitic images, and therefore sometimes roughly referred as Photorealistic Rendering (PRR). While PBR is indeed a mathematical simulation of light transport that guarantees physical reality, photorealism has additional reliance on the realistic digital model of geometry and appearance of the real world, leaving a barely explored gap from PBR to PRR (P2P). Consequently, the path toward photorealism faces a critical dilemma: the explicit simulation of PRR encumbered by unreachable realistic digital models for real-world existence, while implicit generation models sacrifice controllability and geometric consistency. Based on this insight, this paper presents the problem, data, and approach of mitigating P2P gap, followed by the first multi-modal generative rendering model, dubbed GeRM, to unify PBR and PRR. GeRM integrates physical attributes like G-buffers with text prompts, and progressive incremental injection to generate controllable photorealistic images, allowing users to fluidly navigate the continuum between strict physical fidelity and perceptual photorealism. Technically, we model the transition between PBR and PRR images as a distribution transfer and aim to learn a distribution transfer vector field (DTV Field) to guide this process. To define the learning objective, we first leverage a multi-agent VLM framework to construct an expert-guided pairwise P2P transfer dataset, named P2P-50K, where each paired sample in the dataset corresponds to a transfer vector in the DTV Field. Subsequently, we propose a multi-condition ControlNet to learn the DTV Field, which synthesizes PBR images and progressively transitions them into PRR images, guided by G-buffers, text prompts, and cues for enhanced regions.[162] VAGNet: Vision-based accident anticipation with global features
Vipooshan Vipulananthan,Charith D. Chitraranjan
Main category: cs.CV
TL;DR: 本文提出VAGNet,一种基于全局场景特征的事故预测模型,利用VideoMAE-V2提取全局特征,结合Transformer与图模块,在多个基准数据集上实现了更高精度、更长预警时间及更低计算开销。
Details
Motivation: 现有事故预测方法依赖计算密集的显式目标级特征提取,难以满足实时性要求;同时,真实驾驶场景复杂,需更高效鲁棒的建模方式。 Method: 提出VAGNet:采用VideoMAE-V2提取视频全局特征,融合Transformer模块建模时序动态,引入图模块建模交通参与者间关系,端到端学习事故预测,避免显式目标检测与跟踪。 Result: 在DAD、DoTA、DADA和Nexar四个基准数据集上,相比现有方法,平均精度(AP)更高、平均预警时间(MTTA)更长,且计算效率显著提升。 Conclusion: 仅依赖全局交通场景特征即可实现高精度、低延迟的事故预测,验证了摒弃显式目标建模的可行性与优势,为轻量级实时驾驶安全系统提供了新思路。 Abstract: Traffic accidents are a leading cause of fatalities and injuries across the globe. Therefore, the ability to anticipate hazardous situations in advance is essential. Automated accident anticipation enables timely intervention through driver alerts and collision avoidance maneuvers, forming a key component of advanced driver assistance systems. In autonomous driving, such predictive capabilities support proactive safety behaviors, such as initiating defensive driving and human takeover when required. Using dashcam video as input offers a cost-effective solution, but it is challenging due to the complexity of real-world driving scenes. Accident anticipation systems need to operate in real-time. However, current methods involve extracting features from each detected object, which is computationally intensive. We propose VAGNet, a deep neural network that learns to predict accidents from dash-cam video using global features of traffic scenes without requiring explicit object-level features. The network consists of transformer and graph modules, and we use the vision foundation model VideoMAE-V2 for global feature extraction. Experiments on four benchmark datasets (DAD, DoTA, DADA, and Nexar) show that our method anticipates accidents with higher average precision and mean time-to-accident while being computationally more efficient compared to existing methods.[163] Structure-Aware Fine-Grained Gaussian Splatting for Expressive Avatar Reconstruction
Yuze Su,Hongsong Wang,Jie Gui,Liang Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为Structure-aware Fine-grained Gaussian Splatting (SFGS)的新方法,用于从单目视频中重建逼真且拓扑感知的全身3D人类虚拟形象,通过空间三平面与时间六平面结合、结构感知高斯模块及手部残差细化模块,实现了高保真、自然运动和细节丰富的建模。
Details
Motivation: 现有单目视频驱动的3D人体虚拟形象建模方法虽能较好捕捉身体运动,但难以准确建模手部动作和面部表情等精细细节,亟需兼顾几何结构、动态一致性和细粒度表达的新方法。 Method: 提出SFGS方法:1)联合使用空间三平面(spatial-only triplane)和时间六平面(time-aware hexplane)表征跨帧动态特征;2)设计结构感知高斯模块,以姿态依赖方式保持空间一致性并增强姿态与纹理表达;3)引入基于细粒度手部重建的残差细化模块;整个流程为单阶段训练。 Result: 在定量与定性评估中均优于当前最优方法,生成的3D虚拟形象具有高保真度、自然运动连贯性及精细的手部/面部细节;代码已开源。 Conclusion: SFGS有效解决了单目视频下全身3D人体建模中结构一致性、动态建模与细粒度表达的协同难题,为实时、轻量、高质量虚拟人重建提供了新思路。 Abstract: Reconstructing photorealistic and topology-aware human avatars from monocular videos remains a significant challenge in the fields of computer vision and graphics. While existing 3D human avatar modeling approaches can effectively capture body motion, they often fail to accurately model fine details such as hand movements and facial expressions. To address this, we propose Structure-aware Fine-grained Gaussian Splatting (SFGS), a novel method for reconstructing expressive and coherent full-body 3D human avatars from a monocular video sequence. The SFGS use both spatial-only triplane and time-aware hexplane to capture dynamic features across consecutive frames. A structure-aware gaussian module is designed to capture pose-dependent details in a spatially coherent manner and improve pose and texture expression. To better model hand deformations, we also propose a residual refinement module based on fine-grained hand reconstruction. Our method requires only a single-stage training and outperforms state-of-the-art baselines in both quantitative and qualitative evaluations, generating high-fidelity avatars with natural motion and fine details. The code is on Github: https://github.com/Su245811YZ/SFGS[164] From Frames to Events: Rethinking Evaluation in Human-Centric Video Anomaly Detection
Narges Rashvand,Shanle Yao,Armin Danesh Pazho,Babak Rahimi Ardabili,Hamed Tabkhi
Main category: cs.CV
TL;DR: 本文提出了一种面向事件的视频异常检测(VAD)新范式,指出传统帧级评估严重高估模型在实际部署中的性能,并构建了首个基于事件的评估标准,揭示了当前SOTA模型在事件级定位上表现极差。
Details
Motivation: 传统帧级评估无法反映真实监控系统中对连贯异常事件(具有明确起止时间)的检测需求,导致模型性能被系统性高估。 Method: 1)审计主流VAD数据集的事件结构;2)提出两种事件定位策略:基于分数精炼的层级高斯平滑+自适应二值化流程,以及端到端双分支事件检测模型;3)采用时序动作定位指标(如tIoU匹配、多阈值F1)建立首个事件级评估标准。 Result: 在NWPUC数据集上,所有SOTA模型帧级AUC-ROC超52%,但事件级定位精度在tIoU=0.2时低于10%,平均事件级F1仅为0.11。 Conclusion: 帧级评估与实际应用脱节,亟需转向事件中心范式;所提评估标准揭示了当前方法在事件级检测上的严重不足,为未来研究提供新基准和方向。 Abstract: Pose-based Video Anomaly Detection (VAD) has gained significant attention for its privacy-preserving nature and robustness to environmental variations. However, traditional frame-level evaluations treat video as a collection of isolated frames, fundamentally misaligned with how anomalies manifest and are acted upon in the real world. In operational surveillance systems, what matters is not the flagging of individual frames, but the reliable detection, localization, and reporting of a coherent anomalous event, a contiguous temporal episode with an identifiable onset and duration. Frame-level metrics are blind to this distinction, and as a result, they systematically overestimate model performance for any deployment that requires actionable, event-level alerts. In this work, we propose a shift toward an event-centric perspective in VAD. We first audit widely used VAD benchmarks, including SHT[19], CHAD[6], NWPUC[4], and HuVAD[25], to characterize their event structure. We then introduce two strategies for temporal event localization: a score-refinement pipeline with hierarchical Gaussian smoothing and adaptive binarization, and an end-to-end Dual-Branch Model that directly generates event-level detections. Finally, we establish the first event-based evaluation standard for VAD by adapting Temporal Action Localization metrics, including tIoU-based event matching and multi-threshold F1 evaluation. Our results quantify a substantial performance gap: while all SoTA models achieve frame-level AUC-ROC exceeding 52% on the NWPUC[4], their event-level localization precision falls below 10% even at a minimal tIoU=0.2, with an average event-level F1 of only 0.11 across all thresholds. The code base for this work is available at https://github.com/TeCSAR-UNCC/EventCentric-VAD.[165] LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation
Aytaç Sekmen,Fatih Emre Gunes,Furkan Horoz,Hüseyin Umut Işık,Mehmet Alp Ozaydin,Onur Altay Topaloglu,Şahin Umutcan Üstündaş,Yurdasen Alp Yeni,Halil Ersin Soken,Erol Sahin,Ramazan Gokberk Cinbis,Sinan Kalkan
Main category: cs.CV
TL;DR: 本文提出了LuMon基准框架,用于评估月球探测中的单目深度估计(MDE)方法,引入了嫦娥三号真实任务和CHERI暗模拟数据集,并系统评估了现有模型在真实月面场景下的性能与域迁移瓶颈。
Details
Motivation: 将地面单目深度估计模型直接部署到月球面临严重域差距问题,包括强阴影、无纹理月壤和零大气散射;而现有评估依赖无法复现真实条件且缺乏真实度量真值的类比数据。 Method: 构建LuMon基准框架,包含嫦娥三号真实立体视觉深度真值数据和CHERI暗模拟数据集;开展零样本跨域评估;并建立基于合成数据微调基础模型的sim-to-real域自适应基线。 Result: 当前SOTA模型在真实月面图像上泛化能力差;sim-to-real微调虽提升合成数据性能,但对真实月面图像几乎无迁移效果,揭示显著跨域迁移鸿沟。 Conclusion: 揭示了现有MDE网络在月球场景中的固有局限,为地外感知与域适应研究提供了新基准与方向指引。 Abstract: Monocular Depth Estimation (MDE) is crucial for autonomous lunar rover navigation using electro-optical cameras. However, deploying terrestrial MDE networks to the Moon brings a severe domain gap due to harsh shadows, textureless regolith, and zero atmospheric scattering. Existing evaluations rely on analogs that fail to replicate these conditions and lack actual metric ground truth. To address this, we present LuMon, a comprehensive benchmarking framework to evaluate MDE methods for lunar exploration. We introduce novel datasets featuring high-quality stereo ground truth depth from the real Chang'e-3 mission and the CHERI dark analog dataset. Utilizing this framework, we conduct a systematic zero-shot evaluation of state-of-the-art architectures across synthetic, analog, and real datasets. We rigorously assess performance against mission critical challenges like craters, rocks, extreme shading, and varying depth ranges. Furthermore, we establish a sim-to-real domain adaptation baseline by fine tuning a foundation model on synthetic data. While this adaptation yields drastic in-domain performance gains, it exhibits minimal generalization to authentic lunar imagery, highlighting a persistent cross-domain transfer gap. Our extensive analysis reveals the inherent limitations of current networks and sets a standard foundation to guide future advancements in extraterrestrial perception and domain adaptation.[166] Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors
Ying Zang,Yidong Han,Chaotao Ding,Yuanqi Hu,Deyi Ji,Qi Zhu,Xuanfu Li,Jin Ma,Lingyun Sun,Tianrun Chen,Lanyun Zhu
Main category: cs.CV
TL;DR: 本文提出了一种用于动态4D场景重建的新框架,通过在重建各阶段建模不确定性,解耦动态与静态成分,包含三个核心机制,在动态基准上显著优于现有方法。
Details
Motivation: 3D基础模型(如VGGT)在静态场景中表现优异,但在动态序列中因运动导致的几何模糊而性能下降,亟需能处理动态不确定性的重建方法。 Method: 提出一种不确定性感知框架,包含:(1) 熵引导子空间投影,自适应聚合多头注意力分布以分离运动线索;(2) 局部一致性驱动的几何净化,通过邻域约束消除结构异常;(3) 不确定性感知的跨视角一致性,将多视角深度优化建模为异方差最大似然估计,并用深度置信度加权。 Result: 在动态基准上,平均精度误差降低13.43%,分割F值提升10.49%;保持前馈推理效率,无需任务微调或逐场景优化。 Conclusion: 所提框架有效缓解了动态4D重建中的几何歧义问题,通过多阶段不确定性建模实现高精度、高鲁棒且高效的动态场景理解。 Abstract: Reconstructing dynamic 4D scenes is an important yet challenging task. While 3D foundation models like VGGT excel in static settings, they often struggle with dynamic sequences where motion causes significant geometric ambiguity. To address this, we present a framework designed to disentangle dynamic and static components by modeling uncertainty across different stages of the reconstruction process. Our approach introduces three synergistic mechanisms: (1) Entropy-Guided Subspace Projection, which leverages information-theoretic weighting to adaptively aggregate multi-head attention distributions, effectively isolating dynamic motion cues from semantic noise; (2) Local-Consistency Driven Geometry Purification, which enforces spatial continuity via radius-based neighborhood constraints to eliminate structural outliers; and (3) Uncertainty-Aware Cross-View Consistency, which formulates multi-view projection refinement as a heteroscedastic maximum likelihood estimation problem, utilizing depth confidence as a probabilistic weight. Experiments on dynamic benchmarks show that our approach outperforms current state-of-the-art methods, reducing Mean Accuracy error by 13.43\% and improving segmentation F-measure by 10.49\%. Our framework maintains the efficiency of feed-forward inference and requires no task-specific fine-tuning or per-scene optimization.[167] EpiAgent: An Agent-Centric System for Ancient Inscription Restoration
Shipeng Zhu,Ang Chen,Na Nie,Pengfei Fang,Min-Ling Zhang,Hui Xue
Main category: cs.CV
TL;DR: 本文提出EpiAgent,一种基于大语言模型的代理系统,通过模拟人类金石学家的工作流程,实现对古代铭文的分层规划式修复,显著提升了修复质量与泛化能力。
Details
Motivation: 古代铭文因长期环境和人为因素导致严重退化,其图文交织的完整性修复是数字文化遗产保护中最严峻的挑战之一;现有AI方法依赖固定流水线,难以应对复杂多样的真实退化情况。 Method: 受人类金石学家协作流程启发,提出EpiAgent代理系统,将铭文修复建模为分层规划问题;采用观察-构想-执行-再评估范式,由LLM中央规划器协调多模态分析、历史经验、专用修复工具及迭代自优化模块。 Result: 在真实退化铭文数据上,EpiAgent在修复质量和跨退化类型泛化能力上均优于现有方法。 Conclusion: EpiAgent代表了面向专家级文化遗产修复的代理驱动范式的重要进展,推动AI从单次处理迈向灵活、自适应的智能协同修复。 Abstract: Ancient inscriptions, as repositories of cultural memory, have suffered from centuries of environmental and human-induced degradation. Restoring their intertwined visual and textual integrity poses one of the most demanding challenges in digital heritage preservation. However, existing AI-based approaches often rely on rigid pipelines, struggling to generalize across such complex and heterogeneous real-world degradations. Inspired by the skill-coordinated workflow of human epigraphers, we propose EpiAgent, an agent-centric system that formulates inscription restoration as a hierarchical planning problem. Following an Observe-Conceive-Execute-Reevaluate paradigm, an LLM-based central planner orchestrates collaboration among multimodal analysis, historical experience, specialized restoration tools, and iterative self-refinement. This agent-centric coordination enables a flexible and adaptive restoration process beyond conventional single-pass methods. Across real-world degraded inscriptions, EpiAgent achieves superior restoration quality and stronger generalization compared to existing methods. Our work marks an important step toward expert-level agent-driven restoration of cultural heritage. The code is available at https://github.com/blackprotoss/EpiAgent.[168] Region-Constrained Group Relative Policy Optimization for Flow-Based Image Editing
Zhuohan Ouyang,Zhe Qian,Wenhuo Cui,Chaoqun Wang
Main category: cs.CV
TL;DR: 本文提出RC-GRPO-Editing,一种面向流式图像编辑的区域约束GRPO后训练框架,通过区域解耦噪声扰动和注意力集中奖励,提升指令遵循性与非目标区域保持能力。
Details
Motivation: 现有基于GRPO的奖励驱动后训练方法在指令引导图像编辑中存在信用分配噪声问题,即全局探索会干扰非目标区域,导致组内奖励方差增大、GRPO优势估计不准。 Method: 提出RC-GRPO-Editing框架:1)采用区域解耦的初始噪声扰动实现局部探索,抑制背景引入的奖励方差;2)设计注意力集中奖励,使跨注意力机制在整个采样过程中对齐目标编辑区域。 Result: 在CompBench基准上实验表明,该方法显著提升了编辑区域的指令遵循性和非目标内容的保持能力。 Conclusion: 区域约束是提升流式模型在确定性ODE采样下指令编辑性能的关键,RC-GRPO-Editing为高保真、可控图像编辑提供了更鲁棒的后训练范式。 Abstract: Instruction-guided image editing requires balancing target modification with non-target preservation. Recently, flow-based models have emerged as a strong and increasingly adopted backbone for instruction-guided image editing, thanks to their high fidelity and efficient deterministic ODE sampling. Building on this foundation, GRPO-based reward-driven post-training has been explored to directly optimize editing-specific rewards, improving instruction following and editing consistency. However, existing methods often suffer from noisy credit assignment: global exploration also perturbs non-target regions, inflating within-group reward variance and yielding noisy GRPO advantages. To address this, we propose RC-GRPO-Editing, a region-constrained GRPO post-training framework for flow-based image editing under deterministic ODE sampling. It suppresses background-induced nuisance variance to enable cleaner localized credit assignment, improving editing region instruction adherence while preserving non-target content. Concretely, we localize exploration via region-decoupled initial noise perturbations to reduce background-induced reward variance and stabilize GRPO advantages, and introduce an attention concentration reward that aligns cross-attention with the intended editing region throughout the rollout, reducing unintended changes in non-target regions. Experiments on CompBench show consistent improvements in editing region instruction adherence and non-target preservation.[169] EGLOCE: Training-Free Energy-Guided Latent Optimization for Concept Erasure
Junyeong Ahn,Seojin Yoon,Sungyong Baik
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的推理阶段概念擦除方法EGLOCE,通过双能量引导(排斥+保留)在潜在空间中优化采样过程,实现对特定概念(如违规或版权内容)的安全、可控、高质量擦除。
Details
Motivation: 现有文本到图像扩散模型的概念擦除方法存在需重训练、损害无关概念保真度或推理时引导效果弱等问题,亟需一种高效、安全、即插即用的训练-free擦除方案。 Method: 提出Energy-Guided Latent Optimization for Concept Erasure (EGLOCE),在推理阶段对噪声潜在表示进行双目标优化:利用排斥能量梯度下降驱离目标概念,同时用保留能量维持与原始提示的语义对齐;全程不修改模型权重。 Result: EGLOCE在多个基线模型上显著提升概念擦除效果,同时保持图像质量与提示一致性,并对对抗攻击鲁棒;支持即插即用部署。 Conclusion: EGLOCE首次确立了基于采样过程中双能量引导的安全可控图像生成新范式,为扩散模型合规化提供了高效、无损、训练-free的解决方案。 Abstract: As text-to-image diffusion models grow increasingly prevalent, the ability to remove specific concepts-mostly explicit content and many copyrighted characters or styles-has become essential for safety and compliance. Existing unlearning approaches often require costly re-training, modify parameters at the cost of degradation of unrelated concept fidelity, or depend on indirect inference-time adjustment that compromise the effectiveness of concept erasure. Inspired by the success of energy-guided sampling for preservation of the condition of diffusion models, we introduce Energy-Guided Latent Optimization for Concept Erasure (EGLOCE), a training-free approach that removes unwanted concepts by re-directing noisy latent during inference. Our method employs a dual-objective framework: a repulsion energy that steers generation away from target concepts via gradient descent in latent space, and a retention energy that preserves semantic alignment to the original prompt. Combined with previous approaches that either require erroneous modified model weights or provide weak inference-time guidance, EGLOCE operates entirely at inference and enhances erasure performance, enabling plug-and-play integration. Extensive experiments demonstrate that EGLOCE improves concept removal while maintaining image quality and prompt alignment across baselines, even with adversarial attacks. To the best of our knowledge, our work is the first to establish a new paradigm for safe and controllable image generation through dual energy-based guidance during sampling.[170] SynFlow: Scaling Up LiDAR Scene Flow Estimation with Synthetic Data
Qingwen Zhang,Xiaomeng Zhu,Chenhan Jiang,Patric Jensfelt
Main category: cs.CV
TL;DR: 本文提出SynFlow,一种基于可扩展仿真生成大规模合成LiDAR场景流数据集的管道,旨在解决真实世界中密集高质量运动标注稀缺的问题;其生成的数据集SynFlow-4k在零样本迁移和小样本微调下均显著提升3D运动估计性能,并开源以推动通用化研究。
Details
Motivation: 可靠3D动态感知受限于真实世界中密集、高质量运动标注的稀缺;虽有自监督方法利用无标签真实数据,但因代理信号噪声大,单纯扩大数据规模难以弥补性能差距。 Method: 提出完全从可扩展仿真中学习鲁棒真实世界运动先验的新范式;设计SynFlow数据生成管道,采用运动导向策略合成涵盖多样化运动模式的大规模LiDAR场景流数据集SynFlow-4k(4000序列,约94万帧),标注量达现有真实基准的34倍。 Result: SynFlow-4k提供高度域不变的运动先验:零样本下模型在nuScenes上媲美全监督基线,在TruckScenes上超越SOTA方法31.8%;仅用5%真实标签微调即可超越全量真实标签从头训练的模型。 Conclusion: 基于仿真生成高质量运动先验是提升3D动态感知泛化能力的有效路径;SynFlow为通用化、标签高效及跨域迁移的3D运动估计提供了新基础与开源资源。 Abstract: Reliable 3D dynamic perception requires models that can anticipate motion beyond predefined categories, yet progress is hindered by the scarcity of dense, high-quality motion annotations. While self-supervision on unlabeled real data offers a path forward, empirical evidence suggests that scaling unlabeled data fails to close the performance gap due to noisy proxy signals. In this paper, we propose a shift in paradigm: learning robust real-world motion priors entirely from scalable simulation. We introduce SynFlow, a data generation pipeline that generates large-scale synthetic dataset specifically designed for LiDAR scene flow. Unlike prior works that prioritize sensor-specific realism, SynFlow employs a motion-oriented strategy to synthesize diverse kinematic patterns across 4,000 sequences ($\sim$940k frames), termed SynFlow-4k. This represents a 34x scale-up in annotated volume over existing real-world benchmarks. Our experiments demonstrate that SynFlow-4k provides a highly domain-invariant motion prior. In a zero-shot regime, models trained exclusively on our synthetic data generalize across multiple real-world benchmarks, rivaling in-domain supervised baselines on nuScenes and outperforming state-of-the-art methods on TruckScenes by 31.8%. Furthermore, SynFlow-4k serves as a label-efficient foundation: fine-tuning with only 5% of real-world labels surpasses models trained from scratch on the full available budget. We open-source the pipeline and dataset to facilitate research in generalizable 3D motion estimation. More detail can be found at https://kin-zhang.github.io/SynFlow.[171] PhysInOne: Visual Physics Learning and Reasoning in One Suite
Siyuan Zhou,Hejun Wang,Hu Cheng,Jinxi Li,Dongsheng Wang,Junwei Jiang,Yixiao Jin,Jiayue Huang,Shiwei Mao,Shangjia Liu,Yafei Yang,Hongkang Song,Shenxing Wei,Zihui Zhang,Peng Huang,Shijie Liu,Zhengli Hao,Hao Li,Yitian Li,Wenqi Zhou,Zhihan Zhao,Zongqi He,Hongtao Wen,Shouwang Huang,Peng Yun,Bowen Cheng,Pok Kazaf Fu,Wai Kit Lai,Jiahao Chen,Kaiyuan Wang,Zhixuan Sun,Ziqi Li,Haochen Hu,Di Zhang,Chun Ho Yuen,Bing Wang,Zhihua Wang,Chuhang Zou,Bo Yang
Main category: cs.CV
TL;DR: PhysInOne is a large-scale synthetic dataset with 2 million videos across 153,810 dynamic 3D scenes, covering 71 physical phenomena and offering rich ground-truth annotations; it improves physical plausibility in AI models but reveals gaps in modeling complex dynamics and intrinsic properties.
Details
Motivation: To address the critical scarcity of physically-grounded training data for AI systems, especially for world models requiring deep understanding of physics. Method: Constructing a large-scale synthetic dataset—PhysInOne—with diverse, multi-object dynamic 3D scenes, complex backgrounds, and comprehensive ground-truth annotations (3D geometry, semantics, motion, physical properties, text). Result: Fine-tuning foundation models on PhysInOne significantly improves physical plausibility in video generation, future frame prediction, physical property estimation, and motion transfer; however, key limitations in modeling complex dynamics and estimating intrinsic properties are exposed. Conclusion: PhysInOne sets a new benchmark as the largest physics-grounded dataset to date, enabling advances in physics-aware generation, simulation, and embodied AI, while highlighting open challenges in physical reasoning. Abstract: We present PhysInOne, a large-scale synthetic dataset addressing the critical scarcity of physically-grounded training data for AI systems. Unlike existing datasets limited to merely hundreds or thousands of examples, PhysInOne provides 2 million videos across 153,810 dynamic 3D scenes, covering 71 basic physical phenomena in mechanics, optics, fluid dynamics, and magnetism. Distinct from previous works, our scenes feature multiobject interactions against complex backgrounds, with comprehensive ground-truth annotations including 3D geometry, semantics, dynamic motion, physical properties, and text descriptions. We demonstrate PhysInOne's efficacy across four emerging applications: physics-aware video generation, long-/short-term future frame prediction, physical property estimation, and motion transfer. Experiments show that fine-tuning foundation models on PhysInOne significantly enhances physical plausibility, while also exposing critical gaps in modeling complex physical dynamics and estimating intrinsic properties. As the largest dataset of its kind, orders of magnitude beyond prior works, PhysInOne establishes a new benchmark for advancing physics-grounded world models in generation, simulation, and embodied AI.[172] Do Vision Language Models Need to Process Image Tokens?
Sambit Ghosh,R. Venkatesh Babu,Chirag Agarwal
Main category: cs.CV
TL;DR: 本文系统研究了视觉语言模型(VLMs)中图像token的功能角色,发现视觉表征在浅层即快速收敛至低复杂度稳定态,而文本表征持续深度演化;视觉表征层间可互换,深层视觉处理对多数任务非必需,挑战了当前VLM需全程密集视觉处理的默认假设。
Details
Motivation: 现有VLMs普遍采用全程密集图像token处理,但其必要性及视觉表征是否随网络深度持续演化尚不明确,亟需系统性实证检验。 Method: 通过分析视觉表征的熵、本征维数与轨迹曲率等指标,量化其随网络深度的演化规律;开展层间视觉截断实验,对比单/多token预测任务性能变化;考察确定性解码下中间推理路径与最终输出对视觉深度减少的敏感性差异。 Result: 视觉表征在浅层即达稳定(熵稳定、本征维压缩、曲率恒定),层间高度可互换;单token预测对视觉截断鲁棒,多token生成依赖持续视觉输入;减少视觉深度显著扰动中间推理路径,但对最终输出影响较小。 Conclusion: 深层视觉处理并非VLMs性能的普遍前提,其必要性高度依赖具体任务;当前VLM架构中强制全程视觉处理的设计存在冗余,应转向更高效、任务自适应的视觉-语言协同建模范式。 Abstract: Vision Language Models (VLMs) have achieved remarkable success by integrating visual encoders with large language models (LLMs). While VLMs process dense image tokens across deep transformer stacks (incurring substantial computational overhead), it remains fundamentally unclear whether sustained image-token processing is necessary for their performance or visual representations meaningfully evolve from early to later layers. In this work, we systematically investigate the functional role of image tokens in VLMs and show that visual representations rapidly converge to a bounded-complexity regime, \ie their entropy stabilizes, intrinsic dimensionality compresses, and trajectory curvature approaches a near-constant profile. In contrast, textual representations continue to undergo substantial restructuring across depth. Once stabilized, visual representations become largely interchangeable between layers, indicating limited additional transformation in deeper stages. Further, depth-wise visual truncation reveals that the necessity of visual processing is task-dependent, where single-token predictions remain comparatively robust to truncated visual depth, but multi-token generation require sustained access to visual representations. Under deterministic decoding, reducing visual depth perturbs intermediate reasoning trajectories more strongly than final outputs, suggesting that image tokens influence the structure of reasoning more than the ultimate conclusions. Collectively, these findings \textbf{question the assumption} that deeper visual processing is uniformly essential in VLMs, challenging the current paradigm of multimodal LLM architectures.[173] Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
Wonbong Jang,Shikun Liu,Soubhik Sanyal,Juan Camilo Perez,Kam Woh Ng,Sanskar Agrawal,Juan-Manuel Perez-Rua,Yiannis Douratsos,Tao Xiang
Main category: cs.CV
TL;DR: 本文提出Rays as Pixels,一种视频扩散模型(VDM),联合建模视频与相机轨迹,通过‘raxel’(射线像素)表示相机,并采用解耦自交叉注意力机制联合去噪,实现视频生成、相机轨迹预测及二者联合生成三大任务,并通过闭环自洽性测试验证其一致性。
Details
Motivation: 传统上图像重建相机参数与新视角渲染被视作独立任务,但在图像覆盖稀疏或位姿模糊时,二者相互依赖,需统一建模。 Method: 提出Rays as Pixels视频扩散模型,将相机表示为密集射线像素(raxels),并利用Decoupled Self-Cross Attention机制对raxels与视频帧进行联合去噪;支持三种推理模式:从视频预测轨迹、从输入图像联合生成视频与轨迹、按目标轨迹生成视频。 Result: 模型在位姿估计和相机控制视频生成任务上取得良好效果;闭环自洽性测试表明其前向(视频→轨迹)与逆向(轨迹→视频)预测高度一致;轨迹预测仅需极少去噪步数即可达到自洽。 Conclusion: 联合建模视频与相机轨迹的扩散框架是可行且有效的,显著提升稀疏条件下的几何-外观协同推理能力,为神经渲染与SLAM等方向提供新思路。 Abstract: Recovering camera parameters from images and rendering scenes from novel viewpoints have long been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task needs what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. We represent each camera as dense ray pixels (raxels) and denoise them jointly with video frames through Decoupled Self-Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, jointly generating video and camera trajectory from input images, and generating video from input images along a target camera trajectory. Because the model can both predict trajectories from a video and generate views conditioned on its own predictions, we evaluate it through a closed-loop self-consistency test, demonstrating that its forward and inverse predictions agree. Notably, trajectory prediction requires far fewer denoising steps than video generation, even a few denoising steps suffice for self-consistency. We report results on pose estimation and camera-controlled video generation.[174] SCoRe: Clean Image Generation from Diffusion Models Trained on Noisy Images
Yuta Matsuzaki,Seiichi Uchida,Shumpei Takezaki
Main category: cs.CV
TL;DR: 本文提出SCoRe方法,在不重新训练扩散模型的情况下,通过频谱截断和SDEdit再生来修复由噪声数据训练的扩散模型生成的图像,显著提升生成质量。
Details
Motivation: 扩散模型在噪声数据集上训练时容易复现高频训练伪影,严重降低生成质量。 Method: SCoRe是一种无需训练、仅在生成阶段使用的频谱再生方法:利用扩散模型的频谱偏差特性,先对生成图像进行高频成分截断,再通过SDEdit再生被抑制的高频部分;并基于RAPSD理论推导出截断频率与SDEdit初始化时间步之间的映射关系,避免再生过程中注入过多噪声。 Result: 在CIFAR-10(合成)和SIDD(真实世界)噪声数据集上的实验表明,SCoRe显著优于后处理及抗噪基线方法,能在不重训练或微调的前提下,使生成样本更接近干净图像分布。 Conclusion: SCoRe为利用噪声数据训练的扩散模型提供了一种高效、通用且无需再训练的清洁生成方案,验证了频谱视角在扩散模型修复中的有效性。 Abstract: Diffusion models trained on noisy datasets often reproduce high-frequency training artifacts, significantly degrading generation quality. To address this, we propose SCoRe (Spectral Cutoff Regeneration), a training-free, generation-time spectral regeneration method for clean image generation from diffusion models trained on noisy images. Leveraging the spectral bias of diffusion models, which infer high-frequency details from low-frequency cues, SCoRe suppresses corrupted high-frequency components of a generated image via a frequency cutoff and regenerates them via SDEdit. Crucially, we derive a theoretical mapping between the cutoff frequency and the SDEdit initialization timestep based on Radially Averaged Power Spectral Density (RAPSD), which prevents excessive noise injection during regeneration. Experiments on synthetic (CIFAR-10) and real-world (SIDD) noisy datasets demonstrate that SCoRe substantially outperforms post-processing and noise-robust baselines, restoring samples closer to clean image distributions without any retraining or fine-tuning.[175] AsymLoc: Towards Asymmetric Feature Matching for Efficient Visual Localization
Mohammad Omama,Gabriele Berton,Eric Foxlin,Yelin Kim
Main category: cs.CV
TL;DR: AsymLoc提出一种非对称视觉定位框架,通过教师-学生模型蒸馏,在保持高精度的同时大幅降低计算开销,适用于边缘设备。
Details
Motivation: 在AR/VR和机器人等应用中,资源受限的边缘设备(如智能眼镜)亟需低功耗、高精度、实时的视觉定位方法;现有高效模型仍需进一步压缩计算量而不损精度。 Method: 提出AsymLoc蒸馏框架:教师模型离线处理数据库图像,学生模型在线处理查询图像;采用几何驱动匹配目标与联合检测器-描述子蒸馏目标,实现无需参数的快速最近邻匹配。 Result: 在HPatches、ScanNet、IMC2022和Aachen数据集上,AsymLoc达到教师模型95%的定位精度,模型体积缩小一个数量级,显著超越现有基线,建立新的精度-效率权衡SOTA。 Conclusion: AsymLoc成功实现了轻量学生模型与重型教师模型间的有效特征对齐,为边缘端高精度视觉定位提供了实用可行的新范式。 Abstract: Precise and real-time visual localization is critical for applications like AR/VR and robotics, especially on resource-constrained edge devices such as smart glasses, where battery life and heat dissipation can be a primary concerns. While many efficient models exist, further reducing compute without sacrificing accuracy is essential for practical deployment. To address this, we propose asymmetric visual localization: a large Teacher model processes pre-mapped database images offline, while a lightweight Student model processes the query image online. This creates a challenge in matching features from two different models without resorting to heavy, learned matchers. We introduce AsymLoc, a novel distillation framework that aligns a Student to its Teacher through a combination of a geometry-driven matching objective and a joint detector-descriptor distillation objective, enabling fast, parameter-less nearest-neighbor matching. Extensive experiments on HPatches, ScanNet, IMC2022, and Aachen show that AsymLoc achieves up to 95% of the teacher's localization accuracy using an order of magnitude smaller models, significantly outperforming existing baselines and establishing a new state-of-the-art efficiency-accuracy trade-off.[176] Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement
Zhengxian Yang,Shengqi Wang,Shi Pan,Hongshuai Li,Haoxiang Wang,Lin Li,Guanjun Li,Zhengqi Wen,Borong Lin,Jianhua Tao,Tao Yu
Main category: cs.CV
TL;DR: 本文提出了一种新型沉浸式体视频(IVV)格式及配套构建方法,包括多模态数据集ImViD和基于高斯时空表示的动态光场与声场联合重建框架,实现了高质量、大范围6自由度音视频交互。
Details
Motivation: 现有虚拟/增强现实中,基于真实世界视频捕获构建完全沉浸式(6-DoF音视频融合)体验仍属空白,亟需新格式与新方法支持。 Method: 构建了空间导向的多视角多模态数据集ImViD(含5K/60FPS视频与同步音频),并提出基于高斯的动态光场重建框架(含光流引导稀疏初始化、联合相机时序标定、多目标时空监督)及首个面向多视角音视频数据的声场重建方法。 Result: 在基准测试与VR实验中,所提管线生成了高质量、时间稳定、大6-DoF交互空间的音视频体内容,显著优于现有方法。 Conclusion: 本工作首次定义了沉浸式体视频,并提供了从数据采集到音视频联合重建的完整实用构建范式,为真实感沉浸媒体奠定了基础。 Abstract: Fully immersive experiences that tightly integrate 6-DoF visual and auditory interaction are essential for virtual and augmented reality. While such experiences can be achieved through computer-generated content, constructing them directly from real-world captured videos remains largely unexplored. We introduce Immersive Volumetric Videos, a new volumetric media format designed to provide large 6-DoF interaction spaces, audiovisual feedback, and high-resolution, high-frame-rate dynamic content. To support IVV construction, we present ImViD, a multi-view, multi-modal dataset built upon a space-oriented capture philosophy. Our custom capture rig enables synchronized multi-view video-audio acquisition during motion, facilitating efficient capture of complex indoor and outdoor scenes with rich foreground--background interactions and challenging dynamics. The dataset provides 5K-resolution videos at 60 FPS with durations of 1-5 minutes, offering richer spatial, temporal, and multimodal coverage than existing benchmarks. Leveraging this dataset, we develop a dynamic light field reconstruction framework built upon a Gaussian-based spatio-temporal representation, incorporating flow-guided sparse initialization, joint camera temporal calibration, and multi-term spatio-temporal supervision for robust and accurate modeling of complex motion. We further propose, to our knowledge, the first method for sound field reconstruction from such multi-view audiovisual data. Together, these components form a unified pipeline for immersive volumetric video production. Extensive benchmarks and immersive VR experiments demonstrate that our pipeline generates high-quality, temporally stable audiovisual volumetric content with large 6-DoF interaction spaces. This work provides both a foundational definition and a practical construction methodology for immersive volumetric videos.[177] Incremental Semantics-Aided Meshing from LiDAR-Inertial Odometry and RGB Direct Label Transfer
Muhammad Affan,Ville Lehtola,George Vosselman
Main category: cs.CV
TL;DR: 本文提出了一种RGB+LiDAR融合的增量式语义辅助高保真网格重建方法,通过视觉基础模型为RGB帧打标签,并将标签逐帧投影融合到LiDAR惯性地图中,再结合语义感知的TSDF融合生成高质量网格,显著提升了复杂室内场景(如文化建筑)的几何重建精度与边界完整性。
Details
Motivation: 在大型复杂室内环境(如文化建筑)中,纯LiDAR惯性扫描重建面临点云稀疏、几何漂移和固定融合参数导致的空洞、过平滑及伪表面等问题。 Method: 构建模块化、增量式的RGB+LiDAR管线:利用视觉基础模型对每帧RGB图像进行语义标注;将语义标签按扫描帧逐步投影并融合至LiDAR惯性里程计地图;采用增量式语义感知的TSDF融合,最后通过Marching Cubes生成网格。 Result: 在Oxford Spires数据集上定量评估显示,该方法在几何指标上优于ImMesh和Voxblox等SOTA几何基线;NTU VIRAL数据集上的定性分析也验证了其重建质量提升;输出为语义标注的高保真网格,可直接用于USD资产生成及XR/数字建模。 Conclusion: 语义引导的增量融合策略能有效缓解LiDAR稀疏性和漂移带来的几何歧义,显著提升重建边界完整性与整体几何保真度,为室内场景数字化提供了实用可行的新路径。 Abstract: Geometric high-fidelity mesh reconstruction from LiDAR-inertial scans remains challenging in large, complex indoor environments -- such as cultural buildings -- where point cloud sparsity, geometric drift, and fixed fusion parameters produce holes, over-smoothing, and spurious surfaces at structural boundaries. We propose a modular, incremental RGB+LiDAR pipeline that generates incremental semantics-aided high-quality meshes from indoor scans through scan frame-based direct label transfer. A vision foundation model labels each incoming RGB frame; labels are incrementally projected and fused onto a LiDAR-inertial odometry map; and an incremental semantics-aware Truncated Signed Distance Function (TSDF) fusion step produces the final mesh via marching cubes. This frame-level fusion strategy preserves the geometric fidelity of LiDAR while leveraging rich visual semantics to resolve geometric ambiguities at reconstruction boundaries caused by LiDAR point-cloud sparsity and geometric drift. We demonstrate that semantic guidance improves geometric reconstruction quality; quantitative evaluation is therefore performed using geometric metrics on the Oxford Spires dataset, while results from the NTU VIRAL dataset are analyzed qualitatively. The proposed method outperforms state-of-the-art geometric baselines ImMesh and Voxblox, demonstrating the benefit of semantics-aided fusion for geometric mesh quality. The resulting semantically labelled meshes are of value when reconstructing Universal Scene Description (USD) assets, offering a path from indoor LiDAR scanning to XR and digital modeling.[178] Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model
Shunkai Zhou,Zike Yan,Fei Xue,Dong Wu,Yuchen Deng,Hongbin Zha
Main category: cs.CV
TL;DR: Online3R提出一种在线学习的三维重建框架,通过引入可学习的轻量级视觉提示和局部-全局自监督策略,在不破坏预训练几何基础模型能力的前提下,实现对新场景的高效自适应重建。
Details
Motivation: 解决现有方法在新场景中因模型固定导致的重建不一致问题,同时应对测试时缺乏真值标注和需高效率更新的挑战。 Method: 在冻结的预训练几何基础模型中嵌入可学习的轻量视觉提示,并设计局部-全局自监督学习策略:局部一致性约束作用于中间及历史局部融合结果以生成高质量伪真值;全局一致性约束作用于稀疏长跨度关键帧以实现轨迹级高效优化。 Result: 在多个基准上显著超越现有最先进方法。 Conclusion: Online3R验证了轻量提示微调与局部-全局自监督协同可有效提升在线三维重建的泛化性、一致性和效率。 Abstract: We present Online3R, a new sequential reconstruction framework that is capable of adapting to new scenes through online learning, effectively resolving inconsistency issues. Specifically, we introduce a set of learnable lightweight visual prompts into a pretrained, frozen geometry foundation model to capture the knowledge of new environments while preserving the fundamental capability of the foundation model for geometry prediction. To solve the problems of missing groundtruth and the requirement of high efficiency when updating these visual prompts at test time, we introduce a local-global self-supervised learning strategy by enforcing the local and global consistency constraints on predictions. The local consistency constraints are conducted on intermediate and previously local fused results, enabling the model to be trained with high-quality pseudo groundtruth signals; the global consistency constraints are operated on sparse keyframes spanning long distances rather than per frame, allowing the model to learn from a consistent prediction over a long trajectory in an efficient way. Our experiments demonstrate that Online3R outperforms previous state-of-the-art methods on various benchmarks. Project page: https://shunkaizhou.github.io/online3r-1.0/[179] VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning
Yucheng Shen,Jiulong Wu,Jizhou Huang,Dawei Yin,Lingyong Yan,Min Cao
Main category: cs.CV
TL;DR: 本文提出VISOR框架,通过结构化证据空间、视觉动作评估与修正机制以及动态轨迹策略,解决现有视觉检索增强生成系统中视觉证据稀疏和长视野搜索漂移两大瓶颈,显著提升长视野视觉推理任务的性能与效率。
Details
Motivation: 现有视觉检索增强生成(VRAG)系统在处理复杂多步推理查询时面临两个关键瓶颈:一是视觉证据稀疏性,即关键证据分散于多页且细粒度图像信息需精准视觉操作;二是长视野搜索漂移,即累积的视觉token导致上下文稀释与认知过载。 Method: 提出VISOR单智能体框架,包含三方面创新:(1)结构化证据空间支持渐进式跨页推理;(2)视觉动作评估与修正机制保障操作准确性;(3)带滑动窗口与意图注入的动态轨迹策略抑制搜索漂移,并结合状态掩码与信用分配的GRPO强化学习训练流程。 Result: 在ViDoSeek、SlideVQA和MMLongBench三个基准上,VISOR实现了长视野视觉推理任务的最先进性能,同时具备更高推理效率。 Conclusion: VISOR有效缓解了视觉证据稀疏与搜索漂移问题,为长视野、多步视觉语言推理提供了可扩展、鲁棒的统一框架。 Abstract: Visual Retrieval-Augmented Generation (VRAG) empowers Vision-Language Models to retrieve and reason over visually rich documents. To tackle complex queries requiring multi-step reasoning, agentic VRAG systems interleave reasoning with iterative retrieval.. However, existing agentic VRAG faces two critical bottlenecks. (1) Visual Evidence Sparsity: key evidence is scattered across pages yet processed in isolation, hindering cross-page reasoning; moreover, fine-grained intra-image evidence often requires precise visual actions, whose misuse degrades retrieval quality; (2) Search Drift in Long Horizons: the accumulation of visual tokens across retrieved pages dilutes context and causes cognitive overload, leading agents to deviate from their search objective. To address these challenges, we propose VISOR (Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning), a unified single-agent framework. VISOR features a structured Evidence Space for progressive cross-page reasoning, coupled with a Visual Action Evaluation and Correction mechanism to manage visual actions. Additionally, we introduce a Dynamic Trajectory with Sliding Window and Intent Injection to mitigate search drift. They anchor the evidence space while discarding earlier raw interactions, preventing context from being overwhelmed by visual tokens. We train VISOR using a Group Relative Policy Optimization-based Reinforcement Learning (GRPO-based RL) pipeline with state masking and credit assignment tailored for dynamic context reconstruction. Extensive experiments on ViDoSeek, SlideVQA, and MMLongBench demonstrate that VISOR achieves state-of-the-art performance with superior efficiency for long-horizon visual reasoning tasks.[180] RIRF: Reasoning Image Restoration Framework
Wending Yan,Rongkai Zhang,Kaihua Tang,Yu Cheng,Qiankun Liu
Main category: cs.CV
TL;DR: 本文提出Reason and Restore(R&R)框架,将结构化思维链(CoT)推理引入通用图像恢复(UIR)任务,通过多模态大模型Qwen3-VL显式诊断退化类型、严重程度及场景语义,并将退化严重度作为强化学习信号指导恢复器,在保持SOTA性能的同时提供可解释性。
Details
Motivation: 现有通用图像恢复方法主要关注像素重建,缺乏对退化组成、严重程度和场景语义的显式诊断推理,导致恢复过程缺乏可解释性和针对性。 Method: 提出R&R框架:1)使用微调后的Qwen3-VL作为显式reasoner,执行结构化CoT推理,诊断退化类型、量化严重度、推断相关因素并描述场景语义;2)将量化严重度作为强化学习信号指导restorer;3)紧密耦合高层语义推理与底层像素恢复。 Result: 在多个UIR基准上达到SOTA性能,并提供细粒度、可解释的诊断先验与恢复过程分析。 Conclusion: R&R首次将结构化多步推理深度融入通用图像恢复流程,证明了语义诊断与像素恢复联合建模的有效性与必要性,兼顾性能提升与过程可解释性。 Abstract: Universal image restoration (UIR) aims to recover clean images from diverse and unknown degradations using a unified model. Existing UIR methods primarily focus on pixel reconstruction and often lack explicit diagnostic reasoning over degradation composition, severity, and scene semantics prior to restoration. We propose Reason and Restore (R\&R), a novel framework that integrates structured Chain-of-Thought (CoT) reasoning into the image restoration pipeline. R\&R introduces an explicit reasoner, implemented by fine-tuning Qwen3-VL, to diagnose degradation types, quantify degradation severity, infer key degradation-related factors, and describe relevant scene and object semantics. The resulting structured reasoning provides interpretable and fine-grained diagnostic priors for the restorer. To further improve restoration quality, the quantified degradation severity produced by the reasoner is leveraged as reinforcement learning (RL) signals to guide and strengthen the restorer. Unlike existing multimodal LLM-based agentic systems that decouple reasoning from low-level vision tasks, R\&R tightly couples semantic diagnostic reasoning with pixel-level restoration in a unified framework. Extensive experiments across diverse UIR benchmarks demonstrate that R\&R achieves state-of-the-art performance while offering unique interpretability into the restoration process.[181] Envisioning the Future, One Step at a Time
Stefan Andreas Baumann,Jannik Wiese,Tommaso Martorella,Mahdi M. Kalayeh,Björn Ommer
Main category: cs.CV
TL;DR: 本文提出了一种基于稀疏点轨迹的自回归扩散模型,用于开放集场景动态预测,通过短程局部可预测的步进推理建模不确定性增长,在保持物理合理性和长程一致性的同时,实现快速、多样化的未来预测。
Details
Motivation: 现有方法多依赖密集视频或潜在空间预测,耗费大量计算资源在外观细节而非稀疏运动轨迹上,导致长时程、多模态运动预测效率低、扩展性差。 Method: 将未来场景动态预测建模为稀疏点轨迹的逐步推理;采用自回归扩散模型,对轨迹进行短程、局部可预测的推进,并显式建模时间维度上的不确定性增长;引入OWM开放集运动预测基准进行评估。 Result: 在预测精度上媲美或超越密集仿真器,采样速度提升数量级;支持单图生成数千种多样化未来轨迹,支持运动约束引导,且保持物理合理性和长程一致性。 Conclusion: 以动力学为中心的稀疏轨迹表示显著提升了开放集未来预测的可扩展性与实用性,为复杂真实场景下的多模态长期预测提供了新范式。 Abstract: Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains, and efficiently explore many plausible futures. Yet most existing approaches rely on dense video or latent-space prediction, expending substantial capacity on dense appearance rather than on the underlying sparse trajectories of points in the scene. This makes large-scale exploration of future hypotheses costly and limits performance when long-horizon, multi-modal motion is essential. We address this by formulating the prediction of open-set future scene dynamics as step-wise inference over sparse point trajectories. Our autoregressive diffusion model advances these trajectories through short, locally predictable transitions, explicitly modeling the growth of uncertainty over time. This dynamics-centric representation enables fast rollout of thousands of diverse futures from a single image, optionally guided by initial constraints on motion, while maintaining physical plausibility and long-range coherence. We further introduce OWM, a benchmark for open-set motion prediction based on diverse in-the-wild videos, to evaluate accuracy and variability of predicted trajectory distributions under real-world uncertainty. Our method matches or surpasses dense simulators in predictive accuracy while achieving orders-of-magnitude higher sampling speed, making open-set future prediction both scalable and practical. Project page: http://compvis.github.io/myriad.[182] Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise
Zibin Geng,Xuefeng Jiang,Jia Li,Zheng Li,Tian Wen,Lvhua Wu,Sheng Sun,Yuwei Wang,Min Liu
Main category: cs.CV
TL;DR: 本文提出VisPrompt,一种轻量级、鲁棒的视觉引导提示学习框架,通过跨模态注意力将视觉语义反向注入提示表示,并引入条件调制机制自适应控制视觉信息注入强度,从而在标签噪声下提升提示学习的鲁棒性。
Details
Motivation: 视觉内容比文本标签更鲁棒,但传统提示学习对标签噪声敏感;如何利用稳定视觉信息增强提示学习的鲁棒性是核心动机。 Method: 提出VisPrompt框架:1)用跨模态注意力将视觉语义反向注入提示token;2)引入轻量级条件调制机制,根据样本视觉线索质量自适应调节视觉信息注入强度。 Result: 在合成与真实噪声标签下,VisPrompt在7个基准数据集上显著优于现有方法,有效抑制噪声干扰、降低提示更新不稳定性、缓解对错标样本的记忆,且仅需冻结VLM主干并添加少量可训练参数。 Conclusion: VisPrompt验证了利用实例级视觉证据锚定提示学习可显著提升其在标签噪声下的鲁棒性,为参数高效、鲁棒的视觉语言模型微调提供了新思路。 Abstract: Prompt learning is a parameter-efficient approach for vision-language models, yet its robustness under label noise is less investigated. Visual content contains richer and more reliable semantic information, which remains more robust under label noise. However, the prompt itself is highly susceptible to label noise. Motivated by this intuition, we propose VisPrompt, a lightweight and robust vision-guided prompt learning framework for noisy-label settings. Specifically, we exploit a cross-modal attention mechanism to reversely inject visual semantics into prompt representations. This enables the prompt tokens to selectively aggregate visual information relevant to the current sample, thereby improving robustness by anchoring prompt learning to stable instance-level visual evidence and reducing the influence of noisy supervision. To address the instability caused by using the same way of injecting visual information for all samples, despite differences in the quality of their visual cues, we further introduce a lightweight conditional modulation mechanism to adaptively control the strength of visual information injection, which strikes a more robust balance between text-side semantic priors and image-side instance evidence. The proposed framework effectively suppresses the noise-induced disturbances, reduce instability in prompt updates, and alleviate memorization of mislabeled samples. VisPrompt significantly improves robustness while keeping the pretrained VLM backbone frozen and introducing only a small amount of additional trainable parameters. Extensive experiments under synthetic and real-world label noise demonstrate that VisPrompt generally outperforms existing baselines on seven benchmark datasets and achieves stronger robustness. Our code is publicly available at https://github.com/gezbww/Vis_Prompt.[183] EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
Lulin Liu,Dayou Li,Yiqing Liang,Sicong Jiang,Hitesh Vijay,Hezhen Hu,Xuhai Xu,Zirui Liu,Srinivas Shakkottai,Manling Li,Zhiwen Fan
Main category: cs.CV
TL;DR: 本文提出EgoTL框架,通过'先说后做'的思维 aloud 采集流程,结合空间度量校准与多粒度标注,构建高质量的具身智能数据集,用于评估和提升视觉语言模型(VLM)与世界模型在长时程家庭任务中的规划、推理与空间接地能力。
Details
Motivation: 现有基于VLM的自动标注因缺乏准确的人类动作标签、链式推理(CoT)和空间标注而噪声大,尤其在长时程空间指令执行中错误被放大;根本原因在于分钟级日常家庭规划任务覆盖不足及空间接地不准确。 Method: 提出EgoTL:构建‘说-before-做’的自我报告式采集流水线,记录带词级时间戳的逐步目标与口语化推理;引入度量尺度空间估计器校准物理属性、记忆库式场景漫游获取上下文、片段级导航与操作标签;据此构建多维度评测基准并开展模型微调。 Result: 在100+日常家庭任务、分钟级长序列上,于三层六维任务上系统评测VLM与世界模型,发现其仍难以胜任具身助手或开放世界模拟器;经EgoTL人类CoT与度量标签对齐微调后,模型在长时程规划、分步推理、指令跟随与空间接地方面显著提升。 Conclusion: 高质量、多模态、时空对齐的人类行为数据(如EgoTL)是提升具身智能模型真实世界能力的关键;仅依赖大规模预训练不足以解决细粒度、长时程、物理约束下的具身推理问题。 Abstract: Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household tasks. However, VLM-based auto-labeling is often noisy because the primary data sources lack accurate human action labels, chain-of-thought (CoT), and spatial annotations; these errors are amplified during long-horizon spatial instruction following. These issues stem from insufficient coverage of minute-long, daily household planning tasks and from inaccurate spatial grounding. As a result, VLM reasoning chains and world-model synthesis can hallucinate objects, skip steps, or fail to respect real-world physical attributes. To address these gaps, we introduce EgoTL. EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, a memory-bank walkthrough for scene context, and clip-level tags for navigation instructions and detailed manipulation actions. With EgoTL, we are able to benchmark VLMs and World Models on six task dimensions from three layers and long-horizon generation over minute-long sequences across over 100 daily household tasks. We find that foundation models still fall short as egocentric assistants or open-world simulators. Finally, we finetune foundation models with human CoT aligned with metric labels on the training split of EgoTL, which improves long-horizon planning and reasoning, step-wise reasoning, instruction following, and spatial grounding.[184] Tango: Taming Visual Signals for Efficient Video Large Language Models
Shukang Yin,Sirui Zhao,Hanchao Wang,Baozhi Jia,Xianquan Wang,Chaoyou Fu,Enhong Chen
Main category: cs.CV
TL;DR: 本文提出Tango框架,通过多样性驱动的注意力选择和时空旋转位置编码(ST-RoPE)改进视频大模型中的token剪枝,显著提升效率与性能保留率。