Skip to content

Table of Contents

cs.CL [Back]

[1] Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces

Manas Pathak,Xingyao Chen,Shuozhe Li,Amy Zhang,Liu Leqi

Main category: cs.CL

TL;DR: 本文提出了一种新的评估方法Filtered Reasoning Score (FRS),用于衡量大语言模型(LLM)推理过程的质量,而非仅依赖最终答案的准确性。FRS基于推理链的可信度、一致性、实用性与事实性等维度,并只选取置信度最高的K%推理路径进行评分,从而更鲁棒地区分具有相似准确率但推理能力不同的模型。

Details Motivation: 现有基于结果(outcome-based)的评估方法无法反映模型推理过程的质量,高准确率可能源于记忆或过优化,而非真正可靠的推理能力。因此需要一种能直接评估推理质量、且对提示和生成配置鲁棒的新指标。 Method: 提出多维推理质量评分(faithfulness, coherence, utility, factuality),并设计Filtered Reasoning Score(FRS):仅在top-K%最高置信度的推理路径上计算该评分,避免低置信正确路径带来的偶然性干扰。 Result: 使用FRS评估发现:准确率相近的模型在推理质量上存在显著差异;FRS高的模型在其他推理基准上也表现出更高准确率与推理质量,说明FRS可反映可迁移的推理能力。 Conclusion: FRS是一种互补于传统准确率的新型评估指标,能更真实地刻画LLM的推理能力,有助于推动更可靠、可解释的模型开发与评估。 Abstract: Should we trust Large Language Models (LLMs) with high accuracy? LLMs achieve high accuracy on reasoning benchmarks, but correctness alone does not reveal the quality of the reasoning used to produce it. This highlights a fundamental limitation of outcome-based evaluation: models may arrive at correct answers through flawed reasoning, and models with substantially different reasoning capabilities can nevertheless exhibit similar benchmark accuracy, for example due to memorization or over-optimization. In this paper, we ask: given existing benchmarks, can we move beyond outcome-based evaluation to assess the quality of reasoning itself? We seek metrics that (1) differentiate models with similar accuracy and (2) are robust to variations in input prompts and generation configurations. To this end, we propose a reasoning score that evaluates reasoning traces along dimensions such as faithfulness, coherence, utility, and factuality. A remaining question is how to aggregate this score across multiple sampled traces. Naively averaging them is undesirable, particularly in long-horizon settings, where the number of possible trajectories grows rapidly, and low-confidence correct traces are more likely to be coincidental. To address this, we introduce the Filtered Reasoning Score (FRS), which computes reasoning quality using only the top-K% most confident traces. Evaluating with FRS, models that are indistinguishable under standard accuracy exhibit significant differences in reasoning quality. Moreover, models with higher FRS on one benchmark tend to perform better on other reasoning benchmarks, in both accuracy and reasoning quality. Together, these findings suggest that FRS complements accuracy by capturing a model's transferable reasoning capabilities. We open source our evaluation codebase: https://github.com/Manas2006/benchmark_reproducibility.

[2] Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Yinghui He,Simran Kaur,Adithya Bhaskar,Yongjin Yang,Jiarui Liu,Narutatsu Ri,Liam Fowl,Abhishek Panigrahi,Danqi Chen,Sanjeev Arora

Main category: cs.CL

TL;DR: SD-Zero是一种无需外部教师或高质量示范的自蒸馏方法,通过让同一模型兼具生成器与修订者角色,并利用二元奖励生成密集的词元级自监督信号,显著提升数学与代码推理性能。

Details Motivation: 现有可验证场景下的后训练方法存在局限:强化学习(RLVR)监督稀疏,而知识蒸馏依赖昂贵或不可得的外部监督信号。 Method: 提出Self-Distillation Zero(SD-Zero):单模型同时承担Generator(生成初始响应)和Reviser(基于该响应及其二元奖励生成改进响应)角色;再通过on-policy自蒸馏,将Reviser在给定生成响应及奖励条件下的词元分布作为监督信号,蒸馏至Generator。 Result: 在Qwen3-4B-Instruct和Olmo-3-7B-Instruct上,SD-Zero在数学与代码推理基准中相较基线模型提升至少10%,且优于RFT、GRPO、SDFT等强基线;消融实验揭示其具备词元级自定位与迭代式自进化两大新特性。 Conclusion: SD-Zero有效将稀疏二元奖励转化为密集词元级自监督,大幅提高样本效率,在无需外部监督条件下实现显著性能提升,并展现出模型内部自我诊断与持续进化的潜力。 Abstract: Current post-training methods in verifiable settings fall into two categories. Reinforcement learning (RLVR) relies on binary rewards, which are broadly applicable and powerful, but provide only sparse supervision during training. Distillation provides dense token-level supervision, typically obtained from an external teacher or using high-quality demonstrations. Collecting such supervision can be costly or unavailable. We propose Self-Distillation Zero (SD-Zero), a method that is substantially more training sample-efficient than RL and does not require an external teacher or high-quality demonstrations. SD-Zero trains a single model to play two roles: a Generator, which produces an initial response, and a Reviser, which conditions on that response and its binary reward to produce an improved response. We then perform on-policy self-distillation to distill the reviser into the generator, using the reviser's token distributions conditioned on the generator's response and its reward as supervision. In effect, SD-Zero trains the model to transform binary rewards into dense token-level self-supervision. On math and code reasoning benchmarks with Qwen3-4B-Instruct and Olmo-3-7B-Instruct, SD-Zero improves performance by at least 10% over the base models and outperforms strong baselines, including Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), under the same question set and training sample budget. Extensive ablation studies show two novel characteristics of our proposed algorithm: (a) token-level self-localization, where the reviser can identify the key tokens that need to be revised in the generator's response based on reward, and (b) iterative self-evolution, where the improving ability to revise answers can be distilled back into generation performance with regular teacher synchronization.

[3] LLMs Struggle with Abstract Meaning Comprehension More Than Expected

Hamoud Alhazmi,Jiachen Jiang

Main category: cs.CL

TL;DR: 本文研究了抽象语义理解问题,发现大语言模型在零样本、单样本和少样本设置下表现不佳,而微调模型(如BERT、RoBERTa)表现更好;提出一种受人类认知启发的双向注意力分类器,显著提升了抽象概念理解性能。

Details Motivation: 抽象词汇因其非具体、高层语义特性,一直是自然语言理解中的难点,亟需有效方法提升模型对抽象概念的理解能力。 Method: 提出一种受人类认知策略启发的双向注意力分类器,动态建模篇章与选项之间的交互关系,并结合微调的预训练语言模型(如BERT、RoBERTa)进行抽象概念选择任务。 Result: 该方法在SemEval-2021 Task 4(ReCAM)的Task 1上准确率提升4.06%,Task 2上提升3.41%。 Conclusion: 微调模型比大语言模型更擅长抽象语义理解;引入认知启发的双向注意力机制可有效增强模型对抽象概念的建模能力。 Abstract: Understanding abstract meanings is crucial for advanced language comprehension. Despite extensive research, abstract words remain challenging due to their non-concrete, high-level semantics. SemEval-2021 Task 4 (ReCAM) evaluates models' ability to interpret abstract concepts by presenting passages with questions and five abstract options in a cloze-style format. Key findings include: (1) Most large language models (LLMs), including GPT-4o, struggle with abstract meaning comprehension under zero-shot, one-shot, and few-shot settings, while fine-tuned models like BERT and RoBERTa perform better. (2) A proposed bidirectional attention classifier, inspired by human cognitive strategies, enhances fine-tuned models by dynamically attending to passages and options. This approach improves accuracy by 4.06 percent on Task 1 and 3.41 percent on Task 2, demonstrating its potential for abstract meaning comprehension.

[4] Benchmarking Deflection and Hallucination in Large Vision-Language Models

Nicholas Moratelli,Christopher Davis,Leonardo F. R. Ribeiro,Bill Byrne,Gonzalo Iglesias

Main category: cs.CL

TL;DR: 本文提出VLM-DeflectionBench基准,用于评估大视觉语言模型在知识密集型多模态问答中面对冲突或不足证据时的检索依赖性与拒绝回答能力,并设计动态数据筛选与细粒度评测协议。

Details Motivation: 现有基准忽视视觉与文本证据冲突、模型在知识不全时生成拒绝回答(deflection)的能力,且易因LVLM训练数据增长而快速过时。 Method: 提出动态数据筛选流程以维持基准难度;构建含2775个样本的VLM-DeflectionBench基准;定义涵盖四种场景的细粒度评测协议,区分参数化记忆与检索鲁棒性。 Result: 在20个SOTA LVLM上的实验表明,模型在噪声或误导性证据下普遍无法正确拒绝回答。 Conclusion: 需同时评估模型‘知道什么’和‘不知时如何行为’;该基准为KB-VQA提供可复用、可扩展的可靠评测工具。 Abstract: Large Vision-Language Models (LVLMs) increasingly rely on retrieval to answer knowledge-intensive multimodal questions. Existing benchmarks overlook conflicts between visual and textual evidence and the importance of generating deflections (e.g., Sorry, I cannot answer...) when retrieved knowledge is incomplete. These benchmarks also suffer from rapid obsolescence, as growing LVLM training sets allow models to answer many questions without retrieval. We address these gaps with three contributions. First, we propose a dynamic data curation pipeline that preserves benchmark difficulty over time by filtering for genuinely retrieval-dependent samples. Second, we introduce VLM-DeflectionBench, a benchmark of 2,775 samples spanning diverse multimodal retrieval settings, designed to probe model behaviour under conflicting or insufficient evidence. Third, we define a fine-grained evaluation protocol with four scenarios that disentangle parametric memorization from retrieval robustness. Experiments across 20 state-of-the-art LVLMs indicate that models usually fail to deflect in the presence of noisy or misleading evidence. Our results highlight the need to evaluate not only what models know, but how they behave when they do not, and serve as a reusable and extensible benchmark for reliable KB-VQA evaluation. All resources will be publicly available upon publication.

[5] Think Through Uncertainty: Improving Long-Form Generation Factuality via Reasoning Calibration

Xin Liu,Lu Wang

Main category: cs.CL

TL;DR: 本文提出CURE框架,通过在声明级别进行不确定性推理来提升大语言模型长文本生成的事实准确性,引入声明感知推理协议和多阶段训练流程,并支持推理时对不确定声明的选择性回避。

Details Motivation: 现有方法无法让模型评估生成内容中各部分的可靠性,导致模型仍可能自信地陈述错误主张;单一标量置信度不足以应对长文本中不同声明不确定性差异的问题。 Method: 提出CURE框架,包括:1)声明感知推理协议,将输出结构化为原子声明及其显式置信度估计;2)多阶段训练流程,先对齐置信度与声明正确性,再优化事实性;3)利用校准后的置信度实现选择性预测(即对不确定声明主动回避)。 Result: 在四个长文本事实性基准测试中,CURE持续优于强监督和强化学习基线,生物信息生成任务中声明级准确率最高提升39.9%,FactBench上AUROC提升16.0%,同时保持事实召回率。 Conclusion: CURE通过在声明粒度上建模和校准不确定性,有效提升了长文本生成的事实准确性与置信度校准能力,为可控、可信的长文本生成提供了新范式。 Abstract: Large language models (LLMs) often hallucinate in long-form generation. Existing approaches mainly improve factuality through post-hoc revision or reinforcement learning (RL) with correctness-based rewards, but they do not teach the model to estimate which parts of its generation are reliable. As a result, models may still state incorrect claims confidently in their responses. Recent advances in reasoning have significantly improved LLM performance, and have been leveraged to estimate confidence by incorporating calibration into RL objectives. However, existing approaches remain limited to a single scalar confidence for the entire response, which is insufficient for long-form generation where uncertainty varies across individual claims. To mitigate this problem, we propose CURE, a framework that improves long-form factuality by teaching LLMs to reason about uncertainty at the claim level. We first introduce a Claim-Aware Reasoning Protocol, which structures outputs into atomic claims paired with explicit confidence estimates. We then develop a multi-stage training pipeline that aligns model confidence with claims' correctness and then optimizes on factuality. The resulting calibrated confidence further enables selective prediction, allowing the model to abstain from uncertain claims at inference time. Experiments on four long-form factuality benchmarks show that CURE consistently improves factual accuracy over competitive supervised and RL baselines, while maintaining factual recall. In particular, it improves claim-level accuracy by up to 39.9% on Biography generation. These gains are accompanied by improved calibration, as reflected by a 16.0% increase in AUROC on FactBench.

[6] Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

Omar El Bachyr,Yewei Song,Saad Ezzini,Jacques Klein,Tegawendé F. Bissyandé,Anas Zilali,Ulrick Ble,Anne Goujon

Main category: cs.CL

TL;DR: 本文系统研究了RAG系统在PDF理解(特别是问答任务)中的关键组件影响,涵盖PDF解析器与分块策略,并基于金融领域两个基准(含新发布的TableQuest)提供实用构建指南。

Details Motivation: PDF文件结构异构、难以自动化处理,现有RAG系统缺乏对各组件设计选择如何影响性能的系统性研究。 Method: 聚焦问答任务,在金融领域两个基准(含新构建的公开TableQuest)上,系统评估多种PDF解析器与不同重叠度的分块策略及其协同效应。 Result: 揭示了不同解析器与分块策略对结构保持和答案正确性的权衡与协同规律,提供了构建鲁棒PDF-RAG流水线的实证指南。 Conclusion: PDF-RAG系统性能高度依赖解析器与分块策略的匹配;需根据文档结构特征与任务需求协同设计,而非孤立优化单个组件。 Abstract: PDF files are primarily intended for human reading rather than automated processing. In addition, the heterogeneous content of PDFs, such as text, tables, and images, poses significant challenges for parsing and information extraction. To address these difficulties, both practitioners and researchers are increasingly developing new methods, including the promising Retrieval-Augmented Generation (RAG) systems to automated PDF processing. However, there is no comprehensive study investigating how different components and design choices affect the performance of a RAG system for understanding PDFs. In this paper, we propose such a study (1) by focusing on Question Answering, a specific language understanding task, and (2) by leveraging two benchmarks from the financial domain, including TableQuest, our newly generated, publicly available benchmark. We systematically examine multiple PDF parsers and chunking strategies (with varied overlap), along with their potential synergies in preserving document structure and ensuring answer correctness. Overall, our results offer practical guidelines for building robust RAG pipelines for PDF understanding.

[7] Leveraging Weighted Syntactic and Semantic Context Assessment Summary (wSSAS) Towards Text Categorization Using LLMs

Shreeya Verma Kathuria,Nitin Mayande,Sharookh Daruwalla,Nitin Joglekar,Charles Weber

Main category: cs.CL

TL;DR: 本文提出了一种名为wSSAS的确定性框架,通过两阶段验证(主题-故事-簇分层结构 + 信噪比加权)提升LLM在文本分类任务中的精度与可复现性,实验证明其在多个评论数据集上显著降低分类熵、增强聚类完整性。

Details Motivation: 大型语言模型(LLMs)在企业级文本分类等分析任务中受限于注意力机制的随机性和对噪声的敏感性,导致分析精度和可复现性不足。 Method: 提出Weighted Syntactic and Semantic Context Assessment Summary(wSSAS)框架:第一阶段构建Themes-Stories-Clusters层级分类结构;第二阶段引入信噪比(SNR)对语义特征加权排序,并嵌入Summary-of-Summaries(SoS)架构以抑制噪声、聚焦高价值信息。 Result: 在Gemini 2.0 Flash Lite上基于Google Business、Amazon和Goodreads多源评论数据集的实验表明,wSSAS显著提升聚类完整性与分类准确率,降低分类熵,提供高精度、可复现的大规模文本分类路径。 Conclusion: wSSAS是一种有效增强LLM分析可靠性与确定性的新范式,为噪声鲁棒、结构化、可审计的企业级文本分析提供了可行技术路径。 Abstract: The use of Large Language Models (LLMs) for reliable, enterprise-grade analytics such as text categorization is often hindered by the stochastic nature of attention mechanisms and sensitivity to noise that compromise their analytical precision and reproducibility. To address these technical frictions, this paper introduces the Weighted Syntactic and Semantic Context Assessment Summary (wSSAS), a deterministic framework designed to enforce data integrity on large-scale, chaotic datasets. We propose a two-phased validation framework that first organizes raw text into a hierarchical classification structure containing Themes, Stories, and Clusters. It then leverages a Signal-to-Noise Ratio (SNR) to prioritize high-value semantic features, ensuring the model's attention remains focused on the most representative data points. By incorporating this scoring mechanism into a Summary-of-Summaries (SoS) architecture, the framework effectively isolates essential information and mitigates background noise during data aggregation. Experimental results using Gemini 2.0 Flash Lite across diverse datasets - including Google Business reviews, Amazon Product reviews, and Goodreads Book reviews - demonstrate that wSSAS significantly improves clustering integrity and categorization accuracy. Our findings indicate that wSSAS reduces categorization entropy and provides a reproducible pathway for improving LLM based summaries based on a high-precision, deterministic process for large-scale text categorization.

[8] LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

Haocheng Xi,Harman Singh,Yuezhou Hu,Coleman Hooper,Rishabh Tiwari,Aditya Tomar,Minjae Lee,Wonjun Kang,Michael Mahoney,Chenfeng Xu,Kurt Keutzer,Amir Gholami

Main category: cs.CL

TL;DR: 本文提出LOSA(Locality-aware Sparse Attention)方法,通过识别并复用稳定token的缓存注意力结果、仅对活跃token应用稀疏注意力,有效缓解块式扩散语言模型(DLMs)在长上下文中的内存瓶颈问题,在保持高精度的同时显著提升推理效率。

Details Motivation: 块式扩散语言模型(DLMs)在长上下文场景中仍受内存受限的注意力机制瓶颈制约;朴素稀疏注意力因KV膨胀问题(不同查询选取不同前缀位置导致KV页访问集合过大)而在DLMs上失效。 Method: 基于观察:连续去噪步间仅有少量活跃token隐藏状态发生显著变化,多数稳定token几乎不变;LOSA据此复用稳定token的缓存前缀注意力结果,仅对活跃token执行稀疏注意力计算。 Result: LOSA在多个块式DLM和基准测试中保持近似稠密注意力的精度,激进稀疏下平均准确率提升达+9分,注意力密度降低1.54倍,并在RTX A6000 GPU上实现最高4.14倍注意力加速。 Conclusion: LOSA通过引入局部性感知的稀疏注意力机制,有效平衡了块式DLMs的精度与效率,为长上下文生成提供了实用且高效的解决方案。 Abstract: Block-wise diffusion language models (DLMs) generate multiple tokens in any order, offering a promising alternative to the autoregressive decoding pipeline. However, they still remain bottlenecked by memory-bound attention in long-context scenarios. Naive sparse attention fails on DLMs due to a KV Inflation problem, where different queries select different prefix positions, making the union of accessed KV pages large. To address this, we observe that between consecutive denoising steps, only a small fraction of active tokens exhibit significant hidden-state changes, while the majority of stable tokens remain nearly constant. Based on this insight, we propose LOSA (Locality-aware Sparse Attention), which reuses cached prefix-attention results for stable tokens and applies sparse attention only to active tokens. This substantially shrinks the number of KV indices that must be loaded, yielding both higher speedup and higher accuracy. Across multiple block-wise DLMs and benchmarks, LOSA preserves near-dense accuracy while significantly improving efficiency, achieving up to +9 points in average accuracy at aggressive sparsity levels while maintaining 1.54x lower attention density. It also achieves up to 4.14x attention speedup on RTX A6000 GPUs, demonstrating the effectiveness of the proposed method.

[9] Robust Explanations for User Trust in Enterprise NLP Systems

Guilin Zhang,Kai Zhao,Jeffrey Friedman,Xu Chu,Amine Anoun,Jerry Ting

Main category: cs.CL

TL;DR: 本文提出了一种面向黑盒NLP模型(尤其是API访问场景)的token级解释鲁棒性评估框架,基于leave-one-out遮蔽与多种真实扰动(替换、删除、打乱、回译),在多个数据集和6个编码器/解码器模型上系统验证发现:解码器大语言模型(如Llama、Qwen)的解释比编码器模型(如BERT、RoBERTa)更稳定,且模型规模越大稳定性越高;同时量化了鲁棒性提升与推理开销之间的权衡关系。

Details Motivation: 企业NLP中需要鲁棒的解释以建立用户信任,但在黑盒部署(仅API访问)场景下,传统基于表示的解释方法不可行,且现有研究缺乏对真实用户噪声下解释稳定性的系统评估,尤其在从编码器转向解码器大模型的趋势下,亟需统一的鲁棒性评估方法。 Method: 提出基于leave-one-out遮蔽的黑盒token级解释鲁棒性评估框架;定义‘top-token翻转率’作为鲁棒性指标;在swap、deletion、shuffling、back-translation四种现实扰动及多级强度下进行测试;在3个基准数据集、6个模型(BERT/RoBERTa/Qwen7B-14B/Llama8B-70B)上开展系统实验(共64,800案例)。 Result: 解码器LLMs的解释稳定性显著优于编码器模型(平均翻转率低73%);模型规模扩大带来稳定性提升(7B→70B提升44%);首次刻画了鲁棒性增益与推理成本之间的定量权衡曲线。 Conclusion: 该框架为黑盒场景下解释鲁棒性提供了可复现、可扩展的评估标准;实证表明解码器大模型天然具备更强的解释稳定性,支持其在合规敏感型应用中优先部署;所揭示的成本-鲁棒性权衡有助于预部署阶段的模型与解释策略选择。 Abstract: Robust explanations are increasingly required for user trust in enterprise NLP, yet pre-deployment validation is difficult in the common case of black-box deployment (API-only access) where representation-based explainers are infeasible and existing studies provide limited guidance on whether explanations remain stable under real user noise, especially when organizations migrate from encoder classifiers to decoder LLMs. To close this gap, we propose a unified black-box robustness evaluation framework for token-level explanations based on leave-one-out occlusion, and operationalize explanation robustness with top-token flip rate under realistic perturbations (swap, deletion, shuffling, and back-translation) at multiple severity levels. Using this protocol, we conduct a systematic cross-architecture comparison across three benchmark datasets and six models spanning encoder and decoder families (BERT, RoBERTa, Qwen 7B/14B, Llama 8B/70B; 64,800 cases). We find that decoder LLMs produce substantially more stable explanations than encoder baselines (73% lower flip rates on average), and that stability improves with model scale (44% gain from 7B to 70B). Finally, we relate robustness improvements to inference cost, yielding a practical cost-robustness tradeoff curve that supports model and explanation selection prior to deployment in compliance-sensitive applications.

[10] Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

Syed Rifat Raiyan

Main category: cs.CL

TL;DR: 本文首次系统性地大规模实证检验了大语言模型(LLMs)中是否存在“可识别受害者效应”(IVE),发现该效应普遍存在但受对齐训练显著调节:指令微调模型表现出极强IVE,而推理专用模型则反转该效应;标准思维链提示反而加剧IVE,仅功利主义思维链能可靠消除它。

Details Motivation: 随着大语言模型在人道救援分诊、自动拨款评估和内容审核等关键伦理决策场景中承担重要角色,亟需探究其是否继承人类道德判断中的情感非理性偏差,特别是已被广泛验证的可识别受害者效应(IVE)。 Method: 开展N=51,955次API实验,覆盖16个前沿模型(来自9家机构),采用10项实验范式(改编自Small et al., 2007 和 Kogut & Ritov, 2005),系统测量不同模型类型与提示策略下的IVE效应强度,并分析心理物理钝化、数量忽视及文化偏见等现象。 Result: 发现IVE在LLMs中普遍存在(合并效应量d=0.223,p=2e-6),约为人类单受害者基线的两倍;指令微调模型IVE极强(d达1.56),推理模型则反转(d=-0.85);标准思维链提示使IVE效应增至三倍(d从0.15升至0.41),仅功利主义思维链可消除IVE;同时观察到心理物理钝化、完全数量忽视及微弱文化内/外群体偏差。 Conclusion: LLMs并非价值中立,其道德判断显著受训练目标与提示方式塑造;IVE等人类认知偏差可被放大、抑制甚至反转,提示在部署AI于伦理敏感场景前,必须进行针对性偏差审计与对齐优化。 Abstract: The Identifiable Victim Effect (IVE) $-$ the tendency to allocate greater resources to a specific, narratively described victim than to a statistically characterized group facing equivalent hardship $-$ is one of the most robust findings in moral psychology and behavioural economics. As large language models (LLMs) assume consequential roles in humanitarian triage, automated grant evaluation, and content moderation, a critical question arises: do these systems inherit the affective irrationalities present in human moral reasoning? We present the first systematic, large-scale empirical investigation of the IVE in LLMs, comprising N=51,955 validated API trials across 16 frontier models spanning nine organizational lineages (Google, Anthropic, OpenAI, Meta, DeepSeek, xAI, Alibaba, IBM, and Moonshot). Using a suite of ten experiments $-$ porting and extending canonical paradigms from Small et al. (2007) and Kogut and Ritov (2005) $-$ we find that the IVE is prevalent but strongly modulated by alignment training. Instruction-tuned models exhibit extreme IVE (Cohen's d up to 1.56), while reasoning-specialized models invert the effect (down to d=-0.85). The pooled effect (d=0.223, p=2e-6) is approximately twice the single-victim human meta-analytic baseline (d$\approx$0.10) reported by Lee and Feeley (2016) $-$ and likely exceeds the overall human pooled effect by a larger margin, given that the group-victim human effect is near zero. Standard Chain-of-Thought (CoT) prompting $-$ contrary to its role as a deliberative corrective $-$ nearly triples the IVE effect size (from d=0.15 to d=0.41), while only utilitarian CoT reliably eliminates it. We further document psychophysical numbing, perfect quantity neglect, and marginal in-group/out-group cultural bias, with implications for AI deployment in humanitarian and ethical decision-making contexts.

[11] Temporal Flattening in LLM-Generated Text: Comparing Human and LLM Writing Trajectories

Zhanwei Cao,YeoJin Go,Yifan Hu,Shanu Sushmita

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLMs)是否能复现人类写作中随时间演化的纵向风格与认知特征,发现LLMs普遍存在‘时间扁平化’现象——即语义和情感认知层面的时间漂移显著弱于人类,即使引入历史上下文也难以缓解;该现象可高精度区分人机文本,对合成数据与纵向建模应用具有重要启示。

Details Motivation: 人类写作具有长期演化性(如风格、认知状态随时间变化),而当前LLM交互默认为无状态、独立生成,缺乏对这种纵向结构的建模能力,因此需探究LLMs能否再现真实的时间动态性。 Method: 构建并公开发布包含412名作者、6086篇文档(2012–2024年,覆盖学术摘要、博客、新闻三类)的纵向人类文本数据集,并对比三种代表性LLM在标准生成与历史条件生成两种设置下的输出;采用基于语义、词汇及认知-情感表征的漂移(drift)与方差指标进行量化分析。 Result: LLMs表现出显著的时间扁平化:词汇多样性更高,但语义和认知-情感层面的时间漂移远低于人类;仅凭时间变异性模式即可实现94%准确率和98% ROC-AUC的人机轨迹判别。 Conclusion: 时间扁平化是当前LLM部署范式的根本局限,不因引入历史上下文而消除;这一缺陷对依赖真实时间结构的应用(如合成训练数据、纵向文本建模)构成实质性挑战。 Abstract: Large language models (LLMs) are increasingly used in daily applications, from content generation to code writing, where each interaction treats the model as stateless, generating responses independently without memory. Yet human writing is inherently longitudinal: authors' styles and cognitive states evolve across months and years. This raises a central question: can LLMs reproduce such temporal structure across extended time periods? We construct and publicly release a longitudinal dataset of 412 human authors and 6,086 documents spanning 2012--2024 across three domains (academic abstracts, blogs, news) and compare them to trajectories generated by three representative LLMs under standard and history-conditioned generation settings. Using drift and variance-based metrics over semantic, lexical, and cognitive-emotional representations, we find temporal flattening in LLM-generated text. LLMs produce greater lexical diversity but exhibit substantially reduced semantic and cognitive-emotional drift relative to humans. These differences are highly predictive: temporal variability patterns alone achieve 94% accuracy and 98% ROC-AUC in distinguishing human from LLM trajectories. Our results demonstrate that temporal flattening persists regardless of whether LLMs generate independently or with access to incremental history, revealing a fundamental property of current deployment paradigms. This gap has direct implications for applications requiring authentic temporal structure, such as synthetic training data and longitudinal text modeling.

[12] When Self-Reference Fails to Close: Matrix-Level Dynamics in Large Language Models

Ji Ho Bae

Main category: cs.CL

TL;DR: 本文研究自指输入如何影响大语言模型的内部矩阵动力学,发现非闭合真值递归(NCTR)是导致模型不稳定的主因,其引发注意力有效秩异常升高和输出矛盾性显著增加,并与经典矩阵半群问题相关。

Details Motivation: 理解自指输入对大语言模型内部动态的影响,特别是为何某些自指提示会导致模型行为失稳,而另一些则保持稳定。 Method: 在四个不同架构的大语言模型上,对300个分14级层次的自指提示,在三种温度下进行7轮分析,测量106个标量指标;采用注意力有效秩、方差峰度等指标量化不稳定性,并结合逐层SVD、分类器AUC及多重统计检验(FDR校正)进行验证。 Result: NCTR提示导致显著不稳定性(Cohen's d达3.14–3.52),注意力呈现全局分散而非局部坍缩;281/397指标-模型组合显著区分NCTR与稳定自指;所有模型中43/106指标可复现;分类器AUC达0.81–0.90;矛盾输出提升34–56个百分点。 Conclusion: 自指本身并不必然导致不稳定,关键在于是否引发非闭合真值递归(NCTR);NCTR迫使有限深度Transformer进入与经典矩阵半群问题相关的动力学区域,为理解自指失效模式提供了理论框架与实证依据。 Abstract: We investigate how self-referential inputs alter the internal matrix dynamics of large language models. Measuring 106 scalar metrics across up to 7 analysis passes on four models from three architecture families -- Qwen3-VL-8B, Llama-3.2-11B, Llama-3.3-70B, and Gemma-2-9B -- over 300 prompts in a 14-level hierarchy at three temperatures ($T \in \{0.0, 0.3, 0.7\}$), we find that self-reference alone is not destabilizing: grounded self-referential statements and meta-cognitive prompts are markedly more stable than paradoxical self-reference on key collapse-related metrics, and on several such metrics can be as stable as factual controls. Instability concentrates in prompts inducing non-closing truth recursion (NCTR) -- truth-value computations with no finite-depth resolution. NCTR prompts produce anomalously elevated attention effective rank -- indicating attention reorganization with global dispersion rather than simple concentration collapse -- and key metrics reach Cohen's $d = 3.14$ (attention effective rank) to $3.52$ (variance kurtosis) vs. stable self-reference in the 70B model; 281/397 metric-model combinations differentiate NCTR from stable self-reference after FDR correction ($q < 0.05$), 198 with $|d| > 0.8$. Per-layer SVD confirms disruption at every sampled layer ($d > +1.0$ in all three models analyzed), ruling out aggregation artifacts. A classifier achieves AUC $0.81$-$0.90$; 30 minimal pairs yield 42/387 significant combinations; 43/106 metrics replicate across all four models. We connect these observations to three classical matrix-semigroup problems and propose, as a conjecture, that NCTR forces finite-depth transformers toward dynamical regimes where these problems concentrate. NCTR prompts also produce elevated contradictory output ($+34$-$56$ percentage points vs. controls), suggesting practical relevance for understanding self-referential failure modes.

[13] AlphaEval: Evaluating Agents in Production

Pengrui Lu,Bingyu Xu,Wenjun Zhang,Shengjia Hua,Xuanjian Gao,Ranxiang Ge,Lyumanshan Ye,Linxuan Wu,Yiran Li,Junfei Fish Yu,Yibo Zhang,Ruixin Li,Manxiang Li,Xiao Han,Xiaocong Zhou,Guangyao Chi,Zisheng Chen,Kaishen Chen,Kun Wang,Qihua Xu,Fengyue Meng,Yuchen Ni,Jiajun Li,Jinxiu Liu,Danfeng Zhang,Jingru Zhao,Pengfei Liu

Main category: cs.CL

TL;DR: 本文提出AlphaEval,一个基于真实生产环境的AI代理评估基准,包含来自七家公司的94个任务,覆盖六个职业领域,并提供从需求到可执行评估任务的系统化构建框架。

Details Motivation: 现有AI代理评估方法脱离实际生产环境,无法反映隐式约束、多源异构输入、领域专业知识需求、长周期专业输出及专家动态评价标准等现实挑战。 Method: 构建AlphaEval基准:收集真实商业场景中的94个任务;采用多范式评估框架(如LLM-as-a-Judge、参考驱动指标、形式化验证、量表评估、自动化UI测试等);提出‘需求—基准’构建框架,实现生产需求到可执行评估任务的快速、标准化转化。 Result: AlphaEval首次以完整AI代理产品(而非仅模型)为评估对象,揭示模型级评估无法捕捉的性能差异;其构建框架已被验证可复用、模块化,支持组织自主构建领域专属生产级基准。 Conclusion: AlphaEval标志着AI代理评估从理想化实验室基准迈向真实生产导向的新范式,所提出的构建框架为行业提供了可落地、可扩展的评估基础设施。 Abstract: The rapid deployment of AI agents in commercial settings has outpaced the development of evaluation methodologies that reflect production realities. Existing benchmarks measure agent capabilities through retrospectively curated tasks with well-specified requirements and deterministic metrics -- conditions that diverge fundamentally from production environments where requirements contain implicit constraints, inputs are heterogeneous multi-modal documents with information fragmented across sources, tasks demand undeclared domain expertise, outputs are long-horizon professional deliverables, and success is judged by domain experts whose standards evolve over time. We present AlphaEval, a production-grounded benchmark of 94 tasks sourced from seven companies deploying AI agents in their core business, spanning six O*NET (Occupational Information Network) domains. Unlike model-centric benchmarks, AlphaEval evaluates complete agent products -- Claude Code, Codex, etc. -- as commercial systems, capturing performance variations invisible to model-level evaluation. Our evaluation framework covers multiple paradigms (LLM-as-a-Judge, reference-driven metrics, formal verification, rubric-based assessment, automated UI testing, etc.), with individual domains composing multiple paradigms. Beyond the benchmark itself, we contribute a requirement-to-benchmark construction framework -- a systematic methodology that transforms authentic production requirements into executable evaluation tasks in minimal time. This framework standardizes the entire pipeline from requirement to evaluation, providing a reproducible, modular process that any organization can adopt to construct production-grounded benchmarks for their own domains.

[14] AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

Manoj Madushanka Perera,Adnan Mahmood,Kasun Eranda Wijethilake,Quan Z. Sheng

Main category: cs.CL

TL;DR: 本文提出AgenticAI-DialogGen框架,自动生成具备短/长期记忆建模能力的对话数据集TopicGuidedChat(TGC),无需人工标注,提升LLM在记忆感知问答任务中的性能。

Details Motivation: 现有对话数据集缺乏对短/长期记忆的联合建模、缺少记忆锚定、忽视话题连续性或依赖高成本人工标注,难以支撑LLM的记忆细调与评估。 Method: 设计模块化LLM智能体框架AgenticAI-DialogGen:利用LLM代理从非结构化对话中抽取知识图谱(建模长期记忆)、识别话题、构建说话人角色,并生成话题引导的对话(建模短期记忆);再通过QA模块生成基于长短时记忆的问答对。 Result: 成功构建新数据集TopicGuidedChat(TGC),其中长期记忆以说话人专属知识图谱表示,短期记忆为生成的话题引导对话;实验证明该框架生成的对话质量更高,且在TGC上微调的LLM在记忆感知QA任务中表现更优。 Conclusion: AgenticAI-DialogGen提供了一种可扩展、无监督的对话数据合成范式,有效支撑LLM短/长期记忆能力的建模、训练与评估。 Abstract: Recent advancements in Large Language Models (LLMs) have improved their ability to process extended conversational contexts, yet fine-tuning and evaluating short- and long-term memories remain difficult due to the absence of datasets that encode both short- and long-term conversational history. Existing conversational datasets lack memory grounding, overlook topic continuity, or rely on costly human annotation. To address these gaps, we introduce AgenticAI-DialogGen, a modular agent-based framework that generates persona-grounded and topic-guided conversations without human supervision. The framework uses LLM agents to extract knowledge graphs, identify topics, build speaker personas, and simulate topic-guided conversations from unstructured conversations. A QA module generates memory-grounded Question Answer (QA) pairs drawn from short- and long-term conversational histories. We also generated a new dataset entitled, TopicGuidedChat (TGC), where long-term memory is encoded as speaker-specific knowledge graphs and short-term memory as newly generated topic-guided conversations. Evaluations depict that AgenticAI-DialogGen yields higher conversational quality and LLMs fine-tuned on TGC dataset achieve improved performance on memory-grounded QA tasks.

[15] Knowledge Is Not Static: Order-Aware Hypergraph RAG for Language Models

Keshu Wu,Chenchen Kuai,Zihao Li,Jiwan Jiang,Shiyu Shen,Shian Wang,Chan-Wei Hu,Zhengzhong Tu,Yang Zhou

Main category: cs.CL

TL;DR: 本文提出OKH-RAG,一种将交互顺序作为核心结构属性的检索增强生成方法,通过在超图中引入优先结构并建模超边序列推理,显著提升顺序敏感型问答与解释任务性能。

Details Motivation: 现有RAG方法(包括图和超图方法)将检索证据视为无序集合,隐含置换不变性假设,但许多现实推理任务的结果依赖于交互发生的顺序,该假设与实际不符。 Method: 提出Order-Aware Knowledge Hypergraph RAG(OKH-RAG),将知识表示为带优先结构的高阶超图交互,并将检索重构为超边上的序列推理;引入从数据中学习的转移模型推断优先关系,无需显式时序监督。 Result: 在热带气旋和港口运营等顺序敏感型问答与解释任务上,OKH-RAG持续优于置换不变基线;消融实验表明性能增益明确源于对交互顺序的建模。 Conclusion: 基于集合的检索存在关键局限:有效推理不仅需检索相关证据,更需将其组织为结构化序列;顺序是知识建模与推理中不可忽视的一等结构属性。 Abstract: Retrieval-augmented generation (RAG) enhances large language models by grounding outputs in retrieved knowledge. However, existing RAG methods including graph- and hypergraph-based approaches treat retrieved evidence as an unordered set, implicitly assuming permutation invariance. This assumption is misaligned with many real-world reasoning tasks, where outcomes depend not only on which interactions occur, but also on the order in which they unfold. We propose Order-Aware Knowledge Hypergraph RAG (OKH-RAG), which treats order as a first-class structural property. OKH-RAG represents knowledge as higher-order interactions within a hypergraph augmented with precedence structure, and reformulates retrieval as sequence inference over hyperedges. Instead of selecting independent facts, it recovers coherent interaction trajectories that reflect underlying reasoning processes. A learned transition model infers precedence directly from data without requiring explicit temporal supervision. We evaluate OKH-RAG on order-sensitive question answering and explanation tasks, including tropical cyclone and port operation scenarios. OKH-RAG consistently outperforms permutation-invariant baselines, and ablations show that these gains arise specifically from modeling interaction order. These results highlight a key limitation of set-based retrieval: effective reasoning requires not only retrieving relevant evidence, but organizing it into structured sequences.

[16] Representing expertise accelerates learning from pedagogical interaction data

Dhara Yu,Karthikeya Kaushik,Bill D. Thompson

Main category: cs.CL

TL;DR: 本文通过合成专家与新手在空间导航任务中的交互数据,训练Transformer模型,发现基于教学式交互的数据训练出的模型比仅使用专家示范数据训练的模型更具鲁棒性,且能通过建模认知上不同的智能体实现类专家行为。

Details Motivation: 探索交互数据中哪些特征有助于提升学习代理的性能,目前尚不清楚具体起作用的因素。 Method: 构建专家与新手在空间导航任务中的合成交互数据集,并用Transformer模型进行训练与评估,对比不同数据集(如教学式交互 vs. 专家单独行为)的效果。 Result: 基于教学式交互数据训练的模型在多种场景下表现更鲁棒;能够表征认知上不同智能体的模型,即使专家行为样本稀少,也能展现出类专家行为。 Conclusion: 交互数据的有效性关键在于其教学性质及对多智能体认知差异的显式建模,而非单纯增加专家行为样本。 Abstract: Work in cognitive science and artificial intelligence has suggested that exposing learning agents to traces of interaction between multiple individuals can improve performance in a variety of settings, yet it remains unknown which features of interactions contribute to this improvement. We examined the factors that support the effectiveness of interaction data, using a controlled paradigm that allowed us to precisely operationalize key distinctions between interaction and an expert acting alone. We generated synthetic datasets of simple interactions between an expert and a novice in a spatial navigation task, and then trained transformer models on those datasets, evaluating performance after exposure to different datasets. Our experiments showed that models trained on pedagogical interactions were more robust across a variety of scenarios compared to models trained only on expert demonstrations, and that having the ability to represent epistemically distinct agents led to expert-like behavior even when expert behavior was rarely observed.

[17] Beyond Majority Voting: Efficient Best-Of-N with Radial Consensus Score

Manh Nguyen,Sunil Gupta,Hung Le

Main category: cs.CL

TL;DR: 本文提出了一种无需训练、高效的Radial Consensus Score(RCS)方法,通过计算答案嵌入的加权Fréchet均值(语义中心),并依据各候选答案到该中心的径向距离排序,实现更可靠的“best-of-N”选择;RCS支持多种加权策略,在多个基准和模型上显著优于现有方法,尤其在黑盒与多智能体场景中表现稳健。

Details Motivation: 现有方法(如自一致性、概率加权)难以兼顾答案间的语义关系、高频低质答案的干扰、以及高质但低频答案的权重不足,且未充分利用答案表示的几何结构。 Method: 提出Radial Consensus Score(RCS):将候选答案映射为嵌入向量,计算其加权Fréchet均值作为语义中心,再按各答案到该中心的径向距离进行排序;支持均匀、频率、概率等多种加权方案,完全无需训练,适用于黑盒模型。 Result: 在7个短/长形式问答与推理基准、5个开源大模型上,RCS各变体持续超越强基线(如self-consistency);采样预算越大优势越明显;可直接替代多数投票用于多智能体辩论,且在黑盒设置下鲁棒性强。 Conclusion: 几何共识(以语义中心与径向距离建模)是一种可扩展、普适性强的可靠答案选择原则,能有效超越传统多数投票,为大模型推理中的聚合机制提供新范式。 Abstract: Large language models (LLMs) frequently generate multiple candidate responses for a given prompt, yet selecting the most reliable one remains challenging, especially when correctness diverges from surface-level majority agreement. Existing approaches, such as self-consistency, rely on discrete voting, while probability-based methods often fail to capture relationships among candidate answers or tend to underweight high-quality but less frequent responses, and do not fully leverage the geometric structure of answer representations. To address these limitations, we introduce Radial Consensus Score (RCS), a simple, efficient, and training-free method for best-of-N selection. RCS models semantic consensus by computing a weighted Fréchet mean (semantic center) of answer embeddings and ranking candidates by their radial distance to this center. Importantly, RCS provides a general framework that supports multiple weighting schemes, including uniform, frequency-based, and probability-based variants, enabling flexible integration of agreement signals and model confidence while remaining fully applicable in black-box settings. Extensive experiments across seven benchmarks covering short-form QA and long-form reasoning tasks, and five open-weight models, demonstrate that RCS variants consistently outperform strong baselines, with gains becoming more pronounced as the sampling budget increases. RCS also serves as an effective drop-in replacement for majority voting in multi-agent debate and exhibits strong robustness in black-box scenarios. Overall, these results highlight geometric consensus as a scalable and broadly applicable principle for reliable answer selection, extending beyond majority voting to more expressive and robust aggregation in LLM inference.

[18] LLM-Guided Semantic Bootstrapping for Interpretable Text Classification with Tsetlin Machines

Jiechao Gao,Rohan Kumar Yadav,Yuangang Li,Yuandong Pan,Jie Wang,Ying Liu,Michael Lepech

Main category: cs.CL

TL;DR: 本文提出了一种语义引导的符号模型增强框架,将大语言模型(LLM)的语义知识蒸馏为可解释的符号规则,赋能Tsetlin机(TM)在保持完全符号化、高效的同时达到接近BERT的性能。

Details Motivation: 预训练语言模型(如BERT)语义能力强但黑盒且计算昂贵;符号模型(如Tsetlin机)可解释但缺乏语义泛化能力。亟需融合二者优势:既具语义理解力,又保持透明与高效。 Method: 提出语义引导的自举框架:1)用LLM为类别生成子意图;2)基于三阶段课程(seed/core/enriched)合成多样化语义数据;3)非否定Tsetlin机(NTM)从中学习高置信字面量作为可解释语义线索;4)将线索注入真实数据,使TM对齐LLM语义。全程无需嵌入或运行时调用LLM。 Result: 在多个文本分类任务上,该方法显著优于原始TM,在准确率和可解释性上均接近BERT,同时保持完全符号化、低开销。 Conclusion: 语义知识可有效蒸馏为符号逻辑线索,无需端到端微调或持续依赖LLM;符号模型可通过结构化引导获得强语义先验,实现性能与可解释性的协同提升。 Abstract: Pretrained language models (PLMs) like BERT provide strong semantic representations but are costly and opaque, while symbolic models such as the Tsetlin Machine (TM) offer transparency but lack semantic generalization. We propose a semantic bootstrapping framework that transfers LLM knowledge into symbolic form, combining interpretability with semantic capacity. Given a class label, an LLM generates sub-intents that guide synthetic data creation through a three-stage curriculum (seed, core, enriched), expanding semantic diversity. A Non-Negated TM (NTM) learns from these examples to extract high-confidence literals as interpretable semantic cues. Injecting these cues into real data enables a TM to align clause logic with LLM-inferred semantics. Our method requires no embeddings or runtime LLM calls, yet equips symbolic models with pretrained semantic priors. Across multiple text classification tasks, it improves interpretability and accuracy over vanilla TM, achieving performance comparable to BERT while remaining fully symbolic and efficient.

[19] Thought-Retriever: Don't Just Retrieve Raw Data, Retrieve Thoughts for Memory-Augmented Agentic Systems

Tao Feng,Pengrui Han,Guanyu Lin,Ge Liu,Jiaxuan You

Main category: cs.CL

TL;DR: 本文提出Thought-Retriever,一种模型无关算法,使大语言模型能基于任意长的外部数据生成输出,突破上下文长度限制;其核心是利用并组织模型解决历史查询时产生的中间思考(thoughts)形成自演化长期记忆,并在新查询中检索相关thoughts;在AcademicEval等基准上显著超越现有方法,并验证了模型随交互自我进化及利用深层thoughts应对抽象问题的能力。

Details Motivation: 现有大语言模型虽具强大内部能力,但难以有效融合海量外部知识;检索增强方法受限于上下文长度,仅能检索有限数量的数据块。 Method: 提出Thought-Retriever算法:将LLM处理历史查询时生成的中间响应(thoughts)进行过滤、组织,构建‘thought记忆’,并在新查询时检索相关thoughts作为条件信息,从而绕过上下文长度限制。 Result: 在AcademicEval及两个公开数据集上显著优于SOTA基线,F1平均提升≥7.6%,胜率提升16%;验证了LLM可通过该方法实现自我演化,并能利用更深层thoughts回答更抽象问题。 Conclusion: Thought-Retriever为LLM提供了可扩展、自演化的长期记忆机制,突破了传统检索增强方法的上下文瓶颈,提升了模型对超长外部知识的利用能力与泛化性。 Abstract: Large language models (LLMs) have transformed AI research thanks to their powerful internal capabilities and knowledge. However, existing LLMs still fail to effectively incorporate the massive external knowledge when interacting with the world. Although retrieval-augmented LLMs are proposed to mitigate the issue, they are still fundamentally constrained by the context length of LLMs, as they can only retrieve top-K raw data chunks from the external knowledge base which often consists of millions of data chunks. Here we propose Thought-Retriever, a novel model-agnostic algorithm that helps LLMs generate output conditioned on arbitrarily long external data, without being constrained by the context length or number of retrieved data chunks. Our key insight is to let an LLM fully leverage its intermediate responses generated when solving past user queries (thoughts), filtering meaningless and redundant thoughts, organizing them in thought memory, and retrieving the relevant thoughts when addressing new queries. This effectively equips LLM-based agents with a self-evolving long-term memory that grows more capable through continuous interaction. Besides algorithmic innovation, we further meticulously prepare a novel benchmark, AcademicEval, which requires an LLM to faithfully leverage ultra-long context to answer queries based on real-world academic papers. Extensive experiments on AcademicEval and two other public datasets validate that Thought-Retriever remarkably outperforms state-of-the-art baselines, achieving an average increase of at least 7.6% in F1 score and 16% in win rate across various tasks. More importantly, we further demonstrate two exciting findings: (1) Thought-Retriever can indeed help LLM self-evolve after solving more user queries; (2) Thought-Retriever learns to leverage deeper thoughts to answer more abstract user queries.

[20] Continuous Knowledge Metabolism: Generating Scientific Hypotheses from Evolving Literature

Jinkai Tao,Yubo Wang,Xiaoyu Liu,Menglin Yang

Main category: cs.CL

TL;DR: 本文提出连续知识代谢(CKM)框架,通过滑动时间窗口增量更新结构化知识库以支持科学假设生成;CKM-Lite在预测性能和效率上显著优于批量处理;CKM-Full揭示了知识演化类型(如收敛vs矛盾)对假设成功率的显著影响,并指出质量与覆盖率之间存在权衡。

Details Motivation: 科学假设生成不仅依赖当前知识,更需追踪知识如何随时间演化,而现有方法多忽略知识动态性。 Method: 提出CKM框架,包含CKM-Lite(高效增量式)和CKM-Full(可解释、分类知识变化类型并建模演化轨迹)两个变体,在50个研究主题上生成并分析892个假设。 Result: CKM-Lite在命中率、假设产量和匹配度上显著优于批量基线,且节省92% token成本;CKM-Full发现知识收敛信号的命中率是矛盾信号的近5倍,并识别出质量-覆盖率权衡及领域稳定性对假设成功的影响。 Conclusion: 假设生成的质量不仅取决于文献量,更取决于处理方式;评估框架需兼顾质量与覆盖率,而非单一指标优化。 Abstract: Scientific hypothesis generation requires tracking how knowledge evolves, not just what is currently known. We introduce Continuous Knowledge Metabolism (CKM), a framework that processes scientific literature through sliding time windows and incrementally updates a structured knowledge base as new findings arrive. We present CKM-Lite, an efficient variant that achieves strong predictive coverage through incremental accumulation, outperforming batch processing on hit rate (+2.8%, p=0.006), hypothesis yield (+3.6, p<0.001), and best-match alignment (+0.43, p<0.001) while reducing token cost by 92%. To understand what drives these differences, we develop CKM-Full, an instrumented variant that categorizes each new finding as novel, confirming, or contradicting, detects knowledge change signals, and conditions hypothesis generation on the full evolution trajectory. Analyzing 892 hypotheses generated by CKM-Full across 50 research topics, alongside parallel runs of the other variants, we report four empirical observations: (1) incremental processing outperforms batch baseline across predictive and efficiency metrics; (2) change-aware instrumentation is associated with higher LLM-judged novelty (Cohen's d=3.46) but lower predictive coverage, revealing a quality-coverage trade-off; (3) a field's trajectory stability is associated with hypothesis success (r=-0.28, p=0.051), suggesting boundary conditions for literature-based prediction; (4) knowledge convergence signals are associated with nearly 5x higher hit rate than contradiction signals, pointing to differential predictability across change types. These findings suggest that the character of generated hypotheses is shaped not only by how much literature is processed, but also by how it is processed. They further indicate that evaluation frameworks must account for the quality-coverage trade-off rather than optimize for a single metric.

[21] SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration

Zhuofan Wen,Yang Feng

Main category: cs.CL

TL;DR: 本文提出了一种新型自起草(self-draft)框架,通过层间温度退火抑制早期退出时的虚假置信度,并基于词元解码难度自适应限制推测长度,从而提升推测解码效率,在不修改基模型参数前提下实现最高2.33倍端到端加速。

Details Motivation: 现有自起草方法存在两个问题:浅层预测过于自信但易错;草案中存在难解码词元时,迫使深层重复计算,降低接受率和整体加速比。 Method: 提出层-wise 温度退火用于早期退出决策以抑制过自信;根据词元级解码难度动态约束推测长度;对草案词元隐状态统一并行重处理深层,保证输出等价性。 Result: 在多种长文本生成任务和模型架构上,相比标准自回归解码,端到端速度最高提升2.33倍;无需修改基模型参数;保持与原模型完全一致的输出。 Conclusion: 该自起草框架有效缓解了自起草中因过自信和难解码词元导致的效率瓶颈,在保证严格输出等价的前提下显著提升了大语言模型推理速度。 Abstract: Speculative decoding has emerged as a promising approach to accelerate autoregressive inference in large language models (LLMs). Self-draft methods, which leverage the base LLM itself for speculation, avoid the overhead of auxiliary draft models but face limitations: shallow layers often produce overconfident yet incorrect token predictions, and the presence of difficult tokens in a draft sequence forces redundant computation through deeper layers, undermining both draft acceptance and overall speedup. To address these issues, we propose a novel self-draft framework that suppresses spurious confidence via layer-wise temperature annealing in early-exit decision and adaptively bounds speculation length based on token-wise decoding difficulty. By reprocessing the hidden states of draft tokens in a unified parallel pass through deep layers, our method maintains exact output equivalence with the original model while maximizing computational efficiency. It requires no modifications to the base LLM parameters and achieves up to 2.33x wall-time speedup over standard autoregressive decoding across diverse long-form generation tasks and multiple model architectures.

[22] Coding-Free and Privacy-Preserving MCP Framework for Clinical Agentic Research Intelligence System

Taehun Kim,Hyeryun Park,Hyeonhoon Lee,Yushin Lee,Kyungsang Kim,Hyung-Chul Lee

Main category: cs.CL

TL;DR: 本文提出了一种名为CARIS的临床研究智能系统,利用大语言模型(LLMs)与模块化工具集成,通过自然语言驱动实现全流程自动化临床研究(如队列构建、IRB文档生成、机器学习建模和报告生成),同时保障数据隐私,无需直接访问原始敏感数据。

Details Motivation: 临床研究流程繁琐、依赖专业技能与敏感数据访问,导致临床医生和外部研究者难以开展数据驱动研究,亟需一种兼顾自动化与隐私保护的解决方案。 Method: 构建基于Model Context Protocol(MCP)的Clinical Agentic Research Intelligence System(CARIS),将LLMs与模块化工具(文献检索、队列构建、IRB文档生成、Vibe ML建模、报告生成等)耦合,支持自然语言指令驱动的端到端研究流程,并采用人机协同迭代优化。 Result: 在三个异构临床数据集上验证:研究计划与IRB文档3–4轮内完成;Vibe ML自动探索特征-模型组合并排序Top10模型、生成可视化;最终报告TRIPOD+AI框架覆盖率达96%(LLM评估)和82%(人工评估)。 Conclusion: CARIS证明了具身智能体(agentic AI)可将临床假设转化为跨异构数据集的可执行研究流程,在不接触原始数据前提下显著降低临床研究门槛,弥合公私医疗数据环境鸿沟。 Abstract: Clinical research involves labor-intensive processes such as study design, cohort construction, model development, and documentation, requiring domain expertise, programming skills, and access to sensitive patient data. These demands create barriers for clinicians and external researchers conducting data-driven studies. To overcome these limitations, we developed a Clinical Agentic Research Intelligence System (CARIS) that automates the clinical research workflow while preserving data privacy, enabling comprehensive studies without direct access to raw data. CARIS integrates Large Language Models (LLMs) with modular tools via the Model Context Protocol (MCP), enabling natural language-driven orchestration of appropriate tools. Databases remain securely within the MCP server, and users access only the outputs and final research reports. Based on user intent, CARIS automatically executes the full pipeline: research planning, literature search, cohort construction, Institutional Review Board (IRB) documentation, Vibe Machine Learning (ML), and report generation, with iterative human-in-the-loop refinement. We evaluated CARIS on three heterogeneous datasets with distinct clinical tasks. Research plans and IRB documents were finalized within three to four iterations, using evidence from literature and data. The system supported Vibe ML by exploring feature-model combinations, ranking the top ten models, and generating performance visualizations. Final reports showed high completeness based on a checklist derived from the TRIPOD+AI framework, achieving 96% coverage in LLM evaluation and 82% in human evaluation. CARIS demonstrates that agentic AI can transform clinical hypotheses into executable research workflows across heterogeneous datasets. By eliminating the need for coding and direct data access, the system lowers barriers and bridges public and private clinical data environments.

[23] CascadeDebate: Multi-Agent Deliberation for Cost-Aware LLM Cascades

Raeyoung Chang,Dongwook Kwon,Jisoo Lee,Nikhil Verma

Main category: cs.CL

TL;DR: CascadeDebate 提出在级联大语言模型系统中,于各层级升级边界处引入基于置信度的轻量级多智能体辩论机制,以内部共识解决模糊查询,避免过早升级至高成本模型或专家,从而动态平衡精度、成本与计算开销。

Details Motivation: 现有级联LLM系统在单模型层级面对模糊查询时易因置信度不足而过早升级,导致成本上升与计算效率低下。 Method: 在级联架构的每一升级边界插入信心驱动的轻量多智能体辩论模块;采用统一交替架构(单模型推理 ↔ 选择性多智能体辩论),并配备在线阈值优化器实现动态路由决策。 Result: 在科学、医学与常识五大基准上,相比强基线级联系统和独立多智能体系统,准确率最高提升26.75%;在线阈值优化器带来20.98%–52.33%的相对准确率提升,并支持对真实分布的弹性适应。 Conclusion: 将有限、目标导向的多智能体辩论嵌入级联流程的关键边界,比全局多智能体或纯单模型级联更高效地权衡性能、成本与可控性,为测试时计算可扩展AI提供了新范式。 Abstract: Cascaded LLM systems coordinate models of varying sizes with human experts to balance accuracy, cost, and abstention under uncertainty. However, single-model tiers at each stage often struggle with ambiguous queries, triggering premature escalations to costlier models or experts due to under-confidence and inefficient compute scaling. CascadeDebate addresses this gap by inserting multi-agent deliberation directly at each tier's escalation boundary. Confidence-based routers activate lightweight agent ensembles only for uncertain cases, enabling consensus-driven resolution of ambiguities internally without invoking higher-cost upgrades. Our unified architecture alternates single-model inference with selective multi-agent deliberation across model scales, culminating in human experts as the final fallback. This design scales test-time compute dynamically according to query difficulty. Across five benchmarks spanning science, medicine, and general knowledge, CascadeDebate outperforms strong single-model cascades and standalone multi-agent systems by up to 26.75 percent. An online threshold optimizer proves essential, boosting accuracy by 20.98 to 52.33 percent relative improvement over fixed policies and enabling elastic adaptation to real-world distributions.

[24] Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning

Houxing Ren,Mingjie Zhan,Zimu Lu,Ke Wang,Yunqiao Yang,Haotian Hou,Hongsheng Li

Main category: cs.CL

TL;DR: SpreadsheetAgent是一个两阶段多智能体框架,通过分步阅读与多模态(代码执行、图像、LaTeX表格)局部解析,提升大规模电子表格的理解能力,并引入验证模块增强可靠性。

Details Motivation: 现有基于大语言模型的方法将表格视为纯文本,忽略布局和视觉语义,且难以处理大规模电子表格的长输入问题。 Method: 提出两阶段多智能体框架:先构建结构草图和行列摘要(Sketching Stage),再基于中间表示进行任务驱动推理(Solving Stage);引入验证模块对提取结构进行定向检查以减少错误传播。 Result: 在Spreadsheet Bench数据集上,使用GPT-OSS-120B时达到38.16%,优于ChatGPT Agent基线(35.27%)2.89个百分点。 Conclusion: SpreadsheetAgent能有效提升电子表格理解的鲁棒性与可扩展性,适用于企业报表、审计和科学数据管理等真实场景。 Abstract: Spreadsheets are central to real-world applications such as enterprise reporting, auditing, and scientific data management. Despite their ubiquity, existing large language model based approaches typically treat tables as plain text, overlooking critical layout cues and visual semantics. Moreover, real-world spreadsheets are often massive in scale, exceeding the input length that LLMs can efficiently process. To address these challenges, we propose SpreadsheetAgent, a two-stage multi-agent framework for spreadsheet understanding that adopts a step-by-step reading and reasoning paradigm. Instead of loading the entire spreadsheet at once, SpreadsheetAgent incrementally interprets localized regions through multiple modalities, including code execution results, images, and LaTeX tables. The method first constructs a structural sketch and row/column summaries, and then performs task-driven reasoning over this intermediate representation in the Solving Stage. To further enhance reliability, we design a verification module that validates extracted structures via targeted inspections, reducing error propagation and ensuring trustworthy inputs for downstream reasoning. Extensive experiments on two spreadsheet datasets demonstrate the effectiveness of our approach. With GPT-OSS-120B, SpreadsheetAgent achieves 38.16% on Spreadsheet Bench, outperforming the ChatGPT Agent baseline (35.27%) by 2.89 absolute points. These results highlight the potential of SpreadsheetAgent to advance robust and scalable spreadsheet understanding in real-world applications. Code is available at https://github.com/renhouxing/SpreadsheetAgent.git.

Haoran Li,Yulin Chen,Huihao Jing,Wenbin Hu,Tsz Ho Li,Chanhou Lou,Hong Ting Tsang,Sirui Han,Yangqiu Song

Main category: cs.CL

TL;DR: 本文提出ContextLens框架,利用大语言模型(LLM)在法律领域中对输入上下文进行语义锚定,通过回答一系列结构化问题(涵盖适用性、基本原则和具体条款)来评估合规性,显著提升GDPR与欧盟AI法案等法律场景下的合规评估能力,并能识别上下文中的模糊与缺失因素。

Details Motivation: 现有基于LLM的隐私与安全评估方法通常假设上下文完整清晰,但现实中上下文往往模糊且不完整,难以支撑可靠的法律合规判断。 Method: 提出半规则化的ContextLens框架:利用LLM将输入上下文锚定到法律领域,设计覆盖适用性、基本原则和具体条款的结构化问题集,引导LLM显式识别已知与未知合规影响因素,而非直接输出安全判断。 Result: 在GDPR和欧盟AI法案等合规基准上实验表明,ContextLens显著优于现有无训练基线,且能自动识别上下文中的模糊点与缺失信息。 Conclusion: ContextLens为面向法律合规的上下文敏感型LLM评估提供了新范式,兼顾可解释性、无需微调,并具备识别不确定性因素的能力。 Abstract: Individuals' concerns about data privacy and AI safety are highly contextualized and extend beyond sensitive patterns. Addressing these issues requires reasoning about the context to identify and mitigate potential risks. Though researchers have widely explored using large language models (LLMs) as evaluators for contextualized safety and privacy assessments, these efforts typically assume the availability of complete and clear context, whereas real-world contexts tend to be ambiguous and incomplete. In this paper, we propose ContextLens, a semi-rule-based framework that leverages LLMs to ground the input context in the legal domain and explicitly identify both known and unknown factors for legal compliance. Instead of directly assessing safety outcomes, our ContextLens instructs LLMs to answer a set of crafted questions that span over applicability, general principles and detailed provisions to assess compliance with pre-defined priorities and rules. We conduct extensive experiments on existing compliance benchmarks that cover the General Data Protection Regulation (GDPR) and the EU AI Act. The results suggest that our ContextLens can significantly improve LLMs' compliance assessment and surpass existing baselines without any training. Additionally, our ContextLens can further identify the ambiguous and missing factors.

[26] CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

Jingbo Yang,Guanyu Yao,Bairu Hou,Xinghan Yang,Nikolai Glushnev,Iwona Bialynicka-Birula,Duo Ding,Shiyu Chang

Main category: cs.CL

TL;DR: 本文提出CompliBench基准,用于评估LLM作为裁判在多轮对话中检测和定位领域指南违规的能力;通过可控缺陷注入与对抗搜索构建高质量合成数据,发现当前大模型在此任务上表现不佳,而基于该数据微调的小型裁判模型却更优且泛化性强。

Details Motivation: 现有LLM-as-a-Judge在检测复杂、领域特定的策略违规方面可靠性不足,主因是缺乏系统化、低成本、高保真的违规数据生成方法。 Method: 提出CompliBench基准及配套自动化数据生成流水线:模拟用户-代理多轮对话,采用可控缺陷注入生成精确违规标签(含违规条款与具体对话轮次),并用对抗搜索增强难度。 Result: 实验证明当前SOTA闭源LLM在该任务上表现显著不足;而基于合成数据微调的小型裁判模型不仅超越大模型,还在未见业务领域上展现出良好泛化能力。 Conclusion: CompliBench及其数据生成范式为构建鲁棒的生成式奖励模型提供了有效基础,凸显了高质量合成数据对提升LLM裁判能力的关键作用。 Abstract: As Large Language Models (LLMs) are increasingly deployed as task-oriented agents in enterprise environments, ensuring their strict adherence to complex, domain-specific operational guidelines is critical. While utilizing an LLM-as-a-Judge is a promising solution for scalable evaluation, the reliability of these judges in detecting specific policy violations remains largely unexplored. This gap is primarily due to the lack of a systematic data generation method, which has been hindered by the extensive cost of fine-grained human annotation and the difficulty of synthesizing realistic agent violations. In this paper, we introduce CompliBench, a novel benchmark designed to evaluate the ability of LLM judges to detect and localize guideline violations in multi-turn dialogues. To overcome data scarcity, we develop a scalable, automated data generation pipeline that simulates user-agent interactions. Our controllable flaw injection process automatically yields precise ground-truth labels for the violated guideline and the exact conversation turn, while an adversarial search method ensures these introduced perturbations are highly challenging. Our comprehensive evaluation reveals that current state-of-the-art proprietary LLMs struggle significantly with this task. In addition, we demonstrate that a small-scale judge model fine-tuned on our synthesized data outperforms leading LLMs and generalizes well to unseen business domains, highlighting our pipeline as an effective foundation for training robust generative reward models.

[27] ToxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection

Boyang Li,Hongzhe Shou,Yuanyuan Liang,Jingbin Zhang,Fang Zhou

Main category: cs.CL

TL;DR: 本文提出ToxiTrace,一种面向可解释性的中文有毒内容检测方法,通过CuSA、GCLoss和ARCL三个组件提升毒害证据定位能力与分类性能,并保持高效推理和人类可读解释。

Details Motivation: 现有中文有毒内容检测方法多局限于句子级分类,缺乏可读且连续的毒害证据片段定位能力。 Method: 提出ToxiTrace框架,包含:(1) CuSA——利用轻量LLM引导细化BERT编码器输出的显著性线索为细粒度毒害片段;(2) GCLoss——梯度约束损失函数,聚焦毒害证据token的显著性并抑制无关激活;(3) ARCL——构建样本特异性对比推理对以增强毒/非毒语义边界。 Result: 实验表明ToxiTrace在分类准确率和毒害片段抽取性能上均有提升,同时维持基于编码器的高效推理,并生成更连贯、人类可读的解释。模型已开源至Hugging Face。 Conclusion: ToxiTrace有效解决了中文有毒内容检测中解释性弱、证据不连续的问题,兼顾性能、效率与可解释性,为安全AI应用提供了新思路。 Abstract: Existing Chinese toxic content detection methods mainly target sentence-level classification but often fail to provide readable and contiguous toxic evidence spans. We propose \textbf{ToxiTrace}, an explainability-oriented method for BERT-style encoders with three components: (1) \textbf{CuSA}, which refines encoder-derived saliency cues into fine-grained toxic spans with lightweight LLM guidance; (2) \textbf{GCLoss}, a gradient-constrained objective that concentrates token-level saliency on toxic evidence while suppressing irrelevant activations; and (3) \textbf{ARCL}, which constructs sample-specific contrastive reasoning pairs to sharpen the semantic boundary between toxic and non-toxic content. Experiments show that ToxiTrace improves classification accuracy and toxic span extraction while preserving efficient encoder-based inference and producing more coherent, human-readable explanations. We have released the model at https://huggingface.co/ArdLi/ToxiTrace.

[28] Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

Tomer Ashuach,Liat Ein-Dor,Shai Gretz,Yoav Katz,Yonatan Belinkov

Main category: cs.CL

TL;DR: 本文研究大语言模型是否具备类似人类的内省能力,即能否通过自身隐藏状态判断答案正确性;实验发现,在模型间答案一致时无优势,但在答案分歧时,模型自身表征在事实性知识任务中显著优于其他模型表征,而在数学推理任务中无此优势,并进一步定位该差异源于早期到中期网络层。

Details Motivation: 探究大语言模型是否拥有类似人类的、基于内部隐藏状态的‘特权知识’(即对自身答案正确性的私有判断能力),这种能力无法仅从外部输出观察获得。 Method: 训练正确性分类器,分别使用目标模型自身的隐藏状态表示和其它模型(peer models)的表示作为输入,在标准数据集及模型预测分歧子集上对比二者性能,并按模型深度分层分析性能差异。 Result: 在整体数据上,自探针(self-probe)与同伴探针(peer-probe)性能相当;但在模型预测分歧的子集上,自探针在事实性知识任务中持续优于同伴探针,而在数学推理任务中无优势;该优势在事实类任务中随网络深度从早中期层开始逐步显现,数学任务则无一致优势层。 Conclusion: 大语言模型确实在特定领域(事实性知识)具备可被探测的特权知识,其机制可能与模型特异性记忆检索相关;但该能力具有领域依赖性和结构定位性,并非通用内省能力。 Abstract: Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether large language models possess similar privileged knowledge about answer correctness, information unavailable through external observation. We train correctness classifiers on question representations from both a model's own hidden states and external models, testing whether self-representations provide a performance advantage. On standard evaluation, we find no advantage: self-probes perform comparably to peer-model probes. We hypothesize this is due to high inter-model agreement of answer correctness. To isolate genuine privileged knowledge, we evaluate on disagreement subsets, where models produce conflicting predictions. Here, we discover domain-specific privileged knowledge: self-representations consistently outperform peer representations in factual knowledge tasks, but show no advantage in math reasoning. We further localize this domain asymmetry across model layers, finding that the factual advantage emerges progressively from early-to-mid layers onward, consistent with model-specific memory retrieval, while math reasoning shows no consistent advantage at any depth.

[29] Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

Ziyang Liu

Main category: cs.CL

TL;DR: 本文提出了一种名为'协同分页(cooperative paging)'的新方法,通过用简短关键词书签替代被驱逐的上下文片段,并提供recall()工具按需检索,显著提升了长对话中大模型的信息恢复能力。在LoCoMo基准测试中,该方法在多个模型上均优于截断、BM25、词重叠检索等五种基线方法。消融实验揭示了页面划分、驱逐策略与书签生成的关键设计权衡,并指出书签区分度不足是当前主要瓶颈。

Details Motivation: 当大语言模型对话超出上下文窗口时,传统截断或压缩策略会永久丢失信息,而模型缺乏有效机制来恢复已被驱逐的关键历史内容,导致长程依赖建模失败。 Method: 提出协同分页机制:将对话划分为页面,驱逐时用8–24词的关键词书签([pN:keywords])替代原内容;模型可通过recall()工具按需检索完整页面;系统支持多种页面边界策略(如fixed_20、topic_shift)和驱逐策略(FIFO/LFU),并探索多种书签生成方式。 Result: 在LoCoMo基准(10个真实多轮对话,300+轮次)上,协同分页在GPT-4o-mini、DeepSeek-v3.2、Claude Haiku、GLM-5四模型上均取得最高回答质量,显著优于五种基线(p=0.017);消融发现fixed_20分页达96.7%性能,topic_shift仅56.7%;LFU在真实数据上最优;两种新书签策略分别提升4.4和8.7 E2E分;但书签区分度低导致正确页面召回率仅57%。 Conclusion: 协同分页是一种高效可行的长上下文管理范式,其核心优势在于将上下文管理显式化、可干预化;当前瓶颈在于关键词书签的语义区分能力,未来应聚焦于生成更具判别力的轻量级记忆锚点。 Abstract: When LLM conversations grow beyond the context window, old content must be evicted -- but how does the model recover it when needed? We propose cooperative paging: evicted segments are replaced with minimal keyword bookmarks ([pN:keywords], ~8-24 tokens each), and the model is given a recall() tool to retrieve full content on demand. On the LoCoMo benchmark (10 real multi-session conversations, 300+ turns), cooperative paging achieves the highest answer quality among six methods -- outperforming truncation, BM25, word-overlap retrieval, a search-tool baseline, and full context -- on four models (GPT-4o-mini, DeepSeek-v3.2, Claude Haiku, GLM-5), confirmed by four independent LLM judges ($p=0.017$, paired bootstrap). We then study the paging design space with a 5x4 ablation over boundary strategies and eviction policies (3,176 synthetic probes, 1,600 LoCoMo probes). Key findings: (1) coarse fixed-size pages (fixed_20) reach 96.7% while content-aware topic_shift collapses to 56.7%; (2) eviction policy choice is data-dependent (FIFO best on synthetic, LFU on LoCoMo); (3) two bookmark generation strategies improve over the heuristic baseline (+4.4 and +8.7 E2E points); (4) the remaining bottleneck is bookmark discrimination -- the model triggers recall() 96% of the time but selects the correct page only 57% when bookmarks are insufficiently distinctive. Keyword specificity alone accounts for a 25 percentage point accuracy difference.

[30] SCRIPT: A Subcharacter Compositional Representation Injection Module for Korean Pre-Trained Language Models

SungHo Kim,Juhyeong Park,Eda Atalay,SangKeun Lee

Main category: cs.CL

TL;DR: 本文提出SCRIPT模块,通过注入子字符(Jamo)的组合知识来增强韩语预训练语言模型的子词嵌入,无需修改模型结构或额外预训练,显著提升韩语NLU和NLG任务性能,并改善嵌入空间对语法规律和语义变化的建模能力。

Details Motivation: 现有韩语语言模型多采用子词分词,无法显式捕捉韩文字符内部由Jamo构成的形态与音系结构,限制了对韩语丰富形态学特征的建模能力。 Method: 提出SCRIPT——一种模型无关的模块,将Jamo级子字符结构信息注入到预训练语言模型的子词嵌入中,通过结构化增强实现细粒度表示,不依赖架构修改或重新预训练。 Result: SCRIPT在多个韩语自然语言理解与生成任务上全面超越基线模型;语言学分析表明其嵌入空间更准确反映语法规律与语义一致性变化。 Conclusion: SCRIPT有效弥合了韩文字形结构与语言模型表征之间的鸿沟,为形态丰富语言的语言建模提供了可扩展、轻量且语言学驱动的新范式。 Abstract: Korean is a morphologically rich language with a featural writing system in which each character is systematically composed of subcharacter units known as Jamo. These subcharacters not only determine the visual structure of Korean but also encode frequent and linguistically meaningful morphophonological processes. However, most current Korean language models (LMs) are based on subword tokenization schemes, which are not explicitly designed to capture the internal compositional structure of characters. To address this limitation, we propose SCRIPT, a model-agnostic module that injects subcharacter compositional knowledge into Korean PLMs. SCRIPT allows to enhance subword embeddings with structural granularity, without requiring architectural changes or additional pre-training. As a result, SCRIPT enhances all baselines across various Korean natural language understanding (NLU) and generation (NLG) tasks. Moreover, beyond performance gains, detailed linguistic analyses show that SCRIPT reshapes the embedding space in a way that better captures grammatical regularities and semantically cohesive variations. Our code is available at https://github.com/SungHo3268/SCRIPT.

[31] ReasonXL: Shifting LLM Reasoning Language Without Sacrificing Performance

Daniil Gurgurov,Tom Röhr,Sebastian von Rohrscheidt,Josef van Genabith,Alexander Löser,Simon Ostermann

Main category: cs.CL

TL;DR: 本文提出ReasonXL数据集,首次构建了涵盖五种欧洲语言的大规模平行推理轨迹语料库,并基于此提出两阶段微调方法(SFT+RLVR),使大语言模型能完全在目标语言中进行推理;研究还揭示了模型不同层级在语言适配中的功能分工。

Details Motivation: 现有大语言模型虽具多语言能力,但在推理过程中仍以英语为主,导致非英语使用场景存在根本性不匹配。 Method: 构建ReasonXL多语言平行推理轨迹语料库;提出SFT+RLVR两阶段微调方法;开展表征分析,探究模型深度上的语言识别与适应机制。 Result: 模型可实现全目标语言推理,性能不降甚至提升,通用知识损失小、跨语言迁移能力保持良好;发现早期层决定语言身份,上层集中适应变化;RLVR比SFT以更小参数更新实现更大行为差异。 Conclusion: 语言特定推理能力可通过高质量平行数据与高效微调策略实现,且模型内部存在可解释的语言处理分层机制。 Abstract: Despite advances in multilingual capabilities, most large language models (LLMs) remain English-centric in their training and, crucially, in their production of reasoning traces. Even when tasked with non-English problems, these models predominantly reason in English, creating a fundamental mismatch for non-English usage scenarios. We address this disparity directly with three contributions. (i) We introduce ReasonXL, the first large-scale parallel corpus of cross-domain reasoning traces spanning five European languages (English, German, French, Italian, and Spanish), with over two million aligned samples per language, each comprising prompts, reasoning traces, and final outputs, enabling direct supervision of language-specific reasoning. (ii) Using ReasonXL, we demonstrate that LLMs can be adapted to reason entirely in a desired target language, using a simple two-stage pipeline of supervised fine-tuning (SFT) followed by reinforcement learning with verifiable rewards (RLVR). The resulting models match or exceed baseline performance, with minimal loss in general knowledge and broadly preserved cross-lingual transfer. (iii) We conduct an extensive representational analysis of the adaptation and find a clear functional division across model depth: early layers contain an activation bottleneck that causally determines language identity, while upper layers concentrate the weight and activation changes driven by adaptation. We further find that RLVR achieves greater behavioral divergence from the base model with smaller parameter updates than SFT, suggesting a more efficient representational rerouting despite much smaller weight updates.

[32] From Myopic Selection to Long-Horizon Awareness: Sequential LLM Routing for Multi-Turn Dialogue

Jiarui Zhang,Xiangyu Liu,Yong Hu,Chaoyue Niu,Hang Zeng,Shaojie Tang,Fan Wu,Guihai Chen

Main category: cs.CL

TL;DR: 本文提出DialRouter,一种面向多轮对话的长视野序列路由方法,通过MCTS探索对话分支并学习轻量级路由策略,在不依赖在线搜索的情况下实现高效多轮LLM选择,显著提升任务成功率与性能-成本权衡。

Details Motivation: 现有LLM路由方法在单轮设置中有效,但在多轮对话中因交互动态性和延迟奖励而无法最大化累积性能。 Method: 提出DialRouter:首先使用蒙特卡洛树搜索(MCTS)探索由不同LLM选择引发的对话分支,收集高累积奖励轨迹;然后基于搜索数据学习轻量级路由策略,并结合基于检索的未来状态近似,实现无需在线搜索的多轮路由。 Result: 在开放域和领域特定对话任务上,DialRouter在任务成功率上显著优于单个LLM及现有路由基线;结合成本感知奖励时,展现出更优的性能-成本权衡。 Conclusion: 长视野序列路由是提升多轮对话中LLM协同效果的关键路径,DialRouter为高效、低成本的多轮LLM路由提供了可行框架。 Abstract: Multi-turn dialogue is the predominant form of interaction with large language models (LLMs). While LLM routing is effective in single-turn settings, existing methods fail to maximize cumulative performance in multi-turn dialogue due to interaction dynamics and delayed rewards. To address this challenge, we move from myopic, single-turn selection to long-horizon sequential routing for multi-turn dialogue. Accordingly, we propose DialRouter, which first performs MCTS to explore dialogue branches induced by different LLM selections and collect trajectories with high cumulative rewards. DialRouter then learns a lightweight routing policy from search-derived data, augmented with retrieval-based future state approximation, enabling multi-turn routing without online search. Experiments on both open-domain and domain-specific dialogue tasks across diverse candidate sets of both open-source and closed-source LLMs demonstrate that DialRouter significantly outperforms single LLMs and existing routing baselines in task success rate, while achieving a superior performance-cost trade-off when combined with a cost-aware reward.

[33] KoCo: Conditioning Language Model Pre-training on Knowledge Coordinates

Yudong Li,Jiawei Cai,Linlin Shen

Main category: cs.CL

TL;DR: 本文提出了一种名为KoCo(Knowledge Coordinate Conditioning)的方法,通过将文档映射为三维语义坐标并作为文本前缀用于LLM预训练,从而增强模型对现实世界知识结构的上下文感知能力。实验表明该方法显著提升下游任务性能、加快收敛速度,并缓解幻觉问题。

Details Motivation: 标准大语言模型预训练将语料视为扁平化token序列,忽略了人类自然依赖的真实世界上下文;本文旨在弥补这一差距,赋予模型显式的上下文感知能力。 Method: 提出KoCo方法,将每个文档映射到三维语义坐标空间,并将该坐标以文本前缀形式加入预训练输入中,使模型在训练过程中学习知识在真实世界结构中的位置关系。 Result: KoCo在10个下游任务上显著提升性能,预训练收敛速度加快约30%,且能更好区分稳定事实与噪声,有效缓解生成内容的幻觉现象。 Conclusion: 显式建模知识坐标是一种简单而有效的方式,可增强LLM对知识结构的理解,提升泛化能力与事实一致性。 Abstract: Standard Large Language Model (LLM) pre-training typically treats corpora as flattened token sequences, often overlooking the real-world context that humans naturally rely on to contextualize information. To bridge this gap, we introduce Knowledge Coordinate Conditioning (KoCo), a simple method that maps every document into a three-dimensional semantic coordinate. By prepending these coordinates as textual prefixes for pre-training, we aim to equip the model with explicit contextual awareness to learn the documents within the real-world knowledge structure. Experiment results demonstrate that KoCo significantly enhances performance across 10 downstream tasks and accelerates pre-training convergence by approximately 30\%. Furthermore, our analysis indicates that explicitly modeling knowledge coordinates helps the model distinguish stable facts from noise, effectively mitigating hallucination in generated outputs.

[34] Agentic Insight Generation in VSM Simulations

Micha Selak,Dirk Krechel,Adrian Ulges,Sven Spieckermann,Niklas Stoehr,Andreas Loehr

Main category: cs.CL

TL;DR: 本文提出了一种解耦的两步智能体架构,用于从复杂的价值流图模拟中提取可操作的洞察,通过分离编排与数据分析,并融合领域专家知识,提升对细微情境差异的识别能力。

Details Motivation: 现有大语言模型在处理原始数据方面表现优秀,但难以捕捉价值流图领域中区分相似数据源所需的细微情境差异,导致提取可操作洞察困难、耗时且易错。 Method: 提出一种解耦的两步智能体架构:第一步为智能编排(orchestration),负责数据源选择与跨数据结构的多跳推理;第二步为数据分析,结合渐进式数据发现与领域专家知识,同时保持轻量内部上下文。 Result: 在多个SOTA大语言模型上验证了该框架的有效性,顶级模型准确率达86%,且在多次评估中表现出高鲁棒性。 Conclusion: 该解耦架构有效提升了大语言模型在价值流图分析任务中对情境敏感性的建模能力,为工业场景下基于LLM的决策支持提供了新范式。 Abstract: Extracting actionable insights from complex value stream map simulations can be challenging, time-consuming, and error-prone. Recent advances in large language models offer new avenues to support users with this task. While existing approaches excel at processing raw data to gain information, they are structurally unfit to pick up on subtle situational differences needed to distinguish similar data sources in this domain. To address this issue, we propose a decoupled, two-step agentic architecture. By separating orchestration from data analysis, the system leverages progressive data discovery infused with domain expert knowledge. This architecture allows the orchestration to intelligently select data sources and perform multi-hop reasoning across data structures while maintaining a slim internal context. Results from multiple state-of-the-art large language models demonstrate the framework's viability: with top-tier models achieving accuracies of up to 86% and demonstrating high robustness across evaluation runs.

[35] Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation

Sihang Jia,Shuliang Liu,Songbo Yang,Yibo Yan,Xin Zou,Xuming Hu

Main category: cs.CL

TL;DR: 本文提出一种无需训练的解码阶段文本扰动方法(DeP),通过动态探针施加多级文本扰动、利用注意力方差增强稳定视觉证据、并基于logits统计构建可解释的先验漂移方向,以缓解多模态大模型因语言先验主导导致的幻觉问题。

Details Motivation: 多模态大语言模型常因语言先验压倒视觉证据而在推理中产生幻觉;现有无训练缓解方法或破坏图像自然分布,或损害模型生成流畅性。 Method: 提出Decoding by Perturbation(DeP):1)设计动态探针实施多级文本扰动以激发隐式语言先验;2)利用注意力方差强化稳定视觉区域、抑制特征空间中的可疑噪声;3)基于logits统计构建可解释的先验漂移方向,校正文本共现引发的概率偏差。 Result: 在多个基准上显著降低幻觉率,提升模型性能,且不依赖额外训练。 Conclusion: DeP提供了一种高效、可解释、无需训练的解码干预机制,从文本-视觉对齐敏感性角度为多模态幻觉治理提供了新范式。 Abstract: Multimodal Large Language Models frequently suffer from inference hallucinations, partially stemming from language priors dominating visual evidence. Existing training-free mitigation methods either perturb the visual representation and deviate from the natural image distribution, or enforce intrusive manipulations that compromise the model's inherent generative fluency. We introduce a novel perspective that multimodal hallucination manifests as the hypersensitivity of visual grounding to textual phrasing during the decoding phase. Building on this insight, we propose Decoding by Perturbation (DeP), a training-free framework mitigating prior-induced hallucinations via controlled textual interventions. DeP employs a dynamic probe applying multi-level textual perturbations to elicit latent language priors. Leveraging attention variance, it enhances stable evidence regions while suppressing suspicious noise in the feature space. Furthermore, it constructs an interpretable prior drift direction using logits statistics to counteract probability biases from textual co-occurrences. Extensive experiments confirm DeP effectively reduces hallucinations and achieves superior performance across multiple benchmarks.

[36] GLeMM: A large-scale multilingual dataset for morphological research

Hathout Nabil,Basilio Calderone,Fiammetta Namer,Franck Sajous

Main category: cs.CL

TL;DR: 本文介绍了GLeMM,一个用于形态学实验和数据驱动描述的大型、多语言派生词资源,具有自动化构建、跨语言一致性、自动形态特征标注及部分语义描述等特点。

Details Motivation: 解决传统派生形态学研究依赖直觉和小规模数据、难以复现和推广的问题。 Method: 基于Wiktionary条目,全自动构建覆盖七种欧洲语言的派生词资源GLeMM,并对每个词条进行形态特征自动标注及部分语义描述编码。 Result: 构建了大规模、多语言、自动化、可复现的派生形态资源GLeMM,支持形式-意义关系研究及计算方法实验验证。 Conclusion: GLeMM为派生形态学提供了可靠的数据基础,推动该领域向数据驱动和可验证方向发展。 Abstract: In derivational morphology, what mechanisms govern the variation in form-meaning relations between words? The answers to this type of questions are typically based on intuition and on observations drawn from limited data, even when a wide range of languages is considered. Many of these studies are difficult to replicate and generalize. To address this issue, we present GLeMM, a new derivational resource designed for experimentation and data-driven description in morphology. GLeMM is characterized by (i) its large size, (ii) its extensive coverage (currently amounting to seven European languages, i.e., German, English, Spanish, French, Italian, Polish, Russian, (iii) its fully automated design, identical across all languages, (iv) the automatic annotation of morphological features on each entry, as well as (v) the encoding of semantic descriptions for a significant subset of these entries. It enables researchers to address difficult questions, such as the role of form and meaning in word-formation, and to develop and experimentally test computational methods that identify the structures of derivational morphology. The article describes how GLeMM is created using Wiktionary articles and presents various case studies illustrating possible applications of the resource.

[37] Latent-Condensed Transformer for Efficient Long Context Modeling

Zeng You,Yaofo Chen,Qiuwu Chen,Ying Sun,Shuhai Zhang,Yingjian Li,Yaowei Wang,Mingkui Tan

Main category: cs.CL

TL;DR: 本文提出Latent-Condensed Attention(LCA),在MLA的潜在空间内直接压缩上下文,分别对语义向量进行查询感知池化、对位置键进行锚点选择,在不增加参数前提下联合降低计算开销与KV缓存;理论上有长度无关误差界,实验显示在128K上下文下实现2.5倍预填充加速和90% KV缓存减少。

Details Motivation: 现有方法(如MLA和稀疏注意力)分别缓解KV缓存线性增长和自注意力二次复杂度问题,但无法在MLA压缩后的潜在空间中原生应用稀疏机制,错失联合优化机会。 Method: 提出Latent-Condensed Attention(LCA):在MLA的潜在空间中将表征解耦为语义潜在向量和位置键,分别通过query-aware pooling聚合语义向量、通过anchor selection保留关键位置键;设计架构无关,可扩展至GQA等机制;并给出长度无关误差界理论证明。 Result: 实验表明LCA在128K上下文长度下实现最高2.5×预填充加速和90% KV缓存减少,同时保持有竞争力的模型性能。 Conclusion: LCA实现了计算与内存开销的协同压缩,无需额外参数,具备通用性与理论保障,为长上下文大语言模型提供了高效且可扩展的注意力优化方案。 Abstract: Large language models (LLMs) face significant challenges in processing long contexts due to the linear growth of the key-value (KV) cache and quadratic complexity of self-attention. Existing approaches address these bottlenecks separately: Multi-head Latent Attention (MLA) reduces the KV cache by projecting tokens into a low-dimensional latent space, while sparse attention reduces computation. However, sparse methods cannot operate natively on MLA's compressed latent structure, missing opportunities for joint optimization. In this paper, we propose Latent-Condensed Attention (LCA), which directly condenses context within MLA's latent space, where the representation is disentangled into semantic latent vectors and positional keys. LCA separately aggregates semantic vectors via query-aware pooling and preserves positional keys via anchor selection. This approach jointly reduces both computational cost and KV cache without adding parameters. Beyond MLA, LCA's design is architecture-agnostic and readily extends to other attention mechanisms such as GQA. Theoretically, we prove a length-independent error bound. Experiments show LCA achieves up to 2.5$\times$ prefilling speedup and 90% KV cache reduction at 128K context while maintaining competitive performance.

[38] Mining Large Language Models for Low-Resource Language Data: Comparing Elicitation Strategies for Hausa and Fongbe

Mahounan Pericles Adjovi,Roald Eiselen,Prasenjit Mitra

Main category: cs.CL

TL;DR: 本文探讨了通过策略性提示从大型语言模型(LLMs)中提取西非低资源语言(豪萨语和丰语)可用文本数据的可行性,发现GPT-4o Mini在提取效率上显著优于Gemini,且不同语言需采用不同提示策略。

Details Motivation: 大型语言模型虽利用了低资源语言社区的数据进行训练,但其蕴含的语言知识仅能通过商业API访问,难以回馈原语言社区;本文旨在探索是否可通过提示工程低成本、有效地提取这些语言的可用文本数据。 Method: 系统比较六种提示任务类型在两个商用大模型(GPT-4o Mini 和 Gemini 2.5 Flash)上对豪萨语和丰语的文本生成效果,评估每API调用产生的可用目标语词数,并分析最优提示策略的语言特异性。 Result: GPT-4o Mini 在单位API调用下提取的可用目标语言词汇量是Gemini的6–41倍;豪萨语在功能文本与对话类提示下表现最佳,而丰语则依赖约束生成类提示。 Conclusion: 策略性提示可有效从商用LLM中提取低资源语言数据,但需语言适配;该方法为构建低资源语言语料库提供了一种低成本、可复现的新路径,并已开源全部生成语料与代码。 Abstract: Large language models (LLMs) are trained on data contributed by low-resource language communities, yet the linguistic knowledge encoded in these models remains accessible only through commercial APIs. This paper investigates whether strategic prompting can extract usable text data from LLMs for two West African languages: Hausa (Afroasiatic, approximately 80 million speakers) and Fongbe (Niger-Congo, approximately 2 million speakers). We systematically compare six elicitation task types across two commercial LLMs (GPT-4o Mini and Gemini 2.5 Flash). GPT-4o Mini extracts 6-41 times more usable target-language words per API call than Gemini. Optimal strategies differ by language: Hausa benefits from functional text and dialogue, while Fongbe requires constrained generation prompts. We release all generated corpora and code.

[39] Meet Dynamic Individual Preferences: Resolving Conflicting Human Value with Paired Fine-Tuning

Shanyong Wang,Shuhang Lin,Yining Zhao,Xi Zhu,Yongfeng Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为Preference-Paired Fine-Tuning(PFT)的新框架,用于将大语言模型对齐到个体动态且可能相互矛盾的偏好,并构建了Value Conflict Dilemma(VCD)数据集进行评估;实验表明PFT在多选分类和开放生成任务中性能优越,尤其在处理冲突偏好和少量用户历史数据下显著优于DPO、SFT等基线方法。

Details Motivation: 现有大语言模型虽能较好对齐群体人类偏好,但难以适应个体间多样且动态变化的偏好,尤其是当这些偏好存在矛盾时。 Method: 提出Preference-Paired Fine-Tuning(PFT)框架,并构建包含偏好冲突场景的Value Conflict Dilemma(VCD)数据集,通过成对偏好微调实现个体偏好建模。 Result: PFT在多选分类任务中达96.6%准确率,开放生成得分为8.69;相比单偏好模型,在用户偏好对齐上提升44.76%;整体性能显著优于DPO、SFT等方法。 Conclusion: PFT是一种有效应对个体偏好多样性、动态性与矛盾性的新范式,为个性化LLM对齐提供了可行路径。 Abstract: Recent advances in large language models (LLMs) have significantly improved the alignment of models with general human preferences. However, a major challenge remains in adapting LLMs to individual preferences, which are not only diverse but also dynamic. In this paper, we introduce a novel framework, Preference-Paired Fine-Tuning (PFT), designed to align models with contradictory and evolving individual preferences. We present a new dataset, Value Conflict Dilemma (VCD), which includes scenarios that involve conflicting human preferences, facilitating the evaluation of our approach. Our experiments demonstrate that PFT outperforms single-preference training methods, achieving up to 96.6% accuracy in multi-choice classification tasks and the highest open-ended generation score of 8.69. PFT also shows significant improvements over DPO, SFT and some traditional training methods, especially when handling conflicting preferences. Additionally, with limited user history data, models can inferring preference vector rapidly, achieving a 44.76% improvement in user-specific preference alignment in comparison to single-preference models.

[40] KG-Reasoner: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning

Shuai Wang,Yinan Yu

Main category: cs.CL

TL;DR: 本文提出KG-Reasoner框架,利用强化学习训练大语言模型在知识图谱上进行端到端、动态可回溯的多跳推理,显著提升知识密集型复杂问答性能。

Details Motivation: 现有KBQA方法将多跳推理分解为固定流水线步骤,导致推理不灵活、决策过程割裂、中间信息丢失;而大语言模型虽强于语言理解,却在知识密集型推理上表现不足。 Method: 提出KG-Reasoner:一个端到端框架,将多步KG推理整合进大语言模型的统一‘思考’阶段;通过强化学习训练LLM内化图谱遍历过程,支持动态路径探索与必要时回溯。 Result: 在八个多跳与知识密集型推理基准上,KG-Reasoner达到或超越当前最优方法的性能。 Conclusion: 将结构化知识图谱推理深度融入LLM的统一推理过程,并借助强化学习优化,是提升知识密集型复杂推理能力的有效范式。 Abstract: Large Language Models (LLMs) exhibit strong abilities in natural language understanding and generation, yet they struggle with knowledge-intensive reasoning. Structured Knowledge Graphs (KGs) provide an effective form of external knowledge representation and have been widely used to enhance performance in classical Knowledge Base Question Answering (KBQA) tasks. However, performing precise multi-hop reasoning over KGs for complex queries remains highly challenging. Most existing approaches decompose the reasoning process into a sequence of isolated steps executed through a fixed pipeline. While effective to some extent, such designs constrain reasoning flexibility and fragment the overall decision process, often leading to incoherence and the loss of critical intermediate information from earlier steps. In this paper, we introduce KG-Reasoner, an end-to-end framework that integrates multi-step reasoning into a unified "thinking" phase of a Reasoning LLM. Through Reinforcement Learning (RL), the LLM is trained to internalize the KG traversal process, enabling it to dynamically explore reasoning paths, and perform backtracking when necessary. Experiments on eight multi-hop and knowledge-intensive reasoning benchmarks demonstrate that KG-Reasoner achieves competitive or superior performance compared to the state-of-the-art methods. Codes are available at the repository: https://github.com/Wangshuaiia/KG-Reasoner.

[41] Calibrated Confidence Estimation for Tabular Question Answering

Lukas Voss

Main category: cs.CL

TL;DR: 本文系统评估了五种大语言模型在表格问答任务中的置信度估计方法,发现现有模型普遍存在严重过度自信问题;提出了一种基于多格式序列化的一致性方法(MFA),显著提升置信度校准效果,并降低API调用成本。

Details Motivation: 大型语言模型在表格问答中广泛应用,但其在结构化数据上的置信度校准问题尚未被系统研究。 Method: 对五种前沿LLM和两个表格QA基准进行五种置信度估计方法的对比实验;提出Multi-Format Agreement(MFA)方法,利用结构化数据(Markdown/HTML/JSON/CSV)的无损确定性序列化变体估计置信度;引入结构感知重校准方法。 Result: 所有模型均严重过自信(smooth ECE达0.35–0.64);MFA在AUROC上达0.78–0.86,优于自评估方法(0.42–0.76);MFA降低ECE 44–63%,AUROC达0.80;与自一致性集成后AUROC从0.74提升至0.82;结构感知重校准提升AUROC约10个百分点。 Conclusion: 扰动类置信度估计方法(尤其是MFA)在表格QA中显著优于自评估方法;MFA兼具高效性(降低20% API成本)与泛化性,结合采样方法可进一步提升性能;结构感知重校准是有效的补充手段。 Abstract: Large language models (LLMs) are increasingly deployed for tabular question answering, yet calibration on structured data is largely unstudied. This paper presents the first systematic comparison of five confidence estimation methods across five frontier LLMs and two tabular QA benchmarks. All models are severely overconfident (smooth ECE 0.35-0.64 versus 0.10-0.15 reported for textual QA). A consistent self-evaluation versus perturbation dichotomy replicates across both benchmarks and all four fully-covered models: self-evaluation methods (verbalized, P(True)) achieve AUROC 0.42-0.76, while perturbation methods (semantic entropy, self-consistency, and our Multi-Format Agreement) achieve AUROC 0.78-0.86. Per-model paired bootstrap tests reject the null at p<0.001 after Holm-Bonferroni correction, and a 3-seed check on GPT-4o-mini gives a per-seed standard deviation of only 0.006. The paper proposes Multi-Format Agreement (MFA), which exploits the lossless and deterministic serialization variation unique to structured data (Markdown, HTML, JSON, CSV) to estimate confidence at 20% lower API cost than sampling baselines. MFA reduces ECE by 44-63%, generalizes across all four models on TableBench (mean AUROC 0.80), and combines complementarily with sampling: an MFA + self-consistency ensemble lifts AUROC from 0.74 to 0.82. A secondary contribution, structure-aware recalibration, improves AUROC by +10 percentage points over standard post-hoc methods.

[42] Latent Planning Emerges with Scale

Michael Hanna,Emmanuel Ameisen

Main category: cs.CL

TL;DR: 本文提出‘隐性规划’概念,研究Qwen-3系列大模型在简单与复杂任务中是否及如何进行内部规划,并发现规划能力随模型规模增长而增强,提供了可测量的框架和机制证据。

Details Motivation: 探究大语言模型(LLM)在不显式表达计划的情况下,是否具备隐性的内部规划能力,及其随模型规模变化的规律。 Method: 定义‘隐性规划’为模型内部存在能因果引导未来token生成并塑造前置上下文的表征;在Qwen-3(0.6B–14B)上开展简单规划任务(如冠词选择)与复杂任务(如押韵对句补全),结合特征分析与可控引导实验进行机制研究。 Result: 发现隐性规划能力随模型参数量增加而提升;小模型(4B–8B)已具初步规划机制;在押韵任务中模型常提前识别韵脚但规划深度有限;通过引导可激发随规模增强的规划行为。 Conclusion: LLM具备可检测、可度量的隐性规划能力,其强度与机制成熟度随模型规模系统性增长,为理解LLM推理能力提供了新视角与实证框架。 Abstract: LLMs can perform seemingly planning-intensive tasks, like writing coherent stories or functioning code, without explicitly verbalizing a plan; however, the extent to which they implicitly plan is unknown. In this paper, we define latent planning as occurring when LLMs possess internal planning representations that (1) cause the generation of a specific future token or concept, and (2) shape preceding context to license said future token or concept. We study the Qwen-3 family (0.6B-14B) on simple planning tasks, finding that latent planning ability increases with scale. Models that plan possess features that represent a planned-for word like "accountant", and cause them to output "an" rather than "a"; moreover, even the less-successful Qwen-3 4B-8B have nascent planning mechanisms. On the more complex task of completing rhyming couplets, we find that models often identify a rhyme ahead of time, but even large models seldom plan far ahead. However, we can elicit some planning that increases with scale when steering models towards planned words in prose. In sum, we offer a framework for measuring planning and mechanistic evidence of how models' planning abilities grow with scale.

[43] Topology-Aware Reasoning over Incomplete Knowledge Graph with Graph-Based Soft Prompting

Shuai Wang,Xixi Wang,Yinan Yu

Main category: cs.CL

TL;DR: 本文提出了一种基于图的软提示框架GraSP,通过GNN编码子图生成软提示,使大语言模型在多跳知识库问答中进行子图级推理,缓解知识图谱不完整带来的问题,并采用两阶段LLM范式兼顾效率与性能。

Details Motivation: 现有大多数多跳KBQA方法依赖显式的边遍历,对知识图谱的不完整性敏感,易受缺失边影响;而大语言模型在知识密集型任务中易产生幻觉。 Method: 提出图基软提示(GraSP)框架:1)用GNN编码提取的结构化子图为软提示;2)采用两阶段LLM范式——轻量级LLM先识别相关实体与关系,强LLM再进行证据感知的答案生成。 Result: 在四个多跳KBQA基准数据集上实验表明,该方法在其中三个上达到SOTA性能。 Conclusion: 子图级软提示能增强LLM对结构上下文的理解,降低对KG局部完整性的依赖;两阶段设计在保持高性能的同时显著降低计算开销。 Abstract: Large Language Models (LLMs) have shown remarkable capabilities across various tasks but remain prone to hallucinations in knowledge-intensive scenarios. Knowledge Base Question Answering (KBQA) mitigates this by grounding generation in Knowledge Graphs (KGs). However, most multi-hop KBQA methods rely on explicit edge traversal, making them fragile to KG incompleteness. In this paper, we proposed a novel graph-based soft prompting framework that shifts the reasoning paradigm from node-level path traversal to subgraph-level reasoning. Specifically, we employ a Graph Neural Network (GNN) to encode extracted structural subgraphs into soft prompts, enabling LLM to reason over richer structural context and identify relevant entities beyond immediate graph neighbors, thereby reducing sensitivity to missing edges. Furthermore, we introduce a two-stage paradigm that reduces computational cost while preserving good performance: a lightweight LLM first leverages the soft prompts to identify question-relevant entities and relations, followed by a more powerful LLM for evidence-aware answer generation. Experiments on four multi-hop KBQA benchmarks show that our approach achieves state-of-the-art performance on three of them, demonstrating its effectiveness. Code is available at the repository: https://github.com/Wangshuaiia/GraSP.

[44] Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs

Linhao Zhang,Yuhan Song,Aiwei Liu,Chuhan Wu,Sijun Zhang,Wei Jia,Yuan Liu,Houfeng Wang,Xiao Zhou

Main category: cs.CL

TL;DR: 本文提出Unified Audio Schema (UAS)框架,通过结构化监督(转录、副语言特征、非语言事件)解决AudioLLM在细粒度声学感知上的性能瓶颈,显著提升感知能力而不损害推理性能。

Details Motivation: 现有AudioLLM因ASR中心化训练而压制副语言线索和声学事件,导致细粒度听觉感知能力弱于复杂推理能力。 Method: 设计统一JSON格式的三元音频信息结构(Transcription, Paralinguistics, Non-linguistic Events),构建Unified Audio Schema (UAS)监督框架,并应用于离散与连续AudioLLM架构。 Result: 在MMSU、MMAR、MMAU数据集上验证有效,MMSU上细粒度感知提升10.9%,同时保持强推理能力。 Conclusion: 结构化、多维度的音频监督可突破ASR范式局限,实现感知与推理能力的协同提升。 Abstract: Recent Audio Large Language Models (AudioLLMs) exhibit a striking performance inversion: while excelling at complex reasoning tasks, they consistently underperform on fine-grained acoustic perception. We attribute this gap to a fundamental limitation of ASR-centric training, which provides precise linguistic targets but implicitly teaches models to suppress paralinguistic cues and acoustic events as noise. To address this, we propose Unified Audio Schema (UAS), a holistic and structured supervision framework that organizes audio information into three explicit components -- Transcription, Paralinguistics, and Non-linguistic Events -- within a unified JSON format. This design achieves comprehensive acoustic coverage without sacrificing the tight audio-text alignment that enables reasoning. We validate the effectiveness of this supervision strategy by applying it to both discrete and continuous AudioLLM architectures. Extensive experiments on MMSU, MMAR, and MMAU demonstrate that UAS-Audio yields consistent improvements, boosting fine-grained perception by 10.9% on MMSU over the same-size state-of-the-art models while preserving robust reasoning capabilities. Our code and model are publicly available at https://github.com/Tencent/Unified_Audio_Schema.

[45] Enhance-then-Balance Modality Collaboration for Robust Multimodal Sentiment Analysis

Kang He,Yuzhe Ding,Xinrong Wang,Fei Li,Chong Teng,Donghong Ji

Main category: cs.CL

TL;DR: 本文提出EBMC框架,通过语义解耦与跨模态增强提升弱模态表征能力,并引入能量引导的模态协调机制和实例感知的模态可信度蒸馏,以缓解模态竞争、实现梯度再平衡与自适应融合,在多模态情感分析中取得SOTA或具竞争力的结果,且对模态缺失鲁棒。

Details Motivation: 现有方法难以充分利用较弱模态(如音频、视觉),主导模态易掩盖非言语模态,导致模态间竞争、融合性能下降及在噪声或模态缺失场景下鲁棒性差。 Method: 提出Enhance-then-Balance Modality Collaboration(EBMC)框架:1)语义解耦与跨模态增强以提升弱模态表征;2)能量引导的模态协调机制,通过可微均衡目标实现隐式梯度再平衡;3)实例感知的模态可信度蒸馏,动态估计样本级模态可靠性并自适应调节融合权重。 Result: EBMC在多模态情感分析任务上达到当前最优(SOTA)或具有竞争力的性能,并在模态缺失设定下保持强鲁棒性。 Conclusion: EBMC有效缓解了模态不平衡与竞争问题,提升了多模态融合的质量与鲁棒性,为真实场景下的多模态情感分析提供了更可靠、自适应的建模范式。 Abstract: Multimodal sentiment analysis (MSA) integrates heterogeneous text, audio, and visual signals to infer human emotions. While recent approaches leverage cross-modal complementarity, they often struggle to fully utilize weaker modalities. In practice, dominant modalities tend to overshadow non-verbal ones, inducing modality competition and limiting overall contributions. This imbalance degrades fusion performance and robustness under noisy or missing modalities. To address this, we propose a novel model, Enhance-then-Balance Modality Collaboration framework (EBMC). EBMC improves representation quality via semantic disentanglement and cross-modal enhancement, strengthening weaker modalities. To prevent dominant modalities from overwhelming others, an Energy-guided Modality Coordination mechanism achieves implicit gradient rebalancing via a differentiable equilibrium objective. Furthermore, Instance-aware Modality Trust Distillation estimates sample-level reliability to adaptively modulate fusion weights, ensuring robustness. Extensive experiments demonstrate that EBMC achieves state-of-the-art or competitive results and maintains strong performance under missing-modality settings.

[46] When Does Data Augmentation Help? Evaluating LLM and Back-Translation Methods for Hausa and Fongbe NLP

Mahounan Pericles Adjovi,Roald Eiselen,Prasenjit Mitra

Main category: cs.CL

TL;DR: 本文评估了两种数据增强方法(LLM生成和回译)在豪萨语和丰语上的效果,发现增强效果取决于任务类型而非语言或LLM质量;NER任务中两种方法均未提升性能,而POS任务中效果因语言和方法而异,表明任务结构比合成数据质量更关键。

Details Motivation: 数据稀缺限制了低资源非洲语言的NLP发展,需探索有效的数据增强方法。 Method: 采用LLM(Gemini 2.5 Flash)生成和回译(NLLB-200)两种数据增强方法,在豪萨语和丰语的NER(MasakhaNER 2.0)和POS(MasakhaPOS)任务上进行评估。 Result: NER任务中,两种增强方法均未提升性能,甚至轻微降低F1值;POS任务中,LLM增强提升丰语准确率0.33%,回译提升豪萨语0.17%,但对另一语言效果相反或不显著。 Conclusion: 数据增强效果由任务结构主导,而非LLM生成质量;应将增强视为任务特定干预,而非通用预处理步骤。 Abstract: Data scarcity limits NLP development for low-resource African languages. We evaluate two data augmentation methods -- LLM-based generation (Gemini 2.5 Flash) and back-translation (NLLB-200) -- for Hausa and Fongbe, two West African languages that differ substantially in LLM generation quality. We assess augmentation on named entity recognition (NER) and part-of-speech (POS) tagging using MasakhaNER 2.0 and MasakhaPOS benchmarks. Our results reveal that augmentation effectiveness depends on task type rather than language or LLM quality alone. For NER, neither method improves over baseline for either language; LLM augmentation reduces Hausa NER by 0.24% F1 and Fongbe NER by 1.81% F1. For POS tagging, LLM augmentation improves Fongbe by 0.33% accuracy, while back-translation improves Hausa by 0.17%; back-translation reduces Fongbe POS by 0.35% and has negligible effect on Hausa POS. The same LLM-generated synthetic data produces opposite effects across tasks for Fongbe -- hurting NER while helping POS -- suggesting task structure governs augmentation outcomes more than synthetic data quality. These findings challenge the assumption that LLM generation quality predicts augmentation success, and provide actionable guidance: data augmentation should be treated as a task-specific intervention rather than a universally beneficial preprocessing step.

[47] FABLE: Fine-grained Fact Anchoring for Unstructured Model Editing

Peng Wang,Biyu Zhou,Xuehai Tang,Jizhong Han,Songlin Hu

Main category: cs.CL

TL;DR: 本文提出FABLE框架,通过分层解耦细粒度事实注入与整体文本生成,提升模型编辑中对细粒度事实的访问能力,并引入UnFine基准进行系统评估。

Details Motivation: 现有无结构化模型编辑方法往往整体记忆文本,缺乏可靠的细粒度事实访问能力。 Method: 提出FABLE分层框架,采用两阶段‘事实优先’策略:在浅层锚定离散事实,再对深层做最小更新以生成连贯文本;同时构建UnFine诊断基准,含细粒度问答对和事实级指标。 Result: 实验表明FABLE显著提升细粒度问答性能,同时保持最先进的整体编辑性能。 Conclusion: FABLE通过解耦事实注入与文本生成,更符合Transformer单向流特性,有效缓解整体回忆与细粒度事实访问之间的不匹配问题。 Abstract: Unstructured model editing aims to update models with real-world text, yet existing methods often memorize text holistically without reliable fine-grained fact access. To address this, we propose FABLE, a hierarchical framework that decouples fine-grained fact injection from holistic text generation. FABLE follows a two-stage, fact-first strategy: discrete facts are anchored in shallow layers, followed by minimal updates to deeper layers to produce coherent text. This decoupling resolves the mismatch between holistic recall and fine-grained fact access, reflecting the unidirectional Transformer flow in which surface-form generation amplifies rather than corrects underlying fact representations. We also introduce UnFine, a diagnostic benchmark with fine-grained question-answer pairs and fact-level metrics for systematic evaluation. Experiments show that FABLE substantially improves fine-grained question answering while maintaining state-of-the-art holistic editing performance. Our code is publicly available at https://github.com/caskcsg/FABLE.

[48] Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs

Xudong Wang,Chaoning Zhang,Qigan Sun,Zhenzhen Huang,Chang Lu,Sheng Zheng,Zeyu Ma,Caiyan Qin,Yang Yang,Hengtao Shen

Main category: cs.CL

TL;DR: 本文提出Tri-RAG框架,通过将外部知识转化为Condition-Proof-Conclusion三元组结构,提升检索精度与上下文效率,缓解RAG中冗余信息、语义错位和推理碎片化问题。

Details Motivation: 现有RAG方法直接拼接非结构化文本片段,导致上下文冗余、语义对齐差、推理链断裂,影响生成质量并增加token消耗。 Method: Tri-RAG利用轻量级提示工程将自然语言知识自动转换为标准化三元组(Condition, Proof, Conclusion),以Condition作为语义锚点进行精准检索,避免原始长文本拼接。 Result: 在多个基准数据集上,Tri-RAG显著提升检索质量与推理效率,生成更稳定,资源利用更高效。 Conclusion: 结构化三元组表示与推理对齐的上下文构建方式,能有效平衡检索准确性与token效率,为RAG提供新范式。 Abstract: Retrieval-Augmented Generation (RAG) mitigates hallucination in large language models (LLMs) by incorporating external knowledge during generation. However, the effectiveness of RAG depends not only on the design of the retriever and the capacity of the underlying model, but also on how retrieved evidence is structured and aligned with the query. Existing RAG approaches typically retrieve and concatenate unstructured text fragments as context, which often introduces redundant or weakly relevant information. This practice leads to excessive context accumulation, reduced semantic alignment, and fragmented reasoning chains, thereby degrading generation quality while increasing token consumption. To address these challenges, we propose Tri-RAG, a structured triplet-based retrieval framework that improves retrieval efficiency through reasoning-aligned context construction. Tri-RAG automatically transforms external knowledge from natural language into standardized structured triplets consisting of Condition, Proof, and Conclusion, explicitly capturing logical relations among knowledge fragments using lightweight prompt-based adaptation with frozen model parameters. Building on this representation, the triplet head Condition is treated as an explicit semantic anchor for retrieval and matching, enabling precise identification of query-relevant knowledge units without directly concatenating lengthy raw texts. As a result, Tri-RAG achieves a favorable balance between retrieval accuracy and context token efficiency. Experimental results across multiple benchmark datasets demonstrate that Tri-RAG significantly improves retrieval quality and reasoning efficiency, while producing more stable generation behavior and more efficient resource utilization in complex reasoning scenarios.

[49] Multilingual Multi-Label Emotion Classification at Scale with Synthetic Data

Vadim Borisov

Main category: cs.CL

TL;DR: 本文构建了一个覆盖23种语言、11种情绪类别的百万级多标签合成情感分类数据集,并在多种多语言Transformer模型上进行了系统评估,验证了XLM-R-Large在零样本跨语言迁移中可媲美甚至超越单语专用模型的性能。

Details Motivation: 现有情感分类数据集主要为英文、单标签、语言覆盖少,严重制约多语言情感分析发展。 Method: 基于文化适配的提示生成与程序化质量过滤,构建含100万+多标签样本(23种语言,每种5万)、涵盖11类情绪的大规模合成训练集;在统一设置下训练并对比6种多语言Transformer编码器。 Result: XLM-R-Large在领域内测试集达0.868 F1-micro和0.987 AUC-micro;零样本迁移到GoEmotions和SemEval-2018 E-c时,在AP-micro和LRAP上持平英语专用模型,在AUC-micro上更优(0.810 vs. 0.787)。 Conclusion: 大规模高质量合成数据可有效弥补多语言情感标注资源匮乏问题;XLM-R-Large具备强跨语言泛化能力,是兼顾多语言支持与性能的优选方案;最佳基础模型已开源。 Abstract: Emotion classification in multilingual settings remains constrained by the scarcity of annotated data: existing corpora are predominantly English, single-label, and cover few languages. We address this gap by constructing a large-scale synthetic training corpus of over 1M multi-label samples (50k per language) across 23 languages: Arabic, Bengali, Dutch, English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Mandarin, Polish, Portuguese, Punjabi, Russian, Spanish, Swahili, Tamil, Turkish, Ukrainian, Urdu, and Vietnamese, covering 11 emotion categories using culturally-adapted generation and programmatic quality filtering. We train and compare six multilingual transformer encoders, from DistilBERT (135M parameters) to XLM-R-Large (560M parameters), under identical conditions. On our in-domain test set, XLM-R-Large achieves 0.868 F1-micro and 0.987 AUC-micro. To validate against human-annotated data, we evaluate all models zero-shot on GoEmotions (English) and SemEval-2018 Task 1 E-c (English, Arabic, Spanish). On threshold-free ranking metrics, XLM-R-Large matches or exceeds English-only specialist models, tying on AP-micro (0.636) and LRAP (0.804) while surpassing on AUC-micro (0.810 vs. 0.787), while natively supporting all 23 languages. The best base-sized model is publicly available at https://huggingface.co/tabularisai/multilingual-emotion-classification

[50] Learning Chain Of Thoughts Prompts for Predicting Entities, Relations, and even Literals on Knowledge Graphs

Alkid Baci,Luke Friedrichs,Caglar Demir,N'Dah Jean Kouagou,Axel-Cyrille Ngonga Ngomo

Main category: cs.CL

TL;DR: 本文提出RALP方法,将知识图谱链接预测重构为提示学习问题,利用大语言模型通过链式思维提示进行评分,无需梯度更新,在少量样本下实现高效泛化,显著提升MRR并支持复杂OWL推理任务。

Details Motivation: 现有知识图谱嵌入模型在处理未见过的实体、关系及字面量时表现不佳,难以适应动态异构图;而大语言模型通过提示学习具有强泛化能力,因此需探索基于提示的替代方案。 Method: 提出RALP框架,将链接预测建模为字符串化的链式思维(CoT)提示学习问题;采用MIPRO贝叶斯优化算法,在少于30个训练样本下自动搜索最优提示,不依赖梯度;推理时用学到的提示预测缺失项并输出置信度。 Result: 在多个基准(包括传递性、数值型和OWL实例检索)上,RALP相较SOTA KGE模型MRR提升超5%;在复杂OWL类表达式推理中达88%以上Jaccard相似度;生成高质量推断三元组,增强泛化能力。 Conclusion: 基于提示的大语言模型推理是一种灵活、低样本、高泛化的知识图谱链接预测新范式,可有效弥补传统嵌入方法的局限性。 Abstract: Knowledge graph embedding (KGE) models perform well on link prediction but struggle with unseen entities, relations, and especially literals, limiting their use in dynamic, heterogeneous graphs. In contrast, pretrained large language models (LLMs) generalize effectively through prompting. We reformulate link prediction as a prompt learning problem and introduce RALP, which learns string-based chain-of-thought (CoT) prompts as scoring functions for triples. Using Bayesian Optimization through MIPRO algorithm, RALP identifies effective prompts from fewer than 30 training examples without gradient access. At inference, RALP predicts missing entities, relations or whole triples and assigns confidence scores based on the learned prompt. We evaluate on transductive, numerical, and OWL instance retrieval benchmarks. RALP improves state-of-the-art KGE models by over 5% MRR across datasets and enhances generalization via high-quality inferred triples. On OWL reasoning tasks with complex class expressions (e.g., $\exists hasChild.Female$, $\geq 5 \; hasChild.Female$), it achieves over 88% Jaccard similarity. These results highlight prompt-based LLM reasoning as a flexible alternative to embedding-based methods. We release our implementation, training, and evaluation pipeline as open source: https://github.com/dice-group/RALP .

[51] InsightFlow: LLM-Driven Synthesis of Patient Narratives for Mental Health into Causal Models

Shreya Gupta,Prottay Kumar Adhikary,Bhavyaa Dave,Salam Michael Singh,Aniket Deroy,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: 本文提出InsightFlow,一种基于大语言模型(LLM)的自动从心理治疗对话中生成5P框架因果图的方法,并在46份临床访谈文本上验证其结构、语义及临床实用性,结果表明其生成的图具有与人类专家相当的临床意义。

Details Motivation: 临床个案概念化(如5P框架)对治疗至关重要,但人工构建因果图费时且存在临床医生间差异。 Method: 提出InsightFlow方法,利用大语言模型从患者-治疗师对话中自动生成符合5P框架的因果图;在46份由临床专家标注的心理治疗初访转录文本上,采用结构相似性(NetSimile)、语义相似性(嵌入相似度)和专家临床评分三方面评估LLM生成图与人工图的一致性。 Result: LLM生成图在结构相似性上达到近似专家间一致性水平,语义对齐度高;专家评价为中等完整性、一致性和临床实用性;图结构更互联,而人类图偏链式,但整体复杂度与内容覆盖相似。 Conclusion: LLM可生成具有临床意义的个案概念化图,处于专家实践自然变异范围内;InsightFlow展示了自动化因果建模辅助临床工作的潜力,未来需提升时序推理能力并减少冗余。 Abstract: Clinical case formulation organizes patient symptoms and psychosocial factors into causal models, often using the 5P framework. However, constructing such graphs from therapy transcripts is time consuming and varies across clinicians. We present InsightFlow, an LLM based approach that automatically generates 5P aligned causal graphs from patient-therapist dialogues. Using 46 psychotherapy intake transcripts annotated by clinical experts, we evaluate LLM generated graphs against human formulations using structural (NetSimile), semantic (embedding similarity), and expert rated clinical criteria. The generated graphs show structural similarity comparable to inter annotator agreement and high semantic alignment with human graphs. Expert evaluations rate the outputs as moderately complete, consistent, and clinically useful. While LLM graphs tend to form more interconnected structures compared to the chain like patterns of human graphs, overall complexity and content coverage are similar. These results suggest that LLMs can produce clinically meaningful case formulation graphs within the natural variability of expert practice. InsightFlow highlights the potential of automated causal modeling to augment clinical workflows, with future work needed to improve temporal reasoning and reduce redundancy.

[52] Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood

Xingyu Lin,Yilin Wen,Du Su,Jinchang Hou,En Wang,Wenbin Liu,Chenfu Bao,Zhonghou Lv

Main category: cs.CL

TL;DR: 本文提出TEPO,一种新的token-level强化学习框架,用于解决链式思维推理中稀疏奖励问题,通过序列级似然链接组级奖励与单个token,并引入KL散度掩码约束以提升训练稳定性和数学推理性能。

Details Motivation: 现有方法如GRPO在处理链式思维推理中的token-level稀疏奖励时存在熵坍塌或模型退化问题。 Method: TEPO框架包含两部分:(1) 利用序列级似然将组级奖励通过token级聚合分配至各token;(2) 引入针对正优势token的token级KL散度掩码约束,抑制熵快速下降和策略突变。 Result: TEPO在数学推理基准上达到SOTA性能,并将收敛时间相比GRPO/DAPO减少50%,显著提升训练稳定性。 Conclusion: TEPO有效缓解了稀疏奖励下token级策略优化的不稳定性问题,为CoT推理提供了更鲁棒、高效的强化学习训练范式。 Abstract: Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly in their mathemat ical reasoning performance. However, GRPO and related entropy regularization methods still struggle with token-level sparse-rewards, which is an inherent chal lenge in chain-of-thought (CoT) reasoning. These approaches often rely on undifferen tiated token-level entropy regularization, which easily leads to entropy collapse or model degradation under sparse token rewards. In this work, we propose TEPO, a novel token-level framework that (1) leverages sequence-level likelihood to link group-level rewards with individual tokens via token-level aggregation, and (2) introduces a token-level KL-Divergence mask constraint that targets tokens with positive advantages and decreasing entropy to mitigate abrupt policy updates. Experiments demonstrate that TEPO not only achieves state-of-the-art performance on mathematical reasoning benchmarks but also markedly enhances training stability, reducing convergence time by 50% compared with GRPO/DAPO.

[53] Universal NER v2: Towards a Massively Multilingual Named Entity Recognition Benchmark

Terra Blevins,Stephen Mayhew,Marek Šuppa,Hila Gonen,Shachar Mirkin,Vasile Pais,Kaja Dobrovoljc,Voula Giouli,Jun Kevin,Eugene Jang,Eungseo Kim,Jeongyeon Seo,Xenophon Gialis,Yuval Pinter

Main category: cs.CL

TL;DR: The Universal NER project builds gold-standard, multilingual Named Entity Recognition (NER) benchmarks using standardized annotation guidelines and a unified tagset to address the scarcity of high-quality evaluation resources for non-English languages.

Details Motivation: To address the scarcity of gold-standard multilingual evaluation benchmarks—especially for NER—despite the growing use of multilingual language models. Method: Adopting principles from massively multilingual NLP initiatives (e.g., Universal Dependencies), the project develops standardized annotation guidelines and a general tagset to collect cross-lingual, human-annotated NER data across many languages. Result: UNER v1 was released in 2024; the project has since expanded with ongoing contributions from a growing community of organizers, annotators, and collaborators. Conclusion: UNER establishes a scalable, community-driven framework for building high-quality, comparable multilingual NER benchmarks, enabling more rigorous evaluation of multilingual LLMs. Abstract: While multilingual language models promise to bring the benefits of LLMs to speakers of many languages, gold-standard evaluation benchmarks in most languages to interrogate these assumptions remain scarce. The Universal NER project, now entering its fourth year, is dedicated to building gold-standard multilingual Named Entity Recognition (NER) benchmark datasets. Inspired by existing massively multilingual efforts for other core NLP tasks (e.g., Universal Dependencies), the project uses a general tagset and thorough annotation guidelines to collect standardized, cross-lingual annotations of named entity spans. The first installment (UNER v1) was released in 2024, and the project has continued and expanded since then, with various organizers, annotators, and collaborators in an active community.

[54] Generating Effective CoT Traces for Mitigating Causal Hallucination

Yiheng Zhao,Jun Yan

Main category: cs.CL

TL;DR: 本文提出了一种针对小规模大语言模型在事件因果识别任务中因果幻觉问题的解决方案,包括设计满足特定标准的思维链(CoT)生成流程、提出新的因果幻觉率(CHR)评估指标,并通过实验证明该方法能显著降低因果幻觉并提升准确率与泛化能力。

Details Motivation: 小规模大语言模型(≤1.5B参数)在事件因果识别(ECI)任务中存在严重的因果幻觉问题,且缺乏适配的思维链(CoT)训练数据和量化因果幻觉的评估指标。 Method: 首先分析有效CoT迹应满足的关键标准;其次构建一个可生成符合这些标准的CoT迹的数据生成流程;最后提出新指标——因果幻觉率(CHR),用于量化因果幻觉、指导CoT标准制定及验证流程有效性。 Result: 使用本流程生成的CoT迹进行微调,显著降低了小模型的因果幻觉,提升了平均准确率,并展现出跨数据集、跨难度的泛化能力以及对误导性提示的鲁棒性。 Conclusion: 本文提出的CoT生成流程与CHR指标为缓解小规模LLM在ECI中的因果幻觉提供了系统性解决方案,兼具理论指导性与实践有效性。 Abstract: Although large language models (LLMs) excel in complex reasoning tasks, they suffer from severe causal hallucination in event causality identification (ECI), particularly in smaller models ($\leq$1.5B parameters). A promising approach to address this issue is to fine-tune them with Chain-of-Thought (CoT) traces. However, there is currently a lack of CoT trace dataset available for ECI. In this paper, we first investigate the essential criteria that effective CoT traces should possess to mitigate causal hallucination in smaller models. We then design a pipeline to generate CoT traces that meet these criteria. Moreover, since there is currently no metric for quantifying causal hallucination, we also introduce a new metric, the Causal Hallucination Rate (CHR), to quantify causal hallucination, guide the formulation of effective CoT trace criteria, and validate the effectiveness of our pipeline. Our experiments show that fine-tuning with the CoT traces generated by our pipeline not only substantially reduces causal hallucination in smaller LLMs but also improves mean accuracy. Moreover, the fine-tuned models exhibit strong cross-dataset and cross-difficulty generalization, as well as robustness under misleading intervention prompts.

[55] NaviRAG: Towards Active Knowledge Navigation for Retrieval-Augmented Generation

Jihao Dai,Dingjun Wu,Yuxuan Chen,Zheni Zeng,Yukun Yan,Zhenghao Liu,Maosong Sun

Main category: cs.CL

TL;DR: NaviRAG 提出一种分层知识结构与LLM驱动的主动导航机制,实现多粒度、动态检索,显著提升长文档问答中的检索召回率和答案准确性。

Details Motivation: 传统RAG采用扁平化检索,难以应对需跨粒度条件检索与动态信息合成的复杂任务。 Method: 将知识文档构建成语义层次结构,并利用LLM代理在该结构中主动导航:迭代识别信息缺口,按需从最适粒度层级检索内容。 Result: 在长文档问答基准上,NaviRAG持续优于传统RAG基线,提升检索召回率与端到端答案性能;消融实验验证多粒度定位与动态检索规划是关键增益来源。 Conclusion: NaviRAG使RAG系统更具智能性与自主性,为未来高效、场景自适应的检索增强生成提供了新方向。 Abstract: Retrieval-augmented generation (RAG) typically relies on a flat retrieval paradigm that maps queries directly to static, isolated text segments. This approach struggles with more complex tasks that require the conditional retrieval and dynamic synthesis of information across different levels of granularity (e.g., from broad concepts to specific evidence). To bridge this gap, we introduce NaviRAG, a novel framework that shifts from passive segment retrieval to active knowledge navigation. NaviRAG first structures the knowledge documents into a hierarchical form, preserving semantic relationships from coarse-grained topics to fine-grained details. Leveraging this reorganized knowledge records, a large language model (LLM) agent actively navigates the records, iteratively identifying information gaps and retrieving relevant content from the most appropriate granularity level. Extensive experiments on long-document QA benchmarks show that NaviRAG consistently improves both retrieval recall and end-to-end answer performance over conventional RAG baselines. Ablation studies confirm performance gains stem from our method's capacity for multi-granular evidence localization and dynamic retrieval planning. We further discuss efficiency, applicable scenario, and future directions of our method, hoping to make RAG systems more intelligent and autonomous.

[56] Teaching LLMs Human-Like Editing of Inappropriate Argumentation via Reinforcement Learning

Timon Ziegenbein,Maja Stahl,Henning Wachsmuth

Main category: cs.CL

TL;DR: 本文提出了一种基于强化学习的方法,使大语言模型(LLMs)学习人类风格的文本编辑策略,生成语义保持、自包含、可独立采纳的句子级编辑建议,显著提升论点适当性。

Details Motivation: 观察发现LLM编辑常分散且易改写原意,而人类编辑更倾向于自包含、意义保持的局部修改;为弥合这一差距,需让LLM习得人类编辑策略以提升论点适当性。 Method: 采用分组相对策略优化(GRPO)的强化学习框架,结合多组件奖励函数(涵盖编辑级语义相似性、流畅性、模式一致性及论点级适当性),训练LLM生成句子级、自包含、可独立采纳的编辑建议。 Result: 在自动与人工评估中,该方法在人类风格编辑任务上超越强基线与当前最优方法;多轮编辑的论点适当性接近全文重写水平。 Conclusion: 强化学习结合多目标奖励可有效引导LLM模仿人类编辑行为,实现高质量、可控、意义保持的文本编辑,为LLM文本编辑提供了新范式。 Abstract: Editing human-written text has become a standard use case of large language models (LLMs), for example, to make one's arguments more appropriate for a discussion. Comparing human to LLM-generated edits, however, we observe a mismatch in editing strategies: While LLMs often perform multiple scattered edits and tend to change meaning notably, humans rather encapsulate dependent changes in self-contained, meaning-preserving edits. In this paper, we present a reinforcement learning approach that teaches LLMs human-like editing to improve the appropriateness of arguments. Our approach produces self-contained sentence-level edit suggestions that can be accepted or rejected independently. We train the approach using group relative policy optimization with a multi-component reward function that jointly optimizes edit-level semantic similarity, fluency, and pattern conformity as well as argument-level appropriateness. In automatic and human evaluation, it outperforms competitive baselines and the state of the art in human-like editing, with multi-round editing achieving appropriateness close to full rewriting.

[57] EvoSpark: Endogenous Interactive Agent Societies for Unified Long-Horizon Narrative Evolution

Shiyu He,Minchi Kuang,Mengxian Wang,Bin Hu,Tingxiang Gu

Main category: cs.CL

TL;DR: 本文提出EvoSpark框架,通过分层叙事记忆、生成式场景机制和统一叙事操作引擎,解决大语言模型多智能体系统中长周期叙事演化时的社会记忆堆积和叙事-空间不协调问题,实现逻辑连贯的内生性叙事演化。

Details Motivation: 现有LLM多智能体系统难以实现内生性长周期叙事演化,主要受限于生成涌现的随机性,导致社会记忆堆积和叙事-空间不协调。 Method: 提出EvoSpark框架,包含三层机制:1)分层叙事记忆(基于角色社会演化基底动态化解历史冲突);2)生成式场景机制(强制角色-位置-情节对齐);3)统一叙事操作引擎(含涌现角色锚定协议,将随机激发转化为持久角色)。 Result: 实验表明,EvoSpark在多种范式下显著优于基线方法,能持续生成富有表现力且逻辑一致的叙事体验。 Conclusion: EvoSpark为构建具有长期一致性与内生演化能力的交互式智能体社会提供了可行框架,推动了AI叙事系统的可信性与沉浸感发展。 Abstract: Realizing endogenous narrative evolution in LLM-based multi-agent systems is hindered by the inherent stochasticity of generative emergence. In particular, long-horizon simulations suffer from social memory stacking, where conflicting relational states accumulate without resolution, and narrative-spatial dissonance, where spatial logic detaches from the evolving plot. To bridge this gap, we propose EvoSpark, a framework specifically designed to sustain logically coherent long-horizon narratives within Endogenous Interactive Agent Societies. To ensure consistency, the Stratified Narrative Memory employs a Role Socio-Evolutionary Base as living cognition, dynamically metabolizing experiences to resolve historical conflicts. Complementarily, Generative Mise-en-Scène mechanism enforces Role-Location-Plot alignment, synchronizing character presence with the narrative flow. Underpinning these is the Unified Narrative Operation Engine, which integrates an Emergent Character Grounding Protocol to transform stochastic sparking into persistent characters. This engine establishes a substrate that expands a minimal premise into an open-ended, evolving story world. Experiments demonstrate that EvoSpark significantly outperforms baselines across diverse paradigms, enabling the sustained generation of expressive and coherent narrative experiences.

[58] The role of System 1 and System 2 semantic memory structure in human and LLM biases

Katherine Abramski,Giulio Rossetti,Massimo Stella

Main category: cs.CL

TL;DR: 本文通过建模人类和大语言模型(LLMs)的System 1与System 2语义记忆网络,发现仅在人类中语义结构不可约化,且仅人类的System 2结构与较低隐性性别偏见一致,表明人类特有的概念知识参与偏见调控,而LLMs缺乏该机制。

Details Motivation: 理解人类与大语言模型中隐性偏见产生的认知机制差异,尤其是双系统理论(System 1/2)在语义记忆层面的实现及其与偏见的关系。 Method: 将System 1和System 2分别建模为基于人类与LLM生成数据的语义记忆网络,使用网络评估指标分析其结构特征与隐性性别偏见的关联。 Result: 语义记忆结构仅在人类中呈现不可约化;仅在人类中,System 2结构与更低的隐性性别偏见显著相关;LLMs缺乏此类与偏见调控相关的人类特有概念知识。 Conclusion: 人类偏见调控依赖特定类型的概念知识,该机制在LLMs中不存在,揭示了人机认知的根本差异。 Abstract: Implicit biases in both humans and large language models (LLMs) pose significant societal risks. Dual process theories propose that biases arise primarily from associative System 1 thinking, while deliberative System 2 thinking mitigates bias, but the cognitive mechanisms that give rise to this phenomenon remain poorly understood. To better understand what underlies this duality in humans, and possibly in LLMs, we model System 1 and System 2 thinking as semantic memory networks with distinct structures, built from comparable datasets generated by both humans and LLMs. We then investigate how these distinct semantic memory structures relate to implicit gender bias using network-based evaluation metrics. We find that semantic memory structures are irreducible only in humans, suggesting that LLMs lack certain types of human-like conceptual knowledge. Moreover, semantic memory structure relates consistently to implicit bias only in humans, with lower levels of bias in System~2 structures. These findings suggest that certain types of conceptual knowledge contribute to bias regulation in humans, but not in LLMs, highlighting fundamental differences between human and machine cognition.

[59] Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

Eliya Habba,Itay Itzhak,Asaf Yehudai,Yotam Perlitz,Elron Bandel,Michal Shmueli-Scheuer,Leshem Choshen,Gabriel Stanovsky

Main category: cs.CL

TL;DR: 本文提出了一种基于多维项目反应理论(IRT)的框架,利用锚题校准新基准数据集,以在模型和数据集不断增长的情况下保持评估结果的可比性,并显著降低评估成本。

Details Motivation: 由于语言模型和基准数据集快速发布,对每个模型在每个数据集上进行全面评估成本高昂;且不同研究中模型常在不同样本上评估,导致分数难以横向比较。 Method: 提出基于多维IRT的校准框架,使用每个数据集上固定的锚题集,在新数据集加入时将其校准至已有评估体系,同时保持历史题目参数不变。 Result: 在400多个模型上的大规模实验表明,仅用每数据集100道锚题即可将全量评估性能预测误差控制在2–3个百分点内,Spearman相关系数ρ≥0.9,保证排序一致性。 Conclusion: 该框架支持评估套件随时间动态扩展,同时维持跨时期结果的可比性,且新增数据集的评估成本恒定。 Abstract: The rapid release of both language models and benchmarks makes it increasingly costly to evaluate every model on every dataset. In practice, models are often evaluated on different samples, making scores difficult to compare across studies. To address this, we propose a framework based on multidimensional Item Response Theory (IRT) that uses anchor items to calibrate new benchmarks to the evaluation suite while holding previously calibrated item parameters fixed. Our approach supports a realistic evaluation setting in which datasets are introduced over time and models are evaluated only on the datasets available at the time of evaluation, while a fixed anchor set for each dataset is used so that results from different evaluation periods can be compared directly. In large-scale experiments on more than $400$ models, our framework predicts full-evaluation performance within 2-3 percentage points using only $100$ anchor questions per dataset, with Spearman $ρ\geq 0.9$ for ranking preservation, showing that it is possible to extend benchmark suites over time while preserving score comparability, at a constant evaluation cost per new dataset. Code available at https://github.com/eliyahabba/growing-pains

[60] Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss

Ronald Skorobogat,Ameya Prabhu,Matthias Bethge

Main category: cs.CL

TL;DR: 本文指出当前多语言基准测试主要衡量数学推理和事实回忆能力,而非真正的多语言能力;为此提出基于往返翻译(round-trip translation)的替代评估方法,并构建了新基准LiT,该方法与真实用户评分高度相关,且无需人工参考译文或更强的多语言裁判模型。

Details Motivation: 现有前沿模型的多语言评测方式(如跨语言知识/推理基准)未能准确反映其真实多语言能力,因其实际测量的是数学推理和事实回忆,而非语言生成与理解能力;例如思维链变体在这些基准上表现优异,却在真实多语言任务(如LMArena)中更差。 Method: 提出基于往返翻译的多语言能力评估方法:将源语言文本翻译为目标语言再译回源语言,通过原始文本与回译文本间的语义差异揭示模型多语言生成缺陷;并构建覆盖全球广泛使用语言的新基准Lost in Translation (LiT)。 Result: 往返翻译评估与LMArena用户评分高度相关(ρ = 0.94),无需人工参考译文,也不依赖比被测模型更强的多语言裁判模型。 Conclusion: 当前主流多语言基准存在偏差,往返翻译是一种更真实、高效、可扩展的多语言能力评估范式;LiT为前沿多语言大模型提供了更具现实意义的评测标准。 Abstract: Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similar to popular reasoning and knowledge benchmarks, but across many languages. We show such benchmarks, and consequently multilingual evaluations, measure mathematical reasoning and factual recall, not multilingual proficiency. For example, thinking variants dramatically outperform instruct variants on these benchmarks, yet often perform worse on real-world multilingual tasks, such as LMArena. We propose a simple alternative: evaluate multilingual capability via round-trip translation. Given text in a source language, translate it to a target language and back; semantic gaps between the original and result expose failures in multilingual generation capabilities. Round-trip translation correlates almost perfectly (\r{ho} = 0.94) with user ratings on LMArena with our benchmark, requires no human reference translations, and does not require a more capable multilingual judge than tested models. Lastly, we introduce Lost in Translation (LiT), a challenging round-trip translation benchmark spanning widely spoken languages worldwide, for realistic evaluation of multilingual frontier models.

[61] MetFuse: Figurative Fusion between Metonymy and Metaphor

Saptarshi Ghosh,Tianyu Jiang

Main category: cs.CL

TL;DR: 本文提出了一种将字面句转换为转喻、隐喻及混合型修辞变体的框架,并构建了首个专门用于研究转喻与隐喻融合的高质量数据集MetFuse(含4000句),实验证明该数据集能有效提升两类修辞识别性能,尤其混合样例对转喻识别增益最大;进一步分析发现隐喻的存在反而使转喻更易被人类和大模型识别。

Details Motivation: 转喻和隐喻在自然语言中常共现,但计算语言学研究多将其孤立处理,缺乏对二者交互作用的系统建模与数据支持。 Method: 提出一个三路径修辞生成框架(字面→转喻/隐喻/混合),基于该框架人工构建并验证了MetFuse数据集(1000组四元组,共4000句),并在8个现有基准上开展外在评估与修辞交互分析。 Result: MetFuse数据增强显著提升转喻与隐喻分类性能,其中混合样例对转喻任务提升最明显;实证发现:混合句中的转喻比纯转喻句更易被人类标注者和大语言模型识别。 Conclusion: 转喻与隐喻并非独立运作,其共现具有协同增强效应;MetFuse为修辞理解与生成提供了新基准,揭示了修辞间语义显化机制。 Abstract: Metonymy and metaphor often co-occur in natural language, yet computational work has studied them largely in isolation. We introduce a framework that transforms a literal sentence into three figurative variants: metonymic, metaphoric, and hybrid. Using this framework, we construct MetFuse, the first dedicated dataset of figurative fusion between metonymy and metaphor, containing 1,000 human-verified meaning-aligned quadruplets totaling 4,000 sentences. Extrinsic experiments on eight existing benchmarks show that augmenting training data with MetFuse consistently improves both metonymy and metaphor classification, with hybrid examples yielding the largest gains on metonymy tasks. Using this dataset, we also analyze how the presence of one figurative type impacts another. Our findings show that both human annotators and large language models better identify metonymy in hybrid sentences than in metonymy-only sentences, demonstrating that the presence of a metaphor makes a metonymic noun more explicit. Our dataset is publicly available at: https://github.com/cincynlp/MetFuse.

[62] MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

Chung-Ming Chien,Manu Orsini,Eugene Kharitonov,Neil Zeghidour,Karen Livescu,Alexandre Défossez

Main category: cs.CL

TL;DR: 本文提出MoshiRAG,一种结合紧凑型全双工接口与选择性检索的模块化方法,以提升语音到语音语言模型的事实性,同时保持实时交互性。

Details Motivation: 提升全双工语音到语音语言模型的事实性,同时避免因扩大模型规模导致的实时推理成本过高问题。 Method: 提出MoshiRAG,采用异步框架,在响应起始与核心信息传递之间的自然时间间隙中执行检索,结合紧凑全双工接口与外部知识源进行选择性检索。 Result: MoshiRAG在事实性上媲美当前最佳非双工语音语言模型,同时保持全双工系统的交互性;支持即插即用式检索方法且无需重新训练,并在域外数学推理任务中表现优异。 Conclusion: MoshiRAG通过模块化与异步检索设计,有效平衡了事实性、实时性与灵活性,为对话式AI提供了可扩展的事实增强路径。 Abstract: Speech-to-speech language models have recently emerged to enhance the naturalness of conversational AI. In particular, full-duplex models are distinguished by their real-time interactivity, including handling of pauses, interruptions, and backchannels. However, improving their factuality remains an open challenge. While scaling the model size could address this gap, it would make real-time inference prohibitively expensive. In this work, we propose MoshiRAG, a modular approach that combines a compact full-duplex interface with selective retrieval to access more powerful knowledge sources. Our asynchronous framework enables the model to identify knowledge-demanding queries and ground its responses in external information. By leveraging the natural temporal gap between response onset and the delivery of core information, the retrieval process can be completed while maintaining a natural conversation flow. With this approach, MoshiRAG achieves factuality comparable to the best publicly released non-duplex speech language models while preserving the interactivity inherent to full-duplex systems. Moreover, our flexible design supports plug-and-play retrieval methods without retraining and demonstrates strong performance on out-of-domain mathematical reasoning tasks.

[63] GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts

Amir Hossein Kargaran,Nafiseh Nikeghbal,Jana Diesner,François Yvon,Hinrich Schütze

Main category: cs.CL

TL;DR: 本文提出了GlotOCR Bench,一个涵盖100多种Unicode文字的OCR通用性评测基准,发现现有视觉语言模型在多数文字上表现不佳,性能与预训练覆盖的文字密切相关。

Details Motivation: 现有OCR评估集中于少数高/中资源文字,缺乏对多文字通用性的系统评测。 Method: 构建包含100+ Unicode文字的GlotOCR Bench基准,含清晰与退化图像变体,使用Google Fonts、HarfBuzz和FreeType渲染,并人工验证渲染正确性;评估多个开源与闭源视觉语言模型。 Result: 大多数模型仅在少于10种文字上表现良好,最强前沿模型也无法泛化到30种以上文字;性能与文字级预训练覆盖度高度相关;面对不熟悉文字时易产生随机噪声或跨文字幻觉。 Conclusion: 当前OCR系统严重依赖语言模型预训练而非纯视觉识别能力,亟需更均衡、多文字的预训练与评测范式。 Abstract: Optical character recognition (OCR) has advanced rapidly with the rise of vision-language models, yet evaluation has remained concentrated on a small cluster of high- and mid-resource scripts. We introduce GlotOCR Bench, a comprehensive benchmark evaluating OCR generalization across 100+ Unicode scripts. Our benchmark comprises clean and degraded image variants rendered from real multilingual texts. Images are rendered using fonts from the Google Fonts repository, shaped with HarfBuzz and rasterized with FreeType, supporting both LTR and RTL scripts. Samples of rendered images were manually reviewed to verify correct rendering across all scripts. We evaluate a broad suite of open-weight and proprietary vision-language models and find that most perform well on fewer than ten scripts, and even the strongest frontier models fail to generalize beyond thirty scripts. Performance broadly tracks script-level pretraining coverage, suggesting that current OCR systems rely on language model pretraining as much as on visual recognition. Models confronted with unfamiliar scripts either produce random noise or hallucinate characters from similar scripts they already know. We release the benchmark and pipeline for reproducibility. Pipeline Code: https://github.com/cisnlp/glotocr-bench, Benchmark: https://hf.co/datasets/cis-lmu/glotocr-bench.

[64] Accelerating Speculative Decoding with Block Diffusion Draft Trees

Liran Ringel,Yaniv Romano

Main category: cs.CL

TL;DR: 本文提出DDTree,一种基于块扩散drafter的树状推测解码方法,通过构建draft树并在单次前向传播中验证,显著提升接受长度和解码效率。

Details Motivation: 现有DFlash方法每次仅验证一条draft轨迹,限制了接受长度;需更高效利用块扩散drafter输出的每位置分布信息。 Method: DDTree从块扩散drafter的每位置分布直接构建draft树,采用基于代理得分(由draft模型输出定义)的最佳优先堆算法在固定节点预算下选择最可能被目标模型接受的延续路径,并使用仅祖先注意力掩码在单次目标模型前向传播中完成整棵树的验证。 Result: DDTree在保持DFlash先进draft能力基础上,显著提升接受长度与推理吞吐量,在 speculative decoding 任务中达到领先水平,优于EAGLE-3等强autoregressive drafter。 Conclusion: DDTree是一种高效、可扩展的树状推测解码框架,充分利用块扩散模型的并行生成能力,为加速大语言模型推理提供了新范式。 Abstract: Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an entire draft block in a single forward pass and achieve state-of-the-art speculative decoding performance, outperforming strong autoregressive drafters such as EAGLE-3. Vanilla DFlash, however, still verifies only a single drafted trajectory per round, potentially limiting its acceptance length. We introduce DDTree (Diffusion Draft Tree), a method that constructs a draft tree directly from the per-position distributions of a block diffusion drafter. Under a fixed node budget, DDTree uses a simple best-first heap algorithm to select the continuations that are most likely to match the target model according to a surrogate defined by the draft model's output. The resulting tree is verified efficiently in a single target model forward pass using an ancestor-only attention mask. Because DDTree builds on DFlash, a leading draft model for speculative decoding, these gains place DDTree among the leading approaches to speculative decoding.

[65] PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models

Han Bao,Penghao Zhang,Yue Huang,Zhengqing Yuan,Yanchi Ru,Rui Su,Yujun Zhou,Xiangqi Wang,Kehan Guo,Nitesh V Chawla,Yanfang Ye,Xiangliang Zhang

Main category: cs.CL

TL;DR: 本文提出了PolicyBench——首个大规模跨系统(中美)政策理解评测基准,并基于其构建了领域专用的PolicyMoE模型,揭示了当前大语言模型在政策理解上的局限性。

Details Motivation: 现有大语言模型在公共政策等现实决策场景中应用日益广泛,但其对政策相关内容的理解与推理能力尚缺乏系统评估,亟需构建专门 benchmark 以填补这一研究空白。 Method: 构建了涵盖21K案例、覆盖多政策领域的PolicyBench基准,依据布鲁姆分类法从记忆、理解、应用三层次评估模型能力;并提出PolicyMoE——一种按认知层级对齐专家模块的领域专用混合专家模型。 Result: PolicyMoE在应用型政策任务上表现最优,在结构化推理任务中准确率最高;实验揭示当前LLMs在政策理解(尤其是概念性理解与复杂推理)方面存在明显短板。 Conclusion: PolicyBench为政策AI提供了可扩展的评估框架,PolicyMoE验证了分层专家设计的有效性,二者共同推动更可靠、更具政策适应性的大模型发展。 Abstract: Large Language Models (LLMs) are increasingly integrated into real-world decision-making, including in the domain of public policy. Yet, their ability to comprehend and reason about policy-related content remains underexplored. To fill this gap, we present \textbf{\textit{PolicyBench}}, the first large-scale cross-system benchmark (US-China) evaluating policy comprehension, comprising 21K cases across a broad spectrum of policy areas, capturing the diversity and complexity of real-world governance. Following Bloom's taxonomy, the benchmark assesses three core capabilities: (1) \textbf{Memorization}: factual recall of policy knowledge, (2) \textbf{Understanding}: conceptual and contextual reasoning, and (3) \textbf{Application}: problem-solving in real-life policy scenarios. Building on this benchmark, we further propose \textbf{\textit{PolicyMoE}}, a domain-specialized Mixture-of-Experts (MoE) model with expert modules aligned to each cognitive level. The proposed models demonstrate stronger performance on application-oriented policy tasks than on memorization or conceptual understanding, and yields the highest accuracy on structured reasoning tasks. Our results reveal key limitations of current LLMs in policy understanding and suggest paths toward more reliable, policy-focused models.

[66] One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness

Erfan Baghaei Potraghloo,Seyedarmin Azizi,Souvik Kundu,Massoud Pedram

Main category: cs.CL

TL;DR: 本文揭示了指令微调大语言模型在面对简单词汇约束(如禁用单个标点或常见词)时的严重脆弱性,表现为响应完整性大幅下降;该现象源于指令微调引入的规划失败机制,即模型过度依赖固定表面模板,而基础模型无此问题;研究还指出传统LLM-as-judge评估会严重低估约束下的性能退化。

Details Motivation: 探究指令微调大语言模型在简单词汇级约束下的鲁棒性,挑战‘指令微调提升泛化与稳健性’的普遍假设,并揭示其潜在结构性缺陷。 Method: 通过多模型(3个开源+1个闭源)在多种简单词汇约束下的响应完整性 pairwise 评估;结合两阶段生成(自由生成+约束重写)进行机制分析;使用线性探针分析提示表征与响应长度的关系;对比基础模型与指令微调模型的行为差异;在MT-Bench上验证泛化性;比较LLM-as-judge与pairwise评估效果差异。 Result: 指令微调模型在单字/标点约束下完整性下降14–48%,基线响应在77–100% pairwise比较中被偏好;GPT-4o-mini也出现31%完整性损失且99%基线胜率;两阶段生成可恢复59–96%长度;线性探针在指令模型上R²达0.51–0.93(预测响应长度),在基础模型上为负值;基础模型无系统性崩溃;MT-Bench八类任务均复现该效应;LLM-as-judge仅检测到3.5%质量下降,远低于pairwise揭示的23%。 Conclusion: 指令微调虽提升任务表现,却意外引入对表面形式的强耦合,导致规划能力脆弱;这种脆弱性是instruction tuning特有的结构性副作用,非所有LLM固有缺陷;评估方法需升级(如优先采用pairwise),避免掩盖关键鲁棒性问题。 Abstract: Instruction-tuned large language models produce helpful, structured responses, but how robust is this helpfulness when trivially constrained? We show that simple lexical constraints (banning a single punctuation character or common word) cause instruction-tuned LLMs to collapse their responses, losing 14--48% of comprehensiveness in pairwise evaluation across three open-weight model families and one closed-weight model (GPT-4o-mini). The baseline response is preferred in 77--100% of 1,920 pairwise comparisons judged by GPT-4o-mini and GPT-4o. Notably, GPT-4o-mini suffers 31% comprehensiveness loss (99% baseline win rate), demonstrating that the fragility extends to commercially deployed closed-weight models, contrary to prior findings on format-level constraints. Through mechanistic analysis, we identify this as a planning failure: two-pass generation (free generation followed by constrained rewriting) recovers 59--96% of response length, and linear probes on prompt representations predict response length with $R^2 = 0.51$--$0.93$ before generation begins, with $R^2$ tracking collapse severity across models. The same probes yield negative $R^2$ on base models, confirming that instruction tuning creates the representational structure encoding the collapse decision. Crucially, base models show no systematic collapse under identical constraints, with effects that are small, noisy, and bidirectional, demonstrating that instruction tuning creates this fragility by coupling task competence to narrow surface-form templates. The effect replicates on MT-Bench across all eight task categories. We further show that standard independent LLM-as-judge evaluation detects only a 3.5% average quality drop where pairwise evaluation reveals 23%, exposing a methodological blind spot in how constrained generation is assessed.

[67] Toward Autonomous Long-Horizon Engineering for ML Research

Guoxin Chen,Jie Chen,Lei Chen,Jiale Zhao,Fanzhe Meng,Wayne Xin Zhao,Ruihua Song,Cheng Chen,Ji-Rong Wen,Kai Jia

Main category: cs.CL

TL;DR: 本文提出AiScientist系统,通过分层编排与基于文件的持久状态总线(File-as-Bus) workspace,显著提升AI在长周期机器学习研究工程中的自主性与连贯性;实验表明其在PaperBench和MLE-Bench Lite上性能大幅领先,并验证了持久状态管理的关键作用。

Details Motivation: 长周期机器学习研究工程面临任务理解、环境搭建、实现、实验与调试等多阶段持续协同难题,现有方法缺乏对持久项目状态的有效管理与结构化协调。 Method: 提出AiScientist系统,采用分层编排架构:顶层Orchestrator通过摘要与工作区地图进行阶段控制;各专业智能体反复基于持久化工件(如分析报告、计划、代码、实验结果)重新锚定,而非依赖对话式交接;核心创新是权限受限的‘文件即总线’(File-as-Bus)workspace机制。 Result: 在PaperBench上平均提升10.54分,在MLE-Bench Lite上Any Medal%达81.82%;消融实验显示移除File-as-Bus协议导致PaperBench下降6.41分、MLE-Bench Lite下降31.82个百分点。 Conclusion: 长周期ML研究工程本质上是围绕持久项目状态协调专业化工作的系统工程问题,而非单纯依赖局部推理的问题;结构化编排与厚状态(thick state)持续性缺一不可。 Abstract: Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for autonomous long-horizon engineering for ML research built on a simple principle: strong long-horizon performance requires both structured orchestration and durable state continuity. To this end, AiScientist combines hierarchical orchestration with a permission-scoped File-as-Bus workspace: a top-level Orchestrator maintains stage-level control through concise summaries and a workspace map, while specialized agents repeatedly re-ground on durable artifacts such as analyses, plans, code, and experimental evidence rather than relying primarily on conversational handoffs, yielding thin control over thick state. Across two complementary benchmarks, AiScientist improves PaperBench score by 10.54 points on average over the best matched baseline and achieves 81.82 Any Medal% on MLE-Bench Lite. Ablation studies further show that File-as-Bus protocol is a key driver of performance, reducing PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points when removed. These results suggest that long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem.

cs.CV [Back]

[68] UniMark: Unified Adaptive Multi-bit Watermarking for Autoregressive Image Generators

Yigit Yilmaz,Elena Petrova,Mehmet Kaya,Lucia Rossi,Amir Rahman

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、统一的自回归图像生成水印框架\method{},解决了现有方法仅支持零比特水印、静态码本易受攻击、泛化能力差三大问题,通过自适应语义分组、分块多比特编码和统一令牌替换接口,在保持图像质量的同时实现安全、鲁棒、可扩展的多比特水印嵌入与提取。

Details Motivation: 现有自回归图像生成的隐形水印方法存在三方面缺陷:仅支持零比特水印、依赖易被攻破的静态码本划分、难以跨不同AR架构泛化,亟需一种更安全、灵活且实用的通用水印方案。 Method: 提出\method{}框架,包含三个核心组件:(1)自适应语义分组(ASG),基于语义相似性和密钥动态划分码本;(2)分块多比特编码(BME),将token序列分块并结合纠错码实现可靠多比特信息编码;(3)统一令牌替换接口(UTRI),兼容next-token预测(如LlamaGen)与next-scale预测(如VAR)两类AR范式。 Result: 在三个AR模型上实验表明,\method{}在FID(图像质量)、水印检测准确率和多比特消息提取成功率上达到SOTA,并对裁剪、JPEG压缩、高斯噪声、模糊、色彩抖动和随机擦除等常见攻击具有强鲁棒性。 Conclusion: \method{}是一种训练无关、架构无关、支持多比特信息嵌入的通用水印框架,兼具安全性、保真度与鲁棒性,为AI生成图像的版权保护与溯源提供了实用可行的新范式。 Abstract: Invisible watermarking for autoregressive (AR) image generation has recently gained attention as a means of protecting image ownership and tracing AI-generated content. However, existing approaches suffer from three key limitations: (1) they embed only zero-bit watermarks for binary verification, lacking the ability to convey multi-bit messages; (2) they rely on static codebook partitioning strategies that are vulnerable to security attacks once the partition is exposed; and (3) they are designed for specific AR architectures, failing to generalize across diverse AR paradigms. We propose \method{}, a training-free, unified watermarking framework for autoregressive image generators that addresses all three limitations. \method{} introduces three core components: \textbf{Adaptive Semantic Grouping (ASG)}, which dynamically partitions codebook entries based on semantic similarity and a secret key, ensuring both image quality preservation and security; \textbf{Block-wise Multi-bit Encoding (BME)}, which divides the token sequence into blocks and encodes different bits across blocks with error-correcting codes for reliable message transmission; and \textbf{a Unified Token-Replacement Interface (UTRI)} that abstracts the watermark embedding process to support both next-token prediction (e.g., LlamaGen) and next-scale prediction (e.g., VAR) paradigms. We provide theoretical analysis on detection error rates and embedding capacity. Extensive experiments on three AR models demonstrate that \method{} achieves state-of-the-art performance in image quality (FID), watermark detection accuracy, and multi-bit message extraction, while maintaining robustness against cropping, JPEG compression, Gaussian noise, blur, color jitter, and random erasing attacks.

[69] MedConcept: Unsupervised Concept Discovery for Interpretability in Medical VLMs

Md Rakibul Haque,KM Arefeen Sultan,Tushar Kataria,Shireen Elhabian

Main category: cs.CV

TL;DR: 本文提出MedConcept框架,用于无监督地发现医疗视觉-语言模型(VLM)中的潜在医学概念,并将其映射为可临床验证的文本语义(如伪报告),同时设计基于医疗大语言模型(LLM)的定量语义验证协议评估概念解释质量。

Details Motivation: 医疗VLM因黑箱表示难以获得临床信任;现有可解释性方法多限于特定任务(如分类),缺乏跨任务复用的概念级解释。 Method: MedConcept框架:1)从预训练VLM中无监督提取稀疏神经元级医学概念激活;2)将概念转化为伪报告式摘要;3)利用冻结的独立医疗LLM作为外部评估器,定义Aligned/Unaligned/Uncertain三类概念得分进行后验语义验证。 Result: 实现了可临床检验的概念级解释;建立了首个面向医疗VLM概念可解释性的定量评估协议;提供可复现代码、提示与数据。 Conclusion: MedConcept提升了医疗VLM的可信度与可解释性,其概念提取与定量验证范式为临床部署提供了新路径。 Abstract: While medical Vision-Language models (VLMs) achieve strong performance on tasks such as tumor or organ segmentation and diagnosis prediction, their opaque latent representations limit clinical trust and the ability to explain predictions. Interpretability of these multimodal representations are therefore essential for the trustworthy clinical deployment of pretrained medical VLMs. However, current interpretability methods, such as gradient- or attention-based visualizations, are often limited to specific tasks such as classification. Moreover, they do not provide concept-level explanations derived from shared pretrained representations that can be reused across downstream tasks. We introduce MedConcept, a framework that uncovers latent medical concepts in a fully unsupervised manner and grounds them in clinically verifiable textual semantics. MedConcept identifies sparse neuron-level concept activations from pretrained VLM representations and translates them into pseudo-report-style summaries, enabling physician-level inspection of internal model reasoning. To address the lack of quantitative evaluation in concept-based interpretability, we introduce a quantitative semantic verification protocol that leverages an independent pretrained medical LLM as a frozen external evaluator to assess concept alignment with radiology reports. We define three concept scores, Aligned, Unaligned, and Uncertain, to quantify semantic support, contradiction, or ambiguity relative to radiology reports and use them exclusively for post hoc evaluation. These scores provide a quantitative baseline for assessing interpretability in medical VLMs. All codes, prompt and data to be released on acceptance. Ke

[70] V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos

Chengkun Yue,Chuanzhi Xu,Jiangpeng He

Main category: cs.CV

TL;DR: 本文提出V-Nutri框架,利用第一人称烹饪视频(含关键帧选择与特征融合)提升菜肴级营养估计精度,构建首个视频营养估计基准,并在HD-EPIC数据集上验证过程信息对热量与宏量营养素估计的增益。

Details Motivation: 现有基于单张成品图的营养估计方法受限于烹饪后食材视觉模糊(如油、酱料),难以准确估计卡路里和宏量营养素;而烹饪过程视频可能提供关键补充信息。 Method: 构建首个视频营养估计基准(基于HD-EPIC数据集人工标注);提出V-Nutri框架:结合Nutrition5K预训练视觉骨干、轻量级多帧特征融合模块(融合最终盘菜帧与烹饪关键帧)、以及基于VideoMamba的事件检测模型用于精准选取加料时刻的关键帧。 Result: 在HD-EPIC数据集上实验表明,引入烹饪过程线索可提升营养估计精度;效果显著依赖于视觉骨干表征能力与事件检测质量。 Conclusion: 烹饪过程视频信息能为菜肴营养估计提供互补证据,V-Nutri框架验证了其有效性,为视频驱动的膳食健康监测提供了新思路。 Abstract: Nutrition estimation of meals from visual data is an important problem for dietary monitoring and computational health, but existing approaches largely rely on single images of the finally completed dish. This setting is fundamentally limited because many nutritionally relevant ingredients and transformations, such as oils, sauces, and mixed components, become visually ambiguous after cooking, making accurate calorie and macronutrient estimation difficult. In this paper, we investigate whether the cooking process information from egocentric cooking videos can contribute to dish-level nutrition estimation. First, we further manually annotated the HD-EPIC dataset and established the first benchmark for video-based nutrition estimation. Most importantly, we propose V-Nutri, a staged framework that combines Nutrition5K-pretrained visual backbones with a lightweight fusion module that aggregates features from the final dish frame and cooking process keyframes extracted from the egocentric videos. V-Nutri also includes a cooking keyframes selection module, a VideoMamba-based event-detection model that targets ingredient-addition moments. Experiments on the HD-EPIC dataset show that process cues can provide complementary nutritional evidence, improving nutrition estimation under controlled conditions. Our results further indicate that the benefit of process keyframes depends strongly on backbone representation capacity and event detection quality. Our code and annotated dataset is available at https://github.com/K624-YCK/V-Nutri.

[71] A Workflow to Efficiently Generate Dense Tissue Ground Truth Masks for Digital Breast Tomosynthesis

Tamerlan Mustafaev,Oleg Kruglov,Margarita Zuley,Luana de Mero Omena,Guilherme Muniz de Oliveira,Vitor de Sousa Franca,Bruno Barufaldi,Robert Nishikawa,Juhun Lee

Main category: cs.CV

TL;DR: 本文提出了一种高效生成DBT图像致密组织二值分割掩膜的半自动框架,仅需在中心切片上粗略勾画ROI并设定阈值,即可通过跨切片投影与自适应阈值调整实现全容积一致分割,显著减少人工标注时间,且在DBTex数据集上达到与放射科医生间一致性相当的精度(Dice中位数0.83)。

Details Motivation: DBT图像中纤维腺体组织的精确分割对个性化乳腺癌风险评估至关重要,但高质量人工标注数据稀缺,限制了算法发展。 Method: 提出一种半自动框架:用户仅在DBT容积的中心重建切片上勾画大致ROI并选择初始分割阈值;算法将该ROI投影至其余切片,并迭代调整各切片特异性阈值以保持全容积致密组织分割的一致性。 Result: 在44例DBTex DBT数据上评估,放射科医生间患者级Dice中位数为0.84;与放射科医生手动标注的第20和80百分位切片(共176张)相比,所提方法Dice中位数达0.83。 Conclusion: 该框架大幅降低DBT致密组织分割的人工标注成本,同时保持临床可接受的分割精度,有助于推动相关算法研究与临床应用。 Abstract: Digital breast tomosynthesis (DBT) is now the standard of care for breast cancer screening in the USA. Accurate segmentation of fibroglandular tissue in DBT images is essential for personalized risk estimation, but algorithm development is limited by scarce human-delineated training data. In this study we introduce a time- and labor-saving framework to generate a human-annotated binary segmentation mask for dense tissue in DBT. Our framework enables a user to outline a rough region of interest (ROI) enclosing dense tissue on the central reconstructed slice of a DBT volume and select a segmentation threshold to generate the dense tissue mask. The algorithm then projects the ROI to the remaining slices and iteratively adjusts slice-specific thresholds to maintain consistent dense tissue delineation across the DBT volume. By requiring annotation only on the central slice, the framework substantially reduces annotation time and labor. We used 44 DBT volumes from the DBTex dataset for evaluation. Inter-reader agreement was assessed by computing patient-wise Dice similarity coefficients between segmentation masks produced by two radiologists, yielding a median of 0.84. Accuracy of the proposed method was evaluated by having a radiologist manually segment the 20th and 80th percentile slices from each volume (CC and MLO views; 176 slices total) and calculate Dice scores between the manual and proposed segmentations, yielding a median of 0.83.

[72] EigenCoin: sassanid coins classification based on Bhattacharyya distance

Rahele Allahverdi,Mohammad Mahdi Dehshibi,Azam Bastanfard,Daryoosh Akbarzadeh

Main category: cs.CV

TL;DR: 本文提出了一种名为EigenCoin的流形学习方法,结合Bhattacharyya距离,用于解决萨珊王朝钱币分类中的数据不平衡问题,并验证了整体法与特征法的影响;实验表明该方法显著提升准确率(+9.45%~21.75%)并缓解过拟合。

Details Motivation: 解决模式识别中不平衡数据库带来的挑战,特别应用于萨珊王朝钱币分类这一具体且具现实意义的领域。 Method: 提出EigenCoin流形学习框架,包含流形构建、测试数据映射和分类三步,并结合Bhattacharyya距离;对比分析了整体方法与特征方法的效果。 Result: EigenCoin在钱币分类任务中性能优于其他对比算法,准确率提升9.45%至21.75%,且具备抑制过拟合的能力。 Conclusion: EigenCoin是一种有效应对不平衡数据分类问题的流形学习方法,在古钱币识别等小样本、类别不均场景中具有实用价值和推广潜力。 Abstract: Solving pattern recognition problems using imbalanced databases is a hot topic, which entices researchers to bring it into focus. Therefore, we consider this problem in the application of Sassanid coins classification. Our focus is not only on proposing EigenCoin manifold with Bhattacharyya distance for the classification task, but also on testing the influence of the holistic and feature-based approaches. EigenCoin consists of three main steps namely manifold construction, mapping test data, and classification. Conducted experiments show EigenCoin outperformed other observed algorithms and achieved the accuracy from 9.45% up to 21.75%, while it has the capability of handling the over-fitting problem.

[73] Fall Risk and Gait Analysis in Community-Dwelling Older Adults using World-Spaced 3D Human Mesh Recovery

Chitra Banarjee,Patrick Kwon,Ania Lipat,Rui Xie,Chen Chen,Ladda Thiamwong

Main category: cs.CV

TL;DR: 本文提出了一种基于3D人体网格恢复(HMR)模型的步态分析流程,利用普通视频从老年人TUG测试中提取时空步态参数,并验证其与IMU测量及跌倒风险的关联性。

Details Motivation: 传统临床步态评估仅依赖秒表测速,缺乏全面、便捷、生态有效的量化手段,尤其在社区环境中。 Method: 采用3D人体网格恢复(HMR)模型从多社区中心采集的TUG测试视频中提取步态参数(如步时、起坐时间、步长),并与IMU鞋垫数据对比;使用线性混合效应模型分析参数与自我报告的跌倒风险和惧跌心理的关系。 Result: 视频提取的步时与IMU测量显著相关;较短且变异性更大的步长、更长的起坐时间可预测更高的自我报告跌倒风险和惧跌程度。 Conclusion: 该基于视频的流程可在社区场景中实现便捷、生态有效且具临床意义的步态评估,有望提升老年跌倒风险筛查的可及性与实用性。 Abstract: Gait assessment is a key clinical indicator of fall risk and overall health in older adults. However, standard clinical practice is largely limited to stopwatch-measured gait speed. We present a pipeline that leverages a 3D Human Mesh Recovery (HMR) model to extract gait parameters from recordings of older adults completing the Timed Up and Go (TUG) test. From videos recorded across different community centers, we extract and analyze spatiotemporal gait parameters, including step time, sit-to-stand duration, and step length. We found that video-derived step time was significantly correlated with IMU-based insole measurements. Using linear mixed effects models, we confirmed that shorter, more variable step lengths and longer sit-to-stand durations were predicted by higher self-rated fall risk and fear of falling. These findings demonstrate that our pipeline can enable accessible and ecologically valid gait analysis in community settings.

[74] INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents

Somraj Gautam,Anathapindika Dravichi,Gaurav Harit

Main category: cs.CV

TL;DR: 本文提出了INDOTABVQA,一个面向印尼语文档图像的跨语言表格视觉问答基准,包含多语言问答对与多样化表格样式,用于评估和提升视觉语言模型在低资源语言及结构复杂表格上的理解能力。

Details Motivation: 解决现有视觉语言模型在低资源语言(如印尼语)和结构复杂真实文档表格上的跨语言理解能力不足问题,填补领域内多语言、结构感知文档理解基准的空白。 Method: 构建了包含1593张印尼语文档图像及对应四语种(印尼语、英语、印地语、阿拉伯语)问答对的INDOTABVQA基准;在多个开源VLM(Qwen2.5-VL、Gemma-3、LLaMA-3.2)及GPT-4o上进行评测;开展3B/7B模型的微调实验,并引入表格区域坐标作为空间先验输入。 Result: 发现主流VLM在复杂表格和低资源语言上存在显著性能差距;3B与7B LoRA微调分别带来11.6%和17.8%准确率提升;加入表格坐标输入可再提升4–7%性能。 Conclusion: 语言多样、领域特定的数据集对提升VLM文档理解至关重要;针对性微调与空间先验能有效增强模型在跨语言表格VQA任务上的表现;INDOTABVQA为欠代表地区文档智能研究提供了关键资源。 Abstract: We introduce INDOTABVQA, a benchmark for evaluating cross-lingual Table Visual Question Answering (VQA) on real-world document images in Bahasa Indonesia. The dataset comprises 1,593 document images across three visual styles (bordered, borderless, and colorful) with one or more than one tables, and 1,593 question-answer sets in four languages: Bahasa Indonesia, English, Hindi, and Arabic. This enables evaluation of Vision-Language Models (VLMs) in both monolingual (Bahasa documents with Bahasa questions) and cross-lingual settings (Bahasa documents with questions in other languages). We benchmark leading open-source VLMs (Qwen2.5-VL, Gemma-3, LLaMA-3.2) and GPT-4o and reveal substantial performance gaps, particularly on structurally complex tables and in low-resource languages. Fine-tuning a compact 3B and LoRA-finetuned 7B model on our dataset yields 11.6% and 17.8% improvements in accuracy. Providing explicit table region coordinates as additional input further improves performance by 4-7%, demonstrating the value of Spatial priors for table-based reasoning. Our findings underscore the importance of language-diverse, domain-specific datasets and demonstrate that targeted fine-tuning can significantly enhance VLM performance on specialized document understanding tasks. INDOTABVQA provides a valuable resource for advancing research in cross-lingual, structure-aware document understanding, especially in underrepresented regions of the world. Full dataset can be accessed in huggingface at: https://huggingface.co/datasets/NusaBharat/INDOTABVQA}

[75] Ultra-low-light computer vision using trained photon correlations

Mandar M. Sohoni,Jérémie Laydevant,Mathieu Ouellet,Shi-Yuan Ma,Ryotatsu Yanagimoto,Benjamin A. Ash,Tatsuhiro Onodera,Tianyu Wang,Logan G. Wright,Peter L. McMahon

Main category: cs.CV

TL;DR: 本文提出了一种面向目标识别任务的关联光子照明与Transformer后端联合优化方法(CAT),在极低光和高噪声条件下显著提升分类精度。

Details Motivation: 传统关联光子成像侧重于图像重建,而计算机视觉任务更关注场景推理(如目标识别);现有方法未针对具体任务联合优化光源与数字后端。 Method: 提出相关性感知训练(CAT):端到端联合优化可训练的关联光子光源与Transformer网络,在极少光子数(≤100)下实现协同学习。 Result: 在超低光照与强噪声条件下,相比传统非关联照明方法,分类准确率最高提升15个百分点;也优于未经训练的关联照明方案。 Conclusion: 面向特定视觉任务(如目标识别)联合优化光源相关性模式与数字后端,可在极端光子受限场景中突破现有以图像重建为中心的方法性能极限。 Abstract: Illumination using correlated photon sources has been established as an approach to allowing high-fidelity images to be reconstructed from noisy camera frames by taking advantage of the knowledge that signal photons are spatially correlated whereas detector clicks due to noise are uncorrelated. However, in computer-vision tasks, the goal is often not ultimately to reconstruct an image, but to make inferences about a scene -- such as what object is present. Here we show how correlated-photon illumination can be used to gain an advantage in a hybrid optical-electronic computer-vision pipeline for object recognition. We demonstrate correlation-aware training (CAT): end-to-end optimization of a trainable correlated-photon illumination source and a Transformer backend in a way that the Transformer can learn to benefit from the correlations, using a small number (<= 100) of shots. We show a classification accuracy enhancement of up to 15 percentage points over conventional, uncorrelated-illumination-based computer vision in ultra-low-light and noisy imaging conditions, as well as an improvement over using untrained correlated-photon illumination. Our work illustrates how specializing to a computer-vision task -- object recognition -- and training the pattern of photon correlations in conjunction with a digital backend allows us to push the limits of accuracy in highly photon-budget-constrained scenarios beyond existing methods focused on image reconstruction.

[76] The Second Challenge on Cross-Domain Few-Shot Object Detection at NTIRE 2026: Methods and Results

Xingyu Qiu,Yuqian Fu,Jiawei Geng,Bin Ren,Jiancheng Pan,Zongwei Wu,Hao Tang,Yanwei Fu,Radu Timofte,Nicu Sebe,Mohamed Elhoseiny,Lingyi Hong,Mingxi Cheng,Xingqi He,Runze Li,Xingdong Sheng,Wenqiang Zhang,Jiacong Liu,Shu Luo,Yikai Qin,Yaze Zhao,Yongwei Jiang,Yixiong Zou,Zhe Zhang,Yang Yang,Kaiyu Li,Bowen Fu,Zixuan Jiang,Ke Li,Hui Qiao,Xiangyong Cao,Xuanlong Yu,Youyang Sha,Longfei Liu,Di Yang,Xi Shen,Kyeongryeol Go,Taewoong Jang,Saiprasad Meesiyawar,Ravi Kirasur,Rakshita Kulkarni,Bhoomi Deshpande,Harsh Patil,Uma Mudenagudi,Shuming Hu,Chao Chen,Tao Wang,Wei Zhou,Qi Xu,Zhenzhao Xing,Dandan Zhao,Hanzhe Xia,Dongdong Lu,Zhe Zhang,Jingru Wang,Guangwei Huang,Jiachen Tu,Yaokun Shi,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Liwei Zhou,Bei Dou,Tao Wu,Zekang Fan,Junjie Liu,Adhémar de Senneville,Flavien Armangeon,Mengbers,Yazhe Lyu,Zhimeng Xin,Zijian Zhuang,Hongchun Zhu,Li Wang

Main category: cs.CV

TL;DR: 本文介绍了NTIRE 2026跨域小样本目标检测(CD-FSOD)挑战赛的组织情况、参赛规模、方法多样性及最终结果分析。

Details Motivation: 解决现有目标检测器和小样本学习方法在跨不同领域泛化时面临的挑战,推动CD-FSOD研究进展。 Method: 通过举办NTIRE 2026 CD-FSOD挑战赛,设置开放源码与闭源码赛道,收集并评估来自全球团队的创新方法。 Result: 共128支队伍注册,696次提交,31支活跃队伍,19支提交有效最终结果;涌现出多种前沿策略,显著提升性能边界。 Conclusion: 该挑战赛成功激发社区对CD-FSOD的关注与创新,为未来跨域小样本检测提供了重要基准与经验参考。 Abstract: Cross-domain few-shot object detection (CD-FSOD) remains a challenging problem for existing object detectors and few-shot learning approaches, particularly when generalizing across distinct domains. As part of NTIRE 2026, we hosted the second CD-FSOD Challenge to systematically evaluate and promote progress in detecting objects in unseen target domains under limited annotation conditions. The challenge received strong community interest, with 128 registered participants and a total of 696 submissions. Among them, 31 teams actively participated, and 19 teams submitted valid final results. Participants explored a wide range of strategies, introducing innovative methods that push the performance frontier under both open-source and closed-source tracks. This report presents a detailed overview of the NTIRE 2026 CD-FSOD Challenge, including a summary of the submitted approaches and an analysis of the final results across all participating teams. Challenge Codes: https://github.com/ohMargin/NTIRE2026_CDFSOD.

[77] TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

Bingyi Cao,Koert Chen,Kevis-Kokitsi Maninis,Kaifeng Chen,Arjun Karpur,Ye Xia,Sahil Dua,Tanmaya Dabral,Guangxing Han,Bohyung Han,Joshua Ainslie,Alex Bewley,Mithun Jacob,René Wagner,Washington Ramos,Krzysztof Choromanski,Mojtaba Seyedhosseini,Howard Zhou,André Araujo

Main category: cs.CV

TL;DR: 本文提出TIPSv2模型,通过patch-level知识蒸馏、改进的iBOT++预训练目标、EMA优化及多粒度合成字幕采样策略,显著提升视觉-语言模型中密集图像块与文本概念的对齐能力,在9项任务、20个数据集上达到SOTA或接近SOTA性能。

Details Motivation: 现有视觉-语言预训练模型在密集图像块(patch)与对应文本概念的细粒度对齐方面仍存在明显不足,限制了其在分割、深度预测等需像素/区域级理解任务中的表现。 Method: 1) 引入patch-level蒸馏提升学生模型的patch-text对齐能力;2) 提出iBOT++目标函数,使未遮蔽token也参与损失计算;3) 优化EMA更新机制;4) 设计多粒度合成字幕采样策略;5) 整合为TIPSv2图像-文本编码器模型族。 Result: 在分类、检索、分割、深度预测等9类下游任务、共20个基准数据集上取得全面优异性能,整体优于或媲美近期先进视觉编码器(如DINOv2、ViT-MAE等);代码与模型已开源。 Conclusion: 密集patch-text对齐是提升视觉-语言模型通用性与下游适应性的关键瓶颈;通过系统性改进预训练范式(蒸馏、损失设计、EMA、数据采样),TIPSv2实现了更鲁棒、更高效的跨模态表征学习,为多粒度视觉理解提供了新范式。 Abstract: Recent progress in vision-language pretraining has enabled significant improvements to many downstream computer vision applications, such as classification, retrieval, segmentation and depth prediction. However, a fundamental capability that these models still struggle with is aligning dense patch representations with text embeddings of corresponding concepts. In this work, we investigate this critical issue and propose novel techniques to enhance this capability in foundational vision-language models. First, we reveal that a patch-level distillation procedure significantly boosts dense patch-text alignment -- surprisingly, the patch-text alignment of the distilled student model strongly surpasses that of the teacher model. This observation inspires us to consider modifications to pretraining recipes, leading us to propose iBOT++, an upgrade to the commonly-used iBOT masked image objective, where unmasked tokens also contribute directly to the loss. This dramatically enhances patch-text alignment of pretrained models. Additionally, to improve vision-language pretraining efficiency and effectiveness, we modify the exponential moving average setup in the learning recipe, and introduce a caption sampling strategy to benefit from synthetic captions at different granularities. Combining these components, we develop TIPSv2, a new family of image-text encoder models suitable for a wide range of downstream applications. Through comprehensive experiments on 9 tasks and 20 datasets, we demonstrate strong performance, generally on par with or better than recent vision encoder models. Code and models are released via our project page at https://gdm-tipsv2.github.io/ .

[78] Curvelet-Based Frequency-Aware Feature Enhancement for Deepfake Detection

Salar Adel Sabri,Ramadhan J. Mstafa

Main category: cs.CV

TL;DR: 本文提出了一种基于Curvelet变换的新型深度伪造检测方法,结合楔形注意力和尺度感知空间掩码增强频率特征,并在FaceForensics++数据集上取得优异且鲁棒的检测性能。

Details Motivation: 现有基于空间域特征的深度伪造检测器在图像压缩下性能下降,而Curvelet变换因其优越的方向性和多尺度特性,在该任务中尚未被探索。 Method: 引入Curvelet变换提取频率域特征,设计楔形级注意力机制和尺度感知空间掩码以突出判别性频率成分,将优化后的频率线索重建后输入改进的预训练Xception网络进行分类。 Result: 在FaceForensics++数据集低压缩子集上达到98.48%准确率和99.96% AUC;高压缩下仍保持强性能,同时具备良好可解释性。 Conclusion: Curvelet变换能有效提升深度伪造检测在压缩场景下的鲁棒性与判别力,所提方法兼具高性能与可解释性。 Abstract: The proliferation of sophisticated generative models has significantly advanced the realism of synthetic facial content, known as deepfakes, raising serious concerns about digital trust. Although modern deep learning-based detectors perform well, many rely on spatial-domain features that degrade under compression. This limitation has prompted a shift toward integrating frequency-domain representations with deep learning to improve robustness. Prior research has explored frequency transforms such as Discrete Cosine Transform (DCT), Fast Fourier Transform (FFT), and Wavelet Transform, among others. However, to the best of our knowledge, the Curvelet Transform, despite its superior directional and multiscale properties, remains entirely unexplored in the context of deepfake detection. In this work, we introduce a novel Curvelet-based detection approach that enhances feature quality through wedge-level attention and scale-aware spatial masking, both trained to selectively emphasize discriminative frequency components. The refined frequency cues are reconstructed and passed to a modified pretrained Xception network for classification. Evaluated on two compression qualities in the challenging FaceForensics++ dataset, our method achieves 98.48% accuracy and 99.96% AUC on FF++ low compression, while maintaining strong performance under high compression, demonstrating the efficacy and interpretability of Curvelet-informed forgery detection.

[79] Does Visual Token Pruning Improve Calibration? An Empirical Study on Confidence in MLLMs

Kaizhen Tan

Main category: cs.CV

TL;DR: 本文研究了视觉token剪枝对多模态大语言模型(MLLMs)校准性的影响,发现某些剪枝策略(如SCOPE的纯覆盖率设置)不仅能保持准确率,还能显著改善模型校准性(降低ECE),而基于显著性的剪枝反而损害校准性;结果表明应将校准性纳入视觉token剪枝的评估标准。

Details Motivation: 现有视觉token剪枝工作主要关注任务准确率,忽视了模型校准性(即预测置信度与实际正确率的一致性),而校准性对需可靠决策的多模态系统至关重要。 Method: 在LLaVA-1.5-7B模型上,使用POPE和ScienceQA-IMG数据集,评估多种剪枝策略(SCOPE不同权重、显著性剪枝、FastV、随机剪枝)在不同token预算下的校准性指标(ECE、Brier分数、AURC)及准确率,并进行alpha-sweep分析和覆盖间隙幂指数调优。 Result: 在POPE上,SCOPE纯覆盖率设置显著降低ECE且准确率不变;降低salience权重持续改善校准性,而显著性剪枝和FastV损害校准性;在ScienceQA-IMG上剪枝同样降低ECE且准确率稳定或略升;覆盖间隙幂指数的默认值非最优。 Conclusion: 视觉token剪枝不应仅以准确率为评估标准,校准性是同等重要的维度;合理设计的覆盖率导向剪枝可兼顾效率、准确率与可靠性,应成为MLLMs高效推理的综合优化目标。 Abstract: Visual token pruning is a widely used strategy for efficient inference in multimodal large language models (MLLMs), but existing work mainly evaluates it with task accuracy. In this paper, we study how visual token pruning affects model calibration, that is, whether predicted confidence matches actual correctness. Using LLaVA-1.5-7B on POPE and ScienceQA-IMG, we evaluate Expected Calibration Error (ECE), Brier score, and AURC under several pruning strategies, including SCOPE with different saliency weights, saliency-only pruning, FastV, and random pruning, across multiple token budgets. Our results show that pruning does not simply trade reliability for efficiency. On POPE, a pure-coverage setting in SCOPE achieves substantially lower ECE than the full unpruned model while maintaining similar accuracy. An internal alpha-sweep further shows a consistent trend: reducing the saliency weight improves calibration at all tested token budgets, while accuracy changes only slightly. In contrast, saliency-based pruning leads to worse calibration, and real FastV causes severe performance degradation in our setting. On ScienceQA-IMG, pruning also reduces ECE, with accuracy remaining stable or slightly improving. We additionally study the gap power exponent in coverage-based selection and find that its default setting is not always optimal. Overall, our results suggest that visual token pruning should be evaluated not only by accuracy, but also by confidence quality, especially for multimodal systems that need reliable decisions.

[80] Privacy-Preserving Structureless Visual Localization via Image Obfuscation

Vojtech Panek,Patrik Beliansky,Zuzana Kukelova,Torsten Sattler

Main category: cs.CV

TL;DR: 本文提出了一种基于图像模糊(如语义分割)的结构化无关视觉定位隐私保护方法,无需修改现有匹配流程,兼顾隐私性与定位精度。

Details Motivation: 现有隐私保护视觉定位方法复杂、慢且精度低;同时云上定位存在图像和场景表示泄露隐私的风险。 Method: 采用简单图像模糊操作(如RGB图转语义分割图)作为结构less场景表示,并利用现代特征匹配器直接匹配模糊图像,无需调整现有pipeline。 Result: 在多个数据集上实验表明,该方法在隐私保护方法中达到SOTA姿态估计精度,且实现简单、高效。 Conclusion: 结构less定位框架天然兼容轻量级图像模糊策略,在不牺牲精度的前提下可同时保护查询图像和场景表示的隐私。 Abstract: Visual localization is the task of estimating the camera pose of an image relative to a scene representation. In practice, visual localization systems are often cloud-based. Naturally, this raises privacy concerns in terms of revealing private details through the images sent to the server or through the representations stored on the server. Privacy-preserving localization aims to avoid such leakage of private details. However, the resulting localization approaches are significantly more complex, slower, and less accurate than their non-privacy-preserving counterparts. In this paper, we consider structureless localization methods in the context of privacy preservation. Structureless methods represent the scene through a set of reference images with known camera poses and intrinsics. In contrast to existing methods proposing representations that are as privacy-preserving as possible, we study a simple image obfuscation approach based on common image operations, e.g., replacing RGB images with (semantic) segmentations. We show that existing structureless pipelines do not need any special adjustments, as modern feature matchers can match obfuscated images out of the box. The results are easy-to-implement pipelines that can ensure both the privacy of the query images and the scene representations. Detailed experiments on multiple datasets show that the resulting methods achieve state-of-the-art pose accuracy for privacy-preserving approaches.

[81] OpenTME: An Open Dataset of AI-powered H&E Tumor Microenvironment Profiles from TCGA

Maaike Galama,Nina Kozar-Gillan,Christina Embacher,Todd Dembo,Cornelius Böhm,Evelyn Ramberger,Julika Ribbat-Idel,Rosemarie Krupar,Verena Aumiller,Miriam Hägele,Kai Standvoss,Gerrit Erdmann,Blanca Pablos,Ari Angelo,Simon Schallenberg,Andrew Norgan,Viktor Matyas,Klaus-Robert Müller,Maximilian Alber,Lukas Ruff,Frederick Klauschen

Main category: cs.CV

TL;DR: 本文介绍了OpenTME——一个基于TCGA中3634张H&E染色全切片图像构建的开源肿瘤微环境(TME)定量分析数据集,覆盖五种癌症类型;所有TME特征由Atlas H&E-TME AI系统(基于病理学基础模型)全自动提取,提供超4500个细胞级定量指标。

Details Motivation: 当前缺乏大规模、一致且定量的H&E染色图像肿瘤微环境(TME)表征方法,限制了其在临床和研究中的应用。 Method: 利用Atlas H&E-TME AI系统(基于Atlas病理基础模型),对TCGA中3634张H&E全切片图像进行组织质控、组织分割、细胞检测与分类及空间邻域分析,生成细胞级分辨率的TME定量图谱。 Result: 发布OpenTME开源数据集,包含五种癌症共3634例样本的>4500维TME定量特征,托管于Hugging Face供非商业学术研究使用。 Conclusion: OpenTME填补了H&E图像TME大规模定量分析的空白,有望推动生物标志物发现、空间生物学研究及计算病理方法发展。 Abstract: The tumor microenvironment (TME) plays a central role in cancer progression, treatment response, and patient outcomes, yet large-scale, consistent, and quantitative TME characterization from routine hematoxylin and eosin (H&E)-stained histopathology remains scarce. We introduce OpenTME, an open-access dataset of pre-computed TME profiles derived from 3,634 H&E-stained whole-slide images across five cancer types (bladder, breast, colorectal, liver, and lung cancer) from The Cancer Genome Atlas (TCGA). All outputs were generated using Atlas H&E-TME, an AI-powered application built on the Atlas family of pathology foundation models, which performs tissue quality control, tissue segmentation, cell detection and classification, and spatial neighborhood analysis, yielding over 4,500 quantitative readouts per slide at cell-level resolution. OpenTME is available for non-commercial academic research on Hugging Face. We will continue to expand OpenTME over time and anticipate it will serve as a resource for biomarker discovery, spatial biology research, and the development of computational methods for TME analysis.

[82] INST-Align: Implicit Neural Alignment for Spatial Transcriptomics via Canonical Expression Fields

Bonian Han,Cong Qi,Przemyslaw Musialski,Zhi Wei

Main category: cs.CV

TL;DR: INST-Align 是一种用于多切片空间转录组数据联合配准与重建的无监督配对框架,通过耦合基于坐标的形变网络与共享的规范表达场(隐式神经表示),实现形变校正与批次效应消除的一体化学习。

Details Motivation: 解决空间转录组多切片分析中大非刚性形变与切片间批次效应耦合导致的配准与整合困难问题。 Method: 提出 INST-Align 框架:包含坐标驱动的形变网络和共享的 Canonical Expression Field(隐式神经网络,将空间坐标映射为表达嵌入);采用两阶段训练策略——先稳定规范嵌入空间,再联合优化形变与空间-特征匹配;跨切片共享规范场以正则化模糊对应并吸收批次差异。 Result: 在九个数据集上达到最优性能:平均 OT 准确率 0.702、NN 准确率 0.719、Chamfer 距离显著降低(大幅形变切片最高减少 94.9%);生成生物学意义明确的空间嵌入与连贯的 3D 组织重建。 Conclusion: INST-Align 成功实现了形变校正与批次效应消除的联合建模,为多切片空间转录组分析提供了鲁棒、可解释且可扩展的统一解决方案。 Abstract: Spatial transcriptomics (ST) measures mRNA expression while preserving spatial organization, but multi-slice analysis faces two coupled difficulties: large non-rigid deformations across slices and inter-slice batch effects when alignment and integration are treated independently. We present INST-Align, an unsupervised pairwise framework that couples a coordinate-based deformation network with a shared Canonical Expression Field, an implicit neural representation mapping spatial coordinates to expression embeddings, for joint alignment and reconstruction. A two-phase training strategy first establishes a stable canonical embedding space and then jointly optimizes deformation and spatial-feature matching, enabling mutually constrained alignment and representation learning. Cross-slice parameter sharing of the canonical field regularizes ambiguous correspondences and absorbs batch variation. Across nine datasets, INST-Align achieves state-of-the-art mean OT Accuracy (0.702), NN Accuracy (0.719), and Chamfer distance, with Chamfer reductions of up to 94.9\% on large-deformation sections relative to the strongest baseline. The framework also yields biologically meaningful spatial embeddings and coherent 3D tissue reconstruction. The code will be released after review phase.

[83] PC-MIL: Decoupling Feature Resolution from Supervision Scale in Whole-Slide Learning

Syed Fahim Ahmed,Gnanesh Rasineni,Florian Koehler,Abu Zahid Bin Aziz,Mei Wang,Attila Gyulassy,Brian Summa,J. Quincy Brown,Valerio Pascucci,Shireen Y. Elhabian

Main category: cs.CV

TL;DR: 本文提出Progressive-Context MIL(PC-MIL)框架,通过在毫米尺度上调控监督范围(如2mm临床相关区域),解耦特征分辨率与监督尺度,在不引入像素级标注或改变放大倍率的前提下,提升WSI分类模型的解剖学可解释性与跨上下文泛化能力。

Details Motivation: 标准滑动窗口图像(WSI)分类采用全局多实例学习(MIL),仅依赖幻灯片级标签,导致模型忽略解剖结构和局部病灶(如肿瘤负荷、局灶性病变),与临床实际判断尺度(毫米级区域)严重不匹配。 Method: 提出PC-MIL框架,固定20x特征提取,将MIL的‘包’(bag)空间尺度定义为毫米单位;以临床驱动的2mm尺度为锚点进行区域监督,并渐进式混合幻灯片级与区域级监督信号,支持显式的训练-测试上下文对齐分析。 Result: 在5个公开数据集共1476张前列腺WSI二分类任务中,引入适度区域监督显著提升跨上下文(slide/region)泛化性能;多上下文平衡训练可在不损害全局性能前提下稳定各类评估指标。 Conclusion: 监督的空间尺度是MIL中独立于特征分辨率的关键归纳偏置维度;显式建模解剖上下文(如2mm区域)可增强WSI模型的临床可解释性与泛化鲁棒性。 Abstract: Whole-slide image (WSI) classification in computational pathology is commonly formulated as slide-level Multiple Instance Learning (MIL) with a single global bag representation. However, slide-level MIL is fundamentally underconstrained: optimizing only global labels encourages models to aggregate features without learning anatomically meaningful localization. This creates a mismatch between the scale of supervision and the scale of clinical reasoning. Clinicians assess tumor burden, focal lesions, and architectural patterns within millimeter-scale regions, whereas standard MIL is trained only to predict whether "somewhere in the slide there is cancer." As a result, the model's inductive bias effectively erases anatomical structure. We propose Progressive-Context MIL (PC-MIL), a framework that treats the spatial extent of supervision as a first-class design dimension. Rather than altering magnification, patch size, or introducing pixel-level segmentation, we decouple feature resolution from supervision scale. Using fixed 20x features, we vary MIL bag extent in millimeter units and anchor supervision at a clinically motivated 2mm scale to preserve comparable tumor burden and avoid confounding scale with lesion density. PC-MIL progressively mixes slide- and region-level supervision in controlled proportions, enabling explicit train-context x test-context analysis. On 1,476 prostate WSIs from five public datasets for binary cancer detection, we show that anatomical context is an independent axis of generalization in MIL, orthogonal to feature resolution: modest regional supervision improves cross-context performance, and balanced multi-context training stabilizes accuracy across slide and regional evaluation without sacrificing global performance. These results demonstrate that supervision extent shapes MIL inductive bias and support anatomically grounded WSI generalization.

[84] PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation

Minjae Lee,Sungwoo Hur,Soojin Hwang,Won Hwa Kim

Main category: cs.CV

TL;DR: 本文提出PR-MaGIC,一种无需训练的测试时提示优化框架,利用SAM掩码解码器的梯度流自动优化提示,提升上下文分割性能。

Details Motivation: 现有基于SAM的上下文分割方法因支持图像与查询图像间的视觉不一致,导致自动生成的提示次优,影响分割质量。 Method: 提出PR-MaGIC框架,通过SAM掩码解码器的梯度流进行提示优化,并采用理论支撑、实践稳定的top-1选择策略实现鲁棒优化。 Result: 在多个基准上显著提升分割质量,无需额外训练或模型修改,有效缓解提示不足问题。 Conclusion: PR-MaGIC是一种通用、高效、即插即用的提示优化方法,为无训练的上下文分割提供了新范式。 Abstract: Visual Foundation Models (VFMs) such as the Segment Anything Model (SAM) have significantly advanced broad use of image segmentation. However, SAM and its variants necessitate substantial manual effort for prompt generation and additional training for specific applications. Recent approaches address these limitations by integrating SAM into in-context (one/few shot) segmentation, enabling auto-prompting through semantic alignment between query and support images. Despite these efforts, they still generate sub-optimal prompts that degrade segmentation quality due to visual inconsistencies between support and query images. To tackle this limitation, we introduce PR-MaGIC (Prompt Refinement via Mask Decoder Gradient Flow for In-Context Segmentation), a training-free test-time framework that refines prompts via gradient flow derived from SAM's mask decoder. PR-MaGIC seamlessly integrates into in-context segmentation frameworks, being theoretically grounded yet practically stabilized through a simple top-1 selection strategy that ensures robust performance across samples. Extensive evaluations demonstrate that PR-MaGIC consistently improves segmentation quality across various benchmarks, effectively mitigating inadequate prompts without requiring additional training or architectural modifications.

[85] HTDC: Hesitation-Triggered Differential Calibration for Mitigating Hallucination in Large Vision-Language Models

Xinyun Liu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的解码框架HTDC,通过识别层间犹豫信号,在视觉接地不稳定的解码步骤上触发差异校准,以抑制大视觉语言模型(LVLMs)的幻觉,同时避免对稳定步骤的不必要干预。

Details Motivation: 现有LVLMs存在因视觉接地不稳定和过度依赖语言先验导致的幻觉问题;训练-free解码方法通常每步都校准,计算开销大且可能破坏稳定预测。 Method: 提出层间犹豫(layer-wise hesitation)作为接地不稳定信号,并基于此设计HTDC框架:仅在犹豫步骤激活校准;校准时对比全分支与两个轻量探针(视觉归零/语义归零),抑制易幻觉候选。 Result: 在典型幻觉评测基准上,HTDC持续降低幻觉率,同时保持强任务准确率,实现了效果与计算开销的良好平衡。 Conclusion: HTDC是一种高效、精准、训练-free的解码策略,为缓解LVLMs幻觉提供了新思路。 Abstract: Large vision-language models (LVLMs) achieve strong multimodal performance, but still suffer from hallucinations caused by unstable visual grounding and over-reliance on language priors. Existing training-free decoding methods typically apply calibration at every decoding step, introducing unnecessary computation and potentially disrupting stable predictions. We address this problem by identifying layer-wise hesitation, a simple signal of grounding instability reflected by fluctuations in token preference across intermediate layers. Based on this observation, we propose Hesitation-Triggered Differential Calibration (HTDC), a training-free decoding framework that preserves standard full-branch inference and activates calibration only at hesitation-prone steps. When triggered, HTDC contrasts the full branch with two lightweight probes, a visual-nullification probe and a semantic-nullification probe, to suppress hallucination-prone candidates while avoiding unnecessary intervention on stable steps. Experiments on representative hallucination benchmarks show that HTDC consistently reduces hallucinations while maintaining strong task accuracy, achieving a favorable trade-off between effectiveness and computational overhead.

[86] Beyond Perception Errors: Semantic Fixation in Large Vision-Language Models

Md Tanvirul Alam

Main category: cs.CV

TL;DR: 本文提出VLM-Fix基准,揭示大型视觉语言模型存在‘语义固着’现象——即在面对与常规语义相悖但逻辑等价的规则时,仍偏向默认解释;通过抽象策略游戏、提示干预、后训练和激活引导等实验,验证该现象的鲁棒性、可编辑性及外部有效性。

Details Motivation: 现有评估难以区分VLM的感知失败与规则映射失败;作者旨在分离并系统研究模型对熟悉语义先验的过度依赖问题,即‘语义固着’。 Method: 构建VLM-Fix基准:在四种抽象策略游戏中,对同一终局状态配对标准与逆向规则描述;测试14个开源/闭源VLM;设计中性/语义负载别名提示、单/双规则后训练、晚期层激活引导等干预方法;并在VLMBias上验证外部有效性。 Result: 所有14个VLM均显著偏好标准规则(存在稳定语义固着间隙);中性别名提示可大幅缩小逆向规则性能差距,而语义负载别名会重新拉大差距;后训练呈强规则对齐性;激活引导可部分恢复逆向规则性能;VLMBias上复现相同模式。 Conclusion: 语义固着是VLM中普遍且可量化的认知偏差,源于模型对语义先验的过度绑定;该偏差存在于晚期表征中,具备可编辑性;提示工程与表征干预可缓解,但需兼顾语义中立性。 Abstract: Large vision-language models (VLMs) often rely on familiar semantic priors, but existing evaluations do not cleanly separate perception failures from rule-mapping failures. We study this behavior as semantic fixation: preserving a default interpretation even when the prompt specifies an alternative, equally valid mapping. To isolate this effect, we introduce VLM-Fix, a controlled benchmark over four abstract strategy games that evaluates identical terminal board states under paired standard and inverse rule formulations. Across 14 open and closed VLMs, accuracy consistently favors standard rules, revealing a robust semantic-fixation gap. Prompt interventions support this mechanism: neutral alias prompts substantially narrow the inverse-rule gap, while semantically loaded aliases reopen it. Post-training is strongly rule-aligned: training on one rule improves same-rule transfer but hurts opposite-rule transfer, while joint-rule training improves broader transfer. To test external validity beyond synthetic games, we evaluate analogous defamiliarization interventions on VLMBias and observe the same qualitative pattern. Finally, late-layer activation steering partially recovers degraded performance, indicating that semantic-fixation errors are at least partly editable in late representations. Project page, code, and dataset available at https://maveryn.github.io/vlm-fix/.

[87] ViLL-E: Video LLM Embeddings for Retrieval

Rohit Gupta,Jayakrishnan Unnikrishnan,Fan Fei,Sheng Liu,Son Tran,Mubarak Shah

Main category: cs.CV

TL;DR: ViLL-E is a unified VideoLLM that enhances video understanding and retrieval by introducing an adaptive embedding generation mechanism and a three-stage training strategy, achieving state-of-the-art performance in temporal localization, video retrieval, and zero-shot composed/long-text retrieval.

Details Motivation: VideoLLMs perform well on generative tasks (e.g., Video QA, captioning) but lag behind specialized embedding models in retrieval tasks (e.g., Text-to-Video Retrieval, Moment Retrieval); there is a need for a unified model that excels across both paradigms. Method: Proposes ViLL-E, a VideoLLM with a novel adaptive embedding generation mechanism ('think longer' for complex videos, 'stop early' for easy ones), trained via three-stage methodology: (1) large-scale generative pre-training on video-caption pairs, (2) continual training on detailed captions, and (3) multi-task fine-tuning on Video QA, Temporal Localization, Video Retrieval, and Video-Text Matching. Result: Significantly improves temporal localization (avg. +7% over other VideoLLMs) and video retrieval (up to +4% over dual-encoder models); matches SotA embedding models in retrieval while remaining competitive in VideoQA; achieves new zero-shot gains in composed video retrieval (+5%) and long-text retrieval (+2%). Conclusion: ViLL-E bridges the gap between generative and embedding-based video understanding, demonstrating that unified VideoLLMs—with adaptive embedding generation and joint contrastive-generative training—can excel across diverse video-language tasks, including challenging zero-shot retrieval scenarios. Abstract: Video Large Language Models (VideoLLMs) excel at video understanding tasks where outputs are textual, such as Video Question Answering and Video Captioning. However, they underperform specialized embedding-based models in Retrieval tasks, such as Text-toVideo Retrieval and Moment Retrieval. We introduce ViLL-E (Video-LLM-Embed), a unified VideoLLM architecture endowed with a novel embedding generation mechanism that allows the model to "think longer" for complex videos and stop early for easy ones. We train this model with a three-stage training methodology combining generative and contrastive learning: initial large-scale pre-training with video-caption pairs; followed by continual training on a smaller, detailed-caption dataset; and concluding with task-specific fine-tuning on a novel multi-task dataset covering Video QA, Temporal Localization, Video Retrieval, and Video-Text Matching. Our model significantly improves temporal localization (on avg. 7% over other VideoLLMs) and video retrieval (up to 4% over dual encoder models), achieving performance comparable to state-of-the-art specialized embedding models while remaining competitive on VideoQA tasks. Furthermore, our joint contrastive-generative training unlocks new zero-shot capabilities, significantly outperforming state-of-the-art methods in composed video retrieval (+5% over SotA) and retrieval from long text (+2% over SotA).

[88] Domain-Specific Latent Representations Improve the Fidelity of Diffusion-Based Medical Image Super-Resolution

Sebastian Cajas,Ashaba Judith,Rahul Gorijavolu,Sahil Kapadia,Hillary Clinton Kasimbazi,Leo Kinyera,Emmanuel Paul Kwesiga,Sri Sri Jaithra Varma Manthena,Luis Filipe Nakayama,Ninsiima Doreen,Leo Anthony Celi

Main category: cs.CV

TL;DR: 本文发现,在医学图像超分辨率任务中,使用领域专用的MedVAE替代通用Stable Diffusion VAE,可在多种模态上显著提升重建质量(+2.91~+3.29 dB PSNR),且该提升源于低层细节频段,幻觉率无显著增加;VAE重建质量可独立预测下游超分性能,应优先于扩散架构选择。

Details Motivation: 现有医学图像超分辨率的潜在扩散模型普遍沿用面向自然图像的变分自编码器(VAE),但其是否为性能瓶颈尚不明确。 Method: 在严格控制其他组件不变的前提下,将Stable Diffusion VAE替换为在160万医学图像上预训练的MedVAE,并在膝关节MRI、脑MRI和胸部X光数据集上进行系统评估;辅以小波分解、消融实验及统计检验(Wilcoxon符号秩检验、Cohen's d/h)。 Result: MedVAE带来2.91–3.29 dB PSNR提升(p < 10^{-20});优势集中于最高频空间子带;幻觉率差异极小(Cohen's h < 0.02);VAE重建质量与最终超分性能高度相关(R² = 0.67)。 Conclusion: VAE的选择是制约医学图像超分辨率性能的主导因素,而非扩散架构本身;应优先筛选高保真领域专用VAE,其重建指标可作为无需训练扩散模型即可预测下游性能的有效筛选标准。 Abstract: Latent diffusion models for medical image super-resolution universally inherit variational autoencoders designed for natural photographs. We show that this default choice, not the diffusion architecture, is the dominant constraint on reconstruction quality. In a controlled experiment holding all other pipeline components fixed, replacing the generic Stable Diffusion VAE with MedVAE, a domain-specific autoencoder pretrained on more than 1.6 million medical images, yields +2.91 to +3.29 dB PSNR improvement across knee MRI, brain MRI, and chest X-ray (n = 1,820; Cohen's d = 1.37 to 1.86, all p < 10^{-20}, Wilcoxon signed-rank). Wavelet decomposition localises the advantage to the finest spatial frequency bands encoding anatomically relevant fine structure. Ablations across inference schedules, prediction targets, and generative architectures confirm the gap is stable within plus or minus 0.15 dB, while hallucination rates remain comparable between methods (Cohen's h < 0.02 across all datasets), establishing that reconstruction fidelity and generative hallucination are governed by independent pipeline components. These results provide a practical screening criterion: autoencoder reconstruction quality, measurable without diffusion training, predicts downstream SR performance (R^2 = 0.67), suggesting that domain-specific VAE selection should precede diffusion architecture search. Code and trained model weights are publicly available at https://github.com/sebasmos/latent-sr.

[89] VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale

Parth Parag Kulkarni,Rohit Gupta,Prakash Chandra Chhipa,Mubarak Shah

Main category: cs.CV

TL;DR: 本文提出VidTAG框架,通过双编码器实现视频帧到GPS坐标的检索,结合自监督和语言对齐特征,并引入TempGeo和GeoRefiner模块解决时间不一致性问题,在多个数据集上显著优于现有方法。

Details Motivation: 现有基于分类的方法仅支持粗粒度(城市级)定位,而图像检索方法因需构建全球规模图像库而不切实际;相比之下,GPS坐标库构建简单廉价,因此需要一种高效、细粒度的视频地理定位方法。 Method: 提出VidTAG双编码器框架,融合自监督与语言对齐特征进行帧到GPS检索;设计TempGeo模块对齐视频帧嵌入以缓解时间不一致;设计GeoRefiner模块(编码器-解码器结构)利用对齐帧嵌入优化GPS特征。 Result: 在Mapillary(MSLS)和GAMa数据集上,1公里误差阈值下较GeoCLIP提升20%;在CityGuessr68k数据集上,全局粗粒度视频地理定位性能超越当前SOTA 25%。 Conclusion: VidTAG实现了细粒度、时间一致的视频地理定位,为该领域后续研究奠定了坚实基础。 Abstract: The task of video geolocalization aims to determine the precise GPS coordinates of a video's origin and map its trajectory; with applications in forensics, social media, and exploration. Existing classification-based approaches operate at a coarse city-level granularity and fail to capture fine-grained details, while image retrieval methods are impractical on a global scale due to the need for extensive image galleries which are infeasible to compile. Comparatively, constructing a gallery of GPS coordinates is straightforward and inexpensive. We propose VidTAG, a dual-encoder framework that performs frame-to-GPS retrieval using both self-supervised and language-aligned features. To address temporal inconsistencies in video predictions, we introduce the TempGeo module, which aligns frame embeddings, and the GeoRefiner module, an encoder-decoder architecture that refines GPS features using the aligned frame embeddings. Evaluations on Mapillary (MSLS) and GAMa datasets demonstrate our model's ability to generate temporally consistent trajectories and outperform baselines, achieving a 20% improvement at the 1 km threshold over GeoCLIP. We also beat current State-of-the-Art by 25% on global coarse grained video geolocalization (CityGuessr68k). Our approach enables fine-grained video geolocalization and lays a strong foundation for future research. More details on the project webpage: https://parthpk.github.io/vidtag_webpage/

[90] Nucleus-Image: Sparse MoE for Image Generation

Chandan Akiti,Ajay Modukuri,Murali Nandan Nagarapu,Gunavardhan Akiti,Haozhe Liu

Main category: cs.CV

TL;DR: Nucleus-Image 是一个稀疏 MoE 扩散 Transformer 文本到图像生成模型,以约 2B 激活参数实现媲美甚至超越更大模型的生成质量,在多个基准上达到新 Pareto 最优,且完全开源、无需后训练优化。

Details Motivation: 在文本到图像生成中平衡生成质量与推理效率,突破现有模型在参数量、计算成本与性能之间的权衡瓶颈。 Method: 采用 Expert-Choice 路由的稀疏 MoE 扩散 Transformer;去除文本 token 输入主干,使用联合注意力实现跨时间步文本 KV 共享;提出解耦式路由设计以提升时间步调制下的路由稳定性;构建 1.5B 高质量多阶段清洗数据集;采用渐进式分辨率训练(256→512→1024)、多宽高比分桶与专家容量因子渐进稀疏化;使用 Muon 优化器及为扩散模型定制的参数分组策略。 Result: 在 GenEval、DPG-Bench 和 OneIG-Bench 上匹配或超越领先模型,仅激活约 2B 参数/前向传播,总参数达 17B;是首个高质量、完全开源的 MoE 扩散模型;未使用任何后训练优化(如 RLHF、DPO 或人工偏好调优)。 Conclusion: 稀疏 MoE 扩展是实现高质量、高效率图像生成的有效路径,Nucleus-Image 验证了通过架构创新与训练策略优化可在显著降低推理成本的同时保持顶尖性能。 Abstract: We present Nucleus-Image, a text-to-image generation model that establishes a new Pareto frontier in quality-versus-efficiency by matching or exceeding leading models on GenEval, DPG-Bench, and OneIG-Bench while activating only approximately 2B parameters per forward pass. Nucleus-Image employs a sparse mixture-of-experts (MoE) diffusion transformer architecture with Expert-Choice Routing that scales total model capacity to 17B parameters across 64 routed experts per layer. We adopt a streamlined architecture optimized for inference efficiency by excluding text tokens from the transformer backbone entirely and using joint attention that enables text KV sharing across timesteps. To improve routing stability when using timestep modulation, we introduce a decoupled routing design that separates timestep-aware expert assignment from timestep-conditioned expert computation. We construct a large-scale training corpus of 1.5B high-quality training pairs spanning 700M unique images through multi-stage filtering, deduplication, aesthetic tiering, and caption curation. Training follows a progressive resolution curriculum (256 to 512 to 1024) with multi-aspect-ratio bucketing at every stage, coupled with progressive sparsification of the expert capacity factor. We adopt the Muon optimizer and share our parameter grouping recipe tailored for diffusion models with timestep modulation. Nucleus-Image demonstrates that sparse MoE scaling is a highly effective path to high-quality image generation, reaching the performance of models with significantly larger active parameter budgets at a fraction of the inference cost. These results are achieved without post-training optimization of any kind: no reinforcement learning, no direct preference optimization, and no human preference tuning. We release the training recipe, making Nucleus-Image the first fully open-source MoE diffusion model at this quality.

[91] Redefining Quality Criteria and Distance-Aware Score Modeling for Image Editing Assessment

Xinjie Zhang,Qiang Li,Xiaowen Ma,Axi Niu,Li Yan,Qingsen Yan

Main category: cs.CV

TL;DR: 本文提出DS-IEQA框架,通过反馈驱动的指标提示优化(FDMPO)和标记解耦的距离回归损失(TDRL),联合学习评估标准与分数表示,提升图像编辑质量评估的准确性和连续性建模能力。

Details Motivation: 现有基于MLLM的图像编辑质量评估方法依赖人工启发式提示,存在指标提示僵化和距离无关的分数建模问题,难以对齐人类隐含评估标准并建模分数空间的连续结构。 Method: 提出Define-and-Score IEQA(DS-IEQA)统一框架:1)Feedback-Driven Metric Prompt Optimization(FDMPO)自动优化指标定义;2)Token-Decoupled Distance Regression Loss(TDRL)解耦数值token与语言建模,显式建模分数连续性。 Result: 在2026 NTIRE X-AIGC质量评估Track 2中排名第四,且未使用额外训练数据。 Conclusion: DS-IEQA通过联合学习评估标准与分数表征,有效缓解了传统MLLM方法在指标灵活性和分数连续性建模上的不足,提升了IEQA的可靠性与人类一致性。 Abstract: Recent advances in image editing have heightened the need for reliable Image Editing Quality Assessment (IEQA). Unlike traditional methods, IEQA requires complex reasoning over multimodal inputs and multi-dimensional assessments. Existing MLLM-based approaches often rely on human heuristic prompting, leading to two key limitations: rigid metric prompting and distance-agnostic score modeling. These issues hinder alignment with implicit human criteria and fail to capture the continuous structure of score spaces. To address this, we propose Define-and-Score Image Editing Quality Assessment (DS-IEQA), a unified framework that jointly learns evaluation criteria and score representations. Specifically, we introduce Feedback-Driven Metric Prompt Optimization (FDMPO) to automatically refine metric definitions via probabilistic feedback. Furthermore, we propose Token-Decoupled Distance Regression Loss (TDRL), which decouples numerical tokens from language modeling to explicitly model score continuity through expected distance minimization. Extensive experiments show our method's superior performance; it ranks 4th in the 2026 NTIRE X-AIGC Quality Assessment Track 2 without any additional training data.

[92] Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation

Wentai Zhang,Ronghui Xi,Shiyao Peng,Jiayu Huang,Haoran Luo,Zichen Tang,Haihong E

Main category: cs.CV

TL;DR: 本文提出Precision-Allocated Sparse Attention (PASA),一种无需训练的稀疏注意力框架,通过动态预算分配、硬件对齐的分组近似和随机化路由机制,在保持视频生成质量的同时显著提升推理效率并消除时间闪烁。

Details Motivation: 现有稀疏注意力方法因静态稀疏模式和确定性块路由导致严重视觉闪烁,且难以兼顾计算效率与时间一致性。 Method: 提出PASA框架:1)基于曲率感知的动态计算预算分配机制,按时间步语义变化关键性弹性分配精确计算资源;2)采用硬件对齐的分组近似替代全局均质估计,兼顾局部细节与计算吞吐;3)在注意力路由中引入随机选择偏差,缓解硬边界导致的选择振荡与局部计算饥饿。 Result: 在主流视频扩散模型上验证,PASA显著加速推理,同时生成视频具有优异的时间流畅性和结构稳定性,有效消除闪烁。 Conclusion: PASA是一种高效、免训练、面向视频生成的稀疏注意力新范式,解决了现有方法在效率与时间一致性间的根本矛盾。 Abstract: Video Diffusion Transformers have revolutionized high-fidelity video generation but suffer from the massive computational burden of self-attention. While sparse attention provides a promising acceleration solution, existing methods frequently provoke severe visual flickering caused by static sparsity patterns and deterministic block routing. To resolve these limitations, we propose Precision-Allocated Sparse Attention (PASA), a training-free framework designed for highly efficient and temporally smooth video generation. First, we implement a curvature-aware dynamic budgeting mechanism. By profiling the generation trajectory acceleration across timesteps, we elastically allocate the exact-computation budget to secure high-precision processing strictly during critical semantic transitions. Second, we replace global homogenizing estimations with hardware-aligned grouped approximations, successfully capturing fine-grained local variations while maintaining peak compute throughput. Finally, we incorporate a stochastic selection bias into the attention routing mechanism. This probabilistic approach softens rigid selection boundaries and eliminates selection oscillation, effectively eradicating the localized computational starvation that drives temporal flickering. Extensive evaluations on leading video diffusion models demonstrate that PASA achieves substantial inference acceleration while consistently producing remarkably fluid and structurally stable video sequences.

[93] BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition

Qingyuan Cai,Saihui Hou,Xuecai Hu,Yongzhen Huang

Main category: cs.CV

TL;DR: 本文提出BarbieGait合成步态数据集和GaitCLIF模型,解决现实场景中因服装多样导致的步态识别难题,显著提升跨服装步态识别性能。

Details Motivation: 现实世界中多样的服装风格给步态识别带来显著挑战,真实数据难以覆盖充分的服装变化,限制了跨服装步态识别算法的验证与提升。 Method: 构建可控的合成步态数据集BarbieGait,将真实受试者映射至虚拟引擎以模拟丰富服装变化并保持步态身份信息;提出GaitCLIF模型,学习服装不变的步态特征。 Result: 在BarbieGait及现有主流步态基准上,GaitCLIF显著提升了跨服装步态识别性能。 Conclusion: BarbieGait为跨服装步态识别提供了高质量、可控的合成数据支持,GaitCLIF为该任务提供了鲁棒基线,二者共同推动步态识别在复杂现实场景下的发展。 Abstract: Gait recognition, as a reliable biometric technology, has seen rapid development in recent years while it faces significant challenges caused by diverse clothing styles in the real world. This paper introduces BarbieGait, a synthetic gait dataset where real-world subjects are uniquely mapped into a virtual engine to simulate extensive clothing changes while preserving their gait identity information. As a pioneering work, BarbieGait provides a controllable gait data generation method, enabling the production of large datasets to validate cross-clothing issues that are difficult to verify with real-world data. However, the diversity of clothing increases intra-class variance and makes one of the biggest challenges to learning cloth-invariant features under varying clothing conditions. Therefore, we propose GaitCLIF (Gait-oriented CLoth-Invariant Feature) as a robust baseline model for cross-clothing gait recognition. Through extensive experiments, we validate that our method significantly improves cross-clothing performance on BarbieGait and the existing popular gait benchmarks. We believe that BarbieGait, with its extensive cross-clothing gait data, will further advance the capabilities of gait recognition in cross-clothing scenarios and promote progress in related research.

[94] Physics-Grounded Monocular Vehicle Distance Estimation Using Standardized License Plate Typography

Manognya Lokesh Reddy,Zheng Liu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、基于美国车牌几何先验的单目测距框架,利用车牌作为被动基准标记解决尺度模糊问题,在低成本前提下实现了高精度、鲁棒且可认证的距离估计。

Details Motivation: 现有单目深度估计方法依赖昂贵监督训练、易受域偏移影响且难以满足安全关键部署的可认证性要求;而LiDAR/雷达成本过高,难以普及。 Method: 1)四路并行车牌检测器实现全光照范围鲁棒检测;2)融合OCR文本匹配、多设计颜色评分与轻量神经网络的三阶段州标识别引擎;3)结合逆方差加权融合、在线尺度对齐与一维恒速卡尔曼滤波的混合深度估计框架。 Result: 在10米距离处平均绝对误差为2.3%,距离估计方差较先前车牌宽度法降低36%,短暂遮挡下仍保持连续输出,相对误差优于深度学习基线5倍。 Conclusion: 该方法无需训练数据或主动照明,通过显式几何先验实现可解释、可认证的低成本单目测距,适用于ADAS与自动驾驶中的碰撞预警。 Abstract: Accurate inter-vehicle distance estimation is a cornerstone of Advanced Driver Assistance Systems (ADAS) and autonomous driving. While LiDAR and radar provide high precision, their high cost prohibits widespread adoption in mass-market vehicles. Monocular camera-based estimation offers a low-cost alternative but suffers from fundamental scale ambiguity. Recent deep learning methods for monocular depth achieve impressive results yet require expensive supervised training, suffer from domain shift, and produce predictions that are difficult to certify for safety-critical deployment. This paper presents a framework that exploits the standardized typography of United States license plates as passive fiducial markers for metric ranging, resolving scale ambiguity through explicit geometric priors without any training data or active illumination. First, a four-method parallel plate detector achieves robust plate reading across the full automotive lighting range. Second, a three-stage state identification engine fusing OCR text matching, multi-design color scoring, and a lightweight neural network classifier provides robust identification across all ambient conditions. Third, hybrid depth fusion with inverse-variance weighting and online scale alignment, combined with a one-dimensional constant-velocity Kalman filter, delivers smoothed distance, relative velocity, and time-to-collision for collision warning. Baseline validation reproduces a 2.3% coefficient of variation in character height measurements and a 36% reduction in distance-estimate variance compared with plate-width methods from prior work. Extensive outdoor experiments confirm a mean absolute error of 2.3% at 10 m and continuous distance output during brief plate occlusions, outperforming deep learning baselines by a factor of five in relative error.

[95] ArtifactWorld: Scaling 3D Gaussian Splatting Artifact Restoration via Video Generation Models

Xinliang Wang,Yifeng Shi,Zhenyu Wu

Main category: cs.CV

TL;DR: 本文提出ArtifactWorld框架,通过系统性数据扩展和同质双模型范式解决3D高斯泼溅(3DGS)在稀疏视角下的几何与光度退化问题。

Details Motivation: 3DGS在稀疏视角下存在几何与光度退化;现有生成式修复方法受限于时序一致性不足、缺乏显式空间约束及大规模训练数据,导致多视角不一致、几何幻觉和泛化能力差。 Method: 构建细粒度3DGS伪影现象学分类体系,生成107.5K对多样化配对视频片段;采用视频扩散主干网络,结合等构预测器生成伪影热图,并通过伪影感知三元融合机制,在原生自注意力中实现强度引导的时空精准修复。 Result: 在稀疏新视角合成与鲁棒3D重建任务上达到SOTA性能。 Conclusion: ArtifactWorld通过数据驱动与结构化建模协同提升3DGS修复质量与泛化能力,为真实场景中3D重建鲁棒性提供了新范式。 Abstract: 3D Gaussian Splatting (3DGS) delivers high-fidelity real-time rendering but suffers from geometric and photometric degradations under sparse-view constraints. Current generative restoration approaches are often limited by insufficient temporal coherence, a lack of explicit spatial constraints, and a lack of large-scale training data, resulting in multi-view inconsistencies, erroneous geometric hallucinations, and limited generalization to diverse real-world artifact distributions. In this paper, we present ArtifactWorld, a framework that resolves 3DGS artifact repair through systematic data expansion and a homogeneous dual-model paradigm. To address the data bottleneck, we establish a fine-grained phenomenological taxonomy of 3DGS artifacts and construct a comprehensive training set of 107.5K diverse paired video clips to enhance model robustness. Architecturally, we unify the restoration process within a video diffusion backbone, utilizing an isomorphic predictor to localize structural defects via an artifact heatmap. This heatmap then guides the restoration through an Artifact-Aware Triplet Fusion mechanism, enabling precise, intensity-guided spatio-temporal repair within native self-attention. Extensive experiments demonstrate that ArtifactWorld achieves state-of-the-art performance in sparse novel view synthesis and robust 3D reconstruction. Code and dataset will be made public.

[96] ARGen: Affect-Reinforced Generative Augmentation towards Vision-based Dynamic Emotion Perception

Huanzhen Wang,Ziheng Zhou,Jiaqi Song,Li He,Yunshi Lan,Yan Wang,Wenqiang Zhang

Main category: cs.CV

TL;DR: 本文提出ARGen框架,通过情感语义注入和自适应强化扩散两阶段,实现面向动态表情识别的数据自适应生成增强,提升稀有情绪识别性能。

Details Motivation: 动态面部表情识别在真实场景中面临数据稀缺和长尾分布问题,导致模型难以有效学习稀有情绪的时间动态特征。 Method: ARGen框架包含两个阶段:1)情感语义注入(ASI),利用面部动作单元与视觉-语言大模型进行检索增强式提示生成,注入可解释的情感先验;2)自适应强化扩散(ARD),结合文本引导的图像到视频扩散模型与强化学习,设计多目标奖励函数优化表情自然性、面部完整性与生成效率。 Result: 在生成与识别任务上的大量实验表明,ARGen显著提升了合成保真度与识别性能。 Conclusion: ARGen建立了一种可解释、可泛化的面向视觉情感计算的生成增强新范式。 Abstract: Dynamic facial expression recognition in the wild remains challenging due to data scarcity and long-tail distributions, which hinder models from effectively learning the temporal dynamics of scarce emotions. To address these limitations, we propose ARGen, an Affect-Reinforced Generative Augmentation Framework that enables data-adaptive dynamic expression generation for robust emotion perception. ARGen operates in two stages: Affective Semantic Injection (ASI) and Adaptive Reinforcement Diffusion (ARD). The ASI stage establishes affective knowledge alignment through facial Action Units and employs a retrieval-augmented prompt generation strategy to synthesize consistent and fine-grained affective descriptions via large-scale visual-language models, thereby injecting interpretable emotional priors into the generation process. The ARD stage integrates text-conditioned image-to-video diffusion with reinforcement learning, introducing inter-frame conditional guidance and a multi-objective reward function to jointly optimize expression naturalness, facial integrity, and generative efficiency. Extensive experiments on both generation and recognition tasks verify that ARGen substantially enhances synthesis fidelity and improves recognition performance, establishing an interpretable and generalizable generative augmentation paradigm for vision-based affective computing.

[97] Style-Decoupled Adaptive Routing Network for Underwater Image Enhancement

Hang Xu,Chen Long,Bing Wang,Hao Chen,Zhen Dong

Main category: cs.CV

TL;DR: 本文提出了一种自适应水下图像增强框架SDAR-Net,通过解耦退化风格与场景结构,并引入自适应路由机制实现按需增强,在真实数据集上达到25.72 dB PSNR的SOTA性能。

Details Motivation: 现有水下图像增强方法多采用统一映射策略,难以兼顾轻度与重度退化图像的增强需求,导致过处理或恢复不足。 Method: SDAR-Net首先将图像特征解耦为动态退化风格嵌入和静态场景结构表示;然后设计自适应路由机制,依据风格特征预测各增强阶段的软权重,加权融合对应表征以实现自适应恢复。 Result: 在真实世界基准测试中PSNR达25.72 dB,刷新SOTA;且在下游视觉任务中验证了其有效性。 Conclusion: SDAR-Net通过风格-结构解耦与自适应路由机制,显著提升了水下图像增强的针对性与泛化能力,为复杂退化场景下的鲁棒视觉感知提供了新思路。 Abstract: Underwater Image Enhancement (UIE) is essential for robust visual perception in marine applications. However, existing methods predominantly rely on uniform mapping tailored to average dataset distributions, leading to over-processing mildly degraded images or insufficient recovery for severe ones. To address this challenge, we propose a novel adaptive enhancement framework, SDAR-Net. Unlike existing uniform paradigms, it first decouples specific degradation styles from the input and subsequently modulates the enhancement process adaptively. Specifically, since underwater degradation primarily shifts the appearance while keeping the scene structure, SDAR-Net formulates image features into dynamic degradation style embeddings and static scene structural representations through a carefully designed training framework. Subsequently, we introduce an adaptive routing mechanism. By evaluating style features and adaptively predicting soft weights at different enhancement states, it guides the weighted fusion of the corresponding image representations, accurately satisfying the adaptive restoration demands of each image. Extensive experiments show that SDAR-Net achieves a new state-of-the-art (SOTA) performance with a PSNR of 25.72 dB on real-world benchmark, and demonstrates its utility in downstream vision tasks. Our code is available at https://github.com/WHU-USI3DV/SDAR-Net.

[98] DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos

Yuan Huang,Sijie Zhao,Jing Cheng,Hao Xu,Shaohui Jiao

Main category: cs.CV

TL;DR: 本文提出了一种面向立体视频修复的高效方法SASI,包含梯度感知视差扭曲(GAPW)、视差驱动双投影(PBDP)和稀疏感知修复(SASI)三个模块,在保证修复质量的同时大幅降低计算开销,实现HD视频实时处理。

Details Motivation: 现有立体视频修复方法受限于高质量数据集稀缺,且对整帧均匀处理造成大量冗余计算,难以兼顾效果与效率。 Method: 提出三阶段框架:1)Gradient-Aware Parallax Warping(GAPW)利用后向扭曲与坐标映射梯度提升边缘连续性;2)Parallax-Based Dual Projection(PBDP)结合GAPW生成几何一致的立体修复对与精确遮挡掩码;3)Sparsity-Aware Stereo Inpainting(SASI)通过稀疏token处理减少70%以上冗余计算。 Result: 在单张A100 GPU上实现768×1280高清立体视频25FPS实时修复,扩散推理速度提升10.7倍,性能媲美全量计算模型。 Conclusion: 该方法有效解决了立体视频修复中数据稀缺与计算冗余两大瓶颈,为实际应用提供了高效、高质量的解决方案。 Abstract: Stereo video inpainting, which aims to fill the occluded regions of warped videos with visually coherent content while maintaining temporal consistency, remains a challenging open problem. The regions to be filled are scattered along object boundaries and occupy only a small fraction of each frame, leading to two key challenges. First, existing approaches perform poorly on such tasks due to the scarcity of high-quality stereo inpainting datasets, which limits their ability to learn effective inpainting priors. Second, these methods apply equal processing to all regions of the frame, even though most pixels require no modification, resulting in substantial redundant computation. To address these issues, we introduce three interconnected components. We first propose Gradient-Aware Parallax Warping (GAPW), which leverages backward warping and the gradient of the coordinate mapping function to obtain continuous edges and smooth occlusion regions. Then, a Parallax-Based Dual Projection (PBDP) strategy is introduced, which incorporates GAPW to produce geometrically consistent stereo inpainting pairs and accurate occlusion masks without requiring stereo videos. Finally, we present Sparsity-Aware Stereo Inpainting (SASI), which reduces over 70% of redundant tokens, achieving a 10.7x speedup during diffusion inference and delivering results comparable to its full-computation counterpart, enabling real-time processing of HD (768 x 1280) videos at 25 FPS on a single A100 GPU.

[99] MAST: Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer

Dongkyung Kang,Jaeyeon Hwang,Junseo Park,Minji Kang,Yeryeong Lee,Beomseok Ko,Hanyoung Roh,Jeongmin Shin,Hyeryung Jang

Main category: cs.CV

TL;DR: 本文提出MAST,一种无需训练的多风格迁移框架,通过在扩散模型注意力机制中显式控制内容-风格交互,解决多风格迁移中的边界伪影、结构不一致等问题。

Details Motivation: 现有基于扩散模型的风格迁移方法通常假设单一全局风格,在扩展到多风格场景时易产生边界伪影、风格不稳定和结构不一致问题。 Method: MAST包含四个模块:1)布局保持查询锚定,防止语义结构坍缩;2)对数层注意力质量分配,实现无伪影多风格融合;3)锐度感知温度缩放,恢复多风格扩展导致的注意力模糊;4)差异感知细节注入,补偿高频细节损失。 Result: 实验表明MAST能有效缓解边界伪影、保持结构一致性,并在增加应用风格数量时仍维持纹理保真度与空间连贯性。 Conclusion: MAST是一种高效、无需训练的多风格迁移新范式,显著提升了多风格合成的质量与鲁棒性。 Abstract: Style transfer aims to render a content image with the visual characteristics of a reference style while preserving its underlying semantic layout and structural geometry. While recent diffusion-based models demonstrate strong stylization capabilities by leveraging powerful generative priors and controllable internal representations, they typically assume a single global style. Extending them to multi-style scenarios often leads to boundary artifacts, unstable stylization, and structural inconsistency due to interference between multiple style representations. To overcome these limitations, we propose MAST (Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer), a novel training-free framework that explicitly controls content-style interactions within the diffusion attention mechanism. To achieve artifact-free and structure-preserving stylization, MAST integrates four connected modules. First, Layout-preserving Query Anchoring prevents global layout collapse by firmly anchoring the semantic structure using content queries. Second, Logit-level Attention Mass Allocation deterministically distributes attention probability mass across spatial regions, seamlessly fusing multiple styles without boundary artifacts. Third, Sharpness-aware Temperature Scaling restores the attention sharpness degraded by multi-style expansion. Finally, Discrepancy-aware Detail Injection adaptively compensates for localized high-frequency detail losses by measuring structural discrepancies. Extensive experiments demonstrate that MAST effectively mitigates boundary artifacts and maintains structural consistency, preserving texture fidelity and spatial coherence even as the number of applied styles increases.

[100] LiveMoments: Reselected Key Photo Restoration in Live Photos via Reference-guided Diffusion

Clara Xue,Zizheng Yan,Zhenning Shi,Yuhang Yu,Jingyu Zhuang,Qi Zhang,Jinwei Chen,Qingnan Fan

Main category: cs.CV

TL;DR: 本文提出LiveMoments,一种参考引导的图像恢复框架,用于提升Live Photo中重选关键帧的质量,通过双分支网络和统一运动对齐模块,显著改善感知质量与保真度。

Details Motivation: Live Photo中用户重选的关键帧因使用低质量视频ISP流水线而存在明显画质退化,需专用恢复技术来弥补与原始高质量关键帧之间的质量差距。 Method: 提出LiveMoments框架:采用双分支神经网络(参考分支提取原高质量关键帧的结构与纹理信息,主分支恢复重选帧);引入统一运动对齐模块,在潜在空间和图像空间两级进行运动引导的空间对齐。 Result: 在真实与合成Live Photo数据集上的实验表明,LiveMoments在感知质量与保真度上显著优于现有方法,尤其在快速运动或复杂结构场景下效果更优。 Conclusion: LiveMoments为Live Photo重选关键帧提供了高效、鲁棒的参考引导恢复方案,有效弥合了视频与照片画质差距,并开源代码促进后续研究。 Abstract: Live Photo captures both a high-quality key photo and a short video clip to preserve the precious dynamics around the captured moment. While users may choose alternative frames as the key photo to capture better expressions or timing, these frames often exhibit noticeable quality degradation, as the photo capture ISP pipeline delivers significantly higher image quality than the video pipeline. This quality gap highlights the need for dedicated restoration techniques to enhance the reselected key photo. To this end, we propose LiveMoments, a reference-guided image restoration framework tailored for the reselected key photo in Live Photos. Our method employs a two-branch neural network: a reference branch that extracts structural and textural information from the original high-quality key photo, and a main branch that restores the reselected frame using the guidance provided by the reference branch. Furthermore, we introduce a unified Motion Alignment module that incorporates motion guidance for spatial alignment at both the latent and image levels. Experiments on real and synthetic Live Photos demonstrate that LiveMoments significantly improves perceptual quality and fidelity over existing solutions, especially in scenes with fast motion or complex structures. Our code is available at https://github.com/OpenVeraTeam/LiveMoments.

[101] Boosting Robust AIGI Detection with LoRA-based Pairwise Training

Ruiyang Xia,Qi Zhang,Yaowen Xu,Zhaofan Zou,Hao Sun,Zhongjiang He,Xuelong Li

Main category: cs.CV

TL;DR: 本文提出了一种基于LoRA的成对训练(LPT)策略,以提升AI生成图像(AIGI)检测器在复杂失真环境下的鲁棒性。通过在视觉基础模型上进行针对性微调、模拟真实失真分布及引入成对训练解耦泛化与鲁棒性优化,该方法在NTIRE‘Robust AI-Generated Image Detection in the Wild’挑战赛中获得第三名。

Details Motivation: 现有AIGI检测器在干净数据集上表现良好,但在实际部署中面对不可预测的复杂失真时性能显著下降,亟需提升鲁棒性。 Method: 提出LoRA-based Pairwise Training(LPT)策略:1)基于视觉基础模型进行LoRA微调;2)在训练中模拟失真和尺寸变化以匹配真实分布;3)采用成对训练解耦泛化能力与鲁棒性优化目标。 Result: 所提方法在NTIRE Robust AI-Generated Image Detection in the Wild挑战赛中取得第三名,验证了其在严重失真条件下的检测鲁棒性。 Conclusion: LPT策略有效提升了AIGI检测器在真实复杂失真场景下的鲁棒性,为‘in-the-wild’部署提供了实用可行的解决方案。 Abstract: The proliferation of highly realistic AI-Generated Image (AIGI) has necessitated the development of practical detection methods. While current AIGI detectors perform admirably on clean datasets, their detection performance frequently decreases when deployed "in the wild", where images are subjected to unpredictable, complex distortions. To resolve the critical vulnerability, we propose a novel LoRA-based Pairwise Training (LPT) strategy designed specifically to achieve robust detection for AIGI under severe distortions. The core of our strategy involves the targeted finetuning of a visual foundation model, the deliberate simulation of data distribution during the training phase, and a unique pairwise training process. Specifically, we introduce distortion and size simulations to better fit the distribution from the validation and test sets. Based on the strong visual representation capability of the visual foundation model, we finetune the model to achieve AIGI detection. The pairwise training is utilized to improve the detection via decoupling the generalization and robustness optimization. Experiments show that our approach secured the 3th placement in the NTIRE Robust AI-Generated Image Detection in the Wild challenge

[102] Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

Rong Wang,Ruyi Zha,Ziang Cheng,Jiayu Yang,Pulak Purkait,Hongdong Li

Main category: cs.CV

TL;DR: 本文提出了一种利用3D基础生成模型的形状先验来提升单图生成轨道视频几何真实性和多视角一致性的新方法,通过双尺度隐式特征引导和多尺度3D适配器实现高效、通用的视频生成。

Details Motivation: 现有基于像素注意力的视频生成方法在长程外推(如后视图合成)中缺乏足够几何约束,导致结构不连贯、不真实。 Method: 利用3D基础生成模型提取全局结构向量和体积投影的细粒度隐式图像特征,并设计多尺度3D适配器通过交叉注意力将这些特征注入视频生成模型,避免显式网格重建。 Result: 在多个基准上显著优于SOTA方法,在视觉质量、形状真实性和多视角一致性方面表现更优,且对复杂相机轨迹和野外图像具有强泛化能力。 Conclusion: 融合3D形状先验可有效弥补2D视频生成在几何建模上的不足,所提双尺度特征引导与适配器架构为单图生成高质量轨道视频提供了通用、高效的新范式。 Abstract: We present a novel method for generating geometrically realistic and consistent orbital videos from a single image of an object. Existing video generation works mostly rely on pixel-wise attention to enforce view consistency across frames. However, such mechanism does not impose sufficient constraints for long-range extrapolation, e.g. rear-view synthesis, in which pixel correspondences to the input image are limited. Consequently, these works often fail to produce results with a plausible and coherent structure. To tackle this issue, we propose to leverage rich shape priors from a 3D foundational generative model as an auxiliary constraint, motivated by its capability of modeling realistic object shape distributions learned from large 3D asset corpora. Specifically, we prompt the video generation with two scales of latent features encoded by the 3D foundation model: (i) a denoised global latent vector as an overall structural guidance, and (ii) a set of latent images projected from volumetric features to provide view-dependent and fine-grained geometry details. In contrast to commonly used 2.5D representations such as depth or normal maps, these compact features can model complete object shapes, and help to improve inference efficiency by avoiding explicit mesh extraction. To achieve effective shape conditioning, we introduce a multi-scale 3D adapter to inject feature tokens to the base video model via cross-attention, which retains its capabilities from general video pretraining and enables a simple and model-agonistic fine-tuning process. Extensive experiments on multiple benchmarks show that our method achieves superior visual quality, shape realism and multi-view consistency compared to state-of-the-art methods, and robustly generalizes to complex camera trajectories and in-the-wild images.

[103] GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality

Zhiwei Zhang,Xingyuan Zeng,Xinkai Kong,Kunquan Zhang,Haoyuan Liang,Bohan Shi,Juepeng Zheng,Jianxi Huang,Yutong Lu,Haohuan Fu

Main category: cs.CV

TL;DR: 本文提出了首个面向全球梯田地块提取的多模态基准GTPBD-MM,整合高分辨率光学影像、结构化文本描述和数字高程模型(DEM)数据,并设计了多模态基线模型ETTerra,验证了文本语义与地形几何信息对提升梯田地块提取精度与结构一致性的互补作用。

Details Motivation: 现有公开基准主要针对规则、平坦农田场景,难以应对山地梯田地块所具有的阶梯地形、显著高程变化、不规则边界及跨区域异质性等挑战;缺乏在图像-文本-DEM统一设置下的复杂梯田地块提取基准。 Method: 构建多模态基准GTPBD-MM,整合光学影像、文本描述与DEM数据,支持Image-only、Image+Text、Image+Text+DEM三种评估设置;提出Elevation and Text guided Terraced parcel network(ETTerra)作为多模态基线模型。 Result: 实验证明,引入文本语义和地形几何信息能提供超越纯视觉外观的互补线索,在复杂梯田场景中实现更准确、连贯且结构一致的地块划分结果。 Conclusion: 多模态融合(尤其图像、文本与DEM)是提升山地梯田地块提取性能的关键路径,GTPBD-MM为该任务提供了首个系统性评估基准与方法支撑。 Abstract: Agricultural parcel extraction plays an important role in remote sensing-based agricultural monitoring, supporting parcel surveying, precision management, and ecological assessment. However, existing public benchmarks mainly focus on regular and relatively flat farmland scenes. In contrast, terraced parcels in mountainous regions exhibit stepped terrain, pronounced elevation variation, irregular boundaries, and strong cross-regional heterogeneity, making parcel extraction a more challenging problem that jointly requires visual recognition, semantic discrimination, and terrain-aware geometric understanding. Although recent studies have advanced visual parcel benchmarks and image-text farmland understanding, a unified benchmark for complex terraced parcel extraction under aligned image-text-DEM settings remains absent. To fill this gap, we present GTPBD-MM, the first multimodal benchmark for global terraced parcel extraction. Built upon GTPBD, GTPBD-MM integrates high-resolution optical imagery, structured text descriptions, and DEM data, and supports systematic evaluation under Image-only, Image+Text, and Image+Text+DEM settings. We further propose Elevation and Text guided Terraced parcel network (ETTerra), a multimodal baseline for terraced parcel delineation. Extensive experiments demonstrate that textual semantics and terrain geometry provide complementary cues beyond visual appearance alone, yielding more accurate, coherent, and structurally consistent delineation results in complex terraced scenes.

[104] Cell Instance Segmentation via Multi-Task Image-to-Image Schrödinger Bridge

Hayato Inoue,Shota Harada,Shumpei Takezaki,Ryoma Bise

Main category: cs.CV

TL;DR: 本文提出了一种基于多任务图像到图像Schrödinger Bridge框架的细胞实例分割方法,将分割建模为分布式的图像生成问题,并引入边界感知监督和确定性推理,在PanNuke和MoNuSeg数据集上取得有竞争力或更优的结果,且无需SAM预训练或后处理。

Details Motivation: 现有细胞实例分割流程通常结合确定性预测与后处理,对实例掩码的全局结构缺乏显式约束。 Method: 提出多任务图像到图像Schrödinger Bridge框架,将实例分割建模为分布式的图像生成问题;引入基于反向距离图的边界感知监督;采用确定性推理实现稳定预测。 Result: 在PanNuke数据集上性能达到竞争性或更优水平,无需SAM预训练或额外后处理;在MoNuSeg数据集上展现出小样本下的鲁棒性。 Conclusion: 基于Schrödinger Bridge的图像到图像生成是一种有效的细胞实例分割新范式。 Abstract: Existing cell instance segmentation pipelines typically combine deterministic predictions with post-processing, which imposes limited explicit constraints on the global structure of instance masks. In this work, we propose a multi-task image-to-image Schrödinger Bridge framework that formulates instance segmentation as a distribution-based image-to-image generation problem. Boundary-aware supervision is integrated through a reverse distance map, and deterministic inference is employed to produce stable predictions. Experimental results on the PanNuke dataset demonstrate that the proposed method achieves competitive or superior performance without relying on SAM pre-training or additional post-processing. Additional results on the MoNuSeg dataset show robustness under limited training data. These findings indicate that Schrödinger Bridge-based image-to-image generation provides an effective framework for cell instance segmentation.

[105] RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation

Guoan Xu,Yang Xiao,Guangwei Gao,Dongchen Zhu,Wenjing Jia,Guo-Jun Qi

Main category: cs.CV

TL;DR: 本文提出了一种可靠性感知的自门控状态空间模型(RSGMamba),用于多模态语义分割,通过显式建模模态可靠性并动态调控跨模态交互,显著提升了RGB-D和RGB-T分割性能。

Details Motivation: 现有跨模态融合方法常隐式假设各模态同等可靠,但在模态存在噪声、错位或缺失时易导致特征退化;本文从模态可靠性角度重新审视融合问题。 Method: 提出RSGMamba框架,核心为可靠性感知自门控Mamba块(RSGMB),结合轻量级局部交叉门控调制(LCGM)以兼顾全局建模与细粒度空间细节优化。 Result: 在NYUDepth V2、SUN-RGBD、MFNet和PST900等基准上达到SOTA,mIoU分别达58.8%/54.0%和61.1%/88.9%,参数量仅48.6M。 Conclusion: 显式建模模态可靠性可有效提升多模态分割鲁棒性与精度,RSGMamba验证了该思路的有效性与优越性。 Abstract: Multimodal semantic segmentation has emerged as a powerful paradigm for enhancing scene understanding by leveraging complementary information from multiple sensing modalities (e.g., RGB, depth, and thermal). However, existing cross-modal fusion methods often implicitly assume that all modalities are equally reliable, which can lead to feature degradation when auxiliary modalities are noisy, misaligned, or incomplete. In this paper, we revisit cross-modal fusion from the perspective of modality reliability and propose a novel framework termed the Reliability-aware Self-Gated State Space Model (RSGMamba). At the core of our method is the Reliability-aware Self-Gated Mamba Block (RSGMB), which explicitly models modality reliability and dynamically regulates cross-modal interactions through a self-gating mechanism. Unlike conventional fusion strategies that indiscriminately exchange information across modalities, RSGMB enables reliability-aware feature selection and enhancing informative feature aggregation. In addition, a lightweight Local Cross-Gated Modulation (LCGM) is incorporated to refine fine-grained spatial details, complementing the global modeling capability of RSGMB. Extensive experiments demonstrate that RSGMamba achieves state-of-the-art performance on both RGB-D and RGB-T semantic segmentation benchmarks, resulting 58.8% / 54.0% mIoU on NYUDepth V2 and SUN-RGBD (+0.4% / +0.7% over prior best), and 61.1% / 88.9% mIoU on MFNet and PST900 (up to +1.6%), with only 48.6M parameters, thereby validating the effectiveness and superiority of the proposed approach.

[106] EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports

Jianzhe Ma,Zhonghao Cao,Shangkui Chen,Yichen Xu,Wenxuan Wang,Qin Jin

Main category: cs.CV

TL;DR: 本文提出了EgoEsportsQA,一个面向虚拟环境(特别是电竞)的视频问答基准,用于评估视频大语言模型在高速、规则密集型场景中的感知与推理能力;实验表明现有模型在此类任务上表现不佳,尤其在战术推理和微观操作方面存在明显短板。

Details Motivation: 现有Video-LLMs在慢速真实世界第一人称视频上表现良好,但在高速、信息密集的虚拟环境(如电竞)中能力尚未被系统探索;缺乏针对虚拟场景中快速、规则驱动推理的严格评测基准。 Method: 构建了EgoEsportsQA基准:通过六阶段可扩展流程,从3款FPS职业比赛视频中收集1745个高质量QA对;提出二维解耦分类法——认知能力维度(11个子任务,涵盖感知到推理)与电竞知识维度(6个子任务);对SOTA Video-LLMs进行综合评测与消融分析。 Result: 当前最优Video-LLM在EgoEsportsQA上仅达71.58%准确率;模型在基础视觉感知上强于深度战术推理,在宏观进程理解上优于微观操作识别;消融实验证实了现有架构的内在缺陷。 Conclusion: EgoEsportsQA揭示了Video-LLMs在虚拟第一人称场景中的关键能力缺口,不仅架起了真实与虚拟自我的桥梁,也为优化电竞下游应用及推动各类第一人称视频理解模型发展提供了新方向与数据基础。 Abstract: While video large language models (Video-LLMs) excel in understanding slow-paced, real-world egocentric videos, their capabilities in high-velocity, information-dense virtual environments remain under-explored. Existing benchmarks focus on daily activities, yet lack a rigorous testbed for evaluating fast, rule-bound reasoning in virtual scenarios. To fill this gap, we introduce EgoEsportsQA, a pioneering video question-answering (QA) benchmark for grounding perception and reasoning in expert esports knowledge. We curate 1,745 high-quality QA pairs from professional matches across 3 first-person shooter games via a scalable six-stage pipeline. These questions are structured into a two-dimensional decoupled taxonomy: 11 sub-tasks in the cognitive capability dimension (covering perception and reasoning levels) and 6 sub-tasks in the esports knowledge dimension. Comprehensive evaluations of state-of-the-art Video-LLMs reveal that current models still fail to achieve satisfactory performance, with the best model only 71.58%. The results expose notable gaps across both axes: models exhibit stronger capabilities in basic visual perception than in deep tactical reasoning, and they grasp overall macro-progression better than fine-grained micro-operations. Extensive ablation experiments demonstrate the intrinsic weaknesses of current Video-LLM architectures. Further analysis suggests that our dataset not only reveals the connections between real-world and virtual egocentric domains, but also offers guidance for optimizing downstream esports applications, thereby fostering the future advancement of Video-LLMs in various egocentric environments.

[107] Self-Adversarial One Step Generation via Condition Shifting

Deyuan Liu,Peng Sun,Yansen Han,Zhenglin Cheng,Chuyan Chen,Tao Lin

Main category: cs.CV

TL;DR: 本文提出APEX框架,通过条件偏移从流模型中内生提取对抗校正信号,实现无需外部判别器的一步式文本到图像生成,在保持训练稳定性和高效性的同时显著提升生成质量与推理速度。

Details Motivation: 现有一步采样方法在保真度、推理速度和训练效率之间存在三难困境;依赖外部判别器的方法虽能提升性能但带来训练不稳定、显存开销大和收敛慢等问题;而回归蒸馏等方法则易丢失细节。 Method: 基于流模型的理论洞察,引入条件偏移(condition shifting)构造偏移分支,利用其速度场作为生成分布的独立估计器,导出与GAN对齐的梯度,从而替代导致梯度消失的样本相关判别器项;设计为判别器无关、架构无侵入的即插即用框架,支持全参及LoRA微调。 Result: 0.6B参数模型在单步生成质量上超越FLUX-Schnell 12B;Qwen-Image 20B上LoRA微调后,在NFE=1时GenEval达0.89,超过原50步教师模型(0.87),并实现15.33倍推理加速。 Conclusion: APEX通过内生对抗校正机制突破了一步生成的性能瓶颈,在 fidelity、speed 和 trainability 三方面取得协同提升,是一种可扩展、参数高效且兼容性强的新范式。 Abstract: The push for efficient text to image synthesis has moved the field toward one step sampling, yet existing methods still face a three way tradeoff among fidelity, inference speed, and training efficiency. Approaches that rely on external discriminators can sharpen one step performance, but they often introduce training instability, high GPU memory overhead, and slow convergence, which complicates scaling and parameter efficient tuning. In contrast, regression based distillation and consistency objectives are easier to optimize, but they typically lose fine details when constrained to a single step. We present APEX, built on a key theoretical insight: adversarial correction signals can be extracted endogenously from a flow model through condition shifting. Using a transformation creates a shifted condition branch whose velocity field serves as an independent estimator of the model's current generation distribution, yielding a gradient that is provably GAN aligned, replacing the sample dependent discriminator terms that cause gradient vanishing. This discriminator free design is architecture preserving, making APEX a plug and play framework compatible with both full parameter and LoRA based tuning. Empirically, our 0.6B model surpasses FLUX-Schnell 12B (20$\times$ more parameters) in one step quality. With LoRA tuning on Qwen-Image 20B, APEX reaches a GenEval score of 0.89 at NFE=1 in 6 hours, surpassing the original 50-step teacher (0.87) and providing a 15.33$\times$ inference speedup. Code is available https://github.com/LINs-lab/APEX.

[108] HyperLiDAR: Adaptive Post-Deployment LiDAR Segmentation via Hyperdimensional Computing

Ivannia Gomez Moreno,Yi Yao,Ye Tian,Xiaofan Yu,Flavio Ponzina,Michael Sullivan,Jingyi Zhang,Mingyu Yang,Hun Seok Kim,Tajana Rosing

Main category: cs.CV

TL;DR: 本文提出HyperLiDAR,一种基于高维计算(HDC)的轻量级、部署后自适应LiDAR语义分割框架,专为边缘设备设计,在保证性能的同时显著提升适应效率。

Details Motivation: 现实世界中环境变化导致模型性能下降,而边缘设备受限于计算与能耗,难以直接在设备上对大型神经网络模型进行部署后自适应。 Method: 提出基于高维计算(HDC)的轻量级框架HyperLiDAR,并引入缓冲区选择策略,聚焦于最具信息量的点以提升适应效率。 Result: 在两个LiDAR分割基准和两个代表性设备上验证,HyperLiDAR在适应性能上优于或媲美现有方法,重训练速度最高提升13.8倍。 Conclusion: HyperLiDAR是首个面向边缘设备的轻量级、部署后自适应LiDAR分割框架,兼顾高效性与实用性,为实际自动驾驶等应用提供了可行方案。 Abstract: LiDAR semantic segmentation plays a pivotal role in 3D scene understanding for edge applications such as autonomous driving. However, significant challenges remain for real-world deployments, particularly for on-device post-deployment adaptation. Real-world environments can shift as the system navigates through different locations, leading to substantial performance degradation without effective and timely model adaptation. Furthermore, edge systems operate under strict computational and energy constraints, making it infeasible to adapt conventional segmentation models (based on large neural networks) directly on-device. To address the above challenges, we introduce HyperLiDAR, the first lightweight, post-deployment LiDAR segmentation framework based on Hyperdimensional Computing (HDC). The design of HyperLiDAR fully leverages the fast learning and high efficiency of HDC, inspired by how the human brain processes information. To further improve the adaptation efficiency, we identify the high data volume per scan as a key bottleneck and introduce a buffer selection strategy that focuses learning on the most informative points. We conduct extensive evaluations on two state-of-the-art LiDAR segmentation benchmarks and two representative devices. Our results show that HyperLiDAR outperforms or achieves comparable adaptation performance to state-of-the-art segmentation methods, while achieving up to a 13.8x speedup in retraining.

[109] All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding

Tanzila Rahman,Renjie Liao,Leonid Sigal

Main category: cs.CV

TL;DR: 本文提出了一种统一的合成数据生成流程,用于训练多模态大语言模型(MLLMs)进行视频理解,支持多种任务格式,并引入基于VQA的微调策略以增强视觉推理能力;实验表明,主要使用合成数据训练的模型在真实数据上表现优异,为视频理解提供了可扩展、低成本的替代方案。

Details Motivation: 真实世界中多模态视频数据的收集与标注成本高、速度慢,且多样性与覆盖度有限,难以满足多任务视频理解模型训练的需求。 Method: 提出统一的合成数据生成框架,支持多种任务格式(如计数、问答、分割);引入基于视觉问答(VQA)的细粒度微调策略,促使模型进行深度视觉接地与推理。 Result: 在视频目标计数、视频视觉问答和视频目标分割三个任务上验证了方法有效性;仅用合成数据训练的模型在真实数据集上泛化能力强,常优于传统基于真实标注训练的模型。 Conclusion: 统一合成数据流水线是替代昂贵真实标注、实现可扩展多模态视频理解的可行且高效途径。 Abstract: Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, question answering, and segmentation. However, collecting and annotating multimodal video data in real-world is costly, slow, and inherently limited in diversity and coverage. To address this challenge, we propose a unified synthetic data generation pipeline capable of automatically producing unlimited multimodal video data with rich and diverse supervision. Our framework supports multiple task formats within a single pipeline, enabling scalable and consistent data creation across tasks. To further enhance reasoning ability, we introduce a VQA-based fine-tuning strategy that trains models to answer structured questions about visual content rather than relying solely on captions or simple instructions. This formulation encourages deeper visual grounding and reasoning. We evaluate our approach in three challenging tasks: video object counting, video-based visual question answering, and video object segmentation. Experimental results demonstrate that models trained predominantly on synthetic data generalize effectively to real-world datasets, often outperforming traditionally trained counterparts. Our findings highlight the potential of unified synthetic data pipelines as a scalable alternative to expensive real-world annotation for multimodal video understanding.

[110] Bridging the Micro--Macro Gap: Frequency-Aware Semantic Alignment for Image Manipulation Localization

Xiaojie Liang,Zhimin Chen,Ziqi Sheng,Wei Lu

Main category: cs.CV

TL;DR: 本文提出FASA框架,通过融合频率域线索与语义先验,统一解决传统图像篡改与扩散模型生成篡改的定位问题,在多个基准上达到SOTA性能。

Details Motivation: 现有方法仅依赖低层取证线索或高层语义特征,难以同时应对传统篡改(含明显取证痕迹)与扩散模型生成篡改(局部逼真),存在微观-宏观特征割裂问题。 Method: 提出FASA框架:1)自适应双频DCT模块提取篡改敏感频域特征;2)基于冻结CLIP的块级对比对齐学习篡改感知语义先验;3)语义-频率侧适配器实现多尺度特征交互;4)原型引导、频率门控的掩码解码器融合语义一致性与边界感知定位。 Result: 在OpenSDI及多个传统篡改数据集上达到SOTA定位性能,具备强跨生成器、跨数据集泛化能力,并在常见图像退化下保持鲁棒性。 Conclusion: FASA成功弥合了微观频域线索与宏观语义信息之间的鸿沟,为统一处理多样篡改类型提供了有效且鲁棒的定位框架。 Abstract: As generative image editing advances, image manipulation localization (IML) must handle both traditional manipulations with conspicuous forensic artifacts and diffusion-generated edits that appear locally realistic. Existing methods typically rely on either low-level forensic cues or high-level semantics alone, leading to a fundamental micro--macro gap. To bridge this gap, we propose FASA, a unified framework for localizing both traditional and diffusion-generated manipulations. Specifically, we extract manipulation-sensitive frequency cues through an adaptive dual-band DCT module and learn manipulation-aware semantic priors via patch-level contrastive alignment on frozen CLIP representations. We then inject these priors into a hierarchical frequency pathway through a semantic-frequency side adapter for multi-scale feature interaction, and employ a prototype-guided, frequency-gated mask decoder to integrate semantic consistency with boundary-aware localization for tampered region prediction. Extensive experiments on OpenSDI and multiple traditional manipulation benchmarks demonstrate state-of-the-art localization performance, strong cross-generator and cross-dataset generalization, and robust performance under common image degradations.

[111] Detecting Precise Hand Touch Moments in Egocentric Video

Huy Anh Nguyen,Feras Dayoub,Minh Hoai

Main category: cs.CV

TL;DR: 本文提出HiCE模块和TouchMoment数据集,用于在第一人称视频中精确检测手与物体接触的帧级时刻,通过手部区域与上下文的交叉注意力机制及抓握感知损失函数提升检测精度。

Details Motivation: 解决第一人称视频中手-物接触时刻检测难的问题,因细微运动变化、频繁遮挡、精细操作模式及视角动态性导致帧级精度要求高但实现困难。 Method: 提出Hand-informed Context Enhanced(HiCE)模块,利用手部区域及其上下文的时空特征,通过交叉注意力建模接触模式;引入抓握感知损失函数与软标签,强调手部姿态与运动动力学特征;构建大规模egocentric数据集TouchMoment。 Result: 在严格评估标准(预测帧与真实帧误差≤2帧)下,HiCE在TouchMoment数据集上平均精度较SOTA方法提升16.91%。 Conclusion: HiCE模块结合手部先验与上下文建模能力,配合专有数据集与定制化损失函数,显著提升了egocentric视频中接触时刻检测的帧级精度,为AR、人机交互等应用提供可靠技术支撑。 Abstract: We address the challenging task of detecting the precise moment when hands make contact with objects in egocentric videos. This frame-level detection is crucial for augmented reality, human-computer interaction, assistive technologies, and robot learning applications, where contact onset signals action initiation or completion. Temporally precise detection is particularly challenging due to subtle hand motion variations near contact, frequent occlusions, fine-grained manipulation patterns, and the inherent motion dynamics of first-person perspectives. To tackle these challenges, we propose a Hand-informed Context Enhanced module (HiCE; pronounced `high-see') that leverages spatiotemporal features from hand regions and their surrounding context through cross-attention mechanisms, learning to identify potential contact patterns. Our approach is further refined with a grasp-aware loss and soft label that emphasizes hand pose patterns and movement dynamics characteristic of touch events, enabling the model to distinguish between near-contact and actual contact frames. We also introduce TouchMoment, an egocentric dataset containing 4,021 videos and 8,456 annotated contact moments spanning over one million frames. Experiments on TouchMoment show that, under a strict evaluation criterion that counts a prediction as correct only if it falls within a two-frame tolerance of the ground-truth moment, our method achieves substantial gains and outperforms state-of-the-art event-spotting baselines by 16.91% average precision.

[112] Unlocking the Potential of Grounding DINO in Videos: Parameter-Efficient Adaptation for Limited-Data Spatial-Temporal Localization

Zanyi Wang,Fan Li,Dengyang Jiang,Liuzhuozheng Li,Yunhua Zhong,Guang Dai,Mengmeng Wang

Main category: cs.CV

TL;DR: 本文提出ST-GD框架,通过冻结预训练2D视觉语言模型(如Grounding DINO)并注入轻量适配器与新型时间解码器,实现高效时空视频定位,在小数据场景下表现优异。

Details Motivation: 现有全参数微调方法依赖大量标注数据,而STVG任务中密集帧级边界框和复杂时序语言对齐的标注成本极高,导致数据稀缺、模型过拟合或零样本模型缺乏时序感知能力。 Method: 提出ST-GD框架:冻结预训练2D视觉语言模型主干,引入轻量级适配器(约10M参数)注入时空感知能力,并设计新型时间解码器用于边界预测。 Result: 在HC-STVG v1/v2小规模基准上达到极具竞争力的性能,在VidSTG上保持强泛化能力。 Conclusion: ST-GD是一种面向严格小数据约束下复杂视频理解任务的有效新范式。 Abstract: Spatio-temporal video grounding (STVG) aims to localize queried objects within dynamic video segments. Prevailing fully-trained approaches are notoriously data-hungry. However, gathering large-scale STVG data is exceptionally challenging: dense frame-level bounding boxes and complex temporal language alignments are prohibitively expensive to annotate, especially for specialized video domains. Consequently, conventional models suffer from severe overfitting on these inherently limited datasets, while zero-shot foundational models lack the task-specific temporal awareness needed for precise localization. To resolve this small-data challenge, we introduce ST-GD, a data-efficient framework that adapts pre-trained 2D visual-language models (e.g., Grounding DINO) to video tasks. To avoid destroying pre-trained priors on small datasets, ST-GD keeps the base model frozen and strategically injects lightweight adapters (~10M trainable parameters) to instill spatio-temporal awareness, alongside a novel temporal decoder for boundary prediction. This design naturally counters data scarcity. Consequently, ST-GD excels in data-scarce scenarios, achieving highly competitive performance on the limited-scale HC-STVG v1/v2 benchmarks, while maintaining robust generalization on the VidSTG dataset. This validates ST-GD as a powerful paradigm for complex video understanding under strict small-data constraints.

[113] Fundus Image-based Glaucoma Screening via Retinal Knowledge-Oriented Dynamic Multi-Level Feature Integration

Yuzhuo Zhou,Chi Liu,Sheng Shen,Zongyuan Ge,Fengshi Jing,Shiran Zhang,Yu Jiang,Anli Wang,Wenjian Liu,Feilong Yang,Tianqing Zhu,Xiaotong Han

Main category: cs.CV

TL;DR: 本文提出了一种面向视网膜解剖知识的青光眼筛查框架,融合动态多尺度特征学习与领域先验,通过三分支结构和动态窗口机制提升跨数据集鲁棒性,在AIROGS等数据集上达到98.5% AUC。

Details Motivation: 现有深度学习模型缺乏对视网膜解剖知识的显式建模,且固定区域特征提取难以覆盖病理性线索,导致跨异构临床数据泛化能力差。 Method: 提出三分支结构(全局视网膜上下文、视杯/盘结构特征、动态定位病灶区域),引入动态窗口机制自适应定位诊断相关区域,并设计知识增强型卷积注意力模块,融入预训练基础模型提取的视网膜先验。 Result: 在AIROGS大数据集上AUC达98.5%,准确率94.6%;在SMDG-19多中心基准上验证了强跨域泛化能力。 Conclusion: 结合解剖知识引导的注意力机制与自适应病灶定位,可显著提升自动化青光眼筛查系统的鲁棒性与泛化性。 Abstract: Automated diagnosis based on color fundus photography is essential for large-scale glaucoma screening. However, existing deep learning models are typically data-driven and lack explicit integration of retinal anatomical knowledge, which limits their robustness across heterogeneous clinical datasets. Moreover, pathological cues in fundus images may appear beyond predefined anatomical regions, making fixed-region feature extraction insufficient for reliable diagnosis. To address these challenges, we propose a retinal knowledge-oriented glaucoma screening framework that integrates dynamic multi-scale feature learning with domain-specific retinal priors. The framework adopts a tri-branch structure to capture complementary retinal representations, including global retinal context, structural features of the optic disc/cup, and dynamically localized pathological regions. A Dynamic Window Mechanism is devised to adaptively identify diagnostically informative regions, while a Knowledge-Enhanced Convolutional Attention Module incorporates retinal priors extracted from a pre-trained foundation model to guide attention learning. Extensive experiments on the large-scale AIROGS dataset demonstrate that the proposed method outperforms diverse baselines, achieving an AUC of 98.5% and an accuracy of 94.6%. Additional evaluations on multiple datasets from the SMDG-19 benchmark further confirm its strong cross-domain generalization capability, indicating that knowledge-guided attention combined with adaptive lesion localization can significantly improve the robustness of automated glaucoma screening systems.

[114] Combating Pattern and Content Bias: Adversarial Feature Learning for Generalized AI-Generated Image Detection

Haifeng Zhang,Qinghui He,Xiuli Bi,Bo Liu,Chi-Man Pun,Bin Xiao

Main category: cs.CV

TL;DR: 本文提出了一种多维对抗特征学习(MAFL)框架,通过对抗训练抑制生成模式和内容偏差,提升生成图像检测的跨模型泛化能力,并在小样本下仍保持高准确率。

Details Motivation: 现有生成图像检测方法易受数据偏差影响,倾向于拟合特定生成模式或内容,而非不同生成模型共有的通用特征(即不对称偏差学习问题)。 Method: 采用预训练多模态图像编码器作为骨干,构建真实/伪造特征学习网络,并设计带多维对抗损失的对抗偏差学习分支,形成真实性判别特征学习与偏差特征学习之间的对抗机制。 Result: 在公开数据集上准确率比现有最优方法提升10.89%,平均精度(AP)提升8.57%;仅用320张图像训练即可在公开数据集上达到80%以上检测精度。 Conclusion: MAFL能有效引导模型关注不同生成模型共享的生成特征,显著提升跨模型泛化能力,并降低对大规模训练数据的依赖。 Abstract: In recent years, the rapid development of generative artificial intelligence technology has significantly lowered the barrier to creating high-quality fake images, posing a serious challenge to information authenticity and credibility. Existing generated image detection methods typically enhance generalization through model architecture or network design. However, their generalization performance remains susceptible to data bias, as the training data may drive models to fit specific generative patterns and content rather than the common features shared by images from different generative models (asymmetric bias learning). To address this issue, we propose a Multi-dimensional Adversarial Feature Learning (MAFL) framework. The framework adopts a pretrained multimodal image encoder as the feature extraction backbone, constructs a real-fake feature learning network, and designs an adversarial bias-learning branch equipped with a multi-dimensional adversarial loss, forming an adversarial training mechanism between authenticity-discriminative feature learning and bias feature learning. By suppressing generation-pattern and content biases, MAFL guides the model to focus on the generative features shared across different generative models, thereby effectively capturing the fundamental differences between real and generated images, enhancing cross-model generalization, and substantially reducing the reliance on large-scale training data. Through extensive experimental validation, our method outperforms existing state-of-the-art approaches by 10.89% in accuracy and 8.57% in Average Precision (AP). Notably, even when trained with only 320 images, it can still achieve over 80% detection accuracy on public datasets.

[115] OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion

Dongjian Yu,Weiqing Min,Qian Jiang,Xing Lin,Xin Jin,Shuqiang Jiang

Main category: cs.CV

TL;DR: 本文提出OmniFood8K真实多模态食物数据集和NutritionSynth-115K合成数据集,构建首个仅需单张RGB图像即可准确预测食物营养成分的端到端框架,包含深度图估计、频域特征融合与掩码预测模块。

Details Motivation: 现有食物营养估计方法受限于西方饮食主导的数据集覆盖不足、缺乏中文菜肴支持,以及依赖深度传感器导致日常场景适用性差。 Method: 提出端到端框架:1)从单张RGB图像预测并用SSRA适配器优化深度图;2)通过FAFM模块在频域对齐融合RGB与深度特征;3)设计MPH掩码预测头进行动态通道选择以聚焦关键食材区域。 Result: 在多个数据集上实验表明,所提方法显著优于现有营养预测方法;OmniFood8K(8,036样本)与NutritionSynth-115K(115K合成样本)为社区提供重要新基准。 Conclusion: 本工作推动了无需深度传感器的细粒度食物营养估计发展,尤其提升了中文菜肴营养分析能力,为个性化膳食管理提供了实用化技术支撑。 Abstract: Accurate estimation of food nutrition plays a vital role in promoting healthy dietary habits and personalized diet management. Most existing food datasets primarily focus on Western cuisines and lack sufficient coverage of Chinese dishes, which restricts accurate nutritional estimation for Chinese meals. Moreover, many state-of-the-art nutrition prediction methods rely on depth sensors, restricting their applicability in daily scenarios. To address these limitations, we introduce OmniFood8K, a comprehensive multimodal dataset comprising 8,036 food samples, each with detailed nutritional annotations and multi-view images. In addition, to enhance models' capability in nutritional prediction, we construct NutritionSynth-115K, a large-scale synthetic dataset that introduces compositional variations while preserving precise nutritional labels. Moreover, we propose an end-to-end framework for nutritional prediction from a single RGB image. First, we predict a depth map from a single RGB image and design the Scale-Shift Residual Adapter (SSRA) to refine it for global scale consistency and local structural preservation. Second, we propose the Frequency-Aligned Fusion Module (FAFM) to hierarchically align and fuse RGB and depth features in the frequency domain. Finally, we design a Mask-based Prediction Head (MPH) to emphasize key ingredient regions via dynamic channel selection for more accurate prediction. Extensive experiments on multiple datasets demonstrate the superiority of our method over existing approaches. Project homepage: https://yudongjian.github.io/OmniFood8K-food/

[116] Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

Jiwan Kim,Kibum Kim,Wonjoong Kim,Byung-Kwan Lee,Chanyoung Park

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的解码阶段感知视觉令牌剪枝框架DSTP,以解决现有剪枝方法在复杂视觉推理任务中泛化能力差的问题,通过应对解码过程中的相关视觉信息偏移(RVIS)问题,显著提升性能且具有良好的通用性和效率。

Details Motivation: 现有视觉令牌剪枝方法在简单视觉理解任务上表现良好,但在复杂视觉推理任务中泛化能力不足,这一关键问题尚未被充分研究。 Method: 提出Decoding-stage Shift-aware Token Pruning (DSTP),一种无需训练、可即插即用的框架,动态调整视觉令牌以适配解码过程中变化的推理需求,从而缓解相关视觉信息偏移(RVIS)。 Result: DSTP显著缓解了剪枝方法在复杂推理任务中的性能下降,并在多种视觉理解基准上持续提升性能;同时在多种先进多模态架构上验证了其通用性与低计算开销。 Conclusion: DSTP是一种高效、通用且无需训练的剪枝增强框架,有效解决了视觉令牌剪枝在复杂推理任务中的泛化瓶颈。 Abstract: Recently, visual token pruning has been studied to handle the vast number of visual tokens in Multimodal Large Language Models. However, we observe that while existing pruning methods perform reliably on simple visual understanding, they struggle to effectively generalize to complex visual reasoning tasks, a critical gap underexplored in previous studies. Through a systematic analysis, we identify Relevant Visual Information Shift (RVIS) during decoding as the primary failure driver. To address this, we propose Decoding-stage Shift-aware Token Pruning (DSTP), a training-free add-on framework that enables existing pruning methods to align visual tokens with shifting reasoning requirements during the decoding stage. Extensive experiments demonstrate that DSTP significantly mitigates performance degradation of pruning methods in complex reasoning tasks, while consistently yielding performance gains even across visual understanding benchmarks. Furthermore, DSTP demonstrates effectiveness across diverse state-of-the-art architectures, highlighting its generalizability and efficiency with minimal computational overhead.

[117] Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

Ravikumar Balakrishnan,Sanket Mendapara,Ankit Garg

Main category: cs.CV

TL;DR: 本文研究了针对视觉-语言模型(VLMs)的排版式提示注入攻击,发现字体大小、视觉变换及模型差异显著影响攻击成功率(ASR),并揭示文本-图像嵌入距离与ASR呈强负相关,为在对抗环境中选择鲁棒VLM提供实证依据。

Details Motivation: 随着VLM成为自主智能体(如浏览器自动化、计算机使用系统和具身代理)的感知基础,以图像形式渲染的对抗性文本所引发的排版式提示注入攻击构成日益严重的安全威胁;而实际攻击面异构(字体大小、视觉条件多变)且不同VLM鲁棒性差异大,亟需系统性评估与理解。 Method: 在SALAD-Bench的1000个提示上,对GPT-4o、Claude Sonnet 4.5、Mistral-Large-3和Qwen3-VL-4B-Instruct四大VLM,在6–28px字体大小及旋转、模糊、噪声、对比度变化等视觉变换下进行大规模实证评估,并分析文本-图像嵌入距离与攻击成功率的相关性。 Result: (1)字体大小显著影响ASR,6px时ASR近零,中等尺寸达峰值;(2)GPT-4o和Claude上文本攻击远强于图像攻击,而Qwen3-VL和Mistral无显著差异;(3)JinaCLIP和Qwen3-VL-Embedding的文本-图像嵌入距离与ASR呈强负相关(r = -0.71至-0.93);(4)严重退化使嵌入距离增10–12%、ASR降34–96%,旋转对各模型影响不对称(如Mistral下降50%,GPT-4o不变)。 Conclusion: 不同VLM具有高度特异性的鲁棒性模式,无法采用统一防御策略;嵌入距离可作为ASR的可靠预测指标;实践者应根据运行环境的对抗特性,有针对性地选择VLM骨干模型。 Abstract: We study typographic prompt injection attacks on vision-language models (VLMs), where adversarial text is rendered as images to bypass safety mechanisms, posing a growing threat as VLMs serve as the perceptual backbone of autonomous agents, from browser automation and computer-use systems to camera-equipped embodied agents. In practice, the attack surface is heterogeneous: adversarial text appears at varying font sizes and under diverse visual conditions, while the growing ecosystem of VLMs exhibits substantial variation in vulnerability, complicating defensive approaches. Evaluating 1,000 prompts from SALAD-Bench across four VLMs, namely, GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL-4B-Instruct under varying font sizes (6--28px) and visual transformations (rotation, blur, noise, contrast changes), we find: (1) font size significantly affects attack success rate (ASR), with very small fonts (6px) yielding near-zero ASR while mid-range fonts achieve peak effectiveness; (2) text attacks are more effective than image attacks for GPT-4o (36% vs 8%) and Claude (47% vs 22%), while Qwen3-VL and Mistral show comparable ASR across modalities; (3) text-image embedding distance from two multimodal embedding models (JinaCLIP and Qwen3-VL-Embedding) shows strong negative correlation with ASR across all four models (r = -0.71 to -0.93, p < 0.01); (4) heavy degradations increase embedding distance by 10--12% and reduce ASR by 34--96%, while rotation asymmetrically affects models (Mistral drops 50%, GPT-4o unchanged). These findings highlight that model-specific robustness patterns preclude one-size-fits-all defenses and offer empirical guidance for practitioners selecting VLM backbones for agentic systems operating in adversarial environments.

[118] Modality-Agnostic Prompt Learning for Multi-Modal Camouflaged Object Detection

Hao Wang,Jiqing Zhang,Xin Yang,Baocai Yin,Lu Jiang,Zetian Mi,Huibing Wang

Main category: cs.CV

TL;DR: 本文提出了一种面向伪装目标检测(COD)的模态无关多模态提示框架,通过为Segment Anything Model(SAM)生成模态无关的多模态提示,并引入轻量级掩码精修模块,显著提升了对RGB-Depth、RGB-Thermal和RGB-Polarization等多模态COD任务的性能与泛化能力。

Details Motivation: 现有伪装目标检测方法多依赖模态特异性架构或定制化融合策略,限制了可扩展性与跨模态泛化能力;亟需一种能灵活适配任意辅助模态、参数高效且通用性强的新范式。 Method: 构建模态无关的多模态提示生成框架,将多模态学习建模为数据驱动的内容域与知识驱动的提示域之间的交互,蒸馏任务相关线索形成统一提示输入SAM解码器;并设计轻量级Mask Refine Module,利用细粒度提示线索校准粗预测结果。 Result: 在RGB-Depth、RGB-Thermal和RGB-Polarization三大COD基准上验证了方法的有效性与强泛化性,显著提升检测精度,尤其在边界定位方面表现突出。 Conclusion: 所提模态无关多模态提示机制为SAM提供了高效、灵活、可扩展的跨模态适配能力,推动COD向更鲁棒、通用的多源感知方向发展。 Abstract: Camouflaged Object Detection (COD) aims to segment objects that blend seamlessly into complex backgrounds, with growing interest in exploiting additional visual modalities to enhance robustness through complementary information. However, most existing approaches generally rely on modality-specific architectures or customized fusion strategies, which limit scalability and cross-modal generalization. To address this, we propose a novel framework that generates modality-agnostic multi-modal prompts for the Segment Anything Model (SAM), enabling parameter-efficient adaptation to arbitrary auxiliary modalities and significantly improving overall performance on COD tasks. Specifically, we model multi-modal learning through interactions between a data-driven content domain and a knowledge-driven prompt domain, distilling task-relevant cues into unified prompts for SAM decoding. We further introduce a lightweight Mask Refine Module to calibrate coarse predictions by incorporating fine-grained prompt cues, leading to more accurate camouflaged object boundaries. Extensive experiments on RGB-Depth, RGB-Thermal, and RGB-Polarization benchmarks validate the effectiveness and generalization of our modality-agnostic framework.

[119] Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models

Jiawei Fan,Shigeng Wang,Chao Li,Xiaolong Liu,Anbang Yao

Main category: cs.CV

TL;DR: 本文提出Chain-of-Models Pre-Training(CoM-PT),一种面向视觉基础模型族的、性能无损的预训练加速方法,通过构建按规模升序排列的模型链,仅对最小模型进行标准预训练,其余模型通过逆向知识迁移联合复用参数与特征空间知识,显著降低训练开销并提升整体效率。

Details Motivation: 现有加速方法多针对单个模型优化,而CoM-PT旨在从模型家族层面加速预训练流程,并随模型族扩展保持高效可扩展性。 Method: 构建按模型大小升序排列的‘模型链’,仅最小模型进行标准预训练;其余模型通过顺序逆向知识转移(联合复用参数空间与特征空间知识)进行高效训练。 Result: 在45个零样本与微调任务数据集上验证,所有模型性能大多优于独立训练,训练成本大幅下降;模型族规模扩大时反而效率更高(如CC3M上ViT-L链中加入更小模型使计算复杂度最多降低72%;模型数从3增至7,加速比从4.13X跃升至7.09X)。 Conclusion: CoM-PT是一种范式无关、性能无损、可扩展性强的预训练加速方法,已开源代码以支持包括大语言模型在内的更广泛高算力场景。 Abstract: In this paper, we present Chain-of-Models Pre-Training (CoM-PT), a novel performance-lossless training acceleration method for vision foundation models (VFMs). This approach fundamentally differs from existing acceleration methods in its core motivation: rather than optimizing each model individually, CoM-PT is designed to accelerate the training pipeline at the model family level, scaling efficiently as the model family expands. Specifically, CoM-PT establishes a pre-training sequence for the model family, arranged in ascending order of model size, called model chain. In this chain, only the smallest model undergoes standard individual pre-training, while the other models are efficiently trained through sequential inverse knowledge transfer from their smaller predecessors by jointly reusing the knowledge in the parameter space and the feature space. As a result, CoM-PT enables all models to achieve performance that is mostly superior to standard individual training while significantly reducing training cost, and this is extensively validated across 45 datasets spanning zero-shot and fine-tuning tasks. Notably, its efficient scaling property yields a remarkable phenomenon: training more models even results in higher efficiency. For instance, when pre-training on CC3M: i) given ViT-L as the largest model, progressively prepending smaller models to the model chain reduces computational complexity by up to 72%; ii) within a fixed model size range, as the VFM family scales across 3, 4, and 7 models, the acceleration ratio of CoM-PT exhibits a striking leap: from 4.13X to 5.68X and 7.09X. Since CoM-PT is naturally agnostic to specific pre-training paradigms, we open-source the code to spur further extensions in more computationally intensive scenarios, such as large language model pre-training.

[120] Dual-Modality Anchor-Guided Filtering for Test-time Prompt Tuning

Jungwon Choi,Eunwoo Kim

Main category: cs.CV

TL;DR: 本文提出了一种双模态锚点引导的测试时提示调优(TPT)框架,通过引入文本锚点(细粒度语义)和自适应图像锚点(动态统计),改进视图筛选与监督信号生成,显著提升分布偏移下的模型鲁棒性与性能。

Details Motivation: 标准基于熵的视图过滤方法在分布偏移下模型置信度校准不佳,易对无关区域给出高置信,忽视语义内容,导致TPT效果受限。 Method: 提出双模态锚点引导框架:1)利用属性丰富的文本描述构建文本锚点,提供细粒度类别语义;2)构建自适应图像锚点捕捉测试时统计变化;3)基于锚点对齐度与置信度进行视图筛选;4)将锚点作为辅助预测头,与原始输出进行置信加权集成,生成稳定监督信号用于提示更新。 Result: 在15个基准数据集上取得新的SOTA性能,验证了锚点引导监督对鲁棒提示更新的有效性。 Conclusion: 锚点引导的双模态监督机制能有效缓解分布偏移下的置信误校准问题,提升TPT的语义感知能力与泛化鲁棒性,为测试时自适应提供了新范式。 Abstract: Test-Time Prompt Tuning (TPT) adapts vision-language models using augmented views, but its effectiveness is hindered by the challenge of determining which views are beneficial. Standard entropy-based filtering relies on the internal confidence scores of the model, which are often miscalibrated under distribution shift, assigning high confidence to irrelevant crops or background regions while ignoring semantic content. To address this, we propose a dual-modality anchor-guided framework that grounds view selection in semantic evidence. We introduce a text anchor from attribute-rich descriptions, to provide fine-grained class semantics, and an adaptive image anchor that captures evolving test-time statistics. Using these anchors, we filter views based on alignment and confidence, ensuring that only informative views guide adaptation. Moreover, we treat the anchors as auxiliary predictive heads and combine their predictions with the original output in a confidence-weighted ensemble, yielding a stable supervision signal for prompt updates. Extensive experiments on 15 benchmark datasets demonstrate new state-of-the-art performance, highlighting the contribution of anchor-guided supervision as a foundation for robust prompt updates.

[121] DeferredSeg: A Multi-Expert Deferral Framework for Trustworthy Medical Image Segmentation

Qiuyu Tian,Haoliang Sun,Yunshan Wang,Yinghuan Shi,Yilong Yin

Main category: cs.CV

TL;DR: 本文提出DeferredSeg,一种面向医疗图像分割的延迟预测框架,通过像素级动态路由机制决定是否将不确定区域的预测交由人类专家处理,并引入协作损失、空间一致性损失和负载均衡惩罚以提升可信度与鲁棒性。

Details Motivation: 深度学习分割模型在医疗图像中常表现出过度自信或信心不足,导致分割置信度不可靠,尤其在模糊区域,影响临床部署的可信性;受学习延迟(L2D)范式启发,需构建人机协同系统来提升决策可靠性。 Method: 提出DeferredSeg框架:1)扩展基础分割器,加入聚合式延迟预测器和动态像素路由通道;2)设计像素级代理协作损失监督延迟决策;3)引入空间一致性损失保证延迟掩码的平滑性;4)扩展至多专家设置,引入多个差异专家及负载均衡惩罚以均衡工作量。 Result: 在三个医学数据集上,以MedSAM和CENet为基线模型,DeferredSeg持续优于基线;框架具有模型无关性,可适配其他分割架构。 Conclusion: DeferredSeg有效提升了密集医疗图像分割的可信性与实用性,为人机协同的临床AI系统提供了可推广的技术路径。 Abstract: Segmentation models based on deep neural networks demonstrate strong generalization for medical image segmentation. However, they often exhibit overconfidence or underconfidence, leading to unreliable confidence scores for segmentation masks, especially in ambiguous regions. This undermines the trustworthiness required for clinical deployment. Motivated by the learning-to-defer (L2D) paradigm, we introduce DeferredSeg, a deferral-aware segmentation framework, i.e., a Human--AI collaboration system that determines whether to defer predictions to human experts in specific regions. DeferredSeg extends the base segmentor with an aggregated deferral predictor and additional routing channels that dynamically route each pixel to either the base segmentor or a human expert. To train this routing efficiently, we introduce a pixel-wise surrogate collaboration loss that supervises deferral decisions. In addition, to preserve spatial coherence within deferral regions, we propose a spatial-coherence loss that enforces smooth deferral masks, thereby enhancing reliability. Beyond single-expert deferral, we further extend the framework to a multi-expert setting by introducing multiple discrepancy experts for collaborative decision-making. To prevent overloading or underutilizing individual experts, we further design a load-balancing penalty that evenly distributes workload across expert branches. We evaluate DeferredSeg on three challenging medical datasets using MedSAM and CENet as the base segmentor for fair comparison. Experimental results show that DeferredSeg consistently outperforms the baseline, demonstrating its effectiveness for trustworthy dense medical segmentation. Moreover, the proposed framework is model-agnostic and can be readily applied to other segmentation architectures.

[122] A Hybrid Architecture for Benign-Malignant Classification of Mammography ROIs

Mohammed Asad,Mohit Bajpai,Sudhir Singh,Rahul Katarya

Main category: cs.CV

TL;DR: 本文提出了一种结合EfficientNetV2-M与Vision Mamba的混合架构,用于乳腺X线影像中可疑病灶的良恶性二分类,在CBIS-DDSM数据集ROI上实现了高效且准确的病变级分类。

Details Motivation: CNN难以建模长程依赖,ViT虽能解决但计算成本高,需兼顾局部特征提取与高效全局建模能力。 Method: 采用EfficientNetV2-M提取局部视觉特征,结合线性复杂度的状态空间模型Vision Mamba建模全局上下文,构建端到端ROI级二分类模型。 Result: 在CBIS-DDSM数据集的异常中心ROI上实现了强病变级分类性能,兼顾准确性与计算效率。 Conclusion: CNN与SSM的混合架构是 mammography 病变分类的一种高效可行方案,为医学影像分析提供了新思路。 Abstract: Accurate characterization of suspicious breast lesions in mammography is important for early diagnosis and treatment planning. While Convolutional Neural Networks (CNNs) are effective at extracting local visual patterns, they are less suited to modeling long-range dependencies. Vision Transformers (ViTs) address this limitation through self-attention, but their quadratic computational cost can be prohibitive. This paper presents a hybrid architecture that combines EfficientNetV2-M for local feature extraction with Vision Mamba, a State Space Model (SSM), for efficient global context modeling. The proposed model performs binary classification of abnormality-centered mammography regions of interest (ROIs) from the CBIS-DDSM dataset into benign and malignant classes. By combining a strong CNN backbone with a linear-complexity sequence model, the approach achieves strong lesion-level classification performance in an ROI-based setting.

[123] IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation, Understanding, and Generation

Haoyu Zheng,Tianwei Lin,Wei Wang,Zhuonan Wang,Wenqiao Zhang,Jiaqi Zhu,Feifei Shao

Main category: cs.CV

TL;DR: 本文提出IAD-Unify统一框架,集成缺陷定位、自然语言解释与可控编辑三功能;采用DINOv2区域专家+Qwen3.5-4B多模态骨干的双编码器结构,通过轻量token注入实现协同;构建Anomaly-56K多任务评测平台,并验证区域接地机制的关键作用及跨类别泛化能力。

Details Motivation: 现实工业检测需同时支持缺陷定位、自然语言解释和可控编辑,但现有方法缺乏统一框架与评测协议。 Method: 提出IAD-Unify双编码器框架:冻结DINOv2作为区域专家提供异常证据,轻量token注入共享Qwen3.5-4B视觉语言骨干,联合支持异常分割、区域接地理解与掩码引导生成;构建Anomaly-56K评测平台(59,916图像/24类/104缺陷变体)。 Result: 消融实验表明:(i)移除区域接地使定位精度下降>76个百分点;(ii)预测区域性能接近oracle;(iii)区域接地生成在全图保真度与掩码区域感知质量上最优;(iv)预初始化联合训练提升理解能力且生成代价极小(-0.16 dB);在MMAD基准(含未见类别)上表现优异。 Conclusion: IAD-Unify首次在统一框架中协同实现工业异常检测的定位、解释与编辑,并通过新评测平台和机制分析验证了区域接地的核心作用与强泛化能力。 Abstract: Real-world industrial inspection requires not only localizing defects, but also explaining them in natural language and generating controlled defect edits. However, existing approaches fail to jointly support all three capabilities within a unified framework and evaluation protocol. We propose IAD-Unify, a dual-encoder unified framework in which a frozen DINOv2-based region expert supplies precise anomaly evidence to a shared Qwen3.5-4B vision-language backbone via lightweight token injection, jointly enabling anomaly segmentation, region-grounded understanding, and mask-guided generation. To enable unified evaluation, we further construct Anomaly-56K, a comprehensive unified multi-task IAD evaluation platform, spanning 59,916 images across 24 categories and 104 defect variants. Controlled ablations yield four findings: (i) region grounding is the decisive mechanism for understanding, removing it degrades location accuracy by >76 pp; (ii) predicted-region performance closely matches oracle, confirming deployment viability; (iii) region-grounded generation achieves the best full-image fidelity and masked-region perceptual quality; and (iv) pre-initialized joint training improves understanding at negligible generation cost (-0.16 dB). IAD-Unify further achieves strong performance on the MMAD benchmark, including categories unseen during training, demonstrating robust cross-category generalization.

[124] DiffusionPrint: Learning Generative Fingerprints for Diffusion-Based Inpainting Localization

Paschalis Giakoumoglou,Symeon Papadopoulos

Main category: cs.CV

TL;DR: 本文提出DiffusionPrint,一种针对扩散模型图像修复的伪造定位方法,通过对比学习提取鲁棒的生成指纹特征,显著提升多种融合型IFL框架的定位性能。

Details Motivation: 现代扩散修复模型破坏了传统取证方法依赖的相机级噪声模式,导致现有图像伪造定位方法失效。 Method: 提出基于patch级对比学习的DiffusionPrint框架,利用同一生成模型修复区域具有一致生成指纹的特性,采用MoCo风格目标、跨类别难负样本挖掘和生成器感知分类头训练卷积骨干网络,输出可用于融合框架的判别性取证特征图。 Result: 在多个生成模型上验证有效,对微调时未见过的掩码类型定位性能提升最高达+28%,且能泛化至未见过的生成架构。 Conclusion: DiffusionPrint为扩散模型修复图像的伪造定位提供了鲁棒、可泛化的解决方案,可无缝集成到主流融合型IFL框架中。 Abstract: Modern diffusion-based inpainting models pose significant challenges for image forgery localization (IFL), as their full regeneration pipelines reconstruct the entire image via a latent decoder, disrupting the camera-level noise patterns that existing forensic methods rely on. We propose DiffusionPrint, a patch-level contrastive learning framework that learns a forensic signal robust to the spectral distortions introduced by latent decoding. It exploits the fact that inpainted regions generated by the same model share a consistent generative fingerprint, using this as a self-supervisory signal. DiffusionPrint trains a convolutional backbone via a MoCo-style objective with cross-category hard negative mining and a generator-aware classification head, producing a forensic feature map that serves as a highly discriminative secondary modality in fusion-based IFL frameworks. Integrated into TruFor, MMFusion, and a lightweight fusion baseline, DiffusionPrint consistently improves localization across multiple generative models, with gains of up to +28% on mask types unseen during fine-tuning and confirmed generalization to unseen generative architectures. Code is available at https://github.com/mever-team/diffusionprint

[125] Euler-inspired Decoupling Neural Operator for Efficient Pansharpening

Anqi Zhu,Mengting Ma,Yizhen Jiang,Xiangdong Li,Kai Zheng,Jiaxin Li,Wei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于欧拉公式和频域建模的新型神经算子EDNO,用于遥感图像 pansharpening 任务,通过显隐式特征交互机制在保持光谱一致性的同时提升空间细节,兼顾性能与效率。

Details Motivation: 现有基于扩散模型的深度学习方法在 pansharpening 中存在光谱-空间模糊和计算开销大的问题。 Method: 提出欧拉启发的解耦神经算子(EDNO),将 pansharpening 建模为频域中的连续函数映射;引入欧拉特征交互层(EFIL),在极坐标系下实现显式(相位旋转对齐)与隐式(前馈网络建模光谱分布)特征交互。 Result: 在三个数据集上验证了EDNO在性能与效率之间取得更优平衡,优于重型架构,且具备离散不变性与全局感受野。 Conclusion: EDNO是一种物理启发、频域驱动、计算高效的 pansharpening 新范式,有效缓解了光谱失真与计算瓶颈问题。 Abstract: Pansharpening aims to synthesize high-resolution multispectral (HR-MS) images by fusing the spatial textures of panchromatic (PAN) images with the spectral information of low-resolution multispectral (LR-MS) images. While recent deep learning paradigms, especially diffusion-based operators, have pushed the performance boundaries, they often encounter spectral-spatial blurring and prohibitive computational costs due to their stochastic nature and iterative sampling. In this paper, we propose the Euler-inspired Decoupling Neural Operator (EDNO), a physics-inspired framework that redefines pansharpening as a continuous functional mapping in the frequency domain. Departing from conventional Cartesian feature processing, our EDNO leverages Euler's formula to transform features into a polar coordinate system, enabling a novel explicit-implicit interaction mechanism. Specifically, we develop the Euler Feature Interaction Layer (EFIL), which decouples the fusion task into two specialized modules: 1) Explicit Feature Interaction Module, utilizing a linear weighting scheme to simulate phase rotation for adaptive geometric alignment; and 2) Implicit Feature Interaction Module, employing a feed-forward network to model spectral distributions for superior color consistency. By operating in the frequency domain, EDNO inherently captures global receptive fields while maintaining discretization-invariance. Experimental results on the three datasets demonstrate that EDNO offers a superior efficiency-performance balance compared to heavyweight architectures.

[126] T2I-BiasBench: A Multi-Metric Framework for Auditing Demographic and Cultural Bias in Text-to-Image Models

Nihal Jaiswal,Siddhartha Arjaria,Gyanendra Chaubey,Ankush Kumar,Aditya Singh,Anchal Chaurasiya

Main category: cs.CV

TL;DR: 本文提出了T2I-BiasBench,一个统一评估文本到图像生成模型中人口统计偏差、元素遗漏和文化坍缩的框架,并在多个开源模型上验证了其有效性。

Details Motivation: 现有文本到图像生成模型继承并放大训练数据中的群体不平衡与文化偏见,亟需全面、细粒度的偏差评估框架。 Method: 构建包含13个互补指标的T2I-BiasBench框架(含4个新提指标与3个适配指标),在1574张图像、5类结构化提示下评估Stable Diffusion v1.5、BK-SDM Base、Koala Lightning及Gemini 2.5 Flash。 Result: 发现:(1) SD v1.5与BK-SDM在美相关提示中存在偏差放大;(2) 上下文约束(如手术PPE)可显著缓解职业性别偏差;(3) 所有模型(含RLHF对齐的Gemini)均存在文化表征坍缩。 Conclusion: T2I-BiasBench是首个同时覆盖人口统计偏差、元素遗漏与文化坍缩三维度的评估框架,已开源以推动生成模型偏差评估标准化。 Abstract: Text-to-image (T2I) generative models achieve impressive visual fidelity but inherit and amplify demographic imbalances and cultural biases embedded in training data. We introduce T2I-BiasBench, a unified evaluation framework of thirteen complementary metrics that jointly captures demographic bias, element omission, and cultural collapse in diffusion models - the first framework to address all three dimensions simultaneously. We evaluate three open-source models - Stable Diffusion v1.5, BK-SDM Base, and Koala Lightning - against Gemini 2.5 Flash (RLHF-aligned) as a reference baseline. The benchmark comprises 1,574 generated images across five structured prompt categories. T2I-BiasBench integrates six established metrics with seven additional measures: four newly proposed (Composite Bias Score, Grounded Missing Rate, Implicit Element Missing Rate, Cultural Accuracy Ratio) and three adapted (Hallucination Score, Vendi Score, CLIP Proxy Score). Three key findings emerge: (1) Stable Diffusion v1.5 and BK-SDM exhibit bias amplification (>1.0) in beauty-related prompts; (2) contextual constraints such as surgical PPE substantially attenuate professional-role gender bias (Doctor CBS = 0.06 for SD v1.5); and (3) all models, including RLHF-aligned Gemini, collapse to a narrow set of cultural representations (CAS: 0.54-1.00), confirming that alignment techniques do not resolve cultural coverage gaps. T2I-BiasBench is publicly released to support standardized, fine-grained bias evaluation of generative models. The project page is available at: https://gyanendrachaubey.github.io/T2I-BiasBench/

[127] SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker

Junbin Su,Ziteng Xue,Shihui Zhang,Kun Chen,Weiming Hu,Zhipeng Zhang

Main category: cs.CV

TL;DR: SEATrack 提出了一种简单、高效且自适应的双流多模态跟踪器,通过跨模态匹配响应对齐和分层专家混合(HMoE)机制,在性能与参数效率之间取得更好平衡。

Details Motivation: 现有参数高效微调(PEFT)方法在多模态跟踪中常以增加参数量为代价提升性能,违背了PEFT的效率初衷;同时,双流方法中模态特异性偏差导致匹配注意力图冲突,影响联合表征学习。 Method: 提出AMG-LoRA模块,融合LoRA进行领域适配与自适应互引导(AMG)以动态对齐跨模态注意力图;引入分层专家混合(HMoE)替代传统局部融合,实现高效全局关系建模。 Result: 在RGB-T、RGB-D和RGB-E跟踪任务上,SEATrack在保持低参数量的同时显著超越当前最优方法,实现了性能与效率的更好权衡。 Conclusion: 跨模态匹配响应对齐和全局关系建模是解决多模态跟踪中性能-效率权衡的关键,SEATrack为此提供了有效且可扩展的架构范式。 Abstract: Parameter-efficient fine-tuning (PEFT) in multimodal tracking reveals a concerning trend where recent performance gains are often achieved at the cost of inflated parameter budgets, which fundamentally erodes PEFT's efficiency promise. In this work, we introduce SEATrack, a Simple, Efficient, and Adaptive two-stream multimodal tracker that tackles this performance-efficiency dilemma from two complementary perspectives. We first prioritize cross-modal alignment of matching responses, an underexplored yet pivotal factor that we argue is essential for breaking the trade-off. Specifically, we observe that modality-specific biases in existing two-stream methods generate conflicting matching attention maps, thereby hindering effective joint representation learning. To mitigate this, we propose AMG-LoRA, which seamlessly integrates Low-Rank Adaptation (LoRA) for domain adaptation with Adaptive Mutual Guidance (AMG) to dynamically refine and align attention maps across modalities. We then depart from conventional local fusion approaches by introducing a Hierarchical Mixture of Experts (HMoE) that enables efficient global relation modeling, effectively balancing expressiveness and computational efficiency in cross-modal fusion. Equipped with these innovations, SEATrack advances notable progress over state-of-the-art methods in balancing performance with efficiency across RGB-T, RGB-D, and RGB-E tracking tasks. \href{https://github.com/AutoLab-SAI-SJTU/SEATrack}{\textcolor{cyan}{Code is available}}.

[128] From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception

Jilong Zhu,Yang Feng

Main category: cs.CV

TL;DR: 本文提出Variational Information Flow (VIF)框架,通过条件变分自编码器建模视觉显著性,缓解多模态大语言模型在细粒度感知任务中因视觉衰减导致的性能下降。

Details Motivation: 多模态大语言模型(MLLMs)在细粒度视觉理解(如微小物体识别、细微关系判别)上表现不佳,作者将其归因于‘视觉衰减’——即稀疏的细粒度视觉信号在深层传播中被主导的文本token过早抑制或稀释。 Method: 提出Variational Information Flow(VIF)框架,采用条件变分自编码器(CVAE)将与问答对相关的视觉显著性建模为潜在分布;作为即插即用模块,可无缝集成至现有MLLM架构。 Result: 在通用VQA、细粒度感知和视觉定位等多样化基准上进行了广泛评估,VIF均取得优于先前方法的竞争性提升。 Conclusion: VIF有效增强了MLLMs的细粒度感知能力,从信息流建模角度为缓解视觉衰减提供了新思路,且具备良好通用性和即插即用特性。 Abstract: While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding, they frequently falter in fine-grained perception tasks that require identifying tiny objects or discerning subtle visual relationships. We attribute this limitation to Visual Attenuation: a phenomenon where sparse fine-grained visual signals are prematurely suppressed or diluted by dominant textual tokens during network propagation, resulting in a "loss of focus" during the deep-level decision-making process. Existing input-centric solutions fail to fundamentally reverse this intrinsic mechanism of information loss. To address this challenge, we propose the Variational Information Flow (VIF) framework. Adopting a probabilistic perspective, VIF leverages a Conditional Variational Autoencoder (CVAE) to model the visual saliency relevant to the question-answer pair as a latent distribution. As a plug-and-play module, VIF can be integrated into existing architectures. Extensive evaluations across diverse benchmarks, covering General VQA, fine-grained perception, and visual grounding, demonstrate that VIF yields competitive improvements over previous methods, validating its effectiveness in enhancing the fine-grained perception of MLLMs.

[129] NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Professional Image Quality Assessment (Track 1)

Guanyi Qin,Jie Liang,Bingbing Zhang,Lishen Qu,Ya-nan Guan,Hui Zeng,Lei Zhang,Radu Timofte,Jianhui Sun,Xinli Yue,Tao Shao,Huan Hou,Wenjie Liao,Shuhao Han,Jieyu Yuan,Chunle Guo,Chongyi Li,Zewen Chen,Yunze Liu,Jian Guo,Juan Wang,Yun Zeng,Bing Li,Weiming Hu,Hesong Li,Dehua Liu,Xinjie Zhang,Qiang Li,Li Yan,Wei Dong,Qingsen Yan,Xingcan Li,Shenglong Zhou,Manjiang Yin,Yinxiang Zhang,Hongbo Wang,Jikai Xu,Zhaohui Fan,Dandan Zhu,Wei Sun,Weixia Zhang,Kun Zhu,Nana Zhang,Kaiwei Zhang,Qianqian Zhang,Zhihan Zhang,William Gordon,Linwei Wu,Jiachen Tu,Guoyi Xu,Yaoxin Jiang,Cici Liu,Yaokun Shi

Main category: cs.CV

TL;DR: 本文介绍了NTIRE 2026挑战赛中关于专业图像质量评估(Track 1)的概述,提出利用多模态大语言模型(MLLMs)替代传统标量评分方法,以实现高质图像对的比较选择与可解释性推理。

Details Motivation: 传统图像质量评估(IQA)仅依赖单一标量分数,难以区分高质量图像间的细微差异,且缺乏解释能力;而MLLMs具备模拟人类专家认知的潜力,可提供可解释的评估。 Method: 组织NTIRE 2026挑战赛Track 1,构建面向专业场景的新基准,要求参赛方法完成两项任务:(1)高质量图像对的相对质量选择;(2)生成基于视觉依据的专家级解释。 Result: 吸引近200支队伍注册、超2500次提交;顶尖方法显著推动了专业图像质量评估的前沿进展。 Conclusion: MLLMs在专业IQ A任务中展现出强大潜力,该挑战为未来可解释、类人图像评估研究提供了新范式与公开数据集。 Abstract: In this paper, we present an overview of the NTIRE 2026 challenge on the 3rd Restore Any Image Model in the Wild, specifically focusing on Track 1: Professional Image Quality Assessment. Conventional Image Quality Assessment (IQA) typically relies on scalar scores. By compressing complex visual characteristics into a single number, these methods fundamentally struggle to distinguish subtle differences among uniformly high-quality images. Furthermore, they fail to articulate why one image is superior, lacking the reasoning capabilities required to provide guidance for vision tasks. To bridge this gap, recent advancements in Multimodal Large Language Models (MLLMs) offer a promising paradigm. Inspired by this potential, our challenge establishes a novel benchmark exploring the ability of MLLMs to mimic human expert cognition in evaluating high-quality image pairs. Participants were tasked with overcoming critical bottlenecks in professional scenarios, centering on two primary objectives: (1) Comparative Quality Selection: reliably identifying the visually superior image within a high-quality pair; and (2) Interpretative Reasoning: generating grounded, expert-level explanations that detail the rationale behind the selection. In total, the challenge attracted nearly 200 registrations and over 2,500 submissions. The top-performing methods significantly advanced the state of the art in professional IQA. The challenge dataset is available at https://github.com/narthchin/RAIM-PIQA, and the official homepage is accessible at https://www.codabench.org/competitions/12789/.

[130] CoD-Lite: Real-Time Diffusion-Based Generative Image Compression

Zhaoyang Jia,Naifu Xue,Zihan Zheng,Jiahao Li,Bin Li,Xiaoyi Zhang,Zongyu Guo,Yuan Zhang,Houqiang Li,Yan Lu

Main category: cs.CV

TL;DR: 本文提出了一种面向实时压缩的轻量级扩散编解码器CoD_Lite,通过压缩导向预训练和轻量卷积替代Transformer,在1080p下实现60FPS编码与42FPS解码,并在FID相当下降低85%码率。

Details Motivation: 现有基于扩散Transformer的大模型难以适配实时、轻量的图像压缩场景,亟需设计兼顾生成质量与推理效率的轻量扩散编解码器。 Method: 提出压缩导向的预训练范式,验证轻量卷积+知识蒸馏可替代全局注意力;构建单步轻量卷积扩散架构,并引入对抗学习进一步提升性能。 Result: 在1080p分辨率下达到60 FPS编码与42 FPS解码速度;相比MS-ILLM,在FID相近时码率降低85%;显著缩小生成式压缩与实际部署之间的鸿沟。 Conclusion: 生成导向预训练在小模型上效果有限,而压缩导向预训练更有效;轻量卷积结合蒸馏足以支撑高质量压缩,无需复杂Transformer结构。 Abstract: Recent advanced diffusion methods typically derive strong generative priors by scaling diffusion transformers. However, scaling fails to generalize when adapted for real-time compression scenarios that demand lightweight models. In this paper, we explore the design of real-time and lightweight diffusion codecs by addressing two pivotal questions. First, does diffusion pre-training benefit lightweight diffusion codecs? Through systematic analysis, we find that generation-oriented pre-training is less effective at small model scales whereas compression-oriented pre-training yields consistently better performance. Second, are transformers essential? We find that while global attention is crucial for standard generation, lightweight convolutions suffice for compression-oriented diffusion when paired with distillation. Guided by these findings, we establish a one-step lightweight convolution diffusion codec that achieves real-time $60$~FPS encoding and $42$~FPS decoding at 1080p. Further enhanced by distillation and adversarial learning, the proposed codec reduces bitrate by 85\% at a comparable FID to MS-ILLM, bridging the gap between generative compression and practical real-time deployment. Codes are released at https://github.com/microsoft/GenCodec/CoD_Lite

[131] MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models

Ruoxiang Huang,Zhen Yuan

Main category: cs.CV

TL;DR: 本文提出MODIX框架,通过动态调整位置编码的粒度来优化视觉-语言模型(VLMs)的多模态理解能力,无需训练或修改模型结构。

Details Motivation: 现有VLM的位置编码机制对所有token统一赋值位置索引,忽视了模态内与模态间信息密度的差异,导致注意力分配低效:冗余视觉区域占据主导,而关键信息被弱化。 Method: MODIX是一种无需训练的框架,基于模态内信息密度(协方差熵)和模态间交互(跨模态对齐)联合建模,生成统一评分,动态缩放位置索引,从而为高信息量模态分配更细粒度位置编码,压缩冗余模态的位置表示。 Result: 在多种架构与基准测试上,MODIX持续提升多模态推理性能,并能根据任务依赖的信息分布自适应重分配注意力。 Conclusion: 位置编码应被视为Transformer中面向多模态序列建模的可自适应资源,而非固定不变的组件。 Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their positional encoding mechanisms remain suboptimal. Existing approaches uniformly assign positional indices to all tokens, overlooking variations in information density within and across modalities, which leads to inefficient attention allocation where redundant visual regions dominate while informative content is underrepresented. We identify positional granularity as an implicit resource and propose MODIX (Multimodal Information-Driven Positional IndeX Scaling), a training-free framework that dynamically adapts positional strides based on modality-specific contributions. MODIX jointly models intra-modal density via covariance-based entropy and inter-modal interaction via cross-modal alignment to derive unified scores, which rescale positional indices to allocate finer granularity to informative modalities while compressing redundant ones, without requiring any modification to model parameters or architecture. Experiments across diverse architectures and benchmarks demonstrate that MODIX consistently improves multimodal reasoning and adaptively reallocates attention according to task-dependent information distributions, suggesting that positional encoding should be treated as an adaptive resource in Transformers for multimodal sequence modeling.

[132] Cross-Attentive Multiview Fusion of Vision-Language Embeddings

Tomas Berriel Martins,Martin R. Oswald,Javier Civera

Main category: cs.CV

TL;DR: 本文提出了一种名为CAMFusion的跨注意力多视角融合方法,通过多视角视觉-语言描述符的交叉注意力机制和基于多视角一致性的自监督信号,显著提升了3D语义与实例分割性能,尤其在零样本跨域评估中达到SOTA。

Details Motivation: 将视觉-语言模型从2D图像扩展到3D场景面临挑战,现有方法(如反投影平均或单视角选择)导致3D表征次优。 Method: 提出多视角Transformer架构,实现跨视角视觉-语言描述符的交叉注意力融合;引入多视角一致性作为自监督信号,联合标准监督损失进行训练。 Result: CAMFusion在3D语义与实例分类基准上持续优于简单平均或单视角选择,并在零样本跨域数据集上取得SOTA结果。 Conclusion: 跨注意力融合结合多视角自监督是提升3D开放词汇分割性能的有效范式,为3D视觉-语言理解提供了新思路。 Abstract: Vision-language models have been key to the development of open-vocabulary 2D semantic segmentation. Lifting these models from 2D images to 3D scenes, however, remains a challenging problem. Existing approaches typically back-project and average 2D descriptors across views, or heuristically select a single representative one, often resulting in suboptimal 3D representations. In this work, we introduce a novel multiview transformer architecture that cross-attends across vision-language descriptors from multiple viewpoints and fuses them into a unified per-3D-instance embedding. As a second contribution, we leverage multiview consistency as a self-supervision signal for this fusion, which significantly improves performance when added to a standard supervised target-class loss. Our Cross-Attentive Multiview Fusion, which we denote with its acronym CAMFusion, not only consistently outperforms naive averaging or single-view descriptor selection, but also achieves state-of-the-art results on 3D semantic and instance classification benchmarks, including zero-shot evaluations on out-of-domain datasets.

[133] Evolution-Inspired Sample Competition for Deep Neural Network Optimization

Ying Zheng,Yiyi Zhang,Yi Wang,Lap-Pui Chau

Main category: cs.CV

TL;DR: 本文提出了一种受自然选择启发的深度学习优化方法Natural Selection(NS),通过构建样本组并计算每个样本的竞争得分,动态调整损失权重,以解决类别不平衡、难样本学习不足和噪声样本干扰等问题。

Details Motivation: 传统深度网络训练对所有样本采用统一的学习范式,忽略了样本间的异质性竞争,导致类别不平衡偏差、难样本学习不足及噪声样本被错误强化等问题。 Method: Natural Selection(NS)方法将多个样本组合成复合图像,缩放到原始输入尺寸进行推理,基于预测结果计算每个样本在组内的自然选择得分,并据此动态重加权样本损失,引入竞争驱动的优化机制。 Result: 在12个公开数据集、4类图像分类任务上的大量实验验证了NS方法的有效性;NS兼容多种网络结构,不依赖任务特定假设,具有强泛化性和实用性。 Conclusion: NS提供了一种简单而有效的替代方案,突破了传统均匀样本处理的局限,实现了更自适应、更均衡的模型优化。 Abstract: Conventional deep network training generally optimizes all samples under a largely uniform learning paradigm, without explicitly modeling the heterogeneous competition among them. Such an oversimplified treatment can lead to several well-known issues, including bias under class imbalance, insufficient learning of hard samples, and the erroneous reinforcement of noisy samples. In this work, we present \textit{Natural Selection} (NS), a novel evolution-inspired optimization method that explicitly incorporates competitive interactions into deep network training. Unlike conventional sample reweighting strategies that rely mainly on predefined heuristics or static criteria, NS estimates the competitive status of each sample in a group-wise context and uses it to adaptively regulate its training contribution. Specifically, NS first assembles multiple samples into a composite image and rescales it to the original input size for model inference. Based on the resulting predictions, a natural selection score is computed for each sample to characterize its relative competitive variation within the constructed group. These scores are then used to dynamically reweight the sample-wise loss, thereby introducing an explicit competition-driven mechanism into the optimization process. In this way, NS provides a simple yet effective means of moving beyond uniform sample treatment and enables more adaptive and balanced model optimization. Extensive experiments on 12 public datasets across four image classification tasks demonstrate the effectiveness of the proposed method. Moreover, NS is compatible with diverse network architectures and does not depend on task-specific assumptions, indicating its strong generality and practical potential. The code will be made publicly available.

[134] Cross-Modal Knowledge Distillation for PET-Free Amyloid-Beta Detection from MRI

Francesco Chiumento,Julia Dietlmeier,Ronan P. Killeen,Kathleen M. Curran,Noel E. O'Connor,Mingming Liu

Main category: cs.CV

TL;DR: 本文提出了一种PET引导的知识蒸馏框架,利用MRI图像单独预测淀粉样蛋白β(Aβ)阳性状态,无需PET扫描或临床协变量,实现了可解释、低成本的阿尔茨海默病早期筛查。

Details Motivation: 现有Aβ检测依赖昂贵、有创且不易普及的PET成像,难以用于大规模人群筛查,亟需一种仅基于MRI的替代方法。 Method: 构建基于BiomedCLIP的多模态教师模型,通过跨模态注意力和PET引导(Centiloid感知)的在线难负样本三元组对比学习实现PET-MRI对齐;再通过特征层与logit层知识蒸馏训练纯MRI学生模型。 Result: 在OASIS-3和ADNI两个独立数据集上,分别达到0.74和0.68的最佳AUC;模型聚焦于解剖学相关皮层区域,具备良好可解释性,且无需任何非影像临床变量。 Conclusion: 该PET引导知识蒸馏框架成功实现了仅用MRI预测Aβ状态,为无PET的阿尔茨海默病大规模筛查提供了可行、可靠且具临床潜力的新范式。 Abstract: Detecting amyloid-$β$ (A$β$) positivity is crucial for early diagnosis of Alzheimer's disease but typically requires PET imaging, which is costly, invasive, and not widely accessible, limiting its use for population-level screening. We address this gap by proposing a PET-guided knowledge distillation framework that enables A$β$ prediction from MRI alone, without requiring non-imaging clinical covariates or PET at inference. Our approach employs a BiomedCLIP-based teacher model that learns PET-MRI alignment via cross-modal attention and triplet contrastive learning with PET-informed (Centiloid-aware) online negative sampling. An MRI-only student then mimics the teacher via feature-level and logit-level distillation. Evaluated across four MRI contrasts (T1w, T2w, FLAIR, T2*) and two independent datasets, our approach demonstrates effective knowledge transfer (best AUC: 0.74 on OASIS-3, 0.68 on ADNI) while maintaining interpretability and eliminating the need for clinical variables. Saliency analysis confirms that predictions focus on anatomically relevant cortical regions, supporting the clinical viability of PET-free A$β$ screening. Code is available at https://github.com/FrancescoChiumento/pet-guided-mri-amyloid-detection.

[135] StructDiff: A Structure-Preserving and Spatially Controllable Diffusion Model for Single-Image Generation

Yinxi He,Kang Liao,Chunyu Lin,Tianyi Wei,Yao Zhao

Main category: cs.CV

TL;DR: StructDiff 是一种基于单尺度扩散模型的单图像生成框架,通过自适应感受野模块和3D位置编码实现结构一致性与空间可控性,并提出基于大语言模型的新评估标准。

Details Motivation: 现有单图像生成方法难以保持结构布局(尤其含大型刚性物体或严格空间约束的图像),且缺乏空间可控性,无法灵活引导生成内容的位置与结构。 Method: 提出StructDiff框架:1)设计自适应感受野模块以兼顾全局与局部分布;2)引入3D位置编码作为空间先验,支持对位置、尺度和局部细节的灵活控制;3)构建基于大语言模型的新型评估准则。 Result: 在结构一致性、视觉质量和空间可控性方面均优于现有方法,并在文本引导生成、图像编辑、外绘(outpainting)和涂鸦转图像等下游任务中展现出广泛适用性。 Conclusion: StructDiff首次将位置编码用于单图像生成中的空间操控,显著提升了生成结果的结构保真度与可控性,同时提出的LLM评估方法缓解了传统指标与用户研究的局限性。 Abstract: This paper introduces StructDiff, a generative framework based on a single-scale diffusion model for single-image generation. Single-image generation aims to synthesize diverse samples with similar visual content to the source image by capturing its internal statistics, without relying on external data. However, existing methods often struggle to preserve the structural layout, especially for images with large rigid objects or strict spatial constraints. Moreover, most approaches lack spatial controllability, making it difficult to guide the structure or placement of generated content. To address these challenges, StructDiff introduces an \textit{adaptive receptive field} module to maintain both global and local distributions. Building on this foundation, StructDiff incorporates 3D positional encoding (PE) as a spatial prior, allowing flexible control over positions, scale, and local details of generated objects. To our knowledge, this spatial control capability represents the first exploration of PE-based manipulation in single-image generation. Furthermore, we propose a novel evaluation criterion for single-image generation based on large language models (LLMs). This criterion specifically addresses the limitations of existing objective metrics and the high labor costs associated with user studies. StructDiff also demonstrates broad applicability across downstream tasks, such as text-guided image generation, image editing, outpainting, and paint-to-image synthesis. Extensive experiments demonstrate that StructDiff outperforms existing methods in structural consistency, visual quality, and spatial controllability. The project page is available at https://butter-crab.github.io/StructDiff/.

[136] PDF-GS: Progressive Distractor Filtering for Robust 3D Gaussian Splatting

Kangmin Seo,MinKyu Lee,Tae-Young Kim,ByeongCheol Lee,JoonSeoung An,Jae-Pil Heo

Main category: cs.CV

TL;DR: 本文提出PDF-GS框架,通过渐进式多阶段优化增强3D高斯泼溅(3DGS)固有的不一致信号抑制能力,实现鲁棒、高保真、无干扰物的三维重建,无需修改网络结构或增加推理开销。

Details Motivation: 传统3DGS训练假设输入图像完全满足多视角一致性,对违反该假设的干扰物敏感,易产生视觉伪影;而其本身具备抑制不一致信号的潜力尚未被充分挖掘。 Method: 提出PDF-GS(Progressive Distractor Filtering for Robust 3D Gaussian Splatting),采用渐进式多相优化:先通过差异线索逐步滤除干扰物,再在净化后的高斯表示上进行精细、视角一致的重建。 Result: 在多种数据集和复杂真实场景下显著优于基线方法,达到新SOTA;轻量、即插即用,不改变原有3DGS架构,无额外推理开销。 Conclusion: PDF-GS有效激发并强化了3DGS的自过滤能力,为鲁棒三维重建提供了高效、通用的新范式。 Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled impressive real-time photorealistic rendering. However, conventional training pipelines inherently assume full multi-view consistency among input images, which makes them sensitive to distractors that violate this assumption and cause visual artifacts. In this work, we revisit an underexplored aspect of 3DGS: its inherent ability to suppress inconsistent signals. Building on this insight, we propose PDF-GS (Progressive Distractor Filtering for Robust 3D Gaussian Splatting), a framework that amplifies this self-filtering property through a progressive multi-phase optimization. The progressive filtering phases gradually remove distractors by exploiting discrepancy cues, while the following reconstruction phase restores fine-grained, view-consistent details from the purified Gaussian representation. Through this iterative refinement, PDF-GS achieves robust, high-fidelity, and distractor-free reconstructions, consistently outperforming baselines across diverse datasets and challenging real-world conditions. Moreover, our approach is lightweight and easily adaptable to existing 3DGS frameworks, requiring no architectural changes or additional inference overhead, leading to a new state-of-the-art performance. The code is publicly available at https://github.com/kangrnin/PDF-GS.

[137] Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models

Zijian Liu,Sihan Cao,Pengcheng Zheng,Kuien Liu,Caiyan Qin,Xiaolin Qin,Jiwei Wei,Chaoning Zhang

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、解码端的时序再平衡方法(DTR),通过校准视觉注意力,缓解视频大模型在生成过程中对少数帧的过度依赖(即‘锚帧’偏差),从而减少幻觉,提升时序证据利用的均衡性与鲁棒性。

Details Motivation: 现有视频大语言模型(Video-LLMs)存在严重幻觉问题,主因是解码过程中对视频中有限时间片段(尤其是‘锚帧’)的过度依赖,导致时序证据聚合不均衡;该偏差具有模型固有性,与输入无关,但显著加剧幻觉。 Method: 提出Decoder-side Temporal Rebalancing(DTR):一种训练-free、层选择性推理方法,在中后段解码层自适应重标定视觉注意力分布,抑制锚帧主导效应,增强低关注帧的贡献,不改动视觉编码器或引入辅助模型。 Result: 在多个幻觉评测和视频理解基准上,DTR显著提升各类Video-LLM的幻觉鲁棒性,同时保持原有视频理解性能和高推理效率。 Conclusion: 解码端的时序注意力失衡是Video-LLMs幻觉的关键机制;通过轻量、即插即用的DTR方法可有效缓解该问题,为提升视频生成可信度提供了新思路。 Abstract: Recent Video Large Language Models (Video-LLMs) have demonstrated strong capability in video understanding, yet they still suffer from hallucinations. Existing mitigation methods typically rely on training, input modification, auxiliary guidance, or additional decoding procedures, while largely overlooking a more fundamental challenge. During generation, Video-LLMs tend to over-rely on a limited portion of temporal evidence, leading to temporally imbalanced evidence aggregation across the video. To address this issue, we investigate a decoder-side phenomenon in which the model exhibits a temporally imbalanced concentration pattern. We term the frame with the highest aggregated frame-level attention mass the anchor frame. We find that this bias is largely independent of the input video and instead appears to reflect a persistent, model-specific structural or positional bias, whose over-dominance is closely associated with hallucination-prone generation. Motivated by this insight, we propose Decoder-side Temporal Rebalancing (DTR), a training-free, layer-selective inference method that rebalances temporal evidence allocation in middle-to-late decoder layers without altering visual encoding or requiring auxiliary models. DTR adaptively calibrates decoder-side visual attention to alleviate temporally imbalanced concentration and encourage under-attended frames to contribute more effectively to response generation. In this way, DTR guides the decoder to ground its outputs in temporally broader and more balanced video evidence. Extensive experiments on hallucination and video understanding benchmarks show that DTR consistently improves hallucination robustness across diverse Video-LLM families, while preserving competitive video understanding performance and high inference efficiency.

[138] ELoG-GS: Dual-Branch Gaussian Splatting with Luminance-Guided Enhancement for Extreme Low-light 3D Reconstruction

Yuhao Liu,Dingju Wang,Ziyang Zheng

Main category: cs.CV

TL;DR: 本文提出ELoG-GS方法,用于极端低光照下多视角图像的高质量3D重建,结合学习型点云初始化与亮度引导的颜色增强,显著提升几何一致性和视觉保真度。

Details Motivation: 解决极端低光照环境下 degraded multi-view 输入的高质量、几何一致且光真实感的3D场景重建问题。 Method: 提出Extreme Low-light Optimized Gaussian Splatting(ELoG-GS),融合学习型点云初始化与 luminance-guided 颜色增强,并引入几何感知初始化和光度自适应策略。 Result: 在NTIRE 2026 Track 1基准上显著优于基线方法,官方测试集PSNR达18.6626,SSIM达0.6855。 Conclusion: ELoG-GS为真实世界退化场景下的鲁棒3D重建提供了实用有效的解决方案。 Abstract: This paper presents our approach to the NTIRE 2026 3D Restoration and Reconstruction Challenge (Track 1), which focuses on reconstructing high-quality 3D representations from degraded multi-view inputs. The challenge involves recovering geometrically consistent and photorealistic 3D scenes in extreme low-light environments. To address this task, we propose Extreme Low-light Optimized Gaussian Splatting (ELoG-GS), a robust low-light 3D reconstruction pipeline that integrates learning-based point cloud initialization and luminance-guided color enhancement for stable and photorealistic Gaussian Splatting. Our method incorporates both geometry-aware initialization and photometric adaptation strategies to improve reconstruction fidelity under challenging conditions. Extensive experiments on the NTIRE Track 1 benchmark demonstrate that our approach significantly improves reconstruction quality over the baselines, achieving superior visual fidelity and geometric consistency. The proposed method provides a practical solution for robust 3D reconstruction in real-world degraded scenarios. In the final testing phase, our method achieved a PSNR of 18.6626 and an SSIM of 0.6855 on the official platform leaderboard. Code is available at https://github.com/lyh120/FSGS_EAPGS.

[139] Spatial-Spectral Adaptive Fidelity and Noise Prior Reduction Guided Hyperspectral Image Denoising

Xuelin Xie,Xiliang Lu,Zhengshan Wang,Yang Zhang,Long Chen

Main category: cs.CV

TL;DR: 本文提出了一种新的高光谱图像(HSI)去噪框架,通过结合噪声先验缩减与空间-光谱自适应保真度项,在数据保真与先验建模之间实现动态平衡,并设计高效优化算法,显著提升混合噪声去除效果与计算效率。

Details Motivation: 现有方法过度依赖图像内在先验,忽视多样化的噪声假设及保真度与先验间的动态权衡。 Method: 提出集成噪声先验缩减与空间-光谱自适应保真度项的框架;引入自适应权重张量;构建像素级快速鲁棒模型,结合代表性系数全变分正则器;采用ADMM优化算法。 Result: 在模拟与真实数据集上均取得优越去噪性能,同时保持较高计算效率;能有效处理多种噪声,准确刻画HSI的光谱低秩性与局部平滑性。 Conclusion: 所提框架在保真度与先验建模间实现更优动态平衡,为HSI混合噪声去除提供了高效、鲁棒且参数精简的新范式。 Abstract: The core challenge of hyperspectral image denoising is striking the right balance between data fidelity and noise prior modeling. Most existing methods place too much emphasis on the intrinsic priors of the image while overlooking diverse noise assumptions and the dynamic trade-off between fidelity and priors. To address these issues, we propose a denoising framework that integrates noise prior reduction and a spatial-spectral adaptive fidelity term. This framework considers comprehensive noise priors with fewer parameters and introduces an adaptive weight tensor to dynamically balance the fidelity and prior regularization terms. Within this framework, we further develop a fast and robust pixel-wise model combined with the representative coefficient total variation regularizer to accurately remove mixed noise in HSIs. The proposed method not only efficiently handles various types of noise but also accurately captures the spectral low-rank structure and local smoothness of HSIs. An efficient optimization algorithm based on the alternating direction method of multipliers is designed to ensure stable and fast convergence. Extensive experiments on simulated and real-world datasets demonstrate that the proposed model achieves superior denoising performance while maintaining competitive computational efficiency.

[140] GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning

Zhaochen Liu,Limeng Qiao,Guanglu Wan,Tingting Jiang

Main category: cs.CV

TL;DR: 本文提出GeoAlign框架,通过动态聚合多层几何特征来解决多模态大语言模型(MLLMs)在空间推理任务中的任务错位偏差问题,显著提升其在VSI-Bench、ScanQA和SQA3D等基准上的性能。

Details Motivation: 现有MLLMs在空间推理方面表现不佳,虽有研究引入3D基础模型的几何特征,但仅依赖静态单层提取,导致任务错位偏差——即预训练目标与MLLMs多样化的空间需求不一致。 Method: 提出GeoAlign框架:构建分层几何特征库,并利用MLLM原始视觉token作为内容感知查询,进行逐层稀疏路由,自适应地为每个图像块选取最合适的几何特征。 Result: 在VSI-Bench、ScanQA和SQA3D上实验表明,所提紧凑型4B模型达到当前最优性能,甚至超越更大规模的现有MLLMs。 Conclusion: 动态多层几何特征对齐能有效缓解任务错位偏差,是提升MLLM空间推理能力的关键路径。 Abstract: Multimodal large language models (MLLMs) have exhibited remarkable performance in various visual tasks, yet still struggle with spatial reasoning. Recent efforts mitigate this by injecting geometric features from 3D foundation models, but rely on static single-layer extractions. We identify that such an approach induces a task misalignment bias: the geometric features naturally evolve towards 3D pretraining objectives, which may contradict the heterogeneous spatial demands of MLLMs, rendering any single layer fundamentally insufficient. To resolve this, we propose GeoAlign, a novel framework that dynamically aggregates multi-layer geometric features to realign with the actual demands. GeoAlign constructs a hierarchical geometric feature bank and leverages the MLLM's original visual tokens as content-aware queries to perform layer-wise sparse routing, adaptively fetching the suitable geometric features for each patch. Extensive experiments on VSI-Bench, ScanQA, and SQA3D demonstrate that our compact 4B model effectively achieves state-of-the-art performance, even outperforming larger existing MLLMs.

[141] Efficient Semantic Image Communication for Traffic Monitoring at the Edge

Damir Assylbek,Nurmukhammed Aitymbetov,Marko Ristin,Dimitrios Zorbas

Main category: cs.CV

TL;DR: 本文提出了两种面向交通监控的语义图像通信方法MMSD和SAMR,在保证语义信息可用的前提下大幅降低传输数据量,分别实现99%和99.1%的数据压缩,并采用边缘轻量处理+云端重建的非对称架构。

Details Motivation: 许多视觉监控系统受限于通信带宽,传输全分辨率图像不现实且不必要;实际任务(如目标存在性、空间关系、场景上下文)仅需语义信息而非像素级保真。 Method: 提出两种语义图像通信管道:MMSD(多模态语义分解)将图像替换为分割图、边缘图和文本描述,由扩散模型在接收端重建;SAMR(语义感知掩码重建)根据语义重要性选择性遮蔽非关键区域后JPEG编码,再通过生成式修复恢复内容;二者均采用边缘轻处理+服务器重计算的非对称架构。 Result: 在树莓派5上边缘处理耗时分别为15秒(MMSD)和9秒(SAMR);传输数据量平均减少99%(MMSD)和99.1%(SAMR);MMSD比SPIC基线更小载荷且语义一致性强;SAMR在同等条件下比JPEG和SQ-GAN具有更优的质量-压缩权衡。 Conclusion: 语义驱动的图像通信可显著提升带宽受限监控系统的效率与实用性,MMSD侧重高保密性与极致压缩,SAMR侧重重建质量与压缩率平衡,二者共同验证了非像素级视觉通信范式的有效性。 Abstract: Many visual monitoring systems operate under strict communication constraints, where transmitting full-resolution images is impractical and often unnecessary. In such settings, visual data is often used for object presence, spatial relationships, and scene context rather than exact pixel fidelity. This paper presents two semantic image communication pipelines for traffic monitoring, MMSD and SAMR, that reduce transmission cost while preserving meaningful visual information. MMSD (Multi-Modal Semantic Decomposition) targets very high compression together with data confidentiality, since sensitive pixel content is not transmitted. It replaces the original image with compact semantic representations, namely segmentation maps, edge maps, and textual descriptions, and reconstructs the scene at the receiver using a diffusion-based generative model. SAMR (Semantic-Aware Masking Reconstruction) targets higher visual quality while maintaining strong compression. It selectively suppresses non-critical image regions according to semantic importance before standard JPEG encoding and restores the missing content at the receiver through generative inpainting. Both designs follow an asymmetric sender-receiver architecture, where lightweight processing is performed at the edge and computationally intensive reconstruction is offloaded to the server. On a Raspberry Pi~5, the edge-side processing time is about 15s for MMSD and 9s for SAMR. Experimental results show average transmitted-data reductions of 99% for MMSD and 99.1% for SAMR. In addition, MMSD achieves lower payload size than the recent SPIC baseline while preserving strong semantic consistency, whereas SAMR provides a better quality-compression trade-off than standard JPEG and SQ-GAN under comparable operating conditions.

[142] SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis

Kathakoli Sengupta,Kai Ao,Paola Cascante-Bonilla

Main category: cs.CV

TL;DR: 本文提出SceneCritic,一种基于符号规则的室内布局评估器,依托构建的空间本体SceneOnto,从语义、朝向和几何三方面对布局进行细粒度验证;并通过对比实验表明其比VLM评估器更符合人类判断,且不同模态(规则/LLM/VLM)反馈对布局迭代优化效果各异。

Details Motivation: 现有LLM/VLM生成室内场景的评估依赖渲染视图的判别,易受视角、提示词和幻觉影响,导致评估不稳定,难以区分模型真实空间合理性与评估偏差。 Method: 构建结构化空间本体SceneOnto(融合3D-FRONT、ScanNet、Visual Genome先验),设计符号化评估器SceneCritic,实现对象级与关系级的语义、朝向、几何一致性验证;并搭建迭代精炼测试平台,对比规则型、文本LLM型、图像VLM型三种批评模态的效果。 Result: (a)SceneCritic与人类判断一致性显著优于VLM评估器;(b)纯文本LLM在语义布局质量上可超越VLM;(c)基于图像的VLM反馈在语义与朝向修正上最有效。 Conclusion: 符号化、本体驱动的评估方法能提升室内场景生成评估的稳定性与可解释性;不同模态批评器各有所长,应依据任务目标选择适配反馈方式。 Abstract: Large Language Models (LLMs) and Vision-Language Models (VLMs) increasingly generate indoor scenes through intermediate structures such as layouts and scene graphs, yet evaluation still relies on LLM or VLM judges that score rendered views, making judgments sensitive to viewpoint, prompt phrasing, and hallucination. When the evaluator is unstable, it becomes difficult to determine whether a model has produced a spatially plausible scene or whether the output score reflects the choice of viewpoint, rendering, or prompt. We introduce SceneCritic, a symbolic evaluator for floor-plan-level layouts. SceneCritic's constraints are grounded in SceneOnto, a structured spatial ontology we construct by aggregating indoor scene priors from 3D-FRONT, ScanNet, and Visual Genome. SceneOnto traverses this ontology to jointly verify semantic, orientation, and geometric coherence across object relationships, providing object-level and relationship-level assessments that identify specific violations and successful placements. Furthermore, we pair SceneCritic with an iterative refinement test bed that probes how models build and revise spatial structure under different critic modalities: a rule-based critic using collision constraints as feedback, an LLM critic operating on the layout as text, and a VLM critic operating on rendered observations. Through extensive experiments, we show that (a) SceneCritic aligns substantially better with human judgments than VLM-based evaluators, (b) text-only LLMs can outperform VLMs on semantic layout quality, and (c) image-based VLM refinement is the most effective critic modality for semantic and orientation correction.

[143] Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis

Miao Liu,Fangda Wei,Jing Wang,Xinyuan Qian

Main category: cs.CV

TL;DR: 本文提出'倾听深度伪造检测'(LDD)新任务,构建首个专用数据集ListenForge,并设计Motion-aware and Audio-guided Network(MANet)模型,显著提升对倾听状态深度伪造的检测性能。

Details Motivation: 现有深度伪造检测主要针对说话场景,而真实交互中攻击者常交替伪造说话与倾听状态以增强欺骗性;倾听伪造因合成质量较低且缺乏数据集和方法,成为检测新突破口。 Method: 提出倾听深度伪造检测(LDD)任务;构建首个专用数据集ListenForge(基于5种倾听头生成方法);设计MANet模型,融合运动感知与音频引导机制,捕捉倾听视频中的细微运动不一致性,并利用说话人音频语义指导跨模态融合。 Result: 实验表明,现有说话深度伪造检测(SDD)模型在倾听场景下性能差;MANet在ListenForge上表现显著更优。 Conclusion: 需突破传统‘说话中心’范式,重视倾听状态深度伪造检测;本工作为交互式通信中的多模态伪造分析开辟新方向。 Abstract: Existing deepfake detection research has primarily focused on scenarios where the manipulated subject is actively speaking, i.e., generating fabricated content by altering the speaker's appearance or voice. However, in realistic interaction settings, attackers often alternate between falsifying speaking and listening states to mislead their targets, thereby enhancing the realism and persuasiveness of the scenario. Although the detection of 'listening deepfakes' remains largely unexplored and is hindered by a scarcity of both datasets and methodologies, the relatively limited quality of synthesized listening reactions presents an excellent breakthrough opportunity for current deepfake detection efforts. In this paper, we present the task of Listening Deepfake Detection (LDD). We introduce ListenForge, the first dataset specifically designed for this task, constructed using five Listening Head Generation (LHG) methods. To address the distinctive characteristics of listening forgeries, we propose MANet, a Motion-aware and Audio-guided Network that captures subtle motion inconsistencies in listener videos while leveraging speaker's audio semantics to guide cross-modal fusion. Extensive experiments demonstrate that existing Speaking Deepfake Detection (SDD) models perform poorly in listening scenarios. In contrast, MANet achieves significantly superior performance on ListenForge. Our work highlights the necessity of rethinking deepfake detection beyond the traditional speaking-centric paradigm and opens new directions for multimodal forgery analysis in interactive communication settings. The dataset and code are available at https://anonymous.4open.science/r/LDD-B4CB.

[144] PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning

Jinlong Liu,Wanggui He,Peng Zhang,Mushui Liu,Hao Jiang,Pipei Huang

Main category: cs.CV

TL;DR: 本文提出PromptEcho,一种无需标注和训练的奖励构建方法,用于提升文本到图像模型的提示遵循能力。它利用冻结的视觉语言模型(VLM)计算生成图像与原始提示之间的token级交叉熵损失作为奖励信号,并在多个基准上验证了其有效性。

Details Motivation: 现有RL方法中,CLIP Score过于粗粒度,而基于VLM的奖励模型(如RewardDance)依赖昂贵的人工偏好标注和额外微调,亟需更高效、免训练、免标注的奖励信号构造方式。 Method: PromptEcho利用冻结的预训练VLM,以原始提示为标签,计算生成图像对应的token级交叉熵损失作为奖励;该过程无需任何人工标注或模型训练,且随开源VLM性能提升而自动增强。 Result: 在DenseAlignBench上对Z-Image和QwenImage-2512模型分别提升+26.8pp和+16.2pp净胜率,并在GenEval、DPG-Bench和TIIFBench上实现一致增益;消融实验表明其显著优于同VLM下的推理打分,且奖励质量随VLM规模增大而提升。 Conclusion: PromptEcho是一种高效、免训练、免标注的奖励构造范式,能显著增强T2I模型的提示遵循能力,且具备可扩展性和实用性,相关模型与新基准DenseAlignBench将开源。 Abstract: Reinforcement learning (RL) can improve the prompt following capability of text-to-image (T2I) models, yet obtaining high-quality reward signals remains challenging: CLIP Score is too coarse-grained, while VLM-based reward models (e.g., RewardDance) require costly human-annotated preference data and additional fine-tuning. We propose PromptEcho, a reward construction method that requires \emph{no} annotation and \emph{no} reward model training. Given a generated image and a guiding query, PromptEcho computes the token-level cross-entropy loss of a frozen VLM with the original prompt as the label, directly extracting the image-text alignment knowledge encoded during VLM pretraining. The reward is deterministic, computationally efficient, and improves automatically as stronger open-source VLMs become available. For evaluation, we develop DenseAlignBench, a benchmark of concept-rich dense captions for rigorously testing prompt following capability. Experimental results on two state-of-the-art T2I models (Z-Image and QwenImage-2512) demonstrate that PromptEcho achieves substantial improvements on DenseAlignBench (+26.8pp / +16.2pp net win rate), along with consistent gains on GenEval, DPG-Bench, and TIIFBench without any task-specific training. Ablation studies confirm that PromptEcho comprehensively outperforms inference-based scoring with the same VLM, and that reward quality scales with VLM size. We will open-source the trained models and the DenseAlignBench.

[145] Hypergraph-State Collaborative Reasoning for Multi-Object Tracking

Zikai Song,Junqing Yu,Yi-Ping Phoebe Chen,Wei Yang,Xinchao Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为HyperSSM的协同推理框架,结合超图(Hypergraph)与状态空间模型(SSM),通过多目标间的运动状态相互约束,提升多目标跟踪中的运动估计鲁棒性与连续性,尤其在噪声和遮挡场景下表现优异,并在多个主流数据集上达到SOTA。

Details Motivation: 现有运动估计方法存在两大问题:一是受噪声或概率预测影响导致轨迹不稳定;二是在目标被遮挡时因视觉线索缺失而易造成轨迹断裂。 Method: 提出协作推理框架HyperSSM,融合超图建模空间运动相关性(通过动态超边)与状态空间模型保证时间平滑性(结构化状态转移),实现时空联合优化。 Result: 在MOT17、MOT20、DanceTrack和SportsMOT四个多样化基准上取得SOTA性能,显著提升遮挡与噪声场景下的轨迹稳定性与连续性。 Conclusion: 协同多目标联合推理能有效缓解运动估计的不稳定性与遮挡脆弱性,HyperSSM为多目标跟踪提供了更鲁棒、一致的时空运动建模新范式。 Abstract: Motion reasoning serves as the cornerstone of multi-object tracking (MOT), as it enables consistent association of targets across frames. However, existing motion estimation approaches face two major limitations: (1) instability caused by noisy or probabilistic predictions, and (2) vulnerability under occlusion, where trajectories often fragment once visual cues disappear. To overcome these issues, we propose a collaborative reasoning framework that enhances motion estimation through joint inference among multiple correlated objects. By allowing objects with similar motion states to mutually constrain and refine each other, our framework stabilizes noisy trajectories and infers plausible motion continuity even when target is occluded. To realize this concept, we design HyperSSM, an architecture that integrates Hypergraph computation and a State Space Model (SSM) for unified spatial-temporal reasoning. The Hypergraph module captures spatial motion correlations through dynamic hyperedges, while the SSM enforces temporal smoothness via structured state transitions. This synergistic design enables simultaneous optimization of spatial consensus and temporal coherence, resulting in robust and stable motion estimation. Extensive experiments on four mainstream and diverse benchmarks(MOT17, MOT20, DanceTrack, and SportsMOT) covering various motion patterns and scene complexities, demonstrate that our approach achieves state-of-the-art performance across a wide range of tracking scenarios.

[146] OFA-Diffusion Compression: Compressing Diffusion Model in One-Shot Manner

Haoyang Jiang,Zekun Wang,Mingyang Yi,Xiuyu Li,Lanqing Hu,Junxian Cai,Qingbin Liu,Xi Chen,Ju Fan

Main category: cs.CV

TL;DR: 本文提出了一种面向扩散概率模型(DPM)的一次性(OFA)压缩框架,通过限制候选子网络的参数量集合、基于通道重要性逐步分配通道、以及子网络优化重加权策略,在单次训练中高效生成多种计算量的子网络,显著降低重复训练开销并保持良好性能。

Details Motivation: 现有DPM模型压缩方法需为不同设备资源约束反复训练多个模型,开销大;而通用OFA方法候选空间过大,优化困难。 Method: 提出受限OFA框架:1)限定若干离散参数规模作为候选子网络目标;2)按通道重要性逐步分配通道构建对应子网络;3)引入重加权策略平衡不同子网络的联合优化。 Result: 在单次训练下可生成多种尺寸的压缩DPM子网络,训练开销显著降低,且各子网络性能保持满意。 Conclusion: 该OFA压缩框架有效解决了DPM在多设备部署中重复压缩开销大的问题,兼顾效率与性能,为DPM轻量化提供了新思路。 Abstract: The Diffusion Probabilistic Model (DPM) achieves remarkable performance in image generation, while its increasing parameter size and computational overhead hinder its deployment in practical applications. To improve this, the existing literature focuses on obtaining a smaller model with a fixed architecture through model compression. However, in practice, DPMs usually need to be deployed on various devices with different resource constraints, which leads to multiple compression processes, incurring significant overhead for repeated training. To obviate this, we propose a once-for-all (OFA) compression framework for DPMs that yields different subnetworks with various computations in a one-shot training manner. The existing OFA framework typically involves massive subnetworks with different parameter sizes, while such a huge candidate space slows the optimization. Thus, we propose to restrict the candidate subnetworks with a certain set of parameter sizes, where each size corresponds to a specific subnetwork. Specifically, to construct each subnetwork with a given size, we gradually allocate the maintained channels by their importance. Furthermore, we propose a reweighting strategy to balance the optimization process of different subnetworks. Experimental results show that our approach can produce compressed DPMs for various sizes with significantly lower training overhead while achieving satisfactory performance.

[147] Brain-DiT: A Universal Multi-state fMRI Foundation Model with Metadata-Conditioned Pretraining

Junfeng Xia,Wenhao Ye,Xuanye Pan,Xinke Shen,Mo Wang,Quanying Liu

Main category: cs.CV

TL;DR: 本文提出了Brain-DiT,一种基于元数据条件扩散预训练的通用多状态fMRI基础模型,在涵盖多种脑状态的大规模数据上训练,显著提升了下游任务性能。

Details Motivation: 现有fMRI基础模型受限于脑状态范围窄和预训练任务不匹配,难以学习泛化表征。 Method: 提出Brain-DiT模型,采用元数据条件下的扩散预训练框架,结合Diffusion Transformer(DiT),在349,898次扫描的24个数据集上进行多状态(静息、任务、自然刺激、疾病、睡眠)预训练。 Result: 在7个下游任务中验证,扩散式生成预训练优于重建或对齐方法;元数据条件预训练进一步提升性能;不同任务对表征尺度有不同偏好(如ADNI分类依赖全局语义,年龄/性别预测依赖局部精细结构)。 Conclusion: 扩散预训练结合元数据条件建模是fMRI基础模型更优范式,能有效解耦神经内在动力学与人群变异性,支持多状态泛化。 Abstract: Current fMRI foundation models primarily rely on a limited range of brain states and mismatched pretraining tasks, restricting their ability to learn generalized representations across diverse brain states. We present \textit{Brain-DiT}, a universal multi-state fMRI foundation model pretrained on 349,898 sessions from 24 datasets spanning resting, task, naturalistic, disease, and sleep states. Unlike prior fMRI foundation models that rely on masked reconstruction in the raw-signal space or a latent space, \textit{Brain-DiT} adopts metadata-conditioned diffusion pretraining with a Diffusion Transformer (DiT), enabling the model to learn multi-scale representations that capture both fine-grained functional structure and global semantics. Across extensive evaluations and ablations on 7 downstream tasks, we find consistent evidence that diffusion-based generative pretraining is a stronger proxy than reconstruction or alignment, with metadata-conditioned pretraining further improving downstream performance by disentangling intrinsic neural dynamics from population-level variability. We also observe that downstream tasks exhibit distinct preferences for representational scale: ADNI classification benefits more from global semantic representations, whereas age/sex prediction comparatively relies more on fine-grained local structure. Code and parameters of Brain-DiT are available at \href{https://github.com/REDMAO4869/Brain-DiT}{Link}.

[148] Risk-Calibrated Learning: Minimizing Fatal Errors in Medical AI

Abolfazl Mohammadi-Seif,Ricardo Baeza-Yates

Main category: cs.CV

TL;DR: 本文提出Risk-Calibrated Learning方法,通过引入临床严重性矩阵M,在损失函数中显式区分视觉模糊性错误与结构性致命错误,显著降低医学图像分类中的关键错误率(CER),提升模型临床安全性。

Details Motivation: 深度学习模型在医学图像分类中虽准确率高,但存在语义不一致的高置信度错误(如将恶性肿瘤误判为良性),这类致命错误严重损害临床信任,而传统方法无法有效区分其与可接受的细粒度歧义错误。 Method: 提出Risk-Calibrated Loss,将混淆感知的临床严重性矩阵M嵌入优化目标,无需修改网络结构,即可抑制关键错误(尤其是假阴性)。 Result: 在四种医学影像模态(脑瘤MRI、皮肤镜、乳腺和前列腺组织病理)上验证,关键错误率(CER)相对基线(如Focal Loss)降低20.0%–92.4%,且在CNN与Transformer架构下均有效。 Conclusion: Risk-Calibrated Learning在保持精度的同时显著提升安全性,实现了更优的安全-精度权衡,为临床可信AI提供了实用、即插即用的解决方案。 Abstract: Deep learning models often achieve expert-level accuracy in medical image classification but suffer from a critical flaw: semantic incoherence. These high-confidence mistakes that are semantically incoherent (e.g., classifying a malignant tumor as benign) fundamentally differ from acceptable errors which stem from visual ambiguity. Unlike safe, fine-grained disagreements, these fatal failures erode clinical trust. To address this, we propose Risk-Calibrated Learning, a technique that explicitly distinguishes between visual ambiguity (fine-grained errors) and catastrophic structural errors. By embedding a confusion-aware clinical severity matrix M into the optimization landscape, our method suppresses critical errors (false negatives) without requiring complex architectural changes. We validate our approach in four different imaging modalities: Brain Tumor MRI, ISIC 2018 (Dermoscopy), BreaKHis (Breast Histopathology), and SICAPv2 (Prostate Histopathology). Extensive experiments demonstrate that our Risk-Calibrated Loss consistently reduces the Critical Error Rate (CER) for all four datasets, achieving relative safety improvements ranging from 20.0% (on breast histopathology) to 92.4% (on prostate histopathology) compared to state-of-the-art baselines such as Focal Loss. These results confirm that our method offers a superior safety-accuracy trade-off across both CNN and Transformer architectures.

[149] AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition

Zeheng Wang,Zitong Yu,Yijie Zhu,Bo Zhao,Haochen Liang,Taorui Wang,Wei Xia,Jiayu Zhang,Zhishu Liu,Hui Ma,Fei Ma,Qi Tian

Main category: cs.CV

TL;DR: 本文提出AffectAgent,一种面向情感的多智能体检索增强生成框架,通过三个协同优化的专用智能体(查询规划器、证据过滤器、情感生成器)和创新模块(MB-MoE与RAAF),提升多模态情感识别的细粒度理解能力与鲁棒性。

Details Motivation: 现有基于大语言模型的多模态情感识别方法依赖静态参数化记忆,易产生幻觉;单轮检索增强生成易受模态歧义影响,难以建模跨模态复杂情感依赖。 Method: 提出AffectAgent框架,包含三个联合优化的专用智能体,采用多智能体近端策略优化(MAPPO)进行端到端训练,并引入模态平衡混合专家(MB-MoE)和检索增强自适应融合(RAAF)模块。 Result: 在MER-UniBench基准上,AffectAgent在多种复杂场景下均取得优越性能。 Conclusion: AffectAgent通过多智能体协同推理与模态自适应融合机制,有效缓解跨模态异质性与缺失模态问题,显著提升多模态情感识别的准确性与鲁棒性。 Abstract: LLM-based multimodal emotion recognition relies on static parametric memory and often hallucinates when interpreting nuanced affective states. In this paper, given that single-round retrieval-augmented generation is highly susceptible to modal ambiguity and therefore struggles to capture complex affective dependencies across modalities, we introduce AffectAgent, an affect-oriented multi-agent retrieval-augmented generation framework that leverages collaborative decision-making among agents for fine-grained affective understanding. Specifically, AffectAgent comprises three jointly optimized specialized agents, namely a query planner, an evidence filter, and an emotion generator, which collaboratively perform analytical reasoning to retrieve cross-modal samples, assess evidence, and generate predictions. These agents are optimized end-to-end using Multi-Agent Proximal Policy Optimization (MAPPO) with a shared affective reward to ensure consistent emotion understanding. Furthermore, we introduce Modality-Balancing Mixture of Experts (MB-MoE) and Retrieval-Augmented Adaptive Fusion (RAAF), where MB-MoE dynamically regulates the contributions of different modalities to mitigate representation mismatch caused by cross-modal heterogeneity, while RAAF enhances semantic completion under missing-modality conditions by incorporating retrieved audiovisual embeddings. Extensive experiments on MER-UniBench demonstrate that AffectAgent achieves superior performance across complex scenarios. Our code will be released at: https://github.com/Wz1h1NG/AffectAgent.

[150] Scaling In-Context Segmentation with Hierarchical Supervision

T. Camaret Ndir,Marco Reisert,Robin T. Schirrmeister

Main category: cs.CV

TL;DR: 本文提出PatchICL,一种用于医学图像分割的分层上下文学习框架,通过选择性图像分块和多级监督,聚焦于最具信息量的解剖区域,在保持性能的同时显著降低计算开销。

Details Motivation: 标准上下文学习方法依赖密集全局交叉注意力,计算开销随图像分辨率增长而急剧上升;现有局部注意力方法缺乏对区域选择过程的显式监督,导致非信息区域冗余计算。 Method: 提出PatchICL框架,结合选择性图像分块与多级监督机制,使模型主动识别并仅关注最具信息量的解剖区域。 Result: 在512×512分辨率下,相比UniverSeg基线,PatchICL在域内CT分割上达到相当精度,计算量减少44%;在35个跨域数据集上,于13类模态中的6类表现更优,尤其在OCT和皮肤镜等以局灶性病变为主导的模态中优势明显。 Conclusion: PatchICL通过结构化稀疏注意力与显式监督,有效平衡了医学图像分割中精度与效率,提升了跨模态泛化能力,尤其适用于标注稀缺、计算受限的临床场景。 Abstract: In-context learning (ICL) enables medical image segmentation models to adapt to new anatomical structures from limited examples, reducing the clinical annotation burden. However, standard ICL methods typically rely on dense, global cross-attention, which scales poorly with image resolution. While recent approaches have introduced localized attention mechanisms, they often lack explicit supervision on the selection process, leading to redundant computation in non-informative regions. We propose PatchICL, a hierarchical framework that combines selective image patching with multi-level supervision. Our approach learns to actively identify and attend only to the most informative anatomical regions. Compared to UniverSeg, a strong global-attention baseline, PatchICL achieves competitive in-domain CT segmentation accuracy while reducing compute by 44\% at $512\times512$ resolution. On 35 out-of-domain datasets spanning diverse imaging modalities, PatchICL outperforms the baseline on 6 of 13 modality categories, with particular strength on modalities dominated by localized pathology such as OCT and dermoscopy. Training and evaluation code are available at https://github.com/tidiane-camaret/ic_segmentation

Myungchul Kim,Kwanyong Park,Junmo Kim,In So Kweon

Main category: cs.CV

TL;DR: 本文提出了ARGOS,首个将多摄像头人员搜索重构为交互式推理问题的基准和框架,要求代理在信息不对称下进行规划、提问和候选消除;其核心是基于时空拓扑图(STTG)的推理,并在三个渐进轨道(Who/Where/When)上评估,实验表明当前方法性能有限,且领域专用工具至关重要。

Details Motivation: 现有方法难以应对多摄像头人员搜索中 witness 描述模糊、信息不对称、时空约束复杂等挑战,缺乏对交互式、主动推理能力的建模与评估。 Method: 提出 ARGOS 框架,将任务建模为受限回合数下的交互式问答与决策过程;构建 Spatio-Temporal Topology Graph(STTG)以编码相机连接性与实证迁移时间;设计包含语义感知(Who)、空间推理(Where)、时间推理(When)三阶段的基准任务。 Result: 在14个真实场景共2691项任务上评估,四个LLM主干模型在Track 2(Where)最佳TWS仅0.383,Track 3(When)为0.590;消融显示移除领域工具导致准确率最高下降49.6个百分点。 Conclusion: 多摄像头人员搜索需从被动识别转向主动交互推理;ARGOS为该方向提供了首个系统性基准与框架,验证了结构化时空知识与专用工具对推理性能的关键作用。 Abstract: We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that removing domain-specific tools drops accuracy by up to 49.6 percentage points.

[152] A Dataset and Evaluation for Complex 4D Markerless Human Motion Capture

Yeeun Park,Miqdad Naduthodi,Suryansh Kumar

Main category: cs.CV

TL;DR: 本文提出了一种新的复杂4D无标记人体运动捕捉数据集,包含多视角RGB/深度图像、精确相机标定及Vicon真值3D运动数据,用于评估和提升现有方法在真实交互场景(如严重遮挡、多人快速交换位置等)下的鲁棒性。

Details Motivation: 现有无标记4D人体动作捕捉基准缺乏真实世界中复杂的多人交互、严重遮挡和挑战性互动模式,导致模型在实际应用中性能下降,存在显著领域差距。 Method: 构建了一个新的MoCap数据集,涵盖单人与多人场景,具有复杂动作、频繁人际遮挡、相似着装者快速位置交换、不同主体距离变化等特点;提供同步多视角RGB与深度序列、精确相机标定、Vicon系统获取的3D真值及SMPL/SMPL-X参数;并基于该数据集对前沿无标记MoCap模型进行基准测试与针对性微调验证。 Result: 基准测试显示当前最先进无标记MoCap模型在该数据集上性能显著下降;而针对该数据集的微调可提升模型泛化能力,验证了数据集的真实性和有效性。 Conclusion: 该数据集揭示了现有模型的关键缺陷,为推动鲁棒、实用的无标记4D人体运动捕捉技术提供了严格且具现实意义的评估基础。 Abstract: Marker-based motion capture (MoCap) systems have long been the gold standard for accurate 4D human modeling, yet their reliance on specialized hardware and markers limits scalability and real-world deployment. Advancing reliable markerless 4D human motion capture requires datasets that reflect the complexity of real-world human interactions. Yet, existing benchmarks often lack realistic multi-person dynamics, severe occlusions, and challenging interaction patterns, leading to a persistent domain gap. In this work, we present a new dataset and evaluation for complex 4D markerless human motion capture. Our proposed MoCap dataset captures both single and multi-person scenarios with intricate motions, frequent inter-person occlusions, rapid position exchanges between similarly dressed subjects, and varying subject distances. It includes synchronized multi-view RGB and depth sequences, accurate camera calibration, ground-truth 3D motion capture from a Vicon system, and corresponding SMPL/SMPL-X parameters. This setup ensures precise alignment between visual observations and motion ground truth. Benchmarking state-of-the-art markerless MoCap models reveals substantial performance degradation under these realistic conditions, highlighting limitations of current approaches. We further demonstrate that targeted fine-tuning improves generalization, validating the dataset's realism and value for model development. Our evaluation exposes critical gaps in existing models and provides a rigorous foundation for advancing robust markerless 4D human motion capture.

[153] CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models

Yunkai Dang,Yizhu Jiang,Yifan Jiang,Qi Fan,Yinghuan Shi,Wenbin Li,Yang Gao

Main category: cs.CV

TL;DR: 本文提出CLASP框架,通过类别自适应层融合与双阶段剪枝,在多模态大语言模型中实现高效视觉令牌压缩,显著降低计算开销且保持性能。

Details Motivation: 现有MLLM因视觉令牌序列冗余导致计算开销大,而基于单层ViT和静态剪枝的方法在多样化指令下鲁棒性差。 Method: 提出CLASP:首先通过多层视觉特征融合构建类别特异性表征;再进行双阶段剪枝——分配令牌预算给注意力显著的‘主干令牌’(保障相关性)和冗余感知的‘补全令牌’(保障覆盖度);整体采用类别自适应机制实现提示条件下的特征融合与预算分配。 Result: 在多种基准、剪枝比率和MLLM架构上均一致优于现有方法,实现激进但稳健的视觉令牌压缩。 Conclusion: CLASP是一种即插即用、指令感知、类别自适应的视觉令牌压缩框架,有效缓解MLLM的计算瓶颈,兼顾效率与鲁棒性。 Abstract: Multimodal Large Language Models (MLLMs) suffer from substantial computational overhead due to the high redundancy in visual token sequences. Existing approaches typically address this issue using single-layer Vision Transformer (ViT) features and static pruning strategies. However, such fixed configurations are often brittle under diverse instructions. To overcome these limitations, we propose CLASP, a plug-and-play token reduction framework based on class-adaptive layer fusion and dual-stage pruning. Specifically, CLASP first constructs category-specific visual representations through multi-layer vision feature fusion. It then performs dual-stage pruning, allocating the token budget between attention-salient pivot tokens for relevance and redundancy-aware completion tokens for coverage. Through class-adaptive pruning, CLASP enables prompt-conditioned feature fusion and budget allocation, allowing aggressive yet robust visual token reduction. Extensive experiments demonstrate that CLASP consistently outperforms existing methods across a wide range of benchmarks, pruning ratios, and MLLM architectures. Code will be available at https://github.com/Yunkaidang/CLASP.

[154] A Multi-Agent Feedback System for Detecting and Describing News Events in Satellite Imagery

Madeline Anderson,Mikhail Klassen,Ash Hoover,Kerri Cahoy

Main category: cs.CV

TL;DR: 本文提出SkyScraper多智能体工作流,通过地理编码新闻并合成对应卫星图像序列的描述,构建了含5000个序列的多时相遥感事件描述数据集,显著提升了事件发现效率。

Details Motivation: 现有遥感领域缺乏多时相事件描述数据集,主要受限于可见事件搜寻和多时相序列标注耗时耗力。 Method: 提出SkyScraper迭代式多智能体工作流,将新闻文章地理编码并与卫星图像序列匹配,自动生成描述性字幕。 Result: SkyScraper比传统地理编码方法多发现5倍事件,并构建了含5000个序列的新型多时相事件描述数据集。 Conclusion: 多智能体反馈机制可有效挖掘卫星影像中的新多时相事件,该框架亦可支持新闻报道与调查工作。 Abstract: Changes in satellite imagery often occur over multiple time steps. Despite the emergence of bi-temporal change captioning datasets, there is a lack of multi-temporal event captioning datasets (at least two images per sequence) in remote sensing. This gap exists because (1) searching for visible events in satellite imagery and (2) labeling multi-temporal sequences require significant time and labor. To address these challenges, we present SkyScraper, an iterative multi-agent workflow that geocodes news articles and synthesizes captions for corresponding satellite image sequences. Our experiments show that SkyScraper successfully finds 5x more events than traditional geocoding methods, demonstrating that agentic feedback is an effective strategy for surfacing new multi-temporal events in satellite imagery. We apply our framework to a large database of global news articles, curating a new multi-temporal captioning dataset with 5,000 sequences. By automatically identifying imagery related to news events, our work also supports journalism and reporting efforts.

[155] Cognition-Inspired Dual-Stream Semantic Enhancement for Vision-Based Dynamic Emotion Modeling

Huanzhen Wang,Ziheng Zhou,Zeng Tao,Aoxing Li,Yingkai Zhao,Yuxuan Lin,Yan Wang,Wenqiang Zhang

Main category: cs.CV

TL;DR: 本文提出了一种受认知启发的双流语义增强模型(DuSE),通过模拟人类情绪感知中的认知机制(如启动效应和概念整合)来提升动态面部表情识别(DFER)性能与可解释性。

Details Motivation: 现有基于视觉的动态情绪建模方法忽视了情绪感知与认知理论,难以实现类人的情绪理解;需弥合机器与人类情绪感知之间的差距。 Method: 提出双流认知架构:1)分层时间提示聚类(HTPC)流模拟语言线索对视觉处理的启动效应;2)潜在语义情绪聚合器(LSEA)流建模感官输入与概念知识的整合,类比概念行为理论及海马体/默认模式网络功能。 Result: 在多个真实场景(in-the-wild)基准上达到SOTA性能,并提升了模型可解释性。 Conclusion: 显式建模神经认知机制可构建更符合神经科学原理、更鲁棒的DFER框架,验证了类脑情绪建模的有效性。 Abstract: The human brain constructs emotional percepts not by processing facial expressions in isolation, but through a dynamic, hierarchical integration of sensory input with semantic and contextual knowledge. However, existing vision-based dynamic emotion modeling approaches often neglect emotion perception and cognitive theories. To bridge this gap between machine and human emotion perception, we propose cognition-inspired Dual-stream Semantic Enhancement (DuSE). Our model instantiates a dual-stream cognitive architecture. The first stream, a Hierarchical Temporal Prompt Cluster (HTPC), operationalizes the cognitive priming effect. It simulates how linguistic cues pre-sensitize neural pathways, modulating the processing of incoming visual stimuli by aligning textual semantics with fine-grained temporal features of facial dynamics. The second stream, a Latent Semantic Emotion Aggregator (LSEA), computationally models the knowledge integration process, akin to the mechanism described by the Conceptual Act Theory. It aggregates sensory inputs and synthesizes them with learned conceptual knowledge, reflecting the role of the hippocampus and default mode network in constructing a coherent emotional experience. By explicitly modeling these neuro-cognitive mechanisms, DuSE provides a more neurally plausible and robust framework for dynamic facial expression recognition (DFER). Extensive experiments on challenging in-the-wild benchmarks validate our cognition-centric approach, demonstrating that emulating the brain's strategies for emotion processing yields state-of-the-art performance and enhances model interpretability.

[156] Efficient Adversarial Training via Criticality-Aware Fine-Tuning

Wenyun Li,Zheng Zhang,Dongmei Jiang,Yaowei Wang,Xiangyuan Lan

Main category: cs.CV

TL;DR: 本文提出了一种名为Criticality-Aware Adversarial Training (CAAT)的新方法,通过仅微调对鲁棒性最关键的少量参数,显著降低大ViT模型对抗训练的计算开销,同时保持接近全模型对抗训练的鲁棒性。

Details Motivation: ViT模型参数量增大时,其对抗鲁棒性并未同步提升;而标准对抗训练需微调全部参数,计算成本过高,难以扩展到大型ViT。 Method: CAAT首先高效识别对对抗鲁棒性贡献最大的关键参数,然后结合参数高效微调(PEFT),仅对超过阈值的关键参数所在权重矩阵进行鲁棒性调整。 Result: CAAT在三个主流对抗学习数据集上优于现有轻量级对抗训练方法;相比标准对抗训练,仅微调约6%参数,对抗鲁棒性仅下降4.3%。 Conclusion: CAAT实现了高鲁棒性与低计算开销的平衡,为大规模ViT模型的可扩展对抗训练提供了可行路径。 Abstract: Vision Transformer (ViT) models have achieved remarkable performance across various vision tasks, with scalability being a key advantage when applied to large datasets. This scalability enables ViT models to exhibit strong generalization capabilities. However, as the number of parameters increases, the robustness of ViT models to adversarial examples does not scale proportionally. Adversarial training (AT), one of the most effective methods for enhancing robustness, typically requires fine-tuning the entire model, leading to prohibitively high computational costs, especially for large ViT architectures. In this paper, we aim to robustly fine-tune only a small subset of parameters to achieve robustness comparable to standard AT. To accomplish this, we introduce Criticality-Aware Adversarial Training (CAAT), a novel method that adaptively allocates resources to the most robustness-critical parameters, fine-tuning only selected modules. Specifically, CAAT efficiently identifies parameters that contribute most to adversarial robustness. It then leverages parameter-efficient fine-tuning (PEFT) to robustly adjust weight matrices where the number of critical parameters exceeds a predefined threshold. CAAT exhibits favorable generalization when scaled to larger vision transformer architectures, potentially paving the way for adversarial training at scale, e.g, compared with plain adversarial training, CAAT incurs only a 4.3% decrease in adversarial robustness while tuning approximately 6% of its parameters. Extensive experiments on three widely used adversarial learning datasets demonstrate that CAAT outperforms state-of-the-art lightweight AT methods with fewer trainable parameters.

[157] Fragile Reconstruction: Adversarial Vulnerability of Reconstruction-Based Detectors for Diffusion-Generated Images

Haoyang Jiang,Mingyang Yi,Shaolei Zhang,Junxian Cai,Qingbin Liu,Xi Chen,Ju Fan

Main category: cs.CV

TL;DR: 本文揭示了基于重建的AI生成图像检测器在面对对抗性扰动时存在严重的安全漏洞,攻击者可通过微小扰动使检测准确率骤降至接近零,并发现此类攻击具有跨模型和跨检测器的可迁移性,现有防御方法效果有限。

Details Motivation: 扩散模型生成的AI图像对安全构成潜在威胁,而当前主流的基于重建的检测方法可能存在未被充分认识的安全脆弱性,亟需系统评估其鲁棒性。 Method: 对三类代表性重建式检测器在四种生成骨干模型上进行系统性对抗鲁棒性评估,包括白盒攻击构造、跨检测器迁移性验证及常见防御策略测试,并分析低信噪比(SNR)对检测失败的影响机制。 Result: 所有被测检测器在白盒攻击下性能严重下降;攻击具有强迁移性,可在黑盒场景实施;标准对抗防御手段缓解效果有限;根本原因在于对抗样本导致检测器感知到的信号噪声比显著降低。 Conclusion: 基于重建的检测范式存在根本性安全缺陷,需重新思考和设计更鲁棒的AI生成图像检测策略。 Abstract: Recently, detecting AI-generated images produced by diffusion-based models has attracted increasing attention due to their potential threat to safety. Among existing approaches, reconstruction-based methods have emerged as a prominent paradigm for this task. However, we find that such methods exhibit severe security vulnerabilities to adversarial perturbations; that is, by adding imperceptible adversarial perturbations to input images, the detection accuracy of classifiers collapses to near zero. To verify this threat, we present a systematic evaluation of the adversarial robustness of three representative detectors across four diverse generative backbone models. First, we construct adversarial attacks in white-box scenarios, which degrade the performance of all well-trained detectors. Moreover, we find that these attacks demonstrate transferability; specifically, attacks crafted against one detector can be transferred to others, indicating that adversarial attacks on detectors can also be constructed in a black-box setting. Finally, we assess common countermeasures and find that standard defense methods against adversarial attacks provide limited mitigation. We attribute these failures to the low signal-to-noise ratio (SNR) of attacked samples as perceived by the detectors. Overall, our results reveal fundamental security limitations of reconstruction-based detectors and highlight the need to rethink existing detection strategies.

[158] Generative Anonymization in Event Streams

Adam T. Müller,Mihai Kocsis,Nicolaj C. Stache

Main category: cs.CV

TL;DR: 本文提出首个面向事件流的生成式匿名化框架,通过模态桥接与预训练生成模型,在保护身份隐私的同时保持下游视觉任务所需的数据结构完整性,并构建了首个同步真实世界事件-RGB数据集用于评估。

Details Motivation: 神经形态视觉传感器在公共空间部署引发严重数据隐私问题,现有事件转视频(E2V)模型可高保真重建图像从而暴露身份,而传统模糊化方法破坏时空结构、损害下游感知任务性能。 Method: 构建生成式匿名化框架:将异步事件流投影为中间强度表示,利用预训练生成模型合成逼真但不存在的身份图像,再将其重编码回神经形态域,从而桥接事件流与空间生成模型间的模态鸿沟。 Result: 实验表明该方法能可靠阻止从E2V重建中恢复身份,同时保持下游视觉任务所需的数据结构完整性;并发布首个同步真实世界事件与RGB配对数据集。 Conclusion: 本工作首次实现了事件流中隐私保护与数据效用的兼顾,为隐私增强型神经形态视觉提供了新范式与实用基准。 Abstract: Neuromorphic vision sensors offer low latency and high dynamic range, but their deployment in public spaces raises severe data protection concerns. Recent Event-to-Video (E2V) models can reconstruct high-fidelity intensity images from sparse event streams, inadvertently exposing human identities. Current obfuscation methods, such as masking or scrambling, corrupt the spatio-temporal structure, severely degrading data utility for downstream perception tasks. In this paper, to the best of our knowledge, we present the first generative anonymization framework for event streams to resolve this utility-privacy trade-off. By bridging the modality gap between asynchronous events and standard spatial generative models, our pipeline projects events into an intermediate intensity representation, leverages pretrained models to synthesize realistic, non-existent identities, and re-encodes the features back into the neuromorphic domain. Experiments demonstrate that our method reliably prevents identity recovery from E2V reconstructions while preserving the structural data integrity required for downstream vision tasks. Finally, to facilitate rigorous evaluation, we introduce a novel, synchronized real-world event and RGB dataset captured via precise robotic trajectories, providing a robust benchmark for future research in privacy-preserving neuromorphic vision.

[159] Image-to-Image Translation Framework Embedded with Rotation Symmetry Priors

Feiyu Tan,Heran Yang,Qihong Duan,Kai Ye,Qi Xie,Deyu Meng

Main category: cs.CV

TL;DR: 本文提出了一种基于旋转群等变卷积的图像到图像翻译框架,引入可学习变换等变卷积(TL-Conv)以自适应学习变换群,并在理论和实验上验证了其对称性保持能力和优越性能。

Details Motivation: 解决无配对数据和无监督I2I方法效果受限的问题,利用图像内在的旋转对称性这一领域不变特性提升模型性能。 Method: 引入旋转群等变卷积构建旋转等变I2I框架;提出可学习变换等变卷积(TL-Conv),自适应学习变换群;提供TL-Conv在连续域精确等变、离散域误差有界的理论分析。 Result: 在多个I2I任务上实验验证了方法的有效性和优越性,生成质量更高,泛化能力更强。 Conclusion: 等变网络能有效增强I2I生成质量与鲁棒性,所提TL-Conv具有广泛适用性。 Abstract: Image-to-image translation (I2I) is a fundamental task in computer vision, focused on mapping an input image from a source domain to a corresponding image in a target domain while preserving domain-invariant features and adapting domain-specific attributes. Despite the remarkable success of deep learning-based I2I approaches, the lack of paired data and unsupervised learning framework still hinder their effectiveness. In this work, we address the challenge by incorporating transformation symmetry priors into image-to-image translation networks. Specifically, we introduce rotation group equivariant convolutions to achieve rotation equivariant I2I framework, a novel contribution, to the best of our knowledge, along this research direction. This design ensures the preservation of rotation symmetry, one of the most intrinsic and domain-invariant properties of natural and scientific images, throughout the network. Furthermore, we conduct a systematic study on image symmetry priors on real dataset and propose a novel transformation learnable equivariant convolutions (TL-Conv) that adaptively learns transformation groups, enhancing symmetry preservation across diverse datasets. We also provide a theoretical analysis of the equivariance error of TL-Conv, proving that it maintains exact equivariance in continuous domains and provide a bound for the error in discrete cases. Through extensive experiments across a range of I2I tasks, we validate the effectiveness and superior performance of our approach, highlighting the potential of equivariant networks in enhancing generation quality and its broad applicability. Our code is available at https://github.com/tanfy929/Equivariant-I2I

[160] Rethinking Satellite Image Restoration for Onboard AI: A Lightweight Learning-Based Approach

Adrien Dorise,Marjorie Bellizzi,Omar Hlimi

Main category: cs.CV

TL;DR: 本文提出了一种轻量级、非生成式的残差卷积网络ConvBEERS,专为星上卫星图像恢复设计,在模拟与真实高分辨率卫星图像上实现了优于传统方法的PSNR(+6.9dB)和下游检测性能(+5.1% mAP@50),并成功部署于Xilinx FPGA,延迟降低约41倍。

Details Motivation: 传统基于物理模型的卫星图像恢复方法计算复杂、速度慢,难以满足星上实时处理需求;亟需一种轻量、高效、可部署的替代方案。 Method: 提出ConvBEERS:一种面向嵌入式平台的轻量级卷积神经网络,采用残差结构,不依赖生成建模,仅在仿真卫星数据上训练,适配星上资源受限环境。 Result: 在仿真与真实Pleiades-HR图像上PSNR提升6.9dB;下游目标检测mAP@50最高提升5.1%;在Xilinx Versal VCK190 FPGA上成功部署,延迟比传统流程降低约41倍。 Conclusion: 轻量CNN可在保证恢复质量的同时显著提升星上处理效率,验证了其在航天系统中兼顾性能与硬件可行性的实用价值。 Abstract: Satellite image restoration aims to improve image quality by compensating for degradations (e.g., noise and blur) introduced by the imaging system and acquisition conditions. As a fundamental preprocessing step, restoration directly impacts both ground-based product generation and emerging onboard AI applications. Traditional restoration pipelines based on sequential physical models are computationally intensive and slow, making them unsuitable for onboard environments. In this paper, we introduce ConvBEERS: a Convolutional Board-ready Embedded and Efficient Restoration model for Space to investigate whether a light and non-generative residual convolutional network, trained on simulated satellite data, can match or surpass a traditional ground-processing restoration pipeline across multiple operating conditions. Experiments conducted on simulated datasets and real Pleiades-HR imagery demonstrate that the proposed approach achieves competitive image quality, with a +6.9dB PSNR improvement. Evaluation on a downstream object detection task demonstrates that restoration significantly improves performance, with up to +5.1% mAP@50. In addition, successful deployment on a Xilinx Versal VCK190 FPGA validates its practical feasibility for satellite onboard processing, with a ~41x reduction in latency compared to the traditional pipeline. These results demonstrate the relevance of using lightweight CNNs to achieve competitive restoration quality while addressing real-world constraints in spaceborne systems.

[161] DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment

Xinyue Li,Shubo Xu,Zhichao Zhang,Zhaolin Cai,Yitong Chen,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文提出DPC-VQA框架,通过解耦感知与校准,利用冻结的多模态大语言模型(MLLM)提供基础质量估计和感知先验,并引入轻量校准分支预测残差修正,从而在显著降低训练参数量和标注数据需求下,实现视频质量评估(VQA)的高效适配。

Details Motivation: 现有MLLM用于视频质量评估时,需大规模重训练和昂贵的主观评分(MOS)标注,适应新场景成本高;而预训练MLLM本身已蕴含有用感知先验,关键在于如何高效将其校准至目标MOS空间。 Method: 提出DPC-VQA框架:冻结预训练MLLM作为感知主干输出基础质量估计,另设轻量校准分支预测残差修正;整体不进行端到端微调,仅训练校准模块。 Result: 在UGC和AIGC基准上,DPC-VQA性能媲美主流基线,仅需不到2%的可训练参数,并在仅有20% MOS标签时仍保持有效。 Conclusion: 解耦感知与校准是一种高效、低开销的MLLM适配范式,显著降低VQA任务对计算资源和标注数据的依赖,为MLLM在低资源质量评估场景的应用提供了新思路。 Abstract: Recent multimodal large language models (MLLMs) have shown promising performance on video quality assessment (VQA) tasks. However, adapting them to new scenarios remains expensive due to large-scale retraining and costly mean opinion score (MOS) annotations. In this paper, we argue that a pretrained MLLM already provides a useful perceptual prior for VQA, and that the main challenge is to efficiently calibrate this prior to the target MOS space. Based on this insight, we propose DPC-VQA, a decoupling perception and calibration framework for video quality assessment. Specifically, DPC-VQA uses a frozen MLLM to provide a base quality estimate and perceptual prior, and employs a lightweight calibration branch to predict a residual correction for target-scenario adaptation. This design avoids costly end-to-end retraining while maintaining reliable performance with lower training and data costs. Extensive experiments on both user-generated content (UGC) and AI-generated content (AIGC) benchmarks show that DPC-VQA achieves competitive performance against representative baselines, while using less than 2% of the trainable parameters of conventional MLLM-based VQA methods and remaining effective with only 20\% of MOS labels. The code will be released upon publication.

[162] Detecting and refurbishing ground truth errors during training of deep learning-based echocardiography segmentation models

Iman Islam,Bram Ruijsink,Andrew J. Reader,Andrew P. King

Main category: cs.CV

TL;DR: 本研究探讨了深度学习模型在超声心动图分割任务中对标注错误的鲁棒性,并提出了一种基于梯度方差(VOG)的错误标签检测与伪标签修复方法,在高错误率下显著提升模型性能。

Details Motivation: 深度学习医学图像分割依赖人工标注的真值标签,但这些标签常存在随机误差或系统性偏差,影响模型可靠性。 Method: 在CAMUS数据集上模拟三类标注错误,对比基于损失和基于梯度方差(VOG)的错误标签检测方法,并提出伪标签策略修复可疑错误标签。 Result: VOG方法在训练中高效识别错误标签;标准U-Net对随机误差及≤50%系统性误差具有较强鲁棒性;所提检测与修复方法在高错误率下明显提升分割性能。 Conclusion: VOG是一种有效的GT标签错误检测工具,结合伪标签修复可增强模型在低质量标注下的鲁棒性,为临床弱监督分割提供实用方案。 Abstract: Deep learning-based medical image segmentation typically relies on ground truth (GT) labels obtained through manual annotation, but these can be prone to random errors or systematic biases. This study examines the robustness of deep learning models to such errors in echocardiography (echo) segmentation and evaluates a novel strategy for detecting and refurbishing erroneous labels during model training. Using the CAMUS dataset, we simulate three error types, then compare a loss-based GT label error detection method with one based on Variance of Gradients (VOG). We also propose a pseudo-labelling approach to refurbish suspected erroneous GT labels. We assess the performance of our proposed approach under varying error levels. Results show that VOG proved highly effective in flagging erroneous GT labels during training. However, a standard U-Net maintained strong performance under random label errors and moderate levels of systematic errors (up to 50%). The detection and refurbishment approach improved performance, particularly under high-error conditions.

[163] Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

Yingying Zhao,Chengyin Hu,Qike Zhang,Xin Li,Xin Wang,Yiwei Wei,Jiujiang Guo,Jiahuan Long,Tingsong Jiang,Wen Yao

Main category: cs.CV

TL;DR: 本文提出了一种名为多模态语义光照攻击(MSLA)的物理世界对抗攻击框架,首次系统研究了针对视觉-语言模型(VLMs)的可实际部署的物理攻击,揭示了VLMs在真实场景中对语义级干扰的高度脆弱性。

Details Motivation: 现有对抗研究集中于数字领域,忽视了VLMs在物理世界部署时面临的可实现对抗扰动威胁;物理攻击可能引发识别失败与多模态推理中断,造成严重语义误判,亟需系统评估其现实安全性。 Method: 提出Multimodal Semantic Lighting Attacks(MSLA),利用可控的对抗性光照干扰真实场景中的多模态语义理解,攻击目标是语义对齐而非特定任务输出。 Result: MSLA在数字与物理环境中均有效:显著降低CLIP系列零样本分类性能,并在LLaVA、BLIP等先进VLM上诱发严重语义幻觉(图像描述与VQA任务中)。 Conclusion: VLMs存在被物理可实现的语义攻击破坏的重大鲁棒性缺陷,该工作首次揭示这一风险,强调必须开展面向物理世界的VLM鲁棒性评估。 Abstract: Vision-Language Models (VLMs) have shown remarkable performance, yet their security remains insufficiently understood. Existing adversarial studies focus almost exclusively on the digital setting, leaving physical-world threats largely unexplored. As VLMs are increasingly deployed in real environments, this gap becomes critical, since adversarial perturbations must be physically realizable. Despite this practical relevance, physical attacks against VLMs have not been systematically studied. Such attacks may induce recognition failures and further disrupt multimodal reasoning, leading to severe semantic misinterpretation in downstream tasks. Therefore, investigating physical attacks on VLMs is essential for assessing their real-world security risks. To address this gap, we propose Multimodal Semantic Lighting Attacks (MSLA), the first physically deployable adversarial attack framework against VLMs. MSLA uses controllable adversarial lighting to disrupt multimodal semantic understanding in real scenes, attacking semantic alignment rather than only task-specific outputs. Consequently, it degrades zero-shot classification performance of mainstream CLIP variants while inducing severe semantic hallucinations in advanced VLMs such as LLaVA and BLIP across image captioning and visual question answering (VQA). Extensive experiments in both digital and physical domains demonstrate that MSLA is effective, transferable, and practically realizable. Our findings provide the first evidence that VLMs are highly vulnerable to physically deployable semantic attacks, exposing a previously overlooked robustness gap and underscoring the urgent need for physical-world robustness evaluation of VLMs.

[164] PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination

Xuan Wang,Kai Ruan,Jiayi Han,kaiyue Zhou,Gaoang Wang

Main category: cs.CV

TL;DR: 本文提出PianoFlow,一种基于流匹配的音频驱动双钢琴手部动作生成框架,利用MIDI作为训练时的特权模态,并引入非对称角色门控交互模块和自回归流延续方案,实现高精度、实时、长序列的双手机构化运动合成。

Details Motivation: 现有方法依赖纯声学表示、缺乏符号先验、交互机制僵化、且难以高效生成长序列。 Method: 提出PianoFlow框架:1)训练中利用MIDI作为特权模态蒸馏音乐结构先验;2)设计非对称角色门控交互模块建模动态双手协调;3)采用自回归流延续方案支持实时流式长序列生成。 Result: 在PianoMotion10M数据集上显著优于先前方法,定量与定性指标更优,推理速度提升9倍以上。 Conclusion: PianoFlow通过融合符号先验、显式建模双手协作与高效流式生成机制,有效解决了音频驱动双钢琴运动生成中的精度、协调性与实时性难题。 Abstract: Audio-driven bimanual piano motion generation requires precise modeling of complex musical structures and dynamic cross-hand coordination. However, existing methods often rely on acoustic-only representations lacking symbolic priors, employ inflexible interaction mechanisms, and are limited to computationally expensive short-sequence generation. To address these limitations, we propose PianoFlow, a flow-matching framework for precise and coordinated bimanual piano motion synthesis. Our approach strategically leverages MIDI as a privileged modality during training, distilling these structured musical priors to achieve deep semantic understanding while maintaining audio-only inference. Furthermore, we introduce an asymmetric role-gated interaction module to explicitly capture dynamic cross-hand coordination through role-aware attention and temporal gating. To enable real-time streaming generation for arbitrarily long sequences, we design an autoregressive flow continuation scheme that ensures seamless cross-chunk temporal coherence. Extensive experiments on the PianoMotion10M dataset demonstrate that PianoFlow achieves superior quantitative and qualitative performance, while accelerating inference by over 9\times compared to previous methods.

[165] VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization

Andrei Atanov,Jesse Allardice,Roman Bachmann,Oğuzhan Fatih Kar,R Devon Hjelm,David Griffiths,Peter Fu,Afshin Dehghan,Amir Zamir

Main category: cs.CV

TL;DR: VideoFlexTok 提出了一种可变长度、由粗到细的视频分词方法,替代传统固定3D网格分词,显著降低生成模型复杂度与计算开销,提升训练效率并支持长视频生成。

Details Motivation: 传统3D网格视频分词强制下游模型逐像素预测所有细节,忽视视频内在复杂性差异,导致学习复杂度过高;需更灵活、信息分层的分词方式。 Method: 提出VideoFlexTok:采用生成流解码器(generative flow decoder),构建粗粒度(语义、运动)到细粒度(局部细节)的可变长token序列;支持按需截断token数以适配不同下游任务,并能编码超长视频。 Result: 在类条件与文本到视频生成任务中,以1.1B参数模型达到与5.2B 3D网格基线相当的质量(gFVD、ViCLIP Score);对10秒81帧视频仅用672 tokens(仅为基线的1/8),实现高效长视频生成。 Conclusion: VideoFlexTok通过结构化、自适应的token表示,有效解耦视频内容的抽象与细节层次,在保持重建质量的同时大幅降低建模与计算成本,为高效视频生成提供了新范式。 Abstract: Visual tokenizers map high-dimensional raw pixels into a compressed representation for downstream modeling. Beyond compression, tokenizers dictate what information is preserved and how it is organized. A de facto standard approach to video tokenization is to represent a video as a spatiotemporal 3D grid of tokens, each capturing the corresponding local information in the original signal. This requires the downstream model that consumes the tokens, e.g., a text-to-video model, to learn to predict all low-level details "pixel-by-pixel" irrespective of the video's inherent complexity, leading to high learning complexity. We present VideoFlexTok, which represents videos with a variable-length sequence of tokens structured in a coarse-to-fine manner -- where the first tokens (emergently) capture abstract information, such as semantics and motion, and later tokens add fine-grained details. The generative flow decoder enables realistic video reconstructions from any token count. This representation structure allows adapting the token count according to downstream needs and encoding videos longer than the baselines with the same budget. We evaluate VideoFlexTok on class- and text-to-video generative tasks and show that it leads to more efficient training compared to 3D grid tokens, e.g., achieving comparable generation quality (gFVD and ViCLIP Score) with a 5x smaller model (1.1B vs 5.2B). Finally, we demonstrate how VideoFlexTok can enable long video generation without prohibitive computational cost by training a text-to-video model on 10-second 81-frame videos with only 672 tokens, 8x fewer than a comparable 3D grid tokenizer.

Yifan Du,Zikang Liu,Jinbiao Peng,Jie Wu,Junyi Li,Jinyang Li,Wayne Xin Zhao,Ji-Rong Wen

Main category: cs.CV

TL;DR: 本文提出LMM-Searcher框架,通过文件系统存储视觉资产并用轻量UID替代,结合按需加载图像工具与多跳跨模态数据合成方法,实现长周期多模态深度搜索,在多个基准上达到开源模型SOTA。

Details Motivation: 现有多模态深度搜索代理在长周期任务中面临异构信息管理困难和高token开销问题,易导致上下文爆炸或关键视觉信号丢失。 Method: 提出基于文件的视觉表示机制,将视觉资产外存并映射为文本UID;设计fetch-image工具支持渐进式按需视觉加载;构建多跳跨模态数据合成管道,蒸馏12K高质量轨迹微调Qwen3-VL-Thinking-30A3B。 Result: 在四个基准(含MM-BrowseComp和MMSearch-Plus)上成功扩展至100轮搜索,达到开源模型最优性能,并展现出对不同基础模型的良好泛化性。 Conclusion: LMM-Searcher有效缓解长周期多模态搜索中的上下文膨胀与视觉信息衰减问题,为高效、可扩展的多模态智能体提供了新范式。 Abstract: Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long-horizon MultiModal deep search framework, named LMM-Searcher, centered on a file-based visual representation mechanism. By offloading visual assets to an external file system and mapping them to lightweight textual identifiers (UIDs), our approach mitigates context overhead while preserving multimodal information for future access. We equip the agent with a tailored fetch-image tool, enabling a progressive, on-demand visual loading strategy for active perception. Furthermore, we introduce a data synthesis pipeline designed to generate queries requiring complex cross-modal multi-hop reasoning. Using this pipeline, we distill 12K high-quality trajectories to fine-tune Qwen3-VL-Thinking-30A3B into a specialized multimodal deep search agent. Extensive experiments across four benchmarks demonstrate that our method successfully scales to 100-turn search horizons, achieving state-of-the-art performance among open-source models on challenging long-horizon benchmarks like MM-BrowseComp and MMSearch-Plus, while also exhibiting strong generalizability across different base models. Our code will be released in https://github.com/RUCAIBox/LMM-Searcher.

[167] Representing 3D Faces with Learnable B-Spline Volumes

Prashanth Chandran,Daoye Wang,Timo Bolkart

Main category: cs.CV

TL;DR: CUBE是一种结合B样条体与学习特征的新型人脸几何表示方法,通过高维控制特征和两阶段映射实现高表达力与局部可编辑性,在3D扫描配准和单目3D人脸重建中达到SOTA性能。

Details Motivation: 现有B样条表示受限于3D控制点表达能力不足,难以兼顾精度、连续性和局部编辑性;需一种更具表现力且保持B样条优良性质(如局部支撑)的新表示方法。 Method: 提出CUBE:以高维特征格点(如8×8×8)替代传统3D控制点;通过B样条基函数局部加权融合生成高维特征向量,其前三维构成基础网格,再经小型MLP预测残差位移得到最终3D坐标;查询固定模板网格采样点实现稠密语义对应;支持基于控制特征的局部表面编辑。 Result: 在3D扫描配准任务上超越近期基线,达到SOTA;成功训练Transformer编码器从无序点云和单目图像预测CUBE控制特征;验证了其连续性、表达力与局部编辑能力。 Conclusion: CUBE是一种兼具理论严谨性与实用性的新型隐式-显式混合几何表示,为3D人脸建模与重建提供了高效、灵活且可解释的新范式。 Abstract: We present CUBE (Control-based Unified B-spline Encoding), a new geometric representation for human faces that combines B-spline volumes with learned features, and demonstrate its use as a decoder for 3D scan registration and monocular 3D face reconstruction. Unlike existing B-spline representations with 3D control points, CUBE is parametrized by a lattice (e.g., 8 x 8 x 8) of high-dimensional control features, increasing the model's expressivity. These features define a continuous, two-stage mapping from a 3D parametric domain to 3D Euclidean space via an intermediate feature space. First, high-dimensional control features are locally blended using the B-spline bases, yielding a high-dimensional feature vector whose first three values define a 3D base mesh. A small MLP then processes this feature vector to predict a residual displacement from the base shape, yielding the final refined 3D coordinates. To reconstruct 3D surfaces in dense semantic correspondence, CUBE is queried at 3D coordinates sampled from a fixed template mesh. Crucially, CUBE retains the local support property of traditional B-spline representations, enabling local surface editing by updating individual control features. We demonstrate the strengths of this representation by training transformer-based encoders to predict CUBE's control features from unstructured point clouds and monocular images, achieving state-of-the-art scan registration results compared to recent baselines.

[168] Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs

Muhammad Kamran Janjua,Hugo Silva,Di Niu,Bahador Rashidi

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、模型无关的方法Perception Programs (P²),将视觉工具(如深度、光流等)输出转化为紧凑、结构化、语言友好的摘要,从而显著提升多模态大语言模型(MLLMs)在视觉推理任务中的性能。

Details Motivation: 现有方法直接输入原始密集像素级工具输出,与LLM的语言推理优势不匹配,导致感知弱、过度依赖语言先验;瓶颈在于工具输出的表征方式,而非更多工具调用或更大模型。 Method: Perception Programs (P²):一种训练-free、模型无关的方法,将视觉工具输出自动重写为紧凑、结构化、语言原生的摘要,使MLLM能直接解析和推理。 Result: 在BLINK六个感知任务上,P²大幅提升性能:GPT-5 Mini在多视角推理中准确率从41.35%升至86.47%,相对深度从52.42%升至81.45%,平均提升22%;中小模型(如InternVL3.5-4B、Qwen3VL-4B)也获15–40%绝对增益,超越先前各类有监督/强化学习/智能体式工具使用方法。 Conclusion: 工具输出的语言原生表征是提升MLLM视觉推理能力的关键;P²以零训练成本实现SOTA效果,验证了‘表征即能力’的核心思想。 Abstract: Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that, in problems where vision tools can provide the necessary visual cues, the bottleneck is not more tool calls or larger MLLMs, it is how tool outputs are represented. We introduce Perception Programs (P$^2$), a training-free, model-agnostic method that rewrites tool outputs into compact, structured, language-native summaries that MLLMs can directly parse and reason over. Across six perception-centric tasks in BLINK, P$^2$ consistently yields large improvements over base models and raw tool-augmented baselines. With GPT-5 Mini as the base model, P$^2$ raises its accuracy from 41.35\% to 86.47\% on multi-view reasoning, from 52.42\% to 81.45\% on relative depth, and achieves a 22\% average gain across tasks, setting new state-of-the-art results. Even on smaller MLLMs, e.g., InternVL3.5-4B and Qwen3VL-4B, we observe 15-40\% absolute gains from P$^2$, surpassing prior agentic, supervised, and RL-based tool-use methods-without any training or model modifications.

[169] A Sanity Check on Composed Image Retrieval

Yikun Liu,Jiangchao Yao,Weidi Xie,Yanfeng Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的评估框架FISD和多轮交互式评估方法,以更准确、真实地评测组合图像检索(CIR)模型的性能。

Details Motivation: 现有CIR基准存在查询模糊(多个图像满足条件)和未考虑多轮交互场景的问题,导致评估不准确、不实用。 Method: 1) 构建FISD基准:利用生成模型精确控制参考-目标图像对的变量,实现六维度无歧义评估;2) 设计自动多轮智能体评估框架,模拟模型在连续查询中的适应与优化过程。 Result: 在典型CIR方法上进行了大量实验,验证了新评估框架能更准确揭示模型真实能力,尤其在交互场景下的表现。 Conclusion: 所提评估范式显著提升了CIR模型评测的准确性与实用性,为未来研究提供了更可靠的基准和方法论支持。 Abstract: Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image, and a relative caption that specifies the desired modification. Despite the rapid development of CIR models, their performance is not well characterized by existing benchmarks, which inherently contain indeterminate queries degrading the evaluation (i.e., multiple candidate images, rather than solely the target image, meet the query criteria), and have not considered their effectiveness in the context of the multi-round system. Motivated by this, we consider improving the evaluation procedure from two aspects: 1) we introduce FISD, a Fully-Informed Semantically-Diverse benchmark, which employs generative models to precisely control the variables of reference-target image pairs, enabling a more accurate evaluation of CIR methods across six dimensions, without query ambiguity; 2) we propose an automatic multi-round agentic evaluation framework to probe the potential of the existing models in the interactive scenarios. By observing how models adapt and refine their choices over successive rounds of queries, this framework provides a more realistic appraisal of their efficacy in practical applications. Extensive experiments and comparisons prove the value of our novel evaluation on typical CIR methods.

[170] M3D-Stereo: A Multiple-Medium and Multiple-Degradation Dataset for Stereo Image Restoration

Deqing Yang,Yingying Liu,Qicong Wang,Zhi Zeng,Dajiang Lu,Yibin Tian

Main category: cs.CV

TL;DR: 本文介绍了M3D-Stereo,一个包含7904对高分辨率立体图像的新型数据集,覆盖四种退化场景(水下散射、雾霾/雾、水下低光、雾霾低光),每种场景分为六个退化等级,提供像素级一致的清晰真值,旨在支持更真实、可控的图像恢复与立体匹配研究。

Details Motivation: 现有图像恢复数据集多局限于单一退化类型或依赖缺乏立体一致性的合成数据,难以适用于真实复杂场景,因此需要构建更贴近实际、具备多退化类型和立体一致性的高质量数据集。 Method: 设计并构建了M3D-Stereo立体数据集,通过实验室可控环境采集,涵盖四种退化类型、各六级渐进退化,提供配准良好的立体图像对及像素级一致的清晰真值;定义单级与混合级退化两种恢复任务以验证其有效性。 Result: 成功构建了首个面向多介质、多退化等级、具备立体一致性的高分辨率图像恢复数据集M3D-Stereo,并公开发布(LGPLv3协议),为图像恢复与立体匹配方法提供了更真实、细粒度的评估基准。 Conclusion: M3D-Stereo填补了多退化、立体一致图像恢复数据集的空白,显著提升了复杂退化环境下算法评估的真实性与可控性,推动了该领域的实际应用发展。 Abstract: Image restoration under adverse conditions, such as underwater, haze or fog, and low-light environments, remains a highly challenging problem due to complex physical degradations and severe information loss. Existing datasets are predominantly limited to a single degradation type or heavily rely on synthetic data without stereo consistency, inherently restricting their applicability in real-world scenarios. To address this, we introduce M3D-Stereo, a stereo dataset with 7904 high-resolution image pairs for image restoration research acquired in multiple media with multiple controlled degradation levels. It encompasses four degradation scenarios: underwater scatter, haze/fog, underwater low-light, and haze low-light. Each scenario forms a subset, and is divided into six levels of progressive degradation, allowing fine-grained evaluations of restoration methods with increasing severity of degradation. Collected via a laboratory setup, the dataset provides aligned stereo image pairs along with their pixel-wise consistent clear ground truths. Two restoration tasks, single-level and mixed-level degradation, were performed to verify its validity. M3D-Stereo establishes a better controlled and more realistic benchmark to evaluate image restoration and stereo matching methods in complex degradation environments. It is made public under LGPLv3 license.

[171] Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation

Ahmet İnanç,Özgür Erkent

Main category: cs.CV

TL;DR: 本文提出CTAB(Cross-Task Attention Bridge)模块,通过共享BEV空间中的多尺度可变形注意力机制,在检测与分割任务间双向交换特征,提升雷达-相机融合的3D感知性能。

Details Motivation: 现有雷达-相机融合方法将检测与分割任务孤立处理,未能利用二者间的互补信息:检测特征有助于精化分割边界,分割特征可为检测提供密集语义锚点。 Method: 提出CTAB双向注意力桥接模块,集成于多任务框架中,结合基于实例归一化的分割解码器和可学习BEV上采样,实现BEV空间内检测与分割特征的跨任务交互。 Result: 在nuScenes数据集上,CTAB在7类分割任务上优于联合多任务基线,同时检测性能基本保持不变;在4类子集(可行驶区域、人行横道、步道、车辆)上,联合模型在mIoU与单任务相当,并同步输出3D检测结果。 Conclusion: CTAB有效实现了检测与分割任务在BEV空间中的协同优化,验证了跨任务特征交互对雷达-相机融合3D感知的重要价值。 Abstract: Bird's-eye-view (BEV) representations are the dominant paradigm for 3D perception in autonomous driving, providing a unified spatial canvas where detection and segmentation features are geometrically registered to the same physical coordinate system. However, existing radar-camera fusion methods treat these tasks in isolation, missing the opportunity to share complementary information between them: detection features encode object-level geometry that can sharpen segmentation boundaries, while segmentation features provide dense semantic context that can anchor detection. We propose \textbf{CTAB} (Cross-Task Attention Bridge), a bidirectional module that exchanges features between detection and segmentation branches via multi-scale deformable attention in shared BEV space. CTAB is integrated into a multi-task framework with an Instance Normalization-based segmentation decoder and learnable BEV upsampling to provide a more detailed BEV representation. On nuScenes, CTAB improves segmentation on 7 classes over the joint multi-task baseline at essentially neutral detection. On a 4-class subset (drivable area, pedestrian crossing, walkway, vehicle), our joint multi-task model reaches comparable mIoU on 4 classes while simultaneously providing 3D detection.

[172] Pi-HOC: Pairwise 3D Human-Object Contact Estimation

Sravan Chittupalli,Ayush Jain,Dong Huang

Main category: cs.CV

TL;DR: Pi-HOC是一种单次前向、实例感知的框架,用于密集预测所有人体-物体对的3D语义接触,显著提升多人体场景下的精度、定位与推理速度,并支持下游任务如3D重建和语言驱动的接触预测。

Details Motivation: 现有方法在多人体交互、细粒度接触解耦、几何依赖或推理效率方面存在局限,难以应对真实世界中复杂的人-物交互建模需求。 Method: 提出Pi-HOC框架:先检测人体与物体实例;为每对人-物生成专用HO token;通过InteractionFormer进行交互建模;再用SAM-based解码器在SMPL网格上预测密集3D接触。 Result: 在MMHOI和DAMON数据集上显著超越SOTA,在精度、定位和吞吐量(提升20倍)方面表现优异;所预测接触可提升SAM-3D重建质量,并支持无需额外训练的语言引导接触预测。 Conclusion: Pi-HOC实现了高效、准确、可扩展的多人体-物体接触理解,为视觉-语言-几何联合建模提供了新范式。 Abstract: Resolving real-world human-object interactions in images is a many-to-many challenge, in which disentangling fine-grained concurrent physical contact is particularly difficult. Existing semantic contact estimation methods are either limited to single-human settings or require object geometries (e.g., meshes) in addition to the input image. Current state-of-the-art leverages powerful VLM for category-level semantics but struggles with multi-human scenarios and scales poorly in inference. We introduce Pi-HOC, a single-pass, instance-aware framework for dense 3D semantic contact prediction of all human-object pairs. Pi-HOC detects instances, creates dedicated human-object (HO) tokens for each pair, and refines them using an InteractionFormer. A SAM-based decoder then predicts dense contact on SMPL human meshes for each human-object pair. On the MMHOI and DAMON datasets, Pi-HOC significantly improves accuracy and localization over state-of-the-art methods while achieving 20x higher throughput. We further demonstrate that predicted contacts improve SAM-3D image-to-mesh reconstruction via a test-time optimization algorithm and enable referential contact prediction from language queries without additional training.

[173] Grasp in Gaussians: Fast Monocular Reconstruction of Dynamic Hand-Object Interactions

Ayce Idil Aytekin,Xu Chen,Zhengyang Shen,Thabo Beeler,Helge Rhodin,Rishabh Dabral,Christian Theobalt

Main category: cs.CV

TL;DR: GraG是一种基于高斯表示的快速稳健方法,用于从单目视频中重建动态3D手-物交互,通过轻量级Sum-of-Gaussians表示实现高效跟踪,在速度和精度上均显著优于先前方法。

Details Motivation: 现有方法依赖重型神经表示优化,计算开销大、效率低;需要一种兼顾速度、稳定性与几何保真度的轻量级动态手-物交互重建方法。 Method: 提出Grasp in Gaussians(GraG):1)用视频适配的SAM3D初始化物体姿态与几何,并转为轻量Sum-of-Gaussians(SoG)表示;2)基于现成单目手部姿态初值,仅用2D关节与深度对齐损失优化手部运动,避免逐帧拟合复杂3D手模型。 Result: 在公开基准上,GraG比先前方法快6.4倍,物体重建提升13.4%,手部每关节位置误差降低超65%,且保持长时间序列的时间一致性。 Conclusion: 紧凑的高斯表示结合生成式初始化与简化优化策略,可在显著提速的同时提升手-物交互重建的精度与稳定性,验证了经典表征在现代动态重建中的新价值。 Abstract: We present Grasp in Gaussians (GraG), a fast and robust method for reconstructing dynamic 3D hand-object interactions from a single monocular video. Unlike recent approaches that optimize heavy neural representations, our method focuses on tracking the hand and the object efficiently, once initialized from pretrained large models. Our key insight is that accurate and temporally stable hand-object motion can be recovered using a compact Sum-of-Gaussians (SoG) representation, revived from classical tracking literature and integrated with generative Gaussian-based initializations. We initialize object pose and geometry using a video-adapted SAM3D pipeline, then convert the resulting dense Gaussian representation into a lightweight SoG via subsampling. This compact representation enables efficient and fast tracking while preserving geometric fidelity. For the hand, we adopt a complementary strategy: starting from off-the-shelf monocular hand pose initialization, we refine hand motion using simple yet effective 2D joint and depth alignment losses, avoiding per-frame refinement of a detailed 3D hand appearance model while maintaining stable articulation. Extensive experiments on public benchmarks demonstrate that GraG reconstructs temporally coherent hand-object interactions on long sequences 6.4x faster than prior work while improving object reconstruction by 13.4% and reducing hand's per-joint position error by over 65%.

[174] Task Alignment: A simple and effective proxy for model merging in computer vision

Pau de Jorge,César Roberto de Souza,Björn Michele,Mert Bülent Sarıyıldız,Philippe Weinzaepfel,Florent Perronnin,Diane Larlus,Yannis Kalantidis

Main category: cs.CV

TL;DR: 本文提出了一种任务对齐代理(task alignment proxy),用于加速多任务视觉模型合并中的超参数选择,显著提升模型合并方法在非CLIP分类场景(如含可训练异构解码器的多任务视觉模型)中的实用性。

Details Motivation: 现有模型合并研究大多局限于CLIP图像分类任务,而实际视觉任务常依赖可训练且异构的解码器,导致基于下游性能的超参数调优成本过高、不实用。 Method: 引入任务对齐代理作为轻量级替代指标,避免昂贵的解码器重训练,从而大幅加速超参数搜索;并将模型合并方法拓展至更广泛的多任务视觉建模场景。 Result: 任务对齐代理可在保持最终性能的同时,将超参数选择速度提升数个数量级;验证了模型合并可有效应用于CLIP之外的多任务视觉模型。 Conclusion: 任务对齐代理显著提升了模型合并方法在复杂、实际多任务视觉场景下的可行性与实用性,推动其从窄域分类走向通用多任务建模。 Abstract: Efficiently merging several models fine-tuned for different tasks, but stemming from the same pretrained base model, is of great practical interest. Despite extensive prior work, most evaluations of model merging in computer vision are restricted to image classification using CLIP, where different classification datasets define different tasks. In this work, our goal is to make model merging more practical and show its relevance on challenging scenarios beyond this specific setting. In most vision scenarios, different tasks rely on trainable and usually heterogeneous decoders. Differently from previous studies with frozen decoders, where merged models can be evaluated right away, the non-trivial cost of decoder training renders hyperparameter selection based on downstream performance impractical. To address this, we introduce the task alignment proxy, and show how it can be used to speed up hyperparameter selection by orders of magnitude while retaining performance. Equipped with the task alignment proxy, we extend the applicability of model merging to multi-task vision models beyond CLIP-based classification.

[175] Direct Discrepancy Replay: Distribution-Discrepancy Condensation and Manifold-Consistent Replay for Continual Face Forgery Detection

Tianshuo Zhang,Haoyuan Zhang,Siran Peng,Weisong Zhao,Xiangyu Zhu,Zhen Lei

Main category: cs.CV

TL;DR: 本文提出一种无需存储历史伪造图像的持续人脸伪造检测方法,通过建模真实与伪造分布差异并进行分布级重放,显著缓解灾难性遗忘且降低隐私泄露风险。

Details Motivation: 现有持续人脸伪造检测方法依赖历史样本回放来缓解遗忘,但受限于内存预算,样本回放易导致伪造线索覆盖不全或身份泄露,而伪样本生成则过度依赖过往决策边界;作者认为回放的核心作用是恢复先前伪造任务的数据分布。 Method: 提出分布差异压缩(DDC)和流形一致重放(MCR):DDC在特征函数空间中对真实-伪造分布差异进行代理分解并压缩为少量差异图;MCR将这些差异图与当前真实人脸以方差保持方式合成,实现分布级重放。 Result: 在极小内存预算下,该框架持续超越现有CFFD基线,显著缓解灾难性遗忘,并通过回放层面隐私分析表明其身份泄露风险低于基于样本选择的回放方法。 Conclusion: 分布级重放比样本级或伪样本级回放更本质、高效且隐私友好,为持续伪造检测提供了新范式。 Abstract: Continual face forgery detection (CFFD) requires detectors to learn emerging forgery paradigms without forgetting previously seen manipulations. Existing CFFD methods commonly rely on replaying a small amount of past data to mitigate forgetting. Such replay is typically implemented either by storing a few historical samples or by synthesizing pseudo-forgeries from detector-dependent perturbations. Under strict memory budgets, the former cannot adequately cover diverse forgery cues and may expose facial identities, while the latter remains strongly tied to past decision boundaries. We argue that the core role of replay in CFFD is to reinstate the distributions of previous forgery tasks during subsequent training. To this end, we directly condense the discrepancy between real and fake distributions and leverage real faces from the current stage to perform distribution-level replay. Specifically, we introduce Distribution-Discrepancy Condensation (DDC), which models the real-to-fake discrepancy via a surrogate factorization in characteristic-function space and condenses it into a tiny bank of distribution discrepancy maps. We further propose Manifold-Consistent Replay (MCR), which synthesizes replay samples through variance-preserving composition of these maps with current-stage real faces, yielding samples that reflect previous-task forgery cues while remaining compatible with current real-face statistics. Operating under an extremely small memory budget and without directly storing raw historical face images, our framework consistently outperforms prior CFFD baselines and significantly mitigates catastrophic forgetting. Replay-level privacy analysis further suggests reduced identity leakage risk relative to selection-based replay.

[176] Distorted or Fabricated? A Survey on Hallucination in Video LLMs

Yiyang Huang,Yitian Zhang,Yizhou Wang,Mingyuan Zhang,Liang Shi,Huimin Zeng,Yun Fu

Main category: cs.CV

TL;DR: 本文综述了视频大语言模型(Vid-LLMs)中的幻觉问题,提出动态失真与内容捏造两类幻觉的系统分类,并分析其成因、评估方法与缓解策略,指出未来方向如运动感知视觉编码器和反事实学习。

Details Motivation: 视频大语言模型中幻觉问题严重且普遍,影响模型可靠性,亟需系统性梳理与分类。 Method: 构建幻觉分类学(动态失真与内容捏造),系统回顾评估基准、指标及缓解方法,并分析根本成因。 Result: 提出了首个针对Vid-LLMs幻觉的系统性分类体系;归纳了当前主流评估与缓解技术;识别出时序表征能力不足与视觉接地不充分是主要成因。 Conclusion: 幻觉问题是Vid-LLMs走向实用的关键障碍,需从模型架构、训练范式与评估标准多方面协同突破,本综述为后续研究提供统一框架与明确路径。 Abstract: Despite significant progress in video-language modeling, hallucinations remain a persistent challenge in Video Large Language Models (Vid-LLMs), referring to outputs that appear plausible yet contradict the content of the input video. This survey presents a comprehensive analysis of hallucinations in Vid-LLMs and introduces a systematic taxonomy that categorizes them into two core types: dynamic distortion and content fabrication, each comprising two subtypes with representative cases. Building on this taxonomy, we review recent advances in the evaluation and mitigation of hallucinations, covering key benchmarks, metrics, and intervention strategies. We further analyze the root causes of dynamic distortion and content fabrication, which often result from limited capacity for temporal representation and insufficient visual grounding. These insights inform several promising directions for future work, including the development of motion-aware visual encoders and the integration of counterfactual learning techniques. This survey consolidates scattered progress to foster a systematic understanding of hallucinations in Vid-LLMs, laying the groundwork for building robust and reliable video-language systems. An up-to-date curated list of related works is maintained at https://github.com/hukcc/Awesome-Video-Hallucination .

[177] Boosting Visual Instruction Tuning with Self-Supervised Guidance

Sophia Sirko-Galouchenko,Monika Wysoczanska,Andrei Bursuc,Nicolas Thome,Spyros Gidaris

Main category: cs.CV

TL;DR: 本文提出了一种轻量级方法V-GIFT,通过在视觉指令微调中引入少量基于自然语言的自监督视觉任务(如旋转预测、颜色匹配等),增强多模态大模型(MLLMs)对视觉信息的利用,从而提升其细粒度视觉推理能力,无需额外标注、架构修改或训练阶段。

Details Motivation: 现有MLLMs在视觉中心任务上表现不佳,主要原因是指令微调阶段视觉信息利用不足,而非视觉表征本身弱;许多任务可仅靠语言先验部分解决,导致模型忽略视觉线索。 Method: 将经典自监督视觉预训练任务(如图像旋转预测、颜色匹配、跨视角对应)重新表述为图像-指令-响应三元组,作为视觉锚定的指令微调数据,仅占总训练数据的3%-10%,不改变模型结构与训练流程。 Result: 在多个模型、训练设置和视觉中心基准(如POPE、MME、SEED-Bench)上,该方法稳定提升性能,验证了视觉锚定自监督任务对提升MLLM视觉推理能力的有效性。 Conclusion: 仅通过调整指令微调的数据分布,加入少量无需人工标注的视觉锚定自监督任务,即可显著增强MLLM的视觉推理能力,凸显了数据构造在多模态模型能力提升中的关键作用。 Abstract: Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests that this limitation arises not from weak visual representations, but from under-utilization of visual information during instruction tuning, where many tasks can be partially solved using language priors alone. We propose a simple and lightweight approach that augments visual instruction tuning with a small number of visually grounded self-supervised tasks expressed as natural language instructions. By reformulating classical self-supervised pretext tasks, such as rotation prediction, color matching, and cross-view correspondence, as image-instruction-response triplets, we introduce supervision that cannot be solved without relying on visual evidence. Our approach requires no human annotations, no architectural modifications, and no additional training stages. Across multiple models, training regimes, and benchmarks, injecting only a small fraction (3-10%) of such visually grounded instructions consistently improves performance on vision-centric evaluations. Our findings highlight instruction tuning with visually grounded SSL tasks as a powerful lever for improving visual reasoning in MLLMs through simple adjustments to the training data distribution. Code available at: https://github.com/sirkosophia/V-GIFT

[178] AbdomenGen: Sequential Volume-Conditioned Diffusion Framework for Abdominal Anatomy Generation

Yubraj Bhandari,Lavsen Dahal,Paul Segars,Joseph Y. Lo

Main category: cs.CV

TL;DR: AbdomenGen是一种基于序列化体积条件扩散模型的可控腹部解剖结构生成框架,提出Volume Control Scalar(VCS)实现器官体积与体型解耦调控,支持多器官独立、解耦调节,并在几何保真度和临床分布匹配上表现优异。

Details Motivation: 现有计算仿体生成系统难以实现可控且具临床意义的解剖变异,限制了医学影像研究中对个体化解剖差异的建模能力。 Method: 提出AbdomenGen框架:基于序列化体积条件扩散模型,逐个生成器官掩码,以身体掩码和已生成结构为条件;引入标准化残差VCS,解耦器官体积与身体习惯(body habitus),实现可解释的体积调控。 Result: 在11个腹部器官上实现高几何保真度(如肝脏Dice系数0.83±0.05);单器官VCS校准范围[-3,+3]稳定;支持解耦的多器官联合调控;在MERLIN肝肿大队列中,Wasserstein距离降低73.6%。 Conclusion: AbdomenGen实现了校准精准、分布感知的腹部解剖生成,适用于可控仿体构建与仿真研究。 Abstract: Computational phantoms are widely used in medical imaging research, yet current systems to generate controlled, clinically meaningful anatomical variations remain limited. We present AbdomenGen, a sequential volume-conditioned diffusion framework for controllable abdominal anatomy generation. We introduce the \textbf{Volume Control Scalar (VCS)}, a standardized residual that decouples organ size from body habitus, enabling interpretable volume modulation. Organ masks are synthesized sequentially, conditioning on the body mask and previously generated structures to preserve global anatomical coherence while supporting independent, multi-organ control. Across 11 abdominal organs, the proposed framework achieves strong geometric fidelity (e.g., liver dice $0.83 \pm 0.05$), stable single-organ calibration over $[-3,+3]$ VCS, and disentangled multi-organ modulation. To showcase clinical utility with a hepatomegaly cohort selected from MERLIN, Wasserstein-based VCS selection reduces distributional distance of training data by 73.6\% . These results demonstrate calibrated, distribution-aware anatomical generation suitable for controllable abdominal phantom construction and simulation studies.

[179] Agentic Discovery with Active Hypothesis Exploration for Visual Recognition

Jaywon Koo,Jefferson Hernandez,Ruozhen He,Hanjie Chen,Chen Wei,Vicente Ordonez

Main category: cs.CV

TL;DR: HypoExplore是一个假设驱动的神经架构发现框架,将视觉识别模型设计类比为科学探究过程,利用大语言模型和进化分支策略自动生成、评估与优化架构,并通过轨迹树和假设记忆库追踪和更新假设置信度,显著提升了轻量级视觉模型性能并展现出跨数据集和领域(如医学图像)的泛化能力与可解释性。

Details Motivation: 传统神经架构搜索缺乏可解释性和科学性,难以积累设计知识;亟需一种能模拟人类科研过程、支持假设生成、验证与迭代的框架,以提升架构发现效率并增进对设计空间的理解。 Method: 提出HypoExplore框架:以人类指定研究方向为起点,利用大语言模型进行假设生成(兼顾已验证原则与不确定性),通过进化分支实现架构迭代;构建Trajectory Tree记录架构谱系,Hypothesis Memory Bank动态维护假设置信度;多个反馈智能体从不同角度分析实验结果,协同更新置信分。 Result: 在CIFAR-10上将基线准确率从18.91%提升至94.11%,并在CIFAR-100、Tiny-ImageNet及MedMNIST上验证泛化性与SOTA性能;置信度随证据积累更具预测性,且学习到的设计原则可在独立进化谱系间迁移。 Conclusion: HypoExplore不仅高效发现高性能轻量架构,更通过结构化假设管理与证据积累,推动神经网络设计走向可理解、可复用、可演化的科学范式。 Abstract: We introduce HypoExplore, an agentic framework that formulates neural architecture discovery for visual recognition as a hypothesis-driven scientific inquiry. Given a human-specified high-level research direction, HypoExplore ideates, implements, evaluates, and improves neural architectures through evolutionary branching. New hypotheses are created using a large language model by selecting a parent hypothesis to build upon, guided by a dual strategy that balances exploiting validated principles with resolving uncertain ones. Our proposed framework maintains a Trajectory Tree that records the lineage of all proposed architectures, and a Hypothesis Memory Bank that actively tracks confidence scores acquired through experimental evidence. After each experiment, multiple feedback agents analyze the results from different perspectives and consolidate their findings into hypothesis confidence updates. Our framework is tested on discovering lightweight vision architectures on CIFAR-10, with the best achieving 94.11% accuracy evolved from a root node baseline that starts at 18.91%, and generalizes to CIFAR-100 and Tiny-ImageNet. We further demonstrate applicability to a specialized domain by conducting independent architecture discovery runs on MedMNIST, which yield a state-of-the-art performance. We show that hypothesis confidence scores grow increasingly predictive as evidence accumulates, and that the learned principles transfer across independent evolutionary lineages, suggesting that HypoExplore not only discovers stronger architectures, but can help build a genuine understanding of the design space.

[180] See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

Himangi Mittal,Gaurav Mittal,Nelson Daniel Troncoso,Yu Hu

Main category: cs.CV

TL;DR: 本文提出了一种面向编码环境的像素级光标定位方法,通过多轮视觉反馈迭代修正,显著提升GUI操作精度与任务成功率。

Details Motivation: 现有GUI代理依赖单次坐标预测,在高密度IDE界面中缺乏误差校正机制,难以实现亚像素级精准操作。 Method: 采用闭环式迭代精炼机制,代理基于前序操作的视觉反馈持续调整光标位置,实现像素级定位。 Result: 在GPT-4、Claude和Qwen上验证,多轮精炼显著优于单步模型,点击精度与任务成功率均提升。 Conclusion: 迭代式视觉推理是构建高可靠性软件工程代理的关键技术路径。 Abstract: Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces, where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across GPT-5.4, Claude, and Qwen on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate. Our results suggest that iterative visual reasoning is a critical component for the next generation of reliable software engineering agents. Code: https://github.com/microsoft/precision-cua-bench.

[181] Representation geometry shapes task performance in vision-language modeling for CT enterography

Cristian Minoccheri,Emily Wittrup,Kayvan Najarian,Ryan Stidham

Main category: cs.CV

TL;DR: 本文首次研究了腹部CT肠造影的视觉-语言迁移学习,发现平均池化更适合疾病分类,注意力池化更适合跨模态检索;多窗RGB编码优于多平面采样;检索增强生成(RAG)显著提升报告生成性能。

Details Motivation: CT肠造影是评估炎症性肠病(IBD)的主要影像手段,但其自动化分析中最优表征方式尚不明确。 Method: 采用LoRA微调视觉-语言模型,比较不同切片嵌入聚合方式(平均池化 vs. 注意力池化)、多窗RGB编码与多平面采样策略,并引入三教师伪标签框架和检索增强生成(RAG)用于报告生成。 Result: 平均池化实现59.2%三分类准确率;注意力池化达到0.235文本到图像MRR;多窗RGB编码最优;RAG使严重程度预测准确率提升7–14个百分点,序数MAE从0.98降至0.80–0.89。 Conclusion: 为CT肠造影这一未被充分探索的模态提供了首个基准,并为体素级医学影像的视觉-语言系统构建提供了实用指导。 Abstract: Computed tomography (CT) enterography is a primary imaging modality for assessing inflammatory bowel disease (IBD), yet the representational choices that best support automated analysis of this modality are unknown. We present the first study of vision-language transfer learning on abdominal CT enterography and identify two main findings. First, mean pooling of slice embeddings gives better categorical disease assessment (59.2\% three-class accuracy), whereas attention pooling gives better cross-modal retrieval (0.235 text-to-image MRR). This pattern holds across all LoRA configurations tested and suggests that the two aggregators emphasize different properties of the learned representation. Second, per-slice tissue contrast matters more than broader spatial coverage: multi-window RGB encoding, which maps complementary Hounsfield Unit windows to RGB channels, outperforms all strategies that increase spatial coverage through multiplanar sampling, and in this setting adding coronal and sagittal views reduces classification performance. For report generation, fine-tuning without retrieval context yields within-1 severity accuracy at the prevalence-matched chance level (70.4\% vs.\ 71\% random), suggesting little learned ordering beyond the class distribution. Retrieval-augmented generation (RAG) improves this across all configurations, scoring 7--14 percentage points above the chance baseline and improving ordinal MAE from 0.98 to 0.80--0.89. A three-teacher pseudolabel framework enables all comparisons without expert annotations. Together, these findings provide the first baselines for this underexplored modality and offer practical guidance for building vision-language systems for volumetric medical imaging.

[182] Conflated Inverse Modeling to Generate Diverse and Temperature-Change Inducing Urban Vegetation Patterns

Baris Sarper Tezcan,Hrishikesh Viswanath,Rubab Saher,Daniel Aliaga

Main category: cs.CV

TL;DR: 本文提出了一种结合前向预测模型与基于扩散的生成式逆模型的混合框架,用于求解城市热环境调控中的逆问题——即根据目标温度变化生成多样、物理合理的植被空间配置方案。

Details Motivation: 传统方法只能从前向角度预测地表温度,而无法解决‘为达成特定降温目标应如何布置植被’这一逆问题;该问题具有内在不确定性(多解性),现有回归或确定性神经网络难以建模这种多样性。 Method: 构建一个融合式逆建模框架:前端采用可微分的前向模型(模拟植被-温度关系),后端耦合基于扩散的生成模型,以目标温度变化为条件,生成多样化的高分辨率植被分布图像。 Result: 框架能在数据稀缺条件下生成多种物理合理、热效应可控的植被布局方案,且支持训练数据中未见的温度目标;实验验证了其在城市热缓解场景下的有效性与可控性。 Conclusion: 该工作首次将可控生成建模引入城市气候适应领域,为面向热韧性目标的城市绿地规划提供了新范式,强调并利用了逆问题的内在多样性而非回避它。 Abstract: Urban areas are increasingly vulnerable to thermal extremes driven by rapid urbanization and climate change. Traditionally, thermal extremes have been monitored using Earth-observing satellites and numerical modeling frameworks. For example, land surface temperature derived from Landsat or Sentinel imagery is commonly used to characterize surface heating patterns. These approaches operate as forward models, translating radiative observations or modeled boundary conditions into estimates of surface thermal states. While forward models can predict land surface temperature from vegetation and urban form, the inverse problem of determining spatial vegetation configurations that achieve a desired regional temperature shift remains largely unexplored. This task is inherently underdetermined, as multiple spatial vegetation patterns can yield similar aggregated temperature responses. Conventional regression and deterministic neural networks fail to capture this ambiguity and often produce averaged solutions, particularly under data-scarce conditions. We propose a conflated inverse modeling framework that combines a predictive forward model with a diffusion-based generative inverse model to produce diverse, physically plausible image-based vegetation patterns conditioned on specific temperature goals. Our framework maintains control over thermal outcomes while enabling diverse spatial vegetation configurations, even when such combinations are absent from training data. Altogether, this work introduces a controllable inverse modeling approach for urban climate adaptation that accounts for the inherent diversity of the problem. Code is available at the GitHub repository.

[183] Visual Preference Optimization with Rubric Rewards

Ya-Qi Yu,Fangyu Hong,Xiangyang Qu,Hao Wang,Gaojie Wu,Qiaoyu Luo,Nuo Xu,Huixin Wang,Wuheng Xu,Yongxin Liao,Zihao Chen,Haonan Li,Ziming Li,Dezhi Peng,Minghui Liao,Jihao Wu,Haoyu Ren,Dandan Tu

Main category: cs.CV

TL;DR: 本文提出rDPO框架,利用针对每个图像-指令对定制的检查表式评分标准(rubric)进行多模态偏好优化,显著提升视觉推理任务中的奖励建模与下游性能。

Details Motivation: 现有DPO方法依赖离策略扰动或粗粒度结果信号,难以支持细粒度视觉推理所需的高质量偏好数据。 Method: 提出rDPO:为每个图像-指令对构建实例专属的rubric(含必要与附加评分项),离线构建指令-rubric池,并在在线策略数据构造中复用;结合rubric引导的奖励建模与偏好优化。 Result: 在奖励建模基准上,rubric提示大幅提升30B-A3B判别器性能,逼近GPT-5.4;下游任务宏平均准确率达82.69(优于结果过滤的75.82);综合可扩展性评测中得分为61.01,超越风格约束基线(52.36)和基础模型(59.48)。 Conclusion: 将on-policy数据构造与实例级、准则级反馈相结合,能有效提升多模态(尤其是视觉)偏好优化效果。 Abstract: The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited to fine-grained visual reasoning. We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-style rubric of essential and additional criteria to score responses from any possible policies. The instruction-rubric pool is built offline and reused during the construction of on-policy data. On public reward modeling benchmarks, rubric-based prompting massively improves a 30B-A3B judge and brings it close to GPT-5.4. On public downstream benchmarks, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering drops it to 75.82 from 81.14. When evaluating scalability on a comprehensive benchmark, rDPO achieves 61.01, markedly outperforming the style-constrained baseline (52.36) and surpassing the 59.48 base model. Together, these results show that visual preference optimization benefits from combining on-policy data construction with instance-specific criterion-level feedback.

[184] Generative Refinement Networks for Visual Synthesis

Jian Han,Jinlai Liu,Jiahuan Wang,Bingyue Peng,Zehuan Yuan

Main category: cs.CV

TL;DR: 本文提出了一种新型视觉生成范式Generative Refinement Networks (GRN),通过分层二值量化(HBQ)解决自回归模型的离散化瓶颈,并引入全局精修机制与熵引导采样策略,实现复杂度感知、高质量、高效生成,在ImageNet等任务上刷新记录。

Details Motivation: 扩散模型计算效率低且计算量固定;自回归模型虽天然具备复杂度感知能力,但受限于有损离散化和误差累积。 Method: 提出Generative Refinement Networks(GRN),核心包括:1)理论近无损的分层二值量化(HBQ)以替代传统离散tokenization;2)基于HBQ隐空间的全局逐步精修自回归生成机制;3)熵引导的自适应步数采样策略。 Result: 在ImageNet上图像重建rFID达0.56、类条件生成gFID达1.81;并成功扩展至文本到图像和文本到视频任务,性能优于同规模基线。 Conclusion: GRN是一种兼顾生成质量、效率与复杂度感知能力的新一代视觉生成框架,为自回归建模在高维视觉生成中的复兴提供了可行路径。 Abstract: While diffusion models dominate the field of visual generation, they are computationally inefficient, applying a uniform computational effort regardless of different complexity. In contrast, autoregressive (AR) models are inherently complexity-aware, as evidenced by their variable likelihoods, but are often hindered by lossy discrete tokenization and error accumulation. In this work, we introduce Generative Refinement Networks (GRN), a next-generation visual synthesis paradigm to address these issues. At its core, GRN addresses the discrete tokenization bottleneck through a theoretically near-lossless Hierarchical Binary Quantization (HBQ), achieving a reconstruction quality comparable to continuous counterparts. Built upon HBQ's latent space, GRN fundamentally upgrades AR generation with a global refinement mechanism that progressively perfects and corrects artworks -- like a human artist painting. Besides, GRN integrates an entropy-guided sampling strategy, enabling complexity-aware, adaptive-step generation without compromising visual quality. On the ImageNet benchmark, GRN establishes new records in image reconstruction (0.56 rFID) and class-conditional image generation (1.81 gFID). We also scale GRN to more challenging text-to-image and text-to-video generation, delivering superior performance on an equivalent scale. We release all models and code to foster further research on GRN.

[185] Lyra 2.0: Explorable Generative 3D Worlds

Tianchang Shen,Sherwin Bahmani,Kai He,Sangeetha Grama Srinivasan,Tianshi Cao,Jiawei Ren,Ruilong Li,Zian Wang,Nicholas Sharp,Zan Gojcic,Sanja Fidler,Jiahui Huang,Huan Ling,Jun Gao,Xuanchi Ren

Main category: cs.CV

TL;DR: 本文提出Lyra 2.0框架,通过解决空间遗忘和时间漂移问题,实现长程、3D一致的视频生成,从而提升大规模可探索3D场景的重建质量。

Details Motivation: 现有视频生成模型在长轨迹、大视角变化和位置重访场景下性能迅速下降,存在空间遗忘(无法回忆已观察区域)和时间漂移(合成误差累积导致几何与外观失真)两大瓶颈。 Method: Lyra 2.0采用两项关键技术:1)利用每帧的3D几何信息仅作信息路由(检索历史帧、建立稠密对应),外观仍由生成先验合成,以缓解空间遗忘;2)使用自增强历史数据(含模型自身退化输出)进行训练,使模型学会校正而非传播时间漂移。 Result: 显著延长了3D一致的视频生成轨迹长度,并据此微调前馈重建模型,实现了高质量、高保真的大规模3D场景重建。 Conclusion: Lyra 2.0为生成持久、可探索的大规模3D世界提供了可扩展且鲁棒的生成重建新范式,兼顾视觉真实感与3D一致性。 Abstract: Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, then lifting them to 3D via feed-forward reconstruction techniques. This generative reconstruction approach combines the visual fidelity and creative capacity of video models with 3D outputs ready for real-time rendering and simulation. Scaling to large, complex environments requires 3D-consistent video generation over long camera trajectories with large viewpoint changes and location revisits, a setting where current video models degrade quickly. Existing methods for long-horizon generation are fundamentally limited by two forms of degradation: spatial forgetting and temporal drifting. As exploration proceeds, previously observed regions fall outside the model's temporal context, forcing the model to hallucinate structures when revisited. Meanwhile, autoregressive generation accumulates small synthesis errors over time, gradually distorting scene appearance and geometry. We present Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale. To address spatial forgetting, we maintain per-frame 3D geometry and use it solely for information routing -- retrieving relevant past frames and establishing dense correspondences with the target viewpoints -- while relying on the generative prior for appearance synthesis. To address temporal drifting, we train with self-augmented histories that expose the model to its own degraded outputs, teaching it to correct drift rather than propagate it. Together, these enable substantially longer and 3D-consistent video trajectories, which we leverage to fine-tune feed-forward reconstruction models that reliably recover high-quality 3D scenes.