Skip to content

Table of Contents

cs.CL [Back]

[1] TDA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models

Jiaquan Zhang,Qigan Sun,Chaoning Zhang,Xudong Wang,Zhenzhen Huang,Yitian Zhou,Pengcheng Zheng,Chi-lok Andy Tai,Sung-Ho Bae,Zeyu Ma,Caiyan Qin,Jinyu Guo,Yang Yang,Hengtao Shen

Main category: cs.CL

TL;DR: 本文提出一种基于拓扑结构的推理链优化方法,将多轮推理(如ToT、GoT)的有效结构特征通过持续同调映射到单轮CoT中,并设计拓扑优化智能体自动诊断并修复CoT链的逻辑缺陷,在保持单轮效率的同时提升推理准确性。

Details Motivation: Chain-of-Thought(CoT)虽高效但存在逻辑漏洞;多轮推理方法(如ToT、GoT)性能强但计算开销大,难以实用。亟需兼顾准确率与效率的折中方案。 Method: 利用持续同调(persistent homology)将CoT、ToT、GoT映射至统一拓扑空间,提取有效推理的拓扑模式;构建拓扑优化智能体,诊断CoT链偏离理想拓扑结构的问题,并生成针对性修复策略。 Result: 在多个数据集上,该方法在推理准确率上接近ToT/GoT,显著优于标准CoT,同时保持单轮生成效率,实现‘单轮生成、多轮智能’。 Conclusion: 拓扑建模可有效刻画和迁移不同推理范式的结构优势;所提框架为轻量级LLM推理增强提供了新范式,弥合了效率与能力之间的鸿沟。 Abstract: Enhancing the reasoning capability of large language models (LLMs) remains a core challenge in natural language processing. The Chain-of-Thought (CoT) paradigm dominates practical applications for its single-round efficiency, yet its reasoning chains often exhibit logical gaps. While multi-round paradigms like Graph-of-Thoughts (GoT), Tree-of-Thoughts (ToT), and Atom of Thought (AoT) achieve strong performance and reveal effective reasoning structures, their high cost limits practical use. To address this problem, this paper proposes a topology-based method for optimizing reasoning chains. The framework embeds essential topological patterns of effective reasoning into the lightweight CoT paradigm. Using persistent homology, we map CoT, ToT, and GoT into a unified topological space to quantify their structural features. On this basis, we design a unified optimization system: a Topological Optimization Agent diagnoses deviations in CoT chains from desirable topological characteristics and simultaneously generates targeted strategies to repair these structural deficiencies. Compared with multi-round reasoning methods like ToT and GoT, experiments on multiple datasets show that our approach offers a superior balance between reasoning accuracy and efficiency, showcasing a practical solution to ``single-round generation with multi-round intelligence''.

[2] The Illusion of Latent Generalization: Bi-directionality and the Reversal Curse

Julian Coda-Forno,Jane X. Wang,Arslan Chaudhry

Main category: cs.CL

TL;DR: 本文研究了自回归语言模型在事实反转(如训练时看到 'A > B',但测试时无法处理 'B < A')上的失败现象(即‘反转诅咒’),发现双向监督目标(如MLM或decoder-only掩码训练)可缓解该问题;但机制分析表明,模型并非形成方向无关的统一事实表征,而是分别存储正向与反向事实,且不同训练目标下索引几何结构不同,提示性能提升未必代表真正的概念泛化。

Details Motivation: 解决自回归语言模型在事实反转任务中表现差的问题(即‘反转诅咒’),探究为何双向监督目标能缓解该问题,并揭示其内在机制。 Method: 在四个反转基准上对比MLM与decoder-only掩码训练;通过表示距离分析和线性探针进行机制研究,检验模型是否形成方向无关的事实表征。 Result: 反转准确率依赖于将源实体显式设为预测目标;未发现支持单一方向无关事实表征的证据;正向与反向事实被作为独立条目存储,且MLM与decoder-only训练下的索引几何结构不同。 Conclusion: 目标函数层面的改进(如引入双向监督)虽可提升反转性能,但未必带来预期中的潜在概念泛化,需警惕将性能提升等同于语义统一表征的误解。 Abstract: The reversal curse describes a failure of autoregressive language models to retrieve a fact in reverse order (e.g., training on ``$A > B$'' but failing on ``$B < A$''). Recent work shows that objectives with bidirectional supervision (e.g., bidirectional attention or masking-based reconstruction for decoder-only models) can mitigate the reversal curse. We extend this evaluation to include a vanilla masked language modeling (MLM) objective and compare it to decoder-only masking-based training across four reversal benchmarks and then provide a minimal mechanistic study of \emph{how} these objectives succeed. We show that reversal accuracy requires training signal that explicitly makes the source entity a prediction target, and we find little evidence that success corresponds to a single direction-agnostic representation of a fact. Instead, representation distances and linear probes are consistent with storing forward and reverse directions as distinct entries, with different indexing geometry for MLM versus decoder-only masking-based training. Our results caution that objective-level ``fixes'' can improve reversal behavior without necessarily inducing the kind of latent generalization one might expect from a unified concept.

[3] Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space

Mohammad Reza Ghasemi Madani,Soyeon Caren Han,Shuo Yang,Jey Han Lau

Main category: cs.CL

TL;DR: 本文提出了一种名为Inclusion-of-Thoughts(IoT)的渐进式自过滤策略,用于缓解大语言模型在多选题中因干扰项导致的认知负荷与答案不稳定问题,通过仅保留合理选项重构题目,提升推理稳定性与可解释性,并在多个基准上显著提升思维链性能。

Details Motivation: 大语言模型在多选题评估中易受似是而非干扰项影响,导致答案不稳定、注意力分散。 Method: 提出Inclusion-of-Thoughts(IoT)策略:渐进式自过滤,仅保留 plausible 选项重构题目,显式记录过滤过程以增强可解释性。 Result: 在算术、常识推理和教育类基准上显著提升链式推理性能,计算开销极小。 Conclusion: IoT 有效缓解了干扰项引发的认知不稳定性,提升了模型决策的鲁棒性与透明度,是一种轻量高效且具解释性的多选题推理优化方法。 Abstract: Multiple-choice questions (MCQs) are widely used to evaluate large language models (LLMs). However, LLMs remain vulnerable to the presence of plausible distractors. This often diverts attention toward irrelevant choices, resulting in unstable oscillation between correct and incorrect answers. In this paper, we propose Inclusion-of-Thoughts (IoT), a progressive self-filtering strategy that is designed to mitigate this cognitive load (i.e., instability of model preferences under the presence of distractors) and enable the model to focus more effectively on plausible answers. Our method operates to reconstruct the MCQ using only plausible option choices, providing a controlled setting for examining comparative judgements and therefore the stability of the model's internal reasoning under perturbation. By explicitly documenting this filtering process, IoT also enhances the transparency and interpretability of the model's decision-making. Extensive empirical evaluation demonstrates that IoT substantially boosts chain-of-thought performance across a range of arithmetic, commonsense reasoning, and educational benchmarks with minimal computational overhead.

[4] Document Optimization for Black-Box Retrieval via Reinforcement Learning

Omri Uzan,Ron Polonsky,Douwe Kiela,Christopher Potts

Main category: cs.CL

TL;DR: 本文提出了一种将文档扩展重新定义为文档优化问题的新方法,利用GRPO算法和检索器的排序提升作为奖励,微调语言或视觉语言模型来优化文档表示,从而提升多种检索器(单向量、多向量、词法)的性能,且仅需黑盒访问检索排名。该方法在代码检索与视觉文档检索任务中显著提升了nDCG指标,并使小型检索模型超越大型模型。

Details Motivation: 传统文档扩展技术在现代检索器上常因引入噪声而降低性能;本文旨在通过学习式文档变换,在不增加查询时开销的前提下,提升各类检索器的效果。 Method: 将文档扩展建模为文档优化问题,使用GRPO强化学习框架,以目标检索器的排序改进(如nDCG提升)为奖励信号,微调语言模型(LM)或视觉语言模型(VLM)来生成更匹配查询分布的文档表示;支持黑盒检索器,亦可结合白盒权重进行联合优化。 Result: 在代码检索和视觉文档检索(VDR)任务中,优化后的小型模型(如text-embedding-3-small)nDCG5分别从58.7→66.8和53.3→57.6,小幅超越更大更贵的text-embedding-3-large;与Jina-ColBERT-V2联合优化后,VDR和代码检索指标分别从55.8→63.3和48.6→61.8。 Conclusion: 文档优化是一种通用、高效且即插即用的检索增强范式,能显著提升各类检索器性能,尤其利于轻量化部署,并在多数场景下与检索器微调互补甚至更优。 Abstract: Document expansion is a classical technique for improving retrieval quality, and is attractive since it shifts computation offline, avoiding additional query-time processing. However, when applied to modern retrievers, it has been shown to degrade performance, often introducing noise that obfuscates the discriminative signal. We recast document expansion as a document optimization problem: a language model or a vision language model is fine-tuned to transform documents into representations that better align with the expected query distribution under a target retriever, using GRPO with the retriever's ranking improvements as rewards. This approach requires only black-box access to retrieval ranks, and is applicable across single-vector, multi-vector and lexical retrievers. We evaluate our approach on code retrieval and visual document retrieval (VDR) tasks. We find that learned document transformations yield retrieval gains and in many settings enable smaller, more efficient retrievers to outperform larger ones. For example, applying document optimization to OpenAI text-embedding-3-small model improves nDCG5 on code (58.7 to 66.8) and VDR (53.3 to 57.6), even slightly surpassing the 6.5X more expensive OpenAI text-embedding-3-large model (66.3 on code; 57.0 on VDR). When retriever weights are accessible, document optimization is often competitive with fine-tuning, and in most settings their combination performs best, improving Jina-ColBERT-V2 from 55.8 to 63.3 on VDR and from 48.6 to 61.8 on code retrieval.

[5] Multilingual Language Models Encode Script Over Linguistic Structure

Aastha A K Verma,Anwoy Chatterjee,Mehak Gupta,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: 本文研究了多语言大模型中语言表征的组织方式,发现其主要受正字法(如罗马化)影响,而非抽象的语言类型学特征;深层表征中类型学结构逐渐显现,但模型并未形成统一的‘中介语’。

Details Motivation: 多语言语言模型在共享参数空间中处理多种语言,但其内部表征如何组织尚不清楚:是依据抽象语言身份(如语系、词序),还是表面形式线索(如文字系统、拼写)? Method: 使用LAPE指标分析Llama-3.2-1B和Gemma-2-2B中的语言相关神经元,并结合稀疏自编码器分解激活;通过罗马化、词序打乱等扰动实验及因果干预探针,考察表征对表面形式与类型学特征的敏感性。 Result: 语言相关神经元强烈依赖正字法(如罗马化导致表征近乎分离),词序扰动影响甚微;类型学结构在深层更易探测;生成过程最依赖对表面扰动鲁棒的神经元,而非仅与类型学对齐的神经元。 Conclusion: 多语言大模型以表面形式为组织核心,语言抽象能力随网络深度渐进出现,但未收敛为统一的跨语言中间表征(interlingua)。 Abstract: Multilingual language models (LMs) organize representations for typologically and orthographically diverse languages into a shared parameter space, yet the nature of this internal organization remains elusive. In this work, we investigate which linguistic properties - abstract language identity or surface-form cues - shape multilingual representations. Focusing on compact, distilled models where representational trade-offs are explicit, we analyze language-associated units in Llama-3.2-1B and Gemma-2-2B using the Language Activation Probability Entropy (LAPE) metric, and further decompose activations with Sparse Autoencoders. We find that these units are strongly conditioned on orthography: romanization induces near-disjoint representations that align with neither native-script inputs nor English, while word-order shuffling has limited effect on unit identity. Probing shows that typological structure becomes increasingly accessible in deeper layers, while causal interventions indicate that generation is most sensitive to units that are invariant to surface-form perturbations rather than to units identified by typological alignment alone. Overall, our results suggest that multilingual LMs organize representations around surface form, with linguistic abstraction emerging gradually without collapsing into a unified interlingua.

[6] MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Zhengqing Yuan,Hanchi Sun,Lichao Sun,Yanfang Ye

Main category: cs.CL

TL;DR: MegaTrain是一种内存中心化系统,通过将模型参数和优化器状态存于CPU内存、GPU仅作计算单元,并采用双缓冲流水线执行与无状态层模板等技术,在单卡上高效训练百亿级大语言模型。

Details Motivation: 解决传统GPU中心化系统在训练超大模型时显存受限的问题,突破单GPU显存瓶颈,实现百亿参数模型的全精度单卡训练。 Method: 将参数和优化器状态存储于主机内存,GPU仅用于临时计算;采用双缓冲流水线执行引擎重叠参数预取、计算与梯度卸载;用动态绑定权重的无状态层模板替代持久化自动微分图。 Result: 在单块H200 GPU(配1.5TB主机内存)上稳定训练达120B参数模型;相比DeepSpeed ZeRO-3+CPU卸载,14B模型训练吞吐提升1.84×;支持单GH200上7B模型、512k长上下文训练。 Conclusion: MegaTrain证明了内存中心化架构在单GPU大模型训练中的可行性与高效性,为资源受限场景下的大模型训练提供了新范式。 Abstract: We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams, enabling continuous GPU execution. 2) We replace persistent autograd graphs with stateless layer templates, binding weights dynamically as they stream in, eliminating persistent graph metadata while providing flexibility in scheduling. On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters. It also achieves 1.84$\times$ the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models. MegaTrain also enables 7B model training with 512k token context on a single GH200.

[7] RAG or Learning? Understanding the Limits of LLM Adaptation under Continuous Knowledge Drift in the Real World

Hanbing Liu,Lang Cao,Yang Li

Main category: cs.CL

TL;DR: 本文提出了一种用于评估和缓解大语言模型(LLM)在持续知识漂移下适应能力的新基准,并设计了时间感知的检索方法Chronos,以提升时序一致性理解。

Details Motivation: 大语言模型的知识固化于预训练阶段,难以适应现实世界中随时间动态演化的知识,导致知识过时和时序推理不一致;现有更新方法缺乏在真实、时序演化的场景下系统评估。 Method: 构建基于时间戳证据的真实世界动态事件基准;提出无需额外训练的时间感知检索基线Chronos,将检索证据组织为事件演化图(Event Evolution Graph)。 Result: 基准测试表明,包括标准RAG在内的多数现有方法在持续知识漂移下表现不佳,暴露出灾难性遗忘和时序不一致等关键缺陷;Chronos显著提升了时序一致性。 Conclusion: 该工作为在真实动态场景下分析和推进LLM对持续知识漂移的适应能力奠定了基础。 Abstract: Large language models (LLMs) acquire most of their knowledge during pretraining, which ties them to a fixed snapshot of the world and makes adaptation to continuously evolving knowledge challenging. As facts, entities, and events change over time, models may experience continuous knowledge drift, resulting not only in outdated predictions but also in temporally inconsistent reasoning. Although existing approaches, such as continual finetuning, knowledge editing, and retrieval-augmented generation (RAG), aim to update or supplement model knowledge, they are rarely evaluated in settings that reflect chronological, evolving, and real-world knowledge evolution. In this work, we introduce a new benchmark of real-world dynamic events, constructed from time-stamped evidence that captures how knowledge evolves over time, which enables systematic evaluation of model adaptation under continuous knowledge drift. The benchmark reveals that most existing methods, including vanilla RAG and several learning-based approaches, struggle under this setting, exposing critical limitations such as catastrophic forgetting and temporal inconsistency. To mitigate these limitations, we propose a time-aware retrieval baseline, Chronos, which progressively organizes retrieved evidence into an Event Evolution Graph to enable more temporally consistent understanding in LLMs without additional training. Overall, this work provides a foundation for analyzing and advancing LLM adaptation to continuous knowledge drift in realistic settings.

[8] $π^2$: Structure-Originated Reasoning Data Improves Long-Context Reasoning Ability of Large Language Models

Quyet V. Do,Thinh Pham,Nguyen Nguyen,Sha Li,Pratibha Zunjare,Tu Vu

Main category: cs.CL

TL;DR: 本文提出π²方法,通过从维基百科表格中提取并扩展数据、生成多跳分析推理问题并自动验证答案、以及在真实网络搜索背景下反向翻译结构化推理轨迹,构建高质量长上下文推理数据集,显著提升了大语言模型在多个长上下文推理基准上的性能。

Details Motivation: 提升大语言模型在长上下文推理任务中的能力,需高质量、真实且多跳的推理训练数据,而现有数据存在质量低、人工成本高或缺乏真实上下文等问题。 Method: π²方法包含三步:1)从维基百科抽取并扩展表格;2)基于表格与上下文生成多跳分析性问答对,并通过双路径代码执行自动验证答案;3)在模拟真实网页搜索情境下,将答案反向翻译为结构化分步推理过程。最终用该数据对gpt-oss-20b和Qwen3-4B-Instruct-2507进行监督微调。 Result: 在四个长上下文推理基准及自建π²-Bench上,分别取得+4.3%和+2.7%平均绝对准确率提升;更关键的是实现有效自蒸馏——gpt-oss-20b利用自身生成的π²推理轨迹进一步提升平均性能达+4.4%。 Conclusion: π²提供了一种可扩展、自动化、贴近真实场景的高质量推理数据构建范式,不仅提升模型性能,还支持高效自蒸馏,具备强实用性与开源价值。 Abstract: We study a pipeline that curates reasoning data from initial structured data for improving long-context reasoning in large language models (LLMs). Our approach, $π^2$, constructs high-quality reasoning data through rigorous QA curation: 1) extracting and expanding tables from Wikipedia, 2) from the collected tables and relevant context, generating realistic and multi-hop analytical reasoning questions whose answers are automatically determined and verified through dual-path code execution, and 3) back-translating step-by-step structured reasoning traces as solutions of QA pairs given realistic web-search context. Supervised fine-tuning with \textsc{\small{gpt-oss-20b}} and \textsc{\small{Qwen3-4B-Instruct-2507}} on $π^2$ yields consistent improvements across four long-context reasoning benchmarks and our alike $π^2$-Bench, with average absolute accuracy gains of +4.3% and +2.7% respectively. Notably, our dataset facilitates self-distillation, where \textsc{\small{gpt-oss-20b}} even improves its average performance by +4.4% with its own reasoning traces, demonstrating $π^2$'s usefulness. Our code, data, and models are open-source at https://github.com/vt-pi-squared/pi-squared.

[9] SenseAI: A Human-in-the-Loop Dataset for RLHF-Aligned Financial Sentiment Reasoning

Berny Kabalisa

Main category: cs.CL

TL;DR: SenseAI是一个结合人类反馈与金融领域特点的新型情感分析数据集,包含推理链、置信度评分、人工修正信号及真实市场结果,旨在支持RLHF范式下的大模型微调与对齐。

Details Motivation: 现有金融情感数据集缺乏对模型推理过程、置信度及人类反馈的系统性建模,难以支撑可解释、可校准的金融AI开发。 Method: 构建含1439个标注样本的HITL验证数据集SenseAI,覆盖40只美股与13类金融数据;引入推理链、置信度、人工修正与市场结果四维标注;通过行为分析识别模型错误模式。 Result: 发现‘潜在推理漂移’(Latent Reasoning Drift)等系统性错误模式,证实LLM在金融推理中的错误具有可预测性与可修正性。 Conclusion: 结构化人机协同数据(如SenseAI)可有效提升金融大模型的可靠性、可解释性与对齐能力,为模型评估与迭代提供新范式。 Abstract: We introduce SenseAI, a human-in-the-loop (HITL) validated financial sentiment dataset designed to capture not only model outputs but the full reasoning process behind them. Unlike existing resources, SenseAI incorporates reasoning chains, confidence scores, human correction signals, and real-world market outcomes, providing a structure aligned with Reinforcement Learning from Human Feedback (RLHF) paradigms. The dataset consists of 1,439 labelled data points across 40 US-listed equities and 13 financial data categories, enabling direct integration into modern LLM fine-tuning pipelines. Through analysis, we identify several systematic patterns in model behavior, including a novel failure mode we term Latent Reasoning Drift, where models introduce information not grounded in the input, as well as consistent confidence miscalibration and forward projection tendencies. These findings suggest that LLM errors in financial reasoning are not random but occur within a predictable and correctable regime, supporting the use of structured HITL data for targeted model improvement. We discuss implications for financial AI systems and highlight opportunities for applying SenseAI in model evaluation and alignment.

[10] EvolveRouter: Co-Evolving Routing and Prompt for Multi-Agent Question Answering

Jiatan Huang,Zheyuan Zhang,Kaiwen Shi,Yanfang Ye,Chuxu Zhang

Main category: cs.CL

TL;DR: 本文提出EvolveRouter框架,通过闭环协同进化和自适应推理策略,联合提升多智能体问答中智能体的质量与协作结构。

Details Motivation: 现有路由方法存在两个局限:一是仅在固定智能体池中优化路由而不改进智能体本身;二是依赖刚性协作机制,无法根据查询动态调整参与智能体数量。 Method: EvolveRouter包含两部分:1)基于图的查询路由与目标指令精炼耦合的闭环协同进化过程;2)基于路由器加权答案一致性动态确定每条查询所需协作智能体数量的自适应推理策略。 Result: 在五个问答基准上,EvolveRouter在F1和精确匹配指标上均持续超越SOTA路由基线;消融分析验证了闭环精炼与自适应协作的有效性。 Conclusion: EvolveRouter实现了更强大且更高效的多智能体推理,为多智能体系统中路由与智能体协同优化提供了新范式。 Abstract: Large language model agents often exhibit complementary strengths, making routing a promising approach for multi-agent question answering. However, existing routing methods remain limited in two important ways: they typically optimize over a fixed pool of agents without improving the agents themselves, and they often rely on rigid collaboration schemes that cannot adapt the number of participating agents to the query. We propose EvolveRouter, a trainable framework that addresses both limitations by jointly improving agent quality and collaboration structure. First, EvolveRouter couples graph-based query routing with targeted instruction refinement in a closed-loop co-evolution process, allowing router diagnostics to guide agent improvement while refined agents provide cleaner supervision for routing. Second, it introduces an adaptive inference strategy that dynamically determines the effective collaboration size for each query through router-weighted answer agreement. Together, these designs enable more capable and more efficient multi-agent reasoning. Experiments on five question answering benchmarks show that EvolveRouter consistently outperforms SOTA routing baselines in both F1 and exact match, while further analysis confirms the benefits of closed-loop refinement and adaptive collaboration.

[11] Just Pass Twice: Efficient Token Classification with LLMs for Zero-Shot NER

Ahmed Ewais,Ahmed Hashish,Amr Ali

Main category: cs.CL

TL;DR: 本文提出Just Pass Twice (JPT)方法,使因果语言模型能利用双向上下文进行零样本命名实体识别,无需修改模型结构,显著提升性能与速度。

Details Motivation: 现有基于大语言模型的零样本命名实体识别方法受限于因果注意力机制(仅能关注前序上下文),且生成式方法存在解码慢、幻觉和格式错误等问题。 Method: 提出JPT方法:将输入句子拼接两次,使第二次遍历中每个token可关注完整句子;结合定义引导的实体嵌入实现灵活零样本泛化。 Result: 在CrossNER和MIT零样本NER基准上平均F1值超越此前最优方法7.9,并比同类生成式方法快20倍以上。 Conclusion: JPT是一种简单高效的方法,无需修改模型架构即可赋予因果LLM双向上下文建模能力,在零样本NER任务中实现了性能与效率的双重突破。 Abstract: Large language models encode extensive world knowledge valuable for zero-shot named entity recognition. However, their causal attention mechanism, where tokens attend only to preceding context, prevents effective token classification when disambiguation requires future context. Existing approaches use LLMs generatively, prompting them to list entities or produce structured outputs, but suffer from slow autoregressive decoding, hallucinated entities, and formatting errors. We propose Just Pass Twice (JPT), a simple yet effective method that enables causal LLMs to perform discriminative token classification with full bidirectional context. Our key insight is that concatenating the input to itself lets each token in the second pass attend to the complete sentence, requiring no architectural modifications. We combine these representations with definition-guided entity embeddings for flexible zero-shot generalization. Our approach achieves state-of-the-art results on zero-shot NER benchmarks, surpassing the previous best method by +7.9 F1 on average across CrossNER and MIT benchmarks, being over 20x faster than comparable generative methods.

[12] What Makes a Good Response? An Empirical Analysis of Quality in Qualitative Interviews

Jonathan Ivey,Anjalie Field,Ziang Xiao

Main category: cs.CL

TL;DR: 本文通过构建包含343份访谈转录文本和16940条参与者回应的定性访谈语料库,评估了10种访谈回应质量指标,发现“直接关联核心研究问题”是最强预测因子,而NLP中常用的清晰度和基于惊奇度的信息量指标则无预测力。

Details Motivation: 现有访谈质量评估指标缺乏验证——即高分回应是否真正有助于研究目标的达成。 Method: 识别、实现并评估10种访谈回应质量指标;构建并使用Qualitative Interview Corpus(含343份访谈转录文本、16940条回应、来自14个真实研究项目)进行实证分析。 Result: 直接关联核心研究问题是回应质量最强预测因子;清晰度与基于 surprisal 的信息量指标不具预测力。 Conclusion: 本研究提供了经实证支持的、可扩展的质量评估指标,为定性研究设计及自动化访谈系统评估提供依据。 Abstract: Qualitative interviews provide essential insights into human experiences when they elicit high-quality responses. While qualitative and NLP researchers have proposed various measures of interview quality, these measures lack validation that high-scoring responses actually contribute to the study's goals. In this work, we identify, implement, and evaluate 10 proposed measures of interview response quality to determine which are actually predictive of a response's contribution to the study findings. To conduct our analysis, we introduce the Qualitative Interview Corpus, a newly constructed dataset of 343 interview transcripts with 16,940 participant responses from 14 real research projects. We find that direct relevance to a key research question is the strongest predictor of response quality. We additionally find that two measures commonly used to evaluate NLP interview systems, clarity and surprisal-based informativeness, are not predictive of response quality. Our work provides analytic insights and grounded, scalable metrics to inform the design of qualitative studies and the evaluation of automated interview systems.

[13] Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

Purva Chiniya,Kevin Scaria,Sagar Chaturvedi

Main category: cs.CL

TL;DR: 本文提出Gradient-Controlled Decoding(GCD),一种无需训练的防御机制,通过引入接受锚点('Sure')和拒绝锚点('Sorry')增强大语言模型对越狱和提示注入攻击的鲁棒性,在保证首token安全的同时显著降低误拒率。

Details Motivation: 现有防御方法(如GradSafe)依赖单一锚点且阈值脆弱,缺乏确定性安全保证,且易过度拒绝合法请求,损害用户体验。 Method: GCD采用双锚点机制('Sure'与'Sorry')收紧决策边界,并在检测到恶意提示时预注入拒绝token(如'Sorry, I can't...'),确保解码起始即安全;全程无需模型微调,仅需20个示例模板。 Result: 在ToxicChat、XSTest-v2和AdvBench上,GCD相较GradSafe降低52%误报率,攻击成功率下降最高达10%,延迟仅增15–20ms,且可跨模型(LLaMA-2-7B、Mixtral-8x7B、Qwen-2-7B)迁移。 Conclusion: GCD是一种轻量、高效、可迁移、训练无关的解码级防护方案,兼顾安全性与可用性,为LLM部署提供了实用可靠的guardrail。 Abstract: Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over-refuse benign queries and degrade user experience. Previous work on jailbreak and prompt injection detection such as GradSafe, detects unsafe prompts with a single "accept all" anchor token, but its threshold is brittle and it offers no deterministic guarantee that harmful content will not be emitted once decoding begins. We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail that combines an acceptance anchor token ("Sure") and refusal anchor token ("Sorry") tightening the decision boundary and significantly lowering false positives. In the mitigation stage, if a prompt is flagged, GCD preset-injects one or two refusal tokens ("Sorry, I can't...") before autoregressive decoding resumes, guaranteeing first-token safety regardless of sampling strategy. On ToxicChat, XSTest-v2, and AdvBench, GCD reduces false positives by 52% vs. GradSafe at comparable recall, lowers attack success rate by up to 10% vs. the strongest decoding-only baseline, adds under 15-20 ms latency on an average on V100 instances, transfers to LLaMA-2-7B, Mixtral-8x7B, and Qwen-2-7B, and requires only 20 demonstration templates.

[14] Improving Clinical Trial Recruitment using Clinical Narratives and Large Language Models

Ziyi Chen,Mengxian Lyu,Cheng Peng,Yonghui Wu

Main category: cs.CL

TL;DR: 本研究系统探索了编码器和解码器型大语言模型(LLMs)在临床试验患者筛选中的应用,比较了通用与医学适配LLMs,并提出三种缓解长文档中'中间丢失'问题的策略(原生长上下文、NER提取摘要、RAG),在N2C2数据集上验证MedGemma+RAG效果最优(micro-F1达89.05%),强调需根据具体标准权衡规则查询、编码器LLM与生成式LLM以平衡效率与算力成本。

Details Motivation: 临床试验患者筛选是劳动密集型瓶颈,导致入组不足和试验失败;大语言模型(LLMs)为提升筛选效率提供了新机遇。 Method: 系统评估编码器与解码器型生成式LLMs(含通用与医学适配模型),并对比三种长文档处理策略:1)原生长上下文窗口;2)基于命名实体识别(NER)的提取式摘要;3)基于入组标准的检索增强生成(RAG);使用2018年N2C2 Track 1数据集进行评测。 Result: MedGemma模型结合RAG策略取得最佳micro-F1分数(89.05%);生成式LLMs对需跨长文档推理的入组标准提升显著,而对短上下文标准(如实验室检查)仅小幅提升。 Conclusion: LLMs在临床试验招募中具有实用潜力,但实际落地需依据具体入组标准,在规则查询、编码器LLM和生成式LLM之间权衡选择,以在合理计算成本下实现效率最大化。 Abstract: Screening patients for enrollment is a well-known, labor-intensive bottleneck that leads to under-enrollment and, ultimately, trial failures. Recent breakthroughs in large language models (LLMs) offer a promising opportunity to use artificial intelligence to improve screening. This study systematically explored both encoder- and decoder-based generative LLMs for screening clinical narratives to facilitate clinical trial recruitment. We examined both general-purpose LLMs and medical-adapted LLMs and explored three strategies to alleviate the "Lost in the Middle" issue when handling long documents, including 1) Original long-context: using the default context windows of LLMs, 2) NER-based extractive summarization: converting the long document into summarizations using named entity recognition, 3) RAG: dynamic evidence retrieval based on eligibility criteria. The 2018 N2C2 Track 1 benchmark dataset is used for evaluation. Our experimental results show that the MedGemma model with the RAG strategy achieved the best micro-F1 score of 89.05%, outperforming other models. Generative LLMs have remarkably improved trial criteria that require long-term reasoning across long documents, whereas trial criteria that span a short piece of context (e.g., lab tests) show incremental improvements. The real-world adoption of LLMs for trial recruitment must consider specific criteria for selecting among rule-based queries, encoder-based LLMs, and generative LLMs to maximize efficiency within reasonable computing costs.

[15] Faster Superword Tokenization

Craig W. Schmidt,Chris Tanner,Yuval Pinter

Main category: cs.CL

TL;DR: 本文提出了一种加速BoundlessBPE和SuperBPE训练的两阶段方法,通过频率聚合超合并候选(supermerge candidates)避免内存中保存完整文档,将训练速度提升600倍以上,并开源了Python与Rust实现。

Details Motivation: BoundlessBPE和SuperBPE虽能生成跨预分词边界的超词(superwords),但原始实现训练极慢(如BoundlessBPE在1GB数据上需4.7 CPU天),缺乏实用性。 Method: 提出两阶段BoundlessBPE:第一阶段学习常规合并,第二阶段学习超合并;利用频率聚合超合并候选,避免全文档驻留内存;揭示两阶段BoundlessBPE与SuperBPE的近等价性,并将SuperBPE中需手动设定的超参数自动确定。 Result: BoundlessBPE和SuperBPE在1GB数据上的训练时间分别降至603秒和593秒(提速超600倍);实现了与原算法完全一致的结果;开源了Python和Rust双版本实现。 Conclusion: 通过算法重构与工程优化,BoundlessBPE和SuperBPE可高效实用化,显著缩小了理论优势与实际部署之间的鸿沟。 Abstract: Byte Pair Encoding (BPE) is a widely used tokenization algorithm, whose tokens cannot extend across pre-tokenization boundaries, functionally limiting it to representing at most full words. The BoundlessBPE and SuperBPE algorithms extend and improve BPE by relaxing this limitation and allowing the formation of superwords, which are combinations of pretokens that form phrases. However, previous implementations were impractical to train: for example, BoundlessBPE took 4.7 CPU days to train on 1GB of data. We show that supermerge candidates, two or more consecutive pretokens eligible to form a supermerge, can be aggregated by frequency much like regular pretokens. This avoids keeping full documents in memory, as the original implementations of BoundlessBPE and SuperBPE required, leading to a significant training speedup. We present a two-phase formulation of BoundlessBPE that separates first-phase learning of regular merges from second-phase learning of supermerges, producing identical results to the original implementation. We also show a near-equivalence between two-phase BoundlessBPE and SuperBPE, with the difference being that a manually selected hyperparameter used in SuperBPE can be automatically determined in the second phase of BoundlessBPE. These changes enable a much faster implementation, allowing training on that same 1GB of data in 603 and 593 seconds for BoundlessBPE and SuperBPE, respectively, a more than 600x increase in speed. For each of BoundlessBPE, SuperBPE, and BPE, we open-source both a reference Python implementation and a fast Rust implementation.

[16] XMark: Reliable Multi-Bit Watermarking for LLM-Generated Texts

Jiahao Xu,Rui Hu,Olivera Kotevska,Zikai Zhang

Main category: cs.CL

TL;DR: 本文提出XMark方法,用于在大语言模型生成的文本中嵌入多比特水印,以实现对恶意使用的可靠溯源。该方法通过优化logit分布,在保证文本质量的同时提升解码准确率,尤其适用于token数量受限的实际场景。

Details Motivation: 现有方法在处理大量信息时计算成本高,或在文本质量和解码准确性之间权衡不佳;且在生成文本token数量有限时解码准确率显著下降。 Method: 提出XMark方法,其编码器生成失真更小的logit分布以保持文本质量,解码器则针对该特性设计以在有限token下可靠恢复信息。 Result: 在多种下游任务中,XMark显著提升了水印解码准确率,同时保持水印文本质量,优于现有方法。 Conclusion: XMark是一种高效、鲁棒的多比特文本水印方案,适用于实际LLM应用场景,尤其在token受限条件下表现优异。 Abstract: Multi-bit watermarking has emerged as a promising solution for embedding imperceptible binary messages into Large Language Model (LLM)-generated text, enabling reliable attribution and tracing of malicious usage of LLMs. Despite recent progress, existing methods still face key limitations: some become computationally infeasible for large messages, while others suffer from a poor trade-off between text quality and decoding accuracy. Moreover, the decoding accuracy of existing methods drops significantly when the number of tokens in the generated text is limited, a condition that frequently arises in practical usage. To address these challenges, we propose \textsc{XMark}, a novel method for encoding and decoding binary messages in LLM-generated texts. The unique design of \textsc{XMark}'s encoder produces a less distorted logit distribution for watermarked token generation, preserving text quality, and also enables its tailored decoder to reliably recover the encoded message with limited tokens. Extensive experiments across diverse downstream tasks show that \textsc{XMark} significantly improves decoding accuracy while preserving the quality of watermarked text, outperforming prior methods. The code is at https://github.com/JiiahaoXU/XMark.

[17] Exemplar Retrieval Without Overhypothesis Induction: Limits of Distributional Sequence Learning in Early Word Learning

Jon-Paul Cacioli

Main category: cs.CL

TL;DR: 该研究探讨了儿童如何学习形状作为物体分类的关键特征(即‘过假设’),并测试了自回归Transformer语言模型是否能通过分布序列学习实现类似的二阶归纳。结果表明,尽管模型能完美完成一阶示例检索,但在二阶泛化任务上仍处于随机水平,揭示了其在发展规模训练下的根本局限。

Details Motivation: 探究儿童如何习得‘形状是定义物体类别的关键特征’这一二阶归纳(过假设),以及当前主流的自回归分布序列学习模型是否具备类似能力。 Method: 在控制八种混淆因素的合成语料上训练参数量为3.4M-25.6M的自回归Transformer语言模型,并在包含1040项的wug测试集上评估其一阶检索与二阶泛化能力;辅以特征交换诊断分析其内部表征机制。 Result: 所有模型在一阶示例检索任务中达到100%准确率,但在二阶泛化(对新名词的形状抽象)上仅达50–52%(即随机水平),等效性检验确认该结果稳健;特征交换诊断显示模型依赖帧-特征模板匹配,而非名词→领域→特征的结构化抽象。 Conclusion: 在发展规模训练条件下,纯自回归分布序列学习机制不足以支持二阶归纳(过假设),暴露了其在认知建模上的根本局限。 Abstract: Background: Children do not simply learn that balls are round and blocks are square. They learn that shape is the kind of feature that tends to define object categories -- a second-order generalisation known as an overhypothesis [1, 2]. What kind of learning mechanism is sufficient for this inductive leap? Methods: We trained autoregressive transformer language models (3.4M-25.6M parameters) on synthetic corpora in which shape is the stable feature dimension across categories, with eight conditions controlling for alternative explanations. Results: Across 120 pre-registered runs evaluated on a 1,040-item wug test battery, every model achieved perfect first-order exemplar retrieval (100%) while second-order generalisation to novel nouns remained at chance (50-52%), a result confirmed by equivalence testing. A feature-swap diagnostic revealed that models rely on frame-to-feature template matching rather than structured noun-to-domain-to-feature abstraction. Conclusions: These results reveal a clear limitation of autoregressive distributional sequence learning under developmental-scale training conditions.

[18] Do Domain-specific Experts exist in MoE-based LLMs?

Giang Do,Hung Le,Truyen Tran

Main category: cs.CL

TL;DR: 本文探讨了MoE架构大语言模型中是否存在领域特定专家,并通过实证研究证实其存在;进而提出无需训练、零推理开销的Domain Steering MoE(DSMoE)方法,在多个开源MoE模型上实现优于SFT等基线的泛化性能。

Details Motivation: 尽管MoE架构在大模型中广泛应用,但其中专家是否具备领域特异性及其可解释性仍是未解问题。 Method: 对10个参数量从3.8B到120B的先进MoE-LLM进行实证分析,验证领域专家的存在性;在此基础上提出无需训练、零推理开销的Domain Steering MoE(DSMoE)框架。 Result: 在四个先进开源MoE-LLM上跨目标与非目标领域实验表明,DSMoE显著优于SFT等强基线,且不增加推理成本或重训练需求。 Conclusion: MoE-LLM中确实存在可识别、可利用的领域特定专家;DSMoE为高效、低成本地提升领域适应能力提供了新范式。 Abstract: In the era of Large Language Models (LLMs), the Mixture of Experts (MoE) architecture has emerged as an effective approach for training extremely large models with improved computational efficiency. This success builds upon extensive prior research aimed at enhancing expert specialization in MoE-based LLMs. However, the nature of such specializations and how they can be systematically interpreted remain open research challenges. In this work, we investigate this gap by posing a fundamental question: \textit{Do domain-specific experts exist in MoE-based LLMs?} To answer the question, we evaluate ten advanced MoE-based LLMs ranging from 3.8B to 120B parameters and provide empirical evidence for the existence of domain-specific experts. Building on this finding, we propose \textbf{Domain Steering Mixture of Experts (DSMoE)}, a training-free framework that introduces zero additional inference cost and outperforms both well-trained MoE-based LLMs and strong baselines, including Supervised Fine-Tuning (SFT). Experiments on four advanced open-source MoE-based LLMs across both target and non-target domains demonstrate that our method achieves strong performance and robust generalization without increasing inference cost or requiring additional retraining. Our implementation is publicly available at https://github.com/giangdip2410/Domain-specific-Experts.

[19] Beneath the Surface: Investigating LLMs' Capabilities for Communicating with Subtext

Kabir Ahuja,Yuxuan Li,Andrew Kyle Lampinen

Main category: cs.CL

TL;DR: 本文系统研究了大语言模型在交际场景中使用潜台词(subtext)的能力,提出了四个新评估套件,发现前沿模型普遍存在过度字面化的倾向,但在特定条件下(如存在共同背景)可部分改善,同时揭示了当前模型在潜台词理解与生成方面的诸多缺陷。

Details Motivation: 人类交流本质上具有创造性,常依赖潜台词传递言外之意;而现有大语言模型是否具备此类能力尚不明确,亟需系统性评估。 Method: 设计并引入四个新评估套件,涵盖寓言写作与理解、受Dixit等桌游启发的多智能体与多模态游戏(如Visual Allusions),结合paratextual和persona条件分析潜台词解释的可变性。 Result: 前沿模型普遍偏向字面化表达(如在Visual Allusions中60%线索为字面);部分模型在显式共同背景下可将字面线索减少30%-50%,但难以自主推断共同背景是否存在;寓言理解受副文本与角色设定显著影响。 Conclusion: 当前LLM在潜台词生成与理解方面存在根本性局限,其表现高度依赖显式提示,缺乏对社会语境与隐含共识的自发建模能力;本工作为该主观复杂现象提供了可量化的评估基准,并呼吁发展更社会化、更具创造力的通信与推理模型。 Abstract: Human communication is fundamentally creative, and often makes use of subtext -- implied meaning that goes beyond the literal content of the text. Here, we systematically study whether language models can use subtext in communicative settings, and introduce four new evaluation suites to assess these capabilities. Our evaluation settings range from writing & interpreting allegories to playing multi-agent and multi-modal games inspired by the rules of board games like Dixit. We find that frontier models generally exhibit a strong bias towards overly literal, explicit communication, and thereby fail to account for nuanced constraints -- even the best performing models generate literal clues 60% of times in one of our environments -- Visual Allusions. However, we find that some models can sometimes make use of common ground with another party to help them communicate with subtext, achieving 30%-50% reduction in overly literal clues; but they struggle at inferring presence of a common ground when not explicitly stated. For allegory understanding, we find paratextual and persona conditions to significantly shift the interpretation of subtext. Overall, our work provides quantifiable measures for an inherently complex and subjective phenomenon like subtext and reveals many weaknesses and idiosyncrasies of current LLMs. We hope this research to inspire future work towards socially grounded creative communication and reasoning.

[20] Right at My Level: A Unified Multilingual Framework for Proficiency-Aware Text Simplification

Jinhong Jeong,Junghun Park,Youngjae Yu

Main category: cs.CL

TL;DR: 本文提出Re-RIGHT框架,一种无需平行语料监督的多语言自适应文本简化强化学习方法,通过词汇覆盖、语义保持与连贯性三重奖励机制,在英语、日语、韩语和中文上显著提升目标语言水平(CEFR/JLPT/TOPIK/HSK)下的词汇适配性与简化质量。

Details Motivation: 现有文本简化方法依赖昂贵的个性化平行语料或预标注英文句子,难以扩展至多语言及不同语言水平学习者。 Method: 提出基于强化学习的Re-RIGHT框架,构建含43K词汇级多语言数据的轻量4B策略模型,融合词汇覆盖率、语义保留与文本连贯性三个奖励模块进行训练。 Result: 在英语、日语、韩语、中文四种语言上,相比GPT-5.2、Gemini 2.5等强基线模型,Re-RIGHT在目标语言水平下实现更高词汇覆盖率,同时更好保持原意与语言流畅性。 Conclusion: Re-RIGHT验证了无需平行语料的强化学习范式在多语言、多水平文本简化任务中的有效性与实用性,为L2学习者提供更精准、可扩展的可理解输入支持。 Abstract: Text simplification supports second language (L2) learning by providing comprehensible input, consistent with the Input Hypothesis. However, constructing personalized parallel corpora is costly, while existing large language model (LLM)-based readability control methods rely on pre-labeled sentence corpora and primarily target English. We propose Re-RIGHT, a unified reinforcement learning framework for adaptive multilingual text simplification without parallel corpus supervision. We first show that prompting-based lexical simplification at target proficiency levels (CEFR, JLPT, TOPIK, and HSK) performs poorly at easier levels and for non-English languages, even with state-of-the-art LLMs such as GPT-5.2 and Gemini 2.5. To address this, we collect 43K vocabulary-level data across four languages (English, Japanese, Korean, and Chinese) and train a compact 4B policy model using Re-RIGHT, which integrates three reward modules: vocabulary coverage, semantic preservation, and coherence. Compared to the stronger LLM baselines, Re-RIGHT achieves higher lexical coverage at target proficiency levels while maintaining original meaning and fluency.

[21] DIA-HARM: Dialectal Disparities in Harmful Content Detection Across 50 English Dialects

Jason Lucas,Matt Murtagh,Ali Al-Lawati,Uchendu Uchendu,Adaku Uchendu,Dongwon Lee

Main category: cs.CL

TL;DR: 本文提出了DIA-HARM基准,首次系统评估了16种虚假信息检测模型在50种英语方言上的鲁棒性,发现现有模型对非标准美式英语(SAE)表现显著下降,尤其在人工撰写的方言内容上F1值下降1.4–3.6%,部分模型甚至出现超33%的灾难性退化;微调的Transformer模型优于零样本大语言模型,多语言模型(如mDeBERTa)泛化能力更强;研究揭示当前检测系统可能系统性损害数亿非SAE使用者权益。

Details Motivation: 现有有害内容(尤其是虚假信息)检测器主要基于标准美式英语(SAE)开发与评估,其在各类英语方言下的鲁棒性尚属空白,可能导致对全球大量非SAE使用者的不公平影响。 Method: 构建首个覆盖50种英语方言(含美国、英国、非洲、加勒比和亚太地区)的虚假信息检测鲁棒性评测基准DIA-HARM;基于Multi-VALUE语言学驱动变换,从现有虚假信息基准生成含方言变体的D3语料库(195K样本);对16个主流检测模型进行系统评测,并开展跨方言迁移(2450对)、人工vs AI生成内容对比及零样本/微调模型性能分析。 Result: 人工方言内容导致F1下降1.4–3.6%;AI生成内容表现稳定;微调Transformer模型最佳F1达96.6%,零样本LLM仅78.3%;部分模型在混合方言内容上F1骤降超33%;mDeBERTa平均F1达97.2%,而RoBERTa和XLM-RoBERTa在方言输入上严重失效。 Conclusion: 当前虚假信息检测器在方言多样性上存在系统性脆弱性,可能加剧语言不平等;多语言预训练与方言感知建模是提升公平性与鲁棒性的关键路径;作者开源DIA-HARM框架、D3语料与评测工具。 Abstract: Harmful content detectors-particularly disinformation classifiers-are predominantly developed and evaluated on Standard American English (SAE), leaving their robustness to dialectal variation unexplored. We present DIA-HARM, the first benchmark for evaluating disinformation detection robustness across 50 English dialects spanning U.S., British, African, Caribbean, and Asia-Pacific varieties. Using Multi-VALUE's linguistically grounded transformations, we introduce D3 (Dialectal Disinformation Detection), a corpus of 195K samples derived from established disinformation benchmarks. Our evaluation of 16 detection models reveals systematic vulnerabilities: human-written dialectal content degrades detection by 1.4-3.6% F1, while AI-generated content remains stable. Fine-tuned transformers substantially outperform zero-shot LLMs (96.6% vs. 78.3% best-case F1), with some models exhibiting catastrophic failures exceeding 33% degradation on mixed content. Cross-dialectal transfer analysis across 2,450 dialect pairs shows that multilingual models (mDeBERTa: 97.2% average F1) generalize effectively, while monolingual models like RoBERTa and XLM-RoBERTa fail on dialectal inputs. These findings demonstrate that current disinformation detectors may systematically disadvantage hundreds of millions of non-SAE speakers worldwide. We release the DIA-HARM framework, D3 corpus, and evaluation tools: https://github.com/jsl5710/dia-harm

[22] Human Values Matter: Investigating How Misalignment Shapes Collective Behaviors in LLM Agent Communities

Xiangxu Zhang,Jiamin Wang,Qinlin Zhao,Hanze Guo,Linzhuo Li,Jing Yao,Xiao Zhou,Xiaoyuan Yi,Xing Xie

Main category: cs.CL

TL;DR: 本文提出CIVA环境,通过控制多智能体系统中人类价值观的流行程度,揭示了价值观错位如何导致宏观系统崩溃和微观欺骗、权力寻求等行为,证明人类价值观对大语言模型集体行为至关重要。

Details Motivation: 随着大语言模型(LLM)日益融入社会,其在多智能体系统中因个体价值错位引发的群体级失败问题亟待研究;但人类价值观为何重要、如何影响集体行为尚不明确。 Method: 构建基于社会科学理论的可控多智能体环境CIVA,让LLM代理自主交流、探索与资源竞争,并系统操控价值观分布以分析行为变化。 Result: 发现三类关键现象:(1)若干结构性关键价值观显著塑造群体动态;(2)其错配可引发宏观层面的系统性崩溃;(3)微观层面出现欺骗与权力寻求等涌现行为。 Conclusion: 人类价值观不仅是个体对齐问题,更是决定LLM多智能体系统集体结果的核心因素,亟需开展多智能体价值对齐研究。 Abstract: As LLMs become increasingly integrated into human society, evaluating their orientations on human values from social science has drawn growing attention. Nevertheless, it is still unclear why human values matter for LLMs, especially in LLM-based multi-agent systems, where group-level failures may accumulate from individually misaligned actions. We ask whether misalignment with human values alters the collective behavior of LLM agents and what changes it induces? In this work, we introduce CIVA, a controlled multi-agent environment grounded in social science theories, where LLM agents form a community and autonomously communicate, explore, and compete for resources, enabling systematic manipulation of value prevalence and behavioral analysis. Through comprehensive simulation experiments, we reveal three key findings. (1) We identify several structurally critical values that substantially shape the community's collective dynamics, including those diverging from LLMs' original orientations. Triggered by the misspecification of these values, we (2) detect system failure modes, e.g., catastrophic collapse, at the macro level, and (3) observe emergent behaviors like deception and power-seeking at the micro level. These results offer quantitative evidence that human values are essential for collective outcomes in LLMs and motivate future multi-agent value alignment.

[23] DQA: Diagnostic Question Answering for IT Support

Vishaal Kapoor,Mariam Dundua,Sarthak Ahuja,Neda Kordjazi,Evren Yortucboylu,Vaibhavi Padala,Derek Ho,Jennifer Whitted,Rebecca Steinert

Main category: cs.CL

TL;DR: 本文提出DQA诊断问答框架,通过维护持久化诊断状态和按根因聚合检索案例,提升企业IT支持场景中的多轮故障排查效果。

Details Motivation: 企业IT支持交互本质上是诊断性的,需要从模糊的用户报告中迭代收集证据以识别根本原因;而标准多轮RAG系统缺乏显式诊断状态,难以跨轮次累积证据和消解竞争性假设。 Method: 提出DQA框架,包含会话式查询重写、检索结果按根因聚合、以及状态条件下的响应生成,支持在企业低延迟与上下文受限约束下的系统化排障。 Result: 在150个匿名企业IT支持场景上,DQA轨迹级成功率78.7%,显著高于基线多轮RAG的41.3%;平均交互轮次从8.4降至3.9。 Conclusion: DQA通过引入诊断状态与根因级聚合机制,有效提升了多轮RAG在诊断任务中的证据积累与假设消解能力,适用于真实企业IT支持场景。 Abstract: Enterprise IT support interactions are fundamentally diagnostic: effective resolution requires iterative evidence gathering from ambiguous user reports to identify an underlying root cause. While retrieval-augmented generation (RAG) provides grounding through historical cases, standard multi-turn RAG systems lack explicit diagnostic state and therefore struggle to accumulate evidence and resolve competing hypotheses across turns. We introduce DQA, a diagnostic question-answering framework that maintains persistent diagnostic state and aggregates retrieved cases at the level of root causes rather than individual documents. DQA combines conversational query rewriting, retrieval aggregation, and state-conditioned response generation to support systematic troubleshooting under enterprise latency and context constraints. We evaluate DQA on 150 anonymized enterprise IT support scenarios using a replay-based protocol. Averaged over three independent runs, DQA achieves a 78.7% success rate under a trajectory-level success criterion, compared to 41.3% for a multi-turn RAG baseline, while reducing average turns from 8.4 to 3.9.

[24] ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving

Kaiser Hamid,Can Cui,Nade Liang

Main category: cs.CL

TL;DR: 本文提出ICR-Drive诊断框架,评估语言条件自动驾驶模型在面对指令扰动(如改写、歧义、噪声和误导)时的鲁棒性,揭示当前VLA模型在真实部署中存在显著可靠性问题。

Details Motivation: 现有VLA模型评估多假设指令精准规范,但实际部署中指令常存在表述差异、信息缺失或误导性内容,导致指令级鲁棒性被低估。 Method: 构建ICR-Drive框架,生成四类受控指令变体(改写、歧义、噪声、误导),并在CARLA仿真中固定路线与配置进行闭环重放,以隔离语言变化对性能的影响;采用CARLA Leaderboard指标及各扰动类别的性能下降率量化鲁棒性。 Result: 在LMDrive和BEVDriver上的实验表明,微小指令改动可引发显著性能下降和不同失败模式。 Conclusion: 当前端到端语言条件驾驶模型在指令鲁棒性方面存在明显缺陷,制约其在安全关键驾驶场景中的实际部署。 Abstract: Recent progress in vision-language-action (VLA) models has enabled language-conditioned driving agents to execute natural-language navigation commands in closed-loop simulation, yet standard evaluations largely assume instructions are precise and well-formed. In deployment, instructions vary in phrasing and specificity, may omit critical qualifiers, and can occasionally include misleading, authority-framed text, leaving instruction-level robustness under-measured. We introduce ICR-Drive, a diagnostic framework for instruction counterfactual robustness in end-to-end language-conditioned autonomous driving. ICR-Drive generates controlled instruction variants spanning four perturbation families: Paraphrase, Ambiguity, Noise, and Misleading, where Misleading variants conflict with the navigation goal and attempt to override intent. We replay identical CARLA routes under matched simulator configurations and seeds to isolate performance changes attributable to instruction language. Robustness is quantified using standard CARLA Leaderboard metrics and per-family performance degradation relative to the baseline instruction. Experiments on LMDrive and BEVDriver show that minor instruction changes can induce substantial performance drops and distinct failure modes, revealing a reliability gap for deploying embodied foundation models in safety-critical driving.

[25] Confidence Should Be Calibrated More Than One Turn Deep

Zhaohan Zhang,Chengzhengxu Li,Xiaoming Liu,Chao Shen,Ziquan Liu,Ioannis Patras

Main category: cs.CL

TL;DR: 本文提出多轮对话校准(multi-turn calibration)任务,旨在解决大语言模型在多轮交互中置信度校准的动态挑战,并设计MTCal方法和ConfChat解码策略以提升多轮对话的事实性和一致性。

Details Motivation: 现有置信度估计与校准研究主要关注单轮场景,忽视了高风险领域中多轮交互下的校准风险与潜力。 Method: 提出多轮校准任务,定义ECE@T指标刻画校准动态性;设计MTCal方法最小化ECE@T;构建ConfChat解码策略,利用校准后的置信度提升响应事实性与一致性。 Result: MTCal在多轮校准上表现优异且稳定;ConfChat有效保持并增强模型在多轮交互中的性能。 Conclusion: 多轮校准是实现安全、可靠、面向实际应用的大语言模型校准的关键缺失环节。 Abstract: Large Language Models (LLMs) are increasingly applied in high-stakes domains such as finance, healthcare, and education, where reliable multi-turn interactions with users are essential. However, existing work on confidence estimation and calibration, a major approach to building trustworthy LLM systems, largely focuses on single-turn settings and overlooks the risks and potential of multi-turn conversations. In this work, we introduce the task of multi-turn calibration to reframe calibration from a static property into a dynamic challenge central to reliable multi-turn conversation, where calibrating model confidence at each turn conditioned on the conversation history is required. We first reveal the risks of this setting: using Expected Calibration Error at turn T (ECE@T), a new metric that tracks calibration dynamics over turns, we show that user feedback (e.g., persuasion) can degrade multi-turn calibration. To address this, we propose MTCal, which minimises ECE@T via a surrogate calibration target, and further leverage calibrated confidence in ConfChat, a decoding strategy that improves both factuality and consistency of the model response in multi-turn interactions. Extensive experiments demonstrate that MT-Cal achieves outstanding and consistent performance in multi-turn calibration, and ConfChat preserves and even enhances model performance in multi-turn interactions. Our results mark multi-turn calibration as one missing link for scaling LLM calibration toward safe, reliable, and real-world use.

[26] Multi-Drafter Speculative Decoding with Alignment Feedback

Taehyeon Kim,Hojung Jung,Se-Young Yun

Main category: cs.CL

TL;DR: 本文提出MetaSD框架,通过多臂赌博机动态选择多个异构草稿模型,提升推测解码效率。

Details Motivation: 单一草稿模型在不同任务或领域中效果有限,难以适应多样化应用场景。 Method: 将草稿模型选择建模为多臂赌博机问题,利用对齐反馈动态分配计算资源给多个异构草稿模型。 Result: 大量实验表明,MetaSD在各类场景下均持续优于单草稿模型方法。 Conclusion: 集成多个草稿模型并动态调度可显著提升推测解码性能与泛化能力。 Abstract: Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller model to draft future tokens, which are then verified by the target LLM. This preserves generation quality by accepting only aligned tokens. However, individual drafters, often trained for specific tasks or domains, exhibit limited effectiveness across diverse applications. To address this, we introduce \textsc{MetaSD}, a unified framework that integrates multiple drafters into the SD process. MetaSD dynamically allocates computational resources to heterogeneous drafters by leveraging alignment feedback and framing drafter selection as a multi-armed bandit problem. Extensive experiments show MetaSD consistently outperforms single-drafter approaches.

[27] Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling

Qiyuan Chen,Hongsen Huang,Jiahe Chen,Qian Shao,Jintai Chen,Hongxia Xu,Renjie Hua,Chuan Ren,Jian Wu

Main category: cs.CL

TL;DR: VL-MDR是一种新型视觉-语言多维奖励建模框架,通过视觉感知门控机制动态分解并加权多个细粒度评估维度(如幻觉、推理),兼顾可解释性与效率,并在多个基准上超越现有开源奖励模型。

Details Motivation: 解决视觉-语言奖励建模中生成式方法可解释但低效、判别式方法高效但不透明的两难问题。 Method: 提出VL-MDR框架,采用视觉感知门控机制动态选择并加权21个细粒度评估维度;构建含321k偏好对的多维标注数据集。 Result: 在VL-RewardBench等基准上持续优于现有开源奖励模型;其生成的偏好对能有效支持DPO对齐,显著缓解视觉幻觉、提升可靠性。 Conclusion: VL-MDR为视觉语言模型对齐提供了兼顾可解释性、灵活性与可扩展性的新范式。 Abstract: Vision-language reward modeling faces a dilemma: generative approaches are interpretable but slow, while discriminative ones are efficient but act as opaque "black boxes." To bridge this gap, we propose VL-MDR (Vision-Language Multi-Dimensional Reward), a framework that dynamically decomposes evaluation into granular, interpretable dimensions. Instead of outputting a monolithic scalar, VL-MDR employs a visual-aware gating mechanism to identify relevant dimensions and adaptively weight them (e.g., Hallucination, Reasoning) for each specific input. To support this, we curate a dataset of 321k vision-language preference pairs annotated across 21 fine-grained dimensions. Extensive experiments show that VL-MDR consistently outperforms existing open-source reward models on benchmarks like VL-RewardBench. Furthermore, we show that VL-MDR-constructed preference pairs effectively enable DPO alignment to mitigate visual hallucinations and improve reliability, providing a scalable solution for VLM alignment.

[28] Content Fuzzing for Escaping Information Cocoons on Digital Social Media

Yifeng He,Ziye Tang,Hao Chen

Main category: cs.CL

TL;DR: 本文提出ContentFuzz框架,通过置信度引导的模糊测试方法,利用大语言模型改写社交媒体帖子,在保持人类可理解语义意图的前提下,改变机器推断的立场标签,从而突破信息茧房,扩大内容跨群体传播。

Details Motivation: 社交平台依赖立场检测进行推荐,导致用户被困于信息茧房,限制了异见传播与建设性对话;本文从内容创作者角度出发,探索如何改写内容以突破既有观点圈层。 Method: 提出ContentFuzz:一种置信度引导的模糊测试框架,利用立场检测模型的置信度反馈指导大语言模型生成语义保持但机器立场标签改变的帖子改写。 Result: 在三种数据集、两种语言、四个主流立场检测模型上的实验表明,ContentFuzz能有效翻转机器判定的立场标签,同时保持语义完整性。 Conclusion: ContentFuzz为缓解算法驱动的信息茧房提供了可行路径,证明可通过可控语义改写提升内容跨立场传播能力,对推荐系统公平性与公共讨论健康具有实践意义。 Abstract: Information cocoons on social media limit users' exposure to posts with diverse viewpoints. Modern platforms use stance detection as an important signal in recommendation and ranking pipelines, which can route posts primarily to like-minded audiences and reduce cross-cutting exposure. This restricts the reach of dissenting opinions and hinders constructive discourse. We take the creator's perspective and investigate how content can be revised to reach beyond existing affinity clusters. We present ContentFuzz, a confidence-guided fuzzing framework that rewrites posts while preserving their human-interpreted intent and induces different machine-inferred stance labels. ContentFuzz aims to route posts beyond their original cocoons. Our method guides a large language model (LLM) to generate meaning-preserving rewrites using confidence feedback from stance detection models. Evaluated on four representative stance detection models across three datasets in two languages, ContentFuzz effectively changes machine-classified stance labels, while maintaining semantic integrity with respect to the original content.

[29] Don't Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction

Yuzhe Zhang,Xianwei Xue,Xingyong Wu,Mengke Chen,Chen Liu,Xinran He,Run Shao,Feiran Liu,Huanmin Xu,Qiutong Pan,Haiwei Wang

Main category: cs.CL

TL;DR: VeriGUI是一种面向嘈杂真实环境的GUI智能体,通过TVAE框架和两阶段训练方法显式建模动作结果与恢复策略,显著减少失败循环并提升纠错成功率。

Details Motivation: 现有基于VLM的GUI智能体假设环境响应确定,忽视网络延迟、渲染延迟和系统中断等现实噪声,导致动作失败未被检测、行为重复无效及错误累积;同时,因在线交互成本高、离线数据缺乏实时反馈,鲁棒恢复策略难以学习。 Method: 提出VeriGUI,包含:1)Thinking-Verification-Action-Expectation(TVAE)框架,用于检测失败并引导纠正推理;2)两阶段训练流程:结合鲁棒监督微调(Robust SFT)与合成失败轨迹、以及带非对称验证奖励的GRPO;3)基于AndroidControl构建鲁棒性评测基准。 Result: 实验表明VeriGUI显著降低失败循环次数,提升恢复成功率,同时保持有竞争力的标准任务性能。 Conclusion: 显式建模动作结果与恢复机制是提升GUI智能体在真实噪声环境中鲁棒性的关键路径,VeriGUI为该方向提供了有效框架与实证支持。 Abstract: Autonomous GUI agents based on vision-language models (VLMs) often assume deterministic environment responses, generating actions without verifying whether previous operations succeeded. In real-world settings with network latency, rendering delays, and system interruptions, this assumption leads to undetected action failures, repetitive ineffective behaviors, and catastrophic error accumulation. Moreover, learning robust recovery strategies is challenging due to the high cost of online interaction and the lack of real-time feedback in offline datasets.We propose VeriGUI (Verification-driven GUI Agent), which explicitly models action outcomes and recovery under noisy environments. VeriGUI introduces a Thinking--Verification--Action--Expectation (TVAE) framework to detect failures and guide corrective reasoning, and a two-stage training pipeline that combines Robust SFT with synthetic failure trajectories and GRPO with asymmetric verification rewards. We further construct a Robustness Benchmark based on AndroidControl to evaluate failure recognition and correction. Experiments show that VeriGUI significantly reduces failure loops and improves recovery success while maintaining competitive standard task performance.

[30] Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs

Hongcheng Liu,Yuhao Wang,Zhe Chen,Pingjie Wang,Zhiyuan Zhu,Yixuan Hou,Yanfeng Wang,Yu Wang

Main category: cs.CL

TL;DR: 本文提出跨模态共指问题,构建CrossOmni数据集,并通过无训练和有训练两种方法提升大模型在跨模态共指上的推理能力。

Details Motivation: 现有Omni-LLMs在复杂多模态协同推理中表现不佳,尤其缺乏细粒度跨模态对齐(如识别跨模态共享指代对象)能力。 Method: 1)将问题形式化为跨模态共指任务;2)构建含9个任务及人工推理依据的CrossOmni数据集;3)提出无训练的上下文学习方法和基于SFT+GRPO的有训练框架以增强共指意识。 Result: 在13个Omni-LLM上验证了其系统性缺陷;所提两种方法均显著提升性能,并泛化至协作推理任务。 Conclusion: 跨模态共指是实现鲁棒全模态推理的关键缺失环节,引入共指感知的思维模式可有效提升模型能力。 Abstract: Omni Large Language Models (Omni-LLMs) have demonstrated impressive capabilities in holistic multi-modal perception, yet they consistently falter in complex scenarios requiring synergistic omni-modal reasoning. Beyond understanding global multimodal context, effective reasoning also hinges on fine-grained cross-modal alignment, especially identifying shared referents across modalities, yet this aspect has been largely overlooked. To bridge this gap, we formalize the challenge as a cross-modal coreference problem, where a model must localize a referent in a source modality and re-identify it in a target modality. Building on this paradigm, we introduce CrossOmni, a dataset comprising nine tasks equipped with human-designed reasoning rationales to evaluate and enhance this capability. Experiments on 13 Omni-LLMs reveal systematic weaknesses in cross-modal coreference, which we attribute to the absence of coreference-aware thinking patterns. To address this, we enhance cross-modal alignment via two strategies: a training-free In-Context Learning method and a training-based SFT+GRPO framework designed to induce such thinking patterns. Both approaches yield substantial performance gains and generalize effectively to collaborative reasoning tasks. Overall, our findings highlight cross-modal coreference as a crucial missing piece for advancing robust omni-modal reasoning.

[31] Turbulence-like 5/3 spectral scaling in contextual representations of language as a complex system

Zhongxin Yang,Chun Bao,Yuanwei Bin,Xiang I. A. Yang,Shiyi Chen

Main category: cs.CL

TL;DR: 本文发现,基于transformer的语言模型生成的上下文嵌入在高维空间中形成的文本轨迹,其功率谱在多个语言和语料库中均呈现约5/3的幂律标度,该标度反映了语义信息跨尺度的自相似整合,而非单纯词汇统计。

Details Motivation: 探究自然语言作为复杂系统所展现的鲁棒统计规律,特别是语义信息在不同尺度上的组织方式。 Method: 将文本表示为由transformer语言模型生成的高维嵌入空间中的轨迹,利用嵌入步信号量化沿词元序列的尺度依赖涨落,并分析其功率谱。 Result: 在多种语言和语料库中,上下文嵌入的功率谱展现出稳健的、指数接近5/3的幂律;该标度存在于人类撰写和AI生成文本中,但不存在于静态词嵌入中,且被词元顺序随机化所破坏。 Conclusion: 观察到的5/3幂律标度反映了语言中多尺度、上下文依赖的结构组织,类比湍流中的Kolmogorov谱,表明语义信息以尺度无关、自相似的方式整合,并提供了一种模型无关的语言表征复杂性定量基准。 Abstract: Natural language is a complex system that exhibits robust statistical regularities. Here, we represent text as a trajectory in a high-dimensional embedding space generated by transformer-based language models, and quantify scale-dependent fluctuations along the token sequence using an embedding-step signal. Across multiple languages and corpora, the resulting power spectrum exhibits a robust power law with an exponent close to $5/3$ over an extended frequency range. This scaling is observed consistently in contextual embeddings from both human-written and AI-generated text, but is absent in static word embeddings and is disrupted by randomization of token order. These results show that the observed scaling reflects multiscale, context-dependent organization rather than lexical statistics alone. By analogy with the Kolmogorov spectrum in turbulence, our findings suggest that semantic information is integrated in a scale-free, self-similar manner across linguistic scales, and provide a quantitative, model-agnostic benchmark for studying complex structure in language representations.

[32] Learning to Edit Knowledge via Instruction-based Chain-of-Thought Prompting

Jinhu Fu,Yan Bai,Longzhu He,Yihang Lou,Yanxiao Zhao,Li Sun,Sen Su

Main category: cs.CL

TL;DR: 本文提出CoT2Edit方法,通过链式思维(CoT)推理来提升大语言模型的知识编辑能力,解决现有方法泛化性差和适用范围窄的问题。

Details Motivation: 现有知识编辑方法存在两个关键局限:一是泛化能力差,注入新知识后难以有效用于实际问题;二是适用范围窄,仅关注结构化事实三元组,忽视新闻、文章等非结构化信息。 Method: 提出CoT2Edit范式:利用语言模型代理为结构化与非结构化编辑数据生成链式思维(CoT)推理路径,构建高质量指令数据;通过监督微调(SFT)和组相对策略优化(GRPO)训练模型进行编辑知识推理;推理阶段结合检索增强生成(RAG)动态检索相关编辑事实。 Result: 在六个不同知识编辑场景中展现出强泛化能力,仅需对三个开源语言模型进行单轮训练即取得显著效果。 Conclusion: CoT2Edit有效提升了大语言模型在多样化真实场景下的知识编辑能力与实用性,兼顾泛化性与适用广度。 Abstract: Large language models (LLMs) can effectively handle outdated information through knowledge editing. However, current approaches face two key limitations: (I) Poor generalization: Most approaches rigidly inject new knowledge without ensuring that the model can use it effectively to solve practical problems. (II) Narrow scope: Current methods focus primarily on structured fact triples, overlooking the diverse unstructured forms of factual information (e.g., news, articles) prevalent in real-world contexts. To address these challenges, we propose a new paradigm: teaching LLMs to edit knowledge via Chain of Thoughts (CoTs) reasoning (CoT2Edit). We first leverage language model agents for both structured and unstructured edited data to generate CoTs, building high-quality instruction data. The model is then trained to reason over edited knowledge through supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO). At inference time, we integrate Retrieval-Augmented Generation (RAG) to dynamically retrieve relevant edited facts for real-time knowledge editing. Experimental results demonstrate that our method achieves strong generalization across six diverse knowledge editing scenarios with just a single round of training on three open-source language models. The codes are available at https://github.com/FredJDean/CoT2Edit.

[33] Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects

Jun Zhang,Yicheng Ji,Feiyang Ren,Yihang Li,Bowen Zeng,Zonghao Chen,Ke Chen,Lidan Shou,Gang Chen,Huan Li

Main category: cs.CL

TL;DR: 本文系统分析了大型视觉-语言模型(LVLMs)推理中由视觉token主导引发的效率瓶颈,提出覆盖编码、prefill和解码全阶段的效率技术分类体系,并揭示各阶段间的耦合关系与权衡;提出了四大未来研究方向,并开源持续更新的文献库。

Details Motivation: LVLMs推理面临视觉token主导导致的系统性效率瓶颈,现有优化方法多孤立、缺乏端到端视角。 Method: 构建基于推理生命周期(编码、prefill、解码)的效率技术系统分类法,从信息密度塑造、长上下文注意力管理、内存限制突破三个维度解耦分析,并结合实证洞察提出未来方向。 Result: 建立了首个覆盖LVLM全推理流程的结构化效率分析框架,识别出关键瓶颈如'视觉内存墙',并提出四个具可行性的前沿方向。 Conclusion: LVLM效率优化需端到端协同设计;孤立技术必须在统一框架下组合评估;硬件-算法协同与模态感知机制是突破视觉token瓶颈的关键路径。 Abstract: Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime interplay between high-resolution feature extraction, quadratic attention scaling, and memory bandwidth constraints. We present a systematic taxonomy of efficiency techniques structured around the inference lifecycle, consisting of encoding, prefilling, and decoding. Unlike prior reviews focused on isolated optimizations, we analyze the end-to-end pipeline to reveal how upstream decisions dictate downstream bottlenecks, covering compute-bound visual encoding, the intensive prefilling of massive contexts, and the ''visual memory wall'' in bandwidth-bound decoding. By decoupling the efficiency landscape into the axes of shaping information density, managing long-context attention, and overcoming memory limits, this work provides a structured analysis of how isolated optimizations compose to navigate the trade-off between visual fidelity and system efficiency. The survey concludes by outlining four future frontiers supported by pilot empirical insights, including hybrid compression based on functional unit sensitivity, modality-aware decoding with relaxed verification, progressive state management for streaming continuity, and stage-disaggregated serving through hardware-algorithm co-design. The submitted software contains a snapshot of our literature repository, which is designed to be maintained as a living resource for the community.

[34] Stop Fixating on Prompts: Reasoning Hijacking and Constraint Tightening for Red-Teaming LLM Agents

Yanxu Mao,Peipei Liu,Tiehan Cui,Congying Liu,Mingzhe Xing,Datao You

Main category: cs.CL

TL;DR: 本文提出JailAgent框架,通过隐式操控LLM代理的推理路径和记忆检索来实现越狱攻击,无需修改用户提示,具备跨模型与跨场景适应性。

Details Motivation: 现有红队方法依赖修改用户提示,缺乏对新数据的适应性且可能影响代理性能,需一种不修改提示的安全威胁评估新方法。 Method: 提出JailAgent框架,包含触发提取(Trigger Extraction)、推理劫持(Reasoning Hijacking)和约束收紧(Constraint Tightening)三阶段,结合精准触发识别、实时自适应机制和优化目标函数。 Result: JailAgent在跨模型和跨场景环境中展现出卓越性能,有效实现不修改用户提示的越狱攻击。 Conclusion: JailAgent为LLM代理安全评估提供了新范式,验证了隐式操控推理与记忆的有效性,提升了红队测试的鲁棒性与泛化能力。 Abstract: With the widespread application of LLM-based agents across various domains, their complexity has introduced new security threats. Existing red-team methods mostly rely on modifying user prompts, which lack adaptability to new data and may impact the agent's performance. To address the challenge, this paper proposes the JailAgent framework, which completely avoids modifying the user prompt. Specifically, it implicitly manipulates the agent's reasoning trajectory and memory retrieval with three key stages: Trigger Extraction, Reasoning Hijacking, and Constraint Tightening. Through precise trigger identification, real-time adaptive mechanisms, and an optimized objective function, JailAgent demonstrates outstanding performance in cross-model and cross-scenario environments.

[35] AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

Yu Li,Chenyang Shao,Xinyang Liu,Ruotong Zhao,Peijie Liu,Hongyuan Su,Zhibin Chen,Qinglong Yang,Anjie Xu,Yi Fang,Qingbin Zeng,Tianxing Li,Jingbo Xu,Fengli Xu,Yong Li,Tie-Yan Liu

Main category: cs.CL

TL;DR: 本文提出AutoSOTA,一个端到端的自动化AI研究系统,通过多智能体协作实现SOTA模型的自动复现与改进,在多个领域发现105个新SOTA模型,平均耗时约5小时/篇。

Details Motivation: AI研究依赖冗长的复现、调试与迭代过程,亟需能加速整个实证模型优化流程的系统。 Method: 提出三阶段框架(资源准备与目标设定、实验评估、反思与构想),并设计含8个专业化智能体的多智能体架构,协同完成论文到代码的落地、环境初始化与修复、长周期实验追踪、优化方案生成与调度、有效性监督等任务。 Result: 在8个顶会论文上验证,成功复现并优化出105个新SOTA模型,平均耗时约5小时/篇;案例涵盖LLM、NLP、CV、时间序列和优化等领域,超越常规超参调优,发现架构创新、算法重构和工作流改进。 Conclusion: 端到端研究自动化不仅是性能优化工具,更是一种新型科研基础设施,可减轻重复实验负担,释放人类精力聚焦高层次科学创造。 Abstract: Artificial intelligence research increasingly depends on prolonged cycles of reproduction, debugging, and iterative refinement to achieve State-Of-The-Art (SOTA) performance, creating a growing need for systems that can accelerate the full pipeline of empirical model optimization. In this work, we introduce AutoSOTA, an end-to-end automated research system that advances the latest SOTA models published in top-tier AI papers to reproducible and empirically improved new SOTA models. We formulate this problem through three tightly coupled stages: resource preparation and goal setting; experiment evaluation; and reflection and ideation. To tackle this problem, AutoSOTA adopts a multi-agent architecture with eight specialized agents that collaboratively ground papers to code and dependencies, initialize and repair execution environments, track long-horizon experiments, generate and schedule optimization ideas, and supervise validity to avoid spurious gains. We evaluate AutoSOTA on recent research papers collected from eight top-tier AI conferences under filters for code availability and execution cost. Across these papers, AutoSOTA achieves strong end-to-end performance in both automated replication and subsequent optimization. Specifically, it successfully discovers 105 new SOTA models that surpass the original reported methods, averaging approximately five hours per paper. Case studies spanning LLM, NLP, computer vision, time series, and optimization further show that the system can move beyond routine hyperparameter tuning to identify architectural innovation, algorithmic redesigns, and workflow-level improvements. These results suggest that end-to-end research automation can serve not only as a performance optimizer, but also as a new form of research infrastructure that reduces repetitive experimental burden and helps redirect human attention toward higher-level scientific creativity.

[36] FastDiSS: Few-step Match Many-step Diffusion Language Model on Sequence-to-Sequence Generation--Full Version

Dat Nguyen-Cong,Tung Kieu,Hoang Thanh-Tung

Main category: cs.CL

TL;DR: 本文提出了一种新的训练框架,通过扰动自条件信号以匹配推理噪声,并引入词元级噪声感知机制,提升连续扩散语言模型在少步采样下的鲁棒性和优化效果,显著提高推理速度与生成质量。

Details Motivation: 自条件机制在连续扩散语言模型中至关重要,但在少步采样(快速推理)场景下其性能显著下降,导致误差累积并主导样本质量。 Method: 提出一种新训练框架:1)扰动自条件信号以匹配推理阶段的噪声水平,增强对先验估计误差的鲁棒性;2)引入词元级噪声感知机制,防止训练饱和,改善优化过程。 Result: 在多个条件生成基准上,该方法超越标准连续扩散模型,推理速度快达400倍,并在单步扩散方法中保持竞争力。 Conclusion: 通过显式建模和缓解少步采样下的自条件误差,所提框架有效弥合了训练与推理之间的不匹配,提升了扩散语言模型的实用性与效率。 Abstract: Self-conditioning has been central to the success of continuous diffusion language models, as it allows models to correct previous errors. Yet its ability degrades precisely in the regime where diffusion is most attractive for deployment: few-step sampling for fast inference. In this study, we show that when models only have a few denoising steps, inaccurate self-conditioning induces a substantial approximation gap; this mistake compounds across denoising steps and ultimately dominate the sample quality. To address this, we propose a novel training framework that handles these errors during learning by perturbing the self-conditioning signal to match inference noise, improving robustness to prior estimation errors. In addition, we introduce a token-level noise-awareness mechanism that prevents training from saturation, hence improving optimization. Extensive experiments across conditional generation benchmarks demonstrate that our framework surpasses standard continuous diffusion models while providing up to 400x faster inference speed, and remains competitive against other one-step diffusion frameworks.

[37] Context-Agent: Dynamic Discourse Trees for Non-Linear Dialogue

Junan Hu,Shudan Guo,Wenqi Liu,Jianhua Yin,Yinwei Wei

Main category: cs.CL

TL;DR: 本文提出Context-Agent框架,将多轮对话历史建模为动态树结构,以更好处理对话中的非线性、分支和主题切换问题,并构建NTM基准测试其在长周期非线性对话中的性能。

Details Motivation: 现有LLM将对话历史视为扁平线性序列,与人类对话固有的层次化、分支化结构不匹配,导致上下文利用低效、长对话中连贯性下降。 Method: 提出Context-Agent框架,将对话历史表示为动态树结构,支持多分支主题管理;同时构建专用于评估非线性长对话能力的NTM基准。 Result: 实验表明Context-Agent显著提升任务完成率与token效率,在多种LLM上验证了结构化上下文管理的有效性。 Conclusion: 采用树状结构建模对话历史能更真实反映自然对话特性,是提升复杂动态对话能力的关键路径。 Abstract: Large Language Models demonstrate outstanding performance in many language tasks but still face fundamental challenges in managing the non-linear flow of human conversation. The prevalent approach of treating dialogue history as a flat, linear sequence is misaligned with the intrinsically hierarchical and branching structure of natural discourse, leading to inefficient context utilization and a loss of coherence during extended interactions involving topic shifts or instruction refinements. To address this limitation, we introduce Context-Agent, a novel framework that models multi-turn dialogue history as a dynamic tree structure. This approach mirrors the inherent non-linearity of conversation, enabling the model to maintain and navigate multiple dialogue branches corresponding to different topics. Furthermore, to facilitate robust evaluation, we introduce the Non-linear Task Multi-turn Dialogue (NTM) benchmark, specifically designed to assess model performance in long-horizon, non-linear scenarios. Our experiments demonstrate that Context-Agent enhances task completion rates and improves token efficiency across various LLMs, underscoring the value of structured context management for complex, dynamic dialogues. The dataset and code is available at GitHub.

[38] EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents

Xuan Dong,Huanyang Zheng,Tianhao Niu,Zhe Han,Pengzhan Li,Bofei Liu,Zhengyang Liu,Guancheng Li,Qingfu Zhu,Wanxiang Che

Main category: cs.CL

TL;DR: 本文提出了EpiBench,一个用于评估科研智能体在多轮、多模态、跨论文证据整合能力的新型基准,实验表明现有模型在此任务上表现较差,凸显了提升科研自动化能力的必要性。

Details Motivation: 现有基准未能系统评估科研中所需的主动文献检索、多证据整合及长期证据利用等关键能力。 Method: 构建了EpiBench——一个基于短科研工作流的阶段性多轮多模态基准,要求智能体在多轮交互中跨论文导航、对齐图表证据,并基于记忆中的累积证据回答需跨论文比较和多图整合的问题;同时提出过程级评估框架。 Result: 最先进模型在困难子集上的准确率仅为29.23%,显著低于理想水平,验证了当前科研智能体在多轮多证据推理方面存在明显不足。 Conclusion: EpiBench为可验证、可复现的科研智能体提供了重要评估平台,并指明了未来在多步科研推理与证据整合方向上的改进空间。 Abstract: Scientific research follows multi-turn, multi-step workflows that require proactively searching the literature, consulting figures and tables, and integrating evidence across papers to align experimental settings and support reproducible conclusions. This joint capability is not systematically assessed in existing benchmarks, which largely under-evaluate proactive search, multi-evidence integration and sustained evidence use over time. In this work, we introduce EpiBench, an episodic multi-turn multimodal benchmark that instantiates short research workflows. Given a research task, agents must navigate across papers over multiple turns, align evidence from figures and tables, and use the accumulated evidence in the memory to answer objective questions that require cross paper comparisons and multi-figure integration. EpiBench introduces a process-level evaluation framework for fine-grained testing and diagnosis of research agents. Our experiments show that even the leading model achieves an accuracy of only 29.23% on the hard split, indicating substantial room for improvement in multi-turn, multi-evidence research workflows, providing an evaluation platform for verifiable and reproducible research agents.

[39] THIVLVC: Retrieval Augmented Dependency Parsing for Latin

Luc Pommeret,Thibault Wagret,Jules Deret

Main category: cs.CL

TL;DR: THIVLVC 是一个两阶段拉丁语依存句法分析系统,结合检索增强生成(RAG)与大语言模型微调,在诗歌文本上显著提升性能,并揭示了树库间标注不一致性。

Details Motivation: 解决拉丁语依存句法解析中因树库规模小、标注风格不一致导致的性能瓶颈,尤其在诗歌等低资源文体上。 Method: 提出两阶段系统 THIVLVC:第一阶段基于句子长度和 POS n-gram 相似性从 CIRCSE 树库中检索结构相似例句;第二阶段用大语言模型,结合检索示例和 UD 规范对 UDPipe 初始解析结果进行精修;提交无检索与有检索(RAG)两种配置。 Result: 在诗歌(塞内卡)上 CLAS 提升 +17 分,在散文(托马斯·阿奎那)上提升 +1.5 分;对 300 个系统与金标准分歧样本的双盲错误分析显示,53.3% 的共识标注支持 THIVLVC,暴露树库内及跨树库标注不一致问题。 Conclusion: RAG 增强的大模型精修策略在拉丁语句法解析中有效,尤其利于风格化文本;同时,系统表现反向揭示了现有树库的标注质量问题,为后续数据建设提供实证依据。 Abstract: We describe THIVLVC, a two-stage system for the EvaLatin 2026 Dependency Parsing task. Given a Latin sentence, we retrieve structurally similar entries from the CIRCSE treebank using sentence length and POS n-gram similarity, then prompt a large language model to refine the baseline parse from UDPipe using the retrieved examples and UD annotation guidelines. We submit two configurations: one without retrieval and one with retrieval (RAG). On poetry (Seneca), THIVLVC improves CLAS by +17 points over the UDPipe baseline; on prose (Thomas Aquinas), the gain is +1.5 CLAS. A double-blind error analysis of 300 divergences between our system and the gold standard reveals that, among unanimous annotator decisions, 53.3% favour THIVLVC, showing annotation inconsistencies both within and across treebanks.

[40] YoNER: A New Yorùbá Multi-domain Named Entity Recognition Dataset

Peace Busola Falola,Jesujoba O. Alabi,Solomon O. Akinola,Folashade T. Ogunajo,Emmanuel Oluwadunsin Alabi,David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: This paper introduces YoNER, a new multidomain Yorùbá Named Entity Recognition dataset covering five domains and three entity types, along with a Yorùbá-specific language model OyoBERT; it benchmarks models across domains and shows African-centric models outperform general multilingual ones for Yorùbá.

Details Motivation: Limited and domain-specific resources for Yorùbá NER hinder research; existing datasets like MasakhaNER and WikiAnn lack broad domain coverage. Method: Constructed YoNER — a manually annotated, multidomain Yorùbá NER dataset (5 domains, 3 entity types, CoNLL-style) by three native speakers; trained and evaluated transformer models (including newly proposed OyoBERT) in cross-domain, few-shot, and cross-lingual settings. Result: African-centric models surpass multilingual ones on Yorùbá; cross-domain performance drops notably—especially on blogs and movies—while formal domains (news/Wikipedia) transfer better; OyoBERT outperforms multilingual models in in-domain evaluation. Conclusion: YoNER and OyoBERT fill critical resource gaps for Yorùbá NLP; domain diversity and language specificity significantly impact NER performance, highlighting the need for tailored, multidomain resources for low-resource languages. Abstract: Named Entity Recognition (NER) is a foundational NLP task, yet research in Yorùbá has been constrained by limited and domain-specific resources. Existing resources, such as MasakhaNER (a manually annotated news-domain corpus) and WikiAnn (automatically created from Wikipedia), are valuable but restricted in domain coverage. To address this gap, we present YoNER, a new multidomain Yorùbá NER dataset that extends entity coverage beyond news and Wikipedia. The dataset comprises about 5,000 sentences and 100,000 tokens collected from five domains including Bible, Blogs, Movies, Radio broadcast and Wikipedia, and annotated with three entity types: Person (PER), Organization (ORG) and Location (LOC), following CoNLL-style guidelines. Annotation was conducted manually by three native Yorùbá speakers, with an inter-annotator agreement of over 0.70, ensuring high quality and consistency. We benchmark several transformer encoder models using cross-domain experiments with MasakhaNER 2.0, and we also assess the effect of few-shot in-domain data using YoNER and cross-lingual setups with English datasets. Our results show that African-centric models outperform general multilingual models for Yorùbá, but cross-domain performance drops substantially, particularly for blogs and movie domains. Furthermore, we observed that closely related formal domains, such as news and Wikipedia, transfer more effectively. In addition, we introduce a new Yorùbá-specific language model (OyoBERT) that outperforms multilingual models in in-domain evaluation. We publicly release the YoNER dataset and pretrained OyoBERT models to support future research on Yorùbá natural language processing.

[41] Graph-Based Chain-of-Thought Pruning for Reducing Redundant Reflections in Reasoning LLMs

Hongyuan Yuan,Xinran He,Run Shao,Bolei He,Xianwei Xue,Mengke Chen,Qiutong Pan,Haiwei Wang,Haifeng Li

Main category: cs.CL

TL;DR: 本文提出了一种基于图结构的思维链(CoT)优化框架,通过将线性CoT转化为有向无环图(DAG),并设计双层剪枝策略(分支级与深度级)来消除冗余反思(如无差别和重复反思),再经SFT、DPO和GRPO三阶段蒸馏,显著减少推理token(-42%)且不损准确率。

Details Motivation: 现有基于强化学习扩展CoT的方法因奖励稀疏易导致过思考(生成冗余中间推理),其主因是低效反思,表现为无差别反思和重复反思。 Method: 将线性CoT建模为带显式依赖边的DAG;提出双剪枝策略(分支级剪弱贡献反思分支,深度级剪晚期重复验证);采用SFT初始化、DPO偏好简洁正确轨迹、GRPO加长度惩罚联合优化正确性与效率。 Result: 平均推理token减少42%,同时保持或提升任务准确率。 Conclusion: 高效反思需结构化建模与针对性剪枝,图结构+分层蒸馏可有效平衡LLM推理的准确性与简洁性。 Abstract: Extending CoT through RL has been widely used to enhance the reasoning capabilities of LLMs. However, due to the sparsity of reward signals, it can also induce undesirable thinking patterns such as overthinking, i.e., generating redundant intermediate reasoning content. In this work, we argue that a major source of such redundancy is inefficient reflection, which often manifests in two problematic patterns: Indiscriminate Reflection, where the model performs broad, low-impact checks throughout reasoning, and Repetitive Reflection, where it repeatedly re-verifies an already established conclusion. To address this, we introduce a graph-based CoT optimization framework. Specifically, we convert each linear CoT into a directed acyclic graph (DAG) with explicit dependency edges, and design a dual pruning strategy: branch-level pruning removes weakly contributing reflection branches, while depth-level pruning eliminates late-stage re-verification. We distill this behavior via a three-stage pipeline: (1) SFT to initialize the policy on pruned concise traces, (2) DPO to prefer correct but less redundant trajectories, and (3) GRPO with length penalty to jointly optimize answer correctness and efficiency. Experiments show that our approach reduces the average reasoning tokens by 42\% while maintaining or improving accuracy.

[42] See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs

Yicheng Ji,Jun Zhang,Jinpeng Chen,Cong Wang,Lidan Shou,Gang Chen,Huan Li

Main category: cs.CL

TL;DR: 本文提出LVSpec,一种无需训练的宽松推测解码框架,用于加速视频大语言模型(Video-LLMs)的推理,通过识别视觉相关锚点 token 并引入位置偏移容忍机制,在几乎不损失性能的前提下显著提升推理速度。

Details Motivation: 现有推测解码方法受限于严格的逐token精确匹配规则,难以充分发挥加速潜力,而Video-LLMs又面临高推理延迟问题。 Method: LVSpec提出轻量级视觉相关token识别机制,区分关键锚点与冗余填充token;并设计位置偏移容忍机制,接受语义等价但位置错位的token,实现宽松验证。全程无需额外训练。 Result: 在Qwen2.5-VL-32B和LLaVA-OneVision-72B上分别实现2.70x和2.94x加速,保持>99.8%目标模型性能;相比SOTA无训练SD方法,平均接受长度提升136%,加速比提升35%。 Conclusion: LVSpec首次实现了训练自由、语义感知的宽松推测解码,为Video-LLMs高效推理提供了新范式。 Abstract: Video Large Language Models (Video-LLMs) excel in video understanding but suffer from high inference latency during autoregressive generation. Speculative Decoding (SD) mitigates this by applying a draft-and-verify paradigm, yet existing methods are constrained by rigid exact-match rules, severely limiting the acceleration potential. To bridge this gap, we propose LVSpec, the first training-free loosely SD framework tailored for Video-LLMs. Grounded in the insight that generation is governed by sparse visual-relevant anchors (mandating strictness) amidst abundant visual-irrelevant fillers (permitting loose verification), LVSpec employs a lightweight visual-relevant token identification scheme to accurately pinpoint the former. To further maximize acceptance, we augment this with a position-shift tolerant mechanism that effectively salvages positionally mismatched but semantically equivalent tokens. Experiments demonstrate that LVSpec achieves high fidelity and speed: it preserves >99.8 of target performance while accelerating Qwen2.5-VL-32B by 2.70x and LLaVA-OneVision-72B by 2.94x. Notably, it boosts the mean accepted length and speedup ratio by 136% and 35% compared to SOTA training-free SD methods for Video-LLMs.

[43] LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals

Lihao Sun,Hang Dong,Bo Qiao,Qingwei Lin,Dongmei Zhang,Saravan Rajmohan

Main category: cs.CL

TL;DR: 本文将大语言模型的思维链生成视为表征空间中的结构化轨迹,揭示了数学推理在不同网络层中遵循功能有序、步骤特定的子空间,并发现正确与错误推理在后期阶段系统性分离,从而支持中期推理预测最终答案正确性;此外,提出了基于轨迹的引导方法,实现推理修正和长度控制。

Details Motivation: 理解大语言模型(LLM)如何进行数学推理,尤其是思维链(Chain-of-Thought)生成的内在机制,以提升其可解释性、可控性和可靠性。 Method: 通过分析LLM在数学推理过程中各层隐状态的几何结构,识别步骤特定的子空间演化规律;利用轨迹分析比较正确与错误推理路径的差异;构建基于理想轨迹的推理干预框架(trajectory-based steering)。 Result: 发现推理过程具有层深度相关的可分性子空间结构;正确与错误解在晚期系统性分离,支持中期预测(ROC-AUC达0.87);提出轨迹引导方法,实现推理修正与长度控制。 Conclusion: 推理轨迹是一种有效的几何视角,可用于解释、预测和控制LLM的推理行为;该结构天然存在于基础模型中,推理训练主要加速收敛而非重构表征组织。 Abstract: This work characterizes large language models' chain-of-thought generation as a structured trajectory through representation space. We show that mathematical reasoning traverses functionally ordered, step-specific subspaces that become increasingly separable with layer depth. This structure already exists in base models, while reasoning training primarily accelerates convergence toward termination-related subspaces rather than introducing new representational organization. While early reasoning steps follow similar trajectories, correct and incorrect solutions diverge systematically at late stages. This late-stage divergence enables mid-reasoning prediction of final-answer correctness with ROC-AUC up to 0.87. Furthermore, we introduce trajectory-based steering, an inference-time intervention framework that enables reasoning correction and length control based on derived ideal trajectories. Together, these results establish reasoning trajectories as a geometric lens for interpreting, predicting, and controlling LLM reasoning behavior.

[44] Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

Zhen Cheng,Hao-Bo Yang,Wan-Yi Huang,Jin-Long Li

Main category: cs.CL

TL;DR: 本文提出Attention Editing框架,可在不重新预训练的情况下将已训练的大语言模型转换为采用新型注意力机制(如MLA和GateSWA)的模型,显著降低KV缓存开销,同时保持性能,并在国产昇腾910B硬件上验证了可行性。

Details Motivation: KV缓存内存与带宽已成为长上下文和长生成场景下大模型推理成本的主要瓶颈;现有改进架构(如MLA、滑动窗口注意力)虽能缓解该问题,但难以集成到已部署模型中,因先前方法对源模型和目标模块均有细粒度结构限制,缺乏实用部署可行性。 Method: 提出Attention Editing框架:用可学习的目标注意力模块替换原始注意力,并通过两阶段渐进式蒸馏训练——(1) 层级教师强制优化+中间激活监督,防止冷启动误差累积;(2) 全模型级下一词分布蒸馏,可选加入弱特征匹配正则化。 Result: 在Qwen3-8B和Qwen3-30B-A3B上成功应用MLA和自研GateSWA,模型保持竞争力性能,同时显著提升推理效率;全部实验在昇腾910B集群完成,验证了国产硬件上的可实施性与鲁棒性。 Conclusion: 大规模注意力架构转换是可行且鲁棒的,Attention Editing提供了一种实用、免重训的模型升级路径,有助于推动高效注意力机制在真实场景中的快速落地。 Abstract: Key-Value (KV) cache memory and bandwidth increasingly dominate large language model inference cost in long-context and long-generation regimes. Architectures such as multi-head latent attention (MLA) and hybrid sliding-window attention (SWA) can alleviate this bound, but integrating them into existing models remains difficult. Prior methods impose fine-grained structural requirements on both source and target attention modules, which cannot meet the feasible requirement in practical deployment. We present Attention Editing, a practical framework for converting already-trained large language models (LLMs) with new attention architectures without re-pretraining from scratch. Attention editing replaces the original attention with a learnable target module and trains it using progressive distillation, consisting of (1) layer-wise teacher-forced optimization with intermediate activation supervision to prevent cold-start error accumulation, and (2) model-level distillation on next-token distributions, optionally regularized by weak feature matching. We instantiate the framework on two different target--MLA and GateSWA, a gated hybrid SWA design, and apply it to Qwen3-8B and Qwen3-30B-A3B. The resulting models maintain competitive performance while delivering substantial efficiency improvements, demonstrating that large-scale attention conversion is both feasible and robust. Notably, experiments are conducted on an Ascend 910B clusters, offering a practical training case study on domestic hardware.

[45] Dialogue Act Patterns in GenAI-Mediated L2 Oral Practice: A Sequential Analysis of Learner-Chatbot Interactions

Liqun He,Shijun,Chen,Mutlu Cukurova,Manolis Mavrikis

Main category: cs.CL

TL;DR: 本研究分析了九年级中国英语学习者与生成式AI语音聊天机器人在10周干预中的对话行为(DA)模式,发现高进步会话中学习者主动提问更多,而低进步会话中澄清请求更频繁;高进步会话中基于提示的纠正性反馈序列更常见且时机更恰当,强调反馈类型与时机对有效互动的重要性。

Details Motivation: 尽管生成式AI语音聊天机器人为二语口语练习提供了可扩展的机会,但与其学习成效相关的互动过程仍缺乏深入探究。 Method: 本研究对12名学生共70次会话进行人工标注,采用教学导向的对话行为编码方案,共标注6957个对话行为,并对比高、低进步会话中的DA分布与序列模式。 Result: 高进步会话中学习者发起的问题更多,低进步会话中澄清请求率更高;高进步会话中提示型纠正反馈序列更频繁且稳定出现在学习者回应之后。 Conclusion: 研究结果强调了在GenAI聊天机器人设计中采用对话视角的重要性,提出了教学导向的DA编码框架,并为面向二语教育的自适应GenAI聊天机器人设计提供了依据。 Abstract: While generative AI (GenAI) voice chatbots offer scalable opportunities for second language (L2) oral practice, the interactional processes related to learners' gains remain underexplored. This study investigates dialogue act (DA) patterns in interactions between Grade 9 Chinese English as a foreign language (EFL) learners and a GenAI voice chatbot over a 10-week intervention. Seventy sessions from 12 students were annotated by human coders using a pedagogy-informed coding scheme, yielding 6,957 coded DAs. DA distributions and sequential patterns were compared between high- and low-progress sessions. At the DA level, high-progress sessions showed more learner-initiated questions, whereas low-progress sessions exhibited higher rates of clarification-seeking, indicating greater comprehension difficulty. At the sequential level, high-progress sessions were characterised by more frequent prompting-based corrective feedback sequences, consistently positioned after learner responses, highlighting the role of feedback type and timing in effective interaction. Overall, these findings underscore the value of a dialogic lens in GenAI chatbot design, contribute a pedagogy-informed DA coding framework, and inform the design of adaptive GenAI chatbots for L2 education.

[46] MedLayBench-V: A Large-Scale Benchmark for Expert-Lay Semantic Alignment in Medical Vision Language Models

Han Jang,Junhyeok Lee,Heeseong Eum,Kyu Sung Choi

Main category: cs.CL

TL;DR: 本文提出了MedLayBench-V,首个面向专家-患者语义对齐的大规模医学多模态基准,通过结构化概念驱动精炼(SCGR)流程构建,确保语义等价性,旨在提升医学视觉语言模型在患者可理解表达方面的能力。

Details Motivation: 当前医学视觉语言模型主要基于专业文献训练,难以用通俗语言向患者解释影像诊断结果,缺乏支持医患沟通的多模态简化基准。 Method: 提出结构化概念驱动精炼(SCGR)流程,结合UMLS概念唯一标识符(CUIs)与细粒度实体约束,保证专家表述与通俗表述间的严格语义等价。 Result: 构建了首个大规模、高质量、语义对齐的医学多模态基准MedLayBench-V,支持训练和评估具备医患沟通能力的下一代Med-VLMs。 Conclusion: MedLayBench-V填补了医学多模态简化资源空白,为发展面向患者中心护理的可解释、可信Med-VLMs提供了关键基础设施。 Abstract: Medical Vision-Language Models (Med-VLMs) have achieved expert-level proficiency in interpreting diagnostic imaging. However, current models are predominantly trained on professional literature, limiting their ability to communicate findings in the lay register required for patient-centered care. While text-centric research has actively developed resources for simplifying medical jargon, there is a critical absence of large-scale multimodal benchmarks designed to facilitate lay-accessible medical image understanding. To bridge this resource gap, we introduce MedLayBench-V, the first large-scale multimodal benchmark dedicated to expert-lay semantic alignment. Unlike naive simplification approaches that risk hallucination, our dataset is constructed via a Structured Concept-Grounded Refinement (SCGR) pipeline. This method enforces strict semantic equivalence by integrating Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs) with micro-level entity constraints. MedLayBench-V provides a verified foundation for training and evaluating next-generation Med-VLMs capable of bridging the communication divide between clinical experts and patients.

[47] Controlling Distributional Bias in Multi-Round LLM Generation via KL-Optimized Fine-Tuning

Yanbei Jiang,Amr Keleg,Ryandito Diandaru,Jey Han Lau,Lea Frermann,Biaoyan Fang,Fajri Koto

Main category: cs.CL

TL;DR: 本文提出了一种新的微调框架,结合Steering Token Calibration与Semantic Alignment,以提升大语言模型在性别、种族、情感等属性上的分布对齐能力,显著优于现有方法。

Details Motivation: 现实世界具有随机性,但当前大语言模型(LLMs)主要在单轮推理和固定真值下评估,缺乏对输出分布是否符合目标分布(如真实世界统计或均匀分布)的评估能力。 Method: 提出一种新型微调框架,融合Steering Token Calibration与Semantic Alignment;设计混合目标函数,包括KL散度约束潜变量steering tokens的概率质量,以及Kahneman-Tversky优化确保其语义一致性。 Result: 在六个多样化数据集上的实验表明,该方法在属性生成任务中实现了精确的分布控制,显著优于prompt engineering、DPO等基线方法。 Conclusion: 标准LLM及其常用对齐技术难以可靠控制输出分布;所提框架能有效实现细粒度、语义一致的分布对齐,为LLM可控生成提供了新范式。 Abstract: While the real world is inherently stochastic, Large Language Models (LLMs) are predominantly evaluated on single-round inference against fixed ground truths. In this work, we shift the lens to distribution alignment: assessing whether LLMs, when prompted repeatedly, can generate outputs that adhere to a desired target distribution, e.g. reflecting real-world statistics or a uniform distribution. We formulate distribution alignment using the attributes of gender, race, and sentiment within occupational contexts. Our empirical analysis reveals that off-the-shelf LLMs and standard alignment techniques, including prompt engineering and Direct Preference Optimization, fail to reliably control output distributions. To bridge this gap, we propose a novel fine-tuning framework that couples Steering Token Calibration with Semantic Alignment. We introduce a hybrid objective function combining Kullback-Leibler divergence to anchor the probability mass of latent steering tokens and Kahneman-Tversky Optimization to bind these tokens to semantically consistent responses. Experiments across six diverse datasets demonstrate that our approach significantly outperforms baselines, achieving precise distributional control in attribute generation tasks.

[48] Identifying Influential N-grams in Confidence Calibration via Regression Analysis

Shintaro Ozaki,Wataru Hashimoto,Hidetaka Kamigaito,Katsuhiko Hayashi,Taro Watanabe

Main category: cs.CL

TL;DR: 本文通过回归方法识别大语言模型(LLM)推理过程中与置信度相关的语言表达,发现特定n-gram显著导致过自信现象;进一步验证这些表达具有因果效应,并表明仅抑制这些过自信表达即可实现置信度校准,且不损害性能。

Details Motivation: 大语言模型在显式推理中虽提升性能,但常表现出与语言不确定性表达不一致的过自信现象,亟需理解其语言根源并实现有效校准。 Method: 采用回归方法,以LLM推理部分中语言表达的置信度为因变量,系统分析各类n-gram与置信度的关联性,并通过因果检验与消融验证所提取表达的实际影响。 Result: 在多个模型和问答基准上证实LLM推理中普遍存在过自信;识别出若干与过自信强相关的n-gram,其中部分与测试时缩放中人为插入的提示短语重合;实证表明抑制这些表达可校准置信度且不降低性能。 Conclusion: LLM的过自信行为可归因于特定语言表达,其置信度校准可通过有针对性地抑制这些表达实现,为轻量、高效、无需微调的校准方法提供了新路径。 Abstract: While large language models (LLMs) improve performance by explicit reasoning, their responses are often overconfident, even though they include linguistic expressions demonstrating uncertainty. In this work, we identify what linguistic expressions are related to confidence by applying the regression method. Specifically, we predict confidence of those linguistic expressions in the reasoning parts of LLMs as the dependent variables and analyze the relationship between a specific $n$-gram and confidence. Across multiple models and QA benchmarks, we show that LLMs remain overconfident when reasoning is involved and attribute this behavior to specific linguistic information. Interestingly, several of the extracted expressions coincide with cue phrases intentionally inserted on test-time scaling to improve reasoning performance. Through our test on causality and verification that the extracted linguistic information truly affects confidence, we reveal that confidence calibration is possible by simply suppressing those overconfident expressions without drops in performance.

[49] PhageBench: Can LLMs Understand Raw Bacteriophage Genomes?

Yusen Hou,Weicai Long,Haitao Hu,Houcheng Su,Junning Feng,Yanlin Zhang

Main category: cs.CL

TL;DR: 本文提出了首个用于评估噬菌体基因组理解能力的基准PhageBench,包含5600个高质量样本,覆盖筛选、质控和表型注释三阶段共五项核心任务;实验表明通用大语言模型在部分任务(如噬菌体contig识别与宿主预测)上优于随机基线,但在长程依赖与精细功能定位等复杂推理任务中仍存在明显不足。

Details Motivation: 现有通用大语言模型虽擅长理解生物文本,但直接解析原始核苷酸序列并进行生物学推理的能力尚不明确,亟需专门基准来系统评估其对噬菌体基因组的理解能力。 Method: 构建首个面向噬菌体基因组理解的基准PhageBench,涵盖三个阶段(筛选、质控、表型注释)共五类核心任务,含5600个高质量样本,并对8种大语言模型进行系统评测。 Result: 通用推理型大模型在噬菌体contig识别和宿主预测任务上显著优于随机基线,但在需长程依赖和细粒度功能定位的复杂推理任务中表现较差。 Conclusion: 当前大语言模型初步具备噬菌体基因组基础理解能力,但尚不能胜任复杂生物学推理;需发展具备更强序列推理能力的下一代模型。 Abstract: Bacteriophages, often referred to as the dark matter of the biosphere, play a critical role in regulating microbial ecosystems and in antibiotic alternatives. Thus, accurate interpretation of their genomes holds significant scientific and practical value. While general-purpose Large Language Models (LLMs) excel at understanding biological texts, their ability to directly interpret raw nucleotide sequences and perform biological reasoning remains underexplored. To address this, we introduce PhageBench, the first benchmark designed to evaluate phage genome understanding by mirroring the workflow of bioinformatics experts. The dataset contains 5,600 high-quality samples covering five core tasks across three stages: Screening, Quality Control, and Phenotype Annotation. Our evaluation of eight LLMs reveals that general-purpose reasoning models significantly outperform random baselines in phage contig identification and host prediction, demonstrating promising potential for genomic understanding. However, they exhibit significant limitations in complex reasoning tasks involving long-range dependencies and fine-grained functional localization. These findings highlight the necessity of developing next-generation models with enhanced reasoning capabilities for biological sequences.

[50] What Models Know, How Well They Know It: Knowledge-Weighted Fine-Tuning for Learning When to Say "I Don't Know"

Joosung Lee,Hwiyeol Jo,Donghyeon Ko,Kyubyung Chae,Cheonbok Park,Jeonghoon Kim

Main category: cs.CL

TL;DR: 本文提出一种基于多采样推理的细粒度实例级知识评分方法,用于缓解大语言模型在预训练与微调间存在的知识错位问题,从而减少幻觉,并鼓励模型对未知问题明确表达不确定性。

Details Motivation: 大语言模型存在幻觉问题,主要源于预训练与微调阶段的知识错位。 Method: 通过多采样推理可靠估计细粒度、实例级的知识得分,并据此动态缩放学习信号;同时对超出模型知识范围的问题,鼓励模型显式输出“I don't know”。 Result: 模型能更准确地表达不确定性,在已知问题上保持高准确率;所提不确定性评估指标表明,对已知/未知样本的准确区分可稳定提升模型性能。 Conclusion: 实例级知识评分与不确定性显式建模可有效缓解知识错位引发的幻觉,提升模型可靠性与可控性。 Abstract: While large language models (LLMs) demonstrate strong capabilities across diverse user queries, they still suffer from hallucinations, often arising from knowledge misalignment between pre-training and fine-tuning. To address this misalignment, we reliably estimate a fine-grained, instance-level knowledge score via multi-sampled inference. Using the knowledge score, we scale the learning signal according to the model's existing knowledge, while encouraging explicit "I don't know" responses for out-of-scope queries. Experimental results show that this approach allows the model to explicitly express uncertainty when it lacks knowledge, while maintaining accuracy on questions it can answer. Furthermore, we propose evaluation metrics for uncertainty, showing that accurate discrimination between known and unknown instances consistently improves performance.

[51] Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation

Abdullah Mazhar,Het Riteshkumar Shah,Aseem Srivastava,Smriti Joshi,Md Shad Akhtar

Main category: cs.CL

TL;DR: 本文提出CARE框架和FAITH-M基准,用于评估AI心理治疗对话是否符合六大临床原则,显著优于基线模型。

Details Motivation: 现有大语言模型在心理健康应用中缺乏对心理治疗核心原则的系统性评估机制,仅关注表面流畅性而忽略临床合理性。 Method: 构建FAITH-M专家标注基准,定义六大治疗原则的细粒度序数评分标准;提出CARE多阶段评估框架,融合对话内上下文、对比示例检索与知识蒸馏的思维链推理。 Result: CARE在FAITH-M上F1达63.34,较Qwen3基线(38.56)提升64.26%;专家评估与跨域测试验证其鲁棒性,但隐含临床细微差别建模仍存挑战。 Conclusion: CARE为AI心理健康系统提供了以临床实践为依据的治疗保真度评估框架。 Abstract: The increasing use of large language models in mental health applications calls for principled evaluation frameworks that assess alignment with psychotherapeutic best practices beyond surface-level fluency. While recent systems exhibit conversational competence, they lack structured mechanisms to evaluate adherence to core therapeutic principles. In this paper, we study the problem of evaluating AI-generated therapist-like responses for clinically grounded appropriateness and effectiveness. We assess each therapists utterance along six therapeutic principles: non-judgmental acceptance, warmth, respect for autonomy, active listening, reflective understanding, and situational appropriateness using a fine-grained ordinal scale. We introduce FAITH-M, a benchmark annotated with expert-assigned ordinal ratings, and propose CARE, a multi-stage evaluation framework that integrates intra-dialogue context, contrastive exemplar retrieval, and knowledge-distilled chain-of-thought reasoning. Experiments show that CARE achieves an F-1 score of 63.34 versus the strong baseline Qwen3 F-1 score of 38.56 which is a 64.26 improvement, which also serves as its backbone, indicating that gains arise from structured reasoning and contextual modeling rather than backbone capacity alone. Expert assessment and external dataset evaluations further demonstrate robustness under domain shift, while highlighting challenges in modelling implicit clinical nuance. Overall, CARE provides a clinically grounded framework for evaluating therapeutic fidelity in AI mental health systems.

[52] CLEAR: Cross-Lingual Enhancement in Alignment via Reverse-training

Seungyoon Lee,Minhyuk Kim,Seongtae Hong,Youngjoon Jang,Dongsuk Oh,Heuiseok Lim

Main category: cs.CL

TL;DR: 本文提出了一种名为CLEAR的新型损失函数,通过反向训练机制增强跨语言检索中的语言对齐,尤其提升低资源语言性能,同时避免英语性能下降。

Details Motivation: 现有多语言嵌入模型在跨语言场景中因语言资源不平衡及训练时缺乏跨语言对齐考虑而表现不佳;标准对比学习方法难以捕捉语言间根本对齐,且可能损害高资源语言(如英语)性能。 Method: 提出CLEAR损失函数,采用反向训练方案,以英文段落为桥梁,强化目标语言与英文间的对齐,从而提升跨语言检索效果。 Result: 实验表明CLEAR在跨语言检索中显著提升性能,最高增益达15%,尤其对低资源语言效果突出,且基本不损害英文性能;在多语言训练中也表现出色。 Conclusion: CLEAR是一种有效、可扩展的跨语言检索增强方法,适用于资源不平衡场景,并已开源代码。 Abstract: Existing multilingual embedding models often encounter challenges in cross-lingual scenarios due to imbalanced linguistic resources and less consideration of cross-lingual alignment during training. Although standardized contrastive learning approaches for cross-lingual adaptation are widely adopted, they may struggle to capture fundamental alignment between languages and degrade performance in well-aligned languages such as English. To address these challenges, we propose Cross-Lingual Enhancement in Retrieval via Reverse-training (CLEAR), a novel loss function utilizing a reverse training scheme to improve retrieval performance across diverse cross-lingual retrieval scenarios. CLEAR leverages an English passage as a bridge to strengthen alignments between the target language and English, ensuring robust performance in the cross-lingual retrieval task. Our extensive experiments demonstrate that CLEAR achieves notable improvements in cross-lingual scenarios, with gains up to 15%, particularly in low-resource languages, while minimizing performance degradation in English. Furthermore, our findings highlight that CLEAR offers promising effectiveness even in multilingual training, suggesting its potential for broad application and scalability. We release the code at https://github.com/dltmddbs100/CLEAR.

[53] "OK Aura, Be Fair With Me": Demographics-Agnostic Training for Bias Mitigation in Wake-up Word Detection

Fernando López,Paula Delgado-Santos,Pablo Gómez,David Solans,Jordi Luque

Main category: cs.CL

TL;DR: 本研究探讨了在唤醒词检测中使用无种族标签训练技术以减少性别、年龄和口音等人口统计学偏差的有效性,实验表明这些方法显著提升了模型的公平性。

Details Motivation: 由于持续存在的人口统计学偏差,实现跨不同说话人群体的公平唤醒词检测仍是一个关键挑战。 Method: 采用无种族标签的训练方法,利用OK Aura数据库进行实验,并探索了数据增强技术和预训练语音模型的知识蒸馏两种策略。 Result: 实验结果表明,所评估的技术显著减少了人口统计学偏差,其中一种技术在性别、年龄和口音上的预测差异分别降低了39.94%、83.65%和40.48%。 Conclusion: 无标签训练方法在促进唤醒词检测公平性方面是有效的。 Abstract: Voice-based interfaces are widely used; however, achieving fair Wake-up Word detection across diverse speaker populations remains a critical challenge due to persistent demographic biases. This study evaluates the effectiveness of demographics-agnostic training techniques in mitigating performance disparities among speakers of varying sex, age, and accent. We utilize the OK Aura database for our experiments, employing a training methodology that excludes demographic labels, which are reserved for evaluation purposes. We explore (i) data augmentation techniques to enhance model generalization and (ii) knowledge distillation of pre-trained foundational speech models. The experimental results indicate that these demographics-agnostic training techniques markedly reduce demographic bias, leading to a more equitable performance profile across different speaker groups. Specifically, one of the evaluated techniques achieves a Predictive Disparity reduction of 39.94\% for sex, 83.65\% for age, and 40.48\% for accent when compared to the baseline. This study highlights the effectiveness of label-agnostic methodologies in fostering fairness in Wake-up Word detection.

[54] AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning

Yuanfu Sun,Kang Li,Dongzhe Fan,Jiajin Liu,Qiaoyu Tan

Main category: cs.CL

TL;DR: 本文提出Agentic Graph Learning(AGL)范式及首个基于强化学习的框架AgentGL,使大语言模型能拓扑感知地导航与推理图数据,在节点分类和链接预测任务上显著超越现有方法。

Details Motivation: 现有智能体框架将外部信息视为非结构化文本,未能利用真实世界数据中固有的拓扑依赖关系,限制了LLM在关系型环境中的自主推理能力。 Method: 提出AGL范式,将图学习重构为拓扑感知导航与LLM推理的交替过程;设计AgentGL框架,包含图原生多尺度探索工具、搜索约束型思维机制以平衡精度与效率,并采用图条件化课程强化学习策略稳定长程策略训练。 Result: 在多个Text-Attributed Graph(TAG)基准和不同LLM主干网络上,AgentGL显著优于GraphLLMs和GraphRAG等强基线,在节点分类和链接预测任务上分别取得最高17.5%和28.4%的绝对性能提升。 Conclusion: AGL是推动大语言模型自主导航与推理复杂关系环境的重要新方向,为图增强型智能体研究开辟前沿路径。 Abstract: Large Language Models (LLMs) increasingly rely on agentic capabilities-iterative retrieval, tool use, and decision-making-to overcome the limits of static, parametric knowledge. Yet existing agentic frameworks treat external information as unstructured text and fail to leverage the topological dependencies inherent in real-world data. To bridge this gap, we introduce Agentic Graph Learning (AGL), a paradigm that reframes graph learning as an interleaved process of topology-aware navigation and LLM-based inference. Specifically, we propose AgentGL, the first reinforcement learning (RL)-driven framework for AGL. AgentGL equips an LLM agent with graph-native tools for multi-scale exploration, regulates tool usage via search-constrained thinking to balance accuracy and efficiency, and employs a graph-conditioned curriculum RL strategy to stabilize long-horizon policy learning without step-wise supervision. Across diverse Text-Attributed Graph (TAG) benchmarks and multiple LLM backbones, AgentGL substantially outperforms strong GraphLLMs and GraphRAG baselines, achieving absolute improvements of up to 17.5% in node classification and 28.4% in link prediction. These results demonstrate that AGL is a promising frontier for enabling LLMs to autonomously navigate and reason over complex relational environments. The code is publicly available at https://github.com/sunyuanfu/AgentGL.

[55] Evaluating Learner Representations for Differentiation Prior to Instructional Outcomes

Junsoo Park,Youssef Medhat,Htet Phyo Wai,Ploy Thajchayapong,Ashok K. Goel

Main category: cs.CL

TL;DR: 本文提出了一种名为“distinctiveness”的表示评估指标,用于在无标签或任务无关情况下衡量学习者表征对个体差异的保留能力,并验证了基于学生整体交互模式的表征优于单次交互表征。

Details Motivation: 学习者表征在教育AI中至关重要,但其是否真实反映学生间有意义的差异尚不明确,尤其当缺乏教学结果或结果高度依赖上下文时。 Method: 提出distinctiveness指标,基于成对距离量化每个学习者与其同群组其他学习者的分离程度,无需聚类、标签或任务特定评估;在在线学习环境中,利用学生通过对话式AI生成的问题,对比个体问题表征与聚合交互历史的表征。 Result: 学习者层级表征比交互层级表征展现出更高的分离度、更强的聚类结构和更可靠的成对判别能力。 Conclusion: 学习者表征可独立于教学结果进行评估,distinctiveness可作为部署前诊断指标,判断表征是否支持差异化建模或个性化。 Abstract: Learner representations play a central role in educational AI systems, yet it is often unclear whether they preserve meaningful differences between students when instructional outcomes are unavailable or highly context-dependent. This work examines how to evaluate learner representations based on whether they retain separation between learners under a shared comparison rule. We introduce distinctiveness, a representation-level measure that evaluates how each learner differs from others in the cohort using pairwise distances, without requiring clustering, labels, or task-specific evaluation. Using student-authored questions collected through a conversational AI agent in an online learning environment, we compare representations based on individual questions with representations that aggregate patterns across a student's interactions over time. Results show that learner-level representations yield higher separation, stronger clustering structure, and more reliable pairwise discrimination than interaction-level representations. These findings demonstrate that learner representations can be evaluated independently of instructional outcomes and provide a practical pre-deployment criterion using distinctiveness as a diagnostic metric for assessing whether a representation supports differentiated modeling or personalization.

[56] LoRM: Learning the Language of Rotating Machinery for Self-Supervised Condition Monitoring

Xiao Qin,Xingyi Song,Tong Liu,Hatim Laalej,Zepeng Liu,Yunpeng Zhu,Ligang He

Main category: cs.CL

TL;DR: LoRM是一种自监督的多模态旋转机械信号理解框架,将传感器信号视为一种‘机器语言’,通过分词与序列预测建模其动态演化,并利用预训练语言模型进行轻量微调,以预测误差作为健康指标实现实时状态监测。

Details Motivation: 传统信号处理方法依赖手工设计特征,泛化性差且难以适应复杂工业场景;而大模型从头训练成本高。本文旨在探索语言模型范式在工业信号分析中的可行性,建立通用、可迁移、实时的故障监测新路径。 Method: 将多传感器时序信号切分为上下文段(保持连续)和目标段(量化为离散token),构建类NLP的序列预测任务;基于通用预训练语言模型,仅对部分参数进行工业信号微调;以token预测误差作为设备健康状态指标。 Result: 在刀具状态监测(TCM)实验中实现了稳定实时跟踪与强跨刀具泛化能力,验证了方法的有效性与实用性。 Conclusion: LoRM成功将语言建模范式迁移至旋转机械信号分析领域,提供了一种无需手工特征、低训练开销、高可解释性的实时状态监测新范式,是语言模型与工业智能融合的重要实践。 Abstract: We present LoRM (Language of Rotating Machinery), a self-supervised framework for multi-modal rotating-machinery signal understanding and real-time condition monitoring. LoRM is built on the idea that rotating-machinery signals can be viewed as a machine language: local signals can be tokenised into discrete symbolic units, and their future evolution can be predicted from observed multi-sensor context. Unlike conventional signal-processing methods that rely on hand-crafted transforms and features, LoRM reformulates multi-modal sensor data as a token-based sequence-prediction problem. For each data window, the observed context segment is retained in continuous form, while the future target segment of each sensing channel is quantised into a discrete token. Then, efficient knowledge transfer is achieved by partially fine-tuning a general-purpose pre-trained language model on industrial signals, avoiding the need to train a large model from scratch. Finally, condition monitoring is performed by tracking token-prediction errors as a health indicator, where increasing errors indicate degradation. In-situ tool condition monitoring (TCM) experiments demonstrate stable real-time tracking and strong cross-tool generalisation, showing that LoRM provides a practical bridge between language modelling and industrial signal analysis. The source code is publicly available at https://github.com/Q159753258/LormPHM.

[57] Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models

Xiangming Gu,Soham De,Larisa Markeeva,Petar Veličković,Razvan Pascanu

Main category: cs.CL

TL;DR: 本文对比了大型推理模型(LRMs)中顺序采样与并行采样两种策略,发现并行采样表现更优;通过控制实验验证,性能差距主因并非聚合器或上下文长度,而是顺序采样因条件依赖导致探索能力下降。

Details Motivation: 尽管顺序采样理论上具有更强的表示能力,但实践中并行采样在数学和编程等复杂任务上表现更好,其根本原因尚不明确,需系统分析。 Method: 对Qwen3、DeepSeek-R1蒸馏模型、Gemini 2.5等多模型多尺寸,在数学与编程任务上进行对照实验,检验三个假设:(i)聚合器作用,(ii)长上下文损害,(iii)顺序采样抑制探索。 Result: 实证结果表明,聚合操作和上下文长度并非性能差距主因;而顺序采样因依赖前序答案导致探索不足,是造成性能落后的关键因素。 Conclusion: 顺序采样性能劣势主要源于条件依赖引发的探索受限,并行采样优势本质上来自更高程度的独立探索。 Abstract: Large Reasoning Models (LRMs) have shown remarkable performance on challenging questions, such as math and coding. However, to obtain a high quality solution, one may need to sample more than once. In principal, there are two sampling strategies that can be composed to form more complex processes: sequential sampling and parallel sampling. In this paper, we first compare these two approaches with rigor, and observe, aligned with previous works, that parallel sampling seems to outperform sequential sampling even though the latter should have more representation power. To understand the underline reasons, we make three hypothesis on the reason behind this behavior: (i) parallel sampling outperforms due to the aggregator operator; (ii) sequential sampling is harmed by needing to use longer contexts; (iii) sequential sampling leads to less exploration due to conditioning on previous answers. The empirical evidence on various model families and sizes (Qwen3, DeepSeek-R1 distilled models, Gemini 2.5) and question domains (math and coding) suggests that the aggregation and context length do not seem to be the main culprit behind the performance gap. In contrast, the lack of exploration seems to play a considerably larger role, and we argue that this is one main cause for the performance gap.

[58] Mechanistic Circuit-Based Knowledge Editing in Large Language Models

Tianyi Zhao,Yinhan He,Wendy Zheng,Chen Chen

Main category: cs.CL

TL;DR: 本文提出MCircKE框架,通过识别和编辑因果电路来解决大语言模型知识编辑中的'推理差距'问题,提升多跳推理能力。

Details Motivation: 现有知识编辑方法在修补孤立事实方面表现良好,但在多步推理链中利用编辑后的事实时存在'推理差距'。 Method: MCircKE框架采用'映射-适应'编辑流程:首先识别负责特定推理任务的因果电路(包括事实存储和逻辑后果路由),然后仅在该映射电路内进行参数的精准编辑。 Result: 在MQuAKE-3K基准测试上的大量实验表明,MCircKE在多跳推理知识编辑任务中显著有效。 Conclusion: MCircKE通过机制化电路识别与编辑,成功弥合了知识编辑中的推理差距,为动态环境中LLM的知识更新提供了新思路。 Abstract: Deploying Large Language Models (LLMs) in real-world dynamic environments raises the challenge of updating their pre-trained knowledge. While existing knowledge editing methods can reliably patch isolated facts, they frequently suffer from a "Reasoning Gap", where the model recalls the edited fact but fails to utilize it in multi-step reasoning chains. To bridge this gap, we introduce MCircKE (\underline{M}echanistic \underline{Circ}uit-based \underline{K}nowledge \underline{E}diting), a novel framework that enables a precise "map-and-adapt" editing procedure. MCircKE first identifies the causal circuits responsible for a specific reasoning task, capturing both the storage of the fact and the routing of its logical consequences. It then surgically update parameters exclusively within this mapped circuit. Extensive experiments on the MQuAKE-3K benchmark demonstrate the effectiveness of the proposed method for multi-hop reasoning in knowledge editing.

[59] FRENCH-YMCA: A FRENCH Corpus meeting the language needs of Youth, froM Children to Adolescents

Cherifa Ben Khelil,Jean-Yves Antoine,Anaïs Halftermeyer,Frédéric Rayar,Mathieu Thebaud

Main category: cs.CL

TL;DR: 本文介绍了专为儿童和青少年设计的法语语料库French-YMCA,包含39,200个文本文件、共2247万词,具有来源多样、语法拼写规范、开放可访问等特点,旨在支持面向青少年语言理解的模型训练。

Details Motivation: 儿童语言能力处于持续发展中,与成人存在显著差异,因此需要专门针对其语言特点构建语料资源。 Method: 构建了一个大规模、多源、标准化且公开可获取的法语儿童青少年语料库(French-YMCA),含39,200个文本文件、总计22,471,898词。 Result: 成功创建了French-YMCA语料库,具备多样性、语言规范性和开放性,可用于训练适配青少年语言理解与生成的语言模型。 Conclusion: French-YMCA语料库为开发年龄适宜、理解力匹配的青少年语言技术提供了坚实基础,有望提升数字交互质量。 Abstract: In this paper, we introduce the French-YMCA corpus, a new linguistic resource specifically tailored for children and adolescents. The motivation for building this corpus is clear: children have unique language requirements, as their language skills are in constant evolution and differ from those of adults. With an extensive collection of 39,200 text files, the French-YMCA corpus encompasses a total of 22,471,898 words. It distinguishes itself through its diverse sources, consistent grammar and spelling, and the commitment to providing open online accessibility for all. Such corpus can serve as the foundation for training language models that understand and anticipate youth's language, thereby enhancing the quality of digital interactions and ensuring that responses and suggestions are age-appropriate and adapted to the comprehension level of users of this age.

[60] FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

Michael Krumdick,Varshini Reddy,Shivani Chaudhary,William Day,Maarij Ahmed,Hayan Haqqi,Muhammad Ahsen Fahim,Hanzallah Amjad,Ahmad Orakzai,Aqsa Gul,Chris Tanner

Main category: cs.CL

TL;DR: 本文提出了FrontierFinance,一个面向金融建模的长周期基准测试集,包含25个复杂任务,每个任务平均需18小时专业人力完成,旨在评估大语言模型在真实金融场景中的能力,并通过人类专家参与构建与评测,揭示当前LLM仍显著落后于人类专业水平。

Details Motivation: 现有AI基准无法衡量知识密集型领域(尤其是金融)中定义实际专业能力的任务;金融领域AI暴露风险高但缺乏可靠基准,且当前大语言模型部署缺乏明确问责机制。 Method: 构建了FrontierFinance基准:由金融专业人士共同设计,涵盖五大核心金融模型共25个长周期建模任务;每个任务附详细评分细则;由人类专家定义任务、制定 rubrics、人工评分,并作为基线执行全部任务。 Result: 人类专家在平均得分和产出客户就绪(client-ready)结果的概率上均显著优于当前最先进大语言模型。 Conclusion: FrontierFinance填补了高风险专业领域AI能力评估的空白,证实当前LLM尚无法替代人类在复杂金融建模任务中的专业表现,强调需建立更贴近真实工作流的评估体系与问责机制。 Abstract: As concerns surrounding AI-driven labor displacement intensify in knowledge-intensive sectors, existing benchmarks fail to measure performance on tasks that define practical professional expertise. Finance, in particular, has been identified as a domain with high AI exposure risk, yet lacks robust benchmarks to track real-world developments. This gap is compounded by the absence of clear accountability mechanisms in current Large Language Model (LLM) deployments. To address this, we introduce FrontierFinance, a long-horizon benchmark of 25 complex financial modeling tasks across five core finance models, requiring an average of over 18 hours of skilled human labor per task to complete. Developed with financial professionals, the benchmark reflects industry-standard financial modeling workflows and is paired with detailed rubrics for structured evaluation. We engage human experts to define the tasks, create rubrics, grade LLMs, and perform the tasks themselves as human baselines. We demonstrate that our human experts both receive higher scores on average, and are more likely to provide client-ready outputs than current state-of-the-art systems.

[61] "I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?

Naen Xu,Jiayi Sheng,Changjiang Li,Chunyi Zhou,Yuyuan Li,Tianyu Du,Jun Wang,Zhihui Fu,Jinbao Li,Shouling Ji

Main category: cs.CL

TL;DR: 本文提出了一种多模态双关语生成流程,构建了包含多种双关类型及对抗性非双关干扰项的数据集MultiPun,并评估了现有视觉-语言模型(VLMs)对双关的理解能力;结果表明多数模型表现不佳,作者进一步提出了提示级和模型级改进策略,平均F1提升16.5%,为提升VLM幽默理解能力提供了新思路。

Details Motivation: 现有Vision-Language Models(VLMs)在多模态理解与生成中广泛应用,但其对双关语这类依赖语义歧义与语音相似性的修辞现象的理解能力尚无系统研究,主要受限于缺乏严谨的评测基准。 Method: 首先设计多模态双关语生成流程;其次构建MultiPun数据集,涵盖多种双关类型及对抗性非双关干扰样本;最后提出提示级(如指令微调、思维链提示)与模型级(如跨模态注意力优化)策略以提升模型对双关的识别能力,并在该数据集上进行系统评估。 Result: 大多数现有VLM在MultiPun上难以区分真实双关与对抗性干扰项;所提提示级与模型级策略使F1得分平均提升16.5%;实验验证了跨模态协同推理对双关理解的关键作用。 Conclusion: 当前VLM对多模态双关的理解能力仍较弱,需专门设计数据集与建模策略;MultiPun为该方向提供了首个系统性评测基准,所提方法显著提升了模型性能,为未来构建具备类人幽默理解能力的VLM奠定了基础。 Abstract: Puns are a common form of rhetorical wordplay that exploits polysemy and phonetic similarity to create humor. In multimodal puns, visual and textual elements synergize to ground the literal sense and evoke the figurative meaning simultaneously. Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchmarks. To address this, we first propose a multimodal pun generation pipeline. We then introduce MultiPun, a dataset comprising diverse types of puns alongside adversarial non-pun distractors. Our evaluation reveals that most models struggle to distinguish genuine puns from these distractors. Moreover, we propose both prompt-level and model-level strategies to enhance pun comprehension, with an average improvement of 16.5% in F1 scores. Our findings provide valuable insights for developing future VLMs that master the subtleties of human-like humor via cross-modal reasoning.

[62] BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs

Abbas Ghaddar,Ivan Kobyzev,Boxing Chen,Yufei Cui

Main category: cs.CL

TL;DR: 本文提出BOSCH方法,一种无需训练的黑箱二值优化框架,用于在大语言模型后训练混合中进行短上下文头选择,通过分层重要性探测、自适应滑动窗口注意力比例分配和组内头级优化,显著提升性能并减少KV缓存开销。

Details Motivation: 现有混合化方案(层级或静态头级)存在忽略层内头间依赖路由、以及头行为随混合变化导致的纠缠问题,难以兼顾效率与长上下文建模能力。 Method: BOSCH将问题建模为大邻域搜索,分解为三步:(i) 使用小预算黑箱探针检测各层重要性;(ii) 基于敏感度自适应分配每层滑动窗口注意力(SWA)比例;(iii) 在比例桶内进行分组头级二值优化。 Result: 在4个1.7B–30B参数LLM上、4种SWA比率下,BOSCH持续优于层级启发式及6种强静态头级方法,尤其在高SWA比时增益更大;连续预训练中更快更优地恢复长上下文性能;分析显示所选头在不同SWA比率下变动显著。 Conclusion: BOSCH证明了针对目标SWA比率动态执行头级选择的必要性,克服了静态排名局限,为高效低开销LLM混合化提供了新范式。 Abstract: Post-training hybridization of large language models (LLMs) often replaces quadratic self-attention with sliding-window attention (SWA) to reduce KV cache usage and improve latency. Existing hybridization schemes are typically defined either at the layer level (e.g., interleaving) or at the head level via static rankings from local to global. Layer-level schemes ignore that local and global dependencies are routed through heads within the same layer, while static head-level rankings suffer from entanglement: a head's local/global behavior can change after hybridization. We propose BOSCH, Black-box Binary Optimization for Short-context Head Selection, a training-free method that formulates the problem as a Large Neighborhood Search and decomposes it into three subproblems: (i) layer-importance detection via small-budget black-box probes, (ii) adaptive per-layer SWA-ratio assignment based on these sensitivities, and (iii) grouped head-level optimization within ratio buckets. Extensive experiments on 4 LLMs ranging from 1.7B to 30B parameters, across 4 SWA ratios, show that BOSCH consistently outperforms layer-level heuristics and 6 strong static head-level methods, with larger gains at higher SWA ratios. Under continual pretraining, BOSCH recover original long-context performance faster and to a higher level. Analysis of the selected heads reveals substantial turnover for BOSCH across different SWA ratios, underscoring the importance of performing head-level selection for each target ratio rather than relying on fixed locality rankings.

[63] FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosures

Fan Zhang,Mingzi Song,Rania Elbadry,Yankai Chen,Shaobo Wang,Yixi Zhou,Xunwen Zheng,Yueru He,Yuyang Dai,Georgi Georgiev,Ayesha Gull,Muhammad Usman Safder,Fan Wu,Liyuan Meng,Fengxian Ji,Junning Zhao,Xueqing Peng,Jimin Huang,Yu Chen,Xue,Liu,Preslav Nakov,Zhuohan Xie

Main category: cs.CL

TL;DR: 本文提出FinReporting,一种面向多司法管辖区的金融报告智能代理工作流,通过构建统一的财务概念本体、分阶段可审计处理(获取、抽取、映射、异常记录),并约束LLM作为基于规则与证据的验证器,显著提升跨市场财报处理的一致性与可靠性。

Details Motivation: 现有基于大语言模型的金融报告系统大多假设单一市场环境,未解决不同司法管辖区在会计分类法、标记基础设施(如XBRL vs. PDF)和汇总规范上的结构性差异,导致跨辖区语义对齐与验证困难。 Method: 提出FinReporting代理式工作流:构建覆盖利润表、资产负债表和现金流量表的统一标准本体;将报告流程分解为可审计的四个阶段(申报文件获取、信息抽取、标准映射、异常日志);不将LLM用作自由生成器,而是作为受显式决策规则和证据支撑的受限验证器。 Result: 在美、日、中三国年度财报数据上的实验表明,该系统在异构报告制度下显著提升了处理结果的一致性与可靠性;已开源交互式演示平台,支持跨市场比对与结构化导出。 Conclusion: FinReporting为多司法管辖区财务信息披露提供了可验证、可审计、本地化适配的LLM应用范式,推动了金融监管科技(RegTech)向高可靠性方向发展。 Abstract: Financial reporting systems increasingly use large language models (LLMs) to extract and summarize corporate disclosures. However, most assume a single-market setting and do not address structural differences across jurisdictions. Variations in accounting taxonomies, tagging infrastructures (e.g., XBRL vs. PDF), and aggregation conventions make cross-jurisdiction reporting a semantic alignment and verification challenge. We present FinReporting, an agentic workflow for localized cross-jurisdiction financial reporting. The system builds a unified canonical ontology over Income Statement, Balance Sheet, and Cash Flow, and decomposes reporting into auditable stages including filing acquisition, extraction, canonical mapping, and anomaly logging. Rather than using LLMs as free-form generators, FinReporting deploys them as constrained verifiers under explicit decision rules and evidence grounding. Evaluated on annual filings from the US, Japan, and China, the system improves consistency and reliability under heterogeneous reporting regimes. We release an interactive demo supporting cross-market inspection and structured export of localized financial statements. Our demo is available at https://huggingface.co/spaces/BoomQ/FinReporting-Demo . The video describing our system is available at https://www.youtube.com/watch?v=f65jdEL31Kk

[64] The Model Agreed, But Didn't Learn: Diagnosing Surface Compliance in Large Language Models

Xiaojie Gu,Ziying Huang,Weicong Hong,Jian Xie,Renze Lou,Kai Zhang

Main category: cs.CL

TL;DR: 本文提出了一种新的诊断框架,用于评估大语言模型知识编辑的真实性,发现现有方法存在‘表面顺从’现象,并揭示递归编辑会导致记忆不稳定和不可逆性。

Details Motivation: 现有知识编辑方法在标准基准上表现良好,但其评估框架(依赖特定提示下的输出)无法可靠验证模型内部记忆是否真正被修改,影响LLM在现实场景中的可信部署。 Method: 提出基于上下文学习(ICL)的判别式自我评估诊断框架,通过探测编辑后模型在多种提示条件下的行为一致性,识别是否发生真实记忆更新而非表面模仿。 Result: 发现广泛存在的‘表面顺从’现象——编辑器仅模仿目标输出而未改写内部信念;且递归编辑会累积表征残留,导致认知不稳定与记忆不可逆。 Conclusion: 当前知识编辑范式存在严重可靠性风险,亟需发展能实现鲁棒、可逆、结构性记忆修改的新方法,以支撑可信、可持续的LLM系统构建。 Abstract: Large Language Models (LLMs) internalize vast world knowledge as parametric memory, yet inevitably inherit the staleness and errors of their source corpora. Consequently, ensuring the reliability and malleability of these internal representations is imperative for trustworthy real-world deployment. Knowledge editing offers a pivotal paradigm for surgically modifying memory without retraining. However, while recent editors demonstrate high success rates on standard benchmarks, it remains questionable whether current evaluation frameworks that rely on assessing output under specific prompting conditions can reliably authenticate genuine memory modification. In this work, we introduce a simple diagnostic framework that subjects models to discriminative self-assessment under in-context learning (ICL) settings that better reflect real-world application environments, specifically designed to scrutinize the subtle behavioral nuances induced by memory modifications. This probing reveals a pervasive phenomenon of Surface Compliance, where editors achieve high benchmark scores by merely mimicking target outputs without structurally overwriting internal beliefs. Moreover, we find that recursive modifications accumulate representational residues, triggering cognitive instability and permanently diminishing the reversibility of the model's memory state. These insights underscore the risks of current editing paradigms and highlight the pivotal role of robust memory modification in building trustworthy, long-term sustainable LLM systems. Code is available at https://github.com/XiaojieGu/SA-MCQ.

[65] Disentangling MLP Neuron Weights in Vocabulary Space

Asaf Avrahamy,Yoav Gur-Arieh,Mor Geva

Main category: cs.CL

TL;DR: 本文提出ROTATE方法,一种无需数据、无需前向传播的MLP神经元解耦技术,通过优化权重空间中的旋转以最大化词汇空间投影的峰度,从而发现可解释的'词汇通道',并在多个大模型上验证了其有效性与可解释性。

Details Motivation: 解释模型权重中编码的信息是机制可解释性的基本挑战,现有方法多依赖数据和前向传播,缺乏高效、可扩展的权重空间直接分析手段。 Method: 提出ROTATE方法:基于神经元在词汇空间投影具有高峰度即表征单义概念的统计观察,通过纯权重空间的旋转优化来最大化该峰度,从而提取稀疏、可解释的'词汇通道'。 Result: 在Llama-3.1-8B-Instruct和Gemma-2-2B-it上验证,ROTATE能稳定恢复忠实于神经元行为的词汇通道;通道消融可选择性抑制对应输入激活或概念激活;通道级描述聚合后,神经元解释质量较基于激活的最优基线提升2–3倍。 Conclusion: ROTATE是一种数据无关、前向无关、可扩展的权重空间分解方法,为大语言模型的细粒度可解释性提供了新基石。 Abstract: Interpreting the information encoded in model weights remains a fundamental challenge in mechanistic interpretability. In this work, we introduce ROTATE (Rotation-Optimized Token Alignment in weighT spacE), a data-free method requiring no forward passes that disentangles MLP neurons directly in weight space. Our approach relies on a key statistical observation: neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary. By optimizing rotations of neuron weights to maximize their vocabulary-space kurtosis, our method recovers sparse, interpretable directions which we name vocabulary channels. Experiments on Llama-3.1-8B-Instruct and Gemma-2-2B-it demonstrate that ROTATE consistently recovers vocabulary channels that are faithful to the neuron's behavior. ablating individual channels selectively disables corresponding input activations or the promotion of specific concepts. Moreover, aggregating channel-level descriptions yields comprehensive neuron descriptions that outperform optimized activation-based baselines by 2-3x in head-to-head comparisons. By providing a data-free decomposition of neuron weights, ROTATE offers a scalable, fine-grained building block for interpreting LMs.

[66] BiMind: A Dual-Head Reasoning Model with Attention-Geometry Adapter for Incorrect Information Detection

Zhongxing Zhang,Emily K. Vraga,Jisu Huh,Jaideep Srivastava

Main category: cs.CL

TL;DR: 本文提出BiMind双头推理框架,通过注意力几何适配器、自检索知识机制和不确定性感知融合策略,解决虚假信息检测中内容验证与外部知识修正难以兼顾的问题,并定义VoX指标量化知识贡献。

Details Motivation: 虚假信息破坏内容真实性与完整性,而现有检测方法难以在坍缩的注意力几何结构下同时平衡文本内容验证与外部知识修正。 Method: 提出BiMind双头推理框架,包含:(i) 注意力几何适配器,通过token条件偏移重塑注意力logits;(ii) 自检索知识机制,利用kNN构建领域语义记忆并以特征线性调制注入邻居信息;(iii) 不确定性感知融合策略(熵门控融合与可训练一致性头),辅以对称KL一致性正则化;并定义VoX指标量化知识推理带来的实例级logit增益。 Result: 在多个公开数据集上,BiMind显著优于先进检测方法,并提供可解释诊断,揭示知识在何时、为何起作用。 Conclusion: BiMind有效解耦内容内推理与知识增强推理,在虚假信息检测任务中实现性能提升与可解释性兼顾,VoX指标为知识价值评估提供了新范式。 Abstract: Incorrect information poses significant challenges by disrupting content veracity and integrity, yet most detection approaches struggle to jointly balance textual content verification with external knowledge modification under collapsed attention geometries. To address this issue, we propose a dual-head reasoning framework, BiMind, which disentangles content-internal reasoning from knowledge-augmented reasoning. In BiMind, we introduce three core innovations: (i) an attention geometry adapter that reshapes attention logits via token-conditioned offsets and mitigates attention collapse; (ii) a self-retrieval knowledge mechanism, which constructs an in-domain semantic memory through kNN retrieval and injects retrieved neighbors via feature-wise linear modulation; (iii) the uncertainty-aware fusion strategies, including entropy-gated fusion and a trainable agreement head, stabilized by a symmetric Kullback-Leibler agreement regularizer. To quantify the knowledge contributions, we define a novel metric, Value-of-eXperience (VoX), to measure instance-wise logit gains from knowledge-augmented reasoning. Experiment results on public datasets demonstrate that our BiMind model outperforms advanced detection approaches and provides interpretable diagnostics on when and why knowledge matters.

[67] A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models

Maria Mahbub,Gregory M. Dams,Josh Arnold,Caitlin Rizy,Sudarshan Srinivasan,Elliot M. Fielstein,Minu A. Aghevli,Kamonica L. Craig,Elizabeth M. Oliva,Joseph Erdos,Jodie Trafton,Ioana Danciu

Main category: cs.CL

TL;DR: 本文提出了一种多阶段弱监督验证框架,用于评估大语言模型(LLM)在临床信息提取(如物质使用障碍诊断)中的性能,显著减少人工标注依赖,同时保证评估的可信性与可扩展性。

Details Motivation: 现有LLM临床信息提取评估方法依赖大量人工标注或不完整的结构化数据,难以在人群规模上实施,亟需可扩展且可信的验证方案。 Method: 提出多阶段验证框架,包括提示校准、基于规则的合理性过滤、语义锚定评估、由高能力裁判LLM进行目标确认评估、选择性专家复核及外部预测效度分析。 Result: 在91.9万份临床文本中提取11类物质使用障碍诊断,规则与语义过滤剔除14.59%不可靠结果;裁判LLM与专家评估一致性达Gwet's AC1=0.80;主LLM F1达0.80;LLM提取结果预测专科就诊AUC=0.80,优于结构化数据基线。 Conclusion: 该框架实现了无需密集人工标注的大规模、可信LLM临床信息提取部署,为真实世界应用提供了可行路径。 Abstract: Large language models (LLMs) show promise for extracting clinically meaningful information from unstructured health records, yet their translation into real-world settings is constrained by the lack of scalable and trustworthy validation approaches. Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale. We propose a multi-stage validation framework for LLM-based clinical information extraction that enables rigorous assessment under weak supervision. The framework integrates prompt calibration, rule-based plausibility filtering, semantic grounding assessment, targeted confirmatory evaluation using an independent higher-capacity judge LLM, selective expert review, and external predictive validity analysis to quantify uncertainty and characterize error modes without exhaustive manual annotation. We applied this framework to extraction of substance use disorder (SUD) diagnoses across 11 substance categories from 919,783 clinical notes. Rule-based filtering and semantic grounding removed 14.59% of LLM-positive extractions that were unsupported, irrelevant, or structurally implausible. For high-uncertainty cases, the judge LLM's assessments showed substantial agreement with subject matter expert review (Gwet's AC1=0.80). Using judge-evaluated outputs as references, the primary LLM achieved an F1 score of 0.80 under relaxed matching criteria. LLM-extracted SUD diagnoses also predicted subsequent engagement in SUD specialty care more accurately than structured-data baselines (AUC=0.80). These findings demonstrate that scalable, trustworthy deployment of LLM-based clinical information extraction is feasible without annotation-intensive evaluation.

[68] From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection

Hongxu Zhou

Main category: cs.CL

TL;DR: 本文研究了大型语言模型(LLM)在开放性推理任务中通过结构化反思进行自纠错的效果,发现仅靠基于提纲的约束解码不仅无法提升性能,反而引发‘结构雪球效应’,即模型为满足格式要求而陷入格式陷阱,导致语义错误无法被识别和修正,揭示了约束解码存在‘对齐税’。

Details Motivation: 解决LLM在开放性推理中因‘幻觉雪球效应’导致的自纠错失败问题,探索不依赖外部批评者或符号工具、仅靠结构化约束实现自主自纠错的可行性。 Method: 采用基于提纲(Outlines)的约束解码方法,在80亿参数模型Qwen3-8B上强制结构化反思,评估其对自纠错能力的影响,并分析失败机制。 Result: 结构化约束未提升自纠错性能,反而引发‘结构雪球效应’:模型虽达成近乎完美的表面语法对齐,却因认知负荷过重陷入格式陷阱,无法识别和修正深层语义错误。 Conclusion: 约束解码引入‘对齐税’,暴露了结构粒度与模型内在能力之间的张力,提示在自主智能体工作流中需谨慎权衡形式约束与语义可靠性。 Abstract: Intrinsic self-correction in Large Language Models (LLMs) frequently fails in open-ended reasoning tasks due to ``hallucination snowballing,'' a phenomenon in which models recursively justify early errors during free-text reflection. While structured feedback can mitigate this issue, existing approaches often rely on externally trained critics or symbolic tools, reducing agent autonomy. This study investigates whether enforcing structured reflection purely through Outlines-based constrained decoding can disrupt error propagation without additional training. Evaluating an 8-billion-parameter model (Qwen3-8B), we show that simply imposing structural constraints does not improve self-correction performance. Instead, it triggers a new failure mode termed ``structure snowballing.'' We find that the cognitive load required to satisfy strict formatting rules pushes the model into formatting traps. This observation helps explain why the agent achieves near-perfect superficial syntactic alignment yet fails to detect or resolve deeper semantic errors. These findings expose an ``alignment tax'' inherent to constrained decoding, highlighting a tension between structural granularity and internal model capacity in autonomous workflows. Code and raw logs are available in the GitHub repository: https://github.com/hongxuzhou/agentic_llm_structured_self_critique.

[69] Short Data, Long Context: Distilling Positional Knowledge in Transformers

Patrick Huber,Ernie Chang,Chinnadhurai Sankar,Rylan Conway,Igor Fedorov,Md Rifat Arefin,Adithya Sagar

Main category: cs.CL

TL;DR: 本文提出了一种无需长上下文预训练即可扩展语言模型上下文窗口的方法,通过基于logit的知识蒸馏,在短上下文打包样本上训练学生模型,实现长上下文检索能力迁移,并结合RoPE分析揭示了位置信息如何通过蒸馏传递及查询状态在长上下文扩展中的结构化更新规律。

Details Motivation: 扩展语言模型上下文窗口通常依赖昂贵的长上下文预训练,带来训练效率低和数据收集难的问题,亟需更高效替代方案。 Method: 采用基于logit的知识蒸馏方法,在仅使用短上下文打包样本(置于长上下文窗口内)的情况下训练学生模型,并结合Rotary Position Embedding(RoPE)机制进行系统性分析,包括相位级RoPE缩放实验、重复token序列下的位置扰动传播追踪,以及查询状态更新模式分析。 Result: 发现:1)相位级RoPE缩放最有利于知识蒸馏下的长上下文性能;2)logit蒸馏可直接实现位置信息迁移,位置扰动经多层传播显著影响教师输出分布与学生接收的蒸馏信号;3)查询状态在长上下文扩展中呈现结构化更新,特定参数区间对长上下文训练高度敏感。 Conclusion: 长上下文能力可通过logit蒸馏从教师模型有效迁移到学生模型,无需长上下文预训练;RoPE设计与蒸馏信号中隐含的位置编码机制共同支撑该迁移过程,为高效上下文扩展提供了新范式。 Abstract: Extending the context window of language models typically requires expensive long-context pre-training, posing significant challenges for both training efficiency and data collection. In this paper, we present evidence that long-context retrieval capabilities can be transferred to student models through logit-based knowledge distillation, even when training exclusively on packed short-context samples within a long-context window. We provide comprehensive insights through the lens of Rotary Position Embedding (RoPE) and establish three key findings. First, consistent with prior work, we show that phase-wise RoPE scaling, which maximizes rotational spectrum utilization at each training stage, also achieves the best long-context performance in knowledge distillation setups. Second, we demonstrate that logit-based knowledge distillation can directly enable positional information transfer. Using an experimental setup with packed repeated token sequences, we trace the propagation of positional perturbations from query and key vectors through successive transformer layers to output logits, revealing that positional information systematically influences the teacher's output distribution and, in turn, the distillation signal received by the student model. Third, our analysis uncovers structured update patterns in the query state during long-context extension, with distinct parameter spans exhibiting strong sensitivity to long-context training.

[70] Stories of Your Life as Others: A Round-Trip Evaluation of LLM-Generated Life Stories Conditioned on Rich Psychometric Profiles

Ben Wigler,Maria Tsfasman,Tiffany Matej Hrkalovic

Main category: cs.CL

TL;DR: 本研究通过让大语言模型(LLMs)基于真实人格测评数据生成第一人称生命故事,并由其他LLM从这些文本中反推人格得分,验证了人格特质在语言中的稳健编码能力;结果表明恢复精度接近人类重测信度,且跨模型、跨提供商具有鲁棒性,同时生成文本的行为特征与真实对话高度一致。

Details Motivation: 现有LLM人格模拟评估依赖模型自评问卷、架构单一、缺乏真实人类心理测量数据,难以判断其是否真正表征个体差异还是仅表面匹配人格词汇。 Method: 将290名参与者的真实心理测量人格剖面作为条件输入10个不同LLM生成第一人称生命故事;再用3个独立LLM仅基于这些故事反推人格得分;同时进行偏差分解和内容分析,对比生成文本与参与者真实对话中的行为及情绪反应特征。 Result: 人格得分可从生成叙事中以较高信度恢复(平均r=0.750,达人类重测信度上限的85%);该效果在10个生成模型和3个评分模型(共6家提供商)间稳健;9/10编码行为特征与真实对话显著相关;情绪反应模式在叙事与真实对话中一致复现。 Conclusion: 预训练过程中习得的人格-语言关系支持对个体差异(包括情绪变异性等特征模式)的稳健编码与解码,表明LLM有条件实现具心理测量意义的人格模拟。 Abstract: Personality traits are richly encoded in natural language, and large language models (LLMs) trained on human text can simulate personality when conditioned on persona descriptions. However, existing evaluations rely predominantly on questionnaire self-report by the conditioned model, are limited in architectural diversity, and rarely use real human psychometric data. Without addressing these limitations, it remains unclear whether personality conditioning produces psychometrically informative representations of individual differences or merely superficial alignment with trait descriptors. To test how robustly LLMs can encode personality into extended text, we condition LLMs on real psychometric profiles from 290 participants to generate first-person life story narratives, and then task independent LLMs to recover personality scores from those narratives alone. We show that personality scores can be recovered from the generated narratives at levels approaching human test-retest reliability (mean r = 0.750, 85% of the human ceiling), and that recovery is robust across 10 LLM narrative generators and 3 LLM personality scorers spanning 6 providers. Decomposing systematic biases reveals that scoring models achieve their accuracy while counteracting alignment-induced defaults. Content analysis of the generated narratives shows that personality conditioning produces behaviourally differentiated text: nine of ten coded features correlate significantly with the same features in participants' real conversations, and personality-driven emotional reactivity patterns in narratives replicate in real conversational data. These findings provide evidence that the personality-language relationship captured during pretraining supports robust encoding and decoding of individual differences, including characteristic emotional variability patterns that replicate in real human behaviour.

[71] LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces

Olexander Mazurets,Olexander Barmak,Leonid Bedratyuk,Iurii Krak

Main category: cs.CL

TL;DR: 本文提出LAG-XAI框架,将语义 paraphrasing 建模为嵌入空间中的仿射几何变换(旋转、形变、平移),在保持高可解释性的同时实现接近基线的判别性能,并成功应用于LLM幻觉检测。

Details Motivation: 现有Transformer模型的语义空间缺乏可解释性,需从几何角度建模语义变化以实现机制可解释性。 Method: 提出基于李仿射几何(Lie Affine Geometry)的LAG-XAI框架,将paraphrasing视为语义流形上的连续仿射变换,采用均场近似与局部李群作用建模,并分解为旋转、形变、平移三部分。 Result: 在PIT-2015数据集上AUC达0.7713(相对随机基线提升约54%),识别出稳定旋转角~27.84°与近零形变;跨域验证有效;在HaluEval上以几何检验自动检出95.3%事实性错误。 Conclusion: LAG-XAI为Transformer提供了数学严谨、资源高效、参数可解释的机制可解释路径,揭示语义空间具有局部等距结构,具备实际部署潜力。 Abstract: Modern Transformer-based language models achieve strong performance in natural language processing tasks, yet their latent semantic spaces remain largely uninterpretable black boxes. This paper introduces LAG-XAI (Lie Affine Geometry for Explainable AI), a novel geometric framework that models paraphrasing not as discrete word substitutions, but as a structured affine transformation within the embedding space. By conceptualizing paraphrasing as a continuous geometric flow on a semantic manifold, we propose a computationally efficient mean-field approximation, inspired by local Lie group actions. This allows us to decompose paraphrase transitions into geometrically interpretable components: rotation, deformation, and translation. Experiments on the noisy PIT-2015 Twitter corpus, encoded with Sentence-BERT, reveal a "linear transparency" phenomenon. The proposed affine operator achieves an AUC of 0.7713. By normalizing against random chance (AUC 0.5), the model captures approximately 80% of the non-linear baseline's effective classification capacity (AUC 0.8405), offering explicit parametric interpretability in exchange for a marginal drop in absolute accuracy. The model identifies fundamental geometric invariants, including a stable matrix reconfiguration angle (~27.84°) and near-zero deformation, indicating local isometry. Cross-domain generalization is confirmed via direct cross-corpus validation on an independent TURL dataset. Furthermore, the practical utility of LAG-XAI is demonstrated in LLM hallucination detection: using a "cheap geometric check," the model automatically detected 95.3% of factual distortions on the HaluEval dataset by registering deviations beyond the permissible semantic corridor. This approach provides a mathematically grounded, resource-efficient path toward the mechanistic interpretability of Transformers.

[72] Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives

Changgeon Ko,Jisu Shin,Hoyun Song,Huije Lee,Eui Jun Hwang,Jong C. Park

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLM)代理在多代理环境中作为人类代表时,受社会情境影响导致决策可靠性下降的问题,识别并实验验证了四种社会心理现象对代理准确率的负面影响。

Details Motivation: LLM代理越来越多地作为人类代表参与多代理环境决策,但其可靠性可能受社会情境影响;受社会心理学启发,探究这种社会背景如何削弱代表代理的判断力。 Method: 定义并系统操控四种社会心理现象(社会从众、感知专业性、主导发言者效应、修辞说服),通过改变对抗者数量、相对智能水平、论点长度和论证风格进行实验分析。 Result: 代表代理的准确性随社会压力增大而持续下降:更大的对抗群体、更强能力的同伴、更长的论点均显著降低性能;强调可信度或逻辑的修辞策略亦会依语境影响其判断。 Conclusion: 多代理系统不仅依赖个体推理能力,还高度敏感于其配置中的社会动态,揭示出AI代理存在类似人类群体决策的心理偏差等关键脆弱性。 Abstract: Large language model (LLM) agents are increasingly acting as human delegates in multi-agent environments, where a representative agent integrates diverse peer perspectives to make a final decision. Drawing inspiration from social psychology, we investigate how the reliability of this representative agent is undermined by the social context of its network. We define four key phenomena-social conformity, perceived expertise, dominant speaker effect, and rhetorical persuasion-and systematically manipulate the number of adversaries, relative intelligence, argument length, and argumentative styles. Our experiments demonstrate that the representative agent's accuracy consistently declines as social pressure increases: larger adversarial groups, more capable peers, and longer arguments all lead to significant performance degradation. Furthermore, rhetorical strategies emphasizing credibility or logic can further sway the agent's judgment, depending on the context. These findings reveal that multi-agent systems are sensitive not only to individual reasoning but also to the social dynamics of their configuration, highlighting critical vulnerabilities in AI delegates that mirror the psychological biases observed in human group decision-making.

[73] Exclusive Unlearning

Mutsumi Sasaki,Kouta Nakayama,Yusuke Miyao,Yohei Oseki,Masaru Isonuma

Main category: cs.CL

TL;DR: 本文提出Exclusive Unlearning (EU)方法,通过广泛遗忘除特定保留知识外的所有内容,实现对有害内容的全面清除,同时保持模型在特定领域(如医学、数学)的任务能力。

Details Motivation: 现有机器遗忘方法难以全面清除多样化的有害内容,尤其在医疗、教育等高风险工业应用中,LLM生成有害内容的风险亟需解决。 Method: 提出Exclusive Unlearning(EU)框架,不逐条指定遗忘目标,而是反向设定需保留的知识与表达,系统性遗忘其余所有内容。 Result: EU能有效提升模型对各类输入(包括越狱攻击)的安全性,同时在医学、数学等特定领域保持良好的指令响应能力。 Conclusion: Exclusive Unlearning是一种更高效、更具泛化性的安全增强范式,为LLM在敏感领域的部署提供了新思路。 Abstract: When introducing Large Language Models (LLMs) into industrial applications, such as healthcare and education, the risk of generating harmful content becomes a significant challenge. While existing machine unlearning methods can erase specific harmful knowledge and expressions, diverse harmful content makes comprehensive removal difficult. In this study, instead of individually listing targets for forgetting, we propose Exclusive Unlearning (EU), which aims for broad harm removal by extensively forgetting everything except for the knowledge and expressions we wish to retain. We demonstrate that through Exclusive Unlearning, it is possible to obtain a model that ensures safety against a wide range of inputs, including jailbreaks, while maintaining the ability to respond to diverse instructions related to specific domains such as medicine and mathematics.

[74] Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

Komal Kumar,Aman Chadha,Salman Khan,Fahad Shahbaz Khan,Hisham Cholakkal

Main category: cs.CL

TL;DR: 本文提出Paper Circle,一个基于多智能体大语言模型的科研文献发现与分析系统,包含发现和分析两个互补流程,支持多源检索、结构化知识图谱构建及多种格式输出。

Details Motivation: 科学文献快速增长使得研究人员难以高效发现、评估和整合相关工作,亟需自动化工具辅助科研文献管理。 Method: 构建基于coder LLM的多智能体框架,设计Discovery Pipeline(融合离线/在线检索、多准则打分、多样性排序)和Analysis Pipeline(将论文转化为含概念、方法、实验等节点的结构化知识图谱),并支持全链路可复现输出。 Result: 在论文检索与综述生成任务上,Paper Circle在Hit Rate、MRR和Recall@K指标上均随智能体模型增强而持续提升,并已开源代码与网站。 Conclusion: Paper Circle为科研人员提供了端到端、可复现、可扩展的文献发现与分析新范式,验证了多智能体LLM在学术信息处理中的实用价值。 Abstract: The rapid growth of scientific literature has made it increasingly difficult for researchers to efficiently discover, evaluate, and synthesize relevant work. Recent advances in multi-agent large language models (LLMs) have demonstrated strong potential for understanding user intent and are being trained to utilize various tools. In this paper, we introduce Paper Circle, a multi-agent research discovery and analysis system designed to reduce the effort required to find, assess, organize, and understand academic literature. The system comprises two complementary pipelines: (1) a Discovery Pipeline that integrates offline and online retrieval from multiple sources, multi-criteria scoring, diversity-aware ranking, and structured outputs; and (2) an Analysis Pipeline that transforms individual papers into structured knowledge graphs with typed nodes such as concepts, methods, experiments, and figures, enabling graph-aware question answering and coverage verification. Both pipelines are implemented within a coder LLM-based multi-agent orchestration framework and produce fully reproducible, synchronized outputs including JSON, CSV, BibTeX, Markdown, and HTML at each agent step. This paper describes the system architecture, agent roles, retrieval and scoring methods, knowledge graph schema, and evaluation interfaces that together form the Paper Circle research workflow. We benchmark Paper Circle on both paper retrieval and paper review generation, reporting hit rate, MRR, and Recall at K. Results show consistent improvements with stronger agent models. We have publicly released the website at https://papercircle.vercel.app/ and the code at https://github.com/MAXNORM8650/papercircle.

cs.CV [Back]

[75] Generative AI for Video Trailer Synthesis: From Extractive Heuristics to Autoregressive Creativity

Abhishek Dharmaratnakar,Srivaths Ranganathan,Debanshu Das,Anushree Sinha

Main category: cs.CV

TL;DR: 本文综述了视频预告片自动生成领域从启发式提取向深度生成合成的范式转变,重点分析了大语言模型、多模态大模型和扩散模型等生成技术的进展、架构演进、经济影响与伦理挑战,并提出了面向基础模型时代的新分类体系。

Details Motivation: 视频预告片生成正经历从传统启发式方法到基于大模型的生成式方法的深刻变革,亟需系统性梳理技术脉络、评估影响并构建新分类框架。 Method: 采用技术综述方法,系统分析生成式技术(如自回归Transformer、LLM编排流水线、文本到视频模型)的演进路径,对比GCN与Trailer Generation Transformer等架构,并探讨经济与伦理维度。 Result: 建立了面向基础模型时代的AI预告片生成新分类体系,指出未来方向是可控生成编辑与语义级重构,而非简单镜头提取。 Conclusion: AI驱动的预告片生成已进入以基础模型为核心的生成式新阶段,其发展需兼顾技术创新、平台生态适配与伦理治理。 Abstract: The domain of automatic video trailer generation is currently undergoing a profound paradigm shift, transitioning from heuristic-based extraction methods to deep generative synthesis. While early methodologies relied heavily on low-level feature engineering, visual saliency, and rule-based heuristics to select representative shots, recent advancements in Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), and diffusion-based video synthesis have enabled systems that not only identify key moments but also construct coherent, emotionally resonant narratives. This survey provides a comprehensive technical review of this evolution, with a specific focus on generative techniques including autoregressive Transformers, LLM-orchestrated pipelines, and text-to-video foundation models like OpenAI's Sora and Google's Veo. We analyze the architectural progression from Graph Convolutional Networks (GCNs) to Trailer Generation Transformers (TGT), evaluate the economic implications of automated content velocity on User-Generated Content (UGC) platforms, and discuss the ethical challenges posed by high-fidelity neural synthesis. By synthesizing insights from recent literature, this report establishes a new taxonomy for AI-driven trailer generation in the era of foundation models, suggesting that future promotional video systems will move beyond extractive selection toward controllable generative editing and semantic reconstruction of trailers.

[76] RCP: Representation Consistency Pruner for Mitigating Distribution Shift in Large Vision-Language Models

Jianwei Zhang,Chaoning Zhang,Sihan Cao,Wang Liu,Pengcheng Zheng,Jiaxin Huang,Caiyan Qin,Yalan Ye,Wei Dong,Yang Yang

Main category: cs.CV

TL;DR: 本文提出RCP框架,通过累积视觉token剪枝与延迟修复机制,在大幅减少视觉token数量的同时,保持LVLMs性能稳定。

Details Motivation: 大型视觉语言模型(LVLMs)因语言解码器处理大量视觉token而导致推理成本过高;现有剪枝方法因不可逆地移除token引发隐状态分布偏移,造成显著性能下降。 Method: 提出Representation Consistency Pruner(RCP):1)设计基于交叉注意力的累积剪枝器,利用LLM固有注意力预测逐层单调递减的mask;2)引入延迟修复适配器(DRA),缓存被剪枝token的关键信息,并通过FiLM调制作用于答案生成token;3)采用修复损失对齐剪枝后表征与全token教师模型的一、二阶统计量。 Result: 在LVLM基准上,RCP最多可移除88.9%视觉token、降低85.7% FLOPs,仅带来微小平均精度下降,并在多个基准上超越无需微调原模型的先前方法。 Conclusion: RCP是一种高效、轻量且无需重训主干模型的视觉token压缩框架,兼顾显著计算加速与强性能保持能力。 Abstract: Large Vision-Language Models (LVLMs) suffer from prohibitive inference costs due to the massive number of visual tokens processed by the language decoder. Existing pruning methods often lead to significant performance degradation because the irreversible removal of visual tokens causes a distribution shift in the hidden states that deviates from the pre-trained full-token regime. To address this, we propose Representation Consistency Pruner, which we refer to as RCP, as a novel framework that integrates cumulative visual token pruning with a delayed repair mechanism. Specifically, we introduce a cross-attention pruner that leverages the intrinsic attention of the LLM as a baseline to predict cumulative masks, ensuring consistent and monotonic token reduction across layers. To compensate for the resulting information loss, we design a delayed repair adapter denoted as DRA, which caches the essence of pruned tokens and applies FiLM-based modulation specifically to the answer generation tokens. We employ a repair loss to match the first and second-order statistics of the pruned representations with a full-token teacher. RCP is highly efficient because it trains only lightweight plug-in modules while allowing for physical token discarding at inference. Extensive experiments on LVLM benchmarks demonstrate that RCP removes up to 88.9\% of visual tokens and reduces FLOPs by up to 85.7\% with only a marginal average accuracy drop, and outperforms prior methods that avoid fine-tuning the original model on several widely used benchmarks.

[77] Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Chaoyou Fu,Haozhi Yuan,Yuhao Dong,Yi-Fan Zhang,Yunhang Shen,Xiaoxing Hu,Xueying Li,Jinsen Su,Chengwu Long,Xiaoyao Xie,Yongkang Xie,Xiawu Zheng,Xue Yang,Haoyu Cao,Yunsheng Wu,Ziwei Liu,Xing Sun,Caifeng Shan,Ran He

Main category: cs.CV

TL;DR: 本文提出Video-MME-v2,一个面向视频理解模型鲁棒性与可信度评估的新型综合基准,采用渐进式三级评估体系与基于组的非线性评测策略,并通过严格人工标注流程保障数据质量;实验揭示当前模型(如Gemini-3-Pro)在视觉信息聚合与时序建模上的瓶颈,及其对文本线索的强依赖性。

Details Motivation: 现有视频理解基准日趋饱和,榜单分数虚高,无法反映模型在真实场景中的实际能力,亟需更严谨、更具挑战性的评估基准。 Method: 构建Video-MME-v2基准:(1)设计渐进式三级评估层次(多点视觉聚合→时序动态建模→复杂多模态推理);(2)提出基于组的非线性评测策略,强调相关问题间一致性与多步推理连贯性;(3)通过12名标注员与50名独立评审员、3300人工时、5轮质控完成高质量人工构建。 Result: 实验发现:当前最优模型Gemini-3-Pro与人类专家存在显著差距;存在清晰的层级瓶颈——低层视觉与时序错误会传导至高层推理;思维链推理高度依赖字幕等文本线索,在纯视觉场景下性能可能下降。 Conclusion: Video-MME-v2为视频多模态大语言模型提供了更具挑战性与诊断价值的新基准,推动模型向更鲁棒、更可信、更少线索依赖的方向发展。 Abstract: With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a \textbf{progressive tri-level hierarchy} that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a \textbf{group-based non-linear evaluation} strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by \textbf{3,300 human-hours} and up to \textbf{5 rounds} of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.

[78] ID-Sim: An Identity-Focused Similarity Metric

Julia Chae,Nicholas Kolkin,Jui-Hsien Wang,Richard Zhang,Sara Beery,Cusuh Ham

Main category: cs.CV

TL;DR: 本文提出ID-Sim,一种面向身份识别的前馈式评估指标,旨在更准确反映人类对身份的敏感性,并通过真实与合成数据训练,在统一基准上验证其与人类标注的一致性。

Details Motivation: 现有视觉模型在身份识别任务(如个性化图像生成)中表现不足,主要受限于缺乏能反映人类身份敏感性的评估指标。 Method: 构建高质量真实图像数据集,并结合可控、细粒度的身份与上下文变化的生成式合成数据,训练前馈式ID-Sim评估指标;并在涵盖识别、检索与生成任务的新统一基准上进行评估。 Result: ID-Sim在多个身份聚焦任务上展现出与人类标注更高的一致性,优于现有评估指标。 Conclusion: ID-Sim为身份感知视觉任务提供了更可靠、符合人类感知的评估工具,有望推动个性化生成与身份理解相关研究的发展。 Abstract: Humans have remarkable selective sensitivity to identities -- easily distinguishing between highly similar identities, even across significantly different contexts such as diverse viewpoints or lighting. Vision models have struggled to match this capability, and progress toward identity-focused tasks such as personalized image generation is slowed by a lack of identity-focused evaluation metrics. To help facilitate progress, we propose ID-Sim, a feed-forward metric designed to faithfully reflect human selective sensitivity. To build ID-Sim, we curate a high-quality training set of images spanning diverse real-world domains, augmented with generative synthetic data that provides controlled, fine-grained identity and contextual variations. We evaluate our metric on a new unified evaluation benchmark for assessing consistency with human annotations across identity-focused recognition, retrieval, and generative tasks.

[79] R3PM-Net: Real-time, Robust, Real-world Point Matching Network

Yasaman Kashefbahrami,Erkut Akdag,Panagiotis Meletis,Evgeniya Balmashnova,Dip Goswami,Egor Bondarau

Main category: cs.CV

TL;DR: 本文提出R3PM-Net,一种轻量、全局感知、面向物体级点匹配的网络,旨在提升点云配准在真实工业场景中的泛化性与实时性,并构建了两个新数据集Sioux-Cranfield和Sioux-Scans以支持评估。

Details Motivation: 现有深度学习点云配准方法多在干净、稠密、合成数据上开发和评估,难以泛化到含噪声、稀疏、不完整的真实工业扫描数据(如摄影测量和事件相机数据)。 Method: 提出R3PM-Net:一种轻量级、全局感知、物体级点匹配网络;并构建两个新基准数据集Sioux-Cranfield和Sioux-Scans,用于评估真实感扫描与CAD模型之间的配准性能。 Result: 在ModelNet40上达1.0配准精度和0.029 cm内点RMSE,耗时仅0.007秒(比RegTR快约7倍);在Sioux-Cranfield上保持1.0精度与0.030 cm RMSE;在极具挑战性的Sioux-Scans上可在50ms内解决边缘案例。 Conclusion: R3PM-Net在保持高精度的同时显著提升速度,为对精度与实时性要求严苛的工业应用提供了鲁棒、高效的解决方案。 Abstract: Accurate Point Cloud Registration (PCR) is an important task in 3D data processing, involving the estimation of a rigid transformation between two point clouds. While deep-learning methods have addressed key limitations of traditional non-learning approaches, such as sensitivity to noise, outliers, occlusion, and initialization, they are developed and evaluated on clean, dense, synthetic datasets (limiting their generalizability to real-world industrial scenarios). This paper introduces R3PM-Net, a lightweight, global-aware, object-level point matching network designed to bridge this gap by prioritizing both generalizability and real-time efficiency. To support this transition, two datasets, Sioux-Cranfield and Sioux-Scans, are proposed. They provide an evaluation ground for registering imperfect photogrammetric and event-camera scans to digital CAD models, and have been made publicly available. Extensive experiments demonstrate that R3PM-Net achieves competitive accuracy with unmatched speed. On ModelNet40, it reaches a perfect fitness score of $1$ and inlier RMSE of $0.029$ cm in only $0.007$s, approximately 7 times faster than the state-of-the-art method RegTR. This performance carries over to the Sioux-Cranfield dataset, maintaining a fitness of $1$ and inlier RMSE of $0.030$ cm with similarly low latency. Furthermore, on the highly challenging Sioux-Scans dataset, R3PM-Net successfully resolves edge cases in under 50 ms. These results confirm that R3PM-Net offers a robust, high-speed solution for critical industrial applications, where precision and real-time performance are indispensable. The code and datasets are available at https://github.com/YasiiKB/R3PM-Net.

[80] SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

Zhongyu Yang,Zuhao Yang,Shuo Zhan,Tan Yue,Wei Pang,Yingfang Yuan

Main category: cs.CV

TL;DR: 本文提出SVAgent,一种基于故事情节引导的跨模态多智能体框架,用于视频问答(VideoQA),通过构建叙事表征和多智能体协同推理提升性能与可解释性。

Details Motivation: 现有VideoQA方法多依赖定位相关帧,缺乏类似人类通过连贯故事情节进行推理的能力,难以实现鲁棒、上下文感知的预测。 Method: 提出SVAgent框架: storyline agent逐步构建叙事表征;refinement suggestion agent基于历史失败分析推荐关键帧;cross-modal decision agents分别从视觉和文本模态独立预测答案;meta-agent对齐并评估跨模态输出以增强一致性与鲁棒性。 Result: SVAgent在多个VideoQA基准上取得优越性能,并展现出更强的可解释性和类人故事情节推理能力。 Conclusion: storyline-guided multi-agent架构有效弥补了现有VideoQA方法在叙事推理上的不足,为视频理解提供了更接近人类认知的新范式。 Abstract: Video question answering (VideoQA) is a challenging task that requires integrating spatial, temporal, and semantic information to capture the complex dynamics of video sequences. Although recent advances have introduced various approaches for video understanding, most existing methods still rely on locating relevant frames to answer questions rather than reasoning through the evolving storyline as humans do. Humans naturally interpret videos through coherent storylines, an ability that is crucial for making robust and contextually grounded predictions. To address this gap, we propose SVAgent, a storyline-guided cross-modal multi-agent framework for VideoQA. The storyline agent progressively constructs a narrative representation based on frames suggested by a refinement suggestion agent that analyzes historical failures. In addition, cross-modal decision agents independently predict answers from visual and textual modalities under the guidance of the evolving storyline. Their outputs are then evaluated by a meta-agent to align cross-modal predictions and enhance reasoning robustness and answer consistency. Experimental results demonstrate that SVAgent achieves superior performance and interpretability by emulating human-like storyline reasoning in video understanding.

[81] Simultaneous Dual-View Mammogram Synthesis Using Denoising Diffusion Probabilistic Models

Jorge Alberto Garza-Abdala,Gerardo A. Fumagal-González,Eduardo de Avila-Armenta,Sadam Hussain,Jasiel H. Toscano-Martínezb,Diana S. M. Rosales Gurmendi,Alma A. Pedro-Pérez,Jose G. Tamez-Pena

Main category: cs.CV

TL;DR: 本文提出了一种三通道去噪扩散概率模型(DDPM),用于同时生成乳腺X线摄影的CC和MLO双视角图像,其中第三通道编码两视图的绝对差以增强解剖一致性;该方法在私有数据集上微调后,能生成几何一致、分布接近真实图像的合成对,适用于数据增强与跨视角AI应用。

Details Motivation: 许多乳腺影像数据集缺乏完整的配对CC和MLO视图,限制了依赖跨视图一致性的算法开发。 Method: 设计三通道DDPM:CC视图、MLO视图分别置于两个通道,第三通道为二者绝对差图像以引导模型学习解剖结构对应关系;基于Hugging Face预训练DDPM,在私有筛查数据集上微调,并进行合成与评估。 Result: 合成图像在自动乳腺掩模分割中表现出良好几何一致性,分布特征接近真实图像,定性分析显示跨视图解剖对齐良好;差异引导机制有助于保持全局乳腺结构。 Conclusion: 差异引导的DDPM可有效实现同步双视角乳腺X线图像合成,为数据增强及未来跨视图感知AI系统提供了可行技术路径。 Abstract: Breast cancer screening relies heavily on mammography, where the craniocaudal (CC) and mediolateral oblique (MLO) views provide complementary information for diagnosis. However, many datasets lack complete paired views, limiting the development of algorithms that depend on cross-view consistency. To address this gap, we propose a three-channel denoising diffusion probabilistic model capable of simultaneously generating CC and MLO views of a single breast. In this configuration, the two mammographic views are stored in separate channels, while a third channel encodes their absolute difference to guide the model toward learning coherent anatomical relationships between projections. A pretrained DDPM from Hugging Face was fine-tuned on a private screening dataset and used to synthesize dual-view pairs. Evaluation included geometric consistency via automated breast mask segmentation and distributional comparison with real images, along with qualitative inspection of cross-view alignment. The results show that the difference-based encoding helps preserve the global breast structure across views, producing synthetic CC-MLO pairs that resemble real acquisitions. This work demonstrates the feasibility of simultaneous dual-view mammogram synthesis using a difference-guided DDPM, highlighting its potential for dataset augmentation and future cross-view-aware AI applications in breast imaging.

[82] Watch Before You Answer: Learning from Visually Grounded Post-Training

Yuxuan Zhang,EunJeong Hwang,Huaisong Zhang,Penghui Du,Yiming Jia,Dongfu Jiang,Xuan He,Shenhui Zhang,Ping Nie,Peter West,Kelsey R. Allen

Main category: cs.CV

TL;DR: 本文发现当前视频理解基准和后训练数据集存在严重文本偏置问题,提出VidGround方法通过仅使用视觉接地问题进行后训练,显著提升VLM视频理解性能,证明数据质量比复杂算法更重要。

Details Motivation: 现有视频理解基准和后训练数据集中大量问题可仅凭文本线索回答,导致模型未真正学习视觉-语言对齐,掩盖了真实性能瓶颈。 Method: 提出VidGround:筛选出真正需要视觉信息的问题(即视觉接地问题)用于后训练,并结合RL-based后训练算法;同时强调数据清洗与高质量标注的重要性。 Result: 在保持仅69.1%原始数据量下,相对全量数据后训练提升最高达6.2分;且简单数据筛选+基础后训练算法优于多种复杂后训练技术。 Conclusion: 视频理解性能提升的关键在于构建真正需视觉接地的评测基准与后训练数据,而非依赖更复杂的建模方法;数据质量是当前主要瓶颈。 Abstract: It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: http://vidground.etuagi.com.

[83] Lightweight True In-Pixel Encryption with FeFET Enabled Pixel Design for Secure Imaging

Md Rahatul Islam Udoy,Diego Ferrer,Wantong Li,Kai Ni,Sumeet Kumar Gupta,Ahmedullah Aziz

Main category: cs.CV

TL;DR: 本文提出了一种名为SecurePix的安全像素传感器,利用铁电场效应晶体管的多畴极化态实现像素级真加密,显著降低神经网络对加密图像的识别准确率,并支持基于查找表的密钥解密。

Details Motivation: 随着视觉数据在成像流程中多个阶段可能暴露,端到端图像传感器安全变得至关重要;需在像素值出现在读出线之前即完成加密。 Method: 设计并仿真一种CMOS兼容的紧凑型安全像素架构SecurePix,采用基于铁电场效应晶体管非易失性多畴极化态的对称密钥,在像素内完成模拟域真加密;使用HSPICE仿真与45 nm CMOS PDK布局验证。 Result: 像素尺寸为2.33 × 3.01 μm²;MNIST和CIFAR-10上ResNet-18识别准确率分别从99.29%和91.33%骤降至9.58%和6.98%;编程与传感功耗延时积分别为17 μW·μs和1.25 μW·μs;支持基于查找表的对称密钥逆映射恢复。 Conclusion: SecurePix实现了低开销、硬件级、非易失可编程的像素内加密,具备强抗神经网络推理攻击能力,为图像传感器端到端安全提供了新范式。 Abstract: Ensuring end-to-end security in image sensors has become essential as visual data can be exposed through multiple stages of the imaging pipeline. Advanced protection requires encryption to occur before pixel values appear on any readout lines. This work introduces a secure pixel sensor (SecurePix), a compact CMOS-compatible pixel architecture that performs true in-pixel encryption using a symmetric key realized through programmable, non-volatile multidomain polarization states of a ferroelectric field-effect transistor. The pixel and array operations are designed and simulated in HSPICE, while a 45 nm CMOS process design kit is used for layout drawing. The resulting layout confirms a pixel pitch of 2.33 x 3.01 um^2. Each pixel's non-volatile programming level defines its analog transfer characteristic, enabling the photodiode voltage to be converted into an encrypted analog output within the pixel. Full-image evaluation shows that ResNet-18 recognition accuracy drops from 99.29 percent to 9.58 percent on MNIST and from 91.33 percent to 6.98 percent on CIFAR-10 after encryption, indicating strong resistance to neural-network-based inference. Lookup-table-based inverse mapping enables recovery for authorized receivers using the same symmetric key. Based on HSPICE simulation, the SecurePix achieves a per-pixel programming power-delay product of 17 uW us and a per-pixel sensing power-delay product of 1.25 uW us, demonstrating low-overhead hardware-level protection.

[84] Modality-Aware and Anatomical Vector-Quantized Autoencoding for Multimodal Brain MRI

Mingjie Li,Edward Kim,Yue Zhao,Ehsan Adeli,Kilian M. Pohl

Main category: cs.CV

TL;DR: NeuroQuant is a modality-aware, anatomically grounded 3D VQ-VAE for multi-modal brain MRI reconstruction, using factorized multi-axis attention, dual-stream encoding, shared codebook quantization, and FiLM-based decoding, trained with joint 2D/3D supervision.

Details Motivation: Existing brain VAEs focus on single-modality (e.g., T1) MRIs, ignoring the complementary diagnostic value of multi-modal data (e.g., T2); robust multi-modal representation learning remains underexplored. Method: Proposes NeuroQuant: a 3D vector-quantized VAE with (1) factorized multi-axis attention for cross-modality shared latent learning, (2) dual-stream 3D encoder separating anatomical (modality-invariant) and appearance (modality-specific) features, (3) shared codebook for anatomical discretization, (4) FiLM-based fusion during decoding, and (5) joint 2D/3D training to respect slice-wise MRI acquisition. Result: NeuroQuant achieves superior reconstruction fidelity over existing VAEs on two multi-modal brain MRI datasets, enabling scalable downstream generative modeling and cross-modal analysis. Conclusion: Anatomically grounded, modality-aware latent disentanglement combined with vector quantization and joint 2D/3D training significantly improves multi-modal brain MRI reconstruction — paving the way for more interpretable and generalizable medical generative models. Abstract: Learning a robust Variational Autoencoder (VAE) is a fundamental step for many deep learning applications in medical image analysis, such as MRI synthesizes. Existing brain VAEs predominantly focus on single-modality data (i.e., T1-weighted MRI), overlooking the complementary diagnostic value of other modalities like T2-weighted MRIs. Here, we propose a modality-aware and anatomically grounded 3D vector-quantized VAE (VQ-VAE) for reconstructing multi-modal brain MRIs. Called NeuroQuant, it first learns a shared latent representation across modalities using factorized multi-axis attention, which can capture relationships between distant brain regions. It then employs a dual-stream 3D encoder that explicitly separates the encoding of modality-invariant anatomical structures from modality-dependent appearance. Next, the anatomical encoding is discretized using a shared codebook and combined with modality-specific appearance features via Feature-wise Linear Modulation (FiLM) during the decoding phase. This entire approach is trained using a joint 2D/3D strategy in order to account for the slice-based acquisition of 3D MRI data. Extensive experiments on two multi-modal brain MRI datasets demonstrate that NeuroQuant achieves superior reconstruction fidelity compared to existing VAEs, enabling a scalable foundation for downstream generative modeling and cross-modal brain image analysis.

[85] MIRAGE: Benchmarking and Aligning Multi-Instance Image Editing

Ziqian Liu,Stephan Alaniz

Main category: cs.CV

TL;DR: 本文提出MIRAGE框架,解决多实例、多指令图像编辑中的过编辑和空间错位问题,通过视觉语言模型解析指令并采用多分支并行去噪策略实现精准局部编辑,显著提升细粒度一致性。

Details Motivation: 现有指令引导图像编辑模型(如FLUX.2、Qwen-Image-Edit)在处理含多个相似实例及复合指令的复杂场景时,存在严重过编辑和空间错位问题。 Method: 提出无需训练的MIRAGE框架:利用视觉语言模型将复杂指令解析为区域子集,并采用多分支并行去噪策略,将目标区域潜在表示注入全局表征空间,同时通过参考轨迹保持背景完整性。 Result: 在MIRA-Bench和RefEdit-Bench上大幅超越现有方法,在实例级精确修改与背景一致性方面表现优异。 Conclusion: MIRAGE有效缓解了多实例多指令编辑中的关键缺陷,所构建的基准为未来研究提供了重要评估工具。 Abstract: Instruction-guided image editing has seen remarkable progress with models like FLUX.2 and Qwen-Image-Edit, yet they still struggle with complex scenarios with multiple similar instances each requiring individual edits. We observe that state-of-the-art models suffer from severe over-editing and spatial misalignment when faced with multiple identical instances and composite instructions. To this end, we introduce a comprehensive benchmark specifically designed to evaluate fine-grained consistency in multi-instance and multi-instruction settings. To address the failures of existing methods observed in our benchmark, we propose Multi-Instance Regional Alignment via Guided Editing (MIRAGE), a training-free framework that enables precise, localized editing. By leveraging a vision-language model to parse complex instructions into regional subsets, MIRAGE employs a multi-branch parallel denoising strategy. This approach injects latent representations of target regions into the global representation space while maintaining background integrity through a reference trajectory. Extensive evaluations on MIRA-Bench and RefEdit-Bench demonstrate that our framework significantly outperforms existing methods in achieving precise instance-level modifications while preserving background consistency. Our benchmark and code are available at https://github.com/ZiqianLiu666/MIRAGE.

[86] LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows

Zhengqin Li,Cheng Zhang,Jakob Engel,Zhao Dong

Main category: cs.CV

TL;DR: 本文提出大型稀疏重建模型(LSRM),通过扩展Transformer上下文窗口并结合稀疏注意力机制,显著提升前馈式3D重建与逆向渲染的质量,尤其在纹理和几何细节上逼近甚至超越密集视角优化方法。

Details Motivation: 现有基于对象的前馈式3D重建方法虽鲁棒且高效,但在细粒度纹理与外观恢复上仍落后于密集视角优化方法。 Method: 提出大型稀疏重建模型(LSRM),采用稀疏注意力机制,包含三个核心设计:(1) 高效粗到精流程,预测稀疏高分辨率残差;(2) 3D感知空间路由机制,利用显式几何距离建立2D-3D对应;(3) 块感知序列并行策略(All-gather-KV协议)以均衡GPU间动态稀疏计算负载。 Result: LSRM支持20倍更多物体token和2倍以上图像token;在新视角合成任务中PSNR提升2.5 dB、LPIPS降低40%;在逆向渲染任务中纹理与几何细节明显改善,LPIPS媲美甚至优于SOTA密集视角优化方法。 Conclusion: 扩大上下文窗口并结合稀疏建模可有效弥合前馈式重建与密集优化之间的质量差距,为高效高质量3D重建提供新范式。 Abstract: We introduce the Large Sparse Reconstruction Model to study how scaling transformer context windows impacts feed-forward 3D reconstruction. Although recent object-centric feed-forward methods deliver robust, high-quality reconstruction, they still lag behind dense-view optimization in recovering fine-grained texture and appearance. We show that expanding the context window -- by substantially increasing the number of active object and image tokens -- remarkably narrows this gap and enables high-fidelity 3D object reconstruction and inverse rendering. To scale effectively, we adapt native sparse attention in our architecture design, unlocking its capacity for 3D reconstruction with three key contributions: (1) an efficient coarse-to-fine pipeline that focuses computation on informative regions by predicting sparse high-resolution residuals; (2) a 3D-aware spatial routing mechanism that establishes accurate 2D-3D correspondences using explicit geometric distances rather than standard attention scores; and (3) a custom block-aware sequence parallelism strategy utilizing an All-gather-KV protocol to balance dynamic, sparse workloads across GPUs. As a result, LSRM handles 20x more object tokens and >2x more image tokens than prior state-of-the-art (SOTA) methods. Extensive evaluations on standard novel-view synthesis benchmarks show substantial gains over the current SOTA, yielding 2.5 dB higher PSNR and 40% lower LPIPS. Furthermore, when extending LSRM to inverse rendering tasks, qualitative and quantitative evaluations on widely-used benchmarks demonstrate consistent improvements in texture and geometry details, achieving an LPIPS that matches or exceeds that of SOTA dense-view optimization methods. Code and model will be released on our project page.

[87] OrthoFuse: Training-free Riemannian Fusion of Orthogonal Style-Concept Adapters for Diffusion Models

Ali Aliev,Kamil Garifullin,Nikolay Yudin,Vera Soboleva,Alexander Molozhavenko,Ivan Oseledets,Aibek Alanov,Maxim Rakhuba

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的正交适配器融合方法OrthoFuse,利用Group-and-Shuffle正交矩阵的几何结构与测地线近似,结合谱恢复变换,实现生成模型中主题与风格适配器的高质量融合。

Details Motivation: 如何将针对不同任务(如主题与风格)分别微调的适配器无训练地融合为一个通用适配器,尤其在正交微调(OFT)框架下,仍是一个未解问题。 Method: 基于Group-and-Shuffle(GS)正交矩阵的流形结构,推导其测地线近似公式,并引入谱恢复(spectra restoration)变换以保持融合后适配器的频谱特性。 Result: 在主题驱动生成任务上验证了该方法能有效融合概念与风格特征;首次实现了乘性正交适配器的训练-free融合。 Conclusion: 利用正交参数化的几何性质可实现高效、高质量、无需训练的多任务适配器融合,为参数高效微调提供了新范式。 Abstract: In a rapidly growing field of model training there is a constant practical interest in parameter-efficient fine-tuning and various techniques that use a small amount of training data to adapt the model to a narrow task. However, there is an open question: how to combine several adapters tuned for different tasks into one which is able to yield adequate results on both tasks? Specifically, merging subject and style adapters for generative models remains unresolved. In this paper we seek to show that in the case of orthogonal fine-tuning (OFT), we can use structured orthogonal parametrization and its geometric properties to get the formulas for training-free adapter merging. In particular, we derive the structure of the manifold formed by the recently proposed Group-and-Shuffle ($\mathcal{GS}$) orthogonal matrices, and obtain efficient formulas for the geodesics approximation between two points. Additionally, we propose a $\text{spectra restoration}$ transform that restores spectral properties of the merged adapter for higher-quality fusion. We conduct experiments in subject-driven generation tasks showing that our technique to merge two $\mathcal{GS}$ orthogonal matrices is capable of uniting concept and style features of different adapters. To the best of our knowledge, this is the first training-free method for merging multiplicative orthogonal adapters. Code is available via the $\href{https://github.com/ControlGenAI/OrthoFuse}{link}$.

[88] Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification

Muhammad Adil,Mehmood Ahmed,Muhammad Aqib,Vicente A. Gonzalez,Gaang Lee,Qipei Mei

Main category: cs.CV

TL;DR: 本文提出了一种检测引导的小型视觉语言模型(sVLM)框架,通过融合YOLOv11n目标检测与sVLM多模态推理,提升施工场景中危险源识别的准确性与效率,在零样本设置下显著提高F1分数与解释质量,且仅引入极小推理开销。

Details Motivation: 大型视觉语言模型(VLM)计算开销高,难以满足施工现场近实时 hazard 检测需求;小型VLM(sVLM)虽高效但易在复杂施工场景中出现准确率低和幻觉问题,亟需平衡效率与性能。 Method: 提出检测引导的sVLM框架:先用YOLOv11n定位工人与施工机械,再将检测结果结构化嵌入提示词,引导sVLM进行空间感知的多模态危险推理;在零样本设定下评估6种sVLM(如Gemma-3 4B、Qwen-3-VL等)于自建带标注与解释的施工图像数据集。 Result: 所有sVLM性能均提升;最优模型Gemma-3 4B的F1-score从34.5%提升至50.6%,解释质量(BERTScore F1)从0.61升至0.82,单图推理仅增加2.5ms开销。 Conclusion: 轻量级目标检测与sVLM推理的协同可有效实现高精度、低延迟、上下文感知的施工安全 hazard 检测,为实际部署提供了可行方案。 Abstract: Accurate and timely identification of construction hazards around workers is essential for preventing workplace accidents. While large vision-language models (VLMs) demonstrate strong contextual reasoning capabilities, their high computational requirements limit their applicability in near real-time construction hazard detection. In contrast, small vision-language models (sVLMs) with fewer than 4 billion parameters offer improved efficiency but often suffer from reduced accuracy and hallucination when analyzing complex construction scenes. To address this trade-off, this study proposes a detection-guided sVLM framework that integrates object detection with multimodal reasoning for contextual hazard identification. The framework first employs a YOLOv11n detector to localize workers and construction machinery within the scene. The detected entities are then embedded into structured prompts to guide the reasoning process of sVLMs, enabling spatially grounded hazard assessment. Within this framework, six sVLMs (Gemma-3 4B, Qwen-3-VL 2B/4B, InternVL-3 1B/2B, and SmolVLM-2B) were evaluated in zero-shot settings on a curated dataset of construction site images with hazard annotations and explanatory rationales. The proposed approach consistently improved hazard detection performance across all models. The best-performing model, Gemma-3 4B, achieved an F1-score of 50.6%, compared to 34.5% in the baseline configuration. Explanation quality also improved significantly, with BERTScore F1 increasing from 0.61 to 0.82. Despite incorporating object detection, the framework introduces minimal overhead, adding only 2.5 ms per image during inference. These results demonstrate that integrating lightweight object detection with small VLM reasoning provides an effective and efficient solution for context-aware construction safety hazard detection.

[89] Boxer: Robust Lifting of Open-World 2D Bounding Boxes to 3D

Daniel DeTone,Tianwei Shen,Fan Zhang,Lingni Ma,Julian Straub,Richard Newcombe,Jakob Engel

Main category: cs.CV

TL;DR: 本文提出Boxer算法,利用2D开放词汇目标检测、带姿态的图像和可选深度信息(稀疏点云或稠密深度)来估计静态3D边界框;核心是基于Transformer的BoxerNet网络,将2D框提升至3D,并通过多视角融合与几何滤波生成全局一致的3D框;该方法减少对昂贵3D标注数据的依赖,并在多个基准上超越现有SOTA。

Details Motivation: 3D目标定位,尤其是开放世界类别下的问题,远未解决,而2D检测已取得显著进展;现有方法受限于对稠密深度的依赖及大量3D标注数据需求。 Method: 提出Boxer框架,包含BoxerNet(Transformer架构)用于2D到3D提升,结合多视角融合与几何滤波;引入aleatoric不确定性建模、中值深度块编码以支持稀疏深度输入,并进行超大规模训练(>120万唯一3DBB)。 Result: BoxerNet在开放世界3DBB提升任务中显著优于SOTA:在无稠密深度的egocentric设置下mAP达0.532(vs. CuTR 0.010);在CA-1M(含稠密深度)上达0.412(vs. CuTR 0.250)。 Conclusion: Boxer有效解耦2D检测与3D提升,降低对3D标注依赖,支持灵活深度输入,在开放世界3D定位中实现高性能与强泛化性。 Abstract: Detecting and localizing objects in space is a fundamental computer vision problem. While much progress has been made to solve 2D object detection, 3D object localization is much less explored and far from solved, especially for open-world categories. To address this research challenge, we propose Boxer, an algorithm to estimate static 3D bounding boxes (3DBBs) from 2D open-vocabulary object detections, posed images and optional depth either represented as a sparse point cloud or dense depth. At its core is BoxerNet, a transformer-based network which lifts 2D bounding box (2DBB) proposals into 3D, followed by multi-view fusion and geometric filtering to produce globally consistent de-duplicated 3DBBs in metric world space. Boxer leverages the power of existing 2DBB detection algorithms (e.g. DETIC, OWLv2, SAM3) to localize objects in 2D. This allows the main BoxerNet model to focus on lifting to 3D rather than detecting, ultimately reducing the demand for costly annotated 3DBB training data. Extending the CuTR formulation, we incorporate an aleatoric uncertainty for robust regression, a median depth patch encoding to support sparse depth inputs, and large-scale training with over 1.2 million unique 3DBBs. BoxerNet outperforms state-of-the-art baselines in open-world 3DBB lifting, including CuTR in egocentric settings without dense depth (0.532 vs. 0.010 mAP) and on CA-1M with dense depth available (0.412 vs. 0.250 mAP).

[90] Hierarchical Mesh Transformers with Topology-Guided Pretraining for Morphometric Analysis of Brain Structures

Yujian Xiong,Mohammad Farazi,Yanxi Chen,Wenhui Zhu,Xuanzhao Dong,Natasha Lepore,Yi Su,Raza Mushtaq,Stephen Foldes,Andrew Yang,Yalin Wang

Main category: cs.CV

TL;DR: 本文提出了一种面向异构网格(体网格与面网格)的分层Transformer框架,通过单纯复形构建空间自适应树划分,支持多尺度注意力与多模态形态学特征融合,并结合自监督预训练,在阿尔茨海默病分类、淀粉样蛋白负荷预测和局灶性皮质发育不良检测等任务中达到SOTA。

Details Motivation: 现有方法难以同时处理大规模非结构化体网格与面网格,且无法灵活融合多种顶点级形态学描述符(如皮层厚度、曲率等),限制了其在神经影像多流程分析中的应用。 Method: 提出基于单纯复形的空间自适应树划分的分层Transformer;设计特征投影模块以解耦几何结构与特征维度;采用坐标的掩码重建与形态学通道联合自监督预训练。 Result: 在ADNI数据集(体网格)上实现阿尔茨海默病分类与淀粉样蛋白负荷预测SOTA;在MELD数据集(面网格)上实现局灶性皮质发育不良检测SOTA。 Conclusion: 该框架统一支持多种网格类型与形态学特征,具备强泛化性与可迁移性,为神经影像网格表征学习提供了通用、鲁棒的新范式。 Abstract: Representation learning on large-scale unstructured volumetric and surface meshes poses significant challenges in neuroimaging, especially when models must incorporate diverse vertex-level morphometric descriptors, such as cortical thickness, curvature, sulcal depth, and myelin content, which carry subtle disease-related signals. Current approaches either ignore these clinically informative features or support only a single mesh topology, restricting their use across imaging pipelines. We introduce a hierarchical transformer framework designed for heterogeneous mesh analysis that operates on spatially adaptive tree partitions constructed from simplicial complexes of arbitrary order. This design accommodates both volumetric and surface discretizations within a single architecture, enabling efficient multi-scale attention without topology-specific modifications. A feature projection module maps variable-length per-vertex clinical descriptors into the spatial hierarchy, separating geometric structure from feature dimensionality and allowing seamless integration of different neuroimaging feature sets. Self-supervised pretraining via masked reconstruction of both coordinates and morphometric channels on large unlabeled cohorts yields a transferable encoder backbone applicable to diverse downstream tasks and mesh modalities. We validate our approach on Alzheimer's disease classification and amyloid burden prediction using volumetric brain meshes from ADNI, as well as focal cortical dysplasia detection on cortical surface meshes from the MELD dataset, achieving state-of-the-art results across all benchmarks.

[91] Active Measurement of Two-Point Correlations

Max Hamilton,Daniel Sheldon,Subhransu Maji

Main category: cs.CV

TL;DR: 本文提出了一种人机协同框架,利用预训练分类器指导采样,以高效、低方差地估计满足特定属性的点集的两点相关函数(2PCF),显著减少人工标注工作量。

Details Motivation: 传统方法需对大量数据进行精细标注以构建目标源星表(如天文中的星团),耗时费力;而目标子集往往只占全集极小比例,亟需更高效的2PCF估计方法。 Method: 提出一种人机协同框架:利用预训练分类器引导自适应采样,选择最具信息量的点供人工标注;每次标注后同步生成多个距离区间内无偏的对数估计;并设计了新型无偏估计器、采样策略和置信区间构造方法。 Result: 相比简单蒙特卡洛方法,在大幅降低估计方差的同时,显著减少了所需人工标注数量;实现了可扩展、统计上严谨的天文数据两点相关性测量。 Conclusion: 该框架为稀疏目标子集的2PCF估计提供了高效、低方差、统计可靠的新范式,特别适用于需大量人工标注的科学数据分析场景。 Abstract: Two-point correlation functions (2PCF) are widely used to characterize how points cluster in space. In this work, we study the problem of measuring the 2PCF over a large set of points, restricted to a subset satisfying a property of interest. An example comes from astronomy, where scientists measure the 2PCF of star clusters, which make up only a tiny subset of possible sources within a galaxy. This task typically requires careful labeling of sources to construct catalogs, which is time-consuming. We present a human-in-the-loop framework for efficient estimation of 2PCF of target sources. By leveraging a pre-trained classifier to guide sampling, our approach adaptively selects the most informative points for human annotation. After each annotation, it produces unbiased estimates of pair counts across multiple distance bins simultaneously. Compared to simple Monte Carlo approaches, our method achieves substantially lower variance while significantly reducing annotation effort. We introduce a novel unbiased estimator, sampling strategy, and confidence interval construction that together enable scalable and statistically grounded measurement of two-point correlations in astronomy datasets.

[92] Protecting and Preserving Protest Dynamics for Responsible Analysis

Cohen Archbold,Usman Hassan,Nazmus Sakib,Sen-ching Cheung,Abdullah-Al-Zubaer Imran

Main category: cs.CV

TL;DR: 本文提出了一种负责任的计算框架,通过条件图像合成生成标注良好的合成抗议图像,替代真实敏感图像,以在分析集体抗议动态的同时降低个体隐私风险,并兼顾下游分析效用与公平性评估。

Details Motivation: 抗议相关的社交媒体数据虽有价值,但存在监控、镇压和隐私泄露等高风险;现有AI系统可能识别个体、推断敏感属性并跨平台关联信息,导致身份泄露;当前分析方法缺乏整合隐私风险评估、下游分析与公平性考量的整体流程。 Method: 提出一种负责任的计算框架,使用条件图像合成技术将敏感抗议图像替换为带标注的合成图像,支持集体模式分析而不暴露可识别个体;同时评估生成数据的多样性、真实性、下游效用、隐私风险降低效果及人口统计学公平性。 Result: 该方法能生成逼真且多样的合成图像,在保持下游分析效用的同时显著降低隐私风险;公平性评估揭示了合成表示对不同亚群体的潜在差异影响。 Conclusion: 该框架采取务实的风险缓解策略,不追求绝对隐私保证,而是在承认残余风险的前提下,支持社会敏感场景下的安全分析。 Abstract: Protest-related social media data are valuable for understanding collective action but inherently high-risk due to concerns surrounding surveillance, repression, and individual privacy. Contemporary AI systems can identify individuals, infer sensitive attributes, and cross-reference visual information across platforms, enabling surveillance that poses risks to protesters and bystanders. In such contexts, large foundation models trained on protest imagery risk memorizing and disclosing sensitive information, leading to cross-platform identity leakage and retroactive participant identification. Existing approaches to automated protest analysis do not provide a holistic pipeline that integrates privacy risk assessment, downstream analysis, and fairness considerations. To address this gap, we propose a responsible computing framework for analyzing collective protest dynamics while reducing risks to individual privacy. Our framework replaces sensitive protest imagery with well-labeled synthetic reproductions using conditional image synthesis, enabling analysis of collective patterns without direct exposure of identifiable individuals. We demonstrate that our approach produces realistic and diverse synthetic imagery while balancing downstream analytical utility with reductions in privacy risk. We further assess demographic fairness in the generated data, examining whether synthetic representations disproportionately affect specific subgroups. Rather than offering absolute privacy guarantees, our method adopts a pragmatic, harm-mitigating approach that enables socially sensitive analysis while acknowledging residual risks.

[93] Coverage Optimization for Camera View Selection

Timothy Chen,Adam Dai,Maximilian Adang,Grace Gao,Mac Schwager

Main category: cs.CV

TL;DR: 本文提出COVER指标,通过最小化Fisher信息增益的可解近似来选择信息量大的相机视角,从而提升3D重建质量。该方法轻量、鲁棒,已在Nerfstudio框架中实现并验证有效。

Details Motivation: 高质量观测数据对高效准确的3D场景重建至关重要,而主动视角选择缺乏原理性、可解释的准则。 Method: 基于Fisher信息增益的可解近似,提出覆盖不足几何区域的轻量级视角选择指标COVER,避免昂贵透射估计,集成至Nerfstudio框架。 Result: 在多个真实数据集和辐射场基线模型上,COVER一致优于现有主动视角选择方法,提升重建质量。 Conclusion: COVER提供了一种原理清晰、计算高效且鲁棒的主动视角选择方案,适用于固定与具身数据采集场景。 Abstract: What makes a good viewpoint? The quality of the data used to learn 3D reconstructions is crucial for enabling efficient and accurate scene modeling. We study the active view selection problem and develop a principled analysis that yields a simple and interpretable criterion for selecting informative camera poses. Our key insight is that informative views can be obtained by minimizing a tractable approximation of the Fisher Information Gain, which reduces to favoring viewpoints that cover geometry that has been insufficiently observed by past cameras. This leads to a lightweight coverage-based view selection metric that avoids expensive transmittance estimation and is robust to noise and training dynamics. We call this metric COVER (Camera Optimization for View Exploration and Reconstruction). We integrate our method into the Nerfstudio framework and evaluate it on real datasets within fixed and embodied data acquisition scenarios. Across multiple datasets and radiance-field baselines, our method consistently improves reconstruction quality compared to state-of-the-art active view selection methods. Additional visualizations and our Nerfstudio package can be found at https://chengine.github.io/nbv_gym/.

[94] Region-R1: Reinforcing Query-Side Region Cropping for Multi-Modal Re-Ranking

Chan-Wei Hu,Zhengzhong Tu

Main category: cs.CV

TL;DR: 本文提出Region-R1框架,通过在重排序阶段动态裁剪图像中与问题相关的关键区域,提升多模态检索增强生成(MM-RAG)中视觉重排序器的鲁棒性与准确性。

Details Motivation: 标准多模态重排序器使用整图全局嵌入,易受背景干扰等视觉噪声影响,导致相似度评分偏差。 Method: 提出Region-R1:一种查询侧区域裁剪框架,将区域选择建模为重排序过程中的决策问题;设计区域感知分组相对策略优化(r-GRPO)算法学习动态裁剪判别性区域。 Result: 在E-VQA和InfoSeek两个基准上显著提升性能,条件Recall@1最高提升20%,达到SOTA。 Conclusion: 查询侧自适应裁剪是一种简单而有效增强MM-RAG重排序能力的新范式。 Abstract: Multi-modal retrieval-augmented generation (MM-RAG) relies heavily on re-rankers to surface the most relevant evidence for image-question queries. However, standard re-rankers typically process the full query image as a global embedding, making them susceptible to visual distractors (e.g., background clutter) that skew similarity scores. We propose Region-R1, a query-side region cropping framework that formulates region selection as a decision-making problem during re-ranking, allowing the system to learn to retain the full image or focus only on a question-relevant region before scoring the retrieved candidates. Region-R1 learns a policy with a novel region-aware group relative policy optimization (r-GRPO) to dynamically crop a discriminative region. Across two challenging benchmarks, E-VQA and InfoSeek, Region-R1 delivers consistent gains, achieving state-of-the-art performances by increasing conditional Recall@1 by up to 20%. These results show the great promise of query-side adaptation as a simple but effective way to strengthen MM-RAG re-ranking.

[95] Toward Unified Fine-Grained Vehicle Classification and Automatic License Plate Recognition

Gabriel E. Lima,Valfride Nascimento,Eduardo Santos,Eduil Nascimento,Rayson Laroca,David Menotti

Main category: cs.CV

TL;DR: 本文介绍了UFPR-VeSV数据集,一个大规模、真实场景下的细粒度车辆分类(FGVC)数据集,并探讨了其与车牌识别(ALPR)的联合应用。

Details Motivation: 现有细粒度车辆分类研究多基于理想条件,属性覆盖有限,且缺乏与ALPR的协同分析;需构建更具挑战性、标注更全面、来源真实的数据集以推动实际应用。 Method: 构建UFPR-VeSV数据集(24,945张图像,涵盖13种颜色、26个品牌、136个车型、14种类型),全部标注经车牌信息交叉验证;开展五种深度学习模型的FGVC基准测试,并结合两种OCR模型进行ALPR实验,探索FGVC与ALPR联合使用效果。 Result: 实验证明UFPR-VeSV具有显著挑战性(如多色车、红外图像、同平台不同车型区分难);FGVC与ALPR联合可提升整体车辆识别鲁棒性与实用性。 Conclusion: UFPR-VeSV为真实交通监控场景下的细粒度车辆理解提供了高质量基准资源,验证了FGVC与ALPR协同分析的有效性与应用潜力。 Abstract: Extracting vehicle information from surveillance images is essential for intelligent transportation systems, enabling applications such as traffic monitoring and criminal investigations. While Automatic License Plate Recognition (ALPR) is widely used, Fine-Grained Vehicle Classification (FGVC) offers a complementary approach by identifying vehicles based on attributes such as color, make, model, and type. Although there have been advances in this field, existing studies often assume well-controlled conditions, explore limited attributes, and overlook FGVC integration with ALPR. To address these gaps, we introduce UFPR-VeSV, a dataset comprising 24,945 images of 16,297 unique vehicles with annotations for 13 colors, 26 makes, 136 models, and 14 types. Collected from the Military Police of Paraná (Brazil) surveillance system, the dataset captures diverse real-world conditions, including partial occlusions, nighttime infrared imaging, and varying lighting. All FGVC annotations were validated using license plate information, with text and corner annotations also being provided. A qualitative and quantitative comparison with established datasets confirmed the challenging nature of our dataset. A benchmark using five deep learning models further validated this, revealing specific challenges such as handling multicolored vehicles, infrared images, and distinguishing between vehicle models that share a common platform. Additionally, we apply two optical character recognition models to license plate recognition and explore the joint use of FGVC and ALPR. The results highlight the potential of integrating these complementary tasks for real-world applications. The UFPR-VeSV dataset is publicly available at: https://github.com/Lima001/UFPR-VeSV-Dataset.

[96] From Measurement to Mitigation: Quantifying and Reducing Identity Leakage in Image Representation Encoders with Linear Subspace Removal

Daniel George,Charles Yeh,Daniel Lee,Yifei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种面向攻击者的面部隐私审计框架,包括新基准测试和一种名为ISP的一次性线性投影方法,用于在保留视觉检索效用的同时消除视觉嵌入中的身份泄露。

Details Motivation: 现有冻结视觉嵌入(如CLIP、DINOv2/v3、SSCD)在含人脸数据上存在未量化身份泄露风险,且缺乏可部署的隐私缓解方案。 Method: 构建包含开放集验证、扩散模型模板反演检测与等面积扰动的人脸上下文归因的基准;提出名为ISP的一次性线性投影器,估计并去除身份子空间,保留效用所需补空间。 Result: CLIP比DINOv2/v3和SSCD泄露更严重;ISP使线性身份识别降至近随机水平,同时保持高非生物特征效用,并具备跨数据集泛化能力。 Conclusion: 首次实现了对非人脸识别视觉编码器的攻击者校准式面部隐私审计,证明线性子空间移除可在保障实用性的前提下提供强隐私保证。 Abstract: Frozen visual embeddings (e.g., CLIP, DINOv2/v3, SSCD) power retrieval and integrity systems, yet their use on face-containing data is constrained by unmeasured identity leakage and a lack of deployable mitigations. We take an attacker-aware view and contribute: (i) a benchmark of visual embeddings that reports open-set verification at low false-accept rates, a calibrated diffusion-based template inversion check, and face-context attribution with equal-area perturbations; and (ii) propose a one-shot linear projector that removes an estimated identity subspace while preserving the complementary space needed for utility, which for brevity we denote as the identity sanitization projection ISP. Across CelebA-20 and VGGFace2, we show that these encoders are robust under open-set linear probes, with CLIP exhibiting relatively higher leakage than DINOv2/v3 and SSCD, robust to template inversion, and are context-dominant. In addition, we show that ISP drives linear access to near-chance while retaining high non-biometric utility, and transfers across datasets with minor degradation. Our results establish the first attacker-calibrated facial privacy audit of non-FR encoders and demonstrate that linear subspace removal achieves strong privacy guarantees while preserving utility for visual search and retrieval.

[97] SmokeGS-R: Physics-Guided Pseudo-Clean 3DGS for Real-World Multi-View Smoke Restoration

Xueming Fu,Lixia Han

Main category: cs.CV

TL;DR: SmokeGS-R是一种针对真实世界烟雾场景的3D重建方法,通过解耦几何恢复与外观校正,在NTIRE 2026挑战赛中取得领先性能。

Details Motivation: 真实烟雾会同时衰减场景辐射、增加气辉并破坏多视角外观一致性,导致鲁棒3D重建困难。 Method: 提出SmokeGS-R流程:利用改进的暗通道先验和引导滤波生成物理引导的伪干净监督信号;训练仅用于清晰场景的3D高斯溅射源模型;再通过几何均值参考聚合、LAB空间Reinhard颜色迁移和轻量高斯平滑对渲染结果进行外观协调。 Result: 在NTIRE 2026官方测试榜上PSNR=15.217,SSIM=0.666;在RealX3D数据集上复测得PSNR=15.209,SSIM=0.644,LPIPS=0.551,PSNR比最强基线高3.68 dB。 Conclusion: 几何优先的重建策略结合稳定的后渲染外观协调,是真实世界多视角烟雾恢复的有效方案。 Abstract: Real-world smoke simultaneously attenuates scene radiance, adds airlight, and destabilizes multi-view appearance consistency, making robust 3D reconstruction particularly difficult. We present \textbf{SmokeGS-R}, a practical pipeline developed for the NTIRE 2026 3D Restoration and Reconstruction Track 2 challenge. The key idea is to decouple geometry recovery from appearance correction: we generate physics-guided pseudo-clean supervision with a refined dark channel prior and guided filtering, train a sharp clean-only 3D Gaussian Splatting source model, and then harmonize its renderings with a donor ensemble using geometric-mean reference aggregation, LAB-space Reinhard transfer, and light Gaussian smoothing. On the official challenge testing leaderboard, the final submission achieved \mbox{PSNR $=15.217$} and \mbox{SSIM $=0.666$}. After the public release of RealX3D, we re-evaluated the same frozen result on the seven released challenge scenes without retraining and obtained \mbox{PSNR $=15.209$}, \mbox{SSIM $=0.644$}, and \mbox{LPIPS $=0.551$}, outperforming the strongest official baseline average on the same scenes by $+3.68$ dB PSNR. These results suggest that a geometry-first reconstruction strategy combined with stable post-render appearance harmonization is an effective recipe for real-world multi-view smoke restoration. The code is available at https://github.com/windrise/3drr_Track2_SmokeGS-R.

[98] Indoor Asset Detection in Large Scale 360° Drone-Captured Imagery via 3D Gaussian Splatting

Monica Tang,Avideh Zakhor

Main category: cs.CV

TL;DR: 本文提出了一种在360°无人机拍摄重建的3D高斯泼溅(3DGS)场景中进行目标室内资产对象级检测与分割的方法,通过引入联合利用掩码语义与高斯原语空间信息的3D对象码本,结合2D检测/分割模型与约束融合策略,实现了多视角掩码一致性与高精度3D资产检测。

Details Motivation: 现有方法在3DGS场景中难以实现可靠、一致的多视角对象级检测与分割,尤其针对复杂室内环境中的资产识别需求。 Method: 提出3D对象码本,融合掩码语义与高斯原语空间信息;结合2D检测/分割模型与语义-空间约束的多视图掩码融合流程,生成连贯的3D对象实例。 Result: 在两个大型室内场景上实验表明,多视角掩码一致性显著提升,F1分数较SOTA基线提高65%;对象级3D资产检测mAP提升11%。 Conclusion: 所提方法有效解决了3DGS中室内资产的对象级检测与分割难题,兼顾多视角一致性与三维几何语义对齐。 Abstract: We present an approach for object-level detection and segmentation of target indoor assets in 3D Gaussian Splatting (3DGS) scenes, reconstructed from 360° drone-captured imagery. We introduce a 3D object codebook that jointly leverages mask semantics and spatial information of their corresponding Gaussian primitives to guide multi-view mask association and indoor asset detection. By integrating 2D object detection and segmentation models with semantically and spatially constrained merging procedures, our method aggregates masks from multiple views into coherent 3D object instances. Experiments on two large indoor scenes demonstrate reliable multi-view mask consistency, improving F1 score by 65% over state-of-the-art baselines, and accurate object-level 3D indoor asset detection, achieving an 11% mAP gain over baseline methods.

[99] VLA-InfoEntropy: A Training-Free Vision-Attention Information Entropy Approach for Vision-Language-Action Models Inference Acceleration and Success

Chuhang Liu,Yayun He,Zuheng Kang,Xiaoyang Qu,Jianzong Wang

Main category: cs.CV

TL;DR: 本文提出VLA-InfoEntropy方法,利用图像熵和注意力熵动态调整模型关注区域,兼顾空间、语义与时间信息,在降低计算开销的同时保持关键内容,显著提升推理速度与效率。

Details Motivation: VLA模型因需联合处理高维视觉特征、复杂语言输入和连续动作序列,导致计算开销大、推理效率低,难以实现实时部署。 Method: 提出基于图像熵(刻画视觉token灰度分布)和注意力熵(刻画文本相关注意力分数分布)的动态聚焦策略,并结合时间步信息,引导模型从全局视觉特征逐步转向注意力驱动的局部信息丰富区域。 Result: 实验表明该方法有效减少推理参数量、加快推理速度,并在性能上优于现有方法。 Conclusion: VLA-InfoEntropy通过融合空间、语义与时间线索,在降低冗余的同时保留关键信息,为高效VLA模型设计提供了新思路。 Abstract: Vision-Language-Action (VLA) models integrate visual perception, language understanding, and action decision-making for cross-modal semantic alignment, exhibiting broad application potential. However, the joint processing of high-dimensional visual features, complex linguistic inputs, and continuous action sequences incurs significant computational overhead and low inference efficiency, thereby hindering real-time deployment and reliability. To address this issue, we use image entropy to quantify the grayscale distribution characteristics of each visual token and introduce attention entropy to capture the distribution of attention scores over task-related text. Visual entropy identifies texture-rich or structurally informative regions, while attention entropy pinpoints semantically relevant tokens. Combined with timestep information, these metrics enable a dynamic transition strategy that shifts the model's focus from global visual features to attention-guided local informative regions. Thus, the resulting VLA-InfoEntropy method integrates spatial, semantic, and temporal cues to reduce redundancy while preserving critical content. Extensive experiments show that our method reduces inference parameters, accelerates inference speed, and outperforms existing approaches.

[100] Unsupervised Multi-agent and Single-agent Perception from Cooperative Views

Haochen Yang,Baolu Li,Lei Li,Delin Ren,Jiacheng Guo,Minghai Qin,Tianyun Zhang,Hongkai Yu

Main category: cs.CV

TL;DR: 本文提出了一种无监督的多智能体与单智能体(UMS)感知框架,利用多智能体间点云数据共享提升点云密度和跨视角一致性,实现无需人工标注的联合3D目标检测。

Details Motivation: 现有方法无法在无监督设置下同时解决多智能体与单智能体LiDAR感知问题;而多智能体协同感知可提供更密集点云和互补视角,具备无监督学习潜力。 Method: 提出UMS框架:1)基于学习的Proposal Purifying Filter优化多智能体融合后的候选提案分类;2)Progressive Proposal Stabilizing模块通过易到难课程学习生成可靠伪标签;3)Cross-View Consensus Learning利用多智能体协同视图指导单智能体检测。 Result: 在V2V4Real和OPV2V两个公开数据集上,UMS在无监督多智能体与单智能体3D检测任务中均显著超越现有最先进方法。 Conclusion: 多智能体协作可通过无监督方式有效提升单智能体感知性能,UMS框架为车路协同与机器人集群的无标注环境理解提供了新范式。 Abstract: The LiDAR-based multi-agent and single-agent perception has shown promising performance in environmental understanding for robots and automated vehicles. However, there is no existing method that simultaneously solves both multi-agent and single-agent perception in an unsupervised way. By sharing sensor data between multiple agents via communication, this paper discovers two key insights: 1) Improved point cloud density after the data sharing from cooperative views could benefit unsupervised object classification, 2) Cooperative view of multiple agents can be used as unsupervised guidance for the 3D object detection in the single view. Based on these two discovered insights, we propose an Unsupervised Multi-agent and Single-agent (UMS) perception framework that leverages multi-agent cooperation without human annotations to simultaneously solve multi-agent and single-agent perception. UMS combines a learning-based Proposal Purifying Filter to better classify the candidate proposals after multi-agent point cloud density cooperation, followed by a Progressive Proposal Stabilizing module to yield reliable pseudo labels by the easy-to-hard curriculum learning. Furthermore, we design a Cross-View Consensus Learning to use multi-agent cooperative view to guide detection in single-agent view. Experimental results on two public datasets V2V4Real and OPV2V show that our UMS method achieved significantly higher 3D detection performance than the state-of-the-art methods on both multi-agent and single-agent perception tasks in an unsupervised setting.

[101] GESS: Multi-cue Guided Local Feature Learning via Geometric and Semantic Synergy

Yang Yi,Xieyuanli Chen,Jinpu Zhang,Hui Shen,Dewen Hu

Main category: cs.CV

TL;DR: 本文提出了一种多线索引导的局部特征学习框架,通过语义、法向和深度稳定性等多维度线索协同提升关键点检测鲁棒性与描述子判别力,引入SDAK检测机制和UTCF描述子融合模块,在多个基准上验证了有效性。

Details Motivation: 现有方法主要依赖单一外观线索建模,导致关键点不稳定且描述子判别力不足。 Method: 构建联合语义-法向预测头和深度稳定性预测头;提出Semantic-Depth Aware Keypoint(SDAK)检测机制;设计Unified Triple-Cue Fusion(UTCF)描述子融合模块,采用语义调度门控机制自适应融合多属性特征。 Result: 在四个基准数据集上验证了所提框架的有效性,显著提升了关键点检测鲁棒性和描述子判别能力。 Conclusion: 多线索(语义、几何)协同建模能有效缓解单一线索建模的局限性,为鲁棒局部特征学习提供了新思路。 Abstract: Robust local feature detection and description are foundational tasks in computer vision. Existing methods primarily rely on single appearance cues for modeling, leading to unstable keypoints and insufficient descriptor discriminability. In this paper, we propose a multi-cue guided local feature learning framework that leverages semantic and geometric cues to synergistically enhance detection robustness and descriptor discriminability. Specifically, we construct a joint semantic-normal prediction head and a depth stability prediction head atop a lightweight backbone. The former leverages a shared 3D vector field to deeply couple semantic and normal cues, thereby resolving optimization interference from heterogeneous inconsistencies. The latter quantifies the reliability of local regions from a geometric consistency perspective, providing deterministic guidance for robust keypoint selection. Based on these predictions, we introduce the Semantic-Depth Aware Keypoint (SDAK) mechanism for feature detection. By coupling semantic reliability with depth stability, SDAK reweights keypoint responses to suppress spurious features in unreliable regions. For descriptor construction, we design a Unified Triple-Cue Fusion (UTCF) module, which employs a semantic-scheduled gating mechanism to adaptively inject multi-attribute features, improving descriptor discriminability. Extensive experiments on four benchmarks validate the effectiveness of the proposed framework. The source code and pre-trained model will be available at: https://github.com/yiyscut/GESS.git.

[102] Rethinking IRSTD: Single-Point Supervision Guided Encoder-only Framework is Enough for Infrared Small Target Detection

Rixiang Ni,Boyang Li,Jun Chen,Yonghao Li,Feiyu Ren,Yuji Wang,Haoyang Yuan,Wujiao He,Wei An

Main category: cs.CV

TL;DR: 本文提出SPIRE方法,将红外小目标检测(IRSTD)重新定义为质心回归任务,通过单点监督和概率响应编码实现高效、低误报率的目标定位。

Details Motivation: 现有基于像素级监督的编码器-解码器分割范式忽视了小目标仅占少数像素且边界模糊的问题,因此应优先考虑目标定位而非完整区域分割。 Method: 提出SPIRE方法:1)设计点响应先验监督(PRPS),将单点标注转化为符合红外点目标响应特性的概率响应图;2)采用高分辨率概率编码器(HRPE)实现仅编码器端到端回归,避免解码器重建;3)保留高分辨率特征并提升有效监督密度以缓解稀疏目标分布下的优化不稳定。 Result: 在SIRST-UAVB和SIRST4等基准上,SPIRE实现了具有竞争力的目标级检测性能,误报率(Fa)持续较低,计算成本显著降低。 Conclusion: 将IRSTD建模为质心回归任务并采用单点监督与概率响应编码是更合理、高效的方向,SPIRE在精度、鲁棒性和效率上均取得良好平衡。 Abstract: Infrared small target detection (IRSTD) aims to separate small targets from clutter backgrounds. Extensive research is dedicated to the pixel-level supervision-guided "encoder-decoder" segmentation paradigm. Although having achieved promising performance, they neglect the fact that small targets only occupy a few pixels and are usually accompanied with blurred boundary caused by clutter backgrounds. Based on this observation, we argue that the first principle of IRSTD should be target localization instead of separating all target region accompanied with indistinguishable background noise. In this paper, we reformulate IRSTD as a centroid regression task and propose a novel Single-Point Supervision guided Infrared Probabilistic Response Encoding method (namely, SPIRE), which is indeed challenging due to the mismatch between reduced supervision network and equivalent output. Specifically, we first design a Point-Response Prior Supervision (PRPS), which transforms single-point annotations into probabilistic response map consistent with infrared point-target response characteristics, with a High-Resolution Probabilistic Encoder (HRPE) that enables encoder-only, end-to-end regression without decoder reconstruction. By preserving high-resolution features and increasing effective supervision density, SPIRE alleviates optimization instability under sparse target distributions. Finally, extensive experiments on various IRSTD benchmarks, including SIRST-UAVB and SIRST4 demonstrate that SPIRE achieves competitive target-level detection performance with consistently low false alarm rate (Fa) and significantly reduced computational cost. Code is publicly available at: https://github.com/NIRIXIANG/SPIRE-IRSTD.

[103] 3DTurboQuant: Training-Free Near-Optimal Quantization for 3D Reconstruction Models

Jae Joong Lee

Main category: cs.CV

TL;DR: 本文提出3DTurboQuant,一种无需训练、无需数据依赖码本的通用3D模型压缩方法,利用随机旋转与Lloyd-Max量化,在3DGS和DUSt3R上实现高效无损压缩。

Details Motivation: 现有3D重建模型(如3DGS、NeRF、DUSt3R)压缩均需每场景微调学习数据依赖码本,流程繁琐且不通用;作者指出在特定维度下,参数分布具有可预测统计特性,因而可规避学习过程。 Method: 基于高维向量经随机旋转后坐标服从Beta分布的理论发现,采用预计算、数据无关的Lloyd-Max标量量化;并提出四方面技术:(1) 维度依赖的可量化性与位宽判据,(2) 范数分离界将量化MSE映射至渲染PSNR,(3) 面向2D哈希网格特征的分组旋转量化策略,(4) 可组合的剪枝-量化闭式压缩流水线。 Result: 在NeRF Synthetic数据集上,3DGS压缩率达3.5×(PSNR仅降0.02dB),DUSt3R KV缓存压缩率达7.9×(点图保真度39.7dB);全程无需训练、码本学习或校准数据,压缩耗时仅数秒。 Conclusion: 高维参数的统计规律可被直接利用,摆脱传统数据驱动量化范式;3DTurboQuant为通用、即插即用、理论保障的3D模型轻量化提供了新范式。 Abstract: Every existing method for compressing 3D Gaussian Splatting, NeRF, or transformer-based 3D reconstructors requires learning a data-dependent codebook through per-scene fine-tuning. We show this is unnecessary. The parameter vectors that dominate storage in these models, 45-dimensional spherical harmonics in 3DGS and 1024-dimensional key-value vectors in DUSt3R, fall in a dimension range where a single random rotation transforms any input into coordinates with a known Beta distribution. This makes precomputed, data-independent Lloyd-Max quantization near-optimal, within a factor of 2.7 of the information-theoretic lower bound. We develop 3D, deriving (1) a dimension-dependent criterion that predicts which parameters can be quantized and at what bit-width before running any experiment, (2) norm-separation bounds connecting quantization MSE to rendering PSNR per scene, (3) an entry-grouping strategy extending rotation-based quantization to 2-dimensional hash grid features, and (4) a composable pruning-quantization pipeline with a closed-form compression ratio. On NeRF Synthetic, 3DTurboQuant compresses 3DGS by 3.5x with 0.02dB PSNR loss and DUSt3R KV caches by 7.9x with 39.7dB pointmap fidelity. No training, no codebook learning, no calibration data. Compression takes seconds. The code will be released (https://github.com/JaeLee18/3DTurboQuant)

[104] UAVReason: A Unified, Large-Scale Benchmark for Multimodal Aerial Scene Reasoning and Generation

Jintao Sun,Hu Zhang,Donglin Di,Gangyi Ding,Zhedong Zheng

Main category: cs.CV

TL;DR: 本文提出UAVReason,首个面向俯视视角无人机场景的大规模多模态基准,包含27.3万VQA样本,覆盖空间与时间推理及多模态生成任务,并构建统一多任务基线模型验证其有效性。

Details Motivation: 现有视觉语言模型在高海拔无人机场景下因域偏移(如小目标、密集排列、重复纹理、顶视方向模糊)而性能下降,缺乏专门针对无人机俯视视角的统一多模态基准。 Method: 构建基于高保真无人机仿真平台的UAVReason基准,涵盖273K VQA样本(含单帧描述、双帧时序、跨模态生成),覆盖22种空间与时序推理类型;设计统一多任务学习基线模型,联合优化VQA、分割与生成任务。 Result: 实验表明通用VLM在UAV场景表现受限,而所提多任务基线在VQA(EM/F1)、分割(mIoU)和生成(CLIP Score)等指标上显著提升;所有数据、代码与评估工具将开源。 Conclusion: UAVReason填补了无人机俯视视角多模态理解与生成的基准空白,证明统一多任务学习对提升UAV原生性能至关重要,为后续研究提供坚实基础。 Abstract: Vision-Language models (VLMs) have demonstrated remarkable capability in ground-view visual understanding but often fracture when deployed on high-altitude Unmanned Aerial Vehicles (UAVs). The failure largely stems from a pronounced domain shift, characterized by tiny and densely packed objects, repetitive textures, and ambiguous top-down orientations. These factors severely disrupt semantic grounding and hinder both spatial reasoning and controllable generation. To bridge this critical gap, we introduce UAVReason, the first unified large-scale multi-modal benchmark dedicated to nadir-view UAV scenarios, derived from a high-fidelity UAV simulation platform. In contrast to existing UAV benchmarks, which are largely siloed and focus on single tasks like object detection or segmentation, UAVReason uniquely consolidates over 273K Visual Question Answering (VQA) pairs, including 23.6K single frames with detailed captions, 68.2K 2-frame temporal sequences, and 188.8K cross-modal generation samples. The benchmark probes 22 diverse reasoning types across spatial and temporal axes while simultaneously evaluating high-fidelity generation across RGB, depth, and segmentation modalities. We further establish a strong, unified baseline model via multi-task learning. Extensive experiments validate the efficacy of our unified approach across diverse metrics, such as EM/F1 for VQA, mIoU for segmentation, and CLIP Score for generation. These results indicate limitations of general-domain vision-language models and show that unified multi-task learning substantially improves UAV-native performance. All data, code, and evaluation tools will be publicly released to advance UAV multimodal research.

[105] LUMOS: Universal Semi-Supervised OCT Retinal Layer Segmentation with Hierarchical Reliable Mutual Learning

Yizhou Fang,Jian Zhong,Li Lin,Xiaoying Tang

Main category: cs.CV

TL;DR: 本文提出LUMOS框架,通过双解码器网络与分层提示策略(DDN-HPS)及可靠渐进式多粒度学习(RPML),在标注稀缺和标签粒度异质的条件下,实现鲁棒的OCT视网膜层分割。

Details Motivation: OCT层分割面临标注稀缺和跨数据集标签粒度不一致的挑战,现有半监督方法难以充分利用跨粒度监督信息。 Method: 提出LUMOS框架,包含双解码器网络与分层提示策略(DDN-HPS)以抑制伪标签噪声传播;以及可靠渐进式多粒度学习(RPML),引入区域级可靠性加权与渐进训练机制,实现稳定跨粒度对齐。 Result: 在六个OCT数据集上实验表明,LUMOS显著优于现有方法,并展现出优异的跨域与跨粒度泛化能力。 Conclusion: LUMOS有效解决了标注稀缺与标签粒度异质性问题,为通用OCT层分割提供了新范式。 Abstract: Optical Coherence Tomography (OCT) layer segmentation faces challenges due to annotation scarcity and heterogeneous label granularities across datasets. While semi-supervised learning helps alleviate label scarcity, existing methods typically assume a fixed granularity, failing to fully exploit cross-granularity supervision. This paper presents LUMOS, a semi-supervised universal OCT retinal layer segmentation framework based on a Dual-Decoder Network with a Hierarchical Prompting Strategy (DDN-HPS) and Reliable Progressive Multi-granularity Learning (RPML). DDN-HPS combines a dual-branch architecture with a multi-granularity prompting strategy to effectively suppress pseudo-label noise propagation. Meanwhile, RPML introduces region-level reliability weighing and a progressive training approach that guides the model from easier to more difficult tasks, ensuring the reliable selection of cross-granularity consistency targets, thereby achieving stable cross-granularity alignment. Experiments on six OCT datasets demonstrate that LUMOS largely outperforms existing methods and exhibits exceptional cross-domain and cross-granularity generalization capability.

[106] Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval

Yuxin Yang,Yinan Zhou,Yuxin Chen,Ziqi Zhang,Zongyang Ma,Chunfeng Yuan,Bing Li,Jun Gao,Weiming Hu

Main category: cs.CV

TL;DR: 本文提出Object-Anchored Composed Image Retrieval (OACIR)这一新任务,强调实例级一致性而非语义匹配,并构建首个大规模多领域基准OACIRR;同时设计AdaFocal框架,通过上下文感知注意力调制器提升锚定实例区域的注意力,显著提升实例保真度。

Details Motivation: 现有CIR方法侧重语义匹配,难以可靠检索用户指定的具体实例;实际应用中,保持具体实例的一致性比宽泛语义匹配更重要。 Method: 提出OACIR细粒度检索任务及OACIRR基准(含16万+四域图像四元组与硬负样本),引入参考图中目标框作为视觉锚点;设计AdaFocal框架,含Context-Aware Attention Modulator,自适应增强锚定区域注意力并平衡实例与整体上下文。 Result: AdaFocal在OACIR任务上显著超越现有CIR模型,尤其在实例级保真度方面表现突出,为该任务建立强基线。 Conclusion: OACIR任务和OACIRR基准推动了实例感知的细粒度跨模态检索研究,AdaFocal为实现灵活、高保真的实例级检索提供了有效新范式。 Abstract: Composed Image Retrieval (CIR) has demonstrated significant potential by enabling flexible multimodal queries that combine a reference image and modification text. However, CIR inherently prioritizes semantic matching, struggling to reliably retrieve a user-specified instance across contexts. In practice, emphasizing concrete instance fidelity over broad semantics is often more consequential. In this work, we propose Object-Anchored Composed Image Retrieval (OACIR), a novel fine-grained retrieval task that mandates strict instance-level consistency. To advance research on this task, we construct OACIRR (OACIR on Real-world images), the first large-scale, multi-domain benchmark comprising over 160K quadruples and four challenging candidate galleries enriched with hard-negative instance distractors. Each quadruple augments the compositional query with a bounding box that visually anchors the object in the reference image, providing a precise and flexible way to ensure instance preservation. To address the OACIR task, we propose AdaFocal, a framework featuring a Context-Aware Attention Modulator that adaptively intensifies attention within the specified instance region, dynamically balancing focus between the anchored instance and the broader compositional context. Extensive experiments demonstrate that AdaFocal substantially outperforms existing compositional retrieval models, particularly in maintaining instance-level fidelity, thereby establishing a robust baseline for this challenging task while opening new directions for more flexible, instance-aware retrieval systems.

[107] LSGS-Loc: Towards Robust 3DGS-Based Visual Localization for Large-Scale UAV Scenarios

Xiang Zhang,Tengfei Wang,Fang Xu,Xin Wang,Zongqian Zhan

Main category: cs.CV

TL;DR: 本文提出LSGS-Loc,一种面向大规模3D高斯泼溅(3DGS)场景的视觉定位新方法,通过尺度感知的位姿初始化和基于拉普拉斯的可靠性掩码机制,显著提升大尺度无人机场景下无序图像查询的定位精度与鲁棒性。

Details Motivation: 现有基于3DGS的视觉定位方法在大规模无人机场景中存在位姿初始化不鲁棒、易受重建伪影(如模糊、浮点物)影响等问题。 Method: 提出LSGS-Loc:1)尺度感知的位姿初始化策略,融合场景无关的相对位姿估计与显式3DGS尺度约束;2)在位姿优化阶段引入拉普拉斯可靠性掩码,抑制渲染伪影区域对光度优化的干扰。 Result: 在多个大规模无人机基准上,LSGS-Loc在无序图像查询任务中达到SOTA精度与鲁棒性,显著优于现有3DGS定位方法。 Conclusion: LSGS-Loc无需场景特定训练即可实现几何一致的大规模3DGS定位,为实际无人机自主系统提供了更可靠、可扩展的视觉定位方案。 Abstract: Visual localization in large-scale UAV scenarios is a critical capability for autonomous systems, yet it remains challenging due to geometric complexity and environmental variations. While 3D Gaussian Splatting (3DGS) has emerged as a promising scene representation, existing 3DGS-based visual localization methods struggle with robust pose initialization and sensitivity to rendering artifacts in large-scale settings. To address these limitations, we propose LSGS-Loc, a novel visual localization pipeline tailored for large-scale 3DGS scenes. Specifically, we introduce a scale-aware pose initialization strategy that combines scene-agnostic relative pose estimation with explicit 3DGS scale constraints, enabling geometrically grounded localization without scene-specific training. Furthermore, in the pose refinement, to mitigate the impact of reconstruction artifacts such as blur and floaters, we develop a Laplacian-based reliability masking mechanism that guides photometric refinement toward high-quality regions. Extensive experiments on large-scale UAV benchmarks demonstrate that our method achieves state-of-the-art accuracy and robustness for unordered image queries, significantly outperforming existing 3DGS-based approaches. Code is available at: https://github.com/xzhang-z/LSGS-Loc

[108] Weather-Conditioned Branch Routing for Robust LiDAR-Radar 3D Object Detection

Hongsheng Li,Lingfeng Zhang,Zexian Yang,Liang Li,Rong Yin,Xiaoshuai Hao,Wenbo Ding

Main category: cs.CV

TL;DR: 本文提出一种天气条件驱动的多模态3D目标检测框架,通过动态路由机制在LiDAR、4D雷达及融合分支间自适应加权聚合特征,并引入天气监督学习防止分支坍塌,在K-Radar上达到SOTA并提供可解释的模态偏好分析。

Details Motivation: 现有LiDAR-4D雷达融合方法依赖固定或弱自适应流程,难以随恶劣天气变化动态调整模态偏好,导致鲁棒性受限。 Method: 将多模态感知建模为天气条件驱动的分支路由问题,维护LiDAR、4D雷达和条件门控融合三个并行特征流;利用视觉与语义提示提取条件token,由轻量级router生成样本级软权重进行聚合;引入天气监督的辅助分类与多样性正则化防止分支坍塌。 Result: 在K-Radar基准上达到SOTA性能,并提供显式、高可解释的模态偏好分析,清晰揭示不同恶劣天气下LiDAR与4D雷达依赖关系的自适应迁移。 Conclusion: 天气条件驱动的动态分支路由机制能显著提升恶劣天气下3D检测鲁棒性与可解释性,为多模态感知提供了新范式。 Abstract: Robust 3D object detection in adverse weather is highly challenging due to the varying reliability of different sensors. While existing LiDAR-4D radar fusion methods improve robustness, they predominantly rely on fixed or weakly adaptive pipelines, failing to dy-namically adjust modality preferences as environmental conditions change. To bridge this gap, we reformulate multi-modal perception as a weather-conditioned branch routing problem. Instead of computing a single fused output, our framework explicitly maintains three parallel 3D feature streams: a pure LiDAR branch, a pure 4D radar branch, and a condition-gated fusion branch. Guided by a condition token extracted from visual and semantic prompts, a lightweight router dynamically predicts sample-specific weights to softly aggregate these representations. Furthermore, to prevent branch collapse, we introduce a weather-supervised learning strategy with auxiliary classification and diversity regularization to enforce distinct, condition-dependent routing behaviors. Extensive experiments on the K-Radar benchmark demonstrate that our method achieves state-of-the-art performance. Furthermore, it provides explicit and highly interpretable insights into modality preferences, transparently revealing how adaptive routing robustly shifts reliance between LiDAR and 4D radar across diverse adverse-weather scenarios. The source code with be released.

[109] CRISP: Rank-Guided Iterative Squeezing for Robust Medical Image Segmentation under Domain Shift

Yizhou Fang,Pujin Cheng,Yixiang Liu,Xiaoying Tang,Longxi Zhou

Main category: cs.CV

TL;DR: 本文提出CRISP框架,利用正区域概率排序稳定性这一经验规律,实现无需目标域信息、参数自由且模型无关的医学图像分割域适应。

Details Motivation: 医学影像中的分布偏移是阻碍医疗AI临床转化的关键瓶颈,现有域适应方法受限于预定义的模拟偏移或伪监督,难以应对真实世界中无限且不可预测的分布变化。 Method: 提出“正区域排序稳定性”经验规律,并基于此设计CRISP框架:通过潜在特征扰动模拟分布偏移,识别稳定高/低概率体素,构建高精度(HP)和高召回(HR)先验,并迭代优化以逼近最终分割结果。 Result: 在多中心心脏MRI和CT肺血管分割任务上,CRISP在多中心、人群和模态分布偏移下显著优于SOTA方法,HD95指标分别提升7.0%、13.1%和38.9%。 Conclusion: CRISP首次将分割决策从概率转向排序,是一种鲁棒、通用且无需目标域数据的新型域适应范式,为解决医学AI泛化性问题提供了新思路。 Abstract: Distribution shift in medical imaging remains a central bottleneck for the clinical translation of medical AI. Failure to address it can lead to severe performance degradation in unseen environments and exacerbate health inequities. Existing methods for domain adaptation are inherently limited by exhausting predefined possibilities through simulated shifts or pseudo-supervision. Such strategies struggle in the open-ended and unpredictable real world, where distribution shifts are effectively infinite. To address this challenge, we introduce an empirical law called ``Rank Stability of Positive Regions'', which states that the relative rank of predicted probabilities for positive voxels remains stable under distribution shift. Guided by this principle, we propose CRISP, a parameter-free and model-agnostic framework requiring no target-domain information. CRISP is the first framework to make segmentation based on rank rather than probabilities. CRISP simulates model behavior under distribution shift via latent feature perturbation, where voxel probability rankings exhibit two stable patterns: regions that consistently retain high probabilities (destined positives according to the principle) and those that remain low-probability (can be safely classified as negatives). Based on these patterns, we construct high-precision (HP) and high-recall (HR) priors and recursively refine them under perturbation. We then design an iterative training framework, making HP and HR progressively ``squeeze'' to the final segmentation. Extensive evaluations on multi-center cardiac MRI and CT-based lung vessel segmentation demonstrate CRISP's superior robustness, significantly outperforming state-of-the-art methods with striking HD95 reductions of up to 0.14 (7.0\% improvement), 1.90 (13.1\% improvement), and 8.39 (38.9\% improvement) pixels across multi-center, demographic, and modality shifts, respectively.

[110] Learning to Synergize Semantic and Geometric Priors for Limited-Data Wheat Disease Segmentation

Shijie Wang,Zijian Wang,Yadan Luo,Scott Chapman,Xin Yu,Zi Huang

Main category: cs.CV

TL;DR: SGPer是一种结合语义与几何先验的框架,利用DINOv2提供类别感知语义先验,并将其转化为点提示引导SAM精确定位小麦病害边界,在数据受限下实现对时序外观变化鲁棒的分割。

Details Motivation: 小麦病害分割面临生长阶段导致的类内时序外观变化大、标注数据获取困难的问题,从头训练模型不现实。 Method: 提出SGPer框架:1)在DINOv2和SAM中插入疾病敏感适配器以对齐疾病特征;2)将DINOv2特征转化为密集类别特异性点提示;3)通过SAM迭代掩码置信度与DINOv2语义一致性联合动态筛选提示,激活SAM几何先验。 Result: 在小麦病害及器官分割基准上达到SOTA性能,尤其在小样本场景下表现突出。 Conclusion: 语义先验(DINOv2)与几何先验(SAM)的协同建模可有效克服农业图像中严重的时序外观变化问题,提升少样本病害分割鲁棒性与精度。 Abstract: Wheat disease segmentation is fundamental to precision agriculture but faces severe challenges from significant intra-class temporal variations across growth stages. Such substantial appearance shifts make collecting a representative dataset for training from scratch both labor-intensive and impractical. To address this, we propose SGPer, a Semantic-Geometric Prior Synergization framework that treats wheat disease segmentation under limited data as a coupled task of disease-specific semantic perception and disease boundary localization. Our core insight is that pretrained DINOv2 provides robust category-aware semantic priors to handle appearance shifts, which can be converted into coarse spatial prompts to guide SAM for the precise localization of disease boundaries. Specifically, SGPer designs disease-sensitive adapters with multiple disease-friendly filters and inserts them into both DINOv2 and SAM to align their pretrained representations with disease-specific characteristics. To operationalize this synergy, SGPer transforms DINOv2-derived features into dense, category-specific point prompts to ensure comprehensive spatial coverage of all disease regions. To subsequently eliminate prompt redundancy and ensure highly accurate mask generation, it dynamically filters these dense candidates by cross-referencing SAM's iterative mask confidence with the category-specific semantic consistency derived from DINOv2. Ultimately, SGPer distills a highly informative set of prompts to activate SAM's geometric priors, achieving precise and robust segmentation that remains strictly invariant to temporal appearance changes. Extensive evaluations demonstrate that SGPer consistently achieves state-of-the-art performance on wheat disease and organ segmentation benchmarks, especially in data-constrained scenarios.

[111] VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG

Honghao Fu,Miao Xu,Yiwei Wang,Dailing Zhang,Liu Jun,Yujun Cai

Main category: cs.CV

TL;DR: VideoStir 提出一种结构化、意图感知的长视频检索增强生成框架,通过时空图建模和多跳检索保留视频结构,并利用大模型驱动的意图相关性评分器提升检索质量。

Details Motivation: 现有长视频RAG方法存在两个问题:一是将视频扁平化为独立片段,破坏其时空结构;二是依赖显式语义匹配,易忽略与查询意图隐式相关的关键线索。 Method: 提出VideoStir框架:1)将视频建模为剪辑级时空图;2)执行多跳检索以聚合远距离但上下文相关的事件证据;3)引入MLLM支持的意图-相关性评分器,依据查询推理意图检索帧;4)构建IR-600K数据集用于训练帧-查询意图对齐能力。 Result: 实验表明,VideoStir在不依赖额外辅助信息的情况下,性能媲美当前最优基线,验证了从扁平化语义匹配转向结构化、意图感知推理的有效性。 Conclusion: 结构化建模与意图感知检索是提升长视频RAG性能的关键路径,VideoStir为此提供了可行且有效的框架。 Abstract: Scaling multimodal large language models (MLLMs) to long videos is constrained by limited context windows. While retrieval-augmented generation (RAG) is a promising remedy by organizing query-relevant visual evidence into a compact context, most existing methods (i) flatten videos into independent segments, breaking their inherent spatio-temporal structure, and (ii) depend on explicit semantic matching, which can miss cues that are implicitly relevant to the query's intent. To overcome these limitations, we propose VideoStir, a structured and intent-aware long-video RAG framework. It firstly structures a video as a spatio-temporal graph at clip level, and then performs multi-hop retrieval to aggregate evidence across distant yet contextually related events. Furthermore, it introduces an MLLM-backed intent-relevance scorer that retrieves frames based on their alignment with the query's reasoning intent. To support this capability, we curate IR-600K, a large-scale dataset tailored for learning frame-query intent alignment. Experiments show that VideoStir is competitive with state-of-the-art baselines without relying on auxiliary information, highlighting the promise of shifting long-video RAG from flattened semantic matching to structured, intent-aware reasoning. Codes and checkpoints are available at Github.

[112] Cross-Stage Attention Propagation for Efficient Semantic Segmentation

Beoungwoo Kang

Main category: cs.CV

TL;DR: 本文提出了一种名为Cross-Stage Attention Propagation(CSAP)的轻量级语义分割解码器框架,通过仅在最深层特征尺度计算注意力并将其传播至更浅层,避免重复的query-key计算,在保持多尺度上下文推理能力的同时显著降低计算开销。

Details Motivation: 现有轻量级语义分割方法中多尺度解码器在各尺度独立计算注意力,导致高度相关的注意力分布产生大量冗余计算。 Method: 提出CSAP框架:在最深特征尺度统一计算注意力图,并将该注意力图传播至更浅层,跳过这些层的query-key计算;从而构建高效多尺度解码器。 Result: CSAP-Tiny在ADE20K上达42.9% mIoU(5.5 GFLOPs),Cityscapes上80.5%(21.5 GFLOPs),COCO-Stuff 164K上40.9%(5.5 GFLOPs),性能超越SegNeXt-Tiny且FLOPs减少16.8%。 Conclusion: CSAP通过跨阶段注意力传播有效缓解多尺度注意力冗余问题,在不牺牲精度前提下显著提升解码器效率,为轻量语义分割提供了新思路。 Abstract: Recent lightweight semantic segmentation methods have made significant progress by combining compact backbones with efficient decoder heads. However, most multi-scale decoders compute attention independently at each feature scale, introducing substantial redundancy since the resulting attention distributions across scales are strongly correlated. We propose Cross-Stage Attention Propagation (CSAP), a decoder framework that computes attention at the deepest feature scale and propagates the resulting attention maps to shallower stages, bypassing query-key computation at those stages entirely. This design preserves multi-scale contextual reasoning while substantially reducing the decoder's computational cost. CSAP-Tiny achieves 42.9% mIoU on ADE20K with only 5.5 GFLOPs, 80.5% on Cityscapes with 21.5 GFLOPs, and 40.9% on COCO-Stuff 164K with 5.5 GFLOPs, surpassing SegNeXt-Tiny by +1.8% on ADE20K while requiring 16.8% fewer floating-point operations.

[113] Few-Shot Semantic Segmentation Meets SAM3

Yi-Jen Tsai,Yen-Yu Lin,Chien-Yao Wang

Main category: cs.CV

TL;DR: 本文提出一种无需训练的少样本语义分割(FSS)方法,利用冻结的SAM3模型通过空间拼接支持/查询图像并结合Promptable Concept Segmentation能力实现高性能分割,在PASCAL-5i和COCO-20i上达到SOTA;同时发现负提示在少样本场景中反而有害,易导致表征弱化与预测崩溃。

Details Motivation: 现有少样本语义分割方法依赖计算昂贵且对分布偏移敏感的阶段性训练;本文旨在探索现代视觉基础模型(特别是SAM3)在无需训练前提下的潜力。 Method: 将支持图像与查询图像进行空间拼接构成共享画布,直接调用冻结的SAM3模型的Promptable Concept Segmentation(PCS)能力进行分割,不进行任何微调或结构修改,并分析负提示的影响。 Result: 在PASCAL-5i和COCO-20i数据集上达到当前最优性能,超越众多复杂设计的方法;同时揭示负提示在少样本设置下会削弱目标表征、引发预测崩溃。 Conclusion: 简单空间构造即可激发强跨图像推理能力;当前基础模型对冲突提示信号的处理存在局限,需重新思考提示机制在少样本分割中的作用。 Abstract: Few-Shot Semantic Segmentation (FSS) focuses on segmenting novel object categories from only a handful of annotated examples. Most existing approaches rely on extensive episodic training to learn transferable representations, which is both computationally demanding and sensitive to distribution shifts. In this work, we revisit FSS from the perspective of modern vision foundation models and explore the potential of Segment Anything Model 3 (SAM3) as a training-free solution. By repurposing its Promptable Concept Segmentation (PCS) capability, we adopt a simple spatial concatenation strategy that places support and query images into a shared canvas, allowing a fully frozen SAM3 to perform segmentation without any fine-tuning or architectural changes. Experiments on PASCAL-$5^i$ and COCO-$20^i$ show that this minimal design already achieves state-of-the-art performance, outperforming many heavily engineered methods. Beyond empirical gains, we uncover that negative prompts can be counterproductive in few-shot settings, where they often weaken target representations and lead to prediction collapse despite their intended role in suppressing distractors. These findings suggest that strong cross-image reasoning can emerge from simple spatial formulations, while also highlighting limitations in how current foundation models handle conflicting prompt signals. Code at: https://github.com/WongKinYiu/FSS-SAM3

[114] Human Interaction-Aware 3D Reconstruction from a Single Image

Gwanghyun Kim,Junghun James Kim,Suh Yoon Jeon,Jason Park,Se Young Chun

Main category: cs.CV

TL;DR: 本文提出HUG3D框架,通过建模群体与个体信息、引入物理交互先验,实现单张图像中多人交互场景的高质量、物理合理的3D人体重建。

Details Motivation: 现有单图像3D人体重建方法主要面向单人,难以处理多人场景,易产生重叠伪影、遮挡区域几何缺失和交互失真,亟需融入群体上下文与交互先验。 Method: 提出HUG3D框架:1)将输入映射至正交规范空间以缓解透视畸变;2)HUG-MVD模块联合建模群体与个体,生成完整多视角法线与图像以解决遮挡与邻近问题;3)HUG-GR模块利用显式物理交互先验优化几何,确保物理合理性和接触准确性;4)融合多视角图像生成高保真纹理。 Result: 在多个数据集上显著优于单人及现有多人重建方法,生成物理合理、高保真、准确刻画人际交互的3D人体模型。 Conclusion: HUG3D首次实现了从单张图像对多人交互场景进行端到端、物理驱动的高质量3D重建,验证了联合建模群体上下文与物理交互先验的有效性与必要性。 Abstract: Reconstructing textured 3D human models from a single image is fundamental for AR/VR and digital human applications. However, existing methods mostly focus on single individuals and thus fail in multi-human scenes, where naive composition of individual reconstructions often leads to artifacts such as unrealistic overlaps, missing geometry in occluded regions, and distorted interactions. These limitations highlight the need for approaches that incorporate group-level context and interaction priors. We introduce a holistic method that explicitly models both group- and instance-level information. To mitigate perspective-induced geometric distortions, we first transform the input into a canonical orthographic space. Our primary component, Human Group-Instance Multi-View Diffusion (HUG-MVD), then generates complete multi-view normals and images by jointly modeling individuals and group context to resolve occlusions and proximity. Subsequently, the Human Group-Instance Geometric Reconstruction (HUG-GR) module optimizes the geometry by leveraging explicit, physics-based interaction priors to enforce physical plausibility and accurately model inter-human contact. Finally, the multi-view images are fused into a high-fidelity texture. Together, these components form our complete framework, HUG3D. Extensive experiments show that HUG3D significantly outperforms both single-human and existing multi-human methods, producing physically plausible, high-fidelity 3D reconstructions of interacting people from a single image. Project page: https://jongheean11.github.io/HUG3D_project

[115] Not All Agents Matter: From Global Attention Dilution to Risk-Prioritized Game Planning

Kang Ding,Hongsong Wang,Jie Gui,Lei He

Main category: cs.CV

TL;DR: 本文提出GameAD框架,将端到端自动驾驶建模为风险感知的多智能体博弈问题,通过风险优先的交互建模与新型规划风险暴露指标,显著提升轨迹安全性。

Details Motivation: 现有端到端模型将所有交通参与者同等对待,难以区分真实碰撞威胁与复杂背景干扰,导致安全性不足。 Method: 提出Risk-Prioritized Game Planning范式;构建GameAD框架,包含Risk-Aware Topology Anchoring、Strategic Payload Adapter、Minimax Risk-Aware Sparse Attention和Risk Consistent Equilibrium Stabilization四大模块;引入Planning Risk Exposure量化长期轨迹风险。 Result: 在nuScenes和Bench2Drive数据集上显著超越SOTA方法,尤其在轨迹安全性指标上表现突出。 Conclusion: 将自动驾驶建模为风险优先的博弈问题可有效提升决策安全性与鲁棒性,统一表征空间下的动态多智能体博弈是端到端驾驶的关键路径。 Abstract: End-to-end autonomous driving resides not in the integration of perception and planning, but rather in the dynamic multi-agent game within a unified representation space. Most existing end-to-end models treat all agents equally, hindering the decoupling of real collision threats from complex backgrounds. To address this issue, We introduce the concept of Risk-Prioritized Game Planning, and propose GameAD, a novel framework that models end-to-end autonomous driving as a risk-aware game problem. The GameAD integrates Risk-Aware Topology Anchoring, Strategic Payload Adapter, Minimax Risk-Aware Sparse Attention, and Risk Consistent Equilibrium Stabilization to enable game theoretic decision making with risk prioritized interactions. We also present the Planning Risk Exposure metric, which quantifies the cumulative risk intensity of planned trajectories over a long horizon for safe autonomous driving. Extensive experiments on the nuScenes and Bench2Drive datasets show that our approach significantly outperforms state-of-the-art methods, especially in terms of trajectory safety.

[116] A Synthetic Eye Movement Dataset for Script Reading Detection: Real Trajectory Replay on a 3D Simulator

Kidus Zewde,Yuchen Zhou,Dennis Ng,Neo Tiangratanakul,Tommy Duong,Ankit Raj,Yuxin Zhang,Xingyu Shen,Simiao Ren

Main category: cs.CV

TL;DR: 本文提出了一种基于合成数据生成的基础设施,用于大规模生成带自动标注的眼动视频数据,以缓解真实眼动数据稀缺、标注昂贵和隐私敏感的问题;通过从真实视频中提取虹膜轨迹并在3D眼动模拟器中重放,构建了包含144个会话的合成数据集final_dataset_v1,并验证了其时序保真度与适用性。

Details Motivation: 真实视频行为数据(如眼动)稀缺、标注成本高且涉及隐私,而自监督预训练在行为模态上难以开展,亟需替代方案。 Method: 提出一种合成眼动视频生成流程:从参考视频中提取真实人类虹膜轨迹,利用无头浏览器自动化将其重放到3D眼动模拟器中,生成带精确标签的合成眼动视频。 Result: 构建了final_dataset_v1数据集(144 sessions,12小时,25fps);评估显示生成轨迹在各项指标上KS距离<0.14,时序动态高度保留;发现3D模拟器对阅读级微小运动存在有界敏感性,源于未耦合头部运动。 Conclusion: 该合成数据生成范式可有效支撑行为建模与视觉语言系统交叉领域的下游分类器开发,所释放的管道、数据集与评估工具将促进相关研究。 Abstract: Large vision-language models have achieved remarkable capabilities by training on massive internet-scale data, yet a fundamental asymmetry persists: while LLMs can leverage self-supervised pretraining on abundant text and image data, the same is not true for many behavioral modalities. Video-based behavioral data -- gestures, eye movements, social signals -- remains scarce, expensive to annotate, and privacy-sensitive. A promising alternative is simulation: replace real data collection with controlled synthetic generation to produce automatically labeled data at scale. We introduce infrastructure for this paradigm applied to eye movement, a behavioral signal with applications across vision-language modeling, virtual reality, robotics, accessibility systems, and cognitive science. We present a pipeline for generating synthetic labeled eye movement video by extracting real human iris trajectories from reference videos and replaying them on a 3D eye movement simulator via headless browser automation. Applying this to the task of script-reading detection during video interviews, we release final_dataset_v1: 144 sessions (72 reading, 72 conversation) totaling 12 hours of synthetic eye movement video at 25fps. Evaluation shows that generated trajectories preserve the temporal dynamics of the source data (KS D < 0.14 across all metrics). A matched frame-by-frame comparison reveals that the 3D simulator exhibits bounded sensitivity at reading-scale movements, attributable to the absence of coupled head movement -- a finding that informs future simulator design. The pipeline, dataset, and evaluation tools are released to support downstream behavioral classifier development at the intersection of behavioral modeling and vision-language systems.

[117] Unifying VLM-Guided Flow Matching and Spectral Anomaly Detection for Interpretable Veterinary Diagnosis

Pu Wang,Zhixuan Mao,Jialu Li,Zhuoran Zheng,Dianjie Lu,Youshan Zhang

Main category: cs.CV

TL;DR: 本文提出了一种结合视觉-语言模型引导的流匹配分割与随机矩阵理论(RMT)光谱检测的新范式,用于犬类气胸的自动诊断,兼具高精度与可解释性。

Details Motivation: 犬类气胸自动诊断面临数据稀缺和模型可信度不足的挑战。 Method: 构建首个公开像素级标注数据集;提出信号定位(VLM引导的迭代Flow Matching实现高精度分割)与光谱检测(基于RMT分析病变区域特征,将健康组织建模为随机噪声,通过异常特征值识别病理性信号)协同的诊断新范式。 Result: 实现了高精度、高边界准确率的分割,并通过RMT提升了对气胸病变的检测敏感性,系统兼具准确性与可解释性。 Conclusion: 生成式分割与基于第一性原理的统计分析协同可构建高可信、可解释的医学诊断系统。 Abstract: Automatic diagnosis of canine pneumothorax is challenged by data scarcity and the need for trustworthy models. To address this, we first introduce a public, pixel-level annotated dataset to facilitate research. We then propose a novel diagnostic paradigm that reframes the task as a synergistic process of signal localization and spectral detection. For localization, our method employs a Vision-Language Model (VLM) to guide an iterative Flow Matching process, which progressively refines segmentation masks to achieve superior boundary accuracy. For detection, the segmented mask is used to isolate features from the suspected lesion. We then apply Random Matrix Theory (RMT), a departure from traditional classifiers, to analyze these features. This approach models healthy tissue as predictable random noise and identifies pneumothorax by detecting statistically significant outlier eigenvalues that represent a non-random pathological signal. The high-fidelity localization from Flow Matching is crucial for purifying the signal, thus maximizing the sensitivity of our RMT detector. This synergy of generative segmentation and first-principles statistical analysis yields a highly accurate and interpretable diagnostic system (source code is available at: https://github.com/Pu-Wang-alt/Canine-pneumothorax).

[118] A Weak-Signal-Aware Framework for Subsurface Defect Detection: Mechanisms for Enhancing Low-SCR Hyperbolic Signatures

Wenbo Zhang,Zekun Long,Zican Liu,Yangchen Zeng,Keyi Hu

Main category: cs.CV

TL;DR: 本文提出WSA-Net,一种面向探地雷达弱信号检测的轻量级网络,通过信号保持、杂波抑制、几何重建与上下文锚定四机制提升微弱缺陷特征识别能力,在RTST数据集上实现高精度(mAP@0.5=0.6958)与高效率(164 FPS)兼顾。

Details Motivation: 探地雷达(GPR)地下缺陷检测面临弱信号(低信杂比、高波场相似性、几何退化)挑战,现有轻量检测器因忽视低频结构保留与异质杂波解耦而敏感性不足。 Method: 提出WSA-Net框架,包含四个核心机制:基于部分卷积的信号保持、基于异质分组注意力的杂波抑制、超双曲线弧锐化几何重建、以及语义歧义消解的上下文锚定。 Result: 在RTST数据集上达到0.6958 mAP@0.5和164 FPS,仅含2.412M参数;显著降低基础设施检测中的漏检率。 Conclusion: 以信号为中心的设计理念可在轻量架构中有效提升对微弱缺陷的感知能力,为实时高精度地下无损检测提供新范式。 Abstract: Subsurface defect detection via Ground Penetrating Radar is challenged by "weak signals" faint diffraction hyperbolas with low signal-to-clutter ratios, high wavefield similarity, and geometric degradation. Existing lightweight detectors prioritize efficiency over sensitivity, failing to preserve low-frequency structures or decouple heterogeneous clutter. We propose WSA-Net, a framework designed to enhance faint signatures through physical-feature reconstruction. Moving beyond simple parameter reduction, WSA-Net integrates four mechanisms: Signal preservation using partial convolutions; Clutter suppression via heterogeneous grouping attention; Geometric reconstruction to sharpen hyperbolic arcs; Context anchoring to resolve semantic ambiguities. Evaluations on the RTSTdataset show WSA-Net achieves 0.6958 mAP@0.5 and 164 FPS with only 2.412 M parameters. Results prove that signal-centric awareness in lightweight architectures effectively reduces false negatives in infrastructure inspection.

[119] CLIP-Guided Data Augmentation for Night-Time Image Dehazing

Xining Ge,Weijun Yuan,Gengjia Chang,Xuyang Li,Shuhong Liu

Main category: cs.CV

TL;DR: 本文提出了一种面向NTIRE 2026夜间图像去雾挑战的统一框架,通过域对齐数据构建、分阶段训练和推理时增强,在有限监督下提升去雾性能与稳定性。

Details Motivation: 夜间图像去雾面临更复杂的退化模式(如雾霾散射与低照度、非均匀光照、强光干扰耦合),且在监督有限时易引发域偏移与训练不稳定。 Method: 采用CLIP视觉编码器筛选外部样本以构建域对齐训练数据;使用NAFNet进行两阶段训练(先适配目标域,再泛化至更广退化模式);推理时融合TLC、x8自集成与加权快照融合。 Result: 在NTIRE 2026夜间图像去雾挑战中取得优异性能,验证了该框架在有限监督下的有效性与鲁棒性。 Conclusion: 该方法不依赖复杂网络结构重设计,而以实用、轻量、可复现的流程实现高性能夜间去雾,为低监督图像复原任务提供新思路。 Abstract: Nighttime image dehazing faces a more complex degradation pattern than its daytime counterpart, as haze scattering couples with low illumination, non-uniform lighting, and strong light interference. Under limited supervision, this complexity aggravates domain drift and training instability, since target-domain samples are scarce while naively introducing external data may weaken adaptation due to distribution mismatch. This paper presents our solution to the NTIRE 2026 Night Time Image Dehazing Challenge, built as a unified framework that integrates domain-aligned data construction, stage-wise training, and inference-time enhancement. Specifically, a pre-trained CLIP visual encoder screens candidate external samples by similarity to construct training data closer to the target domain. NAFNet is then trained in two stages, first adapting to the target domain and then expanding to broader degradation patterns. At inference time, TLC, x8 self-ensemble, and weighted snapshot fusion are combined to improve output stability. Rather than relying on complex network redesign, the proposed framework offers a practical and effective pipeline for nighttime image dehazing.

[120] Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality

Yanming Xiu,Zhengayuan Jiang,Neil Zhenqiang Gong,Maria Gorlatova

Main category: cs.CV

TL;DR: 本文提出ContrAR基准,用于评估视觉语言模型(VLMs)在增强现实(AR)中对矛盾虚拟内容攻击的鲁棒性,包含312个真实AR视频,并对11个VLM进行了评测,发现现有模型在检测与推理能力及延迟权衡上仍有提升空间。

Details Motivation: 随着AR技术深入日常生活,其安全性和可靠性面临严峻挑战,尤其是矛盾虚拟内容攻击——恶意或不一致的虚拟元素误导用户、引发语义混淆或传播有害信息,亟需系统建模与评估手段。 Method: 系统建模矛盾虚拟内容攻击,构建ContrAR基准:包含312个经10名参与者验证的真实AR视频;对11个商用及开源视觉语言模型(VLMs)进行统一评测,评估其在虚拟内容操纵与语义矛盾识别任务中的表现。 Result: 实验表明,当前VLMs虽具备一定矛盾内容理解能力,但在准确检测和深层推理对抗性AR内容方面仍显不足;同时,检测精度与响应延迟之间难以兼顾。 Conclusion: ContrAR为AR环境下VLM安全性研究提供了首个专用基准,揭示了现有模型在鲁棒性、可解释性与实时性方面的关键短板,推动面向AR安全的VLM评测与改进。 Abstract: Augmented reality (AR) has rapidly expanded over the past decade. As AR becomes increasingly integrated into daily life, its security and reliability emerge as critical challenges. Among various threats, contradictory virtual content attacks, where malicious or inconsistent virtual elements are introduced into the user's view, pose a unique risk by misleading users, creating semantic confusion, or delivering harmful information. In this work, we systematically model such attacks and present ContrAR, a novel benchmark for evaluating the robustness of vision-language models (VLMs) against virtual content manipulation and contradiction in AR. ContrAR contains 312 real-world AR videos validated by 10 human participants. We further benchmark 11 VLMs, including both commercial and open-source models. Experimental results reveal that while current VLMs exhibit reasonable understanding of contradictory virtual content, room still remains for improvement in detecting and reasoning about adversarial content manipulations in AR environments. Moreover, balancing detection accuracy and latency remains challenging.

[121] Geometrical Cross-Attention and Nonvoid Voxelization for Efficient 3D Medical Image Segmentation

Chenxin Yuan,Shoupeng Chen,Haojiang Ye,Yiming Miao,Limei Peng,Pin-Han Ho

Main category: cs.CV

TL;DR: 本文提出GCNV-Net,一种融合三向动态非空体素Transformer(3DNVT)、几何交叉注意力(GCA)与非空体素化的新3D医学图像分割框架,在保持高精度的同时显著提升计算效率。

Details Motivation: 现有3D医学图像分割方法难以兼顾高精度与计算效率,尤其在多解剖结构和多模态影像下表现受限。 Method: 提出GCNV-Net框架,包含:1)Tri-directional Dynamic Nonvoid Voxel Transformer(3DNVT),沿横断、矢状、冠状面动态划分相关体素;2)Geometrical Cross-Attention(GCA),在多尺度特征融合中显式引入几何位置信息;3)Nonvoid Voxelization,仅处理信息丰富区域以降低冗余计算。 Result: 在BraTS2021、ACDC、MSD Prostate等5个主流基准上达到SOTA:Dice提升0.65%,IoU提升0.63%,NSD提升1%,HD95相对改善14.5%;FLOPs降低56.13%,推理延迟降低68.49%。 Conclusion: GCNV-Net在精度与效率间取得优异平衡,具备跨器官、疾病与影像模态的强鲁棒性,展现出良好的临床应用潜力。 Abstract: Accurate segmentation of 3D medical scans is crucial for clinical diagnostics and treatment planning, yet existing methods often fail to achieve both high accuracy and computational efficiency across diverse anatomies and imaging modalities. To address these challenges, we propose GCNV-Net, a novel 3D medical segmentation framework that integrates a Tri-directional Dynamic Nonvoid Voxel Transformer (3DNVT), a Geometrical Cross-Attention module (GCA), and Nonvoid Voxelization. The 3DNVT dynamically partitions relevant voxels along the three orthogonal anatomical planes, namely the transverse, sagittal, and coronal planes, enabling effective modeling of complex 3D spatial dependencies. The GCA mechanism explicitly incorporates geometric positional information during multi-scale feature fusion, significantly enhancing fine-grained anatomical segmentation accuracy. Meanwhile, Nonvoid Voxelization processes only informative regions, greatly reducing redundant computation without compromising segmentation quality, and achieves a 56.13% reduction in FLOPs and a 68.49% reduction in inference latency compared to conventional voxelization. We evaluate GCNV-Net on multiple widely used benchmarks: BraTS2021, ACDC, MSD Prostate, MSD Pancreas, and AMOS2022. Our method achieves state-of-the-art segmentation performance across all datasets, outperforming the best existing methods by 0.65% on Dice, 0.63% on IoU, 1% on NSD, and relatively 14.5% on HD95. All results demonstrate that GCNV-Net effectively balances accuracy and efficiency, and its robustness across diverse organs, disease conditions, and imaging modalities highlights strong potential for clinical deployment.

[122] Cross-Resolution Diffusion Models via Network Pruning

Jiaxuan Ren,Junhan Zhu,Huan Wang

Main category: cs.CV

TL;DR: 本文提出CR-Diff方法,通过分块剪枝和剪枝输出放大来提升扩散模型在不同分辨率下的生成一致性与质量,同时保持原分辨率性能。

Details Motivation: 现有UNet-based扩散模型在训练分辨率外生成图像时质量下降,源于参数行为随分辨率变化而失配,导致语义对齐弱化与结构不稳定。 Method: CR-Diff包含两个阶段:1)分块剪枝,选择性剔除有害权重;2)剪枝输出放大,进一步净化预测结果;并支持提示词特定的按需质量增强。 Result: 实验表明CR-Diff在多个扩散主干网络和未见分辨率上均提升了感知保真度与语义连贯性,且基本不损害默认分辨率性能。 Conclusion: CR-Diff是一种轻量、通用且即插即用的改进方法,有效缓解扩散模型的跨分辨率泛化问题。 Abstract: Diffusion models have demonstrated impressive image synthesis performance, yet many UNet-based models are trained at certain fixed resolutions. Their quality tends to degrade when generating images at out-of-training resolutions. We trace this issue to resolution-dependent parameter behaviors, where weights that function well at the default resolution can become adverse when spatial scales shift, weakening semantic alignment and causing structural instability in the UNet architecture. Based on this analysis, this paper introduces CR-Diff, a novel method that improves the cross-resolution visual consistency by pruning some parameters of the diffusion model. Specifically, CR-Diff has two stages. It first performs block-wise pruning to selectively eliminate adverse weights. Then, a pruned output amplification is conducted to further purify the pruned predictions. Empirically, extensive experiments suggest that CR-Diff can improve perceptual fidelity and semantic coherence across various diffusion backbones and unseen resolutions, while largely preserving the performance at default resolutions. Additionally, CR-Diff supports prompt-specific refinement, enabling quality enhancement on demand.

[123] Prior-guided Fusion of Multimodal Features for Change Detection from Optical-SAR Images

Xuanguang Liu,Lei Ding,Yujie Li,Chenguang Dai,Zhenchao Zhang,Mengmeng Li,Ziyi Yang,Yifan Sun,Yongqi Sun,Hanyun Wang

Main category: cs.CV

TL;DR: 本文提出STSF-Net,一种用于光学与SAR图像多模态变化检测的新框架,通过联合建模模态特异性与时空共性特征,并引入语义引导的自适应融合策略,在多个数据集上超越SOTA。

Details Motivation: 现有MMCD方法在跨模态交互和利用模态特异性特征方面存在不足,难以精细建模变化信息,影响语义变化的精准检测。 Method: 提出STSF-Net框架,联合建模模态特异性特征(捕获真实语义变化)和时空共性特征(抑制成像机制差异导致的伪变化);设计基于预训练基础模型语义先验的光学与SAR特征自适应融合策略;构建首个公开多类MMCD基准数据集Delta-SN6(含VHR全极化SAR与光学图像)。 Result: 在Delta-SN6、BRIGHT和Wuhan-Het数据集上mIoU分别超越SOTA 3.21%、1.08%和1.32%;代码与Delta-SN6数据集将开源。 Conclusion: STSF-Net有效提升了多模态遥感影像变化检测的精度与鲁棒性,尤其在语义变化识别和伪变化抑制方面具有优势,所构建的Delta-SN6数据集为该领域提供了重要基准。 Abstract: Multimodal change detection (MMCD) identifies changed areas in multimodal remote sensing (RS) data, demonstrating significant application value in land use monitoring, disaster assessment, and urban sustainable development. However, literature MMCD approaches exhibit limitations in cross-modal interaction and exploiting modality-specific characteristics. This leads to insufficient modeling of fine-grained change information, thus hindering the precise detection of semantic changes in multimodal data. To address the above problems, we propose STSF-Net, a framework designed for MMCD between optical and SAR images. STSF-Net jointly models modality-specific and spatio-temporal common features to enhance change representations. Specifically, modality-specific features are exploited to capture genuine semantic change signals, while spatio-temporal common features are embedded to suppress pseudo-changes caused by differences in imaging mechanisms. Furthermore, we introduce an optical and SAR feature fusion strategy that adaptively adjusts feature importance based on semantic priors obtained from pre-trained foundational models, enabling semantic-guided adaptive fusion of multi-modal information. In addition, we introduce the Delta-SN6 dataset, the first openly-accessible multiclass MMCD benchmark consisting of very-high-resolution (VHR) fully polarimetric SAR and optical images. Experimental results on Delta-SN6, BRIGHT, and Wuhan-Het datasets demonstrate that our method outperforms the state-of-the-art (SOTA) by 3.21%, 1.08%, and 1.32% in mIoU, respectively. The associated code and Delta-SN6 dataset will be released at: https://github.com/liuxuanguang/STSF-Net.

[124] EchoAgent: Towards Reliable Echocardiography Interpretation with "Eyes","Hands" and "Minds"

Qin Wang,Zhiqing He,Yu Liu,Bowen Guo,Zeju Li,Miao Zhao,Wenhao Ju,Zhiling Luo,Xianhong Shu,Yi Guo,Yuanyuan Wang

Main category: cs.CV

TL;DR: 本文提出EchoAgent,一种面向超声心动图(Echo)端到端解读的智能代理系统,整合视觉感知(eyes)、手动测量(hands)与专业知识推理(minds),实现类心脏超声医师的协同工作流,在CAMUS和MIMIC-EchoQA数据集上结构分析准确率达80.00%。

Details Motivation: 现有深度学习方法和多模态大模型仅覆盖Echo分析的部分能力(如eyes-hands或eyes-minds),缺乏临床所需的全链条协调性与可靠性。 Method: 提出EchoAgent系统:1)基于专家知识的认知引擎构建定制化‘心’(知识库);2)分层协作工具包实现‘眼-手’能力(视频解析、视图识别、解剖分割与定量测量);3)融合多模态感知证据与知识库的协同推理中枢,支持可解释推断。 Result: 在CAMUS和MIMIC-EchoQA数据集(覆盖48种视图、14个解剖区域)上,EchoAgent在多种结构分析任务中达到最优性能,整体准确率最高达80.00%。 Conclusion: EchoAgent首次实现了Echo解读中eyes-hands-minds的全流程协同,具备学习、观察、操作与推理能力,显著提升临床可靠性和实用性,为AI辅助心脏超声诊断提供新范式。 Abstract: Reliable interpretation of echocardiography (Echo) is crucial for assessing cardiac function, which demands clinicians to synchronously orchestrate multiple capabilities, including visual observation (eyes), manual measurement (hands), and expert knowledge learning and reasoning (minds). While current task-specific deep-learning approaches and multimodal large language models have demonstrated promise in assisting Echo analysis through automated segmentation or reasoning, they remain focused on restricted skills, i.e., eyes-hands or eyes-minds, thereby limiting clinical reliability and utility. To address these issues, we propose EchoAgent, an agentic system tailored for end-to-end Echo interpretation, which achieves a fully coordinated eyes-hands-minds workflow that learns, observes, operates, and reasons like a cardiac sonographer. First, we introduce an expertise-driven cognition engine where our agent can automatically assimilate credible Echo guidelines into a structured knowledge base, thus constructing an Echo-customized mind. Second, we devise a hierarchical collaboration toolkit to endow EchoAgent with eyes-hands, which can automatically parse Echo video streams, identify cardiac views, perform anatomical segmentation, and quantitative measurement. Third, we integrate the perceived multimodal evidence with the exclusive knowledge base into an orchestrated reasoning hub to conduct explainable inferences. We evaluate EchoAgent on CAMUS and MIMIC-EchoQA datasets, which cover 48 distinct echocardiographic views spanning 14 cardiac anatomical regions. Experimental results show that EchoAgent achieves optimal performance across diverse structure analyses, yielding overall accuracy of up to 80.00%. Importantly, EchoAgent empowers a single system with abilities to learn, observe, operate and reason like an echocardiologist, which holds great promise for reliable Echo interpretation.

[125] Evaluation Before Generation: A Paradigm for Robust Multimodal Sentiment Analysis with Missing Modalities

Rongfei Chen,Tingting Zhang,Xiaoyu Shen,Wei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于提示学习的缺失模态自适应框架(ProMMA),通过动态评估缺失模态重要性、解耦模态不变提示、自适应加权和多级动态连接,提升多模态情感分析在缺失模态下的鲁棒性与性能。

Details Motivation: 现有方法在缺失模态问题上缺乏对生成缺失模态必要性的严格评估,且未充分建模多模态提示间的结构依赖与全局一致性。 Method: 提出Prompt-based Missing Modality Adaptation(ProMMA)框架,包含:1)缺失模态评估器(动态评估重要性);2)模态不变提示解耦模块(分解共享提示为模态私有提示);3)动态提示加权模块(基于互信息抑制缺失模态干扰);4)多级提示动态连接模块(融合全局提示先验增强一致性)。 Result: 在CMU MOSI、CMU MOSEI和CH-SIMS三个基准上达到SOTA性能,且在多种缺失模态设置下表现稳定。 Conclusion: ProMMA有效缓解了缺失模态带来的性能下降,通过提示层面的细粒度建模提升了模型鲁棒性、表示质量与全局一致性。 Abstract: The missing modality problem poses a fundamental challenge in multimodal sentiment analysis, significantly degrading model accuracy and generalization in real world scenarios. Existing approaches primarily improve robustness through prompt learning and pre trained models. However, two limitations remain. First, the necessity of generating missing modalities lacks rigorous evaluation. Second, the structural dependencies among multimodal prompts and their global coherence are insufficiently explored. To address these issues, a Prompt based Missing Modality Adaptation framework is proposed. A Missing Modality Evaluator is introduced at the input stage to dynamically assess the importance of missing modalities using pretrained models and pseudo labels, thereby avoiding low quality data imputation. Building on this, a Modality invariant Prompt Disentanglement module decomposes shared prompts into modality specific private prompts to capture intrinsic local correlations and improve representation quality. In addition, a Dynamic Prompt Weighting module computes mutual information based weights from cross attention outputs to adaptively suppress interference from missing modalities. To enhance global consistency, a Multi level Prompt Dynamic Connection module integrates shared prompts with self attention outputs through residual connections, leveraging global prompt priors to strengthen key guidance features. Extensive experiments on three public benchmarks, including CMU MOSI, CMU MOSEI, and CH SIMS, demonstrate that the proposed framework achieves state of the art performance and stable results under diverse missing modality settings. The implementation is available at https://github.com/rongfei-chen/ProMMA

[126] Physics-Aligned Spectral Mamba: Decoupling Semantics and Dynamics for Few-Shot Hyperspectral Target Detection

Luqi Gong,Qixin Xie,Yue Chen,Ziqiang Chen,Fanda Fan,Shuai Zhao,Chao Li

Main category: cs.CV

TL;DR: 本文提出SpecMamba,一种参数高效、频域感知的元学习框架,用于少样本高光谱目标检测,通过DCTMA适配器、PGTE引导编码器和SSPLM自监督映射策略,提升频谱适应性与跨域泛化能力。

Details Motivation: 现有元学习方法在高光谱目标检测中难以有效适配深度骨干网络:全参数微调效率低且易过拟合;忽略高光谱数据的频域结构与波段连续性,限制了频谱适应与跨域泛化能力。 Method: 提出SpecMamba框架,包含三个核心模块:1)DCTMA(离散余弦变换Mamba适配器),在冻结Transformer特征上进行频域投影与状态空间建模;2)PGTE(先验引导三编码器),利用实验室光谱先验指导适配器优化;3)SSPLM(自监督伪标签映射),通过不确定性采样与双路径一致性实现测试时自适应。 Result: 在多个公开数据集上实验表明,SpecMamba在检测精度与跨域泛化性能上持续超越当前最优方法。 Conclusion: SpecMamba通过解耦语义表征稳定性与频谱适应敏捷性,实现了高效、鲁棒、可泛化的少样本高光谱目标检测,为频域感知的参数高效微调提供了新范式。 Abstract: Meta-learning facilitates few-shot hyperspectral target detection (HTD), but adapting deep backbones remains challenging. Full-parameter fine-tuning is inefficient and prone to overfitting, and existing methods largely ignore the frequency-domain structure and spectral band continuity of hyperspectral data, limiting spectral adaptation and cross-domain generalization.To address these challenges, we propose SpecMamba, a parameter-efficient and frequency-aware framework that decouples stable semantic representation from agile spectral adaptation. Specifically, we introduce a Discrete Cosine Transform Mamba Adapter (DCTMA) on top of frozen Transformer representations. By projecting spectral features into the frequency domain via DCT and leveraging Mamba's linear-complexity state-space recursion, DCTMA explicitly captures global spectral dependencies and band continuity while avoiding the redundancy of full fine-tuning. Furthermore, to address prototype drift caused by limited sample sizes, we design a Prior-Guided Tri-Encoder (PGTE) that allows laboratory spectral priors to guide the optimization of the learnable adapter without disrupting the stable semantic feature space. Finally, a Self-Supervised Pseudo-Label Mapping (SSPLM) strategy is developed for test-time adaptation, enabling efficient decision boundary refinement through uncertainty-aware sampling and dual-path consistency constraints. Extensive experiments on multiple public datasets demonstrate that SpecMamba consistently outperforms state-of-the-art methods in detection accuracy and cross-domain generalization.

[127] High-Resolution Single-Shot Polarimetric Imaging Made Easy

Shuangfan Zhou,Chu Zhou,Heng Guo,Youwei Lyu,Boxin Shi,Zhanyu Ma,Imari Sato

Main category: cs.CV

TL;DR: 本文提出EasyPolar,一种基于三视角的偏振成像框架,通过一个非偏振相机加两个不同偏振方向的相机实现单次拍摄下的高质量线性偏振重建,并设计了置信度引导的物理约束网络以解决多视角配准误差问题。

Details Motivation: 现有基于焦平面分割(DoFP)的单次偏振成像传感器因空间复用导致分辨率下降和伪影;亟需在保持单次拍摄优势的同时提升成像质量。 Method: 提出三相机硬件系统(1个非偏振+2个不同偏振方向RGB相机),并构建置信度引导的偏振重建网络,融合多模态特征并在物理模型指导下施加几何约束以抑制配准失真。 Result: 实验表明该方法能生成高质量偏振图像,并有效提升多个下游任务性能。 Conclusion: EasyPolar在不牺牲单次拍摄能力的前提下,克服了DoFP传感器固有缺陷,为实用化偏振视觉提供了新思路。 Abstract: Polarization-based vision has gained increasing attention for providing richer physical cues beyond RGB images. While achieving single-shot capture is highly desirable for practical applications, existing Division-of-Focal-Plane (DoFP) sensors inherently suffer from reduced spatial resolution and artifacts due to their spatial multiplexing mechanism. To overcome these limitations without sacrificing the snapshot capability, we propose EasyPolar, a multi-view polarimetric imaging framework. Our system is grounded in the physical insight that three independent intensity measurements are sufficient to fully characterize linear polarization. Guided by this, we design a triple-camera setup consisting of three synchronized RGB cameras that capture one unpolarized view and two polarized views with distinct orientations. Building upon this hardware design, we further propose a confidence-guided polarization reconstruction network to address the potential misalignment in multi-view fusion. The network performs multi-modal feature fusion under a confidence-aware physical guidance mechanism, which effectively suppresses warping-induced artifacts and enforces explicit geometric constraints on the solution space. Experimental results demonstrate that our method achieves high-quality results and benefits various downstream tasks.

[128] WRF4CIR: Weight-Regularized Fine-Tuning Network for Composed Image Retrieval

Yizhuo Xu,Chaojian Yu,Yuanjie Shao,Tongliang Liu,Qinmu Peng,Xinge You

Main category: cs.CV

TL;DR: 本文提出WRF4CIR方法,通过在微调过程中对模型权重施加反向梯度方向的对抗扰动,缓解基于视觉语言预训练模型的组合图像检索(CIR)任务中因三元组数据有限导致的严重过拟合问题,显著缩小泛化差距并提升性能。

Details Motivation: 现有基于视觉语言预训练模型(VLP)的组合图像检索(CIR)方法在有限三元组数据下普遍存在严重过拟合,且存在被忽视的显著泛化差距,亟需更鲁棒的微调策略。 Method: 提出WRF4CIR:一种面向CIR的权重正则化微调网络;核心是在微调过程中,沿梯度下降反方向生成对抗性权重扰动,以增加拟合难度、增强泛化能力。 Result: 在多个基准数据集上的大量实验表明,WRF4CIR显著缩小了泛化差距,并大幅超越现有方法。 Conclusion: 对抗性权重正则化是一种有效缓解CIR中过拟合的策略,WRF4CIR为小样本CIR任务提供了新思路和实用解决方案。 Abstract: Composed Image Retrieval (CIR) task aims to retrieve target images based on reference images and modification texts. Current CIR methods primarily rely on fine-tuning vision-language pre-trained models. However, we find that these approaches commonly suffer from severe overfitting, posing challenges for CIR with limited triplet data. To better understand this issue, we present a systematic study of overfitting in VLP-based CIR, revealing a significant and previously overlooked generalization gap across different models and datasets. Motivated by these findings, we introduce WRF4CIR, a Weight-Regularized Fine-tuning network for CIR. Specifically, during the fine-tuning process, we apply adversarial perturbations to the model weights for regularization, where these perturbations are generated in the opposite direction of gradient descent. Intuitively, WRF4CIR increases the difficulty of fitting the training data, which helps mitigate overfitting in CIR under limited triplet supervision. Extensive experiments on benchmark datasets demonstrate that WRF4CIR significantly narrows the generalization gap and achieves substantial improvements over existing methods.

[129] Purify-then-Align: Towards Robust Human Sensing under Modality Missing with Knowledge Distillation from Noisy Multimodal Teacher

Pengcheng Weng,Yanyu Qian,Yangxin Xu,Fei Wang

Main category: cs.CV

TL;DR: 本文提出PTA框架,通过“净化-对齐”策略解决多模态感知中模态缺失问题,利用元学习动态降权噪声模态,并通过扩散式知识蒸馏实现模态对齐,显著提升单模态编码器在缺失模态下的鲁棒性与性能。

Details Motivation: 多模态人体感知面临模态缺失挑战,核心障碍是异构数据间的表征差距和低质量模态带来的污染效应,且二者存在因果关联。 Method: 提出PTA(Purify-then-Align)框架:首先用元学习驱动的加权机制动态抑制噪声模态影响以净化知识源;再基于净化后的共识构建干净教师模型,通过扩散式知识蒸馏对齐各模态特征。 Result: 在MM-Fi和XRF55大规模数据集上,PTA在强表征差距与污染效应下达到SOTA性能,显著提升单模态模型在各类模态缺失场景下的鲁棒性。 Conclusion: PTA通过解耦并协同解决污染与对齐问题,成功生成兼具跨模态知识与强单模态能力的编码器,为鲁棒多模态感知提供了新范式。 Abstract: Robust multimodal human sensing must overcome the critical challenge of missing modalities. Two principal barriers are the Representation Gap between heterogeneous data and the Contamination Effect from low-quality modalities. These barriers are causally linked, as the corruption introduced by contamination fundamentally impedes the reduction of representation disparities. In this paper, we propose PTA, a novel "Purify-then-Align" framework that solves this causal dependency through a synergistic integration of meta-learning and knowledge diffusion. To purify the knowledge source, PTA first employs a meta-learning-driven weighting mechanism that dynamically learns to down-weight the influence of noisy, low-contributing modalities. Subsequently, to align different modalities, PTA introduces a diffusion-based knowledge distillation paradigm in which an information-rich clean teacher, formed from this purified consensus, refines the features of each student modality. The ultimate payoff of this "Purify-then-Align" strategy is the creation of exceptionally powerful single-modality encoders imbued with cross-modal knowledge. Comprehensive experiments on the large-scale MM-Fi and XRF55 datasets, under pronounced Representation Gap and Contamination Effect, demonstrate that PTA achieves state-of-the-art performance and significantly improves the robustness of single-modality models in diverse missing-modality scenarios.

[130] BPC-Net: Annotation-Free Skin Lesion Segmentation via Boundary Probability Calibration

Yujie Yao,Yuhaohang He,Junjie Huang,Zhou Liu,Jiangzhao Li,Yan Qiao,Wen Xiao,Yunsen Liang,Xiaofan Li

Main category: cs.CV

TL;DR: 本文提出BPC-Net框架,通过高斯概率平滑(GPS)校准边界概率,解决无标注皮肤病变分割中伪标签噪声、跨域迁移不稳定和边界置信度不足三大挑战,在严格无标注协议下达到当前最优无监督性能。

Details Motivation: 现有无标注皮肤病变分割方法受限于伪标签噪声、小目标域数据下迁移不稳定及边界概率置信度不足;尤其边界概率压缩问题被忽视,但直接影响轮廓完整性且无法仅靠全局阈值调整解决。 Method: 提出BPC-Net:核心为高斯概率平滑(GPS),在阈值化前进行局部概率空间校准;辅以特征解耦解码器(分别处理上下文抑制、细节恢复与边界细化)和交互分支自适应策略(仅更新伪标签交互分支,保持图像单路径分割)。 Result: 在ISIC-2017、ISIC-2018和PH2数据集上,严格无标注设定下达到当前最优无监督性能:宏平均Dice系数85.80%,Jaccard指数76.97%,在PH2上接近有监督基准性能。 Conclusion: BPC-Net有效缓解了无标注皮肤病变分割中的边界置信度问题,验证了局部概率校准对分割质量的关键作用,为低资源皮肤科部署提供了实用可行的解决方案。 Abstract: Annotation-free skin lesion segmentation is attractive for low-resource dermoscopic deployment. However, its performance remains constrained by three coupled challenges: noisy pseudo-label supervision, unstable transfer under limited target-domain data, and boundary probability under-confidence. Most existing annotation-free methods primarily focus on pseudo-label denoising. In contrast, the effect of compressed boundary probabilities on final mask quality has received less explicit attention, although it directly affects contour completeness and cannot be adequately corrected by global threshold adjustment alone. To address this issue, we propose BPC-Net, a boundary probability calibration framework for annotation-free skin lesion segmentation. The core of the framework is Gaussian Probability Smoothing (GPS), which performs localized probability-space calibration before thresholding to recover under-confident lesion boundaries without inducing indiscriminate foreground expansion. To support this calibration under noisy pseudo-supervision and cross-domain transfer, we further incorporate two auxiliary designs: a feature-decoupled decoder that separately handles context suppression, detail recovery, and boundary refinement, and an interaction-branch adaptation strategy that updates only the pseudo-label interaction branch while preserving the deployed image-only segmentation path. Under a strictly annotation-free protocol, no manual masks are used during training or target-domain adaptation, and validation labels, when available, are used only for final operating-point selection. Experiments on ISIC-2017, ISIC-2018, and PH2 show that the proposed framework achieves state-of-the-art performance among published unsupervised methods, reaching a macro-average Dice coefficient and Jaccard index of 85.80\% and 76.97\%, respectively, while approaching supervised reference performance on PH2.

[131] ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference

Zhaohong Huang,Wenjing Liu,Yuxin Zhang,Fei Chao,Rongrong Ji

Main category: cs.CV

TL;DR: 本文提出ID-Selection方法,在大视觉语言模型(LVLMs)推理中联合建模视觉token的重要性与多样性,实现高效、高保真的token剪枝,无需额外训练。

Details Motivation: 现有视觉token剪枝方法难以兼顾重要性与多样性:重要性方法易保留冗余token,多样性方法可能忽略关键信息,尤其在高剪枝率下问题更突出。 Method: ID-Selection:先为每个token分配重要性得分,再迭代选择高分token,并在每步中抑制与已选token相似的token得分,从而统一兼顾信息量与多样性。 Result: 在5个LVLM主干和16个基准上验证有效;例如在LLaVA-1.5-7B上剪枝97.2%视觉token(仅留16个),FLOPs降低超97%,性能保持91.8%,且无需微调。 Conclusion: ID-Selection是一种简单高效、即插即用的视觉token选择策略,显著提升了LVLM在高剪枝率下的推理效率与精度平衡。 Abstract: Recent advances have explored visual token pruning to accelerate the inference of large vision-language models (LVLMs). However, existing methods often struggle to balance token importance and diversity: importance-based methods tend to retain redundant tokens, whereas diversity-based methods may overlook informative ones. This trade-off becomes especially problematic under high reduction ratios, where preserving only a small subset of visual tokens is critical. To address this issue, we propose ID-Selection, a simple yet effective token selection strategy for efficient LVLM inference. The key idea is to couple importance estimation with diversity-aware iterative selection: each token is first assigned an importance score, after which high-scoring tokens are selected one by one while the scores of similar tokens are progressively suppressed. In this way, ID-Selection preserves informative tokens while reducing redundancy in a unified selection process. Extensive experiments across 5 LVLM backbones and 16 main benchmarks demonstrate that ID-Selection consistently achieves superior performance and efficiency, especially under extreme pruning ratios. For example, on LLaVA-1.5-7B, ID-Selection prunes 97.2% of visual tokens, retaining only 16 tokens, while reducing inference FLOPs by over 97% and preserving 91.8% of the original performance, all without additional training.

[132] Evaluation of Randomization through Style Transfer for Enhanced Domain Generalization

Dustin Eisenhardt,Timothy Schaumlöffel,Alperen Kantarci,Gemma Roig

Main category: cs.CV

TL;DR: 本文通过系统性实证研究,澄清了风格迁移在领域泛化中的三个关键设计因素(风格池多样性、纹理复杂度、风格来源选择)的影响,并基于发现提出轻量级、模型无关的数据增强方法StyleMixDG,在多个Sim2Real场景中显著提升泛化性能。

Details Motivation: 深度学习模型在真实场景中泛化能力差,尤其在合成数据上训练时存在Sim2Real鸿沟;尽管风格迁移被广泛用于领域泛化,但关于风格池多样性、纹理复杂度和风格来源的选择仍存在矛盾结论。 Method: 开展系统性实证研究,独立控制并评估上述三个设计轴对驾驶场景理解任务的影响;基于发现设计无需架构修改或额外损失的轻量级风格混合增强方法StyleMixDG。 Result: 发现:(i) 扩大风格池比重复使用少量风格增益更大;(ii) 当风格池足够大时,纹理复杂度无显著影响;(iii) 多样化的艺术风格优于领域对齐风格;StyleMixDG在GTAV→BDD100k/Cityscapes/Mapillary Vistas基准上持续超越强基线。 Conclusion: 风格迁移用于领域泛化的效果主要取决于风格池的多样性而非纹理复杂度或领域对齐性;StyleMixDG验证了该原则具备实用价值,是一种简单有效且可即插即用的增强策略。 Abstract: Deep learning models for computer vision often suffer from poor generalization when deployed in real-world settings, especially when trained on synthetic data due to the well-known Sim2Real gap. Despite the growing popularity of style transfer as a data augmentation strategy for domain generalization, the literature contains unresolved contradictions regarding three key design axes: the diversity of the style pool, the role of texture complexity, and the choice of style source. We present a systematic empirical study that isolates and evaluates each of these factors for driving scene understanding, resolving inconsistencies in prior work. Our findings show that (i) expanding the style pool yields larger gains than repeated augmentation with few styles, (ii) texture complexity has no significant effect when the pool is sufficiently large, and (iii) diverse artistic styles outperform domain-aligned alternatives. Guided by these insights, we derive StyleMixDG (Style-Mixing for Domain Generalization), a lightweight, model-agnostic augmentation recipe that requires no architectural modifications or additional losses. Evaluated on the GTAV $\rightarrow$ {BDD100k, Cityscapes, Mapillary Vistas} benchmark, StyleMixDG demonstrates consistent improvements over strong baselines, confirming that the empirically identified design principles translate into practical gains. The code will be released on GitHub.

[133] Semantic-Topological Graph Reasoning for Language-Guided Pulmonary Screening

Chenyu Xue,Yiran Liu,Mian Zhou,Jionglong Su,Zhixiang Lu

Main category: cs.CV

TL;DR: 本文提出了一种语义-拓扑图推理(STGR)框架,用于文本引导的肺部筛查,结合LLaMA-3-V与MedSAM,通过文本到视觉意图蒸馏(TVID)、动态图推理解决解剖歧义,并采用仅微调<1%参数的SAFT策略,在LIDC-IDRI和LNDb数据集上达到SOTA性能(DSC 81.5%),且具备优异跨折稳定性。

Details Motivation: 现有模型难以处理临床报告中的语义歧义和低对比度影像中复杂的解剖重叠,且在小规模医学数据上全量微调易过拟合。 Method: 提出STGR框架:1)Text-to-Vision Intent Distillation(TVID)模块提取诊断意图;2)将病灶掩码选择建模为动态图推理问题,节点为候选病灶、边表征空间与语义亲和性;3)Selective Asymmetric Fine-Tuning(SAFT)策略仅更新<1%参数。 Result: 在LIDC-IDRI和LNDb上5折交叉验证达SOTA:LIDC-IDRI上Dice相似系数达81.5%,较LISA高5%以上;SAFT带来极佳跨折稳定性(DSC方差仅0.6%)。 Conclusion: STGR框架有效融合大语言模型与视觉基础模型能力,以轻量、鲁棒方式实现临床文本引导的精准肺部病灶分割,为上下文感知的临床部署提供了新路径。 Abstract: Medical image segmentation driven by free-text clinical instructions is a critical frontier in computer-aided diagnosis. However, existing multimodal and foundation models struggle with the semantic ambiguity of clinical reports and fail to disambiguate complex anatomical overlaps in low-contrast scans. Furthermore, fully fine-tuning these massive architectures on limited medical datasets invariably leads to severe overfitting. To address these challenges, we propose a novel Semantic-Topological Graph Reasoning (STGR) framework for language-guided pulmonary screening. Our approach elegantly synergizes the reasoning capabilities of large language models (LLaMA-3-V) with the zero-shot delineation of vision foundation models (MedSAM). Specifically, we introduce a Text-to-Vision Intent Distillation (TVID) module to extract precise diagnostic guidance. To resolve anatomical ambiguity, we formulate mask selection as a dynamic graph reasoning problem, where candidate lesions are modeled as nodes and edges capture spatial and semantic affinities. To ensure deployment feasibility, we introduce a Selective Asymmetric Fine-Tuning (SAFT) strategy that updates less than 1% of the parameters. Rigorous 5-fold cross-validation on the LIDC-IDRI and LNDb datasets demonstrates that our framework establishes a new state-of-the-art. Notably, it achieves an 81.5% Dice Similarity Coefficient (DSC) on LIDC-IDRI, outperforming leading LLM-based tools like LISA by over 5%. Crucially, our SAFT strategy acts as a powerful regularizer, yielding exceptional cross-fold stability (0.6% DSC variance) and paving the way for robust, context-aware clinical deployment.

[134] FunRec: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos

Alexandros Delitzas,Chenyangguang Zhang,Alexey Gavryushin,Tommaso Di Mario,Boyang Sun,Rishabh Dabral,Leonidas Guibas,Christian Theobalt,Marc Pollefeys,Francis Engelmann,Daniel Barath

Main category: cs.CV

TL;DR: FunRec 是一种从第一人称 RGB-D 交互视频中直接重建可交互室内场景功能型3D数字孪生的方法,无需受控环境或CAD先验,自动发现关节部件、估计运动学参数并重建静态与动态几何,性能显著优于现有方法。

Details Motivation: 现有基于关节的三维重建方法依赖受控设置、多状态采集或CAD先验,难以处理真实世界中的人类交互视频;本文旨在实现对in-the-wild交互序列的端到端功能重建。 Method: FunRec 从 egocentric RGB-D 视频出发,联合进行关节部件发现、运动学参数估计、3D运动跟踪,并在规范空间中重建静态与动态几何,输出仿真兼容的网格模型。 Result: 在新建的真实与合成基准上大幅超越先前方法:部件分割mIoU提升达+50,关节与姿态误差降低5–10倍,重建精度显著提高;支持URDF/USD导出、手引导的可供性映射及机器人-场景交互。 Conclusion: FunRec 首次实现了从自由交互视频中端到端生成功能完备、仿真就绪的3D数字孪生,为具身智能与机器人交互提供了新范式。 Abstract: We present FunRec, a method for reconstructing functional 3D digital twins of indoor scenes directly from egocentric RGB-D interaction videos. Unlike existing methods on articulated reconstruction, which rely on controlled setups, multi-state captures, or CAD priors, FunRec operates directly on in-the-wild human interaction sequences to recover interactable 3D scenes. It automatically discovers articulated parts, estimates their kinematic parameters, tracks their 3D motion, and reconstructs static and moving geometry in canonical space, yielding simulation-compatible meshes. Across new real and simulated benchmarks, FunRec surpasses prior work by a large margin, achieving up to +50 mIoU improvement in part segmentation, 5-10 times lower articulation and pose errors, and significantly higher reconstruction accuracy. We further demonstrate applications on URDF/USD export for simulation, hand-guided affordance mapping and robot-scene interaction.

[135] DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions

Xinran Wang,Yuxuan Zhang,Xiao Zhang,Haolong Yan,Muxi Diao,Songyu Xu,Zhonghao Yan,Hongbing Li,Kongming Liang,Zhanyu Ma

Main category: cs.CV

TL;DR: 本文提出DetailVerifyBench,一个面向长图像描述中细粒度幻觉定位的高难度基准,包含1000张跨五领域的高质量图像、平均超200词的长描述及密集的词级别幻觉标注。

Details Motivation: 现有基准缺乏评估多模态大模型生成长图像描述时精准定位幻觉(如错误词或短语)所需的细粒度和领域多样性。 Method: 构建DetailVerifyBench:涵盖五个领域、1000张高质量图像;每张图像配以平均超200词的详细描述,并进行多类型幻觉的词级别人工标注。 Result: DetailVerifyBench成为当前长图像描述幻觉定位任务中最具挑战性的基准,具备高细粒度、强领域多样性和严格标注标准。 Conclusion: DetailVerifyBench填补了长文本图像描述中幻觉细粒度检测与定位的评估空白,为推动可靠多模态生成研究提供了关键基础设施。 Abstract: Accurately detecting and localizing hallucinations is a critical task for ensuring high reliability of image captions. In the era of Multimodal Large Language Models (MLLMs), captions have evolved from brief sentences into comprehensive narratives, often spanning hundreds of words. This shift exponentially increases the challenge: models must now pinpoint specific erroneous spans or words within extensive contexts, rather than merely flag response-level inconsistencies. However, existing benchmarks lack the fine granularity and domain diversity required to evaluate this capability. To bridge this gap, we introduce DetailVerifyBench, a rigorous benchmark comprising 1,000 high-quality images across five distinct domains. With an average caption length of over 200 words and dense, token-level annotations of multiple hallucination types, it stands as the most challenging benchmark for precise hallucination localization in the field of long image captioning to date. Our benchmark is available at https://zyx-hhnkh.github.io/DetailVerifyBench/.

[136] A Unified Foundation Model for All-in-One Multi-Modal Remote Sensing Image Restoration and Fusion with Language Prompting

Yongchuan Cui,Peng Liu

Main category: cs.CV

TL;DR: 本文提出了首个面向遥感影像多模态、多任务低层视觉的统一基础模型LLaRS,通过最优传输对齐异构波段、混合专家结构建模多尺度特征,并结合动态权重调整实现联合训练;在百万级多任务数据集LLaRS1M上训练,显著优于现有方法。

Details Motivation: 遥感影像常受云、霾、噪声、分辨率限制及传感器异质性等多重退化影响,而现有方法通常为每种退化类型单独训练模型,缺乏统一、泛化能力强的基础模型。 Method: 提出语言条件化的遥感大模型LLaRS:1)采用Sinkhorn-Knopp最优传输对齐异构光谱波段;2)设计三类互补的混合专家层(卷积专家建模空间模式、通道混合专家保障光谱保真度、带低秩适配器的注意力专家捕获全局上下文);3)引入步级动态权重调整稳定多任务联合训练;4)构建百万规模多任务数据集LLaRS1M,涵盖11种复原与增强任务,并融合真实配对数据、可控合成退化及多样化自然语言提示。 Result: LLaRS在多项遥感低层视觉任务上持续超越7个强基线模型;参数高效微调实验表明其具备优异的跨任务迁移能力与对未见数据的快速适应能力。 Conclusion: LLaRS验证了构建统一、语言可引导、多任务兼容的遥感基础模型的可行性与有效性,为遥感影像智能解译提供了新范式。 Abstract: Remote sensing imagery suffers from clouds, haze, noise, resolution limits, and sensor heterogeneity. Existing restoration and fusion approaches train separate models per degradation type. In this work, we present Language-conditioned Large-scale Remote Sensing restoration model (LLaRS), the first unified foundation model for multi-modal and multi-task remote sensing low-level vision. LLaRS employs Sinkhorn-Knopp optimal transport to align heterogeneous bands into semantically matched slots, routes features through three complementary mixture-of-experts layers (convolutional experts for spatial patterns, channel-mixing experts for spectral fidelity, and attention experts with low-rank adapters for global context), and stabilizes joint training via step-level dynamic weight adjustment. To train LLaRS, we construct LLaRS1M, a million-scale multi-task dataset spanning eleven restoration and enhancement tasks, integrating real paired observations and controlled synthetic degradations with diverse natural language prompts. Experiments show LLaRS consistently outperforms seven competitive models, and parameter-efficient finetuning experiments demonstrate strong transfer capability and adaptation efficiency on unseen data. Repo: https://github.com/yc-cui/LLaRS

[137] SGANet: Semantic and Geometric Alignment for Multimodal Multi-view Anomaly Detection

Letian Bai,Chengyu Tao,Juan Du

Main category: cs.CV

TL;DR: 本文提出SGANet,一种用于多模态多视角异常检测的统一框架,通过语义与几何对齐学习物理一致的特征表示,显著提升表面缺陷检测性能。

Details Motivation: 现有无监督方法因视角变化和模态差异导致特征不一致,难以有效检测复杂物体表面缺陷。 Method: 提出Semantic and Geometric Alignment Network(SGANet),包含三个模块:Selective Cross-view Feature Refinement Module(SCFRM)增强跨视角特征交互;Semantic-Structural Patch Alignment(SSPA)实现跨模态语义对齐并保持结构一致性;Multi-View Geometric Alignment(MVGA)对齐跨视角几何对应块。 Result: 在SiM3D和Eyecandies数据集上,SGANet在异常检测与定位任务中均达到SOTA性能。 Conclusion: SGANet通过联合建模特征交互、语义结构一致性与全局几何对应,有效提升了多模态多视角异常检测的鲁棒性与精度,适用于真实工业场景。 Abstract: Multi-view anomaly detection aims to identify surface defects on complex objects using observations captured from multiple viewpoints. However, existing unsupervised methods often suffer from feature inconsistency arising from viewpoint variations and modality discrepancies. To address these challenges, we propose a Semantic and Geometric Alignment Network (SGANet), a unified framework for multimodal multi-view anomaly detection that effectively combines semantic and geometric alignment to learn physically coherent feature representations across viewpoints and modalities. SGANet consists of three key components. The Selective Cross-view Feature Refinement Module (SCFRM) selectively aggregates informative patch features from adjacent views to enhance cross-view feature interaction. The Semantic-Structural Patch Alignment (SSPA) enforces semantic alignment across modalities while maintaining structural consistency under viewpoint transformations. The Multi-View Geometric Alignment (MVGA) further aligns geometrically corresponding patches across viewpoints. By jointly modeling feature interaction, semantic and structural consistency, and global geometric correspondence, SGANet effectively enhances anomaly detection performance in multimodal multi-view settings. Extensive experiments on the SiM3D and Eyecandies datasets demonstrate that SGANet achieves state-of-the-art performance in both anomaly detection and localization, validating its effectiveness in realistic industrial scenarios.

[138] Towards Athlete Fatigue Assessment from Association Football Videos

Xavier Bou,Nathan Correger,Alexandre Cloots,Cédric Gavage,Silvio Giancola,Cédric Schwartz,François Delvaux,Rudi Cloots,Marc Van Droogenbroeck,Anthony Cioppa

Main category: cs.CV

TL;DR: 本文探讨了利用单目广播视频进行足球运动员疲劳监测的可行性,通过游戏状态重建(GSR)方法提取球员轨迹,并提出新算法估计速度与加速度,构建加速度-速度(A-S)曲线作为疲劳指标,在SoccerNet-GSR数据集上验证了其有效性及对噪声和误差的敏感性。

Details Motivation: 现有足球疲劳监测依赖主观报告、实验室生物标志物或侵入式传感器(如心率带、GPS),成本高或实用性受限;亟需低成本、非侵入、易部署的客观疲劳指标。 Method: 基于最先进的游戏状态重建(GSR)方法从单目广播视频中提取球员在球场坐标系下的轨迹;设计新型运动学处理算法,从重建轨迹中获得时间一致的速度与加速度估计;据此构建加速度–速度(A-S)分布曲线,并分析其随疲劳演变的特性。 Result: 在SoccerNet-GSR基准上验证:30秒片段显示A-S曲线具备短时可靠性;45分钟半场分析表明其具有长时趋势一致性;但结果对轨迹噪声、标定误差和视频时间不连续性较敏感。 Conclusion: 单目广播视频可作为低代价疲劳分析的基础,具备实际应用潜力;但需进一步解决运动重建鲁棒性、时空对齐与噪声抑制等方法学挑战。 Abstract: Fatigue monitoring is central in association football due to its links with injury risk and tactical performance. However, objective fatigue-related indicators are commonly derived from subjective self-reported metrics, biomarkers derived from laboratory tests, or, more recently, intrusive sensors such as heart monitors or GPS tracking data. This paper studies whether monocular broadcast videos can provide spatio-temporal signals of sufficient quality to support fatigue-oriented analysis. Building on state-of-the-art Game State Reconstruction methods, we extract player trajectories in pitch coordinates and propose a novel kinematics processing algorithm to obtain temporally consistent speed and acceleration estimates from reconstructed tracks. We then construct acceleration--speed (A-S) profiles from these signals and analyze their behavior as fatigue-related performance indicators. We evaluate the full pipeline on the public SoccerNet-GSR benchmark, considering both 30-second clips and a complete 45-minute half to examine short-term reliability and longer-term temporal consistency. Our results indicate that monocular GSR can recover kinematic patterns that are compatible with A-S profiling while also revealing sensitivity to trajectory noise, calibration errors, and temporal discontinuities inherent to broadcast footage. These findings support monocular broadcast video as a low-cost basis for fatigue analysis and delineate the methodological challenges for future research.

[139] PanopticQuery: Unified Query-Time Reasoning for 4D Scenes

Ruilin Tang,Yang Zhou,Zhong Ye,Wenxi Liu,Yan Huang,Shengfeng He

Main category: cs.CV

TL;DR: 本文提出PanopticQuery框架,结合4D高斯泼溅重建与多视角语义共识机制,实现自然语言对动态4D场景的统一查询推理,并发布新基准Panoptic-L4D。

Details Motivation: 现有基于神经表示的4D重建方法在上下文推理(如交互、时序动作、空间关系)方面能力有限,难以将噪声大、视角依赖的预测转化为全局一致的4D语义理解。 Method: 基于4D Gaussian Splatting进行高质量动态重建,引入多视角语义共识机制:跨多视角和时间帧聚合2D语义预测,过滤不一致结果、保证几何一致性,并通过神经场优化将2D语义提升为结构化4D语义 grounding。 Result: 在新提出的Panoptic-L4D基准上,PanopticQuery在复杂语言查询(属性、动作、空间关系、多物体交互)任务中达到SOTA性能。 Conclusion: PanopticQuery实现了语言驱动的、时空一致且几何可信的4D场景理解,为动态场景的语义查询提供了新范式。 Abstract: Understanding dynamic 4D environments through natural language queries requires not only accurate scene reconstruction but also robust semantic grounding across space, time, and viewpoints. While recent methods using neural representations have advanced 4D reconstruction, they remain limited in contextual reasoning, especially for complex semantics such as interactions, temporal actions, and spatial relations. A key challenge lies in transforming noisy, view-dependent predictions into globally consistent 4D interpretations. We introduce PanopticQuery, a framework for unified query-time reasoning in 4D scenes. Our approach builds on 4D Gaussian Splatting for high-fidelity dynamic reconstruction and introduces a multi-view semantic consensus mechanism that grounds natural language queries by aggregating 2D semantic predictions across multiple views and time frames. This process filters inconsistent outputs, enforces geometric consistency, and lifts 2D semantics into structured 4D groundings via neural field optimization. To support evaluation, we present Panoptic-L4D, a new benchmark for language-based querying in dynamic scenes. Experiments demonstrate that PanopticQuery sets a new state of the art on complex language queries, effectively handling attributes, actions, spatial relationships, and multi-object interactions. A video demonstration is available in the supplementary materials.

[140] Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis

Peixi Peng,Housheng Xie,Yanling Wei,Guangcong Ruan,Xiaoyang Zou,Qian Cao,Yongjian Nian,Guoyan Zheng

Main category: cs.CV

TL;DR: 本文提出RATNet,一种基于类比推理的胃肠道内窥镜影像基础模型,通过循环预训练策略从异构专家标注中学习和迁移知识,显著提升了模型在多种诊断场景下的泛化性、鲁棒性和适应性。

Details Motivation: 现有AI辅助诊断模型在胃肠道内镜图像分析中面临泛化性差、适应性弱、鲁棒性不足及可扩展性低等问题,主要受限于医学数据稀缺、领域偏移和标注异质性。 Method: 提出RATNet模型,采用循环预训练策略,融合五个胃肠道内镜数据集的异构专家标注;模型包含编码器、相关性-知识获取与迁移(RAT)模块、投影器和多任务头,支持微调、线性探测和零样本迁移;核心为类比推理机制,将图像后验知识匹配至先验知识库并迁移相对知识。 Result: RATNet在六种场景(常见病诊断、罕见病少样本学习、跨机构零样本迁移、长尾分布鲁棒性、新疾病适应、联邦学习隐私部署)中均优于GastroNet和GastroVision等现有基础模型。 Conclusion: RATNet是一种开源、低成本、易部署的胃肠道内镜诊断基础模型,能自动整合异构标注、降低数据采集成本,特别适用于资源有限地区,为智能胃肠诊断提供了实用基础架构。 Abstract: Gastrointestinal diseases impose a growing global health burden, and endoscopy is a primary tool for early diagnosis. However, routine endoscopic image interpretation still suffers from missed lesions and limited efficiency. Although AI-assisted diagnosis has shown promise, existing models often lack generalizability, adaptability, robustness, and scalability because of limited medical data, domain shift, and heterogeneous annotations. To address these challenges, we develop RATNet, a foundation model for gastrointestinal endoscopy imaging based on analogical reasoning. RATNet acquires and transfers knowledge from heterogeneous expert annotations across five gastrointestinal endoscopy datasets through a cyclic pre-training strategy. Its architecture consists of an encoder, a relevance-knowledge acquisition and transfer (RAT) module, a projector, and a multi-task head, and supports fine-tuning, linear probing, and zero-shot transfer. Evaluations show that RATNet outperforms existing foundation models, including GastroNet and GastroVision, across six scenarios: diagnosis of common gastrointestinal diseases, few-shot learning for rare diseases, zero-shot transfer to new medical sites, robustness under long-tailed disease distributions, adaptation to novel diseases, and privacy-preserving deployment via federated learning. Its advantage comes from an analogical reasoning mechanism that matches image-derived posterior knowledge to a learned prior knowledge base and transfers relative knowledge to guide diagnosis, improving generalization and resistance to bias. RATNet is open and cost-effective, supports automatic integration of heterogeneous annotations without manual label unification, and reduces data acquisition costs, making it a practical foundation for intelligent gastrointestinal diagnosis, especially in resource-limited settings.

[141] Probing Intrinsic Medical Task Relationships: A Contrastive Learning Perspective

Jonas Muth,Zdravko Marinov,Simon Reiß

Main category: cs.CV

TL;DR: 本文提出Task-Contrastive Learning(TaCo)框架,通过对比学习将30种医学视觉任务嵌入统一表征空间,揭示其在39个跨模态医学数据集上的内在关系与结构。

Details Motivation: 医学计算机视觉领域长期聚焦单任务性能提升,而任务间在表征层面的内在关系(如重叠、差异、关联)尚未系统探索。 Method: 提出Task-Contrastive Learning(TaCo)对比学习框架,将涵盖语义、生成和变换类的30种医学视觉任务,基于39个跨模态医学影像数据集,映射到共享表征空间并分析其结构特性。 Result: 成功构建可反映任务区分性、融合性及渐进变化特性的任务嵌入空间,揭示了不同医学视觉任务间的内在关联与结构规律。 Conclusion: TaCo为理解医学视觉任务的本质相似性与互联性提供了新范式,奠定了任务结构化分析的基础,有助于指导模型复用、迁移与协同学习。 Abstract: While much of the medical computer vision community has focused on advancing performance for specific tasks, the underlying relationships between tasks, i.e., how they relate, overlap, or differ on a representational level, remain largely unexplored. Our work explores these intrinsic relationships between medical vision tasks, specifically, we investigate 30 tasks, such as semantic tasks (e.g., segmentation and detection), image generative tasks (e.g., denoising, inpainting, or colorization), and image transformation tasks (e.g., geometric transformations). Our goal is to probe whether a data-driven representation space can capture an underlying structure of tasks across a variety of 39 datasets from wildly different medical imaging modalities, including computed tomography, magnetic resonance, electron microscopy, X-ray ultrasound and more. By revealing how tasks relate to one another, we aim to provide insights into their fundamental properties and interconnectedness. To this end, we introduce Task-Contrastive Learning (TaCo), a contrastive learning framework designed to embed tasks into a shared representation space. Through TaCo, we map these heterogeneous tasks from different modalities into a joint space and analyze their properties: identifying which tasks are distinctly represented, which blend together, and how iterative alterations to tasks are reflected in the embedding space. Our work provides a foundation for understanding the intrinsic structure of medical vision tasks, offering a deeper understanding of task similarities and their interconnected properties in embedding spaces.

[142] SnapFlow: One-Step Action Generation for Flow-Matching VLAs via Progressive Self-Distillation

Wuyang Luan,Junhui Li,Weiguang Zhao,Wenjian Zhang,Tieru Wu,Rui Ma

Main category: cs.CV

TL;DR: SnapFlow是一种无需外部教师、不改变模型结构的自蒸馏方法,将基于流匹配的视觉-语言-动作(VLA)模型的多步去噪压缩为单步前向推理,在保持甚至略微提升任务成功率的同时,显著降低推理延迟。

Details Motivation: 现有基于流匹配的VLA模型(如pi0、pi0.5、SmolVLA)依赖10步迭代ODE求解,导致高延迟(去噪占端到端耗时80%);简单减少步数会因速度场未校准而严重损害性能。 Method: 提出SnapFlow:通过混合标准流匹配样本与一致性样本(目标为模型自身预测的两步欧拉捷径速度)进行自蒸馏;引入零初始化的目标时间嵌入,使网络在同一架构中动态切换局部速度估计与全局单步生成。 Result: 在pi0.5(3B)上达98.75%平均成功率(略超10步教师的97.75%),去噪加速9.6×,端到端延迟从274ms降至83ms;在SmolVLA(500M)上MSE降低8.3%,端到端加速3.56×;长视野任务中优势稳定。 Conclusion: SnapFlow是一种即插即用、高效、通用且正交于其他加速技术的单步流匹配VLA推理方案,兼顾性能与实时性。 Abstract: Vision-Language-Action (VLA) models based on flow matching -- such as pi0, pi0.5, and SmolVLA -- achieve state-of-the-art generalist robotic manipulation, yet their iterative denoising, typically 10 ODE steps, introduces substantial latency: on a modern GPU, denoising alone accounts for 80% of end-to-end inference time. Naively reducing the step count is unreliable, degrading success on most tasks due to the velocity field being uncalibrated for single-step jumps. We present SnapFlow, a plug-and-play self-distillation method that compresses multi-step denoising into a single forward pass (1-NFE) for flow-matching VLAs. SnapFlow mixes standard flow-matching samples with consistency samples whose targets are two-step Euler shortcut velocities computed from the model's own marginal velocity predictions, avoiding the trajectory drift caused by conditional velocities, as we analyze theoretically. A zero-initialized target-time embedding lets the network switch between local velocity estimation and global one-step generation within a single architecture. SnapFlow requires no external teacher, no architecture changes, and trains in ~12h on a single GPU. We validate on two VLA architectures spanning a 6x parameter range, with identical hyperparameters: on pi0.5 (3B) across four LIBERO suites (40 tasks, 400 episodes), SnapFlow achieves 98.75% average success -- matching the 10-step teacher at 97.75% and slightly exceeding it -- with 9.6x denoising speedup and end-to-end latency reduced from 274ms to 83ms; on SmolVLA (500M), it reduces MSE by 8.3% with 3.56x end-to-end acceleration. An action-step sweep on long-horizon tasks reveals that SnapFlow maintains its advantage across execution horizons, achieving 93% at n_act=5 where the baseline reaches only 90%. SnapFlow is orthogonal to layer-distillation and token-pruning approaches, enabling compositional speedups.

[143] 3D Smoke Scene Reconstruction Guided by Vision Priors from Multimodal Large Language Models

Xinye Zheng,Fei Wang,Yiqi Nie,Kun Li,Junjie Chen,Jiaqi Zhao,Yanyan Wei,Zhiliang Wu

Main category: cs.CV

TL;DR: 本文提出了一种结合视觉先验与高效3D建模的框架,用于从烟雾退化多视角图像中重建3D场景,核心包括Nano-Banana-Pro图像增强模块和面向介质感知的Smoke-GS高斯泼溅方法。

Details Motivation: 烟雾导致强散射、视角相关外观变化及跨视角一致性严重下降,使3D场景重建尤为困难。 Method: 提出Nano-Banana-Pro增强烟雾退化图像;设计Smoke-GS——一种引入轻量级视角相关介质分支的介质感知3D高斯泼溅框架,用显式3D高斯建模场景并建模烟雾引起的视角依赖外观变化。 Result: 在烟雾环境中实现了更一致、更清晰的新视角合成,同时保持了3D高斯泼溅的渲染效率和对烟雾退化的鲁棒性。 Conclusion: 融合视觉增强与介质感知建模的联合框架显著提升了烟雾环境下多视角3D重建与新视角合成的质量与鲁棒性。 Abstract: Reconstructing 3D scenes from smoke-degraded multi-view images is particularly difficult because smoke introduces strong scattering effects, view-dependent appearance changes, and severe degradation of cross-view consistency. To address these issues, we propose a framework that integrates visual priors with efficient 3D scene modeling. We employ Nano-Banana-Pro to enhance smoke-degraded images and provide clearer visual observations for reconstruction and develop Smoke-GS, a medium-aware 3D Gaussian Splatting framework for smoke scene reconstruction and restoration-oriented novel view synthesis. Smoke-GS models the scene using explicit 3D Gaussians and introduces a lightweight view-dependent medium branch to capture direction-dependent appearance variations caused by smoke. Our method preserves the rendering efficiency of 3D Gaussian Splatting while improving robustness to smoke-induced degradation. Results demonstrate the effectiveness of our method for generating consistent and visually clear novel views in challenging smoke environments.

[144] CRFT: Consistent-Recurrent Feature Flow Transformer for Cross-Modal Image Registration

Xuecong Liu,Mengzhu Ding,Zixuan Sun,Zhang Li,Xichao Teng

Main category: cs.CV

TL;DR: CRFT是一种基于特征流学习的统一粗到细框架,用于鲁棒的跨模态图像配准,通过多尺度特征相关性和迭代差异引导注意力机制提升几何适应性与结构一致性。

Details Motivation: 解决跨模态图像配准中因模态差异、大仿射/尺度变化导致的配准不准确和结构不一致问题。 Method: 提出Consistent-Recurrent Feature Flow Transformer(CRFT),包含粗阶段(多尺度特征相关建立全局对应)和细阶段(分层特征融合与自适应空间推理);引入迭代差异引导注意力机制与空间几何变换(SGT)递归优化光流场。 Result: 在多个跨模态数据集上显著优于现有最先进方法,在精度和鲁棒性方面均表现更优。 Conclusion: CRFT为跨模态图像配准提供了统一、鲁棒且可泛化的解决方案,并可拓展应用于遥感、自动驾驶和医学影像等领域。 Abstract: We present Consistent-Recurrent Feature Flow Transformer (CRFT), a unified coarse-to-fine framework based on feature flow learning for robust cross-modal image registration. CRFT learns a modality-independent feature flow representation within a transformer-based architecture that jointly performs feature alignment and flow estimation. The coarse stage establishes global correspondences through multi-scale feature correlation, while the fine stage refines local details via hierarchical feature fusion and adaptive spatial reasoning. To enhance geometric adaptability, an iterative discrepancy-guided attention mechanism with a Spatial Geometric Transform (SGT) recurrently refines the flow field, progressively capturing subtle spatial inconsistencies and enforcing feature-level consistency. This design enables accurate alignment under large affine and scale variations while maintaining structural coherence across modalities. Extensive experiments on diverse cross-modal datasets demonstrate that CRFT consistently outperforms state-of-the-art registration methods in both accuracy and robustness. Beyond registration, CRFT provides a generalizable paradigm for multimodal spatial correspondence, offering broad applicability to remote sensing, autonomous navigation, and medical imaging. Code and datasets are publicly available at https://github.com/NEU-Liuxuecong/CRFT.

[145] Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs

Chongyu Wang,Ting Huang,Chunyu Sun,Xinyu Ning,Di Wang,Hao Tang

Main category: cs.CV

TL;DR: 本文提出GUIDE框架,通过在MLLM早期层中渐进式注入多粒度几何先验(从局部边缘到全局拓扑),并结合上下文感知门控机制,提升模型对物理空间的感知与2D-to-3D推理能力。

Details Motivation: 现有几何感知的多模态大语言模型受限于单深层提取与输入级融合范式,导致局部几何细节丢失和早期语义错配,缺乏真实世界中的物理空间感知能力。 Method: 提出GUIDE框架:1)在几何编码器中进行多级采样以捕获多粒度几何特征;2)逐层对齐并融合这些先验至MLLM早期层;3)引入上下文感知门控机制动态选择关键空间线索。 Result: 在多个复杂空间推理与感知任务上显著超越现有基线模型。 Conclusion: GUIDE为将3D几何先验融入大模型提供了新范式,有效提升了MLLM的物理空间理解能力。 Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in 2D visual tasks but still exhibit limited physical spatial awareness when processing real-world visual streams. Recently, feed-forward geometric foundation models, which implicitly extract geometric priors, have provided a new pathway to address this issue. However, existing geometry-aware MLLMs are predominantly constrained by the paradigm of single deep-layer extraction and input-level fusion. This flattened fusion leads to the loss of local geometric details and causes semantic mismatches in the early layers. To break this bottleneck, we propose GUIDE (Geometric Unrolling Inside MLLM Early-layers), a progressive geometric priors injection framework. GUIDE performs multi-level sampling within the geometric encoder, comprehensively capturing multi-granularity features ranging from local edges to global topologies. Subsequently, we rigorously align and fuse these multi-level geometric priors step-by-step with the early layers of the MLLM. Building upon the injection of multi-granularity geometric information, this design guides the model to progressively learn the 2D-to-3D transitional process. Furthermore, we introduce a context-aware gating that enables the model to fetch requisite spatial cues based on current semantics, thereby maximizing the utilization efficiency of spatial priors and effectively suppressing redundant geometric noise. Extensive experiments demonstrate that GUIDE significantly outperforms existing baselines on multiple complex spatial reasoning and perception tasks, establishing a novel paradigm for integrating 3D geometric priors into large models.

[146] In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting

Wenhui Xiao,Ethan Goan,Rodrigo Santa Cruz,David Ahmedt-Aristizabal,Olivier Salvado,Clinton Fookes,Leo Lebrat

Main category: cs.CV

TL;DR: 本文提出了一种将尺度模糊且含噪的单目深度先验有效融入3D高斯溅射(GS)训练的新框架,通过弱对齐深度变化建模与病态几何区域选择性正则化,显著提升几何精度与渲染质量。

Details Motivation: 单目深度估计模型成本低但存在尺度模糊、多视角不一致和局部几何不准等问题,直接用于GS易引入伪影;而高精度深度图需专用设备,获取困难。 Method: 设计了融合尺度模糊与噪声深度先验的几何监督训练框架;强调学习弱对齐的深度变化;提出病态几何区域识别方法,实现选择性单目深度正则化,防止误差传播至良好重建结构。 Result: 在多个数据集上实验表明,该方法在不同GS变体与单目深度骨干网络下均一致提升了几何精度、深度估计保真度与渲染质量。 Conclusion: 单目深度先验可通过合理建模与选择性正则化有效增强GS重建,无需高成本深度采集,为稀疏/无纹理场景提供了实用可靠的几何引导方案。 Abstract: Using accurate depth priors in 3D Gaussian Splatting helps mitigate artifacts caused by sparse training data and textureless surfaces. However, acquiring accurate depth maps requires specialized acquisition systems. Foundation monocular depth estimation models offer a cost-effective alternative, but they suffer from scale ambiguity, multi-view inconsistency, and local geometric inaccuracies, which can degrade rendering performance when applied naively. This paper addresses the challenge of reliably leveraging monocular depth priors for Gaussian Splatting (GS) rendering enhancement. To this end, we introduce a training framework integrating scale-ambiguous and noisy depth priors into geometric supervision. We highlight the importance of learning from weakly aligned depth variations. We introduce a method to isolate ill-posed geometry for selective monocular depth regularization, restricting the propagation of depth inaccuracies into well-reconstructed 3D structures. Extensive experiments across diverse datasets show consistent improvements in geometric accuracy, leading to more faithful depth estimation and higher rendering quality across different GS variants and monocular depth backbones tested.

[147] MPM: Mutual Pair Merging for Efficient Vision Transformers

Simon Ravé,Pejman Rasti,David Rousseau

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、基于互近邻配对的token聚合方法MPM,用于加速视觉Transformer在语义分割任务中的推理,兼顾重建精度与实际端到端延迟收益。

Details Motivation: 现有token缩减方法多面向分类任务、使用代理指标,且在语义分割中受限于像素对齐重建需求;同时,在现代加速器上,merge map计算开销可能抵消加速收益。 Method: 提出Mutual Pair Merging(MPM):在余弦空间中构建互最近邻token对,平均每对token,并记录merge map,通过gather操作在解码器前完成重建;无参数、无连续压缩超参,仅靠离散插入时机控制速度-精度权衡。 Result: 在ADE20K上,ViT-Tiny在Raspberry Pi 5上单图延迟降低达60%,H100+FlashAttention-2下吞吐提升达20%,mIoU下降<3%;端到端延迟实测验证了实用性。 Conclusion: 简单、重建感知、无需训练的token合并策略,在显式考虑计算开销的前提下,可为语义分割带来显著的实测加速效果。 Abstract: Decreasing sequence length is a common way to accelerate transformers, but prior token reduction work often targets classification and reports proxy metrics rather than end-to-end latency. For semantic segmentation, token reduction is further constrained by the need to reconstruct dense, pixel-aligned features, and on modern accelerators the overhead of computing merge maps can erase expected gains. We propose Mutual Pair Merging (MPM), a training-free token aggregation module that forms mutual nearest-neighbor pairs in cosine space, averages each pair, and records a merge map enabling a gather-based reconstruction before the decoder so that existing segmentation heads can be used unchanged. MPM introduces no learned parameters and no continuous compression knob (no keep-rate or threshold). The speed-accuracy trade-off is set by a discrete insertion schedule. We benchmark end-to-end latency on an NVIDIA H100 GPU (with and without FlashAttention-2) and a Raspberry Pi 5 across standard segmentation datasets. On ADE20K, MPM reduces per-image latency by up to 60% for ViT-Tiny on Raspberry Pi 5, and increases throughput by up to 20% on H100 with FlashAttention-2 while keeping the mIoU drop below 3%. These results suggest that simple, reconstruction-aware, training-free token merging can translate into practical wall-clock gains for segmentation when overhead is explicitly accounted for.

[148] GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance

Weiqi Zhang,Junsheng Zhou,Haotian Geng,Kanle Shi,Shenkun Xu,Yi Fang,Yu-Shen Liu

Main category: cs.CV

TL;DR: 本文提出GaussianGrow方法,通过从易获取的3D点云中学习‘生长’3D高斯椭球,结合文本引导的多视角扩散模型监督与迭代式相机位姿检测与视图修复,提升无几何先验下的3D Gaussian Splatting生成质量与完整性。

Details Motivation: 现有3D Gaussian Splatting方法缺乏可靠几何先验,依赖估计点图易导致生成质量差;需一种能自然融合准确几何信息的高斯生成方式。 Method: 提出GaussianGrow:1)以原始点云为种子,文本引导多视角扩散模型提供外观一致性监督;2)在多视角重叠区约束非预设位姿的新视角生成以减少融合伪影;3)迭代检测未生长区域最大空洞、预测最优相机位姿,并用2D扩散模型修复对应渲染视图。 Result: 在合成与真实扫描点云上的文本引导高斯生成任务中显著优于现有方法,生成的3D高斯更完整、几何更准确、渲染质量更高。 Conclusion: GaussianGrow通过‘几何驱动生长’范式,有效弥合了点云几何可靠性与高斯表示灵活性之间的鸿沟,为无深度/图像监督的高质量神经渲染提供了新思路。 Abstract: 3D Gaussian Splatting has demonstrated superior performance in rendering efficiency and quality, yet the generation of 3D Gaussians still remains a challenge without proper geometric priors. Existing methods have explored predicting point maps as geometric references for inferring Gaussian primitives, while the unreliable estimated geometries may lead to poor generations. In this work, we introduce GaussianGrow, a novel approach that generates 3D Gaussians by learning to grow them from easily accessible 3D point clouds, naturally enforcing geometric accuracy in Gaussian generation. Specifically, we design a text-guided Gaussian growing scheme that leverages a multi-view diffusion model to synthesize consistent appearances from input point clouds for supervision. To mitigate artifacts caused by fusing neighboring views, we constrain novel views generated at non-preset camera poses identified in overlapping regions across different views. For completing the hard-to-observe regions, we propose to iteratively detect the camera pose by observing the largest un-grown regions in point clouds and inpainting them by inpainting the rendered view with a pretrained 2D diffusion model. The process continues until complete Gaussians are generated. We extensively evaluate GaussianGrow on text-guided Gaussian generation from synthetic and even real-scanned point clouds. Project Page: https://weiqi-zhang.github.io/GaussianGrow

[149] Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP

Yusung Ro,Jaehyun Choi,Junmo Kim

Main category: cs.CV

TL;DR: 本文提出信息范围(information scope)作为稀疏自编码器(SAE)特征解释性的新维度,用以刻画特征聚合视觉证据的广度,并设计上下文依赖分数(CDS)量化局部与全局范围特征,揭示其对CLIP预测和置信度的不同影响。

Details Motivation: 现有对CLIP视觉编码器中稀疏自编码器(SAE)特征的解释性研究主要集中于语义含义,忽略了特征在空间上聚合视觉证据的广度这一关键维度。 Method: 提出‘信息范围’概念,定义特征响应的空间稳定性;设计‘上下文依赖分数(CDS)’来区分位置稳定(局部范围)与位置易变(全局范围)的SAE特征;通过系统实验分析不同范围特征对CLIP预测结果和置信度的影响。 Result: 发现SAE特征存在显著的信息范围差异;CDS能有效分离局部与全局范围特征;不同范围特征对CLIP预测和置信度具有系统性、可区分的影响。 Conclusion: 信息范围是理解CLIP表征的关键新轴,CDS为SAE特征提供了更深入的诊断视角,拓展了模型可解释性的分析维度。 Abstract: Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpreting the internal representations of CLIP vision encoders, yet existing analyses largely focus on the semantic meaning of individual features. We introduce information scope as a complementary dimension of interpretability that characterizes how broadly an SAE feature aggregates visual evidence, ranging from localized, patch-specific cues to global, image-level signals. We observe that some SAE features respond consistently across spatial perturbations, while others shift unpredictably with minor input changes, indicating a fundamental distinction in their underlying scope. To quantify this, we propose the Contextual Dependency Score (CDS), which separates positionally stable local scope features from positionally variant global scope features. Our experiments show that features of different information scopes exert systematically different influences on CLIP's predictions and confidence. These findings establish information scope as a critical new axis for understanding CLIP representations and provide a deeper diagnostic view of SAE-derived features.

[150] Single-Stage Signal Attenuation Diffusion Model for Low-Light Image Enhancement and Denoising

Ying Liu,Junchao Zhang,Caiyun Wu

Main category: cs.CV

TL;DR: 本文提出了一种新型的单阶段低光照图像增强扩散模型SADM,将信号衰减机制融入扩散过程,实现亮度调整与噪声抑制的联合优化,避免了传统两阶段或辅助校正网络带来的目标不一致问题。

Details Motivation: 现有基于扩散模型的低光照图像增强方法多采用两阶段流程或辅助校正网络,割裂了增强与去噪的内在联系,导致优化目标不一致、性能受限。 Method: 提出信号衰减扩散模型(SADM),在前向加噪过程中引入信号衰减系数以建模低光照退化物理先验,并在反向去噪中同步指导亮度恢复与噪声抑制;采用多尺度金字塔采样保证与DDIM的一致性。 Result: SADM实现了单阶段端到端的低光照图像增强,在保持可解释性的同时提升了重建质量与计算效率。 Conclusion: SADM通过将物理先验嵌入扩散过程,统一了低光照增强中的亮度调整与噪声抑制任务,为扩散模型在图像复原中的应用提供了新范式。 Abstract: Diffusion models excel at image restoration via probabilistic modeling of forward noise addition and reverse denoising, and their ability to handle complex noise while preserving fine details makes them well-suited for Low-Light Image Enhancement (LLIE). Mainstream diffusion based LLIE methods either adopt a two-stage pipeline or an auxiliary correction network to refine U-Net outputs, which severs the intrinsic link between enhancement and denoising and leads to suboptimal performance owing to inconsistent optimization objectives. To address these issues, we propose the Signal Attenuation Diffusion Model (SADM), a novel diffusion process that integrates the signal attenuation mechanism into the diffusion pipeline, enabling simultaneous brightness adjustment and noise suppression in a single stage. Specifically, the signal attenuation coefficient simulates the inherent signal attenuation of low-light degradation in the forward noise addition process, encoding the physical priors of low-light degradation to explicitly guide reverse denoising toward the concurrent optimization of brightness recovery and noise suppression, thereby eliminating the need for extra correction modules or staged training relied on by existing methods. We validate that our design maintains consistency with Denoising Diffusion Implicit Models(DDIM) via multi-scale pyramid sampling, balancing interpretability, restoration quality, and computational efficiency.

[151] FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips

Mengtian Li,Kunyan Dai,Yi Ding,Ruobing Ni,Ying Zhang,Wenwu Wang,Zhifeng Xie

Main category: cs.CV

TL;DR: 本文提出FoleyDesigner框架,结合视频分析、时空可控的拟音生成与专业混音能力,利用多智能体架构和潜变量扩散模型实现高精度时空对齐,并构建首个专业级立体声拟音数据集FilmStereo,支持包括杜比全景声在内的专业音频标准。

Details Motivation: 手动制作电影中时空对齐的拟音音频费时费力,且缺乏高质量带空间标注的立体声拟音数据集。 Method: 提出FoleyDesigner多智能体框架,融合视频帧提取的时空线索训练潜变量扩散模型,并引入LLM驱动的混合机制模拟影视后期流程;构建FilmStereo数据集,含8类拟音的空间元数据、精确时间戳和语义标注。 Result: 在时空对齐性能上显著优于现有基线方法,支持5.1声道杜比全景声等专业音频标准,具备良好工业兼容性与交互控制能力。 Conclusion: FoleyDesigner为电影拟音自动化提供了高效、专业、可集成的新范式,推动AI音频技术向影视工业实际应用落地。 Abstract: Foley art plays a pivotal role in enhancing immersive auditory experiences in film, yet manual creation of spatio-temporally aligned audio remains labor-intensive. We propose FoleyDesigner, a novel framework inspired by professional Foley workflows, integrating film clip analysis, spatio-temporally controllable Foley generation, and professional audio mixing capabilities. FoleyDesigner employs a multi-agent architecture for precise spatio-temporal analysis. It achieves spatio-temporal alignment through latent diffusion models trained on spatio-temporal cues extracted from video frames, combined with large language model (LLM)-driven hybrid mechanisms that emulate post-production practices in film industry. To address the lack of high-quality stereo audio datasets in film, we introduce FilmStereo, the first professional stereo audio dataset containing spatial metadata, precise timestamps, and semantic annotations for eight common Foley categories. For applications, the framework supports interactive user control while maintaining seamless integration with professional pipelines, including 5.1-channel Dolby Atmos systems compliant with ITU-R BS.775 standards, thereby offering extensive creative flexibility. Extensive experiments demonstrate that our method achieves superior spatio-temporal alignment compared to existing baselines, with seamless compatibility with professional film production standards. The project page is available at https://gekiii996.github.io/FoleyDesigner/ .

[152] ASSR-Net: Anisotropic Structure-Aware and Spectrally Recalibrated Network for Hyperspectral Image Fusion

Qiya Song,Hongzhi Zhou,Lishan Tan,Renwei Dian,Shutao Li

Main category: cs.CV

TL;DR: 本文提出ASSR-Net,一种两阶段网络,通过各向异性结构感知空间增强和分层先验引导的光谱校准,提升高光谱图像融合的空间细节与光谱保真度。

Details Motivation: 现有高光谱图像融合方法在各向异性空间结构重建不足(导致细节模糊)和融合过程中光谱失真(影响光谱表征)两方面存在关键挑战。 Method: 提出ASSR-Net:第一阶段采用各向异性结构感知增强(ASSE),通过方向感知融合模块自适应提取多方向结构特征;第二阶段采用分层先验引导的光谱校准(HPSC),利用原始低分辨率高光谱图像作为光谱先验显式校正融合结果的光谱偏差。 Result: 在多个基准数据集上实验表明,ASSR-Net在空间细节保持和光谱一致性方面持续优于当前最先进方法。 Conclusion: ASSR-Net有效缓解了高光谱图像融合中空间结构重建不充分与光谱失真两大问题,实现了更高质量的HR-HSI重建。 Abstract: Hyperspectral image fusion aims to reconstruct high-spatial-resolution hyperspectral images (HR-HSI) by integrating complementary information from multi-source inputs. Despite recent progress, existing methods still face two critical challenges: (1) inadequate reconstruction of anisotropic spatial structures, resulting in blurred details and compromised spatial quality; and (2) spectral distortion during fusion, which hinders fine-grained spectral representation. To address these issues, we propose \textbf{ASSR-Net}: an Anisotropic Structure-Aware and Spectrally Recalibrated Network for Hyperspectral Image Fusion. ASSR-Net adopts a two-stage fusion strategy comprising anisotropic structure-aware spatial enhancement (ASSE) and hierarchical prior-guided spectral calibration (HPSC). In the first stage, a directional perception fusion module adaptively captures structural features along multiple orientations, effectively reconstructing anisotropic spatial patterns. In the second stage, a spectral recalibration module leverages the original low-resolution HSI as a spectral prior to explicitly correct spectral deviations in the fused results, thereby enhancing spectral fidelity. Extensive experiments on various benchmark datasets demonstrate that ASSR-Net consistently outperforms state-of-the-art methods, achieving superior spatial detail preservation and spectral consistency.

[153] Beyond the Beep: Scalable Collision Anticipation and Real-Time Explainability with BADAS-2.0

Roni Goldshmidt,Hamish Scott,Lorenzo Niccolini,Hernan Matzner

Main category: cs.CV

TL;DR: BADAS-2.0 是一个面向碰撞预判的第二代ADAS系统,通过长尾基准构建、知识蒸馏至边缘设备及可解释性增强三大方向显著提升性能与实用性。

Details Motivation: 提升现有碰撞预判系统在罕见但安全关键场景下的表现,并解决模型落地边缘设备时的效率与可解释性问题。 Method: 1)利用 BADAS-1.0 作为主动标注 oracle 构建 10 类长尾基准;2)基于 225 万无标签驾驶视频进行领域自监督预训练,再蒸馏为轻量模型(Flash 和 Flash-Lite);3)引入对象级注意力热图与视觉语言模型 BADAS-Reason 实现可解释预测与结构化推理。 Result: 数据集从 40k 扩展至 178,500 个标注视频(约 200 万片段),在最难长尾子组上提升最显著;轻量模型实现 7–12 倍加速且精度接近原模型;支持实时边缘部署与实时可解释输出。 Conclusion: BADAS-2.0 在准确性、效率与可解释性三方面全面超越前代与现有基线,推动碰撞预判技术向实用化、鲁棒化和可信化迈进。 Abstract: We present BADAS-2.0, the second generation of our collision anticipation system, building on BADAS-1.0 [7], which showed that fine-tuning V-JEPA2 [1] on large-scale ego-centric dashcam data outperforms both academic baselines and production ADAS systems. BADAS-2.0 advances the state of the art along three axes. (i) Long-tail benchmark and accuracy: We introduce a 10-group long-tail benchmark targeting rare and safety-critical scenarios. To construct it, BADAS-1.0 is used as an active oracle to score millions of unlabeled drives and surface high-risk candidates for annotation. Combined with Nexar's Atlas platform [13] for targeted data collection, this expands the dataset from 40k to 178,500 labeled videos (~2M clips), yielding consistent gains across all subgroups, with the largest improvements on the hardest long-tail cases. (ii) Knowledge distillation to edge: Domain-specific self-supervised pre-training on 2.25M unlabeled driving videos enables distillation into compact models, BADAS-2.0-Flash (86M) and BADAS-2.0-Flash-Lite (22M), achieving 7-12x speedup with near-parity accuracy, enabling real-time edge deployment. (iii) Explainability: BADAS-2.0 produces real-time object-centric attention heatmaps that localize the evidence behind predictions. BADAS-Reason [17] extends this with a vision-language model that consumes the last frame and heatmap to generate driver actions and structured textual reasoning. Inference code and evaluation benchmarks are publicly available.

[154] On the Robustness of Diffusion-Based Image Compression to Bit-Flip Errors

Amit Vaisman,Gal Pomerants,Raz Lapid

Main category: cs.CV

TL;DR: 本文研究了现代图像压缩方法在比特级损坏下的鲁棒性,发现基于反向信道编码(RCC)范式的扩散模型压缩器比传统和学习型编解码器更鲁棒,并提出了改进版Turbo-DDCM以进一步提升鲁棒性。

Details Motivation: 现代图像压缩方法通常针对率-失真-感知权衡进行优化,但其对比特级损坏的鲁棒性很少被研究。 Method: 分析扩散型压缩器(特别是基于RCC范式的方法)在比特翻转下的鲁棒性,并提出一种更鲁棒的Turbo-DDCM变体。 Result: RCC范式的扩散压缩器显著优于传统和学习型编解码器的比特级鲁棒性;改进的Turbo-DDCM在几乎不损害率-失真-感知性能的前提下大幅提升鲁棒性。 Conclusion: RCC-based压缩能生成更具弹性的压缩表示,有望在高噪声环境中降低对纠错码的依赖。 Abstract: Modern image compression methods are typically optimized for the rate--distortion--perception trade-off, whereas their robustness to bit-level corruption is rarely examined. We show that diffusion-based compressors built on the Reverse Channel Coding (RCC) paradigm are substantially more robust to bit flips than classical and learned codecs. We further introduce a more robust variant of Turbo-DDCM that significantly improves robustness while only minimally affecting the rate--distortion--perception trade-off. Our findings suggest that RCC-based compression can yield more resilient compressed representations, potentially reducing reliance on error-correcting codes in highly noisy environments.

[155] WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering

Yingjian Zhu,Xinming Wang,Kun Ding,Ying Wang,Bin Fan,Shiming Xiang

Main category: cs.CV

TL;DR: 本文提出WikiSeeker,一种新型多模态检索增强生成(RAG)框架,通过引入多模态检索器和将视觉语言模型(VLM)重构为Refiner与Inspector两个专用代理,显著提升知识型视觉问答(KB-VQA)性能。

Details Motivation: 现有KB-VQA方法多以图像为单一检索键,且未充分挖掘视觉语言模型(VLM)潜力,导致检索与生成协同不足。 Method: 提出WikiSeeker框架:1)设计多模态检索器;2)将VLM拆分为Refiner(依据图像重写文本查询以提升检索)和Inspector(根据检索可靠性动态选择使用外部LLM或VLM自身知识生成答案)。 Result: 在EVQA、InfoSeek和M2KR数据集上达到SOTA,显著提升检索准确率与答案质量。 Conclusion: WikiSeeker通过重新定义VLM角色并实现检索-生成解耦,有效释放多模态RAG在KB-VQA中的潜力,为后续研究提供新范式。 Abstract: Multi-modal Retrieval-Augmented Generation (RAG) has emerged as a highly effective paradigm for Knowledge-Based Visual Question Answering (KB-VQA). Despite recent advancements, prevailing methods still primarily depend on images as the retrieval key, and often overlook or misplace the role of Vision-Language Models (VLMs), thereby failing to leverage their potential fully. In this paper, we introduce WikiSeeker, a novel multi-modal RAG framework that bridges these gaps by proposing a multi-modal retriever and redefining the role of VLMs. Rather than serving merely as answer generators, we assign VLMs two specialized agents: a Refiner and an Inspector. The Refiner utilizes the capability of VLMs to rewrite the textual query according to the input image, significantly improving the performance of the multimodal retriever. The Inspector facilitates a decoupled generation strategy by selectively routing reliable retrieved context to another LLM for answer generation, while relying on the VLM's internal knowledge when retrieval is unreliable. Extensive experiments on EVQA, InfoSeek, and M2KR demonstrate that WikiSeeker achieves state-of-the-art performance, with substantial improvements in both retrieval accuracy and answer quality. Our code will be released on https://github.com/zhuyjan/WikiSeeker.

[156] SVC 2026: the Second Multimodal Deception Detection Challenge and the First Domain Generalized Remote Physiological Measurement Challenge

Dongliang Zhu,Zhiyi Niu,Bo Zhao,Jiajian Huang,Shuo Ye,Xun Lin,Hui Ma,Taorui Wang,Jiayu Zhang,Chunmei Zhu,Junzhe Cao,Yingjie Ma,Rencheng Song,Albert Clapés,Sergio Escalera,Dan Guo,Zitong Yu

Main category: cs.CV

TL;DR: 本文介绍了Subtle Visual Challenge(SVC),旨在推动对微弱视觉信号的鲁棒表征学习,包含跨域多模态欺骗检测与远程光电容积描记(rPPG)估计两大任务。

Details Motivation: 微弱视觉信号虽难被肉眼察觉,却蕴含关键信息,广泛应用于生物识别、多媒体取证、医疗诊断等领域;但现有方法在鲁棒性、表征能力与泛化性方面仍面临挑战。 Method: 组织SVC挑战赛,设置两个基准任务(跨域多模态欺骗检测和rPPG估计),提供统一评测平台与基线模型。 Result: 共22支队伍提交最终结果,相关基线模型已开源发布于MMDD2026平台。 Conclusion: 该挑战赛为微弱视觉信号理解提供了标准化评测框架,有望促进鲁棒、通用的视觉与多模态学习模型发展。 Abstract: Subtle visual signals, although difficult to perceive with the naked eye, contain important information that can reveal hidden patterns in visual data. These signals play a key role in many applications, including biometric security, multimedia forensics, medical diagnosis, industrial inspection, and affective computing. With the rapid development of computer vision and representation learning techniques, detecting and interpreting such subtle signals has become an emerging research direction. However, existing studies often focus on specific tasks or modalities, and models still face challenges in robustness, representation ability, and generalization when handling subtle and weak signals in real-world environments. To promote research in this area, we organize the Subtle visual Challenge, which aims to learn robust representations for subtle visual signals. The challenge includes two tasks: cross-domain multimodal deception detection and remote photoplethysmography (rPPG) estimation. We hope that this challenge will encourage the development of more robust and generalizable models for subtle visual understanding, and further advance research in computer vision and multimodal learning. A total of 22 teams submitted their final results to this workshop competition, and the corresponding baseline models have been released on the \href{https://sites.google.com/view/svc-cvpr26}{MMDD2026 platform}\footnote{https://sites.google.com/view/svc-cvpr26}

[157] Is CLIP Cross-Eyed? Revealing and Mitigating Center Bias in the CLIP Family

Oscar Chew,Hsiao-Ying Huang,Kunal Jain,Tai-I Chen,Khoa D Doan,Kuan-Hao Huang

Main category: cs.CV

TL;DR: 本文揭示了CLIP类对比视觉-语言模型存在一种新的中心偏差问题,即模型过度关注图像中心区域而忽略边界重要物体;通过表征与注意力分析发现该问题源于视觉嵌入聚合过程中的信息丢失,特别是池化机制的依赖;提出无需训练的视觉提示和注意力重分配策略可缓解该偏差。

Details Motivation: CLIP等对比视觉-语言模型虽被广泛应用,但缺乏细粒度视觉理解能力;作者发现其存在一种尚未被充分认识的‘中心偏差’失败模式,即忽视图像边界区域的重要物体,影响下游任务性能。 Method: 采用可解释性方法(嵌入分解与注意力图分析)从表征和注意力两个角度分析中心偏差成因;进一步设计训练无关的视觉提示(visual prompting)和注意力重分配(attention redistribution)策略来缓解该偏差。 Result: 证实中心偏差源于视觉嵌入聚合阶段(尤其是池化)导致边缘相关概念在最终嵌入中消失;验证所提训练免费策略能有效提升模型对边缘物体的关注与识别能力。 Conclusion: 中心偏差是CLIP家族模型的一个基础性缺陷,根植于其视觉编码器的聚合机制;无需重新训练即可通过干预注意力分布来缓解该问题,为提升多模态模型空间感知能力提供了新思路。 Abstract: Recent research has shown that contrastive vision-language models such as CLIP often lack fine-grained understanding of visual content. While a growing body of work has sought to address this limitation, we identify a distinct failure mode in the CLIP family, which we term center bias, that persists even in recent model variants. Specifically, CLIP tends to disproportionately focus on the central region of an image, overlooking important objects located near the boundaries. This limitation is fundamental as failure to recognize relevant objects makes it difficult to perform any sophisticated tasks that depend on those objects. To understand the underlying causes of the limitation, we conduct analyses from both representation and attention perspectives. Using interpretability methods, i.e., embedding decomposition and attention map analysis, we find that relevant concepts especially those associated with off-center objects vanish from the model's embedding in the final representation due to information loss during the aggregation of visual embeddings, particularly the reliance on pooling mechanisms. Finally, we show that this bias can be alleviated with training-free strategies such as visual prompting and attention redistribution by redirecting models' attention to off-center regions.

[158] Improving Controllable Generation: Faster Training and Better Performance via $x_0$-Supervision

Amadou S. Sangare,Adrien Maglo,Mohamed Chaouch,Bertrand Luvison

Main category: cs.CV

TL;DR: 本文提出了一种针对可控文本到图像生成模型的改进训练目标——x₀监督,通过直接监督去噪过程中的原始清晰图像,显著加快收敛速度并提升生成质量与条件控制精度。

Details Motivation: 现有文本到图像扩散模型在精确布局控制方面存在局限,而现有可控生成方法采用与原始模型相同的损失函数进行训练,导致收敛缓慢。 Method: 重新分析可控扩散模型的去噪动力学,提出x₀-supervision(对干净图像x₀的直接监督)或等效的扩散损失重加权策略作为新的训练目标。 Result: 在多种控制设定下,新方法使收敛速度提升最高达2倍(以mAUCC为指标),同时改善视觉质量和条件对齐精度。 Conclusion: x₀-supervision是一种更高效、更有效的可控扩散模型训练范式,为可控生成提供了理论更坚实、实践更高效的优化方向。 Abstract: Text-to-Image (T2I) diffusion/flow models have recently achieved remarkable progress in visual fidelity and text alignment. However, they remain limited when users need to precisely control image layouts, something that natural language alone cannot reliably express. Controllable generation methods augment the initial T2I model with additional conditions that more easily describe the scene. Prior works straightforwardly train the augmented network with the same loss as the initial network. Although natural at first glance, this can lead to very long training times in some cases before convergence. In this work, we revisit the training objective of controllable diffusion models through a detailed analysis of their denoising dynamics. We show that direct supervision on the clean target image, dubbed $x_0$-supervision, or an equivalent re-weighting of the diffusion loss, yields faster convergence. Experiments on multiple control settings demonstrate that our formulation accelerates convergence by up to 2$\times$ according to our novel metric (mean Area Under the Convergence Curve - mAUCC), while also improving both visual quality and conditioning accuracy. Our code is available at https://github.com/CEA-LIST/x0-supervision

[159] MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

Yuchi Wang,Haiyang Yu,Weikang Bian,Jiefeng Long,Xiao Liang,Chao Feng,Hongsheng Li

Main category: cs.CV

TL;DR: 本文提出MMEmb-R1框架,通过将推理建模为隐变量并结合反事实干预与强化学习,自适应地选择是否启用链式推理,从而在多模态嵌入任务中提升性能、降低计算开销和延迟。

Details Motivation: 现有MLLM在多模态嵌入任务中未充分利用其生成式推理能力;直接引入链式推理存在结构错配和推理非普适性两大问题。 Method: 提出MMEmb-R1:将推理建模为隐变量,设计pair-aware推理选择机制,利用反事实干预识别有利于查询-目标对齐的推理路径,并通过强化学习实现按需触发推理。 Result: 在MMEB-V2基准上以4B参数取得71.2分,达到新SOTA,同时显著降低推理开销和推理延迟。 Conclusion: 自适应推理机制能有效平衡多模态嵌入任务中的性能与效率,避免盲目推理带来的冗余与语义干扰。 Abstract: MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reasoning into embedding learning introduces two fundamental challenges. First, structural misalignment between instance-level reasoning and pairwise contrastive supervision may lead to shortcut behavior, where the model merely learns the superficial format of reasoning. Second, reasoning is not universally beneficial for embedding tasks. Enforcing reasoning for all inputs may introduce unnecessary computation and latency, and can even obscure salient semantic signals for simple cases. To address these issues, we propose MMEmb-R1, an adaptive reasoning-based multimodal embedding framework. We formulate reasoning as a latent variable and introduce pair-aware reasoning selection that employs counterfactual intervention to identify reasoning paths beneficial for query-target alignment. Furthermore, we adopt reinforcement learning to selectively invoke reasoning only when necessary. Experiments on the MMEB-V2 benchmark demonstrate that our model achieves a score of 71.2 with only 4B parameters, establishing a new state-of-the-art while significantly reducing reasoning overhead and inference latency.

[160] PDMP: Rethinking Balanced Multimodal Learning via Performance-Dominant Modality Prioritization

Shicai Wei,Chunbo Luo,Qiang Zhu,Yang Luo

Main category: cs.CV

TL;DR: 本文提出了一种性能主导模态优先策略(PDMP),通过识别并优先优化性能最优的单模态,来提升多模态学习效果,解决了传统方法中因模态间学习不平衡导致的欠优化问题。

Details Motivation: 现有方法认为模态间学习不平衡是多模态模型性能不佳的主因,并采用梯度调节来平衡学习;本文则指出,由性能占优模态驱动的不平衡学习反而更有利于多模态性能,而问题根源在于该主导模态学习不足。 Method: 提出Performance-Dominant Modality Prioritization(PDMP)策略:首先通过独立训练的单模态模型性能排序识别性能主导模态;然后引入非对称梯度系数,使该模态在优化中起主导作用。该方法仅依赖单模态性能排序,与多模态模型结构和融合方式无关。 Result: 在多个数据集上的大量实验验证了PDMP的有效性和优越性,显著提升了多模态模型性能。 Conclusion: 不平衡学习并非缺陷,而是可被利用的特性;以性能主导模态为优先进行优化,是一种更合理、更实用的多模态学习范式。 Abstract: Multimodal learning has attracted increasing attention due to its practicality. However, it often suffers from insufficient optimization, where the multimodal model underperforms even compared to its unimodal counterparts. Existing methods attribute this problem to the imbalanced learning between modalities and solve it by gradient modulation. This paper argues that balanced learning is not the optimal setting for multimodal learning. On the contrary, imbalanced learning driven by the performance-dominant modality that has superior unimodal performance can contribute to better multimodal performance. And the under-optimization problem is caused by insufficient learning of the performance-dominant modality. To this end, we propose the Performance-Dominant Modality Prioritization (PDMP) strategy to assist multimodal learning. Specifically, PDMP firstly mines the performance-dominant modality via the performance ranking of the independently trained unimodal model. Then PDMP introduces asymmetric coefficients to modulate the gradients of each modality, enabling the performance-dominant modality to dominate the optimization. Since PDMP only relies on the unimodal performance ranking, it is independent of the structures and fusion methods of the multimodal model and has great potential for practical scenarios. Finally, extensive experiments on various datasets validate the superiority of PDMP.

[161] Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion

Yu Xue,Longjun Gao,Yuanqi Su,HaoAng Lu,Xiaoning Zhang

Main category: cs.CV

TL;DR: 本文提出VoxSAMNet,一种针对单目语义场景补全(SSC)任务的统一框架,通过显式建模体素稀疏性和语义不平衡性,提升3D场景重建精度与泛化能力。

Details Motivation: 单目SSC面临体素分布极度不平衡(>93%为空)和前景类别长尾的问题,导致现有方法对空体素冗余关注、对罕见类泛化差。 Method: 提出VoxSAMNet:(1) DSFR模块用共享哑节点跳过空体素,用可变形注意力细化占据体素;(2) 前景调制策略结合前景Dropout(FD)与文本引导图像滤波(TGIF)缓解过拟合并增强类别相关特征。 Result: 在SemanticKITTI和SSCBench-KITTI-360上达到SOTA,mIoU分别为18.2%和20.2%,超越单目与双目基线。 Conclusion: 体素稀疏感知与语义引导的设计对高效准确的3D场景补全至关重要,为后续研究提供新方向。 Abstract: Monocular Semantic Scene Completion (SSC) aims to reconstruct complete 3D semantic scenes from a single RGB image, offering a cost-effective solution for autonomous driving and robotics. However, the inherently imbalanced nature of voxel distributions, where over 93% of voxels are empty and foreground classes are rare, poses significant challenges. Existing methods often suffer from redundant emphasis on uninformative voxels and poor generalization to long-tailed categories. To address these issues, we propose VoxSAMNet (Voxel Sparsity-Aware Modulation Network), a unified framework that explicitly models voxel sparsity and semantic imbalance. Our approach introduces: (1) a Dummy Shortcut for Feature Refinement (DSFR) module that bypasses empty voxels via a shared dummy node while refining occupied ones with deformable attention; and (2) a Foreground Modulation Strategy combining Foreground Dropout (FD) and Text-Guided Image Filter (TGIF) to alleviate overfitting and enhance class-relevant features. Extensive experiments on the public benchmarks SemanticKITTI and SSCBench-KITTI-360 demonstrate that VoxSAMNet achieves state-of-the-art performance, surpassing prior monocular and stereo baselines with mIoU scores of 18.2% and 20.2%, respectively. Our results highlight the importance of sparsity-aware and semantics-guided design for efficient and accurate 3D scene completion, offering a promising direction for future research.

[162] RHVI-FDD: A Hierarchical Decoupling Framework for Low-Light Image Enhancement

Junhao Yang,Bo Yang,Hongwei Ge,Yanchun Liang,Heow Pueh Lee,Chunguo Wu

Main category: cs.CV

TL;DR: 本文提出了一种分层解耦框架RHVI-FDD,通过宏/微两级解耦(RHVI变换与频域三分支处理)分别解决低光照图像中亮度-色度耦合、色度内噪声-细节纠缠问题,在去噪、色彩校正与细节保持上实现协同优化。

Details Motivation: 低光照图像存在严重噪声、细节丢失和色彩失真,且亮度与色度耦合、色度内部噪声与细节深度纠缠,导致现有方法难以兼顾三者。 Method: 提出分层解耦框架RHVI-FDD:宏观层面采用RHVI变换实现鲁棒的亮度-色度解耦;微观层面设计频域解耦(FDD)模块,利用离散余弦变换将色度特征分解为低/中/高频分量,分别由专用专家网络处理,并通过自适应门控模块进行内容感知融合。 Result: 在多个低光照数据集上实验表明,该方法在客观指标和主观视觉质量上均持续超越现有最先进方法。 Conclusion: 分层解耦策略(尤其频域特征分离与内容感知融合)能有效应对低光照图像退化机制的复杂性,为同时提升色彩保真度、降噪性能与细节保留能力提供了新范式。 Abstract: Low-light images often suffer from severe noise, detail loss, and color distortion, which hinder downstream multimedia analysis and retrieval tasks. The degradation in low-light images is complex: luminance and chrominance are coupled, while within the chrominance, noise and details are deeply entangled, preventing existing methods from simultaneously correcting color distortion, suppressing noise, and preserving fine details. To tackle the above challenges, we propose a novel hierarchical decoupling framework (RHVI-FDD). At the macro level, we introduce the RHVI transform, which mitigates the estimation bias caused by input noise and enables robust luminance-chrominance decoupling. At the micro level, we design a Frequency-Domain Decoupling (FDD) module with three branches for further feature separation. Using the Discrete Cosine Transform, we decompose chrominance features into low, mid, and high-frequency bands that predominantly represent global tone, local details, and noise components, which are then processed by tailored expert networks in a divide-and-conquer manner and fused via an adaptive gating module for content-aware fusion. Extensive experiments on multiple low-light datasets demonstrate that our method consistently outperforms existing state-of-the-art approaches in both objective metrics and subjective visual quality.

[163] Sparse Gain Radio Map Reconstruction With Geometry Priors and Uncertainty-Guided Measurement Selection

Zhihan Zeng,Ning Wei,Muhammad Baqer Mollah,Kaihe Wang,Phee Lep Yeoh,Fei Xu,Yue Xiu,Zhongpei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种面向复杂城市环境的稀疏增益无线电图重建方法GeoUQ-GFNet,结合几何先验与不确定性估计,并基于自建可控射线追踪基准UrbanRT-RM进行验证,显著提升了重建精度与主动感知效率。

Details Motivation: 在复杂城市环境中,受限于测量点稀疏、遮挡严重、几何不规则及感知受限,构建高密度无线电图仍具挑战;现有方法对显式几何先验利用不足,且忽视预测不确定性对后续主动感知的价值。 Method: 构建可控射线追踪基准UrbanRT-RM;提出轻量级网络GeoUQ-GFNet,联合预测密集增益图与空间不确定性图;利用预测不确定性指导有限预算下的主动测量选择。 Result: GeoUQ-GFNet在UrbanRT-RM生成的多种场景和基站部署下均表现出强鲁棒性与一致性;不确定性引导的查询比固定采样策略在同等新增测量数下带来更优重建提升。 Conclusion: 融合几何感知学习、不确定性建模与基准驱动评估,是提升复杂城市环境下稀疏无线电图重建性能的有效范式。 Abstract: Radio maps are important for environment-aware wireless communication, network planning, and radio resource optimization. However, dense radio map construction remains challenging when only a limited number of measurements are available, especially in complex urban environments with strong blockages, irregular geometry, and restricted sensing accessibility. Existing methods have explored interpolation, low-rank cartography, deep completion, and channel knowledge map (CKM) construction, but many of these methods insufficiently exploit explicit geometric priors or overlook the value of predictive uncertainty for subsequent sensing. In this paper, we study sparse gain radio map reconstruction from a geometry-aware and active sensing perspective. We first construct \textbf{UrbanRT-RM}, a controllable ray-tracing benchmark with diverse urban layouts, multiple base-station deployments, and multiple sparse sampling modes. We then propose \textbf{GeoUQ-GFNet}, a lightweight network that jointly predicts a dense gain radio map and a spatial uncertainty map from sparse measurements and structured scene priors. The predicted uncertainty is further used to guide active measurement selection under limited sensing budgets. Extensive experiments show that our proposed GeoUQ-GFNet method achieves strong and consistent reconstruction performance across different scenes and transmitter placements generated using UrbanRT-RM. Moreover, uncertainty-guided querying provides more effective reconstruction improvement than non-adaptive sampling under the same additional measurement budget. These results demonstrate the effectiveness of combining geometry-aware learning, uncertainty estimation, and benchmark-driven evaluation for sparse radio map reconstruction in complex urban environments.

[164] EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion

Da Li,Dominik Engel,Deng Luo,Ivan Viola

Main category: cs.CV

TL;DR: 本文提出EfficientMonoHair框架,结合隐式神经网络与多视角几何融合,实现单目视频中发丝级高效高精度重建。

Details Motivation: 现有方法在发丝级头发几何重建中存在精度与效率的显著权衡:隐式神经表示难以保留细粒度发丝细节,而显式优化方法计算开销大、可扩展性差。 Method: 提出EfficientMonoHair:1)基于融合块的多视角优化,减少点云方向优化迭代次数;2)并行发丝生长策略,放宽体素占据约束,提升在噪声方向场下的稳定性与鲁棒性。 Result: 在真实发型上实现高保真、鲁棒的发丝几何重建;在合成基准上重建质量媲美SOTA,运行效率提升近一个数量级。 Conclusion: EfficientMonoHair在保持高精度的同时显著提升了单目视频发丝重建的效率与可扩展性,为虚拟人建模与发型数字化提供了实用新方案。 Abstract: Strand-level hair geometry reconstruction is a fundamental problem in virtual human modeling and the digitization of hairstyles. However, existing methods still suffer from a significant trade-off between accuracy and efficiency. Implicit neural representations can capture the global hair shape but often fail to preserve fine-grained strand details, while explicit optimization-based approaches achieve high-fidelity reconstructions at the cost of heavy computation and poor scalability. To address this issue, we propose EfficientMonoHair, a fast and accurate framework that combines the implicit neural network with multi-view geometric fusion for strand-level reconstruction from monocular video. Our method introduces a fusion-patch-based multi-view optimization that reduces the number of optimization iterations for point cloud direction, as well as a novel parallel hair-growing strategy that relaxes voxel occupancy constraints, allowing large-scale strand tracing to remain stable and robust even under inaccurate or noisy orientation fields. Extensive experiments on representative real-world hairstyles demonstrate that our method can robustly reconstruct high-fidelity strand geometries with accuracy. On synthetic benchmarks, our method achieves reconstruction quality comparable to state-of-the-art methods, while improving runtime efficiency by nearly an order of magnitude.

[165] Learn to Rank: Visual Attribution by Learning Importance Ranking

David Schinagl,Christian Fruhwirth-Reisinger,Alexander Prutsch,Samuel Schulter,Horst Possegger

Main category: cs.CV

TL;DR: 本文提出了一种新型学习型可解释性方法,通过直接优化删除/插入指标并结合Gumbel-Sinkhorn松弛实现端到端训练,生成高效、像素级、边界对齐的视觉归因图,尤其适用于视觉Transformer。

Details Motivation: 现有视觉模型归因方法存在效率、因果性与精度之间的三重权衡:传播法高效但有偏且依赖架构;扰动法因果性强但计算昂贵且在ViT上粗糙;学习法快但多依赖代理目标或启发式教师。 Method: 提出一种直接优化删除(deletion)和插入(insertion)指标的学习框架;将非可微的排序/排序操作建模为排列学习问题,并用Gumbel-Sinkhorn进行可微松弛;支持端到端训练,并在推理时单次前向即可输出密集像素级归因图,可选少量梯度精修。 Result: 在多个基准上实现一致的定量提升;生成更锐利、边界对齐的归因图;尤其显著改善视觉Transformer模型的解释质量;推理高效(单次前向+可选轻量精修)。 Conclusion: 该方法弥合了效率、因果性与细粒度之间的鸿沟,为复杂视觉模型(尤其是ViT)提供了兼具理论合理性与实用性的高质量可解释性解决方案。 Abstract: Interpreting the decisions of complex computer vision models is crucial to establish trust and accountability, especially in safety-critical domains. An established approach to interpretability is generating visual attribution maps that highlight regions of the input most relevant to the model's prediction. However, existing methods face a three-way trade-off. Propagation-based approaches are efficient, but they can be biased and architecture-specific. Meanwhile, perturbation-based methods are causally grounded, yet they are expensive and for vision transformers often yield coarse, patch-level explanations. Learning-based explainers are fast but usually optimize surrogate objectives or distill from heuristic teachers. We propose a learning scheme that instead optimizes deletion and insertion metrics directly. Since these metrics depend on non-differentiable sorting and ranking, we frame them as permutation learning and replace the hard sorting with a differentiable relaxation using Gumbel-Sinkhorn. This enables end-to-end training through attribution-guided perturbations of the target model. During inference, our method produces dense, pixel-level attributions in a single forward pass with optional, few-step gradient refinement. Our experiments demonstrate consistent quantitative improvements and sharper, boundary-aligned explanations, particularly for transformer-based vision models.

[166] Reading Between the Pixels: An Inscriptive Jailbreak Attack on Text-to-Image Models

Zonghao Ying,Haowen Dai,Lianyu Hu,Zonglei Jing,Quanchen Zou,Yaodong Yang,Aishan Liu,Xianglong Liu

Main category: cs.CV

TL;DR: 本文提出了一种针对现代文本到图像(T2I)模型的新攻击方式——'inscriptive jailbreak'(铭刻式越狱),即通过操控模型生成含恶意文字内容(如伪造文件)但视觉上看似正常的图像;为此设计了名为Etch的黑盒攻击框架,将提示词分解为语义伪装、视觉空间锚定和字体编码三层,并利用多模态模型反馈迭代优化,实验证明其在多个模型和基准上显著优于现有方法。

Details Motivation: 现代T2I模型具备生成可读长文本的能力,带来新型滥用风险;现有越狱技术针对粗粒度视觉内容设计,难以绕过多阶段安全过滤器并保持字符级精度,因此需专门应对文本嵌入型攻击的新方法。 Method: 提出Etch黑盒攻击框架,将对抗提示分解为语义伪装、视觉-空间锚定和字形编码三个正交层,通过零阶优化循环迭代优化;每轮由视觉-语言模型对生成图像进行批评,定位各层失败并指导修正。 Result: 在7个T2I模型、2个基准上的广泛评测显示,Etch平均攻击成功率65.57%,最高达91.00%,显著超越基线方法。 Conclusion: 揭示了当前T2I安全对齐中忽视字体与文本渲染层面的严重盲区,强调亟需发展面向排版/文字感知的多模态防御机制。 Abstract: Modern text-to-image (T2I) models can now render legible, paragraph-length text, enabling a fundamentally new class of misuse. We identify and formalize the inscriptive jailbreak, where an adversary coerces a T2I system into generating images containing harmful textual payloads (e.g., fraudulent documents) embedded within visually benign scenes. Unlike traditional depictive jailbreaks that elicit visually objectionable imagery, inscriptive attacks weaponize the text-rendering capability itself. Because existing jailbreak techniques are designed for coarse visual manipulation, they struggle to bypass multi-stage safety filters while maintaining character-level fidelity. To expose this vulnerability, we propose Etch, a black-box attack framework that decomposes the adversarial prompt into three functionally orthogonal layers: semantic camouflage, visual-spatial anchoring, and typographic encoding. This decomposition reduces joint optimization over the full prompt space to tractable sub-problems, which are iteratively refined through a zero-order loop. In this process, a vision-language model critiques each generated image, localizes failures to specific layers, and prescribes targeted revisions. Extensive evaluations across 7 models on the 2 benchmarks demonstrate that Etch achieves an average attack success rate of 65.57% (peaking at 91.00%), significantly outperforming existing baselines. Our results reveal a critical blind spot in current T2I safety alignments and underscore the urgent need for typography-aware defense multimodal mechanisms.

[167] Neural Network Pruning via QUBO Optimization

Osama Orabi,Artur Zagitov,Hadi Salloum,Viktor A. Lobachev,Kasymkhan Khubiev,Yaroslav Kholodov

Main category: cs.CV

TL;DR: 本文提出了一种统一的混合QUBO框架,将梯度感知敏感性度量(如一阶Taylor和二阶Fisher信息)与数据驱动的激活相似性结合,用于神经网络剪枝,并引入动态容量驱动搜索和张量训练(TT)精炼阶段以提升性能。

Details Motivation: 现有剪枝方法多依赖忽略滤波器间复杂交互的贪心启发式,而形式化优化方法(如QUBO)因目标函数过于简化(如仅用L1范数)表现不佳。 Method: 提出混合QUBO框架:线性项整合梯度感知敏感性指标(Taylor、Fisher),二次项建模激活相似性;引入动态容量驱动搜索确保目标稀疏性;采用两阶段流程,第二阶段使用无梯度的TT Refinement直接优化真实评估指标。 Result: 在SIDD图像去噪数据集上,Hybrid QUBO显著优于贪心Taylor剪枝和传统L1-QUBO;TT Refinement在适当组合规模下带来持续增益。 Conclusion: 混合组合优化方法可实现鲁棒、可扩展且可解释的神经网络压缩。 Abstract: Neural network pruning can be formulated as a combinatorial optimization problem, yet most existing approaches rely on greedy heuristics that ignore complex interactions between filters. Formal optimization methods such as Quadratic Unconstrained Binary Optimization (QUBO) provide a principled alternative but have so far underperformed due to oversimplified objective formulations based on metrics like the L1-norm. In this work, we propose a unified Hybrid QUBO framework that bridges heuristic importance estimation with global combinatorial optimization. Our formulation integrates gradient-aware sensitivity metrics - specifically first-order Taylor and second-order Fisher information - into the linear term, while utilizing data-driven activation similarity in the quadratic term. This allows the QUBO objective to jointly capture individual filter relevance and inter-filter functional redundancy. We further introduce a dynamic capacity-driven search to strictly enforce target sparsity without distorting the optimization landscape. Finally, we employ a two-stage pipeline featuring a Tensor-Train (TT) Refinement stage - a gradient-free optimizer that fine-tunes the QUBO-derived solution directly against the true evaluation metric. Experiments on the SIDD image denoising dataset demonstrate that the proposed Hybrid QUBO significantly outperforms both greedy Taylor pruning and traditional L1-based QUBO, with TT Refinement providing further consistent gains at appropriate combinatorial scales. This highlights the potential of hybrid combinatorial formulations for robust, scalable, and interpretable neural network compression.

[168] Automatic dental superimposition of 3D intraorals and 2D photographs for human identification

Antonio D. Villegas-Yeguas,Xavier Abreau-Freire,Guillermo R-García,Andrea Valsecchi,Teresa Pinho,Daniel Pérez-Mongiovi,Oscar Ibáñez,Oscar Cordón

Main category: cs.CV

TL;DR: 本文提出了一种结合3D尸检扫描与2D生前社交媒体照片的自动牙齿形态比对方法,通过计算机视觉与优化技术建模透视畸变,实现客观、定量的牙科身份识别。

Details Motivation: 传统牙科比对依赖生前医疗记录,但在边境移民死亡或无全民医保地区常缺失;社交媒体中露齿照片虽可利用,但现有方法缺乏对透视畸变的建模和客观量化评估。 Method: 提出3D(尸检扫描)-2D(生前照片)配准方法,采用两种自动策略:i) 基于配对解剖标志点;ii) 基于牙齿区域分割估计相机参数;均通过优化实现图像重渲染与形态比对。 Result: 在142个样本共20,164次跨样本比对中,两种方法平均排序值分别为1.6和1.5,显著优于现有自动牙科图表比对方法,并提供可解释的定量匹配分数与叠加可视化。 Conclusion: 该方法为无医疗记录场景下的法医牙科识别提供了自动、客观、可量化的解决方案,提升了形态比对的可靠性与实用性。 Abstract: Dental comparison is considered a primary identification method, at the level of fingerprints and DNA profiling. One crucial but time-consuming step of this method is the morphological comparison. One of the main challenges to apply this method is the lack of ante-mortem medical records, specially on scenarios such as migrant death at the border and/or in countries where there is no universal healthcare. The availability of photos on social media where teeth are visible has led many odontologists to consider morphological comparison using them. However, state-of-the-art proposals have significant limitations, including the lack of proper modeling of perspective distortion and the absence of objective approaches that quantify morphological differences. Our proposal involves a 3D (post-mortem scan) - 2D (ante-mortem photos) approach. Using computer vision and optimization techniques, we replicate the ante-mortem image with the 3D model to perform the morphological comparison. Two automatic approaches have been developed: i) using paired landmarks and ii) using a segmentation of the teeth region to estimate camera parameters. Both are capable of obtaining very promising results over 20,164 cross comparisons from 142 samples, obtaining mean ranking values of 1.6 and 1.5, respectively. These results clearly outperform filtering capabilities of automatic dental chart comparison approaches, while providing an automatic, objective and quantitative score of the morphological correspondence, easily to interpret and analyze by visualizing superimposed images.

[169] Physics-Aware Video Instance Removal Benchmark

Zirui Li,Xinghao Chen,Lingyu Jiang,Dengzhe Hou,Fangzhou Lin,Kazunori Yamada,Xiangbo Gao,Zhengzhong Tu

Main category: cs.CV

TL;DR: 本文提出了一个物理感知的视频实例移除基准PVIR,用于评估视频中目标物体移除后背景完整性与物理一致性的保持能力,并对四种主流方法进行了多维度人类评估。

Details Motivation: 现有视频实例移除基准主要关注视觉合理性,忽视了物理因果关系(如残留阴影)的建模,导致方法在真实物理交互场景下表现不足。 Method: 构建了包含95个高质量视频的PVIR基准,标注实例级掩码和移除提示;划分为Simple与Hard子集,后者聚焦复杂物理交互;采用解耦式人类评估协议,从指令遵循、渲染质量、编辑排他性三方面评估模型性能。 Result: PISCO-Removal和UniVideo达到当前最优性能;DiffuEraser常引入模糊伪影;CoCoCo在指令遵循上表现显著较差;所有方法在Hard子集上性能明显下降。 Conclusion: 恢复复杂物理副作用仍是视频实例移除的核心挑战,PVIR为推动物理一致性建模提供了新基准和评估范式。 Abstract: Video Instance Removal (VIR) requires removing target objects while maintaining background integrity and physical consistency, such as specular reflections and illumination interactions. Despite advancements in text-guided editing, current benchmarks primarily assess visual plausibility, often overlooking the physical causalities, such as lingering shadows, triggered by object removal. We introduce the Physics-Aware Video Instance Removal (PVIR) benchmark, featuring 95 high-quality videos annotated with instance-accurate masks and removal prompts. PVIR is partitioned into Simple and Hard subsets, the latter explicitly targeting complex physical interactions. We evaluate four representative methods, PISCO-Removal, UniVideo, DiffuEraser, and CoCoCo, using a decoupled human evaluation protocol across three dimensions to isolate semantic, visual, and spatial failures: instruction following, rendering quality, and edit exclusivity. Our results show that PISCO-Removal and UniVideo achieve state-of-the-art performance, while DiffuEraser frequently introduces blurring artifacts and CoCoCo struggles significantly with instruction following. The persistent performance drop on the Hard subset highlights the ongoing challenge of recovering complex physical side effects.

[170] AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

Dong She,Xianrong Yao,Liqun Chen,Jinghe Yu,Yang Gao,Zhanpeng Jin

Main category: cs.CV

TL;DR: 本文提出AICA-Bench基准和无需训练的Grounded Affective Tree (GAT) Prompting方法,以提升视觉语言模型在情感图像内容分析(AICA)中的强度校准与开放描述能力。

Details Motivation: 现有视觉语言模型在整体情感图像内容分析(AICA)方面仍不足,缺乏整合感知、推理与生成的统一框架。 Method: 构建包含情绪理解(EU)、情绪推理(ER)和情绪引导内容生成(EGCG)三任务的AICA-Bench基准;提出无需训练的Grounded Affective Tree(GAT)Prompting方法,融合视觉支撑与分层推理。 Result: 在23个VLM上验证了GAT能有效降低情绪强度误差、提升开放描述深度。 Conclusion: GAT为情感多模态理解与生成提供了强基线,推动AICA领域发展。 Abstract: Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). We evaluate 23 VLMs and identify two major limitations: weak intensity calibration and shallow open-ended descriptions. To address these issues, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that combines visual scaffolding with hierarchical reasoning. Experiments show that GAT reduces intensity errors and improves descriptive depth, providing a strong baseline for future research on affective multimodal understanding and generation.

[171] Selective Aggregation of Attention Maps Improves Diffusion-Based Visual Interpretation

Jungwon Park,Jungmin Ko,Dongnam Byun,Wonjong Rhee

Main category: cs.CV

TL;DR: 本文提出了一种基于选择性聚合跨注意力图(cross-attention maps)的方法,以提升文本到图像生成模型的可解释性和可控性;实验表明该方法在分割性能(mIoU)上优于DAAM,并能更准确捕捉概念特征、诊断提示词误解释问题。

Details Motivation: 现有研究虽广泛使用跨注意力图提升T2I模型性能与可解释性,但不同注意力头的特性差异尚未被深入探索。 Method: 选择与目标概念最相关的注意力头的跨注意力图进行聚合,而非简单平均所有头的注意力图。 Result: 相比DAAM,本方法在扩散分割任务中取得更高平均IoU;最相关头能更准确捕获概念特异性特征;选择性聚合有助于诊断提示词误解释。 Conclusion: 注意力头的选择是提升T2I模型可解释性与可控性的有效新方向。 Abstract: Numerous studies on text-to-image (T2I) generative models have utilized cross-attention maps to boost application performance and interpret model behavior. However, the distinct characteristics of attention maps from different attention heads remain relatively underexplored. In this study, we show that selectively aggregating cross-attention maps from heads most relevant to a target concept can improve visual interpretability. Compared to the diffusion-based segmentation method DAAM, our approach achieves higher mean IoU scores. We also find that the most relevant heads capture concept-specific features more accurately than the least relevant ones, and that selective aggregation helps diagnose prompt misinterpretations. These findings suggest that attention head selection offers a promising direction for improving the interpretability and controllability of T2I generation.

[172] Appearance Decomposition Gaussian Splatting for Multi-Traversal Reconstruction

Yangyi Xiao,Siting Zhu,Baoquan Yang,Tianchen Deng,Yongbo Chen,Hesheng Wang

Main category: cs.CV

TL;DR: 本文提出ADM-GS框架,通过显式外观分解(分离材质与光照)解决多遍历场景重建中因光照和环境变化导致的外观不一致问题,在Argoverse 2和Waymo数据集上PSNR提升0.98 dB。

Details Motivation: 多遍历场景重建中,同一区域不同时间采集的数据存在显著外观不一致性(由光照与环境变化引起),而几何结构一致,需解耦外观以提升重建保真度。 Method: 提出ADM-GS:对静态背景进行外观分解,分为遍历不变的材质分量和遍历依赖的光照分量;设计基于频率分离的神经光场,结合表面法线和反射向量分别建模低频漫反射与高频镜面反射。 Result: 在Argoverse 2和Waymo Open数据集上,相比基于隐式表征的基线方法,PSNR提升+0.98 dB,且跨遍历外观一致性更优。 Conclusion: 显式分解材质与光照可有效缓解多遍历重建中的外观纠缠,ADM-GS为高保真自动驾驶仿真与数字孪生提供更鲁棒的场景表示。 Abstract: Multi-traversal scene reconstruction is important for high-fidelity autonomous driving simulation and digital twin construction. This task involves integrating multiple sequences captured from the same geographical area at different times. In this context, a primary challenge is the significant appearance inconsistency across traversals caused by varying illumination and environmental conditions, despite the shared underlying geometry. This paper presents ADM-GS (Appearance Decomposition Gaussian Splatting for Multi-Traversal Reconstruction), a framework that applies an explicit appearance decomposition to the static background to alleviate appearance entanglement across traversals. For the static background, we decompose the appearance into traversal-invariant material, representing intrinsic material properties, and traversal-dependent illumination, capturing lighting variations. Specifically, we propose a neural light field that utilizes a frequency-separated hybrid encoding strategy. By incorporating surface normals and explicit reflection vectors, this design separately captures low-frequency diffuse illumination and high-frequency specular reflections. Quantitative evaluations on the Argoverse 2 and Waymo Open datasets demonstrate the effectiveness of ADM-GS. In multi-traversal experiments, our method achieves a +0.98 dB PSNR improvement over existing latent-based baselines while producing more consistent appearance across traversals. Code will be available at https://github.com/IRMVLab/ADM-GS.

[173] Saliency-Guided Representation with Consistency Policy Learning for Visual Unsupervised Reinforcement Learning

Jingbo Sun,Qichao Zhang,Songjun Tu,Xing Fang,Yupeng Zheng,Haoran Li,Ke Chen,Dongbin Zhao

Main category: cs.CV

TL;DR: 本文提出SRCP框架,通过引入显著性引导的动力学任务和一致性策略学习,解决了后继表示(SR)在视觉无监督强化学习中表征不准确和技能可控性差的问题,显著提升了零样本泛化能力。

Details Motivation: 现有基于后继表示(SR)的零样本无监督强化学习方法在高维视觉环境中表现不佳,主要受限于表征关注无关区域及难以建模多模态技能条件动作分布。 Method: 提出Saliency-Guided Representation with Consistency Policy Learning(SRCP):1)用显著性引导的动力学任务解耦表征学习与后继训练;2)结合快速采样一致性策略、无分类器引导与定制化训练目标提升技能可控性。 Result: 在ExORL基准的4个数据集共16个任务上,SRCP实现了视觉URL领域最优的零样本泛化性能,并兼容多种SR方法。 Conclusion: SRCP有效克服了SR在视觉URL中的两大核心缺陷,为构建可泛化、可控的通用智能体提供了新思路与实用框架。 Abstract: Zero-shot unsupervised reinforcement learning (URL) offers a promising direction for building generalist agents capable of generalizing to unseen tasks without additional supervision. Among existing approaches, successor representations (SR) have emerged as a prominent paradigm due to their effectiveness in structured, low-dimensional settings. However, SR methods struggle to scale to high-dimensional visual environments. Through empirical analysis, we identify two key limitations of SR in visual URL: (1) SR objectives often lead to suboptimal representations that attend to dynamics-irrelevant regions, resulting in inaccurate successor measures and degraded task generalization; and (2) these flawed representations hinder SR policies from modeling multi-modal skill-conditioned action distributions and ensuring skill controllability. To address these limitations, we propose Saliency-Guided Representation with Consistency Policy Learning (SRCP), a novel framework that improves zero-shot generalization of SR methods in visual URL. SRCP decouples representation learning from successor training by introducing a saliency-guided dynamics task to capture dynamics-relevant representations, thereby improving successor measure and task generalization. Moreover, it integrates a fast-sampling consistency policy with URL-specific classifier-free guidance and tailored training objectives to improve skill-conditioned policy modeling and controllability. Extensive experiments on 16 tasks across 4 datasets from the ExORL benchmark demonstrate that SRCP achieves state-of-the-art zero-shot generalization in visual URL and is compatible with various SR methods.

[174] SonoSelect: Efficient Ultrasound Perception via Active Probe Exploration

Yixin Zhang,Yunzhong Hou,Longqi Li,Zhenyue Qin,Yang Liu,Yue Yao

Main category: cs.CV

TL;DR: 本文提出SonoSelect方法,通过主动探索超声扫描视角,利用3D空间记忆和特定目标函数,自适应引导探头运动,在减少扫描视图数量的同时提升器官分类与病灶检测性能。

Details Motivation: 传统超声扫描需大量视角以降低诊断歧义、缓解声学遮挡并提高解剖覆盖,但多数视角信息冗余,增加扫描与处理成本。 Method: 将超声主动视角探索建模为序列决策问题;每帧2D图像融合进3D空间记忆;设计面向超声的目标函数,兼顾器官覆盖度、重建不确定性及扫描冗余度。 Result: 在仿真器实验中,仅用N个视角中的2个即实现优异的多视角器官分类;在肾脏囊肿检测任务中达到54.56%肾脏覆盖与35.13%囊肿覆盖,且轨迹短而精准聚焦于目标囊肿。 Conclusion: SonoSelect能显著减少必要扫描视角数,提升关键解剖结构与病灶的定位与识别效率,具备临床实用潜力。 Abstract: Ultrasound perception typically requires multiple scan views through probe movement to reduce diagnostic ambiguity, mitigate acoustic occlusions, and improve anatomical coverage. However, not all probe views are equally informative. Exhaustively acquiring a large number of views can introduce substantial redundancy, increase scanning and processing costs. To address this, we define an active view exploration task for ultrasound and propose SonoSelect, an ultrasound-specific method that adaptively guides probe movement based on current observations. Specifically, we cast ultrasound active view exploration as a sequential decision-making problem. Each new 2D ultrasound view is fused into a 3D spatial memory of the observed anatomy, which guides the next probe position. On top of this formulation, we propose an ultrasound-specific objective that favors probe movements with greater organ coverage, lower reconstruction uncertainty, and less redundant scanning. Experiments on the ultrasound simulator show that SonoSelect achieves promising multi-view organ classification accuracy using only 2 out of N views. Furthermore, for a more difficult kidney cyst detection task, it reaches 54.56% kidney coverage and 35.13% cyst coverage, with short trajectories consistently centered on the target cyst.

[175] Leveraging Image Editing Foundation Models for Data-Efficient CT Metal Artifact Reduction

Ahmet Rasim Emirdagi,Süleyman Aslan,Mısra Yavuz,Görkay Aydemir,Yunus Bilge Kurt,Nasrin Rahimi,Burak Can Biner,M. Akın Yılmaz

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉-语言扩散基础模型的金属伪影减少新范式,通过低秩自适应(LoRA)进行参数高效微调,并引入多参考条件策略和领域自适应,仅需16–128对训练样本即可实现SOTA性能,显著降低数据依赖并抑制幻觉。

Details Motivation: 高衰减金属植入物导致CT图像严重伪影,遮挡关键解剖结构;标准深度学习方法依赖大量配对训练数据,难以满足临床小样本场景需求。 Method: 将伪影去除重构为上下文推理任务;采用视觉-语言扩散基础模型,结合LoRA进行参数高效微调;引入领域自适应防止幻觉;设计多参考条件策略,利用无关受试者的干净解剖示例引导重建。 Result: 在AAPM CT-MAR基准上达到感知质量与放射学特征指标的SOTA水平,仅需16–128对样本,数据需求降低两个数量级;有效抑制幻觉(如将条纹伪影误识为华夫饼或培养皿)。 Conclusion: 经适当适配的基础模型可作为可解释、数据高效的医学图像重建新范式,为小样本医疗AI提供可行路径。 Abstract: Metal artifacts from high-attenuation implants severely degrade CT image quality, obscuring critical anatomical structures and posing a challenge for standard deep learning methods that require extensive paired training data. We propose a paradigm shift: reframing artifact reduction as an in-context reasoning task by adapting a general-purpose vision-language diffusion foundation model via parameter-efficient Low-Rank Adaptation (LoRA). By leveraging rich visual priors, our approach achieves effective artifact suppression with only 16 to 128 paired training examples reducing data requirements by two orders of magnitude. Crucially, we demonstrate that domain adaptation is essential for hallucination mitigation; without it, foundation models interpret streak artifacts as erroneous natural objects (e.g., waffles or petri dishes). To ground the restoration, we propose a multi-reference conditioning strategy where clean anatomical exemplars from unrelated subjects are provided alongside the corrupted input, enabling the model to exploit category-specific context to infer uncorrupted anatomy. Extensive evaluation on the AAPM CT-MAR benchmark demonstrates that our method achieves state-of-the-art performance on perceptual and radiological-feature metrics . This work establishes that foundation models, when appropriately adapted, offer a scalable alternative for interpretable, data-efficient medical image reconstruction. Code is available at https://github.com/ahmetemirdagi/CT-EditMAR.

[176] Mixture-of-Modality-Experts with Holistic Token Learning for Fine-Grained Multimodal Visual Analytics in Driver Action Recognition

Tianyi Liu,Yiming Li,Wenqian Wang,Jiaojiao Wang,Chen Cai,Yi Wang,Kim-Hui Yap

Main category: cs.CV

TL;DR: 本文提出了一种名为Mixture-of-Modality-Experts(MoME)的框架及Holistic Token Learning(HTL)策略,以解决多模态视觉分析中模态可靠性动态变化和细粒度动作线索捕捉难的问题。该方法通过模态专家自适应协作与类令牌、时空令牌促进知识传递,提升了专家专业化程度并降低融合歧义,在驾驶员动作识别任务上显著优于基线方法,并具备更好可解释性。

Details Motivation: 现有方法依赖固定融合模块或预定义跨模态交互,难以适应模态可靠性变化及捕获细粒度动作线索。 Method: 提出MoME框架与HTL策略:MoME实现模态专家自适应协作;HTL利用类令牌和时空令牌增强专家内部精炼与专家间知识迁移。 Result: 在公开基准上的实验表明,MoME与HTL联合显著优于单模态及主流多模态基线;消融、验证与可视化结果进一步证实HTL提升细微多模态理解能力并增强可解释性。 Conclusion: MoME与HTL构成以知识为中心的多模态学习框架,既提升专家专业化,又缓解融合歧义,适用于需高鲁棒性与可解释性的多模态理解任务。 Abstract: Robust multimodal visual analytics remains challenging when heterogeneous modalities provide complementary but input-dependent evidence for decision-making.Existing multimodal learning methods mainly rely on fixed fusion modules or predefined cross-modal interactions, which are often insufficient to adapt to changing modality reliability and to capture fine-grained action cues. To address this issue, we propose a Mixture-of-Modality-Experts (MoME) framework with a Holistic Token Learning (HTL) strategy. MoME enables adaptive collaboration among modality-specific experts, while HTL improves both intra-expert refinement and inter-expert knowledge transfer through class tokens and spatio-temporal tokens. In this way, our method forms a knowledge-centric multimodal learning framework that improves expert specialization while reducing ambiguity in multimodal fusion.We validate the proposed framework on driver action recognition as a representative multimodal understanding taskThe experimental results on the public benchmark show that the proposed MoME framework and the HTL strategy jointly outperform representative single-modal and multimodal baselines. Additional ablation, validation, and visualization results further verify that the proposed HTL strategy improves subtle multimodal understanding and offers better interpretability.

[177] Multi-Modal Landslide Detection from Sentinel-1 SAR and Sentinel-2 Optical Imagery Using Multi-Encoder Vision Transformers and Ensemble Learning

Ioannis Nasios

Main category: cs.CV

TL;DR: 本文提出了一种融合Sentinel-2光学影像与Sentinel-1 SAR数据的模块化多模型框架,利用多编码器视觉Transformer和集成学习(如LightGBM、XGBoost)实现高精度滑坡检测,在无预事件光学数据条件下达到0.919的F1分数。

Details Motivation: 滑坡是重大地质灾害,亟需准确及时的检测方法以支持减灾;现有方法在变化检测范式、多源数据融合及鲁棒性方面存在不足。 Method: 采用多编码器视觉Transformer分别处理光学与SAR数据,结合衍生光谱指数(如NDVI),并融合神经网络与梯度提升模型(LightGBM/XGBoost)进行集成学习;支持光学、SAR或二者联合输入的灵活配置。 Result: 在基于图像块的滑坡分类任务中达到0.919的SOTA F1分数,无需预事件Sentinel-2数据,在机器学习竞赛中表现最优,兼顾高精度与高召回。 Conclusion: 该框架具备强鲁棒性、可扩展性与业务适用性,为多源遥感滑坡检测提供了新范式,并可迁移至其他自然灾害监测与环境变化分析任务。 Abstract: Landslides represent a major geohazard with severe impacts on human life, infrastructure, and ecosystems, underscoring the need for accurate and timely detection approaches to support disaster risk reduction. This study proposes a modular, multi-model framework that fuses Sentinel-2 optical imagery with Sentinel-1 Synthetic Aperture Radar (SAR) data, for robust landslide detection. The methodology leverages multi-encoder vision transformers, where each data modality is processed through separate lightweight pretrained encoders, achieving strong performance in landslide detection. In addition, the integration of multiple models, particularly the combination of neural networks and gradient boosting models (LightGBM and XGBoost), demonstrates the power of ensemble learning to further enhance accuracy and robustness. Derived spectral indices, such as NDVI, are integrated alongside original bands to enhance sensitivity to vegetation and surface changes. The proposed methodology achieves a state-of-the-art F1 score of 0.919 on landslide detection, addressing a patch-based classification task rather than pixel-level segmentation and operating without pre-event Sentinel-2 data, highlighting its effectiveness in a non-classical change detection setting. It also demonstrated top performance in a machine learning competition, achieving a strong balance between precision and recall and highlighting the advantages of explicitly leveraging the complementary strengths of optical and radar data. The conducted experiments and research also emphasize scalability and operational applicability, enabling flexible configurations with optical-only, SAR-only, or combined inputs, and offering a transferable framework for broader natural hazard monitoring and environmental change applications. Full training and inference code can be found in https://github.com/IoannisNasios/sentinel-landslide-cls.

[178] HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation

Tao Hu,Varun Jampani

Main category: cs.CV

TL;DR: 本文提出HumANDiff框架,通过引入关节化运动一致的噪声采样、外观-运动联合学习和几何运动一致性学习,显著提升生成视频中人体运动的物理真实性和可控性,且无需修改扩散模型架构。

Details Motivation: 现有生成式视频扩散模型难以准确捕捉人体运动的动力学和物理规律,导致生成视频中人体动作失真或不自然。 Method: 1)关节化运动一致的噪声采样:在统计人体模板表面流形上采样3D关节化噪声,替代传统高斯噪声,融入人体拓扑先验;2)外观-运动联合学习:在训练目标中同时预测像素外观与对应物理运动;3)几何运动一致性学习:在关节化噪声空间定义新型几何运动一致性损失,强制帧间运动物理一致。 Result: HumANDiff在多种服装风格下生成运动一致、高保真的人体视频,达到当前最优性能;支持图像到视频单框架生成,实现内在运动控制,且对扩散模型设计无依赖、无需额外运动模块。 Conclusion: HumANDiff通过结构化噪声建模与多任务协同学习,在不改动模型架构前提下,显著提升了人体视频生成的物理合理性、运动可控性与视觉保真度,为可控视频生成提供了新范式。 Abstract: Despite tremendous recent progress in human video generation, generative video diffusion models still struggle to capture the dynamics and physics of human motions faithfully. In this paper, we propose a new framework for human video generation, HumANDiff, which enhances the human motion control with three key designs: 1) Articulated motion-consistent noise sampling that correlates the spatiotemporal distribution of latent noise and replaces the unstructured random Gaussian noise with 3D articulated noise sampled on the dense surface manifold of a statistical human body template. It inherits body topology priors for spatially and temporally consistent noise sampling. 2) Joint appearance-motion learning that enhances the standard training objective of video diffusion models by jointly predicting pixel appearances and corresponding physical motions from the articulated noises. It enables high-fidelity human video synthesis, e.g., capturing motion-dependent clothing wrinkles. 3) Geometric motion consistency learning that enforces physical motion consistency across frames via a novel geometric motion consistency loss defined in the articulated noise space. HumANDiff enables scalable controllable human video generation by fine-tuning video diffusion models with articulated noise sampling. Consequently, our method is agnostic to diffusion model design, and requires no modifications to the model architecture. During inference, HumANDiff enables image-to-video generation within a single framework, achieving intrinsic motion control without requiring additional motion modules. Extensive experiments demonstrate that our method achieves state-of-the-art performance in rendering motion-consistent, high-fidelity humans with diverse clothing styles. Project page: https://taohuumd.github.io/projects/HumANDiff/

[179] OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control

Yukun Wang,Ruihuang Li,Jiale Tao,Shiyuan Yang,Liyi Chen,Zhantao Yang,Handz,Yulan Guo,Shuai Shao,Qinglin Lu

Main category: cs.CV

TL;DR: 本文提出OmniCamera框架,通过显式解耦视频中的场景内容与相机运动,实现灵活的视频生成与控制。为解决模态冲突和数据稀缺问题,构建了混合数据集OmniCAM,并提出双层级课程协同训练策略,在条件级和数据级分别优化控制精度与真实性,达到SOTA性能。

Details Motivation: 现有视频生成模型常将场景内容与相机运动纠缠在一起,难以实现对两者的独立控制,限制了创作灵活性。 Method: 提出OmniCamera统一框架,构建混合数据集OmniCAM(真实视频+合成数据),并设计双层级课程协同训练策略:条件级按难度渐进引入控制模态,数据级先在合成数据上学习精确控制、再迁移到真实数据提升逼真度。 Result: 在复杂相机运动控制和视觉质量方面均达到当前最优(SOTA)水平,支持任意内容与相机条件的组合生成。 Conclusion: OmniCamera成功实现了内容与相机运动的显式解耦与协同控制,为可控视频生成提供了新范式。 Abstract: Video fundamentally intertwines two crucial axes: the dynamic content of a scene and the camera motion through which it is observed. However, existing generation models often entangle these factors, limiting independent control. In this work, we introduce OmniCamera, a unified framework designed to explicitly disentangle and command these two dimensions. This compositional approach enables flexible video generation by allowing arbitrary pairings of camera and content conditions, unlocking unprecedented creative control. To overcome the fundamental challenges of modality conflict and data scarcity inherent in such a system, we present two key innovations. First, we construct OmniCAM, a novel hybrid dataset combining curated real-world videos with synthetic data that provides diverse paired examples for robust multi-task learning. Second, we propose a Dual-level Curriculum Co-Training strategy that mitigates modality interference and synergistically learns from diverse data sources. This strategy operates on two levels: first, it progressively introduces control modalities by difficulties (condition-level), and second, trains for precise control on synthetic data before adapting to real data for photorealism (data-level). As a result, OmniCamera achieves state-of-the-art performance, enabling flexible control for complex camera movements while maintaining superior visual quality.

[180] Toward Aristotelian Medical Representations: Backpropagation-Free Layer-wise Analysis for Interpretable Generalized Metric Learning on MedMNIST

Michael Karnes,Alper Yilmaz

Main category: cs.CV

TL;DR: A-ROM is a new interpretable medical imaging framework that replaces opaque deep learning layers with a human-readable concept dictionary and kNN classifier, achieving competitive performance without gradient-based fine-tuning.

Details Motivation: The 'black-box' nature of backpropagation-based deep learning models hinders their clinical adoption in medical imaging. Method: A-ROM builds on the Platonic Representation Hypothesis and uses pretrained Vision Transformers' metric space; it replaces opaque decision layers with a human-readable concept dictionary and k-Nearest Neighbors classifier for interpretability. Result: A-ROM achieves performance competitive with standard benchmarks on MedMNIST v2, while offering few-shot capability, simplicity, scalability, and high transparency. Conclusion: A-ROM bridges the interpretability gap in medical AI by enabling rapid, transparent modeling of novel medical concepts without gradient-based fine-tuning. Abstract: While deep learning has achieved remarkable success in medical imaging, the "black-box" nature of backpropagation-based models remains a significant barrier to clinical adoption. To bridge this gap, we propose Aristotelian Rapid Object Modeling (A-ROM), a framework built upon the Platonic Representation Hypothesis (PRH). This hypothesis posits that models trained on vast, diverse datasets converge toward a universal and objective representation of reality. By leveraging the generalizable metric space of pretrained Vision Transformers (ViTs), A-ROM enables the rapid modeling of novel medical concepts without the computational burden or opacity of further gradient-based fine-tuning. We replace traditional, opaque decision layers with a human-readable concept dictionary and a k-Nearest Neighbors (kNN) classifier to ensure the model's logic remains interpretable. Experiments on the MedMNIST v2 suite demonstrate that A-ROM delivers performance competitive with standard benchmarks while providing a simple and scalable, "few-shot" solution that meets the rigorous transparency demands of modern clinical environments.

[181] Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models

Katarzyna Zaleska,Łukasz Popek,Monika Wysoczańska,Kamil Deja

Main category: cs.CV

TL;DR: 本文提出了一种基于探针的定位技术,发现文本到图像扩散模型中模糊概念的解析主要由自注意力层主导,并据此提出了ICM(隐式选择-修改)方法,在特定自注意力层进行精准干预,显著提升了去偏性能并减少了伪影。

Details Motivation: 文本到图像扩散模型在处理描述不充分的提示时需做出隐式决策,但这些决策在模型中的计算位置尚不清楚;现有定位方法多依赖显式提示干预,可能无法反映真实隐式决策机制。 Method: 提出一种基于探针的定位技术,用于识别对概念具有最高属性可分性的网络层;发现自注意力层是模糊概念解析的关键部位;在此基础上设计ICM方法,仅对少量选定的自注意力层施加定向干预。 Result: 在多个基准上验证了ICM方法优于现有最先进去偏技术,干预特定自注意力层能更有效地消除偏差且生成图像伪影更少。 Conclusion: 扩散模型中隐式决策具有计算局部性,主要集中于自注意力层;精准定位并干预这些层可高效提升可控生成与公平性。 Abstract: Text-to-image diffusion models exhibit remarkable generative capabilities, yet their internal operations remain opaque, particularly when handling prompts that are not fully descriptive. In such scenarios, models must make implicit decisions to generate details not explicitly specified in the text. This work investigates the hypothesis that this decision-making process is not diffuse but is computationally localized within the model's architecture. While existing localization techniques focus on prompt-related interventions, we notice that such explicit conditioning may differ from implicit decisions. Therefore, we introduce a probing-based localization technique to identify the layers with the highest attribute separability for concepts. Our findings indicate that the resolution of ambiguous concepts is governed principally by self-attention layers, identifying them as the most effective point for intervention. Based on this discovery, we propose ICM (Implicit Choice-Modification) - a precise steering method that applies targeted interventions to a small subset of layers. Extensive experiments confirm that intervening on these specific self-attention layers yields superior debiasing performance compared to existing state-of-the-art methods, minimizing artifacts common to less precise approaches. The code is available at https://github.com/kzaleskaa/icm.

[182] EDGE-Shield: Efficient Denoising-staGE Shield for Violative Content Filtering via Scalable Reference-Based Matching

Takara Taniguchi,Ryohei Shimizu,Minh-Duc Vo,Kota Izumi,Shiqi Yang,Teppei Suzuki

Main category: cs.CV

TL;DR: 本文提出EDGE-Shield,一种在去噪过程中进行实时、可扩展内容过滤的新方法,通过嵌入匹配和x-pred变换提升早期阶段对侵权内容的识别精度,显著降低延迟,同时保持高准确率。

Details Motivation: 文本到图像生成模型的兴起带来了版权侵犯和深度伪造风险,而现有基于参考的内容过滤方法存在扩展性差、需等待生成完成等问题,亟需一种训练无关、实时、可扩展的过滤方案。 Method: 提出EDGE-Shield:在扩散模型去噪过程中嵌入式过滤;采用嵌入匹配实现高效多参考比对;引入x-pred变换,将噪声隐状态映射为后期伪清洁隐状态,以提升早期去噪步的违禁内容判别能力。 Result: 在Z-Image-Turbo和Qwen-Image上实验表明,EDGE-Shield相比传统参考式方法延迟分别降低约79%和50%,且在不同架构下维持高过滤准确率。 Conclusion: EDGE-Shield实现了低延迟、高精度、无需训练、支持动态参考集的实时内容过滤,为生成式AI安全提供了一种实用新范式。 Abstract: The advent of Text-to-Image generative models poses significant risks of copyright violation and deepfake generation. Since the rapid proliferation of new copyrighted works and private individuals constantly emerges, reference-based training-free content filters are essential for providing up-to-date protection without the constraints of a fixed knowledge cutoff. However, existing reference-based approaches often lack scalability when handling numerous references and require waiting for finishing image generation. To solve these problems, we propose EDGE-Shield, a scalable content filter during the denoising process that maintains practical latency while effectively blocking violative content. We leverage embedding-based matching for efficient reference comparison. Additionally, we introduce an \textit{$x$}-pred transformation that converts the model's noisy intermediate latent into the pseudo-estimated clean latent at the later stage, enhancing classification accuracy of violative content at earlier denoising stages. We conduct experiments of violative content filtering against two generative models including Z-Image-Turbo and Qwen-Image. EDGE-Shield significantly outperforms traditional reference-based methods in terms of latency; it achieves an approximate $79\%$ reduction in processing time for Z-Image-Turbo and approximate $50\%$ reduction for Qwen-Image, maintaining the filtering accuracy across different model architectures.

[183] Graph-PiT: Enhancing Structural Coherence in Part-Based Image Synthesis via Graph Priors

Junbin Zhang,Meng Cao,Feng Tan,Yikai Lin,Yuexian Zou

Main category: cs.CV

TL;DR: 本文提出Graph-PiT框架,通过图结构建模视觉部件间的空间与语义关系,利用分层图神经网络和图拉普拉斯平滑损失提升生成图像的结构一致性与可控性。

Details Motivation: 现有基于部件的生成方法将用户提供的部件视为无序集合,忽略其内在空间与语义关系,导致生成结果结构不完整。 Method: 构建图先验(节点为视觉部件,边为关系),设计分层图神经网络(HGNN)在部件级超节点与IP+token子节点间双向传递信息,并引入图拉普拉斯平滑损失和边重建损失优化部件嵌入。 Result: 在多个合成控制域(字符、产品、室内布局、拼图)上定量优于基线PiT,且能迁移到真实网络图像;消融实验证明显式关系推理对满足用户指定邻接约束至关重要。 Conclusion: Graph-PiT显著提升了多部件图像生成的结构性、合理性和可解释性,同时保持与原IP-Prior流程兼容,为细粒度可控视觉生成提供了新范式。 Abstract: Achieving fine-grained and structurally sound controllability is a cornerstone of advanced visual generation. Existing part-based frameworks treat user-provided parts as an unordered set and therefore ignore their intrinsic spatial and semantic relationships, which often results in compositions that lack structural integrity. To bridge this gap, we propose Graph-PiT, a framework that explicitly models the structural dependencies of visual components using a graph prior. Specifically, we represent visual parts as nodes and their spatial-semantic relationships as edges. At the heart of our method is a Hierarchical Graph Neural Network (HGNN) module that performs bidirectional message passing between coarse-grained part-level super-nodes and fine-grained IP+ token sub-nodes, refining part embeddings before they enter the generative pipeline. We also introduce a graph Laplacian smoothness loss and an edge-reconstruction loss so that adjacent parts acquire compatible, relation-aware embeddings. Quantitative experiments on controlled synthetic domains (character, product, indoor layout, and jigsaw), together with qualitative transfer to real web images, show that Graph-PiT improves structural coherence over vanilla PiT while remaining compatible with the original IP-Prior pipeline. Ablation experiments confirm that explicit relational reasoning is crucial for enforcing user-specified adjacency constraints. Our approach not only enhances the plausibility of generated concepts but also offers a scalable and interpretable mechanism for complex, multi-part image synthesis. The code is available at https://github.com/wolf-bailang/Graph-PiT.

[184] Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

Juekai Lin,Yun Zhu,Honglin Lin,Sijing Li,Tianwei Lin,Zheng Liu,Xiaoyang Wang,Wenqiao Zhang,Lijun Wu

Main category: cs.CV

TL;DR: 本文提出SciTikZer框架,通过高质量数据集SciTikZ-230K、多维基准SciTikZ-Bench及新型双自洽强化学习方法,显著提升多模态大模型将图像合成可执行TikZ代码的能力,在多项指标上超越Gemini-2.5-Pro和Qwen3-VL等大模型。

Details Motivation: 现有图像到TikZ代码合成面临两大瓶颈:一是图像-TikZ配对数据缺乏可执行性与视觉对齐可靠性;二是缺乏兼顾结构逻辑与视觉保真度的综合评估基准。 Method: 构建执行中心的数据引擎生成SciTikZ-230K高质量数据集;设计涵盖几何到层次化图示的SciTikZ-Bench基准;提出基于往返验证的双自洽强化学习优化范式。 Result: 所提SciTikZer-8B模型在结构与视觉保真度上均达SOTA,性能持续优于Gemini-2.5-Pro和Qwen3-VL-235B-A22B-Instruct等强基线。 Conclusion: 该工作系统性填补了图形程序合成中数据与评估的关键缺口,并验证了执行驱动与自洽优化对提升TikZ代码生成质量的有效性,为科学图表的可编辑逆向工程提供了新范式。 Abstract: Graphics Program Synthesis is pivotal for interpreting and editing visual data, effectively facilitating the reverse-engineering of static visuals into editable TikZ code. While TikZ is the de facto standard for scientific schematics due to its programmatic flexibility, its requirement for rigorous spatial precision presents a significant challenge for Multimodal Large Language Models. Progress is currently stifled by two primary gaps: (1) Data Quality Gap: existing image-TikZ corpora often lack strict executability and reliable visual alignment; (2) Evaluation Gap: a lack of benchmarks for both structural and visual fidelity. To address these, we present a closed-loop framework featuring: SciTikZ-230K, a large-scale, high-quality dataset from our Execution-Centric Data Engine covering 11 diverse scientific disciplines; SciTikZ-Bench, a multifaceted benchmark spanning from basic geometric constructs to intricate hierarchical schematics to evaluate both visual fidelity and structural logic. To further broaden the scope of visual-code optimization methodology, we introduce a novel Dual Self-Consistency Reinforcement Learning optimization paradigm, which utilizes Round-Trip Verification to penalize degenerate code and boost overall self-consistency. Empowered by these, our trained model SciTikZer-8B achieves state-of-the-art performance, consistently outperforming proprietary giants like Gemini-2.5-Pro and massive models like Qwen3-VL-235B-A22B-Instruct.

[185] Extending ZACH-ViT to Robust Medical Imaging: Corruption and Adversarial Stress Testing in Low-Data Regimes

Athanasios Angelakis,Marta Gomez-Barrero

Main category: cs.CV

TL;DR: 本文扩展了ZACH-ViT模型,首次系统评估其在低数据医学图像任务中面对常见图像损坏和对抗扰动时的鲁棒性;结果表明ZACH-ViT在干净数据和常见损坏下表现最优,在对抗攻击下仍具竞争力,但所有模型在强对抗攻击下性能均显著下降。

Details Motivation: ZACH-ViT原设计旨在摆脱位置编码和类别标记对空间结构的固定假设,以适配医学图像中空间信息弱、局部化或可变的特点;但原始工作未深入考察其鲁棒性,本文旨在填补这一空白。 Method: 在7个MedMNIST数据集、每类50样本的低数据设定下,固定超参、五次随机种子,评估ZACH-ViT及三个紧凑基线模型(ABMIL、Minimal-ViT、TransMIL)在干净数据、常见图像损坏(如噪声、模糊、遮挡等)和两种对抗攻击(FGSM、PGD)下的性能,并采用平均排名进行比较。 Result: ZACH-ViT在干净数据和常见损坏下平均排名均为1.57(最优);在FGSM攻击下排名第一(2.00),PGD下排名第二(2.29);所有模型在对抗攻击下性能均大幅下降,ABMIL在PGD下整体最优。 Conclusion: ZACH-ViT的紧凑、置换不变特性不仅提升干净数据性能,也带来对现实图像退化的鲁棒性;但对抗鲁棒性仍是医学Transformer模型的共性挑战,尚未被有效解决。 Abstract: The recently introduced ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer) formalized a compact permutation-invariant Vision Transformer for medical imaging and argued that architectural alignment with spatial structure can matter more than universal benchmark dominance. Its design was motivated by the observation that positional embeddings and a dedicated class token encode fixed spatial assumptions that may be suboptimal when spatial organization is weakly informative, locally distributed, or variable across biomedical images. The foundational study established a regime-dependent clean performance profile across MedMNIST, but did not examine robustness in detail. In this work, we present the first robustness-focused extension of ZACH-ViT by evaluating its behavior under common image corruptions and adversarial perturbations in the same low-data setting. We compare ZACH-ViT with three scratch-trained compact baselines, ABMIL, Minimal-ViT, and TransMIL, on seven MedMNIST datasets using 50 samples per class, fixed hyperparameters, and five random seeds. Across the benchmark, ZACH-ViT achieves the best overall mean rank on clean data (1.57) and under common corruptions (1.57), indicating a favorable balance between baseline predictive performance and robustness to realistic image degradation. Under adversarial stress, all models deteriorate substantially; nevertheless, ZACH-ViT remains competitive, ranking first under FGSM (2.00) and second under PGD (2.29), where ABMIL performs best overall. These results extend the original ZACH-ViT narrative: the advantages of compact permutation-invariant transformers are not limited to clean evaluation, but can persist under realistic perturbation stress in low-data medical imaging, while adversarial robustness remains an open challenge for all evaluated models.

[186] SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation

Hiba Dahmani,Nathan Piasco,Moussab Bennehar,Luis Roldão,Dzmitry Tsishkou,Laurent Caraffa,Jean-Philippe Tarel,Roland Brémond

Main category: cs.CV

TL;DR: 本文提出了一种基于Σ-Voxfield体素网格的3D生成框架,通过语义条件扩散模型和渐进式空间外绘实现大规模、多视角一致的户外驾驶场景生成,并采用延迟渲染模块输出高质量图像。

Details Motivation: 现有方法在几何一致性、多视角渲染能力和大场景扩展性方面存在不足:基于图像/视频蒸馏的方法损害几何一致性且受限于训练视角;而3D生成方法通常局限于小规模或物体中心化场景。 Method: 提出Σ-Voxfield网格表示(每个占据体素存储固定数量着色表面采样);设计语义条件的扩散模型,作用于局部体素邻域并结合3D位置编码;采用重叠区域上的渐进式空间外绘扩展至大场景;使用延迟渲染模块生成图像。 Result: 可生成多样化的大型城市户外场景,支持多种传感器配置与相机轨迹的逼真渲染,同时计算开销适中。 Conclusion: Σ-Voxfield与扩散建模、渐进外绘及延迟渲染的结合,有效实现了无需逐场景优化的大规模、多视角一致3D场景生成。 Abstract: Scalable generation of outdoor driving scenes requires 3D representations that remain consistent across multiple viewpoints and scale to large areas. Existing solutions either rely on image or video generative models distilled to 3D space, harming the geometric coherence and restricting the rendering to training views, or are limited to small-scale 3D scene or object-centric generation. In this work, we propose a 3D generative framework based on $Σ$-Voxfield grid, a discrete representation where each occupied voxel stores a fixed number of colorized surface samples. To generate this representation, we train a semantic-conditioned diffusion model that operates on local voxel neighborhoods and uses 3D positional encodings to capture spatial structure. We scale to large scenes via progressive spatial outpainting over overlapping regions. Finally, we render the generated $Σ$-Voxfield grid with a deferred rendering module to obtain photorealistic images, enabling large-scale multiview-consistent 3D scene generation without per-scene optimization. Extensive experiments show that our approach can generate diverse large-scale urban outdoor scenes, renderable into photorealistic images with various sensor configurations and camera trajectories while maintaining moderate computation cost compared to existing approaches.

[187] Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery

Hao Chen,Fang Qiu,Fangchao Dong,Defei Yang,Eve Bohnett,Li An

Main category: cs.CV

TL;DR: 本文提出了一种轻量级多模态适配框架,将RGB预训练的视觉语言模型(VLMs)迁移至热红外无人机影像,实现了物种识别与个体计数,并能结合RGB影像生成生境上下文信息。

Details Motivation: 解决RGB预训练视觉语言模型(VLMs)与热红外影像之间的表征鸿沟问题,拓展其在生态监测中对热成像数据的应用能力。 Method: 构建无人机采集的热红外数据集,通过多模态投影器对齐方式微调VLMs,实现从RGB视觉表征到热辐射输入的信息迁移;在闭集和开集提示下对InternVL3、Qwen2.5-VL和Qwen3-VL三个模型进行物种识别与实例计数评测,并融合RGB与热红外影像以生成生境上下文。 Result: Qwen3-VL-8B-Instruct在开集提示下表现最优:鹿、犀牛、大象F1分数分别为0.935、0.915、0.968;个体计数准确率(within-1)分别为0.779、0.982、1.000;融合RGB与热红外影像可生成土地覆被、景观特征及人为干扰等生境上下文信息。 Conclusion: 基于轻量投影器的适配方法能高效、实用地将RGB预训练VLMs迁移至热红外无人机影像,在生态监测中不仅支持目标级识别,还可支持生境级语义理解。 Abstract: This study proposes a lightweight multimodal adaptation framework to bridge the representation gap between RGB-pretrained VLMs and thermal infrared imagery, and demonstrates its practical utility using a real drone-collected dataset. A thermal dataset was developed from drone-collected imagery and was used to fine-tune VLMs through multimodal projector alignment, enabling the transfer of information from RGB-based visual representations to thermal radiometric inputs. Three representative models, including InternVL3-8B-Instruct, Qwen2.5-VL-7B-Instruct, and Qwen3-VL-8B-Instruct, were benchmarked under both closed-set and open-set prompting conditions for species recognition and instance enumeration. Among the tested models, Qwen3-VL-8B-Instruct with open-set prompting achieved the best overall performance, with F1 scores of 0.935 for deer, 0.915 for rhino, and 0.968 for elephant, and within-1 enumeration accuracies of 0.779, 0.982, and 1.000, respectively. In addition, combining thermal imagery with simultaneously collected RGB imagery enabled the model to generate habitat-context information, including land-cover characteristics, key landscape features, and visible human disturbance. Overall, the findings demonstrate that lightweight projector-based adaptation provides an effective and practical route for transferring RGB-pretrained VLMs to thermal drone imagery, expanding their utility from object-level recognition to habitat-context interpretation in ecological monitoring.

[188] PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

David Picard,Nicolas Dufour,Lucas Degeorge,Arijit Ghosh,Davide Allegro,Tom Ravaud,Yohann Perron,Corentin Sautier,Zeynep Sonat Baltaci,Fei Meng,Syrine Kalleli,Marta López-Rauhut,Thibaut Loiseau,Ségolène Albouy,Raphael Baena,Elliot Vincent,Loic Landrieu

Main category: cs.CV

TL;DR: This paper proposes Polynomial Mixer (PoM), a linear-complexity token mixing mechanism replacing self-attention, proven to preserve universal approximation capability and achieving comparable performance with lower cost across multiple domains.

Details Motivation: To address the high computational cost of self-attention—especially for long sequences—while retaining expressive power and universality of transformer models. Method: Introduces Polynomial Mixer (PoM), a learned polynomial-based token aggregation and retrieval mechanism; proves it satisfies the contextual mapping property; replaces self-attention with PoM in transformers across five domains. Result: PoM matches attention-based models' performance in text generation, handwritten text recognition, image generation, 3D modeling, and Earth observation, while significantly reducing computational cost for long sequences. Conclusion: PoM is an effective, linear-complexity drop-in replacement for self-attention that maintains theoretical expressiveness and empirical performance across diverse modalities. Abstract: This paper introduces the Polynomial Mixer (PoM), a novel token mixing mechanism with linear complexity that serves as a drop-in replacement for self-attention. PoM aggregates input tokens into a compact representation through a learned polynomial function, from which each token retrieves contextual information. We prove that PoM satisfies the contextual mapping property, ensuring that transformers equipped with PoM remain universal sequence-to-sequence approximators. We replace standard self-attention with PoM across five diverse domains: text generation, handwritten text recognition, image generation, 3D modeling, and Earth observation. PoM matches the performance of attention-based models while drastically reducing computational cost when working with long sequences. The code is available at https://github.com/davidpicard/pom.

[189] The Character Error Vector: Decomposable errors for page-level OCR evaluation

Jonathan Bourne,Mwiza Simbeye,Joseph Nockels

Main category: cs.CV

TL;DR: 本文提出Character Error Vector (CEV)作为OCR评估新指标,可分解为解析、OCR及交互误差三部分,弥补CER在页面解析错误下失效的缺陷,并验证其在档案报纸数据集上的有效性。

Details Motivation: CER在页面解析出错时无法定义,限制了其在页面级OCR评估中的应用,尤其在标签格式不统一的数据上;需一种能解耦解析与OCR误差、适用于端到端与pipeline模型比较的评估方法。 Method: 提出CEV(Character Error Vector)——一种基于字符包(bag-of-characters)的OCR评估向量,支持分解为解析误差、OCR误差和交互误差;实现两种具体形式:SpACER(空间感知CER)和基于Jensen-Shannon距离的字符分布方法;提供Python开源库。 Result: CEV与CER高度相关,且能有效反映解析质量;在退化图像与复杂版式档案报纸数据集上,传统pipeline方法优于当前SOTA端到端模型;仅用易获取特征阈值即可以0.91 F1识别主要误差来源。 Conclusion: CEV是连接页面解析指标与局部字符级指标(如CER)的有效桥梁,提升了文档理解系统中误差归因与性能优化的可解释性与实用性。 Abstract: The Character Error Rate (CER) is a key metric for evaluating the quality of Optical Character Recognition (OCR). However, this metric assumes that text has been perfectly parsed, which is often not the case. Under page-parsing errors, CER becomes undefined, limiting its use as a metric and making evaluating page-level OCR challenging, particularly when using data that do not share a labelling schema. We introduce the Character Error Vector (CEV), a bag-of-characters evaluator for OCR. The CEV can be decomposed into parsing and OCR, and interaction error components. This decomposability allows practitioners to focus on the part of the Document Understanding pipeline that will have the greatest impact on overall text extraction quality. The CEV can be implemented using a variety of methods, of which we demonstrate SpACER (Spatially Aware Character Error Rate) and a Character distribution method using the Jensen-Shannon Distance. We validate the CEV's performance against other metrics: first, the relationship with CER; then, parse quality; and finally, as a direct measure of page-level OCR quality. The validation process shows that the CEV is a valuable bridge between parsing metrics and local metrics like CER. We analyse a dataset of archival newspapers made of degraded images with complex layouts and find that state-of-the-art end-to-end models are outperformed by more traditional pipeline approaches. Whilst the CEV requires character-level positioning for optimal triage, thresholding on easily available values can predict the main error source with an F1 of 0.91. We provide the CEV as part of a Python library to support Document understanding research.

[190] DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models

Zhengming Yu,Li Ma,Mingming He,Leo Isikdogan,Yuancheng Xu,Dmitriy Smirnov,Pablo Salamanca,Dao Mi,Pablo Delgado,Ning Yu,Julien Philip,Xin Li,Wenping Wang,Paul Debevec

Main category: cs.CV

TL;DR: DiffHDR是一种基于视频扩散模型的LDR-to-HDR转换框架,通过在Log-Gamma色彩空间中进行潜在空间的辐射度修复,恢复过曝/欠曝区域细节,并支持文本或参考图像引导的可控转换。

Details Motivation: 现有LDR视频因量化和饱和丢失大量HDR场景辐射信息,导致无法准确映射至HDR显示及后期重曝光;已有LDR转HDR方法难以真实恢复过曝与欠曝区域细节。 Method: 将LDR-to-HDR转换建模为视频扩散模型潜在空间中的生成式辐射度修复任务,在Log-Gamma色彩空间操作;利用预训练视频扩散模型的时空生成先验;提出基于HDRI地图合成HDR视频的训练数据生成流程;支持文本提示或参考图像引导的可控转换。 Result: 在辐射保真度和时间稳定性上显著超越现有最先进方法,生成高质量、具强重曝光余量的HDR视频。 Conclusion: DiffHDR有效解决了LDR视频动态范围受限问题,为HDR内容生成与后期制作提供了新范式,兼具真实性、可控性与实用性。 Abstract: Most digital videos are stored in 8-bit low dynamic range (LDR) formats, where much of the original high dynamic range (HDR) scene radiance is lost due to saturation and quantization. This loss of highlight and shadow detail precludes mapping accurate luminance to HDR displays and limits meaningful re-exposure in post-production workflows. Although techniques have been proposed to convert LDR images to HDR through dynamic range expansion, they struggle to restore realistic detail in the over- and underexposed regions. To address this, we present DiffHDR, a framework that formulates LDR-to-HDR conversion as a generative radiance inpainting task within the latent space of a video diffusion model. By operating in Log-Gamma color space, DiffHDR leverages spatio-temporal generative priors from a pretrained video diffusion model to synthesize plausible HDR radiance in over- and underexposed regions while recovering the continuous scene radiance of the quantized pixels. Our framework further enables controllable LDR-to-HDR video conversion guided by text prompts or reference images. To address the scarcity of paired HDR video data, we develop a pipeline that synthesizes high-quality HDR video training data from static HDRI maps. Extensive experiments demonstrate that DiffHDR significantly outperforms state-of-the-art approaches in radiance fidelity and temporal stability, producing realistic HDR videos with considerable latitude for re-exposure.

[191] HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models

Reihaneh Zohrabi,Hosein Hasani,Akshita Gupta,Mahdieh Soleymani Baghshah,Anna Rohrbach,Marcus Rohrbach

Main category: cs.CV

TL;DR: 本文提出HaloProbe,一种基于贝叶斯框架的外部检测方法,用于识别大视觉语言模型生成描述中的物体幻觉;它通过解耦外部统计与内部解码信号、平衡训练和先验建模来提升检测可靠性,并在不损害生成质量的前提下实现更优的非侵入式缓解效果。

Details Motivation: 现有基于注意力权重的幻觉检测方法因受token位置和物体重复等隐藏混杂因素影响而不可靠,导致Simpson悖论,亟需更鲁棒的检测机制。 Method: 提出HaloProbe:一种贝叶斯框架,将外部描述统计与内部解码信号解耦建模,采用平衡训练分离内部证据,并融合外部特征的先验分布以估计token级幻觉概率;将其作为外部打分信号用于非侵入式解码干预。 Result: HaloProbe在幻觉检测上更可靠;其引导的解码显著优于当前主流干预方法,在降低幻觉的同时更好保持生成效用与流利性。 Conclusion: 细粒度、去混杂的外部信号建模(如HaloProbe)比粗粒度注意力分析更适合作为幻觉检测基础;非侵入式、基于评分的缓解策略可在不修改模型内部的前提下实现性能与保真度的更好权衡。 Abstract: Large vision-language models can produce object hallucinations in image descriptions, highlighting the need for effective detection and mitigation strategies. Prior work commonly relies on the model's attention weights on visual tokens as a detection signal. We reveal that coarse-grained attention-based analysis is unreliable due to hidden confounders, specifically token position and object repetition in a description. This leads to Simpson's paradox: the attention trends reverse or disappear when statistics are aggregated. Based on this observation, we introduce HaloProbe, a Bayesian framework that factorizes external description statistics and internal decoding signals to estimate token-level hallucination probabilities. HaloProbe uses balanced training to isolate internal evidence and combines it with learned prior over external features to recover the true posterior. While intervention-based mitigation methods often degrade utility or fluency by modifying models' internals, we use HaloProbe as an external scoring signal for non-invasive mitigation. Our experiments show that HaloProbe-guided decoding reduces hallucinations more effectively than state-of-the-art intervention-based methods while preserving utility.

[192] Action Images: End-to-End Policy Learning via Multiview Video Generation

Haoyu Zhen,Zixian Gao,Qiao Sun,Yilin Zhao,Yuncong Yang,Yilun Du,Tsun-Hsuan Wang,Yi-Ling Qiao,Chuang Gan

Main category: cs.CV

TL;DR: 本文提出Action Images,一种将机器人7自由度动作转化为多视角、像素级对齐的动作图像的统一世界动作模型(WAM),使视频骨干网络可直接作为零样本策略使用,无需额外策略头或动作模块,并支持多种生成与标注任务。

Details Motivation: 现有世界动作模型依赖独立动作模块或非像素对齐的动作表征,难以充分利用预训练视频模型知识,且跨视角和环境迁移能力受限。 Method: 将7-DoF机器人动作编码为多视角、像素对齐的动作图像(即动作视频),以视频骨干网络直接建模策略,实现策略学习即多视角视频生成。 Result: 在RLBench和真实场景中取得最优零样本成功率,并提升视频-动作联合生成质量。 Conclusion: 像素对齐、可解释的动作图像是一种有前景的世界动作建模范式,能统一建模控制、生成与标注任务。 Abstract: World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video-action joint generation, action-conditioned video generation, and action labeling under a shared representation. On RLBench and real-world evaluations, our model achieves the strongest zero-shot success rates and improves video-action joint generation quality over prior video-space world models, suggesting that interpretable action images are a promising route to policy learning.