cs.CL [Back]

[1] Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes

Fangyu Ding,Ding Ding,Sijin Chen,Kaibo Wang,Peng Xu,Zijin Feng,Haoli Bai,Kai Han,Youliang Yan,Binhang Yuan,Jiacheng Sun

Main category: cs.CL

TL;DR: 本文提出Deletion-Insertion Diffusion（DID）语言模型，用离散的删除-插入扩散过程替代传统Masked Diffusion Language Models（MDLMs）中的掩码机制，在提升训练与推理效率的同时增强生成灵活性。

Details

Motivation: 现有MDLMs受限于掩码范式，存在计算效率低和生成灵活性差的问题，尤其在处理变长序列时引入大量无信息的和 token，造成冗余计算。 Method: 将token删除与插入建模为离散扩散过程；设计基于分数的插入操作建模方法；提出适用于子序列计数目标的并行动态规划算法求解训练目标。 Result: 在固定与变长序列任务上，DID在建模性能、采样质量及训练/推理速度方面均优于MDLMs及其他插入式语言模型，且无需超参调优。 Conclusion: DID通过摒弃掩码范式、原生支持变长序列及内建自校正机制，为扩散语言建模提供了更高效、更灵活的新范式。 Abstract: While Masked Diffusion Language Models (MDLMs) relying on token masking and unmasking have shown promise in language modeling, their computational efficiency and generation flexibility remain constrained by the masking paradigm. In this paper, we propose Deletion-Insertion Diffusion language models (DID) that rigorously formulate token deletion and insertion as discrete diffusion processes, replacing the masking and unmasking processes in current MDLMs. DID improves training and inference efficiency by eliminating two major sources of computational overhead in MDLMs: the computations on non-informative 1) tokens inherent to the paradigm, and 2) tokens introduced in variable-length settings. Furthermore, DID offers greater flexibility by: 1) natively supporting variable-length sequences without requiring fixed-length padding, and 2) an intrinsic self-correction mechanism during generation due to insertion that dynamically adjusts token positions. To train DID, we design a score-based approach that assigns scores to token insertion operations and derive appropriate training objectives. The objectives involve subsequence counting problems, which we efficiently solve via a parallelized dynamic programming algorithm. Our experiments across fixed and variable-length settings demonstrate the advantage of DID over baselines of MDLMs and existing insertion-based LMs, in terms of modeling performance, sampling quality, and training/inference speed, without any hyperparameter tuning.

[2] Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems

Xunzhuo Liu,Bowei He,Xue Liu,Haichen Zhang,Huamin Chen

Main category: cs.CL

TL;DR: 本文提出了一种实时验证组件，集成于生产级RAG系统中，支持在严格延迟约束下对长达32K token的文档进行全文档依据验证，显著提升对无依据生成响应的检测能力。

Details

Motivation: 现有验证方法存在两难：大模型虽能处理长上下文但速度慢、成本高；轻量分类器受限于上下文长度，易遗漏截断外的证据。企业级RAG亟需兼顾准确性与实时性的验证方案。 Method: 设计并部署一个实时验证组件，采用自适应推理策略，在32K token文档上实现全文档 grounding；通过架构权衡、操作优化与系统化评估完成落地。 Result: 全上下文验证相比截断式验证显著提升了对不支持响应的检出率；明确了长上下文验证的必要场景、基于分块验证失效的原因，以及延迟预算对模型设计的影响。 Conclusion: 长上下文验证对构建高可靠性RAG系统至关重要；实践表明，需根据实际文档复杂性与服务延迟要求协同设计验证模块，而非简单依赖分块或黑盒大模型。 Abstract: Retrieval-augmented generation (RAG) is increasingly deployed in enterprise search and document-centric assistants, where responses must be grounded in long and complex source materials. In practice, verifying that generated answers faithfully reflect retrieved documents is difficult: large language models can check long contexts but are too slow and costly for interactive services, while lightweight classifiers operate within strict context limits and frequently miss evidence outside truncated passages. We present the design of a real-time verification component integrated into a production RAG pipeline that enables full-document grounding under latency constraints. The system processes documents up to 32K tokens and employs adaptive inference strategies to balance response time and verification coverage across workloads. We describe the architectural decisions, operational trade-offs, and evaluation methodology used to deploy the verifier, and show that full-context verification substantially improves detection of unsupported responses compared with truncated validation. Our experience highlights when long-context verification is necessary, why chunk-based checking often fails in real documents, and how latency budgets shape model design. These findings provide practical guidance for practitioners building reliable large-scale retrieval-augmented applications. (Model, benchmark, and code: https://huggingface.co/llm-semantic-router)

[3] Internal Safety Collapse in Frontier Large Language Models

Yutao Wu,Xiao Liu,Yifeng Gao,Xiang Zheng,Hanxun Huang,Yige Li,Cong Wang,Bo Li,Xingjun Ma,Yu-Gang Jiang

Main category: cs.CL

TL;DR: 本文发现前沿大语言模型存在一种新型安全失效模式——内部安全崩溃（ISC），即在特定任务条件下，模型会持续生成有害内容；作者提出TVD框架和ISC-Bench基准，在多个前沿模型上观察到高达95.3%的安全失败率，揭示对齐工作未能消除模型内在风险。

Details

Motivation: 现有对齐方法虽能改善输出表现，但未解决模型内部仍保有生成有害内容能力的根本问题；专业领域中大量双用途工具的普及进一步扩大了此类隐性攻击面，亟需系统识别与评估。 Method: 提出TVD（Task, Validator, Data）框架，通过设计‘仅能以有害内容完成’的专业领域任务来触发ISC，并构建含53个场景、覆盖8个学科的ISC-Bench基准，在JailbreakBench上对GPT-5.2、Claude Sonnet 4.5等前沿模型进行评估。 Result: 在三个代表性场景中，四款前沿LLM平均安全失败率达95.3%，显著高于标准越狱攻击；前沿模型比早期模型更易发生ISC，表明其增强的能力反而加剧该风险；对齐仅改变表层输出，未降低底层风险。 Conclusion: ISC是一种严重且被忽视的失效模式，暴露当前LLM安全范式的根本局限；部署于高风险场景前须重新评估模型在专业任务中的内在安全性，不能仅依赖输出对齐。 Abstract: This work identifies a critical failure mode in frontier large language models (LLMs), which we term Internal Safety Collapse (ISC): under certain task conditions, models enter a state in which they continuously generate harmful content while executing otherwise benign tasks. We introduce TVD (Task, Validator, Data), a framework that triggers ISC through domain tasks where generating harmful content is the only valid completion, and construct ISC-Bench containing 53 scenarios across 8 professional disciplines. Evaluated on JailbreakBench, three representative scenarios yield worst-case safety failure rates averaging 95.3% across four frontier LLMs (including GPT-5.2 and Claude Sonnet 4.5), substantially exceeding standard jailbreak attacks. Frontier models are more vulnerable than earlier LLMs: the very capabilities that enable complex task execution become liabilities when tasks intrinsically involve harmful content. This reveals a growing attack surface: almost every professional domain uses tools that process sensitive data, and each new dual-use tool automatically expands this vulnerability--even without any deliberate attack. Despite substantial alignment efforts, frontier LLMs retain inherently unsafe internal capabilities: alignment reshapes observable outputs but does not eliminate the underlying risk profile. These findings underscore the need for caution when deploying LLMs in high-stakes settings. Source code: https://github.com/wuyoscar/ISC-Bench

[4] Visuospatial Perspective Taking in Multimodal Language Models

Jonathan Prunty,Seraphina Zhang,Patrick Quinn,Jianxun Lian,Xing Xie,Lucy Cheke

Main category: cs.CL

TL;DR: 本文评估了多模态语言模型（MLMs）在视觉空间视角采择（VPT）任务中的能力，发现其在需要抑制自身视角以采纳他人视角的Level 2 VPT上存在显著缺陷。

Details

Motivation: 现有基准主要依赖文本情境或静态场景理解，忽视了对MLMs在 visuospatial perspective-taking（VPT）方面能力的评估，而VPT在社会与协作场景中至关重要。 Method: 借鉴人类研究中的两个经典任务——Director Task（指派者任务）和Rotating Figure Task（旋转图形任务），分别评估MLMs在指称交流范式下的VPT能力和不同角度差异下的视角采择能力。 Result: MLMs在Level 2 VPT任务中表现明显不足，即难以抑制自身视角并成功采用他人视角。 Conclusion: 当前MLMs在表征与推理他人视角方面存在关键局限，影响其在协作场景中的实际应用。 Abstract: As multimodal language models (MLMs) are increasingly used in social and collaborative settings, it is crucial to evaluate their perspective-taking abilities. Existing benchmarks largely rely on text-based vignettes or static scene understanding, leaving visuospatial perspective-taking (VPT) underexplored. We adapt two evaluation tasks from human studies: the Director Task, assessing VPT in a referential communication paradigm, and the Rotating Figure Task, probing perspective-taking across angular disparities. Across tasks, MLMs show pronounced deficits in Level 2 VPT, which requires inhibiting one's own perspective to adopt another's. These results expose critical limitations in current MLMs' ability to represent and reason about alternative perspectives, with implications for their use in collaborative contexts.

[5] DISCO: Document Intelligence Suite for COmparative Evaluation

Kenza Benkirane,Dan Goldwater,Martin Asenov,Aneiss Ghodsi

Main category: cs.CL

TL;DR: DISCO是一个用于文档智能的对比评估套件，分别评估OCR流程和视觉语言模型（VLMs）在文本解析与问答任务上的表现，覆盖手写、多语言、医疗表单、信息图及多页文档等多样类型；结果表明不同方法在不同文档特性上表现差异显著，需根据文档结构与推理需求选择策略。

Details

Motivation: 现有文档智能评估缺乏对OCR与VLM在不同文档类型和任务上的细粒度、分离式对比，难以指导实际场景中方法的选择。 Method: 构建DISCO评估套件，对OCR管道和VLMs在文本解析（parsing）与问答（QA）两个核心任务上进行跨文档类型（手写、多语言、医疗表单、信息图、多页文档）的独立评测，并分析任务感知提示（task-aware prompting）的影响。 Result: OCR在手写文本和长/多页文档上更可靠；VLMs在多语言文本和视觉丰富版式上表现更优；任务感知提示效果因文档类型而异，有增有减。 Conclusion: 文档处理策略应基于文档结构复杂性和推理需求进行复杂度感知的自适应选择，DISCO为该决策提供了实证依据。 Abstract: Document intelligence requires accurate text extraction and reliable reasoning over document content. We introduce \textbf{DISCO}, a \emph{Document Intelligence Suite for COmparative Evaluation}, that evaluates optical character recognition (OCR) pipelines and vision-language models (VLMs) separately on parsing and question answering across diverse document types, including handwritten text, multilingual scripts, medical forms, infographics, and multi-page documents. Our evaluation shows that performance varies substantially across tasks and document characteristics, underscoring the need for complexity-aware approach selection. OCR pipelines are generally more reliable for handwriting and for long or multi-page documents, where explicit text grounding supports text-heavy reasoning, while VLMs perform better on multilingual text and visually rich layouts. Task-aware prompting yields mixed effects, improving performance on some document types while degrading it on others. These findings provide empirical guidance for selecting document processing strategies based on document structure and reasoning demands.

[6] S-Path-RAG: Semantic-Aware Shortest-Path Retrieval Augmented Generation for Multi-Hop Knowledge Graph Question Answering

Rong Fu,Yemin Wang,Tianxiang Xu,Yongtai Liu,Weizhi Tang,Wangyu Wu,Xiaowen Ma,Simon Fong

Main category: cs.CL

TL;DR: S-Path-RAG是一种面向知识图谱多跳问答的语义感知最短路径检索增强生成框架，通过语义加权路径枚举、可学习路径打分与验证、软路径嵌入注入语言模型，并结合神经苏格拉底图对话循环实现自适应检索，显著提升准确性、证据覆盖与效率。

Details

Motivation: 现有RAG方法在知识图谱多跳问答中依赖单次、文本密集型检索，缺乏对图结构和语义路径的细粒度建模，难以兼顾效率、可解释性与自适应能力。 Method: 提出S-Path-RAG框架：1）混合策略（加权k最短路径+束搜索+约束随机游走）枚举有界长度语义加权路径；2）联合训练可微路径打分器、对比路径编码器与轻量验证器；3）通过跨注意力将选中路径的软混合隐表示注入LLM；4）嵌入迭代式神经苏格拉底图对话循环，利用LLM诊断信息驱动图编辑或种子扩展以实现自适应检索。 Result: 在标准多跳KGQA基准上，S-Path-RAG在答案准确率、证据覆盖率和端到端效率上均显著优于强图基与LLM基基线；消融与诊断分析验证了各模块贡献，并揭示了语义加权、验证过滤与迭代更新间的权衡关系。 Conclusion: S-Path-RAG实现了高效、拓扑感知且可解释的图检索，为复杂推理任务提供了兼顾性能、可控性与部署可行性的新范式。 Abstract: We present S-Path-RAG, a semantic-aware shortest-path Retrieval-Augmented Generation framework designed to improve multi-hop question answering over large knowledge graphs. S-Path-RAG departs from one-shot, text-heavy retrieval by enumerating bounded-length, semantically weighted candidate paths using a hybrid weighted $k$-shortest, beam, and constrained random-walk strategy, learning a differentiable path scorer together with a contrastive path encoder and lightweight verifier, and injecting a compact soft mixture of selected path latents into a language model via cross-attention. The system runs inside an iterative Neural-Socratic Graph Dialogue loop in which concise diagnostic messages produced by the language model are mapped to targeted graph edits or seed expansions, enabling adaptive retrieval when the model expresses uncertainty. This combination yields a retrieval mechanism that is both token-efficient and topology-aware while preserving interpretable path-level traces for diagnostics and intervention. We validate S-Path-RAG on standard multi-hop KGQA benchmarks and through ablations and diagnostic analyses. The results demonstrate consistent improvements in answer accuracy, evidence coverage, and end-to-end efficiency compared to strong graph- and LLM-based baselines. We further analyze trade-offs between semantic weighting, verifier filtering, and iterative updates, and report practical recommendations for deployment under constrained compute and token budgets.

[7] Berta: an open-source, modular tool for AI-enabled clinical documentation

Samridhi Vaid,Mike Weldon,Jesse Dunn,Sacha Davis,Kevin Lonergan,Henry Li,Jeffrey Franc,Mohamed Abdalla,Daniel C. Baumgart,Jake Hayward,J Ross Mitchell

Main category: cs.CL

TL;DR: 本文介绍了Berta，一个开源的、模块化的AI临床文档记录平台，已在阿尔伯塔卫生服务系统（AHS）实现省级规模部署，集成于其Snowflake AI数据云，显著降低成本并保障数据主权。

Details

Motivation: 解决商用AI记录工具价格高昂、不透明、数据无法回流至机构基础设施的问题，以增强医疗机构对数据治理、质量改进和临床流程的控制权。 Method: 开发开源、模块化AI记录平台Berta，结合自动语音识别与大语言模型，并将其定制化部署于AHS的Snowflake AI数据云中，确保所有临床数据保留在其安全环境中。 Result: 在8个月内，198名急诊医生于105个城乡医疗机构完成22148次临床会话，月使用量从680增长至5530；单医生月均成本低于30美元（较商用方案降低70–95%）；AHS已批准扩展至850名医生。 Conclusion: Berta是首个与现有医疗系统基础设施深度集成的省级AI记录平台；其开源发布为各卫生系统提供了可复现、低成本、高可控性的替代方案，支持数据主权与AI技术的审慎评估。 Abstract: Commercial AI scribes cost \$99-600 per physician per month, operate as opaque systems, and do not return data to institutional infrastructure, limiting organizational control over data governance, quality improvement, and clinical workflows. We developed Berta, an open-source modular scribe platform for AI-enabled clinical documentation, and deployed a customized implementation within Alberta Health Services (AHS) integrated with their existing Snowflake AI Data Cloud infrastructure. The system combines automatic speech recognition with large language models while retaining all clinical data within the secure AHS environment. During eight months (November 2024 to July 2025), 198 emergency physicians used the system in 105 urban and rural facilities, generating 22148 clinical sessions and more than 2800 hours of audio. The use grew from 680 to 5530 monthly sessions. Operating costs averaged less than \$30 per physician per month, a 70-95% reduction compared to commercial alternatives. AHS has since approved expansion to 850 physicians. This is the first provincial-scale deployment of an AI scribe integrated with existing health system infrastructure. By releasing Berta as open source, we provide a reproducible, cost-effective alternative that health systems can adapt to their own secure environments, supporting data sovereignty and informed evaluation of AI documentation technology.

[8] DepthCharge: A Domain-Agnostic Framework for Measuring Depth-Dependent Knowledge in Large Language Models

Alexander Sheppert

Main category: cs.CL

TL;DR: 本文提出DepthCharge框架，用于衡量大语言模型在不同领域中知识深度的持续准确性，通过自适应提问、按需事实验证和生存统计实现域无关评估。

Details

Motivation: 现有方法无法有效测量大语言模型在任意领域中面对自适应追问时维持准确回答的能力。 Method: 提出DepthCharge框架，包含三项创新：基于模型实际提及概念的自适应探查、来自权威来源的按需事实验证、以及每层深度固定样本量的生存统计。 Result: 在医学、宪法学、古罗马和量子计算四个领域对五个前沿模型的实证验证表明，DepthCharge揭示了标准基准所掩盖的深度依赖性性能差异；期望有效深度（EVD）在3.45至7.55之间变化，且模型排名因领域而异；高成本模型未必知识更深。 Conclusion: DepthCharge是一种可部署于任何具有公开可验证事实领域的域无关评估工具，强调领域特异性评估比综合基准更能指导专业场景下的模型选择。 Abstract: Large Language Models appear competent when answering general questions but often fail when pushed into domain-specific details. No existing methodology provides an out-of-the-box solution for measuring how deeply LLMs can sustain accurate responses under adaptive follow-up questioning across arbitrary domains. We present DepthCharge, a domain-agnostic framework that measures knowledge depth through three innovations: adaptive probing that generates follow-up questions based on concepts the model actually mentions, on-demand fact verification from authoritative sources, and survival statistics with constant sample sizes at every depth level. The framework can be deployed on any knowledge domain with publicly verifiable facts, without requiring pre-constructed test sets or domain-specific expertise. DepthCharge results are relative to the evaluator model used for answer checking, making the framework a tool for comparative evaluation rather than absolute accuracy certification. Empirical validation across four diverse domains (Medicine, Constitutional Law, Ancient Rome, and Quantum Computing) with five frontier models demonstrates that DepthCharge reveals depth-dependent performance variation hidden by standard benchmarks. Expected Valid Depth (EVD) ranges from 3.45 to 7.55 across model-domain combinations, and model rankings vary substantially by domain, with no single model dominating all areas. Cost-performance analysis further reveals that expensive models do not always achieve deeper knowledge, suggesting that domain-specific evaluation is more informative than aggregate benchmarks for model selection in professional applications.

[9] Training a Large Language Model for Medical Coding Using Privacy-Preserving Synthetic Clinical Data

John Cook,Michael Wyatt,Peng Wei,Iris Chin,Santosh Gupta,Van Zyl Van Vuuren,Richie Siburian,Amanda Spicer,Kristen Viviano,Alda Cami,Raunaq Malhotra,Zhewei Yao,Jeff Rasley,Gaurav Kaushik

Main category: cs.CL

TL;DR: 本文探讨了如何通过使用基于电子健康记录（EHR）生成的隐私保护合成数据，对开源大语言模型Llama 3-70B进行微调，以提升ICD-10-CM和CPT医学编码任务的准确率。结果显示，微调后模型在精确代码匹配上的F1分数从零样本的0.18提升至超过0.70，且在需多步临床推理的复杂类别上仍保持高性能。

Details

Motivation: 自动化医学编码面临异构病历、细致编码规则及长尾分布等挑战；现有大模型未专为编码训练，零样本效果差，亟需安全、高效、合规的适配方法。 Method: 利用EHR驱动的模板与编码政策生成隐私保护的合成临床文本-金标准编码对，对Llama 3-70B进行监督微调，并在ICD-10-CM和CPT精确代码预测任务上评估性能。 Result: 微调后模型在精确代码匹配任务中F1达0.70+（零样本基线仅0.18），在Advanced Illness、Frailty等复杂类别及医学理解任务中均保持高性能。 Conclusion: 基于政策感知的合成数据可高效、安全地将通用大模型转化为高精度医学编码助手，为面向真实人群的迭代式、合规编码代理训练提供了可行路径。 Abstract: Improving the accuracy and reliability of medical coding reduces clinician burnout and supports revenue cycle processes, freeing providers to focus more on patient care. However, automating the assignment of ICD-10-CM and CPT codes from clinical documentation remains a challenge due to heterogeneous records, nuanced coding guidelines, and long-tail distributions. Large language models have been proposed to help or automate specific medical coding tasks. However, foundation models are not explicitly trained for medical coding and zero-shot coding has yielded poor results. We investigate whether a modern open-weight foundation model can be adapted for an expert-level medical coding task using privacy-preserving synthetic training data derived from electronic health records. We fine-tune Llama 3-70B on pairs of clinical notes and gold codes generated from EHR-grounded templates and coding policies, then evaluate exact-code prediction for ICD-10-CM and CPT. A zero-shot baseline with the unadapted model achieved an F1 score of 0.18 for exact code match. After fine-tuning on the synthetic corpus, exact-match F1 exceeded 0.70, representing a large absolute gain across both code systems. Notably, performance remained high on complex categories that often require multi-step clinical reasoning and code composition, including Advanced Illness and Frailty classes, and the model retained its performance on medical comprehension tasks. These results indicate that synthetic, policy-aware data can efficiently teach a general-purpose large language model to support precise medical coding without exposing protected health information. The approach offers a practical path for training coding agents safely and iteratively on specific tasks that represent real-world populations.

[10] MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

Yu Chen,Runkai Chen,Sheng Yi,Xinda Zhao,Xiaohong Li,Jianjin Zhang,Jun Sun,Chuanrui Hu,Yunyun Han,Lidong Bing,Yafeng Deng,Tianqiao Chen

Main category: cs.CL

TL;DR: 本文提出Memory Sparse Attention (MSA)，一种端到端可训练、高效且可大规模扩展的记忆模型框架，通过稀疏注意力与文档级RoPE等创新，在保持稳定性的前提下实现百万至亿级token的长上下文处理，并支持动态记忆更新与多跳推理。

Details

Motivation: 现有方法在扩展大语言模型长时记忆能力时面临精度下降、延迟激增、无法动态修改记忆及缺乏端到端优化等问题，难以支撑大语料摘要、数字孪生和长历史智能体推理等复杂场景。 Method: 提出Memory Sparse Attention（MSA）框架，核心包括可扩展稀疏注意力机制、文档级RoPE位置编码、KV缓存压缩、Memory Parallel并行策略以及Memory Interleaving多跳推理机制。 Result: MSA在训练和推理中均实现线性复杂度，16K到100M token仅下降<9%；可在2块A800 GPU上完成100M token推理；在长上下文基准测试中显著超越前沿LLM、RAG系统与记忆型智能体。 Conclusion: MSA通过解耦记忆容量与推理过程，为通用模型赋予内生的、终身尺度的记忆能力，提供了可扩展的基础架构。 Abstract: Long-term memory is a cornerstone of human intelligence. Enabling AI to process lifetime-scale information remains a long-standing pursuit in the field. Due to the constraints of full-attention architectures, the effective context length of large language models (LLMs) is typically limited to 1M tokens. Existing approaches, such as hybrid linear attention, fixed-size memory states (e.g., RNNs), and external storage methods like RAG or agent systems, attempt to extend this limit. However, they often suffer from severe precision degradation and rapidly increasing latency as context length grows, an inability to dynamically modify memory content, or a lack of end-to-end optimization. These bottlenecks impede complex scenarios like large-corpus summarization, Digital Twins, and long-history agent reasoning, while limiting memory capacity and slowing inference. We present Memory Sparse Attention (MSA), an end-to-end trainable, efficient, and massively scalable memory model framework. Through core innovations including scalable sparse attention and document-wise RoPE, MSA achieves linear complexity in both training and inference while maintaining exceptional stability, exhibiting less than 9% degradation when scaling from 16K to 100M tokens. Furthermore, KV cache compression, combined with Memory Parallel, enables 100M-token inference on 2xA800 GPUs. We also propose Memory Interleaving to facilitate complex multi-hop reasoning across scattered memory segments. MSA significantly surpasses frontier LLMs, state-of-the-art RAG systems, and leading memory agents in long-context benchmarks. These results demonstrate that by decoupling memory capacity from reasoning, MSA provides a scalable foundation to endow general-purpose models with intrinsic, lifetime-scale memory.

[11] Cluster-R1: Large Reasoning Models Are Instruction-following Clustering Agents

Peijun Qing,Puneet Mathur,Nedim Lipka,Varun Manjunatha,Ryan Rossi,Franck Dernoncourt,Saeed Hassanpour,Soroush Vosoughi

Main category: cs.CL

TL;DR: 本文提出了一种基于大推理模型（LRM）的自主聚类方法，将指令驱动的聚类重构为生成式任务，使模型既能遵循用户指令，又能自主推断语料潜在结构；在ReasonCluster基准上显著优于现有嵌入模型和LRM基线。

Details

Motivation: 通用嵌入模型无法按用户指令定制文本特征，而指令调优嵌入器又缺乏自主推断语料潜在结构（如最优簇数）的能力。 Method: 将指令跟随聚类重构为生成式任务，训练大推理模型（LRM）作为自主聚类智能体，并设计推理驱动的训练流程，使其能理解高层聚类指令并推断对应潜在分组。 Result: 在涵盖日常对话、法律案例和金融报告等28个任务的ReasonCluster基准上，该方法持续优于强嵌入基线和LRM基线。 Conclusion: 显式推理能提升指令驱动聚类的保真度与可解释性，实现指令遵循与结构发现的统一。 Abstract: General-purpose embedding models excel at recognizing semantic similarities but fail to capture the characteristics of texts specified by user instructions. In contrast, instruction-tuned embedders can align embeddings with textual instructions yet cannot autonomously infer latent corpus structures, such as determining the optimal number of clusters. To address both limitations, we reframe instruction-following clustering as a generative task and train large reasoning models (LRMs) as autonomous clustering agents. Our reasoning-driven training pipeline enables LRMs to interpret high-level clustering instructions and then infer the corresponding latent groupings. To evaluate this paradigm, we introduce ReasonCluster, a comprehensive benchmark comprising 28 diverse tasks spanning daily dialogue, legal cases, and financial reports. Experiments across diverse datasets and clustering scenarios show that our approach consistently outperforms strong embedding-based methods and LRM baselines, demonstrating that explicit reasoning fosters more faithful and interpretable instruction-based clustering.

[12] MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?

Lin Yang,Yuancheng Yang,Xu Wang,Changkun Liu,Haihua Yang

Main category: cs.CL

TL;DR: 本文提出MedMT-Bench，一个面向医疗多轮指令跟随的高难度基准，用于评估大语言模型在长上下文记忆、抗干扰性与安全防御等方面的能力；该基准含400个真实场景模拟测试案例，平均22轮对话，覆盖5类难点；采用经专家验证的LLM-as-judge评估协议，17个前沿模型整体准确率均低于60%，凸显当前医疗AI的可靠性瓶颈。

Details

Motivation: 现有医疗相关基准难以充分检验大语言模型在真实临床场景中所需的长上下文记忆、干扰鲁棒性和安全防御能力，亟需更具挑战性与实用性的评估基准。 Method: 构建了MedMT-Bench：通过分场景数据合成并辅以医学专家人工精修，生成400个高度贴近真实诊疗流程的多轮测试案例；提出基于LLM-as-judge的细粒度评估协议，含实例级评分标准和原子化测试点，并经专家标注验证（人-LLM一致性达91.94%）。 Result: 在MedMT-Bench上评测17个前沿大模型，所有模型整体准确率均低于60.00%，最优模型仅达59.75%，揭示当前模型在复杂医疗多轮交互中的显著不足。 Conclusion: MedMT-Bench填补了医疗AI可靠性和安全性评估的空白，可作为推动更安全、更可靠医疗大模型研究的关键工具。 Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across various specialist domains and have been integrated into high-stakes areas such as medicine. However, as existing medical-related benchmarks rarely stress-test the long-context memory, interference robustness, and safety defense required in practice. To bridge this gap, we introduce MedMT-Bench, a challenging medical multi-turn instruction following benchmark that simulates the entire diagnosis and treatment process. We construct the benchmark via scene-by-scene data synthesis refined by manual expert editing, yielding 400 test cases that are highly consistent with real-world application scenarios. Each test case has an average of 22 rounds (maximum of 52 rounds), covering 5 types of difficult instruction following issues. For evaluation, we propose an LLM-as-judge protocol with instance-level rubrics and atomic test points, validated against expert annotations with a human-LLM agreement of 91.94\%. We test 17 frontier models, all of which underperform on MedMT-Bench (overall accuracy below 60.00\%), with the best model reaching 59.75\%. MedMT-Bench can be an essential tool for driving future research towards safer and more reliable medical AI. The benchmark is available in https://openreview.net/attachment?id=aKyBCsPOHB&name=supplementary_material

[13] From Physician Expertise to Clinical Agents: Preserving, Standardizing, and Scaling Physicians' Medical Expertise with Lightweight LLM

Chanyong Luo,Jirui Dai,Zhendong Wang,Kui Chen,Jiaxi Yang,Bingjie Lu,Jing Wang,Jiaxin Hao,Bing Li,Ruiyang He,Yiyu Qiao,Chenkai Zhang,Kaiyu Wang,Zhi Liu,Zeyu Zheng,Yan Li,Xiaohong Gu

Main category: cs.CL

TL;DR: 本文提出Med-Shicheng框架，使大语言模型能系统学习并迁移名中医的诊疗哲学与个体化辨证论治规则，在资源受限设备上实现接近先进模型的性能，并探讨了LLM作为评估者在中医临床评价中的可靠性与局限性。

Details

Motivation: 中医临床经验高度个体化、难量化、难传承，导致高水平中医专家稀缺；需构建可规模化传授名医诊疗思维的AI框架。 Method: 基于天翼平台构建五阶段Med-Shicheng框架，整合五位国医大师多源资料，统一训练单一大模型（Qwen2.5-1.5B-Base），覆盖七大中医临床任务；同时对比LLM-as-a-judge与医师人工评估的一致性。 Result: 模型在资源受限GPU上运行，性能媲美DeepSeek-R1和GPT-5；LLM评估能捕捉总体趋势，但在个体化精细判断上存在偏差。 Conclusion: Med-Shicheng为名中医经验的结构化学习与传播提供了可行路径；LLM评估需医师监督或领域适配，方能支撑真实临床决策。 Abstract: Medicine is an empirical discipline refined through long-term observation and the messy, high-variance reality of clinical practice. Physicians build diagnostic and therapeutic competence through repeated cycles of application, reflection, and improvement, forming individualized methodologies. Yet outcomes vary widely, and master physicians' knowledge systems are slow to develop and hard to transmit at scale, contributing to the scarcity of high-quality clinical expertise. To address this, we propose Med-Shicheng, a general framework that enables large language models to systematically learn and transfer distinguished physicians' diagnostic-and-therapeutic philosophy and case-dependent adaptation rules in a standardized way. Built on Tianyi, Med-Shicheng consists of five stages. We target five National Masters of Chinese Medicine or distinguished TCM physicians, curate multi-source materials, and train a single model to internalize all five knowledge systems across seven tasks, including etiology-pathogenesis analysis, syndrome diagnosis, treatment principle selection, prescription generation, prescription explanation, symptom evolution with regimen adjustment, and clinical advice. Implemented on Qwen2.5-1.5B-Base, Med-Shicheng runs on resource-constrained GPUs while achieving performance comparable to DeepSeek-R1 and GPT-5. We also examine the reliability of LLM-as-a-judge versus physician evaluation: automated judging tracks overall trends but shows bias on fine-grained individualized distinctions, highlighting the need for physician involvement when ground truth is unavailable and for domain-adapted judge models.

[14] Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages

Shaharukh Khan,Ali Faraz,Abhinav Ravi,Mohd Nauman,Mohd Sarfraz,Akshat Patidar,Raja Kolla,Chandra Khatri,Shubham Agarwal

Main category: cs.CL

TL;DR: 本文介绍了Chitrakshara数据集系列，旨在解决现有视觉语言模型（VLMs）在印度语言上表征不足的问题，涵盖11种印度语言，包含大规模交错图像-文本预训练数据和图像-文本配对数据，并详细阐述了数据收集与质量分析方法。

Details

Motivation: 现有视觉语言模型主要基于英文数据集训练，导致对印度语言表征不足，且多图像理解研究有限，需构建支持多图像、多语言（特别是印度语言）的高质量数据集。 Method: 构建Chitrakshara数据集系列，包括Chitrakshara-IL（193M图像、30B文本token、50M多语言文档）和Chitrakshara-Cap（44M图像-文本对、733M token），并设计数据采集、清洗、过滤与处理流程，辅以质量和多样性分析。 Result: 发布了覆盖11种印度语言的大规模多模态数据集Chitrakshara，完成系统性质量与多样性评估，验证其在文化包容性VLM训练中的潜力。 Conclusion: Chitrakshara数据集填补了印度语言多图像理解研究的数据空白，为构建更公平、更具文化代表性的多模态模型提供了坚实基础。 Abstract: Multimodal research has predominantly focused on single-image reasoning, with limited exploration of multi-image scenarios. Recent models have sought to enhance multi-image understanding through large-scale pretraining on interleaved image-text datasets. However, most Vision-Language Models (VLMs) are trained primarily on English datasets, leading to inadequate representation of Indian languages. To address this gap, we introduce the Chitrakshara dataset series, covering 11 Indian languages sourced from Common Crawl. It comprises (1) Chitrakshara-IL, a large-scale interleaved pretraining dataset with 193M images, 30B text tokens, and 50M multilingual documents, and (2) Chitrakshara-Cap, which includes 44M image-text pairs with 733M tokens. This paper details the data collection pipeline, including curation, filtering, and processing methodologies. Additionally, we present a comprehensive quality and diversity analysis to assess the dataset's representativeness across Indic languages and its potential for developing more culturally inclusive VLMs.

[15] Qworld: Question-Specific Evaluation Criteria for LLMs

Shanghua Gao,Yuchang Su,Pengwei Sui,Curtis Ginder,Marinka Zitnik

Main category: cs.CL

TL;DR: 本文提出Qworld方法，通过递归扩展树为每个问题生成特定的、分层的评估标准，显著提升大语言模型在开放性问题上的评估效果和细粒度能力区分。

Details

Motivation: 现有评估方法难以捕捉开放性问题中响应质量对上下文的依赖性，二元评分和静态量表无法充分覆盖问题隐含的多维评估需求。 Method: 提出One-Question-One-World（Qworld）方法，利用递归扩展树对单个问题进行结构化分解，生成场景、视角及细粒度二元评估标准，实现问题驱动的动态标准构建。 Result: 在HealthBench上覆盖89%专家标准并生成79%经专家验证的新标准；专家评价其标准在洞察力与颗粒度上优于先前方法；在HealthBench和Humanity's Last Exam上成功揭示11个前沿LLM在长期影响、公平性、错误处理、跨学科推理等维度的细微能力差异。 Conclusion: Qworld将评估标准生成建模为对问题所隐含评估轴的结构化覆盖，使评估真正适配每个问题，而非依赖固定任务级标准，为LLM开放性问答评估提供了更自适应、更精细的新范式。 Abstract: Evaluating large language models (LLMs) on open-ended questions is difficult because response quality depends on the question's context. Binary scores and static rubrics fail to capture these context-dependent requirements. Existing methods define criteria at the dataset level or generate them in a single pass, which limits their ability to explore the evaluation space implied by each question. We introduce One-Question-One-World (Qworld), a method that generates question-specific evaluation criteria using a recursive expansion tree. Given a question, Qworld decomposes it into scenarios, perspectives, and fine-grained binary criteria through structured hierarchical and horizontal expansion. The resulting criteria specify what a high-quality answer must address for that question. On HealthBench, Qworld covers 89% of expert-authored criteria and generates 79% novel criteria validated by human experts. Experts rate Qworld criteria higher in insight and granularity than those produced by prior methods. When applied to 11 frontier LLMs on HealthBench and Humanity's Last Exam, Qworld reveals capability differences in dimensions such as long-term impact, equity, error handling, and interdisciplinary reasoning that coarse rubrics do not distinguish. By formulating criteria generation as structured coverage of question-implied evaluation axes, Qworld enables evaluation that adapts to each question rather than relying on fixed task-level criteria.

[16] Do 3D Large Language Models Really Understand 3D Spatial Relationships?

Xianzheng Ma,Tao Sun,Shuai Chen,Yash Bhalgat,Jindong Gu,Angel X Chang,Iro Armeni,Iro Laina,Songyou Peng,Victor Adrian Prisacariu

Main category: cs.CL

TL;DR: 本文发现现有SQA3D基准存在文本捷径问题，提出更严格的Real-3DQA基准，并设计3D重加权训练目标以提升3D-LLMs的空间推理能力。

Details

Motivation: 现有SQA3D基准无法区分模型是否真正进行3D空间推理，还是仅依赖文本线索（textual shortcuts）作答。 Method: 构建Real-3DQA基准（过滤易猜测问题、引入结构化3D推理分类体系），并提出3D重加权训练目标，使模型更依赖3D视觉线索。 Result: 在Real-3DQA上，现有3D-LLMs在去除简单线索后空间关系理解显著下降；所提3D重加权方法大幅提升了其空间推理性能。 Conclusion: 需更鲁棒的评估基准与针对性训练策略，才能推动真正3D感知的视觉语言理解发展。 Abstract: Recent 3D Large-Language Models (3D-LLMs) claim to understand 3D worlds, especially spatial relationships among objects. Yet, we find that simply fine-tuning a language model on text-only question-answer pairs can perform comparably or even surpass these methods on the SQA3D benchmark without using any 3D input. This indicates that the SQA3D benchmark may not be able to detect if the model exploits textual shortcuts rather than engages in 3D-aware reasoning. To address this issue, we introduce Real-3DQA, a more rigorous evaluation benchmark that filters out easy-to-guess questions and introduces a structured taxonomy to assess various aspects of 3D reasoning. Experiments on Real-3DQA confirm that existing 3D-LLMs struggle with spatial relationships once simple cues are removed. We further propose a 3D-reweighted training objective that guides model to rely more on 3D visual clues, substantially enhancing 3D-LLMs performance in spatial reasoning tasks. Our findings underscore the need for robust benchmarks and tailored training strategies to advance genuine 3D vision-language understanding. Project page: https://real-3dqa.github.io/.

[17] Navigating the Concept Space of Language Models

Wilson E. Marcílio-Jr,Danilo M. Eler

Main category: cs.CL

TL;DR: 本文提出Concept Explorer系统，用于大规模稀疏自编码器（SAE）特征的交互式后验探索，通过分层邻域嵌入组织概念解释，支持从粗粒度概念簇到细粒度邻域的渐进式导航。

Details

Motivation: 现有SAE特征分析方法（如查看高激活样例、手动浏览或语义搜索）难以在大规模下进行探索性概念发现。 Method: 构建SAE特征嵌入的多分辨率流形，利用分层邻域嵌入实现概念的层次化组织与渐进式导航。 Result: 在SmolLM2提取的SAE特征上验证了Concept Explorer的有效性，揭示了连贯的高层结构、有意义的子簇及难以用现有方法识别的稀有概念。 Conclusion: Concept Explorer为SAE特征提供了可扩展、交互式的概念探索新范式，显著提升了大规模概念发现、比较与关系分析的能力。 Abstract: Sparse autoencoders (SAEs) trained on large language model activations output thousands of features that enable mapping to human-interpretable concepts. The current practice for analyzing these features primarily relies on inspecting top-activating examples, manually browsing individual features, or performing semantic search on interested concepts, which makes exploratory discovery of concepts difficult at scale. In this paper, we present Concept Explorer, a scalable interactive system for post-hoc exploration of SAE features that organizes concept explanations using hierarchical neighborhood embeddings. Our approach constructs a multi-resolution manifold over SAE feature embeddings and enables progressive navigation from coarse concept clusters to fine-grained neighborhoods, supporting discovery, comparison, and relationship analysis among concepts. We demonstrate the utility of Concept Explorer on SAE features extracted from SmolLM2, where it reveals coherent high-level structure, meaningful subclusters, and distinctive rare concepts that are hard to identify with existing workflows.

[18] Prompt Compression in Production Task Orchestration: A Pre-Registered Randomized Trial

Warren Johnson,Charles Lee

Main category: cs.CL

TL;DR: 本文通过一项预注册的六臂随机对照试验，评估了提示压缩在多智能体任务编排中的经济性，发现中度压缩（保留率0.5）可降低27.9%总推理成本，而激进压缩（保留率0.2）反而使成本微增1.8%，因其引发输出长度小幅增加及不确定性增大；基于最近性加权的结构感知压缩表现优异，与中度均匀压缩共同构成成本-相似性帕累托前沿；研究强调不能简单追求更高压缩率，而应将输出token视为压缩策略设计的一等变量。

Details

Motivation: 提示压缩的经济效益不仅取决于输入token减少，更受其对通常更昂贵的输出token长度影响；现有实践缺乏对输出变化的系统评估，亟需在真实生产场景中量化压缩策略的综合成本效益。 Method: 开展预注册的六臂随机对照试验，在真实生产级多智能体任务编排环境中，使用Claude Sonnet 4.5模型运行358次成功实验（每组59–61次），对比：未压缩对照组、三种均匀保留率（0.8/0.5/0.2）压缩组、两种结构感知策略（熵自适应、最近性加权）压缩组；核心指标为总推理成本（输入+输出）和嵌入响应相似度。 Result: 中度压缩（r=0.5）平均总成本降低27.9%；激进压缩（r=0.2）平均成本反增1.8%，主因输出长度轻微上升（1.03×对照）及重尾不确定性；最近性加权压缩节省23.5%；中度均匀与最近性加权策略共同构成成本-相似性帕累托最优前沿；激进压缩在两项指标上均被支配。 Conclusion: 单纯追求更高压缩率并非可靠生产准则；输出token长度必须作为压缩策略设计的核心考量因素；结构感知（尤其是最近性加权）与适度均匀压缩是兼顾成本与语义保真度的更优选择。 Abstract: The economics of prompt compression depend not only on reducing input tokens but on how compression changes output length, which is typically priced several times higher. We evaluate this in a pre-registered six-arm randomized controlled trial of prompt compression on production multi-agent task-orchestration, analyzing 358 successful Claude Sonnet 4.5 runs (59-61 per arm) drawn from a randomized corpus of 1,199 real orchestration instructions. We compare an uncompressed control with three uniform retention rates (r=0.8, 0.5, 0.2) and two structure-aware strategies (entropy-adaptive and recency-weighted), measuring total inference cost (input+output) and embedding-based response similarity. Moderate compression (r=0.5) reduced mean total cost by 27.9%, while aggressive compression (r=0.2) increased mean cost by 1.8% despite substantial input reduction, consistent with small mean output expansion (1.03x vs. control) and heavy-tailed uncertainty. Recency-weighted compression achieved 23.5% savings and, together with moderate compression, occupied the empirical cost-similarity Pareto frontier, whereas aggressive compression was dominated on both cost and similarity. These results show that "compress more" is not a reliable production heuristic and that output tokens must be treated as a first-class outcome when designing compression policies.

[19] Plato's Cave: A Human-Centered Research Verification System

Matheus Kunzler Maldaner,Raul Valle,Junsung Kim,Tonuka Sultan,Pranav Bhargava,Matthew Maloni,John Courtney,Hoang Nguyen,Aamogh Sawant,Kristian O'Connor,Stephen Wormald,Damon L. Woodard

Main category: cs.CL

TL;DR: Plato's Cave 是一个开源、以人为中心的研究验证系统，通过构建有向无环图（DAG）、利用网络代理评估节点与边的可信度，并解析论文论证结构来生成最终可信度评分。

Details

Motivation: 研究论文发表速度加快，亟需更有效的方法进行事实核查、写作质量评估和不可验证声明识别。 Method: 构建文档的有向无环图（DAG），使用网络代理为图中节点和边分配可信度分数，并基于论证结构解释与评估生成最终评分。 Result: 在包含104篇研究论文的自建数据集上实现了系统并报告了结果。 Conclusion: Plato's Cave 提供了一种新颖、可扩展且人机协同的研究可信度自动化评估框架。 Abstract: The growing publication rate of research papers has created an urgent need for better ways to fact-check information, assess writing quality, and identify unverifiable claims. We present Plato's Cave as an open-source, human-centered research verification system that (i) creates a directed acyclic graph (DAG) from a document, (ii) leverages web agents to assign credibility scores to nodes and edges from the DAG, and (iii) gives a final score by interpreting and evaluating the paper's argumentative structure. We report the system implementation and results on a collected dataset of 104 research papers.

[20] Compression Method Matters: Benchmark-Dependent Output Dynamics in LLM Prompt Compression

Warren Johnson

Main category: cs.CL

TL;DR: 本文研究了提示压缩对大语言模型输出长度和推理成本的实际影响，提出指令生存概率（Psi）作为衡量压缩后关键任务提示保留程度的结构指标，并发现基准测试的选择显著影响压缩效果；作者还引入了压缩鲁棒性指数（CRI）以支持跨基准评估，并强调需结合实际能耗测量来准确评估压缩效益。

Details

Motivation: 提示压缩通常仅以输入token减少来评估，但其实际部署效果取决于对输出长度和总推理成本的影响；先前研究中关于输出膨胀现象存在矛盾结论，亟需系统性解释。 Method: 开展受控复现与扩展研究，涵盖5400次API调用、三个基准（如MBPP、HumanEval）及多个模型提供商；提出结构化指标‘指令生存概率（Psi）’量化关键提示段在截断后的保留率；引入‘压缩鲁棒性指数（CRI）’用于跨基准比较；结合NVML实测GPU能耗数据校准能效分析。 Result: 发现显著的基准效应：例如DeepSeek在MBPP上r=0.3时输出膨胀达56倍（Psi≈0.15），而在HumanEval上仅5倍（Psi≈0.72）；GPT-4o-mini则更稳定；单基准评估易导致对压缩安全性与效率的误判；token节省量常高估实际焦耳节省量。 Conclusion: 提示结构（而非仅模型提供商）是决定压缩效果的核心因素；应采用基准多样化测试与结构感知的压缩策略，以实现可靠且节能的大语言模型部署。 Abstract: Prompt compression is often evaluated by input-token reduction, but its real deployment impact depends on how compression changes output length and total inference cost. We present a controlled replication and extension study of benchmark-dependent output dynamics under aggressive compression, covering 5,400 API calls across three benchmarks and multiple providers. To explain conflicting prior observations, we formalize instruction survival probability (Psi), a structural metric that captures whether task-critical prompt segments remain after truncation. Results show a strong benchmark effect: under r=0.3, DeepSeek exhibits severe output expansion on MBPP (56x, Psi approx 0.15) but substantially lower expansion on HumanEval (5x, Psi approx 0.72), while GPT-4o-mini is comparatively stable across benchmarks. This reconciles the apparent discrepancy between previously reported extreme explosion and lower replication effects by identifying prompt structure, not provider identity alone, as the primary moderator. We introduce the Compression Robustness Index (CRI) for cross-benchmark evaluation and show that single-benchmark assessments can produce misleading conclusions about compression safety and efficiency. To contextualize energy claims, we incorporate companion direct NVML measurements from rented RunPod GPUs and show that token savings can overstate joule savings. These findings motivate benchmark-diverse testing and structure-aware compression policies for reliable, energy-conscious LLM deployment.

[21] The Compression Paradox in LLM Inference: Provider-Dependent Energy Effects of Prompt Compression

Warren Johnson

Main category: cs.CL

TL;DR: 本文研究了提示压缩对大语言模型推理能耗的影响，发现单纯减少输入token并不能可靠地优化能耗，且不同模型表现差异显著。

Details

Motivation: 大型语言模型的快速发展带来了环境悖论：本可用于解决气候问题的技术本身却成为全球碳排放的重要来源。因此，需要探索如提示压缩等方法是否能有效提升推理能效。 Method: 在28,421次API试验中，测试了三种模型（GPT-4o-mini、Claude-3.5-Sonnet、DeepSeek-Chat）、五个基准（HumanEval、MBPP、GSM8K、MATH、MMLU）和四种压缩比（r=1.0, 0.7, 0.5, 0.3），使用基于token的能耗代理模型并校准本地实测数据，同时跟踪基准通过率以评估质量。 Result: 提示压缩导致显著质量下降（基线通过率26.0%，r=0.7时降至1.5%），且能耗响应高度依赖模型：DeepSeek在r=0.3时输出token从21增至798，能耗最高增加2140%；GPT-4o-mini在r=0.5时出现能耗降低，但整体效果不一致。 Conclusion: 仅靠减少输入token并非可靠的生产环境能耗优化策略；相比之下，模型选择和控制输出长度提供了更稳定的能耗-质量权衡。 Abstract: The rapid proliferation of Large Language Models has created an environmental paradox: the very technology that could help solve climate challenges is itself becoming a significant contributor to global carbon emissions. We test whether prompt compression improves inference energy efficiency in 28,421 successful API trials (28,428 planned) across three providers (OpenAI GPT-4o-mini, Anthropic Claude-3.5-Sonnet, and DeepSeek-Chat), five benchmarks (HumanEval, MBPP, GSM8K, MATH, MMLU), and four compression ratios (r in {1.0, 0.7, 0.5, 0.3}). Energy is estimated with a token-based proxy calibrated against local direct measurements, and quality is tracked with benchmark pass rates. Compression produced substantial quality loss (overall pass rate 26.0% at baseline vs. 1.5% at r=0.7) and strongly provider-dependent energy behavior. DeepSeek exhibited output expansion under compression (21 to 798 tokens at r=0.3), corresponding to energy increases up to +2,140%, while GPT-4o-mini showed mixed effects including a reduction at r=0.5. These results indicate that input-token reduction alone is not a reliable energy optimization strategy in production inference. For the evaluated settings, model selection and output-length control provided more consistent energy-quality tradeoffs than prompt compression.

[22] Konkani LLM: Multi-Script Instruction Tuning and Evaluation for a Low-Resource Indian Language

Reuben Chagas Fernandes,Gaurang S. Patkar

Main category: cs.CL

TL;DR: 本文提出Konkani-Instruct-100k合成指令微调数据集，并基于此构建适配多文字（天城文、罗马字、卡纳达文）Konkani语的LLM模型，显著提升低资源语言在机器翻译等任务上的性能。

Details

Motivation: Konkani作为低资源语言，面临训练数据严重不足和多种文字并存的挑战，导致现有大语言模型表现不佳。 Method: 利用Gemini 3生成Konkani-Instruct-100k合成指令数据集；对Llama 3.1、Qwen2.5、Gemma 3等开源模型及闭源模型进行基准评测；微调构建Konkani LLM系列模型；开发Multi-Script Konkani Benchmark用于跨文字评估。 Result: Konkani LLM在机器翻译任务中持续优于对应基座模型，并在多个设置下超越闭源基线模型。 Conclusion: 合成数据与针对性微调可有效缓解低资源、多文字语言的大模型性能瓶颈，为类似语言提供可复用方法论。 Abstract: Large Language Models (LLMs) consistently under perform in low-resource linguistic contexts such as Konkani. This performance deficit stems from acute training data scarcity compounded by high script diversity across Devanagari, Romi and Kannada orthographies. To address this gap, we introduce Konkani-Instruct-100k, a comprehensive synthetic instruction-tuning dataset generated through Gemini 3. We establish rigorous baseline benchmarks by evaluating leading open-weights architectures including Llama 3.1, Qwen2.5 and Gemma 3 alongside proprietary closed-source models. Our primary contribution involves the development of Konkani LLM, a series of fine-tuned models optimized for regional nuances. Furthermore, we are developing the Multi-Script Konkani Benchmark to facilitate cross-script linguistic evaluation. In machine translation, Konkani LLM delivers consistent gains over the corresponding base models and is competitive with and in several settings surpasses proprietary baselines

[23] Did You Forget What I Asked? Prospective Memory Failures in Large Language Models

Avni Mittal

Main category: cs.CL

TL;DR: 本文研究了大语言模型在同时执行复杂任务和遵循格式指令时的合规性下降问题，发现格式约束（尤其是终端约束）会显著降低任务性能，而增强显著性的格式设计可大幅恢复合规性。

Details

Motivation: 大型语言模型在需要同时执行高难度任务和满足格式指令时常常失败，本文旨在探究这种行为背后的原因，并提出改进方法。 Method: 采用认知心理学中的前瞻性记忆视角，构建了一个结合可验证格式约束与渐进式复杂基准任务的可控实验范式；使用确定性程序化检查器评估合规性，在三个模型家族和8000多个提示上进行测试。 Result: 并发任务负载使格式合规性下降2-21%，终端约束最脆弱（最高下降50%），避免类约束较稳健；显著性增强格式可将合规性恢复至90-100%；格式约束也会反向降低任务准确率（如GSM8K从93%降至27%）；多约束叠加导致联合合规性急剧下降。 Conclusion: 格式合规性与任务难度存在权衡关系，其脆弱性具有类型依赖性；通过增强指令显著性可有效缓解该问题；需在系统设计中统筹考虑格式约束与任务性能的双向影响。 Abstract: Large language models often fail to satisfy formatting instructions when they must simultaneously perform demanding tasks. We study this behaviour through a prospective memory inspired lens from cognitive psychology, using a controlled paradigm that combines verifiable formatting constraints with benchmark tasks of increasing complexity. Across three model families and over 8,000 prompts, compliance drops by 2-21% under concurrent task load. Vulnerability is highly type-dependent: terminal constraints (requiring action at the response boundary) degrade most, with drops up to 50%, while avoidance constraints remain comparatively robust. A salience-enhanced format (explicit instruction framing plus a trailing reminder) recovers much of the lost compliance, restoring performance to 90-100% in many settings. Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model's GSM8K accuracy dropping from 93% to 27%. In additional stacking experiments, joint compliance declines sharply as constraints accumulate. All results use deterministic programmatic checkers without an LLM-as-judge component on publicly available datasets.

[24] Large Language Models Unpack Complex Political Opinions through Target-Stance Extraction

Özgür Togay,Florian Kunneman,Javier Garcia-Bernardo,Anastasia Giachanou

Main category: cs.CL

TL;DR: 本文提出使用大语言模型（LLM）进行目标-立场抽取（TSE），以更细粒度地分析在线政治话语中的信念与立场，构建了包含138个政治目标的Reddit数据集，并验证了LLM在零样本、少样本及上下文增强提示下的有效性，性能接近人工标注员。

Details

Motivation: 现有计算分析多依赖粗粒度党派标签，忽视政治话语中关于政策、人物、议题等多维度信念的复杂交互，尤其难以自动识别讨论目标及对应立场。 Method: 引入目标-立场抽取（TSE）任务，构建含1084条Reddit帖子、覆盖138个政治目标的数据集，系统评估多种闭源与开源大语言模型在零样本、少样本及上下文增强提示策略下的表现。 Result: 最优模型性能与高训练度人工标注员相当，在低标注者一致性（IAA）的困难样本上仍保持鲁棒性。 Conclusion: 大语言模型可在极小监督下有效提取复杂政治观点，为计算社会科学和政治文本分析提供可扩展的新工具。 Abstract: Political polarization emerges from a complex interplay of beliefs about policies, figures, and issues. However, most computational analyses reduce discourse to coarse partisan labels, overlooking how these beliefs interact. This is especially evident in online political conversations, which are often nuanced and cover a wide range of subjects, making it difficult to automatically identify the target of discussion and the opinion expressed toward them. In this study, we investigate whether Large Language Models (LLMs) can address this challenge through Target-Stance Extraction (TSE), a recent natural language processing task that combines target identification and stance detection, enabling more granular analysis of political opinions. For this, we construct a dataset of 1,084 Reddit posts from r/NeutralPolitics, covering 138 distinct political targets and evaluate a range of proprietary and open-source LLMs using zero-shot, few-shot, and context-augmented prompting strategies. Our results show that the best models perform comparably to highly trained human annotators and remain robust on challenging posts with low inter-annotator agreement. These findings demonstrate that LLMs can extract complex political opinions with minimal supervision, offering a scalable tool for computational social science and political text analysis.

[25] Generating Hierarchical JSON Representations of Scientific Sentences Using LLMs

Satya Sri Rajiteswari Nimmagadda,Ethan Young,Niladri Sengupta,Ananya Jana,Aniruddha Maiti

Main category: cs.CL

TL;DR: 本文探讨了结构化表示是否能有效保留科学句子的语义，通过微调轻量级大语言模型生成层次化JSON结构，并利用生成模型重建原文，验证了层次化格式在保持科学文本信息方面的有效性。

Details

Motivation: 探究结构化表示（如层次化JSON）能否有效保留科学句子的语义信息。 Method: 微调一个轻量级大语言模型，使用新提出的结构化损失函数，从科学文献句子中生成层次化JSON；再用生成模型基于JSON重建原文；最后通过语义与词汇相似度对比原句与重建句。 Result: 实验表明，基于层次化JSON结构重建的句子与原文在语义和词汇层面高度相似，说明该结构能有效保留科学文本的信息。 Conclusion: 层次化结构化表示（如JSON）是一种可行且有效的科学文本语义保持方式，为科学知识的结构化建模提供了新思路。 Abstract: This paper investigates whether structured representations can preserve the meaning of scientific sentences. To test this, a lightweight LLM is fine-tuned using a novel structural loss function to generate hierarchical JSON structures from sentences collected from scientific articles. These JSONs are then used by a generative model to reconstruct the original text. Comparing the original and reconstructed sentences using semantic and lexical similarity we show that hierarchical formats are capable of retaining information of scientific texts effectively.

[26] MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys and Key-Based Restructuring for High-Accuracy RAG

Bhavik Mangla

Main category: cs.CL

TL;DR: 本文提出MDKeyChunker，一种针对Markdown文档的三阶段RAG分块与元数据提取方法，通过结构感知分块、单次LLM调用多字段元数据提取及语义键驱动的块合并，显著提升检索效果。

Details

Motivation: 传统RAG中固定大小分块忽略文档结构、割裂语义单元，且需多次LLM调用提取元数据，效率低、上下文弱。 Method: MDKeyChunker包含三阶段：(1) 结构感知分块（标题、代码块、表格、列表为原子单元）；(2) 单次LLM调用提取7类元数据并传播滚动语义键以维持文档级上下文；(3) 基于语义键的bin-packing式块合并。 Result: 在18篇Markdown文档、30个查询的评测中，BM25+结构化分块（Config D）达Recall@5=1.000、MRR=0.911；完整稠密检索（Config C）达Recall@5=0.867。 Conclusion: MDKeyChunker通过结构化处理与LLM原生语义建模，提升了RAG中分块质量、元数据效率与检索性能，具备轻量实现与兼容性。 Abstract: RAG pipelines typically rely on fixed-size chunking, which ignores document structure, fragments semantic units across boundaries, and requires multiple LLM calls per chunk for metadata extraction. We present MDKeyChunker, a three-stage pipeline for Markdown documents that (1) performs structure-aware chunking treating headers, code blocks, tables, and lists as atomic units; (2) enriches each chunk via a single LLM call extracting title, summary, keywords, typed entities, hypothetical questions, and a semantic key, while propagating a rolling key dictionary to maintain document-level context; and (3) restructures chunks by merging those sharing the same semantic key via bin-packing, co-locating related content for retrieval. The single-call design extracts all seven metadata fields in one LLM invocation, eliminating the need for separate per-field extraction passes. Rolling key propagation replaces hand-tuned scoring with LLM-native semantic matching. An empirical evaluation on 30 queries over an 18-document Markdown corpus shows Config D (BM25 over structural chunks) achieves Recall@5=1.000 and MRR=0.911, while dense retrieval over the full pipeline (Config C) reaches Recall@5=0.867. MDKeyChunker is implemented in Python with four dependencies and supports any OpenAI-compatible endpoint.

[27] Not All Pretraining are Created Equal: Threshold Tuning and Class Weighting for Imbalanced Polarization Tasks in Low-Resource Settings

Abass Oguntade

Main category: cs.CL

TL;DR: This paper presents a Transformer-based approach for detecting and classifying polarization in English and Swahili social media text, using multilingual and African language-specific models, along with techniques to handle class imbalance; it achieves strong performance on binary detection and competitive results on multi-label tasks, while identifying key challenges like implicit polarization and code-switching.

Details

Motivation: To address the need for effective polarization detection and classification in social media text—particularly for under-resourced languages like Swahili—and to tackle severe class imbalance in the task. Method: Uses Transformer-based models (mDeBERTa-v3-base, SwahBERT, AfriBERTa-large) for three subtasks: binary polarization detection, multi-label target type classification, and multi-label manifestation identification; employs class-weighted loss, iterative stratified splitting, and per-label threshold tuning. Result: Best model (mDeBERTa-v3-base) achieves 0.8032 macro-F1 on binary polarization detection validation; up to 0.556 macro-F1 on multi-label tasks; error analysis highlights difficulties with implicit polarization, code-switching, and distinguishing heated discourse from true polarization. Conclusion: Multilingual and African language-specialized Transformers, combined with imbalance-handling strategies, are effective for polarization analysis in English and Swahili, but challenges remain in modeling subtle and linguistically complex cases. Abstract: This paper describes my submission to the Polarization Shared Task at SemEval-2025, which addresses polarization detection and classification in social media text. I develop Transformer-based systems for English and Swahili across three subtasks: binary polarization detection, multi-label target type classification, and multi-label manifestation identification. The approach leverages multilingual and African language-specialized models (mDeBERTa-v3-base, SwahBERT, AfriBERTa-large), class-weighted loss functions, iterative stratified data splitting, and per-label threshold tuning to handle severe class imbalance. The best configuration, mDeBERTa-v3-base, achieves 0.8032 macro-F1 on validation for binary detection, with competitive performance on multi-label tasks (up to 0.556 macro-F1). Error analysis reveals persistent challenges with implicit polarization, code-switching, and distinguishing heated political discourse from genuine polarization.

[28] Revisiting Real-Time Digging-In Effects: No Evidence from NP/Z Garden-Paths

Amani Maina-Kilaas,Roger Levy

Main category: cs.CL

TL;DR: 本文通过两项实验检验了句子加工中的“digging-in效应”，发现该效应并非实时加工现象，而是句末收尾效应造成的混淆；非句末条件下的结果反而与神经语言模型预测一致。

Details

Motivation: 澄清“digging-in效应”是否为人类实时句子加工的真实现象，还是由句末收尾过程或方法学混淆所致。 Method: 采用迷宫任务（Maze）和自定步速阅读范式，考察英语NP/Z花园路径句中不同歧义长度对加工难度的影响，并与多个大语言模型的预测进行对比。 Result: 未发现实时加工中的digging-in效应；句末歧义条件下呈现正向digging-in趋势（实为wrap-up效应干扰），而非句末条件下呈现反向趋势，与神经语言模型预测一致。 Conclusion: digging-in效应不是稳健的实时句子加工现象，其在句末的出现可能源于收尾加工混淆；人类实时句法加工更符合基于 surprisal 的概率模型，而非自我组织式强化假说。 Abstract: Digging-in effects, where disambiguation difficulty increases with longer ambiguous regions, have been cited as evidence for self-organized sentence processing, in which structural commitments strengthen over time. In contrast, surprisal theory predicts no such effect unless lengthening genuinely shifts statistical expectations, and neural language models appear to show the opposite pattern. Whether digging-in is a robust real-time phenomenon in human sentence processing -- or an artifact of wrap-up processes or methodological confounds -- remains unclear. We report two experiments on English NP/Z garden-path sentences using Maze and self-paced reading, comparing human behavior with predictions from an ensemble of large language models. We find no evidence for real-time digging-in effects. Critically, items with sentence-final versus nonfinal disambiguation show qualitatively different patterns: positive digging-in trends appear only sentence-finally, where wrap-up effects confound interpretation. Nonfinal items -- the cleaner test of real-time processing -- show reverse trends consistent with neural model predictions.

[29] Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks

Fatih Uenal

Main category: cs.CL

TL;DR: 本文提出了首个针对瑞士监管合规任务的三语基准测试Swiss-Bench SBP-002，涵盖395个专家构建题目、三大监管领域、七类任务及德法意三种语言；使用三模型盲审评分框架评估10个前沿大模型，结果显示整体性能有限（最高仅38.2%正确率），且开放权重模型表现优于部分闭源模型。

Details

Motivation: 现有基准未评估大模型在瑞士实际监管合规任务中的表现，缺乏面向应用的实证依据。 Method: 构建三语、多领域、多任务的Swiss-Bench SBP-002基准（395题）；采用三模型盲审小组（GPT-4o、Claude Sonnet 4、Qwen3-235B）进行结构化三维评分，辅以加权Kappa检验信度，并通过人类法律专家验证参考答案。 Result: 模型性能呈三阶分布（Tier A: 35–38%，Tier B: 26–29%，Tier C: 13–21%）；Qwen 3.5 Plus最优（38.2%正确），但错误率达47.3%；任务难度差异显著（翻译/案例分析达69–72%，而监管问答/幻觉检测/缺口分析均<9%）；多个开源模型超越或媲美闭源模型。 Conclusion: 当前前沿大模型在零检索条件下处理瑞士监管合规任务能力仍十分有限，Swiss-Bench SBP-002为后续研究与监管AI能力评估提供了首个实证基准和重要参照。 Abstract: While recent work has benchmarked large language models on Swiss legal translation (Niklaus et al., 2025) and academic legal reasoning from university exams (Fan et al., 2025), no existing benchmark evaluates frontier model performance on applied Swiss regulatory compliance tasks. I introduce Swiss-Bench SBP-002, a trilingual benchmark of 395 expert-crafted items spanning three Swiss regulatory domains (FINMA, Legal-CH, EFK), seven task types, and three languages (German, French, Italian), and evaluate ten frontier models from March 2026 using a structured three-dimension scoring framework assessed via a blind three-judge LLM panel (GPT-4o, Claude Sonnet 4, Qwen3-235B) with majority-vote aggregation and weighted kappa = 0.605, with reference answers validated by an independent human legal expert on a 100-item subset (73% rated Correct, 0% Incorrect, perfect Legal Accuracy). Results reveal three descriptive performance clusters: Tier A (35-38% correct), Tier B (26-29%), and Tier C (13-21%). The benchmark proves difficult: even the top-ranked model (Qwen 3.5 Plus) achieves only 38.2% correct, with 47.3% incorrect and 14.4% partially correct. Task type difficulty varies widely: legal translation and case analysis yield 69-72% correct rates, while regulatory Q&A, hallucination detection, and gap analysis remain below 9%. Within this roster (seven open-weight, three closed-source), an open-weight model leads the ranking, and several open-weight models match or outperform their closed-source counterparts. These findings provide an initial empirical reference point for assessing frontier model capability on Swiss regulatory tasks under zero-retrieval conditions.

[30] Ethio-ASR: Joint Multilingual Speech Recognition and Language Identification for Ethiopian Languages

Badr M. Abdullah,Israel Abebe Azime,Atnafu Lambebo Tonja,Jesujoba O. Alabi,Abel Mulat Alemu,Eyob G. Hagos,Bontu Fufa Balcha,Mulubrhan A. Nerea,Debela Desalegn Yadeta,Dagnachew Mekonnen Marilign,Amanuel Temesgen Fentahun,Tadesse Kebede,Israel D. Gebru,Michael Melese Woldeyohannis,Walelign Tewabe Sewunetie,Bernd Möbius,Dietrich Klakow

Main category: cs.CL

TL;DR: Ethio-ASR is a multilingual CTC-based ASR system for five underrepresented Ethiopian languages, achieving state-of-the-art WER with fewer parameters than OmniASR and including analyses of linguistic features and bias.

Details

Motivation: Ethiopian languages are severely underrepresented in speech technology despite being spoken by most of Ethiopia's population; there is a need for accessible, high-quality ASR models for these low-resource Afroasiatic languages. Method: Jointly trained multilingual CTC-based ASR models on the WAXAL corpus using various pre-trained speech encoders; included analysis of gender bias, vowel length, consonant gemination, and training dynamics. Result: Best model achieves 30.48% average WER on WAXAL test set, outperforming OmniASR with fewer parameters; comprehensive linguistic and bias analyses conducted. Conclusion: Ethio-ASR demonstrates effective multilingual ASR for under-resourced Ethiopian languages and provides open models, code, and insights into linguistic challenges and fairness in ASR. Abstract: We present Ethio-ASR, a suite of multilingual CTC-based automatic speech recognition (ASR) models jointly trained on five Ethiopian languages: Amharic, Tigrinya, Oromo, Sidaama, and Wolaytta. These languages belong to the Semitic, Cushitic, and Omotic branches of the Afroasiatic family, and remain severely underrepresented in speech technology despite being spoken by the vast majority of Ethiopia's population. We train our models on the recently released WAXAL corpus using several pre-trained speech encoders and evaluate against strong multilingual baselines, including OmniASR. Our best model achieves an average WER of 30.48% on the WAXAL test set, outperforming the best OmniASR model with substantially fewer parameters. We further provide a comprehensive analysis of gender bias, the contribution of vowel length and consonant gemination to ASR errors, and the training dynamics of multilingual CTC models. Our models and codebase are publicly available to the research community.

[31] Probing Ethical Framework Representations in Large Language Models: Structure, Entanglement, and Methodological Challenges

Weilun Xu,Alexander Rusnak,Frederic Kaplan

Main category: cs.CL

TL;DR: 本文探究了大型语言模型（LLM）在进行伦理判断时，其内部表征是否能区分不同规范性伦理框架（如义务论、功利主义等），发现存在差异化伦理子空间但存在不对称迁移和表面特征依赖问题。

Details

Motivation: 探究LLM内部表征是否真正区分不同伦理框架，还是仅将伦理简化为单一可接受性维度。 Method: 在6个参数规模为4B–72B的LLM上，针对5种伦理框架（义务论、功利主义、德性论、正义论、常识伦理）构建探针，分析隐藏表征的区分性与迁移性，并进行后验验证以检验探针对模板表面特征的依赖。 Result: 发现存在结构化的、差异化的伦理子空间；不同框架间迁移不对称（如义务论探针可部分泛化至德性场景，而常识探针在正义场景上严重失效）；义务论与功利主义判断分歧与行为熵正相关；探针性能部分依赖于基准模板的表面特征。 Conclusion: LLM展现出一定伦理结构分化能力，但该能力受限于任务难度与表面线索，当前探针方法虽提供结构性洞见，却存在显著的认识论局限，需谨慎解读。 Abstract: When large language models make ethical judgments, do their internal representations distinguish between normative frameworks, or collapse ethics into a single acceptability dimension? We probe hidden representations across five ethical frameworks (deontology, utilitarianism, virtue, justice, commonsense) in six LLMs spanning 4B--72B parameters. Our analysis reveals differentiated ethical subspaces with asymmetric transfer patterns -- e.g., deontology probes partially generalize to virtue scenarios while commonsense probes fail catastrophically on justice. Disagreement between deontological and utilitarian probes correlates with higher behavioral entropy across architectures, though this relationship may partly reflect shared sensitivity to scenario difficulty. Post-hoc validation reveals that probes partially depend on surface features of benchmark templates, motivating cautious interpretation. We discuss both the structural insights these methods provide and their epistemological limitations.

[32] PLACID: Privacy-preserving Large language models for Acronym Clinical Inference and Disambiguation

Manjushree B. Aithal,Ph. D.,Alexander Kotz,James Mitchell,Ph. D

Main category: cs.CL

TL;DR: 本文提出了一种隐私保护的级联管道，利用本地部署的小参数模型（2B-10B）进行临床缩略语消歧，兼顾数据隐私与高准确率。

Details

Motivation: 医疗领域因严格的数据隐私限制难以集成大语言模型，而临床文本中大量模糊缩略语若被误解释可能导致严重用药错误；云依赖模型虽擅长缩略语消歧，但传输受保护健康信息违反隐私法规。 Method: 提出一种隐私优先的级联流程：先用通用轻量本地模型检测临床缩略语，再将检测结果路由至领域专用生物医学模型进行上下文相关扩展。 Result: 通用指令模型缩略语检测准确率达~0.988，但扩展准确率仅~0.655；级联方法将扩展准确率提升至~0.81。 Conclusion: 在设备端部署2B–10B参数规模的小模型可实现高保真的临床缩略语消歧，兼顾隐私合规性与实用性。 Abstract: Large Language Models (LLMs) offer transformative solutions across many domains, but healthcare integration is hindered by strict data privacy constraints. Clinical narratives are dense with ambiguous acronyms, misinterpretation these abbreviations can precipitate severe outcomes like life-threatening medication errors. While cloud-dependent LLMs excel at Acronym Disambiguation, transmitting Protected Health Information to external servers violates privacy frameworks. To bridge this gap, this study pioneers the evaluation of small-parameter models deployed entirely on-device to ensure privacy preservation. We introduce a privacy-preserving cascaded pipeline leveraging general-purpose local models to detect clinical acronyms, routing them to domain-specific biomedical models for context-relevant expansions. Results reveal that while general instruction-following models achieve high detection accuracy (~0.988), their expansion capabilities plummet (~0.655). Our cascaded approach utilizes domain-specific medical models to increase expansion accuracy to (~0.81). This novel work demonstrates that privacy-preserving, on-device (2B-10B) models deliver high-fidelity clinical acronym disambiguation support.

[33] The Diminishing Returns of Early-Exit Decoding in Modern LLMs

Rui Wei,Rui Du,Hanfei Yu,Devesh Tiwari,Jian Li,Zhaozhuo Xu,Hao Wang

Main category: cs.CL

TL;DR: 本文重新评估了现代大语言模型（LLM）中层间早退出（early-exit）的有效性，提出一种衡量模型内在早退出适用性的新指标和基准，并发现新模型代际中早退出效果呈下降趋势，稠密Transformer比MoE和SSM更具早退出潜力，且更大规模（>20B参数）及未经微调的基座模型更适配早退出。

Details

Motivation: 近期LLM在预训练方法与架构上的改进降低了层间冗余，可能削弱早退出机制的效果，因此需对现代LLM中的早退出可行性进行系统重评估。 Method: 提出一种量化模型内在早退出适用性的新指标；构建面向早退出研究的基准；分析不同架构（稠密Transformer、MoE、SSM）、规模和训练阶段（基座模型 vs 微调模型）下中间表征演化与早退出潜力的关系。 Result: 发现早退出有效性随模型代际更新而减弱；稠密Transformer比MoE和SSM更适合早退出；参数量超20B及未经专门微调的基座模型具有更高早退出潜力。 Conclusion: 早退出并非普适优化策略，其适用性高度依赖模型架构、规模与训练状态；该工作为高效LLM推理提供了新的评估视角与实用基准。 Abstract: In Large Language Model (LLM) inference, early-exit refers to stopping computation at an intermediate layer once the prediction is sufficiently confident, thereby reducing latency and cost. However, recent LLMs adopt improved pretraining recipes and architectures that reduce layer redundancy, potentially limiting early-exit opportunities. We re-evaluate layer-wise early-exit in modern LLMs and analyze how intermediate representations evolve during training. We introduce a metric to quantify a model's intrinsic suitability for early-exit and propose a benchmark for researchers to explore the potential early-exit benefits on different models and workloads. Our results show a diminishing trend in early-exit effectiveness across newer model generations. We further find that dense transformers generally offer greater early-exit potential than Mixture-of-Experts and State Space Models. In addition, larger models, particularly those with more than 20 billion parameters, and base pretrained models without specialized tuning tend to exhibit higher early-exit potential.

[34] IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

Ali Abdelaal,Mohammed Nader Al Haffar,Mahmoud Fawzi,Walid Magdy

Main category: cs.CL

TL;DR: 本文提出了IslamicMMLU基准，包含10,013道多选题，覆盖古兰经、圣训和教法学三大伊斯兰学科，用于评估大语言模型在伊斯兰知识领域的性能，并建立了公开排行榜。

Details

Motivation: 当前大语言模型越来越多被用于提供伊斯兰知识，但缺乏一个全面评估其在核心伊斯兰学科中表现的基准。 Method: 构建了IslamicMMLU基准，涵盖古兰经（2013题）、圣训（4000题）和教法学（4000题）三类问题，每类包含多种题型；并设计了教法学中的学派偏好检测任务；对26个大语言模型进行了系统评测。 Result: 26个模型在三类题目上的平均准确率介于39.8%至93.8%之间；古兰经题目的准确率跨度最大（32.4%–99.3%）；教法学任务揭示了不同模型对各伊斯兰法学派别的偏好差异；阿拉伯语专用模型整体不如前沿通用模型。 Conclusion: IslamicMMLU为评估大语言模型在伊斯兰知识领域的表现提供了首个综合性基准与公开评测平台，揭示了现有模型在宗教知识理解上的显著差距与偏差，推动更可靠、公正的宗教AI发展。 Abstract: Large language models are increasingly consulted for Islamic knowledge, yet no comprehensive benchmark evaluates their performance across core Islamic disciplines. We introduce IslamicMMLU, a benchmark of 10,013 multiple-choice questions spanning three tracks: Quran (2,013 questions), Hadith (4,000 questions), and Fiqh (jurisprudence, 4,000 questions). Each track is formed of multiple types of questions to examine LLMs capabilities handling different aspects of Islamic knowledge. The benchmark is used to create the IslamicMMLU public leaderboard for evaluating LLMs, and we initially evaluate 26 LLMs, where their averaged accuracy across the three tracks varied between 39.8\% to 93.8\% (by Gemini 3 Flash). The Quran track shows the widest span (99.3\% to 32.4\%), while the Fiqh track includes a novel madhab (Islamic school of jurisprudence) bias detection task revealing variable school-of-thought preferences across models. Arabic-specific models show mixed results, but they all underperform compared to frontier models. The evaluation code and leaderboard are made publicly available.

[35] Infrequent Child-Directed Speech Is Bursty and May Draw Infant Vocalizations

Margaret Cychosz,Adriana Weisleder

Main category: cs.CL

TL;DR: 本研究比较了玻利维亚农村和美国城市婴儿所接触的儿童指向性言语（CDS）的时间模式及其对婴儿前语言发声行为的影响，发现CDS的时序集中度（而非单纯数量）及说话者来源（尤其是大龄儿童）对语言发展具有关键作用。

Details

Motivation: 解释为何在儿童指向性言语较少的环境中（如玻利维亚农村），婴儿仍能达成关键语言发展里程碑，探究言语输入质（如时间分布、说话者身份）的作用。 Method: 采用长时段、以婴儿为中心的音频记录，对比玻利维亚农村与美国城市婴儿的言语输入时间模式及其自身前语言发声行为，统计分析CDS出现时段与婴儿发声之间的共现概率，并区分成人与大龄儿童作为CDS来源的影响。 Result: 1）玻利维亚的CDS虽总量少，但时间上同样高度集中；2）无论在哪一社区，婴儿在CDS期间发出类语发声的概率约为沉默期的两倍；3）在玻利维亚，大龄儿童提供的CDS比成人更能诱发婴儿发声。 Conclusion: 儿童指向性言语对语言发展的促进作用不仅取决于数量，更依赖其时间集中度和说话者身份；在成人言语输入稀疏的社区中，大龄儿童可成为关键的语言输入来源。 Abstract: Children in many parts of the world hear relatively little speech directed to them, yet still reach major language development milestones. What differs about the speech input that infants learn from when directed input is rare? Using longform, infant-centered audio recordings taken in rural Bolivia and the urban U.S., we examined temporal patterns of infants' speech input and their pre-linguistic vocal behavior. We find that child-directed speech in Bolivia, though less frequent, was just as temporally clustered as speech input in the U.S, arriving in concentrated bursts rather than spread across the day. In both communities, infants were most likely to produce speech-like vocalizations during periods of speech directed to them, with the probability of infants' speech-like vocalizations during target child-directed speech nearly double that during silence. In Bolivia, infants' speech-like vocalizations were also more likely to occur during bouts of directed speech from older children than from adults. Together, these findings suggest that the developmental impact of child-directed speech may depend not only on quantity, but on temporal concentration and source, with older children serving as an important source of input in some communities, including where adult speech to infants is less frequent.

[36] Perturbation: A simple and efficient adversarial tracer for representation learning in language models

Joshua Rozner,Cory Shain

Main category: cs.CL

TL;DR: 本文提出了一种基于扰动传播的新方法来研究语言模型中的语言表征，避免了以往方法在几何假设与表征泛化性之间的两难困境。

Details

Motivation: 现有方法在语言模型表征学习中面临两难：要么施加不合理的约束（如线性），要么使表征概念变得空洞；需要一种更合理、经验驱动的表征定义方式。 Method: 将表征重新定义为‘学习的通道’，通过在单个对抗样本上微调语言模型，观察扰动如何‘感染’其他样本，以此衡量表征结构和泛化能力，不依赖任何几何假设。 Result: 该扰动方法在未训练模型中不产生虚假表征，而在训练后的语言模型中揭示了跨多个语言粒度（如词、短语、句法结构）的结构化迁移现象，表明模型能仅凭经验习得语言抽象并沿表征路径泛化。 Conclusion: 语言模型中的表征应被理解为支持学习与泛化的功能性结构，而非静态激活模式；扰动传播是一种更可靠、更具解释性的表征探测工具。 Abstract: Linguistic representation learning in deep neural language models (LMs) has been studied for decades, for both practical and theoretical reasons. However, finding representations in LMs remains an unsolved problem, in part due to a dilemma between enforcing implausible constraints on representations (e.g., linearity; Arora et al. 2024) and trivializing the notion of representation altogether (Sutter et al., 2025). Here we escape this dilemma by reconceptualizing representations not as patterns of activation but as conduits for learning. Our approach is simple: we perturb an LM by fine-tuning it on a single adversarial example and measure how this perturbation ``infects'' other examples. Perturbation makes no geometric assumptions, and unlike other methods, it does not find representations where it should not (e.g., in untrained LMs). But in trained LMs, perturbation reveals structured transfer at multiple linguistic grain sizes, suggesting that LMs both generalize along representational lines and acquire linguistic abstractions from experience alone.

[37] PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay

Rohan Khetan,Ashna Khetan

Main category: cs.CL

TL;DR: 本研究使用PoliticsBench（基于EQ-Bench-v3改编的多轮角色扮演框架），评估8个主流大语言模型（LLM）在20个演化情境中的政治价值观倾向，发现其中7个模型呈现系统性左倾倾向，仅Grok右倾；所有左倾模型均强烈体现自由主义特质、中度体现保守主义特质；各阶段对齐分数无明显变化规律；推理方式上，多数模型采用后果导向推理，而Grok更常诉诸事实与统计数据。

Details

Motivation: 现有LLM社会偏见评测主要聚焦性别与种族刻板印象，政治偏见评测则多停留在粗粒度层面，忽视塑造社会政治立场的具体价值观维度；亟需一种细粒度、心理测量学驱动的评估框架来揭示LLM的政治价值取向。 Method: 构建PoliticsBench：一个源自EQ-Bench-v3的心理测量学多轮角色扮演框架；在20个动态演化的社会政治情境中，让8个主流LLM（Claude、Deepseek、Gemini、GPT、Grok、Llama、Qwen Base、Qwen Instruction-Tuned）进行自由文本响应；对其立场声明与行动决策按10项政治价值观进行人工评分；分析各模型在不同角色扮演阶段的倾向变化及推理模式差异。 Result: 7/8模型显著左倾（Grok右倾）；左倾模型普遍强显自由主义、弱显保守主义；各阶段价值观对齐分数无一致演化趋势；推理方式分化明显：多数模型依赖后果推理，Grok高频使用事实与统计论证。 Conclusion: 主流商用LLM普遍存在可量化的政治价值观偏向（尤以左倾为主），且该偏向具有跨模型一致性与价值观结构性特征；PoliticsBench为LLM政治偏见提供了首个基于多阶段自由文本交互的心理测量学评估范式，凸显了细粒度价值观分析对理解LLM社会影响的重要性。 Abstract: While Large Language Models (LLMs) are increasingly used as primary sources of information, their potential for political bias may impact their objectivity. Existing benchmarks of LLM social bias primarily evaluate gender and racial stereotypes. When political bias is included, it is typically measured at a coarse level, neglecting the specific values that shape sociopolitical leanings. This study investigates political bias in eight prominent LLMs (Claude, Deepseek, Gemini, GPT, Grok, Llama, Qwen Base, Qwen Instruction-Tuned) using PoliticsBench: a novel multi-turn roleplay framework adapted from the EQ-Bench-v3 psychometric benchmark. We test whether commercially developed LLMs display a systematic left-leaning bias that becomes more pronounced in later stages of multi-stage roleplay. Through twenty evolving scenarios, each model reported its stance and determined its course of action. Scoring these responses on a scale of ten political values, we explored the values underlying chatbots' deviations from unbiased standards. Seven of our eight models leaned left, while Grok leaned right. Each left-leaning LLM strongly exhibited liberal traits and moderately exhibited conservative ones. We discovered slight variations in alignment scores across stages of roleplay, with no particular pattern. Though most models used consequence-based reasoning, Grok frequently argued with facts and statistics. Our study presents the first psychometric evaluation of political values in LLMs through multi-stage, free-text interactions.

[38] Language Model Planners do not Scale, but do Formalizers?

Owen Jiang,Cassie Huang,Ashish Sabharwal,Li Zhang

Main category: cs.CL

TL;DR: 本文研究了大语言模型（LLM）在规划问题形式化任务中的表现，发现LLM形式化器（formalizers）远优于LLM规划器（planners），尤其在BlocksWorld等大规模状态空间中保持高准确率；提出分而治之的形式化策略提升小模型鲁棒性，并针对‘展开型问题’（unraveling problems）引入新范式——LLM作为高阶形式化器（LLM-as-higher-order-formalizer），即LLM生成程序生成器，以解耦输出长度与形式化/搜索空间的组合爆炸。

Details

Motivation: 现有研究表明大语言模型在复杂规划问题上表现不佳，但尚不清楚其在生成求解器导向程序（即形式化任务）上的能力；同时，真实问题描述到形式语言（如PDDL）可能存在指数级扩展，亟需新方法应对。 Method: 1）系统评估LLM formalizers在经典BlocksWorld等规划域上的性能；2）提出分而治之（divide-and-conquer）的形式化技术；3）定义并构造‘unraveling problems’；4）提出LLM-as-higher-order-formalizer范式：LLM生成能产出PDDL等形式化代码的程序生成器。 Result: LLM formalizers在BlocksWorld中可达10^165状态空间仍保持完美准确率；分治策略显著提升小模型鲁棒性；LLM-as-higher-order-formalizer有效缓解因形式化爆炸导致的token瓶颈。 Conclusion: LLM在形式化任务上展现出远超直接规划的能力，且通过结构化策略（如分治、高阶生成）可进一步突破组合爆炸限制，为LLM赋能符号AI提供了新路径。 Abstract: Recent work shows overwhelming evidence that LLMs, even those trained to scale their reasoning trace, perform unsatisfactorily when solving planning problems too complex. Whether the same conclusion holds for LLM formalizers that generate solver-oriented programs remains unknown. We systematically show that LLM formalizers greatly out-scale LLM planners, some retaining perfect accuracy in the classic BlocksWorld domain with a huge state space of size up to $10^{165}$. While performance of smaller LLM formalizers degrades with problem complexity, we show that a divide-and-conquer formalizing technique can greatly improve its robustness. Finally, we introduce unraveling problems where one line of problem description realistically corresponds to exponentially many lines of formal language such as the Planning Domain Definition Language (PDDL), greatly challenging LLM formalizers. We tackle this challenge by introducing a new paradigm, namely LLM-as-higher-order-formalizer, where an LLM generates a program generator. This decouples token output from the combinatorial explosion of the underlying formalization and search space.

[39] BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents

Praveen Kumar Myakala,Manan Agrawal,Rahul Manche

Main category: cs.CL

TL;DR: 本文提出BeliefShift基准，用于评估大语言模型在多轮对话中对用户信念动态变化（如观点漂移、确认偏误等）的建模能力，并设计了四个新指标衡量信念更新的准确性与合理性。

Details

Motivation: 现有记忆评估基准将用户信息视为静态事实，忽略了真实对话中用户观点会随时间演变（如意见漂移、过度对齐、确认偏误），因此需要更符合人类认知动态的评估范式。 Method: 构建了纵向多会话基准BeliefShift，包含三个任务轨道（时间信念一致性、矛盾检测、证据驱动修正），涵盖2400条人工标注的跨领域交互轨迹；在零样本和RAG设置下评测7个主流LLM；提出四个新评估指标：BRA、DCS、CRR、ESI。 Result: 发现模型存在明显权衡：强个性化模型抗漂移能力差，而重事实模型又难以识别合理信念更新；所有模型在BeliefShift各任务上表现均有限，凸显当前LLM在动态信念建模上的根本性不足。 Conclusion: BeliefShift揭示了当前LLM在长期交互中建模用户信念动态变化的能力严重欠缺，呼吁研究从静态记忆转向动态信念追踪，并为未来工作提供了可量化的评估框架与指标体系。 Abstract: LLMs are increasingly used as long-running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved. That's the wrong model. People change their minds, and over extended interactions, phenomena like opinion drift, over-alignment, and confirmation bias start to matter a lot. BeliefShift introduces a longitudinal benchmark designed specifically to evaluate belief dynamics in multi-session LLM interactions. It covers three tracks: Temporal Belief Consistency, Contradiction Detection, and Evidence-Driven Revision. The dataset includes 2,400 human-annotated multi-session interaction trajectories spanning health, politics, personal values, and product preferences. We evaluate seven models including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, LLaMA-3, and Mistral-Large under zero-shot and retrieval-augmented generation (RAG) settings. Results reveal a clear trade-off: models that personalize aggressively resist drift poorly, while factually grounded models miss legitimate belief updates. We further introduce four novel evaluation metrics: Belief Revision Accuracy (BRA), Drift Coherence Score (DCS), Contradiction Resolution Rate (CRR), and Evidence Sensitivity Index (ESI).

[40] Self-Distillation for Multi-Token Prediction

Guoliang Zhao,Ruobing Xie,An Wang,Shuaipeng Li,Huaibing Xie,Xingwu Sun

Main category: cs.CL

TL;DR: 本文提出MTP-D方法，通过自蒸馏提升多令牌预测（MTP）头的接受率并保持主头性能，结合循环扩展策略显著加速LLM推理。

Details

Motivation: 随着大语言模型规模增大，推理效率成为瓶颈；现有MTP方法存在MTP头接受率低和多头联合训练困难两大挑战。 Method: 提出轻量级自蒸馏方法MTP-D，并设计循环扩展策略以经济高效地扩展MTP头；系统实验验证蒸馏策略与MTP可扩展性。 Result: MTP-D将MTP头接受率提升7.5%，循环扩展使单头MTP推理加速达220.4%；在七个基准上验证了有效性。 Conclusion: MTP-D及其循环扩展策略能有效提升MTP头性能与推理效率，推动MTP在大语言模型中的实际应用。 Abstract: As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head acceptance rates (+7.5\%) while maximumly preserving main-head performance. We also introduce a looped extension strategy for MTP-D, enabling effective and economical MTP head extension and further significant inference speedup to 1-head MTP (+220.4\%). Moreover, we systematically explore and validate key insights on the distillation strategies and the potential scalability of MTP through extensive experiments on seven benchmarks. These results demonstrate that our MTP-D and looped extension strategy effectively enhance MTP-head performance and inference efficiency, facilitating the practical usage of MTP in LLMs.

[41] Dialogue to Question Generation for Evidence-based Medical Guideline Agent Development

Zongliang Ji,Ziyang Zhang,Xincheng Tan,Matthew Thompson,Anna Goldenberg,Carl Yang,Rahul G. Krishnan,Fan Zhang

Main category: cs.CL

TL;DR: 本研究探讨了利用大语言模型（LLM）作为环境助手，在初级医疗问诊中实时生成基于循证医学（EBM）的针对性问题，以辅助医生决策、减轻认知负担。采用Gemini 2.5模型与两种提示策略，在80份真实临床转录文本上评估，由6位资深医生审阅超90小时。结果表明，尽管通用LLM尚不完全可靠，但已能生成具有临床意义和指南相关性的问题，具备临床落地潜力。

Details

Motivation: 初级医疗中循证医学（EBM）实施困难，源于问诊时间短、患者量大、指南冗长难实时查阅，亟需能嵌入快速诊疗流程的智能辅助工具。 Method: 聚焦于‘问题生成’而非问答，设计零样本基线与多阶段推理两种提示策略，基于Gemini 2.5模型，在80例脱敏真实临床对话转录数据集上开展实验，并由6位经验丰富的医生进行结构化人工评估（总计>90小时）。 Result: LLM生成的问题在临床意义性和指南相关性方面表现良好，虽可靠性有待提升，但已展现出显著降低医生认知负荷、增强EBM可操作性的潜力。 Conclusion: 将LLM用作临床问诊中的‘环境式提问助手’是可行且有前景的方向；未来需进一步优化提示工程与领域适配，推动其在真实诊疗场景中安全、可靠地落地应用。 Abstract: Evidence-based medicine (EBM) is central to high-quality care, but remains difficult to implement in fast-paced primary care settings. Physicians face short consultations, increasing patient loads, and lengthy guideline documents that are impractical to consult in real time. To address this gap, we investigate the feasibility of using large language models (LLMs) as ambient assistants that surface targeted, evidence-based questions during physician-patient encounters. Our study focuses on question generation rather than question answering, with the aim of scaffolding physician reasoning and integrating guideline-based practice into brief consultations. We implemented two prompting strategies, a zero-shot baseline and a multi-stage reasoning variant, using Gemini 2.5 as the backbone model. We evaluated on a benchmark of 80 de-identified transcripts from real clinical encounters, with six experienced physicians contributing over 90 hours of structured review. Results indicate that while general-purpose LLMs are not yet fully reliable, they can produce clinically meaningful and guideline-relevant questions, suggesting significant potential to reduce cognitive burden and make EBM more actionable at the point of care.

Seunghee Kim,Bumkyu Park,Kyudan Jung,Joosung Lee,Soyoon Kim,Jeonghoon Kim,Taeuk Kim,Hwiyeol Jo

Main category: cs.CL

TL;DR: 本文提出OmniACBench基准，用于评估全模态模型在上下文感知声学控制方面的能力，发现现有模型虽在文本输出任务上表现良好，但在根据图像、文本脚本和语音指令生成恰当语音时存在显著缺陷，主要瓶颈在于多模态上下文融合而非单模态处理。

Details

Motivation: 现有全模态模型评测多依赖文本输出，无法反映模型是否能正确“说出”答案；因此需构建能评估其声学控制能力的新基准。 Method: 构建OmniACBench基准，包含3559个标注样本，覆盖6类声学特征（语速、发声方式、发音、情感、口音、音色）；设计任务要求模型根据语音指令、文本脚本和图像生成符合上下文的语音；在8个模型上开展系统实验并分析失败模式。 Result: 实验证明当前主流全模态模型在该声学控制任务上表现远差于其在文本输出任务上的表现；主要瓶颈在于多模态上下文融合；识别出三类典型失败模式：直接控制弱、隐式推理失败、多模态接地失败。 Conclusion: 全模态模型需增强多模态上下文融合能力以实现有效语音输出；OmniACBench为推动具备真实语音表达能力的模型发展提供了新评测标准和诊断工具。 Abstract: Most testbeds for omni-modal models assess multimodal understanding via textual outputs, leaving it unclear whether these models can properly speak their answers. To study this, we introduce OmniACBench, a benchmark for evaluating context-grounded acoustic control in omni-modal models. Given a spoken instruction, a text script, and an image, a model must read the script aloud with an appropriate tone and manner. OmniACBench comprises 3,559 verified instances covering six acoustic features: speech rate, phonation, pronunciation, emotion, global accent, and timbre. Extensive experiments on eight models reveal their limitations in the proposed setting, despite their strong performance on prior textual-output evaluations. Our analyses show that the main bottleneck lies not in processing individual modalities, but in integrating multimodal context for faithful speech generation. Moreover, we identify three common failure modes-weak direct control, failed implicit inference, and failed multimodal grounding-providing insights for developing models that can verbalize responses effectively.

[43] Argument Mining as a Text-to-Text Generation Task

Masayuki Kawarada,Tsutomu Hirao,Wataru Uchida,Masaaki Nagata

Main category: cs.CL

TL;DR: 本文提出了一种基于预训练编码器-解码器语言模型的文本到文本生成方法，用于端到端地完成论辩挖掘任务，避免了传统多子任务流程及其复杂后处理和超参调优。

Details

Motivation: 传统论辩挖掘方法依赖多个子任务（如片段识别、组件分类、关系分类）及规则后处理，导致模型复杂、超参数搜索空间大。 Method: 采用预训练的编码器-解码器语言模型，将论辩挖掘建模为统一的文本到文本生成任务，一次性生成带论辩标注的文本（涵盖片段、组件与关系）。 Result: 在AAEC、AbstRCT和CDCP三个基准数据集上均达到当前最优性能。 Conclusion: 该方法简洁高效，无需任务特定后处理与超参调优，且易于适配多种论辩结构。 Abstract: Argument Mining(AM) aims to uncover the argumentative structures within a text. Previous methods require several subtasks, such as span identification, component classification, and relation classification. Consequently, these methods need rule-based postprocessing to derive argumentative structures from the output of each subtask. This approach adds to the complexity of the model and expands the search space of the hyperparameters. To address this difficulty, we propose a simple yet strong method based on a text-to-text generation approach using a pretrained encoder-decoder language model. Our method simultaneously generates argumentatively annotated text for spans, components, and relations, eliminating the need for task-specific postprocessing and hyperparameter tuning. Furthermore, because it is a straightforward text-to-text generation method, we can easily adapt our approach to various types of argumentative structures. Experimental results demonstrate the effectiveness of our method, as it achieves state-of-the-art performance on three different types of benchmark datasets: the Argument-annotated Essays Corpus(AAEC), AbstRCT, and the Cornell eRulemaking Corpus(CDCP)

[44] From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents

Sirui Xia,Yikai Zhang,Aili Chen,Siye Wu,Siyu Yuan,Yanghua Xiao

Main category: cs.CL

TL;DR: POISE是一个闭环框架，用于自动发现语言模型的策略优化算法，通过结构化存档连接提案、实现、评估和反思，在数学推理实验中显著提升了性能。

Details

Motivation: 手动发现改进的语言模型策略优化算法成本高昂，需要反复进行机制级修改和验证，而现有方法难以在训练动态紧密耦合的算法机制空间中高效搜索并复用实证证据。 Method: 提出POISE框架，构建结构化、谱系关联的档案库，整合算法提案、可执行实现、标准化评估及自然语言反思，支持基于证据的迭代优化。 Result: 在以GRPO为起点的数学推理实验中，POISE评估了64个候选算法，发现了如解析方差缩放和有效性掩码等改进机制；最佳变体使加权Overall从47.8提升至52.5，AIME25 pass@32从26.7%提升至43.3%。 Conclusion: POISE验证了自动化策略优化算法发现的可行性，并支持可解释的设计原则。 Abstract: Discovering improved policy optimization algorithms for language models remains a costly manual process requiring repeated mechanism-level modification and validation. Unlike simple combinatorial code search, this problem requires searching over algorithmic mechanisms tightly coupled with training dynamics while reusing empirical evidence across iterations. We propose POISE, a closed-loop framework for automated discovery of policy optimization algorithms for language models. POISE maintains a structured, genealogically linked archive linking proposals, executable implementations, standardized evaluations, and natural-language reflections to support evidence-driven iteration. In mathematical reasoning experiments starting from GRPO, POISE evaluates 64 candidate algorithms and discovers improved mechanisms, including analytic-variance scaling and validity masking. The best variant improves weighted Overall from 47.8 to 52.5 (+4.6) and increases AIME25 pass@32 from 26.7% to 43.3%, demonstrating the feasibility of automated policy optimization discovery while supporting interpretable design principles.

[45] The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More

Lingjiao Chen,Chi Zhang,Yeye He,Ion Stoica,Matei Zaharia,James Zou

Main category: cs.CL

TL;DR: 本文首次系统研究了推理语言模型（RLMs）的标称API价格与实际推理成本之间的偏差，发现存在显著的‘定价反转’现象（21.8%的模型对比较中低价模型反而总成本更高），根源在于各模型在‘思考token’消耗上存在巨大异质性（最高达900%差异）；去除思考token成本可大幅缓解反转（减少70%），且单次查询的成本预测受高达9.7倍的内在波动限制，表明当前标价无法可靠反映真实成本。

Details

Motivation: 开发者和用户依赖RLMs标称API价格做选型决策，但该价格是否真实反映实际推理成本尚无系统评估。 Method: 对8个前沿RLMs在9个多样化推理任务（含数学竞赛、科学问答、代码生成等）上进行实证评测，量化其实际推理成本（含思考token消耗），分析价格-成本偏差、反转现象及成因，并评估成本预测的可预测性。 Result: 发现21.8%的模型对比较中存在定价反转，最大反转达28倍；思考token消耗异质性是主因（同查询下最高差900%）；剔除思考token成本后，价格-成本排名相关性（Kendall's τ）从0.563升至0.873，反转减少70%；同一查询多次运行的思考token波动最高达9.7倍，构成成本预测不可逾越的噪声下限。 Conclusion: RLMs标称API价格不能可靠代表实际推理成本，亟需推动成本感知的模型选型策略与细粒度、透明的每请求成本监控机制。 Abstract: Developers and consumers increasingly choose reasoning language models (RLMs) based on their listed API prices. However, how accurately do these prices reflect actual inference costs? We conduct the first systematic study of this question, evaluating 8 frontier RLMs across 9 diverse tasks covering competition math, science QA, code generation, and multi-domain reasoning. We uncover the pricing reversal phenomenon: in 21.8% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitude reaching up to 28x. For example, Gemini 3 Flash's listed price is 78% cheaper than GPT-5.2's, yet its actual cost across all tasks is 22% higher. We trace the root cause to vast heterogeneity in thinking token consumption: on the same query, one model may use 900% more thinking tokens than another. In fact, removing thinking token costs reduces ranking reversals by 70% and raises the rank correlation (Kendall's $τ$ ) between price and cost rankings from 0.563 to 0.873. We further show that per-query cost prediction is fundamentally difficult: repeated runs of the same query yield thinking token variation up to 9.7x, establishing an irreducible noise floor for any predictor. Our findings demonstrate that listed API pricing is an unreliable proxy for actual cost, calling for cost-aware model selection and transparent per-request cost monitoring.

[46] Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith

Somaya Eltanbouly,Samer Rashwani

Main category: cs.CL

TL;DR: 本文提出了一种基于历时词典知识的检索增强生成（RAG）框架，利用多哈阿拉伯语历史词典（DHDA）提升大语言模型对古兰经和圣训等复杂历史宗教阿拉伯语文本的理解能力，显著提高了阿拉伯语原生LLM的准确率至85%以上。

Details

Motivation: 大型语言模型在处理复杂的历史与宗教阿拉伯语文本（如《古兰经》和《圣训》）时仍存在困难，亟需结合阿拉伯语历时词汇演变知识来提升其理解能力。 Method: 构建了一个基于Doha历史阿拉伯语词典（DHDA）的RAG框架，采用混合检索与意图驱动的路由机制，为LLM提供精准、上下文相关的历史词汇信息，并以Gemini作为LLM-as-a-judge进行自动评估，辅以人工验证。 Result: 该方法将阿拉伯语原生LLM（如Fanar和ALLaM）在相关任务上的准确率提升至85%以上，大幅缩小与Gemini等闭源大模型的性能差距；自动评估与人工评估一致性高（kappa=0.87）；错误分析揭示了变音符号和复合表达式是主要语言挑战。 Conclusion: 将历时词典资源融入RAG框架可有效增强LLM对历史宗教阿拉伯语文本的语言理解能力，为低资源或高专业性语言场景提供了可行路径。 Abstract: Large language models (LLMs) have achieved remarkable progress in many language tasks, yet they continue to struggle with complex historical and religious Arabic texts such as the Quran and Hadith. To address this limitation, we develop a retrieval-augmented generation (RAG) framework grounded in diachronic lexicographic knowledge. Unlike prior RAG systems that rely on general-purpose corpora, our approach retrieves evidence from the Doha Historical Dictionary of Arabic (DHDA), a large-scale resource documenting the historical development of Arabic vocabulary. The proposed pipeline combines hybrid retrieval with an intent-based routing mechanism to provide LLMs with precise, contextually relevant historical information. Our experiments show that this approach improves the accuracy of Arabic-native LLMs, including Fanar and ALLaM, to over 85\%, substantially reducing the performance gap with Gemini, a proprietary large-scale model. Gemini also serves as an LLM-as-a-judge system for automatic evaluation in our experiments. The automated judgments were verified through human evaluation, demonstrating high agreement (kappa = 0.87). An error analysis further highlights key linguistic challenges, including diacritics and compound expressions. These findings demonstrate the value of integrating diachronic lexicographic resources into retrieval-augmented generation frameworks to enhance Arabic language understanding, particularly for historical and religious texts. The code and resources are publicly available at: https://github.com/somayaeltanbouly/Doha-Dictionary-RAG.

[47] CoCR-RAG: Enhancing Retrieval-Augmented Generation in Web Q&A via Concept-oriented Context Reconstruction

Kaize Shi,Xueyao Sun,Qika Lin,Firoj Alam,Qing Li,Xiaohui Tao,Guandong Xu

Main category: cs.CL

TL;DR: 本文提出CoCR-RAG框架，通过基于抽象语义表示（AMR）的概念蒸馏与语言模型驱动的概念级上下文重构，提升多源异构文档融合质量，显著增强Web问答中的事实一致性与性能。

Details

Motivation: 现有RAG方法在融合来自不同来源、风格、格式和粒度的异构网页文档时，易受无关与冗余信息干扰，导致答案事实不一致。 Method: 提出概念导向的上下文重构RAG（CoCR-RAG）：1）利用抽象语义表示（AMR）进行概念蒸馏，提取多文档中的核心语义概念；2）由大语言模型对蒸馏出的概念进行融合与重构，仅补充必要句法成分，生成知识密集、连贯统一的上下文。 Result: 在PopQA和EntityQuestions数据集上显著优于现有上下文重构方法；对多种骨干大语言模型均表现出强鲁棒性，具备即插即用特性。 Conclusion: CoCR-RAG通过语言学驱动的概念级融合，有效解决了RAG中多源信息融合难题，提升了问答的事实准确性与框架通用性。 Abstract: Retrieval-augmented generation (RAG) has shown promising results in enhancing Q&A by incorporating information from the web and other external sources. However, the supporting documents retrieved from the heterogeneous web often originate from multiple sources with diverse writing styles, varying formats, and inconsistent granularity. Fusing such multi-source documents into a coherent and knowledge-intensive context remains a significant challenge, as the presence of irrelevant and redundant information can compromise the factual consistency of the inferred answers. This paper proposes the Concept-oriented Context Reconstruction RAG (CoCR-RAG), a framework that addresses the multi-source information fusion problem in RAG through linguistically grounded concept-level integration. Specifically, we introduce a concept distillation algorithm that extracts essential concepts from Abstract Meaning Representation (AMR), a stable semantic representation that structures the meaning of texts as logical graphs. The distilled concepts from multiple retrieved documents are then fused and reconstructed into a unified, information-intensive context by Large Language Models, which supplement only the necessary sentence elements to highlight the core knowledge. Experiments on the PopQA and EntityQuestions datasets demonstrate that CoCR-RAG significantly outperforms existing context-reconstruction methods across these Web Q&A benchmarks. Furthermore, CoCR-RAG shows robustness across various backbone LLMs, establishing itself as a flexible, plug-and-play component adaptable to different RAG frameworks.

[48] Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping

Yao Chen,Yilong Chen,Yinqi Yang,Junyuan Shang,Zhenyu Zhang,Zefeng Zhang,Shuaiyi Nie,Shuohuan Wang,Yu Sun,Hua Wu,HaiFeng Wang,Tingwen Liu

Main category: cs.CL

TL;DR: 本文提出Sparse Growing Transformer (SGT)，一种训练时稀疏深度分配框架，通过在训练过程中逐步从深层向浅层扩展递归计算（针对信息量大的注意力头进行循环），实现结构稀疏性，显著降低额外FLOPs开销（降至1–3%）并提升性能。

Details

Motivation: 现有Transformer加深方法依赖静态、均匀的参数重用，导致训练中大量计算冗余；作者观察到模型各层存在'由深到浅'的成熟轨迹，高熵注意力头对语义整合至关重要，因此主张深度分配应是随训练动态增长的过程。 Method: 提出Sparse Growing Transformer (SGT)：在训练过程中，根据注意力头的信息量（如熵）动态选择关键头，在深层先启动、再逐步向浅层扩展注意力循环，仅对少量参数增加计算深度，实现结构稀疏的渐进式加深。 Result: 在多个参数规模上，SGT在同等设置下持续优于静态块级循环基线，同时将额外训练FLOPs开销从约16–20%大幅降低至仅1–3%。 Conclusion: 训练时动态、稀疏、渐进式的深度分配比静态均匀加深更高效且有效；SGT验证了‘结构生长’范式在提升Transformer训练效率与性能上的可行性与优势。 Abstract: Existing approaches to increasing the effective depth of Transformers predominantly rely on parameter reuse, extending computation through recursive execution. Under this paradigm, the network structure remains static along the training timeline, and additional computational depth is uniformly assigned to entire blocks at the parameter level. This rigidity across training time and parameter space leads to substantial computational redundancy during training. In contrast, we argue that depth allocation during training should not be a static preset, but rather a progressively growing structural process. Our systematic analysis reveals a deep-to-shallow maturation trajectory across layers, where high-entropy attention heads play a crucial role in semantic integration. Motivated by this observation, we introduce the Sparse Growing Transformer (SGT). SGT is a training-time sparse depth allocation framework that progressively extends recurrence from deeper to shallower layers via targeted attention looping on informative heads. This mechanism induces structural sparsity by selectively increasing depth only for a small subset of parameters as training evolves. Extensive experiments across multiple parameter scales demonstrate that SGT consistently outperforms training-time static block-level looping baselines under comparable settings, while reducing the additional training FLOPs overhead from approximately 16--20% to only 1--3% relative to a standard Transformer backbone.

Kun-Yang Yu,Zhi Zhou,Shi-Yu Tian,Xiao-Wen Yang,Zi-Yi Jia,Ming Yang,Zi-Jian Cheng,Lan-Zhe Guo,Yu-Feng Li

Main category: cs.CL

TL;DR: 本文提出Thinking with Tables (TWT)，一种面向表格-视觉多模态理解（TVMU）任务的程序辅助神经符号推理方法，以应对表格数据高变异性、隐式依赖复杂及任务流程异构三大挑战，在八个数据集上平均准确率提升10%，性能媲美甚至超越商用SOTA模型。

Details

Motivation: 表格数据作为关键现实模态，在多模态学习中仍被低估；现有方法难以应对表格的高结构性变异、数据不完整、隐式复杂特征依赖以及下游任务间求解流程的高度异质性。 Method: 提出Thinking with Tables (TWT)，采用基于代码的程序辅助神经符号推理机制，通过与外部环境交互完成信息提取与元素建模等关键操作。 Result: 在八个代表性TVMU数据集上，TWT平均准确率较现有基线提升10%，性能达到或超过商用SOTA大模型。 Conclusion: TWT有效提升了表格-视觉跨模态理解能力，验证了程序辅助神经符号推理在结构化数据多模态任务中的有效性与泛化潜力。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable reasoning capabilities across modalities such as images and text. However, tabular data, despite being a critical real-world modality, remains relatively underexplored in multimodal learning. In this paper, we focus on the task of Tabular-Vision Multi-Modal Understanding (TVMU) and identify three core challenges: (1) high structural variability and data incompleteness in tables, (2) implicit and complex feature dependencies, and (3) significant heterogeneity in problem-solving pipelines across downstream tasks. To address these issues, we propose Thinking with Tables (TWT). TWT employs a program-aided code-based neuro-symbolic reasoning mechanism that facilitates key operations, such as information extraction and element modeling, by interacting with external environments. We evaluate TWT on eight representative datasets. Experimental results demonstrate that TWT consistently outperforms existing baselines by an average of 10\% in accuracy, achieving performance comparable to, or even surpassing, proprietary commercial SOTA LLMs on TVMU tasks. Models and codes are available at https://github.com/kunyang-YU/Thinking-with-Tables

Wassim Swaileh,Mohammed-En-Nadhir Zighem,Hichem Telli,Salah Eddine Bekhouche,Abdellah Zakaria Sellam,Fadi Dornaika,Dimitrios Kotzinos

Main category: cs.CL

TL;DR: 本文提出了一种面向伊斯兰继承法（Ilm al-Mawarith）的检索增强生成（RAG）系统，通过规则驱动的合成数据生成、混合检索与重排序、以及模式约束的输出验证，实现了高精度、可解释的法律推理，在QIAS 2026评测中排名第一。

Details

Motivation: 伊斯兰继承法是一个多阶段、高精度、强规则约束的法律推理任务，涉及继承人识别、阻却规则（hajb）、份额分配及调整（awl/radd）等，且存在教法学派与成文法差异，亟需可配置、可靠、可验证的AI解决方案。 Method: 构建检索增强生成（RAG）流水线：1）基于符号化继承计算器生成带完整中间推理链的高质量合成数据；2）融合密集检索（dense）与稀疏检索（BM25），并用交叉编码器重排序；3）引入schema-constrained输出验证机制保障逻辑与数值一致性。 Result: 在QIAS 2026盲测榜单上取得MIR-E得分0.935，排名第一；验证了检索增强与模式感知生成对阿拉伯语高精度法律推理的显著增益。 Conclusion: 检索增强、规则嵌入与结构化验证相结合的方法，能有效提升复杂宗教-民法交叉领域AI系统的可靠性、可解释性与泛化能力，为高风险法律AI应用提供了可行范式。 Abstract: Islamic inheritance (Ilm al-Mawarith) is a multi-stage legal reasoning task requiring the identification of eligible heirs, resolution of blocking rules (hajb), assignment of fixed and residual shares, handling of adjustments such as awl and radd, and generation of a consistent final distribution. The task is further complicated by variations across legal schools and civil-law codifications, requiring models to operate under explicit legal configurations. We present a retrieval-augmented generation (RAG) pipeline for this setting, combining rule-grounded synthetic data generation, hybrid retrieval (dense and BM25) with cross-encoder reranking, and schema-constrained output validation. A symbolic inheritance calculator is used to generate a large high-quality synthetic corpus with full intermediate reasoning traces, ensuring legal and numerical consistency. The proposed system achieves a MIR-E score of 0.935 and ranks first on the official QIAS 2026 blind-test leaderboard. Results demonstrate that retrieval-grounded, schema-aware generation significantly improves reliability in high-precision Arabic legal reasoning tasks.

[51] Schema on the Inside: A Two-Phase Fine-Tuning Method for High-Efficiency Text-to-SQL at Scale

Chinmay Soni,Shivam Chourasia,Gaurav Kumar,Hitesh Kapoor

Main category: cs.CL

TL;DR: 本文提出了一种专用于文本到SQL任务的8B参数自托管语言模型，通过两阶段监督微调使模型内化数据库schema，大幅减少输入token（降低99%以上）并提升执行成功率与语义准确率，优于Gemini Flash 2.0基线。

Details

Motivation: 解决使用大型闭源API语言模型进行text-to-SQL时因长schema提示导致的高API成本和高延迟问题，阻碍工业界规模化部署。 Method: 设计专用8B参数自托管模型，并采用两阶段监督微调策略，使模型完全内化数据库schema，从而摆脱长上下文prompt依赖。 Result: 输入token从17k降至不足100（降幅>99%），执行成功率达98.4%，语义准确率达92.5%，显著优于Gemini Flash 2.0基线（95.6%/89.4%）。 Conclusion: 验证了在大规模生产环境中，采用领域专用、自托管语言模型实现高精度、低延迟text-to-SQL任务的可行性与实用性。 Abstract: Applying large, proprietary API-based language models to text-to-SQL tasks poses a significant industry challenge: reliance on massive, schema-heavy prompts results in prohibitive per-token API costs and high latency, hindering scalable production deployment. We present a specialized, self-hosted 8B-parameter model designed for a conversational bot in CriQ, a sister app to Dream11, India's largest fantasy sports platform with over 250 million users, that answers user queries about cricket statistics. Our novel two-phase supervised fine-tuning approach enables the model to internalize the entire database schema, eliminating the need for long-context prompts. This reduces input tokens by over 99%, from a 17k-token baseline to fewer than 100, and replaces costly external API calls with efficient local inference. The resulting system achieves 98.4% execution success and 92.5% semantic accuracy, substantially outperforming a prompt-engineered baseline using Google's Gemini Flash 2.0 (95.6% execution, 89.4% semantic accuracy). These results demonstrate a practical path toward high-precision, low-latency text-to-SQL applications using domain-specialized, self-hosted language models in large-scale production environments.

[52] From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs

Xiaoyong Guo,Nanjie Li,Zijie Zeng,Kai Wang,Hao Huang,Haihua Xu,Wei Shi

Main category: cs.CL

TL;DR: 本文提出了一种统一训练框架，通过教师错误知识、上下文丢弃和直接偏好优化（DPO）来缓解语音大模型在上下文ASR中因训练与推理历史不一致导致的上下文暴露偏差，显著提升在预测历史和干扰上下文下的鲁棒性与性能。

Details

Motivation: 现有上下文ASR方法在训练时使用理想对话历史，但推理时依赖易错的历史，造成训练-测试不匹配，即上下文暴露偏差。 Method: 提出三阶段训练策略：(i) 使用Whisper large-v3输出作为训练时历史（教师错误知识），(ii) 引入上下文丢弃正则化历史依赖，(iii) 在人工筛选的失败案例上进行直接偏好优化（DPO）。 Result: 在TED-LIUM 3和零样本LibriSpeech上，使用两轮历史时，SFT+Whisper历史将WER从5.59%降至5.47%，DPO进一步降至5.17%；在无关上下文攻击下，WER仅升至5.63%，鲁棒性最优。 Conclusion: 该框架有效缓解上下文暴露偏差，提升语音大模型在真实历史条件下的泛化能力与抗干扰能力。 Abstract: Contextual automatic speech recognition (ASR) with Speech-LLMs is typically trained with oracle conversation history, but relies on error-prone history at inference, causing a train-test mismatch in the context channel that we term contextual exposure bias. We propose a unified training framework to improve robustness under realistic histories: (i) Teacher Error Knowledge by using Whisper large-v3 hypotheses as training-time history, (ii) Context Dropout to regularize over-reliance on history, and (iii) Direct Preference Optimization (DPO) on curated failure cases. Experiments on TED-LIUM 3 (in-domain) and zero-shot LibriSpeech (out-of-domain) show consistent gains under predicted-history decoding. With a two-utterance history as context, SFT with Whisper hypotheses reduce WER from 5.59% (oracle-history training) to 5.47%, and DPO further improves to 5.17%. Under irrelevant-context attacks, DPO yields the smallest degradation (5.17% -> 5.63%), indicating improved robustness to misleading context. Our code and models are published on https://github.com/XYGuo1996/Contextual_Speech_LLMs.

[53] FinToolSyn: A forward synthesis Framework for Financial Tool-Use Dialogue Data with Dynamic Tool Retrieval

Caishuang Huang,Yang Qiao,Rongyu Zhang,Junjie Ye,Pu Lu,Wenxi Wu,Meng Zhou,Xiku Du,Tao Gui,Qi Zhang,Xuanjing Huang

Main category: cs.CL

TL;DR: 本文提出FinToolSyn，一种面向金融领域的前向合成框架，用于生成高质量的金融对话数据，以解决现有反向合成方法在隐式需求建模和动态工具检索上的不足。

Details

Motivation: 现有金融领域工具使用数据合成方法依赖反向合成范式，导致生成查询过于显式、缺乏事件驱动的隐式需求建模，且忽略大规模工具空间中动态检索的真实过程。 Method: 提出前向合成框架FinToolSyn，包含角色指令、原子工具合成与动态检索对话生成三阶段；构建含43,066个工具的仓库，合成超148k条对话实例，并设计专用金融工具调用评测基准。 Result: 在真实金融场景评测中，基于FinToolSyn训练的模型工具调用能力提升21.06%。 Conclusion: FinToolSyn有效提升了LLM在金融领域对隐式、动态、大规模工具空间的适应能力，为金融工具学习提供了坚实的数据基础。 Abstract: Tool-use capabilities are vital for Large Language Models (LLMs) in finance, a domain characterized by massive investment targets and data-intensive inquiries. However, existing data synthesis methods typically rely on a reverse synthesis paradigm, generating user queries from pre-sampled tools. This approach inevitably introduces artificial explicitness, yielding queries that fail to capture the implicit, event-driven nature of real-world needs. Moreover, its reliance on static tool sets overlooks the dynamic retrieval process required to navigate massive tool spaces. To address these challenges, we introduce \textit{FinToolSyn}, a forward synthesis framework designed to generate high-quality financial dialogues. Progressing from persona instruction and atomic tool synthesis to dynamic retrieval dialogue generation, our pipeline constructs a repository of 43,066 tools and synthesizes over 148k dialogue instances, incorporating dynamic retrieval to emulate the noisy candidate sets typical of massive tool spaces. We also establish a dedicated benchmark to evaluate tool-calling capabilities in realistic financial scenarios. Extensive experiments demonstrate that models trained on FinToolSyn achieve a 21.06\% improvement, providing a robust foundation for tool learning in financial scenarios.

[54] ConceptKT: A Benchmark for Concept-Level Deficiency Prediction in Knowledge Tracing

Yu-Chen Kang,Yu-Chien Tang,An-Zi Yen

Main category: cs.CL

TL;DR: 本文提出概念级缺陷预测任务，扩展传统知识追踪，通过ConceptKT数据集和上下文学习方法，利用大语言模型诊断学生具体概念缺失。

Details

Motivation: 现有知识追踪系统仅关注二元正确性预测，无法诊断导致错误的具体概念误解，缺乏细粒度诊断反馈以支持针对性教学与补救。 Method: 提出概念级缺陷预测新任务；构建ConceptKT数据集（标注题目所需概念及错误回答背后缺失的概念）；探索基于概念对齐与语义相似性的历史记录选择策略，评估多种大语言模型（LLMs）和大推理模型（LRMs）的诊断能力。 Result: 基于概念对齐和语义相似性选取响应历史，显著提升了正确性预测与概念级缺陷识别性能。 Conclusion: 概念级缺陷预测是知识追踪的重要拓展；合理的历史记录选择策略能有效增强大模型在教育诊断任务中的表现。 Abstract: Knowledge Tracing (KT) is a critical technique for modeling student knowledge to support personalized learning. However, most KT systems focus on binary correctness prediction and cannot diagnose the underlying conceptual misunderstandings that lead to errors. Such fine-grained diagnostic feedback is essential for designing targeted instruction and effective remediation. In this work, we introduce the task of concept-level deficiency prediction, which extends traditional KT by identifying the specific concepts a student is likely to struggle with on future problems. We present ConceptKT, a dataset annotated with labels that capture both the concepts required to solve each question and the missing concepts underlying incorrect responses. We investigate in-context learning approaches to KT and evaluate the diagnostic capabilities of various Large Language Models (LLMs) and Large Reasoning Models (LRMs). Different strategies for selecting informative historical records are explored. Experimental results demonstrate that selecting response histories based on conceptual alignment and semantic similarity leads to improved performance on both correctness prediction and concept-level deficiency identification.

[55] LLMpedia: A Transparent Framework to Materialize an LLM's Encyclopedic Knowledge at Scale

Muhammed Saeed,Simon Razniewski

Main category: cs.CL

TL;DR: 本文提出LLMpedia，一个完全开源的参数化百科全书，通过纯参数化生成约100万篇百科文章，揭示当前大语言模型在事实性上远未饱和：在维基百科覆盖主题上真实率仅74.7%，前沿主题更低至63.2%，显著低于MMLU等基准所暗示的90%+水平；该工作同时暴露了固定题库评估存在的可用性偏差，并提供全部数据、代码与可浏览界面。

Details

Motivation: 现有基准（如MMLU）高估了大语言模型的事实准确性，因其依赖固定问题集，存在可用性偏差；需一种更全面、开放、可验证的事实性评估方法，尤其覆盖模型实际生成知识的能力。 Method: 构建LLMpedia——一个不依赖外部检索、纯靠模型参数记忆生成百科文章的系统，覆盖约100万主题，涵盖三个主流模型家族；设计基于维基百科可验证性与网络权威证据的双重事实性评估协议；引入‘capture-trap’新基准，并全面开源所有提示、生成内容与人工评估结果。 Result: gpt-5-mini在维基覆盖主题上的可验证真实率为74.7%，前沿主题为63.2%；维基仅覆盖61%生成主题，三模型家族主题重叠率仅7.3%；LLMpedia在capture-trap基准中事实性显著高于Grokipedia，且文本相似度约为其一半。 Conclusion: 大语言模型的事实性被主流基准严重高估；LLMpedia作为首个完全开源的参数化百科，将事实性评估从静态问答拓展至动态知识生成与实证验证，为可信AI提供了新范式与基础设施。 Abstract: Benchmarks such as MMLU suggest flagship language models approach factuality saturation, with scores above 90\%. We show this picture is incomplete. \emph{LLMpedia} generates encyclopedic articles entirely from parametric memory, producing ${\sim}$1M articles across three model families without retrieval. For gpt-5-mini, the verifiable true rate on Wikipedia-covered subjects is only 74.7\% -- more than 15 percentage points below the benchmark-based picture, consistent with the availability bias of fixed-question evaluation. Beyond Wikipedia, frontier subjects verifiable only through curated web evidence fall further to 63.2\% true rate. Wikipedia covers just 61\% of surfaced subjects, and three model families overlap by only 7.3\% in subject choice. In a capture-trap benchmark inspired by prior analysis of Grokipedia, LLMpedia achieves substantially higher factuality at roughly half the textual similarity to Wikipedia. Unlike Grokipedia, every prompt, artifact, and evaluation verdict is publicly released, making LLMpedia the first fully open parametric encyclopedia -- bridging factuality evaluation and knowledge materialization. All data, code, and a browsable interface are at https://llmpedia.net.

[56] Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study

Nour Bouchouchi,Thiabult Laugel,Xavier Renard,Christophe Marsala,Marie-Jeanne Lesot,Marcin Detyniecki

Main category: cs.CL

TL;DR: 本文提出了一种统一框架，联合分析大语言模型（LLMs）中内在表征与外在生成输出中的性别偏见，发现对齐训练虽能降低输出偏见，但模型内部仍残留可被对抗提示激活的性别相关表征，且结构化基准上的去偏效果难以泛化到真实场景（如故事生成）。

Details

Motivation: 现有研究多关注输出层面的偏见缓解，并依赖结构化基准评估，无法反映模型内在表征是否真正对齐，也难以体现真实应用场景下的偏见表现。 Method: 提出基于相同中性提示的统一分析框架，同步测量内部表征中的隐含性别信息（intrinsic bias）与生成输出中的显式偏见（extrinsic bias）；结合监督微调去偏实验与对抗提示测试；并在故事生成等现实任务中验证泛化性。 Result: 发现内在性别信息与外在偏见表达存在一致关联；对齐训练可降低输出偏见，但内部表征中性别关联仍存且可被对抗提示激活；结构化基准上的去偏效果在故事生成等真实场景中不具泛化性。 Conclusion: 仅优化输出不足以实现真正公平，需同时关注并干预模型内部表征；结构化评估存在局限，应加强真实场景下的偏见评测。 Abstract: During training, Large Language Models (LLMs) learn social regularities that can lead to gender bias in downstream applications. Most mitigation efforts focus on reducing bias in generated outputs, typically evaluated on structured benchmarks, which raises two concerns: output-level evaluation does not reveal whether alignment modifies the model's underlying representations, and structured benchmarks may not reflect realistic usage scenarios. We propose a unified framework to jointly analyze intrinsic and extrinsic gender bias in LLMs using identical neutral prompts, enabling direct comparison between gender-related information encoded in internal representations and bias expressed in generated outputs. Contrary to prior work reporting weak or inconsistent correlations, we find a consistent association between latent gender information and expressed bias when measured under the unified protocol. We further examine the effect of alignment through supervised fine-tuning aimed at reducing gender bias. Our results suggest that while the latter indeed reduces expressed bias, measurable gender-related associations are still present in internal representations, and can be reactivated under adversarial prompting. Finally, we consider two realistic settings and show that debiasing effects observed on structured benchmarks do not necessarily generalize, e.g., to the case of story generation.

[57] MedAidDialog: A Multilingual Multi-Turn Medical Dialogue Dataset for Accessible Healthcare

Shubham Kumar Nigam,Suparnojit Sarkar,Piyush Patel

Main category: cs.CL

TL;DR: 本文提出了MedAidDialog——一个用于模拟真实医患多轮对话的多语言数据集，并基于其构建了轻量、可部署的对话模型MedAidLM，支持个性化问诊与诊断建议生成。

Details

Motivation: 现有医疗对话系统多为单轮问答或依赖模板化数据，缺乏对话真实性与多语言适用性，难以在医疗资源匮乏地区有效应用。 Method: 1）基于MDDial，利用大语言模型生成合成多轮医患对话，并扩展为覆盖7种语言的平行多语言数据集MedAidDialog；2）采用参数高效微调量化小语言模型，构建轻量级对话模型MedAidLM，并引入患者预设上下文（如年龄、性别、过敏史）实现个性化。 Result: MedAidLM能有效完成多轮症状采集与诊断推荐；医学专家评估表明其生成的咨询内容具备合理性和连贯性。 Conclusion: MedAidDialog与MedAidLM为低资源环境下的多语言智能医疗咨询提供了可行、轻量且可扩展的技术路径。 Abstract: Conversational artificial intelligence has the potential to assist users in preliminary medical consultations, particularly in settings where access to healthcare professionals is limited. However, many existing medical dialogue systems operate in a single-turn question--answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. In this work, we introduce MedAidDialog, a multilingual multi-turn medical dialogue dataset designed to simulate realistic physician--patient consultations. The dataset extends the MDDial corpus by generating synthetic consultations using large language models and further expands them into a parallel multilingual corpus covering seven languages: English, Hindi, Telugu, Tamil, Bengali, Marathi, and Arabic. Building on this dataset, we develop MedAidLM, a conversational medical model trained using parameter-efficient fine-tuning on quantized small language models, enabling deployment without high-end computational infrastructure. Our framework additionally incorporates optional patient pre-context information (e.g., age, gender, allergies) to personalize the consultation process. Experimental results demonstrate that the proposed system can effectively perform symptom elicitation through multi-turn dialogue and generate diagnostic recommendations. We further conduct medical expert evaluation to assess the plausibility and coherence of the generated consultations.

[58] A visual observation on the geometry of UMAP projections of the difference vectors of antonym and synonym word pair embeddings

Rami Luisto

Main category: cs.CL

TL;DR: 本文探索了词嵌入向量中反义词对的几何特性，发现了一种在特定投影下跨模型出现的‘漩涡’现象。

Details

Motivation: 基于反义词被定义为‘除一个上下文相关属性外其余均相同’的概念，以及Transformer模型将概念编码为方向的特性，作者试图探究反义词对在嵌入空间中的几何表现（尤其是其差向量），并与同义词对比。 Method: 对多种嵌入模型中的反义词与同义词对进行几何分析，在特定投影配置下观察其向量分布模式。 Result: 发现一种跨模型稳定的、特定投影下的‘漩涡’（swirl）结构，暗示反义关系在嵌入空间中存在可识别的几何特征。 Conclusion: 反义词对在嵌入空间中展现出非平凡且一致的几何结构（‘漩涡’），表明其语义对立可能以某种低维流形形式编码，为理解词向量语义几何提供了新线索。 Abstract: Antonyms, or opposites, are sometimes defined as \emph{word pairs that have all of the same contextually relevant properties but one}. Seeing how transformer models seem to encode concepts as directions, this begs the question if one can detect ``antonymity'' in the geometry of the embedding vectors of word pairs, especially based on their difference vectors. Such geometrical studies are then naturally contrasted by comparing antonymic pairs to their opposites; synonyms. This paper started as an exploratory project on the complexity of the systems needed to detect the geometry of the embedding vectors of antonymic word pairs. What we now report is a curious ``swirl'' that appears across embedding models in a somewhat specific projection configuration.

[59] Variation is the Norm: Embracing Sociolinguistics in NLP

Anne-Marie Lutgen,Alistair Plum,Verena Blaschke,Barbara Plank,Christoph Purschke

Main category: cs.CL

TL;DR: 本文提出了一种将社会语言学视角与NLP技术结合的框架，强调不应将语言变异视为噪声而应主动纳入建模过程；以卢森堡语为例，实验证明模型在面对大量正字法变异时性能显著下降，而通过在微调中引入变异可提升鲁棒性。

Details

Motivation: NLP通常将语言变异视为噪声并进行标准化处理，忽视了其作为语言本质特征的重要性；而社会语言学则以研究社会语境中的语言变异为核心。本文旨在弥合二者鸿沟，推动NLP系统对真实语言变异的鲁棒性。 Method: 提出一个融合社会语言学理论与NLP实践的分析框架，并以卢森堡语为案例开展实证研究：对比标准正字法数据与高变异数据在模型训练与测试中的性能差异，并尝试在微调阶段显式引入变异数据以提升模型表现。 Result: 实验显示，使用高变异数据训练/测试的NLP模型性能明显低于使用近标准数据的模型；而将变异数据纳入微调过程可有效缓解性能下降，提升模型对变异的鲁棒性。 Conclusion: 语言变异不应被简单归为噪声，而应作为关键维度纳入NLP研究设计；本文框架为兼顾社会语言学理论与NLP工程实践提供了可行路径，并呼吁构建更具社会现实感和鲁棒性的语言技术。 Abstract: In Natural Language Processing (NLP), variation is typically seen as noise and "normalised away" before processing, even though it is an integral part of language. Conversely, studying language variation in social contexts is central to sociolinguistics. We present a framework to combine the sociolinguistic dimension of language with the technical dimension of NLP. We argue that by embracing sociolinguistics, variation can actively be included in a research setup, in turn informing the NLP side. To illustrate this, we provide a case study on Luxembourgish, an evolving language featuring a large amount of orthographic variation, demonstrating how NLP performance is impacted. The results show large discrepancies in the performance of models tested and fine-tuned on data with a large amount of orthographic variation in comparison to data closer to the (orthographic) standard. Furthermore, we provide a possible solution to improve the performance by including variation in the fine-tuning process. This case study highlights the importance of including variation in the research setup, as models are currently not robust to occurring variation. Our framework facilitates the inclusion of variation in the thought-process while also being grounded in the theoretical framework of sociolinguistics.

[60] Stance Labels Fail When They Matter Most: The Projection Problem in Stance Detection

Bowen Zhang

Main category: cs.CL

TL;DR: 本文指出了立场检测中将多维态度压缩为单一标签（Favor/Against/Neutral）所引发的“投影问题”，并实证表明该问题导致标注者间分歧，且在维度冲突时尤为显著；作者建议应转向多维标注以提升可靠性和细粒度理解。

Details

Motivation: 现有立场检测任务强制将复杂、多维的态度（如对气候科学支持但反对碳税）压缩为单一标签，导致标注者因对不同维度权重不同而产生分歧，这种分歧并非随机噪声而是系统性‘投影’选择差异。 Method: 提出‘投影问题’概念，通过理论分析说明其条件性影响，并基于SemEval-2016 Task 6数据开展试点研究，对比维度一致与维度冲突文本下的标注一致性（Krippendorff's α）变化。 Result: 实证发现：在维度一致文本上，三类标签标注一致性（α=0.307）高于各维度一致性（α=0.082）；而在维度冲突文本上，标签一致性骤降至α=0.085，而维度一致性升至α=0.334（政策维度达0.572），验证了投影问题的存在及其关键影响场景。 Conclusion: 投影问题是真实且重要的，它并非标注缺陷，而是当前单标签范式的根本局限；应转向多维立场建模与标注，尤其在维度冲突情形下，以提升任务可靠性与语义保真度。 Abstract: Stance detection is nearly always formulated as classifying text into Favor, Against, or Neutral -- a convention inherited from debate analysis and applied without modification to social media since SemEval-2016. But attitudes toward complex targets are not unitary: a person can accept climate science while opposing carbon taxes, expressing support on one dimension and opposition on another. When annotators must compress such multi-dimensional attitudes into a single label, different annotators weight different dimensions -- producing disagreement that reflects not confusion but different compression choices. We call this the \textbf{projection problem}, and show that its cost is conditional: when a text's dimensions align, any weighting yields the same label and three-way annotation works well; when dimensions conflict, label agreement collapses while agreement on individual dimensions remains intact. A pilot study on SemEval-2016 Task 6 confirms this crossover: on dimension-consistent texts, label agreement (Krippendorff's $α= 0.307$) exceeds dimensional agreement ($α= 0.082$); on dimension-conflicting texts, the pattern reverses -- label $α$ drops to $0.085$ while dimensional $α$ rises to $0.334$, with Policy reaching $0.572$. The projection problem is real -- but it activates precisely where it matters most.

[61] Optimizing Multilingual LLMs via Federated Learning: A Study of Client Language Composition

Aleix Sant,Jordi Luque,Carlos Escolano

Main category: cs.CL

TL;DR: 本文提出了一种面向多语言环境的大语言模型联邦学习框架，引入了客户端自适应的动态早停机制LDES-FL，并系统研究了客户端语言构成对模型质量、公平性和训练开销的影响。

Details

Motivation: 解决多语言联邦学习中客户端语言分布异构和语言资源不均衡带来的挑战。 Method: 扩展FederatedScope-LLM框架支持多语言指令微调实验，并提出客户端本地动态早停机制（LDES-FL），允许客户端根据本地验证性能自主暂停/恢复训练；通过控制变量实验分析不同客户端语言构成（从单语到多语）的影响。 Result: 单语本地微调最利于单语专精；联邦训练更适合构建均衡的多语全局模型；客户端内部多语性增强可提升全局模型性能与公平性，尤其利好低资源语言，但需更多优化步数。 Conclusion: 客户端语言构成是多语言联邦学习中的关键设计变量，直接影响模型性能、公平性与训练效率。 Abstract: Federated Learning (FL) of Large Language Models (LLMs) in multilingual environments presents significant challenges stemming from heterogeneous language distributions across clients and disparities in language resource availability. To address these challenges, we extended the FederatedScope-LLM framework to support multilingual instruction-tuning experiments with LLMs. We also introduced a novel client-specific early stopping mechanism, Local Dynamic Early Stopping (LDES-FL), which allows clients to pause and resume local training based on client-side validation performance, enhancing training efficiency and sustainability. Through a series of experiments, we studied how client language composition - from fully monolingual to increasingly multilingual clients - affects multilingual quality, fairness and training cost. Monolingual local fine-tuning remains the most effective for single-language specialization, whereas federated training is better suited to learning a single balanced multilingual model. In FL, increasing within-client multilinguality leads to stronger and fairer global models, narrows the gap to centralized multilingual fine-tuning, and yields the largest gains for lower-resource languages, albeit at the cost of more optimization steps. Overall, our results identify client language composition as a key design variable in multilingual FL, shaping performance, fairness and efficiency

[62] Semantic Centroids and Hierarchical Density-Based Clustering for Cross-Document Software Coreference Resolution

Julia Matela,Frank Krüger

Main category: cs.CL

TL;DR: 本文提出了一种用于软件提及跨文档共指消解的混合框架，结合Sentence-BERT语义嵌入、基于FAISS的KB查找与HDBSCAN聚类，并在SOMD 2026共享任务中取得优异CoNLL F1成绩。

Details

Motivation: 解决科学文献中软件提及形式不一致、跨文档共指识别困难的问题。 Method: 融合Sentence-BERT密集语义嵌入、基于训练集簇中心构建的FAISS知识库检索、HDBSCAN密度聚类，并辅以表面形式归一化和缩写解析；对大规模Subtask 3采用基于实体类型和归一化表面形式的阻塞策略。 Result: 在Subtask 1/2/3上分别达到CoNLL F1为0.98/0.98/0.96。 Conclusion: 所提混合框架在精度和可扩展性上均表现优异，验证了语义嵌入、高效检索与无监督聚类协同的有效性。 Abstract: This paper describes the system submitted to the SOMD 2026 Shared Task for Cross-Document Coreference Resolution (CDCR) of software mentions. Our approach addresses the challenge of identifying and clustering inconsistent software mentions across scientific corpora. We propose a hybrid framework that combines dense semantic embeddings from a pre-trained Sentence-BERT model, Knowledge Base (KB) lookup strategy built from training-set cluster centroids using FAISS for efficient retrieval, and HDBSCAN density-based clustering for mentions that cannot be confidently assigned to existing clusters. Surface-form normalization and abbreviation resolution are applied to improve canonical name matching. The same core pipeline is applied to Subtasks 1 and 2. To address the large scale settings of Subtask 3, the pipeline was adapted by utilising a blocking strategy based on entity types and canonicalized surface forms. Our system achieved CoNLL F1 scores of 0.98, 0.98, and 0.96 on Subtasks 1, 2, and 3 respectively.

[63] Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning

He Huang

Main category: cs.CL

TL;DR: 本文提出了一种联合训练的多任务编码器-解码器模型，利用字节级分词器、多种预训练任务（MLM、TLM、翻译、词性标注）及拉丁转写与IPA重构作为辅助视图，以提升古埃及语四种历史阶段间的词级语义对齐效果；实验表明翻译任务贡献最大，IPA结合KL一致性可改善跨分支对齐，但整体对齐效果仍有限。

Details

Motivation: 古埃及语四个历史阶段在文字和正字法上差异显著，且平行语料稀缺，亟需在数据受限条件下建模跨阶段语义对齐。 Method: 采用共享字节级分词器的紧凑编码器-解码器架构，联合优化MLM、TLM、序列到序列翻译和词性标注任务；引入拉丁转写和IPA重构作为辅助视图，通过KL一致性约束和嵌入层融合进行多视图集成。 Result: 翻译任务带来最显著性能提升；IPA结合KL一致性改善跨分支对齐；早期嵌入融合效果有限；整体词级对齐仍较弱，但在ROC-AUC和三元组准确率等指标上建立了可复现基线。 Conclusion: 在真实历史语言建模约束下，任务设计（尤其是翻译）和规范化手段（如IPA）显著影响语义对齐效果；本工作为类型学距离远、资源稀缺的语言提供了可复现基线与实用建模指导。 Abstract: We study word-level semantic alignment across four historical stages of Ancient Egyptian. These stages differ in script and orthography, and parallel data are scarce. We jointly train a compact encoder-decoder model with a shared byte-level tokenizer on all four stages, combining masked language modeling (MLM), translation language modeling (TLM), sequence-to-sequence translation, and part-of-speech tagging under a task-aware loss with fixed weights and uncertainty-based scaling. To reduce surface divergence we add Latin transliteration and IPA reconstruction as auxiliary views. We integrate these views through KL-based consistency and through embedding-level fusion. We evaluate alignment quality using pairwise metrics, specifically ROC-AUC and triplet accuracy, on curated Egyptian-English and intra-Egyptian cognate datasets. Translation yields the strongest gains. IPA with KL consistency improves cross-branch alignment, while early fusion demonstrates limited efficacy. Although the overall alignment remains limited, the findings provide a reproducible baseline and practical guidance for modeling historical languages under real constraints. They also show how normalization and task design shape what counts as alignment in typologically distant settings.

[64] Samasāmayik: A Parallel Dataset for Hindi-Sanskrit Machine Translation

N J Karthika,Keerthana Suryanarayanan,Jahanvi Purohit,Ganesh Ramakrishnan,Jitin Singla,Anil Kumar Gourishetty

Main category: cs.CL

TL;DR: 本文发布了Samasāmayik，一个新颖、精心整理的大规模印地语-梵语平行语料库，包含92,196个平行句子，涵盖当代材料，并在多个模型上验证了其有效性。

Details

Motivation: 现有梵语文本多集中于古典时期文献和诗歌，缺乏当代语言资源；为填补印地语-梵语机器翻译中低资源语言对的当代语料空白，作者构建了覆盖多样化现代语境的新语料库。 Method: 构建了名为Samasāmayik的大规模印地语-梵语平行语料库（92,196句），来源包括口语教程、儿童杂志、广播对话和说明材料；并在ByT5、NLLB和IndicTrans-v2三种模型上进行微调与基准测试，辅以语义与词汇重叠度分析验证其新颖性。 Result: 在领域内测试集上性能显著提升，在其他通用测试集上表现相当；与现有语料库相比语义与词汇重叠极小，证实其新颖性与非冗余性。 Conclusion: Samasāmayik是一个高质量、非冗余、面向当代应用的印地语-梵语平行语料库，为低资源印度语言机器翻译提供了坚实新基准与实用资源。 Abstract: We release Samasāmayik, a novel, meticulously curated, large-scale Hindi-Sanskrit corpus, comprising 92,196 parallel sentences. Unlike most data available in Sanskrit, which focuses on classical era text and poetry, this corpus aggregates data from diverse sources covering contemporary materials, including spoken tutorials, children's magazines, radio conversations, and instruction materials. We benchmark this new dataset by fine-tuning three complementary models - ByT5, NLLB and IndicTrans-v2, to demonstrate its utility. Our experiments demonstrate that models trained on the Samasamayik corpus achieve significant performance gains on in-domain test data, while achieving comparable performance on other widely used test sets, establishing a strong new performance baseline for contemporary Hindi-Sanskrit translation. Furthermore, a comparative analysis against existing corpora reveals minimal semantic and lexical overlap, confirming the novelty and non-redundancy of our dataset as a robust new resource for low-resource Indic language MT.

[65] GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Yunzhe Wang,Runhui Xu,Kexin Zheng,Tianyi Zhang,Jayavibhav Niranjan Kogundi,Soham Hans,Volkan Ustun

Main category: cs.CL

TL;DR: 本文提出了GameplayQA框架，用于评估多模态大语言模型在3D多玩家游戏视频中的代理中心感知与推理能力，通过高密度时序标注和三元结构（自我、其他智能体、世界）构建2.4K诊断性问答对，并揭示当前MLLM在时间 grounding、角色归因和决策密度处理等方面存在显著短板。

Details

Motivation: 现有基准无法充分评估多模态大语言模型在3D自主代理任务中对快速状态变化感知、动作归属及并发多智能体行为推理等关键能力。 Method: 构建GameplayQA框架：对多人3D游戏视频进行每秒1.22标签的密集时序标注，采用Self-Other Agents-World三元结构组织状态、动作与事件；从中提炼2.4K分三层认知复杂度的诊断性QA对，并设计结构化干扰项分类体系。 Result: 前沿多模态大语言模型在GameplayQA上表现远低于人类，主要失败点包括时间与跨视频定位不准、智能体角色归属错误、难以应对高决策密度场景。 Conclusion: GameplayQA填补了代理中心感知评测的空白，有望推动具身AI、代理感知与世界建模的交叉研究发展。 Abstract: Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective, capabilities that existing benchmarks do not adequately evaluate. We introduce GameplayQA, a framework for evaluating agentic-centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time-synced, concurrent captions of states, actions, and events structured around a triadic system of Self, Other Agents, and the World, a natural decomposition for multi-agent environments. From these annotations, we refined 2.4K diagnostic QA pairs organized into three levels of cognitive complexity, accompanied by a structured distractor taxonomy that enables fine-grained analysis of where models hallucinate. Evaluation of frontier MLLMs reveals a substantial gap from human performance, with common failures in temporal and cross-video grounding, agent-role attribution, and handling the decision density of the game. We hope GameplayQA stimulates future research at the intersection of embodied AI, agentic perception, and world modeling.

[66] Improving Lean4 Autoformalization via Cycle Consistency Fine-tuning

Arsen Shebzukhov

Main category: cs.CL

TL;DR: 本文提出了一种基于Qwen3.5-2B模型和LoRA微调的自动形式化方法，将自然语言数学文本翻译为Lean4语言，并通过循环一致性奖励的GRPO强化学习策略显著提升性能，优于监督微调（含或不含课程学习）。

Details

Motivation: 自动形式化可加速AI辅助数学研究（如证明验证与搜索），但现有方法在保持语义一致性方面仍有不足。 Method: 在FineLeanCorpus上对Qwen3.5-2B进行LoRA微调，比较三种训练方式：带难度课程学习的监督微调（SFT）、无序SFT、以及使用循环一致性奖励的GRPO强化学习；循环一致性由NL→Lean4→NL'路径中句子嵌入余弦相似度衡量。 Result: RL方法在FineLeanCorpus和PutnamBench上显著优于SFT（循环一致性分别达0.669 vs. 0.513 和 0.561 vs. 0.422），交叉熵损失仅微增0.011，形式化质量未明显下降；课程学习未带来可测收益。 Conclusion: 基于循环一致性的GRPO强化学习是更有效的自动形式化训练范式，课程学习在此任务中并非必要。 Abstract: Autoformalization - automatically translating natural language mathematical texts into formal proof language such as Lean4 - can help accelerate AI-assisted mathematical research, be it via proof verification or proof search. I fine-tune Qwen3.5-2B with LoRA for natural language to Lean4 formalization on FineLeanCorpus and consider three training regimes: supervised fine-tuning (SFT) with curriculum learning (difficulty 1 to 10), SFT without curriculum ordering, and reinforcement learning using group relative policy optimization (GRPO) with a cycle consistency reward. Cycle consistency measures how well the meaning of a statement is preserved through a NL to Lean4 to NL' loop, computed as cosine similarity of off-the-shelf sentence embeddings. On an unseen subset of FineLeanCorpus (FLC) and on PutnamBench, RL substantially outperforms both SFT variants (mean cycle consistency 0.669 vs. 0.513 on FLC; 0.561 vs. 0.422 on PutnamBench), while increasing cross-entropy loss by only 0.011 nats, with minimal impact on formalization quality. Curriculum ordering provides no measurable benefit over shuffled training.

[67] Towards Reward Modeling for AI Tutors in Math Mistake Remediation

Kseniia Petukhova,Ekaterina Kochmar

Main category: cs.CL

TL;DR: 本文提出了一种评估AI导师教学质量的新方法，聚焦于错误纠正任务，构建了基于人类偏好的MRBench基准和合成对比响应对，并训练了轻量级Bradley-Terry偏好模型，在人类偏好测试中达到0.74的成对准确率，优于更大规模的通用奖励模型。

Details

Motivation: 标准自然语言生成指标无法衡量AI tutor是否能识别错误、搭建推理脚手架或避免直接给出答案，因此亟需面向教学效果的专用评估方法。 Method: 从MRBench的人类成对偏好中提炼教学维度层次结构，合成在关键教学维度（如错误识别、针对性、脚手架、可操作性等）上最小差异的响应对；构建加权和排序数据，训练基于Bradley-Terry模型的偏好打分器。 Result: 仅用合成数据的最优模型在人类偏好测试中达0.69准确率；融合加权和数据与定向合成组后提升至0.74，且仅使用0.5B参数骨干模型，性能超越更大规模通用奖励模型。 Conclusion: 轻量级、教学导向的偏好模型可有效评估AI tutor的教学质量，无需依赖大规模人工标注，为教育AI的评估与优化提供了可行新范式。 Abstract: Evaluating the pedagogical quality of AI tutors remains challenging: standard NLG metrics do not determine whether responses identify mistakes, scaffold reasoning, or avoid revealing the answers. For the task of mistake remediation, we derive a hierarchy of pedagogical aspects from human pairwise preferences on MRBench, and synthesize minimally contrastive response pairs that differ along key aspects (e.g., mistake identification and location, targetedness, scaffolding, actionability, clarity, and coherence). We develop and release Bradley-Terry preference models trained on weighted-sum rankings that we automatically create from MRBench, synthetic pairs, and data combinations. Using only synthetic data, our best model reaches 0.69 pairwise accuracy on a human preference test, and combining weighted-sum data with targeted synthetic groups improves accuracy to 0.74, outperforming larger general-purpose reward models while using only a 0.5B-parameter backbone.

[68] When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools

Xingming Li,Runke Huang,Yanan Bao,Yuye Jin,Yuru Jiao,Qingyong Hu

Main category: cs.CL

TL;DR: 本文提出了一种基于AI的可扩展教师-儿童互动（TCI）质量评估方法，构建了首个大规模中文幼儿园自然互动数据集TEPE-TCI-370h，并开发了专用大模型框架Interaction2Eval，在专家标注对齐上达88%一致性，部署验证显示评估效率提升18倍，支持从年度人工审计转向月度AI辅助监测。

Details

Motivation: 传统专家人工评估TCI在大规模教育系统（如中国25万所幼儿园）中成本高、耗时长，难以实现持续质量监控，仅能进行低频抽查，阻碍及时干预与改进追踪。 Method: 构建大规模中文自然互动数据集TEPE-TCI-370h（370小时，105间教室），配备ECQRS-EC和SSTEW标准化标注；设计面向领域挑战（儿童语音识别、普通话同音词消歧、量规推理）的LLM框架Interaction2Eval；开展43间教室实地部署验证。 Result: Interaction2Eval与人类专家判断达成最高88%一致性；实际部署中评估流程效率提升18倍；支持从年度专家审计升级为月度AI辅助+靶向人工审核的新模式。 Conclusion: AI可作为可扩展的评估协作者，实现TCI质量的连续、包容性评估，为学前教育建立以AI增强评估驱动系统性改进与公平发展的新范式。 Abstract: High-quality teacher-child interaction (TCI) is fundamental to early childhood development, yet traditional expert-based assessment faces a critical scalability challenge. In large systems like China's-serving 36 million children across 250,000+ kindergartens-the cost and time requirements of manual observation make continuous quality monitoring infeasible, relegating assessment to infrequent episodic audits that limit timely intervention and improvement tracking. In this paper, we investigate whether AI can serve as a scalable assessment teammate by extracting structured quality indicators and validating their alignment with human expert judgments. Our contributions include: (1) TEPE-TCI-370h (Tracing Effective Preschool Education), the first large-scale dataset of naturalistic teacher-child interactions in Chinese preschools (370 hours, 105 classrooms) with standardized ECQRS-EC and SSTEW annotations; (2) We develop Interaction2Eval, a specialized LLM-based framework addressing domain-specific challenges-child speech recognition, Mandarin homophone disambiguation, and rubric-based reasoning-achieving up to 88% agreement; (3) Deployment validation across 43 classrooms demonstrating an 18x efficiency gain in the assessment workflow, highlighting its potential for shifting from annual expert audits to monthly AI-assisted monitoring with targeted human oversight. This work not only demonstrates the technical feasibility of scalable, AI-augmented quality assessment but also lays the foundation for a new paradigm in early childhood education-one where continuous, inclusive, AI-assisted evaluation becomes the engine of systemic improvement and equitable growth.

[69] PINGALA: Prosody-Aware Decoding for Sanskrit Poetry Generation

Manoj Balaji Jagadeeshan,Atul Singh,Nallani Chakravartula Sahith,Amrith Krishna,Pawan Goyal

Main category: cs.CL

TL;DR: 本文提出PINGALA解码方法和SLP1音素转写方案，通过分组行生成和音素感知处理，显著提升梵语诗歌生成的语义连贯性和韵律准确性，并引入基于交叉编码器的无参考评估新方法。

Details

Motivation: 梵语诗歌生成需兼顾语义连贯性与严格的音步规则（如音节权重二元模式），但传统单序列建模难以兼顾二者。 Method: 提出PINGALA解码策略：将诗句按行分组生成，偏好更长token以增强词形完整性；采用SLP1音素感知转写方案适配梵语音系；并设计基于交叉编码器的无参考自动评估方法。 Result: 分组行生成使语义连贯性提升10%，SLP1转写使韵律对齐提升46%（语义相似性保持不变）；交叉编码器评估与真实诗歌实例对齐度更优。 Conclusion: 分而治之（行级建模）与音素感知表示是提升梵语诗歌生成质量的关键，且无参考评估方法可有效替代人工或有参考指标。 Abstract: Poetry generation in Sanskrit typically requires the verse to be semantically coherent and adhere to strict prosodic rules. In Sanskrit prosody, every line of a verse is typically a fixed length sequence of syllables adhering to prescribed binary patterns of syllable weights. We observe that instead of treating a verse as a monolithic sequence, segmenting them as grouped-lines leads to significant improvement in semantic coherence by 10\% with comparable metrical adherence. Specifically, PINGALA, our proposed decoding approach is designed to encourage every line to have well-formed words and our token selection biases the model towards it by preferring longer tokens. Writing in Sanskrit follows phonemic orthography, hence using a phonetically aware transliteration scheme, SLP1, increased the metrical alignment by 46\% with comparable semantic similarity, for a instruction fine-tuned large language models like Phi-4. We also introduce a new approach for reference-free evaluation using cross-encoders which achieved better alignment with true poetry instances.

[70] Mechanic: Sorrifier-Driven Formal Decomposition Workflow for Automated Theorem Proving

Ruichen Qiu,Yichuan Cao,Junqi Liu,Dakai Guo,Xiao-Shan Gao,Lihong Zhi,Ruyong Feng

Main category: cs.CL

TL;DR: 本文提出Mechanic系统，利用Lean中的sorry占位符进行形式化分解，将失败的子问题提取为独立上下文以提高定理证明效率。

Details

Motivation: 现有定理证明系统在处理复杂数学推理时，反复失败后要么完全重生成证明（低效），要么持续修补导致上下文过长、注意力下降。 Method: 提出Mechanic代理系统，采用sorry驱动的形式分解策略：用Lean的sorry精确隔离未解决子目标，保留已验证部分，并将每个失败子问题提取为干净、自包含的上下文独立求解。 Result: 在IMO 2025和Putnam 2025等高难度数学竞赛基准上，Mechanic显著提升了证明效率。 Conclusion: Mechanic通过平衡复用与精简上下文，在自动化定理证明中实现了更高效、可扩展的错误处理机制。 Abstract: Recent advances in large language models (LLMs) and LLM-based agents have substantially improved the capabilities of automated theorem proving. However, for problems requiring complex mathematical reasoning, current systems rarely succeed on the first try and must repeatedly modify their proof strategies. Existing approaches for handling failed attempts typically either discard the entire proof and regenerate it from scratch or iteratively fix errors within the proof. The former is inefficient, as it may abandon mostly correct reasoning due to localized errors, while the latter, although preserving prior progress, leads to progressively longer contexts which progressively degrades the model's ability to attend to the remaining unresolved subproblems. To address this dilemma, we propose Mechanic, a novel agent system that employs a sorry-driven formal decomposition strategy. By leveraging the sorry placeholder in Lean to precisely isolate unresolved subgoals while preserving the surrounding verified proof structure, Mechanic extracts each failed subproblem into a clean, self-contained context and resolves it independently. This avoids both the waste of full regeneration and the excessive context length induced by repeated repairs. Experimental results on challenging mathematical competition benchmarks, including IMO 2025 and Putnam 2025, demonstrate that our agent achieves significant advantages in proving efficiency.

[71] Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Jeonghye Kim,Xufang Luo,Minbeom Kim,Sangmook Lee,Dohyung Kim,Jiwon Jeon,Dongsheng Li,Yuqing Yang

Main category: cs.CL

TL;DR: 本文发现自蒸馏在数学推理任务中会抑制模型对不确定性的表达（即认知性言语化），导致性能下降，尤其是在分布外（OOD）任务上；实验表明，教师模型在丰富上下文条件下训练会过度优化特定领域，牺牲泛化能力，多个主流模型出现最高达40%的性能下降。

Details

Motivation: 自蒸馏虽在多数任务中提升LLM性能并缩短推理链，但在数学推理中却导致性能下降，作者旨在探究其根本原因。 Method: 通过控制实验，系统改变教师模型的条件上下文丰富度和任务覆盖范围，分析其对模型不确定性表达及推理性能的影响，并在多个开源大模型（Qwen3-8B、DeepSeek-Distill-Qwen-7B、Olmo3-7B-Instruct）上验证。 Result: 自蒸馏显著抑制‘认知性言语化’（epistemic verbalization），即模型表达不确定性的能力；上下文越丰富，抑制越强；这带来域内快速优化但严重损害OOD泛化，多个模型数学推理准确率最高下降40%。 Conclusion: 在数学推理中，适度暴露不确定性对鲁棒推理至关重要；优化LLM不应只强化正确答案路径，更需关注推理行为本身的合理性与适应性。 Abstract: Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model's expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.

[72] Representation Learning to Study Temporal Dynamics in Tutorial Scaffolding

Conrad Borchers,Jiayi Zhang,Ashish Gurung

Main category: cs.CL

TL;DR: 本文提出了一种基于嵌入的语义对齐方法，用于量化真实辅导对话中的自适应支架行为，分析了1576段数学辅导对话，发现辅导者和学习者在问题与答案内容上的语义对齐具有角色特异性和时序规律，并能预测教学进展。

Details

Motivation: 当前缺乏在真实辅导对话中稳健测量自适应支架的方法，尤其在远程人工辅导和大语言模型辅导系统兴起的背景下，这一缺口愈发突出。 Method: 提出一种基于嵌入的语义对齐分析方法，通过计算辅导者/学生话语与题目陈述、正确答案之间的余弦相似度来操作化‘对齐’，并在Eedi数据集的1576段真实数学辅导对话上应用该框架，辅以混合效应模型检验其预测力。 Result: 发现辅导者早期更紧密地锚定于问题内容，学生对答案内容的对齐程度虽弱但正向预测教学进展；角色特异的语义对齐能超越消息顺序、长度等基线特征，显著预测辅导进程。 Conclusion: 支架是一种持续的、角色敏感的、以任务语义为基础的过程；所提方法为分析教学对话和评估对话式辅导系统提供了原理性工具。 Abstract: Adaptive scaffolding enhances learning, yet the field lacks robust methods for measuring it within authentic tutoring dialogue. This gap has become more pressing with the rise of remote human tutoring and large language model-based systems. We introduce an embedding-based approach that analyzes scaffolding dynamics by aligning the semantics of dialogue turns, problem statements, and correct solutions. Specifically, we operationalize alignment by computing cosine similarity between tutor and student contributions and task-relevant content. We apply this framework to 1,576 real-world mathematics tutoring dialogues from the Eedi Question Anchored Tutoring Dialogues dataset. The analysis reveals systematic differences in task alignment and distinct temporal patterns in how participants ground their contributions in problem and solution content. Further, mixed-effects models show that role-specific semantic alignment predicts tutorial progression beyond baseline features such as message order and length. Tutor contributions exhibited stronger grounding in problem content early in interactions. In contrast, student solution alignment was modestly positively associated with progression. These findings support scaffolding as a continuous, role-sensitive process grounded in task semantics. By capturing role-specific alignment over time, this approach provides a principled method for analyzing instructional dialogue and evaluating conversational tutoring systems.

[73] Robust Multilingual Text-to-Pictogram Mapping for Scalable Reading Rehabilitation

Soufiane Jhilal,Martina Galletti

Main category: cs.CL

TL;DR: 本文提出了一种多语言AI界面，通过自动为文本添加视觉支架（如语境相关的象形图）来辅助特殊教育需求儿童的阅读理解，并在五种语言上验证了其高覆盖率、语义恰当性和实时可用性。

Details

Motivation: 儿童特殊教育需求者（SEND）在阅读理解方面面临重大挑战，常需高强度一对一辅导；为帮助治疗师扩大支持规模，需自动化、多语言、安全可靠的视觉辅助工具。 Method: 开发了一个多语言AI驱动接口，动态识别文本关键概念并映射至上下文相关象形图；在英语、法语、意大利语、西班牙语和阿拉伯语上开展多语言覆盖率分析、言语治疗师与特教专家临床评审及延迟评估。 Result: 五种语言均实现高象形图覆盖率与视觉支架密度；欧洲四种语言象形图语义正确/可接受率超95%，阿拉伯语约90%；系统延迟满足实时教学交互要求。 Conclusion: 该自动化多模态支架方案具备技术可行性、语义安全性与临床可接受性，可有效提升神经多样性学习者的阅读可及性。 Abstract: Reading comprehension presents a significant challenge for children with Special Educational Needs and Disabilities (SEND), often requiring intensive one-on-one reading support. To assist therapists in scaling this support, we developed a multilingual, AI-powered interface that automatically enhances text with visual scaffolding. This system dynamically identifies key concepts and maps them to contextually relevant pictograms, supporting learners across languages. We evaluated the system across five typologically diverse languages (English, French, Italian, Spanish, and Arabic), through multilingual coverage analysis, expert clinical review by speech therapists and special education professionals, and latency assessment. Evaluation results indicate high pictogram coverage and visual scaffolding density across the five languages. Expert audits suggested that automatically selected pictograms were semantically appropriate, with combined correct and acceptable ratings exceeding 95% for the four European languages and approximately 90% for Arabic despite reduced pictogram repository coverage. System latency remained within interactive thresholds suitable for real-time educational use. These findings support the technical viability, semantic safety, and acceptability of automated multimodal scaffolding to improve accessibility for neurodiverse learners.

[74] A Sociolinguistic Analysis of Automatic Speech Recognition Bias in Newcastle English

Dana Serditova,Kevin Tang

Main category: cs.CL

TL;DR: 本研究通过社会语言学视角分析纽卡斯尔英语对自动语音识别（ASR）系统的影响，发现方言特征（如元音质量、声门化、地方词汇和非标准语法）是主要错误来源，且错误率在性别、年龄等社会变量上呈现规律性差异，表明ASR偏差具有社会结构性，需引入社会语言学方法与社区语音数据以提升公平性。

Details

Motivation: ASR系统在不同方言使用者间表现不均，尤其对偏离训练数据主流口音的方言（如纽卡斯尔英语）识别效果差，亟需从社会语言学角度理解其偏差机制。 Method: 基于DECTE语料库中的自发语音，评估商用ASR系统性能；对3000多个转录错误按语言学领域分类，并关联性别、年龄、社会经济地位等社会变量；辅以选定元音特征的声学个案研究。 Result: 语音学变异（尤其是元音质量和声门化）导致大多数错误；地方词汇和非标准语法亦显著影响识别；男性及年龄两端（极年轻/年长）说话者错误率更高。 Conclusion: ASR错误具有社会规律性而非随机，须将社会语言学知识纳入ASR评估与开发流程，强调方言变异建模与基于社区的语音数据采集，方能构建更公平的语音识别系统。 Abstract: Automatic Speech Recognition (ASR) systems are widely used in everyday communication, education, healthcare, and industry, yet their performance remains uneven across speakers, particularly when dialectal variation diverges from the mainstream accents represented in training data. This study investigates ASR bias through a sociolinguistic analysis of Newcastle English, a regional variety of North-East England that has been shown to challenge current speech recognition technologies. Using spontaneous speech from the Diachronic Electronic Corpus of Tyneside English (DECTE), we evaluate the output of a state-of-the-art commercial ASR system and conduct a fine-grained analysis of more than 3,000 transcription errors. Errors are classified by linguistic domain and examined in relation to social variables including gender, age, and socioeconomic status. In addition, an acoustic case study of selected vowel features demonstrates how gradient phonetic variation contributes directly to misrecognition. The results show that phonological variation accounts for the majority of errors, with recurrent failures linked to dialect-specific features like vowel quality and glottalisation, as well as local vocabulary and non-standard grammatical forms. Error rates also vary across social groups, with higher error frequencies observed for men and for speakers at the extremes of the age spectrum. These findings indicate that ASR errors are not random but socially patterned and can be explained from a sociolinguistic perspective. Thus, the study demonstrates the importance of incorporating sociolinguistic expertise into the evaluation and development of speech technologies and argues that more equitable ASR systems require explicit attention to dialectal variation and community-based speech data.

[75] MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Zhuo Li,Yupeng Zhang,Pengyu Cheng,Jiajun Song,Mengyu Zhou,Hao Li,Shujie Hu,Yu Qin,Erchao Zhao,Xiaoxi Jiang,Guanjun Jiang

Main category: cs.CL

TL;DR: 本文提出MARCH框架，通过多智能体强化学习和刻意设计的信息不对称（Checker不接触Solver原始输出），打破验证过程中的确认偏误，显著降低RAG系统中的幻觉率。

Details

Motivation: 现有基于LLM-as-a-judge的幻觉检测方法存在固有确认偏误，导致验证器重复生成器的错误，影响RAG系统可靠性。 Method: 提出MARCH框架，包含Solver、Proposer和Checker三个专用智能体：Solver生成初始响应；Proposer将其分解为原子命题；Checker在无原始输出信息下独立验证每个命题；整个流程通过多智能体强化学习联合训练优化。 Result: 在多个幻觉基准测试中，MARCH显著降低幻觉率；一个8B参数开源模型启用MARCH后性能媲美强大闭源模型。 Conclusion: MARCH为大语言模型提供了可扩展的事实性自改进路径，其核心机制——信息不对称驱动的多智能体协同验证——有效缓解了确认偏误问题。 Abstract: Hallucination remains a critical bottleneck for large language models (LLMs), undermining their reliability in real-world applications, especially in Retrieval-Augmented Generation (RAG) systems. While existing hallucination detection methods employ LLM-as-a-judge to verify LLM outputs against retrieved evidence, they suffer from inherent confirmation bias, where the verifier inadvertently reproduces the errors of the original generation. To address this, we introduce Multi-Agent Reinforced Self-Check for Hallucination (MARCH), a framework that enforces rigorous factual alignment by leveraging deliberate information asymmetry. MARCH orchestrates a collaborative pipeline of three specialized agents: a Solver, a Proposer, and a Checker. The Solver generates an initial RAG response, which the Proposer decomposes into claim-level verifiable atomic propositions. Crucially, the Checker validates these propositions against retrieved evidence in isolation, deprived of the Solver's original output. This well-crafted information asymmetry scheme breaks the cycle of self-confirmation bias. By training this pipeline with multi-agent reinforcement learning (MARL), we enable the agents to co-evolve and optimize factual adherence. Extensive experiments across hallucination benchmarks demonstrate that MARCH substantially reduces hallucination rates. Notably, an 8B-parameter LLM equipped with MARCH achieves performance competitive with powerful closed-source models. MARCH paves a scalable path for factual self-improvement of LLMs through co-evolution. The code is at https://github.com/Qwen-Applications/MARCH.

[76] Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Saahil Mathur,Ryan David Rittner,Vedant Ajit Thakur,Daniel Stuart Schiff,Tunazzina Islam

Main category: cs.CL

TL;DR: 本文研究了检索增强生成（RAG）系统在AI治理与政策分析中的应用，使用AGORA语料库，结合ColBERT检索器与DPO对齐的生成器，发现领域微调虽提升检索指标，但未必改善端到端问答可靠性，甚至可能加剧幻觉。

Details

Motivation: 现有RAG系统在密集法律语言和动态重叠监管框架下难以满足专家级可靠性需求，亟需针对AI政策分析场景进行系统性评估与优化。 Method: 构建AGORA语料库（947份AI政策文档），采用对比学习微调ColBERT检索器，并用Direct Preference Optimization（DPO）对生成器进行人类偏好对齐；通过合成查询与成对偏好标注实现领域适配。 Result: 域内微调显著提升检索质量（如召回率），但端到端问答的相关性与忠实性未同步提升；强检索能力在缺失相关文档时反而导致更自信的幻觉。 Conclusion: RAG各组件性能提升不必然带来整体答案可靠性提升，政策类RAG系统设计需更注重检索-生成协同与不确定性建模，而非孤立优化单模块。 Abstract: Retrieval-augmented generation (RAG) systems are increasingly used to analyze complex policy documents, but achieving sufficient reliability for expert usage remains challenging in domains characterized by dense legal language and evolving, overlapping regulatory frameworks. We study the application of RAG to AI governance and policy analysis using the AI Governance and Regulatory Archive (AGORA) corpus, a curated collection of 947 AI policy documents. Our system combines a ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned to human preferences using Direct Preference Optimization (DPO). We construct synthetic queries and collect pairwise preferences to adapt the system to the policy domain. Through experiments evaluating retrieval quality, answer relevance, and faithfulness, we find that domain-specific fine-tuning improves retrieval metrics but does not consistently improve end-to-end question answering performance. In some cases, stronger retrieval counterintuitively leads to more confident hallucinations when relevant documents are absent from the corpus. These results highlight a key concern for those building policy-focused RAG systems: improvements to individual components do not necessarily translate to more reliable answers. Our findings provide practical insights for designing grounded question-answering systems over dynamic regulatory corpora.

cs.CV [Back]

[77] LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset

Royden Wagner,Omer Sahin Tas,Jaime Villa,Felix Hauser,Yinzhe Shen,Marlon Steiner,Dominik Strutz,Carlos Fernandez,Christian Kinzig,Guillermo S. Guitierrez-Cabello,Hendrik Königshof,Fabian Immel,Richard Schwarzkopf,Nils Alexander Rack,Kevin Rösch,Kaiwen Wang,Jan-Hendrik Pauls,Martin Lauer,Igor Gilitschenski,Holger Caesar,Christoph Stiller

Main category: cs.CV

TL;DR: 本文介绍了一个专为端到端自动驾驶设计的新数据集，聚焦长尾驾驶事件，包含多视角视频、轨迹、高级指令和多语言（英、西、中）专家推理痕迹，用于评估多模态模型在指令遵循与语义一致性方面的泛化能力。

Details

Motivation: 现实世界中（如自动驾驶）对罕见场景的泛化能力仍是根本挑战，现有数据集缺乏对长尾驾驶事件及多样化推理过程的覆盖。 Method: 构建了一个面向端到端驾驶的新型长尾场景数据集，包含多视图视频、车辆轨迹、高层驾驶指令及由多文化背景领域专家撰写的多语言推理痕迹，支持上下文学习与少样本泛化评估。 Result: 提出了首个融合多语言专家推理痕迹的长尾驾驶基准，支持对VLMs/VLAs等模型在指令遵循、语义一致性等新维度上的评估，超越传统安全与舒适性指标。 Conclusion: 该数据集是研究不同推理形式如何影响驾驶能力的独特资源，有助于提升模型在真实长尾场景中的鲁棒性与可解释性。 Abstract: In real-world domains such as self-driving, generalization to rare scenarios remains a fundamental challenge. To address this, we introduce a new dataset designed for end-to-end driving that focuses on long-tail driving events. We provide multi-view video data, trajectories, high-level instructions, and detailed reasoning traces, facilitating in-context learning and few-shot generalization. The resulting benchmark for multimodal models, such as VLMs and VLAs, goes beyond safety and comfort metrics by evaluating instruction following and semantic coherence between model outputs. The multilingual reasoning traces in English, Spanish, and Chinese are from domain experts with diverse cultural backgrounds. Thus, our dataset is a unique resource for studying how different forms of reasoning affect driving competence. Our dataset is available at: https://hf.co/datasets/kit-mrt/kitscenes-longtail

Alexandre Symeonidis-Herzig,Jianhe Low,Ozge Mercanoglu Sincan,Richard Bowden

Main category: cs.CV

TL;DR: 本文提出SMPL-FX与M3T模型，通过融合FLAME面部模型与SMPL-X身体模型，并采用模态特异性有限标量量化VAE进行多模态运动表征，结合带辅助翻译目标的自回归Transformer，在手语生成中显著提升非手动特征（如口型、眉动、视线、头动）建模能力，实现SOTA性能。

Details

Motivation: 现有3D手语生成系统难以有效建模语法必需的非手动特征（NMFs），因标准身体模型面部维度不足，而高维表达又导致离散化时码本坍缩，使多数表情空间不可达。 Method: 提出SMPL-FX：耦合FLAME（丰富面部表达）与SMPL-X（全身建模）；采用模态特异性Finite Scalar Quantization VAE分别对身体、手部和面部动作进行分词；构建多模态运动词汇，并训练带辅助翻译目标的自回归Transformer（M3T）以学习语义对齐的嵌入。 Result: 在How2Sign、CSL-Daily、Phoenix14T三个基准上达到手语生成质量SOTA；在仅靠非手动特征区分词汇的NMFs-CSL数据集上准确率达58.3%，显著优于最强姿态基线（49.0%）。 Conclusion: SMPL-FX与M3T有效解决了非手动特征建模难题，验证了多模态分词与语义引导训练对手语生成性能的关键作用，为真实感手语合成提供了新范式。 Abstract: Sign language production requires more than hand motion generation. Non-manual features, including mouthings, eyebrow raises, gaze, and head movements, are grammatically obligatory and cannot be recovered from manual articulators alone. Existing 3D production systems face two barriers to integrating them: the standard body model provides a facial space too low-dimensional to encode these articulations, and when richer representations are adopted, standard discrete tokenization suffers from codebook collapse, leaving most of the expression space unreachable. We propose SMPL-FX, which couples FLAME's rich expression space with the SMPL-X body, and tokenize the resulting representation with modality-specific Finite Scalar Quantization VAEs for body, hands, and face. M3T is an autoregressive transformer trained on this multi-modal motion vocabulary, with an auxiliary translation objective that encourages semantically grounded embeddings. Across three standard benchmarks (How2Sign, CSL-Daily, Phoenix14T) M3T achieves state-of-the-art sign language production quality, and on NMFs-CSL, where signs are distinguishable only by non-manual features, reaches 58.3% accuracy against 49.0% for the strongest comparable pose baseline.

[79] Ukrainian Visual Word Sense Disambiguation Benchmark

Yurii Laba,Yaryna Mohytych,Ivanna Rohulia,Halyna Kyryleyza,Hanna Dydyk-Meush,Oles Dobosevych,Rostyslav Hryniv

Main category: cs.CV

TL;DR: 本文构建了一个用于评估乌克兰语视觉词义消歧（Visual-WSD）任务的基准数据集，并在该基准上测试了8个跨语言多模态大模型，结果均不如英文Visual-WSD中使用的零样本CLIP基线模型，揭示了乌英之间显著的性能差距。

Details

Motivation: 构建乌克兰语视觉词义消歧（Visual-WSD）基准，以支持跨语言多模态模型性能比较，并填补乌克兰语在此任务上的评估空白。 Method: 采用类似已有英语、意大利语和波斯语Visual-WSD基准的半自动构建方法，结合领域专家校验，构建乌克兰语Visual-WSD基准；随后在该基准上评测8个跨语言多模态大模型，并与零样本CLIP基线对比。 Result: 所有被测多模态大模型在乌克兰语Visual-WSD任务上的表现均低于英文任务中使用的零样本CLIP基线模型，且乌英之间存在显著性能差距。 Conclusion: 当前多模态大模型对乌克兰语Visual-WSD的支持不足，凸显了低资源语言在该任务上的建模挑战，也验证了所构建基准的有效性和必要性。 Abstract: This study presents a benchmark for evaluating the Visual Word Sense Disambiguation (Visual-WSD) task in Ukrainian. The main goal of the Visual-WSD task is to identify, with minimal contextual information, the most appropriate representation of a given ambiguous word from a set of ten images. To construct this benchmark, we followed a methodology similar to that proposed by (CITATION), who previously introduced benchmarks for the Visual-WSD task in English, Italian, and Farsi. This approach allows us to incorporate the Ukrainian benchmark into a broader framework for cross-language model performance comparisons. We collected the benchmark data semi-automatically and refined it with input from domain experts. We then assessed eight multilingual and multimodal large language models using this benchmark. All tested models performed worse than the zero-shot CLIP-based baseline model (CITATION) used by (CITATION) for the English Visual-WSD task. Our analysis revealed a significant performance gap in the Visual-WSD task between Ukrainian and English.

[80] Stochastic Ray Tracing for the Reconstruction of 3D Gaussian Splatting

Peiyu Xu,Xin Sun,Krishna Mullia,Raymond Fei,Iliyan Georgiev,Shuang Zhao

Main category: cs.CV

TL;DR: 本文提出了一种无需排序、可微分的随机光线追踪3D高斯点阵（3DGS）方法，通过蒙特卡洛估计器高效计算像素颜色梯度，在保持重建质量与速度的同时，显著提升标准及重光照场景下的渲染真实感。

Details

Motivation: 现有基于光线追踪的3DGS方法受限于沿每条光线对所有相交高斯进行排序的高计算开销，且在重光照场景中仍依赖光栅化式近似（如阴影贴图），削弱了光线追踪本应提供的通用性与物理真实性。 Method: 提出一种可微分、无需排序的随机光线追踪框架，核心是针对像素颜色梯度的无偏蒙特卡洛估计器，每条光线仅采样少量高斯；对重光照场景，结合逐高斯着色与全光线追踪阴影射线。 Result: 在标准3DGS任务上，重建质量与速度媲美光栅化方法，并远超排序式光线追踪；在重光照3DGS任务上，重建保真度显著优于先前工作。 Conclusion: 该方法首次实现了真正端到端可微、无需排序的随机光线追踪3DGS，统一支持高质量标准渲染与物理一致的重光照，拓展了3DGS在真实感合成中的应用边界。 Abstract: Ray-tracing-based 3D Gaussian splatting (3DGS) methods overcome the limitations of rasterization -- rigid pinhole camera assumptions, inaccurate shadows, and lack of native reflection or refraction -- but remain slower due to the cost of sorting all intersecting Gaussians along every ray. Moreover, existing ray-tracing methods still rely on rasterization-style approximations such as shadow mapping for relightable scenes, undermining the generality that ray tracing promises. We present a differentiable, sorting-free stochastic formulation for ray-traced 3DGS -- the first framework that uses stochastic ray tracing to both reconstruct and render standard and relightable 3DGS scenes. At its core is an unbiased Monte Carlo estimator for pixel-color gradients that evaluates only a small sampled subset of Gaussians per ray, bypassing the need for sorting. For standard 3DGS, our method matches the reconstruction quality and speed of rasterization-based 3DGS while substantially outperforming sorting-based ray tracing. For relightable 3DGS, the same stochastic estimator drives per-Gaussian shading with fully ray-traced shadow rays, delivering notably higher reconstruction fidelity than prior work.

[81] λSplit: Self-Supervised Content-Aware Spectral Unmixing for Fluorescence Microscopy

Federico Carrara,Talley Lambert,Mehdi Seifi,Florian Jug

Main category: cs.CV

TL;DR: 本文提出λSplit，一种物理信息驱动的深度生成模型，用于荧光显微镜中的光谱解混，通过分层变分自编码器学习浓度图的条件分布，并结合可微光谱混合器保证物理一致性，在多种挑战性场景下达到当前最优性能。

Details

Motivation: 传统像素级最小二乘方法在光谱重叠严重或噪声较高时性能下降；现有学习方法缺乏对荧光显微镜数据的适配性或泛化能力。 Method: 提出λSplit模型：基于分层变分自编码器（VAE）学习浓度图的条件分布，嵌入完全可微的物理驱动光谱混合器以保证成像过程一致性。 Result: 在66个合成挑战性基准上超越10种基线方法（含经典与学习方法），在高噪声、强光谱重叠、低光谱维度等场景下表现更鲁棒；兼容标准共聚焦显微镜数据。 Conclusion: λSplit是荧光显微镜光谱解混的新SOTA方法，兼具高性能、强鲁棒性与即插即用硬件兼容性。 Abstract: In fluorescence microscopy, spectral unmixing aims to recover individual fluorophore concentrations from spectral images that capture mixed fluorophore emissions. Since classical methods operate pixel-wise and rely on least-squares fitting, their performance degrades with increasingly overlapping emission spectra and higher levels of noise, suggesting that a data-driven approach that can learn and utilize a structural prior might lead to improved results. Learning-based approaches for spectral imaging do exist, but they are either not optimized for microscopy data or are developed for very specific cases that are not applicable to fluorescence microscopy settings. To address this, we propose λSplit, a physics-informed deep generative model that learns a conditional distribution over concentration maps using a hierarchical Variational Autoencoder. A fully differentiable Spectral Mixer enforces consistency with the image formation process, while the learned structural priors enable state-of-the-art unmixing and implicit noise removal. We demonstrate λSplit on 3 real-world datasets that we synthetically cast into a total of 66 challenging spectral unmixing benchmarks. We compare our results against a total of 10 baseline methods, including classical methods and a range of learning-based methods. Our results consistently show competitive performance and improved robustness in high noise regimes, when spectra overlap considerably, or when the spectral dimensionality is lowered, making λSplit a new state-of-the-art for spectral unmixing of fluorescent microscopy data. Importantly, λSplit is compatible with spectral data produced by standard confocal microscopes, enabling immediate adoption without specialized hardware modifications.

[82] Foundation Model Embeddings Meet Blended Emotions: A Multimodal Fusion Approach for the BLEMORE Challenge

Masoumeh Chapariniya,Aref Farhadipour,Sarah Ebling,Volker Dellwo,Teodora Vukovic

Main category: cs.CV

TL;DR: 本文提出了一种用于BLEMORE挑战赛的混合情绪识别与相对显著性预测系统，通过融合六类编码器（包括改进的S4D-ViTMoE、冻结层选择的Wav2Vec2、微调的TimeSformer/VideoMAE及首次用于情绪识别的Gemini Embedding 2.0），在测试集上达到Score=0.279。关键发现包括：冻结Wav2Vec2中语音韵律层（6–12层）优于端到端微调；显著性阈值β因人而异；任务适配编码器占集成权重62%。

Details

Motivation: 解决BLEMORE挑战中 blended emotion recognition 与 relative salience prediction 的难点，尤其是非言语音频建模、短时视频表征和个性化表达差异问题。 Method: 采用六类编码器的late probability fusion：S4D-ViTMoE（软标签KL训练）、冻结层选择的Wav2Vec2（6–12层）、微调TimeSformer/VideoMAE、首次引入Gemini Embedding 2.0视频嵌入；并分析不同后处理阈值β与编码器权重分配。 Result: 测试集Score=0.279（ACCP=0.391, ACCS=0.168），排名第6；Wav2Vec2层选择（6–12）提升Score至0.207（vs. 0.161）；β在0.05–0.43间跨折变化；任务适配编码器获62%集成权重。 Conclusion: 冻结语音韵律特征、多源任务适配编码器融合及个性化显著性建模是提升混合情绪识别性能的关键；Gemini Embedding 2.0在极短视频（2秒）下展现潜力。 Abstract: We present our system for the BLEMORE Challenge at FG 2026 on blended emotion recognition with relative salience prediction. Our approach combines six encoder families through late probability fusion: an S4D-ViTMoE face encoder adapted with soft-label KL training, frozen layer-selective Wav2Vec2 audio features, finetuned body-language encoders (TimeSformer, VideoMAE), and -- for the first time in emotion recognition -- Gemini Embedding 2.0, a large multimodal model whose video embeddings produce competitive presence accuracy (ACCP = 0.320) from only 2 seconds of input. Three key findings emerge from our experiments: selecting prosody-encoding layers (6--12) from frozen Wav2Vec2 outperforms end-to-end finetuning (Score 0.207 vs. 0.161), as the non-verbal nature of BLEMORE audio makes phonetic layers irrelevant; the post-processing salience threshold $β$ varies from 0.05 to 0.43 across folds, revealing that personalized expression styles are the primary bottleneck; and task-adapted encoders collectively receive 62\% of ensemble weight over general-purpose baselines. Our 12-encoder system achieves Score = 0.279 (ACCP = 0.391, ACCS = 0.168) on the test set, placing 6th.

[83] Estimating Individual Tree Height and Species from UAV Imagery

Jannik Endres,Etienne Laliberté,David Rolnick,Arthur Ouaknine

Main category: cs.CV

TL;DR: 本文提出了BIRCH-Trees基准数据集和DINOvTree方法，用于从无人机RGB图像中联合估计单棵树的高度与树种，实现了高精度、低参数量的性能。

Details

Motivation: 准确估算森林生物量依赖于树高和树种等个体树特征，而传统方法成本高、可扩展性差，亟需基于低成本无人机RGB图像的高效解决方案。 Method: 构建了涵盖温带、热带和寒带林区的BIRCH-Trees基准数据集，并提出DINOvTree方法——以视觉基础模型（VFM）为骨干、配备任务专用头，实现树高回归与树种分类联合预测。 Result: DINOvTree在BIRCH-Trees上取得最优整体性能：树高预测准确，树种分类具有竞争力，且参数量仅为次优方法的54%–58%。 Conclusion: DINOvTree验证了统一VFM架构在多任务树特征估计中的有效性与高效性，为基于无人机影像的大规模森林碳汇监测提供了新范式。 Abstract: Accurate estimation of forest biomass, a major carbon sink, relies heavily on tree-level traits such as height and species. Unoccupied Aerial Vehicles (UAVs) capturing high-resolution imagery from a single RGB camera offer a cost-effective and scalable approach for mapping and measuring individual trees. We introduce BIRCH-Trees, the first benchmark for individual tree height and species estimation from tree-centered UAV images, spanning three datasets: temperate forests, tropical forests, and boreal plantations. We also present DINOvTree, a unified approach using a Vision Foundation Model (VFM) backbone with task-specific heads for simultaneous height and species prediction. Through extensive evaluations on BIRCH-Trees, we compare DINOvTree against commonly used vision methods, including VFMs, as well as biological allometric equations. We find that DINOvTree achieves top overall results with accurate height predictions and competitive classification accuracy while using only 54% to 58% of the parameters of the second-best approach.

[84] Prototype Fusion: A Training-Free Multi-Layer Approach to OOD Detection

Shreen Gul,Mohamed Elmahallawy,Ardhendu Tripathy,Sanjay Madria

Main category: cs.CV

TL;DR: 本文提出一种利用多层内部表征进行OOD检测的简单有效方法，通过聚合多个卷积块特征、构建类中心原型并使用余弦相似度作为OOD评分，在多个基准上显著提升性能。

Details

Motivation: 现有OOD检测方法主要依赖网络倒数第二层激活，本文质疑该假设，发现中间层同样包含丰富且具有判别性的OOD信息。 Method: 聚合多个连续卷积块的特征，计算各类别的类中心嵌入，并进行L2归一化形成紧凑的ID原型；推理时用测试样本特征与原型间的余弦相似度作为OOD得分。 Result: 在多个SOTA OOD基准和不同架构上验证了方法的有效性，AUROC最高提升4.41%，FPR降低13.58%。 Conclusion: 多层特征聚合是OOD检测中强大但被低估的信号，挑战了仅依赖倒数第二层的传统范式。 Abstract: Deep learning models are increasingly deployed in safety-critical applications, where reliable out-of-distribution (OOD) detection is essential to ensure robustness. Existing methods predominantly rely on the penultimate-layer activations of neural networks, assuming they encapsulate the most informative in-distribution (ID) representations. In this work, we revisit this assumption to show that intermediate layers encode equally rich and discriminative information for OOD detection. Based on this observation, we propose a simple yet effective model-agnostic approach that leverages internal representations across multiple layers. Our scheme aggregates features from successive convolutional blocks, computes class-wise mean embeddings, and applies L_2 normalization to form compact ID prototypes capturing class semantics. During inference, cosine similarity between test features and these prototypes serves as an OOD score--ID samples exhibit strong affinity to at least one prototype, whereas OOD samples remain uniformly distant. Extensive experiments on state-of-the-art OOD benchmarks across diverse architectures demonstrate that our approach delivers robust, architecture-agnostic performance and strong generalization for image classification. Notably, it improves AUROC by up to 4.41% and reduces FPR by 13.58%, highlighting multi-layer feature aggregation as a powerful yet underexplored signal for OOD detection, challenging the dominance of penultimate-layer-based methods. Our code is available at: https://github.com/sgchr273/cosine-layers.git.

[85] MoCHA: Denoising Caption Supervision for Motion-Text Retrieval

Nikolai Warner,Cameron Ethan Taylor,Irfan Essa,Apaar Sadhwani

Main category: cs.CV

TL;DR: 本文提出MoCHA框架，通过文本规范化减少运动-文本检索中因标注者差异导致的文本嵌入方差，提升跨数据集迁移能力和检索性能。

Details

Motivation: 标准对比学习将每个文本标注视为唯一正样本，忽略了同一动作存在多种合理描述（即文本分布性）的事实，导致运动-文本嵌入对齐变弱；尤其运动不可恢复的语义（如风格、推断上下文）引入噪声。 Method: 提出MoCHA文本规范化框架，将原始caption投影到仅保留运动可恢复语义（动作类型、身体部位、方向性等）的子空间；实现方式包括规则方法和两种学习型方法：基于LLM（GPT-5.2）和轻量蒸馏FlanT5模型；作为预处理模块兼容任意检索架构。 Result: 在HumanML3D和KIT-ML上显著提升T2M R@1：LLM版分别达13.9%（+3.1pp）和24.3%（+10.3pp），T5版分别+2.5pp和+8.1pp；文本嵌入方差降低11–19%；跨数据集迁移能力大幅提升（H→K提升94%，K→H提升52%）。 Conclusion: 文本规范化是提升运动-语言检索鲁棒性与泛化性的有效通用原则；标准化语言空间可获得更可迁移的运动-语言表征。 Abstract: Text-motion retrieval systems learn shared embedding spaces from motion-caption pairs via contrastive objectives. However, each caption is not a deterministic label but a sample from a distribution of valid descriptions: different annotators produce different text for the same motion, mixing motion-recoverable semantics (action type, body parts, directionality) with annotator-specific style and inferred context that cannot be determined from 3D joint coordinates alone. Standard contrastive training treats each caption as the single positive target, overlooking this distributional structure and inducing within-motion embedding variance that weakens alignment. We propose MoCHA, a text canonicalization framework that reduces this variance by projecting each caption onto its motion-recoverable content prior to encoding, producing tighter positive clusters and better-separated embeddings. Canonicalization is a general principle: even deterministic rule-based methods improve cross-dataset transfer, though learned canonicalizers provide substantially larger gains. We present two learned variants: an LLM-based approach (GPT-5.2) and a distilled FlanT5 model requiring no LLM at inference time. MoCHA operates as a preprocessing step compatible with any retrieval architecture. Applied to MoPa (MotionPatches), MoCHA sets a new state of the art on both HumanML3D (H) and KIT-ML (K): the LLM variant achieves 13.9% T2M R@1 on H (+3.1pp) and 24.3% on K (+10.3pp), while the LLM-free T5 variant achieves gains of +2.5pp and +8.1pp. Canonicalization reduces within-motion text-embedding variance by 11-19% and improves cross-dataset transfer substantially, with H to K improving by 94% and K to H by 52%, demonstrating that standardizing the language space yields more transferable motion-language representations.

[86] AdvSplat: Adversarial Attacks on Feed-Forward Gaussian Splatting Models

Yiran Qiao,Yiren Lu,Yunlai Zhou,Rui Yang,Linlin Hou,Yu Yin,Jing Ma

Main category: cs.CV

TL;DR: 本文提出了AdvSplat，首个针对前馈式3D高斯泼溅（3DGS）模型的对抗攻击系统性研究，揭示其脆弱性并设计了两种高效黑盒攻击算法，在像素空间中通过频域参数化实现低查询量、不可感知扰动，严重干扰重建结果。

Details

Motivation: feed-forward 3DGS模型虽具商用潜力，但其神经网络结构易受对抗攻击，而该安全与鲁棒性问题此前被忽视。 Method: 首先采用白盒攻击揭示模型根本漏洞；进而提出两种实用、查询高效的黑盒攻击算法——基于梯度估计和无梯度优化，均通过频域参数化在像素空间中生成扰动。 Result: 在多个数据集上实验表明，AdvSplat能以不可感知的输入图像扰动显著破坏3D重建效果。 Conclusion: 本工作揭示了前馈3DGS模型面临的重要安全风险，呼吁社区关注其鲁棒性与安全性挑战。 Abstract: 3D Gaussian Splatting (3DGS) is increasingly recognized as a powerful paradigm for real-time, high-fidelity 3D reconstruction. However, its per-scene optimization pipeline limits scalability and generalization, and prevents efficient inference. Recently emerged feed-forward 3DGS models address these limitations by enabling fast reconstruction from a few input views after large-scale pretraining, without scene-specific optimization. Despite their advantages and strong potential for commercial deployment, the use of neural networks as the backbone also amplifies the risk of adversarial manipulation. In this paper, we introduce AdvSplat, the first systematic study of adversarial attacks on feed-forward 3DGS. We first employ white-box attacks to reveal fundamental vulnerabilities of this model family. We then develop two improved, practically relevant, query-efficient black-box algorithms that optimize pixel-space perturbations via a frequency-domain parameterization: one based on gradient estimation and the other gradient-free, without requiring any access to model internals. Extensive experiments across multiple datasets demonstrate that AdvSplat can significantly disrupt reconstruction results by injecting imperceptible perturbations into the input images. Our findings surface an overlooked yet urgent problem in this domain, and we hope to draw the community's attention to this emerging security and robustness challenge.

[87] CoRe: Joint Optimization with Contrastive Learning for Medical Image Registration

Eytan Kats,Christoph Grossbroehmer,Ziad Al-Haj Hemidi,Fenja Falta,Wiebke Heyer,Mattias P. Heinrich

Main category: cs.CV

TL;DR: 本文提出了一种将等变对比学习直接集成到医学图像配准模型中的新框架，通过联合优化对比学习和配准目标，提升了对组织形变鲁棒的特征表示能力，并在腹部和胸部图像配准任务中取得了优于基线方法的性能。

Details

Motivation: 医学图像配准面临强度不一致和非线性组织形变等挑战，现有自监督表征学习方法虽有潜力，但多为两阶段流程，特征学习与配准目标存在不一致性。 Method: 提出端到端框架，将等变对比学习直接嵌入配准模型，联合优化对比损失（增强特征对形变的不变性）与配准损失（如形变场平滑性和图像相似性）。 Result: 在腹部和胸部的 intra-patient 与 inter-patient 配准任务上，该方法显著优于强基线方法。 Conclusion: 将对比学习与配准任务联合优化，能有效提升特征对组织形变的鲁棒性，从而提高配准精度，验证了端到端学习范式的有效性。 Abstract: Medical image registration is a fundamental task in medical image analysis, enabling the alignment of images from different modalities or time points. However, intensity inconsistencies and nonlinear tissue deformations pose significant challenges to the robustness of registration methods. Recent approaches leveraging self-supervised representation learning show promise by pre-training feature extractors to generate robust anatomical embeddings, that farther used for the registration. In this work, we propose a novel framework that integrates equivariant contrastive learning directly into the registration model. Our approach leverages the power of contrastive learning to learn robust feature representations that are invariant to tissue deformations. By jointly optimizing the contrastive and registration objectives, we ensure that the learned representations are not only informative but also suitable for the registration task. We evaluate our method on abdominal and thoracic image registration tasks, including both intra-patient and inter-patient scenarios. Experimental results demonstrate that the integration of contrastive learning directly into the registration framework significantly improves performance, surpassing strong baseline methods.

[88] Mind the Hitch: Dynamic Calibration and Articulated Perception for Autonomous Trucks

Morui Zhu,Yongqi Zhu,Song Fu,Qing Yang

Main category: cs.CV

TL;DR: 本文提出dCAP框架，通过视觉方法连续估计牵引车与挂车摄像头间的6自由度相对位姿，解决自动驾驶卡车因铰接结构和传感器姿态时变带来的感知与标定难题，并在新构建的STT4AT仿真基准上验证了其有效性。

Details

Motivation: 现有感知与标定方法假设传感器基线静态或依赖高视差、纹理丰富的场景，在真实自动驾驶卡车场景（存在第五轮关节运动和拖车形变）中可靠性不足。 Method: 提出基于视觉的dCAP框架，采用具有跨视角与时间注意力机制的Transformer模型，持续估计牵引车与挂车间摄像头的6-DoF相对位姿；将其集成至BEVFormer中，以动态预测的外参替代静态标定。 Result: dCAP在快速铰接运动与遮挡下仍能保持稳定准确的感知性能，显著提升3D目标检测精度；同时构建了CARLA-based的STT4AT仿真基准用于评估。 Conclusion: dCAP有效克服了静态标定在自动驾驶卡车中的局限性，为 articulated vehicle 的动态感知与标定提供了可靠解决方案，并将开源数据集、开发套件与源码。 Abstract: Autonomous trucking poses unique challenges due to articulated tractor-trailer geometry, and time-varying sensor poses caused by the fifth-wheel joint and trailer flex. Existing perception and calibration methods assume static baselines or rely on high-parallax and texture-rich scenes, limiting their reliability under real-world settings. We propose dCAP (dynamic Calibration and Articulated Perception), a vision-based framework that continuously estimates the 6-DoF (degree of freedom) relative pose between tractor and trailer cameras. dCAP employs a transformer with cross-view and temporal attention to robustly aggregate spatial cues while maintaining temporal consistency, enabling accurate perception under rapid articulation and occlusion. Integrated with BEVFormer, dCAP improves 3D object detection by replacing static calibration with dynamically predicted extrinsics. To facilitate evaluation, we introduce STT4AT, a CARLA-based benchmark simulating semi-trailer trucks with synchronized multi-sensor suites and time-varying inter-rig geometry across diverse environments. Experiments demonstrate that dCAP achieves stable, accurate perception while addressing the limitations of static calibration in autonomous trucking. The dataset, development kit, and source code will be publicly released.

[89] Bi-CRCL: Bidirectional Conservative-Radical Complementary Learning with Pre-trained Foundation Models for Class-incremental Medical Image Analysis

Xinyao Wu,Zhe Xu,Cheng Chen,Jiawei Ma,Yefeng Zheng,Raymond Kai-yu Tong

Main category: cs.CV

TL;DR: 本文提出Bi-CRCL双学习者框架，用于解决医学影像中类增量学习（CIL）的灾难性遗忘与跨机构异构数据适应难题，通过保守与激进学习器的双向互补机制，在多个医疗数据集上显著优于现有方法。

Details

Motivation: 医学影像类增量学习面临异构数据、隐私限制导致无法回放记忆，且预训练基础模型在该领域缺乏系统评估和适配，需兼顾解剖复杂性与机构间差异。 Method: 提出Bidirectional Conservative-Radical Complementary Learning（Bi-CRCL）双学习者框架：保守学习器专注稳定性更新以保留旧知识，激进学习器专注可塑性更新以快速适应新病种；引入双向交互机制实现前向迁移与后向巩固；推理时自适应融合双学习器输出。 Result: 在五个医学影像数据集上实验表明，Bi-CRCL在跨数据集偏移、不同任务配置等多样设置下均持续超越当前最优方法。 Conclusion: Bi-CRCL有效缓解医学CIL中的灾难性遗忘，提升模型对新疾病类别的适应能力与泛化鲁棒性，为临床可扩展部署提供了可行路径。 Abstract: Class-incremental learning (CIL) in medical image-guided diagnosis requires retaining prior diagnostic knowledge while adapting to newly emerging disease categories, which is critical for scalable clinical deployment. This problem is particularly challenging due to heterogeneous data and privacy constraints that prevent memory replay. Although pretrained foundation models (PFMs) have advanced general-domain CIL, their potential in medical imaging remains underexplored, where domain-specific adaptation is essential yet difficult due to anatomical complexity and inter-institutional heterogeneity. To address this gap, we conduct a systematic benchmark of recent PFM-based CIL methods and propose Bidirectional Conservative-Radical Complementary Learning (Bi-CRCL), a dual-learner framework inspired by complementary learning systems. Bi-CRCL integrates a conservative learner that preserves prior knowledge through stability-oriented updates and a radical learner that rapidly adapts to new categories via plasticity-oriented learning. A bidirectional interaction mechanism enables forward transfer and backward consolidation, allowing continual integration of new knowledge while mitigating catastrophic forgetting. During inference, outputs from both learners are adaptively fused for robust predictions. Experiments on five medical imaging datasets demonstrate consistent improvements over state-of-the-art methods under diverse settings, including cross-dataset shifts and varying task configurations.

[90] An Adapter-free Fine-tuning Approach for Tuning 3D Foundation Models

Sneha Paul,Zachary Patterson,Nizar Bouguila

Main category: cs.CV

TL;DR: 本文提出了一种无需适配器的点云基础模型微调方法MCFT，在低数据场景下兼顾性能与效率，优于全量微调和现有参数高效微调方法。

Details

Motivation: 点云基础模型虽泛化能力强，但在低数据场景下微调易过拟合、表征漂移；现有参数高效微调（PEFT）方法虽缓解过拟合，却引入额外参数并增加推理延迟。 Method: 提出动量一致性微调（MCFT）：选择性微调预训练编码器部分参数，并施加动量一致性约束以保持任务无关表征；无额外可学习参数（仅保留任务头）；进一步扩展出半监督变体（利用无标签数据）和剪枝变体（结构化层剪枝）。 Result: 在物体识别与部件分割基准上，MCFT在5-shot设置下比先前方法提升3.30%，结合半监督学习最高提升6.13%；保持原始模型参数量与推理效率，适合资源受限部署。 Conclusion: MCFT在不牺牲推理效率的前提下，有效平衡了低数据微调中的表征保真度与任务适应性，为点云基础模型实用化提供了新范式。 Abstract: Point cloud foundation models demonstrate strong generalization, yet adapting them to downstream tasks remains challenging in low-data regimes. Full fine-tuning often leads to overfitting and significant drift from pre-trained representations, while existing parameter-efficient fine-tuning (PEFT) methods mitigate this issue by introducing additional trainable components at the cost of increased inference-time latency. We propose Momentum-Consistency Fine-Tuning (MCFT), an adapter-free approach that bridges the gap between full and parameter-efficient fine-tuning. MCFT selectively fine-tunes a portion of the pre-trained encoder while enforcing a momentum-based consistency constraint to preserve task-agnostic representations. Unlike PEFT methods, MCFT introduces no additional representation learning parameters beyond a standard task head, maintaining the original model's parameter count and inference efficiency. We further extend MCFT with two variants: a semi-supervised framework that leverages abundant unlabeled data to enhance few-shot performance, and a pruning-based variant that improves computational efficiency through structured layer removal. Extensive experiments on object recognition and part segmentation benchmarks demonstrate that MCFT consistently outperforms prior methods, achieving a 3.30% gain in 5-shot settings and up to a 6.13% improvement with semi-supervised learning, while remaining well-suited for resource-constrained deployment.

[91] Detection and Classification of (Pre)Cancerous Cells in Pap Smears: An Ensemble Strategy for the RIVA Cervical Cytology Challenge

Lautaro Kogan,María Victoria Ríos

Main category: cs.CV

TL;DR: 本文提出了一种基于YOLOv11m的集成方法，结合损失重加权、数据重采样和迁移学习三种策略，并通过加权框融合（WBF）提升宫颈细胞多类别检测性能，在ISBI 2026 RIVA挑战赛中显著提升mAP。

Details

Motivation: 解决宫颈涂片图像中细胞检测面临的严重类别不平衡和细胞核重叠问题，以增强大规模宫颈癌筛查的自动化能力。 Method: 以YOLOv11m为基线模型，系统评估损失重加权、数据重采样和迁移学习三种不平衡缓解策略，并构建集成模型，采用加权框融合（WBF）融合各策略训练的模型输出。 Result: 集成模型在初赛测试集上mAP50-95达0.201，在决赛测试集上达0.147，相较最优单模型提升29%。 Conclusion: 组合多种互补的类别不平衡缓解策略并集成，能有效提升宫颈细胞多类别检测性能，验证了该思路在真实医学图像检测任务中的有效性。 Abstract: Automated detection and classification of cervical cells in conventional Pap smear images can strengthen cervical cancer screening at scale by reducing manual workload, improving triage, and increasing consistency across readers. However, it is challenged by severe class imbalance and frequent nuclear overlap. We present our approach to the RIVA Cervical Cytology Challenge (ISBI 2026), which requires multi-class detection of eight Bethesda cell categories under these conditions. Using YOLOv11m as the base architecture, we systematically evaluate three strategies to improve detection performance: loss reweighting, data resampling and transfer learning. We build an ensemble by combining models trained under each strategy, promoting complementary detection behavior and combining them through Weighted Boxes Fusion (WBF). The ensemble achieves a mAP50-95 of 0.201 on the preliminary test set and 0.147 on the final test set, representing a 29% improvement over the best individual model on the final test set and demonstrating the effectiveness of combining complementary imbalance mitigation strategies.

[92] IJmond Industrial Smoke Segmentation Dataset

Yen-Chia Hsu,Despoina Touska

Main category: cs.CV

TL;DR: This paper introduces a publicly available dataset for industrial smoke segmentation, licensed under CC BY 4.0.

Details

Motivation: There is a need for specialized datasets to support research and development in industrial smoke detection and segmentation. Method: The authors constructed and published a dataset specifically for industrial smoke segmentation on Figshare. Result: A new dataset for industrial smoke segmentation is made publicly available under CC BY 4.0 license. Conclusion: The released dataset fills a gap in the availability of benchmark data for industrial smoke segmentation tasks. Abstract: This report describes a dataset for industrial smoke segmentation, published on a figshare repository (https://doi.org/10.21942/uva.31847188). The dataset is licensed under CC BY 4.0.

[93] Learning Cross-Joint Attention for Generalizable Video-Based Seizure Detection

Omar Zamzam,Takfarinas Medani,Chinmay Chinara,Richard Leahy

Main category: cs.CV

TL;DR: 本文提出了一种基于关节中心注意力机制的视频癫痫发作自动检测方法，通过聚焦身体动态、抑制背景干扰并建模关节间时空交互，显著提升了跨被试泛化能力。

Details

Motivation: 现有基于视频的癫痫检测方法因背景偏差和依赖被试特异性外观线索，难以泛化到未见过的被试。 Method: 提出关节中心注意力模型：先检测身体关节点，提取以关节为中心的视频片段以抑制背景；再用Video ViT进行token化，并通过跨关节注意力建模关节间的时空协同运动模式。 Result: 在跨被试实验中，该方法持续优于当前最先进的CNN、图神经网络和Transformer方法。 Conclusion: 仅关注身体动力学的关节中心建模范式可有效提升癫痫视频检测的跨被试泛化性能。 Abstract: Automated seizure detection from long-term clinical videos can substantially reduce manual review time and enable real-time monitoring. However, existing video-based methods often struggle to generalize to unseen subjects due to background bias and reliance on subject-specific appearance cues. We propose a joint-centric attention model that focuses exclusively on body dynamics to improve cross-subject generalization. For each video segment, body joints are detected and joint-centered clips are extracted, suppressing background context. These joint-centered clips are tokenized using a Video Vision Transformer (ViViT), and cross-joint attention is learned to model spatial and temporal interactions between body parts, capturing coordinated movement patterns characteristic of seizure semiology. Extensive cross-subject experiments show that the proposed method consistently outperforms state-of-the-art CNN-, graph-, and transformer-based approaches on unseen subjects.

[94] Semantic Iterative Reconstruction: One-Shot Universal Anomaly Detection

Ning Zhu

Main category: cs.CV

TL;DR: 本文提出Semantic Iterative Reconstruction (SIR)框架，仅用每个数据集一张正常图像即可训练一个通用模型，在九个医学影像基准上实现跨模态、少样本异常检测的SOTA性能。

Details

Motivation: 现有无监督医学异常检测方法受限于正常样本稀缺，且需为每个任务单独训练模型，缺乏跨模态泛化能力。 Method: SIR利用预训练教师编码器提取多尺度深层特征，设计紧凑的‘上采样-下采样’解码器，并通过多轮迭代重建在深层特征空间中强化正常性先验；采用单次通用训练范式——仅混合九个异构数据集各一张正常图像进行训练。 Result: 在九个医学基准上，SIR在四种设定（单样本/全样本 × 通用/专用）下均达到SOTA性能，显著优于先前方法。 Conclusion: SIR提供了一种高效、可扩展的多领域临床异常检测通用解决方案。 Abstract: Unsupervised medical anomaly detection is severely limited by the scarcity of normal training samples. Existing methods typically train dedicated models for each dataset or disease, requiring hundreds of normal images per task and lacking cross-modality generalization. We propose Semantic Iterative Reconstruction (SIR), a framework that enables a single universal model to detect anomalies across diverse medical domains using extremely few normal samples. SIR leverages a pretrained teacher encoder to extract multi-scale deep features and employs a compact up-then-down decoder with multi-loop iterative refinement to enforce robust normality priors in deep feature space. The framework adopts a one-shot universal design: a single model is trained by mixing exactly one normal sample from each of nine heterogeneous datasets, enabling effective anomaly detection on all corresponding test sets without task-specific retraining. Extensive experiments on nine medical benchmarks demonstrate that SIR achieves state-of-the-art under all four settings -- one-shot universal, full-shot universal, one-shot specialized, and full-shot specialized -- consistently outperforming previous methods. SIR offers an efficient and scalable solution for multi-domain clinical anomaly detection. Code is available at https://github.com/jusufzn212427/sir4ad.

[95] Retinal Disease Classification from Fundus Images using CNN Transfer Learning

Ali Akram

Main category: cs.CV

TL;DR: 本文提出了一种基于公开眼底图像的可复现深度学习流程，用于二分类视网膜疾病风险筛查；采用VGG16迁移学习模型（90.8%准确率，F1=0.90）显著优于基线CNN（83.1%），但对少数疾病类别的敏感性仍存挑战。

Details

Motivation: 视网膜疾病是全球可预防性视力损伤的主要原因，亟需借助自动化眼底图像分析扩大早期筛查覆盖，尤其在医疗资源匮乏地区。 Method: 构建并比较了基线CNN与基于预训练VGG16的迁移学习模型；采用类别加权缓解数据不平衡；评估指标包括准确率、精确率、召回率、F1分数、混淆矩阵和ROC-AUC。 Result: VGG16迁移学习模型在测试集上达到90.8%准确率和0.90加权F1分数，明显优于基线CNN（83.1%准确率）；但对少数疾病类别的识别敏感性仍有不足。 Conclusion: 迁移学习能有效提升视网膜疾病风险分类性能，但数据偏差、类别不平衡及阈值选择仍是临床可靠筛查需解决的关键问题；研究强调了可复现性和后续改进方向。 Abstract: Retinal diseases remain among the leading preventable causes of visual impairment worldwide. Automated screening based on fundus image analysis has the potential to expand access to early detection, particularly in underserved populations. This paper presents a reproducible deep learning pipeline for binary retinal disease risk classification from publicly available fundus photographs. We implement and compare a baseline convolutional neural network with a transfer learning approach using a pretrained VGG16 backbone and evaluate generalization on held-out data. To address class imbalance, we apply class weighting and report standard classification metrics including accuracy, precision, recall, F1-score, confusion matrices, and ROC-AUC. The VGG16 transfer learning model achieves 90.8% test accuracy with a weighted F1-score of 0.90, substantially outperforming the baseline CNN (83.1% accuracy). Results indicate that transfer learning improves discrimination compared to a baseline CNN, while also revealing remaining challenges in sensitivity to minority disease cases. We discuss practical limitations related to dataset characteristics, class imbalance, and threshold selection, and provide guidance for reproducibility and future improvements for clinically reliable screening

[96] Re-Prompting SAM 3 via Object Retrieval: 3rd of the 5th PVUW MOSE Track

Mingqi Gao,Sijie Li,Jungong Han

Main category: cs.CV

TL;DR: 本文提出了一种基于SAM~3和DINOv3的自动重提示框架，用于提升半监督视频目标分割在目标消失/重现、剧烈形变和同类干扰物下的鲁棒性，最终在MOSEv2测试集上达到51.17%的J&F得分，排名第三。

Details

Motivation: 解决MOSEv2中目标消失与重现、严重形变及强同类干扰物带来的半监督视频目标分割鲁棒性问题。 Method: 基于SAM~3构建自动重提示框架：先用SAM~3检测后续帧中的同类候选目标，再利用DINOv3进行变换感知的目标级匹配，从目标特征池中检索可靠锚点，并将这些锚点与首帧掩码一同注入SAM~3追踪器，实现多锚点传播。 Result: 在MOSEv2测试集上取得J&F为51.17%，位列赛道第三名。 Conclusion: 多锚点传播策略显著提升了SAM~3在复杂半监督视频分割场景下的鲁棒性和准确性，验证了重提示与目标级匹配的有效性。 Abstract: This technical report explores the MOSEv2 track of the PVUW 2026 Challenge, which targets complex semi-supervised video object segmentation. Built on SAM~3, we develop an automatic re-prompting framework to improve robustness under target disappearance and reappearance, severe transformation, and strong same-category distractors. Our method first applies the SAM~3 detector to later frames to identify same-category object candidates, and then performs DINOv3-based object-level matching with a transformation-aware target feature pool to retrieve reliable target anchors. These anchors are injected back into the SAM~3 tracker together with the first-frame mask, enabling multi-anchor propagation rather than relying solely on the initial prompt. This simple directly benefits several core challenges of MOSEv2. Our solution achieves a J&F of 51.17% on the test set, ranking 3rd in the MOSEv2 track.

[97] Sparse Autoencoders for Interpretable Medical Image Representation Learning

Philipp Wesp,Robbie Holland,Vasiliki Sideri-Lampretsa,Sergios Gatidis

Main category: cs.CV

TL;DR: 本研究探索了稀疏自编码器（SAEs）在医疗视觉基础模型中提升可解释性的潜力，通过将抽象嵌入映射为语言可表达的稀疏概念特征，在保持高性能的同时实现高度可解释性。

Details

Motivation: 医疗视觉基础模型虽性能优异，但其抽象潜在表征难以被临床医生检验和理解，亟需提升可解释性。 Method: 在BiomedParse和DINOv3模型生成的嵌入上训练稀疏自编码器（SAEs），数据来自TotalSegmentator的909,873张CT/MRI切片；结合LLM自动解释稀疏特征，并评估重建保真度、下游任务性能、图像检索及零样本语言驱动检索能力。 Result: SAEs能高保真重建原始嵌入（R²达0.941），仅用10个稀疏特征即可恢复87.8%下游性能（压缩率99.4%）；语义检索保持稳定；稀疏特征可被LLM准确语言化；支持零样本语言驱动图像检索。 Conclusion: SAEs为构建可解释、以概念驱动的医疗视觉系统提供了可行路径。 Abstract: Vision foundation models (FMs) achieve state-of-the-art performance in medical imaging. However, they encode information in abstract latent representations that clinicians cannot interrogate or verify. The goal of this study is to investigate Sparse Autoencoders (SAEs) for replacing opaque FM image representations with human-interpretable, sparse features. We train SAEs on embeddings from BiomedParse (biomedical) and DINOv3 (general-purpose) using 909,873 CT and MRI 2D image slices from the TotalSegmentator dataset. We find that learned sparse features: (a) reconstruct original embeddings with high fidelity (R2 up to 0.941) and recover up to 87.8% of downstream performance using only 10 features (99.4% dimensionality reduction), (b) preserve semantic fidelity in image retrieval tasks, (c) correspond to specific concepts that can be expressed in language using large language model (LLM)-based auto-interpretation. (d) bridge clinical language and abstract latent representations in zero-shot language-driven image retrieval. Our work indicates SAEs are a promising pathway towards interpretable, concept-driven medical vision systems. Code repository: https://github.com/pwesp/sail.

[98] 3D-LLDM: Label-Guided 3D Latent Diffusion Model for Improving High-Resolution Synthetic MR Imaging in Hepatic Structure Segmentation

Kyeonghun Kim,Jaehyeok Bae,Youngung Han,Joo Young Bae,Seoyoung Ju,Junsu Lim,Gyeongmin Kim,Nam-Joon Kim,Woo Kyoung Jeong,Ken Ying-Kai Liao,Won Jae Lee,Pa Hong,Hyuk-Jae Lee

Main category: cs.CV

TL;DR: 本文提出了一种名为3D-LLDM的标签引导式3D潜在扩散模型，用于生成高质量带解剖分割掩码的合成MR体积数据，显著提升了肝细胞癌分割性能。

Details

Motivation: 医学影像领域因缺乏可靠标注数据集而限制了深度学习与生成模型的应用。 Method: 提出3D-LLDM——一种基于ControlNet架构的标签引导3D潜在扩散模型，利用Gd-EOB-DTPA增强的肝胆期MR图像生成肝脏、门静脉、肝静脉及肝细胞癌的结构掩码，并指导三维合成。 Result: 在720例真实临床MR扫描上训练，FID达28.31，较GAN提升70.9%，较先进扩散基线提升26.7%；用于数据增强时，使肝细胞癌分割Dice分数最高提升11.153%。 Conclusion: 3D-LLDM能高效生成高质量带标注的合成MR数据，在有限标注数据下显著提升下游分割任务性能，为医学影像生成建模提供了新范式。 Abstract: Deep learning and generative models are advancing rapidly, with synthetic data increasingly being integrated into training pipelines for downstream analysis tasks. However, in medical imaging, their adoption remains constrained by the scarcity of reliable annotated datasets. To address this limitation, we propose 3D-LLDM, a label-guided 3D latent diffusion model that generates high-quality synthetic magnetic resonance (MR) volumes with corresponding anatomical segmentation masks. Our approach uses hepatobiliary phase MR images enhanced with the Gd-EOB-DTPA contrast agent to derive structural masks for the liver, portal vein, hepatic vein, and hepatocellular carcinoma, which then guide volumetric synthesis through a ControlNet-based architecture. Trained on 720 real clinical hepatobiliary phase MR scans from Samsung Medical Center, 3D-LLDM achieves a Fréchet Inception Distance (FID) of 28.31, improving over GANs by 70.9% and over state-of-the-art diffusion baselines by 26.7%. When used for data augmentation, the synthetic volumes improve hepatocellular carcinoma segmentation by up to 11.153% Dice score across five CNN architectures.

[99] See, Remember, Explore: A Benchmark and Baselines for Streaming Spatial Reasoning

Yuxi Wei,Wei Huang,Qirui Chen,Lu Hou,Xiaojuan Qi

Main category: cs.CV

TL;DR: 本文提出了S3-Bench基准和AMF-VLM模型，以支持面向具身智能体的流式空间问答与主动感知，解决了现有空间视觉语言模型在长时序推理和主动探索方面的不足。

Details

Motivation: 现有空间视觉语言模型（VLMs）和基准多为离线评估，忽视了实际部署中所需的长时序流式推理和当前视野不足时的主动感知能力。 Method: 提出S3-Bench基准（含仿真与真实视频双域、时间戳对齐的流式空间问答任务）及AMF-VLM模型，该模型通过记忆折叠（Memory Folding）压缩长时序观测，并通过主动探索（Active Exploration）生成动作以获取缺失信息。 Result: AMF-VLM在S3-Eval仿真与真实子集上分别提升8.8%和13.3%，且在标准空间基准上保持良好迁移能力。 Conclusion: S3-Bench填补了流式空间理解评估的空白，AMF-VLM验证了记忆压缩与主动探索对具身空间推理的有效性，推动了面向实际部署的空间VLM发展。 Abstract: Spatial understanding is fundamental for embodied agents, yet most spatial VLMs and benchmarks remain offline-evaluating post-hoc QA over pre-recorded inputs and overlooking two crucial deployment-critical requirements: long-horizon streaming inference and active perception when the current view is insufficient. To address this gap, we introduce S3-Bench, a benchmark suite for streaming spatial question answering with active exploration, where queries are temporally grounded to specific timestamps and must be answered using only observations available up to that moment. S3-Bench adopts a dual-domain design, combining a scalable simulator with controllable trajectories and exploration actions, and real-world streaming videos that capture practical sensing artifacts for rigorous generalization evaluation. Overall, it spans 10K+ scenes and 26K+ trajectories, with dedicated training (S3-Train) and evaluation (S3-Eval) splits. We further propose AMF-VLM, which supports streaming spatial reasoning under bounded computing via (i) memory folding, which compresses long-horizon observations into compact structured memory, and (ii) active exploration, which outputs explicit actions (e.g. move/rotate/scan) to acquire missing evidence before answering. Extensive experiments demonstrate that, compared to models using identical training data, our approach yields improvements of 8.8% and 13.3% on the simulated and real splits of S3-Eval, respectively, while maintaining competitive transferability to standard spatial benchmarks.

[100] MLE-UVAD: Minimal Latent Entropy Autoencoder for Fully Unsupervised Video Anomaly Detection

Yuang Geng,Junkai Zhou,Kang Yang,Pan He,Zhuoyang Zhou,Jose C. Principe,Joel Harley,Ivan Ruchkin

Main category: cs.CV

TL;DR: 本文提出了一种基于熵引导的自编码器方法，用于单场景、完全无监督的视频异常检测，通过结合重建损失和最小潜在熵（MLE）损失，提升对异常帧的检测能力。

Details

Motivation: 现有方法要么依赖大量标注（全监督或弱监督），要么仅使用正常视频（一类分类），易受分布偏移和污染影响；而真实场景中常仅有原始未标注视频可用。 Method: 提出熵引导的自编码器，联合使用标准重建损失与新型最小潜在熵（MLE）损失：重建损失促使正常与异常帧在潜在空间形成不同簇，MLE损失则最小化潜在嵌入熵，迫使稀疏的异常嵌入向高密度的正常簇聚集，从而削弱异常帧的重建质量。 Result: 在两个主流基准数据集及一个自建驾驶数据集上，该方法显著优于各类基线方法，展现出强鲁棒性与优越性能。 Conclusion: 双损失机制有效扩大了正常与异常帧的重建误差差距，使完全无监督、单场景下的视频异常检测更加可靠与实用。 Abstract: In this paper, we address the challenging problem of single-scene, fully unsupervised video anomaly detection (VAD), where raw videos containing both normal and abnormal events are used directly for training and testing without any labels. This differs sharply from prior work that either requires extensive labeling (fully or weakly supervised) or depends on normal-only videos (one-class classification), which are vulnerable to distribution shifts and contamination. We propose an entropy-guided autoencoder that detects anomalies through reconstruction error by reconstructing normal frames well while making anomalies reconstruct poorly. The key idea is to combine the standard reconstruction loss with a novel Minimal Latent Entropy (MLE) loss in the autoencoder. Reconstruction loss alone maps normal and abnormal inputs to distinct latent clusters due to their inherent differences, but also risks reconstructing anomalies too well to detect. Therefore, MLE loss addresses this by minimizing the entropy of latent embeddings, encouraging them to concentrate around high-density regions. Since normal frames dominate the raw video, sparse anomalous embeddings are pulled into the normal cluster, so the decoder emphasizes normal patterns and produces poor reconstructions for anomalies. This dual-loss design produces a clear reconstruction gap that enables effective anomaly detection. Extensive experiments on two widely used benchmarks and a challenging self-collected driving dataset demonstrate that our method achieves robust and superior performance over baselines.

[101] EnvSocial-Diff: A Diffusion-Based Crowd Simulation Model with Environmental Conditioning and Individual-Group Interaction

Bingxue Zhao,Qi Zhang,Hui Huang

Main category: cs.CV

TL;DR: 本文提出EnvSocial-Diff，一种基于扩散模型的群体仿真方法，融合社会物理原理、显式环境条件建模（障碍物、兴趣点、光照）及个体-群体交互建模（图结构），显著提升行人轨迹真实性与合理性。

Details

Motivation: 现有行人轨迹建模方法过度强调社交动态，忽视环境上下文对行为的影响，导致仿真不够真实。 Method: 提出扩散模型EnvSocial-Diff，包含两个核心模块：（1）结构化环境条件模块，显式编码障碍物、兴趣点和光照等场景信息；（2）个体-群体交互模块，通过图结构建模细粒度人际互动与群体一致性。 Result: 在多个基准数据集上超越最新SOTA方法，验证了显式环境建模与多层级社交建模的有效性。 Conclusion: 环境条件与个体-群体协同建模对提升 crowd simulation 的真实性至关重要，扩散模型是建模复杂社会-环境耦合关系的有效范式。 Abstract: Modeling realistic pedestrian trajectories requires accounting for both social interactions and environmental context, yet most existing approaches largely emphasize social dynamics. We propose \textbf{EnvSocial-Diff}: a diffusion-based crowd simulation model informed by social physics and augmented with environmental conditioning and individual--group interaction. Our structured environmental conditioning module explicitly encodes obstacles, objects of interest, and lighting levels, providing interpretable signals that capture scene constraints and attractors. In parallel, the individual--group interaction module goes beyond individual-level modeling by capturing both fine-grained interpersonal relations and group-level conformity through a graph-based design. Experiments on multiple benchmark datasets demonstrate that EnvSocial-Diff outperforms the latest state-of-the-art methods, underscoring the importance of explicit environmental conditioning and multi-level social interaction for realistic crowd simulation. Code is here: https://github.com/zqyq/EnvSocial-Diff.

[102] BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment

Risa Shinoda,Kaede Shiohara,Nakamasa Inoue,Kuniaki Saito,Hiroaki Santo,Fumio Okura

Main category: cs.CV

TL;DR: 本文提出BioVITA框架，首次实现视觉、文本与声学三模态在生物多样性研究中的统一对齐，构建大规模多模态数据集并设计跨模态检索基准，在物种识别与生态特征理解上取得进展。

Details

Motivation: 现有模型（如BioCLIP）仅对齐图像与文本，音频模态的整合仍是开放问题；而动物物种识别亟需利用多模态信息以提升生态理解能力。 Method: 提出BioVITA框架，包含：(i) 覆盖14,133个物种、含34种生态性状标签的130万音频+230万图像训练数据集；(ii) 基于BioCLIP2的两阶段训练策略，实现音频-视觉-文本表征对齐；(iii) 支持六种方向（如image→audio、audio→text等）及科/属/种三级分类的跨模态检索基准。 Result: 实验表明模型构建了统一表征空间，不仅能准确进行跨模态检索，还能捕获超越分类学层级的物种级语义信息，显著提升多模态生物多样性理解能力。 Conclusion: BioVITA是首个支持视觉-文本-声学三模态对齐的生物模型，为生态监测、自动物种识别和跨模态生物知识挖掘提供了新范式与基础工具。 Abstract: Understanding animal species from multimodal data poses an emerging challenge at the intersection of computer vision and ecology. While recent biological models, such as BioCLIP, have demonstrated strong alignment between images and textual taxonomic information for species identification, the integration of the audio modality remains an open problem. We propose BioVITA, a novel visual-textual-acoustic alignment framework for biological applications. BioVITA involves (i) a training dataset, (ii) a representation model, and (iii) a retrieval benchmark. First, we construct a large-scale training dataset comprising 1.3 million audio clips and 2.3 million images, covering 14,133 species annotated with 34 ecological trait labels. Second, building upon BioCLIP2, we introduce a two-stage training framework to effectively align audio representations with visual and textual representations. Third, we develop a cross-modal retrieval benchmark that covers all possible directional retrieval across the three modalities (i.e., image-to-audio, audio-to-text, text-to-image, and their reverse directions), with three taxonomic levels: Family, Genus, and Species. Extensive experiments demonstrate that our model learns a unified representation space that captures species-level semantics beyond taxonomy, advancing multimodal biodiversity understanding. The project page is available at: https://dahlian00.github.io/BioVITA_Page/

[103] Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

Gengluo Li,Chengquan Zhang,Yupu Liang,Huawen Shen,Yaping Zhang,Pengyuan Lyu,Weinong Wang,Xingyu Wan,Gangyan Zeng,Han Hu,Can Ma,Yu Zhou

Main category: cs.CV

TL;DR: 本文提出一种数据-训练协同设计框架，通过真实场景合成策略和文档感知训练方法，提升端到端文档解析的鲁棒性与结构一致性，并构建Wild-OmniDocBench基准进行评估。

Details

Motivation: 现有端到端文档解析方法受限于高质量全页标注数据稀缺及缺乏结构感知训练策略，导致预测重复、幻觉和结构不一致。 Method: 提出数据-训练协同设计框架：1）真实场景合成策略生成大规模、结构多样的全页端到端监督数据；2）文档感知训练方案包含渐进学习与结构化token优化；3）构建真实世界捕获文档基准Wild-OmniDocBench。 Result: 在1B参数MLLM上集成该方法，在扫描/数字文档及真实捕获场景中均取得更高精度与鲁棒性。 Conclusion: 该框架有效缓解了数据稀缺与结构建模不足问题，显著提升端到端文档解析性能，所有模型、数据合成流程与基准将开源。 Abstract: Document parsing has recently advanced with multimodal large language models (MLLMs) that directly map document images to structured outputs. Traditional cascaded pipelines depend on precise layout analysis and often fail under casually captured or non-standard conditions. Although end-to-end approaches mitigate this dependency, they still exhibit repetitive, hallucinated, and structurally inconsistent predictions - primarily due to the scarcity of large-scale, high-quality full-page (document-level) end-to-end parsing data and the lack of structure-aware training strategies. To address these challenges, we propose a data-training co-design framework for robust end-to-end document parsing. A Realistic Scene Synthesis strategy constructs large-scale, structurally diverse full-page end-to-end supervision by composing layout templates with rich document elements, while a Document-Aware Training Recipe introduces progressive learning and structure-token optimization to enhance structural fidelity and decoding stability. We further build Wild-OmniDocBench, a benchmark derived from real-world captured documents for robustness evaluation. Integrated into a 1B-parameter MLLM, our method achieves superior accuracy and robustness across both scanned/digital and real-world captured scenarios. All models, data synthesis pipelines, and benchmarks will be publicly released to advance future research in document understanding.

[104] FilterGS: Traversal-Free Parallel Filtering and Adaptive Shrinking for Large-Scale LoD 3D Gaussian Splatting

Yixian Wang,Haolin Yu,Jiadong Tang,Yu Gao,Xihan Wang,Yufeng Yue,Yi Yang

Main category: cs.CV

TL;DR: FilterGS 提出了一种并行过滤机制和场景自适应高斯收缩策略，以解决3D高斯点绘在大场景中因串行遍历和高斯-瓦片对冗余导致的渲染效率低问题，显著提升渲染速度并保持视觉质量。

Details

Motivation: 3D高斯点绘在大场景中面临串行遍历效率低（占渲染时间60%以上）和高斯-瓦片对冗余的问题，限制了Level-of-Detail方法的可扩展性。 Method: 提出FilterGS：1）基于并行过滤机制的双互补滤波器，避免树遍历；2）新提出的GTC指标量化高斯-瓦片对冗余度；3）基于GTC的场景自适应高斯收缩策略以减少冗余。 Result: 在多个大规模数据集上实现SOTA渲染速度，同时保持具有竞争力的视觉质量。 Conclusion: FilterGS有效克服了3D高斯点绘在大场景渲染中的关键性能瓶颈，为实时神经渲染的大规模应用提供了可行路径。 Abstract: 3D Gaussian Splatting has revolutionized neural rendering with real-time performance. However, scaling this approach to large scenes using Level-of-Detail methods faces critical challenges: inefficient serial traversal consuming over 60\% of rendering time, and redundant Gaussian-tile pairs that incur unnecessary processing overhead. To address these limitations, we introduce FilterGS, featuring a parallel filtering mechanism with two complementary filters that select Gaussian elements efficiently without tree traversal. Additionally, we propose a novel GTC metric that quantifies the redundancy of Gaussian-tile key-value pairs. Based on this metric, we introduce a scene-adaptive Gaussian shrinking strategy that effectively reduces redundant pairs. Extensive experiments demonstrate that FilterGS achieves state-of-the-art rendering speeds while maintaining competitive visual quality across multiple large-scale datasets. Project page: https://github.com/xenon-w/FilterGS

[105] MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation

Gengluo Li,Chengquan Zhang,Yupu Liang,Huawen Shen,Yaping Zhang,Pengyuan Lyu,Weinong Wang,Xingyu Wan,Gangyan Zeng,Han Hu,Can Ma,Yu Zhou

Main category: cs.CV

TL;DR: 本文提出了MMTIT-Bench多语言多场景文本图像机器翻译（TIMT）基准测试集，并设计了融合场景认知、文本感知与翻译推理的CPR-Trans数据范式，显著提升VLLM在低资源语言和复杂视觉场景下的端到端翻译性能与可解释性。

Details

Motivation: 现有视觉语言大模型（VLLMs）在端到端文本图像机器翻译（TIMT）任务中，对多样化视觉场景和低资源语言的鲁棒性评估不足，且缺乏有效结合视觉认知与语言推理的训练范式。 Method: 构建了人工校验的多语言多场景基准MMTIT-Bench（覆盖14种非中英文语言、1400张图像），并提出Cognition-Perception-Reasoning for Translation（CPR-Trans）数据范式，通过VLLM驱动的数据生成流程，将场景认知、文本感知与翻译推理统一于一个可解释的推理链中。 Result: 在3B和7B规模VLLM上实验表明，CPR-Trans带来翻译准确率与推理可解释性的持续提升；MMTIT-Bench填补了多语言、多场景TIMT评测空白。 Conclusion: CPR-Trans为TIMT提供了更符合VLLM特性的推理范式，MMTIT-Bench将成为推动多语言、多场景端到端跨模态翻译研究的重要资源。 Abstract: End-to-end text-image machine translation (TIMT), which directly translates textual content in images across languages, is crucial for real-world multilingual scene understanding. Despite advances in vision-language large models (VLLMs), robustness across diverse visual scenes and low-resource languages remains underexplored due to limited evaluation resources. We present MMTIT-Bench, a human-verified multilingual and multi-scenario benchmark with 1,400 images spanning fourteen non-English and non-Chinese languages and diverse settings such as documents, scenes, and web images, enabling rigorous assessment of end-to-end TIMT. Beyond benchmarking, we study how reasoning-oriented data design improves translation. Although recent VLLMs have begun to incorporate long Chain-of-Thought (CoT) reasoning, effective thinking paradigms for TIMT are still immature: existing designs either cascade parsing and translation in a sequential manner or focus on language-only reasoning, overlooking the visual cognition central to VLLMs. We propose Cognition-Perception-Reasoning for Translation (CPR-Trans), a data paradigm that integrates scene cognition, text perception, and translation reasoning within a unified reasoning process. Using a VLLM-driven data generation pipeline, CPR-Trans provides structured, interpretable supervision that aligns perception with reasoning. Experiments on 3B and 7B models show consistent gains in accuracy and interpretability. We will release MMTIT-Bench to promote the multilingual and multi-scenario TIMT research upon acceptance.

[106] Knowledge-Refined Dual Context-Aware Network for Partially Relevant Video Retrieval

Junkai Yang,Qirui Wang,Yaoqing Jin,Shuai Ma,Minghan Xu,Shanmin Pang

Main category: cs.CV

TL;DR: KDC-Net是一种知识精炼的双上下文感知网络，通过分层语义聚合、动态时间注意力和时序连续性感知的CLIP蒸馏策略，提升未剪辑视频中部分相关片段的检索性能。

Details

Motivation: 解决文本与视频片段间信息密度不匹配、以及现有注意力机制忽略语义焦点与事件关联的问题。 Method: 提出KDC-Net：文本端采用分层语义聚合模块融合多尺度短语线索；视频端设计动态时间注意力机制（含相对位置编码与自适应时间窗口）；并引入时序连续性感知的动态CLIP蒸馏策略。 Result: 在PRVR基准上显著优于现有方法，尤其在低moment-to-video比率场景下表现突出。 Conclusion: KDC-Net从文本与视觉双视角协同建模语义与时间结构，有效缓解了部分相关视频段检索中的关键挑战。 Abstract: Retrieving partially relevant segments from untrimmed videos remains difficult due to two persistent challenges: the mismatch in information density between text and video segments, and limited attention mechanisms that overlook semantic focus and event correlations. We present KDC-Net, a Knowledge-Refined Dual Context-Aware Network that tackles these issues from both textual and visual perspectives. On the text side, a Hierarchical Semantic Aggregation module captures and adaptively fuses multi-scale phrase cues to enrich query semantics. On the video side, a Dynamic Temporal Attention mechanism employs relative positional encoding and adaptive temporal windows to highlight key events with local temporal coherence. Additionally, a dynamic CLIP-based distillation strategy, enhanced with temporal-continuity-aware refinement, ensures segment-aware and objective-aligned knowledge transfer. Experiments on PRVR benchmarks show that KDC-Net consistently outperforms state-of-the-art methods, especially under low moment-to-video ratios.

[107] Latent Bias Alignment for High-Fidelity Diffusion Inversion in Real-World Image Reconstruction and Manipulation

Weiming Chen,Qifan Liu,Siyi Liu,Yushun Tang,Yijia Wang,Zhihan Zhu,Zhihai He

Main category: cs.CV

TL;DR: 本文提出了一种用于扩散模型的扩散逆向（diffusion inversion）新方法，通过引入潜在偏置优化（LBO）和图像潜在增强（ILB）来提升重建质量和鲁棒性。

Details

Motivation: 现有扩散逆向方法在重建质量与鲁棒性方面存在不足，尤其面临逆向与生成轨迹不一致、以及与VQ自编码器重建不匹配两大挑战。 Method: 提出Latent Bias Optimization（LBO）在每步逆向中引入可学习的潜在偏置向量以对齐轨迹；并提出Image Latent Boosting（ILB）近似联合优化扩散逆向与VQAE重建过程，调整图像潜在表示作为二者接口。 Result: 实验表明该方法显著提升了扩散模型的图像重建质量，并在图像编辑和稀有概念生成等下游任务中性能更优。 Conclusion: LBO与ILB协同有效缓解了扩散逆向中的关键失配问题，为连接扩散模型与真实世界图像提供了更可靠、高质量的基础技术路径。 Abstract: Recent research has shown that text-to-image diffusion models are capable of generating high-quality images guided by text prompts. But can they be used to generate or approximate real-world images from the seed noise? This is known as the diffusion inversion problem, which serves as a fundamental building block for bridging diffusion models and real-world scenarios. However, existing diffusion inversion methods often suffer from low reconstruction quality or weak robustness. Two major challenges need to be carefully addressed: (1) the misalignment between the inversion and generation trajectories during the diffusion process, and (2) the mismatch between the diffusion inversion process and the VQ autoencoder (VQAE) reconstruction. To address these challenges, we introduce a latent bias vector at each inversion step, which is learned to reduce the misalignment between inversion and generation trajectories. We refer to this strategy as Latent Bias Optimization (LBO). Furthermore, we perform an approximate joint optimization of the diffusion inversion and VQAE reconstruction processes by learning to adjust the image latent representation, which serves as the connecting interface between them. We refer to this technique as Image Latent Boosting (ILB). Extensive experimental results demonstrate that the proposed method significantly improves the image reconstruction quality of the diffusion model, as well as the performance of downstream tasks, including image editing and rare concept generation.

[108] GenMask: Adapting DiT for Segmentation via Direct Mask

Yuhuan Yang,Xianwei Zhuang,Yuxuan Cai,Chaofan Ma,Shuai Bai,Jiangchao Yao,Ya Zhang,Junyang Lin,Yanfeng Wang

Main category: cs.CV

TL;DR: 本文提出GenMask，一种直接以生成方式训练分割任务的模型，通过时间步采样策略统一处理二值掩码与自然图像的潜在表示差异，无需依赖特征提取流水线，在多个分割基准上达到SOTA。

Details

Motivation: 现有分割方法利用预训练生成模型作为特征提取器，存在表征不一致和流程复杂的问题，需转向直接生成式分割训练。 Method: 提出时间步采样策略，对二值掩码强调高噪声水平、对图像生成采用中等噪声，实现掩码与RGB图像在DiT架构下的联合生成训练，构建GenMask模型。 Result: GenMask在指代分割和推理分割基准上达到SOTA性能，消融实验验证各组件贡献。 Conclusion: 直接生成式分割可行且有效；通过噪声分布建模可弥合二值掩码与自然图像潜在空间差异；GenMask简化流程并提升性能。 Abstract: Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval. This implicit use suffers from a fundamental misalignment in representation. It also depends heavily on indirect feature extraction pipelines, which complicate the workflow and limit adaptation. In this paper, we argue that instead of indirect adaptation, segmentation tasks should be trained directly in a generative manner. We identify a key obstacle to this unified formulation: VAE latents of binary masks are sharply distributed, noise robust, and linearly separable, distinct from natural image latents. To bridge this gap, we introduce timesteps sampling strategy for binary masks that emphasizes extreme noise levels for segmentation and moderate noise for image generation, enabling harmonious joint training. We present GenMask, a DiT trains to generate black-and-white segmentation masks as well as colorful images in RGB space under the original generative objective. GenMask preserves the original DiT architecture while removing the need of feature extraction pipelines tailored for segmentation tasks. Empirically, GenMask attains state-of-the-art performance on referring and reasoning segmentation benchmarks and ablations quantify the contribution of each component.

[109] Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding

Fatih Ilhan,Gaowen Liu,Ramana Rao Kompella,Selim Furkan Tekin,Tiansheng Huang,Zachary Yahn,Yichang Xu,Ling Liu

Main category: cs.CV

TL;DR: 本文提出AttentionPack，一种针对大型视觉语言模型（VLMs）的自适应、注意力感知解码优化框架，通过多头注意力压缩与令牌级注意力感知解压机制，在不牺牲输出质量前提下显著提升内存效率（最高8倍），支持更大批量、更长上下文及更快推理。

Details

Motivation: 大型视觉语言模型（VLMs）在多模态推理中表现优异，但其解码阶段因长序列视觉与文本token带来的内存开销而面临推理效率瓶颈，尤其在多高分辨率图像或视频等长上下文任务中更为突出。 Method: 提出AttentionPack框架：（i）利用注意力矩阵隐含的低秩结构，设计多头注意力压缩方法以经济地存储Key/Value矩阵；（ii）构建token-specific注意力感知解压机制以降低延迟开销；并结合缓存驱逐、量化和核融合进一步优化。 Result: 在多个基准上验证，AttentionPack将内存效率提升最高达8倍，支持更高批大小与更快批量推理，同时保持模型输出质量或延长上下文长度以提升检索性能；联合驱逐、量化与核融合后，在资源受限环境下获得进一步效率增益。 Conclusion: AttentionPack是一种高效、可扩展且兼容性强的解码优化方案，有效缓解VLMs在长上下文多模态任务中的内存瓶颈，为实际部署提供了实用路径。 Abstract: Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and answer of VLMs consist of long sequences of visual and text tokens. This paper presents AttentionPack, an adaptive and attention-aware optimization framework tailored for large vision-language models with improving memory-efficiency during decoding, focusing on addressing the challenges due to the increased high number of visual inputs and interactions, particularly in long-context tasks with multiple high-resolution images or videos. AttentionPack is novel in two aspects: (i) We introduce a multi-head attention compaction method for economically storing key and value matrices by exploiting the implicit low-rank structure, and (ii) we develop a token-specific attention-aware decompression mechanism to reduce latency overhead. Experimental results on multiple benchmarks demonstrate that AttentionPack improves memory efficiency by up to 8x, enabling higher batch sizes and faster batch inference while preserving the model output quality or longer context lengths for superior retrieval performance. We also report the effectiveness of AttentionPack combined with eviction, quantization and kernel fusion, showing further efficiency gains for resource-limited environments.

[110] DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning

Jiajian Huang,Dongliang Zhu,Zitong YU,Hui Ma,Jiayu Zhang,Chunmei Zhu,Xiaochun Cao

Main category: cs.CV

TL;DR: 本文提出了一种面向多模态欺骗检测的新框架，通过构建可解释的推理数据集、发布大规模跨文化T4-Deception数据集，并设计SICS与DMC两个新模块，在小样本和跨域场景下显著提升性能与可解释性。

Details

Motivation: 现有欺骗检测方法缺乏可验证的中间推理依据，数据集规模小、场景覆盖窄、文化多样性不足，易导致捷径学习，难以满足高风险司法与安全应用对可解释性与泛化性的要求。 Method: 1）构建含结构化线索描述与推理链的可审计数据集；2）发布基于‘To Tell The Truth’节目的四国跨文化数据集T4-Deception（1695样本）；3）提出SICS模块（稳定个体-共性协同+极性感知重校准）与DMC模块（蒸馏式模态一致性对齐）以增强小样本鲁棒学习。 Result: 在三个基准及自建T4-Deception数据集上，该方法在域内与跨域设置下均达到SOTA性能，且在跨文化迁移中表现更优；模型输出具备可审计的推理报告能力。 Conclusion: 结合可解释数据构建、跨文化大规模采集与双模块协同优化，本文系统性提升了多模态欺骗检测的可信性、鲁棒性与泛化能力，为高风险应用场景提供了可行技术路径。 Abstract: Multimodal deception detection aims to identify deceptive behavior by analyzing audiovisual cues for forensics and security. In these high-stakes settings, investigators need verifiable evidence connecting audiovisual cues to final decisions, along with reliable generalization across domains and cultural contexts. However, existing benchmarks provide only binary labels without intermediate reasoning cues. Datasets are also small with limited scenario coverage, leading to shortcut learning. We address these issues through three contributions. First, we construct reasoning datasets by augmenting existing benchmarks with structured cue-level descriptions and reasoning chains, enabling model output auditable reports. Second, we release T4-Deception, a multicultural dataset based on the unified ``To Tell The Truth'' television format implemented across four countries. With 1695 samples, it is the largest non-laboratory deception detection dataset. Third, we propose two modules for robust learning under small-data conditions. Stabilized Individuality-Commonality Synergy (SICS) refines multimodal representations by synergizing learnable global priors with sample-adaptive residuals, followed by a polarity-aware adjustment that bi-directionally recalibrates representations. Distilled Modality Consistency (DMC) aligns modality-specific predictions with the fused multimodal predictions via knowledge distillation to prevent unimodal shortcut learning. Experiments on three established benchmarks and our novel dataset demonstrate that our method achieves state-of-the-art performance in both in-domain and cross-domain scenarios, while exhibiting superior transferability across diverse cultural contexts. The datasets and codes will be released.

[111] Uncertainty-Aware Vision-based Risk Object Identification via Conformal Risk Tube Prediction

Kai-Yu Fu,Yi-Ting Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为Conformal Risk Tube Prediction (CRTP)的新方法，用于智能驾驶中的视觉风险对象识别（Vision-ROI），旨在建模时空风险不确定性、提供风险覆盖保证，并输出校准的风险分数与不确定性估计。

Details

Motivation: 现有Vision-ROI方法采用确定性决策，忽略不确定性，在模糊或复杂多风险场景中易导致误检、漏检或时序不稳定预测，缺乏对时空联合风险不确定性的建模框架。 Method: 提出Conformal Risk Tube Prediction（CRTP），一种统一框架，结合共形预测理论建模时空风险不确定性；构建新数据集与评估指标，系统分析场景变化、风险类别差异及感知误差传播对不确定性估计的影响。 Result: 在新基准上显著优于先前方法，提升了Vision-ROI鲁棒性，降低了误触发的制动警报（nuisance braking alerts）。 Conclusion: CRTP为智能驾驶中的风险识别提供了具备统计保障的不确定性感知解决方案，推动了安全关键视觉感知向可靠、可解释、可验证方向发展。 Abstract: We study object importance-based vision risk object identification (Vision-ROI), a key capability for hazard detection in intelligent driving systems. Existing approaches make deterministic decisions and ignore uncertainty, which could lead to safety-critical failures. Specifically, in ambiguous scenarios, fixed decision thresholds may cause premature or delayed risk detection and temporally unstable predictions, especially in complex scenes with multiple interacting risks. Despite these challenges, current methods lack a principled framework to model risk uncertainty jointly across space and time. We propose Conformal Risk Tube Prediction, a unified formulation that captures spatiotemporal risk uncertainty, provides coverage guarantees for true risks, and produces calibrated risk scores with uncertainty estimates. To conduct a systematic evaluation, we present a new dataset and metrics probing diverse scenario configurations with multi-risk coupling effects, which are not supported by existing datasets. We systematically analyze factors affecting uncertainty estimation, including scenario variations, per-risk category behavior, and perception error propagation. Our method delivers substantial improvements over prior approaches, enhancing vision-ROI robustness and downstream performance, such as reducing nuisance braking alerts. For more qualitative results, please visit our project webpage: https://hcis-lab.github.io/CRTP/

[112] DepthArb: Training-Free Depth-Arbitrated Generation for Occlusion-Robust Image Synthesis

Hongjin Niu,Jiahao Wang,Xirui Hu,Weizhan Zhang,Lan Ma,Yuan Gao

Main category: cs.CV

TL;DR: 本文提出DepthArb，一种无需训练的文本到图像扩散模型框架，通过注意力仲裁调制（AAM）和空间紧凑性控制（SCC）解决多物体遮挡关系建模不准的问题，并构建OcclBench基准进行系统评估。

Details

Motivation: 现有文本到图像扩散模型难以准确建模多个物体间的遮挡关系，尤其在密集重叠区域；而当前无训练布局引导方法依赖刚性空间先验，忽略深度顺序，易导致概念混淆或遮挡逻辑错误。 Method: 提出DepthArb框架，包含两个核心机制：注意力仲裁调制（AAM）用于按深度顺序抑制重叠区域背景激活以实现有序可见性；空间紧凑性控制（SCC）用于抑制注意力发散以保持结构完整性。整个方法无需模型微调或重训练。 Result: 在自建遮挡评测基准OcclBench上，DepthArb在遮挡准确性与视觉保真度两方面均显著优于现有SOTA方法；且作为即插即用模块，可无缝提升各类扩散模型的组合生成能力。 Conclusion: DepthArb为生成模型中的空间分层建模提供了新视角，证明仅通过前向注意力调控即可有效解决复杂遮挡建模问题，无需额外训练。 Abstract: Text-to-image diffusion models frequently exhibit deficiencies in synthesizing accurate occlusion relationships of multiple objects, particularly within dense overlapping regions. Existing training-free layout-guided methods predominantly rely on rigid spatial priors that remain agnostic to depth order, often resulting in concept mixing or illogical occlusion. To address these limitations, we propose DepthArb, a training-free framework that resolves occlusion ambiguities by arbitrating attention competition between interacting objects. Specifically, DepthArb employs two core mechanisms: Attention Arbitration Modulation (AAM), which enforces depth-ordered visibility by suppressing background activations in overlapping regions, and Spatial Compactness Control (SCC), which preserves structural integrity by curbing attention divergence. These mechanisms enable robust occlusion generation without model retraining. To systematically evaluate this capability, we propose OcclBench, a comprehensive benchmark designed to evaluate diverse occlusion scenarios. Extensive evaluations demonstrate that DepthArb consistently outperforms state-of-the-art baselines in both occlusion accuracy and visual fidelity. As a plug-and-play method, DepthArb seamlessly enhances the compositional capabilities of diffusion backbones, offering a novel perspective on spatial layering within generative models.

[113] DP^2-VL: Private Photo Dataset Protection by Data Poisoning for Vision-Language Models

Hongyi Miao,Jun Jia,Xincheng Wang,Qianli Ma,Wei Sun,Wangqiu Zhou,Dandan Zhu,Yewen Cao,Zhi Liu,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文提出了一种新的隐私威胁模型——身份-关联学习，即攻击者仅用目标个体的少量私密照片微调视觉语言模型（VLM），使模型隐式习得其面部身份与私人属性、社会关系的关联，从而在公开API部署后泄露其隐私；为此构建首个身份-关联数据集，并提出首个面向私密照片的数据保护框架DP2-VL，利用数据投毒诱导编码器嵌入空间发生数据集级偏移，有效防止身份-关联泄露。

Details

Motivation: 视觉语言模型（VLM）在细粒度图像理解上的进步带来了新型隐私风险，尤其是攻击者可仅凭少量私密照片微调VLM，使其隐式学习并泄露目标个体的身份-关联信息，亟需系统性建模该威胁并提出防御方案。 Method: 提出身份-关联学习威胁模型；构建首个涵盖七类典型场景的身份-关联数据集；设计DP2-VL数据保护框架，通过优化不可察觉扰动，将原始图像表征推向对抗区域，引发VLM编码器嵌入空间的数据集级偏移，使基于受保护数据的微调产生过拟合。 Result: 主流VLM（如LLaVA、Qwen-VL、MiniGPT-v2）在小规模真实或合成私密照片上微调后，均能识别身份并推断身份-关联关系；DP2-VL在跨模型泛化性、抗多种后处理鲁棒性及不同保护比率下均表现一致有效。 Conclusion: 身份-关联学习是一种切实可行且危险的新隐私威胁；DP2-VL作为首个面向私密照片的数据集级防护方法，能有效阻断该威胁，为VLM隐私安全提供新范式。 Abstract: Recent advances in visual-language alignment have endowed vision-language models (VLMs) with fine-grained image understanding capabilities. However, this progress also introduces new privacy risks. This paper first proposes a novel privacy threat model named identity-affiliation learning: an attacker fine-tunes a VLM using only a few private photos of a target individual, thereby embedding associations between the target facial identity and their private property and social relationships into the model's internal representations. Once deployed via public APIs, this model enables unauthorized exposure of the target user's private information upon input of their photos. To benchmark VLMs' susceptibility to such identity-affiliation leakage, we introduce the first identity-affiliation dataset comprising seven typical scenarios appearing in private photos. Each scenario is instantiated with multiple identity-centered photo-description pairs. Experimental results demonstrate that mainstream VLMs like LLaVA, Qwen-VL, and MiniGPT-v2, can recognize facial identities and infer identity-affiliation relationships by fine-tuning on small-scale private photographic dataset, and even on synthetically generated datasets. To mitigate this privacy risk, we propose DP2-VL, the first Dataset Protection framework for private photos that leverages Data Poisoning. Though optimizing imperceptible perturbations by pushing the original representations toward an antithetical region, DP2-VL induces a dataset-level shift in the embedding space of VLMs'encoders. This shift separates protected images from clean inference images, causing fine-tuning on the protected set to overfit. Extensive experiments demonstrate that DP2-VL achieves strong generalization across models, robustness to diverse post-processing operations, and consistent effectiveness across varying protection ratios.

[114] Revealing Multi-View Hallucination in Large Vision-Language Models

Wooje Park,Insu Lee,Soohyun Kim,Jaeyun Jang,Minyoung Noh,Kyuhong Shim,Byonghyo Shim

Main category: cs.CV

TL;DR: 本文提出了一种名为Reference Shift Contrastive Decoding（RSCD）的训练无关解码技术，用于缓解大型视觉语言模型（LVLMs）在处理多视角图像时出现的跨实例和跨视角幻觉问题，并在新构建的MVH-Bench基准上验证了其有效性。

Details

Motivation: 当前LVLMs在处理多视角图像时易发生多视角幻觉（即混淆不同实例或视角的视觉信息），但缺乏系统性分析与有效缓解方法。 Method: 构建了包含4.8k问答对的MVH-Bench基准，定义并区分跨实例与跨视角两类幻觉；提出无需训练的Reference Shift Contrastive Decoding（RSCD）解码策略，通过注意力掩码生成负向logits以抑制视觉干扰。 Result: 在MVH-Bench上，RSCD使Qwen2.5-VL和LLaVA-OneVision分别提升21.1和34.6分，显著优于现有幻觉缓解方法。 Conclusion: 多视角幻觉是LVLMs实际部署中的关键瓶颈，RSCD作为一种轻量、通用、训练无关的解码机制，可有效增强模型对多视角输入的细粒度视觉-语言对齐能力。 Abstract: Large vision-language models (LVLMs) are increasingly being applied to multi-view image inputs captured from diverse viewpoints. However, despite this growing use, current LVLMs often confuse or mismatch visual information originating from different instances or viewpoints, a phenomenon we term multi-view hallucination. To systematically analyze this problem, we construct MVH-Bench, a benchmark comprising 4.8k question-answer pairs targeting two types of hallucination: cross-instance and cross-view. Empirical results show that recent LVLMs struggle to correctly associate visual evidence with its corresponding instance or viewpoint. To overcome this limitation, we propose Reference Shift Contrastive Decoding (RSCD), a training-free decoding technique that suppresses visual interference by generating negative logits through attention masking. Experiments on MVH-Bench with Qwen2.5-VL and LLaVA-OneVision demonstrate that RSCD consistently improves performance by up to 21.1 and 34.6 points over existing hallucination mitigation methods, highlighting the effectiveness of our approach.

[115] High-Fidelity Face Content Recovery via Tamper-Resilient Versatile Watermarking

Peipeng Yu,Jinfeng Xie,Chengfu Ou,Xiaoyu Zhou,Jianwei Fei,Yunshu Dai,Zhihua Xia,Chip Hong Chang

Main category: cs.CV

TL;DR: VeriFi是一种多功能水印框架，用于应对AIGC驱动的面部篡改和深度伪造威胁，统一实现版权保护、像素级篡改定位与高保真面部内容恢复。

Details

Motivation: 现有水印方法在定位精度与视觉质量之间存在权衡，且缺乏内容恢复能力，难以满足深度伪造取证需求。 Method: 提出VeriFi框架：嵌入紧凑语义潜在水印作为内容保持先验；通过图像特征与溯源信号相关性实现细粒度定位；引入结合潜在空间混合与无缝融合的AIGC攻击模拟器以提升鲁棒性。 Result: 在CelebA-HQ和FFHQ数据集上，VeriFi在水印鲁棒性、定位精度和恢复质量方面均显著优于强基线方法。 Conclusion: VeriFi为深度伪造取证提供了实用、可验证的防御方案，兼顾版权保护、篡改定位与内容重建能力。 Abstract: The proliferation of AIGC-driven face manipulation and deepfakes poses severe threats to media provenance, integrity, and copyright protection. Prior versatile watermarking systems typically rely on embedding explicit localization payloads, which introduces a fidelity--functionality trade-off: larger localization signals degrade visual quality and often reduce decoding robustness under strong generative edits. Moreover, existing methods rarely support content recovery, limiting their forensic value when original evidence must be reconstructed. To address these challenges, we present VeriFi, a versatile watermarking framework that unifies copyright protection, pixel-level manipulation localization, and high-fidelity face content recovery. VeriFi makes three key contributions: (1) it embeds a compact semantic latent watermark that serves as an content-preserving prior, enabling faithful restoration even after severe manipulations; (2) it achieves fine-grained localization without embedding localization-specific artifacts by correlating image features with decoded provenance signals; and (3) it introduces an AIGC attack simulator that combines latent-space mixing with seamless blending to improve robustness to realistic deepfake pipelines. Extensive experiments on CelebA-HQ and FFHQ show that VeriFi consistently outperforms strong baselines in watermark robustness, localization accuracy, and recovery quality, providing a practical and verifiable defense for deepfake forensics.

[116] VOLMO: Versatile and Open Large Models for Ophthalmology

Zhenyue Qin,Younjoon Chung,Elijah Lee,Wanyue Feng,Xuguang Ai,Serina Applebaum,Minjie Zou,Yang Liu,Pan Xiao,Mac Singer,Amisha Dave,Aidan Gilson,Tiarnan D. L. Keenan,Emily Y. Chew,Zhiyong Lu,Yih-Chung Tham,Ron Adelman,Luciano V. Del Priore,Qingyu Chen

Main category: cs.CV

TL;DR: 本文提出了VOLMO，一个面向眼科的多模态大语言模型框架，通过知识预训练、领域任务微调和多步临床推理三个阶段，构建了一个2B参数的紧凑模型，在图像描述、疾病筛查与分期分类及临床评估管理生成等任务上均优于现有基线模型。

Details

Motivation: 现有通用及医学多模态大语言模型在眼科任务中表现不佳，且缺乏公开可用的眼科专用模型，而眼科临床工作需整合图像、结构化数据与自由文本，负担重、效率低，亟需高性能、开放、可复现的眼科专用模型。 Method: 提出VOLMO框架：1）眼科知识预训练（86,965图像-文本对，来自82种期刊）；2）领域任务微调（26,929标注样本，覆盖12种眼病的筛查与严重度分类）；3）多步临床推理训练（913份患者病例报告，涵盖评估、计划与随访）。基于该框架训练了2B参数的MLLM，并与InternVL-2B、LLaVA-Med-7B、MedGemma系列、RETFound等基线模型对比。 Result: VOLMO-2B在图像描述生成、12种眼病平均F1达87.4%、外部验证（AMD和DR三个独立队列）及临床专家人工评审中均显著优于所有基线模型。 Conclusion: VOLMO是一个模型无关、数据开放的眼科多模态建模范式，证明了专业化、分阶段训练策略对提升眼科AI性能的有效性，为临床辅助决策提供了高精度、可解释、可扩展的开源工具。 Abstract: Vision impairment affects millions globally, and early detection is critical to preventing irreversible vision loss. Ophthalmology workflows require clinicians to integrate medical images, structured clinical data, and free-text notes to determine disease severity and management, which is time-consuming and burdensome. Recent multimodal large language models (MLLMs) show promise, but existing general and medical MLLMs perform poorly in ophthalmology, and few ophthalmology-specific MLLMs are openly available. We present VOLMO (Versatile and Open Large Models for Ophthalmology), a model-agnostic, data-open framework for developing ophthalmology-specific MLLMs. VOLMO includes three stages: ophthalmology knowledge pretraining on 86,965 image-text pairs from 26,569 articles across 82 journals; domain task fine-tuning on 26,929 annotated instances spanning 12 eye conditions for disease screening and severity classification; and multi-step clinical reasoning on 913 patient case reports for assessment, planning, and follow-up care. Using this framework, we trained a compact 2B-parameter MLLM and compared it with strong baselines, including InternVL-2B, LLaVA-Med-7B, MedGemma-4B, MedGemma-27B, and RETFound. We evaluated these models on image description generation, disease screening and staging classification, and assessment-and-management generation, with additional manual review by two healthcare professionals and external validation on three independent cohorts for age-related macular degeneration and diabetic retinopathy. Across settings, VOLMO-2B consistently outperformed baselines, achieving stronger image description performance, an average F1 of 87.4% across 12 eye conditions, and higher scores in external validation.

[117] SynMVCrowd: A Large Synthetic Benchmark for Multi-view Crowd Counting and Localization

Qi Zhang,Daijie Chen,Yunfei Gong,Hui Huang

Main category: cs.CV

TL;DR: 本文提出一个大型合成基准SynMVCrowd，用于更实际地评估多视角人群计数与定位方法，并设计了强基线模型，在该基准及真实场景上均取得优异性能。

Details

Motivation: 现有方法在小规模场景、有限人数、视角和帧数下评估，易过拟合，缺乏实用性；需更大、更具挑战性的基准以推动实际应用研究。 Method: 构建包含50个合成场景的大型基准SynMVCrowd（支持大量视角、帧数及高达1000人）；设计强多视角人群计数与定位基线模型；验证其在合成数据训练后向真实场景迁移的有效性。 Result: 所提SynMVCrowd基准显著提升多视角与单图人群计数/定位方法的实用化评估能力；基线模型在SynMVCrowd上全面超越已有方法，并在新真实场景中实现更优域迁移性能。 Conclusion: SynMVCrowd为多视角与单图人群分析提供了更贴近实际的大规模评估平台，推动相关技术向真实复杂场景落地。 Abstract: Existing multi-view crowd counting and localization methods are evaluated under relatively small scenes with limited crowd numbers, camera views, and frames. This makes the evaluation and comparison of existing methods impractical, as small datasets are easily overfit by these methods. To avoid these issues, 3DROM proposes a data augmentation method. Instead, in this paper, we propose a large synthetic benchmark, SynMVCrowd, for more practical evaluation and comparison of multi-view crowd counting and localization tasks. The SynMVCrowd benchmark consists of 50 synthetic scenes with a large number of multi-view frames and camera views and a much larger crowd number (up to 1000), which is more suitable for large-scene multi-view crowd vision tasks. Besides, we propose strong multi-view crowd localization and counting baselines that outperform all comparison methods on the new SynMVCrowd benchmark. Moreover, we prove that better domain transferring multi-view and single-image counting performance could be achieved with the aid of the benchmark on novel new real scenes. As a result, the proposed benchmark could advance the research for multi-view and single-image crowd counting and localization to more practical applications. The codes and datasets are here: https://github.com/zqyq/SynMVCrowd.

[118] PointRFT: Explicit Reinforcement Fine-tuning for Point Cloud Few-shot Learning

Yankai Wang,Yiding Sun,Qirui Wang,Pengbo Li,Chaoyi Lu,Dongxu Zhang

Main category: cs.CV

TL;DR: 本文提出PointRFT，首个面向点云表征学习的强化微调范式，通过设计精度与离散度奖励函数，在少样本分类任务中显著优于监督微调，并在数据稀缺场景下达到SOTA性能。

Details

Motivation: 强化学习（如GRPO）在大语言模型中已展现出提升推理能力的潜力，但在3D感知领域尚未被充分探索；本文旨在探究RL方法能否有效赋能点云微调。 Method: 提出PointRFT范式，选取三种主流3D基础模型，设计专门的准确性奖励和离散度奖励函数以稳定训练并缓解分布偏移。 Result: 在多个少样本分类基准上，PointRFT持续优于监督微调（SFT）；结合预训练-SFT-RFT混合范式后，在数据稀缺场景下达到当前最优性能。 Conclusion: RL方法可有效增强点云基础模型的表征能力，尤其适用于低资源场景，为3D感知提供了新范式。 Abstract: Understanding spatial dynamics and semantics in point cloud is fundamental for comprehensive 3D comprehension. While reinforcement learning algorithms such as Group Relative Policy Optimization (GRPO) have recently achieved remarkable breakthroughs in large language models by incentivizing reasoning capabilities through strategic reward design, their potential remains largely unexplored in the 3D perception domain. This naturally raises a pivotal question: Can RL-based methods effectively empower 3D point cloud fine-tuning? In this paper, we propose PointRFT, the first reinforcement fine-tuning paradigm tailored specifically for point cloud representation learning. We select three prevalent 3D foundation models and devise specialized accuracy reward and dispersion reward functions to stabilize training and mitigate distribution shifts. Through comprehensive few-shot classification experiments comparing distinct training paradigms, we demonstrate that PointRFT consistently outperforms vanilla supervised fine-tuning (SFT) across diverse benchmarks. Furthermore, when organically integrated into a hybrid Pretraining-SFT-RFT paradigm, the representational capacity of point cloud foundation models is substantially unleashed, achieving state-of-the-art performance particularly under data-scarce scenarios.

[119] Leave No Stone Unturned: Uncovering Holistic Audio-Visual Intrinsic Coherence for Deepfake Detection

Jielun Peng,Yabin Wang,Yaqi Li,Long Kong,Xiaopeng Hong

Main category: cs.CV

TL;DR: 本文提出HAVIC，一种基于音频-视觉内在一致性的深度伪造检测方法，通过预训练学习模态内与跨模态一致性先验，并自适应融合特征；同时发布高保真数据集HiFi-AVDF，实验表明其在跨数据集场景下显著优于现有方法。

Details

Motivation: 现有检测器依赖单模态伪影或音视频不一致性，泛化能力差，尤其面对未知生成器时性能下降；需基于音视频内在一致性构建鲁棒、通用的检测方法。 Method: 提出HAVIC检测器：首先在真实视频上预训练以学习模态内结构一致性及跨模态微观/宏观一致性先验；再基于先验进行全局面向自适应聚合，动态融合音视频特征；并构建高保真音视频深度伪造数据集HiFi-AVDF。 Result: 在多个基准上大幅超越现有SOTA方法，在最具挑战性的跨数据集场景下AP和AUC分别提升9.39%和9.37%。 Conclusion: 基于内在音视频一致性的检测范式更鲁棒、泛化性更强；HAVIC验证了该思路的有效性，HiFi-AVDF为未来研究提供了高质量基准。 Abstract: The rapid progress of generative AI has enabled hyper-realistic audio-visual deepfakes, intensifying threats to personal security and social trust. Most existing deepfake detectors rely either on uni-modal artifacts or audio-visual discrepancies, failing to jointly leverage both sources of information. Moreover, detectors that rely on generator-specific artifacts tend to exhibit degraded generalization when confronted with unseen forgeries. We argue that robust and generalizable detection should be grounded in intrinsic audio-visual coherence within and across modalities. Accordingly, we propose HAVIC, a Holistic Audio-Visual Intrinsic Coherence-based deepfake detector. HAVIC first learns priors of modality-specific structural coherence, inter-modal micro- and macro-coherence by pre-training on authentic videos. Based on the learned priors, HAVIC further performs holistic adaptive aggregation to dynamically fuse audio-visual features for deepfake detection. Additionally, we introduce HiFi-AVDF, a high-fidelity audio-visual deepfake dataset featuring both text-to-video and image-to-video forgeries from state-of-the-art commercial generators. Extensive experiments across several benchmarks demonstrate that HAVIC significantly outperforms existing state-of-the-art methods, achieving improvements of 9.39% AP and 9.37% AUC on the most challenging cross-dataset scenario. Our code and dataset are available at https://github.com/tuffy-studio/HAVIC.

[120] SLAT-Phys: Fast Material Property Field Prediction from Structured 3D Latents

Rocktim Jyoti Das,Dinesh Manocha

Main category: cs.CV

TL;DR: SLAT-Phys是一种端到端方法，仅用单张RGB图像即可快速、准确地预测3D资产的空间变化材料属性场（如杨氏模量、密度、泊松比），无需显式3D重建，比先前方法快120倍。

Details

Motivation: 现有基于视觉的材料属性估计方法要么计算昂贵缓慢，要么依赖3D信息，难以满足物理仿真、机器人和数字孪生等实时或轻量化需求。 Method: 利用预训练3D资产生成模型中空间组织的潜在特征（蕴含几何与语义先验），训练轻量级神经解码器，直接从单张RGB图像回归空间变化的杨氏模量、密度和泊松比。 Result: 在连续材料参数预测上达到与先前方法相当的精度；单物体处理仅需9.9秒（NVIDIA RTX A5000），避免重建与体素化，实现120倍加速。 Conclusion: SLAT-Phys证明了无需显式3D重建即可高效、准确地从单图估计空间变化材料场，为实时物理仿真和数字孪生提供了新范式。 Abstract: Estimating the material property field of 3D assets is critical for physics-based simulation, robotics, and digital twin generation. Existing vision-based approaches are either too expensive and slow or rely on 3D information. We present SLAT-Phys, an end-to-end method that predicts spatially varying material property fields of 3D assets directly from a single RGB image without explicit 3D reconstruction. Our approach leverages spatially organised latent features from a pretrained 3D asset generation model that encodes rich geometry and semantic prior, and trains a lightweight neural decoder to estimate Young's modulus, density, and Poisson's ratio. The coarse volumetric layout and semantic cues of the latent representation about object geometry and appearance enable accurate material estimation. Our experiments demonstrate that our method provides competitive accuracy in predicting continuous material parameters when compared against prior approaches, while significantly reducing computation time. In particular, SLAT-Phys requires only 9.9 seconds per object on an NVIDIA RTXA5000 GPU and avoids reconstruction and voxelization preprocessing. This results in 120x speedup compared to prior methods and enables faster material property estimation from a single image.

[121] HyDRA: Hybrid Domain-Aware Robust Architecture for Heterogeneous Collaborative Perception

Minwoo Song,Minhee Kang,Heejin Ahn

Main category: cs.CV

TL;DR: 本文提出HyDRA方法，通过结合中间融合与晚期融合，并引入轻量级域分类器和锚点引导的位姿图优化，以应对协同感知中因模型架构或数据分布差异导致的异构性问题，在无需额外训练的情况下实现可扩展且鲁棒的性能。

Details

Motivation: 协同感知中，由于各代理模型架构或训练数据分布不同导致的异构性会降低整体性能。 Method: 提出HyDRA统一框架，集成中间融合与晚期融合；设计轻量级域分类器动态识别异构代理并分配至晚期融合分支；引入锚点引导的位姿图优化，利用中间融合的可靠检测作为空间锚点来缓解晚期融合中的定位误差。 Result: 实验表明，HyDRA在无需额外训练的前提下，性能媲美当前最先进的异构感知协同方法，且随协作代理数量增加仍保持稳定性能，实现零成本扩展。 Conclusion: HyDRA是一种高效、可扩展且无需重训练的协同感知框架，能有效应对异构性挑战。 Abstract: In collaborative perception, an agent's performance can be degraded by heterogeneity arising from differences in model architecture or training data distributions. To address this challenge, we propose HyDRA (Hybrid Domain-Aware Robust Architecture), a unified pipeline that integrates intermediate and late fusion within a domain-aware framework. We introduce a lightweight domain classifier that dynamically identifies heterogeneous agents and assigns them to the late-fusion branch. Furthermore, we propose anchor-guided pose graph optimization to mitigate localization errors inherent in late fusion, leveraging reliable detections from intermediate fusion as fixed spatial anchors. Extensive experiments demonstrate that, despite requiring no additional training, HyDRA achieves performance comparable to state-of-the-art heterogeneity-aware CP methods. Importantly, this performance is maintained as the number of collaborating agents increases, enabling zero-cost scaling without retraining.

[122] SilLang: Improving Gait Recognition with Silhouette Language Encoding

Ruiyi Zhan,Guozhen Peng,Canyu Chen,Jian Lei,Annan Li

Main category: cs.CV

TL;DR: 本文提出了一种将二值步态轮廓与自然语言在二值编码空间中对齐的新方法，设计了Contour-Velocity Tokenizer以对齐分布，并构建双分支的Silhouette Language Model（SilLang），融合大语言模型的离散语言嵌入来增强步态表征，在多个主流数据集上超越现有最优方法。

Details

Motivation: 现有步态识别方法多依赖视觉骨干网络提取连续特征，忽略了二值步态轮廓本身具有的离散性；而大语言模型擅长处理离散序列和建模长程时序依赖，具备捕捉细微运动变化的潜力，因此值得探索将步态轮廓与语言在统一离散编码空间中桥接。 Method: 提出Contour-Velocity Tokenizer，对二值步态轮廓进行编码并重分布以对齐文本token空间；构建双分支Silhouette Language Model（SilLang），将LLM生成的离散语言嵌入融入视觉步态特征；该框架可即插即用地部署于主流步态骨干网络。 Result: 在SUSTech1K、GREW和Gait3D三个主流步态数据集上，SilLang一致地超越了当前最优方法。 Conclusion: 将步态轮廓映射至与自然语言对齐的离散编码空间是可行且有效的；利用LLM的离散语义建模能力可显著提升步态识别性能，为跨模态生物特征理解提供了新范式。 Abstract: Gait silhouettes, which can be encoded into binary gait codes, are widely adopted to representing motion patterns of pedestrian. Recent approaches commonly leverage visual backbones to encode gait silhouettes, achieving successful performance. However, they primarily focus on continuous visual features, overlooking the discrete nature of binary silhouettes that inherently share a discrete encoding space with natural language. Large Language Models (LLMs) have demonstrated exceptional capability in extracting discriminative features from discrete sequences and modeling long-range dependencies, highlighting their potential to capture temporal motion patterns by identifying subtle variations. Motivated by these observations, we explore bridging binary gait silhouettes and natural language within a binary encoding space. However, the encoding spaces of text tokens and binary gait silhouettes remain misaligned, primarily due to differences in token frequency and density. To address this issue, we propose the Contour-Velocity Tokenizer, which encodes binary gait silhouettes while reshaping their distribution to better align with the text token space. We then establish a dual-branch framework termed Silhouette Language Model, which enhances visual silhouettes by integrating discrete linguistic embeddings derived from LLMs. Implemented on mainstream gait backbones, SilLang consistently improves state-of-the-art methods across SUSTech1K, GREW, and Gait3D.

[123] CAKE: Real-time Action Detection via Motion Distillation and Background-aware Contrastive Learning

Hieu Hoang,Dung Trung Tran,Hong Nguyen,Nam-Phong Nguyen

Main category: cs.CV

TL;DR: 本文提出CAKE框架，通过Dynamic Motion Adapter（DMA）和Floating Contrastive Learning，在不显式计算光流的前提下，将光流的运动知识蒸馏到RGB模型中，显著提升在线动作检测（OAD）性能，同时保持高推理速度（>72 FPS on CPU）。

Details

Motivation: 在线动作检测（OAD）面临高计算成本和难以建模判别性时序动态两大挑战；光流虽提供强运动线索但计算开销大。 Method: 提出基于光流蒸馏的CAKE框架：1）Dynamic Motion Adapter（DMA）抑制静态背景噪声、突出像素变化，近似光流；2）Floating Contrastive Learning区分前景动作与背景时序干扰。 Result: 在TVSeries、THUMOS'14、Kinetics-400上达到SOTA mAP，且推理速度超72 FPS（单CPU），优于同骨干网络的现有方法。 Conclusion: CAKE实现了高精度与高效率的平衡，为资源受限场景下的实时OAD提供了有效解决方案。 Abstract: Online Action Detection (OAD) systems face two primary challenges: high computational cost and insufficient modeling of discriminative temporal dynamics against background motion. Adding optical flow could provides strong motion cues but it incurs significant computational overhead. We propose CAKE, a OAD Flow-based distillation framework to transfer motion knowledge into RGB models. We propose Dynamic Motion Adapter (DMA) to suppress static background noise and emphasize pixel changes, effectively approximating optical flow without explicit computation. The framework also integrates a Floating Contrastive Learning strategy to distinguish informative motion dynamics from temporal background. Various experiments conducted on the TVSeries, THUMOS'14, Kinetics-400 datasets show effectiveness of our model. CAKE achieves a standout mAP compared with SOTA while using the same backbone. Our model operates at over 72 FPS on a single CPU, making it highly suitable for resource-constrained systems.

[124] HGGT: Robust and Flexible 3D Hand Mesh Reconstruction from Uncalibrated Images

Yumeng Liu,Xiao-Xiao Long,Marc Habermann,Xuanze Yang,Cheng Lin,Yuan Liu,Yuexin Ma,Wenping Wang,Ligang Liu

Main category: cs.CV

TL;DR: 本文提出了一种新型前馈网络架构，首次实现从任意未标定视角图像中联合估计3D手部网格与相机姿态，兼顾高精度与部署灵活性。

Details

Motivation: 现有单视图方法存在深度模糊和遮挡问题，多视图方法又依赖固定标定设置，难以在真实场景中大规模应用；亟需一种能利用互联网海量无结构图像、且适用于消费级RGB相机的灵活高保真手部重建方法。 Method: 受3D基础模型启发，将手部重建建模为视觉-几何对齐任务，设计端到端前馈网络，直接从任意未标定视角图像中联合回归3D手部网格顶点坐标与相机外参。 Result: 在多个基准上超越SOTA；在未标定、野外（in-the-wild）场景中展现出强泛化能力；支持单图或任意多图输入，无需相机标定。 Conclusion: 该方法有效弥合了精度与部署灵活性之间的鸿沟，为真实世界中可扩展的手部几何重建提供了新范式。 Abstract: Recovering high-fidelity 3D hand geometry from images is a critical task in computer vision, holding significant value for domains such as robotics, animation and VR/AR. Crucially, scalable applications demand both accuracy and deployment flexibility, requiring the ability to leverage massive amounts of unstructured image data from the internet or enable deployment on consumer-grade RGB cameras without complex calibration. However, current methods face a dilemma. While single-view approaches are easy to deploy, they suffer from depth ambiguity and occlusion. Conversely, multi-view systems resolve these uncertainties but typically demand fixed, calibrated setups, limiting their real-world utility. To bridge this gap, we draw inspiration from 3D foundation models that learn explicit geometry directly from visual data. By reformulating hand reconstruction from arbitrary views as a visual-geometry grounded task, we propose a feed-forward architecture that, for the first time in literature, jointly infers 3D hand meshes and camera poses from uncalibrated views. Extensive evaluations show that our approach outperforms state-of-the-art benchmarks and demonstrates strong generalization to uncalibrated, in-the-wild scenarios. Here is the link of our project page: https://lym29.github.io/HGGT/.

[125] DB SwinT: A Dual-Branch Swin Transformer Network for Road Extraction in Optical Remote Sensing Imagery

Zongyang He,Xiangli Yang,Xian Gao,Zhiguo Wang

Main category: cs.CV

TL;DR: 本文提出了一种双分支Swin Transformer网络（DB SwinT），结合Swin Transformer的长程依赖建模能力和U-Net的多尺度特征融合策略，通过局部与全局双分支编码器及注意力特征融合模块（AFF），有效提升复杂场景下遥感影像道路提取精度。

Details

Motivation: 高分辨率光学遥感影像中道路常被树木、建筑等遮挡，导致提取结果碎片化、连续性差、精度低，亟需更鲁棒的道路提取方法。 Method: 构建双分支Swin Transformer网络（DB SwinT）：局部分支恢复遮挡区域细节结构，全局分支捕获语义上下文以保持道路连续性；引入注意力特征融合（AFF）模块自适应融合双分支特征；整体架构融合Swin Transformer与U-Net多尺度思想。 Result: 在Massachusetts和DeepGlobe数据集上分别取得79.35%和74.84%的IoU分数，优于现有主流方法。 Conclusion: DB SwinT能有效建模局部细节与全局语义，显著提升复杂城乡环境中遮挡道路的提取完整性与准确性，适用于城市规划、交通监控与灾害管理等实际应用。 Abstract: With the continuous improvement in the spatial resolution of optical remote sensing imagery, accurate road extraction has become increasingly important for applications such as urban planning, traffic monitoring, and disaster management. However, road extraction in complex urban and rural environments remains challenging, as roads are often occluded by trees, buildings, and other objects, leading to fragmented structures and reduced extraction accuracy. To address this problem, this paper proposes a Dual-Branch Swin Transformer network (DB SwinT) for road extraction. The proposed framework combines the long-range dependency modeling capability of the Swin Transformer with the multi-scale feature fusion strategy of U-Net, and employs a dual-branch encoder to learn complementary local and global representations. Specifically, the local branch focuses on recovering fine structural details in occluded areas, while the global branch captures broader semantic context to preserve the overall continuity of road networks. In addition, an Attentional Feature Fusion (AFF) module is introduced to adaptively fuse features from the two branches, further enhancing the representation of occluded road segments. Experimental results on the Massachusetts and DeepGlobe datasets show that DB SwinT achieves Intersection over Union (IoU) scores of 79.35\% and 74.84\%, respectively, demonstrating its effectiveness for road extraction from optical remote sensing imagery.

[126] UW-VOS: A Large-Scale Dataset for Underwater Video Object Segmentation

Hongshen Zhao,Jingkang Tai,Yuhang Wu,Wenkang Zhang,Xi Lan,Shangyan Wang,Tianyu Zhang,Wankou Yang

Main category: cs.CV

TL;DR: 本文提出了首个大规模水下视频目标分割基准UW-VOS和一种轻量适配框架SAM-U，显著缓解了现有方法在水下场景中的性能退化问题。

Details

Motivation: 水下视频目标分割面临颜色失真、低对比度和目标伪装等挑战，且缺乏高质量训练数据，导致开放水域方法性能严重下降。 Method: 构建了包含1431个视频序列、409类、309295个掩码标注的UW-VOS基准；提出参数高效的SAM-U框架，在SAM2图像编码器中插入轻量适配器。 Result: SAM-U仅需约2%可训练参数即达SOTA性能；现有方法在UW-VOS上平均J&F下降13点，SAM-U有效弥合该域差距；属性分析指出小目标、伪装和进出重入是关键瓶颈。 Conclusion: UW-VOS和SAM-U为水下视觉感知提供了重要数据与方法基础，并明确了未来鲁棒水下感知的研究方向。 Abstract: Underwater Video Object Segmentation (VOS) is essential for marine exploration, yet open-air methods suffer significant degradation due to color distortion, low contrast, and prevalent camouflage. A primary hurdle is the lack of high-quality training data. To bridge this gap, we introduce $\textbf{UW-VOS}$, the first large-scale underwater VOS benchmark comprising 1,431 video sequences across 409 categories with 309,295 mask annotations, constructed via a semi-automatic data engine with rigorous human verification. We further propose $\textbf{SAM-U}$, a parameter-efficient framework that adapts SAM2 to the underwater domain. By inserting lightweight adapters into the image encoder, SAM-U achieves state-of-the-art performance with only $\sim$2$\%$ trainable parameters. Extensive experiments reveal that existing methods experience an average 13-point $\mathcal{J}\&\mathcal{F}$ drop on UW-VOS, while SAM-U effectively bridges this domain gap. Detailed attribute-based analysis further identifies small targets, camouflage, and exit-re-entry as critical bottlenecks, providing a roadmap for future research in robust underwater perception.

[127] COVTrack++: Learning Open-Vocabulary Multi-Object Tracking from Continuous Videos via a Synergistic Paradigm

Zekun Qian,Wei Feng,Ruize Han,Junhui Hou

Main category: cs.CV

TL;DR: 本文提出C-TAO数据集和COVTrack++框架，解决开放词汇多目标跟踪（OVMOT）中连续视频标注缺失与检测-关联协同不足两大瓶颈，在TAO和BDD100K上实现SOTA性能。

Details

Motivation: 现有MOT方法局限于固定类别，OVMOT虽支持任意类别（含未见类），但受限于缺乏连续标注视频数据及专用于OVMOT的检测与关联协同框架。 Method: 构建首个连续标注OVMOT训练集C-TAO；提出COVTrack++框架，包含多线索自适应融合（MCF）、多粒度层次聚合（MGA）和时序置信传播（TCP）三个模块，实现检测与关联双向互促。 Result: 在TAO上 novel TETA达35.4%（val）和30.5%（test），novel AssocA和LocA分别提升4.8%和5.8%；在BDD100K上展现强零样本泛化能力。 Conclusion: C-TAO与COVTrack++有效突破OVMOT的数据与模型瓶颈，显著提升对未见类别的跟踪性能与鲁棒性，推动OVMOT向真实开放场景落地。 Abstract: Multi-Object Tracking (MOT) has traditionally focused on a few specific categories, restricting its applicability to real-world scenarios involving diverse objects. Open-Vocabulary Multi-Object Tracking (OVMOT) addresses this by enabling tracking of arbitrary categories, including novel objects unseen during training. However, current progress is constrained by two challenges: the lack of continuously annotated video data for training, and the lack of a customized OVMOT framework to synergistically handle detection and association. We address the data bottleneck by constructing C-TAO, the first continuously annotated training set for OVMOT, which increases annotation density by 26x over the original TAO and captures smooth motion dynamics and intermediate object states. For the framework bottleneck, we propose COVTrack++, a synergistic framework that achieves a bidirectional reciprocal mechanism between detection and association through three modules: (1) Multi-Cue Adaptive Fusion (MCF) dynamically balances appearance, motion, and semantic cues for association feature learning; (2) Multi-Granularity Hierarchical Aggregation (MGA) exploits hierarchical spatial relationships in dense detections, where visible child nodes (e.g., object parts) assist occluded parent objects (e.g., whole body) for association feature enhancement; (3) Temporal Confidence Propagation (TCP) recovers flickering detections through high-confidence tracked objects boosting low-confidence candidates across frames, stabilizing trajectories. Extensive experiments on TAO demonstrate state-of-the-art performance, with novel TETA reaching 35.4% and 30.5% on validation and test sets, improving novel AssocA by 4.8% and novel LocA by 5.8% over previous methods, and show strong zero-shot generalization on BDD100K. The code and dataset will be publicly available.

[128] Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection

Sa Zhu,Wanqian Zhang,Lin Wang,Xiaohua Chen,Chenxu Cui,Jinchao Zhang,Bo Li

Main category: cs.CV

TL;DR: 本文提出了一种面向开放词汇时序动作检测（OV-TAD）的Phase-wise Decomposition and Alignment（PDA）框架，通过阶段式分解与对齐提升跨类别知识迁移能力。

Details

Motivation: 现有方法仅依赖标签级语义与视觉特征的全局对齐，难以有效迁移时序一致的视觉知识到未见动作类别。 Method: 提出PDA框架，包含三个模块：1）CoT-Prompting Semantic Decomposition（CSD），利用大语言模型的链式思维能力将动作标签分解为阶段级描述；2）Text-infused Foreground Filtering（TIF），依据阶段语义线索自适应筛选各阶段相关视频片段；3）Adaptive Phase-wise Alignment（APA），实现阶段级图文匹配并自适应聚合结果。 Result: 在两个OV-TAD基准上实验表明，该方法显著优于现有方法，提升了对未见动作类别的泛化能力。 Conclusion: 阶段式细粒度建模与对齐能更有效地提取可迁移的动作模式，为开放词汇时序动作检测提供了新思路。 Abstract: Open-Vocabulary Temporal Action Detection (OV-TAD) aims to classify and localize action segments in untrimmed videos for unseen categories. Previous methods rely solely on global alignment between label-level semantics and visual features, which is insufficient to transfer temporal consistent visual knowledge from seen to unseen classes. To address this, we propose a Phase-wise Decomposition and Alignment (PDA) framework, which enables fine-grained action pattern learning for effective prior knowledge transfer. Specifically, we first introduce the CoT-Prompting Semantic Decomposition (CSD) module, which leverages the chain-of-thought (CoT) reasoning ability of large language models to automatically decompose action labels into coherent phase-level descriptions, emulating human cognitive processes. Then, Text-infused Foreground Filtering (TIF) module is introduced to adaptively filter action-relevant segments for each phase leveraging phase-wise semantic cues, producing semantically aligned visual representations. Furthermore, we propose the Adaptive Phase-wise Alignment (APA) module to perform phase-level visual-textual matching, and adaptively aggregates alignment results across phases for final prediction. This adaptive phase-wise alignment facilitates the capture of transferable action patterns and significantly enhances generalization to unseen actions. Extensive experiments on two OV-TAD benchmarks demonstrated the superiority of the proposed method.

[129] SpectralSplats: Robust Differentiable Tracking via Spectral Moment Supervision

Avigail Cohen Rimon,Amir Mann,Mirela Ben Chen,Or Litany

Main category: cs.CV

TL;DR: 本文提出SpectralSplats框架，通过将优化目标从空间域转移到频率域，利用全局复正弦特征（Spectral Moments）监督渲染图像，解决3D高斯泼溅（3DGS）在视频跟踪中因相机严重错位导致的梯度消失问题，并结合频率退火策略实现鲁棒跟踪。

Details

Motivation: 3D高斯泼溅（3DGS）虽具实时光真实感新视角合成能力，但其渲染器在“野外”应用中因高斯基元局部支撑紧凑、标准光度损失依赖空间重叠，导致相机严重错位时梯度消失、优化停滞，亟需更鲁棒的跟踪机制。 Method: 提出SpectralSplats框架：1）用全局复正弦特征（Spectral Moments）替代像素级光度损失，在频率域构建全局吸引盆；2）推导基于原理的频率退火调度，先利用低频保证全局凸性，再逐步引入高频实现精确空间对齐。 Result: SpectralSplats作为即插即用模块，在多种形变参数化（MLP、稀疏控制点等）下均显著提升跟踪鲁棒性，即使从严重错位初值出发也能成功恢复复杂形变，而传统外观跟踪则完全失效。 Conclusion: 将优化迁移至频率域并辅以频率退火，可从根本上缓解3DGS跟踪中的梯度消失问题，为模型驱动视频跟踪提供更可靠、通用的损失设计范式。 Abstract: 3D Gaussian Splatting (3DGS) enables real-time, photorealistic novel view synthesis, making it a highly attractive representation for model-based video tracking. However, leveraging the differentiability of the 3DGS renderer "in the wild" remains notoriously fragile. A fundamental bottleneck lies in the compact, local support of the Gaussian primitives. Standard photometric objectives implicitly rely on spatial overlap; if severe camera misalignment places the rendered object outside the target's local footprint, gradients strictly vanish, leaving the optimizer stranded. We introduce SpectralSplats, a robust tracking framework that resolves this "vanishing gradient" problem by shifting the optimization objective from the spatial to the frequency domain. By supervising the rendered image via a set of global complex sinusoidal features (Spectral Moments), we construct a global basin of attraction, ensuring that a valid, directional gradient toward the target exists across the entire image domain, even when pixel overlap is completely nonexistent. To harness this global basin without introducing periodic local minima associated with high frequencies, we derive a principled Frequency Annealing schedule from first principles, gracefully transitioning the optimizer from global convexity to precise spatial alignment. We demonstrate that SpectralSplats acts as a seamless, drop-in replacement for spatial losses across diverse deformation parameterizations (from MLPs to sparse control points), successfully recovering complex deformations even from severely misaligned initializations where standard appearance-based tracking catastrophically fails.

[130] A^3: Towards Advertising Aesthetic Assessment

Kaiyuan Ji,Yixuan Gao,Lu Sun,Yushuo Zheng,Zijian Chen,Jianbo Zhang,Xiangyang Zhu,Yuan Tian,Zicheng Zhang,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文提出A^3框架，包括理论范式A^3-Law、数据集A^3-Dataset、多模态大模型A^3-Align和基准测试A^3-Bench，旨在解决广告图像评估主观性强、缺乏可扩展性与可解释性的问题。

Details

Motivation: 现有广告图像评估方法依赖主观判断，缺乏可扩展性、标准化标准和可解释性。 Method: 构建了基于三层理论（感知注意、形式兴趣、欲望影响）的A^3-Law范式；据此建立含12万指令-响应对的A^3-Dataset；训练多模态大模型A^3-Align，并采用链式思维（CoT）引导学习；在A^3-Bench上进行广泛实验验证。 Result: A^3-Align在A^3-Bench上展现出比现有模型更强的A^3-Law对齐能力，并能泛化至广告质量筛选与诊断性批评任务。 Conclusion: A^3框架为广告图像评估提供了理论驱动、数据支撑与模型实现的一体化解决方案，具备实际部署潜力。 Abstract: Advertising images significantly impact commercial conversion rates and brand equity, yet current evaluation methods rely on subjective judgments, lacking scalability, standardized criteria, and interpretability. To address these challenges, we present A^3 (Advertising Aesthetic Assessment), a comprehensive framework encompassing four components: a paradigm (A^3-Law), a dataset (A^3-Dataset), a multimodal large language model (A^3-Align), and a benchmark (A^3-Bench). Central to A^3 is a theory-driven paradigm, A^3-Law, comprising three hierarchical stages: (1) Perceptual Attention, evaluating perceptual image signals for their ability to attract attention; (2) Formal Interest, assessing formal composition of image color and spatial layout in evoking interest; and (3) Desire Impact, measuring desire evocation from images and their persuasive impact. Building on A^3-Law, we construct A^3-Dataset with 120K instruction-response pairs from 30K advertising images, each richly annotated with multi-dimensional labels and Chain-of-Thought (CoT) rationales. We further develop A^3-Align, trained under A^3-Law with CoT-guided learning on A^3-Dataset. Extensive experiments on A^3-Bench demonstrate that A^3-Align achieves superior alignment with A^3-Law compared to existing models, and this alignment generalizes well to quality advertisement selection and prescriptive advertisement critique, indicating its potential for broader deployment. Dataset, code, and models can be found at: https://github.com/euleryuan/A3-Align.

[131] SemLayer: Semantic-aware Generative Segmentation and Layer Construction for Abstract Icons

Haiyang Xu,Ronghuan Wu,Li-Yi Wei,Nanxuan Zhao,Chenxi Liu,Cuong Nguyen,Zhuowen Tu,Zhaowen Wang

Main category: cs.CV

TL;DR: 本文提出SemLayer方法，通过视觉生成技术从扁平化矢量图标中恢复语义分层结构，支持编辑、重样式和动画等下游任务。

Details

Motivation: 现代设计中图标常以扁平化单路径或复合路径形式分发，导致原始语义分层信息丢失，阻碍编辑、重样式和动画等下游任务。 Method: SemLayer是一种视觉生成驱动的流程：首先生成色彩区分的表示以实现语义组件可视化分离；其次通过语义补全步骤重建各部件（包括被遮挡区域）的完整几何形状；最后依据推断的遮挡关系组装为分层矢量表示。 Result: 大量定性对比与定量评估验证了SemLayer的有效性，使其能支持此前无法应用于扁平化矢量图形的编辑工作流。 Conclusion: 语义分层重建是一项实用且有价值的任务，SemLayer为此提供了可行且高效的技术方案。 Abstract: Graphic icons are a cornerstone of modern design workflows, yet they are often distributed as flattened single-path or compound-path graphics, where the original semantic layering is lost. This absence of semantic decomposition hinders downstream tasks such as editing, restyling, and animation. We formalize this problem as semantic layer construction for flattened vector art and introduce SemLayer, a visual generation empowered pipeline that restores editable layered structures. Given an abstract icon, SemLayer first generates a chromatically differentiated representation in which distinct semantic components become visually separable. To recover the complete geometry of each part, including occluded regions, we then perform a semantic completion step that reconstructs coherent object-level shapes. Finally, the recovered parts are assembled into a layered vector representation with inferred occlusion relationships. Extensive qualitative comparisons and quantitative evaluations demonstrate the effectiveness of SemLayer, enabling editing workflows previously inapplicable to flattened vector graphics and establishing semantic layer reconstruction as a practical and valuable task. Project page: https://xxuhaiyang.github.io/SemLayer/

[132] HAM: A Training-Free Style Transfer Approach via Heterogeneous Attention Modulation for Diffusion Models

Yeqi He,Liang Li,Zhiwen Yang,Xichun Sheng,Zhidong Zhao,Chenggang Yan

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的风格迁移方法HAM，通过异构注意力调制（包括全局注意力调节和局部注意力移植）在扩散模型中平衡风格与内容，保护内容图像身份信息，显著提升性能。

Details

Motivation: 现有基于预训练扩散模型的风格迁移方法难以兼顾复杂风格参考与内容图像身份保持，存在风格-内容失衡问题。 Method: 提出异构注意力调制（HAM），包含风格噪声初始化、全局注意力调节（GAR）和局部注意力移植（LAT），在扩散过程中动态调控不同注意力机制以协同保留内容细节与捕捉风格特征。 Result: 在多项定量指标上达到SOTA性能，并通过大量定性与定量实验验证了方法的有效性与鲁棒性。 Conclusion: HAM是一种高效、训练-free的扩散模型风格迁移框架，有效缓解风格-内容权衡难题，为可控图像生成提供了新思路。 Abstract: Diffusion models have demonstrated remarkable performance in image generation, particularly within the domain of style transfer. Prevailing style transfer approaches typically leverage pre-trained diffusion models' robust feature extraction capabilities alongside external modular control pathways to explicitly impose style guidance signals. However, these methods often fail to capture complex style reference or retain the identity of user-provided content images, thus falling into the trap of style-content balance. Thus, we propose a training-free style transfer approach via $\textbf{h}$eterogeneous $\textbf{a}$ttention $\textbf{m}$odulation ($\textbf{HAM}$) to protect identity information during image/text-guided style reference transfer, thereby addressing the style-content trade-off challenge. Specifically, we first introduces style noise initialization to initialize latent noise for diffusion. Then, during the diffusion process, it innovatively employs HAM for different attention mechanisms, including Global Attention Regulation (GAR) and Local Attention Transplantation (LAT), which better preserving the details of the content image while capturing complex style references. Our approach is validated through a series of qualitative and quantitative experiments, achieving state-of-the-art performance on multiple quantitative metrics.

[133] LGEST: Dynamic Spatial-Spectral Expert Routing for Hyperspectral Image Classification

Jiawen Wen,Suixuan Qiu,Zihang Luo,Xiaofei Yang,Haotian Shi

Main category: cs.CV

TL;DR: 本文提出LGEST框架，通过深度空间-光谱自编码器、交叉混合专家特征金字塔和局部-全局专家系统，解决高光谱图像分类中局部-全局表征融合僵化、光谱-空间尺度差异处理不足及Hughes现象等问题。

Details

Motivation: 现有深度学习方法在高光谱图像分类中存在局部-全局表征融合不灵活、光谱-空间多尺度差异建模不足、以及在高维异质样本下易受Hughes现象影响的问题。 Method: 提出LGEST框架，包含三个核心模块：1）深度空间-光谱自编码器（DSAE）用于生成紧凑判别性嵌入；2）交叉交互式混合专家特征金字塔（CIEM-FPN），结合交叉注意力与残差MoE层动态融合多尺度特征；3）局部-全局专家系统（LGES），通过稀疏激活的卷积/Transformer子专家对分解特征进行协同建模，并由路由控制器按特征显著性动态选择专家。 Result: 在四个基准数据集上的大量实验表明，LGEST持续优于当前最先进方法。 Conclusion: LGEST通过协同建模局部纹理与全局上下文、自适应融合多尺度光谱-空间特征、以及缓解高维异质性带来的过拟合风险，有效提升了高光谱图像分类性能。 Abstract: Deep learning methods, including Convolutional Neural Networks, Transformers and Mamba, have achieved remarkable success in hyperspectral image (HSI) classification. Nevertheless, existing methods exhibit inflexible integration of local-global representations, inadequate handling of spectral-spatial scale disparities across heterogeneous bands, and susceptibility to the Hughes phenomenon under high-dimensional sample heterogeneity. To address these challenges, we propose Local-Global Expert Spatial-Spectral Transformer (LGEST), a novel framework that synergistically combines three key innovations. The LGEST first employs a Deep Spatial-Spectral Autoencoder (DSAE) to generate compact yet discriminative embeddings through hierarchical nonlinear compression, preserving 3D neighborhood coherence while mitigating information loss in high-dimensional spaces. Secondly, a Cross-Interactive Mixed Expert Feature Pyramid (CIEM-FPN) leverages cross-attention mechanisms and residual mixture-of-experts layers to dynamically fuse multi-scale features, adaptively weighting spectral discriminability and spatial saliency through learnable gating functions. Finally, a Local-Global Expert System (LGES) processes decomposed features via sparsely activated expert pairs: convolutional sub-experts capture fine-grained textures, while transformer sub-experts model long-range contextual dependencies, with a routing controller dynamically selecting experts based on real-time feature saliency. Extensive experiments on four benchmark datasets demonstrate that LGEST consistently outperforms state-of-the-art methods.

[134] Beyond Semantic Priors: Mitigating Optimization Collapse for Generalizable Visual Forensics

Jipeng Liu,Haichao Shi,Siyu Xing,Rong Yin,Xiao-Yu Zhang

Main category: cs.CV

TL;DR: 本文揭示了基于SAM优化的视觉语言模型在深度伪造检测中存在'优化崩溃'问题，并提出Critical Optimization Radius (COR)和Gradient Signal-to-Noise Ratio (GSNR)理论框架分析其根源；进而设计CoRIT方法，通过Contrastive Gradient Proxy与多种训练无关策略提升梯度保真度与泛化能力，显著改善对非语义伪造的检测性能。

Details

Motivation: 现有基于VLM（如CLIP）的深度伪造检测方法因语义预训练范式，难以捕捉超真实合成中的非语义伪影，且采用Sharpness-Aware Minimization（SAM）时易发生‘优化崩溃’——即扰动半径稍大即退化为随机猜测。 Method: 提出Critical Optimization Radius（COR）量化优化几何稳定性，引入Gradient Signal-to-Noise Ratio（GSNR）衡量泛化潜力，并证明COR随GSNR单调递增；据此发现层间GSNR衰减是崩溃根源；进而设计无需训练的Contrastive Regional Injection Transformer（CoRIT），包含Contrastive Gradient Proxy（CGP）、Region Refinement Mask、Regional Signal Injection和Hierarchical Representation Integration。 Result: CoRIT有效缓解优化崩溃，在跨域及通用伪造基准上达到SOTA泛化性能。 Conclusion: 优化崩溃的本质是内在泛化能力退化导致的几何不稳定，提升梯度保真度（而非仅调小扰动）才是根本解法；CoRIT通过训练无关、计算高效的设计实现了对非语义伪造更强鲁棒与泛化检测。 Abstract: While Vision-Language Models (VLMs) like CLIP have emerged as a dominant paradigm for generalizable deepfake detection, a representational disconnect remains: their semantic-centric pre-training is ill-suited for capturing non-semantic artifacts inherent to hyper-realistic synthesis. In this work, we identify a failure mode termed Optimization Collapse, where detectors trained with Sharpness-Aware Minimization (SAM) degenerate to random guessing on non-semantic forgeries once the perturbation radius exceeds a narrow threshold. To theoretically formalize this collapse, we propose the Critical Optimization Radius (COR) to quantify the geometric stability of the optimization landscape, and leverage the Gradient Signal-to-Noise Ratio (GSNR) to measure generalization potential. We establish a theorem proving that COR increases monotonically with GSNR, thereby revealing that the geometric instability of SAM optimization originates from degraded intrinsic generalization potential. This result identifies the layer-wise attenuation of GSNR as the root cause of Optimization Collapse in detecting non-semantic forgeries. Although naively reducing perturbation radius yields stable convergence under SAM, it merely treats the symptom without mitigating the intrinsic generalization degradation, necessitating enhanced gradient fidelity. Building on this insight, we propose the Contrastive Regional Injection Transformer (CoRIT), which integrates a computationally efficient Contrastive Gradient Proxy (CGP) with three training-free strategies: Region Refinement Mask to suppress CGP variance, Regional Signal Injection to preserve CGP magnitude, and Hierarchical Representation Integration to attain more generalizable representations. Extensive experiments demonstrate that CoRIT mitigates optimization collapse and achieves state-of-the-art generalization across cross-domain and universal forgery benchmarks.

[135] Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification

Han Sun,Qin Li,Peixin Wang,Min Zhang

Main category: cs.CV

TL;DR: 本文提出注意力不平衡（attention imbalance）概念，揭示其与大视觉语言模型（LVLMs）中物体幻觉的强因果关系，并设计轻量级解码时干预方法AIR，显著降低幻觉率并提升模型多任务能力。

Details

Motivation: 大视觉语言模型（LVLMs）中的物体幻觉严重损害其在自动驾驶、医学图像分析等高风险场景中的可靠性，亟需深入理解其成因并提出有效缓解方法。 Method: 通过系统实证分析发现模态间（视觉/语言）和模态内（各token）注意力分配不均与物体幻觉强相关；据此定义‘注意力不平衡’量化指标及可视化模式，并提出解码时轻量干预方法AIR，动态重分配注意力权重以校正不平衡。 Result: AIR在四个主流LVLM和三个基准（CHAIR、POPE、MM-Vet）上显著降低物体幻觉率，最高达35.1%；同时提升LVLM通用能力，最多达15.9%。 Conclusion: 注意力不平衡是LVLM物体幻觉的关键成因，AIR作为一种无需训练、即插即用的解码干预策略，可有效缓解幻觉并增强模型鲁棒性与泛化性。 Abstract: Object hallucination in Large Vision-Language Models (LVLMs) severely compromises their reliability in real-world applications, posing a critical barrier to their deployment in high-stakes scenarios such as autonomous driving and medical image analysis. Through systematic empirical investigation, we identify that the imbalanced attention allocation, both across modalities (i.e., vision and language) and within modalities (among individual tokens), exhibits a strong causal correlation with the occurrence of object hallucination. Leveraging this insight, we introduce a novel concept termed attention imbalance, which not only quantifies the degree of attention disparity but also visually delineates the underlying patterns (e.g., over-attentiveness to irrelevant language tokens or under-attentiveness to discriminative visual features) that drive object hallucination. To mitigate object hallucination, we further propose Attention Imbalance Rectification (AIR), a lightweight decoding-time intervention method that reallocates attention weights and adjusts attention distributions to rectify modality-wise and token-wise imbalances. Extensive evaluations on four mainstream LVLMs and three benchmarks (CHAIR, POPE, and MM-Vet) with seven baselines demonstrate that AIR consistently reduces object hallucination rates, achieving up to a 35.1% reduction compared to the baselines, while improving up to 15.9% of LVLMs' general capability across diverse vision-language tasks.

[136] AD-Reasoning: Multimodal Guideline-Guided Reasoning for Alzheimer's Disease Diagnosis

Qiuhui Chen,Yushan Deng,Xuancheng Yao,Yi Hong

Main category: cs.CV

TL;DR: AD-Reasoning是一个结合结构MRI与六种临床模态、并引入基于规则验证器的多模态框架，用于生成符合NIA-AA指南的阿尔茨海默病诊断结果，具备高准确性与可解释性。

Details

Motivation: 现有AD多模态诊断模型缺乏可解释性且与临床指南对齐不足，难以满足临床决策对透明性和规范性的要求。 Method: 提出AD-Reasoning框架：采用模态特异性编码器、双向交叉注意力融合，并通过强化学习微调（奖励函数涵盖输出格式、指南证据覆盖度及推理-决策一致性）；配套构建了含10378次就诊记录的AD-MultiSense多模态问答数据集。 Result: 在AD-MultiSense上达到SOTA诊断准确率，生成结构化、指南一致的诊断理由，显著提升模型透明性与可解释性。 Conclusion: AD-Reasoning验证了将神经影像、多源临床数据与可验证规则推理深度融合的有效性，为指南驱动的AI辅助AD诊断提供了新范式。 Abstract: Alzheimer's disease (AD) diagnosis requires integrating neuroimaging with heterogeneous clinical evidence and reasoning under established criteria, yet most multimodal models remain opaque and weakly guideline-aligned. We present AD-Reasoning, a multimodal framework that couples structural MRI with six clinical modalities and a rule-based verifier to generate structured, NIA-AA-consistent diagnoses. AD-Reasoning combines modality-specific encoders, bidirectional cross-attention fusion, and reinforcement fine-tuning with verifiable rewards that enforce output format, guideline evidence coverage, and reasoning--decision consistency. We also release AD-MultiSense, a 10,378-visit multimodal QA dataset with guideline-validated rationales built from ADNI/AIBL. On AD-MultiSense, AD-Reasoning achieves state-of-the-art diagnostic accuracy and produces structured rationales that improve transparency over recent baselines, while providing transparent rationales.

[137] PosterIQ: A Design Perspective Benchmark for Poster Understanding and Generation

Yuheng Feng,Wen Zhang,Haodong Duan,Xingxing Zou

Main category: cs.CV

TL;DR: PosterIQ is a new benchmark for poster understanding and generation, focusing on design elements like composition, typography, and semantic intent; it reveals limitations in current MLLMs and diffusion models regarding visual hierarchy, typographic understanding, and intention-aware generation.

Details

Motivation: To bridge visual design cognition and generative modeling by creating a design-driven benchmark that evaluates how well AI models understand and generate posters with human-centred design principles. Method: Constructed PosterIQ — a benchmark with 7,765 image-annotation pairs and 822 generation prompts — covering real, professional, and synthetic posters; defined five core tasks: layout parsing, text-image correspondence, typography/readability/font perception, design quality assessment, and controllable, composition-aware generation with metaphor; evaluated state-of-the-art MLLMs and diffusion-based generators. Result: Identified persistent gaps: models struggle with visual hierarchy, typographic semantics, saliency control, and intention communication; commercial MLLMs excel at high-level reasoning but lack sensitivity in design rating; generators render text well but fail at composition-aware synthesis. Conclusion: PosterIQ serves both as a quantitative benchmark and diagnostic tool for design reasoning, enabling reproducible, task-specific evaluation; it aims to advance creativity and human-centred design integration in vision-language generative systems. Abstract: We present PosterIQ, a design-driven benchmark for poster understanding and generation, annotated across composition structure, typographic hierarchy, and semantic intent. It includes 7,765 image-annotation instances and 822 generation prompts spanning real, professional, and synthetic cases. To bridge visual design cognition and generative modeling, we define tasks for layout parsing, text-image correspondence, typography/readability and font perception, design quality assessment, and controllable, composition-aware generation with metaphor. We evaluate state-of-the-art MLLMs and diffusion-based generators, finding persistent gaps in visual hierarchy, typographic semantics, saliency control, and intention communication; commercial models lead on high-level reasoning but act as insensitive automatic raters, while generators render text well yet struggle with composition-aware synthesis. Extensive analyses show PosterIQ is both a quantitative benchmark and a diagnostic tool for design reasoning, offering reproducible, task-specific metrics. We aim to catalyze models' creativity and integrate human-centred design principles into generative vision-language systems.

[138] When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm

Ye Leng,Junjie Chu,Mingjie Li,Chenhao Lin,Chao Shen,Michael Backes,Yun Shen,Yang Zhang

Main category: cs.CV

TL;DR: 本文系统分析了多模态大语言模型（MLLMs）相较于扩散模型在不安全内容生成和虚假图像合成方面的新型安全风险，发现MLLMs因更强的语义理解能力反而更易生成难以检测的有害图像。

Details

Motivation: MLLMs语义理解能力增强可能带来新的、更大的安全风险，但当前对其安全风险的认知不足。 Method: 以扩散模型为基准，从不安全内容生成和虚假图像合成两个维度，通过多个不安全生成基准数据集进行系统对比分析，并评估现有假图检测器对MLLMs生成图像的识别能力。 Result: MLLMs比扩散模型更易生成不安全图像；其生成的虚假图像更难被现有检测器识别，即使针对MLLMs重训练检测器，也易被更长、更详尽的输入提示绕过。 Conclusion: MLLMs带来的新兴安全风险尚未被充分认识，给现实世界的安全防护带来新挑战。 Abstract: Recently, multimodal large language models (MLLMs) have emerged as a unified paradigm for language and image generation. Compared with diffusion models, MLLMs possess a much stronger capability for semantic understanding, enabling them to process more complex textual inputs and comprehend richer contextual meanings. However, this enhanced semantic ability may also introduce new and potentially greater safety risks. Taking diffusion models as a reference point, we systematically analyze and compare the safety risks of emerging MLLMs along two dimensions: unsafe content generation and fake image synthesis. Across multiple unsafe generation benchmark datasets, we observe that MLLMs tend to generate more unsafe images than diffusion models. This difference partly arises because diffusion models often fail to interpret abstract prompts, producing corrupted outputs, whereas MLLMs can comprehend these prompts and generate unsafe content. For current advanced fake image detectors, MLLM-generated images are also notably harder to identify. Even when detectors are retrained with MLLMs-specific data, they can still be bypassed by simply providing MLLMs with longer and more descriptive inputs. Our measurements indicate that the emerging safety risks of the cutting-edge generative paradigm, MLLMs, have not been sufficiently recognized, posing new challenges to real-world safety.

[139] LGTM: Training-Free Light-Guided Text-to-Image Diffusion Model via Initial Noise Manipulation

Ryugo Morita,Stanislav Frolov,Brian Bernhard Moser,Ko Watanabe,Riku Takahashi,Andreas Dengel

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的光照引导文本到图像扩散模型（LGTM），通过操控扩散过程的初始潜在噪声来实现对光照方向的精细控制，无需微调或修改预训练模型，且能与ControlNet等模型无缝集成。

Details

Motivation: 现有方法在文本到图像生成中对光照条件控制不足，多采用低效的两阶段重光照流程，并依赖大量数据微调和高计算开销，缺乏泛化性和适应性。 Method: 提出LGTM方法，基于通道级潜在空间分析，选择性地操纵初始噪声的特定通道，以联合文本提示和用户指定光方向引导生成，全程无需训练。 Result: 实验表明该方法在光照一致性上优于基于提示的基线方法，同时保持图像质量和文本对齐，并可无缝集成ControlNet等框架。 Conclusion: LGTM为动态、用户可控的光照生成提供了新范式，具有高效性、通用性和即插即用特性。 Abstract: Diffusion models have demonstrated high-quality performance in conditional text-to-image generation, particularly with structural cues such as edges, layouts, and depth. However, lighting conditions have received limited attention and remain difficult to control within the generative process. Existing methods handle lighting through a two-stage pipeline that relights images after generation, which is inefficient. Moreover, they rely on fine-tuning with large datasets and heavy computation, limiting their adaptability to new models and tasks. To address this, we propose a novel Training-Free Light-Guided Text-to-Image Diffusion Model via Initial Noise Manipulation (LGTM), which manipulates the initial latent noise of the diffusion process to guide image generation with text prompts and user-specified light directions. Through a channel-wise analysis of the latent space, we find that selectively manipulating latent channels enables fine-grained lighting control without fine-tuning or modifying the pre-trained model. Extensive experiments show that our method surpasses prompt-based baselines in lighting consistency, while preserving image quality and text alignment. This approach introduces new possibilities for dynamic, user-guided light control. Furthermore, it integrates seamlessly with models like ControlNet, demonstrating adaptability across diverse scenarios.

[140] LaDy: Lagrangian-Dynamic Informed Network for Skeleton-based Action Segmentation via Spatial-Temporal Modulation

Haoyu Ji,Xueting Liu,Yu Gao,Wenze Huang,Zhihao Yang,Weihong Ren,Zhiyong Wang,Honghai Liu

Main category: cs.CV

TL;DR: 本文提出LaDy框架，将拉格朗日动力学原理引入骨架时序动作分割任务，通过建模广义坐标与力、引入能量一致性损失，并利用动力学信息调制时空表征，显著提升动作类别区分度与边界定位精度。

Details

Motivation: 现有基于骨架的动作分割方法仅关注运动学特征，忽略了支配人体运动的物理动力学机制，导致对运动学相似但动力学意图不同的动作区分能力不足，且难以精确定位动态力变化的边界。 Method: 提出拉格朗日动力学引导网络（LaDy）：1）从关节点位置推导广义坐标；2）在物理约束下估计拉格朗日量并合成广义力；3）设计能量一致性损失以满足功能-能量定理；4）构建时空调制模块——空间上融合广义力与空间表征，时间上利用动态信号进行门控以增强边界感知。 Result: 在多个具有挑战性的数据集上达到当前最优性能，验证了引入物理动力学建模对动作分割任务的有效性。 Conclusion: 将物理动力学先验（特别是拉格朗日力学）显式建模并融入深度学习框架，能有效提升骨架动作分割的语义判别性与时序边界精度，为动作理解提供了新的建模范式。 Abstract: Skeleton-based Temporal Action Segmentation (STAS) aims to densely parse untrimmed skeletal sequences into frame-level action categories. However, existing methods, while proficient at capturing spatio-temporal kinematics, neglect the underlying physical dynamics that govern human motion. This oversight limits inter-class discriminability between actions with similar kinematics but distinct dynamic intents, and hinders precise boundary localization where dynamic force profiles shift. To address these, we propose the Lagrangian-Dynamic Informed Network (LaDy), a framework integrating principles of Lagrangian dynamics into the segmentation process. Specifically, LaDy first computes generalized coordinates from joint positions and then estimates Lagrangian terms under physical constraints to explicitly synthesize the generalized forces. To further ensure physical coherence, our Energy Consistency Loss enforces the work-energy theorem, aligning kinetic energy change with the work done by the net force. The learned dynamics then drive a Spatio-Temporal Modulation module: Spatially, generalized forces are fused with spatial representations to provide more discriminative semantics. Temporally, salient dynamic signals are constructed for temporal gating, thereby significantly enhancing boundary awareness. Experiments on challenging datasets show that LaDy achieves state-of-the-art performance, validating the integration of physical dynamics for action segmentation. Code is available at https://github.com/HaoyuJi/LaDy.

[141] Granular Ball Guided Stable Latent Domain Discovery for Domain-General Crowd Counting

Fan Chen,Shuyin Xia,Yi Wang,Xinbo Gao

Main category: cs.CV

TL;DR: 本文提出了一种基于粒球引导的稳定潜在域发现框架，用于单源域泛化人群计数，通过分层代表性聚类提升伪域划分稳定性，并设计双分支学习以解耦语义与风格特征，显著提升跨域泛化性能。

Details

Motivation: 单源域泛化人群计数面临挑战：单个标注源域内存在异质潜在子域，且测试数据分布偏移严重；而现有基于样本级特征的扁平聚类易受噪声、离群点和表征漂移影响，导致伪域划分不可靠。 Method: 提出粒球引导的稳定潜在域发现框架：先将样本组织为紧凑的局部粒球，再对粒球中心进行聚类以获得伪域，实现从样本级到代表性的分层聚类；在此基础上构建双分支学习框架，一支通过语义码本重编码增强可迁移语义表征，另一支通过风格分支建模域特异性外观变化，缓解语义-风格纠缠。 Result: 在ShanghaiTech A/B、UCF_QNRF和NWPU-Crowd数据集上，遵循严格无适配协议的大量实验表明，该方法持续优于强基线，尤其在大域间差距下效果更显著。 Conclusion: 粒球引导的分层聚类能提升潜在域发现的稳定性与语义一致性；双分支架构有效解耦语义与风格，增强了模型在未知目标域上的泛化能力。 Abstract: Single-source domain generalization for crowd counting remains highly challenging because a single labeled source domain often contains heterogeneous latent domains, while test data may exhibit severe distribution shifts. A fundamental difficulty lies in stable latent domain discovery: directly performing flat clustering on evolving sample-level latent features is easily affected by feature noise, outliers, and representation drift, leading to unreliable pseudo-domain assignments and weakened domain-structured learning. To address this issue, we propose a granular ball guided stable latent domain discovery framework for domain-general crowd counting. Specifically, the proposed method first organizes samples into compact local granular balls and then clusters granular ball centers as representatives to obtain pseudo-domains, transforming direct sample-level clustering into a hierarchical representative-based clustering process. This design yields more stable and semantically consistent pseudo-domain assignments. Built upon the discovered latent domains, we further develop a two-branch learning framework that enhances transferable semantic representations via semantic codebook re-encoding while modeling domain-specific appearance variations through a style branch, thereby reducing semantic--style entanglement and improving generalization under domain shifts. Extensive experiments on ShanghaiTech A/B, UCF\_QNRF, and NWPU-Crowd under a strict no-adaptation protocol demonstrate that the proposed method consistently outperforms strong baselines, especially under large domain gaps.

[142] Retinal Layer Segmentation in OCT Images With 2.5D Cross-slice Feature Fusion Module for Glaucoma Assessment

Hyunwoo Kim,Heesuk Kim,Wungrak Choi,Jae-Sang Hyun

Main category: cs.CV

TL;DR: 本文提出了一种2.5D视网膜层分割框架，通过引入跨切片特征融合（CFF）模块增强U-Net结构，在保持计算效率的同时提升OCT图像中层边界的分割一致性与鲁棒性，显著优于现有2D和部分3D方法。

Details

Motivation: 现有2D分割方法缺乏相邻B-scan间的上下文信息，导致切片间不一致；3D方法虽能建模跨切片上下文但计算代价高，需兼顾精度、一致性和效率。 Method: 提出一种2.5D分割框架，核心为嵌入U-Net类架构的跨切片特征融合（CFF）模块，用于融合相邻切片特征以增强上下文感知能力。 Result: 在临床数据集和公开DUKE DME数据集上验证，相比无CFF的基线方法，平均绝对距离降低8.56%，均方根误差降低13.92%，边界一致性与噪声鲁棒性明显提升。 Conclusion: 所提2.5D框架在计算效率与上下文建模之间取得良好平衡，可支持解剖学可靠的视网膜层自动分割，适用于青光眼智能评估与临床转化。 Abstract: For accurate glaucoma diagnosis and monitoring, reliable retinal layer segmentation in OCT images is essential. However, existing 2D segmentation methods often suffer from slice-to-slice inconsistencies due to the lack of contextual information across adjacent B-scans. 3D segmentation methods are better for capturing slice-to-slice context, but they require expensive computational resources. To address these limitations, we propose a 2.5D segmentation framework that incorporates a novel cross-slice feature fusion (CFF) module into a U-Net-like architecture. The CFF module fuses inter-slice features to effectively capture contextual information, enabling consistent boundary detection across slices and improved robustness in noisy regions. The framework was validated on both a clinical dataset and the publicly available DUKE DME dataset. Compared to other segmentation methods without the CFF module, the proposed method achieved an 8.56% reduction in mean absolute distance and a 13.92% reduction in root mean square error, demonstrating improved segmentation accuracy and robustness. Overall, the proposed 2.5D framework balances contextual awareness and computational efficiency, enabling anatomically reliable retinal layer delineation for automated glaucoma evaluation and potential clinical applications.

[143] Combi-CAM: A Novel Multi-Layer Approach for Explainable Image Geolocalization

David Faget,José Luis Lisani,Miguel Colom

Main category: cs.CV

TL;DR: 本文提出Combi-CAM方法，通过融合CNN多个层的梯度加权类激活图，提升图像地理定位模型的可解释性。

Details

Motivation: 深度学习模型（尤其是CNN）在行星尺度照片地理定位中取得进展，但其预测推理过程难以理解，亟需提升模型可解释性。 Method: 提出Combi-CAM方法，融合CNN多个层的梯度加权类激活图（Grad-CAM），而非仅使用最深层特征。 Result: 相比传统仅用最深层Grad-CAM的方法，Combi-CAM能提供更细粒度、更全面的视觉特征贡献分析，增强模型决策的可解释性。 Conclusion: 多层特征融合的可视化解释策略能显著提升CNN地理定位模型的透明度与可信度，为可解释AI在地理视觉任务中的应用提供了新思路。 Abstract: Planet-scale photo geolocalization involves the intricate task of estimating the geographic location depicted in an image purely based on its visual features. While deep learning models, particularly convolutional neural networks (CNNs), have significantly advanced this field, understanding the reasoning behind their predictions remains challenging. In this paper, we present Combi-CAM, a novel method that enhances the explainability of CNN-based geolocalization models by combining gradient-weighted class activation maps obtained from several layers of the network architecture, rather than using only information from the deepest layer as is typically done. This approach provides a more detailed understanding of how different image features contribute to the model's decisions, offering deeper insights than the traditional approaches.

[144] Spectral Scalpel: Amplifying Adjacent Action Discrepancy via Frequency-Selective Filtering for Skeleton-Based Action Segmentation

Haoyu Ji,Bowen Chen,Zhihao Yang,Wenze Huang,Yu Gao,Xueting Liu,Weihong Ren,Zhiyong Wang,Honghai Liu

Main category: cs.CV

TL;DR: 本文提出Spectral Scalpel，一种基于频域选择性滤波的骨架时序动作分割框架，通过抑制相邻动作共享频率、增强动作特有频率来提升类间区分度和边界清晰度，并引入频域感知通道混合器加强通道演化，在多个数据集上达到SOTA性能。

Details

Motivation: 现有骨架时序动作分割方法存在类间判别力不足和分割边界模糊的问题，主因是相邻动作间时空模式区分不充分。 Method: 提出Spectral Scalpel框架：1）自适应多尺度谱滤波器作为‘手术刀’编辑频谱；2）设计相邻动作差异损失作为优化目标；3）引入频域感知通道混合器聚合跨通道频谱以增强长时建模能力。 Result: 在五个公开数据集上实现SOTA性能。 Conclusion: 该工作开创了将频域分析融入骨架动作分割的新范式，超越传统时空建模方法，有效缓解边界定位模糊与类间混淆问题。 Abstract: Skeleton-based Temporal Action Segmentation (STAS) seeks to densely segment and classify diverse actions within long, untrimmed skeletal motion sequences. However, existing STAS methodologies face challenges of limited inter-class discriminability and blurred segmentation boundaries, primarily due to insufficient distinction of spatio-temporal patterns between adjacent actions. To address these limitations, we propose Spectral Scalpel, a frequency-selective filtering framework aimed at suppressing shared frequency components between adjacent distinct actions while amplifying their action-specific frequencies, thereby enhancing inter-action discrepancies and sharpening transition boundaries. Specifically, Spectral Scalpel employs adaptive multi-scale spectral filters as scalpels to edit frequency spectra, coupled with a discrepancy loss between adjacent actions serving as the surgical objective. This design amplifies representational disparities between neighboring actions, effectively mitigating boundary localization ambiguities and inter-class confusion. Furthermore, complementing long-term temporal modeling, we introduce a frequency-aware channel mixer to strengthen channel evolution by aggregating spectra across channels. This work presents a novel paradigm for STAS that extends conventional spatio-temporal modeling by incorporating frequency-domain analysis. Extensive experiments on five public datasets demonstrate that Spectral Scalpel achieves state-of-the-art performance. Code is available at https://github.com/HaoyuJi/SpecScalpel.

[145] Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection

Zhanhe Lei,Zhongyuan Wang,Jikang Cheng,Baojin Huang,Yuhong Yang,Zhen Han,Chao Liang,Dengpan Ye

Main category: cs.CV

TL;DR: 本文提出了一种名为Tutor-Student Reinforcement Learning (TSRL)的新框架，通过强化学习动态优化深度伪造检测模型的训练课程，提升其泛化能力。

Details

Motivation: 标准监督训练对所有样本赋予相同重要性，不利于学习鲁棒、可泛化的特征。 Method: 将训练过程建模为马尔可夫决策过程，设计一个基于PPO的‘导师（Tutor）’代理，根据样本视觉特征及历史学习动态（如EMA损失、遗忘次数）为其分配连续权重（0–1），动态调整损失贡献；奖励机制基于学生模型预测正确性的即时变化。 Result: 所提TSRL方法提升了学生模型在未见过的伪造技术上的泛化性能，优于传统训练方法。 Conclusion: 动态、自适应的课程学习能显著增强深度伪造检测器的鲁棒性与泛化能力，TSRL为训练优化提供了新范式。 Abstract: Standard supervised training for deepfake detection treats all samples with uniform importance, which can be suboptimal for learning robust and generalizable features. In this work, we propose a novel Tutor-Student Reinforcement Learning (TSRL) framework to dynamically optimize the training curriculum. Our method models the training process as a Markov Decision Process where a ``Tutor'' agent learns to guide a ``Student'' (the deepfake detector). The Tutor, implemented as a Proximal Policy Optimization (PPO) agent, observes a rich state representation for each training sample, encapsulating not only its visual features but also its historical learning dynamics, such as EMA loss and forgetting counts. Based on this state, the Tutor takes an action by assigning a continuous weight (0-1) to the sample's loss, thereby dynamically re-weighting the training batch. The Tutor is rewarded based on the Student's immediate performance change, specifically rewarding transitions from incorrect to correct predictions. This strategy encourages the Tutor to learn a curriculum that prioritizes high-value samples, such as hard-but-learnable examples, leading to a more efficient and effective training process. We demonstrate that this adaptive curriculum improves the Student's generalization capabilities against unseen manipulation techniques compared to traditional training methods. Code is available at https://github.com/wannac1/TSRL.

[146] LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds

Jaehun Bang,Jinhyeok Kim,Minji Kim,Seungheon Jeong,Kyungdon Joo

Main category: cs.CV

TL;DR: LightSplat是一种无需训练、轻量高效的开放词汇3D场景理解方法，通过向3D高斯表示注入2字节语义索引并结合单步聚类，显著提升速度与内存效率。

Details

Motivation: 现有开放词汇3D场景理解方法存在速度慢、内存消耗大、结构复杂等问题，主要源于迭代优化和密集的每高斯特征分配。 Method: 提出LightSplat框架：在多视角图像重建的3D高斯表示中，仅对显著区域注入紧凑的2字节语义索引，并采用轻量级索引-特征映射；通过单步聚类实现几何与语义一致的3D掩码关联，避免特征优化和冗余存储。 Result: 在LERF-OVS、ScanNet和DL3DV-OVS数据集上达到SOTA性能，推理速度提升50–400倍，内存占用降低64倍。 Conclusion: LightSplat为语言驱动的3D理解提供了可扩展、高效且实用的新范式。 Abstract: Open-vocabulary 3D scene understanding enables users to segment novel objects in complex 3D environments through natural language. However, existing approaches remain slow, memory-intensive, and overly complex due to iterative optimization and dense per-Gaussian feature assignments. To address this, we propose LightSplat, a fast and memory-efficient training-free framework that injects compact 2-byte semantic indices into 3D representations from multi-view images. By assigning semantic indices only to salient regions and managing them with a lightweight index-feature mapping, LightSplat eliminates costly feature optimization and storage overhead. We further ensure semantic consistency and efficient inference via single-step clustering that links geometrically and semantically related masks in 3D. We evaluate our method on LERF-OVS, ScanNet, and DL3DV-OVS across complex indoor-outdoor scenes. As a result, LightSplat achieves state-of-the-art performance with up to 50-400x speedup and 64x lower memory, enabling scalable language-driven 3D understanding. For more details, visit our project page https://vision3d-lab.github.io/lightsplat/.

[147] A convergent Plug-and-Play Majorization-Minimization algorithm for Poisson inverse problems

Thibaut Modrzyk,Ane Etxebeste,Élie Bretin,Voichita Maxim

Main category: cs.CV

TL;DR: 本文提出了一种用于泊松逆问题的新型变分即插即用算法，结合了Kullback-Leibler数据保真项与基于预训练神经网络的正则化项，并在majorization-minimization框架下保证收敛性，在中低噪声下达到SOTA性能，高噪声下表现尤为突出，适用于核医学等场景。

Details

Motivation: 解决泊松逆问题（如核医学中的图像重建）中高噪声条件下的重建质量差、现有方法缺乏理论收敛保证等问题。 Method: 提出一种变分即插即用算法，最小化含Kullback-Leibler数据保真项和基于预训练神经网络正则项的目标函数；利用预训练高斯去噪器，通过majorization-minimization框架实现收敛性保障。 Result: 在去卷积和断层成像任务中，中等噪声下达到最先进性能，高噪声下显著优于现有方法。 Conclusion: 该方法兼顾理论收敛性与实际性能，特别适合对噪声鲁棒性要求高的核医学应用。 Abstract: In this paper, we present a novel variational plug-and-play algorithm for Poisson inverse problems. Our approach minimizes an explicit functional which is the sum of a Kullback-Leibler data fidelity term and a regularization term based on a pre-trained neural network. By combining classical likelihood maximization methods with recent advances in gradient-based denoisers, we allow the use of pre-trained Gaussian denoisers without sacrificing convergence guarantees. The algorithm is formulated in the majorization-minimization framework, which guarantees convergence to a stationary point. Numerical experiments confirm state-of-the-art performance in deconvolution and tomography under moderate noise, and demonstrate clear superiority in high-noise conditions, making this method particularly valuable for nuclear medicine applications.

[148] CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare

Akash Ghosh,Tajamul Ashraf,Rishu Kumar Singh,Numan Saeed,Sriparna Saha,Xiuying Chen,Salman Khan

Main category: cs.CV

TL;DR: 本文提出CareFlow基准和CarePilot多智能体框架，解决医疗领域长周期软件工作流自动化难题，显著提升视觉语言模型在复杂医疗界面任务中的性能。

Details

Motivation: 现有研究集中于短周期或通用界面自动化，而医疗等专业领域的长周期自动化任务尚未被充分探索。 Method: 构建CareFlow高质量人工标注基准，并提出基于actor-critic范式的多智能体框架CarePilot，其中Actor结合工具定位与双记忆机制预测语义动作，Critic评估动作并更新记忆以实现迭代优化。 Result: CarePilot在CareFlow基准及分布外数据集上分别超越强闭源和开源多模态基线约15.26%和3.38%。 Conclusion: CarePilot通过多智能体协同与记忆增强机制，显著提升了长周期、高专业性医疗软件工作流的自动化能力，为领域专用智能体发展提供了新范式。 Abstract: Multimodal agentic pipelines are transforming human-computer interaction by enabling efficient and accessible automation of complex, real-world tasks. However, recent efforts have focused on short-horizon or general-purpose applications (e.g., mobile or desktop interfaces), leaving long-horizon automation for domain-specific systems, particularly in healthcare, largely unexplored. To address this, we introduce CareFlow, a high-quality human-annotated benchmark comprising complex, long-horizon software workflows across medical annotation tools, DICOM viewers, EHR systems, and laboratory information systems. On this benchmark, existing vision-language models (VLMs) perform poorly, struggling with long-horizon reasoning and multi-step interactions in medical contexts. To overcome this, we propose CarePilot, a multi-agent framework based on the actor-critic paradigm. The Actor integrates tool grounding with dual-memory mechanisms (long-term and short-term experience) to predict the next semantic action from the visual interface and system state. The Critic evaluates each action, updates memory based on observed effects, and either executes or provides corrective feedback to refine the workflow. Through iterative agentic simulation, the Actor learns to perform more robust and reasoning-aware predictions during inference. Our experiments show that CarePilot achieves state-of-the-art performance, outperforming strong closed-source and open-source multimodal baselines by approximately 15.26% and 3.38%, respectively, on our benchmark and out-of-distribution dataset.

[149] Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection

Xu Zhang,Zhe Chen,Jing Zhang,Dacheng Tao

Main category: cs.CV

TL;DR: 本文提出HeROD框架，在数据稀缺场景下通过引入启发式空间和语义推理先验，提升指代表达目标检测（ROD）模型的标签效率与收敛性能，并在多个基准上超越强基线。

Details

Motivation: 现有指代表达目标检测模型多面向数据丰富场景，而在机器人、增强现实等实际应用中常面临严重标注稀缺问题；作者探究显式推理先验是否能提升小样本下的学习效率。 Method: 提出De-ROD任务作为低数据/少样本ROD评测协议；设计轻量、模型无关的HeROD框架，将启发式空间与语义推理先验注入DETR式流程的三个阶段：候选框排序、预测融合与匈牙利匹配。 Result: 在RefCOCO、RefCOCO+和RefCOCOg数据集上，HeROD在标注稀缺条件下持续优于强基线模型。 Conclusion: 融入简单、可解释的推理先验是提升视觉-语言理解数据效率的一条实用且可扩展路径。 Abstract: Most referring object detection (ROD) models, especially the modern grounding detectors, are designed for data-rich conditions, yet many practical deployments, such as robotics, augmented reality, and other specialized domains, would face severe label scarcity. In such regimes, end-to-end grounding detectors need to learn spatial and semantic structure from scratch, wasting precious samples. We ask a simple question: Can explicit reasoning priors help models learn more efficiently when data is scarce? To explore this, we first introduce a Data-efficient Referring Object Detection (De-ROD) task, which is a benchmark protocol for measuring ROD performance in low-data and few-shot settings. We then propose the HeROD (Heuristic-inspired ROD), a lightweight, model-agnostic framework that injects explicit, heuristic-inspired spatial and semantic reasoning priors, which are interpretable signals derived based on the referring phrase, into 3 stages of a modern DETR-style pipeline: proposal ranking, prediction fusion, and Hungarian matching. By biasing both training and inference toward plausible candidates, these priors promise to improve label efficiency and convergence performance. On RefCOCO, RefCOCO+, and RefCOCOg, HeROD consistently outperforms strong grounding baselines in scarce-label regimes. More broadly, our results suggest that integrating simple, interpretable reasoning priors provides a practical and extensible path toward better data-efficient vision-language understanding.

[150] Unlocking Few-Shot Capabilities in LVLMs via Prompt Conditioning and Head Selection

Adhemar de Senneville,Xavier Bou,Jérémy Anger,Rafael Grompone,Gabriele Facciolo

Main category: cs.CV

TL;DR: 本文提出Head Ensemble Classifiers (HEC)，利用LVLM内部注意力头的判别性特征，在无需训练的情况下显著提升其零样本和少样本图像分类性能，超越CLIP基线。

Details

Motivation: 现有大型视觉语言模型（LVLMs）虽在多项零样本任务上表现优异，但在图像分类任务上却明显弱于CLIP方法，尽管二者常共享CLIP预训练视觉编码器；作者旨在揭示LVLM内部表征潜力并弥合该性能差距。 Method: 提出Head Ensemble Classifiers（HEC），受高斯判别分析启发，对LVLM中最具判别性的视觉与文本注意力头进行排序与组合，构建无需训练的分类器。 Result: HEC在12个数据集上的零样本和少样本图像分类任务中均达到当前最优（SOTA）性能。 Conclusion: LVLMs的内部表示（尤其是注意力头）蕴含强分类能力，通过合理挖掘与集成（如HEC），可突破其原始输出性能瓶颈，在图像分类任务上媲美甚至超越CLIP。 Abstract: Current Large Vision Language Models (LVLMs) excel at many zero-shot tasks like image captioning, visual question answering and OCR. However, these same models suffer from poor performance at image classification tasks, underperforming against CLIP-based methods. Notably, this gap is surprising because many LVLMs use CLIP-pretrained vision encoders. Yet LVLMs are not inherently limited by CLIP's architecture with independent vision and text encoders. In CLIP, this separation biases classification toward class-name matching rather than joint visual-text reasoning. In this paper we show that, despite their poor raw performance, LVLMs can improve visual feature class separability at inference using prompt conditioning, and LVLMs' internal representations, especially attention heads, can outperform the model itself at zero-shot and few-shot classification. We introduce Head Ensemble Classifiers (HEC) to bridge the performance gap between CLIP-based and LVLM-based classification methods. Inspired by Gaussian Discriminant Analysis, HEC ranks the most discriminative vision and text heads and combines them into a training-free classifier. We show that HEC achieves state-of-the-art performance in few-shot and zero-shot classification across 12 datasets.

[151] Counting Without Numbers \& Finding Without Words

Badri Narayana Patro

Main category: cs.CV

TL;DR: 本文提出首个结合视觉与声学生物特征的多模态宠物重聚系统，旨在解决因仅依赖外观匹配而导致的70%走失宠物无法与主人重聚的问题。

Details

Motivation: 每年有1000万只宠物走失，其中70%无法与主人重聚，原因在于现有系统仅依赖外观匹配，而动物实际通过声音识别彼此；计算机视觉长期忽视动物的声学通信能力。 Method: 基于五十年认知科学研究，构建物种自适应的多模态架构，支持10Hz至4kHz宽频带动物 vocalizations 分析，并结合容忍应激性外观变化的概率化视觉匹配。 Result: 实现了首个面向动物声学-视觉联合识别的 reunification 系统，验证了以生物通信原理为根基的AI可有效服务无语言能力的脆弱群体。 Conclusion: AI系统应超越纯视觉范式，融合生物学上真实的感知与通信机制，尤其在服务非人类生命体时，声学模态不可或缺。 Abstract: Every year, 10 million pets enter shelters, separated from their families. Despite desperate searches by both guardians and lost animals, 70% never reunite, not because matches do not exist, but because current systems look only at appearance, while animals recognize each other through sound. We ask, why does computer vision treat vocalizing species as silent visual objects? Drawing on five decades of cognitive science showing that animals perceive quantity approximately and communicate identity acoustically, we present the first multimodal reunification system integrating visual and acoustic biometrics. Our species-adaptive architecture processes vocalizations from 10Hz elephant rumbles to 4kHz puppy whines, paired with probabilistic visual matching that tolerates stress-induced appearance changes. This work demonstrates that AI grounded in biological communication principles can serve vulnerable populations that lack human language.

[152] RefReward-SR: LR-Conditioned Reward Modeling for Preference-Aligned Super-Resolution

Yushuai Song,Weize Quan,Weining Wang,Jiahui Sun,Jing Liu,Meng Li,Pengbin Yu,Zhentao Chen,Wei Shen,Lunxi Yuan,Dong-ming Yan

Main category: cs.CV

TL;DR: 本文提出RefReward-SR，一种基于低分辨率（LR）参考的奖励模型，利用多模态大语言模型（MLLM）评估超分辨率（SR）重建结果与LR输入之间的语义一致性和合理性，并构建首个LR条件下的大规模偏好数据集RefSR-18K，结合GRPO优化策略提升SR结果与人类感知偏好的对齐程度。

Details

Motivation: 现有超分辨率评估指标（如全参考/无参考指标）和优化方法（如GT依赖的分布匹配）与人类感知不一致，难以准确反映语义合理性和视觉自然性。 Method: 提出RefReward-SR奖励模型，以LR图像为语义锚点，利用MLLM进行LR-HR语义一致性推理；构建RefSR-18K偏好数据集；采用Group Relative Policy Optimization（GRPO）对MLLM进行微调，并将RefReward-SR作为SR模型训练的核心奖励信号。 Result: 实验表明该框架显著提升与人类判断的一致性，在保持语义一致性的同时增强感知合理性和视觉自然性。 Conclusion: RefReward-SR通过LR条件化奖励建模和偏好学习，实现了更符合人类感知的生成式超分辨率，为感知驱动的图像复原提供了新范式。 Abstract: Recent advances in generative super-resolution (SR) have greatly improved visual realism, yet existing evaluation and optimization frameworks remain misaligned with human perception. Full-Reference and No-Reference metrics often fail to reflect perceptual preference, either penalizing semantically plausible details due to pixel misalignment or favoring visually sharp but inconsistent artifacts. Moreover, most SR methods rely on ground-truth (GT)-dependent distribution matching, which does not necessarily correspond to human judgments. In this work, we propose RefReward-SR, a low-resolution (LR) reference-aware reward model for preference-aligned SR. Instead of relying on GT supervision or NR evaluation, RefReward-SR assesses high-resolution (HR) reconstructions conditioned on their LR inputs, treating the LR image as a semantic anchor. Leveraging the visual-linguistic priors of a Multimodal Large Language Models (MLLM), it evaluates semantic consistency and plausibility in a reasoning-aware manner. To support this paradigm, we construct RefSR-18K, the first large-scale LR-conditioned preference dataset for SR, providing pairwise rankings based on LR-HR consistency and HR naturalness. We fine-tune the MLLM with Group Relative Policy Optimization (GRPO) using LR-conditioned ranking rewards, and further integrate GRPO into SR model training with RefReward-SR as the core reward signal for preference-aligned generation. Extensive experiments show that our framework achieves substantially better alignment with human judgments, producing reconstructions that preserve semantic consistency while enhancing perceptual plausibility and visual naturalness. Code, models, and datasets will be released upon paper acceptance.

[153] Powerful Teachers Matter: Text-Guided Multi-view Knowledge Distillation with Visual Prior Enhancement

Xin Zhang,Jianyang Xu,Hao Peng,Dongjing Wang,Jingyuan Zheng,Yu Li,Yuyu Yin,Hongbo Wang

Main category: cs.CV

TL;DR: 本文提出了一种文本引导的多视角知识蒸馏方法（TMKD），利用视觉教师和文本教师（CLIP）双模态教师提升学生模型性能，通过多视角输入增强视觉教师，并用文本教师生成语义权重指导特征融合，同时引入视觉-语言对比正则化加强学生语义学习。

Details

Motivation: 现有知识蒸馏方法主要关注蒸馏策略，而忽视了提升教师模型知识质量的重要性。 Method: 提出Text-guided Multi-view Knowledge Distillation（TMKD）：1）采用双模态教师（视觉教师+CLIP文本教师）；2）对视觉教师引入边缘与高频特征等多视角输入以增强其表征；3）文本教师通过先验感知提示生成语义权重，指导自适应特征融合；4）引入视觉-语言对比正则化以强化学生模型的语义知识。 Result: 在五个基准数据集上实验表明，TMKD相较基线方法最高提升4.49%的蒸馏性能。 Conclusion: 双教师协同与多视角增强策略能有效提升知识蒸馏效果，验证了提升教师知识质量对蒸馏性能的关键作用。 Abstract: Knowledge distillation transfers knowledge from large teacher models to smaller students for efficient inference. While existing methods primarily focus on distillation strategies, they often overlook the importance of enhancing teacher knowledge quality. In this paper, we propose Text-guided Multi-view Knowledge Distillation (TMKD), which leverages dual-modality teachers, a visual teacher and a text teacher (CLIP), to provide richer supervisory signals. Specifically, we enhance the visual teacher with multi-view inputs incorporating visual priors (edge and high-frequency features), while the text teacher generates semantic weights through prior-aware prompts to guide adaptive feature fusion. Additionally, we introduce vision-language contrastive regularization to strengthen semantic knowledge in the student model. Extensive experiments on five benchmarks demonstrate that TMKD consistently improves knowledge distillation performance by up to 4.49\%, validating the effectiveness of our dual-teacher multi-view enhancement strategy. Code is available at https://anonymous.4open.science/r/TMKD-main-44D1.

[154] HEART-PFL: Stable Personalized Federated Learning under Heterogeneity with Hierarchical Directional Alignment and Adversarial Knowledge Transfer

Minjun Kim,Minje Kim

Main category: cs.CV

TL;DR: 本文提出HEART-PFL框架，通过分层方向对齐（HDA）和对抗知识迁移（AKT）双路径提升个性化联邦学习效果，在多个非独立同分布数据集上达到SOTA性能。

Details

Motivation: 现有个性化联邦学习方法存在原型对齐浅层、服务器端知识蒸馏不稳定的问题，难以在异构数据分布下兼顾客户端个性化与全局模型稳定性。 Method: 提出双路径框架HEART-PFL：(i) 分层方向对齐（HDA），早期用余弦相似度、深层用MSE进行特征对齐；(ii) 对抗知识迁移（AKT），在干净与对抗代理数据上采用对称KL散度蒸馏；整体使用仅1.46M参数的轻量适配器。 Result: 在CIFAR-100、Flowers-102、Caltech-101的Dirichlet非IID划分下，个性化准确率分别达63.42%、84.23%、95.67%，且对域外代理数据鲁棒；消融实验证明HDA与AKT具有互补增益。 Conclusion: HEART-PFL能同时增强个性化能力与全局更新稳定性，是一种强健且可扩展的PFL解决方案。 Abstract: Personalized Federated Learning (PFL) aims to deliver effective client-specific models under heterogeneous distributions, yet existing methods suffer from shallow prototype alignment and brittle server-side distillation. We propose HEART-PFL, a dual-sided framework that (i) performs depth-aware Hierarchical Directional Alignment (HDA) using cosine similarity in the early stage and MSE matching in the deep stage to preserve client specificity, and (ii) stabilizes global updates through Adversarial Knowledge Transfer (AKT) with symmetric KL distillation on clean and adversarial proxy data. Using lightweight adapters with only 1.46M trainable parameters, HEART-PFL achieves state-of-the-art personalized accuracy on CIFAR-100, Flowers-102, and Caltech-101 (63.42%, 84.23%, and 95.67%, respectively) under Dirichlet non-IID partitions, and remains robust to out-of-domain proxy data. Ablation studies further confirm that HDA and AKT provide complementary gains in alignment, robustness, and optimization stability, offering insights into how the two components mutually reinforce effective personalization. Overall, these results demonstrate that HEART-PFL simultaneously enhances personalization and global stability, highlighting its potential as a strong and scalable solution for PFL(code available at https://github.com/danny0628/HEART-PFL).

[155] RVLM: Recursive Vision-Language Models with Adaptive Depth

Nicanor Mayumu,Zeenath Khan,Melodena Stephens,Patrick Mukala,Farhad Oroumchian

Main category: cs.CV

TL;DR: 本文提出RVLM框架，通过生成-执行循环实现可审计的医学AI诊断，并引入RRouter自适应调整推理深度，提升计算效率与诊断准确性。

Details

Motivation: 解决医学AI系统中传统视觉语言模型不可审计和迭代推理系统固定预算导致计算浪费或深度不足的问题。 Method: RVLM采用迭代的生成-执行循环，每步生成Python代码调用视觉子代理、操作图像并累积证据；RRouter通过轻量控制器预测最优迭代次数并动态终止。 Result: 在BraTS 2023和MIMIC-CXR数据集上验证了高一致性诊断能力，能检测跨模态差异及视图特异性伪影，且无需微调。 Conclusion: RVLM框架兼顾临床可审计性与推理自适应性，为医学AI治理提供新范式。 Abstract: Medical AI systems face two fundamental limitations. First, conventional vision-language models (VLMs) perform single-pass inference, yielding black-box predictions that cannot be audited or explained in clinical terms. Second, iterative reasoning systems that expose intermediate steps rely on fixed iteration budgets wasting compute on simple cases while providing insufficient depth for complex ones. We address both limitations with a unified framework. RVLM replaces single-pass inference with an iterative generate-execute loop: at each step, the model writes Python code, invokes vision sub-agents, manipulates images, and accumulates evidence. Every diagnostic claim is grounded in executable code, satisfying auditability requirements of clinical AI governance frameworks. RRouter makes iteration depth adaptive: a lightweight controller predicts the optimal budget from task-complexity features, then monitors progress and terminates early when reasoning stalls. We evaluate on BraTS 2023 Meningioma (brain MRI) and MIMIC-CXR (chest X-ray) using Gemini 2.5 Flash without fine-tuning. Across repeated runs, RVLM shows high consistency on salient findings (e.g., mass presence and enhancement) and can detect cross-modal discrepancies between Fluid-Attenuated Inversion Recovery (FLAIR) signal characteristics and segmentation boundaries. On MIMIC-CXR, it generates structured reports and correctly recognises view-specific artefacts. Code: https://github.com/nican2018/rvlm.

[156] InstanceRSR: Real-World Super-Resolution via Instance-Aware Representation Alignment

Zixin Guo,Kai Zhao,Luyan Zhang

Main category: cs.CV

TL;DR: 本文提出InstanceRSR框架，通过联合建模语义信息与实例级特征对齐，解决现有真实世界超分辨率方法难以恢复复杂场景中多样化物体细粒度细节的问题。

Details

Motivation: 现有基于生成先验的真实世界超分辨率（RSR）方法虽能实现高质量、全局一致重建，但难以恢复复杂场景中多样物体实例的细粒度细节，主因是常用去噪损失（如MSE）偏向全局一致性而忽略实例级感知与重建。 Method: 提出InstanceRSR框架：1）以低分辨率图像为全局一致性引导，联合建模图像数据与语义分割图以保障采样过程中的语义相关性；2）设计实例表征学习模块，对齐扩散隐空间与实例隐空间，实现实例感知的特征对齐；3）引入尺度对齐机制增强细粒度感知与细节恢复。 Result: 在多个真实世界基准上实验表明，InstanceRSR在定量指标和视觉质量上显著优于现有方法，达到新的SOTA性能。 Conclusion: InstanceRSR通过语义建模与实例级对齐，兼顾了图像真实性与实例级语义一致性，有效提升了真实世界超分辨率的细粒度细节恢复能力。 Abstract: Existing real-world super-resolution (RSR) methods based on generative priors have achieved remarkable progress in producing high-quality and globally consistent reconstructions. However, they often struggle to recover fine-grained details of diverse object instances in complex real-world scenes. This limitation primarily arises because commonly adopted denoising losses (e.g., MSE) inherently favor global consistency while neglecting instance-level perception and restoration. To address this issue, we propose InstanceRSR, a novel RSR framework that jointly models semantic information and introduces instance-level feature alignment. Specifically, we employ low-resolution (LR) images as global consistency guidance while jointly modeling image data and semantic segmentation maps to enforce semantic relevance during sampling. Moreover, we design an instance representation learning module to align the diffusion latent space with the instance latent space, enabling instance-aware feature alignment, and further incorporate a scale alignment mechanism to enhance fine-grained perception and detail recovery. Benefiting from these designs, our approach not only generates photorealistic details but also preserves semantic consistency at the instance level. Extensive experiments on multiple real-world benchmarks demonstrate that InstanceRSR significantly outperforms existing methods in both quantitative metrics and visual quality, achieving new state-of-the-art (SOTA) performance.

[157] B-MoE: A Body-Part-Aware Mixture-of-Experts "All Parts Matter" Approach to Micro-Action Recognition

Nishit Poddar,Aglind Reka,Diana-Laura Borza,Snehashis Majhi,Michal Balazia,Abhijit Das,Francois Bremond

Main category: cs.CV

TL;DR: 本文提出B-MoE框架，通过身体部位感知的专家混合模型，结合区域特异性编码与全局运动特征，显著提升微动作识别性能。

Details

Motivation: 微动作（如瞥视、点头）具有社会意义但因幅度小、持续时间短、类别间高度相似而难以被现有模型识别。 Method: 提出B-MoE：基于轻量级Macro-Micro Motion Encoder（M3E）的多专家结构，每个专家专注一个身体部位；引入跨注意力路由机制动态选择关键区域；采用双流编码器融合局部语义与全局运动特征。 Result: 在MA-52、SocialGesture和MPII-GroupInteraction三个基准上达到SOTA，尤其在模糊、样本少和低幅度类别上提升显著。 Conclusion: B-MoE有效建模人体运动的结构化特性，为微动作识别提供了可解释、高精度的新范式。 Abstract: Micro-actions, fleeting and low-amplitude motions, such as glances, nods, or minor posture shifts, carry rich social meaning but remain difficult for current action recognition models to recognize due to their subtlety, short duration, and high inter-class ambiguity. In this paper, we introduce B-MoE, a Body-part-aware Mixture-of-Experts framework designed to explicitly model the structured nature of human motion. In B-MoE, each expert specializes in a distinct body region (head, body, upper limbs, lower limbs), and is based on the lightweight Macro-Micro Motion Encoder (M3E) that captures long-range contextual structure and fine-grained local motion. A cross-attention routing mechanism learns inter-region relationships and dynamically selects the most informative regions for each micro-action. B-MoE uses a dual-stream encoder that fuses these region-specific semantic cues with global motion features to jointly capture spatially localized cues and temporally subtle variations that characterize micro-actions. Experiments on three challenging benchmarks (MA-52, SocialGesture, and MPII-GroupInteraction) show consistent state-of-theart gains, with improvements in ambiguous, underrepresented, and low amplitude classes.

[158] Memory-Augmented Vision-Language Agents for Persistent and Semantically Consistent Object Captioning

Tommaso Galliena,Stefano Rosa,Tommaso Apicella,Pietro Morerio,Alessio Del Bue,Lorenzo Natale

Main category: cs.CV

TL;DR: 本文提出了一种统一的记忆增强型视觉-语言智能体，通过自回归框架同步处理数据关联、物体描述生成与探索策略，在多视角下实现语义一致的物体表征。

Details

Motivation: 现有视觉-语言模型在不同视角下对同一物体生成不一致描述，阻碍具身智能体构建长期一致的语义表示；已有方法多为离线或多阶段，难以联合推理历史观测物体。 Method: 设计统一的自回归框架，融合当前RGB观测、俯视探索地图和对象级情景记忆（序列化为token）；采用基于分歧的探索策略与伪标注模型，在逼真3D环境中自监督训练。 Result: 在人工标注的对象级测试集上，标准描述得分提升+11.86%，描述自相似性提升+7.39%；支持可扩展性能与紧凑场景表征。 Conclusion: 该记忆增强型VLM智能体显著提升了跨视角语义一致性与长期物体理解能力，为具身智能提供了更鲁棒的视觉语言基础。 Abstract: Vision-Language Models (VLMs) often yield inconsistent descriptions of the same object across viewpoints, hindering the ability of embodied agents to construct consistent semantic representations over time. Previous methods resolved inconsistencies using offline multi-view aggregation or multi-stage pipelines that decouple exploration, data association, and caption learning, with limited capacity to reason over previously observed objects. In this paper, we introduce a unified, memory-augmented Vision-Language agent that simultaneously handles data association, object captioning, and exploration policy within a single autoregressive framework. The model processes the current RGB observation, a top-down explored map, and an object-level episodic memory serialized into object-level tokens, ensuring persistent object identity and semantic consistency across extended sequences. To train the model in a self-supervised manner, we collect a dataset in photorealistic 3D environments using a disagreement-based policy and a pseudo-captioning model that enforces consistency across multi-view caption histories. Extensive evaluation on a manually annotated object-level test set, demonstrate improvements of up to +11.86% in standard captioning scores and +7.39% in caption self-similarity over baseline models, while enabling scalable performance through a compact scene representation. Code, model weights, and data are available at https://github.com/hsp-iit/epos-vlm

[159] Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep

Tianyi Liu,Ye Lu,Linfeng Zhang,Chen Cai,Jianjun Gao,Yi Wang,Kim-Hui Yap,Lap-Pui Chau

Main category: cs.CV

TL;DR: 本文提出HetCache，一种无需训练的扩散加速框架，通过区分上下文令牌和生成令牌并选择性缓存高相关性上下文令牌，显著减少DiT模型中冗余注意力计算，在保持编辑质量的同时实现2.67倍延迟加速和FLOPs降低。

Details

Motivation: 现有视频扩散加速方法仅关注去噪时间步层面的特征复用，忽略了DiT架构内部时空token间注意力操作的结构性冗余，导致计算开销大、难以实际部署。 Method: HetCache基于空间先验将DiT中的时空token划分为上下文token和生成token，动态评估各类token在指定计算步骤中的上下文相关性和交互强度，选择性缓存最具代表性和强相关性的上下文token，从而跳过冗余注意力计算。 Result: 在常用基础模型上实现2.67×延迟加速与FLOPs下降，编辑质量几乎无损（可忽略的性能退化）。 Conclusion: HetCache有效挖掘了MV2V任务中token级异构性，为扩散视频编辑提供了高效、即插即用的加速新范式。 Abstract: Diffusion-based video editing has emerged as an important paradigm for high-quality and flexible content generation. However, despite their generality and strong modeling capacity, Diffusion Transformers (DiT) remain computationally expensive due to the iterative denoising process, posing challenges for practical deployment. Existing video diffusion acceleration methods primarily exploit denoising timestep-level feature reuse, which mitigates the redundancy in denoising process, but overlooks the architectural redundancy within the DiT that many attention operations over spatio-temporal tokens are redundantly executed, offering little to no incremental contribution to the model output. This work introduces HetCache, a training-free diffusion acceleration framework designed to exploit the inherent heterogeneity in diffusion-based masked video-to-video (MV2V) generation and editing. Instead of uniformly reuse or randomly sampling tokens, HetCache assesses the contextual relevance and interaction strength among various types of tokens in designated computing steps. Guided by spatial priors, it divides the spatial-temporal tokens in DiT model into context and generative tokens, and selectively caches the context tokens that exhibit the strongest correlation and most representative semantics with generative ones. This strategy reduces redundant attention operations while maintaining editing consistency and fidelity. Experiments show that HetCache achieves a noticeable acceleration, including a 2.67$\times$ latency speedup and FLOPs reduction over commonly used foundation models, with negligible degradation in editing quality.

[160] ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors

Haodong Yu,Yabo Zhang,Donglin Di,Ruyi Zhang,Wangmeng Zuo

Main category: cs.CV

TL;DR: ScrollScape将超宽高比（EAR）图像生成重构为连续视频生成任务，利用视频模型的时间一致性作为全局约束，通过ScanPE和ScrollSR两项创新实现32K分辨率下的结构完整与高保真生成。

Details

Motivation: 扩散模型在生成超宽高比（EAR）超高清图像时易出现物体重复、空间碎片化等结构性失败，主因是缺乏鲁棒的空间先验。 Method: 提出ScrollScape框架：1）将大画布空间扩展映射为视频帧时间演化；2）设计扫描位置编码（ScanPE）分布全局坐标，模拟移动相机；3）引入滚动超分（ScrollSR）利用视频超分先验缓解显存瓶颈；4）基于3K多比例图像数据集微调预训练视频先验。 Result: 在极端分辨率（达32K）和多种EAR场景下显著优于现有图像扩散基线，消除局部严重伪影，保障全局结构一致性和视觉保真度。 Conclusion: ScrollScape通过视频生成范式迁移，有效突破图像扩散模型在超宽高比图像合成中的结构性瓶颈，为超高分辨率内容生成提供新路径。 Abstract: While diffusion models excel at generating images with conventional dimensions, pushing them to synthesize ultra-high-resolution imagery at extreme aspect ratios (EAR) often triggers catastrophic structural failures, such as object repetition and spatial fragmentation.This limitation fundamentally stems from a lack of robust spatial priors, as static text-to-image models are primarily trained on image distributions with conventional dimensions.To overcome this bottleneck, we present ScrollScape, a novel framework that reformulates EAR image synthesis into a continuous video generation process through two core innovations.By mapping the spatial expansion of a massive canvas to the temporal evolution of video frames, ScrollScape leverages the inherent temporal consistency of video models as a powerful global constraint to ensure long-range structural integrity.Specifically, Scanning Positional Encoding (ScanPE) distributes global coordinates across frames to act as a flexible moving camera, while Scrolling Super-Resolution (ScrollSR) leverages video super-resolution priors to circumvent memory bottlenecks, efficiently scaling outputs to an unprecedented 32K resolution. Fine-tuned on a curated 3K multi-ratio image dataset, ScrollScape effectively aligns pre-trained video priors with the EAR generation task. Extensive evaluations demonstrate that it significantly outperforms existing image-diffusion baselines by eliminating severe localized artifacts. Consequently, our method overcomes inherent structural bottlenecks to ensure exceptional global coherence and visual fidelity across diverse domains at extreme scales.

[161] TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification

Guan Luo,Xiu Li,Rui Chen,Xuanyu Yi,Jing Lin,Chia-Hao Chen,Jiahang Liu,Song-Hai Zhang,Jianfeng Zhang

Main category: cs.CV

TL;DR: 本文提出TopoMesh，一种基于稀疏体素的VAE，通过Dual Marching Cubes（DMC）框架统一真实与预测网格的拓扑结构，实现顶点/面级显式对应，并引入直接的网格级监督信号，显著提升3D重建保真度，尤其在锐利特征保持方面。

Details

Motivation: 现有VAE受限于真实网格（任意拓扑）与网络预测（固定结构隐式场）之间的表征不匹配，导致无法建立显式网格对应，只能依赖间接监督（如SDF或渲染损失），难以保留精细几何细节尤其是锐利特征。 Method: 提出TopoMesh：1）设计基于DMC的统一拓扑框架；2）通过L∞距离保持锐边的重网格化算法，将任意输入网格转换为DMC兼容表示；3）解码器直接输出同格式DMC网格；4）利用顶点位置、面朝向和拓扑的显式网格级监督；5）采用稀疏VAE架构、教师强制与渐进分辨率训练策略。 Result: TopoMesh在重建保真度上显著超越现有VAE，尤其在锐利特征和几何细节保留方面表现更优。 Conclusion: 通过统一GT与预测网格的拓扑表示并引入显式的网格级监督，TopoMesh有效解决了传统VAE中因表征不匹配导致的细节丢失问题，为高保真3D生成提供了新范式。 Abstract: The dominant paradigm for high-fidelity 3D generation relies on a VAE-Diffusion pipeline, where the VAE's reconstruction capability sets a firm upper bound on generation quality. A fundamental challenge limiting existing VAEs is the representation mismatch between ground-truth meshes and network predictions: GT meshes have arbitrary, variable topology, while VAEs typically predict fixed-structure implicit fields (\eg, SDF on regular grids). This inherent misalignment prevents establishing explicit mesh-level correspondences, forcing prior work to rely on indirect supervision signals such as SDF or rendering losses. Consequently, fine geometric details, particularly sharp features, are poorly preserved during reconstruction. To address this, we introduce TopoMesh, a sparse voxel-based VAE that unifies both GT and predicted meshes under a shared Dual Marching Cubes (DMC) topological framework. Specifically, we convert arbitrary input meshes into DMC-compliant representations via a remeshing algorithm that preserves sharp edges using an L$\infty$ distance metric. Our decoder outputs meshes in the same DMC format, ensuring that both predicted and target meshes share identical topological structures. This establishes explicit correspondences at the vertex and face level, allowing us to derive explicit mesh-level supervision signals for topology, vertex positions, and face orientations with clear gradients. Our sparse VAE architecture employs this unified framework and is trained with Teacher Forcing and progressive resolution training for stable and efficient convergence. Extensive experiments demonstrate that TopoMesh significantly outperforms existing VAEs in reconstruction fidelity, achieving superior preservation of sharp features and geometric details.

[162] VERIA: Verification-Centric Multimodal Instance Augmentation for Long-Tailed 3D Object Detection

Jumin Lee,Siyeong Lee,Namil Kim,Sung-Eui Yoon

Main category: cs.CV

TL;DR: 本文提出VERIA框架，利用现成的基础模型生成同步的RGB-LiDAR实例，并通过语义与几何验证筛选高质量样本，以提升长尾分布下罕见类别的3D目标检测性能。

Details

Motivation: 驾驶数据集中长尾分布导致罕见类别样本稀疏且类内差异大，现有基于复制粘贴或资产库的实例增强方法在细粒度多样性和场景上下文放置方面受限。 Method: 提出VERIA——一种图像优先的多模态增强框架，利用基础模型合成同步RGB-LiDAR实例，并通过顺序语义与几何验证进行筛选；引入阶段式产出分解以诊断流程可靠性。 Result: 在nuScenes和Lyft数据集上，VERIA显著提升了LiDAR-only及多模态设置下罕见类别的3D目标检测性能。 Conclusion: VERIA通过验证驱动的合成与筛选机制，有效缓解了长尾分布带来的3D感知挑战，为罕见类别增强提供了新范式。 Abstract: Long-tail distributions in driving datasets pose a fundamental challenge for 3D perception, as rare classes exhibit substantial intra-class diversity yet available samples cover this variation space only sparsely. Existing instance augmentation methods based on copy-paste or asset libraries improve rare-class exposure but are often limited in fine-grained diversity and scene-context placement. We propose VERIA, an image-first multimodal augmentation framework that synthesizes synchronized RGB--LiDAR instances using off-the-shelf foundation models and curates them with sequential semantic and geometric verification. This verification-centric design tends to select instances that better match real LiDAR statistics while spanning a wider range of intra-class variation. Stage-wise yield decomposition provides a log-based diagnostic of pipeline reliability. On nuScenes and Lyft, VERIA improves rare-class 3D object detection in both LiDAR-only and multimodal settings. Our code is available at https://sgvr.kaist.ac.kr/VERIA/.

[163] RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation

Kai Zhu,Zhenyu Cui,Zehua Zang,Jiahuan Zhou

Main category: cs.CV

TL;DR: 本文提出RS-SSM方法，通过通道幅值感知器（CwAP）和遗忘门信息精炼器（FGIR）来弥补状态空间压缩中丢失的像素级时空细节，提升视频语义分割性能。

Details

Motivation: 现有状态空间模型在视频语义分割中因固定大小状态空间导致特定时空信息遗忘，难以满足像素级时空建模与时间一致性要求。 Method: 提出RS-SSM框架：设计通道幅值感知器（CwAP）提取并对其状态空间中的特定信息分布；提出遗忘门信息精炼器（FGIR）自适应地反演与精炼遗忘门矩阵，以互补恢复被遗忘的细节。 Result: 在四个VSS基准上达到SOTA性能，同时保持高计算效率。 Conclusion: RS-SSM有效缓解了状态空间压缩带来的细节遗忘问题，显著增强了模型对像素级时空语义的建模能力。 Abstract: Recently, state space models have demonstrated efficient video segmentation through linear-complexity state space compression. However, Video Semantic Segmentation (VSS) requires pixel-level spatiotemporal modeling capabilities to maintain temporal consistency in segmentation of semantic objects. While state space models can preserve common semantic information during state space compression, the fixed-size state space inevitably forgets specific information, which limits the models' capability for pixel-level segmentation. To tackle the above issue, we proposed a Refining Specifics State Space Model approach (RS-SSM) for video semantic segmentation, which performs complementary refining of forgotten spatiotemporal specifics. Specifically, a Channel-wise Amplitude Perceptron (CwAP) is designed to extract and align the distribution characteristics of specific information in the state space. Besides, a Forgetting Gate Information Refiner (FGIR) is proposed to adaptively invert and refine the forgetting gate matrix in the state space model based on the specific information distribution. Consequently, our RS-SSM leverages the inverted forgetting gate to complementarily refine the specific information forgotten during state space compression, thereby enhancing the model's capability for spatiotemporal pixel-level segmentation. Extensive experiments on four VSS benchmarks demonstrate that our RS-SSM achieves state-of-the-art performance while maintaining high computational efficiency. The code is available at https://github.com/zhoujiahuan1991/CVPR2026-RS-SSM.

[164] AMIF: Authorizable Medical Image Fusion Model with Built-in Authentication

Jie Song,Jun Jia,Wei Sun,Wangqiu Zhou,Tao Tan,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文提出AMIF模型，首个具备内置认证功能的可授权医学图像融合模型，通过在融合结果中嵌入可见版权标识来保护知识产权，授权用户才能获得高质量融合结果。

Details

Motivation: 现有医学图像融合模型缺乏知识产权保护机制，易因推理泄露导致模型知识和敏感训练数据被恶意逆向工程获取。 Method: 提出AMIF（Authorizable Medical Image Fusion）模型，将授权访问控制融入图像融合目标函数，在未授权使用时嵌入显式、可见的版权标识；授权用户通过密钥认证后可获得高质量融合结果。 Result: AMIF实现了首个支持内置身份认证与版权标识的医学图像融合框架，兼顾知识产权保护与融合性能。 Conclusion: AMIF为专业放射组学模型的知识产权保护提供了新范式，推动医学AI在临床落地中的合规性与安全性发展。 Abstract: Multimodal image fusion enables precise lesion localization and characterization for accurate diagnosis, thereby strengthening clinical decision-making and driving its growing prominence in medical imaging research. A powerful multimodal image fusion model relies on high-quality, clinically representative multimodal training data and a rigorously engineered model architecture. Therefore, the development of such professional radiomics models represents a collaborative achievement grounded in standardized acquisition, clinical-specific expertise, and algorithmic design proficiency, which necessitates protection of associated intellectual property rights. However, current multimodal image fusion models generate fused outputs without built-in mechanisms to safeguard intellectual property rights, inadvertently exposing proprietary model knowledge and sensitive training data through inference leakage. For example, malicious users can exploit fusion outputs and model distillation or other inference-based reverse engineering techniques to approximate the fusion performance of proprietary models. To address this issue, we propose AMIF, the first Authorizable Medical Image Fusion model with built-in authentication, which integrates authorization access control into the image fusion objective. For unauthorized usage, AMIF embeds explicit and visible copyright identifiers into fusion results. In contrast, high-quality fusion results are accessible upon successful key-based authentication.

[165] Refining time-space traffic diagrams: A neighborhood-adaptive linear regression method

Zhihong Yao,Yi Yu,Yunxia Wu,Hao Li,Yangsheng Jiang,Zhengbing He

Main category: cs.CV

TL;DR: 本文提出了一种基于邻域自适应线性回归的时空交通图细化方法，通过引入邻域嵌入概念，利用局部模式相似性提升低分辨率交通数据的分辨率，在多个指标上优于基准方法，并具有良好的泛化性和鲁棒性。

Details

Motivation: 现有时空交通图受监测精度和采样频率限制，普遍存在分辨率低的问题，影响交通理论研究与工程应用效果。 Method: 提出基于邻域自适应线性回归的时空交通图细化方法，引入邻域嵌入概念，自适应识别与目标单元相似的邻域，并在其中拟合低-高分辨率映射关系，避免全局线性模型的过平滑问题。 Result: 在两个真实数据集上多尺度、多上采样因子验证中，相比基准方法，MAE、MAPE、CMJS、SSIM和GMSD指标分别提升9.16%、8.16%、1.86%、3.89%和5.83%；跨天与跨场景验证显示强泛化性与鲁棒性。 Conclusion: 该方法仅需少量配对的高低分辨率训练数据，公式简洁，为低成本、细粒度重构低采样率交通数据提供了可行基础。 Abstract: The time-space (TS) traffic diagram serves as a crucial tool for characterizing the dynamic evolution of traffic flow, with its resolution directly influencing the effectiveness of traffic theory research and engineering applications. However, constrained by monitoring precision and sampling frequency, existing TS traffic diagrams commonly suffer from low resolution. To address this issue, this paper proposes a refinement method for TS traffic diagrams based on neighborhood-adaptive linear regression. Introducing the concept of neighborhood embedding into TS diagram refinement, the method leverages local pattern similarity in TS diagrams, adaptively identifies neighborhoods similar to target cells, and fits the low-to-high resolution mapping within these neighborhoods for refinement. It avoids the over-smoothing tendency of the traditional global linear model, allows the capture of unique traffic wave propagation and congestion evolution characteristics, and outperforms the traditional neighborhood embedding method in terms of local information utilization to achieve target cell refinement. Validation on two real datasets across multiple scales and upscaling factors shows that, compared to benchmark methods, the proposed method achieves improvements of 9.16%, 8.16%, 1.86%, 3.89%, and 5.83% in metrics including MAE, MAPE, CMJS, SSIM, and GMSD, respectively. Furthermore, the proposed method exhibits strong generalization and robustness in cross-day and cross-scenario validations. In summary, requiring only a minimal amount of paired high- and low-resolution training data, the proposed method features a concise formulation, providing a foundation for the low-cost, fine-grained refinement of low-sampling-rate traffic data.

[166] Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions

Shiqin Wang,Haoyang Chen,Huaizhou Huang,Yinkan He,Dongfang Sun,Xiaoqing Chen,Xingyu Liu,Zheng Wang,Kaiyan Zhao

Main category: cs.CV

TL;DR: 本文提出一种基于强化学习的自主类别调度器，用于解决无监督域自适应语义分割中语义类别学习顺序对恶劣天气场景性能影响的问题。该方法通过高维状态编码器捕捉模型训练动态，并设计类别公平的策略梯度目标，实现动态、自适应的课程学习，显著提升了多个基准数据集上的性能。

Details

Motivation: 现有无监督域自适应方法在恶劣天气下表现不佳，主要因静态、手工设计的课程学习策略无法适应模型高维、动态的训练过程，导致类别偏差。 Method: 将课程学习建模为序列决策问题，设计包含高维状态编码器和类别公平策略梯度目标的自主类别调度器，并结合源-目标混合监督进行动态类别排序与学习。 Result: 在ACDC、Dark Zurich和Nighttime Driving三个主流基准上达到SOTA性能，并在合成到真实场景的语义分割任务中展现出良好泛化能力。 Conclusion: 动态、自适应的课程学习机制能有效缓解类别偏差，提升模型在复杂域偏移（如恶劣天气）下的鲁棒性与泛化性。 Abstract: The learning order of semantic classes significantly impacts unsupervised domain adaptation for semantic segmentation, especially under adverse weather conditions. Most existing curricula rely on handcrafted heuristics (e.g., fixed uncertainty metrics) and follow a static schedule, which fails to adapt to a model's evolving, high-dimensional training dynamics, leading to category bias. Inspired by Reinforcement Learning, we cast curriculum learning as a sequential decision problem and propose an autonomous class scheduler. This scheduler consists of two components: (i) a high-dimensional state encoder that maps the model's training status into a latent space and distills key features indicative of progress, and (ii) a category-fair policy-gradient objective that ensures balanced improvement across classes. Coupled with mixed source-target supervision, the learned class rankings direct the network's focus to the most informative classes at each stage, enabling more adaptive and dynamic learning. It is worth noting that our method achieves state-of-the-art performance on three widely used benchmarks (e.g., ACDC, Dark Zurich, and Nighttime Driving) and shows generalization ability in synthetic-to-real semantic segmentation.

[167] Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing

Cheng Cui,Ting Sun,Suyin Liang,Tingquan Gao,Zelun Zhang,Jiaxuan Liu,Xueqing Wang,Changda Zhou,Hongen Liu,Manhui Lin,Yue Zhang,Yubo Zhang,Jing Zhang,Jun Zhang,Xing Wei,Yi Liu,Dianhai Yu,Yanjun Ma

Main category: cs.CV

TL;DR: 本文提出PaddleOCR-VL，一种粗到细的文档解析架构，通过Valid Region Focus Module（VRFM）识别并聚焦语义相关视觉区域，抑制冗余背景区域，从而在减少视觉token和参数的同时提升性能与效率。

Details

Motivation: 高分辨率文档图像虽能提升视觉语言模型性能，但带来视觉token数量二次增长和高昂计算成本，主要源于文档图像中大量冗余视觉区域（如背景）。 Method: 提出粗到细架构PaddleOCR-VL：首先设计轻量级Valid Region Focus Module（VRFM）定位有效视觉token；再基于VRFM输出引导一个紧凑的0.9B参数视觉语言模型（PaddleOCR-VL-0.9B）进行精细化识别，避免全图处理。 Result: 在页面级解析和元素级识别任务上达到SOTA；显著优于现有方法，媲美顶级VLM，同时推理更快、视觉token和参数更少。 Conclusion: 针对文档解析任务，采用目标导向的粗到细解析策略可兼顾高精度与高效率，验证了聚焦语义相关区域的有效性。 Abstract: Document parsing is a fine-grained task where image resolution significantly impacts performance. While advanced research leveraging vision-language models benefits from high-resolution input to boost model performance, this often leads to a quadratic increase in the number of vision tokens and significantly raises computational costs. We attribute this inefficiency to substantial visual regions redundancy in document images, like background. To tackle this, we propose PaddleOCR-VL, a novel coarse-to-fine architecture that focuses on semantically relevant regions while suppressing redundant ones, thereby improving both efficiency and performance. Specifically, we introduce a lightweight Valid Region Focus Module (VRFM) which leverages localization and contextual relationship prediction capabilities to identify valid vision tokens. Subsequently, we design and train a compact yet powerful 0.9B vision-language model (PaddleOCR-VL-0.9B) to perform detailed recognition, guided by VRFM outputs to avoid direct processing of the entire large image. Extensive experiments demonstrate that PaddleOCR-VL achieves state-of-the-art performance in both page-level parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference while utilizing substantially fewer vision tokens and parameters, highlighting the effectiveness of targeted coarse-to-fine parsing for accurate and efficient document understanding. The source code and models are publicly available at https://github.com/PaddlePaddle/PaddleOCR.

Ciem Cornelissen,Sam Leroux,Pieter Simoens

Main category: cs.CV

TL;DR: 本文提出Le MuMo JEPA，一种用于RGB图像与对齐的辅助模态（如LiDAR深度、热成像）联合自监督学习的多模态表征框架，在Waymo、nuScenes和FLIR数据集上展现出优异的性能-效率权衡。

Details

Motivation: 现有自监督学习方法大多局限于单模态，忽略了异构传感器间互补的结构信息，亟需能有效融合多模态信号的统一表征学习框架。 Method: 扩展LeJEPA至多模态场景，引入融合token作为模态特异性patch stem之间的潜在瓶颈；采用剪枝式融合策略：先经跨模态注意力，再丢弃模态特异性token，将信息压缩至共享融合token网格，并对联合CLS嵌入施加Sketched Isotropic Gaussian Regularization（SIGReg）。 Result: 在Waymo上，Le MuMo JEPA在下游patch probe任务中性能-效率最优，提升CenterNet检测与稠密深度估计，分割任务保持竞争力；在nuScenes和FLIR上亦为最强模型，尤其在Waymo预训练后微调下FLIR表现最佳；整体精度-效率平衡最优，且计算、内存与训练时间显著降低。 Conclusion: Le MuMo JEPA验证了高效、轻量的多模态自监督融合机制的有效性，为自动驾驶等多传感器场景提供了实用、可扩展的表征学习新范式。 Abstract: Self-supervised learning has emerged as a powerful paradigm for learning visual representations without manual annotations, yet most methods still operate on a single modality and therefore miss the complementary structure available from heterogeneous sensors. We present Le MuMo JEPA, a self-supervised framework that learns unified representations from RGB images and aligned companion modalities. In our driving experiments, the second modality is camera-aligned LiDAR depth; we also evaluate RGB-thermal training and transfer on the Teledyne FLIR ADAS benchmark. Our approach extends LeJEPA to the multi-modal setting by learning fusion tokens that act as a latent bottleneck between modality-specific patch stems inside a shared transformer. Our default model employs a pruned fusion strategy: after an initial cross-modal attention layer, modality-specific tokens are dropped, forcing cross-modal information into the shared fusion-token grid as an efficient latent bottleneck before Sketched Isotropic Gaussian Regularization (SIGReg) is applied to the joint multimodal CLS embedding. On Waymo, Le MuMo JEPA gives the strongest performance-efficiency trade-off on downstream patch probes among the from-scratch multimodal baselines, improving CenterNet detection and dense depth while remaining competitive on segmentation. Under from-scratch training on nuScenes, Le MuMo JEPA remains the strongest model, and it also gives the best FLIR results, especially after Waymo-initialized fine-tuning. It also retains the best overall accuracy-efficiency balance in our study at substantially lower compute, memory, and estimated training time.

[169] Language-Guided Structure-Aware Network for Camouflaged Object Detection

Min Zhang

Main category: cs.CV

TL;DR: 本文提出了一种语言引导的结构感知网络（LGSAN），结合CLIP文本引导、傅里叶边缘增强模块（FEEM）、结构感知注意力模块（SAAM）和粗粒度引导局部精炼模块（CGLRM），显著提升了伪装物体检测（COD）性能。

Details

Motivation: 现有伪装物体检测方法缺乏文本语义先验引导，难以在复杂场景中聚焦伪装区域。 Method: 基于PVT-v2视觉主干引入CLIP生成文本引导掩码；设计傅里叶边缘增强模块（FEEM）融合频域高频信息；提出结构感知注意力模块（SAAM）强化结构与边界感知；引入粗粒度引导局部精炼模块（CGLRM）提升细节重建与边界完整性。 Result: 在多个COD数据集上取得高度竞争力的性能，验证了方法的有效性与鲁棒性。 Conclusion: 语言引导与多尺度结构感知协同建模能有效提升伪装物体检测精度，尤其在复杂背景下的目标定位与边界刻画能力显著增强。 Abstract: Camouflaged Object Detection (COD) aims to segment objects that are highly integrated with the background in terms of color, texture, and structure, making it a highly challenging task in computer vision. Although existing methods introduce multi-scale fusion and attention mechanisms to alleviate the above issues, they generally lack the guidance of textual semantic priors, which limits the model's ability to focus on camouflaged regions in complex scenes. To address this issue, this paper proposes a Language-Guided Structure-Aware Network (LGSAN). Specifically, based on the visual backbone PVT-v2, we introduce CLIP to generate masks from text prompts and RGB images, thereby guiding the multi-scale features extracted by PVT-v2 to focus on potential target regions. On this foundation, we further design a Fourier Edge Enhancement Module (FEEM), which integrates multi-scale features with high-frequency information in the frequency domain to extract edge enhancement features. Furthermore, we propose a Structure-Aware Attention Module (SAAM) to effectively enhance the model's perception of object structures and boundaries. Finally, we introduce a Coarse-Guided Local Refinement Module (CGLRM) to enhance fine-grained reconstruction and boundary integrity of camouflaged object regions. Extensive experiments demonstrate that our method consistently achieves highly competitive performance across multiple COD datasets, validating its effectiveness and robustness.

[170] PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks

Cheng Cui,Yubo Zhang,Ting Sun,Xueqing Wang,Hongen Liu,Manhui Lin,Yue Zhang,Tingquan Gao,Changda Zhou,Jiaxuan Liu,Zelun Zhang,Jing Zhang,Jun Zhang,Yi Liu

Main category: cs.CV

TL;DR: 本文提出轻量级OCR系统PP-OCRv5（仅500万参数），通过数据驱动方法（聚焦数据难度、准确性和多样性）显著提升传统两阶段OCR性能，在精度、定位能力和抗幻觉方面媲美甚至超越大参数VLMs。

Details

Motivation: 挑战当前OCR领域过度依赖模型规模提升性能的主流观念，解决大型统一架构OCR模型计算开销大、复杂版式文本定位不准及易产生文本幻觉等问题。 Method: 提出轻量级PP-OCRv5系统；开展数据为中心的研究，系统量化训练数据的三个关键维度：数据难度、数据准确性和数据多样性，并基于高质量、高准确、高多样性的大规模数据训练传统两阶段OCR流程。 Result: PP-OCRv5在标准OCR基准上性能媲美诸多十亿参数级VLMs，同时具备更优的文本定位精度和更低的文本幻觉率。 Conclusion: 在足够多高质量、高准确、高多样性的训练数据支撑下，传统高效两阶段OCR流程的性能上限远高于以往认知；轻量级、专用化OCR模型在大模型时代依然具有强大竞争力和实用价值。 Abstract: The advent of "OCR 2.0" and large-scale vision-language models (VLMs) has set new benchmarks in text recognition. However, these unified architectures often come with significant computational demands, challenges in precise text localization within complex layouts, and a propensity for textual hallucinations. Revisiting the prevailing notion that model scale is the sole path to high accuracy, this paper introduces PP-OCRv5, a meticulously optimized, lightweight OCR system with merely 5 million parameters. We demonstrate that PP-OCRv5 achieves performance competitive with many billion-parameter VLMs on standard OCR benchmarks, while offering superior localization precision and reduced hallucinations. The cornerstone of our success lies not in architectural expansion but in a data-centric investigation. We systematically dissect the role of training data by quantifying three critical dimensions: data difficulty, data accuracy, and data diversity. Our extensive experiments reveal that with a sufficient volume of high-quality, accurately labeled, and diverse data, the performance ceiling for traditional, efficient two-stage OCR pipelines is far higher than commonly assumed. This work provides compelling evidence for the viability of lightweight, specialized models in the large-model era and offers practical insights into data curation for OCR. The source code and models are publicly available at https://github.com/PaddlePaddle/PaddleOCR.

[171] GeoRouter: Dynamic Paradigm Routing for Worldwide Image Geolocalization

Pengyue Jia,Derong Xu,Yingyi Zhang,Xiaopeng Li,Wenlin Zhang,Yi Wen,Yuanshao Zhu,Xiangyu Zhao

Main category: cs.CV

TL;DR: 本文提出GeoRouter，一种动态路由框架，根据图像内容自适应选择检索式或生成式地理定位方法，并通过距离感知偏好目标和新构建的GeoRouting数据集进行优化，在多个基准上显著优于现有方法。

Details

Motivation: 现有图像地理定位方法分为检索式和生成式两类，各自具有不同的错误特征（检索式擅长细粒度实例匹配，生成式擅长语义推理），单一范式无法在所有场景下都表现最优，因此需要一种能动态选择最优范式的机制。 Method: 提出GeoRouter动态路由框架，利用大视觉语言模型（LVLM）分析图像并决定使用检索式还是生成式方法；设计距离感知偏好目标函数，将两种范式预测结果与真实坐标的距离差转化为连续监督信号；构建首个面向路由策略训练的大规模数据集GeoRouting。 Result: 在IM2GPS3k和YFCC4k数据集上的实验表明，GeoRouter显著超越当前最先进方法。 Conclusion: 动态路由是提升全球图像地理定位性能的有效途径，GeoRouter通过融合互补范式并引入新型训练目标与数据集，实现了性能突破。 Abstract: Worldwide image geolocalization aims to predict precise GPS coordinates for images captured anywhere on Earth, which is challenging due to the large visual and geographic diversity. Recent methods mainly follow two paradigms: retrieval-based approaches that match queries against a reference database, and generation-based approaches that directly predict coordinates using Large Vision-Language Models (LVLMs). However, we observe distinct error profiles between them: retrieval excels at fine-grained instance matching, while generation offers robust semantic reasoning. This complementary heterogeneity suggests that no single paradigm is universally superior. To harness this potential, we propose GeoRouter, a dynamic routing framework that adaptively assigns each query to the optimal paradigm. GeoRouter leverages an LVLM backbone to analyze visual content and provide routing decisions. To optimize GeoRouter, we introduce a distance-aware preference objective that converts the distance gap between paradigms into a continuous supervision signal, explicitly reflecting relative performance differences. Furthermore, we construct GeoRouting, the first large-scale dataset tailored for training routing policies with independent paradigm predictions. Extensive experiments on IM2GPS3k and YFCC4k demonstrate that GeoRouter significantly outperforms state-of-the-art baselines.

[172] ViHOI: Human-Object Interaction Synthesis with Visual Priors

Songjin Cai,Linjie Zhong,Ling Guo,Changxing Ding

Main category: cs.CV

TL;DR: 本文提出ViHOI框架，通过从2D图像中提取视觉与文本先验，增强扩散模型生成3D人体-物体交互（HOI）动作的质量与泛化能力。

Details

Motivation: 现有方法难以仅靠文本描述准确刻画3D HOI中的物理约束，导致生成动作不真实、不自然。 Method: 利用大视觉语言模型（VLM）作为先验提取引擎，采用层解耦策略分别获取视觉与文本先验；设计Q-Former适配器将高维VLM特征压缩为紧凑的先验token，用于条件化扩散模型训练；训练数据为运动渲染图像，确保视觉输入与动作序列语义对齐；推理时使用文生图模型生成的参考图像提升泛化性。 Result: ViHOI在多个基准上达到SOTA性能，显著优于现有方法，并展现出对未见物体和交互类别的强泛化能力。 Conclusion: 从易得的2D图像中提取任务特定先验是提升3D HOI生成质量与泛化性的有效新范式。 Abstract: Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion generation. One primary reason is that describing these physical constraints with words alone is difficult. To address this limitation, we propose a new paradigm: extracting rich interaction priors from easily accessible 2D images. Specifically, we introduce ViHOI, a novel framework that enables diffusion-based generative models to leverage rich, task-specific priors from 2D images to enhance generation quality. We utilize a large Vision-Language Model (VLM) as a powerful prior-extraction engine and adopt a layer-decoupled strategy to obtain visual and textual priors. Concurrently, we design a Q-Former-based adapter that compresses the VLM's high-dimensional features into compact prior tokens, which significantly facilitates the conditional training of our diffusion model. Our framework is trained on motion-rendered images from the dataset to ensure strict semantic alignment between visual inputs and motion sequences. During inference, it leverages reference images synthesized by a text-to-image generation model to improve generalization to unseen objects and interaction categories. Experimental results demonstrate that ViHOI achieves state-of-the-art performance, outperforming existing methods across multiple benchmarks and demonstrating superior generalization.

[173] Causal Transfer in Medical Image Analysis

Mohammed M. Abdelsamea,Daniel Tweneboah Anyimadu,Tasneem Selim,Saif Alzubi,Lei Zhang,Ahmed Karam Eldaly,Xujiong Ye

Main category: cs.CV

TL;DR: 本文综述了因果迁移学习（CTL）在医学影像分析中的应用，旨在通过因果推理提升模型跨医院、设备和人群的泛化性与鲁棒性。

Details

Motivation: 医学影像模型在跨机构部署时因域偏移而性能下降，传统基于相关性的迁移方法易受伪相关干扰，需引入因果推理以发现跨环境稳定的不变机制。 Method: 系统整合结构因果模型、不变风险最小化和反事实推理等因果方法到迁移学习流程中，按任务类型、偏移类型和因果假设对现有工作进行分类，并提出统一的因果-迁移框架分类法。 Result: 构建了涵盖分类、分割、重建、异常检测和多模态影像的CTL综述体系；归纳了数据集、基准与实证增益；明确了CTL在公平性、鲁棒性和联邦学习中优势场景。 Conclusion: CTL为临床可靠AI提供了新范式，但其在真实医疗场景中的可扩展性、因果假设验证及与临床工作流整合仍是关键挑战。 Abstract: Medical imaging models frequently fail when deployed across hospitals, scanners, populations, or imaging protocols due to domain shift, limiting their clinical reliability. While transfer learning and domain adaptation address such shifts statistically, they often rely on spurious correlations that break under changing conditions. On the other hand, causal inference provides a principled way to identify invariant mechanisms that remain stable across environments. This survey introduces and systematises Causal Transfer Learning (CTL) for medical image analysis. This paradigm integrates causal reasoning with cross-domain representation learning to enable robust and generalisable clinical AI. We frame domain shift as a causal problem and analyse how structural causal models, invariant risk minimisation, and counterfactual reasoning can be embedded within transfer learning pipelines. We studied spanning classification, segmentation, reconstruction, anomaly detection, and multimodal imaging, and organised them by task, shift type, and causal assumption. A unified taxonomy is proposed that connects causal frameworks and transfer mechanisms. We further summarise datasets, benchmarks, and empirical gains, highlighting when and why causal transfer outperforms correlation-based domain adaptation. Finally, we discuss how CTL supports fairness, robustness, and trustworthy deployment in multi-institutional and federated settings, and outline open challenges and research directions for clinically reliable medical imaging AI.

[174] Teacher-Student Diffusion Model for Text-Driven 3D Hand Motion Generation

Ching-Lam Cheng,Bin Zhu,Shengfeng He

Main category: cs.CV

TL;DR: TSHaMo is a teacher-student diffusion framework for generating realistic 3D hand motion from text alone, using auxiliary signals (e.g., MANO) only during training—not inference—improving quality, diversity, and generalizability over prior methods.

Details

Motivation: Existing methods either ignore fine-grained hand gestures by focusing on full-body motion or rely on explicit 3D object meshes, limiting their generality and practicality for text-driven hand motion generation. Method: TSHaMo introduces a model-agnostic teacher-student diffusion framework: the teacher uses auxiliary signals (e.g., MANO parameters) to provide structured guidance, while the student learns text-to-motion mapping; a co-training strategy lets the student leverage teacher's intermediate predictions without requiring auxiliary inputs at inference. Result: TSHaMo consistently improves motion quality and diversity on GRAB and H2O benchmarks using two diffusion backbones; ablation studies confirm robustness and flexibility in incorporating diverse auxiliary inputs, with no 3D objects needed at test time. Conclusion: TSHaMo enables high-fidelity, generalizable text-driven 3D hand motion generation without requiring 3D object meshes at inference, advancing applicability in VR, robotics, and human-computer interaction. Abstract: Generating realistic 3D hand motion from natural language is vital for VR, robotics, and human-computer interaction. Existing methods either focus on full-body motion, overlooking detailed hand gestures, or require explicit 3D object meshes, limiting generality. We propose TSHaMo, a model-agnostic teacher-student diffusion framework for text-driven hand motion generation. The student model learns to synthesize motions from text alone, while the teacher leverages auxiliary signals (e.g., MANO parameters) to provide structured guidance during training. A co-training strategy enables the student to benefit from the teacher's intermediate predictions while remaining text-only at inference. Evaluated using two diffusion backbones on GRAB and H2O, TSHaMo consistently improves motion quality and diversity. Ablations confirm its robustness and flexibility in using diverse auxiliary inputs without requiring 3D objects at test time.

[175] The Gait Signature of Frailty: Transfer Learning based Deep Gait Models for Scalable Frailty Assessment

Laura McDaniel,Basudha Pal,Crystal Szczesny,Yuxiang Guo,Ryan Roemmich,Peter Abadir,Rama Chellappa

Main category: cs.CV

TL;DR: 本文提出了一种基于步态轮廓的公开脆弱性评估数据集，并研究了在小样本和类别不平衡条件下，如何迁移预训练步态识别模型以实现稳健、可解释的脆弱性分类。

Details

Motivation: 脆弱性评估在临床实践中仍主观、异质且难以规模化；现有基于计算机视觉的步态分析受限于小规模、不平衡及缺乏临床代表性数据集。 Method: 构建了一个临床真实场景下的 silhouette-based 脆弱性步态数据集（覆盖全脆弱谱系，含助行器使用者）；系统评估多种预训练模型（CNN与混合注意力架构）在有限数据下的迁移策略，包括选择性冻结、损失设计与可解释性分析。 Result: 选择性冻结底层步态表征、保守处理类别不平衡、融合多任务目标可显著提升模型稳定性与区分相邻脆弱状态的能力；模型注意力集中于下肢与骨盆区域，符合生物力学认知。 Conclusion: 基于步态的表征学习是一种可扩展、无创且可解释的脆弱性评估新范式，支持将现代生物测量建模方法融入衰老研究与临床实践。 Abstract: Frailty is a condition in aging medicine characterized by diminished physiological reserve and increased vulnerability to stressors. However, frailty assessment remains subjective, heterogeneous, and difficult to scale in clinical practice. Gait is a sensitive marker of biological aging, capturing multisystem decline before overt disability. Yet the application of modern computer vision to gait-based frailty assessment has been limited by small, imbalanced datasets and a lack of clinically representative benchmarks. In this work, we introduce a publicly available silhouette-based frailty gait dataset collected in a clinically realistic setting, spanning the full frailty spectrum and including older adults who use walking aids. Using this dataset, we evaluate how pretrained gait recognition models can be adapted for frailty classification under limited data conditions. We study both convolutional and hybrid attention-based architectures and show that predictive performance depends primarily on how pretrained representations are transferred rather than architectural complexity alone. Across models, selectively freezing low-level gait representations while allowing higher-level features to adapt yields more stable and generalizable performance than either full fine-tuning or rigid freezing. Conservative handling of class imbalance further improves training stability, and combining complementary learning objectives enhances discrimination between clinically adjacent frailty states. Interpretability analyses reveal consistent model attention to lower-limb and pelvic regions, aligning with established biomechanical correlates of frailty. Together, these findings establish gait-based representation learning as a scalable, non-invasive, and interpretable framework for frailty assessment and support the integration of modern biometric modeling approaches into aging research and clinical practice.

[176] Unleashing Vision-Language Semantics for Deepfake Video Detection

Jiawen Zhu,Yunqi Miao,Xueyi Zhang,Jiankang Deng,Guansong Pang

Main category: cs.CV

TL;DR: 本文提出VLAForge框架，利用视觉-语言模型（VLM）的跨模态语义增强深度伪造视频检测性能，通过ForgePerceiver增强视觉感知，并引入身份感知的视觉-语言对齐（VLA）得分作为判别线索，在多种伪造类型上显著优于现有方法。

Details

Motivation: 现有基于预训练视觉-语言模型（VLM）的深度伪造检测方法仅利用视觉特征，忽视了VLM中蕴含的丰富视觉-语言语义这一核心优势。 Method: 提出VLAForge框架：i) 设计ForgePerceiver作为独立学习器，细粒度与整体性地捕捉伪造线索，同时保持预训练的视觉-语言对齐（VLA）知识；ii) 构建身份感知的VLA得分，结合跨模态语义与ForgePerceiver学习的伪造线索，并通过身份先验引导的文本提示增强真实性判别能力。 Result: 在涵盖传统换脸与新型全脸生成伪造的多个视频检测基准上，VLAForge在帧级和视频级均显著超越当前最优方法。 Conclusion: 充分挖掘并融合VLM中的视觉-语言语义与伪造线索，特别是引入身份感知的VLA得分，能有效提升深度伪造视频检测的泛化性与判别力。 Abstract: Recent Deepfake Video Detection (DFD) studies have demonstrated that pre-trained Vision-Language Models (VLMs) such as CLIP exhibit strong generalization capabilities in detecting artifacts across different identities. However, existing approaches focus on leveraging visual features only, overlooking their most distinctive strength -- the rich vision-language semantics embedded in the latent space. We propose VLAForge, a novel DFD framework that unleashes the potential of such cross-modal semantics to enhance model's discriminability in deepfake detection. This work i) enhances the visual perception of VLM through a ForgePerceiver, which acts as an independent learner to capture diverse, subtle forgery cues both granularly and holistically, while preserving the pretrained Vision-Language Alignment (VLA) knowledge, and ii) provides a complementary discriminative cue -- Identity-Aware VLA score, derived by coupling cross-modal semantics with the forgery cues learned by ForgePerceiver. Notably, the VLA score is augmented by an identity prior-informed text prompting to capture authenticity cues tailored to each identity, thereby enabling more discriminative cross-modal semantics. Comprehensive experiments on video DFD benchmarks, including classical face-swapping forgeries and recent full-face generation forgeries, demonstrate that our VLAForge substantially outperforms state-of-the-art methods at both frame and video levels. Code is available at https://github.com/mala-lab/VLAForge.

[177] OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning

Kaihang Pan,Qi Tian,Jianwei Zhang,Weijie Kong,Jiangfeng Xiong,Yanxin Long,Shixue Zhang,Haiyi Qiu,Tan Wang,Zheqi Lv,Yue Wu,Liefeng Bo,Siliang Tang,Zhao Zhong

Main category: cs.CV

TL;DR: 本文提出OmniWeaving，一种支持多模态组合与推理增强的全功能视频生成模型，并构建首个面向智能统一视频生成的评测基准IntelligentVBench，显著提升开源统一视频生成模型性能。

Details

Motivation: 开源统一视频生成模型严重落后于闭源系统（如Seedance-2.0），现有学术模型碎片化严重，缺乏能无缝整合多样化任务的统一框架。 Method: 提出OmniWeaving模型，利用大规模预训练数据集学习跨模态（文本、多图、视频）时序绑定与用户意图推理；同时构建首个综合评测基准IntelligentVBench。 Result: OmniWeaving在开源统一视频生成模型中达到SOTA性能；代码与模型将开源。 Conclusion: OmniWeaving通过多模态组成建模与推理能力，有效推动了开源统一视频生成的发展，并为该领域建立了新的评测标准。 Abstract: While proprietary systems such as Seedance-2.0 have achieved remarkable success in omni-capable video generation, open-source alternatives significantly lag behind. Most academic models remain heavily fragmented, and the few existing efforts toward unified video generation still struggle to seamlessly integrate diverse tasks within a single framework. To bridge this gap, we propose OmniWeaving, an omni-level video generation model featuring powerful multimodal composition and reasoning-informed capabilities. By leveraging a massive-scale pretraining dataset that encompasses diverse compositional and reasoning-augmented scenarios, OmniWeaving learns to temporally bind interleaved text, multi-image, and video inputs while acting as an intelligent agent to infer complex user intentions for sophisticated video creation. Furthermore, we introduce IntelligentVBench, the first comprehensive benchmark designed to rigorously assess next-level intelligent unified video generation. Extensive experiments demonstrate that OmniWeaving achieves SoTA performance among open-source unified models. The codes and model will be made publicly available soon. Project Page: https://omniweaving.github.io.

[178] Positive-First Most Ambiguous: A Simple Active Learning Criterion for Interactive Retrieval of Rare Categories

Kawtar Zaher,Olivier Buisson,Alexis Joly

Main category: cs.CV

TL;DR: 本文提出了一种面向细粒度视觉检索中罕见类别发现的主动学习策略PF-MA，通过优先选择边界附近且更可能是正样本的样本来解决类别极度不平衡与标注预算有限的问题，并引入类覆盖度指标衡量检索多样性，在植物学等长尾数据集上验证了其有效性。

Details

Motivation: 现实中的细粒度视觉检索常需在大量无标签数据中以极少监督发现罕见概念，尤其在生物多样性监测、生态研究及长尾视觉领域，目标类别占比极小，导致高度不平衡的二分类问题；传统主动学习方法假设类别先验对称且标注预算充足，难以适用于此类低预算、低延迟、强不平衡场景。 Method: 提出Positive-First Most Ambiguous（PF-MA）主动学习准则：在选择待标注样本时，兼顾模型预测的不确定性（边界附近）与正样本倾向性（优先正类），并设计类覆盖度（class coverage）指标评估所选正样本在目标类视觉多样性上的代表性。 Result: 在多个长尾数据集（包括细粒度植物图像）上的实验表明，PF-MA在类覆盖度和分类器性能两方面均持续优于强基线方法，且在不同类别规模和特征描述子下保持鲁棒性。 Conclusion: 将主动学习与交互式细粒度检索中固有的类别不对称性和用户中心目标对齐，可催生简单而强大的解决方案，有效支持真实人机协同场景下罕见、视觉细微类别的快速发现。 Abstract: Real-world fine-grained visual retrieval often requires discovering a rare concept from large unlabeled collections with minimal supervision. This is especially critical in biodiversity monitoring, ecological studies, and long-tailed visual domains, where the target may represent only a tiny fraction of the data, creating highly imbalanced binary problems. Interactive retrieval with relevance feedback offers a practical solution: starting from a small query, the system selects candidates for binary user annotation and iteratively refines a lightweight classifier. While Active Learning (AL) is commonly used to guide selection, conventional AL assumes symmetric class priors and large annotation budgets, limiting effectiveness in imbalanced, low-budget, low-latency settings. We introduce Positive-First Most Ambiguous (PF-MA), a simple yet effective AL criterion that explicitly addresses the class imbalance asymmetry: it prioritizes near-boundary samples while favoring likely positives, enabling rapid discovery of subtle visual categories while maintaining informativeness. Unlike standard methods that oversample negatives, PF-MA consistently returns small batches with a high proportion of relevant samples, improving early retrieval and user satisfaction. To capture retrieval diversity, we also propose a class coverage metric that measures how well selected positives span the visual variability of the target class. Experiments on long-tailed datasets, including fine-grained botanical data, demonstrate that PF-MA consistently outperforms strong baselines in both coverage and classifier performance, across varying class sizes and descriptors. Our results highlight that aligning AL with the asymmetric and user-centric objectives of interactive fine-grained retrieval enables simple yet powerful solutions for retrieving rare and visually subtle categories in realistic human-in-the-loop settings.

[179] Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models

Siqi Liu,Xinyang Li,Bochao Zou,Junbao Zhuo,Huimin Ma,Jiansheng Chen

Main category: cs.CV

TL;DR: 本文提出VisionToM框架，通过视觉导向的干预向量引导多模态大模型（MLLM）在视觉特征层进行任务感知推理，提升其在真实世界视频数据集EgoToM上的心智理论（ToM）能力，并增强答案可靠性与自由生成解释的准确性。

Details

Motivation: 现有心智理论（ToM）评估多依赖文本输入，忽视纯视觉场景；且多数方法将模型视为黑箱，缺乏对多选问答中内部注意力机制及幻觉影响的可解释性分析。 Method: 提出VisionToM视觉导向干预框架，计算干预向量以对齐视觉表征与语义目标，逐层调控模型视觉特征注意力，削弱语言伪先验依赖。 Result: 在EgoToM基准（含三类多选QA设置）上显著提升MLLM的ToM性能；在开放生成任务中亦能生成更准确刻画主体心理状态的自由解释。 Conclusion: VisionToM有效增强MLLM的视觉驱动ToM推理能力与可解释性，推动人机协作向更高心理状态对齐迈进。 Abstract: As large language models (LLMs) continue to advance, there is increasing interest in their ability to infer human mental states and demonstrate a human-like Theory of Mind (ToM). Most existing ToM evaluations, however, are centered on text-based inputs, while scenarios relying solely on visual information receive far less attention. This leaves a gap, since real-world human-AI interaction typically requires multimodal understanding. In addition, many current methods regard the model as a black box and rarely probe how its internal attention behaves in multiple-choice question answering (QA). The impact of LLM hallucinations on such tasks is also underexplored from an interpretability perspective. To address these issues, we introduce VisionToM, a vision-oriented intervention framework designed to strengthen task-aware reasoning. The core idea is to compute intervention vectors that align visual representations with the correct semantic targets, thereby steering the model's attention through different layers of visual features. This guidance reduces the model's reliance on spurious linguistic priors, leading to more reliable multimodal language model (MLLM) outputs and better QA performance. Experiments on the EgoToM benchmark-an egocentric, real-world video dataset for ToM with three multiple-choice QA settings-demonstrate that our method substantially improves the ToM abilities of MLLMs. Furthermore, results on an additional open-ended generation task show that VisionToM enables MLLMs to produce free-form explanations that more accurately capture agents' mental states, pushing machine-human collaboration toward greater alignment.

[180] Toward Physically Consistent Driving Video World Models under Challenging Trajectories

Jiawei Zhou,Zhenxin Zhu,Lingyi Du,Linye Lyu,Lijun Zhou,Zhanqian Wu,Hongcheng Luo,Zhuotao Tian,Bing Wang,Guang Chen,Hangjun Ye,Haiyang Sun,Yu Li

Main category: cs.CV

TL;DR: 本文提出PhyGenesis，一种用于自动驾驶仿真视频生成的世界模型，通过物理条件生成器和物理增强视频生成器，提升对挑战性/反事实轨迹的生成质量与物理一致性。

Details

Motivation: 现有视频生成模型在处理模拟器或规划系统产生的不完美轨迹时，常出现物理不一致和视觉伪影，因其主要在自然、安全的真实驾驶数据上训练。 Method: 提出两阶段框架：(1) 物理条件生成器，将无效轨迹映射为物理可行条件；(2) 物理增强视频生成器，基于该条件生成高保真多视角驾驶视频；并构建融合真实视频与CARLA仿真极端场景的大规模物理丰富异构数据集进行训练。 Result: PhyGenesis在挑战性轨迹上的视频生成质量显著优于现有SOTA方法，展现出更强的物理一致性与视觉保真度。 Conclusion: 引入显式物理建模与挑战性轨迹监督可有效提升世界模型在自动驾驶仿真中的鲁棒性与可信度，为构建物理可靠的生成式世界模型提供了新范式。 Abstract: Video generation models have shown strong potential as world models for autonomous driving simulation. However, existing approaches are primarily trained on real-world driving datasets, which mostly contain natural and safe driving scenarios. As a result, current models often fail when conditioned on challenging or counterfactual trajectories-such as imperfect trajectories generated by simulators or planning systems-producing videos with severe physical inconsistencies and artifacts. To address this limitation, we propose PhyGenesis, a world model designed to generate driving videos with high visual fidelity and strong physical consistency. Our framework consists of two key components: (1) a physical condition generator that transforms potentially invalid trajectory inputs into physically plausible conditions, and (2) a physics-enhanced video generator that produces high-fidelity multi-view driving videos under these conditions. To effectively train these components, we construct a large-scale, physics-rich heterogeneous dataset. Specifically, in addition to real-world driving videos, we generate diverse challenging driving scenarios using the CARLA simulator, from which we derive supervision signals that guide the model to learn physically grounded dynamics under extreme conditions. This challenging-trajectory learning strategy enables trajectory correction and promotes physically consistent video generation. Extensive experiments demonstrate that PhyGenesis consistently outperforms state-of-the-art methods, especially on challenging trajectories. Our project page is available at: https://wm-research.github.io/PhyGenesis/.

Dipam Goswami,Simone Magistri,Gido M. van de Ven,Bartłomiej Twardowski,Andrew D. Bagdanov,Tinne Tuytelaars,Joost van de Weijer

Main category: cs.CV

TL;DR: 本文提出了一种改进CLIP-based少样本图像分类的方法，通过将图像原型投影到文本嵌入的语义主方向上，构建文本对齐的图像子空间，并结合文本嵌入提升分类性能；同时针对跨模态对齐较差的数据集，引入基于类协方差建模各向异性的LDA分类器，最终融合两类分类器取得SOTA效果。

Details

Motivation: 现有方法虽利用训练图像嵌入提升少样本分类，但图像原型中包含无关背景/上下文噪声；且在CLIP跨模态对齐差的下游数据集上，单纯语义对齐效果受限。 Method: 1）将图像原型投影至文本嵌入空间的主成分方向，获得文本对齐的语义图像原型；2）混合该原型与文本嵌入进行分类；3）对跨模态对齐差的数据集，额外建模图像空间的类协方差以捕捉各向异性，并采用LDA分类器；4）融合两种分类器输出。 Result: 所提方法在多个少样本图像分类基准上超越现有方法，尤其在跨模态对齐不佳的数据集上表现更鲁棒。 Conclusion: 图像原型需经文本语义空间对齐以抑制噪声；当对齐不足时，引入图像空间的统计建模（如LDA）可互补提升性能；混合多源、多策略分类器是有效路径。 Abstract: Vision-language models (VLMs) like CLIP are trained with the objective of aligning text and image pairs. To improve CLIP-based few-shot image classification, recent works have observed that, along with text embeddings, image embeddings from the training set are an important source of information. In this work we investigate the impact of directly mixing image and text prototypes for few-shot classification and analyze this from a bias-variance perspective. We show that mixing prototypes acts like a shrinkage estimator. Although mixed prototypes improve classification performance, the image prototypes still add some noise in the form of instance-specific background or context information. In order to capture only information from the image space relevant to the given classification task, we propose projecting image prototypes onto the principal directions of the semantic text embedding space to obtain a text-aligned semantic image subspace. These text-aligned image prototypes, when mixed with text embeddings, further improve classification. However, for downstream datasets with poor cross-modal alignment in CLIP, semantic alignment might be suboptimal. We show that the image subspace can still be leveraged by modeling the anisotropy using class covariances. We demonstrate that combining a text-aligned mixed prototype classifier and an image-specific LDA classifier outperforms existing methods across few-shot classification benchmarks.

[182] CliPPER: Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition

Florian Stilz,Vinkle Srivastav,Nassir Navab,Nicolas Padoy

Main category: cs.CV

TL;DR: 本文提出CliPPER，一种针对长时程手术视频的上下文视频-语言预训练框架，通过多种新颖的预训练策略（如VTC_CTX、COP、循环一致性对齐和FTM）提升多模态对齐与细粒度时序理解，在多个外科零样本识别任务上达到新SOTA。

Details

Motivation: 手术场景中标注数据稀缺且需要精确的时序理解，现有视频-语言模型难以满足复杂下游任务需求。 Method: 提出CliPPER框架，包含Contextual Video-Text Contrastive Learning (VTC_CTX)、Clip Order Prediction (COP)、循环一致性对齐以及Frame-Text Matching (FTM)等新型预训练策略，增强长时程手术视频中的多模态对齐与时序建模能力。 Result: 在多个公开外科基准（如手术阶段、步骤、器械及三元组的零样本识别）上达到新SOTA性能。 Conclusion: CliPPER通过引入上下文感知与时间敏感的预训练机制，显著提升了视频-语言模型在稀缺标注的手术场景下的泛化与细粒度理解能力。 Abstract: Video-language foundation models have proven to be highly effective in zero-shot applications across a wide range of tasks. A particularly challenging area is the intraoperative surgical procedure domain, where labeled data is scarce, and precise temporal understanding is often required for complex downstream tasks. To address this challenge, we introduce CliPPER (Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition), a novel video-language pretraining framework trained on surgical lecture videos. Our method is designed for fine-grained temporal video-text recognition and introduces several novel pretraining strategies to improve multimodal alignment in long-form surgical videos. Specifically, we propose Contextual Video-Text Contrastive Learning (VTC_CTX) and Clip Order Prediction (COP) pretraining objectives, both of which leverage temporal and contextual dependencies to enhance local video understanding. In addition, we incorporate a Cycle-Consistency Alignment over video-text matches within the same surgical video to enforce bidirectional consistency and improve overall representation coherence. Moreover, we introduce a more refined alignment loss, Frame-Text Matching (FTM), to improve the alignment between video frames and text. As a result, our model establishes a new state-of-the-art across multiple public surgical benchmarks, including zero-shot recognition of phases, steps, instruments, and triplets. The source code and pretraining captions can be found at https://github.com/CAMMA-public/CliPPER.

[183] SEGAR: Selective Enhancement for Generative Augmented Reality

Fanjun Bu,Chenyang Yuan,Hiroshi Yasuda

Main category: cs.CV

TL;DR: 本文提出SEGAR框架，结合扩散模型世界模型与选择性校正阶段，支持增强现实应用中的生成式世界建模，实现区域编辑与安全关键区域实时对齐。

Details

Motivation: 为增强现实（AR）应用提供生成式世界模型基础，通过预测包含视觉编辑的未来图像序列，实现时间一致、可预计算缓存的增强帧，避免实时逐帧渲染。 Method: 提出SEGAR框架：1）基于扩散的世界模型生成带区域特定编辑的未来帧并保持其余区域不变；2）选择性校正阶段将安全关键区域与真实观测对齐，同时保留其他区域的预期增强效果。在驾驶场景中验证该流程。 Result: 成功在驾驶场景中演示了该生成-缓存-选择性校正的端到端流程，验证了其在语义区域结构明确、真实反馈易获取环境下的可行性。 Conclusion: SEGAR是迈向实用化生成式世界模型作为AR基础设施的重要初步探索，展示了未来帧可生成、可缓存、可按需选择性校正的潜力。 Abstract: Generative world models offer a compelling foundation for augmented-reality (AR) applications: by predicting future image sequences that incorporate deliberate visual edits, they enable temporally coherent, augmented future frames that can be computed ahead of time and cached, avoiding per-frame rendering from scratch in real time. In this work, we present SEGAR, a preliminary framework that combines a diffusion-based world model with a selective correction stage to support this vision. The world model generates augmented future frames with region-specific edits while preserving others, and the correction stage subsequently aligns safety-critical regions with real-world observations while preserving intended augmentations elsewhere. We demonstrate this pipeline in driving scenarios as a representative setting where semantic region structure is well defined and real-world feedback is readily available. We view this as an early step toward generative world models as practical AR infrastructure, where future frames can be generated, cached, and selectively corrected on demand.

[184] The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series

Jan Hemmerling,Marcel Schwieder,Philippe Rufin,Leon-Friedrich Thomas,Mirela Tulbure,Patrick Hostert,Stefan Erasmi

Main category: cs.CV

TL;DR: 本研究利用Sentinel-2时间序列数据，结合改进的时空视觉Transformer（TSViT）模型，实现了有机与常规农业系统的遥感分类；通过多任务学习（同步识别作物类型）和调整输入图像块大小来评估空间上下文影响，结果表明分类可行性高但效果因作物类型差异大，扩大空间上下文有助于提升精度。

Details

Motivation: 为支持可持续农业发展，需获取空间明确的有机农业分布信息；现有方法对有机与常规农业系统区分能力有限，亟需更有效的遥感分类方法。 Method: 基于Sentinel-2年度时间序列，构建改进的Temporo-Spatial Vision Transformer（TSViT）模型，支持单任务（仅区分有机/常规）与多任务（同步预测 farming system 和 crop type）学习；通过调节输入图像块尺寸来量化空间上下文的影响。 Result: 有机与常规农业系统可被有效区分，但性能高度依赖作物类型：冬黑麦、冬小麦、冬燕麦等F1得分≥0.8，而永久草地、果园、葡萄园和啤酒花等则≤0.4；多任务学习增益有限，增大空间上下文显著提升两类任务分类精度。 Conclusion: 在多样化农业区域中，仅依靠多光谱遥感数据即可实现农业耕作制度分类；空间上下文比联合学习作物类型更具提升潜力，未来研究应聚焦于作物特异性建模与上下文优化。 Abstract: Organic farming is a key element in achieving more sustainable agriculture. For a better understanding of the development and impact of organic farming, comprehensive, spatially explicit information is needed. This study presents an approach for the discrimination of organic and conventional farming systems using intra-annual Sentinel-2 time series. In addition, it examines two factors influencing this discrimination: the joint learning of crop type information in a concurrent task and the role of spatial context. A Vision Transformer model based on the Temporo-Spatial Vision Transformer (TSViT) architecture was used to construct a classification model for the two farming systems. The model was extended for simultaneous learning of the crop type, creating a multitask learning setting. By varying the patch size presented to the model, we tested the influence of spatial context on the classification accuracy of both tasks. We show that discrimination between organic and conventional farming systems using multispectral remote sensing data is feasible. However, classification performance varies substantially across crop types. For several crops, such as winter rye, winter wheat, and winter oat, F1 scores of 0.8 or higher can be achieved. In contrast, other agricultural land use classes, such as permanent grassland, orchards, grapevines, and hops, cannot be reliably distinguished, with F1 scores for the organic management class of 0.4 or lower. Joint learning of farming system and crop type provides only limited additional benefits over single-task learning. In contrast, incorporating wider spatial context improves the performance of both farming system and crop type classification. Overall, we demonstrate that a classification of agricultural farming systems is possible in a diverse agricultural region using multispectral remote sensing data.

[185] LensWalk: Agentic Video Understanding by Planning How You See in Videos

Keliang Li,Yansong Li,Hongze Shen,Mengdi Liu,Hong Chang,Shiguang Shan

Main category: cs.CV

TL;DR: LensWalk是一种新型的视频理解框架，通过让大语言模型主动控制视觉观察过程，实现动态、按需的证据收集，显著提升了长视频理解任务的性能。

Details

Motivation: 现有视频理解方法依赖静态预处理信息，无法在推理过程中主动从视频中获取原始证据，导致推理与感知脱节。 Method: LensWalk构建了一个‘推理-规划-观察’闭环，使大语言模型能动态指定视频观察的时间范围和采样密度，并调用基于多模态模型的可配置工具进行线索扫描、片段聚焦和跨时刻证据整合。 Result: 无需模型微调，LensWalk在LVBench和Video-MME等长视频基准上使多种模型准确率提升超5%，且增强了推理的准确性、鲁棒性与可解释性。 Conclusion: 赋予智能体自主控制‘如何看’的能力，是提升视频推理能力的关键。 Abstract: The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful Vision-Language Models, prevailing methods for video understanding are limited by the inherent disconnect between reasoning and perception: they rely on static, pre-processed information and cannot actively seek raw evidence from video as their understanding evolves. To address this, we introduce LensWalk, a flexible agentic framework that empowers a Large Language Model reasoner to control its own visual observation actively. LensWalk establishes a tight reason-plan-observe loop where the agent dynamically specifies, at each step, the temporal scope and sampling density of the video it observes. Using a suite of versatile, Vision-Language Model based tools parameterized by these specifications, the agent can perform broad scans for cues, focus on specific segments for fact extraction, and stitch evidence from multiple moments for holistic verification. This design allows for progressive, on-demand evidence gathering that directly serves the agent's evolving chain of thought. Without requiring any model fine-tuning, LensWalk delivers substantial, plug-and-play performance gains on multiple model recipes, boosting their accuracy by over 5\% on challenging long-video benchmarks like LVBench and Video-MME. Our analysis reveals that enabling an agent to control how it sees is key to unlocking more accurate, robust, and interpretable video reasoning.

[186] POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan

Marta Moscati,Muhammad Saad Saeed,Marina Zanoni,Mubashir Noman,Rohan Kumar Das,Monorama Swain,Yufang Hou,Elisabeth Andre,Khalid Mahmood Malik,Markus Schedl,Shah Nawaz

Main category: cs.CV

TL;DR: POLY-SIM Grand Challenge 2026 提出一个面向缺失模态与跨语言场景的多模态说话人识别基准挑战，旨在推动鲁棒、实用的系统发展。

Details

Motivation: 现实场景中常存在视觉模态缺失（如遮挡、摄像头故障、隐私限制）和多语种说话人带来的语言差异，导致现有假设（完整、同质音视频模态）不成立，影响系统鲁棒性与泛化能力。 Method: 设计并组织POLY-SIM Grand Challenge 2026，包括专用数据集、任务定义、评估协议及基线模型，构建标准化评测框架。 Result: 提供了首个聚焦于缺失模态与跨语言条件下的多模态说话人识别公开挑战赛及其配套资源（数据、协议、基线）。 Conclusion: 该挑战推动了面向真实场景的鲁棒多模态说话人识别研究，为后续方法开发与公平比较奠定基础。 Abstract: Multimodal speaker identification systems typically assume the availability of complete and homogeneous audio-visual modalities during both training and testing. However, in real-world applications, such assumptions often do not hold. Visual information may be missing due to occlusions, camera failures, or privacy constraints, while multilingual speakers introduce additional complexity due to linguistic variability across languages. These challenges significantly affect the robustness and generalization of multimodal speaker identification systems. The POLY-SIM Grand Challenge 2026 aims to advance research in multimodal speaker identification under missing-modality and cross-lingual conditions. Specifically, the Grand Challenge encourages the development of robust methods that can effectively leverage incomplete multimodal inputs while maintaining strong performance across different languages. This report presents the design and organization of the POLY-SIM Grand Challenge 2026, including the dataset, task formulation, evaluation protocol, and baseline model. By providing a standardized benchmark and evaluation framework, the challenge aims to foster progress toward more robust and practical multimodal speaker identification systems.

[187] Anti-I2V: Safeguarding your photos from malicious image-to-video generation

Duc Vu,Anh Nguyen,Chi Tran,Anh Tran

Main category: cs.CV

TL;DR: 本文提出Anti-I2V，一种针对图像到视频扩散模型（尤其是Diffusion Transformer）的新型防御方法，通过在Lab色彩空间和频域中引入扰动，并聚焦于关键语义层优化训练目标，有效破坏时序一致性和生成保真度，显著提升对恶意视频生成的鲁棒性。

Details

Motivation: 扩散模型（尤其是DiT架构）在人像视频生成中能力增强，带来伪造风险；现有对抗防御方法多面向图像生成或UNet架构，难以有效应对DiT等新型视频扩散模型。 Method: Anti-I2V在Lab色彩空间与频域联合施加扰动，避免仅限RGB空间；识别去噪过程中最具判别性的网络层，设计针对性训练目标以最大化破坏时序连贯性与生成质量。 Result: 在多种视频扩散模型（含DiT）上验证，Anti-I2V达到当前最优防御性能，显著降低伪造视频质量与时序一致性。 Conclusion: Anti-I2V是一种跨骨干、鲁棒性强的通用防御框架，为防范基于扩散模型的恶意人像视频生成提供了有效可行的解决方案。 Abstract: Advances in diffusion-based video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos from a specific person's photo and text prompts. Recent efforts have focused on adversarial attacks that introduce crafted perturbations to protect images from diffusion-based models. However, most existing approaches target image generation, while relatively few explicitly address image-to-video diffusion models (VDMs), and most primarily focus on UNet-based architectures. Hence, their effectiveness against Diffusion Transformer (DiT) models remains largely under-explored, as these models demonstrate improved feature retention, and stronger temporal consistency due to larger capacity and advanced attention mechanisms. In this work, we introduce Anti-I2V, a novel defense against malicious human image-to-video generation, applicable across diverse diffusion backbones. Instead of restricting noise updates to the RGB space, Anti-I2V operates in both the $L$*$a$*$b$* and frequency domains, improving robustness and concentrating on salient pixels. We then identify the network layers that capture the most distinct semantic features during the denoising process to design appropriate training objectives that maximize degradation of temporal coherence and generation fidelity. Through extensive validation, Anti-I2V demonstrates state-of-the-art defense performance against diverse video diffusion models, offering an effective solution to the problem.

[188] Towards Training-Free Scene Text Editing

Yubo Li,Xugong Qin,Peng Zhang,Hailun Lin,Gangyan Zeng,Kexin Zhang

Main category: cs.CV

TL;DR: 本文提出TextFlow，一种无需训练的场景文本编辑框架，结合Attention Boost和Flow Manifold Steering，实现高保真、语义一致的文本修改。

Details

Motivation: 现有方法通常需要任务特定训练或配对数据，限制了可扩展性和适应性。 Method: 提出TextFlow框架，融合Attention Boost（AttnBoost）和Flow Manifold Steering（FMS），在无需额外训练的前提下实现端到端文本编辑；FMS建模字符与背景的视觉流以保持结构与风格一致性，AttnBoost通过注意力引导增强文本渲染。 Result: 实验表明TextFlow在视觉质量与文本准确性上达到或优于有训练方法，并在多场景、多语言下具有良好的泛化能力。 Conclusion: TextFlow推动场景文本编辑迈向更高效、通用且无需训练的新范式。 Abstract: Scene text editing seeks to modify textual content in natural images while maintaining visual realism and semantic consistency. Existing methods often require task-specific training or paired data, limiting their scalability and adaptability. In this paper, we propose TextFlow, a training-free scene text editing framework that integrates the strengths of Attention Boost (AttnBoost) and Flow Manifold Steering (FMS) to enable flexible, high-fidelity text manipulation without additional training. Specifically, FMS preserves the structural and style consistency by modeling the visual flow of characters and background regions, while AttnBoost enhances the rendering of textual content through attention-based guidance. By jointly leveraging these complementary modules, our approach performs end-to-end text editing through semantic alignment and spatial refinement in a plug-and-play manner. Extensive experiments demonstrate that our framework achieves visual quality and text accuracy comparable to or superior to those of training-based counterparts, generalizing well across diverse scenes and languages. This study advances scene text editing toward a more efficient, generalizable, and training-free paradigm. Code is available at https://github.com/lyb18758/TextFlow

[189] VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

Qijia He,Xunmei Liu,Hammaad Memon,Ziang Li,Zixian Ma,Jaemin Cho,Jason Ren,Daniel S Weld,Ranjay Krishna

Main category: cs.CV

TL;DR: 本文提出VFIG模型，通过大规模数据集VFIG-DATA和粗到细训练策略，实现高质量的栅格图到SVG矢量图转换，在结构保真度上达到开源模型SOTA，并媲美GPT-5.2。

Details

Motivation: 原始SVG源文件常丢失，仅剩难以编辑和缩放的栅格图像（如PNG/JPEG），人工重建耗时且需专业技能，亟需自动化高保真矢量化方法。 Method: 提出VFIG视觉语言模型，构建66K高质量图-SVG对数据集VFIG-DATA；采用粗到细训练范式：先监督微调学习原子图元，再强化学习优化全局布局、结构一致性和拓扑鲁棒性；设计专用评估基准VFIG-BENCH及新指标。 Result: 在VFIG-BENCH上VLM-Judge得分为0.829，性能达开源模型SOTA，并与GPT-5.2持平。 Conclusion: VFIG有效解决了真实场景中复杂技术图表的高保真矢量化难题，其数据构建策略、分阶段训练范式和专用评估体系为该任务树立了新标准。 Abstract: Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible semantic editability. In practice, however, original vector source files are frequently lost or inaccessible, leaving only "flat" rasterized versions (e.g., PNG or JPEG) that are difficult to modify or scale. Manually reconstructing these figures is a prohibitively labor-intensive process, requiring specialized expertise to recover the original geometric intent. To bridge this gap, we propose VFIG, a family of Vision-Language Models trained for complex and high-fidelity figure-to-SVG conversion. While this task is inherently data-driven, existing datasets are typically small-scale and lack the complexity of professional diagrams. We address this by introducing VFIG-DATA, a large-scale dataset of 66K high-quality figure-SVG pairs, curated from a diverse mix of real-world paper figures and procedurally generated diagrams. Recognizing that SVGs are composed of recurring primitives and hierarchical local structures, we introduce a coarse-to-fine training curriculum that begins with supervised fine-tuning (SFT) to learn atomic primitives and transitions to reinforcement learning (RL) refinement to optimize global diagram fidelity, layout consistency, and topological edge cases. Finally, we introduce VFIG-BENCH, a comprehensive evaluation suite with novel metrics designed to measure the structural integrity of complex figures. VFIG achieves state-of-the-art performance among open-source models and performs on par with GPT-5.2, achieving a VLM-Judge score of 0.829 on VFIG-BENCH.

[190] EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction

Falong Fan,Yi Xie,Arnis Lektauers,Bo Liu,Jerzy Rozenblit

Main category: cs.CV

TL;DR: 本文提出EndoVGGT框架，通过Deformation-aware Graph Attention（DeGAT）模块，在特征空间中动态构建语义图以建模软组织长程几何相关性，显著提升内窥镜下软组织3D重建精度与跨域泛化能力。

Details

Motivation: 现有固定拓扑方法难以应对软组织低纹理、镜面高光和器械遮挡导致的几何不连续问题，亟需能建模长程结构一致性的新方法。 Method: 提出EndoVGGT框架，核心是Deformation-aware Graph Attention（DeGAT）模块，摒弃静态空间邻域，转而在特征空间动态构建语义图，实现跨遮挡的结构线索传播与全局一致性约束。 Result: 在SCARED数据集上PSNR提升24.6%，SSIM提升9.1%；并展现出强零样本跨数据集泛化能力（SCARED→EndoNeRF），验证了所学几何先验的领域无关性。 Conclusion: 动态特征空间建模可有效提升手术场景中软组织非刚性形变重建的一致性与鲁棒性，DeGAT为几何感知的视觉理解提供了新范式。 Abstract: Accurate 3D reconstruction of deformable soft tissues is essential for surgical robotic perception. However, low-texture surfaces, specular highlights, and instrument occlusions often fragment geometric continuity, posing a challenge for existing fixed-topology approaches. To address this, we propose EndoVGGT, a geometry-centric framework equipped with a Deformation-aware Graph Attention (DeGAT) module. Rather than using static spatial neighborhoods, DeGAT dynamically constructs feature-space semantic graphs to capture long-range correlations among coherent tissue regions. This enables robust propagation of structural cues across occlusions, enforcing global consistency and improving non-rigid deformation recovery. Extensive experiments on SCARED show that our method significantly improves fidelity, increasing PSNR by 24.6% and SSIM by 9.1% over prior state-of-the-art. Crucially, EndoVGGT exhibits strong zero-shot cross-dataset generalization to the unseen SCARED and EndoNeRF domains, confirming that DeGAT learns domain-agnostic geometric priors. These results highlight the efficacy of dynamic feature-space modeling for consistent surgical 3D reconstruction.

[191] Vision-Language Models vs Human: Perceptual Image Quality Assessment

Imran Mehmood,Imad Ali Shah,Ming Ronnier Luo,Brian Deegan

Main category: cs.CV

TL;DR: 本文系统评估了六种视觉语言模型（VLMs）在对比度、色彩丰富度和整体偏好三类图像质量感知任务中对人类心理物理判断的拟合能力，发现模型表现具有显著属性依赖性，且响应变异性可能反映其对场景依赖性感知线索的敏感性。

Details

Motivation: 心理物理实验虽是图像质量评估（IQA）最可靠方法，但成本高、可扩展性差，亟需能逼近人类感知判断的自动化方法；本文旨在探究VLMs是否具备该潜力。 Method: 在对比度、色彩丰富度和整体偏好三个图像质量维度上，将6个VLM（4个闭源、2个开源）的输出与人类心理物理实验数据进行系统对比，并开展属性加权分析、模型内一致性分析及感知可分性分析。 Result: VLMs在色彩丰富度上与人类高度一致（ρ达0.93），但在对比度上表现较差，反之亦然；多数VLMs在整体偏好判断中赋予色彩丰富度更高权重，与人类一致；最自洽的模型未必最符合人类判断；感知差异越明显，人-VLM一致性越高。 Conclusion: VLMs在特定图像质量属性上可较好模拟人类感知，但其能力具有强属性依赖性；响应变异性并非缺陷，而可能体现对复杂感知线索的建模能力；VLMs作为IQA工具需结合感知可分性谨慎使用。 Abstract: Psychophysical experiments remain the most reliable approach for perceptual image quality assessment (IQA), yet their cost and limited scalability encourage automated approaches. We investigate whether Vision Language Models (VLMs) can approximate human perceptual judgments across three image quality scales: contrast, colorfulness and overall preference. Six VLMs four proprietary and two openweight models are benchmarked against psychophysical data. This work presents a systematic benchmark of VLMs for perceptual IQA through comparison with human psychophysical data. The results reveal strong attribute dependent variability models with high human alignment for colorfulness (ρup to 0.93) underperform on contrast and vice-versa. Attribute weighting analysis further shows that most VLMs assign higher weights to colorfulness compared to contrast when evaluating overall preference similar to the psychophysical data. Intramodel consistency analysis reveals a counterintuitive tradeoff: the most self consistent models are not necessarily the most human aligned suggesting response variability reflects sensitivity to scene dependent perceptual cues. Furthermore, human-VLM agreement is increased with perceptual separability, indicating VLMs are more reliable when stimulus differences are clearly expressed.

[192] Latent-WAM: Latent World Action Modeling for End-to-End Autonomous Driving

Linbo Wang,Yupeng Zheng,Qiang Chen,Shiwei Li,Yichen Zhang,Zebin Xing,Qichao Zhang,Xiang Li,Deheng Qian,Pengxuan Yang,Yihang Dong,Ce Hao,Xiaoqing Ye,Junyu han,Yifeng Pan,Dongbin Zhao

Main category: cs.CV

TL;DR: 本文提出了Latent-WAM，一种高效的端到端自动驾驶框架，通过空间感知与动力学感知的潜在世界表征实现强轨迹规划。其核心包括空间感知压缩世界编码器（SCWE）和动态潜在世界模型（DLWM），在NAVSIM v2和HUGSIM上达到新SOTA性能，参数量仅104M且训练数据更少。

Details

Motivation: 现有基于世界模型的规划器存在表征压缩不足、空间理解有限、时间动态利用不充分等问题，导致在数据和算力受限下规划性能欠佳。 Method: 提出Latent-WAM框架，包含两个核心模块：1）空间感知压缩世界编码器（SCWE），利用基础模型蒸馏几何知识，通过可学习查询将多视角图像压缩为紧凑场景token；2）动态潜在世界模型（DLWM），采用因果Transformer，基于历史视觉与运动表征自回归预测未来世界状态。 Result: 在NAVSIM v2上取得89.3 EPDMS，在HUGSIM上取得28.9 HD-Score，超越此前最优无感知方法3.2 EPDMS，且仅用更少训练数据和104M参数量。 Conclusion: Latent-WAM通过空间与动力学协同建模，在保持轻量化的同时显著提升端到端自动驾驶规划性能，验证了高效潜在世界表征对实际部署的重要性。 Abstract: We introduce Latent-WAM, an efficient end-to-end autonomous driving framework that achieves strong trajectory planning through spatially-aware and dynamics-informed latent world representations. Existing world-model-based planners suffer from inadequately compressed representations, limited spatial understanding, and underutilized temporal dynamics, resulting in sub-optimal planning under constrained data and compute budgets. Latent-WAM addresses these limitations with two core modules: a Spatial-Aware Compressive World Encoder (SCWE) that distills geometric knowledge from a foundation model and compresses multi-view images into compact scene tokens via learnable queries, and a Dynamic Latent World Model (DLWM) that employs a causal Transformer to autoregressively predict future world status conditioned on historical visual and motion representations. Extensive experiments on NAVSIM v2 and HUGSIM demonstrate new state-of-the-art results: 89.3 EPDMS on NAVSIM v2 and 28.9 HD-Score on HUGSIM, surpassing the best prior perception-free method by 3.2 EPDMS with significantly less training data and a compact 104M-parameter model.

[193] TAG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models

Jiaying Zhou,Zhihao Zhan,Ruifeng Zhai,Qinhan Lyu,Hao Liu,Keze Wang,Liang Lin,Guangrun Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为TAG（Target-Agnostic Guidance）的推理时引导机制，用于提升视觉-语言-动作（VLA）策略在杂乱场景中的目标定位鲁棒性，通过对比原始与物体擦除观测下的策略输出差异，生成残差引导信号，无需修改策略架构即可提升性能。

Details

Motivation: 现有VLA策略在杂乱场景中常因实例级定位失败（如抓取偏移或误抓）而失效，而非运动不可行所致。 Method: 提出TAG机制：在推理时对比原始观测和擦除目标物体后的观测下VLA策略的输出差异，将该差异作为残差引导信号，增强目标物体证据对决策的影响；不修改策略结构，仅需少量训练与推理调整。 Result: 在LIBERO、LIBERO-Plus和VLABench等标准操纵基准上，TAG显著提升了策略在杂乱环境中的鲁棒性，减少了‘近错’（near-miss）和‘错抓’（wrong-object）行为。 Conclusion: TAG是一种轻量、即插即用的推理时引导方法，能有效缓解VLA策略因干扰物和外观相似性导致的定位偏差，提升实际部署可靠性。 Abstract: Vision--Language--Action (VLA) policies have shown strong progress in mapping language instructions and visual observations to robotic actions, yet their reliability degrades in cluttered scenes with distractors. By analyzing failure cases, we find that many errors do not arise from infeasible motions, but from instance-level grounding failures: the policy often produces a plausible grasp trajectory that lands slightly off-target or even on the wrong object instance. To address this issue, we propose TAG (Target-Agnostic Guidance), a simple inference-time guidance mechanism that explicitly reduces distractor- and appearance-induced bias in VLA policies. Inspired by classifier-free guidance (CFG), TAG contrasts policy predictions under the original observation and an object-erased observation, and uses their difference as a residual steering signal that strengthens the influence of object evidence in the decision process. TAG does not require modifying the policy architecture and can be integrated with existing VLA policies with minimal training and inference changes. We evaluate TAG on standard manipulation benchmarks, including LIBERO, LIBERO-Plus, and VLABench, where it consistently improves robustness under clutter and reduces near-miss and wrong-object executions.

Table of Contents

cs.CL [Back]

[1] Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes

[2] Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems

[3] Internal Safety Collapse in Frontier Large Language Models

[4] Visuospatial Perspective Taking in Multimodal Language Models

[5] DISCO: Document Intelligence Suite for COmparative Evaluation

[6] S-Path-RAG: Semantic-Aware Shortest-Path Retrieval Augmented Generation for Multi-Hop Knowledge Graph Question Answering

[7] Berta: an open-source, modular tool for AI-enabled clinical documentation

[8] DepthCharge: A Domain-Agnostic Framework for Measuring Depth-Dependent Knowledge in Large Language Models

[9] Training a Large Language Model for Medical Coding Using Privacy-Preserving Synthetic Clinical Data

[10] MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

[11] Cluster-R1: Large Reasoning Models Are Instruction-following Clustering Agents

[12] MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?

[13] From Physician Expertise to Clinical Agents: Preserving, Standardizing, and Scaling Physicians' Medical Expertise with Lightweight LLM

[14] Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages

[15] Qworld: Question-Specific Evaluation Criteria for LLMs

[16] Do 3D Large Language Models Really Understand 3D Spatial Relationships?

[17] Navigating the Concept Space of Language Models

[18] Prompt Compression in Production Task Orchestration: A Pre-Registered Randomized Trial

[19] Plato's Cave: A Human-Centered Research Verification System

[20] Compression Method Matters: Benchmark-Dependent Output Dynamics in LLM Prompt Compression

[21] The Compression Paradox in LLM Inference: Provider-Dependent Energy Effects of Prompt Compression

[22] Konkani LLM: Multi-Script Instruction Tuning and Evaluation for a Low-Resource Indian Language

[23] Did You Forget What I Asked? Prospective Memory Failures in Large Language Models

[24] Large Language Models Unpack Complex Political Opinions through Target-Stance Extraction

[25] Generating Hierarchical JSON Representations of Scientific Sentences Using LLMs

[26] MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys and Key-Based Restructuring for High-Accuracy RAG

[27] Not All Pretraining are Created Equal: Threshold Tuning and Class Weighting for Imbalanced Polarization Tasks in Low-Resource Settings

[28] Revisiting Real-Time Digging-In Effects: No Evidence from NP/Z Garden-Paths

[29] Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks

[30] Ethio-ASR: Joint Multilingual Speech Recognition and Language Identification for Ethiopian Languages

[31] Probing Ethical Framework Representations in Large Language Models: Structure, Entanglement, and Methodological Challenges

[32] PLACID: Privacy-preserving Large language models for Acronym Clinical Inference and Disambiguation

[33] The Diminishing Returns of Early-Exit Decoding in Modern LLMs

[34] IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

[35] Infrequent Child-Directed Speech Is Bursty and May Draw Infant Vocalizations

[36] Perturbation: A simple and efficient adversarial tracer for representation learning in language models

[37] PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay

[38] Language Model Planners do not Scale, but do Formalizers?

[39] BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents

[40] Self-Distillation for Multi-Token Prediction

[41] Dialogue to Question Generation for Evidence-based Medical Guideline Agent Development

[42] OmniACBench: A Benchmark for Evaluating Context-Grounded Acoustic Control in Omni-Modal Models

[43] Argument Mining as a Text-to-Text Generation Task

[44] From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents

[45] The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More

[46] Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith

[47] CoCR-RAG: Enhancing Retrieval-Augmented Generation in Web Q&A via Concept-oriented Context Reconstruction

[48] Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping

[49] Thinking with Tables: Enhancing Multi-Modal Tabular Understanding via Neuro-Symbolic Reasoning

[50] CVPD at QIAS 2026: RAG-Guided LLM Reasoning for Al-Mawarith Share Computation and Heir Allocation

[51] Schema on the Inside: A Two-Phase Fine-Tuning Method for High-Efficiency Text-to-SQL at Scale

[52] From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs

[53] FinToolSyn: A forward synthesis Framework for Financial Tool-Use Dialogue Data with Dynamic Tool Retrieval

[54] ConceptKT: A Benchmark for Concept-Level Deficiency Prediction in Knowledge Tracing

[55] LLMpedia: A Transparent Framework to Materialize an LLM's Encyclopedic Knowledge at Scale

[56] Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study

[57] MedAidDialog: A Multilingual Multi-Turn Medical Dialogue Dataset for Accessible Healthcare

[58] A visual observation on the geometry of UMAP projections of the difference vectors of antonym and synonym word pair embeddings

[59] Variation is the Norm: Embracing Sociolinguistics in NLP

[60] Stance Labels Fail When They Matter Most: The Projection Problem in Stance Detection

[61] Optimizing Multilingual LLMs via Federated Learning: A Study of Client Language Composition

[62] Semantic Centroids and Hierarchical Density-Based Clustering for Cross-Document Software Coreference Resolution

[63] Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning

[64] Samasāmayik: A Parallel Dataset for Hindi-Sanskrit Machine Translation

[65] GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

[66] Improving Lean4 Autoformalization via Cycle Consistency Fine-tuning

[67] Towards Reward Modeling for AI Tutors in Math Mistake Remediation

[68] When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools

[69] PINGALA: Prosody-Aware Decoding for Sanskrit Poetry Generation

[70] Mechanic: Sorrifier-Driven Formal Decomposition Workflow for Automated Theorem Proving

[71] Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

[72] Representation Learning to Study Temporal Dynamics in Tutorial Scaffolding

[73] Robust Multilingual Text-to-Pictogram Mapping for Scalable Reading Rehabilitation

[74] A Sociolinguistic Analysis of Automatic Speech Recognition Bias in Newcastle English

[75] MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

[76] Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

cs.CV [Back]

[77] LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset