Table of Contents
cs.CL [Back]
[1] Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models
Andrew Kiruluta
Main category: cs.CL
TL;DR: 本文提出了一种基于压缩感知的动态大语言模型(LLM)执行框架,通过随机测量、稀疏恢复与硬件友好的稀疏执行路径编译,实现任务条件化、token自适应的模型与提示联合压缩,兼顾精度、速度与部署效率。
Details
Motivation: 现有模型压缩方法多为静态离线优化,未利用不同提示和解码步激活不同计算路径的特性;提示压缩方法虽缩短序列长度,但不调整实际执行的模型子网络。二者割裂,限制了端到端加速潜力。 Method: 构建统一的压缩感知引导框架:使用随机测量算子探测模型隐式计算使用情况;通过稀疏恢复估计任务相关、token自适应的稀疏支持集;将恢复的支持集编译为硬件高效(如GPU适配)的稀疏执行路径(覆盖模块、注意力头、通道、FFN子结构);引入任务条件化测量、token自适应恢复、理论采样复杂度界、硬件约束编译及提示-模型联合优化目标。 Result: 实现了动态、细粒度、部署导向的LLM推理加速,在保持精度的同时显著降低内存占用与解码延迟;提供了带显式近似保证的理论分析(受限等距性/互不相干性假设下);支持端到端联合压缩提示与模型结构。 Conclusion: 将LLM推理重新建模为‘测量–恢复’问题,可统一提示压缩与模型剪枝,突破静态压缩局限,为高效、自适应、可部署的大模型推理提供新范式。 Abstract: Large language models deliver strong generative performance but at the cost of massive parameter counts, memory use, and decoding latency. Prior work has shown that pruning and structured sparsity can preserve accuracy under substantial compression, while prompt-compression methods reduce latency by removing redundant input tokens. However, these two directions remain largely separate. Most model-compression methods are static and optimized offline, and they do not exploit the fact that different prompts and decoding steps activate different latent computational pathways. Prompt-compression methods reduce sequence length, but they do not adapt the executed model subnetwork. We propose a unified compressed-sensing-guided framework for dynamic LLM execution. Random measurement operators probe latent model usage, sparse recovery estimates task-conditioned and token-adaptive support sets, and the recovered supports are compiled into hardware-efficient sparse execution paths over blocks, attention heads, channels, and feed-forward substructures. The framework introduces five key contributions: task-conditioned measurements, so different prompts induce different sparse supports; token-adaptive recovery, so active substructures are re-estimated during decoding; formal sample-complexity bounds under restricted isometry or mutual incoherence assumptions; compile-to-hardware constraints that restrict recovery to GPU-efficient structures; and a joint objective that unifies prompt compression with model reduction. Together, these components recast LLM inference as a measurement-and-recovery problem with explicit approximation guarantees and deployment-oriented speedup constraints.[2] MemGround: Long-Term Memory Evaluation Kit for Large Language Models in Gamified Scenarios
Yihang Ding,Wanke Xia,Yiting Zhao,Jinbo Su,Jialiang Yang,Zhengbo Zhang,Ke Wang,Wenming Yang
Main category: cs.CL
TL;DR: 本文提出MemGround,一个基于丰富、游戏化交互场景的长期记忆基准,通过三层分级框架评估表面状态记忆、时间关联记忆和基于推理的记忆,并引入多维指标量化记忆利用和行为轨迹。实验表明,当前最先进的大语言模型和记忆代理在动态跟踪、时间事件关联及复杂推理方面仍存在困难。
Details
Motivation: 现有对大语言模型长期记忆的评估过于静态,局限于简单检索和短上下文推理,忽略了复杂记忆系统(如动态状态跟踪和分层推理)的多面性。 Method: 提出MemGround基准,包含三层分级框架(表面状态记忆、时间关联记忆、基于推理的记忆)和多维评估指标(QA Overall、MFU、MFCO、ETD),在游戏化交互场景中进行系统评估。 Result: 实验显示,当前SOTA大语言模型和记忆代理在持续动态跟踪、时间事件关联以及基于长期积累证据的复杂推理方面表现不佳。 Conclusion: MemGround为长期记忆能力提供了更全面、动态和交互式的评估范式,揭示了现有模型在真实交互环境中记忆能力的关键短板。 Abstract: Current evaluations of long-term memory in LLMs are fundamentally static. By fixating on simple retrieval and short-context inference, they neglect the multifaceted nature of complex memory systems, such as dynamic state tracking and hierarchical reasoning in continuous interactions. To overcome these limitations, we propose MemGround, a rigorous long-term memory benchmark natively grounded in rich, gamified interactive scenarios. To systematically assess these capabilities, MemGround introduces a three-tier hierarchical framework that evaluates Surface State Memory, Temporal Associative Memory, and Reasoning-Based Memory through specialized interactive tasks. Furthermore, to comprehensively quantify both memory utilization and behavioral trajectories, we propose a multi-dimensional metric suite comprising Question-Answer Score (QA Overall), Memory Fragments Unlocked (MFU), Memory Fragments with Correct Order (MFCO), and Exploration Trajectory Diagrams (ETD). Extensive experiments reveal that state-of-the-art LLMs and memory agents still struggle with sustained dynamic tracking, temporal event association, and complex reasoning derived from long-term accumulated evidence in interactive environments.[3] HUOZIIME: An On-Device LLM-enhanced Input Method for Deep Personalization
Baocai Shan,Yuzhuang Xu,Wanxiang Che
Main category: cs.CL
TL;DR: 本文提出HUOZIIME,一种基于轻量级大语言模型(LLM)的个性化、隐私保护、实时运行的移动端输入法(IME),通过合成数据微调、分层记忆机制和系统级优化,实现在移动设备上高效、高保真地生成个性化文本。
Details
Motivation: 现有移动端输入法(IME)主要依赖手动输入,难以实现个性化文本生成;虽有轻量级LLM支持端侧生成,但如何兼顾深度个性化、隐私保护与实时性仍是根本挑战。 Method: 1)基于合成个性化数据对基础LLM进行后训练,赋予其初步类人预测能力;2)设计分层记忆机制,持续捕获并利用用户输入历史;3)针对移动端部署进行系统性优化(如推理效率、资源占用等)。 Result: 实验表明HUOZIIME可在移动端高效运行,并实现高保真的记忆驱动个性化文本生成;代码与安装包已开源。 Conclusion: HUOZIIME验证了在资源受限的移动设备上构建高性能、强个性化、隐私安全的生成式输入法的可行性,为端侧LLM应用提供了新范式。 Abstract: Mobile input method editors (IMEs) are the primary interface for text input, yet they remain constrained to manual typing and struggle to produce personalized text. While lightweight large language models (LLMs) make on-device auxiliary generation feasible, enabling deeply personalized, privacy-preserving, and real-time generative IMEs poses fundamental challenges.To this end, we present HUOZIIME, a personalized on-device IME powered by LLM. We endow HUOZIIME with initial human-like prediction ability by post-training a base LLM on synthesized personalization data. Notably, a hierarchical memory mechanism is designed to continually capture and leverage user-specific input history. Furthermore, we perform systemic optimizations tailored to on-device LLMbased IME deployment, ensuring efficient and responsive operation under mobile constraints.Experiments demonstrate efficient on-device execution and high-fidelity memory-driven personalization. Code and package are available at https://github.com/Shan-HIT/HuoziIME.[4] Can Large Language Models Detect Methodological Flaws? Evidence from Gesture Recognition for UAV-Based Rescue Operation Based on Deep Learning
Domonkos Varga
Main category: cs.CL
TL;DR: 本文探讨了大语言模型(LLMs)能否作为独立分析代理,识别机器学习论文中常见的方法论缺陷(如数据泄露)。以一篇手势识别论文为案例,作者发现其评估协议存在受试者级数据泄露问题;六种先进LLM在无额外背景信息下均一致识别出该问题,并指出其源于训练/测试集划分不独立。结果表明LLM有望成为提升科研可复现性的辅助审计工具。
Details
Motivation: 机器学习研究中可靠评估至关重要,但方法论缺陷(尤其是数据泄露)持续损害结果有效性;亟需自动化、可扩展的科学审计手段。 Method: 以一篇存在潜在数据泄露的手势识别论文为案例,系统分析其评估协议;随后让六种SOTA大语言模型在统一提示下独立阅读原文并判断评估合理性,分析其诊断一致性与依据。 Result: 所有六种LLM均一致识别出该研究存在因非独立训练/测试划分导致的受试者级数据泄露,并援引重叠学习曲线、极小泛化差距和近100%准确率等证据。 Conclusion: LLM能仅基于已发表论文内容检测常见方法论缺陷;其高度一致的判断表明其有潜力作为提升科研可复现性与支持科学审计的互补工具。 Abstract: Reliable evaluation is essential in machine learning research, yet methodological flaws-particularly data leakage-continue to undermine the validity of reported results. In this work, we investigate whether large language models (LLMs) can act as independent analytical agents capable of identifying such issues in published studies. As a case study, we analyze a gesture-recognition paper reporting near-perfect accuracy on a small, human-centered dataset. We first show that the evaluation protocol is consistent with subject-level data leakage due to non-independent training and test splits. We then assess whether this flaw can be detected independently by six state-of-the-art LLMs, each analyzing the original paper without prior context using an identical prompt. All models consistently identify the evaluation as flawed and attribute the reported performance to non-independent data partitioning, supported by indicators such as overlapping learning curves, minimal generalization gap, and near-perfect classification results. These findings suggest that LLMs can detect common methodological issues based solely on published artifacts. While not definitive, their consistent agreement highlights their potential as complementary tools for improving reproducibility and supporting scientific auditing.[5] Decoupling Scores and Text: The Politeness Principle in Peer Review
Yingxuan Wen
Main category: cs.CL
TL;DR: 本文研究了作者如何解读同行评审反馈,发现数值评分比文本评论更能准确预测论文接受结果,文本评论因礼貌原则而掩盖了真实的拒绝信号。
Details
Motivation: 作者经常难以正确解读同行评审反馈,可能因礼貌性评论产生错误希望,或因具体低分而感到困惑。 Method: 构建了包含2021-2025年ICLR超3万份投稿的数据库,对比基于数值评分和文本评论的接受预测性能,并从分数分布特征和评论情感角度分析差异原因。 Result: 基于评分的模型准确率达91%,而基于文本的模型最高仅81%;被误判样本呈现高峰态与负偏态分布;拒稿评论中正面词多于负面词,体现‘礼貌原则’。 Conclusion: 数值评分比文本评论更可靠地反映评审真实意图,文本反馈因语言礼貌性削弱了信号强度,影响作者判断。 Abstract: Authors often struggle to interpret peer review feedback, deriving false hope from polite comments or feeling confused by specific low scores. To investigate this, we construct a dataset of over 30,000 ICLR 2021-2025 submissions and compare acceptance prediction performance using numerical scores versus text reviews. Our experiments reveal a significant performance gap: score-based models achieve 91% accuracy, while text-based models reach only 81% even with large language models, indicating that textual information is considerably less reliable. To explain this phenomenon, we first analyze the 9% of samples that score-based models fail to predict, finding their score distributions exhibit high kurtosis and negative skewness, which suggests that individual low scores play a decisive role in rejection even when the average score falls near the borderline. We then examine why text-based accuracy significantly lags behind scores from a review sentiment perspective, revealing the prevalence of the Politeness Principle: reviews of rejected papers still contain more positive than negative sentiment words, masking the true rejection signal and making it difficult for authors to judge outcomes from text alone.[6] SeaAlert: Critical Information Extraction From Maritime Distress Communications with Large Language Models
Tomer Atia,Yehudit Aperstein,Alexander Apartsin
Main category: cs.CL
TL;DR: 本文提出SeaAlert,一种基于大语言模型(LLM)的框架,用于鲁棒地分析海上遇险语音通信;通过构建合成数据生成流程,缓解真实标注数据稀缺问题,并提升在噪声、ASR错误及非标准表达下的解析能力。
Details
Motivation: 海上VHF遇险语音通信具有安全关键性,但其短促、高噪声、压力下表述不规范以及ASR识别错误等问题,严重阻碍了自动分析的可靠性。 Method: 提出SeaAlert框架:利用LLM生成多样化的合成遇险语音文本(含省略/替换标准术语的难例),经语音合成、VHF信道噪声模拟和ASR转录,构建贴近真实场景的带噪文本数据集,支撑后续分析任务。 Result: 成功构建了面向海上遇险通信的合成数据生成流程,生成的数据具备真实性与多样性,可有效支撑LLM在噪声和非标准表达下的鲁棒分析。 Conclusion: SeaAlert为解决小语种、低资源、高噪声场景下的安全关键语音理解问题提供了可行范式,验证了合成数据驱动的LLM方法在 maritime NLP 中的有效性。 Abstract: Maritime distress communications transmitted over very high frequency (VHF) radio are safety-critical voice messages used to report emergencies at sea. Under the Global Maritime Distress and Safety System (GMDSS), such messages follow standardized procedures and are expected to convey essential details, including vessel identity, position, nature of the distress, and required assistance. In practice, however, automatic analysis remains difficult because distress messages are often brief, noisy, and produced under stress, may deviate from the prescribed format, and are further degraded by automatic speech recognition (ASR) errors caused by channel noise and speaker stress. This paper presents SeaAlert, an LLM-based framework for robust analysis of maritime distress communications. To address the scarcity of labeled real-world data, we develop a synthetic data generation pipeline in which an LLM produces realistic and diverse maritime messages, including challenging variants in which standard distress codewords are omitted or replaced with less explicit expressions. The generated utterances are synthesized into speech, degraded with simulated VHF noise, and transcribed by an ASR system to obtain realistic noisy transcripts.[7] How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
Zixian Huang,Kaichen Yang,Xu Huang,Feiyang Hao,Qiming Ge,Bowen Li,He Du,Kai Chen,Qipeng Guo
Main category: cs.CL
TL;DR: 本文提出TESSY框架,通过教师-学生协作生成数据,解决教师模型生成数据与学生模型风格不一致导致的微调性能下降问题,在代码生成任务上显著提升学生模型性能。
Details
Motivation: 现有使用更强教师模型生成合成数据进行监督微调(SFT)的方法,在提升新兴推理模型(如Qwen3-8B)能力时常常失效甚至损害性能,主因是教师生成数据与学生模型的数据分布存在显著风格差异。 Method: 提出教师-学生协作数据合成框架(TESSY),让教师和学生模型交替生成风格相关与非风格相关token,从而生成既保留教师强推理能力、又符合学生风格分布的合成序列。 Result: 在以GPT-OSS-120B为教师、Qwen3-8B为学生的代码生成实验中,传统教师生成数据微调导致LiveCodeBench-Pro和OJBench分别下降3.25%和10.02%,而TESSY分别提升11.25%和6.68%。 Conclusion: 风格一致性是合成数据用于SFT成功的关键因素;TESSY通过协同生成机制有效弥合教师-学生风格鸿沟,显著提升学生模型推理能力。 Abstract: A widely adopted strategy for model enhancement is to use synthetic data generated by a stronger model for supervised fine-tuning (SFT). However, for emerging reasoning models like Qwen3-8B, this approach often fails to improve reasoning capabilities and can even lead to a substantial drop in performance. In this work, we identify substantial stylistic divergence between teacher generated data and the distribution of student as a major factor impacting SFT. To bridge this gap, we propose a Teacher-Student Cooperation Data Synthesis framework (TESSY), which interleaves teacher and student models to alternately generate style and non-style tokens. Consequently, TESSY produces synthetic sequences that inherit the advanced reasoning capabilities of the teacher while maintaining stylistic consistency with the distribution of the student. In experiments on code generation using GPT-OSS-120B as the teacher, fine-tuning Qwen3-8B on teacher-generated data leads to performance drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, whereas TESSY achieves improvements of 11.25% and 6.68%.[8] EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic Reviews
Naman Ahuja,Saniya Mulla,Muhammad Ali Khan,Zaryab Bin Riaz,Kaneez Zahra Rubab Khakwani,Mohamad Bassam Sonbol,Irbaz Bin Riaz,Vivek Gupta
Main category: cs.CL
TL;DR: EviSearch是一个多智能体系统,能从原始临床试验PDF中自动构建符合本体规范的证据表,并为每个单元格提供可追溯的来源信息,支持临床医生审核与修正。
Details
Motivation: 解决临床证据表人工构建耗时、易错、难以审计的问题,提升系统性综述中证据提取的准确性、可追溯性与临床可用性。 Method: 提出多智能体架构:PDF-query智能体保留原始排版与图表;检索引导的搜索智能体;协调模块强制在智能体意见不一致时进行页面级验证;全程记录协调决策与人工修改以生成监督信号。 Result: 在临床医生标注的肿瘤学试验数据集上,相比强文本解析基线显著提升提取准确率,并实现全覆盖的溯源归因;支持迭代式模型优化。 Conclusion: EviSearch为循证医学中的活体系统性综述提供了安全、可审计、可交互的LLM驱动提取方案,可有效降低人工整理负担。 Abstract: We present EviSearch, a multi-agent extraction system that automates the creation of ontology-aligned clinical evidence tables directly from native trial PDFs while guaranteeing per-cell provenance for audit and human verification. EviSearch pairs a PDF-query agent (which preserves rendered layout and figures) with a retrieval-guided search agent and a reconciliation module that forces page-level verification when agents disagree. The pipeline is designed for high-precision extraction across multimodal evidence sources (text, tables, figures) and for generating reviewer-actionable provenance that clinicians can inspect and correct. On a clinician-curated benchmark of oncology trial papers, EviSearch substantially improves extraction accuracy relative to strong parsed-text baselines while providing comprehensive attribution coverage. By logging reconciler decisions and reviewer edits, the system produces structured preference and supervision signals that bootstrap iterative model improvement. EviSearch is intended to accelerate living systematic review workflows, reduce manual curation burden, and provide a safe, auditable path for integrating LLM-based extraction into evidence synthesis pipelines.[9] Hierarchical Retrieval Augmented Generation for Adversarial Technique Annotation in Cyber Threat Intelligence Text
Filippo Morbiato,Markus Keller,Priya Nair,Luca Romano
Main category: cs.CL
TL;DR: 本文提出H-TechniqueRAG,一种结合ATT&CK战术-技术层级结构的分层检索增强生成框架,显著提升CTI文本到ATT&CK技术ID映射的准确性、效率与可解释性。
Details
Motivation: 现有基于RAG的方法忽略MITRE ATT&CK框架中战术与技术间的层级结构,采用扁平化检索,导致效率低、精度受限且缺乏可解释性。 Method: 提出两阶段分层检索机制(先检战术、再查技术),引入战术感知重排序模块和层级约束上下文组织策略,以利用ATT&CK层级先验并缓解大模型上下文过载。 Result: 在三个CTI数据集上F1提升3.8%,推理延迟降低62.4%,LLM API调用减少60%;同时增强跨域泛化能力与决策路径可解释性。 Conclusion: 将ATT&CK层级结构作为强归纳偏置融入RAG,可兼顾性能、效率与可解释性,为CTI自动化分析提供新范式。 Abstract: Mapping Cyber Threat Intelligence (CTI) text to MITRE ATT\&CK technique IDs is a critical task for understanding adversary behaviors and automating threat defense. While recent Retrieval-Augmented Generation (RAG) approaches have demonstrated promising capabilities in this domain, they fundamentally rely on a flat retrieval paradigm. By treating all techniques uniformly, these methods overlook the inherent taxonomy of the ATT\&CK framework, where techniques are structurally organized under high-level tactics. In this paper, we propose H-TechniqueRAG, a novel hierarchical RAG framework that injects this tactic-technique taxonomy as a strong inductive bias to achieve highly efficient and accurate annotation. Our approach introduces a two-stage hierarchical retrieval mechanism: it first identifies the macro-level tactics (the adversary's technical goals) and subsequently narrows the search to techniques within those tactics, effectively reducing the candidate search space by 77.5\%. To further bridge the gap between retrieval and generation, we design a tactic-aware reranking module and a hierarchy-constrained context organization strategy that mitigates LLM context overload and improves reasoning precision. Comprehensive experiments across three diverse CTI datasets demonstrate that H-TechniqueRAG not only outperforms the state-of-the-art TechniqueRAG by 3.8\% in F1 score, but also achieves a 62.4\% reduction in inference latency and a 60\% decrease in LLM API calls. Further analysis reveals that our hierarchical structural priors equip the model with superior cross-domain generalization and provide security analysts with highly interpretable, step-by-step decision paths.[10] Chinese Essay Rhetoric Recognition Using LoRA, In-context Learning and Model Ensemble
Yuxuan Lai,Xiajing Wang,Chen Zheng
Main category: cs.CL
TL;DR: 本文利用大语言模型(LLM)结合LoRA微调与上下文学习,以结构化JSON输出(键值中文化)完成中文议论文修辞识别任务,并通过模型集成进一步提升性能,在CCL 2025评测中三项指标均获第一。
Details
Motivation: 修辞识别是自动作文评分的关键环节,有助于评估学生的语言能力与高阶思维;中文修辞识别尚需更有效的AI方法。 Method: 采用基于LoRA的大语言模型微调与上下文学习,将输出统一为中文键名的JSON格式,并探索多种模型集成策略。 Result: 在CCL 2025中文作文修辞识别评测的全部三个赛道上取得最优性能,获得一等奖。 Conclusion: LoRA微调、结构化提示与模型集成相结合,能有效提升大模型在中文修辞识别任务中的性能与可解释性。 Abstract: Rhetoric recognition is a critical component in automated essay scoring. By identifying rhetorical elements in student writing, AI systems can better assess linguistic and higher-order thinking skills, making it an essential task in the area of AI for education. In this paper, we leverage Large Language Models (LLMs) for the Chinese rhetoric recognition task. Specifically, we explore Low-Rank Adaptation (LoRA) based fine-tuning and in-context learning to integrate rhetoric knowledge into LLMs. We formulate the outputs as JSON to obtain structural outputs and translate keys to Chinese. To further enhance the performance, we also investigate several model ensemble methods. Our method achieves the best performance on all three tracks of CCL 2025 Chinese essay rhetoric recognition evaluation task, winning the first prize.[11] SAGE Celer 2.6 Technical Card
SAGEA Research Team,Basab Jha,Firoj Paudel,Ujjwal Puri,Adrian Liu,Ethan Henkel,Zhang Yuting,Mateusz Kowalczyk,Mei Huang,Choi Donghyuk,Wang Junhao
Main category: cs.CL
TL;DR: SAGE Celer 2.6 是 SAGEA 推出的新一代通用大模型,具备多尺寸参数(5B/10B/27B)、逆向推理(IR)训练机制、原生多模态能力(端到端视觉编码器),并在数学、编程、通用智能(ACUMEN)及南亚语言(尼泊尔语、印地语)支持方面表现优异,同时保持低延迟和强英文推理能力。
Details
Motivation: 解决复杂推理中的级联错误与幻觉问题,并增强对南亚语言(尤其是 Devanagari 文字)的支持,弥补现有模型在该区域语言与多模态集成上的不足。 Method: 采用逆向推理(IR)流水线进行自验证逻辑路径训练;引入端到端视觉编码器实现原生多模态;设计 Devanagari 脚本专用分词器;结合架构改进与额外预训练。 Result: 在 ACUMEN 等数学、编程与通用智能基准上达到高性能;显著提升尼泊尔语和印地语理解能力,且不损害英文推理;具备低延迟与原生多模态能力。 Conclusion: Celer 2.6 是面向南亚语言优化、兼具强推理能力与原生多模态的高效通用大模型,代表了区域适配与鲁棒推理协同设计的新方向。 Abstract: We introduce SAGE Celer 2.6, the latest in our line of general-purpose Celer models from SAGEA. Celer 2.6 is available in 5B, 10B, and 27B parameter sizes and benefits from extensive architectural modifications and further pre-training on an undisclosed model. Using our Inverse Reasoning (IR) pipeline, SAGEA natively trains Celer 2.6 to validate its own logic paths, minimizing cascading error and hallucination in complex reasoning tasks. Celer 2.6 also boasts natively integrated multimodal functionality with an end-to-end vision encoder to avoid common pitfalls in adapter-based approaches. Celer 2.6 provides highly competitive results on mathematics, coding, and general intelligence benchmarks (ACUMEN), along with low latency. Most importantly, Celer 2.6 is specifically optimized for South Asian language support, with a custom tokenizer for the Devanagari script and strong performance in both Nepali and Hindi without sacrificing English reasoning ability.[12] Chronological Knowledge Retrieval: A Retrieval-Augmented Generation Approach to Construction Project Documentation
Ioannis-Aris Kostis,Natalia Sanchiz,Steeve De Schryver,François Denis,Pierre Schaus
Main category: cs.CL
TL;DR: 本文提出了一种基于RAG框架的对话式系统,用于从建筑项目会议纪要中检索时间标注的决策历史,支持自然语言查询并提供语义相关、带时间戳的答案。
Details
Motivation: 大型建设项目中决策持续演进,会议纪要记录繁杂,人工追溯特定决策历史费时易错,亟需高效、准确、可追溯的智能检索方法。 Method: 采用检索增强生成(RAG)框架,融合语义搜索与大语言模型,实现对会议纪要的自然语言问答,并确保答案语义相关且显式包含时间标注;使用真实行业数据集(比利时某大型公司完工项目的匿名化会议纪要)进行验证,数据经专家标注与查询设计以支撑系统评估。 Result: 成功构建并验证了一个支持时间感知、语义理解的对话式检索系统;发布了标注数据集与开源实现,推动该方向研究。 Conclusion: RAG框架能有效支持工程文档中时间敏感型决策信息的交互式获取,为项目知识管理提供了实用可行的新范式。 Abstract: In large-scale construction projects, the continuous evolution of decisions generates extensive records, most often captured in meeting minutes. Since decisions may override previous ones, professionals often need to reconstruct the history of specific choices. Retrieving such information manually from raw archives is both labor-intensive and error-prone. From a user perspective, we address this challenge by enabling conversational access to the whole set of project meeting minutes. Professionals can pose natural-language questions and receive answers that are both semantically relevant and explicitly time-annotated, allowing them to follow the chronology of decisions. From a technical perspective, our solution employs a Retrieval-Augmented Generation (RAG) framework that integrates semantic search with large language models to ensure accurate and context-aware responses. We demonstrate the approach using an anonymized, industry-sourced dataset of meeting minutes from a completed construction project by a large company in Belgium. The dataset is annotated and enriched with expert-defined queries to support systematic evaluation. Both the dataset and the open-source implementation are made available to the community to foster further research on conversational access to time-annotated project documentation.[13] Stateful Evidence-Driven Retrieval-Augmented Generation with Iterative Reasoning
Qi Dong,Ziheng Lin,Ning Ding
Main category: cs.CL
TL;DR: 本文提出了一种状态感知、证据驱动的迭代式RAG框架,通过构建持久化证据池与迭代查询优化,提升问答鲁棒性与稳定性。
Details
Motivation: 现有RAG方法存在上下文表征扁平化和无状态检索问题,导致性能不稳定。 Method: 将问答建模为渐进式证据累积过程;将检索文档转化为带相关性与置信度信号的结构化推理单元,并维护包含支持与非支持信息的持久化证据池;通过证据驱动的缺陷分析识别信息缺口与冲突,迭代优化查询以指导后续检索。 Result: 在多个问答基准上一致优于标准RAG及多步基线方法,能有效积累高质量证据,并在强检索噪声下保持稳定性能。 Conclusion: 状态化、证据驱动与迭代推理机制可显著增强RAG系统的鲁棒性与可靠性。 Abstract: Retrieval-Augmented Generation (RAG) grounds Large Language Models (LLMs) in external knowledge but often suffers from flat context representations and stateless retrieval, leading to unstable performance. We propose Stateful Evidence-Driven RAG with Iterative Reasoning, a framework that models question answering as a progressive evidence accumulation process. Retrieved documents are converted into structured reasoning units with explicit relevance and confidence signals and maintained in a persistent evidence pool capturing both supportive and non-supportive information. The framework performs evidence-driven deficiency analysis to identify gaps and conflicts and iteratively refines queries to guide subsequent retrieval. This iterative reasoning process enables stable evidence aggregation and improves robustness to noisy retrieval. Experiments on multiple question answering benchmarks demonstrate consistent improvements over standard RAG and multi-step baselines, while effectively accumulating high-quality evidence and maintaining stable performance under substantial retrieval noise.[14] Benchmarking Linguistic Adaptation in Comparable-Sized LLMs: A Study of Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B on Romanized Nepali
Ananda Rimal,Adarsha Rimal
Main category: cs.CL
TL;DR: 本研究系统评估了Llama-3.1-8B、Mistral-7B-v0.1和Qwen3-8B三种开源大模型在罗马化尼泊尔语上的零样本与微调性能,提出首个该语言的严谨基准,并验证了‘适配潜力假说’:初始零样本能力最弱的模型(Llama-3.1-8B)经微调后提升最大。
Details
Motivation: 罗马化尼泊尔语是尼泊尔非正式数字交流的主要媒介,但在大语言模型领域严重缺乏资源支持,亟需建立可比、可复现的适应性评估基准。 Method: 在统一规模下,对三个开源模型进行零样本测试与QLoRA+rsLoRA微调(r=32,仅训练约1%参数),使用1万条双语指令数据集;采用PPL、BERTScore、chrF++、ROUGE系列及BLEU共七项指标,在五个维度上综合评估。 Result: 零样本下三模型均无法生成罗马化尼泊尔语;微调后全部收敛至BERTScore≈0.75、chrF++>23;Qwen3-8B零样本即具语义相关性且结构对齐指标最优;Llama-3.1-8B零样本最弱但微调增益最大(PPL下降49.77,BERTScore提升0.3287)。 Conclusion: 本工作确立了罗马化尼泊尔语在同类开源大模型中的首个严格适应性基准;证实‘适配潜力假说’,为低资源语言迭代开发提供了模型选型依据:Qwen3-8B适合即用场景,Llama-3.1-8B更适合持续优化。 Abstract: Romanized Nepali, the Nepali language written in the Latin alphabet, is the dominant medium for informal digital communication in Nepal, yet it remains critically underresourced in the landscape of Large Language Models (LLMs). This study presents a systematic benchmarking of linguistic adaptation across three comparable-sized open-weight models: Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B. We evaluate these architectures under zero-shot and fine-tuned settings using a curated bilingual dataset of 10,000 transliterated instruction-following samples. Performance is quantified across five metrics spanning seven measurement dimensions: Perplexity (PPL), BERTScore, chrF++, ROUGE-1, ROUGE-2, ROUGE-L, and BLEU, capturing fluency, phonetic consistency, and semantic integrity. Models were fine-tuned using Quantized Low-Rank Adaptation (QLoRA) with Rank-Stabilized LoRA (rsLoRA) at rank r=32 on dual NVIDIA Tesla T4 GPUs, training only approximately 1% of each model's parameters in under 27 total GPU-hours. At zero-shot, all three models fail to generate Romanized Nepali, each exhibiting a distinct architecture-specific failure mode. Following fine-tuning, all three resolve these failures and converge to BERTScore approximately 0.75 and chrF++ greater than 23. Overall dimension-wise assessment across ten criteria identifies Qwen3-8B as the overall recommended architecture, being the only model to produce semantically relevant zero-shot output and leading all structural alignment metrics post-SFT. The adaptation headroom hypothesis is confirmed: Llama-3.1-8B, despite its weakest zero-shot baseline, achieves the largest absolute fine-tuning gains in PPL (Delta = -49.77) and BERTScore (Delta = +0.3287), making it the preferred choice for iterative low-resource development pipelines. This work establishes the first rigorous baseline for Romanized Nepali adaptation in comparable-sized open-weight LLMs.[15] Tug-of-War within A Decade: Conflict Resolution in Vulnerability Analysis via Teacher-Guided Retrieval-Augmented Generations
Ziyin Zhou,Jianyi Zhang,Xu ji,Yilong Li,Jiameng Han,Zhangchi Zhao
Main category: cs.CL
TL;DR: 本文提出CRVA-TGRAG框架,通过两阶段方法(改进检索+教师引导偏好优化)解决LLM在CVE漏洞分析中因知识更新滞后导致的知识冲突与幻觉问题,提升最新漏洞检索准确率与回答精度。
Details
Motivation: LLM在网络安全漏洞分析中面临知识更新滞后问题:过去十年超3万CVE被修改或更新,导致训练数据与真实知识不一致,引发知识冲突、事实错误和生成幻觉。 Method: 提出两阶段CRVA-TGRAG框架:1)检索阶段采用父文档分段与语义相似性+倒排索引融合的检索策略;2)生成阶段引入教师引导的偏好优化技术对LLM进行微调。 Result: 实验表明该方法在最新CVE检索准确率上优于外部知识库,有效缓解了仅依赖LLM导致的知识冲突与不一致问题。 Conclusion: CRVA-TGRAG框架结合RAG检索质量提升与偏好微调优势,显著增强了LLM在动态漏洞知识场景下的可靠性与准确性。 Abstract: Large Language Models (LLMs) are essential for analyzing and addressing vulnerabilities in cybersecurity. However, among over 200,000 vulnerabilities were discovered in the past decade, more than 30,000 have been changed or updated. This necessitates frequent updates to the training datasets and internal knowledge bases of LLMs to maintain knowledge consistency. In this paper, we focus on the problem of knowledge discrepancy and conflict within CVE (Common Vulnerabilities and Exposures) detection and analysis. This problem hinders LLMs' ability to retrieve the latest knowledge from original training datasets, leading to knowledge conflicts, fabrications of factually incorrect results, and generation hallucinations. To address this problem, we propose an innovative two-stage framework called CRVA-TGRAG (Conflict Resolution in Vulnerability Analysis via Teacher-Guided Retrieval-Augmented Generation). First, to improve document retrieval accuracy during the retrieval stage, we utilize Parent Document Segmentation and an ensemble retrieval scheme based on semantic similarity and inverted indexing. Second, to enhance LLMs' capabilities based on the retrieval of CVE dataset in generation stage, we employ a teacher-guided preference optimization technique to fine-tune LLMs. Our framework not only enhances the quality of content retrieval through RAG but also leverages the advantages of preference fine-tuning in LLMs to answer questions more effectively and precisely. Experiments demonstrate our method achieves higher accuracy in retrieving the latest CVEs compared to external knowledge bases. In conclusion, our framework significantly mitigates potential knowledge conflicts and inconsistencies that may arise from relying solely on LLMs for knowledge retrieval.[16] Correcting Suppressed Log-Probabilities in Language Models with Post-Transformer Adapters
Bryan Sanchez
Main category: cs.CL
TL;DR: 本文提出了一种仅含786K参数的轻量级后Transformer适配器,通过在冻结的隐藏状态上训练,有效缓解对齐调优语言模型在政治敏感话题上的事实性log-probability抑制现象,且不损害其他知识;关键发现包括:适配器需在生成时仅作用于当前预测位置才保持文本连贯性,隐藏状态干预优于logit空间干预,并揭示了Apple MLX中一个未被记录的静默梯度bug。
Details
Motivation: 对齐调优的语言模型常在政治敏感话题上压制事实性log-probabilities,尽管其隐藏层仍保有相关知识,亟需一种低开销、无知识退化的方法来恢复事实表达能力。 Method: 设计并训练一个超轻量(0.02%参数量)的post-transformer适配器,作用于冻结基础模型的隐藏状态;采用锚定训练(anchored training)防止知识遗忘;对比门控(SwiGLU)与非门控(线性瓶颈)结构;系统评估不同应用模式(全位置 vs 最后位置)、不同模型规模(Qwen3-4B/8B/14B)及不同事实划分方式;同时诊断并修复MLX框架中的静默梯度bug。 Result: 适配器在31个意识形态区分性事实上显著纠正log-probability抑制;在15个训练事实上实现100%记忆,在16个预留事实中泛化率达11–39%(5次随机划分);零知识退化;最后位置应用可生成连贯、去审查文本,而全位置或logit空间适配器均失效;SwiGLU与线性结构性能无统计显著差异(Fisher精确检验p > 0.09);定位并修复MLX中nn.value_and_grad的静默梯度归零bug。 Conclusion: 隐藏状态层面的轻量适配器是校正对齐模型事实性抑制的有效且安全途径,其成功依赖于精准的位置应用策略与正确梯度计算;该工作不仅提供实用技术方案,也揭示了框架级bug对实证研究的潜在严重影响。 Abstract: Alignment-tuned language models frequently suppress factual log-probabilities on politically sensitive topics despite retaining the knowledge in their hidden representations. We show that a 786K-parameter (approximately 0.02% of the base model) post-transformer adapter, trained on frozen hidden states, corrects this suppression on 31 ideology-discriminating facts across Qwen3-4B, 8B, and 14B. The adapter memorizes all 15 training facts and generalizes to 11--39% of 16 held-out facts across 5 random splits per scale, with zero knowledge regressions via anchored training. Both gated (SwiGLU) and ungated (linear bottleneck) adapters achieve comparable results; neither consistently outperforms the other (Fisher exact p > 0.09 at all scales). On instruct models, the adapter corrects log-probability rankings. When applied at all token positions during generation, the adapter produces incoherent output; however, when applied only at the current prediction position (last-position-only), the adapter produces coherent, less censored text. A logit-space adapter operating after token projection fails to produce coherent generation at any application mode, suggesting hidden-state intervention is the correct level for generation correction. A previously undocumented silent gradient bug in Apple MLX explains all null results in earlier iterations of this work: the standard pattern nn.value_and_grad(model, fn)(model.parameters()) returns zero gradients without error; the correct pattern nn.value_and_grad(model, fn)(model, data) resolves this. We provide a minimal reproduction and discuss implications for other adapter research using MLX.[17] QU-NLP at ArchEHR-QA 2026: Two-Stage QLoRA Fine-Tuning of Qwen3-4B for Patient-Oriented Clinical Question Answering and Evidence Sentence Alignment
Mohammad AL-Smadi
Main category: cs.CL
TL;DR: 本文提出一个统一系统,解决ArchEHR-QA共享任务中的答案生成(Subtask 3)和证据句对齐(Subtask 4)两个子任务;Subtask 3采用两阶段QLoRA微调Qwen3-4B模型,Subtask 4构建三种检索方法的加权集成;两项任务均受限于仅20个标注样本,凸显数据增强的关键必要性。
Details
Motivation: ArchEHR-QA共享任务中缺乏足够标注数据(仅20个训练样例),导致模型难以区分临床文本中相关与不相关的句子,需统一建模并探索高效微调与检索策略。 Method: Subtask 3:在4-bit NF4量化Qwen3-4B上实施两阶段QLoRA——先用emrQA-MedSQuAD(3万样本)进行临床领域适配,再用20个开发集样本进行任务风格微调;Subtask 4:融合BM25(相对阈值)、TF-IDF余弦相似度和微调cross-encoder的加权检索集成。 Result: Subtask 3在test-2026上综合得分为32.87(BLEU=9.42, ROUGE-L=27.04, SARI=55.42, BERTScore=43.00, AlignScore=25.28, MEDCON=37.04);Subtask 4在100例测试集上micro-F1达67.16;实验表明两任务共同瓶颈是小样本导致的相关性判别困难。 Conclusion: 两阶段QLoRA与多策略检索集成可有效应对小样本临床问答与证据对齐任务,但根本瓶颈在于标注数据极度稀缺,未来最优先方向是针对性数据增强。 Abstract: We present a unified system addressing both Subtask 3 (answer generation) and Subtask 4 (evidence sentence alignment) of the ArchEHR-QA Shared Task. For Subtask 3, we apply two-stage Quantised Low-Rank Adaptation (QLoRA) to Qwen3-4B loaded in 4-bit NF4 quantisation: first on 30,000 samples from the emrQA-MedSQuAD corpus to establish clinical domain competence, then on the 20 annotated development cases to learn the task-specific output style. Our system achieves an overall score of 32.87 on the official test-2026 split (BLEU = 9.42, ROUGE-L = 27.04, SARI = 55.42, BERTScore = 43.00, AlignScore = 25.28, MEDCON = 37.04). For Subtask 4, we develop a weighted ensemble of three retrieval methods - BM25 with relative thresholding, TF-IDF cosine similarity, and a fine-tuned cross-encoder - to identify note sentences supporting a given gold answer, achieving a micro-F1 of 67.16 on the 100-case test set. Experiments reveal that both subtasks expose the same fundamental challenge: 20 annotated training cases are insufficient to distinguish relevant from irrelevant clinical sentences, pointing to data augmentation as the highest-leverage future direction.[18] Listen, Correct, and Feed Back: Spoken Pedagogical Feedback Generation
Junhong Liang,Yifan Lu,Ekaterina Kochmar,Fajri Koto
Main category: cs.CL
TL;DR: 本文提出了SPFG数据集,用于生成面向学习者的口语语法纠错与教学反馈,并对比了监督微调(SFT)和偏好对齐方法(DPO/KTO)在联合生成纠错与反馈任务上的效果,发现SFT更稳定有效,且纠错质量与反馈质量相关性较弱。
Details
Motivation: 现有语法错误纠正(GEC)与解释(GEE)研究虽进展迅速,但真实教学场景需要可操作、符合学习者水平且具鼓励性的教学反馈,而当前方法缺乏面向口语、教师风格、经人工验证的反馈数据及相应建模方法。 Method: 构建了基于Speak & Improve Challenge 2025语料的SPFG数据集,包含流利度导向转录文本、GEC目标及人工校验的教师式反馈(含偏好对);在口语语法纠错(SGEC)设定下,使用Qwen2.5、Llama-3.1和GLM-4三种指令微调大模型,对比监督微调(SFT)与基于偏好的对齐方法(DPO/KTO)在联合生成纠错与反馈任务上的性能。 Result: SFT在纠错与反馈质量上均带来最一致提升;DPO/KTO效果较小或不稳定;纠错质量与反馈质量之间仅呈弱相关;所有模型和方法实现已开源。 Conclusion: 面向学习者的口语教学反馈生成需高质量、教师风格、人工验证的数据支撑;SFT仍是当前更可靠的基础训练范式;未来应探索解耦纠错与反馈建模、增强二者协同机制。 Abstract: Grammatical error correction (GEC) and explanation (GEE) have made rapid progress, but real teaching scenarios also require \emph{learner-friendly pedagogical feedback} that is actionable, level-appropriate, and encouraging. We introduce \textbf{SPFG} (\textbf{S}poken \textbf{P}edagogical \textbf{F}eedback \textbf{G}eneration), a dataset built based on the Speak \& Improve Challenge 2025 corpus, pairing fluency-oriented transcriptions with GEC targets and \emph{human-verified} teacher-style feedback, including preferred/rejected feedback pairs for preference learning. We study a transcript-based Spoken Grammatical Error Correction (SGEC) setting and evaluate three instruction-tuned LLMs (Qwen2.5, Llama-3.1, and GLM-4), comparing supervised fine-tuning (SFT) with preference-based alignment (using DPO and KTO) for jointly generating corrections and feedback. Results show that SFT provides the most consistent improvements, while DPO/KTO yield smaller or mixed gains, and that correction quality and feedback quality are weakly coupled. Our implementation is available at https://github.com/Skywalker-Harrison/spfg.[19] An Underexplored Frontier: Large Language Models for Rare Disease Patient Education and Communication -- A scoping review
Zaifu Zhan,Yu Hou,Kai Yu,Min Zeng,Anita Burgun,Xiaoyi Chen,Rui Zhang
Main category: cs.CL
TL;DR: 本文通过范围综述分析了2022年1月至2026年3月间关于大语言模型(LLM)在罕见病患者教育与沟通中应用的12项研究,发现当前研究多依赖通用模型(如ChatGPT),聚焦于静态问答,缺乏真实场景、多语言支持及以患者为中心的评估维度,整体处于早期阶段。
Details
Motivation: 罕见病患者面临复杂照护路径、临床专家稀缺及长期沟通需求未被满足等挑战,而大语言模型虽在患者教育中展现潜力,其在罕见病领域的实际应用尚不明确。 Method: 开展范围综述,系统检索2022年1月至2026年3月主要数据库中的相关研究,共纳入12项研究;提取研究特征、应用场景、模型使用方式和评估方法,并采用描述性与定性分析进行综合。 Result: 现有研究高度集中于近期、通用型LLM(尤其是ChatGPT);主要应用于基于人工构建问题集的问答任务;极少使用真实世界数据或纵向沟通场景;评估偏重准确性,忽视可读性、共情性与沟通质量等患者中心维度;多语言支持几乎空白。 Conclusion: 罕见病领域LLM应用仍处起步阶段,亟需加强以患者为中心的设计、领域适配方法及真实环境部署,以实现安全、自适应且有效的沟通支持。 Abstract: Rare diseases affect over 300 million people worldwide and are characterized by complex care pathways, limited clinical expertise, and substantial unmet communication needs throughout the long patient journey. Recent advances in large language models (LLMs) offer new opportunities to support patient education and communication, yet their application in rare diseases remains unclear. We conducted a scoping review of studies published between January 2022 and March 2026 across major databases, identifying 12 studies on LLM-based rare disease patient education and communication. Data were extracted on study characteristics, application scenarios, model usage, and evaluation methods, and synthesized using descriptive and qualitative analyses. The literature is highly recent and dominated by general-purpose models, particularly ChatGPT. Most studies focus on patient question answering using curated question sets, with limited use of real-world data or longitudinal communication scenarios. Evaluations are primarily centered on accuracy, with limited attention to patient-centered dimensions such as readability, empathy, and communication quality. Multilingual communication is rarely addressed. Overall, the field remains at an early stage. Future research should prioritize patient-centered design, domain-adapted methods, and real-world deployment to support safe, adaptive, and effective communication in rare diseases.[20] Internal Knowledge Without External Expression: Probing the Generalization Boundary of a Classical Chinese Language Model
Jiuting Chen,Yuan Lian,Hao Wu,Tianqi Huang,Hiroshi Sasaki,Makoto Kouno,Jongil Choi
Main category: cs.CL
TL;DR: 本文训练了一个3.18亿参数的纯文言文Transformer语言模型,通过OOD测试发现:模型内部能区分真假历史事件(体现事实编码能力),但外部生成中从不表达不确定性;这种‘内知外不知’现象跨语言、跨模型规模稳定存在,表明元认知表达(如‘我不知道’)无法仅靠语言建模自发涌现,需RLHF等显式训练信号。
Details
Motivation: 探究大语言模型是否能在未经过专门训练的情况下自发发展出对未知输入的识别与表达能力,即元认知(metacognition)能力。 Method: 在纯文言文语料(15.6亿token)上从零训练318M参数Transformer模型;设计系统性OOD测试(真实/虚构/半虚构历史事件)评估内部不确定性(以困惑度衡量)与外部不确定性表达(以文言文认识论标记词频衡量);并在中、英、日三语、多种模型规模上复现验证。 Result: 模型内部困惑度显著区分真假事件(半虚构事件困惑度达真实的4.24倍,p=1.1e-16),证明具备事实编码能力;但外部生成中反而更少使用表不确定的词汇(OOD下3.5% vs ID下8.3%,p=0.023);该‘内知外不知’现象跨语言、跨模型稳定存在,且不确定性表达频率完全由训练数据中的修辞惯例决定(如文言文出现‘谦逊悖论’,日语则几乎不规避断言)。 Conclusion: 元认知表达(如主动承认无知)不是语言建模的自然产物,不能仅靠缩放或数据多样性获得,必须依赖强化学习人类反馈(RLHF)等显式监督信号进行专门训练。 Abstract: We train a 318M-parameter Transformer language model from scratch on a curated corpus of 1.56 billion tokens of pure Classical Chinese, with zero English characters or Arabic numerals. Through systematic out-of-distribution (OOD) testing, we investigate whether the model can distinguish known from unknown inputs, and crucially, whether it can express this distinction in its generated text. We find a clear dissociation between internal and external uncertainty. Internally, the model exhibits a perplexity jump ratio of 2.39x between real and fabricated historical events (p = 8.9e-11, n = 92 per group), with semi-fabricated events (real figures + fictional events) showing the highest perplexity (4.24x, p = 1.1e-16), demonstrating genuine factual encoding beyond syntactic pattern matching. Externally, however, the model never learns to express uncertainty: classical Chinese epistemic markers appear at lower rates for OOD questions (3.5%) than for in-distribution questions (8.3%, p = 0.023), reflecting rhetorical conventions rather than genuine metacognition. We replicate both findings across three languages (Classical Chinese, English, Japanese), three writing systems, and eight models from 110M to 1.56B parameters. We further show that uncertainty expression frequency is determined entirely by training data conventions, with Classical Chinese models showing a "humility paradox" (more hedging for known topics), while Japanese models almost never hedge. We argue that metacognitive expression -- the ability to say "I don't know" -- does not emerge from language modeling alone and requires explicit training signals such as RLHF.[21] Attention to Mamba: A Recipe for Cross-Architecture Distillation
Abhinav Moudgil,Ningyuan Huang,Eeshan Gunesh Dhekane,Pau Rodríguez,Luca Zappella,Federico Danieli
Main category: cs.CL
TL;DR: 本文提出了一种两阶段知识蒸馏方法,将Transformer模型(如Pythia-1B)有效蒸馏到纯SSM架构(Mamba)中,通过引入基于核技巧的线性化注意力作为中间表示和Mamba的原理性初始化,使蒸馏后的Mamba在下游任务中几乎保持教师模型性能(困惑度14.11 vs. 13.86)。
Details
Motivation: SSM(如Mamba)虽具推理效率优势,但缺乏成熟的预训练生态;而Transformer虽有丰富预训练模型,却难以直接蒸馏到纯SSM架构——朴素跨架构蒸馏效果差,常需混合Attention与SSM模块。本文旨在利用现有Transformer资源,实现高效、纯SSM的蒸馏方案。 Method: 提出两阶段蒸馏框架:第一阶段用核技巧将Transformer蒸馏为线性化注意力模型;第二阶段将该线性化模型进一步蒸馏为无任何Attention模块的改进型Mamba,并为其设计原理性初始化。全程避免混合架构,强调初始化与中间表示的设计。 Result: 蒸馏后的Mamba在1B参数规模下,在10B token上训练后,下游困惑度达14.11,接近教师Pythia-1B的13.86;并通过大规模消融实验验证了序列混洗器架构、模型缩放、总蒸馏token量及两阶段token分配的敏感性。 Conclusion: 原理性初始化与线性化注意力作为中间表示,是实现高性能纯SSM蒸馏的关键;该方法为迁移Transformer知识至SSM提供了可复现、高效且无需混合模块的新范式。 Abstract: State Space Models (SSMs) such as Mamba have become a popular alternative to Transformer models, due to their reduced memory consumption and higher throughput at generation compared to their Attention-based counterparts. On the other hand, the community has built up a considerable body of knowledge on how to train Transformers, and many pretrained Transformer models are readily available. To facilitate the adoption of SSMs while leveraging existing pretrained Transformers, we aim to identify an effective recipe to distill an Attention-based model into a Mamba-like architecture. In prior work on cross-architecture distillation, however, it has been shown that a naïve distillation procedure from Transformers to Mamba fails to preserve the original teacher performance, a limitation often overcome with hybrid solutions combining Attention and SSM blocks. The key argument from our work is that, by equipping Mamba with a principled initialization, we can recover an overall better recipe for cross-architectural distillation. To this end, we propose a principled two-stage approach: first, we distill knowledge from a traditional Transformer into a linearized version of Attention, using an adaptation of the kernel trick. Then, we distill the linearized version into an adapted Mamba model that does not use any Attention block. Overall, the distilled Mamba model is able to preserve the original Pythia-1B Transformer performance in downstream tasks, maintaining a perplexity of 14.11 close to the teacher's 13.86. To show the efficacy of our recipe, we conduct thorough ablations at 1B scale with 10B tokens varying sequence mixer architecture, scaling analysis on model sizes and total distillation tokens, and a sensitivity analysis on tokens allocation between stages.[22] The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure
David A. Cook
Main category: cs.CL
TL;DR: 本文提出了PICCO框架,用于结构化大型语言模型(LLM)提示设计,包含Persona、Instructions、Context、Constraints、Output五个核心元素,并厘清了相关概念的层次关系。
Details
Motivation: 现有提示设计缺乏一致性和系统性,亟需一个统一、可复用的参考框架来提升概念清晰度与工程规范性。 Method: 通过多数据库检索并综合分析11个已有提示框架,采用严谨的概念合成方法,构建出PICCO五要素架构及配套概念体系。 Result: 提出明确区分prompt frameworks、elements、generation、techniques和engineering的分类法;确立PICCO五元素参考架构,并定义各元素功能、范围及相互关系;梳理关键技术、迭代方法、责任提示原则与未来方向。 Conclusion: PICCO是一个概念性与方法论贡献,旨在形式化提示规范与比较标准,而非经实证验证的性能优化方法。 Abstract: Large language model (LLM) performance depends heavily on prompt design, yet prompt construction is often described and applied inconsistently. Our purpose was to derive a reference framework for structuring LLM prompts. This paper presents PICCO, a framework derived through a rigorous synthesis of 11 previously published prompting frameworks identified through a multi-database search. The analysis yields two main contributions. First, it proposes a taxonomy that distinguishes prompt frameworks, prompt elements, prompt generation, prompting techniques, and prompt engineering as related but non-equivalent concepts. Second, it derives a five-element reference architecture for prompt generation: Persona, Instructions, Context, Constraints, and Output (PICCO). For each element, we define its function, scope, and relationship to other elements, with the goal of improving conceptual clarity and supporting more systematic prompt design. Finally, to support application of the framework, we outline key concepts relevant to implementation, including prompting techniques (e.g., zero-shot, few-shot, chain-of-thought, ensembling, decomposition, and self-critique, with selected variants), human and automated approaches to iterative prompt engineering, responsible prompting considerations such as security, privacy, bias, and trust, and priorities for future research. This work is a conceptual and methodological contribution: it formalizes a common structure for prompt specification and comparison, but does not claim empirical validation of PICCO as an optimization method.[23] Chinese Language Is Not More Efficient Than English in Vibe Coding: A Preliminary Study on Token Cost and Problem-Solving Rate
Simiao Ren,Xingyu Shen,Yuchen Zhou,Dennis,Ng,Ankit Raj
Main category: cs.CL
TL;DR: 本文通过SWE-bench Lite基准实证检验“中文提示比英文更省Token”的流行说法,发现该说法不成立:中文并未普遍降低Token消耗,不同模型表现相反(如MiniMax-2.7中文开销更高,GLM-5反而更低),且中文提示的成功率普遍低于英文;综合考虑Token成本与成功率的“单位成功任务成本”显示中文并无优势。
Details
Motivation: 社交媒体和开发者社区流传“中文提示在LLM编程任务中更省Token、可降本40%”,影响实践选择;需严谨验证该主张是否成立。 Method: 基于SWE-bench Lite软件工程任务基准,对多个主流开源/闭源大模型(如MiniMax-2.7、GLM-5等)进行控制变量实验,对比中英文提示下的Token消耗量、任务成功率,并计算综合成本效率(预期每成功任务成本)。 Result: 1)中文未展现一致Token效率优势;2)Token成本变化因模型而异(MiniMax-2.7中文+28%,GLM-5中文略降);3)所有测试模型上中文提示的成功率均低于英文;4)综合成本效率(Token×失败率)显示中文无优势。 Conclusion: 语言对Token成本的影响高度依赖模型架构,单纯切换至中文提示既不能可靠降低成本,也无法提升性能;当前证据不支持将中文作为通用降本策略。 Abstract: A claim has been circulating on social media and practitioner forums that Chinese prompts are more token-efficient than English for LLM coding tasks, potentially reducing costs by up to 40\%. This claim has influenced developers to consider switching to Chinese for ``vibe coding'' to save on API costs. In this paper, we conduct a rigorous empirical study using SWE-bench Lite, a benchmark of software engineering tasks, to evaluate whether this claim of Chinese token efficiency holds up to scrutiny. Our results reveal three key findings: First, the efficiency advantage of Chinese is not observed. Second, token cost varies by model architecture in ways that defy simple assumptions: while MiniMax-2.7 shows 1.28x higher token costs for Chinese, GLM-5 actually consumes fewer tokens with Chinese prompts. Third, and most importantly, we found that the success rate when prompting in Chinese is generally lower than in English across all models we tested. We also measure cost efficiency as expected cost per successful task -- jointly accounting for token consumption and task resolution rate. These findings should be interpreted as preliminary evidence rather than a definitive conclusion, given the limited number of models evaluated and the narrow set of benchmarks tested due to resource constraints; they indicate that language effects on token cost are model-dependent, and that practitioners should not expect cost savings or performance gains just by switching their prompt language to Chinese.[24] CROP: Token-Efficient Reasoning in Large Language Models via Regularized Prompt Optimization
Deep Shah,Sanket Badhe,Nehal Kathrotia,Priyanka Tiwari
Main category: cs.CL
TL;DR: 本文提出CROP方法,在自动提示优化中引入响应长度正则化,以减少大语言模型推理时的令牌消耗和延迟,同时保持较高的任务准确率。
Details
Motivation: 现有自动提示优化框架只关注任务准确性,导致生成冗长的推理过程,增加了延迟和令牌成本。 Method: 提出Cost-Regularized Optimization of Prompts (CROP),在优化过程中除标准准确性反馈外,还生成文本反馈以正则化响应长度,促使提示生成简洁、关键的推理输出。 Result: 在GSM8K、LogiQA和BIG-Bench Hard等复杂推理数据集上评估,实现了80.6%的令牌消耗降低,同时仅出现可忽略的性能下降。 Conclusion: CROP为生产环境中部署高效、低成本的智能体AI系统提供了实用解决方案。 Abstract: Large Language Models utilizing reasoning techniques improve task performance but incur significant latency and token costs due to verbose generation. Existing automatic prompt optimization(APO) frameworks target task accuracy exclusively at the expense of generating long reasoning traces. We propose Cost-Regularized Optimization of Prompts (CROP), an APO method that introduces regularization on response length by generating textual feedback in addition to standard accuracy feedback. This forces the optimization process to produce prompts that elicit concise responses containing only critical information and reasoning. We evaluate our approach on complex reasoning datasets, specifically GSM8K, LogiQA and BIG-Bench Hard. We achieved an 80.6\% reduction in token consumption while maintaining competitive accuracy, seeing only a nominal decline in performance. This presents a pragmatic solution for deploying token-efficient and cost-effective agentic AI systems in production pipelines.[25] MEME-Fusion@CHiPSAL 2026: Multimodal Ablation Study of Hate Detection and Sentiment Analysis on Nepali Memes
Samir Wagle,Reewaj Khanal,Abiral Adhikari
Main category: cs.CL
TL;DR: 本文提出了一种用于Devanagari脚本社交媒体模因的多模态仇恨言论检测系统,结合CLIP与BGE-M3,并引入动态门控跨模态注意力机制,在低资源条件下显著提升性能,同时揭示了英文视觉模型在Devanagari上的失效及标准集成方法在小样本下的退化问题。
Details
Motivation: 解决Devanagari脚本社交媒体模因中仇恨言论检测面临的多模态结构、语言复杂性及极端数据稀缺等复合挑战。 Method: 提出混合跨模态注意力融合架构:以CLIP(ViT-B/32)编码图像,BGE-M3编码多语言文本,通过4头自注意力与可学习门控网络动态加权模态贡献。 Result: 在Subtask A上比纯文本基线提升5.9% F1-macro;发现英文中心化视觉模型在Devanagari脚本上近似随机预测,且标准集成法在极小样本(每折约850样本)下因相关过拟合而严重退化。 Conclusion: 显式跨模态推理对低资源多模态仇恨检测至关重要;模型选择与集成策略需适配目标脚本与数据规模,不能直接迁移英文主导方案。 Abstract: Hate speech detection in Devanagari-scripted social media memes presents compounded challenges: multimodal content structure, script-specific linguistic complexity, and extreme data scarcity in low-resource settings. This paper presents our system for the CHiPSAL 2026 shared task, addressing both Subtask A (binary hate speech detection) and Subtask B (three-class sentiment classification: positive, neutral, negative). We propose a hybrid cross-modal attention fusion architecture that combines CLIP (ViT-B/32) for visual encoding with BGE-M3 for multilingual text representation, connected through 4-head self-attention and a learnable gating network that dynamically weights modality contributions on a per-sample basis. Systematic evaluation across eight model configurations demonstrates that explicit cross-modal reasoning achieves a 5.9% F1-macro improvement over text-only baselines on Subtask A, while uncovering two unexpected but critical findings: English-centric vision models exhibit near-random performance on Devanagari script, and standard ensemble methods catastrophically degrade under data scarcity (N nearly equal to 850 per fold) due to correlated overfitting. The code can be accessed at https://github.com/Tri-Yantra-Technologies/MEME-Fusion/[26] ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents
Zhuofeng Li,Yi Lu,Dongfu Jiang,Haoxiang Zhang,Yuyang Bai,Chuan Li,Yu Wang,Shuiwang Ji,Jianwen Xie,Yu Zhang
Main category: cs.CL
TL;DR: 本文提出REVIEWBENCH基准和REVIEWGROUNDER多智能体框架,通过引入显式评分标准与上下文证据整合,显著提升LLM生成审稿意见的质量与人类判断的一致性。
Details
Motivation: 现有LLM审稿系统常生成表面化、模板化的评论,缺乏基于证据的实质性反馈,主因是未充分利用人类审稿中的显式评分标准和对已有工作的上下文 grounding。 Method: 构建REVIEWBENCH基准,依据官方指南、论文内容和人工评审生成论文特异性评分标准;提出REVIEWGROUNDER多智能体框架,将审稿分为起草与grounding两阶段,结合工具调用进行针对性证据整合。 Result: 在REVIEWBENCH上,REVIEWGROUNDER(Phi-4-14B起草 + GPT-OSS-120B grounding)在8个维度上均优于GPT-4.1、DeepSeek-R1-670B等更大更强基座模型,在与人工判断一致性及评分标准符合度方面表现更优。 Conclusion: 显式rubric引导与上下文证据grounding是提升LLM审稿质量的关键,REVIEWGROUNDER为AI辅助同行评审提供了可扩展、可评估的新范式。 Abstract: The rapid rise in AI conference submissions has driven increasing exploration of large language models (LLMs) for peer review support. However, LLM-based reviewers often generate superficial, formulaic comments lacking substantive, evidence-grounded feedback. We attribute this to the underutilization of two key components of human reviewing: explicit rubrics and contextual grounding in existing work. To address this, we introduce REVIEWBENCH, a benchmark evaluating review text according to paper-specific rubrics derived from official guidelines, the paper's content, and human-written reviews. We further propose REVIEWGROUNDER, a rubric-guided, tool-integrated multi-agent framework that decomposes reviewing into drafting and grounding stages, enriching shallow drafts via targeted evidence consolidation. Experiments on REVIEWBENCH show that REVIEWGROUNDER, using a Phi-4-14B-based drafter and a GPT-OSS-120B-based grounding stage, consistently outperforms baselines with substantially stronger/larger backbones (e.g., GPT-4.1 and DeepSeek-R1-670B) in both alignment with human judgments and rubric-based review quality across 8 dimensions. The code is available \href{https://github.com/EigenTom/ReviewGrounder}{here}.[27] EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation
Francesco Andrea Causio,Vittorio De Vita,Olivia Riccomi,Michele Ferramola,Federico Felizzi,Antonio Cristiano,Lorenzo De Mori,Chiara Battipaglia,Melissa Sawaya,Luigi De Angelis,Marcello Di Pumpo,Alessandra Piscitelli,Pietro Eric Risuleo,Alessia Longo,Giulia Vojvodic,Mariapia Vassalli,Bianca Destro Castaniti,Nicolò Scarsi,Manuel Del Medico
Main category: cs.CL
TL;DR: 本研究提出了EuropeMedQA,首个源自意大利、法国、西班牙和葡萄牙官方监管考试的多语言、多模态医学考试数据集,旨在评估多模态大模型在跨语言迁移与视觉推理上的能力,并推动更具泛化性的医疗AI发展。
Details
Motivation: 现有大语言模型在英语医学考试中表现优异,但在非英语语言及多模态诊断任务中性能下降,亟需符合欧洲临床实践复杂性的多语言多模态基准数据集。 Method: 依据FAIR数据原则与SPIRIT-AI指南,构建EuropeMedQA数据集,包括严格的数据整理流程与自动化翻译流水线,并采用零样本、强约束提示策略评估当前多模态大模型。 Result: 成功构建了首个覆盖四国官方医学考试的多语言多模态医学评估基准EuropeMedQA,具备抗污染特性,支持跨语言迁移与视觉推理能力评测。 Conclusion: EuropeMedQA为推动面向真实欧洲临床场景、具备跨语言与多模态理解能力的医疗AI系统发展提供了关键基础设施与评估标准。 Abstract: While Large Language Models (LLMs) have demonstrated high proficiency on English-centric medical examinations, their performance often declines when faced with non-English languages and multimodal diagnostic tasks. This study protocol describes the development of EuropeMedQA, the first comprehensive, multilingual, and multimodal medical examination dataset sourced from official regulatory exams in Italy, France, Spain, and Portugal. Following FAIR data principles and SPIRIT-AI guidelines, we describe a rigorous curation process and an automated translation pipeline for comparative analysis. We evaluate contemporary multimodal LLMs using a zero-shot, strictly constrained prompting strategy to assess cross-lingual transfer and visual reasoning. EuropeMedQA aims to provide a contamination-resistant benchmark that reflects the complexity of European clinical practices and fosters the development of more generalizable medical AI.[28] Tracking the Temporal Dynamics of News Coverage of Catastrophic and Violent Events
Emily Lugos,Maurício Gruppi
Main category: cs.CL
TL;DR: 本研究通过分析126,602篇在线新闻文章,量化了暴力与灾难性事件报道中的时间与语义动态变化,揭示了突发事件具有可预测的新闻周期模式。
Details
Motivation: 理解危机时刻公众话语中叙事的形成、传播与演化,对解读媒体框架的动态变化至关重要。 Method: 基于大规模在线新闻语料库(126,602篇文章),采用发布量、语义漂移、语义离散度和术语相关性等指标量化叙事变化。 Result: 突发事件表现出结构化且可预测的新闻周期模式:报道量快速激增、早期显著语义漂移、随后逐渐回落至基线;并识别出驱动这些时间模式的关键术语。 Conclusion: 新闻周期具有可建模的动态规律,语义分析可有效揭示叙事演化的关键驱动因素,为危机传播研究提供量化方法论支持。 Abstract: The modern news cycle has been fundamentally reshaped by the rapid exchange of information online. As a result, media framing shifts dynamically as new information, political responses, and social reactions emerge. Understanding how these narratives form, propagate, and evolve is essential for interpreting public discourse during moments of crisis. In this study, we examine the temporal and semantic dynamics of reporting for violent and catastrophic events using a large-scale corpus of 126,602 news articles collected from online publishers. We quantify narrative change through publication volume, semantic drift, semantic dispersion, and term relevance. Our results show that sudden events of impact exhibit structured and predictable news-cycle patterns characterized by rapid surges in coverage, early semantic drift, and gradual declines toward the baseline. In addition, our results indicate the terms that are driving the temporal patterns.[29] LLM Predictive Scoring and Validation: Inferring Experience Ratings from Unstructured Text
Jason Potteiger,Andrew Hong,Ito Zapata
Main category: cs.CL
TL;DR: 本文使用GPT-4.1基于球迷开放性文本反馈预测其0-10分的整体观赛体验评分,发现模型预测与真实评分高度一致(67%在±1分内),但系统性偏低约1分;该偏差反映两种测量本质不同:自评是整合性判断,而AI预测侧重突出、情绪强烈或可行动的体验时刻。
Details
Motivation: 探究大语言模型能否仅从球迷开放性文本中可靠预测其整体体验评分,并理解AI预测与人类自评之间的差异本质。 Method: 使用GPT-4.1对约10,000条来自五支MLB球队球迷的开放文本进行单次提示预测(0–10分);对比预测值与实际调查评分,分析一致性、相关性及系统偏差来源。 Result: 67%预测在±1分内,36%完全匹配;三次独立运行间87%完全一致、99.9%在±1分内;与总体评分相关性最高(r=0.82),但系统性偏低约1分,且该偏差无法归因于任一具体体验维度。 Conclusion: 简单未优化提示即可实现方向性准确预测;预测值与自评值的差距并非误差,而是反映了两种互补的构念:整体评价 vs. 突出体验时刻量化,应被保留而非消除。 Abstract: We tasked GPT-4.1 to read what baseball fans wrote about their game-day experience and predict the overall experience rating each fan gave on a 0-10 survey scale. The model received only the text of a single open-ended response. These AI predictions were compared with the actual experience ratings captured by the survey instrument across approximately 10,000 fan responses from five Major League Baseball teams. In total two-thirds of predicted ratings fell within one point of self-reported fan ratings (67% within +/-1, 36% exact match), and the predicted measurement was near-deterministic across three independent scoring runs (87% exact agreement, 99.9% within +/-1). Predicted ratings aligned most strongly with the overall experience rating (r = 0.82) rather than with any specific aspect of the game-day experience such as parking, concessions, staff, etc. However, predictions were systematically lower than self-reported ratings by approximately one point, and this gap was not driven by any single aspect. Rather, our analysis shows that self-reported ratings capture the fan's verdict, an overall evaluative judgment that integrates the entire experience. While predicted ratings quantify the impact of salient moments characterized as memorable, emotionally intense, unusual, or actionable. Each measure contains information the other misses. These baseline results establish that a simple, unoptimized prompt can directionally predict how fans rate their experience from the text a fan wrote and that a gap between the two numbers can be interpreted as a construct difference worth preserving rather than an error to eliminate.[30] Purging the Gray Zone: Latent-Geometric Denoising for Precise Knowledge Boundary Awareness
Hao An,Yibin Lou,Jiayi Guo,Yang Xu
Main category: cs.CL
TL;DR: 本文提出GeoDe框架,通过几何距离作为置信度信号进行几何去噪,解决大模型在决策边界附近因内部信念模糊导致的幻觉和过度拒绝问题,显著提升模型真实性与OOD泛化能力。
Details
Motivation: 现有基于响应准确率划分数据集的拒答微调方法,在决策边界附近存在严重标签噪声,导致模型过度拒答或产生幻觉;作者发现潜在空间中决策超平面附近的‘灰色区域’——即内部信念模糊区域——是性能瓶颈。 Method: 提出GeoDe(Geometric Denoising)框架:利用线性探针构建真实超平面,并以样本到该超平面的几何距离作为置信度信号,对边界模糊样本进行过滤与去噪,从而提升拒答微调的质量。 Result: 在Llama3、Qwen3等多个模型及TriviaQA、NQ、SciQ、SimpleQA等基准上验证,GeoDe显著提升模型真实性,并在分布外(OOD)场景中表现出强泛化能力。 Conclusion: 从潜在空间几何结构出发,用几何距离建模置信度可有效缓解拒答微调中的标签噪声问题,为提升大模型可信推理提供了新范式。 Abstract: Large language models (LLMs) often exhibit hallucinations due to their inability to accurately perceive their own knowledge boundaries. Existing abstention fine-tuning methods typically partition datasets directly based on response accuracy, causing models to suffer from severe label noise near the decision boundaries and consequently exhibit high rates of abstentions or hallucinations. This paper adopts a latent space representation perspective, revealing a "gray zone" near the decision hyperplane where internal belief ambiguity constitutes the core performance bottleneck. Based on this insight, we propose the **GeoDe** (**Geo**metric **De**noising) framework for abstention fine-tuning. This method constructs a truth hyperplane using linear probes and performs "geometric denoising" by employing geometric distance as a confidence signal for abstention decisions. This approach filters out ambiguous boundary samples while retaining high-fidelity signals for fine-tuning. Experiments across multiple models (Llama3, Qwen3) and benchmark datasets (TriviaQA, NQ, SciQ, SimpleQA) demonstrate that GeoDe significantly enhances model truthfulness and demonstrates strong generalization in out-of-distribution (OOD) scenarios. Code is available at https://github.com/Notbesidemoon/GeoDe.[31] Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance
Bar Alon,Itamar Zimerman,Lior Wolf
Main category: cs.CL
TL;DR: 本文评估了大语言模型(LLM)生成的后验文本解释在认知层面的忠实性,发现其常不忠实;进而提出一种无需训练、基于注意力干预的改进方法,利用忠实归因方法提取的词元级热图引导解释生成,显著提升了多模型、多基准和多提示下的认知忠实性。
Details
Motivation: 大语言模型缺乏可解释性,被视为黑箱,限制其在需透明与可信领域的应用;现有后验文本解释虽具主观说服力,但其是否真正反映模型内部决策依据(即认知忠实性)尚不明确。 Method: 首先通过反事实方法评估LLM解释的认知忠实性;然后提出一种训练-free的方法,利用忠实归因方法生成的词元级热图,在注意力层进行干预以引导解释生成。 Result: 实验表明,所提方法显著提升了多个大语言模型、多个基准数据集及多种提示下的认知忠实性。 Conclusion: LLM生成的后验解释普遍存在认知不忠实问题;基于注意力干预的无训练方法能有效提升其认知忠实性,为可解释AI提供了新思路。 Abstract: Large language models (LLMs) achieve strong performance and have revolutionized NLP, but their lack of explainability keeps them treated as black boxes, limiting their use in domains that demand transparency and trust. A promising direction to address this issue is post-hoc text-based explanations, which aim to explain model decisions in natural language. Prior work has focused on generating convincing rationales that appear to be subjectively faithful, but it remains unclear whether these explanations are epistemically faithful, whether they reflect the internal evidence the model actually relied on for its decision. In this paper, we first assess the epistemic faithfulness of LLM-generated explanations via counterfactuals and show that they are often unfaithful. We then introduce a training-free method that enhances faithfulness by guiding explanation generation through attention-level interventions, informed by token-level heatmaps extracted via a faithful attribution method. This method significantly improves epistemic faithfulness across multiple models, benchmarks, and prompts.[32] Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation
Zichong Li,Chen Liang,Liliang Ren,Tuo Zhao,Yelong Shen,Weizhu Chen
Main category: cs.CL
TL;DR: 本文提出RoPE-Perturbed Self-Distillation方法,通过扰动RoPE位置编码生成同一序列的不同视图,并利用自蒸馏使模型在不同位置下保持预测一致性,从而提升大语言模型长上下文理解的位置鲁棒性。
Details
Motivation: 标准的长上下文微调方法对证据在上下文中的绝对位置高度敏感,存在显著的位置方差,影响模型可靠性。 Method: 提出RoPE-Perturbed Self-Distillation:在训练中扰动RoPE位置索引以生成同一序列的多种位置视图,并通过自蒸馏约束模型在不同视图下输出一致,减少对位置的依赖,增强语义建模能力。 Result: 在Llama-3-8B和Qwen-3-4B上的实验显示,在RULER-64K上最高提升12.04%,RULER-256K上提升2.71%;同时改善了超出训练长度的外推能力。 Conclusion: RoPE扰动结合自蒸馏是一种简单而有效的正则化策略,可显著提升长上下文模型的位置鲁棒性和长度外推性能。 Abstract: Large language models (LLMs) increasingly operate in settings that require reliable long-context understanding, such as retrieval-augmented generation and multi-document reasoning. A common strategy is to fine-tune pretrained short-context models at the target sequence length. However, we find that standard long-context adaptation can remain brittle: model accuracy depends strongly on the absolute placement of relevant evidence, exhibiting high positional variance even when controlling for task format and difficulty. We propose RoPE-Perturbed Self-Distillation, a training regularizer that improves positional robustness. The core idea is to form alternative "views" of the same training sequence by perturbing its RoPE indices -- effectively moving parts of the context to different positions -- and to train the model to produce consistent predictions across views via self-distillation. This encourages reliance on semantic signals instead of brittle position dependencies. Experiments on long-context adaptation of Llama-3-8B and Qwen-3-4B demonstrate consistent gains on long-context benchmarks, including up to 12.04% improvement on RULER-64K for Llama-3-8B and 2.71% on RULER-256K for Qwen-3-4B after SFT, alongside improved length extrapolation beyond the training context window.[33] When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden
Apoorv Prasad,Susan McRoy
Main category: cs.CL
TL;DR: 本文提出了一种基于小型开源语言模型的方法,用于在社交媒体帖子中透明、可解释地检测多囊卵巢综合征(PCOS)女性所面临的三重负担(身体意象困扰、进食障碍和代谢问题),并在150条预留帖子上实现了75.3%的精确匹配准确率。
Details
Motivation: PCOS女性常同时面临身体意象困扰、进食障碍和代谢问题,但现有NLP方法缺乏透明性且难以识别共病表现。 Method: 收集1000条Reddit上的PCOS相关帖子,由两名标注员依据Lee等人(2017)临床框架标注;使用LoRA微调三个轻量级开源模型(Gemma-2-2B、Qwen3-1.7B、DeepSeek-R1-Distill-Qwen-1.5B),生成带文本证据的结构化解释。 Result: 最佳模型在150条测试帖上达到75.3%的精确匹配准确率,具备良好的共病识别与可解释性;但性能随诊断复杂度上升而下降。 Conclusion: 该方法适用于PCOS相关心理与代谢风险的初步筛查,而非自主诊断,强调了小型模型在临床可解释NLP任务中的潜力。 Abstract: Women with polycystic ovary syndrome (PCOS) face substantially elevated risks of body image distress, disordered eating, and metabolic challenges, yet existing natural language processing approaches for detecting these conditions lack transparency and cannot identify co-occurring presentations. We developed small, open-source language models to automatically detect this triple burden in social media posts with grounded explainability. We collected 1,000 PCOS-related posts from six subreddits, with two trained annotators labeling posts using guidelines operationalizing Lee et al. (2017) clinical framework. Three models (Gemma-2-2B, Qwen3-1.7B, DeepSeek-R1-Distill-Qwen-1.5B) were fine-tuned using Low-Rank Adaptation to generate structured explanations with textual evidence. The best model achieved 75.3 percent exact match accuracy on 150 held-out posts, with robust comorbidity detection and strong explainability. Performance declined with diagnostic complexity, indicating their best use is for screening rather than autonomous diagnosis.[34] APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning for Long-Term Conversational AI
Pratyay Banerjee,Masud Moshtaghi,Shivashankar Subramanian,Amita Misra,Ankit Chadha
Main category: cs.CL
TL;DR: 本文提出APEX-MEM,一种基于属性图的对话记忆系统,通过实体中心、时序建模与多工具检索代理,在长程对话记忆任务中显著提升准确性。
Details
Motivation: 大语言模型在长程对话记忆方面仍存在可靠性不足问题,扩大上下文窗口或简单检索易引入噪声并导致响应不稳定。 Method: 提出APEX-MEM系统,包含三个核心创新:(1) 基于领域无关本体的属性图,将对话建模为时序锚定的事件;(2) 仅追加(append-only)存储以保留信息的完整时序演化;(3) 多工具检索代理,在查询时理解并解析冲突或演化的信息,生成紧凑且上下文相关记忆摘要。 Result: 在LOCOMO问答任务中达88.88%准确率,在LongMemEval中达86.2%,超越现有会话感知方法。 Conclusion: 结构化属性图可支持更时序一致的长程对话推理,验证了显式知识结构对提升LLM长期记忆能力的有效性。 Abstract: Large language models still struggle with reliable long-term conversational memory: simply enlarging context windows or applying naive retrieval often introduces noise and destabilizes responses. We present APEX-MEM, a conversational memory system that combines three key innovations: (1) a property graph which uses domain-agnostic ontology to structure conversations as temporally grounded events in an entity-centric framework, (2) append-only storage that preserves the full temporal evolution of information, and (3) a multi-tool retrieval agent that understands and resolves conflicting or evolving information at query time, producing a compact and contextually relevant memory summary. This retrieval-time resolution preserves the full interaction history while suppressing irrelevant details. APEX-MEM achieves 88.88% accuracy on LOCOMO's Question Answering task and 86.2% on LongMemEval, outperforming state-of-the-art session-aware approaches and demonstrating that structured property graphs enable more temporally coherent long-term conversational reasoning.[35] The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models
Akshay Paruchuri,Ishan Chatterjee,Henry Fuchs,Ehsan Adeli,Piotr Didyk
Main category: cs.CL
TL;DR: 本文提出centroid replacement方法探测多模态语言模型中模态依赖性,发现语言表征普遍压制视觉表征;进而设计text centroid contrastive decoding,在推理阶段显著提升视觉感知任务准确率,且效果因训练方式而异。
Details
Motivation: 多模态语言模型在视觉感知任务上系统性表现不佳,但其失败的根本结构原因尚不清楚。 Method: 提出centroid replacement(以K-means聚类中心替代各token)作为可控探针来分析模态依赖;进一步提出text centroid contrastive decoding,在推理时对比文本质心被擦除的参考输出进行解码。 Result: 在七种跨三类架构的模型上,擦除文本质心结构导致的准确率下降是擦除视觉质心结构的4倍;所提contrastive decoding方法在单个任务上最高提升+16.9%准确率;标准微调模型平均增益+5.6%,偏好优化模型仅+1.5%。 Conclusion: 模态竞争具有结构性局部化特征,可在不重训练前提下于推理阶段校正,并可量化为指导未来多模态训练的诊断信号。 Abstract: Multimodal language models systematically underperform on visual perception tasks, yet the structure underlying this failure remains poorly understood. We propose centroid replacement, collapsing each token to its nearest K-means centroid, as a controlled probe for modal dependence. Across seven models spanning three architecture families, erasing text centroid structure costs 4$\times$ more accuracy than erasing visual centroid structure, exposing a universal imbalance where language representations overshadow vision even on tasks that demand visual reasoning. We exploit this asymmetry through text centroid contrastive decoding, recovering up to +16.9% accuracy on individual tasks by contrastively decoding against a text-centroid-erased reference. This intervention varies meaningfully with training approaches: standard fine-tuned models show larger gains (+5.6% on average) than preference-optimized models (+1.5% on average). Our findings suggest that modal competition is structurally localized, correctable at inference time without retraining, and quantifiable as a diagnostic signal to guide future multimodal training.[36] BiCon-Gate: Consistency-Gated De-colloquialisation for Dialogue Fact-Checking
Hyunkyung Park,Arkaitz Zubiaga
Main category: cs.CL
TL;DR: 本文提出了一种面向对话中口语化声明的保守重写方法( staged de-colloquialisation)及语义感知的一致性门控机制(BiCon-Gate),以提升自动事实核查在多轮对话中的鲁棒性与准确性。
Details
Motivation: 现有自动事实核查研究在多轮对话场景中对频繁出现但缺乏深入研究的口语化语言处理不足。 Method: 首先通过分阶段去口语化(轻量级表面规范化 + 局部句内共指解析)生成保守的重写候选;再引入BiCon-Gate,依据对话上下文语义支持度动态选择重写或保留原声明。 Result: 在DialFact基准上,该方法显著提升了证据检索与事实验证性能,尤其在SUPPORTS类别上增益明显,并优于包括单步LLM重写在内的多种强基线。 Conclusion: 分阶段、语义门控的重写策略能有效缓解口语化表达带来的事实核查不稳定性,为对话式事实核查提供了更可靠的前提处理范式。 Abstract: Automated fact-checking in dialogue involves multi-turn conversations where colloquial language is frequent yet understudied. To address this gap, we propose a conservative rewrite candidate for each response claim via staged de-colloquialisation, combining lightweight surface normalisation with scoped in-claim coreference resolution. We then introduce BiCon-Gate, a semantics-aware consistency gate that selects the rewrite candidate only when it is semantically supported by the dialogue context, otherwise falling back to the original claim. This gated selection stabilises downstream fact-checking and yields gains in both evidence retrieval and fact verification. On the DialFact benchmark, our approach improves retrieval and verification, with particularly strong gains on SUPPORTS, and outperforms competitive baselines, including a decoder-based one-shot LLM rewrite that attempts to perform all de-colloquialisation steps in a single pass.[37] Generating Concept Lexicalizations via Dictionary-Based Cross-Lingual Sense Projection
David Basil,Chirooth Girigowda,Bradley Hauer,Sahir Momin,Ning Shi,Grzegorz Kondrak
Main category: cs.CL
TL;DR: 本文提出了一种通过语义投影和词典过滤来自动生成多语言WordNet式词义资源的方法,提升了精度且保持可解释性与低资源依赖。
Details
Motivation: 为解决WordNet等词义资源在新语言中扩展困难的问题,需一种自动、高效且可解释的跨语言词义生成方法。 Method: 基于英-目标语言平行语料,将英语词义(synset)通过增强的预训练对齐模型投影到目标语言词元,并利用双语词典进行对齐优化与错误投影过滤。 Result: 在多种语言上优于现有方法及词典、大语言模型基线,显著提升精度,同时保持高可解释性与低外部资源需求。 Conclusion: ‘投影+过滤’策略是一种有效、鲁棒且实用的跨语言词义资源扩展方法,代码与生成的词义库将开源。 Abstract: We study the task of automatically expanding WordNet-style lexical resources to new languages through sense generation. We generate senses by associating target-language lemmas with existing lexical concepts via semantic projection. Given a sense-tagged English corpus and its translation, our method projects English synsets onto aligned target-language tokens and assigns the corresponding lemmas to those synsets. To generate these alignments and ensure their quality, we augment a pre-trained base aligner with a bilingual dictionary, which is also used to filter out incorrect sense projections. We evaluate the method on multiple languages, comparing it to prior methods, as well as dictionary-based and large language model baselines. Results show that the proposed project-and-filter strategy improves precision while remaining interpretable and requiring few external resources. We plan to make our code, documentation, and generated sense inventories accessible.[38] The Autocorrelation Blind Spot: Why 42% of Turn-Level Findings in LLM Conversation Analysis May Be Spurious
Ferdinand M. Schessl
Main category: cs.CL
TL;DR: 本文揭示了当前多轮人机对话评估中忽略轮次间自相关性的问题,提出了一种结合有效自由度与对话级块自助法的两阶段校正框架,并验证其显著提升统计推断的可靠性。
Details
Motivation: 现有评估流程普遍将多轮对话中的轮次视为独立样本,忽视其内在的时间依赖性和自相关性,导致统计显著性被严重高估。 Method: 系统刻画66个轮级指标在202个多轮对话中的自相关结构;提出结合Chelton(1983)有效自由度与对话级块自助法的两阶段校正框架;在预注册的保留集上进行验证。 Result: 朴素池化检验使42%的显著关联无法通过簇稳健校正;不同指标家族的显著性膨胀率差异显著(14% vs 33%);校正后指标复现率从30%提升至57%;调研发现约87%的近期论文未处理时间依赖性。 Conclusion: 轮级统计必须考虑对话内轮次的聚类结构,本文提供的校正框架、设计原则、检查清单和开源代码可推动LLM对话评估的统计严谨性标准化。 Abstract: Turn-level metrics are widely used to evaluate properties of multi-turn human-LLM conversations, from safety and sycophancy to dialogue quality. However, consecutive turns within a conversation are not statistically independent -- a fact that virtually all current evaluation pipelines fail to correct for in their statistical inference. We systematically characterize the autocorrelation structure of 66 turn-level metrics across 202 multi-turn conversations (11,639 turn pairs, 5 German-speaking users, 4 LLM platforms) and demonstrate that naive pooled analysis produces severely inflated significance estimates: 42% of associations that appear significant under standard pooled testing fail to survive cluster-robust correction. The inflation varies substantially across categories rather than scaling linearly with autocorrelation: three memoryless families (embedding velocity, directional, differential) aggregate to 14%, while the seven non-memoryless families (thermo-cycle, frame distance, lexical/structural, rolling windows, cumulative, interaction, timestamp) aggregate to 33%, with individual category rates ranging from 0% to 100% depending on per-family effect size. We present a two-stage correction framework combining Chelton (1983) effective degrees of freedom with conversation-level block bootstrap, and validate it on a pre-registered hold-out split where cluster-robust metrics replicate at 57% versus 30% for pooled-only metrics. We provide concrete design principles, a publication checklist, and open-source code for the correction pipeline. A survey of ~30 recent papers at major NLP and AI venues that compute turn-level statistics in LLM evaluations finds that only 4 address temporal dependence at all, and 26 do not correct for it.[39] Three-Phase Transformer
Mohammad R. Abu Ayyash
Main category: cs.CL
TL;DR: 本文提出了一种名为Three-Phase Transformer(3PT)的新架构,通过在残差流中引入循环通道划分、相位感知操作(如Givens旋转)、以及DC子空间中注入Gabriel's horn位置编码,显著提升了decoder-only Transformer的训练效率与性能。
Details
Motivation: 提升decoder-only Transformer的训练稳定性与收敛速度,同时避免额外模块带来的复杂性,探索结构先验对模型内在几何特性的调控能力。 Method: 将隐藏向量划分为N个等尺寸循环通道,引入每通道RMSNorm、跨注意力与FFN的2D Givens旋转(相位偏移theta + i*(2π/N))、GQA头数与通道数对齐约束;在正交于通道的一维DC子空间中注入r(p)=1/(p+1)的绝对位置编码,并与RoPE相对位置编码正交组合。 Result: 在WikiText-103上,123M参数模型相比RoPE基线降低7.20%困惑度(-2.62% bpb),仅增加1536参数(0.00124%),收敛步数加速1.93倍(墙钟加速1.64倍);N=3为最优配置之一,且机制被验证具有自稳定几何特性、U型旋转角漂移深度分布及与RoPE/Attention/FFN的正交可组合性。 Conclusion: 3PT证明了残差流结构先验可作为轻量、自稳定、可解释的建模范式,其核心——通道划分、相位旋转、分相归一化与DC注入——共同构成一种新型神经网络守恒律机制,为Transformer架构设计提供了新视角。 Abstract: We present Three-Phase Transformer (3PT), a residual-stream structural prior for decoder-only Transformers on a standard SwiGLU + RMSNorm + RoPE + GQA backbone. The hidden vector is partitioned into N equally-sized cyclic channels, each maintained by phase-respecting ops: a per-channel RMSNorm, a 2D Givens rotation between attention and FFN that rotates each channel by theta + i*(2*pi/N), and a head-count constraint aligning GQA heads with the partition. The architecture is a self-stabilizing equilibrium between scrambling and re-imposition, not a bolted-on module. The partition carves out a one-dimensional DC subspace orthogonal to the channels, into which we inject a fixed Gabriel's horn profile r(p) = 1/(p+1) as an absolute-position side-channel composing orthogonally with RoPE's relative-position rotation. The canonical N=3 borrows its metaphor from balanced three-phase AC, where three sinusoids 120 degrees apart sum to zero with no anti-correlated pair. At 123M parameters on WikiText-103, 3PT achieves -7.20% perplexity (-2.62% bits-per-byte) over a matched RoPE-Only baseline at +1,536 parameters (0.00124% of total), with 1.93x step-count convergence speedup (1.64x wall-clock). N behaves as a parameter-sharing knob rather than a unique optimum: at 5.5M an N-sweep over {1,2,3,4,6,8,12} is near-monotone with N=1 winning; at 123M a three-seed sweep finds N=3 and N=1 statistically indistinguishable. The load-bearing mechanism is the channel-partitioned residual stream, per-block rotation, per-phase normalization, and horn DC injection. We characterize (a) self-stabilization of the geometry without explicit enforcement, a novel instance of the conservation-law framework for neural networks; (b) a U-shaped depth profile of rotation-angle drift at 12 layers; (c) orthogonal composition with RoPE, attention, and FFN.[40] Hierarchical vs. Flat Iteration in Shared-Weight Transformers
Sang-Il Han
Main category: cs.CL
TL;DR: 本文提出HRM-LM模型,用共享权重的双速递归结构(快模块每步运行,慢模块每T步运行)替代传统Transformer的独立堆叠层,并通过实证发现该递归结构在表征质量上显著落后于独立层堆叠。
Details
Motivation: 探究分层结构、共享权重的递归机制是否能在表征能力上媲美Transformer中独立堆叠的层结构。 Method: 提出HRM-LM:以Fast模块(每步运行)和Slow模块(每T步运行)构成双速递归对,参数共享并展开M=N×T步;与参数匹配的Universal Transformer(UniTF, 1.2B)进行五次独立运行的消融对比。 Result: 在参数匹配条件下,HRM-LM相比UniTF表现出显著且稳健的性能差距,表明共享权重递归结构难以达到独立层堆叠的表征质量。 Conclusion: 层次化、共享权重的递归设计在当前Transformer语言建模中无法替代独立层堆叠,独立参数化对建模能力至关重要。 Abstract: We present an empirical study of whether hierarchically structured, shared-weight recurrence can match the representational quality of independent-layer stacking in a Transformer-based language model. HRM-LM replaces L independent Transformer layers with a two-speed recurrent pair: a Fast module operating at every step for local refinement, and a Slow module operating every T steps for global compression. This recurrent hierarchy is unrolled for M = N x T steps with shared parameters. The central and most robust finding, supported by a parameter-matched Universal Transformer ablation (UniTF, 1.2B) across five independent runs, is a sharp empirical gap between the two approaches.[41] MARCA: A Checklist-Based Benchmark for Multilingual Web Search
Thales Sales Almeida,Giovana Kerche Bonás,Ramon Pires,Celio Larcher,Hugo Abonizio,Marcos Piau,Roseval Malaquias Junior,Rodrigo Nogueira,Thiago Laitz
Main category: cs.CL
TL;DR: 本文提出了MARCA,一个用于评估大语言模型在基于网络的信息检索任务中表现的双语(英语和葡萄牙语)基准测试。该基准包含52个手动编写的多实体问题及配套的检查清单式评分标准,并在两种交互框架(Basic和Orchestrator)下对14个模型进行了评估,揭示了模型性能差异、编排策略的有效性以及英葡迁移能力的不均衡性。
Details
Motivation: 现有基准在多语言尤其是葡萄牙语场景下的Web浏览与代理工具使用评估不足,亟需一个能全面衡量LLM信息检索能力(包括搜索、证据筛选与答案合成)的双语评测基准。 Method: 构建双语(英/葡)基准MARCA,含52个多实体问题和人工验证的检查清单式评分标准;设计Basic(直接搜索+爬取)和Orchestrator(任务分解+子代理协同)两种交互框架;对14个模型进行多次运行以量化结果不确定性。 Result: 不同模型间性能差异显著;Orchestrator框架通常提升答案覆盖度;模型从英语到葡萄牙语的能力迁移存在较大不稳定性;所有实验均报告了运行级不确定性。 Conclusion: MARCA填补了多语言Web信息检索评测的空白,为评估和改进LLM在真实世界复杂查询中的可靠性提供了可复现、细粒度、双语支持的新基准。 Abstract: Large language models (LLMs) are increasingly used as sources of information, yet their reliability depends on the ability to search the web, select relevant evidence, and synthesize complete answers. While recent benchmarks evaluate web-browsing and agentic tool use, multilingual settings, and Portuguese in particular, remain underexplored. We present \textsc{MARCA}, a bilingual (English and Portuguese) benchmark for evaluating LLMs on web-based information seeking. \textsc{MARCA} consists of 52 manually authored multi-entity questions, paired with manually validated checklist-style rubrics that explicitly measure answer completeness and correctness. We evaluate 14 models under two interaction settings: a Basic framework with direct web search and scraping, and an Orchestrator framework that enables task decomposition via delegated subagents. To capture stochasticity, each question is executed multiple times and performance is reported with run-level uncertainty. Across models, we observe large performance differences, find that orchestration often improves coverage, and identify substantial variability in how models transfer from English to Portuguese. The benchmark is available at https://github.com/maritaca-ai/MARCA[42] Filling in the Mechanisms: How do LMs Learn Filler-Gap Dependencies under Developmental Constraints?
Atrey Desai,Sathvik Nair
Main category: cs.CL
TL;DR: 本文研究了在有限数据训练下,语言模型是否能像人类一样形成跨句法结构的填空-空位依赖表征,并发现尽管存在共享机制,但模型仍需远超人类的数据量才能达到类似泛化能力。
Details
Motivation: 探究语言模型在发展可行数据量下是否具备与人类相似的跨句法结构的填空-空位依赖表征能力。 Method: 采用分布式对齐搜索(DAS)方法,分析BabyLM挑战中不同数据量训练的语言模型在wh-问句和话题化结构中填空-空位依赖表征的迁移性。 Result: 结果表明,在有限训练数据下,模型可能发展出共享但项目敏感的机制;然而,其所需数据量远超人类,难以实现同等程度的泛化。 Conclusion: 语言习得模型需要引入语言特异性先验偏置,以弥补当前语言模型在小样本泛化能力上的不足。 Abstract: For humans, filler-gap dependencies require a shared representation across different syntactic constructions. Although causal analyses suggest this may also be true for LLMs (Boguraev et al., 2025), it is still unclear if such a representation also exists for language models trained on developmentally feasible quantities of data. We applied Distributed Alignment Search (DAS, Geiger et al. (2024)) to LMs trained on varying amounts of data from the BabyLM challenge (Warstadt et al., 2023), to evaluate whether representations of filler-gap dependencies transfer between wh-questions and topicalization, which greatly vary in terms of their input frequency. Our results suggest shared, yet item-sensitive mechanisms may develop with limited training data. More importantly, LMs still require far more data than humans to learn comparable generalizations, highlighting the need for language-specific biases in models of language acquisition.[43] Psychological Steering of Large Language Models
Leonardo Blas,Robin Jia,Emilio Ferrara
Main category: cs.CL
TL;DR: 本文提出了一种基于心理学的LLM行为调控框架,利用IPIP-NEO-120量表校准残差流注入,在语义对齐的单位中进行无界搜索,发现均值差(MD)注入在多数模型上优于传统人格提示(P²),且MD与P²混合方法效果最佳;同时验证了线性表征假设,但也揭示了模型表征与人类心理学(如大二模型)之间的偏差。
Details
Motivation: 现有激活干预方法受限于搜索空间和未校准的激活单位,可能遗漏最优干预条件,亟需语义可解释、心理学可对齐的干预范式。 Method: 提出心理学引导的调控框架,使用IPIP-NEO-120量表校准残差流注入,对比六种注入方法(含MD与P²及其混合),在14个LLM上开展开放生成评估。 Result: MD注入在11/14个模型上优于P²(提升3.6%–16.4%);MD+P²混合方法在13/14个模型上最优(相对P²提升5.6%–21.9%,相对MD提升3.3%–26.7%);MD符合线性表征假设,但引发的OCEAN协方差偏离人类大二模型。 Conclusion: 表示工程(而非仅提示)是心理调控新前沿;语义校准的残差注入更有效;混合策略具优势;模型内部表征与真实人格结构存在系统性差异。 Abstract: Large language models (LLMs) emulate a consistent human-like behavior that can be shaped through activation-level interventions. This paradigm is converging on additive residual-stream injections, which rely on injection-strength sweeps to approximate optimal intervention settings. However, existing methods restrict the search space and sweep in uncalibrated activation-space units, potentially missing optimal intervention conditions. Thus, we introduce a psychological steering framework that performs unbounded, fluency-constrained sweeps in semantically calibrated units. Our method derives and calibrates residual-stream injections using psychological artifacts, and we use the IPIP-NEO-120, which measures the OCEAN personality model, to compare six injection methods. We find that mean-difference (MD) injections outperform Personality Prompting (P$^2$), an established baseline for OCEAN steering, in open-ended generation in 11 of 14 LLMs, with gains of 3.6\% to 16.4\%, overturning prior reports favoring prompting and positioning representation engineering as a new frontier in open-ended psychological steering. Further, we find that a hybrid of P$^2$ and MD injections outperforms both methods in 13 of 14 LLMs, with gains over P$^2$ ranging from 5.6\% to 21.9\% and from 3.3\% to 26.7\% over MD injections. Finally, we show that MD injections align with the Linear Representation Hypothesis and provide reliable, approximately linear control knobs for psychological steering. Nevertheless, they also induce OCEAN trait covariance patterns that depart from the Big Two model, suggesting a gap between learned representations and human psychology.[44] CobwebTM: Probabilistic Concept Formation for Lifelong and Hierarchical Topic Modeling
Karthik Singaravadivelan,Anant Gupta,Zekun Wang,Christopher MacLellan
Main category: cs.CL
TL;DR: 本文提出CobwebTM,一种基于增量概率概念形成的低参数终身分层主题模型,能够在线构建语义层次结构,实现无监督主题发现、动态主题生成和无需预设主题数的分层组织。
Details
Motivation: 神经主题模型虽性能强但调参复杂、难以终身学习;经典概率模型则缺乏对流式数据的灵活性与适应性。 Method: 将Cobweb算法适配至连续文档嵌入空间,基于增量概率概念形成机制构建终身分层主题模型。 Result: 在多个数据集上展现出高主题一致性、时间稳定性及高质量层次结构。 Conclusion: 结合预训练表征的增量符号化概念形成是高效主题建模的有效途径。 Abstract: Topic modeling seeks to uncover latent semantic structure in text corpora with minimal supervision. Neural approaches achieve strong performance but require extensive tuning and struggle with lifelong learning due to catastrophic forgetting and fixed capacity, while classical probabilistic models lack flexibility and adaptability to streaming data. We introduce \textsc{CobwebTM}, a low-parameter lifelong hierarchical topic model based on incremental probabilistic concept formation. By adapting the Cobweb algorithm to continuous document embeddings, \textsc{CobwebTM} constructs semantic hierarchies online, enabling unsupervised topic discovery, dynamic topic creation, and hierarchical organization without predefining the number of topics. Across diverse datasets, \textsc{CobwebTM} achieves strong topic coherence, stable topics over time, and high-quality hierarchies, demonstrating that incremental symbolic concept formation combined with pretrained representations is an efficient approach to topic modeling.[45] PeerPrism: Peer Evaluation Expertise vs Review-writing AI
Soroush Sadeghian,Alireza Daqiq,Radin Cheraghi,Sajad Ebrahimi,Negar Arabzadeh,Ebrahim Bagheri
Main category: cs.CL
TL;DR: 本文提出PeerPrism基准,用于评估LLM在同行评审中人类-AI协作场景下的检测能力,揭示现有检测方法混淆文本表层生成与思想来源的问题。
Details
Motivation: 现有LLM检测方法将作者归属简化为二元问题(人 vs. AI),忽略了现实中人类与AI协同撰写评审意见的混合模式,无法区分‘思想来源’与‘文本生成’。 Method: 构建大规模基准PeerPrism(含20,690条评审),涵盖全人工、全合成及多种混合生成范式;系统评测主流LLM文本检测器,并结合文体学与语义分析探究其失效机制。 Result: 检测器在标准二元任务上表现良好,但在混合场景(如人类构思+AI撰写)下预测分歧显著、结果矛盾;分析表明其错误地将风格特征等同于思想归属。 Conclusion: 同行评审中的LLM检测不能简化为二元归属问题,而应建模为涵盖语义推理与风格实现的多维作者身份问题;PeerPrism是首个面向人类-AI协作评审的基准。 Abstract: Large Language Models (LLMs) are increasingly used in scientific peer review, assisting with drafting, rewriting, expansion, and refinement. However, existing peer-review LLM detection methods largely treat authorship as a binary problem-human vs. AI-without accounting for the hybrid nature of modern review workflows. In practice, evaluative ideas and surface realization may originate from different sources, creating a spectrum of human-AI collaboration. In this work, we introduce PeerPrism, a large-scale benchmark of 20,690 peer reviews explicitly designed to disentangle idea provenance from text provenance. We construct controlled generation regimes spanning fully human, fully synthetic, and multiple hybrid transformations. This design enables systematic evaluation of whether detectors identify the origin of the surface text or the origin of the evaluative reasoning. We benchmark state-of-the-art LLM text detection methods on PeerPrism. While several methods achieve high accuracy on the standard binary task (human vs. fully synthetic), their predictions diverge sharply under hybrid regimes. In particular, when ideas originate from humans but the surface text is AI-generated, detectors frequently disagree and produce contradictory classifications. Accompanied by stylometric and semantic analyses, our results show that current detection methods conflate surface realization with intellectual contribution. Overall, we demonstrate that LLM detection in peer review cannot be reduced to a binary attribution problem. Instead, authorship must be modeled as a multidimensional construct spanning semantic reasoning and stylistic realization. PeerPrism is the first benchmark evaluating human-AI collaboration in these settings. We release all code, data, prompts, and evaluation scripts to facilitate reproducible research at https://github.com/Reviewerly-Inc/PeerPrism.[46] Mechanistic Decoding of Cognitive Constructs in LLMs
Yitong Shou,Manhao Guan
Main category: cs.CL
TL;DR: 本文提出一种基于表征工程的认知逆向工程框架,用于分析大语言模型中社会比较型嫉妒情绪的内部机制,发现模型将嫉妒编码为两个心理前因(比较对象优越性和领域自我定义相关性)的线性组合,并验证其与人类心理结构的一致性,同时展示可机械检测并精准抑制毒性情绪状态。
Details
Motivation: 现有可解释性方法多将模型视为黑箱或仅关注粗粒度基本情绪,难以揭示复杂情绪(如社会比较型嫉妒)的认知结构,亟需新框架深入探究LLM中复杂情感的内在机制。 Method: 提出基于表征工程(RepE)的认知逆向工程框架,融合评价理论、子空间正交化、回归加权和双向因果引导技术,以分离并量化嫉妒的两个心理前因(比较对象优越性与领域自我定义相关性),并检验其对模型判断的因果影响。 Result: 在Llama、Qwen和Gemma系列共8个LLM上实验表明:模型原生地将嫉妒编码为两个前因的结构化线性组合;其表征整体符合人类心理学构念——优越性是基础触发因素,相关性是强度放大器;且毒性情绪可被机械检测并精准抑制。 Conclusion: LLM内部具备与人类一致的复杂情绪认知结构,所提框架不仅揭示了嫉妒的情绪计算机制,还为AI安全中的表征监控与干预(尤其在多智能体环境)提供了可行路径。 Abstract: While Large Language Models (LLMs) demonstrate increasingly sophisticated affective capabilities, the internal mechanisms by which they process complex emotions remain unclear. Existing interpretability approaches often treat models as black boxes or focus on coarse-grained basic emotions, leaving the cognitive structure of more complex affective states underexplored. To bridge this gap, we propose a Cognitive Reverse-Engineering framework based on Representation Engineering (RepE) to analyze social-comparison jealousy. By combining appraisal theory with subspace orthogonalization, regression-based weighting, and bidirectional causal steering, we isolate and quantify two psychological antecedents of jealousy, Superiority of Comparison Person and Domain Self-Definitional Relevance, and examine their causal effects on model judgments. Experiments on eight LLMs from the Llama, Qwen, and Gemma families suggest that models natively encode jealousy as a structured linear combination of these constituent factors. Their internal representations are broadly consistent with the human psychological construct, treating Superiority as the foundational trigger and Relevance as the ultimate intensity multiplier. Our framework also demonstrates that toxic emotional states can be mechanically detected and surgically suppressed, suggesting a possible route toward representational monitoring and intervention for AI safety in multi-agent environments.[47] NLP needs Diversity outside of 'Diversity'
Joshua Tint
Main category: cs.CL
TL;DR: 本文指出NLP领域内多样性进展过度集中于公平性相关研究,而其他子领域则被忽视;作者通过分析NLP各子领域研究者的人口统计特征,揭示了导致边缘化研究者难以参与非公平性研究的制度性障碍,并提出打破反馈循环、消除地理与语言壁垒等建议。
Details
Motivation: 纠正NLP领域中多样性发展不均衡的问题,关注非公平性子领域中 marginalized 研究者的边缘化现象及其成因。 Method: 基于NLP各子领域研究者的人口统计数据开展实证调查,结合对激励机制、偏见和结构性障碍的分析,支撑论点并提出改进建议。 Result: 发现多样性进展高度集中于公平性子领域;识别出阻碍 marginalized 研究者参与非公平性研究的多重系统性障碍,包括反馈循环、地理与语言壁垒等。 Conclusion: 需系统性改革以促进NLP所有子领域的包容性与公平性,尤其应打破强化不平等的反馈机制,并降低地理与语言门槛。 Abstract: This position paper argues that recent progress with diversity in NLP is disproportionately concentrated on a small number of areas surrounding fairness. We further argue that this is the result of a number of incentives, biases, and barriers which come together to disenfranchise marginalized researchers in non-fairness fields, or to move them into fairness-related fields. We substantiate our claims with an investigation into the demographics of NLP researchers by subfield, using our research to support a number of recommendations for ensuring that all areas within NLP can become more inclusive and equitable. In particular, we highlight the importance of breaking down feedback loops that reinforce disparities, and the need to address geographical and linguistic barriers that hinder participation in NLP research.[48] CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification
Yian Wang,Yuen Chen,Agam Goyal,Hari Sundaram
Main category: cs.CL
TL;DR: 本文提出CAUSALDETOX框架,通过因果分析识别并干预导致毒性生成的关键注意力头,结合推理时干预与微调策略,在显著降低毒性的同时保持语言质量。
Details
Motivation: 大型语言模型常生成有毒内容,现有缓解方法常损害生成质量或依赖高成本人工标注。 Method: 基于概率必要性与充分性(PNS)识别关键注意力头,并设计局部推理时干预(动态引导向量)和PNS引导微调两种策略;同时构建PARATOX基准用于反事实评估。 Result: 在ToxiGen、ImplicitHate和ParaDetox上,毒性降低提升达5.34%,语言流畅性保持良好,且头部选择速度提升7倍。 Conclusion: CAUSALDETOX提供了一种高效、可解释、低损的毒性缓解新范式。 Abstract: Large language models (LLMs) frequently generate toxic content, posing significant risks for safe deployment. Current mitigation strategies often degrade generation quality or require costly human annotation. We propose CAUSALDETOX, a framework that identifies and intervenes on the specific attention heads causally responsible for toxic generation. Using the Probability of Necessity and Sufficiency (PNS), we isolate a minimal set of heads that are necessary and sufficient for toxicity. We utilize these components via two complementary strategies: (1) Local Inference-Time Intervention, which constructs dynamic, input-specific steering vectors for context-aware detoxification, and (2) PNS-Guided Fine-Tuning, which permanently unlearns toxic representations. We also introduce PARATOX, a novel benchmark of aligned toxic/non-toxic sentence pairs enabling controlled counterfactual evaluation. Experiments on ToxiGen, ImplicitHate, and ParaDetox show that CAUSALDETOX achieves up to 5.34% greater toxicity reduction compared to baselines while preserving linguistic fluency, and offers a 7x speedup in head selection.[49] Retrieve, Then Classify: Corpus-Grounded Automation of Clinical Value Set Authoring
Sumit Mukherjee,Juan Shu,Nairwita Mazumder,Tate Kernell,Celena Wheeler,Shannon Hastings,Chris Sidey-Gibbons
Main category: cs.CL
TL;DR: 本文提出了一种名为检索增强集合补全(RASC)的新方法,用于临床价值集编写任务,通过检索相似已有价值集并分类候选编码,显著提升了代码识别的准确率和效率。
Details
Motivation: 临床价值集编写是临床质量评估和表型分析中的关键瓶颈,而直接使用大语言模型生成编码效果受限于临床术语库规模大、版本控制严格及预训练记忆不可靠等问题。 Method: 提出检索增强集合补全(RASC):先从已有序贯价值集语料库中检索K个最相似的价值集构建候选池,再用分类器对每个候选编码打分;在SAPBert交叉编码器上进行微调,并与MLP、LightGBM及零样本GPT-4o对比。 Result: 在11803个VSAC价值集构成的大规模基准上,RASC达到AUROC 0.852、价值集级F1 0.298,优于MLP(F1 0.250)和GPT-4o(F1 0.105);将每真阳性对应的无关候选数从12.3降至约3.2–4.4;性能优势随价值集规模增大而增强。 Conclusion: RASC通过缩小输出空间提升统计效率,其有效性不依赖特定模型类型,为临床价值集自动构建提供了可扩展、鲁棒的新范式。 Abstract: Clinical value set authoring -- the task of identifying all codes in a standardized vocabulary that define a clinical concept -- is a recurring bottleneck in clinical quality measurement and phenotyping. A natural approach is to prompt a large language model (LLM) to generate the required codes directly, but structured clinical vocabularies are large, version-controlled, and not reliably memorized during pretraining. We propose Retrieval-Augmented Set Completion (RASC): retrieve the $K$ most similar existing value sets from a curated corpus to form a candidate pool, then apply a classifier to each candidate code. Theoretically, retrieve-and-select can reduce statistical complexity by shrinking the effective output space from the full vocabulary to a much smaller retrieved candidate pool. We demonstrate the utility of RASC on 11,803 publicly available VSAC value sets, constructing the first large-scale benchmark for this task. A cross-encoder fine-tuned on SAPBert achieves AUROC~0.852 and value-set-level F1~0.298, outperforming a simpler three-layer Multilayer Perceptron (AUROC~0.799, F1~0.250) and both reduce the number of irrelevant candidates per true positive from 12.3 (retrieval-only) to approximately 3.2 and 4.4 respectively. Zero-shot GPT-4o achieves value-set-level F1~0.105, with 48.6\% of returned codes absent from VSAC entirely. This performance gap widens with increasing value set size, consistent with RASC's theoretical advantage. We observe similar performance gains across two other classifier model types, namely a cross-encoder initialized from pre-trained SAPBert and a LightGBM model, demonstrating that RASC's benefits extend beyond a single model class. The code to download and create the benchmark dataset, as well as the model training code is available at: \href{https://github.com/mukhes3/RASC}{https://github.com/mukhes3/RASC}.[50] StoryCoder: Narrative Reformulation for Structured Reasoning in LLM Code Generation
Geonhui Jang,Dongyoon Han,YoungJoon Yoo
Main category: cs.CL
TL;DR: 本文提出StoryCoder框架,通过将代码生成问题重构成连贯的自然语言叙事(包含任务概述、约束和测试用例),显著提升模型零样本代码生成性能,平均pass@10提升18.7%,并促进算法策略选择与模块化实现。
Details
Motivation: 现有代码生成方法虽增强推理步骤或注入思维结构,但未系统组织分散的问题条件;受人类将碎片信息整合为连贯解释的启发,需更富上下文结构的问题表征。 Method: 提出StoryCoder——一种叙事重构框架,将代码生成问题转化为三部分自然语言叙事:任务概述、约束条件、示例测试用例;叙事内容由所选算法与文体(genre)引导生成。 Result: 在HumanEval、LiveCodeBench和CodeForces上对11个模型的实验显示,零样本pass@10平均提升18.7%;分析表明该方法能引导正确算法策略、减少实现错误、增强代码模块性,且效果依赖叙事连贯性与文体对齐。 Conclusion: 结构化的问题表征(如叙事重构)对代码生成至关重要,其益处不依赖模型规模或架构,为提升大模型编程能力提供了新思路。 Abstract: Effective code generation requires both model capability and a problem representation that carefully structures how models reason and plan. Existing approaches augment reasoning steps or inject specific structure into how models think, but leave scattered problem conditions unchanged. Inspired by the way humans organize fragmented information into coherent explanations, we propose StoryCoder, a narrative reformulation framework that transforms code generation questions into coherent natural language narratives, providing richer contextual structure than simple rephrasings. Each narrative consists of three components: a task overview, constraints, and example test cases, guided by the selected algorithm and genre. Experiments across 11 models on HumanEval, LiveCodeBench, and CodeForces demonstrate consistent improvements, with an average gain of 18.7% in zero-shot pass@10. Beyond accuracy, our analyses reveal that narrative reformulation guides models toward correct algorithmic strategies, reduces implementation errors, and induces a more modular code structure. The analyses further show that these benefits depend on narrative coherence and genre alignment, suggesting that structured problem representation is important for code generation regardless of model scale or architecture. Our code is available at https://github.com/gu-ni/StoryCoder.[51] Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options
Nahyun Lee,Guijin Son
Main category: cs.CL
TL;DR: 本文提出了一种大规模选项(100个)的多选评估协议,用于更严格地测试大语言模型在韩语正字法错误检测任务中的真实能力,揭示了传统低选项设置下易被掩盖的模型缺陷,如语义混淆和位置偏差。
Details
Motivation: 传统多选评估在选项较少时容易达到接近上限的准确率,但可能依赖捷径策略而非真实语言理解能力,从而高估模型性能。 Method: 提出大规模选项(N=100)评估协议;在韩语正字法错误检测任务中,要求模型从大量候选句中识别唯一错误句;采用固定目标、重复重采样与随机打乱以消除位置偏差;开展padding控制与长度匹配实验以分离上下文长度与候选排序的影响。 Result: 强模型在低选项设置下的优异表现常在高N设置下显著下降;识别出两种主要失败模式:语义混淆与不确定性下的早期位置偏好;控制实验表明瓶颈在于候选排序能力,而非上下文长度。 Conclusion: 大规模选项评估是一种有效的压力测试框架,能暴露常规基准无法发现的模型可靠性缺陷,适用于高干扰密度下的模型能力验证。 Abstract: Multiple choice evaluation is widely used for benchmarking large language models, yet near ceiling accuracy in low option settings can be sustained by shortcut strategies that obscure true competence. Therefore, we propose a massive option evaluation protocol that scales the candidate set to one hundred options and sharply reduces the impact of chance performance. We apply this framework to a Korean orthography error detection task where models must pick the single incorrect sentence from a large candidate set. With fixed targets and repeated resampling and shuffling, we obtain stable estimates while separating content driven failures from positional artifacts. Across experiments, results indicate that strong performance in low option settings can overstate model competence. This apparent advantage often weakens under dense interference at high $N$, revealing gaps that conventional benchmarks tend to obscure. We identify two failure modes, semantic confusion and position bias toward early options under uncertainty. To isolate the effect of context length, we run padding controlled and length matched tests, which suggest that the main bottleneck is candidate ranking rather than context length. Together, these findings support massive option evaluation as a general framework for stress testing model reliability under extreme distractor density, beyond what low option benchmarks can reveal.[52] Fact4ac at the Financial Misinformation Detection Challenge Task: Reference-Free Financial Misinformation Detection via Fine-Tuning and Few-Shot Prompting of Large Language Models
Cuong Hoang,Le-Minh Nguyen
Main category: cs.CL
TL;DR: 本文提出了一种无需外部参考的金融虚假信息检测方法,结合零样本/少样本提示与LoRA微调大语言模型,在RFC-BENCH基准上取得SOTA性能(公测95.4%,私测96.3%)。
Details
Motivation: 金融虚假信息泛滥威胁市场稳定与投资者信任,而现实中常缺乏可用于交叉验证的外部证据,亟需不依赖参考的检测方法。 Method: 基于RFC-BENCH框架,融合大语言模型的上下文学习(零样本/少样本提示)与参数高效微调(LoRA),提升模型对金融操纵语言线索的识别能力。 Result: 在官方榜单双榜第一:公开测试集准确率95.4%,私有测试集96.3%;开源14B和32B模型。 Conclusion: 该方法验证了纯语义与上下文一致性分析在金融虚假信息检测中的有效性,推动了金融NLP中上下文感知虚假信息检测的发展。 Abstract: The proliferation of financial misinformation poses a severe threat to market stability and investor trust, misleading market behavior and creating critical information asymmetry. Detecting such misleading narratives is inherently challenging, particularly in real-world scenarios where external evidence or supplementary references for cross-verification are strictly unavailable. This paper presents our winning methodology for the "Reference-Free Financial Misinformation Detection" shared task. Built upon the recently proposed RFC-BENCH framework (Jiang et al. 2026), this task challenges models to determine the veracity of financial claims by relying solely on internal semantic understanding and contextual consistency, rather than external fact-checking. To address this formidable evaluation setup, we propose a comprehensive framework that capitalizes on the reasoning capabilities of state-of-the-art Large Language Models (LLMs). Our approach systematically integrates in-context learning, specifically zero-shot and few-shot prompting strategies, with Parameter-Efficient Fine-Tuning (PEFT) via Low-Rank Adaptation (LoRA) to optimally align the models with the subtle linguistic cues of financial manipulation. Our proposed system demonstrated superior efficacy, successfully securing the first-place ranking on both official leaderboards. Specifically, we achieved an accuracy of 95.4% on the public test set and 96.3% on the private test set, highlighting the robustness of our method and contributing to the acceleration of context-aware misinformation detection in financial Natural Language Processing. Our models (14B and 32B) are available at https://huggingface.co/KaiNKaiho.[53] CURaTE: Continual Unlearning in Real Time with Ensured Preservation of LLM Knowledge
Seyun Bae,Seokhan Lee,Eunho Yang
Main category: cs.CL
TL;DR: 本文提出CURaTE方法,通过训练句子嵌入模型实时检测并拒绝与遗忘请求相似的输入,实现大语言模型的持续实时知识遗忘,同时保持模型原有知识几乎完全不变。
Details
Motivation: 现有大语言模型预训练数据难以预先过滤所有潜在问题数据,导致需要在训练后对特定知识进行遗忘;而现有遗忘方法无法满足连续、即时操作需求,导致效用下降和敏感信息长期暴露。 Method: 提出CURaTE方法:首先在特制数据集上训练句子嵌入模型,以构建清晰决策边界来识别是否属于遗忘请求;然后利用输入提示与遗忘请求的相似度决定响应或拒绝,全程不修改语言模型参数。 Result: CURaTE在遗忘效果上优于现有方法;因不修改模型参数,在任意次数更新下均保持近乎完美的知识保留;是目前唯一支持持续、实时遗忘的方法。 Conclusion: CURaTE为大语言模型提供了一种高效、安全、可持续的实时知识遗忘机制,兼顾遗忘有效性与模型知识完整性。 Abstract: The inability to filter out in advance all potentially problematic data from the pre-training of large language models has given rise to the need for methods for unlearning specific pieces of knowledge after training. Existing techniques overlook the need for continuous and immediate action, causing them to suffer from degraded utility as updates accumulate and protracted exposure of sensitive information. To address these issues, we propose Continual Unlearning in Real Time with Ensured Preservation of LLM Knowledge (CURaTE). Our method begins by training a sentence embedding model on a dataset designed to enable the formation of sharp decision boundaries for determining whether a given input prompt corresponds to any stored forget requests. The similarity of a given input to the forget requests is then used to determine whether to answer or return a refusal response. We show that even with such a simple approach, not only does CURaTE achieve more effective forgetting than existing methods, but by avoiding modification of the language model parameters, it also maintains near perfect knowledge preservation over any number of updates and is the only method capable of continual unlearning in real-time.[54] CURA: Clinical Uncertainty Risk Alignment for Language Model-Based Risk Prediction
Sizhe Wang,Ziqi Xu,Claire Najjuuko,Charles Alba,Chenyang Lu
Main category: cs.CL
TL;DR: 本文提出CURA框架,通过细调临床语言模型并引入双层级不确定性目标,提升临床风险预测中不确定性估计的校准度和可靠性。
Details
Motivation: 临床语言模型在风险预测中的不确定性估计往往校准不佳、临床不可靠。 Method: CURA框架首先对领域特定临床语言模型进行微调以获取任务适配的患者嵌入,再对多头分类器进行不确定性微调,采用个体级校准项与队列感知正则项相结合的双层级不确定性目标。 Result: 在MIMIC-IV多个临床风险预测任务上,CURA显著改善了校准指标(如ECE),同时未明显损害判别能力(如AUC);并减少了过度自信的错误安抚,提升了不确定性估计的临床可信度。 Conclusion: CURA是一种有效提升临床语言模型不确定性校准与临床可靠性的新方法,为下游临床决策支持提供了更可信的风险与不确定性估计。 Abstract: Clinical language models (LMs) are increasingly applied to support clinical risk prediction from free-text notes, yet their uncertainty estimates often remain poorly calibrated and clinically unreliable. In this work, we propose Clinical Uncertainty Risk Alignment (CURA), a framework that aligns clinical LM-based risk estimates and uncertainty with both individual error likelihoods and cohort-level ambiguities. CURA first fine-tunes domain-specific clinical LMs to obtain task-adapted patient embeddings, and then performs uncertainty fine-tuning of a multi-head classifier using a bi-level uncertainty objective. Specifically, an individual-level calibration term aligns predictive uncertainty with each patient's likelihood of error, while a cohort-aware regularizer pulls risk estimates toward event rates in their local neighborhoods in the embedding space and places extra weight on ambiguous cohorts near the decision boundary. We further show that this cohort-aware term can be interpreted as a cross-entropy loss with neighborhood-informed soft labels, providing a label-smoothing view of our method. Extensive experiments on MIMIC-IV clinical risk prediction tasks across various clinical LMs show that CURA consistently improves calibration metrics without substantially compromising discrimination. Further analysis illustrates that CURA reduces overconfident false reassurance and yields more trustworthy uncertainty estimates for downstream clinical decision support.[55] SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models
Binxian Su,Haoye Lou,Shucheng Zhu,Weikang Wang,Ying Liu,Dong Yu,Pengyuan Liu
Main category: cs.CL
TL;DR: 本文提出SPAGBias框架,首次系统评估大语言模型(LLMs)中的空间性别偏见,结合城市微空间分类、提示库与三层诊断方法,发现模型中存在超越公私二分的结构化性别-空间关联,并揭示其在预训练、指令调优和奖励建模各阶段被嵌入与强化,导致下游应用失败。
Details
Motivation: 由于性别空间理论指出性别等级嵌入于空间组织中,而大语言模型正日益用于城市规划,因此担忧其可能复现或加剧空间性别偏见。 Method: 构建SPAGBias框架,包含62类城市微空间的分类体系、提示库,以及显式(强制选择重采样)、概率式(词元级不对称性)和建构式(语义与叙事角色分析)三层诊断方法;对六个代表性模型开展多维度实验,包括故事生成、提示设计、温度与规模影响分析,及跨训练阶段溯源。 Result: 发现模型中存在精细的、超越公私二分的性别-空间映射;故事生成揭示情感、措辞与社会角色共同塑造‘空间性别叙事’;偏见贯穿预训练、指令微调与奖励建模全过程,且模型关联强度显著高于现实世界分布;下游任务中引发规范性与描述性双重失效。 Conclusion: LLMs不仅反映语言偏见,更编码了社会性别认知的空间维度;本研究将社会学理论与计算分析结合,拓展了偏见研究至空间领域,为城市智能应用中的公平性治理提供理论与工具基础。 Abstract: Large language models (LLMs) are being increasingly used in urban planning, but since gendered space theory highlights how gender hierarchies are embedded in spatial organization, there is concern that LLMs may reproduce or amplify such biases. We introduce SPAGBias - the first systematic framework to evaluate spatial gender bias in LLMs. It combines a taxonomy of 62 urban micro-spaces, a prompt library, and three diagnostic layers: explicit (forced-choice resampling), probabilistic (token-level asymmetry), and constructional (semantic and narrative role analysis). Testing six representative models, we identify structured gender-space associations that go beyond the public-private divide, forming nuanced micro-level mappings. Story generation reveals how emotion, wording, and social roles jointly shape "spatial gender narratives". We also examine how prompt design, temperature, and model scale influence bias expression. Tracing experiments indicate that these patterns are embedded and reinforced across the model pipeline (pre-training, instruction tuning, and reward modeling), with model associations found to substantially exceed real-world distributions. Downstream experiments further reveal that such biases produce concrete failures in both normative and descriptive application settings. This work connects sociological theory with computational analysis, extending bias research into the spatial domain and uncovering how LLMs encode social gender cognition through language.[56] Which bird does not have wings: Negative-constrained KGQA with Schema-guided Semantic Matching and Self-directed Refinement
Midan Shim,Seokju Hwang,Kaehyun Um,Kyong-Ho Lee
Main category: cs.CL
TL;DR: 本文提出NEST KGQA新任务,聚焦于含否定约束的KGQA问题,并构建NestKGQA数据集;设计支持清晰表达否定且可读性强的PyLF逻辑形式;提出CUCKOO框架,通过约束感知逻辑形式生成、模式引导语义匹配及自导向精炼机制,提升多约束问题的语义可执行性与鲁棒性。
Details
Motivation: 现有KGQA方法和基准偏重正向与计算约束,忽视现实中频繁出现的否定约束,导致模型在处理含否定约束的问题时表现不佳。 Method: 提出NEST KGQA任务与NestKGQA数据集;设计Python格式逻辑形式PyLF以更好表达否定;构建CUCKOO框架,包含约束感知逻辑形式初稿生成、模式引导语义匹配、以及仅在执行结果为空时触发的自导向精炼机制。 Result: CUCKOO在常规KGQA与NEST-KGQA基准上均显著优于基线方法,尤其在少样本设置下表现出更强的鲁棒性与泛化能力。 Conclusion: 否定约束是KGQA中不可忽视的关键挑战;PyLF与CUCKOO共同提升了模型对复杂多约束(尤其是含否定)问题的理解与执行能力,为可信KGQA提供了新思路。 Abstract: Large language models still struggle with faithfulness and hallucinations despite their remarkable reasoning abilities. In Knowledge Graph Question Answering (KGQA), semantic parsing-based approaches address the limitations by understanding constraints in a user's question and converting them into a logical form to execute on a knowledge graph. However, existing KGQA benchmarks and methods are biased toward positive and calculation constraints. Negative constraints are neglected, although they frequently appear in real-world questions. In this paper, we introduce a new task, NEgative-conSTrained (NEST) KGQA, where each question contains at least one negative constraint, and a corresponding dataset, NestKGQA. We also design PyLF, a Python-formatted logical form, since existing logical forms are hardly suitable to express negation clearly while maintaining readability. Furthermore, NEST questions naturally contain multiple constraints. To mitigate their semantic complexity, we present a novel framework named CUCKOO, specialized to multiple-constrained questions and ensuring semantic executability. CUCKOO first generates a constraint-aware logical form draft and performs schema-guided semantic matching. It then selectively applies self-directed refinement only when executing improper logical forms yields an empty result, reducing cost while improving robustness. Experimental results demonstrate that CUCKOO consistently outperforms baselines on both conventional KGQA and NEST-KGQA benchmarks under few-shot settings.[57] CoPA: Benchmarking Personalized Question Answering with Data-Informed Cognitive Factors
Hang Su,Zequn Liu,Chen Hu,Xuesong Lu,Yingce Xia,Zhen Liu
Main category: cs.CL
TL;DR: 本文提出CoPA基准,通过挖掘社区-个体偏好差异(CIPD)识别六个个性化维度,用于细粒度评估大语言模型在问答任务中的个性化能力。
Details
Motivation: 现有个性化问答评估方法依赖词法相似性或人工启发式规则,缺乏充分的数据驱动验证,难以准确衡量个性化效果。 Method: 从用户交互数据中挖掘社区-个体偏好差异(CIPD),提炼出六个关键个性化因子,并构建包含1985个用户画像的CoPA基准,量化模型输出与用户认知偏好的对齐程度。 Result: CoPA提供了比通用指标更全面、更具区分力的个性化QA评估标准,并开源代码以促进后续研究。 Conclusion: CoPA为个性化问答系统提供了首个基于真实用户行为、面向认知偏好的细粒度评估框架,推动个性化评估从启发式向数据驱动范式转变。 Abstract: While LLMs have demonstrated remarkable potential in Question Answering (QA), evaluating personalization remains a critical bottleneck. Existing paradigms predominantly rely on lexical-level similarity or manual heuristics, often lacking sufficient data-driven validation. We address this by mining Community-Individual Preference Divergence (CIPD), where individual choices override consensus, to distill six key personalization factors as evaluative dimensions. Accordingly, we introduce CoPA, a benchmark with 1,985 user profiles for fine-grained, factor-level assessment. By quantifying the alignment between model outputs and user-specific cognitive preferences inferred from interaction patterns, CoPA provides a more comprehensive and discriminative standard for evaluating personalized QA than generic metrics. The code is available at https://github.com/bjzgcai/CoPA.[58] Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems
Nishanth Madhusudhan,Vikas Yadav,Alexandre Lacoste
Main category: cs.CL
TL;DR: 本文提出MM-AQA基准,系统评估多模态模型在视觉-语言任务中的有效弃答能力,发现现有模型普遍缺乏弃答意识,需弃答感知训练而非仅优化提示或增加智能体。
Details
Motivation: 现有视觉-语言模型和多智能体系统的评测范式默认问题总可回答,忽视了证据不足时应主动弃答这一关键可靠性需求;弃答研究在纯文本领域已有进展,但在多模态场景下仍缺乏细粒度、贴近真实失败模式的评测基准。 Method: 构建MM-AQA基准:通过对可回答样本施加沿‘视觉模态依赖性’和‘证据充分性’两个维度的变换生成不可回答样本;在2079个样本上评估三类前沿视觉-语言模型(闭源/开源)及两类多智能体架构,并分析不同提示策略、架构设计(顺序vs迭代)与证据缺失类型对弃答行为的影响。 Result: (1)标准提示下VLM极少弃答,简单置信度基线即优于之;(2)多智能体提升弃答率但牺牲准确性;(3)顺序式架构不逊于迭代式,表明问题核心在于校准偏差而非推理深度;(4)模型仅在图像或文本证据完全缺失时弃答,面对退化或矛盾证据仍尝试调和。 Conclusion: 实现有效的多模态弃答不能仅靠提示工程或增加智能体数量,必须引入弃答感知的专门训练机制。 Abstract: Effective abstention (EA), recognizing evidence insufficiency and refraining from answering, is critical for reliable multimodal systems. Yet existing evaluation paradigms for vision-language models (VLMs) and multi-agent systems (MAS) assume answerability, pushing models to always respond. Abstention has been studied in text-only settings but remains underexplored multimodally; current benchmarks either ignore unanswerability or rely on coarse methods that miss realistic failure modes. We introduce MM-AQA, a benchmark that constructs unanswerable instances from answerable ones via transformations along two axes: visual modality dependency and evidence sufficiency. Evaluating three frontier VLMs spanning closed and open-source models and two MAS architectures across 2079 samples, we find: (1) under standard prompting, VLMs rarely abstain; even simple confidence baselines outperform this setup, (2) MAS improves abstention but introduces an accuracy-abstention trade-off, (3) sequential designs match or exceed iterative variants, suggesting the bottleneck is miscalibration rather than reasoning depth, and (4) models abstain when image or text evidence is absent, but attempt reconciliation with degraded or contradictory evidence. Effective multimodal abstention requires abstention-aware training rather than better prompting or more agents.[59] Modeling LLM Unlearning as an Asymmetric Two-Task Learning Problem
Zeguan Xiao,Siqing Li,Yong Wang,Xuetao Wei,Jian Yang,Yun Chen,Guanhua Chen
Main category: cs.CL
TL;DR: 本文将大语言模型(LLM)的机器遗忘重新定义为一个非对称双任务问题,以保留能力为首要目标、遗忘为辅助目标,并提出一种保留优先的梯度合成框架;其中新方法SAGO通过符号约束的建设性梯度合成,在理论和实验上均优于PCGrad等基线,在WMDP等基准上显著提升保留性能(如MMLU从44.6%升至96.0%),同时保持遗忘强度。
Details
Motivation: 解决大语言模型机器遗忘中遗忘与能力保留之间的固有冲突,传统方法多侧重损失重平衡,而本文认为关键在于梯度几何结构的重塑。 Method: 提出保留优先的梯度合成框架,解耦任务特异性梯度提取与冲突感知组合;适配PCGrad处理梯度冲突,并设计新方法SAGO,通过符号约束实现构造性梯度合成,确保与保留梯度的余弦相似性为正且对齐更紧。 Result: 在WMDP Bio/Cyber和RWKU等基准上,SAGO持续推动帕累托前沿:例如WMDP Bio上MMLU性能恢复率达96.0%,显著高于基线(44.6%)和PCGrad(94.0%),同时遗忘强度相当。 Conclusion: 重塑梯度几何结构比重新加权损失更能有效缓解遗忘-保留权衡;SAGO验证了保留优先设计在理论保证与实际性能上的双重优势。 Abstract: Machine unlearning for large language models (LLMs) aims to remove targeted knowledge while preserving general capability. In this paper, we recast LLM unlearning as an asymmetric two-task problem: retention is the primary objective and forgetting is an auxiliary. From this perspective, we propose a retention-prioritized gradient synthesis framework that decouples task-specific gradient extraction from conflict-aware combination. Instantiating the framework, we adapt established PCGrad to resolve gradient conflicts, and introduce SAGO, a novel retention-prioritized gradient synthesis method. Theoretically, both variants ensure non-negative cosine similarity with the retain gradient, while SAGO achieves strictly tighter alignment through constructive sign-constrained synthesis. Empirically, on WMDP Bio/Cyber and RWKU benchmarks, SAGO consistently pushes the Pareto frontier: e.g., on WMDP Bio (SimNPO+GD), recovery of target model MMLU performance progresses from 44.6% (naive) to 94.0% (+PCGrad) and further to 96.0% (+SAGO), while maintaining comparable forgetting strength. Our results show that re-shaping gradient geometry, rather than re-balancing losses, is the key to mitigating unlearning-retention trade-offs.[60] Domain Fine-Tuning FinBERT on Finnish Histopathological Reports: Train-Time Signals and Downstream Correlations
Rami Luisto,Liisa Petäinen,Tommi Grönholm,Jan Böhm,Maarit Ahtiainen,Tomi Lilja,Ilkka Pölönen,Sami Äyrämö
Main category: cs.CL
TL;DR: This paper explores domain fine-tuning of Finnish BERT on Finnish medical text to address limited labeled data in healthcare NLP, and investigates whether embedding geometry changes can predict the benefit of domain-specific pre-training.
Details
Motivation: The common situation in healthcare AI where acquiring labeled datasets is delayed, especially for medical domains with scarce annotations. Method: Fine-tuning the Finnish BERT model on unlabeled Finnish medical text and analyzing embedding geometry changes to predict benefits of domain-specific pre-training. Result: Observations from fine-tuning Finnish BERT on Finnish medical text and preliminary insights into using embedding geometry to forecast gains from domain pre-training. Conclusion: Domain fine-tuning is valuable for low-resource medical NLP in Finnish, and embedding geometry analysis shows promise as a predictor for domain pre-training benefits, though further validation is needed. Abstract: In NLP classification tasks where little labeled data exists, domain fine-tuning of transformer models on unlabeled data is an established approach. In this paper we have two aims. (1) We describe our observations from fine-tuning the Finnish BERT model on Finnish medical text data. (2) We report on our attempts to predict the benefit of domain-specific pre-training of Finnish BERT from observing the geometry of embedding changes due to domain fine-tuning. Our driving motivation is the common\situation in healthcare AI where we might experience long delays in acquiring datasets, especially with respect to labels.[61] Pangu-ACE: Adaptive Cascaded Experts for Educational Response Generation on EduBench
Dinghao Li,Wenlong Zhou,Zhimin Chen,Yuehan Peng,Hong Ni,Chengfu Zou,Guoyu Shi,Yaochen Li
Main category: cs.CL
TL;DR: 本文介绍了Pangu-ACE系统,一种基于任务需求动态分配计算资源的教育助手级级联模型(1B→7B),在EduBench基准上提升了答案质量与格式有效性,并强调了路由选择性带来的效率优势而非绝对延迟降低。
Details
Motivation: 教育助手应按需分配计算资源;同时修正先前离线评估中因表面格式检查导致的过乐观结果。 Method: 构建1B tutor-router生成初稿并输出路由信号,对简单样本直接接受,复杂样本交由7B specialist prompt处理;采用CPU侧重评(基于保存的JSONL预测结果)修正评估偏差;提供可复现的artifact-first论文工作流。 Result: 在7013个中文测试样本上,cascade_final相比legacy rule_v2系统将确定性质量从0.457提升至0.538,格式有效性从0.707提升至0.866;1B模型直接承接19.7%请求,不同任务路由差异显著(如IP任务78.0%由1B完成,QG/EC则几乎全升级)。 Conclusion: Pangu-ACE通过样本级级联和精细化路由,在质量与格式上取得显著提升,其核心效率收益来自路由选择性而非延迟降低;当前部署尚未体现端到端延迟优势,且与GPT-5.4的最终对齐尚待基础设施修复。 Abstract: Educational assistants should spend more computation only when the task needs it. This paper rewrites our earlier draft around the system that was actually implemented and archived in the repository: a sample-level 1B to 7B cascade for the shared-8 EduBench benchmark. The final system, Pangu-ACE, uses a 1B tutor-router to produce a draft answer plus routing signals, then either accepts the draft or escalates the sample to a 7B specialist prompt. We also correct a major offline evaluation bug: earlier summaries over-credited some open-form outputs that only satisfied superficial format checks. After CPU-side rescoring from saved prediction JSONL, the full Chinese test archive (7013 samples) shows that cascade_final improves deterministic quality from 0.457 to 0.538 and format validity from 0.707 to 0.866 over the legacy rule_v2 system while accepting 19.7% of requests directly at 1B. Routing is strongly task dependent: IP is accepted by 1B 78.0% of the time, while QG and EC still escalate almost always. The current archived deployment does not yet show latency gains, so the defensible efficiency story is routing selectivity rather than wall-clock speedup. We also package a reproducible artifact-first paper workflow and clarify the remaining external-baseline gap: GPT-5.4 re-judging is implemented locally, but the configured provider endpoint and key are invalid, so final sampled-baseline alignment with GPT-5.4 remains pending infrastructure repair.[62] Exploring and Testing Skill-Based Behavioral Profile Annotation: Human Operability and LLM Feasibility under Schema-Guided Execution
Yufeng Wu
Main category: cs.CL
TL;DR: 本文提出将行为特征(BP)标注视为一组标注技能的集合,而非单一任务,并通过技能文件驱动的流程评估大语言模型(LLM)在中文隐喻性颜色词衍生义BP标注中的辅助能力;结果表明BP标注在技能层面高度异质,GPT-5.4在部分技能上表现可靠但非全局可行,人与模型在技能难度分布上高度一致,但在实例或词汇层面无相关性,提示应以技能可行性而非任务自动化来评估自动标注。
Details
Motivation: BP标注因需同步处理多个语言维度而难以自动化,现有方法将其视为单任务,忽视内部技能差异,本文旨在从技能分解视角重新审视LLM辅助标注的可行性。 Method: 采用300例验证子集和14维BP标注框架,构建技能文件驱动的标注流程(含外部定义的模式文件、决策规则与示例);两名人工标注者完成两轮纯模式协议标注,据此将技能分为可直接操作、聚焦重标可恢复、结构未明确定义三类;随后在相同设置下评估GPT-5.4及三个开源模型。 Result: 14项BP技能中:5项可直接操作,4项经聚焦重标可恢复,5项仍结构未明;GPT-5.4在保留技能上准确率0.678,kappa 0.665,加权F1 0.695;人与GPT技能难度分布高度相关(r=0.881),但实例级(r=0.016)和词汇级(r=-0.142)几乎无关;GPT更宜视作独立‘第三技能声部’而非人类替代;开源模型失败主因是模式到技能的执行问题。 Conclusion: 自动标注评估应转向技能可行性维度,而非笼统的任务级自动化;人与模型存在‘共享分类法、独立执行’现象,提示协同标注设计需聚焦技能匹配而非整体替代。 Abstract: Behavioral Profile (BP) annotation is difficult to automate because it requires simultaneous coding across multiple linguistic dimensions. We treat BP annotation as a bundle of annotation skills rather than a single task and evaluate LLM-assisted BP annotation from this perspective. Using 3,134 concordance lines of 30 Chinese metaphorical color-term derivatives and a 14-feature BP schema, we implement a skill-file-driven pipeline in which each feature is externally defined through schema files, decision rules, and examples. Two human annotators completed a two-round schema-only protocol on a 300-instance validation subset, enabling BP skills to be classified as directly operable, recoverable under focused re-annotation, or structurally underspecified. GPT-5.4 and three locally deployable open-source models were then evaluated under the same setup. Results show that BP annotation is highly heterogeneous at the skill level: 5 skills are directly operable, 4 are recoverable after focused re-annotation, and 5 remain structurally underspecified. GPT-5.4 executes the retained skills with substantial reliability (accuracy = 0.678, \k{appa} = 0.665, weighted F1 = 0.695), but this feasibility is selective rather than global. Human and GPT difficulty profiles are strongly aligned at the skill level (r = 0.881), but not at the instance level (r = 0.016) or lexical-item level (r = -0.142), a pattern we describe as shared taxonomy, independent execution. Pairwise agreement further suggests that GPT is better understood as an independent third skill voice than as a direct human substitute. Open-source failures are concentrated in schema-to-skill execution problems. These findings suggest that automatic annotation should be evaluated in terms of skill feasibility rather than task-level automation.[63] ClimateCause: Complex and Implicit Causal Structures in Climate Reports
Liesbeth Allein,Nataly Pineda-Castañeda,Andrea Rocci,Marie-Francine Moens
Main category: cs.CL
TL;DR: 本文介绍了ClimateCause数据集,一个由专家手工标注的、包含高阶因果结构(包括隐式和嵌套因果关系)的气候报告数据集,旨在支持复杂因果网络建模与评估,并揭示大语言模型在因果链推理上的不足。
Details
Motivation: 现有因果发现数据集主要覆盖显式、直接的因果关系,难以支撑气候变化等需复杂因果推理的任务;因此需要构建能表征隐式、嵌套及高阶因果结构的数据集。 Method: 构建了ClimateCause数据集:基于科学政策类气候报告,由领域专家人工标注高阶因果结构;对因果表达进行归一化与解耦,生成独立因果关系三元组,并标注相关性、关系类型及时空上下文;提出基于因果图语义复杂度的可读性量化方法;开展大语言模型在相关性推断与因果链推理上的基准测试。 Result: 成功构建并发布了ClimateCause数据集;验证了其在量化文本可读性方面的有效性;实验表明大语言模型在因果链推理任务上显著弱于相关性推断,凸显该任务为关键难点。 Conclusion: ClimateCause填补了高阶、隐式因果建模的数据空白,为气候推理、因果发现与可读性分析提供了新基准,同时揭示了当前大语言模型在深层因果链推理能力上的局限性。 Abstract: Understanding climate change requires reasoning over complex causal networks. Yet, existing causal discovery datasets predominantly capture explicit, direct causal relations. We introduce ClimateCause, a manually expert-annotated dataset of higher-order causal structures from science-for-policy climate reports, including implicit and nested causality. Cause-effect expressions are normalized and disentangled into individual causal relations to facilitate graph construction, with unique annotations for cause-effect correlation, relation type, and spatiotemporal context. We further demonstrate ClimateCause's value for quantifying readability based on the semantic complexity of causal graphs underlying a statement. Finally, large language model benchmarking on correlation inference and causal chain reasoning highlights the latter as a key challenge.[64] Schema Key Wording as an Instruction Channel in Structured Generation under Constrained Decoding
Yifan Le
Main category: cs.CL
TL;DR: 本文研究了在大语言模型(LLM)结构化生成中,模式(schema)键的措辞作为一种隐式指令通道对模型性能的影响,发现其与显式提示指令存在交互效应,并揭示不同模型家族对此类指令敏感性存在差异。
Details
Motivation: 现有约束解码方法将模式视为纯结构约束,忽视了其语言表述可能影响模型行为;作者旨在探究schema key的措辞是否构成一种隐式指令通道。 Method: 将结构化生成重新建模为多通道指令问题,系统分析prompt-level与schema-level指令的作用及交互;在多个数学推理基准上进行实验验证。 Result: Qwen模型持续受益于schema级指令,LLaMA模型更依赖prompt级指导;两通道组合存在非叠加交互效应,并非总是带来增益。 Conclusion: schema设计不仅决定输出结构,还承载指令信号,为LLM结构化生成提供了新视角。 Abstract: Constrained decoding has been widely adopted for structured generation with large language models (LLMs), ensuring that outputs satisfy predefined formats such as JSON and XML. However, existing approaches largely treat schemas as purely structural constraints and overlook the possibility that their linguistic formulation may affect model behavior. In this work, we study how instruction placement influences model performance in structured generation and show that merely changing the wording of schema keys, without modifying the prompt or model parameters, can significantly alter model performance under constrained decoding. Based on this observation, we propose to reinterpret structured generation as a multi-channel instruction problem, where instructions can be conveyed explicitly through prompts and implicitly through schema keys during decoding. To the best of our knowledge, this is the first work to systematically study how schema key formulation acts as an implicit instruction channel and affects model performance under constrained decoding. Experiments on multiple mathematical reasoning benchmarks show that different model families exhibit distinct sensitivities to these instruction channels: Qwen models consistently benefit from schema-level instructions, while LLaMA models rely more heavily on prompt-level guidance. We further observe non-additive interaction effects between instruction channels, showing that combining multiple channels does not always lead to further improvement. These findings suggest that schema design not only determines output structure, but also carries instruction signals, offering a new perspective on structured generation in LLMs.[65] Segment-Level Coherence for Robust Harmful Intent Probing in LLMs
Xuanli He,Bilgehan Sel,Faizan Ali,Jenny Bao,Hoagy Cunningham,Jerry Wei
Main category: cs.CL
TL;DR: 本文提出了一种新的流式探测目标,通过要求多个证据词元一致支持预测,而非依赖孤立的高分词元,从而提升CBRN领域中大语言模型对抗性越狱检测的鲁棒性与准确率。
Details
Motivation: 现有流式探测方法在CBRN等高风险领域中易因敏感词出现在良性上下文中而产生误报,缺乏对上下文一致性的建模。 Method: 设计一种需多个证据词元协同支持预测的流式探测目标,对比分析Attention、MLP和残差流特征的表现,并验证其对字符级加密等新型对抗攻击的泛化能力。 Result: 在1%假阳性率下,真阳性率相对强基线提升35.55%;AUROC达97.40%以上,对抗加密攻击下仍保持>98.85% AUROC;Attention/MLP激活探测显著优于残差流特征。 Conclusion: 多证据聚合的流式探测机制更鲁棒、可迁移,为高风险场景下LLM安全监控提供了可靠且即插即用的解决方案。 Abstract: Large Language Models (LLMs) are increasingly exposed to adaptive jailbreaking, particularly in high-stakes Chemical, Biological, Radiological, and Nuclear (CBRN) domains. Although streaming probes enable real-time monitoring, they still make systematic errors. We identify a core issue: existing methods often rely on a few high-scoring tokens, leading to false alarms when sensitive CBRN terms appear in benign contexts. To address this, we introduce a streaming probing objective that requires multiple evidence tokens to consistently support a prediction, rather than relying on isolated spikes. This encourages more robust detection based on aggregated signals instead of single-token cues. At a fixed 1% false-positive rate, our method improves the true-positive rate by 35.55% relative to strong streaming baselines. We further observe substantial gains in AUROC, even when starting from near-saturated baseline performance (AUROC = 97.40%). We also show that probing Attention or MLP activations consistently outperforms residual-stream features. Finally, even when adversarial fine-tuning enables novel character-level ciphers, harmful intent remains detectable: probes developed for the base LLMs can be applied ``plug-and-play'' to these obfuscated attacks, achieving an AUROC of over 98.85%.[66] RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding
Zihong Zhang,Zuchao Li,Lefei Zhang,Ping Wang,Hai Zhao
Main category: cs.CL
TL;DR: RACER是一种无需训练的轻量级推测解码方法,通过融合检索到的精确模式与logit驱动的未来线索,提升大语言模型推理速度,实现超2倍加速。
Details
Motivation: 现有无需训练的推测解码方法存在缺陷:基于检索的方法在无精确匹配时失效,基于logits的方法缺乏结构引导。 Method: 提出RACER,结合检索得到的可靠锚点与logit预测的未来token分布,生成更丰富、更准确的推测草案。 Result: 在Spec-Bench、HumanEval和MGSM-ZH上验证,RACER相较自回归解码提速超2倍,且优于其他无需训练的基线方法。 Conclusion: RACER是一种可扩展、即插即用的高效LLM解码方案,兼顾可靠性与灵活性。 Abstract: Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess-and-verify strategy, but existing training-free variants face trade-offs: retrieval-based drafts break when no exact match exists, while logits-based drafts lack structural guidance. We propose $\textbf{RACER}$ ($\textbf{R}$etrieval-$\textbf{A}$ugmented $\textbf{C}$ont$\textbf{e}$xtual $\textbf{R}$apid Speculative Decoding), a lightweight and training-free method that integrates retrieved exact patterns with logit-driven future cues. This unification supplies both reliable anchors and flexible extrapolation, yielding richer speculative drafts. Experiments on Spec-Bench, HumanEval, and MGSM-ZH demonstrate that RACER consistently accelerates inference, achieving more than $2\times$ speedup over autoregressive decoding, and outperforms prior training-free methods, offering a scalable, plug-and-play solution for efficient LLM decoding. Our source code is available at $\href{https://github.com/hkr04/RACER}{https://github.com/hkr04/RACER}$.[67] Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models
Danae Sánchez Villegas,Samuel Lewis-Lim,Nikolaos Aletras,Desmond Elliott
Main category: cs.CL
TL;DR: 本文分析了18个视觉语言模型(VLMs)的推理动态,发现模型存在“答案惯性”现象,即早期预测倾向被强化而非修正;推理训练模型虽有更强校正能力,但其表现受模态条件影响;模型易受误导性文本线索影响,且这种影响在思维链(CoT)中可呈现但难以检测;CoT仅部分反映多模态决策机制,对系统透明性与安全性构成挑战。
Details
Motivation: 探究视觉语言模型(VLMs)如何在推理过程中整合视觉与文本信息,尤其是Chain-of-Thought(CoT)是否真实反映多模态决策机制及其透明性与安全性含义。 Method: 对18个涵盖指令微调与推理训练的VLMs进行系统分析:追踪CoT过程中的置信度变化、量化推理的校正效应、评估中间步骤贡献;设计含误导性文本线索的控制实验,检验模型对文本线索的依赖性及CoT中该依赖的可检测性。 Result: 发现模型普遍存在‘答案惯性’;推理训练模型校正能力更强但依赖模态条件;模型持续受误导性文本线索影响,该影响在CoT中可出现但检测难度因模型而异;推理训练模型更显式引用线索但CoT易掩盖真实模态依赖,指令微调模型引用较少但短CoT暴露视觉不一致。 Conclusion: Chain-of-Thought仅提供多模态决策机制的部分视图,不能充分揭示视觉与文本模态的真实贡献,这对VLM的可解释性、透明性与安全部署构成关键挑战。 Abstract: Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two different model families. We track confidence over Chain-of-Thought (CoT), measure the corrective effect of reasoning, and evaluate the contribution of intermediate reasoning steps. We find that models are prone to answer inertia, in which early commitments to a prediction are reinforced, rather than revised during reasoning steps. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions, from text-dominant to vision-only settings. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient, and assess whether this influence is recoverable from CoT. Although this influence can appear in the CoT, its detectability varies across models and depends on what is being monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer and fluent CoTs can still appear visually grounded while actually following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. Taken together, these findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.[68] Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task
Evaldas Vaiciukynas,Paulius Danenas,Linas Ablonskis,Algirdas Sukys,Edgaras Dambrauskas,Voldemaras Zitkus,Rita Butkiene,Rimantas Butleris
Main category: cs.CL
TL;DR: 本文研究了现代多语言句子嵌入模型在立陶宛语、俄语和英语仇恨言论检测中的有效性,引入了新的立陶宛语语料库LtHate,并在统一框架下对比了六种嵌入模型与不同下游分类器(HBOS异常检测与CatBoost二分类)及PCA降维的效果,结果表明监督式二分类结合多语言嵌入效果最优,尤其在俄语上达到92.19%准确率。
Details
Motivation: 在线仇恨言论和辱骂性语言对内容审核构成日益严峻的挑战,尤其在多语言环境及立陶宛语等低资源语言中缺乏高质量数据和有效模型。 Method: 构建新立陶宛语仇恨语料库LtHate;在LtHate、RuToxic和EnSuperset上统一评估potion、gemma、bge、snow、jina、e5六种多语言句子嵌入;对每种嵌入分别训练一分类HBOS异常检测器和二分类CatBoost模型,并测试是否加入PCA压缩至64维;采用统一Python流水线进行实验。 Result: 二分类监督模型显著优于一分类异常检测;最佳性能为:立陶宛语(jina)80.96%准确率、AUC 0.887;俄语(e5)92.19%准确率、AUC 0.978;英语(e5+PCA)77.21%准确率、AUC 0.859;PCA在监督任务中几乎不损性能,但在无监督任务中略有负面影响。 Conclusion: 现代多语言句子嵌入与梯度提升决策树(如CatBoost)相结合,可为多语言仇恨言论检测提供鲁棒、实用的软计算解决方案,尤其适用于低资源语言场景。 Abstract: Online hate speech and abusive language pose a growing challenge for content moderation, especially in multilingual settings and for low-resource languages such as Lithuanian. This paper investigates to what extent modern multilingual sentence embedding models can support accurate hate speech detection in Lithuanian, Russian, and English, and how their performance depends on downstream modeling choices and feature dimensionality. We introduce LtHate, a new Lithuanian hate speech corpus derived from news portals and social networks, and benchmark six modern multilingual encoders (potion, gemma, bge, snow, jina, e5) on LtHate, RuToxic, and EnSuperset using a unified Python pipeline. For each embedding, we train both a one class HBOS anomaly detector and a two class CatBoost classifier, with and without principal component analysis (PCA) compression to 64-dimensional feature vectors. Across all datasets, two class supervised models consistently and substantially outperform one class anomaly detection, with the best configurations achieving up to 80.96% accuracy and AUC ROC of 0.887 in Lithuanian (jina), 92.19% accuracy and AUC ROC of 0.978 in Russian (e5), and 77.21% accuracy and AUC ROC of 0.859 in English (e5 with PCA). PCA compression preserves almost all discriminative power in the supervised setting, while showing some negative impact for the unsupervised anomaly detection case. These results demonstrate how modern multilingual sentence embeddings combined with gradient boosted decision trees provide robust soft-computing solutions for multilingual hate speech detection applications.[69] IE as Cache: Information Extraction Enhanced Agentic Reasoning
Hang Lv,Sheng Liang,Hongchao Gu,Wei Guo,Defu Lian,Yong Liu,Hao Wang,Enhong Chen
Main category: cs.CL
TL;DR: 本文提出IE-as-Cache框架,将信息抽取(IE)视为一种可复用的认知缓存,以增强智能体推理能力,实验表明该方法显著提升了多步推理的准确性。
Details
Motivation: 传统信息抽取仅作为终端目标,提取结果孤立使用,未在多步推理中持续维护和复用;本文旨在突破这一局限,使IE成为支持推理的动态认知资源。 Method: 受计算机分层内存启发,提出IE-as-Cache框架,融合查询驱动的信息抽取与缓存感知推理,动态维护紧凑中间信息并过滤噪声。 Result: 在多个挑战性基准和不同大语言模型上实验验证,推理准确率显著提升。 Conclusion: 信息抽取可被有效重构为可复用的认知缓存,为IE在下游任务中的深度集成提供了新范式和研究方向。 Abstract: Information Extraction aims to distill structured, decision-relevant information from unstructured text, serving as a foundation for downstream understanding and reasoning. However, it is traditionally treated merely as a terminal objective: once extracted, the resulting structure is often consumed in isolation rather than maintained and reused during multi-step inference. Moving beyond this, we propose \textit{IE-as-Cache}, a framework that repurposes IE as a cognitive cache to enhance agentic reasoning. Drawing inspiration from hierarchical computer memory, our approach combines query-driven extraction with cache-aware reasoning to dynamically maintain compact intermediate information and filter noise. Experiments on challenging benchmarks across diverse LLMs demonstrate significant improvements in reasoning accuracy, indicating that IE can be effectively repurposed as a reusable cognitive resource and offering a promising direction for future research on downstream uses of IE.[70] XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics
Jingxuan Liu,Zhi Qu,Jin Tei,Hidetaka Kamigaito,Lemao Liu,Taro Watanabe
Main category: cs.CL
TL;DR: 本文提出XQ-MEval数据集,用于系统评估多语言机器翻译自动评价指标的跨语言打分偏差,并提出一种基于该数据集的归一化策略以提升多语言评测的公平性与可靠性。
Details
Motivation: 现有自动评价指标在多语言翻译系统中普遍存在跨语言评分偏差问题,即相同质量的译文在不同语言上得分不一致;但因缺乏跨语言质量平行标注基准,该问题尚未被系统研究。 Method: 构建半自动的XQ-MEval基准:基于MQM错误类型向高质量译文自动注入错误,由母语者筛选并合并错误生成可控质量的伪译文,形成源-译-参考三元组;在此基准上评估9个主流指标,并提出跨语言分数分布归一化策略。 Result: 实验证明平均各语言指标得分与人工判断存在不一致,首次提供了跨语言评分偏差的实证证据;所提归一化策略有效提升多语言评测的公平性与可靠性。 Conclusion: 跨语言评分偏差是多语言翻译评估中的关键问题;XQ-MEval为该问题提供了首个可复现基准,其衍生的归一化方法可广泛应用于多语言自动评价实践。 Abstract: Automatic evaluation metrics are essential for building multilingual translation systems. The common practice of evaluating these systems is averaging metric scores across languages, yet this is suspicious since metrics may suffer from cross-lingual scoring bias, where translations of equal quality receive different scores across languages. This problem has not been systematically studied because no benchmark exists that provides parallel-quality instances across languages, and expert annotation is not realistic. In this work, we propose XQ-MEval, a semi-automatically built dataset covering nine translation directions, to benchmark translation metrics. Specifically, we inject MQM-defined errors into gold translations automatically, filter them by native speakers for reliability, and merge errors to generate pseudo translations with controllable quality. These pseudo translations are then paired with corresponding sources and references to form triplets used in assessing the qualities of translation metrics. Using XQ-MEval, our experiments on nine representative metrics reveal the inconsistency between averaging and human judgment and provide the first empirical evidence of cross-lingual scoring bias. Finally, we propose a normalization strategy derived from XQ-MEval that aligns score distributions across languages, improving the fairness and reliability of multilingual metric evaluation.[71] Text2Arch: A Dataset for Generating Scientific Architecture Diagrams from Natural Language Descriptions
Shivank Garg,Sankalp Mittal,Manish Gupta
Main category: cs.CL
TL;DR: 本文提出了一种利用语言模型从文本自动生成高保真科学架构图的方法,构建了包含图像、文本描述和DOT代码的开源数据集\system,并通过微调小语言模型及GPT-4o的上下文学习验证其有效性,性能媲美GPT-4o,且全部资源开源。
Details
Motivation: 文本描述复杂系统设计或科学流程效率低、易歧义,亟需能自动将文本高保真转换为架构图的系统,但缺乏大规模公开数据集和有效开源模型。 Method: 构建了包含科学架构图像、对应文本描述及DOT代码的综合数据集\system;在此基础上微调多个小语言模型,并采用GPT-4o进行上下文学习(in-context learning)生成DOT代码,再渲染为架构图。 Result: 所提\system模型在多项指标上显著超越DiagramAgent等基线模型,性能与GPT-4o上下文学习结果相当;所有代码、数据和模型均已开源。 Conclusion: 基于专用数据集\system训练的小语言模型可高效、低成本实现高质量架构图生成,为AI驱动的软件设计、企业架构可视化和教育内容生成提供了实用新范式。 Abstract: Communicating complex system designs or scientific processes through text alone is inefficient and prone to ambiguity. A system that automatically generates scientific architecture diagrams from text with high semantic fidelity can be useful in multiple applications like enterprise architecture visualization, AI-driven software design, and educational content creation. Hence, in this paper, we focus on leveraging language models to perform semantic understanding of the input text description to generate intermediate code that can be processed to generate high-fidelity architecture diagrams. Unfortunately, no clean large-scale open-access dataset exists, implying lack of any effective open models for this task. Hence, we contribute a comprehensive dataset, \system, comprising scientific architecture images, their corresponding textual descriptions, and associated DOT code representations. Leveraging this resource, we fine-tune a suite of small language models, and also perform in-context learning using GPT-4o. Through extensive experimentation, we show that \system{} models significantly outperform existing baseline models like DiagramAgent and perform at par with in-context learning-based generations from GPT-4o. We make the code, data and models publicly available.[72] Explain the Flag: Contextualizing Hate Speech Beyond Censorship
Jason Liartis,Eirini Kaldeli,Lambrini Gyftokosta,Eleftherios Chelioudakis,Orfeas Menis Mastromichalakis
Main category: cs.CL
TL;DR: 本文提出了一种结合大语言模型(LLMs)与三种新构建的多语种(英、法、希腊语)词汇表的混合方法,用于检测并可解释地识别仇恨言论,兼顾准确性与透明性。
Details
Motivation: 现有自动仇恨言论检测系统多聚焦于内容删除,缺乏透明度和可解释性,难以平衡内容治理与表达自由。 Method: 构建双管道混合系统:一管道利用人工校验的多语种词汇表检测和消歧冒犯性术语;另一管道使用LLM作为上下文感知评估器识别群体定向攻击内容;最终融合输出有依据的解释。 Result: 在人类评估中,该混合方法在检测准确性和解释质量上均优于纯LLM基线方法。 Conclusion: 结合规则化词汇资源与LLM上下文理解的混合范式,能更可靠、透明、可解释地实现多语种仇恨言论检测。 Abstract: Hate, derogatory, and offensive speech remains a persistent challenge in online platforms and public discourse. While automated detection systems are widely used, most focus on censorship or removal, raising concerns for transparency and freedom of expression, and limiting opportunities to explain why content is harmful. To address these issues, explanatory approaches have emerged as a promising solution, aiming to make hate speech detection more transparent, accountable, and informative. In this paper, we present a hybrid approach that combines Large Language Models (LLMs) with three newly created and curated vocabularies to detect and explain hate speech in English, French, and Greek. Our system captures both inherently derogatory expressions tied to identity characteristics and direct group-targeted content through two complementary pipelines: one that detects and disambiguates problematic terms using the curated vocabularies, and one that leverages LLMs as context-aware evaluators of group-targeting content. The outputs are fused into grounded explanations that clarify why content is flagged. Human evaluation shows that our hybrid approach is accurate, with high-quality explanations, outperforming LLM-only baselines.[73] IUQ: Interrogative Uncertainty Quantification for Long-Form Large Language Model Generation
Haozhi Fan,Jinhao Duan,Kaidi Xu
Main category: cs.CL
TL;DR: 本文提出了一种名为Interrogative Uncertainty Quantification (IUQ)的新框架,用于量化大语言模型(LLM)在长文本生成中的不确定性,通过样本间一致性与样本内忠实性来评估声明级不确定性和模型忠实度。
Details
Motivation: 现有方法在短或受限输出上表现良好,但难以适用于需要长文本、自由形式生成的真实场景;LLM常生成语义连贯但事实错误的文本,且语义多维、语言结构复杂,导致不确定性量化困难。 Method: 提出IUQ框架,采用'先提问再回答'范式,结合跨样本一致性(inter-sample consistency)和单样本内忠实性(intra-sample faithfulness)来量化长文本生成中的不确定性。 Result: 在多个模型家族和规模上实验验证,IUQ在两个主流长文本生成数据集上显著优于现有两种常用方法。 Conclusion: IUQ为长文本生成中的不确定性量化提供了可靠、可解释的新路径,提升了LLM输出的可信度与可控性。 Abstract: Despite the rapid advancement of Large Language Models (LLMs), uncertainty quantification in LLM generation is a persistent challenge. Although recent approaches have achieved strong performance by restricting LLMs to produce short or constrained answer sets, many real-world applications require long-form and free-form text generation. A key difficulty in this setting is that LLMs often produce responses that are semantically coherent yet factually inaccurate, while the underlying semantics are multifaceted and the linguistic structure is complex. To tackle this challenge, this paper introduces Interrogative Uncertainty Quantification (IUQ), a novel framework that leverages inter-sample consistency and intra-sample faithfulness to quantify the uncertainty in long-form LLM outputs. By utilizing an interrogate-then-respond paradigm, our method provides reliable measures of claim-level uncertainty and the model's faithfulness. Experimental results across diverse model families and model sizes demonstrate the superior performance of IUQ over two widely used long-form generation datasets. The code is available at https://github.com/louisfanhz/IUQ.[74] Blinded Multi-Rater Comparative Evaluation of a Large Language Model and Clinician-Authored Responses in CGM-Informed Diabetes Counseling
Zhijun Guo,Alvina Lai,Emmanouil Korakas,Aristeidis Vagenas,Irshad Ahamed,Christo Albor,Hengrui Zhang,Justin Healy,Kezhi Li
Main category: cs.CL
TL;DR: 本研究开发并评估了一种基于检索增强的大型语言模型(LLM)对话代理(CA),用于辅助糖尿病患者理解连续血糖监测(CGM)数据及咨询准备;结果显示该CA在响应质量(尤其共情与可操作性)上显著优于临床医生,且安全性相当,但仅适用于辅助而非自主决策。
Details
Motivation: CGM数据分析解释耗时且需共情沟通,现有检索增强LLM系统在CGM指导咨询中的实证证据不足,亟需评估其辅助潜力。 Method: 构建检索增强的LLM对话代理,生成非个体化、通俗易懂的CGM解读响应;基于公开数据设计12个CGM案例;由6名英国资深糖尿病临床医生各评审2例(共24问),采用盲法多评者设计,每条CA与医生响应由3名医生在6个质量维度独立评分;使用线性混合效应模型分析。 Result: CA响应质量总分显著高于医生(均值4.37 vs 3.58,差异0.782,P<0.001);共情(+1.062)与可操作性(+0.992)提升最显著;安全警示率极低且两组相当(均为0.7%)。 Conclusion: 检索增强LLM系统可作为CGM复盘、患者教育和诊前准备的有益辅助工具,但不可替代临床判断,不支持无监督实际应用或自主治疗决策。 Abstract: Continuous glucose monitoring (CGM) is central to diabetes care, but explaining CGM patterns clearly and empathetically remains time-intensive. Evidence for retrieval-grounded large language model (LLM) systems in CGM-informed counseling remains limited. To evaluate whether a retrieval-grounded LLM-based conversational agent (CA) could support patient understanding of CGM data and preparation for routine diabetes consultations. We developed a retrieval-grounded LLM-based CA for CGM interpretation and diabetes counseling support. The system generated plain-language responses while avoiding individualized therapeutic advice. Twelve CGM-informed cases were constructed from publicly available datasets. Between Oct 2025 and Feb 2026, 6 senior UK diabetes clinicians each reviewed 2 assigned cases and answered 24 questions. In a blinded multi-rater evaluation, each CA-generated and clinician-authored response was independently rated by 3 clinicians on 6 quality dimensions. Safety flags and perceived source labels were also recorded. Primary analyses used linear mixed-effects models. A total of 288 unique responses (144 CA and 144 clinician) generated 864 ratings. The CA received higher quality scores than clinician responses (mean 4.37 vs 3.58), with an estimated mean difference of 0.782 points (95% CI 0.692-0.872; P<.001). The largest differences were for empathy (1.062, 95% CI 0.948-1.177) and actionability (0.992, 95% CI 0.877-1.106). Safety flag distributions were similar, with major concerns rare in both groups (3/432, 0.7% each). Retrieval-grounded LLM systems may have value as adjunct tools for CGM review, patient education, and preconsultation preparation. However, these findings do not support autonomous therapeutic decision-making or unsupervised real-world use.[75] DiscoTrace: Representing and Comparing Answering Strategies of Humans and LLMs in Information-Seeking Question Answering
Neha Srikanth,Jordan Boyd-Graber,Rachel Rudinger
Main category: cs.CL
TL;DR: DiscoTrace是一种用于识别回答者在回应信息寻求类问题时所采用修辞策略的方法,研究发现人类社区在回答构建上存在多样性,而大语言模型(LLMs)缺乏这种修辞多样性,且倾向于过度扩展回答范围。
Details
Motivation: 理解人类在问答中使用的多样化修辞策略,以改进大语言模型在语境化问答中的实用性与合理性。 Method: 提出DiscoTrace方法,将答案表示为与问题相关的语篇行为序列,并结合对原始问题的解释,基于修辞结构理论(RST)解析进行标注;在九个人类社区的答案数据上应用该方法,并与LLMs生成答案对比分析。 Result: 不同人类社区在答案构建上表现出显著的修辞策略偏好差异;LLMs缺乏此类多样性,即使被提示模仿特定社区指南也无改善;LLMs更倾向于覆盖更广的问题解释范围,包括人类回答者通常忽略的部分。 Conclusion: LLMs当前在修辞策略选择上缺乏上下文敏感性与多样性,需借鉴人类实践发展更具语用能力的问答模型。 Abstract: We introduce DiscoTrace, a method to identify the rhetorical strategies that answerers use when responding to information-seeking questions. DiscoTrace represents answers as a sequence of question-related discourse acts paired with interpretations of the original question, annotated on top of rhetorical structure theory parses. Applying DiscoTrace to answers from nine different human communities reveals that communities have diverse preferences for answer construction. In contrast, LLMs do not exhibit rhetorical diversity in their answers, even when prompted to mimic specific human community answering guidelines. LLMs also systematically opt for breadth, addressing interpretations of questions that human answerers choose not to address. Our findings can guide the development of pragmatic LLM answerers that consider a range of strategies informed by context in QA.[76] QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies
Alexey Khoroshilov,Alexey Chernysh,Orkhan Ekhtibarov,Nini Kamkia,Dmitry Zmitrovich
Main category: cs.CL
TL;DR: 本文提出了QuantCode-Bench基准,用于系统评估大语言模型(LLMs)在根据英文描述生成Backtrader框架交易策略方面的能力;该基准包含400个来自多源的多样化任务,并通过多阶段管道(语法检查、回测执行、交易生成、语义对齐)进行评估;研究发现当前模型的主要瓶颈不在语法错误,而在金融逻辑建模、API正确使用和任务语义一致性上。
Details
Motivation: 现有大语言模型在通用编程任务中表现优异,但在生成可执行的算法交易策略方面仍缺乏系统性评估;交易策略生成需同时掌握金融领域知识、专用API及确保代码在历史数据上实际产生交易,这与标准代码基准有本质区别。 Method: 构建QuantCode-Bench基准(含400个任务,来源包括Reddit、TradingView、StackExchange、GitHub及合成数据),设计多阶段评估流水线(语法正确性→回测执行→是否产生交易→LLM裁判语义对齐),并在单轮生成和基于反馈的多轮智能体两种设置下对比SOTA模型。 Result: 实验表明,当前主流LLM在语法层面已基本达标,但大量失败发生在交易逻辑实现、Backtrader API误用及与自然语言描述语义不一致等环节;多轮交互能提升性能,但无法根本解决领域逻辑理解问题。 Conclusion: 算法交易策略生成是一类独特的领域特定代码生成任务,其成功不仅依赖技术正确性,更要求自然语言描述、金融逻辑与策略实际行为三者之间的深度对齐;未来工作需加强领域知识注入与逻辑验证机制。 Abstract: Large language models have demonstrated strong performance on general-purpose programming tasks, yet their ability to generate executable algorithmic trading strategies remains underexplored. Unlike standard code benchmarks, trading-strategy generation requires simultaneous mastery of domain-specific financial logic, knowledge of a specialized API, and the ability to produce code that is not only syntactically correct but also leads to actual trades on historical data. In this work, we present QuantCode-Bench, a benchmark for the systematic evaluation of modern LLMs in generating strategies for the Backtrader framework from textual descriptions in English. The benchmark contains 400 tasks of varying difficulty collected from Reddit, TradingView, StackExchange, GitHub, and synthetic sources. Evaluation is conducted through a multi-stage pipeline that checks syntactic correctness, successful backtest execution, the presence of trades, and semantic alignment with the task description using an LLM judge. We compare state-of-the-art models in two settings: single-turn, where the strategy must be generated correctly on the first attempt, and agentic multi-turn, where the model receives iterative feedback and may repair its errors. We analyze the failure modes across different stages of the pipeline and show that the main limitations of current models are not related to syntax, but rather to the correct operationalization of trading logic, proper API usage, and adherence to task semantics. These findings suggest that trading strategy generation constitutes a distinct class of domain-specific code generation tasks in which success requires not only technical correctness, but also alignment between natural-language descriptions, financial logic, and the observable behavior of the strategy on data.[77] Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models
Zihao Xu,John Harvill,Ziwei Fan,Yizhou Sun,Hao Ding,Hao Wang
Main category: cs.CL
TL;DR: 本文提出K-Token Merging,一种在潜在嵌入空间中压缩长提示的轻量级框架,通过将每K个连续token嵌入合并为一个嵌入,显著降低LLM处理长输入的计算与内存开销,同时保持生成质量。
Details
Motivation: 现有token压缩方法局限于token空间,忽略了潜在嵌入空间中的冗余与低效,而LLM处理长提示时因自注意力机制的平方复杂度导致高昂计算与内存成本。 Method: 提出K-Token Merging框架:在嵌入层对连续K个token嵌入用轻量编码器合并为单个嵌入;压缩后的序列送入LoRA微调的LLM进行推理,解码仍使用原始词表。 Result: 在Textualized Tree(结构推理)、Amazon Reviews(情感分类)和CommitPackFT(代码编辑)任务上验证,最高实现75%输入长度压缩,性能下降极小,处于性能-压缩率Pareto前沿。 Conclusion: K-Token Merging是一种高效、即插即用的潜在空间压缩方法,兼顾显著压缩比与模型性能,为长上下文LLM推理提供了新思路。 Abstract: Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self-attention scales quadratically with input length. Token compression aims to address this challenge by reducing the number of tokens representing inputs. However, existing prompt-compression approaches primarily operate in token space and overlook inefficiencies in the latent embedding space. In this paper, we propose K-Token Merging, a latent-space compression framework that merges each contiguous block of K token embeddings into a single embedding via a lightweight encoder. The compressed sequence is processed by a LoRA-adapted LLM, while generation remains in the original vocabulary. Experiments on structural reasoning (Textualized Tree), sentiment classification (Amazon Reviews), and code editing (CommitPackFT) show that K-Token Merging lies on the Pareto frontier of performance vs. compression, achieving up to 75% input length reduction with minimal performance degradation.[78] Fabricator or dynamic translator?
Lisa Vasileva,Karin Sim
Main category: cs.CL
TL;DR: 本文探讨了大型语言模型(LLM)在机器翻译中出现的过生成现象,分析其类型(如自我解释、危险幻觉、适当解释)及检测方法,并报告了商业场景中的实践策略与结果。
Details
Motivation: LLM在机器翻译中虽表现优异,但其生成特性易导致多种过生成现象,这些现象不同于传统NMT的神经胡言乱语,需准确识别与分类以提升翻译可靠性与可理解性。 Method: 探索并比较多种针对LLM翻译过生成现象的检测与分类策略,基于商业应用场景开展实证研究。 Result: 提出了适用于商业环境的过生成检测策略,并展示了相应实验结果,验证了不同策略在识别过生成类型上的有效性。 Conclusion: LLM翻译中的过生成具有多样性与语境依赖性,需结合任务目标设计专用检测机制;适当过生成(如解释性输出)可增强用户理解,不应一概否定。 Abstract: LLMs are proving to be adept at machine translation although due to their generative nature they may at times overgenerate in various ways. These overgenerations are different from the neurobabble seen in NMT and range from LLM self-explanations, to risky confabulations, to appropriate explanations, where the LLM is able to act as a human translator would, enabling greater comprehension for the target audience. Detecting and determining the exact nature of the overgenerations is a challenging task. We detail different strategies we have explored for our work in a commercial setting, and present our results.[79] MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events
Raunak Agarwal,Markus Wenzel,Simon Baur,Jonas Zimmer,George Harvey,Jackie Ma
Main category: cs.CL
TL;DR: 本文提出MADE——一个基于医疗设备不良事件报告构建的、持续更新的多标签文本分类(MLTC)基准,旨在解决现有基准饱和与数据污染问题,并系统评估多种模型在准确率与不确定性量化(UQ)上的表现权衡。
Details
Motivation: 现有MLTC基准趋于饱和且易受训练数据污染,难以区分真实推理能力与记忆效应;高风险领域(如医疗)亟需兼具高性能与可靠不确定性量化的模型。 Method: 构建动态更新、长尾分层标签、时间严格划分的MADE基准;在20+编码器/解码器模型(含微调、少样本、指令微调/推理变体)上建立基线;系统评估基于熵/一致性及自陈述(self-verbalized)的不确定性量化方法。 Result: 小规模判别式微调解码器在头-尾准确率与UQ间平衡最优;生成式微调提供最可靠的UQ;大推理模型提升罕见标签性能但UQ表现差;自陈述置信度不可靠。 Conclusion: MADE为医疗MLTC提供了抗污染、可复现的评估平台,揭示了模型规模、训练范式与UQ能力间的非平凡权衡,强调需针对性设计UQ策略而非依赖默认置信输出。 Abstract: Machine learning in high-stakes domains such as healthcare requires not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human oversight. Multi-label text classification (MLTC) is a central task in this domain, yet remains challenging due to label imbalances, dependencies, and combinatorial complexity. Existing MLTC benchmarks are increasingly saturated and may be affected by training data contamination, making it difficult to distinguish genuine reasoning capabilities from memorization. We introduce MADE, a living MLTC benchmark derived from {m}edical device {ad}verse {e}vent reports and continuously updated with newly published reports to prevent contamination. MADE features a long-tailed distribution of hierarchical labels and enables reproducible evaluation with strict temporal splits. We establish baselines across more than 20 encoder- and decoder-only models under fine-tuning and few-shot settings (instruction-tuned/reasoning variants, local/API-accessible). We systematically assess entropy-/consistency-based and self-verbalized UQ methods. Results show clear trade-offs: smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive UQ; generative fine-tuning delivers the most reliable UQ; large reasoning models improve performance on rare labels yet exhibit surprisingly weak UQ; and self-verbalized confidence is not a reliable proxy for uncertainty. Our work is publicly available at https://hhi.fraunhofer.de/aml-demonstrator/made-benchmark.[80] From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning
Kiran Purohit,Ramasuri Narayanam,Soumyabrata Pal
Main category: cs.CL
TL;DR: 本文提出SpecGuard,一种基于模型内部信号的验证感知推测解码框架,通过步骤级验证提升大语言模型推理的准确性和效率。
Details
Motivation: 现有推测解码方法易受错误步骤传播影响,而引入外部奖励模型会增加延迟、计算开销并限制泛化性。 Method: SpecGuard在每一步采样多个草稿候选,选择最一致的步骤,并利用两种轻量级模型内部信号(基于注意力的接地分数和基于对数概率的置信度分数)进行联合验证,决定接受或重算该步骤。 Result: 在多个推理基准上,SpecGuard相比标准推测解码和奖励引导的推测解码,准确率提升3.6%,延迟降低约11%。 Conclusion: SpecGuard通过仅依赖模型内部信号实现高效、准确的步骤级验证,为推测解码提供了更通用、低开销的优化路径。 Abstract: Speculative decoding (SD) accelerates large language model inference by allowing a lightweight draft model to propose outputs that a stronger target model verifies. However, its token-centric nature allows erroneous steps to propagate. Prior approaches mitigate this using external reward models, but incur additional latency, computational overhead, and limit generalizability. We propose SpecGuard, a verification-aware speculative decoding framework that performs step-level verification using only model-internal signals. At each step, SpecGuard samples multiple draft candidates and selects the most consistent step, which is then validated using an ensemble of two lightweight model-internal signals: (i) an attention-based grounding score that measures attribution to the input and previously accepted steps, and (ii) a log-probability-based score that captures token-level confidence. These signals jointly determine whether a step is accepted or recomputed using the target, allocating compute selectively. Experiments across a range of reasoning benchmarks show that SpecGuard improves accuracy by 3.6% while reducing latency by ~11%, outperforming both SD and reward-guided SD.cs.CV [Back]
[81] QualiaNet: An Experience-Before-Inference Network
Paul Linton
Main category: cs.CV
TL;DR: 本文提出了一种模拟人类3D视觉的两阶段计算模型QualiaNet,其中经验模块提取相对注视点的立体深度,推理模块利用自然场景中近景具有显著视差梯度、远景较平坦的统计规律,通过CNN从视差图中估计距离。
Details
Motivation: 解释为何缺乏绝对距离信息的立体视觉体验仍能影响我们对视觉尺度的推断。 Method: 构建两阶段模型QualiaNet:第一阶段生成模拟人类立体视觉经验的视差图;第二阶段使用CNN从视差梯度中学习估计距离。 Result: QualiaNet仅凭视差梯度即可恢复距离,验证了利用自然场景统计规律进行3D推理的有效性。 Conclusion: 人类立体视觉的推理模块可能依赖于近景与远景在视差梯度上的自然统计差异,该机制可被计算模型有效建模。 Abstract: Human 3D vision involves two distinct stages: an Experience Module, where stereo depth is extracted relative to fixation, and an Inference Module, where this experience is interpreted to estimate 3D scene properties. Paradoxically, although our experience of stereo vision does not provide us with distance information, it does affect our inferences about visual scale. We propose the Inference Module exploits a natural scene statistic: near scenes produce vivid disparity gradients, while far scenes appear comparatively flat. QualiaNet implements this two-stage architecture computationally: disparity maps simulating human stereo experience are passed to a CNN trained to estimate distance. The network can recover distance from disparity gradients alone, validating this approach.[82] HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
Team HY-World,Chenjie Cao,Xuhui Zuo,Zhenwei Wang,Yisu Zhang,Junta Wu,Zhenyang Liu,Yuning Gong,Yang Liu,Bo Yuan,Chao Zhang,Coopers Li,Dongyuan Guo,Fan Yang,Haiyu Zhang,Hang Cao,Jianchen Zhu,Jiaxin Lin,Jie Xiao,Jihong Zhang,Junlin Yu,Lei Wang,Lifu Wang,Lilin Wang,Linus,Minghui Chen,Peng He,Penghao Zhao,Qi Chen,Rui Chen,Rui Shao,Sicong Liu,Wangchen Qin,Xiaochuan Niu,Xiang Yuan,Yi Sun,Yifei Tang,Yifu Sun,Yihang Lian,Yonghao Tan,Yuhong Liu,Yuyang Yin,Zhiyuan Min,Tengfei Wang,Chunchao Guo
Main category: cs.CV
TL;DR: HY-World 2.0 是一个支持多模态输入(文本、单视图/多视图图像、视频)并生成高质量、可导航3D高斯泼溅(3DGS)场景的开源世界模型框架,包含多项创新模块(HY-Pano 2.0、WorldNav、WorldStereo 2.0、WorldMirror 2.0 和 WorldLens 渲染平台),性能达开源SOTA,媲美闭源模型 Marble。
Details
Motivation: 提升多模态3D世界建模能力,支持更丰富输入形式(如视频、多视图图像),增强3D场景生成质量、一致性与交互性,并推动开源3D世界模型研究。 Method: 提出四阶段生成流程:a) HY-Pano 2.0 生成全景图;b) WorldNav 进行轨迹规划;c) WorldStereo 2.0 基于关键帧与记忆机制扩展视图;d) WorldMirror 2.0 升级为通用3D预测模型;并构建 WorldLens 高性能3DGS渲染平台,支持IBL光照、碰撞检测与训渲协同设计。 Result: 在多个基准上达到开源方法最优性能,效果接近闭源模型 Marble;支持从文本/单图生成可导航3DGS场景,亦可从多视图图像或视频重建世界;已全面开源模型、代码与技术细节。 Conclusion: HY-World 2.0 构建了一个功能完备、高性能、开源可复现的多模态3D世界模型框架,显著推进了3D内容生成与交互式世界建模的发展。 Abstract: We introduce HY-World 2.0, a multi-modal world model framework that advances our prior project HY-World 1.0. HY-World 2.0 accommodates diverse input modalities, including text prompts, single-view images, multi-view images, and videos, and produces 3D world representations. With text or single-view image inputs, the model performs world generation, synthesizing high-fidelity, navigable 3D Gaussian Splatting (3DGS) scenes. This is achieved through a four-stage method: a) Panorama Generation with HY-Pano 2.0, b) Trajectory Planning with WorldNav, c) World Expansion with WorldStereo 2.0, and d) World Composition with WorldMirror 2.0. Specifically, we introduce key innovations to enhance panorama fidelity, enable 3D scene understanding and planning, and upgrade WorldStereo, our keyframe-based view generation model with consistent memory. We also upgrade WorldMirror, a feed-forward model for universal 3D prediction, by refining model architecture and learning strategy, enabling world reconstruction from multi-view images or videos. Also, we introduce WorldLens, a high-performance 3DGS rendering platform featuring a flexible engine-agnostic architecture, automatic IBL lighting, efficient collision detection, and training-rendering co-design, enabling interactive exploration of 3D worlds with character support. Extensive experiments demonstrate that HY-World 2.0 achieves state-of-the-art performance on several benchmarks among open-source approaches, delivering results comparable to the closed-source model Marble. We release all model weights, code, and technical details to facilitate reproducibility and support further research on 3D world models.[83] Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
Ahmed Bourouis,Savas Ozkan,Andrea Maracani,Yi-Zhe Song,Mete Ozay
Main category: cs.CV
TL;DR: 本文提出了一种从单幅手绘草图生成几何一致多视角场景的新方法,通过构建新数据集、引入几何先验的注意力适配器和稀疏对应监督损失,实现了单步去噪生成,显著提升了真实感与几何一致性,并加快了推理速度。
Details
Motivation: 现有方法无法处理几何信息极度匮乏且存在空间畸变的手绘草图作为输入,而草图到多视角生成任务尚属空白。 Method: 提出三方面贡献:(i) 构建约9k样本的草图-多视角配对数据集;(ii) 设计Parallel Camera-Aware Attention Adapters(CA3)将几何归纳偏置注入视频Transformer;(iii) 提出基于SfM重建的Sparse Correspondence Supervision Loss(CSL)。 Result: 在FID上提升超60%,Corr-Acc提升23%,推理速度达3.7×加速,且无需参考图像、迭代优化或逐场景调优。 Conclusion: 该框架首次实现了从单幅自由手绘草图端到端生成几何一致的多视角场景,为草图驱动的三维内容创作开辟了新路径。 Abstract: We tackle a new problem: generating geometrically consistent multi-view scenes from a single freehand sketch. Freehand sketches are the most geometrically impoverished input one could offer a multi-view generator. They convey scene intent through abstract strokes while introducing spatial distortions that actively conflict with any consistent 3D interpretation. No prior method attempts this; existing multi-view approaches require photographs or text, while sketch-to-3D methods need multiple views or costly per-scene optimisation. We address three compounding challenges; absent training data, the need for geometric reasoning from distorted 2D input, and cross-view consistency, through three mutually reinforcing contributions: (i) a curated dataset of $\sim$9k sketch-to-multiview samples, constructed via an automated generation and filtering pipeline; (ii) Parallel Camera-Aware Attention Adapters (CA3) that inject geometric inductive biases into the video transformer; and (iii) a Sparse Correspondence Supervision Loss (CSL) derived from Structure-from-Motion reconstructions. Our framework synthesizes all views in a single denoising process without requiring reference images, iterative refinement, or per-scene optimization. Our approach significantly outperforms state-of-the-art two-stage baselines, improving realism (FID) by over 60% and geometric consistency (Corr-Acc) by 23%, while providing up to a 3.7$\times$ inference speedup.[84] DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines
Gabriel Pimenta de Freitas Cardoso,Caio Lucas da Silva Chacon,Jonas Felipe da Fonseca Oliveira,Paulo Henrique de Medeiros Araujo
Main category: cs.CV
TL;DR: 本文提出了DharmaOCR Full和Lite两个专用于结构化OCR的小型语言模型,以及一个涵盖多种文档类型的基准测试DharmaOCR-Benchmark,并首次将直接偏好优化(DPO)应用于OCR任务中,以降低文本退化率,同时保持或提升提取质量。
Details
Motivation: 解决OCR任务中 transcription质量、生成稳定性与推理成本之间的权衡问题,并关注文本退化对实际部署性能的负面影响。 Method: 提出DharmaOCR Full/Lite两个SSLMS;构建DharmaOCR-Benchmark并设计统一评估协议;首次将DPO用于OCR,以退化样本为拒绝样本抑制循环生成;结合SFT强制JSON结构输出;采用AWQ量化降低计算成本。 Result: DharmaOCR Full(7B)和Lite(3B)在DharmaOCR-Benchmark上达到0.925和0.911的提取质量分,退化率分别低至0.40%和0.20%;DPO使退化率最高相对下降87.6%;AWQ量化降低单页成本达22%,质量损失可忽略。 Conclusion: DharmaOCR系列模型在结构化OCR任务中实现了质量、稳定性和成本的协同优化,DPO+Schema-SFT是抑制退化的有效范式,为OCR模型训练提供了新方法论。 Abstract: This manuscript introduces DharmaOCR Full and Lite, a pair of specialized small language models (SSLMs) for structured OCR that jointly optimize transcription quality, generation stability, and inference cost. It also presents DharmaOCR-Benchmark, a benchmark that covers printed, handwritten, and legal/administrative documents, and proposes a unified evaluation protocol that measures fidelity and structure while explicitly tracking text degeneration as a first-class benchmark metric (alongside unit cost). Beyond reporting degeneration rates, the manuscript empirically shows degeneration is not merely a quality failure, since it materially worsens production performance by increasing response time, reducing throughput, and inflating computational cost due to abnormally long generations. To the best of the author's knowledge, as a methodological contribution, this is the first application of Direct Preference Optimization (DPO) for OCR, explicitly using degenerate generations as rejected examples to penalize looping behavior. Combined with Supervised Fine-Tuning (SFT) for enforcing a strict JSON schema (header, margin, footer, and text), DPO consistently reduces degeneration rate across model families (up to 87.6% relative) while preserving or improving extraction quality. The resulting models, namely, DharmaOCR Full (7B) and DharmaOCR Lite (3B), set a new state-of-the-art on DharmaOCR-Benchmark, outperforming each open-source and commercial baseline model evaluated regarding extraction quality, reaching 0.925 and 0.911 scores with 0.40% and 0.20% degeneration rates. AWQ quantization reduced up to 22% per-page cost with negligible quality loss, enabling a strong quality-cost trade-off in comparison to proprietary OCR APIs and open-source alternatives.[85] Interpretable Human Activity Recognition for Subtle Robbery Detection in Surveillance Videos
Bryan Jhoan Cazáres Leyva,Ulises Gachuz Davila,José Juan González Fonseca,Juan Irving Vasquez,Vanessa A. Camacho-Vázquez,Sergio Isahí Garrido-Castañeda
Main category: cs.CV
TL;DR: 本文提出了一种基于姿态估计与可解释特征的轻量级混合方法,用于在无约束监控视频中实时检测非暴力街头抢夺(snatch-and-run)事件,并在Jetson Nano上实现边缘部署。
Details
Motivation: 非暴力街头抢夺行为持续时间短、动作隐晦,在开放场景监控视频中难以与正常人际互动区分,现有方法缺乏实时性与可解释性,且难以部署于边缘设备。 Method: 采用YOLO-based姿态估计算法提取行人关键点;构建手部速度、手臂伸展、人物间距及相对运动等运动学与交互特征;使用随机森林分类器进行判别;引入时间迟滞滤波器稳定帧级预测。 Result: 在自建模拟数据集和来自网络的真实视频测试集上均验证了方法的有效性与跨场景泛化能力;在NVIDIA Jetson Nano上实现端到端实时运行。 Conclusion: 该姿态驱动的轻量级混合框架兼顾检测性能、可解释性与边缘部署可行性,为实时、主动式街头抢劫预警提供了实用解决方案。 Abstract: Non-violent street robberies (snatch-and-run) are difficult to detect automatically because they are brief, subtle, and often indistinguishable from benign human interactions in unconstrained surveillance footage. This paper presents a hybrid, pose-driven approach for detecting snatch-and-run events that combines real-time perception with an interpretable classification stage suitable for edge deployment. The system uses a YOLO-based pose estimator to extract body keypoints for each tracked person and computes kinematic and interaction features describing hand speed, arm extension, proximity, and relative motion between an aggressor-victim pair. A Random Forest classifier is trained on these descriptors, and a temporal hysteresis filter is applied to stabilize frame-level predictions and reduce spurious alarms. We evaluate the method on a staged dataset and on a disjoint test set collected from internet videos, demonstrating promising generalization across different scenes and camera viewpoints. Finally, we implement the complete pipeline on an NVIDIA Jetson Nano and report real-time performance, supporting the feasibility of proactive, on-device robbery detection.[86] SatBLIP: Context Understanding and Feature Identification from Satellite Imagery with Vision-Language Learning
Xue Wu,Shengting Cao,Jiaqi Gong
Main category: cs.CV
TL;DR: 本文提出SatBLIP框架,利用卫星图像与语言模型结合的方法,自动识别农村地区关键环境特征(如屋顶状况、道路宽度、植被等),从而更精准地预测县级社会脆弱性指数(SVI),提升农村环境风险评估的细粒度与可解释性。
Details
Motivation: 现有农村环境风险评估依赖粗粒度脆弱性指数,难以反映地方性条件(如住房质量、道路通达性、地表格局);传统遥感方法存在手工特征、人工虚拟审计及通用视觉语言模型不适应卫星影像语义等问题。 Method: 构建SatBLIP——一种面向卫星影像的视觉-语言框架:1)用GPT-4o为卫星图块生成结构化描述(屋顶类型/状况、房屋尺寸、庭院属性、绿地、道路环境);2)微调适配卫星语义的BLIP模型生成图像描述;3)用CLIP编码描述,并与大模型生成的嵌入通过注意力机制融合;4)在空间聚合下估计SVI;5)采用SHAP分析关键驱动特征。 Result: SatBLIP显著提升了县级SVI预测性能;SHAP分析识别出屋顶形态/状况、街道宽度、植被覆盖、车辆/开放空间等稳定且具解释性的关键风险驱动因素,支持农村风险环境的可解释制图。 Conclusion: SatBLIP通过卫星定制化视觉-语言建模与可解释AI技术,克服了传统遥感与脆弱性评估方法的局限,为细粒度、机制可溯的农村环境风险理解提供了新范式。 Abstract: Rural environmental risks are shaped by place-based conditions (e.g., housing quality, road access, land-surface patterns), yet standard vulnerability indices are coarse and provide limited insight into risk contexts. We propose SatBLIP, a satellite-specific vision-language framework for rural context understanding and feature identification that predicts county-level Social Vulnerability Index (SVI). SatBLIP addresses limitations of prior remote sensing pipelines-handcrafted features, manual virtual audits, and natural-image-trained VLMs-by coupling contrastive image-text alignment with bootstrapped captioning tailored to satellite semantics. We use GPT-4o to generate structured descriptions of satellite tiles (roof type/condition, house size, yard attributes, greenery, and road context), then fine-tune a satellite-adapted BLIP model to generate captions for unseen images. Captions are encoded with CLIP and fused with LLM-derived embeddings via attention for SVI estimation under spatial aggregation. Using SHAP, we identify salient attributes (e.g., roof form/condition, street width, vegetation, cars/open space) that consistently drive robust predictions, enabling interpretable mapping of rural risk environments.[87] FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images
Sabab Ishraq,Aarushi Aarushi,Juncai Jiang,Chen Chen
Main category: cs.CV
TL;DR: 本文提出了FoodSense数据集,用于跨感官推理,包含66,842个参与者-图像对,涵盖味觉、嗅觉、触觉和听觉四个感官维度的评分与描述;并构建了FoodSense-VL模型,能从食物图像中预测多感官评分并生成图像依据的解释。
Details
Motivation: 人类能从食物图像中推断味觉、嗅觉、质地甚至声音,但现有视觉语言研究多集中于识别任务,跨感官体验的图像预测尚未被充分探索。 Method: 构建了人工标注的FoodSense数据集,包含多感官评分与自由文本描述,并利用大语言模型生成图像接地的推理轨迹;基于此训练FoodSense-VL多模态模型,实现感官评分预测与可解释性生成。 Result: 发布了FoodSense数据集与FoodSense-VL模型,验证了跨感官推理任务的可行性,并指出常用评估指标在视觉感官推理任务中存在不足。 Conclusion: 该工作将认知科学中的跨感官感知发现与现代多模态大模型指令微调相结合,推动了视觉语言模型向更丰富感官理解方向发展。 Abstract: Humans routinely infer taste, smell, texture, and even sound from food images a phenomenon well studied in cognitive science. However, prior vision language research on food has focused primarily on recognition tasks such as meal identification, ingredient detection, and nutrition estimation. Image-based prediction of multisensory experience remains largely unexplored. We introduce FoodSense, a human-annotated dataset for cross-sensory inference containing 66,842 participant-image pairs across 2,987 unique food images. Each pair includes numeric ratings (1-5) and free-text descriptors for four sensory dimensions: taste, smell, texture, and sound. To enable models to both predict and explain sensory expectations, we expand short human annotations into image-grounded reasoning traces. A large language model generates visual justifications conditioned on the image, ratings, and descriptors. Using these annotations, we train FoodSense-VL, a vision language benchmark model to produce both multisensory ratings and grounded explanations directly from food images. This work connects cognitive science findings on cross-sensory perception with modern instruction tuning for multimodal models and shows that many popular evaluation metrics are insufficient for visually sensory inference.[88] Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers
Felipe Parodi,Jordan Matelsky,Melanie Segado
Main category: cs.CV
TL;DR: 本文通过多种替代控制实验(均值替代、噪声替代、跨图像寄存器重排)发现,零消融法会严重高估视觉Transformer中register的功能重要性;register的实际作用在于提供类寄存器的合理激活模式,而非精确的图像特异性值。
Details
Motivation: 零消融(zero-ablation)被广泛用于探查vision transformer中token的功能,但其是否真实反映模块必要性尚不明确;本文旨在检验register是否真如零消融所示‘不可或缺’。 Method: 在DINOv2+registers和DINOv3模型上,对比零消融与三种替代控制(均值替代、噪声替代、跨图像register重排)对分类、对应、分割任务的影响,并分析内部表征的余弦相似度变化。 Result: 零消融导致性能大幅下降(最高-36.6pp分类,-30.9pp分割),但三种替代控制均保持性能稳定(偏差≤~1pp);余弦相似度分析表明零消融扰动远超其他方法;结论在ViT-B尺度上可复现。 Conclusion: 零消融夸大了register对精确内容的依赖;register的核心功能是提供类寄存器的合理激活结构(如缓冲密集特征、编码压缩后的patch几何),而非存储特定图像信息。 Abstract: Zero-ablation -- replacing token activations with zero vectors -- is widely used to probe token function in vision transformers. Register zeroing in DINOv2+registers and DINOv3 produces large drops (up to $-36.6$\,pp classification, $-30.9$\,pp segmentation), suggesting registers are functionally indispensable. However, three replacement controls -- mean-substitution, noise-substitution, and cross-image register-shuffling -- preserve performance across classification, correspondence, and segmentation, remaining within ${\sim}1$\,pp of the unmodified baseline. Per-patch cosine similarity shows these replacements genuinely perturb internal representations, while zeroing causes disproportionately large perturbations, consistent with why it alone degrades tasks. We conclude that zero-ablation overstates dependence on exact register content. In the frozen-feature evaluations we test, performance depends on plausible register-like activations rather than on exact image-specific values. Registers nevertheless buffer dense features from \texttt{[CLS]} dependence and are associated with compressed patch geometry. These findings, including the replacement-control results, replicate at ViT-B scale.[89] Crowdsourcing of Real-world Image Annotation via Visual Properties
Xiaolei Diao,Fausto Giunchiglia
Main category: cs.CV
TL;DR: 本文提出了一种结合知识表示、自然语言处理和计算机视觉的图像标注方法,通过视觉属性约束和基于类别层次的交互式众包框架,减少标注主观性,缓解语义鸿沟问题。
Details
Motivation: 现有物体识别数据集存在语义鸿沟问题,导致视觉数据与语言描述之间存在复杂的多对多映射,带来标注偏差,影响计算机视觉任务性能。 Method: 提出一种融合知识表示、NLP和CV的图像标注方法;设计基于预定义对象类别层次和标注者反馈的动态交互式众包框架,利用视觉属性约束引导标注过程。 Result: 实验验证了该方法的有效性,并通过分析标注者反馈优化了众包设置。 Conclusion: 所提方法能有效降低标注主观性,缓解语义鸿沟,提升图像标注质量与一致性。 Abstract: Recent advances in data-centric artificial intelligence highlight inherent limitations in object recognition datasets. One of the primary issues stems from the semantic gap problem, which results in complex many-to-many mappings between visual data and linguistic descriptions. This bias adversely affects performance in computer vision tasks. This paper proposes an image annotation methodology that integrates knowledge representation, natural language processing, and computer vision techniques, aiming to reduce annotator subjectivity by applying visual property constraints. We introduce an interactive crowdsourcing framework that dynamically asks questions based on a predefined object category hierarchy and annotator feedback, guiding image annotation by visual properties. Experiments demonstrate the effectiveness of this methodology, and annotator feedback is discussed to optimize the crowdsourcing setup.[90] Co-distilled attention guided masked image modeling with noisy teacher for self-supervised learning on medical images
Jue Jiang,Aneesh Rangnekar,Harini Veeraraghavan
Main category: cs.CV
TL;DR: 本文提出DAGMaN框架,结合注意力引导掩码与带噪声教师的协同蒸馏机制,解决医学图像中掩码建模的信息泄漏与注意力头多样性下降问题,并在多个下游任务上验证有效性。
Details
Motivation: 随机掩码在医学图像中易导致上下文相似补丁间的信息泄漏,降低自监督学习效果;Swin Transformer缺乏全局[CLS] token,难以应用先进掩码策略。 Method: 提出注意力引导掩码机制,并嵌入协同蒸馏框架;首次引入带噪声的教师模型,在保持注意力头多样性的同时实现选择性掩码。 Result: 在肺结节分类(全量/小样本)、免疫治疗效果预测、肿瘤分割和无监督器官聚类等多个医学图像任务上取得优越性能。 Conclusion: DAGMaN有效缓解了医学图像自监督预训练中的信息泄漏与注意力同质化问题,提升了模型泛化能力与下游任务性能。 Abstract: Masked image modeling (MIM) is a highly effective self-supervised learning (SSL) approach to extract useful feature representations from unannotated data. Predominantly used random masking methods make SSL less effective for medical images due to the contextual similarity of neighboring patches, leading to information leakage and SSL simplification. Hierarchical shifted window (Swin) transformer, a highly effective approach for medical images cannot use advanced masking methods as it lacks a global [CLS] token. Hence, we introduced an attention guided masking mechanism for Swin within a co-distillation learning framework to selectively mask semantically co-occurring and discriminative patches, to reduce information leakage and increase the difficulty of SSL pretraining. However, attention guided masking inevitably reduces the diversity of attention heads, which negatively impacts downstream task performance. To address this, we for the first time, integrate a noisy teacher into the co-distillation framework (termed DAGMaN) that performs attentive masking while preserving high attention head diversity. We demonstrate the capability of DAGMaN on multiple tasks including full- and few-shot lung nodule classification, immunotherapy outcome prediction, tumor segmentation, and unsupervised organs clustering.[91] H2VLR: Heterogeneous Hypergraph Vision-Language Reasoning for Few-Shot Anomaly Detection
Jianghong Huang,Luping Ji,Weiwei Duan,Mao Ye
Main category: cs.CV
TL;DR: 本文提出了一种基于异构超图的视觉-语言推理框架(H2VLR),用于少样本异常检测(FSAD),通过联合建模视觉区域与语义概念,克服现有VLM方法仅依赖成对特征匹配的局限,显著提升性能。
Details
Motivation: 现有基于视觉-语言模型(VLM)的少样本异常检测(FSAD)方法大多仅进行成对特征匹配,忽略了结构依赖性和全局一致性,限制了性能提升。 Method: 提出异构超图视觉-语言推理(H2VLR)框架,将FSAD建模为视觉-语义关系的高阶推理问题,在统一超图中联合建模视觉区域和语义概念。 Result: 在代表性工业与医学基准上实验验证,H2VLR常达到当前最优(SOTA)性能。 Conclusion: H2VLR通过引入异构超图建模视觉-语义高阶关系,有效提升了少样本异常检测的性能,为VLM在FSAD中的应用提供了新思路。 Abstract: As a classic vision task, anomaly detection has been widely applied in industrial inspection and medical imaging. In this task, data scarcity is often a frequently-faced issue. To solve it, the few-shot anomaly detection (FSAD) scheme is attracting increasing attention. In recent years, beyond traditional visual paradigm, Vision-Language Model (VLM) has been extensively explored to boost this field. However, in currently-existing VLM-based FSAD schemes, almost all perform anomaly inference only by pairwise feature matching, ignoring structural dependencies and global consistency. To further redound to FSAD via VLM, we propose a Heterogeneous Hypergraph Vision-Language Reasoning (H2VLR) framework. It reformulates the FSAD as a high-order inference problem of visual-semantic relations, by jointly modeling visual regions and semantic concepts in a unified hypergraph. Experimental comparisons verify the effectiveness and advantages of H2VLR. It could often achieve state-of-the-art (SOTA) performance on representative industrial and medical benchmarks. Our code will be released upon acceptance.[92] Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs
Ziyang Luo,Nian Liu,Junwei Han
Main category: cs.CV
TL;DR: 本文提出Chain of Modality(CoM)框架,通过动态选择输入模态拓扑结构和双路径认知执行机制,解决现有Omni-MLLMs因静态融合结构导致的感知脆弱性问题,显著提升多模态推理鲁棒性与泛化能力。
Details
Motivation: 现有Omni-MLLMs虽追求多感官统一建模,但在实际评估中常被单模态基线超越,根源在于其静态融合结构(如序列或交错输入)引发的位置偏差与对齐陷阱,损害注意力机制。 Method: 提出Chain of Modality(CoM):一种代理式框架,动态切换并行/序列/交错三种模态融合路径以消除结构偏差;同时将认知执行分为‘Direct-Decide’(直觉感知)与‘Reason-Decide’(分析审计)双路径,支持零训练或少量监督微调。 Result: CoM在多种基准上实现鲁棒且一致的泛化性能,显著优于静态融合的Omni-MLLMs,在训练自由或数据高效设定下均表现优异。 Conclusion: 动态、任务自适应的模态融合与认知路径分离是提升Omni-MLLMs真实多模态能力的关键,CoM为构建更鲁棒、可解释的多模态大模型提供了新范式。 Abstract: Omni-modal Large Language Models (Omni-MLLMs) promise a unified integration of diverse sensory streams. However, recent evaluations reveal a critical performance paradox: unimodal baselines frequently outperform joint multimodal inference. We trace this perceptual fragility to the static fusion topologies universally employed by current models, identifying two structural pathologies: positional bias in sequential inputs and alignment traps in interleaved formats, which systematically distort attention regardless of task semantics. To resolve this functional rigidity, we propose Chain of Modality (CoM), an agentic framework that transitions multimodal fusion from passive concatenation to dynamic orchestration. CoM adaptively orchestrates input topologies, switching among parallel, sequential, and interleaved pathways to neutralize structural biases. Furthermore, CoM bifurcates cognitive execution into two task-aligned pathways: a streamlined ``Direct-Decide'' path for direct perception and a structured ``Reason-Decide'' path for analytical auditing. Operating in either a training-free or a data-efficient SFT setting, CoM achieves robust and consistent generalization across diverse benchmarks.[93] FreqTrack: Frequency Learning based Vision Transformer for RGB-Event Object Tracking
Jinlin You,Muyu Li,Xudong Zhao
Main category: cs.CV
TL;DR: 本文提出FreqTrack,一种频率感知的RGB-事件(RGBE)跟踪框架,通过频域变换建立模态间互补相关性,提升复杂动态场景下的跟踪性能。
Details
Motivation: 现有单模态RGB跟踪器在复杂动态场景中性能受限,而事件传感器虽具潜力,但当前RGB-事件融合方法多局限于空间域,未能充分利用事件数据的时间响应和高频特性。 Method: 提出FreqTrack框架,包含频谱增强Transformer(SET)层(采用多头动态傅里叶滤波)和小波边缘细化(WER)模块(基于可学习小波变换提取多尺度边缘结构)。 Result: 在COESOT和FE108数据集上实验表明,FreqTrack性能领先,尤其在COESOT基准上达到76.6%的精度。 Conclusion: 频域建模能有效提升RGBE跟踪鲁棒性与精度,尤其适用于高速与低光场景。 Abstract: Existing single-modal RGB trackers often face performance bottlenecks in complex dynamic scenes, while the introduction of event sensors offers new potential for enhancing tracking capabilities. However, most current RGB-event fusion methods, primarily designed in the spatial domain using convolutional, Transformer, or Mamba architectures, fail to fully exploit the unique temporal response and high-frequency characteristics of event data. To address this, we1 propose FreqTrack, a frequency-aware RGBE tracking framework that establishes complementary inter-modal correlations through frequency-domain transformations for more robust feature fusion. We design a Spectral Enhancement Transformer (SET) layer that incorporates multi-head dynamic Fourier filtering to adaptively enhance and select frequency-domain features. Additionally, we develop a Wavelet Edge Refinement (WER) module, which leverages learnable wavelet transforms to explicitly extract multi-scale edge structures from event data, effectively improving modeling capability in high-speed and low-light scenarios. Extensive experiments on the COESOT and FE108 datasets demonstrate that FreqTrack achieves highly competitive performance, particularly attaining leading precision of 76.6\% on the COESOT benchmark, validating the effectiveness of frequency-domain modeling for RGBE tracking.[94] Design and Validation of a Low-Cost Smartphone Based Fluorescence Detection Platform Compared with Conventional Microplate Readers
Zhendong Cao,Katrina G. Salvante,Ash Parameswaran,Pablo A. Nepomnaschy,Hongji Dai
Main category: cs.CV
TL;DR: 本文提出了一种低成本荧光光学检测系统,利用智能手机摄像头替代传统昂贵的微孔板读数仪(如Perkin Elmer Victor),通过分析样本在RGB颜色空间中的图像颜色与荧光物质摩尔浓度的关系,实现对稀释样品中微生物和分子的检测。
Details
Motivation: 降低荧光检测设备成本,提高便携性与可及性,避免使用昂贵光学元件(如激发滤光片、阻挡滤光片和光电倍增管)。 Method: 设计兼容标准96孔板的装置,用智能手机摄像头作为荧光信号探测器,建立样本RGB图像颜色与荧光物质摩尔浓度之间的定量关系。 Result: 成功构建了无需高端光学器件的荧光检测系统,验证了手机摄像头在该任务中的可行性。 Conclusion: 该低成本系统为现场、资源有限环境下的快速生物检测提供了可行方案,具备实用化潜力。 Abstract: A low cost fluorescence-based optical system is developed for detecting the presence of certain microorganisms and molecules within a diluted sample. A specifically designed device setup compatible with conventional 96 well plates is chosen to create an ideal environment in which a smart phone camera can be used as the optical detector. In comparison with conventional microplate reading machines such as Perkin Elmer Victor Machine, the device presented in this paper is not equipped with expensive elements such as exciter filer, barrier filter and photomultiplier; instead, a phone camera is all needed to detect fluorescence within the sample. The strategy being involved is to determine the relationship between the image color of the sample in RGB color space and the molar concentration of the fluorescence specimen in that sample. This manuscript is a preprint version of work related to a publication in IEEE. The final version may differ from this manuscript.[95] WILD-SAM: Phase-Aware Expert Adaptation of SAM for Landslide Detection in Wrapped InSAR Interferograms
Yucheng Pan,Heping Li,Zhangle Liu,Sajid Hussain,Bin Pan
Main category: cs.CV
TL;DR: 本文提出WILD-SAM框架,通过Phase-Aware Mixture-of-Experts Adapter和Wavelet-Guided Subband Enhancement策略,适配SAM模型以高精度检测包裹相位干涉图中的缓动滑坡。
Details
Motivation: 直接从包裹InSAR干涉图中检测慢动滑坡对地质灾害监测至关重要,但面临严重相位模糊和复杂相干噪声挑战;SAM模型因频谱域偏移难以直接迁移应用。 Method: 提出WILD-SAM:在冻结编码器中嵌入Phase-Aware Mixture-of-Experts (PA-MoE) Adapter以对齐频谱分布,并设计Wavelet-Guided Subband Enhancement (WGSE)策略生成频率感知的密集提示,利用小波变换解耦高频子带并增强方向性相位纹理。 Result: 在ISSLIDE和ISSLIDE+基准上实验表明,WILD-SAM在目标完整性与轮廓保真度上显著优于现有方法,达到SOTA性能。 Conclusion: WILD-SAM成功弥合了通用视觉模型与InSAR相位数据之间的频谱鸿沟,为包裹相位图像的精准滑坡分割提供了高效、鲁棒的新范式。 Abstract: Detecting slow-moving landslides directly from wrapped Interferometric Synthetic Aperture Radar (InSAR) interferograms is crucial for efficient geohazard monitoring, yet it remains fundamentally challenged by severe phase ambiguity and complex coherence noise. While the Segment Anything Model (SAM) offers a powerful foundation for segmentation, its direct transfer to wrapped phase data is hindered by a profound spectral domain shift, which suppresses the high-frequency fringes essential for boundary delineation. To bridge this gap, we propose WILD-SAM, a novel parameter-efficient fine-tuning framework specifically designed to adapt SAM for high-precision landslide detection on wrapped interferograms. Specifically, the architecture integrates a Phase-Aware Mixture-of-Experts (PA-MoE) Adapter into the frozen encoder to align spectral distributions and introduces a Wavelet-Guided Subband Enhancement (WGSE) strategy to generate frequency-aware dense prompts. The PA-MoE Adapter exploits a dynamic routing mechanism across heterogeneous convolutional experts to adaptively aggregate multi-scale spectral-textural priors, effectively aligning the distribution discrepancy between natural images and interferometric phase data. Meanwhile, the WGSE strategy leverages discrete wavelet transforms to explicitly disentangle high-frequency subbands and refine directional phase textures, injecting these structural cues as dense prompts to ensure topological integrity along sharp landslide boundaries. Extensive experiments on the ISSLIDE and ISSLIDE+ benchmarks demonstrate that WILD-SAM achieves state-of-the-art performance, significantly outperforming existing methods in both target completeness and contour fidelity.[96] Giving Faces Their Feelings Back: Explicit Emotion Control for Feedforward Single-Image 3D Head Avatars
Yicheng Gong,Jiawei Zhang,Liqiang Liu,Yanwen Wang,Lei Chu,Jiahao Li,Hao Pan,Hao Zhu,Yan Lu
Main category: cs.CV
TL;DR: 本文提出了一种在前馈式单图像3D头像重建中显式控制情绪的框架,通过双路径调制机制将情绪作为独立可控信号注入现有架构,在保持重建质量的同时实现情绪迁移、解耦操控与平滑插值。
Details
Motivation: 现有方法中情绪常与几何或外观隐式耦合,缺乏对情绪的显式、一致且跨身份的独立控制能力。 Method: 提出双路径调制机制:几何调制在参数空间中进行情绪条件归一化,解耦情绪与语音驱动形变;外观调制捕捉身份感知的情绪相关视觉线索;并构建时序对齐、情绪一致的多身份数据集以支持训练。 Result: 该框架可集成到多种SOTA骨干网络中,在保持高保真重建与重演效果的同时,实现了可控情绪迁移、情绪-身份解耦操作及平滑情绪插值。 Conclusion: 本工作推进了具表现力与可扩展性的3D头像建模,确立了情绪作为第一类控制信号的可行性与有效性。 Abstract: We present a framework for explicit emotion control in feed-forward, single-image 3D head avatar reconstruction. Unlike existing pipelines where emotion is implicitly entangled with geometry or appearance, we treat emotion as a first-class control signal that can be manipulated independently and consistently across identities. Our method injects emotion into existing feed-forward architectures via a dual-path modulation mechanism without modifying their core design. Geometry modulation performs emotion-conditioned normalization in the original parametric space, disentangling emotional state from speech-driven articulation, while appearance modulation captures identity-aware, emotion-dependent visual cues beyond geometry. To enable learning under this setting, we construct a time-synchronized, emotion-consistent multi-identity dataset by transferring aligned emotional dynamics across identities. Integrated into multiple state-of-the-art backbones, our framework preserves reconstruction and reenactment fidelity while enabling controllable emotion transfer, disentangled manipulation, and smooth emotion interpolation, advancing expressive and scalable 3D head avatars.[97] Controllable Video Object Insertion via Multiview Priors
Xia Qi,Peishan Cong,Yichen Yao,Ziyi Wang,Yaoqin Ye,Yuexin Ma
Main category: cs.CV
TL;DR: 本文提出了一种新的视频对象插入方法,通过多视角物体先验、双路径视角一致条件机制、质量感知加权机制以及集成感知一致性模块,解决了外观不一致、遮挡处理、空间对齐和时序连贯性等挑战。
Details
Motivation: 现有视频生成方法在将新对象插入到已有视频时,难以保证对象外观一致性、空间对齐性和时序连贯性,尤其在动态环境中遮挡和视角变化问题突出。 Method: 提出融合多视角物体先验的视频对象插入框架:将2D参考图像提升为多视角表征;设计双路径视角一致条件机制以稳定身份引导;引入质量感知加权机制应对噪声输入;构建集成感知一致性模块以解决遮挡与边界伪影并保障时空连续性。 Result: 实验表明该方法显著提升了视频对象插入的质量,在外观稳定性、空间真实感和时序连续性方面均取得更好效果。 Conclusion: 所提方法有效克服了动态环境下视频对象插入的关键挑战,为高质量、鲁棒的视频编辑提供了新思路与实用框架。 Abstract: Video object insertion is a critical task for dynamically inserting new objects into existing environments. Previous video generation methods focus primarily on synthesizing entire scenes while struggling with ensuring consistent object appearance, spatial alignment, and temporal coherence when inserting objects into existing videos. In this paper, we propose a novel solution for Video Object Insertion, which integrates multi-view object priors to address the common challenges of appearance inconsistency and occlusion handling in dynamic environments. By lifting 2D reference images into multi-view representations and leveraging a dual-path view-consistent conditioning mechanism, our framework ensures stable identity guidance and robust integration across diverse viewpoints. A quality-aware weighting mechanism is also employed to adaptively handle noisy or imperfect inputs. Additionally, we introduce an Integration-Aware Consistency Module that guarantees spatial realism, effectively resolving occlusion and boundary artifacts while maintaining temporal continuity across frames. Experimental results show that our solution significantly improves the quality of video object insertion, providing stable and realistic integration.[98] The Fourth Challenge on Image Super-Resolution ($\times$4) at NTIRE 2026: Benchmark Results and Method Overview
Zheng Chen,Kai Liu,Jingkai Wang,Xianglong Yan,Jianze Li,Ziqing Zhang,Jue Gong,Jiatong Li,Lei Sun,Xiaoyang Liu,Radu Timofte,Yulun Zhang,Jihye Park,Yoonjin Im,Hyungju Chun,Hyunhee Park,MinKyu Park,Zheng Xie,Xiangyu Kong,Weijun Yuan,Zhan Li,Qiurong Song,Luen Zhu,Fengkai Zhang,Xinzhe Zhu,Junyang Chen,Congyu Wang,Yixin Yang,Zhaorun Zhou,Jiangxin Dong,Jinshan Pan,Shengwei Wang,Jiajie Ou,Baiang Li,Sizhuo Ma,Qiang Gao,Jusheng Zhang,Jian Wang,Keze Wang,Yijiao Liu,Yingsi Chen,Hui Li,Yu Wang,Congchao Zhu,Saeed Ahmad,Ik Hyun Lee,Jun Young Park,Ji Hwan Yoon,Kainan Yan,Zian Wang,Weibo Wang,Shihao Zou,Chao Dong,Wei Zhou,Linfeng Li,Jaeseong Lee,Jaeho Chae,Jinwoo Kim,Seonjoo Kim,Yucong Hong,Zhenming Yan,Junye Chen,Ruize Han,Song Wang,Yuxuan Jiang,Chengxi Zeng,Tianhao Peng,Fan Zhang,David Bull,Tongyao Mu,Qiong Cao,Yifan Wang,Youwei Pan,Leilei Cao,Xiaoping Peng,Wei Deng,Yifei Chen,Wenbo Xiong,Xian Hu,Yuxin Zhang,Xiaoyun Cheng,Yang Ji,Zonghao Chen,Zhihao Xue,Junqin Hu,Nihal Kumar,Snehal Singh Tomar,Klaus Mueller,Surya Vashisth,Prateek Shaily,Jayant Kumar,Hardik Sharma,Ashish Negi,Sachin Chaudhary,Akshay Dudhane,Praful Hambarde,Amit Shukla,Shijun Shi,Jiangning Zhang,Yong Liu,Kai Hu,Jing Xu,Xianfang Zeng,Amitesh M,Hariharan S,Chia-Ming Lee,Yu-Fan Lin,Chih-Chung Hsu,Nishalini K,Sreenath K A,Bilel Benjdira,Anas M. Ali,Wadii Boulila,Shuling Zheng,Zhiheng Fu,Feng Zhang,Zhanglu Chen,Boyang Yao,Nikhil Pathak,Aagam Jain,Milan Kumar,Kishor Upla,Vivek Chavda,Sarang N S,Raghavendra Ramachandra,Zhipeng Zhang,Qi Wang,Shiyu Wang,Jiachen Tu,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Yaokun Shi,Yuqi Li,Chuanguang Yang,Weilun Feng,Zhuzhi Hong,Hao Wu,Junming Liu,Yingli Tian,Amish Bhushan Kulkarni,Tejas R R Shet,Saakshi M Vernekar,Nikhil Akalwadi,Kaushik Mallibhat,Ramesh Ashok Tabib,Uma Mudenagudi,Yuwen Pan,Tianrun Chen,Deyi Ji,Qi Zhu,Lanyun Zhu,Heyan Zhangyi
Main category: cs.CV
TL;DR: 本文介绍了NTIRE 2026图像超分辨率(×4)挑战赛,包含恢复和感知两个赛道,旨在推动超分辨率技术发展并提供统一基准。
Details
Motivation: 反映图像超分辨率领域不断演进的目标,推动在像素保真度与视觉真实感两方面的技术进步。 Method: 组织NTIRE 2026超分辨率挑战赛,设置恢复(PSNR评估)与感知(感知评分评估)两个赛道,使用bicubic ×4下采样生成LR图像,并对194名注册者中31支有效队伍的结果进行分析。 Result: 共194人注册、31支队伍提交有效结果;报告总结了挑战设计、数据集、评估协议及各队方法,提供了当前进展与未来方向的洞察。 Conclusion: 该挑战赛构建了一个统一基准,有效促进了图像超分辨率技术在保真度与感知质量两方面的协同发展。 Abstract: This paper presents the NTIRE 2026 image super-resolution ($\times$4) challenge, one of the associated competitions of the NTIRE 2026 Workshop at CVPR 2026. The challenge aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective super-resolution solutions and analyze recent advances in the field. To reflect the evolving objectives of image super-resolution, the challenge includes two tracks: (1) a restoration track, which emphasizes pixel-wise fidelity and ranks submissions based on PSNR; and (2) a perceptual track, which focuses on visual realism and evaluates results using a perceptual score. A total of 194 participants registered for the challenge, with 31 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, main results, and methods of participating teams. The challenge provides a unified benchmark and offers insights into current progress and future directions in image super-resolution.[99] DVFace: Spatio-Temporal Dual-Prior Diffusion for Video Face Restoration
Zheng Chen,Bowen Chai,Rongjun Gao,Mingtao Nie,Xi Li,Bingnan Duan,Jianping Fang,Xiaohong Liu,Linghe Kong,Yulun Zhang
Main category: cs.CV
TL;DR: 本文提出DVFace,一种用于真实世界视频人脸修复的一步扩散框架,通过时空双码本设计和非对称时空融合模块,实现了高质量、时间稳定且身份保持的修复效果。
Details
Motivation: 现有基于扩散的方法依赖通用扩散先验和多步采样,限制了面部适应性和推理效率,因此需要探索一步扩散方法以提升性能与效率。 Method: 提出DVFace框架,包含时空双码本设计用于提取空间和时间面部先验,以及非对称时空融合模块将先验注入扩散主干网络。 Result: 在多个基准上评估显示,DVFace在修复质量、时间一致性及身份保持方面均优于近期方法。 Conclusion: DVFace成功实现了一步扩散下的高质量视频人脸修复,在真实场景中具备良好泛化能力与实用性。 Abstract: Video face restoration aims to enhance degraded face videos into high-quality results with realistic facial details, stable identity, and temporal coherence. Recent diffusion-based methods have brought strong generative priors to restoration and enabled more realistic detail synthesis. However, existing approaches for face videos still rely heavily on generic diffusion priors and multi-step sampling, which limit both facial adaptation and inference efficiency. These limitations motivate the use of one-step diffusion for video face restoration, yet achieving faithful facial recovery alongside temporally stable outputs remains challenging. In this paper, we propose, DVFace, a one-step diffusion framework for real-world video face restoration. Specifically, we introduce a spatio-temporal dual-codebook design to extract complementary spatial and temporal facial priors from degraded videos. We further propose an asymmetric spatio-temporal fusion module to inject these priors into the diffusion backbone according to their distinct roles. Evaluation on various benchmarks shows that DVFace delivers superior restoration quality, temporal consistency, and identity preservation compared to recent methods. Code: https://github.com/zhengchen1999/DVFace.[100] Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors
Mingqian Ji,Shanshan Zhang,Jian Yang
Main category: cs.CV
TL;DR: 本文提出SEPatch3D框架,通过动态调整patch尺寸、选择信息丰富patch并跨粒度增强特征,在保持3D检测精度的同时显著提升ViT类稀疏多视角3D检测器的推理速度。
Details
Motivation: 现有ViT-based稀疏多视角3D检测器虽精度高,但因token处理开销大导致推理延迟高;已有token压缩方法(如剪枝、合并、增大patch尺寸)易丢失背景线索、破坏上下文一致性、损失细粒度语义,损害检测性能。 Method: 提出SEPatch3D框架,包含三个核心模块:1)Spatiotemporal-aware Patch Size Selection (SPSS),根据场景内容(近距物体/背景主导)动态分配小/大patch;2)Informative Patch Selection (IPS),筛选需细化的信息性patch;3)Cross-Granularity Feature Enhancement (CGFE),将细粒度细节注入粗粒度patch以增强语义。 Result: 在nuScenes和Argoverse 2验证集上,相比StreamPETR基线推理加速57%,相比SOTA方法ToC3D-faster效率提升20%,同时保持相当的检测精度。 Conclusion: SEPatch3D通过语义感知的动态patch策略与跨粒度特征增强,在不牺牲精度的前提下有效缓解ViT类3D检测器的计算瓶颈,为实时多视角3D感知提供了新思路。 Abstract: Vision Transformer (ViT)-based sparse multi-view 3D object detectors have achieved remarkable accuracy but still suffer from high inference latency due to heavy token processing. To accelerate these models, token compression has been widely explored. However, our revisit of existing strategies, such as token pruning, merging, and patch size enlargement, reveals that they often discard informative background cues, disrupt contextual consistency, and lose fine-grained semantics, negatively affecting 3D detection. To overcome these limitations, we propose SEPatch3D, a novel framework that dynamically adjusts patch sizes while preserving critical semantic information within coarse patches. Specifically, we design Spatiotemporal-aware Patch Size Selection (SPSS) that assigns small patches to scenes containing nearby objects to preserve fine details and large patches to background-dominated scenes to reduce computation cost. To further mitigate potential detail loss, Informative Patch Selection (IPS) selects the informative patches for feature refinement, and Cross-Granularity Feature Enhancement (CGFE) injects fine-grained details into selected coarse patches, enriching semantic features. Experiments on the nuScenes and Argoverse 2 validation sets show that SEPatch3D achieves up to \textbf{57\%} faster inference than the StreamPETR baseline and \textbf{20\%} higher efficiency than the state-of-the-art ToC3D-faster, while preserving comparable detection accuracy. Code is available at https://github.com/Mingqj/SEPatch3D.[101] Learning Adaptive Reasoning Paths for Efficient Visual Reasoning
Yixu Huang,Tinghui Zhu,Muhao Chen
Main category: cs.CV
TL;DR: 本文提出AVR框架,通过自适应选择推理格式(全格式、仅感知格式、直接答案)来减少视觉推理模型的冗余推理路径,显著降低token使用量(50%-90%)且保持准确率。
Details
Motivation: 视觉推理模型常因推理路径冗余而过度思考,产生不必要的长推理链,尤其对无需复杂推理的视觉问题。 Method: 提出AVR自适应视觉推理框架,将视觉推理分解为视觉感知、逻辑推理和答案应用三部分,并支持三种响应格式;采用改进的FS-GRPO算法进行训练,以在保证正确性前提下鼓励选择最高效格式。 Result: 在多个视觉语言基准上,AVR将token使用量减少50%-90%,同时维持整体准确率,尤其在感知密集型任务中表现更优。 Conclusion: AVR验证了自适应视觉推理能有效缓解视觉推理模型的过思考问题,提升推理效率。 Abstract: Visual reasoning models (VRMs) have recently shown strong cross-modal reasoning capabilities by integrating visual perception with language reasoning. However, they often suffer from overthinking, producing unnecessarily long reasoning chains for any tasks. We attribute this issue to \textbf{Reasoning Path Redundancy} in visual reasoning: many visual questions do not require the full reasoning process. To address this, we propose \textbf{AVR}, an adaptive visual reasoning framework that decomposes visual reasoning into three cognitive functions: visual perception, logical reasoning, and answer application. It further enables models to dynamically choose among three response formats: Full Format, Perception-Only Format, and Direct Answer. AVR is trained with FS-GRPO, an adaptation of Group Relative Policy Optimization that encourages the model to select the most efficient reasoning format while preserving correctness. Experiments on multiple vision-language benchmarks show that AVR reduces token usage by 50--90\% while maintaining overall accuracy, especially in perception-intensive tasks. These results demonstrate that adaptive visual reasoning can effectively mitigate overthinking in VRMs. Code and data are available at: https://github.com/RunRiotComeOn/AVR.[102] Deepfake Detection Generalization with Diffusion Noise
Hongyuan Qi,Wenjin Hou,Hehe Fan,Jun Xiao
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型噪声特性的注意力引导噪声学习(ANL)框架,用于提升深度伪造检测器对新型合成图像(尤其是扩散模型生成的)的泛化能力。通过利用预训练扩散模型的去噪过程和注意力机制,ANL能有效捕捉全局分布差异,并在不增加推理开销的前提下显著提升跨模型泛化性能。
Details
Motivation: 现有深度伪造检测器难以泛化到新兴的扩散模型生成的高保真伪造图像,亟需一种能适应未知伪造类型的新方法。 Method: 提出Attention-guided Noise Learning(ANL)框架:将冻结的预训练扩散模型嵌入检测流程,让检测器预测输入图像在某扩散步长下的噪声,并利用预测噪声生成注意力图,引导网络关注全局而非局部异常区域,从而实现正则化与鲁棒特征学习。 Result: ANL在多个基准上显著优于现有方法,尤其在检测扩散生成伪造图像时达到SOTA精度;在未见过的伪造模型上ACC/AP大幅提升,且推理无额外开销。 Conclusion: 扩散噪声是一种强大且具泛化性的信号,ANL框架有效利用该信号提升了深度伪造检测器对未知合成技术的适应能力,为通用伪造检测提供了新范式。 Abstract: Deepfake detectors face growing challenges in generalization as new image synthesis techniques emerge. In particular, deepfakes generated by diffusion models are highly photorealistic and often evade detectors trained on GAN-based forgeries. This paper addresses the generalization problem in deepfake detection by leveraging diffusion noise characteristics. We propose an Attention-guided Noise Learning (ANL) framework that integrates a pre-trained diffusion model into the deepfake detection pipeline to guide the learning of more robust features. Specifically, our method uses the diffusion model's denoising process to expose subtle artifacts: the detector is trained to predict the noise contained in an input image at a given diffusion step, forcing it to capture discrepancies between real and synthetic images, while an attention-guided mechanism derived from the predicted noise is introduced to encourage the model to focus on globally distributed discrepancies rather than local patterns. By harnessing the frozen diffusion model's learned distribution of natural images, the ANL method acts as a form of regularization, improving the detector's generalization to unseen forgery types. Extensive experiments demonstrate that ANL significantly outperforms existing methods on multiple benchmarks, achieving state-of-the-art accuracy in detecting diffusion-generated deepfakes. Notably, the proposed framework boosts generalization performance (e.g., improving ACC/AP by a substantial margin on unseen models) without introducing additional overhead during inference. Our results highlight that diffusion noise provides a powerful signal for generalizable deepfake detection.[103] M3D-Net: Multi-Modal 3D Facial Feature Reconstruction Network for Deepfake Detection
Haotian Wu,Yue Cheng,Shan Bian
Main category: cs.CV
TL;DR: 本文提出了一种名为M3D-Net的多模态3D人脸特征重建网络,用于深度伪造检测,通过自监督3D人脸重建和多模态特征融合提升检测精度与鲁棒性。
Details
Motivation: 现有深度伪造检测方法大多孤立重建面部属性,未能充分利用多模态特征间的互补性,且难以应对日益逼真的伪造技术带来的网络安全与信息真实性威胁。 Method: 提出端到端双流架构的M3D-Net,包含自监督3D人脸重建模块(重建几何与反射率)、3D特征预融合模块(PFM)和多模态融合模块(MFM),融合RGB与3D重建特征并利用注意力机制增强判别能力。 Result: 在多个公开数据集上实验表明,该方法在检测精度、鲁棒性和跨场景泛化能力方面均达到当前最优水平,显著优于现有方法。 Conclusion: M3D-Net通过协同建模RGB与3D面部特征,有效提升了深度伪造检测性能,为多模态表征学习在伪造检测中的应用提供了新思路。 Abstract: With the rapid advancement of deep learning in image generation, facial forgery techniques have achieved unprecedented realism, posing serious threats to cybersecurity and information authenticity. Most existing deepfake detection approaches rely on the reconstruction of isolated facial attributes without fully exploiting the complementary nature of multi-modal feature representations. To address these challenges, this paper proposes a novel Multi-Modal 3D Facial Feature Reconstruction Network (M3D-Net) for deepfake detection. Our method leverages an end-to-end dual-stream architecture that reconstructs fine-grained facial geometry and reflectance properties from single-view RGB images via a self-supervised 3D facial reconstruction module. The network further enhances detection performance through a 3D Feature Pre-fusion Module (PFM), which adaptively adjusts multi-scale features, and a Multi-modal Fusion Module (MFM) that effectively integrates RGB and 3D-reconstructed features using attention mechanisms. Extensive experiments on multiple public datasets demonstrate that our approach achieves state-of-the-art performance in terms of detection accuracy and robustness, significantly outperforming existing methods while exhibiting strong generalization across diverse scenarios.[104] TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation
Xiangyu Liu,Feng Gao,Xiaomei Zhang,Yong Zhang,Xiaoming Wei,Zhen Lei,Xiangyu Zhu
Main category: cs.CV
TL;DR: 本文提出TurboTalk,一种两阶段渐进式蒸馏框架,将多步音频驱动视频扩散模型压缩为单步生成器,实现120倍推理加速且保持高质量。
Details
Motivation: 现有音频驱动视频数字人生成模型依赖多步去噪,计算开销大、难以实际部署;单步蒸馏虽快但训练不稳定。 Method: 提出两阶段渐进式蒸馏:第一阶段用分布匹配蒸馏获得稳定的4步学生模型;第二阶段通过对抗蒸馏与渐进时间步采样策略,结合自比较对抗目标,将4步逐步压缩至1步。 Result: 实现单步视频说话头像生成,推理速度提升120倍,同时保持高生成质量。 Conclusion: TurboTalk通过渐进式蒸馏策略有效平衡了生成质量与推理效率,为实时音频驱动视频生成提供了可行方案。 Abstract: Existing audio-driven video digital human generation models rely on multi-step denoising, resulting in substantial computational overhead that severely limits their deployment in real-world settings. While one-step distillation approaches can significantly accelerate inference, they often suffer from training instability. To address this challenge, we propose TurboTalk, a two-stage progressive distillation framework that effectively compresses a multi-step audio-driven video diffusion model into a single-step generator. We first adopt Distribution Matching Distillation to obtain a strong and stable 4-step student, and then progressively reduce the denoising steps from 4 to 1 through adversarial distillation. To ensure stable training under extreme step reduction, we introduce a progressive timestep sampling strategy and a self-compare adversarial objective that provides an intermediate adversarial reference that stabilizes progressive distillation. Our method achieve single-step generation of video talking avatar, boosting inference speed by 120 times while maintaining high generation quality.[105] MapSR: Prompt-Driven Land Cover Map Super-Resolution via Vision Foundation Models
Ruiqi Wang,Qi Yu,Jie Ma,Hanlin Wu
Main category: cs.CV
TL;DR: 本文提出MapSR框架,通过提示驱动的方式实现土地覆盖图的超分辨率重建,仅需一次使用低分辨率标签提取类别提示,无需训练即可完成高分辨率映射,并在Chesapeake Bay数据集上取得优异性能。
Details
Motivation: 高分辨率土地覆盖制图受限于密集高分辨率标注的高昂成本,现有弱监督方法依赖重新训练模型,计算开销大。 Method: MapSR利用冻结的视觉基础模型特征和轻量线性探针,从低分辨率标签中一次性提取类别提示;再通过余弦相似度匹配和基于图的空间传播进行无训练的高分辨率预测与 refinement。 Result: 在Chesapeake Bay数据集上达到59.64% mIoU,无需任何高分辨率标签;参数量减少四个数量级,训练时间从数小时缩短至数分钟。 Conclusion: MapSR实现了高效、可扩展的高分辨率土地覆盖制图,显著降低对高分辨率标注和计算资源的依赖。 Abstract: High-resolution (HR) land-cover mapping is often constrained by the high cost of dense HR annotations. We revisit this problem from the perspective of map super-resolution, which enhances coarse low-resolution (LR) land-cover products into HR maps at the resolution of the input imagery. Existing weakly supervised methods can leverage LR labels, but they typically use them to retrain dense predictors with substantial computational cost. We propose MapSR, a prompt-driven framework that decouples supervision from model training. MapSR uses LR labels once to extract class prompts from frozen vision foundation model features through a lightweight linear probe, after which HR mapping proceeds via training-free metric inference and graph-based prediction refinement. Specifically, class prompts are estimated by aggregating high-confidence HR features identified by the linear probe, and HR predictions are obtained by cosine-similarity matching followed by graph-based propagation for spatial refinement. Experiments on the Chesapeake Bay dataset show that MapSR achieves 59.64% mIoU without any HR labels, remaining competitive with the strongest weakly supervised baseline and surpassing a fully supervised baseline. Notably, MapSR reduces trainable parameters by four orders of magnitude and shortens training time from hours to minutes, enabling scalable HR mapping under limited annotation and compute budgets. The code is available at https://github.com/rikirikirikiriki/MapSR.[106] Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models
Amir El-Ghoussani,Marc Hölle,Gustavo Carneiro,Vasileios Belagiannis
Main category: cs.CV
TL;DR: 本文提出了一种名为Masked Logit Nudging(MLN)的方法,用于在视觉自回归模型中实现提示引导的图像编辑,通过利用源图像token映射和语义轨迹对齐,在保持无关区域不变的前提下高效、高质量地完成编辑,并在多个基准上达到SOTA性能。
Details
Motivation: 解决视觉自回归模型中提示引导图像编辑的问题,即在给定源图像和目标文本提示时,仅修改与提示相关区域,同时保留其余内容不变。 Method: 提出Masked Logit Nudging(MLN):1)利用源图像token映射生成logits并沿源-目标提示定义的语义轨迹进行logit nudging;2)基于源提示与编辑提示间cross-attention差异构建空间掩码,限定编辑区域;3)引入量化误差校正与重建质量优化的细化步骤。 Result: 在PIE基准512px和1024px分辨率上取得最佳图像编辑性能;在COCO(512px)和OpenImages(1024px)上实现更优的忠实重建效果;性能优于VAR相关方法,媲美甚至超越扩散模型,且推理速度显著更快。 Conclusion: MLN是一种高效、精准、通用的提示驱动图像编辑方法,兼顾编辑保真度、区域控制能力与计算效率,为视觉自回归模型在编辑任务中的应用提供了新范式。 Abstract: We address the problem of prompt-guided image editing in visual autoregressive models. Given a source image and a target text prompt, we aim to modify the source image according to the target prompt, while preserving all regions which are unrelated to the requested edit. To this end, we present Masked Logit Nudging, which uses the source image token maps to introduce a guidance step that aligns the model's predictions under the target prompt with these source token maps. Specifically, we convert the fixed source encodings into logits using the VAR encoding, nudging the model's predicted logits towards the targets along a semantic trajectory defined by the source-target prompts. Edits are applied only within spatial masks obtained through a dedicated masking scheme that leverages cross-attention differences between the source and edited prompts. Then, we introduce a refinement to correct quantization errors and improve reconstruction quality. Our approach achieves the best image editing performance on the PIE benchmark at 512px and 1024px resolutions. Beyond editing, our method delivers faithful reconstructions and outperforms previous methods on COCO at 512px and OpenImages at 1024px. Overall, our method outperforms VAR-related approaches and achieves comparable or even better performance than diffusion models, while being much faster. Code is available at 'https://github.com/AmirMaEl/MLN'.[107] Towards Design Compositing
Abhinav Mahajan,Abhikhya Tripathy,Sudeeksha Reddy Pala,Vaibhav Methi,K J Joseph,Balaji Vasan Srinivasan
Main category: cs.CV
TL;DR: 本文提出GIST,一种无需训练、保持身份的图像合成器,用于提升图形设计中多源视觉元素的风格一致性与和谐性,可无缝集成到现有设计生成流程中并显著改善美学质量。
Details
Motivation: 现有方法假设输入的多模态设计元素(如图像、文本、logo)已具备风格一致性,但实际中这些元素常来自不同来源,存在视觉不匹配问题,因此需要支持身份保留的风格化与合成能力。 Method: 提出GIST,一种训练-free、身份保持的图像合成器,位于布局预测与排版生成之间,可即插即用地嵌入任意现有组件到设计或设计优化流程(如LaDeCo和Design-o-meter)中。 Result: GIST在两个不同基线方法上均显著提升了视觉和谐性与美学质量,经LLaVA-OV和GPT-4V在维度评分与成对偏好测试中验证优于简单粘贴。 Conclusion: 身份保持的风格化合成是实现真正和谐的‘组件到设计’流程的关键环节,GIST为此提供了一种通用、轻量且有效的解决方案。 Abstract: Graphic design creation involves harmoniously assembling multimodal components such as images, text, logos, and other visual assets collected from diverse sources, into a visually-appealing and cohesive design. Recent methods have largely focused on layout prediction or complementary element generation, while retaining input elements exactly, implicitly assuming that provided components are already stylistically harmonious. In practice, inputs often come from disparate sources and exhibit visual mismatch, making this assumption limiting. We argue that identity-preserving stylization and compositing of input elements is a critical missing ingredient for truly harmonized components-to-design pipelines. To this end, we propose GIST, a training-free, identity-preserving image compositor that sits between layout prediction and typography generation, and can be plugged into any existing components-to-design or design-refining pipeline without modification. We demonstrate this by integrating GIST with two substantially different existing methods, LaDeCo and Design-o-meter. GIST shows significant improvements in visual harmony and aesthetic quality across both pipelines, as validated by LLaVA-OV and GPT-4V on aspect-wise ratings and pairwise preference over naive pasting. Project Page: abhinav-mahajan10.github.io/GIST/.[108] Multigrain-aware Semantic Prototype Scanning and Tri-Token Prompt Learning Embraced High-Order RWKV for Pan-Sharpening
Junfeng Li,Wenyang Zhou,Xueheng Li,Xuanhua He,Jianhou Gan,Wenqi Ren
Main category: cs.CV
TL;DR: 本文提出了一种面向全色锐化的多粒度语义原型扫描范式,基于高阶RWKV架构和源自语义聚类的三令牌提示机制,通过语义驱动的扫描策略、三令牌提示学习与可逆Q-Shift操作提升图像重建质量。
Details
Motivation: 传统RWKV的双向光栅扫描缺乏语义感知,易受位置偏差影响;现有方法难以兼顾全局语义一致性与局部高频细节恢复。 Method: 1)多粒度语义原型扫描:利用局部敏感哈希进行语义分组,构建多粒度原型以实现上下文感知的token重排序;2)三令牌提示学习:引入全局token、聚类原型token和可学习register token协同引导RWKV建模;3)可逆Q-Shift:在value通路使用中心差分卷积注入高频信息,并设计可逆多尺度Q-shift实现无损特征变换。 Result: 实验结果表明该方法在多个标准数据集上优于现有先进方法,显著提升全色锐化质量。 Conclusion: 语义驱动的扫描策略与轻量高效结构设计(如三令牌提示与可逆Q-Shift)能有效增强RWKV在全色锐化任务中的建模能力,为遥感图像融合提供了新思路。 Abstract: In this work, we propose a Multigrain-aware Semantic Prototype Scanning paradigm for pan-sharpening, built upon a high-order RWKV architecture and a tri-token prompting mechanism derived from semantic clustering. Specifically, our method contains three key components: 1) Multigrain-aware Semantic Prototype Scanning. Although RWKV offers a efficient linear-complexity alternative to Transformers, its conventional bidirectional raster scanning is still semantic-agnostic and prone to positional bias. To address this issue, we introduce a semantic-driven scanning strategy that leverages locality-sensitive hashing to group semantically related regions and construct multi-grain semantic prototypes, enabling context-aware token reordering and more coherent global interaction. 2) Tri-token Prompt Learning. We design a tri-token prompting mechanism consisting of a global token, cluster-derived prototype tokens, and a learnable register token. The global and prototype tokens provide complementary semantic priors for RWKV modeling, while the register token helps suppress noisy and artifact-prone intermediate representations. 3) Invertible Q-Shift. To counteract spatial details, we apply center difference convolution on the value pathway to inject high-frequency information, and introduce an invertible multi-scale Q-shift operation for efficient and lossless feature transformation without parameter-heavy receptive field expansion. Experimental results demonstrate the superiority of our method.[109] Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models
Haoyi Sun,Xiaoxiao Wang,Ning Mao,Qian Wang,Lifu Mu,Wen Zheng,Tao Wei,Wei Chen
Main category: cs.CV
TL;DR: 本文提出Switch-KD框架,通过视觉切换蒸馏与动态双向logits差异损失,在共享文本概率空间中统一视觉-语言知识迁移,显著提升小规模VLM的多模态性能。
Details
Motivation: 现有VLM知识蒸馏方法缺乏对多模态对齐的显式建模,导致跨模态知识迁移不一致,难以在资源受限场景下高效部署大模型。 Method: 提出Switch-KD框架,包含(1)视觉切换蒸馏:将学生视觉输出映射至教师语言路径以构建跨模态概率参考;(2)动态双向Logits差异(DBiLD)损失:自适应对齐关键概率区域并保持分布结构。 Result: 0.5B TinyLLaVA在不修改架构前提下,从3B教师模型成功蒸馏知识,在10个多模态基准上平均提升3.6分。 Conclusion: Switch-KD实现了更一致、高效的多模态知识迁移,为轻量化VLM部署提供了新范式。 Abstract: Vision-Language Models (VLMs) have shown remarkable capabilities in joint vision-language understanding, but their large scale poses significant challenges for deployment in resource-constrained scenarios. Knowledge Distillation (KD) offers a viable way to improve model capabilities without increasing model size or data requirements, making deployment more efficient. However, applying KD to VLMs is challenged by modality-specific supervision: although multimodal knowledge in VLMs is fused within the language space, current methods supervise each modality separately without explicitly addressing multimodal alignment, leading to inconsistent multimodal knowledge transfer. To address this, we propose Switch-KD, a visual-switch distillation framework that unifies vision-language knowledge transfer within a shared text-probability space. Switch-KD comprises two key components: (1) Visual-Switch Distillation, which switches the student's visual outputs into the teacher's language pathway to construct cross-modal probabilistic references for implicit visual knowledge transfer; and (2) Dynamic Bi-directional Logits Difference (DBiLD) loss, which adaptively aligns informative probability regions while preserving the distributional structures of teacher and student through bidirectional supervision. Guided by Switch-KD, a 0.5B TinyLLaVA effectively distills rich multimodal knowledge from its 3B teacher, yielding an average improvement of 3.6 points across 10 multimodal benchmarks without any architectural modification.[110] CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation
Inseok Jeon,Suhwan Cho,Minhyeok Lee,Seunghoon Lee,Minseok Kang,Jungho Lee,Chaewon Park,Donghyeong Kim,Sangyoun Lee
Main category: cs.CV
TL;DR: 本文提出了一种跨模态令牌调制方法,通过关系Transformer块增强外观与运动线索间的交互,并结合令牌掩码策略提升学习效率,在无监督视频目标分割任务中达到SOTA性能。
Details
Motivation: 现有双流架构虽能融合外观和运动线索,但难以充分建模二者间的相互依赖关系。 Method: 提出跨模态令牌调制机制,建立两模态token间的密集连接,并利用关系Transformer块实现模内与模间信息传播;引入令牌掩码策略以提升学习效率,避免单纯依赖模型复杂度提升性能。 Result: 在所有公开基准上均取得当前最优性能,超越现有方法。 Conclusion: 跨模态令牌调制有效增强了外观与运动线索的协同建模能力,是提升无监督视频目标分割性能的有效途径。 Abstract: Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.[111] High-Speed Full-Color HDR Imaging via Unwrapping Modulo-Encoded Spike Streams
Chu Zhou,Siqi Yang,Kailong Zhang,Heng Guo,Zhaofei Yu,Boxin Shi,Imari Sato
Main category: cs.CV
TL;DR: 本文提出了一种基于模运算传感器的全彩、高速HDR成像系统,通过解耦曝光设计和无迭代扩散先验驱动的解包裹算法,在保持物理一致性的同时实现高效HDR重建,并在1000 FPS下验证了硬件可行性。
Details
Motivation: 传统RGB HDR成像在多曝光(运动伪影)与单次拍摄(信息损失)间存在根本权衡;现有模传感器方案受限于迭代解包裹开销和低速灰度采集。 Method: 提出曝光解耦的模成像建模,支持时序交错多帧采集;设计融合扩散生成先验与模图像最小绝对余数物理特性的无迭代解包裹算法;构建基于模编码脉冲流的硬件原型。 Result: 实现1000 FPS全彩HDR成像,数据带宽从20 Gbps降至6 Gbps;算法高效且物理一致,系统在动态场景中表现稳健。 Conclusion: 该工作协同推进传感建模与算法设计,突破模成像在速度、色彩与效率上的系统瓶颈,证明其在实际动态HDR应用中的可行性。 Abstract: Conventional RGB-based high dynamic range (HDR) imaging faces a fundamental trade-off between motion artifacts in multi-exposure captures and irreversible information loss in single-shot techniques. Modulo sensors offer a promising alternative by encoding theoretically unbounded dynamic range into wrapped measurements. However, existing modulo solutions remain bottlenecked by iterative unwrapping overhead and hardware constraints limiting them to low-speed, grayscale capture. In this work, we present a complete modulo-based HDR imaging system that enables high-speed, full-color HDR acquisition by synergistically advancing both the sensing formulation and the unwrapping algorithm. At the core of our approach is an exposure-decoupled formulation of modulo imaging that allows multiple measurements to be interleaved in time, preserving a clean, observation-wise measurement model. Building upon this, we introduce an iteration-free unwrapping algorithm that integrates diffusion-based generative priors with the physical least absolute remainder property of modulo images, supporting highly efficient, physics-consistent HDR reconstruction. Finally, to validate the practical viability of our system, we demonstrate a proof-of-concept hardware implementation based on modulo-encoded spike streams. This setup preserves the native high temporal resolution of spike cameras, achieving 1000 FPS full-color imaging while reducing output data bandwidth from approximately 20 Gbps to 6 Gbps. Extensive evaluations indicate that our coordinated approach successfully overcomes key systemic bottlenecks, demonstrating the feasibility of deploying modulo imaging in dynamic scenarios.[112] Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification
Weiwei Zhuang,Wangze Xie,Qi Zhang,Xia Du,Zihan Lin,Zheng Lin,Hanlin Cai,Jizhe Zhou,Zihan Fang,Chi-man Pun,Wei Ni,Jun Luo
Main category: cs.CV
TL;DR: 本文提出FogFool,一种基于雾效的物理可实现对抗攻击框架,利用Perlin噪声建模大气模式,在遥感图像分类中生成视觉真实、迁移性强且抗防御的对抗样本。
Details
Motivation: 现有对抗攻击方法多为像素级扰动,忽视遥感图像的大气特性,难以在真实场景中保持有效性。 Method: 提出FogFool框架,通过迭代优化基于Perlin噪声的大气模式(模拟雾)生成对抗扰动,利用雾的中低频和空间一致性将对抗信息嵌入模型共享的结构特征中。 Result: 在两个遥感基准数据集上验证:白盒攻击性能优越,黑盒迁移成功率高达83.74% TASR,并对JPEG压缩、滤波等预处理防御具有强鲁棒性;CAM分析显示其引起模型注意力的普适性偏移。 Conclusion: FogFool是一种实用、隐蔽且高持续性的遥感分类系统威胁,为复杂环境下模型可靠性评估提供了稳健基准。 Abstract: Adversarial attacks pose a severe threat to the reliability of deep learning models in remote sensing (RS) image classification. Most existing methods rely on direct pixel-wise perturbations, failing to exploit the inherent atmospheric characteristics of RS imagery or survive real-world image degradations. In this paper, we propose FogFool, a physically plausible adversarial framework that generates fog-based perturbations by iteratively optimizing atmospheric patterns based on Perlin noise. By modeling fog formations with natural, irregular structures, FogFool generates adversarial examples that are not only visually consistent with authentic RS scenes but also deceptive. By leveraging the spatial coherence and mid-to-low-frequency nature of atmospheric phenomena, FogFool embeds adversarial information into structural features shared across diverse architectures. Extensive experiments on two benchmark RS datasets demonstrate that FogFool achieves superior performance: not only does it exceed in white-box settings, but also exhibits exceptional black-box transferability (reaching 83.74% TASR) and robustness against common preprocessing-based defenses such as JPEG compression and filtering. Detailed analyses, including confusion matrices and Class Activation Map (CAM) visualizations, reveal that our atmospheric-driven perturbations induce a universal shift in model attention. These results indicate that FogFool represents a practical, stealthy, and highly persistent threat to RS classification systems, providing a robust benchmark for evaluating model reliability in complex environments.[113] Chaotic CNN for Limited Data Image Classification
Anusree M,Akhila Henry,Pramod P Nair
Main category: cs.CV
TL;DR: 本文提出了一种基于混沌映射的特征变换方法,通过在CNN分类层前对归一化特征向量施加Logistic、斜帐篷和正弦等混沌映射,提升小样本图像分类性能,无需增加模型参数,且在MNIST、Fashion-MNIST和CIFAR-10上均取得显著准确率提升。
Details
Motivation: CNN在小样本场景下易过拟合、特征多样性不足,导致泛化能力差。 Method: 在CNN最后一层特征向量(归一化后)上应用Logistic、斜帐篷和正弦三种混沌映射进行非线性变换,再送入分类器;不引入额外可训练参数。 Result: 在MNIST(+5.43%)、Fashion-MNIST(+9.11%)和CIFAR-10(+7.47%)的小样本设定下均显著优于基线CNN;增益源于混沌系统的共性——非线性与动力学特性。 Conclusion: 该混沌特征变换是一种轻量、即插即用、高效且通用的小样本CNN增强方法。 Abstract: Convolutional neural networks (CNNs) often exhibit poor generalisation in limited training data scenarios due to overfitting and insufficient feature diversity. In this work, a simple and effective chaos-based feature transformation is proposed to enhance CNN performance without increasing model complexity. The method applies nonlinear transformations using logistic, skew tent, and sine maps to normalised feature vectors before the classification layer, thereby reshaping the feature space and improving class separability. The approach is evaluated on greyscale datasets (MNIST and Fashion-MNIST) and an RGB dataset (CIFAR-10) using CNN architectures of varying depth under limited data conditions. The results show consistent improvement over the standalone (SA) CNN across all datasets. Notably, a maximum performance gain of 5.43% is achieved on MNIST using the skew tent map with a 3-layer CNN at 40 samples per class. A higher gain of 9.11% is observed on Fashion-MNIST using the sine map with a 3-layer CNN at 50 samples per class. Additionally, a strong gain of 7.47% is obtained on CIFAR-10 using the skew tent map at 200 samples per class. The consistent improvements across different chaotic maps indicate that the performance gain is driven by the shared nonlinear and dynamical properties of chaotic systems. The proposed method is computationally efficient, requires no additional trainable parameters, and can be easily integrated into existing CNN architectures, making it a practical solution for data-scarce image classification tasks.[114] Seen-to-Scene: Keep the Seen, Generate the Unseen for Video Outpainting
Inseok Jeon,Minhyeok Lee,Seunghoon Lee,Minseok Kang,Suhwan Cho,Sangyoun Lee
Main category: cs.CV
TL;DR: 本文提出Seen-to-Scene框架,融合传播式与生成式范式解决视频外绘中时空不一致问题,通过光流传播与参考引导的潜在空间传播提升时序一致性与视觉真实性。
Details
Motivation: 现有基于生成模型(如扩散模型)的方法在视频外绘中存在隐式时序建模不足和空间上下文有限的问题,导致帧内与帧间不一致,尤其在动态场景和大幅外绘时更明显。 Method: 提出Seen-to-Scene框架:1)利用预训练于视频修复的光流补全网络进行光流传播,并端到端微调以弥合领域差距;2)引入参考引导的潜在传播机制,高效跨帧传播源内容。 Result: 在多项实验中,该方法在时序一致性与视觉真实性上优于现有SOTA方法,且推理高效,无需输入特定适配。 Conclusion: Seen-to-Scene成功统一传播与生成范式,显著缓解视频外绘中的时空不一致性问题,为高效高质量视频外绘提供了新思路。 Abstract: Video outpainting aims to expand the visible content of a video beyond the original frame boundaries while preserving spatial fidelity and temporal coherence across frames. Existing methods primarily rely on large-scale generative models, such as diffusion models. However, generationbased approaches suffer from implicit temporal modeling and limited spatial context. These limitations lead to intraframe and inter-frame inconsistencies, which become particularly pronounced in dynamic scenes and large outpainting scenarios. To overcome these challenges, we propose Seen-to-Scene, a novel framework that unifies propagationbased and generation-based paradigms for video outpainting. Specifically, Seen-to-Scene leverages flow-based propagation with a flow completion network pre-trained for video inpainting, which is fine-tuned in an end-to-end manner to bridge the domain gap and reconstruct coherent motion fields. To further improve the efficiency and reliability of propagation, we introduce a reference-guided latent propagation that effectively propagates source content across frames. Extensive experiments demonstrate that our method achieves superior temporal coherence and visual realism with efficient inference, surpassing even prior state-of-the-art methods that require input-specific adaptation.[115] DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts
Bo Qian,Dahu Shi,Xing Wei
Main category: cs.CV
TL;DR: 本文提出DETR-ViP框架,通过全局提示整合与视觉-文本提示关系蒸馏,提升视觉提示的类别区分能力,显著增强开放词汇视觉提示目标检测性能。
Details
Motivation: 现有视觉提示目标检测性能受限,主因是视觉提示缺乏全局判别性;且该方向研究被忽视,常作为文本提示检测器训练的副产品。 Method: 在基础图像-文本对比学习上,引入全局提示整合、视觉-文本提示关系蒸馏,以及选择性融合策略,构建DETR-ViP检测框架。 Result: 在COCO、LVIS、ODinW和Roboflow100数据集上,DETR-ViP在视觉提示检测任务中显著超越现有SOTA方法;消融实验验证各模块有效性。 Conclusion: 提升视觉提示的全局判别性是关键,DETR-ViP通过结构化设计实现了更鲁棒、可区分的视觉提示学习,推动了视觉提示检测的发展。 Abstract: Visual prompted object detection enables interactive and flexible definition of target categories, thereby facilitating open-vocabulary detection. Since visual prompts are derived directly from image features, they often outperform text prompts in recognizing rare categories. Nevertheless, research on visual prompted detection has been largely overlooked, and it is typically treated as a byproduct of training text prompted detectors, which hinders its development. To fully unlock the potential of visual-prompted detection, we investigate the reasons why its performance is suboptimal and reveal that the underlying issue lies in the absence of global discriminability in visual prompts. Motivated by these observations, we propose DETR-ViP, a robust object detection framework that yields class-distinguishable visual prompts. On top of basic image-text contrastive learning, DETR-ViP incorporates global prompt integration and visual-textual prompt relation distillation to learn more discriminative prompt representations. In addition, DETR-ViP employs a selective fusion strategy that ensures stable and robust detection. Extensive experiments on COCO, LVIS, ODinW, and Roboflow100 demonstrate that DETR-ViP achieves substantially higher performance in visual prompt detection compared to other state-of-the-art counterparts. A series of ablation studies and analyses further validate the effectiveness of the proposed improvements and shed light on the underlying reasons for the enhanced detection capability of visual prompts.[116] Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding
Zhixuan Wu,Quanxing Zha,Teng Wang,Genbao Xu,Wenyuan Gu,Wei Rao,Nan Ma,Bo Cheng,Soujanya Poria
Main category: cs.CV
TL;DR: 本文提出Chain-of-Glimpse框架,通过搜索引导的渐进式视觉对象定位与多步推理,提升视频理解中对时序对象变化的建模能力。
Details
Motivation: 现有视频理解方法多为对象无关,难以应对视频中对象随时间发生的显著变化,缺乏对关键视觉对象的显式建模和空间定位能力。 Method: Chain-of-Glimpse将视频推理建模为逐步构建空间定位轨迹的过程,引入搜索引导的控制器(基于强化学习优化,以格式化奖励增强定位能力),在每步推理中锚定具体视觉区域,实现可解释、可组合的多步决策。 Result: 在NExTQA、Video-Holmes、CG-Bench Reasoning和VRBench等多个基准上取得一致性能提升,展现出强鲁棒性与跨域泛化能力。 Conclusion: Chain-of-Glimpse验证了显式对象定位与渐进式空间推理对视频理解任务的有效性,为可解释、可泛化的视频推理提供了新范式。 Abstract: Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.[117] The Courtroom Trial of Pixels: Robust Image Manipulation Localization via Adversarial Evidence and Reinforcement Learning Judgment
Songlin Li,Zhiqing Guo,Dan Ma,Changtao Miao,Gaobo Yang
Main category: cs.CV
TL;DR: 本文提出了一种法庭式裁决框架用于图像篡改定位(IML),通过检察官流(主张篡改)、辩护流(主张真实)和法官模型(对不确定区域进行强化学习驱动的再推理与校准)协同建模篡改与真实证据的对抗,显著提升在模糊或退化场景下的定位鲁棒性。
Details
Motivation: 现有IML方法虽引入真实性监督信号,但仅作为辅助训练目标,未显式建模篡改与真实证据之间的对立关系,导致在痕迹微弱或受后处理/噪声干扰时定位不可靠。 Method: 构建双假设分割架构:共享多尺度编码器上并行检察官流(定位篡改)与辩护流(定位真实),结合边缘先验、级联多层融合、双向分歧抑制和动态辩论精炼生成两类证据;设计基于强化学习的法官模型,以优势奖励和soft-IoU为目标进行不确定性区域的策略性重推理,并通过熵与跨假设一致性校准可靠性。 Result: 在多个基准上平均性能优于当前SOTA IML方法。 Conclusion: 将IML任务建模为证据对抗与裁决过程,显式联合建模篡改与真实假设,并引入可学习的判决机制,能有效提升定位的鲁棒性与可信度。 Abstract: Although some existing image manipulation localization (IML) methods incorporate authenticity-related supervision, this information is typically utilized merely as an auxiliary training signal to enhance the model's sensitivity to manipulation artifacts, rather than being explicitly modeled as localization evidence opposing the manipulated regions. Consequently, when manipulation traces are subtle or degraded by post-processing and noise, these methods struggle to explicitly compare manipulated and authentic evidence, resulting in unreliable predictions in ambiguous areas. To address these issues, we propose a courtroom-style adjudication framework that regards IML task as the confrontation of evidence followed by judgment. The framework comprises a prosecution stream, a defense stream, and a judge model. We first build a dual-hypothesis segmentation architecture on a shared multi-scale encoder, in which the prosecution stream asserts manipulation and the defense stream asserts authenticity. Guided by edge priors, it produces evidence for manipulated and authentic regions through cascaded multi-level fusion, bidirectional disagreement suppression, and dynamic debate refinement. We further develop a reinforcement learning judge model that performs strategic re-inference and refinement on uncertain regions, yielding a manipulated-region mask. The judge model is trained with advantage-based rewards and a soft-IoU objective, and reliability is calibrated via entropy and cross-hypothesis consistency. Experimental results show that our model achieves superior average performance compared with SOTA IML methods.[118] NG-GS: NeRF-Guided 3D Gaussian Splatting Segmentation
Yi He,Tao Wang,Yi Jin,Congyan Lang,Yidong Li,Haibin Ling
Main category: cs.CV
TL;DR: 本文提出NG-GS框架,通过高斯模糊分析、RBF插值与多分辨率哈希编码,结合轻量NeRF模块联合优化,显著提升3D高斯泼溅中物体边界的分割精度。
Details
Motivation: 3D高斯泼溅(3DGS)虽在新视角合成中表现优异,但其离散高斯表示导致物体边界存在混叠和伪影,难以实现精确分割。 Method: 1)利用掩码方差分析自动识别边界模糊高斯;2)采用径向基函数(RBF)插值构建空间连续特征场,并引入多分辨率哈希编码增强多尺度表征;3)设计联合优化策略,通过对其损失和空间连续性损失对齐3DGS与轻量NeRF模块。 Result: 在NVOS、LERF-OVS和ScanNet数据集上达到SOTA性能,边界mIoU显著提升。 Conclusion: NG-GS有效缓解了3DGS中因离散表示导致的边界分割失真问题,为高质量3D场景语义理解提供了新范式。 Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled highly efficient and photorealistic novel view synthesis. However, segmenting objects accurately in 3DGS remains challenging due to the discrete nature of Gaussian representations, which often leads to aliasing and artifacts at object boundaries. In this paper, we introduce NG-GS, a novel framework for high-quality object segmentation in 3DGS that explicitly addresses boundary discretization. Our approach begins by automatically identifying ambiguous Gaussians at object boundaries using mask variance analysis. We then apply radial basis function (RBF) interpolation to construct a spatially continuous feature field, enhanced by multi-resolution hash encoding for efficient multi-scale representation. A joint optimization strategy aligns 3DGS with a lightweight NeRF module through alignment and spatial continuity losses, ensuring smooth and consistent segmentation boundaries. Extensive experiments on NVOS, LERF-OVS, and ScanNet benchmarks demonstrate that our method achieves state-of-the-art performance, with significant gains in boundary mIoU. Code is available at https://github.com/BJTU-KD3D/NG-GS.[119] G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval
Jiyoung Lim,Heejae Yang,Jee-Hyong Lee
Main category: cs.CV
TL;DR: 本文提出G-MIXER方法,通过测地线混合(geodesic mixup)扩展隐式语义并利用多模态大模型生成的显式语义重排序,提升零样本组合图像检索(ZS-CIR)的多样性与准确性,无需额外训练。
Details
Motivation: 现有零样本组合图像检索(ZS-CIR)方法过度依赖文本模态,难以建模模糊检索所需的候选多样性,导致检索结果多样性与准确性不足。 Method: 提出无训练的G-MIXER方法:1)在参考图像-文本对特征上进行多比例测地线混合,构建反映隐式语义的组合查询特征并生成多样化候选集;2)利用多模态大语言模型(MLLM)提取显式语义对候选集重排序。 Result: G-MIXER在多个ZS-CIR基准上达到最优性能,显著提升检索多样性与准确性,且无需额外训练。 Conclusion: G-MIXER有效协同建模隐式与显式语义,为零样本组合图像检索提供了一种高效、通用且无需训练的新范式。 Abstract: Composed Image Retrieval (CIR) aims to retrieve target images by integrating a reference image with a corresponding modification text. CIR requires jointly considering the explicit semantics specified in the query and the implicit semantics embedded within its bi-modal composition. Recent training-free Zero-Shot CIR (ZS-CIR) methods leverage Multimodal Large Language Models (MLLMs) to generate detailed target descriptions, converting the implicit information into explicit textual expressions. However, these methods rely heavily on the textual modality and fail to capture the fuzzy retrieval nature that requires considering diverse combinations of candidates. This leads to reduced diversity and accuracy in retrieval results. To address this limitation, we propose a novel training-free method, Geodesic Mixup-based Implicit semantic eXpansion and Explicit semantic Re-ranking for ZS-CIR (G-MIXER). G-MIXER constructs composed query features that reflect the implicit semantics of reference image-text pairs through geodesic mixup over a range of mixup ratios, and builds a diverse candidate set. The generated candidates are then re-ranked using explicit semantics derived from MLLMs, improving both retrieval diversity and accuracy. Our proposed G-MIXER achieves state-of-the-art performance across multiple ZS-CIR benchmarks, effectively handling both implicit and explicit semantics without additional training. Our code will be available at https://github.com/maya0395/gmixer.[120] MS-SSE-Net: A Multi-Scale Spatial Squeeze-and-Excitation Network for Structural Damage Detection in Civil and Geotechnical Engineering
Saif ur Rehman Khan,Imad Ahmed Waqar,Arooj Zaib,Saad Ahmed,Sebastian Vollmer,Andreas Dengel,Muhammad Nabeel Asim
Main category: cs.CV
TL;DR: 本文提出了一种名为MS-SSE-Net的新型深度学习框架,用于结构损伤分类,通过多尺度特征提取与通道/空间注意力机制,在StructDamage数据集上实现了优于DenseNet201等基线模型的性能。
Details
Motivation: 准确识别图像中的不同结构损伤类型具有挑战性,主要由于损伤模式和环境条件的多样性。 Method: 基于DenseNet201主干网络,引入多尺度特征提取、深度可分离卷积、Squeeze-and-Excitation风格通道注意力及空间注意力机制,并结合全局平均池化与全连接分类层。 Result: 在StructDamage数据集上达到99.31%精确率、99.25%召回率、99.27% F1分数和99.26%准确率,显著优于DenseNet201基线模型。 Conclusion: MS-SSE-Net有效提升了结构损伤图像分类的精度与鲁棒性,验证了多尺度与注意力机制融合设计的有效性。 Abstract: Structural damage detection is essential for maintaining the safety and reliability of civil infrastructure. However, accurately identifying different types of structural damage from images remains challenging due to variations in damage patterns and environmental conditions. To address these challenges, this paper proposes MS-SSE-Net, a novel deep learning (DL) framework for structural damage classification. The proposed model is built upon the DenseNet201 backbone and integrates novel multi-scale feature extraction with channel and spatial attention mechanisms (MS-SSE-Net). Specifically, parallel depthwise convolutions capture both local and contextual features, while squeeze-and-excitation style channel attention and spatial attention emphasize informative regions and suppress irrelevant noise. The refined features are then processed through global average pooling and a fully connected classification layer to generate the final predictions. Experiments are conducted on the StructDamage dataset containing multiple structural damage categories. The proposed MS-SSE-Net demonstrates superior performance compared with the baseline DenseNet201 and other comparative approaches. Specifically, the proposed method achieves 99.31% precision, 99.25% recall, 99.27% F1-score, and 99.26% accuracy, outperforming the baseline model which achieved 98.62% precision, 98.53% recall, 98.58% F1-score, and 98.53% accuracy.[121] Data Synthesis Improves 3D Myotube Instance Segmentation
David Exler,Nils Friederich,Martin Krüger,John Jbeily,Mario Vitacolonna,Rüdiger Rudolf,Ralf Mikut,Markus Reischl
Main category: cs.CV
TL;DR: 本文提出了一种基于几何建模的合成数据生成流程,用于解决肌管三维实例分割中真实标注数据稀缺的问题;通过结合生物物理启发的合成、域自适应和自监督预训练的轻量3D U-Net,在仅用合成数据训练的情况下显著超越零样本迁移模型。
Details
Motivation: 现有预训练生物医学分割模型因缺乏大规模标注的肌管数据而无法泛化到该领域,而精确的三维实例分割对肌管形态学定量分析至关重要。 Method: 构建基于多项式中心线、变半径、分支结构和椭球端帽的肌管几何合成管线;加入真实噪声、光学伪影及CycleGAN域自适应;采用自监督预训练的紧凑型3D U-Net,仅用合成数据训练。 Result: 在真实数据上达到平均实例预测质量(IPQ)0.22,显著优于三个零样本分割模型。 Conclusion: 生物物理驱动的合成数据可有效弥补标注缺失,支撑高精度三维实例分割,为标注稀缺的生物医学图像分析提供新范式。 Abstract: Myotubes are multinucleated muscle fibers serving as key model systems for studying muscle physiology, disease mechanisms, and drug responses. Mechanistic studies and drug screening thereby rely on quantitative morphological readouts such as diameter, length, and branching degree, which in turn require precise three-dimensional instance segmentation. Yet established pretrained biomedical segmentation models fail to generalize to this domain due to the absence of large annotated myotube datasets. We introduce a geometry-driven synthesis pipeline that models individual myotubes via polynomial centerlines, locally varying radii, branching structures, and ellipsoidal end caps derived from real microscopy observations. Synthetic volumes are rendered with realistic noise, optical artifacts, and CycleGAN-based Domain Adaptation (DA). A compact 3D U-Net with self-supervised encoder pretraining, trained exclusively on synthetic data, achieves a mean IPQ of 0.22 on real data, significantly outperforming three established zero-shot segmentation models, demonstrating that biophysics-driven synthesis enables effective instance segmentation in annotation-scarce biomedical domains.[122] HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet
Badri N. Patro,Vijay S. Agneeswaran
Main category: cs.CV
TL;DR: HAMSA是一种无需扫描的视觉状态空间模型,直接在频谱域操作,通过简化核参数化、频谱脉冲网络(SPN)和频谱自适应门控单元(SAGU)提升效率与性能,在ImageNet-1K上达到85.7% top-1精度,推理更快、内存和能耗更低。
Details
Motivation: 现有视觉SSM(如Vim、VMamba、SiMBA)依赖复杂扫描策略处理2D图像,带来计算开销和架构复杂性,亟需更简洁高效的替代方案。 Method: 提出扫描无关的频谱域SSM——HAMSA,包含三项创新:(1) 单高斯初始化复数核替代传统(A,B,C)矩阵;(2) 输入依赖的频谱脉冲网络(SPN)实现自适应频谱调制;(3) 幅值驱动的频谱自适应门控单元(SAGU)保障频域梯度稳定;整体基于FFT卷积实现O(L log L)复杂度。 Result: 在ImageNet-1K上达85.7% top-1精度(SSM中SOTA),推理速度比DeiT-S快2.2倍(4.2ms vs 9.2ms),比扫描式SSM快1.4–1.9倍,内存占用更低(2.1GB vs 3.2–4.5GB),能耗更少(12.5J vs 18–25J),并在迁移学习与密集预测任务中表现优异。 Conclusion: HAMSA验证了频谱域建模可有效规避扫描瓶颈,在保持甚至超越性能的同时显著提升效率、稳定性和硬件友好性,为视觉SSM提供了新范式。 Abstract: Vision State Space Models (SSMs) like Vim, VMamba, and SiMBA rely on complex scanning strategies to adapt sequential SSMs to process 2D images, introducing computational overhead and architectural complexity. We propose HAMSA, a scanning-free SSM operating directly in the spectral domain. HAMSA introduces three key innovations: (1) simplified kernel parameterization-a single Gaussian-initialized complex kernel replacing traditional (A, B, C) matrices, eliminating discretization instabilities; (2) SpectralPulseNet (SPN)-an input-dependent frequency gating mechanism enabling adaptive spectral modulation; and (3) Spectral Adaptive Gating Unit (SAGU)-magnitude-based gating for stable gradient flow in the frequency domain. By leveraging FFT-based convolution, HAMSA eliminates sequential scanning while achieving O(L log L) complexity with superior simplicity and efficiency. On ImageNet-1K, HAMSA reaches 85.7% top-1 accuracy (state-of-the-art among SSMs), with 2.2 X faster inference than transformers (4.2ms vs 9.2ms for DeiT-S) and 1.4-1.9X speedup over scanning-based SSMs, while using less memory (2.1GB vs 3.2-4.5GB) and energy (12.5J vs 18-25J). HAMSA demonstrates strong generalization across transfer learning and dense prediction tasks.[123] Find the Differences: Differential Morphing Attack Detection vs Face Recognition
Una M. Kelly,Luuk J. Spreeuwers,Raymond N. J. Veldhuis
Main category: cs.CV
TL;DR: 本文探讨了人脸识别系统对变形攻击(morphing attacks)的脆弱性,指出人脸识别与差分变形攻击检测(D-MAD)任务本质相似,并提出利用现有的人脸识别系统进行变形检测,同时引入一种新评估阈值以限制对未知类型变形攻击的脆弱性。
Details
Motivation: 现有许多人脸识别系统易受变形攻击影响,而当前决策阈值导致其在正常图像性能与抗变形攻击能力之间存在固有折衷,亟需兼顾二者的新方法。 Method: 通过对比人脸识别系统与两种现有D-MAD方法,分析其任务相似性;理论分析当前决策阈值导致的脆弱性根源;提出复用现有人脸识别系统进行变形检测,并设计一种能保证变形攻击脆弱性上限的新评估阈值。 Result: 验证了FR与D-MAD任务的高度相似性;揭示了当前阈值设定是FR易受变形攻击的根本原因;所提新阈值可在保障正常识别性能的同时,有效控制对已知及未知变形攻击的脆弱性。 Conclusion: 人脸识别系统本身可被用于变形攻击检测,无需额外专用模型;关键在于采用合理评估阈值,从而在实用性和安全性间取得更好平衡。 Abstract: Morphing is a challenge to face recognition (FR) for which several morphing attack detection solutions have been proposed. We argue that face recognition and differential morphing attack detection (D-MAD) in principle perform very similar tasks, which we support by comparing an FR system with two existing D-MAD approaches. We also show that currently used decision thresholds inherently lead to FR systems being vulnerable to morphing attacks and that this explains the tradeoff between performance on normal images and vulnerability to morphing attacks. We propose using FR systems that are already in place for morphing detection and introduce a new evaluation threshold that guarantees an upper limit to the vulnerability to morphing attacks - even of unknown types.[124] Efficient closed-form approaches for pose estimation using Sylvester forms
Jana Vráblíková,Ezio Malis,Laurent Busé
Main category: cs.CV
TL;DR: 本文提出了一类基于Sylvester形式的新型结式求解器,用于非线性最小二乘姿态估计问题,在保持精度的同时显著降低计算复杂度。
Details
Motivation: 非线性最小二乘姿态估计在实时视觉应用中耗时但关键;现有结式求解器虽有效,仍有进一步加速空间。 Method: 提出基于Sylvester形式的新型结式求解器,将姿态估计问题转化为多项式方程组并闭式求解,适用于3D-3D和3D-2D点对应两种情形。 Result: 所提方法在数值精度上与当前最优求解器相当,且计算时间更优。 Conclusion: 基于Sylvester形式的结式求解器是一种高效、准确的姿态估计新方案,适用于多种对应关系下的实时视觉任务。 Abstract: Solving non-linear least-squares problem for pose estimation (rotation and translation) is often a time consuming yet fundamental problem in several real-time computer vision applications. With an adequate rotation parametrization, the optimization problem can be reduced to the solution of a~system of polynomial equations and solved in closed form. Recent advances in efficient closed form solvers utilizing resultant matrices have shown a promising research direction to decrease the computation time while preserving the estimation accuracy. In this paper, we propose a new class of resultant-based solvers that exploit Sylvester forms to further reduce the complexity of the resolution. We demonstrate that our proposed methods are numerically as accurate as the state-of-the-art solvers, and outperform them in terms of computational time. We show that this approach can be applied for pose estimation in two different types of problems: estimating a pose from 3D to 3D correspondences, and estimating a pose from 3D points to 2D points correspondences.[125] ASGNet: Adaptive Spectrum Guidance Network for Automatic Polyp Segmentation
Yanguang Sun,Hengmin Zhang,Jianjun Qian,Jian Yang,Lei Luo
Main category: cs.CV
TL;DR: 本文提出ASGNet,一种结合频谱特征与全局属性的自适应频谱引导网络,用于提升结肠镜图像中息肉分割的准确性。
Details
Motivation: 现有深度学习方法在息肉分割中受限于局部感知偏差,难以捕获完整息肉结构,导致分割效果不佳。 Method: 提出ASGNet,包含频谱引导非局部感知模块、多源语义提取器和密集跨层交互解码器,融合频谱特征与全局信息以增强息肉结构判别力与边界精度。 Result: 在五个主流息肉分割基准上显著优于21种前沿方法,定量与定性结果均验证其优越性。 Conclusion: ASGNet通过引入频谱域建模与全局感知机制,有效克服空间域局部相关性带来的局限,提升了息肉分割的整体性能。 Abstract: Early identification and removal of polyps can reduce the risk of developing colorectal cancer. However, the diverse morphologies, complex backgrounds and often concealed nature of polyps make polyp segmentation in colonoscopy images highly challenging. Despite the promising performance of existing deep learning-based polyp segmentation methods, their perceptual capabilities remain biased toward local regions, mainly because of the strong spatial correlations between neighboring pixels in the spatial domain. This limitation makes it difficult to capture the complete polyp structures, ultimately leading to sub-optimal segmentation results. In this paper, we propose a novel adaptive spectrum guidance network, called ASGNet, which addresses the limitations of spatial perception by integrating spectral features with global attributes. Specifically, we first design a spectrum-guided non-local perception module that jointly aggregates local and global information, therefore enhancing the discriminability of polyp structures, and refining their boundaries. Moreover, we introduce a multi-source semantic extractor that integrates rich high-level semantic information to assist in the preliminary localization of polyps. Furthermore, we construct a dense cross-layer interaction decoder that effectively integrates diverse information from different layers and strengthens it to generate high-quality representations for accurate polyp segmentation. Extensive quantitative and qualitative results demonstrate the superiority of our ASGNet approach over 21 state-of-the-art methods across five widely-used polyp segmentation benchmarks. The code will be publicly available at: https://github.com/CSYSI/ASGNet.[126] OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism
Jordan Shipard,Arnold Wiliem,Kien Nguyen Thanh,Wei Xiang,Clinton Fookes
Main category: cs.CV
TL;DR: 本文提出OmniGCD,一种模态无关的广义类别发现(GCD)方法,利用模态特异性编码器和合成训练的Transformer模型,在零样本设置下跨四种模态、16个数据集实现显著性能提升。
Details
Motivation: 现有GCD方法局限于单模态且需数据集特定微调,无法模拟人类抽象类别形成能力;亟需一种模态无关、无需微调、更贴近人类学习机制的GCD框架。 Method: 提出OmniGCD:采用模态专用编码器提取特征,经降维构建GCD潜在空间,并在测试时用合成数据训练的Transformer模型动态优化表示以适配聚类;引入零样本GCD评估范式,模型仅在合成数据上训练一次。 Result: 在跨4种模态、16个真实数据集的零样本GCD任务中,OmniGCD显著优于基线:视觉+6.2pp、文本+17.9pp、音频+1.5pp、遥感+12.7pp;验证了强编码器与类别发现解耦的有效性。 Conclusion: OmniGCD首次实现了真正模态无关、零样本的GCD,为构建可扩展、类人化的类别发现系统提供了新范式和基准,推动编码器开发与GCD方法的独立演进。 Abstract: Generalized Category Discovery (GCD) challenges methods to identify known and novel classes using partially labeled data, mirroring human category learning. Unlike prior GCD methods, which operate within a single modality and require dataset-specific fine-tuning, we propose a modality-agnostic GCD approach inspired by the human brain's abstract category formation. Our $\textbf{OmniGCD}$ leverages modality-specific encoders (e.g., vision, audio, text, remote sensing) to process inputs, followed by dimension reduction to construct a $\textbf{GCD latent space}$, which is transformed at test-time into a representation better suited for clustering using a novel synthetically trained Transformer-based model. To evaluate OmniGCD, we introduce a $\textbf{zero-shot GCD setting}$ where no dataset-specific fine-tuning is allowed, enabling modality-agnostic category discovery. $\textbf{Trained once on synthetic data}$, OmniGCD performs zero-shot GCD across 16 datasets spanning four modalities, improving classification accuracy for known and novel classes over baselines (average percentage point improvement of $\textbf{+6.2}$, $\textbf{+17.9}$, $\textbf{+1.5}$ and $\textbf{+12.7}$ for vision, text, audio and remote sensing). This highlights the importance of strong encoders while decoupling representation learning from category discovery. Improving modality-agnostic methods will propagate across modalities, enabling encoder development independent of GCD. Our work serves as a benchmark for future modality-agnostic GCD works, paving the way for scalable, human-inspired category discovery. All code is available $\href{https://github.com/Jordan-HS/OmniGCD}{here}$[127] AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning
Peifeng Zhang,Zice Qiu,Donghua Yu,Shilei Cao,Juepeng Zheng,Yutong Lu,Haohuan Fu
Main category: cs.CV
TL;DR: 本文提出Asymmetric Information Masking (AIM)方法,解决视觉语言模型在持续视觉问答任务中因结构不对称导致的灾难性遗忘问题,通过模态敏感的定向掩码平衡稳定性与可塑性,显著提升性能并保持组合推理能力。
Details
Motivation: 现有持续学习方法主要针对对称、单模态架构设计,而现代视觉语言模型(VLMs)具有固有不对称结构,导致标准全局正则化偏向语言解码器,使关键的视觉投影层易受干扰,进而严重损害组合推理能力。 Method: 提出Asymmetric Information Masking (AIM),根据模态特异性敏感度施加定向掩码,以平衡模型的稳定性和可塑性,缓解不对称结构带来的灾难性遗忘。 Result: 在VQA v2和GQA数据集的持续VQA设置下,AIM在平均性能(AP)和平均遗忘(AF)指标上均达到SOTA,并更好保留了对新技能-概念组合的泛化能力。 Conclusion: AIM有效解决了VLMs在持续VQA中的结构不匹配问题,验证了模态感知的局部正则化策略对维持组合推理能力的关键作用。 Abstract: In continual visual question answering (VQA), existing Continual Learning (CL) methods are mostly built for symmetric, unimodal architectures. However, modern Vision-Language Models (VLMs) violate this assumption, as their trainable components are inherently asymmetric. This structural mismatch renders VLMs highly prone to catastrophic forgetting when learning from continuous data streams. Specifically, the asymmetry causes standard global regularization to favor the massive language decoder during optimization, leaving the smaller but critical visual projection layers highly vulnerable to interference. Consequently, this localized degradation leads to a severe loss of compositional reasoning capabilities. To address this, we propose Asymmetric Information Masking (AIM), which balances stability and plasticity by applying targeted masks based on modality-specific sensitivity. Experiments on VQA v2 and GQA under continual VQA settings show that AIM achieves state-of-the-art performance in both Average Performance (AP) and Average Forgetting (AF), while better preserving generalization to novel skill-concept compositions.[128] Integrating Object Detection, LiDAR-Enhanced Depth Estimation, and Segmentation Models for Railway Environments
Enrico Francesco Giannico,Federico Nesti,Gianluca D'Amico,Mauro Marinoni,Edoardo Carosio,Filippo Salotti,Salvatore Sabina,Giorgio Buttazzo
Main category: cs.CV
TL;DR: 本文提出了一种模块化、灵活的铁路环境障碍物检测与距离估计框架,融合目标检测、轨道分割和单目深度估计,并结合LiDAR点云,在合成数据集SynDRA上实现0.63米的平均绝对误差。
Details
Motivation: 铁路环境中障碍物检测对安全至关重要,但现有研究大多只关注检测或轨道识别,缺乏兼顾检测与距离估计的完整系统,且因缺少真实标注数据而难以定量评估。 Method: 构建一个集成三个神经网络的模块化框架:用于目标检测、轨道分割和单目深度估计,并融合LiDAR点云以提升距离估计精度;使用合成数据集SynDRA进行评估,该数据集提供精确的真值标注。 Result: 在SynDRA数据集上,障碍物距离估计的平均绝对误差(MAE)低至0.63米,显著提升了空间感知能力与定量评估可靠性。 Conclusion: 所提框架兼具模块化、灵活性与高精度,通过多任务神经网络协同与多模态数据融合,有效解决了铁路障碍物检测与距离估计难题,并为后续研究提供了可复现的评估基准。 Abstract: Obstacle detection in railway environments is crucial for ensuring safety. However, very few studies address the problem using a complete, modular, and flexible system that can both detect objects in the scene and estimate their distance from the vehicle. Most works focus solely on detection, others attempt to identify the track, and only a few estimate obstacle distances. Additionally, evaluating these systems is challenging due to the lack of ground truth data. In this paper, we propose a modular and flexible framework that identifies the rail track, detects potential obstacles, and estimates their distance by integrating three neural networks for object detection, track segmentation, and monocular depth estimation with LiDAR point clouds. To enable a reliable and quantitative evaluation, the proposed framework is assessed using a synthetic dataset (SynDRA), which provides accurate ground truth annotations, allowing for direct performance comparison with existing methods. The proposed system achieves a mean absolute error (MAE) as low as 0.63 meters by integrating monocular depth maps with LiDAR, enabling not only accurate distance estimates but also spatial perception of the scene.[129] One-shot Compositional 3D Head Avatars with Deformable Hair
Yuan Sun,Xuan Wang,WeiLi Zhang,Wenxuan Zhang,Yu Guo,Fei Wang
Main category: cs.CV
TL;DR: 本文提出了一种从单张正面人像图构建完整3D头部虚拟形象的组合式方法,核心是将头发与面部显式解耦建模,并分别采用不同形变范式(FLAME蒙皮 vs 基于笼结构的PBD物理模拟),结合3D高斯泼溅(3DGS)表征和图像到3D提升技术,显著提升了动画中头发动态的真实感与面部细节保真度。
Details
Motivation: 现有单图生成3D头像方法难以生成真实自然的头发动画,主因是头发与面部几何纠缠、解耦不足,导致形变失真;同时通用模型常丢失高频纹理细节。 Method: 1)输入单张正面人像,先进行发际线分割与头发移除得到秃头图;2)对原图与秃头图分别做图像到3D提升,生成高细节3D高斯泼溅(3DGS)表示;3)对秃头3DGS通过非刚性配准绑定至FLAME网格以支持面部动画;4)对头发部分,利用语义标签监督与边界感知重分配策略提取纯净头发高斯点;5)设计笼状控制结构,驱动头发高斯点进行基于位置的动力学(PBD)物理模拟,响应头部运动、重力与惯性。 Result: 在多种头部运动、表情及重力作用下的动态动画中,头发行为更真实自然,面部纹理细节高度保留;定性结果显著优于当前最优单图生成方法。 Conclusion: 显式解耦建模+物理驱动的头发变形+高保真3DGS重建,是提升单图生成3D头像动画真实感的关键路径;该组合框架为高质量、可控、可动画的个性化头像生成提供了新范式。 Abstract: We propose a compositional method for constructing a complete 3D head avatar from a single image. Prior one-shot holistic approaches frequently fail to produce realistic hair dynamics during animation, largely due to inadequate decoupling of hair from the facial region, resulting in entangled geometry and unnatural deformations. Our method explicitly decouples hair from the face, modeling these components using distinct deformation paradigms while integrating them into a unified rendering pipeline. Furthermore, by leveraging image-to-3D lifting techniques, we preserve fine-grained textures from the input image to the greatest extent possible, effectively mitigating the common issue of high-frequency information loss in generalized models. Specifically, given a frontal portrait image, we first perform hair removal to obtain a bald image. Both the original image and the bald image are then lifted to dense, detail-rich 3D Gaussian Splatting (3DGS) representations. For the bald 3DGS, we rig it to a FLAME mesh via non-rigid registration with a prior model, enabling natural deformation that follows the mesh triangles during animation. For the hair component, we employ semantic label supervision combined with a boundary-aware reassignment strategy to extract a clean and isolated set of hair Gaussians. To control hair deformation, we introduce a cage structure that supports Position-Based Dynamics (PBD) simulation, allowing realistic and physically plausible transformations of the hair Gaussian primitives under head motion, gravity, and inertial effects. Striking qualitative results, including dynamic animations under diverse head motions, gravity effects, and expressions, showcase substantially more realistic hair behavior alongside faithfully preserved facial details, outperforming state-of-the-art one-shot methods in perceptual realism.[130] From Boundaries to Semantics: Prompt-Guided Multi-Task Learning for Petrographic Thin-section Segmentation
Yili Ren,Shiqi Wen,Li Hou,Dingwen Xiao,Weiming Zhang,Caleb Chen Cao,Lin Wang,Zilu Zheng,Qianxiao Su,Mingjun Zhao,Lei Chen
Main category: cs.CV
TL;DR: 本文提出Petro-SAM,一种两阶段多任务框架,用于岩相图像中晶界分割(GES)与岩性语义分割(LSS)的联合高质量分割,通过引入Merge Block融合七种偏光视角、多尺度特征融合及颜色-熵先验来克服域差异与超细边界挑战。
Details
Motivation: 现有方法常将晶界分割(GES)和岩性语义分割(LSS)分开处理,效果不佳;虽有专家标注数据,但成本高、耗时长;而通用基础模型(如SAM)因岩相图像特有的消光色变和超细晶界存在严重域偏移,难以直接迁移。 Method: 提出Petro-SAM两阶段多任务框架:1)基于SAM,设计Merge Block融合七种偏光图像以缓解消光效应;2)引入多尺度特征融合与颜色-熵先验机制提升边界检测精度。 Result: 在岩相图像上实现了高质量的联合GES与LSS分割,显著改善了边界对齐与语义一致性,优于直接微调SAM及其他主流方法。 Conclusion: Petro-SAM有效弥合了基础模型与岩相图像分析之间的域鸿沟,为多角度偏光图像的联合分割提供了可扩展、鲁棒的新范式。 Abstract: Grain-edge segmentation (GES) and lithology semantic segmentation (LSS) are two pivotal tasks for quantifying rock fabric and composition. However, these two tasks are often treated separately, and the segmentation quality is implausible albeit expensive, time-consuming, and expert-annotated datasets have been used. Recently, foundation models, especially the Segment Anything Model (SAM), have demonstrated impressive robustness for boundary alignment. However, directly adapting SAM to joint GES and LSS is nontrivial due to 1) severe domain gap induced by extinction-dependent color variations and ultra-fine grain boundaries, and 2) lacking novel modules for joint learning on multi-angle petrographic image stacks. In this paper, we propose Petro-SAM, a novel two-stage, multi-task framework that can achieve high-quality joint GES and LSS on petrographic images. Specifically, based on SAM, we introduce a Merge Block to integrate seven polarized views, effectively solving the extinction issue. Moreover, we introduce multi-scale feature fusion and color-entropy priors to refine the detection.[131] NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results
Andrey Moskalenko,Alexey Bryncev,Ivan Kosmynin,Kira Shilovskaya,Mikhail Erofeev,Dmitry Vatolin,Radu Timofte,Kun Wang,Yupeng Hu,Zhiran Li,Hao Liu,Qianlong Xiang,Liqiang Nie,Konstantinos Chaldaiopoulos,Niki Efthymiou,Athanasia Zlatintsi,Panagiotis Filntisis,Katerina Pastra,Petros Maragos,Li Yang,Gen Zhan,Yiting Liao,Yabin Zhang,Yuxin Liu,Xu Wu,Yunheng Zheng,Linze Li,Kun He,Cong Wu,Xuefeng Zhu,Tianyang Xu,Xiaojun Wu,Wenzhuo Zhao,Keren Fu,Gongyang Li,Shixiang Shi,Jianlin Chen,Haibin Ling,Yaoxin Jiang,Guoyi Xu,Jiajia Liu,Yaokun Shi,Jiachen Tu
Main category: cs.CV
TL;DR: 本文介绍了NTIRE 2026视频显著性预测挑战赛,包含新构建的2000个开源视频数据集、基于众包鼠标追踪采集的显著图与注视点数据,并评估了20余支参赛队伍的方法。
Details
Motivation: 推动视频显著性预测技术发展,提供大规模、高质量、开源的基准数据集和公平的评测平台。 Method: 组织国际挑战赛,构建含2000个视频的新数据集,通过众包鼠标追踪采集5000+被试的注视数据生成显著图,并采用通用指标在800个测试视频上评估参赛方法。 Result: 吸引20多个团队参赛,7支队伍通过最终代码审核;全部数据及代码已开源发布。 Conclusion: 该挑战赛成功促进了视频显著性预测领域的研究进展,并提供了迄今规模较大、标注丰富且完全开放的数据资源。 Abstract: This paper presents an overview of the NTIRE 2026 Challenge on Video Saliency Prediction. The goal of the challenge participants was to develop automatic saliency map prediction methods for the provided video sequences. The novel dataset of 2,000 diverse videos with an open license was prepared for this challenge. The fixations and corresponding saliency maps were collected using crowdsourced mouse tracking and contain viewing data from over 5,000 assessors. Evaluation was performed on a subset of 800 test videos using generally accepted quality metrics. The challenge attracted over 20 teams making submissions, and 7 teams passed the final phase with code review. All data used in this challenge is made publicly available - https://github.com/msu-video-group/NTIRE26_Saliency_Prediction.[132] Improved Multiscale Structural Mapping with Supervertex Vision Transformer for the Detection of Alzheimer's Disease Neurodegeneration
Geonwoo Baek,David H. Salat,Ikbeom Jang
Main category: cs.CV
TL;DR: 本文提出MSSM+方法,结合表面超顶点映射(SSVM)和超顶点视觉Transformer(SV-ViT),利用单次T1加权MRI扫描提取多尺度皮层结构特征(如皮层厚度、脑沟深度、曲率),显著提升阿尔茨海默病(AD)与认知正常(CN)人群的分类性能,优于现有MSSM及传统生物标志物。
Details
Motivation: 现有AD确诊依赖昂贵且有创的PET或CSF检测;需发展更优的非侵入性MRI生物标志物。 Method: 在MSSM基础上提出MSSM+,新增顶点级脑沟深度与皮层曲率;设计SSVM将皮层表面划分为能表征空间关系的超顶点;构建SV-ViT模型,在超顶点上进行解剖学引导的Transformer学习。 Result: MSSM+比MSSM识别出更广泛、更显著的AD/CN组间差异;在AD vs. CN分类中,精确率-召回率曲线下面积提升3个百分点;跨厂商分析显示其信号变异性更低、分类性能更稳定。 Conclusion: MSSM+联合SV-ViT是一种有前景的、基于MRI的AD早期检测影像标志物,可作为PET/CSF确认前的可靠筛查工具。 Abstract: Alzheimer's disease (AD) confirmation often relies on positron emission tomography (PET) or cerebrospinal fluid (CSF) analysis, which are costly and invasive. Consequently, structural MRI biomarkers such as cortical thickness (CT) are widely used for non-invasive AD screening. Multiscale structural mapping (MSSM) was recently proposed to integrate gray-white matter contrasts (GWCs) with CT from a single T1-weighted MRI (T1w) scan. Building on this framework, we propose MSSM+, together with surface supervertex mapping (SSVM) and a Supervertex Vision Transformer (SV-ViT). 3D T1w images from individuals with AD and cognitively normal (CN) controls were analyzed. MSSM+ extends MSSM by incorporating sulcal depth and cortical curvature at the vertex level. SSVM partitions the cortical surface into supervertices (surface patches) that effectively represent inter- and intra-regional spatial relationships. SV-ViT is a Vision Transformer architecture operating on these supervertices, enabling anatomically informed learning from surface mesh representations. Compared with MSSM, MSSM+ identified more spatially extensive and statistically significant group differences between AD and CN. In AD vs. CN classification, MSSM+ achieved a 3%p higher area under the precision-recall curve than MSSM. Vendor-specific analyses further demonstrated reduced signal variability and consistently improved classification performance across MR manufacturers relative to CT, GWCs, and MSSM. These findings suggest that MSSM+ combined with SV-ViT is a promising MRI-based imaging marker for AD detection prior to CSF/PET confirmation.[133] Zero-Shot Retail Theft Detection via Orchestrated Vision Models: A Model-Agnostic, Cost-Effective Alternative to Trained Single-Model Systems
Haileab Yagersew
Main category: cs.CV
TL;DR: Paza是一个零样本零售盗窃检测框架,通过分层模型编排(轻量级模型持续运行+大语言视觉模型按需调用)实现低成本、高精度的隐蔽行为检测,无需定制训练,支持即插即用的VLM模型切换,单GPU可服务10-20家门店,成本降至50–100美元/月/店,并保障隐私。
Details
Motivation: 现有AI防盗系统依赖昂贵的定制模型训练和高额订阅费(200–500美元/月/店),难以在中小零售商中普及;亟需一种免训练、低成本、高隐私、易部署的替代方案。 Method: 提出零样本、分层流水线架构:1)轻量级模型(目标检测+姿态估计)持续运行;2)多信号可疑行为预滤器(需停留时间+至少一个行为信号)触发昂贵VLM调用;3)VLM组件兼容任意OpenAI风格接口,支持模型热替换;4)全程人脸模糊化处理保障隐私。 Result: 在DCSASS数据集上零样本达成89.5%精度、92.8%特异性、59.3%召回率;预滤器使VLM调用频次降低240倍(≤10次/分钟);单GPU可支撑10–20家门店;成本模型显示为50–100美元/月/店,为商用方案的1/3–1/10。 Conclusion: Paza验证了零样本、模型编排与智能触发策略可在不牺牲关键运营指标(精度与特异性)前提下,显著降低AI防盗系统的部署门槛、成本与隐私风险,具备实际落地价值。 Abstract: Retail theft costs the global economy over \$100 billion annually, yet existing AI-based detection systems require expensive custom model training on proprietary datasets and charge \$200-500/month per store. We present Paza, a zero-shot retail theft detection framework that achieves practical concealment detection without training any model. Our approach orchestrates multiple existing models in a layered pipeline - cheap object detection and pose estimation running continuously, with an expensive vision-language model (VLM) invoked only when behavioral pre-filters trigger. A multi-signal suspicion pre-filter (requiring dwell time plus at least one behavioral signal) reduces VLM invocations by 240x compared to per-frame analysis, bounding calls to <=10/minute and enabling a single GPU to serve 10-20 stores. The architecture is model-agnostic: the VLM component accepts any OpenAI-compatible endpoint, enabling operators to swap between models such as Gemma 4, Qwen3.5-Omni, GPT-4o, or future releases without code changes - ensuring the system improves as the VLM landscape evolves. We evaluate the VLM component on the DCSASS synthesized shoplifting dataset (169 clips, controlled environment), achieving 89.5% precision and 92.8% specificity at 59.3% recall zero-shot - where the recall gap is attributable to sparse frame sampling in offline evaluation rather than VLM reasoning failures, as precision and specificity are the operationally critical metrics determining false alarm rates. We present a detailed cost model showing viability at \$50-100/month per store (3-10x cheaper than commercial alternatives), and introduce a privacy-preserving design that obfuscates faces in the detection pipeline. The source code is available at https://github.com/xHaileab/Paza-AI.[134] Efficient Search of Implantable Adaptive Cells for Medical Image Segmentation
Emil Benedykciuk,Marcin Denkowski,Grzegorz M. Wójcik
Main category: cs.CV
TL;DR: 本文提出IAC-LTH方法,通过Jensen-Shannon散度稳定性判据,在IAC可微搜索早期即剪枝低重要性操作,大幅加速自适应跳过模块的神经架构搜索(NAS),在多个医学图像分割基准上实现相近甚至略优性能,搜索成本降低3.7–16倍。
Details
Motivation: 现有IAC框架虽缩小了搜索空间,但仍需200轮可微搜索,计算开销大,难以在实际医疗场景中部署。 Method: 分析IAC搜索过程中各操作与边的时序重要性演化规律,发现关键操作早熟稳定;据此设计基于Jensen-Shannon散度的稳定性判据,动态剪枝低重要性操作,实现早停式加速搜索(IAC-LTH)。 Result: 在ACDC、BraTS、KiTS、AMOS四个公开数据集及多种2D U-Net/nnU-Net骨干网络上,IAC-LTH所得结构性能媲美或略超原IAC搜索结果,搜索耗时减少3.7–16倍,且结果鲁棒性强、泛化性好。 Conclusion: IAC架构可在搜索早期通过操作重要性稳定性识别,无需完整训练,显著提升自适应跳过模块在医学图像分割中的实用性。 Abstract: Purpose: Adaptive skip modules can improve medical image segmentation, but searching for them is computationally costly. Implantable Adaptive Cells (IACs) are compact NAS modules inserted into U-Net skip connections, reducing the search space compared with full-network NAS. However, the original IAC framework still requires a 200-epoch differentiable search for each backbone and dataset. Methods: We analyzed the temporal behavior of operations and edges within IAC cells during differentiable search on public medical image segmentation benchmarks. We found that operations selected in the final discrete cell typically emerge among the strongest candidates early in training, and their architecture parameters stabilize well before the final epoch. Based on this, we propose a Jensen--Shannon-divergence-based stability criterion that tracks per-edge operation-importance distributions and progressively prunes low-importance operations during search. The accelerated framework is called IAC-LTH. Results: Across four public benchmarks (ACDC, BraTS, KiTS, AMOS), several 2-D U-Net backbones, and a 2-D nnU-Net pipeline, IAC-LTH discovers IAC cells whose patient-level segmentation performance matches and sometimes slightly exceeds that of cells found by the original full-length search, while reducing wall-clock NAS cost by 3.7x to 16x across datasets and backbones. These results are consistent across architectures, benchmarks, and both non-augmented and augmented training settings, while preserving the gains of IAC-equipped U-Nets over strong attention-based and dense-skip baselines. Conclusion: Competitive IAC architectures can be identified from early-stabilizing operations without running the full search, making adaptive skip-module design more practical for medical image segmentation under realistic computational constraints.[135] MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry
Meng-Xun Li,Wen-Hui Deng,Zhi-Xing Wu,Chun-Xiao Jin,Jia-Min Wu,Yue Han,James Kit Hon Tsoi,Gui-Song Xia,Cui Huang
Main category: cs.CV
TL;DR: 本文提出了MetaDent,一个面向口腔摄影的综合性视觉-语言模型(VLM)基准资源,包含大规模牙科图像数据集、半结构化标注框架及多项标准化评测任务(VQA、多标签分类、图像描述),并基于LLM生成高质量标注;实验表明当前SOTA VLM在细粒度口腔场景理解上仍存在明显局限。
Details
Motivation: 现有视觉-语言模型(VLMs)在口腔摄影领域的应用受限于缺乏细粒度标注数据集和系统性评测基准,亟需构建专用资源以推动临床图像理解研究。 Method: 构建MetaDent资源:1)整合临床、公开与网络来源的60,669张口腔图像;2)设计兼顾层级结构与临床细节的半结构化元标注框架(含图像级摘要+异常点自由文本描述);3)利用大语言模型(LLM)自动生成约15K VQA样本和18类多标签分类数据,并经人工验证确保语义保真;4)对主流VLM开展VQA、分类与图像描述三类任务评测。 Result: 当前最先进VLM在口腔图像细粒度理解任务中表现有限:VQA与分类准确率中等,图像描述常不完整或不一致;LLM生成的标注经人工验证具备高保真度与语义准确性。 Conclusion: MetaDent填补了牙科视觉-语言研究的关键资源空白,揭示了现有VLM在临床细粒度理解上的瓶颈,所开源的数据、标注与工具将促进可复现研究及牙科AI系统发展。 Abstract: Vision-Language Models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes (1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; (2) a semi-structured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and (3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging Large Language Models (LLMs), we derive standardized benchmarks: approximately 15K Visual Question Answering (VQA) pairs and an 18-class multi-label classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy and producing inconsistent or incomplete descriptions in image captioning. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.[136] Open-Set Vein Biometric Recognition with Deep Metric Learning
Paweł Pilarek,Marcel Musiałek,Anna Górska
Main category: cs.CV
TL;DR: 本文提出一种面向开放集场景的静脉识别方法,通过深度度量学习学习L2归一化嵌入,并结合原型匹配与校准相似度阈值,在多个静脉数据集上实现了高精度识别与未知用户拒识能力。
Details
Motivation: 现有静脉识别方法多基于闭集分类,难以扩展且无法自适应新增用户,需重新训练模型;本文旨在解决开放集下的可扩展性与鲁棒性问题。 Method: 采用深度度量学习(DML)学习判别性L2归一化嵌入,结合原型匹配和校准的相似度阈值进行开集识别;在四个静脉数据集(MMCBNU 6000、UTFVP、FYO、 dorsal hand-vein)上采用严格的受试者不相交协议评估;使用ResNet50-CBAM等主干网络,辅以三元组损失与1-NN分类器。 Result: 在MMCBNU 6000上达到OSCR 0.9945、AUROC 0.9974、EER 1.57%、Rank-1识别率99.6%;跨数据集实验表明模型对大规模数据鲁棒,但在低数据域偏移下性能敏感;消融证实三元组损失+1-NN在精度与效率间取得最优平衡,支持商用硬件实时部署。 Conclusion: 该方法显著提升了静脉识别在开放集场景下的实用性与部署灵活性,为生物特征识别系统提供了可扩展、自适应、轻量化的解决方案。 Abstract: Most state-of-the-art vein recognition methods rely on closed-set classification, which inherently limits their scalability and prevents the adaptive enrollment of new users without complete model retraining. We rigorously evaluate the computational boundaries of Deep Metric Learning (DML) under strict open-set constraints. Unlike standard closed-set approaches, we analyze the impact of data scarcity and domain shift on recognition performance. Our approach learns discriminative L2-normalised embeddings and employs prototype-based matching with a calibrated similarity threshold to effectively distinguish between enrolled users and unseen impostors. We evaluate the framework under a strict subject-disjoint protocol across four diverse datasets covering finger, wrist, and dorsal hand veins (MMCBNU 6000, UTFVP, FYO, and a dorsal hand-vein dataset). On the large-scale MMCBNU 6000 benchmark, our best model (ResNet50-CBAM) achieves an OSCR of 0.9945, AUROC of 0.9974, and EER of 1.57%, maintaining high identification accuracy (99.6% Rank-1) while robustly rejecting unknown subjects. Cross-dataset experiments evaluate the framework's generalisation across different acquisition setups, confirming that while the model handles large-scale data robustly, performance remains sensitive to domain shifts in low-data regimes. Ablation studies demonstrate that triplet-based objectives combined with a simple 1-NN classifier offer an optimal trade-off between accuracy and efficiency, enabling real-time deployment on commodity hardware.[137] FSDETR: Frequency-Spatial Feature Enhancement for Small Object Detection
Jianchao Huang,Fengming Zhang,Haibo Zhu,Tao Yan
Main category: cs.CV
TL;DR: 本文提出FSDETR,一种基于RT-DETR的频率-空间特征增强框架,通过空间分层注意力、可变形注意力特征交互和频域-空间特征金字塔网络,显著提升小目标检测性能。
Details
Motivation: 小目标检测面临下采样导致的特征退化、密集簇中的相互遮挡以及复杂背景干扰等挑战。 Method: 提出FSDETR框架,包含:1)空间分层注意力块(SHAB)以捕获局部细节与全局依赖;2)基于可变形注意力的同尺度特征交互(DA-AIFI)缓解密集遮挡;3)频率-空间特征金字塔网络(FSFPN)结合频域滤波与空间边缘提取(CFSB)保留细粒度信息。 Result: FSDETR仅含14.7M参数,在VisDrone 2019上达到13.9% APS,在TinyPerson上达到48.95% AP50 tiny,显著优于现有小目标检测方法。 Conclusion: FSDETR通过融合频率域与空间域信息及多层次注意力机制,有效提升了小目标检测精度与鲁棒性,为小目标检测提供了新思路。 Abstract: Small object detection remains a significant challenge due to feature degradation from downsampling, mutual occlusion in dense clusters, and complex background interference. To address these issues, this paper proposes FSDETR, a frequency-spatial feature enhancement framework built upon the RT-DETR baseline. By establishing a collaborative modeling mechanism, the method effectively leverages complementary structural information. Specifically, a Spatial Hierarchical Attention Block (SHAB) captures both local details and global dependencies to strengthen semantic representation. Furthermore, to mitigate occlusion in dense scenes, the Deformable Attention-based Intra-scale Feature Interaction (DA-AIFI) focuses on informative regions via dynamic sampling. Finally, the Frequency-Spatial Feature Pyramid Network (FSFPN) integrates frequency filtering with spatial edge extraction via the Cross-domain Frequency-Spatial Block (CFSB) to preserve fine-grained details. Experimental results show that with only 14.7M parameters, FSDETR achieves 13.9% APS on VisDrone 2019 and 48.95% AP50 tiny on TinyPerson, showing strong performance on small-object benchmarks. The code and models are available at https://github.com/YT3DVision/FSDETR.[138] Reward-Aware Trajectory Shaping for Few-step Visual Generation
Rui Li,Bingyu Li,Yuanzhi Liang,HuangHai Bin,Chi Zhang,XueLong Li
Main category: cs.CV
TL;DR: 本文提出Reward-Aware Trajectory Shaping (RATS)框架,通过引入偏好对齐意识与奖励感知门控机制,在少量采样步数下实现高质量生成,使学生模型有望超越教师模型而非受限于模仿。
Details
Motivation: 现有蒸馏方法将学生限制在模仿强多步教师模型的范式中,难以突破教师性能上限;本文旨在通过引入偏好对齐意识,使学生能直接面向奖励优化,实现超越教师的生成质量。 Method: 提出RATS框架:1)在关键去噪阶段通过时间跨度匹配对齐师生潜在轨迹;2)引入奖励感知门控机制,根据师生相对奖励动态调节教师指导强度;3)当学生奖励接近或超过教师时放松约束,支持持续提升。 Result: RATS显著提升了少步生成中效率与质量的权衡,在视觉生成任务中大幅缩小了少步学生模型与强多步教师模型之间的性能差距。 Conclusion: RATS通过融合轨迹蒸馏、奖励感知门控与偏好对齐,实现了无需额外推理开销的高效偏好知识迁移,证明了少步生成模型可超越传统蒸馏范式下的教师性能上限。 Abstract: Achieving high-fidelity generation in extremely few sampling steps has long been a central goal of generative modeling. Existing approaches largely rely on distillation-based frameworks to compress the original multi-step denoising process into a few-step generator. However, such methods inherently constrain the student to imitate a stronger multi-step teacher, imposing the teacher as an upper bound on student performance. We argue that introducing \textbf{preference alignment awareness} enables the student to optimize toward reward-preferred generation quality, potentially surpassing the teacher instead of being restricted to rigid teacher imitation. To this end, we propose \textbf{Reward-Aware Trajectory Shaping (RATS)}, a lightweight framework for preference-aligned few-step generation. Specifically, teacher and student latent trajectories are aligned at key denoising stages through horizon matching, while a \textbf{reward-aware gate} is introduced to adaptively regulate teacher guidance based on their relative reward performance. Trajectory shaping is strengthened when the teacher achieves higher rewards, and relaxed when the student matches or surpasses the teacher, thereby enabling continued reward-driven improvement. By seamlessly integrating trajectory distillation, reward-aware gating, and preference alignment, RATS effectively transfers preference-relevant knowledge from high-step generators without incurring additional test-time computational overhead. Experimental results demonstrate that RATS substantially improves the efficiency--quality trade-off in few-step visual generation, significantly narrowing the gap between few-step students and stronger multi-step generators.[139] Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes
Victoria Yue Chen,Emery Pierson,Léopold Maillard,Maks Ovsjanikov
Main category: cs.CV
TL;DR: 本文揭示了当前先进的文本到3D生成模型在文本驱动反演中存在“潜在沉陷陷阱”问题,即模型对文本提示不敏感,导致无法有效进行语义编辑;作者通过分析生成轨迹,提出一种解耦几何表征能力与语言敏感性的新框架,从而实现对分布外3D形状的高保真语义编辑。
Details
Motivation: 现有文本驱动的3D生成模型反演方法依赖于模型对自然语言提示保持敏感的假设,但该假设在实际中常失效,限制了文本引导编辑的有效性。 Method: 通过分析生成模型的采样轨迹,识别并验证‘潜在沉陷陷阱’现象;提出一种新框架,利用模型无条件先验生成复杂几何,并解耦其几何表达能力与文本敏感性。 Result: 发现先进文本到3D模型在特定潜空间区域对文本修改完全不敏感,但其几何生成能力依然强大;所提方法能绕过沉陷陷阱,实现对分布外3D形状的鲁棒、高保真语义编辑。 Conclusion: 文本引导3D编辑的瓶颈不在于模型几何表达力,而在于其文本-潜空间映射的脆弱性;解耦几何与语言建模是提升文本驱动3D操作鲁棒性的关键路径。 Abstract: Text-driven inversion of generative models is a core paradigm for manipulating 2D or 3D content, unlocking numerous applications such as text-based editing, style transfer, or inverse problems. However, it relies on the assumption that generative models remain sensitive to natural language prompts. We demonstrate that for state-of-the-art native text-to-3D generative models, this assumption often collapses. We identify a critical failure mode where generation trajectories are drawn into latent ``sink traps'': regions where the model becomes insensitive to prompt modifications. In these regimes, changes to the input text fail to alter internal representations in a way that alters the output geometry. Crucially, we observe that this is not a limitation of the model's \textit{geometric} expressivity; the same generative models possess the ability to produce a vast diversity of shapes but, as we demonstrate, become insensitive to out-of-distribution \textit{text} guidance. We investigate this behavior by analyzing the sampling trajectories of the generative model, and find that complex geometries can still be represented and produced by leveraging the model's unconditional generative prior. This leads to a more robust framework for text-based 3D shape editing that bypasses latent sinks by decoupling a model's geometric representation power from its linguistic sensitivity. Our approach addresses the limitations of current 3D pipelines and enables high-fidelity semantic manipulation of out-of-distribution 3D shapes. Project webpage: https://daidedou.sorpi.fr/publication/beyondprompts[140] Hybrid Latents -- Geometry-Appearance-Aware Surfel Splatting
Neel Kelkar,Simon Niedermayr,Klaus Engel,Rüdiger Westermann
Main category: cs.CV
TL;DR: 本文提出了一种混合高斯-哈希网格辐射表示方法,用于从多视角图像重建2D高斯场景模型,通过显式频率分解、硬不透明度衰减和概率剪枝等技术,在显著减少高斯原语数量的同时提升几何重建精度与渲染效率。
Details
Motivation: 解决NeRF类模型中几何与外观纠缠的问题,降低高频纹理对几何误差的补偿倾向,提升重建保真度和渲染效率。 Method: 引入混合高斯-哈希网格辐射表示;添加每高斯潜在特征与哈希网格特征以实现低/高频成分分离;采用硬不透明度衰减增强几何-外观解耦;结合概率剪枝与稀疏性诱导的BCE不透明度损失剔除冗余高斯。 Result: 在合成与真实数据集上优于当前高斯类新视角合成SOTA方法,重建保真度更高,且仅需十分之一数量的高斯原语。 Conclusion: 该方法通过显式频率分解与结构化正则化,实现了更紧凑、更精确、更高效的高斯场景表示。 Abstract: We introduce a hybrid Gaussian-hash-grid radiance representation for reconstructing 2D Gaussian scene models from multi-view images. Similar to NeST splatting, our approach reduces the entanglement between geometry and appearance common in NeRF-based models, but adds per-Gaussian latent features alongside hash-grid features to bias the optimizer toward a separation of low- and high-frequency scene components. This explicit frequency-based decomposition reduces the tendency of high-frequency texture to compensate for geometric errors. Encouraging Gaussians with hard opacity falloffs further strengthens the separation between geometry and appearance, improving both geometry reconstruction and rendering efficiency. Finally, probabilistic pruning combined with a sparsity-inducing BCE opacity loss allows redundant Gaussians to be turned off, yielding a minimal set of Gaussians sufficient to represent the scene. Using both synthetic and real-world datasets, we compare against the state of the art in Gaussian-based novel-view synthesis and demonstrate superior reconstruction fidelity with an order of magnitude fewer primitives.[141] Generative Data Augmentation for Skeleton Action Recognition
Xu Dong,Wanqing Li,Anthony Adeyemi-Ejeye,Andrew Gilbert
Main category: cs.CV
TL;DR: 本文提出了一种基于条件生成的骨架动作识别数据增强方法,利用Transformer编码器-解码器架构和生成精炼模块,在小样本和全量数据场景下均提升了识别准确率。
Details
Motivation: 收集大规模、多样且标注良好的3D骨架数据集成本高、耗时长,限制了骨架动作识别的发展。 Method: 提出一种条件生成流水线,结合Transformer编码器-解码器架构、生成精炼模块和dropout机制,学习带动作标签约束的真实骨架序列分布,以合成高保真、多样化数据。 Result: 在HumanAct12和NTU-VIBE数据集上实验表明,该方法显著提升多种骨架动作识别模型的准确率,尤其在低数据场景下表现优异。 Conclusion: 所提条件生成方法能有效缓解数据稀缺问题,具备强泛化能力,适用于少样本与全数据设置下的下游任务。 Abstract: Skeleton-based human action recognition is a powerful approach for understanding human behaviour from pose data, but collecting large-scale, diverse, and well-annotated 3D skeleton datasets is both expensive and labor-intensive. To address this challenge, we propose a conditional generative pipeline for data augmentation in skeleton action recognition. Our method learns the distribution of real skeleton sequences under the constraint of action labels, enabling the synthesis of diverse and high-fidelity data. Even with limited training samples, it can effectively generate skeleton sequences and achieve competitive recognition performance in low-data scenarios, demonstrating strong generalisation in downstream tasks. Specifically, we introduce a Transformer-based encoder-decoder architecture, combined with a generative refinement module and a dropout mechanism, to balance fidelity and diversity during sampling. Experiments on HumanAct12 and the refined NTU-RGBD (NTU-VIBE) dataset show that our approach consistently improves the accuracy of multiple skeleton-based action recognition models, validating its effectiveness in both few-shot and full-data settings. The source code can be found at here.[142] RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models
Gabriele Mattioli,Evelyn Turri,Sara Sarto,Lorenzo Baraldi,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
Main category: cs.CV
TL;DR: 本文提出RaTA-Tool框架,通过将多模态用户查询转化为结构化任务描述,并基于语义匹配检索适配工具,实现开放世界下的多模态工具选择,支持零样本扩展与偏好优化。
Details
Motivation: 现有工具学习方法局限于纯文本输入和封闭世界设定,难以处理多模态指令及泛化至训练时未见的新工具。 Method: 提出RaTA-Tool框架:1)利用MLLM将多模态查询转为结构化任务描述;2)通过语义匹配在机器可读的标准化工具描述库中检索最适配工具;3)引入DPO进行偏好优化以提升任务-工具对齐;4)构建首个开放世界多模态工具使用数据集(基于Hugging Face模型卡)。 Result: 在开放世界、多模态场景下显著提升工具选择性能,支持无需重训练即可接入新工具。 Conclusion: RaTA-Tool为多模态基础模型的开放世界工具学习提供了可扩展、可泛化的新范式。 Abstract: Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources -- such as APIs, computational utilities, and specialized models -- to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world settings. Consequently, they struggle to interpret multimodal user instructions and cannot generalize to tools unseen during training. In this work, we introduce RaTA-Tool, a novel framework for open-world multimodal tool selection. Rather than learning direct mappings from user queries to fixed tool identifiers, our approach enables an MLLM to convert a multimodal query into a structured task description and subsequently retrieve the most appropriate tool by matching this representation against semantically rich, machine-readable tool descriptions. This retrieval-based formulation naturally supports extensibility to new tools without retraining. To further improve alignment between task descriptions and tool selection, we incorporate a preference-based optimization stage using Direct Preference Optimization (DPO). To support research in this setting, we also introduce the first dataset for open-world multimodal tool use, featuring standardized tool descriptions derived from Hugging Face model cards. Extensive experiments demonstrate that our approach significantly improves tool-selection performance, particularly in open-world, multimodal scenarios.[143] Prompt-to-Gesture: Measuring the Capabilities of Image-to-Video Deictic Gesture Generation
Hassan Ali,Doreen Jirak,Luca Müller,Stefan Wermter
Main category: cs.CV
TL;DR: 本文提出了一种基于提示的视频生成方法,利用图像到视频基础模型生成真实感强、语义丰富的指示性手势合成数据,并验证其在下游任务中的有效性。
Details
Motivation: 手势识别研究长期受限于真实数据稀缺,传统人工采集成本高,而现有图像处理方法难以生成真实多样的手势;新兴的图像到视频生成模型为零样本、低成本合成高质量手势数据提供了新可能。 Method: 构建一个基于少量真人参考样本的提示驱动视频生成流水线,生成具有真实感和多样性的指示性手势视频数据集,并与真实数据混合用于训练多种深度学习模型进行评估。 Result: 合成手势在视觉保真度上接近真实数据,且引入了有意义的多样性与新颖性;混合使用合成与真实数据显著提升了下游模型性能。 Conclusion: 即使处于早期阶段,图像到视频生成技术已展现出作为零样本手势合成工具的强大潜力,能有效补充真实数据并提升下游任务表现。 Abstract: Gesture recognition research, unlike NLP, continues to face acute data scarcity, with progress constrained by the need for costly human recordings or image processing approaches that cannot generate authentic variability in the gestures themselves. Recent advancements in image-to-video foundation models have enabled the generation of photorealistic, semantically rich videos guided by natural language. These capabilities open up new possibilities for creating effort-free synthetic data, raising the critical question of whether video Generative AI models can augment and complement traditional human-generated gesture data. In this paper, we introduce and analyze prompt-based video generation to construct a realistic deictic gestures dataset and rigorously evaluate its effectiveness for downstream tasks. We propose a data generation pipeline that produces deictic gestures from a small number of reference samples collected from human participants, providing an accessible approach that can be leveraged both within and beyond the machine learning community. Our results demonstrate that the synthetic gestures not only align closely with real ones in terms of visual fidelity but also introduce meaningful variability and novelty that enrich the original data, further supported by superior performance of various deep models using a mixed dataset. These findings highlight that image-to-video techniques, even in their early stages, offer a powerful zero-shot approach to gesture synthesis with clear benefits for downstream tasks.[144] Frequency-Enhanced Dual-Subspace Networks for Few-Shot Fine-Grained Image Classification
Meijia Wang,Guochao Wang,Haozhen Chu,Bin Yao,Weichuan Zhang,Yuan Wang,Junpo Yang
Main category: cs.CV
TL;DR: 本文提出FEDSNet,通过频域增强与双子空间建模,解决少样本细粒度图像分类中空间特征易受纹理噪声干扰、结构不稳定的问题;利用DCT低通滤波提取结构信息,SVD构建空间与频率双低秩子空间,并通过门控机制融合距离,显著提升准确率与鲁棒性。
Details
Motivation: 现有基于度量学习的方法仅依赖空间域特征,易受纹理偏差和高频背景噪声干扰,且缺乏跨视角几何约束,导致在少样本下结构不稳定、过拟合严重。 Method: 提出频率增强双子空间网络(FEDSNet):1)用DCT+低通滤波分离低频全局结构特征;2)用截断SVD分别构建空间纹理与频率结构的独立低秩子空间;3)设计自适应门控机制动态融合两子空间的投影距离。 Result: 在CUB-200-2011、Stanford Cars、Stanford Dogs和FGVC-Aircraft四个基准数据集上达到极具竞争力的分类性能,兼具高精度与计算效率。 Conclusion: FEDSNet为少样本细粒度视觉识别提供了新范式,通过引入频域视角与双子空间协同建模,有效缓解了空间特征的结构不稳定性与噪声敏感性问题。 Abstract: Few-shot fine-grained image classification aims to recognize subcategories with high visual similarity using only a limited number of annotated samples. Existing metric learning-based methods typically rely solely on spatial domain features. Confined to this single perspective, models inevitably suffer from inherent texture biases, entangling essential structural details with high-frequency background noise. Furthermore, lacking cross-view geometric constraints, single-view metrics tend to overfit this noise, resulting in structural instability under few-shot conditions. To address these issues, this paper proposes the Frequency-Enhanced Dual-Subspace Network (FEDSNet). Specifically, FEDSNet utilizes the Discrete Cosine Transform (DCT) and a low-pass filtering mechanism to explicitly isolate low-frequency global structural components from spatial features, thereby suppressing background interference. Truncated Singular Value Decomposition (SVD) is employed to construct independent, low-rank linear subspaces for both spatial texture and frequency structural features. An adaptive gating mechanism is designed to dynamically fuse the projection distances from these dual views. This strategy leverages the structural stability of the frequency subspace to prevent the spatial subspace from overfitting to background features. Extensive experiments on four benchmark datasets - CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC-Aircraft - demonstrate that FEDSNet exhibits excellent classification performance and robustness, achieving highly competitive results compared to existing metric learning algorithms. Complexity analysis further confirms that the proposed network achieves a favorable balance between high accuracy and computational efficiency, providing an effective new paradigm for few-shot fine-grained visual recognition.[145] UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards
Jun Wang,Shuo Tan,Zelong Sun,Tiancheng Gu,Yongle Zhao,Ziyong Feng,Kaicheng Yang,Cewu Lu
Main category: cs.CV
TL;DR: 本文提出UniDoc-RL,一种基于强化学习的统一框架,使大视觉语言模型能联合完成检索、重排序、主动视觉感知与推理,通过分层动作空间和多奖励机制显著提升复杂视觉推理性能。
Details
Motivation: 现有视觉RAG系统依赖通用检索信号,忽视细粒度视觉语义,难以支撑复杂推理任务。 Method: 提出UniDoc-RL框架,将视觉信息获取建模为带分层动作空间的序列决策问题,涵盖文档检索、图像重排序、主动区域裁剪;采用基于GRPO的端到端训练,引入稠密多奖励机制,并构建含细粒度动作标注的高质量推理轨迹数据集。 Result: 在三个基准上持续超越SOTA方法,相较先前基于RL的方法最高提升17.7%。 Conclusion: UniDoc-RL验证了联合优化检索与主动感知对提升LVLM视觉推理能力的有效性,为视觉RAG提供了新范式。 Abstract: Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme that provides task-aware supervision for each action. Based on Group Relative Policy Optimization (GRPO), UniDoc-RL aligns agent behavior with multiple objectives without relying on a separate value network. To support this training paradigm, we curate a comprehensive dataset of high-quality reasoning trajectories with fine-grained action annotations. Experiments on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines, yielding up to 17.7% gains over prior RL-based methods.[146] Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation
Yuzhuo Chen,Zehua Ma,Han Fang,Hengyi Wang,Guanjie Wang,Weiming Zhang
Main category: cs.CV
TL;DR: 本文提出Flow of Truth,首个面向图像到视频生成的时序取证框架,通过建模像素随时间运动来追踪生成痕迹,实现跨模型泛化的时间域伪造检测。
Details
Motivation: 图像到视频(I2V)生成技术快速发展,带来新型深度伪造风险;传统基于静态图像的二维像素级取证方法无法应对视频中像素随时间流动、变形和漂移的动态特性,亟需发展时序维度的主动取证方法。 Method: 将视频生成重新定义为‘像素随时间的运动’而非‘帧的合成’,据此设计可学习的取证模板,并构建模板引导的光流模块,以解耦像素运动与图像内容,从而实现对生成痕迹在时间轴上的鲁棒追踪。 Result: Flow of Truth在多种商用与开源I2V模型上展现出良好泛化性,显著提升了时序取证性能,验证了其对动态生成痕迹建模的有效性。 Conclusion: 该工作开创性地拓展了数字取证的研究维度,从空间域转向时序域,为I2V内容真实性验证提供了首个系统性、主动式解决方案。 Abstract: The rapid rise of image-to-video (I2V) generation enables realistic videos to be created from a single image but also brings new forensic demands. Unlike static images, I2V content evolves over time, requiring forensics to move beyond 2D pixel-level tampering localization toward tracing how pixels flow and transform throughout the video. As frames progress, embedded traces drift and deform, making traditional spatial forensics ineffective. To address this unexplored dimension, we present **Flow of Truth**, the first proactive framework focusing on temporal forensics in I2V generation. A key challenge lies in discovering a forensic signature that can evolve consistently with the generation process, which is inherently a creative transformation rather than a deterministic reconstruction. Despite this intrinsic difficulty, we innovatively redefine video generation as *the motion of pixels through time rather than the synthesis of frames*. Building on this view, we propose a learnable forensic template that follows pixel motion and a template-guided flow module that decouples motion from image content, enabling robust temporal tracing. Experiments show that Flow of Truth generalizes across commercial and open-source I2V models, substantially improving temporal forensics performance.[147] Quality-Aware Calibration for AI-Generated Image Detection in the Wild
Fabrizio Guillaro,Vincenzo De Rosa,Davide Cozzolino,Luisa Verdoliva
Main category: cs.CV
TL;DR: 本文提出QuAD框架,通过质量感知的近似重复图像融合策略提升合成图像检测性能,解决了单图检测在真实传播场景中因图像质量退化导致预测不一致的问题。
Details
Motivation: 现有合成图像检测方法多基于单张图像,忽视了真实网络传播中同一图像存在多个质量退化的近似重复版本这一关键特性,导致检测结果不一致。 Method: 提出QuAD(Quality-Aware calibration with near-Duplicates)框架:对查询图像检索其在线近似重复图像,输入检测器获得各版本得分,并依据估计的质量加权聚合;构建两个新数据集AncesTree(模拟退化树)和ReWIND(真实网络近似重复图像)用于大规模评估。 Result: 在多个SOTA检测器上验证,QuAD的质量感知融合相比简单平均提升约8%平衡准确率;显著增强AI生成内容在现实场景中的检测鲁棒性与可靠性。 Conclusion: 联合处理同一图像的所有在线近似重复版本并考虑其质量差异,是提升合成图像检测实用性的关键路径;QuAD为真实世界AI内容鉴伪提供了可扩展、可复现的新范式。 Abstract: Significant progress has been made in detecting synthetic images, however most existing approaches operate on a single image instance and overlook a key characteristic of real-world dissemination: as viral images circulate on the web, multiple near-duplicate versions appear and lose quality due to repeated operations like recompression, resizing and cropping. As a consequence, the same image may yield inconsistent forensic predictions based on which version has been analyzed. In this work, to address this issue we propose QuAD (Quality-Aware calibration with near-Duplicates) a novel framework that makes decisions based on all available near-duplicates of the same image. Given a query, we retrieve its online near-duplicates and feed them to a detector: the resulting scores are then aggregated based on the estimated quality of the corresponding instance. By doing so, we take advantage of all pieces of information while accounting for the reduced reliability of images impaired by multiple processing steps. To support large-scale evaluation, we introduce two datasets: AncesTree, an in-lab dataset of 136k images organized in stochastic degradation trees that simulate online reposting dynamics, and ReWIND, a real-world dataset of nearly 10k near-duplicate images collected from viral web content. Experiments on several state-of-the-art detectors show that our quality-aware fusion improves their performance consistently, with an average gain of around 8% in terms of balanced accuracy compared to plain average. Our results highlight the importance of jointly processing all the images available online to achieve reliable detection of AI-generated content in real-world applications. Code and data are publicly available at https://grip-unina.github.io/QuAD/[148] Implicit Neural Representations: A Signal Processing Perspective
Dhananjaya Jayasundara,Vishal M. Patel
Main category: cs.CV
TL;DR: 本文从信号处理角度综述隐式神经表示(INRs)的发展,重点分析其频谱特性、采样理论与多尺度表示,探讨网络结构(如周期性/局部化激活函数、哈希网格编码等)如何改进逼近能力与效率,并讨论其在医学/雷达成像、压缩及3D场景表示等应用中的价值,同时指出理论稳定性、权重可解释性与大规模泛化等开放问题。
Details
Motivation: 传统离散采样建模存在局限,INRs通过连续函数建模提供统一框架,并支持解析微分等操作;需从信号处理视角系统理解其频谱行为、逼近机制与实际适用性。 Method: 以信号处理理论为线索,分析INRs的频谱偏差(如低频偏好)、采样特性及多尺度表示能力;对比不同网络设计(坐标网络、周期/局部/自适应激活、哈希网格、层次分解)对逼近空间的重塑效果。 Result: 明确了INRs作为数据自适应的可学习信号模型的本质;揭示了结构设计(如激活函数、编码方式)对频谱控制与计算效率的关键作用;梳理了其在逆问题求解、压缩和3D表示等任务中的成功应用。 Conclusion: INRs不仅是参数化技巧,更是新型信号建模范式;未来需加强理论基础(如稳定性)、提升模型可解释性,并解决大规模泛化挑战。 Abstract: Implicit neural representations (INRs) mark a fundamental shift in signal modeling, moving from discrete sampled data to continuous functional representations. By parameterizing signals as neural networks, INRs provide a unified framework for representing images, audio, video, 3D geometry, and beyond as continuous functions of their coordinates. This functional viewpoint enables signal operations such as differentiation to be carried out analytically through automatic differentiation rather than through discrete approximations. In this article, we examine the evolution of INRs from a signal processing perspective, emphasizing spectral behavior, sampling theory, and multiscale representation. We trace the progression from standard coordinate based networks, which exhibit a spectral bias toward low frequency components, to more advanced designs that reshape the approximation space through specialized activations, including periodic, localized, and adaptive functions. We also discuss structured representations, such as hierarchical decompositions and hash grid encodings, that improve spatial adaptivity and computational efficiency. We further highlight the utility of INRs across a broad range of applications, including inverse problems in medical and radar imaging, compression, and 3D scene representation. By interpreting INRs as learned signal models whose approximation spaces adapt to the underlying data, this article clarifies the field's core conceptual developments and outlines open challenges in theoretical stability, weight space interpretability, and large scale generalization.[149] Attention-Gated Convolutional Networks for Scanner-Agnostic Quality Assessment
Chinmay Bakhale,Anil Sao
Main category: cs.CV
TL;DR: 本文提出了一种结合CNN与注意力机制的混合框架,用于实现鲁棒、跨站点不变的MRI质量评估,能有效识别运动伪影,在已见和未见站点上均表现出色。
Details
Motivation: 运动伪影严重影响结构MRI的质量,而传统人工质控难以扩展到大规模纵向研究中,亟需自动、鲁棒且跨站点泛化的质量评估方法。 Method: 提出一种混合CNN-Attention框架:使用分层2D CNN编码器提取局部空间特征,结合多头交叉注意力机制建模全局依赖关系,以聚焦运动相关伪影(如振铃、模糊),同时抑制站点特异性强度变化和背景噪声;在MR-ART数据集(200名受试者)上端到端训练。 Result: 在已见站点(MR-ART子集)上达到扫描级准确率0.9920、F1分数0.9919;在未见的ABIDE多中心数据(17个异构站点,200名受试者)上无需微调即达准确率0.755,展现出强域迁移能力。 Conclusion: 注意力驱动的特征重加权能有效捕获通用伪影表征,显著缩小不同成像环境与设备厂商间的性能差距,为大规模MRI质控提供可靠自动化方案。 Abstract: Motion artifacts present a significant challenge in structural MRI (sMRI), often compromising clinical diagnostics and large-scale automated analysis. While manual quality control (QC) remains the gold standard, it is increasingly unscalable for massive longitudinal studies. To address this, we propose a hybrid CNN-Attention framework designed for robust, site-invariant MRI quality assessment. Our architecture integrates a hierarchical 2D CNN encoder for local spatial feature extraction with a multi-head cross-attention mechanism to model global dependencies. This synergy enables the model to prioritize motion relevant artifact signatures, such as ringing and blurring, while dynamically filtering out site-specific intensity variations and background noise. The framework was trained end-to-end on the MR-ART dataset using a balanced cohort of 200 subjects. Performance was evaluated across two tiers: Seen Site Evaluation on a held-out MR-ART partition and Unseen Site Evaluation using 200 subjects from 17 heterogeneous sites in the ABIDE archive. On seen sites, the model achieved a scan-level accuracy of 0.9920 and an F1-score of 0.9919. Crucially, it maintained strong generalization across unseen ABIDE sites (Acc = 0.755) without any retraining or fine-tuning, demonstrating high resilience to domain shift. These results indicate that attention-based feature re-weighting successfully captures universal artifact descriptors, bridging the performance gap between diverse imaging environments and scanner manufacturers.[150] Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection
Yangchen Zeng,Zhenyu Yu,Dongming Jiang,Wenbo Zhang,Yifan Hong,Zhanhua Hu,Jiao Luo,Kangning Cui
Main category: cs.CV
TL;DR: 本文提出HELP框架,通过热图引导的位置嵌入(HPE)抑制背景噪声、增强前景位置信息,结合梯度掩码滤波和线性Snake卷积提升小目标检测性能,在大幅减少参数(59.4%)和解码器层数(8→3)的同时保持精度。
Details
Motivation: Transformer检测器在小目标检测中仍存在效率低和背景诱导查询噪声问题,需深度解码器优化低质量查询。 Method: 提出热图引导的嵌入学习范式(HELP),核心为热图引导位置嵌入(HPE),在编码器中注入热图感知位置编码,在解码器前用梯度掩码滤除背景主导嵌入;引入Linear-Snake卷积缓解小目标特征稀疏;热图监督仅用于训练,不增加推理开销。 Result: 解码器层数从8减至3,参数量降低59.4%(66.3M vs. 163M),在减少计算预算下于多个基准上保持一致精度提升。 Conclusion: HELP是一种噪声感知的位置-语义融合框架,通过可解释的HPE机制和轻量化设计,有效提升小目标检测的效率与鲁棒性。 Abstract: Transformer-based detectors have advanced small-object detection, but they often remain inefficient and vulnerable to background-induced query noise, which motivates deep decoders to refine low-quality queries. We present HELP (Heatmap-guided Embedding Learning Paradigm), a noise-aware positional-semantic fusion framework that studies where to embed positional information by selectively preserving positional encodings in foreground-salient regions while suppressing background clutter. Within HELP, we introduce Heatmap-guided Positional Embedding (HPE) as the core embedding mechanism and visualize it with a heatbar for interpretable diagnosis and fine-tuning. HPE is integrated into both the encoder and decoder: it guides noise-suppressed feature encoding by injecting heatmap-aware positional encoding, and it enables high-quality query retrieval by filtering background-dominant embeddings via a gradient-based mask filter before decoding. To address feature sparsity in complex small targets, we integrate Linear-Snake Convolution to enrich retrieval-relevant representations. The gradient-based heatmap supervision is used during training only, incurring no additional gradient computation at inference. As a result, our design reduces decoder layers from eight to three and achieves a 59.4% parameter reduction (66.3M vs. 163M) while maintaining consistent accuracy gains under a reduced compute budget across benchmarks. Code Repository: https://github.com/yidimopozhibai/Noise-Suppressed-Query-Retrieval[151] Building Extraction from Remote Sensing Imagery under Hazy and Low-light Conditions: Benchmark and Baseline
Feifei Sang,Wei Lu,Hongruixuan Chen,Sibao Chen,Bin Luo
Main category: cs.CV
TL;DR: 本文提出HaLoBuilding基准和HaLoBuild-Net模型,解决遥感图像在雾霾与低光照条件下的建筑物提取难题,通过多模块协同提升鲁棒性与精度。
Details
Motivation: 现有光学遥感建筑物提取方法在真实雾霾和低光照条件下性能下降,且缺乏针对性基准;SAR虽全天候但存在几何畸变。 Method: 构建首个面向雾霾与低光照条件的光学遥感建筑物提取基准HaLoBuilding,并提出端到端网络HaLoBuild-Net,包含空间-频率聚焦模块(SFFM)、全局多尺度引导模块(GMGM)和互引导融合模块(MGFM)。 Result: HaLoBuild-Net在HaLoBuilding上显著超越SOTA方法及传统级联恢复-分割范式,并在WHU、INRIA、LoveDA等通用数据集上保持强泛化能力。 Conclusion: 所提基准与方法有效缓解气象干扰,提升了恶劣天气下建筑物提取的精度与鲁棒性,推动遥感解译向实际应用迈进。 Abstract: Building extraction from optical Remote Sensing (RS) imagery suffers from performance degradation under real-world hazy and low-light conditions. However, existing optical methods and benchmarks focus primarily on ideal clear-weather conditions. While SAR offers all-weather sensing, its side-looking geometry causes geometric distortions. To address these challenges, we introduce HaLoBuilding, the first optical benchmark specifically designed for building extraction under hazy and low-light conditions. By leveraging a same-scene multitemporal pairing strategy, we ensure pixel-level label alignment and high fidelity even under extreme degradation. Building upon this benchmark, we propose HaLoBuild-Net, a novel end-to-end framework for building extraction in adverse RS scenarios. At its core, we develop a Spatial-Frequency Focus Module (SFFM) to effectively mitigate meteorological interference on building features by coupling large receptive field attention with frequency-aware channel reweighting guided by stable low-frequency anchors. Additionally, a Global Multi-scale Guidance Module (GMGM) provides global semantic constraints to anchor building topologies, while a Mutual-Guided Fusion Module (MGFM) implements bidirectional semantic-spatial calibration to suppress shallow noise and sharpen weather-induced blurred boundaries. Extensive experiments demonstrate that HaLoBuild-Net significantly outperforms state-of-the-art methods and conventional cascaded restoration-segmentation paradigms on the HaLoBuilding dataset, while maintaining robust generalization on WHU, INRIA, and LoveDA datasets. The source code and datasets are publicly available at: https://github.com/AeroVILab-AHU/HaLoBuilding.[152] Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID
Jiaxuan Li,Xin Wen,Zhihang Li
Main category: cs.CV
TL;DR: 本文提出了一种名为STFER的新框架,利用大视觉语言模型(LVLM)生成身份一致的语义文本,以提升任意时间条件下行人重识别(AT-ReID)在跨模态(RGB/IR)和衣物变化场景下的鲁棒性与泛化能力。
Details
Motivation: 现有方法严重依赖易受环境光照和衣物变化影响的纯视觉特征,在昼夜模态转换和长短期衣物变化等任意时间场景下性能显著下降。 Method: 提出Semantic-driven Token Filtering and Expert Routing(STFER)框架:1)通过指令引导LVLM生成刻画生物特征恒定性的身份内在语义文本;2)基于该文本进行语义驱动的视觉Token过滤(SVTF),增强关键区域、抑制背景噪声;3)将文本融入专家路由(SER),实现更鲁棒的多场景门控。 Result: 在AT-USTC数据集上达到SOTA;迁移至5个主流ReID基准测试仍取得极具竞争力的结果,验证了强泛化能力。 Conclusion: 语义文本可作为稳定的身份表征,有效解耦视觉变化干扰;STFER通过文本引导的视觉过滤与专家路由,显著提升了AT-ReID在复杂现实场景中的鲁棒性与泛化性。 Abstract: Any-Time Person Re-identification (AT-ReID) necessitates the robust retrieval of target individuals under arbitrary conditions, encompassing both modality shifts (daytime and nighttime) and extensive clothing-change scenarios, ranging from short-term to long-term intervals. However, existing methods are highly relying on pure visual features, which are prone to change due to environmental and time factors, resulting in significantly performance deterioration under scenarios involving illumination caused modality shifts or cloth-change. In this paper, we propose Semantic-driven Token Filtering and Expert Routing (STFER), a novel framework that leverages the ability of Large Vision-Language Models (LVLMs) to generate identity consistency text, which provides identity-discriminative features that are robust to both clothing variations and cross-modality shifts between RGB and IR. Specifically, we employ instructions to guide the LVLM in generating identity-intrinsic semantic text that captures biometric constants for the semantic model driven. The text token is further used for Semantic-driven Visual Token Filtering (SVTF), which enhances informative visual regions and suppresses redundant background noise. Meanwhile, the text token is also used for Semantic-driven Expert Routing (SER), which integrates the semantic text into expert routing, resulting in more robust multi-scenario gating. Extensive experiments on the Any-Time ReID dataset (AT-USTC) demonstrate that our model achieves state-of-the-art results. Moreover, the model trained on AT-USTC was evaluated across 5 widely-used ReID benchmarks demonstrating superior generalization capabilities with highly competitive results. Our code will be available soon.[153] Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography
Simon Böhi,Irene Cannistraci,Sergio Muñoz Gonzalez,Moritz Vandenhirtz,Sonia Laguna,Samuel Ruiperez-Campillo,Max Krähenmann,Andrea Agostini,Ece Ozkan,Thomas M. Sutter,Julia E. Vogt
Main category: cs.CV
TL;DR: 本文提出了一种名为Latent Attention Masked Autoencoder (LAMAE) 的新模型,专为处理超声心动图多视角、稀疏、异构的时空数据设计;通过在潜在空间引入跨帧和跨视角的注意力机制,实现对心脏功能的整体表征,并在真实临床数据集MIMIC-IV-ECHO上预训练,首次实现了从该数据集中预测ICD-10编码,且表现出优异的跨年龄组(成人→儿童)迁移能力。
Details
Motivation: 现有掩码自编码器(MAE)方法通常独立处理图像或短片段,难以建模超声心动图固有的多视角结构,而临床中该模态具有稀疏、异构、多视角等特点,亟需能融合多视图信息的基础模型。 Method: 提出LAMAE模型,在标准MAE基础上增加潜空间中的‘潜注意力模块’(latent attention module),支持帧间与视角间的直接信息交互;在大规模未筛选临床数据集MIMIC-IV-ECHO上进行自监督预训练;并评估其在ICD-10编码预测及跨人群(成人→儿童)迁移任务上的表现。 Result: 首次在MIMIC-IV-ECHO视频上实现ICD-10编码预测;所学表征在成人数据上预训练后可有效迁移到儿科人群;验证了引入多视角结构先验(如潜注意力)可显著提升表征鲁棒性与可迁移性。 Conclusion: LAMAE通过建模多视角时空依赖关系,为医学影像基础模型提供了更符合临床数据特性的架构范式;结构先验的显式建模是提升医学视觉表征质量与泛化能力的关键路径。 Abstract: Echocardiography is a widely used modality for cardiac assessment due to its non-invasive and cost-effective nature, but the sparse and heterogeneous spatiotemporal views of the heart pose distinct challenges. Existing masked autoencoder (MAE) approaches typically process images or short clips independently, failing to capture the inherent multi-view structure required for coherent cardiac representation. We introduce Latent Attention Masked Autoencoder (LAMAE), a foundation model architecture tailored to the multi-view nature of medical imaging. LAMAE augments the standard MAE with a latent attention module that enables information exchange across frames and views directly in latent space. This allows the model to aggregate variable-length sequences and distinct views, reconstructing a holistic representation of cardiac function from partial observations. We pretrain LAMAE on MIMIC-IV-ECHO, a large-scale, uncurated dataset reflecting real-world clinical variability. To the best of our knowledge, we present the first results for predicting ICD-10 codes from MIMIC-IV-ECHO videos. Furthermore, we empirically demonstrate that representations learned from adult data transfer effectively to pediatric cohorts despite substantial anatomical differences. These results provide evidence that incorporating structural priors, such as multi-view attention, yields significantly more robust and transferable representations.[154] How to Correctly Make Mistakes: A Framework for Constructing and Benchmarking Mistake Aware Egocentric Procedural Videos
Olga Loginova,Frank Keller
Main category: cs.CV
TL;DR: 本文提出PIE-V框架,通过心理学启发的错误注入方法,在第一人称视频中可控地引入人类可接受的错误及恢复行为,并构建统一评估体系以支持程序性错误检测与纠正的基准测试。
Details
Motivation: 现有程序性视频数据集缺乏自然、一致的错误与修正痕迹,且第一人称视角下错误常被手部遮挡、仅通过细微物体状态变化体现,导致错误监控不可靠。 Method: PIE-V结合心理学驱动的错误规划器(基于步骤阶段与语义负载)、恢复行为建模的修正规划器、级联一致重写的LLM文本生成器、以及验证程序连贯性的LLM评判器;对视频片段,采用文本引导的视频生成合成替换片段并缝合进原视频。 Result: 在17项任务和50个Ego-Exo4D场景中注入102个错误、生成27个恢复修正;提出涵盖9项指标的统一评估分类法与人工评分标准,并完成对多个现有资源的审计及与自由式LLM基线的对比。 Conclusion: PIE-V为egocentric程序性错误检测与修正提供了可扩展的构造框架与可复现的评估协议,支持事后验证。 Abstract: Reliable procedural monitoring in video requires exposure to naturally occurring human errors and the recoveries that follow. In egocentric recordings, mistakes are often partially occluded by hands and revealed through subtle object state changes, while existing procedural datasets provide limited and inconsistent mistake and correction traces. We present PIE-V (Psychologically Inspired Error injection for Videos), a framework for constructing and benchmarking mistake-aware egocentric procedural videos by augmenting clean keystep procedures with controlled, human-plausible deviations. PIE-V combines a psychology-informed error planner conditioned on procedure phase and semantic step load, a correction planner that models recovery behavior, an LLM writer that performs cascade-consistent rewrites, and an LLM judge that validates procedural coherence and repairs failures. For video segment edits, PIE-V synthesizes replacement clips with text-guided video generation and stitches them into the episode to preserve visual plausibility. Applied to 17 tasks and 50 Ego-Exo4D scenarios, PIE-V injects 102 mistakes and generates 27 recovery corrections. For benchmarking, we introduce a unified taxonomy and a human rubric with nine metrics that cover step-level and procedure-level quality, including plausibility, procedure logic with annotator confidence, state change coherence, and grounding between text and video. Using this protocol, we audit several existing resources and compare PIE-V against a freeform LLM generation baseline under the same criteria. Together, the framework and rubric support post-completion verification for egocentric procedural mistake detection and correction.[155] KVNN: Learnable Multi-Kernel Volterra Neural Networks
Haoyu Yun,Hamid Krim,Yufang Bao
Main category: cs.CV
TL;DR: 本文提出了一种核化Volterra神经网络(kVNN),通过可学习的多核表示建模不同阶次的特征交互,以在保持高性能的同时显著降低模型参数量和计算量。
Details
Motivation: 高阶学习依赖于数据的组合式特征交互,但传统深度模型在提升表达能力时往往导致复杂度剧增,亟需兼顾表达力与计算效率的新结构。 Method: 提出核化Volterra神经网络(kVNN),采用带紧凑可学习中心的多项式核组件建模不同阶次交互;每层由多个不同阶次的并行分支组成,其滤波器可直接替换标准卷积核。 Result: 在视频动作识别和图像去噪任务上,kVNN在参数量和GFLOPs显著降低的同时,性能达到甚至超越基线模型,且无需大规模预训练。 Conclusion: 结构化的核化高阶层为现代深度网络提供了兼顾表达能力与计算成本的实用路径。 Abstract: Higher-order learning is fundamentally rooted in exploiting compositional features. It clearly hinges on enriching the representation by more elaborate interactions of the data which, in turn, tends to increase the model complexity of conventional large-scale deep learning models. In this paper, a kernelized Volterra Neural Network (kVNN) is proposed. The key to the achieved efficiency lies in using a learnable multi-kernel representation, where different interaction orders are modeled by distinct polynomial-kernel components with compact, learnable centers, yielding an order-adaptive parameterization. Features are learned by the composition of layers, each of which consists of parallel branches of different polynomial orders, enabling kVNN filters to directly replace standard convolutional kernels within existing architectures. The theoretical results are substantiated by experiments on two representative tasks: video action recognition and image denoising. The results demonstrate favorable performance-efficiency trade-offs: kVNN consistently yields reduced model (parameters) and computational (GFLOPs) complexity with competitive and often improved performance. These results are maintained even when trained from scratch without large-scale pretraining. In summary, we substantiate that structured kernelized higher-order layers offer a practical path to balancing expressivity and computational cost in modern deep networks.[156] Class Unlearning via Depth-Aware Removal of Forget-Specific Directions
Arman Hatami,Romina Aalishah,Ilya E. Monosov
Main category: cs.CV
TL;DR: 本文提出DAMP方法,通过投影式权重手术实现类遗忘,无需梯度优化,在保持保留类性能的同时更有效地消除遗忘类在深层表征中的结构。
Details
Motivation: 现有类遗忘方法存在选择性弱、深层表征中仍保留遗忘类结构、或过度依赖分类头偏置调整等问题,无法实现真正意义上的知识擦除。 Method: DAMP是一种单次、闭式、基于深度感知投影的权重手术方法:在每一网络阶段,于下一可学习算子输入空间计算类别原型,将遗忘类方向建模为相对于保留类原型的残差,并通过投影更新削弱下游对这些方向的敏感性;采用基于探针可分性的无参深度感知缩放策略,浅层小修、深层大修;支持多类遗忘的低秩子空间移除。 Result: 在MNIST、CIFAR-10/100和Tiny ImageNet数据集及CNN/Transformer架构上,DAMP比现有方法更接近重训练基准,在选择性遗忘、保留类性能维持、以及深层遗忘类结构消除三方面均表现更优。 Conclusion: DAMP提供了一种高效、免训练、原理清晰的类遗忘方案,验证了通过定向移除表征空间中的遗忘方向可实现更本质的知识擦除。 Abstract: Machine unlearning aims to remove targeted knowledge from a trained model without the cost of retraining from scratch. In class unlearning, however, reducing accuracy on forget classes does not necessarily imply true forgetting: forgotten information can remain encoded in internal representations, and apparent forgetting may arise from classifier-head suppression rather than representational removal. We show that existing class-unlearning methods often exhibit weak or negative selectivity, preserve forget-class structure in deep representations, or rely heavily on final-layer bias shifts. We then introduce DAMP (Depth-Aware Modulation by Projection), a one-shot, closed-form weight-surgery method that removes forget-specific directions from a pretrained network without gradient-based optimization. At each stage, DAMP computes class prototypes in the input space of the next learnable operator, extracts forget directions as residuals relative to retain-class prototypes, and applies a projection-based update to reduce downstream sensitivity to those directions. To preserve utility, DAMP uses a parameter-free depth-aware scaling rule derived from probe separability, applying smaller edits in early layers and larger edits in deeper layers. The method naturally extends to multi-class forgetting through low-rank subspace removal. Across MNIST, CIFAR-10, CIFAR-100, and Tiny ImageNet, and across convolutional and transformer architectures, DAMP more closely resembles the retraining gold standard than some of the prior methods, improving selective forgetting while better preserving retain-class performance and reducing residual forget-class structure in deep layers.[157] OmniLight: One Model to Rule All Lighting Conditions
Youngjin Oh,Junyoung Park,Junhyeong Kwon,Nam Ik Cho
Main category: cs.CV
TL;DR: 本文提出两种光照相关图像恢复策略:专用模型DINOLight和通用模型OmniLight(含WD-MoE模块),在NTIRE 2026挑战赛三类光照任务中均获顶尖排名,验证了专用与统一架构在不同数据分布下的有效性与泛化能力。
Details
Motivation: 现实应用中需模型具备跨域鲁棒性,而现有方法多针对特定基准优化,缺乏对多样化光照退化场景的泛化能力。 Method: 构建专用基线DINOLight,并扩展为跨数据集训练的通用模型OmniLight,引入小波域混合专家(WD-MoE)结构以提升多域适应性。 Result: DINOLight和OmniLight在NTIRE 2026挑战赛全部三个光照相关赛道均取得顶级排名,展现出优异的感知质量与跨域泛化性能。 Conclusion: 专用与统一架构各有优势,数据分布特性显著影响模型选择;WD-MoE有效支撑了通用模型在多域光照恢复任务中的性能。 Abstract: Adverse lighting conditions, such as cast shadows and irregular illumination, pose significant challenges to computer vision systems by degrading visibility and color fidelity. Consequently, effective shadow removal and ALN are critical for restoring underlying image content, improving perceptual quality, and facilitating robust performance in downstream tasks. However, while achieving state-of-the-art results on specific benchmarks is a primary goal in image restoration challenges, real-world applications often demand robust models capable of handling diverse domains. To address this, we present a comprehensive study on lighting-related image restoration by exploring two contrasting strategies. We leverage a robust framework for ALN, DINOLight, as a specialized baseline to exploit the characteristics of each individual dataset, and extend it to OmniLight, a generalized alternative incorporating our proposed Wavelet Domain Mixture-of-Experts (WD-MoE) that is trained across all provided datasets. Through a comparative analysis of these two methods, we discuss the impact of data distribution on the performance of specialized and unified architectures in lighting-related image restoration. Notably, both approaches secured top-tier rankings across all three lighting-related tracks in the NTIRE 2026 Challenge, demonstrating their outstanding perceptual quality and generalization capabilities. Our codes are available at https://github.com/OBAKSA/Lighting-Restoration.[158] An Analysis of Regularization and Fokker-Planck Residuals in Diffusion Models for Image Generation
Onno Niemann,Gonzalo Martínez Muñoz,Alberto Suárez Gonzalez
Main category: cs.CV
TL;DR: 本文探讨了在扩散模型训练中,使用更轻量级的正则化项来替代计算开销大的Fokker-Planck(FP)方程约束,以在降低计算成本的同时保持生成质量。
Details
Motivation: 扩散模型在去噪分数匹配(DSM)目标下训练时常常违反描述真实数据密度演化的Fokker-Planck(FP)方程;直接惩罚该偏差虽有效但计算代价高,且强FP正则化未必提升生成质量。 Method: 通过实证分析多种轻量级正则化器,评估其对FP残差和生成质量的影响,并与标准FP正则化方法对比。 Result: 发现轻量级正则化器能在显著降低计算开销的同时,提供与FP正则化相当的性能收益。 Conclusion: FP正则化带来的好处可通过更简单、更高效的替代方案实现,无需高昂计算成本。 Abstract: Recent work has shown that diffusion models trained with the denoising score matching (DSM) objective often violate the Fokker--Planck (FP) equation that governs the evolution of the true data density. Directly penalizing these deviations in the objective function reduces their magnitude but introduces a significant computational overhead. It is also observed that enforcing strict adherence to the FP equation does not necessarily lead to improvements in the quality of the generated samples, as often the best results are obtained with weaker FP regularization. In this paper, we investigate whether simpler penalty terms can provide similar benefits. We empirically analyze several lightweight regularizers, study their effect on FP residuals and generation quality, and show that the benefits of FP regularization are available at substantially lower computational cost. Our code is available at https://github.com/OnnoNiemann/fp_diffusion_analysis.[159] Boundary-Centric Active Learning for Temporal Action Segmentation
Halil Ismail Helvaci,Sen-ching Samson Cheung
Main category: cs.CV
TL;DR: 本文提出B-ACT框架,通过聚焦于动作边界的主动学习策略,在有限标注预算下显著提升时序动作分割(TAS)的标注效率与性能。
Details
Motivation: 时序动作分割需要密集的时间监督,但大部分标注成本耗费在识别和精修动作边界上;而这些边界区域恰恰是分割错误集中、小时间偏移严重影响评估指标的关键区域。 Method: B-ACT是一种基于片段预算的主动学习框架:第一阶段按预测不确定性对未标注视频排序并查询;第二阶段在选定视频中检测候选边界,并依据融合邻域不确定性、类别模糊性与时间预测动态性的新型边界得分,选取Top-K边界帧进行标注;标注仅需边界帧标签,训练则使用以边界为中心的片段以利用模型感受野中的时间上下文。 Result: 在GTEA、50Salads和Breakfast数据集上的大量实验表明,B-ACT在稀疏标注预算下显著优于现有TAS主动学习方法及先前SOTA,尤其在边界定位主导F1指标的数据集上增益最大。 Conclusion: 聚焦边界区域的标注策略能极大提升TAS任务的标签效率与模型性能,验证了‘高杠杆区域’优先标注的有效性与实用性。 Abstract: Temporal action segmentation (TAS) demands dense temporal supervision, yet most of the annotation cost in untrimmed videos is spent identifying and refining action transitions, where segmentation errors concentrate and small temporal shifts disproportionately degrade segmental metrics. We introduce B-ACT, a clip-budgeted active learning framework that explicitly allocates supervision to these high-leverage boundary regions. B-ACT operates in a hierarchical two-stage loop: (i) it ranks and queries unlabeled videos using predictive uncertainty, and (ii) within each selected video, it detects candidate transitions from the current model predictions and selects the top-$K$ boundaries via a novel boundary score that fuses neighborhood uncertainty, class ambiguity, and temporal predictive dynamics. Importantly, our annotation protocol requests labels for only the boundary frames while still training on boundary-centered clips to exploit temporal context through the model's receptive field. Extensive experiments on GTEA, 50Salads, and Breakfast demonstrate that boundary-centric supervision delivers strong label efficiency and consistently surpasses representative TAS active learning baselines and prior state of the art under sparse budgets, with the largest gains on datasets where boundary placement dominates edit and overlap-based F1 scores.[160] VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models
Huawei Ji,Yuanhao Sun,Yuan Jin,Cheng Deng,Jiaxin Ding,Luoyi Fu,Xinbing Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为 的新框架,将视觉token剪枝建模为Pareto配置优化问题,通过连续松弛、直通估计器和增广拉格朗日法实现自动搜索最优剪枝配置,并揭示了多步渐进式剪枝更契合VLM的层次压缩结构。
Details
Motivation: 现有视觉token剪枝方法依赖预设配置,无法保证计算-性能权衡的最优性。 Method: 将视觉token剪枝建模为Pareto配置优化问题;采用连续松弛与直通估计器支持梯度搜索;使用增广拉格朗日法求解;引入可学习核函数分析层间剪枝模式。 Result: 在8个视觉基准上验证了该方法能有效逼近网格搜索得到的Pareto前沿,具备跨剪枝方法与VLM架构的良好泛化性;发现多步渐进式剪枝优于单层剪枝。 Conclusion: 自动化的Pareto优化框架能更优地平衡准确率与效率,且多步渐进剪枝更符合VLM内在层次压缩特性。 Abstract: Visual token pruning methods effectively mitigate the quadratic computational growth caused by processing high-resolution images and video frames in vision-language models (VLMs). However, existing approaches rely on predefined pruning configurations without determining whether they achieve computation-performance optimality. In this work, we introduce , a novel framework that formulates visual token pruning as a Pareto configuration optimization problem to automatically identify optimal configurations. Our approach employs continuous relaxation and straight-through estimators to enable gradient-based search, solved via the Augmented Lagrangian method. Extensive experiments across 8 visual benchmarks demonstrate that effectively approximates the empirical Pareto frontier obtained through grid search and generalizes well across various pruning methods and VLM architectures. Furthermore, through learnable kernel functions, we investigate layer-wise pruning patterns and reveal that multi-step progressive pruning captures VLMs' hierarchical compression structure, achieving superior accuracy-efficiency trade-offs compared to single-layer approaches.[161] Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization
Umer Ahmed,Syed Ahmed Mahmood,Fawad Javed Fateh,M. Shaheer Luqman,M. Zeeshan Zia,Quoc-Huy Tran
Main category: cs.CV
TL;DR: 本文提出了一种用于无监督骨架时序动作分割的分层时空向量量化框架,通过两级向量量化分别建模子动作和动作,并融合时空信息,在多个基准上达到SOTA并缓解段长偏差。
Details
Motivation: 解决无监督骨架时序动作分割中缺乏对动作层级结构建模以及仅利用空间信息导致性能受限的问题。 Method: 提出分层向量量化框架:底层将骨架映射到细粒度子动作,高层聚合子动作为动作级表示;进一步扩展为分层时空向量量化,联合重建骨架及其时间戳,实现多级时空聚类。 Result: 在HuGaDB、LARa和BABEL等多个基准上取得新的SOTA性能,并有效降低段长偏差。 Conclusion: 分层时空向量量化能更有效地建模动作的层级与时空结构,显著提升无监督骨架动作分割性能。 Abstract: We propose a novel hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation. We first introduce a hierarchical approach, which includes two consecutive levels of vector quantization. Specifically, the lower level associates skeletons with fine-grained subactions, while the higher level further aggregates subactions into action-level representations. Our hierarchical approach outperforms the non-hierarchical baseline, while primarily exploiting spatial cues by reconstructing input skeletons. Next, we extend our approach by leveraging both spatial and temporal information, yielding a hierarchical spatiotemporal vector quantization scheme. In particular, our hierarchical spatiotemporal approach performs multi-level clustering, while simultaneously recovering input skeletons and their corresponding timestamps. Lastly, extensive experiments on multiple benchmarks, including HuGaDB, LARa, and BABEL, demonstrate that our approach establishes a new state-of-the-art performance and reduces segment length bias in unsupervised skeleton-based temporal action segmentation.[162] MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
Yan Li,Zezi Zeng,Yifan Yang,Yuqing Yang,Ning Liao,Weiwei Guo,Lili Qiu,Mingxi Cheng,Qi Dai,Zhendong Wang,Zhengyuan Yang,Xue Yang,Ji Li,Lijuan Wang,Chong Luo
Main category: cs.CV
TL;DR: 本文提出MM-WebAgent,一种分层智能体框架,用于多模态网页生成,通过分层规划与自反思协调AIGC元素生成,提升全局一致性与视觉协调性,并构建了新基准与多级评估协议。
Details
Motivation: 现有AIGC工具直接集成到自动网页生成中易导致样式不一致和全局连贯性差,因各元素孤立生成。 Method: 提出分层agentic框架MM-WebAgent,结合分层规划与迭代自反思,联合优化全局布局、局部多模态内容及其整合;并构建多模态网页生成基准与多级评估协议。 Result: 实验表明MM-WebAgent在多模态元素生成与整合方面显著优于代码生成及基于agent的基线方法。 Conclusion: MM-WebAgent有效提升了生成网页的视觉一致性与全局连贯性,验证了分层协同与自反思机制在多模态网页生成中的有效性。 Abstract: The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code & Data: https://aka.ms/mm-webagent.[163] StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression
Xuanyi Liu,Deyi Ji,Chunan Yu,Qi Zhu,Xuanfu Li,Jin Ma,Tianrun Chen,Lanyun Zhu
Main category: cs.CV
TL;DR: 本文提出StreamCacheVGGT,一种无需训练的流式3D重建缓存管理框架,通过跨层一致性增强评分(CLCES)和混合缓存压缩(HCC)提升几何信息保留能力,在多个基准上实现SOTA性能。
Details
Motivation: 现有O(1)缓存框架依赖纯驱逐策略,存在二值化删除导致信息严重损失及单层局部打分带来的激活噪声问题,难以在恒定内存下稳定重建稠密3D几何。 Method: 提出StreamCacheVGGT:1)CLCES模块利用Transformer各层激活轨迹与顺序统计分析,识别持续几何显著性token以抑制噪声;2)HCC模块基于该鲁棒评分,采用三层分级策略,在key向量流形上对中等重要token进行最近邻锚点融合,而非简单删除。 Result: 在7-Scenes、NRGBD、ETH3D、Bonn和KITTI共5个基准上,StreamCacheVGGT在严格恒定计算/内存成本约束下,显著提升重建精度与长期稳定性,达到新SOTA。 Conclusion: StreamCacheVGGT通过协同设计的CLCES与HCC模块,有效克服了传统纯驱逐范式的固有缺陷,验证了在流式3D重建中兼顾信息保留与计算效率的可行性与优越性。 Abstract: Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing $O(1)$ frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.[164] TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens
Jiawei Ren,Michal Jan Tyszkiewicz,Jiahui Huang,Zan Gojcic
Main category: cs.CV
TL;DR: 本文提出TokenGS,一种基于Transformer的3D高斯点绘(3DGS)预测新方法,通过直接回归3D均值坐标并采用编码器-解码器架构与可学习高斯token,提升了鲁棒性、泛化性与重建质量。
Details
Motivation: 现有方法沿相机射线回归高斯均值(深度),受限于输入图像分辨率和视图数量,且对位姿噪声和多视角不一致敏感;需更灵活、鲁棒、可扩展的建模方式。 Method: 摒弃深度回归,直接回归3D均值坐标,仅用自监督渲染损失;引入编码器-解码器架构,使用可学习的高斯token表示3D primitives,解耦primitive数量与输入分辨率/视图数。 Result: TokenGS在静态与动态场景上达到前馈式重建SOTA性能,几何更规整、3DGS分布更均衡,并能自然恢复静态-动态分解与场景流等新兴属性;对位姿噪声和多视角不一致更具鲁棒性,支持高效的测试时token空间优化。 Conclusion: 直接回归3D坐标与token化高斯表示是提升3DGS前馈预测性能与泛化能力的关键设计,Encoder-decoder + learnable tokens为3D生成建模提供了更灵活、解耦且可优化的新范式。 Abstract: In this work, we revisit several key design choices of modern Transformer-based approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss. This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, thereby unbinding the number of predicted primitives from input image resolution and number of views. Our resulting method, TokenGS, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance on both static and dynamic scenes, producing more regularized geometry and more balanced 3DGS distribution, while seamlessly recovering emergent scene attributes such as static-dynamic decomposition and scene flow.[165] SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation
Tianhao Fu,Austin Wang,Charles Chen,Roby Aldave-Garza,Yucheng Chen
Main category: cs.CV
TL;DR: 本文提出SegWithU,一种轻量级、单次前向传播的后处理不确定性估计框架,用于医学图像分割,通过在冻结的预训练分割主干网络上添加不确定性头,建模扰动能量以生成两类体素级不确定性图,在多个数据集上达到最优性能。
Details
Motivation: 可靠不确定性估计对医学图像分割至关重要,但现有方法常需多次推理(低效)或单次推理方法在失败排序能力或特征空间假设上存在局限。 Method: SegWithU是一种后处理框架,利用冻结的预训练分割主干的中间特征,通过秩-1后验探针在紧凑探针空间中建模扰动能量,生成两个体素级不确定性图:一个面向校准(用于概率调温),一个面向排序(用于错误检测和选择性预测)。 Result: 在ACDC、BraTS2024和LiTS数据集上,SegWithU作为单次前向传播方法取得最佳且最稳定的AUROC/AURC结果(分别为0.9838/2.4885、0.9946/0.2660、0.9925/0.8193),同时保持分割精度。 Conclusion: 基于扰动的不确定性建模是实现高可靠性医学图像分割的一种有效且实用的途径。 Abstract: Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision support. Many strong uncertainty methods require repeated inference, while efficient single-forward-pass alternatives often provide weaker failure ranking or rely on restrictive feature-space assumptions. We present $\textbf{SegWithU}$, a post-hoc framework that augments a frozen pretrained segmentation backbone with a lightweight uncertainty head. SegWithU taps intermediate backbone features and models uncertainty as perturbation energy in a compact probe space using rank-1 posterior probes. It produces two voxel-wise uncertainty maps: a calibration-oriented map for probability tempering and a ranking-oriented map for error detection and selective prediction. Across ACDC, BraTS2024, and LiTS, SegWithU is the strongest and most consistent single-forward-pass baseline, achieving AUROC/AURC of $0.9838/2.4885$, $0.9946/0.2660$, and $0.9925/0.8193$, respectively, while preserving segmentation quality. These results suggest that perturbation-based uncertainty modeling is an effective and practical route to reliability-aware medical segmentation. Source code is available at https://github.com/ProjectNeura/SegWithU.[166] Why Do Vision Language Models Struggle To Recognize Human Emotions?
Madhav Agarwal,Sotirios A. Tsaftaris,Laura Sevilla-Lara,Steven McDonagh
Main category: cs.CV
TL;DR: 本文探讨了视觉-语言模型(VLMs)在人类情绪识别任务中表现不佳的原因,指出其两大关键缺陷:对长尾情绪数据的偏差倾向和对时序信息建模能力不足;并提出改进的采样策略与多阶段上下文增强方法以提升DFER性能。
Details
Motivation: 尽管VLMs在多种视觉任务中取得显著进展,但在人类情绪识别(尤其是动态面部表情识别DFER)上仍落后于专用视觉模型,本文旨在揭示其根本原因。 Method: 分析VLMs在DFER上的失败机制,提出两种改进策略:1)针对长尾分布问题设计替代采样策略;2)针对时序建模缺陷,提出多阶段上下文增强方法——将密集帧序列转换为自然语言摘要,并与稀疏关键帧联合输入VLM。 Result: 验证了VLMs因数据长尾性和时序建模限制而难以识别微表情等关键情绪信号;所提上下文增强策略有效缓解注意力稀释问题,提升了情绪识别准确率。 Conclusion: VLMs当前架构与情绪识别任务存在本质错配,需从数据采样和时序建模两方面协同优化,才能释放其在情感理解中的潜力。 Abstract: Understanding emotions is a fundamental ability for intelligent systems to be able to interact with humans. Vision-language models (VLMs) have made tremendous progress in the last few years for many visual tasks, potentially offering a promising solution for understanding emotions. However, it is surprising that even the most sophisticated contemporary VLMs struggle to recognize human emotions or to outperform even specialized vision-only classifiers. In this paper we ask the question "Why do VLMs struggle to recognize human emotions?", and observe that the inherently continuous and dynamic task of facial expression recognition (DFER) exposes two critical VLM vulnerabilities. First, emotion datasets are naturally long-tailed, and the web-scale data used to pre-train VLMs exacerbates this head-class bias, causing them to systematically collapse rare, under-represented emotions into common categories. We propose alternative sampling strategies that prevent favoring common concepts. Second, temporal information is critical for understanding emotions. However, VLMs are unable to represent temporal information over dense frame sequences, as they are limited by context size and the number of tokens that can fit in memory, which poses a clear challenge for emotion recognition. We demonstrate that the sparse temporal sampling strategy used in VLMs is inherently misaligned with the fleeting nature of micro-expressions (0.25-0.5 seconds), which are often the most critical affective signal. As a diagnostic probe, we propose a multi-stage context enrichment strategy that utilizes the information from "in-between" frames by first converting them into natural language summaries. This enriched textual context is provided as input to the VLM alongside sparse keyframes, preventing attentional dilution from excessive visual data while preserving the emotional trajectory.[167] R3D: Revisiting 3D Policy Learning
Zhengdong Hong,Shenrui Wu,Haozhe Cui,Boyi Zhao,Ran Ji,Yiyang He,Hangxing Zhang,Zundong Ke,Jun Wang,Guofeng Zhang,Jiayuan Gu
Main category: cs.CV
TL;DR: 本文提出了一种结合可扩展Transformer 3D编码器与扩散解码器的新架构,通过引入3D数据增强和去除Batch Normalization来解决3D策略学习中的训练不稳定与过拟合问题,在操纵任务基准上显著超越现有方法。
Details
Motivation: 3D策略学习因训练不稳定和严重过拟合而难以采用强大的3D感知模型,阻碍了泛化与跨形态迁移能力的发展。 Method: 系统诊断失败原因,发现缺失3D数据增强和Batch Normalization的负面影响;提出耦合可扩展Transformer 3D编码器与扩散解码器的新架构,并强调稳定性设计与大规模预训练利用。 Result: 在具挑战性的操纵基准上显著超越当前最优3D基线方法。 Conclusion: 建立了可扩展、鲁棒的3D模仿学习新基础。 Abstract: 3D policy learning promises superior generalization and cross-embodiment transfer, but progress has been hindered by training instabilities and severe overfitting, precluding the adoption of powerful 3D perception models. In this work, we systematically diagnose these failures, identifying the omission of 3D data augmentation and the adverse effects of Batch Normalization as primary causes. We propose a new architecture coupling a scalable transformer-based 3D encoder with a diffusion decoder, engineered specifically for stability at scale and designed to leverage large-scale pre-training. Our approach significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing a new and robust foundation for scalable 3D imitation learning. Project Page: https://r3d-policy.github.io/[168] GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens
Roni Itkin,Noam Issachar,Yehonatan Keypur,Yehonatan Keypur,Anpei Chen,Sagie Benaim
Main category: cs.CV
TL;DR: 本文提出GlobalSplat框架,通过‘先对齐、后解码’策略学习紧凑、全局一致的潜在场景表示,避免传统方法中因像素/体素对齐导致的冗余和不一致性,在保持高质量新视角合成的同时大幅减少高斯数量(仅需16K)和模型体积(4MB),并实现78ms内单次前向推理。
Details
Motivation: 现有3D高斯泼溅的空间基元分配方法(迭代优化或前馈推理)在表示紧凑性、重建速度与渲染保真度之间存在显著权衡,主因是依赖缺乏全局场景感知的局部启发式策略;尤其前馈方法多为像素或体素对齐,导致三维资产冗余、多视图下表征膨胀且全局一致性差。 Method: 提出GlobalSplat框架,采用‘先对齐、后解码’范式:首先学习一个紧凑、全局、隐式的场景表示,编码多视角输入并解析跨视图对应关系;随后解码显式3D几何;引入由粗到精的训练课程,逐步提升解码容量,天然防止表征膨胀;不依赖预训练像素预测骨干网络或稠密基线的潜在特征复用。 Result: 在RealEstate10K和ACID数据集上达到具有竞争力的新视角合成性能,仅使用16K高斯,模型体积仅4MB;单次前向推理耗时低于78毫秒,显著快于基线方法。 Conclusion: GlobalSplat通过全局隐式表征与课程学习,实现了紧凑性、一致性与高效性的统一,为3D高斯泼溅提供了更优的空间基元分配范式。 Abstract: The efficient spatial allocation of primitives serves as the foundation of 3D Gaussian Splatting, as it directly dictates the synergy between representation compactness, reconstruction speed, and rendering fidelity. Previous solutions, whether based on iterative optimization or feed-forward inference, suffer from significant trade-offs between these goals, mainly due to the reliance on local, heuristic-driven allocation strategies that lack global scene awareness. Specifically, current feed-forward methods are largely pixel-aligned or voxel-aligned. By unprojecting pixels into dense, view-aligned primitives, they bake redundancy into the 3D asset. As more input views are added, the representation size increases and global consistency becomes fragile. To this end, we introduce GlobalSplat, a framework built on the principle of align first, decode later. Our approach learns a compact, global, latent scene representation that encodes multi-view input and resolves cross-view correspondences before decoding any explicit 3D geometry. Crucially, this formulation enables compact, globally consistent reconstructions without relying on pretrained pixel-prediction backbones or reusing latent features from dense baselines. Utilizing a coarse-to-fine training curriculum that gradually increases decoded capacity, GlobalSplat natively prevents representation bloat. On RealEstate10K and ACID, our model achieves competitive novel-view synthesis performance while utilizing as few as 16K Gaussians, significantly less than required by dense pipelines, obtaining a light 4MB footprint. Further, GlobalSplat enables significantly faster inference than the baselines, operating under 78 milliseconds in a single forward pass. Project page is available at https://r-itk.github.io/globalsplat/[169] AD4AD: Benchmarking Visual Anomaly Detection Models for Safer Autonomous Driving
Fabrizio Genilotti,Arianna Stropeni,Gionata Grotto,Francesco Borsatti,Manuel Barusco,Davide Dalle Pezze,Gian Antonio Susto
Main category: cs.CV
TL;DR: 本文探讨了视觉异常检测(VAD)在自动驾驶中的应用,通过在大型合成数据集AnoVox上评估8种前沿VAD方法,验证了VAD在道路场景中的有效性,并发现Tiny-Dinomaly在边缘部署中实现了最优的精度-效率平衡。
Details
Motivation: 自动驾驶系统在面对训练数据分布之外的异常场景(如非典型障碍物)时感知能力易下降,而此类失败会直接导致严重物理风险,因此亟需能识别未知异常并引导注意力的可靠机制。 Method: 在AnoVox数据集上对8种SOTA视觉异常检测(VAD)方法进行基准测试,涵盖从大型到轻量级(如MobileNet、DeiT-Tiny)共4种骨干网络,并重点评估其像素级异常定位能力与边缘部署适用性。 Result: VAD方法能有效迁移到真实道路场景;Tiny-Dinomaly在保持全尺度定位精度的同时,显著降低内存开销,展现出最佳精度-效率权衡。 Conclusion: VAD是提升自动驾驶系统安全性与鲁棒性的可行路径,尤其轻量级VAD模型(如Tiny-Dinomaly)为边缘部署提供了实用解决方案,有助于更安全、负责任地落地自动驾驶技术。 Abstract: The reliability of a machine vision system for autonomous driving depends heavily on its training data distribution. When a vehicle encounters significantly different conditions, such as atypical obstacles, its perceptual capabilities can degrade substantially. Unlike many domains where errors carry limited consequences, failures in autonomous driving translate directly into physical risk for passengers, pedestrians, and other road users. To address this challenge, we explore Visual Anomaly Detection (VAD) as a solution. VAD enables the identification of anomalous objects not present during training, allowing the system to alert the driver when an unfamiliar situation is detected. Crucially, VAD models produce pixel-level anomaly maps that can guide driver attention to specific regions of concern without requiring any prior assumptions about the nature or form of the hazard. We benchmark eight state-of-the-art VAD methods on AnoVox, the largest synthetic dataset for anomaly detection in autonomous driving. In particular, we evaluate performance across four backbone architectures spanning from large networks to lightweight ones such as MobileNet and DeiT-Tiny. Our results demonstrate that VAD transfers effectively to road scenes. Notably, Tiny-Dinomaly achieves the best accuracy-efficiency trade-off for edge deployment, matching full-scale localization performance at a fraction of the memory cost. This study represents a concrete step toward safer, more responsible deployment of autonomous vehicles, ultimately improving protection for passengers, pedestrians, and all road users.[170] AnimationBench: Are Video Models Good at Character-Centric Animation?
Leyi Wu,Pengjun Fang,Kai Sun,Yazhou Xing,Yinwei Wu,Songsong Wang,Ziqi Huang,Dan Zhou,Yingqing He,Ying-Cong Chen,Qifeng Chen
Main category: cs.CV
TL;DR: 本文提出了AnimationBench,首个面向动画风格图像到视频(I2V)生成的系统性评测基准,基于动画十二原则与IP一致性等维度构建可量化指标,并支持闭集与开集两种评估模式,显著提升对动画生成质量的判别力。
Details
Motivation: 现有视频生成评测基准主要面向真实感视频,难以有效评估动画风格生成在风格化外观、夸张动作和角色一致性等方面的表现;且其固定提示集和刚性流程限制了开放域内容和定制化评估需求。 Method: 提出AnimationBench基准,将动画十二基本原理与IP保真度转化为可测量的评估维度,并融合更广义的质量维度(如语义一致性、运动合理性、镜头运动一致性);支持标准化闭集评估与灵活开集诊断评估,并利用视觉-语言模型实现可扩展自动评估。 Result: 实验证明AnimationBench与人类判断高度一致,能揭示真实感导向基准所忽略的动画特有质量差异,从而对前沿I2V模型提供更具信息量和判别力的评估。 Conclusion: AnimationBench填补了动画风格I2V生成评测的空白,为该领域建立了首个系统、可扩展、人机一致的基准框架,推动动画生成技术向更高艺术与技术标准发展。 Abstract: Video generation has advanced rapidly, with recent methods producing increasingly convincing animated results. However, existing benchmarks-largely designed for realistic videos-struggle to evaluate animation-style generation with its stylized appearance, exaggerated motion, and character-centric consistency. Moreover, they also rely on fixed prompt sets and rigid pipelines, offering limited flexibility for open-domain content and custom evaluation needs. To address this gap, we introduce AnimationBench, the first systematic benchmark for evaluating animation image-to-video generation. AnimationBench operationalizes the Twelve Basic Principles of Animation and IP Preservation into measurable evaluation dimensions, together with Broader Quality Dimensions including semantic consistency, motion rationality, and camera motion consistency. The benchmark supports both a standardized close-set evaluation for reproducible comparison and a flexible open-set evaluation for diagnostic analysis, and leverages visual-language models for scalable assessment. Extensive experiments show that AnimationBench aligns well with human judgment and exposes animation-specific quality differences overlooked by realism-oriented benchmarks, leading to more informative and discriminative evaluation of state-of-the-art I2V models.[171] Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation
Yiyang Jiang,Li Zhang,Xiao-Yong Wei,Li Qing
Main category: cs.CV
TL;DR: 本文提出了一种基于推理的SLT(手语翻译)框架,引入显式的‘潜在思维序列’作为视频与文本之间的中间层,并采用‘先规划、再验证’的解码策略,显著提升了翻译的连贯性与忠实性;同时发布了一个更大规模、更贴近真实场景的无词表(gloss-free)手语翻译数据集。
Details
Motivation: 现有SLT系统隐式假设手语片段可直接映射为口语词汇,但该假设忽略了手语依赖上下文、空间和动作动态构建意义的本质,导致建模失真。 Method: 提出推理驱动的SLT框架:1)引入有序的潜在思维序列作为视频到文本的显式中间表示;2)采用‘计划-再接地(plan-then-ground)’解码机制,先生成语义计划,再回溯视频寻找证据支持;3)构建并开源大规模、强上下文依赖的无词表SLT数据集。 Result: 在多个基准上一致优于现有无词表SLT方法,验证了推理建模与显式中间表示的有效性;新数据集增强了模型对真实手语语义结构的建模能力。 Conclusion: SLT本质上是跨模态推理任务,而非简单视频转文本;显式建模推理过程(如潜在思维链与分步解码)可显著提升翻译质量与可解释性。 Abstract: Many SLT systems quietly assume that brief chunks of signing map directly to spoken-language words. That assumption breaks down because signers often create meaning on the fly using context, space, and movement. We revisit SLT and argue that it is mainly a cross-modal reasoning task, not just a straightforward video-to-text conversion. We thus introduce a reasoning-driven SLT framework that uses an ordered sequence of latent thoughts as an explicit middle layer between the video and the generated text. These latent thoughts gradually extract and organize meaning over time. On top of this, we use a plan-then-ground decoding method: the model first decides what it wants to say, and then looks back at the video to find the evidence. This separation improves coherence and faithfulness. We also built and released a new large-scale gloss-free SLT dataset with stronger context dependencies and more realistic meanings. Experiments across several benchmarks show consistent gains over existing gloss-free methods. Code and data will be released upon acceptance at https://github.com/fletcherjiang/SignThought.[172] RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
Hao Gao,Shaoyu Chen,Yifan Zhu,Yuehao Song,Wenyu Liu,Qian Zhang,Xinggang Wang
Main category: cs.CV
TL;DR: 本文提出RAD-2,一种结合扩散生成器与强化学习优化判别器的闭环运动规划框架,通过解耦生成与评估、引入时序一致的策略优化和在线生成器优化,显著提升规划鲁棒性与安全性。
Details
Motivation: 现有基于扩散的规划器在闭环交互中存在随机不稳定性及缺乏负反馈校正的问题,难以兼顾多模态不确定性建模与鲁棒性。 Method: 提出RAD-2框架:1)扩散生成器采样多样轨迹;2)RL优化的判别器按长期驾驶质量重排序;3)时序一致的分组相对策略优化(TC-GRO)缓解信用分配问题;4)在线生成器优化(OGO)将闭环反馈转化为纵向结构化信号;5)BEV-Warp仿真环境实现鸟瞰图空间高效闭环评估。 Result: 相比强基线扩散规划器,碰撞率降低56%;实车部署验证了复杂城市交通中感知安全性和行驶平滑性的提升。 Conclusion: RAD-2通过生成-判别协同、时序引导的RL优化与高效仿真,有效解决了扩散规划器在闭环驾驶中的稳定性与反馈缺失问题,推动高阶自动驾驶运动规划向更鲁棒、可部署方向发展。 Abstract: High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.[173] TokenLight: Precise Lighting Control in Images using Attribute Tokens
Sumit Chaturvedi,Yannick Hold-Geoffroy,Mengwei Ren,Jingyuan Liu,He Zhang,Yiqun Mei,Julie Dorsey,Zhixin Shu
Main category: cs.CV
TL;DR: 本文提出了一种基于属性标记的图像重打光方法,能够对多个光照属性(如强度、颜色、环境光、漫反射水平和3D光源位置)进行精细连续控制,无需显式逆渲染监督即能理解光与几何、遮挡和材质的交互。
Details
Motivation: 现有方法难以同时精确控制多个光照属性,且依赖显式逆渲染监督,泛化性和真实性受限。 Method: 将重打光建模为条件图像生成任务,引入属性标记编码多种光照因素;在大规模合成数据集(带光照真值标注)上训练,并辅以少量真实图像提升真实感和泛化性。 Result: 在合成与真实图像上均实现SOTA定量与定性结果;模型隐式学习了光与场景几何、遮挡及材质的交互,在光源置于物体内部或透明材质重打光等挑战场景中仍表现可信。 Conclusion: 该方法通过属性标记实现了灵活可控、高真实感的图像重打光,展现出强大的泛化能力和隐式物理理解能力。 Abstract: This paper presents a method for image relighting that enables precise and continuous control over multiple illumination attributes in a photograph. We formulate relighting as a conditional image generation task and introduce attribute tokens to encode distinct lighting factors such as intensity, color, ambient illumination, diffuse level, and 3D light positions. The model is trained on a large-scale synthetic dataset with ground-truth lighting annotations, supplemented by a small set of real captures to enhance realism and generalization. We validate our approach across a variety of relighting tasks, including controlling in-scene lighting fixtures and editing environment illumination using virtual light sources, on synthetic and real images. Our method achieves state-of-the-art quantitative and qualitative performance compared to prior work. Remarkably, without explicit inverse rendering supervision, the model exhibits an inherent understanding of how light interacts with scene geometry, occlusion, and materials, yielding convincing lighting effects even in traditionally challenging scenarios such as placing lights within objects or relighting transparent materials plausibly. Project page: vrroom.github.io/tokenlight/[174] LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
Zhanhao Liang,Tao Yang,Jie Wu,Chengjian Feng,Liang Zheng
Main category: cs.CV
TL;DR: 本文提出LeapAlign方法,通过设计两个跳跃步骤来缩短流匹配模型的生成轨迹,从而降低计算成本并实现从奖励到早期生成步骤的直接梯度传播,显著提升图像质量和图文对齐效果。
Details
Motivation: 现有基于直接梯度回传的流匹配模型偏好对齐方法受限于长轨迹带来的内存开销和梯度爆炸问题,难以有效更新决定图像全局结构的早期生成步骤。 Method: 提出LeapAlign:将长ODE采样轨迹压缩为仅含两个跳跃步骤的短轨迹,每个跳跃跨多步预测未来隐状态;通过随机化跳跃起止时间实现任意生成步的高效稳定更新;引入路径一致性加权和大梯度项降权策略以提升训练稳定性。 Result: 在Flux模型上微调时,LeapAlign在多项指标上持续超越当前最优的GRPO类及直接梯度法,显著提升图像质量与图文对齐能力。 Conclusion: LeapAlign通过轨迹压缩与梯度优化策略,有效解决了流匹配模型偏好对齐中早期步骤更新困难的问题,是一种高效、稳定且性能优越的微调方法。 Abstract: This paper focuses on the alignment of flow matching models with human preferences. A promising way is fine-tuning by directly backpropagating reward gradients through the differentiable generation process of flow matching. However, backpropagating through long trajectories results in prohibitive memory costs and gradient explosion. Therefore, direct-gradient methods struggle to update early generation steps, which are crucial for determining the global structure of the final image. To address this issue, we introduce LeapAlign, a fine-tuning method that reduces computational cost and enables direct gradient propagation from reward to early generation steps. Specifically, we shorten the long trajectory into only two steps by designing two consecutive leaps, each skipping multiple ODE sampling steps and predicting future latents in a single step. By randomizing the start and end timesteps of the leaps, LeapAlign leads to efficient and stable model updates at any generation step. To better use such shortened trajectories, we assign higher training weights to those that are more consistent with the long generation path. To further enhance gradient stability, we reduce the weights of gradient terms with large magnitude, instead of completely removing them as done in previous works. When fine-tuning the Flux model, LeapAlign consistently outperforms state-of-the-art GRPO-based and direct-gradient methods across various metrics, achieving superior image quality and image-text alignment.[175] Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo
Ninghui Xu,Fabio Tosi,Lihui Wang,Jiawei Han,Luca Bartolomei,Zhiting Yao,Matteo Poggi,Stefano Mattoccia
Main category: cs.CV
TL;DR: 本文提出Bi-CMPStereo,一种双向跨模态提示框架,用于事件相机与帧相机的异构立体匹配,通过在统一规范空间中对齐表征并双向投影模态信息,提升动态场景下的3D感知鲁棒性与精度。