Table of Contents
cs.CL [Back]
[1] Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models
Andrew Kiruluta
Main category: cs.CL
TL;DR: 本文提出了一种基于压缩感知的动态大语言模型(LLM)执行框架,通过随机测量、稀疏恢复与硬件友好的稀疏执行路径编译,实现任务条件化、token自适应的模型与提示联合压缩,兼顾精度、速度与部署效率。
Details
Motivation: 现有模型压缩方法多为静态离线优化,未利用不同提示和解码步激活不同计算路径的特性;提示压缩方法虽缩短序列长度,但不调整实际执行的模型子网络。二者缺乏统一协同。 Method: 构建统一的压缩感知引导框架:使用随机测量算子探测模型隐式使用模式,通过稀疏恢复估计任务相关且token自适应的结构稀疏支撑集,并将支撑集编译为覆盖模块块、注意力头、通道及前馈子结构的硬件高效稀疏执行路径。 Result: 实现了任务条件化测量、token自适应恢复、理论采样复杂度界(基于RIP或互不相干性)、面向GPU硬件的编译约束,以及提示压缩与模型剪枝的联合优化目标;在保证近似精度的同时提升推理速度。 Conclusion: 该框架将LLM推理重新建模为带显式逼近保证与部署导向加速约束的‘测量-恢复’问题,为动态、自适应、软硬协同的高效LLM执行提供了新范式。 Abstract: Large language models deliver strong generative performance but at the cost of massive parameter counts, memory use, and decoding latency. Prior work has shown that pruning and structured sparsity can preserve accuracy under substantial compression, while prompt-compression methods reduce latency by removing redundant input tokens. However, these two directions remain largely separate. Most model-compression methods are static and optimized offline, and they do not exploit the fact that different prompts and decoding steps activate different latent computational pathways. Prompt-compression methods reduce sequence length, but they do not adapt the executed model subnetwork. We propose a unified compressed-sensing-guided framework for dynamic LLM execution. Random measurement operators probe latent model usage, sparse recovery estimates task-conditioned and token-adaptive support sets, and the recovered supports are compiled into hardware-efficient sparse execution paths over blocks, attention heads, channels, and feed-forward substructures. The framework introduces five key contributions: task-conditioned measurements, so different prompts induce different sparse supports; token-adaptive recovery, so active substructures are re-estimated during decoding; formal sample-complexity bounds under restricted isometry or mutual incoherence assumptions; compile-to-hardware constraints that restrict recovery to GPU-efficient structures; and a joint objective that unifies prompt compression with model reduction. Together, these components recast LLM inference as a measurement-and-recovery problem with explicit approximation guarantees and deployment-oriented speedup constraints.[2] MemGround: Long-Term Memory Evaluation Kit for Large Language Models in Gamified Scenarios
Yihang Ding,Wanke Xia,Yiting Zhao,Jinbo Su,Jialiang Yang,Zhengbo Zhang,Ke Wang,Wenming Yang
Main category: cs.CL
TL;DR: 本文提出MemGround,一个基于丰富、游戏化交互场景的长期记忆基准,通过三层分级框架评估表面状态记忆、时间关联记忆和基于推理的记忆,并引入多维指标量化记忆使用和行为轨迹。实验表明,当前最先进的大语言模型和记忆代理在持续动态跟踪、时间事件关联及基于长期积累证据的复杂推理方面仍存在困难。
Details
Motivation: 现有对大语言模型长期记忆的评估过于静态,仅关注简单检索和短上下文推理,忽略了复杂记忆系统(如动态状态跟踪和分层推理)的多面性。 Method: 提出MemGround基准,构建三层分级框架(表面状态记忆、时间关联记忆、基于推理的记忆),设计专门的交互任务,并引入多维指标(QA Overall、MFU、MFCO、ETD)进行综合评估。 Result: 实验发现当前SOTA大语言模型和记忆代理在持续动态跟踪、时间事件关联以及基于长期记忆的复杂推理上表现不佳。 Conclusion: MemGround为评估大语言模型长期记忆能力提供了更全面、动态和交互式的基准,揭示了现有模型在复杂记忆任务上的关键短板。 Abstract: Current evaluations of long-term memory in LLMs are fundamentally static. By fixating on simple retrieval and short-context inference, they neglect the multifaceted nature of complex memory systems, such as dynamic state tracking and hierarchical reasoning in continuous interactions. To overcome these limitations, we propose MemGround, a rigorous long-term memory benchmark natively grounded in rich, gamified interactive scenarios. To systematically assess these capabilities, MemGround introduces a three-tier hierarchical framework that evaluates Surface State Memory, Temporal Associative Memory, and Reasoning-Based Memory through specialized interactive tasks. Furthermore, to comprehensively quantify both memory utilization and behavioral trajectories, we propose a multi-dimensional metric suite comprising Question-Answer Score (QA Overall), Memory Fragments Unlocked (MFU), Memory Fragments with Correct Order (MFCO), and Exploration Trajectory Diagrams (ETD). Extensive experiments reveal that state-of-the-art LLMs and memory agents still struggle with sustained dynamic tracking, temporal event association, and complex reasoning derived from long-term accumulated evidence in interactive environments.[3] HUOZIIME: An On-Device LLM-enhanced Input Method for Deep Personalization
Baocai Shan,Yuzhuang Xu,Wanxiang Che
Main category: cs.CL
TL;DR: HUOZIIME 是一个基于轻量级大语言模型(LLM)的个性化、隐私保护、实时运行的移动端输入法,通过合成数据微调与分层记忆机制实现用户历史建模,并针对移动设备做了系统性优化。
Details
Motivation: 现有移动端输入法局限于手动输入,难以实现个性化文本生成;虽有轻量LLM支持端侧生成,但如何兼顾个性化、隐私性与实时性仍是挑战。 Method: 提出 HUOZIIME:1)基于合成个性化数据对基座LLM进行后训练;2)设计分层记忆机制持续建模用户输入历史;3)面向移动端约束开展系统级优化(如推理效率、内存占用等)。 Result: 实验表明其可在移动端高效运行,并实现高保真、记忆驱动的个性化预测。 Conclusion: HUOZIIME 验证了在资源受限设备上部署具备长期记忆与个性化能力的LLM输入法的可行性,为隐私优先的端侧生成式交互提供了新范式。 Abstract: Mobile input method editors (IMEs) are the primary interface for text input, yet they remain constrained to manual typing and struggle to produce personalized text. While lightweight large language models (LLMs) make on-device auxiliary generation feasible, enabling deeply personalized, privacy-preserving, and real-time generative IMEs poses fundamental challenges.To this end, we present HUOZIIME, a personalized on-device IME powered by LLM. We endow HUOZIIME with initial human-like prediction ability by post-training a base LLM on synthesized personalization data. Notably, a hierarchical memory mechanism is designed to continually capture and leverage user-specific input history. Furthermore, we perform systemic optimizations tailored to on-device LLMbased IME deployment, ensuring efficient and responsive operation under mobile constraints.Experiments demonstrate efficient on-device execution and high-fidelity memory-driven personalization. Code and package are available at https://github.com/Shan-HIT/HuoziIME.[4] Can Large Language Models Detect Methodological Flaws? Evidence from Gesture Recognition for UAV-Based Rescue Operation Based on Deep Learning
Domonkos Varga
Main category: cs.CL
TL;DR: 本文探讨了大型语言模型(LLMs)能否作为独立分析代理,识别机器学习论文中常见的方法论缺陷(如数据泄露),以提升研究可复现性;通过对一篇手势识别论文的案例分析,发现6个主流LLM均能一致识别出因训练/测试集非独立划分导致的受试者级数据泄露问题。
Details
Motivation: 可靠评估对机器学习研究至关重要,但方法论缺陷(尤其是数据泄露)持续损害结果有效性;亟需自动化、客观的科学审计工具。 Method: 以一篇存在潜在数据泄露的手势识别论文为案例,设计统一提示词,让6个前沿大语言模型独立分析其方法与结果,并对比其诊断结论。 Result: 所有6个LLM均一致识别出该研究存在受试者级数据泄露,归因于训练/测试集非独立划分,并依据学习曲线重叠、泛化差距小、分类准确率近100%等线索支持判断。 Conclusion: LLMs可基于公开论文内容独立检测常见方法论问题,虽不能替代人工审查,但有望作为提升可复现性与辅助科学审计的互补工具。 Abstract: Reliable evaluation is essential in machine learning research, yet methodological flaws-particularly data leakage-continue to undermine the validity of reported results. In this work, we investigate whether large language models (LLMs) can act as independent analytical agents capable of identifying such issues in published studies. As a case study, we analyze a gesture-recognition paper reporting near-perfect accuracy on a small, human-centered dataset. We first show that the evaluation protocol is consistent with subject-level data leakage due to non-independent training and test splits. We then assess whether this flaw can be detected independently by six state-of-the-art LLMs, each analyzing the original paper without prior context using an identical prompt. All models consistently identify the evaluation as flawed and attribute the reported performance to non-independent data partitioning, supported by indicators such as overlapping learning curves, minimal generalization gap, and near-perfect classification results. These findings suggest that LLMs can detect common methodological issues based solely on published artifacts. While not definitive, their consistent agreement highlights their potential as complementary tools for improving reproducibility and supporting scientific auditing.[5] Decoupling Scores and Text: The Politeness Principle in Peer Review
Yingxuan Wen
Main category: cs.CL
TL;DR: 本文研究了作者如何解读同行评审反馈,发现数值评分比文本评论更能准确预测论文接受结果,揭示了评审文本中普遍存在的礼貌原则导致作者难以从文字中判断真实结果。
Details
Motivation: 作者常常难以正确解读同行评审反馈,可能因礼貌性措辞而产生错误希望,或因具体低分而感到困惑。 Method: 构建了包含2021-2025年ICLR超3万份投稿的数据库,对比基于数值评分与文本评论的接受预测性能,并从分数分布统计特征和评论情感倾向两方面分析差异原因。 Result: 基于分数的模型准确率达91%,而基于文本(即使使用大语言模型)仅81%;失败案例显示低分具有决定性作用;评审文本普遍存在‘礼貌原则’,即拒稿评论仍含更多正面词,掩盖真实信号。 Conclusion: 数值评分比文本评论更可靠地反映评审结果,文本中的礼貌性表达削弱了其判别力,提示需改进评审反馈的清晰度与一致性。 Abstract: Authors often struggle to interpret peer review feedback, deriving false hope from polite comments or feeling confused by specific low scores. To investigate this, we construct a dataset of over 30,000 ICLR 2021-2025 submissions and compare acceptance prediction performance using numerical scores versus text reviews. Our experiments reveal a significant performance gap: score-based models achieve 91% accuracy, while text-based models reach only 81% even with large language models, indicating that textual information is considerably less reliable. To explain this phenomenon, we first analyze the 9% of samples that score-based models fail to predict, finding their score distributions exhibit high kurtosis and negative skewness, which suggests that individual low scores play a decisive role in rejection even when the average score falls near the borderline. We then examine why text-based accuracy significantly lags behind scores from a review sentiment perspective, revealing the prevalence of the Politeness Principle: reviews of rejected papers still contain more positive than negative sentiment words, masking the true rejection signal and making it difficult for authors to judge outcomes from text alone.[6] SeaAlert: Critical Information Extraction From Maritime Distress Communications with Large Language Models
Tomer Atia,Yehudit Aperstein,Alexander Apartsin
Main category: cs.CL
TL;DR: 本文提出SeaAlert,一个基于大语言模型(LLM)的框架,用于鲁棒地分析海上遇险语音通信;通过构建包含多样化、逼真且含噪声合成数据的 pipeline 解决真实标注数据稀缺问题,并提升在ASR错误和语音失真下的解析性能。
Details
Motivation: 海上VHF遇险语音通信虽遵循GMDSS标准,但实际中常因简短、高噪声、说话人紧张、格式偏离及ASR识别错误而难以自动分析。 Method: 提出SeaAlert框架,设计基于LLM的合成数据生成pipeline:LLM生成多样化遇险语句(含省略/替换标准术语的变体),经TTS合成语音,叠加模拟VHF信道噪声,再通过ASR转录,获得贴近真实场景的带噪文本。 Result: 该方法显著提升了在噪声与ASR错误干扰下对遇险信息(如船名、位置、险情类型等)的准确抽取能力,验证了合成数据对低资源鲁棒语音理解任务的有效性。 Conclusion: SeaAlert证明了利用LLM可控生成高质量合成语音-文本对,可有效缓解真实海上遇险数据稀缺问题,为安全关键语音理解提供了新范式。 Abstract: Maritime distress communications transmitted over very high frequency (VHF) radio are safety-critical voice messages used to report emergencies at sea. Under the Global Maritime Distress and Safety System (GMDSS), such messages follow standardized procedures and are expected to convey essential details, including vessel identity, position, nature of the distress, and required assistance. In practice, however, automatic analysis remains difficult because distress messages are often brief, noisy, and produced under stress, may deviate from the prescribed format, and are further degraded by automatic speech recognition (ASR) errors caused by channel noise and speaker stress. This paper presents SeaAlert, an LLM-based framework for robust analysis of maritime distress communications. To address the scarcity of labeled real-world data, we develop a synthetic data generation pipeline in which an LLM produces realistic and diverse maritime messages, including challenging variants in which standard distress codewords are omitted or replaced with less explicit expressions. The generated utterances are synthesized into speech, degraded with simulated VHF noise, and transcribed by an ASR system to obtain realistic noisy transcripts.[7] How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
Zixian Huang,Kaichen Yang,Xu Huang,Feiyang Hao,Qiming Ge,Bowen Li,He Du,Kai Chen,Qipeng Guo
Main category: cs.CL
TL;DR: 本文提出TESSY框架,通过教师-学生协作生成数据,解决合成数据风格不匹配导致的推理模型微调性能下降问题,在代码生成任务中显著提升Qwen3-8B性能。
Details
Motivation: 现有使用更强教师模型生成合成数据进行监督微调(SFT)的方法,在增强新兴推理模型(如Qwen3-8B)时常常失效甚至损害性能,主因是教师生成数据与学生模型的数据分布存在显著风格差异。 Method: 提出教师-学生协作数据合成框架(TESSY),让教师和学生模型交替生成风格相关与非风格相关token,从而生成既具备教师高级推理能力、又符合学生风格分布的合成序列。 Result: 在以GPT-OSS-120B为教师、Qwen3-8B为学生的代码生成实验中,传统教师生成数据微调导致LiveCodeBench-Pro和OJBench性能分别下降3.25%和10.02%;而TESSY实现11.25%和6.68%的提升。 Conclusion: 风格对齐是提升推理模型SFT效果的关键,TESSY通过协同生成机制有效弥合教师-学生风格鸿沟,显著提升下游推理性能。 Abstract: A widely adopted strategy for model enhancement is to use synthetic data generated by a stronger model for supervised fine-tuning (SFT). However, for emerging reasoning models like Qwen3-8B, this approach often fails to improve reasoning capabilities and can even lead to a substantial drop in performance. In this work, we identify substantial stylistic divergence between teacher generated data and the distribution of student as a major factor impacting SFT. To bridge this gap, we propose a Teacher-Student Cooperation Data Synthesis framework (TESSY), which interleaves teacher and student models to alternately generate style and non-style tokens. Consequently, TESSY produces synthetic sequences that inherit the advanced reasoning capabilities of the teacher while maintaining stylistic consistency with the distribution of the student. In experiments on code generation using GPT-OSS-120B as the teacher, fine-tuning Qwen3-8B on teacher-generated data leads to performance drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, whereas TESSY achieves improvements of 11.25% and 6.68%.[8] EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic Reviews
Naman Ahuja,Saniya Mulla,Muhammad Ali Khan,Zaryab Bin Riaz,Kaneez Zahra Rubab Khakwani,Mohamad Bassam Sonbol,Irbaz Bin Riaz,Vivek Gupta
Main category: cs.CL
TL;DR: EviSearch是一个多智能体提取系统,能从原始临床试验PDF中自动构建符合本体的证据表,并保证每个单元格的溯源性,支持审计和人工验证。
Details
Motivation: 加速活体系统评价流程、减轻人工整理负担,并为LLM驱动的提取技术在证据整合流程中提供安全、可审计的集成路径。 Method: 结合PDF查询代理(保留排版与图表)、检索引导搜索代理和强制页面级验证的协调模块,实现跨文本、表格、图表等多模态证据源的高精度提取,并记录协调决策与审阅者编辑以生成监督信号。 Result: 在临床医生构建的肿瘤学试验论文基准测试中,EviSearch显著优于强文本解析基线,同时提供全面的归因覆盖。 Conclusion: EviSearch提升了临床证据提取的准确性与可追溯性,支持迭代式模型改进,推动自动化、可信的证据合成。 Abstract: We present EviSearch, a multi-agent extraction system that automates the creation of ontology-aligned clinical evidence tables directly from native trial PDFs while guaranteeing per-cell provenance for audit and human verification. EviSearch pairs a PDF-query agent (which preserves rendered layout and figures) with a retrieval-guided search agent and a reconciliation module that forces page-level verification when agents disagree. The pipeline is designed for high-precision extraction across multimodal evidence sources (text, tables, figures) and for generating reviewer-actionable provenance that clinicians can inspect and correct. On a clinician-curated benchmark of oncology trial papers, EviSearch substantially improves extraction accuracy relative to strong parsed-text baselines while providing comprehensive attribution coverage. By logging reconciler decisions and reviewer edits, the system produces structured preference and supervision signals that bootstrap iterative model improvement. EviSearch is intended to accelerate living systematic review workflows, reduce manual curation burden, and provide a safe, auditable path for integrating LLM-based extraction into evidence synthesis pipelines.[9] Hierarchical Retrieval Augmented Generation for Adversarial Technique Annotation in Cyber Threat Intelligence Text
Filippo Morbiato,Markus Keller,Priya Nair,Luca Romano
Main category: cs.CL
TL;DR: 本文提出H-TechniqueRAG,一种结合ATT&CK战术-技术层级结构的分层检索增强生成框架,显著提升CTI文本到ATT&CK技术ID映射的准确性、效率与可解释性。
Details
Motivation: 现有RAG方法在CTI文本映射到ATT&CK技术ID任务中忽略ATT&CK框架固有的战术-技术层级结构,导致检索效率低、可解释性差。 Method: 提出H-TechniqueRAG:采用两阶段分层检索(先检战术、再检对应技术),引入战术感知重排序模块和层级约束上下文组织策略,以缓解大模型上下文过载并提升推理精度。 Result: 在三个CTI数据集上F1分数比SOTA TechniqueRAG高3.8%,推理延迟降低62.4%,LLM API调用减少60%;同时具备更强跨域泛化能力与可解释决策路径。 Conclusion: 将ATT&CK层级结构作为强归纳偏置融入RAG框架,能兼顾性能、效率与可解释性,为CTI自动化分析提供新范式。 Abstract: Mapping Cyber Threat Intelligence (CTI) text to MITRE ATT\&CK technique IDs is a critical task for understanding adversary behaviors and automating threat defense. While recent Retrieval-Augmented Generation (RAG) approaches have demonstrated promising capabilities in this domain, they fundamentally rely on a flat retrieval paradigm. By treating all techniques uniformly, these methods overlook the inherent taxonomy of the ATT\&CK framework, where techniques are structurally organized under high-level tactics. In this paper, we propose H-TechniqueRAG, a novel hierarchical RAG framework that injects this tactic-technique taxonomy as a strong inductive bias to achieve highly efficient and accurate annotation. Our approach introduces a two-stage hierarchical retrieval mechanism: it first identifies the macro-level tactics (the adversary's technical goals) and subsequently narrows the search to techniques within those tactics, effectively reducing the candidate search space by 77.5\%. To further bridge the gap between retrieval and generation, we design a tactic-aware reranking module and a hierarchy-constrained context organization strategy that mitigates LLM context overload and improves reasoning precision. Comprehensive experiments across three diverse CTI datasets demonstrate that H-TechniqueRAG not only outperforms the state-of-the-art TechniqueRAG by 3.8\% in F1 score, but also achieves a 62.4\% reduction in inference latency and a 60\% decrease in LLM API calls. Further analysis reveals that our hierarchical structural priors equip the model with superior cross-domain generalization and provide security analysts with highly interpretable, step-by-step decision paths.[10] Chinese Essay Rhetoric Recognition Using LoRA, In-context Learning and Model Ensemble
Yuxuan Lai,Xiajing Wang,Chen Zheng
Main category: cs.CL
TL;DR: 本文利用大语言模型(LLM)结合LoRA微调与上下文学习,以结构化JSON输出(键值汉化)方式完成中文议论文修辞识别任务,并通过模型集成进一步提升性能,在CCL 2025评测中三项指标均获第一。
Details
Motivation: 修辞识别是自动作文评分的关键环节,有助于评估学生的语言能力与高阶思维;但中文修辞识别研究尚不充分,亟需有效方法。 Method: 采用基于LoRA的微调与上下文学习策略,将修辞知识注入大语言模型;输出格式统一为结构化JSON,并将键名翻译为中文;进一步探索多种模型集成方法。 Result: 在CCL 2025中文作文修辞识别评测的全部三个赛道上均取得最优性能,获得一等奖。 Conclusion: 基于LLM的结构化输出与轻量微调策略可有效提升中文修辞识别效果,验证了其在AI教育应用中的可行性与先进性。 Abstract: Rhetoric recognition is a critical component in automated essay scoring. By identifying rhetorical elements in student writing, AI systems can better assess linguistic and higher-order thinking skills, making it an essential task in the area of AI for education. In this paper, we leverage Large Language Models (LLMs) for the Chinese rhetoric recognition task. Specifically, we explore Low-Rank Adaptation (LoRA) based fine-tuning and in-context learning to integrate rhetoric knowledge into LLMs. We formulate the outputs as JSON to obtain structural outputs and translate keys to Chinese. To further enhance the performance, we also investigate several model ensemble methods. Our method achieves the best performance on all three tracks of CCL 2025 Chinese essay rhetoric recognition evaluation task, winning the first prize.[11] SAGE Celer 2.6 Technical Card
SAGEA Research Team,Basab Jha,Firoj Paudel,Ujjwal Puri,Adrian Liu,Ethan Henkel,Zhang Yuting,Mateusz Kowalczyk,Mei Huang,Choi Donghyuk,Wang Junhao
Main category: cs.CL
TL;DR: SAGE Celer 2.6 是 SAGEA 推出的新一代通用大模型,具备多尺寸参数(5B/10B/27B)、逆向推理(IR)训练机制、原生多模态能力(端到端视觉编码器),并在数学、编程、通用智能及南亚语言(如尼泊尔语、印地语)支持方面表现优异,同时保持低延迟和强英文推理能力。
Details
Motivation: 解决复杂推理中的级联错误与幻觉问题,克服适配器式多模态方法的常见缺陷,并增强对南亚语言(特别是Devanagari文字)的支持,兼顾多语言与高性能需求。 Method: 采用逆向推理(IR)流水线进行自验证逻辑路径训练;集成端到端视觉编码器实现原生多模态;设计专用Devanagari脚本分词器;在未公开模型基础上进一步预训练并进行架构优化。 Result: 在ACUMEN等数学、编程与通用智能基准上达到高度竞争力;低推理延迟;在尼泊尔语和印地语任务中表现强劲,且不损害英文推理能力。 Conclusion: Celer 2.6 是一款面向南亚语言优化、兼具强推理能力、低延迟与原生多模态支持的先进通用大模型,代表了区域化与专业化大模型发展的新方向。 Abstract: We introduce SAGE Celer 2.6, the latest in our line of general-purpose Celer models from SAGEA. Celer 2.6 is available in 5B, 10B, and 27B parameter sizes and benefits from extensive architectural modifications and further pre-training on an undisclosed model. Using our Inverse Reasoning (IR) pipeline, SAGEA natively trains Celer 2.6 to validate its own logic paths, minimizing cascading error and hallucination in complex reasoning tasks. Celer 2.6 also boasts natively integrated multimodal functionality with an end-to-end vision encoder to avoid common pitfalls in adapter-based approaches. Celer 2.6 provides highly competitive results on mathematics, coding, and general intelligence benchmarks (ACUMEN), along with low latency. Most importantly, Celer 2.6 is specifically optimized for South Asian language support, with a custom tokenizer for the Devanagari script and strong performance in both Nepali and Hindi without sacrificing English reasoning ability.[12] Chronological Knowledge Retrieval: A Retrieval-Augmented Generation Approach to Construction Project Documentation
Ioannis-Aris Kostis,Natalia Sanchiz,Steeve De Schryver,François Denis,Pierre Schaus
Main category: cs.CL
TL;DR: 本文提出了一种基于RAG框架的对话式系统,用于从建筑项目会议纪要中检索时间标注的决策历史,支持自然语言查询并提供语义相关、带时间戳的答案。
Details
Motivation: 大型建设项目中决策持续演进,会议纪要记录繁杂,人工追溯特定决策历史费时易错,亟需高效、准确、可追溯时间线的检索方法。 Method: 采用检索增强生成(RAG)框架,融合语义搜索与大语言模型,实现对会议纪要的对话式访问;使用真实行业数据集(比利时某大型公司完工项目会议纪要)进行验证,并辅以专家标注的查询集支持系统评估。 Result: 构建了可公开获取的标注数据集和开源实现,验证了该方法能有效支持语义相关且显式时间标注的决策历史问答。 Conclusion: 该方案显著提升了工程专业人员对时序化项目文档的交互式检索效率与准确性,为建筑领域知识管理提供了可复用的技术路径,并推动了相关研究的开放协作。 Abstract: In large-scale construction projects, the continuous evolution of decisions generates extensive records, most often captured in meeting minutes. Since decisions may override previous ones, professionals often need to reconstruct the history of specific choices. Retrieving such information manually from raw archives is both labor-intensive and error-prone. From a user perspective, we address this challenge by enabling conversational access to the whole set of project meeting minutes. Professionals can pose natural-language questions and receive answers that are both semantically relevant and explicitly time-annotated, allowing them to follow the chronology of decisions. From a technical perspective, our solution employs a Retrieval-Augmented Generation (RAG) framework that integrates semantic search with large language models to ensure accurate and context-aware responses. We demonstrate the approach using an anonymized, industry-sourced dataset of meeting minutes from a completed construction project by a large company in Belgium. The dataset is annotated and enriched with expert-defined queries to support systematic evaluation. Both the dataset and the open-source implementation are made available to the community to foster further research on conversational access to time-annotated project documentation.[13] Stateful Evidence-Driven Retrieval-Augmented Generation with Iterative Reasoning
Qi Dong,Ziheng Lin,Ning Ding
Main category: cs.CL
TL;DR: 本文提出了一种状态感知、证据驱动的迭代式RAG框架,通过构建持久化证据池与迭代查询优化,提升问答鲁棒性与稳定性。
Details
Motivation: 传统RAG存在上下文表征扁平化和无状态检索问题,导致性能不稳定。 Method: 将问答建模为渐进式证据积累过程;将检索文档转化为带相关性与置信度信号的结构化推理单元,并存入持久化证据池;通过证据驱动的缺陷分析识别信息缺口与冲突,迭代优化查询以指导后续检索。 Result: 在多个问答基准上一致优于标准RAG及多步基线方法;在强检索噪声下仍保持稳定性能,并有效累积高质量证据。 Conclusion: 状态感知与迭代推理机制显著提升了RAG的鲁棒性、稳定性与证据利用效率。 Abstract: Retrieval-Augmented Generation (RAG) grounds Large Language Models (LLMs) in external knowledge but often suffers from flat context representations and stateless retrieval, leading to unstable performance. We propose Stateful Evidence-Driven RAG with Iterative Reasoning, a framework that models question answering as a progressive evidence accumulation process. Retrieved documents are converted into structured reasoning units with explicit relevance and confidence signals and maintained in a persistent evidence pool capturing both supportive and non-supportive information. The framework performs evidence-driven deficiency analysis to identify gaps and conflicts and iteratively refines queries to guide subsequent retrieval. This iterative reasoning process enables stable evidence aggregation and improves robustness to noisy retrieval. Experiments on multiple question answering benchmarks demonstrate consistent improvements over standard RAG and multi-step baselines, while effectively accumulating high-quality evidence and maintaining stable performance under substantial retrieval noise.[14] Benchmarking Linguistic Adaptation in Comparable-Sized LLMs: A Study of Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B on Romanized Nepali
Ananda Rimal,Adarsha Rimal
Main category: cs.CL
TL;DR: 本研究系统评估了Llama-3.1-8B、Mistral-7B-v0.1和Qwen3-8B三种开源大模型在罗马化尼泊尔语上的语言适应能力,通过零样本与微调(QLoRA/rsLoRA)对比,发现Qwen3-8B零样本表现最佳,而Llama-3.1-8B微调增益最大,为低资源语言适配提供了首个严格基准。
Details
Motivation: 罗马化尼泊尔语是尼泊尔非正式数字交流的主要媒介,但在大语言模型领域严重缺乏资源支持,亟需建立可比、可复现的适配基准。 Method: 在统一规模下对三个开源模型进行零样本与QLoRA/rsLoRA微调(r=32,仅训练约1%参数),使用1万条双语指令数据集,并采用PPL、BERTScore、chrF++、ROUGE系列及BLEU共七维五指标综合评估。 Result: 零样本时三模型均失败;微调后BERTScore≈0.75、chrF++>23;Qwen3-8B零样本即产出语义相关输出且结构对齐指标最优;Llama-3.1-8B微调PPL下降49.77、BERTScore提升0.3287,增益最大。 Conclusion: Qwen3-8B是整体推荐架构,Llama-3.1-8B最适合迭代式低资源开发;研究确立了罗马化尼泊尔语在同类开源模型中的首个严谨适配基准。 Abstract: Romanized Nepali, the Nepali language written in the Latin alphabet, is the dominant medium for informal digital communication in Nepal, yet it remains critically underresourced in the landscape of Large Language Models (LLMs). This study presents a systematic benchmarking of linguistic adaptation across three comparable-sized open-weight models: Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B. We evaluate these architectures under zero-shot and fine-tuned settings using a curated bilingual dataset of 10,000 transliterated instruction-following samples. Performance is quantified across five metrics spanning seven measurement dimensions: Perplexity (PPL), BERTScore, chrF++, ROUGE-1, ROUGE-2, ROUGE-L, and BLEU, capturing fluency, phonetic consistency, and semantic integrity. Models were fine-tuned using Quantized Low-Rank Adaptation (QLoRA) with Rank-Stabilized LoRA (rsLoRA) at rank r=32 on dual NVIDIA Tesla T4 GPUs, training only approximately 1% of each model's parameters in under 27 total GPU-hours. At zero-shot, all three models fail to generate Romanized Nepali, each exhibiting a distinct architecture-specific failure mode. Following fine-tuning, all three resolve these failures and converge to BERTScore approximately 0.75 and chrF++ greater than 23. Overall dimension-wise assessment across ten criteria identifies Qwen3-8B as the overall recommended architecture, being the only model to produce semantically relevant zero-shot output and leading all structural alignment metrics post-SFT. The adaptation headroom hypothesis is confirmed: Llama-3.1-8B, despite its weakest zero-shot baseline, achieves the largest absolute fine-tuning gains in PPL (Delta = -49.77) and BERTScore (Delta = +0.3287), making it the preferred choice for iterative low-resource development pipelines. This work establishes the first rigorous baseline for Romanized Nepali adaptation in comparable-sized open-weight LLMs.[15] Tug-of-War within A Decade: Conflict Resolution in Vulnerability Analysis via Teacher-Guided Retrieval-Augmented Generations
Ziyin Zhou,Jianyi Zhang,Xu ji,Yilong Li,Jiameng Han,Zhangchi Zhao
Main category: cs.CL
TL;DR: 本文提出CRVA-TGRAG框架,通过两阶段方法(改进检索与教师引导偏好优化微调)解决LLM在CVE漏洞分析中因知识更新滞后导致的知识冲突与幻觉问题,显著提升最新CVE信息检索准确率与回答可靠性。
Details
Motivation: LLM在网络安全漏洞分析中面临知识更新滞后问题:过去十年超3万CVE被修改或更新,导致训练数据与真实知识不一致,引发知识冲突、事实错误和生成幻觉。 Method: 提出两阶段CRVA-TGRAG框架:1)检索阶段采用父文档分段与语义相似性+倒排索引融合检索提升CVE文档召回精度;2)生成阶段引入教师引导的偏好优化技术对LLM进行微调,增强其基于检索结果的精准问答能力。 Result: 实验表明该方法在最新CVE检索准确率上优于外部知识库,有效缓解了纯LLM依赖导致的知识冲突与不一致问题。 Conclusion: CRVA-TGRAG框架通过结合改进RAG与偏好微调,显著提升了LLM在动态CVE知识场景下的可靠性与一致性,为安全领域LLM应用提供了可扩展的冲突缓解方案。 Abstract: Large Language Models (LLMs) are essential for analyzing and addressing vulnerabilities in cybersecurity. However, among over 200,000 vulnerabilities were discovered in the past decade, more than 30,000 have been changed or updated. This necessitates frequent updates to the training datasets and internal knowledge bases of LLMs to maintain knowledge consistency. In this paper, we focus on the problem of knowledge discrepancy and conflict within CVE (Common Vulnerabilities and Exposures) detection and analysis. This problem hinders LLMs' ability to retrieve the latest knowledge from original training datasets, leading to knowledge conflicts, fabrications of factually incorrect results, and generation hallucinations. To address this problem, we propose an innovative two-stage framework called CRVA-TGRAG (Conflict Resolution in Vulnerability Analysis via Teacher-Guided Retrieval-Augmented Generation). First, to improve document retrieval accuracy during the retrieval stage, we utilize Parent Document Segmentation and an ensemble retrieval scheme based on semantic similarity and inverted indexing. Second, to enhance LLMs' capabilities based on the retrieval of CVE dataset in generation stage, we employ a teacher-guided preference optimization technique to fine-tune LLMs. Our framework not only enhances the quality of content retrieval through RAG but also leverages the advantages of preference fine-tuning in LLMs to answer questions more effectively and precisely. Experiments demonstrate our method achieves higher accuracy in retrieving the latest CVEs compared to external knowledge bases. In conclusion, our framework significantly mitigates potential knowledge conflicts and inconsistencies that may arise from relying solely on LLMs for knowledge retrieval.[16] Correcting Suppressed Log-Probabilities in Language Models with Post-Transformer Adapters
Bryan Sanchez
Main category: cs.CL
TL;DR: 本文提出了一种仅含786K参数的轻量级后Transformer适配器,通过在冻结的隐藏状态上进行训练,有效缓解对齐调优语言模型在政治敏感话题上的事实性概率压制现象,在多个Qwen3模型规模上验证了其泛化能力与生成一致性,并揭示了MLX框架中一个静默梯度bug对适配器训练的影响。
Details
Motivation: 对齐调优的语言模型常在政治敏感话题上压制事实性log-probabilities,尽管其隐藏表示中仍保留相关知识,亟需一种低开销、高保真的干预方法来恢复事实性输出。 Method: 设计两种结构(SwiGLU门控与线性瓶颈)的轻量级(0.02%参数)后Transformer适配器,在冻结模型隐藏状态上进行监督训练;采用锚定训练防止知识遗忘;对比不同应用位置(全位置 vs 最后位置)及logit空间适配器的效果;并复现与定位MLX中的静默梯度bug。 Result: 适配器在31个意识形态区分性事实上实现有效校正;在16个预留事实上的泛化率达11–39%(5次随机划分);无知识回退;最后位置应用可生成连贯、去审查文本;logit空间适配器失效;发现并修复MLX中nn.value_and_grad的静默梯度归零bug。 Conclusion: 隐藏状态层面的轻量适配器是纠正对齐模型事实压制的有效且可行方案;应用位置至关重要;框架级实现细节(如梯度计算)对适配器研究具有实质性影响,需谨慎验证。 Abstract: Alignment-tuned language models frequently suppress factual log-probabilities on politically sensitive topics despite retaining the knowledge in their hidden representations. We show that a 786K-parameter (approximately 0.02% of the base model) post-transformer adapter, trained on frozen hidden states, corrects this suppression on 31 ideology-discriminating facts across Qwen3-4B, 8B, and 14B. The adapter memorizes all 15 training facts and generalizes to 11--39% of 16 held-out facts across 5 random splits per scale, with zero knowledge regressions via anchored training. Both gated (SwiGLU) and ungated (linear bottleneck) adapters achieve comparable results; neither consistently outperforms the other (Fisher exact p > 0.09 at all scales). On instruct models, the adapter corrects log-probability rankings. When applied at all token positions during generation, the adapter produces incoherent output; however, when applied only at the current prediction position (last-position-only), the adapter produces coherent, less censored text. A logit-space adapter operating after token projection fails to produce coherent generation at any application mode, suggesting hidden-state intervention is the correct level for generation correction. A previously undocumented silent gradient bug in Apple MLX explains all null results in earlier iterations of this work: the standard pattern nn.value_and_grad(model, fn)(model.parameters()) returns zero gradients without error; the correct pattern nn.value_and_grad(model, fn)(model, data) resolves this. We provide a minimal reproduction and discuss implications for other adapter research using MLX.[17] QU-NLP at ArchEHR-QA 2026: Two-Stage QLoRA Fine-Tuning of Qwen3-4B for Patient-Oriented Clinical Question Answering and Evidence Sentence Alignment
Mohammad AL-Smadi
Main category: cs.CL
TL;DR: 本文提出一个统一系统,同时解决ArchEHR-QA共享任务的子任务3(答案生成)和子任务4(证据句对齐),分别采用两阶段QLoRA微调Qwen3-4B模型和加权检索集成方法,在测试集上取得较好效果,并指出小样本下区分相关/无关临床句子是核心挑战,建议以数据增强为未来重点。
Details
Motivation: ArchEHR-QA共享任务中Subtask 3与Subtask 4均面临极小规模标注数据(仅20例)带来的泛化困难,需构建兼顾领域适应性与任务特性的统一解决方案。 Method: Subtask 3:采用两阶段QLoRA微调4-bit量化Qwen3-4B模型,先在emrQA-MedSQuAD上做临床领域适配,再在20个标注样本上进行任务风格微调;Subtask 4:构建BM25(相对阈值)、TF-IDF余弦相似度与微调cross-encoder的加权检索集成模型。 Result: Subtask 3在test-2026上综合得分为32.87(BLEU=9.42, ROUGE-L=27.04, SARI=55.42, BERTScore=43.00, AlignScore=25.28, MEDCON=37.04);Subtask 4在100例测试集上micro-F1达67.16。 Conclusion: 两个子任务共同揭示了小样本(20例)下模型难以可靠区分临床文本中相关与无关句子的根本瓶颈,数据增强是最具潜力的后续方向。 Abstract: We present a unified system addressing both Subtask 3 (answer generation) and Subtask 4 (evidence sentence alignment) of the ArchEHR-QA Shared Task. For Subtask 3, we apply two-stage Quantised Low-Rank Adaptation (QLoRA) to Qwen3-4B loaded in 4-bit NF4 quantisation: first on 30,000 samples from the emrQA-MedSQuAD corpus to establish clinical domain competence, then on the 20 annotated development cases to learn the task-specific output style. Our system achieves an overall score of 32.87 on the official test-2026 split (BLEU = 9.42, ROUGE-L = 27.04, SARI = 55.42, BERTScore = 43.00, AlignScore = 25.28, MEDCON = 37.04). For Subtask 4, we develop a weighted ensemble of three retrieval methods - BM25 with relative thresholding, TF-IDF cosine similarity, and a fine-tuned cross-encoder - to identify note sentences supporting a given gold answer, achieving a micro-F1 of 67.16 on the 100-case test set. Experiments reveal that both subtasks expose the same fundamental challenge: 20 annotated training cases are insufficient to distinguish relevant from irrelevant clinical sentences, pointing to data augmentation as the highest-leverage future direction.[18] Listen, Correct, and Feed Back: Spoken Pedagogical Feedback Generation
Junhong Liang,Yifan Lu,Ekaterina Kochmar,Fajri Koto
Main category: cs.CL
TL;DR: 本文提出SPFG数据集,用于生成口语化、教学友好的语法纠错与反馈,并对比了监督微调(SFT)与偏好对齐方法(DPO/KTO)在联合生成纠错与反馈任务上的效果,发现SFT更稳定有效,且纠错质量与反馈质量相关性较弱。
Details
Motivation: 现有GEC和GEE研究虽进展迅速,但缺乏面向真实教学场景的 learner-friendly pedagogical feedback——即需具备可操作性、适配学习者水平、鼓励性等特点。 Method: 构建SPFG数据集(基于Speak & Improve Challenge 2025语料),包含带GEC目标的口语转录文本及人工验证的教师风格反馈(含偏好对);在transcript-based SGEC设定下,用Qwen2.5、Llama-3.1、GLM-4三个指令微调大模型,对比SFT与DPO/KTO偏好对齐方法在联合生成纠错与反馈任务上的表现。 Result: SFT在纠错与反馈生成上带来最一致提升;DPO/KTO效果较小或不稳定;纠错质量与反馈质量呈弱相关。 Conclusion: 面向教学的口语反馈生成需兼顾语言准确性与教育适宜性;SFT当前仍是更可靠的基础训练范式;SPFG为后续研究提供了新基准与资源。 Abstract: Grammatical error correction (GEC) and explanation (GEE) have made rapid progress, but real teaching scenarios also require \emph{learner-friendly pedagogical feedback} that is actionable, level-appropriate, and encouraging. We introduce \textbf{SPFG} (\textbf{S}poken \textbf{P}edagogical \textbf{F}eedback \textbf{G}eneration), a dataset built based on the Speak \& Improve Challenge 2025 corpus, pairing fluency-oriented transcriptions with GEC targets and \emph{human-verified} teacher-style feedback, including preferred/rejected feedback pairs for preference learning. We study a transcript-based Spoken Grammatical Error Correction (SGEC) setting and evaluate three instruction-tuned LLMs (Qwen2.5, Llama-3.1, and GLM-4), comparing supervised fine-tuning (SFT) with preference-based alignment (using DPO and KTO) for jointly generating corrections and feedback. Results show that SFT provides the most consistent improvements, while DPO/KTO yield smaller or mixed gains, and that correction quality and feedback quality are weakly coupled. Our implementation is available at https://github.com/Skywalker-Harrison/spfg.[19] An Underexplored Frontier: Large Language Models for Rare Disease Patient Education and Communication -- A scoping review
Zaifu Zhan,Yu Hou,Kai Yu,Min Zeng,Anita Burgun,Xiaoyi Chen,Rui Zhang
Main category: cs.CL
TL;DR: 本文通过范围综述分析了2022年1月至2026年3月间关于大语言模型(LLM)在罕见病患者教育与沟通中应用的12项研究,发现当前研究多依赖通用模型(如ChatGPT)、聚焦静态问答、评估偏重准确性而忽视共情、可读性等患者中心维度,整体处于早期阶段,亟需面向患者、领域适配和真实场景的研究推进。
Details
Motivation: 罕见病患者面临复杂照护路径、临床专家稀缺及长期沟通需求未被满足等问题,而大语言模型虽在健康领域崭露头角,其在罕见病中的具体应用价值与现状尚不明确。 Method: 开展范围综述,系统检索2022年1月至2026年3月主要数据库文献,筛选出12项关于LLM用于罕见病患者教育与沟通的研究,提取并定性分析其研究特征、应用场景、模型类型及评估方法。 Result: 现有研究高度集中于近期、以通用大模型(尤其是ChatGPT)为主;应用场景单一(多为基于人工构建问题集的问答);缺乏真实世界数据与纵向沟通建模;评估侧重准确性,忽略可读性、共情与沟通质量等患者中心指标;多语言支持几近空白。 Conclusion: LLM在罕见病患者沟通中的应用仍处起步阶段,未来研究应强调患者中心设计、领域适配技术开发及真实环境部署,以实现安全、自适应且有效的沟通支持。 Abstract: Rare diseases affect over 300 million people worldwide and are characterized by complex care pathways, limited clinical expertise, and substantial unmet communication needs throughout the long patient journey. Recent advances in large language models (LLMs) offer new opportunities to support patient education and communication, yet their application in rare diseases remains unclear. We conducted a scoping review of studies published between January 2022 and March 2026 across major databases, identifying 12 studies on LLM-based rare disease patient education and communication. Data were extracted on study characteristics, application scenarios, model usage, and evaluation methods, and synthesized using descriptive and qualitative analyses. The literature is highly recent and dominated by general-purpose models, particularly ChatGPT. Most studies focus on patient question answering using curated question sets, with limited use of real-world data or longitudinal communication scenarios. Evaluations are primarily centered on accuracy, with limited attention to patient-centered dimensions such as readability, empathy, and communication quality. Multilingual communication is rarely addressed. Overall, the field remains at an early stage. Future research should prioritize patient-centered design, domain-adapted methods, and real-world deployment to support safe, adaptive, and effective communication in rare diseases.[20] Internal Knowledge Without External Expression: Probing the Generalization Boundary of a Classical Chinese Language Model
Jiuting Chen,Yuan Lian,Hao Wu,Tianqi Huang,Hiroshi Sasaki,Makoto Kouno,Jongil Choi
Main category: cs.CL
TL;DR: 本文训练了一个3.18亿参数的纯文言文Transformer语言模型,通过OOD测试发现:模型内部能区分真假历史事件(表现出真实事实编码能力),但外部生成中从不表达不确定性;这种‘内知外不知’现象跨语言、跨模型规模稳定存在,表明仅靠语言建模无法自发产生元认知表达(如‘我不确定’),需RLHF等显式训练信号。
Details
Motivation: 探究大语言模型是否能自发发展出元认知能力——即在生成文本中表达对自身知识边界的认知(如‘我不知道’),而非仅依赖内部统计不确定性。 Method: 在纯文言文语料(15.6亿token)上从零训练Transformer模型;设计系统性OOD测试(真实/虚构/半虚构历史事件)测量内部困惑度;分析生成文本中表征不确定性的古典汉语情态标记(如‘或曰’‘盖’)的使用频率;跨语言(中/英/日)、跨模型规模(110M–1.56B)复现验证。 Result: 内部:模型对虚构事件困惑度显著升高(2.39×),半虚构事件最高(4.24×),证明具备事实编码能力;外部:模型在OOD问题中更少使用不确定性标记(3.5% vs 8.3%),且该模式由训练数据惯例决定(如文言文模型出现‘谦逊悖论’,日语模型几乎从不犹豫)。 Conclusion: 元认知表达(如主动声明无知)不会自发涌现于纯语言建模中,必须通过强化学习人类反馈(RLHF)等显式监督信号进行训练;语言模型的‘知道’与‘说出知道与否’是解耦的能力。 Abstract: We train a 318M-parameter Transformer language model from scratch on a curated corpus of 1.56 billion tokens of pure Classical Chinese, with zero English characters or Arabic numerals. Through systematic out-of-distribution (OOD) testing, we investigate whether the model can distinguish known from unknown inputs, and crucially, whether it can express this distinction in its generated text. We find a clear dissociation between internal and external uncertainty. Internally, the model exhibits a perplexity jump ratio of 2.39x between real and fabricated historical events (p = 8.9e-11, n = 92 per group), with semi-fabricated events (real figures + fictional events) showing the highest perplexity (4.24x, p = 1.1e-16), demonstrating genuine factual encoding beyond syntactic pattern matching. Externally, however, the model never learns to express uncertainty: classical Chinese epistemic markers appear at lower rates for OOD questions (3.5%) than for in-distribution questions (8.3%, p = 0.023), reflecting rhetorical conventions rather than genuine metacognition. We replicate both findings across three languages (Classical Chinese, English, Japanese), three writing systems, and eight models from 110M to 1.56B parameters. We further show that uncertainty expression frequency is determined entirely by training data conventions, with Classical Chinese models showing a "humility paradox" (more hedging for known topics), while Japanese models almost never hedge. We argue that metacognitive expression -- the ability to say "I don't know" -- does not emerge from language modeling alone and requires explicit training signals such as RLHF.[21] Attention to Mamba: A Recipe for Cross-Architecture Distillation
Abhinav Moudgil,Ningyuan Huang,Eeshan Gunesh Dhekane,Pau Rodríguez,Luca Zappella,Federico Danieli
Main category: cs.CL
TL;DR: 本文提出了一种两阶段知识蒸馏方法,将Transformer模型(如Pythia-1B)有效蒸馏到纯SSM架构(Mamba)中,关键在于为Mamba设计了基于线性注意力核技巧的原理性初始化,从而避免混合架构,在保持相近困惑度(14.11 vs. 13.86)的同时实现纯SSM模型性能恢复。
Details
Motivation: SSM(如Mamba)虽具推理效率优势,但缺乏成熟的预训练生态;而Transformer虽有丰富预训练模型,但直接跨架构蒸馏至Mamba效果不佳。因此,亟需一种不依赖Attention模块、又能充分利用现有Transformer知识的高效蒸馏方案。 Method: 提出两阶段蒸馏框架:第一阶段用核技巧将Transformer蒸馏为线性化Attention模型;第二阶段将该线性模型进一步蒸馏至适配的Mamba架构(无任何Attention块),并为其设计原理性初始化策略。 Result: 蒸馏后的纯Mamba模型在下游任务中媲美Pythia-1B教师模型,困惑度达14.11(教师为13.86);并通过1B规模、10B token的消融、缩放与敏感性分析验证了方法鲁棒性与有效性。 Conclusion: 通过引入原理性初始化和分阶段线性化蒸馏,可成功实现从Transformer到纯SSM架构的高效知识迁移,无需混合模块,为SSM的实际部署与预训练生态建设提供了可行路径。 Abstract: State Space Models (SSMs) such as Mamba have become a popular alternative to Transformer models, due to their reduced memory consumption and higher throughput at generation compared to their Attention-based counterparts. On the other hand, the community has built up a considerable body of knowledge on how to train Transformers, and many pretrained Transformer models are readily available. To facilitate the adoption of SSMs while leveraging existing pretrained Transformers, we aim to identify an effective recipe to distill an Attention-based model into a Mamba-like architecture. In prior work on cross-architecture distillation, however, it has been shown that a naïve distillation procedure from Transformers to Mamba fails to preserve the original teacher performance, a limitation often overcome with hybrid solutions combining Attention and SSM blocks. The key argument from our work is that, by equipping Mamba with a principled initialization, we can recover an overall better recipe for cross-architectural distillation. To this end, we propose a principled two-stage approach: first, we distill knowledge from a traditional Transformer into a linearized version of Attention, using an adaptation of the kernel trick. Then, we distill the linearized version into an adapted Mamba model that does not use any Attention block. Overall, the distilled Mamba model is able to preserve the original Pythia-1B Transformer performance in downstream tasks, maintaining a perplexity of 14.11 close to the teacher's 13.86. To show the efficacy of our recipe, we conduct thorough ablations at 1B scale with 10B tokens varying sequence mixer architecture, scaling analysis on model sizes and total distillation tokens, and a sensitivity analysis on tokens allocation between stages.[22] The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure
David A. Cook
Main category: cs.CL
TL;DR: 本文提出了PICCO框架,用于系统化大语言模型(LLM)提示词设计,涵盖Persona、Instructions、Context、Constraints和Output五个核心元素,并厘清了相关概念的层次关系。
Details
Motivation: 现有提示词设计缺乏一致性和结构性,亟需一个统一、可复用的参考框架来提升概念清晰度与设计系统性。 Method: 通过多数据库检索,综合分析11种已有提示框架,采用严谨的概念合成方法,构建分类体系并提炼出五要素参考架构PICCO,辅以实施要点与责任考量。 Result: 提出明确区分prompt frameworks、elements、generation、techniques与engineering的分类法;确立PICCO五要素架构及其功能、范围与相互关系;系统梳理主流提示技术及人机协同优化路径。 Conclusion: PICCO是一个概念性与方法论贡献,旨在形式化提示词规范与比较结构,而非经验性验证其性能优化效果。 Abstract: Large language model (LLM) performance depends heavily on prompt design, yet prompt construction is often described and applied inconsistently. Our purpose was to derive a reference framework for structuring LLM prompts. This paper presents PICCO, a framework derived through a rigorous synthesis of 11 previously published prompting frameworks identified through a multi-database search. The analysis yields two main contributions. First, it proposes a taxonomy that distinguishes prompt frameworks, prompt elements, prompt generation, prompting techniques, and prompt engineering as related but non-equivalent concepts. Second, it derives a five-element reference architecture for prompt generation: Persona, Instructions, Context, Constraints, and Output (PICCO). For each element, we define its function, scope, and relationship to other elements, with the goal of improving conceptual clarity and supporting more systematic prompt design. Finally, to support application of the framework, we outline key concepts relevant to implementation, including prompting techniques (e.g., zero-shot, few-shot, chain-of-thought, ensembling, decomposition, and self-critique, with selected variants), human and automated approaches to iterative prompt engineering, responsible prompting considerations such as security, privacy, bias, and trust, and priorities for future research. This work is a conceptual and methodological contribution: it formalizes a common structure for prompt specification and comparison, but does not claim empirical validation of PICCO as an optimization method.[23] Chinese Language Is Not More Efficient Than English in Vibe Coding: A Preliminary Study on Token Cost and Problem-Solving Rate
Simiao Ren,Xingyu Shen,Yuchen Zhou,Dennis,Ng,Ankit Raj
Main category: cs.CL
TL;DR: 本文通过实证研究检验了中文提示在LLM编程任务中是否更省token的流行说法,结果发现该说法不成立:中文提示并未普遍提升token效率,且成功率普遍低于英文;token成本因模型而异,成本效益(成功任务的期望成本)整体未改善。
Details
Motivation: 验证社交媒体和开发者论坛中流传的‘中文提示比英文更省token、可降本40%’这一说法的真实性,并评估其对实际开发(如vibe coding)的指导价值。 Method: 基于SWE-bench Lite软件工程任务基准,对多个主流开源/闭源大模型(如MiniMax-2.7、GLM-5等)进行系统性对比实验,测量中英文提示下的token消耗量、任务成功率及综合成本效率。 Result: 1)中文提示无普适token效率优势;2)token成本呈模型依赖性(MiniMax-2.7中文成本高1.28x,GLM-5反而更低);3)所有测试模型上中文提示的成功率均低于英文;4)综合成本效率(单位成功任务的期望token成本)未提升。 Conclusion: 当前证据表明,单纯将提示语切换为中文无法带来可靠的成本节约或性能提升;语言对token成本的影响高度依赖具体模型,实践者不应盲目转向中文提示。 Abstract: A claim has been circulating on social media and practitioner forums that Chinese prompts are more token-efficient than English for LLM coding tasks, potentially reducing costs by up to 40\%. This claim has influenced developers to consider switching to Chinese for ``vibe coding'' to save on API costs. In this paper, we conduct a rigorous empirical study using SWE-bench Lite, a benchmark of software engineering tasks, to evaluate whether this claim of Chinese token efficiency holds up to scrutiny. Our results reveal three key findings: First, the efficiency advantage of Chinese is not observed. Second, token cost varies by model architecture in ways that defy simple assumptions: while MiniMax-2.7 shows 1.28x higher token costs for Chinese, GLM-5 actually consumes fewer tokens with Chinese prompts. Third, and most importantly, we found that the success rate when prompting in Chinese is generally lower than in English across all models we tested. We also measure cost efficiency as expected cost per successful task -- jointly accounting for token consumption and task resolution rate. These findings should be interpreted as preliminary evidence rather than a definitive conclusion, given the limited number of models evaluated and the narrow set of benchmarks tested due to resource constraints; they indicate that language effects on token cost are model-dependent, and that practitioners should not expect cost savings or performance gains just by switching their prompt language to Chinese.[24] CROP: Token-Efficient Reasoning in Large Language Models via Regularized Prompt Optimization
Deep Shah,Sanket Badhe,Nehal Kathrotia,Priyanka Tiwari
Main category: cs.CL
TL;DR: 本文提出CROP方法,在自动提示优化中引入响应长度正则化,以生成简洁有效的推理提示,在大幅降低token消耗的同时保持较高准确率。
Details
Motivation: 现有自动提示优化框架只关注任务准确率,导致生成冗长的推理过程,带来高延迟和高token成本。 Method: 提出Cost-Regularized Optimization of Prompts (CROP),在标准准确率反馈之外增加文本形式的长度反馈,引导优化过程生成更简洁、关键信息更集中的提示。 Result: 在GSM8K、LogiQA和BIG-Bench Hard等复杂推理数据集上,实现80.6%的token消耗减少,同时仅出现可忽略的性能下降。 Conclusion: CROP为生产环境中部署高效、低成本的智能体AI系统提供了实用解决方案。 Abstract: Large Language Models utilizing reasoning techniques improve task performance but incur significant latency and token costs due to verbose generation. Existing automatic prompt optimization(APO) frameworks target task accuracy exclusively at the expense of generating long reasoning traces. We propose Cost-Regularized Optimization of Prompts (CROP), an APO method that introduces regularization on response length by generating textual feedback in addition to standard accuracy feedback. This forces the optimization process to produce prompts that elicit concise responses containing only critical information and reasoning. We evaluate our approach on complex reasoning datasets, specifically GSM8K, LogiQA and BIG-Bench Hard. We achieved an 80.6\% reduction in token consumption while maintaining competitive accuracy, seeing only a nominal decline in performance. This presents a pragmatic solution for deploying token-efficient and cost-effective agentic AI systems in production pipelines.[25] MEME-Fusion@CHiPSAL 2026: Multimodal Ablation Study of Hate Detection and Sentiment Analysis on Nepali Memes
Samir Wagle,Reewaj Khanal,Abiral Adhikari
Main category: cs.CL
TL;DR: 本文提出了一种用于Devanagari脚本社交媒体模因的多模态仇恨言论检测系统,结合CLIP与BGE-M3,并引入动态门控交叉注意力机制,在数据稀缺条件下显著提升性能,同时揭示了英文视觉模型在该脚本上的失效及标准集成方法的退化问题。
Details
Motivation: 解决Devanagari脚本模因中多模态结构、语言复杂性及极低资源下数据稀缺带来的仇恨言论检测难题。 Method: 提出混合跨模态注意力融合架构:使用CLIP(ViT-B/32)编码图像,BGE-M3编码多语言文本,通过4头自注意力与可学习门控网络动态加权模态贡献。 Result: 在Subtask A上较纯文本基线提升5.9% F1-macro;发现英文视觉模型在Devanagari脚本上表现近似随机,且标准集成法在小样本(每折N≈850)下因相关过拟合而严重退化。 Conclusion: 显式跨模态推理对低资源多模态仇恨检测至关重要;模型选择与集成策略需适配目标语言与数据规模,不能直接迁移英文主导方案。 Abstract: Hate speech detection in Devanagari-scripted social media memes presents compounded challenges: multimodal content structure, script-specific linguistic complexity, and extreme data scarcity in low-resource settings. This paper presents our system for the CHiPSAL 2026 shared task, addressing both Subtask A (binary hate speech detection) and Subtask B (three-class sentiment classification: positive, neutral, negative). We propose a hybrid cross-modal attention fusion architecture that combines CLIP (ViT-B/32) for visual encoding with BGE-M3 for multilingual text representation, connected through 4-head self-attention and a learnable gating network that dynamically weights modality contributions on a per-sample basis. Systematic evaluation across eight model configurations demonstrates that explicit cross-modal reasoning achieves a 5.9% F1-macro improvement over text-only baselines on Subtask A, while uncovering two unexpected but critical findings: English-centric vision models exhibit near-random performance on Devanagari script, and standard ensemble methods catastrophically degrade under data scarcity (N nearly equal to 850 per fold) due to correlated overfitting. The code can be accessed at https://github.com/Tri-Yantra-Technologies/MEME-Fusion/[26] ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents
Zhuofeng Li,Yi Lu,Dongfu Jiang,Haoxiang Zhang,Yuyang Bai,Chuan Li,Yu Wang,Shuiwang Ji,Jianwen Xie,Yu Zhang
Main category: cs.CL
TL;DR: 本文提出REVIEWBENCH基准和REVIEWGROUNDER多智能体框架,通过引入显式评分标准与上下文证据整合,显著提升LLM在学术论文评审中的反馈质量与人类判断一致性。
Details
Motivation: 现有LLM评审常生成表面化、公式化的评论,缺乏基于证据的实质性反馈,主因是未充分利用人类评审中的显式评分标准和对已有工作的上下文 grounding。 Method: 构建REVIEWBENCH基准(基于官方指南、论文内容及人工评审生成纸特定评分标准);提出REVIEWGROUNDER多智能体框架,将评审分解为草稿生成与证据 grounding 两阶段,并集成工具进行针对性证据整合。 Result: 在REVIEWBENCH上,REVIEWGROUNDER(Phi-4-14B草稿+GPT-OSS-120B grounding)在8个维度上均优于更强/更大的基线模型(如GPT-4.1、DeepSeek-R1-670B),尤其在与人类判断对齐和评分标准契合度方面表现突出。 Conclusion: 显式评分标准引导与上下文证据 grounding 是提升LLM评审质量的关键路径,REVIEWGROUNDER为AI辅助评审提供了更可靠、可解释的新范式。 Abstract: The rapid rise in AI conference submissions has driven increasing exploration of large language models (LLMs) for peer review support. However, LLM-based reviewers often generate superficial, formulaic comments lacking substantive, evidence-grounded feedback. We attribute this to the underutilization of two key components of human reviewing: explicit rubrics and contextual grounding in existing work. To address this, we introduce REVIEWBENCH, a benchmark evaluating review text according to paper-specific rubrics derived from official guidelines, the paper's content, and human-written reviews. We further propose REVIEWGROUNDER, a rubric-guided, tool-integrated multi-agent framework that decomposes reviewing into drafting and grounding stages, enriching shallow drafts via targeted evidence consolidation. Experiments on REVIEWBENCH show that REVIEWGROUNDER, using a Phi-4-14B-based drafter and a GPT-OSS-120B-based grounding stage, consistently outperforms baselines with substantially stronger/larger backbones (e.g., GPT-4.1 and DeepSeek-R1-670B) in both alignment with human judgments and rubric-based review quality across 8 dimensions. The code is available \href{https://github.com/EigenTom/ReviewGrounder}{here}.[27] EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation
Francesco Andrea Causio,Vittorio De Vita,Olivia Riccomi,Michele Ferramola,Federico Felizzi,Antonio Cristiano,Lorenzo De Mori,Chiara Battipaglia,Melissa Sawaya,Luigi De Angelis,Marcello Di Pumpo,Alessandra Piscitelli,Pietro Eric Risuleo,Alessia Longo,Giulia Vojvodic,Mariapia Vassalli,Bianca Destro Castaniti,Nicolò Scarsi,Manuel Del Medico
Main category: cs.CL
TL;DR: This paper introduces EuropeMedQA, the first multilingual and multimodal medical examination dataset from official European regulatory exams (Italy, France, Spain, Portugal), designed to evaluate multimodal LLMs on cross-lingual and visual reasoning tasks under strict zero-shot conditions.
Details
Motivation: LLMs perform well on English medical exams but struggle with non-English languages and multimodal diagnostic tasks; there is a need for a contamination-resistant, clinically realistic, multilingual benchmark aligned with European regulatory standards. Method: Development of EuropeMedQA following FAIR principles and SPIRIT-AI guidelines, including rigorous manual curation and automated translation; evaluation of multimodal LLMs via zero-shot, strictly constrained prompting for cross-lingual transfer and visual reasoning. Result: EuropeMedQA is established as the first comprehensive, multilingual, multimodal medical exam dataset from official European sources, enabling robust and fair assessment of multimodal LLMs in clinical contexts. Conclusion: EuropeMedQA fills a critical gap by providing a realistic, contamination-resistant benchmark that supports the development of more generalizable, clinically relevant, and multilingual medical AI systems. Abstract: While Large Language Models (LLMs) have demonstrated high proficiency on English-centric medical examinations, their performance often declines when faced with non-English languages and multimodal diagnostic tasks. This study protocol describes the development of EuropeMedQA, the first comprehensive, multilingual, and multimodal medical examination dataset sourced from official regulatory exams in Italy, France, Spain, and Portugal. Following FAIR data principles and SPIRIT-AI guidelines, we describe a rigorous curation process and an automated translation pipeline for comparative analysis. We evaluate contemporary multimodal LLMs using a zero-shot, strictly constrained prompting strategy to assess cross-lingual transfer and visual reasoning. EuropeMedQA aims to provide a contamination-resistant benchmark that reflects the complexity of European clinical practices and fosters the development of more generalizable medical AI.[28] Tracking the Temporal Dynamics of News Coverage of Catastrophic and Violent Events
Emily Lugos,Maurício Gruppi
Main category: cs.CL
TL;DR: 本研究分析了126,602篇在线新闻文章,量化了暴力与灾难性事件报道中的时间与语义动态变化,发现突发性重大事件具有可预测的新闻周期模式:报道量迅速激增、早期语义漂移明显、随后逐渐回归基线,并识别出驱动这些时序模式的关键术语。
Details
Motivation: 理解危机时刻公众话语中叙事如何形成、传播和演化,对解读媒体框架的动态变化至关重要。 Method: 基于大规模在线新闻语料库(126,602篇文章),通过出版量、语义漂移、语义离散度和术语相关性等指标量化叙事变化。 Result: 突发性重大事件展现出结构化且可预测的新闻周期模式:报道量快速激增、早期发生显著语义漂移、后期逐步回落至基线;同时识别出驱动该时序模式的关键术语。 Conclusion: 新闻报道在突发事件中并非随机波动,而是遵循可建模的动态规律,语义演化与报道节奏高度关联,为理解媒体叙事机制提供了实证基础。 Abstract: The modern news cycle has been fundamentally reshaped by the rapid exchange of information online. As a result, media framing shifts dynamically as new information, political responses, and social reactions emerge. Understanding how these narratives form, propagate, and evolve is essential for interpreting public discourse during moments of crisis. In this study, we examine the temporal and semantic dynamics of reporting for violent and catastrophic events using a large-scale corpus of 126,602 news articles collected from online publishers. We quantify narrative change through publication volume, semantic drift, semantic dispersion, and term relevance. Our results show that sudden events of impact exhibit structured and predictable news-cycle patterns characterized by rapid surges in coverage, early semantic drift, and gradual declines toward the baseline. In addition, our results indicate the terms that are driving the temporal patterns.[29] LLM Predictive Scoring and Validation: Inferring Experience Ratings from Unstructured Text
Jason Potteiger,Andrew Hong,Ito Zapata
Main category: cs.CL
TL;DR: 本研究使用GPT-4.1基于棒球迷开放性文本反馈预测其0-10分的整体观赛体验评分,发现AI预测与真实评分高度相关(r=0.82),但系统性偏低约1分;该偏差反映两种测量本质差异:真实评分是整合性主观判断,而AI预测更侧重突出、情绪强烈或可行动的体验时刻。
Details
Motivation: 探究大语言模型能否仅凭用户开放文本反馈,可靠预测其整体体验评分,并理解预测值与自评值之间系统性偏差的本质。 Method: 使用GPT-4.1对约10,000条来自五支MLB球队球迷的开放文本响应进行单次提示预测(0-10分);对比预测结果与实际调查评分,分析一致性、相关性及偏差来源。 Result: 67%预测值在真实值±1分内,36%完全匹配;三次独立运行间87%完全一致、99.9%在±1分内;与整体体验评分相关性最高(r=0.82),但系统性偏低约1分,且该偏差无法归因于任一具体体验维度。 Conclusion: 简单未优化提示即可实现方向性预测;预测值与自评值间的差距并非误差,而是反映了两种不同心理构念(整合性主观判断 vs. 突出体验时刻量化),应被保留和解读。 Abstract: We tasked GPT-4.1 to read what baseball fans wrote about their game-day experience and predict the overall experience rating each fan gave on a 0-10 survey scale. The model received only the text of a single open-ended response. These AI predictions were compared with the actual experience ratings captured by the survey instrument across approximately 10,000 fan responses from five Major League Baseball teams. In total two-thirds of predicted ratings fell within one point of self-reported fan ratings (67% within +/-1, 36% exact match), and the predicted measurement was near-deterministic across three independent scoring runs (87% exact agreement, 99.9% within +/-1). Predicted ratings aligned most strongly with the overall experience rating (r = 0.82) rather than with any specific aspect of the game-day experience such as parking, concessions, staff, etc. However, predictions were systematically lower than self-reported ratings by approximately one point, and this gap was not driven by any single aspect. Rather, our analysis shows that self-reported ratings capture the fan's verdict, an overall evaluative judgment that integrates the entire experience. While predicted ratings quantify the impact of salient moments characterized as memorable, emotionally intense, unusual, or actionable. Each measure contains information the other misses. These baseline results establish that a simple, unoptimized prompt can directionally predict how fans rate their experience from the text a fan wrote and that a gap between the two numbers can be interpreted as a construct difference worth preserving rather than an error to eliminate.[30] Purging the Gray Zone: Latent-Geometric Denoising for Precise Knowledge Boundary Awareness
Hao An,Yibin Lou,Jiayi Guo,Yang Xu
Main category: cs.CL
TL;DR: 本文提出GeoDe框架,通过几何距离作为置信度信号进行'几何去噪',解决大模型在决策边界附近因内部信念模糊导致的幻觉与过度拒答问题,显著提升模型真实性与OOD泛化能力。
Details
Motivation: 现有拒答微调方法直接按响应准确率划分数据集,导致决策边界附近存在严重标签噪声,引发高拒答率或幻觉;而模型内部信念模糊(即‘灰色地带’)是核心性能瓶颈。 Method: 从潜在空间表征视角出发,构建线性探针定义‘真实超平面’,利用样本到该超平面的几何距离作为置信度信号,对边界模糊样本进行过滤(几何去噪),保留高保真样本用于拒答微调。 Result: 在Llama3、Qwen3及TriviaQA、NQ、SciQ、SimpleQA等多个模型与数据集上,GeoDe显著提升模型真实性,并在分布外(OOD)场景中表现出强泛化能力。 Conclusion: GeoDe通过引入几何距离这一更鲁棒的置信度度量,有效缓解了因标签噪声和信念模糊导致的拒答与幻觉问题,为可信大模型构建提供了新思路。 Abstract: Large language models (LLMs) often exhibit hallucinations due to their inability to accurately perceive their own knowledge boundaries. Existing abstention fine-tuning methods typically partition datasets directly based on response accuracy, causing models to suffer from severe label noise near the decision boundaries and consequently exhibit high rates of abstentions or hallucinations. This paper adopts a latent space representation perspective, revealing a "gray zone" near the decision hyperplane where internal belief ambiguity constitutes the core performance bottleneck. Based on this insight, we propose the **GeoDe** (**Geo**metric **De**noising) framework for abstention fine-tuning. This method constructs a truth hyperplane using linear probes and performs "geometric denoising" by employing geometric distance as a confidence signal for abstention decisions. This approach filters out ambiguous boundary samples while retaining high-fidelity signals for fine-tuning. Experiments across multiple models (Llama3, Qwen3) and benchmark datasets (TriviaQA, NQ, SciQ, SimpleQA) demonstrate that GeoDe significantly enhances model truthfulness and demonstrates strong generalization in out-of-distribution (OOD) scenarios. Code is available at https://github.com/Notbesidemoon/GeoDe.[31] Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance
Bar Alon,Itamar Zimerman,Lior Wolf
Main category: cs.CL
TL;DR: 本文评估了大语言模型(LLM)生成的后验文本解释的“认知可信性”(epistemic faithfulness),发现其常不忠实;进而提出一种无需训练、基于注意力干预的解释增强方法,利用忠实归因方法提取的词元热力图引导解释生成,显著提升了多模型、多基准和多提示下的认知可信性。
Details
Motivation: 大语言模型缺乏可解释性,限制其在需透明与信任的关键领域应用;现有后验文本解释虽具说服力,但其是否真实反映模型内部决策依据(即认知可信性)尚不明确。 Method: 首先通过反事实分析评估LLM解释的认知可信性;然后提出一种训练-free方法:利用忠实归因方法(如梯度或扰动法)生成token级热力图,据此对注意力机制进行定向干预,以引导更忠实的解释生成。 Result: 实验证明现有LLM解释普遍存在认知不忠实现象;所提注意力干预方法在多个LLM(如Llama、GPT系列)、多个基准(如ERASER、BoolQ)及不同提示下,均显著提升解释的认知可信性。 Conclusion: 认知可信性是评估解释质量的关键维度;基于注意力干预的训练-free方法为提升LLM解释忠实性提供了有效且通用的新路径。 Abstract: Large language models (LLMs) achieve strong performance and have revolutionized NLP, but their lack of explainability keeps them treated as black boxes, limiting their use in domains that demand transparency and trust. A promising direction to address this issue is post-hoc text-based explanations, which aim to explain model decisions in natural language. Prior work has focused on generating convincing rationales that appear to be subjectively faithful, but it remains unclear whether these explanations are epistemically faithful, whether they reflect the internal evidence the model actually relied on for its decision. In this paper, we first assess the epistemic faithfulness of LLM-generated explanations via counterfactuals and show that they are often unfaithful. We then introduce a training-free method that enhances faithfulness by guiding explanation generation through attention-level interventions, informed by token-level heatmaps extracted via a faithful attribution method. This method significantly improves epistemic faithfulness across multiple models, benchmarks, and prompts.[32] Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation
Zichong Li,Chen Liang,Liliang Ren,Tuo Zhao,Yelong Shen,Weizhu Chen
Main category: cs.CL
TL;DR: 本文提出RoPE-Perturbed Self-Distillation方法,通过扰动RoPE位置编码生成同一序列的不同位置视图,并利用自蒸馏使模型在不同视图下输出一致,从而提升大语言模型在长上下文任务中的位置鲁棒性。
Details
Motivation: 标准长上下文微调方法对证据在上下文中的绝对位置敏感,存在高位置方差,影响模型可靠性。 Method: 提出RoPE-Perturbed Self-Distillation:在训练中扰动RoPE索引以生成不同位置分布的上下文视图,并通过自蒸馏约束模型在各视图下预测一致。 Result: 在Llama-3-8B和Qwen-3-4B上验证有效:Llama-3-8B在RULER-64K提升12.04%,Qwen-3-4B在RULER-256K提升2.71%;同时改善长度外推能力。 Conclusion: RoPE扰动结合自蒸馏能显著增强模型对位置变化的鲁棒性,减少对绝对位置的依赖,提升长上下文理解的稳定性与泛化性。 Abstract: Large language models (LLMs) increasingly operate in settings that require reliable long-context understanding, such as retrieval-augmented generation and multi-document reasoning. A common strategy is to fine-tune pretrained short-context models at the target sequence length. However, we find that standard long-context adaptation can remain brittle: model accuracy depends strongly on the absolute placement of relevant evidence, exhibiting high positional variance even when controlling for task format and difficulty. We propose RoPE-Perturbed Self-Distillation, a training regularizer that improves positional robustness. The core idea is to form alternative "views" of the same training sequence by perturbing its RoPE indices -- effectively moving parts of the context to different positions -- and to train the model to produce consistent predictions across views via self-distillation. This encourages reliance on semantic signals instead of brittle position dependencies. Experiments on long-context adaptation of Llama-3-8B and Qwen-3-4B demonstrate consistent gains on long-context benchmarks, including up to 12.04% improvement on RULER-64K for Llama-3-8B and 2.71% on RULER-256K for Qwen-3-4B after SFT, alongside improved length extrapolation beyond the training context window.[33] When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden
Apoorv Prasad,Susan McRoy
Main category: cs.CL
TL;DR: 本文提出了一种基于小型开源语言模型的方法,用于在社交媒体帖子中自动检测多囊卵巢综合征(PCOS)女性面临的三重负担(身体意象困扰、进食障碍和代谢问题),并提供可解释的结构化输出。
Details
Motivation: PCOS女性面临身体意象 distress、进食障碍和代谢挑战的三重负担,但现有NLP方法缺乏透明性且无法识别共病表现。 Method: 收集1000条Reddit上的PCOS相关帖子,由两名标注员依据Lee等(2017)临床框架标注;使用LoRA微调三个小型开源语言模型(Gemma-2-2B、Qwen3-1.7B、DeepSeek-R1-Distill-Qwen-1.5B),生成带文本证据的结构化解释。 Result: 最佳模型在150条测试样本上达到75.3%的精确匹配准确率,具备稳健的共病检测与强可解释性;性能随诊断复杂度上升而下降。 Conclusion: 该方法适用于PCOS相关心理与代谢风险的初步筛查,而非自主诊断;强调小型模型+可解释生成在敏感健康场景中的实用价值。 Abstract: Women with polycystic ovary syndrome (PCOS) face substantially elevated risks of body image distress, disordered eating, and metabolic challenges, yet existing natural language processing approaches for detecting these conditions lack transparency and cannot identify co-occurring presentations. We developed small, open-source language models to automatically detect this triple burden in social media posts with grounded explainability. We collected 1,000 PCOS-related posts from six subreddits, with two trained annotators labeling posts using guidelines operationalizing Lee et al. (2017) clinical framework. Three models (Gemma-2-2B, Qwen3-1.7B, DeepSeek-R1-Distill-Qwen-1.5B) were fine-tuned using Low-Rank Adaptation to generate structured explanations with textual evidence. The best model achieved 75.3 percent exact match accuracy on 150 held-out posts, with robust comorbidity detection and strong explainability. Performance declined with diagnostic complexity, indicating their best use is for screening rather than autonomous diagnosis.[34] APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning for Long-Term Conversational AI
Pratyay Banerjee,Masud Moshtaghi,Shivashankar Subramanian,Amita Misra,Ankit Chadha
Main category: cs.CL
TL;DR: APEX-MEM is a conversational memory system using a property graph, append-only storage, and a multi-tool retrieval agent to improve long-term memory reliability in LLMs.
Details
Motivation: Large language models struggle with reliable long-term conversational memory due to noise from enlarged context windows or naive retrieval. Method: APEX-MEM introduces (1) a domain-agnostic property graph to structure conversations as temporally grounded, entity-centric events; (2) append-only storage to preserve full temporal evolution; and (3) a multi-tool retrieval agent that resolves conflicting or evolving information at query time to generate compact, relevant memory summaries. Result: APEX-MEM achieves 88.88% accuracy on LOCOMO's QA task and 86.2% on LongMemEval, outperforming state-of-the-art session-aware methods. Conclusion: Structured property graphs enable more temporally coherent long-term conversational reasoning in LLMs. Abstract: Large language models still struggle with reliable long-term conversational memory: simply enlarging context windows or applying naive retrieval often introduces noise and destabilizes responses. We present APEX-MEM, a conversational memory system that combines three key innovations: (1) a property graph which uses domain-agnostic ontology to structure conversations as temporally grounded events in an entity-centric framework, (2) append-only storage that preserves the full temporal evolution of information, and (3) a multi-tool retrieval agent that understands and resolves conflicting or evolving information at query time, producing a compact and contextually relevant memory summary. This retrieval-time resolution preserves the full interaction history while suppressing irrelevant details. APEX-MEM achieves 88.88% accuracy on LOCOMO's Question Answering task and 86.2% on LongMemEval, outperforming state-of-the-art session-aware approaches and demonstrating that structured property graphs enable more temporally coherent long-term conversational reasoning.[35] The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models
Akshay Paruchuri,Ishan Chatterjee,Henry Fuchs,Ehsan Adeli,Piotr Didyk
Main category: cs.CL
TL;DR: 本文提出了一种名为'质心替换'的可控探针方法,用于分析多模态语言模型中语言与视觉模态间的依赖关系。研究发现,语言表征在多模态任务中普遍存在压倒性主导现象,即使在需强视觉推理的任务中亦然;通过文本质心对比解码,可在不重训练的前提下显著提升准确率(最高+16.9%),且该提升效果因训练方式(微调 vs 偏好优化)而异。
Details
Motivation: 多模态语言模型在视觉感知任务上系统性表现不佳,但其失败的根本结构原因尚不清楚。 Method: 提出‘质心替换’方法——将每个token映射至K-means聚类最近质心,以可控地抹除文本或视觉token的结构信息;进而设计文本质心对比解码策略,在推理阶段对比原始与文本质心被抹除的参考输出,实现性能恢复。 Result: 在7个跨架构模型上验证:抹除文本质心结构导致的精度下降是抹除视觉质心的4倍;文本质心对比解码最多提升单任务准确率16.9%;标准微调模型平均增益5.6%,偏好优化模型仅1.5%。 Conclusion: 模态竞争具有结构性、局部性,可在推理阶段校正,无需重训练;质心扰动效应可作为量化诊断信号,指导未来多模态训练设计。 Abstract: Multimodal language models systematically underperform on visual perception tasks, yet the structure underlying this failure remains poorly understood. We propose centroid replacement, collapsing each token to its nearest K-means centroid, as a controlled probe for modal dependence. Across seven models spanning three architecture families, erasing text centroid structure costs 4$\times$ more accuracy than erasing visual centroid structure, exposing a universal imbalance where language representations overshadow vision even on tasks that demand visual reasoning. We exploit this asymmetry through text centroid contrastive decoding, recovering up to +16.9% accuracy on individual tasks by contrastively decoding against a text-centroid-erased reference. This intervention varies meaningfully with training approaches: standard fine-tuned models show larger gains (+5.6% on average) than preference-optimized models (+1.5% on average). Our findings suggest that modal competition is structurally localized, correctable at inference time without retraining, and quantifiable as a diagnostic signal to guide future multimodal training.[36] BiCon-Gate: Consistency-Gated De-colloquialisation for Dialogue Fact-Checking
Hyunkyung Park,Arkaitz Zubiaga
Main category: cs.CL
TL;DR: 本文提出了一种面向对话中口语化声明的保守重写方法( staged de-colloquialisation)及语义感知的一致性门控机制(BiCon-Gate),以提升自动事实核查在多轮对话中的鲁棒性与准确性。
Details
Motivation: 现有自动事实核查研究对对话中频繁出现但未被充分研究的口语化语言缺乏有效处理。 Method: 提出分阶段去口语化生成保守重写候选,并设计BiCon-Gate门控机制,仅当重写候选在对话上下文中语义可支持时才采纳,否则回退至原始声明。 Result: 在DialFact基准上,该方法显著提升了证据检索与事实验证性能,尤其在SUPPORTS类别上增益明显,优于包括单步LLM重写在内的多种强基线。 Conclusion: 分阶段轻量级去口语化结合语义一致性门控,是提升对话事实核查稳定性和准确性的有效路径。 Abstract: Automated fact-checking in dialogue involves multi-turn conversations where colloquial language is frequent yet understudied. To address this gap, we propose a conservative rewrite candidate for each response claim via staged de-colloquialisation, combining lightweight surface normalisation with scoped in-claim coreference resolution. We then introduce BiCon-Gate, a semantics-aware consistency gate that selects the rewrite candidate only when it is semantically supported by the dialogue context, otherwise falling back to the original claim. This gated selection stabilises downstream fact-checking and yields gains in both evidence retrieval and fact verification. On the DialFact benchmark, our approach improves retrieval and verification, with particularly strong gains on SUPPORTS, and outperforms competitive baselines, including a decoder-based one-shot LLM rewrite that attempts to perform all de-colloquialisation steps in a single pass.[37] Generating Concept Lexicalizations via Dictionary-Based Cross-Lingual Sense Projection
David Basil,Chirooth Girigowda,Bradley Hauer,Sahir Momin,Ning Shi,Grzegorz Kondrak
Main category: cs.CL
TL;DR: 本文提出了一种通过语义投影自动生成新语言WordNet式词义资源的方法,结合预训练对齐器与双语词典进行投影与过滤,在多语言上验证了其高精度、可解释性和低资源需求。
Details
Motivation: 自动扩展WordNet等词义资源到新语言面临标注稀缺和跨语言语义对齐困难的问题。 Method: 利用带词义标注的英语文本及其翻译,将英语同义词集(synsets)投影到目标语言对齐词元上,并借助增强的预训练对齐器与双语词典实现高质量对齐与错误投影过滤。 Result: 在多种语言上显著提升投影精度,优于现有方法及词典和大语言模型基线,同时保持高可解释性与低外部资源依赖。 Conclusion: ‘投影-过滤’策略是一种高效、鲁棒且实用的新语言词义资源构建方法,代码、文档与生成词义库将开源。 Abstract: We study the task of automatically expanding WordNet-style lexical resources to new languages through sense generation. We generate senses by associating target-language lemmas with existing lexical concepts via semantic projection. Given a sense-tagged English corpus and its translation, our method projects English synsets onto aligned target-language tokens and assigns the corresponding lemmas to those synsets. To generate these alignments and ensure their quality, we augment a pre-trained base aligner with a bilingual dictionary, which is also used to filter out incorrect sense projections. We evaluate the method on multiple languages, comparing it to prior methods, as well as dictionary-based and large language model baselines. Results show that the proposed project-and-filter strategy improves precision while remaining interpretable and requiring few external resources. We plan to make our code, documentation, and generated sense inventories accessible.[38] The Autocorrelation Blind Spot: Why 42% of Turn-Level Findings in LLM Conversation Analysis May Be Spurious
Ferdinand M. Schessl
Main category: cs.CL
TL;DR: 本文揭示了当前多轮人机对话评估中忽略轮次间自相关性的问题,指出标准的池化分析会严重高估统计显著性;作者系统分析了66个轮级指标的自相关结构,提出结合Chelton有效自由度与对话级块自助法的两阶段校正框架,并验证其在复现率上的显著提升;同时发现主流NLP/AI论文中绝大多数未校正该问题。
Details
Motivation: 现有对话评估普遍使用轮级指标,但忽视了同一对话内连续轮次间的统计依赖性(即自相关),导致标准统计推断(如池化检验)结果不可靠。 Method: 系统刻画66个轮级指标在202个多轮对话中的自相关结构;提出融合Chelton(1983)有效自由度估计与对话级块自助法的两阶段校正框架;在预注册保留集上验证校正效果,并开展文献调研统计近期论文对时间依赖性的处理情况。 Result: 42%在标准池化检验下显著的关联在聚类稳健校正后不再显著;不同指标族的显著性通胀差异大(记忆无家族14%,非记忆无家族33%,个别达100%);校正后指标复现率达57%,远高于池化法的30%;调研显示约30篇近期顶会论文中仅4篇考虑时间依赖性,26篇完全未校正。 Conclusion: 忽略轮次自相关会导致对话评估结论严重偏差;必须采用聚类稳健方法(如块自助+有效自由度)进行统计推断;作者提供了可复现的开源工具、设计原则与发表检查清单,呼吁领域采纳规范校正实践。 Abstract: Turn-level metrics are widely used to evaluate properties of multi-turn human-LLM conversations, from safety and sycophancy to dialogue quality. However, consecutive turns within a conversation are not statistically independent -- a fact that virtually all current evaluation pipelines fail to correct for in their statistical inference. We systematically characterize the autocorrelation structure of 66 turn-level metrics across 202 multi-turn conversations (11,639 turn pairs, 5 German-speaking users, 4 LLM platforms) and demonstrate that naive pooled analysis produces severely inflated significance estimates: 42% of associations that appear significant under standard pooled testing fail to survive cluster-robust correction. The inflation varies substantially across categories rather than scaling linearly with autocorrelation: three memoryless families (embedding velocity, directional, differential) aggregate to 14%, while the seven non-memoryless families (thermo-cycle, frame distance, lexical/structural, rolling windows, cumulative, interaction, timestamp) aggregate to 33%, with individual category rates ranging from 0% to 100% depending on per-family effect size. We present a two-stage correction framework combining Chelton (1983) effective degrees of freedom with conversation-level block bootstrap, and validate it on a pre-registered hold-out split where cluster-robust metrics replicate at 57% versus 30% for pooled-only metrics. We provide concrete design principles, a publication checklist, and open-source code for the correction pipeline. A survey of ~30 recent papers at major NLP and AI venues that compute turn-level statistics in LLM evaluations finds that only 4 address temporal dependence at all, and 26 do not correct for it.[39] Three-Phase Transformer
Mohammad R. Abu Ayyash
Main category: cs.CL
TL;DR: 本文提出了一种名为Three-Phase Transformer(3PT)的新架构,通过在残差流中引入循环通道划分、相位保持操作(如Givens旋转)、GQA头数约束及Gabriel's horn位置编码,提升decoder-only Transformer的训练稳定性与效率,在WikiText-103上显著降低困惑度并加速收敛。
Details
Motivation: 解决decoder-only Transformer训练不稳定、收敛慢的问题,探索残差流结构先验对模型动态行为的隐式调控能力。 Method: 将隐藏向量划分为N个等长循环通道;每通道独立RMSNorm;跨注意力与FFN使用2D Givens旋转(角度含相位偏移);GQA头数与通道数对齐;注入Gabriel's horn型绝对位置编码到正交DC子空间;整体作为自稳定平衡系统而非附加模块。 Result: 在123M参数规模下,相比RoPE基线,困惑度下降7.20%,比特/字节下降2.62%,收敛步数快1.93倍、实际耗时快1.64倍;N=3为典型配置,且在不同规模下N的影响呈现尺度依赖性;验证了自稳定几何、旋转角漂移U型深度分布及与RoPE/Attention/FFN的正交组合性。 Conclusion: 3PT证明了残差流结构先验可作为轻量、自稳定、可解释的归纳偏置,无需额外监督即可引导网络动力学趋向有利平衡态,为Transformer架构设计提供了新的几何与物理启发范式。 Abstract: We present Three-Phase Transformer (3PT), a residual-stream structural prior for decoder-only Transformers on a standard SwiGLU + RMSNorm + RoPE + GQA backbone. The hidden vector is partitioned into N equally-sized cyclic channels, each maintained by phase-respecting ops: a per-channel RMSNorm, a 2D Givens rotation between attention and FFN that rotates each channel by theta + i*(2*pi/N), and a head-count constraint aligning GQA heads with the partition. The architecture is a self-stabilizing equilibrium between scrambling and re-imposition, not a bolted-on module. The partition carves out a one-dimensional DC subspace orthogonal to the channels, into which we inject a fixed Gabriel's horn profile r(p) = 1/(p+1) as an absolute-position side-channel composing orthogonally with RoPE's relative-position rotation. The canonical N=3 borrows its metaphor from balanced three-phase AC, where three sinusoids 120 degrees apart sum to zero with no anti-correlated pair. At 123M parameters on WikiText-103, 3PT achieves -7.20% perplexity (-2.62% bits-per-byte) over a matched RoPE-Only baseline at +1,536 parameters (0.00124% of total), with 1.93x step-count convergence speedup (1.64x wall-clock). N behaves as a parameter-sharing knob rather than a unique optimum: at 5.5M an N-sweep over {1,2,3,4,6,8,12} is near-monotone with N=1 winning; at 123M a three-seed sweep finds N=3 and N=1 statistically indistinguishable. The load-bearing mechanism is the channel-partitioned residual stream, per-block rotation, per-phase normalization, and horn DC injection. We characterize (a) self-stabilization of the geometry without explicit enforcement, a novel instance of the conservation-law framework for neural networks; (b) a U-shaped depth profile of rotation-angle drift at 12 layers; (c) orthogonal composition with RoPE, attention, and FFN.[40] Hierarchical vs. Flat Iteration in Shared-Weight Transformers
Sang-Il Han
Main category: cs.CL
TL;DR: 本研究通过实证分析比较了分层结构的共享权重循环(HRM-LM)与传统独立堆叠Transformer层在语言建模中的表征能力,发现前者存在显著性能差距。
Details
Motivation: 探究分层、共享权重的循环结构能否在语言模型中达到与独立堆叠Transformer层相当的表征质量。 Method: 提出HRM-LM模型,用一个双速循环对(Fast模块每步运行,Slow模块每T步运行)替代L个独立Transformer层,并在M=N×T步中参数共享;通过与参数匹配的Universal Transformer(UniTF, 1.2B)进行五次独立运行的消融实验进行验证。 Result: 在参数匹配条件下,HRM-LM与UniTF之间存在明显且稳健的性能差距。 Conclusion: 分层共享权重循环结构在当前设定下无法匹敌独立层堆叠的表征能力,挑战了该类结构在Transformer语言模型中的有效性假设。 Abstract: We present an empirical study of whether hierarchically structured, shared-weight recurrence can match the representational quality of independent-layer stacking in a Transformer-based language model. HRM-LM replaces L independent Transformer layers with a two-speed recurrent pair: a Fast module operating at every step for local refinement, and a Slow module operating every T steps for global compression. This recurrent hierarchy is unrolled for M = N x T steps with shared parameters. The central and most robust finding, supported by a parameter-matched Universal Transformer ablation (UniTF, 1.2B) across five independent runs, is a sharp empirical gap between the two approaches.[41] MARCA: A Checklist-Based Benchmark for Multilingual Web Search
Thales Sales Almeida,Giovana Kerche Bonás,Ramon Pires,Celio Larcher,Hugo Abonizio,Marcos Piau,Roseval Malaquias Junior,Rodrigo Nogueira,Thiago Laitz
Main category: cs.CL
TL;DR: 本文提出了MARCA,一个用于评估大语言模型在基于网络的信息检索任务中表现的双语(英语和葡萄牙语)基准测试。它包含52个手动编写的多实体问题及配套的检查清单式评分标准,并在两种交互框架(Basic和Orchestrator)下对14个模型进行了评估,强调了回答完整性、正确性及跨语言迁移能力的差异。
Details
Motivation: 现有基准在多语言尤其是葡萄牙语场景下的网络浏览与智能体工具使用能力评估不足,亟需一个能全面衡量LLM在真实网络信息检索中可靠性(包括搜索、证据筛选与答案合成)的双语评估基准。 Method: 构建了双语(英/葡)基准MARCA,含52个手工编写多实体问题和对应的手工验证检查清单式评分标准;在Basic(直接网页搜索与抓取)和Orchestrator(任务分解+子智能体协同)两种框架下对14个模型进行多次运行评估,并报告运行级不确定性。 Result: 发现不同模型性能差异显著;Orchestrator框架常提升答案覆盖度;模型从英语到葡萄牙语的迁移能力存在较大波动;所有实验结果均体现运行级随机性。 Conclusion: MARCA填补了多语言(特别是葡萄牙语)网络信息检索评估的空白,揭示了当前LLM在跨语言迁移、答案完整性与系统性工具调用方面的重要挑战,为后续研究提供了可复现、细粒度的评估工具。 Abstract: Large language models (LLMs) are increasingly used as sources of information, yet their reliability depends on the ability to search the web, select relevant evidence, and synthesize complete answers. While recent benchmarks evaluate web-browsing and agentic tool use, multilingual settings, and Portuguese in particular, remain underexplored. We present \textsc{MARCA}, a bilingual (English and Portuguese) benchmark for evaluating LLMs on web-based information seeking. \textsc{MARCA} consists of 52 manually authored multi-entity questions, paired with manually validated checklist-style rubrics that explicitly measure answer completeness and correctness. We evaluate 14 models under two interaction settings: a Basic framework with direct web search and scraping, and an Orchestrator framework that enables task decomposition via delegated subagents. To capture stochasticity, each question is executed multiple times and performance is reported with run-level uncertainty. Across models, we observe large performance differences, find that orchestration often improves coverage, and identify substantial variability in how models transfer from English to Portuguese. The benchmark is available at https://github.com/maritaca-ai/MARCA[42] Filling in the Mechanisms: How do LMs Learn Filler-Gap Dependencies under Developmental Constraints?
Atrey Desai,Sathvik Nair
Main category: cs.CL
TL;DR: 本文研究了在有限数据训练下,语言模型是否能像人类一样形成跨句法结构的填空-空位依赖表征,并发现尽管存在共享机制,但模型需要远超人类的数据量才能达到类似泛化能力,表明语言习得模型需引入语言特异性偏差。
Details
Motivation: 探究在发展上可行的数据量下,语言模型是否也能像人类一样形成跨不同句法结构(如wh-问句和话题化)的填空-空位依赖的共享表征。 Method: 采用分布式对齐搜索(DAS)方法,分析BabyLM挑战中在不同数据量下训练的语言模型,检验其对wh-问句和话题化中填空-空位依赖的表征是否共享。 Result: 结果表明:即使在有限训练数据下,语言模型也可能发展出共享但项目敏感的机制;但相比人类,模型仍需多得多的数据才能实现可比的泛化能力。 Conclusion: 语言模型虽可能发展出跨结构的填空-空位依赖表征,但其数据效率远低于人类,因此语言习得计算模型需纳入语言特异性先验或偏差。 Abstract: For humans, filler-gap dependencies require a shared representation across different syntactic constructions. Although causal analyses suggest this may also be true for LLMs (Boguraev et al., 2025), it is still unclear if such a representation also exists for language models trained on developmentally feasible quantities of data. We applied Distributed Alignment Search (DAS, Geiger et al. (2024)) to LMs trained on varying amounts of data from the BabyLM challenge (Warstadt et al., 2023), to evaluate whether representations of filler-gap dependencies transfer between wh-questions and topicalization, which greatly vary in terms of their input frequency. Our results suggest shared, yet item-sensitive mechanisms may develop with limited training data. More importantly, LMs still require far more data than humans to learn comparable generalizations, highlighting the need for language-specific biases in models of language acquisition.[43] Psychological Steering of Large Language Models
Leonardo Blas,Robin Jia,Emilio Ferrara
Main category: cs.CL
TL;DR: 本文提出了一种基于心理学的LLM行为调控框架,利用IPIP-NEO-120量表校准残差流注入(尤其是均值差MD法),在14个大模型中显著优于传统人格提示(P²),并发现MD与P²混合方法效果最佳;同时验证了线性表征假设,但也揭示了模型表征与人类心理学(如Big Two模型)之间的偏差。
Details
Motivation: 现有残差流干预方法受限于搜索空间和未校准的激活单位,难以找到最优干预条件,亟需语义可解释、心理可对齐的干预范式。 Method: 提出心理导向的调控框架,使用IPIP-NEO-120量表对OCEAN人格维度进行语义校准,在无界、流畅性约束下开展残差流注入;对比六种注入方法,重点评估均值差(MD)法及其与Personality Prompting(P²)的混合策略。 Result: MD法在14个LLM中的11个上优于P²(提升3.6%–16.4%);MD+P²混合法在13个模型上同时超越两者(相对P²提升5.6%–21.9%,相对MD提升3.3%–26.7%);MD符合线性表征假设,但诱发的OCEAN协方差偏离Big Two模型。 Conclusion: 表示工程(而非仅提示)是开放域心理调控的新前沿;语义校准的残差注入更有效且可控,但当前LLM的人格表征尚未完全拟合人类心理学结构。 Abstract: Large language models (LLMs) emulate a consistent human-like behavior that can be shaped through activation-level interventions. This paradigm is converging on additive residual-stream injections, which rely on injection-strength sweeps to approximate optimal intervention settings. However, existing methods restrict the search space and sweep in uncalibrated activation-space units, potentially missing optimal intervention conditions. Thus, we introduce a psychological steering framework that performs unbounded, fluency-constrained sweeps in semantically calibrated units. Our method derives and calibrates residual-stream injections using psychological artifacts, and we use the IPIP-NEO-120, which measures the OCEAN personality model, to compare six injection methods. We find that mean-difference (MD) injections outperform Personality Prompting (P$^2$), an established baseline for OCEAN steering, in open-ended generation in 11 of 14 LLMs, with gains of 3.6\% to 16.4\%, overturning prior reports favoring prompting and positioning representation engineering as a new frontier in open-ended psychological steering. Further, we find that a hybrid of P$^2$ and MD injections outperforms both methods in 13 of 14 LLMs, with gains over P$^2$ ranging from 5.6\% to 21.9\% and from 3.3\% to 26.7\% over MD injections. Finally, we show that MD injections align with the Linear Representation Hypothesis and provide reliable, approximately linear control knobs for psychological steering. Nevertheless, they also induce OCEAN trait covariance patterns that depart from the Big Two model, suggesting a gap between learned representations and human psychology.[44] CobwebTM: Probabilistic Concept Formation for Lifelong and Hierarchical Topic Modeling
Karthik Singaravadivelan,Anant Gupta,Zekun Wang,Christopher MacLellan
Main category: cs.CL
TL;DR: CobwebTM是一种基于增量概率概念形成的低参数终身分层主题模型,能在线构建语义层次结构,支持无监督主题发现、动态主题生成和无需预设主题数的分层组织。
Details
Motivation: 神经主题模型需要大量调参且难以应对终身学习(灾难性遗忘、容量固定);经典概率模型缺乏对流式数据的灵活性与适应性。 Method: 将Cobweb算法适配至连续文档嵌入空间,实现基于预训练表示的增量符号化概念形成,构建在线语义层次结构。 Result: 在多个数据集上展现出高主题一致性、时间稳定性及高质量层次结构。 Conclusion: 结合增量符号化概念形成与预训练表征是高效主题建模的有效路径。 Abstract: Topic modeling seeks to uncover latent semantic structure in text corpora with minimal supervision. Neural approaches achieve strong performance but require extensive tuning and struggle with lifelong learning due to catastrophic forgetting and fixed capacity, while classical probabilistic models lack flexibility and adaptability to streaming data. We introduce \textsc{CobwebTM}, a low-parameter lifelong hierarchical topic model based on incremental probabilistic concept formation. By adapting the Cobweb algorithm to continuous document embeddings, \textsc{CobwebTM} constructs semantic hierarchies online, enabling unsupervised topic discovery, dynamic topic creation, and hierarchical organization without predefining the number of topics. Across diverse datasets, \textsc{CobwebTM} achieves strong topic coherence, stable topics over time, and high-quality hierarchies, demonstrating that incremental symbolic concept formation combined with pretrained representations is an efficient approach to topic modeling.[45] PeerPrism: Peer Evaluation Expertise vs Review-writing AI
Soroush Sadeghian,Alireza Daqiq,Radin Cheraghi,Sajad Ebrahimi,Negar Arabzadeh,Ebrahim Bagheri
Main category: cs.CL
TL;DR: 本文提出PeerPrism基准,用于评估大语言模型在科学同行评审中的人机协作检测能力,指出当前LLM检测方法混淆了文本表层生成与思想来源,无法适应混合创作场景。
Details
Motivation: 现有LLM检测方法将作者归属简化为人类vs. AI的二元问题,忽视了实际评审中思想与文本可能源自不同主体的混合协作现实。 Method: 构建PeerPrism大规模基准(20,690条评审),涵盖全人工、全合成及多种混合生成模式,分离‘思想来源’与‘文本来源’;系统评测主流LLM检测方法,并辅以文体学与语义分析。 Result: 主流检测器在二元任务上表现良好,但在混合场景(如人类思想+AI表达)下预测高度不一致,表明其实际依赖文本表层特征而非推理源头。 Conclusion: 同行评审中的LLM检测不应简化为二元归属问题,而需建模为涵盖语义推理与风格实现的多维作者身份问题;PeerPrism是首个面向人机协同评审检测的基准。 Abstract: Large Language Models (LLMs) are increasingly used in scientific peer review, assisting with drafting, rewriting, expansion, and refinement. However, existing peer-review LLM detection methods largely treat authorship as a binary problem-human vs. AI-without accounting for the hybrid nature of modern review workflows. In practice, evaluative ideas and surface realization may originate from different sources, creating a spectrum of human-AI collaboration. In this work, we introduce PeerPrism, a large-scale benchmark of 20,690 peer reviews explicitly designed to disentangle idea provenance from text provenance. We construct controlled generation regimes spanning fully human, fully synthetic, and multiple hybrid transformations. This design enables systematic evaluation of whether detectors identify the origin of the surface text or the origin of the evaluative reasoning. We benchmark state-of-the-art LLM text detection methods on PeerPrism. While several methods achieve high accuracy on the standard binary task (human vs. fully synthetic), their predictions diverge sharply under hybrid regimes. In particular, when ideas originate from humans but the surface text is AI-generated, detectors frequently disagree and produce contradictory classifications. Accompanied by stylometric and semantic analyses, our results show that current detection methods conflate surface realization with intellectual contribution. Overall, we demonstrate that LLM detection in peer review cannot be reduced to a binary attribution problem. Instead, authorship must be modeled as a multidimensional construct spanning semantic reasoning and stylistic realization. PeerPrism is the first benchmark evaluating human-AI collaboration in these settings. We release all code, data, prompts, and evaluation scripts to facilitate reproducible research at https://github.com/Reviewerly-Inc/PeerPrism.[46] Mechanistic Decoding of Cognitive Constructs in LLMs
Yitong Shou,Manhao Guan
Main category: cs.CL
TL;DR: 本文提出一种基于表征工程的认知逆向工程框架,用于解析大语言模型中社会比较型嫉妒情绪的内在认知结构,发现模型将嫉妒建模为‘比较对象优越性’和‘领域自我定义相关性’两个心理前因的线性组合,且其表征与人类心理学一致;该框架还可检测并精准抑制毒性情绪,为多智能体AI安全提供新路径。
Details
Motivation: 现有可解释性方法将模型视为黑箱或仅关注基础情绪,难以揭示复杂情绪(如社会比较型嫉妒)的内部认知机制。 Method: 结合评价理论与表征工程(RepE),采用子空间正交化、回归加权和双向因果引导技术,分离并量化嫉妒的两个心理前因——比较对象优越性和领域自我定义相关性,并检验其对模型判断的因果影响。 Result: 在Llama、Qwen、Gemma系列共8个LLM上的实验表明:模型原生地以结构化线性方式编码嫉妒;其内部表征与人类心理学一致,即优越性是基础触发因素,相关性是强度调节因子;该框架能机械检测并精准抑制毒性情绪。 Conclusion: LLM中复杂情绪具有可解释、可干预的认知结构;所提框架不仅揭示了情绪表征的心理学合理性,也为AI安全中的表征监控与干预提供了可行技术路径。 Abstract: While Large Language Models (LLMs) demonstrate increasingly sophisticated affective capabilities, the internal mechanisms by which they process complex emotions remain unclear. Existing interpretability approaches often treat models as black boxes or focus on coarse-grained basic emotions, leaving the cognitive structure of more complex affective states underexplored. To bridge this gap, we propose a Cognitive Reverse-Engineering framework based on Representation Engineering (RepE) to analyze social-comparison jealousy. By combining appraisal theory with subspace orthogonalization, regression-based weighting, and bidirectional causal steering, we isolate and quantify two psychological antecedents of jealousy, Superiority of Comparison Person and Domain Self-Definitional Relevance, and examine their causal effects on model judgments. Experiments on eight LLMs from the Llama, Qwen, and Gemma families suggest that models natively encode jealousy as a structured linear combination of these constituent factors. Their internal representations are broadly consistent with the human psychological construct, treating Superiority as the foundational trigger and Relevance as the ultimate intensity multiplier. Our framework also demonstrates that toxic emotional states can be mechanically detected and surgically suppressed, suggesting a possible route toward representational monitoring and intervention for AI safety in multi-agent environments.[47] NLP needs Diversity outside of 'Diversity'
Joshua Tint
Main category: cs.CL
TL;DR: 本文指出NLP领域内多样性进展主要集中于公平性相关研究,而其他子领域则被忽视;这种不平衡源于制度性激励、偏见与障碍,导致边缘化研究者难以参与非公平性方向的研究。作者通过分析NLP各子领域研究者的人口统计特征,提出打破强化不平等的反馈循环、消除地理与语言障碍等建议。
Details
Motivation: 纠正NLP领域中多样性研究过度集中于公平性议题的现象,揭示其背后对边缘化研究者在非公平性子领域参与的系统性排斥机制。 Method: 基于NLP各子领域研究者的人口统计数据开展实证调查,并结合制度性分析识别影响多样性的激励结构、偏见和障碍。 Result: 发现NLP多样性进展严重偏向公平性子领域;边缘化研究者在非公平性方向面临多重排斥性障碍(如地理、语言、学术反馈循环等)。 Conclusion: 需系统性改革NLP领域的资助、评审与合作机制,以支持所有子领域的包容性发展,尤其要打破强化不平等的结构性反馈循环,并降低地理与语言门槛。 Abstract: This position paper argues that recent progress with diversity in NLP is disproportionately concentrated on a small number of areas surrounding fairness. We further argue that this is the result of a number of incentives, biases, and barriers which come together to disenfranchise marginalized researchers in non-fairness fields, or to move them into fairness-related fields. We substantiate our claims with an investigation into the demographics of NLP researchers by subfield, using our research to support a number of recommendations for ensuring that all areas within NLP can become more inclusive and equitable. In particular, we highlight the importance of breaking down feedback loops that reinforce disparities, and the need to address geographical and linguistic barriers that hinder participation in NLP research.[48] CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification
Yian Wang,Yuen Chen,Agam Goyal,Hari Sundaram
Main category: cs.CL
TL;DR: 本文提出CAUSALDETOX框架,通过因果分析识别并干预导致毒性生成的关键注意力头,结合推理时动态干预与PNS引导的微调,在显著降低毒性的同时保持语言流畅性,并引入新基准PARATOX用于可控反事实评估。
Details
Motivation: 大型语言模型常生成有毒内容,现有缓解策略常损害生成质量或依赖高成本人工标注。 Method: 提出CAUSALDETOX框架:1)用概率必要性与充分性(PNS)识别对毒性生成起因果作用的最小注意力头集合;2)本地推理时干预——构建输入相关的动态引导向量进行上下文感知解毒;3)PNS引导的微调——永久消除毒性表征;同时构建PARATOX基准用于反事实评估。 Result: 在ToxiGen、ImplicitHate和ParaDetox上实验表明,CAUSALDETOX相比基线最多提升5.34%的毒性降低率,保持语言流畅性,并实现7倍的头部选择加速。 Conclusion: CAUSALDETOX提供了一种高效、可解释且高质量的毒性缓解方法,兼顾效果、效率与可控性。 Abstract: Large language models (LLMs) frequently generate toxic content, posing significant risks for safe deployment. Current mitigation strategies often degrade generation quality or require costly human annotation. We propose CAUSALDETOX, a framework that identifies and intervenes on the specific attention heads causally responsible for toxic generation. Using the Probability of Necessity and Sufficiency (PNS), we isolate a minimal set of heads that are necessary and sufficient for toxicity. We utilize these components via two complementary strategies: (1) Local Inference-Time Intervention, which constructs dynamic, input-specific steering vectors for context-aware detoxification, and (2) PNS-Guided Fine-Tuning, which permanently unlearns toxic representations. We also introduce PARATOX, a novel benchmark of aligned toxic/non-toxic sentence pairs enabling controlled counterfactual evaluation. Experiments on ToxiGen, ImplicitHate, and ParaDetox show that CAUSALDETOX achieves up to 5.34% greater toxicity reduction compared to baselines while preserving linguistic fluency, and offers a 7x speedup in head selection.[49] Retrieve, Then Classify: Corpus-Grounded Automation of Clinical Value Set Authoring
Sumit Mukherjee,Juan Shu,Nairwita Mazumder,Tate Kernell,Celena Wheeler,Shannon Hastings,Chris Sidey-Gibbons
Main category: cs.CL
TL;DR: 本文提出了一种名为检索增强集合补全(RASC)的新方法,用于临床价值集编写任务,通过检索相似已有价值集并分类候选代码,显著提升了代码识别的准确率和效率。
Details
Motivation: 临床价值集编写是临床质量评估和表型分析中的常见瓶颈,而直接使用大语言模型生成标准化代码效果受限于词汇规模大、版本控制严格及预训练记忆不可靠等问题。 Method: 提出检索增强集合补全(RASC):先从已有的价值集语料库中检索K个最相似的价值集形成候选池,再用分类器对每个候选代码打分筛选;在SAPBert上微调交叉编码器,并对比MLP、LightGBM等模型。 Result: 在11803个VSAC价值集构成的基准上,RASC达到AUROC 0.852、价值集级F1 0.298,优于MLP(F1 0.250)和零样本GPT-4o(F1 0.105);将每真阳性对应的无关候选数从12.3降至约3.2–4.4;性能优势随价值集规模增大而增强。 Conclusion: RASC通过缩小输出空间有效降低了统计复杂度,其优势在多种模型上均成立,为临床价值集编写提供了可扩展、鲁棒的新范式。 Abstract: Clinical value set authoring -- the task of identifying all codes in a standardized vocabulary that define a clinical concept -- is a recurring bottleneck in clinical quality measurement and phenotyping. A natural approach is to prompt a large language model (LLM) to generate the required codes directly, but structured clinical vocabularies are large, version-controlled, and not reliably memorized during pretraining. We propose Retrieval-Augmented Set Completion (RASC): retrieve the $K$ most similar existing value sets from a curated corpus to form a candidate pool, then apply a classifier to each candidate code. Theoretically, retrieve-and-select can reduce statistical complexity by shrinking the effective output space from the full vocabulary to a much smaller retrieved candidate pool. We demonstrate the utility of RASC on 11,803 publicly available VSAC value sets, constructing the first large-scale benchmark for this task. A cross-encoder fine-tuned on SAPBert achieves AUROC~0.852 and value-set-level F1~0.298, outperforming a simpler three-layer Multilayer Perceptron (AUROC~0.799, F1~0.250) and both reduce the number of irrelevant candidates per true positive from 12.3 (retrieval-only) to approximately 3.2 and 4.4 respectively. Zero-shot GPT-4o achieves value-set-level F1~0.105, with 48.6\% of returned codes absent from VSAC entirely. This performance gap widens with increasing value set size, consistent with RASC's theoretical advantage. We observe similar performance gains across two other classifier model types, namely a cross-encoder initialized from pre-trained SAPBert and a LightGBM model, demonstrating that RASC's benefits extend beyond a single model class. The code to download and create the benchmark dataset, as well as the model training code is available at: \href{https://github.com/mukhes3/RASC}{https://github.com/mukhes3/RASC}.[50] StoryCoder: Narrative Reformulation for Structured Reasoning in LLM Code Generation
Geonhui Jang,Dongyoon Han,YoungJoon Yoo
Main category: cs.CL
TL;DR: 本文提出StoryCoder框架,通过将代码生成问题重构为包含任务概述、约束条件和示例测试用例的自然语言叙事,提升模型在代码生成任务中的表现。实验表明该方法在多个基准上显著提升零样本pass@10指标,并引导模型采用更正确的算法策略和模块化代码结构。
Details
Motivation: 现有代码生成方法虽增强推理步骤或注入思维结构,但未系统组织零散的问题条件;受人类将碎片信息组织为连贯解释的启发,需更富上下文结构的问题表征。 Method: 提出StoryCoder叙事重构框架,将原始代码问题转化为由任务概述、约束条件和示例测试用例三部分组成的自然语言叙事,叙事内容依据所选算法与文体类型进行引导。 Result: 在HumanEval、LiveCodeBench和CodeForces上对11个模型的实验显示,零样本pass@10平均提升18.7%;分析还表明该方法能引导正确算法策略、减少实现错误、促进模块化代码结构,且效果依赖于叙事连贯性与文体匹配度。 Conclusion: 结构化的、叙事式的问题表征对代码生成至关重要,其益处不依赖于模型规模或架构,为提升大模型编程能力提供了新思路。 Abstract: Effective code generation requires both model capability and a problem representation that carefully structures how models reason and plan. Existing approaches augment reasoning steps or inject specific structure into how models think, but leave scattered problem conditions unchanged. Inspired by the way humans organize fragmented information into coherent explanations, we propose StoryCoder, a narrative reformulation framework that transforms code generation questions into coherent natural language narratives, providing richer contextual structure than simple rephrasings. Each narrative consists of three components: a task overview, constraints, and example test cases, guided by the selected algorithm and genre. Experiments across 11 models on HumanEval, LiveCodeBench, and CodeForces demonstrate consistent improvements, with an average gain of 18.7% in zero-shot pass@10. Beyond accuracy, our analyses reveal that narrative reformulation guides models toward correct algorithmic strategies, reduces implementation errors, and induces a more modular code structure. The analyses further show that these benefits depend on narrative coherence and genre alignment, suggesting that structured problem representation is important for code generation regardless of model scale or architecture. Our code is available at https://github.com/gu-ni/StoryCoder.[51] Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options
Nahyun Lee,Guijin Son
Main category: cs.CL
TL;DR: 本文提出了一种大规模选项(100个)的多选评估协议,用于更严格地测试大语言模型在韩语正字法错误检测任务中的真实能力,揭示了传统低选项设置下易被掩盖的模型缺陷,如语义混淆和位置偏差。
Details
Motivation: 传统多选评估在选项较少时容易达到接近上限的准确率,但可能依赖捷径策略而非真实语言理解能力,从而高估模型性能。 Method: 提出大规模选项(N=100)评估协议,结合固定目标、重复重采样与打乱顺序,控制上下文长度并进行填充对照实验,以分离语义能力与位置偏差等干扰因素。 Result: 强模型在低选项设置中表现优异,但在高N设置下性能显著下降;识别出两种主要失败模式:语义混淆与对靠前选项的位置偏好;候选排序能力是主要瓶颈,而非上下文长度限制。 Conclusion: 大规模选项评估是一种有效的压力测试框架,能更可靠地揭示模型在高干扰密度下的真实可靠性,弥补传统基准的局限性。 Abstract: Multiple choice evaluation is widely used for benchmarking large language models, yet near ceiling accuracy in low option settings can be sustained by shortcut strategies that obscure true competence. Therefore, we propose a massive option evaluation protocol that scales the candidate set to one hundred options and sharply reduces the impact of chance performance. We apply this framework to a Korean orthography error detection task where models must pick the single incorrect sentence from a large candidate set. With fixed targets and repeated resampling and shuffling, we obtain stable estimates while separating content driven failures from positional artifacts. Across experiments, results indicate that strong performance in low option settings can overstate model competence. This apparent advantage often weakens under dense interference at high $N$, revealing gaps that conventional benchmarks tend to obscure. We identify two failure modes, semantic confusion and position bias toward early options under uncertainty. To isolate the effect of context length, we run padding controlled and length matched tests, which suggest that the main bottleneck is candidate ranking rather than context length. Together, these findings support massive option evaluation as a general framework for stress testing model reliability under extreme distractor density, beyond what low option benchmarks can reveal.[52] Fact4ac at the Financial Misinformation Detection Challenge Task: Reference-Free Financial Misinformation Detection via Fine-Tuning and Few-Shot Prompting of Large Language Models
Cuong Hoang,Le-Minh Nguyen
Main category: cs.CL
TL;DR: 本文提出了一种无需外部参考的金融虚假信息检测方法,结合零样本/少样本提示与LoRA微调大语言模型,在该任务中取得第一名(公开/私有测试集准确率分别为95.4%和96.3%)。
Details
Motivation: 金融虚假信息泛滥威胁市场稳定与投资者信任,而现实中常缺乏可用于交叉验证的外部证据,亟需不依赖参考文献的检测方法。 Method: 基于RFC-BENCH框架,融合大语言模型的上下文学习(零样本与少样本提示)与参数高效微调(LoRA),以捕捉金融操纵中的细微语义线索。 Result: 在‘无参考金融虚假信息检测’共享任务中排名第一:公开测试集准确率95.4%,私有测试集96.3%;开源14B与32B模型。 Conclusion: 所提方法验证了仅凭内部语义与上下文一致性即可高效识别金融虚假信息,推动了金融NLP中上下文感知虚假信息检测的发展。 Abstract: The proliferation of financial misinformation poses a severe threat to market stability and investor trust, misleading market behavior and creating critical information asymmetry. Detecting such misleading narratives is inherently challenging, particularly in real-world scenarios where external evidence or supplementary references for cross-verification are strictly unavailable. This paper presents our winning methodology for the "Reference-Free Financial Misinformation Detection" shared task. Built upon the recently proposed RFC-BENCH framework (Jiang et al. 2026), this task challenges models to determine the veracity of financial claims by relying solely on internal semantic understanding and contextual consistency, rather than external fact-checking. To address this formidable evaluation setup, we propose a comprehensive framework that capitalizes on the reasoning capabilities of state-of-the-art Large Language Models (LLMs). Our approach systematically integrates in-context learning, specifically zero-shot and few-shot prompting strategies, with Parameter-Efficient Fine-Tuning (PEFT) via Low-Rank Adaptation (LoRA) to optimally align the models with the subtle linguistic cues of financial manipulation. Our proposed system demonstrated superior efficacy, successfully securing the first-place ranking on both official leaderboards. Specifically, we achieved an accuracy of 95.4% on the public test set and 96.3% on the private test set, highlighting the robustness of our method and contributing to the acceleration of context-aware misinformation detection in financial Natural Language Processing. Our models (14B and 32B) are available at https://huggingface.co/KaiNKaiho.[53] CURaTE: Continual Unlearning in Real Time with Ensured Preservation of LLM Knowledge
Seyun Bae,Seokhan Lee,Eunho Yang
Main category: cs.CL
TL;DR: 本文提出CURaTE方法,通过训练句子嵌入模型实时检测并拒绝与遗忘请求相似的输入,实现大语言模型的持续实时知识遗忘,同时保持模型原有知识几乎完全不变。
Details
Motivation: 预训练大语言模型难以预先过滤所有潜在有害数据,因此需要在训练后对特定知识进行‘遗忘’;现有方法无法支持连续、即时的遗忘操作,导致效用下降和敏感信息长期暴露。 Method: 提出CURaTE方法:先在特制数据集上训练句子嵌入模型,以构建对遗忘请求的清晰判别边界;在线推理时,计算输入提示与遗忘请求的相似度,若超过阈值则拒绝回答,否则正常响应;全程不修改语言模型参数。 Result: CURaTE在遗忘效果上优于现有方法;因不更新模型参数,知识保留率近乎完美;唯一支持任意次数、实时、持续遗忘的方法。 Conclusion: CURaTE提供了一种高效、安全、可持续的后训练知识遗忘机制,兼顾遗忘有效性与模型知识完整性,为LLM隐私与合规应用提供了新路径。 Abstract: The inability to filter out in advance all potentially problematic data from the pre-training of large language models has given rise to the need for methods for unlearning specific pieces of knowledge after training. Existing techniques overlook the need for continuous and immediate action, causing them to suffer from degraded utility as updates accumulate and protracted exposure of sensitive information. To address these issues, we propose Continual Unlearning in Real Time with Ensured Preservation of LLM Knowledge (CURaTE). Our method begins by training a sentence embedding model on a dataset designed to enable the formation of sharp decision boundaries for determining whether a given input prompt corresponds to any stored forget requests. The similarity of a given input to the forget requests is then used to determine whether to answer or return a refusal response. We show that even with such a simple approach, not only does CURaTE achieve more effective forgetting than existing methods, but by avoiding modification of the language model parameters, it also maintains near perfect knowledge preservation over any number of updates and is the only method capable of continual unlearning in real-time.[54] CURA: Clinical Uncertainty Risk Alignment for Language Model-Based Risk Prediction
Sizhe Wang,Ziqi Xu,Claire Najjuuko,Charles Alba,Chenyang Lu
Main category: cs.CL
TL;DR: 本文提出CURA框架,通过双层不确定性目标对临床语言模型进行微调,以提升风险预测的不确定性校准性,兼顾个体误差概率和队列级模糊性,在MIMIC-IV数据上验证了其在保持判别能力的同时显著改善校准性能。
Details
Motivation: 临床语言模型在风险预测中不确定性估计常校准不佳、临床可靠性低。 Method: 提出Clinical Uncertainty Risk Alignment(CURA)框架:先微调领域特定临床LM获得患者嵌入,再对多头分类器进行不确定性微调,采用双层目标——个体级校准项对齐预测不确定性与单个患者误差概率,队列感知正则项将风险估计拉向嵌入空间局部邻域的事件率,并加强对决策边界附近模糊队列的权重;该正则项可解释为基于邻域软标签的交叉熵损失。 Result: 在MIMIC-IV多个临床风险预测任务及不同临床LM上,CURA持续提升校准指标(如ECE),未显著损害判别能力(如AUC),并减少过度自信的错误宽慰,生成更可信的不确定性估计。 Conclusion: CURA有效提升了临床LM风险预测的不确定性校准性与临床可信度,为下游临床决策支持提供了更可靠的不确定性量化工具。 Abstract: Clinical language models (LMs) are increasingly applied to support clinical risk prediction from free-text notes, yet their uncertainty estimates often remain poorly calibrated and clinically unreliable. In this work, we propose Clinical Uncertainty Risk Alignment (CURA), a framework that aligns clinical LM-based risk estimates and uncertainty with both individual error likelihoods and cohort-level ambiguities. CURA first fine-tunes domain-specific clinical LMs to obtain task-adapted patient embeddings, and then performs uncertainty fine-tuning of a multi-head classifier using a bi-level uncertainty objective. Specifically, an individual-level calibration term aligns predictive uncertainty with each patient's likelihood of error, while a cohort-aware regularizer pulls risk estimates toward event rates in their local neighborhoods in the embedding space and places extra weight on ambiguous cohorts near the decision boundary. We further show that this cohort-aware term can be interpreted as a cross-entropy loss with neighborhood-informed soft labels, providing a label-smoothing view of our method. Extensive experiments on MIMIC-IV clinical risk prediction tasks across various clinical LMs show that CURA consistently improves calibration metrics without substantially compromising discrimination. Further analysis illustrates that CURA reduces overconfident false reassurance and yields more trustworthy uncertainty estimates for downstream clinical decision support.[55] SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models
Binxian Su,Haoye Lou,Shucheng Zhu,Weikang Wang,Ying Liu,Dong Yu,Pengyuan Liu
Main category: cs.CL
TL;DR: 本文提出SPAGBias框架,首次系统评估大语言模型(LLMs)中的空间性别偏见,结合城市微空间分类、提示库与三层诊断方法,发现模型中存在超越公私二分法的结构化性别-空间关联,并揭示其在预训练、指令微调与奖励建模各阶段被嵌入和强化,导致下游应用失效。
Details
Motivation: 鉴于性别化空间理论指出性别等级深植于空间组织中,而大语言模型正日益应用于城市规划,亟需系统评估其可能复现或加剧的空间性别偏见。 Method: 构建SPAGBias框架,包含62类城市微空间的分类体系、提示库,以及显式(强制选择重采样)、概率式(词元级不对称性)和建构式(语义与叙事角色分析)三层诊断方法;对六个代表性模型开展多维度实验,涵盖故事生成、提示设计、温度与模型规模影响分析,及偏差溯源与下游任务验证。 Result: 发现模型中存在精细的、超越公私二分法的性别-空间结构化映射;故事生成揭示情感、措辞与社会角色共同塑造‘空间性别叙事’;偏差贯穿模型全流程且显著偏离现实分布;下游任务中导致规范性与描述性应用双重失败。 Conclusion: LLMs不仅反映语言偏见,更编码了社会性别认知的空间维度;本研究将社会学理论与计算分析结合,开创空间领域偏见研究新方向,为负责任的城市AI应用提供理论与工具基础。 Abstract: Large language models (LLMs) are being increasingly used in urban planning, but since gendered space theory highlights how gender hierarchies are embedded in spatial organization, there is concern that LLMs may reproduce or amplify such biases. We introduce SPAGBias - the first systematic framework to evaluate spatial gender bias in LLMs. It combines a taxonomy of 62 urban micro-spaces, a prompt library, and three diagnostic layers: explicit (forced-choice resampling), probabilistic (token-level asymmetry), and constructional (semantic and narrative role analysis). Testing six representative models, we identify structured gender-space associations that go beyond the public-private divide, forming nuanced micro-level mappings. Story generation reveals how emotion, wording, and social roles jointly shape "spatial gender narratives". We also examine how prompt design, temperature, and model scale influence bias expression. Tracing experiments indicate that these patterns are embedded and reinforced across the model pipeline (pre-training, instruction tuning, and reward modeling), with model associations found to substantially exceed real-world distributions. Downstream experiments further reveal that such biases produce concrete failures in both normative and descriptive application settings. This work connects sociological theory with computational analysis, extending bias research into the spatial domain and uncovering how LLMs encode social gender cognition through language.[56] Which bird does not have wings: Negative-constrained KGQA with Schema-guided Semantic Matching and Self-directed Refinement
Midan Shim,Seokju Hwang,Kaehyun Um,Kyong-Ho Lee
Main category: cs.CL
TL;DR: 本文提出NEST KGQA新任务,聚焦于含否定约束的KGQA问题,构建NestKGQA数据集;设计可读性强、支持清晰表达否定的PyLF逻辑形式;并提出CUCKOO框架,通过约束感知逻辑生成与自导向精炼提升多约束问题的语义可执行性与鲁棒性。
Details
Motivation: 现有KGQA方法和基准偏重正向和计算约束,忽视现实中频繁出现的否定约束,导致模型在处理含否定条件的问题时表现不佳。 Method: 提出NEST KGQA任务及NestKGQA数据集;设计Python格式逻辑形式PyLF以清晰表达否定;构建CUCKOO框架,包含约束感知逻辑草稿生成、模式引导语义匹配,以及基于空结果触发的自导向精炼机制。 Result: CUCKOO在少样本设置下,在传统KGQA和NEST-KGQA基准上均持续超越基线模型。 Conclusion: 引入否定约束建模显著提升KGQA的实用性与鲁棒性;PyLF和CUCKOO为复杂多约束语义解析提供了有效且可扩展的解决方案。 Abstract: Large language models still struggle with faithfulness and hallucinations despite their remarkable reasoning abilities. In Knowledge Graph Question Answering (KGQA), semantic parsing-based approaches address the limitations by understanding constraints in a user's question and converting them into a logical form to execute on a knowledge graph. However, existing KGQA benchmarks and methods are biased toward positive and calculation constraints. Negative constraints are neglected, although they frequently appear in real-world questions. In this paper, we introduce a new task, NEgative-conSTrained (NEST) KGQA, where each question contains at least one negative constraint, and a corresponding dataset, NestKGQA. We also design PyLF, a Python-formatted logical form, since existing logical forms are hardly suitable to express negation clearly while maintaining readability. Furthermore, NEST questions naturally contain multiple constraints. To mitigate their semantic complexity, we present a novel framework named CUCKOO, specialized to multiple-constrained questions and ensuring semantic executability. CUCKOO first generates a constraint-aware logical form draft and performs schema-guided semantic matching. It then selectively applies self-directed refinement only when executing improper logical forms yields an empty result, reducing cost while improving robustness. Experimental results demonstrate that CUCKOO consistently outperforms baselines on both conventional KGQA and NEST-KGQA benchmarks under few-shot settings.[57] CoPA: Benchmarking Personalized Question Answering with Data-Informed Cognitive Factors
Hang Su,Zequn Liu,Chen Hu,Xuesong Lu,Yingce Xia,Zhen Liu
Main category: cs.CL
TL;DR: 本文提出CoPA基准,通过挖掘社区-个体偏好差异(CIPD)识别六个个性化维度,用于细粒度评估大语言模型在问答任务中的个性化能力。
Details
Motivation: 现有个性化问答评估方法依赖词法相似性或人工启发式规则,缺乏充分的数据驱动验证。 Method: 挖掘社区-个体偏好差异(CIPD),从中提炼六个关键个性化因素作为评估维度,并构建包含1985个用户画像的CoPA基准,基于用户交互模式推断其认知偏好,量化模型输出与之的对齐程度。 Result: CoPA提供了比通用指标更全面、更具区分力的个性化问答评估标准。 Conclusion: CoPA为个性化问答系统提供了可解释、可量化的细粒度评估框架,推动了数据驱动的个性化评估研究。 Abstract: While LLMs have demonstrated remarkable potential in Question Answering (QA), evaluating personalization remains a critical bottleneck. Existing paradigms predominantly rely on lexical-level similarity or manual heuristics, often lacking sufficient data-driven validation. We address this by mining Community-Individual Preference Divergence (CIPD), where individual choices override consensus, to distill six key personalization factors as evaluative dimensions. Accordingly, we introduce CoPA, a benchmark with 1,985 user profiles for fine-grained, factor-level assessment. By quantifying the alignment between model outputs and user-specific cognitive preferences inferred from interaction patterns, CoPA provides a more comprehensive and discriminative standard for evaluating personalized QA than generic metrics. The code is available at https://github.com/bjzgcai/CoPA.[58] Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems
Nishanth Madhusudhan,Vikas Yadav,Alexandre Lacoste
Main category: cs.CL
TL;DR: 本文提出MM-AQA基准,用于评估多模态系统在证据不足时的有效弃答能力,发现现有视觉语言模型和多智能体系统在弃答方面表现不佳,需弃答感知训练而非仅优化提示或增加智能体。
Details
Motivation: 现有视觉语言模型和多智能体系统的评估范式默认问题总可回答,忽视了证据不足时应主动弃答这一关键可靠性需求;弃答研究在纯文本领域已有进展,但在多模态场景下仍缺乏细粒度、贴近真实失败模式的基准和方法。 Method: 构建MM-AQA基准:通过对可回答样本沿‘视觉模态依赖性’和‘证据充分性’两个维度进行变换生成不可回答样本;在2079个样本上系统评测三类前沿视觉语言模型及两类多智能体架构,并分析不同提示策略、架构设计(顺序vs迭代)及证据缺失/退化/矛盾情形下的弃答行为。 Result: (1)标准提示下VLM极少弃答,简单置信度基线即优于之;(2)多智能体提升弃答率但牺牲准确性;(3)顺序架构不逊于迭代架构,表明问题在于校准偏差而非推理深度;(4)模型仅在图像或文本证据完全缺失时弃答,面对退化或矛盾证据仍强行作答。 Conclusion: 实现有效的多模态弃答不能仅靠提示工程或多智能体堆叠,而必须引入弃答感知的专门训练机制。 Abstract: Effective abstention (EA), recognizing evidence insufficiency and refraining from answering, is critical for reliable multimodal systems. Yet existing evaluation paradigms for vision-language models (VLMs) and multi-agent systems (MAS) assume answerability, pushing models to always respond. Abstention has been studied in text-only settings but remains underexplored multimodally; current benchmarks either ignore unanswerability or rely on coarse methods that miss realistic failure modes. We introduce MM-AQA, a benchmark that constructs unanswerable instances from answerable ones via transformations along two axes: visual modality dependency and evidence sufficiency. Evaluating three frontier VLMs spanning closed and open-source models and two MAS architectures across 2079 samples, we find: (1) under standard prompting, VLMs rarely abstain; even simple confidence baselines outperform this setup, (2) MAS improves abstention but introduces an accuracy-abstention trade-off, (3) sequential designs match or exceed iterative variants, suggesting the bottleneck is miscalibration rather than reasoning depth, and (4) models abstain when image or text evidence is absent, but attempt reconciliation with degraded or contradictory evidence. Effective multimodal abstention requires abstention-aware training rather than better prompting or more agents.[59] Modeling LLM Unlearning as an Asymmetric Two-Task Learning Problem
Zeguan Xiao,Siqing Li,Yong Wang,Xuetao Wei,Jian Yang,Yun Chen,Guanhua Chen
Main category: cs.CL
TL;DR: 本文提出了一种保留优先的梯度合成框架,用于大语言模型的机器遗忘,通过解耦任务特定梯度提取与冲突感知组合,提升模型在遗忘目标知识的同时保持通用能力的效果。
Details
Motivation: 解决大语言模型(LLM)在执行机器遗忘时遗忘目标知识与保留通用能力之间的权衡问题。 Method: 将LLM遗忘建模为非对称双任务问题(保留为主、遗忘为辅),提出保留优先的梯度合成框架;适配PCGrad并提出新方法SAGO,实现构造性符号约束梯度合成。 Result: 在WMDP Bio/Cyber和RWKU基准上,SAGO显著提升保留性能(如WMDP Bio中MMLU从44.6%升至96.0%),同时保持相当的遗忘强度。 Conclusion: 重塑梯度几何结构而非重新平衡损失函数,是缓解遗忘-保留权衡的关键。 Abstract: Machine unlearning for large language models (LLMs) aims to remove targeted knowledge while preserving general capability. In this paper, we recast LLM unlearning as an asymmetric two-task problem: retention is the primary objective and forgetting is an auxiliary. From this perspective, we propose a retention-prioritized gradient synthesis framework that decouples task-specific gradient extraction from conflict-aware combination. Instantiating the framework, we adapt established PCGrad to resolve gradient conflicts, and introduce SAGO, a novel retention-prioritized gradient synthesis method. Theoretically, both variants ensure non-negative cosine similarity with the retain gradient, while SAGO achieves strictly tighter alignment through constructive sign-constrained synthesis. Empirically, on WMDP Bio/Cyber and RWKU benchmarks, SAGO consistently pushes the Pareto frontier: e.g., on WMDP Bio (SimNPO+GD), recovery of target model MMLU performance progresses from 44.6% (naive) to 94.0% (+PCGrad) and further to 96.0% (+SAGO), while maintaining comparable forgetting strength. Our results show that re-shaping gradient geometry, rather than re-balancing losses, is the key to mitigating unlearning-retention trade-offs.[60] Domain Fine-Tuning FinBERT on Finnish Histopathological Reports: Train-Time Signals and Downstream Correlations
Rami Luisto,Liisa Petäinen,Tommi Grönholm,Jan Böhm,Maarit Ahtiainen,Tomi Lilja,Ilkka Pölönen,Sami Äyrämö
Main category: cs.CL
TL;DR: 本文探讨了在标注数据稀缺的NLP分类任务中,对芬兰语BERT模型进行领域微调(尤其是医疗文本)的效果,并尝试通过分析嵌入空间几何变化来预测领域预训练的收益。
Details
Motivation: 医疗AI中常面临标注数据获取延迟长的问题,亟需在少量标签下提升模型性能。 Method: 对芬兰语BERT模型在芬兰医疗文本上进行无监督领域微调,并分析微调前后词嵌入空间的几何变化以预测领域预训练收益。 Result: 报告了芬兰BERT在医疗文本上的微调观察结果,并初步探索了嵌入几何变化与领域预训练效益之间的关联。 Conclusion: 领域微调有效,且嵌入几何变化可能成为评估领域预训练价值的潜在指标,但需进一步验证。 Abstract: In NLP classification tasks where little labeled data exists, domain fine-tuning of transformer models on unlabeled data is an established approach. In this paper we have two aims. (1) We describe our observations from fine-tuning the Finnish BERT model on Finnish medical text data. (2) We report on our attempts to predict the benefit of domain-specific pre-training of Finnish BERT from observing the geometry of embedding changes due to domain fine-tuning. Our driving motivation is the common\situation in healthcare AI where we might experience long delays in acquiring datasets, especially with respect to labels.[61] Pangu-ACE: Adaptive Cascaded Experts for Educational Response Generation on EduBench
Dinghao Li,Wenlong Zhou,Zhimin Chen,Yuehan Peng,Hong Ni,Chengfu Zou,Guoyu Shi,Yaochen Li
Main category: cs.CL
TL;DR: 本文介绍了Pangu-ACE系统,一种基于任务复杂度动态选择1B或7B模型的教育助手级级联架构,在EduBench基准上提升了质量与格式有效性,同时实现了计算资源的按需分配。
Details
Motivation: 教育助手应根据任务需求动态分配计算资源,避免对简单任务过度消耗算力;同时修正早期离线评估中因表面格式检查导致的性能高估问题。 Method: 构建样本级1B→7B级联系统(Pangu-ACE):1B tutor-router生成初稿并输出路由信号,决定是否交由7B specialist prompt精修;采用CPU端重评保存的预测结果以修正评估偏差;提供可复现的artifact-first论文工作流。 Result: 在7013样本中文测试集上,cascade_final相比legacy rule_v2系统将确定性质量从0.457提升至0.538,格式有效性从0.707升至0.866;1B模型直接接受19.7%请求,其中IP任务接受率达78.0%,而QG和EC几乎全量升级;当前部署尚未体现延迟优势,效率增益主要来自路由选择性而非实际加速。 Conclusion: Pangu-ACE验证了细粒度、任务感知的模型级联在教育AI中提升质量与资源效率的可行性;当前核心挑战是完善外部基线(GPT-5.4)对齐所需的基础设施。 Abstract: Educational assistants should spend more computation only when the task needs it. This paper rewrites our earlier draft around the system that was actually implemented and archived in the repository: a sample-level 1B to 7B cascade for the shared-8 EduBench benchmark. The final system, Pangu-ACE, uses a 1B tutor-router to produce a draft answer plus routing signals, then either accepts the draft or escalates the sample to a 7B specialist prompt. We also correct a major offline evaluation bug: earlier summaries over-credited some open-form outputs that only satisfied superficial format checks. After CPU-side rescoring from saved prediction JSONL, the full Chinese test archive (7013 samples) shows that cascade_final improves deterministic quality from 0.457 to 0.538 and format validity from 0.707 to 0.866 over the legacy rule_v2 system while accepting 19.7% of requests directly at 1B. Routing is strongly task dependent: IP is accepted by 1B 78.0% of the time, while QG and EC still escalate almost always. The current archived deployment does not yet show latency gains, so the defensible efficiency story is routing selectivity rather than wall-clock speedup. We also package a reproducible artifact-first paper workflow and clarify the remaining external-baseline gap: GPT-5.4 re-judging is implemented locally, but the configured provider endpoint and key are invalid, so final sampled-baseline alignment with GPT-5.4 remains pending infrastructure repair.[62] Exploring and Testing Skill-Based Behavioral Profile Annotation: Human Operability and LLM Feasibility under Schema-Guided Execution
Yufeng Wu
Main category: cs.CL
TL;DR: 本文提出将行为特征(BP)标注视为一组独立技能而非单一任务,通过技能文件驱动的流程评估大语言模型(LLM)在中文隐喻性颜色词衍生物BP标注中的表现;发现BP标注技能异质性强,仅部分可被GPT-5.4可靠执行,且人与模型在技能难度上高度一致但实例层面无相关性,表明应以‘技能可行性’而非‘任务自动化’来评估自动标注。
Details
Motivation: BP标注因需同步处理多个语言维度而难以自动化;现有方法将BP视为单一任务,忽视其内在技能异质性,亟需从技能分解视角重新评估LLM辅助标注的可行性。 Method: 将14维BP标注解构为独立技能,每项技能由外部定义的schema文件、决策规则和示例驱动;采用300例验证集进行两轮人工标注,分类各技能为‘直接可操作’‘聚焦重标可恢复’或‘结构未明确定义’;在相同设置下测试GPT-5.4及三个开源模型,并分析人机难度相关性与一致性。 Result: 14项BP技能中:5项直接可操作、4项聚焦重标后可恢复、5项结构未明确定义;GPT-5.4在保留技能上表现可靠(准确率0.678,κ=0.665,加权F1=0.695),但能力具有选择性;人与GPT技能难度高度相关(r=0.881),但实例级(r=0.016)和词汇级(r=-0.142)无相关性;GPT更宜视作独立‘第三技能声部’而非人类替代者;开源模型主要失败于schema到技能的执行环节。 Conclusion: 自动BP标注不应追求整体任务自动化,而应基于技能可行性进行细粒度评估;人机协同应聚焦于识别并补全结构未明确定义的技能,而非强行统一建模;该框架可推广至其他多维语言标注任务。 Abstract: Behavioral Profile (BP) annotation is difficult to automate because it requires simultaneous coding across multiple linguistic dimensions. We treat BP annotation as a bundle of annotation skills rather than a single task and evaluate LLM-assisted BP annotation from this perspective. Using 3,134 concordance lines of 30 Chinese metaphorical color-term derivatives and a 14-feature BP schema, we implement a skill-file-driven pipeline in which each feature is externally defined through schema files, decision rules, and examples. Two human annotators completed a two-round schema-only protocol on a 300-instance validation subset, enabling BP skills to be classified as directly operable, recoverable under focused re-annotation, or structurally underspecified. GPT-5.4 and three locally deployable open-source models were then evaluated under the same setup. Results show that BP annotation is highly heterogeneous at the skill level: 5 skills are directly operable, 4 are recoverable after focused re-annotation, and 5 remain structurally underspecified. GPT-5.4 executes the retained skills with substantial reliability (accuracy = 0.678, \k{appa} = 0.665, weighted F1 = 0.695), but this feasibility is selective rather than global. Human and GPT difficulty profiles are strongly aligned at the skill level (r = 0.881), but not at the instance level (r = 0.016) or lexical-item level (r = -0.142), a pattern we describe as shared taxonomy, independent execution. Pairwise agreement further suggests that GPT is better understood as an independent third skill voice than as a direct human substitute. Open-source failures are concentrated in schema-to-skill execution problems. These findings suggest that automatic annotation should be evaluated in terms of skill feasibility rather than task-level automation.[63] ClimateCause: Complex and Implicit Causal Structures in Climate Reports
Liesbeth Allein,Nataly Pineda-Castañeda,Andrea Rocci,Marie-Francine Moens
Main category: cs.CL
TL;DR: 本文介绍了ClimateCause数据集,一个由专家手动标注的、包含高阶因果结构的气候报告数据集,旨在支持复杂因果网络的建模与推理。
Details
Motivation: 现有因果发现数据集主要关注显式、直接的因果关系,难以支持对气候变化等复杂系统中隐含、嵌套因果结构的理解与建模。 Method: 构建了ClimateCause数据集:基于科学政策类气候报告,由领域专家人工标注高阶因果结构;对因果表达进行归一化与解耦,以支持因果图构建;并标注因果相关性、关系类型及时空上下文;进一步用于量化语句因果图的语义复杂度(即可读性);最后在大语言模型上开展相关性推断与因果链推理基准测试。 Result: ClimateCause是首个面向气候政策文本、涵盖隐含与嵌套因果关系的专家标注因果数据集;验证了其在因果图可读性量化中的有效性;LLM实验表明因果链推理比相关性推断更具挑战性。 Conclusion: ClimateCause填补了复杂、真实世界因果建模的数据空白,为提升模型对高阶因果结构的理解与推理能力提供了关键资源和评估基准。 Abstract: Understanding climate change requires reasoning over complex causal networks. Yet, existing causal discovery datasets predominantly capture explicit, direct causal relations. We introduce ClimateCause, a manually expert-annotated dataset of higher-order causal structures from science-for-policy climate reports, including implicit and nested causality. Cause-effect expressions are normalized and disentangled into individual causal relations to facilitate graph construction, with unique annotations for cause-effect correlation, relation type, and spatiotemporal context. We further demonstrate ClimateCause's value for quantifying readability based on the semantic complexity of causal graphs underlying a statement. Finally, large language model benchmarking on correlation inference and causal chain reasoning highlights the latter as a key challenge.[64] Schema Key Wording as an Instruction Channel in Structured Generation under Constrained Decoding
Yifan Le
Main category: cs.CL
TL;DR: 本文研究了在大语言模型的结构化生成中,模式(schema)键的措辞如何作为隐式指令影响模型性能,并提出将结构化生成视为多通道指令问题的新视角。
Details
Motivation: 现有约束解码方法将模式视为纯结构约束,忽略了其语言表述可能影响模型行为的可能性。 Method: 通过改变模式键的措辞(不修改提示或模型参数),系统分析其对模型性能的影响,并将结构化生成重新定义为包含显式提示指令和隐式模式键指令的多通道指令问题。 Result: 实验表明不同模型家族对指令通道敏感性不同:Qwen模型受益于模式级指令,LLaMA模型更依赖提示级指导;且指令通道间存在非叠加交互效应。 Conclusion: 模式设计不仅决定输出结构,还承载指令信号,为大语言模型的结构化生成提供了新视角。 Abstract: Constrained decoding has been widely adopted for structured generation with large language models (LLMs), ensuring that outputs satisfy predefined formats such as JSON and XML. However, existing approaches largely treat schemas as purely structural constraints and overlook the possibility that their linguistic formulation may affect model behavior. In this work, we study how instruction placement influences model performance in structured generation and show that merely changing the wording of schema keys, without modifying the prompt or model parameters, can significantly alter model performance under constrained decoding. Based on this observation, we propose to reinterpret structured generation as a multi-channel instruction problem, where instructions can be conveyed explicitly through prompts and implicitly through schema keys during decoding. To the best of our knowledge, this is the first work to systematically study how schema key formulation acts as an implicit instruction channel and affects model performance under constrained decoding. Experiments on multiple mathematical reasoning benchmarks show that different model families exhibit distinct sensitivities to these instruction channels: Qwen models consistently benefit from schema-level instructions, while LLaMA models rely more heavily on prompt-level guidance. We further observe non-additive interaction effects between instruction channels, showing that combining multiple channels does not always lead to further improvement. These findings suggest that schema design not only determines output structure, but also carries instruction signals, offering a new perspective on structured generation in LLMs.[65] Segment-Level Coherence for Robust Harmful Intent Probing in LLMs
Xuanli He,Bilgehan Sel,Faizan Ali,Jenny Bao,Hoagy Cunningham,Jerry Wei
Main category: cs.CL
TL;DR: 本文提出了一种新的流式探测目标,通过要求多个证据词元一致支持预测,而非依赖孤立的高分词元,从而提升CBRN领域中大语言模型对抗性越狱检测的鲁棒性与准确率。
Details
Motivation: 现有流式探测方法在CBRN等高风险领域中易因敏感词出现在良性上下文中而产生误报,缺乏对上下文一致性的建模。 Method: 设计一种需多个证据词元协同支持预测的流式探测目标,对比分析Attention、MLP和残差流特征的探测效果,并验证其对字符级混淆攻击的泛化能力。 Result: 在1%假正率下,真阳性率相对强基线提升35.55%;AUROC达97.40%以上,对抗混淆攻击仍保持>98.85% AUROC;Attention/MLP激活探测显著优于残差流特征。 Conclusion: 基于多证据聚合的流式探测机制更鲁棒,具备强泛化性和即插即用能力,适用于高风险场景下的实时安全监控。 Abstract: Large Language Models (LLMs) are increasingly exposed to adaptive jailbreaking, particularly in high-stakes Chemical, Biological, Radiological, and Nuclear (CBRN) domains. Although streaming probes enable real-time monitoring, they still make systematic errors. We identify a core issue: existing methods often rely on a few high-scoring tokens, leading to false alarms when sensitive CBRN terms appear in benign contexts. To address this, we introduce a streaming probing objective that requires multiple evidence tokens to consistently support a prediction, rather than relying on isolated spikes. This encourages more robust detection based on aggregated signals instead of single-token cues. At a fixed 1% false-positive rate, our method improves the true-positive rate by 35.55% relative to strong streaming baselines. We further observe substantial gains in AUROC, even when starting from near-saturated baseline performance (AUROC = 97.40%). We also show that probing Attention or MLP activations consistently outperforms residual-stream features. Finally, even when adversarial fine-tuning enables novel character-level ciphers, harmful intent remains detectable: probes developed for the base LLMs can be applied ``plug-and-play'' to these obfuscated attacks, achieving an AUROC of over 98.85%.[66] RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding
Zihong Zhang,Zuchao Li,Lefei Zhang,Ping Wang,Hai Zhao
Main category: cs.CL
TL;DR: RACER是一种无需训练的轻量级推测解码方法,通过结合检索到的精确模式和logit驱动的未来线索,显著加速大语言模型推理,实现2倍以上提速。
Details
Motivation: 自回归解码在大语言模型中存在高推理延迟问题,而现有无需训练的推测解码方法(如基于检索或logits的方法)分别面临匹配失败或缺乏结构指导的局限。 Method: 提出RACER方法,融合检索得到的精确模式(提供可靠锚点)与logit预测的未来线索(支持灵活外推),构建更丰富的推测草稿,全程无需额外训练。 Result: 在Spec-Bench、HumanEval和MGSM-ZH等基准上,RACER相比自回归解码实现超2倍加速,并优于其他无需训练的推测解码方法。 Conclusion: RACER是一种可扩展、即插即用的高效LLM解码方案,在保持零训练成本的同时兼顾可靠性与灵活性。 Abstract: Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess-and-verify strategy, but existing training-free variants face trade-offs: retrieval-based drafts break when no exact match exists, while logits-based drafts lack structural guidance. We propose $\textbf{RACER}$ ($\textbf{R}$etrieval-$\textbf{A}$ugmented $\textbf{C}$ont$\textbf{e}$xtual $\textbf{R}$apid Speculative Decoding), a lightweight and training-free method that integrates retrieved exact patterns with logit-driven future cues. This unification supplies both reliable anchors and flexible extrapolation, yielding richer speculative drafts. Experiments on Spec-Bench, HumanEval, and MGSM-ZH demonstrate that RACER consistently accelerates inference, achieving more than $2\times$ speedup over autoregressive decoding, and outperforms prior training-free methods, offering a scalable, plug-and-play solution for efficient LLM decoding. Our source code is available at $\href{https://github.com/hkr04/RACER}{https://github.com/hkr04/RACER}$.[67] Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models
Danae Sánchez Villegas,Samuel Lewis-Lim,Nikolaos Aletras,Desmond Elliott
Main category: cs.CL
TL;DR: 本文分析了18种视觉语言模型(VLMs)在链式思维(CoT)推理过程中的动态行为,发现模型普遍存在‘答案惯性’现象,即早期预测被强化而非修正;推理训练模型虽有更强的纠错能力,但其表现受模态条件影响;文本误导线索会持续影响模型决策,且这种影响在CoT中难以可靠检测,削弱了CoT对多模态决策透明性的保障。
Details
Motivation: 理解视觉语言模型(VLMs)如何在链式思维(CoT)中整合视觉与文本信息,揭示其推理动态、模态依赖及可解释性局限。 Method: 对18个涵盖指令微调与推理训练的VLMs进行系统分析:追踪CoT过程中的置信度变化、量化推理的纠错效应、评估中间步骤贡献;设计含误导文本线索的可控干预实验;分析CoT中对线索的提及频率、显式程度与视觉一致性。 Result: 发现‘答案惯性’普遍存在;推理训练模型纠错能力更强但依赖模态条件;误导文本线索持续影响决策,且其在CoT中的可检测性因模型类型和监控维度而异;推理训练模型更常显式提及线索但CoT易伪装为视觉驱动,指令微调模型提及较少但短CoT暴露视觉不一致。 Conclusion: 链式思维(CoT)仅能部分反映VLMs的多模态决策机制,其对模态依赖的揭示有限,对多模态系统的透明性与安全性构成挑战。 Abstract: Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two different model families. We track confidence over Chain-of-Thought (CoT), measure the corrective effect of reasoning, and evaluate the contribution of intermediate reasoning steps. We find that models are prone to answer inertia, in which early commitments to a prediction are reinforced, rather than revised during reasoning steps. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions, from text-dominant to vision-only settings. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient, and assess whether this influence is recoverable from CoT. Although this influence can appear in the CoT, its detectability varies across models and depends on what is being monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer and fluent CoTs can still appear visually grounded while actually following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. Taken together, these findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.[68] Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task
Evaldas Vaiciukynas,Paulius Danenas,Linas Ablonskis,Algirdas Sukys,Edgaras Dambrauskas,Voldemaras Zitkus,Rita Butkiene,Rimantas Butleris
Main category: cs.CL
TL;DR: 本文研究了现代多语言句子嵌入模型在立陶宛语、俄语和英语仇恨言论检测中的有效性,引入了新的立陶宛语数据集LtHate,并在统一框架下对比了六种嵌入模型与不同下游分类器(HBOS异常检测器和CatBoost分类器)及PCA降维的效果,结果表明监督式两分类方法显著优于无监督异常检测,且PCA在监督任务中几乎不损性能。
Details
Motivation: 在线仇恨言论和辱骂性语言对内容审核构成日益严峻的挑战,尤其在多语言环境及立陶宛语等低资源语言中;现有模型在这些语言上的适用性尚不明确,亟需系统评估与高质量本地语料支持。 Method: 构建新立陶宛语仇恨言论语料库LtHate;在LtHate、RuToxic和EnSuperset上统一评估potion、gemma、bge、snow、jina、e5六种多语言句子嵌入模型;对每种嵌入分别训练一分类HBOS异常检测器和二分类CatBoost分类器,并分别测试是否使用PCA压缩至64维;采用统一Python流水线进行实验。 Result: 二分类监督模型始终显著优于一分类异常检测;最佳性能为:立陶宛语(jina)达80.96%准确率、AUC 0.887;俄语(e5)达92.19%准确率、AUC 0.978;英语(e5+PCA)达77.21%准确率、AUC 0.859;PCA在监督任务中几乎无损性能,但在无监督任务中略有负面影响。 Conclusion: 现代多语言句子嵌入结合梯度提升决策树(如CatBoost)可为多语言仇恨言论检测提供鲁棒、实用的软计算方案;监督学习范式更适用于该任务,且特征降维(如PCA)在监督设置下高度可行。 Abstract: Online hate speech and abusive language pose a growing challenge for content moderation, especially in multilingual settings and for low-resource languages such as Lithuanian. This paper investigates to what extent modern multilingual sentence embedding models can support accurate hate speech detection in Lithuanian, Russian, and English, and how their performance depends on downstream modeling choices and feature dimensionality. We introduce LtHate, a new Lithuanian hate speech corpus derived from news portals and social networks, and benchmark six modern multilingual encoders (potion, gemma, bge, snow, jina, e5) on LtHate, RuToxic, and EnSuperset using a unified Python pipeline. For each embedding, we train both a one class HBOS anomaly detector and a two class CatBoost classifier, with and without principal component analysis (PCA) compression to 64-dimensional feature vectors. Across all datasets, two class supervised models consistently and substantially outperform one class anomaly detection, with the best configurations achieving up to 80.96% accuracy and AUC ROC of 0.887 in Lithuanian (jina), 92.19% accuracy and AUC ROC of 0.978 in Russian (e5), and 77.21% accuracy and AUC ROC of 0.859 in English (e5 with PCA). PCA compression preserves almost all discriminative power in the supervised setting, while showing some negative impact for the unsupervised anomaly detection case. These results demonstrate how modern multilingual sentence embeddings combined with gradient boosted decision trees provide robust soft-computing solutions for multilingual hate speech detection applications.[69] IE as Cache: Information Extraction Enhanced Agentic Reasoning
Hang Lv,Sheng Liang,Hongchao Gu,Wei Guo,Defu Lian,Yong Liu,Hao Wang,Enhong Chen
Main category: cs.CL
TL;DR: 本文提出IE-as-Cache框架,将信息抽取(IE)视为一种可复用的认知缓存,以增强智能体推理能力,实验表明该方法显著提升了多步推理的准确性。
Details
Motivation: 传统信息抽取被视为终端目标,提取结果常被孤立使用,无法在多步推理中持续维护和复用;本文旨在突破这一局限,使IE成为支持推理的动态认知资源。 Method: 受计算机分层内存启发,提出IE-as-Cache框架,结合查询驱动的信息抽取与缓存感知推理,动态维护紧凑中间信息并过滤噪声。 Result: 在多个挑战性基准和不同大语言模型上实验验证,推理准确率显著提升。 Conclusion: 信息抽取可被有效重构为可复用的认知缓存,不仅提升推理性能,也为IE在下游任务中的深度集成提供了新范式。 Abstract: Information Extraction aims to distill structured, decision-relevant information from unstructured text, serving as a foundation for downstream understanding and reasoning. However, it is traditionally treated merely as a terminal objective: once extracted, the resulting structure is often consumed in isolation rather than maintained and reused during multi-step inference. Moving beyond this, we propose \textit{IE-as-Cache}, a framework that repurposes IE as a cognitive cache to enhance agentic reasoning. Drawing inspiration from hierarchical computer memory, our approach combines query-driven extraction with cache-aware reasoning to dynamically maintain compact intermediate information and filter noise. Experiments on challenging benchmarks across diverse LLMs demonstrate significant improvements in reasoning accuracy, indicating that IE can be effectively repurposed as a reusable cognitive resource and offering a promising direction for future research on downstream uses of IE.[70] XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics
Jingxuan Liu,Zhi Qu,Jin Tei,Hidetaka Kamigaito,Lemao Liu,Taro Watanabe
Main category: cs.CL
TL;DR: 本文提出XQ-MEval数据集,用于系统评估多语言机器翻译自动评价指标,揭示并验证了跨语言打分偏差问题,并提出一种基于该数据集的归一化策略以提升多语言评测的公平性与可靠性。
Details
Motivation: 现有自动评价指标在多语言场景下可能存在跨语言打分偏差(即相同质量的译文在不同语言上得分不同),但缺乏具备平行质量标注的基准数据集来系统研究该问题。 Method: 构建半自动多语言评测基准XQ-MEval:基于MQM错误类型向优质译文自动注入错误,由母语者筛选并合并错误生成可控质量的伪译文,形成源-译-参考三元组;在此基础上评估9种主流指标,并提出跨语言分数分布归一化策略。 Result: 实验证明平均各语言指标得分与人工判断不一致,首次提供了跨语言打分偏差的实证证据;所提归一化策略有效提升了多语言评测的公平性与可靠性。 Conclusion: 跨语言打分偏差是真实且显著的问题,XQ-MEval为多语言翻译评测提供了可靠基准,所提出的归一化方法可广泛应用于多语言自动评价实践。 Abstract: Automatic evaluation metrics are essential for building multilingual translation systems. The common practice of evaluating these systems is averaging metric scores across languages, yet this is suspicious since metrics may suffer from cross-lingual scoring bias, where translations of equal quality receive different scores across languages. This problem has not been systematically studied because no benchmark exists that provides parallel-quality instances across languages, and expert annotation is not realistic. In this work, we propose XQ-MEval, a semi-automatically built dataset covering nine translation directions, to benchmark translation metrics. Specifically, we inject MQM-defined errors into gold translations automatically, filter them by native speakers for reliability, and merge errors to generate pseudo translations with controllable quality. These pseudo translations are then paired with corresponding sources and references to form triplets used in assessing the qualities of translation metrics. Using XQ-MEval, our experiments on nine representative metrics reveal the inconsistency between averaging and human judgment and provide the first empirical evidence of cross-lingual scoring bias. Finally, we propose a normalization strategy derived from XQ-MEval that aligns score distributions across languages, improving the fairness and reliability of multilingual metric evaluation.[71] Text2Arch: A Dataset for Generating Scientific Architecture Diagrams from Natural Language Descriptions
Shivank Garg,Sankalp Mittal,Manish Gupta
Main category: cs.CL
TL;DR: 本文提出了一种利用语言模型从文本自动生成高保真科学架构图的方法,构建了包含图像、文本描述和DOT代码的开源数据集\system,并基于该数据集微调小语言模型,在性能上媲美GPT-4o的上下文学习效果。
Details
Motivation: 文本描述复杂系统设计或科学流程效率低且易产生歧义,亟需能自动将文本高保真转换为架构图的系统,但缺乏大规模公开数据集和有效开放模型。 Method: 构建了包含科学架构图像、对应文本描述及DOT代码的综合数据集\system;在此基础上微调多个小型语言模型,并结合GPT-4o进行上下文学习实验。 Result: \system模型在生成架构图任务上显著优于DiagramAgent等基线模型,性能与GPT-4o的上下文学习结果相当。 Conclusion: 本工作验证了利用专用数据集微调小语言模型可高效实现文本到架构图的高保真生成,推动了AI驱动的可视化建模与教育内容自动化生成。 Abstract: Communicating complex system designs or scientific processes through text alone is inefficient and prone to ambiguity. A system that automatically generates scientific architecture diagrams from text with high semantic fidelity can be useful in multiple applications like enterprise architecture visualization, AI-driven software design, and educational content creation. Hence, in this paper, we focus on leveraging language models to perform semantic understanding of the input text description to generate intermediate code that can be processed to generate high-fidelity architecture diagrams. Unfortunately, no clean large-scale open-access dataset exists, implying lack of any effective open models for this task. Hence, we contribute a comprehensive dataset, \system, comprising scientific architecture images, their corresponding textual descriptions, and associated DOT code representations. Leveraging this resource, we fine-tune a suite of small language models, and also perform in-context learning using GPT-4o. Through extensive experimentation, we show that \system{} models significantly outperform existing baseline models like DiagramAgent and perform at par with in-context learning-based generations from GPT-4o. We make the code, data and models publicly available.[72] Explain the Flag: Contextualizing Hate Speech Beyond Censorship
Jason Liartis,Eirini Kaldeli,Lambrini Gyftokosta,Eleftherios Chelioudakis,Orfeas Menis Mastromichalakis
Main category: cs.CL
TL;DR: 本文提出了一种结合大语言模型(LLMs)与三种新构建的多语种(英、法、希腊语)词汇表的混合方法,用于检测并可解释地识别仇恨言论,兼顾准确性与透明性。
Details
Motivation: 现有自动仇恨言论检测系统多聚焦于内容删除,缺乏透明度和可解释性,难以平衡内容治理与表达自由。 Method: 构建双管道混合系统:一管道利用人工校验的多语种词汇表检测和消歧冒犯性术语;另一管道使用LLM作为上下文感知评估器识别群体定向攻击内容;最终融合输出有依据的解释。 Result: 在人类评估中,该混合方法在检测准确性和解释质量上均优于纯LLM基线方法。 Conclusion: 结合规则化词汇资源与LLM上下文理解的混合范式,能更可靠、透明、可解释地实现多语种仇恨言论检测。 Abstract: Hate, derogatory, and offensive speech remains a persistent challenge in online platforms and public discourse. While automated detection systems are widely used, most focus on censorship or removal, raising concerns for transparency and freedom of expression, and limiting opportunities to explain why content is harmful. To address these issues, explanatory approaches have emerged as a promising solution, aiming to make hate speech detection more transparent, accountable, and informative. In this paper, we present a hybrid approach that combines Large Language Models (LLMs) with three newly created and curated vocabularies to detect and explain hate speech in English, French, and Greek. Our system captures both inherently derogatory expressions tied to identity characteristics and direct group-targeted content through two complementary pipelines: one that detects and disambiguates problematic terms using the curated vocabularies, and one that leverages LLMs as context-aware evaluators of group-targeting content. The outputs are fused into grounded explanations that clarify why content is flagged. Human evaluation shows that our hybrid approach is accurate, with high-quality explanations, outperforming LLM-only baselines.[73] IUQ: Interrogative Uncertainty Quantification for Long-Form Large Language Model Generation
Haozhi Fan,Jinhao Duan,Kaidi Xu
Main category: cs.CL
TL;DR: 本文提出了一种名为“疑问式不确定性量化(IUQ)”的新框架,通过样本间一致性与样本内忠实性来量化大语言模型(LLM)长文本生成中的不确定性,尤其在语义连贯但事实错误的场景下提供可靠的声明级不确定性评估。
Details
Motivation: 现有方法在短/受限输出中表现良好,但难以应对真实场景所需的长文本、自由生成任务;LLM常生成语义连贯却事实错误的内容,且语义多维、语言结构复杂,导致不确定性难以量化。 Method: 提出Interrogative Uncertainty Quantification(IUQ)框架,采用‘先提问再回答’范式,结合跨样本一致性(inter-sample consistency)和单样本内忠实性(intra-sample faithfulness)来量化长文本生成中的不确定性。 Result: 在多个模型家族与规模上实验验证,IUQ在两个主流长文本生成数据集上显著优于现有基线方法。 Conclusion: IUQ为长形式、自由式LLM生成提供了可解释、可靠的不确定性量化新路径,兼顾语义与事实层面的可信度评估。 Abstract: Despite the rapid advancement of Large Language Models (LLMs), uncertainty quantification in LLM generation is a persistent challenge. Although recent approaches have achieved strong performance by restricting LLMs to produce short or constrained answer sets, many real-world applications require long-form and free-form text generation. A key difficulty in this setting is that LLMs often produce responses that are semantically coherent yet factually inaccurate, while the underlying semantics are multifaceted and the linguistic structure is complex. To tackle this challenge, this paper introduces Interrogative Uncertainty Quantification (IUQ), a novel framework that leverages inter-sample consistency and intra-sample faithfulness to quantify the uncertainty in long-form LLM outputs. By utilizing an interrogate-then-respond paradigm, our method provides reliable measures of claim-level uncertainty and the model's faithfulness. Experimental results across diverse model families and model sizes demonstrate the superior performance of IUQ over two widely used long-form generation datasets. The code is available at https://github.com/louisfanhz/IUQ.[74] Blinded Multi-Rater Comparative Evaluation of a Large Language Model and Clinician-Authored Responses in CGM-Informed Diabetes Counseling
Zhijun Guo,Alvina Lai,Emmanouil Korakas,Aristeidis Vagenas,Irshad Ahamed,Christo Albor,Hengrui Zhang,Justin Healy,Kezhi Li
Main category: cs.CL
TL;DR: 本研究开发并评估了一种基于检索增强的大型语言模型(LLM)对话代理(CA),用于辅助糖尿病患者理解连续血糖监测(CGM)数据及咨询准备;结果显示该CA在响应质量(尤其共情与可操作性)上显著优于临床医生,安全性相当,但仅适用于辅助而非自主决策场景。
Details
Motivation: CGM数据解读对糖尿病管理至关重要,但临床医生解释耗时且缺乏高效、共情、可扩展的工具;现有检索增强LLM系统在CGM指导咨询中的实证证据不足。 Method: 构建一个检索增强的LLM对话代理,生成非个体化、纯语言的CGM解释与咨询支持响应;设计12个基于公开数据集的CGM案例;由6位英国资深糖尿病临床医生分别撰写参考回答;采用盲法多评审者设计,3位临床医生独立从6个维度评分(共288个CA响应和288个医生响应);主分析使用线性混合效应模型。 Result: CA响应在总体质量上显著高于临床医生(均值4.37 vs 3.58,P<0.001),差异最大体现在共情(+1.062)与可操作性(+0.992);安全警示率极低且两组相当(各0.7%)。 Conclusion: 检索增强LLM系统可作为CGM复盘、患者教育和诊前准备的有价值辅助工具,但不能替代临床判断,不适用于无监督的实际诊疗或自主治疗决策。 Abstract: Continuous glucose monitoring (CGM) is central to diabetes care, but explaining CGM patterns clearly and empathetically remains time-intensive. Evidence for retrieval-grounded large language model (LLM) systems in CGM-informed counseling remains limited. To evaluate whether a retrieval-grounded LLM-based conversational agent (CA) could support patient understanding of CGM data and preparation for routine diabetes consultations. We developed a retrieval-grounded LLM-based CA for CGM interpretation and diabetes counseling support. The system generated plain-language responses while avoiding individualized therapeutic advice. Twelve CGM-informed cases were constructed from publicly available datasets. Between Oct 2025 and Feb 2026, 6 senior UK diabetes clinicians each reviewed 2 assigned cases and answered 24 questions. In a blinded multi-rater evaluation, each CA-generated and clinician-authored response was independently rated by 3 clinicians on 6 quality dimensions. Safety flags and perceived source labels were also recorded. Primary analyses used linear mixed-effects models. A total of 288 unique responses (144 CA and 144 clinician) generated 864 ratings. The CA received higher quality scores than clinician responses (mean 4.37 vs 3.58), with an estimated mean difference of 0.782 points (95% CI 0.692-0.872; P<.001). The largest differences were for empathy (1.062, 95% CI 0.948-1.177) and actionability (0.992, 95% CI 0.877-1.106). Safety flag distributions were similar, with major concerns rare in both groups (3/432, 0.7% each). Retrieval-grounded LLM systems may have value as adjunct tools for CGM review, patient education, and preconsultation preparation. However, these findings do not support autonomous therapeutic decision-making or unsupervised real-world use.[75] DiscoTrace: Representing and Comparing Answering Strategies of Humans and LLMs in Information-Seeking Question Answering
Neha Srikanth,Jordan Boyd-Graber,Rachel Rudinger
Main category: cs.CL
TL;DR: DiscoTrace是一种识别回答者在回应信息寻求型问题时所用修辞策略的方法,发现人类社区在答案构建上具有修辞多样性,而大语言模型(LLMs)缺乏这种多样性,倾向于广度而非深度,并忽略人类常回避的问题解释。
Details
Motivation: 理解人类在问答中使用的多样化修辞策略,以提升大语言模型在问答任务中的语用能力。 Method: 提出DiscoTrace方法,将答案表示为与问题相关的语篇行为序列,并结合原问题的解释,基于修辞结构理论(RST)解析进行标注;在九个人类社区的答案数据上应用该方法,并与LLM生成答案对比分析。 Result: 不同人类社区在答案构建上存在显著修辞偏好差异;LLMs缺乏修辞多样性,即使被提示模仿特定社区指南也未能复现;LLMs更倾向覆盖更多问题解释(breadth),而人类常有选择性地忽略某些解释。 Conclusion: LLMs当前在问答中的修辞策略过于单一且脱离语境,需借鉴人类修辞多样性来构建更具语用适应性的模型。 Abstract: We introduce DiscoTrace, a method to identify the rhetorical strategies that answerers use when responding to information-seeking questions. DiscoTrace represents answers as a sequence of question-related discourse acts paired with interpretations of the original question, annotated on top of rhetorical structure theory parses. Applying DiscoTrace to answers from nine different human communities reveals that communities have diverse preferences for answer construction. In contrast, LLMs do not exhibit rhetorical diversity in their answers, even when prompted to mimic specific human community answering guidelines. LLMs also systematically opt for breadth, addressing interpretations of questions that human answerers choose not to address. Our findings can guide the development of pragmatic LLM answerers that consider a range of strategies informed by context in QA.[76] QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies
Alexey Khoroshilov,Alexey Chernysh,Orkhan Ekhtibarov,Nini Kamkia,Dmitry Zmitrovich
Main category: cs.CL
TL;DR: 本文提出了QuantCode-Bench基准,用于系统评估大语言模型在基于英文描述生成Backtrader框架交易策略方面的能力;该基准包含400个来自多源的任务,评估涵盖语法正确性、回测执行、实际交易及语义对齐,并发现当前模型的主要瓶颈在于金融逻辑建模、API使用和任务语义一致性,而非语法错误。
Details
Motivation: 现有大语言模型在通用编程任务上表现良好,但在生成可执行的算法交易策略方面能力尚不明确;而交易策略生成需同时掌握金融领域逻辑、专用API知识,并确保生成代码不仅语法正确,还需在历史数据上实际产生交易,因此亟需一个系统性评估基准。 Method: 构建QuantCode-Bench基准,含400个多样化难度的交易策略生成任务(来源包括Reddit、TradingView等);设计多阶段评估流水线:检查语法正确性、回测执行成功性、是否产生真实交易、以及通过LLM裁判评估语义对齐度;在单轮生成与带反馈的多轮智能体两种设置下对比SOTA模型。 Result: 实验表明,当前模型失败主因并非语法错误,而是无法准确建模交易逻辑、误用API、或偏离任务语义;多轮交互可提升性能,但语义与行为对齐仍是核心挑战。 Conclusion: 交易策略生成是一类独特的领域特定代码生成任务,其成功不仅依赖技术正确性,更关键的是自然语言描述、金融逻辑与策略实际数据行为三者之间的严格一致。 Abstract: Large language models have demonstrated strong performance on general-purpose programming tasks, yet their ability to generate executable algorithmic trading strategies remains underexplored. Unlike standard code benchmarks, trading-strategy generation requires simultaneous mastery of domain-specific financial logic, knowledge of a specialized API, and the ability to produce code that is not only syntactically correct but also leads to actual trades on historical data. In this work, we present QuantCode-Bench, a benchmark for the systematic evaluation of modern LLMs in generating strategies for the Backtrader framework from textual descriptions in English. The benchmark contains 400 tasks of varying difficulty collected from Reddit, TradingView, StackExchange, GitHub, and synthetic sources. Evaluation is conducted through a multi-stage pipeline that checks syntactic correctness, successful backtest execution, the presence of trades, and semantic alignment with the task description using an LLM judge. We compare state-of-the-art models in two settings: single-turn, where the strategy must be generated correctly on the first attempt, and agentic multi-turn, where the model receives iterative feedback and may repair its errors. We analyze the failure modes across different stages of the pipeline and show that the main limitations of current models are not related to syntax, but rather to the correct operationalization of trading logic, proper API usage, and adherence to task semantics. These findings suggest that trading strategy generation constitutes a distinct class of domain-specific code generation tasks in which success requires not only technical correctness, but also alignment between natural-language descriptions, financial logic, and the observable behavior of the strategy on data.[77] Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models
Zihao Xu,John Harvill,Ziwei Fan,Yizhou Sun,Hao Ding,Hao Wang
Main category: cs.CL
TL;DR: 本文提出K-Token Merging,一种在潜在嵌入空间中压缩长提示的轻量级方法,通过将每K个连续token嵌入合并为一个嵌入,显著降低计算和内存开销,同时保持模型生成能力。
Details
Motivation: 现有提示压缩方法主要在token空间操作,忽略了潜在嵌入空间中的冗余与低效;而LLM处理长提示时因自注意力机制的二次复杂度导致高昂计算与内存成本。 Method: 提出K-Token Merging框架:在潜在空间中对每K个连续token嵌入用轻量编码器合并为单个嵌入;压缩后序列输入LoRA微调的LLM,解码仍使用原始词表。 Result: 在Textualized Tree(结构推理)、Amazon Reviews(情感分类)和CommitPackFT(代码编辑)任务上验证,最高实现75%输入长度压缩,性能下降极小,处于性能-压缩率Pareto前沿。 Conclusion: K-Token Merging是一种高效、通用的潜在空间提示压缩方法,在大幅减少输入长度的同时维持LLM生成质量,为长上下文推理提供了实用新路径。 Abstract: Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self-attention scales quadratically with input length. Token compression aims to address this challenge by reducing the number of tokens representing inputs. However, existing prompt-compression approaches primarily operate in token space and overlook inefficiencies in the latent embedding space. In this paper, we propose K-Token Merging, a latent-space compression framework that merges each contiguous block of K token embeddings into a single embedding via a lightweight encoder. The compressed sequence is processed by a LoRA-adapted LLM, while generation remains in the original vocabulary. Experiments on structural reasoning (Textualized Tree), sentiment classification (Amazon Reviews), and code editing (CommitPackFT) show that K-Token Merging lies on the Pareto frontier of performance vs. compression, achieving up to 75% input length reduction with minimal performance degradation.[78] Fabricator or dynamic translator?
Lisa Vasileva,Karin Sim
Main category: cs.CL
TL;DR: 本文探讨了大语言模型(LLM)在机器翻译中出现的过生成现象,分析其类型(如自我解释、危险虚构、恰当解释)及检测方法,并报告了商业场景中的实践策略与结果。
Details
Motivation: LLM在机器翻译中虽表现优异,但其生成特性易导致各类过生成现象,这些现象不同于传统神经机器翻译(NMT)中的'神经胡言乱语',亟需系统识别与分类。 Method: 探索并比较多种针对LLM翻译过生成现象的检测与分类策略,基于商业应用环境开展实证研究。 Result: 提出了可区分不同过生成类型(如自我解释、风险性虚构、有益解释)的检测方法,并在实际商业场景中验证了其有效性。 Conclusion: LLM翻译中的过生成具有多样性与语境依赖性;有效的检测策略需兼顾准确性与实用性,以支持LLM作为类人译员的可靠部署。 Abstract: LLMs are proving to be adept at machine translation although due to their generative nature they may at times overgenerate in various ways. These overgenerations are different from the neurobabble seen in NMT and range from LLM self-explanations, to risky confabulations, to appropriate explanations, where the LLM is able to act as a human translator would, enabling greater comprehension for the target audience. Detecting and determining the exact nature of the overgenerations is a challenging task. We detail different strategies we have explored for our work in a commercial setting, and present our results.[79] MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events
Raunak Agarwal,Markus Wenzel,Simon Baur,Jonas Zimmer,George Harvey,Jackie Ma
Main category: cs.CL
TL;DR: 本文提出MADE——一个源自医疗器械不良事件报告的、持续更新的多标签文本分类(MLTC)基准,旨在解决医疗领域高风险场景下模型不确定性量化(UQ)评估难、数据污染与长尾分布等问题,并系统评测了20余种模型在不同训练范式下的准确性与UQ能力。
Details
Motivation: 现有MLTC基准趋于饱和、易受训练数据污染,难以区分模型真实推理能力与记忆效应;同时,高风险医疗场景亟需兼具高预测性能与可靠不确定性量化(UQ)的模型。 Method: 构建动态更新、时间严格划分、具层级长尾标签分布的MADE基准;在fine-tuning与few-shot(含指令微调/推理变体)设置下,对20+编码器/解码器模型进行系统评测;对比熵/一致性类与自陈述(self-verbalized)等UQ方法。 Result: 小尺寸判别式微调解码器在头-尾准确率与UQ间平衡最优;生成式微调提供最可靠的UQ;大推理模型提升罕见标签性能但UQ表现意外薄弱;自陈述置信度不可靠。 Conclusion: MADE为医疗MLTC提供了抗污染、可复现、面向UQ评估的新基准;模型选择需权衡准确性与UQ可靠性,不能仅依赖规模或自陈述输出;UQ评估应成为高风险ML应用的标准环节。 Abstract: Machine learning in high-stakes domains such as healthcare requires not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human oversight. Multi-label text classification (MLTC) is a central task in this domain, yet remains challenging due to label imbalances, dependencies, and combinatorial complexity. Existing MLTC benchmarks are increasingly saturated and may be affected by training data contamination, making it difficult to distinguish genuine reasoning capabilities from memorization. We introduce MADE, a living MLTC benchmark derived from {m}edical device {ad}verse {e}vent reports and continuously updated with newly published reports to prevent contamination. MADE features a long-tailed distribution of hierarchical labels and enables reproducible evaluation with strict temporal splits. We establish baselines across more than 20 encoder- and decoder-only models under fine-tuning and few-shot settings (instruction-tuned/reasoning variants, local/API-accessible). We systematically assess entropy-/consistency-based and self-verbalized UQ methods. Results show clear trade-offs: smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive UQ; generative fine-tuning delivers the most reliable UQ; large reasoning models improve performance on rare labels yet exhibit surprisingly weak UQ; and self-verbalized confidence is not a reliable proxy for uncertainty. Our work is publicly available at https://hhi.fraunhofer.de/aml-demonstrator/made-benchmark.[80] From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning
Kiran Purohit,Ramasuri Narayanam,Soumyabrata Pal
Main category: cs.CL
TL;DR: 本文提出SpecGuard,一种基于模型内部信号的步级验证式推测解码框架,通过注意力机制和对数概率两种轻量级信号联合判断每一步是否接受,从而在提升准确性的同时降低延迟。
Details
Motivation: 现有推测解码方法存在错误传播问题,依赖外部奖励模型会带来额外延迟、计算开销并限制泛化性。 Method: SpecGuard在每一步采样多个草稿候选,选择最一致的步骤,并利用两种模型内部信号进行验证:(i) 基于注意力的归因得分,衡量对输入及已接受步骤的依赖;(ii) 基于对数概率的置信度得分。二者联合决定是否接受该步或回退至目标模型重算。 Result: 在多类推理基准上,SpecGuard相较标准推测解码准确率提升3.6%,延迟降低约11%,且优于奖励引导的推测解码。 Conclusion: SpecGuard通过仅使用模型内部信号实现高效、准确、通用的步级验证,为推测解码提供了更优的权衡方案。 Abstract: Speculative decoding (SD) accelerates large language model inference by allowing a lightweight draft model to propose outputs that a stronger target model verifies. However, its token-centric nature allows erroneous steps to propagate. Prior approaches mitigate this using external reward models, but incur additional latency, computational overhead, and limit generalizability. We propose SpecGuard, a verification-aware speculative decoding framework that performs step-level verification using only model-internal signals. At each step, SpecGuard samples multiple draft candidates and selects the most consistent step, which is then validated using an ensemble of two lightweight model-internal signals: (i) an attention-based grounding score that measures attribution to the input and previously accepted steps, and (ii) a log-probability-based score that captures token-level confidence. These signals jointly determine whether a step is accepted or recomputed using the target, allocating compute selectively. Experiments across a range of reasoning benchmarks show that SpecGuard improves accuracy by 3.6% while reducing latency by ~11%, outperforming both SD and reward-guided SD.cs.CV [Back]
[81] QualiaNet: An Experience-Before-Inference Network
Paul Linton
Main category: cs.CV
TL;DR: 本文提出了一种模拟人类3D视觉的两阶段计算模型QualiaNet,其中第一阶段提取相对固定点的立体深度(Experience Module),第二阶段利用自然场景中近景具有显著视差梯度、远景较平坦的统计规律,通过CNN从视差梯度估计距离(Inference Module)。
Details
Motivation: 尽管人类立体视觉体验本身不直接提供绝对距离信息,却会影响我们对视觉尺度的推断;作者旨在解释这种看似矛盾的现象,并探索其背后的自然场景统计机制。 Method: 构建两阶段模型QualiaNet:先生成模拟人眼立体视觉的视差图(Experience Module),再将视差图输入CNN训练以估计距离(Inference Module);核心假设是利用近景视差梯度强、远景梯度弱的自然场景统计规律。 Result: QualiaNet仅凭视差梯度即可恢复距离,验证了所提假设和方法的有效性。 Conclusion: 人类3D视觉的Inference Module可能依赖于对自然场景中视差梯度与场景距离之间统计关系的学习,而非直接解析原始立体信号;该发现为理解主观视觉经验如何影响客观感知提供了新视角。 Abstract: Human 3D vision involves two distinct stages: an Experience Module, where stereo depth is extracted relative to fixation, and an Inference Module, where this experience is interpreted to estimate 3D scene properties. Paradoxically, although our experience of stereo vision does not provide us with distance information, it does affect our inferences about visual scale. We propose the Inference Module exploits a natural scene statistic: near scenes produce vivid disparity gradients, while far scenes appear comparatively flat. QualiaNet implements this two-stage architecture computationally: disparity maps simulating human stereo experience are passed to a CNN trained to estimate distance. The network can recover distance from disparity gradients alone, validating this approach.[82] HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
Team HY-World,Chenjie Cao,Xuhui Zuo,Zhenwei Wang,Yisu Zhang,Junta Wu,Zhenyang Liu,Yuning Gong,Yang Liu,Bo Yuan,Chao Zhang,Coopers Li,Dongyuan Guo,Fan Yang,Haiyu Zhang,Hang Cao,Jianchen Zhu,Jiaxin Lin,Jie Xiao,Jihong Zhang,Junlin Yu,Lei Wang,Lifu Wang,Lilin Wang,Linus,Minghui Chen,Peng He,Penghao Zhao,Qi Chen,Rui Chen,Rui Shao,Sicong Liu,Wangchen Qin,Xiaochuan Niu,Xiang Yuan,Yi Sun,Yifei Tang,Yifu Sun,Yihang Lian,Yonghao Tan,Yuhong Liu,Yuyang Yin,Zhiyuan Min,Tengfei Wang,Chunchao Guo
Main category: cs.CV
TL;DR: HY-World 2.0 是一个支持多模态输入(文本、单图、多图、视频)并生成高质量、可导航3D高斯溅射(3DGS)场景的先进世界模型框架,包含多项关键技术升级与新模块(如WorldLens渲染平台),性能达开源SOTA,媲美闭源模型Marble,并全面开源。
Details
Motivation: 提升多模态3D世界建模能力,解决单/多视角输入下3D场景生成的保真度、一致性、可导航性与交互性等关键挑战,并推动开源3D世界模型生态发展。 Method: 提出四阶段生成流程:a) HY-Pano 2.0全景生成;b) WorldNav轨迹规划;c) WorldStereo 2.0基于关键帧与一致记忆的视图扩展;d) WorldMirror 2.0架构与学习策略优化的通用3D预测;并新增WorldLens高性能3DGS渲染平台,支持IBL光照、碰撞检测与训渲协同设计。 Result: 在多个基准上达到开源方法SOTA,性能媲美闭源模型Marble;支持文本/图像驱动的高保真3DGS生成、多视图/视频重建及实时交互探索(含角色);所有模型、代码与技术细节已开源。 Conclusion: HY-World 2.0标志着多模态3D世界建模的重要进展,通过系统性架构创新与工程优化,实现了生成质量、泛化能力与可用性的统一,为开放、可复现的3D基础模型研究提供了坚实基础。 Abstract: We introduce HY-World 2.0, a multi-modal world model framework that advances our prior project HY-World 1.0. HY-World 2.0 accommodates diverse input modalities, including text prompts, single-view images, multi-view images, and videos, and produces 3D world representations. With text or single-view image inputs, the model performs world generation, synthesizing high-fidelity, navigable 3D Gaussian Splatting (3DGS) scenes. This is achieved through a four-stage method: a) Panorama Generation with HY-Pano 2.0, b) Trajectory Planning with WorldNav, c) World Expansion with WorldStereo 2.0, and d) World Composition with WorldMirror 2.0. Specifically, we introduce key innovations to enhance panorama fidelity, enable 3D scene understanding and planning, and upgrade WorldStereo, our keyframe-based view generation model with consistent memory. We also upgrade WorldMirror, a feed-forward model for universal 3D prediction, by refining model architecture and learning strategy, enabling world reconstruction from multi-view images or videos. Also, we introduce WorldLens, a high-performance 3DGS rendering platform featuring a flexible engine-agnostic architecture, automatic IBL lighting, efficient collision detection, and training-rendering co-design, enabling interactive exploration of 3D worlds with character support. Extensive experiments demonstrate that HY-World 2.0 achieves state-of-the-art performance on several benchmarks among open-source approaches, delivering results comparable to the closed-source model Marble. We release all model weights, code, and technical details to facilitate reproducibility and support further research on 3D world models.[83] Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
Ahmed Bourouis,Savas Ozkan,Andrea Maracani,Yi-Zhe Song,Mete Ozay
Main category: cs.CV
TL;DR: 本文提出了一种从单张自由手绘草图生成几何一致多视角场景的新方法,通过构建新数据集、引入几何感知注意力适配器CA3和稀疏对应监督损失CSL,在无参考图像、无需迭代优化的前提下实现高质量、高一致性、快速的多视角合成。
Details
Motivation: 现有方法无法处理几何信息极度匮乏且存在空间畸变的自由手绘草图作为输入;此前多视角生成依赖照片或文本,而草图到3D方法则需多视图输入或逐场景优化,缺乏端到端草图到多视角的一致性生成方案。 Method: 提出三方面创新:(i) 构建约9000个样本的自动合成与过滤草图-多视角数据集;(ii) 设计Parallel Camera-Aware Attention Adapters(CA3)将相机几何先验注入视频Transformer;(iii) 提出基于SfM重建的Sparse Correspondence Supervision Loss(CSL)以增强跨视角一致性。整个框架在单次去噪过程中同步生成所有视角。 Result: 相较两阶段SOTA基线,FID提升超60%,Corr-Acc(几何一致性指标)提升23%,推理速度最高提升3.7倍,且无需参考图、迭代优化或逐场景调优。 Conclusion: 本工作首次实现了从单张自由手绘草图端到端生成几何一致多视角内容,验证了引入显式几何归纳偏置与新型监督信号对解决高度病态草图理解问题的有效性,为草图驱动的三维内容创作开辟了新路径。 Abstract: We tackle a new problem: generating geometrically consistent multi-view scenes from a single freehand sketch. Freehand sketches are the most geometrically impoverished input one could offer a multi-view generator. They convey scene intent through abstract strokes while introducing spatial distortions that actively conflict with any consistent 3D interpretation. No prior method attempts this; existing multi-view approaches require photographs or text, while sketch-to-3D methods need multiple views or costly per-scene optimisation. We address three compounding challenges; absent training data, the need for geometric reasoning from distorted 2D input, and cross-view consistency, through three mutually reinforcing contributions: (i) a curated dataset of $\sim$9k sketch-to-multiview samples, constructed via an automated generation and filtering pipeline; (ii) Parallel Camera-Aware Attention Adapters (CA3) that inject geometric inductive biases into the video transformer; and (iii) a Sparse Correspondence Supervision Loss (CSL) derived from Structure-from-Motion reconstructions. Our framework synthesizes all views in a single denoising process without requiring reference images, iterative refinement, or per-scene optimization. Our approach significantly outperforms state-of-the-art two-stage baselines, improving realism (FID) by over 60% and geometric consistency (Corr-Acc) by 23%, while providing up to a 3.7$\times$ inference speedup.[84] DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines
Gabriel Pimenta de Freitas Cardoso,Caio Lucas da Silva Chacon,Jonas Felipe da Fonseca Oliveira,Paulo Henrique de Medeiros Araujo
Main category: cs.CV
TL;DR: 本文提出了DharmaOCR Full和Lite两个专用于结构化OCR的小型语言模型,以及一个涵盖多种文档类型的基准测试集DharmaOCR-Benchmark,并首次将直接偏好优化(DPO)应用于OCR任务以抑制文本退化现象。
Details
Motivation: 解决OCR中生成稳定性差、文本退化严重、推理成本高以及缺乏统一评估标准的问题,尤其关注退化对实际部署性能(如响应时间、吞吐量、计算开销)的负面影响。 Method: 提出DharmaOCR Full(7B)和Lite(3B)两个SSLMS;构建DharmaOCR-Benchmark并设计包含保真度、结构准确率及退化率(作为一级指标)的统一评估协议;首次在OCR中应用DPO,以退化样本为拒绝对,结合SFT强制JSON结构输出;采用AWQ量化降低推理成本。 Result: DharmaOCR Full和Lite在DharmaOCR-Benchmark上达到0.925和0.911的提取质量分,退化率分别低至0.40%和0.20%;DPO使退化率最高降低87.6%;AWQ量化最多降低22%每页成本且质量损失可忽略。 Conclusion: DharmaOCR系列模型在质量-成本权衡上显著优于现有开源及商业OCR方案;将退化率显式建模为关键指标并用DPO进行优化,是OCR方法论的重要创新。 Abstract: This manuscript introduces DharmaOCR Full and Lite, a pair of specialized small language models (SSLMs) for structured OCR that jointly optimize transcription quality, generation stability, and inference cost. It also presents DharmaOCR-Benchmark, a benchmark that covers printed, handwritten, and legal/administrative documents, and proposes a unified evaluation protocol that measures fidelity and structure while explicitly tracking text degeneration as a first-class benchmark metric (alongside unit cost). Beyond reporting degeneration rates, the manuscript empirically shows degeneration is not merely a quality failure, since it materially worsens production performance by increasing response time, reducing throughput, and inflating computational cost due to abnormally long generations. To the best of the author's knowledge, as a methodological contribution, this is the first application of Direct Preference Optimization (DPO) for OCR, explicitly using degenerate generations as rejected examples to penalize looping behavior. Combined with Supervised Fine-Tuning (SFT) for enforcing a strict JSON schema (header, margin, footer, and text), DPO consistently reduces degeneration rate across model families (up to 87.6% relative) while preserving or improving extraction quality. The resulting models, namely, DharmaOCR Full (7B) and DharmaOCR Lite (3B), set a new state-of-the-art on DharmaOCR-Benchmark, outperforming each open-source and commercial baseline model evaluated regarding extraction quality, reaching 0.925 and 0.911 scores with 0.40% and 0.20% degeneration rates. AWQ quantization reduced up to 22% per-page cost with negligible quality loss, enabling a strong quality-cost trade-off in comparison to proprietary OCR APIs and open-source alternatives.[85] Interpretable Human Activity Recognition for Subtle Robbery Detection in Surveillance Videos
Bryan Jhoan Cazáres Leyva,Ulises Gachuz Davila,José Juan González Fonseca,Juan Irving Vasquez,Vanessa A. Camacho-Vázquez,Sergio Isahí Garrido-Castañeda
Main category: cs.CV
TL;DR: This paper proposes a real-time, edge-deployable, pose-driven method for detecting non-violent snatch-and-run street robberies in surveillance video using YOLO-based pose estimation, kinematic/interaction features, and a Random Forest classifier with temporal filtering.
Details
Motivation: Non-violent street robberies (snatch-and-run) are hard to detect automatically due to their brevity, subtlety, and visual similarity to normal human interactions in unconstrained surveillance footage. Method: A hybrid, pose-driven approach: (1) YOLO-based pose estimator extracts body keypoints per tracked person; (2) computes kinematic (e.g., hand speed, arm extension) and interaction features (e.g., proximity, relative motion) between aggressor-victim pairs; (3) classifies using a Random Forest trained on these features; (4) applies a temporal hysteresis filter to stabilize predictions and reduce false alarms. Result: The method shows promising generalization on both staged and internet-collected disjoint test sets across varied scenes and camera viewpoints; full pipeline runs in real time on an NVIDIA Jetson Nano. Conclusion: The proposed system enables proactive, on-device detection of snatch-and-run events, demonstrating feasibility for real-world edge deployment in surveillance systems. Abstract: Non-violent street robberies (snatch-and-run) are difficult to detect automatically because they are brief, subtle, and often indistinguishable from benign human interactions in unconstrained surveillance footage. This paper presents a hybrid, pose-driven approach for detecting snatch-and-run events that combines real-time perception with an interpretable classification stage suitable for edge deployment. The system uses a YOLO-based pose estimator to extract body keypoints for each tracked person and computes kinematic and interaction features describing hand speed, arm extension, proximity, and relative motion between an aggressor-victim pair. A Random Forest classifier is trained on these descriptors, and a temporal hysteresis filter is applied to stabilize frame-level predictions and reduce spurious alarms. We evaluate the method on a staged dataset and on a disjoint test set collected from internet videos, demonstrating promising generalization across different scenes and camera viewpoints. Finally, we implement the complete pipeline on an NVIDIA Jetson Nano and report real-time performance, supporting the feasibility of proactive, on-device robbery detection.[86] SatBLIP: Context Understanding and Feature Identification from Satellite Imagery with Vision-Language Learning
Xue Wu,Shengting Cao,Jiaqi Gong
Main category: cs.CV
TL;DR: 本文提出SatBLIP框架,利用卫星图像与语言模型结合,提升农村地区社会脆弱性指数(SVI)的预测精度与可解释性。
Details
Motivation: 现有农村环境风险评估方法受限于粗粒度脆弱性指数和传统遥感流程(如手工特征、人工虚拟审计、通用图像训练的视觉语言模型),难以刻画地方性风险背景。 Method: 构建面向卫星图像的视觉-语言模型SatBLIP:基于GPT-4o生成结构化卫星影像描述(如屋顶类型/状况、房屋尺寸、院落属性、植被、道路环境),微调适配卫星语义的BLIP模型生成图像字幕;再通过CLIP编码字幕,并与大语言模型嵌入经注意力机制融合,实现县级SVI预测;最后用SHAP分析关键驱动属性。 Result: SatBLIP显著提升了县级SVI预测性能,并识别出屋顶形态/状况、街道宽度、植被覆盖、车辆/开放空间等稳定且具解释性的关键风险属性,支持农村风险环境的可解释制图。 Conclusion: SatBLIP为遥感赋能的社会脆弱性建模提供了新范式,兼顾准确性、可解释性与地方适配性,推动环境正义导向的农村风险治理。 Abstract: Rural environmental risks are shaped by place-based conditions (e.g., housing quality, road access, land-surface patterns), yet standard vulnerability indices are coarse and provide limited insight into risk contexts. We propose SatBLIP, a satellite-specific vision-language framework for rural context understanding and feature identification that predicts county-level Social Vulnerability Index (SVI). SatBLIP addresses limitations of prior remote sensing pipelines-handcrafted features, manual virtual audits, and natural-image-trained VLMs-by coupling contrastive image-text alignment with bootstrapped captioning tailored to satellite semantics. We use GPT-4o to generate structured descriptions of satellite tiles (roof type/condition, house size, yard attributes, greenery, and road context), then fine-tune a satellite-adapted BLIP model to generate captions for unseen images. Captions are encoded with CLIP and fused with LLM-derived embeddings via attention for SVI estimation under spatial aggregation. Using SHAP, we identify salient attributes (e.g., roof form/condition, street width, vegetation, cars/open space) that consistently drive robust predictions, enabling interpretable mapping of rural risk environments.[87] FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images
Sabab Ishraq,Aarushi Aarushi,Juncai Jiang,Chen Chen
Main category: cs.CV
TL;DR: 本文提出了FoodSense数据集和FoodSense-VL模型,用于从食物图像中预测味觉、嗅觉、触觉和听觉等多感官体验,并生成基于图像的解释, bridging认知科学与多模态AI。
Details
Motivation: 人类能从食物图像中推断多种感官体验,但现有视觉语言研究主要集中在识别任务,缺乏对跨感官推理的支持。 Method: 构建了包含66,842个参与者-图像对的FoodSense数据集,涵盖四种感官维度的评分与描述;利用大语言模型生成图像接地的推理痕迹;训练FoodSense-VL模型实现多感官评分预测与可解释性输出。 Result: 成功开发FoodSense数据集与FoodSense-VL模型,验证了跨感官推理任务的可行性,并指出常用评估指标在该任务中存在局限性。 Conclusion: 本工作将认知科学中的跨感官感知发现与现代多模态模型指令微调相结合,推动了具身化、可解释的视觉感知建模。 Abstract: Humans routinely infer taste, smell, texture, and even sound from food images a phenomenon well studied in cognitive science. However, prior vision language research on food has focused primarily on recognition tasks such as meal identification, ingredient detection, and nutrition estimation. Image-based prediction of multisensory experience remains largely unexplored. We introduce FoodSense, a human-annotated dataset for cross-sensory inference containing 66,842 participant-image pairs across 2,987 unique food images. Each pair includes numeric ratings (1-5) and free-text descriptors for four sensory dimensions: taste, smell, texture, and sound. To enable models to both predict and explain sensory expectations, we expand short human annotations into image-grounded reasoning traces. A large language model generates visual justifications conditioned on the image, ratings, and descriptors. Using these annotations, we train FoodSense-VL, a vision language benchmark model to produce both multisensory ratings and grounded explanations directly from food images. This work connects cognitive science findings on cross-sensory perception with modern instruction tuning for multimodal models and shows that many popular evaluation metrics are insufficient for visually sensory inference.[88] Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers
Felipe Parodi,Jordan Matelsky,Melanie Segado
Main category: cs.CV
TL;DR: 本文通过多种替代控制实验(均值替代、噪声替代、跨图像寄存器重排)发现,零消融法会严重高估视觉Transformer中register的功能重要性;实际任务性能依赖于'类register结构'的合理激活,而非register中精确的图像特异性值。
Details
Motivation: 零消融(zero-ablation)被广泛用于分析vision transformer中token的作用,但其是否真实反映register功能尚不明确;本文旨在检验零消融结果的可靠性,并探究register真正起作用的机制。 Method: 在DINOv2+registers和DINOv3模型上,对比零消融与三种控制替换(均值替代、噪声替代、跨图像register重排)对分类、对应匹配和分割任务的影响;同时分析各方法对内部表征的扰动程度(通过逐patch余弦相似度评估)。 Result: 零消融导致性能大幅下降(最高-36.6pp分类,-30.9pp分割),但三种控制替换均保持性能几乎不变(偏差≤1pp);余弦相似度分析表明零消融造成远超其他方法的表征扰动;该结论在ViT-B尺度上可复现。 Conclusion: 零消融过度夸大了register内容的必要性;在冻结特征评估设定下,任务性能依赖于具备register-like结构的合理激活,而非精确的图像特异性值;register的核心作用在于缓冲密集特征对[CLS] token的依赖,并编码压缩后的patch几何信息。 Abstract: Zero-ablation -- replacing token activations with zero vectors -- is widely used to probe token function in vision transformers. Register zeroing in DINOv2+registers and DINOv3 produces large drops (up to $-36.6$\,pp classification, $-30.9$\,pp segmentation), suggesting registers are functionally indispensable. However, three replacement controls -- mean-substitution, noise-substitution, and cross-image register-shuffling -- preserve performance across classification, correspondence, and segmentation, remaining within ${\sim}1$\,pp of the unmodified baseline. Per-patch cosine similarity shows these replacements genuinely perturb internal representations, while zeroing causes disproportionately large perturbations, consistent with why it alone degrades tasks. We conclude that zero-ablation overstates dependence on exact register content. In the frozen-feature evaluations we test, performance depends on plausible register-like activations rather than on exact image-specific values. Registers nevertheless buffer dense features from \texttt{[CLS]} dependence and are associated with compressed patch geometry. These findings, including the replacement-control results, replicate at ViT-B scale.[89] Crowdsourcing of Real-world Image Annotation via Visual Properties
Xiaolei Diao,Fausto Giunchiglia
Main category: cs.CV
TL;DR: 本文提出了一种融合知识表示、自然语言处理和计算机视觉的图像标注方法,通过引入视觉属性约束和基于类别层次的交互式众包框架,减少标注主观性,缓解语义鸿沟问题。
Details
Motivation: 解决对象识别数据集中因语义鸿沟导致的视觉数据与语言描述间复杂多对多映射问题,降低标注主观性对计算机视觉任务性能的负面影响。 Method: 提出一种集成知识表示、NLP和CV的图像标注方法;设计基于预定义对象类别层次和标注者反馈的动态交互式众包框架,利用视觉属性约束引导标注过程。 Result: 实验验证了该方法的有效性,并通过分析标注者反馈优化了众包设置。 Conclusion: 所提方法能有效缓解语义鸿沟带来的偏差,提升图像标注质量与一致性,为构建更可靠的数据集提供了新思路。 Abstract: Recent advances in data-centric artificial intelligence highlight inherent limitations in object recognition datasets. One of the primary issues stems from the semantic gap problem, which results in complex many-to-many mappings between visual data and linguistic descriptions. This bias adversely affects performance in computer vision tasks. This paper proposes an image annotation methodology that integrates knowledge representation, natural language processing, and computer vision techniques, aiming to reduce annotator subjectivity by applying visual property constraints. We introduce an interactive crowdsourcing framework that dynamically asks questions based on a predefined object category hierarchy and annotator feedback, guiding image annotation by visual properties. Experiments demonstrate the effectiveness of this methodology, and annotator feedback is discussed to optimize the crowdsourcing setup.[90] Co-distilled attention guided masked image modeling with noisy teacher for self-supervised learning on medical images
Jue Jiang,Aneesh Rangnekar,Harini Veeraraghavan
Main category: cs.CV
TL;DR: 本文提出DAGMaN框架,通过注意力引导的掩码机制结合带噪声教师的协同蒸馏学习,改进Swin Transformer在医学图像自监督预训练中的性能,有效减少信息泄漏并保持注意力头多样性。
Details
Motivation: 随机掩码在医学图像中易导致上下文相似补丁间的信息泄漏,降低自监督学习效果;而Swin Transformer缺乏全局[CLS] token,难以应用先进掩码策略。 Method: 提出注意力引导掩码机制,并嵌入协同蒸馏框架;首次引入带噪声的教师模型,在执行注意力掩码的同时维持高注意力头多样性。 Result: 在肺结节分类(全量/小样本)、免疫治疗效果预测、肿瘤分割及无监督器官聚类等多个下游任务上验证了DAGMaN的有效性。 Conclusion: DAGMaN显著提升了Swin Transformer在医学图像自监督学习中的表征能力,兼顾掩码难度与注意力多样性,为医学影像SSL提供了新范式。 Abstract: Masked image modeling (MIM) is a highly effective self-supervised learning (SSL) approach to extract useful feature representations from unannotated data. Predominantly used random masking methods make SSL less effective for medical images due to the contextual similarity of neighboring patches, leading to information leakage and SSL simplification. Hierarchical shifted window (Swin) transformer, a highly effective approach for medical images cannot use advanced masking methods as it lacks a global [CLS] token. Hence, we introduced an attention guided masking mechanism for Swin within a co-distillation learning framework to selectively mask semantically co-occurring and discriminative patches, to reduce information leakage and increase the difficulty of SSL pretraining. However, attention guided masking inevitably reduces the diversity of attention heads, which negatively impacts downstream task performance. To address this, we for the first time, integrate a noisy teacher into the co-distillation framework (termed DAGMaN) that performs attentive masking while preserving high attention head diversity. We demonstrate the capability of DAGMaN on multiple tasks including full- and few-shot lung nodule classification, immunotherapy outcome prediction, tumor segmentation, and unsupervised organs clustering.[91] H2VLR: Heterogeneous Hypergraph Vision-Language Reasoning for Few-Shot Anomaly Detection
Jianghong Huang,Luping Ji,Weiwei Duan,Mao Ye
Main category: cs.CV
TL;DR: 本文提出了一种基于异构超图的视觉-语言推理框架(H2VLR),用于解决少样本异常检测(FSAD)中忽略结构依赖和全局一致性的局限性,通过将FSAD建模为视觉-语义关系的高阶推理问题,在工业与医疗基准上达到SOTA性能。
Details
Motivation: 现有基于视觉语言模型(VLM)的少样本异常检测方法大多仅依赖成对特征匹配,忽略了视觉区域与语义概念间的结构依赖和全局一致性,限制了性能提升。 Method: 提出Heterogeneous Hypergraph Vision-Language Reasoning(H2VLR)框架,将FSAD建模为高阶推理问题,在统一超图中联合建模视觉区域和语义概念,显式捕获异质节点间高阶关系。 Result: 在多个代表性工业与医疗异常检测基准上,H2VLR显著优于现有方法,常达到SOTA性能。 Conclusion: H2VLR通过引入异构超图结构建模视觉-语言高阶关系,有效提升了少样本异常检测的鲁棒性与泛化能力,验证了结构化推理对VLM赋能FSAD的重要价值。 Abstract: As a classic vision task, anomaly detection has been widely applied in industrial inspection and medical imaging. In this task, data scarcity is often a frequently-faced issue. To solve it, the few-shot anomaly detection (FSAD) scheme is attracting increasing attention. In recent years, beyond traditional visual paradigm, Vision-Language Model (VLM) has been extensively explored to boost this field. However, in currently-existing VLM-based FSAD schemes, almost all perform anomaly inference only by pairwise feature matching, ignoring structural dependencies and global consistency. To further redound to FSAD via VLM, we propose a Heterogeneous Hypergraph Vision-Language Reasoning (H2VLR) framework. It reformulates the FSAD as a high-order inference problem of visual-semantic relations, by jointly modeling visual regions and semantic concepts in a unified hypergraph. Experimental comparisons verify the effectiveness and advantages of H2VLR. It could often achieve state-of-the-art (SOTA) performance on representative industrial and medical benchmarks. Our code will be released upon acceptance.[92] Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs
Ziyang Luo,Nian Liu,Junwei Han
Main category: cs.CV
TL;DR: 本文提出Chain of Modality(CoM)框架,通过动态选择输入模态拓扑结构与双路径认知执行机制,解决当前Omni-MLLMs因静态融合结构导致的感知脆弱性问题,在多个基准上实现鲁棒泛化。
Details
Motivation: 现有Omni-MLLMs虽追求多模态统一,但实际性能常低于单模态基线,根源在于静态融合结构引发的位置偏差与对齐陷阱。 Method: 提出Chain of Modality(CoM):1)动态切换并行/序列/交错输入拓扑以消除结构偏差;2)设计Direct-Decide与Reason-Decide双认知路径,分别适配直接感知与分析推理任务。支持零训练或数据高效监督微调。 Result: CoM在多种基准测试中展现出鲁棒且一致的泛化能力,显著缓解了多模态联合推理性能弱于单模态的问题。 Conclusion: 动态、任务自适应的模态融合范式优于静态融合,CoM为构建真正可靠的Omni-MLLMs提供了新架构方向。 Abstract: Omni-modal Large Language Models (Omni-MLLMs) promise a unified integration of diverse sensory streams. However, recent evaluations reveal a critical performance paradox: unimodal baselines frequently outperform joint multimodal inference. We trace this perceptual fragility to the static fusion topologies universally employed by current models, identifying two structural pathologies: positional bias in sequential inputs and alignment traps in interleaved formats, which systematically distort attention regardless of task semantics. To resolve this functional rigidity, we propose Chain of Modality (CoM), an agentic framework that transitions multimodal fusion from passive concatenation to dynamic orchestration. CoM adaptively orchestrates input topologies, switching among parallel, sequential, and interleaved pathways to neutralize structural biases. Furthermore, CoM bifurcates cognitive execution into two task-aligned pathways: a streamlined ``Direct-Decide'' path for direct perception and a structured ``Reason-Decide'' path for analytical auditing. Operating in either a training-free or a data-efficient SFT setting, CoM achieves robust and consistent generalization across diverse benchmarks.[93] FreqTrack: Frequency Learning based Vision Transformer for RGB-Event Object Tracking
Jinlin You,Muyu Li,Xudong Zhao
Main category: cs.CV
TL;DR: 本文提出FreqTrack,一种频率感知的RGB-事件(RGBE)跟踪框架,通过频域变换建立模态间互补相关性,并设计了频谱增强Transformer(SET)层和小波边缘细化(WER)模块,以提升复杂动态场景下的跟踪性能。
Details
Motivation: 现有单模态RGB跟踪器在复杂动态场景中性能受限,而当前RGB-事件融合方法未能充分利用事件数据的时序响应和高频特性。 Method: 提出FreqTrack框架,包括频谱增强Transformer(SET)层(采用多头动态傅里叶滤波)和小波边缘细化(WER)模块(基于可学习小波变换提取多尺度边缘结构),在频域实现RGB与事件数据的鲁棒特征融合。 Result: 在COESOT和FE108数据集上实验表明,FreqTrack性能领先,尤其在COESOT基准上达到76.6%的精度。 Conclusion: 频域建模能有效提升RGB-事件融合跟踪性能,尤其适用于高速与低光等挑战性场景。 Abstract: Existing single-modal RGB trackers often face performance bottlenecks in complex dynamic scenes, while the introduction of event sensors offers new potential for enhancing tracking capabilities. However, most current RGB-event fusion methods, primarily designed in the spatial domain using convolutional, Transformer, or Mamba architectures, fail to fully exploit the unique temporal response and high-frequency characteristics of event data. To address this, we1 propose FreqTrack, a frequency-aware RGBE tracking framework that establishes complementary inter-modal correlations through frequency-domain transformations for more robust feature fusion. We design a Spectral Enhancement Transformer (SET) layer that incorporates multi-head dynamic Fourier filtering to adaptively enhance and select frequency-domain features. Additionally, we develop a Wavelet Edge Refinement (WER) module, which leverages learnable wavelet transforms to explicitly extract multi-scale edge structures from event data, effectively improving modeling capability in high-speed and low-light scenarios. Extensive experiments on the COESOT and FE108 datasets demonstrate that FreqTrack achieves highly competitive performance, particularly attaining leading precision of 76.6\% on the COESOT benchmark, validating the effectiveness of frequency-domain modeling for RGBE tracking.[94] Design and Validation of a Low-Cost Smartphone Based Fluorescence Detection Platform Compared with Conventional Microplate Readers
Zhendong Cao,Katrina G. Salvante,Ash Parameswaran,Pablo A. Nepomnaschy,Hongji Dai
Main category: cs.CV
TL;DR: 本文提出了一种低成本荧光光学检测系统,利用智能手机摄像头替代传统昂贵的微孔板读数仪(如Perkin Elmer Victor),通过分析样品RGB图像颜色与荧光物质摩尔浓度的关系实现微生物和分子检测。
Details
Motivation: 降低荧光检测设备成本,使检测技术更普及、便携,尤其适用于资源受限环境。 Method: 设计兼容标准96孔板的光学装置,用智能手机摄像头作为荧光信号探测器,建立样品RGB图像颜色与荧光物质摩尔浓度之间的定量关系。 Result: 成功构建了无需激发滤光片、阻挡滤光片和光电倍增管等昂贵元件的可行检测系统。 Conclusion: 智能手机摄像头可有效替代传统高端光学检测器件,为低成本、便携式生物荧光检测提供了新思路和实用方案。 Abstract: A low cost fluorescence-based optical system is developed for detecting the presence of certain microorganisms and molecules within a diluted sample. A specifically designed device setup compatible with conventional 96 well plates is chosen to create an ideal environment in which a smart phone camera can be used as the optical detector. In comparison with conventional microplate reading machines such as Perkin Elmer Victor Machine, the device presented in this paper is not equipped with expensive elements such as exciter filer, barrier filter and photomultiplier; instead, a phone camera is all needed to detect fluorescence within the sample. The strategy being involved is to determine the relationship between the image color of the sample in RGB color space and the molar concentration of the fluorescence specimen in that sample. This manuscript is a preprint version of work related to a publication in IEEE. The final version may differ from this manuscript.[95] WILD-SAM: Phase-Aware Expert Adaptation of SAM for Landslide Detection in Wrapped InSAR Interferograms
Yucheng Pan,Heping Li,Zhangle Liu,Sajid Hussain,Bin Pan
Main category: cs.CV
TL;DR: 本文提出WILD-SAM框架,通过频谱感知的专家混合适配器(PA-MoE)和小波引导子带增强(WGSE)策略,提升SAM在包裹相位InSAR干涉图上检测慢速滑坡的精度与边界完整性。
Details
Motivation: 直接从包裹InSAR干涉图中检测慢速滑坡对地质灾害监测至关重要,但面临严重相位模糊与复杂相干噪声挑战;SAM因频谱域偏移难以直接迁移至包裹相位数据。 Method: 提出WILD-SAM:1)在冻结编码器中嵌入Phase-Aware Mixture-of-Experts (PA-MoE) Adapter,动态聚合多尺度谱-纹理先验以对齐自然图像与相位数据的频谱分布;2)引入Wavelet-Guided Subband Enhancement (WGSE)策略,利用离散小波变换解耦高频子带并生成频率感知的稠密提示,保障滑坡边界的拓扑完整性。 Result: 在ISSLIDE和ISSLIDE+基准上显著超越现有方法,在目标完整性与轮廓保真度两方面均达到SOTA性能。 Conclusion: WILD-SAM有效弥合了通用视觉模型与InSAR相位数据间的频谱鸿沟,为高精度、鲁棒的慢速滑坡自动识别提供了新范式。 Abstract: Detecting slow-moving landslides directly from wrapped Interferometric Synthetic Aperture Radar (InSAR) interferograms is crucial for efficient geohazard monitoring, yet it remains fundamentally challenged by severe phase ambiguity and complex coherence noise. While the Segment Anything Model (SAM) offers a powerful foundation for segmentation, its direct transfer to wrapped phase data is hindered by a profound spectral domain shift, which suppresses the high-frequency fringes essential for boundary delineation. To bridge this gap, we propose WILD-SAM, a novel parameter-efficient fine-tuning framework specifically designed to adapt SAM for high-precision landslide detection on wrapped interferograms. Specifically, the architecture integrates a Phase-Aware Mixture-of-Experts (PA-MoE) Adapter into the frozen encoder to align spectral distributions and introduces a Wavelet-Guided Subband Enhancement (WGSE) strategy to generate frequency-aware dense prompts. The PA-MoE Adapter exploits a dynamic routing mechanism across heterogeneous convolutional experts to adaptively aggregate multi-scale spectral-textural priors, effectively aligning the distribution discrepancy between natural images and interferometric phase data. Meanwhile, the WGSE strategy leverages discrete wavelet transforms to explicitly disentangle high-frequency subbands and refine directional phase textures, injecting these structural cues as dense prompts to ensure topological integrity along sharp landslide boundaries. Extensive experiments on the ISSLIDE and ISSLIDE+ benchmarks demonstrate that WILD-SAM achieves state-of-the-art performance, significantly outperforming existing methods in both target completeness and contour fidelity.[96] Giving Faces Their Feelings Back: Explicit Emotion Control for Feedforward Single-Image 3D Head Avatars
Yicheng Gong,Jiawei Zhang,Liqiang Liu,Yanwen Wang,Lei Chu,Jiahao Li,Hao Pan,Hao Zhu,Yan Lu
Main category: cs.CV
TL;DR: 本文提出了一种在前馈式单图像3D头像重建中显式控制情绪的框架,通过双路径调制机制将情绪作为独立可控信号注入现有架构,在保持重建质量的同时实现情绪迁移、解耦操控与平滑插值。
Details
Motivation: 现有方法中情绪常与几何或外观隐式耦合,缺乏对情绪的显式、一致且跨身份的独立控制能力。 Method: 提出双路径调制机制:几何调制在参数空间中进行情绪条件归一化,解耦情绪与语音驱动形变;外观调制捕获身份感知的情绪相关视觉线索;并构建了时序对齐、情绪一致的多身份数据集以支持训练。 Result: 该框架可集成到多种SOTA骨干网络中,在保持高保真重建与重演能力的同时,实现了可控情绪迁移、情绪-身份解耦操控及平滑情绪插值。 Conclusion: 本工作推动了具表现力与可扩展性的3D头像建模,确立了情绪作为第一类控制信号的可行性与有效性。 Abstract: We present a framework for explicit emotion control in feed-forward, single-image 3D head avatar reconstruction. Unlike existing pipelines where emotion is implicitly entangled with geometry or appearance, we treat emotion as a first-class control signal that can be manipulated independently and consistently across identities. Our method injects emotion into existing feed-forward architectures via a dual-path modulation mechanism without modifying their core design. Geometry modulation performs emotion-conditioned normalization in the original parametric space, disentangling emotional state from speech-driven articulation, while appearance modulation captures identity-aware, emotion-dependent visual cues beyond geometry. To enable learning under this setting, we construct a time-synchronized, emotion-consistent multi-identity dataset by transferring aligned emotional dynamics across identities. Integrated into multiple state-of-the-art backbones, our framework preserves reconstruction and reenactment fidelity while enabling controllable emotion transfer, disentangled manipulation, and smooth emotion interpolation, advancing expressive and scalable 3D head avatars.[97] Controllable Video Object Insertion via Multiview Priors
Xia Qi,Peishan Cong,Yichen Yao,Ziyi Wang,Yaoqin Ye,Yuexin Ma
Main category: cs.CV
TL;DR: 本文提出了一种新的视频对象插入方法,通过引入多视角物体先验和双路径视角一致性条件机制,解决了外观不一致、遮挡处理及时间连贯性等挑战。
Details
Motivation: 现有视频生成方法在将新对象插入到已有视频时,难以保证对象外观一致性、空间对齐和时间连贯性。 Method: 利用2D参考图像构建多视角表示,结合双路径视角一致性条件机制和质量感知加权机制,并设计集成感知一致性模块以提升空间真实性和时间连续性。 Result: 实验表明该方法显著提升了视频对象插入的质量,实现了稳定且逼真的对象集成效果。 Conclusion: 所提框架有效缓解了动态环境中外观不一致与遮挡处理难题,为高质量视频对象插入提供了新思路。 Abstract: Video object insertion is a critical task for dynamically inserting new objects into existing environments. Previous video generation methods focus primarily on synthesizing entire scenes while struggling with ensuring consistent object appearance, spatial alignment, and temporal coherence when inserting objects into existing videos. In this paper, we propose a novel solution for Video Object Insertion, which integrates multi-view object priors to address the common challenges of appearance inconsistency and occlusion handling in dynamic environments. By lifting 2D reference images into multi-view representations and leveraging a dual-path view-consistent conditioning mechanism, our framework ensures stable identity guidance and robust integration across diverse viewpoints. A quality-aware weighting mechanism is also employed to adaptively handle noisy or imperfect inputs. Additionally, we introduce an Integration-Aware Consistency Module that guarantees spatial realism, effectively resolving occlusion and boundary artifacts while maintaining temporal continuity across frames. Experimental results show that our solution significantly improves the quality of video object insertion, providing stable and realistic integration.[98] The Fourth Challenge on Image Super-Resolution ($\times$4) at NTIRE 2026: Benchmark Results and Method Overview
Zheng Chen,Kai Liu,Jingkai Wang,Xianglong Yan,Jianze Li,Ziqing Zhang,Jue Gong,Jiatong Li,Lei Sun,Xiaoyang Liu,Radu Timofte,Yulun Zhang,Jihye Park,Yoonjin Im,Hyungju Chun,Hyunhee Park,MinKyu Park,Zheng Xie,Xiangyu Kong,Weijun Yuan,Zhan Li,Qiurong Song,Luen Zhu,Fengkai Zhang,Xinzhe Zhu,Junyang Chen,Congyu Wang,Yixin Yang,Zhaorun Zhou,Jiangxin Dong,Jinshan Pan,Shengwei Wang,Jiajie Ou,Baiang Li,Sizhuo Ma,Qiang Gao,Jusheng Zhang,Jian Wang,Keze Wang,Yijiao Liu,Yingsi Chen,Hui Li,Yu Wang,Congchao Zhu,Saeed Ahmad,Ik Hyun Lee,Jun Young Park,Ji Hwan Yoon,Kainan Yan,Zian Wang,Weibo Wang,Shihao Zou,Chao Dong,Wei Zhou,Linfeng Li,Jaeseong Lee,Jaeho Chae,Jinwoo Kim,Seonjoo Kim,Yucong Hong,Zhenming Yan,Junye Chen,Ruize Han,Song Wang,Yuxuan Jiang,Chengxi Zeng,Tianhao Peng,Fan Zhang,David Bull,Tongyao Mu,Qiong Cao,Yifan Wang,Youwei Pan,Leilei Cao,Xiaoping Peng,Wei Deng,Yifei Chen,Wenbo Xiong,Xian Hu,Yuxin Zhang,Xiaoyun Cheng,Yang Ji,Zonghao Chen,Zhihao Xue,Junqin Hu,Nihal Kumar,Snehal Singh Tomar,Klaus Mueller,Surya Vashisth,Prateek Shaily,Jayant Kumar,Hardik Sharma,Ashish Negi,Sachin Chaudhary,Akshay Dudhane,Praful Hambarde,Amit Shukla,Shijun Shi,Jiangning Zhang,Yong Liu,Kai Hu,Jing Xu,Xianfang Zeng,Amitesh M,Hariharan S,Chia-Ming Lee,Yu-Fan Lin,Chih-Chung Hsu,Nishalini K,Sreenath K A,Bilel Benjdira,Anas M. Ali,Wadii Boulila,Shuling Zheng,Zhiheng Fu,Feng Zhang,Zhanglu Chen,Boyang Yao,Nikhil Pathak,Aagam Jain,Milan Kumar,Kishor Upla,Vivek Chavda,Sarang N S,Raghavendra Ramachandra,Zhipeng Zhang,Qi Wang,Shiyu Wang,Jiachen Tu,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Yaokun Shi,Yuqi Li,Chuanguang Yang,Weilun Feng,Zhuzhi Hong,Hao Wu,Junming Liu,Yingli Tian,Amish Bhushan Kulkarni,Tejas R R Shet,Saakshi M Vernekar,Nikhil Akalwadi,Kaushik Mallibhat,Ramesh Ashok Tabib,Uma Mudenagudi,Yuwen Pan,Tianrun Chen,Deyi Ji,Qi Zhu,Lanyun Zhu,Heyan Zhangyi
Main category: cs.CV
TL;DR: 本文介绍了NTIRE 2026图像超分辨率(×4)挑战赛,包含恢复和感知两个赛道,旨在推动超分辨率技术发展并提供统一基准。
Details
Motivation: 反映图像超分辨率领域不断演进的目标,推动在像素保真度与视觉真实感两方面的技术进步。 Method: 组织包含两个赛道(恢复轨以PSNR为指标,感知轨以感知评分为指标)的竞赛,使用bicubic ×4下采样生成LR图像,并对194名注册者中31个有效提交团队的方法进行评估与分析。 Result: 共31支队伍提交有效结果;报告总结了数据集、评估协议、主要结果及各队方法。 Conclusion: 该挑战赛提供了图像超分辨率领域的统一基准,揭示了当前进展与未来方向。 Abstract: This paper presents the NTIRE 2026 image super-resolution ($\times$4) challenge, one of the associated competitions of the NTIRE 2026 Workshop at CVPR 2026. The challenge aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective super-resolution solutions and analyze recent advances in the field. To reflect the evolving objectives of image super-resolution, the challenge includes two tracks: (1) a restoration track, which emphasizes pixel-wise fidelity and ranks submissions based on PSNR; and (2) a perceptual track, which focuses on visual realism and evaluates results using a perceptual score. A total of 194 participants registered for the challenge, with 31 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, main results, and methods of participating teams. The challenge provides a unified benchmark and offers insights into current progress and future directions in image super-resolution.[99] DVFace: Spatio-Temporal Dual-Prior Diffusion for Video Face Restoration
Zheng Chen,Bowen Chai,Rongjun Gao,Mingtao Nie,Xi Li,Bingnan Duan,Jianping Fang,Xiaohong Liu,Linghe Kong,Yulun Zhang
Main category: cs.CV
TL;DR: 本文提出DVFace,一种用于真实世界视频人脸修复的一步扩散框架,通过时空双码本设计和非对称时空融合模块,实现了高质量、时序稳定且身份保持的修复效果。
Details
Motivation: 现有基于扩散模型的视频人脸修复方法依赖通用扩散先验和多步采样,限制了面部适应性和推理效率,难以兼顾保真度与时间稳定性。 Method: 提出DVFace框架,包含时空双码本设计以提取空间与时间面部先验,并引入非对称时空融合模块将先验注入扩散主干网络。 Result: 在多个基准上验证,DVFace在修复质量、时间一致性及身份保持方面优于近期方法。 Conclusion: DVFace通过一步扩散与定制化时空建模,有效提升了视频人脸修复的效率与效果,为真实场景应用提供了新思路。 Abstract: Video face restoration aims to enhance degraded face videos into high-quality results with realistic facial details, stable identity, and temporal coherence. Recent diffusion-based methods have brought strong generative priors to restoration and enabled more realistic detail synthesis. However, existing approaches for face videos still rely heavily on generic diffusion priors and multi-step sampling, which limit both facial adaptation and inference efficiency. These limitations motivate the use of one-step diffusion for video face restoration, yet achieving faithful facial recovery alongside temporally stable outputs remains challenging. In this paper, we propose, DVFace, a one-step diffusion framework for real-world video face restoration. Specifically, we introduce a spatio-temporal dual-codebook design to extract complementary spatial and temporal facial priors from degraded videos. We further propose an asymmetric spatio-temporal fusion module to inject these priors into the diffusion backbone according to their distinct roles. Evaluation on various benchmarks shows that DVFace delivers superior restoration quality, temporal consistency, and identity preservation compared to recent methods. Code: https://github.com/zhengchen1999/DVFace.[100] Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors
Mingqian Ji,Shanshan Zhang,Jian Yang
Main category: cs.CV
TL;DR: 本文提出SEPatch3D框架,通过动态调整patch尺寸、选择信息丰富patch并跨粒度增强特征,在保持3D检测精度的同时显著提升ViT类稀疏多视角3D检测器的推理速度。
Details
Motivation: 现有token压缩方法(如剪枝、合并、增大patch尺寸)会丢失背景线索、破坏上下文一致性、损失细粒度语义,损害3D检测性能。 Method: 提出SEPatch3D:1)时空感知的Patch尺寸选择(SPSS),根据场景近物/背景占比动态分配小/大patch;2)信息丰富Patch选择(IPS)筛选关键patch进行细化;3)跨粒度特征增强(CGFE)将细粒度细节注入粗粒度patch。 Result: 在nuScenes和Argoverse 2验证集上,相比StreamPETR推理快57%,比SOTA ToC3D-faster效率高20%,同时保持相当的检测精度。 Conclusion: 动态patch尺寸调整结合信息选择与跨粒度增强,可有效平衡计算效率与语义完整性,为ViT类3D检测器提供高效轻量新范式。 Abstract: Vision Transformer (ViT)-based sparse multi-view 3D object detectors have achieved remarkable accuracy but still suffer from high inference latency due to heavy token processing. To accelerate these models, token compression has been widely explored. However, our revisit of existing strategies, such as token pruning, merging, and patch size enlargement, reveals that they often discard informative background cues, disrupt contextual consistency, and lose fine-grained semantics, negatively affecting 3D detection. To overcome these limitations, we propose SEPatch3D, a novel framework that dynamically adjusts patch sizes while preserving critical semantic information within coarse patches. Specifically, we design Spatiotemporal-aware Patch Size Selection (SPSS) that assigns small patches to scenes containing nearby objects to preserve fine details and large patches to background-dominated scenes to reduce computation cost. To further mitigate potential detail loss, Informative Patch Selection (IPS) selects the informative patches for feature refinement, and Cross-Granularity Feature Enhancement (CGFE) injects fine-grained details into selected coarse patches, enriching semantic features. Experiments on the nuScenes and Argoverse 2 validation sets show that SEPatch3D achieves up to \textbf{57\%} faster inference than the StreamPETR baseline and \textbf{20\%} higher efficiency than the state-of-the-art ToC3D-faster, while preserving comparable detection accuracy. Code is available at https://github.com/Mingqj/SEPatch3D.[101] Learning Adaptive Reasoning Paths for Efficient Visual Reasoning
Yixu Huang,Tinghui Zhu,Muhao Chen
Main category: cs.CV
TL;DR: 本文提出AVR框架,通过自适应选择推理格式(完整、仅感知、直接回答)来减少视觉推理模型的冗余推理路径,显著降低token使用量(50-90%)并保持准确率。
Details
Motivation: 视觉推理模型常因‘推理路径冗余’而过度思考,对简单任务生成过长推理链,影响效率。 Method: 提出AVR框架,将视觉推理分解为视觉感知、逻辑推理和答案应用三个认知功能,并支持三种动态响应格式;采用改进的FS-GRPO算法训练模型以权衡效率与正确性。 Result: 在多个视觉语言基准上,AVR将token使用量减少50%-90%,同时维持整体准确率,尤其在感知密集型任务中表现优异。 Conclusion: 自适应视觉推理能有效缓解视觉推理模型的过思考问题,提升推理效率而不牺牲性能。 Abstract: Visual reasoning models (VRMs) have recently shown strong cross-modal reasoning capabilities by integrating visual perception with language reasoning. However, they often suffer from overthinking, producing unnecessarily long reasoning chains for any tasks. We attribute this issue to \textbf{Reasoning Path Redundancy} in visual reasoning: many visual questions do not require the full reasoning process. To address this, we propose \textbf{AVR}, an adaptive visual reasoning framework that decomposes visual reasoning into three cognitive functions: visual perception, logical reasoning, and answer application. It further enables models to dynamically choose among three response formats: Full Format, Perception-Only Format, and Direct Answer. AVR is trained with FS-GRPO, an adaptation of Group Relative Policy Optimization that encourages the model to select the most efficient reasoning format while preserving correctness. Experiments on multiple vision-language benchmarks show that AVR reduces token usage by 50--90\% while maintaining overall accuracy, especially in perception-intensive tasks. These results demonstrate that adaptive visual reasoning can effectively mitigate overthinking in VRMs. Code and data are available at: https://github.com/RunRiotComeOn/AVR.[102] Deepfake Detection Generalization with Diffusion Noise
Hongyuan Qi,Wenjin Hou,Hehe Fan,Jun Xiao
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型噪声特性的注意力引导噪声学习(ANL)框架,用于提升深度伪造检测器对新型生成技术(尤其是扩散模型生成伪造)的泛化能力。该方法利用预训练扩散模型的去噪过程引导检测器学习更鲁棒的特征,并通过注意力机制聚焦全局差异,显著提升了跨模型泛化性能,且不增加推理开销。
Details
Motivation: 现有深度伪造检测器在面对新兴的扩散模型生成的高保真伪造图像时泛化能力不足,难以识别GAN以外的新型伪造。 Method: 提出Attention-guided Noise Learning(ANL)框架:利用冻结的预训练扩散模型指导检测器预测输入图像在特定扩散步长下的噪声;引入由预测噪声生成的注意力图,引导网络关注全局分布的伪造痕迹而非局部模式;将扩散模型的自然图像先验作为正则化手段。 Result: ANL在多个基准上显著优于现有方法,在检测扩散生成伪造图像任务中达到SOTA精度;尤其在未见过的伪造模型上,ACC/AP指标大幅提升;推理阶段无额外计算开销。 Conclusion: 扩散噪声是一种强大且具泛化性的信号,ANL框架有效利用该信号提升了检测器对未知伪造技术的适应能力,为通用深度伪造检测提供了新范式。 Abstract: Deepfake detectors face growing challenges in generalization as new image synthesis techniques emerge. In particular, deepfakes generated by diffusion models are highly photorealistic and often evade detectors trained on GAN-based forgeries. This paper addresses the generalization problem in deepfake detection by leveraging diffusion noise characteristics. We propose an Attention-guided Noise Learning (ANL) framework that integrates a pre-trained diffusion model into the deepfake detection pipeline to guide the learning of more robust features. Specifically, our method uses the diffusion model's denoising process to expose subtle artifacts: the detector is trained to predict the noise contained in an input image at a given diffusion step, forcing it to capture discrepancies between real and synthetic images, while an attention-guided mechanism derived from the predicted noise is introduced to encourage the model to focus on globally distributed discrepancies rather than local patterns. By harnessing the frozen diffusion model's learned distribution of natural images, the ANL method acts as a form of regularization, improving the detector's generalization to unseen forgery types. Extensive experiments demonstrate that ANL significantly outperforms existing methods on multiple benchmarks, achieving state-of-the-art accuracy in detecting diffusion-generated deepfakes. Notably, the proposed framework boosts generalization performance (e.g., improving ACC/AP by a substantial margin on unseen models) without introducing additional overhead during inference. Our results highlight that diffusion noise provides a powerful signal for generalizable deepfake detection.[103] M3D-Net: Multi-Modal 3D Facial Feature Reconstruction Network for Deepfake Detection
Haotian Wu,Yue Cheng,Shan Bian
Main category: cs.CV
TL;DR: 本文提出了一种名为M3D-Net的多模态3D人脸特征重建网络,用于深度伪造检测,通过双流自监督3D重建、特征预融合与多模态注意力融合模块,显著提升了检测精度与泛化能力。
Details
Motivation: 现有深度伪造检测方法多依赖孤立的人脸属性重建,未能充分利用多模态特征间的互补性,且难以应对日益逼真的伪造威胁。 Method: 提出M3D-Net:包含自监督3D人脸重建模块(恢复几何与反射率)、3D特征预融合模块(PFM)和多模态融合模块(MFM),采用双流架构融合RGB与3D重建特征,并引入注意力机制。 Result: 在多个公开数据集上达到SOTA检测精度与鲁棒性,泛化能力强于现有方法。 Conclusion: 多模态3D特征联合建模可有效提升深度伪造检测性能,所提M3D-Net为该任务提供了新思路与实用框架。 Abstract: With the rapid advancement of deep learning in image generation, facial forgery techniques have achieved unprecedented realism, posing serious threats to cybersecurity and information authenticity. Most existing deepfake detection approaches rely on the reconstruction of isolated facial attributes without fully exploiting the complementary nature of multi-modal feature representations. To address these challenges, this paper proposes a novel Multi-Modal 3D Facial Feature Reconstruction Network (M3D-Net) for deepfake detection. Our method leverages an end-to-end dual-stream architecture that reconstructs fine-grained facial geometry and reflectance properties from single-view RGB images via a self-supervised 3D facial reconstruction module. The network further enhances detection performance through a 3D Feature Pre-fusion Module (PFM), which adaptively adjusts multi-scale features, and a Multi-modal Fusion Module (MFM) that effectively integrates RGB and 3D-reconstructed features using attention mechanisms. Extensive experiments on multiple public datasets demonstrate that our approach achieves state-of-the-art performance in terms of detection accuracy and robustness, significantly outperforming existing methods while exhibiting strong generalization across diverse scenarios.[104] TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation
Xiangyu Liu,Feng Gao,Xiaomei Zhang,Yong Zhang,Xiaoming Wei,Zhen Lei,Xiangyu Zhu
Main category: cs.CV
TL;DR: TurboTalk is a two-stage progressive distillation framework that compresses multi-step audio-driven video diffusion models into a single-step generator, achieving 120x faster inference while preserving quality.
Details
Motivation: Existing audio-driven video digital human generation models suffer from high computational overhead due to multi-step denoising, limiting real-world deployment; one-step distillation methods are fast but unstable during training. Method: TurboTalk uses a two-stage progressive distillation: first, Distribution Matching Distillation to train a stable 4-step student model; second, adversarial distillation with progressive timestep sampling and a self-compare adversarial objective to reduce steps from 4 to 1. Result: Achieves single-step video talking avatar generation, improving inference speed by 120 times while maintaining high visual quality. Conclusion: TurboTalk successfully balances speed and stability in audio-driven video generation, enabling practical real-time deployment without sacrificing fidelity. Abstract: Existing audio-driven video digital human generation models rely on multi-step denoising, resulting in substantial computational overhead that severely limits their deployment in real-world settings. While one-step distillation approaches can significantly accelerate inference, they often suffer from training instability. To address this challenge, we propose TurboTalk, a two-stage progressive distillation framework that effectively compresses a multi-step audio-driven video diffusion model into a single-step generator. We first adopt Distribution Matching Distillation to obtain a strong and stable 4-step student, and then progressively reduce the denoising steps from 4 to 1 through adversarial distillation. To ensure stable training under extreme step reduction, we introduce a progressive timestep sampling strategy and a self-compare adversarial objective that provides an intermediate adversarial reference that stabilizes progressive distillation. Our method achieve single-step generation of video talking avatar, boosting inference speed by 120 times while maintaining high generation quality.[105] MapSR: Prompt-Driven Land Cover Map Super-Resolution via Vision Foundation Models
Ruiqi Wang,Qi Yu,Jie Ma,Hanlin Wu
Main category: cs.CV
TL;DR: 本文提出MapSR框架,通过提示驱动的方式实现土地覆盖图的超分辨率映射,仅需一次使用低分辨率标签提取类别提示,无需训练即可完成高分辨率预测,并结合图传播进行空间优化,在Chesapeake Bay数据集上达到59.64% mIoU,显著降低参数量与训练时间。
Details
Motivation: 高分辨率土地覆盖制图受限于密集高分辨率标注的高昂成本,而现有弱监督方法依赖低分辨率标签重训练模型,计算开销大。 Method: MapSR采用提示驱动框架:利用冻结视觉基础模型特征和轻量线性探针,从低分辨率标签中一次性提取类别提示;再通过余弦相似度匹配进行无训练推理,并结合图传播实现空间精细化预测。 Result: 在Chesapeake Bay数据集上,MapSR在零高分辨率标签条件下达到59.64% mIoU,性能媲美最强弱监督基线、超越全监督基线,且可训练参数减少四个数量级,训练时间由数小时缩短至数分钟。 Conclusion: MapSR实现了高效、低成本、可扩展的高分辨率土地覆盖映射,为资源受限场景下的遥感语义制图提供了新范式。 Abstract: High-resolution (HR) land-cover mapping is often constrained by the high cost of dense HR annotations. We revisit this problem from the perspective of map super-resolution, which enhances coarse low-resolution (LR) land-cover products into HR maps at the resolution of the input imagery. Existing weakly supervised methods can leverage LR labels, but they typically use them to retrain dense predictors with substantial computational cost. We propose MapSR, a prompt-driven framework that decouples supervision from model training. MapSR uses LR labels once to extract class prompts from frozen vision foundation model features through a lightweight linear probe, after which HR mapping proceeds via training-free metric inference and graph-based prediction refinement. Specifically, class prompts are estimated by aggregating high-confidence HR features identified by the linear probe, and HR predictions are obtained by cosine-similarity matching followed by graph-based propagation for spatial refinement. Experiments on the Chesapeake Bay dataset show that MapSR achieves 59.64% mIoU without any HR labels, remaining competitive with the strongest weakly supervised baseline and surpassing a fully supervised baseline. Notably, MapSR reduces trainable parameters by four orders of magnitude and shortens training time from hours to minutes, enabling scalable HR mapping under limited annotation and compute budgets. The code is available at https://github.com/rikirikirikiriki/MapSR.[106] Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models
Amir El-Ghoussani,Marc Hölle,Gustavo Carneiro,Vasileios Belagiannis
Main category: cs.CV
TL;DR: 本文提出了一种名为Masked Logit Nudging(MLN)的方法,用于在视觉自回归模型中实现基于提示的图像编辑,通过利用源图像token映射和语义轨迹引导logits更新,并结合空间掩码与量化误差修正,实现了高质量、高保真度的编辑与重建,在多个基准上达到SOTA性能且推理速度优于扩散模型。
Details
Motivation: 解决视觉自回归模型中提示引导图像编辑的问题,即在给定源图像和目标文本提示时,仅修改与提示相关区域,同时保持其余区域不变。 Method: 提出Masked Logit Nudging:1)利用源图像token映射生成logits;2)通过VAR编码将源编码转化为logits并沿源-目标提示定义的语义轨迹 nudging 预测logits;3)基于源与编辑提示间cross-attention差异构建空间掩码,限定编辑区域;4)引入细化模块校正量化误差、提升重建质量。 Result: 在PIE基准512px和1024px分辨率上取得最佳图像编辑性能;在COCO(512px)和OpenImages(1024px)上重建性能超越先前方法;整体优于VAR相关方法,媲美甚至优于扩散模型,且速度显著更快。 Conclusion: Masked Logit Nudging是一种高效、精准的视觉自回归图像编辑方法,兼顾编辑保真度与重建质量,在速度与性能间取得优异平衡。 Abstract: We address the problem of prompt-guided image editing in visual autoregressive models. Given a source image and a target text prompt, we aim to modify the source image according to the target prompt, while preserving all regions which are unrelated to the requested edit. To this end, we present Masked Logit Nudging, which uses the source image token maps to introduce a guidance step that aligns the model's predictions under the target prompt with these source token maps. Specifically, we convert the fixed source encodings into logits using the VAR encoding, nudging the model's predicted logits towards the targets along a semantic trajectory defined by the source-target prompts. Edits are applied only within spatial masks obtained through a dedicated masking scheme that leverages cross-attention differences between the source and edited prompts. Then, we introduce a refinement to correct quantization errors and improve reconstruction quality. Our approach achieves the best image editing performance on the PIE benchmark at 512px and 1024px resolutions. Beyond editing, our method delivers faithful reconstructions and outperforms previous methods on COCO at 512px and OpenImages at 1024px. Overall, our method outperforms VAR-related approaches and achieves comparable or even better performance than diffusion models, while being much faster. Code is available at 'https://github.com/AmirMaEl/MLN'.[107] Towards Design Compositing
Abhinav Mahajan,Abhikhya Tripathy,Sudeeksha Reddy Pala,Vaibhav Methi,K J Joseph,Balaji Vasan Srinivasan
Main category: cs.CV
TL;DR: 本文提出GIST,一种无需训练、保持身份的图像合成器,用于提升图形设计中多源视觉元素的风格一致性与和谐性,可无缝集成到现有设计生成流程中并显著改善美学质量。
Details
Motivation: 现有方法假设输入的多模态设计元素(如图像、文本、logo)已具备风格一致性,但实际中这些元素常来自不同来源、存在视觉不匹配,因此需要一种能保持元素身份同时进行风格协调的合成方法。 Method: 提出GIST——一种训练自由(training-free)、身份保持(identity-preserving)的图像合成器,位于布局预测与排版生成之间,支持即插即用式集成到现有组件到设计(components-to-design)或设计优化流程中。 Result: GIST在LaDeCo和Design-o-meter两种差异显著的现有方法上验证有效,通过LLaVA-OV和GPT-4V评估,在视觉和谐度与美学质量方面均显著优于简单粘贴(naive pasting)。 Conclusion: 身份保持的风格化合成是实现真正和谐的设计生成流程的关键环节;GIST作为一种轻量、通用、免训练的模块,填补了该关键空白,并提升了端到端设计系统的实用性与表现力。 Abstract: Graphic design creation involves harmoniously assembling multimodal components such as images, text, logos, and other visual assets collected from diverse sources, into a visually-appealing and cohesive design. Recent methods have largely focused on layout prediction or complementary element generation, while retaining input elements exactly, implicitly assuming that provided components are already stylistically harmonious. In practice, inputs often come from disparate sources and exhibit visual mismatch, making this assumption limiting. We argue that identity-preserving stylization and compositing of input elements is a critical missing ingredient for truly harmonized components-to-design pipelines. To this end, we propose GIST, a training-free, identity-preserving image compositor that sits between layout prediction and typography generation, and can be plugged into any existing components-to-design or design-refining pipeline without modification. We demonstrate this by integrating GIST with two substantially different existing methods, LaDeCo and Design-o-meter. GIST shows significant improvements in visual harmony and aesthetic quality across both pipelines, as validated by LLaVA-OV and GPT-4V on aspect-wise ratings and pairwise preference over naive pasting. Project Page: abhinav-mahajan10.github.io/GIST/.[108] Multigrain-aware Semantic Prototype Scanning and Tri-Token Prompt Learning Embraced High-Order RWKV for Pan-Sharpening
Junfeng Li,Wenyang Zhou,Xueheng Li,Xuanhua He,Jianhou Gan,Wenqi Ren
Main category: cs.CV
TL;DR: 本文提出了一种面向全色锐化的多粒度语义原型扫描新范式,基于高阶RWKV架构和源自语义聚类的三令牌提示机制,通过语义驱动扫描、三令牌提示学习与可逆Q-Shift操作提升性能。
Details
Motivation: 传统RWKV的双向光栅扫描缺乏语义感知,易受位置偏差影响;现有方法难以兼顾全局语义一致性与高频空间细节重建。 Method: 1)多粒度语义原型扫描:利用局部敏感哈希进行语义区域聚类,构建多粒度语义原型以实现上下文感知的token重排序;2)三令牌提示学习:引入全局token、聚类原型token和可学习寄存器token协同引导RWKV建模;3)可逆Q-Shift:在value通路使用中心差分卷积注入高频信息,并设计可逆多尺度Q-Shift实现无损特征变换。 Result: 实验结果表明该方法在多个标准数据集上优于现有主流方法,尤其在结构保真度与伪影抑制方面表现突出。 Conclusion: 语义驱动的扫描策略与轻量高效模块设计可有效提升RWKV在遥感图像融合任务中的建模能力,为泛化性pan-sharpening提供了新思路。 Abstract: In this work, we propose a Multigrain-aware Semantic Prototype Scanning paradigm for pan-sharpening, built upon a high-order RWKV architecture and a tri-token prompting mechanism derived from semantic clustering. Specifically, our method contains three key components: 1) Multigrain-aware Semantic Prototype Scanning. Although RWKV offers a efficient linear-complexity alternative to Transformers, its conventional bidirectional raster scanning is still semantic-agnostic and prone to positional bias. To address this issue, we introduce a semantic-driven scanning strategy that leverages locality-sensitive hashing to group semantically related regions and construct multi-grain semantic prototypes, enabling context-aware token reordering and more coherent global interaction. 2) Tri-token Prompt Learning. We design a tri-token prompting mechanism consisting of a global token, cluster-derived prototype tokens, and a learnable register token. The global and prototype tokens provide complementary semantic priors for RWKV modeling, while the register token helps suppress noisy and artifact-prone intermediate representations. 3) Invertible Q-Shift. To counteract spatial details, we apply center difference convolution on the value pathway to inject high-frequency information, and introduce an invertible multi-scale Q-shift operation for efficient and lossless feature transformation without parameter-heavy receptive field expansion. Experimental results demonstrate the superiority of our method.[109] Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models
Haoyi Sun,Xiaoxiao Wang,Ning Mao,Qian Wang,Lifu Mu,Wen Zheng,Tao Wei,Wei Chen
Main category: cs.CV
TL;DR: 本文提出Switch-KD框架,通过视觉切换蒸馏与动态双向logits差异损失,在共享文本概率空间中统一视觉-语言知识迁移,显著提升小规模VLM在多模态任务上的性能。
Details
Motivation: 现有VLM知识蒸馏方法未显式建模多模态对齐,导致跨模态知识迁移不一致;同时大模型难以部署于资源受限场景。 Method: 提出Switch-KD框架:(1)视觉切换蒸馏——将学生视觉输出映射至教师语言路径以构建跨模态概率参考;(2)动态双向Logits差异(DBiLD)损失——自适应对齐关键概率区域并保持分布结构。 Result: 0.5B TinyLLaVA经Switch-KD蒸馏后,在10个多模态基准上平均提升3.6分,且无需修改模型结构。 Conclusion: Switch-KD有效解决了VLM蒸馏中模态割裂问题,实现了高效、对齐的跨模态知识迁移,为轻量化多模态模型部署提供了新范式。 Abstract: Vision-Language Models (VLMs) have shown remarkable capabilities in joint vision-language understanding, but their large scale poses significant challenges for deployment in resource-constrained scenarios. Knowledge Distillation (KD) offers a viable way to improve model capabilities without increasing model size or data requirements, making deployment more efficient. However, applying KD to VLMs is challenged by modality-specific supervision: although multimodal knowledge in VLMs is fused within the language space, current methods supervise each modality separately without explicitly addressing multimodal alignment, leading to inconsistent multimodal knowledge transfer. To address this, we propose Switch-KD, a visual-switch distillation framework that unifies vision-language knowledge transfer within a shared text-probability space. Switch-KD comprises two key components: (1) Visual-Switch Distillation, which switches the student's visual outputs into the teacher's language pathway to construct cross-modal probabilistic references for implicit visual knowledge transfer; and (2) Dynamic Bi-directional Logits Difference (DBiLD) loss, which adaptively aligns informative probability regions while preserving the distributional structures of teacher and student through bidirectional supervision. Guided by Switch-KD, a 0.5B TinyLLaVA effectively distills rich multimodal knowledge from its 3B teacher, yielding an average improvement of 3.6 points across 10 multimodal benchmarks without any architectural modification.[110] CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation
Inseok Jeon,Suhwan Cho,Minhyeok Lee,Seunghoon Lee,Minseok Kang,Jungho Lee,Chaewon Park,Donghyeong Kim,Sangyoun Lee
Main category: cs.CV
TL;DR: 本文提出了一种跨模态令牌调制方法,通过关系Transformer块增强外观与运动线索间的交互,并结合令牌掩码策略提升学习效率,在无监督视频目标分割任务中达到SOTA性能。
Details
Motivation: 现有双流架构虽能融合外观和运动线索,但难以有效建模二者间的相互依赖关系。 Method: 引入跨模态令牌调制机制,建立两模态令牌间的密集连接,并利用关系Transformer块实现模内与模间信息传播;同时采用令牌掩码策略提升训练效率。 Result: 在所有公开基准上均达到最先进(SOTA)性能,超越现有方法。 Conclusion: 跨模态令牌调制与令牌掩码策略的结合,可更高效地融合外观与运动线索,显著提升无监督视频目标分割性能。 Abstract: Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.[111] High-Speed Full-Color HDR Imaging via Unwrapping Modulo-Encoded Spike Streams
Chu Zhou,Siqi Yang,Kailong Zhang,Heng Guo,Zhaofei Yu,Boxin Shi,Imari Sato
Main category: cs.CV
TL;DR: 本文提出了一种基于模运算传感器的全彩、高速HDR成像系统,通过解耦曝光建模和无迭代扩散先验解包裹算法,突破了传统模传感器在速度、色彩和硬件限制上的瓶颈,并在1000 FPS下实现了带宽大幅降低的实时HDR成像。
Details
Motivation: 传统RGB HDR成像在多曝光(运动伪影)与单次拍摄(信息不可逆损失)之间存在根本权衡;而现有模传感器方案受限于迭代解包裹开销和低速灰度采集。 Method: 提出曝光解耦的模成像建模,支持时序交错多帧采集;设计融合扩散生成先验与模图像最小绝对余数物理特性的无迭代解包裹算法;构建基于模编码脉冲流的硬件原型系统。 Result: 实现1000 FPS全彩HDR成像,数据带宽从约20 Gbps降至6 Gbps;算法高效且物理一致,系统在动态场景中验证可行。 Conclusion: 通过传感模型与算法的协同创新,本文首次实现了高速、全彩、低带宽的模HDR成像系统,解决了模成像落地的关键系统性瓶颈。 Abstract: Conventional RGB-based high dynamic range (HDR) imaging faces a fundamental trade-off between motion artifacts in multi-exposure captures and irreversible information loss in single-shot techniques. Modulo sensors offer a promising alternative by encoding theoretically unbounded dynamic range into wrapped measurements. However, existing modulo solutions remain bottlenecked by iterative unwrapping overhead and hardware constraints limiting them to low-speed, grayscale capture. In this work, we present a complete modulo-based HDR imaging system that enables high-speed, full-color HDR acquisition by synergistically advancing both the sensing formulation and the unwrapping algorithm. At the core of our approach is an exposure-decoupled formulation of modulo imaging that allows multiple measurements to be interleaved in time, preserving a clean, observation-wise measurement model. Building upon this, we introduce an iteration-free unwrapping algorithm that integrates diffusion-based generative priors with the physical least absolute remainder property of modulo images, supporting highly efficient, physics-consistent HDR reconstruction. Finally, to validate the practical viability of our system, we demonstrate a proof-of-concept hardware implementation based on modulo-encoded spike streams. This setup preserves the native high temporal resolution of spike cameras, achieving 1000 FPS full-color imaging while reducing output data bandwidth from approximately 20 Gbps to 6 Gbps. Extensive evaluations indicate that our coordinated approach successfully overcomes key systemic bottlenecks, demonstrating the feasibility of deploying modulo imaging in dynamic scenarios.[112] Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification
Weiwei Zhuang,Wangze Xie,Qi Zhang,Xia Du,Zihan Lin,Zheng Lin,Hanlin Cai,Jizhe Zhou,Zihan Fang,Chi-man Pun,Wei Ni,Jun Luo
Main category: cs.CV
TL;DR: 本文提出FogFool,一种基于雾效的物理可实现对抗攻击框架,利用Perlin噪声建模大气结构,生成视觉真实、鲁棒性强且跨模型迁移性高的对抗样本,显著威胁遥感图像分类系统的可靠性。
Details
Motivation: 现有遥感图像对抗攻击方法多依赖直接像素扰动,忽视遥感图像固有的大气特性,且难以抵抗现实中的图像退化(如压缩、滤波),缺乏物理合理性和实际威胁性。 Method: 提出FogFool框架,通过迭代优化基于Perlin噪声的大气模式来生成雾状扰动;利用雾的自然不规则性、空间一致性及中低频特性,将对抗信息嵌入模型共享的结构特征中。 Result: 在两个遥感基准数据集上,FogFool在白盒攻击中性能优越,黑盒迁移成功率高达83.74% TASR,并对JPEG压缩、滤波等预处理防御具有强鲁棒性;CAM分析显示其引发模型注意力的普适性偏移。 Conclusion: FogFool是一种实用、隐蔽且高度持久的遥感图像分类系统威胁,为复杂环境下模型可靠性评估提供了稳健基准。 Abstract: Adversarial attacks pose a severe threat to the reliability of deep learning models in remote sensing (RS) image classification. Most existing methods rely on direct pixel-wise perturbations, failing to exploit the inherent atmospheric characteristics of RS imagery or survive real-world image degradations. In this paper, we propose FogFool, a physically plausible adversarial framework that generates fog-based perturbations by iteratively optimizing atmospheric patterns based on Perlin noise. By modeling fog formations with natural, irregular structures, FogFool generates adversarial examples that are not only visually consistent with authentic RS scenes but also deceptive. By leveraging the spatial coherence and mid-to-low-frequency nature of atmospheric phenomena, FogFool embeds adversarial information into structural features shared across diverse architectures. Extensive experiments on two benchmark RS datasets demonstrate that FogFool achieves superior performance: not only does it exceed in white-box settings, but also exhibits exceptional black-box transferability (reaching 83.74% TASR) and robustness against common preprocessing-based defenses such as JPEG compression and filtering. Detailed analyses, including confusion matrices and Class Activation Map (CAM) visualizations, reveal that our atmospheric-driven perturbations induce a universal shift in model attention. These results indicate that FogFool represents a practical, stealthy, and highly persistent threat to RS classification systems, providing a robust benchmark for evaluating model reliability in complex environments.[113] Chaotic CNN for Limited Data Image Classification
Anusree M,Akhila Henry,Pramod P Nair
Main category: cs.CV
TL;DR: 本文提出了一种基于混沌映射的特征变换方法,通过在CNN分类层前对归一化特征向量施加Logistic、斜帐篷和正弦等非线性混沌映射,提升小样本图像分类性能,无需增加模型参数,显著提升准确率。
Details
Motivation: CNN在小样本训练场景下易过拟合、特征多样性不足,泛化能力差。 Method: 在CNN分类层前,对归一化特征向量应用Logistic、斜帐篷和正弦三种混沌映射进行非线性变换,重塑特征空间以增强类间可分性。 Result: 在MNIST(+5.43%)、Fashion-MNIST(+9.11%)和CIFAR-10(+7.47%)小样本设定下均取得稳定提升,增益源于混沌系统的共性非线性与动力学特性。 Conclusion: 该混沌特征变换方法计算高效、无额外参数、即插即用,是面向数据稀缺图像分类任务的实用增强方案。 Abstract: Convolutional neural networks (CNNs) often exhibit poor generalisation in limited training data scenarios due to overfitting and insufficient feature diversity. In this work, a simple and effective chaos-based feature transformation is proposed to enhance CNN performance without increasing model complexity. The method applies nonlinear transformations using logistic, skew tent, and sine maps to normalised feature vectors before the classification layer, thereby reshaping the feature space and improving class separability. The approach is evaluated on greyscale datasets (MNIST and Fashion-MNIST) and an RGB dataset (CIFAR-10) using CNN architectures of varying depth under limited data conditions. The results show consistent improvement over the standalone (SA) CNN across all datasets. Notably, a maximum performance gain of 5.43% is achieved on MNIST using the skew tent map with a 3-layer CNN at 40 samples per class. A higher gain of 9.11% is observed on Fashion-MNIST using the sine map with a 3-layer CNN at 50 samples per class. Additionally, a strong gain of 7.47% is obtained on CIFAR-10 using the skew tent map at 200 samples per class. The consistent improvements across different chaotic maps indicate that the performance gain is driven by the shared nonlinear and dynamical properties of chaotic systems. The proposed method is computationally efficient, requires no additional trainable parameters, and can be easily integrated into existing CNN architectures, making it a practical solution for data-scarce image classification tasks.[114] Seen-to-Scene: Keep the Seen, Generate the Unseen for Video Outpainting
Inseok Jeon,Minhyeok Lee,Seunghoon Lee,Minseok Kang,Suhwan Cho,Sangyoun Lee
Main category: cs.CV
TL;DR: 本文提出Seen-to-Scene框架,融合基于传播和生成的范式,通过光流补全与参考引导的潜在传播提升视频外绘的时空一致性与效率。
Details
Motivation: 现有基于生成模型(如扩散模型)的方法在视频外绘中存在隐式时间建模不足和空间上下文有限的问题,导致帧内与帧间不一致,尤其在动态场景和大幅外绘时更明显。 Method: 提出Seen-to-Scene框架:1)利用为视频修复预训练的光流补全网络,并端到端微调以弥合领域差距、重建连贯运动场;2)引入参考引导的潜在传播机制,高效跨帧传播源内容。 Result: 在多项实验中,该方法在时序一致性与视觉真实性上优于现有SOTA方法,且推理高效,无需输入特定自适应。 Conclusion: 统一传播与生成范式可有效克服视频外绘中的时空不一致问题,Seen-to-Scene为高效高质量视频外绘提供了新思路。 Abstract: Video outpainting aims to expand the visible content of a video beyond the original frame boundaries while preserving spatial fidelity and temporal coherence across frames. Existing methods primarily rely on large-scale generative models, such as diffusion models. However, generationbased approaches suffer from implicit temporal modeling and limited spatial context. These limitations lead to intraframe and inter-frame inconsistencies, which become particularly pronounced in dynamic scenes and large outpainting scenarios. To overcome these challenges, we propose Seen-to-Scene, a novel framework that unifies propagationbased and generation-based paradigms for video outpainting. Specifically, Seen-to-Scene leverages flow-based propagation with a flow completion network pre-trained for video inpainting, which is fine-tuned in an end-to-end manner to bridge the domain gap and reconstruct coherent motion fields. To further improve the efficiency and reliability of propagation, we introduce a reference-guided latent propagation that effectively propagates source content across frames. Extensive experiments demonstrate that our method achieves superior temporal coherence and visual realism with efficient inference, surpassing even prior state-of-the-art methods that require input-specific adaptation.[115] DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts
Bo Qian,Dahu Shi,Xing Wei
Main category: cs.CV
TL;DR: 本文提出DETR-ViP框架,通过全局提示整合、视觉-文本提示关系蒸馏和选择性融合策略,提升视觉提示在开放词汇目标检测中的判别能力与鲁棒性。
Details
Motivation: 现有视觉提示目标检测性能不佳,主因是视觉提示缺乏全局判别性;且该方向长期被忽视,常作为文本提示检测器训练的副产品。 Method: 在基础图像-文本对比学习上,引入全局提示整合、视觉-文本提示关系蒸馏,并采用选择性融合策略,构建DETR-ViP检测框架。 Result: 在COCO、LVIS、ODinW和Roboflow100数据集上,DETR-ViP显著超越现有SOTA方法,在视觉提示检测任务中取得更高性能;消融实验验证各模块有效性。 Conclusion: 提升视觉提示的全局判别性是关键,DETR-ViP为视觉提示检测提供了更鲁棒、可扩展的解决方案,推动该方向独立发展。 Abstract: Visual prompted object detection enables interactive and flexible definition of target categories, thereby facilitating open-vocabulary detection. Since visual prompts are derived directly from image features, they often outperform text prompts in recognizing rare categories. Nevertheless, research on visual prompted detection has been largely overlooked, and it is typically treated as a byproduct of training text prompted detectors, which hinders its development. To fully unlock the potential of visual-prompted detection, we investigate the reasons why its performance is suboptimal and reveal that the underlying issue lies in the absence of global discriminability in visual prompts. Motivated by these observations, we propose DETR-ViP, a robust object detection framework that yields class-distinguishable visual prompts. On top of basic image-text contrastive learning, DETR-ViP incorporates global prompt integration and visual-textual prompt relation distillation to learn more discriminative prompt representations. In addition, DETR-ViP employs a selective fusion strategy that ensures stable and robust detection. Extensive experiments on COCO, LVIS, ODinW, and Roboflow100 demonstrate that DETR-ViP achieves substantially higher performance in visual prompt detection compared to other state-of-the-art counterparts. A series of ablation studies and analyses further validate the effectiveness of the proposed improvements and shed light on the underlying reasons for the enhanced detection capability of visual prompts.[116] Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding
Zhixuan Wu,Quanxing Zha,Teng Wang,Genbao Xu,Wenyuan Gu,Wei Rao,Nan Ma,Bo Cheng,Soujanya Poria
Main category: cs.CV
TL;DR: 本文提出Chain-of-Glimpse框架,通过搜索引导的渐进式视觉对象定位与多步推理,提升视频理解中对时序对象变化的建模能力。
Details
Motivation: 现有视频理解方法多为对象无关,难以应对视频中对象随时间发生的显著变化,缺乏对关键视觉对象的显式建模和空间定位能力。 Method: 提出Chain-of-Glimpse:一种搜索引导、对象锚定的渐进式推理框架;包含强化学习优化的搜索控制器,以格式化奖励鼓励空间接地能力,逐步构建围绕任务相关对象的空间接地轨迹。 Result: 在NExTQA(域内)、Video-Holmes、CG-Bench Reasoning和VRBench(域外)等多个视频推理基准上均取得一致性能提升,展现出强鲁棒性与泛化能力。 Conclusion: Chain-of-Glimpse通过显式空间接地与多步组合推理,有效缓解了对显著性线索的过度依赖,为视频理解提供了更准确、可解释的推理范式。 Abstract: Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.[117] The Courtroom Trial of Pixels: Robust Image Manipulation Localization via Adversarial Evidence and Reinforcement Learning Judgment
Songlin Li,Zhiqing Guo,Dan Ma,Changtao Miao,Gaobo Yang
Main category: cs.CV
TL;DR: 本文提出了一种法庭式裁决框架(含控方流、辩方流和法官模型),将图像篡改定位(IML)建模为真假证据的对抗与判决过程,通过双假设分割、边缘先验引导的多级融合与动态辩论优化,以及基于强化学习的法官模型进行不确定性区域精调,显著提升了在弱痕迹和噪声干扰下的定位鲁棒性与精度。
Details
Motivation: 现有IML方法虽引入真实性监督,但仅作为辅助训练信号,未显式建模真实与篡改证据的对立关系,导致在痕迹微弱或受后处理/噪声干扰时难以可靠区分模糊区域。 Method: 构建法庭式 adjudication 框架:1)共享多尺度编码器上的双假设分割架构(控方流预测篡改、辩方流预测真实);2)引入边缘先验,通过级联多级融合、双向分歧抑制和动态辩论优化生成两类证据;3)设计基于强化学习的法官模型,以优势奖励和soft-IoU目标训练,结合熵与跨假设一致性校准可靠性,并对不确定区域进行策略性重推理与精调。 Result: 在多个标准数据集上平均性能超越当前SOTA IML方法,尤其在弱篡改痕迹、后处理及噪声干扰场景下定位更鲁棒、更准确。 Conclusion: 将IML任务形式化为证据对抗与判决过程是有效的建模范式;显式建模真实性与篡改性双假设及其交互机制,可显著提升模型在复杂现实条件下的定位可靠性与泛化能力。 Abstract: Although some existing image manipulation localization (IML) methods incorporate authenticity-related supervision, this information is typically utilized merely as an auxiliary training signal to enhance the model's sensitivity to manipulation artifacts, rather than being explicitly modeled as localization evidence opposing the manipulated regions. Consequently, when manipulation traces are subtle or degraded by post-processing and noise, these methods struggle to explicitly compare manipulated and authentic evidence, resulting in unreliable predictions in ambiguous areas. To address these issues, we propose a courtroom-style adjudication framework that regards IML task as the confrontation of evidence followed by judgment. The framework comprises a prosecution stream, a defense stream, and a judge model. We first build a dual-hypothesis segmentation architecture on a shared multi-scale encoder, in which the prosecution stream asserts manipulation and the defense stream asserts authenticity. Guided by edge priors, it produces evidence for manipulated and authentic regions through cascaded multi-level fusion, bidirectional disagreement suppression, and dynamic debate refinement. We further develop a reinforcement learning judge model that performs strategic re-inference and refinement on uncertain regions, yielding a manipulated-region mask. The judge model is trained with advantage-based rewards and a soft-IoU objective, and reliability is calibrated via entropy and cross-hypothesis consistency. Experimental results show that our model achieves superior average performance compared with SOTA IML methods.[118] NG-GS: NeRF-Guided 3D Gaussian Splatting Segmentation
Yi He,Tao Wang,Yi Jin,Congyan Lang,Yidong Li,Haibin Ling
Main category: cs.CV
TL;DR: 本文提出NG-GS框架,通过高斯模糊分析、RBF插值与多分辨率哈希编码,结合NeRF轻量模块联合优化,显著提升3D高斯泼溅中物体边界的分割质量。
Details
Motivation: 3D高斯泼溅(3DGS)虽在新视角合成上高效逼真,但其离散高斯表示导致物体边界存在混叠和伪影,难以实现精确分割。 Method: 1)利用掩码方差分析自动识别边界模糊高斯;2)采用径向基函数(RBF)插值构建空间连续特征场,并引入多分辨率哈希编码增强多尺度表达;3)通过与轻量NeRF模块联合优化,施加对齐损失与空间连续性损失以保证边界平滑一致。 Result: 在NVOS、LERF-OVS和ScanNet数据集上达到SOTA,边界mIoU显著提升。 Conclusion: NG-GS有效解决了3DGS中因离散表示引发的边界分割难题,实现了高质量、高精度的三维物体分割。 Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled highly efficient and photorealistic novel view synthesis. However, segmenting objects accurately in 3DGS remains challenging due to the discrete nature of Gaussian representations, which often leads to aliasing and artifacts at object boundaries. In this paper, we introduce NG-GS, a novel framework for high-quality object segmentation in 3DGS that explicitly addresses boundary discretization. Our approach begins by automatically identifying ambiguous Gaussians at object boundaries using mask variance analysis. We then apply radial basis function (RBF) interpolation to construct a spatially continuous feature field, enhanced by multi-resolution hash encoding for efficient multi-scale representation. A joint optimization strategy aligns 3DGS with a lightweight NeRF module through alignment and spatial continuity losses, ensuring smooth and consistent segmentation boundaries. Extensive experiments on NVOS, LERF-OVS, and ScanNet benchmarks demonstrate that our method achieves state-of-the-art performance, with significant gains in boundary mIoU. Code is available at https://github.com/BJTU-KD3D/NG-GS.[119] G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval
Jiyoung Lim,Heejae Yang,Jee-Hyong Lee
Main category: cs.CV
TL;DR: 本文提出G-MIXER方法,通过测地线混合(geodesic mixup)扩展隐式语义并利用多模态大模型重排序显式语义,实现无需训练的零样本组合图像检索(ZS-CIR),在多样性和准确性上均达到SOTA。
Details
Motivation: 现有零样本CIR方法过度依赖文本模态,难以建模模糊检索所需的候选多样性,导致检索结果多样性与准确性不足。 Method: 提出G-MIXER:1)在参考图像-文本对特征间进行多比例测地线混合,构建反映隐式语义的合成查询特征并生成多样化候选集;2)利用MLLM提取显式语义对候选集重排序。全程无需训练。 Result: 在多个ZS-CIR基准上达到SOTA性能,显著提升检索多样性与准确性,且不依赖额外训练。 Conclusion: G-MIXER有效协同建模隐式与显式语义,为零样本组合图像检索提供了高效、免训练的新范式。 Abstract: Composed Image Retrieval (CIR) aims to retrieve target images by integrating a reference image with a corresponding modification text. CIR requires jointly considering the explicit semantics specified in the query and the implicit semantics embedded within its bi-modal composition. Recent training-free Zero-Shot CIR (ZS-CIR) methods leverage Multimodal Large Language Models (MLLMs) to generate detailed target descriptions, converting the implicit information into explicit textual expressions. However, these methods rely heavily on the textual modality and fail to capture the fuzzy retrieval nature that requires considering diverse combinations of candidates. This leads to reduced diversity and accuracy in retrieval results. To address this limitation, we propose a novel training-free method, Geodesic Mixup-based Implicit semantic eXpansion and Explicit semantic Re-ranking for ZS-CIR (G-MIXER). G-MIXER constructs composed query features that reflect the implicit semantics of reference image-text pairs through geodesic mixup over a range of mixup ratios, and builds a diverse candidate set. The generated candidates are then re-ranked using explicit semantics derived from MLLMs, improving both retrieval diversity and accuracy. Our proposed G-MIXER achieves state-of-the-art performance across multiple ZS-CIR benchmarks, effectively handling both implicit and explicit semantics without additional training. Our code will be available at https://github.com/maya0395/gmixer.[120] MS-SSE-Net: A Multi-Scale Spatial Squeeze-and-Excitation Network for Structural Damage Detection in Civil and Geotechnical Engineering
Saif ur Rehman Khan,Imad Ahmed Waqar,Arooj Zaib,Saad Ahmed,Sebastian Vollmer,Andreas Dengel,Muhammad Nabeel Asim
Main category: cs.CV
TL;DR: 本文提出了一种名为MS-SSE-Net的新型深度学习框架,用于结构损伤分类,通过多尺度特征提取与通道/空间注意力机制,在StructDamage数据集上实现了优于DenseNet201等基线模型的性能。
Details
Motivation: 准确识别图像中的不同类型结构损伤具有挑战性,主要受限于损伤模式和环境条件的多样性。 Method: 基于DenseNet201主干网络,引入多尺度特征提取、深度可分离卷积、挤压-激励式通道注意力和空间注意力机制,并结合全局平均池化与全连接分类层。 Result: 在StructDamage数据集上达到99.31%精确率、99.25%召回率、99.27% F1分数和99.26%准确率,显著优于DenseNet201基线模型。 Conclusion: MS-SSE-Net有效提升了结构损伤图像分类的精度与鲁棒性,验证了多尺度特征与双重注意力机制融合的有效性。 Abstract: Structural damage detection is essential for maintaining the safety and reliability of civil infrastructure. However, accurately identifying different types of structural damage from images remains challenging due to variations in damage patterns and environmental conditions. To address these challenges, this paper proposes MS-SSE-Net, a novel deep learning (DL) framework for structural damage classification. The proposed model is built upon the DenseNet201 backbone and integrates novel multi-scale feature extraction with channel and spatial attention mechanisms (MS-SSE-Net). Specifically, parallel depthwise convolutions capture both local and contextual features, while squeeze-and-excitation style channel attention and spatial attention emphasize informative regions and suppress irrelevant noise. The refined features are then processed through global average pooling and a fully connected classification layer to generate the final predictions. Experiments are conducted on the StructDamage dataset containing multiple structural damage categories. The proposed MS-SSE-Net demonstrates superior performance compared with the baseline DenseNet201 and other comparative approaches. Specifically, the proposed method achieves 99.31% precision, 99.25% recall, 99.27% F1-score, and 99.26% accuracy, outperforming the baseline model which achieved 98.62% precision, 98.53% recall, 98.58% F1-score, and 98.53% accuracy.[121] Data Synthesis Improves 3D Myotube Instance Segmentation
David Exler,Nils Friederich,Martin Krüger,John Jbeily,Mario Vitacolonna,Rüdiger Rudolf,Ralf Mikut,Markus Reischl
Main category: cs.CV
TL;DR: 本文提出了一种基于几何建模的合成数据生成方法,用于解决肌管三维实例分割中真实标注数据稀缺的问题;通过结合生物物理启发的合成策略与自监督预训练3D U-Net,在仅用合成数据训练的情况下,在真实数据上达到领先性能。
Details
Motivation: 现有预训练生物医学分割模型因缺乏大规模标注的肌管数据而无法泛化到该领域,亟需一种无需真实标注即可实现高精度三维实例分割的方法。 Method: 构建几何驱动的合成管线:基于显微镜观测建模肌管中心线(多项式)、局部变化半径、分支结构和椭球形末端;加入真实噪声、光学伪影及CycleGAN域适应渲染合成体积;采用自监督编码器预训练的轻量3D U-Net,仅用合成数据训练。 Result: 在真实数据上平均实例分割质量(IPQ)达0.22,显著优于三种零样本分割模型。 Conclusion: 生物物理驱动的合成数据可有效弥补标注缺失,为少样本/零样本生物医学图像分割提供新范式。 Abstract: Myotubes are multinucleated muscle fibers serving as key model systems for studying muscle physiology, disease mechanisms, and drug responses. Mechanistic studies and drug screening thereby rely on quantitative morphological readouts such as diameter, length, and branching degree, which in turn require precise three-dimensional instance segmentation. Yet established pretrained biomedical segmentation models fail to generalize to this domain due to the absence of large annotated myotube datasets. We introduce a geometry-driven synthesis pipeline that models individual myotubes via polynomial centerlines, locally varying radii, branching structures, and ellipsoidal end caps derived from real microscopy observations. Synthetic volumes are rendered with realistic noise, optical artifacts, and CycleGAN-based Domain Adaptation (DA). A compact 3D U-Net with self-supervised encoder pretraining, trained exclusively on synthetic data, achieves a mean IPQ of 0.22 on real data, significantly outperforming three established zero-shot segmentation models, demonstrating that biophysics-driven synthesis enables effective instance segmentation in annotation-scarce biomedical domains.[122] HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet
Badri N. Patro,Vijay S. Agneeswaran
Main category: cs.CV
TL;DR: HAMSA是一种无需扫描的视觉状态空间模型,直接在频谱域操作,通过简化核参数化、频谱脉冲网络(SPN)和频谱自适应门控单元(SAGU)提升效率与性能,在ImageNet上达到85.7% top-1精度,推理更快、内存和能耗更低。
Details
Motivation: 现有视觉SSM(如Vim、VMamba、SiMBA)依赖复杂扫描策略处理2D图像,带来计算开销和架构复杂性。 Method: 提出HAMSA:采用FFT-based卷积实现扫描自由;引入单高斯初始化复数核替代传统(A,B,C)矩阵;设计输入依赖的频谱门控机制SpectralPulseNet(SPN);提出幅度驱动的频谱自适应门控单元SAGU以保障频域梯度稳定。 Result: 在ImageNet-1K上达85.7% top-1准确率(SSM中SOTA);推理速度比DeiT-S快2.2倍(4.2ms vs 9.2ms),比扫描式SSM快1.4–1.9倍;显存占用更低(2.1GB vs 3.2–4.5GB),能耗更少(12.5J vs 18–25J);泛化性强,适用于迁移学习与密集预测任务。 Conclusion: HAMSA通过频域建模与结构简化,在保持高性能的同时显著提升效率与稳定性,为视觉SSM提供了新范式。 Abstract: Vision State Space Models (SSMs) like Vim, VMamba, and SiMBA rely on complex scanning strategies to adapt sequential SSMs to process 2D images, introducing computational overhead and architectural complexity. We propose HAMSA, a scanning-free SSM operating directly in the spectral domain. HAMSA introduces three key innovations: (1) simplified kernel parameterization-a single Gaussian-initialized complex kernel replacing traditional (A, B, C) matrices, eliminating discretization instabilities; (2) SpectralPulseNet (SPN)-an input-dependent frequency gating mechanism enabling adaptive spectral modulation; and (3) Spectral Adaptive Gating Unit (SAGU)-magnitude-based gating for stable gradient flow in the frequency domain. By leveraging FFT-based convolution, HAMSA eliminates sequential scanning while achieving O(L log L) complexity with superior simplicity and efficiency. On ImageNet-1K, HAMSA reaches 85.7% top-1 accuracy (state-of-the-art among SSMs), with 2.2 X faster inference than transformers (4.2ms vs 9.2ms for DeiT-S) and 1.4-1.9X speedup over scanning-based SSMs, while using less memory (2.1GB vs 3.2-4.5GB) and energy (12.5J vs 18-25J). HAMSA demonstrates strong generalization across transfer learning and dense prediction tasks.[123] Find the Differences: Differential Morphing Attack Detection vs Face Recognition
Una M. Kelly,Luuk J. Spreeuwers,Raymond N. J. Veldhuis
Main category: cs.CV
TL;DR: 本文探讨了人脸识别(FR)系统在面对变形攻击(morphing attacks)时的脆弱性,指出FR与差异变形攻击检测(D-MAD)任务本质相似,并提出利用现有FR系统进行变形检测的新评估阈值,以控制对未知类型变形攻击的脆弱性。
Details
Motivation: 现有FR系统易受morphing攻击影响,且当前决策阈值导致性能与抗攻击能力间存在权衡,亟需兼顾两者的新方法。 Method: 通过对比FR系统与两种现有D-MAD方法,分析其任务相似性;理论分析当前阈值导致FR易受morphing攻击的原因;提出基于现成FR系统的morphing检测框架及新型评估阈值。 Result: 验证了FR与D-MAD任务的高度相似性;揭示了阈值选择是FR易受morphing攻击的根本原因;所提新阈值可在保证常规识别性能的同时,严格限制对未知morphing攻击的脆弱性。 Conclusion: 无需额外部署专用D-MAD模型,即可利用现有FR系统实现可靠morphing检测;新评估阈值为实际部署提供了可证明的安全保障。 Abstract: Morphing is a challenge to face recognition (FR) for which several morphing attack detection solutions have been proposed. We argue that face recognition and differential morphing attack detection (D-MAD) in principle perform very similar tasks, which we support by comparing an FR system with two existing D-MAD approaches. We also show that currently used decision thresholds inherently lead to FR systems being vulnerable to morphing attacks and that this explains the tradeoff between performance on normal images and vulnerability to morphing attacks. We propose using FR systems that are already in place for morphing detection and introduce a new evaluation threshold that guarantees an upper limit to the vulnerability to morphing attacks - even of unknown types.[124] Efficient closed-form approaches for pose estimation using Sylvester forms
Jana Vráblíková,Ezio Malis,Laurent Busé
Main category: cs.CV
TL;DR: 本文提出了一类基于Sylvester形式的新型结式求解器,用于非线性最小二乘姿态估计问题,在保持精度的同时显著降低计算复杂度。
Details
Motivation: 非线性最小二乘姿态估计(旋转与平移)在实时计算机视觉中耗时但关键;现有结式求解器虽有进展,仍有计算效率提升空间。 Method: 提出基于Sylvester形式的新型结式求解器,将姿态估计优化问题转化为多项式方程组,并以闭式求解;适用于3D-3D和3D-2D点对应两种姿态估计任务。 Result: 所提方法在数值精度上与当前最优求解器相当,且计算时间更优。 Conclusion: 基于Sylvester形式的结式求解器是降低姿态估计计算复杂度的有效新途径,兼顾精度与效率,具有实际应用价值。 Abstract: Solving non-linear least-squares problem for pose estimation (rotation and translation) is often a time consuming yet fundamental problem in several real-time computer vision applications. With an adequate rotation parametrization, the optimization problem can be reduced to the solution of a~system of polynomial equations and solved in closed form. Recent advances in efficient closed form solvers utilizing resultant matrices have shown a promising research direction to decrease the computation time while preserving the estimation accuracy. In this paper, we propose a new class of resultant-based solvers that exploit Sylvester forms to further reduce the complexity of the resolution. We demonstrate that our proposed methods are numerically as accurate as the state-of-the-art solvers, and outperform them in terms of computational time. We show that this approach can be applied for pose estimation in two different types of problems: estimating a pose from 3D to 3D correspondences, and estimating a pose from 3D points to 2D points correspondences.[125] ASGNet: Adaptive Spectrum Guidance Network for Automatic Polyp Segmentation
Yanguang Sun,Hengmin Zhang,Jianjun Qian,Jian Yang,Lei Luo
Main category: cs.CV
TL;DR: 本文提出ASGNet,一种结合频谱特征与全局属性的自适应频谱引导网络,以解决结肠镜图像中息肉分割因局部感知偏差导致结构不完整的问题;通过频谱引导非局部感知模块、多源语义提取器和密集跨层交互解码器,显著提升分割精度,在五个基准上超越21种SOTA方法。
Details
Motivation: 现有深度学习方法在息肉分割中受限于空间域强像素相关性,导致感知偏向局部区域,难以捕获完整息肉结构,影响分割效果。 Method: 提出ASGNet:1)频谱引导非局部感知模块,联合聚合局部与全局信息;2)多源语义提取器,辅助息肉初步定位;3)密集跨层交互解码器,融合并强化多层特征以生成高质量分割表示。 Result: 在五个主流息肉分割基准上,定量与定性结果均优于21种最先进方法。 Conclusion: ASGNet通过引入频谱域建模与全局感知机制,有效缓解局部偏差问题,提升了息肉分割的完整性与边界精度,为医学图像分割提供了新思路。 Abstract: Early identification and removal of polyps can reduce the risk of developing colorectal cancer. However, the diverse morphologies, complex backgrounds and often concealed nature of polyps make polyp segmentation in colonoscopy images highly challenging. Despite the promising performance of existing deep learning-based polyp segmentation methods, their perceptual capabilities remain biased toward local regions, mainly because of the strong spatial correlations between neighboring pixels in the spatial domain. This limitation makes it difficult to capture the complete polyp structures, ultimately leading to sub-optimal segmentation results. In this paper, we propose a novel adaptive spectrum guidance network, called ASGNet, which addresses the limitations of spatial perception by integrating spectral features with global attributes. Specifically, we first design a spectrum-guided non-local perception module that jointly aggregates local and global information, therefore enhancing the discriminability of polyp structures, and refining their boundaries. Moreover, we introduce a multi-source semantic extractor that integrates rich high-level semantic information to assist in the preliminary localization of polyps. Furthermore, we construct a dense cross-layer interaction decoder that effectively integrates diverse information from different layers and strengthens it to generate high-quality representations for accurate polyp segmentation. Extensive quantitative and qualitative results demonstrate the superiority of our ASGNet approach over 21 state-of-the-art methods across five widely-used polyp segmentation benchmarks. The code will be publicly available at: https://github.com/CSYSI/ASGNet.[126] OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism
Jordan Shipard,Arnold Wiliem,Kien Nguyen Thanh,Wei Xiang,Clinton Fookes
Main category: cs.CV
TL;DR: 本文提出OmniGCD,一种模态无关的广义类别发现(GCD)方法,利用模态特异性编码器和合成训练的Transformer模型,在零样本设置下跨4种模态、16个数据集实现类别发现,显著提升已知与新类别的分类准确率。
Details
Motivation: 现有GCD方法局限于单模态且需数据集特定微调,而人类能跨模态抽象形成类别;本文旨在构建更通用、无需微调、受人脑启发的模态无关GCD框架。 Method: OmniGCD采用模态专用编码器提取特征,经降维构建GCD潜在空间,并在测试时使用合成数据预训练的Transformer模型动态优化表示以利于聚类;引入零样本GCD评估范式,模型仅在合成数据上训练一次。 Result: OmniGCD在16个跨模态数据集上实现零样本GCD,相比基线在视觉、文本、音频、遥感模态中对已知类平均提升+6.2、+1.5、+12.7、+17.9个百分点,对新类分别提升+6.2、+1.5、+12.7、+17.9个百分点(原文未明确区分,按上下文推断为统一提升值)。 Conclusion: 强编码器至关重要,而类别发现应与表征学习解耦;OmniGCD为模态无关GCD设立新基准,推动可扩展、人脑启发的通用类别发现发展。 Abstract: Generalized Category Discovery (GCD) challenges methods to identify known and novel classes using partially labeled data, mirroring human category learning. Unlike prior GCD methods, which operate within a single modality and require dataset-specific fine-tuning, we propose a modality-agnostic GCD approach inspired by the human brain's abstract category formation. Our $\textbf{OmniGCD}$ leverages modality-specific encoders (e.g., vision, audio, text, remote sensing) to process inputs, followed by dimension reduction to construct a $\textbf{GCD latent space}$, which is transformed at test-time into a representation better suited for clustering using a novel synthetically trained Transformer-based model. To evaluate OmniGCD, we introduce a $\textbf{zero-shot GCD setting}$ where no dataset-specific fine-tuning is allowed, enabling modality-agnostic category discovery. $\textbf{Trained once on synthetic data}$, OmniGCD performs zero-shot GCD across 16 datasets spanning four modalities, improving classification accuracy for known and novel classes over baselines (average percentage point improvement of $\textbf{+6.2}$, $\textbf{+17.9}$, $\textbf{+1.5}$ and $\textbf{+12.7}$ for vision, text, audio and remote sensing). This highlights the importance of strong encoders while decoupling representation learning from category discovery. Improving modality-agnostic methods will propagate across modalities, enabling encoder development independent of GCD. Our work serves as a benchmark for future modality-agnostic GCD works, paving the way for scalable, human-inspired category discovery. All code is available $\href{https://github.com/Jordan-HS/OmniGCD}{here}$[127] AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning
Peifeng Zhang,Zice Qiu,Donghua Yu,Shilei Cao,Juepeng Zheng,Yutong Lu,Haohuan Fu
Main category: cs.CV
TL;DR: 本文提出Asymmetric Information Masking (AIM)方法,以解决视觉语言模型(VLMs)在持续视觉问答(VQA)任务中因结构不对称导致的灾难性遗忘问题,通过模态敏感的定向掩码平衡稳定性与可塑性,显著提升性能与泛化能力。
Details
Motivation: 现有持续学习方法主要面向对称、单模态架构,而现代视觉语言模型(VLMs)具有固有的不对称结构,导致其在持续学习中易发生灾难性遗忘,尤其视觉投影层易受干扰,损害组合推理能力。 Method: 提出Asymmetric Information Masking (AIM),根据模态特异性敏感度施加定向掩码,以平衡模型稳定性与可塑性,缓解不对称结构带来的优化偏差。 Result: 在VQA v2和GQA数据集的持续VQA设置下,AIM在平均性能(AP)和平均遗忘(AF)上均达到SOTA,并更好保持对新技能-概念组合的泛化能力。 Conclusion: AIM有效缓解了VLMs在持续VQA中的灾难性遗忘问题,验证了针对不对称结构设计专用持续学习策略的必要性与有效性。 Abstract: In continual visual question answering (VQA), existing Continual Learning (CL) methods are mostly built for symmetric, unimodal architectures. However, modern Vision-Language Models (VLMs) violate this assumption, as their trainable components are inherently asymmetric. This structural mismatch renders VLMs highly prone to catastrophic forgetting when learning from continuous data streams. Specifically, the asymmetry causes standard global regularization to favor the massive language decoder during optimization, leaving the smaller but critical visual projection layers highly vulnerable to interference. Consequently, this localized degradation leads to a severe loss of compositional reasoning capabilities. To address this, we propose Asymmetric Information Masking (AIM), which balances stability and plasticity by applying targeted masks based on modality-specific sensitivity. Experiments on VQA v2 and GQA under continual VQA settings show that AIM achieves state-of-the-art performance in both Average Performance (AP) and Average Forgetting (AF), while better preserving generalization to novel skill-concept compositions.[128] Integrating Object Detection, LiDAR-Enhanced Depth Estimation, and Segmentation Models for Railway Environments
Enrico Francesco Giannico,Federico Nesti,Gianluca D'Amico,Mauro Marinoni,Edoardo Carosio,Filippo Salotti,Salvatore Sabina,Giorgio Buttazzo
Main category: cs.CV
TL;DR: 本文提出了一种模块化、灵活的铁路障碍物检测与距离估计框架,融合目标检测、轨道分割和单目深度估计,并结合LiDAR点云,在合成数据集SynDRA上实现0.63米的平均绝对距离误差。
Details
Motivation: 铁路环境中障碍物检测对安全至关重要,但现有方法多仅关注检测或轨道识别,缺乏完整、模块化且能同时检测并测距的系统;此外,真实场景缺乏可靠地面真值,导致评估困难。 Method: 构建一个由三个神经网络组成的模块化框架:分别用于目标检测、轨道分割和单目深度估计,并将单目深度图与LiDAR点云融合以提升距离估计精度;使用合成数据集SynDRA进行定量评估。 Result: 在SynDRA数据集上,障碍物距离估计的平均绝对误差(MAE)低至0.63米,显著提升了空间感知能力与测距精度。 Conclusion: 所提框架兼具模块性、灵活性与高精度,通过多任务神经网络协同与多模态(单目+LiDAR)融合,有效解决了铁路障碍物检测与距离估计难题,并为后续研究提供了可复现的评估基准。 Abstract: Obstacle detection in railway environments is crucial for ensuring safety. However, very few studies address the problem using a complete, modular, and flexible system that can both detect objects in the scene and estimate their distance from the vehicle. Most works focus solely on detection, others attempt to identify the track, and only a few estimate obstacle distances. Additionally, evaluating these systems is challenging due to the lack of ground truth data. In this paper, we propose a modular and flexible framework that identifies the rail track, detects potential obstacles, and estimates their distance by integrating three neural networks for object detection, track segmentation, and monocular depth estimation with LiDAR point clouds. To enable a reliable and quantitative evaluation, the proposed framework is assessed using a synthetic dataset (SynDRA), which provides accurate ground truth annotations, allowing for direct performance comparison with existing methods. The proposed system achieves a mean absolute error (MAE) as low as 0.63 meters by integrating monocular depth maps with LiDAR, enabling not only accurate distance estimates but also spatial perception of the scene.[129] One-shot Compositional 3D Head Avatars with Deformable Hair
Yuan Sun,Xuan Wang,WeiLi Zhang,Wenxuan Zhang,Yu Guo,Fei Wang
Main category: cs.CV
TL;DR: 本文提出了一种从单张正面人像图构建完整3D头部虚拟形象的组合式方法,核心是将头发与面部显式解耦建模,并分别采用基于FLAME网格绑定的面部变形和基于笼状结构+位置动力学(PBD)的头发物理仿真,在3D高斯泼溅(3DGS)表示下实现高保真纹理重建与自然动画。
Details
Motivation: 现有单图像生成3D头像方法常因头发与面部几何纠缠,导致动画中头发动态不真实;同时通用模型易丢失输入图像中的高频细节纹理。 Method: 1)对输入图像进行头发移除得到秃头图;2)将原图与秃头图分别提升为细节丰富的3D高斯泼溅(3DGS)表示;3)对秃头3DGS通过非刚性配准绑定到FLAME网格以支持自然面部动画;4)利用语义标签监督与边界感知重分配策略提取干净独立的头发高斯;5)为头发引入可驱动的位置动力学(PBD)笼状结构,模拟重力、惯性及头部运动下的真实形变。 Result: 在多种头部运动、表情及重力条件下的动态动画中,显著提升了头发行为的真实感,同时保持了面部细节的高保真度,定性结果优于当前最优单图像方法。 Conclusion: 显式解耦建模与物理驱动的头发变形机制,结合3DGS细粒度纹理重建能力,有效解决了单图像3D头像生成中头发失真与纹理丢失两大关键瓶颈。 Abstract: We propose a compositional method for constructing a complete 3D head avatar from a single image. Prior one-shot holistic approaches frequently fail to produce realistic hair dynamics during animation, largely due to inadequate decoupling of hair from the facial region, resulting in entangled geometry and unnatural deformations. Our method explicitly decouples hair from the face, modeling these components using distinct deformation paradigms while integrating them into a unified rendering pipeline. Furthermore, by leveraging image-to-3D lifting techniques, we preserve fine-grained textures from the input image to the greatest extent possible, effectively mitigating the common issue of high-frequency information loss in generalized models. Specifically, given a frontal portrait image, we first perform hair removal to obtain a bald image. Both the original image and the bald image are then lifted to dense, detail-rich 3D Gaussian Splatting (3DGS) representations. For the bald 3DGS, we rig it to a FLAME mesh via non-rigid registration with a prior model, enabling natural deformation that follows the mesh triangles during animation. For the hair component, we employ semantic label supervision combined with a boundary-aware reassignment strategy to extract a clean and isolated set of hair Gaussians. To control hair deformation, we introduce a cage structure that supports Position-Based Dynamics (PBD) simulation, allowing realistic and physically plausible transformations of the hair Gaussian primitives under head motion, gravity, and inertial effects. Striking qualitative results, including dynamic animations under diverse head motions, gravity effects, and expressions, showcase substantially more realistic hair behavior alongside faithfully preserved facial details, outperforming state-of-the-art one-shot methods in perceptual realism.[130] From Boundaries to Semantics: Prompt-Guided Multi-Task Learning for Petrographic Thin-section Segmentation
Yili Ren,Shiqi Wen,Li Hou,Dingwen Xiao,Weiming Zhang,Caleb Chen Cao,Lin Wang,Zilu Zheng,Qianxiao Su,Mingjun Zhao,Lei Chen
Main category: cs.CV
TL;DR: 本文提出Petro-SAM框架,通过两阶段多任务设计,结合Merge Block、多尺度特征融合与颜色-熵先验,解决偏光岩相图像中晶界与岩性联合分割的域差异与细粒度边界对齐难题。
Details
Motivation: 现有晶界分割(GES)与岩性语义分割(LSS)常被分开处理,效果不佳;虽有基础模型如SAM在边界对齐上表现好,但直接迁移至岩相图像存在严重域差异(消光导致的颜色变化、超细晶界)及缺乏面向多角度岩相图像堆栈的联合学习模块。 Method: 提出两阶段多任务框架Petro-SAM:1)基于SAM引入Merge Block融合7种偏光视角以缓解消光影响;2)加入多尺度特征融合与颜色-熵先验优化检测精度。 Result: 实现了岩相图像上高质量的晶界与岩性联合分割,在域差异大、边界极细的挑战下显著提升分割性能。 Conclusion: Petro-SAM有效 bridged 基础模型与岩相图像分析的实际需求,为地质图像智能解析提供了可扩展、鲁棒的新范式。 Abstract: Grain-edge segmentation (GES) and lithology semantic segmentation (LSS) are two pivotal tasks for quantifying rock fabric and composition. However, these two tasks are often treated separately, and the segmentation quality is implausible albeit expensive, time-consuming, and expert-annotated datasets have been used. Recently, foundation models, especially the Segment Anything Model (SAM), have demonstrated impressive robustness for boundary alignment. However, directly adapting SAM to joint GES and LSS is nontrivial due to 1) severe domain gap induced by extinction-dependent color variations and ultra-fine grain boundaries, and 2) lacking novel modules for joint learning on multi-angle petrographic image stacks. In this paper, we propose Petro-SAM, a novel two-stage, multi-task framework that can achieve high-quality joint GES and LSS on petrographic images. Specifically, based on SAM, we introduce a Merge Block to integrate seven polarized views, effectively solving the extinction issue. Moreover, we introduce multi-scale feature fusion and color-entropy priors to refine the detection.[131] NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results
Andrey Moskalenko,Alexey Bryncev,Ivan Kosmynin,Kira Shilovskaya,Mikhail Erofeev,Dmitry Vatolin,Radu Timofte,Kun Wang,Yupeng Hu,Zhiran Li,Hao Liu,Qianlong Xiang,Liqiang Nie,Konstantinos Chaldaiopoulos,Niki Efthymiou,Athanasia Zlatintsi,Panagiotis Filntisis,Katerina Pastra,Petros Maragos,Li Yang,Gen Zhan,Yiting Liao,Yabin Zhang,Yuxin Liu,Xu Wu,Yunheng Zheng,Linze Li,Kun He,Cong Wu,Xuefeng Zhu,Tianyang Xu,Xiaojun Wu,Wenzhuo Zhao,Keren Fu,Gongyang Li,Shixiang Shi,Jianlin Chen,Haibin Ling,Yaoxin Jiang,Guoyi Xu,Jiajia Liu,Yaokun Shi,Jiachen Tu
Main category: cs.CV
TL;DR: 本文介绍了NTIRE 2026视频显著性预测挑战赛,包含新构建的2000个开源视频数据集、基于众包鼠标追踪采集的显著图与注视点数据,并评估了20余支参赛队伍的方法。
Details
Motivation: 推动视频显著性预测方法的发展,提供大规模、高质量、开源的基准数据集和统一评测平台。 Method: 组织国际挑战赛,构建含2000个视频的新数据集,通过众包鼠标追踪采集注视数据并生成显著图,采用通用质量指标在800个测试视频上评估算法性能。 Result: 吸引20多个团队参赛,7支队伍通过最终代码审查;全部数据集与代码已开源发布。 Conclusion: 该挑战赛成功促进了视频显著性预测领域的研究进展,提供了可复现、可扩展的基准资源。 Abstract: This paper presents an overview of the NTIRE 2026 Challenge on Video Saliency Prediction. The goal of the challenge participants was to develop automatic saliency map prediction methods for the provided video sequences. The novel dataset of 2,000 diverse videos with an open license was prepared for this challenge. The fixations and corresponding saliency maps were collected using crowdsourced mouse tracking and contain viewing data from over 5,000 assessors. Evaluation was performed on a subset of 800 test videos using generally accepted quality metrics. The challenge attracted over 20 teams making submissions, and 7 teams passed the final phase with code review. All data used in this challenge is made publicly available - https://github.com/msu-video-group/NTIRE26_Saliency_Prediction.[132] Improved Multiscale Structural Mapping with Supervertex Vision Transformer for the Detection of Alzheimer's Disease Neurodegeneration
Geonwoo Baek,David H. Salat,Ikbeom Jang
Main category: cs.CV
TL;DR: 本文提出MSSM+方法,结合表面超顶点映射(SSVM)和超顶点视觉Transformer(SV-ViT),利用单次T1加权MRI扫描提取更丰富的皮层形态特征(如脑沟深度、曲率),显著提升阿尔茨海默病(AD)与正常对照(CN)的分类性能,优于现有结构MRI生物标志物。
Details
Motivation: 现有AD确诊依赖昂贵且有创的PET或CSF检测;需发展更优的非侵入性MRI生物标志物。 Method: 在MSSM框架基础上,提出MSSM+(新增顶点级脑沟深度与皮层曲率)、SSVM(将皮层表面划分为表征空间关系的超顶点)和SV-ViT(基于超顶点的解剖感知Vision Transformer)。 Result: MSSM+比MSSM检出更广泛、更显著的AD与CN组间差异;在AD/CN分类中,其精确率-召回率曲线下面积(AUPRC)提高3个百分点;跨厂商MRI数据上性能更稳定、优于CT、GWCs和MSSM。 Conclusion: MSSM+联合SV-ViT是一种有前景的、基于MRI的AD早期检测影像标志物,可作为PET/CSF确认前的筛查工具。 Abstract: Alzheimer's disease (AD) confirmation often relies on positron emission tomography (PET) or cerebrospinal fluid (CSF) analysis, which are costly and invasive. Consequently, structural MRI biomarkers such as cortical thickness (CT) are widely used for non-invasive AD screening. Multiscale structural mapping (MSSM) was recently proposed to integrate gray-white matter contrasts (GWCs) with CT from a single T1-weighted MRI (T1w) scan. Building on this framework, we propose MSSM+, together with surface supervertex mapping (SSVM) and a Supervertex Vision Transformer (SV-ViT). 3D T1w images from individuals with AD and cognitively normal (CN) controls were analyzed. MSSM+ extends MSSM by incorporating sulcal depth and cortical curvature at the vertex level. SSVM partitions the cortical surface into supervertices (surface patches) that effectively represent inter- and intra-regional spatial relationships. SV-ViT is a Vision Transformer architecture operating on these supervertices, enabling anatomically informed learning from surface mesh representations. Compared with MSSM, MSSM+ identified more spatially extensive and statistically significant group differences between AD and CN. In AD vs. CN classification, MSSM+ achieved a 3%p higher area under the precision-recall curve than MSSM. Vendor-specific analyses further demonstrated reduced signal variability and consistently improved classification performance across MR manufacturers relative to CT, GWCs, and MSSM. These findings suggest that MSSM+ combined with SV-ViT is a promising MRI-based imaging marker for AD detection prior to CSF/PET confirmation.[133] Zero-Shot Retail Theft Detection via Orchestrated Vision Models: A Model-Agnostic, Cost-Effective Alternative to Trained Single-Model Systems
Haileab Yagersew
Main category: cs.CV
TL;DR: Paza是一个零样本零售盗窃检测框架,通过分层模型编排与多信号预过滤机制,在不训练专用模型的前提下实现高效、低成本、隐私保护的隐蔽行为检测。
Details
Motivation: 现有AI防盗系统依赖昂贵的定制模型训练和高订阅费用(每店200-500美元/月),而全球零售盗窃年损失超1000亿美元,亟需更经济、易部署、隐私友好的解决方案。 Method: 提出零样本分层流水线:持续运行轻量级目标检测与姿态估计;仅当多信号预过滤器(需滞留时间+至少一个行为信号)触发时,才调用开销较高的视觉语言模型(VLM);支持任意OpenAI兼容VLM端点,实现模型无关性与可演进性;引入面部模糊以保障隐私。 Result: 在DCSASS数据集上零样本达到89.5%精度、92.8%特异性、59.3%召回率;预滤器使VLM调用减少240倍(≤10次/分钟),单GPU可服务10–20家门店;成本降至50–100美元/月/店,为商用方案的1/3–1/10。 Conclusion: Paza验证了零样本、模型即插即用、隐私优先的设计范式在现实零售安防中的可行性与显著成本优势,为边缘AI安防系统提供了新架构思路。 Abstract: Retail theft costs the global economy over \$100 billion annually, yet existing AI-based detection systems require expensive custom model training on proprietary datasets and charge \$200-500/month per store. We present Paza, a zero-shot retail theft detection framework that achieves practical concealment detection without training any model. Our approach orchestrates multiple existing models in a layered pipeline - cheap object detection and pose estimation running continuously, with an expensive vision-language model (VLM) invoked only when behavioral pre-filters trigger. A multi-signal suspicion pre-filter (requiring dwell time plus at least one behavioral signal) reduces VLM invocations by 240x compared to per-frame analysis, bounding calls to <=10/minute and enabling a single GPU to serve 10-20 stores. The architecture is model-agnostic: the VLM component accepts any OpenAI-compatible endpoint, enabling operators to swap between models such as Gemma 4, Qwen3.5-Omni, GPT-4o, or future releases without code changes - ensuring the system improves as the VLM landscape evolves. We evaluate the VLM component on the DCSASS synthesized shoplifting dataset (169 clips, controlled environment), achieving 89.5% precision and 92.8% specificity at 59.3% recall zero-shot - where the recall gap is attributable to sparse frame sampling in offline evaluation rather than VLM reasoning failures, as precision and specificity are the operationally critical metrics determining false alarm rates. We present a detailed cost model showing viability at \$50-100/month per store (3-10x cheaper than commercial alternatives), and introduce a privacy-preserving design that obfuscates faces in the detection pipeline. The source code is available at https://github.com/xHaileab/Paza-AI.[134] Efficient Search of Implantable Adaptive Cells for Medical Image Segmentation
Emil Benedykciuk,Marcin Denkowski,Grzegorz M. Wójcik
Main category: cs.CV
TL;DR: 本文提出IAC-LTH方法,通过Jensen-Shannon散度稳定性判据,在可微搜索早期即剪枝低重要性操作,大幅加速Implantable Adaptive Cells(IAC)在U-Net跳连中的神经架构搜索,显著降低计算成本(3.7x–16x),同时保持甚至略微提升分割性能。
Details
Motivation: 原始IAC框架需200轮可微搜索,计算开销大,限制其在医学图像分割中的实际应用;而作者观察到IAC中关键操作在搜索早期即稳定出现,为提前终止搜索提供依据。 Method: 分析IAC在不同医学数据集上可微搜索过程中各边操作重要性的时序演化规律,提出基于Jensen-Shannon散度的稳定性判据,动态追踪并剪枝低重要性操作,实现早停式高效搜索(IAC-LTH)。 Result: 在ACDC、BraTS、KiTS、AMOS四个公开数据集及多种2D U-Net/nnU-Net骨干网络上,IAC-LTH将NAS耗时降低3.7–16倍,所得结构性能媲美或略优于全周期搜索结果,并在有无数据增强下均保持鲁棒性。 Conclusion: IAC架构可在搜索早期通过操作重要性稳定性识别,无需完整训练过程,使自适应跳连模块设计更适用于资源受限的医学图像分割场景。 Abstract: Purpose: Adaptive skip modules can improve medical image segmentation, but searching for them is computationally costly. Implantable Adaptive Cells (IACs) are compact NAS modules inserted into U-Net skip connections, reducing the search space compared with full-network NAS. However, the original IAC framework still requires a 200-epoch differentiable search for each backbone and dataset. Methods: We analyzed the temporal behavior of operations and edges within IAC cells during differentiable search on public medical image segmentation benchmarks. We found that operations selected in the final discrete cell typically emerge among the strongest candidates early in training, and their architecture parameters stabilize well before the final epoch. Based on this, we propose a Jensen--Shannon-divergence-based stability criterion that tracks per-edge operation-importance distributions and progressively prunes low-importance operations during search. The accelerated framework is called IAC-LTH. Results: Across four public benchmarks (ACDC, BraTS, KiTS, AMOS), several 2-D U-Net backbones, and a 2-D nnU-Net pipeline, IAC-LTH discovers IAC cells whose patient-level segmentation performance matches and sometimes slightly exceeds that of cells found by the original full-length search, while reducing wall-clock NAS cost by 3.7x to 16x across datasets and backbones. These results are consistent across architectures, benchmarks, and both non-augmented and augmented training settings, while preserving the gains of IAC-equipped U-Nets over strong attention-based and dense-skip baselines. Conclusion: Competitive IAC architectures can be identified from early-stabilizing operations without running the full search, making adaptive skip-module design more practical for medical image segmentation under realistic computational constraints.[135] MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry
Meng-Xun Li,Wen-Hui Deng,Zhi-Xing Wu,Chun-Xiao Jin,Jia-Min Wu,Yue Han,James Kit Hon Tsoi,Gui-Song Xia,Cui Huang
Main category: cs.CV
TL;DR: 本文提出了MetaDent,一个面向口腔医学影像分析的综合性视觉-语言模型(VLM)基准资源,包含大规模多源牙科图像数据集、半结构化标注框架及多项标准化评测任务(如VQA、多标签分类和图像描述),并基于LLM生成高质量标注;实验表明当前SOTA VLM在细粒度口腔场景理解上仍存在明显不足。
Details
Motivation: 视觉-语言模型(VLMs)在医学影像分析中潜力巨大,但在口腔摄影领域应用受限,主要由于缺乏细粒度标注数据集和系统性评测基准。 Method: 构建MetaDent资源:(1)整合临床、公开与网络来源的60,669张牙科图像;(2)设计兼顾临床层级性与细粒度异常描述的半结构化元标注框架(含图像摘要+逐点自由文本描述);(3)利用大语言模型(LLM)自动生成约15K VQA样本和18类多标签分类数据,并经人工审核验证保真度;(4)在VQA、分类与图像描述任务上系统评测主流VLM。 Result: 当前最先进的VLM在口腔影像细粒度理解任务中表现有限:VQA准确率中等,图像描述常不完整或不一致;人工验证确认LLM生成的标注语义准确、可信。 Conclusion: MetaDent填补了牙科视觉-语言研究的数据与评测空白,揭示了现有VLM在临床细粒度理解上的瓶颈,所开源的数据、标注与工具将推动牙科AI发展。 Abstract: Vision-Language Models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes (1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; (2) a semi-structured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and (3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging Large Language Models (LLMs), we derive standardized benchmarks: approximately 15K Visual Question Answering (VQA) pairs and an 18-class multi-label classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy and producing inconsistent or incomplete descriptions in image captioning. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.[136] Open-Set Vein Biometric Recognition with Deep Metric Learning
Paweł Pilarek,Marcel Musiałek,Anna Górska
Main category: cs.CV
TL;DR: 本文提出了一种面向开放集场景的静脉识别方法,利用深度度量学习学习L2归一化嵌入,结合原型匹配与校准相似度阈值,在多个静脉数据集上实现了高精度识别与未知用户鲁棒拒识。
Details
Motivation: 现有静脉识别方法多基于闭集分类,难以扩展和适应新用户动态注册,缺乏开放集下的泛化能力与鲁棒性。 Method: 采用深度度量学习(DML)学习判别性L2归一化嵌入,使用原型匹配与校准相似度阈值进行开集识别;在四个跨部位(指、腕、背手)静脉数据集上采用严格的受试者不相交协议评估;主干网络为ResNet50-CBAM,损失函数以三元组损失为主,分类器采用1-NN。 Result: 在MMCBNU 6000上达到OSCR 0.9945、AUROC 0.9974、EER 1.57%、Rank-1识别率99.6%;跨数据集实验表明模型对大规模数据鲁棒,但在小样本域偏移下性能下降;消融证实三元组损失+1-NN在精度与效率间最优平衡,支持商用硬件实时部署。 Conclusion: 该方法有效解决了静脉识别在开放集条件下的可扩展性与安全性问题,兼顾高性能与实用性,为生物识别系统向动态、增量部署迈出了关键一步。 Abstract: Most state-of-the-art vein recognition methods rely on closed-set classification, which inherently limits their scalability and prevents the adaptive enrollment of new users without complete model retraining. We rigorously evaluate the computational boundaries of Deep Metric Learning (DML) under strict open-set constraints. Unlike standard closed-set approaches, we analyze the impact of data scarcity and domain shift on recognition performance. Our approach learns discriminative L2-normalised embeddings and employs prototype-based matching with a calibrated similarity threshold to effectively distinguish between enrolled users and unseen impostors. We evaluate the framework under a strict subject-disjoint protocol across four diverse datasets covering finger, wrist, and dorsal hand veins (MMCBNU 6000, UTFVP, FYO, and a dorsal hand-vein dataset). On the large-scale MMCBNU 6000 benchmark, our best model (ResNet50-CBAM) achieves an OSCR of 0.9945, AUROC of 0.9974, and EER of 1.57%, maintaining high identification accuracy (99.6% Rank-1) while robustly rejecting unknown subjects. Cross-dataset experiments evaluate the framework's generalisation across different acquisition setups, confirming that while the model handles large-scale data robustly, performance remains sensitive to domain shifts in low-data regimes. Ablation studies demonstrate that triplet-based objectives combined with a simple 1-NN classifier offer an optimal trade-off between accuracy and efficiency, enabling real-time deployment on commodity hardware.[137] FSDETR: Frequency-Spatial Feature Enhancement for Small Object Detection
Jianchao Huang,Fengming Zhang,Haibo Zhu,Tao Yan
Main category: cs.CV
TL;DR: 本文提出FSDETR,一种基于RT-DETR的频率-空间特征增强框架,通过空间分层注意力、可变形注意力特征交互和频空特征金字塔网络,显著提升小目标检测性能。
Details
Motivation: 小目标检测面临下采样导致的特征退化、密集簇中相互遮挡及复杂背景干扰等挑战。 Method: 提出FSDETR框架,包含空间分层注意力块(SHAB)、基于可变形注意力的同尺度特征交互(DA-AIFI)和频率-空间特征金字塔网络(FSFPN),结合频域滤波与空间边缘提取。 Result: 在VisDrone 2019上达到13.9% APS,在TinyPerson上达到48.95% AP50 tiny,仅含14.7M参数。 Conclusion: FSDETR有效缓解小目标检测中的特征退化、遮挡和背景干扰问题,在多个小目标基准上展现出强性能。 Abstract: Small object detection remains a significant challenge due to feature degradation from downsampling, mutual occlusion in dense clusters, and complex background interference. To address these issues, this paper proposes FSDETR, a frequency-spatial feature enhancement framework built upon the RT-DETR baseline. By establishing a collaborative modeling mechanism, the method effectively leverages complementary structural information. Specifically, a Spatial Hierarchical Attention Block (SHAB) captures both local details and global dependencies to strengthen semantic representation. Furthermore, to mitigate occlusion in dense scenes, the Deformable Attention-based Intra-scale Feature Interaction (DA-AIFI) focuses on informative regions via dynamic sampling. Finally, the Frequency-Spatial Feature Pyramid Network (FSFPN) integrates frequency filtering with spatial edge extraction via the Cross-domain Frequency-Spatial Block (CFSB) to preserve fine-grained details. Experimental results show that with only 14.7M parameters, FSDETR achieves 13.9% APS on VisDrone 2019 and 48.95% AP50 tiny on TinyPerson, showing strong performance on small-object benchmarks. The code and models are available at https://github.com/YT3DVision/FSDETR.[138] Reward-Aware Trajectory Shaping for Few-step Visual Generation
Rui Li,Bingyu Li,Yuanzhi Liang,HuangHai Bin,Chi Zhang,XueLong Li
Main category: cs.CV
TL;DR: 本文提出Reward-Aware Trajectory Shaping (RATS)框架,通过奖励感知的轨迹对齐与门控机制,在极少采样步数下实现超越教师模型的高质量生成,突破传统蒸馏中学生受限于教师性能上限的瓶颈。
Details
Motivation: 现有少步生成方法依赖知识蒸馏,将多步教师模型压缩为少步学生模型,但学生性能被教师上限所限制;本文旨在打破该限制,使学生能基于偏好奖励自主优化,甚至超越教师。 Method: 提出RATS框架:1)在关键去噪阶段通过horizon matching对齐师生隐空间轨迹;2)引入reward-aware gate,根据师生相对奖励动态调节教师指导强度——教师奖励高时加强引导,学生接近或超过教师时放松约束;3)融合轨迹蒸馏、奖励门控与偏好对齐,不增加推理开销。 Result: RATS显著提升了少步生成的效率-质量权衡,在视觉生成任务中大幅缩小了少步学生与强多步教师之间的性能差距。 Conclusion: 引入偏好对齐意识可使少步生成模型摆脱对教师性能的依赖,通过奖励驱动的轨迹塑形实现更优生成质量;RATS为高效高质量生成提供了新范式。 Abstract: Achieving high-fidelity generation in extremely few sampling steps has long been a central goal of generative modeling. Existing approaches largely rely on distillation-based frameworks to compress the original multi-step denoising process into a few-step generator. However, such methods inherently constrain the student to imitate a stronger multi-step teacher, imposing the teacher as an upper bound on student performance. We argue that introducing \textbf{preference alignment awareness} enables the student to optimize toward reward-preferred generation quality, potentially surpassing the teacher instead of being restricted to rigid teacher imitation. To this end, we propose \textbf{Reward-Aware Trajectory Shaping (RATS)}, a lightweight framework for preference-aligned few-step generation. Specifically, teacher and student latent trajectories are aligned at key denoising stages through horizon matching, while a \textbf{reward-aware gate} is introduced to adaptively regulate teacher guidance based on their relative reward performance. Trajectory shaping is strengthened when the teacher achieves higher rewards, and relaxed when the student matches or surpasses the teacher, thereby enabling continued reward-driven improvement. By seamlessly integrating trajectory distillation, reward-aware gating, and preference alignment, RATS effectively transfers preference-relevant knowledge from high-step generators without incurring additional test-time computational overhead. Experimental results demonstrate that RATS substantially improves the efficiency--quality trade-off in few-step visual generation, significantly narrowing the gap between few-step students and stronger multi-step generators.[139] Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes
Victoria Yue Chen,Emery Pierson,Léopold Maillard,Maks Ovsjanikov
Main category: cs.CV
TL;DR: 本文揭示了当前最先进的文本到3D生成模型在文本驱动反演中存在“潜在汇陷”问题,即模型对文本提示不敏感,导致文本编辑失效;作者通过分析采样轨迹发现,模型仍具备强大的无条件几何生成能力,据此提出解耦几何表征与语言敏感性的新框架,提升文本驱动3D编辑的鲁棒性与保真度。
Details
Motivation: 现有文本驱动3D生成模型依赖于模型对自然语言提示保持敏感的假设,但该假设在实践中常失效,限制了文本编辑、风格迁移等应用效果。 Method: 通过分析生成模型的采样轨迹识别‘潜在汇陷’现象,验证模型几何表达力未受限但语言引导失效,并提出一种绕过汇陷、解耦几何表征与语言敏感性的新编辑框架。 Result: 发现并证实了文本到3D模型中普遍存在的‘潜在汇陷’问题;证明模型仍能通过无条件先验生成复杂几何;所提方法显著提升了对分布外3D形状的高保真语义编辑能力。 Conclusion: 文本驱动3D编辑的瓶颈不在于模型几何能力,而在于其对文本提示的敏感性退化;解耦几何生成与语言引导是构建更鲁棒3D编辑系统的关键路径。 Abstract: Text-driven inversion of generative models is a core paradigm for manipulating 2D or 3D content, unlocking numerous applications such as text-based editing, style transfer, or inverse problems. However, it relies on the assumption that generative models remain sensitive to natural language prompts. We demonstrate that for state-of-the-art native text-to-3D generative models, this assumption often collapses. We identify a critical failure mode where generation trajectories are drawn into latent ``sink traps'': regions where the model becomes insensitive to prompt modifications. In these regimes, changes to the input text fail to alter internal representations in a way that alters the output geometry. Crucially, we observe that this is not a limitation of the model's \textit{geometric} expressivity; the same generative models possess the ability to produce a vast diversity of shapes but, as we demonstrate, become insensitive to out-of-distribution \textit{text} guidance. We investigate this behavior by analyzing the sampling trajectories of the generative model, and find that complex geometries can still be represented and produced by leveraging the model's unconditional generative prior. This leads to a more robust framework for text-based 3D shape editing that bypasses latent sinks by decoupling a model's geometric representation power from its linguistic sensitivity. Our approach addresses the limitations of current 3D pipelines and enables high-fidelity semantic manipulation of out-of-distribution 3D shapes. Project webpage: https://daidedou.sorpi.fr/publication/beyondprompts[140] Hybrid Latents -- Geometry-Appearance-Aware Surfel Splatting
Neel Kelkar,Simon Niedermayr,Klaus Engel,Rüdiger Westermann
Main category: cs.CV
TL;DR: 本文提出了一种混合高斯-哈希网格辐射度表示方法,用于从多视角图像重建2D高斯场景模型,通过显式频率分解、硬不透明度衰减和概率剪枝等策略,在显著减少高斯原语数量的同时提升几何重建精度与渲染效率。
Details
Motivation: 解决NeRF类模型中几何与外观耦合严重的问题,避免高频纹理补偿几何误差,提升重建保真度与效率。 Method: 引入混合高斯-哈希网格辐射度表示;为每个高斯添加潜在特征并与哈希网格特征结合,实现低频(几何)与高频(纹理)的显式解耦;采用硬不透明度衰减增强几何-外观分离;结合概率剪枝与稀疏性诱导的BCE不透明度损失剔除冗余高斯。 Result: 在合成与真实数据集上优于现有高斯基新视角合成方法,重建保真度更高,且仅需十分之一数量的高斯原语。 Conclusion: 该方法通过频率分解、几何外观解耦与高效剪枝,实现了更紧凑、更精确的高斯场景表示。 Abstract: We introduce a hybrid Gaussian-hash-grid radiance representation for reconstructing 2D Gaussian scene models from multi-view images. Similar to NeST splatting, our approach reduces the entanglement between geometry and appearance common in NeRF-based models, but adds per-Gaussian latent features alongside hash-grid features to bias the optimizer toward a separation of low- and high-frequency scene components. This explicit frequency-based decomposition reduces the tendency of high-frequency texture to compensate for geometric errors. Encouraging Gaussians with hard opacity falloffs further strengthens the separation between geometry and appearance, improving both geometry reconstruction and rendering efficiency. Finally, probabilistic pruning combined with a sparsity-inducing BCE opacity loss allows redundant Gaussians to be turned off, yielding a minimal set of Gaussians sufficient to represent the scene. Using both synthetic and real-world datasets, we compare against the state of the art in Gaussian-based novel-view synthesis and demonstrate superior reconstruction fidelity with an order of magnitude fewer primitives.[141] Generative Data Augmentation for Skeleton Action Recognition
Xu Dong,Wanqing Li,Anthony Adeyemi-Ejeye,Andrew Gilbert
Main category: cs.CV
TL;DR: 本文提出了一种基于条件生成的骨架动作识别数据增强方法,利用Transformer编码器-解码器架构结合生成精炼模块和dropout机制,在小样本和全量数据场景下均提升了识别准确率。
Details
Motivation: 骨架动作识别面临3D骨架数据集规模小、多样性不足、标注成本高的问题,亟需有效数据增强方法。 Method: 提出条件生成式数据增强流程,采用Transformer编码器-解码器架构,引入生成精炼模块与dropout机制以平衡生成序列的保真度与多样性。 Result: 在HumanAct12和NTU-VIBE数据集上,该方法显著提升多种骨架动作识别模型的准确率,尤其在低数据场景下表现突出。 Conclusion: 所提条件生成框架能高效合成高质量、多样化的骨架序列,在小样本和常规设定下均具备强泛化能力,为骨架动作识别提供了可靠的数据增强方案。 Abstract: Skeleton-based human action recognition is a powerful approach for understanding human behaviour from pose data, but collecting large-scale, diverse, and well-annotated 3D skeleton datasets is both expensive and labor-intensive. To address this challenge, we propose a conditional generative pipeline for data augmentation in skeleton action recognition. Our method learns the distribution of real skeleton sequences under the constraint of action labels, enabling the synthesis of diverse and high-fidelity data. Even with limited training samples, it can effectively generate skeleton sequences and achieve competitive recognition performance in low-data scenarios, demonstrating strong generalisation in downstream tasks. Specifically, we introduce a Transformer-based encoder-decoder architecture, combined with a generative refinement module and a dropout mechanism, to balance fidelity and diversity during sampling. Experiments on HumanAct12 and the refined NTU-RGBD (NTU-VIBE) dataset show that our approach consistently improves the accuracy of multiple skeleton-based action recognition models, validating its effectiveness in both few-shot and full-data settings. The source code can be found at here.[142] RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models
Gabriele Mattioli,Evelyn Turri,Sara Sarto,Lorenzo Baraldi,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
Main category: cs.CV
TL;DR: 本文提出RaTA-Tool框架,通过将多模态用户查询转化为结构化任务描述,并基于语义匹配检索适配工具,实现开放世界下的多模态工具选择,支持零样本扩展与偏好优化。
Details
Motivation: 现有工具调用方法局限于纯文本输入和封闭世界设定,难以理解多模态指令且无法泛化至训练时未见的工具。 Method: 提出RaTA-Tool框架:1)利用MLLM将多模态查询转为结构化任务描述;2)基于语义匹配从机器可读的工具描述库中检出最适配工具;3)引入DPO进行偏好优化以提升任务-工具对齐;4)构建首个开放世界多模态工具使用数据集(源自Hugging Face模型卡)。 Result: 在开放世界、多模态场景下显著提升工具选择性能,支持无需重训练即可接入新工具。 Conclusion: RaTA-Tool为多模态基础模型提供了可扩展、泛化性强的开放世界工具学习新范式。 Abstract: Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources -- such as APIs, computational utilities, and specialized models -- to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world settings. Consequently, they struggle to interpret multimodal user instructions and cannot generalize to tools unseen during training. In this work, we introduce RaTA-Tool, a novel framework for open-world multimodal tool selection. Rather than learning direct mappings from user queries to fixed tool identifiers, our approach enables an MLLM to convert a multimodal query into a structured task description and subsequently retrieve the most appropriate tool by matching this representation against semantically rich, machine-readable tool descriptions. This retrieval-based formulation naturally supports extensibility to new tools without retraining. To further improve alignment between task descriptions and tool selection, we incorporate a preference-based optimization stage using Direct Preference Optimization (DPO). To support research in this setting, we also introduce the first dataset for open-world multimodal tool use, featuring standardized tool descriptions derived from Hugging Face model cards. Extensive experiments demonstrate that our approach significantly improves tool-selection performance, particularly in open-world, multimodal scenarios.[143] Prompt-to-Gesture: Measuring the Capabilities of Image-to-Video Deictic Gesture Generation
Hassan Ali,Doreen Jirak,Luca Müller,Stefan Wermter
Main category: cs.CV
TL;DR: 本文提出了一种基于提示的视频生成方法,利用图像到视频基础模型合成逼真的指示性手势数据集,并验证其在下游任务中的有效性。
Details
Motivation: 手势识别研究面临严重数据稀缺问题,传统方法依赖昂贵的人工采集或无法生成真实手势变异性的图像处理技术。 Method: 设计了一个基于少量真人参考样本的提示驱动视频生成流水线,利用图像到视频基础模型合成指示性手势视频数据。 Result: 合成手势在视觉保真度上接近真实数据,且引入了有意义的变异性和新颖性;混合使用真实与合成数据提升了多种深度模型的下游任务性能。 Conclusion: 即使处于早期阶段,图像到视频生成技术已可作为零样本手势合成的有效工具,显著增强手势识别的数据基础与模型性能。 Abstract: Gesture recognition research, unlike NLP, continues to face acute data scarcity, with progress constrained by the need for costly human recordings or image processing approaches that cannot generate authentic variability in the gestures themselves. Recent advancements in image-to-video foundation models have enabled the generation of photorealistic, semantically rich videos guided by natural language. These capabilities open up new possibilities for creating effort-free synthetic data, raising the critical question of whether video Generative AI models can augment and complement traditional human-generated gesture data. In this paper, we introduce and analyze prompt-based video generation to construct a realistic deictic gestures dataset and rigorously evaluate its effectiveness for downstream tasks. We propose a data generation pipeline that produces deictic gestures from a small number of reference samples collected from human participants, providing an accessible approach that can be leveraged both within and beyond the machine learning community. Our results demonstrate that the synthetic gestures not only align closely with real ones in terms of visual fidelity but also introduce meaningful variability and novelty that enrich the original data, further supported by superior performance of various deep models using a mixed dataset. These findings highlight that image-to-video techniques, even in their early stages, offer a powerful zero-shot approach to gesture synthesis with clear benefits for downstream tasks.[144] Frequency-Enhanced Dual-Subspace Networks for Few-Shot Fine-Grained Image Classification
Meijia Wang,Guochao Wang,Haozhen Chu,Bin Yao,Weichuan Zhang,Yuan Wang,Junpo Yang
Main category: cs.CV
TL;DR: 本文提出FEDSNet,通过频域增强与双子空间建模,解耦结构与纹理特征,缓解少样本细粒度分类中的纹理偏差与结构不稳定性问题。
Details
Motivation: 现有基于度量学习的少样本细粒度图像分类方法仅依赖空间域特征,易受纹理偏差和高频背景噪声干扰,且缺乏跨视角几何约束,导致在少样本下结构不稳定、易过拟合。 Method: 提出频率增强双子空间网络(FEDSNet):利用DCT与低通滤波提取低频全局结构信息;用截断SVD分别构建空间纹理与频率结构的低秩独立子空间;设计自适应门控机制融合双视角投影距离,以频率子空间的结构稳定性抑制空间子空间对背景的过拟合。 Result: 在CUB-200-2011、Stanford Cars、Stanford Dogs和FGVC-Aircraft四个基准数据集上取得极具竞争力的分类性能与鲁棒性,同时保持较高的计算效率。 Conclusion: FEDSNet为少样本细粒度视觉识别提供了一种兼顾精度、鲁棒性与效率的新范式。 Abstract: Few-shot fine-grained image classification aims to recognize subcategories with high visual similarity using only a limited number of annotated samples. Existing metric learning-based methods typically rely solely on spatial domain features. Confined to this single perspective, models inevitably suffer from inherent texture biases, entangling essential structural details with high-frequency background noise. Furthermore, lacking cross-view geometric constraints, single-view metrics tend to overfit this noise, resulting in structural instability under few-shot conditions. To address these issues, this paper proposes the Frequency-Enhanced Dual-Subspace Network (FEDSNet). Specifically, FEDSNet utilizes the Discrete Cosine Transform (DCT) and a low-pass filtering mechanism to explicitly isolate low-frequency global structural components from spatial features, thereby suppressing background interference. Truncated Singular Value Decomposition (SVD) is employed to construct independent, low-rank linear subspaces for both spatial texture and frequency structural features. An adaptive gating mechanism is designed to dynamically fuse the projection distances from these dual views. This strategy leverages the structural stability of the frequency subspace to prevent the spatial subspace from overfitting to background features. Extensive experiments on four benchmark datasets - CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC-Aircraft - demonstrate that FEDSNet exhibits excellent classification performance and robustness, achieving highly competitive results compared to existing metric learning algorithms. Complexity analysis further confirms that the proposed network achieves a favorable balance between high accuracy and computational efficiency, providing an effective new paradigm for few-shot fine-grained visual recognition.[145] UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards
Jun Wang,Shuo Tan,Zelong Sun,Tiancheng Gu,Yongle Zhao,Ziyong Feng,Kaicheng Yang,Cewu Lu
Main category: cs.CV
TL;DR: 本文提出UniDoc-RL,一种基于强化学习的统一框架,用于增强大视觉语言模型在检索增强生成(RAG)中的视觉推理能力,通过分层动作空间实现文档检索、重排序、主动视觉感知与推理的联合优化,并引入密集多奖励机制和GRPO训练策略,在多个基准上显著超越现有方法。
Details
Motivation: 现有视觉RAG系统依赖通用检索信号,忽视复杂推理所需的细粒度视觉语义,限制了性能。 Method: 提出UniDoc-RL框架:将视觉信息获取建模为具有分层动作空间(文档检索→图像选择→区域裁剪)的序列决策问题;采用基于Group Relative Policy Optimization(GRPO)的端到端强化学习;设计任务感知的密集多奖励机制;构建含细粒度动作标注的高质量推理轨迹数据集。 Result: 在三个基准测试中一致超越SOTA基线,相比先前基于RL的方法最高提升17.7%。 Conclusion: UniDoc-RL验证了联合优化检索、重排序与主动视觉感知对提升LVLM视觉推理能力的有效性,为视觉RAG提供了可扩展、可训练的新范式。 Abstract: Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme that provides task-aware supervision for each action. Based on Group Relative Policy Optimization (GRPO), UniDoc-RL aligns agent behavior with multiple objectives without relying on a separate value network. To support this training paradigm, we curate a comprehensive dataset of high-quality reasoning trajectories with fine-grained action annotations. Experiments on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines, yielding up to 17.7% gains over prior RL-based methods.[146] Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation
Yuzhuo Chen,Zehua Ma,Han Fang,Hengyi Wang,Guanjie Wang,Weiming Zhang
Main category: cs.CV
TL;DR: 本文提出Flow of Truth,首个专注于图像到视频(I2V)生成中时间域鉴伪的主动框架,通过将视频生成重新定义为‘像素随时间运动’,设计可学习的 forensic 模板与模板引导的光流模块,实现对像素演化轨迹的鲁棒追踪,显著提升跨模型的时间域鉴伪性能。
Details
Motivation: I2V生成内容随时间动态演化,传统基于2D像素空间的篡改定位方法失效,亟需发展能追踪像素流动与变形的时间域数字取证方法。 Method: 提出‘像素随时间运动’的新视角,设计可学习的forensic模板以跟随像素运动,并构建模板引导的光流模块,解耦运动信息与图像内容,实现鲁棒的时间域追踪。 Result: Flow of Truth在多种商用与开源I2V模型上展现出良好泛化性,显著提升了时间域鉴伪性能。 Conclusion: Flow of Truth是首个面向I2V生成的时间域主动鉴伪框架,验证了建模像素时序演化路径的有效性,为动态内容鉴伪开辟了新方向。 Abstract: The rapid rise of image-to-video (I2V) generation enables realistic videos to be created from a single image but also brings new forensic demands. Unlike static images, I2V content evolves over time, requiring forensics to move beyond 2D pixel-level tampering localization toward tracing how pixels flow and transform throughout the video. As frames progress, embedded traces drift and deform, making traditional spatial forensics ineffective. To address this unexplored dimension, we present **Flow of Truth**, the first proactive framework focusing on temporal forensics in I2V generation. A key challenge lies in discovering a forensic signature that can evolve consistently with the generation process, which is inherently a creative transformation rather than a deterministic reconstruction. Despite this intrinsic difficulty, we innovatively redefine video generation as *the motion of pixels through time rather than the synthesis of frames*. Building on this view, we propose a learnable forensic template that follows pixel motion and a template-guided flow module that decouples motion from image content, enabling robust temporal tracing. Experiments show that Flow of Truth generalizes across commercial and open-source I2V models, substantially improving temporal forensics performance.[147] Quality-Aware Calibration for AI-Generated Image Detection in the Wild
Fabrizio Guillaro,Vincenzo De Rosa,Davide Cozzolino,Luisa Verdoliva
Main category: cs.CV
TL;DR: 本文提出QuAD框架,通过质量感知的近重复图像融合策略提升合成图像检测的可靠性,解决了单图检测在真实传播场景中因图像质量退化导致预测不一致的问题。
Details
Motivation: 现有合成图像检测方法多基于单张图像,忽视了网络传播中近重复图像的质量退化现象,导致同一图像不同版本的检测结果不一致。 Method: 提出QuAD(Quality-Aware calibration with near-Duplicates)框架:对查询图像检索其在线近重复图像,输入检测器获得分数,并依据各图像估计质量进行加权融合。 Result: 在AncesTree(136k人工退化图像)和ReWIND(~10k真实近重复图像)两个新数据集上验证,QuAD使多个SOTA检测器平均平衡准确率提升约8%。 Conclusion: 联合处理网络中所有可用近重复图像并考虑其质量差异,是提升AI生成内容检测鲁棒性和实用性的关键路径。 Abstract: Significant progress has been made in detecting synthetic images, however most existing approaches operate on a single image instance and overlook a key characteristic of real-world dissemination: as viral images circulate on the web, multiple near-duplicate versions appear and lose quality due to repeated operations like recompression, resizing and cropping. As a consequence, the same image may yield inconsistent forensic predictions based on which version has been analyzed. In this work, to address this issue we propose QuAD (Quality-Aware calibration with near-Duplicates) a novel framework that makes decisions based on all available near-duplicates of the same image. Given a query, we retrieve its online near-duplicates and feed them to a detector: the resulting scores are then aggregated based on the estimated quality of the corresponding instance. By doing so, we take advantage of all pieces of information while accounting for the reduced reliability of images impaired by multiple processing steps. To support large-scale evaluation, we introduce two datasets: AncesTree, an in-lab dataset of 136k images organized in stochastic degradation trees that simulate online reposting dynamics, and ReWIND, a real-world dataset of nearly 10k near-duplicate images collected from viral web content. Experiments on several state-of-the-art detectors show that our quality-aware fusion improves their performance consistently, with an average gain of around 8% in terms of balanced accuracy compared to plain average. Our results highlight the importance of jointly processing all the images available online to achieve reliable detection of AI-generated content in real-world applications. Code and data are publicly available at https://grip-unina.github.io/QuAD/[148] Implicit Neural Representations: A Signal Processing Perspective
Dhananjaya Jayasundara,Vishal M. Patel
Main category: cs.CV
TL;DR: 本文从信号处理角度综述隐式神经表示(INRs)的发展,重点分析其频谱特性、采样理论与多尺度表示,探讨网络结构(如坐标输入、激活函数设计、哈希网格编码等)对逼近能力的影响,并讨论其在医学/雷达成像、压缩和3D场景表示等领域的应用及理论挑战。
Details
Motivation: 传统离散采样建模存在局限,而INRs以连续函数形式统一表征各类信号,亟需从信号处理视角理解其频谱行为、逼近机制与理论基础。 Method: 采用信号处理框架分析INRs的频谱偏差、采样特性与多尺度表示;系统梳理坐标网络、周期/局部/自适应激活函数、分层分解与哈希网格编码等结构演进;结合具体应用案例进行实证分析。 Result: 揭示了INRs固有的低频谱偏置现象,验证了特定激活函数与结构化编码可提升高频细节重建能力与计算效率,并在逆问题求解、信号压缩与3D表征中展现出优越性能。 Conclusion: INRs本质上是数据自适应的学习型信号模型;未来需突破理论稳定性、权重空间可解释性与大规模泛化能力等关键挑战。 Abstract: Implicit neural representations (INRs) mark a fundamental shift in signal modeling, moving from discrete sampled data to continuous functional representations. By parameterizing signals as neural networks, INRs provide a unified framework for representing images, audio, video, 3D geometry, and beyond as continuous functions of their coordinates. This functional viewpoint enables signal operations such as differentiation to be carried out analytically through automatic differentiation rather than through discrete approximations. In this article, we examine the evolution of INRs from a signal processing perspective, emphasizing spectral behavior, sampling theory, and multiscale representation. We trace the progression from standard coordinate based networks, which exhibit a spectral bias toward low frequency components, to more advanced designs that reshape the approximation space through specialized activations, including periodic, localized, and adaptive functions. We also discuss structured representations, such as hierarchical decompositions and hash grid encodings, that improve spatial adaptivity and computational efficiency. We further highlight the utility of INRs across a broad range of applications, including inverse problems in medical and radar imaging, compression, and 3D scene representation. By interpreting INRs as learned signal models whose approximation spaces adapt to the underlying data, this article clarifies the field's core conceptual developments and outlines open challenges in theoretical stability, weight space interpretability, and large scale generalization.[149] Attention-Gated Convolutional Networks for Scanner-Agnostic Quality Assessment
Chinmay Bakhale,Anil Sao
Main category: cs.CV
TL;DR: 本文提出了一种CNN-Attention混合框架,用于鲁棒、跨站点的MRI质量评估,能有效识别运动伪影,在已见和未见站点上均表现出色。
Details
Motivation: 运动伪影严重影响结构MRI的质量,手动质控难以扩展到大规模纵向研究,亟需自动、鲁棒且跨站点通用的质量评估方法。 Method: 提出一种结合分层2D CNN编码器与多头交叉注意力机制的混合框架,前者提取局部空间特征,后者建模全局依赖并动态抑制站点特异性干扰;在MR-ART数据集(200名受试者)上端到端训练。 Result: 在已见站点上达到扫描级准确率0.9920、F1分数0.9919;在ABIDE中17个异构未见站点上无需微调即达准确率0.755,展现出强泛化能力。 Conclusion: 注意力机制驱动的特征重加权可有效捕获通用伪影表征,显著缓解域偏移问题,提升MRI质控在多中心、多厂商场景下的实用性。 Abstract: Motion artifacts present a significant challenge in structural MRI (sMRI), often compromising clinical diagnostics and large-scale automated analysis. While manual quality control (QC) remains the gold standard, it is increasingly unscalable for massive longitudinal studies. To address this, we propose a hybrid CNN-Attention framework designed for robust, site-invariant MRI quality assessment. Our architecture integrates a hierarchical 2D CNN encoder for local spatial feature extraction with a multi-head cross-attention mechanism to model global dependencies. This synergy enables the model to prioritize motion relevant artifact signatures, such as ringing and blurring, while dynamically filtering out site-specific intensity variations and background noise. The framework was trained end-to-end on the MR-ART dataset using a balanced cohort of 200 subjects. Performance was evaluated across two tiers: Seen Site Evaluation on a held-out MR-ART partition and Unseen Site Evaluation using 200 subjects from 17 heterogeneous sites in the ABIDE archive. On seen sites, the model achieved a scan-level accuracy of 0.9920 and an F1-score of 0.9919. Crucially, it maintained strong generalization across unseen ABIDE sites (Acc = 0.755) without any retraining or fine-tuning, demonstrating high resilience to domain shift. These results indicate that attention-based feature re-weighting successfully captures universal artifact descriptors, bridging the performance gap between diverse imaging environments and scanner manufacturers.[150] Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection
Yangchen Zeng,Zhenyu Yu,Dongming Jiang,Wenbo Zhang,Yifan Hong,Zhanhua Hu,Jiao Luo,Kangning Cui
Main category: cs.CV
TL;DR: 本文提出HELP框架,通过热图引导的位置嵌入(HPE)抑制背景噪声、增强前景位置语义融合,结合梯度掩码滤波与Linear-Snake卷积,显著减少参数量和解码器层数,同时保持小目标检测精度。
Details
Motivation: Transformer检测器在小目标检测中仍存在效率低和易受背景噪声干扰的问题,需深度解码器优化查询质量。 Method: 提出热图引导的嵌入学习范式(HELP),核心为热图引导位置嵌入(HPE),在编码器中注入热图感知位置编码,在解码器前用梯度掩码滤除背景主导嵌入;引入Linear-Snake卷积缓解小目标特征稀疏;训练时使用梯度热图监督,推理无额外开销。 Result: 解码器层数从8减至3,参数量减少59.4%(66.3M vs. 163M),在多个基准上以更少计算量保持甚至提升检测精度。 Conclusion: HELP通过噪声感知的位置-语义融合机制,有效提升小目标检测效率与鲁棒性,为轻量化Transformer检测器设计提供了新思路。 Abstract: Transformer-based detectors have advanced small-object detection, but they often remain inefficient and vulnerable to background-induced query noise, which motivates deep decoders to refine low-quality queries. We present HELP (Heatmap-guided Embedding Learning Paradigm), a noise-aware positional-semantic fusion framework that studies where to embed positional information by selectively preserving positional encodings in foreground-salient regions while suppressing background clutter. Within HELP, we introduce Heatmap-guided Positional Embedding (HPE) as the core embedding mechanism and visualize it with a heatbar for interpretable diagnosis and fine-tuning. HPE is integrated into both the encoder and decoder: it guides noise-suppressed feature encoding by injecting heatmap-aware positional encoding, and it enables high-quality query retrieval by filtering background-dominant embeddings via a gradient-based mask filter before decoding. To address feature sparsity in complex small targets, we integrate Linear-Snake Convolution to enrich retrieval-relevant representations. The gradient-based heatmap supervision is used during training only, incurring no additional gradient computation at inference. As a result, our design reduces decoder layers from eight to three and achieves a 59.4% parameter reduction (66.3M vs. 163M) while maintaining consistent accuracy gains under a reduced compute budget across benchmarks. Code Repository: https://github.com/yidimopozhibai/Noise-Suppressed-Query-Retrieval[151] Building Extraction from Remote Sensing Imagery under Hazy and Low-light Conditions: Benchmark and Baseline
Feifei Sang,Wei Lu,Hongruixuan Chen,Sibao Chen,Bin Luo
Main category: cs.CV
TL;DR: 本文提出HaLoBuilding基准和HaLoBuild-Net框架,专用于雾天与低光照遥感图像中的建筑物提取,通过多模块协同提升鲁棒性与精度。
Details
Motivation: 现有光学遥感建筑物提取方法在雾天和低光照等真实恶劣天气下性能显著下降,且缺乏相应基准;SAR虽可全天候成像,但存在几何畸变问题。 Method: 构建首个面向雾天与低光照条件的光学遥感建筑物提取基准HaLoBuilding,并提出端到端网络HaLoBuild-Net,包含空间-频率聚焦模块(SFFM)、全局多尺度引导模块(GMGM)和互引导融合模块(MGFM)。 Result: HaLoBuild-Net在HaLoBuilding数据集上显著优于现有SOTA方法及传统恢复-分割级联范式,并在WHU、INRIA和LoveDA等通用数据集上保持强泛化能力。 Conclusion: 所提基准与方法有效解决了恶劣气象条件下遥感图像建筑物提取难题,推动了该方向向更真实、鲁棒的应用场景发展。 Abstract: Building extraction from optical Remote Sensing (RS) imagery suffers from performance degradation under real-world hazy and low-light conditions. However, existing optical methods and benchmarks focus primarily on ideal clear-weather conditions. While SAR offers all-weather sensing, its side-looking geometry causes geometric distortions. To address these challenges, we introduce HaLoBuilding, the first optical benchmark specifically designed for building extraction under hazy and low-light conditions. By leveraging a same-scene multitemporal pairing strategy, we ensure pixel-level label alignment and high fidelity even under extreme degradation. Building upon this benchmark, we propose HaLoBuild-Net, a novel end-to-end framework for building extraction in adverse RS scenarios. At its core, we develop a Spatial-Frequency Focus Module (SFFM) to effectively mitigate meteorological interference on building features by coupling large receptive field attention with frequency-aware channel reweighting guided by stable low-frequency anchors. Additionally, a Global Multi-scale Guidance Module (GMGM) provides global semantic constraints to anchor building topologies, while a Mutual-Guided Fusion Module (MGFM) implements bidirectional semantic-spatial calibration to suppress shallow noise and sharpen weather-induced blurred boundaries. Extensive experiments demonstrate that HaLoBuild-Net significantly outperforms state-of-the-art methods and conventional cascaded restoration-segmentation paradigms on the HaLoBuilding dataset, while maintaining robust generalization on WHU, INRIA, and LoveDA datasets. The source code and datasets are publicly available at: https://github.com/AeroVILab-AHU/HaLoBuilding.[152] Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID
Jiaxuan Li,Xin Wen,Zhihang Li
Main category: cs.CV
TL;DR: 本文提出了一种名为STFER的新框架,利用大视觉语言模型(LVLM)生成身份一致性的文本描述,以增强跨模态(RGB/IR)和衣物变化场景下的人再识别鲁棒性;通过语义驱动的视觉令牌过滤与专家路由机制提升特征判别力和场景适应性,并在AT-USTC数据集及多个主流ReID基准上取得SOTA与强泛化性能。
Details
Motivation: 现有方法依赖纯视觉特征,在光照变化导致的模态迁移(如昼夜)和长期衣物变化场景下性能显著下降,亟需更鲁棒的身份表征方式。 Method: 提出Semantic-driven Token Filtering and Expert Routing (STFER)框架:利用LVLM在指令引导下生成身份内在语义文本;基于该文本进行语义驱动的视觉令牌过滤(SVTF)以增强关键区域、抑制背景噪声;同时将文本融入专家路由(SER),实现多场景自适应门控。 Result: 在Any-Time ReID数据集AT-USTC上达到SOTA;迁移到5个主流ReID基准(如Market-1501、DukeMTMC等)仍保持高度竞争力,验证了强泛化能力。 Conclusion: 语义文本可作为稳定、身份判别性强的辅助监督信号,有效解耦外观变化干扰;STFER证明了LVLM生成的语义先验在复杂动态ReID任务中的有效性与实用性。 Abstract: Any-Time Person Re-identification (AT-ReID) necessitates the robust retrieval of target individuals under arbitrary conditions, encompassing both modality shifts (daytime and nighttime) and extensive clothing-change scenarios, ranging from short-term to long-term intervals. However, existing methods are highly relying on pure visual features, which are prone to change due to environmental and time factors, resulting in significantly performance deterioration under scenarios involving illumination caused modality shifts or cloth-change. In this paper, we propose Semantic-driven Token Filtering and Expert Routing (STFER), a novel framework that leverages the ability of Large Vision-Language Models (LVLMs) to generate identity consistency text, which provides identity-discriminative features that are robust to both clothing variations and cross-modality shifts between RGB and IR. Specifically, we employ instructions to guide the LVLM in generating identity-intrinsic semantic text that captures biometric constants for the semantic model driven. The text token is further used for Semantic-driven Visual Token Filtering (SVTF), which enhances informative visual regions and suppresses redundant background noise. Meanwhile, the text token is also used for Semantic-driven Expert Routing (SER), which integrates the semantic text into expert routing, resulting in more robust multi-scenario gating. Extensive experiments on the Any-Time ReID dataset (AT-USTC) demonstrate that our model achieves state-of-the-art results. Moreover, the model trained on AT-USTC was evaluated across 5 widely-used ReID benchmarks demonstrating superior generalization capabilities with highly competitive results. Our code will be available soon.[153] Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography
Simon Böhi,Irene Cannistraci,Sergio Muñoz Gonzalez,Moritz Vandenhirtz,Sonia Laguna,Samuel Ruiperez-Campillo,Max Krähenmann,Andrea Agostini,Ece Ozkan,Thomas M. Sutter,Julia E. Vogt
Main category: cs.CV
TL;DR: 本文提出了一种名为Latent Attention Masked Autoencoder (LAMAE) 的新模型,专为处理超声心动图多视角、稀疏、异构的时空数据而设计;通过引入潜空间注意力机制,实现跨帧与跨视角信息融合,并在MIMIC-IV-ECHO数据集上预训练,首次实现了从该数据集中预测ICD-10编码,且展现出优异的成人到儿童跨人群迁移能力。
Details
Motivation: 现有掩码自编码器(MAE)方法通常独立处理图像或短片段,无法建模超声心动图固有的多视角结构,难以获得连贯的心脏表征。 Method: 提出LAMAE架构,在标准MAE基础上增加潜空间注意力模块,支持跨帧与跨视角的隐变量信息交互,能聚合变长序列和不同视角数据,重建心脏功能的整体表征;在真实世界大规模未筛选数据集MIMIC-IV-ECHO上进行预训练。 Result: 首次在MIMIC-IV-ECHO视频上实现ICD-10编码预测;验证了所学表征可有效从成人迁移至解剖差异显著的儿童群体;证明引入多视角结构先验(如潜注意力)可提升表征鲁棒性与可迁移性。 Conclusion: LAMAE通过建模多视角结构先验,显著提升了医学视频表征学习的鲁棒性与泛化能力,为多视角医学影像基础模型提供了新范式。 Abstract: Echocardiography is a widely used modality for cardiac assessment due to its non-invasive and cost-effective nature, but the sparse and heterogeneous spatiotemporal views of the heart pose distinct challenges. Existing masked autoencoder (MAE) approaches typically process images or short clips independently, failing to capture the inherent multi-view structure required for coherent cardiac representation. We introduce Latent Attention Masked Autoencoder (LAMAE), a foundation model architecture tailored to the multi-view nature of medical imaging. LAMAE augments the standard MAE with a latent attention module that enables information exchange across frames and views directly in latent space. This allows the model to aggregate variable-length sequences and distinct views, reconstructing a holistic representation of cardiac function from partial observations. We pretrain LAMAE on MIMIC-IV-ECHO, a large-scale, uncurated dataset reflecting real-world clinical variability. To the best of our knowledge, we present the first results for predicting ICD-10 codes from MIMIC-IV-ECHO videos. Furthermore, we empirically demonstrate that representations learned from adult data transfer effectively to pediatric cohorts despite substantial anatomical differences. These results provide evidence that incorporating structural priors, such as multi-view attention, yields significantly more robust and transferable representations.[154] How to Correctly Make Mistakes: A Framework for Constructing and Benchmarking Mistake Aware Egocentric Procedural Videos
Olga Loginova,Frank Keller
Main category: cs.CV
TL;DR: 本文提出PIE-V框架,通过心理学启发的错误注入与修正机制,在第一人称视频中可控地生成人类可理解的操作失误及恢复过程,并构建统一评估体系以支持错误检测与修正的验证。
Details
Motivation: 现有程序性数据集缺乏自然、一致且可解释的人类操作错误及其恢复轨迹,尤其在第一人称视频中,错误常被手部遮挡、仅通过细微物体状态变化体现,难以建模和评估。 Method: PIE-V融合心理学驱动的错误规划器(基于步骤阶段与语义负载)、修正规划器、级联一致的LLM重写器、LLM判别器,以及文本引导的视频片段合成与拼接技术,在真实egocentric视频中注入可控错误并生成合理恢复片段。 Result: 在17项任务、50个Ego-Exo4D场景中注入102个错误并生成27个恢复修正;提出涵盖9项指标的统一评估分类法与人工评分量表,并完成对多个现有资源的审计及与自由式LLM基线的对比评估。 Conclusion: PIE-V为egocentric程序性视频中的错误感知与恢复建模提供了可扩展的构造框架与可复现的评估基准,支持事后验证式的错误检测与修正研究。 Abstract: Reliable procedural monitoring in video requires exposure to naturally occurring human errors and the recoveries that follow. In egocentric recordings, mistakes are often partially occluded by hands and revealed through subtle object state changes, while existing procedural datasets provide limited and inconsistent mistake and correction traces. We present PIE-V (Psychologically Inspired Error injection for Videos), a framework for constructing and benchmarking mistake-aware egocentric procedural videos by augmenting clean keystep procedures with controlled, human-plausible deviations. PIE-V combines a psychology-informed error planner conditioned on procedure phase and semantic step load, a correction planner that models recovery behavior, an LLM writer that performs cascade-consistent rewrites, and an LLM judge that validates procedural coherence and repairs failures. For video segment edits, PIE-V synthesizes replacement clips with text-guided video generation and stitches them into the episode to preserve visual plausibility. Applied to 17 tasks and 50 Ego-Exo4D scenarios, PIE-V injects 102 mistakes and generates 27 recovery corrections. For benchmarking, we introduce a unified taxonomy and a human rubric with nine metrics that cover step-level and procedure-level quality, including plausibility, procedure logic with annotator confidence, state change coherence, and grounding between text and video. Using this protocol, we audit several existing resources and compare PIE-V against a freeform LLM generation baseline under the same criteria. Together, the framework and rubric support post-completion verification for egocentric procedural mistake detection and correction.[155] KVNN: Learnable Multi-Kernel Volterra Neural Networks
Haoyu Yun,Hamid Krim,Yufang Bao
Main category: cs.CV
TL;DR: 本文提出了一种核化Volterra神经网络(kVNN),通过可学习的多核表示实现高效高阶学习,兼顾表达力与计算效率。
Details
Motivation: 高阶学习依赖于组合特征,但传统深度模型在增强交互表示时往往导致模型复杂度显著上升,亟需一种兼顾表达能力与计算效率的方案。 Method: 提出核化Volterra神经网络(kVNN),采用可学习的多核表示,用不同阶次的多项式核组件建模各阶交互,并以紧凑、可学习中心实现阶自适应参数化;每层由多个不同阶次的并行分支组成,其滤波器可直接替代标准卷积核。 Result: 在视频动作识别和图像去噪任务上验证了kVNN的有效性:在参数量和GFLOPs显著降低的同时,性能具有竞争力甚至更优,且无需大规模预训练即可从头训练达到良好效果。 Conclusion: 结构化的核化高阶层为现代深度网络中平衡表达力与计算成本提供了一条切实可行的路径。 Abstract: Higher-order learning is fundamentally rooted in exploiting compositional features. It clearly hinges on enriching the representation by more elaborate interactions of the data which, in turn, tends to increase the model complexity of conventional large-scale deep learning models. In this paper, a kernelized Volterra Neural Network (kVNN) is proposed. The key to the achieved efficiency lies in using a learnable multi-kernel representation, where different interaction orders are modeled by distinct polynomial-kernel components with compact, learnable centers, yielding an order-adaptive parameterization. Features are learned by the composition of layers, each of which consists of parallel branches of different polynomial orders, enabling kVNN filters to directly replace standard convolutional kernels within existing architectures. The theoretical results are substantiated by experiments on two representative tasks: video action recognition and image denoising. The results demonstrate favorable performance-efficiency trade-offs: kVNN consistently yields reduced model (parameters) and computational (GFLOPs) complexity with competitive and often improved performance. These results are maintained even when trained from scratch without large-scale pretraining. In summary, we substantiate that structured kernelized higher-order layers offer a practical path to balancing expressivity and computational cost in modern deep networks.[156] Class Unlearning via Depth-Aware Removal of Forget-Specific Directions
Arman Hatami,Romina Aalishah,Ilya E. Monosov
Main category: cs.CV
TL;DR: 本文提出DAMP方法,一种无需梯度优化的一次性权重手术技术,通过深度感知的投影调制,在预训练网络中移除遗忘类别的特定方向,从而实现更精准的类别遗忘,同时保持保留类别的性能。
Details
Motivation: 现有类别遗忘方法存在选择性弱、深层表征中仍保留遗忘类别结构、或过度依赖最终层偏置调整等问题,无法真正实现知识擦除。 Method: DAMP在每个网络阶段计算类别原型,将遗忘类方向定义为相对于保留类原型的残差,并通过投影更新降低下游对这些方向的敏感性;采用基于探针可分性的无参数深度感知缩放规则,早期层修改小、深层修改大;支持多类别遗忘的低秩子空间移除。 Result: 在MNIST、CIFAR-10/100和Tiny ImageNet上,DAMP在卷积与Transformer架构中均更接近重训练的黄金标准,显著提升选择性遗忘能力,更好维持保留类准确率,并减少深层中遗忘类结构残留。 Conclusion: DAMP提供了一种高效、通用且理论驱动的类别遗忘方案,验证了表征层面干预比仅调优分类头更能实现真正的机器遗忘。 Abstract: Machine unlearning aims to remove targeted knowledge from a trained model without the cost of retraining from scratch. In class unlearning, however, reducing accuracy on forget classes does not necessarily imply true forgetting: forgotten information can remain encoded in internal representations, and apparent forgetting may arise from classifier-head suppression rather than representational removal. We show that existing class-unlearning methods often exhibit weak or negative selectivity, preserve forget-class structure in deep representations, or rely heavily on final-layer bias shifts. We then introduce DAMP (Depth-Aware Modulation by Projection), a one-shot, closed-form weight-surgery method that removes forget-specific directions from a pretrained network without gradient-based optimization. At each stage, DAMP computes class prototypes in the input space of the next learnable operator, extracts forget directions as residuals relative to retain-class prototypes, and applies a projection-based update to reduce downstream sensitivity to those directions. To preserve utility, DAMP uses a parameter-free depth-aware scaling rule derived from probe separability, applying smaller edits in early layers and larger edits in deeper layers. The method naturally extends to multi-class forgetting through low-rank subspace removal. Across MNIST, CIFAR-10, CIFAR-100, and Tiny ImageNet, and across convolutional and transformer architectures, DAMP more closely resembles the retraining gold standard than some of the prior methods, improving selective forgetting while better preserving retain-class performance and reducing residual forget-class structure in deep layers.[157] OmniLight: One Model to Rule All Lighting Conditions
Youngjin Oh,Junyoung Park,Junhyeong Kwon,Nam Ik Cho
Main category: cs.CV
TL;DR: 本文提出两种光照相关图像恢复策略:专用的DINOLight框架和通用的OmniLight框架(含WD-MoE模块),在NTIRE 2026挑战赛三个光照相关赛道均获顶级排名。
Details
Motivation: 现实应用中模型需应对多样光照域,而现有方法常仅在特定基准上表现优异,缺乏跨域鲁棒性。 Method: 构建专用基线DINOLight,并扩展为跨数据集训练的通用模型OmniLight,引入小波域混合专家(WD-MoE)结构。 Result: 两种方法在NTIRE 2026挑战赛全部三个光照相关赛道均取得顶尖排名,验证了其感知质量与泛化能力。 Conclusion: 专用与通用架构各有优势,数据分布特性显著影响二者性能;WD-MoE有效提升了跨域光照恢复的泛化性。 Abstract: Adverse lighting conditions, such as cast shadows and irregular illumination, pose significant challenges to computer vision systems by degrading visibility and color fidelity. Consequently, effective shadow removal and ALN are critical for restoring underlying image content, improving perceptual quality, and facilitating robust performance in downstream tasks. However, while achieving state-of-the-art results on specific benchmarks is a primary goal in image restoration challenges, real-world applications often demand robust models capable of handling diverse domains. To address this, we present a comprehensive study on lighting-related image restoration by exploring two contrasting strategies. We leverage a robust framework for ALN, DINOLight, as a specialized baseline to exploit the characteristics of each individual dataset, and extend it to OmniLight, a generalized alternative incorporating our proposed Wavelet Domain Mixture-of-Experts (WD-MoE) that is trained across all provided datasets. Through a comparative analysis of these two methods, we discuss the impact of data distribution on the performance of specialized and unified architectures in lighting-related image restoration. Notably, both approaches secured top-tier rankings across all three lighting-related tracks in the NTIRE 2026 Challenge, demonstrating their outstanding perceptual quality and generalization capabilities. Our codes are available at https://github.com/OBAKSA/Lighting-Restoration.[158] An Analysis of Regularization and Fokker-Planck Residuals in Diffusion Models for Image Generation
Onno Niemann,Gonzalo Martínez Muñoz,Alberto Suárez Gonzalez
Main category: cs.CV
TL;DR: 本文探讨了在扩散模型训练中,如何以更低的计算成本实现Fokker-Planck(FP)方程正则化的效果,通过实证分析多种轻量级正则项,发现它们能在显著降低计算开销的同时,保持FP正则化对生成质量的提升作用。
Details
Motivation: 扩散模型使用去噪分数匹配(DSM)目标训练时,常违反描述真实数据密度演化的Fokker-Planck(FP)方程;直接在目标函数中惩罚该偏差虽有效但计算开销大,且过强的FP约束未必提升生成质量。 Method: 实证分析多种轻量级正则化项,评估其对FP残差和生成质量的影响,并与标准FP正则化方法进行对比。 Result: 发现轻量级正则项可在显著降低计算成本的前提下,提供与标准FP正则化相当的FP残差抑制效果和生成质量提升。 Conclusion: FP正则化的好处可通过更简单、更高效的正则项实现,无需高昂计算代价;这为扩散模型训练提供了更实用的正则化策略。 Abstract: Recent work has shown that diffusion models trained with the denoising score matching (DSM) objective often violate the Fokker--Planck (FP) equation that governs the evolution of the true data density. Directly penalizing these deviations in the objective function reduces their magnitude but introduces a significant computational overhead. It is also observed that enforcing strict adherence to the FP equation does not necessarily lead to improvements in the quality of the generated samples, as often the best results are obtained with weaker FP regularization. In this paper, we investigate whether simpler penalty terms can provide similar benefits. We empirically analyze several lightweight regularizers, study their effect on FP residuals and generation quality, and show that the benefits of FP regularization are available at substantially lower computational cost. Our code is available at https://github.com/OnnoNiemann/fp_diffusion_analysis.[159] Boundary-Centric Active Learning for Temporal Action Segmentation
Halil Ismail Helvaci,Sen-ching Samson Cheung
Main category: cs.CV
TL;DR: 本文提出B-ACT框架,通过聚焦于动作边界区域的主动学习,在有限标注预算下显著提升时序动作分割(TAS)的标签效率和性能。
Details
Motivation: 时序动作分割需要密集的时间标注,但大部分标注成本集中在动作边界的识别与精调上;而这些边界区域正是分割错误集中、微小时间偏移对评估指标影响最大的地方。 Method: B-ACT是一种基于片段预算的主动学习框架,采用两阶段分层策略:(i)基于预测不确定性对未标注视频进行排序与查询;(ii)在选定视频中,利用融合邻域不确定性、类别模糊性和时间预测动态性的新型边界得分,检测并选取Top-K边界帧进行标注;标注仅针对边界帧,训练则使用以边界为中心的片段以利用模型感受野中的时序上下文。 Result: 在GTEA、50Salads和Breakfast数据集上的大量实验表明,B-ACT在稀疏标注预算下显著优于现有TAS主动学习方法及先前SOTA,尤其在边界定位主导F1分数的数据集上增益最大。 Conclusion: 聚焦边界区域的标注策略能极大提升标注效率与模型性能,验证了‘少而准’的边界监督比均匀或随机密集标注更有效。 Abstract: Temporal action segmentation (TAS) demands dense temporal supervision, yet most of the annotation cost in untrimmed videos is spent identifying and refining action transitions, where segmentation errors concentrate and small temporal shifts disproportionately degrade segmental metrics. We introduce B-ACT, a clip-budgeted active learning framework that explicitly allocates supervision to these high-leverage boundary regions. B-ACT operates in a hierarchical two-stage loop: (i) it ranks and queries unlabeled videos using predictive uncertainty, and (ii) within each selected video, it detects candidate transitions from the current model predictions and selects the top-$K$ boundaries via a novel boundary score that fuses neighborhood uncertainty, class ambiguity, and temporal predictive dynamics. Importantly, our annotation protocol requests labels for only the boundary frames while still training on boundary-centered clips to exploit temporal context through the model's receptive field. Extensive experiments on GTEA, 50Salads, and Breakfast demonstrate that boundary-centric supervision delivers strong label efficiency and consistently surpasses representative TAS active learning baselines and prior state of the art under sparse budgets, with the largest gains on datasets where boundary placement dominates edit and overlap-based F1 scores.[160] VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models
Huawei Ji,Yuanhao Sun,Yuan Jin,Cheng Deng,Jiaxin Ding,Luoyi Fu,Xinbing Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为 的新框架,将视觉token剪枝建模为Pareto配置优化问题,通过连续松弛与直通估计器实现基于梯度的搜索,并用增广拉格朗日法求解;实验证明其能有效逼近经验Pareto前沿,在多个基准和模型上泛化良好,并揭示了多步渐进式剪枝更契合VLM的层次压缩结构。
Details
Motivation: 现有视觉token剪枝方法依赖预定义配置,无法保证计算-性能最优权衡。 Method: 将视觉token剪枝建模为Pareto配置优化问题,采用连续松弛与直通估计器实现可微搜索,并用增广拉格朗日法求解;引入可学习核函数分析层间剪枝模式。 Result: 在8个视觉基准上,该方法能有效逼近网格搜索所得的经验Pareto前沿,且对不同剪枝方法和VLM架构具有良好泛化性;多步渐进剪枝优于单层剪枝。 Conclusion: 自动化的Pareto优化框架能更优地平衡VLM中视觉token剪枝的精度与效率,揭示了符合模型层次结构的剪枝策略更具优势。 Abstract: Visual token pruning methods effectively mitigate the quadratic computational growth caused by processing high-resolution images and video frames in vision-language models (VLMs). However, existing approaches rely on predefined pruning configurations without determining whether they achieve computation-performance optimality. In this work, we introduce , a novel framework that formulates visual token pruning as a Pareto configuration optimization problem to automatically identify optimal configurations. Our approach employs continuous relaxation and straight-through estimators to enable gradient-based search, solved via the Augmented Lagrangian method. Extensive experiments across 8 visual benchmarks demonstrate that effectively approximates the empirical Pareto frontier obtained through grid search and generalizes well across various pruning methods and VLM architectures. Furthermore, through learnable kernel functions, we investigate layer-wise pruning patterns and reveal that multi-step progressive pruning captures VLMs' hierarchical compression structure, achieving superior accuracy-efficiency trade-offs compared to single-layer approaches.[161] Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization
Umer Ahmed,Syed Ahmed Mahmood,Fawad Javed Fateh,M. Shaheer Luqman,M. Zeeshan Zia,Quoc-Huy Tran
Main category: cs.CV
TL;DR: 本文提出了一种用于无监督基于骨架的时间动作分割的分层时空向量量化框架,通过两级向量量化分别建模子动作和动作层级表示,并融合时空信息,在多个基准上达到SOTA性能并缓解段长偏差。
Details
Motivation: 解决无监督基于骨架的时间动作分割中缺乏对动作层级结构建模以及仅利用空间信息导致性能受限的问题。 Method: 提出分层向量量化框架:低层量化建模细粒度子动作,高层量化聚合为动作级表示;进一步扩展为分层时空向量量化,联合重建骨架姿态与时间戳,实现多级聚类。 Result: 在HuGaDB、LARa和BABEL等多个基准上取得新SOTA性能,并有效降低段长偏差。 Conclusion: 分层时空向量量化能更有效地建模动作的层次结构和时空动态,在无监督骨架动作分割任务中具有显著优势。 Abstract: We propose a novel hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation. We first introduce a hierarchical approach, which includes two consecutive levels of vector quantization. Specifically, the lower level associates skeletons with fine-grained subactions, while the higher level further aggregates subactions into action-level representations. Our hierarchical approach outperforms the non-hierarchical baseline, while primarily exploiting spatial cues by reconstructing input skeletons. Next, we extend our approach by leveraging both spatial and temporal information, yielding a hierarchical spatiotemporal vector quantization scheme. In particular, our hierarchical spatiotemporal approach performs multi-level clustering, while simultaneously recovering input skeletons and their corresponding timestamps. Lastly, extensive experiments on multiple benchmarks, including HuGaDB, LARa, and BABEL, demonstrate that our approach establishes a new state-of-the-art performance and reduces segment length bias in unsupervised skeleton-based temporal action segmentation.[162] StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression
Xuanyi Liu,Deyi Ji,Chunan Yu,Qi Zhu,Xuanfu Li,Jin Ma,Tianrun Chen,Lanyun Zhu
Main category: cs.CV
TL;DR: 本文提出StreamCacheVGGT,一种无需训练的流式3D重建缓存管理框架,通过跨层一致性增强评分(CLCES)和混合缓存压缩(HCC)提升几何信息保留能力,在多个基准上达到SOTA性能。
Details
Motivation: 现有O(1)内存框架依赖“纯驱逐”策略,存在二值化token删除导致的信息严重损失及单层局部评分带来的激活噪声问题,难以在恒定内存下稳定重建稠密3D几何。 Method: 提出StreamCacheVGGT:1)CLCES模块通过追踪Transformer各层token重要性轨迹并采用顺序统计分析,识别持续几何显著性以抑制激活噪声;2)HCC模块基于CLCES得分,设计三级分诊策略,在key向量流形上通过最近邻分配将中等重要token融合进锚点,实现非破坏性压缩。 Result: 在7-Scenes、NRGBD、ETH3D、Bonn和KITTI共5个基准上验证,StreamCacheVGGT在严格恒定计算/内存开销下,显著提升重建精度与长期稳定性,达到新SOTA。 Conclusion: StreamCacheVGGT通过协同的跨层评分与混合压缩机制,有效缓解了流式3D重建中信息丢失与噪声干扰问题,为恒定资源约束下的视觉几何建模提供了可扩展新范式。 Abstract: Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing $O(1)$ frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.[163] TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens
Jiawei Ren,Michal Jan Tyszkiewicz,Jiahui Huang,Zan Gojcic
Main category: cs.CV
TL;DR: 本文提出TokenGS,一种基于Transformer的3D高斯点阵(3DGS)前馈预测新方法,通过直接回归3D均值坐标并采用编码器-解码器架构与可学习高斯token,提升鲁棒性、灵活性和重建质量。
Details
Motivation: 现有方法将高斯均值沿相机射线回归为深度值,受限于输入图像分辨率和视图数量,且对位姿噪声和多视角不一致敏感。 Method: 摒弃深度回归,直接回归3D空间中的高斯均值坐标;引入编码器-解码器Transformer架构,使用可学习的高斯token表示场景,仅依赖自监督渲染损失进行训练。 Result: TokenGS在静态与动态场景上均达到前馈重建SOTA性能,几何更规整、3DGS分布更均衡,并能自然恢复静态-动态分解与场景流等新兴属性。 Conclusion: 直接3D坐标回归与token化高斯表示显著提升了3DGS前馈建模的表达能力、鲁棒性与泛化性,为高效、灵活的神经场景表征提供了新范式。 Abstract: In this work, we revisit several key design choices of modern Transformer-based approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss. This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, thereby unbinding the number of predicted primitives from input image resolution and number of views. Our resulting method, TokenGS, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance on both static and dynamic scenes, producing more regularized geometry and more balanced 3DGS distribution, while seamlessly recovering emergent scene attributes such as static-dynamic decomposition and scene flow.[164] SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation
Tianhao Fu,Austin Wang,Charles Chen,Roby Aldave-Garza,Yucheng Chen
Main category: cs.CV
TL;DR: 本文提出SegWithU,一种轻量级、单次前向传播的后处理不确定性估计框架,用于提升医学图像分割的可靠性,无需重复推理且不损害分割性能。
Details
Motivation: 医学图像分割中可靠的不确定性估计对下游量化和临床决策支持至关重要,但现有方法要么需多次推理,要么单次推理效果较弱或依赖强假设。 Method: SegWithU在冻结的预训练分割骨干网络上添加轻量不确定性头,利用中间特征,在紧凑探针空间中以秩-1后验探针建模扰动能量,生成两种体素级不确定性图:用于概率校准的校准导向图和用于错误检测与选择性预测的排序导向图。 Result: 在ACDC、BraTS2024和LiTS数据集上,SegWithU作为单次前向传播方法表现最强且最稳定,AUROC/AURC分别达0.9838/2.4885、0.9946/0.2660、0.9925/0.8193,同时保持分割质量。 Conclusion: 基于扰动的不确定性建模是实现可靠医学图像分割的一种有效且实用的途径。 Abstract: Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision support. Many strong uncertainty methods require repeated inference, while efficient single-forward-pass alternatives often provide weaker failure ranking or rely on restrictive feature-space assumptions. We present $\textbf{SegWithU}$, a post-hoc framework that augments a frozen pretrained segmentation backbone with a lightweight uncertainty head. SegWithU taps intermediate backbone features and models uncertainty as perturbation energy in a compact probe space using rank-1 posterior probes. It produces two voxel-wise uncertainty maps: a calibration-oriented map for probability tempering and a ranking-oriented map for error detection and selective prediction. Across ACDC, BraTS2024, and LiTS, SegWithU is the strongest and most consistent single-forward-pass baseline, achieving AUROC/AURC of $0.9838/2.4885$, $0.9946/0.2660$, and $0.9925/0.8193$, respectively, while preserving segmentation quality. These results suggest that perturbation-based uncertainty modeling is an effective and practical route to reliability-aware medical segmentation. Source code is available at https://github.com/ProjectNeura/SegWithU.[165] Why Do Vision Language Models Struggle To Recognize Human Emotions?
Madhav Agarwal,Sotirios A. Tsaftaris,Laura Sevilla-Lara,Steven McDonagh
Main category: cs.CV
TL;DR: 本文探讨了视觉-语言模型(VLMs)在人类情绪识别任务中表现不佳的原因,指出其两大缺陷:对长尾情绪数据的偏差以及无法有效建模微表情所需的细粒度时序信息;为此提出改进的采样策略与多阶段上下文增强方法,以提升VLM在动态面部表情识别(DFER)中的性能。
Details
Motivation: 尽管VLMs在多种视觉任务上取得显著进展,但在情绪识别(尤其是动态面部表情识别DFER)上仍落后于专用视觉模型,本文旨在探究其根本原因。 Method: 分析VLMs在情绪识别中失败的两个关键原因:1)长尾数据分布导致的头部类别偏差,提出替代采样策略缓解该问题;2)VLMs难以处理密集帧序列的时序信息,提出将‘中间帧’转化为自然语言摘要,并与稀疏关键帧联合输入VLM的多阶段上下文增强策略。 Result: 揭示了VLMs在情绪识别中性能受限的结构性原因,并通过所提方法在DFER任务上验证了上下文增强策略能有效保留情绪演化轨迹、避免注意力稀释,从而提升识别效果。 Conclusion: VLMs当前架构在情绪识别任务中存在固有局限,需针对性改进数据采样与时序建模方式;引入语言化时序上下文是一种可行且有效的增强路径。 Abstract: Understanding emotions is a fundamental ability for intelligent systems to be able to interact with humans. Vision-language models (VLMs) have made tremendous progress in the last few years for many visual tasks, potentially offering a promising solution for understanding emotions. However, it is surprising that even the most sophisticated contemporary VLMs struggle to recognize human emotions or to outperform even specialized vision-only classifiers. In this paper we ask the question "Why do VLMs struggle to recognize human emotions?", and observe that the inherently continuous and dynamic task of facial expression recognition (DFER) exposes two critical VLM vulnerabilities. First, emotion datasets are naturally long-tailed, and the web-scale data used to pre-train VLMs exacerbates this head-class bias, causing them to systematically collapse rare, under-represented emotions into common categories. We propose alternative sampling strategies that prevent favoring common concepts. Second, temporal information is critical for understanding emotions. However, VLMs are unable to represent temporal information over dense frame sequences, as they are limited by context size and the number of tokens that can fit in memory, which poses a clear challenge for emotion recognition. We demonstrate that the sparse temporal sampling strategy used in VLMs is inherently misaligned with the fleeting nature of micro-expressions (0.25-0.5 seconds), which are often the most critical affective signal. As a diagnostic probe, we propose a multi-stage context enrichment strategy that utilizes the information from "in-between" frames by first converting them into natural language summaries. This enriched textual context is provided as input to the VLM alongside sparse keyframes, preventing attentional dilution from excessive visual data while preserving the emotional trajectory.[166] R3D: Revisiting 3D Policy Learning
Zhengdong Hong,Shenrui Wu,Haozhe Cui,Boyi Zhao,Ran Ji,Yiyang He,Hangxing Zhang,Zundong Ke,Jun Wang,Guofeng Zhang,Jiayuan Gu
Main category: cs.CV
TL;DR: 本文提出了一种结合可扩展Transformer 3D编码器与扩散解码器的新架构,通过引入3D数据增强并规避Batch Normalization的负面影响,解决了3D策略学习中的训练不稳定和过拟合问题,在操作基准测试中显著超越现有方法。
Details
Motivation: 3D策略学习因训练不稳定和严重过拟合而难以采用强大的3D感知模型,阻碍了其泛化与跨形态迁移能力的发展。 Method: 系统诊断失败原因,发现缺乏3D数据增强和Batch Normalization的负面影响是主因;提出融合可扩展Transformer 3D编码器与扩散解码器的新架构,并强调稳定性设计与大规模预训练利用。 Result: 在具有挑战性的操作基准测试中显著优于现有3D基线方法。 Conclusion: 该方法为可扩展的3D模仿学习建立了新且稳健的基础。 Abstract: 3D policy learning promises superior generalization and cross-embodiment transfer, but progress has been hindered by training instabilities and severe overfitting, precluding the adoption of powerful 3D perception models. In this work, we systematically diagnose these failures, identifying the omission of 3D data augmentation and the adverse effects of Batch Normalization as primary causes. We propose a new architecture coupling a scalable transformer-based 3D encoder with a diffusion decoder, engineered specifically for stability at scale and designed to leverage large-scale pre-training. Our approach significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing a new and robust foundation for scalable 3D imitation learning. Project Page: https://r3d-policy.github.io/[167] GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens
Roni Itkin,Noam Issachar,Yehonatan Keypur,Yehonatan Keypur,Anpei Chen,Sagie Benaim
Main category: cs.CV
TL;DR: 本文提出GlobalSplat框架,通过‘先对齐、后解码’策略学习紧凑、全局一致的潜在场景表示,避免传统方法中因像素或体素对齐导致的冗余和不一致性,在保持高质量新视角合成的同时大幅减少高斯数量(仅需16K)和模型体积(4MB),并显著提升推理速度(<78ms)。
Details
Motivation: 现有3D高斯泼溅的空间基元分配方法(如迭代优化或前馈推理)在表示紧凑性、重建速度和渲染保真度之间存在严重权衡,主因是依赖缺乏全局场景感知的局部启发式策略;尤其前馈方法因像素/体素对齐导致冗余累积与跨视图一致性脆弱。 Method: 提出GlobalSplat框架,采用‘align first, decode later’范式:首先学习一个紧凑、全局的潜在场景表示以编码多视角输入并解析跨视图对应关系,再解码显式3D几何;引入由粗到精的训练策略逐步增加解码容量,原生防止表示膨胀;不依赖预训练像素预测骨干网络或密集基线的潜在特征。 Result: 在RealEstate10K和ACID数据集上达到具有竞争力的新视角合成性能,仅使用约16K高斯,模型体积低至4MB;单次前向推理耗时低于78毫秒,显著快于基线方法。 Conclusion: GlobalSplat通过全局潜在表示与渐进式解码机制,有效解决了3D高斯泼溅中空间基元分配的冗余与不一致问题,在紧凑性、效率与质量三方面实现更好平衡,为轻量高效神经渲染提供了新范式。 Abstract: The efficient spatial allocation of primitives serves as the foundation of 3D Gaussian Splatting, as it directly dictates the synergy between representation compactness, reconstruction speed, and rendering fidelity. Previous solutions, whether based on iterative optimization or feed-forward inference, suffer from significant trade-offs between these goals, mainly due to the reliance on local, heuristic-driven allocation strategies that lack global scene awareness. Specifically, current feed-forward methods are largely pixel-aligned or voxel-aligned. By unprojecting pixels into dense, view-aligned primitives, they bake redundancy into the 3D asset. As more input views are added, the representation size increases and global consistency becomes fragile. To this end, we introduce GlobalSplat, a framework built on the principle of align first, decode later. Our approach learns a compact, global, latent scene representation that encodes multi-view input and resolves cross-view correspondences before decoding any explicit 3D geometry. Crucially, this formulation enables compact, globally consistent reconstructions without relying on pretrained pixel-prediction backbones or reusing latent features from dense baselines. Utilizing a coarse-to-fine training curriculum that gradually increases decoded capacity, GlobalSplat natively prevents representation bloat. On RealEstate10K and ACID, our model achieves competitive novel-view synthesis performance while utilizing as few as 16K Gaussians, significantly less than required by dense pipelines, obtaining a light 4MB footprint. Further, GlobalSplat enables significantly faster inference than the baselines, operating under 78 milliseconds in a single forward pass. Project page is available at https://r-itk.github.io/globalsplat/[168] AD4AD: Benchmarking Visual Anomaly Detection Models for Safer Autonomous Driving
Fabrizio Genilotti,Arianna Stropeni,Gionata Grotto,Francesco Borsatti,Manuel Barusco,Davide Dalle Pezze,Gian Antonio Susto
Main category: cs.CV
TL;DR: 本文探讨了视觉异常检测(VAD)在自动驾驶中的应用,以提升系统在面对训练数据未覆盖场景时的可靠性与安全性;在AnoVox数据集上评测8种SOTA方法,发现Tiny-Dinomaly在精度与效率间取得最佳平衡,适合边缘部署。
Details
Motivation: 自动驾驶系统在遇到训练中未见的异常障碍物时感知能力易下降,而此类失败会直接导致物理安全风险,亟需能识别未知异常并定位风险区域的技术。 Method: 采用8种前沿视觉异常检测(VAD)方法,在大规模合成数据集AnoVox上进行基准测试;评估涵盖MobileNet、DeiT-Tiny等四种骨干网络架构,关注像素级异常定位能力及边缘部署可行性。 Result: VAD方法在道路场景中具有良好迁移性;Tiny-Dinomaly在保持全尺度定位精度的同时,显著降低内存开销,实现最优精度-效率权衡。 Conclusion: VAD是提升自动驾驶系统鲁棒性与安全性的有效手段;Tiny-Dinomaly为边缘端实时异常感知提供了实用可行的解决方案,推动更负责任的自动驾驶落地。 Abstract: The reliability of a machine vision system for autonomous driving depends heavily on its training data distribution. When a vehicle encounters significantly different conditions, such as atypical obstacles, its perceptual capabilities can degrade substantially. Unlike many domains where errors carry limited consequences, failures in autonomous driving translate directly into physical risk for passengers, pedestrians, and other road users. To address this challenge, we explore Visual Anomaly Detection (VAD) as a solution. VAD enables the identification of anomalous objects not present during training, allowing the system to alert the driver when an unfamiliar situation is detected. Crucially, VAD models produce pixel-level anomaly maps that can guide driver attention to specific regions of concern without requiring any prior assumptions about the nature or form of the hazard. We benchmark eight state-of-the-art VAD methods on AnoVox, the largest synthetic dataset for anomaly detection in autonomous driving. In particular, we evaluate performance across four backbone architectures spanning from large networks to lightweight ones such as MobileNet and DeiT-Tiny. Our results demonstrate that VAD transfers effectively to road scenes. Notably, Tiny-Dinomaly achieves the best accuracy-efficiency trade-off for edge deployment, matching full-scale localization performance at a fraction of the memory cost. This study represents a concrete step toward safer, more responsible deployment of autonomous vehicles, ultimately improving protection for passengers, pedestrians, and all road users.[169] MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
Yan Li,Zezi Zeng,Yifan Yang,Yuqing Yang,Ning Liao,Weiwei Guo,Lili Qiu,Mingxi Cheng,Qi Dai,Zhendong Wang,Zhengyuan Yang,Xue Yang,Ji Li,Lijuan Wang,Chong Luo
Main category: cs.CV
TL;DR: 本文提出MM-WebAgent,一种分层智能体框架,用于多模态网页生成,通过分层规划与自反思协调AIGC元素生成,提升全局一致性与视觉协调性,并构建了新基准与评估协议。
Details
Motivation: 现有AIGC工具直接集成到自动网页生成中易导致样式不一致和全局连贯性差,因各元素孤立生成。 Method: 提出分层代理框架MM-WebAgent,结合分层规划与迭代自反思,联合优化全局布局、局部多模态内容及其整合;并构建多模态网页生成基准与多级评估协议。 Result: MM-WebAgent在多模态元素生成与整合方面显著优于代码生成与基于智能体的基线方法。 Conclusion: MM-WebAgent有效提升了生成网页的视觉一致性与全局连贯性,验证了分层协同与自反思机制在多模态网页生成中的有效性。 Abstract: The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code & Data: https://aka.ms/mm-webagent.[170] AnimationBench: Are Video Models Good at Character-Centric Animation?
Leyi Wu,Pengjun Fang,Kai Sun,Yazhou Xing,Yinwei Wu,Songsong Wang,Ziqi Huang,Dan Zhou,Yingqing He,Ying-Cong Chen,Qifeng Chen
Main category: cs.CV
TL;DR: 本文提出了AnimationBench,首个面向动画风格图像到视频生成的系统性评估基准,基于动画十二法则与IP保真度构建可量化指标,并支持闭集与开集两种评估模式,显著提升对动画生成质量的判别力。
Details
Motivation: 现有视频生成基准主要针对真实感视频设计,难以有效评估动画风格生成(如夸张运动、角色一致性、风格化外观),且依赖固定提示集和刚性流程,缺乏开放域内容与定制化评估的灵活性。 Method: 提出AnimationBench基准,将动画十二基本原理与IP保真度转化为可测量维度,并引入广义质量维度(语义一致性、运动合理性、镜头运动一致性);支持标准化闭集评估与灵活开集诊断评估,并利用视觉-语言模型实现可扩展自动化评估。 Result: 实验表明AnimationBench与人类判断高度一致,能揭示传统真实感导向基准所忽略的动画特有质量问题,显著提升对前沿I2V模型的评估信息量与区分度。 Conclusion: AnimationBench填补了动画风格视频生成评估的空白,为该领域提供了更专业、灵活且符合人类感知的系统性评测框架。 Abstract: Video generation has advanced rapidly, with recent methods producing increasingly convincing animated results. However, existing benchmarks-largely designed for realistic videos-struggle to evaluate animation-style generation with its stylized appearance, exaggerated motion, and character-centric consistency. Moreover, they also rely on fixed prompt sets and rigid pipelines, offering limited flexibility for open-domain content and custom evaluation needs. To address this gap, we introduce AnimationBench, the first systematic benchmark for evaluating animation image-to-video generation. AnimationBench operationalizes the Twelve Basic Principles of Animation and IP Preservation into measurable evaluation dimensions, together with Broader Quality Dimensions including semantic consistency, motion rationality, and camera motion consistency. The benchmark supports both a standardized close-set evaluation for reproducible comparison and a flexible open-set evaluation for diagnostic analysis, and leverages visual-language models for scalable assessment. Extensive experiments show that AnimationBench aligns well with human judgment and exposes animation-specific quality differences overlooked by realism-oriented benchmarks, leading to more informative and discriminative evaluation of state-of-the-art I2V models.[171] Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation
Yiyang Jiang,Li Zhang,Xiao-Yong Wei,Li Qing
Main category: cs.CV
TL;DR: 本文提出了一种基于推理驱动的SLT(手语翻译)框架,引入显式的‘潜在思维序列’作为视频与文本之间的中间层,并采用‘先规划、后验证’的解码策略,显著提升了翻译的连贯性与忠实性;同时发布了一个更大规模、更注重上下文依赖的无词表(gloss-free)手语翻译数据集。
Details
Motivation: 现有SLT系统隐含假设手语片段可直接映射为口语词汇,但实际中手语者常依赖语境、空间和动作实时构建意义,导致该假设失效;因此需将SLT重新定义为跨模态推理任务。 Method: 提出推理驱动的SLT框架:1)用有序的潜在思维序列作为视频到文本的显式中间表示;2)采用‘计划-然后定位(plan-then-ground)’解码机制,即先生成语义规划,再回溯视频寻找证据支持。 Result: 在多个基准测试上一致优于现有无词表SLT方法;构建并开源了首个大规模、强上下文依赖、真实语义导向的无词表SLT数据集。 Conclusion: SLT本质是跨模态推理而非简单视频转文本;引入显式中间推理层与分阶段解码策略能有效提升翻译质量与可解释性。 Abstract: Many SLT systems quietly assume that brief chunks of signing map directly to spoken-language words. That assumption breaks down because signers often create meaning on the fly using context, space, and movement. We revisit SLT and argue that it is mainly a cross-modal reasoning task, not just a straightforward video-to-text conversion. We thus introduce a reasoning-driven SLT framework that uses an ordered sequence of latent thoughts as an explicit middle layer between the video and the generated text. These latent thoughts gradually extract and organize meaning over time. On top of this, we use a plan-then-ground decoding method: the model first decides what it wants to say, and then looks back at the video to find the evidence. This separation improves coherence and faithfulness. We also built and released a new large-scale gloss-free SLT dataset with stronger context dependencies and more realistic meanings. Experiments across several benchmarks show consistent gains over existing gloss-free methods. Code and data will be released upon acceptance at https://github.com/fletcherjiang/SignThought.[172] RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
Hao Gao,Shaoyu Chen,Yifan Zhu,Yuehao Song,Wenyu Liu,Qian Zhang,Xinggang Wang
Main category: cs.CV
TL;DR: 本文提出RAD-2,一种结合扩散生成器与强化学习优化判别器的闭环运动规划框架,通过解耦生成与评估、引入时序一致的策略优化和在线生成器优化,并借助BEV-Warp仿真环境,显著提升规划鲁棒性与安全性。
Details
Motivation: 现有基于扩散的规划器在闭环交互中存在随机不稳定性及缺乏负反馈校正的问题,难以兼顾多模态建模与鲁棒性。 Method: 提出RAD-2框架:1)扩散生成器产生多样轨迹;2)RL优化的判别器按长期驾驶质量重排序;3)时序一致的组相对策略优化(TC-GRPO)缓解信用分配问题;4)在线生成器优化(OGO)将闭环反馈转化为纵向优化信号;5)BEV-Warp仿真环境实现鸟瞰图特征空间中的高效闭环评估。 Result: 相比强基线扩散规划器,碰撞率降低56%;实车部署验证了感知安全性和驾驶平滑性的提升。 Conclusion: RAD-2通过生成-判别协同、时序引导的RL优化与高效仿真,有效解决了扩散规划器在闭环驾驶中的稳定性与性能瓶颈,为高阶自动驾驶提供了更鲁棒的运动规划方案。 Abstract: High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.[173] TokenLight: Precise Lighting Control in Images using Attribute Tokens
Sumit Chaturvedi,Yannick Hold-Geoffroy,Mengwei Ren,Jingyuan Liu,He Zhang,Yiqun Mei,Julie Dorsey,Zhixin Shu
Main category: cs.CV
TL;DR: 本文提出了一种基于属性标记的图像重打光方法,可连续、精确控制多种光照属性(如强度、颜色、环境光、漫反射水平和3D光源位置),在无需显式逆渲染监督下展现出对光照-几何-材质交互的隐式理解,达到SOTA效果。
Details
Motivation: 现有图像重打光方法难以同时精确、连续地控制多个光照属性,且依赖逆渲染监督或泛化能力不足;需要一种更灵活、鲁棒、无需复杂物理建模的端到端方案。 Method: 将重打光建模为条件图像生成任务,引入可学习的属性标记(attribute tokens)分别编码各类光照因素;使用大规模合成数据集(含真实光照标注)预训练,并辅以少量真实图像微调以提升真实感与泛化性。 Result: 在合成与真实图像上均实现SOTA的定量与定性结果;能合理处理复杂场景(如光源置于物体内部、透明材质重打光);模型隐式学习了光照与几何、遮挡、材质的交互关系。 Conclusion: 属性标记机制有效解耦并控制多维光照因素,结合合成+真实数据训练策略,使模型在无逆渲染监督下仍具备强物理一致性与泛化能力,为可控图像重打光提供了新范式。 Abstract: This paper presents a method for image relighting that enables precise and continuous control over multiple illumination attributes in a photograph. We formulate relighting as a conditional image generation task and introduce attribute tokens to encode distinct lighting factors such as intensity, color, ambient illumination, diffuse level, and 3D light positions. The model is trained on a large-scale synthetic dataset with ground-truth lighting annotations, supplemented by a small set of real captures to enhance realism and generalization. We validate our approach across a variety of relighting tasks, including controlling in-scene lighting fixtures and editing environment illumination using virtual light sources, on synthetic and real images. Our method achieves state-of-the-art quantitative and qualitative performance compared to prior work. Remarkably, without explicit inverse rendering supervision, the model exhibits an inherent understanding of how light interacts with scene geometry, occlusion, and materials, yielding convincing lighting effects even in traditionally challenging scenarios such as placing lights within objects or relighting transparent materials plausibly. Project page: vrroom.github.io/tokenlight/[174] LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
Zhanhao Liang,Tao Yang,Jie Wu,Chengjian Feng,Liang Zheng
Main category: cs.CV
TL;DR: 本文提出LeapAlign方法,通过设计两次跳跃来缩短流匹配模型的生成轨迹,从而降低内存成本并稳定地将奖励梯度传播至早期生成步骤,显著提升图像质量和图文对齐效果。
Details
Motivation: 直接通过可微生成过程反向传播奖励梯度的方法存在内存开销大和梯度爆炸问题,难以有效更新决定图像全局结构的早期生成步骤。 Method: 提出LeapAlign:将长ODE采样轨迹压缩为仅含两步的‘跳跃’路径,每次跳跃跳过多步并单步预测未来隐状态;通过随机化跳跃起止时间点实现任意生成步的高效稳定更新;引入路径一致性加权与大梯度项降权策略以增强训练稳定性。 Result: 在Flux模型上微调时,LeapAlign在多个指标上持续优于当前最优的GRPO类及直接梯度方法,显著提升图像质量与图文对齐能力。 Conclusion: LeapAlign是一种高效、稳定且可扩展的流匹配模型对齐方法,解决了长轨迹梯度传播的关键瓶颈,为基于偏好的生成模型优化提供了新范式。 Abstract: This paper focuses on the alignment of flow matching models with human preferences. A promising way is fine-tuning by directly backpropagating reward gradients through the differentiable generation process of flow matching. However, backpropagating through long trajectories results in prohibitive memory costs and gradient explosion. Therefore, direct-gradient methods struggle to update early generation steps, which are crucial for determining the global structure of the final image. To address this issue, we introduce LeapAlign, a fine-tuning method that reduces computational cost and enables direct gradient propagation from reward to early generation steps. Specifically, we shorten the long trajectory into only two steps by designing two consecutive leaps, each skipping multiple ODE sampling steps and predicting future latents in a single step. By randomizing the start and end timesteps of the leaps, LeapAlign leads to efficient and stable model updates at any generation step. To better use such shortened trajectories, we assign higher training weights to those that are more consistent with the long generation path. To further enhance gradient stability, we reduce the weights of gradient terms with large magnitude, instead of completely removing them as done in previous works. When fine-tuning the Flux model, LeapAlign consistently outperforms state-of-the-art GRPO-based and direct-gradient methods across various metrics, achieving superior image quality and image-text alignment.[175] Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo
Ninghui Xu,Fabio Tosi,Lihui Wang,Jiawei Han,Luca Bartolomei,Zhiting Yao,Matteo Poggi,Stefano Mattoccia
Main category: cs.CV
TL;DR: 本文提出Bi-CMPStereo,一种双向跨模态提示框架,用于事件相机与帧相机的异构立体匹配,通过在统一规范空间中对齐表征并双向投影模态信息,提升动态场景下的3D感知鲁棒性。