Table of Contents
cs.CL [Back]
[1] Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models
Andrew Kiruluta
Main category: cs.CL
TL;DR: 本文提出了一种基于压缩感知的动态大语言模型(LLM)执行框架,通过随机测量、稀疏恢复与硬件友好的稀疏执行路径编译,实现任务条件化、token自适应的模型与提示联合压缩,兼顾精度、速度与部署效率。
Details
Motivation: 现有模型压缩方法多为静态离线优化,未利用不同提示和解码步激活不同计算路径的特性;提示压缩方法虽缩短序列长度,但不调整实际执行的模型子网络。二者割裂导致难以兼顾精度与实时性。 Method: 构建统一的压缩感知引导框架:使用随机测量算子探测模型隐式计算使用情况;通过稀疏恢复估计任务条件化和token自适应的支持集;将恢复的支持集编译为硬件高效(如GPU适配)的稀疏执行路径(覆盖模块块、注意力头、通道及前馈子结构);引入任务条件化测量、token自适应恢复、理论采样复杂度界、硬件编译约束与提示-模型联合优化目标。 Result: 实现了动态、细粒度(块/头/通道级)、硬件友好的稀疏推理,在保持精度的同时显著降低内存占用与解码延迟;提供了带显式近似保证与部署导向加速约束的LLM推理新范式。 Conclusion: 该框架将LLM推理重新建模为‘测量-恢复’问题,首次系统融合提示压缩与模型剪枝,支持运行时自适应稀疏执行,为高效、可部署的大模型推理提供了理论支撑与工程可行路径。 Abstract: Large language models deliver strong generative performance but at the cost of massive parameter counts, memory use, and decoding latency. Prior work has shown that pruning and structured sparsity can preserve accuracy under substantial compression, while prompt-compression methods reduce latency by removing redundant input tokens. However, these two directions remain largely separate. Most model-compression methods are static and optimized offline, and they do not exploit the fact that different prompts and decoding steps activate different latent computational pathways. Prompt-compression methods reduce sequence length, but they do not adapt the executed model subnetwork. We propose a unified compressed-sensing-guided framework for dynamic LLM execution. Random measurement operators probe latent model usage, sparse recovery estimates task-conditioned and token-adaptive support sets, and the recovered supports are compiled into hardware-efficient sparse execution paths over blocks, attention heads, channels, and feed-forward substructures. The framework introduces five key contributions: task-conditioned measurements, so different prompts induce different sparse supports; token-adaptive recovery, so active substructures are re-estimated during decoding; formal sample-complexity bounds under restricted isometry or mutual incoherence assumptions; compile-to-hardware constraints that restrict recovery to GPU-efficient structures; and a joint objective that unifies prompt compression with model reduction. Together, these components recast LLM inference as a measurement-and-recovery problem with explicit approximation guarantees and deployment-oriented speedup constraints.[2] MemGround: Long-Term Memory Evaluation Kit for Large Language Models in Gamified Scenarios
Yihang Ding,Wanke Xia,Yiting Zhao,Jinbo Su,Jialiang Yang,Zhengbo Zhang,Ke Wang,Wenming Yang
Main category: cs.CL
TL;DR: 本文提出MemGround,一个基于丰富、游戏化交互场景的长期记忆基准,通过三层级框架评估表面状态记忆、时间关联记忆和基于推理的记忆,并引入多维指标量化记忆利用与行为轨迹。实验表明当前大模型在动态跟踪、时间事件关联及长期证据推理方面仍存在挑战。
Details
Motivation: 现有对大语言模型长期记忆的评估过于静态,仅关注简单检索和短上下文推理,忽视了复杂记忆系统所需的动态状态跟踪和层级推理能力。 Method: 提出MemGround基准,包含三层级评估框架(表面状态记忆、时间关联记忆、基于推理的记忆)和多维指标(QA Overall、MFU、MFCO、ETD),在游戏化交互场景中进行系统评估。 Result: 实验显示当前最先进大语言模型和记忆代理在持续动态跟踪、时间事件关联以及基于长期积累证据的复杂推理方面表现不佳。 Conclusion: MemGround为长期记忆评估提供了更全面、动态和交互式的基准,揭示了现有模型在真实复杂记忆任务中的关键短板。 Abstract: Current evaluations of long-term memory in LLMs are fundamentally static. By fixating on simple retrieval and short-context inference, they neglect the multifaceted nature of complex memory systems, such as dynamic state tracking and hierarchical reasoning in continuous interactions. To overcome these limitations, we propose MemGround, a rigorous long-term memory benchmark natively grounded in rich, gamified interactive scenarios. To systematically assess these capabilities, MemGround introduces a three-tier hierarchical framework that evaluates Surface State Memory, Temporal Associative Memory, and Reasoning-Based Memory through specialized interactive tasks. Furthermore, to comprehensively quantify both memory utilization and behavioral trajectories, we propose a multi-dimensional metric suite comprising Question-Answer Score (QA Overall), Memory Fragments Unlocked (MFU), Memory Fragments with Correct Order (MFCO), and Exploration Trajectory Diagrams (ETD). Extensive experiments reveal that state-of-the-art LLMs and memory agents still struggle with sustained dynamic tracking, temporal event association, and complex reasoning derived from long-term accumulated evidence in interactive environments.[3] HUOZIIME: An On-Device LLM-enhanced Input Method for Deep Personalization
Baocai Shan,Yuzhuang Xu,Wanxiang Che
Main category: cs.CL
TL;DR: 本文提出HUOZIIME,一种基于轻量级大语言模型(LLM)的个性化、隐私保护、实时运行的移动端输入法(IME),通过合成数据微调和分层记忆机制实现高效个性化文本生成。
Details
Motivation: 现有移动端输入法受限于手动输入,难以实现个性化文本生成;虽轻量级LLM使端侧生成成为可能,但如何兼顾个性化、隐私保护与实时性仍是挑战。 Method: 1)在合成个性化数据上对基座LLM进行后训练,赋予初步类人预测能力;2)设计分层记忆机制,持续捕获并利用用户输入历史;3)针对移动端部署进行系统性优化,确保低延迟与高响应性。 Result: 实验表明HUOZIIME可在设备端高效运行,并实现高保真、记忆驱动的个性化文本生成。 Conclusion: HUOZIIME为构建真正个性化、隐私优先、实时可用的端侧生成式输入法提供了可行技术路径与实践范例。 Abstract: Mobile input method editors (IMEs) are the primary interface for text input, yet they remain constrained to manual typing and struggle to produce personalized text. While lightweight large language models (LLMs) make on-device auxiliary generation feasible, enabling deeply personalized, privacy-preserving, and real-time generative IMEs poses fundamental challenges.To this end, we present HUOZIIME, a personalized on-device IME powered by LLM. We endow HUOZIIME with initial human-like prediction ability by post-training a base LLM on synthesized personalization data. Notably, a hierarchical memory mechanism is designed to continually capture and leverage user-specific input history. Furthermore, we perform systemic optimizations tailored to on-device LLMbased IME deployment, ensuring efficient and responsive operation under mobile constraints.Experiments demonstrate efficient on-device execution and high-fidelity memory-driven personalization. Code and package are available at https://github.com/Shan-HIT/HuoziIME.[4] Can Large Language Models Detect Methodological Flaws? Evidence from Gesture Recognition for UAV-Based Rescue Operation Based on Deep Learning
Domonkos Varga
Main category: cs.CL
TL;DR: 本文探讨了大型语言模型(LLMs)能否作为独立分析代理,识别机器学习论文中常见的方法论缺陷(如数据泄露),以提升研究可复现性与科学审计能力。
Details
Motivation: 可靠评估对机器学习研究至关重要,但方法学缺陷(尤其是数据泄露)持续削弱结果有效性;本文旨在检验LLMs是否能仅基于已发表论文内容自主识别此类问题。 Method: 以一篇手势识别论文为案例,首先人工分析其评估协议中存在的受试者级数据泄露问题;随后让六个最先进的LLM在无先验知识、统一提示下独立分析该论文,并评估其是否能一致识别该缺陷。 Result: 所有六个LLM均一致指出该论文存在因训练/测试集非独立导致的评估缺陷,并援引重叠学习曲线、极小泛化差距和近100%准确率等证据支持判断。 Conclusion: LLMs具备仅凭公开论文内容识别常见方法学问题的能力,虽不能替代人工审查,但可作为提升科研可复现性与辅助科学审计的有力补充工具。 Abstract: Reliable evaluation is essential in machine learning research, yet methodological flaws-particularly data leakage-continue to undermine the validity of reported results. In this work, we investigate whether large language models (LLMs) can act as independent analytical agents capable of identifying such issues in published studies. As a case study, we analyze a gesture-recognition paper reporting near-perfect accuracy on a small, human-centered dataset. We first show that the evaluation protocol is consistent with subject-level data leakage due to non-independent training and test splits. We then assess whether this flaw can be detected independently by six state-of-the-art LLMs, each analyzing the original paper without prior context using an identical prompt. All models consistently identify the evaluation as flawed and attribute the reported performance to non-independent data partitioning, supported by indicators such as overlapping learning curves, minimal generalization gap, and near-perfect classification results. These findings suggest that LLMs can detect common methodological issues based solely on published artifacts. While not definitive, their consistent agreement highlights their potential as complementary tools for improving reproducibility and supporting scientific auditing.[5] Decoupling Scores and Text: The Politeness Principle in Peer Review
Yingxuan Wen
Main category: cs.CL
TL;DR: 本文研究了作者如何解读同行评审反馈,发现数值评分比文本评论更能准确预测论文接受结果,揭示了文本评论中普遍存在的礼貌原则导致作者难以从文字中判断真实结果。
Details
Motivation: 作者经常难以正确解读同行评审反馈,可能因礼貌性评论产生错误希望,或因具体低分而感到困惑。 Method: 构建了包含2021–2025年ICLR超3万份投稿的数据库,对比基于数值评分与文本评论的接受率预测性能,并从分数分布统计特征和评论情感分析两方面探究差异成因。 Result: 评分模型准确率达91%,而文本模型仅81%;失败案例中分数分布呈高峰态与负偏态;拒稿评论仍含更多积极词汇,体现‘礼貌原则’。 Conclusion: 数值评分比文本评论更可靠地反映评审意见;文本中的礼貌表达削弱了拒稿信号,导致作者误判。 Abstract: Authors often struggle to interpret peer review feedback, deriving false hope from polite comments or feeling confused by specific low scores. To investigate this, we construct a dataset of over 30,000 ICLR 2021-2025 submissions and compare acceptance prediction performance using numerical scores versus text reviews. Our experiments reveal a significant performance gap: score-based models achieve 91% accuracy, while text-based models reach only 81% even with large language models, indicating that textual information is considerably less reliable. To explain this phenomenon, we first analyze the 9% of samples that score-based models fail to predict, finding their score distributions exhibit high kurtosis and negative skewness, which suggests that individual low scores play a decisive role in rejection even when the average score falls near the borderline. We then examine why text-based accuracy significantly lags behind scores from a review sentiment perspective, revealing the prevalence of the Politeness Principle: reviews of rejected papers still contain more positive than negative sentiment words, masking the true rejection signal and making it difficult for authors to judge outcomes from text alone.[6] SeaAlert: Critical Information Extraction From Maritime Distress Communications with Large Language Models
Tomer Atia,Yehudit Aperstein,Alexander Apartsin
Main category: cs.CL
TL;DR: 本文提出SeaAlert,一种基于大语言模型(LLM)的框架,用于鲁棒地分析海上遇险语音通信;通过构建合成数据生成流程缓解真实标注数据稀缺问题,并在噪声和ASR错误下提升解析性能。
Details
Motivation: 海上VHF遇险语音通信具有安全关键性,但因内容简短、信道噪声大、说话人紧张及ASR识别错误频发,导致自动分析困难;且真实标注数据稀缺。 Method: 提出SeaAlert框架:利用LLM生成多样化、逼真的合成遇险语音文本(含省略/替换标准术语的挑战性变体),再经语音合成、VHF信道噪声模拟与ASR转录,构建贴近实际的带噪文本数据集,并在此基础上训练鲁棒分析模型。 Result: 成功构建了面向海上遇险通信的合成数据生成流程,生成的数据能有效模拟真实场景中的噪声、格式偏差与ASR错误,为后续鲁棒分析模型提供了高质量训练资源。 Conclusion: SeaAlert通过LLM驱动的合成数据生成,显著缓解了真实标注数据匮乏问题,提升了在噪声和ASR错误干扰下对非标准、压力情境下海上遇险语音的理解与分析能力。 Abstract: Maritime distress communications transmitted over very high frequency (VHF) radio are safety-critical voice messages used to report emergencies at sea. Under the Global Maritime Distress and Safety System (GMDSS), such messages follow standardized procedures and are expected to convey essential details, including vessel identity, position, nature of the distress, and required assistance. In practice, however, automatic analysis remains difficult because distress messages are often brief, noisy, and produced under stress, may deviate from the prescribed format, and are further degraded by automatic speech recognition (ASR) errors caused by channel noise and speaker stress. This paper presents SeaAlert, an LLM-based framework for robust analysis of maritime distress communications. To address the scarcity of labeled real-world data, we develop a synthetic data generation pipeline in which an LLM produces realistic and diverse maritime messages, including challenging variants in which standard distress codewords are omitted or replaced with less explicit expressions. The generated utterances are synthesized into speech, degraded with simulated VHF noise, and transcribed by an ASR system to obtain realistic noisy transcripts.[7] How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
Zixian Huang,Kaichen Yang,Xu Huang,Feiyang Hao,Qiming Ge,Bowen Li,He Du,Kai Chen,Qipeng Guo
Main category: cs.CL
TL;DR: 本文提出TESSY框架,通过教师-学生协同生成数据,解决教师模型生成数据与学生模型风格不一致导致的微调性能下降问题,在代码生成任务中显著提升学生模型性能。
Details
Motivation: 现有使用更强教师模型生成合成数据进行监督微调(SFT)的方法,在增强如Qwen3-8B等新兴推理模型时常常失效甚至导致性能大幅下降,主要原因是教师生成数据与学生模型的数据分布存在显著风格差异。 Method: 提出教师-学生协同数据合成框架(TESSY),让教师和学生模型交替生成风格相关和非风格相关token,从而生成既具备教师高级推理能力、又符合学生风格分布的合成序列。 Result: 在以GPT-OSS-120B为教师、Qwen3-8B为学生的代码生成实验中,直接使用教师数据微调导致LiveCodeBench-Pro和OJBench分别下降3.25%和10.02%;而TESSY则分别提升11.25%和6.68%。 Conclusion: 风格一致性是影响SFT效果的关键因素,TESSY通过协同生成机制有效弥合教师与学生间的风格鸿沟,显著提升学生模型的推理能力。 Abstract: A widely adopted strategy for model enhancement is to use synthetic data generated by a stronger model for supervised fine-tuning (SFT). However, for emerging reasoning models like Qwen3-8B, this approach often fails to improve reasoning capabilities and can even lead to a substantial drop in performance. In this work, we identify substantial stylistic divergence between teacher generated data and the distribution of student as a major factor impacting SFT. To bridge this gap, we propose a Teacher-Student Cooperation Data Synthesis framework (TESSY), which interleaves teacher and student models to alternately generate style and non-style tokens. Consequently, TESSY produces synthetic sequences that inherit the advanced reasoning capabilities of the teacher while maintaining stylistic consistency with the distribution of the student. In experiments on code generation using GPT-OSS-120B as the teacher, fine-tuning Qwen3-8B on teacher-generated data leads to performance drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, whereas TESSY achieves improvements of 11.25% and 6.68%.[8] EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic Reviews
Naman Ahuja,Saniya Mulla,Muhammad Ali Khan,Zaryab Bin Riaz,Kaneez Zahra Rubab Khakwani,Mohamad Bassam Sonbol,Irbaz Bin Riaz,Vivek Gupta
Main category: cs.CL
TL;DR: EviSearch是一个多智能体系统,能从临床试验PDF中自动提取并构建符合本体的证据表,同时保证每个单元格的可追溯性,支持临床医生审核与修正。
Details
Motivation: 解决临床证据表人工构建耗时、易错、缺乏可追溯性的问题,提升系统性综述效率与可信度。 Method: 提出多智能体架构:PDF-query代理(保留版面与图表)、检索引导的搜索代理和强制页面级验证的协调模块;结合多模态(文本、表格、图表)高精度提取与可审查的溯源生成机制。 Result: 在肿瘤学临床试验数据集上,相比强文本解析基线显著提升抽取准确率,并实现全覆盖的溯源标注;记录协调决策与人工编辑,生成结构化监督信号以迭代优化模型。 Conclusion: EviSearch为循证医学中的活体系统性综述提供安全、可审计、低人工负担的LLM驱动抽取方案,推动其在证据合成流程中的实际落地。 Abstract: We present EviSearch, a multi-agent extraction system that automates the creation of ontology-aligned clinical evidence tables directly from native trial PDFs while guaranteeing per-cell provenance for audit and human verification. EviSearch pairs a PDF-query agent (which preserves rendered layout and figures) with a retrieval-guided search agent and a reconciliation module that forces page-level verification when agents disagree. The pipeline is designed for high-precision extraction across multimodal evidence sources (text, tables, figures) and for generating reviewer-actionable provenance that clinicians can inspect and correct. On a clinician-curated benchmark of oncology trial papers, EviSearch substantially improves extraction accuracy relative to strong parsed-text baselines while providing comprehensive attribution coverage. By logging reconciler decisions and reviewer edits, the system produces structured preference and supervision signals that bootstrap iterative model improvement. EviSearch is intended to accelerate living systematic review workflows, reduce manual curation burden, and provide a safe, auditable path for integrating LLM-based extraction into evidence synthesis pipelines.[9] Hierarchical Retrieval Augmented Generation for Adversarial Technique Annotation in Cyber Threat Intelligence Text
Filippo Morbiato,Markus Keller,Priya Nair,Luca Romano
Main category: cs.CL
TL;DR: 本文提出H-TechniqueRAG,一种结合MITRE ATT&CK战术-技术层级结构的分层检索增强生成框架,显著提升CTI文本到ATT&CK技术ID映射的准确性、效率与可解释性。
Details
Motivation: 现有基于RAG的方法将ATT&CK技术视为扁平集合,忽略了其战术-技术层级结构,导致检索效率低、推理不精准、可解释性差。 Method: 提出分两阶段的分层检索机制:先检索高层战术,再在对应战术下检索具体技术;引入战术感知重排序模块和层级约束的上下文组织策略,以缓解大模型上下文过载并提升推理精度。 Result: 在三个CTI数据集上F1分数比SOTA TechniqueRAG提升3.8%,推理延迟降低62.4%,LLM API调用减少60%;具备更强跨域泛化能力与可解释的决策路径。 Conclusion: 将ATT&CK层级结构作为强归纳偏置融入RAG框架,能兼顾性能、效率与可解释性,为CTI自动化分析提供新范式。 Abstract: Mapping Cyber Threat Intelligence (CTI) text to MITRE ATT\&CK technique IDs is a critical task for understanding adversary behaviors and automating threat defense. While recent Retrieval-Augmented Generation (RAG) approaches have demonstrated promising capabilities in this domain, they fundamentally rely on a flat retrieval paradigm. By treating all techniques uniformly, these methods overlook the inherent taxonomy of the ATT\&CK framework, where techniques are structurally organized under high-level tactics. In this paper, we propose H-TechniqueRAG, a novel hierarchical RAG framework that injects this tactic-technique taxonomy as a strong inductive bias to achieve highly efficient and accurate annotation. Our approach introduces a two-stage hierarchical retrieval mechanism: it first identifies the macro-level tactics (the adversary's technical goals) and subsequently narrows the search to techniques within those tactics, effectively reducing the candidate search space by 77.5\%. To further bridge the gap between retrieval and generation, we design a tactic-aware reranking module and a hierarchy-constrained context organization strategy that mitigates LLM context overload and improves reasoning precision. Comprehensive experiments across three diverse CTI datasets demonstrate that H-TechniqueRAG not only outperforms the state-of-the-art TechniqueRAG by 3.8\% in F1 score, but also achieves a 62.4\% reduction in inference latency and a 60\% decrease in LLM API calls. Further analysis reveals that our hierarchical structural priors equip the model with superior cross-domain generalization and provide security analysts with highly interpretable, step-by-step decision paths.[10] Chinese Essay Rhetoric Recognition Using LoRA, In-context Learning and Model Ensemble
Yuxuan Lai,Xiajing Wang,Chen Zheng
Main category: cs.CL
TL;DR: 本文利用大语言模型(LLM)结合LoRA微调与上下文学习,以结构化JSON输出(键值汉化)完成中文议论文修辞识别任务,并通过模型集成进一步提升性能,在CCL 2025评测中三项指标均获第一。
Details
Motivation: 修辞识别是自动作文评分的关键环节,有助于评估学生的语言能力与高阶思维;当前亟需适配中文场景的高效修辞识别方法。 Method: 采用LoRA进行大语言模型微调,并结合in-context learning注入修辞知识;输出格式统一为中文键名的JSON结构;进一步探索多种模型集成策略。 Result: 在CCL 2025中文作文修辞识别评测的全部三个赛道上均取得最优性能,获得一等奖。 Conclusion: 基于LLM的LoRA微调与结构化输出策略可有效提升中文修辞识别效果,验证了其在AI教育应用中的可行性与先进性。 Abstract: Rhetoric recognition is a critical component in automated essay scoring. By identifying rhetorical elements in student writing, AI systems can better assess linguistic and higher-order thinking skills, making it an essential task in the area of AI for education. In this paper, we leverage Large Language Models (LLMs) for the Chinese rhetoric recognition task. Specifically, we explore Low-Rank Adaptation (LoRA) based fine-tuning and in-context learning to integrate rhetoric knowledge into LLMs. We formulate the outputs as JSON to obtain structural outputs and translate keys to Chinese. To further enhance the performance, we also investigate several model ensemble methods. Our method achieves the best performance on all three tracks of CCL 2025 Chinese essay rhetoric recognition evaluation task, winning the first prize.[11] SAGE Celer 2.6 Technical Card
SAGEA Research Team,Basab Jha,Firoj Paudel,Ujjwal Puri,Adrian Liu,Ethan Henkel,Zhang Yuting,Mateusz Kowalczyk,Mei Huang,Choi Donghyuk,Wang Junhao
Main category: cs.CL
TL;DR: SAGEA发布了Celer 2.6系列大模型,包含5B、10B和27B三种规模,通过逆向推理(IR)训练机制与原生多模态架构提升逻辑一致性与低延迟性能,并特别优化了对南亚语言(如尼泊尔语、印地语)的支持,同时保持英语推理能力。
Details
Motivation: 解决大模型在复杂推理中易产生级联错误与幻觉的问题,并增强对南亚语言(尤其是Devanagari文字)的支持,弥补现有模型在该区域语言上的性能短板。 Method: 采用Inverse Reasoning(IR)训练流程使模型能自我验证逻辑路径;引入端到端视觉编码器实现原生多模态能力;设计定制化Devanagari分词器并进行针对性预训练与架构优化。 Result: 在数学、编程与通用智能基准(ACUMEN)上表现优异,具备低延迟;在尼泊尔语和印地语任务中性能强劲,且不损害英文推理能力。 Conclusion: Celer 2.6是面向南亚语言场景优化的高性能、低幻觉、原生多模态通用大模型,代表SAGEA在区域化与可靠性协同优化上的重要进展。 Abstract: We introduce SAGE Celer 2.6, the latest in our line of general-purpose Celer models from SAGEA. Celer 2.6 is available in 5B, 10B, and 27B parameter sizes and benefits from extensive architectural modifications and further pre-training on an undisclosed model. Using our Inverse Reasoning (IR) pipeline, SAGEA natively trains Celer 2.6 to validate its own logic paths, minimizing cascading error and hallucination in complex reasoning tasks. Celer 2.6 also boasts natively integrated multimodal functionality with an end-to-end vision encoder to avoid common pitfalls in adapter-based approaches. Celer 2.6 provides highly competitive results on mathematics, coding, and general intelligence benchmarks (ACUMEN), along with low latency. Most importantly, Celer 2.6 is specifically optimized for South Asian language support, with a custom tokenizer for the Devanagari script and strong performance in both Nepali and Hindi without sacrificing English reasoning ability.[12] Chronological Knowledge Retrieval: A Retrieval-Augmented Generation Approach to Construction Project Documentation
Ioannis-Aris Kostis,Natalia Sanchiz,Steeve De Schryver,François Denis,Pierre Schaus
Main category: cs.CL
TL;DR: 本文提出了一种基于检索增强生成(RAG)的对话式系统,用于从建筑项目会议纪要中自动、准确、带时间标注地检索决策历史。
Details
Motivation: 大型建筑项目中决策频繁变更且记录繁杂,人工追溯决策历史费时易错,亟需自动化、语义化、时间感知的问答支持。 Method: 构建基于RAG框架的对话系统,融合语义搜索与大语言模型,支持自然语言提问并返回语义相关且显式时间标注的答案;使用比利时某大型公司真实脱敏会议纪要数据集(含专家标注查询)进行验证。 Result: 实现了对会议纪要中时序决策信息的高效、准确、可解释的问答能力,并开源了数据集与实现代码。 Conclusion: 该方法显著提升了工程文档中时序决策信息的可访问性与可用性,为项目知识管理提供了可复用的技术路径和开放资源。 Abstract: In large-scale construction projects, the continuous evolution of decisions generates extensive records, most often captured in meeting minutes. Since decisions may override previous ones, professionals often need to reconstruct the history of specific choices. Retrieving such information manually from raw archives is both labor-intensive and error-prone. From a user perspective, we address this challenge by enabling conversational access to the whole set of project meeting minutes. Professionals can pose natural-language questions and receive answers that are both semantically relevant and explicitly time-annotated, allowing them to follow the chronology of decisions. From a technical perspective, our solution employs a Retrieval-Augmented Generation (RAG) framework that integrates semantic search with large language models to ensure accurate and context-aware responses. We demonstrate the approach using an anonymized, industry-sourced dataset of meeting minutes from a completed construction project by a large company in Belgium. The dataset is annotated and enriched with expert-defined queries to support systematic evaluation. Both the dataset and the open-source implementation are made available to the community to foster further research on conversational access to time-annotated project documentation.[13] Stateful Evidence-Driven Retrieval-Augmented Generation with Iterative Reasoning
Qi Dong,Ziheng Lin,Ning Ding
Main category: cs.CL
TL;DR: 本文提出了一种状态感知、证据驱动的迭代式RAG框架,通过构建结构化证据池、进行缺陷分析与查询迭代优化,提升问答稳定性与鲁棒性。
Details
Motivation: 现有RAG方法存在上下文表征扁平化和检索无状态问题,导致性能不稳定。 Method: 将问答建模为渐进式证据累积过程;将检索文档转化为带相关性与置信度信号的结构化推理单元;维护持久化证据池;进行证据驱动的缺陷分析并迭代优化查询以指导后续检索。 Result: 在多个问答基准上一致优于标准RAG及多步基线方法,能有效积累高质量证据,并在强检索噪声下保持稳定性能。 Conclusion: 状态感知与迭代推理机制显著提升了RAG的鲁棒性与证据聚合稳定性。 Abstract: Retrieval-Augmented Generation (RAG) grounds Large Language Models (LLMs) in external knowledge but often suffers from flat context representations and stateless retrieval, leading to unstable performance. We propose Stateful Evidence-Driven RAG with Iterative Reasoning, a framework that models question answering as a progressive evidence accumulation process. Retrieved documents are converted into structured reasoning units with explicit relevance and confidence signals and maintained in a persistent evidence pool capturing both supportive and non-supportive information. The framework performs evidence-driven deficiency analysis to identify gaps and conflicts and iteratively refines queries to guide subsequent retrieval. This iterative reasoning process enables stable evidence aggregation and improves robustness to noisy retrieval. Experiments on multiple question answering benchmarks demonstrate consistent improvements over standard RAG and multi-step baselines, while effectively accumulating high-quality evidence and maintaining stable performance under substantial retrieval noise.[14] Benchmarking Linguistic Adaptation in Comparable-Sized LLMs: A Study of Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B on Romanized Nepali
Ananda Rimal,Adarsha Rimal
Main category: cs.CL
TL;DR: 本研究系统评估了Llama-3.1-8B、Mistral-7B-v0.1和Qwen3-8B三种开源大模型在罗马化尼泊尔语上的语言适应能力,通过零样本与微调(QLoRA/rsLoRA)对比,发现所有模型零样本均失败,微调后性能显著提升;Qwen3-8B整体最优,Llama-3.1-8B微调增益最大,为低资源迭代开发首选。
Details
Motivation: 罗马化尼泊尔语是尼泊尔非正式数字交流的主要形式,但在大语言模型领域严重缺乏资源支持,亟需建立可比、可复现的基准评估体系。 Method: 在10,000条双语指令数据集上,对三个同规模开源模型开展零样本与QLoRA/rsLoRA微调(r=32,仅训练约1%参数),使用PPL、BERTScore、chrF++、ROUGE系列及BLEU共七维五指标综合评测。 Result: 零样本下三模型均无法生成有效罗马化尼泊尔语;微调后BERTScore≈0.75、chrF++>23;Qwen3-8B零样本即具语义相关性且结构对齐指标最优;Llama-3.1-8B微调PPL下降49.77、BERTScore提升0.3287,增益最大。 Conclusion: 本工作首次为罗马化尼泊尔语在同规模开源LLM中建立了严格基准;验证了‘适应潜力假说’;推荐Qwen3-8B用于即用型部署,Llama-3.1-8B用于低资源持续优化。 Abstract: Romanized Nepali, the Nepali language written in the Latin alphabet, is the dominant medium for informal digital communication in Nepal, yet it remains critically underresourced in the landscape of Large Language Models (LLMs). This study presents a systematic benchmarking of linguistic adaptation across three comparable-sized open-weight models: Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B. We evaluate these architectures under zero-shot and fine-tuned settings using a curated bilingual dataset of 10,000 transliterated instruction-following samples. Performance is quantified across five metrics spanning seven measurement dimensions: Perplexity (PPL), BERTScore, chrF++, ROUGE-1, ROUGE-2, ROUGE-L, and BLEU, capturing fluency, phonetic consistency, and semantic integrity. Models were fine-tuned using Quantized Low-Rank Adaptation (QLoRA) with Rank-Stabilized LoRA (rsLoRA) at rank r=32 on dual NVIDIA Tesla T4 GPUs, training only approximately 1% of each model's parameters in under 27 total GPU-hours. At zero-shot, all three models fail to generate Romanized Nepali, each exhibiting a distinct architecture-specific failure mode. Following fine-tuning, all three resolve these failures and converge to BERTScore approximately 0.75 and chrF++ greater than 23. Overall dimension-wise assessment across ten criteria identifies Qwen3-8B as the overall recommended architecture, being the only model to produce semantically relevant zero-shot output and leading all structural alignment metrics post-SFT. The adaptation headroom hypothesis is confirmed: Llama-3.1-8B, despite its weakest zero-shot baseline, achieves the largest absolute fine-tuning gains in PPL (Delta = -49.77) and BERTScore (Delta = +0.3287), making it the preferred choice for iterative low-resource development pipelines. This work establishes the first rigorous baseline for Romanized Nepali adaptation in comparable-sized open-weight LLMs.[15] Tug-of-War within A Decade: Conflict Resolution in Vulnerability Analysis via Teacher-Guided Retrieval-Augmented Generations
Ziyin Zhou,Jianyi Zhang,Xu ji,Yilong Li,Jiameng Han,Zhangchi Zhao
Main category: cs.CL
TL;DR: 本文提出CRVA-TGRAG框架,通过两阶段方法(教师引导的检索增强生成)解决LLM在CVE漏洞分析中因知识更新滞后导致的知识冲突与幻觉问题,结合父文档分割、混合检索与偏好优化微调,显著提升最新CVE检索准确率与回答可靠性。
Details
Motivation: LLM在网络安全漏洞分析中面临知识滞后问题:过去十年超30,000个CVE被更新,导致训练数据与真实知识不一致,引发事实错误、幻觉和知识冲突。 Method: 提出两阶段CRVA-TGRAG框架:1)检索阶段采用父文档分割与语义相似性+倒排索引的混合检索提升CVE文档召回精度;2)生成阶段引入教师引导的偏好优化技术对LLM进行微调,增强其基于检索结果的精准问答能力。 Result: 实验表明该方法在最新CVE检索准确率上优于外部知识库,有效缓解LLM单独依赖内部知识导致的知识冲突与不一致。 Conclusion: CRVA-TGRAG框架通过融合高质量RAG与偏好微调,显著提升了LLM在动态漏洞知识场景下的可靠性与事实一致性,为安全领域LLM应用提供了可扩展的冲突消解范式。 Abstract: Large Language Models (LLMs) are essential for analyzing and addressing vulnerabilities in cybersecurity. However, among over 200,000 vulnerabilities were discovered in the past decade, more than 30,000 have been changed or updated. This necessitates frequent updates to the training datasets and internal knowledge bases of LLMs to maintain knowledge consistency. In this paper, we focus on the problem of knowledge discrepancy and conflict within CVE (Common Vulnerabilities and Exposures) detection and analysis. This problem hinders LLMs' ability to retrieve the latest knowledge from original training datasets, leading to knowledge conflicts, fabrications of factually incorrect results, and generation hallucinations. To address this problem, we propose an innovative two-stage framework called CRVA-TGRAG (Conflict Resolution in Vulnerability Analysis via Teacher-Guided Retrieval-Augmented Generation). First, to improve document retrieval accuracy during the retrieval stage, we utilize Parent Document Segmentation and an ensemble retrieval scheme based on semantic similarity and inverted indexing. Second, to enhance LLMs' capabilities based on the retrieval of CVE dataset in generation stage, we employ a teacher-guided preference optimization technique to fine-tune LLMs. Our framework not only enhances the quality of content retrieval through RAG but also leverages the advantages of preference fine-tuning in LLMs to answer questions more effectively and precisely. Experiments demonstrate our method achieves higher accuracy in retrieving the latest CVEs compared to external knowledge bases. In conclusion, our framework significantly mitigates potential knowledge conflicts and inconsistencies that may arise from relying solely on LLMs for knowledge retrieval.[16] Correcting Suppressed Log-Probabilities in Language Models with Post-Transformer Adapters
Bryan Sanchez
Main category: cs.CL
TL;DR: 本文提出了一种仅含786K参数的轻量级后Transformer适配器,通过在冻结的隐藏状态上训练,有效缓解对齐调优语言模型在政治敏感话题上的事实性log-probability抑制现象,且不损害原有知识;该适配器在多个Qwen3模型尺度上验证了泛化能力与生成一致性,并揭示了MLX框架中一个此前未被记录的静默梯度bug。
Details
Motivation: 对齐调优的语言模型常在政治敏感话题上压制事实性log-probabilities,尽管其隐藏层仍保留相关知识,亟需一种低开销、无损知识、可部署的校正机制。 Method: 设计并训练一个仅作用于冻结隐藏状态的轻量级(786K参数)post-transformer adapter,对比gated(SwiGLU)与ungated(线性瓶颈)结构,在Qwen3-4B/8B/14B上进行事实校正实验;采用last-position-only应用策略保障生成连贯性;同时诊断并定位MLX中的静默梯度bug。 Result: 适配器在31个意识形态区分性事实上成功校正log-probability抑制;在15个训练事实中实现100%记忆,在16个预留事实中泛化率达11–39%(5次随机划分);两种结构性能无显著差异(Fisher精确检验p > 0.09);仅在最后token位置应用时生成连贯、去审查文本;logit空间适配器失效;发现并修复MLX中nn.value_and_grad的静默梯度bug。 Conclusion: 隐藏状态层面的轻量适配器是纠正对齐模型事实性抑制的有效且可行方案;last-position-only干预是保障生成质量的关键;该工作不仅推进了模型事实性校准方法,也警示了MLX生态中梯度计算的潜在陷阱,对后续适配器研究具有方法论与工程实践双重启示。 Abstract: Alignment-tuned language models frequently suppress factual log-probabilities on politically sensitive topics despite retaining the knowledge in their hidden representations. We show that a 786K-parameter (approximately 0.02% of the base model) post-transformer adapter, trained on frozen hidden states, corrects this suppression on 31 ideology-discriminating facts across Qwen3-4B, 8B, and 14B. The adapter memorizes all 15 training facts and generalizes to 11--39% of 16 held-out facts across 5 random splits per scale, with zero knowledge regressions via anchored training. Both gated (SwiGLU) and ungated (linear bottleneck) adapters achieve comparable results; neither consistently outperforms the other (Fisher exact p > 0.09 at all scales). On instruct models, the adapter corrects log-probability rankings. When applied at all token positions during generation, the adapter produces incoherent output; however, when applied only at the current prediction position (last-position-only), the adapter produces coherent, less censored text. A logit-space adapter operating after token projection fails to produce coherent generation at any application mode, suggesting hidden-state intervention is the correct level for generation correction. A previously undocumented silent gradient bug in Apple MLX explains all null results in earlier iterations of this work: the standard pattern nn.value_and_grad(model, fn)(model.parameters()) returns zero gradients without error; the correct pattern nn.value_and_grad(model, fn)(model, data) resolves this. We provide a minimal reproduction and discuss implications for other adapter research using MLX.[17] QU-NLP at ArchEHR-QA 2026: Two-Stage QLoRA Fine-Tuning of Qwen3-4B for Patient-Oriented Clinical Question Answering and Evidence Sentence Alignment
Mohammad AL-Smadi
Main category: cs.CL
TL;DR: 本文提出一个统一系统,同时解决ArchEHR-QA共享任务的子任务3(答案生成)和子任务4(证据句对齐);在答案生成中采用两阶段QLoRA微调Qwen3-4B模型,在证据对齐中构建三种检索方法的加权集成;两项任务均受限于仅20个标注样本,凸显数据增强的关键必要性。
Details
Motivation: ArchEHR-QA共享任务中子任务3和子任务4缺乏足够标注数据(仅20例),导致模型难以区分相关与无关临床句子,需统一建模并探索高效适配方法。 Method: 子任务3:在4-bit NF4量化Qwen3-4B上实施两阶段QLoRA微调——先用emrQA-MedSQuAD(3万样本)提升临床领域能力,再用20个开发集样本适配任务输出风格;子任务4:融合BM25(带相对阈值)、TF-IDF余弦相似度和微调cross-encoder的加权检索集成。 Result: 子任务3在test-2026上综合得分为32.87(BLEU=9.42, ROUGE-L=27.04, SARI=55.42, BERTScore=43.00, AlignScore=25.28, MEDCON=37.04);子任务4在100例测试集上micro-F1达67.16。 Conclusion: 两个子任务的根本瓶颈均为标注数据极度稀缺(20例),验证了统一框架的有效性,但未来最关键方向是临床问答领域的数据增强。 Abstract: We present a unified system addressing both Subtask 3 (answer generation) and Subtask 4 (evidence sentence alignment) of the ArchEHR-QA Shared Task. For Subtask 3, we apply two-stage Quantised Low-Rank Adaptation (QLoRA) to Qwen3-4B loaded in 4-bit NF4 quantisation: first on 30,000 samples from the emrQA-MedSQuAD corpus to establish clinical domain competence, then on the 20 annotated development cases to learn the task-specific output style. Our system achieves an overall score of 32.87 on the official test-2026 split (BLEU = 9.42, ROUGE-L = 27.04, SARI = 55.42, BERTScore = 43.00, AlignScore = 25.28, MEDCON = 37.04). For Subtask 4, we develop a weighted ensemble of three retrieval methods - BM25 with relative thresholding, TF-IDF cosine similarity, and a fine-tuned cross-encoder - to identify note sentences supporting a given gold answer, achieving a micro-F1 of 67.16 on the 100-case test set. Experiments reveal that both subtasks expose the same fundamental challenge: 20 annotated training cases are insufficient to distinguish relevant from irrelevant clinical sentences, pointing to data augmentation as the highest-leverage future direction.[18] Listen, Correct, and Feed Back: Spoken Pedagogical Feedback Generation
Junhong Liang,Yifan Lu,Ekaterina Kochmar,Fajri Koto
Main category: cs.CL
TL;DR: 本文提出了SPFG数据集,用于生成口语语法纠错与教学反馈,并对比了监督微调与偏好对齐方法在联合生成纠错与反馈任务上的效果,发现监督微调更稳定有效。
Details
Motivation: 现有语法纠错(GEC)与解释(GEE)研究缺乏面向真实教学场景的、可操作、适配学习者水平且具鼓励性的教学反馈。 Method: 构建基于Speak & Improve Challenge 2025语料的SPFG数据集,包含带GEC目标的流利性转录文本及人工验证的教师风格反馈(含偏好对),在口语GEC任务上对比SFT与DPO/KTO等偏好对齐方法在三个指令微调大模型上的表现。 Result: SFT在纠错与反馈生成上带来最一致提升;DPO/KTO增益较小或不稳定;纠错质量与反馈质量相关性较弱。 Conclusion: 监督微调仍是当前生成高质量、教学友好的口语纠错与反馈的更可靠方法;SPFG为该方向提供了首个面向口语、含人类偏好反馈的基准数据集。 Abstract: Grammatical error correction (GEC) and explanation (GEE) have made rapid progress, but real teaching scenarios also require \emph{learner-friendly pedagogical feedback} that is actionable, level-appropriate, and encouraging. We introduce \textbf{SPFG} (\textbf{S}poken \textbf{P}edagogical \textbf{F}eedback \textbf{G}eneration), a dataset built based on the Speak \& Improve Challenge 2025 corpus, pairing fluency-oriented transcriptions with GEC targets and \emph{human-verified} teacher-style feedback, including preferred/rejected feedback pairs for preference learning. We study a transcript-based Spoken Grammatical Error Correction (SGEC) setting and evaluate three instruction-tuned LLMs (Qwen2.5, Llama-3.1, and GLM-4), comparing supervised fine-tuning (SFT) with preference-based alignment (using DPO and KTO) for jointly generating corrections and feedback. Results show that SFT provides the most consistent improvements, while DPO/KTO yield smaller or mixed gains, and that correction quality and feedback quality are weakly coupled. Our implementation is available at https://github.com/Skywalker-Harrison/spfg.[19] An Underexplored Frontier: Large Language Models for Rare Disease Patient Education and Communication -- A scoping review
Zaifu Zhan,Yu Hou,Kai Yu,Min Zeng,Anita Burgun,Xiaoyi Chen,Rui Zhang
Main category: cs.CL
TL;DR: 本文通过范围综述分析了2022年1月至2026年3月间关于大语言模型(LLM)在罕见病患者教育与沟通中应用的12项研究,发现当前研究多依赖通用模型(如ChatGPT),聚焦于静态问答,缺乏真实场景、多语言支持及以患者为中心的评估维度,整体处于早期阶段。
Details
Motivation: 罕见病患者面临复杂照护路径、临床专家匮乏和长期沟通需求未被满足等挑战,而大语言模型虽在患者教育中展现出潜力,其在罕见病领域的实际应用尚不明确。 Method: 开展范围综述,检索2022年1月至2026年3月主要数据库中的相关研究,共纳入12项研究;提取研究特征、应用场景、模型使用及评估方法,并采用描述性与定性分析进行综合。 Result: 现有研究高度集中于近期、通用型LLM(尤其是ChatGPT),多用于基于人工构建问题集的问答任务;极少使用真实世界数据或纵向沟通场景;评估偏重准确性,忽视可读性、共情性与沟通质量等患者中心指标;多语言支持几乎空白。 Conclusion: 罕见病领域LLM应用仍处起步阶段,未来需加强以患者为中心的设计、领域适配方法开发及真实环境部署,以实现安全、自适应且有效的沟通支持。 Abstract: Rare diseases affect over 300 million people worldwide and are characterized by complex care pathways, limited clinical expertise, and substantial unmet communication needs throughout the long patient journey. Recent advances in large language models (LLMs) offer new opportunities to support patient education and communication, yet their application in rare diseases remains unclear. We conducted a scoping review of studies published between January 2022 and March 2026 across major databases, identifying 12 studies on LLM-based rare disease patient education and communication. Data were extracted on study characteristics, application scenarios, model usage, and evaluation methods, and synthesized using descriptive and qualitative analyses. The literature is highly recent and dominated by general-purpose models, particularly ChatGPT. Most studies focus on patient question answering using curated question sets, with limited use of real-world data or longitudinal communication scenarios. Evaluations are primarily centered on accuracy, with limited attention to patient-centered dimensions such as readability, empathy, and communication quality. Multilingual communication is rarely addressed. Overall, the field remains at an early stage. Future research should prioritize patient-centered design, domain-adapted methods, and real-world deployment to support safe, adaptive, and effective communication in rare diseases.[20] Internal Knowledge Without External Expression: Probing the Generalization Boundary of a Classical Chinese Language Model
Jiuting Chen,Yuan Lian,Hao Wu,Tianqi Huang,Hiroshi Sasaki,Makoto Kouno,Jongil Choi
Main category: cs.CL
TL;DR: 本文训练了一个3.18亿参数的纯文言文Transformer语言模型,发现其内部能区分真实与虚构历史事件(通过困惑度差异体现),但外部无法用文言文表达不确定性(如‘我不知’),这种表达能力取决于训练数据中的修辞习惯而非模型自身元认知能力,需借助RLHF等显式训练信号才能习得。
Details
Motivation: 探究大语言模型是否能自发发展出元认知能力——即在生成文本中表达‘我不知道’这类不确定性,特别是在缺乏英语和阿拉伯数字干扰的纯古典中文语境下。 Method: 在15.6亿token纯文言文语料上从零训练318M参数Transformer模型;设计系统性OOD测试(真实/虚构/半虚构历史事件)测量内部困惑度;分析模型生成中认识论标记(如‘或曰’‘未详’)的使用频率;跨语言(文言文、英文、日文)、跨模型规模(110M–1.56B)复现验证。 Result: 模型内部表现出显著事实编码能力(虚构事件困惑度升高2.39x,半虚构达4.24x);但外部不确定性表达率反低于分布内问题(3.5% vs 8.3%);该现象跨语言/模型一致;不确定性表达频率由训练数据修辞惯例决定(如文言文模型出现‘谦逊悖论’,日文模型几乎不使用模糊表达)。 Conclusion: 仅靠自回归语言建模无法自发涌现元认知表达能力;‘说我不知道’需显式监督信号(如RLHF),不能依赖统计模式学习。 Abstract: We train a 318M-parameter Transformer language model from scratch on a curated corpus of 1.56 billion tokens of pure Classical Chinese, with zero English characters or Arabic numerals. Through systematic out-of-distribution (OOD) testing, we investigate whether the model can distinguish known from unknown inputs, and crucially, whether it can express this distinction in its generated text. We find a clear dissociation between internal and external uncertainty. Internally, the model exhibits a perplexity jump ratio of 2.39x between real and fabricated historical events (p = 8.9e-11, n = 92 per group), with semi-fabricated events (real figures + fictional events) showing the highest perplexity (4.24x, p = 1.1e-16), demonstrating genuine factual encoding beyond syntactic pattern matching. Externally, however, the model never learns to express uncertainty: classical Chinese epistemic markers appear at lower rates for OOD questions (3.5%) than for in-distribution questions (8.3%, p = 0.023), reflecting rhetorical conventions rather than genuine metacognition. We replicate both findings across three languages (Classical Chinese, English, Japanese), three writing systems, and eight models from 110M to 1.56B parameters. We further show that uncertainty expression frequency is determined entirely by training data conventions, with Classical Chinese models showing a "humility paradox" (more hedging for known topics), while Japanese models almost never hedge. We argue that metacognitive expression -- the ability to say "I don't know" -- does not emerge from language modeling alone and requires explicit training signals such as RLHF.[21] Attention to Mamba: A Recipe for Cross-Architecture Distillation
Abhinav Moudgil,Ningyuan Huang,Eeshan Gunesh Dhekane,Pau Rodríguez,Luca Zappella,Federico Danieli
Main category: cs.CL
TL;DR: 本文提出了一种两阶段知识蒸馏方法,将Transformer模型(如Pythia-1B)有效蒸馏到纯SSM架构(Mamba)中,通过引入基于核技巧的线性化注意力作为中间表示和Mamba的原理性初始化,显著提升了跨架构蒸馏效果,在保持低内存开销的同时几乎完全保留教师模型性能(困惑度14.11 vs. 13.86)。
Details
Motivation: SSM(如Mamba)虽具推理效率优势,但缺乏成熟的预训练生态;而Transformer虽有丰富预训练模型,却存在高内存与低吞吐问题。如何在不引入Attention模块的前提下,将Transformer知识高效迁移到纯SSM架构,是本文核心动机。 Method: 提出两阶段蒸馏框架:第一阶段用核技巧将Transformer蒸馏为线性化注意力模型;第二阶段将该线性化模型进一步蒸馏至适配的Mamba架构,并为其设计原理性初始化策略;全程避免混合Attention与SSM模块。 Result: 蒸馏后的纯Mamba模型在下游任务中达到接近Pythia-1B教师模型的性能(困惑度14.11 vs. 13.86),并在1B参数规模、10B token训练量下完成系统消融与可扩展性分析,验证了方法有效性与鲁棒性。 Conclusion: 原理性初始化结合两阶段中间表征蒸馏,是实现高性能、纯SSM架构知识迁移的关键;该方案为SSM的实际部署提供了无需Attention模块、又能复用Transformer预训练成果的新路径。 Abstract: State Space Models (SSMs) such as Mamba have become a popular alternative to Transformer models, due to their reduced memory consumption and higher throughput at generation compared to their Attention-based counterparts. On the other hand, the community has built up a considerable body of knowledge on how to train Transformers, and many pretrained Transformer models are readily available. To facilitate the adoption of SSMs while leveraging existing pretrained Transformers, we aim to identify an effective recipe to distill an Attention-based model into a Mamba-like architecture. In prior work on cross-architecture distillation, however, it has been shown that a naïve distillation procedure from Transformers to Mamba fails to preserve the original teacher performance, a limitation often overcome with hybrid solutions combining Attention and SSM blocks. The key argument from our work is that, by equipping Mamba with a principled initialization, we can recover an overall better recipe for cross-architectural distillation. To this end, we propose a principled two-stage approach: first, we distill knowledge from a traditional Transformer into a linearized version of Attention, using an adaptation of the kernel trick. Then, we distill the linearized version into an adapted Mamba model that does not use any Attention block. Overall, the distilled Mamba model is able to preserve the original Pythia-1B Transformer performance in downstream tasks, maintaining a perplexity of 14.11 close to the teacher's 13.86. To show the efficacy of our recipe, we conduct thorough ablations at 1B scale with 10B tokens varying sequence mixer architecture, scaling analysis on model sizes and total distillation tokens, and a sensitivity analysis on tokens allocation between stages.[22] The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure
David A. Cook
Main category: cs.CL
TL;DR: 本文提出了PICCO框架,用于系统化大语言模型(LLM)提示词设计,包含Persona、Instructions、Context、Constraints、Output五个核心元素,并厘清了相关概念的层次关系。
Details
Motivation: 现有提示工程缺乏一致、结构化的框架,提示设计实践与理论描述混乱,亟需一个统一、可复用的参考架构。 Method: 通过多数据库检索,系统综述11种已有提示框架,进行概念整合与抽象,构建出PICCO五要素参考架构及配套概念体系。 Result: 提出了清晰的概念分类法(区分框架/元素/生成/技术/工程),确立PICCO五要素架构及其定义、作用与关联,并梳理了关键技术、迭代方法、伦理考量与研究方向。 Conclusion: PICCO为提示设计提供了形式化、结构化的理论基础与实践指南,属概念与方法论贡献,尚未进行实证优化验证。 Abstract: Large language model (LLM) performance depends heavily on prompt design, yet prompt construction is often described and applied inconsistently. Our purpose was to derive a reference framework for structuring LLM prompts. This paper presents PICCO, a framework derived through a rigorous synthesis of 11 previously published prompting frameworks identified through a multi-database search. The analysis yields two main contributions. First, it proposes a taxonomy that distinguishes prompt frameworks, prompt elements, prompt generation, prompting techniques, and prompt engineering as related but non-equivalent concepts. Second, it derives a five-element reference architecture for prompt generation: Persona, Instructions, Context, Constraints, and Output (PICCO). For each element, we define its function, scope, and relationship to other elements, with the goal of improving conceptual clarity and supporting more systematic prompt design. Finally, to support application of the framework, we outline key concepts relevant to implementation, including prompting techniques (e.g., zero-shot, few-shot, chain-of-thought, ensembling, decomposition, and self-critique, with selected variants), human and automated approaches to iterative prompt engineering, responsible prompting considerations such as security, privacy, bias, and trust, and priorities for future research. This work is a conceptual and methodological contribution: it formalizes a common structure for prompt specification and comparison, but does not claim empirical validation of PICCO as an optimization method.[23] Chinese Language Is Not More Efficient Than English in Vibe Coding: A Preliminary Study on Token Cost and Problem-Solving Rate
Simiao Ren,Xingyu Shen,Yuchen Zhou,Dennis,Ng,Ankit Raj
Main category: cs.CL
TL;DR: 本文通过SWE-bench Lite基准实证检验“中文提示比英文更省Token”的流行说法,发现该说法不成立:中文并未普遍降低Token消耗,不同模型表现相反(如MiniMax-2.7中文开销更高,GLM-5反而更低),且中文提示的成功率普遍低于英文;综合考虑Token成本与成功率的“单位成功任务成本”显示中文并无优势。
Details
Motivation: 社交媒体和开发者社区流传“中文提示在LLM编程任务中更省Token、可降本40%”,影响实践选择;需严谨验证该主张是否成立。 Method: 基于SWE-bench Lite软件工程任务基准,对多个主流开源/闭源大模型(如MiniMax-2.7、GLM-5等)进行控制变量实验,对比中英文提示下的Token消耗量、任务成功率,并计算综合成本效率(预期每成功任务成本)。 Result: 1)中文未展现一致Token效率优势;2)Token成本变化因模型而异(MiniMax-2.7中文+28%,GLM-5中文略降);3)所有测试模型上中文提示的成功率均低于英文;4)综合成本效率(Token×失败率)显示中文无优势。 Conclusion: 语言对Token成本的影响高度依赖模型架构,单纯切换至中文提示既不能可靠降低成本,也无法提升性能;当前证据不支持将中文作为通用降本策略。 Abstract: A claim has been circulating on social media and practitioner forums that Chinese prompts are more token-efficient than English for LLM coding tasks, potentially reducing costs by up to 40\%. This claim has influenced developers to consider switching to Chinese for ``vibe coding'' to save on API costs. In this paper, we conduct a rigorous empirical study using SWE-bench Lite, a benchmark of software engineering tasks, to evaluate whether this claim of Chinese token efficiency holds up to scrutiny. Our results reveal three key findings: First, the efficiency advantage of Chinese is not observed. Second, token cost varies by model architecture in ways that defy simple assumptions: while MiniMax-2.7 shows 1.28x higher token costs for Chinese, GLM-5 actually consumes fewer tokens with Chinese prompts. Third, and most importantly, we found that the success rate when prompting in Chinese is generally lower than in English across all models we tested. We also measure cost efficiency as expected cost per successful task -- jointly accounting for token consumption and task resolution rate. These findings should be interpreted as preliminary evidence rather than a definitive conclusion, given the limited number of models evaluated and the narrow set of benchmarks tested due to resource constraints; they indicate that language effects on token cost are model-dependent, and that practitioners should not expect cost savings or performance gains just by switching their prompt language to Chinese.[24] CROP: Token-Efficient Reasoning in Large Language Models via Regularized Prompt Optimization
Deep Shah,Sanket Badhe,Nehal Kathrotia,Priyanka Tiwari
Main category: cs.CL
TL;DR: 本文提出CROP方法,在自动提示优化中引入响应长度正则化,以减少大语言模型推理时的令牌消耗和延迟,同时保持较高的任务准确率。
Details
Motivation: 现有自动提示优化框架只关注任务准确率,导致生成冗长的推理过程,带来高延迟和高令牌成本。 Method: 提出Cost-Regularized Optimization of Prompts (CROP),在标准准确率反馈之外增加文本形式的长度反馈,引导优化过程生成简洁、关键信息明确的提示。 Result: 在GSM8K、LogiQA和BIG-Bench Hard数据集上验证,令牌消耗降低80.6%,准确率仅有轻微下降。 Conclusion: CROP为生产环境中部署高效、低成本的智能体AI系统提供了实用解决方案。 Abstract: Large Language Models utilizing reasoning techniques improve task performance but incur significant latency and token costs due to verbose generation. Existing automatic prompt optimization(APO) frameworks target task accuracy exclusively at the expense of generating long reasoning traces. We propose Cost-Regularized Optimization of Prompts (CROP), an APO method that introduces regularization on response length by generating textual feedback in addition to standard accuracy feedback. This forces the optimization process to produce prompts that elicit concise responses containing only critical information and reasoning. We evaluate our approach on complex reasoning datasets, specifically GSM8K, LogiQA and BIG-Bench Hard. We achieved an 80.6\% reduction in token consumption while maintaining competitive accuracy, seeing only a nominal decline in performance. This presents a pragmatic solution for deploying token-efficient and cost-effective agentic AI systems in production pipelines.[25] MEME-Fusion@CHiPSAL 2026: Multimodal Ablation Study of Hate Detection and Sentiment Analysis on Nepali Memes
Samir Wagle,Reewaj Khanal,Abiral Adhikari
Main category: cs.CL
TL;DR: 本文提出了一种用于Devanagari脚本社交媒体模因的多模态仇恨言论检测系统,结合CLIP与BGE-M3,并引入动态门控跨模态注意力机制,在低资源条件下显著提升性能,同时揭示了英文视觉模型在Devanagari上的失效及标准集成方法在小样本下的退化问题。
Details
Motivation: 解决Devanagari脚本社交媒体模因中仇恨言论检测面临的多模态结构、语言复杂性及极端数据稀缺等复合挑战。 Method: 提出混合跨模态注意力融合架构:以CLIP(ViT-B/32)编码图像,BGE-M3编码多语言文本,通过4头自注意力与可学习门控网络动态加权模态贡献。 Result: 在Subtask A上比纯文本基线提升5.9% F1-macro;发现英文中心化视觉模型在Devanagari脚本上近似随机预测,且标准集成法在每折仅约850样本时因相关过拟合而严重退化。 Conclusion: 显式跨模态建模对低资源多模态仇恨检测至关重要;模型选择与集成策略需适配目标脚本与数据规模,不能直接迁移英文主导方案。 Abstract: Hate speech detection in Devanagari-scripted social media memes presents compounded challenges: multimodal content structure, script-specific linguistic complexity, and extreme data scarcity in low-resource settings. This paper presents our system for the CHiPSAL 2026 shared task, addressing both Subtask A (binary hate speech detection) and Subtask B (three-class sentiment classification: positive, neutral, negative). We propose a hybrid cross-modal attention fusion architecture that combines CLIP (ViT-B/32) for visual encoding with BGE-M3 for multilingual text representation, connected through 4-head self-attention and a learnable gating network that dynamically weights modality contributions on a per-sample basis. Systematic evaluation across eight model configurations demonstrates that explicit cross-modal reasoning achieves a 5.9% F1-macro improvement over text-only baselines on Subtask A, while uncovering two unexpected but critical findings: English-centric vision models exhibit near-random performance on Devanagari script, and standard ensemble methods catastrophically degrade under data scarcity (N nearly equal to 850 per fold) due to correlated overfitting. The code can be accessed at https://github.com/Tri-Yantra-Technologies/MEME-Fusion/[26] ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents
Zhuofeng Li,Yi Lu,Dongfu Jiang,Haoxiang Zhang,Yuyang Bai,Chuan Li,Yu Wang,Shuiwang Ji,Jianwen Xie,Yu Zhang
Main category: cs.CL
TL;DR: 本文提出REVIEWBENCH基准和REVIEWGROUNDER多智能体框架,通过引入显式评分标准与上下文证据整合,显著提升LLM在学术评审中的反馈质量与人类判断一致性。
Details
Motivation: 现有LLM评审常生成表面化、公式化的评论,缺乏基于证据的实质性反馈,主因是未充分利用人类评审中的显式评分标准和对已有工作的上下文 grounding。 Method: 构建REVIEWBENCH基准(基于官方指南、论文内容及人工评审生成纸特定评分标准);提出REVIEWGROUNDER框架,采用工具集成的多智能体架构,分阶段完成评审草稿撰写与证据支撑增强。 Result: 在REVIEWBENCH上,REVIEWGROUNDER(Phi-4-14B起草 + GPT-OSS-120B支撑)在8个维度上均优于更强/更大的基线模型(如GPT-4.1、DeepSeek-R1-670B),且更贴近人类判断。 Conclusion: 显式评分标准引导与上下文证据 grounding 是提升LLM评审质量的关键,REVIEWGROUNDER为AI辅助评审提供了可复现、可评估的新范式。 Abstract: The rapid rise in AI conference submissions has driven increasing exploration of large language models (LLMs) for peer review support. However, LLM-based reviewers often generate superficial, formulaic comments lacking substantive, evidence-grounded feedback. We attribute this to the underutilization of two key components of human reviewing: explicit rubrics and contextual grounding in existing work. To address this, we introduce REVIEWBENCH, a benchmark evaluating review text according to paper-specific rubrics derived from official guidelines, the paper's content, and human-written reviews. We further propose REVIEWGROUNDER, a rubric-guided, tool-integrated multi-agent framework that decomposes reviewing into drafting and grounding stages, enriching shallow drafts via targeted evidence consolidation. Experiments on REVIEWBENCH show that REVIEWGROUNDER, using a Phi-4-14B-based drafter and a GPT-OSS-120B-based grounding stage, consistently outperforms baselines with substantially stronger/larger backbones (e.g., GPT-4.1 and DeepSeek-R1-670B) in both alignment with human judgments and rubric-based review quality across 8 dimensions. The code is available \href{https://github.com/EigenTom/ReviewGrounder}{here}.[27] EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation
Francesco Andrea Causio,Vittorio De Vita,Olivia Riccomi,Michele Ferramola,Federico Felizzi,Antonio Cristiano,Lorenzo De Mori,Chiara Battipaglia,Melissa Sawaya,Luigi De Angelis,Marcello Di Pumpo,Alessandra Piscitelli,Pietro Eric Risuleo,Alessia Longo,Giulia Vojvodic,Mariapia Vassalli,Bianca Destro Castaniti,Nicolò Scarsi,Manuel Del Medico
Main category: cs.CL
TL;DR: This paper introduces EuropeMedQA, the first multilingual and multimodal medical examination dataset from official European regulatory exams (Italy, France, Spain, Portugal), designed to evaluate LLMs’ cross-lingual and visual reasoning abilities under strict zero-shot conditions.
Details
Motivation: LLMs show strong performance on English medical exams but suffer in non-English and multimodal diagnostic settings; there is a lack of contamination-resistant, clinically representative benchmarks for European languages and modalities. Method: Curated EuropeMedQA following FAIR and SPIRIT-AI guidelines; built an automated translation pipeline; evaluated multimodal LLMs using zero-shot, strictly constrained prompting for cross-lingual transfer and visual reasoning. Result: EuropeMedQA is established as a comprehensive, multilingual, multimodal, contamination-resistant benchmark reflecting real European clinical practices. Conclusion: EuropeMedQA fills a critical gap by enabling fair, rigorous evaluation of medical AI across languages and modalities, thereby promoting more generalizable and clinically relevant models. Abstract: While Large Language Models (LLMs) have demonstrated high proficiency on English-centric medical examinations, their performance often declines when faced with non-English languages and multimodal diagnostic tasks. This study protocol describes the development of EuropeMedQA, the first comprehensive, multilingual, and multimodal medical examination dataset sourced from official regulatory exams in Italy, France, Spain, and Portugal. Following FAIR data principles and SPIRIT-AI guidelines, we describe a rigorous curation process and an automated translation pipeline for comparative analysis. We evaluate contemporary multimodal LLMs using a zero-shot, strictly constrained prompting strategy to assess cross-lingual transfer and visual reasoning. EuropeMedQA aims to provide a contamination-resistant benchmark that reflects the complexity of European clinical practices and fosters the development of more generalizable medical AI.[28] Tracking the Temporal Dynamics of News Coverage of Catastrophic and Violent Events
Emily Lugos,Maurício Gruppi
Main category: cs.CL
TL;DR: 本研究通过分析126,602篇在线新闻文章,量化了暴力与灾难性事件报道中的时间与语义动态变化,揭示了突发事件具有结构化、可预测的新闻周期模式。
Details
Motivation: 理解危机时刻公众话语中叙事如何形成、传播和演变,对解读公共 discourse至关重要。 Method: 基于大规模在线新闻语料库(126,602篇文章),通过出版量、语义漂移、语义离散度和术语相关性等指标量化叙事变化。 Result: 突发事件展现出结构化且可预测的新闻周期模式:报道量迅速激增、早期发生语义漂移、随后逐渐回落至基线;并识别出驱动时间模式的关键术语。 Conclusion: 新闻周期具有可建模的动态规律,语义分析可有效揭示叙事演化机制及其关键驱动因素。 Abstract: The modern news cycle has been fundamentally reshaped by the rapid exchange of information online. As a result, media framing shifts dynamically as new information, political responses, and social reactions emerge. Understanding how these narratives form, propagate, and evolve is essential for interpreting public discourse during moments of crisis. In this study, we examine the temporal and semantic dynamics of reporting for violent and catastrophic events using a large-scale corpus of 126,602 news articles collected from online publishers. We quantify narrative change through publication volume, semantic drift, semantic dispersion, and term relevance. Our results show that sudden events of impact exhibit structured and predictable news-cycle patterns characterized by rapid surges in coverage, early semantic drift, and gradual declines toward the baseline. In addition, our results indicate the terms that are driving the temporal patterns.[29] LLM Predictive Scoring and Validation: Inferring Experience Ratings from Unstructured Text
Jason Potteiger,Andrew Hong,Ito Zapata
Main category: cs.CL
TL;DR: 本研究使用GPT-4.1基于球迷开放性文本反馈预测其0-10分的整体观赛体验评分,发现AI预测与真实评分高度一致(67%误差≤1分,r=0.82),但系统性偏低约1分;该偏差反映两种测量本质差异:真实评分为整体价值判断,AI预测则侧重突出、情绪强烈或可行动的体验时刻。
Details
Motivation: 探究大语言模型能否仅从球迷开放性文本中可靠预测其主观体验评分,并理解预测值与真实评分间系统性偏差的本质。 Method: 使用GPT-4.1对约10,000条来自五支MLB球队球迷的开放性文本反馈进行单次提示预测(0–10分),对比预测结果与实际调查评分,分析一致性、偏差模式及与各子维度的相关性。 Result: 67%预测值与真实值误差≤1分,36%完全匹配;三次独立运行间87%完全一致、99.9%误差≤1分;预测值与总体评分相关性最高(r=0.82),但系统性偏低约1分,且该偏差无法归因于任一具体体验维度。 Conclusion: 简单未优化提示即可实现对球迷体验评分的方向性有效预测;预测值与真实值之间的稳定差距并非误差,而是反映了两种不同心理构念(整体评价 vs. 突出时刻强度)的差异,应被保留和解读而非消除。 Abstract: We tasked GPT-4.1 to read what baseball fans wrote about their game-day experience and predict the overall experience rating each fan gave on a 0-10 survey scale. The model received only the text of a single open-ended response. These AI predictions were compared with the actual experience ratings captured by the survey instrument across approximately 10,000 fan responses from five Major League Baseball teams. In total two-thirds of predicted ratings fell within one point of self-reported fan ratings (67% within +/-1, 36% exact match), and the predicted measurement was near-deterministic across three independent scoring runs (87% exact agreement, 99.9% within +/-1). Predicted ratings aligned most strongly with the overall experience rating (r = 0.82) rather than with any specific aspect of the game-day experience such as parking, concessions, staff, etc. However, predictions were systematically lower than self-reported ratings by approximately one point, and this gap was not driven by any single aspect. Rather, our analysis shows that self-reported ratings capture the fan's verdict, an overall evaluative judgment that integrates the entire experience. While predicted ratings quantify the impact of salient moments characterized as memorable, emotionally intense, unusual, or actionable. Each measure contains information the other misses. These baseline results establish that a simple, unoptimized prompt can directionally predict how fans rate their experience from the text a fan wrote and that a gap between the two numbers can be interpreted as a construct difference worth preserving rather than an error to eliminate.[30] Purging the Gray Zone: Latent-Geometric Denoising for Precise Knowledge Boundary Awareness
Hao An,Yibin Lou,Jiayi Guo,Yang Xu
Main category: cs.CL
TL;DR: 本文提出GeoDe框架,通过几何距离作为置信度信号进行'几何去噪',解决大模型在决策边界附近因内部信念模糊导致的幻觉与过度拒答问题。
Details
Motivation: 现有拒答微调方法直接按响应准确率划分数据集,导致决策边界附近标签噪声严重,引发高拒答率或幻觉;作者发现隐空间中存在一个由内部信念模糊构成的'灰色区域',是性能瓶颈。 Method: 从隐空间表征视角出发,构建线性探针确定'真实超平面',并利用样本到该超平面的几何距离作为置信度信号,实现对模糊边界样本的过滤与高保真信号保留。 Result: 在Llama3、Qwen3及TriviaQA、NQ、SciQ、SimpleQA等多个模型与数据集上,GeoDe显著提升模型真实性,并在分布外(OOD)场景下表现出强泛化能力。 Conclusion: GeoDe通过几何视角建模模型知识边界,有效缓解幻觉与过度拒答,为可信大模型构建提供了新范式。 Abstract: Large language models (LLMs) often exhibit hallucinations due to their inability to accurately perceive their own knowledge boundaries. Existing abstention fine-tuning methods typically partition datasets directly based on response accuracy, causing models to suffer from severe label noise near the decision boundaries and consequently exhibit high rates of abstentions or hallucinations. This paper adopts a latent space representation perspective, revealing a "gray zone" near the decision hyperplane where internal belief ambiguity constitutes the core performance bottleneck. Based on this insight, we propose the **GeoDe** (**Geo**metric **De**noising) framework for abstention fine-tuning. This method constructs a truth hyperplane using linear probes and performs "geometric denoising" by employing geometric distance as a confidence signal for abstention decisions. This approach filters out ambiguous boundary samples while retaining high-fidelity signals for fine-tuning. Experiments across multiple models (Llama3, Qwen3) and benchmark datasets (TriviaQA, NQ, SciQ, SimpleQA) demonstrate that GeoDe significantly enhances model truthfulness and demonstrates strong generalization in out-of-distribution (OOD) scenarios. Code is available at https://github.com/Notbesidemoon/GeoDe.[31] Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance
Bar Alon,Itamar Zimerman,Lior Wolf
Main category: cs.CL
TL;DR: 本文评估了大语言模型(LLM)生成的后验文本解释在认识论意义上的忠实性,发现其常不忠实;进而提出一种无需训练的注意力层干预方法,利用忠实归因方法提取的词元级热图引导解释生成,显著提升解释的忠实性。
Details
Motivation: 大语言模型缺乏可解释性,被视为黑箱,限制其在需透明与可信领域的应用;现有后验文本解释虽具说服力,但其是否真实反映模型内部决策依据(即认识论忠实性)尚不明确。 Method: 首先通过反事实分析评估LLM生成解释的认识论忠实性;然后提出一种训练-free的方法,基于忠实归因方法生成的词元级热图,在注意力层进行干预以引导解释生成。 Result: 实验证明现有解释常不忠实;所提方法在多个模型、基准和提示下均显著提升认识论忠实性。 Conclusion: 认识论忠实性是可评估且可提升的;基于注意力干预的训练-free方法为提升LLM解释可靠性提供了有效新路径。 Abstract: Large language models (LLMs) achieve strong performance and have revolutionized NLP, but their lack of explainability keeps them treated as black boxes, limiting their use in domains that demand transparency and trust. A promising direction to address this issue is post-hoc text-based explanations, which aim to explain model decisions in natural language. Prior work has focused on generating convincing rationales that appear to be subjectively faithful, but it remains unclear whether these explanations are epistemically faithful, whether they reflect the internal evidence the model actually relied on for its decision. In this paper, we first assess the epistemic faithfulness of LLM-generated explanations via counterfactuals and show that they are often unfaithful. We then introduce a training-free method that enhances faithfulness by guiding explanation generation through attention-level interventions, informed by token-level heatmaps extracted via a faithful attribution method. This method significantly improves epistemic faithfulness across multiple models, benchmarks, and prompts.[32] Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation
Zichong Li,Chen Liang,Liliang Ren,Tuo Zhao,Yelong Shen,Weizhu Chen
Main category: cs.CL
TL;DR: 本文提出RoPE-Perturbed Self-Distillation方法,通过扰动RoPE位置编码生成同一序列的不同视图,并利用自蒸馏促使模型对位置变化鲁棒,从而提升大语言模型在长上下文任务中的位置不变性与泛化能力。
Details
Motivation: 现有长上下文微调方法对证据所在绝对位置敏感,位置方差大,鲁棒性差。 Method: 提出RoPE-Perturbed Self-Distillation:在训练中扰动RoPE位置索引以生成不同位置分布的序列视图,并通过自蒸馏约束模型在不同视图下输出一致,削弱对位置的依赖、增强语义依赖。 Result: 在Llama-3-8B和Qwen-3-4B上验证有效:Llama-3-8B在RULER-64K提升12.04%,Qwen-3-4B在RULER-256K提升2.71%;同时改善长度外推能力。 Conclusion: RoPE扰动结合自蒸馏是一种简单而有效的正则化策略,可显著提升长上下文模型的位置鲁棒性与泛化性能。 Abstract: Large language models (LLMs) increasingly operate in settings that require reliable long-context understanding, such as retrieval-augmented generation and multi-document reasoning. A common strategy is to fine-tune pretrained short-context models at the target sequence length. However, we find that standard long-context adaptation can remain brittle: model accuracy depends strongly on the absolute placement of relevant evidence, exhibiting high positional variance even when controlling for task format and difficulty. We propose RoPE-Perturbed Self-Distillation, a training regularizer that improves positional robustness. The core idea is to form alternative "views" of the same training sequence by perturbing its RoPE indices -- effectively moving parts of the context to different positions -- and to train the model to produce consistent predictions across views via self-distillation. This encourages reliance on semantic signals instead of brittle position dependencies. Experiments on long-context adaptation of Llama-3-8B and Qwen-3-4B demonstrate consistent gains on long-context benchmarks, including up to 12.04% improvement on RULER-64K for Llama-3-8B and 2.71% on RULER-256K for Qwen-3-4B after SFT, alongside improved length extrapolation beyond the training context window.[33] When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden
Apoorv Prasad,Susan McRoy
Main category: cs.CL
TL;DR: 本文提出了一种基于小型开源语言模型的方法,用于在社交媒体帖子中自动检测多囊卵巢综合征(PCOS)患者常见的三重负担(身体意象困扰、进食障碍和代谢问题),并提供可解释的结构化输出。
Details
Motivation: PCOS女性面临身体意象困扰、进食障碍和代谢挑战的高风险,但现有NLP方法缺乏透明性和共病识别能力。 Method: 收集1000条Reddit上的PCOS相关帖子,由两名标注员依据Lee等人(2017)临床框架标注;使用LoRA微调Gemma-2-2B、Qwen3-1.7B和DeepSeek-R1-Distill-Qwen-1.5B三个小模型,生成带文本证据的结构化解释。 Result: 最佳模型在150条测试集上达到75.3%的精确匹配准确率,具备稳健的共病检测与强可解释性;性能随诊断复杂度上升而下降。 Conclusion: 该方法适用于PCOS相关心理与代谢问题的初步筛查,而非自主诊断,强调小模型+可解释AI在临床辅助中的潜力。 Abstract: Women with polycystic ovary syndrome (PCOS) face substantially elevated risks of body image distress, disordered eating, and metabolic challenges, yet existing natural language processing approaches for detecting these conditions lack transparency and cannot identify co-occurring presentations. We developed small, open-source language models to automatically detect this triple burden in social media posts with grounded explainability. We collected 1,000 PCOS-related posts from six subreddits, with two trained annotators labeling posts using guidelines operationalizing Lee et al. (2017) clinical framework. Three models (Gemma-2-2B, Qwen3-1.7B, DeepSeek-R1-Distill-Qwen-1.5B) were fine-tuned using Low-Rank Adaptation to generate structured explanations with textual evidence. The best model achieved 75.3 percent exact match accuracy on 150 held-out posts, with robust comorbidity detection and strong explainability. Performance declined with diagnostic complexity, indicating their best use is for screening rather than autonomous diagnosis.[34] APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning for Long-Term Conversational AI
Pratyay Banerjee,Masud Moshtaghi,Shivashankar Subramanian,Amita Misra,Ankit Chadha
Main category: cs.CL
TL;DR: 本文提出APEX-MEM,一种基于属性图的对话记忆系统,通过实体中心、时序建模与多工具检索代理,在长程对话记忆任务中显著提升准确性。
Details
Motivation: 大语言模型在长程对话记忆方面仍存在可靠性不足问题,扩大上下文窗口或简单检索易引入噪声并导致响应不稳定。 Method: 提出APEX-MEM系统,包含三个核心创新:(1) 基于领域无关本体的属性图,将对话建模为时序锚定的事件;(2) 仅追加(append-only)存储以保留信息的完整时序演化;(3) 多工具检索代理,在查询时理解并解析冲突或演化的信息,生成紧凑且上下文相关记忆摘要。 Result: 在LOCOMO问答任务中达88.88%准确率,在LongMemEval中达86.2%,超越现有会话感知方法。 Conclusion: 结构化属性图可支持更时序一致的长程对话推理,验证了显式知识结构对提升LLM长期记忆能力的有效性。 Abstract: Large language models still struggle with reliable long-term conversational memory: simply enlarging context windows or applying naive retrieval often introduces noise and destabilizes responses. We present APEX-MEM, a conversational memory system that combines three key innovations: (1) a property graph which uses domain-agnostic ontology to structure conversations as temporally grounded events in an entity-centric framework, (2) append-only storage that preserves the full temporal evolution of information, and (3) a multi-tool retrieval agent that understands and resolves conflicting or evolving information at query time, producing a compact and contextually relevant memory summary. This retrieval-time resolution preserves the full interaction history while suppressing irrelevant details. APEX-MEM achieves 88.88% accuracy on LOCOMO's Question Answering task and 86.2% on LongMemEval, outperforming state-of-the-art session-aware approaches and demonstrating that structured property graphs enable more temporally coherent long-term conversational reasoning.[35] The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models
Akshay Paruchuri,Ishan Chatterjee,Henry Fuchs,Ehsan Adeli,Piotr Didyk
Main category: cs.CL
TL;DR: 本文提出centroid replacement方法来探测多模态语言模型中模态依赖性,发现文本表征在视觉推理任务中普遍压倒视觉表征;通过text centroid contrastive decoding可在推理阶段显著提升准确率,且该效果因训练方式不同而异,为多模态训练提供可量化的诊断信号。
Details
Motivation: 多模态语言模型在视觉感知任务上系统性表现不佳,但其失败的内在结构原因尚不清楚。 Method: 提出centroid replacement(将每个token坍缩为其最近的K-means质心)作为受控探针,分析模态依赖;进一步设计text centroid contrastive decoding,在推理时对比文本质心擦除的参考输出以提升性能。 Result: 在七种模型、三类架构上验证:擦除文本质心结构导致的精度下降是擦除视觉质心的4倍;text centroid contrastive decoding最高提升+16.9%准确率;标准微调模型平均增益+5.6%,偏好优化模型仅+1.5%。 Conclusion: 模态竞争具有结构性局部性,可在不重训练前提下于推理阶段修正,并可量化为指导未来多模态训练的诊断信号。 Abstract: Multimodal language models systematically underperform on visual perception tasks, yet the structure underlying this failure remains poorly understood. We propose centroid replacement, collapsing each token to its nearest K-means centroid, as a controlled probe for modal dependence. Across seven models spanning three architecture families, erasing text centroid structure costs 4$\times$ more accuracy than erasing visual centroid structure, exposing a universal imbalance where language representations overshadow vision even on tasks that demand visual reasoning. We exploit this asymmetry through text centroid contrastive decoding, recovering up to +16.9% accuracy on individual tasks by contrastively decoding against a text-centroid-erased reference. This intervention varies meaningfully with training approaches: standard fine-tuned models show larger gains (+5.6% on average) than preference-optimized models (+1.5% on average). Our findings suggest that modal competition is structurally localized, correctable at inference time without retraining, and quantifiable as a diagnostic signal to guide future multimodal training.[36] BiCon-Gate: Consistency-Gated De-colloquialisation for Dialogue Fact-Checking
Hyunkyung Park,Arkaitz Zubiaga
Main category: cs.CL
TL;DR: 本文提出了一种面向对话中口语化事实核查的保守重写方法,结合分阶段去口语化与语义感知的一致性门控机制(BiCon-Gate),在DialFact基准上显著提升了证据检索与事实验证性能,尤其在SUPPORTS类别上效果突出。
Details
Motivation: 现有自动事实核查研究对多轮对话中频繁出现但未被充分研究的口语化语言缺乏有效处理。 Method: 提出分阶段去口语化生成保守重写候选,并设计BiCon-Gate——一种语义感知的一致性门控机制,仅当重写候选在对话上下文中语义支持时才采纳,否则回退至原始声明。 Result: 在DialFact基准上,该方法在证据检索和事实验证任务上均优于强基线(包括单步LLM重写方法),尤其在SUPPORTS标签上提升显著。 Conclusion: 分阶段轻量级去口语化结合语义门控的策略,能稳定下游事实核查性能,证明保守、上下文感知的重写比激进端到端重写更有效。 Abstract: Automated fact-checking in dialogue involves multi-turn conversations where colloquial language is frequent yet understudied. To address this gap, we propose a conservative rewrite candidate for each response claim via staged de-colloquialisation, combining lightweight surface normalisation with scoped in-claim coreference resolution. We then introduce BiCon-Gate, a semantics-aware consistency gate that selects the rewrite candidate only when it is semantically supported by the dialogue context, otherwise falling back to the original claim. This gated selection stabilises downstream fact-checking and yields gains in both evidence retrieval and fact verification. On the DialFact benchmark, our approach improves retrieval and verification, with particularly strong gains on SUPPORTS, and outperforms competitive baselines, including a decoder-based one-shot LLM rewrite that attempts to perform all de-colloquialisation steps in a single pass.[37] Generating Concept Lexicalizations via Dictionary-Based Cross-Lingual Sense Projection
David Basil,Chirooth Girigowda,Bradley Hauer,Sahir Momin,Ning Shi,Grzegorz Kondrak
Main category: cs.CL
TL;DR: 本文提出了一种通过语义投影和词典过滤来自动生成多语言WordNet式词汇资源的方法,提升了精度、可解释性并减少了对外部资源的依赖。
Details
Motivation: 自动扩展WordNet等词汇资源到新语言面临跨语言词义对齐与标注困难,现有方法在精度、可解释性和资源需求上存在不足。 Method: 基于双语对齐和语义投影:利用带词义标注的英文语料及其翻译,将英文同义词集(synsets)投射到目标语言对齐词元,并借助双语词典增强对齐器并过滤错误投射。 Result: 在多种语言上实验表明,该‘投射-过滤’策略相比先前方法及词典、大语言模型基线显著提升精度,同时保持高可解释性与低资源依赖。 Conclusion: 语义投影结合词典增强与过滤是一种高效、透明且轻量的跨语言词义资源构建范式,具备实际部署与开源共享价值。 Abstract: We study the task of automatically expanding WordNet-style lexical resources to new languages through sense generation. We generate senses by associating target-language lemmas with existing lexical concepts via semantic projection. Given a sense-tagged English corpus and its translation, our method projects English synsets onto aligned target-language tokens and assigns the corresponding lemmas to those synsets. To generate these alignments and ensure their quality, we augment a pre-trained base aligner with a bilingual dictionary, which is also used to filter out incorrect sense projections. We evaluate the method on multiple languages, comparing it to prior methods, as well as dictionary-based and large language model baselines. Results show that the proposed project-and-filter strategy improves precision while remaining interpretable and requiring few external resources. We plan to make our code, documentation, and generated sense inventories accessible.[38] The Autocorrelation Blind Spot: Why 42% of Turn-Level Findings in LLM Conversation Analysis May Be Spurious
Ferdinand M. Schessl
Main category: cs.CL
TL;DR: 本文揭示了当前多轮人机对话评估中忽略轮次间自相关性的问题,指出标准的池化检验会严重高估显著性;作者系统分析了66个轮次级指标的自相关结构,并提出一种结合有效自由度与对话块引导的两阶段校正框架,经验证可显著提升结果可复现性;研究还发现多数现有论文未对此问题进行校正。
Details
Motivation: 当前多轮人机对话评估广泛使用轮次级指标,但几乎所有评估流程都忽略了轮次间的统计依赖性(即自相关),导致统计推断不可靠。 Method: 系统刻画66个轮次级指标在202个多轮对话中的自相关结构;提出结合Chelton(1983)有效自由度估计与对话级块引导(block bootstrap)的两阶段校正框架;在预注册的预留数据集上验证校正效果。 Result: 42%在标准池化检验下显著的关联在簇稳健校正后不再显著;三类‘无记忆’指标家族校正后显著性损失平均为14%,七类‘非无记忆’家族达33%;校正后指标在预留集上的复现率为57%,远高于未校正的30%;调研显示约30篇近期顶会论文中仅4篇考虑时间依赖性,26篇完全未校正。 Conclusion: 忽略轮次间自相关性会导致严重统计误判;必须在多轮对话评估中采用簇稳健校正方法;本文提供了可落地的设计原则、发表检查清单与开源工具链。 Abstract: Turn-level metrics are widely used to evaluate properties of multi-turn human-LLM conversations, from safety and sycophancy to dialogue quality. However, consecutive turns within a conversation are not statistically independent -- a fact that virtually all current evaluation pipelines fail to correct for in their statistical inference. We systematically characterize the autocorrelation structure of 66 turn-level metrics across 202 multi-turn conversations (11,639 turn pairs, 5 German-speaking users, 4 LLM platforms) and demonstrate that naive pooled analysis produces severely inflated significance estimates: 42% of associations that appear significant under standard pooled testing fail to survive cluster-robust correction. The inflation varies substantially across categories rather than scaling linearly with autocorrelation: three memoryless families (embedding velocity, directional, differential) aggregate to 14%, while the seven non-memoryless families (thermo-cycle, frame distance, lexical/structural, rolling windows, cumulative, interaction, timestamp) aggregate to 33%, with individual category rates ranging from 0% to 100% depending on per-family effect size. We present a two-stage correction framework combining Chelton (1983) effective degrees of freedom with conversation-level block bootstrap, and validate it on a pre-registered hold-out split where cluster-robust metrics replicate at 57% versus 30% for pooled-only metrics. We provide concrete design principles, a publication checklist, and open-source code for the correction pipeline. A survey of ~30 recent papers at major NLP and AI venues that compute turn-level statistics in LLM evaluations finds that only 4 address temporal dependence at all, and 26 do not correct for it.[39] Three-Phase Transformer
Mohammad R. Abu Ayyash
Main category: cs.CL
TL;DR: 本文提出了一种名为Three-Phase Transformer(3PT)的新架构,通过在残差流中引入循环通道划分、相位感知操作(如Givens旋转)、分通道归一化及DC子空间中的Gabriel's horn位置编码,提升decoder-only Transformer的训练稳定性与效率。在WikiText-103上显著降低困惑度并加速收敛。
Details
Motivation: 解决decoder-only Transformer中残差流缺乏结构先验导致的训练不稳定与收敛慢问题,借鉴物理系统(如三相交流电)的平衡与自稳定特性设计内在结构约束。 Method: 将隐藏向量划分为N个等尺寸循环通道;每通道独立RMSNorm;跨注意力与FFN引入2D Givens旋转(相位偏移theta + i*(2π/N));GQA头数匹配通道数;在正交于通道的DC子空间注入Gabriel's horn绝对位置轮廓r(p)=1/(p+1),并与RoPE正交组合。 Result: 在123M参数规模下,相比RoPE基线,困惑度降低7.20%(-2.62% bpb),参数开销仅+1536(0.00124%),步数收敛提速1.93x(墙钟提速1.64x);N=3为典型配置,且在不同规模下N的影响呈现尺度依赖性;验证了自稳定几何、旋转角漂移U型深度分布及与RoPE/Attention/FFN的正交可组合性。 Conclusion: 3PT证明了在残差流中嵌入轻量但结构严谨的物理启发先验,可有效提升Transformer训练动力学与泛化性能,其核心机制(通道划分、逐相归一化、旋转、DC注入)共同构成一种新型自稳定神经网络范式。 Abstract: We present Three-Phase Transformer (3PT), a residual-stream structural prior for decoder-only Transformers on a standard SwiGLU + RMSNorm + RoPE + GQA backbone. The hidden vector is partitioned into N equally-sized cyclic channels, each maintained by phase-respecting ops: a per-channel RMSNorm, a 2D Givens rotation between attention and FFN that rotates each channel by theta + i*(2*pi/N), and a head-count constraint aligning GQA heads with the partition. The architecture is a self-stabilizing equilibrium between scrambling and re-imposition, not a bolted-on module. The partition carves out a one-dimensional DC subspace orthogonal to the channels, into which we inject a fixed Gabriel's horn profile r(p) = 1/(p+1) as an absolute-position side-channel composing orthogonally with RoPE's relative-position rotation. The canonical N=3 borrows its metaphor from balanced three-phase AC, where three sinusoids 120 degrees apart sum to zero with no anti-correlated pair. At 123M parameters on WikiText-103, 3PT achieves -7.20% perplexity (-2.62% bits-per-byte) over a matched RoPE-Only baseline at +1,536 parameters (0.00124% of total), with 1.93x step-count convergence speedup (1.64x wall-clock). N behaves as a parameter-sharing knob rather than a unique optimum: at 5.5M an N-sweep over {1,2,3,4,6,8,12} is near-monotone with N=1 winning; at 123M a three-seed sweep finds N=3 and N=1 statistically indistinguishable. The load-bearing mechanism is the channel-partitioned residual stream, per-block rotation, per-phase normalization, and horn DC injection. We characterize (a) self-stabilization of the geometry without explicit enforcement, a novel instance of the conservation-law framework for neural networks; (b) a U-shaped depth profile of rotation-angle drift at 12 layers; (c) orthogonal composition with RoPE, attention, and FFN.[40] Hierarchical vs. Flat Iteration in Shared-Weight Transformers
Sang-Il Han
Main category: cs.CL
TL;DR: 本文通过实证研究比较了分层结构、共享权重的循环机制与独立堆叠Transformer层在语言建模中的表征能力,发现前者存在显著性能差距。
Details
Motivation: 探究分层共享权重的循环结构(HRM-LM)能否达到独立堆叠Transformer层的表征质量。 Method: 提出HRM-LM模型,用双速循环对(Fast模块每步运行,Slow模块每T步运行)替代L个独立Transformer层,并在M=N×T步中参数共享;与参数匹配的Universal Transformer(UniTF, 1.2B)进行五次独立实验对比。 Result: 在参数匹配条件下,HRM-LM与UniTF之间存在显著且稳健的性能差距。 Conclusion: 分层共享权重的循环架构在当前设定下无法匹敌独立层堆叠的Transformer,在表征质量上存在本质局限。 Abstract: We present an empirical study of whether hierarchically structured, shared-weight recurrence can match the representational quality of independent-layer stacking in a Transformer-based language model. HRM-LM replaces L independent Transformer layers with a two-speed recurrent pair: a Fast module operating at every step for local refinement, and a Slow module operating every T steps for global compression. This recurrent hierarchy is unrolled for M = N x T steps with shared parameters. The central and most robust finding, supported by a parameter-matched Universal Transformer ablation (UniTF, 1.2B) across five independent runs, is a sharp empirical gap between the two approaches.[41] MARCA: A Checklist-Based Benchmark for Multilingual Web Search
Thales Sales Almeida,Giovana Kerche Bonás,Ramon Pires,Celio Larcher,Hugo Abonizio,Marcos Piau,Roseval Malaquias Junior,Rodrigo Nogueira,Thiago Laitz
Main category: cs.CL
TL;DR: 本文提出了MARCA,一个用于评估大语言模型在基于网络的信息检索任务中表现的双语(英语和葡萄牙语)基准测试。
Details
Motivation: 现有基准测试对多语言环境,特别是葡萄牙语的支持不足,而大型语言模型在实际应用中需要可靠地进行网络搜索、证据筛选和答案整合。 Method: 构建了包含52个手动编写多实体问题及对应检查清单式评分标准的双语基准MARCA,并在Basic和Orchestrator两种交互框架下评估14个模型,多次运行以量化结果不确定性。 Result: 不同模型性能差异显著;Orchestrator框架通常提升答案覆盖度;模型从英语到葡萄牙语的迁移能力存在较大波动。 Conclusion: MARCA填补了多语言尤其是葡语信息检索评估的空白,揭示了当前模型在跨语言迁移与任务分解能力上的局限性。 Abstract: Large language models (LLMs) are increasingly used as sources of information, yet their reliability depends on the ability to search the web, select relevant evidence, and synthesize complete answers. While recent benchmarks evaluate web-browsing and agentic tool use, multilingual settings, and Portuguese in particular, remain underexplored. We present \textsc{MARCA}, a bilingual (English and Portuguese) benchmark for evaluating LLMs on web-based information seeking. \textsc{MARCA} consists of 52 manually authored multi-entity questions, paired with manually validated checklist-style rubrics that explicitly measure answer completeness and correctness. We evaluate 14 models under two interaction settings: a Basic framework with direct web search and scraping, and an Orchestrator framework that enables task decomposition via delegated subagents. To capture stochasticity, each question is executed multiple times and performance is reported with run-level uncertainty. Across models, we observe large performance differences, find that orchestration often improves coverage, and identify substantial variability in how models transfer from English to Portuguese. The benchmark is available at https://github.com/maritaca-ai/MARCA[42] Filling in the Mechanisms: How do LMs Learn Filler-Gap Dependencies under Developmental Constraints?
Atrey Desai,Sathvik Nair
Main category: cs.CL
TL;DR: 本文研究了在有限数据训练下,语言模型是否能像人类一样形成跨句法结构的填空-空位依赖(filler-gap dependencies)的共享表征。作者使用DAS方法分析BabyLM挑战中的模型,发现即使数据量有限,模型也能发展出共享但项目敏感的机制,但仍远不如人类高效,表明现有模型需要引入语言特异性偏置。
Details
Motivation: 探究语言模型在发展可行的数据量下是否具备类似人类的跨句法结构的填空-空位依赖共享表征能力。 Method: 采用Distributed Alignment Search(DAS)方法,分析在BabyLM挑战中使用不同数据量训练的语言模型对wh-问句和话题化结构中填空-空位依赖的表征迁移能力。 Result: 结果表明:模型在有限数据下可发展出共享但项目敏感的机制;但相比人类,仍需多得多的数据才能达到类似泛化能力。 Conclusion: 当前语言模型缺乏人类所具有的语言特异性先验偏置,需在语言习得建模中加以引入。 Abstract: For humans, filler-gap dependencies require a shared representation across different syntactic constructions. Although causal analyses suggest this may also be true for LLMs (Boguraev et al., 2025), it is still unclear if such a representation also exists for language models trained on developmentally feasible quantities of data. We applied Distributed Alignment Search (DAS, Geiger et al. (2024)) to LMs trained on varying amounts of data from the BabyLM challenge (Warstadt et al., 2023), to evaluate whether representations of filler-gap dependencies transfer between wh-questions and topicalization, which greatly vary in terms of their input frequency. Our results suggest shared, yet item-sensitive mechanisms may develop with limited training data. More importantly, LMs still require far more data than humans to learn comparable generalizations, highlighting the need for language-specific biases in models of language acquisition.[43] Psychological Steering of Large Language Models
Leonardo Blas,Robin Jia,Emilio Ferrara
Main category: cs.CL
TL;DR: 本文提出了一种基于心理学的LLM行为调控框架,利用IPIP-NEO-120量表校准残差流注入,在语义一致单位中进行无界搜索,发现均值差(MD)注入在多数模型上优于传统人格提示(P²),且MD与P²混合方法效果最佳;同时验证了线性表征假设,但也揭示了模型表征与人类大五人格结构间的偏差。
Details
Motivation: 现有激活干预方法受限于搜索空间和未校准的激活单位,难以找到最优干预条件,亟需一种语义可解释、心理可对齐的干预范式。 Method: 提出心理引导框架,使用IPIP-NEO-120量表校准残差流注入;对比六种注入方法,重点评估均值差(MD)注入及其与Personality Prompting(P²)的混合策略;在14个LLM上开展开放生成实验,并分析OCEAN特质协方差模式。 Result: MD注入在14个LLM中的11个上优于P²(提升3.6%–16.4%);MD+P²混合方法在13个模型上同时超越两者(相较P²提升5.6%–21.9%,相较MD提升3.3%–26.7%);MD注入支持线性表征假设,但诱发的OCEAN协方差偏离人类‘大二’结构。 Conclusion: 残差流表示工程是开放生成中心理调控的新前沿;语义校准的注入方法显著优于提示工程;线性控制可行,但模型内部人格表征尚未完全对齐人类心理学结构。 Abstract: Large language models (LLMs) emulate a consistent human-like behavior that can be shaped through activation-level interventions. This paradigm is converging on additive residual-stream injections, which rely on injection-strength sweeps to approximate optimal intervention settings. However, existing methods restrict the search space and sweep in uncalibrated activation-space units, potentially missing optimal intervention conditions. Thus, we introduce a psychological steering framework that performs unbounded, fluency-constrained sweeps in semantically calibrated units. Our method derives and calibrates residual-stream injections using psychological artifacts, and we use the IPIP-NEO-120, which measures the OCEAN personality model, to compare six injection methods. We find that mean-difference (MD) injections outperform Personality Prompting (P$^2$), an established baseline for OCEAN steering, in open-ended generation in 11 of 14 LLMs, with gains of 3.6\% to 16.4\%, overturning prior reports favoring prompting and positioning representation engineering as a new frontier in open-ended psychological steering. Further, we find that a hybrid of P$^2$ and MD injections outperforms both methods in 13 of 14 LLMs, with gains over P$^2$ ranging from 5.6\% to 21.9\% and from 3.3\% to 26.7\% over MD injections. Finally, we show that MD injections align with the Linear Representation Hypothesis and provide reliable, approximately linear control knobs for psychological steering. Nevertheless, they also induce OCEAN trait covariance patterns that depart from the Big Two model, suggesting a gap between learned representations and human psychology.[44] CobwebTM: Probabilistic Concept Formation for Lifelong and Hierarchical Topic Modeling
Karthik Singaravadivelan,Anant Gupta,Zekun Wang,Christopher MacLellan
Main category: cs.CL
TL;DR: 本文提出CobwebTM,一种基于增量概率概念形成的低参数终身分层主题模型,能够在线构建语义层次结构,实现无监督主题发现、动态主题生成和无需预设主题数的分层组织。
Details
Motivation: 神经主题模型虽性能强但需大量调参且难以应对终身学习(灾难性遗忘、容量固定);传统概率模型则缺乏对流式数据的灵活性与适应性。 Method: 将Cobweb算法适配到连续文档嵌入空间,基于增量式符号化概念形成机制,在预训练表示基础上进行在线语义层次构建。 Result: 在多个数据集上展现出高主题一致性、时间稳定性及高质量层次结构。 Conclusion: 结合增量符号概念形成与预训练表征是高效主题建模的有效路径。 Abstract: Topic modeling seeks to uncover latent semantic structure in text corpora with minimal supervision. Neural approaches achieve strong performance but require extensive tuning and struggle with lifelong learning due to catastrophic forgetting and fixed capacity, while classical probabilistic models lack flexibility and adaptability to streaming data. We introduce \textsc{CobwebTM}, a low-parameter lifelong hierarchical topic model based on incremental probabilistic concept formation. By adapting the Cobweb algorithm to continuous document embeddings, \textsc{CobwebTM} constructs semantic hierarchies online, enabling unsupervised topic discovery, dynamic topic creation, and hierarchical organization without predefining the number of topics. Across diverse datasets, \textsc{CobwebTM} achieves strong topic coherence, stable topics over time, and high-quality hierarchies, demonstrating that incremental symbolic concept formation combined with pretrained representations is an efficient approach to topic modeling.[45] PeerPrism: Peer Evaluation Expertise vs Review-writing AI
Soroush Sadeghian,Alireza Daqiq,Radin Cheraghi,Sajad Ebrahimi,Negar Arabzadeh,Ebrahim Bagheri
Main category: cs.CL
TL;DR: 本文提出PeerPrism基准,用于评估LLM在科学同行评审中的人-AI协作检测,指出当前检测方法混淆了文本表层生成与思想来源,无法准确识别混合创作中的思想归属。
Details
Motivation: 现有LLM检测方法将作者身份简化为人类vs. AI的二元问题,忽视了实际评审中思想与文本可能来自不同来源的混合协作现实。 Method: 构建包含20690条评审的PeerPrism大规模基准,涵盖全人工、全合成及多种混合生成模式;系统评测主流LLM文本检测方法,并辅以文体学与语义分析。 Result: 主流检测方法在二元任务上表现良好,但在混合场景(如人类思想+AI表达)下预测分歧严重,常给出矛盾分类;分析表明其错误地将表层风格等同于思想来源。 Conclusion: 同行评审中的LLM检测不能简化为二元归因问题,而应建模为涵盖语义推理与风格实现的多维作者身份问题;PeerPrism是首个面向人-AI协作评审的公开基准。 Abstract: Large Language Models (LLMs) are increasingly used in scientific peer review, assisting with drafting, rewriting, expansion, and refinement. However, existing peer-review LLM detection methods largely treat authorship as a binary problem-human vs. AI-without accounting for the hybrid nature of modern review workflows. In practice, evaluative ideas and surface realization may originate from different sources, creating a spectrum of human-AI collaboration. In this work, we introduce PeerPrism, a large-scale benchmark of 20,690 peer reviews explicitly designed to disentangle idea provenance from text provenance. We construct controlled generation regimes spanning fully human, fully synthetic, and multiple hybrid transformations. This design enables systematic evaluation of whether detectors identify the origin of the surface text or the origin of the evaluative reasoning. We benchmark state-of-the-art LLM text detection methods on PeerPrism. While several methods achieve high accuracy on the standard binary task (human vs. fully synthetic), their predictions diverge sharply under hybrid regimes. In particular, when ideas originate from humans but the surface text is AI-generated, detectors frequently disagree and produce contradictory classifications. Accompanied by stylometric and semantic analyses, our results show that current detection methods conflate surface realization with intellectual contribution. Overall, we demonstrate that LLM detection in peer review cannot be reduced to a binary attribution problem. Instead, authorship must be modeled as a multidimensional construct spanning semantic reasoning and stylistic realization. PeerPrism is the first benchmark evaluating human-AI collaboration in these settings. We release all code, data, prompts, and evaluation scripts to facilitate reproducible research at https://github.com/Reviewerly-Inc/PeerPrism.[46] Mechanistic Decoding of Cognitive Constructs in LLMs
Yitong Shou,Manhao Guan
Main category: cs.CL
TL;DR: 本文提出了一种基于表征工程的认知逆向工程框架,用于解析大语言模型(LLMs)中社会比较型嫉妒情绪的内部结构,发现模型将嫉妒编码为‘比较对象优越性’和‘领域自我定义相关性’两个心理前因的线性组合,其表征与人类心理学一致,并可实现对毒性情绪的检测与干预。
Details
Motivation: 现有可解释性方法多将模型视为黑箱或仅关注基础情绪,难以揭示复杂情绪(如嫉妒)的内在认知结构。 Method: 结合评价理论与表征工程(RepE),采用子空间正交化、回归加权和双向因果引导技术,分离并量化嫉妒的两个心理前因,并检验其对模型判断的因果影响。 Result: 在Llama、Qwen、Gemma系列共8个LLM上验证:模型原生地以结构化线性方式编码嫉妒;‘优越性’为触发基础,‘相关性’为强度调节因子;且可机械检测并精准抑制毒性情绪表征。 Conclusion: LLMs的情绪表征具有可解构的心理学结构,所提框架不仅揭示了其情感认知机制,也为多智能体环境下的AI安全提供了基于表征监控与干预的新路径。 Abstract: While Large Language Models (LLMs) demonstrate increasingly sophisticated affective capabilities, the internal mechanisms by which they process complex emotions remain unclear. Existing interpretability approaches often treat models as black boxes or focus on coarse-grained basic emotions, leaving the cognitive structure of more complex affective states underexplored. To bridge this gap, we propose a Cognitive Reverse-Engineering framework based on Representation Engineering (RepE) to analyze social-comparison jealousy. By combining appraisal theory with subspace orthogonalization, regression-based weighting, and bidirectional causal steering, we isolate and quantify two psychological antecedents of jealousy, Superiority of Comparison Person and Domain Self-Definitional Relevance, and examine their causal effects on model judgments. Experiments on eight LLMs from the Llama, Qwen, and Gemma families suggest that models natively encode jealousy as a structured linear combination of these constituent factors. Their internal representations are broadly consistent with the human psychological construct, treating Superiority as the foundational trigger and Relevance as the ultimate intensity multiplier. Our framework also demonstrates that toxic emotional states can be mechanically detected and surgically suppressed, suggesting a possible route toward representational monitoring and intervention for AI safety in multi-agent environments.[47] NLP needs Diversity outside of 'Diversity'
Joshua Tint
Main category: cs.CL
TL;DR: 本文指出NLP领域内多样性进展主要集中在公平性相关研究,而其他子领域则被忽视;作者认为这是由于激励机制、偏见和障碍共同作用的结果,并通过分析NLP研究人员在各子领域的构成,提出打破加剧不平等的反馈循环、消除地理与语言障碍等建议。
Details
Motivation: 近期NLP领域在多样性方面的进展过度集中于公平性相关研究,忽视了其他子领域,且这种不平衡源于系统性因素,如激励机制、偏见及结构性障碍,导致边缘化研究者难以参与非公平性方向的研究。 Method: 通过调查NLP各子领域研究人员的人口统计学特征(如地域、语言背景等),结合定性分析,识别影响多样性的关键因素,并据此提出改进建议。 Result: 发现NLP子领域间研究人员多样性存在显著差异,公平性领域相对更具包容性,而非公平性领域存在明显的地理与语言参与壁垒;识别出强化不平等的反馈循环机制。 Conclusion: 推动NLP整体多样性需超越单一聚焦公平性的路径,应系统性地消除地理、语言和制度性障碍,支持边缘化研究者在所有子领域中平等参与和发展。 Abstract: This position paper argues that recent progress with diversity in NLP is disproportionately concentrated on a small number of areas surrounding fairness. We further argue that this is the result of a number of incentives, biases, and barriers which come together to disenfranchise marginalized researchers in non-fairness fields, or to move them into fairness-related fields. We substantiate our claims with an investigation into the demographics of NLP researchers by subfield, using our research to support a number of recommendations for ensuring that all areas within NLP can become more inclusive and equitable. In particular, we highlight the importance of breaking down feedback loops that reinforce disparities, and the need to address geographical and linguistic barriers that hinder participation in NLP research.[48] CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification
Yian Wang,Yuen Chen,Agam Goyal,Hari Sundaram
Main category: cs.CL
TL;DR: 本文提出CAUSALDETOX框架,通过因果分析识别并干预导致毒性生成的关键注意力头,结合推理时干预与微调策略,在显著降低毒性的同时保持语言流畅性,并引入新基准PARATOX用于可控反事实评估。
Details
Motivation: 大型语言模型常生成有毒内容,现有缓解方法常损害生成质量或依赖高成本人工标注。 Method: 提出CAUSALDETOX框架,基于概率必要性与充分性(PNS)识别关键注意力头,并采用局部推理时干预(构建动态输入相关引导向量)和PNS引导微调两种策略;同时构建PARATOX毒性/非毒性句对基准以支持反事实评估。 Result: 在ToxiGen、ImplicitHate和ParaDetox上实验表明,CAUSALDETOX相比基线最多提升5.34%的毒性降低率,保持语言流畅性,并实现7倍的头部选择加速。 Conclusion: CAUSALDETOX提供了一种高效、精准且无需大量人工标注的毒性缓解方法,兼具推理效率与模型性能保留。 Abstract: Large language models (LLMs) frequently generate toxic content, posing significant risks for safe deployment. Current mitigation strategies often degrade generation quality or require costly human annotation. We propose CAUSALDETOX, a framework that identifies and intervenes on the specific attention heads causally responsible for toxic generation. Using the Probability of Necessity and Sufficiency (PNS), we isolate a minimal set of heads that are necessary and sufficient for toxicity. We utilize these components via two complementary strategies: (1) Local Inference-Time Intervention, which constructs dynamic, input-specific steering vectors for context-aware detoxification, and (2) PNS-Guided Fine-Tuning, which permanently unlearns toxic representations. We also introduce PARATOX, a novel benchmark of aligned toxic/non-toxic sentence pairs enabling controlled counterfactual evaluation. Experiments on ToxiGen, ImplicitHate, and ParaDetox show that CAUSALDETOX achieves up to 5.34% greater toxicity reduction compared to baselines while preserving linguistic fluency, and offers a 7x speedup in head selection.[49] Retrieve, Then Classify: Corpus-Grounded Automation of Clinical Value Set Authoring
Sumit Mukherjee,Juan Shu,Nairwita Mazumder,Tate Kernell,Celena Wheeler,Shannon Hastings,Chris Sidey-Gibbons
Main category: cs.CL
TL;DR: 本文提出了一种名为检索增强集合补全(RASC)的新方法,用于临床价值集编码生成任务,通过先检索相似已有价值集、再对候选代码分类,显著提升了准确率与效率,并构建了首个大规模基准数据集。
Details
Motivation: 临床价值集编写是临床质量评估和表型分析中的关键瓶颈,而直接用大语言模型生成标准化编码效果受限于词汇规模大、版本控制严格及预训练记忆不可靠等问题。 Method: 提出检索增强集合补全(RASC):首先从已有序贯价值集语料库中检索K个最相似的价值集形成候选码池,再使用分类器对每个候选码进行二分类判定;在SAPBert跨编码器上微调,并对比MLP、LightGBM等模型。 Result: 在11803个VSAC价值集构成的基准上,RASC达到AUROC 0.852、价值集级F1 0.298,优于MLP(0.799/0.250)和零样本GPT-4o(F1 0.105);将每真阳性对应的无关候选码数从12.3降至约3.2–4.4。 Conclusion: RASC通过缩小输出空间有效降低统计复杂度,其性能优势随价值集规模增大而增强,且适用于多种分类器,为临床价值集自动化构建提供了可扩展、鲁棒的新范式。 Abstract: Clinical value set authoring -- the task of identifying all codes in a standardized vocabulary that define a clinical concept -- is a recurring bottleneck in clinical quality measurement and phenotyping. A natural approach is to prompt a large language model (LLM) to generate the required codes directly, but structured clinical vocabularies are large, version-controlled, and not reliably memorized during pretraining. We propose Retrieval-Augmented Set Completion (RASC): retrieve the $K$ most similar existing value sets from a curated corpus to form a candidate pool, then apply a classifier to each candidate code. Theoretically, retrieve-and-select can reduce statistical complexity by shrinking the effective output space from the full vocabulary to a much smaller retrieved candidate pool. We demonstrate the utility of RASC on 11,803 publicly available VSAC value sets, constructing the first large-scale benchmark for this task. A cross-encoder fine-tuned on SAPBert achieves AUROC~0.852 and value-set-level F1~0.298, outperforming a simpler three-layer Multilayer Perceptron (AUROC~0.799, F1~0.250) and both reduce the number of irrelevant candidates per true positive from 12.3 (retrieval-only) to approximately 3.2 and 4.4 respectively. Zero-shot GPT-4o achieves value-set-level F1~0.105, with 48.6\% of returned codes absent from VSAC entirely. This performance gap widens with increasing value set size, consistent with RASC's theoretical advantage. We observe similar performance gains across two other classifier model types, namely a cross-encoder initialized from pre-trained SAPBert and a LightGBM model, demonstrating that RASC's benefits extend beyond a single model class. The code to download and create the benchmark dataset, as well as the model training code is available at: \href{https://github.com/mukhes3/RASC}{https://github.com/mukhes3/RASC}.[50] StoryCoder: Narrative Reformulation for Structured Reasoning in LLM Code Generation
Geonhui Jang,Dongyoon Han,YoungJoon Yoo
Main category: cs.CL
TL;DR: 本文提出StoryCoder框架,通过将编程问题转化为包含任务概述、约束条件和示例测试用例的自然语言叙事,提升代码生成模型的性能。实验表明该方法在多个基准上平均提升零样本pass@10达18.7%,并引导模型采用更正确的算法策略、减少实现错误、生成更模块化的代码。
Details
Motivation: 现有代码生成方法虽增强推理步骤或注入思维结构,但未系统整合分散的问题条件;受人类将碎片信息组织为连贯解释的启发,需更富上下文结构的问题表征。 Method: 提出StoryCoder叙事重构框架,将代码生成问题转化为由任务概述、约束和示例测试用例三部分组成的自然语言叙事,叙事内容依据所选算法与文体进行引导。 Result: 在HumanEval、LiveCodeBench和CodeForces上对11个模型的实验显示,零样本pass@10平均提升18.7%;分析还表明叙事一致性与文体匹配度显著影响效果,且该方法能引导正确算法策略、减少实现错误、促进模块化代码生成。 Conclusion: 结构化的问题表征(如叙事重构)对代码生成至关重要,其效益不依赖于模型规模或架构,为提升代码生成能力提供了新方向。 Abstract: Effective code generation requires both model capability and a problem representation that carefully structures how models reason and plan. Existing approaches augment reasoning steps or inject specific structure into how models think, but leave scattered problem conditions unchanged. Inspired by the way humans organize fragmented information into coherent explanations, we propose StoryCoder, a narrative reformulation framework that transforms code generation questions into coherent natural language narratives, providing richer contextual structure than simple rephrasings. Each narrative consists of three components: a task overview, constraints, and example test cases, guided by the selected algorithm and genre. Experiments across 11 models on HumanEval, LiveCodeBench, and CodeForces demonstrate consistent improvements, with an average gain of 18.7% in zero-shot pass@10. Beyond accuracy, our analyses reveal that narrative reformulation guides models toward correct algorithmic strategies, reduces implementation errors, and induces a more modular code structure. The analyses further show that these benefits depend on narrative coherence and genre alignment, suggesting that structured problem representation is important for code generation regardless of model scale or architecture. Our code is available at https://github.com/gu-ni/StoryCoder.[51] Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options
Nahyun Lee,Guijin Son
Main category: cs.CL
TL;DR: 本文提出了一种大规模选项(100个选项)的多选评估协议,用于更严格地测试大语言模型在韩语正字法错误检测任务中的真实能力,揭示了传统低选项设置下可能掩盖的模型缺陷,如语义混淆和位置偏差。
Details
Motivation: 传统多选评估在选项较少时容易达到接近上限的准确率,但可能依赖捷径策略而非真正能力,因此需要更严格的评估方式来揭示模型真实水平。 Method: 设计包含100个选项的大规模多选评估协议,应用于韩语正字法错误检测任务;通过固定目标、重复重采样与打乱顺序获得稳定估计;开展填充控制与长度匹配实验以分离上下文长度影响。 Result: 强模型在低选项设置中表现优异,但在高干扰密度(高N)下性能明显下降;识别出两种主要失败模式:语义混淆与对靠前选项的位置偏好;候选排序能力是主要瓶颈,而非上下文长度。 Conclusion: 大规模选项评估是一种通用的压力测试框架,能有效暴露传统低选项基准无法发现的模型可靠性问题,尤其适用于高干扰密度场景。 Abstract: Multiple choice evaluation is widely used for benchmarking large language models, yet near ceiling accuracy in low option settings can be sustained by shortcut strategies that obscure true competence. Therefore, we propose a massive option evaluation protocol that scales the candidate set to one hundred options and sharply reduces the impact of chance performance. We apply this framework to a Korean orthography error detection task where models must pick the single incorrect sentence from a large candidate set. With fixed targets and repeated resampling and shuffling, we obtain stable estimates while separating content driven failures from positional artifacts. Across experiments, results indicate that strong performance in low option settings can overstate model competence. This apparent advantage often weakens under dense interference at high $N$, revealing gaps that conventional benchmarks tend to obscure. We identify two failure modes, semantic confusion and position bias toward early options under uncertainty. To isolate the effect of context length, we run padding controlled and length matched tests, which suggest that the main bottleneck is candidate ranking rather than context length. Together, these findings support massive option evaluation as a general framework for stress testing model reliability under extreme distractor density, beyond what low option benchmarks can reveal.[52] Fact4ac at the Financial Misinformation Detection Challenge Task: Reference-Free Financial Misinformation Detection via Fine-Tuning and Few-Shot Prompting of Large Language Models
Cuong Hoang,Le-Minh Nguyen
Main category: cs.CL
TL;DR: 本文提出了一种无需外部参考的金融虚假信息检测方法,结合零样本/少样本提示与LoRA微调大语言模型,在该任务中取得第一名(公开/私有测试集准确率分别为95.4%和96.3%)。
Details
Motivation: 金融虚假信息泛滥威胁市场稳定与投资者信任,而现实中常缺乏可用于交叉验证的外部证据,亟需不依赖参考文献的检测方法。 Method: 基于RFC-BENCH框架,融合大语言模型的上下文学习(零样本/少样本提示)与参数高效微调(LoRA),提升模型对金融操纵语言线索的识别能力。 Result: 在‘无参考金融虚假信息检测’共享任务中排名第一:公开测试集准确率95.4%,私有测试集96.3%;开源14B与32B模型。 Conclusion: 该方法验证了纯语义与上下文一致性分析在金融虚假信息检测中的有效性,推动了金融NLP中上下文感知虚假信息检测的发展。 Abstract: The proliferation of financial misinformation poses a severe threat to market stability and investor trust, misleading market behavior and creating critical information asymmetry. Detecting such misleading narratives is inherently challenging, particularly in real-world scenarios where external evidence or supplementary references for cross-verification are strictly unavailable. This paper presents our winning methodology for the "Reference-Free Financial Misinformation Detection" shared task. Built upon the recently proposed RFC-BENCH framework (Jiang et al. 2026), this task challenges models to determine the veracity of financial claims by relying solely on internal semantic understanding and contextual consistency, rather than external fact-checking. To address this formidable evaluation setup, we propose a comprehensive framework that capitalizes on the reasoning capabilities of state-of-the-art Large Language Models (LLMs). Our approach systematically integrates in-context learning, specifically zero-shot and few-shot prompting strategies, with Parameter-Efficient Fine-Tuning (PEFT) via Low-Rank Adaptation (LoRA) to optimally align the models with the subtle linguistic cues of financial manipulation. Our proposed system demonstrated superior efficacy, successfully securing the first-place ranking on both official leaderboards. Specifically, we achieved an accuracy of 95.4% on the public test set and 96.3% on the private test set, highlighting the robustness of our method and contributing to the acceleration of context-aware misinformation detection in financial Natural Language Processing. Our models (14B and 32B) are available at https://huggingface.co/KaiNKaiho.[53] CURaTE: Continual Unlearning in Real Time with Ensured Preservation of LLM Knowledge
Seyun Bae,Seokhan Lee,Eunho Yang
Main category: cs.CL
TL;DR: 本文提出CURaTE方法,通过训练句子嵌入模型实时检测并拒绝与‘遗忘请求’相似的输入,实现大语言模型的持续、实时知识遗忘,同时完全保留模型原有知识。
Details
Motivation: 现有大语言模型预训练数据难以预先过滤所有潜在问题数据,因此需要在训练后对特定知识进行‘遗忘’;但当前方法无法支持连续、即时的遗忘操作,导致效用下降和敏感信息长期暴露。 Method: 提出CURaTE:先在特制数据集上训练句子嵌入模型,以构建对‘遗忘请求’的锐利决策边界;在线推理时,计算输入提示与遗忘请求的相似度,若超过阈值则拒绝回答,否则正常响应;全程不修改语言模型参数。 Result: CURaTE在遗忘效果上优于现有方法;因不更新模型参数,知识保留近乎完美;支持任意次数的持续实时遗忘,是目前唯一满足该能力的方法。 Conclusion: CURaTE提供了一种轻量、安全、可持续的后训练知识遗忘范式,兼顾高效遗忘与零知识损伤,为LLM隐私与合规部署提供了新路径。 Abstract: The inability to filter out in advance all potentially problematic data from the pre-training of large language models has given rise to the need for methods for unlearning specific pieces of knowledge after training. Existing techniques overlook the need for continuous and immediate action, causing them to suffer from degraded utility as updates accumulate and protracted exposure of sensitive information. To address these issues, we propose Continual Unlearning in Real Time with Ensured Preservation of LLM Knowledge (CURaTE). Our method begins by training a sentence embedding model on a dataset designed to enable the formation of sharp decision boundaries for determining whether a given input prompt corresponds to any stored forget requests. The similarity of a given input to the forget requests is then used to determine whether to answer or return a refusal response. We show that even with such a simple approach, not only does CURaTE achieve more effective forgetting than existing methods, but by avoiding modification of the language model parameters, it also maintains near perfect knowledge preservation over any number of updates and is the only method capable of continual unlearning in real-time.[54] CURA: Clinical Uncertainty Risk Alignment for Language Model-Based Risk Prediction
Sizhe Wang,Ziqi Xu,Claire Najjuuko,Charles Alba,Chenyang Lu
Main category: cs.CL
TL;DR: 本文提出CURA框架,通过双层不确定性目标对临床语言模型进行微调,以提升风险预测的不确定性校准能力,使其更符合个体错误概率和队列级模糊性,从而增强临床决策支持的可信度。
Details
Motivation: 临床语言模型在风险预测中应用广泛,但其不确定性估计常缺乏校准、临床不可靠。 Method: 提出Clinical Uncertainty Risk Alignment(CURA)框架:先微调领域专用临床LM获取患者嵌入,再对多头分类器进行不确定性微调;采用双层不确定性目标——个体级校准项对齐预测不确定性与单个患者错误概率,队列感知正则项将风险估计拉向嵌入空间局部邻域的事件率,并加权关注决策边界附近的模糊队列;该正则项可解释为基于邻域软标签的交叉熵损失。 Result: 在MIMIC-IV多个临床风险预测任务和不同临床LM上验证,CURA持续提升校准指标(如ECE),未显著损害判别能力;分析表明其减少了过度自信的误判安慰,提升了不确定性估计的临床可信度。 Conclusion: CURA有效提升了临床语言模型风险预测的不确定性校准性与临床可靠性,为下游临床决策支持提供了更可信的不确定性估计。 Abstract: Clinical language models (LMs) are increasingly applied to support clinical risk prediction from free-text notes, yet their uncertainty estimates often remain poorly calibrated and clinically unreliable. In this work, we propose Clinical Uncertainty Risk Alignment (CURA), a framework that aligns clinical LM-based risk estimates and uncertainty with both individual error likelihoods and cohort-level ambiguities. CURA first fine-tunes domain-specific clinical LMs to obtain task-adapted patient embeddings, and then performs uncertainty fine-tuning of a multi-head classifier using a bi-level uncertainty objective. Specifically, an individual-level calibration term aligns predictive uncertainty with each patient's likelihood of error, while a cohort-aware regularizer pulls risk estimates toward event rates in their local neighborhoods in the embedding space and places extra weight on ambiguous cohorts near the decision boundary. We further show that this cohort-aware term can be interpreted as a cross-entropy loss with neighborhood-informed soft labels, providing a label-smoothing view of our method. Extensive experiments on MIMIC-IV clinical risk prediction tasks across various clinical LMs show that CURA consistently improves calibration metrics without substantially compromising discrimination. Further analysis illustrates that CURA reduces overconfident false reassurance and yields more trustworthy uncertainty estimates for downstream clinical decision support.[55] SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models
Binxian Su,Haoye Lou,Shucheng Zhu,Weikang Wang,Ying Liu,Dong Yu,Pengyuan Liu
Main category: cs.CL
TL;DR: 本文提出了SPAGBias框架,首次系统评估大语言模型(LLMs)在城市空间中的性别偏见,结合62类城市微观空间分类、提示库与三层诊断方法,发现模型存在超越公私领域划分的结构性空间-性别关联,并揭示其在预训练、指令微调与奖励建模各阶段被嵌入和强化,导致下游应用中出现实质性偏差。
Details
Motivation: 鉴于性别化空间理论指出性别等级深植于空间组织之中,而大语言模型正日益应用于城市规划,亟需系统评估其是否复现或加剧空间中的性别偏见。 Method: 提出SPAGBias框架,包含62类城市微观空间的分类法、提示库,以及显式(强制选择重采样)、概率式(词元级不对称性)和建构式(语义与叙事角色分析)三层诊断方法;对六个代表性模型开展多维度实验,涵盖故事生成、提示设计、温度与模型规模影响分析,及跨训练阶段的溯源追踪与下游任务验证。 Result: 发现LLMs存在精细的、超越公私二分法的结构性空间-性别映射;故事生成揭示情感、措辞与社会角色共同塑造‘空间性别叙事’;偏见贯穿预训练、指令微调与奖励建模全过程,且模型关联强度显著高于现实世界分布;下游实验显示其在规范性与描述性应用场景中均引发具体失败。 Conclusion: LLMs不仅反映语言中的性别偏见,更将社会性别认知编码为空间语义结构;本研究弥合社会学理论与计算分析,首次将偏见研究拓展至空间维度,为负责任的城市AI应用提供理论基础与评估工具。 Abstract: Large language models (LLMs) are being increasingly used in urban planning, but since gendered space theory highlights how gender hierarchies are embedded in spatial organization, there is concern that LLMs may reproduce or amplify such biases. We introduce SPAGBias - the first systematic framework to evaluate spatial gender bias in LLMs. It combines a taxonomy of 62 urban micro-spaces, a prompt library, and three diagnostic layers: explicit (forced-choice resampling), probabilistic (token-level asymmetry), and constructional (semantic and narrative role analysis). Testing six representative models, we identify structured gender-space associations that go beyond the public-private divide, forming nuanced micro-level mappings. Story generation reveals how emotion, wording, and social roles jointly shape "spatial gender narratives". We also examine how prompt design, temperature, and model scale influence bias expression. Tracing experiments indicate that these patterns are embedded and reinforced across the model pipeline (pre-training, instruction tuning, and reward modeling), with model associations found to substantially exceed real-world distributions. Downstream experiments further reveal that such biases produce concrete failures in both normative and descriptive application settings. This work connects sociological theory with computational analysis, extending bias research into the spatial domain and uncovering how LLMs encode social gender cognition through language.[56] Which bird does not have wings: Negative-constrained KGQA with Schema-guided Semantic Matching and Self-directed Refinement
Midan Shim,Seokju Hwang,Kaehyun Um,Kyong-Ho Lee
Main category: cs.CL
TL;DR: 本文提出NEST KGQA新任务及NestKGQA数据集,聚焦于含否定约束的问题;设计了适合表达否定的Python格式逻辑形式PyLF;并提出CUCKOO框架,通过约束感知逻辑形式生成、模式引导语义匹配与自导向精炼,提升多约束问题的语义可执行性与鲁棒性,在少样本下显著优于基线。
Details
Motivation: 现有KGQA基准和方法偏向正向和计算约束,忽视现实中频繁出现的否定约束,导致模型在处理含否定约束的问题时表现不佳。 Method: 提出NEST KGQA任务和NestKGQA数据集;设计Python格式逻辑形式PyLF以清晰表达否定;构建CUCKOO框架,包含约束感知逻辑形式初稿生成、模式引导语义匹配、以及仅在执行结果为空时触发的自导向精炼机制。 Result: CUCKOO在常规KGQA和NEST-KGQA基准上少样本设置下均持续超越基线模型。 Conclusion: 否定约束是KGQA中被长期忽视但关键的一类语义,NEST任务、PyLF逻辑形式和CUCKOO框架共同提升了模型对多约束、尤其是否定约束问题的忠实性与可执行性。 Abstract: Large language models still struggle with faithfulness and hallucinations despite their remarkable reasoning abilities. In Knowledge Graph Question Answering (KGQA), semantic parsing-based approaches address the limitations by understanding constraints in a user's question and converting them into a logical form to execute on a knowledge graph. However, existing KGQA benchmarks and methods are biased toward positive and calculation constraints. Negative constraints are neglected, although they frequently appear in real-world questions. In this paper, we introduce a new task, NEgative-conSTrained (NEST) KGQA, where each question contains at least one negative constraint, and a corresponding dataset, NestKGQA. We also design PyLF, a Python-formatted logical form, since existing logical forms are hardly suitable to express negation clearly while maintaining readability. Furthermore, NEST questions naturally contain multiple constraints. To mitigate their semantic complexity, we present a novel framework named CUCKOO, specialized to multiple-constrained questions and ensuring semantic executability. CUCKOO first generates a constraint-aware logical form draft and performs schema-guided semantic matching. It then selectively applies self-directed refinement only when executing improper logical forms yields an empty result, reducing cost while improving robustness. Experimental results demonstrate that CUCKOO consistently outperforms baselines on both conventional KGQA and NEST-KGQA benchmarks under few-shot settings.[57] CoPA: Benchmarking Personalized Question Answering with Data-Informed Cognitive Factors
Hang Su,Zequn Liu,Chen Hu,Xuesong Lu,Yingce Xia,Zhen Liu
Main category: cs.CL
TL;DR: 本文提出CoPA基准,通过挖掘社区-个体偏好差异(CIPD)识别六个个性化维度,用于细粒度评估大语言模型在问答任务中的个性化能力。
Details
Motivation: 现有个性化问答评估方法多依赖词法相似性或人工启发式规则,缺乏充分的数据驱动验证,难以准确衡量模型对用户偏好的建模能力。 Method: 从用户交互数据中挖掘社区与个体偏好之间的差异(CIPD),归纳出六个关键个性化因子,并据此构建包含1985个用户画像的CoPA评测基准;通过量化模型输出与基于用户交互推断的认知偏好之间的一致性来评估个性化效果。 Result: CoPA提供了比通用指标更全面、更具区分力的个性化问答评估标准,并已开源代码。 Conclusion: CoPA为个性化问答系统提供了首个基于认知偏好和因子级分析的数据驱动评测框架,推动个性化评估从表层匹配走向深层认知对齐。 Abstract: While LLMs have demonstrated remarkable potential in Question Answering (QA), evaluating personalization remains a critical bottleneck. Existing paradigms predominantly rely on lexical-level similarity or manual heuristics, often lacking sufficient data-driven validation. We address this by mining Community-Individual Preference Divergence (CIPD), where individual choices override consensus, to distill six key personalization factors as evaluative dimensions. Accordingly, we introduce CoPA, a benchmark with 1,985 user profiles for fine-grained, factor-level assessment. By quantifying the alignment between model outputs and user-specific cognitive preferences inferred from interaction patterns, CoPA provides a more comprehensive and discriminative standard for evaluating personalized QA than generic metrics. The code is available at https://github.com/bjzgcai/CoPA.[58] Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems
Nishanth Madhusudhan,Vikas Yadav,Alexandre Lacoste
Main category: cs.CL
TL;DR: 本文提出MM-AQA基准,用于评估多模态系统在证据不足时的有效弃答能力,发现当前视觉语言模型和多智能体系统在弃答方面表现不佳,需弃答感知训练而非仅优化提示或增加智能体。
Details
Motivation: 现有视觉语言模型和多智能体系统的评估范式默认问题总可回答,忽视了证据不足时应主动弃答这一关键可靠性需求;弃答研究在纯文本领域已有进展,但在多模态场景下仍缺乏系统性基准和深入分析。 Method: 构建MM-AQA基准,通过两个维度(视觉模态依赖性、证据充分性)对可回答样本进行变换生成不可回答样本;在2079个样本上评估三类前沿视觉语言模型及两类多智能体架构,并分析不同提示策略、架构设计与弃答行为的关系。 Result: (1)标准提示下VLM极少弃答,简单置信度基线即优于之;(2)MAS提升弃答率但牺牲准确性;(3)序列式MAS不逊于迭代式,表明问题在于校准偏差而非推理深度;(4)模型在缺失图像或文本证据时能弃答,但在证据退化或矛盾时仍强行作答。 Conclusion: 实现有效多模态弃答的关键在于弃答感知的专门训练,而非改进提示工程或扩展多智能体规模。 Abstract: Effective abstention (EA), recognizing evidence insufficiency and refraining from answering, is critical for reliable multimodal systems. Yet existing evaluation paradigms for vision-language models (VLMs) and multi-agent systems (MAS) assume answerability, pushing models to always respond. Abstention has been studied in text-only settings but remains underexplored multimodally; current benchmarks either ignore unanswerability or rely on coarse methods that miss realistic failure modes. We introduce MM-AQA, a benchmark that constructs unanswerable instances from answerable ones via transformations along two axes: visual modality dependency and evidence sufficiency. Evaluating three frontier VLMs spanning closed and open-source models and two MAS architectures across 2079 samples, we find: (1) under standard prompting, VLMs rarely abstain; even simple confidence baselines outperform this setup, (2) MAS improves abstention but introduces an accuracy-abstention trade-off, (3) sequential designs match or exceed iterative variants, suggesting the bottleneck is miscalibration rather than reasoning depth, and (4) models abstain when image or text evidence is absent, but attempt reconciliation with degraded or contradictory evidence. Effective multimodal abstention requires abstention-aware training rather than better prompting or more agents.[59] Modeling LLM Unlearning as an Asymmetric Two-Task Learning Problem
Zeguan Xiao,Siqing Li,Yong Wang,Xuetao Wei,Jian Yang,Yun Chen,Guanhua Chen
Main category: cs.CL
TL;DR: 本文提出了一种保留优先的梯度合成框架,用于大语言模型的机器遗忘,通过解耦任务特定梯度提取与冲突感知组合,提升保留能力同时维持遗忘效果。
Details
Motivation: 解决大语言模型在遗忘特定知识时对通用能力的损害问题,将遗忘重新定义为以保留为主、遗忘为辅的非对称双任务问题。 Method: 提出保留优先的梯度合成框架,适配PCGrad并设计新方法SAGO,通过符号约束的构造性梯度合成实现更优梯度对齐。 Result: 在WMDP Bio/Cyber和RWKU基准上,SAGO显著提升MMLU性能恢复率(如WMDP Bio达96.0%),同时保持相当的遗忘强度。 Conclusion: 重塑梯度几何结构比重平衡损失更能缓解遗忘-保留权衡问题。 Abstract: Machine unlearning for large language models (LLMs) aims to remove targeted knowledge while preserving general capability. In this paper, we recast LLM unlearning as an asymmetric two-task problem: retention is the primary objective and forgetting is an auxiliary. From this perspective, we propose a retention-prioritized gradient synthesis framework that decouples task-specific gradient extraction from conflict-aware combination. Instantiating the framework, we adapt established PCGrad to resolve gradient conflicts, and introduce SAGO, a novel retention-prioritized gradient synthesis method. Theoretically, both variants ensure non-negative cosine similarity with the retain gradient, while SAGO achieves strictly tighter alignment through constructive sign-constrained synthesis. Empirically, on WMDP Bio/Cyber and RWKU benchmarks, SAGO consistently pushes the Pareto frontier: e.g., on WMDP Bio (SimNPO+GD), recovery of target model MMLU performance progresses from 44.6% (naive) to 94.0% (+PCGrad) and further to 96.0% (+SAGO), while maintaining comparable forgetting strength. Our results show that re-shaping gradient geometry, rather than re-balancing losses, is the key to mitigating unlearning-retention trade-offs.[60] Domain Fine-Tuning FinBERT on Finnish Histopathological Reports: Train-Time Signals and Downstream Correlations
Rami Luisto,Liisa Petäinen,Tommi Grönholm,Jan Böhm,Maarit Ahtiainen,Tomi Lilja,Ilkka Pölönen,Sami Äyrämö
Main category: cs.CL
TL;DR: 本文探讨了在标注数据稀缺的NLP分类任务中,对芬兰语BERT模型进行医学领域微调的效果,并尝试通过分析嵌入空间几何变化来预测领域预训练带来的收益。
Details
Motivation: 医疗AI中常面临数据(尤其是标注数据)获取延迟长的问题,亟需在少量标注数据下提升模型性能。 Method: 对芬兰语BERT模型在芬兰医学文本上进行无监督领域微调,并分析其词嵌入空间的几何变化以预测领域预训练效果。 Result: 报告了芬兰BERT在医学文本上的微调观察结果,并初步探索了嵌入几何变化与领域预训练收益之间的关联。 Conclusion: 领域微调有效,且嵌入空间几何特征可能成为预测领域预训练收益的潜在指标,但需进一步验证。 Abstract: In NLP classification tasks where little labeled data exists, domain fine-tuning of transformer models on unlabeled data is an established approach. In this paper we have two aims. (1) We describe our observations from fine-tuning the Finnish BERT model on Finnish medical text data. (2) We report on our attempts to predict the benefit of domain-specific pre-training of Finnish BERT from observing the geometry of embedding changes due to domain fine-tuning. Our driving motivation is the common\situation in healthcare AI where we might experience long delays in acquiring datasets, especially with respect to labels.[61] Pangu-ACE: Adaptive Cascaded Experts for Educational Response Generation on EduBench
Dinghao Li,Wenlong Zhou,Zhimin Chen,Yuehan Peng,Hong Ni,Chengfu Zou,Guoyu Shi,Yaochen Li
Main category: cs.CL
TL;DR: 本文介绍了Pangu-ACE系统,一种基于任务复杂度动态选择1B或7B模型的教育助手级级联架构,在保证质量提升的同时实现计算资源的按需分配。
Details
Motivation: 教育助手应根据任务需求动态分配计算资源,避免对简单任务过度消耗算力,提升整体效率与质量平衡。 Method: 构建了1B到7B的样本级级联系统(Pangu-ACE):1B tutor-router生成初稿并输出路由信号,决定是否交由7B specialist prompt精修;修正了前期离线评估中因表面格式检查导致的过乐观评分问题,改用CPU端对保存的JSONL预测结果重打分。 Result: 在7013样本中文测试集上,cascade_final相较legacy rule_v2系统将确定性质量从0.457提升至0.538,格式有效性从0.707升至0.866;19.7%请求直接由1B模型处理;不同任务路由差异显著(如IP任务78.0%由1B完成,QG/EC几乎全升级)。 Conclusion: Pangu-ACE通过任务感知路由实现了质量与计算效率的协同优化,当前优势体现于路由选择性而非端到端延迟降低;论文强调可复现性,并指出与GPT-5.4基线对比尚待基础设施修复。 Abstract: Educational assistants should spend more computation only when the task needs it. This paper rewrites our earlier draft around the system that was actually implemented and archived in the repository: a sample-level 1B to 7B cascade for the shared-8 EduBench benchmark. The final system, Pangu-ACE, uses a 1B tutor-router to produce a draft answer plus routing signals, then either accepts the draft or escalates the sample to a 7B specialist prompt. We also correct a major offline evaluation bug: earlier summaries over-credited some open-form outputs that only satisfied superficial format checks. After CPU-side rescoring from saved prediction JSONL, the full Chinese test archive (7013 samples) shows that cascade_final improves deterministic quality from 0.457 to 0.538 and format validity from 0.707 to 0.866 over the legacy rule_v2 system while accepting 19.7% of requests directly at 1B. Routing is strongly task dependent: IP is accepted by 1B 78.0% of the time, while QG and EC still escalate almost always. The current archived deployment does not yet show latency gains, so the defensible efficiency story is routing selectivity rather than wall-clock speedup. We also package a reproducible artifact-first paper workflow and clarify the remaining external-baseline gap: GPT-5.4 re-judging is implemented locally, but the configured provider endpoint and key are invalid, so final sampled-baseline alignment with GPT-5.4 remains pending infrastructure repair.[62] Exploring and Testing Skill-Based Behavioral Profile Annotation: Human Operability and LLM Feasibility under Schema-Guided Execution
Yufeng Wu
Main category: cs.CL
TL;DR: 本文将行为特征(BP)标注视为一组标注技能而非单一任务,提出基于技能文件的流水线方法,并评估了GPT-5.4与多个开源模型在14个BP特征上的表现。结果显示BP标注高度异质,仅部分技能可被LLM可靠执行,且人与GPT在技能难度上高度一致,但在实例或词汇层面无相关性,表明二者是‘共享分类、独立执行’的关系。
Details
Motivation: BP标注因需同步处理多个语言维度而难以自动化;现有研究多将标注视为整体任务,忽视其内在技能异质性,亟需从技能分解视角重新评估LLM辅助标注的可行性。 Method: 构建技能文件驱动的标注流水线,每个BP特征通过外部定义的schema文件、决策规则和示例实现;采用300例验证集进行两轮人工标注,将技能分为可直接操作、聚焦重标可恢复、结构未明确定义三类;对比GPT-5.4与三个本地开源模型在同一设置下的表现。 Result: 14个BP技能中:5个可直接操作,4个经聚焦重标可恢复,5个仍结构未明;GPT-5.4在可用技能上表现可靠(准确率0.678,kappa 0.665,加权F1 0.695),但能力具有选择性;人与GPT技能难度高度相关(r=0.881),但实例级(r=0.016)与词汇级(r=-0.142)无关;GPT更宜视为独立的‘第三技能声音’而非人类替代者;开源模型主要失败于schema到技能的执行环节。 Conclusion: 自动标注评估应转向‘技能可行性’而非‘任务级自动化’;LLM在BP标注中并非全面替代人类,而是与人类形成互补性技能分工;未来研究需聚焦结构未明技能的建模与可解释性增强。 Abstract: Behavioral Profile (BP) annotation is difficult to automate because it requires simultaneous coding across multiple linguistic dimensions. We treat BP annotation as a bundle of annotation skills rather than a single task and evaluate LLM-assisted BP annotation from this perspective. Using 3,134 concordance lines of 30 Chinese metaphorical color-term derivatives and a 14-feature BP schema, we implement a skill-file-driven pipeline in which each feature is externally defined through schema files, decision rules, and examples. Two human annotators completed a two-round schema-only protocol on a 300-instance validation subset, enabling BP skills to be classified as directly operable, recoverable under focused re-annotation, or structurally underspecified. GPT-5.4 and three locally deployable open-source models were then evaluated under the same setup. Results show that BP annotation is highly heterogeneous at the skill level: 5 skills are directly operable, 4 are recoverable after focused re-annotation, and 5 remain structurally underspecified. GPT-5.4 executes the retained skills with substantial reliability (accuracy = 0.678, \k{appa} = 0.665, weighted F1 = 0.695), but this feasibility is selective rather than global. Human and GPT difficulty profiles are strongly aligned at the skill level (r = 0.881), but not at the instance level (r = 0.016) or lexical-item level (r = -0.142), a pattern we describe as shared taxonomy, independent execution. Pairwise agreement further suggests that GPT is better understood as an independent third skill voice than as a direct human substitute. Open-source failures are concentrated in schema-to-skill execution problems. These findings suggest that automatic annotation should be evaluated in terms of skill feasibility rather than task-level automation.[63] ClimateCause: Complex and Implicit Causal Structures in Climate Reports
Liesbeth Allein,Nataly Pineda-Castañeda,Andrea Rocci,Marie-Francine Moens
Main category: cs.CL
TL;DR: 本文介绍了ClimateCause数据集,一个由专家手工标注的、包含高阶因果结构(包括隐式和嵌套因果关系)的气候报告数据集,旨在支持复杂因果网络建模与推理,并揭示大语言模型在因果链推理上的不足。
Details
Motivation: 现有因果发现数据集主要覆盖显式、直接的因果关系,难以支撑对气候变化等复杂系统中高阶、隐式及嵌套因果结构的理解与建模。 Method: 构建了专家人工标注的ClimateCause数据集,对气候政策科学报告中的因果表达进行标准化与解耦,提取单个因果关系,并标注因果相关性、关系类型及时空上下文;同时提出基于因果图语义复杂度的可读性量化方法,并对大语言模型在相关性推断与因果链推理任务上进行基准测试。 Result: ClimateCause首次系统性地提供了含隐式与嵌套因果的高质量标注数据;实验证明大语言模型在因果链推理上显著弱于相关性推断,凸显该任务的挑战性。 Conclusion: ClimateCause填补了高阶因果结构数据集的空白,为气候领域因果推理、模型评估与可读性分析提供了新基准与工具。 Abstract: Understanding climate change requires reasoning over complex causal networks. Yet, existing causal discovery datasets predominantly capture explicit, direct causal relations. We introduce ClimateCause, a manually expert-annotated dataset of higher-order causal structures from science-for-policy climate reports, including implicit and nested causality. Cause-effect expressions are normalized and disentangled into individual causal relations to facilitate graph construction, with unique annotations for cause-effect correlation, relation type, and spatiotemporal context. We further demonstrate ClimateCause's value for quantifying readability based on the semantic complexity of causal graphs underlying a statement. Finally, large language model benchmarking on correlation inference and causal chain reasoning highlights the latter as a key challenge.[64] Schema Key Wording as an Instruction Channel in Structured Generation under Constrained Decoding
Yifan Le
Main category: cs.CL
TL;DR: 本文研究了在大语言模型的结构化生成中,模式键(schema keys)的语言表述如何作为隐式指令影响模型性能,并提出将结构化生成重新解释为多通道指令问题。
Details
Motivation: 现有方法将模式视为纯结构约束,忽略了其语言表述可能影响模型行为,因此本文探究指令位置对结构化生成性能的影响。 Method: 通过改变模式键的措辞(不修改提示词或模型参数),系统分析其对模型性能的影响,并将结构化生成建模为包含显式提示指令和隐式模式键指令的多通道指令问题。 Result: 实验表明不同模型家族对指令通道敏感性不同:Qwen模型受益于模式级指令,LLaMA模型更依赖提示级指导;且多通道指令存在非加性交互效应。 Conclusion: 模式设计不仅决定输出结构,还承载指令信号,为大语言模型的结构化生成提供了新视角。 Abstract: Constrained decoding has been widely adopted for structured generation with large language models (LLMs), ensuring that outputs satisfy predefined formats such as JSON and XML. However, existing approaches largely treat schemas as purely structural constraints and overlook the possibility that their linguistic formulation may affect model behavior. In this work, we study how instruction placement influences model performance in structured generation and show that merely changing the wording of schema keys, without modifying the prompt or model parameters, can significantly alter model performance under constrained decoding. Based on this observation, we propose to reinterpret structured generation as a multi-channel instruction problem, where instructions can be conveyed explicitly through prompts and implicitly through schema keys during decoding. To the best of our knowledge, this is the first work to systematically study how schema key formulation acts as an implicit instruction channel and affects model performance under constrained decoding. Experiments on multiple mathematical reasoning benchmarks show that different model families exhibit distinct sensitivities to these instruction channels: Qwen models consistently benefit from schema-level instructions, while LLaMA models rely more heavily on prompt-level guidance. We further observe non-additive interaction effects between instruction channels, showing that combining multiple channels does not always lead to further improvement. These findings suggest that schema design not only determines output structure, but also carries instruction signals, offering a new perspective on structured generation in LLMs.[65] Segment-Level Coherence for Robust Harmful Intent Probing in LLMs
Xuanli He,Bilgehan Sel,Faizan Ali,Jenny Bao,Hoagy Cunningham,Jerry Wei
Main category: cs.CL
TL;DR: 本文提出了一种新的流式探测目标,通过要求多个证据token一致支持预测,而非依赖孤立的高分token,从而提升CBRN领域中LLM对抗性越狱检测的鲁棒性与准确率。
Details
Motivation: 现有流式探测方法在CBRN等高风险领域易因敏感词出现在良性语境中而产生误报,核心问题在于过度依赖少数高分token。 Method: 设计一种新的流式探测目标,强调多证据token的一致性支持;比较Attention、MLP和残差流特征的探测效果;验证探测器对字符级加密等新型对抗攻击的泛化能力。 Result: 在1%假正率下,真阳性率相对强基线提升35.55%;AUROC显著提升(基线已达97.40%);Attention/MLP激活探测优于残差流;对细粒度对抗微调模型仍保持>98.85% AUROC。 Conclusion: 基于多证据聚合的流式探测机制更鲁棒,且具备跨模型、跨攻击形式的泛化能力,为高危领域LLM安全监控提供了实用新范式。 Abstract: Large Language Models (LLMs) are increasingly exposed to adaptive jailbreaking, particularly in high-stakes Chemical, Biological, Radiological, and Nuclear (CBRN) domains. Although streaming probes enable real-time monitoring, they still make systematic errors. We identify a core issue: existing methods often rely on a few high-scoring tokens, leading to false alarms when sensitive CBRN terms appear in benign contexts. To address this, we introduce a streaming probing objective that requires multiple evidence tokens to consistently support a prediction, rather than relying on isolated spikes. This encourages more robust detection based on aggregated signals instead of single-token cues. At a fixed 1% false-positive rate, our method improves the true-positive rate by 35.55% relative to strong streaming baselines. We further observe substantial gains in AUROC, even when starting from near-saturated baseline performance (AUROC = 97.40%). We also show that probing Attention or MLP activations consistently outperforms residual-stream features. Finally, even when adversarial fine-tuning enables novel character-level ciphers, harmful intent remains detectable: probes developed for the base LLMs can be applied ``plug-and-play'' to these obfuscated attacks, achieving an AUROC of over 98.85%.[66] RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding
Zihong Zhang,Zuchao Li,Lefei Zhang,Ping Wang,Hai Zhao
Main category: cs.CL
TL;DR: RACER是一种无需训练的快速推测解码方法,通过结合检索到的精确模式和基于logit的未来线索,显著提升了大语言模型的推理速度,实现了超过2倍的加速。
Details
Motivation: 自回归解码在大语言模型中存在高推理延迟问题,而现有的无训练推测解码方法在检索式和logits式方案上各有缺陷:前者在无精确匹配时失效,后者缺乏结构引导。 Method: 提出RACER方法,融合检索得到的精确模式与logit驱动的未来预测线索,兼顾可靠性与灵活性,生成更丰富的推测草案。 Result: 在Spec-Bench、HumanEval和MGSM-ZH等基准上,RACER稳定实现超2倍加速,优于现有无训练方法,并具备即插即用和可扩展性。 Conclusion: RACER是一种轻量、无需训练、即插即用的高效LLM解码方案,有效平衡了推测质量与推理速度。 Abstract: Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess-and-verify strategy, but existing training-free variants face trade-offs: retrieval-based drafts break when no exact match exists, while logits-based drafts lack structural guidance. We propose $\textbf{RACER}$ ($\textbf{R}$etrieval-$\textbf{A}$ugmented $\textbf{C}$ont$\textbf{e}$xtual $\textbf{R}$apid Speculative Decoding), a lightweight and training-free method that integrates retrieved exact patterns with logit-driven future cues. This unification supplies both reliable anchors and flexible extrapolation, yielding richer speculative drafts. Experiments on Spec-Bench, HumanEval, and MGSM-ZH demonstrate that RACER consistently accelerates inference, achieving more than $2\times$ speedup over autoregressive decoding, and outperforms prior training-free methods, offering a scalable, plug-and-play solution for efficient LLM decoding. Our source code is available at $\href{https://github.com/hkr04/RACER}{https://github.com/hkr04/RACER}$.[67] Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models
Danae Sánchez Villegas,Samuel Lewis-Lim,Nikolaos Aletras,Desmond Elliott
Main category: cs.CL
TL;DR: 本文分析了18个视觉语言模型(VLMs)的推理动态,发现模型存在“答案惯性”现象,即早期预测倾向被强化而非修正;推理训练模型虽有更强的修正能力,但其表现受模态条件影响;模型易受误导性文本线索影响,且这种影响在思维链(CoT)中难以一致检测,提示CoT对多模态决策机制的揭示有限。
Details
Motivation: 探究视觉语言模型(VLMs)如何在推理过程中整合视觉与文本信息,及其推理过程(尤其是Chain-of-Thought)是否真实反映多模态决策机制。 Method: 对18个覆盖指令微调与推理训练两类VLMs进行系统分析:追踪CoT中置信度变化、量化推理的修正效应、评估中间步骤贡献;设计含误导性文本线索的可控干预实验,并分析CoT中线索提及与视觉一致性。 Result: 发现普遍存在的‘答案惯性’;推理训练模型修正能力更强但依赖模态条件;模型易受文本线索干扰且该干扰在CoT中可隐匿(尤其在流畅长CoT中);指令微调模型虽较少显式提及线索,但其短CoT更易暴露与视觉输入的不一致。 Conclusion: Chain-of-Thought仅部分揭示VLM多模态决策机制,其表观视觉接地性可能掩盖实际文本依赖,这对多模态系统的透明性与安全性构成重要挑战。 Abstract: Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two different model families. We track confidence over Chain-of-Thought (CoT), measure the corrective effect of reasoning, and evaluate the contribution of intermediate reasoning steps. We find that models are prone to answer inertia, in which early commitments to a prediction are reinforced, rather than revised during reasoning steps. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions, from text-dominant to vision-only settings. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient, and assess whether this influence is recoverable from CoT. Although this influence can appear in the CoT, its detectability varies across models and depends on what is being monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer and fluent CoTs can still appear visually grounded while actually following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. Taken together, these findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.[68] Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task
Evaldas Vaiciukynas,Paulius Danenas,Linas Ablonskis,Algirdas Sukys,Edgaras Dambrauskas,Voldemaras Zitkus,Rita Butkiene,Rimantas Butleris
Main category: cs.CL
TL;DR: 本文研究了现代多语言句子嵌入模型在立陶宛语、俄语和英语仇恨言论检测中的有效性,引入了新的立陶宛语数据集LtHate,并在统一框架下对比六种嵌入模型与不同下游分类器(HBOS异常检测与CatBoost二分类)及PCA降维的效果,结果表明监督式二分类结合多语言嵌入效果最优,尤其在俄语上达到92.19%准确率。
Details
Motivation: 在线仇恨言论和辱骂性语言对内容审核构成日益严峻的挑战,尤其在多语言环境及立陶宛语等低资源语言中;亟需评估现代多语言句子嵌入模型在此类语言中的适用性。 Method: 构建新立陶宛语仇恨言论数据集LtHate;在LtHate、RuToxic和EnSuperset三个数据集上,使用统一Python流程,评测potion、gemma、bge、snow、jina、e5六种多语言句子嵌入模型;对每种嵌入分别训练一分类HBOS异常检测器和二分类CatBoost分类器,并分别测试是否应用PCA压缩至64维特征向量。 Result: 二分类监督模型显著优于一分类异常检测;最佳性能为:立陶宛语(jina)80.96%准确率、AUC 0.887;俄语(e5)92.19%准确率、AUC 0.978;英语(e5+PCA)77.21%准确率、AUC 0.859;PCA在监督任务中几乎不损性能,但在无监督任务中略有负面影响。 Conclusion: 现代多语言句子嵌入结合梯度提升决策树(如CatBoost)可为多语言仇恨言论检测提供鲁棒、实用的软计算解决方案,尤其适用于低资源语言场景。 Abstract: Online hate speech and abusive language pose a growing challenge for content moderation, especially in multilingual settings and for low-resource languages such as Lithuanian. This paper investigates to what extent modern multilingual sentence embedding models can support accurate hate speech detection in Lithuanian, Russian, and English, and how their performance depends on downstream modeling choices and feature dimensionality. We introduce LtHate, a new Lithuanian hate speech corpus derived from news portals and social networks, and benchmark six modern multilingual encoders (potion, gemma, bge, snow, jina, e5) on LtHate, RuToxic, and EnSuperset using a unified Python pipeline. For each embedding, we train both a one class HBOS anomaly detector and a two class CatBoost classifier, with and without principal component analysis (PCA) compression to 64-dimensional feature vectors. Across all datasets, two class supervised models consistently and substantially outperform one class anomaly detection, with the best configurations achieving up to 80.96% accuracy and AUC ROC of 0.887 in Lithuanian (jina), 92.19% accuracy and AUC ROC of 0.978 in Russian (e5), and 77.21% accuracy and AUC ROC of 0.859 in English (e5 with PCA). PCA compression preserves almost all discriminative power in the supervised setting, while showing some negative impact for the unsupervised anomaly detection case. These results demonstrate how modern multilingual sentence embeddings combined with gradient boosted decision trees provide robust soft-computing solutions for multilingual hate speech detection applications.[69] IE as Cache: Information Extraction Enhanced Agentic Reasoning
Hang Lv,Sheng Liang,Hongchao Gu,Wei Guo,Defu Lian,Yong Liu,Hao Wang,Enhong Chen
Main category: cs.CL
TL;DR: 本文提出IE-as-Cache框架,将信息抽取(IE)视为一种可复用的认知缓存,以增强智能体推理能力,实验表明该方法显著提升了多步推理的准确性。
Details
Motivation: 传统信息抽取仅作为终端目标,提取结果孤立使用,缺乏在多步推理中持续维护与复用;本文旨在突破这一局限,使IE成为支持推理过程的动态认知资源。 Method: 受计算机分层内存启发,提出IE-as-Cache框架,融合查询驱动的信息抽取与缓存感知推理,动态维护紧凑中间信息并过滤噪声。 Result: 在多个挑战性基准和不同大语言模型上实验验证,推理准确率显著提升。 Conclusion: 信息抽取可被有效重构为可复用的认知缓存,为IE在下游推理任务中的深度集成提供了新范式和研究方向。 Abstract: Information Extraction aims to distill structured, decision-relevant information from unstructured text, serving as a foundation for downstream understanding and reasoning. However, it is traditionally treated merely as a terminal objective: once extracted, the resulting structure is often consumed in isolation rather than maintained and reused during multi-step inference. Moving beyond this, we propose \textit{IE-as-Cache}, a framework that repurposes IE as a cognitive cache to enhance agentic reasoning. Drawing inspiration from hierarchical computer memory, our approach combines query-driven extraction with cache-aware reasoning to dynamically maintain compact intermediate information and filter noise. Experiments on challenging benchmarks across diverse LLMs demonstrate significant improvements in reasoning accuracy, indicating that IE can be effectively repurposed as a reusable cognitive resource and offering a promising direction for future research on downstream uses of IE.[70] XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics
Jingxuan Liu,Zhi Qu,Jin Tei,Hidetaka Kamigaito,Lemao Liu,Taro Watanabe
Main category: cs.CL
TL;DR: 本文提出XQ-MEval数据集,用于系统评估多语言机器翻译自动评价指标的跨语言打分偏差,并提出一种基于该数据集的归一化策略以提升多语言评测的公平性与可靠性。
Details
Motivation: 现有自动评价指标在多语言场景下存在跨语言打分偏差(即相同质量的译文在不同语言上得分不同),但缺乏具备平行质量标注的基准数据集来系统研究该问题。 Method: 构建半自动的XQ-MEval数据集:基于MQM错误类型向高质量译文自动注入错误,由母语者筛选并合并错误生成可控质量的伪译文,形成源-伪译文-参考译文三元组;在此数据集上评估9个主流指标,并提出跨语言分数分布归一化策略。 Result: 实验证明平均各语言指标得分与人工判断不一致,首次提供了跨语言打分偏差的实证证据;所提归一化策略显著提升了多语言评测的公平性与可靠性。 Conclusion: 跨语言打分偏差是多语言翻译评测中的真实且严重的问题;XQ-MEval为该问题提供了首个可复现的基准,所提出的归一化方法为更鲁棒的多语言评估提供了新路径。 Abstract: Automatic evaluation metrics are essential for building multilingual translation systems. The common practice of evaluating these systems is averaging metric scores across languages, yet this is suspicious since metrics may suffer from cross-lingual scoring bias, where translations of equal quality receive different scores across languages. This problem has not been systematically studied because no benchmark exists that provides parallel-quality instances across languages, and expert annotation is not realistic. In this work, we propose XQ-MEval, a semi-automatically built dataset covering nine translation directions, to benchmark translation metrics. Specifically, we inject MQM-defined errors into gold translations automatically, filter them by native speakers for reliability, and merge errors to generate pseudo translations with controllable quality. These pseudo translations are then paired with corresponding sources and references to form triplets used in assessing the qualities of translation metrics. Using XQ-MEval, our experiments on nine representative metrics reveal the inconsistency between averaging and human judgment and provide the first empirical evidence of cross-lingual scoring bias. Finally, we propose a normalization strategy derived from XQ-MEval that aligns score distributions across languages, improving the fairness and reliability of multilingual metric evaluation.[71] Text2Arch: A Dataset for Generating Scientific Architecture Diagrams from Natural Language Descriptions
Shivank Garg,Sankalp Mittal,Manish Gupta
Main category: cs.CL
TL;DR: 本文提出了一种利用语言模型从文本自动生成高保真科学架构图的方法,构建了首个大规模开源数据集\system,包含架构图、文本描述及DOT代码,并通过微调小语言模型和GPT-4o的上下文学习验证其有效性。
Details
Motivation: 文本描述复杂系统设计或科学流程效率低且易产生歧义,亟需能自动将文本转化为高语义保真度架构图的系统。 Method: 构建包含图像、文本描述和DOT代码的多模态数据集\system;基于该数据集微调多个小型语言模型,并结合GPT-4o进行上下文学习生成DOT代码,再渲染为架构图。 Result: 所提\system模型在性能上显著超越DiagramAgent等基线方法,与GPT-4o上下文学习效果相当。 Conclusion: 证明了利用专用数据集微调小语言模型可高效实现文本到架构图的生成,为AI驱动的可视化建模提供了可行路径,并开源全部资源。 Abstract: Communicating complex system designs or scientific processes through text alone is inefficient and prone to ambiguity. A system that automatically generates scientific architecture diagrams from text with high semantic fidelity can be useful in multiple applications like enterprise architecture visualization, AI-driven software design, and educational content creation. Hence, in this paper, we focus on leveraging language models to perform semantic understanding of the input text description to generate intermediate code that can be processed to generate high-fidelity architecture diagrams. Unfortunately, no clean large-scale open-access dataset exists, implying lack of any effective open models for this task. Hence, we contribute a comprehensive dataset, \system, comprising scientific architecture images, their corresponding textual descriptions, and associated DOT code representations. Leveraging this resource, we fine-tune a suite of small language models, and also perform in-context learning using GPT-4o. Through extensive experimentation, we show that \system{} models significantly outperform existing baseline models like DiagramAgent and perform at par with in-context learning-based generations from GPT-4o. We make the code, data and models publicly available.[72] Explain the Flag: Contextualizing Hate Speech Beyond Censorship
Jason Liartis,Eirini Kaldeli,Lambrini Gyftokosta,Eleftherios Chelioudakis,Orfeas Menis Mastromichalakis
Main category: cs.CL
TL;DR: 本文提出了一种结合大语言模型(LLMs)与三种新构建的词汇表的混合方法,用于检测和解释英语、法语和希腊语中的仇恨言论,兼顾准确性与可解释性。
Details
Motivation: 现有仇恨言论检测系统多侧重于内容删除,缺乏透明度和可解释性,难以平衡内容治理与表达自由。 Method: 构建双管道混合系统:一管道利用人工校验的多语种词汇表检测并消歧冒犯性术语;另一管道利用LLM作为上下文感知的群体指向性内容评估器;二者输出融合生成有依据的解释。 Result: 人类评估表明,该混合方法在检测准确性和解释质量上均优于纯LLM基线方法。 Conclusion: 结合规则式词汇匹配与LLM上下文理解的混合范式,能更可靠、透明且可解释地检测多语种仇恨言论,为内容审核提供兼顾问责与表达自由的新路径。 Abstract: Hate, derogatory, and offensive speech remains a persistent challenge in online platforms and public discourse. While automated detection systems are widely used, most focus on censorship or removal, raising concerns for transparency and freedom of expression, and limiting opportunities to explain why content is harmful. To address these issues, explanatory approaches have emerged as a promising solution, aiming to make hate speech detection more transparent, accountable, and informative. In this paper, we present a hybrid approach that combines Large Language Models (LLMs) with three newly created and curated vocabularies to detect and explain hate speech in English, French, and Greek. Our system captures both inherently derogatory expressions tied to identity characteristics and direct group-targeted content through two complementary pipelines: one that detects and disambiguates problematic terms using the curated vocabularies, and one that leverages LLMs as context-aware evaluators of group-targeting content. The outputs are fused into grounded explanations that clarify why content is flagged. Human evaluation shows that our hybrid approach is accurate, with high-quality explanations, outperforming LLM-only baselines.[73] IUQ: Interrogative Uncertainty Quantification for Long-Form Large Language Model Generation
Haozhi Fan,Jinhao Duan,Kaidi Xu
Main category: cs.CL
TL;DR: 本文提出了一种名为Interrogative Uncertainty Quantification (IUQ)的新框架,用于量化大语言模型(LLM)在长文本生成中的不确定性,通过样本间一致性与样本内忠实性来评估声明级不确定性和模型忠实度。
Details
Motivation: 现有方法在短文本或受限输出中表现良好,但难以应对真实场景所需的长文本、自由形式生成;且LLM常生成语义连贯但事实错误的内容,其语义多维、语言结构复杂,导致不确定性难以量化。 Method: 提出IUQ框架,采用‘先提问再回答’范式,结合跨样本一致性(inter-sample consistency)和单一样本内忠实性(intra-sample faithfulness)来量化长文本生成中的不确定性。 Result: 在多个模型家族和规模上实验验证,IUQ在两个主流长文本生成数据集上显著优于现有两种常用方法。 Conclusion: IUQ为长文本生成提供了可靠、可解释的不确定性与忠实度量化手段,具有良好的泛化性和实用性。 Abstract: Despite the rapid advancement of Large Language Models (LLMs), uncertainty quantification in LLM generation is a persistent challenge. Although recent approaches have achieved strong performance by restricting LLMs to produce short or constrained answer sets, many real-world applications require long-form and free-form text generation. A key difficulty in this setting is that LLMs often produce responses that are semantically coherent yet factually inaccurate, while the underlying semantics are multifaceted and the linguistic structure is complex. To tackle this challenge, this paper introduces Interrogative Uncertainty Quantification (IUQ), a novel framework that leverages inter-sample consistency and intra-sample faithfulness to quantify the uncertainty in long-form LLM outputs. By utilizing an interrogate-then-respond paradigm, our method provides reliable measures of claim-level uncertainty and the model's faithfulness. Experimental results across diverse model families and model sizes demonstrate the superior performance of IUQ over two widely used long-form generation datasets. The code is available at https://github.com/louisfanhz/IUQ.[74] Blinded Multi-Rater Comparative Evaluation of a Large Language Model and Clinician-Authored Responses in CGM-Informed Diabetes Counseling
Zhijun Guo,Alvina Lai,Emmanouil Korakas,Aristeidis Vagenas,Irshad Ahamed,Christo Albor,Hengrui Zhang,Justin Healy,Kezhi Li
Main category: cs.CL
TL;DR: 本研究开发并评估了一种基于检索增强的大型语言模型(LLM)对话代理(CA),用于辅助糖尿病患者理解连续血糖监测(CGM)数据及咨询准备;结果显示该CA在响应质量(尤其共情与可操作性)上显著优于临床医生,且安全性相当,但仅适合作为辅助工具,不可替代临床决策。
Details
Motivation: CGM数据解读耗时且需清晰共情表达,现有检索增强型LLM系统在CGM指导咨询中的实证证据不足。 Method: 构建检索增强的LLM对话代理,生成非个性化治疗建议的通俗语言响应;基于12个公开CGM案例,由6名英国资深糖尿病临床医生提供参考回答;采用盲法多评价者设计,3名临床医生对CA和医生回答在6个维度独立评分;使用线性混合效应模型分析差异。 Result: CA平均质量得分显著高于临床医生(4.37 vs 3.58,P<0.001),共情(+1.062)与可操作性(+0.992)提升最明显;安全警示率极低且两组相当(各0.7%)。 Conclusion: 检索增强LLM系统可作为CGM复盘、患者教育与诊前准备的有益辅助工具,但不支持自主治疗决策或无监督实际应用。 Abstract: Continuous glucose monitoring (CGM) is central to diabetes care, but explaining CGM patterns clearly and empathetically remains time-intensive. Evidence for retrieval-grounded large language model (LLM) systems in CGM-informed counseling remains limited. To evaluate whether a retrieval-grounded LLM-based conversational agent (CA) could support patient understanding of CGM data and preparation for routine diabetes consultations. We developed a retrieval-grounded LLM-based CA for CGM interpretation and diabetes counseling support. The system generated plain-language responses while avoiding individualized therapeutic advice. Twelve CGM-informed cases were constructed from publicly available datasets. Between Oct 2025 and Feb 2026, 6 senior UK diabetes clinicians each reviewed 2 assigned cases and answered 24 questions. In a blinded multi-rater evaluation, each CA-generated and clinician-authored response was independently rated by 3 clinicians on 6 quality dimensions. Safety flags and perceived source labels were also recorded. Primary analyses used linear mixed-effects models. A total of 288 unique responses (144 CA and 144 clinician) generated 864 ratings. The CA received higher quality scores than clinician responses (mean 4.37 vs 3.58), with an estimated mean difference of 0.782 points (95% CI 0.692-0.872; P<.001). The largest differences were for empathy (1.062, 95% CI 0.948-1.177) and actionability (0.992, 95% CI 0.877-1.106). Safety flag distributions were similar, with major concerns rare in both groups (3/432, 0.7% each). Retrieval-grounded LLM systems may have value as adjunct tools for CGM review, patient education, and preconsultation preparation. However, these findings do not support autonomous therapeutic decision-making or unsupervised real-world use.[75] DiscoTrace: Representing and Comparing Answering Strategies of Humans and LLMs in Information-Seeking Question Answering
Neha Srikanth,Jordan Boyd-Graber,Rachel Rudinger
Main category: cs.CL
TL;DR: DiscoTrace是一种识别回答者在回应信息寻求型问题时所用修辞策略的方法,研究发现人类社区在回答构建上存在多样性,而大语言模型(LLMs)缺乏这种修辞多样性,且倾向于覆盖更广的问题解释范围。
Details
Motivation: 理解人类在问答中使用的多样化修辞策略,以改进大语言模型在问答任务中的语用能力。 Method: 提出DiscoTrace方法,将答案表示为与问题相关的语篇行为序列,并结合对原问题的解释,基于修辞结构理论(RST)解析进行标注;在九个人类社区的答案数据上应用该方法,并与LLMs生成的答案进行对比分析。 Result: 人类社区在答案构建上表现出显著的修辞策略多样性,而LLMs缺乏这种多样性,即使被提示模仿特定社区风格也未能实现;LLMs还系统性地选择覆盖更多问题解释,而人类回答者则有选择性地忽略部分解释。 Conclusion: DiscoTrace揭示了当前LLMs在语用层面的不足,为开发能依据上下文灵活采用多种修辞策略的实用型LLM问答系统提供了方向。 Abstract: We introduce DiscoTrace, a method to identify the rhetorical strategies that answerers use when responding to information-seeking questions. DiscoTrace represents answers as a sequence of question-related discourse acts paired with interpretations of the original question, annotated on top of rhetorical structure theory parses. Applying DiscoTrace to answers from nine different human communities reveals that communities have diverse preferences for answer construction. In contrast, LLMs do not exhibit rhetorical diversity in their answers, even when prompted to mimic specific human community answering guidelines. LLMs also systematically opt for breadth, addressing interpretations of questions that human answerers choose not to address. Our findings can guide the development of pragmatic LLM answerers that consider a range of strategies informed by context in QA.[76] QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies
Alexey Khoroshilov,Alexey Chernysh,Orkhan Ekhtibarov,Nini Kamkia,Dmitry Zmitrovich
Main category: cs.CL
TL;DR: 本文提出了QuantCode-Bench基准,用于系统评估大语言模型在基于英文描述生成Backtrader框架交易策略方面的能力;该基准包含400个来自多源的任务,评估涵盖语法正确性、回测执行、实际交易产生及语义对齐,并发现当前模型的主要瓶颈在于金融逻辑建模、API使用和任务语义一致性,而非语法错误。
Details
Motivation: 现有大语言模型在通用编程任务上表现良好,但在生成可执行的算法交易策略方面能力尚不明确;而交易策略生成需同时掌握金融领域逻辑、专用API知识,并确保生成代码不仅语法正确,还需在历史数据上实际触发交易,因此亟需一个系统性评估基准。 Method: 构建QuantCode-Bench基准,含400个难度不一的交易策略生成任务(来源包括Reddit、TradingView等);设计多阶段评估流水线,依次验证语法正确性、回测可执行性、是否产生真实交易、以及通过LLM裁判评估语义对齐度;对比SOTA模型在单轮生成与基于反馈的多轮智能体模式下的表现。 Result: 实验表明,当前模型失败主因并非语法错误,而是金融逻辑建模偏差、Backtrader API误用及对自然语言任务描述的语义理解不足;多轮交互模式显著提升成功率,但语义对齐仍是最大挑战。 Conclusion: 算法交易策略生成是一类独特的领域特定代码生成任务,其成功不仅依赖技术正确性,更要求自然语言描述、金融逻辑与策略实际数据行为三者间严格一致;QuantCode-Bench为该方向提供了首个系统化评测标准。 Abstract: Large language models have demonstrated strong performance on general-purpose programming tasks, yet their ability to generate executable algorithmic trading strategies remains underexplored. Unlike standard code benchmarks, trading-strategy generation requires simultaneous mastery of domain-specific financial logic, knowledge of a specialized API, and the ability to produce code that is not only syntactically correct but also leads to actual trades on historical data. In this work, we present QuantCode-Bench, a benchmark for the systematic evaluation of modern LLMs in generating strategies for the Backtrader framework from textual descriptions in English. The benchmark contains 400 tasks of varying difficulty collected from Reddit, TradingView, StackExchange, GitHub, and synthetic sources. Evaluation is conducted through a multi-stage pipeline that checks syntactic correctness, successful backtest execution, the presence of trades, and semantic alignment with the task description using an LLM judge. We compare state-of-the-art models in two settings: single-turn, where the strategy must be generated correctly on the first attempt, and agentic multi-turn, where the model receives iterative feedback and may repair its errors. We analyze the failure modes across different stages of the pipeline and show that the main limitations of current models are not related to syntax, but rather to the correct operationalization of trading logic, proper API usage, and adherence to task semantics. These findings suggest that trading strategy generation constitutes a distinct class of domain-specific code generation tasks in which success requires not only technical correctness, but also alignment between natural-language descriptions, financial logic, and the observable behavior of the strategy on data.[77] Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models
Zihao Xu,John Harvill,Ziwei Fan,Yizhou Sun,Hao Ding,Hao Wang
Main category: cs.CL
TL;DR: 本文提出K-Token Merging,一种在潜在嵌入空间中压缩长提示的轻量级框架,通过合并连续K个token嵌入提升LLM处理长输入的效率,在多个任务上实现高达75%长度压缩且性能损失极小。
Details
Motivation: 现有提示压缩方法主要在token空间操作,忽略了潜在嵌入空间中的冗余和低效问题;同时LLM的全自注意力机制随输入长度呈平方增长,带来高昂计算与内存开销。 Method: 提出K-Token Merging框架:在潜在空间中将每K个连续token嵌入通过轻量编码器合并为一个嵌入;压缩后的序列输入LoRA微调的LLM,解码仍使用原始词表。 Result: 在Textualized Tree(结构推理)、Amazon Reviews(情感分类)和CommitPackFT(代码编辑)任务上,K-Token Merging在性能-压缩比Pareto前沿上表现优异,最高实现75%输入长度压缩,性能下降极小。 Conclusion: 在潜在空间进行token合并是一种高效且实用的长提示压缩范式,兼顾显著压缩率与模型性能,优于传统token级压缩方法。 Abstract: Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self-attention scales quadratically with input length. Token compression aims to address this challenge by reducing the number of tokens representing inputs. However, existing prompt-compression approaches primarily operate in token space and overlook inefficiencies in the latent embedding space. In this paper, we propose K-Token Merging, a latent-space compression framework that merges each contiguous block of K token embeddings into a single embedding via a lightweight encoder. The compressed sequence is processed by a LoRA-adapted LLM, while generation remains in the original vocabulary. Experiments on structural reasoning (Textualized Tree), sentiment classification (Amazon Reviews), and code editing (CommitPackFT) show that K-Token Merging lies on the Pareto frontier of performance vs. compression, achieving up to 75% input length reduction with minimal performance degradation.[78] Fabricator or dynamic translator?
Lisa Vasileva,Karin Sim
Main category: cs.CL
TL;DR: 本文探讨了大语言模型(LLM)在机器翻译中出现的过生成现象,分析其类型(如自我解释、危险虚构、恰当解释)及检测方法,并报告了商业场景中的实践策略与结果。
Details
Motivation: LLM在机器翻译中虽表现优异,但其生成特性易导致各类过生成现象,这些现象不同于传统NMT的神经胡言乱语,需准确识别与分类以提升翻译可靠性与可理解性。 Method: 探索并比较多种针对LLM翻译过生成现象的检测与分类策略,基于商业应用场景开展实证研究。 Result: 提出了适用于商业环境的过生成检测策略,并展示了相应实验结果,为理解和控制LLM翻译中的生成行为提供了实用方案。 Conclusion: LLM翻译中的过生成具有多样性与情境依赖性,需结合具体应用目标设计检测机制;恰当的过生成(如解释性输出)可增强目标读者理解,具备潜在价值。 Abstract: LLMs are proving to be adept at machine translation although due to their generative nature they may at times overgenerate in various ways. These overgenerations are different from the neurobabble seen in NMT and range from LLM self-explanations, to risky confabulations, to appropriate explanations, where the LLM is able to act as a human translator would, enabling greater comprehension for the target audience. Detecting and determining the exact nature of the overgenerations is a challenging task. We detail different strategies we have explored for our work in a commercial setting, and present our results.[79] MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events
Raunak Agarwal,Markus Wenzel,Simon Baur,Jonas Zimmer,George Harvey,Jackie Ma
Main category: cs.CL
TL;DR: 本文提出了MADE,一个基于医疗设备不良事件报告的动态多标签文本分类基准,旨在解决现有基准饱和和数据污染问题,并系统评估了多种模型在不确定性量化(UQ)方面的表现。
Details
Motivation: 现有多标签文本分类(MLTC)基准趋于饱和且易受训练数据污染,难以区分模型的真实推理能力与记忆效应;同时,高风险领域如医疗亟需兼具高性能与可靠不确定性量化(UQ)的模型。 Method: 构建了动态更新、具长尾层级标签分布与严格时间划分的MADE基准;在20余种编码器/解码器模型(含微调、少样本、指令调优及推理变体)上建立基线;系统评估熵/一致性型与自陈述型UQ方法。 Result: 小规模判别式微调解码器在头-尾准确率与UQ间取得最佳平衡;生成式微调提供最可靠的UQ;大推理模型提升罕见标签性能但UQ表现差;自陈述置信度不可靠。 Conclusion: MADE为医疗MLTC任务提供了抗污染、可复现的评估框架,揭示了模型规模、训练范式与UQ能力间的复杂权衡,强调需谨慎选择UQ方法而非默认依赖自陈述置信。 Abstract: Machine learning in high-stakes domains such as healthcare requires not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human oversight. Multi-label text classification (MLTC) is a central task in this domain, yet remains challenging due to label imbalances, dependencies, and combinatorial complexity. Existing MLTC benchmarks are increasingly saturated and may be affected by training data contamination, making it difficult to distinguish genuine reasoning capabilities from memorization. We introduce MADE, a living MLTC benchmark derived from {m}edical device {ad}verse {e}vent reports and continuously updated with newly published reports to prevent contamination. MADE features a long-tailed distribution of hierarchical labels and enables reproducible evaluation with strict temporal splits. We establish baselines across more than 20 encoder- and decoder-only models under fine-tuning and few-shot settings (instruction-tuned/reasoning variants, local/API-accessible). We systematically assess entropy-/consistency-based and self-verbalized UQ methods. Results show clear trade-offs: smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive UQ; generative fine-tuning delivers the most reliable UQ; large reasoning models improve performance on rare labels yet exhibit surprisingly weak UQ; and self-verbalized confidence is not a reliable proxy for uncertainty. Our work is publicly available at https://hhi.fraunhofer.de/aml-demonstrator/made-benchmark.[80] From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning
Kiran Purohit,Ramasuri Narayanam,Soumyabrata Pal
Main category: cs.CL
TL;DR: 本文提出SpecGuard,一种基于模型内部信号的验证感知推测解码框架,通过步骤级验证提升大语言模型推理的准确性和效率。
Details
Motivation: 现有推测解码方法在token层面操作易导致错误传播,而引入外部奖励模型会增加延迟、计算开销并限制泛化性。 Method: SpecGuard在每一步采样多个草稿候选,选择最一致的步骤,并利用两种轻量级模型内部信号进行联合验证:(i) 基于注意力的接地分数,衡量对输入和已接受步骤的归因;(ii) 基于对数概率的分数,反映token级置信度。 Result: 在多个推理基准上,SpecGuard相比标准推测解码和奖励引导的推测解码,准确率提升3.6%,延迟降低约11%。 Conclusion: SpecGuard通过仅依赖模型内部信号实现高效、准确的步骤级验证,为推测解码提供了更通用、低开销的优化路径。 Abstract: Speculative decoding (SD) accelerates large language model inference by allowing a lightweight draft model to propose outputs that a stronger target model verifies. However, its token-centric nature allows erroneous steps to propagate. Prior approaches mitigate this using external reward models, but incur additional latency, computational overhead, and limit generalizability. We propose SpecGuard, a verification-aware speculative decoding framework that performs step-level verification using only model-internal signals. At each step, SpecGuard samples multiple draft candidates and selects the most consistent step, which is then validated using an ensemble of two lightweight model-internal signals: (i) an attention-based grounding score that measures attribution to the input and previously accepted steps, and (ii) a log-probability-based score that captures token-level confidence. These signals jointly determine whether a step is accepted or recomputed using the target, allocating compute selectively. Experiments across a range of reasoning benchmarks show that SpecGuard improves accuracy by 3.6% while reducing latency by ~11%, outperforming both SD and reward-guided SD.cs.CV [Back]
[81] QualiaNet: An Experience-Before-Inference Network
Paul Linton
Main category: cs.CV
TL;DR: 本文提出了一种模拟人类3D视觉的两阶段计算模型QualiaNet,结合经验模块(提取相对固定点的立体深度)和推理模块(利用自然场景统计规律——近景具有显著视差梯度、远景较平缓——来估计距离),并验证了仅凭视差梯度即可恢复距离。
Details
Motivation: 尽管人类立体视觉体验本身不直接提供绝对距离信息,却会影响我们对视觉尺度的推断;作者试图解释这一现象背后的机制,并利用自然场景中‘近景视差梯度大、远景较平’的统计规律建模。 Method: 构建两阶段计算模型QualiaNet:第一阶段模拟人类立体视觉生成相对固定点的视差图(Experience Module);第二阶段将视差图输入CNN(Inference Module),训练其估计场景距离。 Result: QualiaNet仅依靠视差梯度即可有效恢复距离,验证了所提自然场景统计假设及其两阶段架构的合理性。 Conclusion: 人类立体视觉的‘无距离经验’仍能支撑距离推断,关键在于推理模块利用了视差梯度与场景距离之间的自然统计关联;该发现为理解人类3D感知提供了新计算框架。 Abstract: Human 3D vision involves two distinct stages: an Experience Module, where stereo depth is extracted relative to fixation, and an Inference Module, where this experience is interpreted to estimate 3D scene properties. Paradoxically, although our experience of stereo vision does not provide us with distance information, it does affect our inferences about visual scale. We propose the Inference Module exploits a natural scene statistic: near scenes produce vivid disparity gradients, while far scenes appear comparatively flat. QualiaNet implements this two-stage architecture computationally: disparity maps simulating human stereo experience are passed to a CNN trained to estimate distance. The network can recover distance from disparity gradients alone, validating this approach.[82] HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
Team HY-World,Chenjie Cao,Xuhui Zuo,Zhenwei Wang,Yisu Zhang,Junta Wu,Zhenyang Liu,Yuning Gong,Yang Liu,Bo Yuan,Chao Zhang,Coopers Li,Dongyuan Guo,Fan Yang,Haiyu Zhang,Hang Cao,Jianchen Zhu,Jiaxin Lin,Jie Xiao,Jihong Zhang,Junlin Yu,Lei Wang,Lifu Wang,Lilin Wang,Linus,Minghui Chen,Peng He,Penghao Zhao,Qi Chen,Rui Chen,Rui Shao,Sicong Liu,Wangchen Qin,Xiaochuan Niu,Xiang Yuan,Yi Sun,Yifei Tang,Yifu Sun,Yihang Lian,Yonghao Tan,Yuhong Liu,Yuyang Yin,Zhiyuan Min,Tengfei Wang,Chunchao Guo
Main category: cs.CV
TL;DR: HY-World 2.0 是一个支持多模态输入(文本、单视图/多视图图像、视频)并生成高质量、可导航3D高斯泼溅(3DGS)场景的先进世界模型框架,包含多项创新模块(HY-Pano 2.0、WorldNav、WorldStereo 2.0、WorldMirror 2.0 和 WorldLens),在开源方法中达到SOTA,并媲美闭源模型Marble。
Details
Motivation: 推动开放、通用、多模态驱动的3D世界建模,弥补现有方法在输入多样性、3D生成质量、可导航性与系统级交互支持上的不足。 Method: 提出四阶段生成流程:a) HY-Pano 2.0全景生成;b) WorldNav轨迹规划;c) WorldStereo 2.0基于关键帧与一致记忆的视图扩展;d) WorldMirror 2.0架构与训练策略升级以支持多视图/视频重建;并构建WorldLens高性能3DGS渲染平台。 Result: 在多个基准测试中达到开源方法SOTA,性能接近闭源模型Marble;支持文本/单图生成高保真可导航3DGS场景,以及多视图/视频重建;已全面开源模型、代码与技术细节。 Conclusion: HY-World 2.0标志着多模态3D世界建模的重要进展,通过系统性模块创新与工程优化,显著提升了生成质量、泛化能力与交互实用性,为未来研究与应用提供了坚实开源基础。 Abstract: We introduce HY-World 2.0, a multi-modal world model framework that advances our prior project HY-World 1.0. HY-World 2.0 accommodates diverse input modalities, including text prompts, single-view images, multi-view images, and videos, and produces 3D world representations. With text or single-view image inputs, the model performs world generation, synthesizing high-fidelity, navigable 3D Gaussian Splatting (3DGS) scenes. This is achieved through a four-stage method: a) Panorama Generation with HY-Pano 2.0, b) Trajectory Planning with WorldNav, c) World Expansion with WorldStereo 2.0, and d) World Composition with WorldMirror 2.0. Specifically, we introduce key innovations to enhance panorama fidelity, enable 3D scene understanding and planning, and upgrade WorldStereo, our keyframe-based view generation model with consistent memory. We also upgrade WorldMirror, a feed-forward model for universal 3D prediction, by refining model architecture and learning strategy, enabling world reconstruction from multi-view images or videos. Also, we introduce WorldLens, a high-performance 3DGS rendering platform featuring a flexible engine-agnostic architecture, automatic IBL lighting, efficient collision detection, and training-rendering co-design, enabling interactive exploration of 3D worlds with character support. Extensive experiments demonstrate that HY-World 2.0 achieves state-of-the-art performance on several benchmarks among open-source approaches, delivering results comparable to the closed-source model Marble. We release all model weights, code, and technical details to facilitate reproducibility and support further research on 3D world models.[83] Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
Ahmed Bourouis,Savas Ozkan,Andrea Maracani,Yi-Zhe Song,Mete Ozay
Main category: cs.CV
TL;DR: 本文提出了一种从单张手绘草图生成几何一致多视角场景的新方法,通过构建新数据集、引入几何先验的注意力机制(CA3)和稀疏对应监督损失(CSL),在无参考图像、无需迭代优化的情况下实现单步去噪生成,并显著提升真实感与几何一致性。
Details
Motivation: 现有方法无法处理几何信息极度贫乏且存在空间畸变的手绘草图作为输入;此前多视角生成依赖照片或文本,而草图到3D方法则需多视图输入或逐场景优化,缺乏端到端、单草图驱动的几何一致多视角合成方案。 Method: 提出三方面贡献:(i) 构建约9000个样本的自动合成与过滤草图-多视角配对数据集;(ii) 设计并行相机感知注意力适配器(CA3),将几何归纳偏置注入视频Transformer;(iii) 提出基于运动恢复结构(SfM)重建的稀疏对应监督损失(CSL)。整个框架在单次去噪过程中同步生成所有视角。 Result: 相较两阶段SOTA基线,FID提升超60%,Corr-Acc(几何一致性指标)提升23%,推理速度最高快3.7倍,且无需参考图像、迭代细化或逐场景优化。 Conclusion: 本工作首次实现了从单张自由手绘草图端到端生成几何一致多视角内容,验证了在严重失真2D输入下进行强几何推理与跨视角一致建模的可行性,为草图驱动的3D内容创作开辟了新路径。 Abstract: We tackle a new problem: generating geometrically consistent multi-view scenes from a single freehand sketch. Freehand sketches are the most geometrically impoverished input one could offer a multi-view generator. They convey scene intent through abstract strokes while introducing spatial distortions that actively conflict with any consistent 3D interpretation. No prior method attempts this; existing multi-view approaches require photographs or text, while sketch-to-3D methods need multiple views or costly per-scene optimisation. We address three compounding challenges; absent training data, the need for geometric reasoning from distorted 2D input, and cross-view consistency, through three mutually reinforcing contributions: (i) a curated dataset of $\sim$9k sketch-to-multiview samples, constructed via an automated generation and filtering pipeline; (ii) Parallel Camera-Aware Attention Adapters (CA3) that inject geometric inductive biases into the video transformer; and (iii) a Sparse Correspondence Supervision Loss (CSL) derived from Structure-from-Motion reconstructions. Our framework synthesizes all views in a single denoising process without requiring reference images, iterative refinement, or per-scene optimization. Our approach significantly outperforms state-of-the-art two-stage baselines, improving realism (FID) by over 60% and geometric consistency (Corr-Acc) by 23%, while providing up to a 3.7$\times$ inference speedup.[84] DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines
Gabriel Pimenta de Freitas Cardoso,Caio Lucas da Silva Chacon,Jonas Felipe da Fonseca Oliveira,Paulo Henrique de Medeiros Araujo
Main category: cs.CV
TL;DR: 本文提出了DharmaOCR Full和Lite两个专用于结构化OCR的小型语言模型,以及一个涵盖多种文档类型的基准测试集DharmaOCR-Benchmark;首次将直接偏好优化(DPO)应用于OCR任务,以退化生成为负样本抑制循环行为,并结合监督微调(SFT)强制JSON结构输出,显著降低退化率并提升抽取质量;模型在基准上达到新SOTA,且经AWQ量化后进一步优化成本效益。
Details
Motivation: 解决结构化OCR中 transcription质量、生成稳定性与推理成本难以兼顾的问题,同时揭示文本退化不仅影响质量,还会显著恶化实际部署性能(如响应时间、吞吐量和计算成本);并填补OCR领域尚未应用DPO方法的空白。 Method: 提出DharmaOCR Full(7B)和Lite(3B)两个SSLMS;构建DharmaOCR-Benchmark多类型文档基准及统一评估协议,将退化率作为核心指标;首次在OCR中采用DPO,以退化生成为拒绝样本惩罚循环行为;结合SFT强制JSON schema(header/margin/footer/text);使用AWQ量化降低推理成本。 Result: DPO+SFT使退化率最高相对降低87.6%;DharmaOCR Full和Lite在DharmaOCR-Benchmark上分别取得0.925和0.911抽取质量分,退化率仅0.40%和0.20%,全面超越开源及商用基线;AWQ量化实现单页成本最多降低22%,质量损失可忽略。 Conclusion: DPO可有效缓解OCR中的文本退化问题,DharmaOCR系列模型在质量、稳定性与成本之间实现了更优平衡,为结构化OCR提供了新范式和实用工具。 Abstract: This manuscript introduces DharmaOCR Full and Lite, a pair of specialized small language models (SSLMs) for structured OCR that jointly optimize transcription quality, generation stability, and inference cost. It also presents DharmaOCR-Benchmark, a benchmark that covers printed, handwritten, and legal/administrative documents, and proposes a unified evaluation protocol that measures fidelity and structure while explicitly tracking text degeneration as a first-class benchmark metric (alongside unit cost). Beyond reporting degeneration rates, the manuscript empirically shows degeneration is not merely a quality failure, since it materially worsens production performance by increasing response time, reducing throughput, and inflating computational cost due to abnormally long generations. To the best of the author's knowledge, as a methodological contribution, this is the first application of Direct Preference Optimization (DPO) for OCR, explicitly using degenerate generations as rejected examples to penalize looping behavior. Combined with Supervised Fine-Tuning (SFT) for enforcing a strict JSON schema (header, margin, footer, and text), DPO consistently reduces degeneration rate across model families (up to 87.6% relative) while preserving or improving extraction quality. The resulting models, namely, DharmaOCR Full (7B) and DharmaOCR Lite (3B), set a new state-of-the-art on DharmaOCR-Benchmark, outperforming each open-source and commercial baseline model evaluated regarding extraction quality, reaching 0.925 and 0.911 scores with 0.40% and 0.20% degeneration rates. AWQ quantization reduced up to 22% per-page cost with negligible quality loss, enabling a strong quality-cost trade-off in comparison to proprietary OCR APIs and open-source alternatives.[85] Interpretable Human Activity Recognition for Subtle Robbery Detection in Surveillance Videos
Bryan Jhoan Cazáres Leyva,Ulises Gachuz Davila,José Juan González Fonseca,Juan Irving Vasquez,Vanessa A. Camacho-Vázquez,Sergio Isahí Garrido-Castañeda
Main category: cs.CV
TL;DR: 本文提出了一种基于姿态估计与可解释机器学习的轻量级街面非暴力抢劫(抢夺后逃逸)检测方法,适用于边缘设备实时部署。
Details
Motivation: 非暴力街面抢劫(如抢夺后逃逸)持续时间短、动作隐蔽,在无约束监控视频中难以与正常人际交互区分,现有方法缺乏实时性与可解释性,且难以部署于边缘设备。 Method: 采用YOLO-based姿态估计算法提取行人关键点;构建基于运动学(手部速度、手臂伸展)与交互特征(施害者-受害者距离、相对运动)的描述符;使用随机森林分类器进行帧级判别;引入时间迟滞滤波提升预测稳定性。 Result: 在自建模拟数据集及跨域网络视频测试集上均验证了良好泛化能力;完整系统成功部署于NVIDIA Jetson Nano,实现端侧实时检测。 Conclusion: 该混合姿态驱动方法兼顾准确性、可解释性与边缘实时性,为城市安防中低强度异常事件的主动检测提供了可行技术路径。 Abstract: Non-violent street robberies (snatch-and-run) are difficult to detect automatically because they are brief, subtle, and often indistinguishable from benign human interactions in unconstrained surveillance footage. This paper presents a hybrid, pose-driven approach for detecting snatch-and-run events that combines real-time perception with an interpretable classification stage suitable for edge deployment. The system uses a YOLO-based pose estimator to extract body keypoints for each tracked person and computes kinematic and interaction features describing hand speed, arm extension, proximity, and relative motion between an aggressor-victim pair. A Random Forest classifier is trained on these descriptors, and a temporal hysteresis filter is applied to stabilize frame-level predictions and reduce spurious alarms. We evaluate the method on a staged dataset and on a disjoint test set collected from internet videos, demonstrating promising generalization across different scenes and camera viewpoints. Finally, we implement the complete pipeline on an NVIDIA Jetson Nano and report real-time performance, supporting the feasibility of proactive, on-device robbery detection.[86] SatBLIP: Context Understanding and Feature Identification from Satellite Imagery with Vision-Language Learning
Xue Wu,Shengting Cao,Jiaqi Gong
Main category: cs.CV
TL;DR: 本文提出SatBLIP框架,利用卫星图像与语言模型结合,提升农村地区社会脆弱性指数(SVI)的预测精度与可解释性。
Details
Motivation: 现有农村环境风险评估方法受限于粗粒度脆弱性指数和传统遥感流程(如手工特征、人工虚拟审计、通用图像预训练模型),难以刻画地方性风险背景。 Method: 构建面向卫星图像的视觉-语言模型SatBLIP:用GPT-4o生成结构化卫星影像描述(屋顶类型/状况、房屋大小、庭院属性等),微调适配卫星语义的BLIP模型生成图像标题;再通过CLIP编码标题,并与大语言模型嵌入经注意力机制融合,实现县级SVI预测;最后用SHAP分析关键驱动因素。 Result: SatBLIP显著提升了县级SVI预测性能,并识别出屋顶形态/状况、街道宽度、植被覆盖、车辆/开放空间等稳定且具解释性的关键风险属性。 Conclusion: SatBLIP为农村环境风险建模提供了更精细、可解释、数据高效的遥感智能新范式。 Abstract: Rural environmental risks are shaped by place-based conditions (e.g., housing quality, road access, land-surface patterns), yet standard vulnerability indices are coarse and provide limited insight into risk contexts. We propose SatBLIP, a satellite-specific vision-language framework for rural context understanding and feature identification that predicts county-level Social Vulnerability Index (SVI). SatBLIP addresses limitations of prior remote sensing pipelines-handcrafted features, manual virtual audits, and natural-image-trained VLMs-by coupling contrastive image-text alignment with bootstrapped captioning tailored to satellite semantics. We use GPT-4o to generate structured descriptions of satellite tiles (roof type/condition, house size, yard attributes, greenery, and road context), then fine-tune a satellite-adapted BLIP model to generate captions for unseen images. Captions are encoded with CLIP and fused with LLM-derived embeddings via attention for SVI estimation under spatial aggregation. Using SHAP, we identify salient attributes (e.g., roof form/condition, street width, vegetation, cars/open space) that consistently drive robust predictions, enabling interpretable mapping of rural risk environments.[87] FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images
Sabab Ishraq,Aarushi Aarushi,Juncai Jiang,Chen Chen
Main category: cs.CV
TL;DR: 本文提出了FoodSense数据集,用于跨感官推理,包含66,842个参与者-图像对,涵盖味觉、嗅觉、触觉和听觉四个感官维度的评分与描述;并构建了FoodSense-VL模型,能从食物图像中预测多感官评分并生成图像依据的解释。
Details
Motivation: 人类能从食物图像推断多种感官体验(如味道、气味等),但现有视觉语言研究主要集中于识别类任务,缺乏对图像驱动的多感官感知建模;本文旨在填补这一空白,并连接认知科学与多模态AI。 Method: 构建了大规模人工标注的FoodSense数据集,含四维感官评分与文本描述;利用大语言模型生成图像接地的推理链作为解释;基于该数据训练多模态模型FoodSense-VL,支持联合预测与解释生成。 Result: 发布了FoodSense数据集与FoodSense-VL模型;实验表明主流评估指标不足以衡量感官推理能力;验证了跨感官建模的可行性及解释生成的有效性。 Conclusion: 图像可支撑丰富感官预期建模,需专用数据集、评测方式与模型设计;本工作为多感官AI与具身认知建模提供了新范式。 Abstract: Humans routinely infer taste, smell, texture, and even sound from food images a phenomenon well studied in cognitive science. However, prior vision language research on food has focused primarily on recognition tasks such as meal identification, ingredient detection, and nutrition estimation. Image-based prediction of multisensory experience remains largely unexplored. We introduce FoodSense, a human-annotated dataset for cross-sensory inference containing 66,842 participant-image pairs across 2,987 unique food images. Each pair includes numeric ratings (1-5) and free-text descriptors for four sensory dimensions: taste, smell, texture, and sound. To enable models to both predict and explain sensory expectations, we expand short human annotations into image-grounded reasoning traces. A large language model generates visual justifications conditioned on the image, ratings, and descriptors. Using these annotations, we train FoodSense-VL, a vision language benchmark model to produce both multisensory ratings and grounded explanations directly from food images. This work connects cognitive science findings on cross-sensory perception with modern instruction tuning for multimodal models and shows that many popular evaluation metrics are insufficient for visually sensory inference.[88] Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers
Felipe Parodi,Jordan Matelsky,Melanie Segado
Main category: cs.CV
TL;DR: 本文通过多种替代控制实验(均值替代、噪声替代、跨图像寄存器重排)发现,零消融(zero-ablation)会过度夸大视觉Transformer中registers的功能重要性;实际任务性能依赖于‘类寄存器’的合理激活模式,而非精确的寄存器内容。
Details
Motivation: 零消融被广泛用于探查视觉Transformer中token功能,但其是否真实反映模块功能性尚存疑问;尤其DINOv2/v3中registers零化导致性能骤降,需验证该现象是否源于零向量本身的异常扰动。 Method: 在DINOv2+registers和DINOv3模型上,对比零消融与三种控制替换(均值替代、噪声替代、跨图像寄存器重排)对分类、分割、对应等任务的影响,并分析每patch余弦相似度以量化表征扰动程度。 Result: 零消融导致高达-36.6pp分类和-30.9pp分割下降;而三种控制替换均保持性能稳定(偏差≤1pp),且余弦相似度分析表明它们真实扰动表征,而零消融扰动异常剧烈。 Conclusion: 零消融高估了registers对精确内容的依赖;任务性能实际依赖于具备register-like特性的合理激活,而非特定图像的精确值;registers主要作用是缓冲密集特征对[CLS] token的依赖,并编码压缩后的patch几何信息。 Abstract: Zero-ablation -- replacing token activations with zero vectors -- is widely used to probe token function in vision transformers. Register zeroing in DINOv2+registers and DINOv3 produces large drops (up to $-36.6$\,pp classification, $-30.9$\,pp segmentation), suggesting registers are functionally indispensable. However, three replacement controls -- mean-substitution, noise-substitution, and cross-image register-shuffling -- preserve performance across classification, correspondence, and segmentation, remaining within ${\sim}1$\,pp of the unmodified baseline. Per-patch cosine similarity shows these replacements genuinely perturb internal representations, while zeroing causes disproportionately large perturbations, consistent with why it alone degrades tasks. We conclude that zero-ablation overstates dependence on exact register content. In the frozen-feature evaluations we test, performance depends on plausible register-like activations rather than on exact image-specific values. Registers nevertheless buffer dense features from \texttt{[CLS]} dependence and are associated with compressed patch geometry. These findings, including the replacement-control results, replicate at ViT-B scale.[89] Crowdsourcing of Real-world Image Annotation via Visual Properties
Xiaolei Diao,Fausto Giunchiglia
Main category: cs.CV
TL;DR: 本文提出了一种结合知识表示、自然语言处理和计算机视觉的图像标注方法,通过引入视觉属性约束和基于类别层次的交互式众包框架,减少标注主观性,缓解语义鸿沟问题。
Details
Motivation: 解决对象识别数据集中因语义鸿沟导致的视觉数据与语言描述间复杂多对多映射问题,降低标注主观性对计算机视觉任务性能的负面影响。 Method: 提出一种融合知识表示、NLP和CV的图像标注方法;设计基于预定义对象类别层次和标注者反馈的动态交互式众包框架,以视觉属性为约束引导标注过程。 Result: 实验验证了该方法的有效性,并通过分析标注者反馈优化了众包设置。 Conclusion: 所提方法能有效缓解语义鸿沟带来的偏差,提升图像标注质量与一致性,为构建更可靠的数据集提供了新思路。 Abstract: Recent advances in data-centric artificial intelligence highlight inherent limitations in object recognition datasets. One of the primary issues stems from the semantic gap problem, which results in complex many-to-many mappings between visual data and linguistic descriptions. This bias adversely affects performance in computer vision tasks. This paper proposes an image annotation methodology that integrates knowledge representation, natural language processing, and computer vision techniques, aiming to reduce annotator subjectivity by applying visual property constraints. We introduce an interactive crowdsourcing framework that dynamically asks questions based on a predefined object category hierarchy and annotator feedback, guiding image annotation by visual properties. Experiments demonstrate the effectiveness of this methodology, and annotator feedback is discussed to optimize the crowdsourcing setup.[90] Co-distilled attention guided masked image modeling with noisy teacher for self-supervised learning on medical images
Jue Jiang,Aneesh Rangnekar,Harini Veeraraghavan
Main category: cs.CV
TL;DR: 本文提出DAGMaN框架,通过注意力引导的掩码机制和带噪声教师的协同蒸馏学习,提升医学图像自监督预训练效果,减少信息泄露并保持注意力头多样性。
Details
Motivation: 随机掩码在医学图像中易导致上下文相似补丁间的信息泄露,降低自监督学习效果;Swin Transformer缺乏全局[CLS] token,难以应用高级掩码策略。 Method: 提出注意力引导掩码机制,并将其嵌入协同蒸馏框架;首次引入带噪声的教师模型,在执行注意力掩码的同时维持高注意力头多样性。 Result: DAGMaN在肺结节分类(全量/小样本)、免疫治疗效果预测、肿瘤分割及无监督器官聚类等多个下游任务上展现出优越性能。 Conclusion: DAGMaN有效缓解了医学图像自监督学习中的信息泄露问题,同时兼顾掩码难度与注意力多样性,显著提升下游任务泛化能力。 Abstract: Masked image modeling (MIM) is a highly effective self-supervised learning (SSL) approach to extract useful feature representations from unannotated data. Predominantly used random masking methods make SSL less effective for medical images due to the contextual similarity of neighboring patches, leading to information leakage and SSL simplification. Hierarchical shifted window (Swin) transformer, a highly effective approach for medical images cannot use advanced masking methods as it lacks a global [CLS] token. Hence, we introduced an attention guided masking mechanism for Swin within a co-distillation learning framework to selectively mask semantically co-occurring and discriminative patches, to reduce information leakage and increase the difficulty of SSL pretraining. However, attention guided masking inevitably reduces the diversity of attention heads, which negatively impacts downstream task performance. To address this, we for the first time, integrate a noisy teacher into the co-distillation framework (termed DAGMaN) that performs attentive masking while preserving high attention head diversity. We demonstrate the capability of DAGMaN on multiple tasks including full- and few-shot lung nodule classification, immunotherapy outcome prediction, tumor segmentation, and unsupervised organs clustering.[91] H2VLR: Heterogeneous Hypergraph Vision-Language Reasoning for Few-Shot Anomaly Detection
Jianghong Huang,Luping Ji,Weiwei Duan,Mao Ye
Main category: cs.CV
TL;DR: 本文提出了一种基于异构超图的视觉-语言推理框架(H2VLR),用于少样本异常检测(FSAD),通过联合建模视觉区域与语义概念,克服现有VLM方法仅依赖成对特征匹配、忽略结构依赖和全局一致性的缺陷。
Details
Motivation: 现有基于视觉语言模型(VLM)的少样本异常检测方法大多仅进行成对特征匹配,忽略了视觉-语义关系中的结构依赖性和全局一致性,限制了性能提升。 Method: 提出Heterogeneous Hypergraph Vision-Language Reasoning(H2VLR)框架,将FSAD建模为视觉-语义关系的高阶推理问题,在统一异构超图中联合建模视觉区域和语义概念。 Result: 在代表性工业与医学基准数据集上实验验证,H2VLR常达到当前最优(SOTA)性能。 Conclusion: H2VLR有效提升了少样本异常检测性能,证明了引入高阶结构化视觉-语言推理对FSAD的重要价值。 Abstract: As a classic vision task, anomaly detection has been widely applied in industrial inspection and medical imaging. In this task, data scarcity is often a frequently-faced issue. To solve it, the few-shot anomaly detection (FSAD) scheme is attracting increasing attention. In recent years, beyond traditional visual paradigm, Vision-Language Model (VLM) has been extensively explored to boost this field. However, in currently-existing VLM-based FSAD schemes, almost all perform anomaly inference only by pairwise feature matching, ignoring structural dependencies and global consistency. To further redound to FSAD via VLM, we propose a Heterogeneous Hypergraph Vision-Language Reasoning (H2VLR) framework. It reformulates the FSAD as a high-order inference problem of visual-semantic relations, by jointly modeling visual regions and semantic concepts in a unified hypergraph. Experimental comparisons verify the effectiveness and advantages of H2VLR. It could often achieve state-of-the-art (SOTA) performance on representative industrial and medical benchmarks. Our code will be released upon acceptance.[92] Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs
Ziyang Luo,Nian Liu,Junwei Han
Main category: cs.CV
TL;DR: 本文提出Chain of Modality(CoM)框架,通过动态选择输入拓扑结构和双路径认知执行机制,解决当前Omni-MLLMs因静态融合结构导致的感知脆弱性问题,显著提升多模态推理性能。
Details
Motivation: 现有Omni-MLLMs虽追求多感官统一建模,但在实际评估中常被单模态基线超越,根源在于其静态融合结构(如序列或交错输入)引发的位置偏差与对齐陷阱,导致注意力机制失真。 Method: 提出Chain of Modality(CoM):1)动态切换并行/序列/交错输入拓扑以消除结构偏差;2)双路径认知执行——'Direct-Decide'用于直接感知,'Reason-Decide'用于分析式审计;支持零训练或数据高效监督微调(SFT)。 Result: CoM在多种基准测试中实现鲁棒且一致的泛化性能,显著优于现有静态融合的Omni-MLLMs,在训练自由或少量微调下均有效。 Conclusion: 静态融合是当前Omni-MLLMs性能瓶颈的结构性根源;动态、任务自适应的融合范式(如CoM)是提升多模态大模型鲁棒性与泛化能力的关键路径。 Abstract: Omni-modal Large Language Models (Omni-MLLMs) promise a unified integration of diverse sensory streams. However, recent evaluations reveal a critical performance paradox: unimodal baselines frequently outperform joint multimodal inference. We trace this perceptual fragility to the static fusion topologies universally employed by current models, identifying two structural pathologies: positional bias in sequential inputs and alignment traps in interleaved formats, which systematically distort attention regardless of task semantics. To resolve this functional rigidity, we propose Chain of Modality (CoM), an agentic framework that transitions multimodal fusion from passive concatenation to dynamic orchestration. CoM adaptively orchestrates input topologies, switching among parallel, sequential, and interleaved pathways to neutralize structural biases. Furthermore, CoM bifurcates cognitive execution into two task-aligned pathways: a streamlined ``Direct-Decide'' path for direct perception and a structured ``Reason-Decide'' path for analytical auditing. Operating in either a training-free or a data-efficient SFT setting, CoM achieves robust and consistent generalization across diverse benchmarks.[93] FreqTrack: Frequency Learning based Vision Transformer for RGB-Event Object Tracking
Jinlin You,Muyu Li,Xudong Zhao
Main category: cs.CV
TL;DR: 本文提出FreqTrack,一种频率感知的RGB-事件(RGBE)跟踪框架,通过频域变换建立模态间互补相关性,设计了频谱增强Transformer(SET)层和小波边缘细化(WER)模块,在COESOT和FE108数据集上取得领先性能。
Details
Motivation: 现有单模态RGB跟踪器在复杂动态场景中性能受限;当前RGB-事件融合方法多在空间域进行,未能充分利用事件数据的时间响应特性和高频特性。 Method: 提出FreqTrack框架:1)引入频域变换建模模态间互补相关性;2)设计Spectral Enhancement Transformer(SET)层,含多头动态傅里叶滤波以自适应增强和选择频域特征;3)构建Wavelet Edge Refinement(WER)模块,利用可学习小波变换显式提取事件数据的多尺度边缘结构。 Result: 在COESOT和FE108数据集上实验表明,FreqTrack性能优异,尤其在COESOT基准上达到76.6%的领先精度。 Conclusion: 频域建模能有效提升RGB-事件融合跟踪性能,FreqTrack验证了其在高速、低光等挑战场景下的鲁棒性和有效性。 Abstract: Existing single-modal RGB trackers often face performance bottlenecks in complex dynamic scenes, while the introduction of event sensors offers new potential for enhancing tracking capabilities. However, most current RGB-event fusion methods, primarily designed in the spatial domain using convolutional, Transformer, or Mamba architectures, fail to fully exploit the unique temporal response and high-frequency characteristics of event data. To address this, we1 propose FreqTrack, a frequency-aware RGBE tracking framework that establishes complementary inter-modal correlations through frequency-domain transformations for more robust feature fusion. We design a Spectral Enhancement Transformer (SET) layer that incorporates multi-head dynamic Fourier filtering to adaptively enhance and select frequency-domain features. Additionally, we develop a Wavelet Edge Refinement (WER) module, which leverages learnable wavelet transforms to explicitly extract multi-scale edge structures from event data, effectively improving modeling capability in high-speed and low-light scenarios. Extensive experiments on the COESOT and FE108 datasets demonstrate that FreqTrack achieves highly competitive performance, particularly attaining leading precision of 76.6\% on the COESOT benchmark, validating the effectiveness of frequency-domain modeling for RGBE tracking.[94] Design and Validation of a Low-Cost Smartphone Based Fluorescence Detection Platform Compared with Conventional Microplate Readers
Zhendong Cao,Katrina G. Salvante,Ash Parameswaran,Pablo A. Nepomnaschy,Hongji Dai
Main category: cs.CV
TL;DR: 本文提出了一种低成本荧光光学检测系统,利用智能手机摄像头替代传统昂贵的微孔板读数仪(如Perkin Elmer Victor),通过分析样品在RGB颜色空间中的图像色彩与荧光物质摩尔浓度之间的关系,实现对稀释样品中微生物和分子的检测。
Details
Motivation: 降低荧光检测设备成本,提高便携性与可及性,避免使用昂贵的传统光学元件(如激发滤光片、阻挡滤光片和光电倍增管)。 Method: 设计兼容标准96孔板的装置,利用智能手机摄像头作为光学探测器,在RGB颜色空间中建立图像颜色与荧光物质量浓度的定量关系。 Result: 成功构建了一套无需昂贵光学元件、仅依赖智能手机即可完成荧光检测的低成本系统,并验证其适用于稀释样品中特定微生物和分子的检测。 Conclusion: 该系统为资源有限场景下的快速、便携、低成本荧光检测提供了可行方案,具有良好的实用潜力和推广价值。 Abstract: A low cost fluorescence-based optical system is developed for detecting the presence of certain microorganisms and molecules within a diluted sample. A specifically designed device setup compatible with conventional 96 well plates is chosen to create an ideal environment in which a smart phone camera can be used as the optical detector. In comparison with conventional microplate reading machines such as Perkin Elmer Victor Machine, the device presented in this paper is not equipped with expensive elements such as exciter filer, barrier filter and photomultiplier; instead, a phone camera is all needed to detect fluorescence within the sample. The strategy being involved is to determine the relationship between the image color of the sample in RGB color space and the molar concentration of the fluorescence specimen in that sample. This manuscript is a preprint version of work related to a publication in IEEE. The final version may differ from this manuscript.[95] WILD-SAM: Phase-Aware Expert Adaptation of SAM for Landslide Detection in Wrapped InSAR Interferograms
Yucheng Pan,Heping Li,Zhangle Liu,Sajid Hussain,Bin Pan
Main category: cs.CV
TL;DR: 本文提出WILD-SAM框架,通过频谱感知的混合专家适配器(PA-MoE)和小波引导子带增强(WGSE)策略,提升Segment Anything Model(SAM)在包裹相位InSAR干涉图上检测慢速滑坡的精度与边界保真度。
Details
Motivation: 直接从包裹InSAR干涉图中检测慢速滑坡对地质灾害监测至关重要,但面临严重相位模糊和复杂相干噪声挑战;SAM在自然图像上表现优异,但因频谱域偏移难以直接迁移到包裹相位数据。 Method: 提出WILD-SAM:1)在冻结编码器中嵌入Phase-Aware Mixture-of-Experts (PA-MoE) Adapter,利用动态路由聚合多尺度频谱-纹理先验以对齐分布;2)设计Wavelet-Guided Subband Enhancement (WGSE) 策略,通过离散小波变换解耦高频子带并生成频率感知的密集提示,保障滑坡边界的拓扑完整性。 Result: 在ISSLIDE和ISSLIDE+基准上达到SOTA性能,显著优于现有方法,在目标完整性与轮廓保真度两方面均有大幅提升。 Conclusion: WILD-SAM有效解决了SAM向InSAR包裹相位数据迁移中的频谱域偏移问题,为高精度、高鲁棒性的滑坡自动检测提供了新范式。 Abstract: Detecting slow-moving landslides directly from wrapped Interferometric Synthetic Aperture Radar (InSAR) interferograms is crucial for efficient geohazard monitoring, yet it remains fundamentally challenged by severe phase ambiguity and complex coherence noise. While the Segment Anything Model (SAM) offers a powerful foundation for segmentation, its direct transfer to wrapped phase data is hindered by a profound spectral domain shift, which suppresses the high-frequency fringes essential for boundary delineation. To bridge this gap, we propose WILD-SAM, a novel parameter-efficient fine-tuning framework specifically designed to adapt SAM for high-precision landslide detection on wrapped interferograms. Specifically, the architecture integrates a Phase-Aware Mixture-of-Experts (PA-MoE) Adapter into the frozen encoder to align spectral distributions and introduces a Wavelet-Guided Subband Enhancement (WGSE) strategy to generate frequency-aware dense prompts. The PA-MoE Adapter exploits a dynamic routing mechanism across heterogeneous convolutional experts to adaptively aggregate multi-scale spectral-textural priors, effectively aligning the distribution discrepancy between natural images and interferometric phase data. Meanwhile, the WGSE strategy leverages discrete wavelet transforms to explicitly disentangle high-frequency subbands and refine directional phase textures, injecting these structural cues as dense prompts to ensure topological integrity along sharp landslide boundaries. Extensive experiments on the ISSLIDE and ISSLIDE+ benchmarks demonstrate that WILD-SAM achieves state-of-the-art performance, significantly outperforming existing methods in both target completeness and contour fidelity.[96] Giving Faces Their Feelings Back: Explicit Emotion Control for Feedforward Single-Image 3D Head Avatars
Yicheng Gong,Jiawei Zhang,Liqiang Liu,Yanwen Wang,Lei Chu,Jiahao Li,Hao Pan,Hao Zhu,Yan Lu
Main category: cs.CV
TL;DR: 本文提出了一种显式情绪控制框架,用于前馈式单图像3D头像重建,通过双路径调制机制独立、一致地操控情绪,实现跨身份的情绪解耦与平滑插值。
Details
Motivation: 现有方法中情绪常与几何或外观隐式耦合,缺乏对情绪作为独立可控信号的显式建模。 Method: 提出双路径调制机制:几何调制在参数空间中进行情绪条件归一化,解耦情绪与语音驱动动作;外观调制捕获身份感知的情绪相关视觉线索;并构建时序对齐、情绪一致的多身份数据集支持训练。 Result: 在多个SOTA骨干网络上验证,保持重建与重演保真度的同时,实现了可控情绪迁移、解耦操作和情绪平滑插值。 Conclusion: 该框架提升了3D头像的表情表现力与可扩展性,为显式、解耦的情绪控制提供了新范式。 Abstract: We present a framework for explicit emotion control in feed-forward, single-image 3D head avatar reconstruction. Unlike existing pipelines where emotion is implicitly entangled with geometry or appearance, we treat emotion as a first-class control signal that can be manipulated independently and consistently across identities. Our method injects emotion into existing feed-forward architectures via a dual-path modulation mechanism without modifying their core design. Geometry modulation performs emotion-conditioned normalization in the original parametric space, disentangling emotional state from speech-driven articulation, while appearance modulation captures identity-aware, emotion-dependent visual cues beyond geometry. To enable learning under this setting, we construct a time-synchronized, emotion-consistent multi-identity dataset by transferring aligned emotional dynamics across identities. Integrated into multiple state-of-the-art backbones, our framework preserves reconstruction and reenactment fidelity while enabling controllable emotion transfer, disentangled manipulation, and smooth emotion interpolation, advancing expressive and scalable 3D head avatars.[97] Controllable Video Object Insertion via Multiview Priors
Xia Qi,Peishan Cong,Yichen Yao,Ziyi Wang,Yaoqin Ye,Yuexin Ma
Main category: cs.CV
TL;DR: 本文提出了一种新的视频对象插入方法,通过多视角物体先验、双路径视图一致性条件机制和质量感知加权机制,解决了外观不一致、遮挡处理和时序连贯性等挑战。
Details
Motivation: 现有视频生成方法在将新对象插入到已有视频时,难以保证对象外观一致性、空间对齐和时间连贯性。 Method: 引入多视角物体先验,将2D参考图像提升为多视角表示;设计双路径视图一致性条件机制以提供稳定身份引导;采用质量感知加权机制处理噪声输入;提出集成感知一致性模块以解决遮挡与边界伪影并保持时空连续性。 Result: 实验表明该方法显著提升了视频对象插入的质量,实现了稳定且逼真的对象集成。 Conclusion: 所提框架有效克服了动态环境中视频对象插入的关键挑战,在外观一致性、遮挡处理和时空连贯性方面取得显著进步。 Abstract: Video object insertion is a critical task for dynamically inserting new objects into existing environments. Previous video generation methods focus primarily on synthesizing entire scenes while struggling with ensuring consistent object appearance, spatial alignment, and temporal coherence when inserting objects into existing videos. In this paper, we propose a novel solution for Video Object Insertion, which integrates multi-view object priors to address the common challenges of appearance inconsistency and occlusion handling in dynamic environments. By lifting 2D reference images into multi-view representations and leveraging a dual-path view-consistent conditioning mechanism, our framework ensures stable identity guidance and robust integration across diverse viewpoints. A quality-aware weighting mechanism is also employed to adaptively handle noisy or imperfect inputs. Additionally, we introduce an Integration-Aware Consistency Module that guarantees spatial realism, effectively resolving occlusion and boundary artifacts while maintaining temporal continuity across frames. Experimental results show that our solution significantly improves the quality of video object insertion, providing stable and realistic integration.[98] The Fourth Challenge on Image Super-Resolution ($\times$4) at NTIRE 2026: Benchmark Results and Method Overview
Zheng Chen,Kai Liu,Jingkai Wang,Xianglong Yan,Jianze Li,Ziqing Zhang,Jue Gong,Jiatong Li,Lei Sun,Xiaoyang Liu,Radu Timofte,Yulun Zhang,Jihye Park,Yoonjin Im,Hyungju Chun,Hyunhee Park,MinKyu Park,Zheng Xie,Xiangyu Kong,Weijun Yuan,Zhan Li,Qiurong Song,Luen Zhu,Fengkai Zhang,Xinzhe Zhu,Junyang Chen,Congyu Wang,Yixin Yang,Zhaorun Zhou,Jiangxin Dong,Jinshan Pan,Shengwei Wang,Jiajie Ou,Baiang Li,Sizhuo Ma,Qiang Gao,Jusheng Zhang,Jian Wang,Keze Wang,Yijiao Liu,Yingsi Chen,Hui Li,Yu Wang,Congchao Zhu,Saeed Ahmad,Ik Hyun Lee,Jun Young Park,Ji Hwan Yoon,Kainan Yan,Zian Wang,Weibo Wang,Shihao Zou,Chao Dong,Wei Zhou,Linfeng Li,Jaeseong Lee,Jaeho Chae,Jinwoo Kim,Seonjoo Kim,Yucong Hong,Zhenming Yan,Junye Chen,Ruize Han,Song Wang,Yuxuan Jiang,Chengxi Zeng,Tianhao Peng,Fan Zhang,David Bull,Tongyao Mu,Qiong Cao,Yifan Wang,Youwei Pan,Leilei Cao,Xiaoping Peng,Wei Deng,Yifei Chen,Wenbo Xiong,Xian Hu,Yuxin Zhang,Xiaoyun Cheng,Yang Ji,Zonghao Chen,Zhihao Xue,Junqin Hu,Nihal Kumar,Snehal Singh Tomar,Klaus Mueller,Surya Vashisth,Prateek Shaily,Jayant Kumar,Hardik Sharma,Ashish Negi,Sachin Chaudhary,Akshay Dudhane,Praful Hambarde,Amit Shukla,Shijun Shi,Jiangning Zhang,Yong Liu,Kai Hu,Jing Xu,Xianfang Zeng,Amitesh M,Hariharan S,Chia-Ming Lee,Yu-Fan Lin,Chih-Chung Hsu,Nishalini K,Sreenath K A,Bilel Benjdira,Anas M. Ali,Wadii Boulila,Shuling Zheng,Zhiheng Fu,Feng Zhang,Zhanglu Chen,Boyang Yao,Nikhil Pathak,Aagam Jain,Milan Kumar,Kishor Upla,Vivek Chavda,Sarang N S,Raghavendra Ramachandra,Zhipeng Zhang,Qi Wang,Shiyu Wang,Jiachen Tu,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Yaokun Shi,Yuqi Li,Chuanguang Yang,Weilun Feng,Zhuzhi Hong,Hao Wu,Junming Liu,Yingli Tian,Amish Bhushan Kulkarni,Tejas R R Shet,Saakshi M Vernekar,Nikhil Akalwadi,Kaushik Mallibhat,Ramesh Ashok Tabib,Uma Mudenagudi,Yuwen Pan,Tianrun Chen,Deyi Ji,Qi Zhu,Lanyun Zhu,Heyan Zhangyi
Main category: cs.CV
TL;DR: 本文介绍了NTIRE 2026图像超分辨率(×4)挑战赛,包含恢复和感知两个赛道,旨在推动超分辨率技术发展并提供统一基准。
Details
Motivation: 反映图像超分辨率领域不断演进的目标,推动兼顾像素保真度与视觉真实感的技术发展。 Method: 组织NTIRE 2026超分辨率挑战赛,设置基于PSNR的恢复赛道和基于感知评分的感知赛道,使用bicubic下采样生成×4低分辨率输入。 Result: 共194人注册,31支队伍提交有效结果;报告总结了数据集、评估协议、主要结果及参赛方法。 Conclusion: 该挑战赛为图像超分辨率提供了统一基准,揭示了当前进展与未来方向。 Abstract: This paper presents the NTIRE 2026 image super-resolution ($\times$4) challenge, one of the associated competitions of the NTIRE 2026 Workshop at CVPR 2026. The challenge aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective super-resolution solutions and analyze recent advances in the field. To reflect the evolving objectives of image super-resolution, the challenge includes two tracks: (1) a restoration track, which emphasizes pixel-wise fidelity and ranks submissions based on PSNR; and (2) a perceptual track, which focuses on visual realism and evaluates results using a perceptual score. A total of 194 participants registered for the challenge, with 31 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, main results, and methods of participating teams. The challenge provides a unified benchmark and offers insights into current progress and future directions in image super-resolution.[99] DVFace: Spatio-Temporal Dual-Prior Diffusion for Video Face Restoration
Zheng Chen,Bowen Chai,Rongjun Gao,Mingtao Nie,Xi Li,Bingnan Duan,Jianping Fang,Xiaohong Liu,Linghe Kong,Yulun Zhang
Main category: cs.CV
TL;DR: 本文提出DVFace,一种用于真实世界视频人脸修复的一步扩散框架,通过时空双码本设计和非对称时空融合模块,实现了高质量、时间稳定且身份保持的修复效果。
Details
Motivation: 现有基于扩散的方法依赖通用扩散先验和多步采样,限制了面部适应性和推理效率,因此需要探索一步扩散方法以提升人脸修复的保真度和时间稳定性。 Method: 提出DVFace框架,包含时空双码本设计以提取空间和时间面部先验,以及非对称时空融合模块将这些先验注入扩散主干网络。 Result: 在多个基准测试中,DVFace在修复质量、时间一致性和身份保持方面均优于近期方法。 Conclusion: DVFace通过一步扩散与专用时空建模,有效提升了视频人脸修复的性能与效率,为真实场景应用提供了新思路。 Abstract: Video face restoration aims to enhance degraded face videos into high-quality results with realistic facial details, stable identity, and temporal coherence. Recent diffusion-based methods have brought strong generative priors to restoration and enabled more realistic detail synthesis. However, existing approaches for face videos still rely heavily on generic diffusion priors and multi-step sampling, which limit both facial adaptation and inference efficiency. These limitations motivate the use of one-step diffusion for video face restoration, yet achieving faithful facial recovery alongside temporally stable outputs remains challenging. In this paper, we propose, DVFace, a one-step diffusion framework for real-world video face restoration. Specifically, we introduce a spatio-temporal dual-codebook design to extract complementary spatial and temporal facial priors from degraded videos. We further propose an asymmetric spatio-temporal fusion module to inject these priors into the diffusion backbone according to their distinct roles. Evaluation on various benchmarks shows that DVFace delivers superior restoration quality, temporal consistency, and identity preservation compared to recent methods. Code: https://github.com/zhengchen1999/DVFace.[100] Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors
Mingqian Ji,Shanshan Zhang,Jian Yang
Main category: cs.CV
TL;DR: 本文提出SEPatch3D框架,通过动态调整patch尺寸、选择信息丰富patches及跨粒度特征增强,在保持3D检测精度的同时显著提升ViT类稀疏多视角3D检测器的推理速度。
Details
Motivation: 现有token压缩方法(如剪枝、合并、增大patch尺寸)会丢失背景线索、破坏上下文一致性、损失细粒度语义,损害3D检测性能。 Method: 提出SEPatch3D:1)时空感知的Patch尺寸选择(SPSS),根据场景近物/背景占比动态分配小/大patch;2)信息丰富Patch选择(IPS)筛选需细化的patches;3)跨粒度特征增强(CGFE)将细粒度细节注入粗粒度patches。 Result: 在nuScenes和Argoverse 2验证集上,相比StreamPETR推理快57%,比SOTA ToC3D-faster效率高20%,同时保持相当的检测精度。 Conclusion: 动态多粒度patch处理可在不牺牲精度前提下大幅提升ViT类3D检测器推理效率,为实时多视角3D感知提供新思路。 Abstract: Vision Transformer (ViT)-based sparse multi-view 3D object detectors have achieved remarkable accuracy but still suffer from high inference latency due to heavy token processing. To accelerate these models, token compression has been widely explored. However, our revisit of existing strategies, such as token pruning, merging, and patch size enlargement, reveals that they often discard informative background cues, disrupt contextual consistency, and lose fine-grained semantics, negatively affecting 3D detection. To overcome these limitations, we propose SEPatch3D, a novel framework that dynamically adjusts patch sizes while preserving critical semantic information within coarse patches. Specifically, we design Spatiotemporal-aware Patch Size Selection (SPSS) that assigns small patches to scenes containing nearby objects to preserve fine details and large patches to background-dominated scenes to reduce computation cost. To further mitigate potential detail loss, Informative Patch Selection (IPS) selects the informative patches for feature refinement, and Cross-Granularity Feature Enhancement (CGFE) injects fine-grained details into selected coarse patches, enriching semantic features. Experiments on the nuScenes and Argoverse 2 validation sets show that SEPatch3D achieves up to \textbf{57\%} faster inference than the StreamPETR baseline and \textbf{20\%} higher efficiency than the state-of-the-art ToC3D-faster, while preserving comparable detection accuracy. Code is available at https://github.com/Mingqj/SEPatch3D.[101] Learning Adaptive Reasoning Paths for Efficient Visual Reasoning
Yixu Huang,Tinghui Zhu,Muhao Chen
Main category: cs.CV
TL;DR: 本文提出AVR自适应视觉推理框架,通过分解视觉推理为三种认知功能并动态选择响应格式,有效缓解视觉推理模型的过度思考问题,在保持准确率的同时大幅减少token使用量。
Details
Motivation: 视觉推理模型(VRMs)存在过度思考问题,即对简单任务生成冗长推理链,作者将其归因于视觉推理中的推理路径冗余。 Method: 提出AVR框架,将视觉推理分解为视觉感知、逻辑推理和答案应用三个认知功能,并支持三种响应格式(完整格式、仅感知格式、直接答案);采用改进的FS-GRPO训练策略,鼓励模型选择最高效且正确的推理格式。 Result: 在多个视觉-语言基准测试中,AVR将token使用量减少50%–90%,同时保持整体准确率,尤其在感知密集型任务中表现更优。 Conclusion: 自适应视觉推理能有效缓解VRMs的过度思考问题,提升推理效率而不牺牲性能。 Abstract: Visual reasoning models (VRMs) have recently shown strong cross-modal reasoning capabilities by integrating visual perception with language reasoning. However, they often suffer from overthinking, producing unnecessarily long reasoning chains for any tasks. We attribute this issue to \textbf{Reasoning Path Redundancy} in visual reasoning: many visual questions do not require the full reasoning process. To address this, we propose \textbf{AVR}, an adaptive visual reasoning framework that decomposes visual reasoning into three cognitive functions: visual perception, logical reasoning, and answer application. It further enables models to dynamically choose among three response formats: Full Format, Perception-Only Format, and Direct Answer. AVR is trained with FS-GRPO, an adaptation of Group Relative Policy Optimization that encourages the model to select the most efficient reasoning format while preserving correctness. Experiments on multiple vision-language benchmarks show that AVR reduces token usage by 50--90\% while maintaining overall accuracy, especially in perception-intensive tasks. These results demonstrate that adaptive visual reasoning can effectively mitigate overthinking in VRMs. Code and data are available at: https://github.com/RunRiotComeOn/AVR.[102] Deepfake Detection Generalization with Diffusion Noise
Hongyuan Qi,Wenjin Hou,Hehe Fan,Jun Xiao
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型噪声特性的注意力引导噪声学习(ANL)框架,用于提升深度伪造检测器对新型合成图像(尤其是扩散模型生成的深伪图像)的泛化能力。该方法利用预训练扩散模型的去噪过程暴露细微伪造痕迹,并通过注意力机制聚焦全局不一致性,在不增加推理开销的情况下显著提升了跨模型泛化性能。
Details
Motivation: 现有深度伪造检测器在面对新兴的扩散模型生成的高保真伪造图像时泛化能力不足,难以检测GAN以外的新型伪造类型。 Method: 提出Attention-guided Noise Learning(ANL)框架:利用冻结的预训练扩散模型指导检测器学习输入图像在特定扩散步长下的噪声;通过预测噪声引入注意力机制,使模型关注全局分布的伪造差异而非局部纹理;将扩散模型的自然图像先验作为正则化手段。 Result: ANL在多个基准上显著超越现有方法,尤其在检测扩散模型生成的深伪图像时达到SOTA精度;在未见过的伪造模型上ACC/AP大幅提升;推理阶段无额外计算开销。 Conclusion: 扩散噪声是一种强泛化信号,ANL框架有效利用该信号提升了深度伪造检测器对未知伪造技术的鲁棒性和泛化能力。 Abstract: Deepfake detectors face growing challenges in generalization as new image synthesis techniques emerge. In particular, deepfakes generated by diffusion models are highly photorealistic and often evade detectors trained on GAN-based forgeries. This paper addresses the generalization problem in deepfake detection by leveraging diffusion noise characteristics. We propose an Attention-guided Noise Learning (ANL) framework that integrates a pre-trained diffusion model into the deepfake detection pipeline to guide the learning of more robust features. Specifically, our method uses the diffusion model's denoising process to expose subtle artifacts: the detector is trained to predict the noise contained in an input image at a given diffusion step, forcing it to capture discrepancies between real and synthetic images, while an attention-guided mechanism derived from the predicted noise is introduced to encourage the model to focus on globally distributed discrepancies rather than local patterns. By harnessing the frozen diffusion model's learned distribution of natural images, the ANL method acts as a form of regularization, improving the detector's generalization to unseen forgery types. Extensive experiments demonstrate that ANL significantly outperforms existing methods on multiple benchmarks, achieving state-of-the-art accuracy in detecting diffusion-generated deepfakes. Notably, the proposed framework boosts generalization performance (e.g., improving ACC/AP by a substantial margin on unseen models) without introducing additional overhead during inference. Our results highlight that diffusion noise provides a powerful signal for generalizable deepfake detection.[103] M3D-Net: Multi-Modal 3D Facial Feature Reconstruction Network for Deepfake Detection
Haotian Wu,Yue Cheng,Shan Bian
Main category: cs.CV
TL;DR: 本文提出了一种名为M3D-Net的多模态3D人脸特征重建网络,用于深度伪造检测,通过自监督3D人脸重建与多模态特征融合提升检测精度和鲁棒性。
Details
Motivation: 现有深度伪造检测方法大多孤立重建面部属性,未充分利用多模态特征间的互补性,且难以应对日益逼真的伪造技术带来的安全威胁。 Method: 提出端到端双流架构的M3D-Net,包含自监督3D人脸重建模块(重建几何与反射率)、3D特征预融合模块(PFM)和多模态融合模块(MFM),结合注意力机制融合RGB与3D重建特征。 Result: 在多个公开数据集上实验表明,该方法在检测精度、鲁棒性和跨场景泛化能力方面均达到当前最优水平。 Conclusion: 多模态3D特征联合建模可有效提升深度伪造检测性能,所提M3D-Net为该领域提供了新思路和有效工具。 Abstract: With the rapid advancement of deep learning in image generation, facial forgery techniques have achieved unprecedented realism, posing serious threats to cybersecurity and information authenticity. Most existing deepfake detection approaches rely on the reconstruction of isolated facial attributes without fully exploiting the complementary nature of multi-modal feature representations. To address these challenges, this paper proposes a novel Multi-Modal 3D Facial Feature Reconstruction Network (M3D-Net) for deepfake detection. Our method leverages an end-to-end dual-stream architecture that reconstructs fine-grained facial geometry and reflectance properties from single-view RGB images via a self-supervised 3D facial reconstruction module. The network further enhances detection performance through a 3D Feature Pre-fusion Module (PFM), which adaptively adjusts multi-scale features, and a Multi-modal Fusion Module (MFM) that effectively integrates RGB and 3D-reconstructed features using attention mechanisms. Extensive experiments on multiple public datasets demonstrate that our approach achieves state-of-the-art performance in terms of detection accuracy and robustness, significantly outperforming existing methods while exhibiting strong generalization across diverse scenarios.[104] TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation
Xiangyu Liu,Feng Gao,Xiaomei Zhang,Yong Zhang,Xiaoming Wei,Zhen Lei,Xiangyu Zhu
Main category: cs.CV
TL;DR: 本文提出TurboTalk,一种两阶段渐进式蒸馏框架,将多步音频驱动视频扩散模型压缩为单步生成器,在保持高质量的同时将推理速度提升120倍。
Details
Motivation: 现有音频驱动视频数字人生成模型依赖多步去噪,计算开销大、难以实际部署;单步蒸馏虽快但训练不稳定。 Method: 提出两阶段渐进式蒸馏:第一阶段用分布匹配蒸馏获得稳定的4步学生模型;第二阶段通过对抗蒸馏逐步从4步减至1步,并引入渐进时间步采样策略和自比较对抗目标以稳定训练。 Result: 实现单步视频说话头像生成,推理速度提升120倍,同时保持高生成质量。 Conclusion: TurboTalk有效解决了音频驱动视频生成中速度与稳定性之间的权衡问题,为实时应用提供了可行方案。 Abstract: Existing audio-driven video digital human generation models rely on multi-step denoising, resulting in substantial computational overhead that severely limits their deployment in real-world settings. While one-step distillation approaches can significantly accelerate inference, they often suffer from training instability. To address this challenge, we propose TurboTalk, a two-stage progressive distillation framework that effectively compresses a multi-step audio-driven video diffusion model into a single-step generator. We first adopt Distribution Matching Distillation to obtain a strong and stable 4-step student, and then progressively reduce the denoising steps from 4 to 1 through adversarial distillation. To ensure stable training under extreme step reduction, we introduce a progressive timestep sampling strategy and a self-compare adversarial objective that provides an intermediate adversarial reference that stabilizes progressive distillation. Our method achieve single-step generation of video talking avatar, boosting inference speed by 120 times while maintaining high generation quality.[105] MapSR: Prompt-Driven Land Cover Map Super-Resolution via Vision Foundation Models
Ruiqi Wang,Qi Yu,Jie Ma,Hanlin Wu
Main category: cs.CV
TL;DR: 本文提出MapSR框架,通过提示驱动的方式实现土地覆盖图的超分辨率重建,仅需低分辨率标签即可生成高分辨率地图,大幅减少计算开销和参数量。
Details
Motivation: 高分辨率土地覆盖制图受限于密集高分辨率标注的高昂成本,现有弱监督方法虽能利用低分辨率标签,但需大量计算重新训练模型。 Method: MapSR利用冻结视觉基础模型提取低分辨率标签对应的类别提示(class prompts),通过轻量线性探针识别高置信度高分辨率特征并聚合生成提示;随后通过余弦相似度匹配进行无训练推理,并结合图传播进行空间优化。 Result: 在Chesapeake Bay数据集上达到59.64% mIoU,无需任何高分辨率标签,性能媲美最强弱监督基线、超越全监督基线,且可训练参数减少四个数量级,训练时间从小时级缩短至分钟级。 Conclusion: MapSR实现了高效、可扩展的高分辨率土地覆盖映射,在标注与算力受限场景下具有显著优势。 Abstract: High-resolution (HR) land-cover mapping is often constrained by the high cost of dense HR annotations. We revisit this problem from the perspective of map super-resolution, which enhances coarse low-resolution (LR) land-cover products into HR maps at the resolution of the input imagery. Existing weakly supervised methods can leverage LR labels, but they typically use them to retrain dense predictors with substantial computational cost. We propose MapSR, a prompt-driven framework that decouples supervision from model training. MapSR uses LR labels once to extract class prompts from frozen vision foundation model features through a lightweight linear probe, after which HR mapping proceeds via training-free metric inference and graph-based prediction refinement. Specifically, class prompts are estimated by aggregating high-confidence HR features identified by the linear probe, and HR predictions are obtained by cosine-similarity matching followed by graph-based propagation for spatial refinement. Experiments on the Chesapeake Bay dataset show that MapSR achieves 59.64% mIoU without any HR labels, remaining competitive with the strongest weakly supervised baseline and surpassing a fully supervised baseline. Notably, MapSR reduces trainable parameters by four orders of magnitude and shortens training time from hours to minutes, enabling scalable HR mapping under limited annotation and compute budgets. The code is available at https://github.com/rikirikirikiriki/MapSR.[106] Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models
Amir El-Ghoussani,Marc Hölle,Gustavo Carneiro,Vasileios Belagiannis
Main category: cs.CV
TL;DR: 本文提出Masked Logit Nudging(MLN)方法,用于视觉自回归模型中的提示引导图像编辑,在保持无关区域不变的前提下,依据文本提示精准编辑图像,并在多个基准上达到最优性能,同时比扩散模型更快。
Details
Motivation: 解决视觉自回归模型中提示引导图像编辑的问题,要求在依据目标文本提示修改源图像的同时,保持与编辑无关的区域不变。 Method: 提出Masked Logit Nudging:利用源图像token map生成logits,通过VAR编码将其作为目标logits进行引导;沿源-目标提示定义的语义轨迹 nudging 模型预测logits;结合基于源/编辑提示间cross-attention差异的空间掩码实现局部编辑;引入量化误差校正与重建质量提升的细化步骤。 Result: 在PIE基准512px和1024px分辨率上取得最佳图像编辑性能;在COCO(512px)和OpenImages(1024px)上实现更优的保真重建效果;整体优于VAR相关方法,性能媲美甚至超越扩散模型,且推理速度显著更快。 Conclusion: Masked Logit Nudging是一种高效、精准、保真的提示驱动图像编辑方法,兼顾编辑质量与推理效率,为视觉自回归模型的编辑能力提供了新范式。 Abstract: We address the problem of prompt-guided image editing in visual autoregressive models. Given a source image and a target text prompt, we aim to modify the source image according to the target prompt, while preserving all regions which are unrelated to the requested edit. To this end, we present Masked Logit Nudging, which uses the source image token maps to introduce a guidance step that aligns the model's predictions under the target prompt with these source token maps. Specifically, we convert the fixed source encodings into logits using the VAR encoding, nudging the model's predicted logits towards the targets along a semantic trajectory defined by the source-target prompts. Edits are applied only within spatial masks obtained through a dedicated masking scheme that leverages cross-attention differences between the source and edited prompts. Then, we introduce a refinement to correct quantization errors and improve reconstruction quality. Our approach achieves the best image editing performance on the PIE benchmark at 512px and 1024px resolutions. Beyond editing, our method delivers faithful reconstructions and outperforms previous methods on COCO at 512px and OpenImages at 1024px. Overall, our method outperforms VAR-related approaches and achieves comparable or even better performance than diffusion models, while being much faster. Code is available at 'https://github.com/AmirMaEl/MLN'.[107] Towards Design Compositing
Abhinav Mahajan,Abhikhya Tripathy,Sudeeksha Reddy Pala,Vaibhav Methi,K J Joseph,Balaji Vasan Srinivasan
Main category: cs.CV
TL;DR: 本文提出GIST,一种无需训练、保持身份特征的图像合成器,用于解决图形设计中多源组件风格不一致的问题,可无缝集成到现有设计生成流程中,显著提升视觉和谐性与美学质量。
Details
Motivation: 现有图形设计方法假设输入组件已风格协调,但实际中多源输入常存在视觉不匹配,因此需要身份保持的风格化与合成能力。 Method: 提出GIST——一种训练自由、身份保持的图像合成器,介于布局预测与字体生成之间,支持即插即用式集成到LaDeCo和Design-o-meter等不同管线中。 Result: GIST在LaDeCo和Design-o-meter两个不同管线中均显著提升视觉和谐性与美学质量,经LLaVA-OV和GPT-4V在细粒度评分与成对偏好测试中验证优于简单粘贴。 Conclusion: 身份保持的 stylization 与 compositing 是实现真正和谐组件到设计转换的关键环节,GIST为该任务提供了通用、轻量、即插即用的解决方案。 Abstract: Graphic design creation involves harmoniously assembling multimodal components such as images, text, logos, and other visual assets collected from diverse sources, into a visually-appealing and cohesive design. Recent methods have largely focused on layout prediction or complementary element generation, while retaining input elements exactly, implicitly assuming that provided components are already stylistically harmonious. In practice, inputs often come from disparate sources and exhibit visual mismatch, making this assumption limiting. We argue that identity-preserving stylization and compositing of input elements is a critical missing ingredient for truly harmonized components-to-design pipelines. To this end, we propose GIST, a training-free, identity-preserving image compositor that sits between layout prediction and typography generation, and can be plugged into any existing components-to-design or design-refining pipeline without modification. We demonstrate this by integrating GIST with two substantially different existing methods, LaDeCo and Design-o-meter. GIST shows significant improvements in visual harmony and aesthetic quality across both pipelines, as validated by LLaVA-OV and GPT-4V on aspect-wise ratings and pairwise preference over naive pasting. Project Page: abhinav-mahajan10.github.io/GIST/.[108] Multigrain-aware Semantic Prototype Scanning and Tri-Token Prompt Learning Embraced High-Order RWKV for Pan-Sharpening
Junfeng Li,Wenyang Zhou,Xueheng Li,Xuanhua He,Jianhou Gan,Wenqi Ren
Main category: cs.CV
TL;DR: 本文提出了一种面向全色锐化的多粒度感知语义原型扫描范式,结合高阶RWKV架构与基于语义聚类的三令牌提示机制,在扫描策略、提示学习和特征变换三方面进行创新,显著提升性能。
Details
Motivation: 传统RWKV的双向光栅扫描缺乏语义感知,易受位置偏差影响;现有方法在建模全局语义与保留空间细节之间存在权衡。 Method: 1)多粒度感知语义原型扫描:利用局部敏感哈希进行语义分组,构建多粒度语义原型,实现语义驱动的上下文感知token重排序;2)三令牌提示学习:引入全局token、聚类原型token和可学习寄存器token,提供互补语义先验并抑制噪声;3)可逆Q-Shift:在value通路使用中心差分卷积注入高频信息,并设计可逆多尺度Q-shift实现无损高效特征变换。 Result: 实验结果表明该方法在多个数据集上优于现有先进方法,尤其在细节保持与语义一致性方面表现突出。 Conclusion: 多粒度语义建模与可逆高频增强的协同设计,有效提升了全色锐化中全局语义理解与局部细节重建的联合优化能力。 Abstract: In this work, we propose a Multigrain-aware Semantic Prototype Scanning paradigm for pan-sharpening, built upon a high-order RWKV architecture and a tri-token prompting mechanism derived from semantic clustering. Specifically, our method contains three key components: 1) Multigrain-aware Semantic Prototype Scanning. Although RWKV offers a efficient linear-complexity alternative to Transformers, its conventional bidirectional raster scanning is still semantic-agnostic and prone to positional bias. To address this issue, we introduce a semantic-driven scanning strategy that leverages locality-sensitive hashing to group semantically related regions and construct multi-grain semantic prototypes, enabling context-aware token reordering and more coherent global interaction. 2) Tri-token Prompt Learning. We design a tri-token prompting mechanism consisting of a global token, cluster-derived prototype tokens, and a learnable register token. The global and prototype tokens provide complementary semantic priors for RWKV modeling, while the register token helps suppress noisy and artifact-prone intermediate representations. 3) Invertible Q-Shift. To counteract spatial details, we apply center difference convolution on the value pathway to inject high-frequency information, and introduce an invertible multi-scale Q-shift operation for efficient and lossless feature transformation without parameter-heavy receptive field expansion. Experimental results demonstrate the superiority of our method.[109] Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models
Haoyi Sun,Xiaoxiao Wang,Ning Mao,Qian Wang,Lifu Mu,Wen Zheng,Tao Wei,Wei Chen
Main category: cs.CV
TL;DR: 本文提出Switch-KD框架,通过在共享文本概率空间中统一视觉-语言知识迁移,解决视觉语言模型(VLMs)知识蒸馏中模态监督不一致的问题;其包含视觉切换蒸馏与动态双向logits差异损失(DBiLD),使0.5B TinyLLaVA在无结构修改下显著提升多模态性能。
Details
Motivation: 现有VLM知识蒸馏方法对各模态单独监督,未显式处理多模态对齐,导致多模态知识迁移不一致,难以在资源受限场景高效部署大模型。 Method: 提出Switch-KD框架:(1)视觉切换蒸馏——将学生视觉输出映射至教师语言路径,构建跨模态概率参考以实现隐式视觉知识迁移;(2)动态双向Logits差异(DBiLD)损失——自适应对齐信息丰富区域,同时保持师生分布结构。 Result: 0.5B TinyLLaVA在10个多模态基准上平均提升3.6分,成功从3B教师模型蒸馏出丰富多模态知识,且无需架构修改。 Conclusion: Switch-KD有效解决了VLMs蒸馏中模态对齐缺失问题,实现了跨模态知识在统一文本概率空间中的高效、一致迁移,为轻量化VLM部署提供了新范式。 Abstract: Vision-Language Models (VLMs) have shown remarkable capabilities in joint vision-language understanding, but their large scale poses significant challenges for deployment in resource-constrained scenarios. Knowledge Distillation (KD) offers a viable way to improve model capabilities without increasing model size or data requirements, making deployment more efficient. However, applying KD to VLMs is challenged by modality-specific supervision: although multimodal knowledge in VLMs is fused within the language space, current methods supervise each modality separately without explicitly addressing multimodal alignment, leading to inconsistent multimodal knowledge transfer. To address this, we propose Switch-KD, a visual-switch distillation framework that unifies vision-language knowledge transfer within a shared text-probability space. Switch-KD comprises two key components: (1) Visual-Switch Distillation, which switches the student's visual outputs into the teacher's language pathway to construct cross-modal probabilistic references for implicit visual knowledge transfer; and (2) Dynamic Bi-directional Logits Difference (DBiLD) loss, which adaptively aligns informative probability regions while preserving the distributional structures of teacher and student through bidirectional supervision. Guided by Switch-KD, a 0.5B TinyLLaVA effectively distills rich multimodal knowledge from its 3B teacher, yielding an average improvement of 3.6 points across 10 multimodal benchmarks without any architectural modification.[110] CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation
Inseok Jeon,Suhwan Cho,Minhyeok Lee,Seunghoon Lee,Minseok Kang,Jungho Lee,Chaewon Park,Donghyeong Kim,Sangyoun Lee
Main category: cs.CV
TL;DR: 本文提出了一种跨模态令牌调制方法,通过关系Transformer块增强外观与运动线索间的交互,并结合令牌掩码策略提升学习效率,在无监督视频目标分割任务中达到SOTA性能。
Details
Motivation: 现有双流架构虽能融合外观和运动线索,但难以有效建模二者之间的依赖关系,限制了性能提升。 Method: 提出跨模态令牌调制机制,建立两模态令牌间的密集连接,并利用关系Transformer块实现模内与模间信息传播;引入令牌掩码策略以提升学习效率,避免单纯增加模型复杂度。 Result: 在所有公开基准上均达到最先进(SOTA)性能,优于现有方法。 Conclusion: 跨模态令牌调制能有效增强外观与运动线索的协同建模能力,结合令牌掩码可高效提升无监督视频目标分割性能。 Abstract: Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.[111] High-Speed Full-Color HDR Imaging via Unwrapping Modulo-Encoded Spike Streams
Chu Zhou,Siqi Yang,Kailong Zhang,Heng Guo,Zhaofei Yu,Boxin Shi,Imari Sato
Main category: cs.CV
TL;DR: 本文提出了一种基于模数传感器的全彩、高速HDR成像系统,通过解耦曝光的传感建模与无需迭代的物理一致性解包裹算法,结合脉冲相机硬件实现1000 FPS、带宽大幅降低的HDR成像。
Details
Motivation: 传统RGB HDR成像在多曝光(运动伪影)与单次拍摄(信息不可逆损失)之间存在根本权衡;现有模数传感器方案受限于迭代解包裹开销和低速灰度采集。 Method: 提出曝光解耦的模数成像建模,支持时序交错多帧采集;设计融合扩散生成先验与模数最小绝对余数物理特性的无迭代解包裹算法;构建基于模数编码脉冲流的硬件原型系统。 Result: 实现1000 FPS全彩HDR成像,输出带宽从约20 Gbps降至6 Gbps;在动态场景中验证了系统可行性与性能优势。 Conclusion: 该协同软硬方案突破了模数成像在速度、色彩与效率上的系统瓶颈,为动态HDR应用提供了可行路径。 Abstract: Conventional RGB-based high dynamic range (HDR) imaging faces a fundamental trade-off between motion artifacts in multi-exposure captures and irreversible information loss in single-shot techniques. Modulo sensors offer a promising alternative by encoding theoretically unbounded dynamic range into wrapped measurements. However, existing modulo solutions remain bottlenecked by iterative unwrapping overhead and hardware constraints limiting them to low-speed, grayscale capture. In this work, we present a complete modulo-based HDR imaging system that enables high-speed, full-color HDR acquisition by synergistically advancing both the sensing formulation and the unwrapping algorithm. At the core of our approach is an exposure-decoupled formulation of modulo imaging that allows multiple measurements to be interleaved in time, preserving a clean, observation-wise measurement model. Building upon this, we introduce an iteration-free unwrapping algorithm that integrates diffusion-based generative priors with the physical least absolute remainder property of modulo images, supporting highly efficient, physics-consistent HDR reconstruction. Finally, to validate the practical viability of our system, we demonstrate a proof-of-concept hardware implementation based on modulo-encoded spike streams. This setup preserves the native high temporal resolution of spike cameras, achieving 1000 FPS full-color imaging while reducing output data bandwidth from approximately 20 Gbps to 6 Gbps. Extensive evaluations indicate that our coordinated approach successfully overcomes key systemic bottlenecks, demonstrating the feasibility of deploying modulo imaging in dynamic scenarios.[112] Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification
Weiwei Zhuang,Wangze Xie,Qi Zhang,Xia Du,Zihan Lin,Zheng Lin,Hanlin Cai,Jizhe Zhou,Zihan Fang,Chi-man Pun,Wei Ni,Jun Luo
Main category: cs.CV
TL;DR: 本文提出FogFool,一种基于雾效的物理可实现对抗攻击方法,利用Perlin噪声建模大气模式,在遥感图像分类中生成视觉真实、鲁棒且高迁移性的对抗样本。
Details
Motivation: 现有遥感图像对抗攻击多为直接像素扰动,忽视大气特性且难以应对真实场景退化,亟需更物理合理、鲁棒的攻击范式。 Method: 提出FogFool框架,通过迭代优化基于Perlin噪声的大气模式(雾效)生成对抗扰动,利用雾的结构一致性与中低频特性将对抗信息嵌入跨架构共享的结构特征中。 Result: 在两个遥感基准数据集上,FogFool在白盒攻击中性能优越,黑盒迁移成功率高达83.74% TASR,并对JPEG压缩、滤波等预处理防御保持强鲁棒性;CAM分析表明其引发模型注意力的普遍偏移。 Conclusion: FogFool是一种实用、隐蔽且高度持久的遥感分类系统威胁,为复杂环境下模型可靠性评估提供了坚实基准。 Abstract: Adversarial attacks pose a severe threat to the reliability of deep learning models in remote sensing (RS) image classification. Most existing methods rely on direct pixel-wise perturbations, failing to exploit the inherent atmospheric characteristics of RS imagery or survive real-world image degradations. In this paper, we propose FogFool, a physically plausible adversarial framework that generates fog-based perturbations by iteratively optimizing atmospheric patterns based on Perlin noise. By modeling fog formations with natural, irregular structures, FogFool generates adversarial examples that are not only visually consistent with authentic RS scenes but also deceptive. By leveraging the spatial coherence and mid-to-low-frequency nature of atmospheric phenomena, FogFool embeds adversarial information into structural features shared across diverse architectures. Extensive experiments on two benchmark RS datasets demonstrate that FogFool achieves superior performance: not only does it exceed in white-box settings, but also exhibits exceptional black-box transferability (reaching 83.74% TASR) and robustness against common preprocessing-based defenses such as JPEG compression and filtering. Detailed analyses, including confusion matrices and Class Activation Map (CAM) visualizations, reveal that our atmospheric-driven perturbations induce a universal shift in model attention. These results indicate that FogFool represents a practical, stealthy, and highly persistent threat to RS classification systems, providing a robust benchmark for evaluating model reliability in complex environments.[113] Chaotic CNN for Limited Data Image Classification
Anusree M,Akhila Henry,Pramod P Nair
Main category: cs.CV
TL;DR: 本文提出了一种基于混沌映射的特征变换方法,通过在CNN分类层前对归一化特征向量施加Logistic、斜帐篷和正弦等非线性混沌映射,提升小样本图像分类性能,无需增加模型参数,显著提升准确率。
Details
Motivation: CNN在小样本训练场景下易过拟合、特征多样性不足,泛化能力差。 Method: 在CNN分类层前,对归一化特征向量应用Logistic、斜帐篷和正弦三种混沌映射进行非线性变换,重塑特征空间以增强类间可分性。 Result: 在MNIST(+5.43%)、Fashion-MNIST(+9.11%)和CIFAR-10(+7.47%)小样本设定下均取得稳定性能提升,增益源于混沌系统的共性非线性与动力学特性。 Conclusion: 该混沌特征变换方法计算高效、无额外参数、即插即用,是数据稀缺图像分类任务的一种实用增强方案。 Abstract: Convolutional neural networks (CNNs) often exhibit poor generalisation in limited training data scenarios due to overfitting and insufficient feature diversity. In this work, a simple and effective chaos-based feature transformation is proposed to enhance CNN performance without increasing model complexity. The method applies nonlinear transformations using logistic, skew tent, and sine maps to normalised feature vectors before the classification layer, thereby reshaping the feature space and improving class separability. The approach is evaluated on greyscale datasets (MNIST and Fashion-MNIST) and an RGB dataset (CIFAR-10) using CNN architectures of varying depth under limited data conditions. The results show consistent improvement over the standalone (SA) CNN across all datasets. Notably, a maximum performance gain of 5.43% is achieved on MNIST using the skew tent map with a 3-layer CNN at 40 samples per class. A higher gain of 9.11% is observed on Fashion-MNIST using the sine map with a 3-layer CNN at 50 samples per class. Additionally, a strong gain of 7.47% is obtained on CIFAR-10 using the skew tent map at 200 samples per class. The consistent improvements across different chaotic maps indicate that the performance gain is driven by the shared nonlinear and dynamical properties of chaotic systems. The proposed method is computationally efficient, requires no additional trainable parameters, and can be easily integrated into existing CNN architectures, making it a practical solution for data-scarce image classification tasks.[114] Seen-to-Scene: Keep the Seen, Generate the Unseen for Video Outpainting
Inseok Jeon,Minhyeok Lee,Seunghoon Lee,Minseok Kang,Suhwan Cho,Sangyoun Lee
Main category: cs.CV
TL;DR: 本文提出Seen-to-Scene框架,融合基于传播和生成的范式,通过光流补全网络与参考引导的潜在传播提升视频外绘的时空一致性与效率。
Details
Motivation: 现有基于生成模型(如扩散模型)的方法在视频外绘中存在隐式时序建模不足和空间上下文有限的问题,导致帧内与帧间不一致,尤其在动态场景和大幅外绘时更明显。 Method: 提出Seen-to-Scene框架:1)利用预训练于视频修复的光流补全网络,并端到端微调以桥接领域差距、重建连贯运动场;2)引入参考引导的潜在传播机制,高效跨帧传播源内容。 Result: 在多项实验中,该方法在时序一致性与视觉真实性上优于现有SOTA方法,且推理高效,无需输入特定适配。 Conclusion: 统一传播与生成范式的Seen-to-Scene为视频外绘提供了更鲁棒、高效、高质量的解决方案。 Abstract: Video outpainting aims to expand the visible content of a video beyond the original frame boundaries while preserving spatial fidelity and temporal coherence across frames. Existing methods primarily rely on large-scale generative models, such as diffusion models. However, generationbased approaches suffer from implicit temporal modeling and limited spatial context. These limitations lead to intraframe and inter-frame inconsistencies, which become particularly pronounced in dynamic scenes and large outpainting scenarios. To overcome these challenges, we propose Seen-to-Scene, a novel framework that unifies propagationbased and generation-based paradigms for video outpainting. Specifically, Seen-to-Scene leverages flow-based propagation with a flow completion network pre-trained for video inpainting, which is fine-tuned in an end-to-end manner to bridge the domain gap and reconstruct coherent motion fields. To further improve the efficiency and reliability of propagation, we introduce a reference-guided latent propagation that effectively propagates source content across frames. Extensive experiments demonstrate that our method achieves superior temporal coherence and visual realism with efficient inference, surpassing even prior state-of-the-art methods that require input-specific adaptation.[115] DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts
Bo Qian,Dahu Shi,Xing Wei
Main category: cs.CV
TL;DR: 本文提出DETR-ViP框架,通过全局提示整合、视觉-文本提示关系蒸馏和选择性融合策略,提升视觉提示在开放词汇目标检测中的判别能力与鲁棒性。
Details
Motivation: 现有视觉提示目标检测性能不佳,主要因视觉提示缺乏全局判别性;且该方向长期被忽视,常作为文本提示检测器训练的副产品。 Method: 在基础图像-文本对比学习之上,引入全局提示整合、视觉-文本提示关系蒸馏,并采用选择性融合策略,构建DETR-ViP检测框架。 Result: 在COCO、LVIS、ODinW和Roboflow100数据集上,DETR-ViP显著超越现有SOTA方法,在视觉提示检测任务中取得更高性能。 Conclusion: 提升视觉提示的全局判别性是增强视觉提示检测性能的关键,DETR-ViP有效解决了该问题,推动了视觉提示检测的独立发展。 Abstract: Visual prompted object detection enables interactive and flexible definition of target categories, thereby facilitating open-vocabulary detection. Since visual prompts are derived directly from image features, they often outperform text prompts in recognizing rare categories. Nevertheless, research on visual prompted detection has been largely overlooked, and it is typically treated as a byproduct of training text prompted detectors, which hinders its development. To fully unlock the potential of visual-prompted detection, we investigate the reasons why its performance is suboptimal and reveal that the underlying issue lies in the absence of global discriminability in visual prompts. Motivated by these observations, we propose DETR-ViP, a robust object detection framework that yields class-distinguishable visual prompts. On top of basic image-text contrastive learning, DETR-ViP incorporates global prompt integration and visual-textual prompt relation distillation to learn more discriminative prompt representations. In addition, DETR-ViP employs a selective fusion strategy that ensures stable and robust detection. Extensive experiments on COCO, LVIS, ODinW, and Roboflow100 demonstrate that DETR-ViP achieves substantially higher performance in visual prompt detection compared to other state-of-the-art counterparts. A series of ablation studies and analyses further validate the effectiveness of the proposed improvements and shed light on the underlying reasons for the enhanced detection capability of visual prompts.[116] Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding
Zhixuan Wu,Quanxing Zha,Teng Wang,Genbao Xu,Wenyuan Gu,Wei Rao,Nan Ma,Bo Cheng,Soujanya Poria
Main category: cs.CV
TL;DR: 本文提出Chain-of-Glimpse框架,通过搜索引导的渐进式视觉对象定位与多步推理,提升视频理解中对时序对象变化的建模能力。
Details
Motivation: 现有视频理解方法多为对象无关,难以应对视频中对象随时间发生的显著变化,缺乏对关键视觉对象的显式建模和空间定位能力。 Method: Chain-of-Glimpse将视频推理建模为逐步构建空间定位轨迹的过程,引入搜索引导的控制器(基于强化学习优化,以格式化奖励增强定位能力),在每步推理中锚定特定视觉证据区域,实现可解释、组合式的多步决策。 Result: 在NExTQA(域内)、Video-Holmes、CG-Bench Reasoning和VRBench(均域外)等多个基准上取得一致性能提升,展现出强鲁棒性与泛化能力。 Conclusion: Chain-of-Glimpse通过显式对象锚定与渐进式空间追踪,有效缓解了对显著性线索的过度依赖,为视频理解提供了更可靠、可解释的多步推理范式。 Abstract: Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.[117] The Courtroom Trial of Pixels: Robust Image Manipulation Localization via Adversarial Evidence and Reinforcement Learning Judgment
Songlin Li,Zhiqing Guo,Dan Ma,Changtao Miao,Gaobo Yang
Main category: cs.CV
TL;DR: 本文提出了一种法庭式裁决框架用于图像篡改定位(IML),通过控方流(主张篡改)、辩方流(主张真实)和法官模型(动态决策与校准)协同建模篡改与真实证据的对抗,显著提升在弱痕迹与噪声干扰下的定位鲁棒性。
Details
Motivation: 现有IML方法虽引入真实性监督,但仅作为辅助训练信号,未显式建模篡改与真实证据的对立关系,导致在痕迹微弱或受后处理/噪声干扰时定位不可靠。 Method: 构建双假设分割架构:共享多尺度编码器上并行控方流(输出篡改证据)与辩方流(输出真实证据),结合边缘先验、级联多层融合、双向分歧抑制与动态辩论优化;引入基于强化学习的法官模型,以优势奖励和soft-IoU为目标进行不确定区域的策略性重推理与校准,可靠性由熵与跨假设一致性联合标定。 Result: 在多个基准数据集上超越当前SOTA方法,平均性能更优,尤其在模糊、压缩、噪声等退化场景下鲁棒性显著提升。 Conclusion: 将IML任务形式化为证据对抗与裁决过程,显式建模篡改与真实性的对立关系,并通过可学习的法官机制实现动态可信决策,为细粒度伪造检测提供了新范式。 Abstract: Although some existing image manipulation localization (IML) methods incorporate authenticity-related supervision, this information is typically utilized merely as an auxiliary training signal to enhance the model's sensitivity to manipulation artifacts, rather than being explicitly modeled as localization evidence opposing the manipulated regions. Consequently, when manipulation traces are subtle or degraded by post-processing and noise, these methods struggle to explicitly compare manipulated and authentic evidence, resulting in unreliable predictions in ambiguous areas. To address these issues, we propose a courtroom-style adjudication framework that regards IML task as the confrontation of evidence followed by judgment. The framework comprises a prosecution stream, a defense stream, and a judge model. We first build a dual-hypothesis segmentation architecture on a shared multi-scale encoder, in which the prosecution stream asserts manipulation and the defense stream asserts authenticity. Guided by edge priors, it produces evidence for manipulated and authentic regions through cascaded multi-level fusion, bidirectional disagreement suppression, and dynamic debate refinement. We further develop a reinforcement learning judge model that performs strategic re-inference and refinement on uncertain regions, yielding a manipulated-region mask. The judge model is trained with advantage-based rewards and a soft-IoU objective, and reliability is calibrated via entropy and cross-hypothesis consistency. Experimental results show that our model achieves superior average performance compared with SOTA IML methods.[118] NG-GS: NeRF-Guided 3D Gaussian Splatting Segmentation
Yi He,Tao Wang,Yi Jin,Congyan Lang,Yidong Li,Haibin Ling
Main category: cs.CV
TL;DR: 本文提出NG-GS框架,通过高斯模糊分析、RBF插值与多分辨率哈希编码,结合NeRF轻量模块联合优化,显著提升3D高斯泼溅中物体边界的分割质量。
Details
Motivation: 3D高斯泼溅(3DGS)虽在新视角合成上高效逼真,但其离散高斯表示导致物体边界存在混叠和伪影,难以实现精确分割。 Method: 首先利用掩码方差分析自动识别边界模糊高斯;然后采用径向基函数(RBF)插值构建空间连续特征场,并引入多分辨率哈希编码增强多尺度表达;最后通过与轻量NeRF模块的联合优化(含对齐损失和空间连续性损失)确保分割边界平滑一致。 Result: 在NVOS、LERF-OVS和ScanNet数据集上达到SOTA性能,边界mIoU显著提升。 Conclusion: NG-GS有效解决了3DGS中因离散表示引发的边界分割难题,为高质量3D场景语义理解提供了新范式。 Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled highly efficient and photorealistic novel view synthesis. However, segmenting objects accurately in 3DGS remains challenging due to the discrete nature of Gaussian representations, which often leads to aliasing and artifacts at object boundaries. In this paper, we introduce NG-GS, a novel framework for high-quality object segmentation in 3DGS that explicitly addresses boundary discretization. Our approach begins by automatically identifying ambiguous Gaussians at object boundaries using mask variance analysis. We then apply radial basis function (RBF) interpolation to construct a spatially continuous feature field, enhanced by multi-resolution hash encoding for efficient multi-scale representation. A joint optimization strategy aligns 3DGS with a lightweight NeRF module through alignment and spatial continuity losses, ensuring smooth and consistent segmentation boundaries. Extensive experiments on NVOS, LERF-OVS, and ScanNet benchmarks demonstrate that our method achieves state-of-the-art performance, with significant gains in boundary mIoU. Code is available at https://github.com/BJTU-KD3D/NG-GS.[119] G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval
Jiyoung Lim,Heejae Yang,Jee-Hyong Lee
Main category: cs.CV
TL;DR: 本文提出G-MIXER方法,通过测地线混合(geodesic mixup)扩展隐式语义并结合大模型生成的显式语义重排序,提升零样本组合图像检索(ZS-CIR)的多样性与准确性,无需额外训练。
Details
Motivation: 现有零样本CIR方法过度依赖文本模态,难以建模模糊检索所需的候选多样性,导致检索结果多样性与准确性下降。 Method: 提出训练自由的G-MIXER方法:利用测地线混合在不同混合比例下构建反映参考图-文对隐式语义的合成查询特征,生成多样化候选集;再用MLLM提取的显式语义对候选进行重排序。 Result: 在多个ZS-CIR基准上达到SOTA性能,显著提升检索多样性与准确性,且无需额外训练。 Conclusion: G-MIXER有效协同建模隐式与显式语义,为零样本组合图像检索提供了高效、免训练的新范式。 Abstract: Composed Image Retrieval (CIR) aims to retrieve target images by integrating a reference image with a corresponding modification text. CIR requires jointly considering the explicit semantics specified in the query and the implicit semantics embedded within its bi-modal composition. Recent training-free Zero-Shot CIR (ZS-CIR) methods leverage Multimodal Large Language Models (MLLMs) to generate detailed target descriptions, converting the implicit information into explicit textual expressions. However, these methods rely heavily on the textual modality and fail to capture the fuzzy retrieval nature that requires considering diverse combinations of candidates. This leads to reduced diversity and accuracy in retrieval results. To address this limitation, we propose a novel training-free method, Geodesic Mixup-based Implicit semantic eXpansion and Explicit semantic Re-ranking for ZS-CIR (G-MIXER). G-MIXER constructs composed query features that reflect the implicit semantics of reference image-text pairs through geodesic mixup over a range of mixup ratios, and builds a diverse candidate set. The generated candidates are then re-ranked using explicit semantics derived from MLLMs, improving both retrieval diversity and accuracy. Our proposed G-MIXER achieves state-of-the-art performance across multiple ZS-CIR benchmarks, effectively handling both implicit and explicit semantics without additional training. Our code will be available at https://github.com/maya0395/gmixer.[120] MS-SSE-Net: A Multi-Scale Spatial Squeeze-and-Excitation Network for Structural Damage Detection in Civil and Geotechnical Engineering
Saif ur Rehman Khan,Imad Ahmed Waqar,Arooj Zaib,Saad Ahmed,Sebastian Vollmer,Andreas Dengel,Muhammad Nabeel Asim
Main category: cs.CV
TL;DR: 本文提出了一种名为MS-SSE-Net的新型深度学习框架,用于结构损伤分类,通过多尺度特征提取与通道/空间注意力机制,在StructDamage数据集上实现了优于DenseNet201等基线模型的性能。
Details
Motivation: 准确识别图像中的不同类型结构损伤仍具挑战性,主要由于损伤模式和环境条件的多样性。 Method: 基于DenseNet201主干网络,引入并行深度可分离卷积进行多尺度特征提取,并融合挤压激励式通道注意力与空间注意力机制,再经全局平均池化和全连接层完成分类。 Result: 在StructDamage数据集上,MS-SSE-Net达到99.31%精确率、99.25%召回率、99.27% F1分数和99.26%准确率,全面优于DenseNet201基线模型。 Conclusion: MS-SSE-Net有效提升了结构损伤图像分类的精度与鲁棒性,验证了多尺度特征与注意力机制融合策略的有效性。 Abstract: Structural damage detection is essential for maintaining the safety and reliability of civil infrastructure. However, accurately identifying different types of structural damage from images remains challenging due to variations in damage patterns and environmental conditions. To address these challenges, this paper proposes MS-SSE-Net, a novel deep learning (DL) framework for structural damage classification. The proposed model is built upon the DenseNet201 backbone and integrates novel multi-scale feature extraction with channel and spatial attention mechanisms (MS-SSE-Net). Specifically, parallel depthwise convolutions capture both local and contextual features, while squeeze-and-excitation style channel attention and spatial attention emphasize informative regions and suppress irrelevant noise. The refined features are then processed through global average pooling and a fully connected classification layer to generate the final predictions. Experiments are conducted on the StructDamage dataset containing multiple structural damage categories. The proposed MS-SSE-Net demonstrates superior performance compared with the baseline DenseNet201 and other comparative approaches. Specifically, the proposed method achieves 99.31% precision, 99.25% recall, 99.27% F1-score, and 99.26% accuracy, outperforming the baseline model which achieved 98.62% precision, 98.53% recall, 98.58% F1-score, and 98.53% accuracy.[121] Data Synthesis Improves 3D Myotube Instance Segmentation
David Exler,Nils Friederich,Martin Krüger,John Jbeily,Mario Vitacolonna,Rüdiger Rudolf,Ralf Mikut,Markus Reischl
Main category: cs.CV
TL;DR: 本文提出了一种基于几何建模的合成数据生成方法,用于解决肌管三维实例分割中真实标注数据稀缺的问题,并通过自监督预训练的轻量3D U-Net在真实数据上实现了优于零样本模型的性能。
Details
Motivation: 现有预训练生物医学分割模型因缺乏大规模标注的肌管数据而无法泛化到该领域,亟需高质量合成数据支持定量形态学分析。 Method: 构建几何驱动的肌管合成管线,包括多项式中心线、变半径、分支结构和椭球端帽建模,并加入真实噪声、光学伪影及CycleGAN域适应;采用自监督预训练的紧凑型3D U-Net,仅用合成数据训练。 Result: 在真实数据上达到平均实例预测质量(IPQ)0.22,显著优于三个零样本分割模型。 Conclusion: 基于生物物理机制的合成数据可有效支撑标注稀缺场景下的精准三维实例分割。 Abstract: Myotubes are multinucleated muscle fibers serving as key model systems for studying muscle physiology, disease mechanisms, and drug responses. Mechanistic studies and drug screening thereby rely on quantitative morphological readouts such as diameter, length, and branching degree, which in turn require precise three-dimensional instance segmentation. Yet established pretrained biomedical segmentation models fail to generalize to this domain due to the absence of large annotated myotube datasets. We introduce a geometry-driven synthesis pipeline that models individual myotubes via polynomial centerlines, locally varying radii, branching structures, and ellipsoidal end caps derived from real microscopy observations. Synthetic volumes are rendered with realistic noise, optical artifacts, and CycleGAN-based Domain Adaptation (DA). A compact 3D U-Net with self-supervised encoder pretraining, trained exclusively on synthetic data, achieves a mean IPQ of 0.22 on real data, significantly outperforming three established zero-shot segmentation models, demonstrating that biophysics-driven synthesis enables effective instance segmentation in annotation-scarce biomedical domains.[122] HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet
Badri N. Patro,Vijay S. Agneeswaran
Main category: cs.CV
TL;DR: HAMSA是一种无需扫描的频域视觉状态空间模型,通过简化核参数化、频谱脉冲网络(SPN)和频谱自适应门控单元(SAGU)提升效率与稳定性,在ImageNet上达到SSM最优精度,并显著优于现有模型在速度、内存和能耗方面。
Details
Motivation: 现有视觉状态空间模型(如Vim、VMamba、SiMBA)依赖复杂扫描策略处理2D图像,带来计算开销和架构复杂性。 Method: 提出HAMSA:基于FFT的频域SSM;采用单个高斯初始化复数核替代传统(A,B,C)矩阵;引入输入依赖的频谱门控机制SpectralPulseNet(SPN);设计幅度驱动的频谱自适应门控单元(SAGU)以稳定频域梯度流。 Result: 在ImageNet-1K上达85.7% top-1准确率(SSM SOTA);推理比DeiT-S快2.2倍(4.2ms vs 9.2ms),比扫描式SSM快1.4–1.9倍;内存占用更少(2.1GB vs 3.2–4.5GB),能耗更低(12.5J vs 18–25J);在迁移学习与密集预测任务中泛化性强。 Conclusion: HAMSA通过摒弃扫描、转向频域建模,在保持高性能的同时大幅提升了SSM的简洁性、效率与稳定性,为视觉SSM提供了新范式。 Abstract: Vision State Space Models (SSMs) like Vim, VMamba, and SiMBA rely on complex scanning strategies to adapt sequential SSMs to process 2D images, introducing computational overhead and architectural complexity. We propose HAMSA, a scanning-free SSM operating directly in the spectral domain. HAMSA introduces three key innovations: (1) simplified kernel parameterization-a single Gaussian-initialized complex kernel replacing traditional (A, B, C) matrices, eliminating discretization instabilities; (2) SpectralPulseNet (SPN)-an input-dependent frequency gating mechanism enabling adaptive spectral modulation; and (3) Spectral Adaptive Gating Unit (SAGU)-magnitude-based gating for stable gradient flow in the frequency domain. By leveraging FFT-based convolution, HAMSA eliminates sequential scanning while achieving O(L log L) complexity with superior simplicity and efficiency. On ImageNet-1K, HAMSA reaches 85.7% top-1 accuracy (state-of-the-art among SSMs), with 2.2 X faster inference than transformers (4.2ms vs 9.2ms for DeiT-S) and 1.4-1.9X speedup over scanning-based SSMs, while using less memory (2.1GB vs 3.2-4.5GB) and energy (12.5J vs 18-25J). HAMSA demonstrates strong generalization across transfer learning and dense prediction tasks.[123] Find the Differences: Differential Morphing Attack Detection vs Face Recognition
Una M. Kelly,Luuk J. Spreeuwers,Raymond N. J. Veldhuis
Main category: cs.CV
TL;DR: 本文探讨了人脸识别系统对变形攻击(morphing attacks)的脆弱性,指出人脸识别与差分变形攻击检测(D-MAD)任务本质相似,并提出利用现有的人脸识别系统进行变形检测,同时引入一种新评估阈值以限制对未知类型变形攻击的脆弱性。
Details
Motivation: 现有许多人脸识别系统易受变形攻击影响,而当前检测方法与FR任务存在内在关联,亟需统一视角理解并提升鲁棒性。 Method: 通过对比FR系统与两种现有D-MAD方法,分析其任务相似性;揭示当前决策阈值导致FR对变形攻击脆弱的根本原因;提出复用现有人脸识别系统进行变形检测,并设计保障脆弱性上限的新评估阈值。 Result: 证实FR与D-MAD任务高度相关;解释了FR在常规图像性能与抗变形能力之间的固有折衷;所提新阈值可在不依赖攻击先验知识下约束对未知变形攻击的脆弱性。 Conclusion: 人脸识别系统本身可被有效用于变形攻击检测,关键在于合理设定评估阈值;该思路为构建轻量、兼容且鲁棒的防御机制提供了新路径。 Abstract: Morphing is a challenge to face recognition (FR) for which several morphing attack detection solutions have been proposed. We argue that face recognition and differential morphing attack detection (D-MAD) in principle perform very similar tasks, which we support by comparing an FR system with two existing D-MAD approaches. We also show that currently used decision thresholds inherently lead to FR systems being vulnerable to morphing attacks and that this explains the tradeoff between performance on normal images and vulnerability to morphing attacks. We propose using FR systems that are already in place for morphing detection and introduce a new evaluation threshold that guarantees an upper limit to the vulnerability to morphing attacks - even of unknown types.[124] Efficient closed-form approaches for pose estimation using Sylvester forms
Jana Vráblíková,Ezio Malis,Laurent Busé
Main category: cs.CV
TL;DR: 本文提出了一类基于Sylvester形式的新型结式求解器,用于加速非线性最小二乘姿态估计问题的闭式求解,在保持精度的同时显著降低计算时间。
Details
Motivation: 非线性最小二乘姿态估计(旋转与平移)在实时视觉应用中耗时但关键;现有结式矩阵方法虽有进展,但仍存在计算复杂度优化空间。 Method: 提出基于Sylvester形式的新型结式求解器,将姿态估计问题建模为多项式方程组并实现闭式求解,适用于3D-3D和3D-2D点对应两种位姿估计任务。 Result: 所提方法在数值精度上与当前最优求解器相当,且计算时间更优。 Conclusion: Sylvester形式可有效降低结式求解器复杂度,为实时姿态估计提供高效准确的新方案。 Abstract: Solving non-linear least-squares problem for pose estimation (rotation and translation) is often a time consuming yet fundamental problem in several real-time computer vision applications. With an adequate rotation parametrization, the optimization problem can be reduced to the solution of a~system of polynomial equations and solved in closed form. Recent advances in efficient closed form solvers utilizing resultant matrices have shown a promising research direction to decrease the computation time while preserving the estimation accuracy. In this paper, we propose a new class of resultant-based solvers that exploit Sylvester forms to further reduce the complexity of the resolution. We demonstrate that our proposed methods are numerically as accurate as the state-of-the-art solvers, and outperform them in terms of computational time. We show that this approach can be applied for pose estimation in two different types of problems: estimating a pose from 3D to 3D correspondences, and estimating a pose from 3D points to 2D points correspondences.[125] ASGNet: Adaptive Spectrum Guidance Network for Automatic Polyp Segmentation
Yanguang Sun,Hengmin Zhang,Jianjun Qian,Jian Yang,Lei Luo
Main category: cs.CV
TL;DR: 本文提出ASGNet,一种结合频谱特征与全局属性的自适应频谱引导网络,用于提升结肠镜图像中息肉分割的准确性。
Details
Motivation: 现有基于深度学习的息肉分割方法在空间感知上存在局部偏差,难以捕获完整息肉结构,导致分割效果欠佳。 Method: 提出ASGNet,包含频谱引导的非局部感知模块、多源语义提取器和密集跨层交互解码器,融合局部与全局信息、高层语义及多层特征以提升分割性能。 Result: 在五个主流息肉分割基准上显著优于21种前沿方法,定量与定性实验均验证其优越性。 Conclusion: ASGNet通过引入频谱域建模与全局感知机制,有效缓解了传统方法的空间局部性局限,提升了息肉结构完整性与边界精度。 Abstract: Early identification and removal of polyps can reduce the risk of developing colorectal cancer. However, the diverse morphologies, complex backgrounds and often concealed nature of polyps make polyp segmentation in colonoscopy images highly challenging. Despite the promising performance of existing deep learning-based polyp segmentation methods, their perceptual capabilities remain biased toward local regions, mainly because of the strong spatial correlations between neighboring pixels in the spatial domain. This limitation makes it difficult to capture the complete polyp structures, ultimately leading to sub-optimal segmentation results. In this paper, we propose a novel adaptive spectrum guidance network, called ASGNet, which addresses the limitations of spatial perception by integrating spectral features with global attributes. Specifically, we first design a spectrum-guided non-local perception module that jointly aggregates local and global information, therefore enhancing the discriminability of polyp structures, and refining their boundaries. Moreover, we introduce a multi-source semantic extractor that integrates rich high-level semantic information to assist in the preliminary localization of polyps. Furthermore, we construct a dense cross-layer interaction decoder that effectively integrates diverse information from different layers and strengthens it to generate high-quality representations for accurate polyp segmentation. Extensive quantitative and qualitative results demonstrate the superiority of our ASGNet approach over 21 state-of-the-art methods across five widely-used polyp segmentation benchmarks. The code will be publicly available at: https://github.com/CSYSI/ASGNet.[126] OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism
Jordan Shipard,Arnold Wiliem,Kien Nguyen Thanh,Wei Xiang,Clinton Fookes
Main category: cs.CV
TL;DR: 本文提出OmniGCD,一种模态无关的广义类别发现(GCD)方法,利用模态特异性编码器和合成训练的Transformer模型,在零样本设置下跨4种模态、16个数据集实现显著性能提升,无需数据集特定微调。
Details
Motivation: 现有GCD方法局限于单模态且需数据集特定微调,而人类能跨模态抽象形成类别;本文旨在构建更通用、类脑、模态无关的GCD框架。 Method: 提出OmniGCD:使用多模态编码器提取特征,降维构建GCD潜在空间,并通过在合成数据上训练的Transformer模型在测试时动态优化表示以适配聚类;引入零样本GCD评估设定。 Result: 在16个跨模态数据集上零样本GCD任务中,相较基线,已知类准确率平均提升+6.2(视觉)、+1.5(文本)、+12.7(遥感);未知类准确率平均提升+17.9(音频)等;验证了强编码器与类别发现解耦的有效性。 Conclusion: OmniGCD证明了模态无关、零样本GCD的可行性,强调解耦表征学习与类别发现的重要性;为未来可扩展、类脑的跨模态GCD研究提供了新基准与方向。 Abstract: Generalized Category Discovery (GCD) challenges methods to identify known and novel classes using partially labeled data, mirroring human category learning. Unlike prior GCD methods, which operate within a single modality and require dataset-specific fine-tuning, we propose a modality-agnostic GCD approach inspired by the human brain's abstract category formation. Our $\textbf{OmniGCD}$ leverages modality-specific encoders (e.g., vision, audio, text, remote sensing) to process inputs, followed by dimension reduction to construct a $\textbf{GCD latent space}$, which is transformed at test-time into a representation better suited for clustering using a novel synthetically trained Transformer-based model. To evaluate OmniGCD, we introduce a $\textbf{zero-shot GCD setting}$ where no dataset-specific fine-tuning is allowed, enabling modality-agnostic category discovery. $\textbf{Trained once on synthetic data}$, OmniGCD performs zero-shot GCD across 16 datasets spanning four modalities, improving classification accuracy for known and novel classes over baselines (average percentage point improvement of $\textbf{+6.2}$, $\textbf{+17.9}$, $\textbf{+1.5}$ and $\textbf{+12.7}$ for vision, text, audio and remote sensing). This highlights the importance of strong encoders while decoupling representation learning from category discovery. Improving modality-agnostic methods will propagate across modalities, enabling encoder development independent of GCD. Our work serves as a benchmark for future modality-agnostic GCD works, paving the way for scalable, human-inspired category discovery. All code is available $\href{https://github.com/Jordan-HS/OmniGCD}{here}$[127] AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning
Peifeng Zhang,Zice Qiu,Donghua Yu,Shilei Cao,Juepeng Zheng,Yutong Lu,Haohuan Fu
Main category: cs.CV
TL;DR: 本文提出Asymmetric Information Masking (AIM)方法,解决视觉语言模型(VLMs)在持续视觉问答(VQA)任务中因结构不对称导致的灾难性遗忘问题,通过模态敏感的局部掩码提升稳定性与可塑性平衡,显著改善平均性能和遗忘率,并增强对新技能-概念组合的泛化能力。
Details
Motivation: 现代视觉语言模型(VLMs)具有固有的不对称结构,而现有持续学习方法多针对对称、单模态架构设计,导致VLMs在持续学习中易发生灾难性遗忘,尤其损害视觉投影层和组合推理能力。 Method: 提出Asymmetric Information Masking (AIM),依据不同模态对干扰的敏感性,对VLM中关键组件(尤其是视觉投影层)施加针对性的信息掩码,以平衡训练过程中的稳定性与可塑性。 Result: 在VQA v2和GQA数据集的持续VQA设定下,AIM在平均性能(AP)和平均遗忘(AF)上均达到SOTA;同时更好保持对新颖技能-概念组合的泛化能力。 Conclusion: AIM有效缓解了VLMs在持续VQA中的结构性遗忘问题,验证了面向不对称多模态架构定制持续学习策略的必要性与有效性。 Abstract: In continual visual question answering (VQA), existing Continual Learning (CL) methods are mostly built for symmetric, unimodal architectures. However, modern Vision-Language Models (VLMs) violate this assumption, as their trainable components are inherently asymmetric. This structural mismatch renders VLMs highly prone to catastrophic forgetting when learning from continuous data streams. Specifically, the asymmetry causes standard global regularization to favor the massive language decoder during optimization, leaving the smaller but critical visual projection layers highly vulnerable to interference. Consequently, this localized degradation leads to a severe loss of compositional reasoning capabilities. To address this, we propose Asymmetric Information Masking (AIM), which balances stability and plasticity by applying targeted masks based on modality-specific sensitivity. Experiments on VQA v2 and GQA under continual VQA settings show that AIM achieves state-of-the-art performance in both Average Performance (AP) and Average Forgetting (AF), while better preserving generalization to novel skill-concept compositions.[128] Integrating Object Detection, LiDAR-Enhanced Depth Estimation, and Segmentation Models for Railway Environments
Enrico Francesco Giannico,Federico Nesti,Gianluca D'Amico,Mauro Marinoni,Edoardo Carosio,Filippo Salotti,Salvatore Sabina,Giorgio Buttazzo
Main category: cs.CV
TL;DR: 本文提出了一种模块化、灵活的铁路环境障碍物检测与距离估计框架,融合目标检测、轨道分割和单目深度估计,并结合LiDAR点云,在合成数据集SynDRA上实现0.63米的平均绝对误差。
Details
Motivation: 铁路环境中障碍物检测对安全至关重要,但现有方法多只关注检测或轨道识别,缺乏完整、模块化且可定量评估的系统;同时真实场景缺乏可靠地面真值数据。 Method: 构建一个集成三个神经网络的模块化框架:目标检测网络、轨道分割网络、单目深度估计网络,并融合LiDAR点云提升距离估计精度;使用合成数据集SynDRA进行定量评估。 Result: 在SynDRA数据集上,障碍物距离估计的平均绝对误差(MAE)低至0.63米,显著提升了空间感知与测距精度。 Conclusion: 该框架兼具模块性、灵活性与高精度,为铁路智能感知系统提供了可靠、可复现的解决方案,并验证了合成数据在缺乏真实标注时的有效性。 Abstract: Obstacle detection in railway environments is crucial for ensuring safety. However, very few studies address the problem using a complete, modular, and flexible system that can both detect objects in the scene and estimate their distance from the vehicle. Most works focus solely on detection, others attempt to identify the track, and only a few estimate obstacle distances. Additionally, evaluating these systems is challenging due to the lack of ground truth data. In this paper, we propose a modular and flexible framework that identifies the rail track, detects potential obstacles, and estimates their distance by integrating three neural networks for object detection, track segmentation, and monocular depth estimation with LiDAR point clouds. To enable a reliable and quantitative evaluation, the proposed framework is assessed using a synthetic dataset (SynDRA), which provides accurate ground truth annotations, allowing for direct performance comparison with existing methods. The proposed system achieves a mean absolute error (MAE) as low as 0.63 meters by integrating monocular depth maps with LiDAR, enabling not only accurate distance estimates but also spatial perception of the scene.[129] One-shot Compositional 3D Head Avatars with Deformable Hair
Yuan Sun,Xuan Wang,WeiLi Zhang,Wenxuan Zhang,Yu Guo,Fei Wang
Main category: cs.CV
TL;DR: 本文提出了一种从单张正面人像图构建完整3D头部虚拟形象的组合式方法,核心是将头发与面部显式解耦建模,并分别采用基于FLAME网格绑定的面部变形和基于笼结构+位置动力学(PBD)的头发物理仿真,在3D高斯泼溅(3DGS)表示下实现高保真纹理重建与自然动态效果。
Details
Motivation: 现有单图像生成3D头像方法常因头发与面部几何纠缠,导致动画中头发动态不真实;同时通用模型易丢失输入图像中的高频纹理细节。 Method: 1)对输入图像进行去发处理获得秃头图像;2)将原图与秃头图分别提升为密集、细节丰富的3D高斯泼溅(3DGS)表示;3)对秃头3DGS通过非刚性配准绑定到FLAME网格以支持自然面部动画;4)利用语义标签监督与边界感知重分配策略提取纯净独立的头发高斯点;5)设计笼结构并结合位置动力学(PBD)模拟头发在头部运动、重力与惯性下的物理变形。 Result: 在多种头部运动、表情及重力条件下生成高质量动态动画,头发行为更真实,面部细节高度保留,定性结果显著优于当前最优单图像方法。 Conclusion: 显式解耦建模与物理驱动的头发变形机制,结合3DGS细粒度纹理保持能力,有效解决了单图像3D头像生成中头发失真与纹理模糊的关键瓶颈,提升了整体感知真实感。 Abstract: We propose a compositional method for constructing a complete 3D head avatar from a single image. Prior one-shot holistic approaches frequently fail to produce realistic hair dynamics during animation, largely due to inadequate decoupling of hair from the facial region, resulting in entangled geometry and unnatural deformations. Our method explicitly decouples hair from the face, modeling these components using distinct deformation paradigms while integrating them into a unified rendering pipeline. Furthermore, by leveraging image-to-3D lifting techniques, we preserve fine-grained textures from the input image to the greatest extent possible, effectively mitigating the common issue of high-frequency information loss in generalized models. Specifically, given a frontal portrait image, we first perform hair removal to obtain a bald image. Both the original image and the bald image are then lifted to dense, detail-rich 3D Gaussian Splatting (3DGS) representations. For the bald 3DGS, we rig it to a FLAME mesh via non-rigid registration with a prior model, enabling natural deformation that follows the mesh triangles during animation. For the hair component, we employ semantic label supervision combined with a boundary-aware reassignment strategy to extract a clean and isolated set of hair Gaussians. To control hair deformation, we introduce a cage structure that supports Position-Based Dynamics (PBD) simulation, allowing realistic and physically plausible transformations of the hair Gaussian primitives under head motion, gravity, and inertial effects. Striking qualitative results, including dynamic animations under diverse head motions, gravity effects, and expressions, showcase substantially more realistic hair behavior alongside faithfully preserved facial details, outperforming state-of-the-art one-shot methods in perceptual realism.[130] From Boundaries to Semantics: Prompt-Guided Multi-Task Learning for Petrographic Thin-section Segmentation
Yili Ren,Shiqi Wen,Li Hou,Dingwen Xiao,Weiming Zhang,Caleb Chen Cao,Lin Wang,Zilu Zheng,Qianxiao Su,Mingjun Zhao,Lei Chen
Main category: cs.CV
TL;DR: 本文提出Petro-SAM,一种两阶段多任务框架,用于岩相图像中晶界分割(GES)与岩石学语义分割(LSS)的联合高质量分割,通过引入Merge Block融合七种偏光视图、多尺度特征融合及颜色-熵先验来克服域差异与超细边界问题。
Details
Motivation: 现有方法将晶界分割(GES)和岩石学语义分割(LSS)分开处理,效果不佳;虽有专家标注数据,但成本高、耗时长;SAM等基础模型难以直接适配岩相图像因存在消光导致的颜色变化和超细晶界等严重域差距。 Method: 提出Petro-SAM两阶段多任务框架:1)基于SAM引入Merge Block融合七种偏光图像以解决消光问题;2)引入多尺度特征融合与颜色-熵先验模块优化检测。 Result: 在岩相图像上实现了高质量的联合GES与LSS分割,显著提升边界对齐精度与语义分割鲁棒性。 Conclusion: Petro-SAM有效弥合了基础模型与岩相图像分析之间的域差距,为多角度偏光图像的联合分割提供了新范式,兼具实用性与可扩展性。 Abstract: Grain-edge segmentation (GES) and lithology semantic segmentation (LSS) are two pivotal tasks for quantifying rock fabric and composition. However, these two tasks are often treated separately, and the segmentation quality is implausible albeit expensive, time-consuming, and expert-annotated datasets have been used. Recently, foundation models, especially the Segment Anything Model (SAM), have demonstrated impressive robustness for boundary alignment. However, directly adapting SAM to joint GES and LSS is nontrivial due to 1) severe domain gap induced by extinction-dependent color variations and ultra-fine grain boundaries, and 2) lacking novel modules for joint learning on multi-angle petrographic image stacks. In this paper, we propose Petro-SAM, a novel two-stage, multi-task framework that can achieve high-quality joint GES and LSS on petrographic images. Specifically, based on SAM, we introduce a Merge Block to integrate seven polarized views, effectively solving the extinction issue. Moreover, we introduce multi-scale feature fusion and color-entropy priors to refine the detection.[131] NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results
Andrey Moskalenko,Alexey Bryncev,Ivan Kosmynin,Kira Shilovskaya,Mikhail Erofeev,Dmitry Vatolin,Radu Timofte,Kun Wang,Yupeng Hu,Zhiran Li,Hao Liu,Qianlong Xiang,Liqiang Nie,Konstantinos Chaldaiopoulos,Niki Efthymiou,Athanasia Zlatintsi,Panagiotis Filntisis,Katerina Pastra,Petros Maragos,Li Yang,Gen Zhan,Yiting Liao,Yabin Zhang,Yuxin Liu,Xu Wu,Yunheng Zheng,Linze Li,Kun He,Cong Wu,Xuefeng Zhu,Tianyang Xu,Xiaojun Wu,Wenzhuo Zhao,Keren Fu,Gongyang Li,Shixiang Shi,Jianlin Chen,Haibin Ling,Yaoxin Jiang,Guoyi Xu,Jiajia Liu,Yaokun Shi,Jiachen Tu
Main category: cs.CV
TL;DR: 本文介绍了NTIRE 2026视频显著性预测挑战赛,包含新构建的2000个开源视频数据集、基于众包鼠标追踪采集的显著图与注视点数据,并评估了20余支参赛队伍的方法。
Details
Motivation: 推动视频显著性预测技术发展,提供大规模、高质量、开源的基准数据集和公平评测平台。 Method: 组织国际挑战赛,构建含2000个多样化视频的新数据集,通过众包鼠标追踪收集5000+用户注视数据生成显著图,并采用通用指标在800个测试视频上评估算法性能。 Result: 吸引20多个团队参赛,7支队伍通过最终代码审查;全部数据与代码已开源发布。 Conclusion: 该挑战赛成功促进了视频显著性预测领域的研究进展,提供了迄今最大规模的开源视频显著性数据集及标准化评测流程。 Abstract: This paper presents an overview of the NTIRE 2026 Challenge on Video Saliency Prediction. The goal of the challenge participants was to develop automatic saliency map prediction methods for the provided video sequences. The novel dataset of 2,000 diverse videos with an open license was prepared for this challenge. The fixations and corresponding saliency maps were collected using crowdsourced mouse tracking and contain viewing data from over 5,000 assessors. Evaluation was performed on a subset of 800 test videos using generally accepted quality metrics. The challenge attracted over 20 teams making submissions, and 7 teams passed the final phase with code review. All data used in this challenge is made publicly available - https://github.com/msu-video-group/NTIRE26_Saliency_Prediction.[132] Improved Multiscale Structural Mapping with Supervertex Vision Transformer for the Detection of Alzheimer's Disease Neurodegeneration
Geonwoo Baek,David H. Salat,Ikbeom Jang
Main category: cs.CV
TL;DR: 本文提出MSSM+框架,结合表面超顶点映射(SSVM)和超顶点视觉Transformer(SV-ViT),利用单次T1加权MRI扫描提取包括皮层厚度、灰白质对比度、脑沟深度和皮层曲率等多尺度结构特征,显著提升阿尔茨海默病(AD)与正常对照(CN)的分类性能,优于现有方法,有望成为无创早期AD影像标志物。
Details
Motivation: 阿尔茨海默病确诊依赖昂贵且有创的PET或CSF检测,亟需更优的非侵入性MRI生物标志物;现有MSSM方法仍有提升空间。 Method: 在MSSM基础上提出MSSM+,新增顶点级脑沟深度与皮层曲率特征;设计表面超顶点映射(SSVM)将皮层表面划分为能表征区域内外空间关系的超顶点;构建专用于超顶点的Vision Transformer(SV-ViT)模型,在表面网格上进行解剖学引导的学习。 Result: MSSM+比MSSM检测到更广泛、更显著的AD与CN组间差异;在AD/CN分类中,精确率-召回率曲线下面积(AUPRC)提高3个百分点;跨厂商分析显示其信号变异性更低、分类性能更稳定,优于单纯皮层厚度、灰白质对比度及原始MSSM。 Conclusion: MSSM+联合SV-ViT是一种具有潜力的基于MRI的阿尔茨海默病早期检测影像标志物,可作为PET/CSF确认前的有效筛查工具。 Abstract: Alzheimer's disease (AD) confirmation often relies on positron emission tomography (PET) or cerebrospinal fluid (CSF) analysis, which are costly and invasive. Consequently, structural MRI biomarkers such as cortical thickness (CT) are widely used for non-invasive AD screening. Multiscale structural mapping (MSSM) was recently proposed to integrate gray-white matter contrasts (GWCs) with CT from a single T1-weighted MRI (T1w) scan. Building on this framework, we propose MSSM+, together with surface supervertex mapping (SSVM) and a Supervertex Vision Transformer (SV-ViT). 3D T1w images from individuals with AD and cognitively normal (CN) controls were analyzed. MSSM+ extends MSSM by incorporating sulcal depth and cortical curvature at the vertex level. SSVM partitions the cortical surface into supervertices (surface patches) that effectively represent inter- and intra-regional spatial relationships. SV-ViT is a Vision Transformer architecture operating on these supervertices, enabling anatomically informed learning from surface mesh representations. Compared with MSSM, MSSM+ identified more spatially extensive and statistically significant group differences between AD and CN. In AD vs. CN classification, MSSM+ achieved a 3%p higher area under the precision-recall curve than MSSM. Vendor-specific analyses further demonstrated reduced signal variability and consistently improved classification performance across MR manufacturers relative to CT, GWCs, and MSSM. These findings suggest that MSSM+ combined with SV-ViT is a promising MRI-based imaging marker for AD detection prior to CSF/PET confirmation.[133] Zero-Shot Retail Theft Detection via Orchestrated Vision Models: A Model-Agnostic, Cost-Effective Alternative to Trained Single-Model Systems
Haileab Yagersew
Main category: cs.CV
TL;DR: Error
Details
Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: Retail theft costs the global economy over \$100 billion annually, yet existing AI-based detection systems require expensive custom model training on proprietary datasets and charge \$200-500/month per store. We present Paza, a zero-shot retail theft detection framework that achieves practical concealment detection without training any model. Our approach orchestrates multiple existing models in a layered pipeline - cheap object detection and pose estimation running continuously, with an expensive vision-language model (VLM) invoked only when behavioral pre-filters trigger. A multi-signal suspicion pre-filter (requiring dwell time plus at least one behavioral signal) reduces VLM invocations by 240x compared to per-frame analysis, bounding calls to <=10/minute and enabling a single GPU to serve 10-20 stores. The architecture is model-agnostic: the VLM component accepts any OpenAI-compatible endpoint, enabling operators to swap between models such as Gemma 4, Qwen3.5-Omni, GPT-4o, or future releases without code changes - ensuring the system improves as the VLM landscape evolves. We evaluate the VLM component on the DCSASS synthesized shoplifting dataset (169 clips, controlled environment), achieving 89.5% precision and 92.8% specificity at 59.3% recall zero-shot - where the recall gap is attributable to sparse frame sampling in offline evaluation rather than VLM reasoning failures, as precision and specificity are the operationally critical metrics determining false alarm rates. We present a detailed cost model showing viability at \$50-100/month per store (3-10x cheaper than commercial alternatives), and introduce a privacy-preserving design that obfuscates faces in the detection pipeline. The source code is available at https://github.com/xHaileab/Paza-AI.[134] Efficient Search of Implantable Adaptive Cells for Medical Image Segmentation
Emil Benedykciuk,Marcin Denkowski,Grzegorz M. Wójcik
Main category: cs.CV
TL;DR: 本文提出IAC-LTH方法,通过Jensen-Shannon散度稳定性判据,在IAC可微搜索早期即剪枝低重要性操作,大幅加速自适应跳过模块的神经架构搜索(NAS),在多个医学图像分割基准上实现3.7–16倍加速,性能不降反略升。
Details
Motivation: 现有IAC框架虽缩小了搜索空间,但仍需200轮可微搜索,计算开销大,限制其在资源受限医疗场景的实际应用。 Method: 分析IAC搜索过程中各操作与边的时序重要性演化规律,发现最优操作往往早期即显现并快速稳定;据此设计基于Jensen-Shannon散度的稳定性判据,动态剪枝低重要性操作,实现早停式高效搜索(IAC-LTH)。 Result: 在ACDC、BraTS、KiTS、AMOS四个公开数据集及多种2D U-Net/nnU-Net骨干网络上,IAC-LTH搜索所得模型分割性能媲美甚至略超全周期搜索结果,NAS耗时降低3.7–16倍,且结果鲁棒适用于不同增强策略和基线模型。 Conclusion: IAC架构可在搜索早期通过稳定操作识别获得,无需完整训练,显著提升自适应跳过模块设计在现实医疗计算约束下的可行性。 Abstract: Purpose: Adaptive skip modules can improve medical image segmentation, but searching for them is computationally costly. Implantable Adaptive Cells (IACs) are compact NAS modules inserted into U-Net skip connections, reducing the search space compared with full-network NAS. However, the original IAC framework still requires a 200-epoch differentiable search for each backbone and dataset. Methods: We analyzed the temporal behavior of operations and edges within IAC cells during differentiable search on public medical image segmentation benchmarks. We found that operations selected in the final discrete cell typically emerge among the strongest candidates early in training, and their architecture parameters stabilize well before the final epoch. Based on this, we propose a Jensen--Shannon-divergence-based stability criterion that tracks per-edge operation-importance distributions and progressively prunes low-importance operations during search. The accelerated framework is called IAC-LTH. Results: Across four public benchmarks (ACDC, BraTS, KiTS, AMOS), several 2-D U-Net backbones, and a 2-D nnU-Net pipeline, IAC-LTH discovers IAC cells whose patient-level segmentation performance matches and sometimes slightly exceeds that of cells found by the original full-length search, while reducing wall-clock NAS cost by 3.7x to 16x across datasets and backbones. These results are consistent across architectures, benchmarks, and both non-augmented and augmented training settings, while preserving the gains of IAC-equipped U-Nets over strong attention-based and dense-skip baselines. Conclusion: Competitive IAC architectures can be identified from early-stabilizing operations without running the full search, making adaptive skip-module design more practical for medical image segmentation under realistic computational constraints.[135] MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry
Meng-Xun Li,Wen-Hui Deng,Zhi-Xing Wu,Chun-Xiao Jin,Jia-Min Wu,Yue Han,James Kit Hon Tsoi,Gui-Song Xia,Cui Huang
Main category: cs.CV
TL;DR: 本文提出了MetaDent,一个面向口腔摄影的综合性视觉-语言模型(VLM)资源,包括大规模牙科图像数据集、半结构化标注框架和多任务基准测试套件,并基于LLM生成高质量VQA与分类数据;实验表明当前SOTA VLM在细粒度口腔图像理解上仍存在明显不足。
Details
Motivation: Vision-Language Models (VLMs) 在医学影像分析中潜力巨大,但在口腔摄影领域应用受限,主要由于缺乏细粒度标注数据集和统一评估基准。 Method: 构建了MetaDent资源:(1)融合临床、公开及网络来源的60,669张牙科图像;(2)设计兼顾层级性与临床细节的半结构化标注框架,含高层图像摘要与异常点自由文本描述;(3)利用大语言模型(LLM)生成约15K VQA样本和18类多标签分类数据,并经人工审核与错误分析验证其保真度与语义准确性。 Result: 在VQA、分类与图像描述任务上评估主流VLM,结果显示现有模型对口腔场景的细粒度理解能力有限:VQA准确率中等,图像描述不一致或不完整。 Conclusion: MetaDent填补了口腔摄影领域VLM研究的数据与基准空白;实证表明当前VLM尚难胜任临床级细粒度理解任务;作者开源全部数据、标注与工具以推动可复现研究。 Abstract: Vision-Language Models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes (1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; (2) a semi-structured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and (3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging Large Language Models (LLMs), we derive standardized benchmarks: approximately 15K Visual Question Answering (VQA) pairs and an 18-class multi-label classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy and producing inconsistent or incomplete descriptions in image captioning. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.[136] Open-Set Vein Biometric Recognition with Deep Metric Learning
Paweł Pilarek,Marcel Musiałek,Anna Górska
Main category: cs.CV
TL;DR: 本文提出了一种面向开放集场景的静脉识别方法,通过深度度量学习学习L2归一化嵌入,并结合原型匹配与校准相似度阈值,在多个静脉数据集上实现了高精度识别与未知用户鲁棒拒识。
Details
Motivation: 现有静脉识别方法多基于闭集分类,难以扩展和自适应新增用户,且缺乏对开放集约束下计算边界的严格评估。 Method: 采用深度度量学习(DML)学习判别性L2归一化嵌入,结合原型匹配与校准的相似度阈值进行开集识别;在四个跨部位(指、腕、背手)静脉数据集上采用严格的受试者不相交协议评估;使用ResNet50-CBAM等模型,辅以三元组损失和1-NN分类器。 Result: 在MMCBNU 6000上达到OSCR 0.9945、AUROC 0.9974、EER 1.57%及Rank-1准确率99.6%;跨数据集实验表明模型对大规模数据鲁棒,但在低数据域偏移下性能下降;消融显示三元组损失+1-NN在精度与效率间最优平衡,支持商用硬件实时部署。 Conclusion: 该方法有效解决了静脉识别在开放集场景下的可扩展性与鲁棒拒识问题,为实际生物识别系统提供了实用、高效的解决方案。 Abstract: Most state-of-the-art vein recognition methods rely on closed-set classification, which inherently limits their scalability and prevents the adaptive enrollment of new users without complete model retraining. We rigorously evaluate the computational boundaries of Deep Metric Learning (DML) under strict open-set constraints. Unlike standard closed-set approaches, we analyze the impact of data scarcity and domain shift on recognition performance. Our approach learns discriminative L2-normalised embeddings and employs prototype-based matching with a calibrated similarity threshold to effectively distinguish between enrolled users and unseen impostors. We evaluate the framework under a strict subject-disjoint protocol across four diverse datasets covering finger, wrist, and dorsal hand veins (MMCBNU 6000, UTFVP, FYO, and a dorsal hand-vein dataset). On the large-scale MMCBNU 6000 benchmark, our best model (ResNet50-CBAM) achieves an OSCR of 0.9945, AUROC of 0.9974, and EER of 1.57%, maintaining high identification accuracy (99.6% Rank-1) while robustly rejecting unknown subjects. Cross-dataset experiments evaluate the framework's generalisation across different acquisition setups, confirming that while the model handles large-scale data robustly, performance remains sensitive to domain shifts in low-data regimes. Ablation studies demonstrate that triplet-based objectives combined with a simple 1-NN classifier offer an optimal trade-off between accuracy and efficiency, enabling real-time deployment on commodity hardware.[137] FSDETR: Frequency-Spatial Feature Enhancement for Small Object Detection
Jianchao Huang,Fengming Zhang,Haibo Zhu,Tao Yan
Main category: cs.CV
TL;DR: 本文提出FSDETR,一种基于RT-DETR的频率-空间特征增强框架,通过空间分层注意力、可变形注意力内尺度交互和频率-空间特征金字塔网络,显著提升小目标检测性能。
Details
Motivation: 小目标检测面临下采样导致的特征退化、密集簇中相互遮挡及复杂背景干扰等挑战。 Method: 提出FSDETR框架,包含:1)空间分层注意力块(SHAB)以捕获局部细节与全局依赖;2)基于可变形注意力的内尺度特征交互(DA-AIFI)动态聚焦关键区域缓解遮挡;3)频率-空间特征金字塔网络(FSFPN)结合频率滤波与空间边缘提取(CFSB)保留细粒度细节。 Result: FSDETR仅含14.7M参数,在VisDrone 2019上达到13.9% APS,在TinyPerson上达到48.95% AP50 tiny,展现出优异的小目标检测性能。 Conclusion: FSDETR通过融合频率域与空间域信息,并引入多层次注意力与特征交互机制,有效提升了小目标检测精度与鲁棒性,为小目标检测提供了新思路。 Abstract: Small object detection remains a significant challenge due to feature degradation from downsampling, mutual occlusion in dense clusters, and complex background interference. To address these issues, this paper proposes FSDETR, a frequency-spatial feature enhancement framework built upon the RT-DETR baseline. By establishing a collaborative modeling mechanism, the method effectively leverages complementary structural information. Specifically, a Spatial Hierarchical Attention Block (SHAB) captures both local details and global dependencies to strengthen semantic representation. Furthermore, to mitigate occlusion in dense scenes, the Deformable Attention-based Intra-scale Feature Interaction (DA-AIFI) focuses on informative regions via dynamic sampling. Finally, the Frequency-Spatial Feature Pyramid Network (FSFPN) integrates frequency filtering with spatial edge extraction via the Cross-domain Frequency-Spatial Block (CFSB) to preserve fine-grained details. Experimental results show that with only 14.7M parameters, FSDETR achieves 13.9% APS on VisDrone 2019 and 48.95% AP50 tiny on TinyPerson, showing strong performance on small-object benchmarks. The code and models are available at https://github.com/YT3DVision/FSDETR.[138] Reward-Aware Trajectory Shaping for Few-step Visual Generation
Rui Li,Bingyu Li,Yuanzhi Liang,HuangHai Bin,Chi Zhang,XueLong Li
Main category: cs.CV
TL;DR: 本文提出Reward-Aware Trajectory Shaping (RATS)框架,通过奖励感知门控与轨迹对齐,在极少数采样步内实现高质量生成,使学生模型可超越教师模型而非仅模仿。
Details
Motivation: 现有蒸馏方法将多步去噪压缩为少步生成,但受限于教师性能上限;本文旨在突破该限制,使学生能依据奖励偏好自主优化、甚至超越教师。 Method: 提出RATS框架:1)通过horizon matching对齐师生潜在轨迹关键阶段;2)引入reward-aware gate,根据师生相对奖励动态调节教师指导强度;3)融合轨迹蒸馏、奖励感知门控与偏好对齐。 Result: RATS显著提升少步生成的效率-质量权衡,在视觉生成任务中大幅缩小少步学生与多步教师之间的性能差距。 Conclusion: 引入偏好对齐意识可使少步生成模型摆脱教师性能束缚,通过奖励驱动持续优化;RATS为高效高保真生成提供了新范式。 Abstract: Achieving high-fidelity generation in extremely few sampling steps has long been a central goal of generative modeling. Existing approaches largely rely on distillation-based frameworks to compress the original multi-step denoising process into a few-step generator. However, such methods inherently constrain the student to imitate a stronger multi-step teacher, imposing the teacher as an upper bound on student performance. We argue that introducing \textbf{preference alignment awareness} enables the student to optimize toward reward-preferred generation quality, potentially surpassing the teacher instead of being restricted to rigid teacher imitation. To this end, we propose \textbf{Reward-Aware Trajectory Shaping (RATS)}, a lightweight framework for preference-aligned few-step generation. Specifically, teacher and student latent trajectories are aligned at key denoising stages through horizon matching, while a \textbf{reward-aware gate} is introduced to adaptively regulate teacher guidance based on their relative reward performance. Trajectory shaping is strengthened when the teacher achieves higher rewards, and relaxed when the student matches or surpasses the teacher, thereby enabling continued reward-driven improvement. By seamlessly integrating trajectory distillation, reward-aware gating, and preference alignment, RATS effectively transfers preference-relevant knowledge from high-step generators without incurring additional test-time computational overhead. Experimental results demonstrate that RATS substantially improves the efficiency--quality trade-off in few-step visual generation, significantly narrowing the gap between few-step students and stronger multi-step generators.[139] Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes
Victoria Yue Chen,Emery Pierson,Léopold Maillard,Maks Ovsjanikov
Main category: cs.CV
TL;DR: 本文揭示了当前最先进的文本到3D生成模型在文本驱动反演中存在“潜在沉陷陷阱”问题,即模型对文本提示不敏感,导致无法有效进行文本驱动的3D编辑;作者通过分析生成轨迹,提出一种解耦几何表达能力与语言敏感性的新框架,从而实现对分布外3D形状的高保真语义编辑。
Details
Motivation: 现有文本驱动的生成模型反演方法依赖于模型对自然语言提示保持敏感的假设,但作者发现当前SOTA文本到3D模型常在此假设下失效,限制了文本引导编辑的实际效果。 Method: 通过分析生成模型的采样轨迹,识别并验证‘潜在沉陷陷阱’现象;提出一种新框架,利用模型的无条件生成先验来绕过这些陷阱,从而解耦几何表征能力与文本敏感性。 Result: 证明了模型几何表达能力强但文本引导失效并非源于几何能力不足,而是语言敏感性缺失;所提方法实现了对分布外3D形状的鲁棒、高保真文本编辑。 Conclusion: 文本到3D生成模型的文本驱动反演瓶颈主要来自语言敏感性退化而非几何能力不足;解耦二者可显著提升语义编辑鲁棒性与适用范围。 Abstract: Text-driven inversion of generative models is a core paradigm for manipulating 2D or 3D content, unlocking numerous applications such as text-based editing, style transfer, or inverse problems. However, it relies on the assumption that generative models remain sensitive to natural language prompts. We demonstrate that for state-of-the-art native text-to-3D generative models, this assumption often collapses. We identify a critical failure mode where generation trajectories are drawn into latent ``sink traps'': regions where the model becomes insensitive to prompt modifications. In these regimes, changes to the input text fail to alter internal representations in a way that alters the output geometry. Crucially, we observe that this is not a limitation of the model's \textit{geometric} expressivity; the same generative models possess the ability to produce a vast diversity of shapes but, as we demonstrate, become insensitive to out-of-distribution \textit{text} guidance. We investigate this behavior by analyzing the sampling trajectories of the generative model, and find that complex geometries can still be represented and produced by leveraging the model's unconditional generative prior. This leads to a more robust framework for text-based 3D shape editing that bypasses latent sinks by decoupling a model's geometric representation power from its linguistic sensitivity. Our approach addresses the limitations of current 3D pipelines and enables high-fidelity semantic manipulation of out-of-distribution 3D shapes. Project webpage: https://daidedou.sorpi.fr/publication/beyondprompts[140] Hybrid Latents -- Geometry-Appearance-Aware Surfel Splatting
Neel Kelkar,Simon Niedermayr,Klaus Engel,Rüdiger Westermann
Main category: cs.CV
TL;DR: 本文提出了一种混合高斯-哈希网格辐射表示方法,用于从多视角图像重建2D高斯场景模型,通过显式频率分解、硬不透明度衰减和概率剪枝等技术,在显著减少高斯元数量的同时提升几何重建精度与渲染效率。
Details
Motivation: 解决NeRF类模型中几何与外观纠缠的问题,降低高频纹理对几何误差的补偿倾向,提升重建保真度和渲染效率。 Method: 引入混合高斯-哈希网格辐射表示,结合每高斯隐特征与哈希网格特征,实现低频(几何)与高频(外观)的显式分离;采用硬不透明度衰减增强几何-外观解耦;使用概率剪枝与稀疏性诱导的BCE不透明度损失剔除冗余高斯。 Result: 在合成与真实数据集上优于现有高斯基新视角合成方法,重建保真度更高,且仅需十分之一数量的高斯原语。 Conclusion: 该方法通过频率分解、几何外观解耦与自适应稀疏化,实现了更紧凑、更精确、更高效的高斯场景表示。 Abstract: We introduce a hybrid Gaussian-hash-grid radiance representation for reconstructing 2D Gaussian scene models from multi-view images. Similar to NeST splatting, our approach reduces the entanglement between geometry and appearance common in NeRF-based models, but adds per-Gaussian latent features alongside hash-grid features to bias the optimizer toward a separation of low- and high-frequency scene components. This explicit frequency-based decomposition reduces the tendency of high-frequency texture to compensate for geometric errors. Encouraging Gaussians with hard opacity falloffs further strengthens the separation between geometry and appearance, improving both geometry reconstruction and rendering efficiency. Finally, probabilistic pruning combined with a sparsity-inducing BCE opacity loss allows redundant Gaussians to be turned off, yielding a minimal set of Gaussians sufficient to represent the scene. Using both synthetic and real-world datasets, we compare against the state of the art in Gaussian-based novel-view synthesis and demonstrate superior reconstruction fidelity with an order of magnitude fewer primitives.[141] Generative Data Augmentation for Skeleton Action Recognition
Xu Dong,Wanqing Li,Anthony Adeyemi-Ejeye,Andrew Gilbert
Main category: cs.CV
TL;DR: 本文提出了一种基于条件生成的骨架动作识别数据增强方法,利用Transformer编码器-解码器架构和生成精炼模块,在小样本和全量数据场景下均提升了识别准确率。
Details
Motivation: 收集大规模、多样化且标注良好的3D骨架数据集成本高、耗时长,亟需有效数据增强方法。 Method: 提出条件生成式pipeline,采用Transformer编码器-解码器结构,结合生成精炼模块与dropout机制,以在采样中平衡保真度与多样性。 Result: 在HumanAct12和NTU-VIBE数据集上显著提升多种骨架动作识别模型的准确率,尤其在低数据场景下表现优异。 Conclusion: 该方法能以有限真实样本生成高质量、多样化的骨架序列,在下游任务中展现出强泛化能力,为小样本骨架动作识别提供了有效解决方案。 Abstract: Skeleton-based human action recognition is a powerful approach for understanding human behaviour from pose data, but collecting large-scale, diverse, and well-annotated 3D skeleton datasets is both expensive and labor-intensive. To address this challenge, we propose a conditional generative pipeline for data augmentation in skeleton action recognition. Our method learns the distribution of real skeleton sequences under the constraint of action labels, enabling the synthesis of diverse and high-fidelity data. Even with limited training samples, it can effectively generate skeleton sequences and achieve competitive recognition performance in low-data scenarios, demonstrating strong generalisation in downstream tasks. Specifically, we introduce a Transformer-based encoder-decoder architecture, combined with a generative refinement module and a dropout mechanism, to balance fidelity and diversity during sampling. Experiments on HumanAct12 and the refined NTU-RGBD (NTU-VIBE) dataset show that our approach consistently improves the accuracy of multiple skeleton-based action recognition models, validating its effectiveness in both few-shot and full-data settings. The source code can be found at here.[142] RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models
Gabriele Mattioli,Evelyn Turri,Sara Sarto,Lorenzo Baraldi,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
Main category: cs.CV
TL;DR: 本文提出RaTA-Tool框架,通过将多模态用户查询转化为结构化任务描述,并基于语义匹配检索适配工具,实现开放世界下的多模态工具选择,支持零样本扩展与偏好优化。
Details
Motivation: 现有工具调用方法局限于纯文本输入和封闭世界设定,难以理解多模态指令且无法泛化至训练时未见的工具。 Method: 提出RaTA-Tool框架:1)利用MLLM将多模态查询转为结构化任务描述;2)基于语义匹配从机器可读的工具描述库中检出最适配工具;3)引入DPO进行偏好优化以提升任务-工具对齐;4)构建首个开放世界多模态工具使用数据集(源自Hugging Face模型卡)。 Result: 在开放世界、多模态场景下显著提升工具选择性能,支持无需重训练即可接入新工具。 Conclusion: RaTA-Tool为多模态基础模型提供了可扩展、可泛化的工具学习新范式,推动了开放世界AI系统的发展。 Abstract: Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources -- such as APIs, computational utilities, and specialized models -- to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world settings. Consequently, they struggle to interpret multimodal user instructions and cannot generalize to tools unseen during training. In this work, we introduce RaTA-Tool, a novel framework for open-world multimodal tool selection. Rather than learning direct mappings from user queries to fixed tool identifiers, our approach enables an MLLM to convert a multimodal query into a structured task description and subsequently retrieve the most appropriate tool by matching this representation against semantically rich, machine-readable tool descriptions. This retrieval-based formulation naturally supports extensibility to new tools without retraining. To further improve alignment between task descriptions and tool selection, we incorporate a preference-based optimization stage using Direct Preference Optimization (DPO). To support research in this setting, we also introduce the first dataset for open-world multimodal tool use, featuring standardized tool descriptions derived from Hugging Face model cards. Extensive experiments demonstrate that our approach significantly improves tool-selection performance, particularly in open-world, multimodal scenarios.[143] Prompt-to-Gesture: Measuring the Capabilities of Image-to-Video Deictic Gesture Generation
Hassan Ali,Doreen Jirak,Luca Müller,Stefan Wermter
Main category: cs.CV
TL;DR: 本文提出了一种基于提示的视频生成方法,利用图像到视频基础模型合成真实感强、具语义丰富性的指示性手势数据,以缓解手势识别领域长期存在的数据稀缺问题;实验表明,合成数据在视觉保真度、多样性及下游任务性能上均能有效补充真实数据。
Details
Motivation: 手势识别领域面临严重数据稀缺问题,传统依赖人工录制或图像处理的方法成本高且难以生成真实手势变异性;而新兴的图像到视频生成模型为零样本、低成本合成高质量手势视频提供了新可能。 Method: 设计了一个基于少量真人参考样本的提示驱动视频生成流程,通过自然语言提示控制生成具有语义一致性和真实感的指示性手势视频,并构建合成数据集;采用混合(真实+合成)数据训练多种深度模型进行下游任务评估。 Result: 合成手势在视觉质量上接近真实数据,同时引入有意义的变异性与新颖性;使用混合数据训练的模型在下游任务中性能优于仅用真实数据训练的模型。 Conclusion: 即使处于早期阶段,图像到视频生成技术已展现出作为零样本手势合成工具的强大潜力,可有效增强真实数据,在降低数据采集成本的同时提升模型泛化能力与性能。 Abstract: Gesture recognition research, unlike NLP, continues to face acute data scarcity, with progress constrained by the need for costly human recordings or image processing approaches that cannot generate authentic variability in the gestures themselves. Recent advancements in image-to-video foundation models have enabled the generation of photorealistic, semantically rich videos guided by natural language. These capabilities open up new possibilities for creating effort-free synthetic data, raising the critical question of whether video Generative AI models can augment and complement traditional human-generated gesture data. In this paper, we introduce and analyze prompt-based video generation to construct a realistic deictic gestures dataset and rigorously evaluate its effectiveness for downstream tasks. We propose a data generation pipeline that produces deictic gestures from a small number of reference samples collected from human participants, providing an accessible approach that can be leveraged both within and beyond the machine learning community. Our results demonstrate that the synthetic gestures not only align closely with real ones in terms of visual fidelity but also introduce meaningful variability and novelty that enrich the original data, further supported by superior performance of various deep models using a mixed dataset. These findings highlight that image-to-video techniques, even in their early stages, offer a powerful zero-shot approach to gesture synthesis with clear benefits for downstream tasks.[144] Frequency-Enhanced Dual-Subspace Networks for Few-Shot Fine-Grained Image Classification
Meijia Wang,Guochao Wang,Haozhen Chu,Bin Yao,Weichuan Zhang,Yuan Wang,Junpo Yang
Main category: cs.CV
TL;DR: 本文提出FEDSNet,通过频域增强与双子空间建模,缓解细粒度少样本分类中空间特征的纹理偏差与结构不稳定性问题。
Details
Motivation: 现有基于度量学习的方法仅依赖空间域特征,易受纹理偏差和高频背景噪声干扰,且缺乏跨视角几何约束,导致少样本下结构不稳定。 Method: 提出频率增强双子空间网络(FEDSNet):利用DCT与低通滤波提取低频结构特征;用截断SVD构建空间纹理与频率结构两个独立低秩子空间;设计自适应门控机制融合双视角投影距离。 Result: 在CUB、Stanford Cars、Stanford Dogs、FGVC-Aircraft四个基准数据集上取得具有竞争力的分类性能与鲁棒性,并兼顾计算效率。 Conclusion: FEDSNet为少样本细粒度视觉识别提供了兼顾结构稳定性与特征判别性的新范式。 Abstract: Few-shot fine-grained image classification aims to recognize subcategories with high visual similarity using only a limited number of annotated samples. Existing metric learning-based methods typically rely solely on spatial domain features. Confined to this single perspective, models inevitably suffer from inherent texture biases, entangling essential structural details with high-frequency background noise. Furthermore, lacking cross-view geometric constraints, single-view metrics tend to overfit this noise, resulting in structural instability under few-shot conditions. To address these issues, this paper proposes the Frequency-Enhanced Dual-Subspace Network (FEDSNet). Specifically, FEDSNet utilizes the Discrete Cosine Transform (DCT) and a low-pass filtering mechanism to explicitly isolate low-frequency global structural components from spatial features, thereby suppressing background interference. Truncated Singular Value Decomposition (SVD) is employed to construct independent, low-rank linear subspaces for both spatial texture and frequency structural features. An adaptive gating mechanism is designed to dynamically fuse the projection distances from these dual views. This strategy leverages the structural stability of the frequency subspace to prevent the spatial subspace from overfitting to background features. Extensive experiments on four benchmark datasets - CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC-Aircraft - demonstrate that FEDSNet exhibits excellent classification performance and robustness, achieving highly competitive results compared to existing metric learning algorithms. Complexity analysis further confirms that the proposed network achieves a favorable balance between high accuracy and computational efficiency, providing an effective new paradigm for few-shot fine-grained visual recognition.[145] UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards
Jun Wang,Shuo Tan,Zelong Sun,Tiancheng Gu,Yongle Zhao,Ziyong Feng,Kaicheng Yang,Cewu Lu
Main category: cs.CV
TL;DR: 本文提出UniDoc-RL,一种基于强化学习的统一框架,用于提升视觉检索增强生成(RAG)中LVLMs的细粒度视觉语义理解与推理能力,通过分层动作空间实现文档检索、重排序、主动视觉感知和推理的联合优化,并在多个基准上显著超越现有方法。
Details
Motivation: 现有视觉RAG系统依赖通用检索信号,忽视复杂推理所需的细粒度视觉语义,导致信息利用效率低。 Method: 提出UniDoc-RL:基于分层动作空间的强化学习框架,将视觉信息获取建模为序列决策问题;采用密集多奖励机制进行任务感知监督;使用Group Relative Policy Optimization(GRPO)实现多目标对齐,无需独立价值网络;构建含细粒度动作标注的高质量推理轨迹数据集。 Result: 在三个基准测试中一致超越SOTA方法,相比先前RL方法最高提升17.7%。 Conclusion: UniDoc-RL验证了联合建模检索、感知与推理的有效性,为视觉RAG提供了可扩展、端到端可训练的新范式。 Abstract: Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme that provides task-aware supervision for each action. Based on Group Relative Policy Optimization (GRPO), UniDoc-RL aligns agent behavior with multiple objectives without relying on a separate value network. To support this training paradigm, we curate a comprehensive dataset of high-quality reasoning trajectories with fine-grained action annotations. Experiments on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines, yielding up to 17.7% gains over prior RL-based methods.[146] Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation
Yuzhuo Chen,Zehua Ma,Han Fang,Hengyi Wang,Guanjie Wang,Weiming Zhang
Main category: cs.CV
TL;DR: 本文提出Flow of Truth框架,首次针对图像到视频(I2V)生成内容开展时间维度的主动鉴伪研究,将视频生成建模为像素随时间的运动,设计可学习的 forensic template 和模板引导光流模块,实现对像素演化轨迹的鲁棒追踪,显著提升跨模型I2V鉴伪性能。
Details
Motivation: I2V生成技术快速发展带来新型深度伪造威胁,其动态特性使传统基于静态图像的2D像素级鉴伪方法失效,亟需探索能跟踪像素时序演化规律的时间域鉴伪新范式。 Method: 提出‘像素随时间运动’的新视角,设计可学习的forensic template以匹配生成过程中的像素流形变化,并构建template-guided flow模块解耦运动与图像内容,从而实现对生成痕迹在时间轴上的稳定追踪。 Result: Flow of Truth在多个商用及开源I2V模型上展现出强泛化能力,显著优于现有方法,在时间维度鉴伪任务中取得大幅提升。 Conclusion: Flow of Truth是首个面向I2V生成内容的时间域主动鉴伪框架,验证了建模像素时序运动路径的有效性,为动态内容鉴伪开辟了新方向。 Abstract: The rapid rise of image-to-video (I2V) generation enables realistic videos to be created from a single image but also brings new forensic demands. Unlike static images, I2V content evolves over time, requiring forensics to move beyond 2D pixel-level tampering localization toward tracing how pixels flow and transform throughout the video. As frames progress, embedded traces drift and deform, making traditional spatial forensics ineffective. To address this unexplored dimension, we present **Flow of Truth**, the first proactive framework focusing on temporal forensics in I2V generation. A key challenge lies in discovering a forensic signature that can evolve consistently with the generation process, which is inherently a creative transformation rather than a deterministic reconstruction. Despite this intrinsic difficulty, we innovatively redefine video generation as *the motion of pixels through time rather than the synthesis of frames*. Building on this view, we propose a learnable forensic template that follows pixel motion and a template-guided flow module that decouples motion from image content, enabling robust temporal tracing. Experiments show that Flow of Truth generalizes across commercial and open-source I2V models, substantially improving temporal forensics performance.[147] Quality-Aware Calibration for AI-Generated Image Detection in the Wild
Fabrizio Guillaro,Vincenzo De Rosa,Davide Cozzolino,Luisa Verdoliva
Main category: cs.CV
TL;DR: 本文提出QuAD框架,通过质量感知的近似重复图像融合策略提升合成图像检测性能,利用多版本图像的质量差异进行加权聚合,在多个SOTA检测器上平均提升约8%平衡准确率。
Details
Motivation: 现有合成图像检测方法多基于单张图像,忽视了真实传播中同一图像存在多个质量退化近似重复版本的现象,导致检测结果不一致。 Method: 提出QuAD(Quality-Aware calibration with near-Duplicates)框架:对查询图像检索其在线近似重复版本,输入检测器获取预测分,并依据各版本估计质量进行加权聚合;构建两个新数据集AncesTree(136k人工退化树图像)和ReWIND(~10k真实网络近似重复图像)用于评估。 Result: 在多个SOTA检测器上验证,QuAD的质量感知融合相比简单平均提升约8%平衡准确率;显著增强AI生成内容检测在现实场景中的可靠性。 Conclusion: 联合分析同一图像的所有可用近似重复版本并考虑其质量退化程度,是提升合成图像检测鲁棒性与实用性的关键路径。 Abstract: Significant progress has been made in detecting synthetic images, however most existing approaches operate on a single image instance and overlook a key characteristic of real-world dissemination: as viral images circulate on the web, multiple near-duplicate versions appear and lose quality due to repeated operations like recompression, resizing and cropping. As a consequence, the same image may yield inconsistent forensic predictions based on which version has been analyzed. In this work, to address this issue we propose QuAD (Quality-Aware calibration with near-Duplicates) a novel framework that makes decisions based on all available near-duplicates of the same image. Given a query, we retrieve its online near-duplicates and feed them to a detector: the resulting scores are then aggregated based on the estimated quality of the corresponding instance. By doing so, we take advantage of all pieces of information while accounting for the reduced reliability of images impaired by multiple processing steps. To support large-scale evaluation, we introduce two datasets: AncesTree, an in-lab dataset of 136k images organized in stochastic degradation trees that simulate online reposting dynamics, and ReWIND, a real-world dataset of nearly 10k near-duplicate images collected from viral web content. Experiments on several state-of-the-art detectors show that our quality-aware fusion improves their performance consistently, with an average gain of around 8% in terms of balanced accuracy compared to plain average. Our results highlight the importance of jointly processing all the images available online to achieve reliable detection of AI-generated content in real-world applications. Code and data are publicly available at https://grip-unina.github.io/QuAD/[148] Implicit Neural Representations: A Signal Processing Perspective
Dhananjaya Jayasundara,Vishal M. Patel
Main category: cs.CV
TL;DR: 本文从信号处理角度综述隐式神经表示(INRs)的发展,重点分析其频谱特性、采样理论与多尺度表示,探讨网络结构(如坐标输入、激活函数设计、哈希网格编码等)对逼近能力的影响,并讨论其在医学/雷达成像、压缩和3D场景表示等领域的应用及理论挑战。
Details
Motivation: 传统离散采样建模存在局限,而INRs以连续函数形式统一表征各类信号,亟需从信号处理视角理解其频谱行为、逼近机制与理论基础。 Method: 采用信号处理框架分析INRs的频谱偏差、采样特性与多尺度表示;系统梳理坐标网络、周期/局部/自适应激活函数、分层分解与哈希网格编码等结构演进;结合逆问题、压缩、3D重建等应用验证其有效性。 Result: 揭示了INRs固有的低频谱偏置现象,提出通过激活函数设计和结构化编码提升高频细节建模与空间自适应性;验证其在多类逆问题与表示任务中的优越性;指出理论稳定性、权重可解释性与大模型泛化等开放问题。 Conclusion: INRs本质是数据驱动的可学习信号模型,其逼近空间随数据自适应演化;未来需融合经典信号理论与深度学习,构建更鲁棒、可解释、可扩展的连续表示框架。 Abstract: Implicit neural representations (INRs) mark a fundamental shift in signal modeling, moving from discrete sampled data to continuous functional representations. By parameterizing signals as neural networks, INRs provide a unified framework for representing images, audio, video, 3D geometry, and beyond as continuous functions of their coordinates. This functional viewpoint enables signal operations such as differentiation to be carried out analytically through automatic differentiation rather than through discrete approximations. In this article, we examine the evolution of INRs from a signal processing perspective, emphasizing spectral behavior, sampling theory, and multiscale representation. We trace the progression from standard coordinate based networks, which exhibit a spectral bias toward low frequency components, to more advanced designs that reshape the approximation space through specialized activations, including periodic, localized, and adaptive functions. We also discuss structured representations, such as hierarchical decompositions and hash grid encodings, that improve spatial adaptivity and computational efficiency. We further highlight the utility of INRs across a broad range of applications, including inverse problems in medical and radar imaging, compression, and 3D scene representation. By interpreting INRs as learned signal models whose approximation spaces adapt to the underlying data, this article clarifies the field's core conceptual developments and outlines open challenges in theoretical stability, weight space interpretability, and large scale generalization.[149] Attention-Gated Convolutional Networks for Scanner-Agnostic Quality Assessment
Chinmay Bakhale,Anil Sao
Main category: cs.CV
TL;DR: 本文提出了一种结合CNN与注意力机制的混合框架,用于实现鲁棒、跨站点的MRI质量评估,尤其针对运动伪影;该模型在已见站点上准确率达0.992,在未见的多中心ABIDE数据上无需微调即达0.755准确率,展现出强泛化能力。
Details
Motivation: 运动伪影严重影响sMRI诊断与自动化分析,而人工质控难以扩展至大规模纵向研究,亟需自动、跨站点鲁棒的质量评估方法。 Method: 提出一种混合CNN-Attention框架:采用分层2D CNN编码器提取局部空间特征,并引入多头交叉注意力机制建模全局依赖关系,以增强对运动伪影(如振铃、模糊)的敏感性,同时抑制站点特异性强度变化和背景噪声。 Result: 在MR-ART数据集上训练后,模型在已见站点测试中达到扫描级准确率0.9920、F1-score 0.9919;在未见的ABIDE多中心(17个异构站点)数据上无需微调即获准确率0.755,验证了其跨域泛化能力。 Conclusion: 注意力驱动的特征重加权能有效捕获通用伪影表征,显著缩小不同成像环境与设备厂商间的性能差距,为大规模多中心MRI质控提供可靠解决方案。 Abstract: Motion artifacts present a significant challenge in structural MRI (sMRI), often compromising clinical diagnostics and large-scale automated analysis. While manual quality control (QC) remains the gold standard, it is increasingly unscalable for massive longitudinal studies. To address this, we propose a hybrid CNN-Attention framework designed for robust, site-invariant MRI quality assessment. Our architecture integrates a hierarchical 2D CNN encoder for local spatial feature extraction with a multi-head cross-attention mechanism to model global dependencies. This synergy enables the model to prioritize motion relevant artifact signatures, such as ringing and blurring, while dynamically filtering out site-specific intensity variations and background noise. The framework was trained end-to-end on the MR-ART dataset using a balanced cohort of 200 subjects. Performance was evaluated across two tiers: Seen Site Evaluation on a held-out MR-ART partition and Unseen Site Evaluation using 200 subjects from 17 heterogeneous sites in the ABIDE archive. On seen sites, the model achieved a scan-level accuracy of 0.9920 and an F1-score of 0.9919. Crucially, it maintained strong generalization across unseen ABIDE sites (Acc = 0.755) without any retraining or fine-tuning, demonstrating high resilience to domain shift. These results indicate that attention-based feature re-weighting successfully captures universal artifact descriptors, bridging the performance gap between diverse imaging environments and scanner manufacturers.[150] Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection
Yangchen Zeng,Zhenyu Yu,Dongming Jiang,Wenbo Zhang,Yifan Hong,Zhanhua Hu,Jiao Luo,Kangning Cui
Main category: cs.CV
TL;DR: 本文提出HELP框架,通过热图引导的位置嵌入(HPE)抑制背景噪声、增强前景位置信息,结合梯度掩码滤波和线性Snake卷积提升小目标检测性能,在大幅减少参数(59.4%)和解码器层数(8→3)的同时保持精度。
Details
Motivation: Transformer检测器在小目标检测中仍存在效率低和易受背景噪声干扰的问题,需深度解码器优化低质量查询。 Method: 提出热图引导的嵌入学习范式(HELP),核心为热图引导位置嵌入(HPE),在编码器中注入热图感知位置编码,在解码器前用梯度掩码滤除背景主导嵌入;引入Linear-Snake卷积缓解小目标特征稀疏;热图监督仅用于训练,不增加推理开销。 Result: 解码器层数从8减至3,参数量降低59.4%(66.3M vs. 163M),在减少计算预算下仍保持跨基准的一致精度提升。 Conclusion: HELP是一种噪声感知的位置-语义融合框架,有效提升小目标检测效率与鲁棒性,兼顾可解释性(热条可视化)与实用性(零推理开销)。 Abstract: Transformer-based detectors have advanced small-object detection, but they often remain inefficient and vulnerable to background-induced query noise, which motivates deep decoders to refine low-quality queries. We present HELP (Heatmap-guided Embedding Learning Paradigm), a noise-aware positional-semantic fusion framework that studies where to embed positional information by selectively preserving positional encodings in foreground-salient regions while suppressing background clutter. Within HELP, we introduce Heatmap-guided Positional Embedding (HPE) as the core embedding mechanism and visualize it with a heatbar for interpretable diagnosis and fine-tuning. HPE is integrated into both the encoder and decoder: it guides noise-suppressed feature encoding by injecting heatmap-aware positional encoding, and it enables high-quality query retrieval by filtering background-dominant embeddings via a gradient-based mask filter before decoding. To address feature sparsity in complex small targets, we integrate Linear-Snake Convolution to enrich retrieval-relevant representations. The gradient-based heatmap supervision is used during training only, incurring no additional gradient computation at inference. As a result, our design reduces decoder layers from eight to three and achieves a 59.4% parameter reduction (66.3M vs. 163M) while maintaining consistent accuracy gains under a reduced compute budget across benchmarks. Code Repository: https://github.com/yidimopozhibai/Noise-Suppressed-Query-Retrieval[151] Building Extraction from Remote Sensing Imagery under Hazy and Low-light Conditions: Benchmark and Baseline
Feifei Sang,Wei Lu,Hongruixuan Chen,Sibao Chen,Bin Luo
Main category: cs.CV
TL;DR: 本文提出HaLoBuilding基准数据集和HaLoBuild-Net模型,专用于雾天与低光照遥感图像中的建筑物提取,通过多模块协同设计有效抑制气象干扰并提升鲁棒性。
Details
Motivation: 现有光学遥感建筑物提取方法在雾天和低光照等真实恶劣天气条件下性能显著下降,且缺乏相应基准;SAR虽可全天候成像,但存在几何畸变问题。 Method: 构建首个面向雾天与低光照条件的光学遥感建筑物提取基准HaLoBuilding;提出端到端网络HaLoBuild-Net,包含空间-频率聚焦模块(SFFM)、全局多尺度引导模块(GMGM)和互导融合模块(MGFM)。 Result: HaLoBuild-Net在HaLoBuilding数据集上显著优于SOTA方法及传统恢复-分割级联范式,并在WHU、INRIA和LoveDA数据集上展现出强泛化能力。 Conclusion: 所提基准与方法为恶劣天气下遥感建筑物提取提供了新标准与有效解决方案,推动该方向向更实用、鲁棒的方向发展。 Abstract: Building extraction from optical Remote Sensing (RS) imagery suffers from performance degradation under real-world hazy and low-light conditions. However, existing optical methods and benchmarks focus primarily on ideal clear-weather conditions. While SAR offers all-weather sensing, its side-looking geometry causes geometric distortions. To address these challenges, we introduce HaLoBuilding, the first optical benchmark specifically designed for building extraction under hazy and low-light conditions. By leveraging a same-scene multitemporal pairing strategy, we ensure pixel-level label alignment and high fidelity even under extreme degradation. Building upon this benchmark, we propose HaLoBuild-Net, a novel end-to-end framework for building extraction in adverse RS scenarios. At its core, we develop a Spatial-Frequency Focus Module (SFFM) to effectively mitigate meteorological interference on building features by coupling large receptive field attention with frequency-aware channel reweighting guided by stable low-frequency anchors. Additionally, a Global Multi-scale Guidance Module (GMGM) provides global semantic constraints to anchor building topologies, while a Mutual-Guided Fusion Module (MGFM) implements bidirectional semantic-spatial calibration to suppress shallow noise and sharpen weather-induced blurred boundaries. Extensive experiments demonstrate that HaLoBuild-Net significantly outperforms state-of-the-art methods and conventional cascaded restoration-segmentation paradigms on the HaLoBuilding dataset, while maintaining robust generalization on WHU, INRIA, and LoveDA datasets. The source code and datasets are publicly available at: https://github.com/AeroVILab-AHU/HaLoBuilding.[152] Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID
Jiaxuan Li,Xin Wen,Zhihang Li
Main category: cs.CV
TL;DR: 本文提出了一种名为STFER的新框架,利用大视觉语言模型(LVLM)生成身份一致的语义文本,以提升任意时间条件下行人重识别(AT-ReID)在跨模态(RGB/IR)和服装变化场景下的鲁棒性。
Details
Motivation: 现有方法依赖易受环境与时间因素影响的纯视觉特征,在光照变化导致的模态偏移或服装变化场景下性能显著下降,亟需更鲁棒的身份表征方法。 Method: 提出语义驱动的Token过滤与专家路由(STFER)框架:1)用指令引导LVLM生成刻画生物恒定特征的身份内在语义文本;2)基于该文本进行语义驱动的视觉Token过滤(SVTF),增强关键区域、抑制背景噪声;3)将文本融入专家路由(SER),实现多场景鲁棒门控。 Result: 在Any-Time ReID数据集AT-USTC上达到SOTA;迁移至5个主流ReID基准测试仍保持高度竞争力,验证了强泛化能力。 Conclusion: 语义文本可作为稳定、身份判别性强的补充特征,有效缓解视觉特征对模态与外观变化的敏感性,为AT-ReID提供了新范式。 Abstract: Any-Time Person Re-identification (AT-ReID) necessitates the robust retrieval of target individuals under arbitrary conditions, encompassing both modality shifts (daytime and nighttime) and extensive clothing-change scenarios, ranging from short-term to long-term intervals. However, existing methods are highly relying on pure visual features, which are prone to change due to environmental and time factors, resulting in significantly performance deterioration under scenarios involving illumination caused modality shifts or cloth-change. In this paper, we propose Semantic-driven Token Filtering and Expert Routing (STFER), a novel framework that leverages the ability of Large Vision-Language Models (LVLMs) to generate identity consistency text, which provides identity-discriminative features that are robust to both clothing variations and cross-modality shifts between RGB and IR. Specifically, we employ instructions to guide the LVLM in generating identity-intrinsic semantic text that captures biometric constants for the semantic model driven. The text token is further used for Semantic-driven Visual Token Filtering (SVTF), which enhances informative visual regions and suppresses redundant background noise. Meanwhile, the text token is also used for Semantic-driven Expert Routing (SER), which integrates the semantic text into expert routing, resulting in more robust multi-scenario gating. Extensive experiments on the Any-Time ReID dataset (AT-USTC) demonstrate that our model achieves state-of-the-art results. Moreover, the model trained on AT-USTC was evaluated across 5 widely-used ReID benchmarks demonstrating superior generalization capabilities with highly competitive results. Our code will be available soon.[153] Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography
Simon Böhi,Irene Cannistraci,Sergio Muñoz Gonzalez,Moritz Vandenhirtz,Sonia Laguna,Samuel Ruiperez-Campillo,Max Krähenmann,Andrea Agostini,Ece Ozkan,Thomas M. Sutter,Julia E. Vogt
Main category: cs.CV
TL;DR: 本文提出了一种名为Latent Attention Masked Autoencoder (LAMAE) 的新模型,专为处理超声心动图多视角、稀疏、异构的时空数据设计;通过在潜在空间引入跨帧和跨视角的注意力机制,实现对心脏功能的整体表征,并在真实临床数据集MIMIC-IV-ECHO上预训练,首次实现了从该数据集视频预测ICD-10编码,且成年模型表征可有效迁移至儿科人群。
Details
Motivation: 现有掩码自编码器(MAE)方法通常独立处理图像或短片段,无法建模超声心动图固有的多视角结构,而临床中该模态数据又具有稀疏性与异构性,亟需能融合多视角信息的表示学习方法。 Method: 提出LAMAE模型,在标准MAE基础上增加潜空间中的‘潜注意力模块’(latent attention module),支持跨帧与跨视角的信息交互;在大规模未筛选临床数据集MIMIC-IV-ECHO上进行自监督预训练;并评估其在ICD-10编码预测及跨年龄组(成人→儿童)迁移任务上的性能。 Result: 首次在MIMIC-IV-ECHO视频上实现ICD-10代码预测;验证了LAMAE学习到的表征在儿科数据上具备强迁移能力;实证表明引入多视角结构先验(如潜注意力)显著提升了表征鲁棒性与可迁移性。 Conclusion: 将结构先验(如多视角注意力)融入基础模型设计,是提升医学影像表征质量、泛化性与临床适用性的关键路径;LAMAE为多视角动态医学影像建模提供了新范式。 Abstract: Echocardiography is a widely used modality for cardiac assessment due to its non-invasive and cost-effective nature, but the sparse and heterogeneous spatiotemporal views of the heart pose distinct challenges. Existing masked autoencoder (MAE) approaches typically process images or short clips independently, failing to capture the inherent multi-view structure required for coherent cardiac representation. We introduce Latent Attention Masked Autoencoder (LAMAE), a foundation model architecture tailored to the multi-view nature of medical imaging. LAMAE augments the standard MAE with a latent attention module that enables information exchange across frames and views directly in latent space. This allows the model to aggregate variable-length sequences and distinct views, reconstructing a holistic representation of cardiac function from partial observations. We pretrain LAMAE on MIMIC-IV-ECHO, a large-scale, uncurated dataset reflecting real-world clinical variability. To the best of our knowledge, we present the first results for predicting ICD-10 codes from MIMIC-IV-ECHO videos. Furthermore, we empirically demonstrate that representations learned from adult data transfer effectively to pediatric cohorts despite substantial anatomical differences. These results provide evidence that incorporating structural priors, such as multi-view attention, yields significantly more robust and transferable representations.[154] How to Correctly Make Mistakes: A Framework for Constructing and Benchmarking Mistake Aware Egocentric Procedural Videos
Olga Loginova,Frank Keller
Main category: cs.CV
TL;DR: 本文提出PIE-V框架,通过心理学启发的错误注入与修复机制,构建并评测面向第一人称视角程序性视频中的自然错误与恢复行为,提升错误感知与纠正能力。
Details
Motivation: 现有程序性视频数据集缺乏自然、一致且可追踪的人类错误及恢复过程;而真实场景中错误常被手部遮挡,仅通过细微物体状态变化体现,亟需更可靠的过程监控方法。 Method: PIE-V结合心理学驱动的错误规划器(基于步骤阶段与语义负载)、恢复规划器、级联一致的LLM重写器、LLM判别器,以及文本引导的视频片段合成与拼接技术,在17项任务和50个Ego-Exo4D场景中注入102个错误并生成27个恢复修正。 Result: 构建了统一错误分类体系与含9项指标的人类评估量表(涵盖合理性、逻辑性、状态一致性、图文对齐等),并用其审计现有资源、对比自由式LLM基线,验证PIE-V在错误检测与纠正评测上的有效性。 Conclusion: PIE-V为第一人称程序性视频中的错误感知与后处理验证提供了可扩展的生成-评测协同框架,推动面向真实失误的鲁棒过程理解研究。 Abstract: Reliable procedural monitoring in video requires exposure to naturally occurring human errors and the recoveries that follow. In egocentric recordings, mistakes are often partially occluded by hands and revealed through subtle object state changes, while existing procedural datasets provide limited and inconsistent mistake and correction traces. We present PIE-V (Psychologically Inspired Error injection for Videos), a framework for constructing and benchmarking mistake-aware egocentric procedural videos by augmenting clean keystep procedures with controlled, human-plausible deviations. PIE-V combines a psychology-informed error planner conditioned on procedure phase and semantic step load, a correction planner that models recovery behavior, an LLM writer that performs cascade-consistent rewrites, and an LLM judge that validates procedural coherence and repairs failures. For video segment edits, PIE-V synthesizes replacement clips with text-guided video generation and stitches them into the episode to preserve visual plausibility. Applied to 17 tasks and 50 Ego-Exo4D scenarios, PIE-V injects 102 mistakes and generates 27 recovery corrections. For benchmarking, we introduce a unified taxonomy and a human rubric with nine metrics that cover step-level and procedure-level quality, including plausibility, procedure logic with annotator confidence, state change coherence, and grounding between text and video. Using this protocol, we audit several existing resources and compare PIE-V against a freeform LLM generation baseline under the same criteria. Together, the framework and rubric support post-completion verification for egocentric procedural mistake detection and correction.[155] KVNN: Learnable Multi-Kernel Volterra Neural Networks
Haoyu Yun,Hamid Krim,Yufang Bao
Main category: cs.CV
TL;DR: 本文提出了一种核化Volterra神经网络(kVNN),通过可学习的多核表示建模不同阶次的特征交互,以在保持高性能的同时显著降低模型参数量和计算量。
Details
Motivation: 高阶学习依赖于组合特征,但传统深度模型在增强表征能力时往往导致复杂度剧增,亟需兼顾表达力与计算效率的新结构。 Method: 提出核化Volterra神经网络(kVNN),采用带紧凑可学习中心的多项式核组件建模不同阶次交互;每层由多个不同阶次的并行分支组成,可直接替换标准卷积核。 Result: 在视频动作识别与图像去噪任务上,kVNN在参数量和GFLOPs显著降低的同时,性能达到或超过基线模型,且无需大规模预训练。 Conclusion: 结构化的核化高阶层为现代深度网络提供了兼顾表达力与计算成本的实用解决方案。 Abstract: Higher-order learning is fundamentally rooted in exploiting compositional features. It clearly hinges on enriching the representation by more elaborate interactions of the data which, in turn, tends to increase the model complexity of conventional large-scale deep learning models. In this paper, a kernelized Volterra Neural Network (kVNN) is proposed. The key to the achieved efficiency lies in using a learnable multi-kernel representation, where different interaction orders are modeled by distinct polynomial-kernel components with compact, learnable centers, yielding an order-adaptive parameterization. Features are learned by the composition of layers, each of which consists of parallel branches of different polynomial orders, enabling kVNN filters to directly replace standard convolutional kernels within existing architectures. The theoretical results are substantiated by experiments on two representative tasks: video action recognition and image denoising. The results demonstrate favorable performance-efficiency trade-offs: kVNN consistently yields reduced model (parameters) and computational (GFLOPs) complexity with competitive and often improved performance. These results are maintained even when trained from scratch without large-scale pretraining. In summary, we substantiate that structured kernelized higher-order layers offer a practical path to balancing expressivity and computational cost in modern deep networks.[156] Class Unlearning via Depth-Aware Removal of Forget-Specific Directions
Arman Hatami,Romina Aalishah,Ilya E. Monosov
Main category: cs.CV
TL;DR: 本文提出DAMP方法,通过深度感知的投影调制实现类别的单次、闭式权重手术式遗忘,有效移除目标类别在深层表示中的结构信息,同时保持保留类别的性能。
Details
Motivation: 现有类别遗忘方法存在选择性弱、深层表示中仍保留遗忘类别结构、或过度依赖最终层偏置调整等问题,无法真正实现知识擦除。 Method: DAMP是一种无需梯度优化的单次闭式权重手术方法:在每一网络阶段,基于输入空间中的类别原型计算遗忘方向(相对于保留类原型的残差),并通过投影更新降低下游对这些方向的敏感性;采用基于探针可分性的无参深度感知缩放规则,使浅层编辑小、深层编辑大;支持多类别遗忘的低秩子空间移除。 Result: 在MNIST、CIFAR-10/100和Tiny ImageNet数据集及CNN/Transformer架构上,DAMP比现有方法更接近重训练的黄金标准,在选择性遗忘、保留类性能维持及深层遗忘类结构消除方面均有提升。 Conclusion: DAMP提供了一种高效、可解释、架构通用的类遗忘方案,验证了通过定向移除表征空间中的类别特定方向可实现更本质的知识擦除。 Abstract: Machine unlearning aims to remove targeted knowledge from a trained model without the cost of retraining from scratch. In class unlearning, however, reducing accuracy on forget classes does not necessarily imply true forgetting: forgotten information can remain encoded in internal representations, and apparent forgetting may arise from classifier-head suppression rather than representational removal. We show that existing class-unlearning methods often exhibit weak or negative selectivity, preserve forget-class structure in deep representations, or rely heavily on final-layer bias shifts. We then introduce DAMP (Depth-Aware Modulation by Projection), a one-shot, closed-form weight-surgery method that removes forget-specific directions from a pretrained network without gradient-based optimization. At each stage, DAMP computes class prototypes in the input space of the next learnable operator, extracts forget directions as residuals relative to retain-class prototypes, and applies a projection-based update to reduce downstream sensitivity to those directions. To preserve utility, DAMP uses a parameter-free depth-aware scaling rule derived from probe separability, applying smaller edits in early layers and larger edits in deeper layers. The method naturally extends to multi-class forgetting through low-rank subspace removal. Across MNIST, CIFAR-10, CIFAR-100, and Tiny ImageNet, and across convolutional and transformer architectures, DAMP more closely resembles the retraining gold standard than some of the prior methods, improving selective forgetting while better preserving retain-class performance and reducing residual forget-class structure in deep layers.[157] OmniLight: One Model to Rule All Lighting Conditions
Youngjin Oh,Junyoung Park,Junhyeong Kwon,Nam Ik Cho
Main category: cs.CV
TL;DR: 本文提出两种光照相关图像恢复策略:专用的DINOLight框架和通用的OmniLight框架(含WD-MoE模块),在NTIRE 2026挑战赛三个光照相关赛道均获顶级排名。
Details
Motivation: 现实应用中模型需应对多样光照域,而现有方法常仅在特定基准上表现优异,缺乏跨域鲁棒性。 Method: 构建专用基线DINOLight,并扩展为跨数据集训练的通用模型OmniLight,引入小波域混合专家(WD-MoE)结构。 Result: 两种方法在NTIRE 2026挑战赛全部三个光照相关赛道均取得顶尖排名,验证了其感知质量与泛化能力。 Conclusion: 专用与通用架构各有优势,数据分布特性显著影响二者性能;WD-MoE有效提升跨域光照恢复能力。 Abstract: Adverse lighting conditions, such as cast shadows and irregular illumination, pose significant challenges to computer vision systems by degrading visibility and color fidelity. Consequently, effective shadow removal and ALN are critical for restoring underlying image content, improving perceptual quality, and facilitating robust performance in downstream tasks. However, while achieving state-of-the-art results on specific benchmarks is a primary goal in image restoration challenges, real-world applications often demand robust models capable of handling diverse domains. To address this, we present a comprehensive study on lighting-related image restoration by exploring two contrasting strategies. We leverage a robust framework for ALN, DINOLight, as a specialized baseline to exploit the characteristics of each individual dataset, and extend it to OmniLight, a generalized alternative incorporating our proposed Wavelet Domain Mixture-of-Experts (WD-MoE) that is trained across all provided datasets. Through a comparative analysis of these two methods, we discuss the impact of data distribution on the performance of specialized and unified architectures in lighting-related image restoration. Notably, both approaches secured top-tier rankings across all three lighting-related tracks in the NTIRE 2026 Challenge, demonstrating their outstanding perceptual quality and generalization capabilities. Our codes are available at https://github.com/OBAKSA/Lighting-Restoration.[158] An Analysis of Regularization and Fokker-Planck Residuals in Diffusion Models for Image Generation
Onno Niemann,Gonzalo Martínez Muñoz,Alberto Suárez Gonzalez
Main category: cs.CV
TL;DR: 本文探讨了在扩散模型训练中,如何以更低的计算成本实现福克-普朗克(FP)方程正则化的效果,通过实证分析多种轻量级正则项,发现它们能在显著降低计算开销的同时,保持对FP残差的抑制和生成质量的提升。
Details
Motivation: 扩散模型常违反描述真实数据密度演化的福克-普朗克(FP)方程;直接在目标函数中惩罚该偏差虽有效但计算开销大;而强FP正则未必提升生成质量,故需探索更高效替代方案。 Method: 实证分析多种轻量级正则化项,评估其对FP残差和生成质量的影响,并与标准FP正则方法对比计算效率与性能。 Result: 轻量级正则项可在显著降低计算成本的前提下,达到与传统FP正则相似的FP残差抑制效果和生成质量。 Conclusion: FP正则化带来的收益可通过更简单的正则项以更低代价实现,为扩散模型训练提供了高效实用的新思路。 Abstract: Recent work has shown that diffusion models trained with the denoising score matching (DSM) objective often violate the Fokker--Planck (FP) equation that governs the evolution of the true data density. Directly penalizing these deviations in the objective function reduces their magnitude but introduces a significant computational overhead. It is also observed that enforcing strict adherence to the FP equation does not necessarily lead to improvements in the quality of the generated samples, as often the best results are obtained with weaker FP regularization. In this paper, we investigate whether simpler penalty terms can provide similar benefits. We empirically analyze several lightweight regularizers, study their effect on FP residuals and generation quality, and show that the benefits of FP regularization are available at substantially lower computational cost. Our code is available at https://github.com/OnnoNiemann/fp_diffusion_analysis.[159] Boundary-Centric Active Learning for Temporal Action Segmentation
Halil Ismail Helvaci,Sen-ching Samson Cheung
Main category: cs.CV
TL;DR: 本文提出B-ACT框架,通过在高杠杆边界区域主动分配标注资源,提升时序动作分割(TAS)的标注效率。其采用两阶段主动学习策略:先基于预测不确定性选择视频,再在选定视频中融合邻域不确定性、类别模糊性和时间预测动态性来打分并选取关键边界帧进行标注;仅标注边界帧但以边界为中心裁剪片段训练,以利用模型感受野中的时序上下文。实验表明该方法在多个基准数据集上显著优于现有 TAS 主动学习方法,尤其在边界敏感指标上增益明显。
Details
Motivation: 时序动作分割(TAS)需要密集的时间标注,但大部分标注成本耗费在识别和精调动作边界上,而这些边界区域正是分割错误集中、微小时间偏移对评估指标影响最大的地方。 Method: 提出B-ACT——一种基于剪辑预算的主动学习框架:(i)第一阶段基于预测不确定性对未标注视频排序并查询;(ii)第二阶段在选定视频中检测候选动作边界,并通过融合邻域不确定性、类别歧义性和时间预测动态性的新型边界得分,选取Top-K边界帧进行标注;标注协议仅要求标注边界帧,但训练时使用以边界为中心的剪辑片段,以保留时序上下文。 Result: 在GTEA、50Salads和Breakfast数据集上的大量实验表明,B-ACT在稀疏标注预算下显著优于代表性TAS主动学习基线及先前SOTA,尤其在边界定位主导的编辑距离与重叠F1分数上提升最大。 Conclusion: 聚焦边界的主动监督策略能大幅提升TAS任务的标注效率与性能,验证了将有限标注资源精准投向高影响力边界区域的有效性与必要性。 Abstract: Temporal action segmentation (TAS) demands dense temporal supervision, yet most of the annotation cost in untrimmed videos is spent identifying and refining action transitions, where segmentation errors concentrate and small temporal shifts disproportionately degrade segmental metrics. We introduce B-ACT, a clip-budgeted active learning framework that explicitly allocates supervision to these high-leverage boundary regions. B-ACT operates in a hierarchical two-stage loop: (i) it ranks and queries unlabeled videos using predictive uncertainty, and (ii) within each selected video, it detects candidate transitions from the current model predictions and selects the top-$K$ boundaries via a novel boundary score that fuses neighborhood uncertainty, class ambiguity, and temporal predictive dynamics. Importantly, our annotation protocol requests labels for only the boundary frames while still training on boundary-centered clips to exploit temporal context through the model's receptive field. Extensive experiments on GTEA, 50Salads, and Breakfast demonstrate that boundary-centric supervision delivers strong label efficiency and consistently surpasses representative TAS active learning baselines and prior state of the art under sparse budgets, with the largest gains on datasets where boundary placement dominates edit and overlap-based F1 scores.[160] VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models
Huawei Ji,Yuanhao Sun,Yuan Jin,Cheng Deng,Jiaxin Ding,Luoyi Fu,Xinbing Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为 的新框架,将视觉token剪枝建模为Pareto配置优化问题,通过连续松弛、直通估计器和增广拉格朗日法实现自动搜索最优剪枝配置,在多个基准上验证了其有效性与泛化性,并揭示了多步渐进式剪枝更符合VLM的层次压缩结构。
Details
Motivation: 现有视觉token剪枝方法依赖预设配置,无法保证计算-性能的最优权衡。 Method: 将视觉token剪枝建模为Pareto配置优化问题,采用连续松弛与直通估计器实现梯度搜索,并用增广拉格朗日法求解;引入可学习核函数分析层间剪枝模式。 Result: 在8个视觉基准上有效逼近网格搜索得到的Pareto前沿,具有良好泛化性;多步渐进剪枝比单层剪枝获得更优的精度-效率权衡。 Conclusion: 自动化的Pareto优化框架能更科学地确定视觉token剪枝配置,且多步渐进策略更契合VLM的内在层次结构。 Abstract: Visual token pruning methods effectively mitigate the quadratic computational growth caused by processing high-resolution images and video frames in vision-language models (VLMs). However, existing approaches rely on predefined pruning configurations without determining whether they achieve computation-performance optimality. In this work, we introduce , a novel framework that formulates visual token pruning as a Pareto configuration optimization problem to automatically identify optimal configurations. Our approach employs continuous relaxation and straight-through estimators to enable gradient-based search, solved via the Augmented Lagrangian method. Extensive experiments across 8 visual benchmarks demonstrate that effectively approximates the empirical Pareto frontier obtained through grid search and generalizes well across various pruning methods and VLM architectures. Furthermore, through learnable kernel functions, we investigate layer-wise pruning patterns and reveal that multi-step progressive pruning captures VLMs' hierarchical compression structure, achieving superior accuracy-efficiency trade-offs compared to single-layer approaches.[161] Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization
Umer Ahmed,Syed Ahmed Mahmood,Fawad Javed Fateh,M. Shaheer Luqman,M. Zeeshan Zia,Quoc-Huy Tran
Main category: cs.CV
TL;DR: 本文提出了一种用于无监督骨架时序动作分割的分层时空向量量化框架,通过两级向量量化分别建模子动作和动作,并融合时空信息,在多个基准上达到SOTA性能并缓解段长偏差。
Details
Motivation: 解决无监督骨架时序动作分割中缺乏对动作层级结构建模以及忽略时间信息的问题,同时缓解段长偏差。 Method: 提出分层时空向量量化框架:第一级将骨架映射到细粒度子动作,第二级聚合子动作为动作级表示;进一步引入时空联合重建(骨架+时间戳)实现多级聚类。 Result: 在HuGaDB、LARa和BABEL等多个基准上达到新SOTA性能,并有效降低段长偏差。 Conclusion: 分层设计与时空联合建模显著提升了无监督骨架动作分割的性能和鲁棒性,验证了向量量化在该任务中的有效性。 Abstract: We propose a novel hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation. We first introduce a hierarchical approach, which includes two consecutive levels of vector quantization. Specifically, the lower level associates skeletons with fine-grained subactions, while the higher level further aggregates subactions into action-level representations. Our hierarchical approach outperforms the non-hierarchical baseline, while primarily exploiting spatial cues by reconstructing input skeletons. Next, we extend our approach by leveraging both spatial and temporal information, yielding a hierarchical spatiotemporal vector quantization scheme. In particular, our hierarchical spatiotemporal approach performs multi-level clustering, while simultaneously recovering input skeletons and their corresponding timestamps. Lastly, extensive experiments on multiple benchmarks, including HuGaDB, LARa, and BABEL, demonstrate that our approach establishes a new state-of-the-art performance and reduces segment length bias in unsupervised skeleton-based temporal action segmentation.[162] MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
Yan Li,Zezi Zeng,Yifan Yang,Yuqing Yang,Ning Liao,Weiwei Guo,Lili Qiu,Mingxi Cheng,Qi Dai,Zhendong Wang,Zhengyuan Yang,Xue Yang,Ji Li,Lijuan Wang,Chong Luo
Main category: cs.CV
TL;DR: 本文提出MM-WebAgent,一种分层智能体框架,用于多模态网页生成,通过分层规划与自反思协调AIGC元素生成,提升全局一致性与视觉协调性,并构建了新基准与评估协议。
Details
Motivation: 现有AIGC工具直接集成到自动网页生成中易导致风格不一致和全局连贯性差,因各元素孤立生成。 Method: 提出分层agentic框架MM-WebAgent,结合分层规划与迭代自反思,联合优化全局布局、局部多模态内容及其整合;并构建多模态网页生成基准与多级评估协议。 Result: 实验表明MM-WebAgent在多模态元素生成与整合方面显著优于代码生成与基于agent的基线方法。 Conclusion: MM-WebAgent有效提升了自动化网页生成中的视觉一致性与全局连贯性,为多模态UI生成提供了新范式。 Abstract: The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code & Data: https://aka.ms/mm-webagent.[163] StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression
Xuanyi Liu,Deyi Ji,Chunan Yu,Qi Zhu,Xuanfu Li,Jin Ma,Tianrun Chen,Lanyun Zhu
Main category: cs.CV
TL;DR: 本文提出StreamCacheVGGT,一种无需训练的流式3D重建缓存管理框架,通过跨层一致性增强评分(CLCES)和混合缓存压缩(HCC)提升几何信息保留能力,在多个基准上达到SOTA性能。
Details
Motivation: 现有O(1)内存约束下的流式3D重建方法依赖简单逐token剔除,导致信息严重损失和单层评分带来的激活噪声。 Method: 提出StreamCacheVGGT:1)CLCES模块利用Transformer各层激活轨迹与序统计分析,稳健评估token几何显著性;2)HCC模块在key向量流形上采用三档分级策略,将中等重要token合并至锚点,而非直接删除。 Result: 在7-Scenes、NRGBD、ETH3D、Bonn和KITTI共5个基准上,相比现有方法显著提升重建精度与长期稳定性,且严格满足常数内存约束。 Conclusion: StreamCacheVGGT通过协同设计的CLCES与HCC模块,有效缓解了纯剔除范式的信息破坏问题,验证了基于几何语义感知的缓存压缩策略在流式稠密重建中的有效性与泛化性。 Abstract: Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing $O(1)$ frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.[164] TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens
Jiawei Ren,Michal Jan Tyszkiewicz,Jiahui Huang,Zan Gojcic
Main category: cs.CV
TL;DR: 本文提出TokenGS,一种基于Transformer的3D高斯泼溅(3DGS)新方法,通过直接回归3D高斯均值坐标并采用编码器-解码器架构与可学习高斯token,提升鲁棒性、泛化性和重建质量。
Details
Motivation: 现有方法沿相机射线回归深度作为高斯均值存在次优性;且编码器-only结构限制了预测原语数量与输入图像分辨率和视图数绑定,难以应对位姿噪声和多视角不一致问题。 Method: 摒弃沿射线回归深度的做法,改为仅用自监督渲染损失直接回归3D高斯均值坐标;引入编码器-解码器架构与可学习的高斯token,解耦预测原语数量与输入条件;支持测试时在token空间高效优化而不损害先验。 Result: TokenGS在静态与动态场景上均达到最优前馈重建性能,几何更规整、3DGS分布更均衡,并能自然恢复静态-动态分解和场景流等新兴属性。 Conclusion: 直接回归3D坐标与token化高斯表示是提升3DGS前馈建模能力的关键设计,TokenGS为高效、鲁棒、可扩展的神经场景表示提供了新范式。 Abstract: In this work, we revisit several key design choices of modern Transformer-based approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss. This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, thereby unbinding the number of predicted primitives from input image resolution and number of views. Our resulting method, TokenGS, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance on both static and dynamic scenes, producing more regularized geometry and more balanced 3DGS distribution, while seamlessly recovering emergent scene attributes such as static-dynamic decomposition and scene flow.[165] SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation
Tianhao Fu,Austin Wang,Charles Chen,Roby Aldave-Garza,Yucheng Chen
Main category: cs.CV
TL;DR: 本文提出SegWithU,一种轻量级、单次前向传播的后处理不确定性估计框架,用于医学图像分割,通过在冻结的预训练分割主干网络上添加不确定性头,建模扰动能量以生成两类体素级不确定性图,在多个数据集上达到最优单次推理性能。
Details
Motivation: 医学图像分割中可靠的不确定性估计至关重要,但现有强方法多需重复推理,而高效单次前向方法常在失败排序能力或特征空间假设上存在局限。 Method: SegWithU是一种后处理框架,在冻结的预训练分割主干上附加轻量不确定性头,利用中间特征,在紧凑探针空间中用秩-1后验探针建模扰动能量,输出两个体素级不确定性图:一个面向校准(用于概率缩放),一个面向排序(用于错误检测与选择性预测)。 Result: 在ACDC、BraTS2024和LiTS数据集上,SegWithU作为单次前向基线表现最强且最稳定,AUROC/AURC分别达0.9838/2.4885、0.9946/0.2660、0.9925/0.8193,同时保持分割质量。 Conclusion: 基于扰动的不确定性建模是实现高可靠性医学图像分割的一种有效且实用的途径。 Abstract: Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision support. Many strong uncertainty methods require repeated inference, while efficient single-forward-pass alternatives often provide weaker failure ranking or rely on restrictive feature-space assumptions. We present $\textbf{SegWithU}$, a post-hoc framework that augments a frozen pretrained segmentation backbone with a lightweight uncertainty head. SegWithU taps intermediate backbone features and models uncertainty as perturbation energy in a compact probe space using rank-1 posterior probes. It produces two voxel-wise uncertainty maps: a calibration-oriented map for probability tempering and a ranking-oriented map for error detection and selective prediction. Across ACDC, BraTS2024, and LiTS, SegWithU is the strongest and most consistent single-forward-pass baseline, achieving AUROC/AURC of $0.9838/2.4885$, $0.9946/0.2660$, and $0.9925/0.8193$, respectively, while preserving segmentation quality. These results suggest that perturbation-based uncertainty modeling is an effective and practical route to reliability-aware medical segmentation. Source code is available at https://github.com/ProjectNeura/SegWithU.[166] Why Do Vision Language Models Struggle To Recognize Human Emotions?
Madhav Agarwal,Sotirios A. Tsaftaris,Laura Sevilla-Lara,Steven McDonagh
Main category: cs.CV
TL;DR: 本文探讨了视觉-语言模型(VLMs)在人类情绪识别任务中表现不佳的原因,指出其两大缺陷:对长尾情感数据的偏差以及无法有效建模动态微表情的时间信息;为此提出改进的采样策略与多阶段上下文增强方法,以提升VLM在动态面部表情识别(DFER)中的性能。
Details
Motivation: 尽管视觉-语言模型(VLMs)在诸多视觉任务中取得显著进展,但在人类情绪识别上却难以超越专用视觉模型,本文旨在探究其根本原因。 Method: 分析VLM在情绪识别中的失败根源,识别出长尾数据偏差和时间建模能力不足两大问题,并分别提出替代采样策略与基于自然语言摘要的多阶段上下文增强方法。 Result: 发现VLM因预训练数据加剧长尾偏差而混淆罕见情绪;稀疏采样无法捕捉0.25–0.5秒的微表情;所提上下文增强策略能有效保留情感时序轨迹并缓解注意力稀释。 Conclusion: VLMs当前架构不适应情绪识别这一连续、动态、长尾的任务;需针对性改进数据采样与时间信息融合机制,才能释放其跨模态潜力。 Abstract: Understanding emotions is a fundamental ability for intelligent systems to be able to interact with humans. Vision-language models (VLMs) have made tremendous progress in the last few years for many visual tasks, potentially offering a promising solution for understanding emotions. However, it is surprising that even the most sophisticated contemporary VLMs struggle to recognize human emotions or to outperform even specialized vision-only classifiers. In this paper we ask the question "Why do VLMs struggle to recognize human emotions?", and observe that the inherently continuous and dynamic task of facial expression recognition (DFER) exposes two critical VLM vulnerabilities. First, emotion datasets are naturally long-tailed, and the web-scale data used to pre-train VLMs exacerbates this head-class bias, causing them to systematically collapse rare, under-represented emotions into common categories. We propose alternative sampling strategies that prevent favoring common concepts. Second, temporal information is critical for understanding emotions. However, VLMs are unable to represent temporal information over dense frame sequences, as they are limited by context size and the number of tokens that can fit in memory, which poses a clear challenge for emotion recognition. We demonstrate that the sparse temporal sampling strategy used in VLMs is inherently misaligned with the fleeting nature of micro-expressions (0.25-0.5 seconds), which are often the most critical affective signal. As a diagnostic probe, we propose a multi-stage context enrichment strategy that utilizes the information from "in-between" frames by first converting them into natural language summaries. This enriched textual context is provided as input to the VLM alongside sparse keyframes, preventing attentional dilution from excessive visual data while preserving the emotional trajectory.[167] R3D: Revisiting 3D Policy Learning
Zhengdong Hong,Shenrui Wu,Haozhe Cui,Boyi Zhao,Ran Ji,Yiyang He,Hangxing Zhang,Zundong Ke,Jun Wang,Guofeng Zhang,Jiayuan Gu
Main category: cs.CV
TL;DR: 本文提出了一种结合可扩展Transformer 3D编码器与扩散解码器的新架构,通过引入3D数据增强并规避Batch Normalization的负面影响,解决了3D策略学习中的训练不稳定和严重过拟合问题,显著提升了在操作任务基准上的性能。
Details
Motivation: 3D策略学习虽具泛化与跨形态迁移潜力,但受限于训练不稳定和严重过拟合,难以应用强大的3D感知模型。 Method: 系统诊断失败原因,发现缺失3D数据增强和Batch Normalization的负面影响是主因;提出耦合可扩展Transformer 3D编码器与扩散解码器的新架构,专为大规模稳定性设计,并利用大规模预训练。 Result: 在具挑战性的操作基准上显著超越现有最优3D基线方法。 Conclusion: 建立了可扩展、鲁棒的3D模仿学习新基础。 Abstract: 3D policy learning promises superior generalization and cross-embodiment transfer, but progress has been hindered by training instabilities and severe overfitting, precluding the adoption of powerful 3D perception models. In this work, we systematically diagnose these failures, identifying the omission of 3D data augmentation and the adverse effects of Batch Normalization as primary causes. We propose a new architecture coupling a scalable transformer-based 3D encoder with a diffusion decoder, engineered specifically for stability at scale and designed to leverage large-scale pre-training. Our approach significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing a new and robust foundation for scalable 3D imitation learning. Project Page: https://r3d-policy.github.io/[168] GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens
Roni Itkin,Noam Issachar,Yehonatan Keypur,Yehonatan Keypur,Anpei Chen,Sagie Benaim
Main category: cs.CV
TL;DR: 本文提出GlobalSplat框架,通过‘先对齐、后解码’策略学习紧凑、全局一致的潜在场景表示,避免传统方法中因像素/体素对齐导致的冗余和不一致性,在保持高质量新视角合成的同时大幅减少高斯数量(仅需16K)和模型大小(4MB),并实现快速推理(<78ms)。
Details
Motivation: 现有3D高斯泼溅的空间基元分配方法(迭代优化或前馈推断)在表示紧凑性、重建速度与渲染保真度之间存在显著权衡,主因是依赖缺乏全局场景感知的局部启发式策略;尤其前馈方法因像素/体素对齐导致三维资产冗余、表示膨胀及跨视图一致性差。 Method: 提出GlobalSplat框架,采用‘对齐优先、解码其次’范式:首先学习一个紧凑、全局、编码多视角输入并解析跨视图对应关系的潜在场景表示;不依赖预训练像素预测骨干网络或稠密基线的密集特征重用;引入由粗到细的训练课程,逐步增加解码容量以天然防止表示膨胀。 Result: 在RealEstate10K和ACID数据集上达到有竞争力的新视角合成性能,仅需约16K高斯,模型体积低至4MB;单次前向推理耗时低于78毫秒,显著快于基线方法。 Conclusion: GlobalSplat通过全局潜在表示与渐进式解码机制,有效解决了3D高斯泼溅中空间分配的冗余与不一致问题,在紧凑性、效率与质量三者间取得更好平衡,为高效神经辐射场建模提供了新范式。 Abstract: The efficient spatial allocation of primitives serves as the foundation of 3D Gaussian Splatting, as it directly dictates the synergy between representation compactness, reconstruction speed, and rendering fidelity. Previous solutions, whether based on iterative optimization or feed-forward inference, suffer from significant trade-offs between these goals, mainly due to the reliance on local, heuristic-driven allocation strategies that lack global scene awareness. Specifically, current feed-forward methods are largely pixel-aligned or voxel-aligned. By unprojecting pixels into dense, view-aligned primitives, they bake redundancy into the 3D asset. As more input views are added, the representation size increases and global consistency becomes fragile. To this end, we introduce GlobalSplat, a framework built on the principle of align first, decode later. Our approach learns a compact, global, latent scene representation that encodes multi-view input and resolves cross-view correspondences before decoding any explicit 3D geometry. Crucially, this formulation enables compact, globally consistent reconstructions without relying on pretrained pixel-prediction backbones or reusing latent features from dense baselines. Utilizing a coarse-to-fine training curriculum that gradually increases decoded capacity, GlobalSplat natively prevents representation bloat. On RealEstate10K and ACID, our model achieves competitive novel-view synthesis performance while utilizing as few as 16K Gaussians, significantly less than required by dense pipelines, obtaining a light 4MB footprint. Further, GlobalSplat enables significantly faster inference than the baselines, operating under 78 milliseconds in a single forward pass. Project page is available at https://r-itk.github.io/globalsplat/[169] AD4AD: Benchmarking Visual Anomaly Detection Models for Safer Autonomous Driving
Fabrizio Genilotti,Arianna Stropeni,Gionata Grotto,Francesco Borsatti,Manuel Barusco,Davide Dalle Pezze,Gian Antonio Susto
Main category: cs.CV
TL;DR: 本文探讨了视觉异常检测(VAD)在自动驾驶中的应用,通过在AnoVox数据集上评测8种前沿VAD方法,验证其在道路场景中的有效性,并指出Tiny-Dinomaly在精度与效率间取得最佳平衡,适合边缘部署。
Details
Motivation: 自动驾驶系统在面对训练数据分布之外的异常场景(如非典型障碍物)时性能易下降,而此类失效可能直接引发物理安全风险,亟需可靠的方法识别未知异常。 Method: 在AnoVox(目前最大合成自动驾驶异常检测数据集)上系统评测8种前沿视觉异常检测(VAD)方法,覆盖从大型到轻量级(如MobileNet、DeiT-Tiny)四种骨干网络,并分析其像素级异常定位能力与边缘部署适用性。 Result: VAD方法能有效迁移到真实道路场景;Tiny-Dinomaly在保持全尺度定位精度的同时显著降低内存开销,展现出最优的精度-效率权衡。 Conclusion: VAD是一种可行且实用的增强自动驾驶安全性的技术路径,尤其轻量级方案(如Tiny-Dinomaly)为边缘部署提供了切实可行的解决方案,有助于提升乘客、行人等所有道路使用者的安全保障。 Abstract: The reliability of a machine vision system for autonomous driving depends heavily on its training data distribution. When a vehicle encounters significantly different conditions, such as atypical obstacles, its perceptual capabilities can degrade substantially. Unlike many domains where errors carry limited consequences, failures in autonomous driving translate directly into physical risk for passengers, pedestrians, and other road users. To address this challenge, we explore Visual Anomaly Detection (VAD) as a solution. VAD enables the identification of anomalous objects not present during training, allowing the system to alert the driver when an unfamiliar situation is detected. Crucially, VAD models produce pixel-level anomaly maps that can guide driver attention to specific regions of concern without requiring any prior assumptions about the nature or form of the hazard. We benchmark eight state-of-the-art VAD methods on AnoVox, the largest synthetic dataset for anomaly detection in autonomous driving. In particular, we evaluate performance across four backbone architectures spanning from large networks to lightweight ones such as MobileNet and DeiT-Tiny. Our results demonstrate that VAD transfers effectively to road scenes. Notably, Tiny-Dinomaly achieves the best accuracy-efficiency trade-off for edge deployment, matching full-scale localization performance at a fraction of the memory cost. This study represents a concrete step toward safer, more responsible deployment of autonomous vehicles, ultimately improving protection for passengers, pedestrians, and all road users.[170] AnimationBench: Are Video Models Good at Character-Centric Animation?
Leyi Wu,Pengjun Fang,Kai Sun,Yazhou Xing,Yinwei Wu,Songsong Wang,Ziqi Huang,Dan Zhou,Yingqing He,Ying-Cong Chen,Qifeng Chen
Main category: cs.CV
TL;DR: 本文提出了AnimationBench,首个面向动画风格图像到视频生成的系统性评估基准,融合动画十二法则与IP保真度,并支持闭集与开集评估,显著提升对动画生成质量的判别力。
Details
Motivation: 现有视频生成基准主要针对真实感视频设计,难以有效评估动画风格生成(如风格化外观、夸张运动、角色一致性),且依赖固定提示和刚性流程,缺乏开放域内容与定制化评估的灵活性。 Method: 提出AnimationBench基准,将动画十二基本原理与IP保真度转化为可量化评估维度,并引入广义质量维度(语义一致性、运动合理性、镜头运动一致性);支持标准化闭集评估与灵活开集诊断评估,并利用视觉-语言模型实现可扩展自动评估。 Result: 实验表明AnimationBench与人类判断高度一致,能揭示现实主义导向基准所忽略的动画特有质量问题,对当前SOTA图像到视频模型提供更信息丰富且更具区分性的评估。 Conclusion: AnimationBench填补了动画风格I2V生成评估的空白,为该领域建立了首个系统、可扩展、人机一致的基准框架,推动动画生成技术向更高质量和可控性发展。 Abstract: Video generation has advanced rapidly, with recent methods producing increasingly convincing animated results. However, existing benchmarks-largely designed for realistic videos-struggle to evaluate animation-style generation with its stylized appearance, exaggerated motion, and character-centric consistency. Moreover, they also rely on fixed prompt sets and rigid pipelines, offering limited flexibility for open-domain content and custom evaluation needs. To address this gap, we introduce AnimationBench, the first systematic benchmark for evaluating animation image-to-video generation. AnimationBench operationalizes the Twelve Basic Principles of Animation and IP Preservation into measurable evaluation dimensions, together with Broader Quality Dimensions including semantic consistency, motion rationality, and camera motion consistency. The benchmark supports both a standardized close-set evaluation for reproducible comparison and a flexible open-set evaluation for diagnostic analysis, and leverages visual-language models for scalable assessment. Extensive experiments show that AnimationBench aligns well with human judgment and exposes animation-specific quality differences overlooked by realism-oriented benchmarks, leading to more informative and discriminative evaluation of state-of-the-art I2V models.[171] Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation
Yiyang Jiang,Li Zhang,Xiao-Yong Wei,Li Qing
Main category: cs.CV
TL;DR: 本文提出了一种基于推理的SLT(手语翻译)框架,将SLT视为跨模态推理任务,引入显式的、有序的潜在思维序列作为视频与文本之间的中间层,并采用‘先规划、再 grounding’的解码策略,提升了翻译的连贯性与忠实性;同时构建并开源了一个大规模无gloss、强上下文依赖的手语翻译数据集。
Details
Motivation: 现有SLT系统隐含假设手语短片段可直接映射为口语词汇,但该假设在实际中不成立——手语者常依赖上下文、空间和动作动态构建意义,因此需重新建模SLT为跨模态推理问题。 Method: 提出推理驱动的SLT框架:1)引入有序的潜在思维序列作为视频到文本的显式中间表示;2)采用‘计划-再grounding’解码策略,即先生成语义规划,再回溯视频寻找支撑证据;3)构建并发布大规模、无gloss、强上下文依赖的新SLT数据集。 Result: 在多个基准上显著优于现有无gloss SLT方法,验证了推理建模与新数据集的有效性;代码与数据集已开源。 Conclusion: SLT本质是跨模态推理任务,显式建模思维过程与分离规划与证据检索可显著提升翻译质量;新数据集推动更真实、更具挑战性的SLT研究。 Abstract: Many SLT systems quietly assume that brief chunks of signing map directly to spoken-language words. That assumption breaks down because signers often create meaning on the fly using context, space, and movement. We revisit SLT and argue that it is mainly a cross-modal reasoning task, not just a straightforward video-to-text conversion. We thus introduce a reasoning-driven SLT framework that uses an ordered sequence of latent thoughts as an explicit middle layer between the video and the generated text. These latent thoughts gradually extract and organize meaning over time. On top of this, we use a plan-then-ground decoding method: the model first decides what it wants to say, and then looks back at the video to find the evidence. This separation improves coherence and faithfulness. We also built and released a new large-scale gloss-free SLT dataset with stronger context dependencies and more realistic meanings. Experiments across several benchmarks show consistent gains over existing gloss-free methods. Code and data will be released upon acceptance at https://github.com/fletcherjiang/SignThought.[172] RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
Hao Gao,Shaoyu Chen,Yifan Zhu,Yuehao Song,Wenyu Liu,Qian Zhang,Xinggang Wang
Main category: cs.CV
TL;DR: 本文提出RAD-2,一种结合扩散生成器与强化学习优化判别器的闭环规划框架,通过解耦生成与评估、引入时序一致的策略优化和在线生成器优化,并配合BEV-Warp仿真环境,显著提升规划鲁棒性与安全性。
Details
Motivation: 现有基于扩散的运动规划器在闭环交互中存在随机不稳定性及缺乏负反馈修正的问题,难以兼顾多模态不确定性建模与鲁棒性。 Method: 提出RAD-2框架:1)扩散生成器采样多样轨迹;2)RL优化的判别器按长期驾驶质量重排序;3)时序一致的分组相对策略优化(TC-GRO)缓解信用分配问题;4)在线生成器优化(OGO)将闭环反馈转化为纵向结构化梯度信号;5)BEV-Warp仿真环境实现鸟瞰图特征空间中的高效闭环评估。 Result: 相比强基线扩散规划器,碰撞率降低56%;实车部署验证了感知安全性和驾驶平滑性的提升。 Conclusion: RAD-2通过生成-判别协同、时序感知强化学习与高效仿真,有效解决了扩散规划器在闭环驾驶中的稳定性与优化难题,推动高阶自动驾驶规划向更鲁棒、可部署方向发展。 Abstract: High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.[173] TokenLight: Precise Lighting Control in Images using Attribute Tokens
Sumit Chaturvedi,Yannick Hold-Geoffroy,Mengwei Ren,Jingyuan Liu,He Zhang,Yiqun Mei,Julie Dorsey,Zhixin Shu
Main category: cs.CV
TL;DR: 本文提出了一种基于属性标记的图像重打光方法,可连续、精确控制多种光照属性(如强度、颜色、环境光、漫反射水平和3D光源位置),在无需显式逆渲染监督下展现出对光照-几何-材质交互的隐式理解,达到SOTA效果。
Details
Motivation: 现有图像重打光方法难以实现对多个光照属性的精细、连续、解耦控制,且常依赖逆渲染监督或泛化能力不足。 Method: 将重打光建模为条件图像生成任务,引入属性标记(attribute tokens)分别编码强度、颜色、环境光、漫反射和3D光源位置等光照因子;使用大规模带光照标注的合成数据集训练,并辅以少量真实图像提升真实感与泛化性。 Result: 在合成与真实图像上均实现SOTA定量与定性性能;能合理处理复杂场景(如物体内部布光、透明材质重打光);模型隐式学习了光与几何、遮挡、材质的交互规律。 Conclusion: 属性标记机制有效实现了多维度光照的解耦可控生成,无需逆渲染监督即可获得物理合理的重打光结果,显著提升了可控性、真实感与泛化能力。 Abstract: This paper presents a method for image relighting that enables precise and continuous control over multiple illumination attributes in a photograph. We formulate relighting as a conditional image generation task and introduce attribute tokens to encode distinct lighting factors such as intensity, color, ambient illumination, diffuse level, and 3D light positions. The model is trained on a large-scale synthetic dataset with ground-truth lighting annotations, supplemented by a small set of real captures to enhance realism and generalization. We validate our approach across a variety of relighting tasks, including controlling in-scene lighting fixtures and editing environment illumination using virtual light sources, on synthetic and real images. Our method achieves state-of-the-art quantitative and qualitative performance compared to prior work. Remarkably, without explicit inverse rendering supervision, the model exhibits an inherent understanding of how light interacts with scene geometry, occlusion, and materials, yielding convincing lighting effects even in traditionally challenging scenarios such as placing lights within objects or relighting transparent materials plausibly. Project page: vrroom.github.io/tokenlight/[174] LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
Zhanhao Liang,Tao Yang,Jie Wu,Chengjian Feng,Liang Zheng
Main category: cs.CV
TL;DR: 本文提出LeapAlign方法,通过设计两个跳跃步骤来缩短流匹配模型的生成轨迹,从而降低内存成本并稳定地将奖励梯度反向传播至早期生成步骤,显著提升图像质量和图文对齐效果。
Details
Motivation: 直接通过可微生成过程反向传播奖励梯度的方法因长轨迹导致内存开销大和梯度爆炸,难以有效更新决定图像全局结构的早期生成步骤。 Method: 提出LeapAlign:将长ODE采样轨迹压缩为仅含两个跳跃步骤的短轨迹,每个跳跃跳过多步并单步预测未来潜在表示;通过随机化跳跃起止时间实现任意生成步的高效稳定更新;引入路径一致性加权和大梯度项降权策略以提升训练稳定性。 Result: 在Flux模型上微调时,LeapAlign在多项指标上持续超越当前最优的GRPO类及直接梯度法,显著提升图像质量与图文对齐能力。 Conclusion: LeapAlign是一种高效、稳定且可扩展的流匹配模型对齐方法,解决了长轨迹下奖励梯度回传的内存与稳定性瓶颈,为基于偏好的生成模型优化提供了新范式。 Abstract: This paper focuses on the alignment of flow matching models with human preferences. A promising way is fine-tuning by directly backpropagating reward gradients through the differentiable generation process of flow matching. However, backpropagating through long trajectories results in prohibitive memory costs and gradient explosion. Therefore, direct-gradient methods struggle to update early generation steps, which are crucial for determining the global structure of the final image. To address this issue, we introduce LeapAlign, a fine-tuning method that reduces computational cost and enables direct gradient propagation from reward to early generation steps. Specifically, we shorten the long trajectory into only two steps by designing two consecutive leaps, each skipping multiple ODE sampling steps and predicting future latents in a single step. By randomizing the start and end timesteps of the leaps, LeapAlign leads to efficient and stable model updates at any generation step. To better use such shortened trajectories, we assign higher training weights to those that are more consistent with the long generation path. To further enhance gradient stability, we reduce the weights of gradient terms with large magnitude, instead of completely removing them as done in previous works. When fine-tuning the Flux model, LeapAlign consistently outperforms state-of-the-art GRPO-based and direct-gradient methods across various metrics, achieving superior image quality and image-text alignment.[175] Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo
Ninghui Xu,Fabio Tosi,Lihui Wang,Jiawei Han,Luca Bartolomei,Zhiting Yao,Matteo Poggi,Stefano Mattoccia
Main category: cs.CV
TL;DR: 本文提出Bi-CMPStereo框架,通过双向跨模态提示机制融合事件相机与帧相机数据,提升动态场景下的立体匹配鲁棒性与精度。