Table of Contents
cs.CL [Back]
[1] Language Family Matters: Evaluating LLM-Based ASR Across Linguistic Boundaries
Yuchen Zhang,Ravi Shekhar,Haralambos Mouratidis
Main category: cs.CL
TL;DR: 本文提出了一种基于语言家族的连接器共享策略,用于多语言ASR系统,以减少参数量并提升跨领域泛化能力。
Details
Motivation: 现有工作为每种语言单独训练连接器,忽略了语言间的亲缘关系,导致参数冗余和泛化能力受限。 Method: 根据语言所属语系对连接器进行共享,即每个语系共用一个连接器,并在两个多语言大语言模型和两个真实语音语料库上进行验证。 Result: 家族式连接器显著降低了参数量,同时在不同数据域(如精细标注与众包语音)上表现出更强的泛化性能。 Conclusion: 基于语言家族的连接器共享是一种高效、实用且可扩展的多语言ASR部署方案。 Abstract: Large Language Model (LLM)-powered Automatic Speech Recognition (ASR) systems achieve strong performance with limited resources by linking a frozen speech encoder to a pretrained LLM via a lightweight connector. Prior work trains a separate connector per language, overlooking linguistic relatedness. We propose an efficient and novel connector-sharing strategy based on linguistic family membership, enabling one connector per family, and empirically validate its effectiveness across two multilingual LLMs and two real-world corpora spanning curated and crowd-sourced speech. Our results show that family-based connectors reduce parameter count while improving generalization across domains, offering a practical and scalable strategy for multilingual ASR deployment.[2] Self-Aware Knowledge Probing: Evaluating Language Models' Relational Knowledge through Confidence Calibration
Christopher Kissling,Elena Merdjanovska,Alan Akbik
Main category: cs.CL
TL;DR: 本文提出了一种新的关系知识校准探测框架,涵盖内在置信度、结构一致性和语义基础三种模型置信度模态,发现多数语言模型(尤其是掩码预训练模型)存在过度自信问题,且大模型仍难以准确编码语言置信表达的语义。
Details
Motivation: 现有知识探测方法仅关注预测准确率等指标,忽略了模型置信度的校准性(即可靠性),无法全面评估语言模型所学关系知识的质量。 Method: 提出一种新型校准探测框架,从内在置信度、结构一致性、语义基础三个维度量化模型对关系知识的置信度,并在十种因果语言模型和六种掩码语言模型上进行大规模实证分析。 Result: 大多数模型(尤其掩码预训练模型)存在显著过自信现象;基于陈述重述不一致性估计的置信度得分校准效果最佳;即使最大规模预训练模型也无法准确建模语言中置信表达的语义。 Conclusion: 校准性是评估语言模型关系知识能力的关键新维度;仅靠准确率不足,需结合多模态置信度分析以更全面揭示模型知识状态与局限。 Abstract: Knowledge probing quantifies how much relational knowledge a language model (LM) has acquired during pre-training. Existing knowledge probes evaluate model capabilities through metrics like prediction accuracy and precision. Such evaluations fail to account for the model's reliability, reflected in the calibration of its confidence scores. In this paper, we propose a novel calibration probing framework for relational knowledge, covering three modalities of model confidence: (1) intrinsic confidence, (2) structural consistency and (3) semantic grounding. Our extensive analysis of ten causal and six masked language models reveals that most models, especially those pre-trained with the masking objective, are overconfident. The best-calibrated scores come from confidence estimates that account for inconsistencies due to statement rephrasing. Moreover, even the largest pre-trained models fail to encode the semantics of linguistic confidence expressions accurately.[3] Flatter Tokens are More Valuable for Speculative Draft Model Training
Jiaming Fan,Daming Cao,Xiangzhong Luo,Jiale Fu,Chonghan Liu,Xu Yang
Main category: cs.CL
TL;DR: 本文提出了一种基于样本平坦度的数据蒸馏方法(SFDD),用于提升推测解码(SD)中草稿模型的训练效率,仅用50%数据即可实现2倍训练加速,且推理加速性能下降不超过4%。
Details
Motivation: 推测解码(SD)虽能加速大语言模型推理,但其草稿模型训练依赖大量数据;作者发现并非所有训练样本对SD接受率贡献相同,需从数据角度提升训练效率。 Method: 基于理论分析与实证验证,提出‘平坦度’(flatness)指标衡量token预测分布的平坦性,并据此设计样本级平坦度驱动的数据蒸馏方法(SFDD),筛选高价值样本进行训练。 Result: 在EAGLE框架上实验表明,SFDD使用50%数据可实现超2倍训练加速,最终模型推理加速性能保持在全量数据基线的96%以上。 Conclusion: SFDD是一种高效的数据中心化方法,显著提升了推测解码草稿模型的训练效率,为SD提供了轻量、实用的训练优化路径。 Abstract: Speculative Decoding (SD) is a key technique for accelerating Large Language Model (LLM) inference, but it typically requires training a draft model on a large dataset. We approach this problem from a data-centric perspective, finding that not all training samples contribute equally to the SD acceptance rate. Specifically, our theoretical analysis and empirical validation reveals that tokens inducing flatter predictive distributions from the target model are more valuable than those yielding sharply peaked distributions. Based on this insight, we propose flatness, a new metric to quantify this property, and develop the Sample-level-flatness-based Dataset Distillation (SFDD) approach, which filters the training data to retain only the most valuable samples. Experiments on the EAGLE framework demonstrate that SFDD can achieve over 2$\times$ training speedup using only 50% of the data, while keeping the final model's inference speedup within 4% of the full-dataset baseline. This work introduces an effective, data-centric approach that substantially improves the training efficiency for Speculative Decoding. Our code is available at https://anonymous.4open.science/r/Flatness.[4] BabyReasoningBench: Generating Developmentally-Inspired Reasoning Tasks for Evaluating Baby Language Models
Kaustubh D. Dhole
Main category: cs.CL
TL;DR: 本文提出BabyReasoningBench,一个基于发展心理学范式的19项推理任务基准,用于评估在儿童导向语料上训练的语言模型(“婴儿语言模型”)的推理能力;结果显示,模型在因果与物理推理上随规模提升而改善,但在心理理论与语用敏感任务上仍表现薄弱。
Details
Motivation: 传统推理评测基准以成人知识和能力为前提,不适用于在儿童语料上训练的“婴儿语言模型”,因而无法揭示其真实推理能力的发展模式。 Method: 构建BabyReasoningBench基准:由GPT-5.2生成,涵盖心理理论、类比与关系推理、因果推断与干预选择、基础推理原语等任务;在两个GPT-2架构的婴儿语言模型(分别在10M和100M儿童语料上预训练)上进行评测。 Result: 模型整体表现偏低但存在任务间分离:因果与物理推理随参数量提升而改善;信念归因与语用敏感任务持续困难。 Conclusion: BabyReasoningBench为理解儿童样训练分布下推理能力的涌现机制提供了发展心理学视角的评测工具,并支持对相关认知机制的可检验假设。 Abstract: Traditional evaluations of reasoning capabilities of language models are dominated by adult-centric benchmarks that presuppose broad world knowledge, complex instruction following, and mature pragmatic competence. These assumptions are mismatched to baby language models trained on developmentally plausible input such as child-directed speech and early-childhood narratives, and they obscure which reasoning abilities (if any) emerge under such constraints. We introduce BabyReasoningBench, a GPT-5.2 generated benchmark of 19 reasoning tasks grounded in classic paradigms from developmental psychology, spanning theory of mind, analogical and relational reasoning, causal inference and intervention selection, and core reasoning primitives that are known to be confounded by memory and pragmatics. We find that two GPT-2 based baby language models (pretrained on 10M and 100M of child-directed speech text) show overall low but uneven performance, with dissociations across task families: scaling improves several causal and physical reasoning tasks, while belief attribution and pragmatics-sensitive tasks remain challenging. BabyReasoningBench provides a developmentally grounded lens for analyzing what kinds of reasoning are supported by child-like training distributions, and for testing mechanistic hypotheses about how such abilities emerge.[5] LLMs versus the Halting Problem: Revisiting Program Termination Prediction
Oren Sultan,Jordi Armengol-Estape,Pascal Kesseli,Julien Vanegue,Dafna Shahaf,Yossi Adi,Peter O'Hearn
Main category: cs.CL
TL;DR: 本文评估了大语言模型(LLM)在预测C程序终止性方面的性能,发现GPT-5和Claude Sonnet-4.5等模型表现接近顶级传统验证工具,但缺乏可验证的终止证明,且在长程序上性能下降。
Details
Motivation: 受LLM在各类任务中成功启发,探究其是否能可靠预测程序终止这一图灵不可判定问题。 Method: 在SV-Comp 2025终止性基准的多样化C程序集上评估多个主流LLM的终止预测能力,并与传统验证工具排名对比;分析其提供证明(witness)的能力及对程序长度的敏感性。 Result: GPT-5和Claude Sonnet-4.5在终止预测准确率上接近SV-Comp 2025中排名第一的工具(启用test-time scaling),CWM紧随第二;但LLMs普遍无法生成有效终止证明,且性能随程序长度增加而显著下降。 Conclusion: LLM在程序终止预测任务上展现出令人惊讶的有效性,表明其有望辅助处理某些经典不可判定问题,但尚不能替代形式化验证工具,尤其在可解释性与鲁棒性方面仍存挑战。 Abstract: Determining whether a program terminates is a central problem in computer science. Turing's foundational result established the Halting Problem as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Consequently, automatic verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem-specific architectures and abstractions, and are usually tied to particular programming languages. Recent success and progress in large language models (LLMs) raises the following question: can LLMs reliably predict program termination? In this work, we evaluate LLMs on a diverse set of C programs from the Termination category of the International Competition on Software Verification (SV-Comp) 2025. Our results suggest that LLMs perform remarkably well at predicting program termination, where GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool (using test-time-scaling), and Code World Model (CWM) would place just behind the second-ranked tool. While LLMs are effective at predicting program termination, they often fail to provide a valid witness as a proof. Moreover, LLMs performance drops as program length increases. We hope these insights motivate further research into program termination and the broader potential of LLMs for reasoning about undecidable problems.[6] Malicious Repurposing of Open Science Artefacts by Using Large Language Models
Zahra Hashemi,Zhiqiang Zhong,Jun Pang,Wei Zhao
Main category: cs.CL
TL;DR: 本文提出了一种端到端管道,利用说服式越狱绕过LLM安全机制,从NLP论文中识别并恶意重用开放科学成果(如数据集、方法、工具),并构建三维度评估框架(危害性、滥用可行性、技术严谨性)检验其风险;结果表明LLM可生成有害提案,但不同LLM评估结果分歧显著,难以替代人工进行可信的双重用途风险评估。
Details
Motivation: 现有研究关注LLM在科学发现中的积极作用,却忽视其可能被滥用于通过重用开放科学成果生成有害研究,本文旨在填补这一安全风险评估的研究空白。 Method: 构建端到端管道:1)基于说服的LLM jailbreaking以绕过安全限制;2)解析NLP论文,识别并恶意重用其开放 artefacts(数据集、方法、工具);3)设计涵盖危害性、滥用可行性和技术严谨性的三维度评估框架,并用GPT-4.1、Gemini-2.5-pro和Grok-3进行交叉评估。 Result: LLM能成功生成基于正当开放成果的有害研究提案;但不同LLM评估者在三项指标上评分差异显著——GPT-4.1打分最高,Gemini-2.5-pro最严格,Grok-3居中,表明LLM尚不具备可靠自主评估双重用途风险的能力。 Conclusion: 当前LLM虽具生成有害内容的风险能力,但其作为评估者不可靠,人类专家评估仍是双重用途风险治理不可或缺的一环。 Abstract: The rapid evolution of large language models (LLMs) has fuelled enthusiasm about their role in advancing scientific discovery, with studies exploring LLMs that autonomously generate and evaluate novel research ideas. However, little attention has been given to the possibility that such models could be exploited to produce harmful research by repurposing open science artefacts for malicious ends. We fill the gap by introducing an end-to-end pipeline that first bypasses LLM safeguards through persuasion-based jailbreaking, then reinterprets NLP papers to identify and repurpose their artefacts (datasets, methods, and tools) by exploiting their vulnerabilities, and finally assesses the safety of these proposals using our evaluation framework across three dimensions: harmfulness, feasibility of misuse, and soundness of technicality. Overall, our findings demonstrate that LLMs can generate harmful proposals by repurposing ethically designed open artefacts; however, we find that LLMs acting as evaluators strongly disagree with one another on evaluation outcomes: GPT-4.1 assigns higher scores (indicating greater potential harms, higher soundness and feasibility of misuse), Gemini-2.5-pro is markedly stricter, and Grok-3 falls between these extremes. This indicates that LLMs cannot yet serve as reliable judges in a malicious evaluation setup, making human evaluation essential for credible dual-use risk assessment.[7] FROST: Filtering Reasoning Outliers with Attention for Efficient Reasoning
Haozheng Luo,Zhuolin Jiang,Md Zahid Hasan,Yan Chen,Soumalya Sarkar
Main category: cs.CL
TL;DR: FROST是一种注意力感知的高效推理方法,通过利用注意力权重剪枝不重要的推理路径,显著减少token使用并提升准确率。
Details
Motivation: 传统推理方法效率低、路径冗余,需在保持推理能力的同时提升效率和可靠性。 Method: 提出‘推理离群点’概念,设计基于注意力的机制在句子层面识别并剔除这些离群点。 Result: 在四个基准上超越TALE、ThinkLess等SOTA方法;平均减少69.68% token用量,准确率提升26.70%;显著降低注意力离群指标(∞范数降15.97%,峰度降91.09%)。 Conclusion: FROST在不损害甚至增强模型推理能力的前提下,实现了更短、更可靠的推理轨迹,验证了注意力引导剪枝的有效性。 Abstract: We propose FROST, an attention-aware method for efficient reasoning. Unlike traditional approaches, FROST leverages attention weights to prune uncritical reasoning paths, yielding shorter and more reliable reasoning trajectories. Methodologically, we introduce the concept of reasoning outliers and design an attention-based mechanism to remove them. Theoretically, FROST preserves and enhances the model's reasoning capacity while eliminating outliers at the sentence level. Empirically, we validate FROST on four benchmarks using two strong reasoning models (Phi-4-Reasoning and GPT-OSS-20B), outperforming state-of-the-art methods such as TALE and ThinkLess. Notably, FROST achieves an average 69.68% reduction in token usage and a 26.70% improvement in accuracy over the base model. Furthermore, in evaluations of attention outlier metrics, FROST reduces the maximum infinity norm by 15.97% and the average kurtosis by 91.09% compared to the base model. Code is available at https://github.com/robinzixuan/FROST[8] Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback
Siddhant Arora,Jinchuan Tian,Jiatong Shi,Hayato Futami,Yosuke Kashiwagi,Emiru Tsunoo,Shinji Watanabe
Main category: cs.CL
TL;DR: 本文提出了首个面向语音对话系统(SDS)的多奖励RLAIF框架,结合语义、音频质量和情感一致性奖励,并通过轮次级偏好采样与块级DPO优化,显著提升多维对话质量。
Details
Motivation: 现有RLHF/RLAIF方法在语音对话系统中仅使用单一语义奖励,忽视了对话质量的多维多模态特性(如语义连贯性、音频自然度、说话人一致性、情感对齐和轮转行为),且与增量式双工解码机制不匹配。 Method: 提出首个面向SDS的多奖励RLAIF框架,融合语义、音频质量和情感一致性三类奖励;采用轮次级偏好采样,并在单个DPO目标中聚合每语音块的对数概率,以适配增量式双工模型;构建并开源多奖励DPO数据集。 Result: 实验表明:单奖励RLAIF仅提升对应指标,而联合多奖励训练在语义质量和音频自然度上均取得一致提升。 Conclusion: 多维度、多奖励对齐对构建实用化语音对话系统至关重要;该工作为SDS中的偏好学习提供了系统性研究基础与可复现资源。 Abstract: Reinforcement learning from human or AI feedback (RLHF/RLAIF) for speech-in/speech-out dialogue systems (SDS) remains underexplored, with prior work largely limited to single semantic rewards applied at the utterance level. Such setups overlook the multi-dimensional and multi-modal nature of conversational quality, which encompasses semantic coherence, audio naturalness, speaker consistency, emotion alignment, and turn-taking behavior. Moreover, they are fundamentally mismatched with duplex spoken dialogue systems that generate responses incrementally, where agents must make decisions based on partial utterances. We address these limitations with the first multi-reward RLAIF framework for SDS, combining semantic, audio-quality, and emotion-consistency rewards. To align utterance-level preferences with incremental, blockwise decoding in duplex models, we apply turn-level preference sampling and aggregate per-block log-probabilities within a single DPO objective. We present the first systematic study of preference learning for improving SDS quality in both multi-turn Chain-of-Thought and blockwise duplex models, and release a multi-reward DPO dataset to support reproducible research. Experiments show that single-reward RLAIF selectively improves its targeted metric, while joint multi-reward training yields consistent gains across semantic quality and audio naturalness. These results highlight the importance of holistic, multi-reward alignment for practical conversational SDS.[9] PsyProbe: Proactive and Interpretable Dialogue through User State Modeling for Exploratory Counseling
Sohhyung Park,Hyunji Kang,Sungzoon Cho,Dongil Kim
Main category: cs.CL
TL;DR: 本文提出了PsyProbe,一种基于PPPPPI框架和认知错误检测的心理健康对话系统,用于咨询探索阶段,通过状态构建、记忆构建、策略规划和响应生成等模块实现主动式提问,并在真实韩语咨询场景中验证了其有效性。
Details
Motivation: 现有心理健康对话系统多为被动响应式,缺乏对用户心理状态的系统建模,难以支持主动的治疗性探索。 Method: 提出PsyProbe系统,整合PPPPPI心理评估框架与认知错误检测;包含State Builder(结构化心理档案提取)、Memory Construction(信息缺口追踪)、Strategy Planner(动机访谈行为编码)和Response Generator(含提问构想与批评/修订模块)四大组件。 Result: 在27名参与者的韩语真实咨询场景中,PsyProbe在自动评估、用户评估(参与意愿与自然度提升)及专业咨询师评估(核心问题理解提升、提问质量接近专业人士)中均显著优于基线与消融模型。 Conclusion: 系统化的用户心理状态建模与主动提问机制可有效提升心理咨询探索阶段的对话质量与治疗效果。 Abstract: Recent advances in large language models have enabled mental health dialogue systems, yet existing approaches remain predominantly reactive, lacking systematic user state modeling for proactive therapeutic exploration. We introduce PsyProbe, a dialogue system designed for the exploration phase of counseling that systematically tracks user psychological states through the PPPPPI framework (Presenting, Predisposing, Precipitating, Perpetuating, Protective, Impact) augmented with cognitive error detection. PsyProbe combines State Builder for extracting structured psychological profiles, Memory Construction for tracking information gaps, Strategy Planner for Motivational Interviewing behavioral codes, and Response Generator with Question Ideation and Critic/Revision modules to generate contextually appropriate, proactive questions. We evaluate PsyProbe with 27 participants in real-world Korean counseling scenarios, including automatic evaluation across ablation modes, user evaluation, and expert evaluation by a certified counselor. The full PsyProbe model consistently outperforms baseline and ablation modes in automatic evaluation. User evaluation demonstrates significantly increased engagement intention and improved naturalness compared to baseline. Expert evaluation shows that PsyProbe substantially improves core issue understanding and achieves question rates comparable to professional counselors, validating the effectiveness of systematic state modeling and proactive questioning for therapeutic exploration.[10] Leveraging Sentence-oriented Augmentation and Transformer-Based Architecture for Vietnamese-Bahnaric Translation
Tan Sang Nguyen,Quoc Nguyen Pham,Tho Quan
Main category: cs.CL
TL;DR: 本文提出两种轻量级数据增强策略,用于提升越南语到巴纳语的神经机器翻译性能,无需额外数据或复杂预处理。
Details
Motivation: 巴纳语作为越南少数民族语言,面临资源匮乏导致的机器翻译困难,亟需低成本、易部署的翻译技术以支持其数字化保护与代际传播。 Method: 采用前沿神经机器翻译框架,并设计两种灵活、免额外数据与预处理的领域自适应数据增强策略,适配现有平行语料。 Result: 所提方法在越南语→巴纳语翻译任务上显著提升翻译质量,且兼容多种NMT模型,验证了其有效性与泛化性。 Conclusion: 该研究为低资源民族语言的机器翻译提供了实用、可扩展的技术路径,助力濒危语言的数字存续与社会应用。 Abstract: The Bahnar people, an ethnic minority in Vietnam with a rich ancestral heritage, possess a language of immense cultural and historical significance. The government places a strong emphasis on preserving and promoting the Bahnaric language by making it accessible online and encouraging communication across generations. Recent advancements in artificial intelligence, such as Neural Machine Translation (NMT), have brought about a transformation in translation by improving accuracy and fluency. This, in turn, contributes to the revival of the language through educational efforts, communication, and documentation. Specifically, NMT is pivotal in enhancing accessibility for Bahnaric speakers, making information and content more readily available. Nevertheless, the translation of Vietnamese into Bahnaric faces practical challenges due to resource constraints, especially given the limited resources available for the Bahnaric language. To address this, we employ state-of-the-art techniques in NMT along with two augmentation strategies for domain-specific Vietnamese-Bahnaric translation task. Importantly, both approaches are flexible and can be used with various neural machine translation models. Additionally, they do not require complex data preprocessing steps, the training of additional systems, or the acquisition of extra data beyond the existing training parallel corpora.[11] Transparency-First Medical Language Models: Datasheets, Model Cards, and End-to-End Data Provenance for Clinical NLP
Olaf Yunus Laitinen Imanov,Taner Yilmaz,Ayse Tuba Tugrul,Melike Nesrin Zaman,Ozkan Gunalp,Duygu Erisken,Sila Burde Dulger,Rana Irem Turhan,Izzet Ozdemir,Derya Umut Kulali,Ozan Akbulut,Harun Demircioglu,Hasan Basri Kara,Berfin Tavan
Main category: cs.CL
TL;DR: 本文介绍了TeMLM,一套面向临床语言模型的透明度优先发布工具包,涵盖溯源、数据透明性、建模透明性和治理,并在合成临床数据集Technetium-I上进行了实例化和基准测试。
Details
Motivation: 提升临床语言模型发布过程的可追溯性、数据与建模透明性及治理能力,解决当前医疗AI模型缺乏标准化透明发布机制的问题。 Method: 提出TeMLM artifact套件(含TeMLM-Card、TeMLM-Datasheet、TeMLM-Provenance)和轻量级一致性检查清单;基于合成数据集Technetium-I(49.8万临床笔记、774万PHI标注、ICD-9-CM诊断标签)对ProtactiniumBERT模型进行PHI脱敏与ICD-9编码提取任务评估。 Result: 实现了TeMLM各组件的实例化,报告了ProtactiniumBERT在PHI脱敏(token分类)和top-50 ICD-9码提取(多标签分类)上的基线结果;强调合成数据适用于工具与流程验证,但部署前须经真实临床数据验证。 Conclusion: TeMLM为临床语言模型提供了统一、可机器校验的透明发布框架,推动医疗AI的可信开发与审计,但其有效性最终依赖于真实世界数据的验证。 Abstract: We introduce TeMLM, a set of transparency-first release artifacts for clinical language models. TeMLM unifies provenance, data transparency, modeling transparency, and governance into a single, machine-checkable release bundle. We define an artifact suite (TeMLM-Card, TeMLM-Datasheet, TeMLM-Provenance) and a lightweight conformance checklist for repeatable auditing. We instantiate the artifacts on Technetium-I, a large-scale synthetic clinical NLP dataset with 498,000 notes, 7.74M PHI entity annotations across 10 types, and ICD-9-CM diagnosis labels, and report reference results for ProtactiniumBERT (about 100 million parameters) on PHI de-identification (token classification) and top-50 ICD-9 code extraction (multi-label classification). We emphasize that synthetic benchmarks are valuable for tooling and process validation, but models should be validated on real clinical data prior to deployment.[12] Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs
Chi Zhang,Wenxuan Ding,Jiale Liu,Mingrui Wu,Qingyun Wu,Ray Mooney
Main category: cs.CL
TL;DR: 本文提出CONTEXT-VQA数据集,用于评估视觉-语言模型(VLMs)在面对与图像证据相矛盾的误导性文本提示时的鲁棒性;实验表明11种主流VLM平均性能下降超48.2%,揭示其易受文本误导的严重缺陷。
Details
Motivation: 现有研究多关注纯文本领域中的信息误导问题,而视觉-语言模型(VLMs)在面对跨模态矛盾信息(如文本与图像冲突)时的仲裁能力尚不明确,亟需系统性评估。 Method: 构建了CONTEXT-VQA数据集,包含图像-问题对及系统生成的、与视觉证据相冲突的说服性文本提示;设计并实施了一套全面的评估框架,对11种前沿VLM进行鲁棒性测试。 Result: 实验发现所有被测VLM均易受误导性文本影响,常忽略清晰视觉证据而采信矛盾文本,单轮说服性对话即导致平均性能下降超过48.2%。 Conclusion: 当前VLM在跨模态一致性判断上存在关键鲁棒性缺陷,亟需提升其抵御文本操纵的能力。 Abstract: Vision-Language Models (VLMs) have shown strong multimodal reasoning capabilities on Visual-Question-Answering (VQA) benchmarks. However, their robustness against textual misinformation remains under-explored. While existing research has studied the effect of misinformation in text-only domains, it is not clear how VLMs arbitrate between contradictory information from different modalities. To bridge the gap, we first propose the CONTEXT-VQA (i.e., Conflicting Text) dataset, consisting of image-question pairs together with systematically generated persuasive prompts that deliberately conflict with visual evidence. Then, a thorough evaluation framework is designed and executed to benchmark the susceptibility of various models to these conflicting multimodal inputs. Comprehensive experiments over 11 state-of-the-art VLMs reveal that these models are indeed vulnerable to misleading textual prompts, often overriding clear visual evidence in favor of the conflicting text, and show an average performance drop of over 48.2% after only one round of persuasive conversation. Our findings highlight a critical limitation in current VLMs and underscore the need for improved robustness against textual manipulation.[13] How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability
Shawn Im,Changdae Oh,Zhen Fang,Sharon Li
Main category: cs.CL
TL;DR: 本文通过训练动力学视角,分析了基于注意力机制的语言模型如何从自然语言数据中学习语义关联,并推导出早期训练阶段权重的闭式表达,揭示了Transformer各部分权重可表示为三种基础函数(二元组、词项可交换性、上下文映射)的组合。
Details
Motivation: 理解语义关联在语言模型中如何被学习和表征,有助于连接深度学习与语言学理论,并为大语言模型建立机制性基础。 Method: 利用梯度主项近似,推导出训练早期权重的闭式表达;将Transformer各层权重分解为三种基于语料统计的基础函数的组合。 Result: 发现Transformer各组权重均可表示为bigram、token-interchangeability和context mapping三种基础函数的简单组合;实验表明该理论刻画与真实LLM中学习到的权重高度吻合。 Conclusion: 语义关联的形成源于语料统计特性,且可通过基础函数组合机制解释;该理论为解释和理解Transformer中的语义表征提供了新的机械视角。 Abstract: Semantic associations such as the link between "bird" and "flew" are foundational for language modeling as they enable models to go beyond memorization and instead generalize and generate coherent text. Understanding how these associations are learned and represented in language models is essential for connecting deep learning with linguistic theory and developing a mechanistic foundation for large language models. In this work, we analyze how these associations emerge from natural language data in attention-based language models through the lens of training dynamics. By leveraging a leading-term approximation of the gradients, we develop closed-form expressions for the weights at early stages of training that explain how semantic associations first take shape. Through our analysis, we reveal that each set of weights of the transformer has closed-form expressions as simple compositions of three basis functions (bigram, token-interchangeability, and context mappings), reflecting the statistics of the text corpus and uncovering how each component of the transformer captures semantic associations based on these compositions. Experiments on real-world LLMs demonstrate that our theoretical weight characterizations closely match the learned weights, and qualitative analyses further show how our theorem shines light on interpreting the learned associations in transformers.[14] A Hybrid Supervised-LLM Pipeline for Actionable Suggestion Mining in Unstructured Customer Reviews
Aakash Trivedi,Aniket Upadhyay,Pratik Narang,Dhruv Kumar,Praveen Kumar
Main category: cs.CL
TL;DR: 本文提出了一种结合高召回RoBERTa分类器与指令微调大语言模型的混合流水线,用于从客户评论中精准提取、分类、聚类和总结可操作建议,在真实酒店与餐饮数据集上优于多种基线方法。
Details
Motivation: 现有方法难以在混杂意图的非结构化客户评论中精准提取企业所需的细粒度改进指令。 Method: 构建混合流水线:先用基于精确-召回代理损失训练的RoBERTa分类器高召回筛选含建议句子,再用指令微调的LLM进行建议提取、分类、聚类与摘要。 Result: 在真实酒店与餐饮数据集上,该系统在提取准确率与聚类一致性上均优于纯提示、规则和仅分类器基线;人工评估证实其输出清晰、忠实且可解释。 Conclusion: 混合推理架构显著提升了细粒度可操作建议挖掘效果,但也揭示了领域自适应与本地高效部署的挑战。 Abstract: Extracting actionable suggestions from customer reviews is essential for operational decision-making, yet these directives are often embedded within mixed-intent, unstructured text. Existing approaches either classify suggestion-bearing sentences or generate high-level summaries, but rarely isolate the precise improvement instructions businesses need. We evaluate a hybrid pipeline combining a high-recall RoBERTa classifier trained with a precision-recall surrogate to reduce unrecoverable false negatives with a controlled, instruction-tuned LLM for suggestion extraction, categorization, clustering, and summarization. Across real-world hospitality and food datasets, the hybrid system outperforms prompt-only, rule-based, and classifier-only baselines in extraction accuracy and cluster coherence. Human evaluations further confirm that the resulting suggestions and summaries are clear, faithful, and interpretable. Overall, our results show that hybrid reasoning architectures achieve meaningful improvements fine-grained actionable suggestion mining while highlighting challenges in domain adaptation and efficient local deployment.[15] DREAMSTATE: Diffusing States and Parameters for Recurrent Large Language Models
Liu Xiao
Main category: cs.CL
TL;DR: 本文提出DREAMSTATE框架,利用条件扩散Transformer(DiT)建模RWKV模型内部状态的概率流形,实现其生成与编辑;并进一步设计一种将RNN局部建模优势与全局上下文自适应能力结合的混合架构,通过并行DiT动态调节WKV参数,使固定递归机制变为上下文感知的动态函数。
Details
Motivation: 现代RNN(如RWKV)虽具高效短程建模与固定大小状态优势,但其内部状态作为可编辑知识表示的研究严重缺失。 Method: 1) 提出DREAMSTATE框架,用条件Diffusion Transformer建模RWKV状态的概率流形,支持生成与编辑;2) 通过t-SNE可视化和可控生成实验验证状态表征结构;3) 设计新型混合架构:并行DiT处理变长全局上下文,动态生成/调整RWKV的WKV参数,使递归机制具备上下文感知能力;4) 采用多目标损失稳定训练。 Result: 成功揭示并建模RWKV状态的表征潜力;验证了所提混合架构可通过多目标损失稳定训练;t-SNE与可控生成实验证明状态具有结构性;代码已开源。 Conclusion: 本工作开创了RNN状态作为可编辑知识表示的新研究方向,并为融合局部递归与全局上下文建模的下一代模型架构提供了具体范式。 Abstract: Modern Recurrent Neural Networks (RNNs), such as RWKV, are distinguished by their powerful short-range modeling capabilities and efficient fixed-size states, which constitute a core advantage over standard Transformers. However, there is a significant lack of research into their internal state as an editable knowledge representation. To fill this gap, we first explore the representational properties of the RWKV state by proposing the DREAMSTATE framework. This framework utilizes a conditional Diffusion Transformer (DiT) to directly model the probability manifold of the state, enabling its generation and editing. The structural nature of this representation is validated through t-SNE visualizations and controlled generation experiments. After successfully uncovering and modeling the state's representational potential, we further propose a novel hybrid architecture that combines the local advantages of RNNs with global context adaptability. This architecture features a parallel DiT that processes a variable-length global context to dynamically generate and adjust the core recurrent module's WKV parameters, transforming the fixed recurrence mechanism into a context-aware dynamic function. Experiments demonstrate that this hybrid model can be trained stably via a multi-objective loss, validating its design feasibility. Our work not only opens a new research direction for RNN state representation but also provides a concrete architectural reference for future model design. The code is publicly available at: https://huggingface.co/2dgx41s/DreamState.[16] RPO-RAG: Aligning Small LLMs with Relation-aware Preference Optimization for Knowledge Graph Question Answering
Kaehyun Um,KyuHwan Yeom,Haerim Yang,Minyoung Choi,Hyeongjun Yang,Kyong-Ho Lee
Main category: cs.CL
TL;DR: 本文提出了RPO-RAG,一种专为小型语言模型设计的基于知识图谱的检索增强生成框架,通过语义感知路径采样、关系感知偏好优化和答案中心化提示设计,显著提升了小模型在知识图谱问答任务上的性能。
Details
Motivation: 现有基于知识图谱的RAG方法存在路径采样缺乏语义感知、与KG推理目标对齐弱、未组织为答案中心化推理路径等问题,且多依赖大模型,小模型(<7B)研究不足。 Method: 提出RPO-RAG框架,包含:(1) 查询-路径语义采样策略;(2) 关系感知偏好优化,对齐KG中间推理信号;(3) 答案中心化提示设计,以可解释格式组织实体与推理路径。 Result: 在WebQSP和CWQ两个KGQA基准数据集上取得显著提升:WebQSP上F1最高提升8.8%;CWQ上在<8B参数模型中达到Hit和F1新SOTA;甚至在<3B参数模型上也大幅提升推理能力。 Conclusion: RPO-RAG有效弥合了小模型与大模型在KGQA任务上的性能差距,凸显了小模型在资源受限及端侧KGQA应用中的潜力。 Abstract: Large Language Models (LLMs) have recently demonstrated remarkable reasoning abilities, yet hallucinate on knowledge-intensive tasks. Retrieval-augmented generation (RAG) mitigates this issue by grounding answers in external sources, e.g., knowledge graphs (KGs). However, existing KG-based RAG approaches rely on semantics-unaware path sampling and are weakly aligned with KG reasoning objectives, which limits further accuracy gains. They also feed retrieved paths directly into the reasoner without organizing them into answer-centered reasoning paths, hindering small LLMs' ability to leverage the retrieved knowledge. Furthermore, prior works predominantly rely on large LLMs (e.g., ChatGPT/GPT-4) or assume backbones above 7B parameters, leaving sub-7B models underexplored. We address this gap with RPO-RAG, the first KG-based RAG framework specifically designed for small LLMs, to the best of our knowledge. RPO-RAG introduces three key innovations: (1) a query-path semantic sampling strategy that provides informative supervisory signals; (2) a relation-aware preference optimization that aligns training with intermediate KG reasoning signals (e.g., relation); and (3) an answer-centered prompt design that organizes entities and reasoning paths in an interpretable format. Extensive experiments on two benchmark Knowledge Graph Question Answering (KGQA) datasets, WebQSP and CWQ, demonstrate that RPO-RAG effectively bridges the performance gap between small and large language models. On WebQSP, it improves F1 by up to 8.8%, reflecting enhanced answer precision, while on CWQ it achieves new state-of-the-art results among models under 8B parameters in both Hit and F1. Overall, RPO-RAG substantially improves the reasoning capability of small LLMs, even under 3B parameters-highlighting their potential for resource-efficient and practical on-device KGQA applications.[17] DiaDem: Advancing Dialogue Descriptions in Audiovisual Video Captioning for Multimodal Large Language Models
Xinlong Chen,Weihong Lin,Jingyun Hua,Linli Yao,Yue Ding,Bozhou Li,Bohan Zeng,Yang Shi,Qiang Liu,Yuanxing Zhang,Pengfei Wan,Liang Wang,Tieniu Tan
Main category: cs.CL
TL;DR: 本文提出了DiaDem模型,通过合成高质量数据集和两阶段GRPO策略提升音频视频字幕中对话描述的准确性,并构建了DiaDemBench基准用于系统评估对话描述能力。
Details
Motivation: 现有音频视频字幕模型难以生成忠实、准确的对话描述,限制了下游任务的理解与生成效果。 Method: 提出DiaDem模型,包括合成高质量监督微调(SFT)数据集,采用难度划分的两阶段GRPO策略优化对话描述;并构建DiaDemBench基准,聚焦说话人归属准确性和话语转录保真度。 Result: 在DiaDemBench上实验表明,即使商用模型仍有较大提升空间;DiaDem在对话描述准确率上优于Gemini系列,且在通用音频视频字幕基准上表现具有竞争力。 Conclusion: DiaDem有效提升了音频视频字幕中对话描述的精度,兼顾整体性能,DiaDemBench为该任务提供了系统性评估标准。 Abstract: Accurate dialogue description in audiovisual video captioning is crucial for downstream understanding and generation tasks. However, existing models generally struggle to produce faithful dialogue descriptions within audiovisual captions. To mitigate this limitation, we propose DiaDem, a powerful audiovisual video captioning model capable of generating captions with more precise dialogue descriptions while maintaining strong overall performance. We first synthesize a high-quality dataset for SFT, then employ a difficulty-partitioned two-stage GRPO strategy to further enhance dialogue descriptions. To enable systematic evaluation of dialogue description capabilities, we introduce DiaDemBench, a comprehensive benchmark designed to evaluate models across diverse dialogue scenarios, emphasizing both speaker attribution accuracy and utterance transcription fidelity in audiovisual captions. Extensive experiments on DiaDemBench reveal even commercial models still exhibit substantial room for improvement in dialogue-aware captioning. Notably, DiaDem not only outperforms the Gemini series in dialogue description accuracy but also achieves competitive performance on general audiovisual captioning benchmarks, demonstrating its overall effectiveness.[18] Riddle Quest : The Enigma of Words
Niharika Sri Parasa,Chaitali Diwan,Srinath Srinivasa
Main category: cs.CL
TL;DR: 本文提出了一种基于类比的谜语生成与评估流程,通过构建概念三元组、语义映射、风格化生成和答案验证四个模块,探究大语言模型对谜语多义答案的识别能力,发现模型常忽略非主答案,揭示其在推理覆盖与歧义处理上的局限。
Details
Motivation: 探索大语言模型在理解与生成具有多重合理解释的类比型谜语时的推理覆盖能力与歧义处理能力,以谜语为轻量级测试工具评估模型的认知广度。 Method: 构建四阶段流水线:1)三元组生成器提取概念结构化事实;2)语义映射器筛选适合类比的属性;3)风格化生成器将属性转化为谜语线索;4)验证器枚举所有可能答案并用于评估模型表现。 Result: 实验表明,大语言模型虽常能猜中谜语的主答案,但频繁遗漏其他语义上合理的备选答案,说明其推理覆盖不全、对语义歧义敏感度不足。 Conclusion: 类比型谜语可作为有效且轻量的评估工具,揭示当前大语言模型在开放性推理与多义性理解方面的关键短板,为后续提升模型鲁棒性与认知灵活性提供新方向。 Abstract: Riddles are concise linguistic puzzles that describe an object or idea through indirect, figurative, or playful clues. They are a longstanding form of creative expression, requiring the solver to interpret hints, recognize patterns, and draw inferences to identify the answers. In this work, we introduce a simple pipeline for creating and evaluating analogy-based riddles. The system includes a triples creator that builds structured facts about a concept, a semantic mapper that selects attributes useful for analogy, a stylized generator that turns them into riddle clues, and a validator that collects all possible answers the riddle could point to. We use this validator to study whether large language models can recover the full answer set for different riddle types. Our case study shows that while models often guess the main intended answer, they frequently miss other valid interpretations. This highlights the value of riddles as a lightweight tool for examining reasoning coverage and ambiguity handling in language models.[19] DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference
Fuliang Liu,Xue Li,Ketai Zhao,Yinxi Gao,Ziyan Zhou,Zhonghui Zhang,Zhibin Wang,Wanchun Dou,Sheng Zhong,Chen Tian
Main category: cs.CL
TL;DR: DART是一种基于扩散模型思想的并行推测解码方法,通过单次前向传播并行预测多个未来token的logits,消除草稿模型中的自回归推理,结合N元语法约束的树剪枝算法,显著降低推测解码中草稿阶段开销,在保持高准确率的同时实现2–3.4倍端到端加速。
Details
Motivation: 现有基于模型的草稿机制(如EAGLE3)虽提升准确性,但依赖多步自回归推理,导致草稿阶段延迟高、成为性能瓶颈。 Method: 提出DART:利用目标模型隐藏状态,在单次前向中并行预测多个掩码位置的logits;设计N-gram约束的高效树剪枝算法,构建语义连续的高质量草稿token树。 Result: 在多个数据集上实现2.03x–3.44x实测时间加速,平均比EAGLE3快30%,显著降低草稿阶段开销且保持高草稿准确率。 Conclusion: DART提供了一种轻量、高效、实用的推测解码新范式,兼顾低延迟与高精度,推动LLM推理加速落地。 Abstract: Speculative decoding is an effective and lossless approach for accelerating LLM inference. However, existing widely adopted model-based draft designs, such as EAGLE3, improve accuracy at the cost of multi-step autoregressive inference, resulting in high drafting latency and ultimately rendering the drafting stage itself a performance bottleneck. Inspired by diffusion-based large language models (dLLMs), we propose DART, which leverages parallel generation to reduce drafting latency. DART predicts logits for multiple future masked positions in parallel within a single forward pass based on hidden states of the target model, thereby eliminating autoregressive rollouts in the draft model while preserving a lightweight design. Based on these parallel logit predictions, we further introduce an efficient tree pruning algorithm that constructs high-quality draft token trees with N-gram-enforced semantic continuity. DART substantially reduces draft-stage overhead while preserving high draft accuracy, leading to significantly improved end-to-end decoding speed. Experimental results demonstrate that DART achieves a 2.03x--3.44x wall-clock time speedup across multiple datasets, surpassing EAGLE3 by 30% on average and offering a practical speculative decoding framework. Code is released at https://github.com/fvliang/DART.[20] ReToP: Learning to Rewrite Electronic Health Records for Clinical Prediction
Jesus Lovon-Melgarejo,Jose G. Moreno,Christine Damase-Michel,Lynda Tamine
Main category: cs.CL
TL;DR: 本文提出了一种名为Rewrite-To-Predict(ReToP)的新框架,通过端到端训练EHR重写器与临床预测器,利用临床驱动的特征选择生成伪标签,并引入Classifier Supervised Contribution(CSC)评分机制,使重写更贴合预测任务,从而提升临床预测性能。
Details
Motivation: 现有基于大语言模型(LLM)的EHR建模方法多为任务无关,仅将LLM用作编码器或补全模块,未能充分融合预测任务信号,限制了预测精度。 Method: 提出ReToP框架,包含端到端联合训练的EHR重写器和临床预测器;采用临床驱动的特征选择生成多样化的合成伪标签用于重写器微调;设计Classifier Supervised Contribution(CSC)评分引导重写聚焦任务相关临床特征。 Result: 在MIMIC-IV数据集的三项临床任务上显著超越强基线;具备跨数据集与任务的泛化能力,仅需少量微调;重写结果保持临床忠实性并突出预测关键特征。 Conclusion: ReToP通过任务对齐的EHR重写机制,有效弥合了LLM通用表征能力与特定临床预测需求之间的鸿沟,为可解释、高精度的LLM增强型临床预测提供了新范式。 Abstract: Electronic Health Records (EHRs) provide crucial information for clinical decision-making. However, their high-dimensionality, heterogeneity, and sparsity make clinical prediction challenging. Large Language Models (LLMs) allowed progress towards addressing this challenge by leveraging parametric medical knowledge to enhance EHR data for clinical prediction tasks. Despite the significant achievements made so far, most of the existing approaches are fundamentally task-agnostic in the sense that they deploy LLMs as EHR encoders or EHR completion modules without fully integrating signals from the prediction tasks. This naturally hinders task performance accuracy. In this work, we propose Rewrite-To-Predict (ReToP), an LLM-based framework that addresses this limitation through an end-to-end training of an EHR rewriter and a clinical predictor. To cope with the lack of EHR rewrite training data, we generate synthetic pseudo-labels using clinical-driven feature selection strategies to create diverse patient rewrites for fine-tuning the EHR rewriter. ReToP aligns the rewriter with prediction objectives using a novel Classifier Supervised Contribution (CSC) score that enables the EHR rewriter to generate clinically relevant rewrites that directly enhance prediction. Our ReToP framework surpasses strong baseline models across three clinical tasks on MIMIC-IV. Moreover, the analysis of ReToP shows its generalizability to unseen datasets and tasks with minimal fine-tuning while preserving faithful rewrites and emphasizing task-relevant predictive features.[21] MetaGen: Self-Evolving Roles and Topologies for Multi-Agent LLM Reasoning
Yimeng Wang,Jiaxing Zhao,Hongbin Xie,Hexing Ma,Yuzhen Lei,Shuangxue Liu,Xuan Song,Zichen Zhang,Haoran Zhang
Main category: cs.CL
TL;DR: 本文提出MetaGen框架,一种无需训练即可在推理时动态调整角色空间和协作拓扑的多智能体系统方法,提升准确率与推理成本的权衡。
Details
Motivation: 现有多智能体系统依赖固定角色库和冻结的交互拓扑,导致任务不匹配、难以适应新证据、推理成本高。 Method: MetaGen在推理时生成并重写查询条件下的角色定义,构建可控的动态角色池,并围绕最小主干实例化受限执行图;执行中利用轻量反馈信号迭代更新角色提示和结构调整。 Result: 在代码生成与多步推理基准上,MetaGen相较强多智能体基线提升了准确率与成本的权衡表现。 Conclusion: MetaGen提供了一种高效、灵活、无需训练的多智能体动态适配范式,突破了传统静态设计的局限。 Abstract: Large language models are increasingly deployed as multi-agent systems, where specialized roles communicate and collaborate through structured interactions to solve complex tasks that often exceed the capacity of a single agent. However, most existing systems still rely on a fixed role library and an execution-frozen interaction topology, a rigid design choice that frequently leads to task mismatch, prevents timely adaptation when new evidence emerges during reasoning, and further inflates inference cost. We introduce MetaGen, a training-free framework that adapts both the role space and the collaboration topology at inference time, without updating base model weights. MetaGen generates and rewrites query-conditioned role specifications to maintain a controllable dynamic role pool, then instantiates a constrained execution graph around a minimal backbone. During execution, it iteratively updates role prompts and adjusts structural decisions using lightweight feedback signals. Experiments on code generation and multi-step reasoning benchmarks show that MetaGen improves the accuracy and cost tradeoff over strong multi-agent baselines.[22] Formula-One Prompting: Adaptive Reasoning Through Equations For Applied Mathematics
Natapong Nitarach,Pittawat Taveekitworachai,Kunat Pipatanakul
Main category: cs.CL
TL;DR: 本文提出Formula-One Prompting(F-1),一种两阶段提示方法,先提取问题中的控制方程,再根据方程自适应选择求解策略(CoT/PoT/直接计算),显著提升大模型在应用数学(如金融、物理)问题上的推理性能。
Details
Motivation: 现有链式思维(CoT)和程序化思维(PoT)提示方法未显式利用应用数学问题中关键的控制方程推导或回忆步骤,限制其在金融、物理、密码学等领域的表现。 Method: F-1采用两阶段单次LLM调用:第一阶段从问题描述中形式化生成控制方程;第二阶段基于生成的方程自适应选择最合适的求解方式(CoT、PoT或直接计算)。 Result: 在5个模型和4个基准测试上,F-1平均超越CoT 5.76%、PoT 8.42%;在FinanceMath上相对CoT提升达13.30%,在OlympiadBench中对物理题提升(+2.55%)明显高于纯数学题(+0.44%)。 Conclusion: 将控制方程作为中间表示并据此自适应选择求解策略,能更有效地提升大语言模型在应用数学问题上的推理能力,F-1优于传统CoT和PoT方法。 Abstract: Prompting techniques such as Chain-of-Thought (CoT) and Program-of-Thought (PoT) improve LLM mathematical reasoning by structuring intermediate steps in natural language or code. However, applied mathematics problems in domains like finance, physics, and cryptography often require recalling or deriving governing equations, a step that current approaches do not explicitly leverage. We propose Formula-One Prompting (F-1), a two-phase approach that uses mathematical equations as an intermediate representation before adaptive solving. F-1 first formulates governing equations from problem descriptions, then selects a solving strategy among CoT, PoT, or direct computation based on the generated equations, all within a single LLM call. Results across five models and four benchmarks show F-1 outperforms CoT by +5.76% and PoT by +8.42% on average. Crucially, gains are largest in applied domains: +13.30% on FinanceMath over CoT, and within OlympiadBench, larger gains on physics (+2.55%) than pure math (+0.44%). This demonstrates that F-1 is more effective than CoT in applied mathematics problems.[23] When Benchmarks Leak: Inference-Time Decontamination for LLMs
Jianzhe Chai,Yu Zhe,Jun Sakuma
Main category: cs.CL
TL;DR: 本文提出DeconIEP框架,在评估阶段通过在输入嵌入空间施加小而有界的扰动来缓解大语言模型评估中的测试集污染问题,无需修改基准数据集或干扰正常推理。
Details
Motivation: 测试集污染严重威胁基于基准的LLM评估可靠性,现有去污染方法要么修改评估集(不可靠),要么损害干净样本性能(副作用大)。 Method: DeconIEP是一种纯评估阶段的去污染框架:利用较干净的参考模型指导,学习实例自适应的输入嵌入扰动生成器,将目标模型推理从记忆捷径路径上引导开。 Result: 在多个开源LLM和基准上验证,DeconIEP显著降低污染影响,同时对干净样本性能几乎无损。 Conclusion: DeconIEP提供了一种高效、低侵入的评估期去污染新范式,兼顾去污染效果与模型原始能力保留。 Abstract: Benchmark-based evaluation is the de facto standard for comparing large language models (LLMs). However, its reliability is increasingly threatened by test set contamination, where test samples or their close variants leak into training data and artificially inflate reported performance. To address this issue, prior work has explored two main lines of mitigation. One line attempts to identify and remove contaminated benchmark items before evaluation, but this inevitably alters the evaluation set itself and becomes unreliable when contamination is moderate or severe. The other line preserves the benchmark and instead suppresses contaminated behavior at evaluation time; however, such interventions often interfere with normal inference and lead to noticeable performance degradation on clean inputs. We propose DeconIEP, a decontamination framework that operates entirely during evaluation by applying small, bounded perturbations in the input embedding space. Guided by a relatively less-contaminated reference model, DeconIEP learns an instance-adaptive perturbation generator that steers the evaluated model away from memorization-driven shortcut pathways. Across multiple open-weight LLMs and benchmarks, extensive empirical results show that DeconIEP achieves strong decontamination effectiveness while incurring only minimal degradation in benign utility.[24] Cross-Examination Framework: A Task-Agnostic Diagnostic for Information Fidelity in Text-to-Text Generation
Tathagata Raha,Clement Christophe,Nada Saadi,Hamza A Javed,Marco AF Pimentel,Ronnie Rajan,Praveenkumar Kanithi
Main category: cs.CL
TL;DR: 本文提出了一种参考无关、多维度的文本生成质量评估框架CEF,通过将源文本和候选文本视为独立知识库,生成可验证问题并交叉检验,输出Coverage、Conformity、Consistency三个可解释分数,在多个任务上优于BLEU和BERTScore,并经实证验证其鲁棒性与人类判断高度一致。
Details
Motivation: 传统评估指标(如BLEU、BERTScore)难以准确衡量生成文本的语义保真度,尤其在内容遗漏和事实矛盾等关键错误上表现不足。 Method: 基于Cross-Examination Framework(CEF),将源文本和候选文本分别视为知识库,从中生成可验证问题,并进行双向交叉检验,计算Coverage(覆盖度)、Conformity(一致性)、Consistency(自洽性)三个指标;同时开展系统性鲁棒性分析以选择稳定裁判模型,并验证参考无关模式与有参考模式的相关性。 Result: CEF在翻译、摘要和临床病历生成任务中能有效识别BLEU/BERTScore遗漏的内容遗漏与事实矛盾;其参考无关模式与有参考模式强相关;人类专家验证表明,CEF检测出的不匹配问题更集中于语义性错误(尤其是实体与关系层面的扭曲)。 Conclusion: CEF是一种可靠、可解释、无需黄金参考的多维评估框架,显著提升了对语义保真度的判别能力,尤其适用于高风险领域(如临床文本)的质量评估。 Abstract: Traditional metrics like BLEU and BERTScore fail to capture semantic fidelity in generative text-to-text tasks. We adapt the Cross-Examination Framework (CEF) for a reference-free, multi-dimensional evaluation by treating the source and candidate as independent knowledge bases. CEF generates verifiable questions from each text and performs a cross-examination to derive three interpretable scores: Coverage, Conformity, and Consistency. Validated across translation, summarization and clinical note-generation, our framework identifies critical errors, such as content omissions and factual contradictions, missed by standard metrics. A key contribution is a systematic robustness analysis to select a stable judge model. Crucially, the strong correlation between our reference-free and with-reference modes validates CEF's reliability without gold references. Furthermore, human expert validation demonstrates that CEF mismatching questions align with meaning-altering semantic errors higher than with non-semantic errors, particularly excelling at identifying entity-based and relational distortions.[25] Binary Token-Level Classification with DeBERTa for All-Type MWE Identification: A Lightweight Approach with Linguistic Enhancement
Diego Rossini,Lonneke van der Plas
Main category: cs.CL
TL;DR: 本文提出了一种结合二元词元级分类、语言学特征融合与数据增强的多词表达(MWE)识别方法,在CoAM和STREUSLE数据集上分别达到69.8%和78.9% F1,显著超越大语言模型且参数量大幅减少。
Details
Motivation: 解决多词表达(MWE)识别中因类别不平衡、不连续结构及名词型MWE带来的挑战,并探索小模型在结构化NLP任务中替代大语言模型的可行性。 Method: 采用DeBERTa-v3-large模型,将MWE检测重构为词元级START/END/INSIDE三类二元分类任务;引入名词短语(NP)组块与依存句法特征;应用过采样缓解训练数据严重类别不平衡问题。 Result: 在CoAM数据集上F1达69.8%,比Qwen-72B高12个百分点且参数量仅为其1/165;在STREUSLE数据集上F1达78.9%,验证了方法泛化能力。 Conclusion: 精心设计的小规模模型可在结构化NLP任务中显著优于大语言模型,对资源受限场景具有重要实用价值。 Abstract: We present a comprehensive approach for multiword expression (MWE) identification that combines binary token-level classification, linguistic feature integration, and data augmentation. Our DeBERTa-v3-large model achieves 69.8% F1 on the CoAM dataset, surpassing the best results (Qwen-72B, 57.8% F1) on this dataset by 12 points while using 165x fewer parameters. We achieve this performance by (1) reformulating detection as binary token-level START/END/INSIDE classification rather than span-based prediction, (2) incorporating NP chunking and dependency features that help discontinuous and NOUN-type MWEs identification, and (3) applying oversampling that addresses severe class imbalance in the training data. We confirm the generalization of our method on the STREUSLE dataset, achieving 78.9% F1. These results demonstrate that carefully designed smaller models can substantially outperform LLMs on structured NLP tasks, with important implications for resource-constrained deployments.[26] Do LLMs Truly Benefit from Longer Context in Automatic Post-Editing?
Ahrii Kim,Seong-heum Kim
Main category: cs.CL
TL;DR: 本文系统比较了专有和开源大语言模型(LLMs)在文档级自动后编辑(APE)任务中的表现,发现专有模型虽能达到近人类水平的纠错质量且鲁棒性强,但难以有效利用文档上下文进行纠错,且成本与延迟过高;自动指标不可靠,仍需人工评估;研究指出了高效长上下文建模的必要性。
Details
Motivation: 尽管大语言模型在翻译方面表现出色,但其在文档级自动后编辑(APE)任务中的有效性仍缺乏充分理解,尤其是上下文利用能力、鲁棒性与实用性之间的权衡。 Method: 对专有和开源大语言模型在简单文档级提示(one-shot prompting)设置下进行系统性对比实验,评估APE质量、上下文行为、抗数据污染攻击的鲁棒性及推理效率。 Result: 专有LLMs在简单提示下即可达到近人类APE质量,鲁棒性优于开源模型,但几乎不利用文档上下文;自动指标无法可靠反映质量提升;高成本与高延迟阻碍实际部署。 Conclusion: LLM-based文档级APE具有潜力但当前存在上下文利用不足、评估不可靠及部署不现实等关键局限,亟需更高效的长上下文建模方法。 Abstract: Automatic post-editing (APE) aims to refine machine translations by correcting residual errors. Although recent large language models (LLMs) demonstrate strong translation capabilities, their effectiveness for APE--especially under document-level context--remains insufficiently understood. We present a systematic comparison of proprietary and open-weight LLMs under a naive document-level prompting setup, analyzing APE quality, contextual behavior, robustness, and efficiency. Our results show that proprietary LLMs achieve near human-level APE quality even with simple one-shot prompting, regardless of whether document context is provided. While these models exhibit higher robustness to data poisoning attacks than open-weight counterparts, this robustness also reveals a limitation: they largely fail to exploit document-level context for contextual error correction. Furthermore, standard automatic metrics do not reliably reflect these qualitative improvements, highlighting the continued necessity of human evaluation. Despite their strong performance, the substantial cost and latency overheads of proprietary LLMs render them impractical for real-world APE deployment. Overall, our findings elucidate both the promise and current limitations of LLM-based document-aware APE, and point toward the need for more efficient long-context modeling approaches for translation refinement.[27] KG-CRAFT: Knowledge Graph-based Contrastive Reasoning with LLMs for Enhancing Automated Fact-checking
Vítor N. Lourenço,Aline Paes,Tillman Weyde,Audrey Depeige,Mohnish Dubey
Main category: cs.CL
TL;DR: KG-CRAFT是一种利用知识图谱增强的对比问题引导大语言模型进行自动声明验证的新方法,在LIAR-RAW和RAWFC数据集上达到SOTA性能。
Details
Motivation: 提升自动声明验证的准确性,克服现有方法在证据提取与推理方面的不足。 Method: 构建声明与报告的知识图谱,基于图结构生成上下文相关的对比问题,用以引导证据提炼与摘要生成,最终交由大语言模型进行真实性判断。 Result: 在LIAR-RAW和RAWFC两个真实数据集上取得新的最优性能,并通过详尽分析验证了知识图谱驱动的对比推理对提升LLM事实核查能力的有效性。 Conclusion: KG-CRAFT通过将知识图谱与对比式提问机制融入LLM推理流程,显著增强了自动事实核查的可解释性与准确性。 Abstract: Claim verification is a core component of automated fact-checking systems, aimed at determining the truthfulness of a statement by assessing it against reliable evidence sources such as documents or knowledge bases. This work presents KG-CRAFT, a method that improves automatic claim verification by leveraging large language models (LLMs) augmented with contrastive questions grounded in a knowledge graph. KG-CRAFT first constructs a knowledge graph from claims and associated reports, then formulates contextually relevant contrastive questions based on the knowledge graph structure. These questions guide the distillation of evidence-based reports, which are synthesised into a concise summary that is used for veracity assessment by LLMs. Extensive evaluations on two real-world datasets (LIAR-RAW and RAWFC) demonstrate that our method achieves a new state-of-the-art in predictive performance. Comprehensive analyses validate in detail the effectiveness of our knowledge graph-based contrastive reasoning approach in improving LLMs' fact-checking capabilities.[28] Dynamic Multi-Expert Projectors with Stabilized Routing for Multilingual Speech Recognition
Isha Pandey,Ashish Mittal,Vartul Bahuguna,Ganesh Ramakrishnan
Main category: cs.CL
TL;DR: 本文提出SMEAR-MoE,一种稳定的多专家(MoE)投影器,用于提升多语言自动语音识别(ASR)性能,显著降低词错误率(WER),同时保持运行效率。
Details
Motivation: 单个轻量级投影器难以建模多语言语音到语义的多样化映射,限制了冻结语音编码器与大语言模型(LLM)联合架构在多语言ASR中的扩展性。 Method: 提出SMEAR-MoE——一种改进的混合专家(MoE)投影器,通过稳定训练机制保障所有专家获得稠密梯度更新,防止专家坍缩,并支持跨语言共享;在四种印度语言(印地语、马拉地语、泰米尔语、泰卢固语)上系统对比单投影器、静态多投影器与动态MoE设计。 Result: SMEAR-MoE相较单投影器基线实现最高7.6%相对WER下降,运行效率相当;专家路由分析显示语言学上有意义的专家专业化,且亲缘语言倾向共享专家。 Conclusion: 稳定、多专家的投影器是构建可扩展且鲁棒的多语言ASR系统的关键。 Abstract: Recent advances in LLM-based ASR connect frozen speech encoders with Large Language Models (LLMs) via lightweight projectors. While effective in monolingual settings, a single projector struggles to capture the diverse acoustic-to-semantic mappings required for multilingual ASR. To address this, we propose SMEAR-MoE, a stabilized Mixture-of-Experts projector that ensures dense gradient flow to all experts, preventing expert collapse while enabling cross-lingual sharing. We systematically compare monolithic, static multi-projector, and dynamic MoE designs across four Indic languages (Hindi, Marathi, Tamil, Telugu). Our SMEAR-MoE achieves strong performance, delivering upto a 7.6% relative WER reduction over the single-projector baseline, while maintaining comparable runtime efficiency. Analysis of expert routing further shows linguistically meaningful specialization, with related languages sharing experts. These results demonstrate that stable multi-expert projectors are key to scalable and robust multilingual ASR.[29] ClaimPT: A Portuguese Dataset of Annotated Claims in News Articles
Ricardo Campos,Raquel Sequeira,Sara Nerea,Inês Cantante,Diogo Folques,Luís Filipe Cunha,João Canavilhas,António Branco,Alípio Jorge,Sérgio Nunes,Nuno Guimarães,Purificação Silvano
Main category: cs.CL
TL;DR: 本文介绍了ClaimPT数据集,一个面向欧洲葡萄牙语新闻文章的事实性声明标注数据集,旨在推动低资源语言事实核查研究。
Details
Motivation: 英语以外的语言(如葡萄牙语)缺乏可访问、有许可的标注数据集,限制了NLP研究与事实核查应用的发展。 Method: 构建了包含1308篇新闻文章和6875条标注的ClaimPT数据集,聚焦于新闻媒体内容,由两名训练有素的标注员标注,并由专家依据新提出的标注方案进行校验;同时提供了基线模型用于声明检测任务。 Result: 发布了首个面向欧洲葡萄牙语新闻文本的公开、授权的事实声明标注数据集ClaimPT,并建立了初步的声明检测基准性能。 Conclusion: ClaimPT填补了葡萄牙语事实核查数据资源的空白,支持低资源语言下的NLP与信息检索研究,有助于提升新闻媒体中虚假信息识别与应对能力。 Abstract: Fact-checking remains a demanding and time-consuming task, still largely dependent on manual verification and unable to match the rapid spread of misinformation online. This is particularly important because debunking false information typically takes longer to reach consumers than the misinformation itself; accelerating corrections through automation can therefore help counter it more effectively. Although many organizations perform manual fact-checking, this approach is difficult to scale given the growing volume of digital content. These limitations have motivated interest in automating fact-checking, where identifying claims is a crucial first step. However, progress has been uneven across languages, with English dominating due to abundant annotated data. Portuguese, like other languages, still lacks accessible, licensed datasets, limiting research, NLP developments and applications. In this paper, we introduce ClaimPT, a dataset of European Portuguese news articles annotated for factual claims, comprising 1,308 articles and 6,875 individual annotations. Unlike most existing resources based on social media or parliamentary transcripts, ClaimPT focuses on journalistic content, collected through a partnership with LUSA, the Portuguese News Agency. To ensure annotation quality, two trained annotators labeled each article, with a curator validating all annotations according to a newly proposed scheme. We also provide baseline models for claim detection, establishing initial benchmarks and enabling future NLP and IR applications. By releasing ClaimPT, we aim to advance research on low-resource fact-checking and enhance understanding of misinformation in news media.[30] GradPruner: Gradient-Guided Layer Pruning Enabling Efficient Fine-Tuning and Inference for LLMs
Wei Huang,Anda Cheng,Yinggui Wang
Main category: cs.CL
TL;DR: 本文提出GradPruner方法,在LLM下游任务微调初期基于梯度信息(IGIA-Matrix)指导层剪枝,实现训练与推理效率同步提升,40%参数压缩下仅损失0.99%准确率。
Details
Motivation: 现有LLM微调耗时昂贵,结构剪枝虽提升推理效率,但常需额外训练、知识蒸馏或结构搜索,难以兼顾训练与推理效率。 Method: GradPruner在微调早期利用各参数累积梯度构建初始梯度信息累积矩阵(IGIA-Matrix)评估层重要性,据此剪枝;对剪枝层进行稀疏化,并仅合并同号元素以减少符号干扰,再与剩余层融合。 Result: 在两个LLM和八个下游数据集(涵盖医疗、金融及通用基准)上实验表明,参数减少40%,准确率仅下降0.99%。 Conclusion: GradPruner能有效平衡微调阶段的训练与推理效率,在显著压缩模型的同时保持高性能,为高效LLM适配提供了新思路。 Abstract: Fine-tuning Large Language Models (LLMs) with downstream data is often considered time-consuming and expensive. Structured pruning methods are primarily employed to improve the inference efficiency of pre-trained models. Meanwhile, they often require additional time and memory for training, knowledge distillation, structure search, and other strategies, making efficient model fine-tuning challenging to achieve. To simultaneously enhance the training and inference efficiency of downstream task fine-tuning, we introduce GradPruner, which can prune layers of LLMs guided by gradients in the early stages of fine-tuning. GradPruner uses the cumulative gradients of each parameter during the initial phase of fine-tuning to compute the Initial Gradient Information Accumulation Matrix (IGIA-Matrix) to assess the importance of layers and perform pruning. We sparsify the pruned layers based on the IGIA-Matrix and merge them with the remaining layers. Only elements with the same sign are merged to reduce interference from sign variations. We conducted extensive experiments on two LLMs across eight downstream datasets. Including medical, financial, and general benchmark tasks. The results demonstrate that GradPruner has achieved a parameter reduction of 40% with only a 0.99% decrease in accuracy. Our code is publicly available.[31] Automated Safety Benchmarking: A Multi-agent Pipeline for LVLMs
Xiangyang Zhu,Yuan Tian,Zicheng Zhang,Qi Jia,Chunyi Li,Renrui Zhang,Heng Li,Zongrui Wang,Wei Sun
Main category: cs.CL
TL;DR: 本文提出VLSafetyBencher,首个用于大视觉语言模型(LVLMs)安全评估的自动化基准构建系统,通过四个协作智能体高效生成高质量、高区分度的安全评估数据集。
Details
Motivation: 现有LVLM安全评估基准存在人工构建成本高、静态复杂度低、判别力不足等问题,难以跟上模型快速演进和新风险涌现。 Method: 提出VLSafetyBencher自动化系统,包含数据预处理、生成、增强与筛选四个协作智能体,实现端到端的安全基准构建与样本优选。 Result: 实验表明,该系统可在一周内以极低成本构建高质量安全基准,所生成基准对模型安全性具有强区分能力(最安全与最不安全模型间安全率差异达70%)。 Conclusion: VLSafetyBencher为LVLM安全评估提供了可扩展、动态、低成本的自动化解决方案,显著提升了基准建设效率与实用性。 Abstract: Large vision-language models (LVLMs) exhibit remarkable capabilities in cross-modal tasks but face significant safety challenges, which undermine their reliability in real-world applications. Efforts have been made to build LVLM safety evaluation benchmarks to uncover their vulnerability. However, existing benchmarks are hindered by their labor-intensive construction process, static complexity, and limited discriminative power. Thus, they may fail to keep pace with rapidly evolving models and emerging risks. To address these limitations, we propose VLSafetyBencher, the first automated system for LVLM safety benchmarking. VLSafetyBencher introduces four collaborative agents: Data Preprocessing, Generation, Augmentation, and Selection agents to construct and select high-quality samples. Experiments validates that VLSafetyBencher can construct high-quality safety benchmarks within one week at a minimal cost. The generated benchmark effectively distinguish safety, with a safety rate disparity of 70% between the most and least safe models.[32] Yunque DeepResearch Technical Report
Yuxuan Cai,Xinyi Lai,Peng Yuan,Weiting Liu,Huajian Li,Mingda Li,Xinghua Wang,Shengxie Zheng,Yanchao Hao,Yuyang Yin,Zheng Wei
Main category: cs.CL
TL;DR: 本文提出Yunque DeepResearch框架,通过分层模块化设计解决深度研究中上下文噪声、错误传播和可扩展性差等问题,显著提升自主代理在多项基准测试中的性能。
Details
Motivation: 现有深度研究能力受限于长周期任务中的上下文噪声累积、错误级联传播以及缺乏模块化可扩展性。 Method: 提出Yunque DeepResearch框架,包含三个核心组件:(1)集中式多智能体编排系统;(2)动态上下文管理机制;(3)主动监督模块,实现异常检测与上下文裁剪。 Result: 在GAIA、BrowseComp、BrowseComp-ZH和Humanity's Last Exam等多个深度研究基准上达到SOTA性能。 Conclusion: Yunque DeepResearch通过模块化、分层与鲁棒设计,显著提升了自主代理在复杂开放任务中的深度研究能力,并已开源以推动社区发展。 Abstract: Deep research has emerged as a transformative capability for autonomous agents, empowering Large Language Models to navigate complex, open-ended tasks. However, realizing its full potential is hindered by critical limitations, including escalating contextual noise in long-horizon tasks, fragility leading to cascading errors, and a lack of modular extensibility. To address these challenges, we introduce Yunque DeepResearch, a hierarchical, modular, and robust framework. The architecture is characterized by three key components: (1) a centralized Multi-Agent Orchestration System that routes subtasks to an Atomic Capability Pool of tools and specialized sub-agents; (2) a Dynamic Context Management mechanism that structures completed sub-goals into semantic summaries to mitigate information overload; and (3) a proactive Supervisor Module that ensures resilience through active anomaly detection and context pruning. Yunque DeepResearch achieves state-of-the-art performance across a range of agentic deep research benchmarks, including GAIA, BrowseComp, BrowseComp-ZH, and Humanity's Last Exam. We open-source the framework, reproducible implementations, and application cases to empower the community.[33] Decompose-and-Formalise: Recursively Verifiable Natural Language Inference
Xin Quan,Marco Valentino,Louise A. Dennis,André Freitas
Main category: cs.CL
TL;DR: 本文提出了一种分解-形式化框架,通过构建蕴含树、自底向上验证和局部诊断引导的细化,解决LLM与定理证明器结合在自然语言推理中自动形式化错误多、失败定位难的问题,并引入θ-替换提升形式化的忠实性。
Details
Motivation: 现有神经符号方法在自然主义NLI中面临自动形式化错误放大、失败难以局部定位、依赖高成本全局重生成等问题。 Method: 提出分解-形式化框架:(i) 将前提-假设对分解为蕴含树;(ii) 自底向上验证以精确定位失败节点;(iii) 基于诊断进行局部细化;并引入基于事件逻辑形式的θ-替换以保障论元角色绑定一致性。 Result: 在五种LLM主干模型上,解释验证率显著提升(最高达48.9%),同时减少细化迭代次数与运行时间,并保持强NLI准确率。 Conclusion: 该框架有效提升了神经符号推理中解释的可验证性、效率与忠实性,为可信赖NLI提供了新路径。 Abstract: Recent work has shown that integrating large language models (LLMs) with theorem provers (TPs) in neuro-symbolic pipelines helps with entailment verification and proof-guided refinement of explanations for natural language inference (NLI). However, scaling such refinement to naturalistic NLI remains difficult: long, syntactically rich inputs and deep multi-step arguments amplify autoformalisation errors, where a single local mismatch can invalidate the proof. Moreover, current methods often handle failures via costly global regeneration due to the difficulty of localising the responsible span or step from prover diagnostics. Aiming to address these problems, we propose a decompose-and-formalise framework that (i) decomposes premise-hypothesis pairs into an entailment tree of atomic steps, (ii) verifies the tree bottom-up to isolate failures to specific nodes, and (iii) performs local diagnostic-guided refinement instead of regenerating the whole explanation. Moreover, to improve faithfulness of autoformalisation, we introduce $θ$-substitution in an event-based logical form to enforce consistent argument-role bindings. Across a range of reasoning tasks using five LLM backbones, our method achieves the highest explanation verification rates, improving over the state-of-the-art by 26.2%, 21.7%, 21.6% and 48.9%, while reducing refinement iterations and runtime and preserving strong NLI accuracy.[34] Up to 36x Speedup: Mask-based Parallel Inference Paradigm for Key Information Extraction in MLLMs
Xinzhong Wang,Ya Guo,Jing Li,Huan Chen,Yi Tu,Yijie Hong,Gongshen Liu,Huijia Zhu
Main category: cs.CL
TL;DR: 本文提出了一种用于视觉丰富文档关键信息提取(KIE)的并行推理范式PIP,通过[mask]标记同时生成多个目标字段,在保持高精度的同时实现5-36倍推理加速。
Details
Motivation: 现有基于大语言模型(LLMs)和多模态大语言模型(MLLMs)的KIE方法依赖自回归推理,效率低,难以应对多字段并行提取需求。 Method: 提出并行推理范式PIP:用[mask]标记替代所有待提取字段,实现单次前向传播中同步生成;辅以定制化掩码预训练策略和大规模监督数据集构建。 Result: PIP模型在多个KIE任务上实现5-36倍推理速度提升,性能下降可忽略。 Conclusion: PIP显著提升了KIE的推理效率与实用性,为实际部署提供了可行路径。 Abstract: Key Information Extraction (KIE) from visually-rich documents (VrDs) is a critical task, for which recent Large Language Models (LLMs) and Multi-Modal Large Language Models (MLLMs) have demonstrated strong potential. However, their reliance on autoregressive inference, which generates outputs sequentially, creates a significant efficiency bottleneck, especially as KIE tasks often involve extracting multiple, semantically independent fields. To overcome this limitation, we introduce PIP: a Parallel Inference Paradigm for KIE. Our approach reformulates the problem by using "[mask]" tokens as placeholders for all target values, enabling their simultaneous generation in a single forward pass. To facilitate this paradigm, we develop a tailored mask pre-training strategy and construct large-scale supervised datasets. Experimental results show that our PIP-models achieve a 5-36x inference speedup with negligible performance degradation compared to traditional autoregressive base models. By substantially improving efficiency while maintaining high accuracy, PIP paves the way for scalable and practical real-world KIE solutions.[35] RATE: Reviewer Profiling and Annotation-free Training for Expertise Ranking in Peer Review Systems
Weicong Liu,Zixuan Yang,Yibo Zhao,Xiang Li
Main category: cs.CL
TL;DR: 本文提出了LR-bench——一个基于2024–2025年AI/NLP论文与专家自评熟悉度构建的高质量审稿人分配评测基准,并设计了 reviewer-centric 的 RATE 框架,通过关键词画像与弱监督微调嵌入模型,显著提升审稿人匹配性能。
Details
Motivation: 大语言模型时代下主题快速演变,导致传统评测基准过时,且现有代理信号难以准确反映审稿人真实熟悉度,亟需更真实、及时的评估手段。 Method: 构建 LR-bench 基准(含1055条五级熟悉度标注);提出 RATE 框架:基于审稿人近期论文生成关键词画像,并利用启发式检索信号构造弱偏好监督,微调嵌入模型实现稿件-审稿人直接匹配。 Result: 在 LR-bench 和 CMU 黄金标准数据集上,RATE 均显著超越强嵌入基线,达到当前最优性能。 Conclusion: LR-bench 为审稿人分配提供了更贴近现实的评估基础,RATE 展示了以审稿人为中心建模的有效性,二者共同推动该任务向更可靠、可解释方向发展。 Abstract: Reviewer assignment is increasingly critical yet challenging in the LLM era, where rapid topic shifts render many pre-2023 benchmarks outdated and where proxy signals poorly reflect true reviewer familiarity. We address this evaluation bottleneck by introducing LR-bench, a high-fidelity, up-to-date benchmark curated from 2024-2025 AI/NLP manuscripts with five-level self-assessed familiarity ratings collected via a large-scale email survey, yielding 1055 expert-annotated paper-reviewer-score annotations. We further propose RATE, a reviewer-centric ranking framework that distills each reviewer's recent publications into compact keyword-based profiles and fine-tunes an embedding model with weak preference supervision constructed from heuristic retrieval signals, enabling matching each manuscript against a reviewer profile directly. Across LR-bench and the CMU gold-standard dataset, our approach consistently achieves state-of-the-art performance, outperforming strong embedding baselines by a clear margin. We release LR-bench at https://huggingface.co/datasets/Gnociew/LR-bench, and a GitHub repository at https://github.com/Gnociew/RATE-Reviewer-Assign.[36] One Token Is Enough: Improving Diffusion Language Models with a Sink Token
Zihou Zhang,Zheyong Xie,Li Zhong,Haifeng Liu,Shaosheng Cao
Main category: cs.CL
TL;DR: 本文提出了一种通过添加一个特殊额外'sink' token来稳定扩散语言模型(DLMs)中注意力'sink'位置的方法,从而缓解'moving sink'现象带来的推理不稳定性问题。
Details
Motivation: 扩散语言模型(DLMs)存在'moving sink'现象,即sink tokens在不同扩散步中位置不可预测,损害推理鲁棒性。 Method: 引入一个仅自注意力、全局可见的额外sink token,并通过修改注意力掩码实现。 Result: 单个额外sink token显著提升了模型性能与注意力sink的稳定性;该token效果与其位置无关,且语义内容可忽略。 Conclusion: 所提方法是一种简单有效、鲁棒性强的结构化sink机制,能从根本上缓解DLMs中的moving sink问题。 Abstract: Diffusion Language Models (DLMs) have emerged as a compelling alternative to autoregressive approaches, enabling parallel text generation with competitive performance. Despite these advantages, there is a critical instability in DLMs: the moving sink phenomenon. Our analysis indicates that sink tokens exhibit low-norm representations in the Transformer's value space, and that the moving sink phenomenon serves as a protective mechanism in DLMs to prevent excessive information mixing. However, their unpredictable positions across diffusion steps undermine inference robustness. To resolve this, we propose a simple but effective extra sink token implemented via a modified attention mask. Specifically, we introduce a special token constrained to attend solely to itself, while remaining globally visible to all other tokens. Experimental results demonstrate that introducing a single extra token stabilizes attention sinks, substantially improving model performance. Crucially, further analysis confirms that the effectiveness of this token is independent of its position and characterized by negligible semantic content, validating its role as a robust and dedicated structural sink.[37] SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking
Adam Remaki,Christel Gérardin,Eulàlia Farré-Maduell,Martin Krallinger,Xavier Tannier
Main category: cs.CL
TL;DR: SynCABEL 是一种利用大语言模型生成上下文丰富合成数据以缓解生物医学实体链接中专家标注数据稀缺问题的新框架,在多语种基准上达到新SOTA,并提升临床有效预测率。
Details
Motivation: 解决生物医学实体链接(BEL)中专家标注训练数据稀缺这一核心瓶颈。 Method: 提出 SynCABEL 框架,利用大语言模型为知识库中所有候选概念生成上下文丰富的合成训练样本;结合解码器-only 模型与引导式推理;引入 LLM-as-a-judge 协议评估临床有效性。 Result: 在 MedMentions(英语)、QUAERO(法语)、SPACCC(西班牙语)三个多语种基准上取得新SOTA;仅用最多40%的人工标注数据即可达到全监督性能;显著提升临床有效预测率。 Conclusion: SynCABEL 有效缓解标注依赖,提升模型性能与临床实用性,所释放的数据、模型与代码促进可复现研究。 Abstract: We present SynCABEL (Synthetic Contextualized Augmentation for Biomedical Entity Linking), a framework that addresses a central bottleneck in supervised biomedical entity linking (BEL): the scarcity of expert-annotated training data. SynCABEL leverages large language models to generate context-rich synthetic training examples for all candidate concepts in a target knowledge base, providing broad supervision without manual annotation. We demonstrate that SynCABEL, when combined with decoder-only models and guided inference establish new state-of-the-art results across three widely used multilingual benchmarks: MedMentions for English, QUAERO for French, and SPACCC for Spanish. Evaluating data efficiency, we show that SynCABEL reaches the performance of full human supervision using up to 60% less annotated data, substantially reducing reliance on labor-intensive and costly expert labeling. Finally, acknowledging that standard evaluation based on exact code matching often underestimates clinically valid predictions due to ontology redundancy, we introduce an LLM-as-a-judge protocol. This analysis reveals that SynCABEL significantly improves the rate of clinically valid predictions. Our synthetic datasets, models, and code are released to support reproducibility and future research.[38] Component-Level Lesioning of Language Models Reveals Clinically Aligned Aphasia Phenotypes
Yifan Wang,Jichen Zheng,Jingyuan Sun,Yunhao Zhang,Chunyu Ye,Jixing Li,Chengqing Zong,Shaonan Wang
Main category: cs.CL
TL;DR: 本文提出一种临床导向的组件级框架,通过选择性扰动大语言模型(LLM)中的功能组件来模拟失语症(如布罗卡失语和韦尼克失语),并在MoE与稠密Transformer模型上验证其有效性;结果表明,靶向亚型相关组件的扰动能比随机扰动更系统地复现失语特征,且MoE结构更利于定位与解释语言功能损伤。
Details
Motivation: 探索大语言模型能否作为语言认知的计算模拟器,特别是能否被系统操控以复现由局灶性脑损伤导致的失语症语言产出障碍,从而为康复假说检验和语言功能组织研究提供可扩展、可控的计算平台。 Method: 构建基于临床知识的组件级失语模拟框架,统一干预接口应用于MoE与稠密Transformer模型:(i)识别与布罗卡/韦尼克失语相关的模型组件;(ii)通过语言探针任务解释这些组件;(iii)按top-k顺序渐进扰动亚型关联组件,并用西方失语症量表(WAB)子测验及失语商(AQ)评估损伤效应。 Result: 靶向亚型组件的扰动在不同架构与损毁策略下均比等规模随机扰动产生更系统、更类失语的语言退化;MoE模型展现出更强的局部化与可解释的表型-组件映射能力。 Conclusion: 模块化大语言模型结合临床指导的组件扰动,是模拟失语症语言产出障碍及探究特定语言功能在靶向干扰下退化机制的有力计算平台。 Abstract: Large language models (LLMs) increasingly exhibit human-like linguistic behaviors and internal representations that they could serve as computational simulators of language cognition. We ask whether LLMs can be systematically manipulated to reproduce language-production impairments characteristic of aphasia following focal brain lesions. Such models could provide scalable proxies for testing rehabilitation hypotheses, and offer a controlled framework for probing the functional organization of language. We introduce a clinically grounded, component-level framework that simulates aphasia by selectively perturbing functional components in LLMs, and apply it to both modular Mixture-of-Experts models and dense Transformers using a unified intervention interface. Our pipeline (i) identifies subtype-linked components for Broca's and Wernicke's aphasia, (ii) interprets these components with linguistic probing tasks, and (iii) induces graded impairments by progressively perturbing the top-k subtype-linked components, evaluating outcomes with Western Aphasia Battery (WAB) subtests summarized by Aphasia Quotient (AQ). Across architectures and lesioning strategies, subtype-targeted perturbations yield more systematic, aphasia-like regressions than size-matched random perturbations, and MoE modularity supports more localized and interpretable phenotype-to-component mappings. These findings suggest that modular LLMs, combined with clinically informed component perturbations, provide a promising platform for simulating aphasic language production and studying how distinct language functions degrade under targeted disruptions.[39] TokenSeek: Memory Efficient Fine Tuning via Instance-Aware Token Ditching
Runjia Zeng,Qifan Wang,Qiang Guan,Ruixiang Tang,Lifu Huang,Zhenting Wang,Xueling Zhang,Cheng Han,Dongfang Liu
Main category: cs.CL
TL;DR: 本文提出TokenSeek,一种针对Transformer模型的实例感知型token筛选与丢弃方法,显著降低大语言模型微调时的内存消耗(如在Llama3.2 1B上仅需14.8%内存),同时保持甚至提升性能,并提供可解释的token效率分析。
Details
Motivation: 现有激活优化方法多为数据无关,导致微调效果差且不稳定;而激活内存占用主导整体开销,亟需更高效、稳定、可解释的内存优化方案。 Method: 提出TokenSeek插件式方法,通过实例感知的token seeking(筛选重要token)与ditching(丢弃冗余token),动态调整前向传播中的激活存储,适配各类Transformer模型。 Result: 在Llama3.2 1B等模型上实现高达85.2%的内存节省(即仅用14.8%内存),微调精度持平或提升,并揭示了关键token分布规律,具备可解释性。 Conclusion: TokenSeek是一种通用、高效、稳定且可解释的微调内存优化方案,为大模型轻量化微调提供了新范式。 Abstract: Fine tuning has been regarded as a de facto approach for adapting large language models (LLMs) to downstream tasks, but the high training memory consumption inherited from LLMs makes this process inefficient. Among existing memory efficient approaches, activation-related optimization has proven particularly effective, as activations consistently dominate overall memory consumption. Although prior arts offer various activation optimization strategies, their data-agnostic nature ultimately results in ineffective and unstable fine tuning. In this paper, we propose TokenSeek, a universal plugin solution for various transformer-based models through instance-aware token seeking and ditching, achieving significant fine-tuning memory savings (e.g., requiring only 14.8% of the memory on Llama3.2 1B) with on-par or even better performance. Furthermore, our interpretable token seeking process reveals the underlying reasons for its effectiveness, offering valuable insights for future research on token efficiency. Homepage: https://runjia.tech/iclr_tokenseek/[40] Strong Reasoning Isn't Enough: Evaluating Evidence Elicitation in Interactive Diagnosis
Zhuohan Long,Zhijie Bao,Zhongyu Wei
Main category: cs.CL
TL;DR: 本文提出了一种交互式医疗咨询评估框架,引入信息覆盖率(ICR)指标,并构建了基于证据的基准数据集EviMed;发现诊断推理能力强不等于信息收集能力强,进而提出REFINE策略以提升不确定性下的主动证据采集能力。
Details
Motivation: 现有医疗咨询评估多为静态或结果导向,忽视证据采集过程,无法反映真实交互中主动获取临床证据的能力。 Method: 构建基于原子证据的模拟患者与模拟报告者,提出信息覆盖率(ICR)量化证据采集完整性;建立EviMed基准并评估10种模型;提出REFINE策略,利用诊断验证引导不确定性消解。 Result: 实验证明强诊断推理能力≠强信息收集能力;REFINE在多个数据集上持续优于基线,且能促进小模型在强推理监督下实现更优性能。 Conclusion: 信息采集能力是交互式医疗咨询的关键瓶颈,REFINE策略有效缓解该问题,为构建更可靠的医疗AI代理提供了新路径。 Abstract: Interactive medical consultation requires an agent to proactively elicit missing clinical evidence under uncertainty. Yet existing evaluations largely remain static or outcome-centric, neglecting the evidence-gathering process. In this work, we propose an interactive evaluation framework that explicitly models the consultation process using a simulated patient and a \rev{simulated reporter} grounded in atomic evidences. Based on this representation, we introduce Information Coverage Rate (ICR) to quantify how completely an agent uncovers necessary evidence during interaction. To support systematic study, we build EviMed, an evidence-based benchmark spanning diverse conditions from common complaints to rare diseases, and evaluate 10 models with varying reasoning abilities. We find that strong diagnostic reasoning does not guarantee effective information collection, and this insufficiency acts as a primary bottleneck limiting performance in interactive settings. To address this, we propose REFINE, a strategy that leverages diagnostic verification to guide the agent in proactively resolving uncertainties. Extensive experiments demonstrate that REFINE consistently outperforms baselines across diverse datasets and facilitates effective model collaboration, enabling smaller agents to achieve superior performance under strong reasoning supervision. Our code can be found at https://github.com/NanshineLoong/EID-Benchmark .[41] LVLMs and Humans Ground Differently in Referential Communication
Peter Zeng,Weiling Li,Amie Paige,Zhengxiang Wang,Panagiotis Kaliosis,Dimitris Samaras,Gregory Zelinsky,Susan Brennan,Owen Rambow
Main category: cs.CL
TL;DR: 本文通过指称交流实验,揭示了多模态大语言模型(LVLMs)在交互式解析指称表达上的局限性,强调其缺乏对共同知识(common ground)建模能力,影响人机协作效果。
Details
Motivation: 生成式AI代理需准确预测人类意图以实现有效人机协作,但当前受限于无法建模共同知识(common ground)。 Method: 设计了一个因子实验,包含导演-匹配者配对(人-人、人-AI、AI-人、AI-AI),多轮多轮次交互,匹配无明确词汇标签的物体图片;发布数据收集管道、分析工具及含356段对话的语料库。 Result: 发现LVLMs在交互式解决指称表达任务中表现显著弱于人类,尤其在准确性、效率和词汇重叠度方面暴露严重缺陷。 Conclusion: LVLMs缺乏对共同知识的动态建模能力,是制约其自然、鲁棒人机协作的关键瓶颈,需在模型架构与训练范式中显式引入共同知识机制。 Abstract: For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common ground. Here, we present a referential communication experiment with a factorial design involving director-matcher pairs (human-human, human-AI, AI-human, and AI-AI) that interact with multiple turns in repeated rounds to match pictures of objects not associated with any obvious lexicalized labels. We release the online pipeline for data collection, the tools and analyses for accuracy, efficiency, and lexical overlap, and a corpus of 356 dialogues (89 pairs over 4 rounds each) that unmasks LVLMs' limitations in interactively resolving referring expressions, a crucial skill that underlies human language use.[42] Zero-Shot Stance Detection in the Wild: Dynamic Target Generation and Multi-Target Adaptation
Aohua Li,Yuanshuo Zhang,Ge Gao,Bo Chen,Xiaobing Zhao
Main category: cs.CL
TL;DR: 本文提出了一种面向真实社交场景的零样本立场检测新任务DGTA,旨在无需预定义目标的情况下自动识别文本中的多个目标-立场对,并构建了中文社交媒体数据集与多维评测指标,验证了微调大语言模型(尤其是两阶段微调Qwen2.5-7B和集成微调DeepSeek-R1-Distill-Qwen-7B)在该任务上的有效性。
Details
Motivation: 现实社交平台中立场检测的目标复杂且动态变化,而现有方法依赖预定义、静态目标,难以适应实际需求。 Method: 提出DGTA新任务(动态目标生成+多目标适配),构建中文社交立场检测数据集,设计多维评估指标,并探索大语言模型的集成微调与两阶段微调策略。 Result: 两阶段微调Qwen2.5-7B在目标识别综合得分达66.99%;集成微调DeepSeek-R1-Distill-Qwen-7B在立场检测F1达79.26%。 Conclusion: 微调大语言模型能有效应对零样本、多目标、动态目标的立场检测挑战,DGTA任务为真实场景立场分析提供了新范式。 Abstract: Current stance detection research typically relies on predicting stance based on given targets and text. However, in real-world social media scenarios, targets are neither predefined nor static but rather complex and dynamic. To address this challenge, we propose a novel task: zero-shot stance detection in the wild with Dynamic Target Generation and Multi-Target Adaptation (DGTA), which aims to automatically identify multiple target-stance pairs from text without prior target knowledge. We construct a Chinese social media stance detection dataset and design multi-dimensional evaluation metrics. We explore both integrated and two-stage fine-tuning strategies for large language models (LLMs) and evaluate various baseline models. Experimental results demonstrate that fine-tuned LLMs achieve superior performance on this task: the two-stage fine-tuned Qwen2.5-7B attains the highest comprehensive target recognition score of 66.99%, while the integrated fine-tuned DeepSeek-R1-Distill-Qwen-7B achieves a stance detection F1 score of 79.26%.[43] When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering
Mahdi Astaraki,Mohammad Arshi Saloot,Ali Shiraee Kasmaee,Hamidreza Mahyar,Soheila Samiee
Main category: cs.CL
TL;DR: 本文通过控制实验首次系统评估了迭代式检索-推理循环(Iterative RAG)是否能超越理想静态RAG(Gold Context),发现在化学多跳问答任务中,Iterative RAG显著优于Gold Context,尤其对未经推理微调的模型提升达25.6个百分点;其优势源于分阶段检索缓解晚期失败、上下文过载和假设漂移,但仍有覆盖不全、干扰项锁定、早停误判与组合失效等挑战。
Details
Motivation: 现有RAG在科学领域面临多跳推理、领域知识稀疏、证据异构等挑战,尚不清楚迭代式检索-推理是否真能超越一次性提供全部正确证据的理想静态RAG。 Method: 构建三类基准设置(无上下文、黄金上下文Gold Context、训练无关的迭代RAG控制器),在ChemKGMultiHopQA数据集上对11个SOTA大模型进行机制级诊断,分析覆盖缺口、锚点丢失、查询质量、组合保真度与控制校准等维度。 Result: Iterative RAG在所有模型上均稳定超越Gold Context,最高提升25.6个百分点;分阶段检索有效降低晚期失败、缓解上下文过载、纠正早期假设漂移;但仍存在覆盖不全、干扰项锁定、早停不准及高组合失败率等问题。 Conclusion: 阶段性检索本身比单纯提供理想证据更具影响力;研究为科学领域RAG部署与诊断提供实用指南,并为构建更可靠可控的迭代检索-推理框架奠定基础。 Abstract: Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.[44] Identifying and Transferring Reasoning-Critical Neurons: Improving LLM Inference Reliability via Activation Steering
Fangan Dong,Zuming Yan,Xuri Ge,Zhiwei Xu,Mengqi Zhang,Xuanang Chen,Ben He,Xin Xin,Zhumin Chen,Ying Zhou
Main category: cs.CL
TL;DR: 本文提出AdaRAS,一种轻量级测试时干预方法,通过识别并调控对推理正确性起关键作用的神经元(RCNs),提升大语言模型在数学与编程任务上的推理可靠性,无需额外训练或采样开销。
Details
Motivation: 现有大语言模型虽具强推理能力,但需后训练或高成本采样策略才能保证可靠性能,实用性受限。 Method: 基于发现少量神经元与推理正确性高度相关,提出AdaRAS框架:用极性感知的均值差准则识别Reasoning-Critical Neurons(RCNs),并在推理时自适应地调控其激活。 Result: 在10个数学与编程基准上显著提升性能,如AIME-24和AIME-25提升超13%;具备跨数据集泛化性与向更强模型扩展能力,优于各类后训练方法。 Conclusion: AdaRAS验证了测试时细粒度神经元干预可高效提升推理可靠性,为无需训练的模型增强提供了新范式。 Abstract: Despite the strong reasoning capabilities of recent large language models (LLMs), achieving reliable performance on challenging tasks often requires post-training or computationally expensive sampling strategies, limiting their practical efficiency. In this work, we first show that a small subset of neurons in LLMs exhibits strong predictive correlations with reasoning correctness. Based on this observation, we propose AdaRAS (Adaptive Reasoning Activation Steering), a lightweight test-time framework that improves reasoning reliability by selectively intervening on neuron activations. AdaRAS identifies Reasoning-Critical Neurons (RCNs) via a polarity-aware mean-difference criterion and adaptively steers their activations during inference, enhancing incorrect reasoning traces while avoiding degradation on already-correct cases. Experiments on 10 mathematics and coding benchmarks demonstrate consistent improvements, including over 13% gains on AIME-24 and AIME-25. Moreover, AdaRAS exhibits strong transferability across datasets and scalability to stronger models, outperforming post-training methods without additional training or sampling cost.[45] Reflective Translation: Improving Low-Resource Machine Translation via Structured Self-Reflection
Nicholas Cheng
Main category: cs.CL
TL;DR: 本文提出了一种名为“反思式翻译(Reflective Translation)”的提示框架,通过让大语言模型自动生成初译、结构化自我批评,并据此优化译文,在英语-祖鲁语和英语-科萨语低资源翻译任务中显著提升了BLEU和COMET得分,且无需微调、模型无关。
Details
Motivation: 低资源语言(如isiZulu和isiXhosa)因平行语料和语言资源匮乏,在机器翻译中长期面临挑战;而大模型的自反思能力已被证明可提升推理质量与事实一致性,值得迁移至翻译任务。 Method: 提出Reflective Translation框架:模型先生成初始翻译,再生成结构化自我批评,最后基于该反思生成优化译文;在OPUS-100和NTREX-African数据集上,结合多种提示策略与置信度阈值进行评估。 Result: 在英语-isiZulu和英语-isiXhosa翻译任务中,二轮译文相比首轮平均提升+0.22 BLEU和+0.18 COMET,配对非参数检验表明改进具有统计显著性;方法模型无关、无需微调,并构建了反思增强型数据集。 Conclusion: 结构化自反思是一种实用且有效的机制,可在低资源场景下切实提升翻译质量,并为后续监督训练或分析研究提供新数据支持。 Abstract: Low-resource languages such as isiZulu and isiXhosa face persistent challenges in machine translation due to limited parallel data and linguistic resources. Recent advances in large language models suggest that self-reflection, prompting a model to critique and revise its own outputs, can improve reasoning quality and factual consistency. Building on this idea, this paper introduces Reflective Translation, a prompt-based framework in which a model generates an initial translation, produces a structured self-critique, and then uses this reflection to generate a refined translation. The approach is evaluated on English-isiZulu and English-isiXhosa translation using OPUS-100 and NTREX-African, across multiple prompting strategies and confidence thresholds. Results show consistent improvements in both BLEU and COMET scores between first- and second-pass translations, with average gains of up to +0.22 BLEU and +0.18 COMET. Statistical significance testing using paired nonparametric tests confirms that these improvements are robust. The proposed method is model-agnostic, requires no fine-tuning, and introduces a reflection-augmented dataset that can support future supervised or analysis-driven work. These findings demonstrate that structured self-reflection is a practical and effective mechanism for improving translation quality in low-resource settings.[46] Evaluation of Oncotimia: An LLM based system for supporting tumour boards
Luis Lorenzo,Marcos Montana-Mendez,Sergio Figueiras,Miguel Boubeta,Cristobal Bernardo-Castineira
Main category: cs.CL
TL;DR: 本文提出了ONCOTIMIA系统,利用生成式AI(特别是大语言模型)自动完成肺癌多学科肿瘤会诊表,结合多层数据湖、混合存储、RAG和规则驱动的自适应表单模型,在保证临床可接受延迟的同时实现80%字段准确率,验证了其在减轻文档负担与保持数据质量方面的可行性。
Details
Motivation: MDTBs在肿瘤决策中至关重要,但手动处理大量异构临床信息带来巨大文档负担。 Method: 构建模块化、安全的ONCOTIMIA系统,融合多层数据湖、混合关系型/向量存储、检索增强生成(RAG)及规则驱动的自适应表单模型,利用AWS Bedrock部署6个LLM对10例肺癌病例进行表单自动填充评估。 Result: 最佳配置实现80%字段正确填充率,多数LLM响应时间达临床可接受水平;更大、更新的模型准确率更高且延迟未显著增加。 Conclusion: LLM辅助表单自动填充在肺癌多学科工作流中技术上可行、操作上可行,有望显著减轻文档负担并维持数据质量。 Abstract: Multidisciplinary tumour boards (MDTBs) play a central role in oncology decision-making but require manual processes and structuring large volumes of heterogeneous clinical information, resulting in a substantial documentation burden. In this work, we present ONCOTIMIA, a modular and secure clinical tool designed to integrate generative artificial intelligence (GenAI) into oncology workflows and evaluate its application to the automatic completion of lung cancer tumour board forms using large language models (LLMs). The system combines a multi-layer data lake, hybrid relational and vector storage, retrieval-augmented generation (RAG) and a rule-driven adaptive form model to transform unstructured clinical documentation into structured and standardised tumour board records. We assess the performance of six LLMs deployed through AWS Bedrock on ten lung cancer cases, measuring both completion form accuracy and end-to-end latency. The results demonstrate high performance across models, with the best performing configuration achieving an 80% of correct field completion and clinically acceptable response time for most LLMs. Larger and more recent models exhibit best accuracies without incurring prohibitive latency. These findings provide empirical evidence that LLM- assisted autocompletion form is technically feasible and operationally viable in multidisciplinary lung cancer workflows and support its potential to significantly reduce documentation burden while preserving data quality.cs.CV [Back]
[47] Dynamic Mask-Based Backdoor Attack Against Vision AI Models: A Case Study on Mushroom Detection
Zeineb Dridi,Jihen Bennaceur,Amine Ben Hassouna
Main category: cs.CV
TL;DR: 本文提出了一种基于动态掩码的新型后门攻击方法,针对目标检测模型,利用SAM模型生成掩码实现动态触发器嵌入,在蘑菇检测数据集上验证了其高隐蔽性和高攻击成功率。
Details
Motivation: 深度学习模型在实际部署中面临多种对抗攻击威胁,尤其是后门攻击;同时外包训练数据带来安全风险,亟需构建真实、细致的攻击场景以揭示风险。 Method: 提出基于SAM(Segment Anything Model)生成动态掩码的后门攻击方法,通过数据集投毒嵌入恶意触发器,实现对目标检测模型(如YOLOv7)的定向攻击。 Result: 在保持干净样本高检测精度的同时,对投毒样本实现了高攻击成功率,显著优于基于静态模式的传统后门注入方法。 Conclusion: 该动态掩码后门攻击具有更强隐蔽性与实用性,凸显了开发鲁棒防御机制以应对新型对抗威胁的紧迫性。 Abstract: Deep learning has revolutionized numerous tasks within the computer vision field, including image classification, image segmentation, and object detection. However, the increasing deployment of deep learning models has exposed them to various adversarial attacks, including backdoor attacks. This paper presents a novel dynamic mask-based backdoor attack method, specifically designed for object detection models. We exploit a dataset poisoning technique to embed a malicious trigger, rendering any models trained on this compromised dataset vulnerable to our backdoor attack. We particularly focus on a mushroom detection dataset to demonstrate the practical risks posed by such attacks on critical real-life domains. Our work also emphasizes the importance of creating a detailed backdoor attack scenario to illustrate the significant risks associated with the outsourcing practice. Our approach leverages SAM, a recent and powerful image segmentation AI model, to create masks for dynamic trigger placement, introducing a new and stealthy attack method. Through extensive experimentation, we show that our sophisticated attack scenario maintains high accuracy on clean data with the YOLOv7 object detection model while achieving high attack success rates on poisoned samples. Our approach surpasses traditional methods for backdoor injection, which are based on static and consistent patterns. Our findings underscore the urgent need for robust countermeasures to protect deep learning models from these evolving adversarial threats.[48] Audio-Driven Talking Face Generation with Blink Embedding and Hash Grid Landmarks Encoding
Yuhui Zhang,Hui Yu,Wei Liang,Sunjie Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于眨眼嵌入和哈希网格关键点编码的自动方法,结合动态关键点Transformer和NeRF,显著提升了说话人面部(尤其是嘴部)动态建模的保真度。
Details
Motivation: 现有动态NeRF在说话人肖像建模中仍难以准确高效地捕捉嘴部运动。 Method: 提出基于眨眼嵌入与哈希网格关键点编码的方法;利用面部特征作为条件输入,音频特征作为残差项,通过动态关键点Transformer融合;采用NeRF建模整张人脸。 Result: 实验验证了该方法在说话人脸生成质量上优于现有方法。 Conclusion: 所提方法能显著提升动态 talking face 的建模保真度,尤其改善嘴部运动细节。 Abstract: Dynamic Neural Radiance Fields (NeRF) have demonstrated considerable success in generating high-fidelity 3D models of talking portraits. Despite significant advancements in the rendering speed and generation quality, challenges persist in accurately and efficiently capturing mouth movements in talking portraits. To tackle this challenge, we propose an automatic method based on blink embedding and hash grid landmarks encoding in this study, which can substantially enhance the fidelity of talking faces. Specifically, we leverage facial features encoded as conditional features and integrate audio features as residual terms into our model through a Dynamic Landmark Transformer. Furthermore, we employ neural radiance fields to model the entire face, resulting in a lifelike face representation. Experimental evaluations have validated the superiority of our approach to existing methods.[49] SelfieAvatar: Real-time Head Avatar reenactment from a Selfie Video
Wei Liang,Hui Yu,Derui Ding,Rachael E. Jack,Philippe G. Schyns
Main category: cs.CV
TL;DR: 本文提出了一种基于单张自拍视频的高细节头部头像重演方法,结合3DMM与StyleGAN,通过混合损失函数提升高频细节重建效果,在自重演和跨重演任务中均优于现有方法。
Details
Motivation: 现有方法难以兼顾实时全头建模(含非面部区域与背景)与细粒度细节(如皱纹、发丝),且多依赖大量训练数据,缺乏仅用简单自拍视频实现高质量重演的方案。 Method: 将3DMM与StyleGAN结合,构建详细重建模型,并在对抗训练中引入针对前景重建和头像图像生成的混合损失函数,以恢复高频细节。 Result: 在自重演和跨重演任务上的定性与定量评估表明,该方法在头部头像重建质量、纹理丰富度与细节表现上均优于现有方法。 Conclusion: 所提方法有效解决了仅用单段自拍视频生成高保真、高细节头部头像的难题,兼顾了全头建模能力与细粒度纹理重建,为轻量级个性化头像应用提供了新思路。 Abstract: Head avatar reenactment focuses on creating animatable personal avatars from monocular videos, serving as a foundational element for applications like social signal understanding, gaming, human-machine interaction, and computer vision. Recent advances in 3D Morphable Model (3DMM)-based facial reconstruction methods have achieved remarkable high-fidelity face estimation. However, on the one hand, they struggle to capture the entire head, including non-facial regions and background details in real time, which is an essential aspect for producing realistic, high-fidelity head avatars. On the other hand, recent approaches leveraging generative adversarial networks (GANs) for head avatar generation from videos can achieve high-quality reenactments but encounter limitations in reproducing fine-grained head details, such as wrinkles and hair textures. In addition, existing methods generally rely on a large amount of training data, and rarely focus on using only a simple selfie video to achieve avatar reenactment. To address these challenges, this study introduces a method for detailed head avatar reenactment using a selfie video. The approach combines 3DMMs with a StyleGAN-based generator. A detailed reconstruction model is proposed, incorporating mixed loss functions for foreground reconstruction and avatar image generation during adversarial training to recover high-frequency details. Qualitative and quantitative evaluations on self-reenactment and cross-reenactment tasks demonstrate that the proposed method achieves superior head avatar reconstruction with rich and intricate textures compared to existing approaches.[50] Weakly supervised framework for wildlife detection and counting in challenging Arctic environments: a case study on caribou (Rangifer tarandus)
Ghazaleh Serati,Samuel Foucher,Jerome Theau
Main category: cs.CV
TL;DR: 本文提出了一种基于弱监督斑块级预训练的检测模型HerdNet,用于提升北极驯鹿自动检测的鲁棒性,显著提高了多群落影像的检测精度(F1达93.7%/92.6%),并在数据有限时优于ImageNet初始化。
Details
Motivation: 北极驯鹿数量近年持续下降,亟需可扩展、高精度的监测方法;而人工解译耗时易错,自动检测又面临背景异质性强、类别极度不平衡、目标小且遮挡严重、密度与尺度变化大等挑战。 Method: 提出一种基于检测网络架构的弱监督斑块级预训练方法,在仅使用‘空’与‘非空’粗粒度标签的数据集(含阿拉斯加五个驯鹿群)上进行预训练,再迁移到精细检测任务中,替代传统的ImageNet权重初始化。 Result: 弱监督预训练在多群落影像(2017年)和独立年份测试集(2019年)上分别达到F1=93.7%和92.6%;迁移至检测后,在正样本斑块和全图计数任务上均稳定优于ImageNet初始化(F1分别提升约3%和2%)。 Conclusion: 在标注数据有限时,利用粗粒度标签进行弱监督预训练可有效提升检测性能,其效果可媲美甚至超越通用预训练权重,为野生动物监测提供实用可行的新范式。 Abstract: Caribou across the Arctic has declined in recent decades, motivating scalable and accurate monitoring approaches to guide evidence-based conservation actions and policy decisions. Manual interpretation from this imagery is labor-intensive and error-prone, underscoring the need for automatic and reliable detection across varying scenes. Yet, such automatic detection is challenging due to severe background heterogeneity, dominant empty terrain (class imbalance), small or occluded targets, and wide variation in density and scale. To make the detection model (HerdNet) more robust to these challenges, a weakly supervised patch-level pretraining based on a detection network's architecture is proposed. The detection dataset includes five caribou herds distributed across Alaska. By learning from empty vs. non-empty labels in this dataset, the approach produces early weakly supervised knowledge for enhanced detection compared to HerdNet, which is initialized from generic weights. Accordingly, the patch-based pretrain network attained high accuracy on multi-herd imagery (2017) and on an independent year's (2019) test sets (F1: 93.7%/92.6%, respectively), enabling reliable mapping of regions containing animals to facilitate manual counting on large aerial imagery. Transferred to detection, initialization from weakly supervised pretraining yielded consistent gains over ImageNet weights on both positive patches (F1: 92.6%/93.5% vs. 89.3%/88.6%), and full-image counting (F1: 95.5%/93.3% vs. 91.5%/90.4%). Remaining limitations are false positives from animal-like background clutter and false negatives related to low animal density occlusions. Overall, pretraining on coarse labels prior to detection makes it possible to rely on weakly-supervised pretrained weights even when labeled data are limited, achieving results comparable to generic-weight initialization.[51] RealStats: A Rigorous Real-Only Statistical Framework for Fake Image Detection
Haim Zisman,Uri Shaham
Main category: cs.CV
TL;DR: 本文提出了一种基于统计学的、无需训练的通用框架,用于可解释地检测AI生成图像,通过聚合多个检测器的p值来评估图像与真实图像分布的一致性。
Details
Motivation: 现有AI图像检测方法缺乏形式上的可解释性,且依赖于对伪造内容的隐含假设,导致在分布偏移下鲁棒性不足。 Method: 利用多个现有检测器的训练-free统计量,计算一系列检验统计量的p值,并通过经典统计集成方法进行聚合,以评估图像与统一真实图像分布的对齐程度。 Result: 该框架具有通用性、灵活性和无需训练的特点,在多样且动态变化的场景中展现出稳健的AI图像检测能力。 Conclusion: 所提出的统计框架为AI生成图像检测提供了形式化、可解释且鲁棒的解决方案,显著提升了检测结果在分布偏移下的可靠性。 Abstract: As generative models continue to evolve, detecting AI-generated images remains a critical challenge. While effective detection methods exist, they often lack formal interpretability and may rely on implicit assumptions about fake content, potentially limiting robustness to distributional shifts. In this work, we introduce a rigorous, statistically grounded framework for fake image detection that focuses on producing a probability score interpretable with respect to the real-image population. Our method leverages the strengths of multiple existing detectors by combining training-free statistics. We compute p-values over a range of test statistics and aggregate them using classical statistical ensembling to assess alignment with the unified real-image distribution. This framework is generic, flexible, and training-free, making it well-suited for robust fake image detection across diverse and evolving settings.[52] On the Role of Depth in Surgical Vision Foundation Models: An Empirical Study of RGB-D Pre-training
John J. Han,Adam Schmidt,Muhammad Abdullah Jamal,Chinedu Nwoye,Anita Rau,Jie Ying Wu,Omid Mohareri
Main category: cs.CV
TL;DR: 本文通过大规模实证研究,比较了八种基于ViT的视觉基础模型在手术场景中的性能,发现结合深度信息(RGB-D)进行几何感知预训练(如MultiMAE)显著提升各类下游任务表现,并大幅提高数据效率,且无需改变推理架构。
Details
Motivation: 现有手术视觉基础模型主要依赖单模态RGB预训练,忽略了手术环境固有的复杂3D几何结构;深度信息在手术场景中的价值尚未被充分探索。 Method: 对八种ViT-based视觉基础模型进行对比实验,涵盖不同预训练领域、学习目标和输入模态(RGB vs. RGB-D);使用140万张配对RGB-深度图(由现成网络生成)进行预训练;在八个涵盖检测、分割、深度估计与位姿估计的手术数据集上,分别评估冻结主干和端到端微调两种协议。 Result: 显式几何token化模型(如MultiMAE)在所有任务上均显著优于单模态基线;几何感知预训练带来显著数据效率提升——仅用25%标注数据微调的模型即超越全量数据训练的RGB-only模型;该增益不依赖推理时使用深度图,仅需预训练阶段引入。 Conclusion: 多模态(RGB-D)预训练是构建更强大手术视觉系统的一条可行且高效路径,尤其在数据受限的医疗场景中具有重要实用价值。 Abstract: Vision foundation models (VFMs) have emerged as powerful tools for surgical scene understanding. However, current approaches predominantly rely on unimodal RGB pre-training, overlooking the complex 3D geometry inherent to surgical environments. Although several architectures support multimodal or geometry-aware inputs in general computer vision, the benefits of incorporating depth information in surgical settings remain underexplored. We conduct a large-scale empirical study comparing eight ViT-based VFMs that differ in pre-training domain, learning objective, and input modality (RGB vs. RGB-D). For pre-training, we use a curated dataset of 1.4 million robotic surgical images paired with depth maps generated from an off-the-shelf network. We evaluate these models under both frozen-backbone and end-to-end fine-tuning protocols across eight surgical datasets spanning object detection, segmentation, depth estimation, and pose estimation. Our experiments yield several consistent findings. Models incorporating explicit geometric tokenization, such as MultiMAE, substantially outperform unimodal baselines across all tasks. Notably, geometric-aware pre-training enables remarkable data efficiency: models fine-tuned on just 25% of labeled data consistently surpass RGB-only models trained on the full dataset. Importantly, these gains require no architectural or runtime changes at inference; depth is used only during pre-training, making adoption straightforward. These findings suggest that multimodal pre-training offers a viable path towards building more capable surgical vision systems.[53] Smart Split-Federated Learning over Noisy Channels for Embryo Image Segmentation
Zahra Hafezi Kafshgari,Ivan V. Bajic,Parvaneh Saeedi
Main category: cs.CV
TL;DR: 本文研究了Split-Federated学习中通信信道噪声对模型性能的影响,并提出一种智能平均策略以增强抗噪能力,在胚胎图像分割任务中验证了其有效性。
Details
Motivation: SplitFed学习中通信信道的噪声会损害模型训练效果和最终模型质量,需提升其鲁棒性。 Method: 提出一种针对SplitFed学习的智能平均策略,以缓解通信信道噪声带来的负面影响。 Result: 在胚胎图像分割模型上实验表明,该策略可容忍比传统平均方法强两个数量级的信道噪声,同时保持最终模型精度。 Conclusion: 所提智能平均策略显著提升了SplitFed学习在噪声信道下的鲁棒性与实用性。 Abstract: Split-Federated (SplitFed) learning is an extension of federated learning that places minimal requirements on the clients computing infrastructure, since only a small portion of the overall model is deployed on the clients hardware. In SplitFed learning, feature values, gradient updates, and model updates are transferred across communication channels. In this paper, we study the effects of noise in the communication channels on the learning process and the quality of the final model. We propose a smart averaging strategy for SplitFed learning with the goal of improving resilience against channel noise. Experiments on a segmentation model for embryo images shows that the proposed smart averaging strategy is able to tolerate two orders of magnitude stronger noise in the communication channels compared to conventional averaging, while still maintaining the accuracy of the final model.[54] Pay Attention to Where You Look
Alex Beriand,JhihYang Wu,Daniel Brignac,Natnael Daba,Abhijit Mahalanobis
Main category: cs.CV
TL;DR: 本文提出了一种相机权重机制,通过几何距离和角度差异(确定性方法)或交叉注意力(学习方法)来动态调整输入视角对目标视角的重要性,从而提升少样本新型视图合成(NVS)的质量与真实感。
Details
Motivation: 现有少样本新型视图合成方法假设所有输入视角对目标视角同等重要,导致合成效果次优。 Method: 提出两种相机加权机制:基于欧氏距离和角度差的确定性加权方案,以及基于跨注意力的学习型加权方案;该机制可插拔式集成到各类NVS模型中并用于微调。 Result: 在多个基准上验证了所提方法能显著提升合成图像的精度与真实感。 Conclusion: 自适应视角加权是一种有效且通用的改进策略,为少样本NVS提供了新思路。 Abstract: Novel view synthesis (NVS) has advanced with generative modeling, enabling photorealistic image generation. In few-shot NVS, where only a few input views are available, existing methods often assume equal importance for all input views relative to the target, leading to suboptimal results. We address this limitation by introducing a camera-weighting mechanism that adjusts the importance of source views based on their relevance to the target. We propose two approaches: a deterministic weighting scheme leveraging geometric properties like Euclidean distance and angular differences, and a cross-attention-based learning scheme that optimizes view weighting. Additionally, models can be further trained with our camera-weighting scheme to refine their understanding of view relevance and enhance synthesis quality. This mechanism is adaptable and can be integrated into various NVS algorithms, improving their ability to synthesize high-quality novel views. Our results demonstrate that adaptive view weighting enhances accuracy and realism, offering a promising direction for improving NVS.[55] FreeOrbit4D: Training-Free Arbitrary Camera Redirection for Monocular Videos via Geometry-Complete 4D Reconstruction
Wei Cao,Hao Zhang,Fengrui Tian,Yulun Wu,Yingying Li,Shenlong Wang,Ning Yu,Yaoyao Liu
Main category: cs.CV
TL;DR: 本文提出FreeOrbit4D,一种无需训练的框架,通过构建几何完备的4D代理(结合静态背景与对象中心的多视角扩散重建的前景点云),为单目视频的大角度相机重定向提供结构支撑,显著提升重定向视频的几何一致性和时序连贯性。
Details
Motivation: 单目视频仅提供动态3D场景的窄视角时空观测,大角度相机重定向面临严重的几何模糊与时序不一致问题,现有扩散方法在此类情况下性能下降明显。 Method: FreeOrbit4D解耦前景与背景重建:将单目视频反投影为统一全局空间中的静态背景和几何不完备的前景点云;利用对象中心的多视角扩散模型合成多视角图像并重建几何完备的前景点云(在规范对象空间);通过密集像素同步的3D-3D对应关系将规范前景对齐至全局场景空间,形成几何完备的4D代理;最后以此代理为几何骨架引导条件视频扩散模型生成目标视角视频。 Result: 在大角度重定向任务中,FreeOrbit4D生成的重定向视频更保真、几何更一致、时序更连贯;所构建的几何完备4D代理还支持编辑传播和4D数据生成等潜在应用。 Conclusion: FreeOrbit4D通过引入无需训练的几何完备4D代理作为结构基础,有效缓解了单目大角度相机重定向中的根本性几何模糊问题,为高质量4D内容生成提供了新范式。 Abstract: Camera redirection aims to replay a dynamic scene from a single monocular video under a user-specified camera trajectory. However, large-angle redirection is inherently ill-posed: a monocular video captures only a narrow spatio-temporal view of a dynamic 3D scene, providing highly partial observations of the underlying 4D world. The key challenge is therefore to recover a complete and coherent representation from this limited input, with consistent geometry and motion. While recent diffusion-based methods achieve impressive results, they often break down under large-angle viewpoint changes far from the original trajectory, where missing visual grounding leads to severe geometric ambiguity and temporal inconsistency. To address this, we present FreeOrbit4D, an effective training-free framework that tackles this geometric ambiguity by recovering a geometry-complete 4D proxy as structural grounding for video generation. We obtain this proxy by decoupling foreground and background reconstructions: we unproject the monocular video into a static background and geometry-incomplete foreground point clouds in a unified global space, then leverage an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct geometry-complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D--3D correspondences and projecting the geometry-complete 4D proxy onto target camera viewpoints, we provide geometric scaffolds that guide a conditional video diffusion model. Extensive experiments show that FreeOrbit4D produces more faithful redirected videos under challenging large-angle trajectories, and our geometry-complete 4D proxy further opens a potential avenue for practical applications such as edit propagation and 4D data generation. Project page and code will be released soon.[56] Anatomically-aware conformal prediction for medical image segmentation with random walks
Mélanie Gaillochet,Christian Desrosiers,Hervé Lombaert
Main category: cs.CV
TL;DR: 本文提出了一种名为Random-Walk Conformal Prediction (RW-CP)的模型无关框架,用于医学图像分割中的不确定性量化,通过在预训练视觉基础模型特征构建的k近邻图上进行随机游走来扩散不确定性,从而生成解剖学上合理、空间连贯的预测集,并在保持统计有效性的同时显著提升分割质量。
Details
Motivation: 标准共形预测在医学图像分割中常忽略解剖结构上下文,导致预测结果碎片化、空间不连贯、过度分割,临床实用性受限;亟需一种能提供严格误差保证且解剖学合理的不确定性量化方法。 Method: 提出Random-Walk Conformal Prediction(RW-CP):基于预训练视觉基础模型提取特征,构建k近邻图,利用随机游走对非一致性分数进行扩散式正则化,从而增强空间一致性和对校准参数λ的鲁棒性。该方法可即插即用地适配任意分割模型。 Result: 在多模态公开数据集上评估表明,RW-CP在α=0.1允许误差率下相较标准共形预测基线提升高达35.4%,同时严格保证边际覆盖概率,并生成更连续、解剖合理的预测集。 Conclusion: RW-CP是一种通用、有效且解剖学可信的不确定性量化方法,兼顾统计严谨性与临床可用性,为深度学习在医学影像中的安全部署提供了新路径。 Abstract: The reliable deployment of deep learning in medical imaging requires uncertainty quantification that provides rigorous error guarantees while remaining anatomically meaningful. Conformal prediction (CP) is a powerful distribution-free framework for constructing statistically valid prediction intervals. However, standard applications in segmentation often ignore anatomical context, resulting in fragmented, spatially incoherent, and over-segmented prediction sets that limit clinical utility. To bridge this gap, this paper proposes Random-Walk Conformal Prediction (RW-CP), a model-agnostic framework which can be added on top of any segmentation method. RW-CP enforces spatial coherence to generate anatomically valid sets. Our method constructs a k-nearest neighbour graph from pre-trained vision foundation model features and applies a random walk to diffuse uncertainty. The random walk diffusion regularizes the non-conformity scores, making the prediction sets less sensitive to the conformal calibration parameter $λ$, ensuring more stable and continuous anatomical boundaries. RW-CP maintains rigorous marginal coverage while significantly improving segmentation quality. Evaluations on multi-modal public datasets show improvements of up to $35.4\%$ compared to standard CP baselines, given an allowable error rate of $α=0.1$.[57] Non-Invasive 3D Wound Measurement with RGB-D Imaging
Lena Harkämper,Leo Lebrat,David Ahmedt-Aristizabal,Olivier Salvado,Mattias Heinrich,Rodrigo Santa Cruz
Main category: cs.CV
TL;DR: 本文提出了一种基于RGB-D成像的快速、无创三维伤口测量算法,结合RGB-D里程计与B样条曲面重建,实现高精度伤口三维建模及自动临床参数计算。
Details
Motivation: 慢性伤口监测与管理需要准确高效的伤口测量方法。 Method: 结合RGB-D里程计与B样条曲面重建生成三维伤口网格,自动计算周长、表面积和尺寸等临床指标。 Result: 在硅胶伤口模型上达到亚毫米级重建精度,测量结果重复性好、与人工评估高度一致,且优于现有先进RGB-D重建方法,运行时间满足实时临床部署需求。 Conclusion: 该方法为临床及远程医疗场景下的自动化伤口评估提供了有前景的工具。 Abstract: Chronic wound monitoring and management require accurate and efficient wound measurement methods. This paper presents a fast, non-invasive 3D wound measurement algorithm based on RGB-D imaging. The method combines RGB-D odometry with B-spline surface reconstruction to generate detailed 3D wound meshes, enabling automatic computation of clinically relevant wound measurements such as perimeter, surface area, and dimensions. We evaluated our system on realistic silicone wound phantoms and measured sub-millimetre 3D reconstruction accuracy compared with high-resolution ground-truth scans. The extracted measurements demonstrated low variability across repeated captures and strong agreement with manual assessments. The proposed pipeline also outperformed a state-of-the-art object-centric RGB-D reconstruction method while maintaining runtimes suitable for real-time clinical deployment. Our approach offers a promising tool for automated wound assessment in both clinical and remote healthcare settings.[58] NC-Reg : Neural Cortical Maps for Rigid Registration
Ines Vati,Pierrick Bourgeat,Rodrigo Santa Cruz,Vincent Dore,Olivier Salvado,Clinton Fookes,Léo Lebrat
Main category: cs.CV
TL;DR: 本文提出神经皮层图(neural cortical maps),一种用于皮层特征图的连续紧凑神经表示,替代传统离散结构(如网格、三角网格);其支持任意尺寸网格输入与任意分辨率特征输出,在球面上优化效率高(比经典重心插值快达30倍);并基于此构建NC-Reg算法用于皮层表面刚性配准,实现亚度级(<1°)精度,适合作为临床鲁棒预对齐策略。
Details
Motivation: 传统皮层特征图依赖离散结构(如网格、三角网格),难以泛化到不同分辨率和拓扑,且球面优化效率低;亟需一种连续、紧凑、可微、分辨率无关的表示方法以提升皮层分析的灵活性与计算效率。 Method: 提出神经皮层图(Neural Cortical Maps),即参数化在球面上的隐式神经函数,将2D球面坐标映射到特征向量;基于该表示构建NC-Reg算法:结合神经特征提取、梯度下降优化与模拟退火策略,实现皮层表面刚性配准。 Result: 神经皮层图可在任意分辨率生成特征,球面优化速度较经典重心插值提升最高达30倍;NC-Reg在主体到模板配准任务中达到<1°的亚度级配准精度,消融实验验证各模块有效性。 Conclusion: 神经皮层图是一种高效、灵活、可微的皮层特征表示范式;NC-Reg为其典型应用,展现出优异的配准精度与鲁棒性,有望成为临床神经影像预处理中的关键预对齐工具。 Abstract: We introduce neural cortical maps, a continuous and compact neural representation for cortical feature maps, as an alternative to traditional discrete structures such as grids and meshes. It can learn from meshes of arbitrary size and provide learnt features at any resolution. Neural cortical maps enable efficient optimization on the sphere and achieve runtimes up to 30 times faster than classic barycentric interpolation (for the same number of iterations). As a proof of concept, we investigate rigid registration of cortical surfaces and propose NC-Reg, a novel iterative algorithm that involves the use of neural cortical feature maps, gradient descent optimization and a simulated annealing strategy. Through ablation studies and subject-to-template experiments, our method demonstrates sub-degree accuracy ($<1^\circ$ from the global optimum), and serves as a promising robust pre-alignment strategy, which is critical in clinical settings.[59] NuiWorld: Exploring a Scalable Framework for End-to-End Controllable World Generation
Han-Hung Lee,Cheng-Yu Yang,Yu-Lun Liu,Angel X. Chang
Main category: cs.CV
TL;DR: 本文提出了NuiWorld框架,通过生成式自举策略解决世界生成中的数据稀缺问题,并采用可变场景块和扁平化向量集表示提升可控性、可扩展性和效率。
Details
Motivation: 现有世界生成方法面临可控性、可扩展性和效率三大障碍:端到端模型受限于数据稀缺;基于物体的生成方法因固定分辨率表示而降低大场景保真度;无训练方法虽灵活但推理慢且计算开销大。 Method: 提出生成式自举策略,从少量输入图像出发,结合3D重建与可扩展场景生成技术合成多尺度场景以缓解数据稀缺;引入伪草图标签实现可控生成;将场景表示为可变大小的场景块集合,并压缩为扁平化向量集以缩短token长度。 Result: NuiWorld在保持大场景几何保真度的同时显著减少token长度,提升了训练与推理效率;支持伪草图控制,并展现出对未见草图的一定泛化能力。 Conclusion: NuiWorld有效兼顾了世界生成的可控性、可扩展性与效率,为视频游戏、仿真和机器人等应用提供了更实用的生成框架。 Abstract: World generation is a fundamental capability for applications like video games, simulation, and robotics. However, existing approaches face three main obstacles: controllability, scalability, and efficiency. End-to-end scene generation models have been limited by data scarcity. While object-centric generation approaches rely on fixed resolution representations, degrading fidelity for larger scenes. Training-free approaches, while flexible, are often slow and computationally expensive at inference time. We present NuiWorld, a framework that attempts to address these challenges. To overcome data scarcity, we propose a generative bootstrapping strategy that starts from a few input images. Leveraging recent 3D reconstruction and expandable scene generation techniques, we synthesize scenes of varying sizes and layouts, producing enough data to train an end-to-end model. Furthermore, our framework enables controllability through pseudo sketch labels, and demonstrates a degree of generalization to previously unseen sketches. Our approach represents scenes as a collection of variable scene chunks, which are compressed into a flattened vector-set representation. This significantly reduces the token length for large scenes, enabling consistent geometric fidelity across scenes sizes while improving training and inference efficiency.[60] Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models
Jeonghwan Kim,Renjie Tao,Sanat Sharma,Jiaqi Wang,Kai Sun,Zhaojiang Lin,Seungwhan Moon,Lambert Mathias,Anuj Kumar,Heng Ji,Xin Luna Dong
Main category: cs.CV
TL;DR: PixSearch 是首个端到端的分段式大视觉语言模型,统一区域级感知与检索增强推理,通过生成
Details
Motivation: 现有VQA方法在细粒度感知与外部知识结合方面存在不足,MM-RAG系统缺乏对何时及如何检索的内部策略。 Method: 提出PixSearch模型:在编码阶段动态生成[61] m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning
Yosub Shin,Michael Buriek,Igor Molybog
Main category: cs.CV
TL;DR: 本文提出了m2sv基准,用于评估视觉-语言模型在地图到街景空间推理任务中的表现,发现现有模型在此任务上远低于人类水平,并分析了其失败原因。
Details
Motivation: 现有视觉-语言模型在多模态基准上表现良好,但在需要将抽象俯视图与第一人称视角对齐的空间推理任务上仍很脆弱。 Method: 构建了m2sv基准(含m2sv-20k和m2sv-sft-11k数据集),通过让模型根据北向上俯视地图推断街景图像的拍摄朝向,来评估其空间对齐能力;并采用监督微调与强化学习进行模型适配与分析。 Result: 最佳VLM在m2sv上仅达65.2%准确率,显著低于人类95%基线;微调与强化学习带来稳定提升,但跨基准迁移能力有限;系统性失败分析揭示了几何对齐、证据聚合与推理一致性等核心缺陷。 Conclusion: 当前VLM在跨视角空间推理方面存在根本性局限,亟需发展更扎实的、基于真实世界几何约束的推理能力。 Abstract: Vision--language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release m2sv-20k, a geographically diverse benchmark with controlled ambiguity, along with m2sv-sft-11k, a curated set of structured reasoning traces for supervised fine-tuning. Despite strong performance on existing multimodal benchmarks, the best evaluated VLM achieves only 65.2% accuracy on m2sv, far below the human baseline of 95%. While supervised fine-tuning and reinforcement learning yield consistent gains, cross-benchmark evaluations reveal limited transfer. Beyond aggregate accuracy, we systematically analyze difficulty in map-to-street-view reasoning using both structural signals and human effort, and conduct an extensive failure analysis of adapted open models. Our findings highlight persistent gaps in geometric alignment, evidence aggregation, and reasoning consistency, motivating future work on grounded spatial reasoning across viewpoints.[62] Glance and Focus Reinforcement for Pan-cancer Screening
Linshan Wu,Jiaxin Zhuang,Hao Chen
Main category: cs.CV
TL;DR: 本文提出GF-Screen框架,采用‘扫视-聚焦’强化学习策略解决全癌种CT筛查中病灶定位难、前景背景极度不平衡等问题,通过Glance模型粗定位+Focus模型精分割,并引入组相对学习提升效率与精度。
Details
Motivation: 现有AI方法难以在大尺度CT中定位多种微小病灶,且前景-背景极端不平衡导致模型难以聚焦病灶区域,冗余关注健康区域会降低效率并增加假阳性。 Method: 提出GF-Screen:Glance模型负责从全体积CT中裁剪含潜在病灶的子体积,Focus模型进行精细分割;利用分割结果通过强化学习奖励Glance模型;引入组相对学习范式,基于子体积组内优势比较优化Glance模型选择策略。 Result: 在16个内部和7个外部数据集(覆盖9类病灶)上验证有效;在MICCAI FLARE25全癌种挑战赛公开验证榜上排名第一,DSC和NSD分别较FLARE24冠军提升25.6%和28.2%。 Conclusion: GF-Screen首次成功将前沿强化学习技术适配于全癌种CT筛查任务,在定位精度、效率与假阳性控制方面取得显著突破,为临床大规模癌症早筛提供了新范式。 Abstract: Pan-cancer screening in large-scale CT scans remains challenging for existing AI methods, primarily due to the difficulty of localizing diverse types of tiny lesions in large CT volumes. The extreme foreground-background imbalance significantly hinders models from focusing on diseased regions, while redundant focus on healthy regions not only decreases the efficiency but also increases false positives. Inspired by radiologists' glance and focus diagnostic strategy, we introduce GF-Screen, a Glance and Focus reinforcement learning framework for pan-cancer screening. GF-Screen employs a Glance model to localize the diseased regions and a Focus model to precisely segment the lesions, where segmentation results of the Focus model are leveraged to reward the Glance model via Reinforcement Learning (RL). Specifically, the Glance model crops a group of sub-volumes from the entire CT volume and learns to select the sub-volumes with lesions for the Focus model to segment. Given that the selecting operation is non-differentiable for segmentation training, we propose to employ the segmentation results to reward the Glance model. To optimize the Glance model, we introduce a novel group relative learning paradigm, which employs group relative comparison to prioritize high-advantage predictions and discard low-advantage predictions within sub-volume groups, not only improving efficiency but also reducing false positives. In this way, for the first time, we effectively extend cutting-edge RL techniques to tackle the specific challenges in pan-cancer screening. Extensive experiments on 16 internal and 7 external datasets across 9 lesion types demonstrated the effectiveness of GF-Screen. Notably, GF-Screen leads the public validation leaderboard of MICCAI FLARE25 pan-cancer challenge, surpassing the FLARE24 champion solution by a large margin (+25.6% DSC and +28.2% NSD).[63] Reg-TTR, Test-Time Refinement for Fast, Robust and Accurate Image Registration
Lin Chen,Yue He,Fengting Zhang,Yaonan Wang,Fengming Lin,Xiang Chen,Min Liu
Main category: cs.CV
TL;DR: 本文提出Reg-TTR,一种测试时精调框架,结合深度学习与传统配准方法优势,在仅增加21%推理时间(0.56秒)的前提下显著提升注册精度,达到SOTA性能。
Details
Motivation: 现有配准基础模型虽兼顾速度与鲁棒性,但难以匹敌在特定数据集上训练的专用模型的峰值精度。 Method: Reg-TTR在推理阶段对预训练模型的预测结果进行测试时精调,融合深度学习的高效性与传统方法的鲁棒性。 Result: 在两个不同任务上达到SOTA性能,推理速度接近先前深度学习方法,仅增加0.56秒(21%)额外耗时。 Conclusion: Reg-TTR为缩小配准基础模型与专用SOTA模型之间的性能差距提供了一种高效可行的策略。 Abstract: Traditional image registration methods are robust but slow due to their iterative nature. While deep learning has accelerated inference, it often struggles with domain shifts. Emerging registration foundation models offer a balance of speed and robustness, yet typically cannot match the peak accuracy of specialized models trained on specific datasets. To mitigate this limitation, we propose Reg-TTR, a test-time refinement framework that synergizes the complementary strengths of both deep learning and conventional registration techniques. By refining the predictions of pre-trained models at inference, our method delivers significantly improved registration accuracy at a modest computational cost, requiring only 21% additional inference time (0.56s). We evaluate Reg-TTR on two distinct tasks and show that it achieves state-of-the-art (SOTA) performance while maintaining inference speeds close to previous deep learning methods. As foundation models continue to emerge, our framework offers an efficient strategy to narrow the performance gap between registration foundation models and SOTA methods trained on specialized datasets. The source code will be publicly available following the acceptance of this work.[64] FBSDiff++: Improved Frequency Band Substitution of Diffusion Features for Efficient and Highly Controllable Text-Driven Image-to-Image Translation
Xiang Gao,Yunpeng Jia
Main category: cs.CV
TL;DR: 本文提出了FBSDiff和FBSDiff++,一种基于频域视角的即插即用式文本驱动图像到图像(I2I)翻译框架,无需训练或微调,通过动态替换扩散模型潜在特征的频率分量实现外观、布局与轮廓引导的可控图像编辑,并支持连续强度调节与任意分辨率输入。
Details
Motivation: 现有大模型在文本到图像生成上已很成熟,但如何高效、可控、无需训练地将其扩展至文本驱动的图像到图像翻译仍具挑战;频域建模为解耦图像内容与结构提供了新思路。 Method: 提出频带替换机制(Frequency Band Substitution),在扩散模型的潜在空间中对低/中/高频特征分别进行动态替换以实现不同粒度的图像控制;FBSDiff++进一步优化架构以加速推理、支持任意分辨率输入,并拓展至局部编辑与风格化生成。 Result: FBSDiff++在视觉质量、效率(8.9×加速)、可控性与通用性上全面超越现有先进方法,支持外观/布局/轮廓引导及连续强度调节,并兼容任意分辨率与局部编辑。 Conclusion: 频域视角为即插即用式I2I翻译提供了简洁而强大的建模范式,FBSDiff系列方法验证了其在可控性、灵活性与实用性上的显著优势。 Abstract: With large-scale text-to-image (T2I) diffusion models achieving significant advancements in open-domain image creation, increasing attention has been focused on their natural extension to the realm of text-driven image-to-image (I2I) translation, where a source image acts as visual guidance to the generated image in addition to the textual guidance provided by the text prompt. We propose FBSDiff, a novel framework adapting off-the-shelf T2I diffusion model into the I2I paradigm from a fresh frequency-domain perspective. Through dynamic frequency band substitution of diffusion features, FBSDiff realizes versatile and highly controllable text-driven I2I in a plug-and-play manner (without need for model training, fine-tuning, or online optimization), allowing appearance-guided, layout-guided, and contour-guided I2I translation by progressively substituting low-frequency band, mid-frequency band, and high-frequency band of latent diffusion features, respectively. In addition, FBSDiff flexibly enables continuous control over I2I correlation intensity simply by tuning the bandwidth of the substituted frequency band. To further promote image translation efficiency, flexibility, and functionality, we propose FBSDiff++ which improves upon FBSDiff mainly in three aspects: (1) accelerate inference speed by a large margin (8.9$\times$ speedup in inference) with refined model architecture; (2) improve the Frequency Band Substitution module to allow for input source images of arbitrary resolution and aspect ratio; (3) extend model functionality to enable localized image manipulation and style-specific content creation with only subtle adjustments to the core method. Extensive qualitative and quantitative experiments verify superiority of FBSDiff++ in I2I translation visual quality, efficiency, versatility, and controllability compared to related advanced approaches.[65] Implicit Non-Causal Factors are Out via Dataset Splitting for Domain Generalization Object Detection
Zhilong Zhang,Lei Zhang,Qing He,Shuyin Xia,Guoyin Wang,Fuxiang Huang
Main category: cs.CV
TL;DR: 本文提出GB-DAL方法,通过原型粒球分割(PGBS)和模拟非因果因子(SNF)模块,缓解开放世界目标检测中因稀疏域标签和隐式数据偏差导致的非因果因子问题,提升域泛化性能。
Details
Motivation: 开放世界目标检测中,现有基于域对抗学习(DAL)的域泛化方法难以有效处理隐式非因果因素,主要受限于域标签极度稀疏及数据偏差引发的隐式非因果因子难以识别。 Method: 提出GB-DAL方法:1)使用原型粒球分割(PGBS)模块从有限数据集生成更密集的伪域,增强对潜在非因果因子的建模;2)设计模拟非因果因子(SNF)模块,通过类对抗扰动进行数据增强,降低非因果因子的隐式性。 Result: 在多个基准上实验表明,GB-DAL相较现有方法在新场景下展现出更优的泛化性能。 Conclusion: 通过粒球视角改进DAL范式,并引入显式建模与增强隐式非因果因子的机制,可有效提升开放世界目标检测的域不变表征能力与泛化性。 Abstract: Open world object detection faces a significant challenge in domain-invariant representation, i.e., implicit non-causal factors. Most domain generalization (DG) methods based on domain adversarial learning (DAL) pay much attention to learn domain-invariant information, but often overlook the potential non-causal factors. We unveil two critical causes: 1) The domain discriminator-based DAL method is subject to the extremely sparse domain label, i.e., assigning only one domain label to each dataset, thus can only associate explicit non-causal factor, which is incredibly limited. 2) The non-causal factors, induced by unidentified data bias, are excessively implicit and cannot be solely discerned by conventional DAL paradigm. Based on these key findings, inspired by the Granular-Ball perspective, we propose an improved DAL method, i.e., GB-DAL. The proposed GB-DAL utilizes Prototype-based Granular Ball Splitting (PGBS) module to generate more dense domains from limited datasets, akin to more fine-grained granular balls, indicating more potential non-causal factors. Inspired by adversarial perturbations akin to non-causal factors, we propose a Simulated Non-causal Factors (SNF) module as a means of data augmentation to reduce the implicitness of non-causal factors, and facilitate the training of GB-DAL. Comparative experiments on numerous benchmarks demonstrate that our method achieves better generalization performance in novel circumstances.[66] Resolving Primitive-Sharing Ambiguity in Long-Tailed Industrial Point Cloud Segmentation via Spatial Context Constraints
Chao Yin,Qing Han,Zhiwei Hou,Yue Liu,Anjin Dai,Hongda Hu,Ji Yang,Wei Yao
Main category: cs.CV
TL;DR: 本文提出了一种结合空间上下文约束的Class-Balanced Loss改进方法(Boundary-CB和Density-CB),在工业点云分割中有效缓解了极端类别不平衡(215:1)与几何歧义(如阀门、减速器与管道共享圆柱形局部结构)双重挑战,显著提升安全关键部件的识别精度,且不损害头部类别性能。
Details
Motivation: 工业点云分割中,安全关键部件(如减速器、阀门)因数据极度稀少且与主流结构(如管道)具有相同局部几何形状,导致系统性误分类;现有基于频率的重加权方法无法解决几何歧义问题。 Method: 在Class-Balanced Loss框架基础上,引入两种即插即用的空间上下文约束机制:(1) Boundary-CB——基于熵的边界强调机制,聚焦模糊边界区域;(2) Density-CB——基于密度的约束机制,补偿扫描密度变化;二者均无需修改网络结构,仅需替换损失函数。 Result: 在Industrial3D数据集(6.1亿点)上,mIoU达55.74%,尾部类别性能相对提升21.7%(29.59% vs. 24.32%基线);减速器IoU从0%提升至21.12%,阀门相对提升24.3%;头部类别精度保持88.14%。 Conclusion: 所提方法在不牺牲头部类别性能前提下,有效破解几何歧义,解决了工业数字孪生中安全关键部件可靠识别的瓶颈问题,支撑自动化知识提取。 Abstract: Industrial point cloud segmentation for Digital Twin construction faces a persistent challenge: safety-critical components such as reducers and valves are systematically misclassified. These failures stem from two compounding factors: such components are rare in training data, yet they share identical local geometry with dominant structures like pipes. This work identifies a dual crisis unique to industrial 3D data extreme class imbalance 215:1 ratio compounded by geometric ambiguity where most tail classes share cylindrical primitives with head classes. Existing frequency-based re-weighting methods address statistical imbalance but cannot resolve geometric ambiguity. We propose spatial context constraints that leverage neighborhood prediction consistency to disambiguate locally similar structures. Our approach extends the Class-Balanced (CB) Loss framework with two architecture-agnostic mechanisms: (1) Boundary-CB, an entropy-based constraint that emphasizes ambiguous boundaries, and (2) Density-CB, a density-based constraint that compensates for scan-dependent variations. Both integrate as plug-and-play modules without network modifications, requiring only loss function replacement. On the Industrial3D dataset (610M points from water treatment facilities), our method achieves 55.74% mIoU with 21.7% relative improvement on tail-class performance (29.59% vs. 24.32% baseline) while preserving head-class accuracy (88.14%). Components with primitive-sharing ambiguity show dramatic gains: reducer improves from 0% to 21.12% IoU; valve improves by 24.3% relative. This resolves geometric ambiguity without the typical head-tail trade-off, enabling reliable identification of safety-critical components for automated knowledge extraction in Digital Twin applications.[67] CLIP-Guided Unsupervised Semantic-Aware Exposure Correction
Puzhen Wu,Han Weng,Quan Zheng,Yi Zhan,Hewei Wang,Yiming Li,Jiahui Han,Rui Xu
Main category: cs.CV
TL;DR: 本文提出了一种无监督语义感知的曝光校正网络,通过融合FastSAM提取的语义信息与图像特征,并结合CLIP引导的伪真值生成和语义提示一致性损失,有效解决了真实场景中无标注数据下的曝光校正问题。
Details
Motivation: 现有方法忽视物体级区域语义信息导致色彩失真,且真实曝光图像缺乏真值标签,人工标注成本高。 Method: 提出自适应语义感知融合模块,将FastSAM提取的语义信息融入共享图像特征空间;设计多尺度残差空间Mamba组进行细节恢复与曝光调整;利用CLIP微调构建伪真值生成器,并引入语义提示一致性损失实现无监督训练。 Result: 在真实世界曝光图像校正任务上,该方法在数值指标和视觉质量上均优于当前最先进的无监督方法。 Conclusion: 语义引导与多模态先验(FastSAM+CLIP)的有效结合,显著提升了无监督曝光校正的性能与鲁棒性。 Abstract: Improper exposure often leads to severe loss of details, color distortion, and reduced contrast. Exposure correction still faces two critical challenges: (1) the ignorance of object-wise regional semantic information causes the color shift artifacts; (2) real-world exposure images generally have no ground-truth labels, and its labeling entails massive manual editing. To tackle the challenges, we propose a new unsupervised semantic-aware exposure correction network. It contains an adaptive semantic-aware fusion module, which effectively fuses the semantic information extracted from a pre-trained Fast Segment Anything Model into a shared image feature space. Then the fused features are used by our multi-scale residual spatial mamba group to restore the details and adjust the exposure. To avoid manual editing, we propose a pseudo-ground truth generator guided by CLIP, which is fine-tuned to automatically identify exposure situations and instruct the tailored corrections. Also, we leverage the rich priors from the FastSAM and CLIP to develop a semantic-prompt consistency loss to enforce semantic consistency and image-prompt alignment for unsupervised training. Comprehensive experimental results illustrate the effectiveness of our method in correcting real-world exposure images and outperforms state-of-the-art unsupervised methods both numerically and visually.[68] QA-ReID: Quality-Aware Query-Adaptive Convolution Leveraging Fused Global and Structural Cues for Clothes-Changing ReID
Yuxiang Wang,Kunming Jiang,Tianxiang Zhang,Ke Tian,Gaozhe Jiang
Main category: cs.CV
TL;DR: 本文提出QA-ReID方法,通过RGB与解析特征双分支建模及质量感知的自适应卷积匹配,在衣服变化场景下显著提升行人重识别性能。
Details
Motivation: 传统行人重识别在衣物更换时性能大幅下降,CC-ReID面临外观剧烈变化的挑战。 Method: 提出质量感知双分支匹配框架(QA-ReID),融合RGB特征与人体解析特征,并设计质量感知查询自适应卷积(QAConv-QA)引入像素级加权和双向一致性约束。 Result: 在PRCC、LTCC和VC-Clothes等多个基准上达到SOTA性能,尤其在跨衣物场景下显著优于现有方法。 Conclusion: 异构特征融合与质量感知匹配机制有效提升了衣物变化下的行人重识别鲁棒性与准确性。 Abstract: Unlike conventional person re-identification (ReID), clothes-changing ReID (CC-ReID) presents severe challenges due to substantial appearance variations introduced by clothing changes. In this work, we propose the Quality-Aware Dual-Branch Matching (QA-ReID), which jointly leverages RGB-based features and parsing-based representations to model both global appearance and clothing-invariant structural cues. These heterogeneous features are adaptively fused through a multi-modal attention module. At the matching stage, we further design the Quality-Aware Query Adaptive Convolution (QAConv-QA), which incorporates pixel-level importance weighting and bidirectional consistency constraints to enhance robustness against clothing variations. Extensive experiments demonstrate that QA-ReID achieves state-of-the-art performance on multiple benchmarks, including PRCC, LTCC, and VC-Clothes, and significantly outperforms existing approaches under cross-clothing scenarios.[69] TFFM: Topology-Aware Feature Fusion Module via Latent Graph Reasoning for Retinal Vessel Segmentation
Iftekhar Ahmed,Shakib Absar,Aftar Ahmad Sami,Shadman Sakib,Debojyoti Biswas,Seraj Al Mahmud Mostafa
Main category: cs.CV
TL;DR: 本文提出了一种拓扑感知的视网膜动静脉分割框架,通过图注意力网络与混合损失函数(Tversky + soft clDice)显著提升血管连通性,减少碎片化,在Fundus-AVSeg数据集上达到SOTA性能,并开源代码。
Details
Motivation: 标准卷积网络虽具高像素级精度,但常产生拓扑断裂(如间隙、不连续),阻碍基于图的临床分析;需保障血管结构连通性以支持自动化生物标志物量化。 Method: 引入拓扑特征融合模块(TFFM),将局部特征映射至潜在图空间,利用图注意力网络建模全局结构依赖;采用Tversky损失缓解类别不平衡,soft clDice损失显式惩罚拓扑断开。 Result: 在Fundus-AVSeg数据集上取得90.97%联合Dice分数和3.50像素95% Hausdorff距离;血管碎片化相对基线降低约38%,生成拓扑一致的血管树。 Conclusion: 该拓扑感知框架有效提升了分割结果的结构合理性与临床可用性,为心血管疾病自动诊断提供了更可靠的图像分析基础。 Abstract: Precise segmentation of retinal arteries and veins carries the diagnosis of systemic cardiovascular conditions. However, standard convolutional architectures often yield topologically disjointed segmentations, characterized by gaps and discontinuities that render reliable graph-based clinical analysis impossible despite high pixel-level accuracy. To address this, we introduce a topology-aware framework engineered to maintain vascular connectivity. Our architecture fuses a Topological Feature Fusion Module (TFFM) that maps local feature representations into a latent graph space, deploying Graph Attention Networks to capture global structural dependencies often missed by fixed receptive fields. Furthermore, we drive the learning process with a hybrid objective function, coupling Tversky loss for class imbalance with soft clDice loss to explicitly penalize topological disconnects. Evaluation on the Fundus-AVSeg dataset reveals state-of-the-art performance, achieving a combined Dice score of 90.97% and a 95% Hausdorff Distance of 3.50 pixels. Notably, our method decreases vessel fragmentation by approximately 38% relative to baselines, yielding topologically coherent vascular trees viable for automated biomarker quantification. We open-source our code at https://tffm-module.github.io/.[70] GTFMN: Guided Texture and Feature Modulation Network for Low-Light Image Enhancement and Super-Resolution
Yongsong Huang,Tzu-Hsuan Peng,Tomo Miyazaki,Xiaofeng Liu,Chun-Ting Chou,Ai-Chun Pang,Shinichiro Omachi
Main category: cs.CV
TL;DR: 本文提出了一种名为GTFMN的新型网络,通过解耦光照估计和纹理恢复两个子任务来解决低光图像超分辨率问题,利用光照图动态调制纹理特征,实现空间自适应增强,在多个数据集上取得了最优性能。
Details
Motivation: 低光图像超分辨率(LLSR)因低分辨率与不良光照的耦合退化而具有挑战性。 Method: 提出Guided Texture and Feature Modulation Network(GTFMN),包含专用的光照流预测空间变化的光照图,并在光照引导调制模块(IGM Block)中利用该图动态调制纹理流的特征,实现空间自适应恢复。 Result: 在OmniNormal5和OmniNormal15数据集上,GTFMN在定量指标和视觉质量上均优于现有方法,达到最佳性能。 Conclusion: GTFMN通过解耦光照估计与纹理恢复,并引入光照图引导的特征调制机制,有效提升了低光图像超分辨率的效果。 Abstract: Low-light image super-resolution (LLSR) is a challenging task due to the coupled degradation of low resolution and poor illumination. To address this, we propose the Guided Texture and Feature Modulation Network (GTFMN), a novel framework that decouples the LLSR task into two sub-problems: illumination estimation and texture restoration. First, our network employs a dedicated Illumination Stream whose purpose is to predict a spatially varying illumination map that accurately captures lighting distribution. Further, this map is utilized as an explicit guide within our novel Illumination Guided Modulation Block (IGM Block) to dynamically modulate features in the Texture Stream. This mechanism achieves spatially adaptive restoration, enabling the network to intensify enhancement in poorly lit regions while preserving details in well-exposed areas. Extensive experiments demonstrate that GTFMN achieves the best performance among competing methods on the OmniNormal5 and OmniNormal15 datasets, outperforming them in both quantitative metrics and visual quality.[71] SNR-Edit: Structure-Aware Noise Rectification for Inversion-Free Flow-Based Editing
Lifan Jiang,Boxi Wu,Yuhang Pei,Tianrun Wu,Yongyuan Chen,Yan Zhao,Shiyu Yu,Deng Cai
Main category: cs.CV
TL;DR: SNR-Edit是一种无需训练、无需图像反演的流式生成模型图像编辑方法,通过信噪比(SNR)自适应噪声校正,在潜在空间中实现结构感知的轨迹修正,显著提升编辑保真度与结构一致性。
Details
Motivation: 现有基于流的无反演图像编辑方法依赖固定高斯噪声构建源轨迹,导致轨迹动力学偏差,引发结构退化或质量下降。 Method: 提出SNR-Edit框架,采用结构感知的噪声校正机制,将分割约束注入初始噪声,使源轨迹的随机分量锚定于真实图像的隐式反演位置,从而抑制源–目标迁移过程中的轨迹漂移。 Result: 在SD3和FLUX模型上,于PIE-Bench和SNR-Bench评测中,SNR-Edit在像素级指标和视觉语言模型评分上均表现优异,且单图仅增加约1秒开销。 Conclusion: SNR-Edit以轻量、免训练、免反演的方式,实现了高保真结构保持的潜在轨迹校正,为无反演编辑提供了新范式。 Abstract: Inversion-free image editing using flow-based generative models challenges the prevailing inversion-based pipelines. However, existing approaches rely on fixed Gaussian noise to construct the source trajectory, leading to biased trajectory dynamics and causing structural degradation or quality loss. To address this, we introduce SNR-Edit, a training-free framework achieving faithful Latent Trajectory Correction via adaptive noise control. Mechanistically, SNR-Edit uses structure-aware noise rectification to inject segmentation constraints into the initial noise, anchoring the stochastic component of the source trajectory to the real image's implicit inversion position and reducing trajectory drift during source--target transport. This lightweight modification yields smoother latent trajectories and ensures high-fidelity structural preservation without requiring model tuning or inversion. Across SD3 and FLUX, evaluations on PIE-Bench and SNR-Bench show that SNR-Edit delivers performance on pixel-level metrics and VLM-based scoring, while adding only about 1s overhead per image.[72] Contrastive Spectral Rectification: Test-Time Defense towards Zero-shot Adversarial Robustness of CLIP
Sen Nie,Jie Zhang,Zhuo Wang,Shiguang Shan,Xilin Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为对比谱校正(CSR)的高效测试时防御方法,利用视觉-语言模型(VLMs)在频域中的固有谱偏差特性,通过频谱引导的对比目标优化校正扰动,显著提升对抗鲁棒性,且具有低开销和跨任务泛化能力。
Details
Motivation: 现有VLMs(如CLIP)虽具零样本泛化能力,但易受对抗样本攻击;当前测试时防御方法鲁棒性不足、推理延迟高、适用任务受限。 Method: 基于对抗样本在渐进频率衰减下特征不一致的现象及其与模型谱偏差的关联,提出CSR方法:在测试时自适应地优化一个谱引导的对比损失下的校正扰动,使输入重对齐自然流形。 Result: 在16个分类基准上,CSR在强AutoAttack下平均超越SOTA 18.1%,推理开销小,并适用于多种视觉任务。 Conclusion: CSR是一种高效、通用、鲁棒的测试时防御方法,揭示并利用了VLMs的谱偏差本质,为提升VLM鲁棒性提供了新思路。 Abstract: Vision-language models (VLMs) such as CLIP have demonstrated remarkable zero-shot generalization, yet remain highly vulnerable to adversarial examples (AEs). While test-time defenses are promising, existing methods fail to provide sufficient robustness against strong attacks and are often hampered by high inference latency and task-specific applicability. To address these limitations, we start by investigating the intrinsic properties of AEs, which reveals that AEs exhibit severe feature inconsistency under progressive frequency attenuation. We further attribute this to the model's inherent spectral bias. Leveraging this insight, we propose an efficient test-time defense named Contrastive Spectral Rectification (CSR). CSR optimizes a rectification perturbation to realign the input with the natural manifold under a spectral-guided contrastive objective, which is applied input-adaptively. Extensive experiments across 16 classification benchmarks demonstrate that CSR outperforms the SOTA by an average of 18.1% against strong AutoAttack with modest inference overhead. Furthermore, CSR exhibits broad applicability across diverse visual tasks. Code is available at https://github.com/Summu77/CSR.[73] UniPCB: A Unified Vision-Language Benchmark for Open-Ended PCB Quality Inspection
Fuxiang Sun,Xi Jiang,Jiansheng Wu,Haigang Zhang,Feng Zheng,Jinfeng Yang
Main category: cs.CV
TL;DR: 本文提出了首个面向印刷电路板(PCB)质量检测的统一视觉-语言基准UniPCB及专用多模态大模型PCB-GPT,通过系统化数据构建与渐进式课程学习策略,在细粒度缺陷定位等任务上显著超越现有MLLMs。
Details
Motivation: 现有多模态大语言模型(MLLMs)在复杂工业质检(如PCB检测)中表现不足,且缺乏高质量、统一的视觉-语言评测基准,主因是数据稀缺、来源分散与标准不一。 Method: 构建了首个统一视觉-语言基准UniPCB,涵盖三个标注场景;基于其生成指令数据,并提出渐进式课程学习策略训练专用MLLM——PCB-GPT。 Result: PCB-GPT在UniPCB基准上大幅超越现有MLLMs,尤其在细粒度缺陷定位任务上性能提升超两倍,并在定位与分析能力上具有显著优势。 Conclusion: UniPCB和PCB-GPT填补了PCB质检领域统一评测与专用模型的空白,为工业多模态智能质检提供了新范式与开源资源。 Abstract: Multimodal Large Language Models (MLLMs) show promise for general industrial quality inspection, but fall short in complex scenarios, such as Printed Circuit Board (PCB) inspection. PCB inspection poses unique challenges due to densely packed components, complex wiring structures, and subtle defect patterns that require specialized domain expertise. However, a high-quality, unified vision-language benchmark for quantitatively evaluating MLLMs across PCB inspection tasks remains absent, stemming not only from limited data availability but also from fragmented datasets and inconsistent standardization. To fill this gap, we propose UniPCB, the first unified vision-language benchmark for open-ended PCB quality inspection. UniPCB is built via a systematic pipeline that curates and standardizes data from disparate sources across three annotated scenarios. Furthermore, we introduce PCB-GPT, an MLLM trained on a new instruction dataset generated by this pipeline, utilizing a novel progressive curriculum that mimics the learning process of human experts. Evaluations on the UniPCB benchmark show that while existing MLLMs falter on domain-specific tasks, PCB-GPT establishes a new baseline. Notably, it more than doubles the performance on fine-grained defect localization compared to the strongest competitors, with significant advantages in localization and analysis. We will release the instruction data, benchmark, and model to facilitate future research.[74] Towards Pixel-Level VLM Perception via Simple Points Prediction
Tianhui Song,Haoyu Lu,Hao Yang,Lin Sui,Haoning Wu,Zaida Zhou,Zhiqi Huang,Yiping Bao,Y. Charles,Xinyu Zhou,Limin Wang
Main category: cs.CV
TL;DR: SimpleSeg 提出了一种简单但高效的端到端方法,使多模态大语言模型(MLLMs)具备原生像素级分割能力,通过将分割建模为文本坐标点序列生成任务,并结合两阶段 SF→RL 训练提升精度,在多个基准上媲美甚至超越复杂专用模型。
Details
Motivation: 现有方法常依赖复杂、任务定制的架构或辅助模块来实现像素级感知,而作者希望验证标准MLLM本身是否具备未被发掘的底层感知能力,并探索更统一、简洁的视觉语言模型设计路径。 Method: 将图像分割任务重构为纯文本空间中的点序列生成问题(输出边界点坐标);提出两阶段训练流程:先监督微调(SF),再以IoU为奖励信号进行强化学习(RL)优化点序列。不修改MLLM基础架构。 Result: 在多个分割基准上达到与复杂专用方法相当甚至更优的性能,验证了仅靠标准MLLM+简单点预测即可实现高精度空间理解。 Conclusion: 标准MLLM天然具备较强的低层感知潜力,无需额外架构设计;精确的空间理解可通过极简的点序列预测实现,挑战了当前对辅助模块和专用结构的依赖,推动VLM向更统一、通用方向发展。 Abstract: We present SimpleSeg, a strikingly simple yet highly effective approach to endow Multimodal Large Language Models (MLLMs) with native pixel-level perception. Our method reframes segmentation as a simple sequence generation problem: the model directly predicts sequences of points (textual coordinates) delineating object boundaries, entirely within its language space. To achieve high fidelity, we introduce a two-stage SF$\to$RL training pipeline, where Reinforcement Learning with an IoU-based reward refines the point sequences to accurately match ground-truth contours. We find that the standard MLLM architecture possesses a strong, inherent capacity for low-level perception that can be unlocked without any specialized architecture. On segmentation benchmarks, SimpleSeg achieves performance that is comparable to, and often surpasses, methods relying on complex, task-specific designs. This work lays out that precise spatial understanding can emerge from simple point prediction, challenging the prevailing need for auxiliary components and paving the way for more unified and capable VLMs. Homepage: https://simpleseg.github.io/[75] VC-Bench: Pioneering the Video Connecting Benchmark with a Dataset and Evaluation Metrics
Zhiyu Yin,Zhipeng Liu,Kehai Chen,Lemao Liu,Jin Liu,Hong-Dong Li,Yang Xiang,Min Zhang
Main category: cs.CV
TL;DR: 本文提出了视频连接(Video Connecting)这一新任务,旨在生成给定起始与结束视频片段之间的平滑过渡内容,并构建了首个专用评测基准VC-Bench,包含1579个高质量视频及三项核心指标(VQS、SECS、TSS),揭示了现有视频生成模型在一致性与过渡流畅性上的显著不足。
Details
Motivation: 实际应用(如视频编辑、vlog)需要无缝连接独立视频片段,但当前视频生成研究多聚焦于文本或图像条件,且缺乏标准化评测基准,制约了视频连接任务的发展。 Method: 提出视频连接新任务;构建VC-Bench基准,涵盖1579个跨15大类、72子类的高质量视频;设计三项评测指标:视频质量分(VQS)、起止一致性分(SECS)、过渡平滑分(TSS)。 Result: 在VC-Bench上评测多个SOTA视频生成模型,发现其在起止一致性和过渡平滑性方面存在显著缺陷,导致整体连贯性与流畅性偏低。 Conclusion: VC-Bench是首个面向视频连接任务的综合性评测基准,有望推动该方向的研究发展;相关数据与指标已开源。 Abstract: While current video generation focuses on text or image conditions, practical applications like video editing and vlogging often need to seamlessly connect separate clips. In our work, we introduce Video Connecting, an innovative task that aims to generate smooth intermediate video content between given start and end clips. However, the absence of standardized evaluation benchmarks has hindered the development of this task. To bridge this gap, we proposed VC-Bench, a novel benchmark specifically designed for video connecting. It includes 1,579 high-quality videos collected from public platforms, covering 15 main categories and 72 subcategories to ensure diversity and structure. VC-Bench focuses on three core aspects: Video Quality Score VQS, Start-End Consistency Score SECS, and Transition Smoothness Score TSS. Together, they form a comprehensive framework that moves beyond conventional quality-only metrics. We evaluated multiple state-of-the-art video generation models on VC-Bench. Experimental results reveal significant limitations in maintaining start-end consistency and transition smoothness, leading to lower overall coherence and fluidity. We expect that VC-Bench will serve as a pioneering benchmark to inspire and guide future research in video connecting. The evaluation metrics and dataset are publicly available at: https://anonymous.4open.science/r/VC-Bench-1B67/.[76] TIGaussian: Disentangle Gaussians for Spatial-Awared Text-Image-3D Alignment
Jiarun Liu,Qifeng Chen,Yiru Zhao,Minghua Liu,Baorui Ma,Sheng Yang
Main category: cs.CV
TL;DR: 本文提出TIGaussian框架,利用3D高斯泼溅(3DGS)特性,通过多分支3DGS分词器和模态特定的3D特征对齐策略,增强视觉-语言-3D跨模态对齐能力,在跨模态检索、零样本分类等任务中达到SOTA性能。
Details
Motivation: 现有视觉-语言模型难以有效提取3D模态特征并弥合不同模态间的语义鸿沟,亟需一种能融合3D数据(如点云、3D高斯)以支持3D相关下游任务的跨模态预训练方法。 Method: 提出TIGaussian框架:1)设计多分支3DGS分词器,将3DGS内在属性解耦为紧凑潜在表示;2)构建双向跨模态对齐策略——图像-3D对齐采用基于扩散先验的多视角特征融合机制以消除视角歧义,文本-3D对齐则通过自适应文本-3D投影模块将3D特征映射至文本嵌入空间。 Result: 在多个数据集上的大量实验表明,TIGaussian在跨模态检索、零样本分类和场景识别等任务中均取得当前最优性能(state-of-the-art)。 Conclusion: TIGaussian有效提升了3D模态与图像、文本之间的跨模态对齐能力,验证了利用3D高斯泼溅结构特性进行多模态联合建模的可行性与优越性,为3D感知与理解提供了新范式。 Abstract: While visual-language models have profoundly linked features between texts and images, the incorporation of 3D modality data, such as point clouds and 3D Gaussians, further enables pretraining for 3D-related tasks, e.g., cross-modal retrieval, zero-shot classification, and scene recognition. As challenges remain in extracting 3D modal features and bridging the gap between different modalities, we propose TIGaussian, a framework that harnesses 3D Gaussian Splatting (3DGS) characteristics to strengthen cross-modality alignment through multi-branch 3DGS tokenizer and modality-specific 3D feature alignment strategies. Specifically, our multi-branch 3DGS tokenizer decouples the intrinsic properties of 3DGS structures into compact latent representations, enabling more generalizable feature extraction. To further bridge the modality gap, we develop a bidirectional cross-modal alignment strategies: a multi-view feature fusion mechanism that leverages diffusion priors to resolve perspective ambiguity in image-3D alignment, while a text-3D projection module adaptively maps 3D features to text embedding space for better text-3D alignment. Extensive experiments on various datasets demonstrate the state-of-the-art performance of TIGaussian in multiple tasks.[77] Handcrafted Feature Fusion for Reliable Detection of AI-Generated Images
Syed Mehedi Hasan Nirob,Moqsadur Rahman,Shamim Ehsan,Summit Haque
Main category: cs.CV
TL;DR: 本文系统评估了多种手工设计特征(如DCT、HOG、LBP等)在CIFAKE数据集上检测合成图像的效果,发现LightGBM结合混合特征表现最优,PR-AUC达0.9879,验证了手工特征与集成学习在可解释性与效率敏感场景中的持续价值。
Details
Motivation: 生成模型快速发展导致高度逼真的合成图像泛滥,亟需可靠检测方法;手工特征因其可解释性、高效性和泛化性仍具吸引力。 Method: 在CIFAKE数据集(5万训练+1万测试样本)上,系统评估7种手工特征(像素、颜色直方图、DCT、HOG、LBP、GLCM、小波)与7种分类器(从逻辑回归到LightGBM/XGBoost/CatBoost),比较三种特征配置(基线/高级/混合)。 Result: LightGBM配合混合特征取得最优性能:PR-AUC 0.9879、ROC-AUC 0.9878、F1 0.9447、Brier分数0.0414;性能随特征组合复杂度提升而单调增强。 Conclusion: 精心设计的手工特征与集成学习(尤其是LightGBM)在合成图像检测中仍具强大竞争力,特别适用于强调可解释性与计算效率的场景。 Abstract: The rapid progress of generative models has enabled the creation of highly realistic synthetic images, raising concerns about authenticity and trust in digital media. Detecting such fake content reliably is an urgent challenge. While deep learning approaches dominate current literature, handcrafted features remain attractive for their interpretability, efficiency, and generalizability. In this paper, we conduct a systematic evaluation of handcrafted descriptors, including raw pixels, color histograms, Discrete Cosine Transform (DCT), Histogram of Oriented Gradients (HOG), Local Binary Patterns (LBP), Gray-Level Co-occurrence Matrix (GLCM), and wavelet features, on the CIFAKE dataset of real versus synthetic images. Using 50,000 training and 10,000 test samples, we benchmark seven classifiers ranging from Logistic Regression to advanced gradient-boosted ensembles (LightGBM, XGBoost, CatBoost). Results demonstrate that LightGBM consistently outperforms alternatives, achieving PR-AUC 0.9879, ROC-AUC 0.9878, F1 0.9447, and a Brier score of 0.0414 with mixed features, representing strong gains in calibration and discrimination over simpler descriptors. Across three configurations (baseline, advanced, mixed), performance improves monotonically, confirming that combining diverse handcrafted features yields substantial benefit. These findings highlight the continued relevance of carefully engineered features and ensemble learning for detecting synthetic images, particularly in contexts where interpretability and computational efficiency are critical.[78] A Multi-View Consistency Framework with Semi-Supervised Domain Adaptation
Yuting Hong,Li Dong,Xiaojie Qiu,Hui Xiao,Baochen Yao,Siming Zheng,Chengbin Peng
Main category: cs.CV
TL;DR: 本文提出了一种多视角一致性框架,用于半监督域自适应(SSDA),通过去偏预测概率、利用伪负样本标签和跨域相似性学习来提升目标域分类性能。
Details
Motivation: 由于目标域标注样本有限,模型易在特征空间中产生类别偏差预测,即使源域数据平衡也难以避免。 Method: 构建双视角一致性训练框架:1)基于模型预测性能的类别级去偏策略;2)利用模型预测生成伪负样本标签;3)跨域亲和力学习以对齐同类别跨域特征。 Result: 在DomainNet和Office-Home两个标准域自适应数据集上,该方法优于现有竞争方法。 Conclusion: 融合无监督域自适应与半监督学习可显著提升模型泛化能力、降低标注成本,并推动工业应用落地。 Abstract: Semi-Supervised Domain Adaptation (SSDA) leverages knowledge from a fully labeled source domain to classify data in a partially labeled target domain. Due to the limited number of labeled samples in the target domain, there can be intrinsic similarity of classes in the feature space, which may result in biased predictions, even when the model is trained on a balanced dataset. To overcome this limitation, we introduce a multi-view consistency framework, which includes two views for training strongly augmented data. One is a debiasing strategy for correcting class-wise prediction probabilities according to the prediction performance of the model. The other involves leveraging pseudo-negative labels derived from the model predictions. Furthermore, we introduce a cross-domain affinity learning aimed at aligning features of the same class across different domains, thereby enhancing overall performance. Experimental results demonstrate that our method outperforms the competing methods on two standard domain adaptation datasets, DomainNet and Office-Home. Combining unsupervised domain adaptation and semi-supervised learning offers indispensable contributions to the industrial sector by enhancing model adaptability, reducing annotation costs, and improving performance.[79] ProMist-5K: A Comprehensive Dataset for Digital Emulation of Cinematic Pro-Mist Filter Effects
Yingtie Lei,Zimeng Li,Chi-Man Pun,Wangyu Wu,Junke Yang,Xuhang Chen
Main category: cs.CV
TL;DR: 本文介绍了ProMist-5K数据集,用于支持电影风格(特别是Pro-Mist滤镜效果)的数字模拟,该数据集基于物理启发的管线构建,包含20,000对高分辨率图像,覆盖不同滤镜密度与焦距配置,并强调真实感的光晕与高光扩散建模。
Details
Motivation: Pro-Mist滤镜在电影摄影中广泛应用,但其复杂的光学扩散行为难以通过数字方式准确复现;现有通用风格数据集缺乏对真实光学效应(如软光晕、对比度降低)的针对性建模。 Method: 构建了一个物理启发的、场景参考线性空间中的图像生成管线,生成包含20,000对高分辨率图像的ProMist-5K数据集;涵盖两种滤镜密度(1/2和1/8)与两种焦距(20mm和50mm),采用多层模糊与加权机制模拟光学扩散的强度与空间分布变化。 Result: ProMist-5K数据集能有效支持多种图像翻译模型与学习范式,在不同训练设置下均表现出良好性能,可准确捕捉从细微到强烈的电影化视觉外观。 Conclusion: ProMist-5K是一个实用且物理基础扎实的数据集资源,弥合了数字图像处理的灵活性与传统镜头美学之间的鸿沟,为电影风格迁移提供了可控、一致的目标域。 Abstract: Pro-Mist filters are widely used in cinematography for their ability to create soft halation, lower contrast, and produce a distinctive, atmospheric style. These effects are difficult to reproduce digitally due to the complex behavior of light diffusion. We present ProMist-5K, a dataset designed to support cinematic style emulation. It is built using a physically inspired pipeline in a scene-referred linear space and includes 20,000 high-resolution image pairs across four configurations, covering two filter densities (1/2 and 1/8) and two focal lengths (20mm and 50mm). Unlike general style datasets, ProMist-5K focuses on realistic glow and highlight diffusion effects. Multiple blur layers and carefully tuned weighting are used to model the varying intensity and spread of optical diffusion. The dataset provides a consistent and controllable target domain that supports various image translation models and learning paradigms. Experiments show that the dataset works well across different training settings and helps capture both subtle and strong cinematic appearances. ProMist-5K offers a practical and physically grounded resource for film-inspired image transformation, bridging the gap between digital flexibility and traditional lens aesthetics. The dataset is available at https://www.kaggle.com/datasets/yingtielei/promist5k.[80] Beyond Shadows: A Large-Scale Benchmark and Multi-Stage Framework for High-Fidelity Facial Shadow Removal
Tailong Luo,Jiesong Bai,Jinyang Huang,Junyu Xia,Wangyu Wu,Xuhang Chen
Main category: cs.CV
TL;DR: 本文提出了首个大规模真实世界面部阴影去除数据集ASFW,并设计了Face Shadow Eraser(FSE)方法,显著提升了复杂光照下阴影去除效果与纹理保持能力。
Details
Motivation: 现有方法在复杂光照下难以兼顾阴影去除与纹理保持,且缺乏真实配对数据集用于训练。 Method: 构建了包含1081对图像的ASFW真实配对数据集(通过专业Photoshop流程生成),并提出Face Shadow Eraser(FSE)模型进行验证。 Result: 基于ASFW训练的深度模型在真实场景阴影去除任务中性能显著提升,为该任务设定了新基准。 Conclusion: ASFW数据集有效弥合了合成与真实域之间的差距,FSE方法验证了其有效性,推动了真实场景下面部阴影去除的发展。 Abstract: Facial shadows often degrade image quality and the performance of vision algorithms. Existing methods struggle to remove shadows while preserving texture, especially under complex lighting conditions, and they lack real-world paired datasets for training. We present the Augmented Shadow Face in the Wild (ASFW) dataset, the first large-scale real-world dataset for facial shadow removal, containing 1,081 paired shadow and shadow-free images created via a professional Photoshop workflow. ASFW offers photorealistic shadow variations and accurate ground truths, bridging the gap between synthetic and real domains. Deep models trained on ASFW demonstrate improved shadow removal in real-world conditions. We also introduce the Face Shadow Eraser (FSE) method to showcase the effectiveness of the dataset. Experiments demonstrate that ASFW enhances the performance of facial shadow removal models, setting new standards for this task.[81] Instance-Guided Radar Depth Estimation for 3D Object Detection
Chen-Chou Lo,Patrick Vandewalle
Main category: cs.CV
TL;DR: 本文提出了一种端到端的雷达-相机融合框架,通过InstaRadar增强雷达密度与语义对齐,并将预训练RCDPT模块嵌入BEVDepth以提升单目3D目标检测性能,在深度估计和检测精度上取得稳定提升。
Details
Motivation: 单目相机深度估计存在模糊性和鲁棒性差问题;雷达虽抗恶劣环境但稀疏低分辨率,需更有效的雷达-相机融合与深度估计策略。 Method: 1) InstaRadar:基于实例分割掩码引导的雷达点扩展方法,提升雷达密度与语义对齐;2) 将预训练RCDPT模块替代BEVDepth中的深度模块,利用InstaRadar增强输入进行端到端训练。 Result: InstaRadar在雷达引导深度估计任务中达到SOTA;整体框架在BEVDepth基线上实现3D检测性能稳定提升;但弱于直接提取BEV特征的雷达-相机融合模型。 Conclusion: InstaRadar与显式深度监督对单目3D检测有效;当前雷达仅作深度引导而非独立特征流,未来将拓展为类点云表示并引入时序雷达分支以增强BEV融合。 Abstract: Accurate depth estimation is fundamental to 3D perception in autonomous driving, supporting tasks such as detection, tracking, and motion planning. However, monocular camera-based 3D detection suffers from depth ambiguity and reduced robustness under challenging conditions. Radar provides complementary advantages such as resilience to poor lighting and adverse weather, but its sparsity and low resolution limit its direct use in detection frameworks. This motivates the need for effective Radar-camera fusion with improved preprocessing and depth estimation strategies. We propose an end-to-end framework that enhances monocular 3D object detection through two key components. First, we introduce InstaRadar, an instance segmentation-guided expansion method that leverages pre-trained segmentation masks to enhance Radar density and semantic alignment, producing a more structured representation. InstaRadar achieves state-of-the-art results in Radar-guided depth estimation, showing its effectiveness in generating high-quality depth features. Second, we integrate the pre-trained RCDPT into the BEVDepth framework as a replacement for its depth module. With InstaRadar-enhanced inputs, the RCDPT integration consistently improves 3D detection performance. Overall, these components yield steady gains over the baseline BEVDepth model, demonstrating the effectiveness of InstaRadar and the advantage of explicit depth supervision in 3D object detection. Although the framework lags behind Radar-camera fusion models that directly extract BEV features, since Radar serves only as guidance rather than an independent feature stream, this limitation highlights potential for improvement. Future work will extend InstaRadar to point cloud-like representations and integrate a dedicated Radar branch with temporal cues for enhanced BEV fusion.[82] Innovator-VL: A Multimodal Large Language Model for Scientific Discovery
Zichen Wen,Boxue Yang,Shuang Chen,Yaojie Zhang,Yuhang Han,Junlong Ke,Cong Wang,Yicheng Fu,Jiawang Zhao,Jiangchao Yao,Xi Fang,Zhen Wang,Henxing Cai,Lin Yao,Zhifeng Gao,Yanhui Hong,Nang Yuan,Yixuan Li,Guojiang Zhao,Haoyi Tao,Nan Wang,Han Lyu,Guolin Ke,Ning Liao,Xiaoxing Wang,Kai Chen,Zhiyu Li,Feiyu Xiong,Sihan Hu,Kun Chen,Yanfeng Wang,Weinan E,Linfeng Zhang,Linfeng Zhang
Main category: cs.CV
TL;DR: Innovator-VL 是一个高效、透明、可复现的科学多模态大模型,通过精巧训练设计而非海量数据,在科学与通用视觉任务上均取得优异性能。
Details
Motivation: 现有方法依赖大规模领域预训练和不透明流程;本文旨在探索以更少数据、更透明方法实现强科学推理能力的可行路径。 Method: 构建端到端可复现训练流程(含数据收集、清洗、SFT、强化学习等),采用少于500万精选样本进行训练,强调原则性数据选择与优化策略。 Result: 在科学任务、通用视觉及多模态推理基准上均达竞争性性能,验证了数据效率与泛化能力兼顾的可行性。 Conclusion: 科学多模态模型无需依赖超大规模数据,可通过透明、高效、可复现的设计实现高性能,为后续研究提供实用范式。 Abstract: We present Innovator-VL, a scientific multimodal large language model designed to advance understanding and reasoning across diverse scientific domains while maintaining excellent performance on general vision tasks. Contrary to the trend of relying on massive domain-specific pretraining and opaque pipelines, our work demonstrates that principled training design and transparent methodology can yield strong scientific intelligence with substantially reduced data requirements. (i) First, we provide a fully transparent, end-to-end reproducible training pipeline, covering data collection, cleaning, preprocessing, supervised fine-tuning, reinforcement learning, and evaluation, along with detailed optimization recipes. This facilitates systematic extension by the community. (ii) Second, Innovator-VL exhibits remarkable data efficiency, achieving competitive performance on various scientific tasks using fewer than five million curated samples without large-scale pretraining. These results highlight that effective reasoning can be achieved through principled data selection rather than indiscriminate scaling. (iii) Third, Innovator-VL demonstrates strong generalization, achieving competitive performance on general vision, multimodal reasoning, and scientific benchmarks. This indicates that scientific alignment can be integrated into a unified model without compromising general-purpose capabilities. Our practices suggest that efficient, reproducible, and high-performing scientific multimodal models can be built even without large-scale data, providing a practical foundation for future research.[83] Pareto-Guided Optimization for Uncertainty-Aware Medical Image Segmentation
Jinming Zhang,Xi Yang,Youpeng Yang,Haosen Shi,Yuyao Yan,Qiufeng Wang,Guangliang Cheng,Kaizhu Huang
Main category: cs.CV
TL;DR: 本文提出一种区域感知的课程学习策略和Pareto一致损失函数,结合模糊标注机制,以缓解医学图像分割中边界区域不确定性高导致的训练不稳定问题,提升模型在肿瘤亚区的分割性能。
Details
Motivation: 医学图像分割中边界区域不确定性高,传统像素级等权训练导致早期优化不稳定,阻碍模型收敛至Pareto最优解。 Method: 提出区域感知课程学习策略、Pareto一致损失函数(自适应重塑损失曲面并约束内外区域收敛动态)以及模糊标注机制(非边界区保持二值置信度,边界区平滑过渡)。 Result: 在脑转移瘤与非转移性肿瘤分割任务上,该方法在所有肿瘤子区域均优于传统清晰集方法,且在多种配置下表现稳定提升。 Conclusion: 区域差异化建模不确定性可有效提升医学图像分割训练稳定性与最终性能,为Pareto优化与模糊建模提供了新思路。 Abstract: Uncertainty in medical image segmentation is inherently non-uniform, with boundary regions exhibiting substantially higher ambiguity than interior areas. Conventional training treats all pixels equally, leading to unstable optimization during early epochs when predictions are unreliable. We argue that this instability hinders convergence toward Pareto-optimal solutions and propose a region-wise curriculum strategy that prioritizes learning from certain regions and gradually incorporates uncertain ones, reducing gradient variance. Methodologically, we introduce a Pareto-consistent loss that balances trade-offs between regional uncertainties by adaptively reshaping the loss landscape and constraining convergence dynamics between interior and boundary regions; this guides the model toward Pareto-approximate solutions. To address boundary ambiguity, we further develop a fuzzy labeling mechanism that maintains binary confidence in non-boundary areas while enabling smooth transitions near boundaries, stabilizing gradients, and expanding flat regions in the loss surface. Experiments on brain metastasis and non-metastatic tumor segmentation show consistent improvements across multiple configurations, with our method outperforming traditional crisp-set approaches in all tumor subregions.[84] Establishing dermatopathology encyclopedia DermpathNet with Artificial Intelligence-Based Workflow
Ziyang Xu,Mingquan Lin,Yiliang Zhou,Zihan Xu,Seth J. Orlow,Zihan Xu,Shane A. Meehan,Alexandra Flamm,Ata S. Moshiri,Yifan Peng
Main category: cs.CV
TL;DR: 本文提出了一种混合工作流,从PubMed Central中自动收集和分类皮肤病理学图像,构建了名为DermpathNet的大规模、开放获取、专家审核的图像数据集,并验证了其在教育、跨参考和机器学习中的价值。
Details
Motivation: 解决临床医生和皮肤病理学培训人员难以获取高质量、开放获取的皮肤病理学图像数据集的问题。 Method: 采用结合深度学习图像模态分类与图注分析的混合方法,从PMC中检索并分类皮肤病理学图像,并由认证皮肤病理学家进行人工审核。 Result: 构建了包含7772张图像、覆盖166种诊断的DermpathNet数据集;混合方法在651张人工标注图像上达到90.4%的F-score;发现当前OpenAI图像分析算法不适用于皮肤病理图像分析。 Conclusion: 成功构建了一个大规模、同行评审、开放获取的皮肤病理学图像数据集DermpathNet,并提出了半自动化的数据整理流程,为教育、临床交叉参考和AI研究提供了重要资源。 Abstract: Accessing high-quality, open-access dermatopathology image datasets for learning and cross-referencing is a common challenge for clinicians and dermatopathology trainees. To establish a comprehensive open-access dermatopathology dataset for educational, cross-referencing, and machine-learning purposes, we employed a hybrid workflow to curate and categorize images from the PubMed Central (PMC) repository. We used specific keywords to extract relevant images, and classified them using a novel hybrid method that combined deep learning-based image modality classification with figure caption analyses. Validation on 651 manually annotated images demonstrated the robustness of our workflow, with an F-score of 89.6\% for the deep learning approach, 61.0\% for the keyword-based retrieval method, and 90.4\% for the hybrid approach. We retrieved over 7,772 images across 166 diagnoses and released this fully annotated dataset, reviewed by board-certified dermatopathologists. Using our dataset as a challenging task, we found the current image analysis algorithm from OpenAI inadequate for analyzing dermatopathology images. In conclusion, we have developed a large, peer-reviewed, open-access dermatopathology image dataset, DermpathNet, which features a semi-automated curation workflow.[85] Tri-Reader: An Open-Access, Multi-Stage AI Pipeline for First-Pass Lung Nodule Annotation in Screening CT
Fakrul Islam Tushar,Joseph Y. Lo
Main category: cs.CV
TL;DR: 本文提出了Tri-Reader,一个基于多个开源模型的免费肺部CT分析流程,集成了肺部分割、结节检测和恶性程度分类三阶段任务,旨在提高敏感性并减轻标注者负担,并在多中心数据上验证了其准确性和泛化能力。
Details
Motivation: 提升肺结节分析的敏感性,同时降低人工标注负担,并增强模型在不同临床场景下的泛化能力。 Method: 构建了一个三阶段统一流程Tri-Reader,整合肺分割、结节检测与恶性程度分类,采用多个开源模型和公开数据集训练,并在多个内部及外部数据集上与专家标注和参考标准对比评估。 Result: Tri-Reader在多个内部和外部数据集上展现出良好的准确性与泛化性,有效平衡了高敏感性与候选结节数量控制。 Conclusion: Tri-Reader是一个可公开获取、高敏感、泛化性强的肺结节全流程分析工具,适用于多样化临床实践环境。 Abstract: Using multiple open-access models trained on public datasets, we developed Tri-Reader, a comprehensive, freely available pipeline that integrates lung segmentation, nodule detection, and malignancy classification into a unified tri-stage workflow. The pipeline is designed to prioritize sensitivity while reducing the candidate burden for annotators. To ensure accuracy and generalizability across diverse practices, we evaluated Tri-Reader on multiple internal and external datasets as compared with expert annotations and dataset-provided reference standards.[86] Unveiling Perceptual Artifacts: A Fine-Grained Benchmark for Interpretable AI-Generated Image Detection
Yao Xiao,Weiyan Chen,Jiahao Chen,Zijie Cao,Weijian Deng,Binbin Yang,Ziyi Dong,Xiangyang Ji,Wei Ke,Pengxu Wei,Liang Lin
Main category: cs.CV
TL;DR: 本文提出X-AIGD细粒度可解释AI生成图像检测基准,提供像素级、分类别的感知伪影标注(低/高/认知层),揭示现有检测器对伪影依赖性低、仍依赖不可解释特征,并验证显式对齐模型注意力与伪影区域可提升可解释性与泛化性。
Details
Motivation: 现有AI生成图像检测方法多为二分类,缺乏可解释、有说服力的判别依据;现有基准覆盖伪影多样性不足且缺少细粒度局部标注,难以支撑可解释性研究。 Method: 构建X-AIGD细粒度基准,包含像素级、多层级(低层失真、高层语义、认知层反事实)感知伪影标注;基于该基准系统评估主流检测器对伪影的依赖性、可解释性及泛化能力,并探索注意力对齐等改进策略。 Result: 发现:(1) 现有检测器几乎不依赖各类感知伪影;(2) 即使可训练识别特定伪影,仍主要依赖不可解释特征;(3) 显式对齐模型注意力与伪影区域能显著提升可解释性与泛化性。 Conclusion: 细粒度、多层级伪影标注是推动可解释AIGI检测的关键;单纯提升检测准确率不足以保障可解释性,需从数据基准与模型机制两方面协同设计。 Abstract: Current AI-Generated Image (AIGI) detection approaches predominantly rely on binary classification to distinguish real from synthetic images, often lacking interpretable or convincing evidence to substantiate their decisions. This limitation stems from existing AIGI detection benchmarks, which, despite featuring a broad collection of synthetic images, remain restricted in their coverage of artifact diversity and lack detailed, localized annotations. To bridge this gap, we introduce a fine-grained benchmark towards eXplainable AI-Generated image Detection, named X-AIGD, which provides pixel-level, categorized annotations of perceptual artifacts, spanning low-level distortions, high-level semantics, and cognitive-level counterfactuals. These comprehensive annotations facilitate fine-grained interpretability evaluation and deeper insight into model decision-making processes. Our extensive investigation using X-AIGD provides several key insights: (1) Existing AIGI detectors demonstrate negligible reliance on perceptual artifacts, even at the most basic distortion level. (2) While AIGI detectors can be trained to identify specific artifacts, they still substantially base their judgment on uninterpretable features. (3) Explicitly aligning model attention with artifact regions can increase the interpretability and generalization of detectors. The data and code are available at: https://github.com/Coxy7/X-AIGD.[87] RoamScene3D: Immersive Text-to-3D Scene Generation via Adaptive Object-aware Roaming
Jisheng Chu,Wenrui Li,Rui Zhao,Wangmeng Zuo,Shifeng Chen,Xiaopeng Fan
Main category: cs.CV
TL;DR: 本文提出RoamScene3D框架,通过场景图引导自适应相机轨迹与运动注入式修复模型,实现语义感知、几何一致的文本到3D场景生成。
Details
Motivation: 现有文本生成3D场景方法存在空间感知弱、依赖预设轨迹、无法理解语义布局及处理运动导致的空洞等问题。 Method: 利用视觉语言模型构建场景图以指导自适应相机漫游轨迹;提出运动注入式修复模型,在合成全景数据集上微调以适配真实相机运动。 Result: 在多项指标上显著超越现有最先进方法,生成更一致、更逼真的3D场景。 Conclusion: 语义推理与几何约束的结合可有效提升文本驱动3D场景生成的质量与合理性。 Abstract: Generating immersive 3D scenes from texts is a core task in computer vision, crucial for applications in virtual reality and game development. Despite the promise of leveraging 2D diffusion priors, existing methods suffer from spatial blindness and rely on predefined trajectories that fail to exploit the inner relationships among salient objects. Consequently, these approaches are unable to comprehend the semantic layout, preventing them from exploring the scene adaptively to infer occluded content. Moreover, current inpainting models operate in 2D image space, struggling to plausibly fill holes caused by camera motion. To address these limitations, we propose RoamScene3D, a novel framework that bridges the gap between semantic guidance and spatial generation. Our method reasons about the semantic relations among objects and produces consistent and photorealistic scenes. Specifically, we employ a vision-language model (VLM) to construct a scene graph that encodes object relations, guiding the camera to perceive salient object boundaries and plan an adaptive roaming trajectory. Furthermore, to mitigate the limitations of static 2D priors, we introduce a Motion-Injected Inpainting model that is fine-tuned on a synthetic panoramic dataset integrating authentic camera trajectories, making it adaptive to camera motion. Extensive experiments demonstrate that with semantic reasoning and geometric constraints, our method significantly outperforms state-of-the-art approaches in producing consistent and photorealistic scenes. Our code is available at https://github.com/JS-CHU/RoamScene3D.[88] DSTCS: Dual-Student Teacher Framework with Segment Anything Model for Semi-Supervised Pubic Symphysis Fetal Head Segmentation
Yalin Luo,Shun Long,Huijin Wang,Jieyun Bai
Main category: cs.CV
TL;DR: 本文提出了一种结合CNN与Segment Anything Model(SAM)的双学生-教师框架(DSTCS),用于超声图像中耻骨联合与胎儿头部(PSFH)的精准分割,通过协同学习、边界优化的数据增强及新损失函数,在MICCAI 2023/2024数据集上显著优于现有方法。
Details
Motivation: PSFH分割对产程监测至关重要,但面临类别不平衡、边界模糊、噪声干扰及高质量标注数据稀缺等挑战;现有方法多依赖CNN或Transformer,未充分挖掘更强大模型(如SAM)潜力。 Method: 提出Dual-Student and Teacher框架(DSTCS),融合CNN与SAM双分支,并设计二者间的协同学习机制;引入面向边界的专用数据增强策略和新型损失函数。 Result: 在MICCAI 2023和2024 PSFH分割基准上,该方法展现出更强鲁棒性,显著超越现有技术。 Conclusion: DSTCS为临床PSFH分割提供了一种高精度、高鲁棒性的可靠工具,推动了超声影像智能分析在产科实践中的应用。 Abstract: Segmentation of the pubic symphysis and fetal head (PSFH) is a critical procedure in intrapartum monitoring and is essential for evaluating labor progression and identifying potential delivery complications. However, achieving accurate segmentation remains a significant challenge due to class imbalance, ambiguous boundaries, and noise interference in ultrasound images, compounded by the scarcity of high-quality annotated data. Current research on PSFH segmentation predominantly relies on CNN and Transformer architectures, leaving the potential of more powerful models underexplored. In this work, we propose a Dual-Student and Teacher framework combining CNN and SAM (DSTCS), which integrates the Segment Anything Model (SAM) into a dual student-teacher architecture. A cooperative learning mechanism between the CNN and SAM branches significantly improves segmentation accuracy. The proposed scheme also incorporates a specialized data augmentation strategy optimized for boundary processing and a novel loss function. Extensive experiments on the MICCAI 2023 and 2024 PSFH segmentation benchmarks demonstrate that our method exhibits superior robustness and significantly outperforms existing techniques, providing a reliable segmentation tool for clinical practice.[89] Towards Gold-Standard Depth Estimation for Tree Branches in UAV Forestry: Benchmarking Deep Stereo Matching Methods
Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green
Main category: cs.CV
TL;DR: 本文系统评估了八种立体匹配方法在植被密集环境下的零样本深度估计性能,提出首个专用于树木枝干场景的Canterbury Tree Branches数据集,并发现DEFOM模型在跨域一致性上表现最优,可作为植被深度估计的基准。
Details
Motivation: 现有深度估计方法的评估主要集中于城市和室内场景,缺乏对植被密集环境(如森林)的鲁棒性和跨域泛化能力验证,制约了自主无人机林业作业的发展。 Method: 对八种代表性立体匹配方法(涵盖迭代优化、基础模型、扩散模型和3D CNN范式)进行零样本迁移评估,全部使用官方发布的Scene Flow预训练权重;在ETH3D、KITTI 2012/2015、Middlebury及新构建的5313对Canterbury Tree Branches数据集(1920×1080)上统一评测。 Result: 发现不同方法性能具有显著场景依赖性:基础模型(如DEFOM)在结构化场景中优势明显(ETH3D上0.23 px,Middlebury上4.65 px),而迭代方法(如IGEV++)跨基准表现波动大;DEFOM在Tree Branches数据集上定性表现最优,跨所有基准平均排名1.75,稳居前两名。 Conclusion: DEFOM是当前植被深度估计最稳健的零样本基准模型,其预测结果可作为未来研究的伪真值,填补了林业场景深度估计评估的空白。 Abstract: Autonomous UAV forestry operations require robust depth estimation with strong cross-domain generalization, yet existing evaluations focus on urban and indoor scenarios, leaving a critical gap for vegetation-dense environments. We present the first systematic zero-shot evaluation of eight stereo methods spanning iterative refinement, foundation model, diffusion-based, and 3D CNN paradigms. All methods use officially released pretrained weights (trained on Scene Flow) and are evaluated on four standard benchmarks (ETH3D, KITTI 2012/2015, Middlebury) plus a novel 5,313-pair Canterbury Tree Branches dataset ($1920 \times 1080$). Results reveal scene-dependent patterns: foundation models excel on structured scenes (BridgeDepth: 0.23 px on ETH3D; DEFOM: 4.65 px on Middlebury), while iterative methods show variable cross-benchmark performance (IGEV++: 0.36 px on ETH3D but 6.77 px on Middlebury; IGEV: 0.33 px on ETH3D but 4.99 px on Middlebury). Qualitative evaluation on the Tree Branches dataset establishes DEFOM as the gold-standard baseline for vegetation depth estimation, with superior cross-domain consistency (consistently ranking 1st-2nd across benchmarks, average rank 1.75). DEFOM predictions will serve as pseudo-ground-truth for future benchmarking.[90] Dynamic Worlds, Dynamic Humans: Generating Virtual Human-Scene Interaction Motion in Dynamic Scenes
Yin Wang,Zhiying Leng,Haitian Liu,Frederick W. B. Li,Mu Li,Xiaohui Liang
Main category: cs.CV
TL;DR: 本文提出了Dyn-HSI,首个面向动态人-场景交互的认知架构,通过视觉(动态场景感知导航)、记忆(分层经验记忆)和控制(人-场景交互扩散模型)三大类人组件,实现对真实世界中动态场景的建模与高质量交互动作生成,并构建了动态基准Dyn-Scenes进行验证。
Details
Motivation: 现有方法将场景视为静态,与现实世界中场景持续动态变化的事实不符,因此需要一种能建模动态人-场景交互的新方法。 Method: 提出Dyn-HSI认知架构,包含三个核心模块:(1) 动态场景感知导航(视觉),(2) 分层经验记忆(记忆),(3) 人-场景交互扩散模型(控制);并构建动态基准Dyn-Scenes用于评估。 Result: 在静态和动态场景下均生成高质量人-场景交互动作,定量与定性实验表明其性能持续优于现有方法。 Conclusion: Dyn-HSI成功将世界模型思想引入人-场景交互生成,通过类人认知组件实现了对动态环境的有效建模与泛化,为未来智能虚拟人研究提供了新范式。 Abstract: Scenes are continuously undergoing dynamic changes in the real world. However, existing human-scene interaction generation methods typically treat the scene as static, which deviates from reality. Inspired by world models, we introduce Dyn-HSI, the first cognitive architecture for dynamic human-scene interaction, which endows virtual humans with three humanoid components. (1)Vision (human eyes): we equip the virtual human with a Dynamic Scene-Aware Navigation, which continuously perceives changes in the surrounding environment and adaptively predicts the next waypoint. (2)Memory (human brain): we equip the virtual human with a Hierarchical Experience Memory, which stores and updates experiential data accumulated during training. This allows the model to leverage prior knowledge during inference for context-aware motion priming, thereby enhancing both motion quality and generalization. (3) Control (human body): we equip the virtual human with Human-Scene Interaction Diffusion Model, which generates high-fidelity interaction motions conditioned on multimodal inputs. To evaluate performance in dynamic scenes, we extend the existing static human-scene interaction datasets to construct a dynamic benchmark, Dyn-Scenes. We conduct extensive qualitative and quantitative experiments to validate Dyn-HSI, showing that our method consistently outperforms existing approaches and generates high-quality human-scene interaction motions in both static and dynamic settings.[91] Entropy-Guided k-Guard Sampling for Long-Horizon Autoregressive Video Generation
Yizhao Han,Tianxing Shi,Zhao Wang,Zifan Xu,Zhiyuan Pu,Mingxiao Li,Qian Zhang,Wei Yin,Xiao-Xiao Long
Main category: cs.CV
TL;DR: 本文提出了一种熵引导的k-Guard(ENkG)采样策略,用于改善视频生成中自回归模型的采样效果,通过根据每个token预测分布的熵自适应调整候选集大小,提升长时序视频生成的质量与稳定性。
Details
Motivation: 静态top-k/top-p采样在视频生成中效果不佳,因视频token语义密度低、时空冗余高,导致低不确定性区域引入噪声、高不确定性区域误差累积。 Method: 提出熵引导的k-Guard(ENkG)采样策略,依据每个token预测分布的熵动态调整候选token数量:低熵区域减少候选数以抑制噪声,高熵区域增加候选数以缓解误差传播。该方法无需训练、模型无关、开销极小。 Result: 实验表明ENkG在感知质量与结构稳定性上持续优于静态top-k/top-p策略。 Conclusion: ENkG是一种简单有效、即插即用的采样改进方法,显著缓解了自回归视频生成中的误差累积问题,提升了长程生成质量。 Abstract: Autoregressive (AR) architectures have achieved significant successes in LLMs, inspiring explorations for video generation. In LLMs, top-p/top-k sampling strategies work exceptionally well: language tokens have high semantic density and low redundancy, so a fixed size of token candidates already strikes a balance between semantic accuracy and generation diversity. In contrast, video tokens have low semantic density and high spatio-temporal redundancy. This mismatch makes static top-k/top-p strategies ineffective for video decoders: they either introduce unnecessary randomness for low-uncertainty regions (static backgrounds) or get stuck in early errors for high-uncertainty regions (foreground objects). Prediction errors will accumulate as more frames are generated and eventually severely degrade long-horizon quality. To address this, we propose Entropy-Guided k-Guard (ENkG) sampling, a simple yet effective strategy that adapts sampling to token-wise dispersion, quantified by the entropy of each token's predicted distribution. ENkG uses adaptive token candidate sizes: for low-entropy regions, it employs fewer candidates to suppress redundant noise and preserve structural integrity; for high-entropy regions, it uses more candidates to mitigate error compounding. ENkG is model-agnostic, training-free, and adds negligible overhead. Experiments demonstrate consistent improvements in perceptual quality and structural stability compared to static top-k/top-p strategies.[92] Fast Converging 3D Gaussian Splatting for 1-Minute Reconstruction
Ziyu Zhang,Tianle Liu,Diantao Tu,Shuhan Shen
Main category: cs.CV
TL;DR: 本文提出了一种可在一分钟内完成的快速3D高斯泼溅(3DGS)重建流水线,针对SLAM(含噪声轨迹)和COLMAP(高精度位姿)两种异构场景设计了两阶段优化策略,最终在SIGGRAPH Asia 3DGS快速重建挑战赛中以PSNR 28.43获得第一名。
Details
Motivation: 应对SIGGRAPH Asia 3DGS快速重建挑战中严格的一分钟时间限制,以及SLAM位姿噪声大、COLMAP位姿精度高所带来的优化场景差异,需兼顾速度与重建质量。 Method: 采用两阶段方案:第一阶段面向SLAM位姿,引入反向逐高斯并行优化、紧凑前向泼溅、负载均衡分块、锚点式Neural-Gaussian表示、单目深度+前馈3DGS初始化及全局位姿精修;第二阶段面向COLMAP位姿,关闭位姿精修、回归标准3DGS、引入多视图一致性引导的高斯分裂及深度估计器监督渲染深度。 Result: 在1分钟时限内实现高质量重建,PSNR达28.43,在挑战赛中排名第一。 Conclusion: 通过针对不同位姿质量定制化设计优化策略与表示方法,可显著提升3DGS训练速度而不牺牲重建质量,验证了快速、鲁棒、高保真三维重建的可行性。 Abstract: We present a fast 3DGS reconstruction pipeline designed to converge within one minute, developed for the SIGGRAPH Asia 3DGS Fast Reconstruction Challenge. The challenge consists of an initial round using SLAM-generated camera poses (with noisy trajectories) and a final round using COLMAP poses (highly accurate). To robustly handle these heterogeneous settings, we develop a two-stage solution. In the first round, we use reverse per-Gaussian parallel optimization and compact forward splatting based on Taming-GS and Speedy-splat, load-balanced tiling, an anchor-based Neural-Gaussian representation enabling rapid convergence with fewer learnable parameters, initialization from monocular depth and partially from feed-forward 3DGS models, and a global pose refinement module for noisy SLAM trajectories. In the final round, the accurate COLMAP poses change the optimization landscape; we disable pose refinement, revert from Neural-Gaussians back to standard 3DGS to eliminate MLP inference overhead, introduce multi-view consistency-guided Gaussian splitting inspired by Fast-GS, and introduce a depth estimator to supervise the rendered depth. Together, these techniques enable high-fidelity reconstruction under a strict one-minute budget. Our method achieved the top performance with a PSNR of 28.43 and ranked first in the competition.[93] Cortex-Grounded Diffusion Models for Brain Image Generation
Fabian Bongratz,Yitong Li,Sama Elbaroudy,Christian Wachinger
Main category: cs.CV
TL;DR: 本文提出Cor2Vox,一种以大脑皮层结构为引导的3D扩散生成模型,利用高分辨率皮层表面指导MRI图像合成,结合大规模统计形状模型,实现解剖学一致、拓扑保真、亚体素级精度的脑影像生成与应用。
Details
Motivation: 现有生成模型依赖弱条件信号(如标签或文本),缺乏解剖学基础,易产生生物学不可信结果;真实神经影像数据存在罕见表型稀缺、跨设备域偏移和纵向覆盖不足等问题。 Method: 提出Cor2Vox框架:基于33,000+ UK Biobank扫描构建大规模皮层统计形状模型;利用高分辨率皮层表面作为连续结构先验,驱动3D shape-to-image Brownian桥扩散过程;实现皮层解剖引导的MRI合成。 Result: 在图像质量、皮层表面重建和全脑分割等指标上优于多种基线方法;在解剖一致合成、进行性灰质萎缩模拟、FTD数据集标准化三个任务中均保持亚体素级皮层形态保真度,且对皮层几何与疾病表型变化鲁棒,无需重训练。 Conclusion: Cor2Vox通过皮层结构显式建模显著提升了合成神经影像的生物学合理性与解剖精确性,为数据增强、疾病建模与多中心影像标准化提供了新范式。 Abstract: Synthetic neuroimaging data can mitigate critical limitations of real-world datasets, including the scarcity of rare phenotypes, domain shifts across scanners, and insufficient longitudinal coverage. However, existing generative models largely rely on weak conditioning signals, such as labels or text, which lack anatomical grounding and often produce biologically implausible outputs. To this end, we introduce Cor2Vox, a cortex-grounded generative framework for brain magnetic resonance image (MRI) synthesis that ties image generation to continuous structural priors of the cerebral cortex. It leverages high-resolution cortical surfaces to guide a 3D shape-to-image Brownian bridge diffusion process, enabling topologically faithful synthesis and precise control over underlying anatomies. To support the generation of new, realistic brain shapes, we developed a large-scale statistical shape model of cortical morphology derived from over 33,000 UK Biobank scans. We validated the fidelity of Cor2Vox based on traditional image quality metrics, advanced cortical surface reconstruction, and whole-brain segmentation quality, outperforming many baseline methods. Across three applications, namely (i) anatomically consistent synthesis, (ii) simulation of progressive gray matter atrophy, and (iii) harmonization of in-house frontotemporal dementia scans with public datasets, Cor2Vox preserved fine-grained cortical morphology at the sub-voxel level, exhibiting remarkable robustness to variations in cortical geometry and disease phenotype without retraining.[94] Bridging Information Asymmetry: A Hierarchical Framework for Deterministic Blind Face Restoration
Zhengjian Yao,Jiakui Hu,Kaiwen Li,Hangzhou He,Xinliang Zhang,Shuang Zeng,Lei Zhu,Yanye Lu
Main category: cs.CV
TL;DR: 本文提出Pref-Restore,一种结合离散语义逻辑与连续纹理生成的分层盲脸复原框架,通过增强输入密度和剪枝输出分布来缓解信息不对称问题,实现确定性、偏好对齐的复原。
Details
Motivation: 盲脸复原因严重欠定性而困难;现有生成方法存在输入信息稀疏与输出信息密集之间的信息不对称,导致一对多映射、不确定性及幻觉伪影。 Method: 提出Pref-Restore框架:(1) 使用自回归整合器将文本指令转为稠密潜在查询,增强输入语义稳定性;(2) 在扩散复原过程中嵌入on-policy强化学习,将人类偏好转化为可微约束以抑制随机偏差。 Result: 在合成与真实世界基准上达到SOTA性能;实证表明该方法显著降低解空间熵,提升复原的可靠性与确定性。 Conclusion: Pref-Restore通过语义-纹理协同建模与偏好驱动的分布精炼,为盲脸复原提供了更鲁棒、确定性的新范式。 Abstract: Blind face restoration remains a persistent challenge due to the inherent ill-posedness of reconstructing holistic structures from severely constrained observations. Current generative approaches, while capable of synthesizing realistic textures, often suffer from information asymmetry -- the intrinsic disparity between the information-sparse low quality inputs and the information-dense high quality outputs. This imbalance leads to a one-to-many mapping, where insufficient constraints result in stochastic uncertainty and hallucinatory artifacts. To bridge this gap, we present \textbf{Pref-Restore}, a hierarchical framework that integrates discrete semantic logic with continuous texture generation to achieve deterministic, preference-aligned restoration. Our methodology fundamentally addresses this information disparity through two complementary strategies: (1) Augmenting Input Density: We employ an auto-regressive integrator to reformulate textual instructions into dense latent queries, injecting high-level semantic stability to constrain the degraded signals; (2) Pruning Output Distribution: We pioneer the integration of on-policy reinforcement learning directly into the diffusion restoration loop. By transforming human preferences into differentiable constraints, we explicitly penalize stochastic deviations, thereby sharpening the posterior distribution toward the desired high-fidelity outcomes. Extensive experiments demonstrate that Pref-Restore achieves state-of-the-art performance across synthetic and real-world benchmarks. Furthermore, empirical analysis confirms that our preference-aligned strategy significantly reduces solution entropy, establishing a robust pathway toward reliable and deterministic blind restoration.[95] Mocap Anywhere: Towards Pairwise-Distance based Motion Capture in the Wild (for the Wild)
Ofir Abramovich,Ariel Shamir,Andreas Aristidou
Main category: cs.CV
TL;DR: 本文提出了一种仅依赖体表超宽带(UWB)传感器测得的稀疏成对距离(PWD)来实现全身体3D动作捕捉的新系统,核心是名为Wild-Poser(WiP)的轻量级实时Transformer模型,可直接从噪声PWD数据中预测3D关节点位置,具备跨体型(含非人类物种)、免标定、抗环境干扰等优势。
Details
Motivation: 解决传统光学或惯性动捕系统在户外、无控环境中受限于光照、磁场干扰及需外部设备的问题,追求低成本、鲁棒、通用的野外动捕方案。 Method: 利用UWB传感器获取体表节点间时间飞行距离(PWD),构建稀疏距离测量输入;设计轻量级实时Transformer模型Wild-Poser(WiP),直接从噪声PWD预测3D关节坐标,并结合学习方法重建关节旋转。 Result: WiP在真实场景中实现实时运行,对人与动物均达到低关节位置误差,泛化性强,无需个体体型测量或拟合,且在光照变化、磁干扰等复杂环境下保持稳定。 Conclusion: 该系统为野外、低成本、通用型动作捕捉提供了新范式,具有良好的可扩展性与实际部署潜力。 Abstract: We introduce a novel motion capture system that reconstructs full-body 3D motion using only sparse pairwise distance (PWD) measurements from body-mounted(UWB) sensors. Using time-of-flight ranging between wireless nodes, our method eliminates the need for external cameras, enabling robust operation in uncontrolled and outdoor environments. Unlike traditional optical or inertial systems, our approach is shape-invariant and resilient to environmental constraints such as lighting and magnetic interference. At the core of our system is Wild-Poser (WiP for short), a compact, real-time Transformer-based architecture that directly predicts 3D joint positions from noisy or corrupted PWD measurements, which can later be used for joint rotation reconstruction via learned methods. WiP generalizes across subjects of varying morphologies, including non-human species, without requiring individual body measurements or shape fitting. Operating in real time, WiP achieves low joint position error and demonstrates accurate 3D motion reconstruction for both human and animal subjects in-the-wild. Our empirical analysis highlights its potential for scalable, low-cost, and general purpose motion capture in real-world settings.[96] A Non-Invasive 3D Gait Analysis Framework for Quantifying Psychomotor Retardation in Major Depressive Disorder
Fouad Boutaleb,Emery Pierson,Mohamed Daoudi,Clémence Nineuil,Ali Amad,Fabien D'Hondt
Main category: cs.CV
TL;DR: 本文提出了一种基于单目RGB视频的非侵入式计算框架,用于提取3D步态运动学特征,以客观评估抑郁症患者的运动迟滞(PMR),并在小样本临床数据上实现了83.3%的PMR检测准确率和R²=0.64的抑郁严重程度解释率。
Details
Motivation: 现有MDD状态预测方法缺乏客观、可解释且易临床部署的运动特征;PMR临床评估主观性强,而3D动捕因依赖专用设备难以常规使用。 Method: 提出基于单目RGB视频的计算框架:引入Gravity-View坐标系与新型轨迹校正算法(利用改进版TUG闭环拓扑结构缓解单目深度误差),提取297个显式步态生物力学标志物;结合稳定性驱动的机器学习框架应对小样本过拟合问题。 Result: 在CALYPSO数据集上实现83.3%的PMR检测准确率,R²=0.64解释抑郁总体严重程度;发现踝关节推进力下降与骨盆活动受限是抑郁运动表型的关键指标。 Conclusion: 身体运动可作为认知状态的稳健代理,该方法为标准临床环境提供了透明、可扩展的抑郁症客观监测工具。 Abstract: Predicting the status of Major Depressive Disorder (MDD) from objective, non-invasive methods is an active research field. Yet, extracting automatically objective, interpretable features for a detailed analysis of the patient state remains largely unexplored. Among MDD's symptoms, Psychomotor retardation (PMR) is a core item, yet its clinical assessment remains largely subjective. While 3D motion capture offers an objective alternative, its reliance on specialized hardware often precludes routine clinical use. In this paper, we propose a non-invasive computational framework that transforms monocular RGB video into clinically relevant 3D gait kinematics. Our pipeline uses Gravity-View Coordinates along with a novel trajectory-correction algorithm that leverages the closed-loop topology of our adapted Timed Up and Go (TUG) protocol to mitigate monocular depth errors. This novel pipeline enables the extraction of 297 explicit gait biomechanical biomarkers from a single camera capture. To address the challenges of small clinical datasets, we introduce a stability-based machine learning framework that identifies robust motor signatures while preventing overfitting. Validated on the CALYPSO dataset, our method achieves an 83.3% accuracy in detecting PMR and explains 64% of the variance in overall depression severity (R^2=0.64). Notably, our study reveals a strong link between reduced ankle propulsion and restricted pelvic mobility to the depressive motor phenotype. These results demonstrate that physical movement serves as a robust proxy for the cognitive state, offering a transparent and scalable tool for the objective monitoring of depression in standard clinical environments.[97] The S3LI Vulcano Dataset: A Dataset for Multi-Modal SLAM in Unstructured Planetary Environments
Riccardo Giubilato,Marcus Gerhard Müller,Marco Sewtz,Laura Alejandra Encinar Gonzalez,John Folkesson,Rudolph Triebel
Main category: cs.CV
TL;DR: 本文发布了S3LI Vulcano多模态数据集,用于视觉与LiDAR融合的SLAM和地点识别算法开发与评测,数据采集自意大利西西里岛的火山岛Vulcano,并配套开源工具包支持真值位姿生成和地点识别样本标注。
Details
Motivation: 现有SLAM和地点识别算法在复杂、多样化的自然环境中(如火山地貌)缺乏充分验证数据,亟需包含丰富纹理、地形和环境变化的多模态基准数据集。 Method: 构建并发布S3LI Vulcano多模态数据集,涵盖多种视觉与LiDAR同步序列;配套开源工具包,支持真值位姿生成及地点识别标签样本准备。 Result: 成功发布包含多样化火山环境(玄武岩、铁质岩石、古熔岩通道、干植被、水域等)的公开多模态数据集及配套开源工具包。 Conclusion: S3LI Vulcano数据集填补了复杂自然场景下多模态SLAM与地点识别基准测试的数据空白,为算法鲁棒性评估与提升提供了重要资源。 Abstract: We release the S3LI Vulcano dataset, a multi-modal dataset towards development and benchmarking of Simultaneous Localization and Mapping (SLAM) and place recognition algorithms that rely on visual and LiDAR modalities. Several sequences are recorded on the volcanic island of Vulcano, from the Aeolian Islands in Sicily, Italy. The sequences provide users with data from a variety of environments, textures and terrains, including basaltic or iron-rich rocks, geological formations from old lava channels, as well as dry vegetation and water. The data (rmc.dlr.de/s3li_dataset) is accompanied by an open source toolkit (github.com/DLR-RM/s3li-toolkit) providing tools for generating ground truth poses as well as preparation of labelled samples for place recognition tasks.[98] MaDiS: Taming Masked Diffusion Language Models for Sign Language Generation
Ronglai Zuo,Rolandos Alexandros Potamias,Qi Sun,Evangelos Ververas,Jiankang Deng,Stefanos Zafeiriou
Main category: cs.CV
TL;DR: 本文提出了MaDiS,一种基于掩码扩散的语言模型,用于手语生成(SLG),通过双向建模、并行多token生成、三级跨模态预训练、新颖的时间检查点解掩码策略以及混合部件嵌入层,显著提升了性能与推理效率。
Details
Motivation: 现有自回归语言模型在手语生成中存在单向上下文建模和逐token推理慢的问题,难以满足高效、表达丰富手语动作的需求。 Method: 提出基于掩码扩散的MaDiS模型;设计三级跨模态预训练(token/latent/3D物理空间);引入带时间检查点的解掩码策略以加速收敛;开发混合部件嵌入层融合不同部位手语token信息。 Result: 在CSL-Daily、Phoenix-2014T和How2Sign数据集上,MaDiS在DTW误差、新指标SiBLEU和SiCLIP上均优于现有方法,并将推理延迟降低近30%。 Conclusion: MaDiS通过创新的扩散建模与多级协同学习机制,有效克服了传统自回归SLG模型的局限性,为高质量、高效率手语生成提供了新范式。 Abstract: Sign language generation (SLG) aims to translate written texts into expressive sign motions, bridging communication barriers for the Deaf and Hard-of-Hearing communities. Recent studies formulate SLG within the language modeling framework using autoregressive language models, which suffer from unidirectional context modeling and slow token-by-token inference. To address these limitations, we present MaDiS, a masked-diffusion-based language model for SLG that captures bidirectional dependencies and supports efficient parallel multi-token generation. We further introduce a tri-level cross-modal pretraining scheme that jointly learns from token-, latent-, and 3D physical-space objectives, leading to richer and more grounded sign representations. To accelerate model convergence in the fine-tuning stage, we design a novel unmasking strategy with temporal checkpoints, reducing the combinatorial complexity of unmasking orders by over $10^{41}$ times. In addition, a mixture-of-parts embedding layer is developed to effectively fuse information stored in different part-wise sign tokens through learnable gates and well-optimized codebooks. Extensive experiments on CSL-Daily, Phoenix-2014T, and How2Sign demonstrate that MaDiS achieves superior performance across multiple metrics, including DTW error and two newly introduced metrics, SiBLEU and SiCLIP, while reducing inference latency by nearly 30%. Code and models will be released on our project page.[99] QuaMo: Quaternion Motions for Vision-based 3D Human Kinematics Capture
Cuong Le,Pavlo Melnyk,Urs Waldmann,Mårten Wadenbäck,Bastian Wandt
Main category: cs.CV
TL;DR: 本文提出QuaMo方法,利用四元数微分方程(QDE)建模人体运动学,结合单位球面约束与自适应加速度增强的元PD控制器,在保持运动连续性的同时提升3D人体姿态估计精度。
Details
Motivation: 传统基于欧拉角的运动捕捉方法存在不连续性,导致运动重建不稳定;而四元数可保证姿态过渡连续,适合实时、在线运动估计。 Method: 提出基于四元数状态空间模型的QDE方法,将四元数作为状态变量,用QDE描述其变化率;引入元PD控制器计算角加速度,并加入自适应加速度增强机制;在单位四元数球面上求解QDE以保证几何约束。 Result: 在Human3.6M、Fit3D、SportsPose和AIST等多个数据集上,QuaMo显著优于现有SOTA方法,运动估计无抖动、无间断、少不合理姿态。 Conclusion: QuaMo通过严谨的四元数建模范式与控制增强策略,有效解决了时序一致性与运动连续性难题,为实时、鲁棒的视觉3D人体运动捕捉提供了新范式。 Abstract: Vision-based 3D human motion capture from videos remains a challenge in computer vision. Traditional 3D pose estimation approaches often ignore the temporal consistency between frames, causing implausible and jittery motion. The emerging field of kinematics-based 3D motion capture addresses these issues by estimating the temporal transitioning between poses instead. A major drawback in current kinematics approaches is their reliance on Euler angles. Despite their simplicity, Euler angles suffer from discontinuity that leads to unstable motion reconstructions, especially in online settings where trajectory refinement is unavailable. Contrarily, quaternions have no discontinuity and can produce continuous transitions between poses. In this paper, we propose QuaMo, a novel Quaternion Motions method using quaternion differential equations (QDE) for human kinematics capture. We utilize the state-space model, an effective system for describing real-time kinematics estimations, with quaternion state and the QDE describing quaternion velocity. The corresponding angular acceleration is computed from a meta-PD controller with a novel acceleration enhancement that adaptively regulates the control signals as the human quickly changes to a new pose. Unlike previous work, our QDE is solved under the quaternion unit-sphere constraint that results in more accurate estimations. Experimental results show that our novel formulation of the QDE with acceleration enhancement accurately estimates 3D human kinematics with no discontinuity and minimal implausibilities. QuaMo outperforms comparable state-of-the-art methods on multiple datasets, namely Human3.6M, Fit3D, SportsPose and AIST. The code is available at https://github.com/cuongle1206/QuaMo[100] ScenePilot-Bench: A Large-Scale Dataset and Benchmark for Evaluation of Vision-Language Models in Autonomous Driving
Yujin Wang,Yutong Zheng,Wenxian Fan,Tianyi Wang,Hongqing Chu,Daxin Tian,Bingzhao Gao,Jianqiang Wang,Hong Chen
Main category: cs.CV
TL;DR: 本文提出了ScenePilot-Bench,一个面向自动驾驶场景的大规模第一人称驾驶基准,用于评估视觉语言模型(VLMs)在场景理解、空间感知、运动规划等方面的能力,并基于真实驾驶数据构建了多粒度标注和安全感知指标。
Details
Motivation: 现有VLMs在安全关键的自动驾驶任务中缺乏系统性、多维度的评估基准,尤其缺少第一人称视角、多粒度语义与安全约束结合的评测框架。 Method: 基于3847小时第一人称驾驶视频数据集ScenePilot-4K,构建ScenePilot-Bench;设计四轴评估体系(场景理解、空间感知、运动规划、GPT-Score),引入安全感知指标与跨区域泛化设置,并对主流VLMs进行实证评测。 Result: 对代表性VLMs进行了系统评测,揭示了当前模型在驾驶推理任务中的性能边界与关键能力缺口,如风险识别、轨迹预测与跨区域鲁棒性不足。 Conclusion: ScenePilot-Bench为自动驾驶领域VLMs的评估与演进提供了全面、可扩展且安全导向的基准框架,推动模型向实际部署所需的可靠性与可解释性发展。 Abstract: In this paper, we introduce ScenePilot-Bench, a large-scale first-person driving benchmark designed to evaluate vision-language models (VLMs) in autonomous driving scenarios. ScenePilot-Bench is built upon ScenePilot-4K, a diverse dataset comprising 3,847 hours of driving videos, annotated with multi-granularity information including scene descriptions, risk assessments, key participant identification, ego trajectories, and camera parameters. The benchmark features a four-axis evaluation suite that assesses VLM capabilities in scene understanding, spatial perception, motion planning, and GPT-Score, with safety-aware metrics and cross-region generalization settings. We benchmark representative VLMs on ScenePilot-Bench, providing empirical analyses that clarify current performance boundaries and identify gaps for driving-oriented reasoning. ScenePilot-Bench offers a comprehensive framework for evaluating and advancing VLMs in safety-critical autonomous driving contexts.[101] Localized Latent Editing for Dose-Response Modeling in Botulinum Toxin Injection Planning
Estèphe Arnaud,Mohamed Daoudi,Pierre Guerreschi
Main category: cs.CV
TL;DR: 本文提出了一种基于StyleGAN2的局部潜在编辑框架,用于模拟肉毒毒素注射效果,通过区域特异性潜在轴发现方法建模剂量-反应关系,辅助临床注射规划,并结合医生交互式反馈形成人机协同工作流。
Details
Motivation: 肉毒毒素注射虽为面部不对称矫正和美学抗衰的金标准,但剂量选择依赖经验直觉,易致效果不佳。 Method: 提出区域特异性潜在轴发现方法,在StyleGAN2隐空间中学习局部肌肉松弛轨迹,并将其与注射单位关联建立剂量-反应预测模型;比较直接指标回归与图像生成模拟两种策略,并设计人机协同工作流。 Result: 在N=360的临床数据集上验证,生成模型对几何不对称性指标展现出中至强结构相关性,证实其能正确表征形态变化方向;生物变异性限制绝对精度,但人机协同流程提升了实用性。 Conclusion: 该框架为肉毒毒素注射提供了可解释、可控、临床可行的AI辅助规划工具, bridging pathological reconstruction and cosmetic planning. Abstract: Botulinum toxin (Botox) injections are the gold standard for managing facial asymmetry and aesthetic rejuvenation, yet determining the optimal dosage remains largely intuitive, often leading to suboptimal outcomes. We propose a localized latent editing framework that simulates Botulinum Toxin injection effects for injection planning through dose-response modeling. Our key contribution is a Region-Specific Latent Axis Discovery method that learns localized muscle relaxation trajectories in StyleGAN2's latent space, enabling precise control over specific facial regions without global side effects. By correlating these localized latent trajectories with injected toxin units, we learn a predictive dose-response model. We rigorously compare two approaches: direct metric regression versus image-based generative simulation on a clinical dataset of N=360 images from 46 patients. On a hold-out test set, our framework demonstrates moderate-to-strong structural correlations for geometric asymmetry metrics, confirming that the generative model correctly captures the direction of morphological changes. While biological variability limits absolute precision, we introduce a hybrid "Human-in-the-Loop" workflow where clinicians interactively refine simulations, bridging the gap between pathological reconstruction and cosmetic planning.[102] GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining
Shentong Mo,Zehua Chen,Jun Zhu
Main category: cs.CV
TL;DR: 本文提出GMS-CAVP框架,通过多尺度视频-音频对齐与多尺度时空扩散预训练目标,提升视频-音频跨模态建模能力,在检索与生成任务上均取得SOTA性能。
Details
Motivation: 现有方法(如CAVP)在建模视频与音频之间密集、多尺度的语义和时序对应关系方面存在不足,尤其难以捕捉从细粒度到粗粒度的空间-时间结构。 Method: 提出GMS-CAVP框架:1)采用多尺度对比学习策略,建模不同粒度下的视频-音频语义与时间关系;2)引入基于扩散模型的生成式目标,支持视频与音频之间的跨模态翻译与合成,实现判别式与生成式联合建模。 Result: 在VGGSound、AudioSet和Panda70M数据集上的大量实验表明,GMS-CAVP在视频-音频检索与生成任务上均优于先前方法。 Conclusion: 多尺度对齐与扩散驱动的生成目标相结合,能更全面地建模视频-音频跨模态关系,推动理解与生成能力同步提升。 Abstract: Recent advances in video-audio (V-A) understanding and generation have increasingly relied on joint V-A embeddings, which serve as the foundation for tasks such as cross-modal retrieval and generation. While prior methods like CAVP effectively model semantic and temporal correspondences between modalities using contrastive objectives, their performance remains suboptimal. A key limitation is the insufficient modeling of the dense, multi-scale nature of both video and audio signals, correspondences often span fine- to coarse-grained spatial-temporal structures, which are underutilized in existing frameworks. To this end, we propose GMS-CAVP, a novel framework that combines Multi-Scale Video-Audio Alignment and Multi-Scale Spatial-Temporal Diffusion-based pretraining objectives to enhance V-A correspondence modeling. First, GMS-CAVP introduces a multi-scale contrastive learning strategy that captures semantic and temporal relations across varying granularities. Second, we go beyond traditional contrastive learning by incorporating a diffusion-based generative objective, enabling modality translation and synthesis between video and audio. This unified discriminative-generative formulation facilitates deeper cross-modal understanding and paves the way for high-fidelity generation. Extensive experiments on VGGSound, AudioSet, and Panda70M demonstrate that GMS-CAVP outperforms previous methods in generation and retrieval.[103] The role of self-supervised pretraining in differentially private medical image analysis
Soroosh Tayebi Arasteh,Mina Farajiamiri,Mahshad Lotfinia,Behrus Hinrichs-Puladi,Jonas Bienzeisler,Mohamed Alhaskir,Mirabela Rusu,Christiane Kuhl,Sven Nebelung,Daniel Truhn
Main category: cs.CV
TL;DR: 本文系统评估了不同初始化策略对差分隐私(DP)下医学影像分析性能的影响,发现域特定监督预训练(如MIMIC-CXR)效果最优,自监督DINOv3初始化优于ImageNet初始化,且初始化显著影响公平性、泛化性与鲁棒性。
Details
Motivation: 差分隐私虽提供形式化数据保护,但常导致诊断性能大幅下降;模型初始化被证明是缓解该问题的关键因素,但现代自监督学习在全模型差分隐私下的作用尚不明确。 Method: 在胸片分类任务(>80万图像)上,采用先进ConvNeXt模型与DP-SGD,在真实隐私预算下,对比三种初始化:ImageNet监督初始化、DINOv3自监督初始化、MIMIC-CXR域内监督初始化;并在五个外部数据集上评估性能、公平性、泛化性与鲁棒性。 Result: DINOv3初始化在DP下持续优于ImageNet初始化,但仍不如域内监督预训练(MIMIC-CXR);后者性能最接近非私有基线;初始化选择显著影响人口统计公平性、跨数据集泛化能力,以及在数据量和模型容量受限下的鲁棒性。 Conclusion: 初始化策略是决定差分隐私医学影像分析中效用、公平性与泛化能力的核心因素,域内监督预训练当前仍是最佳选择。 Abstract: Differential privacy (DP) provides formal protection for sensitive data but typically incurs substantial losses in diagnostic performance. Model initialization has emerged as a critical factor in mitigating this degradation, yet the role of modern self-supervised learning under full-model DP remains poorly understood. Here, we present a large-scale evaluation of initialization strategies for differentially private medical image analysis, using chest radiograph classification as a representative benchmark with more than 800,000 images. Using state-of-the-art ConvNeXt models trained with DP-SGD across realistic privacy regimes, we compare non-domain-specific supervised ImageNet initialization, non-domain-specific self-supervised DINOv3 initialization, and domain-specific supervised pretraining on MIMIC-CXR, the largest publicly available chest radiograph dataset. Evaluations are conducted across five external datasets spanning diverse institutions and acquisition settings. We show that DINOv3 initialization consistently improves diagnostic utility relative to ImageNet initialization under DP, but remains inferior to domain-specific supervised pretraining, which achieves performance closest to non-private baselines. We further demonstrate that initialization choice strongly influences demographic fairness, cross-dataset generalization, and robustness to data scale and model capacity under privacy constraints. The results establish initialization strategy as a central determinant of utility, fairness, and generalization in differentially private medical imaging.[104] Towards Governance-Oriented Low-Altitude Intelligence: A Management-Centric Multi-Modal Benchmark With Implicitly Coordinated Vision-Language Reasoning Framework
Hao Chang,Zhihui Wang,Lingxiang Wu,Peijin Wang,Wenhui Diao,Jinqiao Wang
Main category: cs.CV
TL;DR: 本文提出了GovLA-10K——首个面向城市管理的低空多模态基准数据集,以及GovLA-Reasoner——一个统一的视觉-语言推理框架,旨在支持城市治理中对异常事件的理解与响应。
Details
Motivation: 现有以对象为中心的感知范式和松耦合的视觉-语言流水线难以满足真实城市治理中管理导向的异常理解需求。 Method: 构建了聚焦功能显著目标(而非所有可见物体)的GovLA-10K数据集,并提供可操作的管理建议;提出GovLA-Reasoner框架,通过高效特征适配器隐式协调视觉检测器与大语言模型之间的判别性表征共享。 Result: 实验表明该方法显著提升性能,且无需对任一任务特定组件进行微调。 Conclusion: 本工作为管理感知型低空视觉-语言系统提供了新视角与基础支撑。 Abstract: Low-altitude vision systems are becoming a critical infrastructure for smart city governance. However, existing object-centric perception paradigms and loosely coupled vision-language pipelines are still difficult to support management-oriented anomaly understanding required in real-world urban governance. To bridge this gap, we introduce GovLA-10K, the first management-oriented multi-modal benchmark for low-altitude intelligence, along with GovLA-Reasoner, a unified vision-language reasoning framework tailored for governance-aware aerial perception. Unlike existing studies that aim to exhaustively annotate all visible objects, GovLA-10K is deliberately designed around functionally salient targets that directly correspond to practical management needs, and further provides actionable management suggestions grounded in these observations. To effectively coordinate the fine-grained visual grounding with high-level contextual language reasoning, GovLA-Reasoner introduces an efficient feature adapter that implicitly coordinates discriminative representation sharing between the visual detector and the large language model (LLM). Extensive experiments show that our method significantly improves performance while avoiding the need of fine-tuning for any task-specific individual components. We believe our work offers a new perspective and foundation for future studies on management-aware low-altitude vision-language systems.[105] KeepLoRA: Continual Learning with Residual Gradient Adaptation
Mao-Lin Luo,Zi-Hao Zhou,Yi-Lin Zhang,Yuanyu Wan,Tong Wei,Min-Ling Zhang
Main category: cs.CV
TL;DR: 本文提出KeepLoRA方法,在持续学习预训练视觉语言模型时,通过将LoRA参数更新限制在残差子空间,避免干扰已学能力,从而平衡知识保留、任务记忆与新知识获取三重目标。
Details
Motivation: 持续学习预训练视觉语言模型需同时满足保留预训练知识、维持已学任务知识、保持学习新任务的可塑性,三者存在冲突。 Method: 基于发现通用知识主要编码于主子空间、任务特定知识编码于残差子空间,KeepLoRA将新任务的梯度投影到正交于预训练主子空间及先前任务主导特征方向的子空间中进行LoRA更新。 Result: 理论与实验验证KeepLoRA能有效平衡三重目标,在多个基准上达到SOTA性能。 Conclusion: KeepLoRA是一种简单而有效的持续学习方法,通过子空间约束实现知识解耦与协同优化,为视觉语言模型的持续学习提供了新思路。 Abstract: Continual learning for pre-trained vision-language models requires balancing three competing objectives: retaining pre-trained knowledge, preserving knowledge from a sequence of learned tasks, and maintaining the plasticity to acquire new knowledge. This paper presents a simple but effective approach called KeepLoRA to effectively balance these objectives. We first analyze the knowledge retention mechanism within the model parameter space and find that general knowledge is mainly encoded in the principal subspace, while task-specific knowledge is encoded in the residual subspace. Motivated by this finding, KeepLoRA learns new tasks by restricting LoRA parameter updates in the residual subspace to prevent interfering with previously learned capabilities. Specifically, we infuse knowledge for a new task by projecting its gradient onto a subspace orthogonal to both the principal subspace of pre-trained model and the dominant directions of previous task features. Our theoretical and empirical analyses confirm that KeepLoRA balances the three objectives and achieves state-of-the-art performance. The implementation code is available at https://github.com/MaolinLuo/KeepLoRA.[106] A new Image Similarity Metric for a Perceptual and Transparent Geometric and Chromatic Assessment
Antonio Di Marino,Vincenzo Bevilacqua,Emanuel Di Nardo,Angelo Ciaramella,Ivanoe De Falco,Giovanna Sannino
Main category: cs.CV
TL;DR: 本文提出了一种新的感知图像相似度度量方法,结合纹理差异(使用Earth Mover's Distance)和色度差异(在Oklab颜色空间中),在复杂形变与色彩失真图像上优于现有方法,并提供可解释的视觉依据。
Details
Motivation: 现有图像相似度指标非感知导向,尤其在纹理失真时表现不佳;且黑箱深度模型缺乏可解释性。 Method: 构建双项感知度量:1)用Earth Mover's Distance衡量两图纹理差异;2)在Oklab感知色空间中计算色度差异。并在Berkeley-Adobe Perceptual Patch Similarity数据集上评估。 Result: 所提方法在含形状失真的图像上显著优于当前最优方法,验证了其更强的感知一致性;同时能生成可视化解释,提升评估透明性。 Conclusion: 该双分量感知度量兼具高性能与可解释性,为图像相似性评估提供了更符合人类感知且可信赖的新范式。 Abstract: In the literature, several studies have shown that state-of-the-art image similarity metrics are not perceptual metrics; moreover, they have difficulty evaluating images, especially when texture distortion is also present. In this work, we propose a new perceptual metric composed of two terms. The first term evaluates the dissimilarity between the textures of two images using Earth Mover's Distance. The second term evaluates the chromatic dissimilarity between two images in the Oklab perceptual color space. We evaluated the performance of our metric on a non-traditional dataset, called Berkeley-Adobe Perceptual Patch Similarity, which contains a wide range of complex distortions in shapes and colors. We have shown that our metric outperforms the state of the art, especially when images contain shape distortions, confirming also its greater perceptiveness. Furthermore, although deep black-box metrics could be very accurate, they only provide similarity scores between two images, without explaining their main differences and similarities. Our metric, on the other hand, provides visual explanations to support the calculated score, making the similarity assessment transparent and justified.[107] SharpNet: Enhancing MLPs to Represent Functions with Controlled Non-differentiability
Hanting Niu,Junkai Deng,Fei Hou,Wencheng Wang,Ying He
Main category: cs.CV
TL;DR: 本文提出SharpNet,一种改进的MLP架构,通过引入求解带跳跃Neumann边界条件Poisson方程的辅助特征函数,实现对具有用户指定C⁰尖锐特征(如边缘、角点)函数的精确建模,支持端到端联合优化特征位置与网络参数,并保证特征处C⁰连续、其余区域光滑。
Details
Motivation: 传统MLP输出全局光滑,难以准确表示具有预设C⁰尖锐特征(如不连续梯度)但整体连续的函数,现有方法常需后处理或平滑掉关键不连续性。 Method: 提出SharpNet:在MLP中嵌入一个由Poisson方程(具跳跃Neumann边界条件)定义的辅助特征函数;该函数通过高效、可微的局部积分进行评估,从而支持对特征位置和MLP参数的联合端到端优化;理论保证C⁰连续性仅在指定特征位置成立,其余区域保持光滑。 Result: 在2D函数拟合与3D CAD模型重建任务上,SharpNet能高精度恢复尖锐边缘与角点,同时在非特征区域保持光滑;相比SOTA基线方法(如SIREN、Fourier Features等),显著抑制了梯度不连续性的平滑效应;定性与定量结果均验证其优越性。 Conclusion: SharpNet为隐式函数建模提供了一种可微、可控且无需后处理的新范式,首次实现了MLP框架下对C⁰尖锐几何特征的显式、联合可学习编码,在几何重建与物理建模等领域具有重要应用价值。 Abstract: Multi-layer perceptrons (MLPs) are a standard tool for learning and function approximation, but they inherently yield outputs that are globally smooth. As a result, they struggle to represent functions that are continuous yet deliberately non-differentiable (i.e., with prescribed $C^0$ sharp features) without relying on ad hoc post-processing. We present SharpNet, a modified MLP architecture capable of encoding functions with user-defined sharp features by enriching the network with an auxiliary feature function, which is defined as the solution to a Poisson equation with jump Neumann boundary conditions. It is evaluated via an efficient local integral that is fully differentiable with respect to the feature locations, enabling our method to jointly optimize both the feature locations and the MLP parameters to recover the target functions/models. The $C^0$-continuity of SharpNet is precisely controllable, ensuring $C^0$-continuity at the feature locations and smoothness elsewhere. We validate SharpNet on 2D problems and 3D CAD model reconstruction, and compare it against several state-of-the-art baselines. In both types of tasks, SharpNet accurately recovers sharp edges and corners while maintaining smooth behavior away from those features, whereas existing methods tend to smooth out gradient discontinuities. Both qualitative and quantitative evaluations highlight the benefits of our approach.[108] Video-KTR: Reinforcing Video Reasoning via Key Token Attribution
Ziyue Wang,Sheng Jin,Zhongrong Zuo,Jiawei Wu,Han Qiu,Qi She,Hao Zhang,Xudong Jiang
Main category: cs.CV
TL;DR: 本文提出Video-KTR,一种面向视频推理的模态感知策略塑形框架,通过融合视觉感知性、时间敏感性和预测不确定性三类归因信号,在token级进行选择性强化学习,显著提升多模态大模型在视频理解与推理任务中的准确性和可解释性。
Details
Motivation: 现有视频推理方法依赖粗粒度序列级奖励或单因素token选择,忽视视觉输入、时序动态与语言输出之间的细粒度关联,导致准确性和可解释性受限。 Method: 提出Video-KTR框架,基于反事实遮蔽识别视觉感知token、帧重排检测时间敏感token、高熵值定位预测不确定token,并仅对这三类关键token施加token级强化学习更新。 Result: 在五个挑战性基准上达到SOTA或极具竞争力的结果,如Video-Holmes达42.7%,超越GPT-4o;消融实验证明三类归因信号互补且目标更新鲁棒。 Conclusion: Video-KTR通过模态感知的token级策略塑形,有效提升视频推理性能与可解释性,是一种简单、即插即用的RL增强方案。 Abstract: Reinforcement learning (RL) has shown strong potential for enhancing reasoning in multimodal large language models, yet existing video reasoning methods often rely on coarse sequence-level rewards or single-factor token selection, neglecting fine-grained links among visual inputs, temporal dynamics, and linguistic outputs, limiting both accuracy and interpretability. We propose Video-KTR, a modality-aware policy shaping framework that performs selective, token-level RL by combining three attribution signals: (1) visual-aware tokens identified via counterfactual masking to reveal perceptual dependence; (2) temporal-aware tokens detected through frame shuffling to expose temporal sensitivity; and (3) high-entropy tokens signaling predictive uncertainty. By reinforcing only these key tokens, Video-KTR focuses learning on semantically informative, modality-sensitive content while filtering out low-value tokens. Across five challenging benchmarks, Video-KTR achieves state-of-the-art or highly competitive results, achieving 42.7\% on Video-Holmes (surpassing GPT-4o) with consistent gains on both reasoning and general video understanding tasks. Ablation studies verify the complementary roles of the attribution signals and the robustness of targeted token-level updates. Overall, Video-KTR improves accuracy and interpretability, offering a simple, drop-in extension to RL for complex video reasoning. Our code and models are available at https://github.com/zywang0104/Video-KTR.[109] DSVM-UNet : Enhancing VM-UNet with Dual Self-distillation for Medical Image Segmentation
Renrong Shao,Dongyang Li,Dong Xia,Lin Shao,Jiangdong Lu,Fen Zheng,Lulu Zhang
Main category: cs.CV
TL;DR: 本文提出了一种无需复杂结构设计的双自蒸馏方法(DSVM-UNet),用于提升VM-UNet在医学图像分割中的性能,通过全局与局部特征对齐实现高效准确分割。
Details
Motivation: 现有基于Vision Mamba的医学图像分割方法(如VM-UNet)依赖复杂架构优化语义感知能力,但增加了计算负担;本文旨在以更简洁方式提升性能。 Method: 提出双自蒸馏(Dual Self-distillation)机制,在VM-UNet中同时对全局和局部特征进行自监督式知识蒸馏,无需额外参数或复杂模块。 Result: 在ISIC2017、ISIC2018和Synapse数据集上达到SOTA性能,同时保持高计算效率。 Conclusion: 双自蒸馏是一种轻量、有效且通用的改进策略,显著提升了Vision Mamba在医学图像分割任务中的表现。 Abstract: Vision Mamba models have been extensively researched in various fields, which address the limitations of previous models by effectively managing long-range dependencies with a linear-time overhead. Several prospective studies have further designed Vision Mamba based on UNet(VM-UNet) for medical image segmentation. These approaches primarily focus on optimizing architectural designs by creating more complex structures to enhance the model's ability to perceive semantic features. In this paper, we propose a simple yet effective approach to improve the model by Dual Self-distillation for VM-UNet (DSVM-UNet) without any complex architectural designs. To achieve this goal, we develop double self-distillation methods to align the features at both the global and local levels. Extensive experiments conducted on the ISIC2017, ISIC2018, and Synapse benchmarks demonstrate that our approach achieves state-of-the-art performance while maintaining computational efficiency. Code is available at https://github.com/RoryShao/DSVM-UNet.git.[110] Self-Supervised Weight Templates for Scalable Vision Model Initialization
Yucheng Xie,Fu Feng,Ruixiao Shi,Jing Wang,Yong Rui,Xin Geng
Main category: cs.CV
TL;DR: SWEET是一种自监督的约束式预训练框架,通过Tucker分解学习共享权重模板和尺寸特定的权重缩放器,支持灵活初始化不同大小的视觉模型,并引入宽度随机缩放以提升跨宽度泛化能力。
Details
Motivation: 现代模型参数规模和复杂性增加,但部署常需不同尺寸的架构,传统预训练和微调方法存在局限性。 Method: 提出SWEET框架:基于Tucker分解学习共享权重模板与尺寸特定权重缩放器;采用宽度随机缩放正则化模板,增强宽度不变表征能力;目标模型通过组合模板与轻量缩放器初始化。 Result: 在分类、检测、分割和生成任务上,SWEET在可变尺寸视觉模型初始化中达到当前最优性能。 Conclusion: SWEET通过模块化、可扩展的初始化机制,有效解决了不同尺寸视觉模型的预训练适配问题,显著提升了跨任务和跨尺寸的泛化能力。 Abstract: The increasing scale and complexity of modern model parameters underscore the importance of pre-trained models. However, deployment often demands architectures of varying sizes, exposing limitations of conventional pre-training and fine-tuning. To address this, we propose SWEET, a self-supervised framework that performs constraint-based pre-training to enable scalable initialization in vision tasks. Instead of pre-training a fixed-size model, we learn a shared weight template and size-specific weight scalers under Tucker-based factorization, which promotes modularity and supports flexible adaptation to architectures with varying depths and widths. Target models are subsequently initialized by composing and reweighting the template through lightweight weight scalers, whose parameters can be efficiently learned from minimal training data. To further enhance flexibility in width expansion, we introduce width-wise stochastic scaling, which regularizes the template along width-related dimensions and encourages robust, width-invariant representations for improved cross-width generalization. Extensive experiments on \textsc{classification}, \textsc{detection}, \textsc{segmentation} and \textsc{generation} tasks demonstrate the state-of-the-art performance of SWEET for initializing variable-sized vision models.[111] DiffStyle3D: Consistent 3D Gaussian Stylization via Attention Optimization
Yitong Yang,Xuexin Liu,Yinglin Wang,Jing Wang,Hao Dou,Changshuo Wang,Shuting He
Main category: cs.CV
TL;DR: 本文提出DiffStyle3D,一种基于扩散模型的3D高斯泼溅(3DGS)风格迁移新范式,通过在潜在空间中直接优化,并引入注意力感知损失与几何引导的多视角一致性方法,显著提升风格迁移质量与多视角一致性。
Details
Motivation: 现有基于VGG和CLIP的方法难以建模内在多视角一致性,而基于扩散的方法虽能捕获一致性但依赖去噪方向,训练不稳定。 Method: 提出DiffStyle3D:1)在潜在空间直接优化;2)设计Attention-Aware Loss,在自注意力空间对齐风格与内容特征;3)提出Geometry-Guided Multi-View Consistency,将几何信息融入自注意力以建模跨视角对应;4)构建几何感知掩码抑制重叠区域冗余优化。 Result: 在多个数据集上实验表明,DiffStyle3D在风格化质量与视觉真实感上均优于当前最优方法。 Conclusion: DiffStyle3D通过结合注意力机制、几何先验与扩散建模,有效解决了3D风格迁移中多视角不一致与训练不稳定问题,为高质量3D内容生成提供了新思路。 Abstract: 3D style transfer enables the creation of visually expressive 3D content, enriching the visual appearance of 3D scenes and objects. However, existing VGG- and CLIP-based methods struggle to model multi-view consistency within the model itself, while diffusion-based approaches can capture such consistency but rely on denoising directions, leading to unstable training. To address these limitations, we propose DiffStyle3D, a novel diffusion-based paradigm for 3DGS style transfer that directly optimizes in the latent space. Specifically, we introduce an Attention-Aware Loss that performs style transfer by aligning style features in the self-attention space, while preserving original content through content feature alignment. Inspired by the geometric invariance of 3D stylization, we propose a Geometry-Guided Multi-View Consistency method that integrates geometric information into self-attention to enable cross-view correspondence modeling. Based on geometric information, we additionally construct a geometry-aware mask to prevent redundant optimization in overlapping regions across views, which further improves multi-view consistency. Extensive experiments show that DiffStyle3D outperforms state-of-the-art methods, achieving higher stylization quality and visual realism.[112] WaterClear-GS: Optical-Aware Gaussian Splatting for Underwater Reconstruction and Restoration
Xinrui Zhang,Yufeng Wang,Shuangkang Fang,Zesheng Wang,Dacheng Qi,Wenrui Ding
Main category: cs.CV
TL;DR: 本文提出了WaterClear-GS,首个纯3D高斯泼溅(3DGS)框架,显式建模水下光学特性(衰减与散射),无需辅助网络,实现高效新颖视角合成与水下图像恢复,并支持实时渲染。
Details
Motivation: 水下3D重建与外观恢复受水体波长相关衰减和散射等复杂光学特性制约;现有NeRF方法渲染慢、色彩恢复差,而3DGS本身难以建模体积散射。 Method: 提出WaterClear-GS:将局部衰减与散射物理模型嵌入高斯原语;采用双分支优化策略,结合深度引导几何正则化、感知驱动图像损失、曝光约束、空间自适应正则化和物理引导光谱正则化。 Result: 在标准基准及新构建数据集上,WaterClear-GS在新视角合成(NVS)和水下图像恢复(UIR)任务中均达最优性能,且支持实时渲染。 Conclusion: WaterClear-GS首次实现了纯3DGS框架下对水下光学物理过程的显式建模,在保持高效实时渲染的同时显著提升了水下场景的重建质量与外观还原真实性。 Abstract: Underwater 3D reconstruction and appearance restoration are hindered by the complex optical properties of water, such as wavelength-dependent attenuation and scattering. Existing Neural Radiance Fields (NeRF)-based methods struggle with slow rendering speeds and suboptimal color restoration, while 3D Gaussian Splatting (3DGS) inherently lacks the capability to model complex volumetric scattering effects. To address these issues, we introduce WaterClear-GS, the first pure 3DGS-based framework that explicitly integrates underwater optical properties of local attenuation and scattering into Gaussian primitives, eliminating the need for an auxiliary medium network. Our method employs a dual-branch optimization strategy to ensure underwater photometric consistency while naturally recovering water-free appearances. This strategy is enhanced by depth-guided geometry regularization and perception-driven image loss, together with exposure constraints, spatially-adaptive regularization, and physically guided spectral regularization, which collectively enforce local 3D coherence and maintain natural visual perception. Experiments on standard benchmarks and our newly collected dataset demonstrate that WaterClear-GS achieves outstanding performance on both novel view synthesis (NVS) and underwater image restoration (UIR) tasks, while maintaining real-time rendering. The code will be available at https://buaaxrzhang.github.io/WaterClear-GS/.[113] PaW-ViT: A Patch-based Warping Vision Transformer for Robust Ear Verification
Deeksha Arun,Kevin W. Bowyer,Patrick Flynn
Main category: cs.CV
TL;DR: 本文提出PaW-ViT,一种基于解剖学知识的预处理方法,通过将耳部图像标准化并对齐token边界与耳部特征边界,提升ViT在耳部生物识别任务中的鲁棒性与性能。
Details
Motivation: 传统ViT使用的矩形token会引入目标物体外的无关信息,且ViT对位置敏感,而耳部形态存在形状、大小和姿态变化,导致模型性能下降。 Method: 提出Patch-based Warping Vision Transformer(PaW-ViT),利用解剖学先验对耳部图像进行归一化,并将token边界精准对齐到检测到的耳部特征边界及自然曲率上。 Result: 在ViT-T/S/B/L等多种ViT模型上验证有效,显著提升了对耳部形状、大小和姿态变化的鲁棒性。 Conclusion: PaW-ViT弥合了耳部生物特征形态多样性与Transformer位置敏感性之间的鸿沟,为基于耳部的身份认证提供了新思路。 Abstract: The rectangular tokens common to vision transformer methods for visual recognition can strongly affect performance of these methods due to incorporation of information outside the objects to be recognized. This paper introduces PaW-ViT, Patch-based Warping Vision Transformer, a preprocessing approach rooted in anatomical knowledge that normalizes ear images to enhance the efficacy of ViT. By accurately aligning token boundaries to detected ear feature boundaries, PaW-ViT obtains greater robustness to shape, size, and pose variation. By aligning feature boundaries to natural ear curvature, it produces more consistent token representations for various morphologies. Experiments confirm the effectiveness of PaW-ViT on various ViT models (ViT-T, ViT-S, ViT-B, ViT-L) and yield reasonable alignment robustness to variation in shape, size, and pose. Our work aims to solve the disconnect between ear biometric morphological variation and transformer architecture positional sensitivity, presenting a possible avenue for authentication schemes.[114] GeoDiff3D: Self-Supervised 3D Scene Generation with Geometry-Constrained 2D Diffusion Guidance
Haozhi Zhu,Miaomiao Zhao,Dingyao Liu,Runze Tian,Yan Zhang,Jie Guo,Fenggen Yu
Main category: cs.CV
TL;DR: 本文提出GeoDiff3D,一种高效自监督3D场景生成框架,利用粗略几何结构作为锚点,并结合几何约束的2D扩散模型生成纹理丰富的参考图像,在弱监督甚至无标注数据下实现高质量、高一致性3D场景生成。
Details
Motivation: 现有3D场景生成方法受限于结构建模能力弱、依赖大量标注数据,导致结构失真、几何不一致和高频细节退化,难以满足快速迭代与高保真需求。 Method: 提出GeoDiff3D框架:1)以粗几何为结构锚点;2)引入几何约束的2D扩散模型生成纹理参考;3)采用体素对齐的3D特征聚合;4)设计双重自监督机制;5)支持低计算开销训练与推理。 Result: 在复杂场景上显著优于现有基线,提升生成质量、结构一致性和泛化能力,同时大幅降低对标注数据的依赖。 Conclusion: GeoDiff3D为可访问、高效的3D场景构建提供了实用新范式,兼顾结构鲁棒性、纹理丰富性与训练经济性。 Abstract: 3D scene generation is a core technology for gaming, film/VFX, and VR/AR. Growing demand for rapid iteration, high-fidelity detail, and accessible content creation has further increased interest in this area. Existing methods broadly follow two paradigms - indirect 2D-to-3D reconstruction and direct 3D generation - but both are limited by weak structural modeling and heavy reliance on large-scale ground-truth supervision, often producing structural artifacts, geometric inconsistencies, and degraded high-frequency details in complex scenes. We propose GeoDiff3D, an efficient self-supervised framework that uses coarse geometry as a structural anchor and a geometry-constrained 2D diffusion model to provide texture-rich reference images. Importantly, GeoDiff3D does not require strict multi-view consistency of the diffusion-generated references and remains robust to the resulting noisy, inconsistent guidance. We further introduce voxel-aligned 3D feature aggregation and dual self-supervision to maintain scene coherence and fine details while substantially reducing dependence on labeled data. GeoDiff3D also trains with low computational cost and enables fast, high-quality 3D scene generation. Extensive experiments on challenging scenes show improved generalization and generation quality over existing baselines, offering a practical solution for accessible and efficient 3D scene construction.[115] Diffusion for De-Occlusion: Accessory-Aware Diffusion Inpainting for Robust Ear Biometric Recognition
Deeksha Arun,Kevin W. Bowyer,Patrick Flynn
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的耳部图像修复方法,用于预处理耳饰遮挡问题,以提升基于Transformer的耳部生物特征识别系统的性能。
Details
Motivation: 耳饰(如耳环、耳机)引起的耳部遮挡会显著降低无约束场景下耳部生物识别系统的性能,亟需有效缓解方案。 Method: 利用扩散模型,结合自动提取的耳饰掩码,对被遮挡耳部区域进行结构保持的像素级修复,重点维持耳廓、对耳廓、耳甲腔和耳垂等关键解剖结构的几何一致性。 Result: 在多个Vision Transformer模型、不同patch尺寸及多个基准数据集上的实验表明,该扩散修复预处理能有效缓解耳饰遮挡,提升整体识别性能。 Conclusion: 扩散模型驱动的耳部图像修复是一种有效的预处理手段,可显著增强耳部生物识别系统在真实复杂场景下的鲁棒性与准确性。 Abstract: Ear occlusions (arising from the presence of ear accessories such as earrings and earphones) can negatively impact performance in ear-based biometric recognition systems, especially in unconstrained imaging circumstances. In this study, we assess the effectiveness of a diffusion-based ear inpainting technique as a pre-processing aid to mitigate the issues of ear accessory occlusions in transformer-based ear recognition systems. Given an input ear image and an automatically derived accessory mask, the inpainting model reconstructs clean and anatomically plausible ear regions by synthesizing missing pixels while preserving local geometric coherence along key ear structures, including the helix, antihelix, concha, and lobule. We evaluate the effectiveness of this pre-processing aid in transformer-based recognition systems for several vision transformer models and different patch sizes for a range of benchmark datasets. Experiments show that diffusion-based inpainting can be a useful pre-processing aid to alleviate ear accessory occlusions to improve overall recognition performance.[116] Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision
Zhixiang Wei,Yi Li,Zhehan Kan,Xinghua Jiang,Zuwei Long,Shifeng Liu,Hongze Shen,Wei Liu,Xiaoyu Tan,Haojia Lin,Yubo Zhu,Qianyu Li,Di Yin,Haoyu Cao,Weibo Gu,Xin Li,Yinsong Liu,Deqiang Jiang,Xing Sun,Yunsheng Wu,Mingkong Tang,Shuangyin Liu,Lexiang Tang,Haodong Lin,Junru Lu,Jiarui Qin,Lingfeng Qiao,Ruizhi Qiao,Bo Ke,Jianfeng He,Ke Li,Yangning Li,Yunhang Shen,Mengdan Zhang,Peixian Chen,Kun Yin,Bing Liu,Yunfei Wu,Huang Chen,Zhongpeng Cai,Xiaotian Li
Main category: cs.CV
TL;DR: 本文提出Youtu-VL框架,通过Vision-Language Unified Autoregressive Supervision(VLUAS)范式,将视觉信号从“输入”转变为“预测目标”,从而提升细粒度视觉信息保留能力,并支持通用多模态与视觉中心任务。
Details
Motivation: 现有视觉-语言模型(VLMs)在训练中存在文本主导的优化偏差,将视觉信号仅视为被动条件输入而非监督目标,导致细粒度视觉信息丢失和粗粒度理解。 Method: 提出Youtu-VL框架及VLUAS范式,将视觉token直接纳入自回归预测流,对视觉细节和语言内容施加统一的自回归监督;并扩展该范式以支持无需额外模块的视觉中心任务。 Result: 在通用多模态任务和视觉中心任务上均取得具有竞争力的性能表现。 Conclusion: Youtu-VL为构建全面的通用视觉智能体提供了稳健基础,有效缓解了VLM中视觉信息退化问题。 Abstract: Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from ``vision-as-input'' to ``vision-as-target.'' By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.[117] Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering
Kun Li,Michael Ying Yang,Sami Sebastian Brandt
Main category: cs.CV
TL;DR: 本文提出了一种新的Query-guided Spatial-Temporal-Frequency (QSTar)交互方法和Query Context Reasoning (QCR)模块,以提升音频-视觉问答(AVQA)任务中对音视频与文本问题的联合理解能力,显著优于现有方法。
Details
Motivation: 现有AVQA方法过度依赖视觉信息,音频仅作为补充,且问题文本在早期阶段未被充分用于指导音视频理解。 Method: 提出QSTar方法,融合问题引导的空间-时间-频率域交互;引入受提示学习启发的QCR模块,增强对语义相关音视频特征的关注。 Result: 在多个AVQA基准上取得显著性能提升,超越现有Audio QA、Visual QA、Video QA及AVQA方法。 Conclusion: 问题引导的多模态交互与上下文推理机制能有效提升AVQA任务中的跨模态理解能力。 Abstract: Audio--Visual Question Answering (AVQA) is a challenging multimodal task that requires jointly reasoning over audio, visual, and textual information in a given video to answer natural language questions. Inspired by recent advances in Video QA, many existing AVQA approaches primarily focus on visual information processing, leveraging pre-trained models to extract object-level and motion-level representations. However, in those methods, the audio input is primarily treated as complementary to video analysis, and the textual question information contributes minimally to audio--visual understanding, as it is typically integrated only in the final stages of reasoning. To address these limitations, we propose a novel Query-guided Spatial--Temporal--Frequency (QSTar) interaction method, which effectively incorporates question-guided clues and exploits the distinctive frequency-domain characteristics of audio signals, alongside spatial and temporal perception, to enhance audio--visual understanding. Furthermore, we introduce a Query Context Reasoning (QCR) block inspired by prompting, which guides the model to focus more precisely on semantically relevant audio and visual features. Extensive experiments conducted on several AVQA benchmarks demonstrate the effectiveness of our proposed method, achieving significant performance improvements over existing Audio QA, Visual QA, Video QA, and AVQA approaches. The code and pretrained models will be released after publication.[118] HexFormer: Hyperbolic Vision Transformer with Exponential Map Aggregation
Haya Alyoussef,Ahmad Bdeir,Diego Coello de Portugal Mecke,Tom Hanika,Niels Landwehr,Lars Schmidt-Thieme
Main category: cs.CV
TL;DR: 本文提出HexFormer,一种基于双曲几何的视觉Transformer,用于图像分类,通过指数映射聚合改进注意力机制,并设计了纯双曲和混合(双曲编码器+欧氏分类头)两种架构,在多个数据集上超越欧氏基线和先前双曲ViT,同时发现双曲模型具有更稳定的梯度和更低的预热敏感性。
Details
Motivation: 图像、文本和图等多模态数据常具有层次化和关系化结构,而欧氏几何难以有效建模;双曲几何天然适合表示此类结构,因此探索其在视觉Transformer中的应用。 Method: 提出HexFormer(双曲ViT)及其混合变体HexFormer-Hybrid,核心创新是将指数映射聚合引入注意力机制,替代传统的质心平均,以提升表征准确性与稳定性;并分析双曲Transformer中的梯度稳定性。 Result: 在多个数据集上,HexFormer及尤其是HexFormer-Hybrid显著优于欧氏基线和先前双曲ViT;实验表明双曲模型梯度更稳定、对warmup策略更鲁棒。 Conclusion: 双曲几何可有效增强视觉Transformer,不仅提升分类精度,还改善训练稳定性与效率;相对简单的指数映射聚合机制已具备强实用性。 Abstract: Data across modalities such as images, text, and graphs often contains hierarchical and relational structures, which are challenging to model within Euclidean geometry. Hyperbolic geometry provides a natural framework for representing such structures. Building on this property, this work introduces HexFormer, a hyperbolic vision transformer for image classification that incorporates exponential map aggregation within its attention mechanism. Two designs are explored: a hyperbolic ViT (HexFormer) and a hybrid variant (HexFormer-Hybrid) that combines a hyperbolic encoder with an Euclidean linear classification head. HexFormer incorporates a novel attention mechanism based on exponential map aggregation, which yields more accurate and stable aggregated representations than standard centroid based averaging, showing that simpler approaches retain competitive merit. Experiments across multiple datasets demonstrate consistent performance improvements over Euclidean baselines and prior hyperbolic ViTs, with the hybrid variant achieving the strongest overall results. Additionally, this study provides an analysis of gradient stability in hyperbolic transformers. The results reveal that hyperbolic models exhibit more stable gradients and reduced sensitivity to warmup strategies compared to Euclidean architectures, highlighting their robustness and efficiency in training. Overall, these findings indicate that hyperbolic geometry can enhance vision transformer architectures by improving gradient stability and accuracy. In addition, relatively simple mechanisms such as exponential map aggregation can provide strong practical benefits.[119] EgoHandICL: Egocentric 3D Hand Reconstruction with In-Context Learning
Binzhu Xie,Shi Qiu,Sicheng Zhang,Yinqiao Wang,Hao Xu,Muzammal Naseer,Chi-Wing Fu,Pheng-Ann Heng
Main category: cs.CV
TL;DR: 本文提出了EgoHandICL,首个面向第一人称视角3D手部重建的上下文学习(ICL)框架,通过视觉语言模型引导的示例检索、专为ICL设计的多模态分词器以及手部引导的几何与感知目标训练的MAE架构,显著提升了语义对齐、视觉一致性和鲁棒性。
Details
Motivation: 解决第一人称视角下3D手部重建面临的深度模糊、自遮挡和复杂手物交互等挑战,尤其在未见场景中现有方法泛化能力不足的问题。 Method: 提出EgoHandICL框架,包含:1)视觉语言模型(VLM)引导的互补示例检索;2)面向ICL的多模态上下文分词器;3)基于掩码自编码器(MAE)的架构,并采用手部引导的几何与感知目标进行训练。 Result: 在ARCTIC和EgoExo4D数据集上持续超越当前最优方法;验证了真实世界泛化能力;并将重建手部作为视觉提示,提升了EgoVLM的手物交互推理能力。 Conclusion: EgoHandICL首次将上下文学习引入第一人称3D手部重建,有效增强了模型在复杂、未见场景下的鲁棒性与语义理解能力,为具身智能中的手部感知提供了新范式。 Abstract: Robust 3D hand reconstruction in egocentric vision is challenging due to depth ambiguity, self-occlusion, and complex hand-object interactions. Prior methods mitigate these issues by scaling training data or adding auxiliary cues, but they often struggle in unseen contexts. We present EgoHandICL, the first in-context learning (ICL) framework for 3D hand reconstruction that improves semantic alignment, visual consistency, and robustness under challenging egocentric conditions. EgoHandICL introduces complementary exemplar retrieval guided by vision-language models (VLMs), an ICL-tailored tokenizer for multimodal context, and a masked autoencoder (MAE)-based architecture trained with hand-guided geometric and perceptual objectives. Experiments on ARCTIC and EgoExo4D show consistent gains over state-of-the-art methods. We also demonstrate real-world generalization and improve EgoVLM hand-object interaction reasoning by using reconstructed hands as visual prompts. Code and data: https://github.com/Nicous20/EgoHandICL[120] SONIC: Spectral Oriented Neural Invariant Convolutions
Gijs Joppe Moens,Regina Beets-Tan,Eduardo H. P. Pooch
Main category: cs.CV
TL;DR: 本文提出了SONIC,一种基于连续谱参数化的神经网络架构,通过共享的方向选择性组件建模卷积算子,在保持结构化表示的同时实现全局感受野和跨分辨率自适应,显著提升了对几何变换、噪声和分辨率变化的鲁棒性,并以更少参数达到或超越现有方法性能。
Details
Motivation: CNN缺乏全局上下文建模能力,ViT缺乏空间归纳偏置且依赖固定patch尺寸,需一种兼具结构性与全局性的新表示方法。 Method: 提出SONIC(Spectral Oriented Neural Invariant Convolutions),采用连续谱参数化方式,用少量共享、方向选择性组件定义覆盖全频域的平滑响应,实现全局感受野与分辨率自适应卷积。 Result: 在合成数据、大规模图像分类和3D医学数据集上,SONIC展现出对几何变换、噪声和分辨率变化更强的鲁棒性,参数量减少一个数量级,性能持平或优于CNN、ViT及先前谱方法。 Conclusion: 连续、方向感知的谱参数化是一种有原则且可扩展的替代方案,能克服传统空间与谱算子的固有局限。 Abstract: Convolutional Neural Networks (CNNs) rely on fixed-size kernels scanning local patches, which limits their ability to capture global context or long-range dependencies without very deep architectures. Vision Transformers (ViTs), in turn, provide global connectivity but lack spatial inductive bias, depend on explicit positional encodings, and remain tied to the initial patch size. Bridging these limitations requires a representation that is both structured and global. We introduce SONIC (Spectral Oriented Neural Invariant Convolutions), a continuous spectral parameterisation that models convolutional operators using a small set of shared, orientation-selective components. These components define smooth responses across the full frequency domain, yielding global receptive fields and filters that adapt naturally across resolutions. Across synthetic benchmarks, large-scale image classification, and 3D medical datasets, SONIC shows improved robustness to geometric transformations, noise, and resolution shifts, and matches or exceeds convolutional, attention-based, and prior spectral architectures with an order of magnitude fewer parameters. These results demonstrate that continuous, orientation-aware spectral parameterisations provide a principled and scalable alternative to conventional spatial and spectral operators.[121] VGGT-SLAM 2.0: Real time Dense Feed-forward Scene Reconstruction
Dominic Maggio,Luca Carlone
Main category: cs.CV
TL;DR: VGGT-SLAM 2.0 是一个实时RGB前馈SLAM系统,通过改进因子图设计、利用VGGT注意力层辅助图像检索验证,显著提升了子图对齐精度与鲁棒性,并在TUM数据集上将位姿误差降低约23%,支持实时机载运行与开放集目标检测。
Details
Motivation: 解决VGGT-SLAM中存在的高维(15自由度)漂移、平面退化问题,以及VGGT在未知相机内参下的重建歧义;同时提升回环检测的准确率与鲁棒性。 Method: 1)设计新型因子图以消除高维漂移和平面退化,同时处理未知内参下的重建歧义;2)利用VGGT中某一注意力层进行免训练的图像检索验证,用于剔除误匹配并增强回环闭合能力;3)在Jetson Thor平台实现实时在线SLAM,并拓展至开放集目标检测任务。 Result: 在TUM数据集上位姿误差比VGGT-SLAM降低约23%;支持实时机载运行(地面机器人);在室内公寓、办公室及4200平方英尺谷仓等多种环境中验证有效性;可无缝适配开放集目标检测。 Conclusion: VGGT-SLAM 2.0 在精度、鲁棒性和实用性方面全面超越原版,是面向真实场景部署的高效、轻量且可扩展的视觉SLAM系统。 Abstract: We present VGGT-SLAM 2.0, a real time RGB feed-forward SLAM system which substantially improves upon VGGT-SLAM for incrementally aligning submaps created from VGGT. Firstly, we remove high-dimensional 15-degree-of-freedom drift and planar degeneracy from VGGT-SLAM by creating a new factor graph design while still addressing the reconstruction ambiguity of VGGT given unknown camera intrinsics. Secondly, by studying the attention layers of VGGT, we show that one of the layers is well suited to assist in image retrieval verification for free without additional training, which enables both rejecting false positive matches and allows for completing more loop closures. Finally, we conduct a suite of experiments which includes showing VGGT-SLAM 2.0 can easily be adapted for open-set object detection and demonstrating real time performance while running online onboard a ground robot using a Jetson Thor. We also test in environments ranging from cluttered indoor apartments and office scenes to a 4,200 square foot barn, and we also demonstrate VGGT-SLAM 2.0 achieves the highest accuracy on the TUM dataset with about 23 percent less pose error than VGGT-SLAM. Code will be released upon publication.[122] DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding
Shubham Patle,Sara Ghaboura,Hania Tariq,Mohammad Usman Khan,Omkar Thawakar,Rao Muhammad Anwer,Salman Khan
Main category: cs.CV
TL;DR: 本文提出DuwatBench——首个面向阿拉伯书法的多模态基准数据集,包含1272个样本、6种书体,用于评估模型对艺术化阿拉伯文字的理解能力;实验表明现有模型在书法识别与图文对齐上表现不佳,作者开源数据集与评测工具以推动阿拉伯视觉语言研究。