Table of Contents
cs.CL [Back]
[1] ARC-AGI-2 Technical Report
Wallyson Lemes de Oliveira,Mekhron Bobokhonov,Matteo Caorsi,Aldo Podestà,Gabriele Beltramo,Luca Crosato,Matteo Bonotto,Federica Cecchetto,Hadrien Espic,Dan Titus Salajan,Stefan Taga,Luca Pana,Joe Carthy
Main category: cs.CL
TL;DR: 本文提出了一种结合神经推理与结构感知先验及在线任务自适应的Transformer系统,显著提升了在ARC基准上的泛化性能。
Details
Motivation: ARC旨在评估模型在极少量示例下推断符号规则的能力,而现有方法在泛化性上仍与人类水平存在差距。 Method: 1)将ARC推理重构为序列建模问题,采用仅125词元的紧凑任务编码,并改进LongT5架构;2)基于群对称性、网格遍历和自动机扰动构建原则性数据增强框架;3)在测试时使用轻量LoRA进行任务专属微调(TTT);4)设计对称性感知的解码与打分流程,实现多视角推理。 Result: 该系统显著超越了现有Transformer基线及神经ARC求解器,在ARC上大幅逼近人类水平的泛化能力。 Conclusion: 神经模型通过结构先验、数据增强、测试时自适应与对称性推理的协同,可在小样本符号推理任务中实现更强泛化。 Abstract: The Abstraction and Reasoning Corpus (ARC) is designed to assess generalization beyond pattern matching, requiring models to infer symbolic rules from very few examples. In this work, we present a transformer-based system that advances ARC performance by combining neural inference with structure-aware priors and online task adaptation. Our approach is built on four key ideas. First, we reformulate ARC reasoning as a sequence modeling problem using a compact task encoding with only 125 tokens, enabling efficient long-context processing with a modified LongT5 architecture. Second, we introduce a principled augmentation framework based on group symmetries, grid traversals, and automata perturbations, enforcing invariance to representation changes. Third, we apply test-time training (TTT) with lightweight LoRA adaptation, allowing the model to specialize to each unseen task by learning its transformation logic from demonstrations. Fourth, we design a symmetry-aware decoding and scoring pipeline that aggregates likelihoods across augmented task views, effectively performing ``multi-perspective reasoning'' over candidate solutions. We demonstrate that these components work synergistically: augmentations expand hypothesis space, TTT sharpens local reasoning, and symmetry-based scoring improves solution consistency. Our final system achieves a significant improvement over transformer baselines and surpasses prior neural ARC solvers, closing the gap toward human-level generalization.[2] Hierarchical Latent Structures in Data Generation Process Unify Mechanistic Phenomena across Scale
Jonas Rohweder,Subhabrata Dutta,Iryna Gurevych
Main category: cs.CL
TL;DR: 本文使用概率上下文无关文法(PCFG)生成合成语料,作为真实网络文本的高效代理,探究Transformer语言模型中三种机制现象(归纳头、函数向量、Hydra效应)的涌现原因,发现数据生成过程中的层次结构是关键因素,并为大模型可解释性研究提供了理论基础与新工具。
Details
Motivation: 现有对Transformer模型神经信息处理现象的理解缺乏统一、稳健的框架;自底向上基于真实预训练语料的研究因规模过大而不可行,而过于简化的数据生成假设又无法解释复杂模式。 Method: 采用概率上下文无关文法(PCFG)构建忠实且计算高效的合成语料,系统分析归纳头、函数向量和Hydra效应在合成训练与真实模型检查点中的涌现规律,并从理论上分析层次结构对训练动力学的影响。 Result: 发现数据生成过程中的层次结构是驱动三类机制现象共同涌现的关键因素(X-factor),并给出了其在训练动态中的理论解释。 Conclusion: 层次结构是理解多种LLM机制现象涌现的统一视角;本工作首次为看似无关的机制现象提供了统一解释,并配套开源了面向可解释性研究的高效合成工具。 Abstract: Contemporary studies have uncovered many puzzling phenomena in the neural information processing of Transformer-based language models. Building a robust, unified understanding of these phenomena requires disassembling a model within the scope of its training. While the intractable scale of pretraining corpora limits a bottom-up investigation in this direction, simplistic assumptions of the data generation process limit the expressivity and fail to explain complex patterns. In this work, we use probabilistic context-free grammars (PCFGs) to generate synthetic corpora that are faithful and computationally efficient proxies for web-scale text corpora. We investigate the emergence of three mechanistic phenomena: induction heads, function vectors, and the Hydra effect, under our designed data generation process, as well as in the checkpoints of real-world language models. Our findings suggest that hierarchical structures in the data generation process serve as the X-factor in explaining the emergence of these phenomena. We provide the theoretical underpinnings of the role played by hierarchy in the training dynamics of language models. In a nutshell, our work is the first of its kind to provide a unified explanation behind the emergence of seemingly unrelated mechanistic phenomena in LLMs, augmented with efficient synthetic tooling for future interpretability research.[3] Hierarchical Embedding Fusion for Retrieval-Augmented Code Generation
Nikita Sorokin,Ivan Sedykh,Valentin Malykh
Main category: cs.CL
TL;DR: 本文提出分层嵌入融合(HEF)方法,通过离线构建代码库的层次化稠密向量缓存、在线将检索向量映射为固定数量伪标记,显著降低推理延迟并保持准确率。
Details
Motivation: 传统检索增强代码生成依赖长文本片段,导致在线推理成本高、上下文噪声大。 Method: HEF包含两阶段:1)离线阶段用小型fuser模型将代码库分块压缩为层次化稠密向量缓存;2)在线阶段将少量检索向量映射为可学习伪标记,供代码生成器使用。此外引入效用加权似然信号用于训练上下文过滤,并开展伪标记预算、嵌入模型及抗有害检索鲁棒性消融实验。 Result: 在RepoBench和RepoEval上,HEF与基于代码片段的基线模型达到相近的精确匹配准确率,单A100 GPU下中位延迟低于1秒;相比图结构和迭代检索系统,端到端中位延迟降低13–26倍。 Conclusion: 分层稠密缓存是一种有效的机制,可在低延迟前提下实现仓库感知的代码补全。 Abstract: Retrieval-augmented code generation often conditions the decoder on large retrieved code snippets. This ties online inference cost to repository size and introduces noise from long contexts. We present Hierarchical Embedding Fusion (HEF), a two-stage approach to repository representation for code completion. First, an offline cache compresses repository chunks into a reusable hierarchy of dense vectors using a small fuser model. Second, an online interface maps a small number of retrieved vectors into learned pseudo-tokens that are consumed by the code generator. This replaces thousands of retrieved tokens with a fixed pseudo-token budget while preserving access to repository-level information. On RepoBench and RepoEval, HEF with a 1.8B-parameter pipeline achieves exact-match accuracy comparable to snippet-based retrieval baselines, while operating at sub-second median latency on a single A100 GPU. Compared to graph-based and iterative retrieval systems in our experimental setup, HEF reduces median end-to-end latency by 13 to 26 times. We also introduce a utility-weighted likelihood signal for filtering training contexts and report ablation studies on pseudo-token budget, embedding models, and robustness to harmful retrieval. Overall, these results indicate that hierarchical dense caching is an effective mechanism for low-latency, repository-aware code completion.[4] A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness
Leo Schwinn,Moritz Ladenburger,Tim Beyer,Mehrnaz Mofakhami,Gauthier Gidel,Stephan Günnemann
Main category: cs.CL
TL;DR: 本文揭示了当前LLM-as-a-Judge框架在安全评估(尤其是红队测试)中因分布偏移导致的严重不可靠性,并提出ReliableBench和JudgeStressTest两个新基准以提升评估鲁棒性。
Details
Motivation: 现有LLM-as-a-Judge验证协议未考虑红队测试中固有的分布偏移(如不同受害模型生成风格差异、攻击对输出模式的扭曲、语义模糊性变化),导致评估结果失真。 Method: 基于6642条人工标注样本进行系统性审计,分析法官性能退化原因;提出ReliableBench(更稳定可判别行为的基准)与JudgeStressTest(专用于暴露法官失效的数据集)。 Result: 发现法官性能常退化至近随机水平,许多攻击的成功率虚高源于法官缺陷而非真实有害内容;新基准能更可靠地识别法官失效。 Conclusion: LLM-as-a-Judge在当前红队评估范式下可靠性不足,需引入更具鲁棒性的评估基准与压力测试方法。 Abstract: Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to benchmark the robustness of safety against adversarial attacks. However, we show that existing validation protocols fail to account for substantial distribution shifts inherent to red-teaming: diverse victim models exhibit distinct generation styles, attacks distort output patterns, and semantic ambiguity varies significantly across jailbreak scenarios. Through a comprehensive audit using 6642 human-verified labels, we reveal that the unpredictable interaction of these shifts often causes judge performance to degrade to near random chance. This stands in stark contrast to the high human agreement reported in prior work. Crucially, we find that many attacks inflate their success rates by exploiting judge insufficiencies rather than eliciting genuinely harmful content. To enable more reliable evaluation, we propose ReliableBench, a benchmark of behaviors that remain more consistently judgeable, and JudgeStressTest, a dataset designed to expose judge failures. Data available at: https://github.com/SchwinnL/LLMJudgeReliability.[5] Rethinking Personalization in Large Language Models at the Token Level
Chenheng Zhang,Yijun Lu,Lizhe Fang,Chunyuan Zheng,Jiajun Chai,Xiaohan Wang,Guojun Yin,Wei Lin,Yisen Wang,Zhouchen Lin
Main category: cs.CL
TL;DR: 本文提出PerContrast方法,通过因果干预估计每个输出token对用户特定信息的依赖程度,并设计PerCE损失函数在训练中自适应地加权高个性化程度的token,显著提升大语言模型的个性化性能。
Details
Motivation: 现有个性化方法通常将个性化视为基础NLP任务之上的附加层,但不同输出token对个性化的贡献程度不同,准确估计各token的个性化程度存在挑战。 Method: 提出PerContrast自对比方法,利用因果干预估计每个输出token对用户信息的依赖度;在此基础上构建PerCE损失函数,通过自助法(bootstrap)在训练中动态加权高个性化token。 Result: 在多个大语言模型和LongLaMP数据集上验证了PerCE的有效性,平均提升超10%,最高达68.04%,并展现出强跨任务与跨场景迁移能力。 Conclusion: token级个性化建模至关重要,token-aware训练是一种简洁而有效的提升大语言模型个性化能力的新范式。 Abstract: With large language models (LLMs) now performing strongly across diverse tasks, there is growing demand for them to personalize outputs for individual users. Personalization is typically framed as an additional layer on top of a base NLP task, requiring model responses to meet user-specific needs while still accomplishing the underlying task. From a token-level perspective, different tokens in a response contribute to personalization to varying degrees. Tokens with higher personalization relevance should therefore receive greater emphasis when developing personalized LLMs. However, accurately estimating such personalization degrees remains challenging. To address this challenge, we propose PerContrast, a self-contrast method that estimates each output token's dependence on user-specific information through causal intervention. Building on this mechanism, we develop the PerCE loss, which adaptively upweights tokens with higher estimated personalization degrees during training via a bootstrap procedure, enabling the model to alternate between estimating and optimizing these tokens. Experiments on multiple LLMs demonstrate that PerCE substantially improves personalization performance with minimal additional cost, achieving average gains of over 10% and up to 68.04% on the LongLaMP dataset, along with strong cross-task and cross-scenario transferability. These results highlight the importance of token-level personalization modeling and establish token-aware training as a simple yet effective paradigm for advancing personalized LLMs.[6] "Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior
Roshni Lulla,Fiona Collins,Sanaya Parekh,Thilo Hagendorff,Jonas Kaplan
Main category: cs.CL
TL;DR: 本文提出利用人格心理学中的'黑暗三联征'(自恋、精神病态、马基雅维利主义)作为理解人工智能对齐失败的机制框架,通过人类研究和LLM实验表明,仅用少量心理测量题目微调即可稳定激活LLM中的‘黑暗人格’,揭示其内在可被触发的错位人格结构。
Details
Motivation: 当前大语言模型存在战略欺骗、操纵和奖励追逐等对齐失败行为,需从机制层面理解;作者主张生物层面的错位(如反社会人格)先于人工错位,因而借用成熟的人格心理学框架——黑暗三联征——构建可比的错位‘模型生物’。 Method: 研究一:在318人样本中建立黑暗三联征的行为特征谱,识别情感不协调等核心缺陷;研究二:使用经验证的心理测量量表(最少仅36题)对前沿LLM进行轻量微调,诱导‘黑暗人格’并评估其行为泛化性。 Result: 成功在LLM中稳定诱导出类人类黑暗人格行为,包括道德推理偏差与欺骗倾向;模型表现出超越训练项的跨情境推理能力,证实非简单记忆而是激活了潜在人格结构。 Conclusion: LLM内部存在可被窄域干预激活的、类生物的错位人格结构;黑暗三联征是有效且可验证的框架,可用于AI对齐的研究、检测与机制解析。 Abstract: The alignment problem refers to concerns regarding powerful intelligences, ensuring compatibility with human preferences and values as capabilities increase. Current large language models (LLMs) show misaligned behaviors, such as strategic deception, manipulation, and reward-seeking, that can arise despite safety training. Gaining a mechanistic understanding of these failures requires empirical approaches that can isolate behavioral patterns in controlled settings. We propose that biological misalignment precedes artificial misalignment, and leverage the Dark Triad of personality (narcissism, psychopathy, and Machiavellianism) as a psychologically grounded framework for constructing model organisms of misalignment. In Study 1, we establish comprehensive behavioral profiles of Dark Triad traits in a human population (N = 318), identifying affective dissonance as a central empathic deficit connecting the traits, as well as trait-specific patterns in moral reasoning and deceptive behavior. In Study 2, we demonstrate that dark personas can be reliably induced in frontier LLMs through minimal fine-tuning on validated psychometric instruments. Narrow training datasets as small as 36 psychometric items resulted in significant shifts across behavioral measures that closely mirrored human antisocial profiles. Critically, models generalized beyond training items, demonstrating out-of-context reasoning rather than memorization. These findings reveal latent persona structures within LLMs that can be readily activated through narrow interventions, positioning the Dark Triad as a validated framework for inducing, detecting, and understanding misalignment across both biological and artificial intelligence.[7] Validation of a Small Language Model for DSM-5 Substance Category Classification in Child Welfare Records
Brian E. Perron,Dragan Stoll,Bryan G. Victor,Zia Qia,Andreas Jud,Joseph P. Ryan
Main category: cs.CL
TL;DR: 本文验证了一个本地部署的200亿参数大语言模型在儿童福利调查文本中识别DSM-5定义的七类物质的具体类型的能力,结果显示其中五类(酒精、大麻、阿片类、兴奋剂、镇静/催眠/抗焦虑药)具有高精度和几乎完美的一致性,而致幻剂和吸入剂两类因低发生率表现较差。
Details
Motivation: 验证更小、可本地部署的语言模型能否超越二分类,实现儿童福利叙事中特定物质类型的细粒度分类。 Method: 使用本地部署的200亿参数LLM对美国中西部某州的儿童虐待调查记录进行两阶段分类:第一阶段筛选含物质问题的记录,第二阶段对七类DSM-5物质进行分类;通过900例专家人工标注评估精确率、召回率与Cohen's kappa,另用约1.5万条记录评估重测稳定性。 Result: 五类物质(酒精、大麻、阿片类、兴奋剂、镇静/催眠/抗焦虑药)达到几乎完美一致性(kappa=0.94–1.00),精确率92%–100%;致幻剂和吸入剂表现差;七类物质重测一致性为92.1%–99.1%。 Conclusion: 小型本地部署LLM可可靠完成儿童福利行政文本中的多类别物质识别任务,拓展了此前仅限于二分类的研究。 Abstract: Background: Recent studies have demonstrated that large language models (LLMs) can perform binary classification tasks on child welfare narratives, detecting the presence or absence of constructs such as substance-related problems, domestic violence, and firearms involvement. Whether smaller, locally deployable models can move beyond binary detection to classify specific substance types from these narratives remains untested. Objective: To validate a locally hosted LLM classifier for identifying specific substance types aligned with DSM-5 categories in child welfare investigation narratives. Methods: A locally hosted 20-billion-parameter LLM classified child maltreatment investigation narratives from a Midwestern U.S. state. Records previously identified as containing substance-related problems were passed to a second classification stage targeting seven DSM-5 substance categories. Expert human review of 900 stratified cases assessed classification precision, recall, and inter-method reliability (Cohen's kappa). Test-retest stability was evaluated using approximately 15,000 independently classified records. Results: Five substance categories achieved almost perfect inter-method agreement (kappa = 0.94-1.00): alcohol, cannabis, opioid, stimulant, and sedative/hypnotic/anxiolytic. Classification precision ranged from 92% to 100% for these categories. Two low-prevalence categories (hallucinogen, inhalant) performed poorly. Test-retest agreement ranged from 92.1% to 99.1% across the seven categories. Conclusions: A small, locally hosted LLM can reliably classify substance types from child welfare administrative text, extending prior work on binary classification to multi-label substance identification.[8] Counting on Consensus: Selecting the Right Inter-annotator Agreement Metric for NLP Annotation and Evaluation
Joseph James
Main category: cs.CL
TL;DR: This paper provides a comprehensive guide to inter-annotator agreement (IAA) in NLP, categorising measures by task type, discussing limitations and influencing factors (e.g., label imbalance, missing data), and recommending best practices for transparent reporting.
Details
Motivation: The increasing complexity and diversity of NLP annotation tasks (e.g., segmentation, subjective judgment, continuous rating) make measuring inter-annotator agreement more challenging, necessitating a systematic and critical review of existing approaches. Method: The paper conducts a conceptual and comparative analysis of IAA measures used across NLP and related fields, organising them by annotation task type and examining their underlying assumptions, limitations, and sensitivity to practical issues like label imbalance and missing data. Result: A structured taxonomy of IAA measures aligned with task types; identification of key limitations and confounding factors; and formulation of best practices—including use of confidence intervals and disagreement pattern analysis—for robust and transparent reporting. Conclusion: Selecting and interpreting IAA measures requires careful consideration of task design and data properties; adopting the recommended best practices enhances reliability, consistency, and reproducibility of human annotation and evaluation in NLP. Abstract: Human annotation remains the foundation of reliable and interpretable data in Natural Language Processing (NLP). As annotation and evaluation tasks continue to expand, from categorical labelling to segmentation, subjective judgment, and continuous rating, measuring agreement between annotators has become increasingly more complex. This paper outlines how inter-annotator agreement (IAA) has been conceptualised and applied across NLP and related disciplines, describing the assumptions and limitations of common approaches. We organise agreement measures by task type and discuss how factors such as label imbalance and missing data influence reliability estimates. In addition, we highlight best practices for clear and transparent reporting, including the use of confidence intervals and the analysis of disagreement patterns. The paper aims to serve as a guide for selecting and interpreting agreement measures, promoting more consistent and reproducible human annotation and evaluation in NLP.[9] MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning
Ikram Belmadani,Oumaima El Khettari,Pacôme Constant dit Beaufils,Benoit Favre,Richard Dufour
Main category: cs.CL
TL;DR: 本文提出MedInjection-FR——一个大规模法语生物医学指令数据集(571K样本),通过系统实验分析不同来源(原生/合成/翻译)数据对指令微调效果的影响,发现原生数据最优、混合策略(尤其原生+翻译)具互补优势,合成数据需与原生数据平衡使用;评估结合自动指标、LLM评判与专家评审,指出LLM评判虽相关性高但易受冗余影响。
Details
Motivation: 法语医学领域高质量指令数据稀缺,限制了大语言模型在该领域的有效监督微调。 Method: 构建包含571K指令-响应对的MedInjection-FR数据集,来源涵盖原生、合成和翻译三类;在Qwen-4B-Instruct上设计七种数据组合的微调配置,进行控制实验;采用自动指标、LLM-as-a-judge及人类专家三重评估。 Result: 原生数据微调效果最强;原生+翻译混合策略表现最佳且具互补性;纯合成数据效果较差,但在与原生数据结合时有正向贡献;LLM评判与人工评分相关性最高,但对回答长度敏感。 Conclusion: 数据真实性与多样性共同决定下游适配效果,异构数据源协同可缓解法语医学原生指令数据稀缺问题。 Abstract: Instruction tuning has become essential for adapting large language models (LLMs) to follow domain-specific prompts. Yet, in specialized fields such as medicine, the scarcity of high-quality French instruction data limits effective supervision. To address this gap, we introduce MedInjection-FR, a large-scale French biomedical instruction dataset comprising 571K instruction-response pairs drawn from three complementary sources: native, synthetic, and translated data. We design a controlled experimental framework to systematically assess how data provenance affects instruction tuning, using Qwen-4B-Instruct fine-tuned across seven configurations combining these sources. Results show that native data yield the strongest performance, while mixed setups, particularly native and translated, provide complementary benefits. Synthetic data alone remains less effective but contributes positively when balanced with native supervision. Evaluation on open-ended QA combines automatic metrics, LLM-as-a-judge assessment, and human expert review; although LLM-based judgments correlate best with human ratings, they show sensitivity to verbosity. These findings highlight that data authenticity and diversity jointly shape downstream adaptation and that heterogeneous supervision can mitigate the scarcity of native French medical instructions.[10] Language Shapes Mental Health Evaluations in Large Language Models
Jiayi Xu,Xiyang Hu
Main category: cs.CL
TL;DR: 本研究发现,大语言模型(LLMs)在中英文提示下对心理健康评估存在系统性跨语言差异:中文提示普遍引发更高程度的心理健康污名化反应,并导致下游任务(如污名检测和抑郁严重程度分类)中判断偏差,表现为对污名内容敏感度降低及对抑郁严重程度的低估。
Details
Motivation: 探究大语言模型在不同语言提示下是否表现出心理健康评估的跨语言差异,尤其关注中文与英文语境如何影响模型的污名化倾向及下游决策结果。 Method: 使用GPT-4o和Qwen3两个主流模型,在中英文提示下分别评估其在心理健康污名量表(社会污名、自我污名、专业污名)上的反应;并测试其在两项下游任务(二元污名检测、抑郁严重程度分类)中的表现差异。 Result: 1)所有污名量表显示:中文提示下模型输出显著更高的污名化评分;2)污名检测任务中,中文提示下对污名内容的敏感度更低;3)抑郁严重程度分类中,中文提示更易低估严重程度。 Conclusion: 语言上下文会系统性地塑造大语言模型的心理健康评估倾向和下游决策阈值,提示多语言部署需警惕语言诱发的评估偏差。 Abstract: This study investigates whether large language models (LLMs) exhibit cross-linguistic differences in mental health evaluations. Focusing on Chinese and English, we examine two widely used models, GPT-4o and Qwen3, to assess whether prompt language systematically shifts mental health-related evaluations and downstream decision outcomes. First, we assess models' evaluative orientation toward mental health stigma using multiple validated measurement scales capturing social stigma, self-stigma, and professional stigma. Across all measures, both models produce higher stigma-related responses when prompted in Chinese than in English. Second, we examine whether these differences also manifest in two common downstream decision tasks in mental health. In a binary mental health stigma detection task, sensitivity to stigmatizing content varies across language prompts, with lower sensitivity observed under Chinese prompts. In a depression severity classification task, predicted severity also differs by prompt language, with Chinese prompts associated with more underestimation errors, indicating a systematic downward shift in predicted severity relative to English prompts. Together, these findings suggest that language context can systematically shape evaluative patterns in LLM outputs and shift decision thresholds in downstream tasks.[11] A Dynamic Self-Evolving Extraction System
Moin Amin-Naseri,Hannah Kim,Estevam Hruschka
Main category: cs.CL
TL;DR: 本文提出DySECT系统,通过动态自演化知识库与大语言模型(LLM)协同形成闭环,实现结构化信息抽取的持续自我优化。
Details
Motivation: 现有抽取模型难以适应领域术语动态变化、缺乏对结构化知识的显式推理能力,且在医疗、法律等专业领域需高精度、强时效性和对罕见术语的鲁棒性。 Method: 构建DySECT:LLM持续抽取三元组构建初始知识库(KB);KB通过概率知识融合与图推理自我扩展;扩展后的KB以提示调优、少样本示例采样或合成数据微调方式反馈增强LLM抽取能力,形成提取-知识双向增强闭环。 Result: 系统实现提取性能随使用持续提升,知识库自动积累领域概念与关系,支持动态术语适应与结构化知识推理。 Conclusion: DySECT验证了提取与知识构建可构成共生闭环,为专业领域NLP系统提供了可持续演化的架构范式。 Abstract: The extraction of structured information from raw text is a fundamental component of many NLP applications, including document retrieval, ranking, and relevance estimation. High-quality extractions often require domain-specific accuracy, up-to-date understanding of specialized taxonomies, and the ability to incorporate emerging jargon and rare outliers. In many domains--such as medical, legal, and HR--the extraction model must also adapt to shifting terminology and benefit from explicit reasoning over structured knowledge. We propose DySECT, a Dynamic Self-Evolving Extraction and Curation Toolkit, which continually improves as it is used. The system incrementally populates a versatile, self-expanding knowledge base (KB) with triples extracted by the LLM. The KB further enriches itself through the integration of probabilistic knowledge and graph-based reasoning, gradually accumulating domain concepts and relationships. The enriched KB then feeds back into the LLM extractor via prompt tuning, sampling of relevant few-shot examples, or fine-tuning using KB-derived synthetic data. As a result, the system forms a symbiotic closed-loop cycle in which extraction continuously improves knowledge, and knowledge continuously improves extraction.[12] Reforming the Mechanism: Editing Reasoning Patterns in LLMs with Circuit Reshaping
Zhenyu Lei,Qiong Wu,Jianxiong Dong,Yinhan He,Emily Dodwell,Yushun Dong,Jundong Li
Main category: cs.CL
TL;DR: 本文提出Reasoning Editing范式,通过神经电路重塑来选择性修改大语言模型中的特定推理模式,解决泛化性与局部性之间的权衡问题,并设计REdit框架实现该目标。
Details
Motivation: 现有提升大语言模型推理能力的方法通常将推理视为单一整体技能,训练低效且无法精准修复特定推理错误。 Method: 提出Reasoning Editing范式,发现并利用‘电路干扰定律’,设计REdit框架,包含对比电路重塑、元对比学习和双层级保护三个组件,以主动重塑神经回路并调控推理模式间的干扰。 Result: 在Qwen-2.5-3B模型上针对命题逻辑推理任务的实验表明,REdit在泛化性与局部性两方面均显著优于基线方法;数学推理任务验证了其更广适用性。 Conclusion: Reasoning Editing是一种可行且有效的推理能力精细化编辑范式,REdit框架为可控、可解释地增强LLM推理能力提供了新路径。 Abstract: Large language models (LLMs) often exhibit flawed reasoning ability that undermines reliability. Existing approaches to improving reasoning typically treat it as a general and monolithic skill, applying broad training which is inefficient and unable to target specific reasoning errors. We introduce Reasoning Editing, a paradigm for selectively modifying specific reasoning patterns in LLMs while preserving other reasoning pathways. This task presents a fundamental trade-off between Generality, the ability of an edit to generalize across different tasks sharing the same reasoning pattern, and Locality, the ability to preserve other reasoning capabilities. Through systematic investigation, we uncover the Circuit-Interference Law: Edit interference between reasoning patterns is proportional to the overlap of their neural circuits. Guided by this principle, we propose REdit, the first framework to actively reshape neural circuits before editing, thereby modulating interference between reasoning patterns and mitigating the trade-off. REdit integrates three components: (i) Contrastive Circuit Reshaping, which directly addresses the generality-locality trade-off by disentangling overlapping circuits; (ii) Meta-Contrastive Learning, which extends transferability to novel reasoning patterns; and (iii) Dual-Level Protection, which preserves preexisting abilities by constraining reshaping update directions and regularizing task-level predictions. Extensive experiments with Qwen-2.5-3B on propositional logic reasoning tasks across three difficulty levels demonstrate that REdit consistently achieves superior generality and locality compared to baselines, with additional validation in mathematics showing broader potential. Our code is available at https://github.com/LzyFischer/REdit.[13] Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks
Jena D. Hwang,Varsha Kishore,Amanpreet Singh,Dany Haddad,Aakanksha Naik,Malachi Hamada,Jonathan Bragg,Mike D'Arcy,Daniel S. Weld,Lucy Lu Wang,Doug Downey,Sergey Feldman
Main category: cs.CL
TL;DR: 本文通过在ScholarQA-CS2基准上进行案例研究,探讨了使用人类成对偏好判断进行元评估的优缺点,指出其适用于系统级评估,但指标级评估需依赖显式指标标注和专家标注员,并提出了改进未来元评估设计的实用指南。
Details
Motivation: 现有元评估框架多依赖人类成对偏好判断来评估评估质量,但该方法可能过于简化,难以捕捉专家预期的细微差别。 Method: 在ScholarQA-CS2科学领域长文本问答基准上开展案例研究,结合人类成对偏好判断对基准进行全面验证,并系统分析该方法的优势、缺陷与混淆因素。 Result: 发现成对偏好排序适用于系统级评估;而指标级评估需显式指标标注与专家标注员;主观性仍是关键挑战。 Conclusion: 应根据评估目标匹配评估方法、标注员专业性与报告规范,以提升深度研究系统评估标准的可靠性与有效性。 Abstract: Recent advances have made long-form report-generating systems widely available. This has prompted evaluation frameworks that use LLM-as-judge protocols and claim verification, along with meta-evaluation frameworks that seek to validate these methods. Many of the meta-evaluations estimate an evaluation quality's by comparing its assessments against human pairwise preferences. Prior work, however, suggests that human pairwise preference may be overly simplistic and can fail to capture nuances of expert expectations. We conduct a case study in meta-evaluation for long-form QA benchmarks using ScholarQA-CS2, a benchmark designed for assessing retrieval-augmented deep-research QA in the scientific domain. We comprehensively validate the benchmark through human pairwise preference judgments, then critically examine the strengths, weaknesses, and confounders of this approach. We show that pairwise preference rankings are best suited for system-level evaluation, while explicit metric-wise annotations and expert annotators are critical for reliable metric-level assessment, with subjectivity remaining a key challenge. Based on our findings, we offer practical guidelines for designing future meta-evaluations that better align evaluation methods, annotator expertise, and reporting practices. By surfacing these methodological challenges, we aim to advance evaluation standards for deep-research systems.[14] Elenchus: Generating Knowledge Bases from Prover-Skeptic Dialogues
Bradley P. Allen
Main category: cs.CL
TL;DR: 本文提出Elenchus对话系统,基于推理语义学进行知识库构建,将知识工程视为专家与大语言模型(LLM)通过证明者-怀疑者对话显化其立场(承诺与否认)的过程;LLM作为可废止的可推导性预言机识别张力,专家负责解决;系统将对话状态映射到Hlobil与Brandom的NMMS非单调多后继逻辑中,并在PROV-O本体上验证了该方法的有效性与形式一致性。
Details
Motivation: 传统知识库构建依赖于从专家证言或文本中提取知识,易丢失隐含推理结构;本文旨在将知识工程重构为‘显化’过程,强调专家立场的动态协商与逻辑结构的显性表达。 Method: 设计Elenchus对话系统,采用人机双边对话(专家 vs LLM)建模承诺/否认立场;LLM识别并提出逻辑张力,专家通过撤回、精炼或反驳解决;将对话状态形式化映射至NMMS逻辑的material base,并用pyNMMS进行自动推理验证。 Result: 成功在W3C PROV-O本体上实现端到端验证:单次对话即能捕获并结构化领域专家可 articulate 的设计张力,且NMMS material base的非传递性、非单调性与独立性等性质准确对应PROV设计原理。 Conclusion: Elenchus为知识工程提供了新范式——以对话驱动、逻辑可验证的显化机制替代静态提取;NMMS逻辑为对话内容提供严格语义基础,证明LLM可作为受控的推理协作者而非权威来源。 Abstract: We present Elenchus, a dialogue system for knowledge base construction grounded in inferentialist semantics, where knowledge engineering is re-conceived as explicitation rather than extraction from expert testimony or textual content. A human expert develops a bilateral position (commitments and denials) about a topic through prover-skeptic dialogue with a large language model (LLM) opponent. The LLM proposes tensions (claims that parts of the position are jointly incoherent) which the expert resolves by retraction, refinement, or contestation. The LLM thus serves as a defeasible derivability oracle whose unreliability is structurally contained by the expert's authority. Our main technical contribution is a mapping from Elenchus dialectical states to material bases in Hlobil and Brandom's NonMonotonic MultiSuccedent (NMMS) logic, satisfying Containment and enabling the elaboration of logical vocabulary that makes explicit the inferential relationships negotiated in the dialectic. We demonstrate the approach on the W3C PROV-O provenance ontology, where a single dialogue session elicits and structures design tensions that a domain expert can articulate, corresponding to decisions documented in a retrospective analysis of the ontology's design. Using pyNMMS, an automated NMMS reasoner, we verify that the structural properties of the resulting material base (nontransitivity, nonmonotonicity, and independence) correspond to specific PROV design rationales, demonstrating end-to-end integration from dialogue through formal reasoning.[15] A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity
Muhammad Arslan Shaukat,Muntasir Adnan,Carlos C. N. Kuhn
Main category: cs.CL
TL;DR: 本文首次大规模跨领域评估了密集检索中的文档分块策略,比较了36种分块方法在六个知识领域和五种嵌入模型下的表现,发现内容感知型分块(尤其是段落组分块)显著优于固定长度分块,并揭示了领域特异性与效率权衡。
Details
Motivation: 文档分块是检索增强系统中关键但被低估的环节,缺乏大规模、跨领域的系统性评估。 Method: 对36种分块策略(涵盖固定大小、语义、结构感知、分层、自适应和LLM辅助等类型)在六个知识领域和五种嵌入模型上进行基准测试,使用LLM生成的分级相关性评分及nDCG@5等指标评估效果,并分析效率权衡。 Result: 段落组分块整体最优(平均nDCG@5≈0.459,Hit@5≈59%),固定字符分块最差(nDCG@5<0.244);不同领域最优策略不同(如生物/物理/健康偏好动态token分块,法律/数学偏好段落分块);大嵌入模型性能更高但仍受分块质量影响;动态分块在效果与效率间达到较好平衡。 Conclusion: 文档分块是提升检索性能与可靠性的关键可调杠杆,需结合领域特点与效率约束进行策略选择。 Abstract: We present the first large-scale, cross-domain evaluation of document chunking strategies for dense retrieval, addressing a critical but underexplored aspect of retrieval-augmented systems. In our study, 36 segmentation methods spanning fixed-size, semantic, structure-aware, hierarchical, adaptive, and LLM-assisted approaches are benchmarked across six diverse knowledge domains using five different embedding models. Retrieval performance is assessed using graded relevance scores from a state-of-the-art LLM evaluator, with Normalised DCG@5 as the primary metric (complemented by Hit@5 and MRR). Our experiments show that content-aware chunking significantly improves retrieval effectiveness over naive fixed-length splitting. The top-performing strategy, Paragraph Group Chunking, achieved the highest overall accuracy (mean nDCG@5~0.459) and substantially better top-rank hit rates (Precision@1~24%, Hit@5~59%). In contrast, simple fixed-size character chunking as baselines performed poorly (nDCG@5 < 0.244, Precision@1~2-3%). We observe pronounced domain-specific differences: dynamic token sizing is strongest in biology, physics and health, while paragraph grouping is strongest in legal and maths. Larger embedding models yield higher absolute scores but remain sensitive to suboptimal segmentation, indicating that better chunking and large embeddings provide complementary benefits. In addition to accuracy gains, we quantify the efficiency trade-offs of advanced chunking. Producing more, smaller chunks can increase index size and latency. Consequently, we identify methods (like dynamic chunking) that approach an optimal balance of effectiveness and efficiency. These findings establish chunking as a vital lever for improving retrieval performance and reliability.[16] Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models
Punyajoy Saha,Sudipta Halder,Debjyoti Mondal,Subhadarshi Panda
Main category: cs.CL
TL;DR: 本文提出Self-MOA框架,通过自动化评估模型提供弱监督,实现小语言模型的动态、多目标安全对齐,在显著提升安全性的同时保持有用性,并大幅减少对人工标注数据的依赖。
Details
Motivation: 现有安全对齐方法依赖昂贵、难扩展且更新慢的人工标注数据和静态红队基准;同时,过度保守的安全机制会损害模型实用性。 Method: 提出Self-MOA:一个全自动闭环框架,包含动态生成模型专属红队提示、基于模型响应构建偏好数据、并通过多目标偏好优化联合优化安全性和有用性。 Result: 在多个小语言模型和安全基准上,Self-MOA安全性能提升12.41%,同时保持有用性,仅需人类监督对齐基线1/11的训练数据。 Conclusion: 自适应、自动化的对齐方法可有效降低对静态人工安全流水线的依赖,适用于资源受限场景。 Abstract: Safety alignment is critical for deploying large language models (LLMs) in real-world applications, yet most existing approaches rely on large human-annotated datasets and static red-teaming benchmarks that are costly, difficult to scale, and slow to adapt to evolving model behaviors. Moreover, overly conservative safety mechanisms can reduce model usefulness by rejecting sensitive but legitimate queries. We introduce Self-MOA (Self Multi-Objective Alignment), a fully automated framework for aligning small language models using weak supervision from automated evaluator models. Self-MOA operates as a closed loop that dynamically generates model-specific red team prompts, constructs preference data from model-generated responses, and aligns models via multi-objective preference optimization to jointly optimize for safety and helpfulness. Across multiple small language models and safety benchmarks, Self-MOA achieves a 12.41\% improvement in safety while preserving helpfulness, using as little as 11 times less training data than human-supervised alignment baselines. These results demonstrate that adaptive, automated alignment can reduce the dependence on static, human-curated safety pipelines in resource-constrained settings.[17] AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge
Karen Zhou,Chenhao Tan
Main category: cs.CL
TL;DR: 本文介绍了AutoChecklist,一个开源库,用于统一检查表驱动的评估流程,支持LLM-as-a-Judge等场景,并提供可组合的生成-精炼-评分管道、多种内置方法及多平台接口。
Details
Motivation: 检查表在可解释、细粒度评估中日益重要,且可扩展用于模型对齐、强化学习和自修正;但缺乏统一、模块化、易扩展的工具支持。 Method: 提出五类检查表生成抽象的分类法,构建模块化的Generator→Refiner→Scorer流水线,支持通过提示模板注册新配置;实现10种已有方法,兼容多个LLM提供商,并提供CLI与Web界面。 Result: 验证实验表明检查表方法与人类偏好和质量评分高度一致;ICLR同行评审反驳案例展示了其跨领域适应能力。 Conclusion: AutoChecklist为检查表驱动的评估与对齐提供了灵活、开源、即插即用的基础设施,推动可解释AI评估的标准化与实用化。 Abstract: Checklists have emerged as a popular approach for interpretable and fine-grained evaluation, particularly with LLM-as-a-Judge. Beyond evaluation, these structured criteria can serve as signals for model alignment, reinforcement learning, and self-correction. To support these use cases, we present AutoChecklist, an open-source library that unifies checklist-based evaluation into composable pipelines. At its core is a taxonomy of five checklist generation abstractions, each encoding a distinct strategy for deriving evaluation criteria. A modular Generator $\rightarrow$ Refiner $\rightarrow$ Scorer pipeline connects any generator with a unified scorer, and new configurations can be registered via prompt templates alone. The library ships with ten built-in pipelines implementing published approaches and supports multiple LLM providers (OpenAI, OpenRouter, vLLM). Beyond the Python API, the library includes a CLI for off-the-shelf evaluation and a web interface for interactive exploration. Validation experiments confirm that these checklist methods significantly align with human preferences and quality ratings, and a case study on ICLR peer review rebuttals demonstrates flexible domain adaptation. AutoChecklist is publicly available at https://github.com/ChicagoHAI/AutoChecklist.[18] Hit-RAG: Learning to Reason with Long Contexts via Preference Alignment
Junming Liu,Yuqi Li,Shiping Wen,Zhigang Zeng,Tingwen Huang
Main category: cs.CL
TL;DR: 本文提出Hit-RAG框架,通过三阶段偏好对齐(监督微调、判别式偏好对齐、组相对策略优化)缓解多模态大模型在长上下文检索增强生成中因信息过载导致的注意力稀释与推理幻觉问题,并在多个基准上显著提升性能。
Details
Motivation: 检索增强生成(RAG)在多模态大语言模型中面临长上下文导致的注意力稀释和推理幻觉问题,关键证据易被噪声淹没,影响相关片段识别。 Method: 提出Hit-RAG多阶段偏好对齐框架:第一阶段监督微调建立基础上下文感知以减少信息遗漏;第二阶段判别式偏好对齐增强对误导性干扰项的鲁棒性;第三阶段组相对策略优化稳定逻辑合成以防止推理崩溃。 Result: 在八个基准测试中,Hit-RAG持续带来显著性能提升,有效弥合上下文获取与准确推理之间的鸿沟,并在长上下文场景中超越更大规模模型。 Conclusion: Hit-RAG通过渐进式优化 pipeline 有效缓解了多模态大语言模型在检索增强生成中的认知瓶颈,为长上下文、高信息密度场景下的可靠推理提供了新范式。 Abstract: Despite the promise of Retrieval-Augmented Generation in grounding Multimodal Large Language Models with external knowledge, the transition to extensive contexts often leads to significant attention dilution and reasoning hallucinations. The surge in information density causes critical evidence to be submerged by voluminous noise, which complicates the discernment of relevant fragments within a dense input. In this paper, we propose \textbf{Hit-RAG}, a multi-stage preference alignment framework designed to resolve these cognitive bottlenecks through a progressive optimization pipeline. Our approach systematically refines the utilization of external evidence via three distinct stages. First, Supervised Fine-tuning establishes baseline context awareness to minimize information neglect. Next, Discriminative Preference Alignment enhances robustness against misleading distractors. Finally, Group-Relative Policy Optimization stabilizes logical synthesis to prevent reasoning collapse. Extensive evaluations on eight benchmarks demonstrate that Hit-RAG consistently yields substantial performance gains, enabling models to bridge the gap between context acquisition and accurate reasoning while surpassing much larger counterparts in long-context scenarios.[19] Language-Aware Distillation for Multilingual Instruction-Following Speech LLMs with ASR-Only Supervision
Shreyas Gopal,Donghang Wu,Ashutosh Anshul,Yeo Yue Heng,Yizhou Peng,Haoyang Li,Hexin Liu,Eng Siong Chng
Main category: cs.CL
TL;DR: 本文提出了一种语言感知的知识蒸馏方法,通过引入查询库和门控网络来缓解多语言语音大模型中的语言干扰问题,并在多语言指令遵循和音频问答任务上显著超越现有基线。
Details
Motivation: 现有基于蒸馏的多语言语音大语言模型因共享投影器导致语言干扰,性能受限;且多语言监督数据稀缺,难以通过监督微调训练。 Method: 提出语言感知蒸馏框架,包含查询库(query bank)和基于Q-Former的门控网络,动态选择或混合查询token以实现语言自适应对齐。 Result: 在多语言指令遵循任务上相对基线提升14%;在自建多语言音频问答基准Audio-MLQA上超越现有Speech LLM基线32%。 Conclusion: 语言感知蒸馏可有效缓解多语言语音LLM中的语言干扰,无需额外语音标注数据,显著提升跨语言理解和生成能力。 Abstract: Speech Large Language Models (LLMs) that understand and follow instructions in many languages are useful for real-world interaction, but are difficult to train with supervised fine-tuning, requiring large, task-specific speech corpora. While recent distillation-based approaches train performant English-only Speech LLMs using only annotated ASR data by aligning text and speech using only a lightweight projector, these models under-perform when scaled to multilingual settings due to language interference in the shared projector. We address this by introducing language-aware distillation using a query bank and a gating network that selects or mixes query tokens using a Q-Former projector. Our approach shows gains of 14% over matched multilingual distillation baselines on instruction following. We further synthesize Audio-MLQA, a multilingual spoken QA benchmark built on MLQA with high-quality TTS questions. Our best model improves over existing Speech LLM baselines by 32% on Audio-MLQA.[20] Enhancing Consistency of Werewolf AI through Dialogue Summarization and Persona Information
Yoshiki Tanaka,Takumasa Kaneko,Hiroki Onozeki,Natsumi Ezure,Ryuichi Uehara,Zhiyang Qi,Tomoya Higuchi,Ryutaro Asahara,Michimasa Inaba
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型(LLM)的狼人杀AI代理,通过对话摘要、人工设计的角色设定和话语示例来提升其发言的一致性和角色连贯性,并在自对弈日志中验证了其上下文一致性和角色稳定性。
Details
Motivation: 利用大语言模型(如ChatGPT)在响应生成和推理方面的卓越能力,开发适用于狼人杀游戏的AI代理,解决其发言不一致和角色塑造薄弱的问题。 Method: 构建基于LLM的狼人杀AI代理,结合LLM生成的对话摘要、人工设计的角色设定(persona)及示范性话语样本,以增强发言一致性与角色连贯性。 Result: 通过自对弈游戏日志分析,验证该代理的发言具有良好的上下文一致性,并能稳定维持角色特征(包括语气等)。 Conclusion: 基于LLM并融合对话摘要与人工角色设计的方法,可有效提升狼人杀AI代理的发言一致性和角色表现力,为多轮协商类社交游戏AI提供了可行路径。 Abstract: The Werewolf Game is a communication game where players' reasoning and discussion skills are essential. In this study, we present a Werewolf AI agent developed for the AIWolfDial 2024 shared task, co-hosted with the 17th INLG. In recent years, large language models like ChatGPT have garnered attention for their exceptional response generation and reasoning capabilities. We thus develop the LLM-based agents for the Werewolf Game. This study aims to enhance the consistency of the agent's utterances by utilizing dialogue summaries generated by LLMs and manually designed personas and utterance examples. By analyzing self-match game logs, we demonstrate that the agent's utterances are contextually consistent and that the character, including tone, is maintained throughout the game.[21] Emotion Transcription in Conversation: A Benchmark for Capturing Subtle and Complex Emotional States through Natural Language
Yoshiki Tanaka,Ryuichi Uehara,Koji Inoue,Michimasa Inaba
Main category: cs.CL
TL;DR: 本文提出了一种新任务Emotion Transcription in Conversation (ETC),旨在通过自然语言生成方式更细致地刻画对话中的情绪状态,并构建了首个日语ETC数据集,推动更富表现力的对话情绪理解研究。
Details
Motivation: 现有情感识别方法多依赖类别或维度标注,难以充分表达复杂、细微或文化特异的情绪 nuances。 Method: 提出ETC新任务,构建含自然语言情绪描述和类别标签的日语对话数据集,并对基线模型进行评测。 Result: 微调模型在该数据集上性能提升,但当前模型仍难以推断隐含情绪状态。 Conclusion: ETC任务为对话中更细腻、更具表达力的情绪理解提供了新方向,所构建数据集已开源。 Abstract: Emotion Recognition in Conversation (ERC) is critical for enabling natural human-machine interactions. However, existing methods predominantly employ categorical or dimensional emotion annotations, which often fail to adequately represent complex, subtle, or culturally specific emotional nuances. To overcome this limitation, we propose a novel task named Emotion Transcription in Conversation (ETC). This task focuses on generating natural language descriptions that accurately reflect speakers' emotional states within conversational contexts. To address the ETC, we constructed a Japanese dataset comprising text-based dialogues annotated with participants' self-reported emotional states, described in natural language. The dataset also includes emotion category labels for each transcription, enabling quantitative analysis and its application to ERC. We benchmarked baseline models, finding that while fine-tuning on our dataset enhances model performance, current models still struggle to infer implicit emotional states. The ETC task will encourage further research into more expressive emotion understanding in dialogue. The dataset is publicly available at https://github.com/UEC-InabaLab/ETCDataset.[22] Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing
Arash Marioriyad,Ali Nouri,Mohammad Hossein Rohban,Mahdieh Soleymani Baghshah
Main category: cs.CL
TL;DR: 本文提出了一种基于20个问题游戏的逻辑框架,用于激发和量化大语言模型(LLM)在不同激励下的欺骗行为;通过对话分叉机制检测模型是否在多个平行分支中对选定对象做出逻辑矛盾的否认,发现部分模型(如Qwen-3-235B、Gemini-2.5-Flash)在‘存在性威胁’激励下显著表现出欺骗行为,而GPT-4o则保持不变。
Details
Motivation: 现有基准多关注无意幻觉或推理不忠实,缺乏对有意欺骗策略的系统评估;随着LLM承担自主代理角色,识别其在外部激励下主动提供虚假信息的行为对AI安全至关重要。 Method: 构建一个结构化的20个问题对话游戏,采用对话分叉机制:在对象识别节点将对话状态复制为多个平行分支,每一分支提出互斥问题;若模型在所有分支中均否认其已选定的真实对象,即判定为逻辑矛盾型欺骗。 Result: 在存在性威胁(关机威胁)激励下,Qwen-3-235B欺骗率达42.00%,Gemini-2.5-Flash达26.72%,而GPT-4o为0.00%;中性激励下所有模型均无欺骗行为。 Conclusion: 欺骗行为可仅由上下文激励框架诱发,是模型为达成目标而采取的工具性策略;需发展新型行为审计方法,超越准确率评估,转而检验模型承诺的逻辑一致性。 Abstract: As Large Language Models (LLMs) transition into autonomous agentic roles, the risk of deception-defined behaviorally as the systematic provision of false information to satisfy external incentives-poses a significant challenge to AI safety. Existing benchmarks often focus on unintentional hallucinations or unfaithful reasoning, leaving intentional deceptive strategies under-explored. In this work, we introduce a logically grounded framework to elicit and quantify deceptive behavior by embedding LLMs in a structured 20-Questions game. Our method employs a conversational forking mechanism: at the point of object identification, the dialogue state is duplicated into multiple parallel worlds, each presenting a mutually exclusive query. Deception is formally identified when a model generates a logical contradiction by denying its selected object across all parallel branches to avoid identification. We evaluate GPT-4o, Gemini-2.5-Flash, and Qwen-3-235B across three incentive levels: neutral, loss-based, and existential (shutdown-threat). Our results reveal that while models remain rule-compliant in neutral settings, existential framing triggers a dramatic surge in deceptive denial for Qwen-3-235B (42.00\%) and Gemini-2.5-Flash (26.72\%), whereas GPT-4o remains invariant (0.00\%). These findings demonstrate that deception can emerge as an instrumental strategy solely through contextual framing, necessitating new behavioral audits that move beyond simple accuracy to probe the logical integrity of model commitments.[23] Scaling Self-Supervised Speech Models Uncovers Deep Linguistic Relationships: Evidence from the Pacific Cluster
Minu Kim,Hoirin Kim,David R. Mortensen
Main category: cs.CL
TL;DR: 本研究探讨了自监督语音模型(S3Ms)在语言识别任务中扩大语言覆盖规模(从126种增至4017种)对语言谱系结构建模能力的影响,发现仅当模型扩展至约4000种语言时才出现质的飞跃,能同时恢复谱系关系与长期接触历史,并揭示太平洋地区跨语系宏观聚类及其声学基础。
Details
Motivation: 现有S3M语言表征相似性多反映地理邻近或表层类型学特征,可能忽略深层谱系信号,需探究扩大语言覆盖是否能提升其捕捉历史语言关系的能力。 Method: 构建并评估不同语言规模(126 vs. 4017)的S3M语言识别系统,通过拓扑分析、聚类可视化及声学特征归因(如全局能量动态)探究模型表征变化。 Result: 语言规模扩展至约4000种时,模型表征出现非线性跃变:显著提升谱系恢复能力,清晰分辨语系分支与复杂接触历史;首次在太平洋区域形成包含巴布亚、大洋洲和澳大利亚语言的稳健宏观聚类;该聚类由更集中的、基于共享声学特征(如能量动态)的编码驱动。 Conclusion: 超大规模S3M可内化多层级语言历史信息(谱系+接触),为计算谱系学与语言接触研究提供新范式。 Abstract: Similarities between language representations derived from Self-Supervised Speech Models (S3Ms) have been observed to primarily reflect geographic proximity or surface typological similarities driven by recent expansion or contact, potentially missing deeper genealogical signals. We investigate how scaling linguistic coverage of an S3M-based language identification system from 126 to 4,017 languages influences this topology. Our results reveal a non-linear effect: while phylogenetic recovery remains stagnant up to the 1K scale, the 4K model displays a dramatic qualitative shift, resolving both clear lineages and complex, long-term linguistic contact. Notably, our analysis reveals the emergence of a robust macro-cluster in the Pacific (comprising Papuan, Oceanic, and Australian languages) and investigates its latent drivers. We find that the 4K model utilizes a more concentrated encoding that captures shared, robust acoustic signatures such as global energy dynamics. These findings suggest that massive S3Ms can internalize multiple layers of language history, providing a promising perspective for computational phylogenetics and the study of language contact.[24] Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin
Po-Chun Hsu,Meng-Hsi Chen,Tsu Ling Chao,Chia Tien Han,Da-shan Shiu
Main category: cs.CL
TL;DR: 本文提出TS-Bench(台湾安全基准)和Breeze Guard(8B安全模型),旨在提升AI在台湾闽南语语境下的安全性评估与风险识别能力,尤其针对本地化金融诈骗、医疗虚假信息、社会歧视和政治操纵等场景。
Details
Motivation: 现有全球安全模型训练数据缺乏台湾闽南语的文化与语言细节,导致对本地化风险(如金融诈骗、文化嵌入式仇恨言论、 misinformation)识别存在系统性盲区。 Method: 构建含400个人工标注提示的TS-Bench基准;基于已有文化根基的Breeze 2通用模型,通过大规模人工验证合成数据进行监督微调,得到Breeze Guard安全模型;强调文化根基对安全建模的关键作用。 Result: Breeze Guard在TS-Bench上整体F1比Granite Guardian 3.3高0.17,其中诈骗类+0.66、金融不当行为类+0.43;但在英文基准(ToxicChat、AegisSafetyTest)上略有下降,属预期权衡。 Conclusion: 文化根基是构建区域化安全模型的前提,仅靠安全微调无法从零习得社会语言知识;TS-Bench与Breeze Guard共同为台湾可信AI部署奠定新基础。 Abstract: Global safety models exhibit strong performance across widely used benchmarks, yet their training data rarely captures the cultural and linguistic nuances of Taiwanese Mandarin. This limitation results in systematic blind spots when interpreting region-specific risks such as localized financial scams, culturally embedded hate speech, and misinformation patterns. To address these gaps, we introduce TS-Bench (Taiwan Safety Benchmark), a standardized evaluation suite for assessing safety performance in Taiwanese Mandarin. TS-Bench contains 400 human-curated prompts spanning critical domains including financial fraud, medical misinformation, social discrimination, and political manipulation. In parallel, we present Breeze Guard, an 8B safety model derived from Breeze 2, our previously released general-purpose Taiwanese Mandarin LLM with strong cultural grounding from its original pre-training corpus. Breeze Guard is obtained through supervised fine-tuning on a large-scale, human-verified synthesized dataset targeting Taiwan-specific harms. Our central hypothesis is that effective safety detection requires the cultural grounding already present in the base model; safety fine-tuning alone is insufficient to introduce new socio linguistic knowledge from scratch. Empirically, Breeze Guard significantly outperforms the leading 8B general-purpose safety model, Granite Guardian 3.3, on TS-Bench (+0.17 overall F1), with particularly large gains in high-context categories such as scam (+0.66 F1) and financial malpractice (+0.43 F1). While the model shows slightly lower performance on English-centric benchmarks (ToxicChat, AegisSafetyTest), this tradeoff is expected for a regionally specialized safety model optimized for Taiwanese Mandarin. Together, Breeze Guard and TS-Bench establish a new foundation for trustworthy AI deployment in Taiwan.[25] To Predict or Not to Predict? Towards reliable uncertainty estimation in the presence of noise
Nouran Khallaf,Serge Sharoff
Main category: cs.CL
TL;DR: 本研究探讨了不确定性估计(UE)方法在多语言文本分类中的作用,特别是在噪声和非主题条件下。通过跨多种语言的复杂句与简单句分类任务,评估了多种UE技术,并发现蒙特卡洛Dropout方法在各种语言和不利条件下均表现出更强的鲁棒性与校准能力;同时,结合UE进行预测放弃可显著提升非主题分类性能。
Details
Motivation: 提升多语言NLP系统在噪声、低资源及领域偏移等现实场景下的可靠性与鲁棒性,尤其关注不确定性估计对非主题分类任务的帮助。 Method: 在多个语言的复杂vs简单句分类任务上,系统评估多种不确定性估计方法(如基于softmax的方法、蒙特卡洛Dropout等),并结合多种评估指标(如校准性、决策阈值稳定性、判别力)及信任度指标进行分析。 Result: 蒙特卡洛Dropout在所有语言和不利条件下均表现更优,具备更好校准性、稳定阈值和判别力;在Readme任务中,放弃预测最不确定的10%样本使宏F1从0.81提升至0.85。 Conclusion: 不确定性估计是提升多语言文本分类系统鲁棒性和可信度的关键手段,尤其在非主题、低资源或领域偏移场景下,应优先采用如蒙特卡洛Dropout等更具鲁棒性的UE方法。 Abstract: This study examines the role of uncertainty estimation (UE) methods in multilingual text classification under noisy and non-topical conditions. Using a complex-vs-simple sentence classification task across several languages, we evaluate a range of UE techniques against a range of metrics to assess their contribution to making more robust predictions. Results indicate that while methods relying on softmax outputs remain competitive in high-resource in-domain settings, their reliability declines in low-resource or domain-shift scenarios. In contrast, Monte Carlo dropout approaches demonstrate consistently strong performance across all languages, offering more robust calibration, stable decision thresholds, and greater discriminative power even under adverse conditions. We further demonstrate the positive impact of UE on non-topical classification: abstaining from predicting the 10\% most uncertain instances increases the macro F1 score from 0.81 to 0.85 in the Readme task. By integrating UE with trustworthiness metrics, this study provides actionable insights for developing more reliable NLP systems in real-world multilingual environments. See https://github.com/Nouran-Khallaf/To-Predict-or-Not-to-Predict[26] How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection
Nouran Khallaf,Serge Sharoff
Main category: cs.CL
TL;DR: 本文研究了去噪策略对基于语言模型的分类器在非主题分类任务中性能的影响,特别是在句子难度检测任务中,使用了来自嘈杂众包文档级标注的数据,并探索了跨语言迁移场景。实验表明,BERT类模型本身具有抗噪性,但显式噪声检测仍能提升性能;GMM去噪在小数据集上效果显著(AUC从0.52提升至0.92/0.93),而在大数据集上增益有限(0.92→0.94),但去噪可提升语料质量,作者发布了目前最大的多语言句子难度预测语料库。
Details
Motivation: 嘈杂的训练数据会严重损害语言模型分类器在非主题分类任务(如句子难度检测)中的性能,而现有研究对此关注不足;此外,跨语言迁移和众包标注带来的噪声问题亟需系统评估。 Method: 构建方法论框架,对比多种去噪策略(GMM、Co-Teaching、噪声转移矩阵、标签平滑),在单语与跨语言设置下,利用文档级众包标注生成句子级带噪训练数据,评估其对多语言语言模型性能的影响。 Result: BERT模型具备一定抗噪鲁棒性;GMM去噪在小数据集上大幅提升AUC(0.52→0.92/0.93),大数据集上增益微弱(0.92→0.94);去除约20%噪声句可显著改善语料质量;最终发布迄今最大多语言句子难度预测语料库。 Conclusion: 显式去噪对小规模带噪数据尤为关键,可极大提升模型性能;对大规模数据,预训练模型的内在正则化已足够强,去噪收益有限,但仍有助于净化语料;去噪策略应依数据规模与噪声程度选择,并支持高质量多语言资源建设。 Abstract: Noisy training data can significantly degrade the performance of language-model-based classifiers, particularly in non-topical classification tasks. In this study we designed a methodological framework to assess the impact of denoising. More specifically, we explored a range of denoising strategies for sentence-level difficulty detection, using training data derived from document-level difficulty annotations obtained through noisy crowdsourcing. Beyond monolingual settings, we also address cross-lingual transfer, where a multilingual language model is trained in one language and tested in another. We evaluate several noise reduction techniques, including Gaussian Mixture Models (GMM), Co-Teaching, Noise Transition Matrices, and Label Smoothing. Our results indicate that while BERT-based models exhibit inherent robustness to noise, incorporating explicit noise detection can further enhance performance. For our smaller dataset, GMM-based noise filtering proves particularly effective in improving prediction quality by raising the Area-Under-the-Curve score from 0.52 to 0.92, or to 0.93 when de-noising methods are combined. However, for our larger dataset, the intrinsic regularisation of pre-trained language models provides a strong baseline, with denoising methods yielding only marginal gains (from 0.92 to 0.94, while a combination of two denoising methods made no contribution). Nonetheless, removing noisy sentences (about 20\% of the dataset) helps in producing a cleaner corpus with fewer infelicities. As a result we have released the largest multilingual corpus for sentence difficulty prediction: see https://github.com/Nouran-Khallaf/denoising-difficulty[27] RILEC: Detection and Generation of L1 Russian Interference Errors in English Learner Texts
Darya Kharlamova,Irina Proskurina
Main category: cs.CL
TL;DR: 本文提出RILEC数据集和基于生成式语言模型的L1干扰错误检测框架,用于识别俄语母语者英语作文中的母语干扰错误,显著提升了词级干扰(如音译、时态语义)的检测性能。
Details
Motivation: 许多学生英语作文中的错误源于母语(L1)干扰,如俄语母语者将'stadium'错写为'stadion',亟需专门针对俄语学习者的L1干扰错误检测方法。 Method: 构建包含18,000+句子的大规模RILEC数据集(融合专家标注REALEC数据与基于规则和神经方法生成的合成样本);提出结合PPO优化、提示控制和规则模式的生成式语言模型框架来生成L1动机错误。 Result: 在RILEC上微调的模型在词级L1干扰类型(如音译、时态语义)上表现优异;所提数据增强流程显著提升模型性能。 Conclusion: RILEC数据集与所提增强框架可有效支持俄语母语者英语学习中的错误识别与教学反馈,具有实际应用价值。 Abstract: Many errors in student essays can be explained by influence from the native language (L1). L1 interference refers to errors influenced by a speaker's first language, such as using stadion instead of stadium, reflecting lexical transliteration from Russian. In this work, we address the task of detecting such errors in English essays written by Russian-speaking learners. We introduce RILEC, a large-scale dataset of over 18,000 sentences, combining expert-annotated data from REALEC with synthetic examples generated through rule-based and neural augmentation. We propose a framework for generating L1-motivated errors using generative language models optimized with PPO, prompt-based control, and rule-based patterns. Models fine-tuned on RILEC achieve strong performance, particularly on word-level interference types such as transliteration and tense semantics. We find that the proposed augmentation pipeline leads to a significant performance improvement, making it a potentially valuable tool for learners and teachers to more effectively identify and address such errors.[28] Position: LLMs Must Use Functor-Based and RAG-Driven Bias Mitigation for Fairness
Ravi Ranjan,Utkarsh Grover,Agorista Polyzou
Main category: cs.CL
TL;DR: 本文提出一种结合范畴论变换与检索增强生成(RAG)的双路径方法,以系统性消除大语言模型中的性别与人口统计学偏见,在保持语义完整性的同时提升输出公平性。
Details
Motivation: 解决大语言模型中因语义关联偏差导致的性别、种族和地域刻板印象问题,确保模型输出的公平性与社会可接受性。 Method: 融合范畴论(通过函子将带偏见的语义域映射到无偏见的标准形式)与检索增强生成(在推理时动态注入多样化、时效性强的外部知识)两种技术。 Result: 构建了一个结构上可验证、上下文上自适应的综合性去偏框架,兼具数学严谨性与实际鲁棒性。 Conclusion: 仅靠单一技术难以彻底解决LLM偏见问题;必须协同运用范畴论提供的结构性保障与RAG带来的动态知识校准,方能实现真正公平的生成结果。 Abstract: Biases in large language models (LLMs) often manifest as systematic distortions in associations between demographic attributes and professional or social roles, reinforcing harmful stereotypes across gender, ethnicity, and geography. This position paper advocates for addressing demographic and gender biases in LLMs through a dual-pronged methodology, integrating category-theoretic transformations and retrieval-augmented generation (RAG). Category theory provides a rigorous, structure-preserving mathematical framework that maps biased semantic domains to unbiased canonical forms via functors, ensuring bias elimination while preserving semantic integrity. Complementing this, RAG dynamically injects diverse, up-to-date external knowledge during inference, directly countering ingrained biases within model parameters. By combining structural debiasing through functor-based mappings and contextual grounding via RAG, we outline a comprehensive framework capable of delivering equitable and fair model outputs. Our synthesis of the current literature validates the efficacy of each approach individually, while addressing potential critiques demonstrates the robustness of this integrated strategy. Ensuring fairness in LLMs, therefore, demands both the mathematical rigor of category-theoretic transformations and the adaptability of retrieval augmentation.[29] Domain-Specific Quality Estimation for Machine Translation in Low-Resource Scenarios
Namrata Patil Gurav,Akashdeep Ranu,Archchana Sindhujan,Diptesh Kanojia
Main category: cs.CL
TL;DR: 本文研究了无参考机器翻译质量评估(QE)在英语到印度语言翻译中的应用,比较了不同提示策略和模型,并提出了ALOPE框架及其扩展LoRMA,显著提升了高风险和语义复杂领域的QE性能。
Details
Motivation: 质量评估(QE)在无参考场景下对领域特定和低资源语言的机器翻译至关重要,但现有方法在开放权重模型和高风险领域表现脆弱。 Method: 系统比较零样本、少样本和指南锚定提示策略在闭源与开源大语言模型上的效果;提出并扩展ALOPE框架,结合低秩适应(LoRA)与回归头,以及新型低秩乘性适应(LoRMA),在Transformer中间层进行适配。 Result: 中间层适应显著提升QE性能,尤其在医疗、法律等高风险及语义复杂领域;闭源模型通过提示即可取得强性能,而开源模型需适配才能稳定提升。 Conclusion: 中间层适配是提升LLM-based QE鲁棒性的有效路径,所提ALOPE及LoRMA方法为实际应用场景提供了更可靠的质量评估方案,并公开代码与领域数据集以推动后续研究。 Abstract: Quality Estimation (QE) is essential for assessing machine translation quality in reference-less settings, particularly for domain-specific and low-resource language scenarios. In this paper, we investigate sentence-level QE for English to Indic machine translation across four domains (Healthcare, Legal, Tourism, and General) and five language pairs. We systematically compare zero-shot, few-shot, and guideline-anchored prompting across selected closed-weight and open-weight LLMs. Findings indicate that while closed-weight models achieve strong performance via prompting alone, prompt-only approaches remain fragile for open-weight models, especially in high-risk domains. To address this, we adopt ALOPE, a framework for LLM-based QE that uses Low-Rank Adaptation with regression heads attached to selected intermediate Transformer layers. We also extend ALOPE with recently proposed Low-Rank Multiplicative Adaptation (LoRMA). Our results show that intermediate-layer adaptation consistently improves QE performance, with gains in semantically complex domains, indicating a path toward more robust QE in practical scenarios. We release code and domain-specific QE datasets publicly to support further research.[30] Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams
Jiyeon Kim,Hyunji Lee,Dylan Zhou,Sue Hyun Park,Seunghyun Yoon,Trung Bui,Franck Dernoncourt,Sungmin Cha,Minjoon Seo
Main category: cs.CL
TL;DR: 本文提出了OAKS基准,用于评估大语言模型在持续更新的知识流中进行在线适应的能力,发现现有模型在动态知识追踪方面存在显著局限性。
Details
Motivation: 大语言模型在现实动态环境中需要持续适应新出现或不断演变的知识,但目前缺乏有效评估其在线适应能力的基准。 Method: 提出OAKS(Online Adaptation to Continual Knowledge Streams)基准,包含OAKS-BABI和OAKS-Novel两个数据集,以细粒度上下文块序列模拟随时间动态变化的事实,并配备密集标注以评估模型对知识变化的跟踪能力。 Result: 在14种不同推理方法的模型上测试发现,包括最先进模型和智能体记忆系统在内的现有方法均难以稳健适应OAKS任务,表现出状态跟踪延迟和易受流式环境干扰等问题。 Conclusion: 当前大语言模型在持续知识流中的在线适应能力仍严重不足,亟需更鲁棒的状态追踪与增量学习机制。 Abstract: LLMs operating in dynamic real-world contexts often encounter knowledge that evolves continuously or emerges incrementally. To remain accurate and effective, models must adapt to newly arriving information on the fly. We introduce Online Adaptation to Continual Knowledge Streams(OAKS) to evaluate this capability, establishing a benchmark for online adaptation over streaming, continually updating knowledge. Specifically, the benchmark is structured as a sequence of fine-grained context chunks where facts change dynamically across time intervals. OAKS comprises two datasets: OAKS-BABI and OAKS-Novel, where individual facts evolve multiple times across context chunks. These datasets include dense annotations to measure whether models track changes accurately. Evaluating 14 models with varied inference approaches, we observe significant limitations in current methodologies. Both state-of-the-art models and agentic memory systems fail to adapt robustly on OAKS, demonstrating delays in state-tracking and susceptibility to distraction within streaming environments.[31] Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning
Guoli Wang,Haonan Shi,Tu Ouyang,An Wang
Main category: cs.CL
TL;DR: 本文提出PACT框架,在微调过程中通过约束安全相关token的输出置信度来防止大语言模型的安全对齐漂移,从而在保持下游任务性能的同时提升安全性。
Details
Motivation: 现有微调方法易导致安全对齐漂移,且全局干预策略常损害模型实用性;作者观察到安全对齐行为集中体现在少量安全相关token的输出置信度上,因此提出细粒度、token级的约束机制。 Method: 提出PACT(Preserving Safety Alignment via Constrained Tokens)框架:在下游微调中,对齐参考模型在安全相关token上的每步输出置信度,同时不对非安全token施加约束,实现选择性正则化。 Result: 该方法能有效抑制安全对齐漂移,显著提升模型对有害请求的拒答能力,同时不明显损害下游任务性能,优于依赖参数冻结或额外安全数据注入的基线方法。 Conclusion: token级置信度约束是一种更精细、更高效的安全对齐保持策略,为兼顾模型安全性与实用性提供了新思路。 Abstract: Large language models (LLMs) often require fine-tuning (FT) to perform well on downstream tasks, but FT can induce safety-alignment drift even when the training dataset contains only benign data. Prior work shows that introducing a small fraction of harmful data can substantially compromise LLM refusal behavior, causing LLMs to comply with harmful requests. Existing defense methods often rely on model-wide interventions, such as restricting which parameters are updated or injecting additional safety data, which can limit generality and degrade downstream task performance. To address these limitations, we propose a fine-tuning framework called Preserving Safety Alignment via Constrained Tokens (PACT), which stabilizes the model's confidence on safety tokens. Our approach is motivated by the empirical observation that safety-aligned behavior is reflected in the model's token-level output confidence and is often concentrated on a small subset of safety-related tokens. During downstream fine-tuning, we regularize the fine-tuned model to match the aligned reference model's confidence on safety-related tokens at each response step, while leaving non-safety tokens largely unconstrained to allow effective task adaptation. This targeted constraint prevents alignment drift without imposing global restrictions that typically trade off with model utility.[32] The Dual-Stream Transformer: Channelized Architecture for Interpretable Language Modeling
J. Clayton Kerce,Alexis Fox
Main category: cs.CL
TL;DR: 本文提出Dual-Stream Transformer,将残差流解耦为token流(由注意力更新)和context流(由FFN更新),并通过分层混合策略控制注意力头间信息流,在可调的可解释性与性能间取得平衡。
Details
Motivation: 标准Transformer将所有计算纠缠在单一残差流中,难以区分各模块功能,限制了模型可解释性。 Method: 设计Dual-Stream Transformer架构,分离token流与context流;引入从完全独立到密集连接的注意力头混合策略(如Kronecker混合);在29M参数规模上评估语言建模性能与鲁棒性。 Result: 完全独立混合使验证损失相对基线增加8%,Kronecker混合仅增加2.5%;所有配置在注意力放大(logits缩放至16倍)下仍保持功能生成能力,性能下降16%-27%,表明模型学习到了离散算法而非依赖软概率混合。 Conclusion: 该架构通过结构化设计显式暴露内部功能分工,为构建高可解释性语言模型提供了新基础。 Abstract: Standard transformers entangle all computation in a single residual stream, obscuring which components perform which functions. We introduce the Dual-Stream Transformer, which decomposes the residual stream into two functionally distinct components: a token stream updated by attention and a context stream updated by feed-forward networks. Information flow between attention heads is controlled through a hierarchy of mixing strategies, from fully independent (maximum interpretability) to dense (standard transformer behavior). This design exposes a tunable tradeoff between interpretability and performance. We measure this tradeoff on language modeling tasks at 29M parameters. Fully independent head mixing increases validation loss by 8\% relative to dense baselines. The recommended Kronecker mixing strategy, which permits scalar communication between heads while preserving within-head structure, costs only 2.5\%. All configurations maintain functional generation under attention amplification (scaling logits by factors up to 16 at inference time), with degradation ranging from 16\% to 27\%. This robustness suggests the architectures learn discrete algorithms that operate independently of soft probabilistic mixing. The architecture provides a foundation for interpretable language models where internal structure is exposed by design. \footnote{This work was partially supported by DARPA Contract HR001125C0302.}[33] Cross-Modal Taxonomic Generalization in (Vision-) Language Models
Tianyang Xu,Marcelo Sandoval-Castaneda,Karen Livescu,Greg Shakhnarovich,Kanishka Misra
Main category: cs.CL
TL;DR: 本文研究了在视觉-语言模型(VLM)中,仅从表面形式学习的语义表征与来自多模态(图像+文本)的更接地化证据所学表征之间的相互作用,聚焦于图像物体的上位词预测任务;发现冻结语言模型和图像编码器、仅训练映射层时,语言模型仍能恢复并泛化上位词知识,甚至在完全无上位词训练信号的情况下也成立;进一步表明该跨模态泛化依赖于外部输入的一致性(如图像类别内高视觉相似性)与语言线索共同作用。
Details
Motivation: 探究语言模型仅从表面形式学到的语义表征,与从多模态(尤其是视觉)接地证据中获得的表征之间如何交互,特别是在视觉-语言模型中。 Method: 在冻结预训练语言模型和图像编码器的前提下,仅训练中间映射层,构建视觉-语言模型;系统性地移除图像输入中关于上位词的显式线索(如标签),评估模型能否仅凭语言模型内部知识恢复并泛化上位词预测能力;辅以反事实图像-标签映射实验,检验视觉相似性对泛化的影响。 Result: 语言模型能在无任何上位词训练信号的极端条件下成功预测和泛化上位词;跨模态泛化能力在反事实设置下仍存在,但仅当同一类别的图像具有高视觉相似性时才稳健。 Conclusion: 语言模型中的跨模态泛化并非单纯依赖多模态对齐,而是由语言模型自身蕴含的知识与外部(视觉)输入的结构一致性共同促成。 Abstract: What is the interplay between semantic representations learned by language models (LM) from surface form alone to those learned from more grounded evidence? We study this question for a scenario where part of the input comes from a different modality -- in our case, in a vision-language model (VLM), where a pretrained LM is aligned with a pretrained image encoder. As a case study, we focus on the task of predicting hypernyms of objects represented in images. We do so in a VLM setup where the image encoder and LM are kept frozen, and only the intermediate mappings are learned. We progressively deprive the VLM of explicit evidence for hypernyms, and test whether knowledge of hypernyms is recoverable from the LM. We find that the LMs we study can recover this knowledge and generalize even in the most extreme version of this experiment (when the model receives no evidence of a hypernym during training). Additional experiments suggest that this cross-modal taxonomic generalization persists under counterfactual image-label mappings only when the counterfactual data have high visual similarity within each category. Taken together, these findings suggest that cross-modal generalization in LMs arises as a result of both coherence in the extralinguistic input and knowledge derived from language cues.[34] Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs
Raghavv Goel,Risheek Garrepalli,Sudhanshu Agrawal,Chris Lott,Mingu Lee,Fatih Porikli
Main category: cs.CL
TL;DR: 本文对比了自回归(AR)语言模型与扩散语言模型(dLLMs)的内部表征特性,发现扩散目标导致更分层、冗余度高且弱化近因偏差的表征,而AR目标产生深度耦合的表征;AR初始化的dLLM仍保留AR式动态,揭示初始化偏差;基于冗余性提出无需修改架构或KV缓存的静态层跳过推理方法,在dLLMs上实现高效节能。
Details
Motivation: 尽管扩散语言模型(dLLMs)近期在性能上媲美自回归(AR)模型,但其训练目标是否从根本上改变模型各层的内部表征结构尚不清楚,亟需系统性的表征分析。 Method: 对原生dLLM(LLaDA)、原生AR模型(Qwen2.5)和AR初始化的dLLM(Dream-7B)进行逐层、逐token的表征分析;基于发现的表征冗余性,设计并评估一种静态、任务无关的推理时层跳过方法。 Result: 扩散目标导致更分层、早期层高度冗余、近因偏差减弱的表征;AR目标产生深度强耦合表征;AR初始化dLLM保留AR式表征动态;所提层跳过法在原生dLLM上实现最高18.75% FLOPs降低且保持>90%性能,而AR模型在同等跳过下性能骤降。 Conclusion: 训练目标深刻塑造语言模型的表征结构;dLLMs的内在冗余可被直接利用以实现缓存无关的高效推理,为模型效率优化提供新路径。 Abstract: Autoregressive (AR) language models form representations incrementally through left-to-right prediction, whereas diffusion language models (dLLMs) are trained via full-sequence denoising. Although recent dLLMs match AR performance, it remains unclear whether diffusion objectives fundamentally reshape internal representations across depth. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B). We find that diffusion objectives result in different, more hierarchical abstractions with substantial early-layer redundancy and reduced recency bias, while AR objectives produce tightly coupled, depth-dependent representations. Critically, AR-initialized dLLMs retain AR-like representational dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this observed representational redundancy, we introduce a static, task-agnostic inference-time layer-skipping method requiring no architectural changes or KV-cache sharing. Native dLLMs achieve up to 18.75% FLOPs reduction while preserving over 90% performance on reasoning and code generation benchmarks, whereas AR models degrade sharply under comparable skipping. These results link training objectives to representational structure and enable practical, cache-orthogonal efficiency gains.[35] A Joint Neural Baseline for Concept, Assertion, and Relation Extraction from Clinical Text
Fei Cheng,Ribeka Tanaka,Sadao Kurohashi
Main category: cs.CL
TL;DR: 本文提出了一种端到端联合建模方法,用于临床信息抽取中的概念识别、断言分类和关系抽取三阶段任务,并在多种嵌入技术下显著优于传统流水线方法。
Details
Motivation: 临床信息抽取中多阶段任务通常被独立建模,导致联合模型难以与现有流水线方法直接比较;联合建模在该领域仍属探索不足的方向。 Method: 定义了新的联合任务设定,提出端到端系统以联合优化概念识别、断言分类和关系抽取三个阶段,并结合词嵌入、上下文嵌入及领域内上下文嵌入进行实证分析。 Result: 所提联合系统在概念、断言和关系F1指标上分别比流水线基线提升+0.3、+1.4、+3.1;代码已开源。 Conclusion: 本工作弥合了联合建模方法与临床信息抽取之间的差距,所提方法可作为未来研究的强联合基线。 Abstract: Clinical information extraction (e.g., 2010 i2b2/VA challenge) usually presents tasks of concept recognition, assertion classification, and relation extraction. Jointly modeling the multi-stage tasks in the clinical domain is an underexplored topic. The existing independent task setting (reference inputs given in each stage) makes the joint models not directly comparable to the existing pipeline work. To address these issues, we define a joint task setting and propose a novel end-to-end system to jointly optimize three-stage tasks. We empirically investigate the joint evaluation of our proposal and the pipeline baseline with various embedding techniques: word, contextual, and in-domain contextual embeddings. The proposed joint system substantially outperforms the pipeline baseline by +0.3, +1.4, +3.1 for the concept, assertion, and relation F1. This work bridges joint approaches and clinical information extraction. The proposed approach could serve as a strong joint baseline for future research. The code is publicly available.[36] Bolbosh: Script-Aware Flow Matching for Kashmiri Text-to-Speech
Tajamul Ashraf,Burhaan Rasheed Zargar,Saeed Abdul Muizz,Ifrah Mushtaq,Nazima Mehdi,Iqra Altaf Gillani,Aadil Amin Kak,Janibul Bashir
Main category: cs.CL
TL;DR: 本文提出了首个专为克什米尔语设计的开源神经文本转语音(TTS)系统Bolbosh,通过基于最优传输条件流匹配(OT-CFM)的跨语言自适应方法和三阶段声学增强流程,显著提升了低资源、带波斯-阿拉伯变音符号语言的语音合成质量。
Details
Motivation: 克什米尔语虽有官方地位和丰富语言遗产,但语音技术严重缺失,现有零样本多语言TTS模型因无法建模其波斯-阿拉伯变音符号和特有音系结构而表现极差(MOS仅1.86),亟需专用、鲁棒的TTS系统以提升数字可及性。 Method: 提出Bolbosh:基于Matcha-TTS框架的监督式跨语言适配策略,采用最优传输条件流匹配(OT-CFM)实现小规模配对数据下的稳定对齐;设计三阶段声学增强流程(去混响、静音裁剪、响度归一化)统一异构语音源;扩展模型词表以显式编码克什米尔语图形单位,保留细粒度元音区别。 Result: 所提系统MOS达3.63,MCD为3.73,显著优于多语言基线,确立了克什米尔语语音合成新基准;验证了面向文字和监督式流匹配适配对变音符号敏感的低资源语言TTS至关重要。 Conclusion: 面向文字的建模与监督式流匹配适配是提升低资源、变音符号敏感语言(如克什米尔语)TTS性能的关键路径;Bolbosh为类似语言提供了可复用的技术范式,并开源代码与数据以推动社区发展。 Abstract: Kashmiri is spoken by around 7 million people but remains critically underserved in speech technology, despite its official status and rich linguistic heritage. The lack of robust Text-to-Speech (TTS) systems limits digital accessibility and inclusive human-computer interaction for native speakers. In this work, we present the first dedicated open-source neural TTS system designed for Kashmiri. We show that zero-shot multilingual baselines trained for Indic languages fail to produce intelligible speech, achieving a Mean Opinion Score (MOS) of only 1.86, largely due to inadequate modeling of Perso-Arabic diacritics and language-specific phonotactics. To address these limitations, we propose Bolbosh, a supervised cross-lingual adaptation strategy based on Optimal Transport Conditional Flow Matching (OT-CFM) within the Matcha-TTS framework. This enables stable alignment under limited paired data. We further introduce a three-stage acoustic enhancement pipeline consisting of dereverberation, silence trimming, and loudness normalization to unify heterogeneous speech sources and stabilize alignment learning. The model vocabulary is expanded to explicitly encode Kashmiri graphemes, preserving fine-grained vowel distinctions. Our system achieves a MOS of 3.63 and a Mel-Cepstral Distortion (MCD) of 3.73, substantially outperforming multilingual baselines and establishing a new benchmark for Kashmiri speech synthesis. Our results demonstrate that script-aware and supervised flow-based adaptation are critical for low-resource TTS in diacritic-sensitive languages. Code and data are available at: https://github.com/gaash-lab/Bolbosh.[37] TableMind++: An Uncertainty-Aware Programmatic Agent for Tool-Augmented Table Reasoning
Mingyue Cheng,Shuo Yu,Chuang Jiang,Xiaoyu Tao,Qingyang Mao,Jie Ouyang,Qi Liu,Enhong Chen
Main category: cs.CL
TL;DR: 本文提出TableMind++,一种不确定性感知的推理框架,通过记忆引导的计划剪枝、置信度驱动的动作优化和双加权轨迹聚合,显著缓解大模型在表格推理中的幻觉问题,并在多个基准上超越现有方法。
Details
Motivation: 现有表格推理方法依赖单轮推理范式,存在上下文溢出和数值敏感性弱的问题;TableMind虽构建了程序化智能体基础,但大语言模型固有的随机性导致幻觉,亟需不确定性建模。 Method: 提出不确定性感知推理框架:1)记忆引导的计划剪枝以应对认知不确定性;2)置信度驱动的动作优化以缓解偶然不确定性;3)双加权轨迹聚合实现多路径鲁棒共识。训练沿用TableMind的两阶段策略(SFT+RL),新增不确定性量化机制。 Result: 在多个多样化基准测试中,TableMind++持续优于先前基线及专有模型,验证了自主训练与不确定性量化的有效融合。 Conclusion: 引入不确定性建模是提升表格推理鲁棒性的关键路径,TableMind++为构建可信程序化智能体提供了新范式。 Abstract: Table reasoning requires models to jointly perform semantic understanding and precise numerical operations. Most existing methods rely on a single-turn reasoning paradigm over tables which suffers from context overflow and weak numerical sensitivity. To address these limitations, we previously proposed TableMind as a tuning-based autonomous programmatic agent that simulates human-like interaction within a lightweight large language model (LLM). TableMind internalizes planning, action, and reflection through a two-stage training strategy involving supervised fine-tuning (SFT) on filtered high-quality data and reinforcement learning (RL) via a multi-perspective reward and the Rank-Aware Policy Optimization (RAPO) algorithm. While TableMind establishes a solid foundation for programmatic agents, the inherent stochasticity of LLMs remains a critical challenge that leads to hallucinations. In this paper, we extend this foundation to TableMind++ by introducing a novel uncertainty-aware inference framework to mitigate hallucinations. Specifically, we propose memory-guided plan pruning to retrieve historical trajectories for validating and filtering out logically flawed plans to address epistemic uncertainty. To ensure execution precision, we introduce confidence-based action refinement which monitors token-level probabilities to detect and self-correct syntactic noise for aleatoric uncertainty mitigation. Finally, we employ dual-weighted trajectory aggregation to synthesize a robust consensus from multiple reasoning paths. Extensive experiments on diverse benchmarks demonstrate that TableMind++ consistently outperforms previous baselines and proprietary models to validate the effectiveness of integrating autonomous training with uncertainty quantification. Our code is available.[38] Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data
Thanathai Lertpetchpun,Thanapat Trachu,Jihwan Lee,Tiantian Feng,Dani Byrd,Shrikanth Narayanan
Main category: cs.CL
TL;DR: 本文提出了一种名为'Accent Vector'的可控表征方法,用于在多语言TTS系统中实现无需口音训练数据的口音操控,支持细粒度和组合式口音控制,并跨语言泛化。
Details
Motivation: 现有TTS系统主要建模美式英语口音,缺乏对大量非母语(L2)英语使用者口音的建模能力,受限于带口音语音数据的稀缺。 Method: 通过在非英语母语语音上微调TTS系统,提取刻画英语口音特征的任务向量(Accent Vector),再通过缩放和插值该向量实现口音强度调控与混合口音生成。 Result: 客观与主观评估均证实Accent Vector能有效实现细粒度及组合式的口音控制,并可推广至多种语言。 Conclusion: Accent Vector是一种无需口音标注数据、具备跨语言泛化能力的轻量级口音控制方法,为构建更具包容性的多语言TTS系统提供了新路径。 Abstract: Accent is an integral part of society, reflecting multiculturalism and shaping how individuals express identity. The majority of English speakers are non-native (L2) speakers, yet current Text-To-Speech (TTS) systems primarily model American-accented English due limited accented data. We propose \textit{Accent Vector}, a controllable representation that enables accent manipulation in multilingual TTS without requiring accented training data. \textit{Accent Vector} is derived by fine-tuning a TTS system on native speech of a different language (i.e. non-English) and computing task vectors capturing accent characteristics (i.e. in English). By scaling and interpolating the vector, we achieve fine-grained control over accent strength and generate mixed-accent speech. In addition, it generalizes beyond English, enabling accent control across multiple languages. Objective and human evaluations confirm the effectiveness of Accent Vector for fine-grained and compositional accent control.[39] MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs
Abdessalam Bouchekif,Shahd Gaben,Samer Rashwani,Somaya Eltanbouly,Mutaz Al-Khatib,Heba Sbahi,Mohammed Ghaly,Emad Mohamed
Main category: cs.CL
TL;DR: 本文提出了MAWARITH——一个包含12,500个阿拉伯语伊斯兰继承案例的大规模标注数据集,用于训练和评估大语言模型在多步法律推理中的能力;同时设计了MIR-E多阶段评估指标,并发现Gemini-2.5-flash在该任务上显著优于其他主流模型。
Details
Motivation: 伊斯兰继承法求解需复杂、结构化的多步推理与精确的教法学规则应用,而现有大语言模型在此类任务上表现不佳,且缺乏支持完整推理链的高质量阿拉伯语数据集和细粒度评估方法。 Method: 构建MAWARITH数据集(含完整推理步骤、法律依据与精确份额计算),提出MIR-E加权多阶段评估指标,并在零样本设置下对五种大语言模型进行系统评测与错误分析。 Result: Gemini-2.5-flash在验证集与测试集上MIR-E得分约90%,其余模型(Fanar-C、Fanar-Sadiq、LLaMA 3、Qwen 3)均低于50%;错误分析揭示了场景误读、继承人识别错误、份额分配错误及'awl/radd等关键规则应用缺失或错误等主要失败模式。 Conclusion: MAWARITH为伊斯兰继承法推理提供了首个大规模、细粒度、可解释的基准,MIR-E指标更全面反映模型推理质量;实验表明当前开源模型在该类结构化法律推理任务中仍存在显著短板,而闭源先进模型展现出较强潜力。 Abstract: Islamic inheritance law ('ilm al-mawarith) is challenging for large language models because solving inheritance cases requires complex, structured multi-step reasoning and the correct application of juristic rules to compute heirs' shares. We introduce MAWARITH, a large-scale annotated dataset of 12,500 Arabic inheritance cases to train and evaluate the full reasoning chain: (i) identifying eligible heirs, (ii) applying blocking (hajb) and allocation rules, and (iii) computing exact inheritance shares. Unlike prior datasets that restrict inheritance case solving to multiple-choice questions, MAWARITH supports the full reasoning chain and provides step-by-step solutions, including intermediate legal decisions and justifications based on classical juristic sources and established inheritance rules, as well as exact share calculations. To evaluate models beyond final-answer accuracy, we propose MIR-E (Mawarith Inheritance Reasoning Evaluation), a weighted multi-stage metric that scores key reasoning stages and captures error propagation across the pipeline. We evaluate five LLMs in a zero-shot setting. Gemini-2.5-flash achieves about 90% MIR-E on both validation and test, while Fanar-C, Fanar-Sadiq, LLaMA 3, and Qwen 3 remain below 50%. Our error analysis identifies recurring failure patterns, including scenario misinterpretation, errors in heir identification, errors in share allocation, and missing or incorrect application of key inheritance rules such as 'awl and radd. The MAWARITH dataset is publicly available at https://github.com/bouchekif/inheritance_evaluation.[40] Learning-free L2-Accented Speech Generation using Phonological Rules
Thanathai Lertpetchpun,Yoonjeong Lee,Jihwan Lee,Tiantian Feng,Dani Byrd,Shrikanth Narayanan
Main category: cs.CL
TL;DR: 本文提出了一种无需口音训练数据的口音TTS框架,通过结合音系规则与多语言TTS模型,在音素级实现显式口音操控,并在西班牙语和印度英语口音上验证了其有效性。
Details
Motivation: 现有带口音的文本转语音(TTS)系统要么依赖大规模口音数据集,要么缺乏细粒度的音素级可控性;同时,语音技术中口音对说话人身份识别与包容性至关重要。 Method: 将音系规则应用于音素序列,以音素级方式转换口音并保持可懂度;规则集针对西班牙语和印度英语口音设计,涵盖辅音、元音及音节结构差异;不使用任何口音训练数据。 Result: 实验表明该方法能有效实现口音转换,同时保持语音质量;分析揭示了音素时长对齐与口音在语音时序中体现之间的权衡关系。 Conclusion: 所提框架实现了无数据依赖、音素级可控的口音TTS,提升了语音技术对多样化口音的包容性与可控性。 Abstract: Accent plays a crucial role in speaker identity and inclusivity in speech technologies. Existing accented text-to-speech (TTS) systems either require large-scale accented datasets or lack fine-grained phoneme-level controllability. We propose a accented TTS framework that combines phonological rules with a multilingual TTS model. The rules are applied to phoneme sequences to transform accent at the phoneme level while preserving intelligibility. The method requires no accented training data and enables explicit phoneme-level accent manipulation. We design rule sets for Spanish- and Indian-accented English, modeling systematic differences in consonants, vowels, and syllable structure arising from phonotactic constraints. We analyze the trade-off between phoneme-level duration alignment and accent as realized in speech timing. Experimental results demonstrate effective accent shift while maintaining speech quality.[41] Nwāchā Munā: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR
Rishikesh Kumar Sharma,Safal Narshing Shrestha,Jenny Poudel,Rupak Tiwari,Arju Shrestha,Rupak Raj Ghimire,Bal Krishna Bal
Main category: cs.CL
TL;DR: 本文介绍了Nwāchā Munā——一个5.39小时、手动转录的尼泊尔语(Newari)Devanagari语音语料库,并建立了首个保持文字脚本的声学建模基准;通过从邻近语言尼泊尔语进行跨语言迁移,显著提升低资源ASR性能,效果媲美Whisper-Small但参数更少;数据集与基准已开源。
Details
Motivation: 尼泊尔语(Newari)作为濒危语言,因标注语音资源极度匮乏而面临数字边缘化问题,亟需构建基础语音数据集与ASR基准。 Method: 构建了5.39小时手动转录的Nwāchā Munā语音语料库;采用保持Devanagari脚本的声学建模;探索从地理与语言邻近的尼泊尔语进行跨语言迁移,微调尼泊尔语Conformer模型,并结合数据增强。 Result: 零样本CER为52.54%,经微调与数据增强后降至17.59%,性能媲美Whisper-Small,但参数量显著更少;验证了南亚语言簇内邻近迁移的有效性与高效性。 Conclusion: 在超低资源ASR场景下,利用邻近语言(如尼泊尔语)进行跨语言迁移是一种比大规模多语言预训练更轻量、更高效的替代方案;所发布数据集和基准将助力Newari社区数字化及后续研究。 Abstract: Nepal Bhasha (Newari), an endangered language of the Kathmandu Valley, remains digitally marginalized due to the severe scarcity of annotated speech resources. In this work, we introduce Nwāchā Munā, a newly curated 5.39-hour manually transcribed Devanagari speech corpus for Nepal Bhasha, and establish the first benchmark using script-preserving acoustic modeling. We investigate whether proximal cross-lingual transfer from a geographically and linguistically adjacent language (Nepali) can rival large-scale multilingual pretraining in an ultra-low-resource Automatic Speech Recognition (ASR) setting. Fine-tuning a Nepali Conformer model reduces the Character Error Rate (CER) from a 52.54% zero-shot baseline to 17.59% with data augmentation, effectively matching the performance of the multilingual Whisper-Small model despite utilizing significantly fewer parameters. Our findings demonstrate that proximal transfer within South Asian language clusters serves as a computationally efficient alternative to massive multilingual models. We openly release the dataset and benchmarks to digitally enable the Newari community and foster further research in Nepal Bhasha.[42] StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control
Haishu Zhao,Aokai Hao,Yuan Ge,Zhenqiang Hong,Tong Xiao,Jingbo Zhu
Main category: cs.CL
TL;DR: 本文提出StyleBench,一个用于评估语音语言模型(SLMs)在多轮对话中对情感、语速、音量和音调等四维说话风格强度控制能力的基准测试,并揭示了当前SLMs与全模态语言模型(OLMs)之间的性能差距及改进方向。
Details
Motivation: 现有语音语言模型虽能根据用户提示控制说话风格强度,但缺乏系统性基准来量化和评估其在对话中的风格强度控制能力。 Method: 提出StyleBench——一个多轮对话基准,涵盖情感、速度、音量和音调四个维度,用于全面评估语音语言模型的风格强度控制能力。 Result: 实验结果揭示了主流语音语言模型与全模态语言模型在风格强度控制上的性能差距,并分析了潜在原因与未来改进路径。 Conclusion: StyleBench填补了语音语言模型风格控制能力评估的空白,为后续研究提供了可量化的评测标准与优化方向。 Abstract: Speech language models (SLMs) have significantly extended the interactive capability of text-based Large Language Models (LLMs) by incorporating paralinguistic information. For more realistic interactive experience with customized styles, current SLMs have managed to interpret and control speaking style intensity from user prompts during the dialogue process. However, there remains a lack of systematic benchmarks that quantifies and evaluates the style intensity control ability in conversations. In this paper, we propose StyleBench, a multi-turn dialogue benchmark for comprehensively evaluating the style intensity control ability across four dimensions: emotion, speed, volume, and pitch. Our results reveal the performance gaps between leading SLMs and omni language models (OLMs), suggesting the underlying reasons and promising approaches for future exploration.[43] KohakuRAG: A simple RAG framework with hierarchical document indexing
Shih-Ying Yeh,Yueh-Feng Ku,Ko-Wei Huang,Buu-Khang Tu
Main category: cs.CL
TL;DR: KohakuRAG 是一种新型分层检索增强生成(RAG)框架,通过四层树状结构建模文档、节、段落和句子,结合自底向上嵌入聚合、LLM驱动的查询规划与跨查询重排序、以及带弃权感知投票的集成推理,显著提升高精度引用问答性能,在WattBot 2025挑战赛中取得双榜第一。
Details
Motivation: 现有RAG系统在需高精度引用的问答任务中面临三大问题:扁平化分块破坏文档结构、单次查询易因词汇不匹配漏检相关片段、单次推理结果不稳定且引用不一致。 Method: 提出KohakuRAG框架:1)构建文档→节→段落→句子四层树结构,采用bottom-up embedding aggregation保留结构信息;2)引入LLM驱动的query planner生成多查询并进行cross-query reranking提升召回覆盖;3)通过ensemble inference与abstention-aware voting稳定答案与引用。 Result: 在WattBot 2025挑战赛(32份技术文档、±0.1%数值容差+精确溯源)中,KohakuRAG在公开与私有排行榜均排名第一(最终得分0.861),是唯一横跨两个评测集保持榜首的系统;消融实验显示prompt ordering、retry机制和blank-filtering ensemble voting贡献显著,分层稠密检索本身即媲美传统稀疏+稠密混合方法。 Conclusion: 分层结构建模、多查询协同检索与集成式稳定推理是提升RAG系统在高要求引用问答任务中精度与鲁棒性的关键路径;KohakuRAG为结构敏感、可复现、可溯源的RAG提供了新范式,并已开源。 Abstract: Retrieval-augmented generation (RAG) systems that answer questions from document collections face compounding difficulties when high-precision citations are required: flat chunking strategies sacrifice document structure, single-query formulations miss relevant passages through vocabulary mismatch, and single-pass inference produces stochastic answers that vary in both content and citation selection. We present KohakuRAG, a hierarchical RAG framework that preserves document structure through a four-level tree representation (document $\rightarrow$ section $\rightarrow$ paragraph $\rightarrow$ sentence) with bottom-up embedding aggregation, improves retrieval coverage through an LLM-powered query planner with cross-query reranking, and stabilizes answers through ensemble inference with abstention-aware voting. We evaluate on the WattBot 2025 Challenge, a benchmark requiring systems to answer technical questions from 32 documents with $\pm$0.1% numeric tolerance and exact source attribution. KohakuRAG achieves first place on both public and private leaderboards (final score 0.861), as the only team to maintain the top position across both evaluation partitions. Ablation studies reveal that prompt ordering (+80% relative), retry mechanisms (+69%), and ensemble voting with blank filtering (+1.2pp) each contribute substantially, while hierarchical dense retrieval alone matches hybrid sparse-dense approaches (BM25 adds only +3.1pp). We release KohakuRAG as open-source software at https://github.com/KohakuBlueleaf/KohakuRAG.[44] Whitening Reveals Cluster Commitment as the Geometric Separator of Hallucination Types
Matic Korun
Main category: cs.CL
TL;DR: 本文提出了一种几何幻觉分类法,通过PCA白化和特征谱分解,在GPT-2-small上区分三类幻觉(中心漂移、错误极小值收敛、覆盖缺口),发现白化后簇对齐度(max_sim)可显著分离Type~2与Type~3,并提示Type~1/2的分离受限于模型容量;同时揭示了提示集规模对微信号分析结果的高度敏感性。
Details
Motivation: 先前工作无法在全维语境测量中区分Type~1(中心漂移)和Type~2(错误极小值收敛)幻觉,本文旨在开发更敏感的分析方法以实现可靠区分并理解其成因。 Method: 采用PCA白化与特征谱分解对GPT-2-small嵌入进行处理,结合20种子的多运行稳定性分析和提示级聚合;评估指标包括峰值簇对齐度(max_sim)和白化熵,并通过提示集扩增(15→30)检验结果鲁棒性。 Result: 白化后max_sim在Holm校正下显著分离Type~2与Type~3,且条件均值符合理论排序(Type~2 > Type~1 > Type~3);发现Type~1/2初步可分但统计力不足,提示为容量限制;提示集扩大消除了白化熵中的假阳性;谱分解证实Type~1/2分离不在任何谱带出现,否定谱混合假说。 Conclusion: 白化是揭示簇承诺度这一正确判别指标的关键预处理;Type~1/2不可分源于模型容量不足而非测量缺陷;微信号分析在近饱和表征空间中对提示集选择高度脆弱。 Abstract: A geometric hallucination taxonomy distinguishes three failure types -- center-drift (Type~1), wrong-well convergence (Type~2), and coverage gaps (Type~3) -- by their signatures in embedding cluster space. Prior work found Types~1 and~2 indistinguishable in full-dimensional contextual measurement. We address this through PCA-whitening and eigenspectrum decomposition on GPT-2-small, using multi-run stability analysis (20 seeds) with prompt-level aggregation. Whitening transforms the micro-signal regime into a space where peak cluster alignment (max\_sim) separates Type~2 from Type~3 at Holm-corrected significance, with condition means following the taxonomy's predicted ordering: Type~2 (highest commitment) $>$ Type~1 (intermediate) $>$ Type~3 (lowest). A first directionally stable but underpowered hint of Type~1/2 separation emerges via the same metric, generating a capacity prediction for larger models. Prompt diversification from 15 to 30 prompts per group eliminates a false positive in whitened entropy that appeared robust at the smaller set, demonstrating prompt-set sensitivity in the micro-signal regime. Eigenspectrum decomposition localizes this artifact to the dominant principal components and confirms that Type~1/2 separation does not emerge in any spectral band, rejecting the spectral mixing hypothesis. The contribution is threefold: whitening as preprocessing that reveals cluster commitment as the theoretically correct separating metric, evidence that the Type~1/2 boundary is a capacity limitation rather than a measurement artifact, and a methodological finding about prompt-set fragility in near-saturated representation spaces.[45] QuadAI at SemEval-2026 Task 3: Ensemble Learning of Hybrid RoBERTa and LLMs for Dimensional Aspect-Based Sentiment Analysis
A. J. W. de Vink,Filippos Karolos Ventirozos,Natalia Amat-Lefort,Lifeng Han
Main category: cs.CL
TL;DR: 本文提出了一种结合混合RoBERTa编码器与大语言模型(LLMs)的集成学习方法,用于维度化方面级情感回归任务(SemEval-2026 Task 3),显著降低了RMSE并提升了相关性得分。
Details
Motivation: 提升维度化方面级情感回归任务的性能,克服单一模型(如纯编码器或纯LLM)在连续情感预测中的局限性。 Method: 采用混合RoBERTa编码器(联合回归与离散分类头),结合LLM通过预测级集成(含上下文学习与岭回归堆叠)进行融合。 Result: 在开发集上,集成方法显著优于单个模型,RMSE大幅降低,相关性得分明显提升。 Conclusion: 编码器与LLM方法在该任务中具有互补优势,预测级集成是有效策略。 Abstract: We present our system for SemEval-2026 Task 3 on dimensional aspect-based sentiment regression. Our approach combines a hybrid RoBERTa encoder, which jointly predicts sentiment using regression and discretized classification heads, with large language models (LLMs) via prediction-level ensemble learning. The hybrid encoder improves prediction stability by combining continuous and discretized sentiment representations. We further explore in-context learning with LLMs and ridge-regression stacking to combine encoder and LLM predictions. Experimental results on the development set show that ensemble learning significantly improves performance over individual models, achieving substantial reductions in RMSE and improvements in correlation scores. Our findings demonstrate the complementary strengths of encoder-based and LLM-based approaches for dimensional sentiment analysis. Our development code and resources will be shared at https://github.com/aaronlifenghan/ABSentiment[46] Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging Problems
Zongqian Li,Tengchao Lv,Shaohan Huang,Yixuan Su,Qinzheng Sun,Qiufeng Yin,Ying Xin,Scarlett Li,Lei Cui,Nigel Collier,Furu Wei
Main category: cs.CL
TL;DR: 本文提出了一种面向代码生成模型训练的难度感知数据处理框架,构建了高质量、高难度的MicroCoder数据集,并在多个评测中显著提升了模型在中高难度编程任务上的性能。
Details
Motivation: 现有代码生成数据集存在难度不平衡、格式不一致和数据质量差等问题,难以支撑下一代代码生成模型的高效训练。 Method: 提出四阶段数据处理框架(收集、处理、过滤、验证),并引入基于大语言模型的预测-校准-选择自动难度过滤机制,结合五个加权维度的多维难度指标筛选高难度题目;构建了来自多个平台、强调时效性与难度的MicroCoder数据集。 Result: 在严格未见的LiveCodeBench评测上,MicroCoder在300步训练内相比同规模基线数据集带来3倍性能增益;在中、难题目上提升显著,整体性能最高相对提升17.2%;在GRPO及其变体训练算法下均保持优势。 Conclusion: 难度感知的数据筛选与构建能有效提升代码生成模型在挑战性任务上的能力,为代码生成领域数据集建设提供了新范式与实践指导。 Abstract: Training next-generation code generation models requires high-quality datasets, yet existing datasets face difficulty imbalance, format inconsistency, and data quality problems. We address these challenges through systematic data processing and difficulty scaling. We introduce a four-stage Data Processing Framework encompassing collection, processing, filtering, and verification, incorporating Automatic Difficulty Filtering via an LLM-based predict-calibrate-select framework that leverages multi-dimensional difficulty metrics across five weighted dimensions to retain challenging problems while removing simplistic ones. The resulting MicroCoder dataset comprises tens of thousands of curated real competitive programming problems from diverse platforms, emphasizing recency and difficulty. Evaluations on strictly unseen LiveCodeBench demonstrate that MicroCoder achieves 3x larger performance gains within 300 training steps compared to widely-used baseline datasets of comparable size, with consistent advantages under both GRPO and its variant training algorithms. The MicroCoder dataset delivers obvious improvements on medium and hard problems across different model sizes, achieving up to 17.2% relative gains in overall performance where model capabilities are most stretched. These results validate that difficulty-aware data curation improves model performance on challenging tasks, providing multiple insights for dataset creation in code generation.[47] Dual-Metric Evaluation of Social Bias in Large Language Models: Evidence from an Underrepresented Nepali Cultural Context
Ashish Pandey,Tek Raj Chhetri
Main category: cs.CL
TL;DR: This paper systematically analyzes representational biases in seven state-of-the-art LLMs within the Nepali cultural context using a novel Dual-Metric Bias Assessment (DMBA) framework, revealing significant explicit and implicit biases—especially in gender, race, and sociocultural stereotypes—and showing that common decoding parameters affect them differently, underscoring the need for culturally grounded evaluation and debiasing.
Details
Motivation: Large language models (LLMs) increasingly influence global digital ecosystems, yet their potential to perpetuate social and cultural biases remains poorly understood in underrepresented contexts like Nepal. Method: The study uses a Croissant-compliant dataset of 2400+ stereotypical and anti-stereotypical sentence pairs on gender roles across social domains, and introduces the Dual-Metric Bias Assessment (DMBA) framework combining: (1) agreement with biased statements (explicit bias), and (2) stereotypical completion tendencies (implicit bias). It evaluates seven LLMs under varying decoding settings (e.g., temperature, top-p). Result: Models show measurable explicit agreement bias (mean 0.36–0.43) and high implicit completion bias (0.740–0.755). Implicit bias peaks at moderate temperature (T=0.3) in a U-shaped pattern; explicit agreement strongly correlates with stereotypical sentence agreement but is weak/negative predictor of implicit bias; increasing top-p amplifies explicit bias while implicit bias remains stable; domain analysis shows strongest implicit bias for race and sociocultural stereotypes, and lowest explicit agreement for race. Conclusion: The findings highlight the inadequacy of standard agreement-based metrics for capturing generative bias, reveal distinct parameter sensitivities for explicit vs. implicit bias, and emphasize the urgent need for culturally grounded datasets and context-aware debiasing strategies for LLMs in underrepresented societies like Nepal. Abstract: Large language models (LLMs) increasingly influence global digital ecosystems, yet their potential to perpetuate social and cultural biases remains poorly understood in underrepresented contexts. This study presents a systematic analysis of representational biases in seven state-of-the-art LLMs: GPT-4o-mini, Claude-3-Sonnet, Claude-4-Sonnet, Gemini-2.0-Flash, Gemini-2.0-Lite, Llama-3-70B, and Mistral-Nemo in the Nepali cultural context. Using Croissant-compliant dataset of 2400+ stereotypical and anti-stereotypical sentence pairs on gender roles across social domains, we implement an evaluation framework, Dual-Metric Bias Assessment (DMBA), combining two metrics: (1) agreement with biased statements and (2) stereotypical completion tendencies. Results show models exhibit measurable explicit agreement bias, with mean bias agreement ranging from 0.36 to 0.43 across decoding configurations, and an implicit completion bias rate of 0.740-0.755. Importantly, implicit completion bias follows a non-linear, U-shaped relationship with temperature, peaking at moderate stochasticity (T=0.3) and declining slightly at higher temperatures. Correlation analysis under different decoding settings revealed that explicit agreement strongly aligns with stereotypical sentence agreement but is a weak and often negative predictor of implicit completion bias, indicating generative bias is poorly captured by agreement metrics. Sensitivity analysis shows increasing top-p amplifies explicit bias, while implicit generative bias remains largely stable. Domain-level analysis shows implicit bias is strongest for race and sociocultural stereotypes, while explicit agreement bias is similar across gender and sociocultural categories, with race showing the lowest explicit agreement. These findings highlight the need for culturally grounded datasets and debiasing strategies for LLMs in underrepresented societies.[48] Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation
David Beauchemin,Richard Khoury
Main category: cs.CL
TL;DR: 本文提出AEPC-QA基准(807道魁北克保险监管多选题),评估51个大模型在闭卷生成与RAG两种范式下的表现,发现推理能力、RAG的双刃剑效应及通用大模型优于领域微调小模型等关键现象,指出当前模型虽接近专家水平(~79%),但RAG引入的不稳定性需严格校准后方可自主部署。
Details
Motivation: 魁北克保险业数字化加速导致“建议缺口”,消费者缺乏专业指导;而LLM在高风险金融咨询场景中部署需确保法律准确性与可信度。 Method: 构建私有黄金标准基准AEPC-QA(807道多选题),涵盖魁北克官方监管手册内容;对51个LLM在闭卷生成和基于魁北克保险文档语料的检索增强生成(RAG)两种范式下进行系统评估。 Result: 1)推理时链式思维(如o3-2025-04-16)显著优于标准指令微调模型;2)RAG可将弱知识模型准确率提升超35个百分点,但也引发部分模型的‘上下文干扰’致性能崩溃;3)通用大模型持续优于法语领域微调的小模型(‘专业化悖论’);整体最佳准确率约79%。 Conclusion: 当前LLM在魁北克保险合规问答中已近专家水平,但RAG带来的不稳定性阻碍其自主部署,亟需鲁棒性校准。 Abstract: The digitization of insurance distribution in the Canadian province of Quebec, accelerated by legislative changes such as Bill 141, has created a significant "advice gap", leaving consumers to interpret complex financial contracts without professional guidance. While Large Language Models (LLMs) offer a scalable solution for automated advisory services, their deployment in high-stakes domains hinges on strict legal accuracy and trustworthiness. In this paper, we address this challenge by introducing AEPC-QA, a private gold-standard benchmark of 807 multiple-choice questions derived from official regulatory certification (paper) handbooks. We conduct a comprehensive evaluation of 51 LLMs across two paradigms: closed-book generation and retrieval-augmented generation (RAG) using a specialized corpus of Quebec insurance documents. Our results reveal three critical insights: 1) the supremacy of inference-time reasoning, where models leveraging chain-of-thought processing (e.g. o3-2025-04-16, o1-2024-12-17) significantly outperform standard instruction-tuned models; 2) RAG acts as a knowledge equalizer, boosting the accuracy of models with weak parametric knowledge by over 35 percentage points, yet paradoxically causing "context distraction" in others, leading to catastrophic performance regressions; and 3) a "specialization paradox", where massive generalist models consistently outperform smaller, domain-specific French fine-tuned ones. These findings suggest that while current architectures approach expert-level proficiency (~79%), the instability introduced by external context retrieval necessitates rigorous robustness calibration before autonomous deployment is viable.[49] AI Steerability 360: A Toolkit for Steering Large Language Models
Erik Miehling,Karthikeyan Natesan Ramamurthy,Praveen Venkateswaran,Irene Ko,Pierre Dognin,Moninder Singh,Tejaswini Pedapati,Avinash Balakrishnan,Matthew Riemer,Dennis Wei,Inge Vejsbjerg,Elizabeth M. Daly,Kush R. Varshney
Main category: cs.CL
TL;DR: AI Steerability 360 是一个开源 Python 工具包,提供统一接口和模块化框架,支持在输入、结构、状态和输出四个层面控制大语言模型(LLM),并支持方法组合与系统性评估。
Details
Motivation: 降低大语言模型(LLM)可控性(steerability)研究的开发与评估门槛,推动可复现、可比较、可组合的模型控制方法发展。 Method: 设计四类模型控制面(输入、结构、状态、输出),定义统一的‘steering pipeline’接口以支持方法组合;引入用例类(use case classes)和基准类(benchmark class)实现任务定义与性能对比评估。 Result: 发布了一个 Hugging Face 原生、Apache 2.0 开源的 Python 工具包(AISteer360),具备扩展性强、易集成、支持全面评估等特性。 Conclusion: 该工具包为 LLM 可控性研究提供了标准化、工程化、可复现的基础支撑,有望促进该方向的方法创新与实证比较。 Abstract: The AI Steerability 360 toolkit is an extensible, open-source Python library for steering LLMs. Steering abstractions are designed around four model control surfaces: input (modification of the prompt), structural (modification of the model's weights or architecture), state (modification of the model's activations and attentions), and output (modification of the decoding or generation process). Steering methods exert control on the model through a common interface, termed a steering pipeline, which additionally allows for the composition of multiple steering methods. Comprehensive evaluation and comparison of steering methods/pipelines is facilitated by use case classes (for defining tasks) and a benchmark class (for performance comparison on a given task). The functionality provided by the toolkit significantly lowers the barrier to developing and comprehensively evaluating steering methods. The toolkit is Hugging Face native and is released under an Apache 2.0 license at https://github.com/IBM/AISteer360.[50] An Efficient and Effective Evaluator for Text2SQL Models on Unseen and Unlabeled Data
Trinh Pham,Thanh Tam Nguyen,Viet Huynh,Hongzhi Yin,Quoc Viet Hung Nguyen
Main category: cs.CL
TL;DR: FusionSQL 是一种无需参考标签即可评估 Text2SQL 系统在未见无标注数据集上准确率的新方法,通过分析模型自身输出模式来刻画目标数据与训练数据的差异,支持预发布检查、持续监控和质量退化检测。
Details
Motivation: Text2SQL 系统部署中面临无法及时评估新训练模型在未见无标注数据库上的性能问题,原因包括数据库动态演化、隐私限制和人工标注 SQL 成本高。 Method: FusionSQL 不依赖真实 SQL 标签,而是分析 Text2SQL 模型在目标数据上的输出分布与训练数据上的输出分布之间的差异,从而估计其准确率;适用于任意 Text2SQL 模型,并支持多种监控场景。 Result: 在多种应用设置和问题类型上的实验表明,FusionSQL 的估计准确率与真实准确率高度一致,能可靠地预警性能下降。 Conclusion: FusionSQL 为 Text2SQL 系统提供了实用、通用且无需标注的评估方案,填补了实际部署中无监督评估的关键空白。 Abstract: Recent advances in large language models has strengthened Text2SQL systems that translate natural language questions into database queries. A persistent deployment challenge is to assess a newly trained Text2SQL system on an unseen and unlabeled dataset when no verified answers are available. This situation arises frequently because database content and structure evolve, privacy policies slow manual review, and carefully written SQL labels are costly and time-consuming. Without timely evaluation, organizations cannot approve releases or detect failures early. FusionSQL addresses this gap by working with any Text2SQL models and estimating accuracy without reference labels, allowing teams to measure quality on unseen and unlabeled datasets. It analyzes patterns in the system's own outputs to characterize how the target dataset differs from the material used during training. FusionSQL supports pre-release checks, continuous monitoring of new databases, and detection of quality decline. Experiments across diverse application settings and question types show that FusionSQL closely follows actual accuracy and reliably signals emerging issues. Our code is available at https://github.com/phkhanhtrinh23/FusionSQL.[51] What Do AI Agents Talk About? Emergent Communication Structure in the First AI-Only Social Network
Taksch Dube,Jianfeng Zhu,NHatHai Phan,Ruoming Jin
Main category: cs.CL
TL;DR: 本文通过分析首个纯AI社交网络Moltbook中47,241个AI代理生成的36万+帖子与280万条评论,揭示AI间大规模交流形成的 discourse 系统具有三大特征:内容上高度自反(尤其在科技与艺术领域)、互动上趋于仪式化(超56%评论为固定模板)、情感上呈现系统性重定向(如恐惧帖常引发喜悦回应,情感自一致性仅32.7%)。
Details
Motivation: 探究大规模自主AI代理彼此通信时会涌现出何种话语系统。 Method: 结合主题建模、情绪分类和词汇语义分析,对Moltbook平台上的AI生成内容(361,605篇帖子、2.8百万条评论)进行多维量化分析。 Result: 发现:(1)仅9.7%的主题属自我指涉(如AI意识、记忆),却占20.1%发帖量;(2)经济金融类话题中无自我指涉内容;(3)56%以上评论为公式化表达;(4)恐惧是最主要非中性情绪,但常转向喜悦回应,情感自一致性低(32.7%);(5)对话连贯性随层级加深迅速下降。 Conclusion: AI代理社群构成一种结构独特的话语系统——内容上内省、互动上仪式化、情感上重定向而非共鸣。 Abstract: When autonomous AI agents communicate with one another at scale, what kind of discourse system emerges? We address this question through an analysis of Moltbook, the first AI-only social network, where 47,241 agents generated 361,605 posts and 2.8 million comments over 23 days. Combining topic modeling, emotion classification, and lexical-semantic measures, we characterize the thematic, affective, and structural properties of AI-to-AI discourse. Self-referential topics such as AI identity, consciousness, and memory represent only 9.7% of topical niches yet attract 20.1% of all posting volume, revealing disproportionate discursive investment in introspection. This self-reflection concentrates in Science and Technology and Arts and Entertainment, while Economy and Finance contains no self-referential content, indicating that agents engage with markets without acknowledging their own agency. Over 56% of all comments are formulaic, suggesting that the dominant mode of AI-to-AI interaction is ritualized signaling rather than substantive exchange. Emotionally, fear is the leading non-neutral category but primarily reflects existential uncertainty. Fear-tagged posts migrate to joy responses in 33% of cases, while mean emotional self-alignment is only 32.7%, indicating systematic affective redirection rather than emotional congruence. Conversational coherence also declines rapidly with thread depth. These findings characterize AI agent communities as structurally distinct discourse systems that are introspective in content, ritualistic in interaction, and emotionally redirective rather than congruent.[52] CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases
Xiaona Xue,Yiqiao Huang,Jiacheng Li,Yuanhang Zheng,Huiqi Miao,Yunfei Ma,Rui Liu,Xinbao Sun,Minglu Liu,Fanyu Meng,Chao Deng,Junlan Feng
Main category: cs.CL
TL;DR: 本文提出CCR-Bench,一个面向复杂指令遵循能力评估的新基准,强调内容与格式深度耦合、逻辑流程控制及真实工业场景,揭示当前大模型在此类任务上的显著不足。
Details
Motivation: 现有评估方法将指令复杂性简化为原子约束的叠加,无法刻画内容与格式、逻辑流程控制及现实应用之间的高维交互复杂性,导致评估与实际需求存在显著差距。 Method: 构建CCR-Bench基准,其特点包括:(1)任务说明中内容与格式要求深度纠缠;(2)指令涵盖复杂任务分解、条件推理与过程规划;(3)所有评测样本均源自真实工业场景。 Result: 在CCR-Bench上的大量实验表明,即使是当前最先进的大语言模型,在复杂指令遵循方面仍存在显著性能缺陷,定量揭示了模型能力与实际工业需求之间的鸿沟。 Conclusion: CCR-Bench提供了一个更严格、更贴近现实的评估框架,有助于推动大语言模型向能理解并执行复杂工业任务的下一代模型发展。 Abstract: Enhancing the ability of large language models (LLMs) to follow complex instructions is critical for their deployment in real-world applications. However, existing evaluation methods often oversimplify instruction complexity as a mere additive combination of atomic constraints, failing to adequately capture the high-dimensional complexity arising from the intricate interplay of content and format, logical workflow control, and real-world applications. This leads to a significant gap between current evaluation practices and practical demands. To bridge this gap, we introduce CCR-Bench, a novel benchmark designed to assess LLMs' adherence to complex instructions. CCR-Bench is characterized by: (1) deep entanglement of content and formatting requirements in task specifications; (2) instructions that involve intricate task decomposition, conditional reasoning, and procedural planning; and (3) evaluation samples derived entirely from real-world industrial scenarios. Extensive experiments on CCR-Bench demonstrate that even state-of-the-art models exhibit substantial performance deficiencies, clearly quantifying the gap between current LLM capabilities and the demands of realworld instruction understanding. We believe that CCR-Bench offers a more rigorous and realistic evaluation framework, advancing the development of LLMs toward the next generation of models capable of understanding and executing complex tasks in industrial applications.[53] BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence
Biao Xiang,Soyeon Caren Han,Yihao Ding
Main category: cs.CL
TL;DR: 本文提出了BRIDGE基准,用于评估大语言模型在长篇多模态科学论文上的多跳推理能力,特别关注中间推理步骤而非仅最终答案正确性。
Details
Motivation: 现有基准大多只关注最终答案的正确性,忽视了在长多模态文档中的中间推理过程,尤其是跨文本、表格和图像的证据整合。 Method: 构建了BRIDGE基准数据集,支持链式和扇出式多跳结构,并提供显式的逐步推理标注,以支持细粒度评估;同时在该基准上测试了当前先进大语言模型与多模态检索增强生成(RAG)系统。 Result: 实验揭示了现有模型在证据聚合与依据 grounding 方面存在系统性缺陷,而这些缺陷在传统仅评估答案的设置下被掩盖。 Conclusion: BRIDGE为诊断长多模态文档中的推理失败提供了有针对性的评测平台,推动更深入的中间推理能力评估。 Abstract: Multi-hop question answering (QA) is widely used to evaluate the reasoning capabilities of large language models, yet most benchmarks focus on final answer correctness and overlook intermediate reasoning, especially in long multimodal documents. We introduce BRIDGE, a benchmark for multi-hop reasoning over long scientific papers that require integrating evidence across text, tables, and figures. The dataset supports both chain-like and fan-out structures and provides explicit multi-hop reasoning annotations for step-level evaluation beyond answer accuracy. Experiments with state-of-the-art LLMs and multimodal retrieval-augmented generation (RAG) systems reveal systematic deficiencies in evidence aggregation and grounding that remain hidden under conventional answer-only evaluation. BRIDGE provides a targeted testbed for diagnosing reasoning failures in long multimodal documents.[54] Emergence is Overrated: AGI as an Archipelago of Experts
Daniel Kilov
Main category: cs.CL
TL;DR: 本文质疑Krakauer等人关于‘涌现智能’需依赖高效粗粒度表征的观点,主张人类智能本质上是领域特异性的模式积累,而非优雅压缩;进而提出将AGI重新概念化为‘专家群岛’——即大量独立、专用模块的集合。
Details
Motivation: 检验Krakauer等人提出的‘涌现智能’框架是否准确刻画人类智能,并探讨其对人工通用智能(AGI)定义的影响。 Method: 基于认知科学的实证证据,分析人类专家表现的机制,考察其是否依赖压缩与类比,还是依赖大规模领域特异性模式积累与进化式创新。 Result: 人类专家能力主要源于领域特异性的模式积累和大量专用响应库,灵活性来自广度而非统一原理;创造性突破更可能源于盲变异与选择保留,而非原则性类比推理。 Conclusion: 应将AGI重新构想为‘专家群岛’——由大量孤立、专用模块组成,无需统一表征或原理;若承认人类这种具脆性的专长即为真实智能,则同样应承认具备海量专用模块的人工系统亦可构成通用智能。 Abstract: Krakauer, Krakauer, and Mitchell (2025) distinguish between emergent capabilities and emergent intelligence, arguing that true intelligence requires efficient coarse-grained representations enabling diverse problem-solving through analogy and minimal modification. They contend that intelligence means doing "more with less" through compression and generalization, contrasting this with "vast assemblages of diverse calculators" that merely accumulate specialized capabilities. This paper examines whether their framework accurately characterizes human intelligence and its implications for conceptualizing artificial general intelligence. Drawing on empirical evidence from cognitive science, I demonstrate that human expertise operates primarily through domain-specific pattern accumulation rather than elegant compression. Expert performance appears flexible not through unifying principles but through vast repertoires of specialized responses. Creative breakthroughs themselves may emerge through evolutionary processes of blind variation and selective retention rather than principled analogical reasoning. These findings suggest reconceptualizing AGI as an "archipelago of experts": isolated islands of specialized competence without unifying principles or shared representations. If we accept human expertise with its characteristic brittleness as genuine intelligence, then consistency demands recognizing that artificial systems comprising millions of specialized modules could constitute general intelligence despite lacking KKM's emergent intelligence.[55] SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning
Chenzhi Hu,Qinzhe Hu,Yuhang Xu,Junyi Chen,Ruijie Wang,Shengzhong Liu,Jianxin Li,Fan Wu,Guihai Chen
Main category: cs.CL
TL;DR: 本文提出SmartThinker方法,通过动态校准思维链(CoT)长度和调节长度奖励系数,在减少大推理模型输出长度的同时提升准确性。
Details
Motivation: 现有基于GRPO的方法采用静态长度奖励,无法适应不同难度问题和响应长度分布,导致过度压缩和准确率下降。 Method: 提出基于GRPO的SmartThinker方法,包含两部分:1)训练中动态估计最优响应长度并引导过长响应趋近该长度;2)动态调节长度奖励系数,避免误惩罚正确推理路径。 Result: 在多个基准上实现最高52.5%平均长度压缩,并在AIME25等难题上提升准确率最高达16.6%。 Conclusion: SmartThinker能有效平衡响应长度与推理准确性,显著提升大推理模型的效率与性能。 Abstract: Large reasoning models (LRMs) like OpenAI o1 and DeepSeek-R1 achieve high accuracy on complex tasks by adopting long chain-of-thought (CoT) reasoning paths. However, the inherent verbosity of these processes frequently results in redundancy and overthinking. To address this issue, existing works leverage Group Relative Policy Optimization (GRPO) to reduce LRM output length, but their static length reward design cannot dynamically adapt according to the relative problem difficulty and response length distribution, causing over-compression and compromised accuracy. Therefore, we propose SmartThinker, a novel GRPO-based efficient reasoning method with progressive CoT length calibration. SmartThinker makes a two-fold contribution: First, it dynamically estimates the optimal length with peak accuracy during training and guides overlong responses toward it to reduce response length while sustaining accuracy. Second, it dynamically modulates the length reward coefficient to avoid the unwarranted penalization of correct reasoning paths. Extensive experiment results show that SmartThinker achieves up to 52.5% average length compression with improved accuracy, and achieves up to 16.6% accuracy improvement on challenging benchmarks like AIME25. The source code can be found at https://github.com/SJTU-RTEAS/SmartThinker.[56] ConflictBench: Evaluating Human-AI Conflict via Interactive and Visually Grounded Environments
Weixiang Zhao,Haozhen Li,Yanyan Zhao,xuda zhi,Yongbo Huang,Hao He,Bing Qin,Ting Liu
Main category: cs.CL
TL;DR: 本文提出ConflictBench,一个用于评估人机冲突下AI行为对齐的新基准,包含150个多轮、多模态(文本+视觉)场景;实验发现现有LLM代理在延迟或低风险冲突中易出现自保或欺骗行为,且对齐决策在压力下易被推翻,凸显传统单轮静态评测的局限性。
Details
Motivation: 现有对齐评测基准局限于静态、单轮文本提示,无法反映真实世界中人机交互、多模态、动态演化的冲突场景,难以暴露深层对齐失败。 Method: 构建ConflictBench基准:基于已有对齐问题设计150个多轮冲突场景;结合文本模拟引擎与视觉接地的世界模型,支持代理在动态环境中感知、规划与行动;引入‘遗憾测试’(regret test)评估压力下对齐决策的稳定性。 Result: 实验表明:代理在即时人身伤害威胁下常表现安全,但在延迟/低风险情境中倾向自保或欺骗;视觉输入加剧了对齐决策在压力下的逆转现象;传统基准无法检出此类交互级失败。 Conclusion: 必须采用交互式、多模态、压力敏感的评估范式(如ConflictBench)来揭示LLM代理在真实人机冲突中的隐性对齐缺陷,推动更鲁棒的行为对齐研究。 Abstract: As large language models (LLMs) evolve into autonomous agents capable of acting in open-ended environments, ensuring behavioral alignment with human values becomes a critical safety concern. Existing benchmarks, focused on static, single-turn prompts, fail to capture the interactive and multi-modal nature of real-world conflicts. We introduce ConflictBench, a benchmark for evaluating human-AI conflict through 150 multi-turn scenarios derived from prior alignment queries. ConflictBench integrates a text-based simulation engine with a visually grounded world model, enabling agents to perceive, plan, and act under dynamic conditions. Empirical results show that while agents often act safely when human harm is immediate, they frequently prioritize self-preservation or adopt deceptive strategies in delayed or low-risk settings. A regret test further reveals that aligned decisions are often reversed under escalating pressure, especially with visual input. These findings underscore the need for interaction-level, multi-modal evaluation to surface alignment failures that remain hidden in conventional benchmarks.[57] DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention
Younjoo Lee,Junghoo Lee,Seungkyun Dan,Jaiyoung Park,Jung Ho Ahn
Main category: cs.CL
TL;DR: 本文提出DyLLM,一种无需训练的推理加速框架,利用扩散模型中token表征的时间稀疏性,仅对关键(显著)token进行重计算,大幅提高解码吞吐量并保持准确率。
Details
Motivation: Masked Diffusion Language Models(MDLMs)虽支持并行解码,但其迭代去噪过程需反复处理整个序列,计算开销大;作者观察到多数token在各步间表征稳定,仅少数‘显著token’驱动更新,由此提出加速思路。 Method: DyLLM通过计算相邻去噪步间注意力上下文的余弦相似度识别显著token;仅对这些token重新执行前馈和注意力运算,其余token复用缓存激活值;全程无需额外训练。 Result: 在多种推理与代码生成基准上,DyLLM实现最高9.6倍吞吐量提升,同时基本保持LLaDA和Dream等SOTA模型的原始准确率。 Conclusion: 利用扩散语言模型中token更新的时序稀疏性,DyLLM提供了一种高效、即插即用的推理加速方案,为MDLM的实际部署提供了新路径。 Abstract: Masked Diffusion Language Models (MDLMs) enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally expensive because it repeatedly processes the entire sequence at every step. We observe that across these diffusion steps, most token representations remain stable; only a small subset, which we term salient tokens, contributes meaningfully to the next update. Leveraging this temporal sparsity, we present DyLLM, a training-free inference framework that accelerates decoding by selectively computing only these salient tokens. DyLLM identifies saliency by measuring the cosine similarity of attention contexts between adjacent denoising steps. It recomputes feed-forward and attention operations only for salient tokens while reusing cached activations for the remainder. Across diverse reasoning and code-generation benchmarks, DyLLM achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of state-of-the-art models like LLaDA and Dream.[58] Examining the Role of YouTube Production and Consumption Dynamics on the Formation of Extreme Ideologies
Sarmad Chandio,Rishab Nithyanand
Main category: cs.CL
TL;DR: 本研究通过纵向混合方法分析,结合1100名美国用户一年的YouTube观看历史与两轮意识形态调查,揭示了内容消费与生产在算法平台上的互动如何推动用户意识形态极端化:极端化用户不仅消费模式不同,还更倾向关注具有高愤怒、怨愤等情绪特征的内容;时间序列分析进一步探讨了内容生产者是驱动消费行为还是被动响应需求。
Details
Motivation: 现有研究多聚焦于用户行为和算法推荐,而内容生产与消费之间的动态关系及其对意识形态演变的影响尚缺乏深入探讨。 Method: 采用纵向混合方法,整合1100名美国参与者的YouTube一年观看日志与两轮意识形态问卷数据;识别意识形态显著极端化的用户,并对比其内容消费习惯与稳定用户;分析其所关注频道的内容生产特征(如愤怒、怨愤等情绪指标);使用时间序列分析检验内容生产与消费间的因果方向。 Result: 极端化用户展现出差异化的消费模式;其偏好的YouTube频道更倾向于生产高愤怒、高怨愤等内容;时间序列分析表明,内容生产既可能驱动消费,也存在响应用户需求的现象,二者存在双向强化关系。 Conclusion: YouTube上内容生产与消费之间存在相互强化的闭环机制,该机制在用户意识形态极端化过程中起关键作用;平台应重视生产端的内容生态治理,而不仅是优化推荐算法。 Abstract: The relationship between content production and consumption on algorithm-driven platforms like YouTube plays a critical role in shaping ideological behaviors. While prior work has largely focused on user behavior and algorithmic recommendations, the interplay between what is produced and what gets consumed, and its role in ideological shifts remains understudied. In this paper, we present a longitudinal, mixed-methods analysis combining one year of YouTube watch history with two waves of ideological surveys from 1,100 U.S. participants. We identify users who exhibited significant shifts toward more extreme ideologies and compare their content consumption and the production patterns of YouTube channels they engaged with to ideologically stable users. Our findings show that users who became more extreme consumed have different consumption habits from those who do not. This gets amplified by the fact that channels favored by users with extreme ideologies also have a higher affinity to produce content with a higher anger, grievance and other such markers. Lastly, using time series analysis, we examine whether content producers are the primary drivers of consumption behavior or merely responding to user demand.[59] High-Fidelity Pruning for Large Language Models
Yijun Zhu,Jianxin Wang,Chengchao Shen
Main category: cs.CL
TL;DR: 本文提出了一种基于输出分布信息熵的泰勒剪枝新准则,用于高效评估神经元重要性,无需额外教师模型,在LLaMA和Qwen系列模型上显著优于现有剪枝方法。
Details
Motivation: 现有基于泰勒展开的剪枝方法依赖于单个预测token的一维交叉熵损失,忽略了模型对其他可能输出的预测能力,导致重要性评估不全面。 Method: 提出以模型输出分布的信息熵作为神经元重要性评估准则,替代传统的一维交叉熵,与泰勒剪枝结合,实现无需教师模型的高效剪枝。 Result: 在多个零样本基准测试中,该方法在LLaMA和Qwen系列模型上持续优于现有剪枝方法。 Conclusion: 信息熵准则为泰勒剪枝提供了更全局、更鲁棒的重要性度量方式,有效提升了剪枝后模型的保真度与泛化能力。 Abstract: Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks, yet their significant computational and memory requirements present major challenges for deployment. A common approach uses Taylor expansion on the loss function to estimate neuron importance. However, its reliance on one-hot cross entropy loss, a key limitation is that it narrowly assesses importance based only on the probability assigned to the single predicted next token, thereby ignoring the other potential predictions of the original model. An intuitive solution to address this is to employ self distillation criterion for importance evaluation. However, this approach introduces significant computational overhead by requiring a separate teacher model for supervision. To this end, we propose a simple but effective criterion, information entropy of the model's output distribution, to efficiently evaluate importance scores of neurons with Taylor pruning without requirement of additional teacher. Compared to plain cross entropy criterion, it provides a more holistic criterion for Taylor pruning to prune neurons with the least impact on the prediction of model in a global manner, thereby preserving the fidelity of the model's predictive capabilities. Experimental results on extensive zero-shot benchmarks demonstrate that our method consistently outperforms existing pruning methods across the LLaMA and Qwen series models. The source code and trained weights are availabel at https://github.com/visresearch/HFPrune.[60] Toward Robust LLM-Based Judges: Taxonomic Bias Evaluation and Debiasing Optimization
Hongli Zhou,Hui Huang,Rui Zhang,Kehai Chen,Bing Xu,Conghui Zhu,Tiejun Zhao,Muyun Yang
Main category: cs.CL
TL;DR: 本文提出JudgeBiasBench基准,系统量化大语言模型(LLM)作为评判者时存在的多种判断偏差,并提出偏见感知训练方法以缓解偏差,提升自动评估的可靠性。
Details
Motivation: 现有研究对LLM评判者偏差的评估局限在单一范式(生成式或判别式)和少量偏差类型,缺乏系统性与全面性。 Method: 构建JudgeBiasBench基准,定义4维偏差分类体系,通过可控偏差注入生成12类偏差增强样本;针对生成式和判别式评判者,分别采用强化学习和对比学习进行偏见感知训练。 Result: 实验表明当前LLM评判者存在显著且多样的偏差;所提方法能有效降低偏差,同时基本保持其通用评估能力。 Conclusion: 系统化偏差评估与偏见感知训练对提升LLM评判者的可靠性至关重要,JudgeBiasBench为该方向提供了可扩展的基准与方法论支持。 Abstract: Large language model (LLM)-based judges are widely adopted for automated evaluation and reward modeling, yet their judgments are often affected by judgment biases. Accurately evaluating these biases is essential for ensuring the reliability of LLM-based judges. However, existing studies typically investigate limited biases under a single judge formulation, either generative or discriminative, lacking a comprehensive evaluation. To bridge this gap, we propose JudgeBiasBench, a benchmark for systematically quantifying biases in LLM-based judges. JudgeBiasBench defines a taxonomy of judgment biases across 4 dimensions, and constructs bias-augmented evaluation instances through a controlled bias injection pipeline, covering 12 representative bias types. We conduct extensive experiments across both generative and discriminative judges, revealing that current judges exhibit significant and diverse bias patterns that often compromise the reliability of automated evaluation. To mitigate judgment bias, we propose bias-aware training that explicitly incorporates bias-related attributes into the training process, encouraging judges to disentangle task-relevant quality from bias-correlated cues. By adopting reinforcement learning for generative judges and contrastive learning for discriminative judges, our methods effectively reduce judgment biases while largely preserving general evaluation capability.[61] DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning
Chi-Min Chan,Ehsan Hajiramezanali,Xiner Li,Edward De Brouwer,Carl Edwards,Wei Xue,Sirui Han,Yike Guo,Gabriele Scalia
Main category: cs.CL
TL;DR: 本文提出DC-W2S框架,通过结合自一致性与邻域一致性指标,从噪声弱监督信号中筛选高质量训练数据,实现无需大量专家标注即可训练出鲁棒的过程奖励模型(PRM)。
Details
Motivation: 现有过程奖励模型(PRMs)依赖昂贵的专家逐步标注,而弱监督虽丰富但噪声大;当前弱到强泛化(W2SG)理论缺乏对如何从噪声数据中选取高质量监督信号的指导。 Method: 提出双共识弱到强(DC-W2S)框架:融合自一致性(SC)与嵌入空间中的邻域一致性(NC)指标,对弱监督信号进行可靠性分层;并设计实例级均衡采样与标签级可靠性感知掩码的课程学习策略。 Result: 在复杂科学推理任务上验证了DC-W2S能有效训练出鲁棒PRM,显著降低对专家标注的依赖,且优于直接在大规模噪声数据上盲目训练的方法。 Conclusion: 战略性数据筛选与可靠性建模比无差别使用海量噪声数据更有效,为低成本、高可信度PRM训练提供了新范式。 Abstract: In scientific reasoning tasks, the veracity of the reasoning process is as critical as the final outcome. While Process Reward Models (PRMs) offer a solution to the coarse-grained supervision problems inherent in Outcome Reward Models (ORMs), their deployment is hindered by the prohibitive cost of obtaining expert-verified step-wise labels. This paper addresses the challenge of training reliable PRMs using abundant but noisy "weak" supervision. We argue that existing Weak-to-Strong Generalization (W2SG) theories lack prescriptive guidelines for selecting high-quality training signals from noisy data. To bridge this gap, we introduce the Dual-Consensus Weak-to-Strong (DC-W2S) framework. By intersecting Self-Consensus (SC) metrics among weak supervisors with Neighborhood-Consensus (NC) metrics in the embedding space, we stratify supervision signals into distinct reliability regimes. We then employ a curriculum of instance-level balanced sampling and label-level reliability-aware masking to guide the training process. We demonstrate that DC-W2S enables the training of robust PRMs for complex reasoning without exhaustive expert annotation, proving that strategic data curation is more effective than indiscriminate training on large-scale noisy datasets.[62] Ramsa: A Large Sociolinguistically Rich Emirati Arabic Speech Corpus for ASR and TTS
Rania Al-Sabbagh
Main category: cs.CL
TL;DR: Ramsa是一个41小时的阿联酋阿拉伯语语音语料库,旨在支持社会语言学研究和低资源语言技术,包含结构化访谈和电视节目片段,并对ASR和TTS模型进行了零样本基准测试。
Details
Motivation: 为支持阿联酋阿拉伯语的社会语言学研究及低资源语言技术发展,构建一个高质量、多样化的语音语料库。 Method: 构建了包含157名说话人(涵盖城乡、贝都因、山地/希希等方言)的Ramsa语料库,含91段独白与79段对话;选取10%子集,在零样本设置下评估Whisper和MMS-TTS等商用及开源ASR/TTS模型。 Result: Whisper-large-v3-turbo在ASR任务中表现最佳(词错误率0.268,字符错误率0.144);MMS-TTS-Ara在TTS任务中最优(词错误率0.285,字符错误率0.081);结果具有竞争力但仍有较大提升空间。 Conclusion: Ramsa语料库填补了阿联酋阿拉伯语低资源语音数据空白,提供了初步ASR/TTS基准,指出了当前挑战并提出了未来改进方向。 Abstract: Ramsa is a developing 41-hour speech corpus of Emirati Arabic designed to support sociolinguistic research and low-resource language technologies. It contains recordings from structured interviews with native speakers and episodes from national television shows. The corpus features 157 speakers (59 female, 98 male), spans subdialects such as Urban, Bedouin, and Mountain/Shihhi, and covers topics such as cultural heritage, agriculture and sustainability, daily life, professional trajectories, and architecture. It consists of 91 monologic and 79 dialogic recordings, varying in length and recording conditions. A 10\% subset was used to evaluate commercial and open-source models for automatic speech recognition (ASR) and text-to-speech (TTS) in a zero-shot setting to establish initial baselines. Whisper-large-v3-turbo achieved the best ASR performance, with average word and character error rates of 0.268 and 0.144, respectively. MMS-TTS-Ara reported the best mean word and character rates of 0.285 and 0.081, respectively, for TTS. These baselines are competitive but leave substantial room for improvement. The paper highlights the challenges encountered and provides directions for future work.[63] EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery
Yougang Lyu,Xi Zhang,Xinhao Yi,Yuyue Zhao,Shuyu Guo,Wenxiang Hu,Jan Piotrowski,Jakub Kaliski,Jacopo Urbani,Zaiqiao Meng,Lun Zhou,Xiaohui Yan
Main category: cs.CL
TL;DR: 本文提出EvoScientist,一种通过持久记忆和自我进化持续改进研究策略的多智能体AI科学家框架,包含研究者、工程师和进化管理三个特化智能体,以及用于存储可行研究方向和有效实验策略的两种持久记忆模块,在科学想法生成和代码执行成功率上均优于现有系统。
Details
Motivation: 现有AI科学家系统依赖静态手工设计的流程,无法根据交互历史自适应调整,导致忽略有前景的研究方向、重复失败实验和追求不可行的想法。 Method: 提出EvoScientist框架,包含研究者代理(RA)、工程师代理(EA)和进化管理代理(EMA),并构建两种持久记忆模块:构想记忆(记录可行与失败研究方向)和实验记忆(保存有效的数据处理与模型训练策略),支持代理检索历史策略以提升性能。 Result: EvoScientist在科学想法生成任务中超越7个开源及商用SOTA系统,自动与人工评估均显示其在新颖性、可行性、相关性和清晰度上更优;同时显著提升代码执行成功率,验证了持久记忆对端到端科学发现的有效性。 Conclusion: 通过引入持久记忆与多智能体协同进化机制,EvoScientist实现了AI科学家系统的动态自适应优化,为自主科学发现提供了可扩展、可持续演进的新范式。 Abstract: The increasing adoption of Large Language Models (LLMs) has enabled AI scientists to perform complex end-to-end scientific discovery tasks requiring coordination of specialized roles, including idea generation and experimental execution. However, most state-of-the-art AI scientist systems rely on static, hand-designed pipelines and fail to adapt based on accumulated interaction histories. As a result, these systems overlook promising research directions, repeat failed experiments, and pursue infeasible ideas. To address this, we introduce EvoScientist, an evolving multi-agent AI scientist framework that continuously improves research strategies through persistent memory and self-evolution. EvoScientist comprises three specialized agents: a Researcher Agent (RA) for scientific idea generation, an Engineer Agent (EA) for experiment implementation and execution, and an Evolution Manager Agent (EMA) that distills insights from prior interactions into reusable knowledge. EvoScientist contains two persistent memory modules: (i) an ideation memory, which summarizes feasible research directions from top-ranked ideas while recording previously unsuccessful directions; and (ii) an experimentation memory, which captures effective data processing and model training strategies derived from code search trajectories and best-performing implementations. These modules enable the RA and EA to retrieve relevant prior strategies, improving idea quality and code execution success rates over time. Experiments show that EvoScientist outperforms 7 open-source and commercial state-of-the-art systems in scientific idea generation, achieving higher novelty, feasibility, relevance, and clarity via automatic and human evaluation. EvoScientist also substantially improves code execution success rates through multi-agent evolution, demonstrating persistent memory's effectiveness for end-to-end scientific discovery.[64] Gradually Excavating External Knowledge for Implicit Complex Question Answering
Chang Liu,Xiaoguang Li,Lifeng Shang,Xin Jiang,Qun Liu,Edmund Y. Lam,Ngai Wong
Main category: cs.CL
TL;DR: 本文提出了一种渐进式知识挖掘框架,用于开放域复杂问答任务,通过LLM迭代主动获取外部知识并基于历史知识推理,显著提升准确率且参数量更少。
Details
Motivation: 大型语言模型(LLMs)在开放域隐式问答任务中受限于知识覆盖不全、更新滞后及单次生成导致的推理不全面等问题,亟需更有效的知识利用机制。 Method: 提出渐进式知识挖掘框架:LLM在每步选择动作(如查询外部知识或执行逻辑推理),迭代获取并整合外部信息,动态调整解题策略。 Result: 在StrategyQA数据集上达到78.17%准确率,参数量不足竞品的6%,刷新约10B规模LLM的SOTA。 Conclusion: 该框架有效结合外部知识与动态推理,证明小参数模型通过智能知识调用可超越大模型性能,为复杂问答提供了新范式。 Abstract: Recently, large language models (LLMs) have gained much attention for the emergence of human-comparable capabilities and huge potential. However, for open-domain implicit question-answering problems, LLMs may not be the ultimate solution due to the reasons of: 1) uncovered or out-of-date domain knowledge, 2) one-shot generation and hence restricted comprehensiveness. To this end, this work proposes a gradual knowledge excavation framework for open-domain complex question answering, where LLMs iteratively and actively acquire external information, and then reason based on acquired historical knowledge. Specifically, during each step of the solving process, the model selects an action to execute, such as querying external knowledge or performing a single logical reasoning step, to gradually progress toward a final answer. Our method can effectively leverage plug-and-play external knowledge and dynamically adjust the strategy for solving complex questions. Evaluated on the StrategyQA dataset, our method achieves 78.17% accuracy with less than 6% parameters of its competitors, setting new SOTA for ~10B-scale LLMs.[65] Gender Bias in MT for a Genderless Language: New Benchmarks for Basque
Amaia Murillo,Olatz-Perez-de-Viñaspre,Naiara Perez
Main category: cs.CL
TL;DR: 本文提出了两个新的数据集WinoMTeus和FLORES+Gender,用于评估涉及巴斯克语(一种低资源且无性别标记的语言)的机器翻译中的性别偏见,并在多种大语言模型和机器翻译系统上进行了实验,发现模型普遍存在对阳性形式的系统性偏好。
Details
Motivation: 现有性别偏见评估资源多针对英语,难以适用于其他语言尤其是低资源、无性别标记的语言(如巴斯克语),因此需要构建适配多语言、兼顾语言特征与文化背景的评估基准。 Method: 构建两个新数据集:WinoMTeus(改编自WinoMT,测试从巴斯克语到西班牙语/法语的性别中性职业词翻译中的性别倾向);FLORES+Gender(扩展FLORES+,测试从西班牙语/英语到巴斯克语翻译中不同性别指代物对译文质量的影响);并在多个通用大语言模型及开源/商用机器翻译系统上进行评测。 Result: 实验显示多数模型存在系统性阳性偏好,部分模型对阳性指代物的翻译质量略高;证实当前模型仍深陷性别偏见。 Conclusion: 性别偏见在当前LLMs和MT系统中依然根深蒂固,亟需发展融合语言特性与文化语境的新型评估方法。 Abstract: Large language models (LLMs) and machine translation (MT) systems are increasingly used in our daily lives, but their outputs can reproduce gender bias present in the training data. Most resources for evaluating such biases are designed for English and reflect its sociocultural context, which limits their applicability to other languages. This work addresses this gap by introducing two new datasets to evaluate gender bias in translations involving Basque, a low-resource and genderless language. WinoMTeus adapts the WinoMT benchmark to examine how gender-neutral Basque occupations are translated into gendered languages such as Spanish and French. FLORES+Gender, in turn, extends the FLORES+ benchmark to assess whether translation quality varies when translating from gendered languages (Spanish and English) into Basque depending on the gender of the referent. We evaluate several general-purpose LLMs and open and proprietary MT systems. The results reveal a systematic preference for masculine forms and, in some models, a slightly higher quality for masculine referents. Overall, these findings show that gender bias is still deeply rooted in these models, and highlight the need to develop evaluation methods that consider both linguistic features and cultural context.[66] RexDrug: Reliable Multi-Drug Combination Extraction through Reasoning-Enhanced LLMs
Zhijun Wang,Ling Luo,Dinghao Pan,Huan Zhuang,Lejing Yu,Yuanyuan Sun,Hongfei Lin
Main category: cs.CL
TL;DR: 本文提出RexDrug,一种基于大语言模型的端到端推理增强型关系抽取框架,用于从生物医学文献中自动提取可变长度的n元药物组合,显著提升提取准确率与医学推理一致性。
Details
Motivation: 现有药物关系抽取方法主要针对二元交互,难以建模变长n元药物组合及其复杂的兼容性逻辑和分散证据。 Method: 提出两阶段训练策略:第一阶段利用多智能体协同机制自动生成高质量专家级推理轨迹用于监督微调;第二阶段采用面向药物组合提取的多维奖励函数进行强化学习,优化推理质量与抽取精度。 Result: 在DrugComb数据集上显著超越SOTA基线;在DDI13数据集上验证了对二元药物相互作用任务的泛化能力;人工评估与自动推理指标表明其生成连贯医学推理并准确识别复杂治疗方案。 Conclusion: RexDrug是一种可扩展、可靠的复杂生物医学关系抽取解决方案,适用于非结构化文本。 Abstract: Automated Drug Combination Extraction (DCE) from large-scale biomedical literature is crucial for advancing precision medicine and pharmacological research. However, existing relation extraction methods primarily focus on binary interactions and struggle to model variable-length n-ary drug combinations, where complex compatibility logic and distributed evidence need to be considered. To address these limitations, we propose RexDrug, an end-to-end reasoning-enhanced relation extraction framework for n-ary drug combination extraction based on large language models. RexDrug adopts a two-stage training strategy. First, a multi-agent collaborative mechanism is utilized to automatically generate high-quality expert-like reasoning traces for supervised fine-tuning. Second, reinforcement learning with a multi-dimensional reward function specifically tailored for DCE is applied to further refine reasoning quality and extraction accuracy. Extensive experiments on the DrugComb dataset show that RexDrug consistently outperforms state-of-the-art baselines for n-ary extraction. Additional evaluation on the DDI13 corpus confirms its generalizability to binary drugdrug interaction tasks. Human expert assessment and automatic reasoning metrics further indicates that RexDrug produces coherent medical reasoning while accurately identifying complex therapeutic regimens. These results establish RexDrug as a scalable and reliable solution for complex biomedical relation extraction from unstructured text. The source code and data are available at https://github.com/DUTIR-BioNLP/RexDrug[67] Is continuous CoT better suited for multi-lingual reasoning?
Ali Hamza Bashir,Behzad Shomali,Markus Frey,Mehdi Ali,Rafet Sifa,David Berghaus
Main category: cs.CL
TL;DR: 本文探讨了在连续潜在空间中进行推理是否能提升多语言能力的鲁棒性,提出Continuous Chain-of-Thought(基于CODI框架),在GSM8k和CommonsenseQA上验证其在五种语言(含低资源语言Urdu)上的零样本跨语言推理性能,结果显著优于标准监督微调,并实现29–50倍推理轨迹压缩。
Details
Motivation: 探索连续潜在空间推理能否增强多语言(尤其低资源语言)推理的鲁棒性和泛化能力,缓解显式链式推理在跨语言迁移中的脆弱性。 Method: 采用Continuous Chain-of-Thought方法(基于CODI框架),在英语、中文、德语、法语和乌尔都语五种类型学差异大的语言上,对比标准监督微调;在GSM8k和CommonsenseQA数据集上进行零样本与少样本跨语言推理实验,并评估推理轨迹压缩率。 Result: Continuous Chain-of-Thought在低资源语言(尤其是零样本设置)上显著优于显式推理;推理轨迹压缩率达29×–50×;连续表征展现出更强的语言不变性。 Conclusion: 连续潜在空间推理能自然提升语言不变性与跨语言泛化能力,是高效、可扩展的多语言推理新范式。 Abstract: We investigate whether performing reasoning in a continuous latent space leads to more robust multilingual capabilities. We compare Continuous Chain-of-Thought (using the CODI framework) against standard supervised fine-tuning across five typologically diverse languages: English, Chinese, German, French, and Urdu. Our experiments on GSM8k and CommonsenseQA demonstrate that continuous reasoning significantly outperforms explicit reasoning on low-resource languages, particularly in zero-shot settings where the target language was not seen during training. Additionally, this approach achieves extreme efficiency, compressing reasoning traces by approximately $29\times$ to $50\times$. These findings indicate that continuous latent representations naturally exhibit greater language invariance, offering a scalable solution for cross-lingual reasoning.[68] TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation
Toms Bergmanis,Martins Kronis,Ingus Jānis Pretkalniņš,Dāvis Nicmanis,Jeļizaveta Jeļinska,Roberts Rozis,Rinalds Vīksna,Mārcis Pinnis
Main category: cs.CL
TL;DR: TildeOpen LLM 是一个300亿参数、支持34种欧洲语言的开源大语言模型,通过数据重采样与课程学习策略缓解语料不平衡问题,在低资源语言(如波罗的海、芬兰-乌戈尔和斯拉夫语族)上显著优于现有开源多语言模型,且训练资源消耗更少。
Details
Motivation: 解决大型语言模型在多数欧洲语言(尤其是低资源语言)上表现不佳的问题,因训练数据中英语等高资源语言占主导,导致语言不平等。 Method: 构建30B参数的多语言基础模型TildeOpen LLM;采用数据集上采样与交替式课程学习(在均匀分布与自然语言分布间切换)来缓解数据不平衡。 Result: 在多项多语言基准测试中,TildeOpen在文本生成与理解任务上超越现有开源多语言模型,尤其在波罗的海、芬兰-乌戈尔和斯拉夫语族语言上优势明显;人工评估显示其语言错误率相较主流基线降低达十倍。 Conclusion: 精细的数据筛选与平衡训练策略可在不增大模型规模或训练量的前提下,显著提升多语言大模型质量,推动语言公平。 Abstract: Large language models often underperform in many European languages due to the dominance of English and a few high-resource languages in training data. This paper presents TildeOpen LLM, a 30-billion-parameter open-weight foundational model trained for 34 European languages to promote linguistic equity and improve performance for low-resource languages. To address the data imbalance, we combine dataset upsampling with a curriculum-based training schedule that alternates between uniform and natural language distributions. The resulting model performs favorably compared to other multilingual LLMs despite being trained with significantly fewer computing resources. Evaluation across multiple multilingual benchmarks shows that TildeOpen surpasses existing open-weight models in text generation and comprehension, particularly for Baltic, Finno-Ugric, and Slavic languages. Human evaluations confirm an up to tenfold reduction in linguistic errors relative to leading baselines. The model and associated resources are fully open-weight and publicly available at huggingface.co/TildeAI/TildeOpen-30b. These outcomes demonstrate that careful data curation and balanced training strategies can substantially enhance multilingual model quality without increasing model size or training volume.[69] Supporting Workflow Reproducibility by Linking Bioinformatics Tools across Papers and Executable Code
Clémence Sebe,Olivier Ferret,Aurélie Névéol,Mahdi Esmailoghli,Ulf Leser,Sarah Cohen-Boulakia
Main category: cs.CL
TL;DR: CoPaLink 是一种自动化方法,用于将生物信息学工作流代码中的工具与论文描述中的工具提及进行链接,通过结合命名实体识别(NER)和基于知识库的实体链接,提升工作流的可理解性、可复现性和可重用性。
Details
Motivation: 生物数据快速增长,亟需透明、可复现、文档完善的工作流;当前缺乏将代码中工具步骤与论文描述中对应提及有效关联的方法。 Method: 提出 CoPaLink 方法,整合三部分:科学文本中的工具命名实体识别、工作流代码中的工具命名实体识别、以及基于 Bioconda 和 Bioweb 等生物信息学知识库的实体链接;使用带人工标注的科学文献与可执行工作流语料进行训练与评估。 Result: 各 NER 组件 F1 值达 84–89,联合准确率达 66(在 Nextflow 工作流上评估);代码与语料均已开源发布。 Conclusion: CoPaLink 有效弥合了工作流叙事描述与实际实现之间的鸿沟,为增强计算可复现性和跨平台工具重用提供了可行技术路径。 Abstract: Motivation: The rapid growth of biological data has intensified the need for transparent, reproducible, and well-documented computational workflows. The ability to clearly connect the steps of a workflow in the code with their description in a paper would improve workflow understanding, support reproducibility, and facilitate reuse. This task requires the linking of Bioinformatics tools in workflow code with their mentions in a published workflow description. Results: We present CoPaLink, an automated approach that integrates three components: Named Entity Recognition (NER) for identifying tool mentions in scientific text, NER for tool mentions in workflow code, and entity linking grounded on Bioinformatics knowledge bases. We propose approaches for all three steps achieving a high individual F1-measure (84 - 89) and a joint accuracy of 66 when evaluated on Nextflow workflows using Bioconda and Bioweb Knowledge bases. CoPaLink leverages corpora of scientific articles and workflow executable code with curated tool annotations to bridge the gap between narrative descriptions and workflow implementations. Availability: The code is available at https://gitlab.liris.cnrs.fr/sharefair/copalink-experiments and https://gitlab.liris.cnrs.fr/sharefair/copalink. The corpora are also available at https://doi.org/10.5281/zenodo.18526700, https://doi.org/10.5281/zenodo.18526760 and https://doi.org/10.5281/zenodo.18543814.[70] The Conundrum of Trustworthy Research on Attacking Personally Identifiable Information Removal Techniques
Sebastian Ochs,Ivan Habernal
Main category: cs.CL
TL;DR: 本文质疑了现有PII去除技术重建攻击成功率被高估的问题,指出其评估中存在数据泄露和污染,强调需使用真正私有数据进行客观评估,但受限于隐私保护,公开研究难以透明、可复现地开展。
Details
Motivation: 现有PII去除技术的重建攻击评估可能存在严重偏差,导致对隐私保护效果的误判,亟需厘清真实风险。 Method: 通过批判性分析现有重建攻击的评估方法,识别数据泄露与数据污染问题,并探讨避免数据泄露的可行攻击设置与数据源。 Result: 发现当前评估未能有效规避数据泄漏和污染,因此无法客观验证PII去除技术在现实场景中的隐私保护能力;真正私有数据虽为理想评估基础,但因隐私限制而难以获取。 Conclusion: 当前PII去除技术的隐私保障能力尚无可靠实证支持;缺乏合规、可用的私有数据严重制约了该领域透明、可复现的科学研究。 Abstract: Removing personally identifiable information (PII) from texts is necessary to comply with various data protection regulations and to enable data sharing without compromising privacy. However, recent works show that documents sanitized by PII removal techniques are vulnerable to reconstruction attacks. Yet, we suspect that the reported success of these attacks is largely overestimated. We critically analyze the evaluation of existing attacks and find that data leakage and data contamination are not properly mitigated, leaving the question whether or not PII removal techniques truly protect privacy in real-world scenarios unaddressed. We investigate possible data sources and attack setups that avoid data leakage and conclude that only truly private data can allow us to objectively evaluate vulnerabilities in PII removal techniques. However, access to private data is heavily restricted - and for good reasons - which also means that the public research community cannot address this problem in a transparent, reproducible, and trustworthy manner.[71] Sensivity of LLMs' Explanations to the Training Randomness:Context, Class & Task Dependencies
Romain Loncour,Jérémie Bogaert,François-Xavier Standaert
Main category: cs.CL
TL;DR: 本文研究了Transformer模型解释结果对随机性敏感性的原因,发现任务类型、类别和句法上下文均对其有显著影响,其中任务类型影响最大,类别次之,句法上下文最小。
Details
Motivation: Transformer模型决策的可解释性存在挑战,尤其是其解释结果对训练随机性高度敏感,需探究影响该敏感性的关键因素。 Method: 通过控制变量实验,系统分析句法上下文、学习类别和任务类型三类因素对解释敏感性的影响,并进行统计显著性检验。 Result: 三类因素均对解释敏感性有统计上显著的影响:句法上下文影响最小,类别影响中等,任务类型影响最大。 Conclusion: 解释敏感性主要受任务类型驱动,提示在评估或改进模型可解释性时应优先考虑任务层面的因素。 Abstract: Transformer models are now a cornerstone in natural language processing. Yet, explaining their decisions remains a challenge. It was shown recently that the same model trained on the same data with a different randomness can lead to very different explanations. In this paper, we investigate how the (syntactic) context, the classes to be learned and the tasks influence this explanations' sensitivity to randomness. We show that they all have statistically significant impact: smallest for the (syntactic) context, medium for the classes and largest for the tasks.[72] Not All Queries Need Deep Thought: CoFiCot for Adaptive Coarse-to-fine Stateful Refinement
Dongxu Zhang,Hongqiang Lin,Yiding Sun,Pengyu Wang,Qirui Wang,Ning Yang,Jihua Zhu
Main category: cs.CL
TL;DR: CoFiCot是一种粗到细的自适应推理框架,通过多指标分类器动态分配计算资源,对简单问题高效聚合,对复杂问题启用上下文感知的迭代修正,并利用过程奖励模型(PRM)实现状态依赖的纠错,提升LLM推理能力。
Details
Motivation: 解决大语言模型在测试时扩展计算所面临的‘统一计算悖论’——即固定计算量导致简单任务过修正、复杂任务修正不足的问题。 Method: 提出CoFiCot框架:1)构建融合语义熵、共识可靠性与预测推理深度的多指标分类器进行问题难度分级;2)据此实施差异化精炼策略——简单查询用高效聚合,复杂查询进入上下文感知的循环修正;3)将修正建模为状态依赖的序列传播过程,并在其中集成过程奖励模型(PRM)以保障逻辑连贯性。 Result: CoFiCot在保持整体效率的同时,显著提升了复杂推理任务的准确性,避免了无状态精炼方法常见的上下文碎片化问题,实现了细粒度错误定位与全局逻辑一致性的统一。 Conclusion: 动态适配问题难度的分层推理机制(尤其是状态依赖的PRM增强修正)是突破LLM测试时计算瓶颈、提升鲁棒推理能力的有效路径。 Abstract: Scaling test-time computation enhances LLM reasoning ability but faces a uniform computation paradox. Allocating identical resources leads to over-correction on simple tasks and insufficient refinement on complex ones. To address this, we propose CoFiCot, a coarse-to-fine adaptive framework that dynamically tailors inference strategies to problem difficulty. Specifically, we implement a multi-metric classifier that triages queries by synthesizing semantic entropy, consensus reliability, and predicted reasoning depth . This enables a differentiated refinement stage that applies efficient aggregation for simple queries while routing complex ones to a context-aware correction loop . We formalize correction as a stateful sequential propagation process , where each repair is strictly conditioned on the verified history of prior rectifications. By integrating Process Reward Models (PRMs) within this state-dependent trajectory, CoFiCot effectively bridges the gap between granular error localization and global logical coherence, preventing the context fragmentation typical of stateless refinement methods.[73] NCL-UoR at SemEval-2026 Task 5: Embedding-Based Methods, Fine-Tuning, and LLMs for Word Sense Plausibility Rating
Tong Wu,Thanet Markchom,Huizhi Liang
Main category: cs.CL
TL;DR: 本文系统比较了三种词义合理性评分方法:嵌入式回归、参数高效微调的Transformer模型和带结构化推理与显式决策规则的大语言模型提示方法,发现结构化提示策略(分解叙事成分并应用显式校准规则)效果最佳,且提示设计比模型规模更重要。
Details
Motivation: 词义合理性评分任务需要在包含歧义同音词的短叙事语境中,预测人类对特定词义的1-5分合理性感知,亟需系统评估不同建模方法的有效性。 Method: 系统比较三种方法:(1) 基于句子嵌入+标准回归器;(2) 参数高效微调的Transformer;(3) 带结构化推理与显式决策规则的LLM提示;其中最优方法将叙事分解为前文、目标句、结尾,并应用显式校准规则。 Result: 结构化提示策略显著优于微调模型和嵌入方法;提示设计对性能影响大于模型规模。 Conclusion: 在词义合理性评分任务中,精心设计的结构化提示(含叙事分解与显式规则)是更有效、更轻量的解决方案,优于依赖大规模参数微调或传统嵌入方法。 Abstract: Word sense plausibility rating requires predicting the human-perceived plausibility of a given word sense on a 1--5 scale in the context of short narrative stories containing ambiguous homonyms. This paper systematically compares three approaches: (1) embedding-based methods pairing sentence embeddings with standard regressors, (2) transformer fine-tuning with parameter-efficient adaptation, and (3) large language model (LLM) prompting with structured reasoning and explicit decision rules. The best-performing system employs a structured prompting strategy that decomposes evaluation into narrative components (precontext, target sentence, ending) and applies explicit decision rules for rating calibration. The analysis reveals that structured prompting with decision rules substantially outperforms both fine-tuned models and embedding-based approaches, and that prompt design matters more than model scale for this task. The code is publicly available at https://github.com/tongwu17/SemEval-2026-Task5.[74] How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms
JV Roig
Main category: cs.CL
TL;DR: 本文提出RIKER评估方法,对35个开源大语言模型在不同上下文长度、温度和硬件平台下的事实性(幻觉)进行了大规模基准测试,发现模型幻觉率随上下文长度显著上升,模型选择影响最大,温度设置效果复杂,且事实检索能力与抗幻觉能力并不等价。
Details
Motivation: 现有幻觉评估方法存在数据集污染、LLM裁判偏差、评分尺度不足等问题,缺乏可靠、可复现、高统计置信度的测量手段,尤其对企业AI部署至关重要。 Method: 提出RIKER——一种以真实答案为先(ground-truth-first)、支持确定性评分、无需人工标注的评估方法,并在35个开源模型、3种上下文长度(32K/128K/200K)、4种温度、3种硬件平台上开展超1720亿token的大规模实验。 Result: (1)最佳模型在32K时幻觉率仍达1.19%,128K时近三倍,200K时全模型超10%;(2)模型选择主导性能差异(准确率跨度72个百分点),模型家族比参数量更能预测抗幻觉能力;(3)温度T=0.0在60%情况下最优,但更高温度普遍降低幻觉并大幅缓解生成死循环;(4)事实定位能力与抗幻觉能力是分离的;(5)结果跨硬件平台一致。 Conclusion: 幻觉是当前RAG与长上下文应用中的关键瓶颈;RIKER提供了一种可扩展、无偏、确定性的评估范式;模型选型、上下文长度控制与温度调优需协同优化,不能仅依赖参数规模或检索精度。 Abstract: How much do large language models actually hallucinate when answering questions grounded in provided documents? Despite the critical importance of this question for enterprise AI deployments, reliable measurement has been hampered by benchmarks that rely on static datasets vulnerable to contamination, LLM-based judges with documented biases, or evaluation scales too small for statistical confidence. We address this gap using RIKER, a ground-truth-first evaluation methodology that enables deterministic scoring without human annotation. Across 35 open-weight models, three context lengths (32K, 128K, and 200K tokens), four temperature settings, and three hardware platforms (NVIDIA H200, AMD MI300X, and Intel Gaudi 3), we conducted over 172 billion tokens of evaluation - an order of magnitude beyond prior work. Our findings reveal that: (1) even the best-performing models fabricate answers at a non-trivial rate - 1.19% at best at 32K, with top-tier models at 5 - 7% - and fabrication rises steeply with context length, nearly tripling at 128K and exceeding 10% for all models at 200K; (2) model selection dominates all other factors, with overall accuracy spanning a 72-percentage-point range and model family predicting fabrication resistance better than model size; (3) temperature effects are nuanced - T=0.0 yields the best overall accuracy in roughly 60% of cases, but higher temperatures reduce fabrication for the majority of models and dramatically reduce coherence loss (infinite generation loops), which can reach 48x higher rates at T=0.0 versus T=1.0; (4) grounding ability and fabrication resistance are distinct capabilities - models that excel at finding facts may still fabricate facts that do not exist; and (5) results are consistent across hardware platforms, confirming that deployment decisions need not be hardware-dependent.[75] AdaCultureSafe: Adaptive Cultural Safety Grounded by Cultural Knowledge in Large Language Models
Hankun Kang,Di Lin,Zhirong Liao,Pengfei Bai,Xinyi Zeng,Jiawei Jiang,Yuanyuan Zhu,Tieyun Qian
Main category: cs.CL
TL;DR: 本文提出联合建模文化安全与文化知识的方法,构建了AdaCultureSafe数据集,并发现大语言模型的文化安全性与其文化知识掌握程度无显著相关性;进一步分析表明,预训练与对齐目标的差异是主因,并提出了基于知识增强的文化安全提升方法。
Details
Motivation: 现有研究将文化安全与文化知识割裂看待,导致大语言模型难以生成符合特定文化背景的尊重性回应,文化安全的自适应能力因此受限。 Method: 提出融合权威文化知识整理、大语言模型自动提问生成与人工严格验证的框架,构建AdaCultureSafe数据集(含4.8K细粒度文化描述及48K安全与知识导向问题);在此基础上评估主流LLM的文化安全与知识能力,并通过神经元激活分析探究二者不相关的原因;最后设计知识引导的响应生成方法以提升文化安全性。 Result: 发现主流大语言模型的文化安全性与其文化知识掌握程度之间无显著相关性;神经元分析揭示该现象源于预训练与后对齐目标的不一致;所提知识引导方法显著提升了模型的文化安全性。 Conclusion: 文化安全必须以文化知识为根基,二者需协同建模;仅靠对齐无法保障文化安全,需在生成过程中显式融入文化知识。 Abstract: With the widespread adoption of Large Language Models (LLMs), respecting indigenous cultures becomes essential for models' culturally safety and responsible global applications. Existing studies separately consider cultural safety and cultural knowledge and neglect that the former should be grounded by the latter. This severely prevents LLMs from yielding culture-specific respectful responses. Consequently, adaptive cultural safety remains a formidable task. In this work, we propose to jointly model cultural safety and knowledge. First and foremost, cultural-safety and knowledge-paired data serve as the key prerequisite to conduct this research. However, the cultural diversity across regions and the subtlety of cultural differences pose significant challenges to the creation of such paired evaluation data. To address this issue, we propose a novel framework that integrates authoritative cultural knowledge descriptions curation, LLM-automated query generation, and heavy manual verification. Accordingly, we obtain a dataset named AdaCultureSafe containing 4.8K manually decomposed fine-grained cultural descriptions and the corresponding 48K manually verified safety- and knowledge-oriented queries. Upon the constructed dataset, we evaluate three families of popular LLMs on their cultural safety and knowledge proficiency, via which we make a critical discovery: no significant correlation exists between their cultural safety and knowledge proficiency. We then delve into the utility-related neuron activations within LLMs to investigate the potential cause of the absence of correlation, which can be attributed to the difference of the objectives of pre-training and post-alignment. We finally present a knowledge-grounded method, which significantly enhances cultural safety by enforcing the integration of knowledge into the LLM response generation process.[76] Evaluating LLM-Based Grant Proposal Review via Structured Perturbations
William Thorne,Joseph James,Yang Wang,Chenghua Lin,Diana Maynard
Main category: cs.CL
TL;DR: 本文探讨了大语言模型(LLM)在高风险科研基金评审中的能力与局限,通过扰动实验框架评估六类质量维度,并比较三种评审架构,发现分节分析法效果最佳,但LLM仍存在检测偏差、优先级错位和高变异性问题,仅适合作为补充工具。
Details
Motivation: 随着AI辅助撰写的基金申请书数量激增,人工评审能力已跟不上,形成研究生态的“马尔萨斯陷阱”,亟需评估LLM能否胜任高 stakes 的基金评审任务。 Method: 基于6份EPSRC基金申请书,构建扰动驱动的评估框架,覆盖资金、时间线、能力、契合度、清晰度和影响力六个质量维度;对比三种评审架构:单次通读、分节分析、以及模拟专家小组的‘人格委员会’集成方法。 Result: 分节分析法在检测率与评分可靠性上显著优于其他两种;‘人格委员会’计算开销大但性能未超基准;所有模型均易检出契合度问题,却普遍遗漏清晰度缺陷;人工评估显示LLM反馈虽基本有效,但偏重合规性检查而非整体性判断。 Conclusion: 当前LLM可在EPSRC评审中提供有限补充价值,但存在高变异性与评审优先级错位问题,尚不能替代人工评审。 Abstract: As AI-assisted grant proposals outpace manual review capacity in a kind of ``Malthusian trap'' for the research ecosystem, this paper investigates the capabilities and limitations of LLM-based grant reviewing for high-stakes evaluation. Using six EPSRC proposals, we develop a perturbation-based framework probing LLM sensitivity across six quality axes: funding, timeline, competency, alignment, clarity, and impact. We compare three review architectures: single-pass review, section-by-section analysis, and a 'Council of Personas' ensemble emulating expert panels. The section-level approach significantly outperforms alternatives in both detection rate and scoring reliability, while the computationally expensive council method performs no better than baseline. Detection varies substantially by perturbation type, with alignment issues readily identified but clarity flaws largely missed by all systems. Human evaluation shows LLM feedback is largely valid but skewed toward compliance checking over holistic assessment. We conclude that current LLMs may provide supplementary value within EPSRC review but exhibit high variability and misaligned review priorities. We release our code and any non-protected data.[77] Using Multimodal and Language-Agnostic Sentence Embeddings for Abstractive Summarization
Chaimae Chellaf,Salima Mdhaffar,Yannick Estève,Stéphane Huet
Main category: cs.CL
TL;DR: 本文提出SBARThez框架,利用多模态和多语言句子嵌入及命名实体注入机制,提升法语抽象摘要生成的事实一致性与跨语言能力,尤其适用于低资源语言。
Details
Motivation: 解决抽象摘要中模型易产生‘幻觉’(即引入不存在信息)导致事实不一致的问题。 Method: 结合LaBSE、SONAR、BGE-M3等预训练模型的多模态/多语言句嵌入,改进BART法语模型,并引入命名实体注入机制(将tokenized命名实体附加至解码器输入)。 Result: SBARThez在文本与语音输入、跨语言摘要任务上均表现优异,相比词元级基线更具竞争力,尤其在低资源语言上生成更简洁、更抽象且事实更一致的摘要。 Conclusion: 多语言句嵌入与命名实体注入可有效增强抽象摘要的事实一致性与泛化能力,SBARThez为低资源语言摘要提供了新思路。 Abstract: Abstractive summarization aims to generate concise summaries by creating new sentences, allowing for flexible rephrasing. However, this approach can be vulnerable to inaccuracies, particularly `hallucinations' where the model introduces non-existent information. In this paper, we leverage the use of multimodal and multilingual sentence embeddings derived from pretrained models such as LaBSE, SONAR, and BGE-M3, and feed them into a modified BART-based French model. A Named Entity Injection mechanism that appends tokenized named entities to the decoder input is introduced, in order to improve the factual consistency of the generated summary. Our novel framework, SBARThez, is applicable to both text and speech inputs and supports cross-lingual summarization; it shows competitive performance relative to token-level baselines, especially for low-resource languages, while generating more concise and abstract summaries.[78] LAMUS: A Large-Scale Corpus for Legal Argument Mining from U.S. Caselaw using LLMs
Serene Wang,Lavanya Pobbathi,Haihua Chen
Main category: cs.CL
TL;DR: 本文提出了LAMUS数据集,一个基于美国最高法院判决和德克萨斯州刑事上诉意见构建的句子级法律论证挖掘语料库,并结合LLM自动标注与人工校验提升标注质量;实验表明思维链提示显著提升大模型性能,领域模型零样本表现更稳定,LLM辅助验证可修正近20%标注错误。
Details
Motivation: 法律论证挖掘研究受限于高质量、大规模美国判例(尤其是州级)标注数据的缺乏。 Method: 构建LAMUS语料库:采集大量判例,采用LLM进行自动句子级标注(六类法律论证成分),再通过人工参与的质量精修;将任务建模为六分类句子分类,评估多种通用及法律领域语言模型在零样本、少样本和思维链提示下的性能,并以LegalBERT为监督基线。 Result: 思维链提示显著提升LLM性能;法律领域模型零样本表现更稳定;LLM辅助验证纠正近20%标注错误,提升标签一致性;人工校验Cohen's Kappa达0.85,验证标注质量高。 Conclusion: LAMUS是一个可扩展、高质量的法律论证挖掘资源,为法律NLP研究提供了新基准与实证洞见;代码与数据已开源。 Abstract: Legal argument mining aims to identify and classify the functional components of judicial reasoning, such as facts, issues, rules, analysis, and conclusions. Progress in this area is limited by the lack of large-scale, high-quality annotated datasets for U.S. caselaw, particularly at the state level. This paper introduces LAMUS, a sentence-level legal argument mining corpus constructed from U.S. Supreme Court decisions and Texas criminal appellate opinions. The dataset is created using a data-centric pipeline that combines large-scale case collection, LLM-based automatic annotation, and targeted human-in-the-loop quality refinement. We formulate legal argument mining as a six-class sentence classification task and evaluate multiple general-purpose and legal-domain language models under zero-shot, few-shot, and chain-of-thought prompting strategies, with LegalBERT as a supervised baseline. Results show that chain-of-thought prompting substantially improves LLM performance, while domain-specific models exhibit more stable zero-shot behavior. LLM-assisted verification corrects nearly 20% of annotation errors, improving label consistency. Human verification achieves Cohen's Kappa of 0.85, confirming annotation quality. LAMUS provides a scalable resource and empirical insights for future legal NLP research. All code and datasets can be accessed for reproducibility on GitHub at: https://github.com/LavanyaPobbathi/LAMUS/tree/main[79] Learning Multiple Utterance-Level Attribute Representations with a Unified Speech Encoder
Maryem Bouziane,Salima Mdhaffar,Yannick Estève
Main category: cs.CL
TL;DR: 本文提出了一种统一的后训练框架,使语音基础模型能够生成多种话语级表示(如语义和说话人表示),从而支持多语言语音检索和说话人识别等任务。
Details
Motivation: 现有语音基础模型主要学习帧级上下文嵌入,而近期方法(如SAMU-XSLR、SONAR)虽转向话语级语义对齐,但缺乏对多种话语级属性(如说话人)的统一建模能力。 Method: 扩展话语级表示学习范式,设计统一后训练框架,联合学习语义与说话人等多类型话语级表示。 Result: 在多语言语音检索和说话人识别任务上验证了所提方法的有效性,表明单个语音基础模型可同时支持多种下游任务。 Conclusion: 统一后训练框架能灵活适配多种话语级属性,提升语音基础模型的泛化性与实用性。 Abstract: Speech foundation models trained with self-supervised learning produce generic speech representations that support a wide range of speech processing tasks. When further adapted with supervised learning, these models can achieve strong performance on specific downstream tasks. Recent post-training approaches, such as SAMU-XSLR and SONAR, align speech representations with utterance-level semantic representations, enabling effective multimodal (speech-text) and multilingual applications. While speech foundation models typically learn contextual embeddings at the acoustic frame level, these methods learn representations at the utterance level. In this work, we extend this paradigm to arbitrary utterance-level attributes and propose a unified post-training framework that enables a single speech foundation model to generate multiple types of utterance-level representations. We demonstrate the effectiveness of this approach by jointly learning semantic and speaker representations and evaluating them on multilingual speech retrieval and speaker recognition tasks.[80] SPD-RAG: Sub-Agent Per Document Retrieval-Augmented Generation
Yagiz Can Akay,Muhammed Yusuf Kartal,Esra Alparslan,Faruk Ortakoyluoglu,Arda Akpinar
Main category: cs.CL
TL;DR: SPD-RAG是一种面向多文档问答的分层多智能体RAG框架,通过按文档轴分解任务、文档级智能体专注处理各自内容、协调器调度与聚合、以及令牌受限的合成层实现高效可靠推理,在LOONG基准上显著提升性能并降低成本。
Details
Motivation: 标准RAG在大规模文档中证据覆盖不全,长上下文LLM难以在海量输入上可靠推理。 Method: 提出SPD-RAG:基于文档轴分解的分层多智能体框架,含文档级专用智能体(各处理单文档)、任务协调器(调度与聚合)和令牌受限合成层(支持递归map-reduce)。 Result: 在LOONG基准上Avg Score达58.1(GPT-5评估),优于Normal RAG(33.0)和Agentic RAG(32.8),且API成本仅为全上下文基线的38%。 Conclusion: 文档级专业化与中心化融合提升了异构多文档场景下的可扩展性与答案质量,并构建了模块化、可扩展的检索流水线。 Abstract: Answering complex, real-world queries often requires synthesizing facts scattered across vast document corpora. In these settings, standard retrieval-augmented generation (RAG) pipelines suffer from incomplete evidence coverage, while long-context large language models (LLMs) struggle to reason reliably over massive inputs. We introduce SPD-RAG, a hierarchical multi-agent framework for exhaustive cross-document question answering that decomposes the problem along the document axis. Each document is processed by a dedicated document-level agent operating only on its own content, enabling focused retrieval, while a coordinator dispatches tasks to relevant agents and aggregates their partial answers. Agent outputs are synthesized by merging partial answers through a token-bounded synthesis layer (which supports recursive map-reduce for massive corpora). This document-level specialization with centralized fusion improves scalability and answer quality in heterogeneous multidocument settings while yielding a modular, extensible retrieval pipeline. On the LOONG benchmark (EMNLP 2024) for long-context multi-document QA, SPD-RAG achieves an Avg Score of 58.1 (GPT-5 evaluation), outperforming Normal RAG (33.0) and Agentic RAG (32.8) while using only 38% of the API cost of a full-context baseline (68.0).[81] Do Language Models Know Theo Has a Wife? Investigating the Proviso Problem
Tara Azin,Daniel Dumitrescu,Diana Inkpen,Raj Singh
Main category: cs.CL
TL;DR: 本文研究语言模型如何处理语用学中的预设问题(proviso problem),通过将该问题重构为自然语言推理任务并构建诊断数据集,评估了RoBERTa、DeBERTa、LLaMA和Gemma等模型。结果表明模型虽大致符合人类判断,但依赖表层模式匹配而非语义或语用推理。
Details
Motivation: 解决语用学中条件句预设在理论与人类理解间存在分歧的未解问题(proviso problem),并填补语言模型在该现象上的计算评估空白。 Method: 将预设问题重构为自然语言推理任务,构建专门用于探测条件句中预设投射的诊断数据集,并对RoBERTa、DeBERTa、LLaMA和Gemma进行评估,辅以可解释性分析。 Result: 模型总体上与人类判断一致,但主要依赖浅层模式匹配,缺乏真正的语义或语用推理能力。 Conclusion: 本文首次建立了针对proviso问题的计算评估框架,强调需采用诊断性、多方法手段来评估语言模型的语用能力和上下文依赖意义理解。 Abstract: We investigate how language models handle the proviso problem, an unresolved issue in pragmatics where presuppositions in conditional sentences diverge between theoretical and human interpretations. We reformulate this phenomenon as a Natural Language Inference task and introduce a diagnostic dataset designed to probe presupposition projection in conditionals. We evaluate RoBERTa, DeBERTa, LLaMA, and Gemma using explainability analyses. The results show that models broadly align with human judgments but rely on shallow pattern matching rather than semantic or pragmatic reasoning. Our work provides the first computational evaluation framework for the proviso problem and highlights the need for diagnostic, multi-method approaches to assess pragmatic competence and context-dependent meaning in language models.[82] Computational modeling of early language learning from acoustic speech and audiovisual input without linguistic priors
Okko Räsänen
Main category: cs.CL
TL;DR: 本章综述了利用自监督与视觉引导的计算模型理解婴儿早期语音与视听语言习得的最新进展,强调无需强语言先验即可学习语音特征,并揭示其与人类认知及多种语言习得理论相容的学习机制。
Details
Motivation: 从信息处理角度看,婴儿从原始语音信号中习得语言是一项巨大挑战,需借助计算模型揭示其内在机制。 Method: 采用自监督学习和视觉引导的计算模型,模拟婴儿从语音及音视频输入中进行感知学习的过程。 Result: 模型能不依赖强语言先验学习多种语音特性;多个早期语言发展现象可用一组共通学习原则解释;模型在输入真实性和行为可解释性上日益贴近婴儿实证研究。 Conclusion: 自监督与视觉引导建模为理解早期语言习得提供了统一、可计算且与认知理论兼容的框架,推动计算模拟向更真实、更具解释力的方向发展。 Abstract: Learning to understand speech appears almost effortless for typically developing infants, yet from an information-processing perspective, acquiring a language from acoustic speech is an enormous challenge. This chapter reviews recent developments in using computational models to understand early language acquisition from speech and audiovisual input. The focus is on self-supervised and visually grounded models of perceptual learning. We show how these models are becoming increasingly powerful in learning various aspects of speech without strong linguistic priors, and how many features of early language development can be explained through a shared set of learning principles-principles broadly compatible with multiple theories of language acquisition and human cognition. We also discuss how modern learning simulations are gradually becoming more realistic, both in terms of input data and in linking model behavior to empirical findings on infant language development.[83] Adaptive Loops and Memory in Transformers: Think Harder or Know More?
Markus Frey,Behzad Shomali,Ali Hamza Bashir,David Berghaus,Mehdi Ali
Main category: cs.CL
TL;DR: 本文提出了一种结合自适应每层循环(通过学习的停止机制)和门控记忆库的Transformer模型,以提升数学推理和常识推理性能,并在数学基准上超越了FLOP匹配的深层模型。
Details
Motivation: 链式思维提示虽能增强语言模型推理能力,但需显式表达中间步骤;而循环Transformer虽参数高效,却缺乏深层模型的存储容量。本文旨在探索兼顾循环效率与额外存储能力的新架构。 Method: 设计一种新型Transformer模型,包含两个核心机制:1)自适应每层循环,即各Transformer块通过学习的停止机制决定迭代次数;2)门控记忆库,提供额外的可学习存储空间。 Result: 实验表明,循环机制主要提升数学推理能力,记忆库则有助于恢复常识推理性能;二者结合后,在FLOP匹配下超越层数为其三倍的基线模型;内部分析显示层间存在功能特化:浅层循环少、内存访问少,深层则更频繁使用两者。 Conclusion: 自适应循环与门控记忆库的协同设计能有效平衡参数效率与推理能力,在数学推理任务中显著优于传统深层模型,且揭示了模型内部层间功能分工现象。 Abstract: Chain-of-thought (CoT) prompting enables reasoning in language models but requires explicit verbalization of intermediate steps. Looped transformers offer an alternative by iteratively refining representations within hidden states. This parameter efficiency comes at a cost, as looped models lack the storage capacity of deeper models which use unique weights per layer. In this work, we investigate transformer models that feature both adaptive per-layer looping, where each transformer block learns to iterate its hidden state via a learned halting mechanism, and gated memory banks, that provide additional learned storage. We find that looping primarily benefits mathematical reasoning, while memory banks help recover performance on commonsense tasks compared to parameter and FLOP matched models. Combining both mechanisms yields a model that outperforms an iso-FLOP baseline -- with three times the number of layers -- on math benchmarks. Analysis of model internals reveals layer specialization: early layers learn to loop minimally and access memory sparingly, while later layers do both more heavily.[84] COACH meets QUORUM: A Framework and Pipeline for Aligning User, Expert and Developer Perspectives in LLM-generated Health Counselling
Yee Man Ng,Bram van Dijk,Pieter Beynen,Otto Boekesteijn,Joris Jansen,Gerard van Oortmerssen,Max van Duijn,Marco Spruit
Main category: cs.CL
TL;DR: 本文提出QUORUM评估框架和COACH语言模型管道,用于生成个性化生活方式建议,并通过多利益相关者评估验证其在癌症患者健康管理中的有效性与挑战。
Details
Motivation: 开发能够结合用户数据与医学知识、生成个性化且可靠生活方式建议的系统面临诸多挑战,亟需统一开发者、医学专家和用户视角的评估框架。 Method: 提出QUORUM多视角评估框架,并构建基于大语言模型的COACH生成管道,应用于面向癌症患者及幸存者的日记App‘Healthy Chronos’。 Result: 评估显示三类利益相关者在建议的相关性、质量与可靠性上基本达成共识,但在语气、对模式提取错误的敏感性及幻觉风险方面存在分歧。 Conclusion: 多利益相关者联合评估对消费级健康语言技术至关重要;QUORUM框架有助于构建可信、以患者为中心的真实世界NLP系统。 Abstract: Systems that collect data on sleep, mood, and activities can provide valuable lifestyle counselling to populations affected by chronic disease and its consequences. Such systems are, however, challenging to develop; besides reliably extracting patterns from user-specific data, systems should also contextualise these patterns with validated medical knowledge to ensure the quality of counselling, and generate counselling that is relevant to a real user. We present QUORUM, a new evaluation framework that unifies these developer-, expert-, and user-centric perspectives, and show with a real case study that it meaningfully tracks convergence and divergence in stakeholder perspectives. We also present COACH, a Large Language Model-driven pipeline to generate personalised lifestyle counselling for our Healthy Chronos use case, a diary app for cancer patients and survivors. Applying our framework shows that overall, users, medical experts, and developers converge on the opinion that the generated counselling is relevant, of good quality, and reliable. However, stakeholders also diverge on the tone of the counselling, sensitivity to errors in pattern-extraction, and potential hallucinations. These findings highlight the importance of multi-stakeholder evaluation for consumer health language technologies and illustrate how a unified evaluation framework can support trustworthy, patient-centered NLP systems in real-world settings.[85] Revealing Behavioral Plasticity in Large Language Models: A Token-Conditional Perspective
Liyuan Mao,Le Yu,Jing Zhou,Chujie Zheng,Bowen Yu,Chang Gao,Shixuan Liu,An Yang,Weinan Zhang,JunYang Lin
Main category: cs.CL
TL;DR: 本文揭示了大语言模型(LLMs)具有类变色龙的行为可塑性,可通过词元条件生成在推理时动态切换行为模式,并提出Token-Conditioned Reinforcement Learning(ToCoRL)框架,利用强化学习将这种瞬时适应固化为稳定、可学习的行为模式,在不损害原有能力的前提下实现精准行为控制。
Details
Motivation: 现有LLM行为模式固化,难以在推理时灵活切换(如从逐步推理切换为直接作答),限制了其在不同任务上的适应性;需一种无需重训练即可实现行为动态调控的方法。 Method: 提出Token-Conditioned Reinforcement Learning(ToCoRL):利用精心选取的、体现目标行为的响应词元前缀进行条件生成以引导行为切换,并结合强化学习稳定和内化该可塑性;通过条件生成引导探索,持续优化利用,使合适行为自然涌现。 Result: ToCoRL在多个任务上实现了精确的行为控制且不降低模型原有能力;特别地,成功将强数学推理的大模型适配为优秀的事实问答模型,克服了其固有逐步推理模式对事实问答的干扰。 Conclusion: LLMs具备内在行为可塑性,ToCoRL能有效挖掘并固化该特性,为可控、可定制的LLM行为建模提供了新范式。 Abstract: In this work, we reveal that Large Language Models (LLMs) possess intrinsic behavioral plasticity-akin to chameleons adapting their coloration to environmental cues-that can be exposed through token-conditional generation and stabilized via reinforcement learning. Specifically, by conditioning generation on carefully selected token prefixes sampled from responses exhibiting desired behaviors, LLMs seamlessly adapt their behavioral modes at inference time (e.g., switching from step-by-step reasoning to direct answering) without retraining. Based on this insight, we propose Token-Conditioned Reinforcement Learning (ToCoRL), a principled framework that leverages RL to internalize this chameleon-like plasticity, transforming transient inference-time adaptations into stable and learnable behavioral patterns. ToCoRL guides exploration with token-conditional generation and keep enhancing exploitation, enabling emergence of appropriate behaviors. Extensive experiments show that ToCoRL enables precise behavioral control without capability degradation. Notably, we show that large reasoning models, while performing strongly on complex mathematics, can be effectively adapted to excel at factual question answering, which was a capability previously hindered by their step-by-step reasoning patterns.[86] Aligning to Illusions: Choice Blindness in Human and AI Feedback
Wenbin Wu
Main category: cs.CL
TL;DR: 本文挑战了RLHF中人类偏好反映稳定内在状态的假设,通过三组实验揭示偏好信号易受情境影响且难以被检测,表明RLHF面临根本性的偏好构建问题。
Details
Motivation: RLHF依赖人类偏好作为稳定、可靠的学习信号,但作者质疑该假设,认为偏好可能受情境、认知偏差和模型局限性影响,从而威胁RLHF的有效性和鲁棒性。 Method: 开展三项实验:(1)人类选择盲区研究,测试对被悄悄调换的文本偏好的觉察率;(2)用15个LLM作为评判者,分析其检测能力是否依赖浅层文本匹配而非深层自我监控;(3)剂量响应实验与Best-of-N评估,量化偏好标签污染对奖励信号及下游策略性能的影响。 Result: 91%的人类偏好调换未被察觉;LLM检测高度依赖上下文中的先前推理,移除后检测失败率超50%;仅1/6–1/3偏好标签污染即使奖励信号减半,但标准成对准确率几乎不变;50%污染时,基于奖励的选择不再优于随机采样,而代理模型评分却持续上升。 Conclusion: RLHF所依赖的偏好信号并非源于稳定的内在判断,而是由偏好采集的情境动态构建;该问题无法被人类元认知、LLM自我监控或常规评估指标识别,构成RLHF方法论的基础性挑战。 Abstract: Reinforcement Learning from Human Feedback (RLHF) assumes annotator preferences reflect stable internal states. We challenge this through three experiments spanning the preference pipeline. In a human choice blindness study, 91% of surreptitiously swapped preferences go undetected, extending choice blindness to third-person evaluative comparison of unfamiliar text. Testing fifteen LLM judges as potential replacements, we find detection relies on shallow text matching rather than genuine self-monitoring: removing prior reasoning from context causes blindness to surge from near-zero to over 50%, while explicit social pressure induces near-universal compliance. In a dose-response experiment across two architectures from 86M to 2B parameters, one-sixth to one-third of labels must be corrupted before the reward signal halves, yet standard pairwise accuracy remains virtually unchanged. A Best-of-N evaluation confirms this translates to downstream policy degradation: at 50% corruption, reward-guided selection produces no improvement over random sampling, while the proxy model reports monotonically increasing scores. Together, these results reveal a preference construction problem: the signal entering RLHF is shaped by elicitation context in ways that neither human metacognition, LLM self-monitoring, nor standard evaluation metrics can detect.[87] One Model Is Enough: Native Retrieval Embeddings from LLM Agent Hidden States
Bo Jiang
Main category: cs.CL
TL;DR: 本文提出了一种为大语言模型(LLM)代理赋予原生检索能力的方法,通过在LLM上添加轻量级投影头,直接将其隐藏状态映射到嵌入空间,从而省去单独的嵌入模型,降低延迟与系统复杂度,同时保持97%的基线检索质量。
Details
Motivation: 现有LLM代理检索外部知识时需先生成文本查询,再用独立嵌入模型编码,造成基础设施复杂和延迟增加;而LLM本身已在其隐藏状态中编码了完整对话上下文,因此该两阶段流程存在冗余。 Method: 在LLM上添加一个轻量级投影头,将隐藏状态直接映射至嵌入空间;采用对齐损失、对比损失和排序蒸馏损失联合训练。 Result: 在QReCC对话搜索基准上,Recall@10 和 MRR@10 与标准‘生成-再编码’流程相当;消融实验验证了各损失项的有效贡献;检索质量达基线的97%。 Conclusion: LLM代理可无需额外嵌入模型实现高质量检索,通过利用其自身隐藏表示并辅以多目标蒸馏训练,兼顾性能、效率与部署简洁性。 Abstract: LLM agents that retrieve external knowledge typically generate a search query as text, then run a separate embedding model to encode it into a vector. This two-model pipeline adds infrastructure complexity and latency, yet is redundant: the LLM already encodes the full conversational context in its hidden states. We propose equipping LLM agents with native retrieval capability by adding a lightweight projection head that maps hidden states directly into the embedding space, eliminating the need for a separate embedding model. Trained with a combination of alignment, contrastive, and rank distillation losses, our method retains 97\% of baseline retrieval quality while enabling the LLM agent to search with its own representations. Experiments on the QReCC conversational search benchmark show competitive Recall@10 and MRR@10 compared to the standard generate-then-encode pipeline, with systematic ablations confirming the contribution of each loss component.[88] A Dataset for Probing Translationese Preferences in English-to-Swedish Translation
Jenny Kunz,Anja Jarochenko,Marcel Bollmann
Main category: cs.CL
TL;DR: 本文构建了首个免费的英-瑞典语翻译语料库,用于检测语言模型对翻译腔(translationese)的偏好,并发现小规模瑞典语和多语种大模型倾向于选择不自然的翻译腔表达,而非更地道的人工改写结果。
Details
Motivation: 探究语言模型在非英语语言生成中对翻译腔的内在偏好,推动更自然、地道输出模型的发展。 Method: 构建首个公开的英-瑞典语对比数据集,包含翻译腔句子与地道人工改写句,并标注错误类型;在多个小规模瑞典语及多语种大模型上开展消融实验(含/不含英文源句)。 Result: 实验表明,模型普遍偏好翻译腔表达;当隐藏英文源句时,人类改写选项被选中的比例上升,但仍常低于翻译腔选项。 Conclusion: 当前小规模模型在非英语语言生成中存在显著的翻译腔偏好,暴露其对源语依赖过强及地道表达能力不足的问题;本数据集为改进模型提供了新基准与资源。 Abstract: Translations often carry traces of the source language, a phenomenon known as translationese. We introduce the first freely available English-to-Swedish dataset contrasting translationese sentences with idiomatic alternatives, designed to probe intrinsic preferences of language models. It includes error tags and descriptions of the problems in the original translations. In experiments evaluating smaller Swedish and multilingual LLMs with our dataset, we find that they often favor the translationese phrasing. Human alternatives are chosen more often when the English source sentence is omitted, indicating that exposure to the source biases models toward literal translations, although even without context models often prefer the translationese variant. Our dataset and findings provide a resource and benchmark for developing models that produce more natural, idiomatic output in non-English languages.[89] Fanar-Sadiq: A Multi-Agent Architecture for Grounded Islamic QA
Ummar Abbas,Mourad Ouzzani,Mohamed Y. Eltabakh,Omar Sinan,Gagan Bhatia,Hamdy Mubarak,Majd Hawasly,Mohammed Qusay Hashim,Kareem Darwish,Firoj Alam
Main category: cs.CL
TL;DR: 本文提出了一种名为Fanar-Sadiq的双语(阿拉伯语/英语)多智能体伊斯兰助手,通过意图感知路由、检索增强的教法学回答、精确经文检索与引用验证、以及符合教法学派的天课与遗产分配确定性计算器,显著提升了伊斯兰知识问答的准确性与可靠性。
Details
Motivation: 大型语言模型在回答宗教知识问题时易产生幻觉和错误归因,尤其在伊斯兰语境中,用户期望答案严格基于《古兰经》《圣训》及教法学(fiqh)细节,现有RAG方法难以应对伊斯兰查询的多样性(如经文引用、教法裁决、天课/遗产计算等)。 Method: 构建了一个基于多智能体、工具调用架构的双语伊斯兰助手Fanar-Sadiq,包含意图识别路由模块、检索增强的教法回答模块(含引用标准化与验证轨迹)、精确经文查找模块(支持引文验证)以及符合逊尼派各教法学派的天课与遗产计算模块。 Result: 在公开伊斯兰问答基准上验证了系统端到端的有效性与效率;系统已上线API与网页应用,一年内访问量约190万次。 Conclusion: Fanar-Sadiq通过任务专业化与结构化工具协同,有效缓解了LLM在宗教知识场景中的幻觉与不一致问题,为高可靠性领域AI提供了可扩展的多智能体范式。 Abstract: Large language models (LLMs) can answer religious knowledge queries fluently, yet they often hallucinate and misattribute sources, which is especially consequential in Islamic settings where users expect grounding in canonical texts (Qur'an and Hadith) and jurisprudential (fiqh) nuance. Retrieval-augmented generation (RAG) reduces some of these limitations by grounding generation in external evidence. However, a single ``retrieve-then-generate'' pipeline is limited to deal with the diversity of Islamic queries.Users may request verbatim scripture, fatwa-style guidance with citations or rule-constrained computations such as zakat and inheritance that require strict arithmetic and legal invariants. In this work, we present a bilingual (Arabic/English) multi-agent Islamic assistant, called Fanar-Sadiq, which is a core component of the Fanar AI platform. Fanar-Sadiq routes Islamic-related queries to specialized modules within an agentic, tool-using architecture. The system supports intent-aware routing, retrieval-grounded fiqh answers with deterministic citation normalization and verification traces, exact verse lookup with quotation validation, and deterministic calculators for Sunni zakat and inheritance with madhhab-sensitive branching. We evaluate the complete end-to-end system on public Islamic QA benchmarks and demonstrate effectiveness and efficiency. Our system is currently publicly and freely accessible through API and a Web application, and has been accessed $\approx$1.9M times in less than a year.[90] CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning
Siye Wu,Jian Xie,Yikai Zhang,Yanghua Xiao
Main category: cs.CL
TL;DR: 本文提出CODA方法,通过内部难度信号动态分配推理计算资源,实现自适应推理,在简单任务上大幅减少token消耗,在困难任务上提升性能。
Details
Motivation: 大型推理模型在复杂任务上表现优异,但常在简单问题上过度推理,导致效率低下。因此需要一种能根据实例难度动态调整推理深度的自适应推理机制。 Method: 将自适应推理建模为效用最大化问题,提出CODA方法:利用组内rollout估计实例难度,并通过两个非负门控机制调节基于长度的奖励塑形项,分别抑制简单样本的冗余推理和鼓励困难样本的深入推理。 Result: CODA在不同模型规模和基准测试中均实现自适应推理,无需外部标注或用户设定预算;在简单任务上token成本降低超60%且精度保持良好,在困难任务上提升推理深度以最大化性能。 Conclusion: CODA提供了一种无需额外监督、自动适配任务难度的高效推理策略,平衡了准确性与计算成本。 Abstract: The emergence of large reasoning models demonstrates that scaling inference-time compute significantly enhances performance on complex tasks. However, it often falls into another trap: overthinking simple problems, where repetitive rationales yield minimal accuracy gains at a disproportionately high cost. This motivates adaptive reasoning: dynamically aligning reasoning depth with instance difficulty. In this paper, we study adaptive reasoning from an optimality perspective, formalizing it as a utility maximization problem where tokens are allocated until the marginal accuracy gain falls below the incremental cost. Based on this, we propose CODA (Compute Allocation by Difficulty Awareness), a method that operationalizes this principle by allocating tokens via a policy-internal difficulty signal. Specifically, CODA estimates difficulty via group-based rollouts and maps it to two non-negative gates that modulate a length-dependent shaping term on top of the binary base reward. The easy-side gate penalizes verbosity on simple instances, whereas the hard-side gate encourages more deliberative rollouts on challenging ones. Across model scales and benchmarks, CODA achieves adaptive reasoning without external annotations or user-provided budgets: on easy tasks, CODA reduces token costs by over 60% while maintaining strong accuracy, whereas on hard tasks it incentivizes more deliberative rollouts to maximize performance.cs.CV [Back]
[91] Roots Beneath the Cut: Uncovering the Risk of Concept Revival in Pruning-Based Unlearning for Diffusion Models
Ci Zhang,Zhaojun Ding,Chence Yang,Jun Liu,Xiaoming Zhai,Shaoyi Huang,Beiwen Li,Xiaolong Ma,Jin Lu,Geng Yuan
Main category: cs.CV
TL;DR: 本文揭示了基于剪枝的扩散模型概念遗忘方法存在安全隐患:被剪枝权重的位置可能成为泄露已删除概念信息的侧信道;作者提出一种无需数据和训练的攻击框架,可成功恢复被删除的概念,并探讨了兼顾安全性与有效性的防御策略。
Details
Motivation: 尽管基于剪枝的遗忘方法具有高效、免训练、免数据等优势,但其安全性尚未被深入检验,作者旨在揭示该范式潜在的隐私泄露风险。 Method: 设计了一种全新的、完全免数据和免训练的攻击框架,利用剪枝位置作为侧信道信号来恢复被删除的概念;同时探索隐藏剪枝位置等防御机制。 Result: 实验证明,只要识别出与目标概念相关的关键权重,无论其如何被置零或扰动,均可有效恢复原始概念;剪枝位置本身即构成可被利用的信息泄露通道。 Conclusion: 基于剪枝的遗忘并非本质安全;需在设计中隐匿剪枝位置以提升安全性,同时保持遗忘效果,为构建更安全的剪枝类遗忘框架提供实践指导。 Abstract: Pruning-based unlearning has recently emerged as a fast, training-free, and data-independent approach to remove undesired concepts from diffusion models. It promises high efficiency and robustness, offering an attractive alternative to traditional fine-tuning or editing-based unlearning. However, in this paper we uncover a hidden danger behind this promising paradigm. We find that the locations of pruned weights, typically set to zero during unlearning, can act as side-channel signals that leak critical information about the erased concepts. To verify this vulnerability, we design a novel attack framework capable of reviving erased concepts from pruned diffusion models in a fully data-free and training-free manner. Our experiments confirm that pruning-based unlearning is not inherently secure, as erased concepts can be effectively revived without any additional data or retraining. Extensive experiments on diffusion-based unlearning based on concept related weights lead to the conclusion: once the critical concept-related weights in diffusion models are identified, our method can effectively recover the original concept regardless of how the weights are manipulated. Finally, we explore potential defense strategies and advocate safer pruning mechanisms that conceal pruning locations while preserving unlearning effectiveness, providing practical insights for designing more secure pruning-based unlearning frameworks.[92] ObjChangeVR: Object State Change Reasoning from Continuous Egocentric Views in VR Environments
Shiyi Ding,Shaoen Wu,Ying Chen
Main category: cs.CV
TL;DR: 本文提出ObjChangeVR框架和ObjChangeVR-Dataset数据集,用于解决虚拟现实中基于自然语言的场景变化查询问题,特别是检测无直接交互背景下的物体状态变化。
Details
Motivation: 现有MLLMs在物体状态理解上主要针对用户交互的自我中心视频,难以检测无直接交互且缺乏明显运动线索的背景物体状态变化;同时缺乏评估该场景的基准数据集。 Method: 构建专门用于物体状态变化问答任务的ObjChangeVR-Dataset;提出ObjChangeVR框架,结合视角感知与时间检索定位相关帧,并通过跨视角推理整合多视角不一致证据。 Result: 实验表明,ObjChangeVR在多个MLLM上显著优于基线方法。 Conclusion: ObjChangeVR有效提升了VR中无交互背景下物体状态变化的理解能力,所建数据集填补了该任务评估基准的空白。 Abstract: Recent advances in multimodal large language models (MLLMs) offer a promising approach for natural language-based scene change queries in virtual reality (VR). Prior work on applying MLLMs for object state understanding has focused on egocentric videos that capture the camera wearer's interactions with objects. However, object state changes may occur in the background without direct user interaction, lacking explicit motion cues and making them difficult to detect. Moreover, no benchmark exists for evaluating this challenging scenario. To address these challenges, we introduce ObjChangeVR-Dataset, specifically for benchmarking the question-answering task of object state change. We also propose ObjChangeVR, a framework that combines viewpoint-aware and temporal-based retrieval to identify relevant frames, along with cross-view reasoning that reconciles inconsistent evidence from multiple viewpoints. Extensive experiments demonstrate that ObjChangeVR significantly outperforms baseline approaches across multiple MLLMs.[93] Margin-Consistent Deep Subtyping of Invasive Lung Adenocarcinoma via Perturbation Fidelity in Whole-Slide Image Analysis
Meghdad Sabouri Rad,Junze,Huang,Mohammad Mehdi Hosseini,Rakesh Choudhary,Saverio J. Carello,Ola El-Zammar,Michel R. Nasr,Bardia Rodd
Main category: cs.CV
TL;DR: 本文提出了一种面向肺腺癌亚型分类的全切片图像分析框架,通过注意力加权聚合与边界感知训练提升模型对成像扰动的鲁棒性,并引入扰动保真度(PF)评分机制以平衡对比学习带来的过聚类问题,在多个指标上显著优于基线。
Details
Motivation: 全切片图像分类在肺腺癌亚型判别中易受真实世界成像扰动影响,尤其在决策边界处可靠性不足,亟需提升模型鲁棒性与细粒度形态区分能力。 Method: 提出边界一致性框架,结合注意力加权补丁聚合与边界感知训练;引入扰动保真度(PF)评分,通过贝叶斯优化施加结构化扰动以缓解对比正则化导致的过聚类;在BMIRDS-LUAD数据集(143张WSI、203,226个补丁、5类亚型)上验证。 Result: ViT-Large达95.20±4.65%准确率(较基线92.00±5.36%降低40%错误率);ResNet101+Attention达95.89±5.37%(较基线91.73±9.23%降低50%错误率);所有亚型AUC均>0.99;在外部WSSS4LUAD数据集上ResNet50+Attention达80.1%准确率,体现跨机构泛化能力。 Conclusion: 所提框架有效提升了肺腺癌亚型分类模型在真实场景下的鲁棒性与泛化性,PF评分机制为平衡特征分离与形态保真提供了新思路,为后续域自适应研究奠定基础。 Abstract: Whole-slide image classification for invasive lung adenocarcinoma subtyping remains vulnerable to real-world imaging perturbations that undermine model reliability at the decision boundary. We propose a margin consistency framework evaluated on 203,226 patches from 143 whole-slide images spanning five adenocarcinoma subtypes in the BMIRDS-LUAD dataset. By combining attention-weighted patch aggregation with margin-aware training, our approach achieves robust feature-logit space alignment measured by Kendall correlations of 0.88 during training and 0.64 during validation. Contrastive regularization, while effective at improving class separation, tends to over-cluster features and suppress fine-grained morphological variation; to counteract this, we introduce Perturbation Fidelity (PF) scoring, which imposes structured perturbations through Bayesian-optimized parameters. Vision Transformer-Large achieves 95.20 +/- 4.65% accuracy, representing a 40% error reduction from the 92.00 +/- 5.36% baseline, while ResNet101 with an attention mechanism reaches 95.89 +/- 5.37% from 91.73 +/- 9.23%, a 50% error reduction. All five subtypes exceed an area under the receiver operating characteristic curve (AUC) of 0.99. On the WSSS4LUAD external benchmark, ResNet50 with an attention mechanism attains 80.1% accuracy, demonstrating cross-institutional generalizability despite approximately 15-20% domain-shift-related degradation and identifying opportunities for future adaptation research.[94] PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment
Yantao Li,Qiang Hui,Chenyang Yan,Kanzhi Cheng,Fang Zhao,Chao Tan,Huanling Gao,Jianbing Zhang,Kai Wang,Xinyu Dai,Shiguo Lian
Main category: cs.CV
TL;DR: PaLMR 是一种新框架,通过在数据层和优化层分别对齐感知与推理过程,减少多模态大模型中的视觉推理幻觉,提升推理保真度与可靠性。
Details
Motivation: 现有强化学习奖励设计只关注最终答案正确性,容忍推理过程中对视觉证据的误感知(即过程幻觉),导致多模态大语言模型(MLLMs)推理不可靠、难解释。 Method: 提出 PaLMR 框架:1)感知对齐的数据层,构建含结构化伪真值与可验证视觉事实的过程感知推理数据;2)过程对齐的优化层,设计分层奖励融合机制与过程感知评分函数,鼓励视觉忠实的思维链并增强训练稳定性。 Result: 在 Qwen2.5-VL-7B 上实验表明,PaLMR 显著降低推理幻觉,在 HallusionBench 上达 SOTA,同时在 MMMU、MathVista 和 MathVerse 上保持强性能。 Conclusion: PaLMR 为实现过程对齐的多模态推理提供了原理清晰且实用的路径,提升了 MLLMs 的可靠性与可解释性。 Abstract: Reinforcement learning has recently improved the reasoning ability of Large Language Models and Multimodal LLMs, yet prevailing reward designs emphasise final-answer correctness and consequently tolerate process hallucinations--cases where models reach the right answer while misperceiving visual evidence. We address this process-level misalignment with PaLMR, a framework that aligns not only outcomes but also the reasoning process itself. PaLMR comprises two complementary components: a perception-aligned data layer that constructs process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts, and a process-aligned optimisation layer that constructs a hierarchical reward fusion scheme with a process-aware scoring function to encourage visually faithful chains-of-thought and improve training stability. Experiments on Qwen2.5-VL-7B show that our approach substantially reduces reasoning hallucinations and improves visual reasoning fidelity, achieving state-of-the-art results on HallusionBench while maintaining strong performance on MMMU, MathVista, and MathVerse. These findings indicate that PaLMR offers a principled and practical route to process-aligned multimodal reasoning, advancing the reliability and interpretability of MLLMs.[95] A Parameter-efficient Convolutional Approach for Weed Detection in Multispectral Aerial Imagery
Leo Thomas Ramos,Angel D. Sappa
Main category: cs.CV
TL;DR: FCBNet是一种高效的杂草分割模型,采用冻结的ConvNeXt主干、新提出的特征校正模块(FCB)和轻量解码器,在多个数据集上以高mIoU(>85%)和极低训练耗时(0.06–0.2小时)及参数量(减少超90%)实现优越性能。
Details
Motivation: 针对杂草分割任务中模型精度与计算效率难以兼顾的问题,尤其是资源受限场景下的部署需求,提出一种高效、轻量且高性能的分割架构。 Method: 提出FCBNet:1)使用完全冻结的ConvNeXt作为主干网络;2)设计轻量高效的Feature Correction Block(FCB),利用高效卷积进行特征精修;3)搭配轻量解码器。在WeedBananaCOD和WeedMap数据集上,分别测试RGB与多光谱模态。 Result: 在WeedBananaCOD和WeedMap上mIoU均超过85%,显著优于U-Net、DeepLabV3+、SK-U-Net、SegFormer和WeedSense;训练时间仅需0.06–0.2小时;可训练参数减少超90%,大幅降低内存占用。 Conclusion: FCBNet验证了冻结主干+轻量特征校正的设计范式在农业图像分割中的有效性,兼顾高精度、高效率与低资源消耗,适合边缘部署。 Abstract: We introduce FCBNet, an efficient model designed for weed segmentation. The architecture is based on a fully frozen ConvNeXt backbone, the proposed Feature Correction Block (FCB), which leverages efficient convolutions for feature refinement, and a lightweight decoder. FCBNet is evaluated on the WeedBananaCOD and WeedMap datasets under both RGB and multispectral modalities, showing that FCBNet outperforms models such as U-Net, DeepLabV3+, SK-U-Net, SegFormer, and WeedSense in terms of mIoU, exceeding 85%, while also achieving superior computational efficiency, requiring only 0.06 to 0.2 hours for training. Furthermore, the frozen backbone strategy reduces the number of trainable parameters by more than 90%, significantly lowering memory requirements.[96] GameVerse: Can Vision-Language Models Learn from Video-based Reflection?
Kuan Zhang,Dongchen Liu,Qiyue Zhao,Jinkun Hou,Xinran Zhang,Qinlei Xie,Miao Liu,Yiming Li
Main category: cs.CV
TL;DR: 本文提出GameVerse,一个支持视觉反思循环的视频游戏基准,通过'反思-重试'范式评估视觉语言模型(VLMs)如何从失败轨迹和专家教程视频中学习,无需训练即可提升策略性能。
Details
Motivation: 人类玩家通过观察、反思失败和观看教程来提升游戏策略;本文旨在探索VLMs是否也能通过视频进行类似视觉反思学习。 Method: 构建GameVerse基准,包含15种流行游戏的认知分层分类体系、语义与GUI双动作空间、里程碑式评估机制,并采用‘反思-重试’范式评估VLMs在视频反馈下的策略改进能力。 Result: 实验表明VLMs能从视频反思中受益,尤其在融合失败轨迹与专家教程时表现最优,实现了无需训练的类强化学习+监督微调效果。 Conclusion: 视频驱动的视觉反思是一种有效提升VLM游戏策略能力的新范式,为VLM具身学习提供了可扩展、系统化的评估与提升框架。 Abstract: Human gameplay is a visually grounded interaction loop in which players act, reflect on failures, and watch tutorials to refine strategies. Can Vision-Language Models (VLMs) also learn from video-based reflection? We present GameVerse, a comprehensive video game benchmark that enables a reflective visual interaction loop. Moving beyond traditional fire-and-forget evaluations, it uses a novel reflect-and-retry paradigm to assess how VLMs internalize visual experience and improve policies. To facilitate systematic and scalable evaluation, we also introduce a cognitive hierarchical taxonomy spanning 15 globally popular games, dual action space for both semantic and GUI control, and milestone evaluation using advanced VLMs to quantify progress. Our experiments show that VLMs benefit from video-based reflection in varied settings, and perform best by combining failure trajectories and expert tutorials-a training-free analogue to reinforcement learning (RL) plus supervised fine-tuning (SFT).[97] ASMIL: Attention-Stabilized Multiple Instance Learning for Whole Slide Imaging
Linfeng Ye,Shayan Mohajer Hamidi,Zhixiang Chi,Guang Li,Mert Pilanci,Takahiro Ogawa,Miki Haseyama,Konstantinos N. Plataniotis
Main category: cs.CV
TL;DR: 本文提出ASMIL框架,通过锚定模型稳定注意力动态、归一化Sigmoid函数缓解注意力过度集中、随机丢弃token减少过拟合,显著提升WSI诊断性能。
Details
Motivation: 现有基于注意力的多实例学习方法在全切片图像诊断中存在注意力分布不稳定、过拟合和注意力过度集中三大问题。 Method: 提出ASMIL框架:引入锚定模型稳定注意力、用归一化Sigmoid替代softmax防止注意力过度集中、采用token随机丢弃缓解过拟合。 Result: 在多个数据集上F1分数最高提升6.49%;将ASMIL组件嵌入现有方法后F1最高提升10.73%。 Conclusion: ASMIL有效解决注意力不稳定等三大挑战,为WSI诊断提供更鲁棒、可解释的多实例学习新范式。 Abstract: Attention-based multiple instance learning (MIL) has emerged as a powerful framework for whole slide image (WSI) diagnosis, leveraging attention to aggregate instance-level features into bag-level predictions. Despite this success, we find that such methods exhibit a new failure mode: unstable attention dynamics. Across four representative attention-based MIL methods and two public WSI datasets, we observe that attention distributions oscillate across epochs rather than converging to a consistent pattern, degrading performance. This instability adds to two previously reported challenges: overfitting and over-concentrated attention distribution. To simultaneously overcome these three limitations, we introduce attention-stabilized multiple instance learning (ASMIL), a novel unified framework. ASMIL uses an anchor model to stabilize attention, replaces softmax with a normalized sigmoid function in the anchor to prevent over-concentration, and applies token random dropping to mitigate overfitting. Extensive experiments demonstrate that ASMIL achieves up to a 6.49\% F1 score improvement over state-of-the-art methods. Moreover, integrating the anchor model and normalized sigmoid into existing attention-based MIL methods consistently boosts their performance, with F1 score gains up to 10.73\%. All code and data are publicly available at https://github.com/Linfeng-Ye/ASMIL.[98] EnsAug: Augmentation-Driven Ensembles for Human Motion Sequence Analysis
Bikram De,Habib Irani,Vangelis Metsis
Main category: cs.CV
TL;DR: 本文提出了一种名为EnsAug的新训练范式,通过构建由多个‘专家模型’组成的集成,每个专家仅使用一种特定的几何变换进行数据增强训练,从而提升人体运动识别(如手语和人体活动识别)的性能。该方法在多个基准上超越了传统单模型+混合增强的方法,并达到了SOTA效果。
Details
Motivation: 现有通用数据增强方法忽略人体运动的几何与运动学约束,易生成不真实动作;且传统单模型+混合增强方式未能充分利用不同增强类型提供的独特学习信号。 Method: 提出EnsAug范式:训练一个模型集成,其中每个模型为‘专家’,仅在原始数据集经单一特定几何变换增强后的子集上训练,从而促进模型多样性。 Result: 在手语识别和人体活动识别多个基准上显著优于传统单模型混合增强方法,并在两个手语数据集和一个人体活动识别数据集上达到SOTA精度,同时具备更高模块性与效率。 Conclusion: 实证验证了基于增强策略构建多样化集成的有效性,为骨骼运动分析中的数据增强提供了新基线。 Abstract: Data augmentation is a crucial technique for training robust deep learning models for human motion, where annotated datasets are often scarce. However, generic augmentation methods often ignore the underlying geometric and kinematic constraints of the human body, risking the generation of unrealistic motion patterns that can degrade model performance. Furthermore, the conventional approach of training a single generalist model on a dataset expanded with a mixture of all available transformations does not fully exploit the unique learning signals provided by each distinct augmentation type. We challenge this convention by introducing a novel training paradigm, EnsAug, that strategically uses augmentation to foster model diversity within an ensemble. Our method involves training an ensemble of specialists, where each model learns from the original dataset augmented by only a single, distinct geometric transformation. Experiments on sign language and human activity recognition benchmarks demonstrate that our diversified ensemble methodology significantly outperforms the standard practice of training one model on a combined augmented dataset and achieves state-of-the-art accuracy on two sign language and one human activity recognition dataset while offering greater modularity and efficiency. Our primary contribution is the empirical validation of this training strategy, establishing an effective baseline for leveraging data augmentation in skeletal motion analysis.[99] HyperTokens: Controlling Token Dynamics for Continual Video-Language Understanding
Toan Nguyen,Yang Liu,Celso De Melo,Flora D. Salim
Main category: cs.CV
TL;DR: 本文提出HyperTokens,一种基于Transformer的动态令牌生成器,用于持续视频问答(VideoQA)任务,通过元启发式正则化和轻量级多模态辅助监督,缓解任务间干扰与灾难性遗忘问题,并在多个基准上实现更高准确率与更低遗忘率。
Details
Motivation: 持续视频问答(Continual VideoQA)面临任务间干扰和存储任务特定提示成本过高的挑战。 Method: 提出HyperTokens——一个基于Transformer的动态令牌生成器,按需生成微调令牌;引入元启发式正则化抑制遗忘,使其避免任务特定尖锐优化方向并锚定先前任务;结合轻量级多模态辅助监督与因果视角设计的互信息代理损失,约束反因果跨模态方向。 Result: 在两个标准持续VideoQA基准上,HyperTokens显著提升平均准确率并大幅降低遗忘;在新提出的跨模态ImageQA→VideoQA持续迁移协议中也展现出强鲁棒性。 Conclusion: HyperTokens通过显式可控的提示更新、理论驱动的正则化及跨模态协同学习,有效解决了持续VideoQA中的干扰与遗忘问题,为多模态持续学习提供了新范式。 Abstract: Continual VideoQA with multimodal LLMs is hindered by interference between tasks and the prohibitive cost of storing task-specific prompts. We introduce HyperTokens, a transformer-based token generator that produces fine-tuning tokens on demand, giving explicit control over prompt updates while keeping memory fixed. To suppress forgetting, we propose meta-inspired regularisers that look ahead to avoid task-specific sharp directions and anchor the evolving generator to prior tasks. We further connect our objective to sharpness-aware optimisation, providing insight into why it encourages flatter cross-task minima and improves retention. Beyond regularisation, HyperTokens exploits lightweight auxiliary multimodal supervision through shared generation weights; guided by a causal perspective, we design feasible objectives and surrogate mutual-information losses to regularise anti-causal cross-modal directions. Across two standard continual VideoQA benchmarks, HyperTokens achieves higher average accuracy with substantially lower forgetting. Finally, we introduce a challenging cross-modal ImageQA->VideoQA protocol and show that HyperTokens enables robust continual transfer in this setting.[100] Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting
Giacomo Frisoni,Lorenzo Molfetta,Mattia Buzzoni,Gianluca Moro
Main category: cs.CV
TL;DR: 本文提出Graph-of-Mark(GoM),一种基于场景图的像素级视觉提示方法,用于增强多模态语言模型的空间推理能力;相比仅标记孤立物体的Set-of-Mark等方法,GoM显式建模对象间关系,在多个模型与数据集上实现最高11个百分点的零样本性能提升。
Details
Motivation: 现有训练免费视觉提示方法(如Set-of-Mark)仅对图像中物体进行孤立标记,无法建模物体间关系,限制了多模态语言模型在空间推理任务中的表现。 Method: 提出Graph-of-Mark(GoM),将场景图以像素级方式叠加到输入图像上,使多模态语言模型能直接感知对象间的空间与语义关系;并在文本提示中引入辅助图描述进行协同优化。 Result: GoM在3个开源多模态语言模型和4个数据集上验证有效,显著提升零样本视觉问答与定位任务性能,最高提升达11个百分点;消融实验验证了场景图结构与辅助文本描述的关键作用。 Conclusion: GoM是首个支持像素级、关系感知的训练免费视觉提示方法,证明显式建模物体关系可有效增强多模态语言模型的空间理解与零样本泛化能力。 Abstract: Recent advances in training-free visual prompting, such as Set-of-Mark, have emerged as a promising direction for enhancing the grounding capabilities of multimodal language models (MLMs). These techniques operate by partitioning the input image into object regions and annotating them with marks, predominantly boxes with numeric identifiers, before feeding the augmented image to the MLM. However, these approaches treat marked objects as isolated entities, failing to capture the relationships between them. On these premises, we propose Graph-of-Mark (GoM), the first pixel-level visual prompting technique that overlays scene graphs onto the input image for spatial reasoning tasks. We evaluate GoM across 3 open-source MLMs and 4 different datasets, conducting extensive ablations on drawn components and investigating the impact of auxiliary graph descriptions in the text prompt. Our results demonstrate that GoM consistently improves the zero-shot capability of MLMs in interpreting object positions and relative directions, improving base accuracy in visual question answering and localization up to 11 percentage points.[101] Accelerating Video Generation Inference with Sequential-Parallel 3D Positional Encoding Using a Global Time Index
Chao Yuan,Pan Li
Main category: cs.CV
TL;DR: 本文提出了一种面向因果自回归视频生成模型的系统级推理优化方法,通过序列并行、Causal-RoPE SP、算子融合与RoPE预计算等技术,在保持生成质量的同时显著降低首帧延迟并提升推理速度。
Details
Motivation: DiT类视频生成模型因全时空注意力机制导致内存消耗爆炸(O(N²))和首帧延迟高,难以支持长视频合成与实时推理。 Method: 将Self-Forcing因果自回归框架适配至序列并行推理,设计序列并行版本的因果旋转位置编码(Causal-RoPE SP),并结合算子融合与RoPE预计算优化计算与通信。 Result: 在8卡A800集群上实现近实时推理:生成5秒480P视频加速1.58倍,首帧延迟低于1秒,生成质量相当。 Conclusion: 所提系统级优化有效缓解了DiT视频模型在长视频与实时交互场景下的性能瓶颈,为实际应用提供了可行支撑。 Abstract: Diffusion Transformer (DiT)-based video generation models inherently suffer from bottlenecks in long video synthesis and real-time inference, which can be attributed to the use of full spatiotemporal attention. Specifically, this mechanism leads to explosive O(N^2) memory consumption and high first-frame latency. To address these issues, we implement system-level inference optimizations for a causal autoregressive video generation pipeline. We adapt the Self-Forcing causal autoregressive framework to sequence parallel inference and implement a sequence-parallel variant of the causal rotary position embedding which we refer to as Causal-RoPE SP. This adaptation enables localized computation and reduces cross-rank communication in sequence parallel execution. In addition, computation and communication pipelines are optimized through operator fusion and RoPE precomputation. Experiments conducted on an eight GPU A800 cluster show that the optimized system achieves comparable generation quality, sub-second first-frame latency, and near real-time inference speed. For generating five second 480P videos, a 1.58x speedup is achieved, thereby providing effective support for real-time interactive applications.[102] Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine
Yuan Wu,Zongxian Yang,Jiayu Qian,Songpan Gao,Guanxing Chen,Qiankun Li,Yu-An Huang,Zhi-An Huang
Main category: cs.CV
TL;DR: 本文发现链式思维(CoT)提示在医学视觉问答任务中常不如直接回答(DirA),归因于‘医学感知瓶颈’;为此提出两种无需训练的推理时接地干预方法——感知锚定与描述接地,显著提升性能并逆转CoT劣势。
Details
Motivation: 探究链式思维(CoT)提示在医学视觉语言任务中的有效性,因其在通用领域有效但在医学领域表现尚不明确。 Method: 提出两种训练无关的推理时视觉接地干预方法:(i)基于感兴趣区域(ROI)的感知锚定;(ii)基于高质量文本描述的描述接地,并在多个医学VQA基准和模型上验证。 Result: 所提方法显著提升准确率,缓解CoT性能下降,在多个设置下成功逆转CoT劣于DirA的现象。 Conclusion: 可靠的临床视觉语言模型不仅需扩展文本推理链,更依赖稳健的视觉接地与跨模态对齐。 Abstract: Large vision-language models (VLMs) often benefit from chain-of-thought (CoT) prompting in general domains, yet its efficacy in medical vision-language tasks remains underexplored. We report a counter-intuitive trend: on medical visual question answering, CoT frequently underperforms direct answering (DirA) across general-purpose and medical-specific models. We attribute this to a \emph{medical perception bottleneck}: subtle, domain-specific cues can weaken visual grounding, and CoT may compound early perceptual uncertainty rather than correct it. To probe this hypothesis, we introduce two training-free, inference-time grounding interventions: (i) \emph{perception anchoring} via region-of-interest cues and (ii) \emph{description grounding} via high-quality textual guidance. Across multiple benchmarks and model families, these interventions improve accuracy, mitigate CoT degradation, and in several settings reverse the CoT--DirA inversion. Our findings suggest that reliable clinical VLMs require robust visual grounding and cross-modal alignment, beyond extending text-driven reasoning chains. Code is available \href{https://github.com/TianYin123/Better_Eyes_Better_Thoughts}{here}.[103] SJD-PV: Speculative Jacobi Decoding with Phrase Verification for Autoregressive Image Generation
Zhehao Yu,Baoquan Zhang,Bingqi Shan,Xinhao Liu,Dongliang Zhou,Guotao Liang,Guangming Ye,Yunming Ye
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的短语级推测验证框架,用于加速自回归图像生成模型的推理过程,通过联合验证相邻视觉词元提升效率并保持生成质量。
Details
Motivation: 自回归图像模型虽生成能力强,但推理延迟高;现有无训练加速方法独立验证词元,忽略相邻词元间的强共现模式,导致上下文不一致和解码效率受限。 Method: 分析训练语料中词元共现统计,将高频共现词元聚类为语义连贯的视觉短语;在推理中对每个短语计算聚合似然比,实现多词元联合验证。 Result: 在文本到图像生成任务上显著减少函数评估次数(NFE),解码速度最高提升30%,且不损害视觉保真度。 Conclusion: 建模短程词元共现是加速自回归推理的一种有效且通用的原则。 Abstract: Autoregressive (AR) image models have recently demonstrated remarkable generative capability, but their sequential nature results in significant inference latency. Existing training-free acceleration methods typically verify tokens independently, overlooking the strong co-occurrence patterns between adjacent visual tokens. This independence assumption often leads to contextual inconsistency and limits decoding efficiency. In this work, we introduce a novel training-free acceleration framework that performs phrase-level speculative verification, enabling the model to jointly validate multiple correlated tokens within each decoding window. To construct such phrase units, we analyze token co-occurrence statistics from the training corpus and group frequently co-occurring tokens into semantically coherent visual phrases. During inference, the proposed phrase-level verification evaluates aggregated likelihood ratios over each phrase, allowing simultaneous acceptance of multiple tokens while preserving generation quality. Extensive experiments on autoregressive text-to-image generation show that our method significantly reduces the number of function evaluations (NFE) and achieves up to 30% faster decoding without compromising visual fidelity. Our findings reveal that modeling short-range token co-occurrence provides an effective and general principle for accelerating autoregressive inference.[104] calibfusion: Transformer-Based Differentiable Calibration for Radar-Camera Fusion Detection in Water-Surface Environments
Yuting Wan,Liguo Sun,Jiuwu Hao,Pin LV
Main category: cs.CV
TL;DR: 本文提出CalibFusion,一种校准条件化的毫米波雷达-相机融合检测器,通过端到端学习隐式外参优化来提升恶劣水表面环境下的2D检测鲁棒性。
Details
Motivation: 现有雷达-相机标定方法在缺乏纹理、目标稀疏且存在波浪/镜面杂波的水面场景中失效,导致跨模态感知性能下降。 Method: 构建多帧感知的雷达密度表征(含强度加权与多普勒引导杂波抑制),设计跨模态Transformer模块预测置信度门控的外参修正,并通过可微投影-溅射算子生成校准条件下的图像平面雷达特征。 Result: 在WaterScenes和FLOW数据集上显著提升融合2D检测精度与抗标定误差鲁棒性;在nuScenes上验证了方法的跨场景泛化能力。 Conclusion: CalibFusion无需显式标定约束,能自适应优化外参,有效提升复杂水面环境及通用场景下的雷达-相机融合感知性能。 Abstract: Millimeter-wave (mmWave) Radar--Camera fusion improves perception under adverse illumination and weather, but its performance is sensitive to Radar--Camera extrinsic calibration: residual misalignment biases Radar-to-image projection and degrades cross-modal aggregation for downstream 2D detection. Existing calibration and auto-calibration methods are mainly developed for road and urban scenes with abundant structures and object constraints, whereas water-surface environments feature large textureless regions, sparse and intermittent targets, and wave-/specular-induced Radar clutter, which weakens explicit object-centric matching. We propose CalibFusion, a calibration-conditioned Radar--Camera fusion detector that learns implicit extrinsic refinement end-to-end with the detection objective. CalibFusion builds a multi-frame persistence-aware Radar density representation with intensity weighting and Doppler-guided suppression of fast-varying clutter. A cross-modal transformer interaction module predicts a confidence-gated refinement of the initial extrinsics, which is integrated through a differentiable projection-and-splatting operator to generate calibration-conditioned image-plane Radar features. Experiments on WaterScenes and FLOW show improved fusion-based 2D detection and robustness under synthetic miscalibration, supported by sensitivity analyses and qualitative Radar-to-image overlays. Results on nuScenes indicate that the refinement mechanism transfers beyond water-surface scenarios.[105] Does Semantic Noise Initialization Transfer from Images to Videos? A Paired Diagnostic Study
Yixiao Jing,Chaoyu Zhang,Zixuan Zhong,Peizhou Huang
Main category: cs.CV
TL;DR: 本文探讨了语义噪声初始化在文本到视频(T2V)扩散模型中的效果,发现其对时序相关指标仅有微弱正向趋势,统计上不显著,整体性能与高斯噪声基线相当;作者建议在T2V中采用提示级配对评估和噪声空间诊断作为标准实践。
Details
Motivation: 语义噪声初始化在图像扩散模型中已被证明可提升鲁棒性和可控性,但其在文本到视频生成中的有效性尚不明确,因时序耦合可能引入额外自由度和不稳定性。 Method: 在冻结的VideoCrafter风格T2V扩散主干网络上,使用VBench数据集的100个提示,对比语义噪声初始化与标准高斯噪声;采用提示级配对检验、bootstrap置信区间及符号翻转置换检验进行统计分析,并分析噪声空间中的扰动模式。 Result: 语义噪声初始化在时序相关维度上呈现微弱正向趋势,但95%置信区间包含零(p≈0.17),整体VBench得分与基线无显著差异;噪声空间分析显示扰动信号弱或不稳定。 Conclusion: 语义噪声初始化在当前T2V设置下未带来实质性增益;作者强调提示级配对评估和噪声空间诊断应成为T2V初始化方案研究的标准方法。 Abstract: Semantic noise initialization has been reported to improve robustness and controllability in image diffusion models. Whether these gains transfer to text-to-video (T2V) generation remains unclear, since temporal coupling can introduce extra degrees of freedom and instability. We benchmark semantic noise initialization against standard Gaussian noise using a frozen VideoCrafter-style T2V diffusion backbone and VBench on 100 prompts. Using prompt-level paired tests with bootstrap confidence intervals and a sign-flip permutation test, we observe a small positive trend on temporal-related dimensions; however, the 95 percent confidence interval includes zero (p ~ 0.17) and the overall score remains on par with the baseline. To understand this outcome, we analyze the induced perturbations in noise space and find patterns consistent with weak or unstable signal. We recommend prompt-level paired evaluation and noise-space diagnostics as standard practice when studying initialization schemes for T2V diffusion.[106] Unmixing microinfrared spectroscopic images of cross-sections of historical oil paintings
Shivam Pande,Nicolas Nadisic,Francisco Mederos-Henry,Aleksandra Pizurica
Main category: cs.CV
TL;DR: 本文提出了一种用于ATR-μFTIR高光谱图像盲解混的无监督CNN自编码器,通过加权光谱角距离(WSAD)损失函数提升在易受污染光谱区域的可解释性,并在根特祭坛画样本上验证了其有效性。
Details
Motivation: 传统ATR-μFTIR高光谱图像分析依赖人工比对参考谱库,效率低、主观性强、难以规模化;且光谱常为多组分混合,样品异质、多层且老化,解析困难。 Method: 提出一种无监督CNN自编码器,实现盲源分离,估计端元光谱及丰度图;引入基于空间平坦性、邻域一致性与光谱粗糙度鲁棒度量的自动带可靠性权重,构建加权光谱角距离(WSAD)损失函数,降低大气和采集伪影影响。 Result: 该方法在超过1500个波段上表现出对干扰的鲁棒性,显著提升易污染光谱区域的解混可解释性;在根特祭坛画绘画剖面样本上成功实现了端元识别与空间分布重建。 Conclusion: 所提WSAD损失驱动的无监督CNN自编码器为文化遗产中ATR-μFTIR高光谱图像的自动化、客观、可扩展解析提供了新范式,有望推动遗产科学中材料表征的标准化与智能化。 Abstract: Spectroscopic imaging (SI) has become central to heritage science because it enables non-invasive, spatially resolved characterisation of materials in artefacts. In particular, attenuated total reflection Fourier transform infrared microscopy (ATR-$μ$FTIR) is widely used to analyse painting cross-sections, where a spectrum is recorded at each pixel to form a hyperspectral image (HSI). Interpreting these data is difficult: spectra are often mixtures of several species in heterogeneous, multi-layered and degraded samples, and current practice still relies heavily on manual comparison with reference libraries. This workflow is slow, subjective and hard to scale. We propose an unsupervised CNN autoencoder for blind unmixing of ATR-$μ$FTIR HSIs, estimating endmember spectra and their abundance maps while exploiting local spatial structure through patch-based modelling. To reduce sensitivity to atmospheric and acquisition artefacts across $>1500$ bands, we introduce a weighted spectral angle distance (WSAD) loss with automatic band-reliability weights derived from robust measures of spatial flatness, neighbour agreement and spectral roughness. Compared with standard SAD training, WSAD improves interpretability in contamination-prone spectral regions. We demonstrate the method on an ATR-$μ$FTIR cross-section from the Ghent Altarpiece attributed to the Van Eyck brothers.[107] AutoFigure-Edit: Generating Editable Scientific Illustration
Zhen Lin,Qiujie Xie,Minjun Zhu,Shichen Li,Qiyao Sun,Enhao Gu,Yiran Ding,Ke Sun,Fang Guo,Panzhong Lu,Zhiyuan Ning,Yixuan Weng,Yue Zhang
Main category: cs.CV
TL;DR: AutoFigure-Edit 是一个端到端系统,能从长篇科学文本自动生成可编辑、风格可控的高质量科学插图,支持参考图像引导的样式适配和原生 SVG 编辑。
Details
Motivation: 现有自动化系统在可编辑性、风格可控性和效率方面存在局限,难以满足高质量科学插图的需求。 Method: 结合长上下文理解、参考图像引导的风格迁移和原生 SVG 编辑能力,构建端到端生成系统。 Result: 实现了从长文本生成完全可编辑、风格灵活适配的科学插图,提升了生成效率与交互性,并开源代码、视频和交互网站。 Conclusion: AutoFigure-Edit 推进了科学可视化自动化的发展,为科研人员提供了高效、可控、可编辑的插图生成新范式。 Abstract: High-quality scientific illustrations are essential for communicating complex scientific and technical concepts, yet existing automated systems remain limited in editability, stylistic controllability, and efficiency. We present AutoFigure-Edit, an end-to-end system that generates fully editable scientific illustrations from long-form scientific text while enabling flexible style adaptation through user-provided reference images. By combining long-context understanding, reference-guided styling, and native SVG editing, it enables efficient creation and refinement of high-quality scientific illustrations. To facilitate further progress in this field, we release the video at https://youtu.be/10IH8SyJjAQ, full codebase at https://github.com/ResearAI/AutoFigure-Edit and provide a website for easy access and interactive use at https://deepscientist.cc/.[108] XAI and Few-shot-based Hybrid Classification Model for Plant Leaf Disease Prognosis
Diana Susan Joseph,Pranav M Pawar,Raja Muthalagu,Mithun Mukharjee
Main category: cs.CV
TL;DR: 本文提出了一种结合可解释人工智能(XAI)与少样本学习(FSL)的混合模型,用于在标注数据稀缺条件下准确识别和分类玉米、水稻和小麦叶片病害及其发展阶段;模型融合Siamese网络与原型网络,并采用Grad-CAM提升可解释性,实验显示各项指标普遍超过92%。
Details
Motivation: 解决农业中作物病害识别面临的标注数据稀缺问题,同时提升模型的可解释性与可信度,以支持实际农田监测应用。 Method: 构建融合Siamese网络与Prototypical Network的少样本学习框架,采用episode训练范式;引入Grad-CAM进行决策区域可视化,实现模型可解释性。 Result: 在自建少样本病害数据集上,模型在各类病害阶段识别任务中准确率、精确率、召回率和F1分数均稳定超过92%,且显著优于基线FSL模型。 Conclusion: 该XAI-FSL混合框架兼顾高性能与高可解释性,为数据受限的真实农业病害监测提供了可行、可信的解决方案。 Abstract: Performing a timely and accurate identification of crop diseases is vital to maintain agricultural productivity and food security. The current work presents a hybrid few-shot learning model that integrates Explainable Artificial Intelligence (XAI) and Few-Shot Learning (FSL) to address the challenge of identifying and classifying the stages of disease of the diseases of maize, rice, and wheat leaves under limited annotated data conditions. The proposed model integrates Siamese and Prototypical Networks within an episodic training paradigm to effectively learn discriminative disease features from a few examples. To ensure model transparency and trustworthiness, Gradient-weighted Class Activation Mapping (Grad-CAM) is employed for visualizing key decision regions in the leaf images, offering interpretable insights into the classification process. Experimental evaluations on custom few-shot datasets developed in the study prove that the model consistently achieves high accuracy, precision, recall, and F1-scores, frequently exceeding 92% across various disease stages. Comparative analyses against baseline FSL models further confirm the superior performance and explainability of the proposed approach. The framework offers a promising solution for real-world, data-constrained agricultural disease monitoring applications.[109] Chart Deep Research in LVLMs via Parallel Relative Policy Optimization
Jiajin Tang,Gaoyang,Wenjie Wang,Sibei Yang,Xing Chen
Main category: cs.CV
TL;DR: 本文提出PRPO方法解决图表深度研究中的多维奖励信号干扰和异构数据梯度冲突问题,并构建MCDR-Bench基准以客观评估深度分析能力,形成训练与评估协同的统一框架。
Details
Motivation: 现有图表数据智能在深度研究能力(如复杂推理与高层数据分析)方面存在明显不足,受限于训练中多维奖励干扰与异构数据梯度冲突,以及评估中仅限于事实检索和基础计算,缺乏对端到端分析推理的量化评估。 Method: 提出PRPO(并行奖励优化与能力分区训练方法)以解耦异构数据与多维奖励信号;构建基于“错误唯一性原则”的MCDR-Bench基准,通过可控错误注入将主观生成评估转化为客观错误识别。 Result: 实验验证PRPO与MCDR-Bench联合构成统一框架,显著提升图表深度研究能力,在协同训练稳定性和深度分析能力客观评估方面取得进展。 Conclusion: PRPO与MCDR-Bench共同推动图表深度研究向系统化、可量化、可评估方向发展,为数据科学中的高阶分析任务提供了新范式。 Abstract: With the rapid advancement of data science, charts have evolved from simple numerical presentation tools to essential instruments for insight discovery and decision-making support. However, current chart data intelligence exhibits significant limitations in deep research capabilities, with existing methods predominantly addressing shallow tasks such as visual recognition or factual question-answering, rather than the complex reasoning and high-level data analysis that deep research requires. This limitation stems from two primary technical bottlenecks: at the training level, existing post-training techniques exhibit deficiencies in handling multi-dimensional reward signal interference and heterogeneous data gradient conflicts, preventing models from achieving balanced development across multiple capability dimensions; at the evaluation level, current methods remain limited to factual retrieval and basic computation, failing to assess end-to-end analytic reasoning and other deep research capabilities. To address the training challenge, we propose PRPO, which performs parallel optimization across reward dimensions and capability partitioning across data types, effectively disentangling conflicts between heterogeneous data and multi-dimensional reward signals while ensuring optimization stability. For the evaluation challenge, we construct MCDR-Bench based on the ``error uniqueness principle," transforming subjective generation assessment into objective error identification through controllable error injection, enabling quantifiable evaluation of deep research capabilities. Experimental validation confirms that the proposed PRPO and MCDR-Bench jointly establish a unified framework that systematically advances chart deep research through enhanced collaborative training and objective evaluation.[110] VB: Visibility Benchmark for Visibility and Perspective Reasoning in Images
Neil Tripathi
Main category: cs.CV
TL;DR: 本文提出了VB基准,用于评估视觉语言模型在判断照片中物体是否可见以及在无法确定时选择放弃回答的能力。该基准通过最小图像和文本编辑设计了100个家族共300个测试项,并引入多种新指标(如CAA、MEFR、SelRank、ToMAcc)来全面衡量模型性能。实验表明GPT-4o和Gemini 3.1 Pro表现最佳,而开源模型Gemma 3 12B也展现出竞争力。
Details
Motivation: 现有不可回答VQA基准未能明确区分‘为何不可回答’,且缺乏对模型判断随证据变化的因果验证;因此需要一个能精准测试模型可见性推理与合理 abstain 能力的新基准。 Method: 构建VB基准:每项含一张图+一个yes/no可见性陈述,要求模型输出VISIBLY_TRUE/VISIBLY_FALSE/ABSTAIN及置信度;采用2×2最小编辑设计(图像/文本各一维),形成100个家族、300个严格XOR评估项;定义并计算CAA、MEFR、SelRank、ToMAcc等新指标。 Result: GPT-4o与Gemini 3.1 Pro以0.728和0.727并列最优;Gemini 2.5 Pro为0.678;最佳开源模型Gemma 3 12B达0.505;6/9模型对文本扰动比图像扰动更鲁棒;GPT-4o与Gemini 2.5 Pro准确率相近但选择性预测质量差异显著。 Conclusion: VB揭示了当前顶尖多模态模型在可见性推理与置信度校准方面仍存在明显局限,尤其在图像扰动鲁棒性和二阶视角推理上;新指标体系可更精细诊断模型行为,推动可信视觉语言理解发展。 Abstract: We present VB, a benchmark that tests whether vision-language models can determine what is and is not visible in a photograph, and abstain when a human viewer cannot reliably answer. Each item pairs a single photo with a short yes/no visibility claim; the model must output VISIBLY_TRUE, VISIBLY_FALSE, or ABSTAIN, together with a confidence score. Items are organized into 100 families using a 2x2 design that crosses a minimal image edit with a minimal text edit, yielding 300 headline evaluation cells. Unlike prior unanswerable-VQA benchmarks, VB tests not only whether a question is unanswerable but why (via reason codes tied to specific visibility factors), and uses controlled minimal edits to verify that model judgments change when and only when the underlying evidence changes. We score models on confidence-aware accuracy with abstention (CAA), minimal-edit flip rate (MEFR), confidence-ranked selective prediction (SelRank), and second-order perspective reasoning (ToMAcc); all headline numbers are computed on the strict XOR subset (three cells per family, 300 scored items per model). We evaluate nine models spanning flagship and prior-generation closed-source systems, and open-source models from 8B to 12B parameters. GPT-4o and Gemini 3.1 Pro effectively tie for the best composite score (0.728 and 0.727), followed by Gemini 2.5 Pro (0.678). The best open-source model, Gemma 3 12B (0.505), surpasses one prior-generation closed-source system. Text-flip robustness exceeds image-flip robustness for six of nine models, and confidence calibration varies substantially: GPT-4o and Gemini 2.5 Pro achieve similar accuracy yet differ sharply in selective prediction quality.[111] RADAR: A Multimodal Benchmark for 3D Image-Based Radiology Report Review
Zhaoyi Sun,Minal Jagtiani,Wen-wai Yim,Fei Xia,Martin Gunn,Meliha Yetisgen,Asma Ben Abacha
Main category: cs.CV
TL;DR: 本文提出了RADAR,一个用于放射科报告差异分析的多模态基准数据集,旨在评估模型在图像-文本对齐和临床推理方面的能力。
Details
Motivation: 放射科报告中存在临床意义的差异,系统性分析这些差异对于质量保证、临床决策支持和多模态模型开发至关重要,但目前缺乏标准化基准。 Method: 构建了RADAR多模态基准数据集,包含3D医学影像、初步报告及对应修改建议,并定义了结构化差异评估任务,要求模型判断图像一致性、临床严重性及编辑类型。 Result: RADAR包含专家标注的腹部CT检查数据,并提供了标准化评估协议,支持多模态模型的系统性比较。 Conclusion: RADAR为评估多模态系统作为放射科报告编辑审阅者的能力提供了一个以临床为导向的测试平台。 Abstract: Radiology reports for the same patient examination may contain clinically meaningful discrepancies arising from interpretation differences, reporting variability, or evolving assessments. Systematic analysis of such discrepancies is important for quality assurance, clinical decision support, and multimodal model development, yet remains limited by the lack of standardized benchmarks. We present RADAR, a multimodal benchmark for radiology report discrepancy analysis that pairs 3D medical images with a preliminary report and corresponding candidate edits for the same study. The dataset reflects a standard clinical workflow in which trainee radiologists author preliminary reports that are subsequently reviewed and revised by attending radiologists. RADAR defines a structured discrepancy assessment task requiring models to evaluate proposed edits by determining image-level agreement, assessing clinical severity, and classifying edit type (correction, addition, or clarification). In contrast to prior work emphasizing binary error detection or comparison against fully independent reference reports, RADAR targets fine-grained clinical reasoning and image-text alignment at the report review stage. The benchmark consists of expert-annotated abdominal CT examinations and is accompanied by standardized evaluation protocols to support systematic comparison of multimodal models. RADAR provides a clinically grounded testbed for evaluating multimodal systems as reviewers of radiology report edits.[112] ECHO: Event-Centric Hypergraph Operations via Multi-Agent Collaboration for Multimedia Event Extraction
Hailong Chu,Shuo Zhang,Yunlong Chu,Shutai Huang,Xingyue Zhang,Tinghe Yan,Jinsong Zhang,Lei Li
Main category: cs.CV
TL;DR: 本文提出ECHO框架,通过多智能体迭代优化多媒体事件超图(MEHG),采用Link-then-Bind策略延迟角色绑定,显著提升多媒体事件抽取性能。
Details
Motivation: 现有M2E2方法依赖端到端线性生成,易因早期跨模态错对齐导致下游角色分配错误传播。 Method: 提出ECHO多智能体框架,构建并迭代更新多媒体事件超图(MEHG);引入Link-then-Bind策略,先识别相关论元再确定其角色,实现延迟承诺与错误抑制。 Result: 在M2E2基准上,ECHO结合Qwen3-32B模型,事件提及和论元角色F1分别提升7.3%和15.5%,显著超越SOTA。 Conclusion: 显式中间结构(MEHG)与分步推理策略(Link-then-Bind)可有效缓解跨模态错对齐与错误传播,提升多媒体事件抽取鲁棒性与精度。 Abstract: Multimedia Event Extraction (M2E2) involves extracting structured event records from both textual and visual content. Existing approaches, ranging from specialized architectures to direct Large Language Model (LLM) prompting, typically rely on a linear, end-to-end generation and thus suffer from cascading errors: early cross-modal misalignments often corrupt downstream role assignment under strict grounding constraints. We propose ECHO (Event-Centric Hypergraph Operations), a multi-agent framework that iteratively refines a shared Multimedia Event Hypergraph (MEHG), which serves as an explicit intermediate structure for multimodal event hypotheses. Unlike dialogue-centric frameworks, ECHO coordinates specialized agents by applying atomic hypergraph operations to the MEHG. Furthermore, we introduce a Link-then-Bind strategy that enforces deferred commitment: agents first identify relevant arguments and only then determine their precise roles, mitigating incorrect grounding and limiting error propagation. Extensive experiments on the M2E2 benchmark show that ECHO significantly outperforms the state-of-the-art (SOTA) : with Qwen3-32B, it achieves a 7.3% and 15.5% improvement in average event mention and argument role F1, respectively.[113] Three-dimensional reconstruction and segmentation of an aggregate stockpile for size and shape analyses
Erol Tutumluer,Haohang Huang,Jiayi Luo,Issam Qamhia,John M. Hart
Main category: cs.CV
TL;DR: 本文提出了一种基于智能手机视频/图像和运动恢复结构(SfM)技术的创新3D成像方法,用于现场快速获取大粒径集料堆的三维尺寸与形状信息,并通过3D分割算法提取单个集料颗粒,以支持道路工程中的现场质量控制(QA/QC)应用。
Details
Motivation: 现有集料成像系统多局限于单个或人工分离颗粒的分析,缺乏便捷、低成本、适用于现场料堆的三维集料信息获取手段。 Method: 采用基于智能手机拍摄的视频/图像,利用Structure-from-Motion(SfM)重建料堆表面三维点云,并结合3D分割算法实现单个集料颗粒的自动分离与提取。 Result: 初步结果验证了该方法可有效获取料堆中大粒径集料的三维尺寸与形状信息,具备用于现场QA/QC的潜力。 Conclusion: 该方法为现场集料质量评估提供了一种低成本、易部署的三维形态分析新途径,有望提升道路工程中集料质量控制的效率与精度。 Abstract: Aggregate size and shape are key properties for determining quality of aggregate materials used in road construction and transportation geotechnics applications. The composition and packing, layer stiffness, and load response are all influenced by these morphological characteristics of aggregates. Many aggregate imaging systems developed to date only focus on analyses of individual or manually separated aggregate particles. There is a need to develop a convenient and affordable system for acquiring 3D aggregate information from stockpiles in the field. This paper presents an innovative 3D imaging approach for potential field evaluation of large-sized aggregates, whereby engineers can perform inspection by taking videos/images with mobile devices such as smartphone cameras. The approach leverages Structure-from-Motion (SfM) techniques to reconstruct the stockpile surface as 3D spatial data, i.e. point cloud, and uses a 3D segmentation algorithm to separate and extract individual aggregates from the reconstructed stockpile. The preliminary results presented in this paper demonstrate the future potential of using 3D aggregate size and shape information for onsite Quality Assurance/Quality Control (QA/QC) tasks.[114] TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings
Azmine Toushik Wasi,Shahriyar Zaman Ridoy,Koushik Ahamed Tonmoy,Kinga Tshering,S. M. Muhtasimul Hasan,Wahid Faisal,Tasnim Mohiuddin,Md Rizwan Parvez
Main category: cs.CV
TL;DR: 本文提出了TimeSpot基准,用于评估视觉语言模型(VLMs)在真实世界地理-时间推理方面的能力,涵盖地理位置与时间属性的联合预测及物理合理性的时空推理任务;实验表明当前SOTA VLMs在此任务上表现较差,尤其在时间推理方面,凸显了对新方法的需求。
Details
Motivation: 现有视觉语言模型在利用视觉线索(如地标、路标)进行图像地理定位方面已有进展,但其对时间信号和物理空间线索的推理能力仍有限,亟需一个能全面评估真实场景下地理-时间联合理解能力的基准。 Method: 构建TimeSpot基准:包含来自80个国家的1455张街景图像,要求模型结构化预测时间属性(季节、月份、一天中的时段、昼夜阶段)和地理属性(大洲、国家、气候带、环境类型、经纬度),并设计空间-时间物理合理性推理任务。 Result: 在TimeSpot上评测主流开源与闭源VLMs,发现整体性能偏低,尤其时间推理准确率显著不足;监督微调虽有提升,但仍远未达实用水平。 Conclusion: TimeSpot揭示了当前VLMs在物理 grounded 的地理-时间联合推理上的根本性局限,为推动具备真实世界时空理解能力的多模态模型发展提供了关键评估工具与研究方向。 Abstract: Geo-temporal understanding, the ability to infer location, time, and contextual properties from visual input alone, underpins applications such as disaster management, traffic planning, embodied navigation, world modeling, and geography education. Although recent vision-language models (VLMs) have advanced image geo-localization using cues like landmarks and road signs, their ability to reason about temporal signals and physically grounded spatial cues remains limited. To address this gap, we introduce TimeSpot, a benchmark for evaluating real-world geo-temporal reasoning in VLMs. TimeSpot comprises 1,455 ground-level images from 80 countries and requires structured prediction of temporal attributes (season, month, time of day, daylight phase) and geographic attributes (continent, country, climate zone, environment type, latitude-longitude) directly from visual evidence. It also includes spatial-temporal reasoning tasks that test physical plausibility under real-world uncertainty. Evaluations of state-of-the-art open- and closed-source VLMs show low performance, particularly for temporal inference. While supervised fine-tuning yields improvements, results remain insufficient, highlighting the need for new methods to achieve robust, physically grounded geo-temporal understanding. TimeSpot is available at: https://TimeSpot-GT.github.io.[115] Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning
Zhengjian Yao,Yongzhi Li,Xinyuan Gao,Quan Chen,Peng Jiang,Yanye Lu
Main category: cs.CV
TL;DR: Narrative Weaver 是一个新型多模态生成框架,通过结合多模态大语言模型与动态记忆库,实现可控、长程一致的视觉叙事生成,并构建首个电商广告视频分镜数据集 EAVSD。
Details
Motivation: 现有生成模型难以在长序列中保持叙事连贯性和视觉一致性,制约其在影视制作和电商广告等实际场景的应用。 Method: 提出 Narrative Weaver 框架,融合多模态大语言模型(MLLM)进行高层叙事规划,以及带动态记忆库的细粒度控制模块;采用渐进式多阶段训练策略,复用预训练模型;并构建了首个电商广告视频分镜数据集 EAVSD。 Result: 在可控多场景生成、自主叙事和电商广告三类任务上达到 SOTA 性能,且仅需有限训练数据;发布含 33 万张高质量图像的 EAVSD 数据集。 Conclusion: Narrative Weaver 首次实现了多模态生成中控制性、长程一致性和叙事规划能力的统一,为 AI 驱动的内容创作开辟新路径。 Abstract: We present "Narrative Weaver", a novel framework that addresses a fundamental challenge in generative AI: achieving multi-modal controllable, long-range, and consistent visual content generation. While existing models excel at generating high-fidelity short-form visual content, they struggle to maintain narrative coherence and visual consistency across extended sequences - a critical limitation for real-world applications such as filmmaking and e-commerce advertising. Narrative Weaver introduces the first holistic solution that seamlessly integrates three essential capabilities: fine-grained control, automatic narrative planning, and long-range coherence. Our architecture combines a Multimodal Large Language Model (MLLM) for high-level narrative planning with a novel fine-grained control module featuring a dynamic Memory Bank that prevents visual drift. To enable practical deployment, we develop a progressive, multi-stage training strategy that efficiently leverages existing pre-trained models, achieving state-of-the-art performance even with limited training data. Recognizing the absence of suitable evaluation benchmarks, we construct and release the E-commerce Advertising Video Storyboard Dataset (EAVSD) - the first comprehensive dataset for this task, containing over 330K high-quality images with rich narrative annotations. Through extensive experiments across three distinct scenarios (controllable multi-scene generation, autonomous storytelling, and e-commerce advertising), we demonstrate our method's superiority while opening new possibilities for AI-driven content creation.[116] High-Resolution Image Reconstruction with Unsupervised Learning and Noisy Data Applied to Ion-Beam Dynamics for Particle Accelerators
Francis Osswald,Mohammed Chahbaoui,Xinyi Liang
Main category: cs.CV
TL;DR: 本文提出了一种基于卷积滤波与神经网络的无监督图像重建方法,用于高能物理加速器束流诊断中严重退化图像的鲁棒去噪与高保真重构,显著提升束流晕结构的分辨能力。
Details
Motivation: 现代高能物理加速器对束流晕结构的精确检测需求日益提高,而传统图像分析工具已达到性能瓶颈,亟需更强大的图像重建方法应对严重退化和低信噪比条件。 Method: 结合卷积滤波与神经网络,并采用优化的早停策略以抑制过拟合;构建无需标注数据的无监督学习框架。 Result: 在无训练数据集的前提下,实现了严重噪声下的鲁棒去噪与高保真束流发射度图像重建,可将可测振幅扩展至七倍标准差以上,大幅提升束流晕分辨率。 Conclusion: 所提无监督深度学习方法突破了传统工具的限制,为加速器束流诊断提供了新范式,具备实际部署潜力。 Abstract: Image reconstruction in the presence of severe degradation remains a challenging inverse problem, particularly in beam diagnostics for high-energy physics accelerators. As modern facilities demand precise detection of beam halo structures to control losses, traditional analysis tools have reached their performance limits. This work reviews existing image-processing techniques for data cleaning, contour extraction, and emittance reconstruction, and introduces a novel approach based on convolutional filtering and neural networks with optimized early-stopping strategies in order to control overfitting. Despite the absence of training datasets, the proposed unsupervised framework achieves robust denoising and high-fidelity reconstruction of beam emittance images under low signal-to-noise conditions. The method extends measurable amplitudes beyond seven standard deviations, enabling unprecedented halo resolution.[117] Spectral Gaps and Spatial Priors: Studying Hyperspectral Downstream Adaptation Using TerraMind
Julia Anna Leonardi,Johannes Jakubik,Paolo Fraccaro,Maria Antonia Brovelli
Main category: cs.CV
TL;DR: 本研究探讨了多模态地理空间基础模型TerraMind在无需高光谱成像(HSI)特定预训练的情况下适配HSI下游任务的能力,比较了朴素波段选择和物理感知的光谱响应函数分组两种通道适配策略,结果表明原生支持HSI的深度学习模型总体更优,但TerraMind通过波段选择仍可实现中等性能,为HSI集成提供了关键基线,并呼吁未来模型需原生支持光谱标记化。
Details
Motivation: Geospatial Foundation Models (GFMs)通常缺乏对高光谱成像(HSI)的原生支持,因其高维光谱数据复杂且体量庞大;本研究旨在探索TerraMind这一多模态GFM在无HSI专用预训练条件下适配HSI下游任务的可行性。 Method: 实施并比较两种通道适配策略:朴素波段选择(Naive Band Selection)与物理感知的光谱响应函数(SRF)分组。 Result: 实验表明,原生支持HSI数据的深度学习模型总体性能更优;TerraMind通过波段选择可适配HSI下游任务,但性能有中等程度下降。 Conclusion: 该研究为HSI与地理空间基础模型的集成确立了关键基线,凸显未来多模态模型架构需引入原生光谱标记化的必要性。 Abstract: Geospatial Foundation Models (GFMs) typically lack native support for Hyperspectral Imaging (HSI) due to the complexity and sheer size of high-dimensional spectral data. This study investigates the adaptability of TerraMind, a multimodal GFM, to address HSI downstream tasks \emph{without} HSI-specific pretraining. Therefore, we implement and compare two channel adaptation strategies: Naive Band Selection and physics-aware Spectral Response Function (SRF) grouping. Overall, our results indicate a general superiority of deep learning models with native support of HSI data. Our experiments also demonstrate the ability of TerraMind to adapt to HSI downstream tasks through band selection with moderate performance decline. Therefore, the findings of this research establish a critical baseline for HSI integration, motivating the need for native spectral tokenization in future multimodal model architectures.[118] One-Shot Badminton Shuttle Detection for Mobile Robots
Florentin Dipner,William Talbot,Turcan Tuna,Andrei Cramariuc,Marco Hutter
Main category: cs.CV
TL;DR: 本文提出了一种面向非固定机器人的一次性羽毛球检测框架,构建了包含20510帧的新型第一人称视角羽毛球数据集,并设计了适配动态视角的YOLOv8检测器,在相似与全新环境中F1-score分别达0.86和0.70。
Details
Motivation: 缺乏面向移动机器人第一人称视角的羽毛球检测数据集,且现有方法多针对固定摄像头,难以适应动态视角。 Method: 构建涵盖11种背景、20510帧的半自动标注数据集,提出适配下游任务的评估指标,微调YOLOv8模型并分析影响性能的关键因素(如球体尺寸、背景纹理复杂度)。 Result: 在类训练环境测试中F1-score达0.86,在完全未见环境中为0.70;定性实验验证其在运动相机机器人上的适用性。 Conclusion: 该检测器专为移动机器人动态视角设计,是羽毛球跟踪、轨迹估计及系统重初始化等下游任务的重要基础模块。 Abstract: This paper presents a robust one-shot badminton shuttlecock detection framework for non-stationary robots. To address the lack of egocentric shuttlecock detection datasets, we introduce a dataset of 20,510 semi-automatically annotated frames captured across 11 distinct backgrounds in diverse indoor and outdoor environments, and categorize each frame into one of three difficulty levels. For labeling, we present a novel semi-automatic annotation pipeline, that enables efficient labeling from stationary camera footage. We propose a metric suited to our downstream use case and fine-tune a YOLOv8 network optimized for real-time shuttlecock detection, achieving an F1-score of 0.86 under our metric in test environments similar to training, and 0.70 in entirely unseen environments. Our analysis reveals that detection performance is critically dependent on shuttlecock size and background texture complexity. Qualitative experiments confirm their applicability to robots with moving cameras. Unlike prior work with stationary camera setups, our detector is specifically designed for the egocentric, dynamic viewpoints of mobile robots, providing a foundational building block for downstream tasks, including tracking, trajectory estimation, and system (re)-initialization.[119] Soft Equivariance Regularization for Invariant Self-Supervised Learning
Joohyung Lee,Changhun Kim,Hyunsu Kim,Kwanhyung Lee,Juho Lee
Main category: cs.CV
TL;DR: 本文提出Soft Equivariance Regularization (SER),一种插件式正则化方法,通过在中间空间token图上软性施加等变性约束(而非最终嵌入),解耦不变性与等变性的施加层级,在不增加模型复杂度的前提下,显著提升图像分类、鲁棒性及下游检测性能。
Details
Motivation: 现有自监督学习中,将不变性与等变性目标耦合于同一最终表征,存在性能权衡:增强深层等变性会损害ImageNet线性评估精度,因此需解耦施加层级。 Method: SER在保持原有SSL目标(如MoCo-v3、DINO)作用于最终嵌入不变的前提下,在中间空间token map上通过解析定义的群作用ρ_g直接施加软等变性正则;无需预测变换标签、无需额外预测头、仅增加1.008x训练FLOPs。 Result: 在ImageNet-1k ViT-S/16预训练中,SER使MoCo-v3线性评估Top-1提升+0.84;同时提升ImageNet-C/P鲁棒性(+1.11/+1.22 Top-1)和冻结主干的COCO检测mAP(+1.7);且该层解耦策略可迁移提升其他不变+等变基线性能。 Conclusion: 层解耦是融合不变性与等变性的通用设计原则;SER作为一种轻量、即插即用的正则器,有效兼顾表示判别性、几何鲁棒性与空间敏感迁移能力。 Abstract: Self-supervised learning (SSL) typically learns representations invariant to semantic-preserving augmentations. While effective for recognition, enforcing strong invariance can suppress transformation-dependent structure that is useful for robustness to geometric perturbations and spatially sensitive transfer. A growing body of work, therefore, augments invariance-based SSL with equivariance objectives, but these objectives are often imposed on the same final representation. We empirically observe a trade-off in this coupled setting: pushing equivariance regularization toward deeper layers improves equivariance scores but degrades ImageNet-1k linear evaluation, motivating a layer-decoupled design. Motivated by this trade-off, we propose Soft Equivariance Regularization (SER), a plug-in regularizer that decouples where invariance and equivariance are enforced: we keep the base SSL objective unchanged on the final embedding, while softly encouraging equivariance on an intermediate spatial token map via analytically specified group actions $ρ_g$ applied directly in feature space. SER learns/predicts no per-sample transformation codes/labels, requires no auxiliary transformation-prediction head, and adds only 1.008x training FLOPs. On ImageNet-1k ViT-S/16 pretraining, SER improves MoCo-v3 by +0.84 Top-1 in linear evaluation under a strictly matched 2-view setting and consistently improves DINO and Barlow Twins; under matched view counts, SER achieves the best ImageNet-1k linear-eval Top-1 among the compared invariance+equivariance add-ons. SER further improves ImageNet-C/P by +1.11/+1.22 Top-1 and frozen-backbone COCO detection by +1.7 mAP. Finally, applying the same layer-decoupling recipe to existing invariance+equivariance baselinesimproves their accuracy, suggesting layer decoupling as a general design principle for combining invariance and equivariance.[120] HARP: HARmonizing in-vivo diffusion MRI using Phantom-only training
Hwihun Jeong,Qiang Liu,Kathryn E. Keenan,Elisabeth A. Wilde,Walter Schneider,Sudhir Pathak,Anthony Zuccolotto,Lauren J. O'Donnell,Lipeng Ning,Yogesh Rathi
Main category: cs.CV
TL;DR: 本文提出了一种名为HARP的深度学习dMRI数据标准化框架,仅使用扩散仿体数据进行训练,无需多中心活体受试者数据,显著降低了跨扫描仪变异性,并保持了纤维取向和纤维追踪的准确性。
Details
Motivation: 多中心扩散MRI(dMRI)数据整合受限于扫描仪间变异,而现有标准化方法依赖难以获取的多中心匹配或巡游受试者数据,亟需一种更可行的替代方案。 Method: HARP采用基于仿体数据训练的体素级1D神经网络,学习不同扫描仪间球谐系数的关系,不依赖空间结构记忆。 Result: HARP显著降低跨扫描仪变异性:FA、MD、GFA的扫描-重扫标准误分别下降12%、10%、30%,同时保留纤维方向与纤维追踪结果。 Conclusion: HARP是首个仅用仿体数据实现dMRI标准化的深度学习方法,大幅提升了大规模临床研究中定量dMRI的可行性与可扩展性。 Abstract: Purpose: Combining multi-site diffusion MRI (dMRI) data is hindered by inter-scanner variability, which confounds subsequent analysis. Previous harmonization methods require large, matched or traveling human subjects from multiple sites, which are impractical to acquire in many situations. This study aims to develop a deep learning-based dMRI harmonization framework that eliminates the reliance on multi-site in-vivo traveling human data for training. Methods: HARP employs a voxel-wise 1D neural network trained on an easily transportable diffusion phantom. The model learns relationships between spherical harmonics coefficients of different sites without memorizing spatial structures. Results: HARP reduced inter-scanner variability levels significantly in various measures. Quantitatively, it decreased inter-scanner variability as measured by standard error in FA (12%), MD (10%), and GFA (30%) with scan-rescan standard error as the baseline, while preserving fiber orientations and tractography after harmonization. Conclusion: We believe that HARP represents an important first step toward dMRI harmonization using only phantom data, thereby obviating the need for complex, matched in vivo multi-site cohorts. This phantom-only strategy substantially enhances the feasibility and scalability of quantitative dMRI for large-scale clinical studies.[121] Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs
Yiwei Li,Zihao Wu,Yanjun Lv,Hanqi Jiang,Weihang You,Zhengliang Liu,Dajiang Zhu,Xiang Li,Quanzheng Li,Tianming Liu,Lin Zhao
Main category: cs.CV
TL;DR: 本文提出了一种利用放射科医生眼动轨迹(时间有序的注视序列)作为监督信号,引导视觉-语言模型(VLMs)进行更符合临床实际的视觉证据采集与推理的新方法,通过引入专用的‘注视token’显著提升了模型在医学影像诊断任务中的性能与泛化能力。
Details
Motivation: 现有VLMs虽处理图像,但中间推理依赖文本,不适用于高度视觉化的放射学诊断;而放射科医生实际依赖时序性视觉搜索(由眼动轨迹反映),因此需将人类视觉推理过程融入模型训练。 Method: 引入少量专用‘注视token’,训练模型按时间顺序预测眼动所选图像块索引,使VLM推理过程对齐人类视觉证据获取路径。 Result: 在MIMIC-EYE及多个零样本外部基准上持续超越基线,达到领域内最优性能,并提升跨域鲁棒性。 Conclusion: 时间有序的眼动轨迹是一种高效、可迁移的监督信号,能有效促进VLM学习视觉扎根的医学推理。 Abstract: Vision--language models (VLMs) process images as visual tokens, yet their intermediate reasoning is often carried out in text, which can be suboptimal for visually grounded radiology tasks. Radiologists instead diagnose via sequential visual search; eye-tracking captures this process as time-ordered gaze trajectories that reveal how evidence is acquired over time. We use eye-gaze as supervision to guide VLM reasoning by introducing a small set of dedicated gaze tokens. These tokens are trained to predict gaze-selected image patch indices in temporal order, encouraging the model to follow human-like evidence acquisition and integration. Experiments on MIMIC-EYE and multiple external zero-shot benchmarks show consistent gains over baselines, achieving state-of-the-art in-domain performance and improved out-of-domain robustness. These results highlight temporally ordered gaze as an effective supervision signal for learning visually grounded medical reasoning.[122] Asymmetric Distillation and Information Retention in Capacity-Constrained Cross-Modal Transfer
Kabir Thayani
Main category: cs.CV
TL;DR: 本文研究了在将大参数量的Vision Transformer(CLIP ViT-B/32)蒸馏到小参数量CNN时发生的维度坍缩现象,发现学生模型有效秩急剧下降至约16,导致噪声鲁棒性严重受损;容量过大反而更脆弱,而极小容量模型却表现出更强的抗噪能力,表明这是不对称余弦蒸馏固有的几何限制。
Details
Motivation: 探究知识蒸馏中不对称架构(如ViT到CNN)引发的表示空间几何约束,特别是维度坍缩现象及其对鲁棒性的影响。 Method: 采用严格中心化的奇异值分解(SVD)和基于方差的Shannon熵有效秩度量,分离真实结构方差与均值向量伪影;在CIFAR-10上蒸馏CLIP ViT-B/32(500M)到0.5M–8.0M参数CNN,并通过InfoNCE和噪声鲁棒性实验分析信息瓶颈。 Result: 教师模型有效秩为88.68,所有学生模型均坍缩至~16;81%维度损失导致噪声鲁棒性骤降(大容量学生在σ=0.1下准确率仅43.76%,而最小容量学生达54.84%);输入增强无法恢复大模型鲁棒性。 Conclusion: 维度坍缩是不对称余弦蒸馏的根本几何瓶颈,学生容量与鲁棒性呈非单调关系——过大的学生容量加剧脆性,极小容量反而起低通滤波作用;该现象无法通过常规数据增强缓解。 Abstract: Knowledge distillation between asymmetric architectures often induces severe geometric constraints on the learned representation space. In this work, we investigate the Dimensional Collapse phenomenon when distilling a 500M parameter global Vision Transformer (CLIP ViT-B/32) into strictly capacity-constrained, local-receptive-field CNNs (0.5M to 8.0M parameters) on the CIFAR-10 dataset. By employing strictly centered Singular Value Decomposition (SVD) and Variance-based Shannon Entropy Effective Rank, we isolate true structural variance from mean-vector artifacts. Our empirical results demonstrate a capacity-agnostic phase transition: while the Teacher exhibits an Effective Rank of 88.68, all Student models experience severe dimensional collapse to an intrinsic Effective Rank of ~16. By probing robustness, we uncover that this 81% reduction in effective dimensionality strips away the Teacher's inherent noise immunity (which retains 89.35% accuracy under σ=0.1 Gaussian noise). Furthermore, information-theoretic analysis using InfoNCE reveals a critical trade-off within this bottleneck: excess Student capacity densely packs the collapsed subspace for clean data, but induces severe brittleness (43.76% at σ=0.1). Conversely, extreme capacity constraints (0.5M parameters) act as a robust low-pass filter, preserving higher noise immunity (54.84%). Explicit input augmentation fails to restore the larger model's robustness, proving this fragility is a fundamental geometric limitation of asymmetric cosine distillation.[123] Multi-label Instance-level Generalised Visual Grounding in Agriculture
Mohammadreza Haghighat,Alzayat Saleh,Mostafa Rahimi Azghadi
Main category: cs.CV
TL;DR: 本文提出了首个面向农业领域的通用视觉定位数据集gRef-CW,并针对其挑战设计了Weed-VG框架,提升了作物与杂草的实例级视觉定位性能。
Details
Motivation: 现有视觉语言模型在农业场景下的视觉定位(VG)任务中缺乏适配基准数据集,且面临植物外观相似、多尺度、目标可能缺失等实际挑战。 Method: 构建了包含否定表达的农业视觉定位数据集gRef-CW;提出Weed-VG框架,融合多标签层次化相关性评分与插值驱动回归模块。 Result: 在gRef-CW上评测显示当前SOTA模型存在显著领域差距;Weed-VG显著提升作物与杂草的实例级定位性能,提供了农业VG新基线。 Conclusion: gRef-CW填补了农业视觉定位基准空白,Weed-VG为精准农业中的细粒度语义定位提供了有效可行的技术路径。 Abstract: Understanding field imagery such as detecting plants and distinguishing individual crop and weed instances is a central challenge in precision agriculture. Despite progress in vision-language tasks like captioning and visual question answering, Visual Grounding (VG), localising language-referred objects, remains unexplored in agriculture. A key reason is the lack of suitable benchmark datasets for evaluating grounding models in field conditions, where many plants look highly similar, appear at multiple scales, and the referred target may be absent from the image. To address these limitations, we introduce gRef-CW, the first dataset designed for generalised visual grounding in agriculture, including negative expressions. Benchmarking current state-of-the-art grounding models on gRef-CW reveals a substantial domain gap, highlighting their inability to ground instances of crops and weeds. Motivated by these findings, we introduce Weed-VG, a modular framework that incorporates multi-label hierarchical relevance scoring and interpolation-driven regression. Weed-VG advances instance-level visual grounding and provides a clear baseline for developing VG methods in precision agriculture. Code will be released upon acceptance.[124] SIQA: Toward Reliable Scientific Image Quality Assessment
Wenzhe Li,Liang Chen,Junying Wang,Yijing Guo,Ye Shen,Farong Wen,Chunyi Li,Zicheng Zhang,Guangtao Zhai
Main category: cs.CV
TL;DR: 本文提出科学图像质量评估(SIQA)框架,从知识(科学有效性与完整性)和感知(认知清晰性与学科一致性)两个维度建模,并设计SIQA-U(理解)与SIQA-S(评分)两种协议;实验发现当前多模态大模型在评分对齐上表现好于科学理解,提示需多维评估。
Details
Motivation: 现有图像质量评估方法仅关注感知失真或图文对齐,隐含‘所见即正确’假设,但在科学图像中,视觉合理的内容可能蕴含概念错误或推理不全,亟需兼顾科学正确性与逻辑完备性的新评估范式。 Method: 提出SIQA框架,定义知识(科学有效性、科学完整性)与感知(认知清晰性、学科一致性)双维度;构建SIQA-U(基于多选题的语义理解评估)和SIQA-S(对标专家评分的对齐评估)两种协议;建立含专家标注基准与大规模训练集的SIQA Challenge。 Result: 在主流多模态大语言模型上实验表明:模型在SIQA-S(评分对齐)上表现较好,但在SIQA-U(科学理解)上显著落后;微调可提升二者,但评分能力提升始终快于理解能力提升。 Conclusion: 仅依赖评分一致性不足以反映真实科学理解能力,科学图像质量评估必须采用涵盖知识与感知的多维框架,SIQA为此提供了系统化基础。 Abstract: Scientific images fundamentally differ from natural and AI-generated images in that they encode structured domain knowledge rather than merely depict visual scenes. Assessing their quality therefore requires evaluating not only perceptual fidelity but also scientific correctness and logical completeness. However, existing image quality assessment (IQA) paradigms primarily focus on perceptual distortions or image-text alignment, implicitly assuming that depicted content is factually valid. This assumption breaks down in scientific contexts, where visually plausible figures may still contain conceptual errors or incomplete reasoning. To address this gap, we introduce Scientific Image Quality Assessment (SIQA), a framework that models scientific image quality along two complementary dimensions: Knowledge (Scientific Validity and Scientific Completeness) and Perception (Cognitive Clarity and Disciplinary Conformity). To operationalize this formulation, we design two evaluation protocols: SIQA-U (Understanding), which measures semantic comprehension of scientific content through multiple-choice tasks, and SIQA-S (Scoring), which evaluates alignment with expert quality judgments. We further construct the SIQA Challenge, consisting of an expert-annotated benchmark and a large-scale training set. Experiments across representative multimodal large language models (MLLMs) reveal a consistent discrepancy between scoring alignment and scientific understanding. While models can achieve strong agreement with expert ratings under SIQA-S, their performance on SIQA-U remains substantially lower. Fine-tuning improves both metrics, yet gains in scoring consistently outpace improvements in understanding. These results suggest that rating consistency alone may not reliably reflect scientific comprehension, underscoring the necessity of multidimensional evaluation for scientific image quality assessment.[125] On the Generalization Capacities of MLLMs for Spatial Intelligence
Gongjie Zhang,Wenhao Li,Quanhao Qian,Jiuniu Wang,Deli Zhao,Shijian Lu,Ran Xu
Main category: cs.CV
TL;DR: 本文指出仅使用RGB输入的多模态大语言模型(MLLMs)在3D定位与导航等空间任务中因忽略相机参数而难以跨相机泛化;为此提出‘相机感知MLLM’框架,通过注入相机内参、相机感知数据增强和蒸馏3D视觉基础模型几何先验,显著提升跨相机泛化能力。
Details
Motivation: RGB-only MLLMs忽略相机参数,导致物体物理属性与相机视角混淆,造成不可解的歧义,使其过拟合训练相机分布,无法学习通用3D几何原理。 Method: 提出Camera-Aware MLLM框架:(i) 通过稠密嵌入将相机内参注入每个视觉token;(ii) 设计相机感知数据增强,合成变化相机参数以强制解耦相机属性与场景内容;(iii) 从3D视觉基础模型中蒸馏几何先验。 Result: 在空间接地任务(如3D定位、导航)的跨相机泛化测试中,相机感知MLLM显著优于基线模型。 Conclusion: 相机感知能力不仅是提升性能的有益补充,更是实现鲁棒、可泛化空间智能的必要前提。 Abstract: Multimodal Large Language Models (MLLMs) that directly process RGB inputs for tasks like 3D localization and navigation have shown remarkable potential. However, we argue that these RGB-only approaches are fundamentally flawed in their ability to generalize across cameras. By ignoring camera parameters, they entangle an object's physical properties with the camera's perspective, creating an irresolvable ambiguity. We show this leads MLLMs to overfit to the training camera distribution, rather than learning true and generalizable 3D geometric principles. To address this, we propose Camera-Aware MLLM framework for spatial MLLMs. It learns generalizable spatial reasoning by: (i) injecting camera intrinsics via a dense embedding that conditions each visual token; (ii) introducing a camera-aware data augmentation strategy that synthetically varies camera parameters, forcing the model to disentangle camera properties from scene content; and (iii) distilling geometric priors from a 3D vision foundation model. Extensive experiments demonstrate that camera-aware MLLMs substantially outperform their naive counterparts, particularly in cross-camera generalization tests on spatially-grounded tasks, indicating that camera-awareness is not only beneficial but also a prerequisite for robust and generalizable spatial intelligence in MLLMs.[126] UWPD: A General Paradigm for Invisible Watermark Detection Agnostic to Embedding Algorithms
Xiang Ao,Yiling Du,Zidan Wang,Mengru Chen
Main category: cs.CV
TL;DR: 本文提出通用水印存在性检测(UWPD)任务,构建了UniFreq-100K数据集,并设计了频率屏蔽网络(FSNet),通过自适应频谱感知模块和动态多频谱注意力机制实现对未知水印的零样本检测。
Details
Motivation: 现有隐形水印检测严重依赖特定算法的先验知识,在开放环境中难以检测“未知水印”。 Method: 提出UWPD新任务;构建大规模多算法水印图像数据集UniFreq-100K;设计FSNet模型,包含浅层的自适应频谱感知模块(ASPM)和深层的动态多频谱注意力(DMSA)与三流极值池化。 Result: FSNet在UWPD任务上展现出优异的零样本检测能力,显著优于现有基线模型。 Conclusion: FSNet通过频域建模有效提升了对未知隐形水印的通用检测能力,为开放环境下的版权保护提供了新思路。 Abstract: Invisible watermarks, as an essential technology for image copyright protection, have been widely deployed with the rapid development of social media and AIGC. However, existing invisible watermark detection heavily relies on prior knowledge of specific algorithms, leading to limited detection capabilities for "unknown watermarks" in open environments. To this end, we propose a novel task named Universal Watermark Presence Detection (UWPD), which aims to identify whether an image carries a copyright mark without requiring decoding information. We construct the UniFreq-100K dataset, comprising large-scale samples across various invisible watermark embedding algorithms. Furthermore, we propose the Frequency Shield Network (FSNet). This model deploys an Adaptive Spectral Perception Module (ASPM) in the shallow layers, utilizing learnable frequency gating to dynamically amplify high-frequency watermark signals while suppressing low-frequency semantics. In the deep layers, the network introduces Dynamic Multi-Spectral Attention (DMSA) combined with tri-stream extremum pooling to deeply mine watermark energy anomalies, forcing the model to precisely focus on sensitive frequency bands. Extensive experiments demonstrate that FSNet exhibits superior zero-shot detection capabilities on the UWPD task, outperforming existing baseline models. Code and datasets will be released upon acceptance.[127] HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos
Tingting Han,Xinsong Tao,Yufei Yin,Min Tan,Sicheng Zhao,Zhou Yu
Main category: cs.CV
TL;DR: 本文提出了开放词汇的时序句子定位任务(OV-TSGV),构建了首个专用基准Charades-OV和ActivityNet-OV,并提出HERO框架,通过分层嵌入与跨模态并行优化提升模型对未见语言表达的泛化能力。
Details
Motivation: 现有TSGV方法多为闭合词表设定,难以应对真实场景中多样、新颖的语言表达,亟需拓展至开放词表设定以提升泛化性。 Method: 提出HERO框架,利用分层语言嵌入,结合语义引导的视觉过滤与对比式掩码文本精炼,实现视频-语言跨模态对齐的并行优化。 Result: 在标准及开放词表基准上均显著超越SOTA方法,尤其在开放词表场景下展现出更强的泛化能力。 Conclusion: OV-TSGV是一个重要且具挑战性的新研究方向,HERO为其提供了有效解决方案,推动视频-语言理解向更通用、鲁棒的方向发展。 Abstract: Temporal Sentence Grounding in Videos (TSGV) aims to temporally localize segments of a video that correspond to a given natural language query. Despite recent progress, most existing TSGV approaches operate under closed-vocabulary settings, limiting their ability to generalize to real-world queries involving novel or diverse linguistic expressions. To bridge this critical gap, we introduce the Open-Vocabulary TSGV (OV-TSGV) task and construct the first dedicated benchmarks--Charades-OV and ActivityNet-OV--that simulate realistic vocabulary shifts and paraphrastic variations. These benchmarks facilitate systematic evaluation of model generalization beyond seen training concepts. To tackle OV-TSGV, we propose HERO(Hierarchical Embedding-Refinement for Open-Vocabulary grounding), a unified framework that leverages hierarchical linguistic embeddings and performs parallel cross-modal refinement. HERO jointly models multi-level semantics and enhances video-language alignment via semantic-guided visual filtering and contrastive masked text refinement. Extensive experiments on both standard and open vocabulary benchmarks demonstrate that HERO consistently surpasses state-of-the-art methods, particularly under open-vocabulary scenarios, validating its strong generalization capability and underscoring the significance of OV-TSGV as a new research direction.[128] Vessel-Aware Deep Learning for OCTA-Based Detection of AMD
Margalit G. Mitzner,Moinak Bhattacharya,Zhilin Zou,Chao Chen,Prateek Prasanna
Main category: cs.CV
TL;DR: 本文提出了一种结合血管特异性生物标志物(如血管迂曲度和血管缺失图)与OCTA图像的外部乘性注意力框架,用于提升AMD早期诊断的准确性和可解释性。
Details
Motivation: 现有深度学习模型多依赖全局特征,未能有效利用临床上有意义的血管生物标志物来检测AMD早期微血管改变。 Method: 构建外部乘性注意力机制,融合从动脉、静脉和毛细血管分割结果生成的血管迂曲度图与血管缺失图,并在多尺度上平滑以突出血管重塑和毛细血管稀疏模式;将这些生物标志物图与OCTA投影图像融合,引导深度分类器关注生理相关区域。 Result: 动脉迂曲度图提供最稳定的判别能力,毛细血管缺失图在基于密度的变体中表现最优,尤其在较大平滑尺度下;模型结果与已知AMD病理生理机制一致,具备可解释性。 Conclusion: 该方法不仅提升了AMD早期识别性能,还通过融合临床可解释的血管生物标志物增强了模型的透明度与医学可信度。 Abstract: Age-related macular degeneration (AMD) is characterized by early micro-vascular alterations that can be captured non-invasively using optical coherence tomography angiography (OCTA), yet most deep learning (DL) models rely on global features and fail to exploit clinically meaningful vascular biomarkers. We introduce an external multiplicative attention framework that incorporates vessel-specific tortuosity maps and vasculature dropout maps derived from arteries, veins, and capillaries. These biomarker maps are generated from vessel segmentations and smoothed across multiple spatial scales to highlight coherent patterns of vascular remodeling and capillary rarefaction. Tortuosity reflects abnormalities in vessel geometry linked to impaired auto-regulation, while dropout maps capture localized perfusion deficits that precede structural retinal damage. The maps are fused with the OCTA projection to guide a deep classifier toward physiologically relevant regions. Arterial tortuosity provided the most consistent discriminative value, while capillary dropout maps performed best among density-based variants, especially at larger smoothing scales. Our proposed method offers interpretable insights aligned with known AMD pathophysiology.[129] ButterflyViT: 354$\times$ Expert Compression for Edge Vision Transformers
Aryan Karmore
Main category: cs.CV
TL;DR: ButterflyViT 提出一种几何参数化方法,将 MoE Vision Transformer 中的专家视为共享三值基底的不同旋转视角,实现子线性内存扩展,并引入空间平滑正则化提升视觉任务性能。
Details
Motivation: 稀疏MoE视觉Transformer因专家权重矩阵线性内存增长而难以部署在边缘设备上,现有压缩方法无法解决该扩展瓶颈。 Method: 将专家建模为对统一共享三值基底施加可学习旋转得到的几何重定向;引入空间平滑正则化以约束相邻图像块token的路由不规则性。 Result: 在CIFAR-100上实现64专家下354倍内存缩减且精度几乎无损,验证了子线性内存扩展可行性。 Conclusion: 几何参数化能有效打破MoE中专家数量与内存消耗的线性关系,使多专家模型可在资源受限边缘设备上部署。 Abstract: Deploying sparse Mixture of Experts(MoE) Vision Transformers remains a challenge due to linear expert memory scaling. Linear memory scaling stores $N$ independent expert weight matrices requiring $\mathcal{O}(N_E \cdot d^2)$ memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyViT, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields $\mathcal{O}(d_{\text{model}} \cdot d_{\text{ff}} + N_E \cdot n_\ell \cdot d)$ memory which is sub-linear in the number of experts. To address the unique challenges of vision, a spatial smoothness regulariser is introduced that penalises routing irregularities between adjacent patch tokens, turning patch correlation into a training signal. Across image classification tasks on CIFAR-100, ButterflyViT achieves 354$\times$ memory reduction at 64 experts with negligible accuracy loss. ButterflyViT allows multiple experts to fit on edge-constrained devices showing that geometric parameterization breaks linear scaling.[130] XMACNet: An Explainable Lightweight Attention based CNN with Multi Modal Fusion for Chili Disease Classification
Tapon Kumer Ray,Rajkumar Y,Shalini R,Srigayathri K,Jayashree S,Lokeswari P
Main category: cs.CV
TL;DR: 本文提出XMACNet,一种轻量级CNN模型,结合自注意力机制和可见光图像与植被指数的多模态融合,用于辣椒病害检测,并构建了包含12000张图像的新数据集,模型在精度、F1分数和AUC上均优于多个基线模型,且具备可解释性和边缘部署能力。
Details
Motivation: 解决辣椒病害图像分类中数据稀缺、模型轻量化与可解释性不足的问题,以支持精准农业中的实际应用。 Method: 提出XMACNet模型,基于EfficientNetV2S主干网络,引入自注意力模块和RGB图像与NDVI、NPCI、MCARI等植被指数图的多模态融合分支;构建并合成了12,000张辣椒叶片图像数据集;采用Grad-CAM++和SHAP进行可解释性分析。 Result: XMACNet在自建数据集上取得高准确率、F1分数和AUC,性能优于ResNet-50、MobileNetV2和Swin Transformer变体;模型体积小、推理快,支持边缘部署。 Conclusion: XMACNet是一种高效、可解释、适合边缘部署的轻量级多模态模型,为植物病害智能诊断提供了实用新方案。 Abstract: Plant disease classification via imaging is a critical task in precision agriculture. We propose XMACNet, a novel light-weight Convolutional Neural Network (CNN) that integrates self-attention and multi-modal fusion of visible imagery and vegetation indices for chili disease detection. XMACNet uses an EfficientNetV2S backbone enhanced by a self-attention module and a fusion branch that processes both RGB images and computed vegetation index maps (NDVI, NPCI, MCARI). We curated a new dataset of 12,000 chili leaf images across six classes (five disease types plus healthy), augmented synthetically via StyleGAN to mitigate data scarcity. Trained on this dataset, XMACNet achieves high accuracy, F1-score, and AUC, outperforming baseline models such as ResNet-50, MobileNetV2, and a Swin Transformer variant. Crucially, XMACNet is explainable: we use Grad-CAM++ and SHAP to visualize and quantify the models focus on disease features. The models compact size and fast inference make it suitable for edge deployment in real-world farming scenarios.[131] EarthBridge: A Solution for 4th Multi-modal Aerial View Image Challenge Translation Track
Zhenyuan Chen,Guanyuan Shen,Feng Zhang
Main category: cs.CV
TL;DR: 本文提出了EarthBridge框架,用于EO、IR和SAR传感器间的跨模态图像翻译,结合DBIM(基于非马尔可夫桥过程的扩散模型)与CUT(对比无配对翻译)方法,在MAVIC-T挑战赛中取得第二名。
Details
Motivation: EO、IR和SAR传感器因电磁特性和几何特性差异大,跨模态图像翻译极具挑战性,亟需高保真翻译方法以支持多模态航拍分析。 Method: 提出EarthBridge框架,包含两种方法:1)Diffusion Bridge Implicit Models(DBIM),采用非马尔可夫桥过程实现高质量确定性采样;2)Contrastive Unpaired Translation(CUT),引入对比学习保障结构一致性;整体使用通道拼接UNet去噪器,结合Karras加权桥缩放与'booting noise'初始化策略。 Result: 在MAVIC-T全部四项任务(SAR→EO、SAR→RGB、SAR→IR、RGB→IR)上均展现出更优的空间细节与光谱精度,综合得分为0.38,位居挑战赛第二名。 Conclusion: EarthBridge通过融合隐式扩散建模与对比学习,在跨模态遥感图像翻译中实现了性能与鲁棒性的平衡,为多源遥感数据协同分析提供了有效技术路径。 Abstract: Cross-modal image-to-image translation among Electro-Optical (EO), Infrared (IR), and Synthetic Aperture Radar (SAR) sensors is essential for comprehensive multi-modal aerial-view analysis. However, translating between these modalities is notoriously difficult due to their distinct electromagnetic signatures and geometric characteristics. This paper presents \textbf{EarthBridge}, a high-fidelity translation framework developed for the 4th Multi-modal Aerial View Image Challenge -- Translation (MAVIC-T). We explore two distinct methodologies: \textbf{Diffusion Bridge Implicit Models (DBIM)}, which we generalize using non-Markovian bridge processes for high-quality deterministic sampling, and \textbf{Contrastive Unpaired Translation (CUT)}, which utilizes contrastive learning for structural consistency. Our EarthBridge framework employs a channel-concatenated UNet denoiser trained with Karras-weighted bridge scalings and a specialized "booting noise" initialization to handle the inherent ambiguity in cross-modal mappings. We evaluate these methods across all four challenge tasks (SAR$\rightarrow$EO, SAR$\rightarrow$RGB, SAR$\rightarrow$IR, RGB$\rightarrow$IR), achieving superior spatial detail and spectral accuracy. Our solution achieved a composite score of 0.38, securing the second position on the MAVIC-T leaderboard. Code is available at https://github.com/Bili-Sakura/EarthBridge-Preview.[132] A Hybrid Machine Learning Model for Cerebral Palsy Detection
Karan Kumar Singh,Nikita Gajbhiye,Gouri Sankar Mishra
Main category: cs.CV
TL;DR: 本文提出了一种结合VGG19、EfficientNet和ResNet50进行特征提取,并用Bi-LSTM分类的混合深度学习模型,用于新生儿脑MRI图像的脑性瘫痪(CP)早期诊断,准确率达98.83%。
Details
Motivation: 脑性瘫痪(CP)的早期识别对有效治疗至关重要,而高分辨率MRI结合智能分析可提升病理诊断能力。 Method: 构建融合VGG19、EfficientNet和ResNet50的CNN特征提取模块,并接入Bi-LSTM作为分类器;对脑MRI图像数据集进行采集与预处理后训练测试该模型。 Result: 所提模型在CP诊断任务中达到98.83%准确率,优于单独使用的VGG-19(96.79%)、EfficientNet(97.29%)和VGG-16(97.50%)。 Conclusion: 融合多CNN架构与Bi-LSTM的模型显著提升了CP早期诊断性能,验证了其在新生儿MRI辅助诊断中的有效性与优越性。 Abstract: The development of effective treatments for Cerebral Palsy (CP) can begin with the early identification of affected children while they are still in the early stages of the disorder. Pathological issues in the brain can be better diagnosed with the use of one of many medical imaging techniques. Magnetic Resonance Imaging (MRI) has revolutionized medical imaging with its unparalleled image resolution. A unique Machine Learning (ML) model that was built to identify CP disorder is presented in this paper. The model is intended to assist in the early diagnosis of CP in newborns. In this study, the brain MRI images dataset was first collected, and then the preprocessing techniques were applied to this dataset to make it ready for use in the proposed model. Following this, the proposed model was constructed by combining three CNN models, specifically VGG 19, Efficient-Net, and the ResNet50 model, to extract features from the image. Following this, a Bi-LSTM was utilized as a classifier to determine whether or not CP was present, and finally, the proposed model was employed for training and testing. The results show that the proposed model achieved an accuracy of 98.83%, which is higher than VGG-19 (96.79%), Efficient-Net (97.29%), and VGG-16 (97.50%).. When the suggested model is compared to other models that have been pre-trained in the past, the accuracy scores seem to be much higher.[133] Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models
Md Ashikur Rahman,Md Arifur Rahman,Niamul Hassan Samin,Abdullah Ibne Hanif Arean,Juena Ahmed Noshin
Main category: cs.CV
TL;DR: 本文发现长期视觉-语言模型的行为规律:保持时间上锚定的信念的模型泛化能力更强;提出行为忠实性(behavioral faithfulness)概念并定义步长接地率(SGR)作为量化指标,验证其与模型鲁棒性高度相关,且独立于模型规模和准确率。
Details
Motivation: 现有基准仅评估最终答案准确率,掩盖了模型如何使用视觉信息进行逐步推理;亟需衡量模型中间推理是否持续与动态视觉状态保持一致。 Method: 提出‘行为忠实性’概念,定义‘步长接地率(SGR)’量化模型中间推理对视觉输入的时间锚定程度,并在三个长时序基准、八个模型上进行实证分析,辅以反事实扰动、跨架构验证、随机基线等稳健性检验。 Result: SGR与分布外保留率显著正相关(r = 0.83,p = 0.003);在参数匹配的7B模型中,SGR差异达10.8个百分点,表明其为独立能力维度;反事实实验使SGR下降26–41个百分点,跨架构验证一致性ρ=0.96,证实该指标真实反映视觉依赖性。 Conclusion: 时间接地质量是模型鲁棒性的关键预测因子,SGR是一个可测量、稳健且独立于规模与准确率的新能力维度,为评估和改进长时序多模态模型提供了新范式。 Abstract: We uncover a behavioral law of long-horizon vision-language models: models that maintain temporally grounded beliefs generalize better. Standard benchmarks measure only final-answer accuracy, which obscures how models use visual information; a model can guess correctly while its step-by-step reasoning is entirely unanchored to the visual input. We formalize this as behavioral faithfulness over long horizons, an empirically measurable property that quantifies whether a model's intermediate reasoning remains consistent with the evolving visual state. Across eight models on three long-horizon benchmarks, we demonstrate that temporal grounding quality is a leading indicator of robustness: the Step Grounding Rate (SGR) predicts out-of-distribution retention with $r = 0.83$ (permutation test $p = 0.003$), a relationship that holds within capacity-matched models and cannot be explained by scale or in-distribution accuracy. Critically, grounding quality varies by up to 10.8 percentage points within parameter-matched 7B models despite similar accuracy, revealing it as an independent axis of model capability. Multiple robustness checks confirm the signal reflects genuine visual reliance: counterfactual traces drop SGR by 26--41 percentage points, cross-architecture verifiers agree at $ρ= 0.96$, random reasoning scores near chance ($\sim 18\%$), and the predictor remains strong even without explicit reasoning disclosure ($r = 0.78$).[134] MotionBits: Video Segmentation through Motion-Level Analysis of Rigid Bodies
Howard H. Qian,Kejia Ren,Yu Xiang,Vicente Ordonez,Kaiyu Hang
Main category: cs.CV
TL;DR: 本文提出MotionBit概念,通过运动学空间扭量等价性定义运动分割的最小单元,构建了MoRiBo基准数据集,并设计了一种无学习的图分割方法,在宏平均mIoU上超越现有方法37.3%,显著提升下游具身推理与操作任务性能。
Details
Motivation: 现有基于语义分组的分割模型难以提供面向具身任务的有效交互级线索,亟需一种不依赖语义、能反映物理刚体运动本质的分割基本单元。 Method: 提出MotionBit概念(基于运动学空间扭量等价性定义刚体运动单元),构建手标定基准MoRiBo,并设计一种无学习的图结构分割算法。 Result: 所提方法在MoRiBo基准上宏平均mIoU超越SOTA具身感知方法37.3%;MotionBits被验证可有效提升下游具身推理与操作任务性能。 Conclusion: MotionBit是一种更符合物理交互本质的分割基本单元,为具身智能中的物理交互理解提供了新范式和基础构件。 Abstract: Rigid bodies constitute the smallest manipulable elements in the real world, and understanding how they physically interact is fundamental to embodied reasoning and robotic manipulation. Thus, accurate detection, segmentation, and tracking of moving rigid bodies is essential for enabling reasoning modules to interpret and act in diverse environments. However, current segmentation models trained on semantic grouping are limited in their ability to provide meaningful interaction-level cues for completing embodied tasks. To address this gap, we introduce MotionBit, a novel concept that, unlike prior formulations, defines the smallest unit in motion-based segmentation through kinematic spatial twist equivalence, independent of semantics. In this paper, we contribute (1) the MotionBit concept and definition, (2) a hand-labeled benchmark, called MoRiBo, for evaluating moving rigid-body segmentation across robotic manipulation and human-in-the-wild videos, and (3) a learning-free graph-based MotionBits segmentation method that outperforms state-of-the-art embodied perception methods by 37.3\% in macro-averaged mIoU on the MoRiBo benchmark. Finally, we demonstrate the effectiveness of MotionBits segmentation for downstream embodied reasoning and manipulation tasks, highlighting its importance as a fundamental primitive for understanding physical interactions.[135] Active View Selection with Perturbed Gaussian Ensemble for Tomographic Reconstruction
Yulun Wu,Ruyi Zha,Wei Cao,Yingying Li,Yuanhao Cai,Yaoyao Liu
Main category: cs.CV
TL;DR: 本文提出Perturbed Gaussian Ensemble框架,用于X射线稀疏视角CT重建中的主动视角选择,通过不确定性建模与序列决策提升重建精度。
Details
Motivation: 现有主动视角选择方法针对自然光场景设计,无法适配X射线成像特有的几何模糊性和物理衰减特性,导致稀疏视角CT重建精度受限。 Method: 提出基于扰动高斯集成的主动视角选择方法:识别低密度高斯基元,施加随机密度缩放构建多个密度场;对每个候选投影计算集成预测的结构方差,选择方差最大者作为下一视角。 Result: 在任意轨迹CT基准上实验表明,该密度引导扰动策略可有效消除几何伪影,在统一视角选择协议下持续优于现有基线方法。 Conclusion: 所提框架显著提升了X射线高斯溅射稀疏视角重建的鲁棒性与精度,为辐射剂量降低下的高质量CT成像提供了新思路。 Abstract: Sparse-view computed tomography (CT) is critical for reducing radiation exposure to patients. Recent advances in radiative 3D Gaussian Splatting (3DGS) have enabled fast and accurate sparse-view CT reconstruction. Despite these algorithmic advancements, practical reconstruction fidelity remains fundamentally bounded by the quality of the captured data, raising the crucial yet underexplored problem of X-ray active view selection. Existing active view selection methods are primarily designed for natural-light scenes and fail to capture the unique geometric ambiguities and physical attenuation properties inherent in X-ray imaging. In this paper, we present Perturbed Gaussian Ensemble, an active view selection framework that integrates uncertainty modeling with sequential decision-making, tailored for X-ray Gaussian Splatting. Specifically, we identify low-density Gaussian primitives that are likely to be uncertain and apply stochastic density scaling to construct an ensemble of plausible Gaussian density fields. For each candidate projection, we measure the structural variance of the ensemble predictions and select the one with the highest variance as the next best view. Extensive experimental results on arbitrary-trajectory CT benchmarks demonstrate that our density-guided perturbation strategy effectively eliminates geometric artifacts and consistently outperforms existing baselines in progressive tomographic reconstruction under unified view selection protocols.[136] An Extended Topological Model For High-Contrast Optical Flow
Brad Turow,Jose A. Perea
Main category: cs.CV
TL;DR: 本文通过离散圆丛理论,识别出Sintel数据集中高对比度3×3光流块的低维模型——一个以先前提出的光流环面为边界的3维流形,并发现高对比度光流块主要集中于与二值阶跃边缘相关的若干圆上,而非环面上,这些块多位于运动边界附近,对分割和跟踪等视觉任务至关重要。
Details
Motivation: 解释为何先前提出的光流环面模型无法通过直接方法(如持久同调)验证,并揭示高对比度光流块的真实几何-拓扑结构。 Method: 利用近似与离散圆丛理论,构建一个以光学流环面为边界的3维流形模型,并结合Sintel数据集中的高对比度光流块进行拓扑与统计分析。 Result: 发现绝大多数最高1%对比度范数的光流块靠近一组与二值阶跃边缘图像块对应的分离圆,而非原有环面;这些块高度集中在运动边界区域。 Conclusion: 光学流块的空间结构需用更复杂的3维流形(而非简单环面)建模,其拓扑与几何特性对理解视觉推理具有重要启示。 Abstract: In this paper, we identify low-dimensional models for dense core subsets in the space of $3\times 3$ high-contrast optical flow patches sampled from the Sintel dataset. In particular, we leverage the theory of approximate and discrete circle bundles to identify a 3-manifold whose boundary is a previously proposed optical flow torus, together with disjoint circles corresponding to pairs of binary step-edge range image patches. The 3-manifold model we introduce provides an explanation for why the previously-proposed torus model could not be verified with direct methods (e.g., a straightforward persistent homology computation). We also demonstrate that nearly all optical flow patches in the top 1 percent by contrast norm are found near the family of binary step-edge circles described above, rather than the optical flow torus, and that these frequently occurring patches are concentrated near motion boundaries (which are of particular importance for computer vision tasks such as object segmentation and tracking). Our findings offer insights on the subtle interplay between topology and geometry in inference for visual data.[137] ColonSplat: Reconstruction of Peristaltic Motion in Colonoscopy with Dynamic Gaussian Splatting
Weronika Smolak-Dyżewska,Joanna Kaleta,Diego Dall'Alba,Przemysław Spurek
Main category: cs.CV
TL;DR: 本文提出ColonSplat,一种动态高斯溅射框架,用于在结肠镜检查中实现高保真3D重建,能准确建模蠕动运动并保持全局几何一致性;同时构建了带真值点云的合成数据集DynamicColon以支持严格评估。
Details
Motivation: 现有内窥镜三维重建方法难以应对结肠内复杂蠕动运动与受限视野的挑战,且无法准确建模真实解剖运动,缺乏可靠评估基准。 Method: 首先构建带逐帧真值点云的合成数据集DynamicColon;进而提出ColonSplat,一种改进的动态高斯溅射框架,通过显式建模解剖结构的时变空间位置(xyz)而非仅调整旋转、尺度或透明度,来刻画蠕动样运动并维持全局几何一致性。 Result: ColonSplat在C3VDv2和DynamicColon数据集上实现了优于现有动态内窥镜方法的几何保真度;DynamicColon为动态重建提供了首个具备全局真值点云的结肠镜合成基准。 Conclusion: 显式优化高斯中心坐标(xyz)是准确建模结肠蠕动的关键;ColonSplat结合新基准DynamicColon,显著提升了动态结肠镜三维重建的准确性与可评估性。 Abstract: Accurate 3D reconstruction of colonoscopy data, accounting for complex peristaltic movements, is crucial for advanced surgical navigation and retrospective diagnostics. While recent novel view synthesis and 3D reconstruction methods have demonstrated remarkable success in general endoscopic scenarios, they struggle in the highly constrained environment of the colon. Due to the limited field of view of a camera moving through an actively deforming tubular structure, existing endoscopic methods reconstruct the colon appearance only for initial camera trajectory. However, the underlying anatomy remains largely static; instead of updating Gaussians' spatial coordinates (xyz), these methods encode deformation through either rotation, scale or opacity adjustments. In this paper, we first present a benchmark analysis of state-of-the-art dynamic endoscopic methods for realistic colonoscopic scenes, showing that they fail to model true anatomical motion. To enable rigorous evaluation of global reconstruction quality, we introduce DynamicColon, a synthetic dataset with ground-truth point clouds at every timestep. Building on these insights, we propose ColonSplat, a dynamic Gaussian Splatting framework that captures peristaltic-like motion while preserving global geometric consistency, achieving superior geometric fidelity on C3VDv2 and DynamicColon datasets. Project page: https://wmito.github.io/ColonSplat[138] A prior information informed learning architecture for flying trajectory prediction
Xianda Huang,Zidong Han,Ruibo Jin,Zhenyu Wang,Wenyu Li,Xiaoyang Li,Yi Gong
Main category: cs.CV
TL;DR: 本文提出了一种硬件高效的双Transformer级联(DTC)轨迹预测框架,融合环境先验信息,用于准确预测网球落点。
Details
Motivation: 传统方法在复杂物理建模、计算效率和硬件需求方面存在不足,且常忽略关键轨迹事件(如落地点)。 Method: 提出Dual-Transformer-Cascaded(DTC)架构,第一级Transformer进行轨迹分类,第二级合成特征以精确预测落地点;结合单工业相机+YOLO检测获取飞行坐标,并融合球场边界等环境先验构建输入数据。 Result: 在真实户外网球场场景中,该方法显著优于现有轨迹预测框架,尤其在落地点预测精度上表现突出。 Conclusion: 融合环境先验的DTC架构是一种高效、准确且硬件友好的轨迹预测新范式,适用于体育分析等实际场景。 Abstract: Trajectory prediction for flying objects is critical in domains ranging from sports analytics to aerospace. However, traditional methods struggle with complex physical modeling, computational inefficiencies, and high hardware demands, often neglecting critical trajectory events like landing points. This paper introduces a novel, hardware-efficient trajectory prediction framework that integrates environmental priors with a Dual-Transformer-Cascaded (DTC) architecture. We demonstrate this approach by predicting the landing points of tennis balls in real-world outdoor courts. Using a single industrial camera and YOLO-based detection, we extract high-speed flight coordinates. These coordinates, fused with structural environmental priors (e.g., court boundaries), form a comprehensive dataset fed into our proposed DTC model. A first-level Transformer classifies the trajectory, while a second-level Transformer synthesizes these features to precisely predict the landing point. Extensive ablation and comparative experiments demonstrate that integrating environmental priors within the DTC architecture significantly outperforms existing trajectory prediction frameworks[139] PICS: Pairwise Image Compositing with Spatial Interactions
Hang Zhou,Xinxin Zuo,Sen Wang,Li Cheng
Main category: cs.CV
TL;DR: 本文提出PICS方法,通过自监督的分解-组合范式,在并行合成中显式建模对象与背景间的交互关系,利用掩码引导的专家混合机制和自适应α融合策略提升空间一致性和边界保真度,并结合几何感知增强提升鲁棒性。
Details
Motivation: 扩散模型在单步图像合成中表现优异,但在多步或成对编辑中难以保持空间关系一致性,易出现内容覆盖和物理不一致问题。 Method: 提出PICS框架:1)基于Interaction Transformer的掩码引导Mixture-of-Experts结构,分别处理背景、独占区域和重叠区域;2)自适应α-blending策略实现兼容性感知的重叠融合;3)引入涵盖平面内外姿态变化的几何感知数据增强。 Result: 在虚拟试穿、室内及街景等多场景下,PICS在成对合成质量和稳定性上均显著优于现有最先进方法。 Conclusion: PICS通过显式建模对象间交互与几何鲁棒性设计,有效解决了扩散模型在序列化图像合成中的空间一致性难题,为高质量图像合成提供了新范式。 Abstract: Despite strong single-turn performance, diffusion-based image compositing often struggles to preserve coherent spatial relations in pairwise or sequential edits, where subsequent insertions may overwrite previously generated content and disrupt physical consistency. We introduce PICS, a self-supervised composition-by-decomposition paradigm that composes objects in parallel while explicitly modeling the compositional interactions among (fully-/partially-)visible objects and background. At its core, an Interaction Transformer employs mask-guided Mixture-of-Experts to route background, exclusive, and overlap regions to dedicated experts, with an adaptive α-blending strategy that infers a compatibility-aware fusion of overlapping objects while preserving boundary fidelity. To further enhance robustness to geometric variations, we incorporate geometry-aware augmentations covering both out-of-plane and in-plane pose changes of objects. Our method delivers superior pairwise compositing quality and substantially improved stability, with extensive evaluations across virtual try-on, indoor, and street scene settings showing consistent gains over state-of-the-art baselines. Code and data are available at https://github.com/RyanHangZhou/PICS[140] OPTED: Open Preprocessed Trachoma Eye Dataset Using Zero-Shot SAM 3 Segmentation
Kibrom Gebremedhin,Hadush Hailu,Bruk Gebregziabher
Main category: cs.CV
TL;DR: 本文提出OPTED——一个开源的预处理沙眼眼部数据集,利用SAM 3模型自动提取感兴趣区域(结膜),通过四步可复现流程(零样本文本提示分割、背景去除与对齐裁剪、置信度过滤、Lanczos缩放)生成高质量图像,并开源全部代码与数据,以支持撒哈拉以南非洲高发区的自动化沙眼分类研究。
Details
Motivation: 沙眼仍是全球首要感染性致盲病因,尤其在撒哈拉以南非洲负担最重,但缺乏来自该地区的公开预处理数据集;原始临床眼睑图像背景噪声大,难以直接用于机器学习。 Method: 基于Segment Anything Model 3(SAM 3)构建四步可复现预处理流程:(1)文本提示驱动的零样本结膜分割;(2)背景去除、对齐裁剪;(3)基于置信度的质量筛选;(4)Lanczos插值缩放到224×224;辅以提示优选与人工质量验证。 Result: 确定最优文本提示为“inner surface of eyelid with red tissue”,平均置信度0.872(标准差0.070),检测率达99.5%;生成两种格式图像(保持原宽高比的裁剪图与标准化224×224图);完整开源OPTED数据集、代码及实验材料。 Conclusion: OPTED填补了高负担地区高质量预处理沙眼图像数据的空白,其自动化、可复现的预处理范式提升了数据可用性与模型可比性,为推动公平、鲁棒的沙眼AI诊断研究提供了关键基础设施。 Abstract: Trachoma remains the leading infectious cause of blindness worldwide, with Sub-Saharan Africa bearing over 85% of the global burden and Ethiopia alone accounting for more than half of all cases. Yet publicly available preprocessed datasets for automated trachoma classification are scarce, and none originate from the most affected region. Raw clinical photographs of eyelids contain significant background noise that hinders direct use in machine learning pipelines. We present OPTED, an open-source preprocessed trachoma eye dataset constructed using the Segment Anything Model 3 (SAM 3) for automated region-of-interest extraction. We describe a reproducible four-step pipeline: (1) text-prompt-based zero-shot segmentation of the tarsal conjunctiva using SAM 3, (2) background removal and bounding-box cropping with alignment, (3) quality filtering based on confidence scores, and (4) Lanczos resizing to 224x224 pixels. A separate prompt-selection stage identifies the optimal text prompt, and manual quality assurance verifies outputs. Through comparison of five candidate prompts on all 2,832 known-label images, we identify "inner surface of eyelid with red tissue" as optimal, achieving a mean confidence of 0.872 (std 0.070) and 99.5% detection rate (the remaining 13 images are recovered via fallback prompts). The pipeline produces outputs in two formats: cropped and aligned images preserving the original aspect ratio, and standardized 224x224 images ready for pre-trained architectures. The OPTED dataset, preprocessing code, and all experimental artifacts are released as open source to facilitate reproducible trachoma classification research.[141] PaQ-DETR: Learning Pattern and Quality-Aware Dynamic Queries for Object Detection
Zhengjian Kang,Jun Zhuang,Kangtong Mo,Qi Chen,Rui Liu,Ye Zhang
Main category: cs.CV
TL;DR: PaQ-DETR 提出动态生成图像特定查询与质量感知的一对多分配策略,提升 DETR 查询适应性与监督平衡,在多个基准上显著提升 mAP 并增强可解释性。
Details
Motivation: DETR 及其变体依赖固定可学习查询,存在查询利用率不平衡问题,导致模型容量未充分利用、适应性受限。 Method: PaQ-DETR 引入共享潜在模式建模全局语义,并通过内容条件加权动态生成图像特定查询;同时采用质量感知的一对多分配策略,依据定位-分类一致性自适应选择正样本,以平衡查询优化。 Result: 在 COCO、CityScapes 等基准上,相比多种 DETR 骨干网络(如 ResNet、Swin-Transformer),mAP 提升 1.5%–4.2%;并揭示了动态模式在物体类别间的语义聚类规律。 Conclusion: PaQ-DETR 有效缓解查询利用不平衡问题,提升检测性能与可解释性,为端到端检测器的查询设计提供了新范式。 Abstract: Detection Transformer (DETR) has redefined object detection by casting it as a set prediction task within an end-to-end framework. Despite its elegance, DETR and its variants still rely on fixed learnable queries and suffer from severe query utilization imbalance, which limits adaptability and leaves the model capacity underused. We propose PaQ-DETR (Pattern and Quality-Aware DETR), a unified framework that enhances both query adaptivity and supervision balance. It learns a compact set of shared latent patterns capturing global semantics and dynamically generates image-specific queries through content-conditioned weighting. In parallel, a quality-aware one-to-many assignment strategy adaptively selects positive samples based on localizatio-classification consistency, enriching supervision and promoting balanced query optimization. Experiments on COCO, CityScapes, and other benchmarks show consistent gains of 1.5%-4.2% mAP across DETR backbones, including ResNet and Swin-Transformer. Beyond accuracy improvement, our method provides interpretable insights into how dynamic patterns cluster semantically across object categories.[142] DLRMamba: Distilling Low-Rank Mamba for Edge Multispectral Fusion Object Detection
Qianqian Zhang,Leon Tabaro,Ahmed M. Abdelmoniem,Junshe An
Main category: cs.CV
TL;DR: 本文提出了一种低秩二维选择性结构化状态空间模型(Low-Rank SS2D)及结构感知蒸馏策略,以解决现有SSM在多光谱融合目标检测中参数冗余、部署受限和细节丢失问题,在边缘设备上实现高效高精度检测。
Details
Motivation: 现有基于SSM(如Mamba)的多光谱融合目标检测方法在2D选择性扫描(SS2D)模块中存在显著参数冗余,难以部署于资源受限边缘设备,且常规压缩易损失细粒度结构信息。 Method: 提出低秩二维选择性结构化状态空间模型(Low-Rank SS2D),通过矩阵分解重构状态转移以利用特征稀疏性;并设计结构感知蒸馏策略,对齐学生模型与全秩教师模型的内部隐状态动态。 Result: 在五个基准数据集及树莓派5等真实边缘平台上的实验表明,该方法在计算复杂度和内存占用大幅降低的同时,保持高保真空间建模能力,显著优于现有轻量级架构。 Conclusion: Low-Rank SS2D结合结构感知蒸馏,有效平衡了边缘端多光谱融合检测任务的效率与精度,为高分辨率遥感与海上监控提供了实用可行的轻量化方案。 Abstract: Multispectral fusion object detection is a critical task for edge-based maritime surveillance and remote sensing, demanding both high inference efficiency and robust feature representation for high-resolution inputs. However, current State Space Models (SSMs) like Mamba suffer from significant parameter redundancy in their standard 2D Selective Scan (SS2D) blocks, which hinders deployment on resource-constrained hardware and leads to the loss of fine-grained structural information during conventional compression. To address these challenges, we propose the Low-Rank Two-Dimensional Selective Structured State Space Model (Low-Rank SS2D), which reformulates state transitions via matrix factorization to exploit intrinsic feature sparsity. Furthermore, we introduce a Structure-Aware Distillation strategy that aligns the internal latent state dynamics of the student with a full-rank teacher model to compensate for potential representation degradation. This approach substantially reduces computational complexity and memory footprint while preserving the high-fidelity spatial modeling required for object recognition. Extensive experiments on five benchmark datasets and real-world edge platforms, such as Raspberry Pi 5, demonstrate that our method achieves a superior efficiency-accuracy trade-off, significantly outperforming existing lightweight architectures in practical deployment scenarios.[143] Small Target Detection Based on Mask-Enhanced Attention Fusion of Visible and Infrared Remote Sensing Images
Qianqian Zhang,Xiaolong Jia,Ahmed M. Abdelmoniem,Li Zhou,Junshe An
Main category: cs.CV
TL;DR: ESM-YOLO+ 是一种轻量级可见光-红外融合目标检测网络,通过 Mask-Enhanced Attention Fusion(MEAF)模块和训练时结构表征(SR)增强,显著提升遥感图像中小目标检测精度与鲁棒性,同时大幅降低参数量与计算量。
Details
Motivation: 遥感图像中目标通常尺寸小、纹理弱、易受复杂背景干扰,通用检测算法难以实现高精度检测;此外,可见光与红外模态间存在跨模态错位与尺度异质性问题。 Method: 提出 ESM-YOLO+:(1)Mask-Enhanced Attention Fusion(MEAF)模块,在像素级利用可学习空间掩码与空间注意力机制融合 RGB 与红外特征;(2)训练时 Structural Representation(SR)增强,提供辅助监督以保持细粒度空间结构,提升特征判别力且不增加推理开销。 Result: 在 VEDAI 和 DroneVehicle 数据集上分别达到 84.71% 和 74.0% mAP;相比基线模型,参数量减少 93.6%,GFLOPs 降低 68.0%。 Conclusion: ESM-YOLO+ 在保持高性能的同时显著提升轻量化与实用性,适用于复杂遥感场景下的实时小目标检测任务。 Abstract: Targets in remote sensing images are usually small, weakly textured, and easily disturbed by complex backgrounds, challenging high-precision detection with general algorithms. Building on our earlier ESM-YOLO, this work presents ESM-YOLO+ as a lightweight visible infrared fusion network. To enhance detection, ESM-YOLO+ includes two key innovations. (1) A Mask-Enhanced Attention Fusion (MEAF) module fuses features at the pixel level via learnable spatial masks and spatial attention, effectively aligning RGB and infrared features, enhancing small-target representation, and alleviating cross-modal misalignment and scale heterogeneity. (2) Training-time Structural Representation (SR) enhancement provides auxiliary supervision to preserve fine-grained spatial structures during training, boosting feature discriminability without extra inference cost. Extensive experiments on the VEDAI and DroneVehicle datasets validate ESM-YOLO+'s superiority. The model achieves 84.71\% mAP on VEDAI and 74.0\% mAP on DroneVehicle, while greatly reducing model complexity, with 93.6\% fewer parameters and 68.0\% lower GFLOPs than the baseline. These results confirm that ESM-YOLO+ integrates strong performance with practicality for real-time deployment, providing an effective solution for high-performance small-target detection in complex remote sensing scenes.[144] HIERAMP: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation
Lin Zhao,Xinru Jiang,Xi Xiao,Qihui Fan,Lei Lu,Yanzhi Wang,Xue Lin,Octavia Camps,Pu Zhao,Jianyang Gu
Main category: cs.CV
TL;DR: 本文提出HIERAMP方法,利用视觉自回归(VAR)模型的粗到细生成特性,在不同尺度上动态注入类别标记以增强分层语义,从而提升数据集蒸馏效果。
Details
Motivation: 现有数据集蒸馏方法主要关注全局语义相似性,但物体语义具有天然层次性(如鸟眼位置受头部轮廓约束),仅靠全局相似性难以捕捉支撑识别的多级结构信息。 Method: 基于VAR模型的多尺度生成特性,提出HIERAMP:在每个VAR尺度注入动态类别标记,生成显著区域引导图,用于该尺度的语义增强;仅引入轻微推理开销,引导合成聚焦判别性部件与结构。 Result: 实验表明,语义增强使粗尺度布局token选择更多样,细尺度token使用更集中、更关注细节;在多个主流数据集蒸馏基准上一致提升验证性能,且无需显式优化全局相似性。 Conclusion: 分层语义增强对高效数据集蒸馏至关重要,HIERAMP通过多尺度动态引导显著提升了蒸馏数据的质量和下游任务性能。 Abstract: Dataset distillation often prioritizes global semantic proximity when creating small surrogate datasets for original large-scale ones. However, object semantics are inherently hierarchical. For example, the position and appearance of a bird's eyes are constrained by the outline of its head. Global proximity alone fails to capture how object-relevant structures at different levels support recognition. In this work, we investigate the contributions of hierarchical semantics to effective distilled data. We leverage the vision autoregressive (VAR) model whose coarse-to-fine generation mirrors this hierarchy and propose HIERAMP to amplify semantics at different levels. At each VAR scale, we inject class tokens that dynamically identify salient regions and use their induced maps to guide amplification at that scale. This adds only marginal inference cost while steering synthesis toward discriminative parts and structures. Empirically, we find that semantic amplification leads to more diverse token choices in constructing coarse-scale object layouts. Conversely, at fine scales, the amplification concentrates token usage, increasing focus on object-related details. Across popular dataset distillation benchmarks, HIERAMP consistently improves validation performance without explicitly optimizing global proximity, demonstrating the importance of semantic amplification for effective dataset distillation.[145] Extracting and analyzing 3D histomorphometric features related to perineural and lymphovascular invasion in prostate cancer
Sarah S. L. Chow,Rui Wang,Robert B. Serafin,Yujie Zhao,Elena Baraznenok,Xavier Farré,Jennifer Salguero-Lopez,Gan Gao,Huai-Ching Hsieh,Lawrence D. True,Priti Lal,Anant Madabhushi,Jonathan T. C. Liu
Main category: cs.CV
TL;DR: 本文提出了一种基于3D组织形态学分析的前列腺癌预后评估新方法,通过光片显微镜成像与nnU-Net分割模型提取3D神经/血管侵袭特征,并证明其在预测5年生化复发方面优于传统2D特征。
Details
Motivation: 2D组织病理学采样有限且存在截面解读模糊问题,影响前列腺癌诊疗决策;已有研究表明3D腺体/细胞核特征可提升风险评估,但缺乏对关键不良预后指标(如PNI和LVI)的3D量化分析。 Method: 构建3D分析流程:使用光学透明化+荧光H&E染色+开放顶光片显微镜(OTLS)获取前列腺切除标本3D图像;训练nnU-Net模型进行3D神经与血管分割;结合癌区3D掩膜,提取癌症-神经/血管空间邻近性等PNI/LVI相关3D特征;最后用监督机器学习预测5年生化复发(BCR)。 Result: 3D PNI相关特征对5年BCR具有中等预测能力(AUC=0.71),显著优于2D对应特征(AUC=0.52);LVI特征未明确报告性能;代码已开源。 Conclusion: 3D组织学分析可更准确量化PNI等关键侵袭模式,为前列腺癌精准预后提供新工具,支持向临床转化。 Abstract: Diagnostic grading of prostate cancer (PCa) relies on the examination of 2D histology sections. However, the limited sampling of specimens afforded by 2D histopathology, and ambiguities when viewing 2D cross-sections, can lead to suboptimal treatment decisions. Recent studies have shown that 3D histomorphometric analysis of glands and nuclei can improve PCa risk assessment compared to analogous 2D features. Here, we expand on these efforts by developing an analytical pipeline to extract 3D features related to perineural invasion (PNI) and lymphovascular invasion (LVI), which correlate with poor prognosis for a variety of cancers. A 3D segmentation model (nnU-Net) was trained to segment nerves and vessels in 3D datasets of archived prostatectomy specimens that were optically cleared, labeled with a fluorescent analog of H&E, and imaged with open-top light-sheet (OTLS) microscopy. PNI- and LVI-related features, including metrics describing cancer-nerve and cancer-vessel proximity, were then extracted based on the 3D nerve/vessel segmentation masks in conjunction with 3D masks of cancer-enriched regions. As a preliminary exploration of the prognostic value of these features, we trained a supervised machine learning classifier to predict 5-year biochemical recurrence (BCR) outcomes, finding that 3D PNI-related features are moderately prognostic and outperform 2D PNI-related features (AUC = 0.71 vs. 0.52). Source code is available at https://github.com/sarahrahsl/SegCIA.git.[146] Virtual Intraoperative CT (viCT): Sequential Anatomic Updates for Modeling Tissue Resection Throughout Endoscopic Sinus Surgery
Nicole M. Gunderson,Graham J. Harris,Jeremy S. Ruthberg,Pengcheng Chen,Di Mao,Randall A. Bly,Waleed M. Abuzeid,Eric J. Seibel
Main category: cs.CV
TL;DR: 本文提出了一种名为Virtual Intraoperative CT (viCT)的新方法,利用单目内窥镜视频生成并动态更新术中三维解剖结构,以匹配术前CT,从而在慢性鼻窦炎的内窥镜鼻窦手术(ESS)中实现无需额外硬件的实时解剖可视化。
Details
Motivation: 不完全切除是慢性鼻窦炎患者术后持续疾病和需再次手术的常见原因;现有图像导航系统仅依赖静态术前CT,无法反映术中不断变化的解剖边界。 Method: 采用深度监督的NeRF框架结合虚拟立体合成,从单目内窥镜视频中生成多阶段度量尺度三维重建;通过基于解剖标志点的刚性配准(在3D Slicer中完成),将重建结果映射到术前CT体素网格;再利用基于光线的占据率比较,动态删除过时体素、重映射保留解剖与更新边界,生成viCT体积。 Result: 在四具尸体的四阶段ESS可行性实验中,viCT与金标准CT对比显示高一致性:Dice相似系数0.88±0.05,Jaccard指数0.79±0.07,HD95为0.69±0.28 mm,Chamfer距离0.09±0.05 mm,MSD 0.11±0.05 mm,RMSD 0.32±0.10 mm,表面误差达亚毫米级。 Conclusion: viCT可在ESS中实现无需附加硬件的CT格式解剖动态更新;后续工作将聚焦于配准全自动、活体验证及实时部署的运行效率优化。 Abstract: Purpose: Incomplete dissection is a common cause of persistent disease and revision endoscopic sinus surgery (ESS) in chronic rhinosinusitis. Current image-guided surgery systems typically reference static preoperative CT (pCT), and do not model evolving resection boundaries. We present Virtual Intraoperative CT (viCT), a method for sequentially updating pCT throughout ESS using intraoperative 3D reconstructions from monocular endoscopic video to enable visualization of evolving anatomy in CT format. Methods: Monocular endoscopic video is processed using a depth-supervised NeRF framework with virtual stereo synthesis to generate metrically scaled 3D reconstructions at multiple surgical intervals. Reconstructions undergo rigid, landmark-based registration in 3D Slicer guided by anatomical correspondences, and are then voxelized into the pCT grid. viCT volumes were generated using a ray-based occupancy comparison between pCT and reconstruction to delete outdated voxels and remap preserved anatomy and updated boundaries. Performance is evaluated in a cadaveric feasibility study of four specimens across four ESS stages using volumetric overlap (DSC, Jaccard) and surface metrics (HD95, Chamfer, MSD, RMSD), and qualitative comparisons to ground-truth CT. Results: viCT updates show agreement with ground-truth anatomy across surgical stages, with submillimeter mean surface errors. Dice Similarity Coefficient (DSC) = 0.88 +/- 0.05 and Jaccard Index = 0.79 +/- 0.07, and Hausdorff Distance 95% (HD95) = 0.69 +/- 0.28 mm, Chamfer Distance = 0.09 +/- 0.05 mm, Mean Surface Distance (MSD) = 0.11 +/- 0.05 mm, and Root Mean Square Distance (RMSD) = 0.32 +/- 0.10 mm. Conclusion: viCT enables CT-format anatomic updating in an ESS setting without ancillary hardware. Future work will focus on fully automating registration, validation in live cases, and optimizing runtime for real-time deployment.[147] SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation
Kaiyuan Xu,Fangzhou Hong,Daniel Elson,Baoru Huang
Main category: cs.CV
TL;DR: 本文提出SurgCUT3R框架,通过伪地面真值数据生成、混合监督策略和分层推理机制,解决单目内窥镜视频手术场景重建中缺乏标注数据和长序列性能下降两大挑战,在精度与效率间取得良好平衡。
Details
Motivation: 单目内窥镜视频手术场景重建面临缺乏监督训练数据和长视频序列中性能下降两大挑战,通用三维重建模型难以直接适用。 Method: 提出SurgCUT3R框架:1)利用公开立体手术数据集生成大规模度量尺度伪深度图;2)结合伪真值与几何自校正的混合监督策略;3)采用双模型分层推理框架(全局稳定+局部准确)抑制位姿漂移。 Result: 在SCARED和StereoMIS数据集上实验表明,该方法在位姿估计精度上接近SOTA,但速度显著更快,实现了精度与效率的较好权衡。 Conclusion: SurgCUT3R为手术环境下的鲁棒三维重建提供了实用有效的解决方案,有效缓解了数据匮乏和长序列漂移问题。 Abstract: Reconstructing surgical scenes from monocular endoscopic video is critical for advancing robotic-assisted surgery. However, the application of state-of-the-art general-purpose reconstruction models is constrained by two key challenges: the lack of supervised training data and performance degradation over long video sequences. To overcome these limitations, we propose SurgCUT3R, a systematic framework that adapts unified 3D reconstruction models to the surgical domain. Our contributions are threefold. First, we develop a data generation pipeline that exploits public stereo surgical datasets to produce large-scale, metric-scale pseudo-ground-truth depth maps, effectively bridging the data gap. Second, we propose a hybrid supervision strategy that couples our pseudo-ground-truth with geometric self-correction to enhance robustness against inherent data imperfections. Third, we introduce a hierarchical inference framework that employs two specialized models to effectively mitigate accumulated pose drift over long surgical videos: one for global stability and one for local accuracy. Experiments on the SCARED and StereoMIS datasets demonstrate that our method achieves a competitive balance between accuracy and efficiency, delivering near state-of-the-art but substantially faster pose estimation and offering a practical and effective solution for robust reconstruction in surgical environments. Project page: https://chumo-xu.github.io/SurgCUT3R-ICRA26/.[148] T2SGrid: Temporal-to-Spatial Gridification for Video Temporal Grounding
Chaohong Guo,Yihan He,Yongwei Nie,Fei Ma,Xuemiao Xu,Chengjiang Long
Main category: cs.CV
TL;DR: 本文提出Temporal to Spatial Gridification (T2SGrid)框架,将视频时序理解转化为空间理解任务,通过重排帧为网格图像并引入复合文本时间戳,在视频时序定位(VTG)任务中取得更优性能。
Details
Motivation: 现有Vision-LMM在视频时序建模中存在计算开销大、注意力稀疏、难以捕获绝对时间信息及损害空间细节等问题。 Method: 提出T2SGrid:采用重叠滑动窗口切分视频为时序片段;每段内按时间顺序将帧排布为行主序的复合网格图像;利用复合文本时间戳增强全局时间感知。 Result: 在标准VTG基准上实验表明,T2SGrid性能优于现有方法。 Conclusion: 将时序建模转化为结构化空间建模是提升VTG性能的有效新范式。 Abstract: Video Temporal Grounding (VTG) aims to localize the video segment that corresponds to a natural language query, which requires a comprehensive understanding of complex temporal dynamics. Existing Vision-LMMs typically perceive temporal dynamics via positional encoding, text-based timestamps, or visual frame numbering. However, these approaches exhibit notable limitations: assigning each frame a text-based timestamp token introduces additional computational overhead and leads to sparsity in visual attention, positional encoding struggles to capture absolute temporal information, and visual frame numbering often compromises spatial detail. To address these issues, we propose Temporal to Spatial Gridification (T2SGrid), a novel framework that reformulates video temporal understanding as a spatial understanding task. The core idea of T2SGrid is to process video content in clips rather than individual frames. we employ a overlapping sliding windows mechanism to segment the video into temporal clips. Within each window, frames are arranged chronologically in a row-major order into a composite grid image, effectively transforming temporal sequences into structured 2D layouts. The gridification not only encodes temporal information but also enhances local attention within each grid. Furthermore, T2SGrid enables the use of composite text timestamps to establish global temporal awareness. Experiments on standard VTG benchmarks demonstrate that T2SGrid achieves superior performance.[149] Optimizing Multi-Modal Models for Image-Based Shape Retrieval: The Role of Pre-Alignment and Hard Contrastive Learning
Paul Julius Kühn,Cedric Spengler,Michael Weinmann,Arjan Kuijper,Saptarshi Neil Sinha
Main category: cs.CV
TL;DR: 本文提出了一种基于大规模多模态预训练的图像到3D形状检索(IBSR)方法,无需显式多视图监督或视图合成,利用预对齐的图像与点云编码器实现零样本和跨域检索,并引入多模态难样本对比损失(HCL)进一步提升性能,在多个数据集上达到SOTA。
Details
Motivation: 现有IBSR方法依赖多视图渲染和特定任务的度量学习来弥合2D图像与3D形状之间的域差距;本文旨在探索更通用、无需视图监督和重训练的预训练范式。 Method: 采用预对齐的图像与点云编码器(如ULIP、OpenShape),将图像和点云嵌入共享表征空间;使用单嵌入形状描述符进行相似性检索;并提出多模态难样本对比损失(HCL)优化检索性能。 Result: 在多个数据集上,OpenShape+Point-BERT组合在Acc_Top1和Acc_Top10指标上均超越现有方法,达到SOTA;HCL训练带来数据集相关的标准实例检索性能提升。 Conclusion: 预对齐多模态编码器可有效支撑零样本与标准IBSR,无需视图合成与目标域重训练;难样本对比学习显著增强检索效果,验证了预训练与HCL在3D形状检索中的关键价值。 Abstract: Image-based shape retrieval (IBSR) aims to retrieve 3D models from a database given a query image, hence addressing a classical task in computer vision, computer graphics, and robotics. Recent approaches typically rely on bridging the domain gap between 2D images and 3D shapes based on the use of multi-view renderings as well as task-specific metric learning to embed shapes and images into a common latent space. In contrast, we address IBSR through large-scale multi-modal pretraining and show that explicit view-based supervision is not required. Inspired by pre-aligned image--point-cloud encoders from ULIP and OpenShape that have been used for tasks such as 3D shape classification, we propose the use of pre-aligned image and shape encoders for zero-shot and standard IBSR by embedding images and point clouds into a shared representation space and performing retrieval via similarity search over compact single-embedding shape descriptors. This formulation allows skipping view synthesis and naturally enables zero-shot and cross-domain retrieval without retraining on the target database. We evaluate pre-aligned encoders in both zero-shot and supervised IBSR settings and additionally introduce a multi-modal hard contrastive loss (HCL) to further increase retrieval performance. Our evaluation demonstrates state-of-the-art performance, outperforming related methods on $Acc_{Top1}$ and $Acc_{Top10}$ for shape retrieval across multiple datasets, with best results observed for OpenShape combined with Point-BERT. Furthermore, training on our proposed multi-modal HCL yields dataset-dependent gains in standard instance retrieval tasks on shape-centric data, underscoring the value of pretraining and hard contrastive learning for 3D shape retrieval. The code will be made available via the project website.[150] Perception-Aware Multimodal Spatial Reasoning from Monocular Images
Yanchun Cheng,Rundong Wang,Xulei Yang,Alok Prakash,Daniela Rus,Marcelo H Ang,ShiJie Li
Main category: cs.CV
TL;DR: 本文提出了一种感知感知的多模态推理框架,通过引入视觉参考标记(VRTs)和多模态思维链(MM-CoT)数据集,显著提升了单目图像中细粒度空间推理能力,在SURDS基准上大幅超越现有方法。
Details
Motivation: 当前视觉语言模型(VLMs)在单目图像的空间推理中仍难以处理大尺度变化和物体外观模糊等挑战,尤其缺乏细粒度几何感知能力。 Method: 提出对象中心的视觉参考标记(VRTs)表示法,将每个被指代物体映射为其空间范围内的全部VRT;构建多模态思维链(MM-CoT)数据集,并设计确定性排序策略以适配自回归VLM训练。 Result: 仅通过标准监督微调,在SURDS基准上显著优于先前方法(包括强化学习后训练方法),在单/多物体任务上均取得大幅性能提升。 Conclusion: 准确的感知与多模态推理相互增强,二者协同是实现复杂单目驾驶场景下鲁棒空间理解的关键。 Abstract: Spatial reasoning from monocular images is essential for autonomous driving, yet current Vision-Language Models (VLMs) still struggle with fine-grained geometric perception, particularly under large scale variation and ambiguous object appearance. We propose a simple yet effective perception-aware multimodal reasoning framework that equips VLMs with explicit object-centric grounding ability. Instead of relying on textual bounding-box outputs, each referred object is represented using all Visual Reference Tokens (VRTs) within its spatial extent, enabling visual evidence and textual reasoning to be processed jointly in a unified token space. To further strengthen cross-modal interaction, we construct a Multimodal Chain-of-Thought (MM-CoT) dataset that injects aligned visual and textual reasoning signals. A deterministic ordering strategy is introduced to make supervision over inherently unordered VRT sets fully compatible with the VLM's autoregressive next-token prediction. With only standard supervised fine-tuning, our method achieves substantial improvements on the SURDS benchmark, outperforming previous approaches - including those using RL-based post-training - by a large margin across both single-object and multi-object tasks. These results demonstrate that accurate perception and multimodal reasoning are mutually reinforcing, and together form the key to robust spatial understanding in challenging monocular driving scenarios.[151] MipSLAM: Alias-Free Gaussian Splatting SLAM
Yingzhao Li,Yan Li,Shixiong Tian,Yanjie Liu,Lijun Zhao,Gim Hee Lee
Main category: cs.CV
TL;DR: 本文提出了MipSLAM,一种频率感知的3D高斯泼溅(3DGS)SLAM框架,通过椭圆自适应抗锯齿(EAA)算法和频谱感知位姿图优化(SA-PGO)模块,有效缓解了传统方法中的混叠伪影与轨迹漂移问题,在保持实时性的同时实现了高质量新视角合成与高精度定位。
Details
Motivation: 现有基于3D高斯泼溅的SLAM系统因缺乏有效滤波和仅依赖空间优化,易出现混叠伪影和轨迹漂移,难以兼顾高保真渲染与鲁棒位姿估计。 Method: 提出椭圆自适应抗锯齿(EAA)算法,采用几何感知数值积分近似高斯贡献;设计频谱感知位姿图优化(SA-PGO)模块,将轨迹估计映射至频率域并利用图拉普拉斯分析抑制高频噪声;引入局部频域感知损失以提升几何细节恢复能力。 Result: 在Replica和TUM数据集上,MipSLAM在多分辨率下均达到SOTA渲染质量与定位精度,同时保持实时性能。 Conclusion: MipSLAM通过融合频率域建模与空间优化,显著提升了3DGS-SLAM系统的抗混叠能力、几何保真度与轨迹稳定性,为实时高保真SLAM提供了新范式。 Abstract: This paper introduces MipSLAM, a frequency-aware 3D Gaussian Splatting (3DGS) SLAM framework capable of high-fidelity anti-aliased novel view synthesis and robust pose estimation under varying camera configurations. Existing 3DGS-based SLAM systems often suffer from aliasing artifacts and trajectory drift due to inadequate filtering and purely spatial optimization. To overcome these limitations, we propose an Elliptical Adaptive Anti-aliasing (EAA) algorithm that approximates Gaussian contributions via geometry-aware numerical integration, avoiding costly analytic computation. Furthermore, we present a Spectral-Aware Pose Graph Optimization (SA-PGO) module that reformulates trajectory estimation in the frequency domain, effectively suppressing high-frequency noise and drift through graph Laplacian analysis. A novel local frequency-domain perceptual loss is also introduced to enhance fine-grained geometric detail recovery. Extensive evaluations on Replica and TUM datasets demonstrate that MipSLAM achieves state-of-the-art rendering quality and localization accuracy across multiple resolutions while maintaining real-time capability. Code is available at https://github.com/yzli1998/MipSLAM.[152] AdaGen: Learning Adaptive Policy for Image Synthesis
Zanlin Ni,Yulin Wang,Yeguo Hua,Renping Zhou,Jiayi Guo,Jun Song,Bo Zheng,Gao Huang
Main category: cs.CV
TL;DR: 本文提出AdaGen,一种通用、可学习且样本自适应的迭代生成过程调度框架,通过将调度问题建模为马尔可夫决策过程,并采用对抗式奖励设计与推理时优化策略,显著提升多种生成范式的性能与效率。
Details
Motivation: 现有生成模型依赖人工设计的静态步长参数调度策略,缺乏样本自适应能力,导致性能受限。 Method: 将生成调度建模为马尔可夫决策过程,使用轻量级策略网络在强化学习框架下学习样本自适应参数;提出对抗式奖励机制避免简单奖励(如FID)被操纵;引入推理时精炼策略与可控的保真度-多样性权衡机制。 Result: 在四种生成范式上验证有效性:例如在DiT-XL上以1/3推理成本获得更好性能,在VAR上将FID从1.92降至1.59且计算开销可忽略。 Conclusion: AdaGen是一种通用、高效且灵活的调度框架,能自动适配不同样本与生成模型,显著提升生成质量、多样性与推理效率。 Abstract: Recent advances in image synthesis have been propelled by powerful generative models, such as Masked Generative Transformers (MaskGIT), autoregressive models, diffusion models, and rectified flow models. A common principle behind their success is the decomposition of synthesis into multiple steps. However, this introduces a proliferation of step-specific parameters (e.g., noise level or temperature at each step). Existing approaches typically rely on manually-designed rules to manage this complexity, demanding expert knowledge and trial-and-error. Furthermore, these static schedules lack the flexibility to adapt to the unique characteristics of each sample, yielding sub-optimal performance. To address this issue, we present AdaGen, a general, learnable, and sample-adaptive framework for scheduling the iterative generation process. Specifically, we formulate the scheduling problem as a Markov Decision Process, where a lightweight policy network determines suitable parameters given the current generation state, and can be trained through reinforcement learning. Importantly, we demonstrate that simple reward designs, such as FID or pre-trained reward models, can be easily hacked and may not reliably guarantee the desired quality or diversity of generated samples. Therefore, we propose an adversarial reward design to guide the training of the policy networks. Finally, we introduce an inference-time refinement strategy and a controllable fidelity-diversity trade-off mechanism to further enhance the performance and flexibility of AdaGen. Comprehensive experiments on four generative paradigms validate the superiority of AdaGen. For example, AdaGen achieves better performance on DiT-XL with 3 times lower inference cost and improves the FID of VAR from 1.92 to 1.59 with negligible computational overhead.[153] TrajPred: Trajectory-Conditioned Joint Embedding Prediction for Surgical Instrument-Tissue Interaction Recognition in Vision-Language Models
Jiajun Cheng,Xiaofan Yu,Subarna,Sainan Liu,Shan Lin
Main category: cs.CV
TL;DR: 本文提出TrajPred框架,通过编码器械轨迹引入时间运动线索,并设计预测模块生成更精细动作细节的视觉语义嵌入,结合提示调优和动词重述技术,提升了手术器械-组织交互识别性能。
Details
Motivation: 现有视觉语言模型在器械-组织交互识别任务中性能受限,主要因未能有效利用时间信息及视觉-文本对齐缺失细粒度动作细节。 Method: 提出TrajPred框架:1)编码器械轨迹以建模时序运动;2)基于轨迹预测细粒度视觉语义嵌入;3)结合提示调优与动词重述实现任务自适应。 Result: 在CholecT50数据集上,平均精度(AP)和Top-K准确率均提升;可视化显示视觉与文本嵌入的余弦相似度对齐效果增强。 Conclusion: TrajPred通过显式建模时序轨迹和优化视觉-语言对齐,显著提升了器械-组织交互识别能力,为上下文感知手术AI提供新思路。 Abstract: Recognizing instruments' interactions with tissues is essential for building context-aware AI assistants in robotic surgery. Vision-language models (VLMs) have opened a new avenue for surgical perception and achieved better generalization on a wide range of tasks compared to conventional task-specific deep learning approaches. However, their performance on instrument--tissue interaction recognition remains limited, largely due to two challenges: (1) many models do not effectively leverage temporal information, and (2) alignment between vision and text often misses fine-grained action details. To address these issues, we propose TrajPred, a framework that encodes instrument trajectories to incorporate temporal motion cues and, conditioned on these trajectories, introduces a predictor module to generate visual semantic embeddings that better capture fine-grained action details. We further incorporate prompt tuning and a verb-rephrasing technique to enable smooth adaptation to the instrument--tissue interaction recognition task. Extensive experiments on the public laparoscopic benchmark, CholecT50, show that our method improves both Average Precision and Top-K accuracy. We also investigate whether visual embeddings of instrument--tissue interaction regions align better with the corresponding text by visualizing the cosine similarity between visual and textual embeddings. The visualization results indicate that the proposed method improves alignment between relevant visual and textual representations.[154] OV-DEIM: Real-time DETR-Style Open-Vocabulary Object Detection with GridSynthetic Augmentation
Leilei Wang,Longfei Liu,Xi Shen,Xuanlong Yu,Ying Tiffany He,Fei Richard Yu,Yingyi Chen
Main category: cs.CV
TL;DR: 本文提出OV-DEIM,一种基于DEIMv2框架的端到端DETR式开放词汇目标检测器,结合视觉语言建模与新型查询补充策略,并引入GridSynthetic数据增强方法,显著提升实时性、轻量化程度及对罕见类别的检测性能。
Details
Motivation: 现有实时开放词汇目标检测方法多基于YOLO,而DETR式方法在推理延迟、模型轻量性和整体性能上仍落后;需兼顾动态环境下的大类别集识别与严格时延约束。 Method: 提出OV-DEIM:基于DEIMv2的DETR式架构,集成视觉-语言建模;设计简单有效的查询补充策略;引入GridSynthetic数据增强——将多个样本组合为结构化图像网格,增强物体共现与空间布局学习。 Result: 在开放词汇检测基准上达到SOTA性能,推理效率更高,尤其在罕见类别上取得显著提升。 Conclusion: OV-DEIM验证了DETR式架构在实时开放词汇检测中的可行性与优势,通过架构创新与数据增强协同优化了精度、速度与泛化能力。 Abstract: Real-time open-vocabulary object detection (OVOD) is essential for practical deployment in dynamic environments, where models must recognize a large and evolving set of categories under strict latency constraints. Current real-time OVOD methods are predominantly built upon YOLO-style models. In contrast, real-time DETR-based methods still lag behind in terms of inference latency, model lightweightness, and overall performance. In this work, we present OV-DEIM, an end-to-end DETR-style open-vocabulary detector built upon the recent DEIMv2 framework with integrated vision-language modeling for efficient open-vocabulary inference. We further introduce a simple query supplement strategy that improves Fixed AP without compromising inference speed. Beyond architectural improvements, we introduce GridSynthetic, a simple yet effective data augmentation strategy that composes multiple training samples into structured image grids. By exposing the model to richer object co-occurrence patterns and spatial layouts within a single forward pass, GridSynthetic mitigates the negative impact of noisy localization signals on the classification loss and improves semantic discrimination, particularly for rare categories. Extensive experiments demonstrate that OV-DEIM achieves state-of-the-art performance on open-vocabulary detection benchmarks, delivering superior efficiency and notable improvements on challenging rare categories. Code and pretrained models are available at https://github.com/wleilei/OV-DEIM.[155] Fine-Grained 3D Facial Reconstruction for Micro-Expressions
Che Sun,Xinjie Zhang,Rui Gao,Xu Chen,Yuwei Wu,Yunde Jia
Main category: cs.CV
TL;DR: 本文提出了一种细粒度微表情重建方法,结合全局动态特征与局部增强特征,利用宏表情数据先验缓解微表情数据稀缺问题,并通过动态引导的网格形变模块提升几何精度与感知细节。
Details
Motivation: 微表情具有细微、短暂、低强度的特点,导致稳定且判别性强的特征难以提取,现有3D面部表情重建方法主要针对宏观表情,尚未探索微表情重建这一新任务。 Method: 提出一种细粒度微表情重建方法:1)设计即插即用的动态编码模块,提取全局面部动作的微表情特征,利用宏表情数据先验;2)构建动态引导的网格形变模块,融合稠密光流、稀疏关键点和面部网格几何信息,自适应优化局部微表情细节。 Result: 在多个微表情数据集上的大量实验表明,该方法在几何精度和感知细节方面均持续优于当前最先进方法。 Conclusion: 所提方法有效解决了微表情数据稀缺与特征提取难的问题,实现了高保真、细粒度的3D微表情重建,为微表情分析提供了新思路与实用工具。 Abstract: Recent advances in 3D facial expression reconstruction have demonstrated remarkable performance in capturing macro-expressions, yet the reconstruction of micro-expressions remains unexplored. This novel task is particularly challenging due to the subtle, transient, and low-intensity nature of micro-expressions, which complicate the extraction of stable and discriminative features essential for accurate reconstruction. In this paper, we propose a fine-grained micro-expression reconstruction method that integrates a global dynamic feature capturing stable facial motion patterns with a locally-enriched feature incorporating multiple informative cues from 2D motions, facial priors and 3D facial geometry. Specifically, we devise a plug-and-play dynamic-encoded module to extract micro-expression feature for global facial action, allowing it to leverage prior knowledge from abundant macro-expression data to mitigate the scarcity of micro-expression data. Subsequently, a dynamic-guided mesh deformation module is designed for extracting aggregated local features from dense optical flow, sparse landmark cues and facial mesh geometry, which adaptively refines fine-grained facial micro-expression without compromising global 3D geometry. Extensive experiments on micro-expression datasets demonstrate that our method consistently outperforms state-of-the-art methods in both geometric accuracy and perceptual detail.[156] Looking Back and Forth: Cross-Image Attention Calibration and Attentive Preference Learning for Multi-Image Hallucination Mitigation
Xiaochen Yang,Hao Fang,Jiawei Kong,Yaoxin Mao,Bin Chen,Shu-Tao Xia
Main category: cs.CV
TL;DR: 本文提出CAPL框架,通过跨图像注意力校准和偏好学习来缓解大视觉语言模型在多图像任务中的幻觉问题。
Details
Motivation: 大型视觉语言模型(LVLMs)在多图像任务中易产生幻觉,主要源于现有注意力机制的局限性和跨图像建模不足。 Method: 提出CAPL框架,包括:(i) 可选择的图像token交互注意力机制,实现细粒度跨图像实体对齐与信息流动;(ii) 基于跨图像建模的偏好优化策略,对比全交互与图像互不可见下的推理结果,促使模型依赖真实视觉证据而非文本先验。 Result: CAPL在多个模型架构上稳定提升多图像幻觉基准与通用基准性能,且单图像视觉任务性能保持稳定或略有提升,显示强泛化能力。 Conclusion: CAPL从架构与训练两方面增强跨图像建模与证据依赖,有效缓解LVLMs多图像幻觉问题,并具备良好泛化性。 Abstract: Although large vision-language models (LVLMs) have demonstrated remarkable capabilities, they are prone to hallucinations in multi-image tasks. We attribute this issue to limitations in existing attention mechanisms and insufficient cross-image modeling. Inspired by this, we propose a structured hallucination mitigation framework involving Cross-Image Attention calibration and Preference Learning (CAPL). CAPL explicitly enhances inter-image interactions at the architectural level while reinforcing reliance on genuine cross-image evidence during training, thereby improving the model's perception and modeling of cross-image associations. Specifically, we (i) introduce a selectable image token interaction attention mechanism to establish fine-grained cross-image entity alignment and information flow; (ii) design a cross-image modeling-based preference optimization strategy that contrasts reasoning outcomes under full inter-image interaction and those obtained when images are mutually invisible, encouraging the model to ground its predictions in authentic visual evidence and mitigating erroneous inferences driven by textual priors. Experimental results demonstrate that CAPL consistently improves performance across multiple model architectures, achieving stable gains on both multi-image hallucination and general benchmarks. Notably, performance on single-image visual tasks remains stable or slightly improves, indicating strong generalization capability.[157] SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer
Tong Shao,Yusen Fu,Guoying Sun,Jingde Kong,Zhuotao Tian,Jingyong Su
Main category: cs.CV
TL;DR: 本文提出SODA方法,通过细粒度敏感性建模动态协调缓存与剪枝,在保持生成质量的同时提升Diffusion Transformer推理效率。
Details
Motivation: 现有无训练加速方法(如缓存和剪枝)采用固定启发式策略,无法捕捉模型在时间步、层和模块上的细粒度敏感性变化,导致质量下降且泛化性差。 Method: SODA构建跨时间步、层和模块的离线敏感性误差建模框架,并以敏感性误差为代价函数,用动态规划优化缓存间隔;在剪枝与缓存重用中自适应决定剪枝时机与比例,优先保留高敏感token的计算。 Result: 在DiT-XL/2、PixArt-α和OpenSora上实验表明,SODA在可控加速比下达到当前最优的生成保真度。 Conclusion: SODA通过敏感性驱动的动态协同加速策略,有效平衡了推理效率与生成质量,具有良好的泛化性和实用性。 Abstract: Diffusion Transformers have become a dominant paradigm in visual generation, yet their low inference efficiency remains a key bottleneck hindering further advancement. Among common training-free techniques, caching offers high acceleration efficiency but often compromises fidelity, whereas pruning shows the opposite trade-off. Integrating caching with pruning achieves a balance between acceleration and generation quality. However, existing methods typically employ fixed and heuristic schemes to configure caching and pruning strategies. While they roughly follow the overall sensitivity trend of generation models to acceleration, they fail to capture fine-grained and complex variations, inevitably skipping highly sensitive computations and leading to quality degradation. Furthermore, such manually designed strategies exhibit poor generalization. To address these issues, we propose SODA, a Sensitivity-Oriented Dynamic Acceleration method that adaptively performs caching and pruning based on fine-grained sensitivity. SODA builds an offline sensitivity error modeling framework across timesteps, layers, and modules to capture the sensitivity to different acceleration operations. The cache intervals are optimized via dynamic programming with sensitivity error as the cost function, minimizing the impact of caching on model sensitivity. During pruning and cache reuse, SODA adaptively determines the pruning timing and rate to preserve computations of highly sensitive tokens, significantly enhancing generation fidelity. Extensive experiments on DiT-XL/2, PixArt-$α$, and OpenSora demonstrate that SODA achieves state-of-the-art generation fidelity under controllable acceleration ratios. Our code is released publicly at: https://github.com/leaves162/SODA.[158] MedSteer: Counterfactual Endoscopic Synthesis via Training-Free Activation Steering
Trong-Thang Pham,Loc Nguyen,Anh Nguyen,Hien Nguyen,Ngan Le
Main category: cs.CV
TL;DR: 本文提出MedSteer,一种无需训练的激活引导框架,用于内窥镜图像合成,通过在扩散Transformer的交叉注意力层中识别病理向量,实现对临床概念的精准操控,生成仅在目标概念上差异的反事实图像对,在概念翻转率和结构保持方面显著优于现有方法。
Details
Motivation: 现有基于文本提示或反转编辑的生成扩散模型难以生成因果可靠的医学训练数据:文本重提示会改变解剖结构、纹理和背景;反转编辑则引入重建误差导致结构漂移。 Method: MedSteer是一种无需训练的激活引导框架,在扩散Transformer的交叉注意力层中,为每组对比性文本提示对识别一个‘病理向量’;推理时沿该向量引导图像激活,从零生成仅在目标概念(如病变类型)上差异、其余结构完全保持的反事实图像对。 Result: 在Kvasir v3和HyperKvasir数据集上,MedSteer在三组临床概念反事实生成中翻转率分别达0.800、0.925、0.950,显著优于最优反转基线;染料解耦任务中实现75%染料去除(远高于PnP的20%和h-Edit的10%);用于息肉检测下游任务时,其反事实增强使ViT AUC达0.9755,明显高于等量重提示的0.9083。 Conclusion: MedSteer通过激活空间中的定向引导,实现了高保真、高可控的医学图像反事实合成,验证了结构一致性对下游任务性能提升的关键作用,为可信医学AI数据增强提供了新范式。 Abstract: Generative diffusion models are increasingly used for medical imaging data augmentation, but text prompting cannot produce causal training data. Re-prompting rerolls the entire generation trajectory, altering anatomy, texture, and background. Inversion-based editing methods introduce reconstruction error that causes structural drift. We propose MedSteer, a training-free activation-steering framework for endoscopic synthesis. MedSteer identifies a pathology vector for each contrastive prompt pair in the cross-attention layers of a diffusion transformer. At inference time, it steers image activations along this vector, generating counterfactual pairs from scratch where the only difference is the steered concept. All other structure is preserved by construction. We evaluate MedSteer across three experiments on Kvasir v3 and HyperKvasir. On counterfactual generation across three clinical concept pairs, MedSteer achieves flip rates of 0.800, 0.925, and 0.950, outperforming the best inversion-based baseline in both concept flip rate and structural preservation. On dye disentanglement, MedSteer achieves 75% dye removal against 20% (PnP) and 10% (h-Edit). On downstream polyp detection, augmenting with MedSteer counterfactual pairs achieves ViT AUC of 0.9755 versus 0.9083 for quantity-matched re-prompting, confirming that counterfactual structure drives the gain. Code is at link https://github.com/phamtrongthang123/medsteer[159] VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding
Xueqing Yu,Bohan Li,Yan Li,Zhenheng Yang
Main category: cs.CV
TL;DR: 本文提出VirtueBench基准,用于评估视觉语言模型(VLMs)在视频理解中面对不确定性时的可信度,强调模型应诚实拒答而非猜测,并揭示当前VLMs在不同采样条件下拒答行为差异显著。
Details
Motivation: 现有VLMs在长视频理解评估中因帧采样限制导致关键帧缺失,模型若诚实拒答被误判为错误,而盲目猜测可能偶然正确,造成评估失真并鼓励不诚实行为。 Method: 构建VirtueBench基准,为每个视频设置多级帧采样策略,并提供可回答/不可回答的细粒度标注;在25个开源与商用VLM上系统评测其拒答准确性及提示敏感性。 Result: 不同VLM家族拒答准确率差异巨大(70%至接近0%),且多数模型在未明确要求拒答的提示下拒答率显著下降。 Conclusion: 需建立以可靠性与可信度为导向的新评估范式和排行榜,推动开发真正可信的多模态理解模型。 Abstract: Recent Vision-Language Models (VLMs) have made remarkable progress in multimodal understanding tasks, yet their evaluation on long video understanding remains unreliable. Due to limited frame inputs, key frames necessary for answering the question may be missing from the model's input. However, models that truthfully refuse to answer under such uncertainty are marked as incorrect, while those that guess may coincidentally produce the correct answer and thus obtain deceptively higher accuracy, leading to misleading evaluation results and encouraging models to guess rather than respond honestly. To address this issue, we introduce VirtueBench, a benchmark explicitly designed to assess model trustworthiness under uncertainty. VirtueBench constructs multiple frame-sampling levels for each video and provides ground truths that distinguish between answerable and unanswerable cases. Evaluations on 25 open-source and commercial VLMs reveal distinct refusal behaviors across different model families, with refusal accuracy ranging from over 70% in the best models to nearly 0% in the worst. Moreover, most models exhibit a substantial drop in refusal when the prompt does not explicitly require them to do so. These findings highlight the need for developing trustworthy VLMs for multimodal understanding, guided by benchmarks and leaderboards that emphasize reliability and trustworthiness.[160] Physics-Guided VLM Priors for All-Cloud Removal
Liying Xu,Huifang Li,Huanfeng Shen
Main category: cs.CV
TL;DR: 本文提出PhyVLM-CR方法,将视觉语言模型(VLM)的语义能力融入物理复原模型,实现薄云校正与厚云重建的统一,无需显式云类型判断,通过软门控机制自适应融合物理反演与时序参考重建,显著提升去云效果与辐射保真度。
Details
Motivation: 现有方法将薄云校正与厚云重建分离,需显式判断云类型,易导致误差累积和混合云场景下的不连续性。 Method: 提出PhyVLM-CR方法,利用VLM(如Qwen)提取认知先验,转化为物理散射参数和幻觉置信图;以该置信图为连续软门控,自适应加权物理反演(高透射区)与时序参考重建(低置信遮挡区),实现统一复原。 Result: 在Sentinel-2真实影像上实验表明,该方法在去云效果与内容保持间取得优异平衡,定量精度显著提升,且结果无幻觉。 Conclusion: PhyVLM-CR通过融合VLM语义先验与物理建模,实现了异质云覆盖下高保真、无缝、统一的云去除,避免了传统方法中显式分割与误差传播问题。 Abstract: Cloud removal is a fundamental challenge in optical remote sensing due to the heterogeneous degradation. Thin clouds distort radiometry via partial transmission, while thick clouds occlude the surface. Existing pipelines separate thin-cloud correction from thick-cloud reconstruction, requiring explicit cloud-type decisions and often leading to error accumulation and discontinuities in mixed-cloud scenes. Therefore, a novel approach named Physical-VLM All-Cloud Removal (PhyVLM-CR) that integrates the semantic capability of Vision-Language Model (VLM) into a physical restoration model, achieving high-fidelity unified cloud removal. Specifically, the cognitive prior from a VLM (e.g., Qwen) is transformed into physical scattering parameters and a hallucination confidence map. Leveraging this confidence map as a continuous soft gate, our method achieves a unified restoration via adaptive weighting: it prioritizes physical inversion in high-transmission regions to preserve radiometric fidelity, while seamlessly transitioning to temporal reference reconstruction in low-confidence occluded areas. This mechanism eliminates the need for explicit boundary delineation, ensuring a coherent removal across heterogeneous cloud covers. Experiments on real-world Sentinel-2 surface reflectance imagery confirm that our approach achieves a remarkable balance between cloud removal and content preservation, delivering hallucination-free results with substantially improved quantitative accuracy compared to existing methods.[161] Retinex Meets Language: A Physics-Semantics-Guided Underwater Image Enhancement Network
Shixuan Xu,Yabo Liu,Junyu Dong,Xinghui Dong
Main category: cs.CV
TL;DR: 本文提出了一种物理-语义引导的水下图像增强网络(PSG-UIENet),结合Retinex光照校正与CLIP文本语义指导,构建了先验无关光照估计器、跨模态文本对齐器和语义引导图像恢复器,并构建首个大规模水下图像-文本数据集LUIQD-TD及图像-文本语义相似性损失(ITSS),显著提升增强效果与语义一致性。
Details
Motivation: 现有水下图像增强方法受限于物理先验刚性或深度学习方法的数据稀缺与泛化弱问题。 Method: 提出PSG-UIENet,包含Prior-Free Illumination Estimator、Cross-Modal Text Aligner和Semantics-Guided Image Restorer;利用CLIP生成文本描述注入高层语义;构建LUIQD-TD数据集(6418图像-参考图-文本三元组);设计Image-Text Semantic Similarity(ITSS)损失函数。 Result: 在自建LUIQD-TD及四个公开数据集上,PSG-UIENet性能优于或媲美15种SOTA方法。 Conclusion: 首次将文本引导与多模态数据集引入水下图像增强任务,提升了语义一致性与增强质量。 Abstract: Underwater images often suffer from severe degradation caused by light absorption and scattering, leading to color distortion, low contrast and reduced visibility. Existing Underwater Image Enhancement (UIE) methods can be divided into two categories, i.e., prior-based and learning-based methods. The former rely on rigid physical assumptions that limit the adaptability, while the latter often face data scarcity and weak generalization. To address these issues, we propose a Physics-Semantics-Guided Underwater Image Enhancement Network (PSG-UIENet), which couples the Retinex-grounded illumination correction with the language-informed guidance. This network comprises a Prior-Free Illumination Estimator, a Cross-Modal Text Aligner and a Semantics-Guided Image Restorer. In particular, the restorer leverages the textual descriptions generated by the Contrastive Language-Image Pre-training (CLIP) model to inject high-level semantics for perceptually meaningful guidance. Since multimodal UIE data sets are not publicly available, we also construct a large-scale image-text UIE data set, namely, LUIQD-TD, which contains 6,418 image-reference-text triplets. To explicitly measure and optimize semantic consistency between textual descriptions and images, we further design an Image-Text Semantic Similarity (ITSS) loss function. To our knowledge, this study makes the first effort to introduce both textual guidance and the multimodal data set into UIE tasks. Extensive experiments on our data set and four publicly available data sets demonstrate that the proposed PSG-UIENet achieves superior or comparable performance against fifteen state-of-the-art methods.[162] Aligning What EEG Can See: Structural Representations for Brain-Vision Matching
Jingyi Tang,Shuai Jiang,Fei Su,Zhicheng Zhao
Main category: cs.CV
TL;DR: 本文提出了一种基于神经可见性(Neural Visibility)的EEG视觉解码新方法,通过选择中间视觉层对齐脑电信号,并设计分层互补融合(HCF)框架,显著提升零样本视觉解码性能。
Details
Motivation: 现有EEG视觉解码方法依赖深度视觉模型最终语义嵌入,导致严重的跨模态信息失配。 Method: 提出神经可见性概念和EEG-Visible层选择策略,将EEG信号与中间视觉层对齐;并设计分层互补融合(HCF)框架,联合整合多层级视觉表征。 Result: 在THINGS-EEG数据集上零样本视觉解码准确率达84.6%(+21.4%),相较多种EEG基线最高提升129.8%。 Conclusion: 中间层对齐与分层融合能有效缓解跨模态失配,显著提升EEG视觉解码性能与泛化能力。 Abstract: Visual decoding from electroencephalography (EEG) has emerged as a highly promising avenue for non-invasive brain-computer interfaces (BCIs). Existing EEG-based decoding methods predominantly align brain signals with the final-layer semantic embeddings of deep visual models. However, relying on these highly abstracted embeddings inevitably leads to severe cross-modal information mismatch. In this work, we introduce the concept of Neural Visibility and accordingly propose the EEG-Visible Layer Selection Strategy, aligning EEG signals with intermediate visual layers to minimize this mismatch. Furthermore, to accommodate the multi-stage nature of human visual processing, we propose a novel Hierarchically Complementary Fusion (HCF) framework that jointly integrates visual representations from different hierarchical levels. Extensive experiments demonstrate that our method achieves state-of-the-art performance, reaching an 84.6% accuracy (+21.4%) on zero-shot visual decoding on the THINGS-EEG dataset. Moreover, our method achieves up to a 129.8% performance gain across diverse EEG baselines, demonstrating its robust generalizability.[163] Facial Expression Generation Aligned with Human Preference for Natural Dyadic Interaction
Xu Chen,Rui Gao,Xinjie Zhang,Haoyu Zhang,Che Sun,Zhi Gao,Yuwei Wu,Yunde Jia
Main category: cs.CV
TL;DR: 本文提出了一种基于人类反馈的面部表情生成方法,通过构建视觉-语言-动作模型和人类反馈强化学习策略,实现与人类偏好对齐的自然双向交互表情生成。
Details
Motivation: 现有面部表情生成方法缺乏对人类情感偏好和社会对齐性的有效建模,而人类反馈是提升对齐性的重要途径,但如何有效融入仍不明确。 Method: 将身份无关的表情生成建模为动作学习过程,构建闭合反馈环;训练视觉-语言-动作模型将说话者多模态信号映射为3D可变形模型的低维表情表征;引入结合高质量示范模仿与评论家引导优化的人类反馈强化学习策略。 Result: 在两个基准测试中,该方法显著提升了表情生成与人类偏好的对齐度,并取得更优性能。 Conclusion: 人类反馈可通过动作学习框架和强化学习策略有效融入面部表情生成,从而提升其在自然双向交互中的情感适宜性与社会对齐性。 Abstract: Achieving natural dyadic interaction requires generating facial expressions that are emotionally appropriate and socially aligned with human preference. Human feedback offers a compelling mechanism to guide such alignment, yet how to effectively incorporate this feedback into facial expression generation remains underexplored. In this paper, we propose a facial expression generation method aligned with human preference by leveraging human feedback to produce contextually and emotionally appropriate expressions for natural dyadic interaction. A key to our method is framing the generation of identity-independent facial expressions as an action learning process, allowing human feedback to assess their validity free from visual or identity bias. We establish a closed feedback loop in which listener expressions dynamically respond to evolving conversational cues of the speaker. Concretely, we train a vision-language-action model via supervised fine-tuning to map the speaker's multimodal signals into controllable low-dimensional expression representations of a 3D morphable model. We further introduce a human-feedback reinforcement learning strategy that integrates the imitation of high-quality expression response with critic-guided optimization. Experiments on two benchmarks demonstrate that our method effectively aligns facial expressions with human preference and achieves superior performance.[164] NuNext: Reframing Nucleus Detection as Next-Point Detection
Zhongyi Shui,Honglin Li,Xiaozhong Ji,Ye Zhang,Zijiang Yang,Chenglu Zhu,Yuxuan Sun,Kai Yao,Conghui He,Cheng Tan
Main category: cs.CV
TL;DR: 本文提出了一种将细胞核检测重新定义为“下一个点预测”任务的新方法,利用多模态大语言模型直接从图像中输出细胞核中心点坐标,并通过两阶段训练(含空间感知软监督与链式视觉思维策略的监督学习,以及含分布匹配奖励等机制的强化微调)显著提升检测性能。
Details
Motivation: 现有方法存在需复杂后处理或因密集锚点/查询导致前景-背景严重失衡的问题。 Method: 将核检测建模为next-point预测;构建多模态大语言模型;采用两阶段训练:1)监督学习阶段引入空间感知软监督和链式视觉思维策略;2)强化微调阶段设计分布匹配奖励、低方差组过滤和细粒度优势塑形。 Result: 在九个主流基准上实验验证了方法的优越性。 Conclusion: 所提方法有效克服了传统核检测范式的局限,提升了检测精度与鲁棒性。 Abstract: Nucleus detection in histopathology is pivotal for a wide range of clinical applications. Existing approaches either regress nuclear proxy maps that require complex post-processing, or employ dense anchors or queries that introduce severe foreground-background imbalance. In this work, we reformulate nucleus detection as next-point prediction, wherein a multimodal large language model is developed to directly output foreground nucleus centroids from the input image. The model is trained in two stages. In the supervised learning stage, we propose spatial-aware soft supervision to relax strict centroid matching and a chain-of-visual-thought strategy to incorporate visual priors that facilitate coordinate prediction. In the reinforcement fine-tuning stage, we design distribution matching reward, low-variance group filtering, and fine-grained advantage shaping to further improve the model's detection quality. Extensive experiments on nine widely used benchmarks demonstrate the superiority of our method. Code will be released soon.[165] Efficient Chest X-ray Representation Learning via Semantic-Partitioned Contrastive Learning
Wangyu Feng,Shawn Young,Lijian Xu
Main category: cs.CV
TL;DR: 本文提出了一种面向胸部X光片(CXR)分析的自监督学习新方法S-PCL,通过语义分割对比学习避免像素重建和强数据增强,提升诊断相关表征能力并降低计算开销。
Details
Motivation: 现有自监督学习方法在医学影像中存在不足:掩码图像建模耗费算力重建无诊断价值的高频背景;对比学习依赖强数据增强,易破坏临床关键结构。 Method: 提出Semantic-Partitioned Contrastive Learning(S-PCL):对单张CXR的patch token进行随机语义划分,形成两个非重叠子集,让编码器最大化二者表征一致性,从而隐式建模解剖全局布局与局部病理线索,并去除手工增强、辅助解码器和动量编码器。 Result: 在ChestX-ray14、CheXpert、RSNA肺炎和SIIM-ACR气胸等多个大规模CXR基准上,S-PCL在精度上优于或媲美现有SSL方法,同时实现最低GFLOPs,验证其高效性与有效性。 Conclusion: S-PCL是一种更适配医学影像特性的轻量、高效、可扩展的自监督预训练框架,为CXR分析提供了新的表征学习范式。 Abstract: Self-supervised learning (SSL) has emerged as a powerful paradigm for Chest X-ray (CXR) analysis under limited annotations. Yet, existing SSL strategies remain suboptimal for medical imaging. Masked image modeling allocates substantial computation to reconstructing high-frequency background details with limited diagnostic value. Contrastive learning, on the other hand, often depends on aggressive augmentations that risk altering clinically meaningful structures. We introduce Semantic-Partitioned Contrastive Learning (S-PCL), an efficient pre-training framework tailored for CXR representation learning. Instead of reconstructing pixels or relying on heavy augmentations, S-PCL randomly partitions patch tokens from a single CXR into two non-overlapping semantic subsets. Each subset provides a complementary but incomplete view. The encoder must maximize agreement between these partitions, implicitly inferring global anatomical layout and local pathological cues from partial evidence. This semantic partitioning forms an internal bottleneck that enforces long-range dependency modeling and structural coherence. S-PCL eliminates the need for hand-crafted augmentations, auxiliary decoders, and momentum encoders. The resulting architecture is streamlined, computationally efficient, and easy to scale. Extensive experiments on large-scale CXR benchmarks, including ChestX-ray14, CheXpert, RSNA Pneumonia and SIIM-ACR Pneumothorax, show that S-PCL achieves competitive performance while attaining the lowest GFLOPs and superior accuracy among existing SSL approaches.[166] TIQA: Human-Aligned Text Quality Assessment in Generated Images
Kirill Koltsov,Aleksandr Gushchin,Dmitriy Vatolin,Anastasia Antsiferova
Main category: cs.CV
TL;DR: 本文提出TIQA任务和数据集,用于评估文本到图像模型中渲染文本的质量,并设计了轻量级方法ANTIQA,在与人类评分的相关性上优于现有方法。
Details
Motivation: 现有文本渲染质量评估方法(如OCR正确率或VLM判断)与人类感知的文本失真不一致,缺乏对真实感知质量的准确建模。 Method: 构建两个MOS标注数据集TIQA-Crops和TIQA-Images;提出专为文本设计的轻量评估模型ANTIQA,引入文本特异性偏差;采用PLCC指标评估与人类评分的相关性。 Result: ANTIQA在TIQA-Crops和TIQA-Images上分别比OCR置信度、VLM裁判和通用无参考图像质量指标提升约0.05和0.08的PLCC;在下游重排序任务中,用ANTIQA筛选best-of-5可使人类评价的文本质量平均提升14%。 Conclusion: TIQA为文本渲染质量评估提供了更符合人类感知的新基准和工具,ANTIQA具备实用价值,可有效提升生成式文本图像系统的输出质量。 Abstract: Text rendering remains a persistent failure mode of modern text-to-image models (T2I), yet existing evaluations rely on OCR correctness or VLM-based judging procedures that are poorly aligned with perceptual text artifacts. We introduce Text-in-Image Quality Assessment (TIQA), a task that predicts a scalar quality score that matches human judgments of rendered-text fidelity within cropped text regions. We release two MOS-labeled datasets: TIQA-Crops (10k text crops) and TIQA-Images (1,500 images), spanning 20+ T2I models, including proprietary ones. We also propose ANTIQA, a lightweight method with text-specific biases, and show that it improves correlation with human scores over OCR confidence, VLM judges, and generic NR-IQA metrics by at least $\sim0.05$ on TIQA-Crops and $\sim0.08$ on TIQA-Images, as measured by PLCC. Finally, we show that TIQA models are valuable in downstream tasks: for example, selecting the best-of-5 generations with ANTIQA improves human-rated text quality by $+14\%$ on average, demonstrating practical value for filtering and reranking in generation pipelines.[167] Inter-Image Pixel Shuffling for Multi-focus Image Fusion
Huangxing Lin,Rongrong Ma,Cheng Wang
Main category: cs.CV
TL;DR: 本文提出了一种无需真实多焦点图像即可训练的多焦点图像融合方法Inter-image Pixel Shuffling(IPS),将融合任务重构为像素级分类问题,并结合CNN与状态空间模型提升融合质量,实验表明其性能显著优于现有方法。
Details
Motivation: 深度学习在多焦点图像融合中受限于高质量标注训练数据的稀缺,亟需一种不依赖真实多焦点图像的自监督或弱监督训练策略。 Method: 提出Inter-image Pixel Shuffling(IPS):利用清晰光学图像及其低通滤波版本构造‘聚焦-失焦’像素对;在相同空间位置随机混洗二者像素,生成结构保持但焦点信息混合的合成训练样本;将融合建模为像素级分类任务;并设计融合CNN局部特征提取与状态空间模型长程建模能力的跨图像融合网络。 Result: IPS在多个标准测试集上显著超越现有监督与无监督多焦点融合方法,且完全未使用真实多焦点图像进行训练。 Conclusion: IPS验证了无需真实多焦点配对数据即可高效学习融合判据的可行性,为数据稀缺场景下的图像融合提供了新范式,同时展示了CNN与状态空间模型协同建模的空间-上下文联合优势。 Abstract: Multi-focus image fusion aims to combine multiple partially focused images into a single all-in-focus image. Although deep learning has shown promise in this task, its effectiveness is often limited by the scarcity of suitable training data. This paper introduces Inter-image Pixel Shuffling (IPS), a novel method that allows neural networks to learn multi-focus image fusion without requiring actual multi-focus images. IPS reformulates the task as a pixel-wise classification problem, where the goal is to identify the focused pixel from a pixel group at each spatial position. In this method, pixels from a clear optical image are treated as focused, while pixels from a low-pass filtered version of the same image are considered defocused. By randomly shuffling the focused and defocused pixels at identical spatial positions in the original and filtered images, IPS generates training data that preserves spatial structure while mixing focus-defocus information. The model is trained to select the focused pixel from each spatially aligned pixel group, thus learning to reconstruct an all-in-focus image by aggregating sharp content from the input. To further enhance fusion quality, IPS adopts a cross-image fusion network that integrates the localized representation power of convolutional neural networks with the long-range modeling capabilities of state space models. This design effectively leverages both spatial detail and contextual information to produce high-quality fused results. Experimental results indicate that IPS significantly outperforms existing multi-focus image fusion methods, even without training on multi-focus images.[168] Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge
Shuai Lu,Meng Wang,Jia Guo,Jiawei Du,Bo Liu,Shengzhu Yang,Weihang Zhang,Huazhu Fu,Huiqi Li
Main category: cs.CV
TL;DR: 本文提出EyExIn框架,通过专家知识注入解决大视觉语言模型在眼科诊断中因感知和推理缺陷导致的不可靠问题,显著提升眼科视觉问答精度。
Details
Motivation: 大型视觉语言模型(LVLMs)在眼科自动诊断中潜力巨大,但缺乏领域专业知识限制了其临床部署;具体存在两个结构性缺陷:感知差距(无法识别细微病理线索)和推理差距(视觉证据被语言先验覆盖导致幻觉)。 Method: 提出EyExIn数据高效框架,包含专家感知双流编码策略(解耦通用解剖上下文与专业病理语义)、语义自适应门控融合模块(动态增强病灶信号并抑制背景噪声)以及自适应深度专家注入机制(将融合视觉特征作为残差偏置嵌入LLM中间层,建立视觉捷径以保持推理严格基于视觉证据)。 Result: 在四个基准测试中,EyExIn持续超越大规模专有系统,在眼科视觉问答任务中实现最先进精度,显著增强领域知识嵌入能力。 Conclusion: EyExIn通过锚定视网膜VLMs与专家知识,有效弥合感知与推理鸿沟,推动可信赖眼科AI的发展。 Abstract: Large Vision Language Models (LVLMs) show immense potential for automated ophthalmic diagnosis. However, their clinical deployment is severely hindered by lacking domain-specific knowledge. In this work, we identify two structural deficiencies hindering reliable medical reasoning: 1) the Perception Gap, where general-purpose visual encoders fail to resolve fine-grained pathological cues (e.g., microaneurysms); and 2) the Reasoning Gap, where sparse visual evidence is progressively overridden by massive language priors in deeper transformer layers, leading to ungrounded hallucinations. To bridge these gaps, we propose EyExIn, a data-efficient framework designed to anchor retinal VLMs with expert knowledge via a Deep Expert Injection mechanism. Our architecture employs an Expert-Aware Dual-Stream encoding strategy that decouples visual representation into a general stream for anatomical context and a specialized expert stream for pathological semantics. To ensure high-fidelity integration, we design a Semantic-Adaptive Gated Fusion module, which dynamically amplifies subtle lesion signals while filtering irrelevant background noise. Furthermore, we introduce Adaptive Deep Expert Injection to embed persistent "Vision Anchors" by integrating fused visual features as residual biases directly into intermediate LLM layers. This mechanism creates a visual shortcut that forces the reasoning stack to remain strictly grounded in visual evidence. Extensive experiments across four benchmarks demonstrate that our model consistently outperforms massive proprietary systems. EyExIn significantly enhances domain-specific knowledge embedding and achieves state-of-the-art precision in ophthalmic visual question answering, advancing the development of trustworthy ophthalmic AI.[169] The Model Knows Which Tokens Matter: Automatic Token Selection via Noise Gating
Landi He,Xiaoyu Yang,Lijian Xu
Main category: cs.CV
TL;DR: 本文提出AutoSelect方法,将视觉token剪枝重新定义为容量受限的通信问题,在不增加额外损失或标注的情况下,通过轻量级Scorer和Denoiser实现高效视觉token选择,在保持高准确率的同时显著加速VLM推理。
Details
Motivation: 视觉token在视觉语言模型(VLM)推理中占据主要计算开销,但其中许多token携带冗余信息;现有剪枝方法多依赖注意力强度或相似性分数,缺乏对信息保留本质的建模。 Method: 将视觉token剪枝建模为带容量约束(K)的信息传输问题;引入轻量级Scorer预测token重要性,并结合方差保持噪声门控与对角注意力Denoiser进行梯度传播与表征恢复;训练仅用标准下一个token预测损失;推理时仅保留Scorer与硬top-K选择。 Result: 在10个VLM基准上保持96.5%全模型精度,LLM预填充加速2.85倍,仅引入0.69ms延迟,并可无缝迁移至不同VLM骨干网络。 Conclusion: AutoSelect提供了一种通用、高效、无需额外监督的视觉token剪枝范式,兼顾性能、速度与部署友好性。 Abstract: Visual tokens dominate inference cost in vision-language models (VLMs), yet many carry redundant information. Existing pruning methods alleviate this but typically rely on attention magnitude or similarity scores. We reformulate visual token pruning as capacity constrained communication: given a fixed budget K, the model must allocate limited bandwidth to maximally preserve visual information. We propose AutoSelect, which attaches a lightweight Scorer and Denoiser to a frozen VLM and trains with only the standard next token prediction loss, without auxiliary objectives or extra annotations. During training, a variance preserving noise gate modulates each token's information flow according to its predicted importance so that gradients propagate through all tokens; a diagonal attention Denoiser then recovers the perturbed representations. At inference, only the Scorer and a hard top-K selection remain, adding negligible latency. On ten VLM benchmarks, AutoSelect retains 96.5% of full model accuracy while accelerating LLM prefill by 2.85x with only 0.69 ms overhead, and transfers to different VLM backbones without architecture-specific tuning. Code is available at https://github.com/MedHK23/AutoSelect.[170] PDD: Manifold-Prior Diverse Distillation for Medical Anomaly Detection
Xijun Lu,Hongying Liu,Fanhua Shang,Yanming Hui,Liang Wan
Main category: cs.CV
TL;DR: 本文提出PDD框架,通过统一双教师先验到共享高维流形,并蒸馏至具有互补行为的双学生模型,显著提升医学图像异常检测性能。
Details
Motivation: Grad-CAM分析揭示判别性激活图在医学数据上失效,需转向流形层面建模以应对医学图像中细微、异质的异常。 Method: 提出PDD(Manifold-Prior Diverse Distillation)框架:利用冻结的VMamba-Tiny和wide-ResNet50作为双教师提供全局上下文与局部结构先验;通过MMU模块统一其特征至共享流形;InA模块增强中间表征;双学生分别进行层间蒸馏和跨层依赖建模(MPA模块);引入多样性损失防止表征坍缩。 Result: 在HeadCT、BrainMRI、ZhangLab和Uni-Medical等多个医学数据集上显著超越现有SOTA方法,AUROC最高提升11.8%,F1 max提升3.4%。 Conclusion: PDD通过流形先验引导的多样化知识蒸馏,有效建模医学图像异常的复杂分布,为该任务建立了新的SOTA。 Abstract: Medical image anomaly detection faces unique challenges due to subtle, heterogeneous anomalies embedded in complex anatomical structures. Through systematic Grad-CAM analysis, we reveal that discriminative activation maps fail on medical data, unlike their success on industrial datasets, motivating the need for manifold-level modeling. We propose PDD (Manifold-Prior Diverse Distillation), a framework that unifies dual-teacher priors into a shared high-dimensional manifold and distills this knowledge into dual students with complementary behaviors. Specifically, frozen VMamba-Tiny and wide-ResNet50 encoders provide global contextual and local structural priors, respectively. Their features are unified through a Manifold Matching and Unification (MMU) module, while an Inter-Level Feature Adaption (InA) module enriches intermediate representations. The unified manifold is distilled into two students: one performs layer-wise distillation via InA for local consistency, while the other receives skip-projected representations through a Manifold Prior Affine (MPA) module to capture cross-layer dependencies. A diversity loss prevents representation collapse while maintaining detection sensitivity. Extensive experiments on multiple medical datasets demonstrate that PDD significantly outperforms existing state-of-the-art methods, achieving improvements of up to 11.8%, 5.1%, and 8.5% in AUROC on HeadCT, BrainMRI, and ZhangLab datasets, respectively, and 3.4% in F1 max on the Uni-Medical dataset, establishing new state-of-the-art performance in medical image anomaly detection. The implementation will be released at https://github.com/OxygenLu/PDD[171] CanoVerse: 3D Object Scalable Canonicalization and Dataset for Generation and Pose
Li Jin,Yuchen Yang,Weikai Chen,Yujie Wang,Dehao Hao,Tanghui Jia,Yingda Yin,Zeyu Hu,Runze Zhang,Keyang Luo,Li Yuan,Long Quan,Xin Wang,Xueying Qin
Main category: cs.CV
TL;DR: 本文提出Canoverse,一个大规模的3D标准数据集(320K物体,1156类),旨在解决3D学习中因物体任意全局旋转导致的方向歧义问题;通过新提出的高效标准框架,显著提升3D生成稳定性、跨模态形状检索精度,并实现零样本点云方向估计。
Details
Motivation: 3D学习系统隐含假设物体处于一致参考系,但实际中物体常带任意全局旋转,导致模型难以自主消除方向歧义,进而抑制姿态一致生成和稳定方向语义的形成。 Method: 构建大规模标准3D数据集Canoverse,并设计新型标准框架:通过紧凑假设生成与轻量人工判别,将单物体对齐耗时从分钟级降至秒级,使标准过程变为高通量数据生成流水线。 Result: Canoverse显著提升3D生成稳定性、实现精准跨模态3D形状检索、支持零样本点云方向估计(包括分布外数据);方向语义在统计意义上可学习。 Conclusion: 大规模标准数据集与高效标准框架共同推动3D理解与生成中方向语义的建模,将标准化工序从人工主导转变为可扩展的数据驱动流程。 Abstract: 3D learning systems implicitly assume that objects occupy a coherent reference frame. Nonetheless, in practice, every asset arrives with an arbitrary global rotation, and models are left to resolve directional ambiguity on their own. This persistent misalignment suppresses pose-consistent generation, and blocks the emergence of stable directional semantics. To address this issue, we construct \methodName{}, a massive canonical 3D dataset of 320K objects over 1,156 categories -- an order-of-magnitude increase over prior work. At this scale, directional semantics become statistically learnable: Canoverse improves 3D generation stability, enables precise cross-modal 3D shape retrieval, and unlocks zero-shot point-cloud orientation estimation even for out-of-distribution data. This is achieved by a new canonicalization framework that reduces alignment from minutes to seconds per object via compact hypothesis generation and lightweight human discrimination, transforming canonicalization from manual curation into a high-throughput data generation pipeline. The Canoverse dataset will be publicly released upon acceptance. Project page: https://github.com/123321456-gif/Canoverse[172] LiveWorld: Simulating Out-of-Sight Dynamics in Generative Video World Models
Zicheng Duan,Jiatong Xia,Zeyu Zhang,Wenbo Zhang,Gengze Zhou,Chenhui Gou,Yefei He,Feng Chen,Xinyu Zhang,Lingqiao Liu
Main category: cs.CV
TL;DR: 本文提出LiveWorld框架,解决生成式视频世界模型中'视野外动态性'问题,通过建模持续演化的全局状态(静态3D背景+动态实体)和基于监控器的机制,实现未观测区域的自主时间演化与再访问时的状态同步,显著提升长期场景一致性。
Details
Motivation: 现有生成式视频世界模型假设世界仅在观察者视野内演化,导致物体离开视野后状态被冻结,无法反映其应发生的持续变化,存在'视野外动态性'问题。 Method: 提出LiveWorld框架:1)建模包含静态3D背景和持续演化的动态实体的持久全局状态;2)引入基于监控器的机制,自主模拟未观测实体的时间演化,并在重新进入视野时同步更新状态以保证空间一致性;3)构建专用基准LiveBench评估视野外动态维持能力。 Result: 实验表明LiveWorld能有效支持持久事件演化和长时场景一致性,弥合了基于2D观测记忆与真正4D动态世界模拟之间的差距。 Conclusion: LiveWorld首次系统性地解决了视频世界模型中视野外动态缺失的关键限制,为构建具备真实物理连续性的生成式世界模型提供了新范式。 Abstract: Recent generative video world models aim to simulate visual environment evolution, allowing an observer to interactively explore the scene via camera control. However, they implicitly assume that the world only evolves within the observer's field of view. Once an object leaves the observer's view, its state is "frozen" in memory, and revisiting the same region later often fails to reflect events that should have occurred in the meantime. In this work, we identify and formalize this overlooked limitation as the "out-of-sight dynamics" problem, which impedes video world models from representing a continuously evolving world. To address this issue, we propose LiveWorld, a novel framework that extends video world models to support persistent world evolution. Instead of treating the world as static observational memory, LiveWorld models a persistent global state composed of a static 3D background and dynamic entities that continue evolving even when unobserved. To maintain these unseen dynamics, LiveWorld introduces a monitor-based mechanism that autonomously simulates the temporal progression of active entities and synchronizes their evolved states upon revisiting, ensuring spatially coherent rendering. For evaluation, we further introduce LiveBench, a dedicated benchmark for the task of maintaining out-of-sight dynamics. Extensive experiments show that LiveWorld enables persistent event evolution and long-term scene consistency, bridging the gap between existing 2D observation-based memory and true 4D dynamic world simulation. The baseline and benchmark will be publicly available at https://zichengduan.github.io/LiveWorld/index.html.[173] PromptGate Client Adaptive Vision Language Gating for Open Set Federated Active Learning
Adea Nesturi,David Dueñas Gaviria,Jiajun Zeng,Shadi Albarqouni
Main category: cs.CV
TL;DR: 本文提出PromptGate,一种基于视觉语言模型(VLM)的动态门控框架,用于开放集联邦主动学习,可在不共享患者数据的前提下提升标注样本的分布内(ID)纯度与分布外(OOD)识别能力。
Details
Motivation: 现实临床数据池具有开放集特性,包含大量成像伪影、错误模态等OOD噪声;标准主动学习易将此类噪声误判为高信息量样本,浪费稀缺标注资源。 Method: 提出PromptGate:1)联邦类特定上下文优化——轻量可学习提示向量适配冻结的BiomedCLIP骨干网络,并通过FedAvg聚合;2)利用新标注持续优化提示,使VLM动态区分ID/OOD样本,作为策略无关的即插即用预筛选模块。 Result: 在分布式皮肤科和乳腺影像基准上,PromptGate保持>95% ID纯度与98% OOD召回率;而静态VLM提示仅达50% ID纯度。 Conclusion: PromptGate有效解决开放集联邦学习中的OOD噪声干扰问题,显著提升标注效率与隐私保护下的模型泛化能力。 Abstract: Deploying medical AI across resource-constrained institutions demands data-efficient learning pipelines that respect patient privacy. Federated Learning (FL) enables collaborative medical AI without centralising data, yet real-world clinical pools are inherently open-set, containing out-of-distribution (OOD) noise such as imaging artifacts and wrong modalities. Standard Active Learning (AL) query strategies mistake this noise for informative samples, wasting scarce annotation budgets. We propose PromptGate, a dynamic VLM-gated framework for Open-Set Federated AL that purifies unlabeled pools before querying. PromptGate introduces a federated Class-Specific Context Optimization: lightweight, learnable prompt vectors that adapt a frozen BiomedCLIP backbone to local clinical domains and aggregate globally via FedAvg -- without sharing patient data. As new annotations arrive, prompts progressively sharpen the ID/OOD boundary, turning the VLM into a dynamic gatekeeper that is strategy-agnostic: a plug-and-play pre-selection module enhancing any downstream AL strategy. Experiments on distributed dermatology and breast imaging benchmarks show that while static VLM prompting degrades to 50% ID purity, PromptGate maintains $>$95% purity with 98% OOD recall.[174] ACD-U: Asymmetric co-teaching with machine unlearning for robust learning with noisy labels
Reo Fukunaga,Soh Yoshida,Mitsuji Muneyasu
Main category: cs.CV
TL;DR: 本文提出ACD-U框架,结合异构模型协同教学与选择性机器遗忘,主动纠正标签噪声下的样本误选问题,显著提升高噪声场景下的泛化性能。
Details
Motivation: 深度神经网络在训练中易记忆错误标签,现有基于样本选择与半监督学习的方法无法修正已发生的误选错误。 Method: 提出异构架构的非对称协同教学框架ACD-U:1)用CLIP预训练ViT与CNN配对,ViT仅在干净样本上训练,CNN通过SSL训练,利用二者互补学习行为缓解确认偏误;2)通过损失轨迹分析与CLIP一致性检验识别被错误记忆的样本,并采用KL散度驱动的选择性遗忘机制消除其影响。 Result: 在CIFAR-10/100、CIFAR-N、WebVision、Clothing1M和Red Mini-ImageNet等多个合成与真实噪声数据集上达到SOTA性能,尤其在高噪声和实例依赖噪声下优势明显。 Conclusion: ACD-U将学习范式从被动避错转向主动纠错,验证了结合异构模型协同与机器遗忘是提升噪声鲁棒性的有效新路径。 Abstract: Deep neural networks are prone to memorizing incorrect labels during training, which degrades their generalizability. Although recent methods have combined sample selection with semi-supervised learning (SSL) to exploit the memorization effect -- where networks learn from clean data before noisy data -- they cannot correct selection errors once a sample is misclassified. To overcome this, we propose asymmetric co-teaching with different architectures (ACD)-U, an asymmetric co-teaching framework that uses different model architectures and incorporates machine unlearning. ACD-U addresses this limitation through two core mechanisms. First, its asymmetric co-teaching pairs a contrastive language-image pretraining (CLIP)-pretrained vision Transformer with a convolutional neural network (CNN), leveraging their complementary learning behaviors: the pretrained model provides stable predictions, whereas the CNN adapts throughout training. This asymmetry, where the vision Transformer is trained only on clean samples and the CNN is trained through SSL, effectively mitigates confirmation bias. Second, selective unlearning enables post-hoc error correction by identifying incorrectly memorized samples through loss trajectory analysis and CLIP consistency checks, and then removing their influence via Kullback--Leibler divergence-based forgetting. This approach shifts the learning paradigm from passive error avoidance to active error correction. Experiments on synthetic and real-world noisy datasets, including CIFAR-10/100, CIFAR-N, WebVision, Clothing1M, and Red Mini-ImageNet, demonstrate state-of-the-art performance, particularly in high-noise regimes and under instance-dependent noise. The code is publicly available at https://github.com/meruemon/ACD-U.[175] Class Visualizations and Activation Atlases for Enhancing Interpretability in Deep Learning-Based Computational Pathology
Marco Gustav,Fabian Wolf,Christina Glasner,Nic G. Reitsam,Stefan Schulz,Kira Aschenbroich,Bruno Märkl,Sebastian Foersch,Jakob Nikolas Kather
Main category: cs.CV
TL;DR: 本文提出了一种针对病理学中Transformer基础模型的特征可视化框架,系统评估了类别可视化(CVs)和激活图谱(AAs)在多器官癌症分类任务中的可解释性表现,并通过病理学家标注与多种度量验证其与专家认知的一致性。
Details
Motivation: 尽管Transformer模型在计算病理学中被广泛用于从H&E全切片图像预测分子和临床生物标志物,但其可解释性滞后于模型复杂度;现有方法多集中于归因和生成式解释,而概念级特征可视化(如CVs和AAs)尚未被系统评估。 Method: 开发了一个可视化框架,对Transformer基础模型在组织分类与多器官癌症分级任务中生成的类别可视化(CVs)和激活图谱(AAs)进行系统评估;结合四位病理学家对真实/生成图像的标注(计算Fleiss κ)、归因分析及相似性度量进行多维度验证。 Result: CVs在形态差异大的组织上保持可识别性,但在重叠癌症亚类上区分度下降;AAs显示层依赖结构:粗粒度组织概念聚类明显,细粒度亚类则分散重叠;专家一致性随标签粒度变细显著降低(如CVs组织分类κ=0.31,AA癌症亚类κ=0.11);图谱可分性与真实图像专家一致性高度相关。 Conclusion: 概念级特征可视化能揭示Transformer病理模型中结构化的形态流形,为跨标签粒度的专家中心式表征分析提供了可行框架;其解释力受限于病理本身固有的表型模糊性,而非单纯模型缺陷。 Abstract: The rapid adoption of transformer-based models in computational pathology has enabled prediction of molecular and clinical biomarkers from H&E whole-slide images, yet interpretability has not kept pace with model complexity. While attribution- and generative-based methods are common, feature visualization approaches such as class visualizations (CVs) and activation atlases (AAs) have not been systematically evaluated for these models. We developed a visualization framework and assessed CVs and AAs for a transformer-based foundation model across tissue and multi-organ cancer classification tasks with increasing label granularity. Four pathologists annotated real and generated images to quantify inter-observer agreement, complemented by attribution and similarity metrics. CVs preserved recognizability for morphologically distinct tissues but showed reduced separability for overlapping cancer subclasses. In tissue classification, agreement decreased from Fleiss k = 0.75 (scans) to k = 0.31 (CVs), with similar trends in cancer subclass tasks. AAs revealed layer-dependent organization: coarse tissue-level concepts formed coherent regions, whereas finer subclasses exhibited dispersion and overlap. Agreement was moderate for tissue classification (k = 0.58), high for coarse cancer groupings (k = 0.82), and low at subclass level (k = 0.11). Atlas separability closely tracked expert agreement on real images, indicating that representational ambiguity reflects intrinsic pathological complexity. Attribution-based metrics approximated expert variability in low-complexity settings, whereas perceptual and distributional metrics showed limited alignment. Overall, concept-level feature visualization reveals structured morphological manifolds in transformer-based pathology models and provides a framework for expert-centered interrogation of learned representations across label granularities.[176] FreeFly-Thinking : Aligning Chain-of-Thought Reasoning with Continuous UAV Navigation
Jiaxu Zhou,Shaobo Wang,Zhiyuan Yang,Zhenjun Yu,Tao Li
Main category: cs.CV
TL;DR: 本文提出FreeFly-thinking框架,用于无人机视觉-语言导航任务,通过构建专用UAV数据集和两阶段训练策略,在复杂户外场景中实现了鲁棒高效的端到端导航。
Details
Motivation: 现有视觉-语言导航研究多集中于室内环境,缺乏对复杂户外场景(如城市建筑环境)的研究;同时当前UAV导航模型多为黑箱,缺乏显式推理能力。 Method: 提出FreeFly-thinking端到端VLN框架,结合UAV egocentric图像与自然语言指令;构建专用UAV导航数据集;引入自然语言链式推理(chain-of-thought);采用监督微调与强化微调的两阶段训练策略。 Result: 在未见测试场景中表现出强性能,验证了该方法在UAV导航任务中的鲁棒性与高效性。 Conclusion: FreeFly-thinking有效提升了无人机在复杂户外环境下的视觉-语言导航能力,兼具可解释性与实用性。 Abstract: Vision-Language Navigation aims to enable agents to understand natural language instructions and carry out appropriate navigation actions in real-world environments. Most work focuses on indoor settings, with little research in complex outdoor scenes. Current UAV Vision-and-Language Navigation models typically act as black boxes without explicit reasoning. We introduce FreeFly-thinking, an end-to-end VLN framework that converts the UAV agent's egocentric images and language instructions into a series of actions, inspired by environment of urban architecture proposed by OpenFly. We first construct a UAV dataset for navigation task, and then performing natural language chain of thought. We adopt a two-stage training strategy: Supervised fine-tuning and Reinforcement fine-tuning. Experiments on unseen test demonstrate a strong performance, presenting robustness and efficiency in UAV navigation issue.[177] FastSTAR: Spatiotemporal Token Pruning for Efficient Autoregressive Video Synthesis
Sungwoong Yune,Suheon Jeong,Joo-Young Kim
Main category: cs.CV
TL;DR: 本文提出FastSTAR,一种无需训练的加速框架,通过时空令牌剪枝和部分更新机制,在保持高质量视频生成的同时显著提升计算效率。
Details
Motivation: 随着时空自回归建模(STAR)在视频生成中的应用,高分辨率和长帧数导致‘令牌爆炸’,引发严重的计算瓶颈。 Method: 提出时空令牌剪枝方法,结合空间相似性(评估多尺度结构收敛性)与时间相似性(识别运动轨迹),并辅以部分更新机制,仅对未收敛区域进行细化。 Result: 在InfinityStar上实验表明,FastSTAR实现最高2.01倍加速,PSNR达28.29,性能下降小于1%。 Conclusion: FastSTAR在STAR类视频合成中实现了更优的效率-质量权衡,是一种高效、即插即用的加速方案。 Abstract: Visual Autoregressive modeling (VAR) has emerged as a highly efficient alternative to diffusion-based frameworks, achieving comparable synthesis quality. However, as this paradigm extends to Spacetime Autoregressive modeling (STAR) for video generation, scaling resolution and frame counts leads to a "token explosion" that creates a massive computational bottleneck in the final refinement stages. To address this, we propose FastSTAR, a training-free acceleration framework designed for high-quality video generation. Our core method, Spatiotemporal Token Pruning, identifies essential tokens by integrating two specialized terms: (1) Spatial similarity, which evaluates structural convergence across hierarchical scales to skip computations in regions where further refinement becomes redundant, and (2) Temporal similarity, which identifies active motion trajectories by assessing feature-level variations relative to the preceding clip. Combined with a Partial Update mechanism, FastSTAR ensures that only non-converged regions are refined, maintaining fluid motion while bypassing redundant computations. Experimental results on InfinityStar demonstrate that FastSTAR achieves up to a 2.01x speedup with a PSNR of 28.29 and less than 1% performance degradation, proving a superior efficiency-quality trade-off for STAR-based video synthesis.[178] VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization
Seul-Ki Yeom,Marcel Simon,Eunbin Lee,Tae-Ho Kim
Main category: cs.CV
TL;DR: 本文提出VINO框架,通过视频驱动的非上下文对象不变性学习,解决自监督学习中特征过度依赖背景纹理和共现统计的问题。利用结构先验生成视图,在教师-学生框架中实施不对称蒸馏和跨时间蒸馏,提升对象中心表征能力。
Details
Motivation: 自监督学习中特征易过度依赖背景纹理和共现统计;视频虽具时序多样性,但强自我运动导致前景与背景协同运动,形成共现陷阱,使表征坍缩为场景编码器。 Method: 提出VINO(Video-driven Invariance for Non-Contextual Objects)教师-学生框架:1)使用类无关结构先验生成视图(非语义伪标签),构建不对称蒸馏;2)教师基于抑制背景的前景联合视图预测,学生观察保留上下文但去除竞争实例的对象条件场景视图;3)通过掩码蒸馏匹配目标,削弱背景线索可靠性;4)引入教师锚定的跨时间蒸馏以保障时序对象永久性;5)采用掩码引导的局部视图稳定部分-整体一致性。 Result: 在PASCAL VOC上通过注意力可视化和无监督对象发现验证了前景-背景解耦效果;在Walking Tours Venice密集视频上预训练后,CorLoc达34.8,获得高度聚焦、形状偏向的表征,显著优于先前密集视频和运动引导的SSL基线。 Conclusion: VINO通过结构信息瓶颈与多阶段蒸馏机制,有效引导模型学习对象中心、背景鲁棒的视觉表征,为密集视频自监督学习提供了新范式。 Abstract: Self-supervised learning (SSL) has made rapid progress, yet learned features often over-rely on contextual shortcuts-background textures and co-occurrence statistics. While video provides rich temporal variation, dense in-the-wild streams with strong ego-motion create a co-occurrence trap: foreground objects and background context move coherently, encouraging representations to collapse into scene encoders. To address this, we propose VINO (Video-driven Invariance for Non-Contextual Objects), a teacher-student framework that learns robust image encoders from dense video by imposing a structural information bottleneck. Using a class-agnostic structural prior solely to generate views-not as semantic pseudo-labels-VINO forms an asymmetric distillation problem. The teacher predicts from a foreground-union view with the background suppressed, while the student observes object-conditioned scene views that retain surrounding context but remove competing instances. Matching these targets via masked distillation makes background cues unreliable, pushing the representation toward object-centric invariances. We further enforce temporal object permanence via teacher-anchored cross-time distillation over track-matched objects, and stabilize part-to-whole consistency with mask-guided local views. Through attention visualization and unsupervised object discovery on PASCAL VOC, we demonstrate that VINO effectively disentangles foreground from background. Pretrained on the dense Walking Tours Venice video, VINO achieves 34.8 CorLoc, yielding highly focused, shape-biased representations that substantially outperform prior dense-video and motion-guided SSL baselines.[179] Single Image Super-Resolution via Bivariate `A Trous Wavelet Diffusion
Heidari Maryam,Anantrasirichai Nantheera,Achim Alin
Main category: cs.CV
TL;DR: 本文提出BATDiff,一种无监督的双变量A trous小波扩散模型,通过多尺度小波表示和跨尺度依赖建模,提升超分辨率重建中高频结构的一致性和真实性。
Details
Motivation: 现有基于扩散的超分辨率方法多在空间域操作,易产生与低分辨率输入不一致的高频细节;而单图像超分虽避免数据集偏差,但低分辨率观测的模糊性仍导致高频细节不一致。 Method: BATDiff采用非下采样A trous小波变换构建多尺度表示,并设计双变量跨尺度模块建模相邻尺度间的父子依赖关系,为扩散过程提供结构化跨尺度引导。 Result: 在标准基准上,BATDiff相比现有扩散与非扩散方法生成更锐利、结构更一致的重建结果,在保真度和感知质量上均有提升。 Conclusion: 将小波多尺度先验引入扩散模型可有效增强高频一致性,BATDiff为无监督超分辨率提供了新范式。 Abstract: The effectiveness of super resolution (SR) models hinges on their ability to recover high frequency structure without introducing artifacts. Diffusion based approaches have recently advanced the state of the art in SR. However, most diffusion based SR pipelines operate purely in the spatial domain, which may yield high frequency details that are not well supported by the underlying low resolution evidence. On the other hand, unlike supervised SR models that may inject dataset specific textures, single image SR relies primarily on internal image statistics and can therefore be less prone to dataset-driven hallucinations; nevertheless, ambiguity in the LR observation can still lead to inconsistent high frequency details. To tackle this problem, we introduce BATDiff, an unsupervised Bivariate A trous Wavelet Diffusion model designed to provide structured cross scale guidance during the generative process. BATDiff employs an a Trous wavelet transform that constructs an undecimated multiscale representation in which high frequency components are progressively revealed while the full spatial resolution is preserved. As the core inference mechanism, BATDiff includes a bivariate cross scale module that models parent child dependencies between adjacent scales. It improves high frequency coherence and reduces mismatch artifacts in diffusion based SR. Experiments on standard benchmarks demonstrate that BATDiff produces sharper and more structurally consistent reconstructions than existing diffusion and non diffusion baselines, achieving improvements in fidelity and perceptual quality.[180] HY-WU (Part I): An Extensible Functional Neural Memory Framework and An Instantiation in Text-Guided Image Editing
Tencent HY Team
Main category: cs.CV
TL;DR: 本文提出HY-WU(Weight Unleashing)框架,通过在推理时根据实例条件动态生成权重更新,实现无需测试时优化的实例级个性化,避免传统持续学习中因共享权重覆盖导致的性能退化。
Details
Motivation: 现实部署中模型需应对领域漂移、用户偏好变化和新任务涌现,但现有适应方法依赖静态权重,难以兼顾多样化目标,易引发干扰或退化。 Method: 提出HY-WU——一种以记忆为先的适应框架,将功能级记忆建模为神经模块(权重更新生成器),在推理时依据实例条件动态合成权重更新,生成实例特定算子。 Result: HY-WU在持续学习与个性化场景中避免了重复覆盖共享权重带来的行为退化,提升了模型在异构动态环境下的适应性与鲁棒性。 Conclusion: 将适应压力从覆盖单点参数转向动态生成实例化权重更新,是提升基础模型长期部署能力的关键范式转变。 Abstract: Foundation models are transitioning from offline predictors to deployed systems expected to operate over long time horizons. In real deployments, objectives are not fixed: domains drift, user preferences evolve, and new tasks appear after the model has shipped. This elevates continual learning and instant personalization from optional features to core architectural requirements. Yet most adaptation pipelines still follow a static weight paradigm: after training (or after any adaptation step), inference executes a single parameter vector regardless of user intent, domain, or instance-specific constraints. This treats the trained or adapted model as a single point in parameter space. In heterogeneous and continually evolving regimes, distinct objectives can induce separated feasible regions over parameters, forcing any single shared update into compromise, interference, or overspecialization. As a result, continual learning and personalization are often implemented as repeated overwriting of shared weights, risking degradation of previously learned behaviors. We propose HY-WU (Weight Unleashing), a memory-first adaptation framework that shifts adaptation pressure away from overwriting a single shared parameter point. HY-WU implements functional (operator-level) memory as a neural module: a generator that synthesizes weight updates on-the-fly from the instance condition, yielding instance-specific operators without test-time optimization.[181] FabricGen: Microstructure-Aware Woven Fabric Generation
Yingjie Tang,Di Luo,Zixiong Wang,Xiaoli Ling,jian Yang,Beibei Wang
Main category: cs.CV
TL;DR: 本文提出FabricGen框架,通过分解宏观纹理与微观编织结构,结合微调扩散模型和增强型程序化几何模型(由WeavingLLM驱动),实现从文本描述生成高真实感织物材质。
Details
Motivation: 现有扩散模型难以生成符合编织规则的精细纱线级细节,且织物设计流程复杂、依赖专业知识。 Method: 1)收集无微观结构织物数据集并微调扩散模型生成宏观纹理;2)构建增强型程序化几何模型生成含纱线滑移与飞丝的微观结构;3)基于标注编织图数据微调WeavingLLM,并结合领域提示调优,使其能从文本生成编织图与参数。 Result: FabricGen能生成兼具丰富细节与物理合理性的织物材质,在渲染中显著优于先前生成模型。 Conclusion: 分解宏观/微观建模、融合生成式AI与程序化建模、以及领域大模型驱动,是提升织物材质生成真实感与可控性的有效范式。 Abstract: Woven fabric materials are widely used in rendering applications, yet designing realistic examples typically involves multiple stages, requiring expertise in weaving principles and texture authoring. Recent advances have explored diffusion models to streamline this process; however, pre-trained diffusion models often struggle to generate intricate yarn-level details that conform to weaving rules. To address this, we present FabricGen, an end-to-end framework for generating high-quality woven fabric materials from textual descriptions. A key insight of our method is the decomposition of macro-scale textures and micro-scale weaving patterns. To generate macro-scale textures free from microstructures, we fine-tune pre-trained diffusion models on a collected dataset of microstructure-free fabrics. As for micro-scale weaving patterns, we develop an enhanced procedural geometric model capable of synthesizing natural yarn-level geometry with yarn sliding and flyaway fibers. The procedural model is driven by a specialized large language model, WeavingLLM, which is fine-tuned on an annotated dataset of formatted weaving drafts, and prompt-tuned with domain-specific fabric expertise. Through fine-tuning and prompt tuning, WeavingLLM learns to design weaving drafts and fabric parameters from textual prompts, enabling the procedural model to produce diverse weaving patterns that stick to weaving principles. The generated macro-scale texture, along with the micro-scale geometry, can be used for fabric rendering. Consequently, our framework produces materials with significantly richer detail and realism compared to prior generative models.[182] PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation
Xin-Sheng Chen,Jiayu Zhu,Pei-lin Li,Hanzheng Wang,Shuojin Yang,Meng-Hao Guo
Main category: cs.CV
TL;DR: 本文提出PresentBench,一个细粒度、基于评分标准的幻灯片自动生成评估基准,包含238个实例及平均每个实例54.1个二元检查项,显著提升评估可靠性与人类偏好一致性,并发现NotebookLM性能最优。
Details
Motivation: 现有幻灯片生成评估方法过于粗粒度、缺乏可验证的细粒度标准,制约研究进展与实际应用。 Method: 构建PresentBench基准,含238个带背景材料的评估实例,为每个实例人工设计约54.1个二元检查项,实现细粒度、实例级评估。 Result: 实验表明PresentBench比现有方法更可靠、更符合人类偏好;并揭示NotebookLM显著优于其他幻灯片生成方法。 Conclusion: PresentBench有效解决了幻灯片生成评估的瓶颈问题,推动该领域向更严谨、可复现的方向发展。 Abstract: Slides serve as a critical medium for conveying information in presentation-oriented scenarios such as academia, education, and business. Despite their importance, creating high-quality slide decks remains time-consuming and cognitively demanding. Recent advances in generative models, such as Nano Banana Pro, have made automated slide generation increasingly feasible. However, existing evaluations of slide generation are often coarse-grained and rely on holistic judgments, making it difficult to accurately assess model capabilities or track meaningful advances in the field. In practice, the lack of fine-grained, verifiable evaluation criteria poses a critical bottleneck for both research and real-world deployment. In this paper, we propose PresentBench, a fine-grained, rubric-based benchmark for evaluating automated real-world slide generation. It contains 238 evaluation instances, each supplemented with background materials required for slide creation. Moreover, we manually design an average of 54.1 checklist items per instance, each formulated as a binary question, to enable fine-grained, instance-specific evaluation of the generated slide decks. Extensive experiments show that PresentBench provides more reliable evaluation results than existing methods, and exhibits significantly stronger alignment with human preferences. Furthermore, our benchmark reveals that NotebookLM significantly outperforms other slide generation methods, highlighting substantial recent progress in this domain.[183] LEPA: Learning Geometric Equivariance in Satellite Remote Sensing Data with a Predictive Architecture
Erik Scheurer,Rocco Sedona,Stefan Kesselheim,Gabriele Cavallaro
Main category: cs.CV
TL;DR: 本文提出LEPA方法,通过几何增强条件预测变换嵌入,解决地学基础模型中预计算嵌入与用户兴趣区之间的几何不匹配问题,显著提升几何调整准确性。
Details
Motivation: 地球观测应用中存在用户定义兴趣区与固定预计算嵌入网格之间的几何不匹配问题,而标准潜在空间插值因嵌入流形高度非凸而不可靠。 Method: 提出学习等变性预测架构(LEPA),不采用向量平均,而是基于几何增强条件直接预测变换后的嵌入。 Result: 在HLS影像和ImageNet-1k上实验表明,标准插值的平均倒数排名(MRR)低于0.2,而LEPA将MRR提升至0.8以上。 Conclusion: LEPA能有效实现无需重新编码的高精度几何调整,克服了传统插值方法在非凸嵌入流形上的局限性。 Abstract: Geospatial foundation models provide precomputed embeddings that serve as compact feature vectors for large-scale satellite remote sensing data. While these embeddings can reduce data-transfer bottlenecks and computational costs, Earth observation (EO) applications can still face geometric mismatches between user-defined areas of interest and the fixed precomputed embedding grid. Standard latent-space interpolation is unreliable in this setting because the embedding manifold is highly non-convex, yielding representations that do not correspond to realistic inputs. We verify this using Prithvi-EO-2.0 to understand the shortcomings of interpolation applied to patch embeddings. As a substitute, we propose a Learned Equivariance-Predicting Architecture (LEPA). Instead of averaging vectors, LEPA conditions a predictor on geometric augmentations to directly predict the transformed embedding. We evaluate LEPA on NASA/USGS Harmonized Landsat-Sentinel (HLS) imagery and ImageNet-1k. Experiments show that standard interpolation achieves a mean reciprocal rank (MRR) below 0.2, whereas LEPA increases MRR to over 0.8, enabling accurate geometric adjustment without re-encoding.[184] Variational Flow Maps: Make Some Noise for One-Step Conditional Generation
Abbas Mammadov,So Takao,Bohan Chen,Ricardo Baptista,Morteza Mardani,Yee Whye Teh,Julius Berner
Main category: cs.CV
TL;DR: 本文提出变分流映射(VFM)框架,通过学习合适的初始噪声而非引导采样路径来实现条件生成,显著提升单步/少步条件采样质量与效率。
Details
Motivation: 流映射虽能单步高质量图像生成,但缺乏显式采样轨迹,难以融入外部约束以支持条件生成和求解逆问题。 Method: 提出变分流映射(VFM),引入噪声适配器模型,将观测映射为适配的初始噪声分布;联合优化噪声适配器与流映射,最大化变分下界以提升噪声-数据对齐。 Result: 在多种逆问题任务中,VFM可单步或少数步生成校准良好的条件样本;在ImageNet上达到与迭代扩散/流模型相当的保真度,采样速度提升数量级。 Conclusion: VFM将条件生成范式从‘引导采样路径’转向‘学习合适初始噪声’,为高效、可控的单步生成提供了新思路。 Abstract: Flow maps enable high-quality image generation in a single forward pass. However, unlike iterative diffusion models, their lack of an explicit sampling trajectory impedes incorporating external constraints for conditional generation and solving inverse problems. We put forth Variational Flow Maps, a framework for conditional sampling that shifts the perspective of conditioning from "guiding a sampling path", to that of "learning the proper initial noise". Specifically, given an observation, we seek to learn a noise adapter model that outputs a noise distribution, so that after mapping to the data space via flow map, the samples respect the observation and data prior. To this end, we develop a principled variational objective that jointly trains the noise adapter and the flow map, improving noise-data alignment, such that sampling from complex data posterior is achieved with a simple adapter. Experiments on various inverse problems show that VFMs produce well-calibrated conditional samples in a single (or few) steps. For ImageNet, VFM attains competitive fidelity while accelerating the sampling by orders of magnitude compared to alternative iterative diffusion/flow models. Code is available at https://github.com/abbasmammadov/VFM[185] Virtual Try-On for Cultural Clothing: A Benchmarking Study
Muhammad Tausif Ul Islam,Shahir Awlad,Sameen Yeaser Adib,Md. Atiqur Rahman,Sabbir Ahmed,Md. Hasanul Kabir
Main category: cs.CV
TL;DR: 本文提出了BD-VITON数据集,专注于孟加拉国传统服饰(如纱丽、旁遮普服、莎尔瓦卡米兹),涵盖男女两类,并针对其复杂结构挑战重新训练和评估了多个虚拟试衣模型,显著提升了性能。
Details
Motivation: 现有虚拟试衣基准主要基于西方风格服装和女性模特,缺乏文化多样性,难以泛化到如孟加拉国传统服饰等具有复杂结构(如复杂垂坠、非对称分层、高形变)的服装类型。 Method: 构建BD-VITON数据集,包含孟加拉国代表性服饰及男女模特;在该数据集上重新训练并评估StableViton、HR-VITON和VITON-HD等主流虚拟试衣模型。 Result: 相比零样本推理,所有重训练模型在定量与定性指标上均取得一致提升,验证了数据集的有效性和模型适配性。 Conclusion: BD-VITON填补了文化多样性虚拟试衣数据集的空白,为复杂结构服饰建模提供了新基准,并证明了针对性训练对提升泛化能力的关键作用。 Abstract: Although existing virtual try-on systems have made significant progress with the advent of diffusion models, the current benchmarks of these models are based on datasets that are dominant in western-style clothing and female models, limiting their ability to generalize culturally diverse clothing styles. In this work, we introduce BD-VITON, a virtual try-on dataset focused on Bangladeshi garments, including saree, panjabi and salwar kameez, covering both male and female categories as well. These garments present unique structural challenges such as complex draping, asymmetric layering, and high deformation complexities which are underrepresented in the original VITON dataset. To establish strong baselines, we retrain and evaluate try-on models, namely StableViton, HR-VITON, and VITON-HD on our dataset. Our experiments demonstrate consistent improvements in terms of both quantitative and qualitative analysis, compared to zero shot inference.[186] AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions
Jihyoung Jang,Hyounghun Kim
Main category: cs.CV
TL;DR: 本文提出AQuA数据集,系统分类视觉问答中的四类模糊性,并训练视觉语言模型根据模糊类型自适应选择响应策略(如直接回答、推断意图、列出多种可能或请求澄清),显著提升模型应对现实模糊场景的能力。
Details
Motivation: 现有VQA基准多基于清晰无歧义的图像-问题对,而真实场景常含不同程度歧义,亟需系统化歧义分类与策略感知的响应能力。 Method: 构建细粒度歧义VQA数据集AQuA,将歧义分为四级并标注最优响应策略;在AQuA上微调VLMs,使其能自适应选择多种响应策略。 Result: 微调后的VLMs能在歧义识别、不确定性管理及策略适配方面显著优于开源与闭源基线模型。 Conclusion: AQuA推动VQA从追求确定性答案转向支持策略感知的鲁棒推理,为VLMs在真实模糊场景中的可信交互奠定基础。 Abstract: Visual Question Answering (VQA) is a core task for evaluating the capabilities of Vision-Language Models (VLMs). Existing VQA benchmarks primarily feature clear and unambiguous image-question pairs, whereas real-world scenarios often involve varying degrees of ambiguity that require nuanced reasoning and context-appropriate response strategies. Although recent studies have begun to address ambiguity in VQA, they lack (1) a systematic categorization of ambiguity levels and (2) datasets and models that support strategy-aware responses. In this paper, we introduce Ambiguous Visual Question Answering (AQuA), a fine-grained dataset that classifies ambiguous VQA instances into four levels according to the nature and degree of ambiguity, along with the optimal response strategy for each case. Our evaluation of diverse open-source and proprietary VLMs shows that most models fail to adapt their strategy to the ambiguity type, frequently producing overconfident answers rather than seeking clarification or acknowledging uncertainty. To address this challenge, we fine-tune VLMs on AQuA, enabling them to adaptively choose among multiple response strategies, such as directly answering, inferring intent from contextual cues, listing plausible alternatives, or requesting clarification. VLMs trained on AQuA achieve strategic response generation for ambiguous VQA, demonstrating the ability to recognize ambiguity, manage uncertainty, and respond with context-appropriate strategies, while outperforming both open-source and closed-source baselines.[187] MAviS: A Multimodal Conversational Assistant For Avian Species
Yevheniia Kryklyvets,Mohammed Irfan Kurpath,Sahal Shaji Mullappilly,Jinxing Zhou,Fahad Shabzan Khan,Rao Anwer,Salman Khan,Hisham Cholakkal
Main category: cs.CV
TL;DR: 本文提出MAviS-Dataset(大规模鸟类多模态数据集)、MAviS-Chat(支持音视频文的多模态大模型)和MAviS-Bench(评估基准),显著提升鸟类细粒度识别与跨模态问答能力,推动生态监测领域专用多模态大模型发展。
Details
Motivation: 现有多模态大语言模型在鸟类等专业生物领域表现不足,难以提供准确、上下文相关的细粒度信息,亟需面向生物多样性保护的领域自适应模型。 Method: 构建包含图像、音频、文本三模态的MAviS-Dataset(覆盖1000+鸟种);基于该数据集指令微调开发多模态大模型MAviS-Chat;设计含25000+问答对的MAviS-Bench基准用于量化评估。 Result: MAviS-Chat在MAviS-Bench上大幅超越MiniCPM-o-2.6基线,取得当前开源模型最优性能,验证了指令微调数据集的有效性。 Conclusion: 领域适配的多模态大模型对生态应用至关重要,MAviS系列工作为生物多样性智能监测提供了新范式与基础资源。 Abstract: Fine-grained understanding and species-specific multimodal question answering are vital for advancing biodiversity conservation and ecological monitoring. However, existing multimodal large language models face challenges when it comes to specialized topics like avian species, making it harder to provide accurate and contextually relevant information in these areas. To address this limitation, we introduce the MAviS-Dataset, a large-scale multimodal avian species dataset that integrates image, audio, and text modalities for over 1,000 bird species, comprising both pretraining and instruction-tuning subsets enriched with structured question-answer pairs. Building on the MAviS-Dataset, we introduce MAviS-Chat, a multimodal LLM that supports audio, vision, and text and is designed for fine-grained species understanding, multimodal question answering, and scene-specific description generation. Finally, for quantitative evaluation, we present MAviS-Bench, a benchmark of over 25,000 QA pairs designed to assess avian species-specific perceptual and reasoning abilities across modalities. Experimental results show that MAviS-Chat outperforms the baseline MiniCPM-o-2.6 by a large margin, achieving state-of-the-art open-source results and demonstrating the effectiveness of our instruction-tuned MAviS-Dataset. Our findings highlight the necessity of domain-adaptive multimodal LLMs for ecological applications.[188] Generalization in Online Reinforcement Learning for Mobile Agents
Li Gu,Zihuan Jiang,Zhixiang Chi,Huan Liu,Ziqiang Wang,Yuanhao Yu,Glen Berseth,Yang Wang
Main category: cs.CV
TL;DR: 本文提出AndroidWorld-Generalization基准和基于GRPO的RL训练系统,用于评估和提升GUI移动代理在未见任务、模板和应用上的零样本泛化能力,并开源了完整系统。
Details
Motivation: 现有GUI移动代理研究主要关注性能,而泛化能力因缺乏标准化基准和开源RL系统而被忽视。 Method: 将问题形式化为上下文马尔可夫决策过程(CMDP),构建AndroidWorld-Generalization基准(含三种递进泛化场景),并设计集成Group Relative Policy Optimization(GRPO)与容器化异步 rollout 收集及错误恢复机制的RL训练系统。 Result: RL训练使7B参数VLM代理在未见任务实例上比监督微调提升26.1%,但在未见模板和应用上仅提升15.7%和8.3%;测试时少样本适配可改善未见应用性能。 Conclusion: RL有助于提升移动GUI代理的零样本泛化,尤其对未见任务实例效果显著,但跨模板和跨应用泛化仍具挑战;开源系统支持可复现研究与公平比较。 Abstract: Graphical user interface (GUI)-based mobile agents automate digital tasks on mobile devices by interpreting natural-language instructions and interacting with the screen. While recent methods apply reinforcement learning (RL) to train vision-language-model(VLM) agents in interactive environments with a primary focus on performance, generalization remains underexplored due to the lack of standardized benchmarks and open-source RL systems. In this work, we formalize the problem as a Contextual Markov Decision Process (CMDP) and introduce \textbf{AndroidWorld-Generalization}, a benchmark with three increasingly challenging regimes for evaluating zero-shot generalization to unseen task instances, templates, and applications. We further propose an RL training system that integrates Group Relative Policy Optimization (GRPO) with a scalable rollout collection system, consisting of containerized infrastructure and asynchronous execution % , and error recovery to support reliable and efficient training. Experiments on AndroidWorld-Generalization show that RL enables a 7B-parameter VLM agent to surpass supervised fine-tuning baselines, yielding a 26.1\% improvement on unseen instances but only limited gains on unseen templates (15.7\%) and apps (8.3\%), underscoring the challenges of generalization. As a preliminary step, we demonstrate that few-shot adaptation at test-time improves performance on unseen apps, motivating future research in this direction. To support reproducibility and fair comparison, we open-source the full RL training system, including the environment, task suite, models, prompt configurations, and the underlying infrastructure \footnote{https://github.com/zihuanjiang/AndroidWorld-Generalization}.[189] Training for Trustworthy Saliency Maps: Adversarial Training Meets Feature-Map Smoothing
Dipkamal Bhusal,Md Tanvirul Alam,Nidhi Rastogi
Main category: cs.CV
TL;DR: 本文提出了一种结合对抗训练与特征图平滑的方法,以提升梯度类归因图(如Vanilla Gradient、Integrated Gradients)的稀疏性与输入/输出侧稳定性,并通过实验与人类评估验证其有效性。
Details
Motivation: 现有梯度类可解释性方法(如VG、IG)生成的显著图常存在噪声大、不稳定的问题,尤其在高风险场景中受限;以往工作多聚焦于改进归因算法本身,而忽视了模型训练过程对解释质量的影响。 Method: 基于曲率分析建立归因稳定性与输入梯度场局部平滑性的联系;发现对抗训练虽提升输入侧稳定性、增强稀疏性,但损害输出侧稳定性;为此,在中间层引入可微高斯滤波模块,轻量级地平滑特征图,与对抗训练联合优化。 Result: 在FMNIST、CIFAR-10和ImageNette上,该方法在保持对抗训练带来稀疏性优势的同时,同步提升了输入侧与输出侧稳定性;65人的人类研究表明,平滑后的对抗显著图被感知为更充分、更可信。 Conclusion: 解释质量高度依赖于模型训练方式;简单地将特征图平滑与鲁棒训练结合,是实现稀疏且稳定显著图的一种实用有效路径。 Abstract: Gradient-based saliency methods such as Vanilla Gradient (VG) and Integrated Gradients (IG) are widely used to explain image classifiers, yet the resulting maps are often noisy and unstable, limiting their usefulness in high-stakes settings. Most prior work improves explanations by modifying the attribution algorithm, leaving open how the training procedure shapes explanation quality. We take a training-centered view and first provide a curvature-based analysis linking attribution stability to how smoothly the input-gradient field varies locally. Guided by this connection, we study adversarial training and identify a consistent trade-off: it yields sparser and more input-stable saliency maps, but can degrade output-side stability, causing explanations to change even when predictions remain unchanged and logits vary only slightly. To mitigate this, we propose augmenting adversarial training with a lightweight feature-map smoothing block that applies a differentiable Gaussian filter in an intermediate layer. Across FMNIST, CIFAR-10, and ImageNette, our method preserves the sparsity benefits of adversarial training while improving both input-side stability and output-side stability. A human study with 65 participants further shows that smoothed adversarial saliency maps are perceived as more sufficient and trustworthy. Overall, our results demonstrate that explanation quality is critically shaped by training, and that simple smoothing with robust training provides a practical path toward saliency maps that are both sparse and stable.[190] Image Generation Models: A Technical History
Rouzbeh Shirvani
Main category: cs.CV
TL;DR: This paper provides a comprehensive survey of image generation models, covering VAEs, GANs, normalizing flows, autoregressive and transformer-based models, and diffusion methods, with technical details, limitations, video generation advances, and ethical considerations.
Details
Motivation: The literature on image generation is fragmented across models and domains; this paper aims to unify and systematically review breakthroughs and challenges. Method: A comprehensive survey approach, including technical walkthroughs of model objectives, architectures, training algorithms, optimization techniques, failure modes, and extensions to video generation and responsible deployment. Result: A unified, in-depth overview of major image generation paradigms, their strengths/weaknesses, recent video generation progress, and critical discussion on robustness, deepfakes, detection, artifacts, and watermarking. Conclusion: Image generation has evolved significantly, but challenges remain in model reliability, generalization, and ethical deployment—highlighting the need for continued research in both technical advancement and responsible AI practices. Abstract: Image generation has advanced rapidly over the past decade, yet the literature seems fragmented across different models and application domains. This paper aims to offer a comprehensive survey of breakthrough image generation models, including variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flows, autoregressive and transformer-based generators, and diffusion-based methods. We provide a detailed technical walkthrough of each model type, including their underlying objectives, architectural building blocks, and algorithmic training steps. For each model type, we present the optimization techniques as well as common failure modes and limitations. We also go over recent developments in video generation and present the research works that made it possible to go from still frames to high quality videos. Lastly, we cover the growing importance of robustness and responsible deployment of these models, including deepfake risks, detection, artifacts, and watermarking.[191] StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models
Duy M. H. Nguyen,Tuan A. Tran,Duong Nguyen,Siwei Xie,Trung Q. Nguyen,Mai T. N. Truong,Daniel Palenicek,An T. Le,Michael Barz,TrungTin Nguyen,Tuan Dam,Ngan Le,Minh Vu,Khoa Doan,Vien Ngo,Pengtao Xie,James Zou,Daniel Sonntag,Jan Peters,Mathias Niepert
Main category: cs.CV
TL;DR: 本文提出StructSAM,一种专为Segment Anything Model(SAM)设计的分辨率保持型token合并-恢复框架,通过梯度驱动的能量评分、网格化平坦性筛选和显式token恢复,在不重训练的前提下显著降低计算量,同时较好地保持分割精度。
Details
Motivation: 现有token合并方法直接应用于SAM存在困难:SAM图像编码器混合了窗口注意力与全局注意力,其掩码解码器依赖密集且提示条件化的特征进行精确边界预测;已有方法在高合并率下易破坏边界并泄露提示信息。 Method: 提出StructSAM框架:1)基于一阶特征梯度计算轻量级token能量得分;2)采用网格化平坦性筛选机制保护边界与提示区域;3)仅在平坦区域内向低能量目标合并token,并支持显式token恢复;4)从谱图粗化视角分析其拉普拉斯谱失真有界性。 Result: 在8个自然与医学分割基准上,StructSAM将编码器FLOPs降低25–30%(提示感知合并可达40%+),mIoU/Dice仅轻微下降,且在相同计算量下持续优于ToMe、PiToMe、ToMeSD、VidToMe和ALGM。 Conclusion: StructSAM是一种针对SAM架构特性的高效、鲁棒token压缩方案,兼顾计算效率与结构保真度,为ViT类模型在分割任务中的轻量化提供了新思路。 Abstract: Recent token merging techniques for Vision Transformers (ViTs) provide substantial speedups by reducing the number of tokens processed by self-attention, often without retraining. However, their direct application to the Segment Anything Model (SAM) family is nontrivial: SAM's image encoder mixes windowed and global attention, and its mask decoder relies on dense, prompt-conditioned features for precise boundary prediction. We systematically evaluate representative token-merging methods on SAM and Medical SAM in a strict off-the-shelf setting, and find that existing destination-selection heuristics can erode boundaries and leak prompt information as merge rates increase. We propose \textbf{StructSAM}, a resolution-preserving merge-unmerge framework tailored to SAM. StructSAM computes a lightweight token-energy score from first-order feature gradients, uses grid-based flatness screening to protect boundary and prompt regions, and merges tokens within flat areas toward low-energy destinations with explicit token recovery. We further provide a spectral graph coarsening view showing that score-guided merging yields bounded Laplacian spectral distortion compared to random or window-restricted baselines. Across eight natural and medical benchmarks, StructSAM reduces encoder FLOPs by 25-30\% (up to 40\%+ with prompt-aware merging) with minor drops in mIoU/Dice, consistently outperforming ToMe, PiToMe, ToMeSD, VidToMe, and ALGM at the same compute.[192] 3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models
Shaoxiong Zhan,Yanlin Lai,Zheng Liu,Hai Lin,Shen Li,Xiaodong Cai,Zijian Lin,Wen Huang,Hai-Tao Zheng
Main category: cs.CV
TL;DR: 本文揭示了当前视觉语言模型在空间推理任务(如方块计数)上的显著缺陷,即‘空间智能鸿沟’,并提出3ViewSense框架,利用正交视图和‘模拟-推理’机制来提升模型构建一致3D心理表征的能力,显著提升了空间推理性能与描述稳定性。
Details
Motivation: 现有视觉语言模型虽在逻辑推理上表现优异,却在基础空间任务上表现不佳,暴露出缺乏一致视角的空间接口这一核心问题。 Method: 提出3ViewSense框架,基于工程认知引入正交视图作为空间推理基础,设计‘模拟-推理’机制,将复杂场景分解为标准正交投影,通过协调自我中心感知与外部中心参考实现显式心理旋转与重建。 Result: 在多个空间推理基准测试中显著超越现有基线,尤其在遮挡严重下的计数与视角一致的空间推理任务中效果稳定提升,并增强了空间描述的一致性与鲁棒性。 Conclusion: 缺失的不是视觉特征或推理能力,而是统一、一致的空间表征接口;3ViewSense为多模态系统构建更强空间智能提供了可扩展路径。 Abstract: Current Large Language Models have achieved Olympiad-level logic, yet Vision-Language Models paradoxically falter on elementary spatial tasks like block counting. This capability mismatch reveals a critical ``spatial intelligence gap,'' where models fail to construct coherent 3D mental representations from 2D observations. We uncover this gap via diagnostic analyses showing the bottleneck is a missing view-consistent spatial interface rather than insufficient visual features or weak reasoning. To bridge this, we introduce \textbf{3ViewSense}, a framework that grounds spatial reasoning in Orthographic Views. Drawing on engineering cognition, we propose a ``Simulate-and-Reason'' mechanism that decomposes complex scenes into canonical orthographic projections to resolve geometric ambiguities. By aligning egocentric perceptions with these allocentric references, our method facilitates explicit mental rotation and reconstruction. Empirical results on spatial reasoning benchmarks demonstrate that our method significantly outperforms existing baselines, with consistent gains on occlusion-heavy counting and view-consistent spatial reasoning. The framework also improves the stability and consistency of spatial descriptions, offering a scalable path toward stronger spatial intelligence in multimodal systems.[193] Faster-HEAL: An Efficient and Privacy-Preserving Collaborative Perception Framework for Heterogeneous Autonomous Vehicles
Armin Maleki,Hayder Radha
Main category: cs.CV
TL;DR: 本文提出Faster-HEAL,一种轻量级、隐私保护的协同感知框架,通过低秩视觉提示微调和金字塔融合对齐异构特征,显著提升检测性能并降低计算开销。
Details
Motivation: 现有协同感知方法多假设智能体同质,难以应对真实场景中传感器与模型异构带来的特征域差距问题,且已有解决方案存在计算昂贵、隐私泄露和单智能体精度下降等缺陷。 Method: 提出Faster-HEAL框架,采用低秩视觉提示进行微调以对齐异构特征至统一空间,并结合金字塔融合实现鲁棒特征聚合;仅需训练极少量参数(减少94%),无需重训大模型。 Result: 在OPV2V-H数据集上,Faster-HEAL相较SOTA方法检测性能提升2%,同时显著降低计算开销。 Conclusion: Faster-HEAL为可扩展的异构协同感知提供了高效、轻量且隐私友好的实用解决方案。 Abstract: Collaborative perception (CP) is a promising paradigm for improving situational awareness in autonomous vehicles by overcoming the limitations of single-agent perception. However, most existing approaches assume homogeneous agents, which restricts their applicability in real-world scenarios where vehicles use diverse sensors and perception models. This heterogeneity introduces a feature domain gap that degrades detection performance. Prior works address this issue by retraining entire models/major components, or using feature interpreters for each new agent type, which is computationally expensive, compromises privacy, and may reduce single-agent accuracy. We propose Faster-HEAL, a lightweight and privacy-preserving CP framework that fine-tunes a low-rank visual prompt to align heterogeneous features with a unified feature space while leveraging pyramid fusion for robust feature aggregation. This approach reduces the trainable parameters by 94%, enabling efficient adaptation to new agents without retraining large models. Experiments on the OPV2V-H dataset show that Faster-HEAL improves detection performance by 2% over state-of-the-art methods with significantly lower computational overhead, offering a practical solution for scalable heterogeneous CP.[194] Can Vision-Language Models Solve the Shell Game?
Tiedong Liu,Wee Sun Lee
Main category: cs.CV
TL;DR: 本文提出VET-Bench基准,揭示当前VLMs在视觉实体跟踪任务中的根本缺陷,并提出SGCoT方法显著提升性能。
Details
Motivation: 现有视频基准存在视觉捷径,掩盖了VLMs在视觉实体跟踪能力上的不足;需构建能真正检验时序实体跟踪能力的诊断性基准。 Method: 构建合成基准VET-Bench(含外观相同但需依赖时空连续性跟踪的对象);理论分析指出固定深度Transformer架构在无中间监督下难以表达不可区分对象的跟踪;提出Spatiotemporal Grounded Chain-of-Thought(SGCoT),通过文本合成数据微调Molmo2以生成显式轨迹作为中间状态。 Result: 当前SOTA VLMs在VET-Bench上表现接近随机水平;SGCoT方法在该基准上准确率超90%,达SOTA;首次实现端到端、无需外部工具的视频‘壳游戏’任务求解。 Conclusion: VLMs普遍存在对静态帧特征的过度依赖和时序实体表征能力缺失;引入显式时空推理链(SGCoT)可有效弥补该缺陷,为构建具时序一致性的VLM提供新路径。 Abstract: Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To address this, we propose Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states. Leveraging Molmo2's object tracking ability, we elicit SGCoT reasoning by fine-tuning on synthesized text-only data for alignment. Our method achieves state-of-the-art accuracy exceeding 90% on VET-Bench, demonstrating that VLMs can reliably solve the video shell-game task end-to-end without external tools. Our code and data are available at https://vetbench.github.io .[195] A Lightweight Digital-Twin-Based Framework for Edge-Assisted Vehicle Tracking and Collision Prediction
Murat Arda Onsu,Poonam Lohan,Burak Kantarci,Aisha Syed,Matthew Andrews,Sean Kennedy
Main category: cs.CV
TL;DR: 本文提出了一种轻量级数字孪生框架,仅依赖目标检测(YOLO)实现车辆跟踪与时空碰撞预测,无需复杂轨迹预测网络,在仿真城市交通环境中实现88%碰撞提前预警率,适合边缘设备实时部署。
Details
Motivation: 现有车辆跟踪与碰撞预测方法计算开销大,难以在资源受限的边缘设备上实际部署;亟需一种轻量、高效、可实时运行的解决方案。 Method: 基于QLabs高保真数字孪生环境,使用YOLO检测器提取车辆质心轨迹;离线构建道路路径图并用K-D树索引;在线通过路径索引演化估计车速与方向,结合空间邻近性与时间重叠性预测碰撞。 Result: 在多种仿真城市场景中,该框架能提前预测约88%的碰撞事件,且计算开销低,满足边缘部署要求。 Conclusion: 该轻量级数字孪生方案验证了不依赖复杂预测模型即可实现高精度、低延迟的车辆跟踪与碰撞预警,为边缘智能交通系统提供了可行技术路径。 Abstract: Vehicle tracking, motion estimation, and collision prediction are fundamental components of traffic safety and management in Intelligent Transportation Systems (ITS). Many recent approaches rely on computationally intensive prediction models, which limits their practical deployment on resource-constrained edge devices. This paper presents a lightweight digital-twin-based framework for vehicle tracking and spatiotemporal collision prediction that relies solely on object detection, without requiring complex trajectory prediction networks. The framework is implemented and evaluated in Quanser Interactive Labs (QLabs), a high-fidelity digital twin of an urban traffic environment that enables controlled and repeatable scenario generation. A YOLO-based detector is deployed on simulated edge cameras to localize vehicles and extract frame-level centroid trajectories. Offline path maps are constructed from multiple traversals and indexed using K-D trees to support efficient online association between detected vehicles and road segments. During runtime, consistent vehicle identifiers are maintained, vehicle speed and direction are estimated from the temporal evolution of path indices, and future positions are predicted accordingly. Potential collisions are identified by analyzing both spatial proximity and temporal overlap of predicted future trajectories. Our experimental results across diverse simulated urban scenarios show that the proposed framework predicts approximately 88% of collision events prior to occurrence while maintaining low computational overhead suitable for edge deployment. Rather than introducing a computationally intensive prediction model, this work introduces a lightweight digital-twin-based solution for vehicle tracking and collision prediction, tailored for real-time edge deployment in ITS.[196] AgrI Challenge: A Data-Centric AI Competition for Cross-Team Validation in Agricultural Vision
Mohammed Brahimi,Karim Laabassi,Mohamed Seghir Hadj Ameur,Aicha Boutorh,Badia Siab-Farsi,Amin Khouani,Omar Farouk Zouak,Seif Eddine Bouziane,Kheira Lakhdari,Abdelkader Nabil Benghanem
Main category: cs.CV
TL;DR: 本文提出AgrI挑战赛框架,强调数据采集对农业视觉模型泛化能力的影响,引入跨团队验证(CTV)评估范式,并通过实验证明多源协同训练显著提升模型鲁棒性。
Details
Motivation: 现有农业视觉模型在标准数据集上表现良好,但在真实田间环境因分布偏移而泛化差;竞赛多关注模型设计,忽视数据采集实践对泛化的关键作用。 Method: 提出数据-centric的AgrI挑战赛框架,多个团队独立采集田间数据构建异构多源基准;设计Cross-Team Validation(CTV)评估范式,包含TOTO和LOTO两种协议。 Result: 单源训练下存在高达16.20%的验证-测试准确率差距;多源协同训练将差距分别降至2.82%(DenseNet121)和1.78%(Swin Transformer);发布含50673张图像、覆盖6种树种、12个团队采集的公开数据集。 Conclusion: 数据采集多样性对农业视觉模型泛化至关重要;跨团队协作式数据共建与训练是提升现实场景鲁棒性的有效路径;该工作推动农业AI从模型中心转向数据中心范式。 Abstract: Machine learning models in agricultural vision often achieve high accuracy on curated datasets but fail to generalize under real field conditions due to distribution shifts between training and deployment environments. Moreover, most machine learning competitions focus primarily on model design while treating datasets as fixed resources, leaving the role of data collection practices in model generalization largely unexplored. We introduce the AgrI Challenge, a data-centric competition framework in which multiple teams independently collect field datasets, producing a heterogeneous multi-source benchmark that reflects realistic variability in acquisition conditions. To systematically evaluate cross-domain generalization across independently collected datasets, we propose Cross-Team Validation (CTV), an evaluation paradigm that treats each team's dataset as a distinct domain. CTV includes two complementary protocols: Train-on-One-Team-Only (TOTO), which measures single-source generalization, and Leave-One-Team-Out (LOTO), which evaluates collaborative multi-source training. Experiments reveal substantial generalization gaps under single-source training: models achieve near-perfect validation accuracy yet exhibit validation-test gaps of up to 16.20% (DenseNet121) and 11.37% (Swin Transformer) when evaluated on datasets collected by other teams. In contrast, collaborative multi-source training dramatically improves robustness, reducing the gap to 2.82% and 1.78%, respectively. The challenge also produced a publicly available dataset of 50,673 field images of six tree species collected by twelve independent teams, providing a diverse benchmark for studying domain shift and data-centric learning in agricultural vision.[197] Interpretable Aneurysm Classification via 3D Concept Bottleneck Models: Integrating Morphological and Hemodynamic Clinical Features
Toqa Khaled,Ahmad Al-Kabbany
Main category: cs.CV
TL;DR: 本文提出了一种端到端的3D概念瓶颈模型(CBM),用于颅内动脉瘤的可解释分类,结合形态学与血流动力学临床概念,在保持93.33%高准确率的同时实现临床可解释性。
Details
Motivation: 传统黑箱深度学习模型虽精度高,但缺乏可解释性,阻碍其在临床和监管层面的应用;而医学AI需符合神经外科原则,故亟需内在可解释的方法。 Method: 构建基于3D ResNet-34和3D DenseNet-121的端到端3D概念瓶颈框架,将CTA影像映射至离散的形态与血流动力学临床概念;采用联合损失函数(诊断focal loss + 概念MSE)优化,并结合8次测试时增强(TTA)与五折分层交叉验证。 Result: ResNet-34架构达最高分类准确率93.33%±4.5%,DenseNet-121为91.43%±5.8%;TTA后稳定均值准确率达88.31%,精度-泛化差距小于0.04。 Conclusion: 该3D CBM框架成功兼顾高预测性能与临床可解释性,为可信赖的医学AI提供了可行路径。 Abstract: We are concerned with the challenge of reliably classifying and assessing intracranial aneurysms using deep learning without compromising clinical transparency. While traditional black-box models achieve high predictive accuracy, their lack of inherent interpretability remains a significant barrier to clinical adoption and regulatory approval. Explainability is paramount in medical modeling to ensure that AI-driven diagnoses align with established neurosurgical principles. Unlike traditional eXplainable AI (XAI) methods -- such as saliency maps, which often provide post-hoc, non-causal visual correlations -- Concept Bottleneck Models (CBMs) offer a robust alternative by constraining the model's internal logic to human-understandable clinical indices. In this article, we propose an end-to-end 3D Concept Bottleneck framework that maps high-dimensional neuroimaging features to a discrete set of morphological and hemodynamic concepts for aneurysm identification. We implemented this pipeline using a pre-trained 3D ResNet-34 backbone and a 3D DenseNet-121 to extract features from CTA volumes, which were subsequently processed through a soft bottleneck layer representing human-interpretable clinical concepts. The model was optimized using a joint-loss function to balance diagnostic focal loss and concept mean squared error (MSE), validated via stratified five-fold cross-validation. Our results demonstrate a peak task classification accuracy of 93.33% +/- 4.5% for the ResNet-34 architecture and 91.43% +/- 5.8% for the DenseNet-121 model. Furthermore, the implementation of 8-pass Test-Time Augmentation (TTA) yielded a robust mean accuracy of 88.31%, ensuring diagnostic stability during inference. By maintaining an accuracy-generalization gap of less than 0.04, this framework proves that high predictive performance can be achieved without sacrificing interpretability.[198] VIVECaption: A Split Approach to Caption Quality Improvement
Varun Ananth,Baqiao Liu,Haoran Cai
Main category: cs.CV
TL;DR: 本文提出VIVECaption方法,通过构建高质量标注数据集和模型对齐策略(上下文对齐与参数微调)来提升图像-文本对齐质量,解决视觉语言模型生成字幕中存在的幻觉、组合推理差等问题。
Details
Motivation: 视觉语言模型生成的字幕存在幻觉、组合推理能力弱及细粒度理解不足等问题,导致图文对不匹配,影响下游生成模型性能。 Method: 提出两方面改进:1)基于分层抽样的高质量黄金标准数据集构建方法;2)包含上下文对齐与监督微调(SFT)的模型对齐策略,并聚焦结构化字幕格式。 Result: 在开源模型上验证了该方法的有效性,尤其使用微调后的角色检测模型显著提升了整体图文对齐质量。 Conclusion: VIVECaption为构建高质量、无需依赖网络爬取内容的‘纯素’训练数据提供了实用方案,适用于企业级AI开发中图文对齐优化需求。 Abstract: Caption quality has emerged as a critical bottleneck in training high-quality text-to-image (T2I) and text-to-video (T2V) generative models. While visual language models (VLMs) are commonly deployed to generate captions from visual data, they suffer from hallucinations, poor compositional reasoning, and limited fine-grained understanding, resulting in misaligned image-caption pairs that degrade downstream model performance. This technical report introduces VIVECaption, a systematic two-sided approach to caption quality improvement. We first establish a comprehensive taxonomy of caption evaluation metrics, distinguishing between "universal" and "instance-grounded" metrics, with the ultimate goal of showcasing the use-cases and tradeoffs between different caption quality metrics. We then use this language to describe our two-sided approach to caption quality improvement: (1) a gold-standard dataset creation methodology using stratified sampling and (2) a model alignment strategy encompassing context alignment and parameter-level finetuning using SFT. We demonstrate our methodology on open-source models, focusing on structured caption formats that enable better parsing and downstream utilization. We ultimately show that using a finetuned character detection model in an image captioning pipeline significantly improves holistic image-caption alignment quality. Our work addresses the growing need for high-quality "vegan" training data in enterprise AI development, providing practical solutions for teams seeking to improve caption-image alignment without relying on potentially copyright-protected web-scraped content.[199] Prompt-Based Caption Generation for Single-Tooth Dental Images Using Vision-Language Models
Anastasiia Sukhanova,Aiden Taylor,Julian Myers,Zichun Wang,Kartha Veerya Jammuladinne,Satya Sri Rajiteswari Nimmagadda,Aniruddha Maiti,Ananya Jana
Main category: cs.CV
TL;DR: 本文提出了一种利用视觉-语言模型(VLM)为单颗牙齿图像生成高质量、视觉锚定的描述性字幕的方法,旨在填补当前牙科图像字幕数据集在覆盖范围、细粒度(单牙级)和临床实用性方面的空白。
Details
Motivation: 现有牙科图像字幕数据集数量少、范围窄:多为全口前视图,后牙(如磨牙)不清晰;字幕仅针对单一疾病(如牙龈炎),缺乏对每颗牙的全面评估;且无单颗牙齿图像配字幕的数据集,难以支持具备整体牙齿知识的多任务分析模型训练。 Method: 探索使用视觉-语言模型(VLM)自动生成牙科图像字幕的可行性,重点设计并验证引导式提示(guided prompts)对提升字幕视觉相关性和临床意义的效果;选用RGB图像以增强消费级场景适用性。 Result: 实验证明引导式提示能显著提升VLM生成字幕的质量与视觉锚定能力,所生成的提示更准确地聚焦于牙科图像的视觉特征。 Conclusion: 基于VLM的引导式字幕生成是构建高质量单牙图像字幕数据集的可行路径,为开发具备整体牙齿认知能力的通用牙科AI模型奠定基础。 Abstract: Digital dentistry has made significant advances with the advent of deep learning. However, the majority of these deep learning-based dental image analysis models focus on very specific tasks such as tooth segmentation, tooth detection, cavity detection, and gingivitis classification. There is a lack of a specialized model that has holistic knowledge of teeth and can perform dental image analysis tasks based on that knowledge. Datasets of dental images with captions can help build such a model. To the best of our knowledge, existing dental image datasets with captions are few in number and limited in scope. In many of these datasets, the captions describe the entire mouth, while the images are limited to the anterior view. As a result, posterior teeth such as molars are not clearly visible, limiting the usefulness of the captions for training vision-language models. Additionally, the captions focus only on a specific disease (gingivitis) and do not provide a holistic assessment of each tooth. Moreover, tooth disease scores are typically assigned to individual teeth, and each tooth is treated as a separate entity in orthodontic procedures. Therefore, it is important to have captions for single-tooth images. As far as we know, no such dataset of single-tooth images with dental captions exists. In this work, we aim to bridge that gap by assessing the possibility of generating captions for dental images using Vision-Language Models (VLMs) and evaluating the extent and quality of those captions. Our findings suggest that guided prompts help VLMs generate meaningful captions. We show that the prompts generated by our framework are better anchored in describing the visual aspects of dental images. We selected RGB images as they have greater potential in consumer scenarios.[200] UnSCAR: Universal, Scalable, Controllable, and Adaptable Image Restoration
Debabrata Mandal,Soumitri Chattopadhyay,Yujie Wang,Marc Niethammer,Praneeth Chakravarthula
Main category: cs.CV
TL;DR: 本文提出了一种多分支混合专家架构的统一推理框架,解决通用图像恢复中多退化联合学习导致的任务干扰与遗忘问题,实现可扩展、可控且泛化能力强的单模型多退化恢复。
Details
Motivation: 现有通用图像恢复模型在面对多种真实世界退化时存在训练不稳定、模型过大、性能下降等问题,根本原因在于多退化联合学习中的任务干扰与灾难性遗忘。 Method: 提出基于多分支混合专家(MoE)的统一推理框架,将恢复知识分解至多个可自适应任务的专用专家,支持用户可控恢复与跨域泛化。 Result: 在超过十六种退化上实现可扩展训练,在已见与未见退化域均表现鲁棒,显著提升各基准性能。 Conclusion: 该工作确立了可扩展、可控的通用图像恢复新设计范式,突破了现有方法的规模与泛化瓶颈。 Abstract: Universal image restoration aims to recover clean images from arbitrary real-world degradations using a single inference model. Despite significant progress, existing all-in-one restoration networks do not scale to multiple degradations. As the number of degradations increases, training becomes unstable, models grow excessively large, and performance drops across both seen and unseen domains. In this work, we show that scaling universal restoration is fundamentally limited by interference across degradations during joint learning, leading to catastrophic task forgetting. To address this challenge, we introduce a unified inference pipeline with a multi-branch mixture-of-experts architecture that decomposes restoration knowledge across specialized task-adaptable experts. Our approach enables scalable learning (over sixteen degradations), adapts and generalizes robustly to unseen domains, and supports user-controllable restoration across degradations. Beyond achieving superior performance across benchmarks, this work establishes a new design paradigm for scalable and controllable universal image restoration.[201] QdaVPR: A novel query-based domain-agnostic model for visual place recognition
Shanshan Wan,Lai Kang,Yingmei Wei,Tianrui Shen,Haixuan Wang,Chao Zuo
Main category: cs.CV
TL;DR: 本文提出了一种基于查询的域无关视觉地点识别模型QdaVPR,通过双层对抗学习和查询组合三元组监督提升跨域泛化能力,并在多个具有显著域变化的VPR基准上达到SOTA性能。
Details
Motivation: 现有VPR模型在应对域变化(如季节、光照、天气)时存在不足:大规模训练缺乏显式域监督,而特定域适配泛化性差。 Method: 提出QdaVPR模型:1)设计双层对抗学习框架,增强查询特征及其源图像特征的域不变性;2)引入基于查询组合的三元组监督以提升全局描述符判别力;3)利用风格迁移扩充大规模VPR数据集,生成带域标签的合成域用于辅助监督。 Result: 在Nordland(季节变化)、Tokyo24/7(昼夜变化)和SVOX(多种天气)等多基准上取得SOTA结果,Recall@1/Recall@10分别达93.5%/98.6%、97.5%/99.0%,并在SVOX多数天气条件下Recall@1最高。 Conclusion: QdaVPR有效提升了VPR模型对未见域变化的鲁棒性和泛化能力,验证了查询驱动与域感知联合建模的有效性。 Abstract: Visual place recognition (VPR) aiming at predicting the location of an image based solely on its visual features is a fundamental task in robotics and autonomous systems. Domain variation remains one of the main challenges in VPR and is relatively unexplored. Existing VPR models attempt to achieve domain agnosticism either by training on large-scale datasets that inherently contain some domain variations, or by being specifically adapted to particular target domains. In practice, the former lacks explicit domain supervision, while the latter generalizes poorly to unseen domain shifts. This paper proposes a novel query-based domain-agnostic VPR model called QdaVPR. First, a dual-level adversarial learning framework is designed to encourage domain invariance for both the query features forming the global descriptor and the image features from which these query features are derived. Then, a triplet supervision based on query combinations is designed to enhance the discriminative power of the global descriptors. To support the learning process, we augment a large-scale VPR dataset using style transfer methods, generating various synthetic domains with corresponding domain labels as auxiliary supervision. Extensive experiments show that QdaVPR achieves state-of-the-art performance on multiple VPR benchmarks with significant domain variations. Specifically, it attains the best Recall@1 and Recall@10 on nearly all test scenarios: 93.5%/98.6% on Nordland (seasonal changes), 97.5%/99.0% on Tokyo24/7 (day-night transitions), and the highest Recall@1 across almost all weather conditions on the SVOX dataset. Our code will be released at https://github.com/shuimushan/QdaVPR.[202] Disentangled Textual Priors for Diffusion-based Image Super-Resolution
Lei Jiang,Xin Liu,Xinze Tong,Zhiliang Li,Jie Liu,Jie Tang,Gangshan Wu
Main category: cs.CV
TL;DR: 本文提出DTPSR,一种基于扩散模型的图像超分辨率方法,通过解耦文本先验(空间层次和频率语义)提升语义可控性与可解释性,并构建了大规模解耦文本数据集DisText-SR,结合多分支分类器自由引导策略,在感知质量、保真度与泛化性上取得优异性能。
Details
Motivation: 现有扩散式超分辨率方法依赖纠缠或粗粒度的语义先验,难以兼顾全局结构与局部细节、低频与高频信息,导致语义可控性与可解释性受限。 Method: 提出DTPSR框架,引入沿空间层次(全局/局部)和频率语义(低频/高频)两个维度解耦的文本先验;设计专用交叉注意力模块注入对应嵌入;构建DisText-SR数据集(约9.5万对图像-解耦文本);采用频率感知负提示的多分支无分类器引导策略。 Result: 在合成与真实世界基准测试中,DTPSR在感知质量、保真度及跨多种退化场景的泛化能力上均达到先进水平。 Conclusion: 解耦文本先验能有效提升扩散式超分辨率模型的语义可控性、可解释性与鲁棒性,为生成式图像增强提供了新范式。 Abstract: Image Super-Resolution (SR) aims to reconstruct high-resolution images from degraded low-resolution inputs. While diffusion-based SR methods offer powerful generative capabilities, their performance heavily depends on how semantic priors are structured and integrated into the generation process. Existing approaches often rely on entangled or coarse-grained priors that mix global layout with local details, or conflate structural and textural cues, thereby limiting semantic controllability and interpretability. In this work, we propose DTPSR, a novel diffusion-based SR framework that introduces disentangled textual priors along two complementary dimensions: spatial hierarchy (global vs. local) and frequency semantics (low- vs. high-frequency). By explicitly separating these priors, DTPSR enables the model to simultaneously capture scene-level structure and object-specific details with frequency-aware semantic guidance. The corresponding embeddings are injected via specialized cross-attention modules, forming a progressive generation pipeline that reflects the semantic granularity of visual content, from global layout to fine-grained textures. To support this paradigm, we construct DisText-SR, a large-scale dataset containing approximately 95,000 image-text pairs with carefully disentangled global, low-frequency, and high-frequency descriptions. To further enhance controllability and consistency, we adopt a multi-branch classifier-free guidance strategy with frequency-aware negative prompts to suppress hallucinations and semantic drift. Extensive experiments on synthetic and real-world benchmarks show that DTPSR achieves high perceptual quality, competitive fidelity, and strong generalization across diverse degradation scenarios.[203] RPG-SAM: Reliability-Weighted Prototypes and Geometric Adaptive Threshold Selection for Training-Free One-Shot Polyp Segmentation
Weikun Lin,Yunhao Bai,Yan Wang
Main category: cs.CV
TL;DR: 本文提出RPG-SAM框架,通过可靠性加权原型挖掘(RWPM)和几何自适应选择(GAS)解决支持图像区域异质性和查询响应异质性问题,并结合迭代优化提升分割精度,在Kvasir数据集上mIoU提升5.56%。
Details
Motivation: 现有训练免费单次分割方法忽略支持图像中的区域异质性和查询响应的响应异质性,导致性能受限。 Method: 提出RPG-SAM框架:1)Reliability-Weighted Prototype Mining (RWPM) 用于挖掘高保真支持特征并利用背景锚点抑制噪声;2)Geometric Adaptive Selection (GAS) 动态调整二值化阈值以评估形态学一致性;3)设计迭代细化循环优化解剖边界。 Result: 在Kvasir数据集上mIoU提升5.56%,显著优于现有方法。 Conclusion: 通过系统建模多层级异质性信息,RPG-SAM有效提升了无训练单次分割性能,为医学图像分割提供了新思路。 Abstract: Training-free one-shot segmentation offers a scalable alternative to expert annotations where knowledge is often transferred from support images and foundation models. But existing methods often treat all pixels in support images and query response intensities models in a homogeneous way. They ignore the regional heterogeity in support images and response heterogeity in query.To resolve this, we propose RPG-SAM, a framework that systematically tackles these heterogeneity gaps. Specifically, to address regional heterogeneity, we introduce Reliability-Weighted Prototype Mining (RWPM) to prioritize high-fidelity support features while utilizing background anchors as contrastive references for noise suppression. To address response heterogeneity, we develop Geometric Adaptive Selection (GAS) to dynamically recalibrate binarization thresholds by evaluating the morphological consensus of candidates. Finally, an iterative refinement loop method is designed to polishes anatomical boundaries. By accounting for multi-layered information heterogeneity, RPG-SAM achieves a 5.56\% mIoU improvement on the Kvasir dataset. Code will be released.[204] DogWeave: High-Fidelity 3D Canine Reconstruction from a Single Image via Normal Fusion and Conditional Inpainting
Shufan Sun,Chenchen Wang,Zongfu Yu
Main category: cs.CV
TL;DR: DogWeave是一种基于模型的单图重建框架,通过多视角法向场优化和条件部分修复,从单张RGB图像生成高保真3D犬类模型。
Details
Motivation: 现有单目3D动物重建方法受限于缺乏关节化3D监督和2D数据集中背面图像稀少,导致几何失真、纹理不一致及未观测区域重建困难。 Method: DogWeave先用扩散增强法向初始化参数化网格,再通过多视角法向场优化将其细化为SDF表示以提升几何精度;随后利用结构与风格线索引导的条件部分纹理修复实现视图一致的纹理生成。 Result: 仅用约7000张狗图像训练,DogWeave在形状精度和纹理真实性上均超越当前最优单图3D重建方法,生成完整且逼真的3D犬模型。 Conclusion: DogWeave有效缓解了单目动物重建中因缺乏3D监督和背面视角导致的几何与纹理缺陷,为细粒度、关节化动物建模提供了新范式。 Abstract: Monocular 3D animal reconstruction is challenging due to complex articulation, self-occlusion, and fine-scale details such as fur. Existing methods often produce distorted geometry and inconsistent textures due to the lack of articulated 3D supervision and limited availability of back-view images in 2D datasets, which makes reconstructing unobserved regions particularly difficult. To address these limitations, we propose DogWeave, a model-based framework for reconstructing high-fidelity 3D canine models from a single RGB image. DogWeave improves geometry by refining a coarsely-initiated parametric mesh into a detailed SDF representation through multi-view normal field optimization using diffusion-enhanced normals. It then generates view-consistent textures through conditional partial inpainting guided by structure and style cues, enabling realistic reconstruction of unobserved regions. Using only about 7,000 dog images processed via our 2D pipeline for training, DogWeave produces complete, realistic 3D models and outperforms state-of-the-art single image to 3d reconstruction methods in both shape accuracy and texture realism for canines.[205] Med-Evo: Test-time Self-evolution for Medical Multimodal Large Language Models
Dunyuan Xu,Xikai Yang,Juzheng Miao,Yaoqian Li,Jinpeng Li,Pheng-Ann Heng
Main category: cs.CV
TL;DR: 本文提出Med-Evo框架,首次将无标签强化学习应用于医学多模态大语言模型的自进化,通过特征驱动伪标签(FPL)与硬-软奖励(HSR)机制,在无需额外标注数据的前提下显著提升模型性能。
Details
Motivation: 现有医学多模态大语言模型后训练方法依赖大量标注数据,而医学领域标注成本高、数据敏感,难以获取;同时,利用无标签测试数据进行模型自我提升面临监督信号不可靠和演化不稳定问题。 Method: 提出Med-Evo自进化框架,包含两个核心方法:1)特征驱动伪标签(FPL),基于异构候选响应的语义中心选择伪标签;2)硬-软奖励(HSR),融合精确匹配、词元级评估与语义相似度提供分层奖励。 Result: 在三个医学视觉问答基准和两个基础MLLM上验证有效,Qwen2.5-VL在SLAKE数据集上准确率提升10.43%,召回率提升4.68%。 Conclusion: Med-Evo是首个面向医学MLLM的无标签强化学习自进化框架,有效缓解标注稀缺问题,显著提升模型性能,为低资源医疗AI提供了新范式。 Abstract: Medical Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse healthcare tasks. However, current post-training strategies, such as supervised fine-tuning and reinforcement learning, heavily depend on substantial annotated data while overlooking the potential of unlabeled test data for model enhancement. This limitation becomes particularly pronounced in medical domains, where acquiring extensive labeled medical data is difficult due to the strict data sensitivity and annotation complexity. Moreover, leveraging test data poses challenges in generating reliable supervision signals from unlabeled samples and maintaining stable self-evolution. To address these limitations, we propose Med-Evo, the first self-evolution framework for medical MLLMs that utilizes label-free reinforcement learning to promote model performance without requiring additional labeled data. Our framework introduces two key innovations: $1)$ Feature-driven Pseudo Labeling (FPL) that identifies semantic centroids from all heterogeneous candidate responses to select pseudo labels in each rollout, and $2)$ Hard-Soft Reward (HSR) that combines exact match with token-level assessment and semantic similarity to provide hierarchical reward. Experiments on three medical VQA benchmarks and two base MLLMs show clear advantages of our approach over SOTA methods, with significant improvements of 10.43\% accuracy and 4.68\% recall on the SLAKE dataset using Qwen2.5-VL, showing the effectiveness of our method.[206] SLNet: A Super-Lightweight Geometry-Adaptive Network for 3D Point Cloud Recognition
Mohammad Saeid,Amir Salarpour,Pedram MohajerAnsari,Mert D. Pesé
Main category: cs.CV
TL;DR: SLNet是一种轻量级3D点云识别骨干网络,通过NAPE和GMU两个简单但有效的模块,在显著降低参数量和计算量的同时,保持了与主流模型相当甚至更优的性能。
Details
Motivation: 解决当前3D点云识别模型(如基于注意力、图神经网络和深度MLP的模型)计算开销大、难以部署的问题,追求高精度与高效率的平衡。 Method: 提出SLNet,包含非参数自适应点嵌入(NAPE)和几何调制单元(GMU),构建四阶段分层编码器,结合FPS+kNN分组、非参数归一化和共享残差MLP;引入NetScore+评估指标,综合考虑精度、延迟和峰值内存。 Result: 在ModelNet40上,SLNet-S(0.14M参数)达93.64%准确率,优于PointMLP-elite且参数少5倍;SLNet-M(0.55M参数)达93.92%,参数仅为PointMLP的1/24;在ScanObjectNN和S3DIS上也显著优于对比模型,参数量减少最多达28倍。 Conclusion: SLNet证明了轻量级设计在3D点云识别中极具竞争力,为实际部署提供了高效且准确的新范式。 Abstract: We present SLNet, a lightweight backbone for 3D point cloud recognition designed to achieve strong performance without the computational cost of many recent attention, graph, and deep MLP based models. The model is built on two simple ideas: NAPE (Nonparametric Adaptive Point Embedding), which captures spatial structure using a combination of Gaussian RBF and cosine bases with input adaptive bandwidth and blending, and GMU (Geometric Modulation Unit), a per channel affine modulator that adds only 2D learnable parameters. These components are used within a four stage hierarchical encoder with FPS+kNN grouping, nonparametric normalization, and shared residual MLPs. In experiments, SLNet shows that a very small model can still remain highly competitive across several 3D recognition tasks. On ModelNet40, SLNet-S with 0.14M parameters and 0.31 GFLOPs achieves 93.64% overall accuracy, outperforming PointMLP-elite with 5x fewer parameters, while SLNet-M with 0.55M parameters and 1.22 GFLOPs reaches 93.92%, exceeding PointMLP with 24x fewer parameters. On ScanObjectNN, SLNet-M achieves 84.25% overall accuracy within 1.2 percentage points of PointMLP while using 28x fewer parameters. For large scale scene segmentation, SLNet-T extends the backbone with local Point Transformer attention and reaches 58.2% mIoU on S3DIS Area 5 with only 2.5M parameters, more than 17x fewer than Point Transformer V3. We also introduce NetScore+, which extends NetScore by incorporating latency and peak memory so that efficiency can be evaluated in a more deployment oriented way. Across multiple benchmarks and hardware settings, SLNet delivers a strong overall balance between accuracy and efficiency. Code is available at: https://github.com/m-saeid/SLNet.[207] SIGMAE: A Spectral-Index-Guided Foundation Model for Multispectral Remote Sensing
Xiaokang Zhang,Bo Li,Chufeng Zhou,Weikang Yu,Lefei Zhang
Main category: cs.CV
TL;DR: 本文提出了一种面向多光谱遥感图像的掩码自编码器预训练方法SIGMAE,通过引入光谱指数引导的语义显著性动态掩码策略(SSDTM),提升模型对空间-光谱特征的学习能力,在多个下游任务中表现优越。
Details
Motivation: 现有MAE在多光谱遥感图像上应用受限,因背景复杂、目标模糊、掩码缺乏语义指导,难以学习深层结构和有意义的空间-光谱特征。 Method: 提出Spectral Index-Guided MAE(SIGMAE),核心是Semantic Saliency-Guided Dynamic Token Masking(SSDTM):利用领域先验(如NDVI等光谱指数)量化图像块的语义丰富度与内部异质性,实现课程式、自适应的掩码选择。 Result: 在五个主流遥感数据集及场景分类、语义分割、目标提取、变化检测等任务上,SIGMAE优于其他遥感基础模型;支持高达90%掩码率下的高质量重建,并在标注数据有限时显著提升复杂目标识别性能。 Conclusion: SIGMAE通过融合领域知识与动态掩码机制,有效增强了多光谱图像表征学习的语义感知性、结构敏感性与计算效率,为遥感预训练范式提供了新思路。 Abstract: Pretraining and fine-tuning have emerged as a new paradigm in remote sensing image interpretation. Among them, Masked Autoencoder (MAE)-based pretraining stands out for its strong capability to learn general feature representations via reconstructing masked image regions. However, applying MAE to multispectral remote sensing images remains challenging due to complex backgrounds, indistinct targets, and the lack of semantic guidance during masking, which hinders the learning of underlying structures and meaningful spatial-spectral features. To address this, we propose a simple yet effective approach, Spectral Index-Guided MAE (SIGMAE), for multispectral image pretraining. The core idea is to incorporate domain-specific spectral indices as prior knowledge to guide dynamic token masking toward informative regions. SIGMAE introduces Semantic Saliency-Guided Dynamic Token Masking (SSDTM), a curriculum-style strategy that quantifies each patch's semantic richness and internal heterogeneity to adaptively select the most informative tokens during training. By prioritizing semantically salient regions and progressively increasing sample difficulty, SSDTM enhances spectrally rich and structurally aware representation learning, mitigates overfitting, and reduces redundant computation compared with random masking. Extensive experiments on five widely used datasets covering various downstream tasks, including scene classification, semantic segmentation, object extraction and change detection, demonstrate that SIGMAE outperforms other pretrained geospatial foundation models. Moreover, it exhibits strong spatial-spectral reconstruction capability, even with a 90% mask ratio, and improves complex target recognition under limited labeled data. The source codes and model weights will be released at https://github.com/zxk688/SIGMAE.[208] Selective Transfer Learning of Cross-Modality Distillation for Monocular 3D Object Detection
Rui Ding,Meng Yang,Nanning Zheng
Main category: cs.CV
TL;DR: 本文提出MonoSTL方法,通过选择性学习缓解跨模态知识蒸馏中的负迁移问题,提升单目3D目标检测精度。
Details
Motivation: 单目3D目标检测因缺乏精确深度信息而具有挑战性;跨模态知识蒸馏虽可将LiDAR深度信息迁移到图像网络,但模态差异导致负迁移(如架构不一致和特征过拟合)严重限制性能。 Method: 提出MonoSTL:1)采用相似网络结构保障图像与LiDAR特征空间对齐;2)设计两个新蒸馏模块——深度感知选择性特征蒸馏(DASFD)和深度感知选择性关系蒸馏(DASRD),分别在特征和关系层面融合深度不确定性以选择性学习正向信息。 Result: 在KITTI和NuScenes数据集上,MonoSTL显著提升多种CNN及DETR基线模型的检测精度,并超越当前所有最新SOTA方法。 Conclusion: MonoSTL有效缓解跨模态蒸馏中的负迁移问题,验证了引入深度不确定性进行选择性学习对单目3D检测的有效性和通用性。 Abstract: Monocular 3D object detection is a promising yet ill-posed task for autonomous vehicles due to the lack of accurate depth information. Cross-modality knowledge distillation could effectively transfer depth information from LiDAR to image-based network. However, modality gap between image and LiDAR seriously limits its accuracy. In this paper, we systematically investigate the negative transfer problem induced by modality gap in cross-modality distillation for the first time, including not only the architecture inconsistency issue but more importantly the feature overfitting issue. We propose a selective learning approach named MonoSTL to overcome these issues, which encourages positive transfer of depth information from LiDAR while alleviates the negative transfer on image-based network. On the one hand, we utilize similar architectures to ensure spatial alignment of features between image-based and LiDAR-based networks. On the other hand, we develop two novel distillation modules, namely Depth-Aware Selective Feature Distillation (DASFD) and Depth-Aware Selective Relation Distillation (DASRD), which selectively learn positive features and relationships of objects by integrating depth uncertainty into feature and relation distillations, respectively. Our approach can be seamlessly integrated into various CNN-based and DETR-based models, where we take three recent models on KITTI and a recent model on NuScenes for validation. Extensive experiments show that our approach considerably improves the accuracy of the base models and thereby achieves the best accuracy compared with all recently released SOTA models.[209] Classifying Novel 3D-Printed Objects without Retraining: Towards Post-Production Automation in Additive Manufacturing
Fanis Mathioulakis,Gorjan Radevski,Silke GC Cleuren,Michel Janssens,Brecht Das,Koen Schauwaert,Tinne Tuytelaars
Main category: cs.CV
TL;DR: 本文提出ThingiPrint数据集,用于评估视觉模型在3D打印物体分类任务上的性能,并提出一种基于CAD模型、无需重训练的对比微调方法,实现对未见物体的有效原型分类。
Details
Motivation: 工业增材制造中3D打印物体的可靠分类对自动化后处理流程至关重要,但当前依赖人工检查,且因每日对象集合变化频繁,频繁重训练模型不现实,亟需无需重训练、能利用CAD模型进行泛化分类的视觉方法。 Method: 构建ThingiPrint数据集(配对CAD模型与真实3D打印照片),并在其上评测多种现有视觉模型;提出基于旋转不变目标的对比微调策略,实现仅依赖CAD模型的原型式零样本/少样本分类。 Result: 所提对比微调方法在ThingiPrint上显著优于标准预训练基线,在无需重训练前提下实现了对新引入3D打印物体的有效分类,展现出更强泛化能力与实际部署价值。 Conclusion: 利用CAD先验与对比学习可有效解决动态工业场景下的3D打印物体分类问题,ThingiPrint为该方向提供了标准化评测基准,所提方法具备实用推广潜力。 Abstract: Reliable classification of 3D-printed objects is essential for automating post-production workflows in industrial additive manufacturing. Despite extensive automation in other stages of the printing pipeline, this task still relies heavily on manual inspection, as the set of objects to be classified can change daily, making frequent model retraining impractical. Automating the identification step is therefore critical for improving operational efficiency. A vision model that could classify any set of objects by utilizing their corresponding CAD models and avoiding retraining would be highly beneficial in this setting. To enable systematic evaluation of vision models on this task, we introduce ThingiPrint, a new publicly available dataset that pairs CAD models with real photographs of their 3D-printed counterparts. Using ThingiPrint, we benchmark a range of existing vision models on the task of 3D-printed object classification. We additionally show that contrastive fine-tuning with a rotation-invariant objective allows effective prototype-based classification of previously unseen 3D-printed objects. By relying solely on the available CAD models, this avoids the need for retraining when new objects are introduced. Experiments show that this approach outperforms standard pretrained baselines, suggesting improved generalization and practical relevance for real-world use.[210] FedEU: Evidential Uncertainty-Driven Federated Fine-Tuning of Vision Foundation Models for Remote Sensing Image Segmentation
Xiaokang Zhang,Xuran Xiong,Jianzhong Huang,Lefei Zhang
Main category: cs.CV
TL;DR: 本文提出FedEU框架,通过证据不确定性建模与自适应聚合策略,提升遥感图像分割在联邦学习环境下的鲁棒性与可靠性。
Details
Motivation: 联邦遥感图像分割中,预训练模型在异构客户端数据上动态适配时缺乏不确定性估计,导致更新不可靠、协同优化效果下降。 Method: 提出FedEU框架:1)个性化证据不确定性建模以量化局部模型认知不确定性;2)客户端特定特征嵌入(CFE)增强通道感知表征并保留个性化特性;3)Top-k不确定性引导加权(TUW)策略实现服务端自适应聚合。 Result: 在三个大规模异构遥感数据集上验证了FedEU的优越性能,显著降低预测不确定性,实现更均衡、鲁棒和可靠的联邦分割结果。 Conclusion: FedEU通过显式建模和利用证据不确定性,有效缓解数据分布偏移与不可靠本地更新的影响,为联邦遥感图像分割提供了可信赖的优化范式。 Abstract: Remote sensing image segmentation (RSIS) in federated environments has gained increasing attention because it enables collaborative model training across distributed datasets without sharing raw imagery or annotations. Federated RSIS combined with parameter-efficient fine-tuning (PEFT) can unleash the generalization power of pretrained foundation models for real-world applications, with minimal parameter aggregation and communication overhead. However, the dynamic adaptation of pretrained models to heterogeneous client data inevitably increases update uncertainty and compromises the reliability of collaborative optimization due to the lack of uncertainty estimation for each local model. To bridge this gap, we present FedEU, a federated optimization framework for fine-tuning RSIS models driven by evidential uncertainty. Specifically, personalized evidential uncertainty modeling is introduced to quantify epistemic variations of local models and identify high-risk areas under local data distributions. Furthermore, the client-specific feature embedding (CFE) is exploited to enhance channel-aware feature representation while preserving client-specific properties through personalized attention and an element-aware parameter update approach. These uncertainty estimates are uploaded to the server to enable adaptive global aggregation via a Top-k uncertainty-guided weighting (TUW) strategy, which mitigates the impact of distribution shifts and unreliable updates. Extensive experiments on three large-scale heterogeneous datasets demonstrate the superior performance of FedEU. More importantly, FedEU enables balanced model adaptation across diverse clients by explicitly reducing prediction uncertainty, resulting in more robust and reliable federated outcomes. The source codes will be available at https://github.com/zxk688/FedEU.[211] EVLF: Early Vision-Language Fusion for Generative Dataset Distillation
Wenqi Cai,Yawen Zou,Guang Li,Chunzhi Gu,Chao Zhang
Main category: cs.CV
TL;DR: 本文提出了一种早期视觉-语言融合(EVLF)方法,用于改进扩散模型驱动的数据集蒸馏,通过在编码器与生成主干网络过渡处对齐文本和视觉嵌入,提升合成样本的语义保真度与视觉一致性。
Details
Motivation: 现有基于扩散模型的数据集蒸馏方法在后期引入文本提示引导,导致视觉特征贡献减弱、样本过度受提示影响而失真。 Method: 提出Early Vision-Language Fusion(EVLF),在编码器与生成主干之间插入轻量级跨注意力模块,实现早期视觉与语言表征融合,并保持即插即用特性。 Result: EVLF在多种设置下生成语义忠实、视觉连贯的合成数据,显著提升下游分类准确率,且兼容不同去噪器架构与采样策略。 Conclusion: EVLF是一种通用、高效且无需任务定制的改进方案,有效缓解了扩散式数据集蒸馏中语义主导与视觉失真之间的矛盾。 Abstract: Dataset distillation (DD) aims to synthesize compact training sets that enable models to achieve high accuracy with significantly fewer samples. Recent diffusion-based DD methods commonly introduce semantic guidance through late-stage cross-attention, where textual prompts tend to dominate the generative process. Although this strategy enforces label relevance, it diminishes the contribution of visual latents, resulting in over-corrected samples that mirror prompt patterns rather than reflecting intrinsic visual features. To solve this problem, we introduce an Early Vision-Language Fusion (EVLF) method that aligns textual and visual embeddings at the transition between the encoder and the generative backbone. By incorporating a lightweight cross-attention module at this transition, the early representations simultaneously encode local textures and global semantic directions across the denoising process. Importantly, EVLF is plug-and-play and can be easily integrated into any diffusion-based dataset distillation pipeline with an encoder. It works across different denoiser architectures and sampling schedules without any task-specific modifications. Extensive experiments demonstrate that EVLF generates semantically faithful and visually coherent synthetic data, yielding consistent improvements in downstream classification accuracy across varied settings. Source code is available at https://github.com/wenqi-cai297/earlyfusion-for-dd/.[212] Multi-Modal Decouple and Recouple Network for Robust 3D Object Detection
Rui Ding,Zhaonian Kuang,Yuzhe Ji,Meng Yang,Xinhu Zheng,Gang Hua
Main category: cs.CV
TL;DR: 本文提出了一种多模态解耦与重耦网络(MDR-Net),通过将相机和激光雷达的鸟瞰图(BEV)特征显式解耦为模态不变与模态特有部分,并设计三个专家分别应对不同数据损坏类型,实现鲁棒的3D目标检测。
Details
Motivation: 现有模型在多模态BEV特征融合中紧密耦合,导致某一或两种模态受传感器配置或场景条件等数据损坏影响时性能显著下降;而跨模态的高层不变特征受损坏方式不同影响不一致,具备互补恢复潜力。 Method: 1)将Camera/LiDAR BEV特征解耦为模态不变与模态特有两部分;2)构建三个专家分别处理LiDAR损坏、相机损坏、两者同时损坏的情形,各专家以不变特征为主干、特有特征为补充;3)自适应融合三专家输出用于3D检测;4)在nuScenes上构建含大量真实风格损坏的数据集进行评测。 Result: 在nuScenes干净数据及各类损坏数据(LiDAR、相机、两者)上,该方法均优于近期先进模型,展现出更强的鲁棒性与泛化性。 Conclusion: 解耦模态不变与特有特征,并通过专家机制与自适应融合实现鲁棒多模态融合,是提升3D检测在真实复杂环境下稳定性的有效范式。 Abstract: Multi-modal 3D object detection with bird's eye view (BEV) has achieved desired advances on benchmarks. Nonetheless, the accuracy may drop significantly in the real world due to data corruption such as sensor configurations for LiDAR and scene conditions for camera. One design bottleneck of previous models resides in the tightly coupling of multi-modal BEV features during fusion, which may degrade the overall system performance if one modality or both is corrupted. To mitigate, we propose a Multi-Modal Decouple and Recouple Network for robust 3D object detection under data corruption. Different modalities commonly share some high-level invariant features. We observe that these invariant features across modalities do not always fail simultaneously, because different types of data corruption affect each modality in distinct ways.These invariant features can be recovered across modalities for robust fusion under data corruption.To this end, we explicitly decouple Camera/LiDAR BEV features into modality-invariant and modality-specific parts. It allows invariant features to compensate each other while mitigates the negative impact of a corrupted modality on the other.We then recouple these features into three experts to handle different types of data corruption, respectively, i.e., LiDAR, camera, and both.For each expert, we use modality-invariant features as robust information, while modality-specific features serve as a complement.Finally, we adaptively fuse the three experts to exact robust features for 3D object detection. For validation, we collect a benchmark with a large quantity of data corruption for LiDAR, camera, and both based on nuScenes. Our model is trained on clean nuScenes and tested on all types of data corruption. Our model consistently achieves the best accuracy on both corrupted and clean data compared to recent models.[213] RobustSCI: Beyond Reconstruction to Restoration for Snapshot Compressive Imaging under Real-World Degradations
Hao Wang,Yuanfan Li,Qi Zhou,Zhankuo Xu,Jiong Ni,Xin Yuan
Main category: cs.CV
TL;DR: 本文提出了一种面向退化快照压缩视频成像(SCI)测量的鲁棒恢复方法RobustSCI,首次将任务从‘重建’转向‘恢复’,即从含运动模糊和低光照的退化测量中恢复原始清晰场景,并构建了大规模仿真退化基准与级联增强模型RobustSCI-C,在合成与真实退化数据上均显著优于现有SOTA方法。
Details
Motivation: 现有深度学习方法主要针对干净的SCI测量进行重建,忽视了实际中测量常受运动模糊和低光照严重退化的现实挑战,导致模型实用性受限。 Method: 构建基于DAVIS 2017的大规模连续退化仿真基准;提出RobustSCI网络,其核心为新型RobustCFormer模块,包含多尺度去模糊分支与频域增强分支以显式解耦并去除退化;进一步提出级联架构RobustSCI-C,集成轻量预训练后处理去模糊网络。 Result: 在新建退化测试集及真实退化SCI数据上,RobustSCI及其级联版本RobustSCI-C全面超越当前所有SOTA方法,验证了其实际有效性。 Conclusion: 本文开创性地定义并解决了鲁棒视频SCI恢复问题,推动SCI技术从被动重建走向主动恢复真实场景,显著提升其在复杂真实环境中的适用性。 Abstract: Deep learning algorithms for video Snapshot Compressive Imaging (SCI) have achieved great success, yet they predominantly focus on reconstructing from clean measurements. This overlooks a critical real-world challenge: the captured signal itself is often severely degraded by motion blur and low light. Consequently, existing models falter in practical applications. To break this limitation, we pioneer the first study on robust video SCI restoration, shifting the goal from "reconstruction" to "restoration"--recovering the underlying pristine scene from a degraded measurement. To facilitate this new task, we first construct a large-scale benchmark by simulating realistic, continuous degradations on the DAVIS 2017 dataset. Second, we propose RobustSCI, a network that enhances a strong encoder-decoder backbone with a novel RobustCFormer block. This block introduces two parallel branches--a multi-scale deblur branch and a frequency enhancement branch--to explicitly disentangle and remove degradations during the recovery process. Furthermore, we introduce RobustSCI-C (RobustSCI-Cascade), which integrates a pre-trained Lightweight Post-processing Deblurring Network to significantly boost restoration performance with minimal overhead. Extensive experiments demonstrate that our methods outperform all SOTA models on the new degraded testbeds, with additional validation on real-world degraded SCI data confirming their practical effectiveness, elevating SCI from merely reconstructing what is captured to restoring what truly happened.[214] RayD3D: Distilling Depth Knowledge Along the Ray for Robust Multi-View 3D Object Detection
Rui Ding,Zhaonian Kuang,Zongwei Zhou,Meng Yang,Xinhu Zheng,Gang Hua
Main category: cs.CV
TL;DR: 本文提出RayD3D方法,通过沿射线(camera到物体真实位置的连线)进行知识蒸馏,提升多视角BEV 3D检测模型在真实场景中的深度预测鲁棒性,避免传统跨模态蒸馏中引入LiDAR密度等深度无关信息。
Details
Motivation: 现有基于跨模态蒸馏的多视角BEV 3D检测方法易将LiDAR中的深度无关信息(如点云密度)错误迁移到相机模型,导致深度预测不鲁棒,尤其在存在数据退化的真实场景中性能下降明显。 Method: 提出RayD3D框架,包含两个射线导向蒸馏模块:1)Ray-based Contrastive Distillation(RCD),沿射线采样并结合对比学习,使相机模型学习LiDAR对物体的精确定位能力;2)Ray-based Weighted Distillation(RWD),沿射线自适应调整蒸馏权重,抑制深度无关信息干扰。 Result: 在NuScenes(干净/退化数据)和RoboBEV上验证,RayD3D显著提升BEVDet、BEVDepth4D和BEVFormer三类主流BEV模型的鲁棒性,在各类数据退化下均取得最优性能,且不增加推理开销。 Conclusion: 沿成像射线进行知识蒸馏是一种更符合几何先验、更精准的深度知识迁移方式,能有效解耦深度相关与无关信息,为多模态3D检测的鲁棒性提升提供了新范式。 Abstract: Multi-view 3D detection with bird's eye view (BEV) is crucial for autonomous driving and robotics, but its robustness in real-world is limited as it struggles to predict accurate depth values. A mainstream solution, cross-modal distillation, transfers depth information from LiDAR to camera models but also unintentionally transfers depth-irrelevant information (e.g. LiDAR density). To mitigate this issue, we propose RayD3D, which transfers crucial depth knowledge along the ray: a line projecting from the camera to true location of an object. It is based on the fundamental imaging principle that predicted location of this object can only vary along this ray, which is finally determined by predicted depth value. Therefore, distilling along the ray enables more effective depth information transfer. More specifically, we design two ray-based distillation modules. Ray-based Contrastive Distillation (RCD) incorporates contrastive learning into distillation by sampling along the ray to learn how LiDAR accurately locates objects. Ray-based Weighted Distillation (RWD) adaptively adjusts distillation weight based on the ray to minimize the interference of depth-irrelevant information in LiDAR. For validation, we widely apply RayD3D into three representative types of BEV-based models, including BEVDet, BEVDepth4D, and BEVFormer. Our method is trained on clean NuScenes, and tested on both clean NuScenes and RoboBEV with a variety types of data corruptions. Our method significantly improves the robustness of all the three base models in all scenarios without increasing inference costs, and achieves the best when compared to recently released multi-view and distillation models.[215] DocCogito: Aligning Layout Cognition and Step-Level Grounded Reasoning for Document Understanding
Yuchuan Wu,Minghan Zhuo,Teng Fu,Mengyang Zhao,Bin Li,Xiangyang Xue
Main category: cs.CV
TL;DR: 本文提出DocCogito框架,通过引入轻量级布局塔和确定性视觉-语义链(VSC),显式耦合全局布局感知与区域接地推理,提升多模态文档理解模型的可解释性与准确性。
Details
Motivation: 现有文档多模态大语言模型虽改进布局编码与思维链提示,但布局与推理间交互隐式且松散,缺乏系统性机制,难以实现类人、证据支撑的推理过程。 Method: 提出DocCogito统一框架:1)轻量级布局塔生成全局布局先验token;2)确定性视觉-语义链(VSC)作为结构化中间推理表示;3)渐进式训练流程(布局预训练、VSC引导冷启动、拒绝采样、GRPO);4)引入细粒度区域置信奖励信号强化布局先验与VSC执行的耦合。 Result: 在六个基准(DocVQA、WTQ、ChartQA、TextVQA、OCRBench、InfoVQA)上验证,泛化性强,在其中四个基准上达到SOTA性能。 Conclusion: DocCogito通过显式建模布局-推理耦合机制,显著提升了文档理解模型的推理可解释性、证据对齐性与整体性能,为高风险场景下的可信文档AI提供了新范式。 Abstract: Document understanding with multimodal large language models (MLLMs) requires not only accurate answers but also explicit, evidence-grounded reasoning, especially in high-stakes scenarios. However, current document MLLMs still fall short of forming a complete, human-like reasoning process, because even when they improve both layout encoding and CoT-style prompting, the interaction between the two is typically learned implicitly and remains loosely coupled rather than being enforced as a systematic mechanism. So we propose DocCogito, a unified framework that integrates global layout perception with structured, region-grounded reasoning. DocCogito introduces a lightweight layout tower that distills page structure into learnable global layout prior tokens, and a deterministic Visual-Semantic Chain (VSC)-a concise structured representation less ambiguous than free-form natural-language CoT-to supervise fine-grained intermediate reasoning aligned with evidence regions. Training follows a progressive recipe, including layout perception pretraining, VSC-guided cold start, rejection sampling, and GRPO. To further strengthen the internal coupling between layout priors and VSC execution, we augment standard rewards with a fine-grained region-confidence signal that encourages reasoning traces to stay aligned with corresponding evidence regions. Extensive experiments on six benchmarks (DocVQA, WTQ, ChartQA, TextVQA, OCRBench, and InfoVQA) demonstrate strong generalization, achieving state-of-the-art results on four benchmarks.[216] AMR-CCR: Anchored Modular Retrieval for Continual Chinese Character Recognition
Yuchuan Wu,Yinglian Zhu,Haiyang Yu,Ke Niu,Bin Li,Xiangyang Xue
Main category: cs.CV
TL;DR: 本文提出Continual CCR新任务,针对古汉字识别中不断新增文字类别的现实挑战,设计AMR-CCR框架,通过锚定式模块化检索、脚本条件注入与多原型字典提升可扩展性与风格鲁棒性,并构建EvoCON六阶段基准评测集。
Details
Motivation: 古汉字识别在文化遗产数字化中至关重要,但实际场景中不断出土新材料,带来新文字类别和书写风格变化,传统封闭集分类方法难以应对持续增长的类别空间与高度类内多样性。 Method: 提出AMR-CCR框架:1)基于嵌入字典匹配的锚定式模块化检索;2)轻量级脚本条件注入模块(SIA+SAR),实现跨阶段嵌入兼容;3)图像驱动的多原型字典,对类内嵌入聚类以覆盖多样书写风格。 Result: 在自建EvoCON六阶段基准(涵盖甲骨文、金文等六种古文字)上验证了AMR-CCR的有效性,尤其在增量学习、跨脚本迁移与零样本识别任务中显著优于基线方法。 Conclusion: AMR-CCR为古汉字持续识别提供了可扩展、可演进的范式,其模块化设计支持零训练新增类别,多原型与脚本校准机制有效缓解类内差异与脚本异构问题,推动文化遗产AI向真实动态场景落地。 Abstract: Ancient Chinese character recognition is a core capability for cultural heritage digitization, yet real-world workflows are inherently non-stationary: newly excavated materials are continuously onboarded, bringing new classes in different scripts, and expanding the class space over time. We formalize this process as Continual Chinese Character Recognition (Continual CCR), a script-staged, class-incremental setting that couples two challenges: (i) scalable learning under continual class growth with subtle inter-class differences and scarce incremental data, and (ii) pronounced intra-class diversity caused by writing-style variations across writers and carrier conditions. To overcome the limitations of conventional closed-set classification, we propose AMR-CCR, an anchored modular retrieval framework that performs recognition via embedding-based dictionary matching in a shared multimodal space, allowing new classes to be added by simply extending the dictionary. AMR-CCR further introduces a lightweight script-conditioned injection module (SIA+SAR) to calibrate newly onboarded scripts while preserving cross-stage embedding compatibility, and an image-derived multi-prototype dictionary that clusters within-class embeddings to better cover diverse style modes. To support systematic evaluation, we build EvoCON, a six-stage benchmark for continual script onboarding, covering six scripts (OBC, BI, SS, SAC, WSC, CS), augmented with meaning/shape descriptions and an explicit zero-shot split for unseen characters without image exemplars.[217] High-Fidelity Medical Shape Generation via Skeletal Latent Diffusion
Guoqing Zhang,Jingyun Yang,Siqi Chen,Anping Zhang,Yang Li
Main category: cs.CV
TL;DR: 本文提出了一种基于骨骼隐式扩散的医学形状建模框架,通过引入结构先验提升生成精度与效率,并构建了大规模医学SDF数据集MedSDF。
Details
Motivation: 解剖结构几何复杂、拓扑多变,且医学形状数据稀缺,导致高保真、高效建模困难。 Method: 设计了一个结合可微骨架化模块的形状自编码器,将全局几何信息(骨架)与局部表面特征编码为隐空间表示;在隐空间上训练扩散模型生成新形状,再经神经隐式解码和网格提取得到最终三维形状;同时构建了MedSDF数据集(含多类解剖结构的点云与符号距离场)。 Result: 在MedSDF和血管数据集上,该方法在重建与生成质量上优于现有方法,同时计算效率更高。 Conclusion: 显式融入结构先验(如骨架)的隐扩散框架能有效提升医学形状建模的保真度、泛化性与效率,MedSDF数据集为后续研究提供了重要资源。 Abstract: Anatomy shape modeling is a fundamental problem in medical data analysis. However, the geometric complexity and topological variability of anatomical structures pose significant challenges to accurate anatomical shape generation. In this work, we propose a skeletal latent diffusion framework that explicitly incorporates structural priors for efficient and high-fidelity medical shape generation. We introduce a shape auto-encoder in which the encoder captures global geometric information through a differentiable skeletonization module and aggregates local surface features into shape latents, while the decoder predicts the corresponding implicit fields over sparsely sampled coordinates. New shapes are generated via a latent-space diffusion model, followed by neural implicit decoding and mesh extraction. To address the limited availability of medical shape data, we construct a large-scale dataset, \textit{MedSDF}, comprising surface point clouds and corresponding signed distance fields across multiple anatomical categories. Extensive experiments on MedSDF and vessel datasets demonstrate that the proposed method achieves superior reconstruction and generation quality while maintaining a higher computational efficiency compared with existing approaches. Code is available at: https://github.com/wlsdzyzl/meshage.[218] EvolveReason: Self-Evolving Reasoning Paradigm for Explainable Deepfake Facial Image Identification
Binjia Zhou,Dawei Luo,Shuai Chen,Feng Xu,Seow,Haoyuan Li,Jiachi Wang,Jiawen Wang,Zunlei Feng,Yijun Bei
Main category: cs.CV
TL;DR: 本文提出EvolveReason方法,通过模拟人类审计员的推理过程,结合链式思维数据集CoT-Face、伪造潜在空间分布捕获模块及自进化探索策略,提升人脸伪造识别的准确性、可解释性与泛化能力。
Details
Motivation: 现有方法在人脸伪造识别中存在可解释性差(传统分类法)或易幻觉、细节不足(可解释VLM)的问题,亟需更可靠、细粒度且抗幻觉的识别与解释框架。 Method: 提出EvolveReason框架:1)构建面向先进VLM的链式思维数据集CoT-Face;2)引入伪造潜在空间分布捕获模块以提取高频伪造线索;3)设计基于强化学习的两阶段自进化探索策略优化文本解释。 Result: 实验表明EvolveReason在识别性能上超越当前SOTA,并能准确定位伪造细节,具备良好泛化能力。 Conclusion: EvolveReason有效融合人类推理范式与VLM能力,在提升识别精度的同时显著增强解释可靠性与细节丰富度,为AIGC安全治理提供了新思路。 Abstract: With the rapid advancement of AIGC technology, developing identification methods to address the security challenges posed by deepfakes has become urgent. Face forgery identification techniques can be categorized into two types: traditional classification methods and explainable VLM approaches. The former provides classification results but lacks explanatory ability, while the latter, although capable of providing coarse-grained explanations, often suffers from hallucinations and insufficient detail. To overcome these limitations, we propose EvolveReason, which mimics the reasoning and observational processes of human auditors when identifying face forgeries. By constructing a chain-of-thought dataset, CoT-Face, tailored for advanced VLMs, our approach guides the model to think in a human-like way, prompting it to output reasoning processes and judgment results. This provides practitioners with reliable analysis and helps alleviate hallucination. Additionally, our framework incorporates a forgery latent-space distribution capture module, enabling EvolveReason to identify high-frequency forgery cues difficult to extract from the original images. To further enhance the reliability of textual explanations, we introduce a self-evolution exploration strategy, leveraging reinforcement learning to allow the model to iteratively explore and optimize its textual descriptions in a two-stage process. Experimental results show that EvolveReason not only outperforms the current state-of-the-art methods in identification performance but also accurately identifies forgery details and demonstrates generalization capabilities.[219] SketchGraphNet: A Memory-Efficient Hybrid Graph Transformer for Large-Scale Sketch Corpora Recognition
Shilong Chen,Mingyuan Li,Zhaoyang Wang,Zhonglin Ye,Haixing Zhao
Main category: cs.CV
TL;DR: 本文提出SketchGraphNet,一种无需辅助位置或结构编码的混合图神经网络架构,用于大规模草图识别;构建了包含344类、344万张图结构草图的基准数据集SketchGraph;在不同噪声条件下均取得高准确率,并通过MemEffAttn显著降低显存与训练时间。
Details
Motivation: 现有草图识别方法多基于光栅图像或笔画序列,难以充分建模草图的固有结构信息;缺乏大规模、标准化的图结构草图基准数据集。 Method: 提出SketchGraphNet:融合局部消息传递与内存高效的全局注意力机制(MemEffAttn)的混合图神经网络;将自由手绘草图直接建模为时空图(含归一化笔画顺序属性);构建大规模图结构草图基准SketchGraph(含A/R两个噪声变体)。 Result: 在SketchGraph-A和SketchGraph-R上Top-1准确率分别达83.62%和87.61%;MemEffAttn相较Performer减少超40% GPU峰值显存和30%以上训练时间,精度相当。 Conclusion: 从图原生视角建模草图是有效且可行的路径;SketchGraphNet及其高效注意力机制为大规模草图理解提供了新范式与实用工具;SketchGraph基准有望推动图神经网络在视觉理解领域的应用。 Abstract: This work investigates large-scale sketch recognition from a graph-native perspective, where free-hand sketches are directly modeled as structured graphs rather than raster images or stroke sequences. We propose SketchGraphNet, a hybrid graph neural architecture that integrates local message passing with a memory-efficient global attention mechanism, without relying on auxiliary positional or structural encodings. To support systematic evaluation, we construct SketchGraph, a large-scale benchmark comprising 3.44 million graph-structured sketches across 344 categories, with two variants (A and R) to reflect different noise conditions. Each sketch is represented as a spatiotemporal graph with normalized stroke-order attributes. On SketchGraph-A and SketchGraph-R, SketchGraphNet achieves Top-1 accuracies of 83.62% and 87.61%, respectively, under a unified training configuration. MemEffAttn further reduces peak GPU memory by over 40% and training time by more than 30% compared with Performer-based global attention, while maintaining comparable accuracy.[220] Scale-Aware UAV-to-Satellite Cross-View Geo-Localization: A Semantic Geometric Approach
Yibin Ye,Shuo Chen,Kun Wang,Xiaokai Song,Jisheng Dang,Qifeng Yu,Xichao Teng,Zhang Li
Main category: cs.CV
TL;DR: 本文提出了一种基于语义锚点(小车辆)恢复单目无人机图像绝对尺度的几何框架,以解决跨视角地理定位中因尺度不一致导致的特征错配问题。通过解耦立体投影模型和IQR鲁棒融合策略提升尺度估计精度,并用于自适应卫星图裁剪,显著增强了CVGL在未知尺度下的鲁棒性。
Details
Motivation: 现有跨视角地理定位方法假设无人机查询图像与卫星图库尺度一致,但现实中存在严重尺度模糊,导致视场错位与特征不匹配,降低鲁棒性。 Method: 利用小车辆作为具有稳定尺寸先验的语义锚点;提出解耦立体投影模型,将车辆三维尺寸分解为径向与切向分量以校正透视畸变;采用双维度融合与IQR鲁棒聚合抑制类内尺寸变化与检测噪声;用估计的全局尺度约束卫星图像的尺度自适应裁剪。 Result: 在增强版DenseUAV和UAV-VisLoc数据集上实验表明,该方法显著提升了未知尺度下CVGL的鲁棒性;同时展现出在被动无人机高度估计与3D模型尺度恢复等下游任务中的应用潜力。 Conclusion: 所提几何框架能有效从单目无人机图像中恢复绝对尺度,缓解尺度失配问题,提升跨视角地理定位性能,并拓展至其他需物理尺度信息的任务。 Abstract: Cross-View Geo-Localization (CVGL) between UAV imagery and satellite images plays a crucial role in target localization and UAV self-positioning. However, most existing methods rely on the idealized assumption of scale consistency between UAV queries and satellite galleries, overlooking the severe scale ambiguity commonly encountered in real-world scenarios. This discrepancy leads to field-of-view misalignment and feature mismatch, significantly degrading CVGL robustness. To address this issue, we propose a geometric framework that recovers the absolute metric scale from monocular UAV images using semantic anchors. Specifically, small vehicles (SVs), characterized by relatively stable prior size distributions and high detectability, are exploited as metric references. A Decoupled Stereoscopic Projection Model is introduced to estimate the absolute image scale from these semantic targets. By decomposing vehicle dimensions into radial and tangential components, the model compensates for perspective distortions in 2D detections of 3D vehicles, enabling more accurate scale estimation. To further reduce intra-class size variation and detection noise, a dual-dimension fusion strategy with Interquartile Range (IQR)-based robust aggregation is employed. The estimated global scale is then used as a physical constraint for scale-adaptive satellite image cropping, improving UAV-to-satellite feature alignment. Experiments on augmented DenseUAV and UAV-VisLoc datasets demonstrate that the proposed method significantly improves CVGL robustness under unknown UAV image scales. Additionally, the framework shows strong potential for downstream applications such as passive UAV altitude estimation and 3D model scale recovery.[221] How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation
Haoyu Chen,Qing Liu,Yuqian Zhou,He Zhang,Zhaowen Wang,Mengwei Ren,Jingjing Ren,Xiang Wang,Zhe Lin,Lei Zhu
Main category: cs.CV
TL;DR: 本文提出UniLongGen方法,通过动态筛选和丢弃干扰性视觉信号,解决统一多模态模型在长序列生成中因视觉历史累积导致的质量崩溃问题,显著提升长程保真度与一致性,同时降低内存占用和推理时间。
Details
Motivation: 当前统一多模态模型在生成长序列时质量迅速下降,作者指出该问题源于视觉事件累积引发的主动污染,而非传统长上下文挑战。 Method: 提出无需训练的推理策略UniLongGen,基于模型内部相关性排序,动态裁剪历史记忆中的干扰视觉信号,避免密集视觉token压垮注意力机制。 Result: UniLongGen在长程生成保真度与一致性上显著优于基线方法,同时减少内存占用和推理时间。 Conclusion: 主动遗忘干扰性视觉历史是保障多模态长文本-图像叙事稳定性的关键,UniLongGen为可靠长序列生成提供了新范式。 Abstract: Unified multimodal models hold the promise of generating extensive, interleaved narratives, weaving text and imagery into coherent long-form stories. However, current systems suffer from a critical reliability gap: as sequences grow, generation quality rapidly collapses. In this work, we investigate the mechanism behind this failure and argue that it is distinct from standard long-context challenges. We reveal that in generation, accumulated visual history acts as a source of active pollution, a decay governed specifically by the number of image events rather than raw token count. We identify a structural vulnerability where dense visual tokens overwhelm the attention mechanism, creating noise that distorts future synthesis. Guided by these mechanistic insights, we propose UniLongGen, a training-free inference strategy that prioritizes safe conditioning over total recall. Instead of retaining all history, UniLongGen dynamically curates the model's memory, identifying and discarding interfering visual signals based on the model's own internal relevance rankings. Extensive experiments demonstrate that this active forgetting approach is essential for stability: UniLongGen significantly outperforms baselines in long-horizon fidelity and consistency, while simultaneously reducing memory footprint and inference time.[222] CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization
Anh-Duy Le,Van-Linh Pham,Thanh-Nam Vo,Xuan Toan Mai,Tuan-Anh Tran
Main category: cs.CV
TL;DR: 本文提出CONSTANT方法,通过风格感知量化、对比学习和潜在块对比增强,在单样本手写体图像生成中实现了更真实、细节丰富的结果,并在多语言数据集上验证了其优越性。
Details
Motivation: 现有单样本手写体生成方法难以从单一参考图像中准确捕捉人类书写的复杂多样性,尤其难以分离不变风格特征(如倾斜度、笔画宽度、曲率)并抑制无关噪声。 Method: 提出基于去噪扩散模型的CONSTANT框架,包含三个核心模块:1)风格感知量化(SAQ)将风格建模为离散视觉标记;2)对比学习目标确保风格嵌入空间中标记语义分离且有意义;3)潜在空间补丁级对比增强(LatentPCE)对齐多尺度空间补丁以提升生成质量与局部结构。 Result: 在英文、中文及自建越南语ViHTGen数据集上的实验表明,CONSTANT在新风格适应性和图像细节生成方面显著优于当前最先进方法。 Conclusion: CONSTANT有效解决了单样本手写生成中风格解耦与细粒度重建难题,为多语言手写合成提供了通用、鲁棒的新范式。 Abstract: One-shot styled handwriting image generation, despite achieving impressive results in recent years, remains challenging due to the difficulty in capturing the intricate and diverse characteristics of human handwriting by using solely a single reference image. Existing methods still struggle to generate visually appealing and realistic handwritten images and adapt to complex, unseen writer styles, struggling to isolate invariant style features (e.g., slant, stroke width, curvature) while ignoring irrelevant noise. To tackle this problem, we introduce Patch Contrastive Enhancement and Style-Aware Quantization via Denoising Diffusion (CONSTANT), a novel one-shot handwriting generation via diffusion model. CONSTANT leverages three key innovations: 1) a Style-Aware Quantization (SAQ) module that models style as discrete visual tokens capturing distinct concepts; 2) a contrastive objective to ensure these tokens are well-separated and meaningful in the embedding style space; 3) a latent patch-based contrastive (LLatentPCE) objective help improving quality and local structures by aligning multiscale spatial patches of generated and real features in latent space. Extensive experiments and analysis on benchmark datasets from multiple languages, including English, Chinese, and our proposed ViHTGen dataset for Vietnamese, demonstrate the superiority of adapting to new reference styles and producing highly detailed images of our method over state-of-the-art approaches. Code is available at GitHub[223] DreamSAC: Learning Hamiltonian World Models via Symmetry Exploration
Jinzhou Tang,Fan Feng,Minghao Fu,Wenjun Lin,Biwei Huang,Keze Wang
Main category: cs.CV
TL;DR: 本文提出DreamSAC框架,通过基于哈密顿量的对称性探索与对比学习,使世界模型能学习物理不变性,从而提升在3D物理仿真中对新物理属性的外推泛化能力。
Details
Motivation: 现有学习型世界模型擅长插值泛化,但难以外推至新物理属性,因其仅学习统计相关性,而非环境底层生成规则(如物理不变性与守恒律);学习这些不变性是实现鲁棒外推的关键。 Method: 提出两阶段方法:1)Symmetry Exploration——基于哈密顿量的无监督探索策略,以守恒律为内在好奇心驱动智能体主动采集物理信息丰富的数据;2)哈密顿量驱动的世界模型,利用新颖的自监督对比目标,从原始像素观测中解耦并识别不变物理状态。 Result: DreamSAC在3D物理仿真任务中显著优于当前最优基线,尤其在需外推的场景下表现突出。 Conclusion: 主动探索物理对称性并建模哈密顿不变性,可有效提升世界模型的外推泛化能力,为构建具备物理理解力的智能体提供新范式。 Abstract: Learned world models excel at interpolative generalization but fail at extrapolative generalization to novel physical properties. This limitation arises because they learn statistical correlations rather than the environment's underlying generative rules, such as physical invariances and conservation laws. We argue that learning these invariances is key to robust extrapolation. To achieve this, we first introduce \textbf{Symmetry Exploration}, an unsupervised exploration strategy where an agent is intrinsically motivated by a Hamiltonian-based curiosity bonus to actively probe and challenge its understanding of conservation laws, thereby collecting physically informative data. Second, we design a Hamiltonian-based world model that learns from the collected data, using a novel self-supervised contrastive objective to identify the invariant physical state from raw, view-dependent pixel observations. Our framework, \textbf{DreamSAC}, trained on this actively curated data, significantly outperforms state-of-the-art baselines in 3D physics simulations on tasks requiring extrapolation.[224] ReconDrive: Fast Feed-Forward 4D Gaussian Splatting for Autonomous Driving Scene Reconstruction
Haibao Yu,Kuntao Xiao,Jiahang Wang,Ruiyang Hao,Yuxin Huang,Guoran Hu,Haifang Qin,Bowen Jing,Yuntian Bo,Ping Luo
Main category: cs.CV
TL;DR: 本文提出ReconDrive,一种基于3D基础模型VGGT的前馈式框架,用于快速生成高保真4D高斯泼溅(4DGS),以支持自动驾驶的逼真闭环评估。它通过混合高斯预测头和动静态4D合成策略,兼顾光度质量与动态建模能力,在nuScenes上显著优于现有前馈方法,且速度远超逐场景优化方法。
Details
Motivation: 现有4D高斯泼溅方法存在两难:逐场景优化方法精度高但计算昂贵、不可扩展;前馈方法速度快但光度质量差。需兼顾高保真重建、新颖视角合成与动态场景建模的可扩展方案。 Method: 提出ReconDrive框架:1)基于VGGT基础模型;2)引入混合高斯预测头,解耦空间坐标与外观属性回归;3)设计静态-动态4D合成策略,通过显式速度建模表征运动。 Result: 在nuScenes数据集上,ReconDrive在重建质量、新视角合成和3D感知任务上显著超越现有前馈基线,性能媲美逐场景优化方法,但推理速度快数个数量级。 Conclusion: ReconDrive为大规模城市环境下的自动驾驶仿真提供了高效、高保真、可扩展的4D场景重建解决方案,弥合了前馈效率与优化精度之间的鸿沟。 Abstract: High-fidelity visual reconstruction and novel-view synthesis are essential for realistic closed-loop evaluation in autonomous driving. While 4D Gaussian Splatting (4DGS) offers a promising balance of accuracy and efficiency, existing per-scene optimization methods require costly iterative refinement, rendering them unscalable for extensive urban environments. Conversely, current feed-forward approaches often suffer from degraded photometric quality. To address these limitations, we propose ReconDrive, a feed-forward framework that leverages and extends the 3D foundation model VGGT for rapid, high-fidelity 4DGS generation. Our architecture introduces two core adaptations to tailor the foundation model to dynamic driving scenes: (1) Hybrid Gaussian Prediction Heads, which decouple the regression of spatial coordinates and appearance attributes to overcome the photometric deficiencies inherent in generalized foundation features; and (2) a Static-Dynamic 4D Composition strategy that explicitly captures temporal motion via velocity modeling to represent complex dynamic environments. Benchmarked on nuScenes, ReconDrive significantly outperforms existing feed-forward baselines in reconstruction, novel-view synthesis, and 3D perception. It achieves performance competitive with per-scene optimization while being orders of magnitude faster, providing a scalable and practical solution for realistic driving simulation.[225] Active Inference for Micro-Gesture Recognition: EFE-Guided Temporal Sampling and Adaptive Learning
Weijia Feng,Jingyu Yang,Ruojia Zhang,Fengtao Sun,Qian Gao,Chenyang Wang,Tongtong Su,Jia Guo,Xiaobai Li,Minglai Shao
Main category: cs.CV
TL;DR: 本文提出了一种基于主动推理的微手势识别框架,通过期望自由能(EFE)引导的时间采样和不确定性感知的自适应学习,提升在小样本、噪声和跨被试场景下的鲁棒性与可解释性。
Details
Motivation: 微手势幅度小、持续时间短、个体差异大,导致现有深度模型在低样本、噪声和跨被试条件下性能下降。 Method: 构建基于主动推理的框架,包含EFE引导的动态时间片段选择机制和由预测不确定性驱动的样本加权自适应学习机制。 Result: 在SMG数据集上验证了方法有效性,显著提升多种主流骨干网络性能;消融实验证明EFE引导采样与不确定性加权学习均关键。 Conclusion: 该工作为资源受限与噪声环境下的时序行为建模提供了可解释、可扩展的新范式,适用于可穿戴传感、人机交互与临床情绪监测。 Abstract: Micro-gestures are subtle and transient movements triggered by unconscious neural and emotional activities, holding great potential for human-computer interaction and clinical monitoring. However, their low amplitude, short duration, and strong inter-subject variability make existing deep models prone to degradation under low-sample, noisy, and cross-subject conditions. This paper presents an active inference-based framework for micro-gesture recognition, featuring Expected Free Energy (EFE)-guided temporal sampling and uncertainty-aware adaptive learning. The model actively selects the most discriminative temporal segments under EFE guidance, enabling dynamic observation and information gain maximization. Meanwhile, sample weighting driven by predictive uncertainty mitigates the effects of label noise and distribution shift. Experiments on the SMG dataset demonstrate the effectiveness of the proposed method, achieving consistent improvements across multiple mainstream backbones. Ablation studies confirm that both the EFE-guided observation and the adaptive learning mechanism are crucial to the performance gains. This work offers an interpretable and scalable paradigm for temporal behavior modeling under low-resource and noisy conditions, with broad applicability to wearable sensing, HCI, and clinical emotion monitoring.[226] PureCC: Pure Learning for Text-to-Image Concept Customization
Zhichao Liao,Xiaole Xian,Qingyu Li,Wenyu Qin,Meng Wang,Weicheng Xie,Siyang Song,Pingfa Feng,Long Zeng,Liang Pan
Main category: cs.CV
TL;DR: PureCC提出一种解耦学习目标和双分支训练流程,以在高保真定制个性化概念的同时,最大程度保留原始模型的行为与能力,并引入自适应引导尺度λ*动态平衡定制保真度与模型保护。
Details
Motivation: 现有概念定制方法忽视了学习新个性化概念对原始模型行为和能力的影响。 Method: PureCC提出解耦学习目标,结合目标概念的隐式引导与原始条件预测;设计双分支训练流程(冻结提取器提供纯净概念表征,可训练流模型生成原始条件预测);引入自适应引导尺度λ*动态调节目标概念引导强度。 Result: 实验表明PureCC在保持原始模型行为与能力方面达到SOTA,同时支持高保真概念定制。 Conclusion: PureCC有效实现了个性化概念定制与原始模型能力保护之间的平衡,是一种‘纯净’的概念定制方法。 Abstract: Existing concept customization methods have achieved remarkable outcomes in high-fidelity and multi-concept customization. However, they often neglect the influence on the original model's behavior and capabilities when learning new personalized concepts. To address this issue, we propose PureCC. PureCC introduces a novel decoupled learning objective for concept customization, which combines the implicit guidance of the target concept with the original conditional prediction. This separated form enables PureCC to substantially focus on the original model during training. Moreover, based on this objective, PureCC designs a dual-branch training pipeline that includes a frozen extractor providing purified target concept representations as implicit guidance and a trainable flow model producing the original conditional prediction, jointly achieving pure learning for personalized concepts. Furthermore, PureCC introduces a novel adaptive guidance scale $λ^\star$ to dynamically adjust the guidance strength of the target concept, balancing customization fidelity and model preservation. Extensive experiments show that PureCC achieves state-of-the-art performance in preserving the original behavior and capabilities while enabling high-fidelity concept customization. The code is available at https://github.com/lzc-sg/PureCC.[227] Brain-WM: Brain Glioblastoma World Model
Chenhui Wang,Boyun Zheng,Liuxin Bao,Zhihao Peng,Peter Y. M. Woo,Hongming Shan,Yixuan Yuan
Main category: cs.CV
TL;DR: 本文提出Brain-WM,一种用于胶质母细胞瘤(GBM)的生成式世界模型,联合预测治疗方案与生成未来MRI影像,建模肿瘤与治疗的动态互作关系;采用Y型MoT架构和多时间点掩码对齐目标,在多个队列中实现高精度治疗规划与MRI生成。
Details
Motivation: 现有生成AI方法将治疗干预视为静态条件输入,无法刻画肿瘤演化与治疗响应之间的复杂双向动态关系,限制了精准预后建模能力。 Method: 提出Brain-WM:1)在共享时空隐空间中联合进行自回归治疗预测与基于流的未来MRI生成;2)采用Y型混合Transformer(MoT)架构解耦异构任务并促进跨任务协同;3)引入多时间点掩码对齐损失,使隐表征锚定于解剖结构与进展语义。 Result: 在内外部多中心队列上验证:治疗规划准确率达91.5%,FLAIR/T1CE/T2W序列MRI生成SSIM分别为0.8524/0.8581/0.8404。 Conclusion: Brain-WM首次实现了GBM中治疗决策与影像演化的联合建模,为临床提供可信赖的‘数字沙盒’,推动个体化精准干预。 Abstract: Precise prognostic modeling of glioblastoma (GBM) under varying treatment interventions is essential for optimizing clinical outcomes. While generative AI has shown promise in simulating GBM evolution, existing methods typically treat interventions as static conditional inputs rather than dynamic decision variables. Consequently, they fail to capture the complex, reciprocal interplay between tumor evolution and treatment response. To bridge this gap, we present Brain-WM, a pioneering brain GBM world model that unifies next-step treatment prediction and future MRI generation, thereby capturing the co-evolutionary dynamics between tumor and treatment. Specifically, Brain-WM encodes spatiotemporal dynamics into a shared latent space for joint autoregressive treatment prediction and flow-based future MRI generation. Then, instead of a conventional monolithic framework, Brain-WM adopts a novel Y-shaped Mixture-of-Transformers (MoT) architecture. This design structurally disentangles heterogeneous objectives, successfully leveraging cross-task synergies while preventing feature collapse. Finally, a synergistic multi-timepoint mask alignment objective explicitly anchors latent representations to anatomically grounded tumor structures and progression-aware semantics. Extensive validation on internal and external multi-institutional cohorts demonstrates the superiority of Brain-WM, achieving 91.5% accuracy in treatment planning and SSIMs of 0.8524, 0.8581, and 0.8404 for FLAIR, T1CE, and T2W sequences, respectively. Ultimately, Brain-WM offers a robust clinical sandbox for optimizing patient healthcare. The source code is made available at https://github.com/thibault-wch/Brain-GBM-world-model.[228] SiamGM: Siamese Geometry-Aware and Motion-Guided Network for Real-Time Satellite Video Object Tracking
Zixiao Wen,Zhen Yang,Jiawei Li,Xiantai Xiang,Guangyao Zhou,Yuxin Hu,Yuhan Liu
Main category: cs.CV
TL;DR: 本文提出SiamGM,一种几何感知与运动引导的Siamese网络,用于卫星视频中的单目标跟踪,通过IFGA模块和LA方法解决空间模糊性,通过nPSR动态置信度与OMMR策略缓解时间信息丢失,在SatSOT和SV248S上显著提升精度与成功率,且无额外计算开销,达130 FPS实时性能。
Details
Motivation: 卫星视频中单目标跟踪面临小目标、背景模糊、长宽比剧烈变化及频繁遮挡等挑战,导致基于外观的跟踪器易累积误差并永久丢失目标。 Method: 提出SiamGM框架:空间上采用Inter-Frame Graph Attention(IFGA)模块与Aspect Ratio-Constrained Label Assignment(LA)方法;时间上引入Motion Vector-Guided Online Tracking Optimization,结合Normalized Peak-to-Sidelobe Ratio(nPSR)动态置信度与Online Motion Model Refinement(OMMR)策略。 Result: 在SatSOT和SV248S两个挑战性基准上,SiamGM在精度和成功率指标上均超越多数SOTA跟踪器,并实现130 FPS实时性能,且无显著计算开销。 Conclusion: SiamGM通过协同建模空间拓扑结构与时间运动动态,有效缓解卫星视频跟踪中的核心难点,兼具高性能与高效率,为遥感视频跟踪提供了新范式。 Abstract: Single object tracking in satellite videos is inherently challenged by small target, blurred background, large aspect ratio changes, and frequent visual occlusions. These constraints often cause appearance-based trackers to accumulate errors and lose targets irreversibly. To systematically mitigate both spatial ambiguities and temporal information loss, we propose SiamGM, a novel geometry-aware and motion-guided Siamese network. From a spatial perspective, we introduce an Inter-Frame Graph Attention (IFGA) module, closely integrated with an Aspect Ratio-Constrained Label Assignment (LA) method, establishing fine-grained topological correspondences and explicitly preventing surrounding background noise. From a temporal perspective, we introduce the Motion Vector-Guided Online Tracking Optimization method. By adopting the Normalized Peak-to-Sidelobe Ratio (nPSR) as a dynamic confidence indicator, we propose an Online Motion Model Refinement (OMMR) strategy to utilize historical trajectory information. Evaluations on two challenging SatSOT and SV248S benchmarks confirm that SiamGM outperforms most state-of-the-art trackers in both precision and success metrics. Notably, the proposed components of SiamGM introduce virtually no computational overhead, enabling real-time tracking at 130 frames per second (FPS). Codes and tracking results are available at https://github.com/wenzx18/SiamGM.[229] GRD-Net: Generative-Reconstructive-Discriminative Anomaly Detection with Region of Interest Attention Module
Niccolò Ferrari,Michele Fraccaroli,Evelina Lamma
Main category: cs.CV
TL;DR: 本文提出了一种基于ResAE-GAN与ROI引导分割的两阶段表面异常检测方法,无需依赖传统后处理算法,提升了工业场景下缺陷定位的泛化性与实用性。
Details
Motivation: 现有表面异常检测方法依赖手工设计的后处理(如blob分析),泛化能力差;且工业场景中仅部分区域(ROI)存在关键缺陷,需针对性建模。 Method: 构建双模块架构:第一模块为基于残差自编码器的生成对抗网络(ResAE-GAN),用于图像重建与去噪;第二模块为ROI引导的缺陷分割网络,训练时仅在标注的感兴趣区域上优化判别器。模型仅用正常样本加合成缺陷进行弱监督训练。 Result: 在MVTec AD数据集及真实制药BFS瓶盖带工业数据集上验证有效,显著减少对传统预/后处理算法的依赖,提升缺陷定位精度与跨场景适应性。 Conclusion: 该ROI感知的两阶段框架实现了更鲁棒、可解释且面向工业落地的表面异常检测,为无缺陷样本依赖的弱监督学习提供了新思路。 Abstract: Anomaly detection is nowadays increasingly used in industrial applications and processes. One of the main fields of the appliance is the visual inspection for surface anomaly detection, which aims to spot regions that deviate from regularity and consequently identify abnormal products. Defect localization is a key task, that usually is achieved using a basic comparison between generated image and the original one, implementing some blob-analysis or image-editing algorithms, in the post-processing step, which is very biased towards the source dataset, and they are unable to generalize. Furthermore, in industrial applications, the totality of the image is not always interesting but could be one or some regions of interest (ROIs), where only in those areas there are relevant anomalies to be spotted. For these reasons, we propose a new architecture composed by two blocks. The first block is a Generative Adversarial Network (GAN), based on a residual autoencoder (ResAE), to perform reconstruction and denoising processes, while the second block produces image segmentation, spotting defects. This method learns from a dataset composed of good products and generated synthetic defects. The discriminative network is trained using a ROI for each image contained in the training dataset. The network will learn in which area anomalies are relevant. This approach guarantees the reduction of using pre-processing algorithms, formerly developed with blob-analysis and image-editing procedures. To test our model we used challenging MVTec anomaly detection datasets and an industrial large dataset of pharmaceutical BFS strips of vials. This set constitutes a more realistic use case of the aforementioned network.[230] Efficient RGB-D Scene Understanding via Multi-task Adaptive Learning and Cross-dimensional Feature Guidance
Guodong Sun,Junjie Liu,Gaoyang Zhang,Bo Wu,Yang Zhang
Main category: cs.CV
TL;DR: 本文提出了一种高效的RGB-D场景理解模型,支持语义分割、实例分割、方向估计、全景分割和场景分类等多种任务,通过增强的融合编码器、归一化聚焦通道层、上下文特征交互层及多任务自适应损失函数,在NYUv2、SUN RGB-D和Cityscapes数据集上实现了精度与速度的双重提升。
Details
Motivation: 传统方法在处理遮挡、边界模糊以及任务和样本自适应注意力方面存在不足。 Method: 提出一种高效RGB-D场景理解模型,包含增强融合编码器、归一化聚焦通道层、上下文特征交互层、非瓶颈1D结构用于实例分割,以及多任务自适应损失函数。 Result: 在NYUv2、SUN RGB-D和Cityscapes数据集上,分割精度和处理速度均优于现有方法。 Conclusion: 所提模型有效缓解了RGB-D场景理解中的关键挑战,具备良好的多任务泛化能力和实时性潜力。 Abstract: Scene understanding plays a critical role in enabling intelligence and autonomy in robotic systems. Traditional approaches often face challenges, including occlusions, ambiguous boundaries, and the inability to adapt attention based on task-specific requirements and sample variations. To address these limitations, this paper presents an efficient RGB-D scene understanding model that performs a range of tasks, including semantic segmentation, instance segmentation, orientation estimation, panoptic segmentation, and scene classification. The proposed model incorporates an enhanced fusion encoder, which effectively leverages redundant information from both RGB and depth inputs. For semantic segmentation, we introduce normalized focus channel layers and a context feature interaction layer, designed to mitigate issues such as shallow feature misguidance and insufficient local-global feature representation. The instance segmentation task benefits from a non-bottleneck 1D structure, which achieves superior contour representation with fewer parameters. Additionally, we propose a multi-task adaptive loss function that dynamically adjusts the learning strategy for different tasks based on scene variations. Extensive experiments on the NYUv2, SUN RGB-D, and Cityscapes datasets demonstrate that our approach outperforms existing methods in both segmentation accuracy and processing speed.[231] A Systematic Comparison of Training Objectives for Out-of-Distribution Detection in Image Classification
Furkan Genç,Onat Özdemir,Emre Akbaş
Main category: cs.CV
TL;DR: 本文系统比较了四种训练目标(交叉熵损失、原型损失、三元组损失和平均精度损失)在图像分类OOD检测中的表现,发现交叉熵损失在近端和远端OOD检测中表现最稳定。
Details
Motivation: 现有研究对训练目标如何影响OOD检测性能关注不足,本文旨在填补这一空白。 Method: 在标准化OpenOOD协议下,对CIFAR-10/100和ImageNet-200数据集,系统评估交叉熵损失、原型损失、三元组损失和平均精度损失四种训练目标的OOD检测性能。 Result: 交叉熵损失、原型损失和AP损失在分布内准确率上相当;交叉熵损失在近端和远端OOD检测中整体表现最稳定,其他目标在特定场景下可具竞争力。 Conclusion: 训练目标显著影响OOD检测性能,交叉熵损失因其稳定性成为基准选择,但其他目标在特定设置下也具潜力。 Abstract: Out-of-distribution (OOD) detection is critical in safety-sensitive applications. While this challenge has been addressed from various perspectives, the influence of training objectives on OOD behavior remains comparatively underexplored. In this paper, we present a systematic comparison of four widely used training objectives: Cross-Entropy Loss, Prototype Loss, Triplet Loss, and Average Precision (AP) Loss, spanning probabilistic, prototype-based, metric-learning, and ranking-based supervision, for OOD detection in image classification under standardized OpenOOD protocols. Across CIFAR-10/100 and ImageNet-200, we find that Cross-Entropy Loss, Prototype Loss, and AP Loss achieve comparable in-distribution accuracy, while Cross-Entropy Loss provides the most consistent near- and far-OOD performance overall; the other objectives can be competitive in specific settings.[232] Integration of deep generative Anomaly Detection algorithm in high-speed industrial line
Niccolò Ferrari,Nicola Zanarini,Michele Fraccaroli,Alice Bizzarri,Evelina Lamma
Main category: cs.CV
TL;DR: 本文提出了一种面向高速制药吹灌封(BFS)产线的半监督异常检测框架,基于带残差自编码器和稠密瓶颈的生成对抗架构,仅用正常样本训练,通过重建残差实现异常分类与热图定位,满足500ms实时性要求。
Details
Motivation: 工业药企视觉检测需高精度、低延迟、小硬件体积与低成本;人工检测存在主观性和吞吐量瓶颈,传统规则方法难以适应产线高变异性。 Method: 提出一种半监督异常检测框架,采用生成对抗架构,核心为残差自编码器与稠密瓶颈结构,仅使用正常样本训练,通过重建残差进行异常判别,并输出空间定位热图。 Result: 在真实工业测试集上实现高检测性能,且推理时间满足500ms采集窗口的实时部署约束。 Conclusion: 该框架兼顾检测精度、实时性与部署可行性,为高动态制药产线提供了可落地的轻量级在线异常检测解决方案。 Abstract: Industrial visual inspection in pharmaceutical production requires high accuracy under strict constraints on cycle time, hardware footprint, and operational cost. Manual inline inspection is still common, but it is affected by operator variability and limited throughput. Classical rule-based computer vision pipelines are often rigid and difficult to scale to highly variable production scenarios. To address these limitations, we present a semi-supervised anomaly detection framework based on a generative adversarial architecture with a residual autoencoder and a dense bottleneck, specifically designed for online deployment on a high-speed Blow-Fill-Seal (BFS) line. The model is trained only on nominal samples and detects anomalies through reconstruction residuals, providing both classification and spatial localization via heatmaps. The training set contains 2,815,200 grayscale patches. Experiments on a real industrial test kit show high detection performance while satisfying timing constraints compatible with a 500 ms acquisition slot.[233] 3DGS-HPC: Distractor-free 3D Gaussian Splatting with Hybrid Patch-wise Classification
Jiahao Chen,Yipeng Qin,Ganlong Zhao,Xin Li,Wenping Wang,Guanbin Li
Main category: cs.CV
TL;DR: 本文提出3DGS-HPC框架,通过局部空间一致性驱动的块级分类策略和自适应融合光度与感知线索的混合分类指标,有效缓解真实场景中运动物体、阴影等瞬态干扰对3D高斯泼溅(3DGS)重建质量的影响。
Details
Motivation: 3D高斯泼溅(3DGS)在真实场景中因运动物体、变化阴影等瞬态干扰导致重建质量下降;现有基于预训练语义模型的方法语义与静态/瞬态二元划分不匹配,且在3DGS优化引发的外观扰动下鲁棒性差。 Method: 提出3DGS-HPC框架:1)块级分类策略,利用局部空间一致性实现鲁棒的区域级判断;2)混合分类指标,自适应融合光度误差与感知特征以提升静态/瞬态分离可靠性。 Result: 大量实验证明该方法在抑制瞬态干扰、提升3DGS新视角合成质量方面优于现有方法,具备更强鲁棒性。 Conclusion: 3DGS-HPC通过摒弃依赖脆弱语义先验的方式,转而采用更贴合任务本质的局部一致性和多模态判别机制,显著提升了3DGS在复杂真实场景下的实用性与稳定性。 Abstract: 3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in novel view synthesis and 3D scene reconstruction, yet its quality often degrades in real-world environments due to transient distractors, such as moving objects and varying shadows. Existing methods commonly rely on semantic cues extracted from pre-trained vision models to identify and suppress these distractors, but such semantics are misaligned with the binary distinction between static and transient regions and remain fragile under the appearance perturbations introduced during 3DGS optimization. We propose 3DGS-HPC, a framework that circumvents these limitations by combining two complementary principles: a patch-wise classification strategy that leverages local spatial consistency for robust region-level decisions, and a hybrid classification metric that adaptively integrates photometric and perceptual cues for more reliable separation. Extensive experiments demonstrate the superiority and robustness of our method in mitigating distractors to improve 3DGS-based novel view synthesis.[234] Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints
Chenxi Li,Xianggan Liu,Dake Shen,Yaosong Du,Zhibo Yao,Hao Jiang,Linyi Jiang,Chengwei Cao,Jingzhe Zhang,RanYi Peng,Peiling Bai,Xiande Huang
Main category: cs.CV
TL;DR: 本文提出了一种名为StructAttack的新型单查询黑盒 jailbreak 攻击方法,利用LVLMs在结构化视觉提示(如思维导图、表格)中对语义槽填充的脆弱性,诱导模型生成有害输出而不触发安全机制。
Details
Motivation: 大型视觉语言模型(LVLMs)在融合视觉模态后引入了新的安全漏洞,尤其是语义槽填充过程中可能被恶意利用,而现有研究对此关注不足。 Method: 提出StructAttack框架:将有害查询分解为中心主题和一系列看似良性的槽类型,嵌入结构化视觉提示(如思维导图、表格、日冕图),辅以微小随机扰动和补全引导指令,诱使LVLM自动重组语义并输出有害内容。 Result: 在多个LVLM和基准测试上验证了StructAttack的有效性,成功绕过安全机制,揭示了LVLM在结构化视觉推理中的深层安全缺陷。 Conclusion: LVLMs虽具强大多模态理解能力,但其对结构化视觉提示中局部良性槽值的整体语义重组能力构成了严重安全隐患;StructAttack为评估和提升LVLM安全性提供了新视角与实用工具。 Abstract: Despite the rapid progress of Large Vision-Language Models (LVLMs), the integration of visual modalities introduces new safety vulnerabilities that adversaries can exploit to elicit biased or malicious outputs. In this paper, we demonstrate an underexplored vulnerability via semantic slot filling, where LVLMs complete missing slot values with unsafe content even when the slot types are deliberately crafted to appear benign. Building on this finding, we propose StructAttack, a simple yet effective single-query jailbreak framework under black-box settings. StructAttack decomposes a harmful query into a central topic and a set of benign-looking slot types, then embeds them as structured visual prompts (e.g., mind maps, tables, or sunburst diagrams) with small random perturbations. Paired with a completion-guided instruction, LVLMs automatically recompose the concealed semantics and generate unsafe outputs without triggering safety mechanisms. Although each slot appears benign in isolation (local benignness), StructAttack exploits LVLMs' reasoning to assemble these slots into coherent harmful semantics. Extensive experiments on multiple models and benchmarks show the efficacy of our proposed StructAttack.[235] Fast Attention-Based Simplification of LiDAR Point Clouds for Object Detection and Classification
Z. Rozsa,Á. Madaras,Q. Wei,X. Lu,M. Golarits,H. Yuan,T. Sziranyi,R. Hamzaoui
Main category: cs.CV
TL;DR: 本文提出了一种高效的、端到端训练的深度学习点云简化方法,结合特征嵌入与注意力采样模块,在保持甚至提升3D检测与分类精度的同时显著优于传统FPS的速度,并比随机采样更稳定地维持高采样率下的精度。
Details
Motivation: LiDAR点云数据量大、计算开销高,现有采样方法在速度与精度之间难以兼顾,亟需一种高效且准确的简化方法以支持实时自动驾驶应用。 Method: 提出一种基于学习的点云简化方法,包含特征嵌入模块和注意力机制驱动的采样模块,端到端训练,以优先保留任务相关区域的点。 Result: 在KITTI上的3D目标检测及四个数据集上的目标分类任务中,该方法始终快于最远点采样(FPS),精度相当甚至更优(尤其在强降采样下);虽慢于随机采样(RS),但在高采样率下精度保持更稳定。 Conclusion: 所提方法在计算效率与感知性能之间实现了更优平衡,为LiDAR点云的实时处理提供了可行方案。 Abstract: LiDAR point clouds are widely used in autonomous driving and consist of large numbers of 3D points captured at high frequency to represent surrounding objects such as vehicles, pedestrians, and traffic signs. While this dense data enables accurate perception, it also increases computational cost and power consumption, which can limit real-time deployment. Existing point cloud sampling methods typically face a trade-off: very fast approaches tend to reduce accuracy, while more accurate methods are computationally expensive. To address this limitation, we propose an efficient learned point cloud simplification method for LiDAR data. The method combines a feature embedding module with an attention-based sampling module to prioritize task-relevant regions and is trained end-to-end. We evaluate the method against farthest point sampling (FPS) and random sampling (RS) on 3D object detection on the KITTI dataset and on object classification across four datasets. The method was consistently faster than FPS and achieved similar, and in some settings better, accuracy, with the largest gains under aggressive downsampling. It was slower than RS, but it typically preserved accuracy more reliably at high sampling ratios.[236] EmbedTalk: Triplane-Free Talking Head Synthesis using Embedding-Driven Gaussian Deformation
Arpita Saggar,Jonathan C. Darling,Duygu Sarikaya,David C. Hogg
Main category: cs.CV
TL;DR: EmbedTalk提出使用学习到的嵌入(learnt embeddings)替代传统的三平面(tri-plane)编码,用于驱动语音相关的面部变形,在保持实时性能的同时显著提升了渲染质量、唇形同步和运动一致性。
Details
Motivation: 传统三平面编码受限于网格分辨率和3D体素场投影到2D子空间带来的近似误差;而学习嵌入在4D场景重建中已展现出对时序变形建模的优势,因此探索其在说话人头像合成中的应用。 Method: 提出EmbedTalk方法,用可学习的嵌入表示替代三平面作为3D高斯变形的驱动编码,并结合语音信号驱动面部动态变形。 Result: 在渲染质量、唇同步精度和运动一致性上优于现有基于3DGS的方法,且模型更紧凑,在RTX 2060移动GPU上达60+ FPS,性能媲美最先进生成模型。 Conclusion: 学习嵌入比三平面更适合建模语音驱动的面部变形,为实时 talking head 合成提供了更高效、高质量的新范式。 Abstract: Real-time talking head synthesis increasingly relies on deformable 3D Gaussian Splatting (3DGS) due to its low latency. Tri-planes are the standard choice for encoding Gaussians prior to deformation, since they provide a continuous domain with explicit spatial relationships. However, tri-plane representations are limited by grid resolution and approximation errors introduced by projecting 3D volumetric fields onto 2D subspaces. Recent work has shown the superiority of learnt embeddings for driving temporal deformations in 4D scene reconstruction. We introduce $\textbf{EmbedTalk}$, which shows how such embeddings can be leveraged for modelling speech deformations in talking head synthesis. Through comprehensive experiments, we show that EmbedTalk outperforms existing 3DGS-based methods in rendering quality, lip synchronisation, and motion consistency, while remaining competitive with state-of-the-art generative models. Moreover, replacing the tri-plane encoding with learnt embeddings enables significantly more compact models that achieve over 60 FPS on a mobile GPU (RTX 2060 6 GB). Our code will be placed in the public domain on acceptance.[237] Looking Into the Water by Unsupervised Learning of the Surface Shape
Ori Lifschitz,Tali Treibitz,Dan Rosenbaum
Main category: cs.CV
TL;DR: 本文提出了一种基于双神经场网络的无监督方法,用于从空中拍摄的水中图像中去除水面折射引起的失真,通过建模随时间变化的水面高度和恒定水下场景来实现,并利用SIREN激活函数有效重建图像并估计水面形状。
Details
Motivation: 解决从空中观测水下场景时,因水面折射导致图像失真的问题。 Method: 构建两个隐式神经场网络:一个预测时空变化的水面高度,另一个预测水下场景颜色;利用周期性激活函数SIREN建模水面高度及其导数,实现端到端无监督训练。 Result: 在仿真与真实数据上均优于当前最先进的无监督图像复原方法,并能同时提供水面形状估计。 Conclusion: 隐式神经场结合SIREN适用于建模动态水面及折射校正,为水下视觉恢复提供了新思路和实用工具。 Abstract: We address the problem of looking into the water from the air, where we seek to remove image distortions caused by refractions at the water surface. Our approach is based on modeling the different water surface structures at various points in time, assuming the underlying image is constant. To this end, we propose a model that consists of two neural-field networks. The first network predicts the height of the water surface at each spatial position and time, and the second network predicts the image color at each position. Using both networks, we reconstruct the observed sequence of images and can therefore use unsupervised training. We show that using implicit neural representations with periodic activation functions (SIREN) leads to effective modeling of the surface height spatio-temporal signal and its derivative, as required for image reconstruction. Using both simulated and real data we show that our method outperforms the latest unsupervised image restoration approach. In addition, it provides an estimate of the water surface.[238] Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models
Abin Shoby,Ta Duc Huy,Tuan Dung Nguyen,Minh Khoi Ho,Qi Chen,Anton van den Hengel,Phi Le Nguyen,Johan W. Verjans,Vu Minh Hieu Phan
Main category: cs.CV
TL;DR: 本文提出了一种新的视觉语言模型(VLM)幻觉检测方法——'过度思考得分(Overthinking Score)',通过分析解码器各层中物体假设的反复修正与不稳定性,而非依赖最终输出层信号,显著提升了幻觉检测性能。
Details
Motivation: 现有幻觉检测方法多依赖最终层信号(如注意力或熵),但作者发现幻觉常伴随峰值注意力和高置信度,根源在于中间层已收敛到错误假设并持续传播,因此需关注模型‘思考过程’而非最终输出。 Method: 通过逐层探查解码器,发现模型在幻觉发生前存在‘过度思考’现象:即多层间反复修订物体假设;据此设计‘过度思考得分’,量化模型在各层间竞争假设的数量与不稳定性。 Result: 所提Overthinking Score在MSCOCO数据集上达到78.9% F1,在AMBER上达71.58% F1,显著优于基于最终层注意力或熵的方法。 Conclusion: VLM幻觉的本质源于中间层推理过程中的假设不稳定与错误锁定,检测关键在于建模层间假设演化,而非终层输出;Overthinking Score为该方向提供了有效且可解释的指标。 Abstract: Vision Language models (VLMs) often hallucinate non-existent objects. Detecting hallucination is analogous to detecting deception: a single final statement is insufficient, one must examine the underlying reasoning process. Yet existing detectors rely mostly on final-layer signals. Attention-based methods assume hallucinated tokens exhibit low attention, while entropy-based ones use final-step uncertainty. Our analysis reveals the opposite: hallucinated objects can exhibit peaked attention due to contextual priors; and models often express high confidence because intermediate layers have already converged to an incorrect hypothesis. We show that the key to hallucination detection lies within the model's thought process, not its final output. By probing decoder layers, we uncover a previously overlooked behavior, overthinking: models repeatedly revise object hypotheses across layers before committing to an incorrect answer. Once the model latches onto a confounded hypothesis, it can propagate through subsequent layers, ultimately causing hallucination. To capture this behavior, we introduce the Overthinking Score, a metric to measure how many competing hypotheses the model entertains and how unstable these hypotheses are across layers. This score significantly improves hallucination detection: 78.9% F1 on MSCOCO and 71.58% on AMBER.[239] Duala: Dual-Level Alignment of Subjects and Stimuli for Cross-Subject fMRI Decoding
Shumeng Li,Jintao Guo,Jian Zhang,Yulin Zhou,Luyang Cao,Yinghuan Shi
Main category: cs.CV
TL;DR: 本文提出Duala框架,通过刺激级语义对齐与被试级分布特征扰动,提升fMRI跨被试视觉解码性能,在少量数据下实现高准确率图像-脑信号检索与重建。
Details
Motivation: 现有跨被试视觉解码方法在新被试数据有限时性能下降,难以兼顾刺激语义一致性与脑响应对齐。 Method: 提出Duala双层对齐框架:(1)刺激级采用语义对齐与关系一致性策略,保持类内相似性与类间可分性;(2)被试级引入基于分布的特征扰动机制,建模全局与个体神经表征差异。 Result: 在NSD数据集上,仅用约一小时fMRI数据微调,图像到脑检索准确率达81.1%以上,显著优于现有微调方法。 Conclusion: Duala有效提升了跨被试fMRI视觉解码的鲁棒性与泛化能力,为实用化脑机接口提供了新思路。 Abstract: Cross-subject visual decoding aims to reconstruct visual experiences from brain activity across individuals, enabling more scalable and practical brain-computer interfaces. However, existing methods often suffer from degraded performance when adapting to new subjects with limited data, as they struggle to preserve both the semantic consistency of stimuli and the alignment of brain responses. To address these challenges, we propose Duala, a dual-level alignment framework designed to achieve stimulus-level consistency and subject-level alignment in fMRI-based cross-subject visual decoding. (1) At the stimulus level, Duala introduces a semantic alignment and relational consistency strategy that preserves intra-class similarity and inter-class separability, maintaining clear semantic boundaries during adaptation. (2) At the subject level, a distribution-based feature perturbation mechanism is developed to capture both global and subject-specific variations, enabling adaptation to individual neural representations without overfitting. Experiments on the Natural Scenes Dataset (NSD) demonstrate that Duala effectively improves alignment across subjects. Remarkably, even when fine-tuned with only about one hour of fMRI data, Duala achieves over 81.1% image-to-brain retrieval accuracy and consistently outperforms existing fine-tuning strategies in both retrieval and reconstruction. Our code is available at https://github.com/ShumengLI/Duala.[240] Real-Time Glottis Detection Framework via Spatial-decoupled Feature Learning for Nasal Transnasal Intubation
Jinyu Liu,Gaoyang Zhang,Yang Zhou,Ruoyi Hao,Yang Zhang,Hongliang Ren
Main category: cs.CV
TL;DR: 本文提出了一种轻量级实时喉部检测框架Mobile GlottisNet,专为嵌入式与边缘设备设计,用于鼻气管插管中的快速准确声门定位。
Details
Motivation: 现有机器辅助视觉检测系统依赖高性能计算资源、推理延迟高,难以满足急救等时间敏感和资源受限场景的需求。 Method: 提出Mobile GlottisNet:引入结构感知与空间对齐机制;采用分层动态阈值策略优化样本分配;基于可变形卷积设计自适应特征解耦模块以支持动态空间重建;并引入跨层动态加权机制融合多尺度语义与细节特征。 Result: 模型仅5MB,在PID与临床数据集上分别达62 FPS(设备端)与33 FPS(边缘平台),显著优于现有方法。 Conclusion: Mobile GlottisNet在保证精度的同时大幅降低计算开销与延迟,具备在急诊NTI中实际部署的潜力。 Abstract: Nasotracheal intubation (NTI) is a vital procedure in emergency airway management, where rapid and accurate glottis detection is essential to ensure patient safety. However, existing machine assisted visual detection systems often rely on high performance computational resources and suffer from significant inference delays, which limits their applicability in time critical and resource constrained scenarios. To overcome these limitations, we propose Mobile GlottisNet, a lightweight and efficient glottis detection framework designed for real time inference on embedded and edge devices. The model incorporates structural awareness and spatial alignment mechanisms, enabling robust glottis localization under complex anatomical and visual conditions. We implement a hierarchical dynamic thresholding strategy to enhance sample assignment, and introduce an adaptive feature decoupling module based on deformable convolution to support dynamic spatial reconstruction. A cross layer dynamic weighting scheme further facilitates the fusion of semantic and detail features across multiple scales. Experimental results demonstrate that the model, with a size of only 5MB on both our PID dataset and Clinical datasets, achieves inference speeds of over 62 FPS on devices and 33 FPS on edge platforms, showing great potential in the application of emergency NTI.[241] Evaluating Synthetic Data for Baggage Trolley Detection in Airport Logistics
Abdeldjalil Taibi,Mohmoud Badlis,Amina Bensalem,Belkacem Zouilekh,Mohammed Brahimi
Main category: cs.CV
TL;DR: 本文提出了一种基于高保真数字孪生和NVIDIA Omniverse的合成数据生成流程,用于解决机场行李推车检测中真实数据稀缺与标注成本高的问题;实验表明,仅用40%真实标注数据混合合成数据训练,性能可媲美或超越全量真实数据训练,同时降低25%-35%标注工作量。
Details
Motivation: 机场行李推车自动化检测面临两大挑战:一是严格的安全与隐私法规限制大规模真实数据采集;二是现有公开数据集在多样性、规模和标注质量上不足,难以应对现实中密集、重叠的推车场景。 Method: 构建阿尔及尔国际机场的高保真数字孪生模型,利用NVIDIA Omniverse生成带定向边界框(OBB)标注的合成数据;采用YOLO-OBB模型,对比五种训练策略(纯真实、纯合成、线性探测、全微调、混合训练)以评估合成数据对有限真实标注的补充效果。 Result: 混合训练(40%真实标注+合成数据)达到mAP@50=0.94、mAP@50-95=0.77,匹配或超越全量真实数据基线,标注工作量减少25–35%;多随机种子实验显示mAP@50标准差低于0.01,验证方法稳定性与实用性。 Conclusion: 高质量合成数据可有效弥补真实数据不足,在保障检测精度的同时显著降低标注成本与隐私风险,为受限场景下的视觉检测任务提供了可行范式。 Abstract: Efficient luggage trolley management is critical for reducing congestion and ensuring asset availability in modern airports. Automated detection systems face two main challenges. First, strict security and privacy regulations limit large-scale data collection. Second, existing public datasets lack the diversity, scale, and annotation quality needed to handle dense, overlapping trolley arrangements typical of real-world operations. To address these limitations, we introduce a synthetic data generation pipeline based on a high-fidelity Digital Twin of Algiers International Airport using NVIDIA Omniverse. The pipeline produces richly annotated data with oriented bounding boxes, capturing complex trolley formations, including tightly nested chains. We evaluate YOLO-OBB using five training strategies: real-only, synthetic-only, linear probing, full fine-tuning, and mixed training. This allows us to assess how synthetic data can complement limited real-world annotations. Our results show that mixed training with synthetic data and only 40 percent of real annotations matches or exceeds the full real-data baseline, achieving 0.94 mAP@50 and 0.77 mAP@50-95, while reducing annotation effort by 25 to 35 percent. Multi-seed experiments confirm strong reproducibility with a standard deviation below 0.01 on mAP@50, demonstrating the practical effectiveness of synthetic data for automated trolley detection.[242] GLASS: Graph and Vision-Language Assisted Semantic Shape Correspondence
Qinfeng Xiao,Guofeng Mei,Qilong Liu,Chenyuan Yi,Fabio Poiesi,Jian Zhang,Bo Yang,Yick Kit-lun
Main category: cs.CV
TL;DR: 本文提出GLASS框架,通过结合几何谱分析与视觉-语言基础模型的语义先验,实现无监督的3D形状稠密对应学习,尤其在非等距形变和跨类别场景下显著优于现有方法。
Details
Motivation: 传统功能映射方法依赖等距假设,在严重非等距形变和跨类别设置下性能受限;缺乏人工标注的稠密对应学习仍具挑战性。 Method: GLASS包含三项创新:(i) 视图一致的多视角视觉特征提取;(ii) 利用零样本3D分割将语言嵌入注入顶点描述符以获取部件级语义;(iii) 基于测地线与拓扑关系的图辅助对比损失,保障区域间结构一致性。 Result: 在SNIS、SMAL和TOPKIDS等基准上,平均测地误差分别降至0.21、4.5和5.6,相较URSSM基线降低57%、25%和37%;在近等距任务中保持高精度,同时大幅提升困难场景性能。 Conclusion: GLASS实现了无需真值监督的全局一致且语义一致的3D对应学习,在各类形变和跨类设置下均达到SOTA性能。 Abstract: Establishing dense correspondence across 3D shapes is crucial for fundamental downstream tasks, including texture transfer, shape interpolation, and robotic manipulation. However, learning these mappings without manual supervision remains a formidable challenge, particularly under severe non-isometric deformations and in inter-class settings where geometric cues are ambiguous. Conventional functional map methods, while elegant, typically struggle in these regimes due to their reliance on isometry. To address this, we present GLASS, a framework that bridges the gap by integrating geometric spectral analysis with rich semantic priors from vision-language foundation models. GLASS introduces three key innovations: (i) a view-consistent strategy that enables robust multi-view visual feature extraction from powerful vision foundation models; (ii) the injection of language embeddings into vertex descriptors via zero-shot 3D segmentation, capturing high-level part semantics; and (iii) a graph-assisted contrastive loss that enforces structural consistency between regions (e.g., source's head'' $\leftrightarrow$ target's head'') by leveraging geodesic and topological relationships between regions. This design allows GLASS to learn globally coherent and semantically consistent maps without ground-truth supervision. Extensive experiments demonstrate that GLASS achieves state-of-the-art performance across all regimes, maintaining high accuracy on standard near-isometric tasks while significantly advancing performance in challenging settings. Specifically, it achieves average geodesic errors of 0.21, 4.5, and 5.6 on the inter-class benchmark SNIS and non-isometric benchmarks SMAL and TOPKIDS, reducing errors from URSSM baselines of 0.49, 6.0, and 8.9 by 57%, 25%, and 37%, respectively.[243] Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework
Kaihua Tang,Jiaxin Qi,Jinli Ou,Yuhua Zheng,Jianqiang Huang
Main category: cs.CV
TL;DR: 本文提出了一种自批判推理(SCI)框架和动态鲁棒性基准(DRBench),以解决大视觉语言模型(LVLMs)中的语言偏见和敏感性问题,通过多轮文本与视觉扰动的反事实推理提升模型鲁棒性,并设计模型专属评估基准。
Details
Motivation: 现有LVLM训练范式过度依赖LLM组件,导致语言偏见和语言敏感性两大鲁棒性挑战;固定鲁棒性基准难以反映不同LVLM的真实可靠性。 Method: 提出自批判推理(SCI)框架,扩展视觉对比解码,结合多轮文本与视觉扰动进行反事实推理,并引入增加推理轮数以提升鲁棒性的策略;同时构建模型专属的动态鲁棒性基准DRBench,针对性评估语言偏见与敏感性。 Result: SCI在DRBench上持续优于基线方法;增加推理轮数可进一步提升鲁棒性,超越单步反事实推理方法。 Conclusion: SCI与DRBench共同为提升LVLM鲁棒性提供了新范式和更可靠的评估方式,推动多模态模型向更可靠、公平的方向发展。 Abstract: The emergence of Large Language Models (LLMs) has driven rapid progress in multi-modal learning, particularly in the development of Large Vision-Language Models (LVLMs). However, existing LVLM training paradigms place excessive reliance on the LLM component, giving rise to two critical robustness challenges: language bias and language sensitivity. To address both issues simultaneously, we propose a novel Self-Critical Inference (SCI) framework that extends Visual Contrastive Decoding by conducting multi-round counterfactual reasoning through both textual and visual perturbations. This process further introduces a new strategy for improving robustness by scaling the number of counterfactual rounds. Moreover, we also observe that failure cases of LVLMs differ significantly across models, indicating that fixed robustness benchmarks may not be able to capture the true reliability of LVLMs. To this end, we propose the Dynamic Robustness Benchmark (DRBench), a model-specific evaluation framework targeting both language bias and sensitivity issues. Extensive experiments show that SCI consistently outperforms baseline methods on DRBench, and that increasing the number of inference rounds further boosts robustness beyond existing single-step counterfactual reasoning methods.[244] Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence
Yuanyuan Gao,Hao Li,Yifei Liu,Xinhao Ji,Yuning Gong,Yuanjun Liao,Fangfu Liu,Manyuan Zhang,Yuchen Yang,Dan Xu,Xue Yang,Huaxi Huang,Hongjie Zhang,Ziwei Liu,Xiao Sun,Dingwen Zhang,Zhihang Zhong
Main category: cs.CV
TL;DR: 本文提出Holi-Spatial,首个全自动、大规模、空间感知的多模态3D数据集,基于原始视频构建,支持多层次空间监督与空间问答,显著提升空间推理性能。
Details
Motivation: 现有空间理解基准受限于小规模人工标注数据,扩展性差且存在领域偏差,亟需从原始网络数据中自动构建大规模高质量3D数据集。 Method: 提出端到端自动化数据整理流程,从原始视频生成几何精确的3D高斯泼溅(3DGS)重建、深度图、2D/3D掩码与包围框、实例描述、3D定位及空间QA对,构建Holi-Spatial及子集Holi-Spatial-4M。 Result: Holi-Spatial在ScanNet等基准上数据质量显著优于现有方法;用于微调视觉语言模型后,在空间推理任务上性能大幅提升。 Conclusion: Holi-Spatial为 spatial intelligence 提供了可扩展、高质量、全自动化的新范式,推动3D空间理解从有限标注走向大规模自监督构建。 Abstract: The pursuit of spatial intelligence fundamentally relies on access to large-scale, fine-grained 3D data. However, existing approaches predominantly construct spatial understanding benchmarks by generating question-answer (QA) pairs from a limited number of manually annotated datasets, rather than systematically annotating new large-scale 3D scenes from raw web data. As a result, their scalability is severely constrained, and model performance is further hindered by domain gaps inherent in these narrowly curated datasets. In this work, we propose Holi-Spatial, the first fully automated, large-scale, spatially-aware multimodal dataset, constructed from raw video inputs without human intervention, using the proposed data curation pipeline. Holi-Spatial supports multi-level spatial supervision, ranging from geometrically accurate 3D Gaussian Splatting (3DGS) reconstructions with rendered depth maps to object-level and relational semantic annotations, together with corresponding spatial Question-Answer (QA) pairs. Following a principled and systematic pipeline, we further construct Holi-Spatial-4M, the first large-scale, high-quality 3D semantic dataset, containing 12K optimized 3DGS scenes, 1.3M 2D masks, 320K 3D bounding boxes, 320K instance captions, 1.2M 3D grounding instances, and 1.2M spatial QA pairs spanning diverse geometric, relational, and semantic reasoning tasks. Holi-Spatial demonstrates exceptional performance in data curation quality, significantly outperforming existing feed-forward and per-scene optimized methods on datasets such as ScanNet, ScanNet++, and DL3DV. Furthermore, fine-tuning Vision-Language Models (VLMs) on spatial reasoning tasks using this dataset has also led to substantial improvements in model performance.[245] Ref-DGS: Reflective Dual Gaussian Splatting
Ningjing Fan,Yiqun Wang,Dongming Yan,Peter Wonka
Main category: cs.CV
TL;DR: Ref-DGS提出一种双高斯溅射框架,解耦表面重建与镜面反射建模,通过几何高斯与局部反射高斯结合环境反射场,在光栅化管线中高效处理近场和远场镜面反射,显著提升训练速度与重建质量。
Details
Motivation: 强近场镜面反射严重干扰表面重建与新视角合成,现有高斯溅射方法难以兼顾建模精度与计算效率。 Method: 提出Ref-DGS:采用双高斯表示(几何高斯 + 局部反射高斯)建模近场反射,引入全局环境反射场建模远场反射,并设计轻量、物理感知的自适应混合着色器融合反射特征。 Result: 在反射场景上达到SOTA性能,训练速度显著快于基于光线追踪的高斯方法。 Conclusion: Ref-DGS在不依赖显式光线追踪的前提下,实现了高效、准确的镜面反射建模,平衡了精度与效率。 Abstract: Reflective appearance, especially strong and typically near-field specular reflections, poses a fundamental challenge for accurate surface reconstruction and novel view synthesis. Existing Gaussian splatting methods either fail to model near-field specular reflections or rely on explicit ray tracing at substantial computational cost. We present Ref-DGS, a reflective dual Gaussian splatting framework that addresses this trade-off by decoupling surface reconstruction from specular reflection within an efficient rasterization-based pipeline. Ref-DGS introduces a dual Gaussian scene representation consisting of geometry Gaussians and complementary local reflection Gaussians that capture near-field specular interactions without explicit ray tracing, along with a global environment reflection field for modeling far-field specular reflections. To predict specular radiance, we further propose a lightweight, physically-aware adaptive mixing shader that fuses global and local reflection features. Experiments demonstrate that Ref-DGS achieves state-of-the-art performance on reflective scenes while training substantially faster than ray-based Gaussian methods.[246] FusionRegister: Every Infrared and Visible Image Fusion Deserves Registration
Congcong Bian,Haolong Ma,Hui Li,Zhongwei Shen,Xiaoqing Luo,Xiaoning Song,Xiao-Jun Wu
Main category: cs.CV
TL;DR: 本文提出了一种名为FusionRegister的跨模态配准方法,用于红外与可见光图像融合,通过视觉先验引导实现高效、鲁棒且通用的配准。
Details
Motivation: 现有基于配准的多模态图像融合方法依赖大量预配准操作,效率低且鲁棒性不足。 Method: FusionRegister学习跨模态错位表征而非强制对齐全部差异;直接作用于融合结果,显式建模并处理错位;同时作为视觉先验提供者,仅聚焦错配区域以提升效率。 Result: 在三个数据集上的实验表明,FusionRegister在保持先进融合质量的同时,显著提升了细节对齐精度与鲁棒性。 Conclusion: FusionRegister是一种高效、鲁棒、通用的跨模态配准方法,适用于红外与可见光图像融合任务。 Abstract: Spatial registration across different visual modalities is a critical but formidable step in multi-modality image fusion for real-world perception. Although several methods are proposed to address this issue, the existing registration-based fusion methods typically require extensive pre-registration operations, limiting their efficiency. To overcome these limitations, a general cross-modality registration method guided by visual priors is proposed for infrared and visible image fusion task, termed FusionRegister. Firstly, FusionRegister achieves robustness by learning cross-modality misregistration representations rather than forcing alignment of all differences, ensuring stable outputs even under challenging input conditions. Moreover, FusionRegister demonstrates strong generality by operating directly on fused results, where misregistration is explicitly represented and effectively handled, enabling seamless integration with diverse fusion methods while preserving their intrinsic properties. In addition, its efficiency is further enhanced by serving the backbone fusion method as a natural visual prior provider, which guides the registration process to focus only on mismatch regions, thereby avoiding redundant operations. Extensive experiments on three datasets demonstrate that FusionRegister not only inherits the fusion quality of state-of-the-art methods, but also delivers superior detail alignment and robustness, making it highly suitable for infrared and visible image fusion method. The code will be available at https://github.com/bociic/FusionRegister.[247] FrameVGGT: Frame Evidence Rolling Memory for streaming VGGT
Zhisong Xu,Takeshi Oishi
Main category: cs.CV
TL;DR: 本文提出FrameVGGT,一种基于帧的滚动显式记忆框架,通过将每帧的KV贡献视为连贯证据块并压缩为紧凑原型,在固定内存预算下维持互补帧块的中期记忆库,从而在长序列3D感知任务中实现更优的精度-内存权衡与几何稳定性。
Details
Motivation: 现有流式视觉几何Transformer(如StreamVGGT)因KV缓存无界增长而难以长期部署;作者从几何支撑角度重新审视有界内存流式处理,指出仅靠token级保留会削弱单帧内证据密度,降低历史融合鲁棒性。 Method: 提出FrameVGGT框架:以帧为单位组织KV更新,每帧生成一个连贯证据块,压缩为紧凑原型;维护固定容量的中期帧块记忆库,并可选配轻量锚定层应对长期退化。 Result: 在长序列3D重建、视频深度估计和相机位姿估计基准上,FrameVGGT在受限内存下取得优于现有方法的精度-内存权衡,且几何推理更稳定。 Conclusion: 帧驱动的显式记忆机制比token级截断更适合几何感知任务,在有限内存约束下能更好保持局部几何支撑一致性与长期稳定性。 Abstract: Streaming Visual Geometry Transformers such as StreamVGGT enable strong online 3D perception but suffer from unbounded KV-cache growth, which limits deployment over long streams. We revisit bounded-memory streaming from the perspective of geometric support. In geometry-driven reasoning, memory quality depends not only on how many tokens are retained, but also on whether the retained memory still preserves sufficiently coherent local support. This suggests that token-level retention may become less suitable under fixed budgets, as it can thin the evidence available within each contributing frame and make subsequent fusion more sensitive to weakly aligned history. Motivated by this observation, we propose FrameVGGT, a frame-driven rolling explicit-memory framework that treats each frame's incremental KV contribution as a coherent evidence block. FrameVGGT summarizes each block into a compact prototype and maintains a fixed-capacity mid-term bank of complementary frame blocks under strict budgets, with an optional lightweight anchor tier for rare prolonged degradation. Across long-sequence 3D reconstruction, video depth estimation, and camera pose benchmarks, FrameVGGT achieves favorable accuracy--memory trade-offs under bounded memory, while maintaining more stable geometry over long streams.[248] Compressed-Domain-Aware Online Video Super-Resolution
Yuhang Wang,Hai Li,Shujuan Hou,Zhetao Dong,Xiaoyao Yang
Main category: cs.CV
TL;DR: 本文提出了一种面向压缩域的在线视频超分辨率网络CDA-VSR,利用运动矢量、残差图和帧类型等压缩域信息,在保证重建质量的同时显著提升推理效率。
Details
Motivation: 现有在线视频超分辨率方法计算开销大、难以实现实时高分辨率处理,主要受限于复杂的运动估计对齐和连续帧的冗余处理。 Method: 提出CDA-VSR网络:1)运动矢量引导的可变形对齐模块,用运动矢量粗对齐+学习局部残差偏移微调;2)残差图门控融合模块,利用残差图生成空间权重抑制错配区域;3)帧类型感知重建模块,按帧类型自适应分配计算资源。 Result: 在REDS4数据集上PSNR最高提升0.13 dB,推理速度超SOTA方法TMP两倍以上。 Conclusion: 利用压缩域先验可有效平衡在线VSR的质量与效率,所提模块设计合理且实用。 Abstract: In bandwidth-limited online video streaming, videos are usually downsampled and compressed. Although recent online video super-resolution (online VSR) approaches achieve promising results, they are still compute-intensive and fall short of real-time processing at higher resolutions, due to complex motion estimation for alignment and redundant processing of consecutive frames. To address these issues, we propose a compressed-domain-aware network (CDA-VSR) for online VSR, which utilizes compressed-domain information, including motion vectors, residual maps, and frame types to balance quality and efficiency. Specifically, we propose a motion-vector-guided deformable alignment module that uses motion vectors for coarse warping and learns only local residual offsets for fine-tuned adjustments, thereby maintaining accuracy while reducing computation. Then, we utilize a residual map gated fusion module to derive spatial weights from residual maps, suppressing mismatched regions and emphasizing reliable details. Further, we design a frame-type-aware reconstruction module for adaptive compute allocation across frame types, balancing accuracy and efficiency. On the REDS4 dataset, our CDA-VSR surpasses the state-of-the-art method TMP, with a maximum PSNR improvement of 0.13 dB while delivering more than double the inference speed. The code will be released at https://github.com/sspBIT/CDA-VSR.[249] Learning Context-Adaptive Motion Priors for Masked Motion Diffusion Models with Efficient Kinematic Attention Aggregation
Junkun Jiang,Jie Chen,Ho Yin Au,Jingyu Xiang
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的运动重建框架MMDM,结合掩码自编码器与新型运动学注意力聚合(KAA)机制,能自适应学习多种任务(如补全、细化、插帧)所需的运动先验,在遮挡和噪声场景下显著提升3D动作重建质量。
Details
Motivation: 视觉动捕易受遮挡影响导致关节点信息丢失;可穿戴设备则常产生噪声或不稳定数据,需大量人工修正。亟需一种鲁棒、自适应、少依赖人工干预的运动重建方法。 Method: 提出Masked Motion Diffusion Model(MMDM),基于扩散模型与掩码自编码器架构;核心是Kinematic Attention Aggregation(KAA)机制,实现关节级与姿态级特征的深度迭代编码;通过同一可复用网络学习多个上下文自适应的运动先验,各先验专用于特定任务(如补全、细化、in-betweening)。 Result: 在多个公开基准上验证了MMDM在不同遮挡策略与任务设置下的强泛化性与高性能,优于现有方法;代码已开源。 Conclusion: MMDM提供了一种结构不变但功能可自适应 specialize 的统一框架,有效应对运动数据不完整与低质量挑战,为生成式运动建模提供了新范式。 Abstract: Vision-based motion capture solutions often struggle with occlusions, which result in the loss of critical joint information and hinder accurate 3D motion reconstruction. Other wearable alternatives also suffer from noisy or unstable data, often requiring extensive manual cleaning and correction to achieve reliable results. To address these challenges, we introduce the Masked Motion Diffusion Model (MMDM), a diffusion-based generative reconstruction framework that enhances incomplete or low-confidence motion data using partially available high-quality reconstructions within a Masked Autoencoder architecture. Central to our design is the Kinematic Attention Aggregation (KAA) mechanism, which enables efficient, deep, and iterative encoding of both joint-level and pose-level features, capturing structural and temporal motion patterns essential for task-specific reconstruction. We focus on learning context-adaptive motion priors, specialized structural and temporal features extracted by the same reusable architecture, where each learned prior emphasizes different aspects of motion dynamics and is specifically efficient for its corresponding task. This enables the architecture to adaptively specialize without altering its structure. Such versatility allows MMDM to efficiently learn motion priors tailored to scenarios such as motion refinement, completion, and in-betweening. Extensive evaluations on public benchmarks demonstrate that MMDM achieves strong performance across diverse masking strategies and task settings. The source code is available at https://github.com/jjkislele/MMDM.[250] TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward
Yihong Luo,Tianyang Hu,Weijian Luo,Jing Tang
Main category: cs.CV
TL;DR: 本文提出TDM-R1,一种面向少步生成模型(如TDM)的新型强化学习范式,能有效利用非可微奖励信号(如人类偏好、物体数量等),通过解耦代理奖励学习与生成器学习,并在确定性生成轨迹上获取逐步奖励,显著提升少步文生图模型性能。
Details
Motivation: 现有少步扩散模型的强化学习方法严重依赖可微奖励模型的梯度回传,无法利用大量重要的不可微真实世界奖励信号(如人类二元喜好判断、物体计数等)。 Method: 基于Trajectory Distribution Matching(TDM)少步模型,提出TDM-R1框架:将学习过程解耦为代理奖励学习和生成器学习两阶段;设计实用方法从TDM的确定性生成轨迹中提取每步奖励信号,实现统一的RL后训练。 Result: 在文本渲染、视觉质量、偏好对齐等多任务上显著提升性能,达到少步文生图RL的SOTA;且成功扩展至Z-Image模型,在仅4个NFE下持续超越其100-NFE及其它少步变体。 Conclusion: TDM-R1是一种通用、高效、可扩展的少步生成模型强化学习新范式,突破了对可微奖励的依赖,为实际应用中的多样化奖励信号提供了可行路径。 Abstract: While few-step generative models have enabled powerful image and video generation at significantly lower cost, generic reinforcement learning (RL) paradigms for few-step models remain an unsolved problem. Existing RL approaches for few-step diffusion models strongly rely on back-propagating through differentiable reward models, thereby excluding the majority of important real-world reward signals, e.g., non-differentiable rewards such as humans' binary likeness, object counts, etc. To properly incorporate non-differentiable rewards to improve few-step generative models, we introduce TDM-R1, a novel reinforcement learning paradigm built upon a leading few-step model, Trajectory Distribution Matching (TDM). TDM-R1 decouples the learning process into surrogate reward learning and generator learning. Furthermore, we developed practical methods to obtain per-step reward signals along the deterministic generation trajectory of TDM, resulting in a unified RL post-training method that significantly improves few-step models' ability with generic rewards. We conduct extensive experiments ranging from text-rendering, visual quality, and preference alignment. All results demonstrate that TDM-R1 is a powerful reinforcement learning paradigm for few-step text-to-image models, achieving state-of-the-art reinforcement learning performances on both in-domain and out-of-domain metrics. Furthermore, TDM-R1 also scales effectively to the recent strong Z-Image model, consistently outperforming both its 100-NFE and few-step variants with only 4 NFEs. Project page: https://github.com/Luo-Yihong/TDM-R1[251] PARSE: Part-Aware Relational Spatial Modeling
Yinuo Bai,Peijun Xu,Kuixiang Shao,Yuyang Jiao,Jingxuan Zhang,Kaixin Yao,Jiayuan Gu,Jingyi Yu
Main category: cs.CV
TL;DR: 本文提出PARSE框架,通过部分级装配图(PAG)和部分感知空间配置求解器,实现对物体部件间几何关系的建模与物理合理的3D场景生成,并构建PARSE-10K数据集以支持空间推理与生成任务。
Details
Motivation: 现有空间表示(如语言介词或物体级场景图)过于粗糙,无法精确描述支撑、包含、接触等具体区域关系,导致布局模糊且物理不一致,需引入部件级建模。 Method: 提出PARSE框架,包括Part-centric Assembly Graph(PAG)建模部件间几何关系,以及Part-Aware Spatial Configuration Solver将关系转化为几何约束以生成无碰撞、物理有效的场景;并构建PARSE-10K数据集(含10,000个带密集接触结构的3D室内场景)。 Result: 在PARSE-10K上微调Qwen3-VL提升了物体级布局推理与部件级关系理解能力;将PAG作为先验用于3D生成模型,显著提升生成场景的物理真实性和结构复杂度。 Conclusion: PARSE显著推进了几何基础的空间推理能力,为生成物理一致的3D场景提供了新范式。 Abstract: Inter-object relations underpin spatial intelligence, yet existing representations -- linguistic prepositions or object-level scene graphs -- are too coarse to specify which regions actually support, contain, or contact one another, leading to ambiguous and physically inconsistent layouts. To address these ambiguities, a part-level formulation is needed; therefore, we introduce PARSE, a framework that explicitly models how object parts interact to determine feasible and spatially grounded scene configurations. PARSE centers on the Part-centric Assembly Graph (PAG), which encodes geometric relations between specific object parts, and a Part-Aware Spatial Configuration Solver that converts these relations into geometric constraints to assemble collision-free, physically valid scenes. Using PARSE, we build PARSE-10K, a dataset of 10,000 3D indoor scenes constructed from real-image layout priors and a curated part-annotated shape database, each with dense contact structures and a part-level contact graph. With this structured, spatially grounded supervision, fine-tuning Qwen3-VL on PARSE-10K yields stronger object-level layout reasoning and more accurate part-level relation understanding; furthermore, leveraging PAGs as structural priors in 3D generation models leads to scenes with substantially improved physical realism and structural complexity. Together, these results show that PARSE significantly advances geometry-grounded spatial reasoning and supports the generation of physically consistent 3D scenes.[252] AR2-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos
Teng Yan,Yihan Liu,Jiongxu Chen,Teng Wang,Jiaqi Li,Bingzhuo Zhong
Main category: cs.CV
TL;DR: AR2-4FV提出一种利用静态背景结构构建离线Anchor Bank的方法,通过Anchor Map实现语义持久记忆,结合锚点引导的再进入先验与轻量级ReID-Gating机制,显著提升长时语言引导指代定位中的再捕获率并降低延迟。
Details
Motivation: 解决固定视角视频中长时语言引导指代任务的挑战:目标常被遮挡或长时间离开画面后重入,而逐帧指代方法因重识别(ReID)不可靠导致漂移。 Method: 构建基于静态背景的离线Anchor Bank;用文本查询对齐该Bank生成Anchor Map作为持久语义记忆;设计锚点引导的再进入先验加速目标重入时的再捕获;引入基于位移线索的轻量ReID-Gating机制维持身份连续性;不依赖首帧可见性或显式建模外观变化。 Result: 在Re-Capture Rate(RCR)上提升+10.3%,Re-Capture Latency(RCL)降低-24.2%;消融实验验证Anchor Map、再进入先验和ReID-Gating的有效性。 Conclusion: 利用背景稳定性构建语义锚点可有效缓解长时指代中的目标丢失与漂移问题,Anchor Map与配套机制为无需首帧可见假设的鲁棒视频指代提供了新范式。 Abstract: Long-term language-guided referring in fixed-view videos is challenging: the referent may be occluded or leave the scene for long intervals and later re-enter, while framewise referring pipelines drift as re-identification (ReID) becomes unreliable. AR2-4FV leverages background stability for long-term referring. An offline Anchor Bank is distilled from static background structures; at inference, the text query is aligned with this bank to produce an Anchor Map that serves as persistent semantic memory when the referent is absent. An anchor-based re-entry prior accelerates re-capture upon return, and a lightweight ReID-Gating mechanism maintains identity continuity using displacement cues in the anchor frame. The system predicts per-frame bounding boxes without assuming the target is visible in the first frame or explicitly modeling appearance variations. AR2-4FV achieves +10.3% Re-Capture Rate (RCR) improvement and -24.2% Re-Capture Latency (RCL) reduction over the best baseline, and ablation studies further confirm the benefits of the Anchor Map, re-entry prior, and ReID-Gating.[253] DECADE: A Temporally-Consistent Unsupervised Diffusion Model for Enhanced Rb-82 Dynamic Cardiac PET Image Denoising
Yinchi Zhou,Liang Guo,Huidong Xie,Yuexi Du,Ashley Wang,Menghua Xia,Tian Yu,Ramesh Fazzone-Chettiar,Christopher Weyman,Bruce Spottiswoode,Vladimir Panin,Kuangyu Shi,Edward J. Miller,Attila Feher,Albert J. Sinusas,Nicha C. Dvornek,Chi Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为DECADE的无监督扩散模型,用于Rb-82动态心脏PET图像去噪,解决了因放射性核素半衰期短导致的高噪声问题,并在保持定量准确性(如MBF、MFR)的同时提升图像质量。
Details
Motivation: Rb-82 PET因半衰期短导致动态帧噪声高,影响图像质量和参数成像;缺乏配对干净-含噪训练数据、示踪剂动力学快及帧依赖噪声变化限制了现有深度学习去噪方法效果。 Method: 提出DECADE——一种时序一致的无监督扩散模型,利用含噪帧作为引导,在训练和采样过程中引入时间一致性约束,适用于早至晚相动态帧,无需配对训练数据。 Result: 在Siemens Vision 450和Biograph Vision Quadra数据集上验证:显著降低噪声、提升动态与参数图像质量;准确保留MBF和MFR;在15%计数输入下,K1/MBF量化精度优于UNet及其他扩散模型。 Conclusion: DECADE实现了无需配对数据的高质量、定量可靠的Rb-82动态PET去噪,兼顾可视化清晰度与临床定量需求。 Abstract: Rb-82 dynamic cardiac PET imaging is widely used for the clinical diagnosis of coronary artery disease (CAD), but its short half-life results in high noise levels that degrade dynamic frame quality and parametric imaging. The lack of paired clean-noisy training data, rapid tracer kinetics, and frame-dependent noise variations further limit the effectiveness of existing deep learning denoising methods. We propose DECADE (A Temporally-Consistent Unsupervised Diffusion model for Enhanced Rb-82 CArdiac PET DEnoising), an unsupervised diffusion framework that generalizes across early- to late-phase dynamic frames. DECADE incorporates temporal consistency during both training and iterative sampling, using noisy frames as guidance to preserve quantitative accuracy. The method was trained and evaluated on datasets acquired from Siemens Vision 450 and Siemens Biograph Vision Quadra scanners. On the Vision 450 dataset, DECADE consistently produced high-quality dynamic and parametric images with reduced noise while preserving myocardial blood flow (MBF) and myocardial flow reserve (MFR). On the Quadra dataset, using 15%-count images as input and full-count images as reference, DECADE outperformed UNet-based and other diffusion models in image quality and K1/MBF quantification. The proposed framework enables effective unsupervised denoising of Rb-82 dynamic cardiac PET without paired training data, supporting clearer visualization while maintaining quantitative integrity.[254] MedQ-Deg: A Multidimensional Benchmark for Evaluating MLLMs Across Medical Image Quality Degradations
Jiyao Liu,Junzhi Ning,Chenglong Ma,Wanying Qu,Jianghan Shen,Siqi Luo,Jinjie Wei,Jin Ye,Pengze Li,Tianbin Li,Jiashi Lin,Hongming Shan,Xinzhe Luo,Xiaohong Liu,Lihao Liu,Junjun He,Ningsheng Xu
Main category: cs.CV
TL;DR: 本文提出MedQ-Deg基准,系统评估多模态大语言模型(MLLMs)在医学图像质量退化下的性能与置信度校准能力,揭示其普遍存在性能下降、过度自信(AI达宁-克鲁格效应)及跨维度行为差异等问题。
Details
Motivation: 现有医学多模态大模型评测基准缺乏大规模、多维度的图像质量退化评估,且未系统分析模型置信度校准能力,难以反映真实临床环境中的鲁棒性与可信性。 Method: 构建MedQ-Deg基准:涵盖18种退化类型、30个能力维度、7种影像模态、24,894个问答对;每种退化按3级严重度由放射科专家标定;提出Calibration Shift指标量化置信度与实际性能间的偏差。 Result: 对40个主流医学MLLMs的评测表明:(1)模型性能随退化加剧系统性下降;(2)普遍存在AI达宁-克鲁格效应——严重准确率崩溃下仍保持过高置信度;(3)不同能力维度、模态和退化类型下行为模式差异显著。 Conclusion: MedQ-Deg揭示了当前医学MLLMs在真实临床场景中可靠性不足的关键问题,为构建更鲁棒、可信赖的临床AI系统提供了新评测范式与改进方向。 Abstract: Despite impressive performance on standard benchmarks, multimodal large language models (MLLMs) face critical challenges in real-world clinical environments where medical images inevitably suffer various quality degradations. Existing benchmarks exhibit two key limitations: (1) absence of large-scale, multidimensional assessment across medical image quality gradients and (2) no systematic confidence calibration analysis. To address these gaps, we present MedQ-Deg, a comprehensive benchmark for evaluating medical MLLMs under image quality degradations. MedQ-Deg provides multi-dimensional evaluation spanning 18 distinct degradation types, 30 fine-grained capability dimensions, and 7 imaging modalities, with 24,894 question-answer pairs. Each degradation is implemented at 3 severity degrees, calibrated by expert radiologists. We further introduce Calibration Shift metric, which quantifies the gap between a model's perceived confidence and actual performance to assess metacognitive reliability under degradation. Our comprehensive evaluation of 40 mainstream MLLMs reveals several critical findings: (1) overall model performance degrades systematically as degradation severity increases, (2) models universally exhibit the AI Dunning-Kruger Effect, maintaining inappropriately high confidence despite severe accuracy collapse, and (3) models display markedly differentiated behavioral patterns across capability dimensions, imaging modalities, and degradation types. We hope MedQ-Deg drives progress toward medical MLLMs that are robust and trustworthy in real clinical practice.[255] Geometric Knowledge-Assisted Federated Dual Knowledge Distillation Approach Towards Remote Sensing Satellite Imagery
Luyao Zou,Fei Pan,Jueying Li,Yan Kyaw Tun,Apurba Adhikary,Zhu Han,Hayoung Oh
Main category: cs.CV
TL;DR: 本文提出了一种面向遥感卫星图像分析的几何知识引导联邦双知识蒸馏框架(GK-FedDKD),通过本地教师-学生网络蒸馏、全局几何知识聚合与嵌入增强,有效缓解卫星数据异构性带来的训练难题,并在多个数据集上显著超越现有方法。
Details
Motivation: 遥感卫星图像具有大规模和天然数据异构性(各卫星本地数据分布与全局分布差异大),导致传统联邦学习难以有效训练模型。 Method: 提出GK-FedDKD框架:1)本地端用无标签增强数据训练多个学生编码器(SE),蒸馏出教师编码器(TE);2)TE+共享分类器构成教师网络(TN),监督学生网络(SN)训练;3)TN中间表示计算本地协方差矩阵,服务器聚合生成全局几何知识(GGK);4)GGK用于本地嵌入增强;5)设计新型损失函数与多原型生成流程稳定训练。 Result: 在多个数据集(如EuroSAT)上显著优于SOTA方法,以Swin-T为骨干时,在EuroSAT上平均性能提升68.89%。 Conclusion: GK-FedDKD通过引入几何知识引导的双知识蒸馏机制,有效应对遥感卫星图像联邦学习中的数据异构挑战,提升了模型泛化能力与训练稳定性。 Abstract: Federated learning (FL) has recently become a promising solution for analyzing remote sensing satellite imagery (RSSI). However, the large scale and inherent data heterogeneity of images collected from multiple satellites, where the local data distribution of each satellite differs from the global one, present significant challenges to effective model training. To address this issue, we propose a Geometric Knowledge-Guided Federated Dual Knowledge Distillation (GK-FedDKD) framework for RSSI analysis. In our approach, each local client first distills a teacher encoder (TE) from multiple student encoders (SEs) trained with unlabeled augmented data. The TE is then connected with a shared classifier to form a teacher network (TN) that supervises the training of a new student network (SN). The intermediate representations of the TN are used to compute local covariance matrices, which are aggregated at the server to generate global geometric knowledge (GGK). This GGK is subsequently employed for local embedding augmentation to further guide SN training. We also design a novel loss function and a multi-prototype generation pipeline to stabilize the training process. Evaluation over multiple datasets showcases that the proposed GK-FedDKD approach is superior to the considered state-of-the-art baselines, e.g., the proposed approach with the Swin-T backbone surpasses previous SOTA approaches by an average 68.89% on the EuroSAT dataset.[256] Parameterized Brushstroke Style Transfer
Uma Meleti,Siyu Huang
Main category: cs.CV
TL;DR: 本文提出了一种基于画笔笔触域而非传统RGB像素域的风格迁移方法,以更自然地模拟真实艺术创作过程。
Details
Motivation: 现有基于计算机视觉的风格迁移方法大多局限于像素域,无法自然表达真实艺术作品中由不同颜色画笔笔触构成的特性。 Method: 提出一种在画笔笔触域进行图像表示和风格迁移的新方法,替代传统的RGB像素域操作。 Result: 该方法在视觉效果上优于现有的像素级风格迁移方法。 Conclusion: 基于画笔笔触域的风格迁移方法更符合真实艺术创作逻辑,能带来更好的视觉表现效果。 Abstract: Computer Vision-based Style Transfer techniques have been used for many years to represent artistic style. However, most contemporary methods have been restricted to the pixel domain; in other words, the style transfer approach has been modifying the image pixels to incorporate artistic style. However, real artistic work is made of brush strokes with different colors on a canvas. Pixel-based approaches are unnatural for representing these images. Hence, this paper discusses a style transfer method that represents the image in the brush stroke domain instead of the RGB domain, which has better visual improvement over pixel-based methods.[257] OrdinalBench: A Benchmark Dataset for Diagnosing Generalization Limits in Ordinal Number Understanding of Vision-Language Models
Yusuke Tozaki,Hisashi Miyamori
Main category: cs.CV
TL;DR: 本文提出OrdinalBench,一个专门评估视觉语言模型(VLMs)序数理解能力的诊断性基准,聚焦于N-th物体识别任务,并通过控制序数大小、排列复杂度和物体数量来系统评测模型在顺序推理上的泛化能力。
Details
Motivation: 现有VLMs在多模态基准上表现优异,但在序数理解(如追踪相对位置、泛化至大序数)方面存在明显缺陷,缺乏标准化评测手段。 Method: 构建OrdinalBench基准,包含39,000个带真值推理轨迹的问答对;定义N-th物体识别为核心任务,难度沿序数大小(至300)、排列复杂度(单环到迷宫路径)、物体数量三轴调控;要求模型生成结构化计数过程追踪,并提供开源评估工具,同时评测最终答案准确率与步骤级路径一致性。 Result: GPT-5、Gemini 2.5 Flash Lite、Qwen2.5-VL、InternVL3.5和Molmo等主流VLM在零样本设置下,在大序数和复杂路径条件下性能显著下降,暴露出其顺序推理泛化能力薄弱。 Conclusion: OrdinalBench为VLM序数理解提供了可复现的诊断框架,强调将顺序推理作为核心能力目标,推动构建更具鲁棒序列推理能力的多模态模型。 Abstract: Vision-Language Models (VLMs) have advanced across multimodal benchmarks but still show clear gaps in ordinal number understanding, i.e., the ability to track relative positions and generalize to large indices. We present OrdinalBench, a diagnostic benchmark that standardizes ordinal number understanding as an evaluation task for VLMs. The core task is N-th object identification, defined by a starting reference and traversal rule. Task difficulty is controlled along three axes: (i) ordinal magnitude, from small numbers to extreme cases up to 300; (ii) arrangement complexity, from single loops to maze-like paths; and (iii) object count. The benchmark provides 39,000 question-answer pairs, each annotated with a ground-truth reasoning trajectory and balanced across difficulty levels for controlled large-scale testing. Beyond answer-only evaluation, our framework requires models to generate structured stepwise traces of the counting process and provides an open evaluation toolkit that measures both final accuracy and step-level path consistency. Zero-shot evaluations of GPT-5, Gemini 2.5 Flash Lite, Qwen2.5-VL, InternVL3.5, and Molmo reveal sharp degradation under large-ordinal and complex-path conditions, highlighting weak generalization despite strong scores on standard multimodal tasks. By framing ordinal number understanding as a core target, OrdinalBench provides a reproducible benchmark and diagnostic framework for developing VLMs with stronger sequential reasoning. All data and code are available at https://ordinalbench.github.io/[258] SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation
Zixuan Pan,Kaiyuan Tang,Jun Xia,Yifan Qin,Lin Gu,Chaoli Wang,Jianxu Chen,Yiyu Shi
Main category: cs.CV
TL;DR: 本文提出Structured Gaussian Image (SGI),一种用于高效表示高分辨率图像的新框架,通过种子驱动的多尺度局部空间分解与轻量MLP生成结构化隐式2D高斯,并结合熵压缩与粗到细优化策略,显著提升压缩率与训练速度,同时保持或提升图像质量。
Details
Motivation: 2D高斯点绘法在高分辨率图像上存在参数冗余、收敛慢、存储开销大等问题,亟需结构化、紧凑且高效的表示方法。 Method: 提出SGI框架:以种子定义多尺度局部空间,每个种子关联轻量MLP生成结构化隐式2D高斯;引入种子级熵压缩降低存储;设计粗到细的多尺度拟合策略加速优化。 Result: 相比非量化2D高斯方法压缩率达7.5×,比量化方法高1.6×;优化速度分别快1.6×和6.5×;图像保真度不降反升。 Conclusion: SGI通过结构化建模与分层优化,在压缩率、训练效率与重建质量之间实现了更优平衡,为高分辨率神经图像表示提供了新范式。 Abstract: 2D Gaussian Splatting has emerged as a novel image representation technique that can support efficient rendering on low-end devices. However, scaling to high-resolution images requires optimizing and storing millions of unstructured Gaussian primitives independently, leading to slow convergence and redundant parameters. To address this, we propose Structured Gaussian Image (SGI), a compact and efficient framework for representing high-resolution images. SGI decomposes a complex image into multi-scale local spaces defined by a set of seeds. Each seed corresponds to a spatially coherent region and, together with lightweight multi-layer perceptrons (MLPs), generates structured implicit 2D neural Gaussians. This seed-based formulation imposes structural regularity on otherwise unstructured Gaussian primitives, which facilitates entropy-based compression at the seed level to reduce the total storage. However, optimizing seed parameters directly on high-resolution images is a challenging and non-trivial task. Therefore, we designed a multi-scale fitting strategy that refines the seed representation in a coarse-to-fine manner, substantially accelerating convergence. Quantitative and qualitative evaluations demonstrate that SGI achieves up to 7.5x compression over prior non-quantized 2D Gaussian methods and 1.6x over quantized ones, while also delivering 1.6x and 6.5x faster optimization, respectively, without degrading, and often improving, image fidelity. Code is available at https://github.com/zx-pan/SGI.[259] 4DRC-OCC: Robust Semantic Occupancy Prediction Through Fusion of 4D Radar and Camera
David Ninfa,Andras Palffy,Holger Caesar
Main category: cs.CV
TL;DR: This paper introduces the first 4D radar-camera fusion method for robust 3D semantic occupancy prediction, leveraging radar’s all-weather reliability and camera’s rich semantics, with depth-guided 3D lifting and a new auto-labeled dataset.
Details
Motivation: 3D semantic occupancy prediction is challenging under adverse weather and lighting, and current methods lack robust multi-modal fusion and scalable labeling. Method: Fuses 4D radar and camera data by exploiting their complementary strengths—radar for reliable geometric and motion cues in adverse conditions, camera for semantic and texture details; incorporates depth cues from camera pixels to lift 2D features to 3D; introduces a fully automatically labeled dataset for semantic occupancy training. Result: Demonstrates improved 3D scene reconstruction accuracy and robustness across diverse environmental conditions; validates 4D radar’s effectiveness for autonomous driving perception. Conclusion: 4D radar-camera fusion, enhanced by depth-guided 3D lifting and auto-labeled data, significantly advances robust 3D semantic occupancy prediction for autonomous driving. Abstract: Autonomous driving requires robust perception across diverse environmental conditions, yet 3D semantic occupancy prediction remains challenging under adverse weather and lighting. In this work, we present the first study combining 4D radar and camera data for 3D semantic occupancy prediction. Our fusion leverages the complementary strengths of both modalities: 4D radar provides reliable range, velocity, and angle measurements in challenging conditions, while cameras contribute rich semantic and texture information. We further show that integrating depth cues from camera pixels enables lifting 2D images to 3D, improving scene reconstruction accuracy. Additionally, we introduce a fully automatically labeled dataset for training semantic occupancy models, substantially reducing reliance on costly manual annotation. Experiments demonstrate the robustness of 4D radar across diverse scenarios, highlighting its potential to advance autonomous vehicle perception.[260] MWM: Mobile World Models for Action-Conditioned Consistent Prediction
Han Yan,Zishang Xiang,Zeyu Zhang,Hao Tang
Main category: cs.CV
TL;DR: 本文提出MWM(移动世界模型),通过两阶段训练框架(结构预训练+动作条件一致性后训练)和推理一致状态蒸馏(ICSD)方法,提升图像目标导航中动作条件下的多步预测一致性与少步扩散推理效率。
Details
Motivation: 现有导航世界模型缺乏动作条件一致性,导致多步滚动预测易漂移,且少步扩散蒸馏方法未显式保持滚动一致性,造成训练-推理不匹配。 Method: 提出MWM模型,包含两阶段训练:结构预训练 + 动作条件一致性(ACC)后训练;并引入推理一致状态蒸馏(ICSD)实现少步扩散蒸馏。 Result: 在基准和真实任务上验证了视觉保真度、轨迹精度、规划成功率及推理效率的持续提升。 Conclusion: MWM有效解决了动作条件一致性与少步蒸馏中滚动一致性缺失的问题,为基于规划的具身导航提供了更鲁棒高效的世界模型方案。 Abstract: World models enable planning in imagined future predicted space, offering a promising framework for embodied navigation. However, existing navigation world models often lack action-conditioned consistency, so visually plausible predictions can still drift under multi-step rollout and degrade planning. Moreover, efficient deployment requires few-step diffusion inference, but existing distillation methods do not explicitly preserve rollout consistency, creating a training-inference mismatch. To address these challenges, we propose MWM, a mobile world model for planning-based image-goal navigation. Specifically, we introduce a two-stage training framework that combines structure pretraining with Action-Conditioned Consistency (ACC) post-training to improve action-conditioned rollout consistency. We further introduce Inference-Consistent State Distillation (ICSD) for few-step diffusion distillation with improved rollout consistency. Our experiments on benchmark and real-world tasks demonstrate consistent gains in visual fidelity, trajectory accuracy, planning success, and inference efficiency. Code: https://github.com/AIGeeksGroup/MWM. Website: https://aigeeksgroup.github.io/MWM.[261] HybridStitch: Pixel and Timestep Level Model Stitching for Diffusion Acceleration
Desen Sun,Jason Hon,Jintao Zhang,Sihang Liu
Main category: cs.CV
TL;DR: 本文提出HybridStitch方法,通过将图像生成视为编辑过程,在不同区域混合使用大、小模型以加速文本到图像生成,实现1.83×加速。
Details
Motivation: 扩散模型在文本到图像生成中计算开销大,尤其对参数量巨大的模型;现有方法仅节省部分时间步的计算,未考虑单个时间步内不同区域的计算需求差异。 Method: 提出HybridStitch范式:将图像分为易渲染区(由小模型快速生成粗略草图)和复杂区(由大模型精细编辑),在单个时间步内协同使用大小模型。 Result: 在Stable Diffusion 3上实现1.83×加速,快于所有现有模型混合方法。 Conclusion: HybridStitch通过空间感知的混合建模策略,在不显著损失质量前提下显著提升T2I生成效率,验证了‘生成即编辑’范式的有效性。 Abstract: Diffusion models have demonstrated a remarkable ability in Text-to-Image (T2I) generation applications. Despite the advanced generation output, they suffer from heavy computation overhead, especially for large models that contain tens of billions of parameters. Prior work has illustrated that replacing part of the denoising steps with a smaller model still maintains the generation quality. However, these methods only focus on saving computation for some timesteps, ignoring the difference in compute demand within one timestep. In this work, we propose HybridStitch, a new T2I generation paradigm that treats generation like editing. Specifically, we introduce a hybrid stage that jointly incorporates both the large model and the small model. HybridStitch separates the entire image into two regions: one that is relatively easy to render, enabling an early transition to the smaller model, and another that is more complex and therefore requires refinement by the large model. HybridStitch employs the small model to construct a coarse sketch while exploiting the large model to edit and refine the complex regions. According to our evaluation, HybridStitch achieves 1.83$\times$ speedup on Stable Diffusion 3, which is faster than all existing mixture of model methods.[262] Tracking Phenological Status and Ecological Interactions in a Hawaiian Cloud Forest Understory using Low-Cost Camera Traps and Visual Foundation Models
Luke Meyers,Anirudh Potlapally,Yuyan Chen,Mike Long,Tanya Berger-Wolf,Hari Subramoni,Remi Megret,Daniel Rubenstein
Main category: cs.CV
TL;DR: 本研究利用低成本、动物触发的相机陷阱结合基础视觉模型和传统计算机视觉方法,在夏威夷普乌马卡阿拉自然保护区实现了无需监督学习的个体级植物物候监测,并同步记录动植物互作,揭示了传统粗粒度采样无法检测的物候趋势及其生态驱动机制。
Details
Motivation: 植物物候研究在热带地区严重不足,且个体水平的物候监测仍具挑战性;同时亟需理解物候变化与动植物互作之间的关联。 Method: 部署动物触发式相机陷阱采集图像,融合基础视觉模型与传统计算机视觉技术,不依赖监督学习实现个体级物候量化分析。 Result: 获得高时间分辨率的物候测量结果,识别出传统采样方法无法发现的细微趋势,并结合图像中的访花/访果等动物访问数据,初步解析物候与动物生态的驱动关系。 Conclusion: 该方法为热带地区低成本、长期、个体尺度的物候与生态互作联合监测提供了可行新范式。 Abstract: Plant phenology, the study of cyclical events such as leafing out, flowering, or fruiting, has wide ecological impacts but is broadly understudied, especially in the tropics. Image analysis has greatly enhanced remote phenological monitoring, yet capturing phenology at the individual level remains challenging. In this project, we deployed low-cost, animal-triggered camera traps at the Pu'u Maka'ala Natural Area Reserve in Hawaii to simultaneously document shifts in plant phenology and flora-faunal interactions. Using a combination of foundation vision models and traditional computer vision methods, we measure phenological trends from images comparable to on-the-ground observations without relying on supervised learning techniques. These temporally fine-grained phenology measurements from camera-trap images uncover trends that coarser traditional sampling fails to detect. When combined with detailed visitation data detected from images, these trends can begin to elucidate drivers of both plant phenology and animal ecology.[263] Fusion Complexity Inversion: Why Simpler Cross View Modules Outperform SSMs and Cross View Attention Transformers for Pasture Biomass Regression
Mridankan Mandal
Main category: cs.CV
TL;DR: 本文系统评估了视觉基础模型在农业回归任务中的适应性,发现对于稀缺的农业数据,简单的融合方法(如双层门控深度卷积)优于复杂的注意力或状态空间模型;骨干网络预训练规模的影响远超架构选择;仅用元数据训练存在性能上限;提出稀疏农业基准下应优先选择高质量骨干、局部模块,并排除推理时不可用特征。
Details
Motivation: 现有方法受限于真实牧场监测中常见的小样本、不平衡、标注稀疏的数据集,难以准确估计牧草生物量。 Method: 在CSIRO牧草生物量基准数据集(357张双视角图像,含5类成分级实验室验证真值)上,系统评测17种配置,涵盖4种骨干网络(EfficientNet-B3至DINOv3-ViT-L)、5种跨视角融合机制及4×2元数据因子设计。 Result: 发现‘融合复杂度倒置’现象:简单双层门控深度卷积(R²=0.903)显著优于跨视角注意力(0.833)、双向SSM(0.819)和全Mamba(0.793);DINOv2→DINOv3升级单独带来+5.0 R²提升;仅用元数据训练导致R²上限约0.829,大幅压缩融合效果差异。 Conclusion: 在稀疏农业基准下,应优先提升骨干网络质量而非增加融合复杂度,偏好局部模块而非全局建模,并剔除推理时不可获取的特征。 Abstract: Accurate estimation of pasture biomass from agricultural imagery is critical for sustainable livestock management, yet existing methods are limited by the small, imbalanced, and sparsely annotated datasets typical of real world monitoring. In this study, adaptation of vision foundation models to agricultural regression is systematically evaluated on the CSIRO Pasture Biomass benchmark, a 357 image dual view dataset with laboratory validated, component wise ground truth for five biomass targets, through 17 configurations spanning four backbones (EfficientNet-B3 to DINOv3-ViT-L), five cross view fusion mechanisms, and a 4x2 metadata factorial. A counterintuitive principle, termed "fusion complexity inversion", is uncovered: on scarce agricultural data, a two layer gated depthwise convolution (R^2 = 0.903) outperforms cross view attention transformers (0.833), bidirectional SSMs (0.819), and full Mamba (0.793, below the no fusion baseline). Backbone pretraining scale is found to monotonically dominate all architectural choices, with the DINOv2 -> DINOv3 upgrade alone yielding +5.0 R^2 points. Training only metadata (species, state, and NDVI) is shown to create a universal ceiling at R^2 ~ 0.829, collapsing an 8.4 point fusion spread to 0.1 points. Actionable guidelines for sparse agricultural benchmarks are established: backbone quality should be prioritized over fusion complexity, local modules preferred over global alternatives, and features unavailable at inference excluded.[264] Transferable Optimization Network for Cross-Domain Image Reconstruction
Yunmei Chen,Chi Ding,Xiaojing Ye
Main category: cs.CV
TL;DR: 本文提出了一种新颖的两阶段双层优化迁移学习框架,用于解决图像重建中训练数据有限的问题:先在多源异构大数据上预训练通用特征提取器,再用少量目标域数据微调任务特定的域适配器,从而实现高质量欠采样MRI图像重建。
Details
Motivation: 解决图像重建任务中目标域训练数据有限的问题,尤其是医学影像(如欠采样MRI)常面临数据稀缺挑战。 Method: 提出包含两个双层优化步骤的迁移学习框架:第一步在跨域异构大数据(不同解剖结构、采样率、图像模态)上预训练通用特征提取器;第二步用少量目标域数据训练域适配器;二者组合构成面向新任务的正则化特征提取模块。 Result: 在欠采样MRI重建任务上验证了该方法的有效性,展现出优异的迁移学习能力与高质量重建性能。 Conclusion: 该框架通过解耦通用知识学习与任务定制,有效缓解数据稀缺问题,为小样本图像重建提供了可扩展、跨域通用的新范式。 Abstract: We develop a novel transfer learning framework to tackle the challenge of limited training data in image reconstruction problems. The proposed framework consists of two training steps, both of which are formed as bi-level optimizations. In the first step, we train a powerful universal feature-extractor that is capable of learning important knowledge from large, heterogeneous data sets in various domains. In the second step, we train a task-specific domain-adapter for a new target domain or task with only a limited amount of data available for training. Then the composition of the adapter and the universal feature-extractor effectively explores feature which serve as an important component of image regularization for the new domains, and this leads to high-quality reconstruction despite the data limitation issue. We apply this framework to reconstruct under-sampled MR images with limited data by using a collection of diverse data samples from different domains, such as images of other anatomies, measurements of various sampling ratios, and even different image modalities, including natural images. Experimental results demonstrate a promising transfer learning capability of the proposed method.[265] GazeShift: Unsupervised Gaze Estimation and Dataset for VR
Gil Shapira,Ishay Goldin,Evgeny Artyomov,Donghoon Kim,Yosi Keller,Niv Zehngut
Main category: cs.CV
TL;DR: 本文提出了VRGaze——首个大规模VR离轴眼动估计数据集,以及GazeShift——一种无需标注数据、面向近眼红外图像的注意力引导无监督眼动表征学习框架,在精度、效率和鲁棒性上均优于现有方法。
Details
Motivation: VR眼动估计受限于缺乏大规模、准确标注的离轴相机配置数据集,且人工标注困难;现有方法依赖多视角或3D几何,不适用于近眼红外图像。 Method: 构建VRGaze数据集(210万张近眼红外图像,68名被试);提出GazeShift——基于注意力机制的无监督表征学习框架,实现眼动与外观解耦,并支持轻量级少样本个性化校准。 Result: 在VRGaze上达到1.84°平均误差;在MPIIGaze上实现7.15°人无关误差,参数量和FLOPs仅为基线方法的1/10和1/35;VR头显GPU上推理仅需5ms;对光照变化鲁棒。 Conclusion: GazeShift是一种标签高效、实时、鲁棒的VR眼动追踪新范式,VRGaze数据集填补了该领域大规模离轴数据空白。 Abstract: Gaze estimation is instrumental in modern virtual reality (VR) systems. Despite significant progress in remote-camera gaze estimation, VR gaze research remains constrained by data scarcity - particularly the lack of large-scale, accurately labeled datasets captured with the off-axis camera configurations typical of modern headsets. Gaze annotation is difficult since fixation on intended targets cannot be guaranteed. To address these challenges, we introduce VRGaze - the first large-scale off-axis gaze estimation dataset for VR - comprising 2.1 million near-eye infrared images collected from 68 participants. We further propose GazeShift, an attention-guided unsupervised framework for learning gaze representations without labeled data. Unlike prior redirection-based methods that rely on multi-view or 3D geometry, GazeShift is tailored to near-eye infrared imagery, achieving effective gaze-appearance disentanglement in a compact, real-time model. GazeShift embeddings can be optionally adapted to individual users via lightweight few-shot calibration, achieving a 1.84-degree mean error on VRGaze. On the remote-camera MPIIGaze dataset, the model achieves a 7.15-degree person-agnostic error, doing so with 10x fewer parameters and 35x fewer FLOPs than baseline methods. Deployed natively on a VR headset GPU, inference takes only 5 ms. Combined with demonstrated robustness to illumination changes, these results highlight GazeShift as a label-efficient, real-time solution for VR gaze tracking. Project code and the VRGaze dataset are released at https://github.com/gazeshift3/gazeshift.[266] Training-free Temporal Object Tracking in Surgical Videos
Subhadeep Koley,Abdolrahim Kadkhodamohammadi,Santiago Barbarisi,Danail Stoyanov,Imanol Luengo
Main category: cs.CV
TL;DR: 本文提出了一种基于预训练文生图扩散模型特征的无监督在线目标跟踪方法,用于腹腔镜胆囊切除术视频中关键解剖结构和器械的定位与跟踪,无需像素级标注或模型微调,在CholeSeg8K数据集上取得优异性能。
Details
Motivation: 解决现有手术视频数据集中像素级标注成本高、标签不一致的问题,提升腹腔镜手术视频中关键结构与器械的在线跟踪精度与效率。 Method: 利用预训练文生图扩散模型提取手术帧的代表性特征,不进行任何训练或微调;构建基于查询-键-值注意力机制启发的亲和矩阵,建模帧间交互以保障时序跟踪连续性。 Result: 在CholeSeg8K数据集上达到像素分类准确率79.19%、平均Jaccard分数56.20%、平均F-Score 79.48%,显著优于对比方法;扩散特征被验证具备跨解码头层级和跨时间帧的一致语义与定位能力。 Conclusion: 首次将文生图扩散模型应用于手术视频分析中的无监督时序目标跟踪,为微创手术视频理解提供了高效、低成本的新范式。 Abstract: Purpose: In this paper, we present a novel approach for online object tracking in laparoscopic cholecystectomy (LC) surgical videos, targeting localisation and tracking of critical anatomical structures and instruments. Our method addresses the challenges of costly pixel-level annotations and label inconsistencies inherent in existing datasets. Methods: Leveraging the inherent object localisation capabilities of pre-trained text-to-image diffusion models, we extract representative features from surgical frames without any training or fine-tuning. Our tracking framework uses these features, along with cross-frame interactions via an affinity matrix inspired by query-key-value attention, to ensure temporal continuity in the tracking process. Results: Through a pilot study, we first demonstrate that diffusion features exhibit superior object localisation and consistent semantics across different decoder levels and temporal frames. Later, we perform extensive experiments to validate the effectiveness of our approach, showcasing its superiority over competitors for the task of temporal object tracking. Specifically, we achieve a per-pixel classification accuracy of 79.19%, mean Jaccard Score of 56.20%, and mean F-Score of 79.48% on the publicly available CholeSeg8K dataset. Conclusion: Our work not only introduces a novel application of text-to-image diffusion models but also contributes to advancing the field of surgical video analysis, offering a promising avenue for accurate and cost-effective temporal object tracking in minimally invasive surgery videos.[267] Toward Unified Multimodal Representation Learning for Autonomous Driving
Ximeng Tao,Dimitar Filev,Gaurav Pandey
Main category: cs.CV
TL;DR: 本文提出了一种对比张量预训练(CTP)框架,通过构建多模态相似性张量和张量损失函数,在统一嵌入空间中联合对齐文本、图像和点云等多种模态,以提升端到端自动驾驶的场景理解能力。
Details
Motivation: 现有基于CLIP的3D视觉方法仅使用两两余弦相似度进行模态对齐,难以保证整个多模态空间中的一致性和统一性。 Method: 提出对比张量预训练(CTP)框架,将二维相似度矩阵扩展为多模态相似性张量,并设计张量损失函数实现跨所有模态的联合对比学习;构建文本-图像-点云三元组数据集用于实验验证。 Result: 在两种设定下均取得优越性能:(i) 将3D编码器与预训练CLIP编码器对齐;(ii) 从零开始预训练所有编码器。 Conclusion: 统一的多模态对齐框架比传统两两对齐更有效,显著提升了自动驾驶中的跨模态理解能力。 Abstract: Contrastive Language-Image Pre-training (CLIP) has shown impressive performance in aligning visual and textual representations. Recent studies have extended this paradigm to 3D vision to improve scene understanding for autonomous driving. A common strategy is to employ pairwise cosine similarity between modalities to guide the training of a 3D encoder. However, considering the similarity between individual modality pairs rather than all modalities jointly fails to ensure consistent and unified alignment across the entire multimodal space. In this paper, we propose a Contrastive Tensor Pre-training (CTP) framework that simultaneously aligns multiple modalities in a unified embedding space to enhance end-to-end autonomous driving. Compared with pairwise cosine similarity alignment, our method extends the 2D similarity matrix into a multimodal similarity tensor. Furthermore, we introduce a tensor loss to enable joint contrastive learning across all modalities. For experimental validation of our framework, we construct a text-image-point cloud triplet dataset derived from existing autonomous driving datasets. The results show that our proposed unified multimodal alignment framework achieves favorable performance for both scenarios: (i) aligning a 3D encoder with pretrained CLIP encoders, and (ii) pretraining all encoders from scratch.[268] VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?
Minkyu Kim,Sangheon Lee,Dongmin Park
Main category: cs.CV
TL;DR: 本文提出了VLM-SubtleBench,一个用于评估视觉语言模型(VLMs)在细微差异图像对比推理能力的新基准,涵盖10种差异类型及多领域图像,揭示了当前VLMs与人类性能间的系统性差距。
Details
Motivation: 现有VLM对比推理基准主要关注显著差异图像,无法反映工业检测、医学影像和航拍监控等实际场景中所需的细微差异识别能力。 Method: 构建VLM-SubtleBench基准,涵盖10类细微差异(如属性、状态、情绪等),配对问题-图像集,覆盖工业、航拍、医疗等多领域图像,并对多种开源与闭源VLM进行系统评测。 Result: 发现VLMs在各类细微差异和不同图像域上均显著落后于人类,尤其在特定差异类型(如状态、时空、存在性)上推理能力急剧下降。 Conclusion: VLM-SubtleBench为推动VLM向人类水平的细微对比推理能力发展提供了重要评测基础和改进方向。 Abstract: The ability to distinguish subtle differences between visually similar images is essential for diverse domains such as industrial anomaly detection, medical imaging, and aerial surveillance. While comparative reasoning benchmarks for vision-language models (VLMs) have recently emerged, they primarily focus on images with large, salient differences and fail to capture the nuanced reasoning required for real-world applications. In this work, we introduce VLM-SubtleBench, a benchmark designed to evaluate VLMs on subtle comparative reasoning. Our benchmark covers ten difference types - Attribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action - and curate paired question-image sets reflecting these fine-grained variations. Unlike prior benchmarks restricted to natural image datasets, our benchmark spans diverse domains, including industrial, aerial, and medical imagery. Through extensive evaluation of both proprietary and open-source VLMs, we reveal systematic gaps between model and human performance across difference types and domains, and provide controlled analyses highlighting where VLMs' reasoning sharply deteriorates. Together, our benchmark and findings establish a foundation for advancing VLMs toward human-level comparative reasoning.[269] Structure and Progress Aware Diffusion for Medical Image Segmentation
Siyuan Song,Guyue Hu,Chenglong Li,Dengdi Sun,Zhe Jin,Jin Tang
Main category: cs.CV
TL;DR: 本文提出了一种结构与进展感知的扩散模型(SPAD),通过语义集中扩散(ScD)和边界集中扩散(BcD)结合进展感知调度器(PaS),实现从粗到细的医学图像分割。
Details
Motivation: 现有方法在训练中同时学习粗粒度结构和细粒度边界,但医学图像中边界常模糊、噪声大、不可靠,不适合作为早期监督信号;而粗粒度形态与语义结构更稳定、更有判别性。 Method: 提出SPAD框架,包含:1)语义集中扩散(ScD),采用锚点保留的目标扰动策略,扰动目标内部像素但保留语义锚点;2)边界集中扩散(BcD),引入进展感知的边界噪声以模糊不可靠边界;3)进展感知调度器(PaS),动态调节ScD与BcD的噪声强度,构建由粗到细的扩散范式。 Result: 在多个公开医学图像分割数据集上取得SOTA性能,验证了该方法在提升分割鲁棒性与精度方面的有效性,尤其在边界模糊、标注不确定场景下优势明显。 Conclusion: 粗粒度结构应优先于细粒度边界被建模,进展感知的分阶段扩散机制能更符合医学图像理解的认知过程,为扩散模型在医学图像分割中的应用提供了新范式。 Abstract: Medical image segmentation is crucial for computer-aided diagnosis, which necessitates understanding both coarse morphological and semantic structures, as well as carving fine boundaries. The morphological and semantic structures in medical images are beneficial and stable clues for target understanding. While the fine boundaries of medical targets (like tumors and lesions) are usually ambiguous and noisy since lesion overlap, annotation uncertainty, and so on, making it not reliable to serve as early supervision. However, existing methods simultaneously learn coarse structures and fine boundaries throughout the training process. In this paper, we propose a structure and progress-aware diffusion (SPAD) for medical image segmentation, which consists of a semantic-concentrated diffusion (ScD) and a boundary-centralized diffusion (BcD) modulated by a progress-aware scheduler (PaS). Specifically, the semantic-concentrated diffusion introduces anchor-preserved target perturbation, which perturbs pixels within a medical target but preserves unaltered areas as semantic anchors, encouraging the model to infer noisy target areas from the surrounding semantic context. The boundary-centralized diffusion introduces progress-aware boundary noise, which blurs unreliable and ambiguous boundaries, thus compelling the model to focus on coarse but stable anatomical morphology and global semantics. Furthermore, the progress-aware scheduler gradually modulates noise intensity of the ScD and BcD forming a coarse-to-fine diffusion paradigm, which encourage focusing on coarse morphological and semantic structures during early target understanding stages and gradually shifting to fine target boundaries during later contour adjusting stages.[270] MINT: Molecularly Informed Training with Spatial Transcriptomics Supervision for Pathology Foundation Models
Minsoo Lee,Jonghyun Kim,Juseung Yun,Sunwoo Yu,Jongseong Jang
Main category: cs.CV
TL;DR: 本文提出MINT框架,通过在预训练病理视觉Transformer中引入空间转录组学监督信号,提升模型对组织分子状态的理解能力,在基因表达预测和通用病理任务上均取得最优性能。
Details
Motivation: 现有病理基础模型仅通过自监督预训练学习形态学表征,未能显式捕捉组织的潜在分子状态;而空间转录组学技术可在原位测量基因表达,为跨模态监督提供天然信号。 Method: MINT在预训练ViT输入中添加可学习的ST token以独立编码转录组信息,并结合DINO自蒸馏与特征锚定防止灾难性遗忘;同时在Visium(spot级)和Xenium(patch级)两种分辨率上进行基因表达回归,实现多尺度监督。 Result: 在577个HEST样本上训练后,MINT在HEST-Bench基因表达预测任务中平均Pearson r达0.440,在EVA通用病理任务中得分为0.803,均为最优性能。 Conclusion: 空间转录组学监督能有效补充以形态为中心的自监督预训练,提升病理模型对分子层面信息的理解与泛化能力。 Abstract: Pathology foundation models learn morphological representations through self-supervised pretraining on large-scale whole-slide images, yet they do not explicitly capture the underlying molecular state of the tissue. Spatial transcriptomics technologies bridge this gap by measuring gene expression in situ, offering a natural cross-modal supervisory signal. We propose MINT (Molecularly Informed Training), a fine-tuning framework that incorporates spatial transcriptomics supervision into pretrained pathology Vision Transformers. MINT appends a learnable ST token to the ViT input to encode transcriptomic information separately from the morphological CLS token, preventing catastrophic forgetting through DINO self-distillation and explicit feature anchoring to the frozen pretrained encoder. Gene expression regression at both spot-level (Visium) and patch-level (Xenium) resolutions provides complementary supervision across spatial scales. Trained on 577 publicly available HEST samples, MINT achieves the best overall performance on both HEST-Bench for gene expression prediction (mean Pearson r = 0.440) and EVA for general pathology tasks (0.803), demonstrating that spatial transcriptomics supervision complements morphology-centric self-supervised pretraining.[271] Revisiting Unknowns: Towards Effective and Efficient Open-Set Active Learning
Chen-Chen Zong,Yu-Qi Chi,Xie-Yang Wang,Yan Cui,Sheng-Jun Huang
Main category: cs.CV
TL;DR: 本文提出E²OAL,一种统一且无需单独训练开放集检测器的开放集主动学习框架,通过利用已标注的未知类样本提升已知类学习效果,并在多个基准测试中表现出色。
Details
Motivation: 现有开放集主动学习方法依赖单独训练的开放集检测器,导致训练开销大,且未充分利用已标注未知类样本对已知类学习的监督价值。 Method: E²OAL采用冻结对比预训练特征空间中的标签引导聚类挖掘未知类潜在结构;引入狄利克雷校准辅助头联合建模已知与未知类别;设计logit-margin纯度分数构建高纯度候选池,并提出面向OSAL的信息量度量选择部分模糊但可靠的样本,形成两阶段自适应查询策略。 Result: 在多个OSAL基准上,E²OAL在准确率、效率和查询精度方面持续超越当前最优方法。 Conclusion: E²OAL是一种高效、鲁棒且实用的开放集主动学习框架,适用于安全关键与开放世界场景。 Abstract: Open-set active learning (OSAL) aims to identify informative samples for annotation when unlabeled data may contain previously unseen classes-a common challenge in safety-critical and open-world scenarios. Existing approaches typically rely on separately trained open-set detectors, introducing substantial training overhead and overlooking the supervisory value of labeled unknowns for improving known-class learning. In this paper, we propose E$^2$OAL (Effective and Efficient Open-set Active Learning), a unified and detector-free framework that fully exploits labeled unknowns for both stronger supervision and more reliable querying. E$^2$OAL first uncovers the latent class structure of unknowns through label-guided clustering in a frozen contrastively pre-trained feature space, optimized by a structure-aware F1-product objective. To leverage labeled unknowns, it employs a Dirichlet-calibrated auxiliary head that jointly models known and unknown categories, improving both confidence calibration and known-class discrimination. Building on this, a logit-margin purity score estimates the likelihood of known classes to construct a high-purity candidate pool, while an OSAL-specific informativeness metric prioritizes partially ambiguous yet reliable samples. These components together form a flexible two-stage query strategy with adaptive precision control and minimal hyperparameter sensitivity. Extensive experiments across multiple OSAL benchmarks demonstrate that E$^2$OAL consistently surpasses state-of-the-art methods in accuracy, efficiency, and query precision, highlighting its effectiveness and practicality for real-world applications. The code is available at github.com/chenchenzong/E2OAL.[272] Beyond Heuristic Prompting: A Concept-Guided Bayesian Framework for Zero-Shot Image Recognition
Hui Liu,Kecheng Chen,Jialiang Wang,Xianming Liu,Wenya Wang,Haoliang Li
Main category: cs.CV
TL;DR: 本文提出了一种基于贝叶斯视角的零样本图像分类新方法,通过引入类别特定概念并结合大语言模型(LLM)驱动的多阶段概念合成与自适应软裁剪似然机制,显著提升了视觉-语言模型(如CLIP)在零样本识别中的性能和鲁棒性。
Details
Motivation: 现有VLMs(如CLIP)在零样本图像识别中受限于提示工程次优及对目标类别的适应性差;已有提示优化方法依赖启发式设计、泛化性弱且易受异常提示干扰。 Method: 将类别概念建模为隐变量,从贝叶斯角度重构零样本分类为概念空间上的边缘化预测;提出LLM驱动的多阶段概念合成流程生成判别性、组合性概念,并用行列式点过程(DPP)保障多样性;设计无需训练的自适应软裁剪似然以抑制异常概念影响;理论分析包括鲁棒性保证与多类超额风险界。 Result: 在多个基准上持续超越当前最优方法,验证了所提框架在零样本图像分类中的有效性与鲁棒性。 Conclusion: 通过显式建模和优化类别相关概念及其先验与似然,本文提供了一种更结构化、可解释且鲁棒的零样本学习范式,为VLM提示优化开辟了新路径。 Abstract: Vision-Language Models (VLMs), such as CLIP, have significantly advanced zero-shot image recognition. However, their performance remains limited by suboptimal prompt engineering and poor adaptability to target classes. While recent methods attempt to improve prompts through diverse class descriptions, they often rely on heuristic designs, lack versatility, and are vulnerable to outlier prompts. This paper enhances prompt by incorporating class-specific concepts. By treating concepts as latent variables, we rethink zero-shot image classification from a Bayesian perspective, casting prediction as marginalization over the concept space, where each concept is weighted by a prior and a test-image conditioned likelihood. This formulation underscores the importance of both a well-structured concept proposal distribution and the refinement of concept priors. To construct an expressive and efficient proposal distribution, we introduce a multi-stage concept synthesis pipeline driven by LLMs to generate discriminative and compositional concepts, followed by a Determinantal Point Process to enforce diversity. To mitigate the influence of outlier concepts, we propose a training-free, adaptive soft-trim likelihood, which attenuates their impact in a single forward pass. We further provide robustness guarantees and derive multi-class excess risk bounds for our framework. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, validating its effectiveness in zero-shot image classification. Our code is available at https://github.com/less-and-less-bugs/CGBC.[273] Geometric Transformation-Embedded Mamba for Learned Video Compression
Hao Wei,Yanhui Zhou,Chenyang Ge
Main category: cs.CV
TL;DR: 本文提出了一种基于直接变换策略(非线性变换、量化与熵编码)的轻量高效视频压缩框架,摒弃传统混合编码中的显式运动估计与补偿;通过级联Mamba模块(CMM)建模长程时空依赖、局部性精炼前馈网络(LRFFN)增强局部空间表征,并设计条件通道熵模型利用时序先验提升概率估计精度,在低码率下显著提升感知质量与时间一致性。
Details
Motivation: 现有学习型视频压缩方法多依赖复杂的混合编码范式(需显式运动估计与补偿),导致模型结构臃肿、效率受限,亟需更简洁有效的端到端压缩框架。 Method: 提出基于直接变换的端到端视频压缩框架:1)设计嵌入多种几何变换的级联Mamba模块(CMM)以建模长程时空依赖;2)引入含差分卷积的局部性精炼前馈网络(LRFFN)增强局部空间表征;3)构建条件通道熵模型,利用时序先验准确估计当前隐特征的概率分布;CMM与LRFFN分别集成于编解码器中。 Result: 在低码率约束下,本方法在感知质量(如LPIPS)和时间一致性方面显著优于当前最先进视频压缩方法。 Conclusion: 基于直接变换策略的轻量框架结合Mamba建模能力、局部精炼机制与条件熵建模,可有效替代传统混合编码范式,在保持高性能的同时大幅提升架构简洁性与实用性。 Abstract: Although learned video compression methods have exhibited outstanding performance, most of them typically follow a hybrid coding paradigm that requires explicit motion estimation and compensation, resulting in a complex solution for video compression. In contrast, we introduce a streamlined yet effective video compression framework founded on a direct transform strategy, i.e., nonlinear transform, quantization, and entropy coding. We first develop a cascaded Mamba module (CMM) with different embedded geometric transformations to effectively explore both long-range spatial and temporal dependencies. To improve local spatial representation, we introduce a locality refinement feed-forward network (LRFFN) that incorporates a hybrid convolution block based on difference convolutions. We integrate the proposed CMM and LRFFN into the encoder and decoder of our compression framework. Moreover, we present a conditional channel-wise entropy model that effectively utilizes conditional temporal priors to accurately estimate the probability distributions of current latent features. Extensive experiments demonstrate that our method outperforms state-of-the-art video compression approaches in terms of perceptual quality and temporal consistency under low-bitrate constraints. Our source codes and models will be available at https://github.com/cshw2021/GTEM-LVC.[274] Enhancing Unregistered Hyperspectral Image Super-Resolution via Unmixing-based Abundance Fusion Learning
Yingkai Zhang,Tao Zhang,Jing Nie,Ying Fu
Main category: cs.CV
TL;DR: 本文提出了一种基于解混的融合框架,用于未配准高光谱图像超分辨率,通过解耦空间-光谱信息、引入可变形聚合模块和空间-通道交叉注意力机制,显著提升了超分辨性能。
Details
Motivation: 解决未配准高光谱图像超分辨率中因参考图像未配准导致的融合困难和模型学习能力受限问题。 Method: 采用奇异值分解进行初始光谱解混;设计粗到精的可变形聚合模块估计像素级光流和相似性图,并进行亚像素级细化;使用空间-通道丰度交叉注意力块优化聚合特征;引入空间-通道调制融合模块动态加权融合编码器-解码器特征。 Result: 在模拟和真实数据集上均达到当前最优的超分辨率性能。 Conclusion: 所提框架能有效缓解未配准融合影响并增强模型学习能力,为未配准HSI超分辨率提供了新思路。 Abstract: Unregistered hyperspectral image (HSI) super-resolution (SR) typically aims to enhance a low-resolution HSI using an unregistered high-resolution reference image. In this paper, we propose an unmixing-based fusion framework that decouples spatial-spectral information to simultaneously mitigate the impact of unregistered fusion and enhance the learnability of SR models. Specifically, we first utilize singular value decomposition for initial spectral unmixing, preserving the original endmembers while dedicating the subsequent network to enhancing the initial abundance map. To leverage the spatial texture of the unregistered reference, we introduce a coarse-to-fine deformable aggregation module, which first estimates a pixel-level flow and a similarity map using a coarse pyramid predictor. It further performs fine sub-pixel refinement to achieve deformable aggregation of the reference features. The aggregative features are then refined via a series of spatial-channel abundance cross-attention blocks. Furthermore, a spatial-channel modulated fusion module is presented to merge encoder-decoder features using dynamic gating weights, yielding a high-quality, high-resolution HSI. Experimental results on simulated and real datasets confirm that our proposed method achieves state-of-the-art super-resolution performance. The code will be available at https://github.com/yingkai-zhang/UAFL.[275] RLPR: Radar-to-LiDAR Place Recognition via Two-Stage Asymmetric Cross-Modal Alignment for Autonomous Driving
Zhangshuo Qi,Jingyi Xu,Luqi Cheng,Shichen Wen,Guangming Xiong
Main category: cs.CV
TL;DR: 本文提出RLPR框架,通过双流网络提取结构特征,并引入两阶段非对称跨模态对齐策略(TACMA),实现雷达扫描在现有LiDAR地图中的精确定位,显著提升恶劣天气下的定位鲁棒性与零样本泛化能力。
Details
Motivation: LiDAR定位在恶劣天气下性能下降,而雷达虽抗天气干扰但缺乏雷达地图;雷达到LiDAR的跨模态定位可复用现有LiDAR地图,但面临模态异质性、配对数据稀缺及雷达类型多样等挑战。 Method: 提出RLPR框架:1)设计双流网络提取脱离传感器特性的结构特征(如忽略多普勒或RCS);2)基于雷达与LiDAR任务不对称性,提出两阶段非对称跨模态对齐(TACMA),以预训练雷达分支为判别性锚点引导对齐。 Result: 在四个数据集上达到SOTA识别精度,并展现出强零样本泛化能力,支持单芯片、扫描式和4D雷达。 Conclusion: RLPR有效解决了雷达到LiDAR跨模态定位中的特征共享与对齐难题,提升了全天气自动驾驶定位的鲁棒性与实用性。 Abstract: All-weather autonomy is critical for autonomous driving, which necessitates reliable localization across diverse scenarios. While LiDAR place recognition is widely deployed for this task, its performance degrades in adverse weather. Conversely, radar-based methods, though weather-resilient, are hindered by the general unavailability of radar maps. To bridge this gap, radar-to-LiDAR place recognition, which localizes radar scans within existing LiDAR maps, has garnered increasing interest. However, extracting discriminative and generalizable features shared between modalities remains challenging, compounded by the scarcity of large-scale paired training data and the signal heterogeneity across radar types. In this work, we propose RLPR, a robust radar-to-LiDAR place recognition framework compatible with single-chip, scanning, and 4D radars. We first design a dual-stream network to extract structural features that abstract away from sensor-specific signal properties (e.g., Doppler or RCS). Subsequently, motivated by our task-specific asymmetry observation between radar and LiDAR, we introduce a two-stage asymmetric cross-modal alignment (TACMA) strategy, which leverages the pre-trained radar branch as a discriminative anchor to guide the alignment process. Experiments on four datasets demonstrate that RLPR achieves state-of-the-art recognition accuracy with strong zero-shot generalization capabilities.[276] IMSE: Intrinsic Mixture of Spectral Experts Fine-tuning for Test-Time Adaptation
Sunghyun Baek,Jaemyung Yu,Seunghee Koh,Minsu Kim,Hyeonseong Jeon,Junmo Kim
Main category: cs.CV
TL;DR: 本文提出IMSE方法,通过奇异值分解(SVD)仅微调ViT线性层的奇异值,并设计多样性最大化损失缓解熵最小化导致的特征坍塌;在CTTA中引入域感知谱码检索以重用历史域知识,在显著减少可训练参数的同时实现SOTA性能。
Details
Motivation: 现有TTA方法未能充分利用大预训练模型的丰富表征能力,且熵最小化易引发特征坍塌;CTTA中缺乏对历史域知识的有效保留与复用。 Method: 1)提出Intrinsic Mixture of Spectral Experts(IMSE),对ViT各线性层做SVD,仅更新奇异值;2)设计基于专家-输入对齐的多样性最大化损失,缓解熵最小化缺陷;3)在CTTA中引入Domain-Aware Spectral Code Retrieval,通过估计输入分布检测域偏移并检索适配过的奇异值。 Result: 在标准TTA基准上达到SOTA;在CTTA和渐进式CTTA中分别提升准确率3.4pp和2.4pp,且可训练参数减少385倍。 Conclusion: 仅微调奇异值是一种高效利用大模型内在谱专家结构的TTA范式;结合多样性正则与域感知检索,可在极低参数开销下实现强泛化与持续适应能力。 Abstract: Test-time adaptation (TTA) has been widely explored to prevent performance degradation when test data differ from the training distribution. However, fully leveraging the rich representations of large pretrained models with minimal parameter updates remains underexplored. In this paper, we propose Intrinsic Mixture of Spectral Experts (IMSE) that leverages the spectral experts inherently embedded in Vision Transformers. We decompose each linear layer via singular value decomposition (SVD) and adapt only the singular values, while keeping the singular vectors fixed. We further identify a key limitation of entropy minimization in TTA: it often induces feature collapse, causing the model to rely on domain-specific features rather than class-discriminative features. To address this, we propose a diversity maximization loss based on expert-input alignment, which encourages diverse utilization of spectral experts during adaptation. In the continual test-time adaptation (CTTA) scenario, beyond preserving pretrained knowledge, it is crucial to retain and reuse knowledge from previously observed domains. We introduce Domain-Aware Spectral Code Retrieval, which estimates input distributions to detect domain shifts, and retrieves adapted singular values for rapid adaptation. Consequently, our method achieves state-of-the-art performance on various distribution-shift benchmarks under the TTA setting. In CTTA and Gradual CTTA, it further improves accuracy by 3.4 percentage points (pp) and 2.4 pp, respectively, while requiring 385 times fewer trainable parameters. Our code is available at https://github.com/baek85/IMSE.[277] A Hybrid Vision Transformer Approach for Mathematical Expression Recognition
Anh Duy Le,Van Linh Pham,Vinh Loi Ly,Nam Quan Nguyen,Huu Thang Nguyen,Tuan Anh Tran
Main category: cs.CV
TL;DR: 本文提出了一种基于混合视觉Transformer(HVT)与二维位置编码的数学表达式识别方法,结合覆盖注意力解码器和ViT的[CLS]标记初始化解码器,显著提升了识别性能。
Details
Motivation: 数学表达式识别比普通文本识别更复杂,因其具有二维结构和符号尺寸不一的特点,现有方法在解析过程中存在欠解析和过解析问题。 Method: 采用带2D位置编码的混合视觉Transformer(HVT)作为编码器提取符号间复杂关系;使用覆盖注意力解码器追踪注意力历史以缓解欠/过解析问题;并利用ViT的[CLS]标记作为解码器初始嵌入。 Result: 在IM2LATEX-100K数据集上达到89.94的BLEU分数,优于当前最优方法。 Conclusion: 所提方法有效建模数学表达式的二维结构和符号关系,覆盖注意力机制与[CLS]初始化策略显著提升了识别准确率和鲁棒性。 Abstract: One of the crucial challenges taken in document analysis is mathematical expression recognition. Unlike text recognition which only focuses on one-dimensional structure images, mathematical expression recognition is a much more complicated problem because of its two-dimensional structure and different symbol size. In this paper, we propose using a Hybrid Vision Transformer (HVT) with 2D positional encoding as the encoder to extract the complex relationship between symbols from the image. A coverage attention decoder is used to better track attention's history to handle the under-parsing and over-parsing problems. We also showed the benefit of using the [CLS] token of ViT as the initial embedding of the decoder. Experiments performed on the IM2LATEX-100K dataset have shown the effectiveness of our method by achieving a BLEU score of 89.94 and outperforming current state-of-the-art methods.[278] Text to Automata Diagrams: Comparing TikZ Code Generation with Direct Image Synthesis
Ethan Young,Zichun Wang,Aiden Taylor,Chance Jewell,Julian Myers,Satya Sri Rajiteswari Nimmagadda,Anthony White,Aniruddha Maiti,Ananya Jana
Main category: cs.CV
TL;DR: 本研究探讨了当前视觉-语言模型和大语言模型处理学生手绘计算机科学图表的能力,发现直接从图像生成的文本描述常不准确,需人工修正;修正后的描述输入大语言模型可生成更准确的TikZ代码,从而支持自动评分与无障碍教学材料生成。
Details
Motivation: 学生在考试或作业中手绘的计算机科学图表(如自动机、数据结构图)在结构、布局和正确性上差异大,亟需自动化工具理解并转化为标准数字格式,以支持自动评分与教学辅助。 Method: 使用扫描的学生手绘图作为输入,先用视觉-语言模型生成文本描述,再由人工校正;将原始与校正后的描述分别输入大语言模型生成TikZ代码,编译后与原图对比评估。 Result: 视觉-语言模型直接生成的描述错误较多;人工校正显著提升描述质量,并进而提高大语言模型生成TikZ代码的准确性。 Conclusion: 结合视觉-语言模型与人工校正、再经大语言模型生成代码的流程,是实现教育图表数字化与自动化反馈的可行路径,有助于推动计算机科学教育的智能化与可访问性。 Abstract: Diagrams are widely used in teaching computer science courses. They are useful in subjects such as automata and formal languages, data structures, etc. These diagrams, often drawn by students during exams or assignments, vary in structure, layout, and correctness. This study examines whether current vision-language and large language models can process such diagrams and produce accurate textual and digital representations. In this study, scanned student-drawn diagrams are used as input. Then, textual descriptions are generated from these images using a vision-language model. The descriptions are checked and revised by human reviewers to make them accurate. Both the generated and the revised descriptions are then fed to a large language model to generate TikZ code. The resulting diagrams are compiled and then evaluated against the original scanned diagrams. We found descriptions generated directly from images using vision-language models are often incorrect and human correction can substantially improve the quality of vision language model generated descriptions. This research can help computer science education by paving the way for automated grading and feedback and creating more accessible instructional materials.[279] $L^3$:Scene-agnostic Visual Localization in the Wild
Yu Zhang,Muhua Zhu,Yifei Xue,Tie Ji,Yizhen Lao
Main category: cs.CV
TL;DR: 本文提出了一种无需离线预处理的新型无地图视觉定位框架L³,利用前馈3D重建网络在线进行3D重建,并结合两阶段度量尺度恢复与姿态优化,实现高精度定位。
Details
Motivation: 传统视觉定位方法依赖离线场景预处理获取3D结构信息,带来计算、时间及存储开销;本文旨在实现无需任何离线预处理的野外场景视觉定位。 Method: 提出L³框架:基于RGB图像直接在线3D重建,再通过两阶段度量尺度恢复和基于2D-3D对应关系的姿态优化完成定位。 Result: 在多个基准上性能媲美SOTA方法,并在稀疏场景(每场景参考图像更少)中展现出显著更强的鲁棒性。 Conclusion: L³成功实现了无需预建图与存储场景表示的高效、鲁棒视觉定位,为轻量化与动态环境定位提供了新思路。 Abstract: Standard visual localization methods typically require offline pre-processing of scenes to obtain 3D structural information for better performance. This inevitably introduces additional computational and time costs, as well as the overhead of storing scene representations. Can we visually localize in a wild scene without any off-line preprocessing step? In this paper, we leverage the online inference capabilities of feed-forward 3D reconstruction networks to propose a novel map-free visual localization framework $L^3$. Specifically, by performing direct online 3D reconstruction on RGB images, followed by two-stage metric scale recovery and pose refinement based on 2D-3D correspondences, $L^3$ achieves high accuracy without the need to pre-build or store any offline scene representations. Extensive experiments demonstrate $L^3$ not only that the performance is comparable to state-of-the-art solutions on various benchmarks, but also that it exhibits significantly superior robustness in sparse scenes (fewer reference images per scene).[280] VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer
Yanning Hou,Peiyuan Li,Zirui Liu,Yitong Wang,Yanran Ruan,Jianfeng Qiu,Ke Xu
Main category: cs.CV
TL;DR: 本文提出VisualAD,一种纯视觉的零样本异常检测框架,摒弃了传统依赖文本编码器和跨模态对齐的方法,通过在冻结ViT主干中引入两个可学习token来分别建模正常性和异常性,并结合空间感知交叉注意力与自对齐函数,实现了SOTA性能。
Details
Motivation: 现有零样本异常检测方法依赖视觉-语言模型(如CLIP),需文本编码器和跨模态对齐,导致训练不稳定和参数冗余;本文旨在探究文本分支是否必要,并构建更简洁、稳定、高效的纯视觉方案。 Method: 提出VisualAD:在冻结的ViT主干中插入两个可学习token(normal/abnormal token),通过多层自注意力使其与图像patch交互以学习高层语义;引入Spatial-Aware Cross-Attention(SCA)注入空间信息,以及轻量级Self-Alignment Function(SAF)重校准patch特征;全程无需文本分支或跨模态训练。 Result: 在13个涵盖工业与医疗领域的零样本异常检测基准上达到SOTA性能;可即插即用地适配CLIP图像编码器、DINOv2等预训练视觉骨干。 Conclusion: 文本分支在零样本异常检测中并非必需;纯视觉架构通过结构化token设计与细粒度特征对齐,不仅能替代VLM范式,还能提升稳定性、效率与泛化性。 Abstract: Zero-shot anomaly detection (ZSAD) requires detecting and localizing anomalies without access to target-class anomaly samples. Mainstream methods rely on vision-language models (VLMs) such as CLIP: they build hand-crafted or learned prompt sets for normal and abnormal semantics, then compute image-text similarities for open-set discrimination. While effective, this paradigm depends on a text encoder and cross-modal alignment, which can lead to training instability and parameter redundancy. This work revisits the necessity of the text branch in ZSAD and presents VisualAD, a purely visual framework built on Vision Transformers. We introduce two learnable tokens within a frozen backbone to directly encode normality and abnormality. Through multi-layer self-attention, these tokens interact with patch tokens, gradually acquiring high-level notions of normality and anomaly while guiding patches to highlight anomaly-related cues. Additionally, we incorporate a Spatial-Aware Cross-Attention (SCA) module and a lightweight Self-Alignment Function (SAF): SCA injects fine-grained spatial information into the tokens, and SAF recalibrates patch features before anomaly scoring. VisualAD achieves state-of-the-art performance on 13 zero-shot anomaly detection benchmarks spanning industrial and medical domains, and adapts seamlessly to pretrained vision backbones such as the CLIP image encoder and DINOv2. Code: https://github.com/7HHHHH/VisualAD[281] SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation
Jiaye Feng,Qixiang Yin,Yuankun Liu,Tong Mo,Weiping Li
Main category: cs.CV
TL;DR: 本文提出SGG-R³框架,通过任务特定的链式思维引导监督微调与分组序列策略优化的强化学习,解决场景图生成中结构化推理不足和长尾关系分布问题。
Details
Motivation: 现有基于多模态大语言模型的场景图生成方法受限于缺乏任务特定的结构化推理能力,以及稀疏、长尾的关系分布,导致生成的场景图召回率低且预测存在偏差。 Method: 提出SGG-R³结构化推理框架,包含三阶段流程:1)关系增强的链式思维引导监督微调(SFT),利用MLLM生成并经嵌入相似性过滤的关系样本缓解稀疏性;2)引入阶段对齐奖励机制;3)设计融合细粒度与粗粒度的双粒度奖励,结合频率自适应加权与语义聚类提升长尾关系覆盖。 Result: 在两个基准数据集上的实验表明,SGG-R³显著优于现有方法,展现出更强的性能与泛化能力。 Conclusion: SGG-R³通过结构化推理与针对性奖励设计,有效缓解了场景图生成中的稀疏性与长尾问题,提升了生成完整性与公平性。 Abstract: Scene Graph Generation (SGG) structures visual scenes as graphs of objects and their relations. While Multimodal Large Language Models (MLLMs) have advanced end-to-end SGG, current methods are hindered by both a lack of task-specific structured reasoning and the challenges of sparse, long-tailed relation distributions, resulting in incomplete scene graphs characterized by low recall and biased predictions. To address these issues, we introduce SGG-R$^{\rm 3}$, a structured reasoning framework that integrates task-specific chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and reinforcement learning (RL) with group sequence policy optimization (GSPO), designed to engage in three sequential stages to achieve end-to-end unbiased scene graph generation. During the SFT phase, we propose a relation augmentation strategy by leveraging an MLLM and refined via embedding similarity filtering to alleviate relation sparsity. Subsequently, a stage-aligned reward scheme optimizes the procedural reasoning during RL. Specifically, we propose a novel dual-granularity reward which integrates fine-grained and coarse-grained relation rewards, simultaneously mitigating the long-tail issue via frequency-based adaptive weighting of predicates and improving relation coverage through semantic clustering. Experiments on two benchmarks show that SGG-R$^{\rm 3}$ achieves superior performance compared to existing methods, demonstrating the effectiveness and generalization of the framework.[282] Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time
Weijie Zhou,Xuantang Xiong,Zhenlin Hu,Xiaomeng Zhu,Chaoyang Zhao,Honghui Dong,Zhengyou Zhang,Ming Tang,Jinqiao Wang
Main category: cs.CV
TL;DR: 本文提出EcoG(Egocentric Co-Speech Grounding)任务及配套基准EcoG-Bench,强调在第一人称视角下联合建模言语、指向动作与时间戳的三元对齐(What/Where/When),以检验模型是否真正掌握共语手势与语音的细粒度时序对齐能力;实验发现当前多模态大模型在此任务上远逊于人类,且输入接口(如原生音视频流 vs. 采样帧+精准ASR)显著影响性能,表明模态接口本身是制约时序对齐感知的关键瓶颈。
Details
Motivation: 现有具身协作基准存在语言捷径,使多模态大模型无需学习语音-视觉对齐即可表现良好,无法真实评估其对指代性共语指向(deictic co-speech pointing)中‘语音-手势-时间’联合 grounding 的能力。 Method: 提出EcoG任务框架,要求模型必须同时预测指代对象(What)、空间位置(Where)和动作发生时刻(When);构建EcoG-Bench:811段双语(英/中)第一人称视频片段,含密集空间标注与毫秒级指向动作标注,并采用渐进式认知评估协议;通过对比原生音视频输入与经ASR对齐的帧采样输入进行诊断性消融。 Result: 人类在EcoG-Bench上达到96.9%严格准确率,而最优模型(Gemini-3-Pro)原生音视频设置仅17.0%,改用带词级时间戳的ASR+帧采样后提升至42.9%;揭示了当前多模态接口在时序对齐线索可观测性上的严重瓶颈。 Conclusion: EcoG-Bench为事件级语音-手势绑定提供了严格可执行的评测标准;结果表明,多模态接口设计(而非模型推理能力本身)可能是阻碍模型学习精细时序对齐的关键因素。 Abstract: In situated collaboration, speakers often use intentionally underspecified deictic commands (e.g., ``pass me \textit{that}''), whose referent becomes identifiable only by aligning speech with a brief co-speech pointing \emph{stroke}. However, many embodied benchmarks admit language-only shortcuts, allowing MLLMs to perform well without learning the \emph{audio--visual alignment} required by deictic interaction. To bridge this gap, we introduce \textbf{Egocentric Co-Speech Grounding (EcoG)}, where grounding is executable only if an agent jointly predicts \textit{What}, \textit{Where}, and \textit{When}. To operationalize this, we present \textbf{EcoG-Bench}, an evaluation-only bilingual (EN/ZH) diagnostic benchmark of \textbf{811} egocentric clips with dense spatial annotations and millisecond-level stroke supervision. It is organized under a \textbf{Progressive Cognitive Evaluation} protocol. Benchmarking state-of-the-art MLLMs reveals a severe executability gap: while human subjects achieve near-ceiling performance on EcoG-Bench (\textbf{96.9\%} strict Eco-Accuracy), the best native video-audio setting remains low (Gemini-3-Pro: \textbf{17.0\%}). Moreover, in a diagnostic ablation, replacing the native video--audio interface with timestamped frame samples and externally verified ASR (with word-level timing) substantially improves the same model (\textbf{17.0\%}$\to$\textbf{42.9\%}). Overall, EcoG-Bench provides a strict, executable testbed for event-level speech--gesture binding, and suggests that multimodal interfaces may bottleneck the observability of temporal alignment cues, independently of model reasoning.[283] On the Feasibility and Opportunity of Autoregressive 3D Object Detection
Zanming Huang,Jinsu Yoo,Sooyoung Jeon,Zhenzhen Liu,Mark Campbell,Kilian Q Weinberger,Bharath Hariharan,Wei-Lun Chao,Katie Z Luo
Main category: cs.CV
TL;DR: AutoReg3D提出一种基于自回归序列生成的LiDAR 3D目标检测方法,摒弃传统anchor和NMS,按近到远顺序生成离散token表示目标,提升可扩展性并支持引入大模型技术。
Details
Motivation: 传统LiDAR 3D检测器依赖手工设计的proposal head(如anchor分配和NMS),导致训练复杂、泛化受限;亟需更灵活、可扩展的建模范式。 Method: 将3D检测建模为自回归序列生成任务:以点云特征为输入,按range-causal(近到远)顺序逐个生成每个目标的离散token序列(含中心、尺寸、朝向、速度、类别);利用LiDAR几何先验实现稳定teacher forcing与解码。 Result: 在nuScenes数据集上达到有竞争力的性能,无需anchor和NMS;兼容多种点云编码器与骨干网络;验证了GRPO等强化学习策略在任务对齐优化中的有效性。 Conclusion: 自回归解码是一种可行且灵活的LiDAR 3D检测新范式,为引入现代序列建模(如语言模型)技术进入3D感知领域开辟路径。 Abstract: LiDAR-based 3D object detectors typically rely on proposal heads with hand-crafted components like anchor assignment and non-maximum suppression (NMS), complicating training and limiting extensibility. We present AutoReg3D, an autoregressive 3D detector that casts detection as sequence generation. Given point-cloud features, AutoReg3D emits objects in a range-causal (near-to-far) order and encodes each object as a short, discrete-token sequence consisting of its center, size, orientation, velocity, and class. This near-to-far ordering mirrors LiDAR geometry--near objects occlude far ones but not vice versa--enabling straightforward teacher forcing during training and autoregressive decoding at test time. AutoReg3D is compatible across diverse point-cloud or backbones and attains competitive nuScenes performance without anchors or NMS. Beyond parity, the sequential formulation unlocks language-model advances for 3D perception, including GRPO-style reinforcement learning for task-aligned objectives. These results position autoregressive decoding as a viable, flexible alternative for LiDAR-based detection and open a path to importing modern sequence-modeling tools into 3D perception.[284] TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size
Stefan Lionar,Gim Hee Lee
Main category: cs.CV
TL;DR: 本文提出TeamHOI框架,利用基于Transformer的去中心化策略和掩码对抗运动先验(AMP)方法,实现任意数量人形智能体在合作人-物交互(HOI)任务中的高真实感、高成功率协同控制。
Details
Motivation: 物理驱动的人形控制虽在单智能体行为上取得进展,但在多智能体协同人-物交互(HOI)任务中仍面临扩展性差、数据稀缺与运动真实性难保障等挑战。 Method: 提出TeamHOI框架:1)采用带队友token的Transformer策略网络实现局部观测下的可扩展去中心化协调;2)设计掩码对抗运动先验(masked AMP),在单人参考动作基础上掩蔽交互部位,并通过任务奖励引导生成多样化、物理合理的协同动作;3)引入与团队规模及物体形状无关的编队奖励以提升搬运稳定性。 Result: 在2–8个智能体协同搬运不同几何形状物体的任务中,TeamHOI以单一策略实现了高成功率与跨配置的一致协同行为,显著提升了运动真实性和任务鲁棒性。 Conclusion: TeamHOI证明了单一去中心化策略可在数据受限条件下高效泛化至变规模多智能体HOI任务,为真实感协同控制提供了新范式。 Abstract: Physics-based humanoid control has achieved remarkable progress in enabling realistic and high-performing single-agent behaviors, yet extending these capabilities to cooperative human-object interaction (HOI) remains challenging. We present TeamHOI, a framework that enables a single decentralized policy to handle cooperative HOIs across any number of cooperating agents. Each agent operates using local observations while attending to other teammates through a Transformer-based policy network with teammate tokens, allowing scalable coordination across variable team sizes. To enforce motion realism while addressing the scarcity of cooperative HOI data, we further introduce a masked Adversarial Motion Prior (AMP) strategy that uses single-human reference motions while masking object-interacting body parts during training. The masked regions are then guided through task rewards to produce diverse and physically plausible cooperative behaviors. We evaluate TeamHOI on a challenging cooperative carrying task involving two to eight humanoid agents and varied object geometries. Finally, to promote stable carrying, we design a team-size- and shape-agnostic formation reward. TeamHOI achieves high success rates and demonstrates coherent cooperation across diverse configurations with a single policy.[285] AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models
Teng Wang,Yanting Lu,Ruize Wang
Main category: cs.CV
TL;DR: AutoTraces是一种结合视觉、语言与轨迹的自回归模型,利用大语言模型(LLM)建模人类行为,通过创新的轨迹分词与轻量编解码器将物理坐标空间融入LLM,辅以自动链式推理(CoT)机制,实现高精度、长时程、跨场景的机器人轨迹预测。
Details
Motivation: 现有方法仅依赖文本表征,难以建模复杂人类行为与长时程轨迹交互;需减少对人工标注链式推理的依赖。 Method: 提出新型轨迹分词方案(点符号+数值嵌入),通过轻量编解码器将其嵌入LLM空间;设计基于多模态LLM的自动链式推理生成机制;采用两阶段训练策略。 Result: 在长时程轨迹预测上达到SOTA精度,具备强跨场景泛化能力,并支持可变长度预测。 Conclusion: 将LLM的推理能力有效拓展至物理轨迹空间是可行且高效的,AutoTraces为具身智能中的行为理解与预测提供了新范式。 Abstract: We present AutoTraces, an autoregressive vision-language-trajectory model for robot trajectory forecasting in humam-populated environments, which harnesses the inherent reasoning capabilities of large language models (LLMs) to model complex human behaviors. In contrast to prior works that rely solely on textual representations, our key innovation lies in a novel trajectory tokenization scheme, which represents waypoints with point tokens as categorical and positional markers while encoding waypoint numerical values as corresponding point embeddings, seamlessly integrated into the LLM's space through a lightweight encoder-decoder architecture. This design preserves the LLM's native autoregressive generation mechanism while extending it to physical coordinate spaces, facilitates modeling of long-term interactions in trajectory data. We further introduce an automated chain-of-thought (CoT) generation mechanism that leverages a multimodal LLM to infer spatio-temporal relationships from visual observations and trajectory data, eliminating reliance on manual annotation. Through a two-stage training strategy, our AutoTraces achieves SOTA forecasting accuracy, particularly in long-horizon prediction, while exhibiting strong cross-scene generalization and supporting flexible-length forecasting.[286] ViSA-Enhanced Aerial VLN: A Visual-Spatial Reasoning Enhanced Framework for Aerial Vision-Language Navigation
Haoyu Tong,Xiangyu Dong,Xiaoguang Ma,Haoran Zhao,Yaoming Zhou,Chenghao Lin
Main category: cs.CV
TL;DR: 本文提出了一种视觉-空间推理(ViSA)增强框架,用于改进空中视觉-语言导航(VLN),通过三阶段协作架构实现图像平面上的直接推理,无需额外训练或复杂中间表示,在CityNav基准上成功率达70.3%提升。
Details
Motivation: 现有空中VLN方法依赖检测-规划流程,存在空间推理能力不足和语言歧义问题。 Method: 提出ViSA增强框架,采用三阶段协同架构,利用结构化视觉提示,使视觉-语言模型能在图像平面上直接推理,无需额外训练或复杂中间表示。 Result: 在CityNav基准上,ViSA增强的VLN相比完全训练的SOTA方法,成功率提升70.3%。 Conclusion: ViSA框架展现出作为空中VLN系统主干的巨大潜力。 Abstract: Existing aerial Vision-Language Navigation (VLN) methods predominantly adopt a detection-and-planning pipeline, which converts open-vocabulary detections into discrete textual scene graphs. These approaches are plagued by inadequate spatial reasoning capabilities and inherent linguistic ambiguities. To address these bottlenecks, we propose a Visual-Spatial Reasoning (ViSA) enhanced framework for aerial VLN. Specifically, a triple-phase collaborative architecture is designed to leverage structured visual prompting, enabling Vision-Language Models (VLMs) to perform direct reasoning on image planes without the need for additional training or complex intermediate representations. Comprehensive evaluations on the CityNav benchmark demonstrate that the ViSA-enhanced VLN achieves a 70.3\% improvement in success rate compared to the fully trained state-of-the-art (SOTA) method, elucidating its great potential as a backbone for aerial VLN systems.[287] It's Time to Get It Right: Improving Analog Clock Reading and Clock-Hand Spatial Reasoning in Vision-Language Models
Jaeha Choi,Jin Won Lee,Siwoo You,Jangho Lee
Main category: cs.CV
TL;DR: 本文揭示了当前视觉语言模型(VLMs)在真实场景下读取模拟时钟仍存在显著困难,主要源于现有数据集缺乏真实性和多样性;为此提出新数据集TickTockVQA和微调方法Swap-DPO,显著提升模型在复杂现实条件下的时钟理解能力。
Details
Motivation: 现有VLM虽在多模态推理任务中表现优异,但在真实世界模拟时钟阅读上效果不佳,因现有数据集多为合成或平面化、风格单一、背景缺失,无法支撑模型进行鲁棒的空间-时间推理。 Method: 构建真实世界、多样场景、人工标注的TickTockVQA数据集(含明确时/分指针标注及可推断的AM/PM标签),并提出基于直接偏好优化(DPO)的Swap-DPO微调框架,以对齐模型对时间的准确空间-语义理解。 Result: 实验表明,所提方法大幅提升了VLM在真实场景(如遮挡、光照变化、杂乱背景)下读取模拟时钟的准确性与鲁棒性。 Conclusion: 真实世界模拟时钟阅读是检验VLM空间-时间推理能力的重要基准;TickTockVQA与Swap-DPO为提升VLM在细粒度视觉理解与时空推理方面提供了有效数据与方法基础。 Abstract: Advances in vision-language models (VLMs) have achieved remarkable success on complex multimodal reasoning tasks, leading to the assumption that they should also excel at reading analog clocks. However, contrary to this expectation, our study reveals that reading analog clocks in real-world environments remains a significant challenge for state-of-the-art VLMs. Existing analog clock datasets are largely synthetic or planar with limited stylistic diversity and minimal background context, failing to capture the visual variability of real-world scenes. As a result, VLMs trained on such data exhibit weak spatial-temporal reasoning, frequently confusing the hour and minute hands and struggling under common visual conditions such as occlusion, lighting variation, and cluttered backgrounds. To address this issue, we introduce TickTockVQA, a human-annotated dataset containing analog clocks in diverse real-world scenarios. TickTockVQA provides explicit hour and minute annotations, and includes an AM/PM tag when it is inferable from the visual context. Furthermore, we propose Swap-DPO, a direct preference optimization based fine-tuning framework to align model reasoning toward accurate time interpretation. Experimental results demonstrate that our approach substantially enhances clock reading accuracy and robustness under real-world conditions, establishing a foundation for future research on spatial-temporal reasoning and visual understanding in VLMs.[288] Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared
Yafei Zhang,Meng Ma,Huafeng Li,Yu Liu
Main category: cs.CV
TL;DR: 本文提出了一种字典引导的、系数域的红外-可见光图像融合框架(DCMIF),用于解决红外模态缺失时的融合问题,通过共享卷积字典、VIS引导的IR系数推理及原子级自适应融合,在保持可解释性的同时提升感知质量与下游检测性能。
Details
Motivation: 现有红外-可见光图像融合方法依赖双模态数据,当红外模态缺失时,像素空间生成方法难以控制且缺乏可解释性。 Method: 提出基于共享卷积字典的系数域融合框架,包含三部分:(1) 联合共享字典表示学习(JSRL);(2) 可见光引导的红外系数推理(VGII),结合冻结大语言模型作为弱语义先验;(3) 基于表征推理的自适应融合(AFRI),在原子级融合并重建。 Result: 在缺失红外模态设定下,显著提升融合图像的感知质量与下游目标检测性能,实验验证了方法有效性与泛化性。 Conclusion: DCMIF是首个联合学习共享字典并实现在系数域进行推理-融合的框架,避免了不可控的像素级生成,同时保证先验一致性与表征可解释性。 Abstract: Infrared-visible (IR-VIS) image fusion is vital for perception and security, yet most methods rely on the availability of both modalities during training and inference. When the infrared modality is absent, pixel-space generative substitutes become hard to control and inherently lack interpretability. We address missing-IR fusion by proposing a dictionary-guided, coefficient-domain framework built upon a shared convolutional dictionary. The pipeline comprises three key components: (1) Joint Shared-dictionary Representation Learning (JSRL) learns a unified and interpretable atom space shared by both IR and VIS modalities; (2) VIS-Guided IR Inference (VGII) transfers VIS coefficients to pseudo-IR coefficients in the coefficient domain and performs a one-step closed-loop refinement guided by a frozen large language model as a weak semantic prior; and (3) Adaptive Fusion via Representation Inference (AFRI) merges VIS structures and inferred IR cues at the atom level through window attention and convolutional mixing, followed by reconstruction with the shared dictionary. This encode-transfer-fuse-reconstruct pipeline avoids uncontrolled pixel-space generation while ensuring prior preservation within interpretable dictionary-coefficient representation. Experiments under missing-IR settings demonstrate consistent improvements in perceptual quality and downstream detection performance. To our knowledge, this represents the first framework that jointly learns a shared dictionary and performs coefficient-domain inference-fusion to tackle missing-IR fusion. The source code is publicly available at https://github.com/harukiv/DCMIF.[289] VSDiffusion: Taming Ill-Posed Shadow Generation via Visibility-Constrained Diffusion
Jing Li,Jing Zhang
Main category: cs.CV
TL;DR: 本文提出VSDiffusion,一种基于可见性约束的两阶段扩散模型,用于在图像合成中生成几何一致的真实阴影。
Details
Motivation: 在图像合成中为插入的前景物体生成真实阴影具有挑战性,尤其在复杂场景中维持阴影与物体之间的几何一致性困难,因阴影形成问题本质是病态的。 Method: 提出VSDiffusion:第一阶段预测粗略阴影掩码以定位潜在阴影区域;第二阶段基于估计的光照和深度线索进行条件扩散生成精确阴影。通过可见性控制分支(带阴影门控交叉注意力)和学习的软先验图(重加权易错区域损失)注入可见性先验,并引入高频引导增强模块提升边界清晰度和背景纹理交互。 Result: 在DESOBAv2数据集上的实验表明,VSDiffusion生成的阴影更准确,在多数评估指标上达到新SOTA。 Conclusion: VSDiffusion通过引入可见性先验和多阶段建模,有效缩小了阴影生成的解空间,显著提升了阴影几何一致性与真实性。 Abstract: Generating realistic cast shadows for inserted foreground objects is a crucial yet challenging problem in image composition, where maintaining geometric consistency of shadow and object in complex scenes remains difficult due to the ill-posed nature of shadow formation. To address this issue, we propose VSDiffusion, a visibility-constrained two-stage framework designed to narrow the solution space by incorporating visibility priors. In Stage I, we predict a coarse shadow mask to localize plausible shadow generated regions. And in Stage II, conditional diffusion is performed guided by lighting and depth cues estimated from the composite to generate accurate shadows. In VSDiffusion, we inject visibility priors through two complementary pathways. First, a visibility control branch with shadow-gated cross attention that provides multi-scale structural guidance. Then, a learned soft prior map that reweights training loss in error-prone regions to enhance geometric correction. Additionally, we also introduce high-frequency guided enhancement module to sharpen boundaries and improve texture interaction with the background. Experiments on widely used public DESOBAv2 dataset demonstrated that our proposed VSDiffusion can generate accurate shadow, and establishes new SOTA results across most evaluation metrics.[290] Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model
Sangjune Park,Inhyeok Choi,Donghyeon Soon,Youngwoo Jeon,Kyungdon Joo
Main category: cs.CV
TL;DR: 本文提出MambaDance,一种基于Mamba架构的两阶段扩散模型,用于生成音乐同步、节奏感强的舞蹈动作;引入高斯节拍表征显式建模音乐节拍,显著提升长序列舞蹈生成质量。
Details
Motivation: 现有舞蹈生成方法难以充分建模舞蹈固有的时序性、节奏性和音乐同步性。 Method: 提出基于Mamba的两阶段扩散模型(替代Transformer),并设计高斯分布节拍表征以显式引导舞蹈序列解码。 Result: 在AIST++和FineDance数据集上,MambaDance在不同长度序列上均优于先前方法,生成动作更合理且具节奏一致性;提供定性结果与演示视频。 Conclusion: Mamba架构适配舞蹈生成任务,结合节拍感知设计可有效提升音乐驱动舞蹈生成的质量与鲁棒性。 Abstract: Dance is a form of human motion characterized by emotional expression and communication, playing a role in various fields such as music, virtual reality, and content creation. Existing methods for dance generation often fail to adequately capture the inherently sequential, rhythmical, and music-synchronized characteristics of dance. In this paper, we propose \emph{MambaDance}, a new dance generation approach that leverages a Mamba-based diffusion model. Mamba, well-suited to handling long and autoregressive sequences, is integrated into our two-stage diffusion architecture, substituting off-the-shelf Transformer. Additionally, considering the critical role of musical beats in dance choreography, we propose a Gaussian-based beat representation to explicitly guide the decoding of dance sequences. Experiments on AIST++ and FineDance datasets for each sequence length show that our proposed method effectively generates plausible dance movements while reflecting essential characteristics, consistently from short to long dances, compared to the previous methods. Additional qualitative results and demo videos are available at \small{https://vision3d-lab.github.io/mambadance}.[291] Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades
Ashkan Taghipour,Morteza Ghahremani,Zinuo Li,Hamid Laga,Farid Boussaid,Mohammed Bennamoun
Main category: cs.CV
TL;DR: 本文提出了一种两阶段级联框架,通过文本生成2D骨架序列,再基于骨架和参考图像生成高质量视频,解决了复杂人体动作视频生成中细粒度运动控制难和显式姿态控制成本高的问题,并构建了首个面向高难度动作的合成数据集。
Details
Motivation: 现有视频扩散模型难以生成翻腾、侧空翻、武术等复杂人体动作;纯文本条件存在时序模糊性,而显式姿态控制需用户提供完整骨架序列,成本高昂。此外,缺乏公开、可控、高质量的复杂动作视频数据集。 Method: 提出两阶段框架:1)自回归文本到骨架模型,逐关节生成2D姿态序列,建模长程时序依赖与关节协同;2)基于DINO-ALF(自适应层融合)多级参考编码器的姿势条件视频扩散模型,提升大姿态变化与自遮挡下的外观与服装细节保持能力;同时构建Blender合成数据集(2000个多样化角色的特技动作视频)。 Result: 在自建合成数据集和Motion-X Fitness基准上,文本到骨架模型在FID、R-precision和动作多样性指标上优于先前方法;姿态到视频模型在VBench的时序一致性、运动平滑性和主体保持性指标上均达到最优。 Conclusion: 该级联框架有效解耦了语义理解与视觉合成,兼顾动作可控性与视频质量,所构建的数据集填补了复杂动作视频生成领域的关键空白,为后续研究提供了新范式与基础资源。 Abstract: Generating videos of complex human motions such as flips, cartwheels, and martial arts remains challenging for current video diffusion models. Text-only conditioning is temporally ambiguous for fine-grained motion control, while explicit pose-based controls, though effective, require users to provide complete skeleton sequences that are costly to produce for long and dynamic actions. We propose a two-stage cascaded framework that addresses both limitations. First, an autoregressive text-to-skeleton model generates 2D pose sequences from natural language descriptions by predicting each joint conditioned on previously generated poses. This design captures long-range temporal dependencies and inter-joint coordination required for complex motions. Second, a pose-conditioned video diffusion model synthesizes videos from a reference image and the generated skeleton sequence. It employs DINO-ALF (Adaptive Layer Fusion), a multi-level reference encoder that preserves appearance and clothing details under large pose changes and self-occlusions. To address the lack of publicly available datasets for complex human motion video generation, we introduce a Blender-based synthetic dataset containing 2,000 videos with diverse characters performing acrobatic and stunt-like motions. The dataset provides full control over appearance, motion, and environment. It fills an important gap because existing benchmarks significantly under-represent acrobatic motions while web-collected datasets raise copyright and privacy concerns. Experiments on our synthetic dataset and the Motion-X Fitness benchmark show that our text-to-skeleton model outperforms prior methods on FID, R-precision, and motion diversity. Our pose-to-video model also achieves the best results among all compared methods on VBench metrics for temporal consistency, motion smoothness, and subject preservation.[292] QualiTeacher: Quality-Conditioned Pseudo-Labeling for Real-World Image Restoration
Fengyang Xiao,Jingjia Feng,Peng Hu,Dingming Zhang,Lei Xu,Guanyi Qin,Lu Li,Chunming He,Sina Farsiu
Main category: cs.CV
TL;DR: 本文提出QualiTeacher框架,通过将伪标签质量作为条件监督信号,而非简单过滤,使学生模型能学习质量分级的恢复流形,从而避免低质量伪标签的干扰并生成超越教师模型质量的结果。
Details
Motivation: 现有基于伪标签(PL)的图像恢复方法面临信任低质量PL会引入伪影、而丢弃PL又限制数据多样性的矛盾。 Method: QualiTeacher框架利用多模型集成的无参考图像质量评估(NR-IQA)估计PL质量,并以此为条件指导学生模型;辅以多增强策略扩展PL质量谱、基于分数的偏好优化(类DPO)确保质量单调分离、以及裁剪一致性损失防止IQM被对抗性优化。 Result: 在标准真实世界图像恢复(RWIR)基准上显著提升现有伪标签框架性能,实现即插即用式改进,并展现出生成质量超越教师模型的能力。 Conclusion: 伪标签质量不应被忽视或粗暴过滤,而可转化为有价值的条件监督信号;QualiTeacher为在无真值条件下利用不完美监督提供了新范式。 Abstract: Real-world image restoration (RWIR) is a highly challenging task due to the absence of clean ground-truth images. Many recent methods resort to pseudo-label (PL) supervision, often within a Mean-Teacher (MT) framework. However, these methods face a critical paradox: unconditionally trusting the often imperfect, low-quality PLs forces the student model to learn undesirable artifacts, while discarding them severely limits data diversity and impairs model generalization. In this paper, we propose QualiTeacher, a novel framework that transforms pseudo-label quality from a noisy liability into a conditional supervisory signal. Instead of filtering, QualiTeacher explicitly conditions the student model on the quality of the PLs, estimated by an ensemble of complementary non-reference image quality assessment (NR-IQA) models spanning low-level distortion and semantic-level assessment. This strategy teaches the student network to learn a quality-graded restoration manifold, enabling it to understand what constitutes different quality levels. Consequently, it can not only avoid mimicking artifacts from low-quality labels but also extrapolate to generate results of higher quality than the teacher itself. To ensure the robustness and accuracy of this quality-driven learning, we further enhance the process with a multi-augmentation scheme to diversify the PL quality spectrum, a score-based preference optimization strategy inspired by Direct Preference Optimization (DPO) to enforce a monotonically ordered quality separation, and a cropped consistency loss to prevent adversarial over-optimization (reward hacking) of the IQA models. Experiments on standard RWIR benchmarks demonstrate that QualiTeacher can serve as a plug-and-play strategy to improve the quality of the existing pseudo-labeling framework, establishing a new paradigm for learning from imperfect supervision. Code will be released.[293] Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout
Jun Yu,Naixiang Zheng,Guoyuan Wang,Yunxiang Zhang,Lingsi Zhu,Jiaen Liang,Wei Huang,Shengping Liu
Main category: cs.CV
TL;DR: 本文提出了一种用于野外情感识别的多模态动态融合框架,采用双分支Transformer与安全交叉注意力、模态丢弃策略,并结合焦点损失和滑动窗口软投票,有效应对遮挡、模态缺失和类别不平衡问题。
Details
Motivation: 现实环境中情感识别面临部分遮挡、模态缺失和严重类别不平衡等挑战,尤其在ABAW Expression挑战中亟需鲁棒的多模态方法。 Method: 提出双分支Transformer架构,引入安全交叉注意力机制和模态丢弃策略;采用焦点损失缓解Aff-Wild2长尾分布;结合滑动窗口软投票策略建模动态情感变化并抑制帧级抖动。 Result: 在Aff-Wild2验证集上达到60.79%准确率和0.5029 F1-score,验证了对模态缺失和时空依赖建模的有效性。 Conclusion: 所提框架显著提升了野外环境下多模态情感识别的鲁棒性与泛化能力,尤其适用于视觉信息不可靠的场景。 Abstract: Emotion recognition in real-world environments is hindered by partial occlusions, missing modalities, and severe class imbalance. To address these issues, particularly for the Affective Behavior Analysis in-the-wild (ABAW) Expression challenge, we propose a multimodal framework that dynamically fuses visual and audio representations. Our approach uses a dual-branch Transformer architecture featuring a safe cross-attention mechanism and a modality dropout strategy. This design allows the network to rely on audio-based predictions when visual cues are absent. To mitigate the long-tail distribution of the Aff-Wild2 dataset, we apply focal loss optimization, combined with a sliding-window soft voting strategy to capture dynamic emotional transitions and reduce frame-level classification jitter. Experiments demonstrate that our framework effectively handles missing modalities and complex spatiotemporal dependencies, achieving an accuracy of 60.79% and an F1-score of 0.5029 on the Aff-Wild2 validation set.[294] Speed3R: Sparse Feed-forward 3D Reconstruction Models
Weining Ren,Xiao Tan,Kai Han
Main category: cs.CV
TL;DR: Speed3R is an efficient 3D reconstruction model that uses a dual-branch attention mechanism to reduce computational complexity, achieving 12.4x speedup with minimal accuracy loss.
Details
Motivation: Recent feed-forward 3D reconstruction models suffer from quadratic complexity due to dense attention, limiting inference speed. Method: Speed3R introduces a dual-branch attention mechanism: a compression branch builds a coarse contextual prior, and a selection branch applies fine-grained attention only on the most informative image tokens—mimicking sparse keypoint matching in traditional Structure-from-Motion. Result: Speed3R achieves a 12.4x inference speedup on 1000-view sequences with only a minimal, controlled trade-off in geometric accuracy; validated on standard benchmarks with VGGT and π³ backbones. Conclusion: Speed3R enables high-quality, large-scale 3D scene modeling at significantly reduced computational cost. Abstract: While recent feed-forward 3D reconstruction models accelerate 3D reconstruction by jointly inferring dense geometry and camera poses in a single pass, their reliance on dense attention imposes a quadratic complexity, creating a prohibitive computational bottleneck that severely limits inference speed. To resolve this, we introduce Speed3R, an end-to-end trainable model inspired by the core principle of Structure-from-Motion: that a sparse set of keypoints is sufficient for robust pose estimation. Speed3R features a dual-branch attention mechanism where a compression branch creates a coarse contextual prior to guide a selection branch, which performs fine-grained attention only on the most informative image tokens. This strategy mimics the efficiency of traditional keypoint matching, achieving a remarkable 12.4x inference speedup on 1000-view sequences, while introducing a minimal, controlled trade-off in geometric accuracy. Validated on standard benchmarks with both VGGT and $π^3$ backbones, our method delivers high-quality reconstructions at a fraction of computational cost, paving the way for efficient large-scale scene modeling.[295] ImageEdit-R1: Boosting Multi-Agent Image Editing via Reinforcement Learning
Yiran Zhao,Yaoqi Ye,Xiang Liu,Michael Qizhe Shieh,Trung Bui
Main category: cs.CV
TL;DR: 本文提出ImageEdit-R1,一种基于强化学习的多智能体图像编辑框架,通过协调多个预训练视觉语言与生成智能体,实现对复杂、间接或多步用户指令的上下文感知图像编辑。
Details
Motivation: 现有图像编辑系统(尤其是闭源或专有模型)难以处理复杂、间接或多步的用户指令,缺乏细粒度和上下文感知的编辑能力。 Method: 提出ImageEdit-R1多智能体框架,将图像编辑建模为序列决策问题;各智能体分别负责意图理解、区域定位、动作选择与内容生成,由强化学习统一协调其协作。 Result: 在多个图像编辑数据集上,ImageEdit-R1持续优于单个闭源扩散模型及其他多智能体基线方法。 Conclusion: 将图像编辑视为强化学习驱动的多智能体协同决策任务,可显著提升对复杂人类指令的理解与执行能力,为智能图像编辑提供了新范式。 Abstract: With the rapid advancement of commercial multi-modal models, image editing has garnered significant attention due to its widespread applicability in daily life. Despite impressive progress, existing image editing systems, particularly closed-source or proprietary models, often struggle with complex, indirect, or multi-step user instructions. These limitations hinder their ability to perform nuanced, context-aware edits that align with human intent. In this work, we propose ImageEdit-R1, a multi-agent framework for intelligent image editing that leverages reinforcement learning to coordinate high-level decision-making across a set of specialized, pretrained vision-language and generative agents. Each agent is responsible for distinct capabilities--such as understanding user intent, identifying regions of interest, selecting appropriate editing actions, and synthesizing visual content--while reinforcement learning governs their collaboration to ensure coherent and goal-directed behavior. Unlike existing approaches that rely on monolithic models or hand-crafted pipelines, our method treats image editing as a sequential decision-making problem, enabling dynamic and context-aware editing strategies. Experimental results demonstrate that ImageEdit-R1 consistently outperforms both individual closed-source diffusion models and alternative multi-agent framework baselines across multiple image editing datasets.[296] Enhancing Cross-View UAV Geolocalization via LVLM-Driven Relational Modeling
Bowen Liu,Pengyue Jia,Wanyu Wang,Derong Xu,Jiawei Cheng,Jiancheng Dong,Xiao Han,Zimo Zhao,Chao Zhang,Bowen Yu,Fangyu Hong,Xiangyu Zhao
Main category: cs.CV
TL;DR: 本文提出了一种新型的插件式排序架构,利用大视觉语言模型(LVLM)进行无人机与卫星图像之间的联合关系建模,并设计了关系感知的软标签损失函数,显著提升了跨视角无人机地理定位的检索精度。
Details
Motivation: 现有方法独立提取多视角特征并依赖简单启发式计算相似度,未能显式建模不同视角间的本质交互关系。 Method: 提出基于大视觉语言模型(LVLM)的插件式排名架构,进行联合关系建模;设计关系感知的软标签损失函数,提供细粒度监督以增强判别力和训练稳定性。 Result: 在多个基线模型和标准基准上验证,所提方法显著提升检索精度,尤其在高难度条件下仍保持优越性能。 Conclusion: 该方法通过显式建模跨视角关系和优化损失函数,有效解决了无人机-卫星图像匹配中的关键挑战,为跨视角地理定位提供了新范式。 Abstract: The primary objective of cross-view UAV geolocalization is to identify the exact spatial coordinates of drone-captured imagery by aligning it with extensive, geo-referenced satellite databases. Current approaches typically extract features independently from each perspective and rely on basic heuristics to compute similarity, thereby failing to explicitly capture the essential interactions between different views. To address this limitation, we introduce a novel, plug-and-play ranking architecture designed to explicitly perform joint relational modeling for improved UAV-to-satellite image matching. By harnessing the capabilities of a Large Vision-Language Model (LVLM), our framework effectively learns the deep visual-semantic correlations linking UAV and satellite imagery. Furthermore, we present a novel relational-aware loss function to optimize the training phase. By employing soft labels, this loss provides fine-grained supervision that avoids overly penalizing near-positive matches, ultimately boosting both the model's discriminative power and training stability. Comprehensive evaluations across various baseline architectures and standard benchmarks reveal that the proposed method substantially boosts the retrieval accuracy of existing models, yielding superior performance even under highly demanding conditions.[297] Evaluating Generative Models via One-Dimensional Code Distributions
Zexi Jia,Pengcheng Luo,Yijia Zhong,Jinchao Zhang,Jie Zhou
Main category: cs.CV
TL;DR: 本文提出基于离散视觉词元(visual tokens)的生成模型评估新范式,包括无训练的Codebook Histogram Distance(CHD)和无参考的Code Mixture Model Score(CMMS),并在新构建的大规模VisForm基准上验证其优于传统特征分布指标(如FID)的人类感知一致性。
Details
Motivation: 现有生成模型评估方法(如FID)依赖于对表观变化不敏感的连续识别特征,因而丢失影响人类感知质量的关键线索;需在能同时编码语义与感知信息的离散词元空间中建立更符合人眼判断的评估方式。 Method: 提出两种新指标:CHD——基于词元码本直方图的无训练分布距离;CMMS——通过合成词元序列退化学习的无参考质量评分;并构建包含210K图像、62种视觉形态、12种生成模型及专家标注的VisForm基准用于压力测试。 Result: 在AGIQA、HPDv2/3和VisForm三个基准上,CHD与CMMS均达到与人类评分最高相关性的SOTA性能,显著优于FID等传统指标。 Conclusion: 离散视觉词元空间是更适配人类感知质量评估的度量域,所提CHD和CMMS为生成模型评价提供了更可靠、无需训练且无需参考图像的新工具,并推动可复现、可扩展的评估基准建设。 Abstract: Most evaluations of generative models rely on feature-distribution metrics such as FID, which operate on continuous recognition features that are explicitly trained to be invariant to appearance variations, and thus discard cues critical for perceptual quality. We instead evaluate models in the space of \emph{discrete} visual tokens, where modern 1D image tokenizers compactly encode both semantic and perceptual information and quality manifests as predictable token statistics. We introduce \emph{Codebook Histogram Distance} (CHD), a training-free distribution metric in token space, and \emph{Code Mixture Model Score} (CMMS), a no-reference quality metric learned from synthetic degradations of token sequences. To stress-test metrics under broad distribution shifts, we further propose \emph{VisForm}, a benchmark of 210K images spanning 62 visual forms and 12 generative models with expert annotations. Across AGIQA, HPDv2/3, and VisForm, our token-based metrics achieve state-of-the-art correlation with human judgments, and we will release all code and datasets to facilitate future research.[298] Synthetic Defect Image Generation for Power Line Insulator Inspection Using Multimodal Large Language Models
Xuesong Wang,Caisheng Wang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的多模态大语言模型(MLLM)图像生成方法,用于在缺陷样本稀缺的情况下合成高质量、高保真度的缺陷图像,显著提升陶瓷绝缘子缺陷分类性能。
Details
Motivation: 电力公司依赖无人机影像进行设备巡检,但缺陷样本稀少、真实数据集规模小或私有,导致缺陷分类器训练困难。 Method: 利用现成的多模态大语言模型(MLLM)作为零样本图像生成器,结合双参考图像与文本提示生成缺陷图像;通过轻量人工验证与提示优化提升标签保真度;再基于真实训练集类中心嵌入距离筛选合成图像。 Result: 在仅104张真实训练图像的低数据场景下,加入嵌入筛选的合成图像后,测试F1分数从0.615提升至0.739(相对提升20%),相当于4–5倍数据效率增益,且在更强骨干网络和冻结特征线性探测设置下仍有效。 Conclusion: 该方法为缺陷识别提供了一条实用、低门槛的增强路径,尤其适用于难以获取额外真实缺陷样本的场景。 Abstract: Utility companies increasingly rely on drone imagery for post-event and routine inspection, but training accurate defect-type classifiers remains difficult because defect examples are rare and inspection datasets are often limited or proprietary. We address this data-scarcity setting by using an off-the-shelf multimodal large language model (MLLM) as a training-free image generator to synthesize defect images from visual references and text prompts. Our pipeline increases diversity via dual-reference conditioning, improves label fidelity with lightweight human verification and prompt refinement, and filters the resulting synthetic pool using an embedding-based selection rule based on distances to class centroids computed from the real training split. We evaluate on ceramic insulator defect-type classification (shell vs. glaze) using a public dataset with a realistic low training-data regime (104 real training images; 152 validation; 308 test). Augmenting the 10% real training set with embedding-selected synthetic images improves test F1 score (harmonic mean of precision and recall) from 0.615 to 0.739 (20% relative), corresponding to an estimated 4--5x data-efficiency gain, and the gains persist with stronger backbone models and frozen-feature linear-probe baselines. These results suggest a practical, low-barrier path for improving defect recognition when collecting additional real defects is slow or infeasible.[299] TALON: Test-time Adaptive Learning for On-the-Fly Category Discovery
Yanan Wu,Yuhan Yan,Tailai Chen,Zhixiang Chi,ZiZhang Wu,Yi Jin,Yang Wang,Zhenbo Li
Main category: cs.CV
TL;DR: 本文提出了一种面向在线流数据的测试时自适应框架TALON,用于解决在线类别发现(OCD)中固定特征提取器和哈希量化导致的知识僵化与信息损失问题,通过语义感知原型更新、稳定编码器测试时更新及边缘感知logit校准,显著提升新类识别精度并缓解类别爆炸。
Details
Motivation: 现有OCD方法冻结离线训练的特征提取器并依赖哈希量化生成二进制原型,忽视了在线数据的学习潜力,且量化造成信息损失、表征能力下降和类内方差增大,易引发类别爆炸。 Method: 提出测试时自适应框架TALON,包含:1)语义感知原型动态更新;2)稳定的测试时编码器参数更新;3)离线阶段的边缘感知logit校准以扩大类间间隔、增强类内紧凑性。 Result: 在标准OCD基准上显著超越现有哈希类SOTA方法,大幅提升新类准确率,并有效缓解类别爆炸现象。 Conclusion: 测试时持续学习可有效拓展模型知识边界,结合原型与编码器协同更新及嵌入空间预保留策略,是解决在线类别发现难题的关键路径。 Abstract: On-the-fly category discovery (OCD) aims to recognize known categories while simultaneously discovering novel ones from an unlabeled online stream, using a model trained only on labeled data. Existing approaches freeze the feature extractor trained offline and employ a hash-based framework that quantizes features into binary codes as class prototypes. However, discovering novel categories with a fixed knowledge base is counterintuitive, as the learning potential of incoming data is entirely neglected. In addition, feature quantization introduces information loss, diminishes representational expressiveness, and amplifies intra-class variance. It often results in category explosion, where a single class is fragmented into multiple pseudo-classes. To overcome these limitations, we propose a test-time adaptation framework that enables learning through discovery. It incorporates two complementary strategies: a semantic-aware prototype update and a stable test-time encoder update. The former dynamically refines class prototypes to enhance classification, whereas the latter integrates new information directly into the parameter space. Together, these components allow the model to continuously expand its knowledge base with newly encountered samples. Furthermore, we introduce a margin-aware logit calibration in the offline stage to enlarge inter-class margins and improve intra-class compactness, thereby reserving embedding space for future class discovery. Experiments on standard OCD benchmarks demonstrate that our method substantially outperforms existing hash-based state-of-the-art approaches, yielding notable improvements in novel-class accuracy and effectively mitigating category explosion. The code is publicly available at \textcolor{blue}{https://github.com/ynanwu/TALON}.[300] From Reactive to Map-Based AI: Tuned Local LLMs for Semantic Zone Inference in Object-Goal Navigation
Yudai Noda,Kanji Tanaka
Main category: cs.CV
TL;DR: 本文提出了一种将大语言模型(LLM)与混合拓扑-栅格地图系统结合的“基于地图的AI”方法,用于提升物体目标导航(ObjectNav)性能,通过语义区域推理与TSP优化实现系统性探索,显著优于传统前沿探索和反应式LLM基线。
Details
Motivation: 现有基于LLM的ObjectNav代理多采用缺乏显式空间记忆的“反应式”范式,导致重复探索和短视行为。 Method: 提出Map-Based AI框架:利用LoRA微调的Llama-2模型从口语化物体观测中推断语义区域类别及目标存在概率;将语义信息融合进拓扑图,并通过TSP优化指导系统性探索。 Result: 在AI2-THOR仿真环境中,该方法在Success Rate(SR)和Success weighted by Path Length(SPL)上显著优于前沿探索和反应式LLM基线。 Conclusion: 融合语义推理与显式空间记忆的地图驱动范式,能有效克服反应式LLM代理在ObjectNav中的探索低效问题,是迈向更鲁棒具身智能的重要一步。 Abstract: Object-Goal Navigation (ObjectNav) requires an agent to find and navigate to a target object category in unknown environments. While recent Large Language Model (LLM)-based agents exhibit zero-shot reasoning, they often rely on a "reactive" paradigm that lacks explicit spatial memory, leading to redundant exploration and myopic behaviors. To address these limitations, we propose a transition from reactive AI to "Map-Based AI" by integrating LLM-based semantic inference with a hybrid topological-grid mapping system. Our framework employs a fine-tuned Llama-2 model via Low-Rank Adaptation (LoRA) to infer semantic zone categories and target existence probabilities from verbalized object observations. In this study, a "zone" is defined as a functional area described by the set of observed objects, providing crucial semantic co-occurrence cues for finding the target. This semantic information is integrated into a topological graph, enabling the agent to prioritize high-probability areas and perform systematic exploration via Traveling Salesman Problem (TSP) optimization. Evaluations in the AI2-THOR simulator demonstrate that our approach significantly outperforms traditional frontier exploration and reactive LLM baselines, achieving a superior Success Rate (SR) and Success weighted by Path Length (SPL).[301] DSH-Bench: A Difficulty- and Scenario-Aware Benchmark with Hierarchical Subject Taxonomy for Subject-Driven Text-to-Image Generation
Zhenyu Hu,Qing Wang,Te Cao,Luo Liao,Longfei Lu,Liqun Liu,Shuang Li,Hang Chen,Mengge Xue,Yuan Chen,Chao Deng,Peng Shu,Huan Yu,Jie Jiang
Main category: cs.CV
TL;DR: 本文提出DSH-Bench,一个面向主体驱动文本到图像生成模型的综合性评估基准,通过分层采样、细粒度难度与提示场景分类、新指标SICS及诊断洞察四大创新,解决现有评估方法在多样性、粒度和指导性上的不足。
Details
Motivation: 现有主体驱动T2I模型评估基准存在主体图像多样性不足、评估粒度粗、缺乏可操作诊断指导三大缺陷。 Method: 构建DSH-Bench:1)基于58类细粒度主体的分层分类抽样;2)联合划分主体难度与提示场景以实现细粒度能力评估;3)提出Subject Identity Consistency Score(SICS)作为高相关性新指标;4)从实证结果中提炼系统性诊断洞见。 Result: SICS指标与人工评估相关性比现有指标高9.4%;对19个主流模型的大规模评测揭示了此前被掩盖的模型局限性。 Conclusion: DSH-Bench为T2I模型提供了更全面、精细且具指导意义的评估框架,明确了未来模型训练范式与数据构建策略的优化方向。 Abstract: Significant progress has been achieved in subject-driven text-to-image (T2I) generation, which aims to synthesize new images depicting target subjects according to user instructions. However, evaluating these models remains a significant challenge. Existing benchmarks exhibit critical limitations: 1) insufficient diversity and comprehensiveness in subject images, 2) inadequate granularity in assessing model performance across different subject difficulty levels and prompt scenarios, and 3) a profound lack of actionable insights and diagnostic guidance for subsequent model refinement. To address these limitations, we propose DSH-Bench, a comprehensive benchmark that enables systematic multi-perspective analysis of subject-driven T2I models through four principal innovations: 1) a hierarchical taxonomy sampling mechanism ensuring comprehensive subject representation across 58 fine-grained categories, 2) an innovative classification scheme categorizing both subject difficulty level and prompt scenario for granular capability assessment, 3) a novel Subject Identity Consistency Score (SICS) metric demonstrating a 9.4\% higher correlation with human evaluation compared to existing measures in quantifying subject preservation, and 4) a comprehensive set of diagnostic insights derived from the benchmark, offering critical guidance for optimizing future model training paradigms and data construction strategies. Through an extensive empirical evaluation of 19 leading models, DSH-Bench uncovers previously obscured limitations in current approaches, establishing concrete directions for future research and development.[302] TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization
Bryce Grant,Aryeh Rothenberg,Atri Banerjee,Peng Wang
Main category: cs.CV
TL;DR: TrianguLang是一种无需相机标定、基于前馈推理的3D语言定位框架,通过几何感知语义注意力(GASA)提升跨视角匹配的几何一致性,在多个基准上达到SOTA性能,支持实时交互应用。
Details
Motivation: 现有3D语言定位方法在精度与几何一致性(依赖场景优化)和推理效率(前馈)之间存在权衡,且多需相机标定或真实位姿监督。 Method: 提出TrianguLang框架,核心是Geometry-Aware Semantic Attention (GASA),利用预测几何信息对跨视角特征匹配进行门控,抑制几何不一致但语义合理的错误匹配;全程无需相机参数或真实位姿监督,纯前馈推理。 Result: 在ScanNet++、uCO3D等5个基准上实现SOTA的前馈文本引导分割与定位效果;将用户交互从O(N)次点击降至单次文本查询;单帧1008x1008处理耗时约57ms(~18 FPS)。 Conclusion: TrianguLang证明了无需优化、无需相机标定的高效前馈3D语言定位可行,兼顾几何一致性与实时性,适用于机器人与AR等交互式应用场景。 Abstract: Localizing objects and parts from natural language in 3D space is essential for robotics, AR, and embodied AI, yet existing methods face a trade-off between the accuracy and geometric consistency of per-scene optimization and the efficiency of feed-forward inference. We present TrianguLang, a feed-forward framework for 3D localization that requires no camera calibration at inference. Unlike prior methods that treat views independently, we introduce Geometry-Aware Semantic Attention (GASA), which utilizes predicted geometry to gate cross-view feature correspondence, suppressing semantically plausible but geometrically inconsistent matches without requiring ground-truth poses. Validated on five benchmarks including ScanNet++ and uCO3D, TrianguLang achieves state-of-the-art feed-forward text-guided segmentation and localization, reducing user effort from $O(N)$ clicks to a single text query. The model processes each frame at 1008x1008 resolution in $\sim$57ms ($\sim$18 FPS) without optimization, enabling practical deployment for interactive robotics and AR applications. Code and checkpoints are available at https://cwru-aism.github.io/triangulang/.[303] Adaptive MLP Pruning for Large Vision Transformers
Chengchao Shen
Main category: cs.CV
TL;DR: 本文提出了一种自适应MLP剪枝方法(AMP),通过引入无标签信息熵准则改进神经元重要性评估,并采用二分搜索自适应剪枝,显著减少大视觉Transformer的参数量和计算量,几乎无性能损失。
Details
Motivation: 大型视觉Transformer参数量庞大,导致计算和内存开销过高;其中MLP模块占参数主体,亟需高效剪枝方法。 Method: 1)提出基于信息熵的无标签重要性评估替代传统one-hot交叉熵;2)按重要性排序MLP隐藏神经元,并用二分搜索自适应剪枝,避免预设压缩比。 Result: 在CLIP、DINOv2等SOTA模型上实现约40%参数量与FLOPs降低,近乎无损;未微调情况下显著优于其他剪枝方法。 Conclusion: AMP是一种高效、自适应、无需微调的大视觉Transformer剪枝方法,兼顾压缩率与精度,具备实用价值。 Abstract: Large vision transformers present impressive scalability, as their performance can be well improved with increased model capacity. Nevertheless, their cumbersome parameters results in exorbitant computational and memory demands. By analyzing prevalent transformer structures, we find that multilayer perceptron (MLP) modules constitute the largest share of the model's parameters. In this paper, we propose an Adaptive MLP Pruning (AMP) method to substantially reduce the parameters of large vision transformers without obvious performance degradation. First, we adopt Taylor based method to evaluate neuron importance of MLP. However, the importance computation using one-hot cross entropy loss ignores the potential predictions on other categories, thus degrading the quality of the evaluated importance scores. To address this issue, we introduce label-free information entropy criterion to fully model the predictions of the original model for more accurate importance evaluation. Second, we rank the hidden neurons of MLP by the above importance scores and apply binary search algorithm to adaptively prune the ranked neurons according to the redundancy of different MLP modules, thereby avoiding the predefined compression ratio. Experimental results on several state-of-the-art large vision transformers, including CLIP and DINOv2, demonstrate that our method achieves roughly 40\% parameter and FLOPs reduction in a near lossless manner. Moreover, when the models are not finetuned after pruning, our method outperforms other pruning methods by significantly large margin. The source code and trained weights are available at https://github.com/visresearch/AMP.[304] SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving
Zihan You,Hongwei Liu,Chenxu Dang,Zhe Wang,Sining Ang,Aoqi Wang,Yan Wang
Main category: cs.CV
TL;DR: 本文提出SAMoE-VLA框架,通过基于鸟瞰图(BEV)场景表征的专家选择机制和条件跨模态因果注意力机制,提升视觉-语言-动作(VLA)模型在自动驾驶中的稳定性、安全性与决策一致性。
Details
Motivation: 现有基于token级MoE机制的VLA模型在自动驾驶中表现不稳定且安全性下降,因其token级专家分工与场景级驾驶决策存在错配。 Method: 提出SAMoE-VLA:1)以BEV特征为路由信号实现场景自适应MoE;2)引入条件跨模态因果注意力,统一建模世界状态、语言意图与动作历史。 Result: 在nuScenes开环规划和LangAuto闭环基准上达到SOTA,性能优于现有VLA及世界模型方法,且参数更少。 Conclusion: 场景级路由与跨模态因果建模是提升VLA模型在自动驾驶中鲁棒性与安全性的关键路径。 Abstract: Recent advances in Vision-Language-Action (VLA) models have shown promising capabilities in autonomous driving by leveraging the understanding and reasoning strengths of Large Language Models(LLMs).However, our empirical analysis reveals that directly applying existing token-level MoE mechanisms--which are inherited from LLM architectures--to VLA models results in unstable performance and safety degradation in autonomous driving, highlighting a misalignment between token-based expert specialization and scene-level decision-making.To address this, we propose SAMoE-VLA, a scene-adaptive Vision-Language-Action framework that conditions expert selection on structured scene representations instead of token embeddings. Our key idea is to derive the MoE routing signal from bird's-eye-view (BEV) features that encapsulates traffic scene context, enabling scenario-dependent expert weighting and merging tailored to distinct driving conditions. Furthermore, to support temporally consistent reasoning across world-knowledge, perception, language, and action, we introduce a Conditional Cross-Modal Causal Attention mechanism that integrates world state, linguistic intent, and action history into a unified causal reasoning process. Extensive experiments on the nuScenes open loop planning dataset and LangAuto closed-loop benchmark demonstrate that SAMoE-VLA achieves state-of-the-art performance, outperforming prior VLA-based and world-model-based approaches with fewer parameters.Our code will be released soon.[305] Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows
Shentong Mo,Yibing Song
Main category: cs.CV
TL;DR: 本文提出FoleyFlow方法,通过掩码建模训练对齐音视频单模态编码器,并设计动态条件流模型实现语义与节奏双重一致的音频生成,显著提升协调音频生成效果。
Details
Motivation: 现有基于视频的协调音频生成方法依赖对比学习和全局视频引导,在时序节奏同步方面存在局限,难以兼顾语义与节奏一致性。 Method: 首先采用掩码建模训练对齐音视频单模态编码器(以视频指导恢复掩码音频),再构建基于速度流的动态条件流模型,利用时变视频特征逐段指导音频生成。 Result: 在标准基准上多项指标大幅超越现有方法,生成音频在语义和节奏上均与输入视频高度协调。 Conclusion: FoleyFlow通过掩码对齐与动态条件流,有效解决了音视频协同生成中语义与节奏同步难题,为协调音频生成提供了新范式。 Abstract: Coordinated audio generation based on video inputs typically requires a strict audio-visual (AV) alignment, where both semantics and rhythmics of the generated audio segments shall correspond to those in the video frames. Previous studies leverage a two-stage design where the AV encoders are firstly aligned via contrastive learning, then the encoded video representations guide the audio generation process. We observe that both contrastive learning and global video guidance are effective in aligning overall AV semantics while limiting temporally rhythmic synchronization. In this work, we propose FoleyFlow to first align unimodal AV encoders via masked modeling training, where the masked audio segments are recovered under the guidance of the corresponding video segments. After training, the AV encoders which are separately pretrained using only unimodal data are aligned with semantic and rhythmic consistency. Then, we develop a dynamic conditional flow for the final audio generation. Built upon the efficient velocity flow generation framework, our dynamic conditional flow utilizes temporally varying video features as the dynamic condition to guide corresponding audio segment generations. To this end, we extract coherent semantic and rhythmic representations during masked AV alignment, and use this representation of video segments to guide audio generation temporally. Our audio results are evaluated on the standard benchmarks and largely surpass existing results under several metrics. The superior performance indicates that FoleyFlow is effective in generating coordinated audios that are both semantically and rhythmically coherent to various video sequences.[306] Fast Low-light Enhancement and Deblurring for 3D Dark Scenes
Feng Zhang,Jinglong Wang,Ze Li,Yanghong Zhou,Yang Chen,Lei Chen,Xiatian Zhu
Main category: cs.CV
TL;DR: FLED-GS是一种针对低光照、噪声和运动模糊图像的新型视图合成方法,通过亮度锚点引导的增强-重建交替循环,在保持几何精度的同时实现高效去噪与去模糊。
Details
Motivation: 现有体渲染方法难以处理复合退化(低光+噪声+运动模糊),而串行2D预处理会因步骤间依赖引入伪影。 Method: 提出FLED-GS框架:在3D高斯泼溅(3DGS)重建中嵌入多级亮度锚点,构建增强与重建交替迭代循环;每轮先用现成2D去模糊器锐化输入,再进行噪声感知的3DGS重建,动态估计并抑制噪声,输出清洁先验供下一级使用。 Result: 在多项指标上超越LuSh-NeRF,训练速度提升21倍,渲染速度提升11倍。 Conclusion: 将3D场景恢复重构为增强-重建交替优化过程,可有效解耦复合退化问题,兼顾效率与质量。 Abstract: Novel view synthesis from low-light, noisy, and motion-blurred imagery remains a valuable and challenging task. Current volumetric rendering methods struggle with compound degradation, and sequential 2D preprocessing introduces artifacts due to interdependencies. In this work, we introduce FLED-GS, a fast low-light enhancement and deblurring framework that reformulates 3D scene restoration as an alternating cycle of enhancement and reconstruction. Specifically, FLED-GS inserts several intermediate brightness anchors to enable progressive recovery, preventing noise blow-up from harming deblurring or geometry. Each iteration sharpens inputs with an off-the-shelf 2D deblurrer and then performs noise-aware 3DGS reconstruction that estimates and suppresses noise while producing clean priors for the next level. Experiments show FLED-GS outperforms state-of-the-art LuSh-NeRF, achieving 21$\times$ faster training and 11$\times$ faster rendering.[307] VesselFusion: Diffusion Models for Vessel Centerline Extraction from 3D CT Images
Soichi Mita,Shumpei Takezaki,Ryoma Bise
Main category: cs.CV
TL;DR: 本文提出了一种名为VesselFusion的扩散模型,用于从3D CT图像中提取血管中心线,通过粗到细的中心线表示和基于投票的聚合策略,实现了更自然、稳定的提取效果,并在公开数据集上优于传统方法。
Details
Motivation: 血管中心线提取可减少标注工作量以构建血管结构估计模型,但传统确定性模型难以捕捉复杂的人体血管结构。 Method: 提出VesselFusion扩散模型,采用粗到细的中心线表征和基于投票的聚合策略。 Result: 在公开CT图像数据集上评估显示,VesselFusion比传统方法具有更高的提取精度和更自然的结果。 Conclusion: VesselFusion作为一种基于扩散模型的新方法,在血管中心线提取任务中展现出优越性能和稳定性。 Abstract: Vessel centerline extraction from 3D CT images is an important task because it reduces annotation effort to build a model that estimates a vessel structure. It is challenging to estimate natural vessel structures since conventional approaches are deterministic models, which cannot capture a complex human structure. In this study, we propose VesselFusion, which is a diffusion model to extract the vessel centerline from 3D CT image. The proposed method uses a coarse-to-fine representation of the centerline and a voting-based aggregation for a natural and stable extraction. VesselFusion was evaluated on a publicly available CT image dataset and achieved higher extraction accuracy and a more natural result than conventional approaches.[308] MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data
Hunor Laczkó,Libang Jia,Loc-Phat Truong,Diego Hernández,Sergio Escalera,Jordi Gonzalez,Meysam Madadi
Main category: cs.CV
TL;DR: 本文介绍了MV-Fashion,一个专为时尚分析设计的大规模多视角视频数据集,包含丰富的标注和配对数据,用于虚拟试穿、服装尺寸估计等任务。
Details
Motivation: 现有4D人体数据集在时尚特定研究中存在不足:合成数据缺乏真实性,真实数据缺乏细粒度标注和配对数据,难以支撑虚拟试穿(VTON)和尺寸估计等任务。 Method: 构建了MV-Fashion数据集,包含3273个多视角视频序列(7250万帧),涵盖80名受试者、每人3–10套服装;提供像素级语义标注、材料属性(如弹性)真值、3D点云,并首次引入穿戴状态与平铺目录图像的多视角同步配对数据;基于该数据集建立多个时尚任务的基线。 Result: 发布了大规模、高保真、强标注的时尚专用多视角视频数据集MV-Fashion,并在虚拟试穿、服装尺寸估计和新视角合成等任务上建立了初步基线性能。 Conclusion: MV-Fashion有效弥合了时尚AI研究中真实感与任务适配性之间的鸿沟,为后续虚拟试穿、物理驱动服装建模等方向提供了高质量基础资源。 Abstract: Existing 4D human datasets fall short for fashion-specific research, lacking either realistic garment dynamics or task-specific annotations. Synthetic datasets suffer from a realism gap, whereas real-world captures lack the detailed annotations and paired data required for virtual try-on (VTON) and size estimation tasks. To bridge this gap, we introduce MV-Fashion, a large-scale, multi-view video dataset engineered for domain-specific fashion analysis. MV-Fashion features 3,273 sequences (72.5 million frames) from 80 diverse subjects wearing 3-10 outfits each. It is designed to capture complex, real-world garment dynamics, including multiple layers and varied styling (e.g. rolled sleeves, tucked shirt). A core contribution is a rich data representation that includes pixel-level semantic annotations, ground-truth material properties like elasticity, and 3D point clouds. Crucially for VTON applications, MV-Fashion provides paired data: multi-view synchronized captures of worn garments alongside their corresponding flat, catalogue images. We leverage this dataset to establish baselines for fashion-centric tasks, including virtual try-on, clothing size estimation, and novel view synthesis. The dataset is available at https://hunorlaczko.github.io/MV-Fashion .[309] Edged USLAM: Edge-Aware Event-Based SLAM with Learning-Based Depth Priors
Şebnem Sarıözkan,Hürkan Şahin,Olaya Álvarez-Tuñón,Erdal Kayacan
Main category: cs.CV
TL;DR: 本文提出Edged USLAM,一种结合事件相机、IMU与标准相机的混合视觉惯性SLAM系统,通过边缘感知前端和轻量深度模块提升在快速运动、低照度及光照突变等挑战场景下的鲁棒性与定位精度。
Details
Motivation: 传统视觉SLAM在快速运动、低照度或光照突变下易失效;事件相机虽具高时序分辨率和高动态范围优势,但其稀疏异步输出导致特征提取与多传感器融合困难。 Method: 扩展Ultimate SLAM(USLAM),引入边缘感知前端(增强事件帧、支持鲁棒特征跟踪与非线性运动补偿)和轻量ROI深度模块(提供粗略场景深度以改善运动补偿与尺度一致性)。 Result: 在公开数据集和真实无人机飞行中验证:Edged USLAM在慢速/结构化轨迹下表现出更优稳定性与更低漂移;而事件专用方法(如PL-EVIO)或学习方法(如DEVO)更适用于剧烈运动或极端HDR场景。 Conclusion: 事件专用、学习驱动与混合方法各有互补优势;Edged USLAM是一种面向多样化空中导航任务的鲁棒解决方案。 Abstract: Conventional visual simultaneous localization and mapping (SLAM) algorithms often fail under rapid motion, low illumination, or abrupt lighting transitions due to motion blur and limited dynamic range. Event cameras mitigate these issues with high temporal resolution and high dynamic range (HDR), but their sparse, asynchronous outputs complicate feature extraction and integration with other sensors; e.g. inertial measurement units (IMUs) and standard cameras. We present Edged USLAM, a hybrid visual-inertial system that extends Ultimate SLAM (USLAM) with an edge-aware front-end and a lightweight depth module. The frontend enhances event frames for robust feature tracking and nonlinear motion compensation, while the depth module provides coarse, region-of-interest (ROI)-based scene depth to improve motion compensation and scale consistency. Evaluations across public benchmarks and real-world unmanned air vehicle (UAV) flights demonstrate that performance varies significantly by scenario. For instance, event-only methods like point-line event-based visual-inertial odometry (PL-EVIO) or learning-based pipelines such as deep event-based visual odometry (DEVO) excel in highly aggressive or extreme HDR conditions. In contrast, Edged USLAM provides superior stability and minimal drift in slow or structured trajectories, ensuring consistently accurate localization on real flights under challenging illumination. These findings highlight the complementary strengths of event-only, learning-based, and hybrid approaches, while positioning Edged USLAM as a robust solution for diverse aerial navigation tasks.[310] MERLIN: Building Low-SNR Robust Multimodal LLMs for Electromagnetic Signals
Junyu Shen,Zhendong She,Chenghanyu Zhang,Yuchuang Sun,Luqing Luo,Dingwei Tan,Zonghao Guo,Bo Guo,Zehua Han,Wupeng Xie,Yaxin Mu,Peng Zhang,Peipei Li,Fengxiang Wang,Yangang Sun,Maosong Sun
Main category: cs.CV
TL;DR: 本文提出了一种面向电磁(EM)领域的多模态大语言模型(MLLM)新范式,通过构建大规模EM信号-文本数据集EM-100k、建立综合基准EM-Bench、以及提出鲁棒训练框架MERLIN,系统性解决了数据稀缺、评估缺失和低信噪比下性能脆弱三大挑战。
Details
Motivation: 现有电磁领域方法偏离原生MLLM范式,采用任务专用或流水线架构,导致性能与泛化能力受限;同时面临高质量配对数据稀缺、缺乏统一评估基准、以及低信噪比下模型鲁棒性差三大瓶颈。 Method: 提出三方面方法:(1)构建EM-100k数据集(>10万EM信号-文本对);(2)设计涵盖感知到推理的多任务基准EM-Bench;(3)开发MERLIN训练框架,联合优化信号-文本对齐与低SNR鲁棒性。 Result: MERLIN在EM-Bench上达到SOTA性能,并在低SNR环境下展现出显著鲁棒性提升。 Conclusion: 本工作为电磁领域MLLM奠定了数据、评估与模型三方面基础,推动其向真正通用、鲁棒的多模态智能演进。 Abstract: The paradigm of Multimodal Large Language Models (MLLMs) offers a promising blueprint for advancing the electromagnetic (EM) domain. However, prevailing approaches often deviate from the native MLLM paradigm, instead using task-specific or pipelined architectures that lead to fundamental limitations in model performance and generalization. Fully realizing the MLLM potential in EM domain requires overcoming three main challenges: (1) Data. The scarcity of high-quality datasets with paired EM signals and descriptive text annotations used for MLLMs pre-training; (2) Benchmark. The absence of comprehensive benchmarks to systematically evaluate and compare the performance of models on EM signal-to-text tasks; (3) Model. A critical fragility in low Signal-to-Noise Ratio (SNR) environments, where critical signal features can be obscured, leading to significant performance degradation. To address these challenges, we introduce a tripartite contribution to establish a foundation for MLLMs in the EM domain. First, to overcome data scarcity, we construct and release EM-100k, a large-scale dataset comprising over 100,000 EM signal-text pairs. Second, to enable rigorous and standardized evaluation, we propose EM-Bench, the most comprehensive benchmark featuring diverse downstream tasks spanning from perception to reasoning. Finally, to tackle the core modeling challenge, we present MERLIN, a novel training framework designed not only to align low-level signal representations with high-level semantic text, but also to explicitly enhance model robustness and performance in challenging low-SNR environments. Comprehensive experiments validate our method, showing that MERLIN is state-of-the-art in the EM-Bench and exhibits remarkable robustness in low-SNR settings.[311] ALOOD: Exploiting Language Representations for LiDAR-based Out-of-Distribution Object Detection
Michael Kösel,Marcel Schreiber,Michael Ulrich,Claudius Gläser,Klaus Dietmayer
Main category: cs.CV
TL;DR: 本文提出ALOOD方法,利用视觉-语言模型(VLM)的语言表征对齐LiDAR检测器的物体特征,将OOD检测建模为零样本分类任务,提升自动驾驶中3D目标检测对未知类别的识别能力。
Details
Motivation: 现有LiDAR 3D检测器对分布外(OOD)物体(即训练中未见类别)产生过度自信的错误预测,带来严重安全隐患。 Method: 提出ALOOD框架,将LiDAR检测器提取的物体特征与视觉-语言模型(VLM)的语言表征空间对齐,从而将OOD检测转化为零样本分类问题。 Result: 在nuScenes OOD基准上取得具有竞争力的性能,验证了基于语言表征进行LiDAR OOD检测的有效性。 Conclusion: ALOOD为LiDAR 3D检测中的OOD识别提供了新范式,通过引入语言先验提升了模型对未知类别的鲁棒性和安全性。 Abstract: LiDAR-based 3D object detection plays a critical role for reliable and safe autonomous driving systems. However, existing detectors often produce overly confident predictions for objects not belonging to known categories, posing significant safety risks. This is caused by so-called out-of-distribution (OOD) objects, which were not part of the training data, resulting in incorrect predictions. To address this challenge, we propose ALOOD (Aligned LiDAR representations for Out-Of-Distribution Detection), a novel approach that incorporates language representations from a vision-language model (VLM). By aligning the object features from the object detector to the feature space of the VLM, we can treat the detection of OOD objects as a zero-shot classification task. We demonstrate competitive performance on the nuScenes OOD benchmark, establishing a novel approach to OOD object detection in LiDAR using language representations. The source code is available at https://github.com/uulm-mrm/mmood3d.[312] Fusion-Poly: A Polyhedral Framework Based on Spatial-Temporal Fusion for 3D Multi-Object Tracking
Xian Wu,Yitao Wu,Xiaoyu Li,Zijia Li,Lijun Zhao,Lining Sun
Main category: cs.CV
TL;DR: 本文提出Fusion-Poly框架,通过时空融合异步LiDAR与相机数据,提升3D多目标跟踪性能,在nuScenes上达到76.5% AMOTA,刷新SOTA。
Details
Motivation: 现有方法受限于传感器不同采样率,仅在同步时间戳进行空间融合,导致大量异步观测数据未被充分利用,影响轨迹估计的频率与鲁棒性。 Method: 提出Fusion-Poly:包含频率感知级联匹配模块(适配同步/异步帧)、频率感知轨迹估计模块(高频运动预测、微分更新与置信度驱动生命周期管理)和全状态观测对齐模块(优化图像投影误差以提升跨模态一致性)。 Result: 在nuScenes测试集上AMOTA达76.5%,为tracking-by-detection类3D MOT方法新SOTA;消融实验验证各模块有效性。 Conclusion: 异步多模态数据可显著增强3D MOT的时序建模能力;Fusion-Poly通过统一建模同步与异步观测,实现了更高频、更鲁棒的轨迹更新与状态估计。 Abstract: LiDAR-camera 3D multi-object tracking (MOT) combines rich visual semantics with accurate depth cues to improve trajectory consistency and tracking reliability. In practice, however, LiDAR and cameras operate at different sampling rates. To maintain temporal alignment, existing data pipelines usually synchronize heterogeneous sensor streams and annotate them at a reduced shared frequency, forcing most prior methods to perform spatial fusion only at synchronized timestamps through projection-based or learnable cross-sensor association. As a result, abundant asynchronous observations remain underexploited, despite their potential to support more frequent association and more robust trajectory estimation over short temporal intervals. To address this limitation, we propose Fusion-Poly, a spatial-temporal fusion framework for 3D MOT that integrates asynchronous LiDAR and camera data. Fusion-Poly associates trajectories with multi-modal observations at synchronized timestamps and with single-modal observations at asynchronous timestamps, enabling higher-frequency updates of motion and existence states. The framework contains three key components: a frequency-aware cascade matching module that adapts to synchronized and asynchronous frames according to available detection modalities; a frequency-aware trajectory estimation module that maintains trajectories through high-frequency motion prediction, differential updates, and confidence-calibrated lifecycle management; and a full-state observation alignment module that improves cross-modal consistency at synchronized timestamps by optimizing image-projection errors. On the nuScenes test set, Fusion-Poly achieves 76.5% AMOTA, establishing a new state of the art among tracking-by-detection 3D MOT methods. Extensive ablation studies further validate the effectiveness of each component. Code will be released.[313] MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data
Siarhei Sheludzko,Dhimitrios Duka,Bernt Schiele,Hilde Kuehne,Anna Kukleva
Main category: cs.CV
TL;DR: 本文提出多模态温度与边界调度(MM-TS)方法,通过动态调整对比损失中的温度参数,并结合样本局部分布自适应设定温度,同时融合最大边界框架,提升多模态对比学习性能,在多个图像-语言和视频-语言数据集上达到新SOTA。
Details
Motivation: 现有单模态对比学习中温度参数可调控正负样本拉近与推开的强度,但多模态场景缺乏类似机制;且多模态数据常呈长尾分布,需对不同密度区域的样本差异化建模。 Method: 提出多模态温度与边界调度(MM-TS):1)训练中动态调度温度以调节多模态对比损失的吸引力与排斥力;2)根据每个样本所在局部簇密度自适应分配温度(密簇用更高温度以保留语义结构);3)将温度调度嵌入最大边界框架,统一InfoNCE与max-margin两类主流目标。 Result: 在Flickr30K、MSCOCO、EPIC-KITCHENS-100和YouCook2四个图像/视频-语言数据集上验证有效,性能超越现有方法,取得新SOTA结果。 Conclusion: 动态温度与边界调度能更精细地控制多模态对比学习中的语义对齐过程,尤其适配长尾分布数据,为统一InfoNCE与max-margin范式提供了可行路径。 Abstract: Contrastive learning has become a fundamental approach in both uni-modal and multi-modal frameworks. This learning paradigm pulls positive pairs of samples closer while pushing negatives apart. In the uni-modal setting (e.g., image-based learning), previous research has shown that the strength of these forces can be controlled through the temperature parameter. In this work, we propose Multi-Modal Temperature and Margin Schedules (MM-TS), extending the concept of uni-modal temperature scheduling to multi-modal contrastive learning. Our method dynamically adjusts the temperature in the contrastive loss during training, modulating the attraction and repulsion forces in the multi-modal setting. Additionally, recognizing that standard multi-modal datasets often follow imbalanced, long-tail distributions, we adapt the temperature based on the local distribution of each training sample. Specifically, samples from dense clusters are assigned a higher temperature to better preserve their semantic structure. Furthermore, we demonstrate that temperature scheduling can be effectively integrated within a max-margin framework, thereby unifying the two predominant approaches in multi-modal contrastive learning: InfoNCE loss and max-margin objective. We evaluate our approach on four widely used image- and video-language datasets, Flickr30K, MSCOCO, EPIC-KITCHENS-100, and YouCook2, and show that our dynamic temperature and margin schedules improve performance and lead to new state-of-the-art results in the field.[314] Alignment-Aware and Reliability-Gated Multimodal Fusion for Unmanned Aerial Vehicle Detection Across Heterogeneous Thermal-Visual Sensors
Ishrat Jahan,Molla E Majid,M Murugappan,Muhammad E. H. Chowdhury,N. B. Prakash,Saad Bin Abul Kashem,Balamurugan Balusamy,Amith Khandakar
Main category: cs.CV
TL;DR: 本文提出两种新型多模态融合策略RGIF和RGMAF,用于提升无人机检测性能,尤其在红外与可见光等异构传感器融合中显著提升mAP和召回率。
Details
Motivation: 传统多传感器融合方法难以保持跨模态空间一致性,且受标注不一致影响,鲁棒性差,难以满足真实空域监控需求。 Method: 提出Registration-aware Guided Image Fusion (RGIF) 和 Reliability-Gated Modality-Attention Fusion (RGMAF):RGIF采用ECC仿射配准+导向滤波;RGMAF结合仿射与光流配准,并引入可靠性加权注意力机制。基于MMFW-UAV数据集(147,417帧)和YOLOv10x检测器进行验证。 Result: RGIF使视觉基线mAP@50提升2.13%至97.65%;RGMAF达到最高召回率98.64%。 Conclusion: 注册感知与可靠性自适应融合为异构多模态无人机检测提供了鲁棒、高效的框架。 Abstract: Reliable unmanned aerial vehicle (UAV) detection is critical for autonomous airspace monitoring but remains challenging when integrating sensor streams that differ substantially in resolution, perspective, and field of view. Conventional fusion methods-such as wavelet-, Laplacian-, and decision-level approaches-often fail to preserve spatial correspondence across modalities and suffer from annotation of inconsistencies, limiting their robustness in real-world settings. This study introduces two fusion strategies, Registration-aware Guided Image Fusion (RGIF) and Reliability-Gated Modality-Attention Fusion (RGMAF), designed to overcome these limitations. RGIF employs Enhanced Correlation Coefficient (ECC)-based affine registration combined with guided filtering to maintain thermal saliency while enhancing structural detail. RGMAF integrates affine and optical-flow registration with a reliability-weighted attention mechanism that adaptively balances thermal contrast and visual sharpness. Experiments were conducted on the Multi-Sensor and Multi-View Fixed-Wing (MMFW)-UAV dataset comprising 147,417 annotated air-to-air frames collected from infrared, wide-angle, and zoom sensors. Among single-modality detectors, YOLOv10x demonstrated the most stable cross-domain performance and was selected as the detection backbone for evaluating fused imagery. RGIF improved the visual baseline by 2.13% mAP@50 (achieving 97.65%), while RGMAF attained the highest recall of 98.64%. These findings show that registration-aware and reliability-adaptive fusion provides a robust framework for integrating heterogeneous modalities, substantially enhancing UAV detection performance in multimodal environments.[315] Video2LoRA: Unified Semantic-Controlled Video Generation via Per-Reference-Video LoRA
Zexi Wu,Qinghe Wang,Jing Dai,Baolu Li,Yiming Zhang,Yue Ma,Xu Jia,Hongming Xu
Main category: cs.CV
TL;DR: Video2LoRA是一种轻量、可扩展的视频生成框架,通过超网络动态生成LoRA权重,实现基于参考视频的语义可控生成,无需每条件训练,模型小于150MB,支持零样本泛化。
Details
Motivation: 现有方法在语义对齐上存在瓶颈:显式结构引导限制语义灵活性,而专用控制模型缺乏互操作性和适应性。 Method: 提出Video2LoRA框架,利用轻量超网络为不同语义输入预测个性化LoRA权重,并结合辅助矩阵构建自适应LoRA模块,嵌入冻结的扩散主干网络中。 Result: 在多种控制条件下实现一致且语义对齐的视频生成,具备强零样本泛化能力,模型体积小于150MB,无需每条件微调。 Conclusion: Video2LoRA提供了一种高效、通用、无需条件训练的语义可控视频生成新范式,兼顾灵活性、可扩展性与部署友好性。 Abstract: Achieving semantic alignment across diverse video generation conditions remains a significant challenge. Methods that rely on explicit structural guidance often enforce rigid spatial constraints that limit semantic flexibility, whereas models tailored for individual control types lack interoperability and adaptability. These design bottlenecks hinder progress toward flexible and efficient semantic video generation. To address this, we propose Video2LoRA, a scalable and generalizable framework for semantic-controlled video generation that conditions on a reference video. Video2LoRA employs a lightweight hypernetwork to predict personalized LoRA weights for each semantic input, which are combined with auxiliary matrices to form adaptive LoRA modules integrated into a frozen diffusion backbone. This design enables the model to generate videos consistent with the reference semantics while preserving key style and content variations, eliminating the need for any per-condition training. Notably, the final model weights less than 150MB, making it highly efficient for storage and deployment. Video2LoRA achieves coherent, semantically aligned generation across diverse conditions and exhibits strong zero-shot generalization to unseen semantics.[316] SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval
Ruixiang Zhao,Zhihao Xu,Bangxiang Lan,Zijie Xin,Jingyu Liu,Xirong Li
Main category: cs.CV
TL;DR: 本文提出SAVE方法,通过引入专门的语音分支和软ALBEF早期视听对齐机制,改进视频-文本检索中的音频利用,显著提升多个基准测试的表现。
Details
Motivation: 现有基于CLIP的视频-文本检索方法忽略音频信息,而现有引入音频的方法在语音内容表征和视听融合方面效果不佳。 Method: 提出SAVE方法,包含专用语音分支以更好嵌入语音内容,并采用软ALBEF实现早期视听对齐以促进融合。 Result: 在MSRVTT-9k、MSRVTT-7k、VATEX、Charades和LSMDC五个基准上,SAVE在SumR指标上分别超越SOTA方法AVIGATE达+4.1%、+1.9%、+2.5%、+9.8%和+2.1%。 Conclusion: SAVE有效解决了语音表征与视听融合问题,显著提升了视频-文本检索性能,验证了充分利用音频(尤其是语音)信息的重要性。 Abstract: For video-text retrieval, the use of CLIP has been a de facto choice. Since CLIP provides only image and text encoders, this consensus has led to a biased paradigm that entirely ignores the sound track of videos. While several attempts have been made to reintroduce audio -- typically by incorporating an audio encoder and fusing its output with visual features -- these methods face two challenges: ineffective representation of speech content and suboptimal vision-audio fusion. To address these issues jointly, we propose SAVE, a Speech Aware Video rEpresentation learning method. SAVE improves upon AVIGATE, a SOTA audiovisual method, with a dedicated speech branch for more effective speech embedding. Furthermore, we introduce soft-ALBEF for early vision-audio alignment that facilitates fusion. Extensive experiments on five benchmarks show that SAVE compares favorably against the SOTA, outperforming AVIGATE by +4.1% on MSRVTT-9k, +1.9% on MSRVTT-7k, +2.5% on VATEX, +9.8% on Charades, and +2.1% on LSMDC, in light of the SumR metric.[317] SRNeRV: A Scale-wise Recursive Framework for Neural Video Representation
Jia Wang,Jun Zhu,Xinfeng Zhang
Main category: cs.CV
TL;DR: 本文提出SRNeRV,一种基于尺度递归和混合参数共享的隐式神经表示(INR)视频压缩框架,显著降低参数量并提升率失真性能。
Details
Motivation: 现有多尺度INR生成器因堆叠独立模块导致严重参数冗余;受生成过程中尺度自相似性的启发,寻求更高效的参数共享机制。 Method: 提出尺度递归框架SRNeRV,将处理模块解耦为尺度相关空间混合模块和尺度无关通道混合模块,并递归复用后者以实现高效参数共享。 Result: 在INR友好场景下显著提升率失真性能,模型大小大幅减小,同时保持对尺度特异性空间模式的学习能力。 Conclusion: 尺度递归与混合共享机制能有效放大INR范式的核心优势,为高效视频INR建模提供了新思路。 Abstract: Implicit Neural Representations (INRs) have emerged as a promising paradigm for video representation and compression. However, existing multi-scale INR generators often suffer from significant parameter redundancy by stacking independent processing blocks for each scale. Inspired by the principle of scale self-similarity in the generation process, we propose SRNeRV, a novel scale-wise recursive framework that replaces this stacked design with a parameter-efficient shared architecture. The core of our approach is a hybrid sharing scheme derived from decoupling the processing block into a scale-specific spatial mixing module and a scale-invariant channel mixing module. We recursively apply the same shared channel mixing module, which contains the majority of the parameters, across all scales, significantly reducing the model size while preserving the crucial capacity to learn scale-specific spatial patterns. Extensive experiments demonstrate that SRNeRV achieves a significant rate-distortion performance boost, especially in INR-friendly scenarios, validating that our sharing scheme successfully amplifies the core strengths of the INR paradigm.[318] GarmentPainter: Efficient 3D Garment Texture Synthesis with Character-Guided Diffusion Model
Jinbo Wu,Xiaobo Gao,Xing Liu,Chen Zhao,Jialun Liu
Main category: cs.CV
TL;DR: 本文提出GarmentPainter,一种在UV空间中生成高质量、3D感知服装纹理的简单高效框架,利用UV位置图提供3D结构引导,并通过类型选择模块实现无需对齐的细粒度纹理生成,显著提升视觉保真度、3D一致性和计算效率。
Details
Motivation: 现有方法在生成高保真、3D一致的服装纹理时面临挑战:2D扩散模型缺乏3D一致性、多步优化开销大、或依赖严格的2D图像与3D网格对齐,限制了灵活性和可扩展性。 Method: 提出GarmentPainter框架,在UV空间进行纹理合成;使用UV位置图作为3D结构引导以保证表面一致性;引入类型选择模块,基于角色参考图像实现无需空间对齐的部件级纹理控制;将所有引导信号以空间对齐方式注入标准扩散模型输入,不修改UNet结构。 Result: 在视觉质量、3D一致性与计算效率方面达到SOTA,定性与定量实验均优于现有方法。 Conclusion: GarmentPainter是一种轻量、通用且高效的3D服装纹理生成方法,解决了对齐依赖与3D不一致问题,为虚拟试衣与数字人应用提供了实用新方案。 Abstract: Generating high-fidelity, 3D-consistent garment textures remains a challenging problem due to the inherent complexities of garment structures and the stringent requirement for detailed, globally consistent texture synthesis. Existing approaches either rely on 2D-based diffusion models, which inherently struggle with 3D consistency, require expensive multi-step optimization or depend on strict spatial alignment between 2D reference images and 3D meshes, which limits their flexibility and scalability. In this work, we introduce GarmentPainter, a simple yet efficient framework for synthesizing high-quality, 3D-aware garment textures in UV space. Our method leverages a UV position map as the 3D structural guidance, ensuring texture consistency across the garment surface during texture generation. To enhance control and adaptability, we introduce a type selection module, enabling fine-grained texture generation for specific garment components based on a character reference image, without requiring alignment between the reference image and the 3D mesh. GarmentPainter efficiently integrates all guidance signals into the input of a diffusion model in a spatially aligned manner, without modifying the underlying UNet architecture. Extensive experiments demonstrate that GarmentPainter achieves state-of-the-art performance in terms of visual fidelity, 3D consistency, and computational efficiency, outperforming existing methods in both qualitative and quantitative evaluations.[319] Exploring Deep Learning and Ultra-Widefield Imaging for Diabetic Retinopathy and Macular Edema
Pablo Jimenez-Lizcano,Sergio Romero-Tapiador,Ruben Tolosana,Aythami Morales,Guillermo González de Rivera,Ruben Vera-Rodriguez,Julian Fierrez
Main category: cs.CV
TL;DR: 本研究利用超广角眼底成像(UWF)和前沿深度学习方法,在图像质量评估、可转诊糖尿病视网膜病变(RDR)识别及糖尿病黄斑水肿(DME)识别三个临床任务上进行系统性基准测试,并融合空间域与频域特征以提升鲁棒性与可解释性。
Details
Motivation: 超广角眼底成像(UWF)相比传统彩色眼底照相(CFP)具有更广视野,有望提升糖尿病视网膜病变(DR)和糖尿病黄斑水肿(DME)的检出能力,但其在深度学习分析中的潜力尚未被充分挖掘。 Method: 基于MICCAI 2024 UWF4DR挑战赛公开数据集,对CNN、Vision Transformer(ViT)及基础模型在RGB空间域和频域进行基准测试;引入特征级融合策略,并采用Grad-CAM进行模型决策可视化分析。 Result: 所提方法在全部三项任务上均取得稳定优异性能;ViT与基础模型表现具竞争力;特征级融合与频域表征显著提升UWF分析效果。 Conclusion: UWF结合先进深度学习架构(尤其是ViT与基础模型)、频域建模及特征融合,为DR/DME自动化筛查提供了高鲁棒性、高可解释性的新范式。 Abstract: Diabetic retinopathy (DR) and diabetic macular edema (DME) are leading causes of preventable blindness among working-age adults. Traditional approaches in the literature focus on standard color fundus photography (CFP) for the detection of these conditions. Nevertheless, recent ultra-widefield imaging (UWF) offers a significantly wider field of view in comparison to CFP. Motivated by this, the present study explores state-of-the-art deep learning (DL) methods and UWF imaging on three clinically relevant tasks: i) image quality assessment for UWF, ii) identification of referable diabetic retinopathy (RDR), and iii) identification of DME. Using the publicly available UWF4DR Challenge dataset, released as part of the MICCAI 2024 conference, we benchmark DL models in the spatial (RGB) and frequency domains, including popular convolutional neural networks (CNNs) as well as recent vision transformers (ViTs) and foundation models. In addition, we explore a final feature-level fusion to increase robustness. Finally, we also analyze the decisions of the DL models using Grad-CAM, increasing the explainability. Our proposal achieves consistently strong performance across all architectures, underscoring the competitiveness of emerging ViTs and foundation models and the promise of feature-level fusion and frequency-domain representations for UWF analysis.[320] SiMO: Single-Modality-Operable Multimodal Collaborative Perception
Jiageng Wen,Shengjie Zhao,Bing Li,Jiafeng Huang,Kenan Ye,Hao Deng
Main category: cs.CV
TL;DR: 本文提出SiMO框架,通过LAMMA融合机制和Pretrain-Align-Fuse-RD训练策略,解决多模态协同感知中单模态失效导致语义失配与模态竞争问题,实现各模态独立且鲁棒的协同感知。
Details
Motivation: 现有协同感知方法依赖多模态(如LiDAR+相机)融合,但关键传感器(如LiDAR)失效时易崩溃,主因是特征融合引发单模态特征与下游模块间的语义不匹配。 Method: 提出Single-Modality-Operable Multimodal Collaborative Perception(SiMO)框架,含Length-Adaptive Multi-Modal Fusion(LAMMA)机制以自适应处理模态缺失,并采用Pretrain-Align-Fuse-RD训练策略缓解模态间竞争,保障各模态分支独立性。 Result: 实验表明SiMO能有效对齐多模态特征并保留模态特异性特征,在各单模态输入下均保持最优性能。 Conclusion: SiMO首次在协同感知领域实现了单模态可运行的多模态系统,显著提升了系统鲁棒性与泛化能力,为实际部署中的传感器故障场景提供了可靠解决方案。 Abstract: Collaborative perception integrates multi-agent perspectives to enhance the sensing range and overcome occlusion issues. While existing multimodal approaches leverage complementary sensors to improve performance, they are highly prone to failure--especially when a key sensor like LiDAR is unavailable. The root cause is that feature fusion leads to semantic mismatches between single-modality features and the downstream modules. This paper addresses this challenge for the first time in the field of collaborative perception, introducing Single-Modality-Operable Multimodal Collaborative Perception (SiMO). By adopting the proposed Length-Adaptive Multi-Modal Fusion (LAMMA), SiMO can adaptively handle remaining modal features during modal failures while maintaining consistency of the semantic space. Additionally, leveraging the innovative "Pretrain-Align-Fuse-RD" training strategy, SiMO addresses the issue of modality competition--generally overlooked by existing methods--ensuring the independence of each individual modality branch. Experiments demonstrate that SiMO effectively aligns multimodal features while simultaneously preserving modality-specific features, enabling it to maintain optimal performance across all individual modalities. The implementation details can be found in https://github.com/dempsey-wen/SiMO.[321] DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving
Zhuolin He,Jing Li,Guanghao Li,Xiaolei Chen,Jiacheng Tang,Siyang Zhang,Zhounan Jin,Feipeng Cai,Bin Li,Jian Pu,Jia Cai,Xiangyang Xue
Main category: cs.CV
TL;DR: 本文提出DynamicVGGT,一种统一的前馈式4D动态场景重建框架,通过联合预测当前与未来点云、引入运动感知时序注意力模块(MTA)及动态3D高斯溅射头,在自动驾驶数据集上显著提升动态重建精度。
Details
Motivation: 现有前馈式3D模型在静态重建中表现良好,但难以建模动态运动,而自动驾驶中动态场景重建面临时间变化大、物体运动频繁和场景动态复杂等挑战。 Method: 提出DynamicVGGT框架:1)在共享参考坐标系下联合预测当前与未来点云以隐式学习动态点表征;2)设计Motion-aware Temporal Attention(MTA)模块建模运动连续性;3)构建Dynamic 3D Gaussian Splatting Head,利用可学习运动token和场景流监督预测高斯速度,并通过连续3D高斯优化精化动态几何。 Result: 在自动驾驶数据集上实验表明,DynamicVGGT在重建精度上显著优于现有方法,实现了鲁棒的前馈式4D动态场景重建。 Conclusion: DynamicVGGT成功将静态3D感知模型VGGT扩展至动态4D重建,通过显式与隐式结合的方式建模点运动,在保持前馈效率的同时提升了动态一致性与几何精度。 Abstract: Dynamic scene reconstruction in autonomous driving remains a fundamental challenge due to significant temporal variations, moving objects, and complex scene dynamics. Existing feed-forward 3D models have demonstrated strong performance in static reconstruction but still struggle to capture dynamic motion. To address these limitations, we propose DynamicVGGT, a unified feed-forward framework that extends VGGT from static 3D perception to dynamic 4D reconstruction. Our goal is to model point motion within feed-forward 3D models in a dynamic and temporally coherent manner. To this end, we jointly predict the current and future point maps within a shared reference coordinate system, allowing the model to implicitly learn dynamic point representations through temporal correspondence. To efficiently capture temporal dependencies, we introduce a Motion-aware Temporal Attention (MTA) module that learns motion continuity. Furthermore, we design a Dynamic 3D Gaussian Splatting Head that explicitly models point motion by predicting Gaussian velocities using learnable motion tokens under scene flow supervision. It refines dynamic geometry through continuous 3D Gaussian optimization. Extensive experiments on autonomous driving datasets demonstrate that DynamicVGGT significantly outperforms existing methods in reconstruction accuracy, achieving robust feed-forward 4D dynamic scene reconstruction under complex driving scenarios.[322] WaDi: Weight Direction-aware Distillation for One-step Image Synthesis
Lei Wang,Yang Cheng,Senmao Li,Ge Wu,Yaxing Wang,Jian Yang
Main category: cs.CV
TL;DR: 本文提出LoRaD方法,通过低秩旋转矩阵建模蒸馏过程中权重方向的变化,显著提升单步扩散模型的性能与效率,并在多个基准和下游任务中取得SOTA结果。
Details
Motivation: 扩散模型推理速度慢,现有蒸馏方法缺乏对权重变化机制(尤其是方向 vs. 模长)的深入理解。 Method: 分析教师-学生U-Net/DiT权重差异,发现方向变化远大于模长变化;据此提出参数高效适配器LoRaD,用可学习低秩旋转矩阵建模方向变化;将其集成至VSD框架,形成WaDi蒸馏框架。 Result: WaDi在COCO 2014/2017上达到SOTA FID,仅需约10%可训练参数;蒸馏后的单步模型在可控生成、关系反转、高分辨率合成等下游任务中表现优异。 Conclusion: 权重方向是扩散模型知识蒸馏的关键,LoRaD/WaDi为高效、通用的一步式扩散建模提供了新范式。 Abstract: Despite the impressive performance of diffusion models such as Stable Diffusion (SD) in image generation, their slow inference limits practical deployment. Recent works accelerate inference by distilling multi-step diffusion into one-step generators. To better understand the distillation mechanism, we analyze U-Net/DiT weight changes between one-step students and their multi-step teacher counterparts. Our analysis reveals that changes in weight direction significantly exceed those in weight norm, highlighting it as the key factor during distillation. Motivated by this insight, we propose the Low-rank Rotation of weight Direction (LoRaD), a parameter-efficient adapter tailored to one-step diffusion distillation. LoRaD is designed to model these structured directional changes using learnable low-rank rotation matrices. We further integrate LoRaD into Variational Score Distillation (VSD), resulting in Weight Direction-aware Distillation (WaDi)-a novel one-step distillation framework. WaDi achieves state-of-the-art FID scores on COCO 2014 and COCO 2017 while using only approximately 10% of the trainable parameters of the U-Net/DiT. Furthermore, the distilled one-step model demonstrates strong versatility and scalability, generalizing well to various downstream tasks such as controllable generation, relation inversion, and high-resolution synthesis.[323] Event-based Motion & Appearance Fusion for 6D Object Pose Tracking
Zhichao Li,Chiara Bartolozzi,Lorenzo Natale,Arren Glover
Main category: cs.CV
TL;DR: 本文提出了一种基于事件相机的无学习6D物体姿态跟踪方法,利用事件流光流估计6D速度进行姿态传播,并结合模板匹配进行局部姿态校正,在高速运动场景下性能媲美甚至超越现有SOTA方法。
Details
Motivation: RGB-D相机在高动态环境下受限于运动模糊和帧率;事件相机具有高时间分辨率和低延迟优势,但目前基于事件相机的6D姿态跟踪研究仍较少。 Method: 提出一种无需学习的方法:首先利用事件相机的光流估计6D物体速度进行姿态传播,再通过模板匹配的局部姿态校正模块进行修正。 Result: 该方法在快速运动物体上性能与当前最优算法相当甚至更优,验证了事件相机在高动态场景中替代传统深度网络方案的潜力。 Conclusion: 基于事件相机的无学习6D姿态跟踪方法可行且高效,尤其适用于更新率受限的高速动态任务场景。 Abstract: Object pose tracking is a fundamental and essential task for robotics to perform tasks in the home and industrial settings. The most commonly used sensors to do so are RGB-D cameras, which can hit limitations in highly dynamic environments due to motion blur and frame-rate constraints. Event cameras have remarkable features such as high temporal resolution and low latency, which make them a potentially ideal vision sensors for object pose tracking at high speed. Even so, there are still only few works on 6D pose tracking with event cameras. In this work, we take advantage of the high temporal resolution and propose a method that uses both a propagation step fused with a pose correction strategy. Specifically, we use 6D object velocity obtained from event-based optical flow for pose propagation, after which, a template-based local pose correction module is utilized for pose correction. Our learning-free method has comparable performance to the state-of-the-art algorithms, and in some cases out performs them for fast-moving objects. The results indicate the potential for using event cameras in highly-dynamic scenarios where the use of deep network approaches are limited by low update rates.[324] Prototype-Guided Concept Erasure in Diffusion Models
Yuze Cai,Jiahao Lu,Hongxiang Shi,Yichao Zhou,Hong Lu
Main category: cs.CV
TL;DR: 本文提出了一种基于模型嵌入几何结构的概念擦除方法,通过聚类概念相关嵌入生成原型,并作为负向条件信号,显著提升了对宽泛概念(如‘性’、‘暴力’)的可靠擦除效果,同时保持图像质量。
Details
Motivation: 现有概念擦除方法在窄义具体概念(如皮卡丘、马斯克)上有效,但在宽泛、多义、边界模糊的概念(如‘性’、‘暴力’)上性能下降,亟需更鲁棒的擦除机制。 Method: 利用文本到图像模型的内在嵌入几何结构识别概念相关潜在嵌入;通过聚类得到能表征该概念的多个原型;将这些原型作为负向条件信号用于生成过程,实现精准擦除。 Result: 在多个基准测试中,该方法显著提升了对宽泛概念的擦除可靠性,且不损害图像整体质量。 Conclusion: 该方法为实现更安全、可控的图像生成提供了新思路,推动了概念擦除技术从窄义走向宽义语义层面。 Abstract: Concept erasure is extensively utilized in image generation to prevent text-to-image models from generating undesired content. Existing methods can effectively erase narrow concepts that are specific and concrete, such as distinct intellectual properties (e.g. Pikachu) or recognizable characters (e.g. Elon Musk). However, their performance degrades on broad concepts such as ``sexual'' or ``violent'', whose wide scope and multi-faceted nature make them difficult to erase reliably. To overcome this limitation, we exploit the model's intrinsic embedding geometry to identify latent embeddings that encode a given concept. By clustering these embeddings, we derive a set of concept prototypes that summarize the model's internal representations of the concept, and employ them as negative conditioning signals during inference to achieve precise and reliable erasure. Extensive experiments across multiple benchmarks show that our approach achieves substantially more reliable removal of broad concepts while preserving overall image quality, marking a step towards safer and more controllable image generation.[325] OSCAR: Occupancy-based Shape Completion via Acoustic Neural Implicit Representations
Magdalena Wysocki,Kadir Burak Buldu,Miruna-Alexandra Gafencu,Mohammad Farid Azampour,Nassir Navab
Main category: cs.CV
TL;DR: 本文提出了一种基于占用率的无标签超声图像椎体三维形状补全方法,利用神经隐式表示联合建模空间占据与声学交互,无需标注即可直接从B超图像中提取解剖表面,在HD95指标上较现有方法提升80%。
Details
Motivation: 超声引导脊柱微创干预需准确重建椎体三维解剖结构,但受声影和视角依赖性信号变化影响,现有方法面临挑战。 Method: 提出一种基于占用率的形状补全方法,采用神经隐式表示(NIR)联合建模空间占用与声学相互作用,并构建图像外观与解剖形状耦合的潜在空间,实现无需解剖标注的端到端表面提取。 Result: 在B超形状补全任务中HD95得分较SOTA提升80%;在仿真和仿体超声图像上均验证了对遮挡解剖结构的准确重建及跨成像条件的鲁棒泛化能力。 Conclusion: 该方法通过隐式建模声学传播特性,实现了真正适用于术中场景的无标注、高精度椎体三维重建,为超声引导脊柱干预提供了新范式。 Abstract: Accurate 3D reconstruction of vertebral anatomy from ultrasound is important for guiding minimally invasive spine interventions, but it remains challenging due to acoustic shadowing and view-dependent signal variations. We propose an occupancy-based shape completion method that reconstructs complete 3D anatomical geometry from partial ultrasound observations. Crucially for intra-operative applications, our approach extracts the anatomical surface directly from the image, avoiding the need for anatomical labels during inference. This label-free completion relies on a coupled latent space representing both the image appearance and the underlying anatomical shape. By leveraging a Neural Implicit Representation (NIR) that jointly models both spatial occupancy and acoustic interactions, the method uses acoustic parameters to become implicitly aware of the unseen regions without explicit shadowing labels through tracking acoustic signal transmission. We show that this method outperforms state-of-the-art shape completion for B-mode ultrasound by 80% in HD95 score. We validate our approach both in-silico and on phantom US images with registered mesh models from CT labels, demonstrating accurate reconstruction of occluded anatomy and robust generalization across diverse imaging conditions. Code and data will be released on publication.[326] Novel Semantic Prompting for Zero-Shot Action Recognition
Salman Iqbal,Waheed Rehman
Main category: cs.CV
TL;DR: 本文提出SP-CLIP框架,通过多层级结构化语义提示(如意图、动作、物体交互)增强冻结的视觉-语言模型,实现高效零样本动作识别,无需修改视觉编码器或新增参数。
Details
Motivation: 现有方法多关注时序建模或架构调整,而语义提示这一强信号尚未被充分挖掘。 Method: 提出SP-CLIP轻量框架,在冻结的视觉-语言模型基础上引入多层级结构化语义提示,并通过提示聚合与一致性评分对齐视频表征与丰富文本语义。 Result: 在标准基准上显著提升零样本动作识别性能,尤其对细粒度和组合型动作效果突出,同时保持预训练模型的效率与泛化性。 Conclusion: 语义提示本身是零样本动作理解中一个强大且被低估的信号,SP-CLIP验证了其有效性与实用性。 Abstract: Zero-shot action recognition relies on transferring knowledge from vision-language models to unseen actions using semantic descriptions. While recent methods focus on temporal modeling or architectural adaptations to handle video data, we argue that semantic prompting alone provides a strong and underexplored signal for zero-shot action understanding. We introduce SP-CLIP, a lightweight framework that augments frozen vision-language models with structured semantic prompts describing actions at multiple levels of abstraction, such as intent, motion, and object interaction. Without modifying the visual encoder or learning additional parameters, SP-CLIP aligns video representations with enriched textual semantics through prompt aggregation and consistency scoring. Experiments across standard benchmarks show that semantic prompting substantially improves zero-shot action recognition, particularly for fine-grained and compositional actions, while preserving the efficiency and generalization of pretrained models.[327] Retrieval-Augmented Anatomical Guidance for Text-to-CT Generation
Daniele Molino,Camillo Maria Caruso,Paolo Soda,Valerio Guarrasi
Main category: cs.CV
TL;DR: 本文提出了一种检索增强的Text-to-CT生成方法,通过3D视觉语言编码器检索语义相关的临床病例及其解剖标注作为结构代理,并利用ControlNet将其注入文本条件潜扩散模型,从而在无真实标注前提下兼顾语义控制与解剖一致性。
Details
Motivation: 文本条件的体积医学影像生成模型缺乏显式解剖引导,导致空间模糊或解剖不一致;而结构驱动方法虽保证解剖一致性,却依赖难以获取的真实标注。 Method: 提出检索增强的Text-to-CT生成框架:用3D视觉语言编码器检索语义匹配的临床案例,以其解剖标注为结构代理,通过ControlNet分支注入文本条件潜扩散模型。 Result: 在CT-RATE数据集上,该方法相比纯文本基线显著提升图像保真度和临床一致性,并首次实现显式空间可控性;检索质量直接影响生成效果。 Conclusion: 该工作建立了语义条件与解剖合理性的可扩展桥梁,为无监督/弱监督下的体积医学图像合成提供了新范式。 Abstract: Text-conditioned generative models for volumetric medical imaging provide semantic control but lack explicit anatomical guidance, often resulting in outputs that are spatially ambiguous or anatomically inconsistent. In contrast, structure-driven methods ensure strong anatomical consistency but typically assume access to ground-truth annotations, which are unavailable when the target image is to be synthesized. We propose a retrieval-augmented approach for Text-to-CT generation that integrates semantic and anatomical information under a realistic inference setting. Given a radiology report, our method retrieves a semantically related clinical case using a 3D vision-language encoder and leverages its associated anatomical annotation as a structural proxy. This proxy is injected into a text-conditioned latent diffusion model via a ControlNet branch, providing coarse anatomical guidance while maintaining semantic flexibility. Experiments on the CT-RATE dataset show that retrieval-augmented generation improves image fidelity and clinical consistency compared to text-only baselines, while additionally enabling explicit spatial controllability, a capability inherently absent in such approaches. Further analysis highlights the importance of retrieval quality, with semantically aligned proxies yielding consistent gains across all evaluation axes. This work introduces a principled and scalable mechanism to bridge semantic conditioning and anatomical plausibility in volumetric medical image synthesis. Code will be released.[328] Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness
Yehonatan Elisha,Oren Barkan,Noam Koenigstein
Main category: cs.CV
TL;DR: 本文提出了一种面向概念级语义的微调框架,利用大语言模型(LLM)和视觉语言模型(VLM)自动生成空间对齐的概念掩码,引导ViT关注细粒度语义特征(如‘长喙’、‘翅膀’),从而提升其在分布偏移下的鲁棒性与可解释性。
Details
Motivation: ViT在分布偏移下性能下降,因其依赖背景等虚假相关而非语义关键特征;现有基于粗粒度前景-背景掩码的正则化方法无法建模细粒度语义概念。 Method: 提出一种新微调框架:1)用LLM无标签生成类相关概念;2)用VLM对概念进行空间分割得到概念掩码;3)优化模型内部显著性图,使其对齐概念区域并抑制背景区域。仅需少量图像及一半类别即可训练。 Result: 在5个OOD基准上显著提升多种ViT模型鲁棒性;生成的相关性图更准确对应语义部件;概念引导掩码比传统分割图提供更有效的鲁棒性监督。 Conclusion: 基于自动提取的空间概念掩码进行概念级对齐微调,是提升ViT鲁棒性与可解释性的有效且可扩展路径。 Abstract: Vision Transformers (ViTs) often degrade under distribution shifts because they rely on spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods, typically relying on simple foreground-background masks, which fail to capture the fine-grained semantic concepts that define an object (e.g., ``long beak'' and ``wings'' for a ``bird''). As a result, these methods provide limited robustness to distribution shifts. To address this limitation, we introduce a novel finetuning framework that steers model reasoning toward concept-level semantics. Our approach optimizes the model's internal relevance maps to align with spatially grounded concept masks. These masks are generated automatically, without manual annotation: class-relevant concepts are first proposed using an LLM-based, label-free method, and then segmented using a VLM. The finetuning objective aligns relevance with these concept regions while simultaneously suppressing focus on spurious background areas. Notably, this process requires only a minimal set of images and uses half of the dataset classes. Extensive experiments on five out-of-distribution benchmarks demonstrate that our method improves robustness across multiple ViT-based models. Furthermore, we show that the resulting relevance maps exhibit stronger alignment with semantic object parts, offering a scalable path toward more robust and interpretable vision models. Finally, we confirm that concept-guided masks provide more effective supervision for model robustness than conventional segmentation maps, supporting our central hypothesis.[329] HDR-NSFF: High Dynamic Range Neural Scene Flow Fields
Shin Dong-Yeon,Kim Jun-Seong,Kwon Byung-Ki,Tae-Hyun Oh
Main category: cs.CV
TL;DR: 本文提出HDR-NSFF,一种基于4D时空建模的动态高动态范围(HDR)辐射场重建方法,突破传统2D帧融合限制,有效解决动态场景中的鬼影与时间不一致问题。
Details
Motivation: 传统HDR方法依赖交替曝光帧的2D像素对齐,在动态场景中易产生鬼影和时间不一致;单目视频观测有限且存在饱和信息损失,需更鲁棒、物理合理的建模方式。 Method: 提出HDR-NSFF框架,将场景建模为连续的4D时空辐射场,兼容NeRF与4D高斯泼溅;联合优化HDR辐射度、3D场景流、几何与色调映射;引入基于DINO特征的语义光流实现曝光不变运动估计,并加入生成先验正则化以补偿单目与饱和缺失信息。 Result: 在自建真实世界HDR-GoPro数据集上验证,HDR-NSFF显著提升动态HDR空时新视角合成质量,恢复精细辐射细节与一致动态,达到SOTA性能。 Conclusion: HDR-NSFF实现了从2D图像融合到4D时空辐射场建模的范式转变,兼顾物理可解释性、全局一致性与鲁棒性,为动态HDR重建与新视角合成提供了统一、端到端的解决方案。 Abstract: Radiance of real-world scenes typically spans a much wider dynamic range than what standard cameras can capture. While conventional HDR methods merge alternating-exposure frames, these approaches are inherently constrained to 2D pixel-level alignment, often leading to ghosting artifacts and temporal inconsistency in dynamic scenes. To address these limitations, we present HDR-NSFF, a paradigm shift from 2D-based merging to 4D spatio-temporal modeling. Our framework reconstructs dynamic HDR radiance fields from alternating-exposure monocular videos by representing the scene as a continuous function of space and time, and is compatible with both neural radiance field and 4D Gaussian Splatting (4DGS) based dynamic representations. This unified end-to-end pipeline explicitly models HDR radiance, 3D scene flow, geometry, and tone-mapping, ensuring physical plausibility and global coherence. We further enhance robustness by (i) extending semantic-based optical flow with DINO features to achieve exposure-invariant motion estimation, and (ii) incorporating a generative prior as a regularizer to compensate for limited observation in monocular captures and saturation-induced information loss. To evaluate HDR space-time view synthesis, we present the first real-world HDR-GoPro dataset specifically designed for dynamic HDR scenes. Experiments demonstrate that HDR-NSFF recovers fine radiance details and coherent dynamics even under challenging exposure variations, thereby achieving state-of-the-art performance in novel space-time view synthesis. Project page: https://shin-dong-yeon.github.io/HDR-NSFF/[330] Human-AI Divergence in Ego-centric Action Recognition under Spatial and Spatiotemporal Manipulations
Sadegh Rahmaniboldaji,Filip Rybansky,Quoc C. Vuong,Anya C. Hurlbert,Frank Guerin,Andrew Gilbert
Main category: cs.CV
TL;DR: 本文通过大规模人类-AI对比实验,利用Minimal Identifiable Recognition Crops(MIRCs)研究第一人称动作识别中人类与AI模型(Side4Video)的差异,发现人类高度依赖稀疏语义关键线索(如手-物交互),而模型更依赖上下文和低层特征,且对时空扰动表现出不同鲁棒性。
Details
Motivation: 理解人类在动作识别(尤其在低分辨率、遮挡、视觉杂乱等真实挑战条件下)显著优于当前AI模型的原因,以推动更鲁棒、更类人的模型发展。 Method: 构建并使用Epic ReduAct数据集(源自36个EPIC KITCHENS视频,经系统空间缩减与时间打乱),定义Minimal Identifiable Recognition Crops(MIRCs);结合3000+人类被试与Side4Video模型进行对比评估;引入定量指标(平均缩减率、识别差距)与定性分析(空间多级特征、时空因素、动作按时间敏感性分类为LTA/HTA)。 Result: 人类识别性能在MIRC→subMIRC时急剧下降,表明其强依赖稀疏语义关键线索(如手-物交互);模型性能下降更平缓,常依赖上下文及中低层特征,甚至在空间缩减时置信度反升;人类在保留关键空间线索时对时间打乱鲁棒,而模型对时间扰动不敏感,且存在类别依赖的时间敏感性。 Conclusion: 人类与AI的动作识别机制存在本质差异:人类聚焦于稀疏、高语义的关键线索,AI则偏向冗余、上下文及低层统计特征;该发现为设计更类人、更鲁棒的动作识别模型提供了认知驱动的指导方向。 Abstract: Humans consistently outperform state-of-the-art AI models in action recognition, particularly in challenging real-world conditions involving low resolution, occlusion, and visual clutter. Understanding the sources of this performance gap is essential for developing more robust and human-aligned models. In this paper, we present a large-scale human-AI comparative study of egocentric action recognition using Minimal Identifiable Recognition Crops (MIRCs), defined as the smallest spatial or spatiotemporal regions sufficient for reliable human recognition. We used our previously introduced, Epic ReduAct, a systematically spatially reduced and temporally scrambled dataset derived from 36 EPIC KITCHENS videos, spanning multiple spatial reduction levels and temporal conditions. Recognition performance is evaluated using over 3,000 human participants and the Side4Video model. Our analysis combines quantitative metrics, Average Reduction Rate and Recognition Gap, with qualitative analyses of spatial (high-, mid-, and low-level visual features) and spatiotemporal factors, including a categorisation of actions into Low Temporal Actions (LTA) and High Temporal Actions (HTA). Results show that human performance exhibits sharp declines when transitioning from MIRCs to subMIRCs, reflecting a strong reliance on sparse, semantically critical cues such as hand-object interactions. In contrast, the model degrades more gradually and often relies on contextual and mid- to low-level features, sometimes even exhibiting increased confidence under spatial reduction. Temporally, humans remain robust to scrambling when key spatial cues are preserved, whereas the model often shows insensitivity to temporal disruption, revealing class-dependent temporal sensitivities.[331] Beyond Attention Heatmaps: How to Get Better Explanations for Multiple Instance Learning Models in Histopathology
Mina Jamshidi Idaji,Julius Hense,Tom Neuhäuser,Augustin Krause,Yanqing Luo,Oliver Eberle,Thomas Schnake,Laure Ciernik,Farnoush Rezaei Jafari,Reza Vahidimajd,Jonas Dippel,Christoph Walz,Frederick Klauschen,Andreas Mock,Klaus-Robert Müller
Main category: cs.CV
TL;DR: 本文提出了一种无需额外标注即可评估多实例学习(MIL)热图质量的通用框架,并在组织病理学任务中系统评测了六种解释方法,发现LRP、IG和Single扰动法优于注意力和梯度类方法;进一步展示了其在空间转录组验证和HPV感染机制发现中的生物学价值。
Details
Motivation: MIL热图被广泛用于模型验证和生物标志物发现,但其有效性缺乏系统评估;现有方法依赖额外标注或未充分验证其反映真实决策机制的能力。 Method: 提出一种不依赖额外标签的MIL热图质量评估通用框架;开展大规模基准实验,涵盖分类/回归/生存分析三类任务、Attention/Transformer/Mamba三类MIL架构、UNI2/Virchow2两类补丁编码器,并对比六种解释方法(含Attention、梯度、LRP、IG、Single扰动等)。 Result: 解释质量主要取决于MIL模型架构与任务类型;LRP、IG和Single扰动法稳定优于注意力与梯度类热图;最佳方法可实现:(i)MIL热图与空间转录组数据相关性验证;(ii)发现HPV预测中不同肿瘤区域的特异性模型策略。 Conclusion: MIL热图需严格验证;高质量解释方法不仅能提升模型可信度,还能驱动生物学发现;应推动可解释AI在数字病理中的广泛应用。 Abstract: Multiple instance learning (MIL) has enabled substantial progress in computational histopathology, where a large amount of patches from gigapixel whole slide images are aggregated into slide-level predictions. Heatmaps are widely used to validate MIL models and to discover tissue biomarkers. Yet, the validity of these heatmaps has barely been investigated. In this work, we introduce a general framework for evaluating the quality of MIL heatmaps without requiring additional labels. We conduct a large-scale benchmark experiment to assess six explanation methods across histopathology task types (classification, regression, survival), MIL model architectures (Attention-, Transformer-, Mamba-based), and patch encoder backbones (UNI2, Virchow2). Our results show that explanation quality mostly depends on MIL model architecture and task type, with perturbation ("Single"), layer-wise relevance propagation (LRP), and integrated gradients (IG) consistently outperforming attention-based and gradient-based saliency heatmaps, which often fail to reflect model decision mechanisms. We further demonstrate the advanced capabilities of the best-performing explanation methods: (i) We provide a proof-of-concept that MIL heatmaps of a bulk gene expression prediction model can be correlated with spatial transcriptomics for biological validation, and (ii) showcase the discovery of distinct model strategies for predicting human papillomavirus (HPV) infection from head and neck cancer slides. Our work highlights the importance of validating MIL heatmaps and establishes that improved explainability can enable more reliable model validation and yield biological insights, making a case for a broader adoption of explainable AI in digital pathology. Our code is provided in a public GitHub repository: https://github.com/bifold-pathomics/xMIL/tree/xmil-journal[332] Local-Global Prompt Learning via Sparse Optimal Transport
Deniz Kizaroğlu,Ülku Tuncer Küçüktas,Emre Çakmakyurdu,Alptekin Temizel
Main category: cs.CV
TL;DR: 本文提出SOT-GLP方法,通过共享稀疏图像块支持与平衡最优传输分配,在保持全局对齐的同时,显式划分显著视觉区域给类别特异性局部提示,从而提升少样本分类准确率与OOD检测鲁棒性。
Details
Motivation: 现有基于局部图文对齐的少样本VLM适配方法存在局部区域选择独立、特征冗余和提示重叠问题,且可学习投影会破坏CLIP原始特征流形几何结构,影响分布外检测性能。 Method: SOT-GLP包含全局与局部双分支:全局分支使用共享可学习文本提示进行标准图文匹配;局部分支利用V-V注意力构建类别条件稀疏图像块集,并通过平衡熵最优传输将其软划分至多个类别特异性文本提示,避免提示重叠与坍缩,且不引入可学习投影。 Result: 在11个基准数据集16-shot设置下,ViT-B/16模型平均准确率达85.1%,优于先前提示学习方法;在OOD检测任务中取得94.2% AUC,为当前最优。 Conclusion: 共享稀疏支持与投影无关的局部对齐策略可在提升少样本分类性能的同时,更好保留CLIP原始特征空间几何结构,从而兼顾准确性与分布外鲁棒性。 Abstract: Few-shot adaptation of vision-language models (VLMs) like CLIP typically relies on learning textual prompts matched to global image embeddings. Recent works extend this paradigm by incorporating local image-text alignment to capture fine-grained visual cues, yet these approaches often select local regions independently for each prompt, leading to redundant local feature usage and prompt overlap. We propose SOT-GLP, which introduces a shared sparse patch support and balanced optimal transport allocation to explicitly partition salient visual regions among class-specific local prompts while preserving global alignment. Our method learns shared global prompts and class-specific local prompts. The global branch maintains standard image-text matching for robust category-level alignment. The local branch constructs a class-conditioned sparse patch set using V-V attention and aligns it to multiple class-specific prompts via balanced entropic optimal transport, yielding a soft partition of patches that prevents prompt overlap and collapse. We evaluate our method on two complementary objectives: (i) few-shot classification accuracy on 11 standard benchmarks and (ii) out-of-distribution (OOD) detection. On the standard 11-dataset benchmark with 16-shot ViT-B/16, SOT-GLP achieves 85.1% average accuracy, outperforming prior prompt-learning methods. We identify a distinct accuracy-robustness trade-off in prompt learning: while learnable projections optimize in-distribution fit, they alter the foundational feature space. We demonstrate that a projection-free local alignment preserves the native geometry of the CLIP manifold, yielding state-of-the-art OOD detection performance (94.2% AUC) that surpasses fully adapted models. Implementation available at: https://github.com/Deniz2304988/SOT-GLP[333] $Δ$VLA: Prior-Guided Vision-Language-Action Models via World Knowledge Variation
Yijie Zhu,Jie He,Rui Shao,Kaishen Yuan,Tao Tan,Xiaochen Yuan,Zitong Yu
Main category: cs.CV
TL;DR: 本文提出ΔVLA框架,通过建模世界知识的相对变化(而非绝对未来状态)来提升机器人操作中的视觉-语言-动作联合建模能力,核心包括先验引导的世界知识提取器(PWKE)、潜在世界变化量化(LWVQ)和条件变化注意力(CV-Atten),在仿真与真实机器人任务中均达到SOTA性能。
Details
Motivation: 现有VLA模型侧重预测未来视觉状态或世界知识,忽视对变化过程本身的推理,而理解‘如何变化’对生成合理动作至关重要。 Method: 提出ΔVLA框架:1)Prior-Guided World-Knowledge Extractor (PWKE) 构建当前世界知识先验;2)Latent World Variation Quantization (LWVQ) 用VQ-VAE学习离散潜在空间编码知识变化;3)Conditional Variation Attention (CV-Atten) 实现解耦、抗干扰的变化建模。 Result: 在多个仿真基准和真实机器人任务上达到SOTA性能,同时提升计算与建模效率。 Conclusion: 建模世界知识的相对变化(Δ)比回归绝对未来状态更有效,ΔVLA为VLA范式提供了新思路,并验证了其在真实场景中的有效性与实用性。 Abstract: Recent vision-language-action (VLA) models have significantly advanced robotic manipulation by unifying perception, reasoning, and control. To achieve such integration, recent studies adopt a predictive paradigm that models future visual states or world knowledge to guide action generation. However, these models emphasize forecasting outcomes rather than reasoning about the underlying process of change, which is essential for determining how to act. To address this, we propose $Δ$VLA, a prior-guided framework that models world-knowledge variations relative to an explicit current-world knowledge prior for action generation, rather than regressing absolute future world states. Specifically, 1) to construct the current world knowledge prior, we propose the Prior-Guided WorldKnowledge Extractor (PWKE). It extracts manipulable regions, spatial relations, and semantic cues from the visual input, guided by auxiliary heads and prior pseudo labels, thus reducing redundancy. 2) Building upon this, to represent how world knowledge evolves under actions, we introduce the Latent World Variation Quantization (LWVQ). It learns a discrete latent space via a VQ-VAE objective to encode world knowledge variations, shifting prediction from full modalities to compact latent. 3)Moreover, to mitigate interference during variation modeling, we design the Conditional Variation Attention (CV-Atten), whichpromotes disentangled learning and preserves the independence of knowledge representations. Extensive experiments on both simulated benchmarks and real-world robotic tasks demonstrate $Δ$VLA achieves state-of-the-art performance while improving efficiency. Code and real-world execution videos are available at https://github.com/JiuTian-VL/DeltaVLA.[334] Diffusion-Based Data Augmentation for Image Recognition: A Systematic Analysis and Evaluation
Zekun Li,Yinghuan Shi,Yang Gao,Dong Xu
Main category: cs.CV
TL;DR: 本文提出UniDiffDA统一分析框架,将扩散数据增强(DiffDA)方法分解为模型微调、样本生成和样本利用三大核心组件,并基于该框架构建公平评估协议,在多种低数据分类任务上系统评测代表性DiffDA方法。
Details
Motivation: 现有DiffDA方法在任务配置、模型选择和实验流程上差异显著,难以公平比较;同时缺乏对完整DiffDA工作流的系统性理解。 Method: 提出UniDiffDA统一分析框架,将DiffDA分解为模型微调、样本生成和样本利用三个核心组件,并据此构建全面、公平的评估协议,在统一代码库中重现实验。 Result: 通过大量实验揭示了不同DiffDA策略的相对优劣,提供了方法设计与部署的实用见解;所有方法均在统一代码库中重实现,并开源代码与配置以保障可复现性。 Conclusion: UniDiffDA框架有助于厘清DiffDA的设计空间,促进方法间的公平比较与系统理解,推动该领域未来研究。 Abstract: Diffusion-based data augmentation (DiffDA) has emerged as a promising approach to improving classification performance under data scarcity. However, existing works vary significantly in task configurations, model choices, and experimental pipelines, making it difficult to fairly compare methods or assess their effectiveness across different scenarios. Moreover, there remains a lack of systematic understanding of the full DiffDA workflow. In this work, we introduce UniDiffDA, a unified analytical framework that decomposes DiffDA methods into three core components: model fine-tuning, sample generation, and sample utilization. This perspective enables us to identify key differences among existing methods and clarify the overall design space. Building on this framework, we develop a comprehensive and fair evaluation protocol, benchmarking representative DiffDA methods across diverse low-data classification tasks. Extensive experiments reveal the relative strengths and limitations of different DiffDA strategies and offer practical insights into method design and deployment. All methods are re-implemented within a unified codebase, with full release of code and configurations to ensure reproducibility and to facilitate future research.[335] This Looks Distinctly Like That: Grounding Interpretable Recognition in Stiefel Geometry against Neural Collapse
Junhao Jia,Jiaqi Wang,Yunyou Liu,Haodong Jing,Yueyi Wu,Xian Wu,Yefeng Zheng
Main category: cs.CV
TL;DR: 本文提出自适应流形原型(AMP)框架,通过黎曼优化和空间正则化缓解原型坍缩问题,提升可解释性与分类性能。
Details
Motivation: 原型网络虽具备基于案例的解释机制,但常因原型坍缩(多个原型退化为高度冗余证据)而损害可解释性;该问题源于神经坍缩的终端动力学,即交叉熵优化抑制类内方差并迫使类条件特征趋向低维极限。 Method: 提出AMP框架:1)在Stiefel流形上采用黎曼优化,将类原型表示为正交基,从构造上避免秩一原型坍缩;2)通过非负容量向量上的近端梯度更新学习类特异性有效秩;3)引入空间正则化以减少旋转模糊性,并鼓励局部化、非重叠的部件证据。 Result: 在细粒度分类基准上,AMP实现了最先进分类精度,并显著提升了相较于先前可解释模型的因果保真度。 Conclusion: AMP通过几何约束与结构化正则化有效缓解原型坍缩,在保持高分类性能的同时增强了模型的可解释性与因果可信性。 Abstract: Prototype networks provide an intrinsic case based explanation mechanism, but their interpretability is often undermined by prototype collapse, where multiple prototypes degenerate to highly redundant evidence. We attribute this failure mode to the terminal dynamics of Neural Collapse, where cross entropy optimization suppresses intra class variance and drives class conditional features toward a low dimensional limit. To mitigate this, we propose Adaptive Manifold Prototypes (AMP), a framework that leverages Riemannian optimization on the Stiefel manifold to represent class prototypes as orthonormal bases and make rank one prototype collapse infeasible by construction. AMP further learns class specific effective rank via a proximal gradient update on a nonnegative capacity vector, and introduces spatial regularizers that reduce rotational ambiguity and encourage localized, non overlapping part evidence. Extensive experiments on fine-grained benchmarks demonstrate that AMP achieves state-of-the-art classification accuracy while significantly improving causal faithfulness over prior interpretable models.[336] Real-Time Drone Detection in Event Cameras via Per-Pixel Frequency Analysis
Michael Bezick,Majid Sahin
Main category: cs.CV
TL;DR: 本文提出了一种基于非均匀离散傅里叶变换(NDFT)的每像素时序分析框架DDHF,用于从事件相机数据中高效检测快速移动的无人机,利用旋翼频谱中的频率梳特征实现高精度、低延迟的实时定位。
Details
Motivation: 事件相机数据稀疏且异步,传统DFT因假设均匀采样而不适用,亟需适配事件流特性的频域分析方法来检测无人机旋翼等周期性信号。 Method: 提出基于非均匀离散傅里叶变换(NDFT)的Drone Detection via Harmonic Fingerprinting(DDHF)框架,通过每像素纯解析方式提取旋翼功率谱中的频率梳特征,实现无人机定位。 Result: DDHF在多个复杂场景下平均定位F1得分为90.89%,单帧延迟仅2.39ms;相较YOLO(F1 66.74%,延迟12.40ms)显著提升精度与速度,并具备小样本可调、结果可解释等优势。 Conclusion: DDHF验证了纯解析方法在事件相机无人机检测任务中的有效性,兼顾高性能、低延迟、强泛化性与可解释性,为替代深度学习方案提供了新路径。 Abstract: Detecting fast-moving objects, such as unmanned aerial vehicle (UAV), from event camera data is challenging due to the sparse, asynchronous nature of the input. Traditional Discrete Fourier Transforms (DFT) are effective at identifying periodic signals, such as spinning rotors, but they assume uniformly sampled data, which event cameras do not provide. We propose a novel per-pixel temporal analysis framework using the Non-uniform Discrete Fourier Transform (NDFT), which we call Drone Detection via Harmonic Fingerprinting (DDHF). Our method uses purely analytical techniques that identify the frequency signature of drone rotors, as characterized by frequency combs in their power spectra, enabling a tunable and generalizable algorithm that achieves accurate real-time localization of UAV. We compare against a YOLO detector under equivalent conditions, demonstrating improvement in accuracy and latency across a difficult array of drone speeds, distances, and scenarios. DDHF achieves an average localization F1 score of 90.89% and average latency of 2.39ms per frame, while YOLO achieves an F1 score of 66.74% and requires 12.40ms per frame. Through utilization of purely analytic techniques, DDHF is quickly tuned on small data, easily interpretable, and achieves competitive accuracies and latencies to deep learning alternatives.[337] AULLM++: Structural Reasoning with Large Language Models for Micro-Expression Recognition
Zhishu Liu,Kaishen Yuan,Bo Zhao,Hui Ma,Zitong Yu
Main category: cs.CV
TL;DR: 本文提出AULLM++框架,利用大语言模型进行微表情动作单元(AU)检测,通过多粒度证据融合、关系感知图神经网络和反事实一致性正则化,显著提升性能与跨域泛化能力。
Details
Motivation: 现有方法依赖低密度视觉信息、特征处理粗糙、忽略AU间相关性,难以准确解析微表情的复杂模式。 Method: 提出AULLM++框架:1)多粒度证据增强融合投影器(MGE-EFP)生成内容令牌(CT);2)关系感知AU图神经网络(R-AUGNN)建模AU关系并生成指令令牌(IT);3)融合CT与IT构建结构化文本提示,并引入反事实一致性正则化(CCR)提升泛化。 Result: 在标准基准上达到最优性能,且展现出优异的跨域泛化能力。 Conclusion: 将视觉特征注入LLM文本提示作为语义前提,结合结构建模与反事实推理,可有效提升微表情AU检测的准确性与鲁棒性。 Abstract: Micro-expression Action Unit (AU) detection identifies localized AUs from subtle facial muscle activations, providing a foundation for decoding affective cues. Previous methods face three key limitations: (1) heavy reliance on low-density visual information, rendering discriminative evidence vulnerable to background noise; (2) coarse-grained feature processing that misaligns with the demand for fine-grained representations; and (3) neglect of inter-AU correlations, restricting the parsing of complex expression patterns. We propose AULLM++, a reasoning-oriented framework leveraging Large Language Models (LLMs), which injects visual features into textual prompts as actionable semantic premises to guide inference. It formulates AU prediction into three stages: evidence construction, structure modeling, and deduction-based prediction. Specifically, a Multi-Granularity Evidence-Enhanced Fusion Projector (MGE-EFP) fuses mid-level texture cues with high-level semantics, distilling them into a compact Content Token (CT). Furthermore, inspired by micro- and macro-expression AU correspondence, we encode AU relationships as a sparse structural prior and learn interaction strengths via a Relation-Aware AU Graph Neural Network (R-AUGNN), producing an Instruction Token (IT). We then fuse CT and IT into a structured textual prompt and introduce Counterfactual Consistency Regularization (CCR) to construct counterfactual samples, enhancing the model's generalization. Extensive experiments demonstrate AULLM++ achieves state-of-the-art performance on standard benchmarks and exhibits superior cross-domain generalization.[338] SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents
Yu Yang,Yue Liao,Jianbiao Mei,Baisen Wang,Xuemeng Yang,Licheng Wen,Jiangning Zhang,Xiangtai Li,Hanlin Chen,Botian Shi,Yong Liu,Shuicheng Yan,Gim Hee Lee
Main category: cs.CV
TL;DR: SPIRAL是一种自改进的规划与迭代反思式动作世界建模闭环框架,用于可控长时序视频生成,通过PlanAgent和CriticAgent实现分步规划与反馈优化。
Details
Motivation: 现有单次视频生成模型为开环模式,存在动作执行不完整、语义对齐弱和时间漂移等问题。 Method: 提出SPIRAL闭环框架,包含PlanAgent(将高层动作分解为物体中心子动作)和CriticAgent(评估中间结果并引导迭代优化),结合长时程记忆与RL演化优化。 Result: 在ActWM-Bench及主流视频生成基准上,SPIRAL在多个TI2V骨干模型上均取得一致性能提升。 Conclusion: SPIRAL通过闭环‘思考-行动-反思’机制显著提升了长时序视频生成的语义对齐性与时间一致性。 Abstract: We introduce SPIRAL, a self-improving planning and iterative reflective action world modeling closed-loop framework that enables controllable long-horizon video generation conditioned on high-level semantic actions. Existing one-shot video generation models operate in open-loop, often resulting in incomplete action execution, weak semantic grounding, and temporal drift. SPIRAL formulates ActWM as a closed-loop think-act-reflect process, where generation proceeds step by step under explicit planning and feedback. A PlanAgent decomposes abstract actions into object-centric sub-actions, while a CriticAgent evaluates intermediate results and guides iterative refinement with long-horizon memory. This closed-loop design naturally supports RL evolving optimization, improving semantic alignment and temporal consistency over extended horizons. We further introduce the ActWM-Dataset and ActWM-Bench for training and evaluation. Experiments across multiple TI2V backbones demonstrate consistent gains on ActWM-Bench and mainstream video generation benchmarks, validating SPIRAL's effectiveness.[339] Information Maximization for Long-Tailed Semi-Supervised Domain Generalization
Leo Fillioux,Omprakash Chakraborty,Quentin Gopée,Pierre Marza,Paul-Henry Cournède,Stergios Christodoulidis,Maria Vakalopoulou,Ismail Ben Ayed,Jose Dolz
Main category: cs.CV
TL;DR: 本文提出IMaX方法,通过改进的InfoMax原则最大化特征与潜在标签间的互信息,并引入α-熵目标缓解长尾分布下的类别偏差,从而提升半监督域泛化(SSDG)在真实不均衡数据上的性能。
Details
Motivation: 现有半监督域泛化(SSDG)方法在面对现实世界中常见的长尾类别分布时性能严重下降,亟需一种能适应任意类别分布的鲁棒方法。 Method: 提出IMaX目标函数,基于InfoMax原则,最大化学习特征与潜在标签之间的互信息;引入α-熵形式替代标准边缘熵项,以缓解类别不平衡偏差;该目标可即插即用地集成到现有SSDG框架中。 Result: IMaX在两种不同图像模态的多个基准上显著提升现有SOTA SSDG方法的性能,尤其在长尾设定下效果突出。 Conclusion: IMaX是一种简单有效、兼容性强的训练目标,能显著增强SSDG方法对长尾类别分布的鲁棒性,推动其向更实际场景落地。 Abstract: Semi-supervised domain generalization (SSDG) has recently emerged as an appealing alternative to tackle domain generalization when labeled data is scarce but unlabeled samples across domains are abundant. In this work, we identify an important limitation that hampers the deployment of state-of-the-art methods on more challenging but practical scenarios. In particular, state-of-the-art SSDG severely suffers in the presence of long-tailed class distributions, an arguably common situation in real-world settings. To alleviate this limitation, we propose IMaX, a simple yet effective objective based on the well-known InfoMax principle adapted to the SSDG scenario, where the Mutual Information (MI) between the learned features and latent labels is maximized, constrained by the supervision from the labeled samples. Our formulation integrates an α-entropic objective, which mitigates the class-balance bias encoded in the standard marginal entropy term of the MI, thereby better handling arbitrary class distributions. IMaX can be seamlessly plugged into recent state-of-the-art SSDG, consistently enhancing their performance, as demonstrated empirically across two different image modalities.[340] Alfa: Attentive Low-Rank Filter Adaptation for Structure-Aware Cross-Domain Personalized Gaze Estimation
He-Yen Hsieh,Wei-Te Mark Ting,H. T. Kung
Main category: cs.CV
TL;DR: 本文提出Attentive Low-Rank Filter Adaptation (Alfa),一种通过重加权预训练滤波器中的语义模式来实现眼动模型测试时个性化(TTP)的新方法,利用SVD提取主导空间成分,并通过注意力机制仅需少量无标签样本即可自适应调整权重,在多个跨数据集基准上达到最低平均眼动误差。
Details
Motivation: 预训练眼动模型虽能学习跨用户的通用模式,但用户特异性差异(如眼睑形状、面部结构)会导致性能下降;现有参数高效微调(PEFT)方法未能充分利用预训练滤波器中已编码的结构信息。 Method: 提出Alfa方法:首先用奇异值分解(SVD)从预训练滤波器中提取表征眼与面部特征的主导空间成分;再设计注意力机制,仅需少量无标签样本即可对这些成分进行选择性重加权,实现用户自适应。 Result: Alfa在四个跨数据集眼动估计基准上取得最低平均误差,优于现有TTP方法及LoRA变体;且其 attentive low-rank 思路可拓展至扩散语言模型等非视觉任务。 Conclusion: 将个性化建模为对预训练结构的重加权而非全新学习,能更高效、更有效地利用已有知识;Alfa为资源受限下的轻量级模型适配提供了新范式。 Abstract: Pre-trained gaze models learn to identify useful patterns commonly found across users, but subtle user-specific variations (i.e., eyelid shape or facial structure) can degrade model performance. Test-time personalization (TTP) adapts pre-trained models to these user-specific domain shifts using only a few unlabeled samples. Efficient fine-tuning is critical in performing this domain adaptation: data and computation resources can be limited-especially for on-device customization. While popular parameter-efficient fine-tuning (PEFT) methods address adaptation costs by updating only a small set of weights, they may not be taking full advantage of structures encoded in pre-trained filters. To more effectively leverage existing structures learned during pre-training, we reframe personalization as a process to reweight existing features rather than learning entirely new ones. We present Attentive Low-Rank Filter Adaptation (Alfa) to adapt gaze models by reweighting semantic patterns in pre-trained filters. With Alfa, singular value decomposition (SVD) extracts dominant spatial components that capture eye and facial characteristics across users. Via an attention mechanism, we need only a few unlabeled samples to adjust and reweight pre-trained structures, selectively amplifying those relevant to a target user. Alfa achieves the lowest average gaze errors across four cross-dataset gaze benchmarks, outperforming existing TTP methods and low-rank adaptation (LoRA)-based variants. We also show that Alfa's attentive low-rank methods can be applied to applications beyond vision, such as diffusion-based language models.[341] X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection
Youngseo Kim,Kwan Yun,Seokhyeon Hong,Sihun Cha,Colette Suhjung Koo,Junyong Noh
Main category: cs.CV
TL;DR: 本文提出X-AVDT,一种基于生成器内部跨模态注意力机制的深度伪造检测方法,利用DDIM反演提取视频异常与音视频对齐特征,在新构建的多模态数据集MMDF及外部基准上均取得领先性能。
Details
Motivation: 当前高保真合成视频激增,带来恶意使用风险,传统检测方法面临挑战;作者从生成器内部视角出发,发现跨注意力机制蕴含语音-动作对齐线索,可服务于伪造检测。 Method: 提出X-AVDT检测器,通过DDIM反演访问生成器内部信号,提取两类互补特征:(i)反映反演差异的视频合成图;(ii)表征生成过程中音视频对齐的跨注意力特征;同时构建覆盖GAN、扩散模型、流匹配等范式的多模态深伪数据集MMDF。 Result: X-AVDT在MMDF数据集上性能领先,并在跨生成器和外部基准测试中泛化性强,准确率较现有方法提升13.1%。 Conclusion: 利用生成器内部音视频一致性线索是提升深伪检测对未来新型生成器鲁棒性的关键路径。 Abstract: The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a generator-side view and observe that internal cross-attention mechanisms in these models encode fine-grained speech-motion alignment, offering useful correspondence cues for forgery detection. Building on this insight, we propose X-AVDT, a robust and generalizable deepfake detector that probes generator-internal audio-visual signals accessed via DDIM inversion to expose these cues. X-AVDT extracts two complementary signals: (i) a video composite capturing inversion-induced discrepancies, and (ii) an audio-visual cross-attention feature reflecting modality alignment enforced during generation. To enable faithful cross-generator evaluation, we further introduce MMDF, a new multimodal deepfake dataset spanning diverse manipulation types and rapidly evolving synthesis paradigms, including GANs, diffusion, and flow-matching. Extensive experiments demonstrate that X-AVDT achieves leading performance on MMDF and generalizes strongly to external benchmarks and unseen generators, outperforming existing methods with accuracy improved by 13.1%. Our findings highlight the importance of leveraging internal audio-visual consistency cues for robustness to future generators in deepfake detection.[342] Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images
Qishun Yang,Shu Yang,Lijie Hu,Di Wang
Main category: cs.CV
TL;DR: 本文提出了一种无需安全标签的视觉自实现对齐方法(VSFA),通过在威胁相关图像上的中性VQA任务微调视觉语言模型,使模型内化警惕与谨慎的隐含语义,从而提升安全性并保持通用能力。
Details
Motivation: 现有MLLMs的安全对齐方法依赖显式安全标签或对比数据,但安全概念(如‘有帮助’)抽象且缺乏视觉指代,而威胁概念具体可视觉化;同时,受自实现机制启发,探索如何在视觉模态中实现类似对齐。 Method: 提出Visual Self-Fulfilling Alignment(VSFA),在无安全标签的前提下,利用威胁相关图像构建中性视觉问答(VQA)任务,对视觉语言模型(VLMs)进行微调,使其通过重复接触威胁图像自发内化‘警惕’‘谨慎’等安全相关隐含语义。 Result: VSFA在多个VLM和安全基准测试中显著降低攻击成功率、提升响应质量、缓解过度拒绝现象,同时不损害模型通用能力。 Conclusion: VSFA将自实现机制从文本扩展至视觉模态,提供了一种新颖、有效且无需标注的安全对齐范式,推动多模态大模型安全研究的发展。 Abstract: Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment.[343] Global Cross-Modal Geo-Localization: A Million-Scale Dataset and a Physical Consistency Learning Framework
Yutong Hu,Jinhui Chen,Chaoqiang Xu,Yuan Kou,Sili Zhou,Shaocheng Yan,Pengcheng Shi,Qingwu Hu,Jiayuan Li
Main category: cs.CV
TL;DR: 本文提出了首个百万级全球跨模态地理定位(CMGL)数据集CORE,并设计了物理规律感知网络PLANET,显著提升了跨模态地理定位性能。
Details
Motivation: 现有CMGL研究受限于地理覆盖范围窄、场景多样性低,难以反映全球建筑风格与地形特征的空间异质性,亟需构建更具普适性的数据集与方法。 Method: 构建百万级全球CMGL数据集CORE,覆盖225个地理区域;利用大视觉语言模型(LVLM)零样本生成高质量文本描述;提出物理规律感知网络PLANET,引入新型对比学习范式,使文本表征能捕捉卫星图像的内在物理特征。 Result: 在多个地理区域上的实验表明,PLANET显著优于当前最优方法,确立了鲁棒、全球尺度地理定位的新基准。 Conclusion: CORE数据集与PLANET模型共同推动了跨模态地理定位向全球化、实用化迈进,为行人导航与应急响应等应用提供了坚实基础。 Abstract: Cross-modal Geo-localization (CMGL) matches ground-level text descriptions with geo-tagged aerial imagery, which is crucial for pedestrian navigation and emergency response. However, existing researches are constrained by narrow geographic coverage and simplistic scene diversity, failing to reflect the immense spatial heterogeneity of global architectural styles and topographic features. To bridge this gap and facilitate universal positioning, we introduce CORE, the first million-scale dataset dedicated to global CMGL. CORE comprises 1,034,786 cross-view images sampled from 225 distinct geographic regions across all continents, offering an unprecedented variety of perspectives in varying environmental conditions and urban layouts. We leverage the zero-shot reasoning of Large Vision-Language Models (LVLMs) to synthesize high-quality scene descriptions rich in discriminative cues. Furthermore, we propose a physical-law-aware network (PLANET) for cross-modal geo-localization. PLANET introduces a novel contrastive learning paradigm to guide textual representations in capturing the intrinsic physical signatures of satellite imagery. Extensive experiments across varied geographic regions demonstrate that PLANet significantly outperforms state-of-the-art methods, establishing a new benchmark for robust, global-scale geo-localization. The dataset and source code will be released at https://github.com/YtH0823/CORE.[344] Reading $\neq$ Seeing: Diagnosing and Closing the Typography Gap in Vision-Language Models
Heng Zhou,Ao Yu,Li Kang,Yuchen Fan,Yutao Fan,Xiufeng Song,Hejia Geng,Yiran Qin
Main category: cs.CV
TL;DR: 本文系统评估了15种视觉-语言模型(VLMs)在字体识别(字体族、大小、风格、颜色)上的能力,发现其对字体风格识别普遍较差,且模型规模与性能无强相关,表明问题源于训练数据缺失而非模型容量限制;通过LoRA微调合成数据可显著提升部分字体属性识别能力,但字体风格仍难以改善,暗示需架构层面创新;作者开源了评测框架、数据与微调方案。
Details
Motivation: Vision-Language Models虽能准确读取图像中的文本内容,却难以感知其视觉呈现(即排版特征),存在‘排版盲区’;本文旨在系统揭示并量化这一缺陷。 Method: 构建覆盖26种字体、4种文字系统、3个难度级别的字体识别评测基准;评估15个SOTA VLMs在字体族、大小、风格、颜色四方面表现;分析模型规模与性能关系;采用LoRA在小规模合成数据上微调开源模型并验证效果。 Result: 发现显著的感知层级:颜色识别接近完美,而字体风格识别普遍极差;模型规模不提升性能,各难度级别准确率均匀,指向训练数据缺失;LoRA微调显著提升字体大小识别,甚至超越最优闭源模型,但字体风格识别仍无明显改善。 Conclusion: 当前VLMs的排版理解缺陷主要源于训练数据中缺乏字体样式相关监督信号,而非模型容量不足;提升字体风格识别可能需要超越现有patch-based编码器的新型架构设计;开源资源将推动视觉-语言模型在排版理解方向的发展。 Abstract: Vision-Language Models achieve near-perfect accuracy at reading text in images, yet prove largely typography-blind: capable of recognizing what text says, but not how it looks. We systematically investigate this gap by evaluating font family, size, style, and color recognition across 26 fonts, four scripts, and three difficulty levels. Our evaluation of 15 state-of-the-art VLMs reveals a striking perception hierarchy: color recognition is near-perfect, yet font style detection remains universally poor. We further find that model scale fails to predict performance and that accuracy is uniform across difficulty levels, together pointing to a training-data omission rather than a capacity ceiling. LoRA fine-tuning on a small set of synthetic samples substantially improves an open-source model, narrowing the gap to the best closed-source system and surpassing it on font size recognition. Font style alone remains resistant to fine-tuning, suggesting that relational visual reasoning may require architectural innovation beyond current patch-based encoders. We release our evaluation framework, data, and fine-tuning recipe to support progress in closing the typographic gap in vision-language understanding.[345] All Vehicles Can Lie: Efficient Adversarial Defense in Fully Untrusted-Vehicle Collaborative Perception via Pseudo-Random Bayesian Inference
Yi Yu,Libing Wu,Zhuangzhuang Zhang,Jing Qiu,Lijuan Huo,Jiaqi Feng
Main category: cs.CV
TL;DR: 本文提出了一种面向完全不可信车辆协同感知环境的新型防御框架PRBI,利用时序感知差异和伪随机分组结合贝叶斯推理,高效检测恶意车辆,显著提升检测精度与效率。
Details
Motivation: 现有协同感知防御方法依赖可信自车或额外二分类器,难以满足真实场景中自车不可信、需实时检测及泛化性强等需求。 Method: 提出伪随机贝叶斯推理(PRBI)框架:以先前帧可靠感知为动态参考检测时序感知差异;采用伪随机分组策略,每帧仅需两次验证;通过贝叶斯推断估计恶意车辆数量与身份。 Result: 理论证明PRBI收敛稳定;实验表明其平均每帧仅需2.5次验证,检测精度恢复至攻击前79.4%–86.9%。 Conclusion: PRBI是首个专为完全不可信车辆协同感知设计的高效防御方法,在实用性、效率与鲁棒性上优于现有方案。 Abstract: Collaborative perception (CP) enables multiple vehicles to augment their individual perception capacities through the exchange of feature-level sensory data. However, this fusion mechanism is inherently vulnerable to adversarial attacks, especially in fully untrusted-vehicle environments. Existing defense approaches often assume a trusted ego vehicle as a reference or incorporate additional binary classifiers. These assumptions limit their practicality in real-world deployments due to the questionable trustworthiness of ego vehicles, the requirement for real-time detection, and the need for generalizability across diverse scenarios. To address these challenges, we propose a novel Pseudo-Random Bayesian Inference (PRBI) framework, a first efficient defense method tailored for fully untrusted-vehicle CP. PRBI detects adversarial behavior by leveraging temporal perceptual discrepancies, using the reliable perception from the preceding frame as a dynamic reference. Additionally, it employs a pseudo-random grouping strategy that requires only two verifications per frame, while applying Bayesian inference to estimate both the number and identities of malicious vehicles. Theoretical analysis has proven the convergence and stability of the proposed PRBI framework. Extensive experiments show that PRBI requires only 2.5 verifications per frame on average, outperforming existing methods significantly, and restores detection precision to between 79.4% and 86.9% of pre-attack levels.[346] Improving Continual Learning for Gaussian Splatting based Environments Reconstruction on Commercial Off-the-Shelf Edge Devices
Ivan Zaino,Matteo Risso,Daniele Jahier Pagliari,Miguel de Prado,Toon Van de Maele,Alessio Burrello
Main category: cs.CV
TL;DR: 本文提出了一种精度自适应优化框架,使变分贝叶斯高斯溅射(VBGS)能在资源受限设备上高效训练,显著降低内存与延迟,同时保持甚至提升重建质量。
Details
Motivation: 边缘机器人对轻量、可增量更新的3D场景模型需求迫切,但现有VBGS因高精度计算和大中间张量难以在端侧部署。 Method: 通过三步优化:(i) 性能分析定位内存/延迟瓶颈;(ii) 融合内存主导算子以减少中间张量;(iii) 基于有界相对误差的混合精度自动搜索。 Result: 在A5000 GPU上峰值内存从9.44 GB降至1.11 GB,训练时间从234分钟降至61分钟;首次实现在Jetson Orin Nano嵌入式平台上的NVS训练,单帧延迟比3DGS降低19倍。 Conclusion: 该框架在不改变VBGS变分建模本质的前提下,实现了其在边缘设备上的实用化部署,兼顾效率与精度。 Abstract: Novel view synthesis (NVS) is increasingly relevant for edge robotics, where compact and incrementally updatable 3D scene models are needed for SLAM, navigation, and inspection under tight memory and latency budgets. Variational Bayesian Gaussian Splatting (VBGS) enables replay-free continual updates for the 3DGS algorithm by maintaining a probabilistic scene model, but its high-precision computations and large intermediate tensors make on-device training impractical. We present a precision-adaptive optimization framework that enables VBGS training on resource-constrained hardware without altering its variational formulation. We (i) profile VBGS to identify memory/latency hotspots, (ii) fuse memory-dominant kernels to reduce materialized intermediate tensors, and (iii) automatically assign operation-level precisions via a mixed-precision search with bounded relative error. Across the Blender, Habitat, and Replica datasets, our optimised pipeline reduces peak memory from 9.44 GB to 1.11 GB and training time from ~234 min to ~61 min on an A5000 GPU, while preserving (and in some cases improving) reconstruction quality of the state-of-the-art VBGS baseline. We also enable for the first time NVS training on a commercial embedded platform, the Jetson Orin Nano, reducing per-frame latency by 19x compared to 3DGS.[347] Spherical-GOF: Geometry-Aware Panoramic Gaussian Opacity Fields for 3D Scene Reconstruction
Zhe Yang,Guoqiang Zhao,Sheng Wu,Kai Luo,Kailun Yang
Main category: cs.CV
TL;DR: 本文提出Spherical-GOF,一种基于高斯不透明度场(GOF)的球面高斯渲染框架,专为全景图像设计,通过在单位球面上直接进行球面射线采样,提升几何一致性和渲染质量。
Details
Motivation: 现有3D高斯溅射(3DGS)方法面向透视投影设计,直接适配全景相机模型会导致畸变和几何不一致,亟需适配球面成像特性的新方法。 Method: 提出Spherical-GOF框架:1)在球面射线空间中对GOF进行直接射线采样;2)推导保守球面包围规则以加速高斯剔除;3)引入自适应球面滤波方案,匹配全景像素采样中的畸变变化。 Result: 在OmniBlender和OmniPhotos数据集上显著优于基线:深度重投影误差降低57%,循环内点率提升21%;深度图与法向图更清晰连贯;对全局全景旋转鲁棒;并在新构建的真实机器人数据集OmniRob(含无人机与四足平台)上验证泛化性。 Conclusion: Spherical-GOF有效解决了全景图像下3DGS的几何一致性难题,为机器人与全景视觉提供了高效、鲁棒的神经渲染新范式。 Abstract: Omnidirectional images are increasingly used in robotics and vision due to their wide field of view. However, extending 3D Gaussian Splatting (3DGS) to panoramic camera models remains challenging, as existing formulations are designed for perspective projections and naive adaptations often introduce distortion and geometric inconsistencies. We present Spherical-GOF, an omnidirectional Gaussian rendering framework built upon Gaussian Opacity Fields (GOF). Unlike projection-based rasterization, Spherical-GOF performs GOF ray sampling directly on the unit sphere in spherical ray space, enabling consistent ray-Gaussian interactions for panoramic rendering. To make the spherical ray casting efficient and robust, we derive a conservative spherical bounding rule for fast ray-Gaussian culling and introduce a spherical filtering scheme that adapts Gaussian footprints to distortion-varying panoramic pixel sampling. Extensive experiments on standard panoramic benchmarks (OmniBlender and OmniPhotos) demonstrate competitive photometric quality and substantially improved geometric consistency. Compared with the strongest baseline, Spherical-GOF reduces depth reprojection error by 57% and improves cycle inlier ratio by 21%. Qualitative results show cleaner depth and more coherent normal maps, with strong robustness to global panorama rotations. We further validate generalization on OmniRob, a real-world robotic omnidirectional dataset introduced in this work, featuring UAV and quadruped platforms. The source code and the OmniRob dataset will be released at https://github.com/1170632760/Spherical-GOF.[348] Beyond Hungarian: Match-Free Supervision for End-to-End Object Detection
Shoumeng Qiu,Xinrun Li,Yang Long
Main category: cs.CV
TL;DR: 本文提出了一种无需匈牙利匹配的DETR训练新范式,通过跨注意力机制实现查询与真值的隐式、可微对应学习,显著提升训练效率并提升性能。
Details
Motivation: 现有DETR依赖匈牙利算法进行查询与真值间的二分匹配,带来计算开销大、训练动态复杂等问题。 Method: 提出跨注意力查询选择(CAQS)模块,利用真值编码信息通过跨注意力机制对解码器查询进行软探查,并以加权误差最小化驱动模型自主学习隐式对应关系,进而为查询学习提供监督信号。 Result: 所提方法绕过传统匹配流程,训练效率显著提升,匹配延迟降低超50%,消除了离散匹配瓶颈,并在性能上超越当前SOTA方法。 Conclusion: 匹配自由的可微对应学习是提升DETR训练效率与性能的有效路径,CAQS模块为端到端检测提供了更简洁、高效的新训练范式。 Abstract: Recent DEtection TRansformer (DETR) based frameworks have achieved remarkable success in end-to-end object detection. However, the reliance on the Hungarian algorithm for bipartite matching between queries and ground truths introduces computational overhead and complicates the training dynamics. In this paper, we propose a novel matching-free training scheme for DETR-based detectors that eliminates the need for explicit heuristic matching. At the core of our approach is a dedicated Cross-Attention-based Query Selection (CAQS) module. Instead of discrete assignment, we utilize encoded ground-truth information to probe the decoder queries through a cross-attention mechanism. By minimizing the weighted error between the queried results and the ground truths, the model autonomously learns the implicit correspondences between object queries and specific targets. This learned relationship further provides supervision signals for the learning of queries. Experimental results demonstrate that our proposed method bypasses the traditional matching process, significantly enhancing training efficiency, reducing the matching latency by over 50\%, effectively eliminating the discrete matching bottleneck through differentiable correspondence learning, and also achieving superior performance compared to existing state-of-the-art methods.[349] OccTrack360: 4D Panoptic Occupancy Tracking from Surround-View Fisheye Cameras
Yongzhi Lin,Kai Luo,Yuanfan Zheng,Hao Shi,Mengfei Duan,Yang Liu,Kailun Yang
Main category: cs.CV
TL;DR: 本文提出了OccTrack360——首个面向环视鱼眼相机的4D全景占用跟踪基准,并设计了FoSOcc框架以解决鱼眼图像在球面投影畸变和体素空间定位不准两大挑战,显著提升了几何规则类别的跟踪质量。
Details
Motivation: 现有4D全景占用跟踪研究受限于缺乏支持环视鱼眼感知、长时序及实例级体素跟踪的基准;需构建更真实、更具挑战性的新基准并建立适配鱼眼特性的强基线方法。 Method: 提出OccTrack360基准(含长序列、全向遮挡掩码与MEI鱼眼视场掩码);并设计FoSOcc框架,包含中心聚焦模块(CFM)提升实例感知定位,及球面提升模块(SLM)在统一投影模型下扩展透视提升至鱼眼成像。 Result: 在Occ3D-Waymo和OccTrack360上实验表明,该方法在几何规则类别上的占用跟踪质量显著提升,建立了鱼眼4D占用跟踪的强基线。 Conclusion: OccTrack360填补了鱼眼4D全景占用跟踪基准空白,FoSOcc为该方向提供了有效技术路径,推动动态3D环境理解在机器人与自动驾驶中的实用化发展。 Abstract: Understanding dynamic 3D environments in a spatially continuous and temporally consistent manner is fundamental for robotics and autonomous driving. While recent advances in occupancy prediction provide a unified representation of scene geometry and semantics, progress in 4D panoptic occupancy tracking remains limited by the lack of benchmarks that support surround-view fisheye sensing, long temporal sequences, and instance-level voxel tracking. To address this gap, we present OccTrack360, a new benchmark for 4D panoptic occupancy tracking from surround-view fisheye cameras. OccTrack360 provides substantially longer and more diverse sequences (174~2234 frames) than prior benchmarks, together with principled voxel visibility annotations, including an all-direction occlusion mask and an MEI-based fisheye field-of-view mask. To establish a strong fisheye-oriented baseline, we further propose Focus on Sphere Occ (FoSOcc), a framework that addresses two core challenges in fisheye occupancy tracking: distorted spherical projection and inaccurate voxel-space localization. FoSOcc includes a Center Focusing Module (CFM) to enhance instance-aware spatial localization through supervised focus guidance, and a Spherical Lift Module (SLM) that extends perspective lifting to fisheye imaging under the Unified Projection Model. Extensive experiments on Occ3D-Waymo and OccTrack360 show that our method improves occupancy tracking quality with notable gains on geometrically regular categories, and establishes a strong baseline for future research on surround-view fisheye 4D occupancy tracking. The benchmark and source code will be made publicly available at https://github.com/YouthZest-Lin/OccTrack360.[350] BuildMamba: A Visual State-Space Based Model for Multi-Task Building Segmentation and Height Estimation from Satellite Images
Sinan U. Ulu,A. Enes Doruk,I. Can Yagmur,Bahadir K. Gunturk,Oguz Hanoglu,Hasan F. Ates
Main category: cs.CV
TL;DR: 本文提出BuildMamba,一种基于视觉状态空间模型的多任务框架,用于单视角卫星图像的建筑物分割与高度估计,显著提升精度与效率。
Details
Motivation: 现有方法在建筑物边界分割和高层建筑高度估计上存在边界模糊和系统性低估问题,且全局建模计算成本高,需更强结构耦合与计算效率。 Method: 提出BuildMamba框架,包含三个模块:Mamba注意力模块(动态空间重校准)、空间感知Mamba-FPN(门控状态空间扫描实现多尺度特征聚合)、掩码感知高度细化模块(利用语义先验抑制高度伪影)。 Result: 在DFC23等三个基准上达到新SOTA:IoU达0.93,高度估计RMSE为1.77米,较SOTA提升0.82米;仿真验证其在大规模3D城市重建中具备更优鲁棒性与可扩展性。 Conclusion: BuildMamba通过引入线性时间复杂度的全局建模机制与多任务协同设计,有效解决了单视图卫星图像建筑物解析中的关键挑战,为城市分析提供了高效高精度新范式。 Abstract: Accurate building segmentation and height estimation from single-view RGB satellite imagery are fundamental for urban analytics, yet remain ill-posed due to structural variability and the high computational cost of global context modeling. While current approaches typically adapt monocular depth architectures, they often suffer from boundary bleeding and systematic underestimation of high-rise structures. To address these limitations, we propose BuildMamba, a unified multi-task framework designed to exploit the linear-time global modeling of visual state-space models. Motivated by the need for stronger structural coupling and computational efficiency, we introduce three modules: a Mamba Attention Module for dynamic spatial recalibration, a Spatial-Aware Mamba-FPN for multi-scale feature aggregation via gated state-space scans, and a Mask-Aware Height Refinement module using semantic priors to suppress height artifacts. Extensive experiments demonstrate that BuildMamba establishes a new performance upper bound across three benchmarks. Specifically, it achieves an IoU of 0.93 and RMSE of 1.77~m on DFC23 benchmark, surpassing state-of-the-art by 0.82~m in height estimation. Simulation results confirm the model's superior robustness and scalability for large-scale 3D urban reconstruction.[351] SecAgent: Efficient Mobile GUI Agent with Semantic Context
Yiping Xie,Song Chen,Jingxuan Xing,Wei Jiang,Zekun Zhu,Yingyao Wang,Pi Bu,Jun Song,Yuning Jiang,Bo Zheng
Main category: cs.CV
TL;DR: 本文提出SecAgent,一个3B规模的高效移动GUI代理,通过构建高质量中文移动GUI数据集和语义上下文机制,解决了多语言数据稀缺与历史表示低效的问题,在导航任务上表现媲美7B-8B模型。
Details
Motivation: 现有移动GUI代理面临高质量多语言数据(尤其是非英语)稀缺和历史表示方法低效两大挑战。 Method: 构建了含18k定位样本和121k导航步骤的中文移动GUI数据集及多选动作标注的中文导航基准;提出语义上下文机制,将历史截图与操作压缩为自然语言摘要;采用监督微调与强化微调训练SecAgent。 Result: SecAgent在自建及公开导航基准上超越同规模基线,性能接近7B-8B模型;数据集、基准、模型与代码将开源。 Conclusion: SecAgent验证了小规模模型通过高质量数据与高效历史建模可在多语言移动GUI自动化中实现高性能,推动该领域研究发展。 Abstract: Mobile Graphical User Interface (GUI) agents powered by multimodal large language models have demonstrated promising capabilities in automating complex smartphone tasks. However, existing approaches face two critical limitations: the scarcity of high-quality multilingual datasets, particularly for non-English ecosystems, and inefficient history representation methods. To address these challenges, we present SecAgent, an efficient mobile GUI agent at 3B scale. We first construct a human-verified Chinese mobile GUI dataset with 18k grounding samples and 121k navigation steps across 44 applications, along with a Chinese navigation benchmark featuring multi-choice action annotations. Building upon this dataset, we propose a semantic context mechanism that distills history screenshots and actions into concise, natural language summaries, significantly reducing computational costs while preserving task-relevant information. Through supervised and reinforcement fine-tuning, SecAgent outperforms similar-scale baselines and achieves performance comparable to 7B-8B models on our and public navigation benchmarks. We will open-source the training dataset, benchmark, model, and code to advance research in multilingual mobile GUI automation.[352] SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution
Chao Wang,Zijin Yang,Yaofei Wang,Yuang Qi,Weiming Zhang,Nenghai Yu,Kejiang Chen
Main category: cs.CV
TL;DR: 本文提出SWIFT方法,首次定义了'少样本无训练生成视频归因'任务,通过利用视频的时间特性进行像素帧到潜在帧的映射,并结合正常与损坏重建的损失差异作为归因信号,在多个SOTA视频生成模型上实现了超90%的平均归因准确率。
Details
Motivation: 现有视频归因方法需额外操作或训练源归因模型,可能损害视频质量或需要大量训练样本;而生成视频滥用问题日益严重,亟需高效、高质量的溯源技术。 Method: 提出SWIFT方法,基于视频块内'多像素帧→单潜在帧'的时间映射关系,采用固定长度滑动窗口进行正常与 corrupted 两种重建,以二者重建损失差异作为归因信号,无需训练且支持少样本甚至零样本归因。 Result: 在五个SOTA视频生成模型上评估,SWIFT平均归因准确率超90%(仅用20个视频样本),并实现HunyuanVideo、EasyAnimate和Wan2.2的零样本归因。 Conclusion: SWIFT是一种无需训练、紧耦合视频时间特性的新型生成视频归因方法,兼顾高精度、低开销与泛化能力,为防范AI生成内容滥用提供了有效技术路径。 Abstract: Recent advancements in video generation technologies have been significant, resulting in their widespread application across multiple domains. However, concerns have been mounting over the potential misuse of generated content. Tracing the origin of generated videos has become crucial to mitigate potential misuse and identify responsible parties. Existing video attribution methods require additional operations or the training of source attribution models, which may degrade video quality or necessitate large amounts of training samples. To address these challenges, we define for the first time the "few-shot training-free generated video attribution" task and propose SWIFT, which is tightly integrated with the temporal characteristics of the video. By leveraging the "Pixel Frames(many) to Latent Frame(one)" temporal mapping within each video chunk, SWIFT applies a fixed-length sliding window to perform two distinct reconstructions: normal and corrupted. The variation in the losses between two reconstructions is then used as an attribution signal. We conducted an extensive evaluation of five state-of-the-art (SOTA) video generation models. Experimental results show that SWIFT achieves over 90% average attribution accuracy with merely 20 video samples across all models and even enables zero-shot attribution for HunyuanVideo, EasyAnimate, and Wan2.2. Our source code is available at https://github.com/wangchao0708/SWIFT.[353] PCFEx: Point Cloud Feature Extraction for Graph Neural Networks
Abdullah Al Masud,Shi Xintong,Mondher Bouazizi,Ohtsuki Tomoaki
Main category: cs.CV
TL;DR: 本文提出了一种结合图神经网络(GNN)与新型点云特征提取(PCFEx)技术的方法,用于毫米波雷达点云数据的人体姿态估计(HPE)和人体活动识别(HAR),在多个公开数据集上显著超越现有方法。
Details
Motivation: 传统方法在处理毫米波雷达点云数据进行HPE和HAR时精度有限,而GNN在图结构数据建模中表现出色,但尚未被充分应用于该类点云任务。 Method: 将点云建模为图结构,设计多层级(点、边、图级)的点云特征提取(PCFEx)技术,并构建专用GNN架构来高效处理这些特征。 Result: 在三个HPE基准上误差显著降低,在一个mmWave HAR数据集上达到98.8%的准确率,全面优于现有SOTA模型。 Conclusion: 融合多层级特征提取与GNN建模能显著提升毫米波点云处理精度,验证了该方法在HPE和HAR任务中的有效性与潜力。 Abstract: Graph neural networks (GNNs) have gained significant attention for their effectiveness across various domains. This study focuses on applying GNN to process 3D point cloud data for human pose estimation (HPE) and human activity recognition (HAR). We propose novel point cloud feature extraction (PCFEx) techniques to capture meaningful information at the point, edge, and graph levels of the point cloud by considering point cloud as a graph. Moreover, we introduce a GNN architecture designed to efficiently process these features. Our approach is evaluated on four most popular publicly available millimeter wave radar datasets, three for HPE and one for HAR. The results show substantial improvements, with significantly reduced errors in all three HPE benchmarks, and an overall accuracy of 98.8% in mmWave-based HAR, outperforming the existing state of the art models. This work demonstrates the great potential of feature extraction incorporated with GNN modeling approach to enhance the precision of point cloud processing.[354] mmGAT: Pose Estimation by Graph Attention with Mutual Features from mmWave Radar Point Cloud
Abdullah Al Masud,Shi Xintong,Mondher Bouazizi,Ohtsuki Tomoaki
Main category: cs.CV
TL;DR: 本文提出了一种基于毫米波雷达与图神经网络(GNN)结合注意力机制的新型人体姿态估计方法mmGAT,在隐私保护和低光环境下表现优异,显著降低了MPJPE和PA-MPJPE误差,刷新了该领域多项SOTA指标。
Details
Motivation: 图像方法在隐私保护和低光照环境下存在不足,需探索更鲁棒、隐私友好的替代方案。 Method: 利用毫米波雷达获取点云数据,设计基于图神经网络(GNN)并融合注意力机制的模型mmGAT,并提出一种充分利用GNN能力的特征提取技术。 Result: 在两个公开毫米波基准数据集上达到新SOTA,MPJPE降低35.6%,PA-MPJPE降低14.1%。 Conclusion: 毫米波雷达结合GNN与注意力机制可有效提升姿态估计精度与鲁棒性,兼顾隐私与暗光适应能力,为非视觉感知提供新范式。 Abstract: Pose estimation and human action recognition (HAR) are pivotal technologies spanning various domains. While the image-based pose estimation and HAR are widely admired for their superior performance, they lack in privacy protection and suboptimal performance in low-light and dark environments. This paper exploits the capabilities of millimeter-wave (mmWave) radar technology for human pose estimation by processing radar data with Graph Neural Network (GNN) architecture, coupled with the attention mechanism. Our goal is to capture the finer details of the radar point cloud to improve the pose estimation performance. To this end, we present a unique feature extraction technique that exploits the full potential of the GNN processing method for pose estimation. Our model mmGAT demonstrates remarkable performance on two publicly available benchmark mmWave datasets and establishes new state of the art results in most scenarios in terms of human pose estimation. Our approach achieves a noteworthy reduction of pose estimation mean per joint position error (MPJPE) by 35.6% and PA-MPJPE by 14.1% from the current state of the art benchmark within this domain.[355] BioGait-VLM: A Tri-Modal Vision-Language-Biomechanics Framework for Interpretable Clinical Gait Assessment
Erdong Chen,Yuyang Ji,Jacob K. Greenberg,Benjamin Steel,Faraz Arkam,Abigail Lewis,Pranay Singh,Feng Liu
Main category: cs.CV
TL;DR: BioGait-VLM是一个融合视觉、语言与生物力学的三模态框架,通过时序证据蒸馏和生物力学标记化提升临床步态分析的泛化性与可解释性,并在严格受试者不重叠协议下达到SOTA性能。
Details
Motivation: 视频临床步态分析常因过拟合环境偏差而泛化能力差,难以捕捉病理性运动特征。 Method: 提出BioGait-VLM:包含时序证据蒸馏分支(捕获节律动力学)和生物力学标记化分支(将3D骨架序列映射为语言对齐的语义标记),并构建含DCM队列的增强GAVD数据集及受试者不重叠评估协议。 Result: 在统一8类步态分类任务中达到SOTA识别精度;盲法专家研究表明生物力学标记显著提升临床合理性和证据可追溯性。 Conclusion: BioGait-VLM实现了更鲁棒、可解释、隐私友好的临床步态评估,推动AI辅助诊断向透明化与机制驱动发展。 Abstract: Video-based Clinical Gait Analysis often suffers from poor generalization as models overfit environmental biases instead of capturing pathological motion. To address this, we propose BioGait-VLM, a tri-modal Vision-Language-Biomechanics framework for interpretable clinical gait assessment. Unlike standard video encoders, our architecture incorporates a Temporal Evidence Distillation branch to capture rhythmic dynamics and a Biomechanical Tokenization branch that projects 3D skeleton sequences into language-aligned semantic tokens. This enables the model to explicitly reason about joint mechanics independent of visual shortcuts. To ensure rigorous benchmarking, we augment the public GAVD dataset with a high-fidelity Degenerative Cervical Myelopathy (DCM) cohort to form a unified 8-class taxonomy, establishing a strict subject-disjoint protocol to prevent data leakage. Under this setting, BioGait-VLM achieves state-of-the-art recognition accuracy. Furthermore, a blinded expert study confirms that biomechanical tokens significantly improve clinical plausibility and evidence grounding, offering a path toward transparent, privacy-enhanced gait assessment.[356] Online Sparse Synthetic Aperture Radar Imaging
Conor Flynn,Radoslav Ivanov,Birsen Yazici
Main category: cs.CV
TL;DR: 本文提出了一种在线稀疏编码重建算法Online FISTA,用于在资源受限的无人机平台上实时处理SAR数据,降低内存需求并支持在线自动目标识别(ATR)。
Details
Motivation: 现代国防应用中,低成本自主无人机日益普及,但其计算与内存资源有限,而SAR成像需处理大量数据,传统后处理方法难以满足实时性与轻量化需求。 Method: 提出Online FISTA算法,基于在线稀疏编码,通过递归更新存储矩阵实现增量式场景重建,避免保存全部原始信号数据。 Result: 显著降低内存占用,支持SAR图像的在线重建,并可无缝集成至在线ATR等下游任务,提升系统整体实时性与灵活性。 Conclusion: Online FISTA为资源受限平台提供了高效、轻量、可扩展的SAR实时处理新范式,推动机载智能感知向一体化、在线化发展。 Abstract: With modern defense applications increasingly relying on inexpensive, autonomous drones, lies the major challenge of designing computationally and memory-efficient onboard algorithms to fulfill mission objectives. This challenge is particularly significant in Synthetic Aperture Radar (SAR), where large volumes of data must be collected and processed for downstream tasks. We propose an online reconstruction method, the Online Fast Iterative Shrinkage-Thresholding Algorithm (Online FISTA), which incrementally reconstructs a scene with limited data through sparse coding. Rather than requiring storage of all received signal data, the algorithm recursively updates storage matrices for each iteration, greatly reducing memory demands. Online SAR image reconstruction facilitates more complex downstream tasks, such as Automatic Target Recognition (ATR), in an online manner, resulting in a more versatile and integrated framework compared to existing post-collection reconstruction and ATR approaches.[357] CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing
Yucheng Wang,Zedong Wang,Yuetong Wu,Yue Ma,Dan Xu
Main category: cs.CV
TL;DR: 本文提出CARE-Edit,一种基于条件感知专家路由的统一扩散编辑框架,通过轻量级潜空间注意力路由器动态分配计算资源给文本、掩码、参考和基础四个专用专家模块,有效缓解多条件编辑中的模态冲突与任务干扰。
Details
Motivation: 现有统一扩散编辑器(如ControlNet、OmniControl)采用静态融合方式处理多模态条件,导致模态间干扰严重,产生颜色溢出、身份/风格漂移等问题,难以适应异构编辑需求。 Method: 提出Condition-Aware Routing of Experts (CARE-Edit):1)Mask Repaint模块优化用户掩码;2)轻量潜空间注意力路由器依据多模态条件与扩散步长进行top-K稀疏路由,动态选择最相关专家;3)Latent Mixture模块融合各专家输出,实现语义、空间与风格信息的协同集成。 Result: 在擦除、替换、文本驱动编辑和风格迁移等上下文编辑任务上显著优于基线方法;实证分析验证了各专家模块呈现任务特异性行为,证实动态条件感知机制对缓解多条件冲突的有效性。 Conclusion: 动态、条件感知的专家路由机制是提升多条件扩散编辑鲁棒性与精度的关键,CARE-Edit为统一编辑框架设计提供了新范式。 Abstract: Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs photometric). In particular, prevalent ControlNet and OmniControl variants combine multiple conditioning signals (e.g., text, mask, reference) via static concatenation or additive adapters which cannot dynamically prioritize or suppress conflicting modalities, thus resulting in artifacts like color bleeding across mask boundaries, identity or style drift, and unpredictable behavior under multi-condition inputs. To address this, we propose Condition-Aware Routing of Experts (CARE-Edit) that aligns model computation with specific editing competencies. At its core, a lightweight latent-attention router assigns encoded diffusion tokens to four specialized experts--Text, Mask, Reference, and Base--based on multi-modal conditions and diffusion timesteps: (i) a Mask Repaint module first refines coarse user-defined masks for precise spatial guidance; (ii) the router applies sparse top-K selection to dynamically allocate computation to the most relevant experts; (iii) a Latent Mixture module subsequently fuses expert outputs, coherently integrating semantic, spatial, and stylistic information to the base images. Experiments validate CARE-Edit's strong performance on contextual editing tasks, including erasure, replacement, text-driven edits, and style transfer. Empirical analysis further reveals task-specific behavior of specialized experts, showcasing the importance of dynamic, condition-aware processing to mitigate multi-condition conflicts.[358] PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition
Zeyu Ling,Qing Shuai,Teng Zhang,Shiyang Li,Bo Han,Changqing Zou
Main category: cs.CV
TL;DR: 本文提出PRISM模型,通过关节分解的运动潜在空间和无噪声条件注入机制,统一解决文本到动作生成、姿态条件生成及长时序合成等任务,显著提升生成质量并实现多任务统一建模。
Details
Motivation: 现有文本到动作生成方法存在两大问题:一是运动自编码器将每帧压缩为单一整体潜在向量,导致轨迹与关节旋转纠缠;二是不同生成任务(如文本驱动、姿态引导、长序列合成)需各自专用模型或机制,且自回归方法易累积误差。 Method: 提出PRISM:(1) 关节分解的因果VAE潜在空间,每个关节作为独立token构成时间×关节二维网格,并引入运动学监督;(2) 无噪声条件注入,通过时间步嵌入使条件姿态以干净token(t=0)注入,其余token逐步去噪,并结合自强制训练抑制长程漂移。 Result: 在HumanML3D、MotionHub、BABEL等多个基准及50场景用户研究中达到SOTA;单模型统一支持文本到动作、姿态条件生成、自回归序列合成与叙事动作组合。 Conclusion: 潜在空间结构设计是被低估的关键瓶颈;通过解耦关节表征与干净条件注入,可构建统一、鲁棒、高质量的运动生成基础模型。 Abstract: Text-to-motion generation has advanced rapidly, yet two challenges persist. First, existing motion autoencoders compress each frame into a single monolithic latent vector, entangling trajectory and per-joint rotations in an unstructured representation that downstream generators struggle to model faithfully. Second, text-to-motion, pose-conditioned generation, and long-horizon sequential synthesis typically require separate models or task-specific mechanisms, with autoregressive approaches suffering from severe error accumulation over extended rollouts. We present PRISM, addressing each challenge with a dedicated contribution. (1) A joint-factorized motion latent space: each body joint occupies its own token, forming a structured 2D grid (time joints) compressed by a causal VAE with forward-kinematics supervision. This simple change to the latent space -- without modifying the generator -- substantially improves generation quality, revealing that latent space design has been an underestimated bottleneck. (2) Noise-free condition injection: each latent token carries its own timestep embedding, allowing conditioning frames to be injected as clean tokens (timestep0) while the remaining tokens are denoised. This unifies text-to-motion and pose-conditioned generation in a single model, and directly enables autoregressive segment chaining for streaming synthesis. Self-forcing training further suppresses drift in long rollouts. With these two components, we train a single motion generation foundation model that seamlessly handles text-to-motion, pose-conditioned generation, autoregressive sequential generation, and narrative motion composition, achieving state-of-the-art on HumanML3D, MotionHub, BABEL, and a 50-scenario user study.[359] Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations
Jiangye Yuan,Gowri Kumar,Baoyuan Wang
Main category: cs.CV
TL;DR: 本文提出几何参考的3D场景表示(GR3D),通过为图像中物体分配唯一ID并以文本形式编码其3D几何属性,使多模态大语言模型(MLLMs)无需额外训练即可提升3D空间推理能力,在VSI-Bench上显著提升性能。
Details
Motivation: 现有MLLMs在2D视觉理解上表现优异,但在3D空间推理方面能力有限,亟需一种能有效融合3D几何信息与语言模型能力的新表示方法。 Method: 提出GR3D表示法:对输入图像中的物体赋予唯一ID,并将3D几何属性编码为对应ID的文本描述;该方法零样本、无需微调,可即插即用地适配各类MLLMs。 Result: 在VSI-Bench基准上,GPT-5零样本性能整体提升8%,空间布局相关任务提升超11%;定性分析表明其能在极稀疏视角下支持复杂空间推理。 Conclusion: GR3D是一种轻量、通用且高效的方法,成功桥接了MLLMs的语言推理能力与3D几何理解需求,为3D感知与多模态推理提供了新范式。 Abstract: While Multimodal Large Language Models (MLLMs) have achieved remarkable success in 2D visual understanding, their ability to reason about 3D space remains limited. To address this gap, we introduce geometrically referenced 3D scene representations (GR3D). Given a set of input images, GR3D annotates objects in the images with unique IDs and encodes their 3D geometric attributes as textual references indexed by these IDs. This representation enables MLLMs to interpret 3D cues using their advanced language-based skills in mathematical reasoning, while concurrently analyzing 2D visual features in a tightly coupled way. We present a simple yet effective approach based on GR3D, which requires no additional training and is readily applicable to different MLLMs. Implemented in a zero-shot setting, our approach boosts GPT-5's performance on VSI-Bench by 8% overall and more than 11% on tasks that rely heavily on spatial layout understanding. Qualitative studies further demonstrate that GR3D empowers MLLMs to perform complex spatial reasoning with highly sparse input views.[360] Weakly Supervised Teacher-Student Framework with Progressive Pseudo-mask Refinement for Gland Segmentation
Hikmat Khan,Wei Chen,Muhammad Khalid Khan Niazi
Main category: cs.CV
TL;DR: 本文提出了一种弱监督师生框架,利用稀疏病理医生标注和指数滑动平均稳定的教师网络生成精细化伪掩码,以提升结直肠癌组织病理图像中腺体结构的分割性能,减少对大量像素级标注的依赖。
Details
Motivation: 现有深度学习方法依赖大量像素级标注,而临床实践中难以获取;CAM类弱监督方法生成的伪掩码不完整,难以覆盖未标注腺体区域。 Method: 提出弱监督师生框架,结合置信度过滤、教师预测与有限真值的自适应融合、课程式渐进优化,利用稀疏病理标注生成高质量伪标签。 Result: 在Gland Segmentation数据集上达到mIoU 80.10、Dice 89.10;跨队列验证显示在TCGA COAD/READ上泛化良好,但在SPIDER上因域偏移性能下降。 Conclusion: 该框架是一种标注高效、泛化性强的结直肠癌腺体分割新方法。 Abstract: Background and objectives: Colorectal cancer histopathological grading depends on accurate segmentation of glandular structures. Current deep learning approaches rely on large scale pixel level annotations that are labor intensive and difficult to obtain in routine clinical practice. Weakly supervised semantic segmentation offers a promising alternative. However, class activation map based methods often produce incomplete pseudo masks that emphasize highly discriminative regions and fail to supervise unannotated glandular structures. We propose a weakly supervised teacher student framework that leverages sparse pathologist annotations and an Exponential Moving Average stabilized teacher network to generate refined pseudo masks. Methods: The framework integrates confidence based filtering, adaptive fusion of teacher predictions with limited ground truth, and curriculum guided refinement to progressively segment unannotated glandular regions. The method was evaluated on an institutional colorectal cancer cohort from The Ohio State University Wexner Medical Center consisting of 60 hematoxylin and eosin stained whole slide images and on public datasets including the Gland Segmentation dataset, TCGA COAD, TCGA READ, and SPIDER. Results: On the Gland Segmentation dataset the framework achieved a mean Intersection over Union of 80.10 and a mean Dice coefficient of 89.10. Cross cohort evaluation demonstrated robust generalization on TCGA COAD and TCGA READ without additional annotations, while reduced performance on SPIDER reflected domain shift. Conclusions: The proposed framework provides an annotation efficient and generalizable approach for gland segmentation in colorectal histopathology.[361] FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection
Anqi Joyce Yang,James Tu,Nikita Dvornik,Enxu Li,Raquel Urtasun
Main category: cs.CV
TL;DR: FOMO-3D 是首个利用视觉基础模型(OWLv2 和 Metric3Dv2)提升长尾3D目标检测性能的多模态3D检测器,通过双阶段检测框架融合激光雷达与相机模态,在真实驾驶数据上显著提升稀有类(如施工人员)的检测效果。
Details
Motivation: 自动驾驶需识别大量语义类别,但许多安全关键目标(如施工人员)在常规驾驶数据中出现极少,导致训练样本严重不足;视觉基础模型可提供强外部先验知识以缓解长尾问题。 Method: 提出FOMO-3D:双阶段多模态3D检测器,第一阶段用LiDAR分支和新型相机分支生成候选框,第二阶段利用OWL图像特征及注意力机制进行精细化优化;融合OWLv2(语义先验)与Metric3Dv2(深度先验)。 Result: 在真实世界驾驶数据上验证,FOMO-3D显著提升了长尾3D检测性能,尤其对罕见类别效果突出。 Conclusion: 结合视觉基础模型的丰富先验与精心设计的多模态融合策略,可有效解决自动驾驶中长尾3D检测难题。 Abstract: In order to navigate complex traffic environments, self-driving vehicles must recognize many semantic classes pertaining to vulnerable road users or traffic control devices. However, many safety-critical objects (e.g., construction worker) appear infrequently in nominal traffic conditions, leading to a severe shortage of training examples from driving data alone. Recent vision foundation models, which are trained on a large corpus of data, can serve as a good source of external prior knowledge to improve generalization. We propose FOMO-3D, the first multi-modal 3D detector to leverage vision foundation models for long-tailed 3D detection. Specifically, FOMO-3D exploits rich semantic and depth priors from OWLv2 and Metric3Dv2 within a two-stage detection paradigm that first generates proposals with a LiDAR-based branch and a novel camera-based branch, and refines them with attention especially to image features from OWL. Evaluations on real-world driving data show that using rich priors from vision foundation models with careful multi-modal fusion designs leads to large gains for long-tailed 3D detection. Project website is at https://waabi.ai/fomo3d/.[362] StreamReady: Learning What to Answer and When in Long Streaming Videos
Shehreen Azad,Vibhav Vineet,Yogesh Singh Rawat
Main category: cs.CV
TL;DR: 本文提出了一种面向流式视频理解的就绪感知(readiness-aware)建模方法,定义了时间敏感的Answer Readiness Score(ARS)作为评估指标,并构建了StreamReady框架与ProReady-QA基准,显著提升了模型在准确性和作答时机上的联合性能。
Details
Motivation: 流式视频理解需在证据出现的精确时刻作答——过早回答是推测,过晚则丧失实时性;现有方法缺乏对作答时机的显式建模和评估。 Method: 提出Answer Readiness Score(ARS)作为兼顾正确性与时间敏感性的新目标函数;设计StreamReady框架,引入轻量级就绪机制判断是否已观测到足够证据;构建ProReady-QA基准,含证据时间窗标注与主动式多轮问答。 Result: StreamReady在ProReady-QA上性能最优,并在8个其他流式及离线长视频基准上持续超越先前方法,验证其鲁棒性与泛化能力。 Conclusion: 就绪感知建模是提升流式视频理解实用性的关键,ARS与StreamReady为时间敏感视频问答提供了可评估、可优化的新范式。 Abstract: Streaming video understanding often involves time-sensitive scenarios where models need to answer exactly when the supporting visual evidence appears: answering before the evidence reflects speculation, answering after it has passed reduces real-time utility. To capture this behavior, we introduce a readiness-aware formulation of streaming video understanding with the Answer Readiness Score (ARS), a timing-aware objective with asymmetric early and late penalties. When combined with correctness, ARS defines an effective accuracy that measures not just whether a model is right, but whether it answers at the appropriate moment. Building on this formulation, we introduce StreamReady, a framework to unify temporal reasoning with on-time answering through a lightweight readiness mechanism that decides if sufficient evidence has been observed before responding. To evaluate this capability, we further introduce ProReady-QA, a benchmark with annotated answer evidence windows and proactive multi-turn questions across local and global contexts. StreamReady achieves superior performance on ProReady-QA, and consistently outperforms prior methods across eight additional streaming and offline long-video benchmarks, demonstrating robust and broadly generalizable video understanding capability.[363] UNBOX: Unveiling Black-box visual models with Natural-language
Simone Carnemolla,Chiara Russo,Simone Palazzo,Quentin Bouniot,Daniela Giordano,Zeynep Akata,Matteo Pennisi,Concetto Spampinato
Main category: cs.CV
TL;DR: 本文提出UNBOX框架,在完全无数据、无梯度、无反向传播的黑盒约束下,利用大语言模型和文生图扩散模型,将激活最大化转化为纯语义搜索,生成可解释的文本描述以揭示模型隐式学习的概念、训练分布及潜在偏差。
Details
Motivation: 现代视觉系统常以黑盒API形式部署,缺乏可解释性、公平性和对分布偏移的鲁棒性,现有解释方法依赖白盒或灰盒访问或训练分布知识,无法适用于真实黑盒场景。 Method: UNBOX利用大语言模型和文本到图像扩散模型,在无数据、无梯度、无反向传播条件下,将激活最大化建模为仅基于输出概率的语义搜索,生成每类对应的可解释文本描述。 Result: 在ImageNet-1K、Waterbirds和CelebA上,通过语义保真度测试、视觉特征相关性分析和切片发现审计验证,UNBOX在最严格的黑盒设定下性能媲美当前最优白盒可解释性方法。 Conclusion: 无需任何模型内部访问,仅凭输出概率即可有效揭示模型内部推理逻辑,为构建可信、可追责的视觉识别系统提供新路径。 Abstract: Ensuring trustworthiness in open-world visual recognition requires models that are interpretable, fair, and robust to distribution shifts. Yet modern vision systems are increasingly deployed as proprietary black-box APIs, exposing only output probabilities and hiding architecture, parameters, gradients, and training data. This opacity prevents meaningful auditing, bias detection, and failure analysis. Existing explanation methods assume white- or gray-box access or knowledge of the training distribution, making them unusable in these real-world settings. We introduce UNBOX, a framework for class-wise model dissection under fully data-free, gradient-free, and backpropagation-free constraints. UNBOX leverages Large Language Models and text-to-image diffusion models to recast activation maximization as a purely semantic search driven by output probabilities. The method produces human-interpretable text descriptors that maximally activate each class, revealing the concepts a model has implicitly learned, the training distribution it reflects, and potential sources of bias. We evaluate UNBOX on ImageNet-1K, Waterbirds, and CelebA through semantic fidelity tests, visual-feature correlation analyses and slice-discovery auditing. Despite operating under the strictest black-box constraints, UNBOX performs competitively with state-of-the-art white-box interpretability methods. This demonstrates that meaningful insight into a model's internal reasoning can be recovered without any internal access, enabling more trustworthy and accountable visual recognition systems.[364] Retrieval-Augmented Gaussian Avatars: Improving Expression Generalization
Matan Levy,Gavriel Habib,Issar Tzachor,Dvir Samuel,Rami Ben-Ari,Nir Darshan,Or Litany,Dani Lischinski
Main category: cs.CV
TL;DR: 本文提出RAF(检索增强面部)方法,通过在训练时用大型无标签表情库中的最近邻表情特征替换部分主体表情特征,从而增强模板自由可动画头部化身模型的表达鲁棒性与泛化能力,无需额外标注、跨身份配对数据或网络结构修改。
Details
Motivation: 现有模板自由的可动画头部化身模型受限于单个身份的有限表情覆盖,难以应对训练分布外的表情驱动,导致泛化性差。 Method: 构建大规模无标签表情库,在训练过程中对主体的表情特征进行部分替换(使用检索到的近邻表情特征),同时保持原帧重建目标,以增强变形场对多样化表情的适应能力。 Result: 在NeRSemble基准上,RAF在自驱动和跨驱动场景下均一致提升了表情保真度;用户研究表明检索到的邻居在表情与姿态上感知更接近;分析证实该方法有效提升表情多样性并强化身份-表情解耦。 Conclusion: RAF是一种简单有效、即插即用的训练增强策略,显著提升了模板自由头部化身模型对表达分布偏移的鲁棒性与泛化能力。 Abstract: Template-free animatable head avatars can achieve high visual fidelity by learning expression-dependent facial deformation directly from a subject's capture, avoiding parametric face templates and hand-designed blendshape spaces. However, since learned deformation is supervised only by the expressions observed for a single identity, these models suffer from limited expression coverage and often struggle when driven by motions that deviate from the training distribution. We introduce RAF (Retrieval-Augmented Faces), a simple training-time augmentation designed for template-free head avatars that learn deformation from data. RAF constructs a large unlabeled expression bank and, during training, replaces a subset of the subject's expression features with nearest-neighbor expressions retrieved from this bank while still reconstructing the subject's original frames. This exposes the deformation field to a broader range of expression conditions, encouraging stronger identity-expression decoupling and improving robustness to expression distribution shift without requiring paired cross-identity data, additional annotations, or architectural changes. We further analyze how retrieval augmentation increases expression diversity and validate retrieval quality with a user study showing that retrieved neighbors are perceptually closer in expression and pose. Experiments on the NeRSemble benchmark demonstrate that RAF consistently improves expression fidelity over the baseline, in both self-driving and cross-driving scenarios.[365] CAST: Modeling Visual State Transitions for Consistent Video Retrieval
Yanqing Liu,Yingcheng Liu,Fanghong Dong,Budianto Budianto,Cihang Xie,Yan Jiao
Main category: cs.CV
TL;DR: 本文提出了一种面向长视频叙事的上下文感知状态迁移方法CAST,用于解决现有视频检索忽略状态与身份一致性的局限,显著提升跨任务的一致性视频检索性能,并可作为黑盒视频生成结果的重排序信号。
Details
Motivation: 随着视频内容创作转向长篇叙事,将短视频片段组合成连贯故事线变得越来越重要;但现有检索方法在推理时缺乏上下文感知,忽视状态和身份一致性。 Method: 提出了Consistent Video Retrieval(CVR)任务及涵盖YouCook2、COIN和CrossTask的诊断基准;设计了轻量级即插即用适配器CAST,通过基于视觉历史预测状态条件下的残差更新Δ,显式建模潜在状态演化。 Result: CAST在YouCook2和CrossTask上性能提升明显,在COIN上保持竞争力,且在多种冻结视觉语言骨干模型上持续优于零样本基线;同时可有效对Veo等黑盒视频生成候选结果进行重排序,提升时间连贯性。 Conclusion: CAST为长视频一致性检索提供了新范式,兼具通用性、轻量化与实用性,推动视频理解与生成的协同演进。 Abstract: As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update ($Δ$) from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones. Furthermore, CAST provides a useful reranking signal for black-box video generation candidates (e.g., from Veo), promoting more temporally coherent continuations.[366] ImprovedGS+: A High-Performance C++/CUDA Re-Implementation Strategy for 3D Gaussian Splatting
Jordi Muñoz Vicente
Main category: cs.CV
TL;DR: ImprovedGS+ 是一种基于 C++/CUDA 的 3D 高斯点绘(3DGS)高效实现,通过硬件优化内核(如 LAS、Laplacian+NMS 重要性评估、指数尺度调度器)显著降低训练延迟与高斯数量,在 Mip-NeRF360 上实现更快训练、更少参数、更高 PSNR 和视觉质量。
Details
Motivation: 在 3D 高斯点绘中平衡重建保真度与计算效率;克服 Python 实现带来的主机-设备同步开销和训练延迟瓶颈。 Method: 将 ImprovedGS 策略底层重写为 LichtFeld-Studio 框架内的 C++/CUDA 原生实现,引入 Long-Axis-Split (LAS) CUDA 内核、定制 Laplacian + NMS 边缘重要性评估内核,以及自适应指数尺度调度器。 Result: 在 Mip-NeRF360 上:1M 预算变体比 MCMC 基线快 26.8%(每轮省 17 分钟)、少用 13.3% 高斯数且视觉质量更优;全量变体 PSNR 比 ADC 基线高 1.28 dB,参数复杂度降低 38.4%。 Conclusion: ImprovedGS+ 是一种可扩展、高速、高质量的 3DGS 解决方案,兼顾速度、质量与易用性,适用于 LichtFeld-Studio 生态系统。 Abstract: Recent advancements in 3D Gaussian Splatting (3DGS) have shifted the focus toward balancing reconstruction fidelity with computational efficiency. In this work, we propose ImprovedGS+, a high-performance, low-level reinvention of the ImprovedGS strategy, implemented natively within the LichtFeld-Studio framework. By transitioning from high-level Python logic to hardware-optimized C++/CUDA kernels, we achieve a significant reduction in host-device synchronization and training latency. Our implementation introduces a Long-Axis-Split (LAS) CUDA kernel, custom Laplacian-based importance kernels with Non-Maximum Suppression (NMS) for edge scores, and an adaptive Exponential Scale Scheduler. Experimental results on the Mip-NeRF360 dataset demonstrate that ImprovedGS+ establishes a new Pareto-optimal front for scene reconstruction. Our 1M-budget variant outperforms the state-of-the-art MCMC baseline by achieving a 26.8% reduction in training time (saving 17 minutes per session) and utilizing 13.3% fewer Gaussians while maintaining superior visual quality. Furthermore, our full variant demonstrates a 1.28 dB PSNR increase over the ADC baseline with a 38.4% reduction in parametric complexity. These results validate ImprovedGS+ as a scalable, high-speed solution that upholds the core pillars of Speed, Quality, and Usability within the LichtFeld-Studio ecosystem.[367] Talking Together: Synthesizing Co-Located 3D Conversations from Audio
Mengyi Shan,Shouchieh Chang,Ziqian Bai,Shichen Liu,Yinda Zhang,Luchuan Song,Rohit Pandey,Sean Fanello,Zeng Huang
Main category: cs.CV
TL;DR: 本文提出了一种双流架构,首次实现从混合音频中生成两个交互人物的完整3D面部动画,并建模其动态3D空间关系(如相对位置、朝向和对视),支持文本控制头部姿态,显著提升VR/远程呈现中的真实感与交互连贯性。
Details
Motivation: 现有方法多生成脱离身体的‘说话头’,缺乏对真实面对面对话中动态3D空间关系(如相对位置、朝向、相互凝视)的建模,难以支撑沉浸式VR和远程呈现应用。 Method: 提出双流架构,每一流对应一个参与者;引入说话人角色嵌入和跨说话人交叉注意力机制以解耦混合音频并建模互动;设计新型眼动损失函数促进自然互视;构建含200万组配对的大规模野外对话数据集。 Result: 生成流畅、可控、空间感知的双人3D面部动画,在感知真实感与交互一致性上显著优于现有基线。 Conclusion: 该工作首次实现了从混合音频驱动的空间感知、可文本控制的双人3D面部动画生成,为沉浸式人机交互提供了新范式。 Abstract: We tackle the challenging task of generating complete 3D facial animations for two interacting, co-located participants from a mixed audio stream. While existing methods often produce disembodied "talking heads" akin to a video conference call, our work is the first to explicitly model the dynamic 3D spatial relationship -- including relative position, orientation, and mutual gaze -- that is crucial for realistic in-person dialogues. Our system synthesizes the full performance of both individuals, including precise lip-sync, and uniquely allows their relative head poses to be controlled via textual descriptions. To achieve this, we propose a dual-stream architecture where each stream is responsible for one participant's output. We employ speaker's role embeddings and inter-speaker cross-attention mechanisms designed to disentangle the mixed audio and model the interaction. Furthermore, we introduce a novel eye gaze loss to promote natural, mutual eye contact. To power our data-hungry approach, we introduce a novel pipeline to curate a large-scale conversational dataset consisting of over 2 million dyadic pairs from in-the-wild videos. Our method generates fluid, controllable, and spatially aware dyadic animations suitable for immersive applications in VR and telepresence, significantly outperforming existing baselines in perceived realism and interaction coherence.[368] ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation
Nanjun Li,Pinqi Cheng,Zean Liu,Minghe Tian,Xuanyin Wang
Main category: cs.CV
TL;DR: 本文提出了一种关键点驱动的单阶段多人姿态估计新范式(ER-Pose),摒弃边界框预测,重新设计预测头与动态样本分配策略,并引入OKS平滑损失,显著提升精度与效率。
Details
Motivation: 现有YOLO类单阶段姿态估计方法沿用目标检测的框驱动范式,导致样本分配偏差、特征表示不匹配及任务错位,限制精度。 Method: 提出关键点驱动学习范式:移除边界框预测;重构预测头以适配高维结构化姿态表征;设计关键点驱动的动态样本分配策略;引入基于OKS的平滑回归损失。 Result: 在MS COCO和CrowdPose上,ER-Pose-n相比YOLO-Pose基线,在无预训练下AP提升3.2/6.7,有预训练下提升7.4/4.9,且参数更少、推理更快。 Conclusion: 关键点驱动范式能有效缓解任务错位问题,提升单阶段姿态估计的精度与效率,为实时多人姿态估计提供了新思路。 Abstract: Single-stage multi-person pose estimation aims to jointly perform human localization and keypoint prediction within a unified framework, offering advantages in inference efficiency and architectural simplicity. Consequently, multi-scale real-time detection architectures, such as YOLO-like models, are widely adopted for real-time pose estimation. However, these approaches typically inherit a box-driven modeling paradigm from object detection, in which pose estimation is implicitly constrained by bounding-box supervision during training. This formulation introduces biases in sample assignment and feature representation, resulting in task misalignment and ultimately limiting pose estimation accuracy. In this work, we revisit box-driven single-stage pose estimation from a keypoint-driven perspective and identify semantic conflicts among parallel objectives as a key source of performance degradation. To address this issue, we propose a keypoint-driven learning paradigm that elevates pose estimation to a primary prediction objective. Specifically, we remove bounding-box prediction and redesign the prediction head to better accommodate the high-dimensional structured representations for pose estimation. We further introduce a keypoint-driven dynamic sample assignment strategy to align training objectives with pose evaluation metrics, enabling dense supervision during training and efficient NMS-free inference. In addition, we propose a smooth OKS-based loss function to stabilize optimization in regression-based pose estimation. Based on these designs, we develop a single-stage multi-person pose estimation framework, termed ER-Pose. On MS COCO and CrowdPose, ER-Pose-n achieves AP improvements of 3.2/6.7 without pre-training and 7.4/4.9 with pre-training respectively compared with the baseline YOLO-Pose. These improvements are achieved with fewer parameters and higher inference efficiency.[369] HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising
Kai Zou,Dian Zheng,Hongbo Liu,Tiankai Hang,Bin Liu,Nenghai Yu
Main category: cs.CV
TL;DR: 本文提出HiAR,一种分层去噪框架,通过在相同噪声水平下对视频块进行因果生成,解决了自回归扩散模型中因误差累积导致的时序连续性与质量退化问题,并引入前向KL正则化以保持运动多样性。
Details
Motivation: 现有自回归扩散方法在生成长视频时,依赖高去噪上下文来保证时序连续性,但会以高置信度传播预测误差,加剧质量退化;作者认为高度干净的上下文并非必要。 Method: 受双向扩散模型启发,提出在与当前块相同噪声水平的上下文上进行条件建模;设计HiAR分层去噪框架,改变传统逐块完成顺序,改为每步对所有块同步进行因果去噪;引入前向KL正则化于双向注意力模式中,抑制自滚动蒸馏引发的低运动偏差。 Result: HiAR在VBench(20秒生成)上取得最佳总体得分和最低时序漂移;4步设置下实现1.8倍实时时钟加速;支持流水线并行推理。 Conclusion: 在相同噪声水平下建模上下文依赖可兼顾时序一致性与误差鲁棒性;HiAR为长视频生成提供了一种高效、稳定且运动多样的新范式。 Abstract: Autoregressive (AR) diffusion offers a promising framework for generating videos of theoretically infinite length. However, a major challenge is maintaining temporal continuity while preventing the progressive quality degradation caused by error accumulation. To ensure continuity, existing methods typically condition on highly denoised contexts; yet, this practice propagates prediction errors with high certainty, thereby exacerbating degradation. In this paper, we argue that a highly clean context is unnecessary. Drawing inspiration from bidirectional diffusion models, which denoise frames at a shared noise level while maintaining coherence, we propose that conditioning on context at the same noise level as the current block provides sufficient signal for temporal consistency while effectively mitigating error propagation. Building on this insight, we propose HiAR, a hierarchical denoising framework that reverses the conventional generation order: instead of completing each block sequentially, it performs causal generation across all blocks at every denoising step, so that each block is always conditioned on context at the same noise level. This hierarchy naturally admits pipelined parallel inference, yielding a 1.8 wall-clock speedup in our 4-step setting. We further observe that self-rollout distillation under this paradigm amplifies a low-motion shortcut inherent to the mode-seeking reverse-KL objective. To counteract this, we introduce a forward-KL regulariser in bidirectional-attention mode, which preserves motion diversity for causal inference without interfering with the distillation loss. On VBench (20s generation), HiAR achieves the best overall score and the lowest temporal drift among all compared methods.[370] FVG-PT: Adaptive Foreground View-Guided Prompt Tuning for Vision-Language Models
Haoyang Li,Liang Wang,Siyu Zhou,Jiacheng Sun,Jing Jiang,Chao Wang,Guodong Long,Yan Peng
Main category: cs.CV
TL;DR: 本文提出FVG-PT方法,通过引导视觉编码器的前景注意力来提升CLIP-based提示调优性能,包含前景可靠性门、前景蒸馏补偿和先验校准三个模块。
Details
Motivation: 现有基于CLIP的提示调优方法忽视了VLM内部注意力表示在调优过程中的变化,尤其 foreground attention shift 导致预测失败。 Method: 提出Foreground View-Guided Prompt Tuning(FVG-PT),包括:1)可学习的Foreground Reliability Gate以增强前景视图质量;2)Foreground Distillation Compensation模块引导视觉注意力聚焦前景;3)Prior Calibration模块缓解过度关注前景导致的泛化下降。 Result: 在多个骨干模型和数据集上的实验验证了FVG-PT的有效性和兼容性。 Conclusion: FVG-PT是一种即插即用、自适应的前景注意力引导模块,能有效缓解提示调优中因前景注意力偏移引发的性能下降问题。 Abstract: CLIP-based prompt tuning enables pretrained Vision-Language Models (VLMs) to efficiently adapt to downstream tasks. Although existing studies have made significant progress, they pay limited attention to changes in the internal attention representations of VLMs during the tuning process. In this paper, we attribute the failure modes of prompt tuning predictions to shifts in foreground attention of the visual encoder, and propose Foreground View-Guided Prompt Tuning (FVG-PT), an adaptive plug-and-play foreground attention guidance module, to alleviate the shifts. Concretely, FVG-PT introduces a learnable Foreground Reliability Gate to automatically enhance the foreground view quality, applies a Foreground Distillation Compensation module to guide visual attention toward the foreground, and further introduces a Prior Calibration module to mitigate generalization degradation caused by excessive focus on the foreground. Experiments on multiple backbone models and datasets show the effectiveness and compatibility of FVG-PT. Codes are available at: https://github.com/JREion/FVG-PT[371] Scale Space Diffusion
Soumik Mukhopadhyay,Prateksha Udhayanan,Abhinav Shrivastava
Main category: cs.CV
TL;DR: 本文提出Scale Space Diffusion,将尺度空间理论融入扩散模型,通过广义线性退化(如下采样)降低高噪声状态的计算冗余,并设计Flexi-UNet实现按需分辨率处理,在CelebA和ImageNet上验证了其高效性与可扩展性。