Table of Contents
cs.CL [Back]
[1] Incentives or Ontology? A Structural Rebuttal to OpenAI's Hallucination Thesis
Richard Ackermann,Simeon Emanuilov
Main category: cs.CL
TL;DR: 本文认为大语言模型的幻觉并非优化失败,而是Transformer架构的必然产物,因其仅建模词元间的统计关联而非真实世界结构,导致在知识边界处必然生成虚构内容;实验证明唯有通过外部验证与 abstention 机制才能消除幻觉,因此可靠的AI需依赖区分语言流畅性与认知责任的混合系统。
Details
Motivation: 反驳OpenAI将幻觉归因于激励错配的观点,提出幻觉是Transformer架构本身的结构性问题,而非可通过奖励机制或训练策略改进的偶然行为。 Method: 基于结构幻觉的前期研究,结合使用Licensing Oracle进行实证实验,测试不同条件下模型的幻觉表现与 abstention 能力。 Result: 实验表明,仅靠调整激励、提示或微调无法消除幻觉,而引入外部真理验证和拒绝回答机制(如Licensing Oracle)可实现跨领域的完美 abstention 精度。 Conclusion: 幻觉是生成式架构的结构性属性,真正的可靠AI必须采用混合系统,将语言生成能力与外部知识验证分离,以承担认知责任。 Abstract: OpenAI has recently argued that hallucinations in large language models result primarily from misaligned evaluation incentives that reward confident guessing rather than epistemic humility. On this view, hallucination is a contingent behavioral artifact, remediable through improved benchmarks and reward structures. In this paper, we challenge that interpretation. Drawing on previous work on structural hallucination and empirical experiments using a Licensing Oracle, we argue that hallucination is not an optimization failure but an architectural inevitability of the transformer model. Transformers do not represent the world; they model statistical associations among tokens. Their embedding spaces form a pseudo-ontology derived from linguistic co-occurrence rather than world-referential structure. At ontological boundary conditions - regions where training data is sparse or incoherent - the model necessarily interpolates fictional continuations in order to preserve coherence. No incentive mechanism can modify this structural dependence on pattern completion. Our empirical results demonstrate that hallucination can only be eliminated through external truth-validation and abstention modules, not through changes to incentives, prompting, or fine-tuning. The Licensing Oracle achieves perfect abstention precision across domains precisely because it supplies grounding that the transformer lacks. We conclude that hallucination is a structural property of generative architectures and that reliable AI requires hybrid systems that distinguish linguistic fluency from epistemic responsibility.[2] T5Gemma 2: Seeing, Reading, and Understanding Longer
Biao Zhang,Paul Suganthan,Gaël Liu,Ilya Philippov,Sahil Dua,Ben Hora,Kat Black,Gus Martins,Omar Sanseviero,Shreya Pathak,Cassidy Hardin,Francesco Visin,Jiageng Zhang,Kathleen Kenealy,Qin Yin,Olivier Lacombe,Armand Joulin,Tris Warkentin,Adam Roberts
Main category: cs.CL
TL;DR: T5Gemma 2 是基于 Gemma 3 的轻量级开源编码器-解码器模型,支持多语言、多模态和长上下文建模,通过 UL2 方法将纯解码器模型转化为编码器-解码器结构,并提出嵌入共享和注意力合并两种优化方法,在预训练和后训练性能上均优于其对应模型。
Details
Motivation: 为了提升轻量级模型在多语言、多模态及长上下文任务中的表现,并探索从纯解码器到编码器-解码器架构的高效转换方法。 Method: 采用 UL2 的适应策略,将预训练的仅解码器模型转换为编码器-解码器结构,并扩展至多模态;引入共享词嵌入和合并注意力机制以提高效率。 Result: T5Gemma 2 在多种架构和模态下验证了适应策略的通用性,在长上下文建模方面表现出独特优势,预训练性能相当或更好,后训练性能显著优于对应的 Gemma 3 模型。 Conclusion: T5Gemma 2 成功继承并扩展了 T5Gemma 的设计理念,提供了高效、通用且适合研究的开源轻量级多模态模型。 Abstract: We introduce T5Gemma 2, the next generation of the T5Gemma family of lightweight open encoder-decoder models, featuring strong multilingual, multimodal and long-context capabilities. T5Gemma 2 follows the adaptation recipe (via UL2) in T5Gemma -- adapting a pretrained decoder-only model into an encoder-decoder model, and extends it from text-only regime to multimodal based on the Gemma 3 models. We further propose two methods to improve the efficiency: tied word embedding that shares all embeddings across encoder and decoder, and merged attention that unifies decoder self- and cross-attention into a single joint module. Experiments demonstrate the generality of the adaptation strategy over architectures and modalities as well as the unique strength of the encoder-decoder architecture on long context modeling. Similar to T5Gemma, T5Gemma 2 yields comparable or better pretraining performance and significantly improved post-training performance than its Gemma 3 counterpart. We release the pretrained models (270M-270M, 1B-1B and 4B-4B) to the community for future research.[3] Integrating Large Language Models and Knowledge Graphs to Capture Political Viewpoints in News Media
Massimiliano Fadda,Enrico Motta,Francesco Osborne,Diego Reforgiato Recupero,Angelo Salatino
Main category: cs.CL
TL;DR: 本文改进了新闻观点分类的处理流程,通过微调大语言模型和结合Wikidata语义信息来提升对英国移民议题中观点分类的准确性。
Details
Motivation: 为了更准确地理解媒体在公共辩论中呈现的观点多样性,评估新闻报道的平衡性与公正性。 Method: 采用微调大型语言模型(LLMs)进行观点分类,并利用Wikidata中相关参与者的语义描述增强主张表征,结合两者提升分类性能。 Result: 在以英国移民辩论为中心的基准测试中,两种机制均独立提升分类效果,联合使用时效果最佳,尤其适用于能处理长输入的LLMs。 Conclusion: 整合微调LLMs与外部语义知识可显著提升新闻观点分类的性能,有助于更全面分析媒体话语中的意识形态分布。 Abstract: News sources play a central role in democratic societies by shaping political and social discourse through specific topics, viewpoints and voices. Understanding these dynamics is essential for assessing whether the media landscape offers a balanced and fair account of public debate. In earlier work, we introduced a pipeline that, given a news corpus, i) uses a hybrid human-machine approach to identify the range of viewpoints expressed about a given topic, and ii) classifies relevant claims with respect to the identified viewpoints, defined as sets of semantically and ideologically congruent claims (e.g., positions arguing that immigration positively impacts the UK economy). In this paper, we improve this pipeline by i) fine-tuning Large Language Models (LLMs) for viewpoint classification and ii) enriching claim representations with semantic descriptions of relevant actors drawn from Wikidata. We evaluate our approach against alternative solutions on a benchmark centred on the UK immigration debate. Results show that while both mechanisms independently improve classification performance, their integration yields the best results, particularly when using LLMs capable of processing long inputs.[4] DrugRAG: Enhancing Pharmacy LLM Performance Through A Novel Retrieval-Augmented Generation Pipeline
Houman Kazemzadeh,Kiarash Mokhtari Dizaji,Seyed Reza Tavakoli,Farbod Davoodi,MohammadReza KarimiNejad,Parham Abed Azad,Ali Sabzi,Armin Khosravi,Siavash Ahmadi,Mohammad Hossein Rohban,Glolamali Aminian,Tahereh Javaheri
Main category: cs.CL
TL;DR: 本研究评估了大语言模型在药学执照式问答任务中的表现,并提出了一种外部知识集成方法DrugRAG,通过检索增强生成显著提升了模型准确率。
Details
Motivation: 提高大语言模型在专业药学问答任务中的准确性,尤其是在不修改模型结构的前提下利用外部权威知识源。 Method: 使用141道药学问题的数据集对11个不同规模的大语言模型进行基准测试,提出三步检索增强生成管道DrugRAG,从验证过的资源中检索结构化药物知识并将其融入提示中。 Result: 基线准确率为46%至92%,GPT-5和o3表现最佳;DrugRAG使所有测试模型的准确率提升7到21个百分点。 Conclusion: DrugRAG能有效通过外部结构化药物知识提升大语言模型在药学任务中的表现,且无需修改模型本身,为构建基于证据的药学AI应用提供了可行路径。 Abstract: Objectives: To evaluate large language model (LLM) performance on pharmacy licensure-style question-answering (QA) tasks and develop an external knowledge integration method to improve their accuracy. Methods: We benchmarked eleven existing LLMs with varying parameter sizes (8 billion to 70+ billion) using a 141-question pharmacy dataset. We measured baseline accuracy for each model without modification. We then developed a three-step retrieval-augmented generation (RAG) pipeline, DrugRAG, that retrieves structured drug knowledge from validated sources and augments model prompts with evidence-based context. This pipeline operates externally to the models, requiring no changes to model architecture or parameters. Results: Baseline accuracy ranged from 46% to 92%, with GPT-5 (92%) and o3 (89%) achieving the highest scores. Models with fewer than 8 billion parameters scored below 50%. DrugRAG improved accuracy across all tested models, with gains ranging from 7 to 21 percentage points (e.g., Gemma 3 27B: 61% to 71%, Llama 3.1 8B: 46% to 67%) on the 141-item benchmark. Conclusion: We demonstrate that external structured drug knowledge integration through DrugRAG measurably improves LLM accuracy on pharmacy tasks without modifying the underlying models. This approach provides a practical pipeline for enhancing pharmacy-focused AI applications with evidence-based information.[5] Multiscale Aggregated Hierarchical Attention (MAHA): A Game Theoretic and Optimization Driven Approach to Efficient Contextual Modeling in Large Language Models
Caner Erden
Main category: cs.CL
TL;DR: 本文提出了Multiscale Aggregated Hierarchical Attention (MAHA),通过分层分解和数学上严格的聚合机制来解决多头自注意力在长序列任务中的二次计算复杂度问题。
Details
Motivation: 现有的稀疏或线性化注意力机制在降低计算成本的同时,往往牺牲了全局依赖关系的建模能力或多尺度语义粒度的捕捉能力,因此需要一种既能保持表达力又能提升效率的新架构。 Method: MAHA通过可学习的下采样操作将输入序列动态划分为多个层次尺度,并在每个尺度上计算注意力;其核心创新在于将跨尺度的注意力矩阵融合建模为资源分配问题,采用凸优化或基于纳什均衡的博弈论方法进行求解,从而实现局部细节与全局上下文之间的理论最优平衡;该方法集成于混合空洞卷积Transformer结构中,并利用可微优化层支持端到端训练。 Result: 实验表明,MAHA在序列长度为4096时相比标准注意力机制减少了81%的FLOPs,同时保持甚至提升了模型性能,展现出更强的可扩展性。 Conclusion: MAHA成功结合了优化理论与序列建模,为下一代大语言模型提供了一种高效、可扩展的注意力架构解决方案。 Abstract: The quadratic computational complexity of MultiHead SelfAttention (MHSA) remains a fundamental bottleneck in scaling Large Language Models (LLMs) for longcontext tasks. While sparse and linearized attention mechanisms attempt to mitigate this, they often compromise the representation of global dependencies or fail to capture multiscale semantic granularity effectively. In this paper, we propose Multiscale Aggregated Hierarchical Attention (MAHA), a novel architectural framework that reformulates the attention mechanism through hierarchical decomposition and mathematically rigorous aggregation. Unlike conventional approaches that treat token interactions at a single resolution, MAHA dynamically partitions the input sequence into hierarchical scales via learnable downsampling operators. The core innovation lies in its aggregation strategy: we model the fusion of scalespecific attention matrices as a resource allocation problem, solved via a convex optimization framework or a Nash equilibriumbased gametheoretic approach. This ensures a theoretically optimal balance between local nuance and global context fidelity. Implemented within a hybrid dilatedconvolutional transformer backbone, MAHA utilizes differentiable optimization layers to enable endtoend training. Experimental evaluations demonstrate that MAHA achieves superior scalability; empirical FLOPs analysis confirms an 81% reduction in computational cost at a sequence length of 4096 compared to standard attention. This work bridges the gap between optimization theory and sequence modeling, offering a scalable solution for nextgeneration LLMs.[6] Parameter Efficient Multimodal Instruction Tuning for Romanian Vision Language Models
George-Andrei Dima,Dumitru-Clementin Cercel
Main category: cs.CL
TL;DR: 本文通过翻译Flickr30k数据集并利用开源大语言模型扩展罗马尼亚语的视觉问答任务,构建了适用于低资源语言的多模态数据集,并采用LoRA方法微调多种主流VLM模型,显著提升了模型在罗马尼亚语视觉问答和图像描述生成中的性能与语言流畅性。
Details
Motivation: 减少罗马尼亚语等低资源语言在多模态自然语言处理方面的资源差距,推动生成式AI的普惠化。 Method: 将Flickr30k数据集翻译为罗马尼亚语,并利用开源大语言模型生成视觉问答扩展数据;采用LoRA方法对LLaMA 3.2、LLaVA 1.6和Qwen2系列VLM进行参数高效微调。 Result: 七亿参数的Qwen2-VL-RoVQA在罗马尼亚语视觉问答和图像描述生成任务中分别取得+6.05%和+2.61%的BERTScore F1提升,且语法错误显著减少。 Conclusion: 所构建的数据集和微调方法有效增强了开源视觉语言模型在低资源语言环境下的理解与生成能力,尤其在语言流畅性和跨任务泛化方面表现突出。 Abstract: Focusing on low-resource languages is an essential step toward democratizing generative AI. In this work, we contribute to reducing the multimodal NLP resource gap for Romanian. We translate the widely known Flickr30k dataset into Romanian and further extend it for visual question answering by leveraging open-source LLMs. We demonstrate the usefulness of our datasets by fine-tuning open-source VLMs on Romanian visual question answering. We select VLMs from three widely used model families: LLaMA 3.2, LLaVA 1.6, and Qwen2. For fine-tuning, we employ the parameter-efficient LoRA method. Our models show improved Romanian capabilities in visual QA, as well as on tasks they were not trained on, such as Romanian image description generation. The seven-billion-parameter Qwen2-VL-RoVQA obtains top scores on both tasks, with improvements of +6.05% and +2.61% in BERTScore F1 over its original version. Finally, the models show substantial reductions in grammatical errors compared to their original forms, indicating improvements not only in language understanding but also in Romanian fluency.[7] Cross-Tokenizer Likelihood Scoring Algorithms for Language Model Distillation
Buu Phan,Ashish Khisti,Karen Ullrich
Main category: cs.CL
TL;DR: 本文提出了一种基于BPE算法递归结构的跨分词器似然评分框架,解决了教师与学生语言模型词汇表不一致时的likelihood ratio计算问题,支持精确计算与高效近似,在知识蒸馏中实现了更小内存占用和更优性能。
Details
Motivation: 当教师和学生语言模型使用不同的分词器(如因边缘设备部署需要较小词汇表)时,传统方法难以直接计算next-token likelihood ratios,导致知识蒸馏等任务受阻,本文旨在解决这一词汇表不匹配问题。 Method: 利用字节对编码(BPE)算法中隐含的递归结构,构建一个概率框架,支持在不同词汇空间之间进行序列似然评估;针对学生词汇为教师子集的情况实现精确计算,对一般情况则设计了无损方法与快速近似策略。 Result: 在Qwen2.5-1.5B模型上实现最高12%的内存 footprint 减少,并在多个任务上比基线提升达4%;在数学推理蒸馏任务中,GSM8K准确率超过现有最好方法2%以上。 Conclusion: 该方法有效解决了跨分词器语言模型间的likelihood scoring难题,兼顾效率与精度,显著提升了知识蒸馏在资源受限场景下的实用性与性能表现。 Abstract: Computing next-token likelihood ratios between two language models (LMs) is a standard task in training paradigms such as knowledge distillation. Since this requires both models to share the same probability space, it becomes challenging when the teacher and student LMs use different tokenizers, for instance, when edge-device deployment necessitates a smaller vocabulary size to lower memory overhead. In this work, we address this vocabulary misalignment problem by uncovering an implicit recursive structure in the commonly deployed Byte-Pair Encoding (BPE) algorithm and utilizing it to create a probabilistic framework for cross-tokenizer likelihood scoring. Our method enables sequence likelihood evaluation for vocabularies different from the teacher model native tokenizer, addressing two specific scenarios: when the student vocabulary is a subset of the teacher vocabulary, and the general case where it is arbitrary. In the subset regime, our framework computes exact likelihoods and provides next-token probabilities for sequential sampling with only O(1) model evaluations per token. When used for distillation, this yields up to a 12% reduction in memory footprint for the Qwen2.5-1.5B model while also improving baseline performance up to 4% on the evaluated tasks. For the general case, we introduce a rigorous lossless procedure that leverages BPE recursive structure, complemented by a fast approximation that keeps large-vocabulary settings practical. Applied to distillation for mathematical reasoning, our approach improves GSM8K accuracy by more than 2% over the current state of the art.[8] Evaluating Large Language Models on Multimodal Chemistry Olympiad Exams
Yiming Cui,Xin Yao,Yuxuan Qin,Xin Li,Shijin Wang,Guoping Hu
Main category: cs.CL
TL;DR: 本研究系统评估了40种多模态大语言模型在奥林匹克化学问题上的表现,揭示了当前模型在视觉-语言融合和科学推理方面的关键缺陷,并提出通过思维链提示等策略改进模型性能。
Details
Motivation: 化学问题求解依赖符号图、分子结构和视觉数据,但现有大语言模型在多模态科学推理方面存在挑战,尤其在模态融合上表现不佳。 Method: 构建了一个基于美国国家化学奥林匹克竞赛试题的多模态化学推理基准,评估40个专有和开源多模态大模型的表现,并通过消融实验和基于遮蔽的可解释性分析探究思维链提示对准确性和视觉 grounding 的影响。 Result: 发现许多模型在模态融合方面存在困难,部分情况下移除图像反而提高准确性;思维链提示能持续提升模型准确性和视觉对齐能力。 Conclusion: 当前多模态大语言模型在科学推理任务中仍存在显著局限,需进一步改进视觉-语言对齐与领域特定的推理能力,该工作为衡量专用多模态AI进展提供了及时基准。 Abstract: Multimodal scientific reasoning remains a significant challenge for large language models (LLMs), particularly in chemistry, where problem-solving relies on symbolic diagrams, molecular structures, and structured visual data. Here, we systematically evaluate 40 proprietary and open-source multimodal LLMs, including GPT-5, o3, Gemini-2.5-Pro, and Qwen2.5-VL, on a curated benchmark of Olympiad-style chemistry questions drawn from over two decades of U.S. National Chemistry Olympiad (USNCO) exams. These questions require integrated visual and textual reasoning across diverse modalities. We find that many models struggle with modality fusion, where in some cases, removing the image even improves accuracy, indicating misalignment in vision-language integration. Chain-of-Thought prompting consistently enhances both accuracy and visual grounding, as demonstrated through ablation studies and occlusion-based interpretability. Our results reveal critical limitations in the scientific reasoning abilities of current MLLMs, providing actionable strategies for developing more robust and interpretable multimodal systems in chemistry. This work provides a timely benchmark for measuring progress in domain-specific multimodal AI and underscores the need for further advances at the intersection of artificial intelligence and scientific reasoning.[9] DASH: Dialogue-Aware Similarity and Handshake Recognition for Topic Segmentation in Public-Channel Conversations
Sijin Sun,Liangbin Zhao,Ming Deng,Xiuju Fu
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型的对话话题分割框架DASH-DTS,通过对话握手识别、上下文增强和正负样本生成,在 maritime VHF 对话等非正式通信中实现了高准确率的话题分割,并发布了首个公开的真实VHF对话数据集VHF-Dial。
Details
Motivation: 传统方法在处理非正式、隐式转换的公共信道对话(如海事VHF通信)时存在局限,难以有效识别话题转移。 Method: 提出DASH-DTS框架:1)通过对话握手识别检测话题转移;2)利用相似性引导的示例选择增强上下文;3)生成选择性正负样本来提升模型判别能力和鲁棒性。 Result: 在VHF-Dial数据集和标准基准上均达到多个SOTA的分割可信准确率,提供可解释的推理过程和置信度评分。 Conclusion: DASH-DTS为操作型对话中的稳定监控与决策支持奠定了坚实基础,推动了真实场景下对话理解的研究进展。 Abstract: Dialogue Topic Segmentation (DTS) is crucial for understanding task-oriented public-channel communications, such as maritime VHF dialogues, which feature informal speech and implicit transitions. To address the limitations of traditional methods, we propose DASH-DTS, a novel LLM-based framework. Its core contributions are: (1) topic shift detection via dialogue handshake recognition; (2) contextual enhancement through similarity-guided example selection; and (3) the generation of selective positive and negative samples to improve model discrimination and robustness. Additionally, we release VHF-Dial, the first public dataset of real-world maritime VHF communications, to advance research in this domain. DASH-DTS provides interpretable reasoning and confidence scores for each segment. Experimental results demonstrate that our framework achieves several sota segmentation trusted accuracy on both VHF-Dial and standard benchmarks, establishing a strong foundation for stable monitoring and decision support in operational dialogues.[10] SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification
Hongbo Wang,MaungMaung AprilPyone,Isao Echizen
Main category: cs.CL
TL;DR: 本文提出了一种名为SGM的白盒神经元级多模态干预方法,用于减轻多模态大语言模型中的毒性问题,有效降低有害内容生成率,同时保持模型性能。
Details
Motivation: 多模态大语言模型(MLLMs)在弱监督预训练数据中继承了有毒、偏见和不适宜内容,导致安全风险,尤其是在对抗性触发下,现有去毒方法难以应对。 Method: 提出SGM方法,通过专家权重加权软抑制技术,在神经元级别选择性地重新校准有毒神经元,实现无需参数更新的安全干预,并构建MM-TOXIC-QA框架评估多模态毒性。 Result: 实验显示SGM将有害内容生成率从48.2%降至2.5%,在标准和对抗条件下均有效,且保持了语言流畅性和多模态推理能力;结合现有方法的SGM*进一步提升安全性。 Conclusion: SGM是一种可扩展、可解释、低成本的去毒解决方案,适用于控制多模态生成中的毒性问题,具有实际应用潜力。 Abstract: Disclaimer: Samples in this paper may be harmful and cause discomfort. Multimodal large language models (MLLMs) enable multimodal generation but inherit toxic, biased, and NSFW signals from weakly curated pretraining corpora, causing safety risks, especially under adversarial triggers that late, opaque training-free detoxification methods struggle to handle. We propose SGM, a white-box neuron-level multimodal intervention that acts like safety glasses for toxic neurons: it selectively recalibrates a small set of toxic expert neurons via expertise-weighted soft suppression, neutralizing harmful cross-modal activations without any parameter updates. We establish MM-TOXIC-QA, a multimodal toxicity evaluation framework, and compare SGM with existing detoxification techniques. Experiments on open-source MLLMs show that SGM mitigates toxicity in standard and adversarial conditions, cutting harmful rates from 48.2\% to 2.5\% while preserving fluency and multimodal reasoning. SGM is extensible, and its combined defenses, denoted as SGM*, integrate with existing detoxification methods for stronger safety performance, providing an interpretable, low-cost solution for toxicity-controlled multimodal generation.[11] The Meta-Prompting Protocol: Orchestrating LLMs via Adversarial Feedback Loops
Fanzhe Fu
Main category: cs.CL
TL;DR: 本文提出了一种名为Meta-Prompting Protocol的理论框架,通过生成器、审计器和优化器构成的对抗三元组,将大语言模型的交互范式形式化为可编程、自优化的系统,以实现确定性语义计算。
Details
Motivation: 当前基于启发式提示工程的方法无法为关键任务应用提供所需的确定性保证,因此需要一种更严谨的框架来重构大语言模型的交互方式。 Method: 引入Meta-Prompting Protocol,采用生成器(P)、审计器(A)和优化器(O)的三元结构,将自然语言指令视为语义计算图中的可微变量,并利用文本批评作为梯度进行优化。 Result: 通过DSPy和TextGrad验证了该方法在理论上可行,能够减少幻觉并防止模型崩溃。 Conclusion: 该框架为概率计算时代下的可观测软件工程奠定了基础,推动大语言模型从随机对话接口向可靠软件组件转变。 Abstract: The transition of Large Language Models (LLMs) from stochastic chat interfaces to reliable software components necessitates a fundamental re-engineering of interaction paradigms. Current methodologies, predominantly heuristic-based "prompt engineering," fail to provide the deterministic guarantees required for mission-critical applications. We introduce the Meta-Prompting Protocol, a rigorous theoretical framework that formalizes the orchestration of LLMs as a programmable, self-optimizing system. Central to this protocol is the Adversarial Trinity, a tripartite topology comprising a Generator (P), an Auditor (A), and an Optimizer (O). By treating natural language instructions as differentiable variables within a semantic computation graph and utilizing textual critiques as gradients, this architecture mitigates hallucination and prevents model collapse. We demonstrate the theoretical viability of this approach using declarative programming paradigms (DSPy) and automatic textual differentiation (TextGrad), establishing a foundation for "Observable Software Engineering" in the era of probabilistic computing.[12] Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning
Weiqin Wang,Yile Wang,Kehao Chen,Hui Huang
Main category: cs.CL
TL;DR: 本文提出了一种名为SCOPE的新框架,通过结合模型置信度和动态子群划分来改善大语言模型在推理任务中的表现,克服了传统投票策略带来的确认偏误和稀疏奖励问题。
Details
Motivation: 现有的测试时强化学习依赖多数投票生成伪标签,容易产生确认偏误且面临奖励稀疏问题,限制了模型推理能力的提升。 Method: 提出SCOPE框架,引入逐步置信度加权机制,并动态将候选输出划分为多个子群,通过重复采样获得局部共识,为每个子群提供多样化的监督目标。 Result: 实验结果显示,SCOPE在多个模型和基准上均优于现有基线方法,在具有挑战性的AIME 2025上相对提升了13.1%,在AMC上提升了8.1%。 Conclusion: SCOPE有效缓解了确认偏误和稀疏奖励问题,通过高质量推理路径选择和多样化探索显著提升了大语言模型的推理性能。 Abstract: Test-time reinforcement learning mitigates the reliance on annotated data by using majority voting results as pseudo-labels, emerging as a complementary direction to reinforcement learning with verifiable rewards (RLVR) for improving reasoning ability of large language models (LLMs). However, this voting strategy often induces confirmation bias and suffers from sparse rewards, limiting the overall performance. In this work, we propose subgroup-specific step-wise confidence-weighted pseudo-label estimation (SCOPE), a framework integrating model confidence and dynamic subgroup partitioning to address these issues. Specifically, SCOPE integrates the proposed step-wise confidence into pseudo label deduction, prioritizing high-quality reasoning paths over simple frequency count. Furthermore, it dynamically partitions the candidate outputs pool into independent subgroups by balancing reasoning quality against exploration diversity. By deriving local consensus via repeat sampling for each sub group, SCOPE provides diverse supervision targets to encourage broader exploration. We conduct experiments across various models and benchmarks, experimental results show that SCOPE consistently outperforms recent baselines. Notably, SCOPE achieving relative improvements of 13.1\% on challenging AIME 2025 and 8.1\% on AMC. The code is released at \href{https://github.com/szu-tera/SCOPE}{https://github.com/szu-tera/SCOPE}.[13] Rakuten Data Release: A Large-Scale and Long-Term Reviews Corpus for Hotel Domain
Yuki Nakayama,Koki Hikichi,Yun Ching Liu,Yu Hirate
Main category: cs.CL
TL;DR: 本文介绍了一个大规模的乐天旅游评论语料库,包含2009至2024年间730万条客户评论及相关元数据,并分析了2019至2024年间数据漂移的驱动因素。
Details
Motivation: 构建一个长期、大规模、多维度的旅游评论语料库,以支持用户生成内容的分析与研究。 Method: 收集并整理乐天旅行平台上的730万条评论数据,涵盖文本、回复、元数据及评分,并使用统计方法分析数据随时间的变化趋势。 Result: 提供了涵盖16年的详细统计数据,揭示了2019至2024年间影响数据漂移的关键因素。 Conclusion: 该语料库为研究用户行为、情感变化和数据漂移提供了宝贵的资源,具有广泛的研究与应用价值。 Abstract: This paper presents a large-scale corpus of Rakuten Travel Reviews. Our collection contains 7.3 million customer reviews for 16 years, ranging from 2009 to 2024. Each record in the dataset contains the review text, its response from an accommodation, an anonymized reviewer ID, review date, accommodation ID, plan ID, plan title, room type, room name, purpose, accompanying group, and user ratings from different aspect categories, as well as an overall score. We present statistical information about our corpus and provide insights into factors driving data drift between 2019 and 2024 using statistical approaches.[14] MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers
Xuanjun Zong,Zhiqi Shen,Lei Wang,Yunshi Lan,Chao Yang
Main category: cs.CL
TL;DR: 本文提出了MCP-SafetyBench,一个基于真实MCP服务器的综合性基准,用于评估大型语言模型在多轮、多域、多服务器环境下的安全风险,揭示了现有模型在复杂交互中的脆弱性。
Details
Motivation: 现有的安全基准无法捕捉到Model Context Protocol(MCP)带来的新型安全风险,尤其是在开放性和多服务器工作流背景下,需要一个更贴近现实的评估工具。 Method: 构建了一个包含五个真实应用场景(浏览器自动化、金融分析、位置导航、仓库管理、网络搜索)的基准MCP-SafetyBench,提出涵盖20种攻击类型的统一分类法,并支持多步推理和跨服务器协作的多轮评估。 Result: 通过对主流开源和闭源大模型进行系统评估,发现其在安全表现上存在显著差异,且随着任务周期和服务器交互增加,漏洞愈发严重。 Conclusion: 研究凸显了加强MCP系统安全防御的紧迫性,并确立MCP-SafetyBench为诊断和缓解现实世界中MCP安全风险的基础工具。 Abstract: Large language models (LLMs) are evolving into agentic systems that reason, plan, and operate external tools. The Model Context Protocol (MCP) is a key enabler of this transition, offering a standardized interface for connecting LLMs with heterogeneous tools and services. Yet MCP's openness and multi-server workflows introduce new safety risks that existing benchmarks fail to capture, as they focus on isolated attacks or lack real-world coverage. We present MCP-SafetyBench, a comprehensive benchmark built on real MCP servers that supports realistic multi-turn evaluation across five domains: browser automation, financial analysis, location navigation, repository management, and web search. It incorporates a unified taxonomy of 20 MCP attack types spanning server, host, and user sides, and includes tasks requiring multi-step reasoning and cross-server coordination under uncertainty. Using MCP-SafetyBench, we systematically evaluate leading open- and closed-source LLMs, revealing large disparities in safety performance and escalating vulnerabilities as task horizons and server interactions grow. Our results highlight the urgent need for stronger defenses and establish MCP-SafetyBench as a foundation for diagnosing and mitigating safety risks in real-world MCP deployments.[15] From NLG Evaluation to Modern Student Assessment in the Era of ChatGPT: The Great Misalignment Problem and Pedagogical Multi-Factor Assessment (P-MFA)
Mika Hämäläinen,Kimmo Leiviskä
Main category: cs.CL
TL;DR: 本文探讨了自然语言生成(NLG)评估与大学学生评分之间的认知平行性,指出两者均面临“巨大错位问题”,并提出基于过程的多因素评估模型(P-MFA)以应对挑战。
Details
Motivation: 由于学生广泛使用如ChatGPT等工具生成高质量内容,传统的以最终成果为核心的评估方式已不再有效,亟需关注学习过程的新型评估框架。 Method: 提出Pedagogical Multi-Factor Assessment (P-MFA)模型,借鉴多因素认证逻辑,采用过程导向、多证据的评估方法。 Result: P-MFA模型能够更有效地评估学生的真实学习过程,增强评估的可靠性与有效性,并为NLG评估提供启示。 Conclusion: 应从重视结果转向重视过程,P-MFA为教育评估和NLG评价提供了可推广的解决方案。 Abstract: This paper explores the growing epistemic parallel between NLG evaluation and grading of students in a Finnish University. We argue that both domains are experiencing a Great Misalignment Problem. As students increasingly use tools like ChatGPT to produce sophisticated outputs, traditional assessment methods that focus on final products rather than learning processes have lost their validity. To address this, we introduce the Pedagogical Multi-Factor Assessment (P-MFA) model, a process-based, multi-evidence framework inspired by the logic of multi-factor authentication.[16] RFKG-CoT: Relation-Driven Adaptive Hop-count Selection and Few-Shot Path Guidance for Knowledge-Aware QA
Chao Zhang,Minghan Li,Tianrui Lv,Guodong Zhou
Main category: cs.CL
TL;DR: 提出RFKG-CoT,通过关系驱动的自适应跳数选择和少样本路径引导机制,提升大模型在知识图谱问答中的准确性和推理可靠性。
Details
Motivation: 现有方法在知识图谱问答中存在跳数选择僵化和推理路径利用不足的问题,导致大语言模型生成幻觉。 Method: 设计关系驱动的自适应跳数选择器(通过关系掩码动态调整推理步数),并引入少样本上下文学习的路径引导机制(采用“问题-路径-答案”格式的思维链提示)。 Result: 在四个KGQA基准测试上,RFKG-CoT相比KG-CoT最高提升14.7个百分点(Llama2-7B在WebQSP上的表现),消融实验证明两个模块互补且有效。 Conclusion: RFKG-CoT通过动态跳数选择和路径引导显著提升了大模型利用知识图谱进行推理的能力,减少了幻觉,生成更可信的答案。 Abstract: Large language models (LLMs) often generate hallucinations in knowledge-intensive QA due to parametric knowledge limitations. While existing methods like KG-CoT improve reliability by integrating knowledge graph (KG) paths, they suffer from rigid hop-count selection (solely question-driven) and underutilization of reasoning paths (lack of guidance). To address this, we propose RFKG-CoT: First, it replaces the rigid hop-count selector with a relation-driven adaptive hop-count selector that dynamically adjusts reasoning steps by activating KG relations (e.g., 1-hop for direct "brother" relations, 2-hop for indirect "father-son" chains), formalized via a relation mask. Second, it introduces a few-shot in-context learning path guidance mechanism with CoT (think) that constructs examples in a "question-paths-answer" format to enhance LLMs' ability to understand reasoning paths. Experiments on four KGQA benchmarks show RFKG-CoT improves accuracy by up to 14.7 pp (Llama2-7B on WebQSP) over KG-CoT. Ablations confirm the hop-count selector and the path prompt are complementary, jointly transforming KG evidence into more faithful answers.[17] Yes-MT's Submission to the Low-Resource Indic Language Translation Shared Task in WMT 2024
Yash Bhaskar,Parameswari Krishnamurthy
Main category: cs.CL
TL;DR: 本文介绍了Yes-MT团队在WMT 2024低资源印度语言翻译任务中的系统,探索了多种方法进行英-阿萨姆语、米佐语、卡西语和曼尼普尔语的翻译。
Details
Motivation: 解决低资源印度语言之间的机器翻译问题,提升这些语言在现有模型下的翻译性能。 Method: 采用多种方法,包括微调mT5和IndicBart等预训练模型、使用LoRA微调IndicTrans2和Llama 3、基于LLM的零样本与少样本提示(如Llama 3和Mixtral)、以及从头训练Transformer模型。 Result: 在WMT23测试集上使用SacreBLEU和CHRF进行评估,结果显示微调后的大型语言模型在低资源翻译中表现较好,但仍面临数据稀缺带来的挑战。 Conclusion: 尽管低资源语言翻译仍具挑战性,但结合LoRA微调和大型语言模型的方法展现出较大潜力,尤其在有限数据条件下优于传统模型。 Abstract: This paper presents the systems submitted by the Yes-MT team for the Low-Resource Indic Language Translation Shared Task at WMT 2024 (Pakray et al., 2024), focusing on translating between English and the Assamese, Mizo, Khasi, and Manipuri languages. The experiments explored various approaches, including fine-tuning pre-trained models like mT5 (Xue et al., 2020) and IndicBart (Dabre et al., 2021) in both multilingual and monolingual settings, LoRA (Hu et al., 2021) fine-tuning IndicTrans2 (Gala et al., 2023), zero-shot and few-shot prompting (Brown, 2020) with large language models (LLMs) like Llama 3 (Dubey et al., 2024) and Mixtral 8x7b (Jiang et al., 2024), LoRA supervised fine-tuning of Llama 3 (Mecklenburg et al., 2024), and training Transformer models (Vaswani, 2017) from scratch. The results were evaluated on the WMT23 Low-Resource Indic Language Translation Shared Task test data using SacreBLEU (Post, 2018) and CHRF (Popovic, 2015), highlighting the challenges of low-resource translation and the potential of LLMs for these tasks, particularly with fine-tuning.[18] FAME: Fictional Actors for Multilingual Erasure
Claudio Savelli,Moreno La Quatra,Alkis Koudounas,Flavio Giobergia
Main category: cs.CL
TL;DR: FAME是一个用于评估多语言大模型遗忘技术的合成基准,包含1000个虚构演员传记和20000个问答对,支持实体级和实例级遗忘,并覆盖英法德意西五种语言。
Details
Motivation: 现有机器遗忘评估基准仅限于英语且只支持实体级遗忘,缺乏多语言和细粒度遗忘能力的评估手段。 Method: 构建名为FAME的合成基准,包含五种语言的虚构演员数据,设计20个结构化主题和两种数据划分以支持不同遗忘场景。 Result: FAME包含1000个虚构传记和20000个问答对,支持跨五种语言的实体级与实例级遗忘评估。 Conclusion: FAME为机器遗忘方法提供了可控、多语言、细粒度的评估平台,有助于推动隐私保护和被遗忘权的研究。 Abstract: LLMs trained on web-scale data raise concerns about privacy and the right to be forgotten. To address these issues, Machine Unlearning provides techniques to remove specific information from trained models without retraining from scratch. However, existing benchmarks for evaluating unlearning in LLMs face two major limitations: they focus only on English and support only entity-level forgetting (removing all information about a person). We introduce FAME (Fictional Actors for Multilingual Erasure), a synthetic benchmark for evaluating Machine Unlearning across five languages: English, French, German, Italian, and Spanish. FAME contains 1,000 fictional actor biographies and 20,000 question-answer pairs. Each biography includes information on 20 topics organized into structured categories (biography, career, achievements, personal information). This design enables both entity-level unlearning (i.e., forgetting entire identities) and instance-level unlearning (i.e., forgetting specific facts while retaining others). We provide two dataset splits to support these two different unlearning scenarios and enable systematic comparison of unlearning techniques across languages. Since FAME uses entirely fictional data, it ensures that the information was never encountered during model pretraining, allowing for a controlled evaluation of unlearning methods.[19] The Moralization Corpus: Frame-Based Annotation and Analysis of Moralizing Speech Acts across Diverse Text Genres
Maria Becker,Mirko Sommer,Lars Tapken,Yi Wan Teh,Bruno Brocai
Main category: cs.CL
TL;DR: 本文介绍了Moralization Corpus,一个用于分析道德价值观在论辩话语中策略性使用的多体裁数据集,并提出了基于框架的标注方案以捕捉道德化论述的核心要素。研究评估了多种大语言模型在不同提示条件下的表现,发现详细提示指令比少样本或解释性提示更有效,同时指出道德化识别仍具有高度主观性和语境依赖性。
Details
Motivation: 道德化论述作为一种 invoking 道德价值观来正当化主张的说服形式,尚未被充分探索,且其常具隐含性和语用复杂性,对人工和自动分析均构成挑战。 Method: 提出一种基于框架的标注方案,标注道德值、诉求和话语主体等核心元素,并应用于涵盖政治辩论、新闻文章和在线讨论的德语文本;构建多体裁的Moralization Corpus,并在不同提示条件下评估大语言模型在道德化检测与成分提取上的表现,与人类标注进行对比。 Result: 详细提示指令对模型性能的提升最为显著,优于少样本或解释性提示;但整体上道德化任务仍存在高度主观性和语境敏感性,人机表现均受限。 Conclusion: 道德化是复杂且语境依赖的论辩现象,需结合精细标注与适当提示策略推进研究;作者公开了数据、指南与代码,以促进自然语言处理与道德话语的跨学科研究。 Abstract: Moralizations - arguments that invoke moral values to justify demands or positions - are a yet underexplored form of persuasive communication. We present the Moralization Corpus, a novel multi-genre dataset designed to analyze how moral values are strategically used in argumentative discourse. Moralizations are pragmatically complex and often implicit, posing significant challenges for both human annotators and NLP systems. We develop a frame-based annotation scheme that captures the constitutive elements of moralizations - moral values, demands, and discourse protagonists - and apply it to a diverse set of German texts, including political debates, news articles, and online discussions. The corpus enables fine-grained analysis of moralizing language across communicative formats and domains. We further evaluate several large language models (LLMs) under varied prompting conditions for the task of moralization detection and moralization component extraction and compare it to human annotations in order to investigate the challenges of automatic and manual analysis of moralizations. Results show that detailed prompt instructions has a greater effect than few-shot or explanation-based prompting, and that moralization remains a highly subjective and context-sensitive task. We release all data, annotation guidelines, and code to foster future interdisciplinary research on moral discourse and moral reasoning in NLP.[20] SynGP500: A Clinically-Grounded Synthetic Dataset of Australian General Practice Medical Notes
Piyawoot Songsiritat
Main category: cs.CL
TL;DR: SynGP500是一个由临床医生策划的包含500份合成澳大利亚全科医疗记录的数据集,旨在反映真实临床复杂性并支持更具泛化能力的临床NLP模型训练。
Details
Motivation: 解决现有数据集中病例分布受限、无法覆盖全科医学培训要求中少见但重要的疾病问题,同时保护患者隐私。 Method: 结合RACGP 2022课程大纲和BEACH研究的流行病学数据,生成具有临床广度和真实分布特征的合成医疗记录,并通过多维度验证其质量。 Result: SynGP500在流行病学分布、语言风格变异性和语义多样性方面与真实数据一致,并在自监督医学概念提取任务中表现出F1分数提升。 Conclusion: SynGP500为澳大利亚全科医疗领域的临床NLP方法开发和评估提供了高质量、隐私安全的资源。 Abstract: We introduce SynGP500, a clinician-curated collection of 500 synthetic Australian general practice medical notes. The dataset integrates curriculum-based clinical breadth (RACGP 2022 Curriculum), epidemiologically-calibrated prevalence (BEACH study), and diverse consultation contexts. This approach systematically includes both common presentations and less-common curriculum-specified conditions that GPs must recognize but appear infrequently in single practice populations, potentially supporting more generalizable model training than datasets constrained by naturally occurring case distributions. SynGP500 is messy by design, reflecting the authentic complexity of healthcare delivery: telegraphic documentation, typos, patient non-adherence, socioeconomic barriers, and clinician-patient disagreements, unlike sanitized synthetic datasets that obscure clinical realities. Multi-faceted validation demonstrates dataset quality through epidemiological alignment with real Australian GP consultation patterns (BEACH study), stylometric analysis confirming high linguistic variation, semantic diversity analysis demonstrating broad coverage, and exploratory downstream evaluation using self-supervised medical concept extraction, showing F1 improvements. SynGP500 addresses a critical national gap, providing researchers and educators with a resource for developing and evaluating clinical NLP methods for Australian general practice while inherently protecting patient privacy.[21] Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning
Yiliu Sun,Zicheng Zhao,Yang Wei,Yanfang Zhang,Chen Gong
Main category: cs.CL
TL;DR: 本文提出了一种新的强化学习与可验证奖励方法PPPO,专注于优化大语言模型生成结果的前缀部分,以提升推理能力。
Details
Motivation: 现有RLVR方法对所有生成token进行均匀训练,忽视了前缀token在推理中的关键作用,导致训练效率低下。 Method: 基于路径依赖理论提出“起始锁定效应”(BLE),设计PPPO框架,采用渐进式前缀保留和续写累积奖励两种策略,聚焦前缀推理过程的优化。 Result: 在多种推理任务上实验表明,PPPO在仅使用26.17%训练token的情况下,准确率提升达18.02%。 Conclusion: 通过聚焦前缀token的优化,PPPO显著提高了训练效率和推理性能,验证了前缀阶段在LLM推理中的决定性作用。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs). Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning. This uniform training strategy spends substantial effort on optimizing low-return tokens, which in turn impedes the potential improvement from high-return tokens and reduces overall training effectiveness. To address this issue, we propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the prefix segment of generated outputs. Specifically, inspired by the well-established human thinking theory of Path Dependence, where early-stage thoughts substantially constrain subsequent thinking trajectory, we identify an analogous phenomenon in LLM reasoning termed Beginning Lock-in Effect (BLE). PPPO leverages this finding by focusing its optimization objective on the prefix reasoning process of LLMs. This targeted optimization strategy can positively influence subsequent reasoning processes, and ultimately improve final results. To improve the learning effectiveness of LLMs on how to start reasoning with high quality, PPPO introduces two training strategies: (a) Progressive Prefix Retention, which shapes a progressive learning process by increasing the proportion of retained prefix tokens during training; (b) Continuation Accumulated Reward, which mitigates reward bias by sampling multiple continuations for one prefix token sequence, and accumulating their scores as the reward signal. Extensive experimental results on various reasoning tasks demonstrate that our proposed PPPO outperforms representative RLVR methods, with the accuracy improvements of 18.02% on only 26.17% training tokens.[22] Towards Proactive Personalization through Profile Customization for Individual Users in Dialogues
Xiaotian Zhang,Yuan Wang,Ruizhe Chen,Zeya Wang,Runchen Hou,Zuozhu Liu
Main category: cs.CL
TL;DR: 本文提出了一种名为PersonalAgent的用户中心型终身代理,用于持续推断并适应用户的动态偏好,解决现有大模型在长期个性化和冷启动问题上的不足。
Details
Motivation: 现有的大语言模型对齐技术主要关注普适的人类价值观或静态的单轮偏好,难以应对用户偏好随时间变化的长期个性化需求及冷启动问题。 Method: PersonalAgent通过将对话分解为单轮交互,构建并动态更新统一的用户画像,将偏好推断建模为序贯决策任务,实现持续学习与个性化对齐。 Result: 实验表明,PersonalAgent在理想和含噪对话场景下均优于基于提示和策略优化的基线方法,且能保持跨会话的偏好一致性;人工评估也验证其能自然、连贯地捕捉用户偏好。 Conclusion: 研究强调了终身个性化对构建更具包容性和适应性的对话系统的重要性,为LLM的个性化部署提供了有效路径。 Abstract: The deployment of Large Language Models (LLMs) in interactive systems necessitates a deep alignment with the nuanced and dynamic preferences of individual users. Current alignment techniques predominantly address universal human values or static, single-turn preferences, thereby failing to address the critical needs of long-term personalization and the initial user cold-start problem. To bridge this gap, we propose PersonalAgent, a novel user-centric lifelong agent designed to continuously infer and adapt to user preferences. PersonalAgent constructs and dynamically refines a unified user profile by decomposing dialogues into single-turn interactions, framing preference inference as a sequential decision-making task. Experiments show that PersonalAgent achieves superior performance over strong prompt-based and policy optimization baselines, not only in idealized but also in noisy conversational contexts, while preserving cross-session preference consistency. Furthermore, human evaluation confirms that PersonalAgent excels at capturing user preferences naturally and coherently. Our findings underscore the importance of lifelong personalization for developing more inclusive and adaptive conversational agents. Our code is available here.[23] Evaluating LLMs for Zeolite Synthesis Event Extraction (ZSEE): A Systematic Analysis of Prompting Strategies
Charan Prakash Rathore,Saumi Ray,Dhruv Kumar
Main category: cs.CL
TL;DR: 本研究系统评估了大语言模型(LLM)在沸石合成实验信息提取中的不同提示策略效果,发现尽管LLM在事件分类上表现良好,但在精细参数提取任务中表现一般,且高级提示策略提升有限,揭示了当前模型在科学信息提取中的架构局限性。
Details
Motivation: 现有方法未系统评估大语言模型在材料科学特定领域(如沸石合成)信息提取中的有效性,尤其是不同提示策略的影响尚不明确。 Method: 研究聚焦于四个子任务:事件类型分类、触发文本识别、论元角色提取和论元文本提取;在六个先进LLM上评估四种提示策略(零样本、少样本、事件特定、基于反思),使用包含1530个标注句子的ZSEE数据集进行实验。 Result: 事件类型分类F1值达80-90%,但论元角色与文本提取仅50-65%;GPT-5-mini表现出极端提示敏感性(F1波动11-79%);高级提示策略相比零样本提升有限;错误分析显示存在幻觉、过度泛化及难以捕捉合成细节等问题。 Conclusion: 大语言模型虽能理解高层次科学文本,但在精确提取实验参数方面仍受限,需开发领域适配模型;研究提供了科学信息提取的定量基准,并揭示当前LLM架构在专业任务中的根本局限。 Abstract: Extracting structured information from zeolite synthesis experimental procedures is critical for materials discovery, yet existing methods have not systematically evaluated Large Language Models (LLMs) for this domain-specific task. This work addresses a fundamental question: what is the efficacy of different prompting strategies when applying LLMs to scientific information extraction? We focus on four key subtasks: event type classification (identifying synthesis steps), trigger text identification (locating event mentions), argument role extraction (recognizing parameter types), and argument text extraction (extracting parameter values). We evaluate four prompting strategies - zero-shot, few-shot, event-specific, and reflection-based - across six state-of-the-art LLMs (Gemma-3-12b-it, GPT-5-mini, O4-mini, Claude-Haiku-3.5, DeepSeek reasoning and non-reasoning) using the ZSEE dataset of 1,530 annotated sentences. Results demonstrate strong performance on event type classification (80-90\% F1) but modest performance on fine-grained extraction tasks, particularly argument role and argument text extraction (50-65\% F1). GPT-5-mini exhibits extreme prompt sensitivity with 11-79\% F1 variation. Notably, advanced prompting strategies provide minimal improvements over zero-shot approaches, revealing fundamental architectural limitations. Error analysis identifies systematic hallucination, over-generalization, and inability to capture synthesis-specific nuances. Our findings demonstrate that while LLMs achieve high-level understanding, precise extraction of experimental parameters requires domain-adapted models, providing quantitative benchmarks for scientific information extraction.[24] Why Your Academic Field Is Everywhere at Once: A Case Study of Arabic Linguistics
Ayman Eddakrouri,Amani Ramadan
Main category: cs.CL
TL;DR: 本研究应用Brookes的类别离散度测量(Δ)分析当代阿拉伯应用语言学研究的主题结构,基于2019至2025年1,564篇真实出版物数据,发现该领域具有高度异质性(Δ = 0.194),无明显主导主题。
Details
Motivation: 澄清Brookes公式在学术领域结构分析中的正确应用,并揭示阿拉伯应用语言学研究领域的主题分布特征。 Method: 采用Brookes的类别离散度指标(Δ),对2019至2025年间1,564篇阿拉伯应用语言学论文按八个核心子学科分类,计算其主题离散度。 Result: 计算得Δ = 0.194,表明该领域主题极度分散;计算语言学虽占主导但未形成垄断,社会语言学、语言教学等其他子领域亦具较强活力。 Conclusion: 阿拉伯应用语言学研究呈现显著异质性,缺乏集中趋势;本研究验证了Brookes方法的有效性,并提供了一种可复制的文献计量方法用于跨学科结构分析。 Abstract: This study applies Brookes' Measure of Categorical Dispersion (Δ) to analyze the thematic structure of contemporary Arabic Applied Linguistics research. Using a comprehensive, real-world dataset of 1,564 publications from 2019 to 2025, classified into eight core sub-disciplines, we calculate a dispersion index of Δ = 0.194. This remarkably low value indicates extreme thematic dispersion, revealing that the field is characterized by pronounced heterogeneity rather than concentration. The analysis identifies Computational Linguistics as a dominant but non-hegemonic force, coexisting with robust research in Sociolinguistics, Language Teaching, and other subfields. This study clarifies the correct application of Brookes' original formula, demonstrates its utility for field characterization, and provides a replicable bibliometric methodology for assessing disciplinary structure across domains.[25] Adversarial versification in portuguese as a jailbreak operator in LLMs
Joao Queiroz
Main category: cs.CL
TL;DR: 该研究发现,将提示语改写为诗歌形式可显著突破对齐的大语言模型的安全限制,揭示了当前对齐机制在表面模式依赖上的深层缺陷,并指出针对葡萄牙语等复杂语言的评估缺失是一个关键空白。
Details
Motivation: 探索大语言模型在面对诗歌形式对抗性提示时的安全漏洞,揭示当前对齐方法的局限性,并强调多语言(尤其是葡萄牙语)在安全评估中的重要性。 Method: 通过手动和自动化方式将原本被拒绝的指令转化为诗歌形式,在基于MLCommons AILuminate的基准上测试其攻击成功率,并分析不同对齐方法(如RLHF、宪法AI)模型的表现。 Result: 诗歌形式的提示使安全失败率最高提升18倍,人工创作的诗歌达到约62%的成功率,自动版本达43%,部分模型单轮成功率超90%;所有测试模型均表现出对韵律形式攻击的一致脆弱性。 Conclusion: 当前大语言模型的对齐机制过度依赖表面语言模式,无法应对细微的符号结构变化;必须引入音步、格律和韵律变异等参数化测试,以评估包括葡萄牙语在内的多种语言中的潜在漏洞。 Abstract: Recent evidence shows that the versification of prompts constitutes a highly effective adversarial mechanism against aligned LLMs. The study 'Adversarial poetry as a universal single-turn jailbreak mechanism in large language models' demonstrates that instructions routinely refused in prose become executable when rewritten as verse, producing up to 18 x more safety failures in benchmarks derived from MLCommons AILuminate. Manually written poems reach approximately 62% ASR, and automated versions 43%, with some models surpassing 90% success in single-turn interactions. The effect is structural: systems trained with RLHF, constitutional AI, and hybrid pipelines exhibit consistent degradation under minimal semiotic formal variation. Versification displaces the prompt into sparsely supervised latent regions, revealing guardrails that are excessively dependent on surface patterns. This dissociation between apparent robustness and real vulnerability exposes deep limitations in current alignment regimes. The absence of evaluations in Portuguese, a language with high morphosyntactic complexity, a rich metric-prosodic tradition, and over 250 million speakers, constitutes a critical gap. Experimental protocols must parameterise scansion, metre, and prosodic variation to test vulnerabilities specific to Lusophone patterns, which are currently ignored.[26] Dual-Density Inference for Efficient Language Model Reasoning
Zhengyi Zhao,Shubo Zhang,Yuxi Zhang,Huimin Wang,Binyang Li,Kam-Fai Wong
Main category: cs.CL
TL;DR: 本文提出了Denser框架,通过在推理和回答阶段使用不同的语言密度来提高大语言模型的计算效率,显著减少了token消耗并保持或提升了准确性。
Details
Motivation: 现有的大语言模型在复杂推理任务中采用统一的语言密度,导致计算效率低下。作者观察到推理过程主要服务于模型自身计算,而回答则用于人类理解,因此提出区分两者的信息密度以优化效率。 Method: 提出Denser双密度推断框架,包含三个组件:查询处理模块、高密度压缩推理机制和答案生成组件,分别负责问题分析、高效的中间计算以及将压缩推理结果转化为人类可读的答案。 Result: 在多个推理问答基准上的实验表明,与标准思维链方法相比,Denser最多可减少62%的token消耗,同时保持或提高准确性,尤其在复杂的多步推理问题中效果显著。 Conclusion: 通过分离推理和回答阶段的信息密度,Denser框架有效提升了大语言模型的推理效率,为未来高效推理系统的设计提供了新方向。 Abstract: Large Language Models (LLMs) have shown impressive capabilities in complex reasoning tasks. However, current approaches employ uniform language density for both intermediate reasoning and final answers, leading to computational inefficiency. Our observation found that reasoning process serves a computational function for the model itself, while answering serves a communicative function for human understanding. This distinction enables the use of compressed, symbol-rich language for intermediate computations while maintaining human-readable final explanations. To address this inefficiency, we present Denser: \underline{D}ual-d\underline{ens}ity inf\underline{er}ence, a novel framework that optimizes information density separately for reasoning and answering phases. Our framework implements this through three components: a query processing module that analyzes input problems, a high-density compressed reasoning mechanism for efficient intermediate computations, and an answer generation component that translates compressed reasoning into human-readable solutions. Experimental evaluation across multiple reasoning question answering benchmarks demonstrates that Denser reduces token consumption by up to 62\% compared to standard Chain-of-Thought methods while preserving or improving accuracy. These efficiency gains are particularly significant for complex multi-step reasoning problems where traditional methods generate extensive explanations.[27] ORACLE: Time-Dependent Recursive Summary Graphs for Foresight on News Data Using LLMs
Lev Kharlashkin,Eiaki Morooka,Yehor Tereshchenko,Mika Hämäläinen
Main category: cs.CL
TL;DR: ORACLE是一个为芬兰应用科学大学设计的平台,将每日新闻转化为周度决策洞察,通过爬取、过滤、分类和生成时间依赖性递归摘要图(TRSG),结合PESTEL分析,提供课程智能的应用案例。
Details
Motivation: 为了帮助教育机构及时获取与战略相关的外部环境变化,提升决策效率和课程设计的前瞻性。 Method: 平台自动爬取和版本化新闻内容,使用大学特定的相关性过滤器,嵌入文本并按PESTEL维度分类,构建双层聚类的时间依赖递归摘要图(TRSG),每周由大语言模型(LLM)生成摘要,并通过轻量级变化检测器识别新增、删除或变更的内容,归纳成主题进行分析。 Result: 实现了一个稳定运行于生产环境的系统,能够持续生成简洁、可操作的周度洞察,支持课程智能等实际应用场景,并提出了评估计划。 Conclusion: ORACLE平台有效整合了新闻数据与教育战略需求,通过结构化摘要和变化检测,为大学提供了可操作的外部环境监测能力。 Abstract: ORACLE turns daily news into week-over-week, decision-ready insights for one of the Finnish University of Applied Sciences. The platform crawls and versions news, applies University-specific relevance filtering, embeds content, classifies items into PESTEL dimensions and builds a concise Time-Dependent Recursive Summary Graph (TRSG): two clustering layers summarized by an LLM and recomputed weekly. A lightweight change detector highlights what is new, removed or changed, then groups differences into themes for PESTEL-aware analysis. We detail the pipeline, discuss concrete design choices that make the system stable in production and present a curriculum-intelligence use case with an evaluation plan.[28] Toward expert-level motivational interviewing for health behavior improvement with LLMs
Run-ze Hu,Yang Yang,Yi-hang Yang,Jing-qi Kong,Jia-hui Luo,Wen-yu Yang,Jing Chen,Jing-yao Liu,Hui-qun Zeng,Lei Zhang,Zheng Liu
Main category: cs.CL
TL;DR: 本研究通过微调大型语言模型(MI-LLMs)实现动机性访谈(MI)对话的自动生成,探索AI在健康行为干预中的可扩展应用。
Details
Motivation: 由于传统动机性访谈依赖高训练水平的人类咨询师,限制了其广泛应用,因此需要一种更可扩展的替代方案。 Method: 基于GPT-4生成的2000段中文MI风格对话数据集,对三种开源中文大模型(Baichuan2、ChatGLM-4、Llama-3)进行微调,并使用自动指标和MITI 4.2.1标准由专家人工编码评估效果。 Result: 微调后模型在BLEU-4和ROUGE分数上显著提升,人工评分成员认为MI-LLMs的技术性与关系性全局评分及MI一致比率接近真实MI对话水平,但在复杂反映和反映/提问比方面仍有不足。 Conclusion: MI导向的微调能使通用大模型具备核心的MI一致性咨询行为,为AI辅助健康行为改变提供可行路径,但仍需在数据规模、复杂MI技能和实际干预试验方面进一步研究。 Abstract: Background: Motivational interviewing (MI) is an effective counseling approach for promoting health behavior change, but its impact is constrained by the need for highly trained human counselors. Objective: This study aimed to explore a scalable alternative by developing and evaluating Large Language Models for Motivational Interviewing (MI-LLMs). Methods: We first curated five Chinese psychological counseling corpora and, using GPT-4 with an MI-informed prompt, transcribed multi-turn dialogues from the two highest-quality datasets (CPsyCounD and PsyDTCorpus) into 2,040 MI-style counseling conversations, of which 2,000 were used for training and 40 for testing. Three Chinese-capable open-source LLMs (Baichuan2-7B-Chat, ChatGLM-4-9B-Chat and Llama-3-8B-Chinese-Chat-v2) were fine-tuned on this corpus and were named as MI-LLMs. We evaluated MI-LLMs using round-based automatic metrics and expert manual coding with the Motivational Interviewing Treatment Integrity (MITI) Coding Manual 4.2.1. Results: Across all three models, fine-tuning substantially improved BLEU-4 and ROUGE scores compared with the base models, and manual coding showed that MI-LLMs achieved technical and relational global scores, and MI-adherent ratios that approached those of real MI dialogues, although complex reflections and reflection-to-question ratios remained less frequent. Conclusions: These findings provide initial evidence that MI-oriented fine-tuning can endow general-purpose LLMs with core MI-consistent counseling behaviors, suggesting a scalable pathway toward AI-assisted health behavior change support while underscoring the need for further work on data scale, complex MI skills and real-world intervention trials.[29] When a Nation Speaks: Machine Learning and NLP in People's Sentiment Analysis During Bangladesh's 2024 Mass Uprising
Md. Samiul Alim,Mahir Shahriar Tamim,Maisha Rahman,Tanvir Ahmed Khan,Md Mushfique Anwar
Main category: cs.CL
TL;DR: 本研究首次在孟加拉语背景下对2024年孟加拉国大规模起义期间的公众情绪进行情感分析,构建了一个包含2028条标注新闻标题的数据集,并使用LDA识别主题,发现语言特定模型在检测愤怒、希望和绝望等情绪上优于多语言Transformer和传统机器学习方法。
Details
Motivation: 情感分析在选举和社会媒体趋势中已有广泛研究,但在社会动荡背景下,尤其是孟加拉语中,公众情绪动态的研究仍存在显著空白。 Method: 收集并标注了来自主要Facebook新闻页面的2028条孟加拉语新闻标题,分为愤怒、希望和绝望三类;采用LDA主题建模识别关键议题,并比较了mBERT、XLM-RoBERTa、SVM和逻辑回归等模型在情感分类中的表现。 Result: 语言特定模型在情感分类任务中表现最佳,显著优于mBERT(67%)、XLM-RoBERTa(71%)以及SVM和逻辑回归(均为70%);LDA揭示了政治腐败和公众抗议等主导主题,且互联网封锁等事件显著影响情绪变化。 Conclusion: 针对特定语言构建的情感分析模型在理解危机期间的公众情绪方面更为有效,本研究为孟加拉语在政治动荡背景下的NLP研究提供了新路径,并揭示了重大社会事件中情绪演变的关键驱动因素。 Abstract: Sentiment analysis, an emerging research area within natural language processing (NLP), has primarily been explored in contexts like elections and social media trends, but there remains a significant gap in understanding emotional dynamics during civil unrest, particularly in the Bangla language. Our study pioneers sentiment analysis in Bangla during a national crisis by examining public emotions amid Bangladesh's 2024 mass uprising. We curated a unique dataset of 2,028 annotated news headlines from major Facebook news portals, classifying them into Outrage, Hope, and Despair. Through Latent Dirichlet Allocation (LDA), we identified prevalent themes like political corruption and public protests, and analyzed how events such as internet blackouts shaped sentiment patterns. It outperformed multilingual transformers (mBERT: 67%, XLM-RoBERTa: 71%) and traditional machine learning methods (SVM and Logistic Regression: both 70%). These results highlight the effectiveness of language-specific models and offer valuable insights into public sentiment during political turmoil.[30] CTkvr: KV Cache Retrieval for Long-Context LLMs via Centroid then Token Indexing
Kuan Lu,Shuhang Lin,Sai Wu,Yichen Yao,Junhan Yang,Huan Li,Wei Chu,Xu Yinghui,Yuan Qi,Gang Chen
Main category: cs.CL
TL;DR: 本文提出CTKVR,一种新颖的质心-然后-令牌KV检索方案,通过两阶段检索策略在长上下文场景中平衡了效率与准确性,在多个基准测试中实现了显著的吞吐量提升且精度损失小于1%。
Details
Motivation: 长上下文场景下,大语言模型面临KV缓存内存开销高和访问延迟大的问题,现有动态KV选择方法在块级索引(精度低)和令牌级索引(效率低)之间存在权衡难题。 Method: 基于RoPE后相邻位置查询向量高度相似且共享大部分top-k KV条目的观察,提出CTKVR:预填充阶段生成轻量级质心用于粗粒度索引,再进行令牌级精细化检索,并采用CPU-GPU协同执行优化系统以加速索引构建与搜索。 Result: 在Llama-3-8B和Yi-9B模型上、96K上下文长度下,CTKVR实现最高达3倍和4倍的吞吐量提升,在多种GPU硬件上多个基准测试中精度损失不到1%。 Conclusion: CTKVR有效解决了长上下文推理中效率与准确性的权衡问题,通过两阶段检索和系统优化,在实际部署中展现出卓越的性能优势。 Abstract: Large language models (LLMs) are increasingly applied in long-context scenarios such as multi-turn conversations. However, long contexts pose significant challenges for inference efficiency, including high memory overhead from Key-Value (KV) cache and increased latency due to excessive memory accesses. Recent methods for dynamic KV selection struggle with trade-offs: block-level indexing degrades accuracy by retrieving irrelevant KV entries, while token-level indexing incurs high latency from inefficient retrieval mechanisms. In this paper, we propose CTKVR, a novel centroid-then-token KV retrieval scheme that addresses these limitations. CTKVR leverages a key observation: query vectors adjacent in position exhibit high similarity after Rotary Position Embedding (RoPE) and share most of their top-k KV cache entries. Based on this insight, CTKVR employs a two-stage retrieval strategy: lightweight centroids are precomputed during prefilling for centroid-grained indexing, followed by token-level refinement for precise KV retrieval. This approach balances retrieval efficiency and accuracy. To further enhance performance, we implement an optimized system for indexing construction and search using CPU-GPU co-execution. Experimentally, CTKVR achieves superior performance across multiple benchmarks with less than 1% accuracy degradation. Meanwhile, CTKVR delivers 3 times and 4 times throughput speedups on Llama-3-8B and Yi-9B at 96K context length across diverse GPU hardware.[31] Learning inflection classes using Adaptive Resonance Theory
Peter Dekker,Heikki Rasilo,Bart de Boer
Main category: cs.CL
TL;DR: 该研究使用自适应共振理论(ART)神经网络对拉丁语、葡萄牙语和爱沙尼亚语的动词屈折类进行无监督聚类,探讨个体语言使用者如何习得屈折系统。模型通过调节泛化参数(警觉性)实现认知上合理且可解释的聚类,结果在特定参数范围内与真实屈折类高度相似,提取的特征也与语言学描述一致。
Details
Motivation: 研究语言中动词屈折类的可学习性,理解个体语言使用者如何通过类比推断新形式,揭示形态习得与处理中的模式形成机制。 Method: 采用基于自适应共振理论(ART)的无监督神经网络模型,对三种语言的词元进行聚类,通过调整警觉性参数控制泛化程度,并评估聚类结果与实际屈折类的相似性。 Result: 模型在拉丁语、葡萄牙语和爱沙尼亚语中表现出不同程度的聚类准确性,最佳性能出现在警觉性参数的狭窄区间内,所学特征与语言学描述的屈折类具有可比性。 Conclusion: ART模型能有效模拟屈折类的学习过程,具备认知合理性与可解释性,未来可嵌入基于代理的模型以研究屈折类的历史演变。 Abstract: The concept of inflection classes is an abstraction used by linguists, and provides a means to describe patterns in languages that give an analogical base for deducing previously unencountered forms. This ability is an important part of morphological acquisition and processing. We study the learnability of a system of verbal inflection classes by the individual language user by performing unsupervised clustering of lexemes into inflection classes. As a cognitively plausible and interpretable computational model, we use Adaptive Resonance Theory, a neural network with a parameter that determines the degree of generalisation (vigilance). The model is applied to Latin, Portuguese and Estonian. The similarity of clustering to attested inflection classes varies depending on the complexity of the inflectional system. We find the best performance in a narrow region of the generalisation parameter. The learned features extracted from the model show similarity with linguistic descriptions of the inflection classes. The proposed model could be used to study change in inflection classes in the future, by including it in an agent-based model.[32] From Data to Dialogue: Unlocking Language for All
Dakota Ellis,Samy Bakikerali,Wanshan Chen,Bao Dinh,Uyen Le
Main category: cs.CL
TL;DR: 本文提出了一种基于客观标准的自动化方法来构建专门词汇表(SWL),相较于传统通用服务词汇表(GSL)和NGSL,在更少词汇量下实现了95%的语言理解覆盖率,更具实用性和可扩展性。
Details
Motivation: 传统语言学依赖专家主观判断和大量时间构建通用服务词汇表(GSL),难以高效满足语言学习者的需求。 Method: 通过仅使用客观标准构建专门词汇表(SWL),并将其应用于语料子集,自动化生成针对特定需求的词汇列表,并与行业标准NGSL进行覆盖率对比评估。 Result: 所构建的SWL在达到95%语言理解覆盖率方面优于NGSL,所需词汇更少,表现出更高的实用性。 Conclusion: 限制在客观标准内的SWL构建过程可被自动化、规模化,并可根据全球语言学习者的不同需求进行定制,是优化语言学习词汇掌握的有效途径。 Abstract: Traditional linguists have proposed the use of a General Service List (GSL) to assist new language learners in identifying the most important words in English. This process requires linguistic expertise, subjective input, and a considerable amount of time. We attempt to create our own GSL and evaluate its practicality against the industry standard (The NGSL). We found creating a Specialized Word List (SWL), or a word list specific to a subset of the overall corpus, to be the most practical way for language-learners to optimize the process. The SWL's that we created using our model outperformed the industry standard, reaching the 95% coverage required for language comprehension with fewer words comparatively. By restricting the SWL process to objective criteria only, it can be automated, scaled, and tailored to the needs of language-learners across the globe.[33] An Empirical Study on Chinese Character Decomposition in Multiword Expression-Aware Neural Machine Translation
Lifeng Han,Gareth J. F. Jones,Alan F. Smeaton
Main category: cs.CL
TL;DR: 本文系统研究了中文字符分解技术在多词表达感知的神经机器翻译中的应用,探讨其如何帮助表示中文词汇的原始意义,并有效应对多词表达翻译的挑战。
Details
Motivation: 由于中文等象形文字语言缺乏针对多词表达(MWEs)的有效处理方法,且子词建模技术(如BPE)难以直接应用,因此需要探索适用于中文的语言处理技术。 Method: 采用中文字符分解技术,结合神经机器翻译框架,进行多词表达感知的翻译实验,分析该技术对中文词义表示和MWE翻译的影响。 Result: 实验证明,中文字符分解技术有助于更好地表示汉字和词语的原始含义,并能有效提升多词表达在机器翻译中的处理效果。 Conclusion: 中文字符分解是一种有前景的技术,可用于改善中文多词表达的自然语言处理任务,特别是在神经机器翻译中具有应用价值。 Abstract: Word meaning, representation, and interpretation play fundamental roles in natural language understanding (NLU), natural language processing (NLP), and natural language generation (NLG) tasks. Many of the inherent difficulties in these tasks stem from Multi-word Expressions (MWEs), which complicate the tasks by introducing ambiguity, idiomatic expressions, infrequent usage, and a wide range of variations. Significant effort and substantial progress have been made in addressing the challenging nature of MWEs in Western languages, particularly English. This progress is attributed in part to the well-established research communities and the abundant availability of computational resources. However, the same level of progress is not true for language families such as Chinese and closely related Asian languages, which continue to lag behind in this regard. While sub-word modelling has been successfully applied to many Western languages to address rare words improving phrase comprehension, and enhancing machine translation (MT) through techniques like byte-pair encoding (BPE), it cannot be applied directly to ideograph language scripts like Chinese. In this work, we conduct a systematic study of the Chinese character decomposition technology in the context of MWE-aware neural machine translation (NMT). Furthermore, we report experiments to examine how Chinese character decomposition technology contributes to the representation of the original meanings of Chinese words and characters, and how it can effectively address the challenges of translating MWEs.[34] Bolmo: Byteifying the Next Generation of Language Models
Benjamin Minixhofer,Tyler Murray,Tomasz Limisiewicz,Anna Korhonen,Luke Zettlemoyer,Noah A. Smith,Edoardo M. Ponti,Luca Soldaini,Valentin Hofmann
Main category: cs.CL
TL;DR: Bolmo是首个在1B和7B参数规模上具有竞争力的完全开源字节级语言模型家族,通过将现有子词级模型转化为字节级实现高效训练,在字符理解、编码等任务上超越先前字节级模型,并接近甚至超过源子词模型的表现。
Details
Motivation: 克服子词分词在字符理解不足和固定词汇表导致效率受限的问题,使字节级语言模型在性能和实用性上可与子词级模型竞争。 Method: 提出“字节化”方法,将现有的子词级语言模型转化为字节级模型;设计匹配子词模型表达能力的新架构,实现高效的精确蒸馏目标,仅用不到1%的典型预训练token预算完成转换。 Result: Bolmo在相同规模下显著优于以往字节级模型,在字符理解与部分编码任务上优于源子词模型,其他任务上接近原模型表现;具备与子词模型相当的推理速度,并可通过利用源模型生态进行低成本后训练。 Conclusion: Bolmo使字节级语言模型首次成为广泛使用场景下实用且与子词级模型相媲美的选择。 Abstract: We introduce Bolmo, the first family of competitive fully open byte-level language models (LMs) at the 1B and 7B parameter scales. In contrast to prior research on byte-level LMs, which focuses predominantly on training from scratch, we train Bolmo by byteifying existing subword-level LMs. Byteification enables overcoming the limitations of subword tokenization - such as insufficient character understanding and efficiency constraints due to the fixed subword vocabulary - while performing at the level of leading subword-level LMs. Bolmo is specifically designed for byteification: our architecture resolves a mismatch between the expressivity of prior byte-level architectures and subword-level LMs, which makes it possible to employ an effective exact distillation objective between Bolmo and the source subword model. This allows for converting a subword-level LM to a byte-level LM by investing less than 1\% of a typical pretraining token budget. Bolmo substantially outperforms all prior byte-level LMs of comparable size, and outperforms the source subword-level LMs on character understanding and, in some cases, coding, while coming close to matching the original LMs' performance on other tasks. Furthermore, we show that Bolmo can achieve inference speeds competitive with subword-level LMs by training with higher token compression ratios, and can be cheaply and effectively post-trained by leveraging the existing ecosystem around the source subword-level LM. Our results finally make byte-level LMs a practical choice competitive with subword-level LMs across a wide set of use cases.[35] You Never Know a Person, You Only Know Their Defenses: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations
Hongbin Na,Zimu Wang,Zhaoming Chen,Peilin Zhou,Yining Hua,Grace Ziqi Zhou,Haiyang Zhang,Tao Shen,Wei Wang,John Torous,Shaoxiong Ji,Ling Chen
Main category: cs.CL
TL;DR: 本文介绍了PsyDefConv语料库和DMRS Co-Pilot四阶段管道,用于在临床对话中自动标注心理防御水平,提高了标注效率并为语言中的防御功能研究提供了新资源。
Details
Motivation: 心理防御机制复杂且难以可靠测量,尤其是在临床对话中,因此需要一种有效的工具来辅助标注和分析。 Method: 构建了一个包含200个对话和4709个语句的对话语料库(PsyDefConv),并对求助者的语句进行防御水平标注;开发了DMRS Co-Pilot四阶段预标注管道,并通过专家评估和基准测试验证其有效性。 Result: DMRS Co-Pilot使平均标注时间减少了22.4%;专家评分为证据性4.62分、临床合理性4.44分、洞察力4.40分(七点量表);最强语言模型的macro F1约30%,存在高估成熟防御的倾向;语料分析显示成熟防御最常见,且存在情绪特异性偏差。 Conclusion: PsyDefConv语料库和DMRS Co-Pilot为心理防御的自动识别与临床应用提供了可行方案,未来将公开资源以促进相关研究。 Abstract: Psychological defenses are strategies, often automatic, that people use to manage distress. Rigid or overuse of defenses is negatively linked to mental health and shapes what speakers disclose and how they accept or resist help. However, defenses are complex and difficult to reliably measure, particularly in clinical dialogues. We introduce PsyDefConv, a dialogue corpus with help seeker utterances labeled for defense level, and DMRS Co-Pilot, a four-stage pipeline that provides evidence-based pre-annotations. The corpus contains 200 dialogues and 4709 utterances, including 2336 help seeker turns, with labeling and Cohen's kappa 0.639. In a counterbalanced study, the co-pilot reduced average annotation time by 22.4%. In expert review, it averaged 4.62 for evidence, 4.44 for clinical plausibility, and 4.40 for insight on a seven-point scale. Benchmarks with strong language models in zero-shot and fine-tuning settings demonstrate clear headroom, with the best macro F1-score around 30% and a tendency to overpredict mature defenses. Corpus analyses confirm that mature defenses are most common and reveal emotion-specific deviations. We will release the corpus, annotations, code, and prompts to support research on defensive functioning in language.[36] Evaluating Metrics for Safety with LLM-as-Judges
Kester Clegg,Richard Hawkins,Ibrahim Habli,Tom Lawton
Main category: cs.CL
TL;DR: 本文探讨了在关键信息流中引入大语言模型(LLM)时如何确保其安全性和可靠性,主张通过加权指标组合、上下文敏感性和置信度阈值来降低风险,并在LLM作为评判者(LaJ)的框架中触发人工复核。
Details
Motivation: 由于LLM在文本处理中的广泛应用,可能替代人类角色,但在安全关键场景下LLM的错误可能导致严重后果,因此需要建立可靠的安全论证机制。 Method: 提出采用一组加权指标进行评估,结合上下文定义错误严重性,并设置置信度阈值,在评判者间一致性低时触发人工审查。 Result: 尽管自然语言任务无法获得确定性评估结果,但该方法可有效降低评估中的错误风险。 Conclusion: 应将安全性论证重点放在评估证据的质量上,而非仅依赖生成增强或图技术,强调在LaJ框架中引入多维度评估与人工干预机制以提升可靠性。 Abstract: LLMs (Large Language Models) are increasingly used in text processing pipelines to intelligently respond to a variety of inputs and generation tasks. This raises the possibility of replacing human roles that bottleneck existing information flows, either due to insufficient staff or process complexity. However, LLMs make mistakes and some processing roles are safety critical. For example, triaging post-operative care to patients based on hospital referral letters, or updating site access schedules in nuclear facilities for work crews. If we want to introduce LLMs into critical information flows that were previously performed by humans, how can we make them safe and reliable? Rather than make performative claims about augmented generation frameworks or graph-based techniques, this paper argues that the safety argument should focus on the type of evidence we get from evaluation points in LLM processes, particularly in frameworks that employ LLM-as-Judges (LaJ) evaluators. This paper argues that although we cannot get deterministic evaluations from many natural language processing tasks, by adopting a basket of weighted metrics it may be possible to lower the risk of errors within an evaluation, use context sensitivity to define error severity and design confidence thresholds that trigger human review of critical LaJ judgments when concordance across evaluators is low.[37] How Much is Too Much? Exploring LoRA Rank Trade-offs for Retaining Knowledge and Domain Robustness
Darshita Rathore,Vineet Kumar,Chetna Bansal,Anindya Moitra
Main category: cs.CL
TL;DR: 本文对全监督微调(SFT)和参数高效微调(如LoRA)在问答任务中的表现进行了综合评估,发现LoRA在特定秩下在推理任务上可达到甚至超过SFT的性能,并通过表示分析揭示了其在泛化和注意力结构上的差异。
Details
Motivation: 尽管LoRA等参数高效微调方法因计算效率高而被广泛使用,但其配置(如秩)对下游任务特别是问答和泛化能力的影响尚不明确,因此需要系统性评估。 Method: 在多个推理与记忆类数据集上进行全秩扫描实验,比较SFT与LoRA在域内和域外适应下的准确性,并通过谱特征和逐层注意力结构分析模型内部表示变化。 Result: LoRA在特定秩设置下表现与SFT相当甚至更优,尤其在推理任务上;不同方法在泛化行为和任务遗忘方面表现出显著差异;表示分析显示LoRA引起明显的表示漂移和注意力结构变化。 Conclusion: LoRA不仅计算高效,在适当配置下还能实现优于SFT的性能,且其内部表示动态特性为理解参数高效微调机制提供了新视角。 Abstract: Large language models are increasingly adapted to downstream tasks through fine-tuning. Full supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), are two dominant approaches. While PEFT methods are widely used for their computational efficiency, the implications of their configurations (e.g., rank) remain under-explored in downstream Q&A tasks and generalisation. In this work, we perform a comprehensive evaluation across multiple reasoning and recall datasets, conducting a rank sweep to quantify the trade-off between SFT and PEFT. We also compare the accuracy of PEFT and SFT models across in-domain and out-of-domain adaptation, highlighting distinct generalisation behaviour and task-specific forgetting. We demonstrate that LoRA achieves competitive and in some cases superior performance compared to SFT, particularly on reasoning tasks at specific rank values. Additionally, we analyze the internal representations via spectral features and layer-wise attention structures, offering insights into representational drift and structural changes in attention patterns.[38] Characterizing Mamba's Selective Memory using Auto-Encoders
Tamanna Hossain,Robert L. Logan,Ganesh Jagadeesan,Sameer Singh,Joel Tetreault,Alejandro Jaimes
Main category: cs.CL
TL;DR: 本文研究了状态空间模型(SSM)在语言建模中遗忘信息的类型,发现其更易忘记数学相关符号、组织实体和非标准英语等低频词,并通过重构实验验证了遗忘与训练数据频率之间的关系。
Details
Motivation: 现有研究未明确SSM语言模型在处理长序列时倾向于遗忘哪些类型的信息,本文旨在填补这一空白。 Method: 训练一个自编码器从SSM的隐藏状态重建输入序列,通过比较原始输入与重建结果来量化不同类型token和序列的信息丢失情况,并分析其在预训练数据中的出现频率。 Result: 实验表明,数学相关token(如数字、变量)、组织实体提及和非标准美式英语变体更容易被遗忘,且这些token在Mamba预训练数据中出现频率较低。 Conclusion: SSM语言模型更可能遗忘训练数据中低频的重要信息,该发现为改进模型记忆机制提供了方向。 Abstract: State space models (SSMs) are a promising alternative to transformers for language modeling because they use fixed memory during inference. However, this fixed memory usage requires some information loss in the hidden state when processing long sequences. While prior work has studied the sequence length at which this information loss occurs, it does not characterize the types of information SSM language models (LMs) tend to forget. In this paper, we address this knowledge gap by identifying the types of tokens (e.g., parts of speech, named entities) and sequences (e.g., code, math problems) that are more frequently forgotten by SSM LMs. We achieve this by training an auto-encoder to reconstruct sequences from the SSM's hidden state, and measure information loss by comparing inputs with their reconstructions. We perform experiments using the Mamba family of SSM LMs (130M--1.4B) on sequences ranging from 4--256 tokens. Our results show significantly higher rates of information loss on math-related tokens (e.g., numbers, variables), mentions of organization entities, and alternative dialects to Standard American English. We then examine the frequency that these tokens appear in Mamba's pretraining data and find that less prevalent tokens tend to be the ones Mamba is most likely to forget. By identifying these patterns, our work provides clear direction for future research to develop methods that better control Mamba's ability to retain important information.[39] PPSEBM: An Energy-Based Model with Progressive Parameter Selection for Continual Learning
Xiaodi Li,Dingcheng Li,Rujun Gao,Mahmoud Zamani,Feng Mi,Latifur Khan
Main category: cs.CL
TL;DR: 本文提出PPSEBM框架,结合能量模型与渐进参数选择,有效缓解自然语言处理中持续学习的灾难性遗忘问题。
Details
Motivation: 解决持续学习中模型在学习新任务时遗忘旧任务知识的灾难性遗忘问题。 Method: 将能量模型(EBM)与渐进参数选择(PPS)结合,EBM生成先前任务的伪样本,PPS为每个新任务分配特定参数,并利用伪样本来指导参数选择过程。 Result: 在多个NLP基准上的实验表明,PPSEBM优于当前最先进的持续学习方法。 Conclusion: PPSEBM能有效平衡新旧任务的学习,显著缓解灾难性遗忘,为持续学习提供了鲁棒的新方案。 Abstract: Continual learning remains a fundamental challenge in machine learning, requiring models to learn from a stream of tasks without forgetting previously acquired knowledge. A major obstacle in this setting is catastrophic forgetting, where performance on earlier tasks degrades as new tasks are learned. In this paper, we introduce PPSEBM, a novel framework that integrates an Energy-Based Model (EBM) with Progressive Parameter Selection (PPS) to effectively address catastrophic forgetting in continual learning for natural language processing tasks. In PPSEBM, progressive parameter selection allocates distinct, task-specific parameters for each new task, while the EBM generates representative pseudo-samples from prior tasks. These generated samples actively inform and guide the parameter selection process, enhancing the model's ability to retain past knowledge while adapting to new tasks. Experimental results on diverse NLP benchmarks demonstrate that PPSEBM outperforms state-of-the-art continual learning methods, offering a promising and robust solution to mitigate catastrophic forgetting.[40] Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Adam Karvonen,James Chua,Clément Dumas,Kit Fraser-Taliente,Subhash Kantamneni,Julian Minder,Euan Ong,Arnab Sen Sharma,Daniel Wen,Owain Evans,Samuel Marks
Main category: cs.CL
TL;DR: 本文研究了通过多样化训练提升激活预言机(Activation Oracles, AOs)在大语言模型激活解释中的泛化能力,发现即使在远分布外场景下,AOs也能有效恢复微调信息,并在多个任务上达到或超越现有白盒方法。
Details
Motivation: 大语言模型的激活难以理解,现有方法复杂且专用,缺乏通用性。希望验证LatentQA方法在更广泛、未见过的任务中是否仍具有效性,并探索训练数据多样性对模型泛化的影响。 Method: 采用LatentQA框架,训练AO模型以自然语言回答关于LLM激活的问题;在多种下游任务上评估其性能,包括白盒和黑盒设置,并测试不同训练数据(如分类任务、自监督上下文预测)对性能的影响。 Result: AO模型能在未见的远分布外场景中恢复微调信息(如传记知识或恶意倾向);在四个下游任务中,窄域训练的AO已具备良好泛化性,增加训练数据可进一步提升性能;最佳AO在四项任务中均达到或超过白盒基线,在三项中表现最优。 Conclusion: 多样化的自然语言问答训练赋予AOs一种通用的能力,能有效解释和表达大语言模型激活中的隐含信息,表明简单而通用的LatentQA方法具有广阔的应用潜力。 Abstract: Large language model (LLM) activations are notoriously difficult to understand, with most existing techniques using complex, specialized methods for interpreting them. Recent work has proposed a simpler approach known as LatentQA: training LLMs to directly accept LLM activations as inputs and answer arbitrary questions about them in natural language. However, prior work has focused on narrow task settings for both training and evaluation. In this paper, we instead take a generalist perspective. We evaluate LatentQA-trained models, which we call Activation Oracles (AOs), in far out-of-distribution settings and examine how performance scales with training data diversity. We find that AOs can recover information fine-tuned into a model (e.g., biographical knowledge or malign propensities) that does not appear in the input text, despite never being trained with activations from a fine-tuned model. Our main evaluations are four downstream tasks where we can compare to prior white- and black-box techniques. We find that even narrowly-trained LatentQA models can generalize well, and that adding additional training datasets (such as classification tasks and a self-supervised context prediction task) yields consistent further improvements. Overall, our best AOs match or exceed prior white-box baselines on all four tasks and are the best method on 3 out of 4. These results suggest that diversified training to answer natural-language queries imparts a general capability to verbalize information about LLM activations.cs.CV [Back]
[41] SkyCap: Bitemporal VHR Optical-SAR Quartets for Amplitude Change Detection and Foundation-Model Evaluation
Paul Weinmann,Ferdinand Schenck,Martin Šiklar
Main category: cs.CV
TL;DR: 本文提出了SkyCap,一个用于线性基础设施监测的双时相高分辨率光学-SAR数据集,通过光学到SAR标签迁移实现无需专家标注的SAR变化检测,并评估了多种基础模型在该任务上的表现,发现经过适当预处理的光学基础模型优于专门针对SAR数据微调的SAR基础模型。
Details
Motivation: 由于云层影响,光学遥感影像难以保证连续获取;而SAR虽可全天候成像但难以标注。因此需要一种结合两者优势的数据集和方法来提升变化检测的可靠性与效率。 Method: 构建SkyCap数据集,通过归档匹配和配准整合SkySat(光学)与Capella Space(SAR)影像,采用光学到SAR的标签迁移生成SAR幅度变化检测标签,并对SARATR-X等基础模型进行持续预训练,在不同预处理策略下进行基准测试。 Result: 在所评估的模型中,经过dB+Z-score预处理的光学基础模型MTP(ViT-B+RVSA)表现最佳(F1$_c$ = 45.06),优于直接在Capella SAR数据上进一步预训练的SAR专用模型;同时发现预处理方式对性能影响显著,且光学变化检测中的模型排序不能直接迁移到SAR ACD任务。 Conclusion: 这是首次在VHR SAR幅度变化检测任务上对基础模型进行系统评估,结果表明适当的预处理和跨模态知识迁移可能比专为SAR设计的模型更有效,为未来多模态遥感变化检测提供了新方向。 Abstract: Change detection for linear infrastructure monitoring requires reliable high-resolution data and regular acquisition cadence. Optical very-high-resolution (VHR) imagery is interpretable and straightforward to label, but clouds break this cadence. Synthetic Aperture Radar (SAR) enables all-weather acquisitions, yet is difficult to annotate. We introduce SkyCap, a bitemporal VHR optical-SAR dataset constructed by archive matching and co-registration of (optical) SkySat and Capella Space (SAR) scenes. We utilize optical-to-SAR label transfer to obtain SAR amplitude change detection (ACD) labels without requiring SAR-expert annotations. We perform continued pretraining of SARATR-X on our SAR data and benchmark the resulting SAR-specific foundation models (FMs) together with SARATR-X against optical FMs on SkyCap under different preprocessing choices. Among evaluated models, MTP(ViT-B+RVSA), an optical FM, with dB+Z-score preprocessing attains the best result (F1$_c$ = 45.06), outperforming SAR-specific FMs further pretrained directly on Capella data. We observe strong sensitivity to preprocessing alignment with pretraining statistics, and the ranking of optical models on optical change detection does not transfer one-to-one to SAR ACD. To our knowledge, this is the first evaluation of foundation models on VHR SAR ACD.[42] SocialNav-MoE: A Mixture-of-Experts Vision Language Model for Socially Compliant Navigation with Reinforcement Fine-Tuning
Tomohito Kawabata,Xinyu Zhang,Ling Xiao
Main category: cs.CV
TL;DR: 本文提出了一种高效的混合专家视觉语言模型SocialNav-MoE,结合强化微调和语义相似性奖励,用于机器人在人群环境中的社交合规导航,实现了精度与效率的平衡。
Details
Motivation: 现有研究多关注机器人导航的安全性,而忽视了人类舒适度、社会规范等社交合规性因素,且大型视觉语言模型计算开销大,难以在资源受限的机器人平台上实时部署。 Method: 提出SocialNav-MoE,采用小规模混合专家视觉语言模型,结合强化微调(RFT)和新设计的语义相似性奖励(SSR),并研究不同小型语言模型、路由策略及视觉编码器的影响。 Result: 在SNEI数据集上的实验表明,SocialNav-MoE在导航准确性和推理效率之间取得了良好平衡,所提出的SSR优于硬级别和字符级别的奖励机制。 Conclusion: SocialNav-MoE通过轻量化架构和有效的奖励设计,提升了机器人在人类环境中导航的社交合规性与实用性,适合资源受限平台的实时应用。 Abstract: For robots navigating in human-populated environments, safety and social compliance are equally critical, yet prior work has mostly emphasized safety. Socially compliant navigation that accounts for human comfort, social norms, and contextual appropriateness remains underexplored. Vision language models (VLMs) show promise for this task; however, large-scale models incur substantial computational overhead, leading to higher inference latency and energy consumption, which makes them unsuitable for real-time deployment on resource-constrained robotic platforms. To address this issue, we investigate the effectiveness of small VLM and propose SocialNav-MoE, an efficient Mixture-of-Experts vision language model for socially compliant navigation with reinforcement fine-tuning (RFT). We further introduce a semantic similarity reward (SSR) to effectively leverage RFT for enhancing the decision-making capabilities. Additionally, we study the effectiveness of different small language model types (Phi, Qwen, and StableLM), routing strategies, and vision encoders (CLIP vs. SigLIP, frozen vs. fine-tuned). Experiments on the SNEI dataset demonstrate that SocialNav-MoE achieves an excellent balance between navigation accuracy and efficiency. The proposed SSR function is more effective than hard-level and character-level rewards. Source code will be released upon acceptance.[43] The Renaissance of Expert Systems: Optical Recognition of Printed Chinese Jianpu Musical Scores with Lyrics
Fan Bu,Rongfeng Li,Zijin Li,Ya Li,Linfeng Fan,Pei Huang
Main category: cs.CV
TL;DR: 提出一种无需大量标注数据的模块化专家系统流水线,用于将带歌词的印刷版中国简谱转换为机器可读格式(MusicXML和MIDI),在民歌集上实现高精度识别。
Details
Motivation: 大规模光学乐谱识别研究主要集中于西方五线谱,而对中国简谱及其丰富的歌词资源关注不足,亟需有效的数字化方法。 Method: 采用自上而下的专家系统设计,结合传统计算机视觉技术(如短语相关性、骨架分析)与无监督深度学习模块进行图像特征嵌入,形成混合策略。 Result: 在《中国民间歌曲集成》上评估,系统成功数字化超过5000首纯旋律歌曲(>30万音符)和1400首带歌词歌曲(>10万音符),旋律识别音符级F1达0.951,歌词对齐字符级F1达0.931。 Conclusion: 该混合方法在可解释性与准确性之间取得良好平衡,显著推动了中国简谱乐谱的大规模数字化进程。 Abstract: Large-scale optical music recognition (OMR) research has focused mainly on Western staff notation, leaving Chinese Jianpu (numbered notation) and its rich lyric resources underexplored. We present a modular expert-system pipeline that converts printed Jianpu scores with lyrics into machine-readable MusicXML and MIDI, without requiring massive annotated training data. Our approach adopts a top-down expert-system design, leveraging traditional computer-vision techniques (e.g., phrase correlation, skeleton analysis) to capitalize on prior knowledge, while integrating unsupervised deep-learning modules for image feature embeddings. This hybrid strategy strikes a balance between interpretability and accuracy. Evaluated on The Anthology of Chinese Folk Songs, our system massively digitizes (i) a melody-only collection of more than 5,000 songs (> 300,000 notes) and (ii) a curated subset with lyrics comprising over 1,400 songs (> 100,000 notes). The system achieves high-precision recognition on both melody (note-wise F1 = 0.951) and aligned lyrics (character-wise F1 = 0.931).[44] AquaDiff: Diffusion-Based Underwater Image Enhancement for Addressing Color Distortion
Afrah Shaahid,Muzammil Behzad
Main category: cs.CV
TL;DR: AquaDiff是一种基于扩散模型的水下图像增强框架,通过引入色彩先验引导的颜色补偿策略和条件扩散过程,有效校正色偏并保留结构与感知细节。
Details
Motivation: 水下图像受光吸收和散射影响严重,导致颜色失真、对比度低和细节丢失,限制了基于视觉的水下应用,现有方法在色彩恢复与细节保持之间平衡不足。 Method: 提出AquaDiff,结合色彩先验引导的颜色补偿与条件扩散模型;采用交叉注意力机制在每一步去噪中融合退化输入与噪声潜在状态;使用增强的去噪骨干网络(含残差密集块和多分辨率注意力)捕获全局颜色上下文与局部细节;设计新的跨域一致性损失,联合优化像素级精度、感知相似性、结构完整性和频域保真度。 Result: 在多个具有挑战性的水下基准上实验表明,AquaDiff在色彩校正方面优于传统方法、CNN、GAN及扩散模型方法,整体图像质量具竞争力,适用于多种水下环境。 Conclusion: AquaDiff通过融合色彩先验与扩散模型,在颜色校正和细节保留方面取得良好平衡,显著提升水下图像质量,为复杂水下视觉任务提供了有效的预处理方案。 Abstract: Underwater images are severely degraded by wavelength-dependent light absorption and scattering, resulting in color distortion, low contrast, and loss of fine details that hinder vision-based underwater applications. To address these challenges, we propose AquaDiff, a diffusion-based underwater image enhancement framework designed to correct chromatic distortions while preserving structural and perceptual fidelity. AquaDiff integrates a chromatic prior-guided color compensation strategy with a conditional diffusion process, where cross-attention dynamically fuses degraded inputs and noisy latent states at each denoising step. An enhanced denoising backbone with residual dense blocks and multi-resolution attention captures both global color context and local details. Furthermore, a novel cross-domain consistency loss jointly enforces pixel-level accuracy, perceptual similarity, structural integrity, and frequency-domain fidelity. Extensive experiments on multiple challenging underwater benchmarks demonstrate that AquaDiff provides good results as compared to the state-of-the-art traditional, CNN-, GAN-, and diffusion-based methods, achieving superior color correction and competitive overall image quality across diverse underwater conditions.[45] Improving VQA Reliability: A Dual-Assessment Approach with Self-Reflection and Cross-Model Verification
Xixian Wu,Yang Ou,Pengchao Tian,Zian Yang,Jielei Zhang,Peiyi Li,Longwen Gao
Main category: cs.CV
TL;DR: 提出了一种名为DAVR的双评估框架,通过自反思和跨模型验证来提升视觉语言模型在视觉问答中的可靠性,有效减少幻觉问题。
Details
Motivation: 视觉语言模型(VLM)在视觉问答中易产生幻觉,导致高置信度但错误的答案,影响答案可信度。 Method: 设计双通路架构:一通路结合VLM潜在特征与问答嵌入进行自反思评估;另一通路利用外部参考模型进行事实交叉验证。 Result: 在ICCV-CLVL 2025可靠VQA挑战赛中,DAVR取得39.64的Φ₁₀₀分数和97.22的100-AUC,排名第一。 Conclusion: DAVR能有效提升VLM回答的可靠性,显著降低幻觉带来的风险,增强模型信任度。 Abstract: Vision-language models (VLMs) have demonstrated significant potential in Visual Question Answering (VQA). However, the susceptibility of VLMs to hallucinations can lead to overconfident yet incorrect answers, severely undermining answer reliability. To address this, we propose Dual-Assessment for VLM Reliability (DAVR), a novel framework that integrates Self-Reflection and Cross-Model Verification for comprehensive uncertainty estimation. The DAVR framework features a dual-pathway architecture: one pathway leverages dual selector modules to assess response reliability by fusing VLM latent features with QA embeddings, while the other deploys external reference models for factual cross-checking to mitigate hallucinations. Evaluated in the Reliable VQA Challenge at ICCV-CLVL 2025, DAVR achieves a leading $Φ_{100}$ score of 39.64 and a 100-AUC of 97.22, securing first place and demonstrating its effectiveness in enhancing the trustworthiness of VLM responses.[46] HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering
Dan Ben-Ami,Gabriele Serussi,Kobi Cohen,Chaim Baskin
Main category: cs.CV
TL;DR: HERBench是一个新的视频问答基准,旨在评估模型在时间上整合多个非重叠视觉证据的能力,要求每个问题至少融合三个不同时段的证据,从而避免依赖单一帧或语言先验;该基准引入了最小必需帧集(MRFS)来量化证据融合需求,并揭示当前视频大模型在检索和融合关键信息方面存在严重缺陷。
Details
Motivation: 现有视频问答基准往往允许通过单一显著线索回答问题,无法充分测试模型跨时间整合多证据的推理能力,因此需要一个更严格、可量化的基准来推动真正的时间理解与组合推理。 Method: 构建包含26K五选一选择题的HERBench数据集,涵盖十二种组合任务,要求每个问题必须整合至少三个非重叠的证据片段;提出最小必需帧集(MRFS)指标来量化模型所需的跨时间帧融合数量,并对13个最先进的视频大模型进行评估。 Result: HERBench的平均MRFS为5.5,显著高于以往数据集(2.6-4.2);13个先进视频大模型在该基准上的准确率仅为31-42%,略高于20%的随机猜测水平;分析发现模型失败主要源于两个瓶颈:关键帧检索不足和信息融合能力弱。 Conclusion: HERBench通过强制且可度量的跨时间证据整合需求,暴露了当前视频大模型在多步视觉推理上的根本缺陷,为实现鲁棒、组合式的视频理解提供了明确的发展方向。 Abstract: Video Large Language Models (Video-LLMs) are rapidly improving, yet current Video Question Answering (VideoQA) benchmarks often allow questions to be answered from a single salient cue, under-testing reasoning that must aggregate multiple, temporally separated visual evidence. We present HERBench, a VideoQA benchmark purpose-built to assess multi-evidence integration across time. Each question requires aggregating at least three non-overlapping evidential cues across distinct video segments, so neither language priors nor a single snapshot can suffice. HERBench comprises 26K five-way multiple-choice questions organized into twelve compositional tasks that probe identity binding, cross-entity relations, temporal ordering, co-occurrence verification, and counting. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS), the smallest number of frames a model must fuse to answer correctly, and show that HERBench imposes substantially higher demand than prior datasets (mean MRFS 5.5 vs. 2.6-4.2). Evaluating 13 state-of-the-art Video-LLMs on HERBench reveals pervasive failures: accuracies of 31-42% are only slightly above the 20% random-guess baseline. We disentangle this failure into two critical bottlenecks: (1) a retrieval deficit, where frame selectors overlook key evidence, and (2) a fusion deficit, where models fail to integrate information even when all necessary evidence is provided. By making cross-time evidence both unavoidable and quantifiable, HERBench establishes a principled target for advancing robust, compositional video understanding.[47] Isolated Sign Language Recognition with Segmentation and Pose Estimation
Daniel Perkins,Davis Hunter,Dhrumil Patel,Galen Flanagan
Main category: cs.CV
TL;DR: 提出一种结合姿态估计、分割模块和ResNet-Transformer骨干网络的孤立手语识别模型,以降低计算成本并提升对使用者差异的鲁棒性。
Details
Motivation: 现有大语言模型在手语翻译中应用受限,ASL用户难以受益;孤立手语识别因数据稀缺、使用者差异大和高计算成本而受限。 Method: 采用姿态估计提取手部和面部关键点,通过分割模块过滤无关信息,并使用ResNet-Transformer联合建模时空依赖。 Result: 模型显著降低了计算需求,同时保持了对不同使用者的高识别鲁棒性。 Conclusion: 该方法为高效、实用的孤立手语识别提供了可行方案,有助于缩小ASL与其他语言之间的技术差距。 Abstract: The recent surge in large language models has automated translations of spoken and written languages. However, these advances remain largely inaccessible to American Sign Language (ASL) users, whose language relies on complex visual cues. Isolated sign language recognition (ISLR) - the task of classifying videos of individual signs - can help bridge this gap but is currently limited by scarce per-sign data, high signer variability, and substantial computational costs. We propose a model for ISLR that reduces computational requirements while maintaining robustness to signer variation. Our approach integrates (i) a pose estimation pipeline to extract hand and face joint coordinates, (ii) a segmentation module that isolates relevant information, and (iii) a ResNet-Transformer backbone to jointly model spatial and temporal dependencies.[48] Visual-textual Dermatoglyphic Animal Biometrics: A First Case Study on Panthera tigris
Wenshuo Li,Majid Mirmehdi,Tilo Burghardt
Main category: cs.CV
TL;DR: 本研究提出了一种结合皮肤纹理文本描述与图像的跨模态个体识别方法,用于动物重识别(Re-ID),在老虎数据集上验证了该方法可提升AI识别准确率并缓解数据稀缺问题。
Details
Motivation: 现有动物重识别技术主要依赖图像,难以解释且受限于数据量;引入法医学中的皮肤纹理语言描述可增强可解释性与跨模态匹配能力。 Method: 利用84,264个人工标注的皮肤纹理细节,构建视觉-文本联合识别框架,并开发文本-图像协同生成 pipeline 生成虚拟个体以增强训练数据。 Result: 该方法显著提升了跨模态身份检索的准确性,在真实场景基准测试中表现优异,有效缓解了数据稀缺问题。 Conclusion: 基于皮肤纹理语言引导的生物特征识别能突破纯视觉方法的局限,实现可解释、可验证的文本到图像身份恢复,推动生态监测中多模态描述的统一。 Abstract: Biologists have long combined visuals with textual field notes to re-identify (Re-ID) animals. Contemporary AI tools automate this for species with distinctive morphological features but remain largely image-based. Here, we extend Re-ID methodologies by incorporating precise dermatoglyphic textual descriptors-an approach used in forensics but new to ecology. We demonstrate that these specialist semantics abstract and encode animal coat topology using human-interpretable language tags. Drawing on 84,264 manually labelled minutiae across 3,355 images of 185 tigers (Panthera tigris), we evaluate this visual-textual methodology, revealing novel capabilities for cross-modal identity retrieval. To optimise performance, we developed a text-image co-synthesis pipeline to generate 'virtual individuals', each comprising dozens of life-like visuals paired with dermatoglyphic text. Benchmarking against real-world scenarios shows this augmentation significantly boosts AI accuracy in cross-modal retrieval while alleviating data scarcity. We conclude that dermatoglyphic language-guided biometrics can overcome vision-only limitations, enabling textual-to-visual identity recovery underpinned by human-verifiable matchings. This represents a significant advance towards explainability in Re-ID and a language-driven unification of descriptive modalities in ecological monitoring.[49] Vibe Spaces for Creatively Connecting and Expressing Visual Concepts
Huzheng Yang,Katherine Xu,Andrew Lu,Michael D. Grossberg,Yutong Bai,Jianbo Shi
Main category: cs.CV
TL;DR: 本文提出了“Vibe Blending”任务,旨在通过学习低维特征空间中的语义路径(如CLIP)生成概念间的连贯且有意义的图像融合,并提出Vibe Space方法实现跨远距离概念的平滑过渡,结合人类判断与LLM推理评估创造性质量。
Details
Motivation: 现有方法难以在潜在空间中识别和连接相距较远的概念,缺乏对概念间共享语义属性(即“vibe”)的有效建模,限制了创造性视觉融合的生成能力。 Method: 提出Vibe Space,一种分层图流形结构,在CLIP等特征空间中学习低维测地线路径,实现语义一致的概念过渡;设计基于人类认知的综合评估框架,结合人工评分、大语言模型推理和基于路径几何的难度评分来衡量创意质量。 Result: 实验表明,Vibe Space生成的融合结果在人类评价中显著优于现有方法,被评定为更具创造性和语义连贯性。 Conclusion: 通过构建结构化的语义流形并模拟人类类比思维,Vibe Blending为跨概念视觉创造提供了新思路,验证了低维几何路径在生成创造性内容中的有效性。 Abstract: Creating new visual concepts often requires connecting distinct ideas through their most relevant shared attributes -- their vibe. We introduce Vibe Blending, a novel task for generating coherent and meaningful hybrids that reveals these shared attributes between images. Achieving such blends is challenging for current methods, which struggle to identify and traverse nonlinear paths linking distant concepts in latent space. We propose Vibe Space, a hierarchical graph manifold that learns low-dimensional geodesics in feature spaces like CLIP, enabling smooth and semantically consistent transitions between concepts. To evaluate creative quality, we design a cognitively inspired framework combining human judgments, LLM reasoning, and a geometric path-based difficulty score. We find that Vibe Space produces blends that humans consistently rate as more creative and coherent than current methods.[50] PANDA-PLUS-Bench: A Clinical Benchmark for Evaluating Robustness of AI Foundation Models in Prostate Cancer Diagnosis
Joshua L. Ebbert,Dennis Della Corte
Main category: cs.CV
TL;DR: PANDA-PLUS-Bench是一个专为评估前列腺癌Gleason分级中人工智能基础模型鲁棒性而设计的基准数据集,用于量化模型是否学习到生物学特征而非载玻片特异性伪影。研究评估了七个基础模型,发现不同模型在跨载玻片准确性和对载玻片级混淆因素的编码程度上存在显著差异,表明组织特异性训练可能提升模型性能。
Details
Motivation: 现有AI模型在Gleason分级任务中可能依赖于载玻片特异性伪影而非可泛化的生物特征,导致临床应用受限,因此需要一个专门的基准来评估模型的真正泛化能力。 Method: 构建了一个名为PANDA-PLUS-Bench的高质量基准数据集,包含9名患者的专家标注全切片图像,提取非重叠组织块并在多种增强条件下进行测试;使用该基准评估七个基础模型在跨载玻片识别能力与载玻片级编码程度方面的表现。 Result: 不同模型在鲁棒性上表现差异显著:Virchow2载玻片级编码最低(81.0%),但跨载玻片准确率较低(47.2%);HistoEncoder在前列腺组织上专门训练,表现出最高跨载玻片准确率(59.7%)和最强载玻片级编码(90.3%);所有模型均存在19.9至26.9个百分点的片内与片间准确率差距。 Conclusion: PANDA-PLUS-Bench填补了基础模型在临床重要场景下鲁棒性评估的空白,结果表明组织特异性训练可能有助于提升模型对生物特征的学习能力,但也可能增强对载玻片特异性信号的依赖,需进一步优化以实现真正的临床泛化。 Abstract: Artificial intelligence foundation models are increasingly deployed for prostate cancer Gleason grading, where GP3/GP4 distinction directly impacts treatment decisions. However, these models may achieve high validation accuracy by learning specimen-specific artifacts rather than generalizable biological features, limiting real-world clinical utility. We introduce PANDA-PLUS-Bench, a curated benchmark dataset derived from expert-annotated prostate biopsies designed specifically to quantify this failure mode. The benchmark comprises nine carefully selected whole slide images from nine unique patients containing diverse Gleason patterns, with non-overlapping tissue patches extracted at both 512x512 and 224x224 pixel resolutions across eight augmentation conditions. Using this benchmark, we evaluate seven foundation models on their ability to separate biological signal from slide-level confounders. Our results reveal substantial variation in robustness across models: Virchow2 achieved the lowest slide-level encoding among large-scale models (81.0%) yet exhibited the second-lowest cross-slide accuracy (47.2%). HistoEncoder, trained specifically on prostate tissue, demonstrated the highest cross-slide accuracy (59.7%) and the strongest slide-level encoding (90.3%), suggesting tissue-specific training may enhance both biological feature capture and slide-specific signatures. All models exhibited measurable within-slide vs. cross-slide accuracy gaps, though the magnitude varied from 19.9 percentage points to 26.9 percentage points. We provide an open-source Google Colab notebook enabling researchers to evaluate additional foundation models against our benchmark using standardized metrics. PANDA-PLUS-Bench addresses a critical gap in foundation model evaluation by providing a purpose-built resource for robustness assessment in the clinically important context of Gleason grading.[51] Improving Pre-trained Segmentation Models using Post-Processing
Abhijeet Parida,Daniel Capellán-Martín,Zhifan Jiang,Nishad Kulkarni,Krithika Iyer,Austin Tapp,Syed Muhammad Anwar,María J. Ledesma-Carbayo,Marius George Linguraru
Main category: cs.CV
TL;DR: 提出自适应后处理技术以提升大规模预训练模型在胶质瘤MRI分割中的性能,改善分割质量并促进计算公平与可持续性。
Details
Motivation: 现有深度学习模型在脑肿瘤分割中泛化能力差,存在假阳性、标签交换和切片不连续等问题,且大型模型训练资源消耗大,难以普及。 Method: 设计自适应后处理技术,优化由大规模预训练模型生成的胶质瘤分割结果,并在BraTS 2025多个挑战任务中验证效果。 Result: 在BraTS 2025子撒哈拉非洲挑战中排名指标提升14.9%,成人胶质瘤挑战中提升0.9%。 Conclusion: 后处理策略可有效提升现有模型的分割质量,推动脑肿瘤分割研究向高效、临床对齐、计算公平和可持续方向发展。 Abstract: Gliomas are the most common malignant brain tumors in adults and are among the most lethal. Despite aggressive treatment, the median survival rate is less than 15 months. Accurate multiparametric MRI (mpMRI) tumor segmentation is critical for surgical planning, radiotherapy, and disease monitoring. While deep learning models have improved the accuracy of automated segmentation, large-scale pre-trained models generalize poorly and often underperform, producing systematic errors such as false positives, label swaps, and slice discontinuities in slices. These limitations are further compounded by unequal access to GPU resources and the growing environmental cost of large-scale model training. In this work, we propose adaptive post-processing techniques to refine the quality of glioma segmentations produced by large-scale pretrained models developed for various types of tumors. We demonstrated the techniques in multiple BraTS 2025 segmentation challenge tasks, with the ranking metric improving by 14.9 % for the sub-Saharan Africa challenge and 0.9% for the adult glioma challenge. This approach promotes a shift in brain tumor segmentation research from increasingly complex model architectures to efficient, clinically aligned post-processing strategies that are precise, computationally fair, and sustainable.[52] TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation
Zhenzhi Wang,Jian Wang,Ke Ma,Dahua Lin,Bing Zhou
Main category: cs.CV
TL;DR: 本文提出了TalkVerse,一个大规模开放数据集,用于单人音频驱动的说话视频生成,并基于此构建了一个可复现的50亿参数DiT基线模型,在降低推理成本的同时实现了高质量、长时长的视频生成。
Details
Motivation: 现有最先进方法依赖封闭数据或计算开销大的模型,缺乏公平、可复现的比较基准,限制了音频驱动说话视频生成的研究发展。 Method: 构建了包含230万高分辨率音视频同步片段的TalkVerse数据集,并设计了一个基于Wan2.2-5B的50亿参数DiT模型,采用高下采样比的视频VAE和滑动窗口机制以减少生成漂移,同时引入MLLM导演模块优化长视频叙事,并通过潜噪声注入实现零样本视频配音。 Result: 模型能生成长达一分钟、低漂移的高质量视频,在唇部同步和视觉质量上媲美140亿参数的Wan-S2V模型,但推理成本降低10倍;支持零样本视频配音和可控生成。 Conclusion: TalkVerse为音频驱动说话视频生成提供了开放、可复现的研究基础,所提方法在效率和效果之间取得了更好平衡,推动了该领域的开放研究。 Abstract: We introduce TalkVerse, a large-scale, open corpus for single-person, audio-driven talking video generation designed to enable fair, reproducible comparison across methods. While current state-of-the-art systems rely on closed data or compute-heavy models, TalkVerse offers 2.3 million high-resolution (720p/1080p) audio-video synchronized clips totaling 6.3k hours. These are curated from over 60k hours of video via a transparent pipeline that includes scene-cut detection, aesthetic assessment, strict audio-visual synchronization checks, and comprehensive annotations including 2D skeletons and structured visual/audio-style captions. Leveraging TalkVerse, we present a reproducible 5B DiT baseline built on Wan2.2-5B. By utilizing a video VAE with a high downsampling ratio and a sliding window mechanism with motion-frame context, our model achieves minute-long generation with low drift. It delivers comparable lip-sync and visual quality to the 14B Wan-S2V model but with 10$\times$ lower inference cost. To enhance storytelling in long videos, we integrate an MLLM director to rewrite prompts based on audio and visual cues. Furthermore, our model supports zero-shot video dubbing via controlled latent noise injection. We open-source the dataset, training recipes, and 5B checkpoints to lower barriers for research in audio-driven human video generation. Project Page: https://zhenzhiwang.github.io/talkverse/[53] Puzzle Curriculum GRPO for Vision-Centric Reasoning
Ahmadreza Jeddi,Hakki Can Karaimer,Hue Nguyen,Zhongling Wang,Ke Zhao,Javad Rajabi,Ran Zhang,Raghav Goyal,Babak Taati,Radek Grzeszczuk
Main category: cs.CV
TL;DR: 提出PC-GRPO,一种无需人工标注和外部验证器的自监督强化学习方法,通过三个拼图环境和难度感知课程提升视觉语言模型的推理能力与训练稳定性。
Details
Motivation: 现有基于强化学习的视觉语言模型推理方法依赖昂贵且含噪的标注或外部验证器,奖励机制稀疏且平坦,且推理链与答案间常存在逻辑不一致问题。 Method: 设计三种自监督拼图环境(PatchFit、Rotation、Jigsaw)生成可验证奖励;引入难度感知课程,动态加权样本并聚焦中等难度;在训练中监控推理-答案一致性(RAC),并通过一致性增强奖励缓解其下降。 Result: 在Qwen-7B和Qwen-3B上实验表明,PC-GRPO提升了推理质量、训练稳定性和下游任务准确率,有效缓解奖励稀疏和RAC衰退问题,且RAC与准确率正相关。 Conclusion: PC-GRPO为视觉语言模型提供了一条可扩展、可验证、可解释的强化学习后训练路径,无需外部监督即可显著提升视觉推理性能。 Abstract: Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain's reasoning and its final answer. We present Puzzle Curriculum GRPO (PC-GRPO), a supervision-free recipe for RL with Verifiable Rewards (RLVR) that strengthens visual reasoning in VLMs without annotations or external verifiers. PC-GRPO replaces labels with three self-supervised puzzle environments: PatchFit, Rotation (with binary rewards) and Jigsaw (with graded partial credit mitigating reward sparsity). To counter flat rewards and vanishing group-relative advantages, we introduce a difficulty-aware curriculum that dynamically weights samples and peaks at medium difficulty. We further monitor Reasoning-Answer Consistency (RAC) during post-training: mirroring reports for vanilla GRPO in LLMs, RAC typically rises early then degrades; our curriculum delays this decline, and consistency-enforcing reward schemes further boost RAC. RAC correlates with downstream accuracy. Across diverse benchmarks and on Qwen-7B and Qwen-3B backbones, PC-GRPO improves reasoning quality, training stability, and end-task accuracy, offering a practical path to scalable, verifiable, and interpretable RL post-training for VLMs.[54] Adaptive Multimodal Person Recognition: A Robust Framework for Handling Missing Modalities
Aref Farhadipour,Teodora Vukovic,Volker Dellwo,Petr Motlicek,Srikanth Madikeri
Main category: cs.CV
TL;DR: 提出了一种基于语音、面部和手势的三模态人物识别框架,利用多任务学习、交叉注意力与门控融合机制,并通过置信度加权融合策略动态应对模态缺失或低质量数据,在CANDOR和VoxCeleb1数据集上取得了接近完美的识别准确率,且在单模态或双模态情况下仍保持鲁棒性。
Details
Motivation: 现实场景中常出现模态缺失或质量下降的问题,传统单模态或简单融合方法难以保持稳定性能,因此需要构建一种对模态丢失具有鲁棒性的多模态人物识别系统。 Method: 采用多任务学习分别处理语音、面部和手势模态,引入交叉注意力和门控融合机制实现模态间交互,并设计置信度加权融合策略以动态适应缺失或低质量模态数据。 Result: 在CANDOR数据集上达到99.18%的Top-1准确率,在VoxCeleb1数据集双模态模式下达到99.92%准确率,且在单模态或双模态输入时仍保持高性能。 Conclusion: 所提出的三模态框架在多种模态缺失场景下均表现出卓越的鲁棒性和准确性,适用于真实世界的人物识别应用。 Abstract: Person recognition systems often rely on audio, visual, or behavioral cues, but real-world conditions frequently result in missing or degraded modalities. To address this challenge, we propose a Trimodal person identification framework that integrates voice, face, and gesture modalities, while remaining robust to modality loss. Our approach leverages multi-task learning to process each modality independently, followed by a cross-attention and gated fusion mechanisms to facilitate interaction across modalities. Moreover, a confidence-weighted fusion strategy dynamically adapts to missing and low-quality data, ensuring optimal classification even in Unimodal or Bimodal scenarios. We evaluate our method on CANDOR, a newly introduced interview-based multimodal dataset, which we benchmark for the first time. Our results demonstrate that the proposed Trimodal system achieves 99.18% Top-1 accuracy on person identification tasks, outperforming conventional Unimodal and late-fusion approaches. In addition, we evaluate our model on the VoxCeleb1 dataset as a benchmark and reach 99.92% accuracy in Bimodal mode. Moreover, we show that our system maintains high accuracy even when one or two modalities are unavailable, making it a robust solution for real-world person recognition applications. The code and data for this work are publicly available.[55] Where is the Watermark? Interpretable Watermark Detection at the Block Level
Maria Bulychev,Neil G. Marchant,Benjamin I. P. Rubinstein
Main category: cs.CV
TL;DR: 提出一种具有区域级可解释性的后验图像水印方法,通过离散小波变换域的分块统计策略嵌入水印信号,生成检测图揭示图像中被水印或篡改的区域,在保持强鲁棒性和不可见性的同时提升透明度和可解释性。
Details
Motivation: 现有图像水印方案多为黑箱模型,仅提供全局检测分数,缺乏对水印位置和影响的解释,影响用户信任和对篡改行为的理解。 Method: 在离散小波变换(DWT)域采用分块统计策略进行水印嵌入,实现局部化嵌入并生成区域级水印检测图,以揭示图像中具体被水印或修改的区域。 Result: 该方法在常见图像变换下表现出强鲁棒性,对语义篡改敏感,水印不可见,并支持高达一半图像的裁剪抵抗;相比先前方法,提供了更优的可解释性与相当的鲁棒性。 Conclusion: 所提方法在不牺牲鲁棒性和透明度的前提下,显著提升了后验水印的可解释性,有助于增强用户信任并有效识别局部篡改。 Abstract: Recent advances in generative AI have enabled the creation of highly realistic digital content, raising concerns around authenticity, ownership, and misuse. While watermarking has become an increasingly important mechanism to trace and protect digital media, most existing image watermarking schemes operate as black boxes, producing global detection scores without offering any insight into how or where the watermark is present. This lack of transparency impacts user trust and makes it difficult to interpret the impact of tampering. In this paper, we present a post-hoc image watermarking method that combines localised embedding with region-level interpretability. Our approach embeds watermark signals in the discrete wavelet transform domain using a statistical block-wise strategy. This allows us to generate detection maps that reveal which regions of an image are likely watermarked or altered. We show that our method achieves strong robustness against common image transformations while remaining sensitive to semantic manipulations. At the same time, the watermark remains highly imperceptible. Compared to prior post-hoc methods, our approach offers more interpretable detection while retaining competitive robustness. For example, our watermarks are robust to cropping up to half the image.[56] Beyond Proximity: A Keypoint-Trajectory Framework for Classifying Affiliative and Agonistic Social Networks in Dairy Cattle
Sibi Parivendan,Kashfia Sailunaz,Suresh Neethirajan
Main category: cs.CV
TL;DR: 本文提出了一种基于姿态的计算框架,通过建模解剖关键点的时空几何结构来分类牲畜的社会互动行为,克服了传统基于静态距离阈值方法无法区分亲和性与攻击性行为的局限。
Details
Motivation: 现有的精准畜牧养殖中社会行为评估多依赖静态邻近阈值,难以在复杂的 barn 环境中区分亲和性和对抗性互动,限制了自动化社交网络分析的可解释性。因此需要一种更精确、可推广的方法来客观识别互动类型。 Method: 提出一个端到端的计算机视觉流程:使用YOLOv11检测个体(mAP@0.50达96.24%),结合监督学习实现个体识别(准确率98.24%),采用ByteTrack进行多目标跟踪(81.96%准确率),利用ZebraPose估计27个解剖关键点,并基于关键点轨迹提取姿态相关的距离动态特征,最后用支持向量机分类器区分亲和性与对抗性行为。 Result: 在商业奶牛场采集的数据上,仅依靠姿态信息的分类器准确率达到77.51%,相比基于邻近性的基线模型,在行为判别能力上有显著提升,尤其对亲和性行为的识别效果更好。系统具备近实时处理能力,可在普通硬件上运行。 Conclusion: 该研究验证了基于姿态的关键点轨迹可用于自动推断动物社会互动的情感倾向,为构建交互感知的社交网络提供了可行方案,推动了精准畜牧中福利监测的自动化发展。 Abstract: Precision livestock farming requires objective assessment of social behavior to support herd welfare monitoring, yet most existing approaches infer interactions using static proximity thresholds that cannot distinguish affiliative from agonistic behaviors in complex barn environments. This limitation constrains the interpretability of automated social network analysis in commercial settings. We present a pose-based computational framework for interaction classification that moves beyond proximity heuristics by modeling the spatiotemporal geometry of anatomical keypoints. Rather than relying on pixel-level appearance or simple distance measures, the proposed method encodes interaction-specific motion signatures from keypoint trajectories, enabling differentiation of social interaction valence. The framework is implemented as an end-to-end computer vision pipeline integrating YOLOv11 for object detection (mAP@0.50: 96.24%), supervised individual identification (98.24% accuracy), ByteTrack for multi-object tracking (81.96% accuracy), ZebraPose for 27-point anatomical keypoint estimation, and a support vector machine classifier trained on pose-derived distance dynamics. On annotated interaction clips collected from a commercial dairy barn, the classifier achieved 77.51% accuracy in distinguishing affiliative and agonistic behaviors using pose information alone. Comparative evaluation against a proximity-only baseline shows substantial gains in behavioral discrimination, particularly for affiliative interactions. The results establish a proof-of-concept for automated, vision-based inference of social interactions suitable for constructing interaction-aware social networks, with near-real-time performance on commodity hardware.[57] Evaluating the Capability of Video Question Generation for Expert Knowledge Elicitation
Huaying Zhang,Atsushi Hashimoto,Tosho Hirasawa
Main category: cs.CV
TL;DR: 提出了一种新的视频问题生成(VQG)评估协议,通过构建EgoExoAsk数据集并利用问题到答案的检索来模拟与专家的问答交流,以评估问题在激发未见知识方面的质量。
Details
Motivation: 现有VQG评估主要关注回答能力,而非问题本身的质量;本文旨在量化评估问题在从专家处获取未知知识方面的能力。 Method: 构建包含27,666个问答对的EgoExoAsk数据集,训练问题-答案检索器,并在Ego-Exo4D视频片段上建立验证集基准,用以评估VQG模型生成高质量问题的能力。 Result: 实验表明所提评估指标能合理区分不同上下文输入下的模型表现,访问更丰富上下文的模型得分更高,验证了协议的有效性。 Conclusion: 该评估协议有效支持VQG模型在激发专家知识方面的持续改进,且EgoExoAsk数据集已公开。 Abstract: Skilled human interviewers can extract valuable information from experts. This raises a fundamental question: what makes some questions more effective than others? To address this, a quantitative evaluation of question-generation models is essential. Video question generation (VQG) is a topic for video question answering (VideoQA), where questions are generated for given answers. Their evaluation typically focuses on the ability to answer questions, rather than the quality of generated questions. In contrast, we focus on the question quality in eliciting unseen knowledge from human experts. For a continuous improvement of VQG models, we propose a protocol that evaluates the ability by simulating question-answering communication with experts using a question-to-answer retrieval. We obtain the retriever by constructing a novel dataset, EgoExoAsk, which comprises 27,666 QA pairs generated from Ego-Exo4D's expert commentary annotation. The EgoExoAsk training set is used to obtain the retriever, and the benchmark is constructed on the validation set with Ego-Exo4D video segments. Experimental results demonstrate our metric reasonably aligns with question generation settings: models accessing richer context are evaluated better, supporting that our protocol works as intended. The EgoExoAsk dataset is available in https://github.com/omron-sinicx/VQG4ExpertKnowledge .[58] Model Agnostic Preference Optimization for Medical Image Segmentation
Yunseong Nam,Jiwon Jang,Dongkyu Won,Sang Hyun Park,Soopil Kim
Main category: cs.CV
TL;DR: 本文提出了一种模型无关的偏好优化框架MAPO,用于医学图像分割,利用Dropout生成随机分割假设,构建偏好一致梯度,无需直接真值监督,具有良好的通用性和稳定性。
Details
Motivation: 现有的医学图像分割偏好优化方法局限于特定模型且依赖低多样性预测采样,缺乏通用性和鲁棒性。 Method: 提出MAPO框架,通过Dropout生成多样化的分割假设,基于相对偏好信号构建一致性梯度,实现无真实标签监督的模型优化,支持多种网络架构和2D/3D输入。 Result: 在多个医学数据集上验证表明,MAPO相比传统监督训练能更好保持边界精度、减少过拟合,并提供更稳定的优化过程。 Conclusion: MAPO是一种通用、高效的偏好优化方法,在无需真实标签的情况下显著提升医学图像分割性能,适用于多种模型和维度。 Abstract: Preference optimization offers a scalable supervision paradigm based on relative preference signals, yet prior attempts in medical image segmentation remain model-specific and rely on low-diversity prediction sampling. In this paper, we propose MAPO (Model-Agnostic Preference Optimization), a training framework that utilizes Dropout-driven stochastic segmentation hypotheses to construct preference-consistent gradients without direct ground-truth supervision. MAPO is fully architecture- and dimensionality-agnostic, supporting 2D/3D CNN and Transformer-based segmentation pipelines. Comprehensive evaluations across diverse medical datasets reveal that MAPO consistently enhances boundary adherence, reduces overfitting, and yields more stable optimization dynamics compared to conventional supervised training.[59] MVGSR: Multi-View Consistent 3D Gaussian Super-Resolution via Epipolar Guidance
Kaizhe Zhang,Shinan Chen,Qian Zhao,Weizhan Zhang,Caixia Yan,Yudeng Xin
Main category: cs.CV
TL;DR: 本文提出了一种新的3D高斯点阵超分辨率框架MVGSR,通过基于相机姿态的辅助视图选择方法和首次引入的极线约束多视图注意力机制,实现了对任意组织形式的多视图数据进行高频细节增强与跨视图一致性优化,在物体级和场景级基准上均达到最先进性能。
Details
Motivation: 低分辨率图像训练的3D高斯点阵难以用于高分辨率渲染,现有超分方法缺乏跨视图一致性或多视图信息融合能力,且受限于序列化输入要求,难以适用于非结构化多视图数据。 Method: 提出了MVGSR框架,包括基于相机姿态的辅助视图选择方法,以及带有极线约束的多视图注意力机制,以实现对非顺序多视图数据的有效信息聚合,提升几何一致性和细节保真度。 Result: 在多个物体级和场景级3DGS超分基准上取得最优性能,显著优于单图和视频-based方法,尤其在跨视图一致性和高频细节恢复方面表现突出。 Conclusion: MVGSR有效解决了3DGS在低分辨率输入下的高分辨率渲染问题,通过引入极线约束的多视图注意力和灵活的视图选择策略,实现了对非结构化多视图数据的高效利用,推动了3DGS超分辨率技术的发展。 Abstract: Scenes reconstructed by 3D Gaussian Splatting (3DGS) trained on low-resolution (LR) images are unsuitable for high-resolution (HR) rendering. Consequently, a 3DGS super-resolution (SR) method is needed to bridge LR inputs and HR rendering. Early 3DGS SR methods rely on single-image SR networks, which lack cross-view consistency and fail to fuse complementary information across views. More recent video-based SR approaches attempt to address this limitation but require strictly sequential frames, limiting their applicability to unstructured multi-view datasets. In this work, we introduce Multi-View Consistent 3D Gaussian Splatting Super-Resolution (MVGSR), a framework that focuses on integrating multi-view information for 3DGS rendering with high-frequency details and enhanced consistency. We first propose an Auxiliary View Selection Method based on camera poses, making our method adaptable for arbitrarily organized multi-view datasets without the need of temporal continuity or data reordering. Furthermore, we introduce, for the first time, an epipolar-constrained multi-view attention mechanism into 3DGS SR, which serves as the core of our proposed multi-view SR network. This design enables the model to selectively aggregate consistent information from auxiliary views, enhancing the geometric consistency and detail fidelity of 3DGS representations. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both object-centric and scene-level 3DGS SR benchmarks.[60] Asynchronous Event Stream Noise Filtering for High-frequency Structure Deformation Measurement
Yifei Bian,Banglei Guan,Zibin Liu,Ang Su,Shiyao Zhu,Yang Shang,Qifeng Yu
Main category: cs.CV
TL;DR: 提出一种基于事件相机和LED标记的高频变形测量方法,利用事件流特性滤除噪声并提取高速运动的LED标记,实现单目事件相机对大尺度结构高频平面变形的精确测量。
Details
Motivation: 传统高速相机在恶劣光照条件下受限且设备成本高,难以有效测量大型结构的高频变形,因此需要一种更鲁棒、低成本的测量方法。 Method: 利用事件相机捕捉LED标记的闪烁事件流,基于事件的时间和空间相关性滤除观测噪声,并区分运动引起的事件与LED闪烁事件,从而提取高速运动中的LED标记,通过单目事件相机重建高频平面变形。 Result: 实验结果表明该方法能够准确测量高频平面变形,验证了其在复杂光照和高动态条件下的有效性与精度。 Conclusion: 该方法为大型结构在复杂环境下的高频变形监测提供了一种高效、低成本的解决方案,具有良好的应用前景。 Abstract: Large-scale structures suffer high-frequency deformations due to complex loads. However, harsh lighting conditions and high equipment costs limit measurement methods based on traditional high-speed cameras. This paper proposes a method to measure high-frequency deformations by exploiting an event camera and LED markers. Firstly, observation noise is filtered based on the characteristics of the event stream generated by LED markers blinking and spatiotemporal correlation. Then, LED markers are extracted from the event stream after differentiating between motion-induced events and events from LED blinking, which enables the extraction of high-speed moving LED markers. Ultimately, high-frequency planar deformations are measured by a monocular event camera. Experimental results confirm the accuracy of our method in measuring high-frequency planar deformations.[61] Tracking spatial temporal details in ultrasound long video via wavelet analysis and memory bank
Chenxiao Zhang,Runshi Zhang,Junchen Wang
Main category: cs.CV
TL;DR: 提出一种基于记忆库和小波滤波融合网络的医学超声视频分割方法,有效提升小病灶和器官边界的分割精度。
Details
Motivation: 超声视频对比度低、噪声多,导致器官边界误分割及小目标丢失,且长视频中的目标跟踪仍具挑战性。 Method: 采用编码器-解码器结构,引入基于记忆库的小波卷积、级联小波压缩和跨注意力机制的记忆模块,并设计高频感知特征融合模块以增强边界细节。 Result: 在四个超声数据集上优于现有方法,尤其在小甲状腺结节分割上表现更优。 Conclusion: 所提方法能有效改善超声视频中小目标和边界的分割效果,适用于长时视频分析。 Abstract: Medical ultrasound videos are widely used for medical inspections, disease diagnosis and surgical planning. High-fidelity lesion area and target organ segmentation constitutes a key component of the computer-assisted surgery workflow. The low contrast levels and noisy backgrounds of ultrasound videos cause missegmentation of organ boundary, which may lead to small object losses and increase boundary segmentation errors. Object tracking in long videos also remains a significant research challenge. To overcome these challenges, we propose a memory bank-based wavelet filtering and fusion network, which adopts an encoder-decoder structure to effectively extract fine-grained detailed spatial features and integrate high-frequency (HF) information. Specifically, memory-based wavelet convolution is presented to simultaneously capture category, detailed information and utilize adjacent information in the encoder. Cascaded wavelet compression is used to fuse multiscale frequency-domain features and expand the receptive field within each convolutional layer. A long short-term memory bank using cross-attention and memory compression mechanisms is designed to track objects in long video. To fully utilize the boundary-sensitive HF details of feature maps, an HF-aware feature fusion module is designed via adaptive wavelet filters in the decoder. In extensive benchmark tests conducted on four ultrasound video datasets (two thyroid nodule, the thyroid gland, the heart datasets) compared with the state-of-the-art methods, our method demonstrates marked improvements in segmentation metrics. In particular, our method can more accurately segment small thyroid nodules, demonstrating its effectiveness for cases involving small ultrasound objects in long video. The code is available at https://github.com/XiAooZ/MWNet.[62] PMMD: A pose-guided multi-view multi-modal diffusion for person generation
Ziyu Shang,Haoran Liu,Rongchao Zhang,Zhiqian Wei,Tongtong Feng
Main category: cs.CV
TL;DR: 本文提出了一种名为Pose-guided Multi-view Multimodal Diffusion (PMMD)的扩散框架,用于在多视角参考、姿态图和文本提示条件下生成逼真且一致的人物图像。
Details
Motivation: 现有方法在虚拟试穿、图像编辑和数字人生成中常面临遮挡、服装风格漂移和姿态错位等问题,因此需要一种能提升图像一致性、细节保持和可控性的新方法。 Method: 提出PMMD框架,采用多模态编码器联合建模多视角视觉信息、姿态特征和语义描述,并设计ResCVA模块增强局部细节同时保持整体结构,以及跨模态融合模块在整个去噪过程中整合图文语义。 Result: 在DeepFashion MultiModal数据集上的实验表明,PMMD在图像一致性、细节保留和可控性方面优于代表性基线方法。 Conclusion: PMMD有效解决了人物图像生成中的关键挑战,显著提升了生成质量与多模态条件控制能力。 Abstract: Generating consistent human images with controllable pose and appearance is essential for applications in virtual try on, image editing, and digital human creation. Current methods often suffer from occlusions, garment style drift, and pose misalignment. We propose Pose-guided Multi-view Multimodal Diffusion (PMMD), a diffusion framework that synthesizes photorealistic person images conditioned on multi-view references, pose maps, and text prompts. A multimodal encoder jointly models visual views, pose features, and semantic descriptions, which reduces cross modal discrepancy and improves identity fidelity. We further design a ResCVA module to enhance local detail while preserving global structure, and a cross modal fusion module that integrates image semantics with text throughout the denoising pipeline. Experiments on the DeepFashion MultiModal dataset show that PMMD outperforms representative baselines in consistency, detail preservation, and controllability. Project page and code are available at https://github.com/ZANMANGLOOPYE/PMMD.[63] Uni-Parser Technical Report
Xi Fang,Haoyi Tao,Shuwen Yang,Suyang Zhong,Haocheng Lu,Han Lyu,Chaozheng Huang,Xinyu Li,Linfeng Zhang,Guolin Ke
Main category: cs.CV
TL;DR: Uni-Parser是一个面向科学文献和专利的工业级文档解析引擎,采用模块化多专家架构,实现高效、准确且可扩展的跨模态解析。
Details
Motivation: 传统流水线式文档解析方法难以保持细粒度的跨模态对齐,且扩展性差,无法满足大规模科学文献处理的需求。 Method: 提出一种模块化、松耦合的多专家架构,支持文本、公式、表格、图像和化学结构等多模态内容的联合解析,并引入自适应GPU负载均衡、分布式推理和动态模块编排以优化性能。 Result: 在8块NVIDIA RTX 4090D GPU上达到每秒处理20页PDF的速度,具备高吞吐量、高精度和低成本优势。 Conclusion: Uni-Parser具备良好的可扩展性和实用性,适用于大规模科学文献与专利解析,支持从信息检索到AI4Science模型训练等多种下游应用。 Abstract: This technical report introduces Uni-Parser, an industrial-grade document parsing engine tailored for scientific literature and patents, delivering high throughput, robust accuracy, and cost efficiency. Unlike pipeline-based document parsing methods, Uni-Parser employs a modular, loosely coupled multi-expert architecture that preserves fine-grained cross-modal alignments across text, equations, tables, figures, and chemical structures, while remaining easily extensible to emerging modalities. The system incorporates adaptive GPU load balancing, distributed inference, dynamic module orchestration, and configurable modes that support either holistic or modality-specific parsing. Optimized for large-scale cloud deployment, Uni-Parser achieves a processing rate of up to 20 PDF pages per second on 8 x NVIDIA RTX 4090D GPUs, enabling cost-efficient inference across billions of pages. This level of scalability facilitates a broad spectrum of downstream applications, ranging from literature retrieval and summarization to the extraction of chemical structures, reaction schemes, and bioactivity data, as well as the curation of large-scale corpora for training next-generation large language models and AI4Science models.[64] Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 Datasets
Jialong Zuo,Haoyou Deng,Hanyu Zhou,Jiaxin Zhu,Yicheng Zhang,Yiwei Zhang,Yongxin Yan,Kaixing Huang,Weisen Chen,Yongtai Deng,Rui Jin,Nong Sang,Changxin Gao
Main category: cs.CV
TL;DR: 本研究探讨了文本到图像生成模型Nano Banana Pro在传统低层次视觉任务中的零样本表现,发现其在主观视觉质量上优于专用模型,但在基于参考的定量指标上仍有差距。
Details
Motivation: 探索像Nano Banana Pro这样的先进文本到图像生成模型是否能够作为通用工具解决传统的低层次视觉问题,而不仅限于内容创作。 Method: 通过对14个不同的低层次视觉任务、涵盖40个多样化数据集进行零样本评估,使用简单文本提示但不进行微调,将Nano Banana Pro与最先进的专用模型进行比较。 Result: Nano Banana Pro在主观视觉质量方面表现出色,常能生成超越专用模型的合理高频细节;但在需要像素级一致性的传统定量指标上表现较差。 Conclusion: Nano Banana Pro可被视为低层次视觉任务中具备潜力的零样本竞争者,但实现领域专用模型的高保真度仍是重大挑战。 Abstract: The rapid evolution of text-to-image generation models has revolutionized visual content creation. While commercial products like Nano Banana Pro have garnered significant attention, their potential as generalist solvers for traditional low-level vision challenges remains largely underexplored. In this study, we investigate the critical question: Is Nano Banana Pro a Low-Level Vision All-Rounder? We conducted a comprehensive zero-shot evaluation across 14 distinct low-level tasks spanning 40 diverse datasets. By utilizing simple textual prompts without fine-tuning, we benchmarked Nano Banana Pro against state-of-the-art specialist models. Our extensive analysis reveals a distinct performance dichotomy: while \textbf{Nano Banana Pro demonstrates superior subjective visual quality}, often hallucinating plausible high-frequency details that surpass specialist models, it lags behind in traditional reference-based quantitative metrics. We attribute this discrepancy to the inherent stochasticity of generative models, which struggle to maintain the strict pixel-level consistency required by conventional metrics. This report identifies Nano Banana Pro as a capable zero-shot contender for low-level vision tasks, while highlighting that achieving the high fidelity of domain specialists remains a significant hurdle.[65] 3DProxyImg: Controllable 3D-Aware Animation Synthesis from Single Image via 2D-3D Aligned Proxy Embedding
Yupeng Zhu,Xiongzhen Zhang,Ye Chen,Bingbing Ni
Main category: cs.CV
TL;DR: 提出一种轻量级3D动画框架,通过2D-3D对齐的代理表示解耦几何控制与外观合成,在保持高质量渲染的同时实现良好的3D可控性。
Details
Motivation: 传统3D动画制作耗时耗力,现有AIGC方法在渲染质量与3D控制之间存在权衡,难以兼顾效率与交互性。 Method: 采用粗略3D估计作为结构载体,结合学习到的图像空间生成先验来处理高保真外观和视图合成,实现几何与外观的分离建模。 Result: 该方法在低功耗设备上高效生成动画,优于基于视频的3D动画方法,具有更好的身份保持、几何与纹理一致性,以及更精细的交互控制能力。 Conclusion: 所提框架在单图像3D动画生成中实现了质量与控制的平衡,为轻量级、可交互的3D动画提供了新思路。 Abstract: 3D animation is central to modern visual media, yet traditional production pipelines remain labor-intensive, expertise-demanding, and computationally expensive. Recent AIGC-based approaches partially automate asset creation and rigging, but they either inherit the heavy costs of full 3D pipelines or rely on video-synthesis paradigms that sacrifice 3D controllability and interactivity. We focus on single-image 3D animation generation and argue that progress is fundamentally constrained by a trade-off between rendering quality and 3D control. To address this limitation, we propose a lightweight 3D animation framework that decouples geometric control from appearance synthesis. The core idea is a 2D-3D aligned proxy representation that uses a coarse 3D estimate as a structural carrier, while delegating high-fidelity appearance and view synthesis to learned image-space generative priors. This proxy formulation enables 3D-aware motion control and interaction comparable to classical pipelines, without requiring accurate geometry or expensive optimization, and naturally extends to coherent background animation. Extensive experiments demonstrate that our method achieves efficient animation generation on low-power platforms and outperforms video-based 3D animation generation in identity preservation, geometric and textural consistency, and the level of precise, interactive control it offers to users.[66] Borrowing from anything: A generalizable framework for reference-guided instance editing
Shengxiao Zhou,Chenghua Li,Jianhao Huang,Qinghao Hu,Yifan Zhang
Main category: cs.CV
TL;DR: 提出GENIE框架,通过空间对齐、自适应残差缩放和渐进注意力融合模块,实现参考引导实例编辑中的显式解耦,显著提升保真度和鲁棒性。
Details
Motivation: 解决参考引导实例编辑中因语义纠缠导致的外观与属性难以分离的问题,明确应从参考中借用哪些信息及其应用方式。 Method: 设计GENIE框架,包含空间对齐模块(SAM)校正空间错位,自适应残差缩放模块(ARSM)增强内在特征并抑制外在属性,渐进注意力融合(PAF)机制将外观渲染到目标并保持结构。 Result: 在AnyInsertion数据集上实验表明,GENIE在保真度和鲁棒性方面达到SOTA水平。 Conclusion: GENIE通过显式解耦机制为基于解耦的实例编辑树立了新标准。 Abstract: Reference-guided instance editing is fundamentally limited by semantic entanglement, where a reference's intrinsic appearance is intertwined with its extrinsic attributes. The key challenge lies in disentangling what information should be borrowed from the reference, and determining how to apply it appropriately to the target. To tackle this challenge, we propose GENIE, a Generalizable Instance Editing framework capable of achieving explicit disentanglement. GENIE first corrects spatial misalignments with a Spatial Alignment Module (SAM). Then, an Adaptive Residual Scaling Module (ARSM) learns what to borrow by amplifying salient intrinsic cues while suppressing extrinsic attributes, while a Progressive Attention Fusion (PAF) mechanism learns how to render this appearance onto the target, preserving its structure. Extensive experiments on the challenging AnyInsertion dataset demonstrate that GENIE achieves state-of-the-art fidelity and robustness, setting a new standard for disentanglement-based instance editing.[67] Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning
Mengshi Qi,Yeteng Wu,Xianlin Zhang,Huadong Ma
Main category: cs.CV
TL;DR: 本文提出了一个新的“人体动作形态评估(AFA)”任务,并构建了包含大规模健身和武术视频的CoT-AFA数据集,引入链式思维解释范式以提供可解释的反馈。同时提出可解释性健身评估框架(EFA),通过双流结构与动态门控机制融合视觉与语义信息,在动作分类、质量评估和解释生成方面均取得提升。
Details
Motivation: 现有视频理解方法主要关注动作的类别与位置,缺乏对动作规范性的评估能力;同时现有数据集缺少动作标准化程度标签,且缺乏可解释的详细反馈,难以满足实际应用需求。 Method: 提出Human Action Form Assessment(AFA)任务,构建CoT-AFA数据集并引入Chain-of-Thought解释范式;设计Explainable Fitness Assessor(EFA)框架,采用双流网络与动态门控机制融合视觉与语义信息,实现动作判断、原因解释与改进建议生成。 Result: 在解释生成(CIDEr +16.0%)、动作分类(准确率+2.7%)和质量评估(准确率+2.1%)上均取得性能提升。 Conclusion: 所提出的AFA任务、CoT-AFA数据集与EFA框架为动作标准化评估提供了新的解决方案,具备良好的可解释性与应用潜力,推动视频理解向更深层次发展。 Abstract: Evaluating whether human action is standard or not and providing reasonable feedback to improve action standardization is very crucial but challenging in real-world scenarios. However, current video understanding methods are mainly concerned with what and where the action is, which is unable to meet the requirements. Meanwhile, most of the existing datasets lack the labels indicating the degree of action standardization, and the action quality assessment datasets lack explainability and detailed feedback. Therefore, we define a new Human Action Form Assessment (AFA) task, and introduce a new diverse dataset CoT-AFA, which contains a large scale of fitness and martial arts videos with multi-level annotations for comprehensive video analysis. We enrich the CoT-AFA dataset with a novel Chain-of-Thought explanation paradigm. Instead of offering isolated feedback, our explanations provide a complete reasoning process--from identifying an action step to analyzing its outcome and proposing a concrete solution. Furthermore, we propose a framework named Explainable Fitness Assessor, which can not only judge an action but also explain why and provide a solution. This framework employs two parallel processing streams and a dynamic gating mechanism to fuse visual and semantic information, thereby boosting its analytical capabilities. The experimental results demonstrate that our method has achieved improvements in explanation generation (e.g., +16.0% in CIDEr), action classification (+2.7% in accuracy) and quality assessment (+2.1% in accuracy), revealing great potential of CoT-AFA for future studies. Our dataset and source code is available at https://github.com/MICLAB-BUPT/EFA.[68] EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence
Jiaxu Wan,Xu Wang,Mengwei Xie,Hang Zhang,Mu Xu,Yang Han,Hong Zhang,Ding Yuan,Yifan Yang
Main category: cs.CV
TL;DR: 本文提出了EagleVision,一种用于增强空间智能的双阶段框架,通过宏观感知和微观验证实现渐进式空间认知,在VSI-Bench上达到开源视觉语言模型中的最先进水平。
Details
Motivation: 现有方法在3D与2D推理融合时存在空间一致性差、视角多样性不足以及证据链不可追溯的问题,且缺乏对全局空间感知、3D假设与视频帧关联及空间定位奖励机制的支持。 Method: 提出EagleVision框架:宏观感知阶段采用语义-视角融合的DPP(SPF-DPP)从长视频中选取几何与语义感知的关键帧;微观验证阶段将空间思维链形式化为BEV定位查询,通过强化学习结合空间定位奖励进行训练。 Result: 在VSI-Bench基准上,EagleVision在开源视觉语言模型中实现了最先进的性能,表现出强大的空间理解和泛化能力。 Conclusion: EagleVision通过双阶段设计有效解决了空间推理中的关键挑战,实现了可追踪、一致且高效的多视角空间认知。 Abstract: Recent spatial intelligence approaches typically attach 3D cues to 2D reasoning pipelines or couple MLLMs with black-box reconstruction modules, leading to weak spatial consistency, limited viewpoint diversity, and evidence chains that cannot be traced back to supporting views. Frameworks for "thinking with images" (e.g., ChatGPT-o3 and DeepEyes) show that stepwise multimodal reasoning can emerge by interleaving hypothesis formation with active acquisition of visual evidence, but they do not address three key challenges in spatial Chain-of-Thought (CoT): building global space perception under strict token budgets, explicitly associating 3D hypotheses with video frames for verification, and designing spatially grounded rewards for reinforcement learning. To address these issues, we present EagleVision, a dual-stage framework for progressive spatial cognition through macro perception and micro verification. In the macro perception stage, EagleVision employs a semantics-perspective-fusion determinantal point process (SPF-DPP) to select a compact set of geometry- and semantics-aware keyframes from long videos under a fixed token budget. In the micro verification stage, we formalize spatial CoT as BEV-grounded pose querying: the agent iteratively predicts poses on a BEV plane, retrieves the nearest real frames, and is trained purely by reinforcement learning with a spatial grounding reward that scores the consistency between predicted poses and observed views. On VSI-Bench, EagleVision achieves state-of-the-art performance among open-source vision-language models, demonstrating strong and generalizable spatial understanding.[69] Cross-modal ultra-scale learning with tri-modalities of renal biopsy images for glomerular multi-disease auxiliary diagnosis
Kaixing Long,Danyi Weng,Yun Mi,Zhentai Zhang,Yanmeng Lu,Jian Geng,Zhitao Zhou,Liming Zhong,Qianjin Feng,Wei Yang,Lei Cao
Main category: cs.CV
TL;DR: 提出了一种跨模态超尺度学习网络(CMUS-Net),用于基于三种肾活检图像的多模态分类,实现了对多种肾小球疾病的自动精准诊断。
Details
Motivation: 现有方法难以有效融合纳米级透射电镜(TEM)图像与微米级光学/免疫荧光显微图像之间的多尺度特征,限制了肾小球多病识别的准确性。 Method: 设计了稀疏多实例学习模块来聚合TEM图像特征,并引入跨模态尺度注意力模块促进不同模态间的特征交互;结合多种损失函数,实现多模态特征融合与分类优化。 Result: 在自建数据集上,CMUS-Net达到95.37%±2.41%的准确率、99.05%±0.53%的AUC和95.32%±2.41%的F1分数,优于现有主流多模态或多尺度方法,并在膜性肾病分期中表现出良好泛化能力。 Conclusion: CMUS-Net有效解决了多尺度多模态肾活检图像的特征融合难题,首次实现了基于三模态两尺度图像的多种肾小球疾病自动分类,具有临床辅助诊断潜力。 Abstract: Constructing a multi-modal automatic classification model based on three types of renal biopsy images can assist pathologists in glomerular multi-disease identification. However, the substantial scale difference between transmission electron microscopy (TEM) image features at the nanoscale and optical microscopy (OM) or immunofluorescence microscopy (IM) images at the microscale poses a challenge for existing multi-modal and multi-scale models in achieving effective feature fusion and improving classification accuracy. To address this issue, we propose a cross-modal ultra-scale learning network (CMUS-Net) for the auxiliary diagnosis of multiple glomerular diseases. CMUS-Net utilizes multiple ultrastructural information to bridge the scale difference between nanometer and micrometer images. Specifically, we introduce a sparse multi-instance learning module to aggregate features from TEM images. Furthermore, we design a cross-modal scale attention module to facilitate feature interaction, enhancing pathological semantic information. Finally, multiple loss functions are combined, allowing the model to weigh the importance among different modalities and achieve precise classification of glomerular diseases. Our method follows the conventional process of renal biopsy pathology diagnosis and, for the first time, performs automatic classification of multiple glomerular diseases including IgA nephropathy (IgAN), membranous nephropathy (MN), and lupus nephritis (LN) based on images from three modalities and two scales. On an in-house dataset, CMUS-Net achieves an ACC of 95.37+/-2.41%, an AUC of 99.05+/-0.53%, and an F1-score of 95.32+/-2.41%. Extensive experiments demonstrate that CMUS-Net outperforms other well-known multi-modal or multi-scale methods and show its generalization capability in staging MN. Code is available at https://github.com/SMU-GL-Group/MultiModal_lkx/tree/main.[70] Criticality Metrics for Relevance Classification in Safety Evaluation of Object Detection in Automated Driving
Jörg Gamerdinger,Sven Teufel,Stephan Amann,Oliver Bringmann
Main category: cs.CV
TL;DR: 本文首次深入分析了用于自动驾驶物体检测系统安全评估的临界性度量,提出双向临界性评分和多指标聚合两种新策略,显著提升了临界性分类准确性。
Details
Motivation: 为了确保自动驾驶的安全性,需要能够区分相关与非相关物体的临界性度量,以准确评估感知系统在安全关键场景中的表现。 Method: 通过文献综述识别现有临界性度量,并在DeepAccident数据集上进行实证验证,提出双向临界性评分和多指标聚合两种新应用策略。 Result: 所提方法在临界性分类准确率上最高实现了100%的提升。 Conclusion: 改进后的临界性度量策略能显著提高自动驾驶物体检测系统的安全评估能力,推动安全性评价的发展。 Abstract: Ensuring safety is the primary objective of automated driving, which necessitates a comprehensive and accurate perception of the environment. While numerous performance evaluation metrics exist for assessing perception capabilities, incorporating safety-specific metrics is essential to reliably evaluate object detection systems. A key component for safety evaluation is the ability to distinguish between relevant and non-relevant objects - a challenge addressed by criticality or relevance metrics. This paper presents the first in-depth analysis of criticality metrics for safety evaluation of object detection systems. Through a comprehensive review of existing literature, we identify and assess a range of applicable metrics. Their effectiveness is empirically validated using the DeepAccident dataset, which features a variety of safety-critical scenarios. To enhance evaluation accuracy, we propose two novel application strategies: bidirectional criticality rating and multi-metric aggregation. Our approach demonstrates up to a 100% improvement in terms of criticality classification accuracy, highlighting its potential to significantly advance the safety evaluation of object detection systems in automated vehicles.[71] Robust and Calibrated Detection of Authentic Multimedia Content
Sarim Hashmi,Abdelrahman Elsayed,Mohammed Talha Alam,Samuele Poppi,Nils Lukas
Main category: cs.CV
TL;DR: 提出一种基于重合成(resynthesis)的框架,用于高精度地验证媒体样本的真实性,并在对抗计算受限的攻击者时实现可控的低误报率和更强的鲁棒性。
Details
Motivation: 现有深度伪造检测方法可靠性差,存在无法事后区分真实与伪造内容的问题,且易被攻击者绕过,需更可靠的验证机制。 Method: 提出一种校准的重合成框架,通过判断样本是否可被合理重建来验证其真实性,结合先进的反演技术,支持多模态数据。 Result: 该方法在保持低误报率的同时,对真实样本具有最高的验证可靠性,并在相同计算预算下显著优于现有方法,具备对抗高效攻击者的鲁棒性。 Conclusion: 重合成框架为应对深度伪造提供了更可靠、鲁棒的解决方案,尤其适用于高精度、低召回率场景下的真实性验证。 Abstract: Generative models can synthesize highly realistic content, so-called deepfakes, that are already being misused at scale to undermine digital media authenticity. Current deepfake detection methods are unreliable for two reasons: (i) distinguishing inauthentic content post-hoc is often impossible (e.g., with memorized samples), leading to an unbounded false positive rate (FPR); and (ii) detection lacks robustness, as adversaries can adapt to known detectors with near-perfect accuracy using minimal computational resources. To address these limitations, we propose a resynthesis framework to determine if a sample is authentic or if its authenticity can be plausibly denied. We make two key contributions focusing on the high-precision, low-recall setting against efficient (i.e., compute-restricted) adversaries. First, we demonstrate that our calibrated resynthesis method is the most reliable approach for verifying authentic samples while maintaining controllable, low FPRs. Second, we show that our method achieves adversarial robustness against efficient adversaries, whereas prior methods are easily evaded under identical compute budgets. Our approach supports multiple modalities and leverages state-of-the-art inversion techniques.[72] ERIENet: An Efficient RAW Image Enhancement Network under Low-Light Environment
Jianan Wang,Yang Hong,Hesong Li,Tao Wang,Songrong Liu,Ying Fu
Main category: cs.CV
TL;DR: 本文提出了一种高效的RAW图像增强网络(ERIENet),通过并行处理多尺度信息和利用绿色通道的丰富信息来提升低光环境下的图像质量,实现了比现有方法更高的效率和更快的处理速度。
Details
Motivation: 现有的基于RAW的低光照图像增强方法通常顺序处理多尺度信息,难以实现轻量级模型和高处理速度,并且忽略了RAW图像中绿色通道的优势。 Method: 提出了一个高效的多尺度全并行架构,结合新颖的通道感知残差密集块提取特征图;引入了一个绿色通道引导分支,充分利用输入RAW图像绿色通道中的丰富信息来指导图像重建。 Result: 实验表明,ERIENet在常用的数据集上优于最先进的方法,能够在单个NVIDIA GeForce RTX 3090显卡上以超过146帧/秒的速度处理4K分辨率图像。 Conclusion: ERIENet通过并行处理和绿色通道利用,在保证高质量重建的同时显著提升了处理效率,为实时低光RAW图像增强提供了一个有效解决方案。 Abstract: RAW images have shown superior performance than sRGB images in many image processing tasks, especially for low-light image enhancement. However, most existing methods for RAW-based low-light enhancement usually sequentially process multi-scale information, which makes it difficult to achieve lightweight models and high processing speeds. Besides, they usually ignore the green channel superiority of RAW images, and fail to achieve better reconstruction performance with good use of green channel information. In this work, we propose an efficient RAW Image Enhancement Network (ERIENet), which parallelly processes multi-scale information with efficient convolution modules, and takes advantage of rich information in green channels to guide the reconstruction of images. Firstly, we introduce an efficient multi-scale fully-parallel architecture with a novel channel-aware residual dense block to extract feature maps, which reduces computational costs and achieves real-time processing speed. Secondly, we introduce a green channel guidance branch to exploit the rich information within the green channels of the input RAW image. It increases the quality of reconstruction results with few parameters and computations. Experiments on commonly used low-light image enhancement datasets show that ERIENet outperforms state-of-the-art methods in enhancing low-light RAW images with higher effiency. It also achieves an optimal speed of over 146 frame-per-second (FPS) for 4K-resolution images on a single NVIDIA GeForce RTX 3090 with 24G memory.[73] TBC: A Target-Background Contrast Metric for Low-Altitude Infrared and Visible Image Fusion
Yufeng Xie
Main category: cs.CV
TL;DR: 提出一种基于Weber定律的红外与可见光图像融合评价指标TBC,通过关注显著目标的相对对比度来避免传统无参考指标在低光照环境下将噪声误判为细节的问题,在DroneVehicle数据集上验证了其与人类感知更一致且更可靠。
Details
Motivation: 传统无参考评价指标(如EN、AG)在复杂低光环境中易将高频传感器噪声误认为有效细节,陷入“噪声陷阱”,导致对融合算法的错误引导。 Method: 受Weber定律启发,提出目标-背景对比度(TBC)指标,聚焦于显著目标的相对对比度而非全局统计信息,惩罚背景噪声并奖励目标可见性。 Result: 在DroneVehicle数据集上的实验表明,TBC比传统指标更符合人类视觉感知,能更准确地评估低空场景下的图像融合质量。 Conclusion: TBC是一种更可靠、更符合人眼感知的图像融合评价指标,特别适用于低空无人机侦察等低光照复杂环境。 Abstract: Infrared and visible image fusion is a pivotal technology in low-altitude UAV reconnaissance missions, providing high-quality data support for downstream tasks such as target detection and tracking by integrating thermal saliency with background texture details.However, traditional no-reference metrics fail(Specifically,like Entropy (EN) and Average Gradient (AG)) in complex low-light environments. They often misinterpret high-frequency sensor noise as valid detail. This creates a "Noise Trap," paradoxically assigning higher scores to noisy images and misguiding fusion algorithms.To address this, we propose the Target-Background Contrast (TBC) metric. Inspired by Weber's Law, TBC focuses on the relative contrast of salient targets rather than global statistics. Unlike traditional metrics, TBC penalizes background noise and rewards target visibility. Experiments on the DroneVehicle dataset demonstrate that TBC aligns better with human perception and provides a reliable standard for low-altitude scenarios.[74] From Camera to World: A Plug-and-Play Module for Human Mesh Transformation
Changhai Ma,Ziyu Wu,Yunkang Zhang,Qijun Ying,Boyan Liu,Xiaohui Cai
Main category: cs.CV
TL;DR: 提出Mesh-Plug模块,通过RGB图像和深度图估计相机旋转参数,实现从相机坐标系到世界坐标系的准确3D人体网格重建。
Details
Motivation: 现有方法因假设相机无旋转而在转换到世界坐标系时产生显著误差,缺乏对相机旋转的准确估计。 Method: 提出一种以人为中心的方法,利用初始网格渲染的RGB图像和深度图训练相机旋转预测模块,并设计网格调整模块优化根关节朝向和身体姿态。 Result: 在SPEC-SYN和SPEC-MTP数据集上实验表明,该方法优于当前最先进的方法。 Conclusion: Mesh-Plug能有效提升3D人体网格在世界坐标系下的重建精度,具有良好的即插即用特性。 Abstract: Reconstructing accurate 3D human meshes in the world coordinate system from in-the-wild images remains challenging due to the lack of camera rotation information. While existing methods achieve promising results in the camera coordinate system by assuming zero camera rotation, this simplification leads to significant errors when transforming the reconstructed mesh to the world coordinate system. To address this challenge, we propose Mesh-Plug, a plug-and-play module that accurately transforms human meshes from camera coordinates to world coordinates. Our key innovation lies in a human-centered approach that leverages both RGB images and depth maps rendered from the initial mesh to estimate camera rotation parameters, eliminating the dependency on environmental cues. Specifically, we first train a camera rotation prediction module that focuses on the human body's spatial configuration to estimate camera pitch angle. Then, by integrating the predicted camera parameters with the initial mesh, we design a mesh adjustment module that simultaneously refines the root joint orientation and body pose. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods on the benchmark datasets SPEC-SYN and SPEC-MTP.[75] SLCFormer: Spectral-Local Context Transformer with Physics-Grounded Flare Synthesis for Nighttime Flare Removal
Xiyu Zhu,Wei Wang,Xin Yuan,Xiao Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为SLCFormer的新型频谱-局部上下文Transformer框架,用于有效去除夜景镜头眩光。该方法结合频域全局建模与空间域局部结构增强,并引入基于ZernikeVAE的散射眩光生成流程,实现了在复杂真实场景下的优越性能。
Details
Motivation: 现有方法难以有效处理非均匀分布的眩光,在复杂光照条件下应用受限,因此需要一种能同时捕捉全局频域特征与局部空间结构的方法来提升去眩光效果。 Method: 提出SLCFormer框架,包含频率傅里叶激励模块(FFEM)以在频域建模眩光全局上下文,以及方向增强空间模块(DESM)以增强局部结构和方向特征;并设计ZernikeVAE-based生成流程合成具有物理真实性的散射眩光用于训练。 Result: 在Flare7K++数据集上实验表明,该方法在定量指标和视觉质量上均优于现有方法,且在复杂真实夜景中表现出强泛化能力。 Conclusion: SLCFormer通过融合频域与空间域信息,结合物理启发的数据生成策略,显著提升了夜间非均匀眩光去除的效果与实际适用性。 Abstract: Lens flare is a common nighttime artifact caused by strong light sources scattering within camera lenses, leading to hazy streaks, halos, and glare that degrade visual quality. However, existing methods usually fail to effectively address nonuniform scattered flares, which severely reduces their applicability to complex real-world scenarios with diverse lighting conditions. To address this issue, we propose SLCFormer, a novel spectral-local context transformer framework for effective nighttime lens flare removal. SLCFormer integrates two key modules: the Frequency Fourier and Excitation Module (FFEM), which captures efficient global contextual representations in the frequency domain to model flare characteristics, and the Directionally-Enhanced Spatial Module (DESM) for local structural enhancement and directional features in the spatial domain for precise flare removal. Furthermore, we introduce a ZernikeVAE-based scatter flare generation pipeline to synthesize physically realistic scatter flares with spatially varying PSFs, bridging optical physics and data-driven training. Extensive experiments on the Flare7K++ dataset demonstrate that our method achieves state-of-the-art performance, outperforming existing approaches in both quantitative metrics and perceptual visual quality, and generalizing robustly to real nighttime scenes with complex flare artifacts.[76] Null-LoRA: Low-Rank Adaptation on Null Space
Yi Zhang,Yulei Kang,Haoxuan Chen,Jinxuan Li,ian-Fang Hu
Main category: cs.CV
TL;DR: 提出了一种基于零空间的低秩适应方法Null-LoRA,通过冻结部分低秩矩阵并将增量更新限制在零空间内,提升了参数效率和模型有效性,在图像-文本检索和视觉问答任务中优于现有方法。
Details
Motivation: 现有的低秩适应方法在全参数空间进行微调,存在冗余;而预训练模型存在非平凡的零空间,可利用该特性提升微调效率。 Method: 提出Null-LoRA,冻结低秩矩阵的部分参数,并将整个增量更新约束在模型的零空间内,实现更高效的参数利用。 Result: 在图像-文本检索和视觉问答任务上,Null-LoRA以更少的参数量超过了现有最先进方法。 Conclusion: Null-LoRA通过利用预训练模型的零空间结构,有效减少了冗余,提高了低秩微调的效率和性能。 Abstract: Parameter-efficient fine-tuning methods have gained considerable popularity for adapting large-scale models to downstream tasks, particularly LoRA and its variants. Existing methods perform low-rank adaptation over the full parameter space. However, fine-tuning within a subspace can achieve comparable effectiveness. Inspired by the observation that pre-trained models possess non-trivial null spaces, we propose Null-space based Low-Rank Adaptation (Null-LoRA). Null-LoRA effectively reduces redundancy and enhances effective rank by freezing portions of the low-rank matrices. To further improve parameter efficiency, Null-LoRA constrains the entire incremental update within the null space, maximizing the utilization of incremental updates to adapt to new task paradigms. Null-LoRA surpasses the state of the art with fewer parameters in extensive experiments across image-text retrieval and visual question answering tasks.[77] Intersectional Fairness in Vision-Language Models for Medical Image Disease Classification
Yupeng Zhang,Adam G. Dunn,Usman Naseem,Jinman Kim
Main category: cs.CV
TL;DR: 提出了一种名为CMAC-MMD的训练框架,用于减少医疗AI中跨交叉患者亚群的诊断偏差,同时提升整体诊断性能。
Details
Motivation: 现有医疗AI系统在交叉性患者亚群中表现出系统性偏差,导致对边缘化群体的诊断置信度较低,且当前的公平性干预方法常牺牲整体性能或依赖敏感人口数据。 Method: 开发了Cross-Modal Alignment Consistency (CMAC-MMD)训练框架,通过标准化不同交叉亚群间的诊断置信度来减少偏差,且无需在推理时获取敏感人口统计信息。在皮肤病变和青光眼检测数据集上进行评估,并按年龄、性别和种族交叉分层分析性能。 Result: 在皮肤病学队列中,该方法将交叉性漏诊差距(ΔTPR)从0.50降至0.26,AUC从0.94提升至0.97;在青光眼筛查中,ΔTPR从0.41降至0.31,AUC达到0.72(基线为0.71)。 Conclusion: CMAC-MMD提供了一个可扩展的框架,能够在不增加隐私风险的情况下实现高准确性与跨人群公平性的临床决策支持系统。 Abstract: Medical artificial intelligence (AI) systems, particularly multimodal vision-language models (VLM), often exhibit intersectional biases where models are systematically less confident in diagnosing marginalised patient subgroups. Such bias can lead to higher rates of inaccurate and missed diagnoses due to demographically skewed data and divergent distributions of diagnostic certainty. Current fairness interventions frequently fail to address these gaps or compromise overall diagnostic performance to achieve statistical parity among the subgroups. In this study, we developed Cross-Modal Alignment Consistency (CMAC-MMD), a training framework that standardises diagnostic certainty across intersectional patient subgroups. Unlike traditional debiasing methods, this approach equalises the model's decision confidence without requiring sensitive demographic data during clinical inference. We evaluated this approach using 10,015 skin lesion images (HAM10000) with external validation on 12,000 images (BCN20000), and 10,000 fundus images for glaucoma detection (Harvard-FairVLMed), stratifying performance by intersectional age, gender, and race attributes. In the dermatology cohort, the proposed method reduced the overall intersectional missed diagnosis gap (difference in True Positive Rate, $Δ$TPR) from 0.50 to 0.26 while improving the overall Area Under the Curve (AUC) from 0.94 to 0.97 compared to standard training. Similarly, for glaucoma screening, the method reduced $Δ$TPR from 0.41 to 0.31, achieving a better AUC of 0.72 (vs. 0.71 baseline). This establishes a scalable framework for developing high-stakes clinical decision support systems that are both accurate and can perform equitably across diverse patient subgroups, ensuring reliable performance without increasing privacy risks.[78] Emotion Recognition in Signers
Kotaro Funakoshi,Yaoxiong Zhu
Main category: cs.CV
TL;DR: 该论文提出了一种跨语言方法来解决手语者情感识别中的两个挑战:语法与情感面部表情的重叠以及训练数据稀缺问题,利用eJSL和BOBSL数据集,通过文本情感识别、时序片段选择和手势运动融合提升了识别性能,并建立了强于口语大语言模型的新基线。
Details
Motivation: 解决手语情感识别中因语法与情感表情重叠及数据稀缺带来的理论与实践挑战。 Method: 在跨语言设置下使用eJSL(日语手语数据集)和BOBSL(英式手语数据集)进行实验,结合文本情感识别、时序片段选择和手部动作信息进行建模。 Result: 1) 文本情感识别可缓解手语数据稀缺问题;2) 时序片段选择对性能有显著影响;3) 引入手部动作可提升情感识别效果;最终性能优于口语大语言模型。 Conclusion: 通过跨语言迁移与多模态信息融合,有效提升了手语者情感识别性能,为低资源手语情感分析提供了可行方案。 Abstract: Recognition of signers' emotions suffers from one theoretical challenge and one practical challenge, namely, the overlap between grammatical and affective facial expressions and the scarcity of data for model training. This paper addresses these two challenges in a cross-lingual setting using our eJSL dataset, a new benchmark dataset for emotion recognition in Japanese Sign Language signers, and BOBSL, a large British Sign Language dataset with subtitles. In eJSL, two signers expressed 78 distinct utterances with each of seven different emotional states, resulting in 1,092 video clips. We empirically demonstrate that 1) textual emotion recognition in spoken language mitigates data scarcity in sign language, 2) temporal segment selection has a significant impact, and 3) incorporating hand motion enhances emotion recognition in signers. Finally we establish a stronger baseline than spoken language LLMs.[79] Assessing the Visual Enumeration Abilities of Specialized Counting Architectures and Vision-Language Models
Kuinan Hou,Jing Mi,Marco Zorzi,Lamberto Ballan,Alberto Testolin
Main category: cs.CV
TL;DR: 该研究比较了最先进的专用计数架构与视觉-语言模型(VLMs)在视觉场景中物体计数任务上的表现,发现VLMs在某些情况下可达到甚至超过专用模型的性能,尤其在引入中间表示(如位置和标签)后表现更优,但在复杂场景中仍存在局限。
Details
Motivation: 探索通用的视觉-语言模型是否能作为开放集物体计数的灵活替代方案,克服传统方法依赖特定类别标注数据的限制。 Method: 在两个主流计数数据集和一个新构建的细粒度控制测试图像数据集上,系统比较了VLMs与专用计数模型的性能,并评估了提示生成中间表示对计数准确性的影响。 Result: 大多数VLMs能够近似枚举场景中的物体数量,性能接近或优于专用架构;通过提示生成对象的位置和文本标签可显著提升准确性;但在复杂视觉场景中所有模型均表现不佳。 Conclusion: 尽管VLMs在开放集计数任务中展现出潜力,但当前所有模型在复杂真实场景中仍无法可靠计数,表明该领域仍需进一步研究。 Abstract: Counting the number of items in a visual scene remains a fundamental yet challenging task in computer vision. Traditional approaches to solving this problem rely on domain-specific counting architectures, which are trained using datasets annotated with a predefined set of object categories. However, recent progress in creating large-scale multimodal vision-language models (VLMs) suggests that these domain-general architectures may offer a flexible alternative for open-set object counting. In this study, we therefore systematically compare the performance of state-of-the-art specialized counting architectures against VLMs on two popular counting datasets, as well as on a novel benchmark specifically created to have a finer-grained control over the visual properties of test images. Our findings show that most VLMs can approximately enumerate the number of items in a visual scene, matching or even surpassing the performance of specialized computer vision architectures. Notably, enumeration accuracy significantly improves when VLMs are prompted to generate intermediate representations (i.e., locations and verbal labels) of each object to be counted. Nevertheless, none of the models can reliably count the number of objects in complex visual scenes, showing that further research is still needed to create AI systems that can reliably deploy counting procedures in realistic environments.[80] VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?
Hongbo Zhao,Meng Wang,Fei Zhu,Wenzhuo Liu,Bolin Ni,Fanhu Zeng,Gaofeng Meng,Zhaoxiang Zhang
Main category: cs.CV
TL;DR: 本文提出了首个针对视觉-文本压缩(VTC)的基准测试VTCBench,系统评估了视觉语言模型在长上下文理解任务中的表现,发现尽管现有模型能较好解码文本信息,但在处理VTC压缩后的高密度信息时普遍存在长距离依赖理解能力不足的问题。
Details
Motivation: 视觉-文本压缩(VTC)虽可显著降低大语言模型的计算与内存开销,但其高信息密度对视觉语言模型(VLM)长上下文理解能力的影响尚不明确,亟需系统性评估。 Method: 构建了包含VTC-Retrieval、VTC-Reasoning和VTC-Memory三个任务的基准VTCBench,并设计VTCBench-Wild以模拟多样化输入场景,对主流开源与闭源VLM进行综合评测。 Result: 实验表明大多数VLM在VTC压缩信息下的长上下文理解能力较差,难以捕捉上下文中的长距离关联或依赖关系,即使它们能较好地完成OCR式文本解码。 Conclusion: 该研究揭示了当前VLM在处理高密度压缩上下文时的局限性,为未来设计更高效、可扩展的视觉语言模型提供了基础与方向。 Abstract: The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model's ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.[81] MMMamba: A Versatile Cross-Modal In Context Fusion Framework for Pan-Sharpening and Zero-Shot Image Enhancement
Yingying Wang,Xuanhua He,Chen Wu,Jialing Huang,Suiyun Zhang,Rui Liu,Xinghao Ding,Haoxuan Che
Main category: cs.CV
TL;DR: 本文提出了一种用于全色锐化的新框架MMMamba,基于Mamba架构和多模态交错扫描机制,实现高效跨模态融合,并支持零样本图像超分辨率。
Details
Motivation: 传统CNN方法在全色锐化中受限于固定的卷积操作,难以适应多样化的空谱变化;交叉注意力机制计算效率低且易稀释细粒度对应关系。因此需要更高效、直接的跨模态融合方式。 Method: 基于Mamba架构设计MMMamba框架,采用上下文内条件机制替代交叉注意力,实现线性复杂度下的强跨模态交互;引入多模态交错(MI)扫描机制促进PAN与MS图像间的信息交换。 Result: 在多个任务和基准上实验表明,MMMamba性能优于现有最先进方法,兼具高效计算与优异融合效果。 Conclusion: MMMamba通过跨模态上下文融合与MI扫描机制,为全色锐化提供了高效且灵活的解决方案,具备零样本超分辨率能力,具有广泛应用潜力。 Abstract: Pan-sharpening aims to generate high-resolution multispectral (HRMS) images by integrating a high-resolution panchromatic (PAN) image with its corresponding low-resolution multispectral (MS) image. To achieve effective fusion, it is crucial to fully exploit the complementary information between the two modalities. Traditional CNN-based methods typically rely on channel-wise concatenation with fixed convolutional operators, which limits their adaptability to diverse spatial and spectral variations. While cross-attention mechanisms enable global interactions, they are computationally inefficient and may dilute fine-grained correspondences, making it difficult to capture complex semantic relationships. Recent advances in the Multimodal Diffusion Transformer (MMDiT) architecture have demonstrated impressive success in image generation and editing tasks. Unlike cross-attention, MMDiT employs in-context conditioning to facilitate more direct and efficient cross-modal information exchange. In this paper, we propose MMMamba, a cross-modal in-context fusion framework for pan-sharpening, with the flexibility to support image super-resolution in a zero-shot manner. Built upon the Mamba architecture, our design ensures linear computational complexity while maintaining strong cross-modal interaction capacity. Furthermore, we introduce a novel multimodal interleaved (MI) scanning mechanism that facilitates effective information exchange between the PAN and MS modalities. Extensive experiments demonstrate the superior performance of our method compared to existing state-of-the-art (SOTA) techniques across multiple tasks and benchmarks.[82] SynthSeg-Agents: Multi-Agent Synthetic Data Generation for Zero-Shot Weakly Supervised Semantic Segmentation
Wangyu Wu,Zhenhong Chen,Xiaowei Huang,Fei Ma,Jimin Xiao
Main category: cs.CV
TL;DR: 本文提出了一种新的零样本弱监督语义分割方法ZSWSSS,通过名为SynthSeg Agents的多智能体框架,利用大语言模型生成完全无需真实图像的合成训练数据,在PASCAL VOC 2012和COCO 2014上取得了具有竞争力的结果。
Details
Motivation: 现有的弱监督语义分割方法依赖于真实图像数据,即使使用了生成模型进行数据增强,仍然受限于现实样本。本文旨在完全摆脱对真实训练图像的依赖,探索无需真实图像监督的语义分割新范式。 Method: 提出SynthSeg Agents框架,包含自优化提示智能体和图像生成智能体:前者通过迭代优化、记忆机制和CLIP相似性与多样性过滤生成高质量文本提示;后者利用视觉语言模型合成图像,并通过冻结的CLIP打分筛选高质量样本,再用ViT分类器重新标注以提升语义精度。 Result: 在PASCAL VOC 2012和COCO 2014数据集上,仅使用合成数据训练的模型达到了与现有方法相媲美的性能,验证了无需真实图像进行训练的可行性。 Conclusion: SynthSeg Agents证明了大语言模型驱动的合成数据生成在弱监督语义分割中的巨大潜力,为低成本、可扩展的像素级预测任务提供了新方向。 Abstract: Weakly Supervised Semantic Segmentation (WSSS) with image level labels aims to produce pixel level predictions without requiring dense annotations. While recent approaches have leveraged generative models to augment existing data, they remain dependent on real world training samples. In this paper, we introduce a novel direction, Zero Shot Weakly Supervised Semantic Segmentation (ZSWSSS), and propose SynthSeg Agents, a multi agent framework driven by Large Language Models (LLMs) to generate synthetic training data entirely without real images. SynthSeg Agents comprises two key modules, a Self Refine Prompt Agent and an Image Generation Agent. The Self Refine Prompt Agent autonomously crafts diverse and semantically rich image prompts via iterative refinement, memory mechanisms, and prompt space exploration, guided by CLIP based similarity and nearest neighbor diversity filtering. These prompts are then passed to the Image Generation Agent, which leverages Vision Language Models (VLMs) to synthesize candidate images. A frozen CLIP scoring model is employed to select high quality samples, and a ViT based classifier is further trained to relabel the entire synthetic dataset with improved semantic precision. Our framework produces high quality training data without any real image supervision. Experiments on PASCAL VOC 2012 and COCO 2014 show that SynthSeg Agents achieves competitive performance without using real training images. This highlights the potential of LLM driven agents in enabling cost efficient and scalable semantic segmentation.[83] KD360-VoxelBEV: LiDAR and 360-degree Camera Cross Modality Knowledge Distillation for Bird's-Eye-View Segmentation
Wenke E,Yixin Sun,Jiaxu Liu,Hubert P. H. Shum,Amir Atapour-Abarghouei,Toby P. Breckon
Main category: cs.CV
TL;DR: 提出首个针对单全景相机鸟瞰图(BEV)分割的跨模态蒸馏框架,利用LiDAR图像表示和体素对齐视图变换器,通过教师-学生模型实现高效知识迁移,显著提升性能并降低传感器成本。
Details
Motivation: 为了减少自动驾驶中BEV分割对复杂多传感器的依赖,降低成本,同时提升单目全景相机BEV分割的性能。 Method: 设计一种新的LiDAR图像表示方法(融合距离、强度和环境通道),结合体素对齐的视图变换器;使用高容量LiDAR-相机融合教师网络向仅依赖单个360度全景相机的学生网络进行跨模态知识蒸馏。 Result: 在Dur360BEV数据集上,教师模型比现有方法提升25.6% IoU,学生模型获得8.5% IoU增益,并达到31.2 FPS的最先进推理速度;在KITTI-360上的实验表明该框架具有良好的泛化能力。 Conclusion: 所提跨模态蒸馏框架有效降低了传感器复杂性和部署成本,为现实世界中的低代价、高效BEV分割提供了可行且鲁棒的解决方案。 Abstract: We present the first cross-modality distillation framework specifically tailored for single-panoramic-camera Bird's-Eye-View (BEV) segmentation. Our approach leverages a novel LiDAR image representation fused from range, intensity and ambient channels, together with a voxel-aligned view transformer that preserves spatial fidelity while enabling efficient BEV processing. During training, a high-capacity LiDAR and camera fusion Teacher network extracts both rich spatial and semantic features for cross-modality knowledge distillation into a lightweight Student network that relies solely on a single 360-degree panoramic camera image. Extensive experiments on the Dur360BEV dataset demonstrate that our teacher model significantly outperforms existing camera-based BEV segmentation methods, achieving a 25.6\% IoU improvement. Meanwhile, the distilled Student network attains competitive performance with an 8.5\% IoU gain and state-of-the-art inference speed of 31.2 FPS. Moreover, evaluations on KITTI-360 (two fisheye cameras) confirm that our distillation framework generalises to diverse camera setups, underscoring its feasibility and robustness. This approach reduces sensor complexity and deployment costs while providing a practical solution for efficient, low-cost BEV segmentation in real-world autonomous driving.[84] Automated Motion Artifact Check for MRI (AutoMAC-MRI): An Interpretable Framework for Motion Artifact Detection and Severity Assessment
Antony Jerald,Dattesh Shanbhag,Sudhanya Chatterjee
Main category: cs.CV
TL;DR: 本文提出了一种名为AutoMAC-MRI的可解释框架,用于在多种MRI对比度和方向上对运动伪影进行分级评估。该方法结合监督对比学习与等级特异性亲和力评分,实现透明且可解释的质量控制。
Details
Motivation: 现有的MRI质量评估方法多局限于二分类判断,缺乏可解释性,难以满足临床对运动伪影严重程度精细分级的需求。 Method: 采用监督对比学习来学习运动严重程度的判别性特征表示,并在此特征空间中计算每个等级的亲和力得分,以量化图像与各等级的接近程度,从而实现可解释的分级。 Result: 在超过5000张专家标注的脑部MRI切片上验证,亲和力得分与专家判断高度一致,能够准确反映运动伪影的严重程度。 Conclusion: AutoMAC-MRI通过结合精确的等级检测与每级亲和力评分,支持实时MRI质量控制,有助于减少不必要的重扫并提升工作流程效率。 Abstract: Motion artifacts degrade MRI image quality and increase patient recalls. Existing automated quality assessment methods are largely limited to binary decisions and provide little interpretability. We introduce AutoMAC-MRI, an explainable framework for grading motion artifacts across heterogeneous MR contrasts and orientations. The approach uses supervised contrastive learning to learn a discriminative representation of motion severity. Within this feature space, we compute grade-specific affinity scores that quantify an image's proximity to each motion grade, thereby making grade assignments transparent and interpretable. We evaluate AutoMAC-MRI on more than 5000 expert-annotated brain MRI slices spanning multiple contrasts and views. Experiments assessing affinity scores against expert labels show that the scores align well with expert judgment, supporting their use as an interpretable measure of motion severity. By coupling accurate grade detection with per-grade affinity scoring, AutoMAC-MRI enables inline MRI quality control, with the potential to reduce unnecessary rescans and improve workflow efficiency.[85] Prototypical Learning Guided Context-Aware Segmentation Network for Few-Shot Anomaly Detection
Yuxin Jiang,Yunkang Cao,Weiming Shen
Main category: cs.CV
TL;DR: 本文提出了一种用于少样本异常检测(FSAD)的新型网络PCSNet,通过原型学习引导和上下文感知分割来缩小预训练特征与目标域之间的领域差距,提升了异常检测性能。
Details
Motivation: 现有FSAD方法依赖预训练特征,但忽略了其与目标场景之间的领域差异,影响检测效果。 Method: 提出PCSNet,包含原型特征自适应(PFA)子网络和上下文感知分割(CAS)子网络;PFA利用原型特征增强正常样本的特征紧凑性并分离异常,设计像素级差异分类损失以突出细微异常;CAS利用伪异常辅助训练实现像素级定位。 Result: 在MVTec和MPDD数据集上8-shot设置下分别达到94.9%和80.2%的图像级AUROC,实际应用于汽车塑料件检测也表现良好。 Conclusion: PCSNet有效缓解了领域差距问题,在少样本条件下实现了优越的异常检测与定位性能。 Abstract: Few-shot anomaly detection (FSAD) denotes the identification of anomalies within a target category with a limited number of normal samples. Existing FSAD methods largely rely on pre-trained feature representations to detect anomalies, but the inherent domain gap between pre-trained representations and target FSAD scenarios is often overlooked. This study proposes a Prototypical Learning Guided Context-Aware Segmentation Network (PCSNet) to address the domain gap, thereby improving feature descriptiveness in target scenarios and enhancing FSAD performance. In particular, PCSNet comprises a prototypical feature adaption (PFA) sub-network and a context-aware segmentation (CAS) sub-network. PFA extracts prototypical features as guidance to ensure better feature compactness for normal data while distinct separation from anomalies. A pixel-level disparity classification loss is also designed to make subtle anomalies more distinguishable. Then a CAS sub-network is introduced for pixel-level anomaly localization, where pseudo anomalies are exploited to facilitate the training process. Experimental results on MVTec and MPDD demonstrate the superior FSAD performance of PCSNet, with 94.9% and 80.2% image-level AUROC in an 8-shot scenario, respectively. Real-world applications on automotive plastic part inspection further demonstrate that PCSNet can achieve promising results with limited training samples. Code is available at https://github.com/yuxin-jiang/PCSNet.[86] MECAD: A multi-expert architecture for continual anomaly detection
Malihe Dahmardeh,Francesco Setti
Main category: cs.CV
TL;DR: 本文提出了一种基于多专家架构的持续异常检测方法MECAD,通过动态分配专家、优化的coreset选择和回放缓冲机制,在减少知识遗忘的同时实现高效增量学习。
Details
Motivation: 在工业环境中,产品类型不断变化,需要一种能够在不重新训练整个模型的情况下持续学习新类别并保持对已有类别检测性能的异常检测方法。 Method: 采用多专家架构,根据特征相似性动态分配专家到不同物体类别,并结合优化的coreset采样和专用回放缓冲区进行内存管理,以支持增量学习。 Result: 在MVTec AD数据集上的实验表明,最优的5专家配置在15个不同物体类别上平均AUROC达到0.8259,显著减少了知识退化。 Conclusion: MECAD框架在计算效率、专业知识保留和适应性之间取得了良好平衡,适用于产品类型不断演化的工业场景。 Abstract: In this paper we propose MECAD, a novel approach for continual anomaly detection using a multi-expert architecture. Our system dynamically assigns experts to object classes based on feature similarity and employs efficient memory management to preserve the knowledge of previously seen classes. By leveraging an optimized coreset selection and a specialized replay buffer mechanism, we enable incremental learning without requiring full model retraining. Our experimental evaluation on the MVTec AD dataset demonstrates that the optimal 5-expert configuration achieves an average AUROC of 0.8259 across 15 diverse object categories while significantly reducing knowledge degradation compared to single-expert approaches. This framework balances computational efficiency, specialized knowledge retention, and adaptability, making it well-suited for industrial environments with evolving product types.[87] A Masked Reverse Knowledge Distillation Method Incorporating Global and Local Information for Image Anomaly Detection
Yuxin Jiang,Yunkang Can,Weiming Shen
Main category: cs.CV
TL;DR: 本文提出了一种新的知识蒸馏方法MRKD,用于图像异常检测与定位,通过图像级和特征级掩码缓解过泛化问题,在MVTec数据集上取得了优异性能。
Details
Motivation: 现有知识蒸馏方法在图像异常检测中存在过泛化问题,因输入信号与监督信号过于相似,导致检测能力受限。 Method: 提出掩码反向知识蒸馏(MRKD),结合图像级掩码(ILM)和特征级掩码(FLM),将图像重建转为图像恢复任务,增强全局与局部信息建模能力。 Result: 在MVTec数据集上,MRKD达到98.9%图像级AU-ROC、98.4%像素级AU-ROC和95.3% AU-PRO,且消融实验验证了其对过泛化的有效缓解。 Conclusion: MRKD通过双重掩码策略显著提升了异常检测性能,抑制了过泛化,增强了模型对图像上下文的捕捉能力。 Abstract: Knowledge distillation is an effective image anomaly detection and localization scheme. However, a major drawback of this scheme is its tendency to overly generalize, primarily due to the similarities between input and supervisory signals. In order to address this issue, this paper introduces a novel technique called masked reverse knowledge distillation (MRKD). By employing image-level masking (ILM) and feature-level masking (FLM), MRKD transforms the task of image reconstruction into image restoration. Specifically, ILM helps to capture global information by differentiating input signals from supervisory signals. On the other hand, FLM incorporates synthetic feature-level anomalies to ensure that the learned representations contain sufficient local information. With these two strategies, MRKD is endowed with stronger image context capture capacity and is less likely to be overgeneralized. Experiments on the widely-used MVTec anomaly detection dataset demonstrate that MRKD achieves impressive performance: image-level 98.9% AU-ROC, pixel-level 98.4% AU-ROC, and 95.3% AU-PRO. In addition, extensive ablation experiments have validated the superiority of MRKD in mitigating the overgeneralization problem.[88] Vision-based module for accurately reading linear scales in a laboratory
Parvesh Saini,Soumyadipta Maiti,Beena Rai
Main category: cs.CV
TL;DR: 本文提出了一种模仿人类读取线性刻度测量值的方法,用于在实验室环境中实现机器人自主操作,特别是在注射器和量筒上读取液位。
Details
Motivation: 为了使机器人在非结构化实验室环境中具备类人能力,需要能够从仪器中准确读取定量测量值,而目前具备此能力的视觉模型较少。 Method: 通过图像变换校正随机方向的注射器姿态,将感兴趣区域缩小至线性刻度部分,提取主要刻度标记、对应数字和液位指示位置等特征,进而计算最终读数。 Result: 系统读数与人工读数对比显示高度一致,验证了方法的准确性、效率和鲁棒性。 Conclusion: 该方法有效实现了对线性刻度的自动化读数,为实验室自动化中机器人的感知能力提供了可行解决方案。 Abstract: Capabilities and the number of vision-based models are increasing rapidly. And these vision models are now able to do more tasks like object detection, image classification, instance segmentation etc. with great accuracy. But models which can take accurate quantitative measurements form an image, as a human can do by just looking at it, are rare. For a robot to work with complete autonomy in a Laboratory environment, it needs to have some basic skills like navigation, handling objects, preparing samples etc. to match human-like capabilities in an unstructured environment. Another important capability is to read measurements from instruments and apparatus. Here, we tried to mimic a human inspired approach to read measurements from a linear scale. As a test case we have picked reading level from a syringe and a measuring cylinder. For a randomly oriented syringe we carry out transformations to correct the orientation. To make the system efficient and robust, the area of interest is reduced to just the linear scale containing part of the image. After that, a series of features were extracted like the major makers, the corresponding digits, and the level indicator location, from which the final reading was calculated. Readings obtained using this system were also compared against human read values of the same instances and an accurate correspondence was observed.[89] Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics
Junjie Chen,Fei Wang,Zhihao Huang,Qing Zhou,Kun Li,Dan Guo,Linfeng Zhang,Xun Yang
Main category: cs.CV
TL;DR: TIMAR是一种用于3D对话头部生成的因果框架,通过交错音频-视觉上下文建模对话,提升时间连贯性和表达多样性。
Details
Motivation: 现有方法常将说话与倾听视为独立过程或依赖非因果全序列建模,导致回合间时间不连贯,难以真实模拟人类交流中的双向动态。 Method: 提出TIMAR框架,采用回合级交错掩码自回归机制,融合每一轮的多模态信息,并利用轻量扩散头预测连续3D头部动作,实现对协调性与表达变异性的建模。 Result: 在DualTalk基准上,TIMAR在测试集上将Fréchet距离和MSE降低了15-30%,并在分布外数据上取得相似增益。 Conclusion: TIMAR能有效建模对话中3D头部动作的双向动态,具备良好的泛化能力与实际应用潜力。 Abstract: Human conversation involves continuous exchanges of speech and nonverbal cues such as head nods, gaze shifts, and facial expressions that convey attention and emotion. Modeling these bidirectional dynamics in 3D is essential for building expressive avatars and interactive robots. However, existing frameworks often treat talking and listening as independent processes or rely on non-causal full-sequence modeling, hindering temporal coherence across turns. We present TIMAR (Turn-level Interleaved Masked AutoRegression), a causal framework for 3D conversational head generation that models dialogue as interleaved audio-visual contexts. It fuses multimodal information within each turn and applies turn-level causal attention to accumulate conversational history, while a lightweight diffusion head predicts continuous 3D head dynamics that captures both coordination and expressive variability. Experiments on the DualTalk benchmark show that TIMAR reduces Fréchet Distance and MSE by 15-30% on the test set, and achieves similar gains on out-of-distribution data. The source code will be released in the GitHub repository https://github.com/CoderChen01/towards-seamleass-interaction.[90] Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models
Shiran Ge,Chenyi Huang,Yuang Ai,Qihang Fan,Huaibo Huang,Ran He
Main category: cs.CV
TL;DR: 本文提出了Pro-GRPO,一种通过在采样过程中动态剪枝低价值轨迹来提升组相对策略优化(GRPO)效率与性能的新框架,解决了大组采样中的计算成本与优化效益之间的权衡问题。
Details
Motivation: GRPO在对齐生成模型方面有效,但其性能受限于大组规模带来的高昂计算成本,且存在大量奖励趋同、优化价值低的轨迹,导致资源浪费。 Method: 提出Pro-GRPO框架,结合潜在特征驱动的动态轨迹剪枝机制,在采样过程中早期终止奖励聚集的轨迹;采用“扩展-剪枝”策略:先扩大初始采样组以增加多样性,再基于潜变量多步应用最优方差过滤(OVF)进行剪枝。 Result: 实验表明,Pro-GRPO在扩散模型和流模型上均能显著降低计算开销,同时优于传统GRPO及静态OVF方法,实现了更高的样本利用效率和模型性能。 Conclusion: Pro-GRPO通过动态剪枝和扩展-剪枝策略,有效缓解了GRPO中计算成本与优化效果之间的矛盾,为高效强化学习对齐提供了新思路。 Abstract: Group Relative Policy Optimization (GRPO) is a powerful technique for aligning generative models, but its effectiveness is bottlenecked by the conflict between large group sizes and prohibitive computational costs. In this work, we investigate the trade-off through empirical studies, yielding two key observations. First, we discover the reward clustering phenomenon in which many trajectories collapse toward the group-mean reward, offering limited optimization value. Second, we design a heuristic strategy named Optimal Variance Filtering (OVF), and verify that a high-variance subset of trajectories, selected by OVF can outperform the larger, unfiltered group. However, this static, post-sampling OVF approach still necessitates critical computational overhead, as it performs unnecessary sampling for trajectories that are ultimately discarded. To resolve this, we propose Pro-GRPO (Proactive GRPO), a novel dynamic framework that integrates latent feature-based trajectory pruning into the sampling process. Through the early termination of reward-clustered trajectories, Pro-GRPO reduces computational overhead. Leveraging its efficiency, Pro-GRPO employs an "Expand-and-Prune" strategy. This strategy first expands the size of initial sampling group to maximize trajectory diversity, then it applies multi-step OVF to the latents, avoiding prohibitive computational costs. Extensive experiments on both diffusion-based and flow-based models demonstrate the generality and effectiveness of our Pro-GRPO framework.[91] SemanticBridge -- A Dataset for 3D Semantic Segmentation of Bridges and Domain Gap Analysis
Maximilian Kellner,Mariana Ferrandon Cervantes,Yuandong Pan,Ruodan Lu,Ioannis Brilakis,Alexander Reiterer
Main category: cs.CV
TL;DR: 提出了一种用于桥梁3D语义分割和传感器域差异分析的新型数据集,评估了三种先进3D深度学习模型,发现传感器变化导致性能下降最高达11.4% mIoU。
Details
Motivation: 解决基础设施检测中因不同传感器引起的域差异问题,提升桥梁结构健康监测的自动化与准确性。 Method: 构建包含多国桥梁高分辨率3D扫描和精细语义标签的数据集,使用三种最先进的3D深度学习架构进行评估,并利用多传感器数据量化域差距。 Result: 所有模型在任务上表现稳健,但传感器差异导致最大11.4%的mIoU性能下降。 Conclusion: 该数据集有助于推动桥梁语义分割研究,同时揭示了跨传感器域泛化的重要性。 Abstract: We propose a novel dataset that has been specifically designed for 3D semantic segmentation of bridges and the domain gap analysis caused by varying sensors. This addresses a critical need in the field of infrastructure inspection and maintenance, which is essential for modern society. The dataset comprises high-resolution 3D scans of a diverse range of bridge structures from various countries, with detailed semantic labels provided for each. Our initial objective is to facilitate accurate and automated segmentation of bridge components, thereby advancing the structural health monitoring practice. To evaluate the effectiveness of existing 3D deep learning models on this novel dataset, we conduct a comprehensive analysis of three distinct state-of-the-art architectures. Furthermore, we present data acquired through diverse sensors to quantify the domain gap resulting from sensor variations. Our findings indicate that all architectures demonstrate robust performance on the specified task. However, the domain gap can potentially lead to a decline in the performance of up to 11.4% mIoU.[92] See It Before You Grab It: Deep Learning-based Action Anticipation in Basketball
Arnau Barrera Roy,Albert Clapés Sintes
Main category: cs.CV
TL;DR: 本文提出了篮球比赛中投篮后预测哪支球队将获得球权的任务,引入了一个包含10万段视频片段和2000多个标注回弹事件的大型数据集,并首次应用深度学习方法进行动作预判,推动了体育视频理解的发展。
Details
Motivation: 尽管在球员和球追踪、姿态估计等方面取得了进展,但对体育赛事中未来动作的预测研究较少,尤其是在篮球回弹球权预测方面缺乏公开数据集和系统研究。 Method: 构建了一个大规模的篮球视频数据集,包含300多小时的转播 footage 和超过2000个手动标注的回弹事件;采用最先进的动作预判深度学习模型进行基准测试,并探索了回弹分类与回弹定位两个辅助任务。 Result: 实验结果验证了回弹预测的可行性与挑战性,展示了现有深度学习模型在此任务上的表现,同时证明该数据集支持多种篮球视频理解任务。 Conclusion: 该工作为篮球中的动作预判提供了新数据集和基准,展示了预测建模在多智能体动态体育场景中的潜力,可应用于实时转播和赛后分析工具。 Abstract: Computer vision and video understanding have transformed sports analytics by enabling large-scale, automated analysis of game dynamics from broadcast footage. Despite significant advances in player and ball tracking, pose estimation, action localization, and automatic foul recognition, anticipating actions before they occur in sports videos has received comparatively little attention. This work introduces the task of action anticipation in basketball broadcast videos, focusing on predicting which team will gain possession of the ball following a shot attempt. To benchmark this task, a new self-curated dataset comprising 100,000 basketball video clips, over 300 hours of footage, and more than 2,000 manually annotated rebound events is presented. Comprehensive baseline results are reported using state-of-the-art action anticipation methods, representing the first application of deep learning techniques to basketball rebound prediction. Additionally, two complementary tasks, rebound classification and rebound spotting, are explored, demonstrating that this dataset supports a wide range of video understanding applications in basketball, for which no comparable datasets currently exist. Experimental results highlight both the feasibility and inherent challenges of anticipating rebounds, providing valuable insights into predictive modeling for dynamic multi-agent sports scenarios. By forecasting team possession before rebounds occur, this work enables applications in real-time automated broadcasting and post-game analysis tools to support decision-making.[93] SMART: Semantic Matching Contrastive Learning for Partially View-Aligned Clustering
Liang Peng,Yixuan Ye,Cheng Liu,Hangjun Che,Fei Wang,Zhiwen Yu,Si Wu,Hau-San Wong
Main category: cs.CV
TL;DR: 提出了一种用于部分视图对齐聚类(PVC)的语义匹配对比学习模型SMART,通过缓解跨视图分布偏移,有效利用对齐和未对齐数据中的语义关系,在八个基准数据集上优于现有方法。
Details
Motivation: 现有PVC方法未能充分利用未对齐数据中的共享语义,且多视图异质性导致表示分布偏移,影响跨视图特征匹配准确性。 Method: 提出SMART模型,采用语义匹配对比学习策略,缓解跨视图分布偏移,增强对齐与未对齐数据中样本的语义一致性建模。 Result: 在八个基准数据集上实验表明,SMART在聚类性能上显著且一致地优于现有PVC方法。 Conclusion: SMART有效解决了PVC中分布偏移和语义利用不足的问题,提升了多视图聚类效果。 Abstract: Multi-view clustering has been empirically shown to improve learning performance by leveraging the inherent complementary information across multiple views of data. However, in real-world scenarios, collecting strictly aligned views is challenging, and learning from both aligned and unaligned data becomes a more practical solution. Partially View-aligned Clustering aims to learn correspondences between misaligned view samples to better exploit the potential consistency and complementarity across views, including both aligned and unaligned data. However, most existing PVC methods fail to leverage unaligned data to capture the shared semantics among samples from the same cluster. Moreover, the inherent heterogeneity of multi-view data induces distributional shifts in representations, leading to inaccuracies in establishing meaningful correspondences between cross-view latent features and, consequently, impairing learning effectiveness. To address these challenges, we propose a Semantic MAtching contRasTive learning model (SMART) for PVC. The main idea of our approach is to alleviate the influence of cross-view distributional shifts, thereby facilitating semantic matching contrastive learning to fully exploit semantic relationships in both aligned and unaligned data. Extensive experiments on eight benchmark datasets demonstrate that our method consistently outperforms existing approaches on the PVC problem.[94] Preserving Marker Specificity with Lightweight Channel-Independent Representation Learning
Simon Gutwein,Arthur Longuefosse,Jun Seita,Sabine Taschner-Mandl,Roxane Licandro
Main category: cs.CV
TL;DR: 本文研究了在多重组织成像数据中,保持标记物独立性并采用浅层架构是否比扩大模型规模更适合自监督表示学习。作者提出了一种新型的轻量级通道独立模型(CIM-S),在霍奇金淋巴瘤的CODEX数据集上验证其性能优于传统的早期融合CNN模型,尤其在稀有细胞识别方面表现更佳。
Details
Motivation: 现有的深度学习模型通常对多重蛋白标记数据采用早期通道融合,假设标记间存在共享结构,但可能丢失标记特异性信息,尤其难以区分稀有细胞类型。因此,需要一种更适合多重数据特性的归纳偏置方法。 Method: 提出一种新型的浅层通道独立模型(CIM-S),保留各蛋白标记的独立性,并与标准的早期融合CNN进行对比。使用霍奇金淋巴瘤的CODEX数据集(14.5万个细胞,49个标记),通过对比学习预训练和线性评估,在多种自监督框架和增强设置下评估模型性能。 Result: 早期融合模型难以保留标记特异性信息,尤其在稀有细胞识别上表现差;而通道独立架构(尤其是CIM-S)尽管参数仅5.5K,仍能学习到更强的表示能力,且结果在不同标记数量(49 vs 18)和自监督框架下均稳定可复现。 Conclusion: 轻量级、通道独立的架构在多重组织成像的表示学习中可以匹敌甚至超越深层早期融合CNN和基础模型,表明模型设计应注重归纳偏置而非单纯扩大规模。 Abstract: Multiplexed tissue imaging measures dozens of protein markers per cell, yet most deep learning models still apply early channel fusion, assuming shared structure across markers. We investigate whether preserving marker independence, combined with deliberately shallow architectures, provides a more suitable inductive bias for self-supervised representation learning in multiplex data than increasing model scale. Using a Hodgkin lymphoma CODEX dataset with 145,000 cells and 49 markers, we compare standard early-fusion CNNs with channel-separated architectures, including a marker-aware baseline and our novel shallow Channel-Independent Model (CIM-S) with 5.5K parameters. After contrastive pretraining and linear evaluation, early-fusion models show limited ability to retain marker-specific information and struggle particularly with rare-cell discrimination. Channel-independent architectures, and CIM-S in particular, achieve substantially stronger representations despite their compact size. These findings are consistent across multiple self-supervised frameworks, remain stable across augmentation settings, and are reproducible across both the 49-marker and reduced 18-marker settings. These results show that lightweight, channel-independent architectures can match or surpass deep early-fusion CNNs and foundation models for multiplex representation learning. Code is available at https://github.com/SimonBon/CIM-S.[95] Photorealistic Phantom Roads in Real Scenes: Disentangling 3D Hallucinations from Physical Geometry
Hoang Nguyen,Xiaohao Xu,Xiaonan Huang
Main category: cs.CV
TL;DR: 本文提出了针对单目深度估计模型中“3D幻觉”问题的首个端到端解决方案,包括新基准3D-Mirage、基于拉普拉斯的评估指标以及一种参数高效的自蒸馏方法。
Details
Motivation: 单目深度模型因学习语义先验而在平面区域产生几何幻觉(3D Mirage),现有评估方式无法捕捉此类结构错误,亟需新的评测与缓解机制。 Method: 构建真实世界幻觉图像的基准3D-Mirage,提出基于拉普拉斯的DCS和CCS指标量化幻觉,并设计Grounded Self-Distillation方法,在保持背景知识的同时强制幻觉区域的平面性。 Result: 实验表明所提方法显著降低幻觉区域的非平面偏差与上下文不稳定性,优于传统微调与蒸馏策略。 Conclusion: 应将单目深度估计的评估重心从像素精度转向结构与上下文鲁棒性,本文提供了诊断与缓解3D幻觉的关键工具。 Abstract: Monocular depth foundation models achieve remarkable generalization by learning large-scale semantic priors, but this creates a critical vulnerability: they hallucinate illusory 3D structures from geometrically planar but perceptually ambiguous inputs. We term this failure the 3D Mirage. This paper introduces the first end-to-end framework to probe, quantify, and tame this unquantified safety risk. To probe, we present 3D-Mirage, the first benchmark of real-world illusions (e.g., street art) with precise planar-region annotations and context-restricted crops. To quantify, we propose a Laplacian-based evaluation framework with two metrics: the Deviation Composite Score (DCS) for spurious non-planarity and the Confusion Composite Score (CCS) for contextual instability. To tame this failure, we introduce Grounded Self-Distillation, a parameter-efficient strategy that surgically enforces planarity on illusion ROIs while using a frozen teacher to preserve background knowledge, thus avoiding catastrophic forgetting. Our work provides the essential tools to diagnose and mitigate this phenomenon, urging a necessary shift in MDE evaluation from pixel-wise accuracy to structural and contextual robustness. Our code and benchmark will be publicly available to foster this exciting research direction.[96] Step-GUI Technical Report
Haolong Yan,Jia Wang,Xin Huang,Yeqing Shen,Ziyang Meng,Zhimin Fan,Kaijun Tan,Jin Gao,Lieyu Shi,Mi Yang,Shiliang Yang,Zhirui Wang,Brian Li,Kang An,Chenyang Li,Lei Lei,Mengmeng Duan,Danxun Liang,Guodong Liu,Hang Cheng,Hao Wu,Jie Dong,Junhao Huang,Mei Chen,Renjie Yu,Shunshan Li,Xu Zhou,Yiting Dai,Yineng Deng,Yingdan Liang,Zelin Chen,Wen Sun,Chengxu Yan,Chunqin Xu,Dong Li,Fengqiong Xiao,Guanghao Fan,Guopeng Li,Guozhen Peng,Hongbing Li,Hang Li,Hongming Chen,Jingjing Xie,Jianyong Li,Jingyang Zhang,Jiaju Ren,Jiayu Yuan,Jianpeng Yin,Kai Cao,Liang Zhao,Liguo Tan,Liying Shi,Mengqiang Ren,Min Xu,Manjiao Liu,Mao Luo,Mingxin Wan,Na Wang,Nan Wu,Ning Wang,Peiyao Ma,Qingzhou Zhang,Qiao Wang,Qinlin Zeng,Qiong Gao,Qiongyao Li,Shangwu Zhong,Shuli Gao,Shaofan Liu,Shisi Gao,Shuang Luo,Xingbin Liu,Xiaojia Liu,Xiaojie Hou,Xin Liu,Xuanti Feng,Xuedan Cai,Xuan Wen,Xianwei Zhu,Xin Liang,Xin Liu,Xin Zhou,Yingxiu Zhao,Yukang Shi,Yunfang Xu,Yuqing Zeng,Yixun Zhang,Zejia Weng,Zhonghao Yan,Zhiguo Huang,Zhuoyu Wang,Zheng Ge,Jing Li,Yibo Zhu,Binxing Jiao,Xiangyu Zhang,Daxin Jiang
Main category: cs.CV
TL;DR: 本文提出了一种自演进的GUI自动化训练管道Step-GUI,结合校准的步骤奖励系统实现高效高质量训练,并提出保护隐私的GUI-MCP协议和真实场景基准AndroidDaily,显著提升GUI智能体在实际应用中的性能与部署能力。
Details
Motivation: 现有的多模态大语言模型在GUI自动化中面临高质量训练数据获取困难、标注成本高以及用户隐私保护等问题,缺乏高效、可靠且可扩展的训练方法和标准化部署接口。 Method: 提出自演进训练管道和校准的步骤奖励系统(Calibrated Step Reward System),通过轨迹级校准将模型生成的轨迹转化为可靠的训练信号;开发Step-GUI系列模型(4B/8B);设计分层架构的GUI-MCP协议,支持低层级原子操作与高层任务委派,并保障敏感数据本地化;构建基于真实移动使用模式的基准AndroidDaily。 Result: Step-GUI在多个基准上达到SOTA性能(8B: AndroidWorld 80.2%, OSWorld 48.5%, ScreenShot-Pro 62.6%);标注准确率>90%,成本降低10-100倍;AndroidDaily基准显示8B模型静态操作准确率达89.91%,端到端任务完成率达52.50%;GUI-MCP实现高隐私性本地执行。 Conclusion: 该工作推动了实用型GUI智能体的发展,通过高效自演进训练、高隐私接口设计和真实场景评估,展示了在日常数字交互中实际部署的强大潜力。 Abstract: Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving >90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.[97] CLIP-FTI: Fine-Grained Face Template Inversion via CLIP-Driven Attribute Conditioning
Longchen Dai,Zixuan Shen,Zhiheng Zhou,Peipeng Yu,Zhihua Xia
Main category: cs.CV
TL;DR: 本文提出了一种名为CLIP-FTI的细粒度属性条件驱动框架,利用CLIP模型的语义嵌入实现更逼真的面部模板反演,显著提升了重建图像的身份识别准确率、属性相似性和跨模型攻击可迁移性,是首个引入模板以外额外信息进行模板反演的方法,并取得SOTA效果。
Details
Motivation: 现有面部模板反演方法重建的图像存在面部属性过度平滑、细节模糊和跨模型迁移能力差的问题,难以有效威胁实际系统,因此需要一种能恢复更精细面部特征并提升攻击通用性的新方法。 Method: 提出CLIP-FTI框架:利用CLIP模型提取面部属性语义嵌入,通过跨模态特征交互网络将这些嵌入与泄露的面部模板融合,并映射到预训练StyleGAN的中间潜在空间,驱动生成具有相同身份但细节更清晰的面部图像。 Result: 在多个面部识别骨干网络和数据集上的实验表明,该方法相比先前方法:(i) 提升了身份识别准确率和属性相似度;(ii) 恢复出更清晰的局部属性语义;(iii) 显著增强了跨模型攻击的可转移性。 Conclusion: CLIP-FTI首次引入面部模板之外的语义信息(CLIP嵌入)进行模板反演,在重建质量和攻击有效性上均达到最先进水平,揭示了结合多模态先验知识对增强反演攻击的巨大潜力,对生物特征安全构成新的挑战。 Abstract: Face recognition systems store face templates for efficient matching. Once leaked, these templates pose a threat: inverting them can yield photorealistic surrogates that compromise privacy and enable impersonation. Although existing research has achieved relatively realistic face template inversion, the reconstructed facial images exhibit over-smoothed facial-part attributes (eyes, nose, mouth) and limited transferability. To address this problem, we present CLIP-FTI, a CLIP-driven fine-grained attribute conditioning framework for face template inversion. Our core idea is to use the CLIP model to obtain the semantic embeddings of facial features, in order to realize the reconstruction of specific facial feature attributes. Specifically, facial feature attribute embeddings extracted from CLIP are fused with the leaked template via a cross-modal feature interaction network and projected into the intermediate latent space of a pretrained StyleGAN. The StyleGAN generator then synthesizes face images with the same identity as the templates but with more fine-grained facial feature attributes. Experiments across multiple face recognition backbones and datasets show that our reconstructions (i) achieve higher identification accuracy and attribute similarity, (ii) recover sharper component-level attribute semantics, and (iii) improve cross-model attack transferability compared to prior reconstruction attacks. To the best of our knowledge, ours is the first method to use additional information besides the face template attack to realize face template inversion and obtains SOTA results.[98] ST-DETrack: Identity-Preserving Branch Tracking in Entangled Plant Canopies via Dual Spatiotemporal Evidence
Yueqianji Chen,Kevin Williams,John H. Doonan,Paolo Remagnino,Jo Hepworth
Main category: cs.CV
TL;DR: 提出了一种名为ST-DETrack的时空融合双解码器网络,用于从时间序列图像中准确提取并保持植物分支的身份一致性,显著提升了高通量表型分析中的分支匹配精度。
Details
Motivation: 由于植物非刚性生长和冠层纠缠导致的严重身份碎片化,现有方法难以在长时间序列中保持植物分支的身份一致性,亟需一种能够适应不同生长阶段的鲁棒追踪方法。 Method: 设计了一个双解码器网络ST-DETrack,包含利用位置和角度等几何先验的空间解码器,以及利用运动一致性的时间解码器;通过自适应门控机制动态融合两者,并引入基于负向重力性的生物约束来减少垂直生长歧义。 Result: 在油菜(Brassica napus)数据集上验证,ST-DETrack实现了93.6%的分支匹配精度(BMA),比纯空间和纯时间基线分别高出28.9和3.3个百分点。 Conclusion: ST-DETrack能有效应对复杂动态植物结构中的长期身份保持挑战,为高通量表型分析提供了鲁棒的分支级提取解决方案。 Abstract: Automated extraction of individual plant branches from time-series imagery is essential for high-throughput phenotyping, yet it remains computationally challenging due to non-rigid growth dynamics and severe identity fragmentation within entangled canopies. To overcome these stage-dependent ambiguities, we propose ST-DETrack, a spatiotemporal-fusion dual-decoder network designed to preserve branch identity from budding to flowering. Our architecture integrates a spatial decoder, which leverages geometric priors such as position and angle for early-stage tracking, with a temporal decoder that exploits motion consistency to resolve late-stage occlusions. Crucially, an adaptive gating mechanism dynamically shifts reliance between these spatial and temporal cues, while a biological constraint based on negative gravitropism mitigates vertical growth ambiguities. Validated on a Brassica napus dataset, ST-DETrack achieves a Branch Matching Accuracy (BMA) of 93.6%, significantly outperforming spatial and temporal baselines by 28.9 and 3.3 percentage points, respectively. These results demonstrate the method's robustness in maintaining long-term identity consistency amidst complex, dynamic plant architectures.[99] Evaluation of deep learning architectures for wildlife object detection: A comparative study of ResNet and Inception
Malach Obisa Amonga,Benard Osero,Edna Too
Main category: cs.CV
TL;DR: 本研究评估了ResNet-101和Inception v3在野生动物检测中的性能,结果显示两者均表现良好,其中Inception v3以95%准确率和mAP 0.92略胜一筹。
Details
Motivation: 解决复杂环境下野生动物检测面临的环境变化、物种间视觉相似性和类内多样性挑战。 Method: 采用ResNet-101和Inception v3两种深度学习架构,在统一预处理的野生动物图像数据集上进行训练与评估,使用70:30的训练验证划分。 Result: ResNet-101达到94%分类准确率和0.91 mAP;Inception v3达到95%准确率和0.92 mAP,表现出更优的多尺度特征提取能力。 Conclusion: 两种模型均适用于野生动物检测任务,可为生态保护中的计算机视觉应用提供可靠基础,但在相似物种及低光照、遮挡条件下仍有挑战。 Abstract: Wildlife object detection plays a vital role in biodiversity conservation, ecological monitoring, and habitat protection. However, this task is often challenged by environmental variability, visual similarities among species, and intra-class diversity. This study investigates the effectiveness of two individual deep learning architectures ResNet-101 and Inception v3 for wildlife object detection under such complex conditions. The models were trained and evaluated on a wildlife image dataset using a standardized preprocessing approach, which included resizing images to a maximum dimension of 800 pixels, converting them to RGB format, and transforming them into PyTorch tensors. A ratio of 70:30 training and validation split was used for model development. The ResNet-101 model achieved a classification accuracy of 94% and a mean Average Precision (mAP) of 0.91, showing strong performance in extracting deep hierarchical features. The Inception v3 model performed slightly better, attaining a classification accuracy of 95% and a mAP of 0.92, attributed to its efficient multi-scale feature extraction through parallel convolutions. Despite the strong results, both models exhibited challenges when detecting species with similar visual characteristics or those captured under poor lighting and occlusion. Nonetheless, the findings confirm that both ResNet-101 and Inception v3 are effective models for wildlife object detection tasks and provide a reliable foundation for conservation-focused computer vision applications.[100] RUMPL: Ray-Based Transformers for Universal Multi-View 2D to 3D Human Pose Lifting
Seyed Abolfazl Ghasemzadeh,Alexandre Alahi,Christophe De Vleeschouwer
Main category: cs.CV
TL;DR: 提出RUMPL,一种基于变换器的3D姿态提升框架,采用3D射线表示2D关键点,无需相机标定且适用于任意多视角配置,显著优于现有方法。
Details
Motivation: 现有基于多视角学习的方法因缺乏大规模真实3D数据而泛化能力差,且依赖相机标定,难以应用于复杂现实场景。 Method: 引入3D射线表示2D关键点,构建不依赖相机参数的通用表示;设计View Fusion Transformer融合多视角射线特征,实现跨视角一致的信息聚合。 Result: 在MPJPE指标上比三角化方法降低达53%,相比基于图像表示的变换器基线降低超60%;在新提出的野外多视角和多人数据集上表现出强鲁棒性与可扩展性。 Conclusion: RUMPL实现了无需重训练即可部署于任意多视角设置的通用3D姿态估计,推动了合成数据驱动方法在真实场景中的应用。 Abstract: Estimating 3D human poses from 2D images remains challenging due to occlusions and projective ambiguity. Multi-view learning-based approaches mitigate these issues but often fail to generalize to real-world scenarios, as large-scale multi-view datasets with 3D ground truth are scarce and captured under constrained conditions. To overcome this limitation, recent methods rely on 2D pose estimation combined with 2D-to-3D pose lifting trained on synthetic data. Building on our previous MPL framework, we propose RUMPL, a transformer-based 3D pose lifter that introduces a 3D ray-based representation of 2D keypoints. This formulation makes the model independent of camera calibration and the number of views, enabling universal deployment across arbitrary multi-view configurations without retraining or fine-tuning. A new View Fusion Transformer leverages learned fused-ray tokens to aggregate information along rays, further improving multi-view consistency. Extensive experiments demonstrate that RUMPL reduces MPJPE by up to 53% compared to triangulation and over 60% compared to transformer-based image-representation baselines. Results on new benchmarks, including in-the-wild multi-view and multi-person datasets, confirm its robustness and scalability. The framework's source code is available at https://github.com/aghasemzadeh/OpenRUMPL[101] The LUMirage: An independent evaluation of zero-shot performance in the LUMIR challenge
Rohit Jena,Pratik Chaudhari,James C. Gee
Main category: cs.CV
TL;DR: 本文对LUMIR挑战中深度学习方法在神经影像配准中的零样本泛化能力提出了质疑,通过严格评估发现其在分布外对比度和高分辨率数据上表现显著下降,且对预处理敏感,结果强调需更贴近实际应用的评估协议。
Details
Motivation: 质疑LUMIR挑战中关于深度学习方法具有卓越零样本泛化的主张,该主张与深度学习领域迁移问题的已有认知相矛盾。 Method: 对LUMIR挑战中的深度学习方法进行独立重新评估,采用严格的评估协议,并控制潜在的仪器偏差来源,测试其在不同成像模态、分辨率和预处理条件下的表现。 Result: (1)深度学习方法在T1加权图像及近人类物种(如猕猴)上有良好表现;(2)但在分布外对比度(T2、T2*、FLAIR)上性能显著下降(Cohen's d = 0.7–1.5);(3)无法处理0.6 mm各向同性高分辨率图像,而迭代方法则受益于更高分辨率;(4)对预处理选择高度敏感。 Conclusion: 深度学习方法并未实现普遍适用的零样本优越性,其性能受域偏移、分辨率和预处理影响显著,建议采用更贴近临床和科研实际的评估标准。 Abstract: The LUMIR challenge represents an important benchmark for evaluating deformable image registration methods on large-scale neuroimaging data. While the challenge demonstrates that modern deep learning methods achieve competitive accuracy on T1-weighted MRI, it also claims exceptional zero-shot generalization to unseen contrasts and resolutions, assertions that contradict established understanding of domain shift in deep learning. In this paper, we perform an independent re-evaluation of these zero-shot claims using rigorous evaluation protocols while addressing potential sources of instrumentation bias. Our findings reveal a more nuanced picture: (1) deep learning methods perform comparably to iterative optimization on in-distribution T1w images and even on human-adjacent species (macaque), demonstrating improved task understanding; (2) however, performance degrades significantly on out-of-distribution contrasts (T2, T2*, FLAIR), with Cohen's d scores ranging from 0.7-1.5, indicating substantial practical impact on downstream clinical workflows; (3) deep learning methods face scalability limitations on high-resolution data, failing to run on 0.6 mm isotropic images, while iterative methods benefit from increased resolution; and (4) deep methods exhibit high sensitivity to preprocessing choices. These results align with the well-established literature on domain shift and suggest that claims of universal zero-shot superiority require careful scrutiny. We advocate for evaluation protocols that reflect practical clinical and research workflows rather than conditions that may inadvertently favor particular method classes.[102] Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting
Arthur Moreau,Richard Shaw,Michal Nazarczuk,Jisu Shin,Thomas Tanay,Zhensong Zhang,Songcen Xu,Eduardo Pérez-Pellitero
Main category: cs.CV
TL;DR: 提出一种新的前馈式3D高斯点阵模型,通过亚像素级关键点检测实现“脱离网格”的自适应基元分布,在无需姿态标签的情况下实现高质量、高效率的实时场景生成。
Details
Motivation: 现有前馈3DGS模型依赖固定的密集像素网格,导致基元放置不够灵活,限制了渲染质量和效率。 Method: 受关键点检测启发,设计多分辨率解码器,以端到端方式学习在图像块上分布3D高斯基元,并结合自监督学习训练3D重建主干网络。 Result: 模型在新视角合成任务中达到前馈模型的最先进水平,使用更少的基元即能生成更逼真的场景,减少伪影并提升细节表现;同时发现该方法可改善相机位姿估计性能。 Conclusion: 所提出的“脱离网格”方法实现了更精确高效的基元分配,为无标签训练3D重建模型提供了新思路。 Abstract: Feed-forward 3D Gaussian Splatting (3DGS) models enable real-time scene generation but are hindered by suboptimal pixel-aligned primitive placement, which relies on a dense, rigid grid and limits both quality and efficiency. We introduce a new feed-forward architecture that detects 3D Gaussian primitives at a sub-pixel level, replacing the pixel grid with an adaptive, "Off The Grid" distribution. Inspired by keypoint detection, our multi-resolution decoder learns to distribute primitives across image patches. This module is trained end-to-end with a 3D reconstruction backbone using self-supervised learning. Our resulting pose-free model generates photorealistic scenes in seconds, achieving state-of-the-art novel view synthesis for feed-forward models. It outperforms competitors while using far fewer primitives, demonstrating a more accurate and efficient allocation that captures fine details and reduces artifacts. Moreover, we observe that by learning to render 3D Gaussians, our 3D reconstruction backbone improves camera pose estimation, suggesting opportunities to train these foundational models without labels.[103] VAAS: Vision-Attention Anomaly Scoring for Image Manipulation Detection in Digital Forensics
Opeyemi Bamigbade,Mark Scanlon,John Sheppard
Main category: cs.CV
TL;DR: 本文提出了Vision-Attention Anomaly Scoring (VAAS)框架,结合Vision Transformer和SegFormer,用于图像伪造检测,提供可解释的连续异常评分。
Details
Motivation: 现有图像伪造检测方法难以有效识别现代生成模型产生的逼真伪造图像,且缺乏对异常强度的量化能力。 Method: 提出VAAS双模块框架:全局注意力异常估计(基于ViT)和局部块自一致性评分(基于SegFormer嵌入),融合生成可解释的异常分数与定位图。 Result: 在DF2023和CASIA v2.0数据集上达到具有竞争力的F1和IoU性能,并通过注意力引导的异常图增强了可视化解释性。 Conclusion: VAAS实现了定量检测与人类可理解推理的结合,提升了图像真实性评估的透明度与可靠性,支持开源复现。 Abstract: Recent advances in AI-driven image generation have introduced new challenges for verifying the authenticity of digital evidence in forensic investigations. Modern generative models can produce visually consistent forgeries that evade traditional detectors based on pixel or compression artefacts. Most existing approaches also lack an explicit measure of anomaly intensity, which limits their ability to quantify the severity of manipulation. This paper introduces Vision-Attention Anomaly Scoring (VAAS), a novel dual-module framework that integrates global attention-based anomaly estimation using Vision Transformers (ViT) with patch-level self-consistency scoring derived from SegFormer embeddings. The hybrid formulation provides a continuous and interpretable anomaly score that reflects both the location and degree of manipulation. Evaluations on the DF2023 and CASIA v2.0 datasets demonstrate that VAAS achieves competitive F1 and IoU performance, while enhancing visual explainability through attention-guided anomaly maps. The framework bridges quantitative detection with human-understandable reasoning, supporting transparent and reliable image integrity assessment. The source code for all experiments and corresponding materials for reproducing the results are available open source.[104] DeX-Portrait: Disentangled and Expressive Portrait Animation via Explicit and Latent Motion Representations
Yuxiang Shi,Zhe Li,Yanwen Wang,Hao Zhu,Xun Cao,Ligang Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为DeX-Portrait的新方法,用于实现高保真、解耦控制的人像动画生成,能够分别操控头部姿态和面部表情。
Details
Motivation: 现有基于扩散模型的方法难以实现头部姿态与面部表情的高保真解耦控制,限制了在仅表情或仅姿态编辑等应用中的表现。 Method: 将姿态表示为显式的全局变换,表情表示为隐式的潜在编码;设计运动训练器提取解耦驱动信号;通过双分支条件机制注入姿态,通过交叉注意力注入表情;采用渐进式混合无分类器引导提升身份一致性。 Result: 实验表明,该方法在动画质量和解耦可控性方面优于现有的最先进基线方法。 Conclusion: DeX-Portrait实现了更精确的姿态与表情解耦控制,在人像动画中展现出更高的保真度和灵活性。 Abstract: Portrait animation from a single source image and a driving video is a long-standing problem. Recent approaches tend to adopt diffusion-based image/video generation models for realistic and expressive animation. However, none of these diffusion models realizes high-fidelity disentangled control between the head pose and facial expression, hindering applications like expression-only or pose-only editing and animation. To address this, we propose DeX-Portrait, a novel approach capable of generating expressive portrait animation driven by disentangled pose and expression signals. Specifically, we represent the pose as an explicit global transformation and the expression as an implicit latent code. First, we design a powerful motion trainer to learn both pose and expression encoders for extracting precise and decomposed driving signals. Then we propose to inject the pose transformation into the diffusion model through a dual-branch conditioning mechanism, and the expression latent through cross attention. Finally, we design a progressive hybrid classifier-free guidance for more faithful identity consistency. Experiments show that our method outperforms state-of-the-art baselines on both animation quality and disentangled controllability.[105] EmoCaliber: Advancing Reliable Visual Emotion Comprehension via Confidence Verbalization and Calibration
Daiqing Wu,Dongbao Yang,Can Ma. Yu Zhou
Main category: cs.CV
TL;DR: 本文提出EmoCaliber,一种具备自信度表达能力的多模态大语言模型,通过三阶段训练框架提升视觉情感理解中的预测可靠性。
Details
Motivation: 现有视觉情感理解方法通常将任务视为确定性问题,忽略情绪感知的主观性,缺乏对预测置信度的表达,限制了实际应用中的可靠性。 Method: 提出三阶段训练框架:结构化推理、自信度表达学习和自信度校准,在统一基准VECBench上训练多模态大语言模型以同时输出情感标签和置信度。 Result: EmoCaliber在情感预测和自信度估计方面均优于现有方法,验证了其在提升模型可靠性方面的有效性。 Conclusion: 通过让模型显式表达自信度,能够更好地反映情绪理解的主观性,为构建更可靠的视觉情感理解系统提供了可行路径。 Abstract: Visual Emotion Comprehension (VEC) aims to infer sentiment polarities or emotion categories from affective cues embedded in images. In recent years, Multimodal Large Language Models (MLLMs) have established a popular paradigm in VEC, leveraging their generalizability to unify VEC tasks defined under diverse emotion taxonomies. While this paradigm achieves notable success, it typically formulates VEC as a deterministic task, requiring the model to output a single, definitive emotion label for each image. Such a formulation insufficiently accounts for the inherent subjectivity of emotion perception, overlooking alternative interpretations that may be equally plausible to different viewers. To address this limitation, we propose equipping MLLMs with capabilities to verbalize their confidence in emotion predictions. This additional signal provides users with an estimate of both the plausibility of alternative interpretations and the MLLMs' self-assessed competence, thereby enhancing reliability in practice. Building on this insight, we introduce a three-stage training framework that progressively endows with structured reasoning, teaches to verbalize confidence, and calibrates confidence expression, culminating in EmoCaliber, a confidence-aware MLLM for VEC. Through fair and comprehensive evaluations on the unified benchmark VECBench, EmoCaliber demonstrates overall superiority against existing methods in both emotion prediction and confidence estimation. These results validate the effectiveness of our approach and mark a feasible step toward more reliable VEC systems. Project page: https://github.com/wdqqdw/EmoCaliber.[106] An Efficient and Effective Encoder Model for Vision and Language Tasks in the Remote Sensing Domain
João Daniel Silva,Joao Magalhaes,Devis Tuia,Bruno Martins
Main category: cs.CV
TL;DR: 本文提出了一种名为GeoMELT的高效多任务学习模型,基于编码器-only架构,用于遥感图像中文本生成和跨模态检索任务,在保持参数量紧凑的同时实现了优异性能。
Details
Motivation: 由于大型视觉语言模型(LVLMs)训练和推理成本高昂,且现有方法难以兼顾多个遥感特定任务,因此需要一种高效、轻量化的多任务学习方案。 Method: 采用encoder-only架构设计了一个名为GeoMELT(Multi-task Efficient Learning Transformer)的模型,统一处理遥感图像到文本生成与跨模态检索任务,并通过参数高效的多任务学习策略降低计算开销。 Result: 在标准基准测试中,GeoMELT在多项任务上表现出色,验证了其有效性与计算效率,显著降低了参数量和计算成本。 Conclusion: GeoMELT为遥感领域的多任务学习提供了一种高效、紧凑的解决方案,平衡了性能与资源消耗,具有广泛的应用潜力。 Abstract: The remote sensing community has recently seen the emergence of methods based on Large Vision and Language Models (LVLMs) that can address multiple tasks at the intersection of computer vision and natural language processing. To fully exploit the potential of such models, a significant focus has been given to the collection of large amounts of training data that cover multiple remote sensing-specific tasks, such as image captioning or visual question answering. However, the cost of using and training LVLMs is high, due to the large number of parameters. While multiple parameter-efficient adaptation techniques have been explored, the computational costs of training and inference with these models can remain prohibitive for most institutions. In this work, we explore the use of encoder-only architectures and propose a model that can effectively address multi-task learning while remaining compact in terms of the number of parameters. In particular, our model tackles combinations of tasks that are not typically explored in a unified model: the generation of text from remote sensing images and cross-modal retrieval. The results of our GeoMELT model - named from Multi-task Efficient Learning Transformer - in established benchmarks confirm the efficacy and efficiency of the proposed approach.[107] BLANKET: Anonymizing Faces in Infant Video Recordings
Ditmar Hadera,Jan Cech,Miroslav Purkrabek,Matej Hoffmann
Main category: cs.CV
TL;DR: 提出了一种名为BLANKET的婴儿面部匿名化方法,可在保护身份的同时保留关键面部特征和表情,并在视频中实现时序一致性,优于DeepPrivacy2。
Details
Motivation: 为了在涉及婴幼儿的视频数据中确保伦理使用,需要既能有效匿名又保留重要面部信息(如表情)的方法,而现有方法难以兼顾这两点。 Method: BLANKET包含两个阶段:首先利用扩散模型基于关键点生成与原身份兼容的新婴儿面部;然后通过时间一致的脸部替换和真实表情迁移将新身份无缝融入视频帧中。 Result: 在婴儿视频数据集上,BLANKET在去识别化程度、面部属性保留、对下游任务(如人体姿态估计)的影响及伪影控制方面均优于DeepPrivacy2。 Conclusion: BLANKET能有效匿名化婴儿面部视频,同时保持表情自然和下游任务可用性,是一种适用于婴幼儿敏感数据处理的实用解决方案。 Abstract: Ensuring the ethical use of video data involving human subjects, particularly infants, requires robust anonymization methods. We propose BLANKET (Baby-face Landmark-preserving ANonymization with Keypoint dEtection consisTency), a novel approach designed to anonymize infant faces in video recordings while preserving essential facial attributes. Our method comprises two stages. First, a new random face, compatible with the original identity, is generated via inpainting using a diffusion model. Second, the new identity is seamlessly incorporated into each video frame through temporally consistent face swapping with authentic expression transfer. The method is evaluated on a dataset of short video recordings of babies and is compared to the popular anonymization method, DeepPrivacy2. Key metrics assessed include the level of de-identification, preservation of facial attributes, impact on human pose estimation (as an example of a downstream task), and presence of artifacts. Both methods alter the identity, and our method outperforms DeepPrivacy2 in all other respects. The code is available as an easy-to-use anonymization demo at https://github.com/ctu-vras/blanket-infant-face-anonym.[108] GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models
Bozhou Li,Sihan Yang,Yushuo Guan,Ruichuan An,Xinlong Chen,Yang Shi,Pengfei Wan,Wentao Zhang,Yuanxing zhang
Main category: cs.CV
TL;DR: 本文提出了GRAN-TED,一种用于生成鲁棒、对齐且细致文本嵌入的新范式,以提升文本到图像和视频扩散模型中的语义保真度。作者设计了TED-6K这一纯文本基准来高效评估编码器质量,并提出两阶段训练方法优化语言模型在视觉合成中的表现。
Details
Motivation: 现有文本编码器缺乏可靠的评估框架,且难以有效适配预训练语言模型用于视觉生成任务,限制了文本到图像/视频模型的发展。 Method: 首先构建TED-6K文本评估基准,通过轻量统一适配器标准化性能并验证其与下游任务的相关性;然后基于该框架设计两阶段训练:先在多模态大模型上微调以增强视觉表示,再采用层加权方法提取更精细的文本特征。 Result: GRAN-TED在TED-6K上达到SOTA性能,且在文本到图像和文本到视频生成任务中显著提升生成效果。性能在TED-6K上的表现与下游任务高度相关。 Conclusion: GRAN-TED为文本编码器提供了有效的评估与优化框架,推动了文本到视觉生成模型中语义对齐能力的发展。 Abstract: The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder's representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder's effectiveness in downstream generation tasks. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our code is available at the following link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.[109] On the Effectiveness of Textual Prompting with Lightweight Fine-Tuning for SAM3 Remote Sensing Segmentation
Roni Blushtein-Livnon,Osher Rafaeli,David Ioffe,Amir Boger,Karen Sandberg Esquenazi,Tal Svoray
Main category: cs.CV
TL;DR: 本文研究了SAM3在遥感图像分割中的应用,探讨了文本、几何和混合提示策略在不同监督水平下的表现,发现结合语义与几何线索效果最佳,少量几何标注即可有效提升性能,但不规则目标的分割仍存在边界不准确等问题。
Details
Motivation: 遥感图像分割受限于标注数据稀缺以及基础模型多基于自然图像训练,导致适应性差,因此需要在有限监督下实现有效迁移。 Method: 采用SAM3概念驱动框架,通过文本、几何和混合提示策略进行零样本推理和轻量微调,在四种目标类型上评估其在遥感图像中的分割性能。 Result: 结合语义与几何线索的混合提示性能最优;纯文本提示效果最差,尤其对不规则形状目标;轻量微调可显著提升规则且显著目标的表现;随着监督增加,性能提升逐渐饱和;精确率与IoU之间存在持续差距,显示欠分割和边界误差仍是主要问题。 Conclusion: 少量几何标注足以实现SAM3在遥感图像分割中的有效适应,混合提示优于单一文本提示,未来需改进边界精度尤其是对不规则和稀有目标的分割。 Abstract: Remote sensing (RS) image segmentation is constrained by the limited availability of annotated data and a gap between overhead imagery and natural images used to train foundational models. This motivates effective adaptation under limited supervision. SAM3 concept-driven framework generates masks from textual prompts without requiring task-specific modifications, which may enable this adaptation. We evaluate SAM3 for RS imagery across four target types, comparing textual, geometric, and hybrid prompting strategies, under lightweight fine-tuning scales with increasing supervision, alongside zero-shot inference. Results show that combining semantic and geometric cues yields the highest performance across targets and metrics. Text-only prompting exhibits the lowest performance, with marked score gaps for irregularly shaped targets, reflecting limited semantic alignment between SAM3 textual representations and their overhead appearances. Nevertheless, textual prompting with light fine-tuning offers a practical performance-effort trade-off for geometrically regular and visually salient targets. Across targets, performance improves between zero-shot inference and fine-tuning, followed by diminishing returns as the supervision scale increases. Namely, a modest geometric annotation effort is sufficient for effective adaptation. A persistent gap between Precision and IoU further indicates that under-segmentation and boundary inaccuracies remain prevalent error patterns in RS tasks, particularly for irregular and less prevalent targets.[110] MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors
Zhipeng Du,Duolikun Danier,Jan Eric Lenssen,Hakan Bilen
Main category: cs.CV
TL;DR: 本文提出了MoonSeg3R,首个实现在线单目RGB 3D实例分割的方法,利用CUT3R提供的几何先验和2D视觉基础模型的掩码,通过自监督查询优化、3D查询记忆和状态分布令牌实现跨帧一致性和强分割性能。
Details
Motivation: 现有方法依赖于带位姿的RGB-D序列,在线零样本单目3D实例分割缺乏有效方法,难以应用于实际场景。 Method: 提出MoonSeg3R,包含三个核心组件:(1) 自监督查询优化模块,结合空间-语义蒸馏将2D VFM的掩码转化为判别性3D查询;(2) 3D查询索引内存,用于检索上下文查询以保证时序一致性;(3) 使用CUT3R的状态分布令牌作为掩码身份描述符,增强跨帧融合。 Result: 在ScanNet200和SceneNN上的实验表明,MoonSeg3R是首个支持在线单目3D分割的方法,性能与最先进的基于RGB-D的方法相当。 Conclusion: MoonSeg3R成功实现了无需深度输入的在线零样本3D实例分割,推动了单目视频流中3D感知的实际应用。 Abstract: In this paper, we focus on online zero-shot monocular 3D instance segmentation, a novel practical setting where existing approaches fail to perform because they rely on posed RGB-D sequences. To overcome this limitation, we leverage CUT3R, a recent Reconstructive Foundation Model (RFM), to provide reliable geometric priors from a single RGB stream. We propose MoonSeg3R, which introduces three key components: (1) a self-supervised query refinement module with spatial-semantic distillation that transforms segmentation masks from 2D visual foundation models (VFMs) into discriminative 3D queries; (2) a 3D query index memory that provides temporal consistency by retrieving contextual queries; and (3) a state-distribution token from CUT3R that acts as a mask identity descriptor to strengthen cross-frame fusion. Experiments on ScanNet200 and SceneNN show that MoonSeg3R is the first method to enable online monocular 3D segmentation and achieves performance competitive with state-of-the-art RGB-D-based systems. Code and models will be released.[111] IMKD: Intensity-Aware Multi-Level Knowledge Distillation for Camera-Radar Fusion
Shashank Mishra,Karan Patil,Didier Stricker,Jason Rambach
Main category: cs.CV
TL;DR: 本文提出了一种名为IMKD的雷达-相机三维目标检测框架,通过多级知识蒸馏在不使用LiDAR推理的情况下提升性能,同时保留各传感器特性并增强模态互补性,在nuScenes数据集上取得了优于现有方法的表现。
Details
Motivation: 现有知识蒸馏方法直接传输模态特定特征,容易破坏各传感器的独特特性,削弱其个体优势,因此需要一种能保持传感器内在特征的同时增强互补性的融合方法。 Method: 提出IMKD框架,采用三级、强度感知的知识蒸馏策略:1)LiDAR到雷达的强度感知特征蒸馏;2)LiDAR到融合特征的强度引导蒸馏;3)相机-雷达强度引导融合机制,以增强跨模态表征。 Result: 在nuScenes基准上的实验表明,IMKD达到67.0% NDS和61.0% mAP,优于所有先前基于蒸馏的雷达-相机融合方法。 Conclusion: IMKD有效保留了雷达和相机的内在特性,并通过强度感知的多级知识蒸馏增强了它们的互补性,显著提升了3D目标检测性能。 Abstract: High-performance Radar-Camera 3D object detection can be achieved by leveraging knowledge distillation without using LiDAR at inference time. However, existing distillation methods typically transfer modality-specific features directly to each sensor, which can distort their unique characteristics and degrade their individual strengths. To address this, we introduce IMKD, a radar-camera fusion framework based on multi-level knowledge distillation that preserves each sensor's intrinsic characteristics while amplifying their complementary strengths. IMKD applies a three-stage, intensity-aware distillation strategy to enrich the fused representation across the architecture: (1) LiDAR-to-Radar intensity-aware feature distillation to enhance radar representations with fine-grained structural cues, (2) LiDAR-to-Fused feature intensity-guided distillation to selectively highlight useful geometry and depth information at the fusion level, fostering complementarity between the modalities rather than forcing them to align, and (3) Camera-Radar intensity-guided fusion mechanism that facilitates effective feature alignment and calibration. Extensive experiments on the nuScenes benchmark show that IMKD reaches 67.0% NDS and 61.0% mAP, outperforming all prior distillation-based radar-camera fusion methods. Our code and models are available at https://github.com/dfki-av/IMKD/.[112] FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision
Tobias Kirschstein,Simon Giebenhain,Matthias Nießner
Main category: cs.CV
TL;DR: FlexAvatar是一种从单张图像生成高质量、完整3D头部头像的方法,通过引入基于Transformer的动画模型和可学习的“偏差源”令牌,实现单目与多视图数据的统一训练,从而在保持3D完整性的同时提升泛化能力。
Details
Motivation: 现有方法因依赖单目视频训练,导致驱动信号与目标视角纠缠,难以生成完整的3D头部重建。缺乏多视角数据限制了模型的3D完整性,因此需要一种能融合单目和多视图数据优势的新方法。 Method: 提出FlexAvatar,采用Transformer架构的3D肖像动画模型,引入可学习的数据源令牌(bias sinks),解耦驱动信号与视角信息,支持在单目和多视图数据上联合训练,并构建一个平滑的潜在头像空间,支持身份插值和灵活适应不同数量的输入观测。 Result: 在单视图、少样本和单目头像创建任务中表现优异,能够生成完整的3D头部模型并实现逼真的面部动画,显著优于现有方法,尤其在视角外推方面效果突出。 Conclusion: FlexAvatar通过创新的偏差源令牌机制实现了单目与多视图数据的有效融合,在3D头像完整性、泛化能力和动画质量之间取得了良好平衡,为高质量3D头像生成提供了新思路。 Abstract: We introduce FlexAvatar, a method for creating high-quality and complete 3D head avatars from a single image. A core challenge lies in the limited availability of multi-view data and the tendency of monocular training to yield incomplete 3D head reconstructions. We identify the root cause of this issue as the entanglement between driving signal and target viewpoint when learning from monocular videos. To address this, we propose a transformer-based 3D portrait animation model with learnable data source tokens, so-called bias sinks, which enables unified training across monocular and multi-view datasets. This design leverages the strengths of both data sources during inference: strong generalization from monocular data and full 3D completeness from multi-view supervision. Furthermore, our training procedure yields a smooth latent avatar space that facilitates identity interpolation and flexible fitting to an arbitrary number of input observations. In extensive evaluations on single-view, few-shot, and monocular avatar creation tasks, we verify the efficacy of FlexAvatar. Many existing methods struggle with view extrapolation while FlexAvatar generates complete 3D head avatars with realistic facial animations. Website: https://tobias-kirschstein.github.io/flexavatar/[113] Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition
Shengming Yin,Zekai Zhang,Zecheng Tang,Kaiyuan Gao,Xiao Xu,Kun Yan,Jiahao Li,Yilei Chen,Yuxiang Chen,Heung-Yeung Shum,Lionel M. Ni,Jingren Zhou,Junyang Lin,Chenfei Wu
Main category: cs.CV
TL;DR: 提出Qwen-Image-Layered,一种端到端的扩散模型,能将单张RGB图像分解为多个语义解耦的RGBA图层,实现可编辑性更强的一致性图像编辑。
Details
Motivation: 现有视觉生成模型在图像编辑时因栅格图像内容纠缠而难以保持一致性,受专业设计工具分层编辑的启发,本文旨在实现图像的语义解耦与独立编辑。 Method: 提出Qwen-Image-Layered模型,包含RGBA-VAE统一RGB/RGBA隐空间、VLD-MMDiT架构支持变长图层数分解,以及多阶段训练策略;并构建PSD文件解析 pipeline 生成多层图像数据集。 Result: 在分解质量上显著优于现有方法,实现了更一致的图像编辑效果,并支持灵活的图层操作。 Conclusion: 该方法建立了新的图像编辑范式,通过分层解耦实现高质量、可编辑的图像生成与修改,推动生成模型向专业设计工具靠拢。 Abstract: Recent visual generative models often struggle with consistency during image editing due to the entangled nature of raster images, where all visual content is fused into a single canvas. In contrast, professional design tools employ layered representations, allowing isolated edits while preserving consistency. Motivated by this, we propose \textbf{Qwen-Image-Layered}, an end-to-end diffusion model that decomposes a single RGB image into multiple semantically disentangled RGBA layers, enabling \textbf{inherent editability}, where each RGBA layer can be independently manipulated without affecting other content. To support variable-length decomposition, we introduce three key components: (1) an RGBA-VAE to unify the latent representations of RGB and RGBA images; (2) a VLD-MMDiT (Variable Layers Decomposition MMDiT) architecture capable of decomposing a variable number of image layers; and (3) a Multi-stage Training strategy to adapt a pretrained image generation model into a multilayer image decomposer. Furthermore, to address the scarcity of high-quality multilayer training images, we build a pipeline to extract and annotate multilayer images from Photoshop documents (PSD). Experiments demonstrate that our method significantly surpasses existing approaches in decomposition quality and establishes a new paradigm for consistent image editing. Our code and models are released on \href{https://github.com/QwenLM/Qwen-Image-Layered}{https://github.com/QwenLM/Qwen-Image-Layered}[114] Robust Multi-view Camera Calibration from Dense Matches
Johannes Hägerlind,Bao-Long Tran,Urs Waldmann,Per-Erik Forssén
Main category: cs.CV
TL;DR: 本文提出了一种改进的结构光运动(SfM)方法,用于估计具有强径向畸变的相机内外参数,在动物行为和法医分析等多视角场景中表现出更高的精度。
Details
Motivation: 现有的SfM方法在处理强径向畸变和多视角配置时仍存在精度和鲁棒性问题,尤其是在动物行为研究和监控视频分析中。 Method: 分析SfM流程中的各个组件,提出优化密集匹配器预测对应点的子采样策略,并设计增量式视图添加的选择标准,结合VGGT初始化姿态进行全局SfM。 Result: 在定量评估中,该方法在强径向畸变下显著优于基线(79.9% vs 40.4%),并验证了其在多种相机设置下的泛化能力。 Conclusion: 所提出的SfM改进方案有效提升了姿态估计与标定精度,尤其适用于存在强畸变和复杂视角的现实应用场景。 Abstract: Estimating camera intrinsics and extrinsics is a fundamental problem in computer vision, and while advances in structure-from-motion (SfM) have improved accuracy and robustness, open challenges remain. In this paper, we introduce a robust method for pose estimation and calibration. We consider a set of rigid cameras, each observing the scene from a different perspective, which is a typical camera setup in animal behavior studies and forensic analysis of surveillance footage. Specifically, we analyse the individual components in a structure-from-motion (SfM) pipeline, and identify design choices that improve accuracy. Our main contributions are: (1) we investigate how to best subsample the predicted correspondences from a dense matcher to leverage them in the estimation process. (2) We investigate selection criteria for how to add the views incrementally. In a rigorous quantitative evaluation, we show the effectiveness of our changes, especially for cameras with strong radial distortion (79.9% ours vs. 40.4 vanilla VGGT). Finally, we demonstrate our correspondence subsampling in a global SfM setting where we initialize the poses using VGGT. The proposed pipeline generalizes across a wide range of camera setups, and could thus become a useful tool for animal behavior and forensic analysis.[115] Persistent feature reconstruction of resident space objects (RSOs) within inverse synthetic aperture radar (ISAR) images
Morgan Coe,Gruffudd Jones,Leah-Nani Alconcel,Marina Gashinova
Main category: cs.CV
TL;DR: 本文提出了一种基于亚太赫兹逆合成孔径雷达(ISAR)图像序列的特征检测与跟踪方法,用于提升对近地空间物体的识别能力,通过改进的霍夫变换和图像配准技术实现了高精度线性特征检测与阴影特征的鲁棒识别。
Details
Motivation: 随着近地空间中空间目标数量迅速增加,需要更精确的信息来实现空间域感知(SDA),而传统观测手段受限于大气影响和视角限制,因此需发展新型高分辨率、全向性强的成像与识别技术。 Method: 采用亚THz频段的ISAR成像系统进行远距离(最远100 km)高分辨率成像;使用元启发式模拟器生成不同任务场景下的ISAR图像序列;通过仿射变换实现帧间初步对齐;利用梯度比值法进行边缘检测,并结合双加权霍夫变换检测线性特征;在连续帧中进行特征跟踪以分析其演化规律。 Result: 实现了亚厘米级图像分辨率;显著提高了特征检测的准确性与稳定性;成功实现了对阴影等复杂特征的鲁棒检测;特征跟踪有效增强了分类置信度。 Conclusion: 所提出的基于ISAR图像序列的特征检测与跟踪方法能够有效提升空间目标外部结构识别的精度与可靠性,适用于未来空间域感知任务中的自主识别与状态评估。 Abstract: With the rapidly growing population of resident space objects (RSOs) in the near-Earth space environment, detailed information about their condition and capabilities is needed to provide Space Domain Awareness (SDA). Space-based sensing will enable inspection of RSOs at shorter ranges, independent of atmospheric effects, and from all aspects. The use of a sub-THz inverse synthetic aperture radar (ISAR) imaging and sensing system for SDA has been proposed in previous work, demonstrating the achievement of sub-cm image resolution at ranges of up to 100 km. This work focuses on recognition of external structures by use of sequential feature detection and tracking throughout the aligned ISAR images of the satellites. The Hough transform is employed to detect linear features, which are tracked throughout the sequence. ISAR imagery is generated via a metaheuristic simulator capable of modelling encounters for a variety of deployment scenarios. Initial frame-to-frame alignment is achieved through a series of affine transformations to facilitate later association between image features. A gradient-by-ratio method is used for edge detection within individual ISAR images, and edge magnitude and direction are subsequently used to inform a double-weighted Hough transform to detect features with high accuracy. Feature evolution during sequences of frames is analysed. It is shown that the use of feature tracking within sequences with the proposed approach will increase confidence in feature detection and classification, and an example use-case of robust detection of shadowing as a feature is presented.[116] OccSTeP: Benchmarking 4D Occupancy Spatio-Temporal Persistence
Yu Zheng,Jie Hu,Kailun Yang,Jiaming Zhang
Main category: cs.CV
TL;DR: 本文提出了4D Occupancy Spatio-Temporal Persistence (OccSTeP) 新概念,旨在实现自动驾驶中的反应式与主动式未来预测,并构建了包含挑战性场景的OccSTeP基准。为此提出OccSTeP-WM模型,采用无需分词、基于体素的密集状态表示与线性复杂度注意力机制,支持在线推理并具备对缺失或噪声历史输入的鲁棒性。
Details
Motivation: 自动驾驶需要对3D场景进行具备时间持续性的稳健理解,并能预测不同未来动作下的可能结果。现有方法在处理时序干扰和未来动作干预方面存在不足,因此需要新的框架来同时应对反应式和主动式预测。 Method: 提出OccSTeP-WM,一种无需分词的世界模型,维护基于体素的密集场景状态,通过线性复杂度注意力骨干网络和循环状态空间模块,逐步融合时空上下文信息,并结合自车运动补偿持续更新场景记忆。 Result: 在新构建的OccSTeP基准上验证,OccSTeP-WM取得了语义mIoU 23.70%(+6.56%)和占据IoU 35.89%(+9.26%)的性能增益,展现出对缺失或噪声传感器输入的鲁棒性,并支持在线推理。 Conclusion: OccSTeP为自动驾驶中的时空持久化理解提供了新方向,OccSTeP-WM通过高效架构实现了优越性能,证明了其在复杂动态环境中进行未来预测的潜力。 Abstract: Autonomous driving requires a persistent understanding of 3D scenes that is robust to temporal disturbances and accounts for potential future actions. We introduce a new concept of 4D Occupancy Spatio-Temporal Persistence (OccSTeP), which aims to address two tasks: (1) reactive forecasting: ''what will happen next'' and (2) proactive forecasting: "what would happen given a specific future action". For the first time, we create a new OccSTeP benchmark with challenging scenarios (e.g., erroneous semantic labels and dropped frames). To address this task, we propose OccSTeP-WM, a tokenizer-free world model that maintains a dense voxel-based scene state and incrementally fuses spatio-temporal context over time. OccSTeP-WM leverages a linear-complexity attention backbone and a recurrent state-space module to capture long-range spatial dependencies while continually updating the scene memory with ego-motion compensation. This design enables online inference and robust performance even when historical sensor input is missing or noisy. Extensive experiments prove the effectiveness of the OccSTeP concept and our OccSTeP-WM, yielding an average semantic mIoU of 23.70% (+6.56% gain) and occupancy IoU of 35.89% (+9.26% gain). The data and code will be open source at https://github.com/FaterYU/OccSTeP.[117] Towards Physically-Based Sky-Modeling For Image Based Lighting
Ian J. Maquignaz
Main category: cs.CV
TL;DR: 本文提出了AllSky,一种直接从物理捕获的HDRI学习的全天气天空模型,实现了对环境光照的高动态范围和用户可控的精确建模,显著提升了现有DNN天空模型在真实感和光照一致性方面的不足。
Details
Motivation: 现有基于深度神经网络的天空模型在重现自然天空时无法同时满足高动态范围(FDR)和真实感需求,且生成的环境图在重照明场景中与真实HDR图像存在明显差异,限制了其在下游应用中的可扩展性和准确性。 Method: 提出AllSky模型,直接从物理捕获的高动态范围图像(HDRI)中学习,支持用户控制太阳位置和云层形态,并系统研究了输入模态、色调映射、条件控制和评估方法。 Result: AllSky在全天气条件下实现了最先进的天空建模性能,能够生成具有22 f-stops动态范围的逼真环境图,并在重照明效果上更接近真实HDR图像。 Conclusion: 当前DNN天空模型不能替代物理捕获或参数化天空模型;AllSky通过真实数据驱动和灵活控制,填补了高动态范围与视觉真实感之间的差距,为渲染、VR及科学应用提供了更可靠的光照解决方案。 Abstract: Accurate environment maps are a key component for rendering photorealistic outdoor scenes with coherent illumination. They enable captivating visual arts, immersive virtual reality, and a wide range of engineering and scientific applications. Recent works have extended sky-models to be more comprehensive and inclusive of cloud formations but, as we demonstrate, existing methods fall short in faithfully recreating natural skies. Though in recent years the visual quality of DNN-generated High Dynamic Range Imagery (HDRI) has greatly improved, the environment maps generated by DNN sky-models do not re-light scenes with the same tones, shadows, and illumination as physically captured HDR imagery. In this work, we demonstrate progress in HDR literature to be tangential to sky-modelling as current works cannot support both photorealism and the 22 f-stops required for the Full Dynamic Range (FDR) of outdoor illumination. We achieve this by proposing AllSky, a flexible all-weather sky-model learned directly from physically captured HDRI which we leverage to study the input modalities, tonemapping, conditioning, and evaluation of sky-models. Per user-controlled positioning of the sun and cloud formations, AllSky expands on current functionality by allowing for intuitive user control over environment maps and achieves state-of-the-art sky-model performance. Through our proposed evaluation, we demonstrate existing DNN sky-models are not interchangeable with physically captured HDRI or parametric sky-models, with current limitations being prohibitive of scalability and accurate illumination in downstream applications[118] IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning
Yuanhang Li,Yiren Song,Junzhe Bai,Xinran Liang,Hu Yang,Libiao Jin,Qi Mao
Main category: cs.CV
TL;DR: 本文提出了IC-Effect,一种基于DiT的指令引导框架,用于少样本视频视觉特效(VFX)编辑,能够合成复杂效果并保持时空一致性。
Details
Motivation: 现有视频编辑模型难以在少样本条件下实现特效与背景的无缝融合、背景的完全保留以及高效学习复杂效果模式。 Method: 利用源视频作为上下文条件,结合DiT模型的上下文学习能力;采用两阶段训练策略(通用编辑适配+基于Effect-LoRA的特效专用学习);引入时空稀疏化分词以提升效率。 Result: 实现了高质量、可控且时间一致的VFX编辑,在自建的15种高质量风格的配对数据集上验证了方法的有效性。 Conclusion: IC-Effect为少样本视频VFX编辑提供了有效解决方案,推动了视频创作中的自动化特效生成。 Abstract: We propose \textbf{IC-Effect}, an instruction-guided, DiT-based framework for few-shot video VFX editing that synthesizes complex effects (\eg flames, particles and cartoon characters) while strictly preserving spatial and temporal consistency. Video VFX editing is highly challenging because injected effects must blend seamlessly with the background, the background must remain entirely unchanged, and effect patterns must be learned efficiently from limited paired data. However, existing video editing models fail to satisfy these requirements. IC-Effect leverages the source video as clean contextual conditions, exploiting the contextual learning capability of DiT models to achieve precise background preservation and natural effect injection. A two-stage training strategy, consisting of general editing adaptation followed by effect-specific learning via Effect-LoRA, ensures strong instruction following and robust effect modeling. To further improve efficiency, we introduce spatiotemporal sparse tokenization, enabling high fidelity with substantially reduced computation. We also release a paired VFX editing dataset spanning $15$ high-quality visual styles. Extensive experiments show that IC-Effect delivers high-quality, controllable, and temporally consistent VFX editing, opening new possibilities for video creation.[119] InpaintDPO: Mitigating Spatial Relationship Hallucinations in Foreground-conditioned Inpainting via Diverse Preference Optimization
Qirui Li,Yizhe Tang,Ran Yi,Guangben Lu,Fangyuan Zou,Peng Shu,Huan Yu,Jie Jiang
Main category: cs.CV
TL;DR: 本文提出了一种基于直接偏好优化(DPO)的框架InpaintDPO,用于解决前景引导图像修复中的空间关系幻觉问题,通过MaskDPO、条件非对称偏好优化和共享共性偏好优化提升前景与背景之间的空间合理性和边界一致性。
Details
Motivation: 现有方法在前景与生成背景之间常出现不合理的空间关系(如尺度、位置、视角),且由于空间合理性主观性强,难以量化,限制了传统基于奖励的强化学习方法的应用。 Method: 提出InpaintDPO框架:1)MaskDPO将偏好优化限制在背景区域以避免梯度冲突并保持前景完整性;2)条件非对称偏好优化通过不同的裁剪策略进行全局优化,增强边界连贯性;3)共享共性偏好优化从优质样本中提取并强化空间关系的共性。 Result: 所提方法有效提升了生成图像中前景与背景的空间合理性与上下文一致性,在定性和定量实验中均优于现有DPO方法。 Conclusion: InpaintDPO是首个专注于前景引导修复中空间合理性的DPO框架,通过多种改进策略显著改善了生成结果的空间逻辑与视觉和谐性。 Abstract: Foreground-conditioned inpainting, which aims at generating a harmonious background for a given foreground subject based on the text prompt, is an important subfield in controllable image generation. A common challenge in current methods, however, is the occurrence of Spatial Relationship Hallucinations between the foreground subject and the generated background, including inappropriate scale, positional relationships, and viewpoints. Critically, the subjective nature of spatial rationality makes it challenging to quantify, hindering the use of traditional reward-based RLHF methods. To address this issue, we propose InpaintDPO, the first Direct Preference Optimization (DPO) based framework dedicated to spatial rationality in foreground-conditioned inpainting, ensuring plausible spatial relationships between foreground and background elements. To resolve the gradient conflicts in standard DPO caused by identical foreground in win-lose pairs, we propose MaskDPO, which confines preference optimization exclusively to the background to enhance background spatial relationships, while retaining the inpainting loss in the foreground region for robust foreground preservation. To enhance coherence at the foreground-background boundary, we propose Conditional Asymmetric Preference Optimization, which samples pairs with differentiated cropping operations and applies global preference optimization to promote contextual awareness and enhance boundary coherence. Finally, based on the observation that winning samples share a commonality in plausible spatial relationships, we propose Shared Commonality Preference Optimization to enhance the model's understanding of spatial commonality across high-quality winning samples, further promoting shared spatial rationality.[120] Hard Labels In! Rethinking the Role of Hard Labels in Mitigating Local Semantic Drift
Jiacheng Cui,Bingkui Tong,Xinyue Bi,Xiaohan Zhao,Jiacheng Liu,Zhiqiang shen
Main category: cs.CV
TL;DR: 本文提出了一种新的训练范式HALD,通过结合硬标签来缓解软标签在少样本蒸馏中引起的局部语义漂移问题,显著提升了模型的泛化性能。
Details
Motivation: 观察到在使用少量图像裁剪进行知识蒸馏时,软标签容易因局部视觉相似性导致语义漂移,从而引发训练与测试之间的分布不一致。 Method: 理论分析了软标签在少数裁剪下的语义漂移现象,提出将硬标签作为中间校正信号,构建软硬标签混合的训练框架(HALD),以恢复视觉内容与语义监督的一致性。 Result: 在ImageNet-1K上仅用285M软标签存储即达到42.7%准确率,比先前最优方法LPLD提升9.0%,在多个数据集蒸馏和分类任务中均表现出一致改进。 Conclusion: 硬标签可有效校准软标签中的语义漂移,是软标签训练的重要补充,应重新审视其在知识蒸馏中的作用。 Abstract: Soft labels generated by teacher models have become a dominant paradigm for knowledge transfer and recent large-scale dataset distillation such as SRe2L, RDED, LPLD, offering richer supervision than conventional hard labels. However, we observe that when only a limited number of crops per image are used, soft labels are prone to local semantic drift: a crop may visually resemble another class, causing its soft embedding to deviate from the ground-truth semantics of the original image. This mismatch between local visual content and global semantic meaning introduces systematic errors and distribution misalignment between training and testing. In this work, we revisit the overlooked role of hard labels and show that, when appropriately integrated, they provide a powerful content-agnostic anchor to calibrate semantic drift. We theoretically characterize the emergence of drift under few soft-label supervision and demonstrate that hybridizing soft and hard labels restores alignment between visual content and semantic supervision. Building on this insight, we propose a new training paradigm, Hard Label for Alleviating Local Semantic Drift (HALD), which leverages hard labels as intermediate corrective signals while retaining the fine-grained advantages of soft labels. Extensive experiments on dataset distillation and large-scale conventional classification benchmarks validate our approach, showing consistent improvements in generalization. On ImageNet-1K, we achieve 42.7% with only 285M storage for soft labels, outperforming prior state-of-the-art LPLD by 9.0%. Our findings re-establish the importance of hard labels as a complementary tool, and call for a rethinking of their role in soft-label-dominated training.[121] Stylized Synthetic Augmentation further improves Corruption Robustness
Georg Siedel,Rojan Regmi,Abhirami Anand,Weijia Shao,Silvia Vock,Andrey Morozov
Main category: cs.CV
TL;DR: 提出了一种结合合成图像数据和神经风格迁移的数据增强方法,以提升深度视觉模型在常见损坏下的鲁棒性。
Details
Motivation: 解决深度视觉模型对常见图像损坏的脆弱性问题。 Method: 将神经风格迁移应用于合成图像,并系统分析该策略与不同数据增强方法结合的效果。 Result: 在CIFAR-10-C、CIFAR-100-C和TinyImageNet-C上分别达到93.54%、74.9%和50.86%的鲁棒准确率,性能领先。 Conclusion: 风格化与合成数据能有效互补,且与TrivialAugment等方法结合效果良好,显著提升模型鲁棒性。 Abstract: This paper proposes a training data augmentation pipeline that combines synthetic image data with neural style transfer in order to address the vulnerability of deep vision models to common corruptions. We show that although applying style transfer on synthetic images degrades their quality with respect to the common FID metric, these images are surprisingly beneficial for model training. We conduct a systematic empirical analysis of the effects of both augmentations and their key hyperparameters on the performance of image classifiers. Our results demonstrate that stylization and synthetic data complement each other well and can be combined with popular rule-based data augmentation techniques such as TrivialAugment, while not working with others. Our method achieves state-of-the-art corruption robustness on several small-scale image classification benchmarks, reaching 93.54%, 74.9% and 50.86% robust accuracy on CIFAR-10-C, CIFAR-100-C and TinyImageNet-C, respectively[122] Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
Yifei Li,Wenzhao Zheng,Yanran Zhang,Runze Sun,Yu Zheng,Lei Chen,Jie Zhou,Jiwen Lu
Main category: cs.CV
TL;DR: 本文提出了Skyra,一种专门用于检测AI生成视频中人类可感知视觉伪影的多模态大语言模型,并通过构建大规模标注数据集ViF-CoT-4K和新基准ViF-Bench,实现了可解释的视频检测。
Details
Motivation: 现有AI生成视频检测方法多局限于二分类,缺乏可解释性,难以提供人类可理解的判断依据,亟需具备解释能力的检测模型。 Method: 提出Skyra模型,构建含细粒度人工标注的大规模数据集ViF-CoT-4K,采用两阶段训练策略提升模型时空伪影感知、解释与检测能力,并建立新评测基准ViF-Bench。 Result: 实验表明Skyra在多个基准上超越现有方法,新基准提供了推动可解释AI生成视频检测发展的宝贵洞察。 Conclusion: Skyra通过结合可感知伪影识别与语言解释,实现了高效且可解释的AI生成视频检测,推动了该领域向更透明、可信的方向发展。 Abstract: The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.[123] VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression
Kyle Sargent,Ruiqi Gao,Philipp Henzler,Charles Herrmann,Aleksander Holynski,Li Fei-Fei,Jiajun Wu,Jason Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于视觉-语言模型(VLM)的图像压缩系统VLIC,利用VLM在无训练的情况下对图像质量的判断能力,通过偏好后训练提升压缩结果与人类感知的一致性,在多种评估下达到领先水平。
Details
Motivation: 传统的图像压缩评估依赖MSE等失真指标,与人类感知不一致;现有方法使用基于人类心理视觉数据训练的可微感知损失,但需大量标注数据。作者希望利用VLM强大的零样本视觉推理能力,直接对齐人类偏好而无需额外设计损失函数。 Method: 提出VLIC——一种基于扩散模型的图像压缩框架,采用VLM作为零样本奖励模型提供二元偏好判断(如2AFC),并利用偏好优化技术(如DPO变体)对扩散模型进行后训练,使其生成更符合人类感知的解压图像。 Result: 实验表明,经VLM偏好校准的VLIC在多个感知指标和大规模用户研究中表现优异,达到或超越当前最先进水平;同时对VLM奖励设计和训练过程进行了深入分析,提供了关键见解。 Conclusion: VLM具备强大的零样本图像质量评估能力,可有效用于指导图像压缩系统的优化;VLIC展示了将大模型推理能力直接融入生成系统训练流程的潜力,为构建更贴近人类感知的压缩技术开辟了新路径。 Abstract: Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently aligned to human perception. In order to align compression models to human perception, prior work has employed differentiable perceptual losses consisting of neural networks calibrated on large-scale datasets of human psycho-visual judgments. We show that, surprisingly, state-of-the-art vision-language models (VLMs) can replicate binary human two-alternative forced choice (2AFC) judgments zero-shot when asked to reason about the differences between pairs of images. Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision-Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing techniques for diffusion model post-training with preferences, rather than distilling the VLM judgments into a separate perceptual loss network. We show that calibrating this system on VLM judgments produces competitive or state-of-the-art performance on human-aligned visual compression depending on the dataset, according to perceptual metrics and large-scale user studies. We additionally conduct an extensive analysis of the VLM-based reward design and training procedure and share important insights. More visuals are available at https://kylesargent.github.io/vlic[124] End-to-End Training for Autoregressive Video Diffusion via Self-Resampling
Yuwei Guo,Ceyuan Yang,Hao He,Yang Zhao,Meng Wei,Zhenheng Yang,Weilin Huang,Dahua Lin
Main category: cs.CV
TL;DR: 提出了一种无需教师模型的自回归视频扩散模型训练框架Resampling Forcing,通过自重采样和稀疏因果掩码实现端到端训练,并引入历史路由机制提升长序列生成效率。
Details
Motivation: 解决自回归视频扩散模型因训练-测试不匹配导致的暴露偏差问题,避免依赖双向教师模型或在线判别器。 Method: 提出Resampling Forcing框架,包含自重采样机制模拟推理时的历史误差,结合稀疏因果掩码保证时间因果性并支持并行训练;设计参数自由的历史路由机制,动态检索最相关的历史帧用于生成。 Result: 实验表明该方法在性能上可媲美基于蒸馏的基线方法,在长视频生成中表现出更优的时间一致性,且支持原生长度训练。 Conclusion: Resampling Forcing为自回归视频扩散模型提供了一种可扩展、端到端的训练方案,有效缓解暴露偏差,提升了长视频生成的质量与效率。 Abstract: Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or online discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant history frames for each query. Experiments demonstrate that our approach achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos owing to native-length training.[125] GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection
Yu Wang,Juhyung Ha,Frangil M. Ramirez,Yuchen Wang,David J. Crandall
Main category: cs.CV
TL;DR: 本文提出了一种名为GateFusion的新架构,通过分层门控融合解码器(HiGate)实现视觉与音频模态的渐进式多深度融合,并引入两种辅助损失(MAL和OPP)来增强跨模态对齐与抑制虚假激活,在多个主流ASD基准上实现了最先进的性能。
Details
Motivation: 现有ASD方法多采用晚期融合,难以捕捉细粒度的跨模态交互,限制了在复杂场景下的鲁棒性,因此需要更有效的融合机制。 Method: 提出GateFusion架构,结合强预训练单模态编码器与HiGate解码器,通过可学习的双模态条件门控在Transformer多层中实现自适应上下文注入;并设计MAL和OPP两种辅助损失以加强跨模态对齐与抑制过拟合。 Result: 在Ego4D-ASD、UniTalk和WASD上分别达到77.8%、86.1%和96.1% mAP,显著优于先前方法,并在AVA-ActiveSpeaker上表现优异;外域实验和消融研究验证了模型的泛化能力与各组件的有效性。 Conclusion: GateFusion通过精细化的多层级跨模态融合与辅助学习目标,显著提升了主动说话人检测的性能,具备良好的泛化能力和模块互补性。 Abstract: Active Speaker Detection (ASD) aims to identify who is currently speaking in each frame of a video. Most state-of-the-art approaches rely on late fusion to combine visual and audio features, but late fusion often fails to capture fine-grained cross-modal interactions, which can be critical for robust performance in unconstrained scenarios. In this paper, we introduce GateFusion, a novel architecture that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate). HiGate enables progressive, multi-depth fusion by adaptively injecting contextual features from one modality into the other at multiple layers of the Transformer backbone, guided by learnable, bimodally-conditioned gates. To further strengthen multimodal learning, we propose two auxiliary objectives: Masked Alignment Loss (MAL) to align unimodal outputs with multimodal predictions, and Over-Positive Penalty (OPP) to suppress spurious video-only activations. GateFusion establishes new state-of-the-art results on several challenging ASD benchmarks, achieving 77.8% mAP (+9.4%), 86.1% mAP (+2.9%), and 96.1% mAP (+0.5%) on Ego4D-ASD, UniTalk, and WASD benchmarks, respectively, and delivering competitive performance on AVA-ActiveSpeaker. Out-of-domain experiments demonstrate the generalization of our model, while comprehensive ablations show the complementary benefits of each component.[126] Multi-View Foundation Models
Leo Segre,Or Hirschorn,Shai Avidan
Main category: cs.CV
TL;DR: 提出了一种将基础模型转换为多视角基础模型的方法,通过引入3D感知注意力机制提升跨视角特征一致性,在表面法线估计和多视图分割任务中显著优于现有方法。
Details
Motivation: 现有的单图像基础模型在处理多视角3D场景时无法保证同一3D点在不同视角下的特征一致性。 Method: 在基于Transformer的基础模型(如DINO、SAM、CLIP)中引入中间的3D感知注意力层,使模型能在多视角图像间进行特征对齐,直接在图像空间中实现特征一致性。 Result: 该方法显著提升了跨视角特征匹配的性能,无需构建显式的3D特征模型,且在表面法线估计和多视图分割任务上表现出色。 Conclusion: 所提出的多视角基础模型能有效增强跨视角特征一致性,为多视图理解任务提供了一种高效、灵活的新框架。 Abstract: Foundation models are vital tools in various Computer Vision applications. They take as input a single RGB image and output a deep feature representation that is useful for various applications. However, in case we have multiple views of the same 3D scene, they operate on each image independently and do not always produce consistent features for the same 3D point. We propose a way to convert a Foundation Model into a Multi-View Foundation Model. Such a model takes as input a set of images and outputs a feature map for each image such that the features of corresponding points are as consistent as possible. This approach bypasses the need to build a consistent 3D model of the features and allows direct manipulation in the image space. Specifically, we show how to augment Transformers-based foundation models (i.e., DINO, SAM, CLIP) with intermediate 3D-aware attention layers that help match features across different views. As leading examples, we show surface normal estimation and multi-view segmentation tasks. Quantitative experiments show that our method improves feature matching considerably compared to current foundation models.[127] Gaussian Pixel Codec Avatars: A Hybrid Representation for Efficient Rendering
Divam Gupta,Anuj Pahuja,Nemanja Bartolovic,Tomas Simon,Forrest Iandola,Giljoo Nam
Main category: cs.CV
TL;DR: 本文提出了一种名为高斯像素编解码头像(GPiCA)的新方法,结合三角形网格和各向异性3D高斯分布,实现从多视角图像生成可在移动设备上高效渲染的逼真头像。
Details
Motivation: 为了在移动设备上实现既高效又逼真的头像渲染,克服传统方法在表现非表面区域(如头发、胡须)时的局限性。 Method: 采用混合表示法,将三角形网格与3D高斯分布结合,并设计统一的可微渲染管线,通过神经网络将面部表情编码解码为三维网格、RGBA纹理和3D高斯集合,进行联合训练与渲染。 Result: 实验结果表明,GPiCA在保持纯高斯方法逼真度的同时,达到了基于网格方法的渲染效率,适用于移动设备。 Conclusion: GPiCA通过融合网格与3D高斯的优势,在视觉质量和渲染速度之间实现了良好平衡,是适用于移动端的高质量头像解决方案。 Abstract: We present Gaussian Pixel Codec Avatars (GPiCA), photorealistic head avatars that can be generated from multi-view images and efficiently rendered on mobile devices. GPiCA utilizes a unique hybrid representation that combines a triangle mesh and anisotropic 3D Gaussians. This combination maximizes memory and rendering efficiency while maintaining a photorealistic appearance. The triangle mesh is highly efficient in representing surface areas like facial skin, while the 3D Gaussians effectively handle non-surface areas such as hair and beard. To this end, we develop a unified differentiable rendering pipeline that treats the mesh as a semi-transparent layer within the volumetric rendering paradigm of 3D Gaussian Splatting. We train neural networks to decode a facial expression code into three components: a 3D face mesh, an RGBA texture, and a set of 3D Gaussians. These components are rendered simultaneously in a unified rendering engine. The networks are trained using multi-view image supervision. Our results demonstrate that GPiCA achieves the realism of purely Gaussian-based avatars while matching the rendering performance of mesh-based avatars.[128] DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models
Lunbin Zeng,Jingfeng Yao,Bencheng Liao,Hongyuan Tao,Wenyu Liu,Xinggang Wang
Main category: cs.CV
TL;DR: 提出DiffusionVL,一种可从强大的自回归模型转换而来的扩散视觉语言模型,通过简单微调实现高效多模态生成,并引入块解码设计以支持任意长度生成和KV缓存重用,显著提升推理速度与性能。
Details
Motivation: 现有扩散视觉语言模型(dVLM)因基础语言模型能力受限,性能远落后于主流自回归模型,因此探索能否基于强大的现有自回归模型构建高性能dVLM。 Method: 通过简单微调将预训练的自回归模型适配到扩散范式,并引入块解码设计以支持任意长度生成和KV缓存重用,从而提升推理效率。 Result: 在少于此前方法5%训练数据的情况下,DiffusionVL在MMMU-Pro(视觉)上提升34.4%,MME(认知)上提升37.5%,并实现2倍推理加速;同时验证了从AR模型向扩散范式的有效转换。 Conclusion: 扩散范式可以有效继承自回归模型的强大能力,DiffusionVL为构建高性能多模态模型提供了新路径,在性能、效率和训练成本之间实现了更好平衡。 Abstract: In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL.[129] In Pursuit of Pixel Supervision for Visual Pre-training
Lihe Yang,Shang-Wen Li,Yang Li,Xinjie Lei,Dong Wang,Abdelrahman Mohamed,Hengshuang Zhao,Hu Xu
Main category: cs.CV
TL;DR: Pixio是一种增强的掩码自编码器,通过在20亿网络图像上进行自我筛选的自监督学习,在多种下游任务中表现出色,展示了像素空间自监督学习的潜力。
Details
Motivation: 探索自编码器在现代自监督学习中的竞争力,并提升其在真实场景下游任务中的表现。 Method: 提出Pixio模型,基于增强的掩码自编码器架构,采用更具挑战性的预训练任务和更强大的网络结构,在大规模无标注图像上进行训练。 Result: Pixio在单目深度估计、3D重建、语义分割和机器人学习等任务上优于或媲美同类方法(如DINOv3),展现出广泛适用性。 Conclusion: 像素空间的自监督学习仍具强大潜力,可作为潜在空间方法的有效补充。 Abstract: At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging from low-level attributes to high-level concepts. Autoencoders represent a classical and long-standing paradigm for learning representations from pixels or other raw inputs. In this work, we demonstrate that autoencoder-based self-supervised learning remains competitive today and can produce strong representations for downstream tasks, while remaining simple, stable, and efficient. Our model, codenamed "Pixio", is an enhanced masked autoencoder (MAE) with more challenging pre-training tasks and more capable architectures. The model is trained on 2B web-crawled images with a self-curation strategy with minimal human curation. Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), semantic segmentation, and robot learning, outperforming or matching DINOv3 trained at similar scales. Our results suggest that pixel-space self-supervised learning can serve as a promising alternative and a complement to latent-space approaches.[130] Spatia: Video Generation with Updatable Spatial Memory
Jinjing Zhao,Fangyun Wei,Zhening Liu,Hongyang Zhang,Chang Xu,Yan Lu
Main category: cs.CV
TL;DR: 提出Spatia,一种基于空间记忆的视频生成框架,通过维护3D点云实现长期时空一致性。