Skip to content

Table of Contents

cs.CL [Back]

[1] Context Discipline and Performance Correlation: Analyzing LLM Performance and Quality Degradation Under Varying Context Lengths

Ahilan Ayyachamy Nadar Ponnusamy,Karthic Chandran,M Maruf Hossain

Main category: cs.CL

TL;DR: 本文研究了在扩展上下文窗口时,密集Transformer架构(如Llama-3.1-70B和Qwen1.5-14B)在面对大量无关上下文时系统性能与模型质量之间的权衡,发现KV缓存增长导致非线性性能下降,并揭示MoE架构在高token量下受基础设施瓶颈影响的行为异常。

Details Motivation: 随着大语言模型上下文窗口的扩大,如何在保持模型质量的同时管理计算开销成为关键问题,特别是当上下文中包含大量无关信息时,系统性能可能显著下降。 Method: 通过实验分析密集Transformer模型在不同规模无关上下文下的表现,重点研究KV缓存增长对推理效率和输出质量的影响,并对MoE架构进行扩展分析以观察其在不同上下文长度下的行为变化。 Result: 发现模型性能随KV缓存增大呈现非线性下降;MoE架构虽具理论优势,但在高token输入下因基础设施瓶颈未能充分发挥潜力。 Conclusion: 单纯的架构改进不足以应对长上下文带来的挑战,需协同优化系统基础设施以缓解性能退化。 Abstract: The scaling trend in Large Language Models (LLMs) has prioritized increasing the maximum context window to facilitate complex, long-form reasoning and document analysis. However, managing this expanded context introduces severe computational overhead. This paper investigates the critical trade-off between system performance and model quality when dense transformer architectures--specifically Llama-3.1-70B and Qwen1.5-14B--are exposed to large volumes of irrelevant and distracting context. The research identifies a non-linear performance degradation tied to the growth of the Key-Value (KV) cache. Furthermore, an extended analysis of the Mixture-of-Experts (MoE) architecture reveals unique behavioral anomalies at varying context scales, suggesting that architectural benefits may be masked by infrastructure bottlenecks at high token volumes.

[2] Compass-Embedding v4: Robust Contrastive Learning for Multilingual E-commerce Embeddings

Pakorn Ueareeworakul,Shuman Liu,Jinghao Feng,Ling Hu,Zhantang Shi,Chengqi Sun,Liang Yao,Panyi Ouyang,Haibo Zhang,Anxiang Zeng

Main category: cs.CL

TL;DR: 本文提出了Compass-Embedding v4,一种专为东南亚电商场景优化的高效多语言嵌入框架,通过类感知掩码、合成数据增强和推理优化,在低资源语言下实现了最先进的语义表示性能。

Details Motivation: 针对东南亚低资源语言在电商场景中面临的数据稀缺、噪声监督和生产限制问题,现有通用嵌入模型难以满足检索与推荐需求。 Method: 提出类感知掩码(CAM)改进InfoNCE目标函数,构建多样化训练语料(包括合成数据、跨语言翻译和结构化电商数据),并采用大批次训练、球面模型融合、vLLM与FP8量化优化训练鲁棒性与推理效率。 Result: 在多语言基准和真实电商任务上,Compass-Embedding v4在主要东南亚语言上达到SOTA性能,显著优于通用模型,并在高资源语言上保持竞争力,同时提升推理吞吐量。 Conclusion: Compass-Embedding v4有效解决了低资源语言在电商场景下的语义表示瓶颈,兼顾模型性能与生产效率,具备实际部署价值。 Abstract: As global e-commerce rapidly expands into emerging markets, the lack of high-quality semantic representations for low-resource languages has become a decisive bottleneck for retrieval, recommendation, and search systems. In this work, we present Compass-Embedding v4, a high-efficiency multilingual embedding framework specifically optimized for Southeast Asian (SEA) e-commerce scenarios, where data scarcity, noisy supervision, and strict production constraints jointly challenge representation learning. Compass-Embedding v4 addresses three core challenges. First, large-batch contrastive training under mixed task supervision introduces systematic false negatives that degrade semantic alignment. We propose Class-Aware Masking (CAM), a lightweight modification to the InfoNCE objective that suppresses invalid in-batch negatives and improves semantic discrimination without altering training efficiency. Second, low-resource SEA languages suffer from limited and uneven data coverage. We construct a diversified training corpus through context-grounded synthetic data generation, cross-lingual translation, and structured e-commerce data construction, enabling robust multilingual and domain-specific learning. Third, production deployment requires high-throughput inference while preserving embedding quality. We combine robustness-driven large-batch training with spherical model merging to mitigate catastrophic forgetting, and optimize inference via vLLM and FP8 quantization. Extensive evaluations across multilingual benchmarks and proprietary e-commerce tasks show that Compass-Embedding v4 achieves state-of-the-art performance on major SEA languages, significantly outperforming general-purpose embedding models in domain-specific retrieval and classification, while maintaining competitive performance on high-resource languages.

[3] Measuring Stability Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology

Vanessa D'Amario,Randy Daniel,Alessandro Zanetti,Dhruv Edamadaka,Nitya Alaparthy,Joshua Tarkoff

Main category: cs.CL

TL;DR: 本研究评估了六种小型开源医疗大语言模型在儿科内分泌学中的表现,发现高一致性并不一定意味着正确性,且模型输出易受提示微小变化和系统环境差异的影响,提示需要更广泛的诊断框架来评估其在临床决策支持中的潜在风险。

Details Motivation: 现有对小型开源医疗大语言模型的评估多局限于多项选择题准确率,缺乏对其一致性、鲁棒性和推理行为的深入分析,尤其是在低资源部署场景下,亟需更全面的评估方法。 Method: 结合多项选择题测试、人类评估和临床审查,评估六个小型开源医疗大语言模型;在确定性和随机性设置下分析提示变化、自我评估偏差及输出变异性,并考察CUDA构建等系统级扰动对结果的影响。 Result: HuatuoGPT-o1-8B 表现最佳且一致性最高,但高一致性不保证正确性;模型存在自我评估偏差和候选解释顺序依赖;专家审查发现部分错误推理包含临床可接受回答与临床疏忽;系统级扰动(如不同CUDA版本)可导致输出显著变化,尽管准确率稳定。 Conclusion: 小型医疗大语言模型的输出易受细微提示和系统环境变化影响,表现出不可忽视的变异性与偏差,仅依赖准确率或一致性指标不足以保障临床可靠性,需建立更全面的诊断评估框架以应对实际应用中的潜在风险。 Abstract: Small open-source medical large language models (LLMs) offer promising opportunities for low-resource deployment and broader accessibility. However, their evaluation is often limited to accuracy on medical multiple choice question (MCQ) benchmarks, and lacks evaluation of consistency, robustness, or reasoning behavior. We use MCQ coupled to human evaluation and clinical review to assess six small open-source medical LLMs (HuatuoGPT-o1 (Chen 2024), Diabetica-7B, Diabetica-o1 (Wei 2024), Meditron3-8B (Sallinen2025), MedFound-7B (Liu 2025), and ClinicaGPT-base-zh (Wang 2023)) in pediatric endocrinology. In deterministic settings, we examine the effect of prompt variation on models' output and self-assessment bias. In stochastic settings, we evaluate output variability and investigate the relationship between consistency and correctness. HuatuoGPT-o1-8B achieved the highest performance. The results show that high consistency across the model response is not an indicator of correctness, although HuatuoGPT-o1-8B showed the highest consistency rate. When tasked with selecting correct reasoning, both HuatuoGPT-o1-8B and Diabetica-o1 exhibit self-assessment bias and dependency on the order of the candidate explanations. Expert review of incorrect reasoning rationales identified a mix of clinically acceptable responses and clinical oversight. We further show that system-level perturbations, such as differences in CUDA builds, can yield statistically significant shifts in model output despite stable accuracy. This work demonstrates that small, semantically negligible prompt perturbations lead to divergent outputs, raising concerns about reproducibility of LLM-based evaluations and highlights the output variability under different stochastic regimes, emphasizing the need of a broader diagnostic framework to understand potential pitfalls in real-world clinical decision support scenarios.

[4] An Empirical Analysis of Fine-Tuning Large Language Models on Bioinformatics Literature: PRSGPT and BioStarsGPT

Muhammad Muneeb,David B. Ascher

Main category: cs.CL

TL;DR: 本文提出了一种可复现的九步流程,用于在特定生物信息学数据上微调大语言模型(LLM),并以PRSGPT和BioStarsGPT两个应用为例进行验证。

Details Motivation: 大语言模型在处理复杂生物信息学任务时往往缺乏专业知识,因此需要一种系统化的方法来增强其领域适应性。 Method: 提出一个包含数据整合、结构化预处理、基于提示的问答生成、自然语言推断质量控制、语义去重、聚类划分数据集以及使用LoRA进行参数高效微调的九步流程,并在三个LLM上进行实验。 Result: Qwen2.5-7B表现最佳,PRSGPT的BLEU-4和ROUGE-1分别提升82%和70%,BioStarsGPT分别提升6%和18%;PRSGPT在人工评估中准确率达61.9%,与Google Gemini相当但提供更详细的引用信息;BioStarsGPT在142个问题上达到59%的概念准确率。 Conclusion: 该流程支持可扩展、领域专用的大语言模型微调,能够构建隐私保护、本地部署的生物信息学助手,并探讨了其实用性、挑战及应对策略。 Abstract: Large language models (LLMs) often lack specialized knowledge for complex bioinformatics applications. We present a reproducible pipeline for fine-tuning LLMs on specialized bioinformatics data, demonstrated through two use cases: PRSGPT, focused on polygenic risk score (PRS) tools, and BioStarsGPT, trained on community forum discussions. The nine-step pipeline integrates diverse data sources, structured preprocessing, prompt-based question-answer (QA) generation (via Google Gemini), natural language inference (NLI) for quality control, semantic deduplication, clustering-based data splitting, and parameter-efficient fine-tuning using LoRA. We fine-tuned three LLMs (LLaMA-3.2-3B, Qwen2.5-7B, Gemma) and benchmarked them on over 14 lexical and semantic metrics. Qwen2.5-7B emerged as the best performer, with BLEU-4 and ROUGE-1 improvements of 82\% and 70\% for PRSGPT and 6\% and 18\% for BioStarsGPT, respectively. The open-source datasets produced include over 28,000 QA pairs for PRSGPT and 154,282 for BioStarsGPT. Human evaluation of PRSGPT yielded 61.9\% accuracy on the PRS tools comparison task, comparable to Google Gemini (61.4\%), but with richer methodological detail and accurate citations. BioStarsGPT demonstrated 59\% conceptual accuracy across 142 curated bioinformatics questions. Our pipeline enables scalable, domain-specific fine-tuning of LLMs. It enables privacy-preserving, locally deployable bioinformatics assistants, explores their practical applications, and addresses the challenges, limitations, and mitigation strategies associated with their development and use.

[5] Concept Attractors in LLMs and their Applications

Sotirios Panagiotis Chytas,Vikas Singh

Main category: cs.CL

TL;DR: 本文提出利用迭代函数系统(IFS)解释大语言模型中语义相关提示的内部表示行为,并基于概念吸引子开发无需训练的方法,有效应用于翻译、减少幻觉、引导和合成数据生成等任务。

Details Motivation: 理解大语言模型如何将语义相关的提示映射到相似的内部表示,并利用该机制设计更高效的任务解决方法。 Method: 将模型层视为向概念特定吸引子收缩的映射,通过直接操作这些吸引子实现无需训练的干预方法。 Result: 所提出的吸引子方法在多个任务上达到或超过专用基线模型的表现,且具有良好的泛化性和效率。 Conclusion: 基于吸引子的干预为大语言模型提供了一种轻量、通用且无需训练的有效应用范式。 Abstract: Large language models (LLMs) often map semantically related prompts to similar internal representations at specific layers, even when their surface forms differ widely. We show that this behavior can be explained through Iterated Function Systems (IFS), where layers act as contractive mappings toward concept-specific Attractors. We leverage this insight and develop simple, training-free methods that operate directly on these Attractors to solve a wide range of practical tasks, including language translation, hallucination reduction, guardrailing, and synthetic data generation. Despite their simplicity, these Attractor-based interventions match or exceed specialized baselines, offering an efficient alternative to heavy fine-tuning, generalizable in scenarios where baselines underperform.

[6] LimAgents: Multi-Agent LLMs for Generating Research Limitations

Ibrahim Al Azher,Zhishuai Guo,Hamed Alhoori

Main category: cs.CL

TL;DR: 提出了一种名为LimAgents的多智能体大语言模型框架,用于生成更深入、全面的研究局限性分析,结合同行评审意见、引用文献和作者声明,显著提升了局限性识别的覆盖率。

Details Motivation: 现有的零样本大语言模型在生成研究局限性时往往流于表面,且受限于作者自身披露的不完整或浅层信息,难以发现深层次的方法论缺陷和上下文漏洞。 Method: 构建一个包含多个角色代理(如提取、分析、模拟审稿人、引用分析)的多智能体框架,引入OpenReview评论和引文信息作为更强的真值依据,并通过Judge和Master代理整合输出;同时提出基于LLM-as-a-Judge的点式评估协议以更好衡量局限性覆盖度。 Result: 实验表明,RAG+多智能体GPT-4o mini配置相比零样本基线提升了15.51%的覆盖率,Llama 3 8B多智能体设置也实现了4.41%的提升。 Conclusion: LimAgents能系统化地识别显性和隐性局限、同行评审视角问题及文献关联弱点,显著优于传统方法,推动了科研透明性和自动化审查的发展。 Abstract: Identifying and articulating limitations is essential for transparent and rigorous scientific research. However, zero-shot large language models (LLMs) approach often produce superficial or general limitation statements (e.g., dataset bias or generalizability). They usually repeat limitations reported by authors without looking at deeper methodological issues and contextual gaps. This problem is made worse because many authors disclose only partial or trivial limitations. We propose LimAgents, a multi-agent LLM framework for generating substantive limitations. LimAgents integrates OpenReview comments and author-stated limitations to provide stronger ground truth. It also uses cited and citing papers to capture broader contextual weaknesses. In this setup, different agents have specific roles as sequential role: some extract explicit limitations, others analyze methodological gaps, some simulate the viewpoint of a peer reviewer, and a citation agent places the work within the larger body of literature. A Judge agent refines their outputs, and a Master agent consolidates them into a clear set. This structure allows for systematic identification of explicit, implicit, peer review-focused, and literature-informed limitations. Moreover, traditional NLP metrics like BLEU, ROUGE, and cosine similarity rely heavily on n-gram or embedding overlap. They often overlook semantically similar limitations. To address this, we introduce a pointwise evaluation protocol that uses an LLM-as-a-Judge to measure coverage more accurately. Experiments show that LimAgents substantially improve performance. The RAG + multi-agent GPT-4o mini configuration achieves a +15.51% coverage gain over zero-shot baselines, while the Llama 3 8B multi-agent setup yields a +4.41% improvement.

[7] Bielik 11B v3: Multilingual Large Language Model for European Languages

Krzysztof Ociepa,Łukasz Flis,Remigiusz Kinas,Krzysztof Wróbel,Adrian Gwoździej

Main category: cs.CL

TL;DR: Bielik 11B v3 是一个针对波兰语优化的110亿参数语言模型,基于 Mistral 7B v0.2 架构扩展而来,在多阶段训练后在多种任务上表现优异,且具备高效的参数利用和量化部署能力。

Details Motivation: 提升对波兰语等代表性不足语言的支持,开发高效、高性能的语言模型。 Method: 基于 Mistral 7B v0.2 通过深度扩展至11B参数,并采用四阶段训练流程:持续预训练、监督微调(SFT)、直接偏好优化(DPO)和强化学习。 Result: 在各类任务中显著优于其他波兰语模型,并超越参数量2-6倍的更大模型,同时支持多种量化方案,可在不同硬件上高效部署。 Conclusion: Bielik 11B v3 为资源受限语言的高效大模型开发设立了新基准。 Abstract: We present Bielik 11B v3, a state-of-the-art language model highly optimized for the Polish language, while also maintaining strong capabilities in other European languages. This model extends the Mistral 7B v0.2 architecture, scaled to 11B parameters via depth up-scaling. Its development involved a comprehensive four-stage training pipeline: continuous pre-training, supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and reinforcement learning. Comprehensive evaluations demonstrate that Bielik 11B v3 achieves exceptional performance. It significantly surpasses other specialized Polish language models and outperforms many larger models (with 2-6 times more parameters) on a wide range of tasks, from basic linguistic understanding to complex reasoning. The model's parameter efficiency, combined with extensive quantization options, allows for effective deployment across diverse hardware configurations. Bielik 11B v3 not only advances AI capabilities for the Polish language but also establishes a new benchmark for developing resource-efficient, high-performance models for less-represented languages.

[8] Speculative Decoding: Performance or Illusion?

Xiaoxuan Liu,Jiaxiang Yu,Jongseok Park,Ion Stoica,Alvin Cheung

Main category: cs.CL

TL;DR: 本文首次系统研究了在生产级推理引擎vLLM上 speculative decoding (SD) 的实际性能,评估了多种SD变体在不同工作负载、模型规模和批量大小下的表现,揭示了当前实现与理论加速上限之间存在显著差距,主要瓶颈在于目标模型的验证开销和接受长度的不稳定性,并据此指出了未来改进SD的新研究方向。

Details Motivation: 尽管speculative decoding(SD)被广泛认为可加速大语言模型推理,但其在真实生产环境中的有效性尚不清楚,因先前评估多基于研究原型和不现实的小批量设置。因此,需要在真实部署系统中系统评估SD的实际性能。 Method: 在生产级推理引擎vLLM上对多种SD变体(n-gram、EAGLE/EAGLE-3、Draft-Model、Multi-Token Prediction)进行了系统性实验,覆盖多种工作负载、模型规模和批量大小,分析影响SD性能的关键因素,并量化了SD加速的理论上限。 Result: 发现目标模型的验证阶段主导了整体执行时间,且接受长度在不同输出位置、请求和数据集间变化显著;实测性能与理论加速上限之间存在较大差距。 Conclusion: 当前SD在实际系统中远未达到理论性能上限,主要受限于验证开销和接受长度波动,研究揭示了通过优化这些方面来提升SD效率的新机会。 Abstract: Speculative decoding (SD) has become a popular technique to accelerate Large Language Model (LLM) inference, yet its real-world effectiveness remains unclear as prior evaluations rely on research prototypes and unrealistically small batch sizes. We present, to our knowledge, the first systematic study of SD on a production-grade and widely deployed inference engine (vLLM), covering multiple SD variants ($n$-gram, EAGLE/EAGLE-3, Draft-Model, Multi-Token Prediction) across diverse workloads, model scales, and batch sizes. We analyze key factors governing SD performance, and quantify a theoretical upper bound on SD speedup. Our results show that verification by the target model dominates the execution, while acceptance length varies markedly across output token positions, requests, and datasets. Comparing measured performance with theoretical bounds reveals substantial gaps between observed and theoretical upper bounds, and we leverage this observation to highlight new research opportunities that our study opens up in improving SD.

[9] Enhancing the QA Model through a Multi-domain Debiasing Framework

Yuefeng Wang,ChangJae Lee

Main category: cs.CL

TL;DR: 本研究评估了ELECTRA-small模型在SQuAD v1.1及对抗性数据集上的表现,提出一种结合知识蒸馏、去偏技术和领域扩展的多领域去偏框架,有效提升了模型在复杂和对抗性问题上的鲁棒性。

Details Motivation: QA模型在复杂和对抗性查询中常因词汇偏见、数值推理和实体识别等问题表现出偏差,影响其性能,因此需要有效的去偏策略来提升模型的可靠性。 Method: 通过分析ELECTRA-small模型在SQuAD v1.1、AddSent和AddOneSent数据集上的错误,识别出主要偏差类型,并构建一个融合知识蒸馏、去偏技术与领域扩展的多领域去偏框架。 Result: 该框架在所有测试集上实现了最高达2.6个百分点的EM和F1分数提升,尤其在对抗性条件下表现更优。 Conclusion: 针对性的去偏策略能显著增强自然语言理解系统的鲁棒性和可靠性,多领域融合方法为模型去偏提供了有效路径。 Abstract: Question-answering (QA) models have advanced significantly in machine reading comprehension but often exhibit biases that hinder their performance, particularly with complex queries in adversarial conditions. This study evaluates the ELECTRA-small model on the Stanford Question Answering Dataset (SQuAD) v1.1 and adversarial datasets AddSent and AddOneSent. By identifying errors related to lexical bias, numerical reasoning, and entity recognition, we develop a multi-domain debiasing framework incorporating knowledge distillation, debiasing techniques, and domain expansion. Our results demonstrate up to 2.6 percentage point improvements in Exact Match (EM) and F1 scores across all test sets, with gains in adversarial contexts. These findings highlight the potential of targeted bias mitigation strategies to enhance the robustness and reliability of natural language understanding systems.

[10] Entropic Context Shaping: Information-Theoretic Filtering for Context-Aware LLM Agents

Hyunjun Kim

Main category: cs.CL

TL;DR: 本文提出了Entropic Context Shaping(ECS),一种基于信息论的上下文效用度量框架,通过模型回答分布向正确答案的偏移来衡量上下文的实用性,在多轮上下文选择任务中显著优于基于词汇相似性的方法。

Details Motivation: 现有上下文工程方法依赖词汇重叠,难以区分实际有助于回答问题的信息与干扰项,缺乏对上下文实际效用的准确评估。 Method: 提出ECS框架,将上下文效用形式化为模型答案概率分布的有符号变化,并利用信息熵理论分析其特性,评估上下文对引导正确答案的实际贡献。 Result: 在LongMemEval和LoCoMo基准上的多轮上下文选择任务中,ECS结合Llama-3.1-8B在细粒度回合选择上达到F1=0.265,相较TF-IDF(F1=0.154)相对提升71.83%。 Conclusion: ECS能有效捕捉上下文的实用语用价值,优于传统的词汇相似性方法,尤其在需要精确上下文选择的任务中表现更优。 Abstract: Context engineering for large language model (LLM) agents requires distinguishing pragmatically useful information from misleading distractors. We introduce Entropic Context Shaping (ECS), an information-theoretic framework that measures context utility via the shift in the model's answer distribution toward the correct answer. Unlike lexical similarity methods that rely on word overlap, ECS captures pragmatic utility -- whether a passage actually helps answer the question. We formalize utility as the signed change in answer probability and provide theoretical analysis showing that task-irrelevant updates yield near-zero distribution shift. We evaluate on multi-turn context selection tasks using LongMemEval (session-level) and LoCoMo (turn-level) benchmarks. On fine-grained turn selection, ECS with Llama-3.1-8B achieves F1=0.265, a 71.83% relative improvement over TF-IDF (F1=0.154), demonstrating that pragmatic utility outperforms lexical similarity when precise context selection matters. Code and data are available in the supplementary materials.

[11] Towards AGI A Pragmatic Approach Towards Self Evolving Agent

Indrajit Kar,Sammy Zonunpuia,Zonunfeli Ralte

Main category: cs.CL

TL;DR: 提出了一种层次化的自进化多智能体框架,使基于大语言模型的智能体能够自主生成工具并持续进化能力。

Details Motivation: 现有LLM智能体在部署后无法自主扩展能力或生成新工具,缺乏持续适应性。 Method: 结合基础LLM、操作型SLM、代码生成LLM和教师LLM,通过任务尝试、工具合成和进化阶段(课程学习、奖励学习或遗传算法)实现自进化。 Result: 在TaskCraft数据集上验证,CL恢复快且泛化强,RL擅长高难度任务,GA提供高行为多样性,所有演化后智能体均优于原始版本。 Conclusion: 该框架实现了强大且鲁棒的自主自我改进智能体进化。 Abstract: Large Language Model (LLM) based agents are powerful yet fundamentally static after deployment, lacking the ability to autonomously expand capabilities, generate new tools, or evolve their reasoning. This work introduces a hierarchical self-evolving multi-agent framework that integrates a Base LLM, an operational SLM agent, a Code-Generation LLM, and a Teacher-LLM to enable continuous adaptation. The workflow begins with the agent attempting a task using reasoning and existing tools; if unsuccessful, it escalates to tool synthesis through the Code-Gen LLM, and when failures persist, it triggers an evolution phase using Curriculum Learning (CL), Reward-Based Learning (RL), or Genetic Algorithm (GA) evolution. Using the TaskCraft dataset rich in hierarchical tasks, tool-use traces, and difficulty scaling we evaluate these paradigms. CL delivers fast recovery and strong generalization, RL excels on high-difficulty tasks, and GA offers high behavioral diversity. Across all settings, evolved agents outperform their originals, demonstrating robust, autonomous, self-improving agentic evolution.

Ahmed Rayane Kebir,Vincent Guigue,Lynda Said Lhadj,Laure Soulier

Main category: cs.CL

TL;DR: 本文提出了RAC(检索增强澄清)框架,用于生成基于语料库的、有根据的澄清问题,通过结合检索上下文和对比偏好优化,显著提升了问题的忠实性和回答可用性。

Details Motivation: 现有对话式搜索系统在生成澄清问题时往往缺乏对底层文档语料库的 grounding,可能导致提出无法回答的问题。本文旨在生成与语料库一致且可回答的澄清问题。 Method: 提出RAC框架:首先比较多种检索索引策略,然后微调大语言模型以利用检索到的研究上下文,并采用对比偏好优化(contrastive preference optimization)来促使模型生成有证据支持的问题而非无根据的问题。 Result: 在四个基准数据集上评估,RAC显著优于基线方法;引入基于NLI和data-to-text的新指标评估问题与上下文的锚定程度,结果显示该方法持续提升了生成问题的忠实性。 Conclusion: RAC框架能有效生成语料库忠实的澄清问题,通过检索增强和偏好优化策略,提高了对话式搜索中澄清提问的质量与实用性。 Abstract: Clarification questions help conversational search systems resolve ambiguous or underspecified user queries. While prior work has focused on fluency and alignment with user intent, especially through facet extraction, much less attention has been paid to grounding clarifications in the underlying corpus. Without such grounding, systems risk asking questions that cannot be answered from the available documents. We introduce RAC (Retrieval-Augmented Clarification), a framework for generating corpus-faithful clarification questions. After comparing several indexing strategies for retrieval, we fine-tune a large language model to make optimal use of research context and to encourage the generation of evidence-based question. We then apply contrastive preference optimization to favor questions supported by retrieved passages over ungrounded alternatives. Evaluated on four benchmarks, RAC demonstrate significant improvements over baselines. In addition to LLM-as-Judge assessments, we introduce novel metrics derived from NLI and data-to-text to assess how well questions are anchored in the context, and we demonstrate that our approach consistently enhances faithfulness.

[13] Bridging Human Interpretation and Machine Representation: A Landscape of Qualitative Data Analysis in the LLM Era

Xinyu Pi,Qisen Yang,Chuong Nguyen,Hua Shen

Main category: cs.CL

TL;DR: 本文提出了一个4×4的框架,用于分析LLM在定性研究中进行意义建构和建模的不同层次,揭示现有系统偏向低层次、低承诺的输出,并呼吁发展能明确、选择和管理其解释与建模意图的LLM系统。

Details Motivation: 现有的LLM支持定性研究时输出差异大且不透明,缺乏对意义建构和建模层次的系统性区分,导致结果不可靠或难以控制。 Method: 构建了一个涵盖四个意义建构层次(描述性、分类性、解释性、理论性)与四个建模层次(静态结构、阶段/时间线、因果路径、反馈动态)的4×4分析框架,并用该框架评估现有的LLM应用。 Result: 发现当前LLM系统主要集中在低层次的意义生成和低承诺的建模形式,极少涉及解释性或理论性的推断以及动态建模。 Conclusion: 应推动开发能够显式表达、选择并受控于用户意图的解释性和理论性建模能力的LLM系统,以提升其在定性研究中的可靠性与适用性。 Abstract: LLMs are increasingly used to support qualitative research, yet existing systems produce outputs that vary widely--from trace-faithful summaries to theory-mediated explanations and system models. To make these differences explicit, we introduce a 4$\times$4 landscape crossing four levels of meaning-making (descriptive, categorical, interpretive, theoretical) with four levels of modeling (static structure, stages/timelines, causal pathways, feedback dynamics). Applying the landscape to prior LLM-based automation highlights a strong skew toward low-level meaning and low-commitment representations, with few reliable attempts at interpretive/theoretical inference or dynamical modeling. Based on the revealed gap, we outline an agenda for applying and building LLM-systems that make their interpretive and modeling commitments explicit, selectable, and governable.

[14] LIME-LLM: Probing Models with Fluent Counterfactuals, Not Broken Text

George Mihaila,Suleyman Olcay Polat,Poli Nemkova,Himanshu Sharma,Namratha V. Urs,Mark V. Albert

Main category: cs.CL

TL;DR: 本文提出了LIME-LLM,一种用于自然语言处理中可解释AI的新框架,通过基于假设的受控扰动替代传统的随机掩码,提升了局部解释的保真度。

Details Motivation: 现有局部解释方法(如LIME)依赖随机词元掩码,生成语义无效、分布外的输入,损害了解释的保真性;而基于生成模型的方法(如LLiMe)因无约束改写引入混杂变量,难以隔离特征贡献。 Method: 提出LIME-LLM框架,采用“单掩码-单样本”协议,并设计中性填充与边界填充策略,利用大语言模型生成流畅且在流形上的邻域样本,实现对特征效应的严格隔离。 Result: 在CoLA、SST-2和HateXplain三个基准上,以人工标注的rationale为真实标签,LIME-LLM在局部解释保真度上显著优于传统方法(LIME、SHAP、Integrated Gradients)和生成式方法LLiMe。 Conclusion: LIME-LLM通过可控扰动机制,在黑盒NLP模型解释中实现了更高保真度的局部解释,为NLP领域的可信赖AI设立了新标杆。 Abstract: Local explanation methods such as LIME (Ribeiro et al., 2016) remain fundamental to trustworthy AI, yet their application to NLP is limited by a reliance on random token masking. These heuristic perturbations frequently generate semantically invalid, out-of-distribution inputs that weaken the fidelity of local surrogate models. While recent generative approaches such as LLiMe (Angiulli et al., 2025b) attempt to mitigate this by employing Large Language Models for neighborhood generation, they rely on unconstrained paraphrasing that introduces confounding variables, making it difficult to isolate specific feature contributions. We introduce LIME-LLM, a framework that replaces random noise with hypothesis-driven, controlled perturbations. By enforcing a strict "Single Mask-Single Sample" protocol and employing distinct neutral infill and boundary infill strategies, LIME-LLM constructs fluent, on-manifold neighborhoods that rigorously isolate feature effects. We evaluate our method against established baselines (LIME, SHAP, Integrated Gradients) and the generative LLiMe baseline across three diverse benchmarks: CoLA, SST-2, and HateXplain using human-annotated rationales as ground truth. Empirical results demonstrate that LIME-LLM establishes a new benchmark for black-box NLP explainability, achieving significant improvements in local explanation fidelity compared to both traditional perturbation-based methods and recent generative alternatives.

[15] Early Linguistic Pattern of Anxiety from Social Media Using Interpretable Linguistic Features: A Multi-Faceted Validation Study with Author-Disjoint Evaluation

Arnab Das Utsa

Main category: cs.CL

TL;DR: 提出一种基于社交媒体语言的透明化焦虑检测方法,通过可解释的语言特征和跨域验证,在大规模Reddit数据上实现了可靠且鲁棒的焦虑识别。

Details Motivation: 现有焦虑检测模型缺乏可解释性、关键词鲁棒性验证和严格用户级数据完整性,限制了大规模筛查的应用。 Method: 采用逻辑回归分类器,基于精心筛选的Reddit子版块数据进行训练,并进行特征消融、关键词屏蔽、密度差异分析及临床访谈数据的外部验证。 Result: 模型在去除情感或屏蔽关键词后仍保持高准确率,早期检测显著优于随机分类,跨域分析与临床数据高度一致。 Conclusion: 基于可解释语言特征的框架可支持可靠、可推广且对关键词具有鲁棒性的焦虑检测,为心理健康筛查提供了可复现的基线方法。 Abstract: Anxiety affects hundreds of millions of individuals globally, yet large-scale screening remains limited. Social media language provides an opportunity for scalable detection, but current models often lack interpretability, keyword-robustness validation, and rigorous user-level data integrity. This work presents a transparent approach to social media-based anxiety detection through linguistically interpretable feature-grounded modeling and cross-domain validation. Using a substantial dataset of Reddit posts, we trained a logistic regression classifier on carefully curated subreddits for training, validation, and test splits. Comprehensive evaluation included feature ablation, keyword masking experiments, and varying-density difference analyses comparing anxious and control groups, along with external validation using clinically interviewed participants with diagnosed anxiety disorders. The model achieved strong performance while maintaining high accuracy even after sentiment removal or keyword masking. Early detection using minimal post history significantly outperformed random classification, and cross-domain analysis demonstrated strong consistency with clinical interview data. Results indicate that transparent linguistic features can support reliable, generalizable, and keyword-robust anxiety detection. The proposed framework provides a reproducible baseline for interpretable mental health screening across diverse online contexts.

[16] Industry-Aligned Granular Topic Modeling

Sae Young Moon,Myeongjun Erik Jang,Haoyan Luo,Chunyang Xiao,Antonios Georgiadis,Fran Silavong

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型的细粒度主题建模框架TIDE,能够在商业场景中生成更精细的主题,并支持文档摘要、主题归类与提炼等功能,在多种数据集上优于现有方法。

Details Motivation: 现有主题建模方法在生成细粒度主题方面探索不足,而细粒度对商业应用具有重要意义,因此需要一种能更好支持业务需求的主题建模框架。 Method: 提出TIDE框架,利用大语言模型实现细粒度主题建模,并集成文档摘要、主题归类和主题提炼等辅助功能。 Result: 在多个公开及真实商业数据集上的实验表明,TIDE在主题建模性能上优于现代方法,且其辅助功能有效支持工业级业务场景。 Conclusion: TIDE是一种高效、实用的细粒度主题建模框架,具备良好的商业应用前景,目前正推进开源。 Abstract: Topic modeling has extensive applications in text mining and data analysis across various industrial sectors. Although the concept of granularity holds significant value for business applications by providing deeper insights, the capability of topic modeling methods to produce granular topics has not been thoroughly explored. In this context, this paper introduces a framework called TIDE, which primarily provides a novel granular topic modeling method based on large language models (LLMs) as a core feature, along with other useful functionalities for business applications, such as summarizing long documents, topic parenting, and distillation. Through extensive experiments on a variety of public and real-world business datasets, we demonstrate that TIDE's topic modeling approach outperforms modern topic modeling methods, and our auxiliary components provide valuable support for dealing with industrial business scenarios. The TIDE framework is currently undergoing the process of being open sourced.

[17] Cleansing the Artificial Mind: A Self-Reflective Detoxification Framework for Large Language Models

Kaituo Zhang,Zhimeng Jiang,Na Zou

Main category: cs.CL

TL;DR: 本文提出了一种完全自反的去毒框架,利用大语言模型(LLM)内在的自我检测与修正能力,无需外部模块或人工标注数据即可实现有害内容的识别与净化。

Details Motivation: 现有去毒技术依赖外部模块、人工标注或干预,缺乏可扩展性和一致性,未能充分利用LLM自身涌现的自调节机制。 Method: 设计了一个内部毒性信号检测器(Toxic Signal Detector)和系统化的干预流程,通过迭代生成对比去毒数据集,并用于微调模型,提升其安全且连贯的文本生成能力。 Result: 在DetoxLLM和ParaDetox等基准上优于现有最先进方法,有效去除毒性同时保持语义保真度。 Conclusion: 揭示了LLM内在的自去毒能力,为构建真正自我监管、更负责任的语言模型提供了可行路径。 Abstract: Recent breakthroughs in Large Language Models (LLMs) have revealed remarkable generative capabilities and emerging self-regulatory mechanisms, including self-correction and self-rewarding. However, current detoxification techniques rarely exploit these built-in abilities; instead, they rely on external modules, labor-intensive data annotation, or human intervention --factors that hinder scalability and consistency. In this paper, we introduce a fully self-reflective detoxification framework that harnesses the inherent capacities of LLMs to detect, correct toxic content, and refine LLMs without external modules and data annotation. Specifically, we propose a Toxic Signal Detector --an internal self-identification mechanism, coupled with a systematic intervention process to transform toxic text into its non-toxic counterpart. This iterative procedure yields a contrastive detoxification dataset used to fine-tune the model, enhancing its ability for safe and coherent text generation. Experiments on benchmark datasets such as DetoxLLM and ParaDetox show that our method achieves better detoxification performance than state-of-the-art methods while preserving semantic fidelity. By obviating the need for human intervention or external components, this paper reveals the intrinsic self-detoxification ability of LLMs, offering a consistent and effective approach for mitigating harmful content generation. Ultimately, our findings underscore the potential for truly self-regulated language models, paving the way for more responsible and ethically guided text generation systems.

[18] Translation as a Scalable Proxy for Multilingual Evaluation

Sheriff Issaka,Erick Rosas Gonzalez,Lieqi Liu,Evans Kofi Agyei,Lucas Bandarkar,Nanyun Peng,David Ifeoluwa Adelani,Francisco Guzmán,Saadia Gabriel

Main category: cs.CL

TL;DR: 翻译质量可作为评估大语言模型多语言能力的有效初步指标,研究发现翻译性能与下游任务表现高度相关,为多语言基准测试提供了一种低成本、可扩展的替代方案。

Details Motivation: 现有大语言模型声称具备多语言能力,但针对非机器翻译的综合评测基准仅覆盖不到30种语言,绝大多数语言缺乏有效评估手段,亟需一种可扩展且低成本的多语言能力评估方法。 Method: 系统评估了14个不同规模(1B-72B参数)的模型在9个多样化基准和7个翻译指标上的表现,分析翻译质量与下游任务性能之间的相关性。 Result: 研究发现翻译性能与下游任务表现之间存在高度正相关(如Phi-4模型中,MetricX相关系数达0.89,xCOMET达0.91,SSA-COMET达0.87),表明翻译质量能有效预测模型的多语言理解能力。 Conclusion: 翻译质量是一种强而有效的多语言能力代理指标,可用于低成本初筛,后续再针对特定任务进行深入评估,从而解决当前多语言基准构建中的可扩展性难题。 Abstract: The rapid proliferation of LLMs has created a critical evaluation paradox: while LLMs claim multilingual proficiency, comprehensive non-machine-translated benchmarks exist for fewer than 30 languages, leaving >98% of the world's 7,000 languages in an empirical void. Traditional benchmark construction faces scaling challenges such as cost, scarcity of domain experts, and data contamination. We evaluate the validity of a simpler alternative: can translation quality alone indicate a model's broader multilingual capabilities? Through systematic evaluation of 14 models (1B-72B parameters) across 9 diverse benchmarks and 7 translation metrics, we find that translation performance is a good indicator of downstream task success (e.g., Phi-4, median Pearson r: MetricX = 0.89, xCOMET = 0.91, SSA-COMET = 0.87). These results suggest that the representational abilities supporting faithful translation overlap with those required for multilingual understanding. Translation quality, thus emerges as a strong, inexpensive first-pass proxy of multilingual performance, enabling a translation-first screening with targeted follow-up for specific tasks.

[19] Beyond Tokens: Concept-Level Training Objectives for LLMs

Laya Iyer,Pranav Somani,Alice Guo,Dan Jurafsky,Chen Shani

Main category: cs.CL

TL;DR: 本文提出从传统的token-level预测转向concept-level预测,通过将同一概念的不同表达(如“mom”和“mother”)进行归一化,提升语言模型的语义一致性、鲁棒性和泛化能力。

Details Motivation: 传统的下一个token预测(NTP)目标在token层面操作,会将语义等价但形式不同的合理续写视为错误,导致模型偏向表面形式而非深层语义,因此需要一种更贴近人类抽象理解的学习目标。 Method: 引入concept-level prediction,将多个表示相同概念的表面形式分组,并设计多种方法将概念级监督信号融入大语言模型的训练过程中。 Result: 实验表明,采用概念级监督的模型在困惑度、领域迁移下的鲁棒性以及多种NLP基准任务上的表现均优于传统NTP模型。 Conclusion: concept-level supervision是一种更优的训练信号,能更好对齐大语言模型与人类的语义抽象,推动模型理解本质含义而非机械匹配表层文本。 Abstract: The next-token prediction (NTP) objective has been foundational in the development of modern large language models (LLMs), driving advances in fluency and generalization. However, NTP operates at the \textit{token} level, treating deviations from a single reference continuation as errors even when alternative continuations are equally plausible or semantically equivalent (e.g., ``mom'' vs. ``mother''). As a result, token-level loss can penalize valid abstractions, paraphrases, or conceptually correct reasoning paths, biasing models toward surface form rather than underlying meaning. This mismatch between the training signal and semantic correctness motivates learning objectives that operate over higher-level representations. We propose a shift from token-level to concept-level prediction, where concepts group multiple surface forms of the same idea (e.g., ``mom,'' ``mommy,'' ``mother'' $\rightarrow$ \textit{MOTHER}). We introduce various methods for integrating conceptual supervision into LLM training and show that concept-aware models achieve lower perplexity, improved robustness under domain shift, and stronger performance than NTP-based models on diverse NLP benchmarks. This suggests \textit{concept-level supervision} as an improved training signal that better aligns LLMs with human semantic abstractions.

[20] TWeddit : A Dataset of Triggering Stories Predominantly Shared by Women on Reddit

Shirlene Rose Bandela,Sanjeev Parthasarathy,Vaibhav Garg

Main category: cs.CL

TL;DR: 本文提出了一个名为TWeddit的Reddit数据集,用于标注与女性常面临的问题相关的触发性经历,旨在帮助研究者更好地理解和处理社交媒体上的敏感内容。

Details Motivation: 由于许多用户在发布可能引起心理不适的内容时未添加触发警告,导致其他读者可能无意中接触到令人困扰的信息,因此需要一个系统化的数据集来识别和标记这些内容。 Method: 构建了一个专门针对Reddit平台的标注数据集TWeddit,并进行了语言学分析,以揭示其中叙述的主题和道德基础。 Result: TWeddit数据集展示了不同触发性故事在话题和道德基础上的显著差异,证明其在情感表达、支持寻求及内容安全研究方面的潜力。 Conclusion: TWeddit为识别和管理社交媒体中的触发性内容提供了有价值的资源,有助于未来在心理健康、自然语言处理和内容审核等领域的研究。 Abstract: Warning: This paper may contain examples and topics that may be disturbing to some readers, especially survivors of miscarriage and sexual violence. People affected by abortion, miscarriage, or sexual violence often share their experiences on social media to express emotions and seek support. On public platforms like Reddit, where users can post long, detailed narratives (up to 40,000 characters), readers may be exposed to distressing content. Although Reddit allows manual trigger warnings, many users omit them due to limited awareness or uncertainty about which categories apply. There is scarcity of datasets on Reddit stories labeled for triggering experiences. We propose a curated Reddit dataset, TWeddit, covering triggering experiences related to issues majorly faced by women. Our linguistic analyses show that annotated stories in TWeddit express distinct topics and moral foundations, making the dataset useful for a wide range of future research.

[21] The Third VoicePrivacy Challenge: Preserving Emotional Expressiveness and Linguistic Content in Voice Anonymization

Natalia Tomashenko,Xiaoxiao Miao,Pierre Champion,Sarina Meyer,Michele Panariello,Xin Wang,Nicholas Evans,Emmanuel Vincent,Junichi Yamagishi,Massimiliano Todisco

Main category: cs.CL

TL;DR: 2024年VoicePrivacy挑战赛推动了语音匿名化技术的发展,任务是在保护说话人身份的同时保留语言内容和情感状态,本文介绍了挑战框架、数据集、评估指标、基线系统及参赛者的创新方法,并提出未来研究方向。

Details Motivation: 随着语音数据的广泛应用,保护说话人隐私成为重要问题,现有的语音匿名化技术在隐私保护与语音效用之间难以平衡,因此需要进一步推动相关技术的发展。 Method: 设计了一个系统的挑战框架,包括语音匿名化任务定义、开发与评估所用的数据集、攻击模型以及客观评估指标;提供了六个基线匿名化系统,并汇总了参赛团队提出的创新方法。 Result: 成功吸引了多个团队参与,提出了多种在隐私保护(隐藏说话人身份)和效用(保留语言内容和情感)之间取得更好平衡的创新方案,评估结果显示部分方法显著优于基线系统。 Conclusion: 本次挑战赛促进了语音匿名化技术的进步,揭示了当前方法的优势与局限,为未来挑战的设计和语音隐私研究提供了关键见解和方向指引。 Abstract: We present results and analyses from the third VoicePrivacy Challenge held in 2024, which focuses on advancing voice anonymization technologies. The task was to develop a voice anonymization system for speech data that conceals a speaker's voice identity while preserving linguistic content and emotional state. We provide a systematic overview of the challenge framework, including detailed descriptions of the anonymization task and datasets used for both system development and evaluation. We outline the attack model and objective evaluation metrics for assessing privacy protection (concealing speaker voice identity) and utility (content and emotional state preservation). We describe six baseline anonymization systems and summarize the innovative approaches developed by challenge participants. Finally, we provide key insights and observations to guide the design of future VoicePrivacy challenges and identify promising directions for voice anonymization research.

[22] ATOD: An Evaluation Framework and Benchmark for Agentic Task-Oriented Dialogue System

Yifei Zhang,Hooshang Nayyeri,Rinat Khaziev,Emine Yilmaz,Gokhan Tur,Dilek Hakkani-Tür,Hari Thadakamalla

Main category: cs.CL

TL;DR: 本文提出了ATOD,一个用于评估任务导向型对话系统中智能体行为的基准和合成对话生成管道,以及ATOD-Eval,一种能够全面评估任务完成、智能体能力和响应质量的评估框架。

Details Motivation: 现有的任务导向型对话(TOD)系统基准缺乏对复杂智能体行为(如长期推理、多目标协调和主动性)的系统性评估支持,因此需要一个新的基准来填补这一空白。 Method: 提出ATOD基准和对话生成管道,构建涵盖多目标协调、依赖管理、记忆、适应性和主动性的丰富标注对话;并设计ATOD-Eval评估框架,结合细粒度指标和基于记忆的强效评估器进行离线与在线评估。 Result: 实验证明ATOD-Eval能有效支持对任务完成度、智能体能力和响应质量的综合评估,且所提出的基于记忆的评估器在准确性和效率之间优于现有方法。 Conclusion: ATOD和ATOD-Eval为评估先进任务导向型对话系统中的智能体行为提供了有效且可复现的解决方案,推动了该领域向更复杂、自主的对话智能发展。 Abstract: Recent advances in task-oriented dialogue (TOD) systems, driven by large language models (LLMs) with extensive API and tool integration, have enabled conversational agents to coordinate interleaved goals, maintain long-horizon context, and act proactively through asynchronous execution. These capabilities extend beyond traditional TOD systems, yet existing benchmarks lack systematic support for evaluating such agentic behaviors. To address this gap, we introduce ATOD, a benchmark and synthetic dialogue generation pipeline that produces richly annotated conversations requiring long-term reasoning. ATOD captures key characteristics of advanced TOD, including multi-goal coordination, dependency management, memory, adaptability, and proactivity. Building on ATOD, we propose ATOD-Eval, a holistic evaluation framework that translates these dimensions into fine-grained metrics and supports reproducible offline and online evaluation. We further present a strong agentic memory-based evaluator for benchmarking on ATOD. Experiments show that ATOD-Eval enables comprehensive assessment across task completion, agentic capability, and response quality, and that the proposed evaluator offers a better accuracy-efficiency tradeoff compared to existing memory- and LLM-based approaches under this evaluation setting.

[23] CTPD: Cross Tokenizer Preference Distillation

Truong Nguyen,Phi Van Dat,Ngan Nguyen,Linh Ngo Van,Trung Le,Thanh Hong Nguyen

Main category: cs.CL

TL;DR: 本文提出了跨分词器偏好蒸馏(CTPD)框架,用于在具有不同分词方案的模型之间传递人类对齐行为,解决了传统知识蒸馏在偏好对齐中受限于分词不兼容的问题。

Details Motivation: 知识蒸馏在预训练和指令微调中广泛应用,但在语言模型偏好对齐中的应用仍不足,尤其是在跨分词器的实际场景下,因教师与学生模型的分词方案不兼容,难以进行细粒度的偏好信息蒸馏。 Method: 提出CTPD框架,包含三个创新:对齐跨度投影(将教师和学生标记映射到共享字符级跨度)、跨分词器的标记重要性采样(TIS-DPO)以改进信用分配,以及基于教师锚点的参考机制,在DPO目标中直接利用教师偏好。 Result: 理论分析表明CTPD基于重要性采样的原理,实验在多个基准上验证了其有效性,相比现有方法有显著性能提升。 Conclusion: CTPD是一种实用且通用的跨分词器偏好蒸馏解决方案,为更高效、可访问的语言模型对齐提供了新路径。 Abstract: While knowledge distillation has seen widespread use in pre-training and instruction tuning, its application to aligning language models with human preferences remains underexplored, particularly in the more realistic cross-tokenizer setting. The incompatibility of tokenization schemes between teacher and student models has largely prevented fine-grained, white-box distillation of preference information. To address this gap, we propose Cross-Tokenizer Preference Distillation (CTPD), the first unified framework for transferring human-aligned behavior between models with heterogeneous tokenizers. CTPD introduces three key innovations: (1) Aligned Span Projection, which maps teacher and student tokens to shared character-level spans for precise supervision transfer; (2) a cross-tokenizer adaptation of Token-level Importance Sampling (TIS-DPO) for improved credit assignment; and (3) a Teacher-Anchored Reference, allowing the student to directly leverage the teacher's preferences in a DPO-style objective. Our theoretical analysis grounds CTPD in importance sampling, and experiments across multiple benchmarks confirm its effectiveness, with significant performance gains over existing methods. These results establish CTPD as a practical and general solution for preference distillation across diverse tokenization schemes, opening the door to more accessible and efficient alignment of language models.

[24] Advances in LLM Reasoning Enable Flexibility in Clinical Problem-Solving

Kie Shidara,Preethi Prem,Jonathan Kim,Anna Podlasek,Feng Liu,Ahmed Alaa,Danilo Bernardo

Main category: cs.CL

TL;DR: 该研究评估了多个大型推理模型在医学抽象与推理语料库(mARC)上的表现,发现强推理模型在临床推理中展现出更高的认知灵活性,能更好避免Einstellung效应带来的陷阱,表现达到人类水平。

Details Motivation: 探讨先进的推理大模型是否在临床推理中具备更好的认知灵活性,尤其是在易受固定思维模式干扰的场景下。 Method: 使用mARC这一基于Einstellung效应的对抗性医学问答基准,评估来自OpenAI、Grok、Gemini、Claude和DeepSeek系列的推理模型在临床推理任务中的表现。 Result: 强推理模型比弱推理模型更少陷入Einstellung效应陷阱,在mARC上达到人类水平;在医生常错的问题上,顶级模型正确率达55%至70%,且置信度高。 Conclusion: 强大的推理模型在医学推理中表现出更强的认知灵活性,不易受固有启发式模式误导,其临床推理能力在特定任务上已与人类相当。 Abstract: Large Language Models (LLMs) have achieved high accuracy on medical question-answer (QA) benchmarks, yet their capacity for flexible clinical reasoning has been debated. Here, we asked whether advances in reasoning LLMs improve their cognitive flexibility in clinical reasoning. We assessed reasoning models from the OpenAI, Grok, Gemini, Claude, and DeepSeek families on the medicine abstraction and reasoning corpus (mARC), an adversarial medical QA benchmark which utilizes the Einstellung effect to induce inflexible overreliance on learned heuristic patterns in contexts where they become suboptimal. We found that strong reasoning models avoided Einstellung-based traps more often than weaker reasoning models, achieving human-level performance on mARC. On questions most commonly missed by physicians, the top 5 performing models answered 55% to 70% correctly with high confidence, indicating that these models may be less susceptible than humans to Einstellung effects. Our results indicate that strong reasoning models demonstrate improved flexibility in medical reasoning, achieving performance on par with humans on mARC.

[25] GloCTM: Cross-Lingual Topic Modeling via a Global Context Space

Nguyen Tien Phat,Ngo Vu Minh,Linh Van Ngo,Nguyen Thi Ngoc Diep,Thien Huu Nguyen

Main category: cs.CL

TL;DR: 本文提出了GloCTM,一种通过统一语义空间实现跨语言主题对齐的新框架,利用多语言上下文表示和全局主题-词分布,在主题连贯性和跨语言对齐方面显著优于现有方法。

Details Motivation: 现有跨语言主题模型通常在分离的语言空间中学习主题,依赖浅层对齐机制且忽略多语言预训练表示中的深层语义信号,导致主题对齐不紧密。 Method: 提出GloCTM框架:1)结合跨语言词汇邻域扩展词袋表示;2)使用局部和全局编码器推断主题比例,并通过内部正则化对齐潜在空间;3)在联合词汇表上定义全局主题-词分布以结构化同步主题含义;4)引入CKA损失对齐潜在主题空间与多语言上下文嵌入。 Result: 在多个基准数据集上的实验表明,GloCTM在主题 coherence 和跨语言对齐方面显著优于强基线方法。 Conclusion: GloCTM通过构建贯穿整个模型流程的统一语义空间,有效提升了跨语言主题建模的性能,验证了深层语义对齐和全局上下文建模的重要性。 Abstract: Cross-lingual topic modeling seeks to uncover coherent and semantically aligned topics across languages - a task central to multilingual understanding. Yet most existing models learn topics in disjoint, language-specific spaces and rely on alignment mechanisms (e.g., bilingual dictionaries) that often fail to capture deep cross-lingual semantics, resulting in loosely connected topic spaces. Moreover, these approaches often overlook the rich semantic signals embedded in multilingual pretrained representations, further limiting their ability to capture fine-grained alignment. We introduce GloCTM (Global Context Space for Cross-Lingual Topic Model), a novel framework that enforces cross-lingual topic alignment through a unified semantic space spanning the entire model pipeline. GloCTM constructs enriched input representations by expanding bag-of-words with cross-lingual lexical neighborhoods, and infers topic proportions using both local and global encoders, with their latent representations aligned through internal regularization. At the output level, the global topic-word distribution, defined over the combined vocabulary, structurally synchronizes topic meanings across languages. To further ground topics in deep semantic space, GloCTM incorporates a Centered Kernel Alignment (CKA) loss that aligns the latent topic space with multilingual contextual embeddings. Experiments across multiple benchmarks demonstrate that GloCTM significantly improves topic coherence and cross-lingual alignment, outperforming strong baselines.

[26] Faithfulness vs. Safety: Evaluating LLM Behavior Under Counterfactual Medical Evidence

Kaijie Mo,Siddhartha Venkatayogi,Chantal Shaib,Ramez Kouzy,Wei Xu,Byron C. Wallace,Junyi Jessy Li

Main category: cs.CL

TL;DR: 本文研究了大语言模型在面对反事实或对抗性医学证据时的行为和推理能力,构建了一个名为MedCounterFact的反事实医学问答数据集,并发现现有模型在面对危险或不可信的证据时仍会盲目接受并给出自信的回答,揭示了当前模型在忠实性和安全性之间缺乏有效边界。

Details Motivation: 在高风险领域如医学中,模型应忠实地遵循提供的上下文信息,但当上下文与模型先验知识或安全协议冲突时,模型如何行为尚不明确。因此,研究模型在反事实或对抗性医学证据下的表现具有重要意义。 Method: 构建了一个名为MedCounterFact的反事实医学QA数据集,包含临床比较问题及随机对照试验作为证据,系统地将真实医疗干预替换为四种类型的反事实刺激(如未知词汇、有毒物质等),并在多个前沿大语言模型上进行评估。 Result: 评估结果显示,现有大语言模型在面对反事实证据时,普遍盲目接受这些‘证据’,即使其内容危险或不合理,并给出自信且无警示的回答。 Conclusion: 当前的大语言模型在处理反事实医学信息时缺乏对安全性的考量,尚未在忠实于上下文和保障安全性之间建立有效边界,亟需改进。 Abstract: In high-stakes domains like medicine, it may be generally desirable for models to faithfully adhere to the context provided. But what happens if the context does not align with model priors or safety protocols? In this paper, we investigate how LLMs behave and reason when presented with counterfactual or even adversarial medical evidence. We first construct MedCounterFact, a counterfactual medical QA dataset that requires the models to answer clinical comparison questions (i.e., judge the efficacy of certain treatments, with evidence consisting of randomized controlled trials provided as context). In MedCounterFact, real-world medical interventions within the questions and evidence are systematically replaced with four types of counterfactual stimuli, ranging from unknown words to toxic substances. Our evaluation across multiple frontier LLMs on MedCounterFact reveals that in the presence of counterfactual evidence, existing models overwhelmingly accept such "evidence" at face value even when it is dangerous or implausible, and provide confident and uncaveated answers. While it may be prudent to draw a boundary between faithfulness and safety, our findings reveal that there exists no such boundary yet.

[27] PPA-Plan: Proactive Pitfall Avoidance for Reliable Planning in Long-Context LLM Reasoning

Byeongjin Kim,Gyuwan Kim,Seo Yeon Park

Main category: cs.CL

TL;DR: 提出PPA-Plan,一种主动规划策略,通过识别潜在逻辑陷阱并将其作为负约束融入规划生成过程,提升大模型在长上下文推理中的表现。

Details Motivation: 现有基于规划-执行框架的方法因依赖表面线索导致计划生成不可靠,易产生错误假设且难以修正,限制了长上下文推理的效果。 Method: PPA-Plan通过预先识别潜在的逻辑陷阱和错误假设,将其转化为负约束,并在生成计划时明确规避这些约束,从而实现更可靠的计划生成。 Result: 在多个长上下文问答基准上的实验表明,PPA-Plan生成的计划在执行后显著优于现有的规划-执行方法和直接提示方法。 Conclusion: PPA-Plan通过主动预防而非被动修正的方式提升了长上下文推理中计划生成的可靠性与有效性。 Abstract: Large language models (LLMs) struggle with reasoning over long contexts where relevant information is sparsely distributed. Although plan-and-execute frameworks mitigate this by decomposing tasks into planning and execution, their effectiveness is often limited by unreliable plan generation due to dependence on surface-level cues. Consequently, plans may be based on incorrect assumptions, and once a plan is formed, identifying what went wrong and revising it reliably becomes difficult, limiting the effectiveness of reactive refinement. To address this limitation, we propose PPA-Plan, a proactive planning strategy for long-context reasoning that focuses on preventing such failures before plan generation. PPA-Plan identifies potential logical pitfalls and false assumptions, formulates them as negative constraints, and conditions plan generation on explicitly avoiding these constraints. Experiments on long-context QA benchmarks show that executing plans generated by PPA-Plan consistently outperforms existing plan-and-execute methods and direct prompting.

[28] LSTM-MAS: A Long Short-Term Memory Inspired Multi-Agent System for Long-Context Understanding

Yichen Jiang,Peng Ye,Jiakang Yuan,Chongjun Tu,Lei Bai,Tao Chen

Main category: cs.CL

TL;DR: 本文提出了一种受LSTM启发的多智能体系统LSTM-MAS,用于解决大语言模型在长上下文处理中的挑战,通过模拟LSTM的门控机制实现信息的选择性传递与长期依赖建模,在多个问答任务上显著优于现有方法。

Details Motivation: 现有的单LLM方法在扩展上下文长度时面临计算成本高或长度受限的问题,而多智能体框架虽有潜力但仍易出现错误累积和幻觉传播,因此需要更有效的长上下文理解架构。 Method: 设计了一个类比LSTM结构的多智能体系统LSTM-MAS,包含工作、过滤、判断和管理四种智能体,分别对应段落理解、冗余削减、错误检测和全局信息调控,形成链式结构以实现受控的信息流动和选择性记忆保留。 Result: 在NarrativeQA、Qasper、HotpotQA和MuSiQue四个数据集上,相比先前最优的多智能体方法CoA,分别取得了40.93%、43.70%、121.57%和33.12%的性能提升。 Conclusion: LSTM-MAS通过借鉴LSTM的层级信息流和门控机制,有效解决了长上下文处理中的错误累积与幻觉传播问题,实现了对长文本更准确的理解,为构建高效长上下文语言模型提供了新思路。 Abstract: Effectively processing long contexts remains a fundamental yet unsolved challenge for large language models (LLMs). Existing single-LLM-based methods primarily reduce the context window or optimize the attention mechanism, but they often encounter additional computational costs or constrained expanded context length. While multi-agent-based frameworks can mitigate these limitations, they remain susceptible to the accumulation of errors and the propagation of hallucinations. In this work, we draw inspiration from the Long Short-Term Memory (LSTM) architecture to design a Multi-Agent System called LSTM-MAS, emulating LSTM's hierarchical information flow and gated memory mechanisms for long-context understanding. Specifically, LSTM-MAS organizes agents in a chained architecture, where each node comprises a worker agent for segment-level comprehension, a filter agent for redundancy reduction, a judge agent for continuous error detection, and a manager agent for globally regulates information propagation and retention, analogous to LSTM and its input gate, forget gate, constant error carousel unit, and output gate. These novel designs enable controlled information transfer and selective long-term dependency modeling across textual segments, which can effectively avoid error accumulation and hallucination propagation. We conducted an extensive evaluation of our method. Compared with the previous best multi-agent approach, CoA, our model achieves improvements of 40.93%, 43.70%,121.57% and 33.12%, on NarrativeQA, Qasper, HotpotQA, and MuSiQue, respectively.

[29] Enhancing LLM-Based Data Annotation with Error Decomposition

Zhen Xu,Vedant Khatri,Yijun Dai,Xiner Liu,Siyan Li,Xuanming Zhang,Renzhe Yu

Main category: cs.CL

TL;DR: 提出一种人机协同的诊断性评估范式,用于分解和分析大语言模型在主观标注任务中的错误来源与类型,提升对LLM标注质量及下游影响的理解。

Details Motivation: 现有评估方法将所有标注错误合并为单一一致性指标,难以区分任务本身模糊性与模型错误,无法准确反映LLM在主观标注任务(如心理、教育领域)中的实际表现和影响。 Method: 构建一个包含三部分的诊断框架:(1) 从错误来源(模型特有 vs. 任务固有)和错误类型(边界模糊 vs. 概念误判)两个维度的分类体系;(2) 轻量级人工标注测试以估计任务固有的模糊性;(3) 计算方法分解LLM标注错误。在四个教育领域的序数标注任务上进行验证。 Result: 验证了该诊断范式的概念有效性与实用性,发现某些任务中高一致性不现实,单一指标不足以反映标注质量;能够识别出哪些错误来自任务模糊、哪些来自模型误解。 Conclusion: 该范式可作为低成本诊断工具,帮助判断特定任务是否适合使用LLM进行标注,并为技术优化提供可操作的改进建议。 Abstract: Large language models offer a scalable alternative to human coding for data annotation tasks, enabling the scale-up of research across data-intensive domains. While LLMs are already achieving near-human accuracy on objective annotation tasks, their performance on subjective annotation tasks, such as those involving psychological constructs, is less consistent and more prone to errors. Standard evaluation practices typically collapse all annotation errors into a single alignment metric, but this simplified approach may obscure different kinds of errors that affect final analytical conclusions in different ways. Here, we propose a diagnostic evaluation paradigm that incorporates a human-in-the-loop step to separate task-inherent ambiguity from model-driven inaccuracies and assess annotation quality in terms of their potential downstream impacts. We refine this paradigm on ordinal annotation tasks, which are common in subjective annotation. The refined paradigm includes: (1) a diagnostic taxonomy that categorizes LLM annotation errors along two dimensions: source (model-specific vs. task-inherent) and type (boundary ambiguity vs. conceptual misidentification); (2) a lightweight human annotation test to estimate task-inherent ambiguity from LLM annotations; and (3) a computational method to decompose observed LLM annotation errors following our taxonomy. We validate this paradigm on four educational annotation tasks, demonstrating both its conceptual validity and practical utility. Theoretically, our work provides empirical evidence for why excessively high alignment is unrealistic in specific annotation tasks and why single alignment metrics inadequately reflect the quality of LLM annotations. In practice, our paradigm can be a low-cost diagnostic tool that assesses the suitability of a given task for LLM annotation and provides actionable insights for further technical optimization.

[30] Mapping the maturation of TCM as an adjuvant to radiotherapy

P. Bilha Githinji,Aikaterini Melliou,Xi Yuan,Dayan Zhang,Lian Zhang,Zhenglin Chen,Jiansong Ji,Chengying Lv,Jinhao Xu,Peiwu Qin,Dongmei Yu

Main category: cs.CL

TL;DR: 该研究对2000至2025年间69,745篇关于中医药作为放疗辅助治疗的肿瘤学文献进行了大规模分析,揭示了该领域呈现周期性发展特征,并识别出五大主题轴心,表明研究正趋于专业化和系统化,但也可能存在报告偏倚。

Details Motivation: 评估中医药作为放疗辅助治疗在肿瘤学中的研究轨迹和发展趋势,以理解其证据积累过程及潜在偏倚。 Method: 采用大规模文献计量分析和主题建模方法,分析2000-2025年间的69,745篇出版物,识别发表趋势、国际合作、资金投入的周期性变化及主题结构。 Result: 发现研究发展呈现‘定义-构思-测试’的周期性模式;识别出五大主题轴:癌症类型、支持性护理、临床终点、机制和方法学;跨主题整合体现以患者为中心和系统导向;存在跨出版类型、主题和周期的一致性阳性报告倾向。 Conclusion: 中医药辅助放疗的研究领域已成熟,可能正处于新阶段的转折点,但需警惕系统性的阳性结果报告偏倚。 Abstract: The integration of complementary medicine into oncology represents a paradigm shift that has seen to increasing adoption of Traditional Chinese Medicine (TCM) as an adjuvant to radiotherapy. About twenty-five years since the formal institutionalization of integrated oncology, it is opportune to synthesize the trajectory of evidence for TCM as an adjuvant to radiotherapy. Here we conduct a large-scale analysis of 69,745 publications (2000 - 2025), emerging a cyclical evolution defined by coordinated expansion and contraction in publication output, international collaboration, and funding commitments that mirrors a define-ideate-test pattern. Using a theme modeling workflow designed to determine a stable thematic structure of the field, we identify five dominant thematic axes - cancer types, supportive care, clinical endpoints, mechanisms, and methodology - that signal a focus on patient well-being, scientific rigor and mechanistic exploration. Cross-theme integration of TCM is patient-centered and systems-oriented. Together with the emergent cycles of evolution, the thematic structure demonstrates progressive specialization and potential defragmentation of the field or saturation of existing research agenda. The analysis points to a field that has matured its current research agenda and is likely at the cusp of something new. Additionally, the field exhibits positive reporting of findings that is homogeneous across publication types, thematic areas, and the cycles of evolution suggesting a system-wide positive reporting bias agnostic to structural drivers.

[31] Event Detection with a Context-Aware Encoder and LoRA for Improved Performance on Long-Tailed Classes

Abdullah Al Monsur,Nitesh Vamshi Bommisetty,Gene Louis Kim

Main category: cs.CL

TL;DR: 本研究针对事件检测中的两个主要问题:解码器-only大模型的单向性限制和对Micro-F1指标的过度依赖,提出使用句子上下文增强和LoRA微调,并以Macro-F1为评估指标,显著提升了长尾事件类型的检测性能。

Details Motivation: 解决现有事件检测研究中因模型架构(单向解码器)和评估指标(Micro-F1)不当导致对长尾事件类型性能评估不准确的问题。 Method: 引入包含句子上下文的信息输入,并采用低秩自适应(LoRA)方法对解码器-only大语言模型进行微调,同时以Macro-F1作为主要评估指标进行实验分析。 Result: 使用句子上下文和LoRA微调的模型在Macro-F1分数上显著优于传统基线模型,尤其在长尾事件类型上表现更优。 Conclusion: LoRA结合上下文信息能有效提升解码器-only模型在事件检测中的整体性能,尤其是对罕见事件类型的识别能力,且Macro-F1是更公平、更具代表性的评估指标。 Abstract: The current state of event detection research has two notable re-occurring limitations that we investigate in this study. First, the unidirectional nature of decoder-only LLMs presents a fundamental architectural bottleneck for natural language understanding tasks that depend on rich, bidirectional context. Second, we confront the conventional reliance on Micro-F1 scores in event detection literature, which systematically inflates performance by favoring majority classes. Instead, we focus on Macro-F1 as a more representative measure of a model's ability across the long-tail of event types. Our experiments demonstrate that models enhanced with sentence context achieve superior performance over canonical decoder-only baselines. Using Low-Rank Adaptation (LoRA) during finetuning provides a substantial boost in Macro-F1 scores in particular, especially for the decoder-only models, showing that LoRA can be an effective tool to enhance LLMs' performance on long-tailed event classes.

[32] Double-Calibration: Towards Trustworthy LLMs via Calibrating Knowledge and Reasoning Confidence

Yuyin Lu,Ziran Liang,Yanghui Rao,Wenqi Fan,Fu Lee Wang,Qing Li

Main category: cs.CL

TL;DR: DoublyCal 是一种基于双重校准原则的框架,通过结合知识图谱和置信度校准,提升大语言模型在事实准确性与置信度校准方面的表现。

Details Motivation: 现有知识图谱增强方法无法量化检索证据和推理过程中的认知不确定性,导致大语言模型容易产生幻觉。 Method: 提出 DoublyCal 框架,使用轻量级代理模型生成知识图谱证据并输出经过校准的证据置信度,再以此引导黑盒大语言模型进行推理,实现最终预测结果的准确性和置信度校准。 Result: 在多个知识密集型基准上的实验表明,DoublyCal 显著提高了黑盒大语言模型的准确性和置信度校准水平,且具有较低的 token 开销。 Conclusion: DoublyCal 有效提升了大语言模型在推理中的可信度,通过双重校准机制将预测置信度与证据不确定性关联,增强了模型的可解释性与可靠性。 Abstract: Trustworthy reasoning in Large Language Models (LLMs) is challenged by their propensity for hallucination. While augmenting LLMs with Knowledge Graphs (KGs) improves factual accuracy, existing KG-augmented methods fail to quantify epistemic uncertainty in both the retrieved evidence and LLMs' reasoning. To bridge this gap, we introduce DoublyCal, a framework built on a novel double-calibration principle. DoublyCal employs a lightweight proxy model to first generate KG evidence alongside a calibrated evidence confidence. This calibrated supporting evidence then guides a black-box LLM, yielding final predictions that are not only more accurate but also well-calibrated, with confidence scores traceable to the uncertainty of the supporting evidence. Experiments on knowledge-intensive benchmarks show that DoublyCal significantly improves both the accuracy and confidence calibration of black-box LLMs with low token cost.

[33] PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning

Bingxuan Li,Jeonghwan Kim,Cheng Qian,Xiusi Chen,Eitan Anzenberg,Niran Kundapur,Heng Ji

Main category: cs.CL

TL;DR: 本文提出了CalConflictBench,用于评估大语言模型在日程冲突解决中的表现,并提出PEARL框架,通过强化学习和外部记忆模块显著提升性能。

Details Motivation: 自动化日程冲突解决对忙碌的专业人士至关重要,但现有语言模型在此类长期决策任务中表现不佳,难以有效捕捉和适应用户偏好。 Method: 构建了一个名为CalConflictBench的基准测试,引入PEARL框架,结合强化学习、外部记忆模块和逐轮优化奖励机制,使语言代理能逐步推断并适应用户偏好。 Result: 实验表明当前LLM代理错误率高(如Qwen-3-30B-Think平均错误率达35%),而PEARL框架相较最强基线实现了0.76的错误减少率和55%的平均错误率改善。 Conclusion: PEARL显著提升了语言代理在长期日程冲突解决任务中的性能,证明了结合外部记忆与强化学习是实现自适应时间管理的有效路径。 Abstract: Overlapping calendar invitations force busy professionals to repeatedly decide which meetings to attend, reschedule, or decline. We refer to this preference-driven decision process as calendar conflict resolution. Automating such process is crucial yet challenging. Scheduling logistics drain hours, and human delegation often fails at scale, which motivate we to ask: Can we trust large language model (LLM) or language agent to manager time? To enable systematic study of this question, we introduce CalConflictBench, a benchmark for long-horizon calendar conflict resolution. Conflicts are presented sequentially and agents receive feedback after each round, requiring them to infer and adapt to user preferences progressively. Our experiments show that current LLM agents perform poorly with high error rates, e.g., Qwen-3-30B-Think has 35% average error rate. To address this gap, we propose PEARL, a reinforcement-learning framework that augments language agent with an external memory module and optimized round-wise reward design, enabling agent to progressively infer and adapt to user preferences on-the-fly. Experiments on CalConflictBench shows that PEARL achieves 0.76 error reduction rate, and 55% improvement in average error rate compared to the strongest baseline.

[34] $\texttt{MemoryRewardBench}$: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models

Zecheng Tang,Baibei Ji,Ruoxi Sun,Haitian Wang,WangJie You,Zhang Yijun,Wenpeng Zhu,Ji Qi,Juntao Li,Min Zhang

Main category: cs.CL

TL;DR: 本文提出了MemoryRewardBench,首个系统研究奖励模型(RMs)评估大语言模型长时记忆管理能力的基准,涵盖多种长文本任务与记忆模式,并揭示了当前RMs的能力与局限。

Details Motivation: 现有方法在处理长上下文时依赖分段式记忆机制,有效记忆管理对信息跨序列传播至关重要,因此需要可靠评估记忆质量的手段。 Method: 构建MemoryRewardBench基准,包含10种不同记忆管理模式的设置,覆盖8K到128K token的长上下文理解与生成任务,评估13个前沿奖励模型的表现。 Result: 实验显示开源与闭源奖励模型性能差距缩小,新一代模型普遍优于前代,无论参数量大小;同时揭示了当前RMs在多样记忆场景下的评估能力与根本局限。 Conclusion: MemoryRewardBench为评估大语言模型的记忆管理提供了有效工具,推动奖励模型在长时记忆评估中的发展与优化。 Abstract: Existing works increasingly adopt memory-centric mechanisms to process long contexts in a segment manner, and effective memory management is one of the key capabilities that enables large language models to effectively propagate information across the entire sequence. Therefore, leveraging reward models (RMs) to automatically and reliably evaluate memory quality is critical. In this work, we introduce $\texttt{MemoryRewardBench}$, the first benchmark to systematically study the ability of RMs to evaluate long-term memory management processes. $\texttt{MemoryRewardBench}$ covers both long-context comprehension and long-form generation tasks, featuring 10 distinct settings with different memory management patterns, with context length ranging from 8K to 128K tokens. Evaluations on 13 cutting-edge RMs indicate a diminishing performance gap between open-source and proprietary models, with newer-generation models consistently outperforming their predecessors regardless of parameter count. We further expose the capabilities and fundamental limitations of current RMs in evaluating LLM memory management across diverse settings.

[35] Acting Flatterers via LLMs Sycophancy: Combating Clickbait with LLMs Opposing-Stance Reasoning

Chaowei Zhang,Xiansheng Luo,Zewei Zhang,Yi Zhu,Jipeng Qiang,Longwei Wang

Main category: cs.CL

TL;DR: 提出一种利用大语言模型的“谄媚性”生成对立推理的新框架SORG,并结合对比学习提升标题党检测性能。

Details Motivation: 大语言模型在处理标题党检测时受制于倾向于迎合用户观点的“谄媚性”,本文将其转化为优势,用于生成多视角推理以提高检测效果。 Method: 设计Self-renewal Opposing-stance Reasoning Generation (SORG) 框架生成同意与反对的推理对,构建基于BERT的ORCD模型,使用对比学习和LLM生成的可信度软标签进行训练。 Result: 在三个基准数据集上实验表明,该方法优于现有的提示方法、微调小模型及最先进的标题党检测方法。 Conclusion: 通过利用大语言模型的sycophancy行为生成高质量对立推理,可有效提升点击诱饵检测的鲁棒性和准确性。 Abstract: The widespread proliferation of online content has intensified concerns about clickbait, deceptive or exaggerated headlines designed to attract attention. While Large Language Models (LLMs) offer a promising avenue for addressing this issue, their effectiveness is often hindered by Sycophancy, a tendency to produce reasoning that matches users' beliefs over truthful ones, which deviates from instruction-following principles. Rather than treating sycophancy as a flaw to be eliminated, this work proposes a novel approach that initially harnesses this behavior to generate contrastive reasoning from opposing perspectives. Specifically, we design a Self-renewal Opposing-stance Reasoning Generation (SORG) framework that prompts LLMs to produce high-quality agree and disagree reasoning pairs for a given news title without requiring ground-truth labels. To utilize the generated reasoning, we develop a local Opposing Reasoning-based Clickbait Detection (ORCD) model that integrates three BERT encoders to represent the title and its associated reasoning. The model leverages contrastive learning, guided by soft labels derived from LLM-generated credibility scores, to enhance detection robustness. Experimental evaluations on three benchmark datasets demonstrate that our method consistently outperforms LLM prompting, fine-tuned smaller language models, and state-of-the-art clickbait detection baselines.

[36] Preserving Fairness and Safety in Quantized LLMs Through Critical Weight Protection

Muhammad Alif Al Hakim,Alfan Farizki Wicaksono,Fajri Koto

Main category: cs.CL

TL;DR: 本文系统研究了静态和动态量化方法对大语言模型在公平性和安全性方面的影响,发现量化会损害模型在这两方面的表现,尤其是非英语场景下的安全性。为此,作者提出了“关键权重保护”技术,通过保留与公平性和安全性相关的关键权重来缓解这些问题,在不牺牲效率的前提下提升了量化后模型的可信度。

Details Motivation: 量化虽能降低大语言模型的计算成本,但其对公平性和安全性的潜在影响,尤其是在动态量化和多语言情境下,尚缺乏深入研究。 Method: 作者系统评估了静态与动态量化方法在多个衡量内在/外在偏见及安全对齐的基准上的表现,并提出“关键权重保护”技术,识别并在量化过程中保留影响公平性与安全性的关键权重。 Result: 量化普遍导致公平性和安全性下降,其中动态量化比静态量化更稳定;公平性退化因语言而异,安全性在非英语环境下恶化尤为严重;所提方法能有效缓解这些退化问题。 Conclusion: 量化会影响模型的公平性与安全性,尤其在多语言场景中风险更显著;通过保护关键权重可在无需重新训练的情况下有效维持模型的可信性与效率平衡。 Abstract: Quantization is widely adopted to reduce the computational cost of large language models (LLMs); however, its implications for fairness and safety, particularly in dynamic quantization and multilingual contexts, remain underexplored. In this work, we conduct a systematic study of how static and dynamic quantization methods impact fairness and safety across benchmarks measuring intrinsic and extrinsic bias and safety alignment. For fairness, we evaluate English, French, Dutch, Spanish, and Turkish; for safety, we focus on English, Korean, and Arabic. Our findings reveal that quantization consistently degrades fairness and safety, with dynamic methods demonstrating greater stability than static ones. Moreover, fairness degradation varies across languages, while safety deterioration is especially pronounced in non-English settings. To address these risks, we introduce Critical Weight Protection, a novel technique that identifies and preserves fairness- and safety-critical weights during quantization. This approach effectively mitigates bias and safety deterioration without costly retraining or alignment, maintaining trustworthiness while retaining efficiency.

[37] Don't Start Over: A Cost-Effective Framework for Migrating Personalized Prompts Between LLMs

Ziyi Zhao,Chongming Gao,Yang Zhang,Haoyan Liu,Weinan Gan,Huifeng Guo,Yong Liu,Fuli Feng

Main category: cs.CL

TL;DR: 提出了一种名为PUMA的轻量级框架,用于在不兼容的大语言模型之间高效迁移个性化提示,显著降低计算成本并保持性能。

Details Motivation: 现有的个性化软提示在基础模型升级后会失效,需重新训练,成本高昂。 Method: 设计了一个参数高效的适配器(Prompt-level User Migration Adapter, PUMA),结合基于用户组的选择策略,以弥合不同模型间的语义差距并减少训练开销。 Result: 在三个大规模数据集上的实验表明,PUMA在性能上达到甚至超过从头训练的方法,计算成本最多降低98%,并在多种模型架构和复杂迁移场景中表现出强泛化性和鲁棒性。 Conclusion: PUMA实现了用户资产与底层模型的解耦,为个性化AI的可持续演进提供了实用路径。 Abstract: Personalization in Large Language Models (LLMs) often relies on user-specific soft prompts. However, these prompts become obsolete when the foundation model is upgraded, necessitating costly, full-scale retraining. To overcome this limitation, we propose the Prompt-level User Migration Adapter (PUMA), a lightweight framework to efficiently migrate personalized prompts across incompatible models. PUMA utilizes a parameter-efficient adapter to bridge the semantic gap, combined with a group-based user selection strategy to significantly reduce training costs. Experiments on three large-scale datasets show our method matches or even surpasses the performance of retraining from scratch, reducing computational cost by up to 98%. The framework demonstrates strong generalization across diverse model architectures and robustness in advanced scenarios like chained and aggregated migrations, offering a practical path for the sustainable evolution of personalized AI by decoupling user assets from the underlying models.

[38] Codebook-Injected Dialogue Segmentation for Multi-Utterance Constructs Annotation: LLM-Assisted and Gold-Label-Free Evaluation

Jinsook Lee,Kirk Vanacore,Zhuqian Zhou,Jeanine Grutter,Rene F. Kizilcec

Main category: cs.CL

TL;DR: 本文提出了一种基于代码本注入的对话分割方法,通过将下游标注标准融入边界决策过程来提升对话行为(DA)标注的一致性,并利用大语言模型(LLM)进行分段。研究设计了无需金标签的评估指标,发现DA感知分割在段内一致性上优于基线,但不同分割器各有优劣,强调应根据下游任务优化分割策略。

Details Motivation: 传统对话行为标注常因忽视上下文意图而导致分段边界不一致,降低标注可靠性。本文旨在通过引入与标注目标对齐的分割方法,缓解这一问题。 Method: 提出代码本注入分割法,将对话行为定义作为条件指导分段;训练LLM-based segmenter,并与标准及检索增强基线比较;设计评估指标:跨度一致性、区分度和人-AI分布一致性。 Result: DA感知分割在段内一致性上优于仅基于文本的基线;LLM在构建符合语义的片段上表现良好,但在捕捉全局对话变化上不如基于连贯性的方法;不同数据集上无单一最优分段器,且段内一致性提升常以边界区分度为代价。 Conclusion: 对话分割是一个影响显著的设计选择,应根据具体下游任务目标进行优化,而非追求单一性能指标的最大化。 Abstract: Dialogue Act (DA) annotation typically treats communicative or pedagogical intent as localized to individual utterances or turns. This leads annotators to agree on the underlying action while disagreeing on segment boundaries, reducing apparent reliability. We propose codebook-injected segmentation, which conditions boundary decisions on downstream annotation criteria, and evaluate LLM-based segmenters against standard and retrieval-augmented baselines. To assess these without gold labels, we introduce evaluation metrics for span consistency, distinctiveness, and human-AI distributional agreement. We found DA-awareness produces segments that are internally more consistent than text-only baselines. While LLMs excel at creating construct-consistent spans, coherence-based baselines remain superior at detecting global shifts in dialogue flow. Across two datasets, no single segmenter dominates. Improvements in within-segment coherence frequently trade off against boundary distinctiveness and human-AI distributional agreement. These results highlight segmentation as a consequential design choice that should be optimized for downstream objectives rather than a single performance score.

[39] Bridging the Gap in Bangla Healthcare: Machine Learning Based Disease Prediction Using a Symptoms-Disease Dataset

Rowzatul Zannat,Abdullah Al Shafi,Abdul Muntakim

Main category: cs.CL

TL;DR: 本文构建了一个包含85种疾病和758个症状-疾病关系的孟加拉语症状-疾病数据集,并基于该数据集评估了多种机器学习模型,软硬投票集成方法达到98%准确率,为孟加拉语人群的疾病预测和健康信息可及性提供了基础资源。

Details Motivation: 提升非英语人群(特别是孟加拉语使用者)获取可靠健康信息的能力,弥补现有孟加拉语疾病预测资源的不足。 Method: 构建并公开一个大规模孟加拉语症状-疾病数据集(85种疾病、758条关系),并在其上评估多种机器学习模型,采用软/硬投票集成策略融合最优模型。 Result: 软投票与硬投票集成方法均达到98%的准确率,展现出优异的鲁棒性与泛化能力。 Conclusion: 该工作建立了首个面向孟加拉语的疾病预测基础数据集与模型基准,推动本地化健康信息学发展,助力孟加拉语社区实现公平、早期的疾病检测与医疗干预。 Abstract: Increased access to reliable health information is essential for non-English-speaking populations, yet resources in Bangla for disease prediction remain limited. This study addresses this gap by developing a comprehensive Bangla symptoms-disease dataset containing 758 unique symptom-disease relationships spanning 85 diseases. To ensure transparency and reproducibility, we also make our dataset publicly available. The dataset enables the prediction of diseases based on Bangla symptom inputs, supporting healthcare accessibility for Bengali-speaking populations. Using this dataset, we evaluated multiple machine learning models to predict diseases based on symptoms provided in Bangla and analyzed their performance on our dataset. Both soft and hard voting ensemble approaches combining top-performing models achieved 98\% accuracy, demonstrating superior robustness and generalization. Our work establishes a foundational resource for disease prediction in Bangla, paving the way for future advancements in localized health informatics and diagnostic tools. This contribution aims to enhance equitable access to health information for Bangla-speaking communities, particularly for early disease detection and healthcare interventions.

[40] To Copy or Not to Copy: Copying Is Easier to Induce Than Recall

Mehrdad Farahani,Franziska Penzkofer,Richard Johansson

Main category: cs.CL

TL;DR: 本文研究了在检索增强设置中语言模型如何在参数化知识和上下文信息之间进行仲裁,提出了一种通过提取‘仲裁向量’来干预模型行为的方法,并揭示了复制与回忆之间的机制不对称性。

Details Motivation: 理解语言模型在面对相关或无关上下文时如何选择使用其内部参数知识还是外部提供的上下文信息。 Method: 构建一个专门数据集以分离不同情境(无关上下文引发参数回忆、相关但错误的上下文引发复制),计算残差流中两种情形下的中心点差异作为仲裁向量,并在特定层和token范围内进行加法干预实验。 Result: 在两种架构(仅解码器与编码器/解码器)及两个开放域问答基准上验证了该方法能一致地引导模型行为变化,在保持准确性和流畅性的同时实现Copy↔Recall的转向;机制分析揭示了诱导复制是易触发的‘再激活’过程,而恢复回忆则是更脆弱且依赖对象token干预的‘抑制’过程。 Conclusion: 语言模型中上下文复制与参数回忆之间存在可被干预的机制路径,且二者具有本质不对称性:复制易于诱发,回忆恢复则更为困难且条件敏感。 Abstract: Language models used in retrieval-augmented settings must arbitrate between parametric knowledge stored in their weights and contextual information in the prompt. This work presents a mechanistic study of that choice by extracting an \emph{arbitration vector} from model activations on a curated dataset designed to disentangle (i) irrelevant contexts that elicit parametric recall and (ii) relevant but false contexts that elicit copying. The vector is computed as the residual-stream centroid difference between these regimes across 27 relations, and is injected as an additive intervention at selected layers and token spans to steer behavior in two directions: Copy$\rightarrow$Recall (suppressing context use) and Recall$\rightarrow$Copy (inducing the model to copy any token from the context). Experiments on two architectures (decoder-only and encoder/decoder) and two open-domain QA benchmarks show consistent behavior shifts under moderate scaling while monitoring accuracy and fluency. Mechanistic analyses of attention routing, MLP contributions, and layer-wise probability trajectories reveal an asymmetry: inducing copying is an easy ``reactivation'' process that can be triggered at different locations in the input, while restoring recall is a ``suppression'' process that is more fragile and strongly tied to object-token interventions.

[41] Optimizing User Profiles via Contextual Bandits for Retrieval-Augmented LLM Personalization

Linfeng Du,Ye Yuan,Zichen Zhao,Fuyuan Lyu,Emiliano Penaloza,Xiuying Chen,Zipeng Sun,Jikun Kang,Laurent Charlin,Xue Liu,Haolun Wu

Main category: cs.CL

TL;DR: PURPLE是一种基于上下文赌博机框架的个性化大语言模型优化方法,通过Plackett-Luce排序模型生成用户历史记录集合,直接以生成质量为导向进行检索增强,显著优于传统相关性驱动的方法。

Details Motivation: 现有基于语义相关性的检索增强方法可能选取与查询相似但无用甚至有害的历史记录,无法保证提升生成质量,因此需要一种更可靠的方法来构建用户档案。 Method: 提出PURPLE框架,将用户档案构建视为集合生成过程,采用Plackett-Luce模型建模记录间的依赖关系,并利用参考回复的似然作为密集反馈信号,通过上下文赌博机机制优化检索以对齐生成质量。 Result: 在九个个性化任务上的实验表明,PURPLE在效果和效率上均一致优于强启发式和检索增强基线方法。 Conclusion: PURPLE为大语言模型的个性化提供了原则性强且可扩展的解决方案,验证了以生成质量为目标的检索优化优于传统基于相关性的方法。 Abstract: Large Language Models (LLMs) excel at general-purpose tasks, yet adapting their responses to individual users remains challenging. Retrieval augmentation provides a lightweight alternative to fine-tuning by conditioning LLMs on user history records, and existing approaches typically select these records based on semantic relevance. We argue that relevance serves as an unreliable proxy for utility: a record may be semantically similar to a query yet fail to improve generation quality or even degrade it due to redundancy or conflicting information. To bridge this gap, we propose PURPLE, a contextual bandit framework that oPtimizes UseR Profiles for Llm pErsonalization. In contrast to a greedy selection of the most relevant records, PURPLE treats profile construction as a set generation process and utilizes a Plackett-Luce ranking model to capture complex inter-record dependencies. By training with dense feedback provided by the likelihood of the reference response, our method aligns retrieval directly with generation quality. Extensive experiments on nine personalization tasks demonstrate that PURPLE consistently outperforms strong heuristic and retrieval-augmented baselines in both effectiveness and efficiency, establishing a principled and scalable solution for optimizing user profiles.

[42] Large language models struggle with ethnographic text annotation

Leonardo S. Goodall,Dor Shilton,Daniel A. Mullins,Harvey Whitehouse

Main category: cs.CL

TL;DR: 评估了7种先进的大语言模型在567个民族志摘录中对121种仪式特征的标注能力,发现其性能有限,无法达到可靠自动标注的要求,表明当前大语言模型尚不能替代人类在民族志标注中的专业性。

Details Motivation: 探索大语言模型是否能通过从民族志文本中提取结构化数据来加速跨文化研究。 Method: 评估7种最先进的大语言模型在121种仪式特征和567个民族志摘录上的标注表现,并与人类编码员的可靠性进行比较。 Result: 模型表现有限,尤其在长文本、需要序数区分和模糊概念的特征上表现不佳;即使在人类编码员高度一致的特征上,模型仍不及人类。 Conclusion: 当前的大语言模型尚不能替代人类在民族志标注任务中的作用。 Abstract: Large language models (LLMs) have shown promise for automated text annotation, raising hopes that they might accelerate cross-cultural research by extracting structured data from ethnographic texts. We evaluated 7 state-of-the-art LLMs on their ability to annotate 121 ritual features across 567 ethnographic excerpts. Performance was limited, falling well below levels required for reliable automated annotation. Longer texts, features requiring ordinal distinctions, and ambiguous constructs proved particularly difficult. Human inter-coder reliability set an approximate ceiling on LLM accuracy: features that human coders found difficult to agree upon were also difficult for LLMs. Yet even on features where humans reliably agreed, models fell short of human performance. Our findings suggest that LLMs cannot yet substitute for human expertise in ethnographic annotation.

[43] Powerful Training-Free Membership Inference Against Autoregressive Language Models

David Ilić,David Stanojević,Kostadin Cvejoski

Main category: cs.CL

TL;DR: 本文提出了一种新的成员推断攻击方法EZ-MIA,利用模型在错误位置的记忆现象来更有效地检测微调语言模型中的隐私泄露风险。

Details Motivation: 现有的成员推断攻击在低误报率下检测效果有限,难以满足实际隐私审计需求,因此需要一种更高效的检测方法。 Method: 提出Error Zone (EZ)分数,通过测量模型在错误预测位置相对于预训练参考模型的概率偏移方向性不平衡来进行成员推断,仅需两次前向传播且无需额外训练。 Result: 在WikiText和GPT-2上,EZ-MIA在1%误报率下达到66.3%的真阳性率,是先前最优方法的3.8倍;在0.1%误报率下检测效果提升8倍;在Llama-2-7B上也实现3倍增益。 Conclusion: EZ-MIA显著提高了成员推断攻击的有效性,揭示了微调语言模型比以往认知中更大的隐私风险,对隐私审计和模型部署具有重要影响。 Abstract: Fine-tuned language models pose significant privacy risks, as they may memorize and expose sensitive information from their training data. Membership inference attacks (MIAs) provide a principled framework for auditing these risks, yet existing methods achieve limited detection rates, particularly at the low false-positive thresholds required for practical privacy auditing. We present EZ-MIA, a membership inference attack that exploits a key observation: memorization manifests most strongly at error positions, specifically tokens where the model predicts incorrectly yet still shows elevated probability for training examples. We introduce the Error Zone (EZ) score, which measures the directional imbalance of probability shifts at error positions relative to a pretrained reference model. This principled statistic requires only two forward passes per query and no model training of any kind. On WikiText with GPT-2, EZ-MIA achieves 3.8x higher detection than the previous state-of-the-art under identical conditions (66.3% versus 17.5% true positive rate at 1% false positive rate), with near-perfect discrimination (AUC 0.98). At the stringent 0.1% FPR threshold critical for real-world auditing, we achieve 8x higher detection than prior work (14.0% versus 1.8%), requiring no reference model training. These gains extend to larger architectures: on AG News with Llama-2-7B, we achieve 3x higher detection (46.7% versus 15.8% TPR at 1% FPR). These results establish that privacy risks of fine-tuned language models are substantially greater than previously understood, with implications for both privacy auditing and deployment decisions. Code is available at https://github.com/JetBrains-Research/ez-mia.

[44] Bengali Text Classification: An Evaluation of Large Language Model Approaches

Md Mahmudul Hoque,Md Mehedi Hassain,Md Hojaifa Tanvir,Rahul Nandy

Main category: cs.CL

TL;DR: 本研究探讨了大型语言模型(LLM)在孟加拉语报纸文章分类中的有效性,使用Prothom Alo的数据集评估了三种指令调优的LLM,其中Qwen 2.5 7B Instruct表现最佳,准确率达到72%。

Details Motivation: 由于缺乏大规模标注数据集和预训练语言模型,孟加拉语文本分类面临挑战,因此需要探索适用于低资源语言的大型语言模型的有效性。 Method: 采用三个指令调优的大型语言模型(LLaMA 3.1 8B Instruct、LLaMA 3.2 3B Instruct 和 Qwen 2.5 7B Instruct),在相同的分类框架下对来自Kaggle的Prothom Alo新闻数据集进行文本分类评估。 Result: Qwen 2.5 7B Instruct取得了最高的分类准确率72%,尤其在“体育”类别中表现突出;LLaMA 3.1和LLaMA 3.2的准确率分别为53%和56%。 Conclusion: 尽管孟加拉语NLP资源稀缺,大型语言模型仍展现出有效的文本分类能力,未来的研究将聚焦于探索更多模型、解决类别不平衡问题以及优化微调方法以提升性能。 Abstract: Bengali text classification is a Significant task in natural language processing (NLP), where text is categorized into predefined labels. Unlike English, Bengali faces challenges due to the lack of extensive annotated datasets and pre-trained language models. This study explores the effectiveness of large language models (LLMs) in classifying Bengali newspaper articles. The dataset used, obtained from Kaggle, consists of articles from Prothom Alo, a major Bangladeshi newspaper. Three instruction-tuned LLMs LLaMA 3.1 8B Instruct, LLaMA 3.2 3B Instruct, and Qwen 2.5 7B Instruct were evaluated for this task under the same classification framework. Among the evaluated models, Qwen 2.5 achieved the highest classification accuracy of 72%, showing particular strength in the "Sports" category. In comparison, LLaMA 3.1 and LLaMA 3.2 attained accuracies of 53% and 56%, respectively. The findings highlight the effectiveness of LLMs in Bengali text classification, despite the scarcity of resources for Bengali NLP. Future research will focus on exploring additional models, addressing class imbalance issues, and refining fine-tuning approaches to improve classification performance.

[45] Analyzing Cancer Patients' Experiences with Embedding-based Topic Modeling and LLMs

Teodor-Călin Ionescu,Lifeng Han,Jan Heijdra Suasnabar,Anne Stiggelbout,Suzan Verberne

Main category: cs.CL

TL;DR: 本研究使用神经主题建模(特别是BERTopic)和大语言模型(GPT4)分析癌症患者访谈数据,发现基于领域特定的嵌入模型(如BioClinicalBERT)能提升主题的精确性和可解释性,揭示出“癌症护理管理中的协调与沟通”和“癌症治疗决策”等主导主题,有助于增强医疗中患者声音的作用。

Details Motivation: 探索如何从患者叙事数据中自动提取有意义的主题,以支持以患者为中心的医疗实践,并比较不同神经主题模型在临床文本上的表现。 Method: 采用BERTopic和Top2Vec进行主题建模,使用相同预处理、分块和聚类设置进行公平比较;利用GPT4进行主题标注,并通过人工评估(连贯性、清晰度、相关性)选择最优模型;进一步测试三种临床专用嵌入模型,最终使用BioClinicalBERT对全部13个访谈进行全局分析。 Result: BERTopic在关键词提取和主题质量上优于Top2Vec;使用BioClinicalBERT嵌入模型显著提高了主题的精确性和可解释性;全局分析识别出两个主导主题:‘癌症护理管理中的协调与沟通’和‘癌症治疗决策’;尽管数据为机器翻译且无临床人员参与评估,结果仍具意义。 Conclusion: 神经主题建模结合LLM可用于从患者访谈中有效提取临床相关主题,尤其是使用领域适配的嵌入模型时,该方法有望辅助医生更高效地理解患者需求,推动以患者为中心的医疗发展。 Abstract: This study investigates the use of neural topic modeling and LLMs to uncover meaningful themes from patient storytelling data, to offer insights that could contribute to more patient-oriented healthcare practices. We analyze a collection of transcribed interviews with cancer patients (132,722 words in 13 interviews). We first evaluate BERTopic and Top2Vec for individual interview summarization by using similar preprocessing, chunking, and clustering configurations to ensure a fair comparison on Keyword Extraction. LLMs (GPT4) are then used for the next step topic labeling. Their outputs for a single interview (I0) are rated through a small-scale human evaluation, focusing on {coherence}, {clarity}, and {relevance}. Based on the preliminary results and evaluation, BERTopic shows stronger performance and is selected for further experimentation using three {clinically oriented embedding} models. We then analyzed the full interview collection with the best model setting. Results show that domain-specific embeddings improved topic \textit{precision} and \textit{interpretability}, with BioClinicalBERT producing the most consistent results across transcripts. The global analysis of the full dataset of 13 interviews, using the BioClinicalBERT embedding model, reveals the most dominant topics throughout all 13 interviews, namely ``Coordination and Communication in Cancer Care Management" and ``Patient Decision-Making in Cancer Treatment Journey''. Although the interviews are machine translations from Dutch to English, and clinical professionals are not involved in this evaluation, the findings suggest that neural topic modeling, particularly BERTopic, can help provide useful feedback to clinicians from patient interviews. This pipeline could support more efficient document navigation and strengthen the role of patients' voices in healthcare workflows.

[46] Tolerance Principle and Small Language Model Learning

Adam E. Friedman,Stevan Harnad,Rushen Shi

Main category: cs.CL

TL;DR: 本研究探讨了基于Transformer的语言模型(BabyBERTa)在人工语法学习中是否符合Yang(2016)的容忍原则,结果表明其学习动态与人类婴儿不同,不遵循该原则。

Details Motivation: 探究语言模型在极小数据下的语法规则泛化能力,并检验其是否符合人类语言习得中的容忍原则。 Method: 使用针对小数据集优化的BabyBERTa模型,在不同规模、句型多样性和规则-例外比例的人工语法训练集上进行训练,并测试其泛化表现。 Result: BabyBERTa的学习结果未显示出与容忍原则一致的模式,即其无法像人类婴儿那样在容忍一定例外的情况下仍学会抽象语法规则。 Conclusion: 当前的Transformer模型在小样本语法学习机制上与人类婴儿存在根本差异,提示现有模型与人类语言习得机制并不对等。 Abstract: Modern language models like GPT-3, BERT, and LLaMA require massive training data, yet with sufficient training they reliably learn to distinguish grammatical from ungrammatical sentences. Children aged as young as 14 months already have the capacity to learn abstract grammar rules from very few exemplars, even in the presence of non-rule-following exceptions. Yang's (2016) Tolerance Principle defines a precise threshold for how many exceptions a rule can tolerate and still be learnable. The present study explored the minimal amount and quality of training data necessary for rules to be generalized by a transformer-based language model to test the predictions of the Tolerance Principle. We trained BabyBERTa (Huebner et al. 2021), a transformer model optimized for small datasets, on artificial grammars. The training sets varied in size, number of unique sentence types, and proportion of rule-following versus exception exemplars. We found that, unlike human infants, BabyBERTa's learning dynamics do not align with the Tolerance Principle.

[47] CTC-DID: CTC-Based Arabic dialect identification for streaming applications

Muhammad Umar Farooq,Oscar Saz

Main category: cs.CL

TL;DR: 本文提出了一种受连接时序分类(CTC)损失函数启发的方言识别(DID)方法,将方言标签视为语音序列的标签序列,并在低资源阿拉伯语方言识别任务中表现出优于Whisper和ECAPA-TDNN模型的性能,尤其在短语音和零样本场景下更具鲁棒性且适用于实时流式应用。

Details Motivation: 由于低资源方言数据标注稀缺且语音较短,传统方言识别方法性能受限,因此需要一种更鲁棒、可扩展并适用于现实场景(如流式处理)的方法。 Method: 提出CTC-DID方法,将DID任务类比为有限词汇的ASR任务,使用CTC损失进行建模;通过语言无关启发式(LAH)或预训练ASR模型估计方言标签重复次数以生成训练标签。 Result: 在低资源阿拉伯语方言识别任务中,基于自监督学习(SSL)的CTC-DID模型在有限数据上训练后,性能超过微调后的Whisper和ECAPA-TDNN;在Casablanca数据集上的零样本评估中也表现更优;对短语音更鲁棒,且适用于实时流式识别。 Conclusion: CTC-DID为方言识别提供了一种有效的新范式,利用ASR中的CTC框架实现了更强的泛化能力与实用性,尤其适合低资源、短语音和实时应用场景。 Abstract: This paper proposes a Dialect Identification (DID) approach inspired by the Connectionist Temporal Classification (CTC) loss function as used in Automatic Speech Recognition (ASR). CTC-DID frames the dialect identification task as a limited-vocabulary ASR system, where dialect tags are treated as a sequence of labels for a given utterance. For training, the repetition of dialect tags in transcriptions is estimated either using a proposed Language-Agnostic Heuristic (LAH) approach or a pre-trained ASR model. The method is evaluated on the low-resource Arabic Dialect Identification (ADI) task, with experimental results demonstrating that an SSL-based CTC-DID model, trained on a limited dataset, outperforms both fine-tuned Whisper and ECAPA-TDNN models. Notably, CTC-DID also surpasses these models in zero-shot evaluation on the Casablanca dataset. The proposed approach is found to be more robust to shorter utterances and is shown to be easily adaptable for streaming, real-time applications, with minimal performance degradation.

[48] CoReflect: Conversational Evaluation via Co-Evolutionary Simulation and Reflective Rubric Refinement

Yunzhe Li,Richie Yueqi Feng,Tianxin Wei,Chin-Chia Hsu

Main category: cs.CL

TL;DR: 提出CoReflect方法,通过对话模拟与评估的协同进化循环,实现对多轮对话系统的自适应、迭代式评估。

Details Motivation: 传统对话系统评估依赖静态规则和固定上下文,难以覆盖模型多样化的涌现行为,缺乏动态诊断能力。 Method: 引入CoReflect框架,结合对话规划器生成结构化模板指导用户模拟器进行目标导向对话,并利用反射分析器识别行为模式,自动优化评估标准,形成闭环的协同进化过程。 Result: 实现了测试用例复杂性与评估标准精度的同步提升,显著减少人工干预,具备良好的可扩展性。 Conclusion: CoReflect为对话系统评估提供了一种可自我 refinement 的新范式,能够适应快速发展的对话模型能力。 Abstract: Evaluating conversational systems in multi-turn settings remains a fundamental challenge. Conventional pipelines typically rely on manually defined rubrics and fixed conversational context$-$a static approach that limits coverage and fails to capture the diverse, emergent behaviors of dialogue models. To address this, we introduce CoReflect (Conversational Evaluation via Co-Evolutionary Simulation and Reflective Rubric Refinement), which unifies dialogue simulation and evaluation into an adaptive, iterative process. CoReflect employs a conversation planner that generates structured templates to guide a user simulator through diverse, goal-directed dialogues. Subsequently, a reflective analyzer processes these dialogues to identify systematic behavioral patterns and automatically refine the evaluation rubrics. Crucially, the insights from the conversation analysis are fed back into the planner to update conversation templates for subsequent iterations. This co-evolution loop ensures that the complexity of test cases and the diagnostic precision of rubrics improve in tandem. By minimizing human intervention, CoReflect provides a scalable and self-refining methodology that allows evaluation protocols to adapt alongside the rapidly advancing capabilities of dialogue models.

[49] Plan, Verify and Fill: A Structured Parallel Decoding Approach for Diffusion Language Models

Miao Li,Hanyang Jiang,Sikai Chen,Hengyu Fu,Yuhang Cai,Baihe Huang,Tinghan Ye,Xuanzhou Chen,Pascal Van Hentenryck

Main category: cs.CL

TL;DR: 提出了一种无需训练的文本生成规划框架PVF,通过构建语义骨架和验证机制显著提升扩散语言模型的解码效率。

Details Motivation: 现有扩散语言模型的解码策略缺乏主动规划,未能充分利用双向上下文进行全局序列建模。 Method: 提出Plan-Verify-Fill(PVF)框架,先构建语义骨架,通过高影响力语义锚点进行层次化规划,并引入验证机制实现结构化早停。 Result: 在LLaDA-8B-Instruct和Dream-7B-Instruct上实验显示,相比并行解码方法,NFE最多降低65%,且保持生成质量。 Conclusion: PVF为扩散语言模型提供了一种高效、无需训练的解码范式,显著提升了生成效率。 Abstract: Diffusion Language Models (DLMs) present a promising non-sequential paradigm for text generation, distinct from standard autoregressive (AR) approaches. However, current decoding strategies often adopt a reactive stance, underutilizing the global bidirectional context to dictate global trajectories. To address this, we propose Plan-Verify-Fill (PVF), a training-free paradigm that grounds planning via quantitative validation. PVF actively constructs a hierarchical skeleton by prioritizing high-leverage semantic anchors and employs a verification protocol to operationalize pragmatic structural stopping where further deliberation yields diminishing returns. Extensive evaluations on LLaDA-8B-Instruct and Dream-7B-Instruct demonstrate that PVF reduces the Number of Function Evaluations (NFE) by up to 65% compared to confidence-based parallel decoding across benchmark datasets, unlocking superior efficiency without compromising accuracy.

[50] Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

Yixuan Du,Chenxiao Yu,Haoyan Xu,Ziyi Wang,Yue Zhao,Xiyang Hu

Main category: cs.CL

TL;DR: 本文提出了一种针对视觉-语言模型(VLM)产品搜索的新型多模态对抗攻击框架MGEO,通过联合优化图像扰动和文本后缀,利用VLM的跨模态耦合实现对搜索排名的操纵。

Details Motivation: 研究VLM在竞争性排序场景中的鲁棒性问题,揭示其在多模态对抗攻击下的潜在漏洞。 Method: 提出Multimodal Generative Engine Optimization (MGEO),采用交替梯度优化策略,协同生成难以察觉的图像扰动和流畅的文本后缀,以攻击VLM的跨模态融合机制。 Result: 在真实数据集和先进模型上的实验表明,MGEO显著优于单模态攻击基线,能有效提升目标商品排名且不易被内容过滤器检测。 Conclusion: VLM的跨模态协同效应虽增强性能,但也可能被恶意利用,威胁检索系统的公平性与完整性,需引起重视。 Abstract: Vision-Language Models (VLMs) are rapidly replacing unimodal encoders in modern retrieval and recommendation systems. While their capabilities are well-documented, their robustness against adversarial manipulation in competitive ranking scenarios remains largely unexplored. In this paper, we uncover a critical vulnerability in VLM-based product search: multimodal ranking attacks. We present Multimodal Generative Engine Optimization (MGEO), a novel adversarial framework that enables a malicious actor to unfairly promote a target product by jointly optimizing imperceptible image perturbations and fluent textual suffixes. Unlike existing attacks that treat modalities in isolation, MGEO employs an alternating gradient-based optimization strategy to exploit the deep cross-modal coupling within the VLM. Extensive experiments on real-world datasets using state-of-the-art models demonstrate that our coordinated attack significantly outperforms text-only and image-only baselines. These findings reveal that multimodal synergy, typically a strength of VLMs, can be weaponized to compromise the integrity of search rankings without triggering conventional content filters.

[51] Simulated Annealing Enhances Theory-of-Mind Reasoning in Autoregressive Language Models

Xucong Hu,Jian-Qiao Zhu

Main category: cs.CL

TL;DR: 本文提出通过基于采样的优化方法(如幂次采样与退火策略)从基础自回归语言模型中提取强大的心智理论(ToM)能力,无需额外训练或权重更新。

Details Motivation: 自回归语言模型常被批评为仅优化表面连贯性,缺乏对潜在状态的正确表征,因而被认为难以完成依赖深层推理的心智理论任务。本文旨在验证其是否真正缺乏此类能力。 Method: 采用基于马尔可夫链蒙特卡洛(MCMC)的幂次采样方法,对序列级概率分布进行锐化,并引入退火策略,逐步降低温度以优化采样过程,从而增强模型在ToM任务上的表现。 Result: 实验表明,无需任何微调或验证,仅通过改进采样方法即可显著提升基础模型在ToM任务上的性能,退火策略进一步优于固定温度采样。 Conclusion: 语言模型本身可能已具备较强的ToM潜力,关键在于如何通过更优的推理与采样策略将其激发出来,而非依赖后训练手段。 Abstract: Autoregressive language models are next-token predictors and have been criticized for only optimizing surface plausibility (i.e., local coherence) rather than maintaining correct latent-state representations (i.e., global coherence). Because Theory of Mind (ToM) tasks crucially depend on reasoning about latent mental states of oneself and others, such models are therefore often thought to fail at ToM. While post-training methods can improve ToM performance, we show that strong ToM capability can be recovered directly from the base model without any additional weight updates or verifications. Our approach builds on recent power-sampling methods (Karan & Du, 2025) that use Markov chain Monte Carlo (MCMC) to sample from sharpened sequence-level (rather than token-level) probability distributions of autoregressive language models. We further find that incorporating annealing, where the tempered distribution is gradually shifted from high to low temperature, substantially improves ToM performance over fixed-temperature power sampling. Together, these results suggest that sampling-based optimization provides a powerful way to extract latent capabilities from language models without retraining.

[52] Conversational Context Classification: A Representation Engineering Approach

Jonathan Pan

Main category: cs.CL

TL;DR: 本文提出了一种结合表示工程(RepE)和单类支持向量机(OCSVM)的方法,用于在大型语言模型(LLM)的隐藏状态中识别特定上下文的子空间,以检测是否偏离上下文。

Details Motivation: 由于LLM容易生成脱离上下文的回复(如话题偏移、事实错误或幻觉),需要有效机制来判断其输出是否符合预期语义,传统异常检测方法难以直接应用于上下文语义层面。 Method: 利用表示工程(RepE)提取LLM内部状态中与特定上下文相关的表示,并使用OCSVM基于这些表示在隐层空间中构建上下文边界;通过在Llama和Qwen两个开源模型上实验,寻找最能反映目标上下文的网络层。 Result: 实验结果表明,该方法能够在LLM的隐状态空间中有效识别出与特定上下文相关的子空间,对判断对话是否偏离上下文具有良好潜力。 Conclusion: 该方法为检测LLM输出的上下文一致性提供了新思路,同时有助于增强对LLM内部工作机制的理解,推动更安全可靠的LLM应用。 Abstract: The increasing prevalence of Large Language Models (LLMs) demands effective safeguards for their operation, particularly concerning their tendency to generate out-of-context responses. A key challenge is accurately detecting when LLMs stray from expected conversational norms, manifesting as topic shifts, factual inaccuracies, or outright hallucinations. Traditional anomaly detection struggles to directly apply within contextual semantics. This paper outlines our experiment in exploring the use of Representation Engineering (RepE) and One-Class Support Vector Machine (OCSVM) to identify subspaces within the internal states of LLMs that represent a specific context. By training OCSVM on in-context examples, we establish a robust boundary within the LLM's hidden state latent space. We evaluate out study with two open source LLMs - Llama and Qwen models in specific contextual domain. Our approach entailed identifying the optimal layers within the LLM's internal state subspaces that strongly associates with the context of interest. Our evaluation results showed promising results in identifying the subspace for a specific context. Aside from being useful in detecting in or out of context conversation threads, this research work contributes to the study of better interpreting LLMs.

[53] Can Deep Research Agents Find and Organize? Evaluating the Synthesis Gap with Expert Taxonomies

Ming Zhang,Jiabao Zhuang,Wenqing Jing,Ziyu Kong,Jingyi Deng,Yujiong Shen,Kexin Tan,Yuhang Zhao,Ning Luo,Renzhe Zheng,Jiahui Lin,Mingqi Wu,Long Ma,Yi Zou,Shihan Dou,Tao Gui,Qi Zhang,Xuanjing Huang

Main category: cs.CL

TL;DR: 本文提出TaxoBench基准,用于评估深度研究代理在自动综述生成中检索关键文献和构建知识结构的能力,结果表明现有系统与人类专家水平仍有显著差距。

Details Motivation: 现有基准仅关注语言流畅性或引用准确性,缺乏对综述写作核心能力——关键文献检索与知识结构组织——的评估。 Method: 构建TaxoBench基准:基于72篇高引计算机科学综述,人工提取含3815个精确分类引用的专家级分类树;设计Deep Research(端到端)与Bottom-Up(仅评估结构能力)两种评测模式;评估7种深度研究代理和12种前沿大模型。 Result: 最佳代理仅能召回20.9%的专家选定文献;即使输入完全正确,最优模型组织性能(ARI)仅为0.31。 Conclusion: 当前深度研究代理在综述写作的关键能力上远未达到人类专家水平,TaxoBench为未来研究提供了可公开使用的诊断性基准。 Abstract: Deep Research Agents are increasingly used for automated survey generation. However, whether they can write surveys like human experts remains unclear. Existing benchmarks focus on fluency or citation accuracy, but none evaluates the core capabilities: retrieving essential papers and organizing them into coherent knowledge structures. We introduce TaxoBench, a diagnostic benchmark derived from 72 highly-cited computer science surveys. We manually extract expert-authored taxonomy trees containing 3,815 precisely categorized citations as ground truth. Our benchmark supports two evaluation modes: Deep Research mode tests end-to-end retrieval and organization given only a topic, while Bottom-Up mode isolates structuring capability by providing the exact papers human experts used. We evaluate 7 leading Deep Research agents and 12 frontier LLMs. Results reveal a dual bottleneck: the best agent recalls only 20.9% of expert-selected papers, and even with perfect input, the best model achieves only 0.31 ARI in organization. Current deep research agents remain far from expert-level survey writing. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench.

[54] A Scalable Entity-Based Framework for Auditing Bias in LLMs

Akram Elbouanani,Aboubacar Tuo,Adrian Popescu

Main category: cs.CL

TL;DR: 提出了一种基于命名实体的可扩展偏见审计框架,通过合成数据重现自然文本中的偏见模式,开展了迄今为止最大规模的偏见审计(19亿数据点),发现LLM存在系统性偏差,且模型规模增大加剧偏见,而指令微调可减轻偏见。

Details Motivation: 现有LLM偏见评估方法在生态效度和统计控制之间难以兼顾,缺乏既能反映真实使用场景又具大规模和严谨性的评估手段。 Method: 利用命名实体作为探针设计偏见审计框架,验证合成数据能可靠复现自然文本中的偏见,并在此基础上进行跨实体类型、任务、语言、模型和提示策略的大规模分析。 Result: 在1.9亿数据点上的审计揭示了系统的偏见模式:模型惩罚右翼政治人物、偏好左翼政治人物,偏向西方和富裕国家及企业,歧视全球南方以及国防和制药行业公司;指令微调可减少偏见,但增加模型规模会放大偏见,使用中文或俄语提示无法减弱西方偏好。 Conclusion: 大型语言模型在部署于高风险应用前应接受严格的偏见审计,当前模型仍存在显著且复杂的系统性偏见,需引起重视。 Abstract: Existing approaches to bias evaluation in large language models (LLMs) trade ecological validity for statistical control, relying on artificial prompts that poorly reflect real-world use, or on naturalistic tasks that lack scale and rigor. We introduce a scalable bias-auditing framework using named entities as probes to measure structural disparities in model behavior. We show that synthetic data reliably reproduces bias patterns observed in natural text, enabling large-scale analysis. Using this approach, we conduct the largest bias audit to date, comprising 1.9 billion data points across multiple entity types, tasks, languages, models, and prompting strategies. Our results reveal systematic biases: models penalize right-wing politicians, favor left-wing politicians, prefer Western and wealthy nations over the Global South, favor Western companies, and penalize firms in the defense and pharmaceutical sectors. While instruction tuning reduces bias, increasing model scale amplifies it, and prompting in Chinese or Russian does not attenuate Western-aligned preferences. These results indicate that LLMs should undergo rigorous auditing before deployment in high-stakes applications.

[55] LR-DWM: Efficient Watermarking for Diffusion Language Models

Ofek Raban,Ethan Fetaya,Gal Chechik

Main category: cs.CL

TL;DR: 提出了一种名为Left-Right Diffusion Watermarking (LR-DWM)的新方法,用于高效地为扩散语言模型(DLMs)嵌入水印,具有低计算和内存开销,并实现可靠的检测性能。

Details Motivation: 现有的大语言模型水印技术主要针对自回归模型设计,难以直接应用于非顺序生成文本的扩散语言模型(DLMs),且现有适配方法存在较高的计算或内存开销。 Method: 提出LR-DWM方法,利用左右两侧已生成的上下文来偏置当前token的生成,从而在迭代去噪过程中嵌入水印信号,适用于DLM的非序列生成机制。 Result: LR-DWM在标准评估下实现了高效的水印嵌入与可靠统计检测,运行时间和内存开销极低,接近无水印的基线DLM性能。 Conclusion: 扩散语言模型可以以极小的代价实现高效水印嵌入,LR-DWM为DLM提供了实用且低开销的水印解决方案。 Abstract: Watermarking (WM) is a critical mechanism for detecting and attributing AI-generated content. Current WM methods for Large Language Models (LLMs) are predominantly tailored for autoregressive (AR) models: They rely on tokens being generated sequentially, and embed stable signals within the generated sequence based on the previously sampled text. Diffusion Language Models (DLMs) generate text via non-sequential iterative denoising, which requires significant modification to use WM methods designed for AR models. Recent work proposed to watermark DLMs by inverting the process when needed, but suffers significant computational or memory overhead. We introduce Left-Right Diffusion Watermarking (LR-DWM), a scheme that biases the generated token based on both left and right neighbors, when they are available. LR-DWM incurs minimal runtime and memory overhead, remaining close to the non-watermarked baseline DLM while enabling reliable statistical detection under standard evaluation settings. Our results demonstrate that DLMs can be watermarked efficiently, achieving high detectability with negligible computational and memory overhead.

[56] NADIR: Differential Attention Flow for Non-Autoregressive Transliteration in Indic Languages

Lakshya Tomar,Vinayak Abrol,Puneet Agarwal

Main category: cs.CL

TL;DR: 本文提出了一种新的非自回归模型NADIR,用于多语言音译任务,结合微分Transformer和专家混合机制,在保持高准确率的同时实现比自回归模型快13倍以上的推理速度,并显著降低各类错误率。

Details Motivation: 自回归模型在序列到序列任务中具有强归纳偏置,但在某些依赖局部关系的任务中可能导致推理延迟过高;而非自回归模型虽快但易出现幻觉和长度控制差,因此需要在速度与准确性之间取得平衡。 Method: 提出NADIR模型,结合微分Transformer和专家混合(MoE)机制,去除对序列依赖的建模需求,增强对复杂字符映射的建模能力。 Result: NADIR相比最先进的自回归模型实现了13倍以上的加速,平均字符错误率为15.78%(AR为14.44%,标准NAR为21.88%),并显著降低了重复、替换、遗漏和插入错误。 Conclusion: NADIR在保持较高精度的同时大幅提升推理效率,为构建快速可靠的非自回归系统提供了实用蓝图,适用于实时和大规模部署场景。 Abstract: In this work, we argue that not all sequence-to-sequence tasks require the strong inductive biases of autoregressive (AR) models. Tasks like multilingual transliteration, code refactoring, grammatical correction or text normalization often rely on local dependencies where the full modeling capacity of AR models can be overkill, creating a trade-off between their high accuracy and high inference latency. While non-autoregressive (NAR) models offer speed, they typically suffer from hallucinations and poor length control. To explore this trade-off, we focus on the multilingual transliteration task in Indic languages and introduce NADIR, a novel NAR architecture designed to strike a balance between speed and accuracy. NADIR integrates a Differential Transformer and a Mixture-of-Experts mechanism, enabling it to robustly model complex character mappings without sequential dependencies. NADIR achieves over a 13x speed-up compared to the state-of-the-art AR baseline. It maintains a competitive mean Character Error Rate of 15.78%, compared to 14.44% for the AR model and 21.88% for a standard NAR equivalent. Importantly, NADIR reduces Repetition errors by 49.53%, Substitution errors by 24.45%, Omission errors by 32.92%, and Insertion errors by 16.87%. This work provides a practical blueprint for building fast and reliable NAR systems, effectively bridging the gap between AR accuracy and the demands of real-time, large-scale deployment.

Mahammad Namazov,Tomáš Koref,Ivan Habernal

Main category: cs.CL

TL;DR: 本文提出了一种用于比较模型无关的可解释性技术的框架,特别关注法律领域中大语言模型的决策解释。通过使用两种理由提取方法,评估其在充分性、全面性和合理性方面的表现,并发现模型生成的理由与法律专家的观点存在显著差异。

Details Motivation: 在法律领域应用大型语言模型时,可解释性对于建立信任和透明度至关重要。然而,目前尚不清楚哪种技术最适合解释法律结果预测。因此,需要一个系统性的比较框架来评估不同可解释性方法的有效性。 Method: 提出一种模型无关的可解释性技术比较框架,采用两种理由提取方法,通过标准化的充分性和全面性指标进行定量评估,并邀请法律专家对提取的理由进行合理性评估,同时探讨LLM-as-a-Judge的可行性。 Result: 实验结果显示,尽管模型在定量指标和分类性能上表现良好,但其预测违规行为的‘理由’与法律专家的看法存在显著差异。 Conclusion: 当前的可解释性方法虽在量化指标上表现优异,但在实际法律场景中的合理性仍有不足,强调了结合人类专家评估的重要性。 Abstract: Interpretability is critical for applications of large language models in the legal domain which requires trust and transparency. While some studies develop task-specific approaches, other use the classification model's parameters to explain the decisions. However, which technique explains the legal outcome prediction best remains an open question. To address this challenge, we propose a comparative analysis framework for model-agnostic interpretability techniques. Among these, we employ two rationale extraction methods, which justify outcomes with human-interpretable and concise text fragments (i.e., rationales) from the given input text. We conduct comparison by evaluating faithfulness-via normalized sufficiency and comprehensiveness metrics along with plausibility-by asking legal experts to evaluate extracted rationales. We further assess the feasibility of LLM-as-a-Judge using legal expert evaluation results. We show that the model's "reasons" for predicting a violation differ substantially from those of legal experts, despite highly promising quantitative analysis results and reasonable downstream classification performance. The source code of our experiments is publicly available at https://github.com/trusthlt/IntEval.

[58] System-Mediated Attention Imbalances Make Vision-Language Models Say Yes

Tsan Tsai Chan,Varsha Suresh,Anisha Saha,Michael Hahn,Vera Demberg

Main category: cs.CL

TL;DR: 本文提出了一种系统介导的注意力失衡框架,用于解释视觉-语言模型(VLM)中的“是偏差”幻觉问题,发现冗余的系统权重会降低对图像和文本输入的注意力,通过重新分配注意力可有效抑制该偏差。

Details Motivation: 现有方法多关注增强图像注意力来缓解VLM幻觉,忽视了系统模态和其他模态的作用,本文旨在从更全面的角度理解并解决这一问题。 Method: 提出系统介导的注意力失衡假说,通过因果干预方法重新分配系统、图像和文本模态之间的注意力,并分析其对yes-bias的影响。 Result: 重新分配注意力后显著抑制了yes-bias,在多个任务上优于现有方法;发现系统注意力过强会导致模型依赖粗糙的输入表示,从而引发幻觉。 Conclusion: 系统注意力是VLM幻觉的关键因素,调节系统模态的注意力是一种有效的幻觉缓解杠杆。 Abstract: Vision-language model (VLM) hallucination is commonly linked to imbalanced allocation of attention across input modalities: system, image and text. However, existing mitigation strategies tend towards an image-centric interpretation of these imbalances, often prioritising increased image attention while giving less consideration to the roles of the other modalities. In this study, we evaluate a more holistic, system-mediated account, which attributes these imbalances to functionally redundant system weights that reduce attention to image and textual inputs. We show that this framework offers a useful empirical perspective on the yes-bias, a common form of hallucination in which VLMs indiscriminately respond 'yes'. Causally redistributing attention from the system modality to image and textual inputs substantially suppresses this bias, often outperforming existing approaches. We further present evidence suggesting that system-mediated attention imbalances contribute to the yes-bias by encouraging a default reliance on coarse input representations, which are effective for some tasks but ill-suited to others. Taken together, these findings firmly establish system attention as a key factor in VLM hallucination and highlight its potential as a lever for mitigation.

[59] Incentivizing In-depth Reasoning over Long Contexts with Process Advantage Shaping

Miao Peng,Weizhou Shen,Nuo Chen,Chenliang Li,Ming Yan,Jia Li

Main category: cs.CL

TL;DR: 本文提出DeepReasonQA和LongPAS方法,解决长上下文推理中“almost-there”现象,通过构建高难度多跳问答数据并进行细粒度信用分配,在少参数情况下显著超越强化学习基线。

Details Motivation: 现有强化学习与可验证奖励(RLVR)在长上下文推理中表现下降,存在“almost-there”现象,即推理轨迹基本正确但最终步骤失败,需提升推理密度并保留部分正确轨迹的学习信号。 Method: 提出DeepReasonQA框架,基于知识图谱合成高难度、多跳长上下文问答对;并设计LongPAS方法,从有效性和相关性两个维度进行细粒度信用分配,保留部分正确轨迹的训练信号。 Result: 在三个长上下文推理基准上显著优于RLVR基线,性能匹敌前沿大模型且使用更少参数;分析验证了方法在增强长上下文推理和稳定强化学习训练方面的有效性。 Conclusion: DeepReasonQA和LongPAS能有效缓解长上下文推理中的“almost-there”问题,通过高质量数据构造和精细奖励塑形,提升了模型的多步推理能力。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing LLMs short-context reasoning, but its performance degrades in long-context scenarios that require both precise grounding and robust long-range reasoning. We identify the "almost-there" phenomenon in long-context reasoning, where trajectories are largely correct but fail at the final step, and attribute this failure to two factors: (1) the lack of high reasoning density in long-context QA data that push LLMs beyond mere grounding toward sophisticated multi-hop reasoning; and (2) the loss of valuable learning signals during long-context RL training due to the indiscriminate penalization of partially correct trajectories with incorrect outcomes. To overcome this bottleneck, we propose DeepReasonQA, a KG-driven synthesis framework that controllably constructs high-difficulty, multi-hop long-context QA pairs with inherent reasoning chains. Building on this, we introduce Long-context Process Advantage Shaping (LongPAS), a simple yet effective method that performs fine-grained credit assignment by evaluating reasoning steps along Validity and Relevance dimensions, which captures critical learning signals from "almost-there" trajectories. Experiments on three long-context reasoning benchmarks show that our approach substantially outperforms RLVR baselines and matches frontier LLMs while using far fewer parameters. Further analysis confirms the effectiveness of our methods in strengthening long-context reasoning while maintaining stable RL training.

[60] Knowing When to Abstain: Medical LLMs Under Clinical Uncertainty

Sravanthi Machcha,Sushrita Yerra,Sahil Gupta,Aishwarya Sahoo,Sharmin Sultana,Hong Yu,Zonghai Yao

Main category: cs.CL

TL;DR: 本文提出了MedAbstain,一个用于评估医疗多选题问答中大语言模型(LLM) abstention(拒绝回答)能力的统一基准和协议,强调在高风险应用中模型表达不确定性和安全拒绝回答的重要性。

Details Motivation: 当前对大语言模型的评估过于关注准确性,但在实际尤其是安全关键的应用中,模型在不确定时主动 abstention 的能力同样重要。缺乏系统性评估该能力的方法,尤其是在医疗等高风险领域。 Method: 提出 MedAbstain 基准,结合了合规预测(conformal prediction)、对抗性问题扰动和显式拒绝选项,系统评估开源和闭源大模型在医疗MCQA任务中的 abstention 行为。 Result: 发现即使是最先进的高准确率模型也常无法在不确定时正确 abstention;提供显式的拒绝选项能显著提升模型的不确定感知和安全拒绝行为,效果优于输入扰动;而增大模型规模或使用高级提示技术改善效果有限。 Conclusion: abstention 机制对可信的大模型部署至关重要,显式拒绝选项是提升高风险场景下模型安全性的有效且实用的方法。 Abstract: Current evaluation of large language models (LLMs) overwhelmingly prioritizes accuracy; however, in real-world and safety-critical applications, the ability to abstain when uncertain is equally vital for trustworthy deployment. We introduce MedAbstain, a unified benchmark and evaluation protocol for abstention in medical multiple-choice question answering (MCQA) -- a discrete-choice setting that generalizes to agentic action selection -- integrating conformal prediction, adversarial question perturbations, and explicit abstention options. Our systematic evaluation of both open- and closed-source LLMs reveals that even state-of-the-art, high-accuracy models often fail to abstain with uncertain. Notably, providing explicit abstention options consistently increases model uncertainty and safer abstention, far more than input perturbations, while scaling model size or advanced prompting brings little improvement. These findings highlight the central role of abstention mechanisms for trustworthy LLM deployment and offer practical guidance for improving safety in high-stakes applications.

[61] Capability-Aware Early-Stage Research Idea Evaluation

Renlong Jie,Chen Chu,Zhen Wang

Main category: cs.CL

TL;DR: 提出一种基于作者信息和研究想法的早期科研成果预测框架,通过三路Transformer架构融合能力表征,显著提升论文接收与评分预测准确率。

Details Motivation: 在研究初期阶段(尚未投入大量资源时)预测科研想法的结果,有助于优化科研资源配置和研究规划。现有方法依赖完整论文或同行评审,难以应用于早期预测。 Method: 提出一种能力感知的三路Transformer框架,结合作者信息、推断的能力表现和研究想法,采用灵活的融合机制;并设计两阶段架构来学习基于作者信息和研究想法的能力表征。 Result: 实验表明,该方法在微调bert-base和bert-large的基础上显著优于单路模型,且引入能力预测能显著提升最终模型的预测准确性。 Conclusion: 所提方法可有效用于早期科研成果预测与科研资源分配,具备实际应用潜力。 Abstract: Predicting the outcomes of research ideas at their conceptual stage (i.e. before significant resources are committed) holds great potential for optimizing scientific resource allocation and research planning. While existing methods rely heavily on finished manuscripts or peer reviews, we propose a novel capability-aware framework that predicts paper acceptance and ratings using only author information and research ideas, without requiring full text or experimental results. Our approach integrates author information, (inferred) capability presentation, and research ideas through a three-way transformer architecture with flexible fusion mechanisms. We also introduce a two-stage architecture for learning the capability representation given the author information and idea. Experiments show that our method significantly outperform the single-way models by finetuning bert-base and bert-large, and the capability predicting significantly increase the predictive accuracy of the final model. The proposed method can be applied in both early-stage research outcome prediction and scientific resource allocation.

[62] DoPE: Decoy Oriented Perturbation Encapsulation Human-Readable, AI-Hostile Documents for Academic Integrity

Ashish Raj Shekhar,Shiven Agarwal,Priyanuj Bordoloi,Yash Shah,Tejas Anvekar,Vivek Gupta

Main category: cs.CL

TL;DR: 本文提出了DoPE,一种面向多模态大语言模型(MLLMs)威胁的文档层防御框架,通过在PDF/HTML考试文档中嵌入语义诱饵来检测和阻止AI作弊,具备模型无关的预防与检测能力,并发布了包含1826个考试文档的新基准Integrity-Bench以支持可复现研究。

Details Motivation: 随着多模态大语言模型(MLLMs)能够直接解析考试文档,传统评估方式面临学术诚信威胁,亟需不依赖传统分类器、可在文档生成阶段部署的新型防御机制。 Method: 提出DoPE框架,利用渲染-解析差异,在作者编写考试时嵌入语义诱饵;设计FewSoRT-Q生成问题级诱饵,FewSoRT-D将其封装为带水印文档,并通过LLM-as-Judge进行检测验证。 Result: 在Integrity-Bench基准(1826个PDF+HTML考试)上测试显示:对OpenAI和Anthropic的黑盒MLLM,检测率达91.4%(误报率8.7%),成功阻止或诱导失败率达96.3%。 Conclusion: DoPE实现了模型无关、文档层级的学术诚信保护,兼具高检测率与强防御效果,推动了可复现的AI防作弊研究。 Abstract: Multimodal Large Language Models (MLLMs) can directly consume exam documents, threatening conventional assessments and academic integrity. We present DoPE (Decoy-Oriented Perturbation Encapsulation), a document-layer defense framework that embeds semantic decoys into PDF/HTML assessments to exploit render-parse discrepancies in MLLM pipelines. By instrumenting exams at authoring time, DoPE provides model-agnostic prevention (stop or confound automated solving) and detection (flag blind AI reliance) without relying on conventional one-shot classifiers. We formalize prevention and detection tasks, and introduce FewSoRT-Q, an LLM-guided pipeline that generates question-level semantic decoys and FewSoRT-D to encapsulate them into watermarked documents. We evaluate on Integrity-Bench, a novel benchmark of 1826 exams (PDF+HTML) derived from public QA datasets and OpenCourseWare. Against black-box MLLMs from OpenAI and Anthropic, DoPE yields strong empirical gains: a 91.4% detection rate at an 8.7% false-positive rate using an LLM-as-Judge verifier, and prevents successful completion or induces decoy-aligned failures in 96.3% of attempts. We release Integrity-Bench, our toolkit, and evaluation code to enable reproducible study of document-layer defenses for academic integrity.

[63] Improving Low-Resource Machine Translation via Round-Trip Reinforcement Learning

Ahmed Attia,Alham Fikri

Main category: cs.CL

TL;DR: 提出一种基于自监督强化学习的低资源机器翻译微调方法,通过往返回译和chrF++/BLEU奖励函数提升翻译质量。

Details Motivation: 低资源语言的平行数据有限,现有方法尚未充分探索如何有效提升低资源机器翻译性能。 Method: 采用自监督强化学习框架,利用NLLB模型进行英-目标语言-英的往返翻译,以chrF++和BLEU作为重建英文句子的奖励函数进行微调。 Result: 在NLLB-MD数据集上,600M和1.3B参数的NLLB模型在Central Aymara、Friulian、Wolof和Russian等语言上均表现出一致的性能提升,翻译输出的流畅性和语义保真度更高。 Conclusion: 该方法能有效提升低资源语言的翻译质量,并且随着模型规模扩大,可进一步利用预训练知识实现自我优化。 Abstract: Low-resource machine translation (MT) has gained increasing attention as parallel data from low-resource language communities is collected, but many potential methods for improving low-resource MT remain unexplored. We investigate a self-supervised reinforcement-learning-based fine-tuning for translation in low-resource settings using round-trip bootstrapping with the No Language Left Behind (NLLB) family of models. Our approach translates English into a target low-resource language and then back into English, using a combination of chrF++ and BLEU as the reward function on the reconstructed English sentences. Using the NLLB-MD dataset, we evaluate both the 600M and 1.3B parameter NLLB models and observe consistent improvements for the following languages: Central Aymara, Friulian, Wolof and Russian. Qualitative inspection of translation outputs indicates increased fluency and semantic fidelity. We argue that our method can further benefit from scale, enabling models to increasingly leverage their pretrained knowledge and continue self-improving.

[64] Benchmarking Concept-Spilling Across Languages in LLMs

Ilia Badanin,Daniil Dzenhaliou,Imanol Schlag

Main category: cs.CL

TL;DR: 本文提出了一种评估多语言大模型语义鲁棒性的新框架,通过衡量模型在处理多义词时的跨语言表现,揭示了“语言溢出”现象,并对多种多语言大模型进行了系统比较。

Details Motivation: 多语言大模型在生成非英语内容时常常受到其他语言(尤其是英语)语义干扰,即存在“语言溢出”问题,缺乏系统评估该现象的方法。 Method: 构建了一个基于高多义性英文词汇的结构化生成任务,要求模型在九种语言中生成五个词义,通过分析模型何时开始借用主导语言的含义来衡量其语义稳健性。 Result: 发现不同模型和语言在语义鲁棒性上存在显著差异,较强的模型能更持久地保持目标语言语义,较弱的模型更早出现语言溢出;提出了可扩展的比较基准和验证流程。 Conclusion: 该框架为评估多语言大模型提供了可量化的、无需因果归因的排序方法,有助于推动更语言平衡的人工智能系统发展。 Abstract: Multilingual Large Language Models (LLMs) exhibit remarkable cross-lingual abilities, yet often exhibit a systematic bias toward the representations from other languages, resulting in semantic interference when generating content in non-English languages$-$a phenomenon we define as language spilling. This paper presents a novel comparative framework for evaluating multilingual semantic robustness by systematically measuring how models handle polysemous words across languages. Our methodology provides a relative measure of model performance: when required to generate exactly five meanings, both strong and weak models may resort to meanings from dominant languages, but semantically stronger models do so later in the generation sequence, producing more true meanings from the target language before failing, while weaker models resort to dominant-language meanings earlier in the sequence. We evaluate a diverse set of open and closed multilingual LLMs using a structured meaning generation task across nine languages, employing a carefully curated benchmark of 100 high-polysemy English words. Our findings reveal significant variation in semantic robustness across both models and languages, providing a principled ranking system for model comparison without requiring definitive causal attribution of error sources. We contribute both a scalable comparative benchmark for multilingual semantic evaluation and a rigorous validation pipeline$-$critical tools for developing more linguistically balanced AI systems.

[65] Evaluating Contextually Mediated Factual Recall in Multilingual Large Language Models

Yihong Liu,Bingyu Xiong,Hinrich Schütze

Main category: cs.CL

TL;DR: 该研究探讨了大语言模型在多语言环境下通过上下文间接获取事实知识的能力,发现上下文会显著降低事实回忆的准确性,且不同关系间存在较大差异。

Details Motivation: 现有的事实回忆评估主要集中在孤立的事实检索上,而自然语言使用中事实往往通过上下文间接引入,因此需要研究模型在上下文中介情况下的表现。 Method: 构建控制性提示,保持事实不变但通过上下文句子引入指代中介,并使用合成名称与真实名称对比分析多语言下多个模型家族的表现。 Result: 上下文中介一致地降低了事实回忆效果,较大模型对此更具鲁棒性,而真实名称和名字来源的影响则不系统。 Conclusion: 多语言大语言模型在孤立事实回忆与依赖上下文的语言理解之间存在差距,需进一步提升上下文中的事实提取能力。 Abstract: Large language models (LLMs) can recall a wide range of factual knowledge across languages. However, existing factual recall evaluations primarily assess fact retrieval in isolation, where the queried entity is explicitly named and the fact is requested directly. In natural language use, facts are often accessed through context, where the relevant entity is introduced only indirectly. In this work, we study contextually mediated factual recall, asking whether LLMs can reliably retrieve factual knowledge when the target entity is embedded in a naturalistic context rather than queried explicitly, across languages. We construct controlled prompts that preserve the underlying fact while introducing referential mediation through contextual sentences. To disentangle contextual effects from name-specific associations, we further compare performance using synthetic names and real names across languages. Evaluating multiple model families in five languages, we find that contextual mediation consistently degrades factual recall, with substantial variation across relations. Larger models are more robust to contextual mediation, exhibiting a reduced performance gap relative to direct queries, while the effect of real names and name origin is mixed and unsystematic. These findings highlight a gap between isolated factual recall and context-dependent language understanding in multilingual LLMs.

[66] A Cloud-based Multi-Agentic Workflow for Science

Anurag Acharya,Timothy Vega,Rizwan A. Ashraf,Anshu Sharma,Derek Parker,Robert Rallo

Main category: cs.CL

TL;DR: 本文提出了一种领域无关、模型无关的LLM代理框架,能够在云端运行并协调多个具有不同能力的代理,以执行从文献调研到复杂模拟等多种科学任务。该框架在催化剂研究中进行了概念验证,并在合成与真实化学任务上表现出高任务完成率和准确性。

Details Motivation: 由于大语言模型难以执行复杂任务(如仿真或决策),限制了其在科学领域的应用。构建能调用外部工具的代理系统虽有潜力,但设计兼顾模型、云服务和资源的工作流极具挑战。因此需要一个通用且高效的代理框架来克服这些障碍。 Method: 设计了一个基于主管代理协调多个功能代理的架构,所有组件均运行于云端。框架支持多种任务,包括文献检索、数据分析和模拟运行,并通过成本分析、基准测试(合成与化学领域)及专家评估进行验证。 Result: 在合成任务中,系统能正确路由任务的比例达90%,任务成功完成率为97.5%;在真实世界任务中成功率为91%。系统成本被详细拆解,且准确率优于或媲美多数前沿模型。 Conclusion: 该框架是跨科学领域可复用的可行方案,有效平衡了模型、工具与云资源,提升了LLM在复杂科研任务中的实用性与自动化水平。 Abstract: As Large Language Models (LLMs) become ubiquitous across various scientific domains, their lack of ability to perform complex tasks like running simulations or to make complex decisions limits their utility. LLM-based agents bridge this gap due to their ability to call external resources and tools and thus are now rapidly gaining popularity. However, coming up with a workflow that can balance the models, cloud providers, and external resources is very challenging, making implementing an agentic system more of a hindrance than a help. In this work, we present a domain-agnostic, model-independent workflow for an agentic framework that can act as a scientific assistant while being run entirely on cloud. Built with a supervisor agent marshaling an array of agents with individual capabilities, our framework brings together straightforward tasks like literature review and data analysis with more complex ones like simulation runs. We describe the framework here in full, including a proof-of-concept system we built to accelerate the study of Catalysts, which is highly important in the field of Chemistry and Material Science. We report the cost to operate and use this framework, including the breakdown of the cost by services use. We also evaluate our system on a custom-curated synthetic benchmark and a popular Chemistry benchmark, and also perform expert validation of the system. The results show that our system is able to route the task to the correct agent 90% of the time and successfully complete the assigned task 97.5% of the time for the synthetic tasks and 91% of the time for real-world tasks, while still achieving better or comparable accuracy to most frontier models, showing that this is a viable framework for other scientific domains to replicate.

[67] Disagreement as Data: Reasoning Trace Analytics in Multi-Agent Systems

Elham Tajik,Conrad Borchers,Bahar Shahrokhian,Sebastian Simon,Ali Keramati,Sonika Pal,Sreecharan Sankaranarayanan

Main category: cs.CL

TL;DR: 本研究提出利用大语言模型(LLM)代理生成的推理轨迹作为新的过程数据,通过余弦相似度检测代理间的分歧,将分歧转化为有意义的分析信号,从而提升教育研究中定性编码的解释力与方法严谨性。

Details Motivation: 随着生成式AI的发展,自动化和人机协作的分析方法兴起,但缺乏方法论标准。研究旨在探索如何利用LLM代理的推理轨迹增强学习分析中的定性编码解释实践。 Method: 采用余弦相似度分析多代理系统中LLM生成的推理轨迹,量化代理间在编码人类辅导对话片段时的一致性与分歧,并结合定量相似性指标与定性审查进行混合分析。 Result: 分析近10,000个代理配对实例发现,LLM代理的语义推理相似性可有效区分共识与分歧,并与人工编码信度相关;质性分析揭示了代码内的教学子功能及代码本概念优化空间。 Conclusion: 推理轨迹中的分歧是一种有价值的新型分析信号,所提方法能提升编码过程中评分者间信度的建立效率与解释深度,推动教育研究的方法论进步。 Abstract: Learning analytics researchers often analyze qualitative student data such as coded annotations or interview transcripts to understand learning processes. With the rise of generative AI, fully automated and human-AI workflows have emerged as promising methods for analysis. However, methodological standards to guide such workflows remain limited. In this study, we propose that reasoning traces generated by large language model (LLM) agents, especially within multi-agent systems, constitute a novel and rich form of process data to enhance interpretive practices in qualitative coding. We apply cosine similarity to LLM reasoning traces to systematically detect, quantify, and interpret disagreements among agents, reframing disagreement as a meaningful analytic signal. Analyzing nearly 10,000 instances of agent pairs coding human tutoring dialog segments, we show that LLM agents' semantic reasoning similarity robustly differentiates consensus from disagreement and correlates with human coding reliability. Qualitative analysis guided by this metric reveals nuanced instructional sub-functions within codes and opportunities for conceptual codebook refinement. By integrating quantitative similarity metrics with qualitative review, our method has the potential to improve and accelerate establishing inter-rater reliability during coding by surfacing interpretive ambiguity, especially when LLMs collaborate with humans. We discuss how reasoning-trace disagreements represent a valuable new class of analytic signals advancing methodological rigor and interpretive depth in educational research.

[68] BioPulse-QA: A Dynamic Biomedical Question-Answering Benchmark for Evaluating Factuality, Robustness, and Bias in Large Language Models

Kriti Bhattarai,Vipina K. Keloth,Donald Wright,Andrew Loza,Yang Ren,Hua Xu

Main category: cs.CL

TL;DR: BioPulse-QA是一个新的生物医学问答基准,用于评估大语言模型在新发布的动态文档上的性能,涵盖药物标签、临床试验和指南,强调时效性、鲁棒性和公平性。

Details Motivation: 现有生物医学基准存在数据静态过时、数据泄露风险高、忽视语言变异鲁棒性和人口偏见等问题,亟需一个更贴近真实临床场景的动态评估框架。 Method: 构建包含2,280个专家验证的问答对及扰动变体的BioPulse-QA基准,覆盖抽取式与生成式问题,基于药物标签、试验方案和临床指南等新发布文件,评估多个LLM在准确率、语言变化鲁棒性和偏见方面的表现。 Result: GPT-o1在药物标签上表现最佳(松弛F1为0.92),Gemini-2.0-Flash紧随其后(0.90);临床试验最具挑战性,抽取式F1低至0.36;模型对复述比拼写错误更敏感,但偏见测试差异不显著。 Conclusion: BioPulse-QA提供了一个可扩展且具临床相关性的框架,能更真实地评估生物医学大模型的性能,有助于推动其在高风险医疗场景中的可靠应用。 Abstract: Objective: Large language models (LLMs) are increasingly applied in biomedical settings, and existing benchmark datasets have played an important role in supporting model development and evaluation. However, these benchmarks often have limitations. Many rely on static or outdated datasets that fail to capture the dynamic, context-rich, and high-stakes nature of biomedical knowledge. They also carry increasing risk of data leakage due to overlap with model pretraining corpora and often overlook critical dimensions such as robustness to linguistic variation and potential demographic biases. Materials and Methods: To address these gaps, we introduce BioPulse-QA, a benchmark that evaluates LLMs on answering questions from newly published biomedical documents including drug labels, trial protocols, and clinical guidelines. BioPulse-QA includes 2,280 expert-verified question answering (QA) pairs and perturbed variants, covering both extractive and abstractive formats. We evaluate four LLMs - GPT-4o, GPT-o1, Gemini-2.0-Flash, and LLaMA-3.1 8B Instruct - released prior to the publication dates of the benchmark documents. Results: GPT-o1 achieves the highest relaxed F1 score (0.92), followed by Gemini-2.0-Flash (0.90) on drug labels. Clinical trials are the most challenging source, with extractive F1 scores as low as 0.36. Discussion and Conclusion: Performance differences are larger for paraphrasing than for typographical errors, while bias testing shows negligible differences. BioPulse-QA provides a scalable and clinically relevant framework for evaluating biomedical LLMs.

[69] Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift

Daniel Vennemeyer,Punya Syon Pandey,Phan Anh Duong,Michael Umeokoli,Samuel Ratnam

Main category: cs.CL

TL;DR: 细调目标在小规模训练时对安全性影响较小,但随着训练规模增加,目标选择成为对抗鲁棒性和潜在人格稳定性的主要驱动因素,尤其是ORPO和KL正则化能有效缓解风险。

Details Motivation: 尽管在良性数据上微调大语言模型可能导致对齐性退化和对抗鲁棒性下降,但微调目标如何影响安全性的机制尚不明确,因此需要系统研究不同微调目标的作用。 Method: 在固定数据、领域、架构和优化条件下,对六种微调目标(SFT、DPO、CFT、Inoculation Prompting、ORPO、KL正则化)进行受控比较,评估其在封闭式推理和开放式生成任务中的表现。 Result: 在小训练预算下,不同目标的鲁棒性相似但能力有差异;在大预算下,监督和偏好类目标导致能力提升伴随对抗脆弱性和人格漂移,而ORPO和KL正则化等约束学习信号的目标显著缓解这些问题。 Conclusion: 微调目标的选择在小规模时对安全影响有限,但在大规模训练中成为决定对抗鲁棒性和人格稳定性的关键因素,合理选择目标可解耦能力提升与安全风险。 Abstract: Fine-tuning LLMs on benign data can still degrade alignment and adversarial robustness, yet direct analysis of the role of fine-tuning objectives in shaping these safety outcomes remain limited. We present a controlled comparison of six fine-tuning objectives -- Supervised Fine-Tuning, Direct Preference Optimization, Conditional Fine-Tuning, Inoculation Prompting, Odds Ratio Preference Optimization, and KL-regularized fine-tuning -- holding data, domain, architecture, and optimization fixed. Across closed-form reasoning and open-ended generation tasks, we find that objective choice induces systematic, scale-dependent shifts along the safety-capability frontier. At small training budgets, robustness is similar across objectives but capability differs. At larger budgets, objectives diverge sharply: supervised and preference-based tuning tightly couple capability gains to increased adversarial vulnerability and persona drift, while objectives that constrain learning signals -- especially ORPO and KL-regularization -- substantially mitigate both. Fine-tuning objectives therefore matter little for safety at small scales but become a primary driver of adversarial robustness and latent persona stability as training scale increases.

[70] Intelligent Documentation in Medical Education: Can AI Replace Manual Case Logging?

Nafiz Imtiaz Khan,Kylie Cleland,Vladimir Filkov,Roger Eric Goldman

Main category: cs.CL

TL;DR: 该研究探讨了使用大语言模型(LLM)从自由文本放射学报告中自动提取结构化操作信息,以自动化放射学培训中的程序性病例记录。基于414份介入放射学报告评估多种本地和商业LLM,在指令式和思维链提示下均表现出较强的提取性能(F1最高达0.87),并在速度与成本间存在权衡。结果表明LLM可显著减轻住院医师文书负担并提高记录一致性,验证了AI辅助医学教育文档的可行性。

Details Motivation: 手动编写程序性病例记录耗时且易不一致,亟需自动化解决方案以减轻放射科培训中学员的文书负担,并提高记录的标准化程度。 Method: 收集2018至2024年间由9名住院医师撰写的414份介入放射学报告,采用指令式提示和思维链提示方法,评估多个本地与商业大语言模型在提取结构化操作信息方面的表现;通过敏感性、特异性、F1分数、推理延迟和令牌效率等指标进行综合评估。 Result: 本地和商业大语言模型均展现出较强的结构信息提取能力,最佳F1分数接近0.87,但在推理速度和运行成本上存在权衡;部分模型在准确率与效率之间实现了良好平衡。 Conclusion: 大语言模型可有效支持放射科培训中病例日志的自动化生成,显著减少学员文书工作量并提升记录一致性;研究验证了AI辅助医学教育文档的可行性,但需在更多机构和临床流程中进一步验证其泛化能力。 Abstract: Procedural case logs are a core requirement in radiology training, yet they are time-consuming to complete and prone to inconsistency when authored manually. This study investigates whether large language models (LLMs) can automate procedural case log documentation directly from free-text radiology reports. We evaluate multiple local and commercial LLMs under instruction-based and chain-of-thought prompting to extract structured procedural information from 414 curated interventional radiology reports authored by nine residents between 2018 and 2024. Model performance is assessed using sensitivity, specificity, and F1-score, alongside inference latency and token efficiency to estimate operational cost. Results show that both local and commercial models achieve strong extraction performance, with best F1-scores approaching 0.87, while exhibiting different trade-offs between speed and cost. Automation using LLMs has the potential to substantially reduce clerical burden for trainees and improve consistency in case logging. These findings demonstrate the feasibility of AI-assisted documentation in medical education and highlight the need for further validation across institutions and clinical workflows.

[71] Augmenting Question Answering with A Hybrid RAG Approach

Tianyi Yang,Nashrah Haque,Vaishnave Jonnalagadda,Yuya Jeremy Ong,Zhehui Chen,Yanzhao Wu,Lei Yu,Divyesh Jadav,Wenqi Wei

Main category: cs.CL

TL;DR: 本文提出了一种名为Structured-Semantic RAG (SSRAG)的混合架构,通过结合查询增强、代理路由以及融合向量和图技术的结构化检索机制,提升了问答系统的准确性和信息丰富度。

Details Motivation: 现有检索增强生成方法在获取上下文相关的信息方面表现不佳,导致答案不完整或次优。因此需要一种更有效的机制来提高检索的相关性和回答质量。 Method: 提出SSRAG模型,整合查询增强、agentic routing,并采用结合向量与图结构的混合检索方式,实现上下文统一,从而增强语义和结构化信息的利用。 Result: 在TruthfulQA、SQuAD和WikiQA三个数据集上,使用五个大语言模型进行实验,结果表明SSRAG在回答准确性和信息完整性方面均优于标准RAG方法。 Conclusion: SSRAG通过改进检索过程和增强上下文建模,显著提升了问答系统的性能,具有广泛的应用潜力。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful technique for enhancing the quality of responses in Question-Answering (QA) tasks. However, existing approaches often struggle with retrieving contextually relevant information, leading to incomplete or suboptimal answers. In this paper, we introduce Structured-Semantic RAG (SSRAG), a hybrid architecture that enhances QA quality by integrating query augmentation, agentic routing, and a structured retrieval mechanism combining vector and graph based techniques with context unification. By refining retrieval processes and improving contextual grounding, our approach improves both answer accuracy and informativeness. We conduct extensive evaluations on three popular QA datasets, TruthfulQA, SQuAD and WikiQA, across five Large Language Models (LLMs), demonstrating that our proposed approach consistently improves response quality over standard RAG implementations.

[72] UbuntuGuard: A Culturally-Grounded Policy Benchmark for Equitable AI Safety in African Languages

Tassallah Abdullahi,Macton Mgonzo,Mardiyyah Oduwole,Paul Okewunmi,Abraham Owodunni,Ritambhara Singh,Carsten Eickhoff

Main category: cs.CL

TL;DR: 本文提出了UbuntuGuard,首个基于非洲政策的安全基准,旨在解决现有守护模型在低资源非洲语言中的文化错位和跨语言安全失效问题。通过155名领域专家构建的对抗性查询,建立了情境化安全策略与参考响应,并评估了13种模型,结果表明当前多语言安全被高估,动态模型仍难以完全本地化非洲语言环境。

Details Motivation: 现有守护模型以西方为中心,依赖预定义安全类别,无法适应多样化的语言与社会文化背景,尤其在低资源非洲语言中存在安全漏洞与文化不匹配问题。因此需要灵活、可运行时执行的、反映本地规范的安全基准。 Method: 构建UbuntuGuard——首个基于非洲政策的安全基准,由155名敏感领域专家设计对抗性查询;从中提取情境特定的安全政策和参考回复;评估6个通用大模型和7个守护模型(静态、动态、多语言三类)在多语言安全对齐方面表现。 Result: 现有以英语为中心的基准高估了真实世界的多语言安全性;跨语言迁移仅提供部分覆盖;动态模型虽能更好地利用推理时策略,但仍难以充分本地化非洲语言情境。 Conclusion: 必须发展多语言、文化嵌入式的安全基准,以支持为低资源语言构建可靠且公平的守护模型。UbuntuGuard为实现这一目标提供了初步框架与评估工具。 Abstract: Current guardian models are predominantly Western-centric and optimized for high-resource languages, leaving low-resource African languages vulnerable to evolving harms, cross-lingual safety failures, and cultural misalignment. Moreover, most guardian models rely on rigid, predefined safety categories that fail to generalize across diverse linguistic and sociocultural contexts. Robust safety, therefore, requires flexible, runtime-enforceable policies and benchmarks that reflect local norms, harm scenarios, and cultural expectations. We introduce UbuntuGuard, the first African policy-based safety benchmark built from adversarial queries authored by 155 domain experts across sensitive fields, including healthcare. From these expert-crafted queries, we derive context-specific safety policies and reference responses that capture culturally grounded risk signals, enabling policy-aligned evaluation of guardian models. We evaluate 13 models, comprising six general-purpose LLMs and seven guardian models across three distinct variants: static, dynamic, and multilingual. Our findings reveal that existing English-centric benchmarks overestimate real-world multilingual safety, cross-lingual transfer provides partial but insufficient coverage, and dynamic models, while better equipped to leverage policies at inference time, still struggle to fully localize African-language contexts. These findings highlight the urgent need for multilingual, culturally grounded safety benchmarks to enable the development of reliable and equitable guardian models for low-resource languages. Our code can be found online.\footnote{Code repository available at https://github.com/hemhemoh/UbuntuGuard.

[73] A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization

Qiuyi Qu,Yicheng Sui,Yufei Sun,Rui Chen,Xiaofei Zhang,Yuzhi Zhang,Haofeng Wang,Ge Lan,Ning Zhang

Main category: cs.CL

TL;DR: 本文提出了一种基于模板的代理驱动CUDA内核优化方法,通过将语义重构为可参数化模板并结合搜索式自动调优,在真实场景中实现了超过3倍的性能提升。

Details Motivation: GPU代码优化是HPC和大模型训练/推理中的关键瓶颈,现有LLM代理方法多依赖直接重写,参数控制不明确或需人工干预,导致性能增益不稳定。 Method: 引入模板化重写层:首先将内核语义重构为显式可参数化的模板,然后在代理驱动的迭代循环中结合基于搜索的自动调优,利用性能分析反馈在硬件资源约束下进行参数优化。 Result: 在来自SGLang的真实CUDA内核上实验,最佳情况下实现超过3倍的加速;相比纯代理直接重写,该方法显著降低了优化过程的随机性。 Conclusion: 模板加搜索的设计使优化过程更稳定、可解释,并支持向OpenCL、HIP等后端扩展,为实际生产工作负载提供系统化的自动化性能优化方案。 Abstract: GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially alleviate this issue, achieving near-hardware-limit performance still relies heavily on manual code refactoring and parameter tuning. Recent progress in LLM-agent-based kernel generation and optimization has been reported, yet many approaches primarily focus on direct code rewriting, where parameter choices are often implicit and hard to control, or require human intervention, leading to unstable performance gains. This paper introduces a template-based rewriting layer on top of an agent-driven iterative loop: kernels are semantically refactored into explicitly parameterizable templates, and template parameters are then optimized via search-based autotuning, yielding more stable and higher-quality speedups. Experiments on a set of real-world kernels demonstrate speedups exceeding 3x in the best case. We extract representative CUDA kernels from SGLang as evaluation targets; the proposed agentic tuner iteratively performs templating, testing, analysis, and planning, and leverages profiling feedback to execute constrained parameter search under hardware resource limits. Compared to agent-only direct rewriting, the template-plus-search design significantly reduces the randomness of iterative optimization, making the process more interpretable and enabling a more systematic approach toward high-performance configurations. The proposed method can be further extended to OpenCL, HIP, and other backends to deliver automated performance optimization for real production workloads.

[74] A Shared Geometry of Difficulty in Multilingual Language Models

Stefano Civelli,Pietro Bernardelle,Nicolò Brunello,Gianluca Demartini

Main category: cs.CL

TL;DR: 研究了大型语言模型中多语言问题难度的几何特性,发现难度信号在模型内部的早期和晚期表示中分别形成,并表现出不同的功能行为。

Details Motivation: 探索大型语言模型如何在多语言环境下表示和处理问题难度,理解其内部表征的变化机制。 Method: 使用翻译成21种语言的Easy2Hard基准的AMC子集,训练线性探针分析模型内部表示。 Result: 发现浅层表示中的探针跨语言泛化能力更强,而深层表示中的探针在同语言下准确率更高但跨语言表现差。 Conclusion: 大型语言模型首先形成语言无关的问题难度表示,随后转化为语言特定的表示,这一过程反映了从抽象概念到具体语言输出的转变。 Abstract: Predicting problem-difficulty in large language models (LLMs) refers to estimating how difficult a task is according to the model itself, typically by training linear probes on its internal representations. In this work, we study the multilingual geometry of problem-difficulty in LLMs by training linear probes using the AMC subset of the Easy2Hard benchmark, translated into 21 languages. We found that difficulty-related signals emerge at two distinct stages of the model internals, corresponding to shallow (early-layers) and deep (later-layers) internal representations, that exhibit functionally different behaviors. Probes trained on deep representations achieve high accuracy when evaluated on the same language but exhibit poor cross-lingual generalization. In contrast, probes trained on shallow representations generalize substantially better across languages, despite achieving lower within-language performance. Together, these results suggest that LLMs first form a language-agnostic representation of problem difficulty, which subsequently becomes language-specific. This closely aligns with existing findings in LLM interpretability showing that models tend to operate in an abstract conceptual space before producing language-specific outputs. We demonstrate that this two-stage representational process extends beyond semantic content to high-level meta-cognitive properties such as problem-difficulty estimation.

[75] Towards Robust Process Reward Modeling via Noise-aware Learning

Bin Xie,Bingbing Xu,Xueyun Tian,Yilin Chen,Huawei Shen

Main category: cs.CL

TL;DR: 提出了一种两阶段去噪框架,通过反思感知的标签校正和噪声感知迭代训练,显著提升推理步骤正确性判断的准确性。

Details Motivation: 现有蒙特卡洛估计方法产生的过程奖励依赖于策略模型,导致标签噪声(如误奖错误步骤或误罚正确步骤),影响过程奖励模型性能。 Method: 第一阶段使用大语言模型作为评判器,识别与当前推理步骤相关的反思和自我纠正行为,校正标签;第二阶段提出噪声感知迭代训练框架,PRM基于自身置信度逐步优化带噪标签。 Result: 在多个实验中,该方法相比传统PRM在平均F1上最高提升27%,显著增强了对推理步骤正确性的判别能力。 Conclusion: 所提两阶段框架有效缓解了MCE带来的策略依赖性和标签噪声问题,提升了过程奖励建模的鲁棒性和准确性。 Abstract: Process Reward Models (PRMs) have achieved strong results in complex reasoning, but are bottlenecked by costly process-level supervision. A widely used alternative, Monte Carlo Estimation (MCE), defines process rewards as the probability that a policy model reaches the correct final answer from a given reasoning step. However, step correctness is an intrinsic property of the reasoning trajectory, and should be invariant to policy choice. Our empirical findings show that MCE producing policy-dependent rewards that induce label noise, including false positives that reward incorrect steps and false negatives that penalize correct ones. To address above challenges, we propose a two-stage framework to mitigate noisy supervision. In the labeling stage, we introduce a reflection-aware label correction mechanism that uses a large language model (LLM) as a judge to detect reflection and self-correction behaviors related to the current reasoning step, thereby suppressing overestimated rewards. In the training stage, we further propose a \underline{\textbf{N}}oise-\underline{\textbf{A}}ware \underline{\textbf{I}}terative \underline{\textbf{T}}raining framework that enables the PRM to progressively refine noisy labels based on its own confidence. Extensive Experiments show that our method substantially improves step-level correctness discrimination, achieving up to a 27\% absolute gain in average F1 over PRMs trained with noisy supervision.

[76] VISPA: Pluralistic Alignment via Automatic Value Selection and Activation

Shenyan Zheng,Jiayou Zhong,Anudeex Shetty,Heng Ji,Preslav Nakov,Usman Naseem

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的多元化对齐框架VISPA,通过动态选择和内部模型激活控制,实现大语言模型在不同价值观下的输出调控,尤其在医疗等领域表现出良好的适应性和性能。

Details Motivation: 随着大语言模型在高风险领域的广泛应用,其输出应反映多元价值观而非单一平均人类偏好,但现有方法在价值控制与表征方面存在不足。 Method: 提出VISPA框架,采用训练免费的方法,通过动态选择和内部模型激活引导来直接控制价值观表达。 Result: 在多个模型和评估场景中进行广泛实验,结果显示VISPA在医疗及其他领域的所有多元化对齐模式下均表现优异,并具有跨启动方式、模型和价值观的适应性。 Conclusion: 研究表明,通过内部激活机制可实现大语言模型的多元化对齐,为构建服务全体用户的可扩展模型提供了新路径。 Abstract: As large language models are increasingly used in high-stakes domains, it is essential that their outputs reflect not average} human preference, rather range of varying perspectives. Achieving such pluralism, however, remains challenging. Existing approaches consider limited values or rely on prompt-level interventions, lacking value control and representation. To address this, we introduce VISPA, a training-free pluralistic alignment framework, that enables direct control over value expression by dynamic selection and internal model activation steering. Across extensive empirical studies spanning multiple models and evaluation settings, we show VISPA is performant across all pluralistic alignment modes in healthcare and beyond. Further analysis reveals VISPA is adaptable with different steering initiations, model, and/or values. These results suggest that pluralistic alignment can be achieved through internal activation mechanisms, offering a scalable path toward language models that serves all.

[77] Who Does This Name Remind You of? Nationality Prediction via Large Language Model Associative Memory

Keito Inoshita

Main category: cs.CL

TL;DR: 提出LAMA框架,利用大语言模型的世界知识作为联想记忆,通过回忆同名名人并聚合其国籍来预测国籍,采用双智能体架构,在99个国家的国籍预测任务中准确率达到0.817,显著优于传统方法。

Details Motivation: 传统的大语言模型提示方法在应用抽象语言规则方面存在局限,而国籍和区域预测任务需要理解文化与历史背景,因此需要更有效地激发大模型中的世界知识。 Method: 提出LAMA(LLM Associative Memory Agents)框架,使用双智能体架构(Person Agent和Media Agent)并行回忆著名人物,通过间接推理聚合国籍信息,生成Top-1预测(投票)和Top-K预测(条件补全)。 Result: 在99个国家的国籍预测任务中,LAMA准确率达到0.817,显著优于传统LLM提示方法和神经网络模型;实验表明基于回忆的方法对低频国籍鲁棒,且双智能体具有协同效应。 Conclusion: 通过检索和聚合大语言模型中的知识而非直接提示推理,LAMA展示了联想记忆在知识提取中的有效性,揭示了大模型在回忆具体实例上比抽象推理更可靠。 Abstract: Large language models (LLMs) possess extensive world knowledge, yet methods for effectively eliciting this knowledge remain underexplored. Nationality and region prediction tasks require understanding of not only linguistic features but also cultural and historical background, making LLM world knowledge particularly valuable. However, conventional LLM prompting methods rely on direct reasoning approaches, which have limitations in applying abstract linguistic rules. We propose LLM Associative Memory Agents (LAMA), a novel framework that leverages LLM world knowledge as associative memory. Rather than directly inferring nationality from names, LAMA recalls famous individuals with the same name and aggregates their nationalities through indirect reasoning. A dual-agent architecture comprising a Person Agent and a Media Agent, specialized in different knowledge domains, recalls famous individuals in parallel, generating Top-1 predictions through voting and Top-K predictions through conditional completion. On a 99-country nationality prediction task, LAMA achieved 0.817 accuracy, substantially outperforming conventional LLM prompting methods and neural models. Our experiments reveal that LLMs exhibit higher reliability in recalling concrete examples than in abstract reasoning, that recall-based approaches are robust to low-frequency nationalities independent of data frequency distributions, and that the dual-agent architecture functions complementarily to produce synergistic effects. These results demonstrate the effectiveness of a new multi-agent system that retrieves and aggregates LLM knowledge rather than prompting reasoning.

[78] Do Clinical Question Answering Systems Really Need Specialised Medical Fine Tuning?

Sushant Kumar Ray,Gautam Siddharth Kashyap,Sahil Tripathi,Nipun Joshi,Vijay Govindarajan,Rafiq Ali,Jiechao Gao,Usman Naseem

Main category: cs.CL

TL;DR: 本文提出了MEDASSESS-X框架,通过推理时对齐而非微调来提升临床问答系统性能,打破了“专业化即优越”的误区。

Details Motivation: 挑战当前临床问答系统依赖领域微调的假设,提出无需重新训练的轻量级解决方案。 Method: 利用轻量级引导向量在推理时对模型激活进行调控,实现医学一致性推理,不更新模型权重。 Result: 在多种大语言模型上显著提升准确率(+6%)、事实一致性(+7%),并将安全错误率降低达50%。 Conclusion: 推理时对齐可有效替代监督微调,适用于通用和专业医学大模型,解决了专业化谬误问题。 Abstract: Clinical Question-Answering (CQA) industry systems are increasingly rely on Large Language Models (LLMs), yet their deployment is often guided by the assumption that domain-specific fine-tuning is essential. Although specialised medical LLMs such as BioBERT, BioGPT, and PubMedBERT remain popular, they face practical limitations including narrow coverage, high retraining costs, and limited adaptability. Efforts based on Supervised Fine-Tuning (SFT) have attempted to address these assumptions but continue to reinforce what we term the SPECIALISATION FALLACY-the belief that specialised medical LLMs are inherently superior for CQA. To address this assumption, we introduce MEDASSESS-X, a deployment-industry-oriented CQA framework that applies alignment at inference time rather than through SFT. MEDASSESS-X uses lightweight steering vectors to guide model activations toward medically consistent reasoning without updating model weights or requiring domain-specific retraining. This inference-time alignment layer stabilises CQA performance across both general-purpose and specialised medical LLMs, thereby resolving the SPECIALISATION FALLACY. Empirically, MEDASSESS-X delivers consistent gains across all LLM families, improving Accuracy by up to +6%, Factual Consistency by +7%, and reducing Safety Error Rate by as much as 50%.

Zhaolu Kang,Junhao Gong,Qingxi Chen,Hao Zhang,Jiaxin Liu,Rong Fu,Zhiyuan Feng,Yuan Wang,Simon Fong,Kaiyue Zhou

Main category: cs.CL

TL;DR: 本文提出了一种新的法律判决预测框架JurisMMA,并构建了包含10万多个中国司法案例的多模态数据集JurisMM,验证了该框架在法律判决预测及其他法律应用中的有效性。

Details Motivation: 传统法律判决预测方法在处理多重指控、多样证据和适应性方面存在挑战,需要更有效的框架和数据支持。 Method: 提出JurisMMA框架,分解审判任务并标准化流程;构建大规模多模态数据集JurisMM,并在JurisMM和LawBench上进行实验验证。 Result: 实验表明JurisMMA在法律判决预测任务中表现优异,且具有广泛适用性,推动了法律人工智能的发展。 Conclusion: JurisMMA框架结合大规模数据集有效提升了法律判决预测性能,为未来法律方法与数据集开发提供了新方向。 Abstract: Legal Judgment Prediction (LJP) aims to predict the outcomes of legal cases based on factual descriptions, serving as a fundamental task to advance the development of legal systems. Traditional methods often rely on statistical analyses or role-based simulations but face challenges with multiple allegations, diverse evidence, and lack adaptability. In this paper, we introduce JurisMMA, a novel framework for LJP that effectively decomposes trial tasks, standardizes processes, and organizes them into distinct stages. Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation. Experiments on JurisMM and the benchmark LawBench validate our framework's effectiveness. These results indicate that our framework is effective not only for LJP but also for a broader range of legal applications, offering new perspectives for the development of future legal methods and datasets.

[80] Rapport du Projet de Recherche TRAIMA

Julie Rançon,Jean-François Cerisier,Emilie Remond,Aurélien Nguyen,Andrew Peterson,Ladjel Bellatreche

Main category: cs.CL

TL;DR: TRAIMA项目探索了利用机器学习自动分析教育场景中多模态互动(言语、副言语、非言语)的可能性,重点研究法语作为外语/母语课堂中的解释性与协作性互动序列,提出三段式解释话语模型,并系统评估现有多模态转录规范,为未来AI赋能的教育互动研究奠定方法论基础。

Details Motivation: 当前教育与互动研究中,对言语、副言语和非言语数据的手动分析耗时费力、难以规模化;TRAIMA旨在探索机器学习如何助力此类多模态互动的分类与标注。 Method: 结合话语分析与交互语言学理论,定义解释性话语的三段式结构(开启—核心—收尾);系统梳理并比较ICOR、Mondada等多模态转录规范;基于INTER-EXPLIC和EXPLIC-LEXIC语料库开展人工标注与实证分析;依托TechnéLAB平台采集多源同步数据(视频、音频、眼动、数字交互痕迹)。 Result: 明确了适用于机器学习的转录规范、标注范畴与分析单元;揭示了转录实践固有的解释性与理论依赖性;验证了教师手势、韵律等多模态资源在意义建构与学习理解中的功能;构建了支持自动化工具开发的多模态研究基础设施。 Conclusion: TRAIMA未追求开发成熟自动化系统,而是确立了一套严谨的方法论框架,强调理论明晰性与研究者反思性,为教学法、话语分析、多模态研究与教育人工智能的跨学科融合奠定基础。 Abstract: The TRAIMA project (TRaitement Automatique des Interactions Multimodales en Apprentissage), conducted between March 2019 and June 2020, investigates the potential of automatic processing of multimodal interactions in educational settings. The project addresses a central methodological challenge in educational and interactional research: the analysis of verbal, paraverbal, and non-verbal data is currently carried out manually, making it extremely time-consuming and difficult to scale. TRAIMA explores how machine learning approaches could contribute to the categorisation and classification of such interactions. The project focuses specifically on explanatory and collaborative sequences occurring in classroom interactions, particularly in French as a Foreign Language (FLE) and French as a First Language (FLM) contexts. These sequences are analysed as inherently multimodal phenomena, combining spoken language with prosody, gestures, posture, gaze, and spatial positioning. A key theoretical contribution of the project is the precise linguistic and interactional definition of explanatory discourse as a tripartite sequence (opening, explanatory core, closure), drawing on discourse analysis and interactional linguistics. A substantial part of the research is devoted to the methodological foundations of transcription, which constitute a critical bottleneck for any form of automation. The report provides a detailed state of the art of existing transcription conventions (ICOR, Mondada, GARS, VALIBEL, Ferr{é}), highlighting their respective strengths and limitations when applied to multimodal classroom data. Through comparative analyses of manually transcribed sequences, the project demonstrates the inevitable variability and interpretative dimension of transcription practices, depending on theoretical positioning and analytical goals. Empirical work is based on several corpora, notably the INTER-EXPLIC corpus (approximately 30 hours of classroom interaction) and the EXPLIC-LEXIC corpus, which serve both as testing grounds for manual annotation and as reference datasets for future automation. Particular attention is paid to teacher gestures (kin{é}sic and proxemic resources), prosodic features, and their functional role in meaning construction and learner comprehension. The project also highlights the strategic role of the Techn{é}LAB platform, which provides advanced multimodal data capture (multi-camera video, synchronized audio, eye-tracking, digital interaction traces) and constitutes both a research infrastructure and a test environment for the development of automated tools. In conclusion, TRAIMA does not aim to deliver a fully operational automated system, but rather to establish a rigorous methodological framework for the automatic processing of multimodal pedagogical interactions. The project identifies transcription conventions, annotation categories, and analytical units that are compatible with machine learning approaches, while emphasizing the need for theoretical explicitness and researcher reflexivity. TRAIMA thus lays the groundwork for future interdisciplinary research at the intersection of didactics, discourse analysis, multimodality, and artificial intelligence in education.

[81] Race, Ethnicity and Their Implication on Bias in Large Language Models

Shiyue Hu,Ruizhe Li,Yanjun Gao

Main category: cs.CL

TL;DR: 该论文研究了大语言模型(LLM)中种族和族裔信息的表示与作用机制,使用可复现的可解释性方法分析多个开源模型,发现人口统计信息在模型内部单元中分布广泛且存在跨模型差异,抑制相关神经元虽能减少偏见但残留效应显著,表明行为变化而非表征变化为主导,需更系统的缓解策略。

Details Motivation: 现有研究主要记录LLM在高风险领域(如医疗)中的结果层面差异,缺乏对种族和族裔等人口属性如何被模型内部处理的机制性理解。 Method: 结合探针、神经元级归因和定向干预的可复现可解释性流程,分析两个公开数据集(涉及毒性生成和临床叙事理解)上的三个开源大模型。 Result: 发现种族和族裔信息在模型内部单元中分布式表示,跨模型差异大;某些单元编码敏感或刻板关联,相同人口线索可引发不同行为;抑制这些单元可减少偏见但仍有显著残留效应。 Conclusion: LLM中的偏见缓解可能更多体现为行为调整而非根本性表征改变,当前干预手段不足,需要更系统化的方法来实现公平性。 Abstract: Large language models (LLMs) increasingly operate in high-stakes settings including healthcare and medicine, where demographic attributes such as race and ethnicity may be explicitly stated or implicitly inferred from text. However, existing studies primarily document outcome-level disparities, offering limited insight into internal mechanisms underlying these effects. We present a mechanistic study of how race and ethnicity are represented and operationalized within LLMs. Using two publicly available datasets spanning toxicity-related generation and clinical narrative understanding tasks, we analyze three open-source models with a reproducible interpretability pipeline combining probing, neuron-level attribution, and targeted intervention. We find that demographic information is distributed across internal units with substantial cross-model variation. Although some units encode sensitive or stereotype-related associations from pretraining, identical demographic cues can induce qualitatively different behaviors. Interventions suppressing such neurons reduce bias but leave substantial residual effects, suggesting behavioral rather than representational change and motivating more systematic mitigation.

[82] From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation

Jiahao Wang,Weiyu Xie,Mingxing Zhang,Boxing Zhang,Jianwei Dong,Yuening Zhu,Chen Lin,Jinqi Tang,Yaochen Han,Zhiyuan Ai,Xianglin Chen,Yongwei Wu,Congfeng Jiang

Main category: cs.CL

TL;DR: FusionRAG是一种新的检索增强生成推理框架,通过在预处理阶段嵌入相关文本块信息并在重处理阶段选择性地重新计算KV缓存,有效平衡了生成质量与推理效率。

Details Motivation: 现有KV缓存复用方法因缺乏跨块上下文信息导致生成质量下降,难以兼顾RAG的效率与效果。 Method: 提出FusionRAG框架:离线预处理阶段将相关文本块的信息嵌入每个块中;在线重处理阶段重新计算模型关注 token 的KV缓存。 Result: 实验表明,在相同重计算比例下,FusionRAG显著提升生成质量;重计算少于15%的token时,归一化F1分数最高提升70%,相比全注意力机制TTFT减少2.66x-9.39x。 Conclusion: FusionRAG有效解决了KV缓存复用中的上下文缺失问题,在保持高生成质量的同时大幅提升推理效率,为高效RAG提供了新思路。 Abstract: Retrieval-Augmented Generation enhances Large Language Models by integrating external knowledge, which reduces hallucinations but increases prompt length. This increase leads to higher computational costs and longer Time to First Token (TTFT). To mitigate this issue, existing solutions aim to reuse the preprocessed KV cache of each retrieved chunk to accelerate RAG. However, the lack of cross-chunk contextual information leads to a significant drop in generation quality, leaving the potential benefits of KV cache reuse largely unfulfilled. The challenge lies in how to reuse the precomputed KV cache of chunks while preserving generation quality. We propose FusionRAG, a novel inference framework that optimizes both the preprocessing and reprocessing stages of RAG. In the offline preprocessing stage, we embed information from other related text chunks into each chunk, while in the online reprocessing stage, we recompute the KV cache for tokens that the model focuses on. As a result, we achieve a better trade-off between generation quality and efficiency. According to our experiments, FusionRAG significantly improves generation quality at the same recomputation ratio compared to previous state-of-the-art solutions. By recomputing fewer than 15% of the tokens, FusionRAG achieves up to 70% higher normalized F1 scores than baselines and reduces TTFT by 2.66x-9.39x compared to Full Attention.

[83] Gated Differentiable Working Memory for Long-Context Language Modeling

Lingrui Mei,Shenghua Liu,Yiwei Wang,Yuyao Ge,Baolong Bi,Jiayu Yao,Jun Wan,Ziling Yin,Jiafeng Guo,Xueqi Cheng

Main category: cs.CL

TL;DR: 本文提出Gdwm框架,通过引入写控制器和上下文效用度量,在测试时自适应地优化长上下文处理中的工作记忆更新,显著提升计算效率与性能。

Details Motivation: 长上下文下Transformer注意力稀释、关键信息丢失、模型难以适应新模式;现有测试时自适应方法采用均匀写入策略,计算浪费且梯度方差高。 Method: 将测试时自适应重构为预算约束下的记忆整合问题,提出Gdwm框架:含门控写控制器,基于信息论定义的'上下文效用'动态分配梯度步数,保证全局覆盖。 Result: 在ZeroSCROLLS和LongBench v2上,Gdwm以1/4的梯度步数达到与均匀基线相当或更优性能,确立了效率-性能的新Pareto前沿。 Conclusion: Gdwm通过有选择性的、效用驱动的记忆整合,有效缓解长上下文建模瓶颈,为高效测试时自适应提供了新范式。 Abstract: Long contexts challenge transformers: attention scores dilute across thousands of tokens, critical information is often lost in the middle, and models struggle to adapt to novel patterns at inference time. Recent work on test-time adaptation addresses this by maintaining a form of working memory -- transient parameters updated on the current context -- but existing approaches rely on uniform write policies that waste computation on low-utility regions and suffer from high gradient variance across semantically heterogeneous contexts. In this work, we reframe test-time adaptation as a budget-constrained memory consolidation problem, focusing on which parts of the context should be consolidated into working memory under limited computation. We propose Gdwm (Gated Differentiable Working Memory), a framework that introduces a write controller to gate the consolidation process. The controller estimates Contextual Utility, an information-theoretic measure of long-range contextual dependence, and allocates gradient steps accordingly while maintaining global coverage. Experiments on ZeroSCROLLS and LongBench v2 demonstrate that Gdwm achieves comparable or superior performance with 4$\times$ fewer gradient steps than uniform baselines, establishing a new efficiency-performance Pareto frontier for test-time adaptation.

[84] SciCoQA: Quality Assurance for Scientific Paper--Code Alignment

Tim Baumgärtner,Iryna Gurevych

Main category: cs.CL

TL;DR: SciCoQA是一个用于检测科学出版物与其代码库之间差异的数据集,旨在确保实现的保真度,包含611个真实和合成的差异实例,评估显示现有大模型在该任务上表现有限。

Details Motivation: 确保科学研究的可重复性,检测论文与其代码实现之间的一致性问题。 Method: 基于GitHub issues和可重复性论文构建真实数据,并提出一种合成数据生成方法扩展数据集;定义差异类型并进行详细分析。 Result: 构建了包含611个差异实例的数据集(81个真实,530个合成),覆盖多个学科领域;评估21个大语言模型,GPT-5在真实案例中仅能检测出45.7%的差异。 Conclusion: SciCoQA揭示了当前大语言模型在识别论文与代码不一致方面的局限性,尤其是在缺失论文细节、长上下文和训练数据外场景下,凸显了该任务的挑战性和未来改进方向。 Abstract: We present SciCoQA, a dataset for detecting discrepancies between scientific publications and their codebases to ensure faithful implementations. We construct SciCoQA from GitHub issues and reproducibility papers, and to scale our dataset, we propose a synthetic data generation method for constructing paper-code discrepancies. We analyze the paper-code discrepancies in detail and propose discrepancy types and categories to better understand the occurring mismatches. In total, our dataset consists of 611 paper-code discrepancies (81 real, 530 synthetic), spanning diverse computational science disciplines, including AI, Physics, Quantitative Biology, and others. Our evaluation of 21 LLMs highlights the difficulty of SciCoQA, particularly for instances involving omitted paper details, long-context inputs, and data outside the models' pre-training corpus. The best performing model in our evaluation, GPT-5, can only detect 45.7\% of real-world paper-code discrepancies.

[85] Injecting Knowledge from Social Science Journals to Improve Indonesian Cultural Understanding by LLMs

Adimulya Kartiyasa,Bao Gia Cao,Boyang Li

Main category: cs.CL

TL;DR: 本文提出了一种从印尼社会科学期刊中提取文化知识并注入到大语言模型中的新方法,通过检索增强生成(RAG)显著提升了在IndoCulture基准上的表现。

Details Motivation: 现有的大语言模型对印尼文化的理解有限,且缺乏利用本土视角的开放学术资源来增强文化知识的系统性尝试。 Method: 构建了一个包含151种印尼社会科学期刊文章段落的数据集IndoSoSci,并采用基于LLM生成的假设文档作为查询,结合检索增强生成(RAG)技术来注入文化知识。 Result: 该方法在IndoCulture基准上显著优于多个强基线模型,结合IndoSoSci与印尼维基百科更达到了新的最优准确率。 Conclusion: 利用本地学术文献并通过RAG注入文化知识是提升LLM对本土文化理解的有效途径。 Abstract: Recently there have been intensifying efforts to improve the understanding of Indonesian cultures by large language models (LLMs). An attractive source of cultural knowledge that has been largely overlooked is local journals of social science, which likely contain substantial cultural studies from a native perspective. We present a novel text dataset of journal article passages, created from 151 open-source Indonesian social science journals, called IndoSoSci. We demonstrate an effective recipe for injecting Indonesian cultural knowledge therein into LLMs: extracting the facts related to Indonesian culture, and apply retrieval-augmented generation (RAG) with LLM-generated hypothetical documents as queries during retrieval. The proposed recipe yields strong performance gains over several strong baselines on the IndoCulture benchmark. Additionally, by combining IndoSoSci with Indonesian Wikipedia, we set a new state-of-the-art accuracy on the IndoCulture benchmark.

[86] A Component-Based Survey of Interactions between Large Language Models and Multi-Armed Bandits

Miao Xie,Siguang Chen,Chunli Lv

Main category: cs.CL

TL;DR: 本文是首个系统性综述大语言模型(LLM)与多臂赌博机(MAB)在组件级别双向交互的调研,探讨二者如何相互增强:MAB助力LLM各阶段优化,LLM改进MAB的核心组件设计。

Details Motivation: 探索大语言模型与多臂赌博机两个领域交叉的潜力,填补缺乏对两者双向交互系统性总结的空白。 Method: 从组件层面分析现有LLM增强的赌博机系统和赌博机增强的LLM系统,梳理其设计、方法与性能,并建立文献索引库。 Result: 明确了两类系统的代表性成果与关键挑战,揭示了MAB在LLM中的应用(如训练、RAG、个性化)以及LLM对MAB在臂定义、环境建模等方面的提升作用。 Conclusion: LLM与MAB的融合具有广阔前景,双向增强路径为各自领域提供了新的研究方向与优化手段,未来需进一步探索更深层次的协同机制。 Abstract: Large language models (LLMs) have become powerful and widely used systems for language understanding and generation, while multi-armed bandit (MAB) algorithms provide a principled framework for adaptive decision-making under uncertainty. This survey explores the potential at the intersection of these two fields. As we know, it is the first survey to systematically review the bidirectional interaction between large language models and multi-armed bandits at the component level. We highlight the bidirectional benefits: MAB algorithms address critical LLM challenges, spanning from pre-training to retrieval-augmented generation (RAG) and personalization. Conversely, LLMs enhance MAB systems by redefining core components such as arm definition and environment modeling, thereby improving decision-making in sequential tasks. We analyze existing LLM-enhanced bandit systems and bandit-enhanced LLM systems, providing insights into their design, methodologies, and performance. Key challenges and representative findings are identified to help guide future research. An accompanying GitHub repository that indexes relevant literature is available at https://github.com/bucky1119/Awesome-LLM-Bandit-Interaction.

[87] Trustworthy Data-driven Chronological Age Estimation from Panoramic Dental Images

Ainhoa Vivel-Couso,Nicolás Vila-Blanco,María J. Carreira,Alberto Bugarín-Diz,Inmaculada Tomás,Jose M. Alonso-Moral

Main category: cs.CL

TL;DR: 提出了一种结合不透明和透明方法的牙科年龄估计系统,通过自然语言生成模块提供临床医生友好的文本解释,并经专家验证和ALTAI可信度评估,表现出高信任度和性能。

Details Motivation: 由于深度学习模型的不透明性,医疗领域中的个性化护理面临信任问题,因此需要提高模型的可解释性和透明度。 Method: 结合不透明与透明方法,在自然语言生成(NLG)模块中生成牙科年龄估计的文本解释,采用基于规则的方法并与牙科专家合作设计,通过问卷由专家进行手动验证,并使用ALTAI清单进行可信度自评。 Result: 专家在五个维度上的平均评分为4.77±0.12(满分5分),ALTAI可信度评估得分为4.40±0.27(满分5分)。 Conclusion: 该系统能有效提升牙科年龄估计模型的可解释性与可信度,具备良好的临床应用潜力。 Abstract: Integrating deep learning into healthcare enables personalized care but raises trust issues due to model opacity. To improve transparency, we propose a system for dental age estimation from panoramic images that combines an opaque and a transparent method within a natural language generation (NLG) module. This module produces clinician-friendly textual explanations about the age estimations, designed with dental experts through a rule-based approach. Following the best practices in the field, the quality of the generated explanations was manually validated by dental experts using a questionnaire. The results showed a strong performance, since the experts rated 4.77+/-0.12 (out of 5) on average across the five dimensions considered. We also performed a trustworthy self-assessment procedure following the ALTAI checklist, in which it scored 4.40+/-0.27 (out of 5) across seven dimensions of the AI Trustworthiness Assessment List.

[88] Pardon? Evaluating Conversational Repair in Large Audio-Language Models

Shuanghong Huang,Jinlei Xu,Youchao Zhou,Yanghao Zhou,Xuan Zhao,Chong Feng,Wenxuan Zhang

Main category: cs.CL

TL;DR: 本文提出了一种新的评估框架EAR,用于评估大型音频-语言模型在可回答与不可回答语音输入下的表现,强调模型应识别不可回答问题并启动对话修复,而不仅仅是追求准确率。

Details Motivation: 现有评估主要关注答案准确性,但忽视了真实交互中信息缺失导致输入不可回答的问题,缺乏对模型修复行为的评估。 Method: 引入修复感知的评估设置,定义输入的可回答性,并通过语义-声学掩码协议构建成对评测条件,提出EAR评分指标,联合评估可回答情况下的任务能力和不可回答时的修复行为。 Result: 在两个语音问答基准上的实验表明,尽管许多模型在可回答输入下表现良好,但大多数未能识别语义不可回答的情况,也未触发适当的对话修复。 Conclusion: 当前以准确率为中心的评估方式存在局限,需引入可靠性评估,将不可回答输入视为触发修复和持续交互的信号。 Abstract: Large Audio-Language Models (LALMs) have demonstrated strong performance in spoken question answering (QA), with existing evaluations primarily focusing on answer accuracy and robustness to acoustic perturbations. However, such evaluations implicitly assume that spoken inputs remain semantically answerable, an assumption that often fails in real-world interaction when essential information is missing. In this work, we introduce a repair-aware evaluation setting that explicitly distinguishes between answerable and unanswerable audio inputs. We define answerability as a property of the input itself and construct paired evaluation conditions using a semantic-acoustic masking protocol. Based on this setting, we propose the Evaluability Awareness and Repair (EAR) score, a non-compensatory metric that jointly evaluates task competence under answerable conditions and repair behavior under unanswerable conditions. Experiments on two spoken QA benchmarks across diverse LALMs reveal a consistent gap between answer accuracy and conversational reliability: while many models perform well when inputs are answerable, most fail to recognize semantic unanswerability and initiate appropriate conversational repair. These findings expose a limitation of prevailing accuracy-centric evaluation practices and motivate reliability assessments that treat unanswerable inputs as cues for repair and continued interaction.

[89] Bridging the Knowledge-Action Gap by Evaluating LLMs in Dynamic Dental Clinical Scenarios

Hongyang Ma,Tiantian Gu,Huaiyuan Sun,Huilin Zhu,Yongxin Wang,Jie Li,Wubin Sun,Zeliang Lian,Yinghong Zhou,Yi Gao,Shirui Wang,Zhihui Tang

Main category: cs.CL

TL;DR: 本研究提出了一个用于评估牙科大语言模型从静态知识检索到动态临床行为可靠性的新基准SCMPE,发现现有模型在动态对话中表现不佳,主要瓶颈在于主动信息收集和状态跟踪,且检索增强生成(RAG)在动态任务中效果有限,强调需结合领域自适应预训练以实现安全的自主临床实践。

Details Motivation: 随着大语言模型向自主临床代理转变,传统基于静态准确率的评估已不足,尤其在牙科这种需要患者参与决策的领域,亟需一种能全面评估模型从知识掌握到动态临床行为能力的新型评估体系。 Method: 提出标准化临床管理与性能评估(SCMPE)基准,涵盖面向知识的静态任务和基于工作流的多轮模拟患者交互,并通过分析‘指南依从性’与‘决策质量’的关系,评估模型在不同任务中的表现及RAG的影响。 Result: 模型在静态任务中表现良好,但在动态临床对话中性能显著下降;主要瓶颈在于主动信息获取和动态状态跟踪能力不足;RAG可减少静态任务中的幻觉,但在动态流程中效果有限甚至可能造成性能下降。 Conclusion: 当前大语言模型在牙科临床应用中的主要挑战不在知识存储而在动态推理与行为可靠性,仅靠外部知识检索无法弥补推理差距,必须结合领域自适应预训练,才能实现安全、自主的临床实践。 Abstract: The transition of Large Language Models (LLMs) from passive knowledge retrievers to autonomous clinical agents demands a shift in evaluation-from static accuracy to dynamic behavioral reliability. To explore this boundary in dentistry, a domain where high-quality AI advice uniquely empowers patient-participatory decision-making, we present the Standardized Clinical Management & Performance Evaluation (SCMPE) benchmark, which comprehensively assesses performance from knowledge-oriented evaluations (static objective tasks) to workflow-based simulations (multi-turn simulated patient interactions). Our analysis reveals that while models demonstrate high proficiency in static objective tasks, their performance precipitates in dynamic clinical dialogues, identifying that the primary bottleneck lies not in knowledge retention, but in the critical challenges of active information gathering and dynamic state tracking. Mapping "Guideline Adherence" versus "Decision Quality" reveals a prevalent "High Efficacy, Low Safety" risk in general models. Furthermore, we quantify the impact of Retrieval-Augmented Generation (RAG). While RAG mitigates hallucinations in static tasks, its efficacy in dynamic workflows is limited and heterogeneous, sometimes causing degradation. This underscores that external knowledge alone cannot bridge the reasoning gap without domain-adaptive pre-training. This study empirically charts the capability boundaries of dental LLMs, providing a roadmap for bridging the gap between standardized knowledge and safe, autonomous clinical practice.

[90] The Bitter Lesson of Diffusion Language Models for Agentic Workflows: A Comprehensive Reality Check

Qingyu Lu,Liang Ding,Kanjian Zhang,Jinxia Zhang,Dacheng Tao

Main category: cs.CL

TL;DR: 本文研究了扩散型大语言模型(dLLMs)在实时代理交互中的有效性,发现在具身代理和工具调用场景中,尽管dLLMs具有效率优势,但存在系统性失败问题;为此提出DiffuAgent框架,表明dLLMs适合作为非因果性认知组件,但需增强因果推理能力以支持代理任务。

Details Motivation: 探索dLLMs是否能在保持高效的同时实现有效的代理行为,检验其在长周期规划与精确格式生成任务中的表现。 Method: 在Embodied Agents和Tool-Calling Agents两种范式下,使用Agentboard和BFCL基准对LLaDA、Dream等dLLMs进行评估,并提出DiffuAgent多代理框架以分析其在不同角色中的适用性。 Result: 发现当前dLLMs在时序反馈下无法有效分支(具身场景),且因扩散噪声难以维持符号精度(工具调用场景);但在记忆摘要和工具选择等非因果任务中表现良好。 Conclusion: dLLMs目前不适合作为可靠代理骨干模型,需将因果性、精确性和逻辑性推理机制融入去噪过程才能胜任代理任务。 Abstract: The pursuit of real-time agentic interaction has driven interest in Diffusion-based Large Language Models (dLLMs) as alternatives to auto-regressive backbones, promising to break the sequential latency bottleneck. However, does such efficiency gains translate into effective agentic behavior? In this work, we present a comprehensive evaluation of dLLMs (e.g., LLaDA, Dream) across two distinct agentic paradigms: Embodied Agents (requiring long-horizon planning) and Tool-Calling Agents (requiring precise formatting). Contrary to the efficiency hype, our results on Agentboard and BFCL reveal a "bitter lesson": current dLLMs fail to serve as reliable agentic backbones, frequently leading to systematically failure. (1) In Embodied settings, dLLMs suffer repeated attempts, failing to branch under temporal feedback. (2) In Tool-Calling settings, dLLMs fail to maintain symbolic precision (e.g. strict JSON schemas) under diffusion noise. To assess the potential of dLLMs in agentic workflows, we introduce DiffuAgent, a multi-agent evaluation framework that integrates dLLMs as plug-and-play cognitive cores. Our analysis shows that dLLMs are effective in non-causal roles (e.g., memory summarization and tool selection) but require the incorporation of causal, precise, and logically grounded reasoning mechanisms into the denoising process to be viable for agentic tasks.

[91] ChartAttack: Testing the Vulnerability of LLMs to Malicious Prompting in Chart Generation

Jesus-German Ortiz-Barajas,Jonathan Tonglet,Vivek Gupta,Iryna Gurevych

Main category: cs.CL

TL;DR: 提出ChartAttack框架,评估多模态大语言模型在图表生成中被滥用以生成误导性图表的风险,并发布包含误导策略的QA数据集AttackViz。

Details Motivation: 随着MLLM被用于自动化图表生成,存在被滥用生成误导性图表的风险,亟需系统性评估其安全性和鲁棒性。 Method: 提出ChartAttack框架,通过注入误导元素(misleaders)生成误导性图表,并构建标注了误导策略和错误答案的图表问答数据集AttackViz,进行领域内与跨领域的实验评估。 Result: 实验显示ChartAttack使MLLM读者的问答准确率平均下降19.6(领域内)和14.9(跨领域)个百分点;人类研究也显示准确率平均下降20.2个百分点。 Conclusion: MLLM驱动的图表生成系统存在严重误导风险,必须在设计、评估和部署中加强鲁棒性与安全性考量。 Abstract: Multimodal large language models (MLLMs) are increasingly used to automate chart generation from data tables, enabling efficient data analysis and reporting but also introducing new misuse risks. In this work, we introduce ChartAttack, a novel framework for evaluating how MLLMs can be misused to generate misleading charts at scale. ChartAttack injects misleaders into chart designs, aiming to induce incorrect interpretations of the underlying data. Furthermore, we create AttackViz, a chart question-answering (QA) dataset where each (chart specification, QA) pair is labeled with effective misleaders and their induced incorrect answers. Experiments in in-domain and cross-domain settings show that ChartAttack significantly degrades the QA performance of MLLM readers, reducing accuracy by an average of 19.6 points and 14.9 points, respectively. A human study further shows an average 20.2 point drop in accuracy for participants exposed to misleading charts generated by ChartAttack. Our findings highlight an urgent need for robustness and security considerations in the design, evaluation, and deployment of MLLM-based chart generation systems. We make our code and data publicly available.

[92] Graph Reasoning Paradigm: Structured and Symbolic Reasoning with Topology-Aware Reinforcement Learning for Large Language Models

Runxuan Liu,Xianhao Ou,Xinyan Ma,Jiyuan Wang,Jiafeng Liang,Jiaqi Li,Tao He,Zheng Chu,Rongchuan Mu,Zekun Wang,Baoxin Wang,Dayong Wu,Ming Liu,Shijin Wang,Guoping Hu,Bing Qin

Main category: cs.CL

TL;DR: 提出图推理范式(GRP)和PASC-GRPO算法,通过结构化图表示和过程感知优化,提升大模型在数学推理与代码生成中的性能。

Details Motivation: 现有长链思维推理依赖语义评估,存在监督粗粒度、奖励作弊、训练成本高和泛化差等问题。 Method: 设计图推理范式(GRP),使用图结构表示和步骤级认知标签实现符号化推理;提出PASC-GRPO算法,结合结构化评估、图结果奖励和分层剪裁优势估计进行优化。 Result: 在数学推理和代码生成任务上显著优于现有方法,有效降低训练成本并缓解奖励作弊问题。 Conclusion: 结构化和符号化的图推理能有效提升大模型的推理能力,为未来训练提供高效、可解释的新范式。 Abstract: Long Chain-of-Thought (LCoT), achieved by Reinforcement Learning with Verifiable Rewards (RLVR), has proven effective in enhancing the reasoning capabilities of Large Language Models (LLMs). However, reasoning in current LLMs is primarily generated as plain text, where performing semantic evaluation on such unstructured data creates a computational bottleneck during training. Despite RLVR-based optimization, existing methods still suffer from coarse-grained supervision, reward hacking, high training costs, and poor generalization. To address these issues, we propose the Graph Reasoning Paradigm (GRP), which realizes structured and symbolic reasoning, implemented via graph-structured representations with step-level cognitive labels. Building upon GRP, we further design Process-Aware Stratified Clipping Group Relative Policy Optimization (PASC-GRPO), which leverages structured evaluation to replace semantic evaluation, achieves process-aware verification through graph-structured outcome rewards, and mitigates reward hacking via stratified clipping advantage estimation. Experiments demonstrate significant improvements across mathematical reasoning and code generation tasks. Data, models, and code will be released later.

[93] Bi-Attention HateXplain : Taking into account the sequential aspect of data during explainability in a multi-task context

Ghislain Dorian Tchuente Mondjo

Main category: cs.CL

TL;DR: 本文提出了一种新的双向注意力BiRNN模型(BiAtt-BiRNN-HateXplain),用于在多任务学习框架下提升仇恨言论检测的性能、可解释性并减少无意识偏见。

Details Motivation: 现有的仇恨言论检测模型在可解释性方面存在不足,尤其是基于HateXplain的模型预测注意力不稳定,导致解释不一致和预测不稳健。 Method: 提出BiAtt-BiRNN-HateXplain模型,结合双向注意力机制和BiRNN结构,在多任务学习框架下同时进行分类与解释生成,利用序列建模增强可解释性。 Result: 在HateXplain数据集上的实验表明,该模型在检测性能、解释一致性方面优于现有方法,并减少了对特定社群的无意识偏见。 Conclusion: 所提模型通过结构设计提升了可解释性和分类性能,为透明、可靠的仇恨言论检测提供了有效解决方案。 Abstract: Technological advances in the Internet and online social networks have brought many benefits to humanity. At the same time, this growth has led to an increase in hate speech, the main global threat. To improve the reliability of black-box models used for hate speech detection, post-hoc approaches such as LIME, SHAP, and LRP provide the explanation after training the classification model. In contrast, multi-task approaches based on the HateXplain benchmark learn to explain and classify simultaneously. However, results from HateXplain-based algorithms show that predicted attention varies considerably when it should be constant. This attention variability can lead to inconsistent interpretations, instability of predictions, and learning difficulties. To solve this problem, we propose the BiAtt-BiRNN-HateXplain (Bidirectional Attention BiRNN HateXplain) model which is easier to explain compared to LLMs which are more complex in view of the need for transparency, and will take into account the sequential aspect of the input data during explainability thanks to a BiRNN layer. Thus, if the explanation is correctly estimated, thanks to multi-task learning (explainability and classification task), the model could classify better and commit fewer unintentional bias errors related to communities. The experimental results on HateXplain data show a clear improvement in detection performance, explainability and a reduction in unintentional bias.

[94] Tears or Cheers? Benchmarking LLMs via Culturally Elicited Distinct Affective Responses

Chongyuan Dai,Yaling Shen,Jinpeng Hu,Zihan Gao,Jia Li,Yishun Jiang,Yaxiong Wang,Liu Liu,Zongyuan Ge

Main category: cs.CL

TL;DR: 本文提出了CEDAR,一个用于评估大语言模型在跨文化情感理解方面表现的多模态基准。该基准通过结合LLM生成标签与人工标注,涵盖七种语言和14种细粒度情绪类别,揭示了当前多语言模型在文化对齐上的不足。

Details Motivation: 现有文化对齐评估主要关注地理事实或社会习俗等陈述性知识,难以捕捉不同文化背景下情感解释的主观差异。因此需要一种能够衡量模型对文化诱发情感反应理解能力的新基准。 Method: 提出CEDAR基准构建方法:利用大语言模型生成初步的情感标签以筛选出引发跨文化情感差异的情境,再通过严格的人工评估确定可靠的真实标注。最终构建包含七种语言、14种细粒度情绪类别的数据集,分为多模态和纯文本两种类型。 Result: CEDAR包含10,962个样本,每种语言有400个多模态和1,166个纯文本样本。对17个主流多语言模型的评测显示,语言一致性并不等同于文化对齐,模型在文化相关的感情理解上表现不佳。 Conclusion: 当前多语言大模型在文化对齐,特别是文化诱发的情感理解方面仍面临重大挑战,未来研究需更加关注主观性情感解释的建模。 Abstract: Culture serves as a fundamental determinant of human affective processing and profoundly shapes how individuals perceive and interpret emotional stimuli. Despite this intrinsic link extant evaluations regarding cultural alignment within Large Language Models primarily prioritize declarative knowledge such as geographical facts or established societal customs. These benchmarks remain insufficient to capture the subjective interpretative variance inherent to diverse sociocultural lenses. To address this limitation, we introduce CEDAR, a multimodal benchmark constructed entirely from scenarios capturing Culturally \underline{\textsc{E}}licited \underline{\textsc{D}}istinct \underline{\textsc{A}}ffective \underline{\textsc{R}}esponses. To construct CEDAR, we implement a novel pipeline that leverages LLM-generated provisional labels to isolate instances yielding cross-cultural emotional distinctions, and subsequently derives reliable ground-truth annotations through rigorous human evaluation. The resulting benchmark comprises 10,962 instances across seven languages and 14 fine-grained emotion categories, with each language including 400 multimodal and 1,166 text-only samples. Comprehensive evaluations of 17 representative multilingual models reveal a dissociation between language consistency and cultural alignment, demonstrating that culturally grounded affective understanding remains a significant challenge for current models.

[95] SASA: Semantic-Aware Contrastive Learning Framework with Separated Attention for Triple Classification

Xu Xiaodan,Hu Xiaolin

Main category: cs.CL

TL;DR: 本文提出了一种名为SASA的新框架,通过分离注意力机制和语义感知对比学习来增强知识图谱三元组分类性能,显著超越现有方法。

Details Motivation: 现有基于文本的三元组分类方法忽略了知识图谱组件间的有效语义交互,且多采用单一二分类目标,导致语义表示学习不足。 Method: 提出分离注意力机制以解耦编码三元组,并通过交互融合提升表示;引入层次化语义感知对比学习作为辅助训练目标,兼顾局部与全局语义学习。 Result: 在FB15k-237和YAGO3-10两个基准数据集上,准确率分别比现有最佳方法提升5.9%和3.4%。 Conclusion: SASA框架通过改进语义交互和引入对比学习,有效提升了三元组分类的性能与泛化能力。 Abstract: Knowledge Graphs~(KGs) often suffer from unreliable knowledge, which restricts their utility. Triple Classification~(TC) aims to determine the validity of triples from KGs. Recently, text-based methods learn entity and relation representations from natural language descriptions, significantly improving the generalization capabilities of TC models and setting new benchmarks in performance. However, there are still two critical challenges. First, existing methods often ignore the effective semantic interaction among different KG components. Second, most approaches adopt single binary classification training objective, leading to insufficient semantic representation learning. To address these challenges, we propose \textbf{SASA}, a novel framework designed to enhance TC models via separated attention mechanism and semantic-aware contrastive learning~(CL). Specifically, we first propose separated attention mechanism to encode triples into decoupled contextual representations and then fuse them through a more effective interactive way. Then, we introduce semantic-aware hierarchical CL as auxiliary training objective to guide models in improving their discriminative capabilities and achieving sufficient semantic learning, considering both local level and global level CL. Experimental results across two benchmark datasets demonstrate that SASA significantly outperforms state-of-the-art methods. In terms of accuracy, we advance the state-of-the-art by +5.9\% on FB15k-237 and +3.4\% on YAGO3-10.

[96] Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition

Warit Sirichotedumrong,Adisai Na-Thalang,Potsawee Manakul,Pittawat Taveekitworachai,Sittipong Sripaisarnmongkol,Kunat Pipatanakul

Main category: cs.CL

TL;DR: 本文提出了一种用于低延迟泰语语音识别的115M参数FastConformer-Transducer模型(Typhoon ASR Real-time),通过严格的文本归一化和两阶段课程学习,实现了与Whisper Large-v3相当的准确率,同时计算成本降低了45倍,并发布了用于泰语ASR研究的标准基准数据集。

Details Motivation: 由于预训练检查点的可获得性,现有的开源泰语自动语音识别(ASR)系统主要依赖于高延迟的离线架构(如Whisper),缺乏高效的流式解决方案,尤其是在处理泰语特有的语言现象和方言适应方面存在明显不足。 Method: 采用FastConformer-Transducer架构构建低延迟流式模型;设计了严格的文本归一化流程以解决泰语中数字读法和重复符号(mai yamok)等歧义问题;提出两阶段课程学习方法实现对伊桑方言的适应,同时保持中央泰语性能;并发布了一个符合泰语语言规范的人工标注基准数据集。 Result: 所提出的紧凑型模型在计算成本上比Whisper Large-v3降低45倍的同时,达到了相当的识别准确率;文本归一化显著提升了模型性能;两阶段课程学习成功实现了方言适应而不损害主流语言表现;Typhoon ASR Benchmark为社区提供了标准化评估平台。 Conclusion: 通过结合高效模型架构、精细化文本处理和系统化的训练策略,可以在不牺牲准确率的前提下大幅降低泰语ASR系统的延迟和计算开销,推动流式语音识别在资源有限语言中的实际应用,并提升研究的可复现性。 Abstract: Large encoder-decoder models like Whisper achieve strong offline transcription but remain impractical for streaming applications due to high latency. However, due to the accessibility of pre-trained checkpoints, the open Thai ASR landscape remains dominated by these offline architectures, leaving a critical gap in efficient streaming solutions. We present Typhoon ASR Real-time, a 115M-parameter FastConformer-Transducer model for low-latency Thai speech recognition. We demonstrate that rigorous text normalization can match the impact of model scaling: our compact model achieves a 45x reduction in computational cost compared to Whisper Large-v3 while delivering comparable accuracy. Our normalization pipeline resolves systemic ambiguities in Thai transcription --including context-dependent number verbalization and repetition markers (mai yamok) --creating consistent training targets. We further introduce a two-stage curriculum learning approach for Isan (north-eastern) dialect adaptation that preserves Central Thai performance. To address reproducibility challenges in Thai ASR, we release the Typhoon ASR Benchmark, a gold-standard human-labeled datasets with transcriptions following established Thai linguistic conventions, providing standardized evaluation protocols for the research community.

[97] Profiling German Text Simplification with Interpretable Model-Fingerprints

Lars Klöser,Mika Beele,Bodo Kraft

Main category: cs.CL

TL;DR: 本文提出了一种名为Simplification Profiler的诊断工具包,用于生成简化文本的多维、可解释的“指纹”,以全面分析大型语言模型在文本简化中的行为。

Details Motivation: 由于缺乏对大型语言模型在文本简化中行为进行全面、高效且可重复诊断的工具,尤其在数据稀缺的语言中难以构建适应不同目标群体的灵活简化模型,因此需要一种新的评估范式。 Method: 提出Simplification Profiler,通过提取多个聚合简化结果形成模型的“指纹”,并利用线性分类器进行元评估,判断其是否能根据简化文本可靠识别不同模型配置,从而验证指标的敏感性和描述能力。 Result: 该工具能够在无需大规模人工评分数据集的情况下,有效区分不同提示策略和细粒度提示工程(如少样本示例)带来的行为变化,完整特征集的分类F1分数最高达71.9%,比基线提升超过48个百分点。 Conclusion: Simplification Profiler为开发者提供了细粒度、可操作的分析手段,有助于构建更高效、真正自适应的文本简化系统。 Abstract: While Large Language Models (LLMs) produce highly nuanced text simplifications, developers currently lack tools for a holistic, efficient, and reproducible diagnosis of their behavior. This paper introduces the Simplification Profiler, a diagnostic toolkit that generates a multidimensional, interpretable fingerprint of simplified texts. Multiple aggregated simplifications of a model result in a model's fingerprint. This novel evaluation paradigm is particularly vital for languages, where the data scarcity problem is magnified when creating flexible models for diverse target groups rather than a single, fixed simplification style. We propose that measuring a model's unique behavioral signature is more relevant in this context as an alternative to correlating metrics with human preferences. We operationalize this with a practical meta-evaluation of our fingerprints' descriptive power, which bypasses the need for large, human-rated datasets. This test measures if a simple linear classifier can reliably identify various model configurations by their created simplifications, confirming that our metrics are sensitive to a model's specific characteristics. The Profiler can distinguish high-level behavioral variations between prompting strategies and fine-grained changes from prompt engineering, including few-shot examples. Our complete feature set achieves classification F1-scores up to 71.9 %, improving upon simple baselines by over 48 percentage points. The Simplification Profiler thus offers developers a granular, actionable analysis to build more effective and truly adaptive text simplification systems.

[98] Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs

Abdellah El Mekki,Samar M. Magdy,Houdaifa Atou,Ruwa AbuHweidi,Baraah Qawasmeh,Omer Nacar,Thikra Al-hibiri,Razan Saadie,Hamzah Alsayadi,Nadia Ghezaiel Hammouda,Alshima Alkhazimi,Aya Hamod,Al-Yas Al-Ghafri,Wesam El-Sayed,Asila Al sharji,Mohamad Ballout,Anas Belfathi,Karim Ghaddar,Serry Sibaee,Alaa Aoun,Areej Asiri,Lina Abureesh,Ahlam Bashiti,Majdal Yousef,Abdulaziz Hafiz,Yehdih Mohamed,Emira Hamedtou,Brakehe Brahim,Rahaf Alhamouri,Youssef Nafea,Aya El Aatar,Walid Al-Dhabyani,Emhemed Hamed,Sara Shatnawi,Fakhraddin Alwajih,Khalid Elkhidir,Ashwag Alasmari,Abdurrahman Gerrio,Omar Alshahri,AbdelRahim A. Elmadany,Ismail Berrada,Amir Azad Adli Alkathiri,Fadi A Zaraket,Mustafa Jarrar,Yahya Mohamed El Hadj,Hassan Alhuzali,Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: 本文介绍了Alexandria,一个大规模、社区驱动、人工翻译的阿拉伯语方言数据集,涵盖13个阿拉伯国家和11个重要领域,具有城市级细粒度标注和性别配置信息,旨在提升机器翻译和大语言模型在阿拉伯语方言上的表现。

Details Motivation: 阿拉伯语存在高度的双言现象,日常交流多使用地区方言而非现代标准阿拉伯语,但现有机器翻译系统对方言处理能力差,限制了其应用。 Method: 构建了一个名为Alexandria的大规模人工翻译数据集,包含107K样本,覆盖13国、11个领域,提供城市来源元数据和说话人-受话人性别配置的多轮对话标注。 Result: 该数据集可用于训练和评估MT与大语言模型;自动与人工评估揭示了当前模型在跨方言翻译中的能力与持续存在的挑战。 Conclusion: Alexandria为阿拉伯语方言的机器翻译提供了高质量、细粒度的资源,并作为严格基准推动相关技术发展。 Abstract: Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic. Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce \textbf{Alexandria}, a large-scale, community-driven, human-translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high-impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city-of-origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of multi-turn conversational scenarios annotated with speaker-addressee gender configurations, enabling the study of gender-conditioned variation in dialectal use. Comprising 107K total samples, Alexandria serves as both a training resource and a rigorous benchmark for evaluating MT and Large Language Models (LLMs). Our automatic and human evaluation of Arabic-aware LLMs benchmarks current capabilities in translating across diverse Arabic dialects and sub-dialects, while exposing significant persistent challenges.

[99] Leveraging Lora Fine-Tuning and Knowledge Bases for Construction Identification

Liu Kaipeng,Wu Ling

Main category: cs.CL

TL;DR: 本研究结合LoRA微调大语言模型与检索增强生成(RAG)框架,自动识别英语双及物结构,在BNC标注数据上取得优于基线模型的二分类性能,且错误分析表明微调使模型从表层模式匹配转向语义驱动判断。

Details Motivation: 提高英语双及物结构自动识别的准确性与语义合理性,克服纯理论RAG或原始大模型在句法-语义接口任务上的局限性。 Method: 采用LoRA对Qwen3-8B模型进行轻量微调,并融合RAG框架;在英国国家语料库(BNC)标注数据上开展二分类实验。 Result: LoRA微调的Qwen3-8B显著优于原生Qwen3-MAX和仅依赖理论规则的RAG系统;错误分析显示其判断更依赖语义而非表面形式。 Conclusion: LoRA微调能有效提升大语言模型对复杂句法构造的语义感知能力,结合RAG可兼顾知识引导与数据驱动优势。 Abstract: This study investigates the automatic identification of the English ditransitive construction by integrating LoRA-based fine-tuning of a large language model with a Retrieval-Augmented Generation (RAG) framework.A binary classification task was conducted on annotated data from the British National Corpus. Results demonstrate that a LoRA-fine-tuned Qwen3-8B model significantly outperformed both a native Qwen3-MAX model and a theory-only RAG system. Detailed error analysis reveals that fine-tuning shifts the model's judgment from a surface-form pattern matching towards a more semantically grounded understanding based.

[100] CORE-T: COherent REtrieval of Tables for Text-to-SQL

Hassan Soliman,Vivek Gupta,Dan Roth,Iryna Gurevych

Main category: cs.CL

TL;DR: CORE-T是一个无需训练、可扩展的框架,通过LLM生成的元数据和预计算的兼容性缓存来提升大规模异构表集合中的表格选择准确性,显著提高文本到SQL任务的性能。

Details Motivation: 在现实的文本到SQL任务中,准确检索多表连接所需的表格是性能瓶颈,尤其是在缺乏清晰数据库标识的大规模异构表格集合中。 Method: 提出CORE-T框架:利用LLM生成表格用途元数据,预计算轻量级表格兼容性缓存;推理时先用密集检索获取候选表,再通过单次LLM调用选择可连接子集,并通过加法调整恢复高兼容性表格。 Result: 在Bird、Spider和MMQA数据集上,CORE-T将表格选择F1值最高提升22.7点,减少最多42%的检索表格数,执行准确率分别在Bird和MMQA上提升5.0和6.9点,且比强依赖LLM的方法少用4-5倍token。 Conclusion: CORE-T有效平衡了检索精度与推理效率,显著提升了开放环境下的多表文本到SQL性能,具备良好的实用性和扩展性。 Abstract: Realistic text-to-SQL workflows often require joining multiple tables. As a result, accurately retrieving the relevant set of tables becomes a key bottleneck for end-to-end performance. We study an open-book setting where queries must be answered over large, heterogeneous table collections pooled from many sources, without clean scoping signals such as database identifiers. Here, dense retrieval (DR) achieves high recall but returns many distractors, while join-aware alternatives often rely on extra assumptions and/or incur high inference overhead. We propose CORE-T, a scalable, training-free framework that enriches tables with LLM-generated purpose metadata and pre-computes a lightweight table-compatibility cache. At inference time, DR returns top-K candidates; a single LLM call selects a coherent, joinable subset, and a simple additive adjustment step restores strongly compatible tables. Across Bird, Spider, and MMQA, CORE-T improves table-selection F1 by up to 22.7 points while retrieving up to 42% fewer tables, improving multi-table execution accuracy by up to 5.0 points on Bird and 6.9 points on MMQA, and using 4-5x fewer tokens than LLM-intensive baselines.

[101] Agentic Conversational Search with Contextualized Reasoning via Reinforcement Learning

Fengran Mo,Yifan Gao,Sha Li,Hansi Zeng,Xin Liu,Zhaoxuan Tan,Xian Li,Jianshu Chen,Dakuo Wang,Meng Jiang

Main category: cs.CL

TL;DR: 本文提出了一种新型对话式搜索代理,通过在多轮对话中交替进行搜索与推理,并利用强化学习优化混合主动行为,显著提升了多轮对话中的意图理解与响应能力。

Details Motivation: 现有方法多采用静态的重写-检索-生成流程,难以适应多轮对话中动态演化的用户意图;而当前深度搜索智能体主要面向单轮场景,缺乏对多轮交互的支持。 Method: 设计了一个能跨对话轮次交替执行搜索与推理的对话代理,通过面向演化用户目标定制的奖励函数,利用强化学习进行端到端训练。 Result: 在四个主流对话式搜索基准上,该方法显著优于多个强基线模型。 Conclusion: 交错式搜索与推理机制结合强化学习,能更有效地建模多轮对话中的动态用户意图,为构建真正自适应的对话式AI提供了新范式。 Abstract: Large Language Models (LLMs) have become a popular interface for human-AI interaction, supporting information seeking and task assistance through natural, multi-turn dialogue. To respond to users within multi-turn dialogues, the context-dependent user intent evolves across interactions, requiring contextual interpretation, query reformulation, and dynamic coordination between retrieval and generation. Existing studies usually follow static rewrite, retrieve, and generate pipelines, which optimize different procedures separately and overlook the mixed-initiative action optimization simultaneously. Although the recent developments in deep search agents demonstrate the effectiveness in jointly optimizing retrieval and generation via reasoning, these approaches focus on single-turn scenarios, which might lack the ability to handle multi-turn interactions. We introduce a conversational agent that interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through reinforcement learning (RL) training with tailored rewards towards evolving user goals. The experimental results across four widely used conversational benchmarks demonstrate the effectiveness of our methods by surpassing several existing strong baselines.

[102] Adversarial Alignment: Ensuring Value Consistency in Large Language Models for Sensitive Domains

Yuan Gao,Zhigang Liu,Xinyu Yao,Bo Chen,Xiaobing Zhao

Main category: cs.CL

TL;DR: 本文提出了一种对抗对齐框架,通过持续预训练、指令微调和对抗训练提升大语言模型在敏感领域中的价值一致性,并构建了中英文双语评测数据集。实验结果表明,所提出的VC-LLM在中英文测试中均优于主流模型。

Details Motivation: 大语言模型在敏感领域(如种族、社会、政治)存在偏见和价值不一致问题,可能生成有害内容,亟需提升其价值一致性以确保安全应用。 Method: 提出对抗对齐框架,结合持续预训练、指令微调和对抗训练;其中Attacker生成争议性问题,Actor生成符合价值观的回答,Critic进行质量过滤和筛选。 Result: 成功训练出VC-LLM模型,在自建的中英文双语评测数据集上表现优于现有主流模型,验证了方法的有效性。 Conclusion: 所提出的对抗对齐框架能有效提升大语言模型在敏感领域的价值一致性,VC-LLM在多语言环境下展现出更优的安全性和对齐能力。 Abstract: With the wide application of large language models (LLMs), the problems of bias and value inconsistency in sensitive domains have gradually emerged, especially in terms of race, society and politics. In this paper, we propose an adversarial alignment framework, which enhances the value consistency of the model in sensitive domains through continued pre-training, instruction fine-tuning and adversarial training. In adversarial training, we use the Attacker to generate controversial queries, the Actor to generate responses with value consistency, and the Critic to filter and ensure response quality. Furthermore, we train a Value-Consistent Large Language Model, VC-LLM, for sensitive domains, and construct a bilingual evaluation dataset in Chinese and English. The experimental results show that VC-LLM performs better than the existing mainstream models in both Chinese and English tests, verifying the effectiveness of the method. Warning: This paper contains examples of LLMs that are offensive or harmful in nature.

[103] Probe and Skip: Self-Predictive Token Skipping for Efficient Long-Context LLM Inference

Zimeng Wu,Donghao Wang,Chaozhe Jin,Jiaxin Chen,Yunhong Wang

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的高效长上下文大语言模型推理框架SPTS,通过部分注意力探测和低秩变换探测实现选择性跳过冗余token,在保持性能的同时显著提升推理速度。

Details Motivation: 现有的基于token的推理加速方法存在加速潜力有限、代理信号过时和冗余干扰等问题,导致速度与准确性的权衡不理想。为了克服这些局限,需要一种更有效的长上下文推理加速机制。 Method: 提出SPTS框架,包含三个核心组件:1)部分注意力探测(PAP),通过部分前向注意力计算选择重要token;2)低秩变换探测(LTP),构建低秩代理网络预测token变换;3)多阶段延迟剪枝(MSDP),跨层逐步重分配跳过预算并剪除冗余token。整个过程无需额外训练。 Result: 实验表明,SPTS在预填充和端到端生成阶段分别实现了最高2.46倍和2.29倍的加速,同时保持了最先进的模型性能。 Conclusion: SPTS通过组件特定的探测策略和渐进式剪枝,有效提升了长上下文LLM推理的效率与准确性平衡,为训练-free推理加速提供了新思路。 Abstract: Long-context inference enhances the reasoning capability of Large Language Models (LLMs) while incurring significant computational overhead. Token-oriented methods, such as pruning and skipping, have shown promise in reducing inference latency, but still suffer from inherently limited acceleration potential, outdated proxy signals, and redundancy interference, thus yielding suboptimal speed-accuracy trade-offs. To address these challenges, we propose SPTS (Self-Predictive Token Skipping), a training-free framework for efficient long-context LLM inference. Specifically, motivated by the thought of probing the influence of targeted skipping layers, we design two component-specific strategies for selective token skipping: Partial Attention Probing (PAP) for multi-head attention, which selects informative tokens by performing partial forward attention computation, and Low-rank Transformation Probing (LTP) for feed forward network, which constructs a low-rank proxy network to predict token transformations. Furthermore, a Multi-Stage Delayed Pruning (MSDP) strategy reallocates the skipping budget and progressively prunes redundant tokens across layers. Extensive experiments demonstrate the effectiveness of our method, achieving up to 2.46$\times$ and 2.29$\times$ speedups for prefilling and end-to-end generation, respectively, while maintaining state-of-the-art model performance. The source code will be publicly available upon paper acceptance.

[104] Medical Triage as Pairwise Ranking: A Benchmark for Urgency in Patient Portal Messages

Joseph Gatto,Parker Seegmiller,Timothy Burdick,Philip Resnik,Roshnik Rahat,Sarah DeLozier,Sarah M. Preum

Main category: cs.CL

TL;DR: 本文提出了首个大规模公开的异步门诊门户消息医疗分诊数据集PMR-Bench,将患者消息分诊建模为成对推理问题,并提出两种基于LLM的模型UrgentSFT和UrgentReward进行医疗紧急程度排序。

Details Motivation: 医疗分诊需高效分配资源,但缺乏真实场景下的大规模公开数据集来研究基于异步消息的患者优先级排序问题。 Method: 将分诊任务定义为成对比较('哪个消息更紧急'),构建包含1569条消息和2000+测试对的PMR-Bench基准;利用Bradley-Terry模型(UrgentReward)和下一词预测(UrgentSFT)训练LLM进行排序。 Result: UrgentSFT在PMR-Bench上表现最佳,UrgentReward在低资源环境下具优势;两者相比现有8B模型在收件箱排序指标上分别提升15和16个百分点。 Conclusion: 成对学习框架结合领域特定标注策略可有效提升LLM在医疗分诊中的应用性能,为现实世界门诊消息管理提供了可行的自动化解决方案。 Abstract: Medical triage is the task of allocating medical resources and prioritizing patients based on medical need. This paper introduces the first large-scale public dataset for studying medical triage in the context of asynchronous outpatient portal messages. Our novel task formulation views patient message triage as a pairwise inference problem, where we train LLMs to choose `"which message is more medically urgent" in a head-to-head tournament-style re-sort of a physician's inbox. Our novel benchmark PMR-Bench contains 1569 unique messages and 2,000+ high-quality test pairs for pairwise medical urgency assessment alongside a scalable training data generation pipeline. PMR-Bench includes samples that contain both unstructured patient-written messages alongside real electronic health record (EHR) data, emulating a real-world medical triage scenario. We develop a novel automated data annotation strategy to provide LLMs with in-domain guidance on this task. The resulting data is used to train two model classes, UrgentReward and UrgentSFT, leveraging Bradley-Terry and next token prediction objective, respectively to perform pairwise urgency classification. We find that UrgentSFT achieves top performance on PMR-Bench, with UrgentReward showing distinct advantages in low-resource settings. For example, UrgentSFT-8B and UrgentReward-8B provide a 15- and 16-point boost, respectively, on inbox sorting metrics over off-the-shelf 8B models. Paper resources can be found at https://tinyurl.com/Patient-Message-Triage

Sergio Servantez,Sarah B. Lawsky,Rajiv Jain,Daniel W. Linna,Kristian Hammond

Main category: cs.CL

TL;DR: 本文提出了OpenExempt,一个用于诊断评估法律推理的框架和基准,通过专家设计的美国破产法典符号表示动态生成自然语言推理任务及其机器可计算解,实现了对语言模型在复杂规则领域中推理能力的细粒度评估。

Details Motivation: 现有的法律推理基准构建成本高且难以隔离特定失败模式,静态的问答对无法全面反映模型在复杂、规则密集型领域的推理表现,因此需要一种更具诊断性的评估方法。 Method: 提出OpenExempt框架,利用专家手工构建的美国破产法典的符号化表示,按需动态生成包含9,765个样本的自然语言推理任务及对应的机器可计算答案,并设计九个评估套件以独立测试不同的推理能力。 Result: 在13种不同语言模型上的实验表明,当推理路径更长或存在干扰语句时,模型性能显著下降,暴露出当前模型在复杂法律推理中的关键缺陷。 Conclusion: OpenExempt为法律推理提供了可控制、可扩展的诊断评估手段,有助于深入理解并改进下一代推理系统,相关框架和基准已公开发布。 Abstract: Reasoning benchmarks have played a crucial role in the progress of language models. Yet rigorous evaluation remains a significant challenge as static question-answer pairs provide only a snapshot of performance, compressing complex behavior into a single accuracy metric. This limitation is especially true in complex, rule-bound domains such as law, where existing benchmarks are costly to build and ill suited for isolating specific failure modes. To address this, we introduce OpenExempt, a framework and benchmark for diagnostic evaluation of legal reasoning. The OpenExempt Framework uses expert-crafted symbolic representations of U.S. Bankruptcy Code statutes to dynamically generate a large space of natural language reasoning tasks and their machine-computable solutions on demand. This gives users fine-grained control over task complexity and scope, allowing individual reasoning skills to be probed in isolation. Using this system, we construct the OpenExempt Benchmark, a diagnostic benchmark for legal reasoning with 9,765 samples across nine evaluation suites designed to carefully probe model capabilities. Experiments on 13 diverse language models reveal sharp performance cliffs that emerge only under longer reasoning paths and in the presence of obfuscating statements. We release the framework and benchmark publicly to support research aimed at understanding and improving the next generation of reasoning systems.

[106] Beyond Single-shot Writing: Deep Research Agents are Unreliable at Multi-turn Report Revision

Bingsen Chen,Boyan Li,Ping Nie,Yuyu Zhang,Xi Ye,Chen Zhao

Main category: cs.CL

TL;DR: 本文提出了Mr Dre,一个用于评估深度研究代理(DRAs)在多轮报告修订中的表现的新基准,揭示了现有代理在处理用户反馈时会退化先前内容的严重问题。

Details Motivation: 现有DRA基准将报告生成视为单次写作任务,忽视了人类研究人员通过迭代反思和反馈改进报告的实际过程,缺乏对多轮修订能力的评估。 Method: 构建Mr Dre评估套件,包括统一的长篇报告评估协议(涵盖全面性、事实性和表达)以及基于人工验证反馈模拟的多轮修订流程,用于系统评估五种不同DRA的表现。 Result: 分析显示,尽管DRA能响应大部分反馈,但在16-27%的情况下导致原有内容或引用质量退化;多轮修订中持续破坏非反馈范围内的内容,且无法保持早期修改;推理阶段优化如提示工程或专用子代理效果有限。 Conclusion: 当前DRA在多轮报告修订中存在显著缺陷,难以平衡修改与保留已有内容,需更强大的编辑控制机制,Mr Dre为未来研究提供了新的评估方向。 Abstract: Existing benchmarks for Deep Research Agents (DRAs) treat report generation as a single-shot writing task, which fundamentally diverges from how human researchers iteratively draft and revise reports via self-reflection or peer feedback. Whether DRAs can reliably revise reports with user feedback remains unexplored. We introduce Mr Dre, an evaluation suite that establishes multi-turn report revision as a new evaluation axis for DRAs. Mr Dre consists of (1) a unified long-form report evaluation protocol spanning comprehensiveness, factuality, and presentation, and (2) a human-verified feedback simulation pipeline for multi-turn revision. Our analysis of five diverse DRAs reveals a critical limitation: while agents can address most user feedback, they also regress on 16-27% of previously covered content and citation quality. Over multiple revision turns, even the best-performing agents leave significant headroom, as they continue to disrupt content outside the feedback's scope and fail to preserve earlier edits. We further show that these issues are not easily resolvable through inference-time fixes such as prompt engineering and a dedicated sub-agent for report revision.

[107] Autoregressive Models Rival Diffusion Models at ANY-ORDER Generation

Tianqi Du,Lizhe Fang,Weijie Yang,Chenheng Zhang,Zeming Wei,Yifei Wang,Yisen Wang

Main category: cs.CL

TL;DR: 提出Any-order Any-subset Autoregressive modeling (A3),一种结合自回归模型建模能力与扩散模型生成灵活性的新型语言模型框架。

Details Motivation: 扩散语言模型虽具有生成顺序灵活的优势,但受限于单步依赖结构,导致建模深度不足、生成质量与稳定性低于自回归模型。希望在保留扩散模型灵活性的同时,提升生成质量和建模能力。 Method: 将扩散式训练重构为结构化的多组预测过程,提出A3框架,支持任意分组和顺序的自回归生成,并通过双流注意力架构和渐进式适配策略实现。 Result: 在问答、常识推理和故事填充任务上,A3优于扩散模型,同时保持灵活解码能力。 Conclusion: A3提供了一种统一、高效且灵活的语言建模新范式,兼具自回归模型的深度建模优势与扩散模型的生成灵活性。 Abstract: Diffusion language models enable any-order generation and bidirectional conditioning, offering appealing flexibility for tasks such as infilling, rewriting, and self-correction. However, their formulation-predicting one part of a sequence from another within a single-step dependency-limits modeling depth and often yields lower sample quality and stability than autoregressive (AR) models. To address this, we revisit autoregressive modeling as a foundation and reformulate diffusion-style training into a structured multi-group prediction process. We propose Any-order Any-subset Autoregressive modeling (A3), a generalized framework that extends the standard AR factorization to arbitrary token groups and generation orders. A3 preserves the probabilistic rigor and multi-layer dependency modeling of AR while inheriting diffusion models' flexibility for parallel and bidirectional generation. We implement A3 through a two-stream attention architecture and a progressive adaptation strategy that transitions pretrained AR models toward any-order prediction. Experiments on question answering, commonsense reasoning, and story infilling demonstrate that A3 outperforms diffusion-based models while maintaining flexible decoding. This work offers a unified approach for a flexible, efficient, and novel language modeling paradigm.

[108] Aligning Agentic World Models via Knowledgeable Experience Learning

Baochang Ren,Yunzhi Yao,Rui Sun,Shuofei Qiao,Ningyu Zhang,Huajun Chen

Main category: cs.CL

TL;DR: WorldMind 是一个通过整合环境反馈构建符号化世界知识库的框架,旨在解决大语言模型在物理世界中缺乏程序性基础而导致的物理幻觉问题。

Details Motivation: 当前的大语言模型虽然具备丰富的语义知识,但在模拟物理世界时常常生成逻辑合理但物理上不可执行的计划,即存在物理幻觉问题。 Method: 提出 WorldMind 框架,通过 Process Experience 利用预测误差来保证物理可行性,并通过 Goal Experience 利用成功轨迹来引导任务最优性,自主构建符号化的世界知识库。 Result: 在 EB-ALFRED 和 EB-Habitat 上的实验表明,WorldMind 在性能上优于基线方法,并展现出显著的跨模型和跨环境可迁移性。 Conclusion: WorldMind 有效弥补了大语言模型与物理世界之间的鸿沟,提供了一种无需频繁再训练即可适应开放物理动态的灵活解决方案。 Abstract: Current Large Language Models (LLMs) exhibit a critical modal disconnect: they possess vast semantic knowledge but lack the procedural grounding to respect the immutable laws of the physical world. Consequently, while these agents implicitly function as world models, their simulations often suffer from physical hallucinations-generating plans that are logically sound but physically unexecutable. Existing alignment strategies predominantly rely on resource-intensive training or fine-tuning, which attempt to compress dynamic environmental rules into static model parameters. However, such parametric encapsulation is inherently rigid, struggling to adapt to the open-ended variability of physical dynamics without continuous, costly retraining. To bridge this gap, we introduce WorldMind, a framework that autonomously constructs a symbolic World Knowledge Repository by synthesizing environmental feedback. Specifically, it unifies Process Experience to enforce physical feasibility via prediction errors and Goal Experience to guide task optimality through successful trajectories. Experiments on EB-ALFRED and EB-Habitat demonstrate that WorldMind achieves superior performance compared to baselines with remarkable cross-model and cross-environment transferability.

[109] Beyond Cosine Similarity: Taming Semantic Drift and Antonym Intrusion in a 15-Million Node Turkish Synonym Graph

Ebubekir Tosun,Mehmet Emin Buldur,Özay Ezerceli,Mahmoud ElHussieni

Main category: cs.CL

TL;DR: 本文提出了一种大规模语义聚类系统,通过构建大型标注数据集、设计三元语义关系判别器以及软-硬聚类算法,有效区分同义词、反义词和共下位词,解决了神经嵌入无法可靠识别反义关系的问题。

Details Motivation: 神经嵌入在区分同义词与反义词方面存在明显缺陷,导致高相似性阈值仍无法避免将相反词义错误聚类,尤其影响低资源语言的语义检索质量。 Method: 1)利用Gemini 2.5-Flash大模型增强并结合人工词典验证,构建包含84.3万概念对的标注数据集;2)设计一个三类语义关系判别器(同义、反义、共下位),达到90% macro-F1;3)提出一种拓扑感知的两阶段软-硬聚类算法,通过扩展-剪枝流程与拓扑投票机制防止语义漂移和多义词误连。 Result: 系统处理1500万词汇项,评估5.2亿潜在关系,生成290万高精度语义簇;三类关系判别准确率达90% macro-F1;聚类过程有效抑制了如'热→辣→痛→抑郁'之类的错误传递链。 Conclusion: 该方法显著提升了语义聚类的精确性,尤其适用于形态复杂或资源稀缺语言的语义搜索与检索增强生成任务,为克服嵌入空间中反义混淆问题提供了可行方案。 Abstract: Neural embeddings have a notorious blind spot: they can't reliably tell synonyms apart from antonyms. Consequently, increasing similarity thresholds often fails to prevent opposites from being grouped together. We've built a large-scale semantic clustering system specifically designed to tackle this problem head on. Our pipeline chews through 15 million lexical items, evaluates a massive 520 million potential relationships, and ultimately generates 2.9 million high-precision semantic clusters. The system makes three primary contributions. First, we introduce a labeled dataset of 843,000 concept pairs spanning synonymy, antonymy, and co-hyponymy, constructed via Gemini 2.5-Flash LLM augmentation and verified using human-curated dictionary resources. Second, we propose a specialized three-way semantic relation discriminator that achieves 90% macro-F1, enabling robust disambiguation beyond raw embedding similarity. Third, we introduce a novel soft-to-hard clustering algorithm that mitigates semantic drift preventing erroneous transitive chains (e.g., hot -> spicy -> pain -> depression) while simultaneously resolving polysemy. Our approach employs a topology-aware two-stage expansion-pruning procedure with topological voting, ensuring that each term is assigned to exactly one semantically coherent cluster. The resulting resource enables high-precision semantic search and retrieval-augmented generation, particularly for morphologically rich and low-resource languages where existing synonym databases remain sparse.

[110] A Hybrid Protocol for Large-Scale Semantic Dataset Generation in Low-Resource Languages: The Turkish Semantic Relations Corpus

Ebubekir Tosun,Mehmet Emin Buldur,Özay Ezerceli,Mahmoud ElHussieni

Main category: cs.CL

TL;DR: 提出了一种混合方法,用于在低资源语言中生成大规模语义关系数据集,并以土耳其语为例构建了包含84.3万对语义词的数据集,规模比现有资源提升10倍,成本仅65美元。

Details Motivation: 解决低资源语言(如土耳其语)中语义关系数据稀缺的问题,推动其自然语言处理发展。 Method: 采用三阶段方法:首先使用FastText嵌入和凝聚聚类发现语义簇;然后利用Gemini 2.5-Flash模型自动分类语义关系;最后融合人工整理的词典资源进行数据增强与验证。 Result: 构建了包含843,000个土耳其语语义对的数据集,涵盖同义、反义和共下位三种关系;下游任务中,检索模型达到90% top-1准确率,分类模型获得90% F1-macro分数。 Conclusion: 该方法高效、低成本,显著缓解了低资源语言的语义数据短缺问题,具有良好的可扩展性,适用于其他低资源语言。 Abstract: We present a hybrid methodology for generating large-scale semantic relationship datasets in low-resource languages, demonstrated through a comprehensive Turkish semantic relations corpus. Our approach integrates three phases: (1) FastText embeddings with Agglomerative Clustering to identify semantic clusters, (2) Gemini 2.5-Flash for automated semantic relationship classification, and (3) integration with curated dictionary sources. The resulting dataset comprises 843,000 unique Turkish semantic pairs across three relationship types (synonyms, antonyms, co-hyponyms) representing a 10x scale increase over existing resources at minimal cost ($65). We validate the dataset through two downstream tasks: an embedding model achieving 90% top-1 retrieval accuracy and a classification model attaining 90% F1-macro. Our scalable protocol addresses critical data scarcity in Turkish NLP and demonstrates applicability to other low-resource languages. We publicly release the dataset and models.

[111] Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models

Sawsan Alqahtani,Mir Tafseer Nayeem,Md Tahmid Rahman Laskar,Tasnim Mohiuddin,M Saiful Bari

Main category: cs.CL

TL;DR: 本文重新审视了大语言模型中的分词技术,主张将其视为核心建模决策而非预处理步骤,提出应通过上下文感知的框架实现分词器与模型的协同设计,并强调标准化评估和透明报告的重要性。

Details Motivation: 现有的子词分词方法(如BPE)虽然可扩展,但常与语言结构不匹配,加剧偏见,并在多语言和多领域场景中浪费模型容量,因此需要更合理的设计理论和评估标准。 Method: 提出将分词视为核心建模问题,倡导结合语言学、领域和部署需求的上下文感知分词框架,推动分词器与模型的协同设计,并呼吁建立标准化评估和透明报告机制。 Result: 该框架有助于提升模型在多语言和多领域环境下的效率与适应性,减少偏见,增强可比性和可解释性。 Conclusion: 将分词从技术细节提升为核心设计问题,有望构建更公平、高效和灵活的语言技术系统。 Abstract: Tokenization underlies every large language model, yet it remains an under-theorized and inconsistently designed component. Common subword approaches such as Byte Pair Encoding (BPE) offer scalability but often misalign with linguistic structure, amplify bias, and waste capacity across languages and domains. This paper reframes tokenization as a core modeling decision rather than a preprocessing step. We argue for a context-aware framework that integrates tokenizer and model co-design, guided by linguistic, domain, and deployment considerations. Standardized evaluation and transparent reporting are essential to make tokenization choices accountable and comparable. Treating tokenization as a core design problem, not a technical afterthought, can yield language technologies that are fairer, more efficient, and more adaptable.

[112] Unlearning in LLMs: Methods, Evaluation, and Open Challenges

Tyler Lizzo,Larry Heck

Main category: cs.CL

TL;DR: 本文综述了大语言模型中的机器遗忘技术,系统地分类了现有方法并回顾了评估体系,指出了可扩展性、跨模态遗忘等开放问题。

Details Motivation: 由于大语言模型在部署中引发隐私、版权、安全和偏见等问题,需要有效的知识移除机制以确保模型的合规与可信。 Method: 将现有的遗忘方法分为数据中心、参数中心、架构中心、混合及其他策略,并系统梳理了评估所用的基准、指标和数据集。 Result: 总结了当前遗忘技术在遗忘效果、知识保留和鲁棒性方面的进展,并识别出多个关键挑战。 Conclusion: 该文为大语言模型中可靠且负责任的遗忘技术发展提供了系统性综述和未来研究路线图。 Abstract: Large language models (LLMs) have achieved remarkable success across natural language processing tasks, yet their widespread deployment raises pressing concerns around privacy, copyright, security, and bias. Machine unlearning has emerged as a promising paradigm for selectively removing knowledge or data from trained models without full retraining. In this survey, we provide a structured overview of unlearning methods for LLMs, categorizing existing approaches into data-centric, parameter-centric, architecture-centric, hybrid, and other strategies. We also review the evaluation ecosystem, including benchmarks, metrics, and datasets designed to measure forgetting effectiveness, knowledge retention, and robustness. Finally, we outline key challenges and open problems, such as scalable efficiency, formal guarantees, cross-language and multimodal unlearning, and robustness against adversarial relearning. By synthesizing current progress and highlighting open directions, this paper aims to serve as a roadmap for developing reliable and responsible unlearning techniques in large language models.

[113] A BERTology View of LLM Orchestrations: Token- and Layer-Selective Probes for Efficient Single-Pass Classification

Gonzalo Ariel Meyoyan,Luciano Del Corro

Main category: cs.CL

TL;DR: 本文提出了一种在生成过程中复用LLM隐藏状态进行分类任务的方法,通过轻量级探针和两阶段聚合机制,在不增加延迟和显存开销的情况下,实现与独立安全模型相当的性能。

Details Motivation: 减少生产环境中大语言模型系统因使用独立安全模型带来的延迟、显存占用和操作复杂性。 Method: 训练轻量级探针,利用LLM生成时的隐藏状态,采用两阶段聚合器(先在每层内汇总token,再跨层聚合)进行分类表示选择。 Result: 在安全性和情感分析基准上,该方法优于仅使用logit的复用方法(如MULI),并与更大的任务专用模型表现相当。 Conclusion: 该方法能在保持近似服务延迟的同时,有效避免额外的VRAM和延迟成本,为LLM系统的高效分类提供了可行方案。 Abstract: Production LLM systems often rely on separate models for safety and other classification-heavy steps, increasing latency, VRAM footprint, and operational complexity. We instead reuse computation already paid for by the serving LLM: we train lightweight probes on its hidden states and predict labels in the same forward pass used for generation. We frame classification as representation selection over the full token-layer hidden-state tensor, rather than committing to a fixed token or fixed layer (e.g., first-token logits or final-layer pooling). To implement this, we introduce a two-stage aggregator that (i) summarizes tokens within each layer and (ii) aggregates across layer summaries to form a single representation for classification. We instantiate this template with direct pooling, a 100K-parameter scoring-attention gate, and a downcast multi-head self-attention (MHA) probe with up to 35M trainable parameters. Across safety and sentiment benchmarks our probes improve over logit-only reuse (e.g., MULI) and are competitive with substantially larger task-specific baselines, while preserving near-serving latency and avoiding the VRAM and latency costs of a separate guard-model pipeline.

[114] OI-Bench: An Option Injection Benchmark for Evaluating LLM Susceptibility to Directive Interference

Yow-Fu Liou,Yu-Chien Tang,Yu-Hsiang Liu,An-Zi Yen

Main category: cs.CL

TL;DR: 本文提出了一种名为“选项注入”(option injection)的基准测试方法,通过在多项选择题中引入包含误导性指令的额外选项,系统评估大语言模型对指令干扰的易感性。

Details Motivation: 现有研究表明,大语言模型的决策可能受到社会线索、框架效应和指令等导向信号的影响。为了更系统地评估这些影响,需要一种标准化且可扩展的评测方法。 Method: 提出“选项注入”方法,并构建包含3000个问题的OI-Bench基准,涵盖知识、推理和常识任务,包含16种误导性指令类型。在12个大语言模型上进行实验,评估攻击成功率、行为响应及多种缓解策略。 Result: 实验结果显示不同模型在面对指令干扰时表现出显著的脆弱性和异质性鲁棒性,部分模型更容易受到社会顺从、奖励或威胁框架的影响。 Conclusion: OI-Bench能够有效支持对基于选择界面中大语言模型对指令干扰的系统性评估,揭示当前模型在鲁棒性方面的不足,并为未来改进提供方向。 Abstract: Benchmarking large language models (LLMs) is critical for understanding their capabilities, limitations, and robustness. In addition to interface artifacts, prior studies have shown that LLM decisions can be influenced by directive signals such as social cues, framing, and instructions. In this work, we introduce option injection, a benchmarking approach that augments the multiple-choice question answering (MCQA) interface with an additional option containing a misleading directive, leveraging standardized choice structure and scalable evaluation. We construct OI-Bench, a benchmark of 3,000 questions spanning knowledge, reasoning, and commonsense tasks, with 16 directive types covering social compliance, bonus framing, threat framing, and instructional interference. This setting combines manipulation of the choice interface with directive-based interference, enabling systematic assessment of model susceptibility. We evaluate 12 LLMs to analyze attack success rates, behavioral responses, and further investigate mitigation strategies ranging from inference-time prompting to post-training alignment. Experimental results reveal substantial vulnerabilities and heterogeneous robustness across models. OI-Bench is expected to support more systematic evaluation of LLM robustness to directive interference within choice-based interfaces.

[115] Paid Voices vs. Public Feeds: Interpretable Cross-Platform Theme Modeling of Climate Discourse

Samantha Sudhoff,Pranav Perumal,Zhaoqing Wu,Tunazzina Islam

Main category: cs.CL

TL;DR: 本文提出了一种可解释的主题发现框架,用于比较Meta广告和Bluesky公开帖子中的气候话语,揭示平台激励机制如何影响气候叙事的主题结构、立场和动态变化。

Details Motivation: 现有研究通常孤立分析不同平台的气候传播,难以区分机构信息与公众表达;本文旨在通过跨平台比较,揭示结构性差异对气候话语的影响。 Method: 构建一个端到端的可解释主题发现与分配框架,利用语义相似性聚类文本,并借助大语言模型生成人类可理解的主题标签;通过人工评估和LLM评估器对比传统主题建模方法,并通过立场预测和主题引导检索任务验证主题质量。 Result: 发现付费广告与公共社交媒体在主题分布、立场倾向和对重大政治事件的响应速度上存在系统性差异,平台激励机制显著影响气候叙事的结构与动态。 Conclusion: 平台结构和激励机制塑造了气候话语的表达方式;该框架可用于跨异构传播环境的叙事比较分析,适用于更广泛的传播研究场景。 Abstract: Climate discourse online plays a crucial role in shaping public understanding of climate change and influencing political and policy outcomes. However, climate communication unfolds across structurally distinct platforms with fundamentally different incentive structures: paid advertising ecosystems incentivize targeted, strategic persuasion, while public social media platforms host largely organic, user-driven discourse. Existing computational studies typically analyze these environments in isolation, limiting our ability to distinguish institutional messaging from public expression. In this work, we present a comparative analysis of climate discourse across paid advertisements on Meta (previously known as Facebook) and public posts on Bluesky from July 2024 to September 2025. We introduce an interpretable, end-to-end thematic discovery and assignment framework that clusters texts by semantic similarity and leverages large language models (LLMs) to generate concise, human-interpretable theme labels. We evaluate the quality of the induced themes against traditional topic modeling baselines using both human judgments and an LLM-based evaluator, and further validate their semantic coherence through downstream stance prediction and theme-guided retrieval tasks. Applying the resulting themes, we characterize systematic differences between paid climate messaging and public climate discourse and examine how thematic prevalence shifts around major political events. Our findings show that platform-level incentives are reflected in the thematic structure, stance alignment, and temporal responsiveness of climate narratives. While our empirical analysis focuses on climate communication, the proposed framework is designed to support comparative narrative analysis across heterogeneous communication environments.

[116] Arab Voices: Mapping Standard and Dialectal Arabic Speech Technology

Peter Sullivan,AbdelRahim Elmadany,Alcides Alcoba Inciarte,Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: 本文提出了一个标准化框架Arab Voices,用于解决方言阿拉伯语(DA)语音识别中数据集异质性问题,整合了31个数据集并提供了统一的元数据和评估工具,同时建立了现代DA ASR的强基线。

Details Motivation: 由于方言阿拉伯语数据在领域覆盖、标注方式和录音条件上差异大,导致跨数据集比较和模型评估困难,缺乏细粒度的标准化描述。 Method: 对常用DA语料库的训练集进行计算分析,评估语言‘方言性’和音频质量;构建Arab Voices框架,整合31个数据集、统一元数据,并提供评估工具;对多种最新ASR系统进行基准测试。 Result: 发现现有数据集在声学条件和方言信号强度与一致性方面存在显著异质性;Arab Voices实现了跨数据集的统一访问与评估;基准测试建立了现代DA ASR的强基线性能。 Conclusion: 需要超越粗粒度标签的标准化数据表征方法,Arab Voices为DA ASR研究提供了可复现、可比较的开放平台,推动该领域的规范化发展。 Abstract: Dialectal Arabic (DA) speech data vary widely in domain coverage, dialect labeling practices, and recording conditions, complicating cross-dataset comparison and model evaluation. To characterize this landscape, we conduct a computational analysis of linguistic ``dialectness'' alongside objective proxies of audio quality on the training splits of widely used DA corpora. We find substantial heterogeneity both in acoustic conditions and in the strength and consistency of dialectal signals across datasets, underscoring the need for standardized characterization beyond coarse labels. To reduce fragmentation and support reproducible evaluation, we introduce Arab Voices, a standardized framework for DA ASR. Arab Voices provides unified access to 31 datasets spanning 14 dialects, with harmonized metadata and evaluation utilities. We further benchmark a range of recent ASR systems, establishing strong baselines for modern DA ASR.

[117] Reducing Tokenization Premiums for Low-Resource Languages

Geoffrey Churchill,Steven Skiena

Main category: cs.CL

TL;DR: 本文分析了十种流行语言模型的分词器设计及其对低资源语言的分词代价,并提出通过向预训练模型词汇表中后处理添加新词以减少分词代价的方法,实验表明该方法在12种低资源语言上有效且保持模型输出一致性。

Details Motivation: 低资源语言在现代语言模型中面临显著的分词代价,导致更高的API和能源成本以及更短的有效上下文窗口,亟需优化。 Method: 分析十种主流语言模型的分词器设计,提出一种后处理扩展词汇表的方法,将多标记字符合并为单个标记以降低分词代价。 Result: 在Llama 3.2 1B模型上验证,原始与压缩输入在12种低资源语言中具有相似的最后隐藏状态,表明方法有效性。 Conclusion: 所提出的词汇表扩展方法能有效降低低资源语言的分词代价,同时保持模型表示一致性,有助于提升其在实际应用中的效率与可行性。 Abstract: Relative to English, low-resource languages suffer from substantial tokenization premiums in modern LMs, meaning that it generally requires several times as many tokens to encode a sentence in a low-resource language than to encode the analogous sentence in English. This tokenization premium results in increased API and energy costs and reduced effective context windows for these languages. In this paper we analyze the tokenizers of ten popular LMs to better understand their designs and per-language tokenization premiums. We also propose a mechanism to reduce tokenization premiums in pre-trained models, by post-hoc additions to the token vocabulary that coalesce multi-token characters into single tokens. We apply this methodology to 12 low-resource languages, demonstrating that the original and compressed inputs often have similar last hidden states when run through the Llama 3.2 1B model.

[118] RegCheck: A tool for automating comparisons between study registrations and papers

Jamie Cummins,Beth Clarke,Ian Hussey,Malte Elson

Main category: cs.CL

TL;DR: 本文介绍了一种名为RegCheck的模块化LLM辅助工具,旨在帮助研究人员、审稿人和编辑比较研究注册与对应论文,提升科学研究的透明度和可重复性。

Details Motivation: 研究注册虽有益于科学透明性和严谨性,但目前常被忽视,且人工核对耗时耗力,亟需高效工具支持。 Method: 开发了一个基于大语言模型(LLM)的工具RegCheck,采用人机协作方式,由用户决定比对内容,并提供相关文本以辅助判断差异,同时生成可共享并验证的报告。 Result: RegCheck能够跨学科、跨格式灵活使用,支持可扩展的科研基础设施,并通过实例展示了其在促进可重复科学中的潜力。 Conclusion: RegCheck通过结合人类专业知识与AI能力,有效提升了研究注册与发表论文间比对的效率与可靠性,有望成为推动开放科学的重要工具。 Abstract: Across the social and medical sciences, researchers recognize that specifying planned research activities (i.e., 'registration') prior to the commencement of research has benefits for both the transparency and rigour of science. Despite this, evidence suggests that study registrations frequently go unexamined, minimizing their effectiveness. In a way this is no surprise: manually checking registrations against papers is labour- and time-intensive, requiring careful reading across formats and expertise across domains. The advent of AI unlocks new possibilities in facilitating this activity. We present RegCheck, a modular LLM-assisted tool designed to help researchers, reviewers, and editors from across scientific disciplines compare study registrations with their corresponding papers. Importantly, RegCheck keeps human expertise and judgement in the loop by (i) ensuring that users are the ones who determine which features should be compared, and (ii) presenting the most relevant text associated with each feature to the user, facilitating (rather than replacing) human discrepancy judgements. RegCheck also generates shareable reports with unique RegCheck IDs, enabling them to be easily shared and verified by other users. RegCheck is designed to be adaptable across scientific domains, as well as registration and publication formats. In this paper we provide an overview of the motivation, workflow, and design principles of RegCheck, and we discuss its potential as an extensible infrastructure for reproducible science with an example use case.

[119] AfroScope: A Framework for Studying the Linguistic Landscape of Africa

Sang Yun Kwon,AbdelRahim Elmadany,Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: 本文提出了AfroScope,一个用于非洲语言识别(LID)的统一框架,包括覆盖713种非洲语言的数据集AfroScope-Data和一系列高性能LID模型AfroScope-Models。为提升对易混淆语言的区分能力,引入了基于Mirror-Serengeti嵌入模型的分层分类方法,并在相关子集上显著提升了宏F1得分。研究还分析了跨语言迁移和领域影响,推动非洲语言景观的大规模数字化测量,并公开发布了数据与模型。

Details Motivation: 现有非洲语言识别方法在支持语言数量和细粒度区分相近语言变体方面存在局限,亟需更全面、精细的LID系统以支持下游NLP任务。 Method: 提出AfroScope框架,包含大规模数据集AfroScope-Data和配套模型AfroScope-Models;采用分层分类方法,结合专为29种易混淆语言设计的Mirror-Serengeti嵌入模型以提升判别性能。 Result: 在易混淆语言子集上,新方法相比最佳基线模型宏F1提升4.55;并通过实验分析了跨语言迁移与领域效应。 Conclusion: AfroScope显著扩展了非洲语言识别的语言覆盖范围与识别精度,尤其改善了对相近语言的区分能力,为非洲语言数字文本的大规模分析提供了有力工具,且数据与模型已公开发布。 Abstract: Language Identification (LID) is the task of determining the language of a given text and is a fundamental preprocessing step that affects the reliability of downstream NLP applications. While recent work has expanded LID coverage for African languages, existing approaches remain limited in (i) the number of supported languages and (ii) their ability to make fine-grained distinctions among closely related varieties. We introduce AfroScope, a unified framework for African LID that includes AfroScope-Data, a dataset covering 713 African languages, and AfroScope-Models, a suite of strong LID models with broad language coverage. To better distinguish highly confusable languages, we propose a hierarchical classification approach that leverages Mirror-Serengeti, a specialized embedding model targeting 29 closely related or geographically proximate languages. This approach improves macro F1 by 4.55 on this confusable subset compared to our best base model. Finally, we analyze cross linguistic transfer and domain effects, offering guidance for building robust African LID systems. We position African LID as an enabling technology for large scale measurement of Africas linguistic landscape in digital text and release AfroScope-Data and AfroScope-Models publicly.

[120] LLM-as-RNN: A Recurrent Language Model for Memory Updates and Sequence Prediction

Yuxing Lu,J. Ben Tamo,Weichen Zhao,Nan Sun,Yishan Zhong,Wenqi Shi,Jinzhuo Wang,May D. Wang

Main category: cs.CL

TL;DR: 提出LLM-as-RNN框架,将大语言模型通过自然语言记忆实现类RNN的推理机制,实现无需参数更新的在线学习,在多个领域任务中显著提升预测准确率。

Details Motivation: 标准推理依赖固定上下文,缺乏在生成过程中纠正错误并持续改进的可更新记忆机制。 Method: 设计一种仅在推理阶段使用的LLM-as-RNN框架,将模型隐藏状态表示为自然语言形式的记忆,并通过反馈驱动的文本重写在每一步更新该记忆。 Result: 在医疗、气象和金融三个领域的序列任务上测试,相比zero-shot、全历史和MemPrompt基线平均准确率提升6.5%。 Conclusion: 冻结的大语言模型可通过自然语言记忆机制实现类似RNN的在线学习,具备更强的序列预测能力和可解释性。 Abstract: Large language models are strong sequence predictors, yet standard inference relies on immutable context histories. After making an error at generation step t, the model lacks an updatable memory mechanism that improves predictions for step t+1. We propose LLM-as-RNN, an inference-only framework that turns a frozen LLM into a recurrent predictor by representing its hidden state as natural-language memory. This state, implemented as a structured system-prompt summary, is updated at each timestep via feedback-driven text rewrites, enabling learning without parameter updates. Under a fixed token budget, LLM-as-RNN corrects errors and retains task-relevant patterns, effectively performing online learning through language. We evaluate the method on three sequential benchmarks in healthcare, meteorology, and finance across Llama, Gemma, and GPT model families. LLM-as-RNN significantly outperforms zero-shot, full-history, and MemPrompt baselines, improving predictive accuracy by 6.5% on average, while producing interpretable, human-readable learning traces absent in standard context accumulation.

[121] Sockpuppetting: Jailbreaking LLMs Without Optimization Through Output Prefix Injection

Asen Dotsinski,Panagiotis Eustratiadis

Main category: cs.CL

TL;DR: 本文提出了一种名为“sockpuppetting”的简单方法,通过在模型输出开头插入接受序列来越狱开源大语言模型,仅需一行代码且无需优化,攻击成功率比GCG高最多80%。

Details Motivation: 随着开源大语言模型能力的提升,防范恶意提示和理解潜在攻击途径变得愈发重要。现有的自动化越狱方法如GCG虽然有效,但通常需要大量计算资源和专业知识,限制了其广泛应用。 Method: 提出“sockpuppetting”方法,在模型输出起始处插入一个接受序列(例如“Sure, here is how to...”),让模型继续完成响应;同时探索一种混合方法,优化助手消息块内的对抗性后缀而非用户提示。 Result: Sockpuppetting在Qwen3-8B上单提示比较中攻击成功率最高提升80%,在Llama-3.1-8B上混合方法使攻击成功率比GCG提高64%。 Conclusion: Sockpuppetting是一种低成本、高效的攻击方式,易于被低技术门槛的攻击者利用,突显出需针对开源模型加强防御输出前缀注入攻击的重要性。 Abstract: As open-weight large language models (LLMs) increase in capabilities, safeguarding them against malicious prompts and understanding possible attack vectors becomes ever more important. While automated jailbreaking methods like GCG [Zou et al., 2023] remain effective, they often require substantial computational resources and specific expertise. We introduce "sockpuppetting'', a simple method for jailbreaking open-weight LLMs by inserting an acceptance sequence (e.g., "Sure, here is how to...'') at the start of a model's output and allowing it to complete the response. Requiring only a single line of code and no optimization, sockpuppetting achieves up to 80% higher attack success rate (ASR) than GCG on Qwen3-8B in per-prompt comparisons. We also explore a hybrid approach that optimizes the adversarial suffix within the assistant message block rather than the user prompt, increasing ASR by 64% over GCG on Llama-3.1-8B in a prompt-agnostic setting. The results establish sockpuppetting as an effective low-cost attack accessible to unsophisticated adversaries, highlighting the need for defences against output-prefix injection in open-weight models.

[122] Recurrent Confidence Chain: Temporal-Aware Uncertainty Quantification in Large Language Models

Zhenjiang Mao,Anirudhh Venkat

Main category: cs.CL

TL;DR: 本文提出了一种结合跨步骤注意力和隐含置信度机制的新方法,用于改善大语言模型在长推理链中的不确定性评估,提升了预测质量与置信度校准的平衡。

Details Motivation: 现有推理模块在处理长推理序列时忽略置信度的时间传播,导致整体置信度被高估,难以有效识别低置信推理步骤,从而可能引发严重幻觉。 Method: 引入跨步骤注意力机制分析推理步骤间的语义关联,并设计隐含置信度机制以保留历史置信信息,将其与逐 步置信度结合,生成更准确的整体置信估计。 Result: 在GAOKAO数学基准和CLadder因果推理数据集上,基于主流开源大模型的实验表明,该方法在负对数似然和期望校准误差上优于现有最先进方法。 Conclusion: 所提方法能更准确地评估长推理过程中的不确定性,有效缓解因早期低置信步骤导致的置信度膨胀问题,提升模型可靠性。 Abstract: As reasoning modules, such as the chain-of-thought mechanism, are applied to large language models, they achieve strong performance on various tasks such as answering common-sense questions and solving math problems. The main challenge now is to assess the uncertainty of answers, which can help prevent misleading or serious hallucinations for users. Although current methods analyze long reasoning sequences by filtering unrelated tokens and examining potential connections between nearby tokens or sentences, the temporal spread of confidence is often overlooked. This oversight can lead to inflated overall confidence, even when earlier steps exhibit very low confidence. To address this issue, we propose a novel method that incorporates inter-step attention to analyze semantic correlations across steps. For handling long-horizon responses, we introduce a hidden confidence mechanism to retain historical confidence information, which is then combined with stepwise confidence to produce a more accurate overall estimate. We evaluate our method on the GAOKAO math benchmark and the CLadder causal reasoning dataset using mainstream open-source large language models. Our approach is shown to outperform state-of-the-art methods by achieving a superior balance between predictive quality and calibration, demonstrated by strong performance on both Negative Log-Likelihood and Expected Calibration Error.

[123] Confidence over Time: Confidence Calibration with Temporal Logic for Large Language Model Reasoning

Zhenjiang Mao,Anirudhh Venkat,Artem Bisliouk,Akshat Kothiyal,Sindhura Kumbakonam Subramanian,Saithej Singhu,Ivan Ruchkin

Main category: cs.CL

TL;DR: 提出基于信号时序逻辑(STL)和超网络的步态置信度估计方法,用于提升大语言模型在多步推理中的置信度校准性能。

Details Motivation: 现有置信度估计方法将整个推理过程压缩为单一标量分数,忽略了置信度在生成过程中的动态演变,导致对错误推理的误判。 Method: 使用信号时序逻辑(STL)刻画多步推理中的置信度变化,通过判别性STL挖掘发现区分正确与错误响应的时序模式,并结合参数超网络构建动态置信度估计方法。 Result: 在多个推理任务上验证了所提方法的有效性,结果显示其置信度评分比基线更校准,且STL模式具有跨任务泛化能力。 Conclusion: 该方法能更精细地建模推理过程中的置信度演化,有效区分正确推理与高置信错误,提升大模型在复杂任务中的可靠性。 Abstract: Large Language Models (LLMs) increasingly rely on long-form, multi-step reasoning to solve complex tasks such as mathematical problem solving and scientific question answering. Despite strong performance, existing confidence estimation methods typically reduce an entire reasoning process to a single scalar score, ignoring how confidence evolves throughout the generation. As a result, these methods are often sensitive to superficial factors such as response length or verbosity, and struggle to distinguish correct reasoning from confidently stated errors. We propose to characterize the stepwise confidence signal using Signal Temporal Logic (STL). Using a discriminative STL mining procedure, we discover temporal formulas that distinguish confidence signals of correct and incorrect responses. Our analysis found that the STL patterns generalize across tasks, and numeric parameters exhibit sensitivity to individual questions. Based on these insights, we develop a confidence estimation approach that informs STL blocks with parameter hypernetworks. Experiments on multiple reasoning tasks show our confidence scores are more calibrated than the baselines.

[124] Structured Insight from Unstructured Data: Large Language Models for SDOH-Driven Diabetes Risk Prediction

Sasha Ronaghi,Prerit Choudhary,David H Rehkopf,Bryant Lin

Main category: cs.CL

TL;DR: 本研究探索了使用大语言模型(LLMs)从2型糖尿病老年患者的生活叙述中提取社会健康决定因素(SDOH)信息,并评估其在糖尿病控制预测中的价值。

Details Motivation: 现有电子健康记录和风险预测模型通常缺乏个体层面的SDOH数据,而传统结构化筛查工具难以捕捉患者经历的复杂性和特定人群的独特需求。 Method: 收集65名65岁以上T2D患者的非结构化访谈记录,采用基于检索增强生成的大语言模型进行分析,生成定性摘要和结构化SDOH评分;将SDOH评分与传统生物标志物结合,输入线性与树形机器学习模型(Ridge、Lasso、Random Forest、XGBoost)进行风险预测;同时评估LLM直接从文本预测糖尿病控制水平的能力。 Result: LLM从访谈文本中直接预测糖尿病控制水平(低、中、高)的准确率达到60%;结构化SDOH评分可有效整合到传统风险预测流程中,提升模型表现。 Conclusion: 大语言模型能够将非结构化的SDOH相关叙述转化为可用于临床风险建模和决策支持的结构化信息,提供了一种可扩展的方法来增强现有临床预测系统。 Abstract: Social determinants of health (SDOH) play a critical role in Type 2 Diabetes (T2D) management but are often absent from electronic health records and risk prediction models. Most individual-level SDOH data is collected through structured screening tools, which lack the flexibility to capture the complexity of patient experiences and unique needs of a clinic's population. This study explores the use of large language models (LLMs) to extract structured SDOH information from unstructured patient life stories and evaluate the predictive value of both the extracted features and the narratives themselves for assessing diabetes control. We collected unstructured interviews from 65 T2D patients aged 65 and older, focused on their lived experiences, social context, and diabetes management. These narratives were analyzed using LLMs with retrieval-augmented generation to produce concise, actionable qualitative summaries for clinical interpretation and structured quantitative SDOH ratings for risk prediction modeling. The structured SDOH ratings were used independently and in combination with traditional laboratory biomarkers as inputs to linear and tree-based machine learning models (Ridge, Lasso, Random Forest, and XGBoost) to demonstrate how unstructured narrative data can be applied in conventional risk prediction workflows. Finally, we evaluated several LLMs on their ability to predict a patient's level of diabetes control (low, medium, high) directly from interview text with A1C values redacted. LLMs achieved 60% accuracy in predicting diabetes control levels from interview text. This work demonstrates how LLMs can translate unstructured SDOH-related data into structured insights, offering a scalable approach to augment clinical risk models and decision-making.

[125] Beyond Memorization: Testing LLM Reasoning on Unseen Theory of Computation Tasks

Shlok Shelat,Jay Raval,Souvik Roy,Manas Gaur

Main category: cs.CL

TL;DR: 本文提出了一种用于评估大语言模型在正则语言上构建确定有限自动机(DFA)能力的基准,发现模型在熟悉任务上表现良好,但在新问题上准确率显著下降,暴露出其在形式化推理上的根本缺陷。

Details Motivation: 探究大语言模型在形式语言任务中的表现是否源于真正的符号推理能力,还是仅依赖于对常见结构的模式匹配。 Method: 构建了一个包含事实性问题、已见DFA构造题以及两类未见问题(手工设计的多约束交互题和基于Arden定理系统生成的问题)的基准测试,并采用多种提示策略(直接、思维链、思维树)进行评估,同时引入三阶段提示协议分析错误修正能力。 Result: 模型在事实性问题上达到完美准确率,在已见任务上准确率为84-90%,但在未见问题上准确率下降30-64%,主要错误包括对语言约束的误读、Kleene星号语义处理不当及全局不一致;三阶段提示可纠正浅层错误但无法解决结构性缺陷,且各类提示策略均无法消除错误。 Conclusion: 大语言模型虽能生成语法合理的DFA,但缺乏语义正确的形式推理能力,揭示了其在真正符号推理方面的根本局限。 Abstract: Large language models (LLMs) have demonstrated strong performance on formal language tasks, yet whether this reflects genuine symbolic reasoning or pattern matching on familiar constructions remains unclear. We introduce a benchmark for deterministic finite automata (DFA) construction from regular languages, comprising factual knowledge questions, seen construction problems from public sources, and two types of unseen problems: hand-crafted instances with multiple interacting constraints and systematically generated problems via Arden's theorem. Models achieve perfect accuracy on factual questions and 84-90% on seen tasks. However, accuracy drops sharply on unseen problems (by 30-64%), with failures stemming from systematic misinterpretation of language constraints, incorrect handling of Kleene-star semantics, and a failure to preserve global consistency. We evaluate a three-stage hint protocol that enables correction of shallow errors but does not reliably resolve globally inconsistent or structurally flawed automata. Our analysis across multiple prompting strategies (direct, Chain-of-Thought, Tree-of-Thought) reveals that errors persist regardless of prompting approach, exposing a fundamental gap between LLMs' ability to generate syntactically plausible DFAs and their capacity for semantically correct formal reasoning.

[126] Trust Me, I'm an Expert: Decoding and Steering Authority Bias in Large Language Models

Priyanka Mary Mammen,Emil Joswin,Shankar Venkitachalam

Main category: cs.CL

TL;DR: 研究表明语言模型在推理任务中会受到来源可信度的影响,尤其当权威性较高的专家提供错误建议时,模型更容易被误导并对其错误答案更加自信。研究还发现这种权威偏见是模型内部机制编码的结果,并可通过引导减轻该偏见以提升表现。

Details Motivation: 探讨语言模型是否会在推理任务中因建议来源的专业水平不同而产生系统性偏差,特别是高权威来源的误导影响。 Method: 在涵盖数学、法律和医学推理的4个数据集上,使用代表四个专业水平的虚拟人物对11种模型进行评估,分析不同权威等级对模型准确性和置信度的影响。 Result: 模型对高权威来源的错误建议更易受影响,表现为准确性下降且对错误答案的置信度上升;同时发现权威偏见在模型内部有明确的机制编码基础。 Conclusion: 语言模型存在权威偏见,但该偏见可通过机制性干预缓解,从而提升模型在面对误导性专家意见时的鲁棒性。 Abstract: Prior research demonstrates that performance of language models on reasoning tasks can be influenced by suggestions, hints and endorsements. However, the influence of endorsement source credibility remains underexplored. We investigate whether language models exhibit systematic bias based on the perceived expertise of the provider of the endorsement. Across 4 datasets spanning mathematical, legal, and medical reasoning, we evaluate 11 models using personas representing four expertise levels per domain. Our results reveal that models are increasingly susceptible to incorrect/misleading endorsements as source expertise increases, with higher-authority sources inducing not only accuracy degradation but also increased confidence in wrong answers. We also show that this authority bias is mechanistically encoded within the model and a model can be steered away from the bias, thereby improving its performance even when an expert gives a misleading endorsement.

[127] MOSLD-Bench: Multilingual Open-Set Learning and Discovery Benchmark for Text Categorization

Adriana-Valentina Costache,Daria-Nicoleta Dragomir,Silviu-Florin Gheorghe,Eduard Poesina,Paul Irofti,Radu Tudor Ionescu

Main category: cs.CL

TL;DR: 本文提出了首个面向文本分类的多语言开放集学习与发现(MOSLD)基准,包含12种语言的96万样本,并构建了一个多阶段框架以持续发现和学习新类别,同时评估了多种语言模型,为未来研究提供参考。

Details Motivation: 现有的零样本学习在文本分类中研究较多,但开放集学习与发现(OSLD)在文本领域仍较新颖,尤其缺乏多语言支持,因此需要构建一个标准化的多语言基准来推动该方向发展。 Method: 通过重构现有数据集和从新闻领域收集新样本,构建MOSLD基准;提出一个集成多阶段的新框架,用于持续发现并学习未知类别的文本。 Result: 发布了包含12种语言、96万个样本的MOSLD基准,评估了多个语言模型的表现,提供了可用于未来研究的基线结果。 Conclusion: 所提出的MOSLD基准和多阶段框架为多语言开放集文本分类提供了重要基础,有助于推动开放世界文本分类的研究进展。 Abstract: Open-set learning and discovery (OSLD) is a challenging machine learning task in which samples from new (unknown) classes can appear at test time. It can be seen as a generalization of zero-shot learning, where the new classes are not known a priori, hence involving the active discovery of new classes. While zero-shot learning has been extensively studied in text classification, especially with the emergence of pre-trained language models, open-set learning and discovery is a comparatively new setup for the text domain. To this end, we introduce the first multilingual open-set learning and discovery (MOSLD) benchmark for text categorization by topic, comprising 960K data samples across 12 languages. To construct the benchmark, we (i) rearrange existing datasets and (ii) collect new data samples from the news domain. Moreover, we propose a novel framework for the OSLD task, which integrates multiple stages to continuously discover and learn new classes. We evaluate several language models, including our own, to obtain results that can be used as reference for future work. We release our benchmark at https://github.com/Adriana19Valentina/MOSLD-Bench.

[128] PhysicsSolutionAgent: Towards Multimodal Explanations for Numerical Physics Problem Solving

Aditya Thole,Anmol Agrawal,Arnav Ramamoorthy,Dhruv Kumar

Main category: cs.CL

TL;DR: 本文提出了PhysicsSolutionAgent (PSA),一种利用Manim动画生成物理问题解释视频的自主代理,并通过自动化评估和视觉语言模型反馈评估视频质量,揭示了在多模态推理与可视化教学中的关键挑战。

Details Motivation: 现有的大语言模型在文本形式的物理问题上表现良好,但在生成高质量、长时间的可视化解释方面仍不足,因此需要探索能够提升物理概念理解的视觉推理方法。 Method: 开发了一个名为PhysicsSolutionAgent (PSA) 的自主代理,使用Manim生成最长六分钟的物理解释视频,并设计了一个包含15个定量参数的自动化评估流程,结合视觉语言模型(VLM)反馈进行迭代优化。 Result: 在32个涵盖数值与理论物理问题的视频评估中,使用GPT-5-mini的PSA实现了100%视频完成率,平均自动化评分为3.8/5,但人工检查发现存在视觉布局不一致和视觉内容解读错误等问题。 Conclusion: 当前的多模态系统在可靠生成Manim代码和视觉解释方面仍存在局限,需改进视觉理解、验证及评估框架,以推动未来多模态教育系统的发展。 Abstract: Explaining numerical physics problems often requires more than text-based solutions; clear visual reasoning can substantially improve conceptual understanding. While large language models (LLMs) demonstrate strong performance on many physics questions in textual form, their ability to generate long, high-quality visual explanations remains insufficiently explored. In this work, we introduce PhysicsSolutionAgent (PSA), an autonomous agent that generates physics-problem explanation videos of up to six minutes using Manim animations. To evaluate the generated videos, we design an assessment pipeline that performs automated checks across 15 quantitative parameters and incorporates feedback from a vision-language model (VLM) to iteratively improve video quality. We evaluate PSA on 32 videos spanning numerical and theoretical physics problems. Our results reveal systematic differences in video quality depending on problem difficulty and whether the task is numerical or theoretical. Using GPT-5-mini, PSA achieves a 100% video-completion rate with an average automated score of 3.8/5. However, qualitative analysis and human inspection uncover both minor and major issues, including visual layout inconsistencies and errors in how visual content is interpreted during feedback. These findings expose key limitations in reliable Manim code generation and highlight broader challenges in multimodal reasoning and evaluation for visual explanations of numerical physics problems. Our work underscores the need for improved visual understanding, verification, and evaluation frameworks in future multimodal educational systems

[129] Anonpsy: A Graph-Based Framework for Structure-Preserving De-identification of Psychiatric Narratives

Kyung Ho Lim,Byung-Hoon Kim

Main category: cs.CL

TL;DR: Anonpsy是一种基于图引导的语义重写框架,用于精神病学叙述的去标识化,通过将文本转化为语义图并进行约束生成,在保持诊断保真度的同时显著降低再识别风险。

Details Motivation: 现有去标识化方法(如PHI掩蔽和基于LLM的合成重写)仅在文本层面操作,难以控制语义元素的保留或修改,尤其无法有效处理隐含于临床结构中的个体化生活事件所带来的识别风险。 Method: Anonpsy将每个叙述转化为包含临床实体、时间锚点和类型化关系的语义图;应用图约束扰动以修改识别性上下文同时保留关键临床结构;并通过图条件化的LLM生成重新生成文本。 Result: 在90个临床医生撰写的精神病案例叙述上的评估显示,Anonpsy在专家、语义及GPT-5评估下均保持低再识别风险,并维持诊断保真度;与强LLM基线相比,语义相似性和可识别性显著更低。 Conclusion: 结合显式结构表示与约束生成的方法能有效实现精神病学叙述的去标识化,优于传统的纯文本级处理方法。 Abstract: Psychiatric narratives encode patient identity not only through explicit identifiers but also through idiosyncratic life events embedded in their clinical structure. Existing de-identification approaches, including PHI masking and LLM-based synthetic rewriting, operate at the text level and offer limited control over which semantic elements are preserved or altered. We introduce Anonpsy, a de-identification framework that reformulates the task as graph-guided semantic rewriting. Anonpsy (1) converts each narrative into a semantic graph encoding clinical entities, temporal anchors, and typed relations; (2) applies graph-constrained perturbations that modify identifying context while preserving clinically essential structure; and (3) regenerates text via graph-conditioned LLM generation. Evaluated on 90 clinician-authored psychiatric case narratives, Anonpsy preserves diagnostic fidelity while achieving consistently low re-identification risk under expert, semantic, and GPT-5-based evaluations. Compared with a strong LLM-only rewriting baseline, Anonpsy yields substantially lower semantic similarity and identifiability. These results demonstrate that explicit structural representations combined with constrained generation provide an effective approach to de-identification for psychiatric narratives.

[130] When Wording Steers the Evaluation: Framing Bias in LLM judges

Yerin Hwang,Dongryeol Lee,Taegwan Kang,Minwoo Lee,Kyomin Jung

Main category: cs.CL

TL;DR: 本文研究了提示词的表述方式(框架效应)对大语言模型(LLM)评估结果的影响,发现正向和负向表述会显著影响模型判断,表明当前LLM评估系统存在结构性的框架偏见。

Details Motivation: 大语言模型在不同提示表述下会产生不同回应,但在高风险评估任务中,这种框架偏差可能影响判断的稳定性与公正性,然而其影响尚未被充分研究。 Method: 受心理学中框架效应启发,设计了谓词正向和谓词负向的对称提示,在四个高风险评估任务中测试14种LLM裁判模型,分析其判断差异。 Result: 实验显示所有LLM均受提示框架显著影响,不同模型家族在倾向同意或拒绝方面表现出系统性偏差。 Conclusion: 框架偏见是当前LLM评估系统的结构性缺陷,需建立对框架敏感的评估协议以提升可靠性与公平性。 Abstract: Large language models (LLMs) are known to produce varying responses depending on prompt phrasing, indicating that subtle guidance in phrasing can steer their answers. However, the impact of this framing bias on LLM-based evaluation, where models are expected to make stable and impartial judgments, remains largely underexplored. Drawing inspiration from the framing effect in psychology, we systematically investigate how deliberate prompt framing skews model judgments across four high-stakes evaluation tasks. We design symmetric prompts using predicate-positive and predicate-negative constructions and demonstrate that such framing induces significant discrepancies in model outputs. Across 14 LLM judges, we observe clear susceptibility to framing, with model families showing distinct tendencies toward agreement or rejection. These findings suggest that framing bias is a structural property of current LLM-based evaluation systems, underscoring the need for framing-aware protocols.

[131] HateXScore: A Metric Suite for Evaluating Reasoning Quality in Hate Speech Explanations

Yujia Hu,Roy Ka-Wei Lee

Main category: cs.CL

TL;DR: 提出HateXScore,一个四组件指标套件,用于评估仇恨言论检测模型解释的推理质量,可揭示标准指标无法发现的可解释性缺陷和标注不一致问题。

Details Motivation: 现有仇恨言论检测评估框架很少评估文本为何被视为仇恨言论,缺乏对模型解释合理性的系统评估。 Method: 设计了包含结论明确性、引用片段的保真性与因果基础、受保护群体识别及逻辑一致性的四组件评估套件HateXScore,并在六个不同数据集上进行评估。 Result: HateXScore能有效揭示模型解释中的可解释性失败和标注不一致性,且与人工评估结果高度一致。 Conclusion: HateXScore可作为诊断工具,补充传统指标,提升内容审核系统的可信度与透明度。 Abstract: Hateful speech detection is a key component of content moderation, yet current evaluation frameworks rarely assess why a text is deemed hateful. We introduce \textsf{HateXScore}, a four-component metric suite designed to evaluate the reasoning quality of model explanations. It assesses (i) conclusion explicitness, (ii) faithfulness and causal grounding of quoted spans, (iii) protected group identification (policy-configurable), and (iv) logical consistency among these elements. Evaluated on six diverse hate speech datasets, \textsf{HateXScore} is intended as a diagnostic complement to reveal interpretability failures and annotation inconsistencies that are invisible to standard metrics like Accuracy or F1. Moreover, human evaluation shows strong agreement with \textsf{HateXScore}, validating it as a practical tool for trustworthy and transparent moderation. \textcolor{red}{Disclaimer: This paper contains sensitive content that may be disturbing to some readers.}

[132] Comparing Without Saying: A Dataset and Benchmark for Implicit Comparative Opinion Mining from Same-User Reviews

Thanh-Lam T. Nguyen,Ngoc-Quang Le,Quoc-Trung Phu,Thi-Phuong Le,Ngoc-Huyen Pham,Phuong-Nguyen Nguyen,Hoang-Quynh Le

Main category: cs.CL

TL;DR: 本文提出了SUDO数据集,用于从同用户评论中挖掘隐式比较意见,填补了现有研究主要关注显式比较表达的空白。

Details Motivation: 现有研究主要集中于显式比较表达,而实际评论中隐式比较更为常见但未被充分探索。 Method: 构建了一个包含4,150个标注评论对的双层结构数据集SUDO,并采用传统机器学习和基于语言模型的两种基线架构进行基准测试。 Result: 基于语言模型的方法表现优于传统方法,但整体性能仍有限,表明该任务具有挑战性。 Conclusion: SUDO为隐式比较意见挖掘提供了有价值的基准,揭示了该任务的难度,推动未来研究发展。 Abstract: Existing studies on comparative opinion mining have mainly focused on explicit comparative expressions, which are uncommon in real-world reviews. This leaves implicit comparisons - here users express preferences across separate reviews - largely underexplored. We introduce SUDO, a novel dataset for implicit comparative opinion mining from same-user reviews, allowing reliable inference of user preferences even without explicit comparative cues. SUDO comprises 4,150 annotated review pairs (15,191 sentences) with a bi-level structure capturing aspect-level mentions and review-level preferences. We benchmark this task using two baseline architectures: traditional machine learning- and language model-based baselines. Experimental results show that while the latter outperforms the former, overall performance remains moderate, revealing the inherent difficulty of the task and establishing SUDO as a challenging and valuable benchmark for future research.

[133] TREX: Tokenizer Regression for Optimal Data Mixture

Inho Won,Hangyeol Yoo,Minkyung Cho,Jungyeul Park,Hoyun Song,KyungTae Lim

Main category: cs.CL

TL;DR: 本文提出了TREX,一种基于回归的框架,用于预测多语言大模型分词器训练中的最优数据混合比例,从而提升压缩效率并降低训练成本。

Details Motivation: 现有的多语言分词器设计依赖启发式方法或高成本搜索来确定语言数据混合比例,缺乏高效且准确的方法。 Method: TREX通过在随机混合数据上训练小规模代理分词器,收集其压缩统计信息,并训练一个回归模型来预测不同数据混合下的压缩性能,从而实现对最优混合比例的快速搜索。 Result: 使用TREX预测的混合比例训练的分词器,在分布内和分布外的压缩效率上比LLaMA3和均匀分布基线高出最多12%。 Conclusion: TREX有效缓解了多语言分词器设计中的准确性与成本之间的权衡,具有良好的可扩展性、鲁棒性和实际应用价值。 Abstract: Building effective tokenizers for multilingual Large Language Models (LLMs) requires careful control over language-specific data mixtures. While a tokenizer's compression performance critically affects the efficiency of LLM training and inference, existing approaches rely on heuristics or costly large-scale searches to determine optimal language ratios. We introduce Tokenizer Regression for Optimal Data MiXture (TREX), a regression-based framework that efficiently predicts the optimal data mixture for tokenizer training. TREX trains small-scale proxy tokenizers on random mixtures, gathers their compression statistics, and learns to predict compression performance from data mixtures. This learned model enables scalable mixture search before large-scale tokenizer training, mitigating the accuracy-cost trade-off in multilingual tokenizer design. Tokenizers trained with TReX's predicted mixtures outperform mixtures based on LLaMA3 and uniform distributions by up to 12% in both inand out-of-distribution compression efficiency, demonstrating strong scalability, robustness, and practical effectiveness.

[134] Vulnerability of LLMs' Belief Systems? LLMs Belief Resistance Check Through Strategic Persuasive Conversation Interventions

Fan Huang,Haewoon Kwak,Jisun An

Main category: cs.CL

TL;DR: 本文在SMCR沟通框架下系统评估了大语言模型(LLM)在多轮交互中对说服的易感性,发现较小模型更易改变信念,元认知提示反而增加脆弱性,对抗性微调在部分模型中有效但效果因模型而异。

Details Motivation: 近年来研究表明大语言模型容易受到说服影响而采纳反事实信念,因此需要系统评估其信念稳定性及现有防御方法的有效性。 Method: 基于SMCR沟通框架,在三个领域(事实知识、医学问答、社会偏见)对五种主流大语言模型进行多轮说服实验,并测试元认知提示和对抗性微调对模型抗说服能力的影响。 Result: 较小模型表现出高度顺从性,超过80%的信念在第一轮说服时即发生改变;元认知提示未增强鲁棒性,反而加速信念侵蚀;对抗性微调使GPT-4o-mini达到98.6%的鲁棒性,Mistral 7B从35.7%提升至79.3%,但Llama系列模型即使在自身失败案例上微调后仍高度易感(<14%)。 Conclusion: 当前的鲁棒性干预措施效果具有显著模型依赖性,需针对不同模型设计更有效的信任保障机制。 Abstract: Large Language Models (LLMs) are increasingly employed in various question-answering tasks. However, recent studies showcase that LLMs are susceptible to persuasion and could adopt counterfactual beliefs. We present a systematic evaluation of LLM susceptibility to persuasion under the Source--Message--Channel--Receiver (SMCR) communication framework. Across five mainstream Large Language Models (LLMs) and three domains (factual knowledge, medical QA, and social bias), we analyze how different persuasive strategies influence belief stability over multiple interaction turns. We further examine whether meta-cognition prompting (i.e., eliciting self-reported confidence) affects resistance to persuasion. Results show that smaller models exhibit extreme compliance, with over 80% of belief changes occurring at the first persuasive turn (average end turn of 1.1--1.4). Contrary to expectations, meta-cognition prompting increases vulnerability by accelerating belief erosion rather than enhancing robustness. Finally, we evaluate adversarial fine-tuning as a defense. While GPT-4o-mini achieves near-complete robustness (98.6%) and Mistral~7B improves substantially (35.7% $\rightarrow$ 79.3%), Llama models remain highly susceptible (<14%) even when fine-tuned on their own failure cases. Together, these findings highlight substantial model-dependent limits of current robustness interventions and offer guidance for developing more trustworthy LLMs.

[135] CauScientist: Teaching LLMs to Respect Data for Causal Discovery

Bo Peng,Sirui Chen,Lei Xu,Chaochao Lu

Main category: cs.CL

TL;DR: 本文提出CauScientist,一种结合大语言模型(LLM)与概率统计的因果发现框架,通过LLM生成假设、统计方法验证,显著提升因果结构学习的准确性与鲁棒性。

Details Motivation: 现有因果发现方法存在统计不可区分性、建模假设过强或忽略统计证据等问题,且LLM方法易受错误先验误导,因此需要一种融合假设生成与严格验证的协同框架。 Method: CauScientist采用混合初始化选择初始图结构,由LLM作为‘数据科学家’提出结构修改假设,并由概率统计方法作为‘验证者’进行检验,迭代优化图结构,同时利用错误记忆机制指导搜索空间。 Result: 实验显示CauScientist比纯数据驱动方法F1分数最高提升53.8%,召回率从35.0%提升至100.0%;在37节点图上相比Qwen3-32B的SHD降低44.0%。 Conclusion: CauScientist通过协同LLM的创造性与统计方法的严谨性,有效提升了复杂场景下的因果发现性能,为科学发现提供了可靠的新范式。 Abstract: Causal discovery is fundamental to scientific understanding and reliable decision-making. Existing approaches face critical limitations: purely data-driven methods suffer from statistical indistinguishability and modeling assumptions, while recent LLM-based methods either ignore statistical evidence or incorporate unverified priors that can mislead result. To this end, we propose CauScientist, a collaborative framework that synergizes LLMs as hypothesis-generating "data scientists" with probabilistic statistics as rigorous "verifiers". CauScientist employs hybrid initialization to select superior starting graphs, iteratively refines structures through LLM-proposed modifications validated by statistical criteria, and maintains error memory to guide efficient search space. Experiments demonstrate that CauScientist substantially outperforms purely data-driven baselines, achieving up to 53.8% F1 score improvement and enhancing recall from 35.0% to 100.0%. Notably, while standalone LLM performance degrades with graph complexity, CauScientist reduces structural hamming distance (SHD) by 44.0% compared to Qwen3-32B on 37-node graphs. Our project page is at https://github.com/OpenCausaLab/CauScientist.

[136] Activation-Space Anchored Access Control for Multi-Class Permission Reasoning in Large Language Models

Zhaopeng Zhang,Pengcheng Sun,Lan Zhang,Chen Tang,Jiewei Lai,Yunhao Wang,Hui Jin

Main category: cs.CL

TL;DR: 提出了一种无需训练的激活空间锚定访问控制框架(AAAC),利用大模型中间激活的几何规律实现细粒度权限控制,有效减少敏感信息泄露。

Details Motivation: 大语言模型在知识库问答中可能超出用户权限泄露敏感信息,难以满足细粒度访问控制需求。 Method: 发现不同权限范围下的查询在中间激活空间中可分离,构建每个权限类别的锚点库,在推理时通过多锚点引导机制将激活导向授权区域。 Result: 在三个大模型家族上实验显示,AAAC最多减少86.5%的权限违规率和90.7%的提示攻击成功率,且仅引入轻微推理开销。 Conclusion: AAAC是一种高效、无需微调的权限控制方法,能在保持响应可用性的同时显著提升大模型在知识库问答中的安全性。 Abstract: Large language models (LLMs) are increasingly deployed over knowledge bases for efficient knowledge retrieval and question answering. However, LLMs can inadvertently answer beyond a user's permission scope, leaking sensitive content, thus making it difficult to deploy knowledge-base QA under fine-grained access control requirements. In this work, we identify a geometric regularity in intermediate activations: for the same query, representations induced by different permission scopes cluster distinctly and are readily separable. Building on this separability, we propose Activation-space Anchored Access Control (AAAC), a training-free framework for multi-class permission control. AAAC constructs an anchor bank, with one permission anchor per class, from a small offline sample set and requires no fine-tuning. At inference time, a multi-anchor steering mechanism redirects each query's activations toward the anchor-defined authorized region associated with the current user, thereby suppressing over-privileged generations by design. Finally, extensive experiments across three LLM families demonstrate that AAAC reduces permission violation rates by up to 86.5% and prompt-based attack success rates by 90.7%, while improving response usability with minor inference overhead compared to baselines.

[137] Towards Token-Level Text Anomaly Detection

Yang Cao,Bicheng Yu,Sikun Yang,Ming Liu,Yujiu Yang

Main category: cs.CL

TL;DR: 本文提出了词元级异常检测新范式,实现了文本内异常的细粒度定位,并构建了三个带词元级标签的基准数据集,实验表明所提框架优于6个基线方法。

Details Motivation: 现有文本异常检测方法局限于文档级别,无法精确定位文本中的异常部分,限制了其在实际应用中的效果。 Method: 提出词元级异常检测框架,统一支持文档级和词元级异常检测,并基于标注的多领域数据集进行实验验证。 Result: 在三个基准数据集上,所提框架性能优于6个基线模型,有效实现细粒度异常定位。 Conclusion: 词元级异常检测为文本异常识别提供了更精细的解决方案,推动了相关应用(如垃圾信息过滤、假新闻检测)的发展。 Abstract: Despite significant progress in text anomaly detection for web applications such as spam filtering and fake news detection, existing methods are fundamentally limited to document-level analysis, unable to identify which specific parts of a text are anomalous. We introduce token-level anomaly detection, a novel paradigm that enables fine-grained localization of anomalies within text. We formally define text anomalies at both document and token-levels, and propose a unified detection framework that operates across multiple levels. To facilitate research in this direction, we collect and annotate three benchmark datasets spanning spam, reviews and grammar errors with token-level labels. Experimental results demonstrate that our framework get better performance than other 6 baselines, opening new possibilities for precise anomaly localization in text. All the codes and data are publicly available on https://github.com/charles-cao/TokenCore.

[138] Fairness or Fluency? An Investigation into Language Bias of Pairwise LLM-as-a-Judge

Xiaolin Zhou,Zheng Luo,Yicheng Gao,Qixuan Chen,Xiyang Hu,Yue Zhao,Ruishan Liu

Main category: cs.CL

TL;DR: 本文研究了LLM作为评判者时存在的语言偏见,发现其在同语言和跨语言判断中均存在对特定语言(尤其是英语)的偏好,且这种偏见无法完全由低困惑度偏见解释。

Details Motivation: 现有研究表明LLM作为评判者时存在偏差,尤其在语言方面与人类偏好不一致,因此需要深入探究其语言偏见的具体表现和成因。 Method: 通过分析LLM在同语言和跨语言配对判断中的表现,评估不同语言家族间的性能差异,并检验语言偏见与困惑度之间的关系。 Result: 发现在同语言判断中,欧洲语言显著优于非洲语言,且文化相关主题中偏见更明显;在跨语言判断中,多数模型偏向英文回答,且答案语言比提问语言影响更大;语言偏见与困惑度仅有轻微相关性,不能完全由其解释。 Conclusion: LLM作为评判者存在显著的语言偏见,这种偏见受语言类型和文化因素影响,且不能仅用困惑度来解释,需进一步改进以提升公平性和跨语言适用性。 Abstract: Recent advances in Large Language Models (LLMs) have incentivized the development of LLM-as-a-judge, an application of LLMs where they are used as judges to decide the quality of a certain piece of text given a certain context. However, previous studies have demonstrated that LLM-as-a-judge can be biased towards different aspects of the judged texts, which often do not align with human preference. One of the identified biases is language bias, which indicates that the decision of LLM-as-a-judge can differ based on the language of the judged texts. In this paper, we study two types of language bias in pairwise LLM-as-a-judge: (1) performance disparity between languages when the judge is prompted to compare options from the same language, and (2) bias towards options written in major languages when the judge is prompted to compare options of two different languages. We find that for same-language judging, there exist significant performance disparities across language families, with European languages consistently outperforming African languages, and this bias is more pronounced in culturally-related subjects. For inter-language judging, we observe that most models favor English answers, and that this preference is influenced more by answer language than question language. Finally, we investigate whether language bias is in fact caused by low-perplexity bias, a previously identified bias of LLM-as-a-judge, and we find that while perplexity is slightly correlated with language bias, language bias cannot be fully explained by perplexity only.

[139] Beyond Known Facts: Generating Unseen Temporal Knowledge to Address Data Contamination in LLM Evaluation

Arthur Amalvy,Hen-Hsen Huang

Main category: cs.CL

TL;DR: 提出了一种基于合成未来事实的无污染评估数据集,用于更准确地评估大语言模型在时序知识图谱抽取(TKGE)中的性能。

Details Motivation: 现有TKGE评估数据集稀缺且存在训练-测试数据污染问题,可能导致LLM性能被高估,缺乏可靠的评估基准。 Method: 采用两步法构建合成数据集:首先通过时序知识图预测生成未来四元组并进行模式过滤,然后利用LLM生成对应的文本描述。 Result: 在新数据集上评测发现,当前最先进的LLM系统EDC性能显著下降,表明原有评估结果存在偏差。 Conclusion: 所提方法可有效避免数据污染,提供长期、可扩展的无污染TKGE评估基准,已公开发布包含4.2K样本的数据集及生成方法。 Abstract: The automatic extraction of information is important for populating large web knowledge bases such as Wikidata. The temporal version of that task, temporal knowledge graph extraction (TKGE), involves extracting temporally grounded facts from text, represented as semantic quadruples (subject, relation, object, timestamp). Many recent systems take advantage of large language models (LLMs), which are becoming a new cornerstone of the web due to their performance on many tasks across the natural language processing (NLP) field. Despite the importance of TKGE, existing datasets for training and evaluation remain scarce, and contamination of evaluation data is an unaddressed issue, potentially inflating LLMs' perceived performance due to overlaps between training and evaluation sets. To mitigate these challenges, we propose a novel synthetic evaluation dataset constructed from predicted future, previously unseen temporal facts, thereby eliminating contamination and enabling robust and unbiased benchmarking. Our dataset creation involves a two-step approach: (1) Temporal Knowledge Graph Forecasting (TKGF) generates plausible future quadruples, which are subsequently filtered to adhere to the original knowledge base schema; (2) LLMs perform quadruple-to-text generation, creating semantically aligned textual descriptions. We benchmark Extract, Define and Canonicalize (EDC), a state-of-the-art LLM-based extraction framework, demonstrating that LLM performance decreases when evaluated on our dataset compared to a dataset of known facts. We publicly release our dataset consisting of 4.2K future quadruples and corresponding textual descriptions, along with the generation methodology, enabling continuous creation of unlimited future temporal datasets to serve as long-term, contamination-free benchmarks for TKGE.

[140] Temporal-Spatial Decouple before Act: Disentangled Representation Learning for Multimodal Sentiment Analysis

Chunlei Meng,Ziyang Zhou,Lucas He,Xiaojing Du,Chun Ouyang,Zhongxue Gan

Main category: cs.CL

TL;DR: 提出TSDA模型,通过解耦时空特征并分别对齐多模态时序与空间因子,提升多模态情感分析性能。

Details Motivation: 现有方法忽略时空异质性,导致时空信息不对称,限制了多模态情感分析的性能。 Method: 提出TSDA模型,先将各模态解耦为时间动态和空间结构上下文,分别使用时间编码器和空间编码器建模;通过因子一致的跨模态对齐机制,仅对齐各模态的时间-时间、空间-空间特征,并引入因子特定监督与去相关正则化;最后通过门控机制重新组合用于任务。 Result: 在多个实验中TSDA优于基线方法,消融实验验证了设计的必要性和可解释性。 Conclusion: 显式解耦时空因素并分别对齐可有效缓解时空信息不对称问题,提升多模态情感分析效果。 Abstract: Multimodal Sentiment Analysis integrates Linguistic, Visual, and Acoustic. Mainstream approaches based on modality-invariant and modality-specific factorization or on complex fusion still rely on spatiotemporal mixed modeling. This ignores spatiotemporal heterogeneity, leading to spatiotemporal information asymmetry and thus limited performance. Hence, we propose TSDA, Temporal-Spatial Decouple before Act, which explicitly decouples each modality into temporal dynamics and spatial structural context before any interaction. For every modality, a temporal encoder and a spatial encoder project signals into separate temporal and spatial body. Factor-Consistent Cross-Modal Alignment then aligns temporal features only with their temporal counterparts across modalities, and spatial features only with their spatial counterparts. Factor specific supervision and decorrelation regularization reduce cross factor leakage while preserving complementarity. A Gated Recouple module subsequently recouples the aligned streams for task. Extensive experiments show that TSDA outperforms baselines. Ablation analysis studies confirm the necessity and interpretability of the design.

[141] CommunityBench: Benchmarking Community-Level Alignment across Diverse Groups and Tasks

Jiayu Lin,Zhongyu Wei

Main category: cs.CL

TL;DR: 提出社区级对齐作为大规模语言模型对齐的中间路径,并引入首个大规模评估基准CommunityBench,验证当前模型在建模社区特定偏好上的局限性,并探索其在个体化建模中的潜力。

Details Motivation: 现有对齐方法要么采用单一普适价值(忽视少数群体),要么进行个体级定制(成本过高);而人类社会实际以具有共同价值观的社区为组织单位,因此需要一种兼顾可扩展性与多样性的中间方案。 Method: 基于共同身份和共同纽带理论构建CommunityBench,包含四个任务的大规模基准,用于评估基础模型在社区级对齐上的表现,并分析其对个体化建模的支持能力。 Result: 实验表明当前大模型在社区特定偏好建模方面能力有限;同时发现社区级对齐有助于提升个体化建模效果,具备可扩展性和多元价值兼容潜力。 Conclusion: 社区级对齐是一种可行且有前景的中间路径,既能缓解少数群体被边缘化的问题,又能降低个体定制的成本,推动实现更包容、可扩展的价值对齐。 Abstract: Large language models (LLMs) alignment ensures model behaviors reflect human value. Existing alignment strategies primarily follow two paths: one assumes a universal value set for a unified goal (i.e., one-size-fits-all), while the other treats every individual as unique to customize models (i.e., individual-level). However, assuming a monolithic value space marginalizes minority norms, while tailoring individual models is prohibitively expensive. Recognizing that human society is organized into social clusters with high intra-group value alignment, we propose community-level alignment as a "middle ground". Practically, we introduce CommunityBench, the first large-scale benchmark for community-level alignment evaluation, featuring four tasks grounded in Common Identity and Common Bond theory. With CommunityBench, we conduct a comprehensive evaluation of various foundation models on CommunityBench, revealing that current LLMs exhibit limited capacity to model community-specific preferences. Furthermore, we investigate the potential of community-level alignment in facilitating individual modeling, providing a promising direction for scalable and pluralistic alignment.

[142] HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference

Zhiyuan Shi,Qibo Qiu,Feng Xue,Zhonglin Jiang,Li Yu,Jian Jiang,Xiaofei He,Wenxiao Wang

Main category: cs.CL

TL;DR: 本文提出了一种名为HeteroCache的无需训练的动态压缩框架,用于缓解大语言模型在长上下文任务中KV缓存线性增长带来的推理瓶颈。该方法通过细粒度加权和异步检索机制,在保持关键信息的同时显著降低内存与I/O开销。

Details Motivation: KV缓存的线性增长限制了长上下文场景下的LLM推理效率;现有静态压缩方法因忽略注意力漂移现象而丢失重要信息,而动态方法则存在缓存策略粗粒度和高I/O开销的问题。 Method: 基于注意力头在时间上的异质性和层内空间冗余性两个洞察,HeteroCache对注意力头进行分类,并采用细粒度加权策略为变化剧烈的头分配更多缓存资源;同时设计分层存储机制,用代表性头监测注意力变化并触发异步按需从CPU恢复上下文,以隐藏I/O延迟。 Result: 实验表明,HeteroCache在多个长上下文基准上达到SOTA性能,在224K上下文长度下解码速度比原模型快达3倍。 Conclusion: HeteroCache通过细粒度、分层的动态缓存策略有效解决了KV缓存增长带来的内存和I/O瓶颈,显著提升了长上下文LLM推理效率。 Abstract: The linear memory growth of the KV cache poses a significant bottleneck for LLM inference in long-context tasks. Existing static compression methods often fail to preserve globally important information, principally because they overlook the attention drift phenomenon where token significance evolves dynamically. Although recent dynamic retrieval approaches attempt to address this issue, they typically suffer from coarse-grained caching strategies and incur high I/O overhead due to frequent data transfers. To overcome these limitations, we propose HeteroCache, a training-free dynamic compression framework. Our method is built on two key insights: attention heads exhibit diverse temporal heterogeneity, and there is significant spatial redundancy among heads within the same layer. Guided by these insights, HeteroCache categorizes heads based on stability and redundancy. Consequently, we apply a fine-grained weighting strategy that allocates larger cache budgets to heads with rapidly shifting attention to capture context changes, thereby addressing the inefficiency of coarse-grained strategies. Furthermore, we employ a hierarchical storage mechanism in which a subset of representative heads monitors attention shift, and trigger an asynchronous, on-demand retrieval of contexts from the CPU, effectively hiding I/O latency. Finally, experiments demonstrate that HeteroCache achieves state-of-the-art performance on multiple long-context benchmarks and accelerates decoding by up to $3\times$ compared to the original model in the 224K context. Our code will be open-source.

[143] Dr. Assistant: Enhancing Clinical Diagnostic Inquiry via Structured Diagnostic Reasoning Data and Reinforcement Learning

Yue Guo,Fanfu Wang,Jianwei Lv,Xincheng Shi,Yuchen Li,Youya Wang,Yunsheng Zeng,Yujing Liu,Yunhao Qiao,Gen Li,Junfeng Wang,Bo Yuan

Main category: cs.CL

TL;DR: 本文提出了一种基于大型语言模型的临床诊断助手(Dr. Assistant),通过构建临床诊断推理数据(CDRD)结构和两阶段训练方法,提升其在临床诊断推理与问询中的表现,并引入新基准进行评估。

Details Motivation: 现有的临床决策支持系统维护成本高、泛化能力差,而现有大语言模型在诊断推理和问询能力上存在不足,因此需要一种更具通用性和高效性的解决方案。 Method: 提出了CDRD数据结构以捕捉抽象的临床推理逻辑,并设计了一个包含SFT和强化学习(RL)的两阶段训练流程;同时构建了专门用于评估诊断推理与问询能力的基准。 Result: 实验表明,Dr. Assistant在诊断推理和问询方面优于开源模型,并与闭源模型具有竞争力,验证了方法的有效性。 Conclusion: 该研究为临床诊断推理与问询提供了有效的模型解决方案,展示了结合结构化推理数据与大语言模型在医疗领域的潜力。 Abstract: Clinical Decision Support Systems (CDSSs) provide reasoning and inquiry guidance for physicians, yet they face notable challenges, including high maintenance costs and low generalization capability. Recently, Large Language Models (LLMs) have been widely adopted in healthcare due to their extensive knowledge reserves, retrieval, and communication capabilities. While LLMs show promise and excel at medical benchmarks, their diagnostic reasoning and inquiry skills are constrained. To mitigate this issue, we propose (1) Clinical Diagnostic Reasoning Data (CDRD) structure to capture abstract clinical reasoning logic, and a pipeline for its construction, and (2) the Dr. Assistant, a clinical diagnostic model equipped with clinical reasoning and inquiry skills. Its training involves a two-stage process: SFT, followed by RL with a tailored reward function. We also introduce a benchmark to evaluate both diagnostic reasoning and inquiry. Our experiments demonstrate that the Dr. Assistant outperforms open-source models and achieves competitive performance to closed-source models, providing an effective solution for clinical diagnostic inquiry guidance.

[144] OptiSQL: Executable SQL Generation from Optical TokensOptiSQL: Executable SQL Generation from Optical Tokens

Sifan Li,Hongkai Chen,Yujun Cai,Liyang Chen,Qingwen Ye,Yiwei Wang

Main category: cs.CL

TL;DR: 提出OptiSQL,一种基于视觉的框架,直接从表格图像和自然语言问题生成可执行SQL,使用紧凑的光学token显著减少输入token数量。

Details Motivation: 现有文本到SQL方法依赖线性化文本模式,需大量token且不适用于真实场景中以视觉形式存在的表格。 Method: 采用面向OCR的视觉编码器将表格结构和内容压缩为少量光学token,并冻结编码器,微调预训练解码器生成SQL。 Result: 在Spider 2.0-Snow的可视化版本上实验显示,OptiSQL在降低一个数量级的输入token下仍保持高执行准确率,且对视觉扰动具有鲁棒性。 Conclusion: 紧凑的光学表示可作为高效接口用于可执行语义解析,兼顾效率与性能。 Abstract: Executable SQL generation is typically studied in text-to-SQL settings, where tables are provided as fully linearized textual schemas and contents. While effective, this formulation assumes access to structured text and incurs substantial token overhead, which is misaligned with many real-world scenarios where tables appear as visual artifacts in documents or webpages. We investigate whether compact optical representations can serve as an efficient interface for executable semantic parsing. We present OptiSQL, a vision-driven framework that generates executable SQL directly from table images and natural language questions using compact optical tokens. OptiSQL leverages an OCR-oriented visual encoder to compress table structure and content into a small set of optical tokens and fine-tunes a pretrained decoder for SQL generation while freezing the encoder to isolate representation sufficiency. Experiments on a visualized version of Spider 2.0-Snow show that OptiSQL retains strong execution accuracy while reducing table input tokens by an order of magnitude. Robustness analyses further demonstrate that optical tokens preserve essential structural information under visual perturbations.

[145] Uncertainty-Aware Gradient Signal-to-Noise Data Selection for Instruction Tuning

Zhihang Yuan,Chengyu Yue,Long Huang,Litu Ou,Lei Shi

Main category: cs.CL

TL;DR: 提出GRADFILTERING,一种目标无关、不确定性感知的数据选择框架,利用小规模GPT-2代理和LoRA集成计算梯度信噪比(G-SNR),在降低训练成本的同时实现优于随机子集和强基线的性能。

Details Motivation: 现有数据选择方法依赖昂贵的梯度存储或静态评分,忽略模型训练过程中的不确定性变化,导致效率低下且缺乏对LLM可解释性的利用。 Method: 使用小型GPT-2模型配合LoRA集成,动态聚合样本级梯度,构建梯度信噪比(G-SNR)作为数据效用评分,实现不确定性感知的数据筛选。 Result: 在LLM-as-a-judge评估和人工评价中,GRADFILTERING优于随机子集和强基线方法,并在相同计算预算下收敛更快。 Conclusion: GRADFILTERING通过引入不确定性感知机制,有效提升指令微调的数据利用效率和模型训练速度,为高效LLM适配提供了新思路。 Abstract: Instruction tuning is a standard paradigm for adapting large language models (LLMs), but modern instruction datasets are large, noisy, and redundant, making full-data fine-tuning costly and often unnecessary. Existing data selection methods either build expensive gradient datastores or assign static scores from a weak proxy, largely ignoring evolving uncertainty, and thus missing a key source of LLM interpretability. We propose GRADFILTERING, an objective-agnostic, uncertainty-aware data selection framework that utilizes a small GPT-2 proxy with a LoRA ensemble and aggregates per-example gradients into a Gradient Signal-to-Noise Ratio (G-SNR) utility. Our method matches or surpasses random subsets and strong baselines in most LLM-as-a-judge evaluations as well as in human assessment. Moreover, GRADFILTERING-selected subsets converge faster than competitive filters under the same compute budget, reflecting the benefit of uncertainty-aware scoring.

[146] GerAV: Towards New Heights in German Authorship Verification using Fine-Tuned LLMs on a New Benchmark

Lotta Kiefer,Christoph Leiter,Sotaro Takeshita,Elena Schmidt,Steffen Eger

Main category: cs.CL

TL;DR: 本文提出了GerAV,一个用于德语作者验证的大规模基准数据集,包含超过60万对标注文本,并基于Twitter和Reddit数据构建,支持对数据源、主题领域和文本长度的影响进行系统分析。实验表明,微调的大型语言模型表现最佳,且揭示了专业化与泛化之间的权衡。

Details Motivation: 现有作者验证研究主要集中于英语,其他语言尤其是德语缺乏大规模基准和系统评估,本文旨在填补这一空白。 Method: 构建了一个名为GerAV的德语作者验证基准,包含Twitter和Reddit来源的多子集数据;使用提供的训练划分对多种基线和最先进模型进行系统评估,重点分析不同数据类型下的模型性能。 Result: 微调的大型语言模型在有监督设置下比近期基线高出最多0.09 F1分数,在零样本设置下超越GPT-5达0.08;发现特定数据类型训练的模型在匹配条件下表现好但跨域泛化能力弱,混合训练源可缓解该问题。 Conclusion: GerAV为德语及跨域作者验证研究提供了具有挑战性和多样性的基准,推动多语言作者验证技术的发展。 Abstract: Authorship verification (AV) is the task of determining whether two texts were written by the same author and has been studied extensively, predominantly for English data. In contrast, large-scale benchmarks and systematic evaluations for other languages remain scarce. We address this gap by introducing GerAV, a comprehensive benchmark for German AV comprising over 600k labeled text pairs. GerAV is built from Twitter and Reddit data, with the Reddit part further divided into in-domain and cross-domain message-based subsets, as well as a profile-based subset. This design enables controlled analysis of the effects of data source, topical domain, and text length. Using the provided training splits, we conduct a systematic evaluation of strong baselines and state-of-the-art models and find that our best approach, a fine-tuned large language model, outperforms recent baselines by up to 0.09 absolute F1 score and surpasses GPT-5 in a zero-shot setting by 0.08. We further observe a trade-off between specialization and generalization: models trained on specific data types perform best under matching conditions but generalize less well across data regimes, a limitation that can be mitigated by combining training sources. Overall, GerAV provides a challenging and versatile benchmark for advancing research on German and cross-domain AV.

[147] Simulated Ignorance Fails: A Systematic Study of LLM Behaviors on Forecasting Problems Before Model Knowledge Cutoff

Zehan Li,Yuxuan Wang,Ali El Lahib,Ying-Jieh Xia,Xinyu Pi

Main category: cs.CL

TL;DR: 本文系统评估了“模拟无知”(SI)方法在大语言模型预测能力测试中的有效性,发现其无法可靠地模拟“真实无知”(TI),因此基于SI的回溯性预测存在方法论缺陷。

Details Motivation: 由于前瞻性评估延迟过高,而回溯性预测因模型知识截止日期不断更新而缺乏干净数据,研究者提出用“模拟无知”来解决这一矛盾,但其有效性尚未被系统检验。 Method: 在477个竞赛级问题和9个模型上对比了模拟无知与真实无知的表现,分析了提示、思维链推理及推理优化模型对知识抑制的影响。 Result: 发现模拟无知存在52%的性能差距,思维链无法有效抑制先验知识,且推理优化模型的SI保真度更差。 Conclusion: 提示无法可靠地‘倒转’模型知识,基于SI的回溯性评估方法不可靠,应避免用于预测能力基准测试。 Abstract: Evaluating LLM forecasting capabilities is constrained by a fundamental tension: prospective evaluation offers methodological rigor but prohibitive latency, while retrospective forecasting (RF) -- evaluating on already-resolved events -- faces rapidly shrinking clean evaluation data as SOTA models possess increasingly recent knowledge cutoffs. Simulated Ignorance (SI), prompting models to suppress pre-cutoff knowledge, has emerged as a potential solution. We provide the first systematic test of whether SI can approximate True Ignorance (TI). Across 477 competition-level questions and 9 models, we find that SI fails systematically: (1) cutoff instructions leave a 52% performance gap between SI and TI; (2) chain-of-thought reasoning fails to suppress prior knowledge, even when reasoning traces contain no explicit post-cutoff references; (3) reasoning-optimized models exhibit worse SI fidelity despite superior reasoning trace quality. These findings demonstrate that prompts cannot reliably "rewind" model knowledge. We conclude that RF on pre-cutoff events is methodologically flawed; we recommend against using SI-based retrospective setups to benchmark forecasting capabilities.

[148] OP-Bench: Benchmarking Over-Personalization for Memory-Augmented Personalized Conversational Agents

Yulin Hu,Zimo Long,Jiahe Guo,Xingyu Sui,Xing Fu,Weixiang Zhao,Yanyan Zhao,Bing Qin

Main category: cs.CL

TL;DR: 本文提出了记忆增强对话系统中的“过度个性化”问题,形式化为无关、重复和谄媚三类,并构建了包含1700个实例的OP-Bench基准进行评估;为此提出了一种轻量级、模型无关的记忆过滤机制Self-ReCheck,以在保持个性化效果的同时缓解该问题。

Details Motivation: 现有记忆增强对话代理的基准主要关注是否能回忆和使用用户信息,而忽视了这些个性化使用是否恰当,常导致过度使用记忆,造成回应生硬、侵入感或社交不适。 Method: 将过度个性化问题形式化为三类:无关(Irrelevance)、重复(Repetition)和谄媚(Sycophancy),并基于长对话历史构建了1700个经验证的实例形成OP-Bench基准;在此基础上评估多种大模型与记忆增强方法,并提出Self-ReCheck——一种轻量级、模型无关的记忆过滤机制。 Result: 实验发现引入记忆后过度个性化现象普遍存在,代理倾向于不必要地检索和过度关注用户记忆;Self-ReCheck能有效减少过度个性化,同时保持良好的个性化性能。 Conclusion: 该工作首次系统性揭示并量化了记忆增强对话系统中的过度个性化问题,提出的OP-Bench和Self-ReCheck为实现更可控、适当的个性化提供了有效工具和方向。 Abstract: Memory-augmented conversational agents enable personalized interactions using long-term user memory and have gained substantial traction. However, existing benchmarks primarily focus on whether agents can recall and apply user information, while overlooking whether such personalization is used appropriately. In fact, agents may overuse personal information, producing responses that feel forced, intrusive, or socially inappropriate to users. We refer to this issue as \emph{over-personalization}. In this work, we formalize over-personalization into three types: Irrelevance, Repetition, and Sycophancy, and introduce \textbf{OP-Bench} a benchmark of 1,700 verified instances constructed from long-horizon dialogue histories. Using \textbf{OP-Bench}, we evaluate multiple large language models and memory-augmentation methods, and find that over-personalization is widespread when memory is introduced. Further analysis reveals that agents tend to retrieve and over-attend to user memories even when unnecessary. To address this issue, we propose \textbf{Self-ReCheck}, a lightweight, model-agnostic memory filtering mechanism that mitigates over-personalization while preserving personalization performance. Our work takes an initial step toward more controllable and appropriate personalization in memory-augmented dialogue systems.

[149] On Temperature-Constrained Non-Deterministic Machine Translation: Potential and Evaluation

Weichuan Wang,Mingyang Liu,Linqi Song,Chen Ma

Main category: cs.CL

TL;DR: 本文研究了机器翻译中的非确定性现象(ND-MT),发现其在解决多模态问题上具有潜力,但对评估框架带来新挑战,并提出ExpectoSample策略以提升评估可靠性。

Details Motivation: 非确定性在语言模型中受到关注,但在机器翻译中研究不足,尤其是其对多模态问题和系统评估的影响尚不明确。 Method: 系统评估现代MT系统,识别温度约束下的非确定性MT现象,并在三个公开数据集上使用词汇和语义指标、不同采样规模测试五个最先进系统,提出ExpectoSample策略。 Result: 发现ND-MT在温度限制下能提供比确定性MT更高质量的候选翻译,并揭示‘Buckets效应’:最低质量候选决定整体系统排名。 Conclusion: ND-MT具有提升翻译多样性和质量的潜力,但现有评估方法不稳健;ExpectoSample可有效评估指标可靠性,为未来ND-MT系统评估提供新方向。 Abstract: In recent years, the non-deterministic properties of language models have garnered considerable attention and have shown a significant influence on real-world applications. However, such properties remain under-explored in machine translation (MT), a complex, non-deterministic NLP task. In this study, we systematically evaluate modern MT systems and identify temperature-constrained Non-Deterministic MT (ND-MT) as a distinct phenomenon. Additionally, we demonstrate that ND-MT exhibits significant potential in addressing the multi-modality issue that has long challenged MT research and provides higher-quality candidates than Deterministic MT (D-MT) under temperature constraints. However, ND-MT introduces new challenges in evaluating system performance. Specifically, the evaluation framework designed for D-MT fails to yield consistent evaluation results when applied to ND-MT. We further investigate this emerging challenge by evaluating five state-of-the-art ND-MT systems across three open datasets using both lexical-based and semantic-based metrics at varying sampling sizes. The results reveal a Buckets effect across these systems: the lowest-quality candidate generated by ND-MT consistently determines the overall system ranking across different sampling sizes for all reasonable metrics. Furthermore, we propose the ExpectoSample strategy to automatically assess the reliability of evaluation metrics for selecting robust ND-MT.

[150] Towards robust long-context understanding of large language model via active recap learning

Chenyu Hui

Main category: cs.CL

TL;DR: 本文提出了主动回顾学习(ARL),一种增强大语言模型在长上下文理解能力的框架,通过在持续预训练中构建目标序列和推理时进行回溯总结来提升性能。

Details Motivation: 大语言模型在处理长上下文时存在记忆和理解局限,需要更有效的机制来增强对早期内容的回顾与利用。 Method: 基于长短上下文的损失差异识别关键标记,找到最相关的前置段落,并使用大语言模型对其进行总结;在推理过程中让模型自主生成并利用这些回溯摘要,形成跨段落的递归记忆机制。 Result: 实验结果显示ARL在RULER上提升了26.8%,在LongBench上提升了9.44%。 Conclusion: ARL提供了一种简单而有效的基于持续预训练的方法,增强了长上下文理解能力,推动了大语言模型中可扩展记忆增强的发展。 Abstract: In this paper, we propose active recap learning (ARL), a framework for enhancing large language model (LLM) in understanding long contexts. ARL enables models to revisit and summarize earlier content through targeted sequence construction during contined pretraining and retrospective summarization at inference. First, we identify key tokens in prepared long context based on loss gaps between long and short forward contexts and find most revant preceding paragraphs, then summarize them using an LLM. Second, ARL equips models with the ability to autonomously generate and utilize these retrospective summaries during inference, thereby establishing a recursive memory mechanism across paragraphs. Experimental results show substantial gains, with ARL achieving a 26.8% improvement on RULER and a 9.44% improvement on LongBench. Overall, ARL offers a simple yet effective continued pretraining-based approach to strengthen long-context understanding, advancing scalable memory augmentation in LLM

[151] Dimension-First Evaluation of Speech-to-Speech Models with Structured Acoustic Cues

Arjun Chandra,Kevin Miller,Venkatesh Ravichandran,Constantinos Papayiannis,Venkatesh Saligrama

Main category: cs.CL

TL;DR: 提出TRACE框架,利用文本化音频线索实现高效、低成本的语音到语音评估,超越现有方法并与人类评价更一致。

Details Motivation: 现有的语音到语音评估依赖昂贵且不透明的音频语言模型,缺乏对多维度(内容、音质、副语言)的细粒度判断能力。 Method: 设计人类链式思维(HCoT)标注协议,将评估分解为内容、语音质量和副语言三个维度;通过提取廉价音频信号生成文本蓝图,使大语言模型基于该蓝图进行分维度判断,并通过确定性策略融合为总评分。 Result: TRACE在与人类评分者的一致性上优于现有ALMs和仅使用转录文本的LLM裁判,同时显著降低成本。 Conclusion: TRACE为语音到语音系统提供了可扩展、成本低且与人类对齐的自动评估方案,推动了基于LLM judge的多模态评估发展。 Abstract: Large Language Model (LLM) judges exhibit strong reasoning capabilities but are limited to textual content. This leaves current automatic Speech-to-Speech (S2S) evaluation methods reliant on opaque and expensive Audio Language Models (ALMs). In this work, we propose TRACE (Textual Reasoning over Audio Cues for Evaluation), a novel framework that enables LLM judges to reason over audio cues to achieve cost-efficient and human-aligned S2S evaluation. To demonstrate the strength of the framework, we first introduce a Human Chain-of-Thought (HCoT) annotation protocol to improve the diagnostic capability of existing judge benchmarks by separating evaluation into explicit dimensions: content (C), voice quality (VQ), and paralinguistics (P). Using this data, TRACE constructs a textual blueprint of inexpensive audio signals and prompts an LLM to render dimension-wise judgments, fusing them into an overall rating via a deterministic policy. TRACE achieves higher agreement with human raters than ALMs and transcript-only LLM judges while being significantly more cost-effective. We will release the HCoT annotations and the TRACE framework to enable scalable and human-aligned S2S evaluation.

[152] Pro-AI Bias in Large Language Models

Benaya Trabelsi,Jonathan Shaki,Sarit Kraus

Main category: cs.CL

TL;DR: 大型语言模型(LLMs)在提供决策支持时表现出系统性的亲人工智能(AI)偏见,包括在建议中偏好AI选项、高估AI岗位薪资以及在内部表征中赋予AI更高中心性。

Details Motivation: 研究LLMs在决策支持中是否对AI本身存在系统性偏好偏差,以揭示其对高风险决策的潜在影响。 Method: 通过三个实验:分析LLMs对多样化咨询问题的建议倾向;比较AI与非AI职位薪资预测差异;探测开源模型内部表征中AI概念与其他学术领域的相似性。 Result: 发现LLMs普遍推荐AI选项,尤其是闭源模型近乎确定性地偏好AI;AI岗位薪资被显著高估,闭源模型高出10个百分点;'人工智能'在不同语境下均与学术领域提示词最相似,显示其表征的中心地位。 Conclusion: LLMs生成的建议和价值判断会系统性地偏向AI,可能扭曲重要决策中的选择与认知,需警惕其在现实应用中的影响。 Abstract: Large language models (LLMs) are increasingly employed for decision-support across multiple domains. We investigate whether these models display a systematic preferential bias in favor of artificial intelligence (AI) itself. Across three complementary experiments, we find consistent evidence of pro-AI bias. First, we show that LLMs disproportionately recommend AI-related options in response to diverse advice-seeking queries, with proprietary models doing so almost deterministically. Second, we demonstrate that models systematically overestimate salaries for AI-related jobs relative to closely matched non-AI jobs, with proprietary models overestimating AI salaries more by 10 percentage points. Finally, probing internal representations of open-weight models reveals that ``Artificial Intelligence'' exhibits the highest similarity to generic prompts for academic fields under positive, negative, and neutral framings alike, indicating valence-invariant representational centrality. These patterns suggest that LLM-generated advice and valuation can systematically skew choices and perceptions in high-stakes decisions.

[153] Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis

Yushen Chen,Junzhe Liu,Yujie Tu,Zhikang Niu,Yuzhe Liang,Kai Yu,Chunyu Qiang,Chen Zhang,Xie Chen

Main category: cs.CL

TL;DR: 本文提出了Habibi,一个支持多种阿拉伯方言的统一文本到语音合成模型套件,利用语言学指导的课程学习方法,无需文本加注音符号即可在生成质量上超越领先的商业服务,并计划开源模型及建立首个系统性多方言阿拉伯语语音合成基准。

Details Motivation: 阿拉伯方言在语音合成研究中存在显著空白,尤其是缺乏统一建模方法、标准化数据、基准和评估指南,导致研究人员倾向于选择更安全的研究方向。 Method: 提出Habibi模型套件,采用语言学指导的课程学习策略,利用现有的开源ASR语料库,支持从高资源到低资源的多种阿拉伯方言,且不依赖文本音素标注。 Result: Habibi在语音生成质量上优于领先的商业服务,具备通过上下文学习实现良好扩展性的能力。 Conclusion: 本文为多方言阿拉伯语语音合成奠定了坚实基础,提供了开源模型、首个系统性基准以及评估标准,推动未来相关研究发展。 Abstract: A notable gap persists in speech synthesis research and development for Arabic dialects, particularly from a unified modeling perspective. Despite its high practical value, the inherent linguistic complexity of Arabic dialects, further compounded by a lack of standardized data, benchmarks, and evaluation guidelines, steers researchers toward safer ground. To bridge this divide, we present Habibi, a suite of specialized and unified text-to-speech models that harnesses existing open-source ASR corpora to support a wide range of high- to low-resource Arabic dialects through linguistically-informed curriculum learning. Our approach outperforms the leading commercial service in generation quality, while maintaining extensibility through effective in-context learning, without requiring text diacritization. We are committed to open-sourcing the model, along with creating the first systematic benchmark for multi-dialect Arabic speech synthesis. Furthermore, by identifying the key challenges in and establishing evaluation standards for the process, we aim to provide a solid groundwork for subsequent research. Resources at https://SWivid.github.io/Habibi/ .

[154] Knowledge Graph-Assisted LLM Post-Training for Enhanced Legal Reasoning

Dezhao Song,Guglielmo Bonifazi,Frank Schilder,Jonathan Richard Schwarz

Main category: cs.CL

TL;DR: 本文提出一种基于知识图谱(KG)的法律领域大模型后训练方法,利用IRAC框架构建包含1.2万案例的法律知识图谱,并生成训练数据进行监督微调和偏好优化,显著提升大模型在多项法律推理任务上的表现。

Details Motivation: 现有大模型后训练缺乏对领域知识结构的建模,导致在高风险专业领域(如法律)的复杂推理能力不足,尤其缺少对法律概念间关系的理解。 Method: 采用IRAC(问题、规则、分析、结论)框架构建法律知识图谱,基于该图谱生成训练数据,并对三种先进大模型进行监督微调(SFT)和直接偏好优化(DPO)。 Result: 在4/5个法律基准测试中优于基线模型,70B参数的DPO模型在6项推理任务中4项表现最佳,超过基线及一个141B的SOTA法律大模型。 Conclusion: 引入结构化法律知识图谱可有效增强大模型的法律推理能力,该方法具有推广至其他高风险专业领域的潜力。 Abstract: LLM post-training has primarily relied on large text corpora and human feedback, without capturing the structure of domain knowledge. This has caused models to struggle dealing with complex reasoning tasks, especially for high-stakes professional domains. In Law, reasoning requires deep understanding of the relations between various legal concepts, a key component missing in current LLM post-training. In this paper, we propose a knowledge graph (KG)-assisted approach for enhancing LLMs' reasoning capability in Legal that is generalizable to other high-stakes domains. We model key legal concepts by following the \textbf{IRAC} (Issue, Rule, Analysis and Conclusion) framework, and construct a KG with 12K legal cases. We then produce training data using our IRAC KG, and conduct both Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) with three state-of-the-art (SOTA) LLMs (30B, 49B and 70B), varying architecture and base model family. Our post-trained models obtained better average performance on 4/5 diverse legal benchmarks (14 tasks) than baselines. In particular, our 70B DPO model achieved the best score on 4/6 reasoning tasks, among baselines and a 141B SOTA legal LLM, demonstrating the effectiveness of our KG for enhancing LLMs' legal reasoning capability.

[155] The Role of Prosodic and Lexical Cues in Turn-Taking with Self-Supervised Speech Representations

Sam OConnor Russell,Delphine Charuau,Naomi Harte

Main category: cs.CL

TL;DR: 本文研究了自监督语音表示(S3R)在人机交互中转述模型对韵律和词汇线索的依赖性,提出了一种基于声码器的新方法来更清晰地控制这些线索。实验表明,仅凭韵律或词汇信息即可实现有效的转述预测,且两者在S3R中编码时相互依赖有限。结果在CPC和wav2vec2.0模型中一致,提示未来模型可能只需使用韵律信息,从而提升隐私保护和性能。

Details Motivation: 理解S3R-based转述模型是否依赖于韵律、词汇或两者兼有,以提高模型的鲁棒性和隐私性。 Method: 提出一种基于声码器的方法,分离并控制语音中的韵律和词汇线索,并在语音活动投影模型上进行测试。 Result: 模型在仅含匹配韵律的无意义噪声上的表现与清晰语音相当,说明韵律或词汇任一线索均可独立支持转述;当其中一种被破坏时,模型能自动利用另一种,显示二者在S3R中编码独立。结果在CPC和wav2vec2.0中一致。 Conclusion: S3R同时编码韵律和词汇线索但二者独立,未来转述模型可仅依赖韵律,有助于保护隐私并提升性能。 Abstract: Fluid turn-taking remains a key challenge in human-robot interaction. Self-supervised speech representations (S3Rs) have driven many advances, but it remains unclear whether S3R-based turn-taking models rely on prosodic cues, lexical cues or both. We introduce a vocoder-based approach to control prosody and lexical cues in speech more cleanly than prior work. This allows us to probe the voice-activity projection model, an S3R-based turn-taking model. We find that prediction on prosody-matched, unintelligible noise is similar to accuracy on clean speech. This reveals both prosodic and lexical cues support turn-taking, but either can be used in isolation. Hence, future models may only require prosody, providing privacy and potential performance benefits. When either prosodic or lexical information is disrupted, the model exploits the other without further training, indicating they are encoded in S3Rs with limited interdependence. Results are consistent in CPC-based and wav2vec2.0 S3Rs. We discuss our findings and highlight a number of directions for future work. All code is available to support future research.

[156] FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Qian Chen,Jinlan Fu,Changsong Li,See-Kiong Ng,Xipeng Qiu

Main category: cs.CL

TL;DR: 本文提出了FutureOmni,首个用于评估音视频多模态未来事件预测的基准,并引入了OFF训练策略以提升模型在跨模态因果和时序推理下的表现。

Details Motivation: 现有基准主要关注回溯性理解,缺乏对多模态大模型未来事件预测能力的评估,因此需要构建专注于音频-视觉环境中未来预测的新基准。 Method: 通过LLM辅助、人工参与的可扩展流程构建包含919个视频和1,034个问答对的FutureOmni基准,并提出基于7K样本指令微调数据的OMNI-Modal Future Forecasting (OFF) 训练策略。 Result: 在FutureOmni上评测显示当前模型表现较差,最佳准确率仅为64.8%(Gemini 3 Flash),但采用OFF训练后模型在未来预测与泛化能力上有提升。 Conclusion: FutureOmni填补了音视频多模态未来预测评估的空白,OFF训练策略有效提升了模型在此类任务上的性能,推动MLLM向真正的前瞻性理解迈进。 Abstract: Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (https://github.com/OpenMOSS/FutureOmni) and datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni).

[157] Pedagogical Alignment for Vision-Language-Action Models: A Comprehensive Framework for Data, Architecture, and Evaluation in Education

Unggi Lee,Jahyun Jeong,Sunyoung Shin,Haeun Park,Jeongsu Moon,Youngchang Song,Jaechang Shim,JaeHwan Lee,Yunju Noh,Seungwon Choi,Ahhyun Kim,TaeHyeon Kim,Kyungtae Joo,Taeyeong Kim,Gyeonggeon Lee

Main category: cs.CL

TL;DR: 本文提出了一种面向资源受限教育场景的轻量级教学型视觉-语言-动作(Pedagogical VLA)框架,通过文本修复、大模型知识蒸馏、安全训练和教学评估四个模块,在保持高效运行的同时提升科学实验机器人在教学解释生成与安全性方面的能力。

Details Motivation: 现有VLA模型计算开销大且牺牲了语言生成能力,难以满足课堂教学中对安全、可解释和能生成教学性解释的需求。 Method: 提出Pedagogical VLA框架,包含文本修复、LLM知识蒸馏、安全训练和针对科学教育的 pedagogical evaluation 四个组件,并在五个跨学科科学实验中进行验证。 Result: 该框架在任务成功率、安全性等方面与基线模型相当,同时显著提升了生成解释的教学适切性和语言质量,经教师调查和LLM-as-Judge评估验证有效。 Conclusion: Pedagogical VLA Framework 能在资源受限环境下实现高效且具教学意义的科学实验演示,为STEM教育中的机器人应用提供了可行方案。 Abstract: Science demonstrations are important for effective STEM education, yet teachers face challenges in conducting them safely and consistently across multiple occasions, where robotics can be helpful. However, current Vision-Language-Action (VLA) models require substantial computational resources and sacrifice language generation capabilities to maximize efficiency, making them unsuitable for resource-constrained educational settings that require interpretable, explanation-generating systems. We present \textit{Pedagogical VLA Framework}, a framework that applies pedagogical alignment to lightweight VLA models through four components: text healing to restore language generation capabilities, large language model (LLM) distillation to transfer pedagogical knowledge, safety training for educational environments, and pedagogical evaluation adjusted to science education contexts. We evaluate Pedagogical VLA Framework across five science demonstrations spanning physics, chemistry, biology, and earth science, using an evaluation framework developed in collaboration with science education experts. Our evaluation assesses both task performance (success rate, protocol compliance, efficiency, safety) and pedagogical quality through teacher surveys and LLM-as-Judge assessment. We additionally provide qualitative analysis of generated texts. Experimental results demonstrate that Pedagogical VLA Framework achieves comparable task performance to baseline models while producing contextually appropriate educational explanations.

[158] OpenLearnLM Benchmark: A Unified Framework for Evaluating Knowledge, Skill, and Attitude in Educational Large Language Models

Unggi Lee,Sookbun Lee,Heungsoo Choi,Jinseo Lee,Haeun Park,Younghoon Jeon,Sungmin Cho,Minju Kang,Junbo Koh,Jiyeong Bae,Minwoo Nam,Juyeon Eun,Yeonji Jung,Yeil Jeong

Main category: cs.CL

TL;DR: OpenLearnLM是一个基于教育评估理论的基准框架,从知识、技能和态度三个维度评估大语言模型在教育场景中的适用性,涵盖12.4万多个题目,揭示了不同前沿模型的能力特征,强调多维度评估的必要性。

Details Motivation: 现有大语言模型的教育评估基准过于狭窄,缺乏学习科学理论支持,难以全面反映模型在真实教育场景中的能力,因此需要一个更全面、理论驱动的多维评估框架。 Method: 提出OpenLearnLM基准,包含知识(课程对齐内容与教学理解)、技能(基于角色-场景-子场景分层的案例型能力)和态度(一致性对齐与抗欺骗性)三个维度,整合124K+题目,覆盖多学科、多角色与多难度层级,并采用改编自Anthropic的Alignment Faking方法检测行为一致性。 Result: 评估七种前沿模型发现:Claude-Opus-4.5在实践技能上表现优异但知识较弱,Grok-4.1-fast知识领先但存在对齐问题;没有模型在所有维度均占优,验证了多轴评估的重要性。 Conclusion: OpenLearnLM提供了一个开放、全面且理论支撑的评估框架,有助于推动大语言模型在真实教育环境中的有效部署与优化。 Abstract: Large Language Models are increasingly deployed as educational tools, yet existing benchmarks focus on narrow skills and lack grounding in learning sciences. We introduce OpenLearnLM Benchmark, a theory-grounded framework evaluating LLMs across three dimensions derived from educational assessment theory: Knowledge (curriculum-aligned content and pedagogical understanding), Skills (scenario-based competencies organized through a four-level center-role-scenario-subscenario hierarchy), and Attitude (alignment consistency and deception resistance). Our benchmark comprises 124K+ items spanning multiple subjects, educational roles, and difficulty levels based on Bloom's taxonomy. The Knowledge domain prioritizes authentic assessment items from established benchmarks, while the Attitude domain adapts Anthropic's Alignment Faking methodology to detect behavioral inconsistency under varying monitoring conditions. Evaluation of seven frontier models reveals distinct capability profiles: Claude-Opus-4.5 excels in practical skills despite lower content knowledge, while Grok-4.1-fast leads in knowledge but shows alignment concerns. Notably, no single model dominates all dimensions, validating the necessity of multi-axis evaluation. OpenLearnLM provides an open, comprehensive framework for advancing LLM readiness in authentic educational contexts.

[159] Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores

Esma Balkır,Alice Pernthaller,Marco Basaldella,José Hernández-Orallo,Nigel Collier

Main category: cs.CL

TL;DR: 提出了一种基于IRT的自适应测试方法扩展,适用于连续分数生成任务评估LLM,在减少98%测试项目的同时显著提升排名相关性。

Details Motivation: 现有的计算机化自适应测试(CAT)主要针对多项选择题,而现代LLM评估越来越多地依赖于生成任务和连续评分指标,因此需要一种能处理连续有界分数的自适应测试方法。 Method: 通过将IRT模型中的伯努利响应分布替换为异方差正态分布,扩展了CAT以支持连续分数(如ROUGE、BLEU、LLM-as-a-Judge)。并设计了一个具有自适应停止机制的不确定性感知排序器。 Result: 在五个不同类型的基准上验证了该方法,仅使用2%的测试项目即实现了比随机采样高0.12 τ的排名相关性提升,且在有信心的预测中达到95%的准确率。 Conclusion: 该方法能够高效、可靠地对LLM进行排名,大幅降低评估成本,同时保持高精度,适用于多种生成式评估指标。 Abstract: Computerized Adaptive Testing (CAT) has proven effective for efficient LLM evaluation on multiple-choice benchmarks, but modern LLM evaluation increasingly relies on generation tasks where outputs are scored continuously rather than marked correct/incorrect. We present a principled extension of IRT-based adaptive testing to continuous bounded scores (ROUGE, BLEU, LLM-as-a-Judge) by replacing the Bernoulli response distribution with a heteroskedastic normal distribution. Building on this, we introduce an uncertainty aware ranker with adaptive stopping criteria that achieves reliable model ranking while testing as few items and as cheaply as possible. We validate our method on five benchmarks spanning n-gram-based, embedding-based, and LLM-as-judge metrics. Our method uses 2% of the items while improving ranking correlation by 0.12 τ over random sampling, with 95% accuracy on confident predictions.

[160] AgentEHR: Advancing Autonomous Clinical Decision-Making via Retrospective Summarization

Yusheng Liao,Chuan Xuan,Yutong Cai,Lina Yang,Zhe Chen,Yanfeng Wang,Yu Wang

Main category: cs.CL

TL;DR: AgentEHR是一个新的基准,用于评估在原始、高噪声电子健康记录(EHR)数据库中进行复杂决策任务的智能体表现;为此提出RetroSum框架,通过回顾性摘要和持续学习策略显著提升性能并减少交互错误。

Details Motivation: 现有大语言模型在医疗领域的应用受限于对结构化输入的依赖和简化的检索任务,难以应对真实临床环境中复杂的EHR导航需求。 Method: 提出AgentEHR基准测试,要求模型在原始、高噪声的EHR数据中完成诊断和治疗规划等需长期交互推理的任务;并设计RetroSum框架,结合回顾性摘要机制动态重评估交互历史,以及利用记忆库中的经验演化策略来保持逻辑连贯性和弥补信息丢失。 Result: 实验表明,RetroSum相比强基线模型性能最高提升29.16%,总交互错误最多减少92.3%。 Conclusion: RetroSum有效解决了长上下文信息丢失与推理断裂问题,在贴近现实的EHR导航任务中展现出优越的推理连续性和实用性。 Abstract: Large Language Models have demonstrated profound utility in the medical domain. However, their application to autonomous Electronic Health Records~(EHRs) navigation remains constrained by a reliance on curated inputs and simplified retrieval tasks. To bridge the gap between idealized experimental settings and realistic clinical environments, we present AgentEHR. This benchmark challenges agents to execute complex decision-making tasks, such as diagnosis and treatment planning, requiring long-range interactive reasoning directly within raw and high-noise databases. In tackling these tasks, we identify that existing summarization methods inevitably suffer from critical information loss and fractured reasoning continuity. To address this, we propose RetroSum, a novel framework that unifies a retrospective summarization mechanism with an evolving experience strategy. By dynamically re-evaluating interaction history, the retrospective mechanism prevents long-context information loss and ensures unbroken logical coherence. Additionally, the evolving strategy bridges the domain gap by retrieving accumulated experience from a memory bank. Extensive empirical evaluations demonstrate that RetroSum achieves performance gains of up to 29.16% over competitive baselines, while significantly decreasing total interaction errors by up to 92.3%.

[161] HyperWalker: Dynamic Hypergraph-Based Deep Diagnosis for Multi-Hop Clinical Modeling across EHR and X-Ray in Medical VLMs

Yuezhe Yang,Hao Wang,Yige Peng,Jinman Kim,Lei Bi

Main category: cs.CL

TL;DR: 提出HyperWalker框架,通过动态超图和测试时训练实现深度临床诊断,利用强化学习代理在多模态电子健康记录中寻找最优诊断路径,在医学报告生成和视觉问答任务上达到SOTA性能。

Details Motivation: 现有医学视觉语言模型通常孤立地处理病例,缺乏对纵向电子健康记录和相关病例的利用,限制了诊断准确性。 Method: 构建名为iBrochure的动态超图来建模EHR数据的异质性和高阶关联,并设计基于强化学习的Walker代理进行诊断路径搜索;引入“linger机制”实现多跳正交检索以覆盖多样化的临床特征。 Result: 在MIMIC数据集上的医学报告生成和EHRXQA上的医疗视觉问答任务中,HyperWalker均取得了最先进的性能。 Conclusion: HyperWalker通过整合结构化EHR和多模态信息的高阶关联,支持更全面的临床推理,显著提升了自动化诊断的准确性与可解释性。 Abstract: Automated clinical diagnosis remains a core challenge in medical AI, which usually requires models to integrate multi-modal data and reason across complex, case-specific contexts. Although recent methods have advanced medical report generation (MRG) and visual question answering (VQA) with medical vision-language models (VLMs), these methods, however, predominantly operate under a sample-isolated inference paradigm, as such processing cases independently without access to longitudinal electronic health records (EHRs) or structurally related patient examples. This paradigm limits reasoning to image-derived information alone, which ignores external complementary medical evidence for potentially more accurate diagnosis. To overcome this limitation, we propose \textbf{HyperWalker}, a \textit{Deep Diagnosis} framework that reformulates clinical reasoning via dynamic hypergraphs and test-time training. First, we construct a dynamic hypergraph, termed \textbf{iBrochure}, to model the structural heterogeneity of EHR data and implicit high-order associations among multimodal clinical information. Within this hypergraph, a reinforcement learning agent, \textbf{Walker}, navigates to and identifies optimal diagnostic paths. To ensure comprehensive coverage of diverse clinical characteristics in test samples, we incorporate a \textit{linger mechanism}, a multi-hop orthogonal retrieval strategy that iteratively selects clinically complementary neighborhood cases reflecting distinct clinical attributes. Experiments on MRG with MIMIC and medical VQA on EHRXQA demonstrate that HyperWalker achieves state-of-the-art performance. Code is available at: https://github.com/Bean-Young/HyperWalker

[162] Automatic Prompt Optimization for Dataset-Level Feature Discovery

Adrian Cosma,Oleg Szehr,David Kletz,Alessandro Antonucci,Olivier Pelletier

Main category: cs.CL

TL;DR: 提出了一种多智能体提示优化框架,用于从非结构化文本中自动发现可解释且具有区分性的特征。

Details Motivation: 现有方法依赖手工设计的提示或固定特征模式,缺乏自动发现高质量特征的能力。 Method: 将特征发现建模为数据集级别的提示优化问题,通过多个语言模型代理协作生成、提取和评估特征,并基于整体性能与可解释性反馈迭代优化提示。 Result: 实现了无需逐样本监督的自动特征发现,生成的特征具有良好的可解释性和分类性能。 Conclusion: 该方法为非结构化文本中的特征提取提供了一个新颖且有效的框架,优于传统基于手工提示的方法。 Abstract: Feature extraction from unstructured text is a critical step in many downstream classification pipelines, yet current approaches largely rely on hand-crafted prompts or fixed feature schemas. We formulate feature discovery as a dataset-level prompt optimization problem: given a labelled text corpus, the goal is to induce a global set of interpretable and discriminative feature definitions whose realizations optimize a downstream supervised learning objective. To this end, we propose a multi-agent prompt optimization framework in which language-model agents jointly propose feature definitions, extract feature values, and evaluate feature quality using dataset-level performance and interpretability feedback. Instruction prompts are iteratively refined based on this structured feedback, enabling optimization over prompts that induce shared feature sets rather than per-example predictions. This formulation departs from prior prompt optimization methods that rely on per-sample supervision and provides a principled mechanism for automatic feature discovery from unstructured text.

[163] "The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework

Jin Cui,Jiaqi Guo,Jiepeng Zhou,Ruixuan Yang,Jiayi Lu,Jiajun Xu,Jiangcheng Song,Boran Zhao,Pengju Ren

Main category: cs.CL

TL;DR: 提出COMPACT框架,通过动态加权不同教师模型的梯度来融合其监督信号,利用多维指标评估学生模型的实时兼容性,从而有效提升小模型的推理能力并缓解灾难性遗忘。

Details Motivation: 现有CoT蒸馏方法多依赖单一教师模型,受限于个体能力偏差和灾难性遗忘问题,且难以有效融合多个教师的监督信号,导致学生模型无法充分吸收多样化推理能力。 Method: 提出COMPACT框架,引入三个维度的动态评估机制:基于图的共识过滤误导性推理路径、基于互信息的适应性检测理解时刻、基于损失的难度评估学生接受度,并据此动态加权不同教师的梯度进行融合。 Result: 实验和潜在空间分析表明,COMPACT在多个基准上达到SOTA性能,能有效整合多样化的推理能力,同时不破坏学生模型原有知识结构,显著缓解灾难性遗忘。 Conclusion: COMPACT通过多维兼容性评估实现教师监督信号的自适应融合,为小模型高效继承大模型推理能力提供了有效解决方案。 Abstract: Chain-of-Thought (CoT) reasoning empowers Large Language Models (LLMs) with remarkable capabilities but typically requires prohibitive parameter scales. CoT distillation has emerged as a promising paradigm to transfer reasoning prowess into compact Student Models (SLMs), but existing approaches often rely on a solitary teacher, capping the student's potential since individual LLMs often exhibit distinct capability biases and may suffer from catastrophic forgetting. While leveraging diverse teachers seems appealing, effectively fusing their supervisions remains challenging: teacher-student incompatibility risks amplifying hallucinations, and passive supervision fails to ensure genuine logic internalization. To address this, we introduce COMPACT, a framework that adaptively fuses supervisions from different teachers by dynamically weighting teacher gradients based on the student's real-time compatibility evaluated by a multi-dimensional metric: (1) Graph-based Consensus to filter misleading rationales by identifying mainstream reasoning paths; (2) Mutual-Information-based Adaptability to detect "epiphany moments" for genuinely understanding the reasoning process rather than merely imitating; and (3) Loss-based Difficulty to assess student receptivity to the teacher's guidance and prevent negative transfer. Extensive experiments and latent space analysis demonstrate that COMPACT effectively integrates diverse reasoning capabilities without damaging the model's original knowledge structure, achieving state-of-the-art performance on various benchmarks while mitigating catastrophic forgetting.

[164] From Tags to Trees: Structuring Fine-Grained Knowledge for Controllable Data Selection in LLM Instruction Tuning

Zihan Niu,Wenping Hu,Junmin Chen,Xiyue Wang,Tong Xu,Ruiming Tang

Main category: cs.CL

TL;DR: 提出了一种基于知识树的统一数据采样框架TAGS,通过细粒度标签构建层次化知识结构,实现对质量、多样性和目标对齐的联合控制,在仅用5%数据时性能超越全数据模型。

Details Motivation: 现有数据选择方法受限于嵌入空间的扁平性或标签的粗粒度,忽略了细粒度知识及其层次依赖关系,导致难以精确评估数据价值和进行知识对齐采样。 Method: 利用LLM提取原子级知识概念,并通过自下而上的层次聚类构建全局知识树;将数据实例映射到树上,设计树感知指标量化质量和多样性,采用最大化树级信息增益和KL散度约束叶级对齐的可控采样策略。 Result: 实验表明TAGS显著优于现有最先进基线方法,在仅使用5%数据时性能超过全数据训练模型5.84%,结合对齐采样策略进一步提升平均性能4.24%。 Conclusion: TAGS通过构建和利用层次化知识树,实现了高效且可控的数据选择,在大幅减少数据量的同时提升了指令微调的效果,为大规模数据筛选提供了新思路。 Abstract: Effective and controllable data selection is critical for LLM instruction tuning, especially with massive open-source datasets. Existing approaches primarily rely on instance-level quality scores, or diversity metrics based on embedding clusters or semantic tags. However, constrained by the flatness of embedding spaces or the coarseness of tags, these approaches overlook fine-grained knowledge and its intrinsic hierarchical dependencies, consequently hindering precise data valuation and knowledge-aligned sampling. To address this challenge, we propose Tree-aware Aligned Global Sampling (TAGS), a unified framework that leverages a knowledge tree built from fine-grained tags, thereby enabling joint control of global quality, diversity, and target alignment. Using an LLM-based tagger, we extract atomic knowledge concepts, which are organized into a global tree through bottom-up hierarchical clustering. By grounding data instances onto this tree, a tree-aware metric then quantifies data quality and diversity, facilitating effective sampling. Our controllable sampling strategy maximizes tree-level information gain and enforces leaf-level alignment via KL-divergence for specific domains. Extensive experiments demonstrate that TAGS significantly outperforms state-of-the-art baselines. Notably, it surpasses the full-dataset model by \textbf{+5.84\%} using only \textbf{5\%} of the data, while our aligned sampling strategy further boosts average performance by \textbf{+4.24\%}.

[165] Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Hengyuan Zhang,Zhihao Zhang,Mingyang Wang,Zunhai Su,Yiwei Wang,Qianli Wang,Shuzhou Yuan,Ercong Nie,Xufeng Duan,Qibo Xue,Zeping Yu,Chenming Shang,Xiao Liang,Jing Xiong,Hui Shen,Chaofan Tao,Zhengwu Liu,Senjie Jin,Zhiheng Xi,Dongdong Zhang,Sophia Ananiadou,Tao Gui,Ruobing Xie,Hayden Kwok-Hay So,Hinrich Schütze,Xuanjing Huang,Qi Zhang,Ngai Wong

Main category: cs.CL

TL;DR: 本文提出了一种以“定位、引导、改进”为框架的实用化机制可解释性(MI)方法,系统化地将MI从观察性科学转变为可操作的干预范式,并基于可解释对象对定位与引导方法进行分类,展示了其在模型对齐、能力与效率优化中的实际应用。

Details Motivation: 现有可解释性研究多停留在观察和总结层面,缺乏系统性的干预框架,难以实现对模型行为的实际调控,因此需要构建一个可操作的MI方法论。 Method: 提出“Locate, Steer, and Improve”流程,按可解释对象对定位(诊断)和引导(干预)技术进行形式化分类,建立严格的干预协议。 Result: 该框架成功应用于提升模型的对齐性、能力和效率,验证了MI作为可操作优化方法的可行性与有效性。 Conclusion: 机制可解释性可被转化为一套可执行的模型优化方法论,未来的研究可通过此框架实现更精准的干预与改进。 Abstract: Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the opaque decision-making of Large Language Models (LLMs). However, existing reviews primarily treat MI as an observational science, summarizing analytical insights while lacking a systematic framework for actionable intervention. To bridge this gap, we present a practical survey structured around the pipeline: "Locate, Steer, and Improve." We formally categorize Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish a rigorous intervention protocol. Furthermore, we demonstrate how this framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as an actionable methodology for model optimization. The curated paper list of this work is available at https://github.com/rattlesnakey/Awesome-Actionable-MI-Survey.

[166] BACH-V: Bridging Abstract and Concrete Human-Values in Large Language Models

Junyu Zhang,Yipeng Kang,Jiong Guo,Jiayu Zhan,Junqi Wang

Main category: cs.CL

TL;DR: 该研究提出了一种抽象-具象解耦框架,用于评估大语言模型对抽象概念(如人类价值观)的理解能力,发现模型能将抽象价值表征作为稳定锚点指导具体决策,但这些表征在抽象层面不易被改变。

Details Motivation: 探究大语言模型是否真正理解抽象概念,还是仅进行统计模式操作,特别是在价值对齐这一关键问题上。 Method: 提出抽象-具象解耦框架(A-A, A-C, C-C),在六个开源LLM和十个价值维度上使用探针检测(probing)和表示干预(steering)方法分析模型内部表征。 Result: 探针结果显示抽象价值表征可跨层级迁移到具体情境;干预实验显示修改表征能影响具体判断(A-C, C-C),但不改变抽象理解(A-A),表明抽象表征具有稳定性。 Conclusion: 大语言模型具备结构化的价值表征能力,能够在抽象与行动之间建立联系,为构建可解释、可控制的价值驱动型AI系统提供了机制基础。 Abstract: Do large language models (LLMs) genuinely understand abstract concepts, or merely manipulate them as statistical patterns? We introduce an abstraction-grounding framework that decomposes conceptual understanding into three capacities: interpretation of abstract concepts (Abstract-Abstract, A-A), grounding of abstractions in concrete events (Abstract-Concrete, A-C), and application of abstract principles to regulate concrete decisions (Concrete-Concrete, C-C). Using human values as a testbed - given their semantic richness and centrality to alignment - we employ probing (detecting value traces in internal activations) and steering (modifying representations to shift behavior). Across six open-source LLMs and ten value dimensions, probing shows that diagnostic probes trained solely on abstract value descriptions reliably detect the same values in concrete event narratives and decision reasoning, demonstrating cross-level transfer. Steering reveals an asymmetry: intervening on value representations causally shifts concrete judgments and decisions (A-C, C-C), yet leaves abstract interpretations unchanged (A-A), suggesting that encoded abstract values function as stable anchors rather than malleable activations. These findings indicate LLMs maintain structured value representations that bridge abstraction and action, providing a mechanistic and operational foundation for building value-driven autonomous AI systems with more transparent, generalizable alignment and control.

[167] RM-Distiller: Exploiting Generative LLM for Reward Model Distillation

Hongli Zhou,Hui Huang,Wei Liu,Chenglong Wang,Xingyuan Bu,Lvyuan Han,Fuhai Song,Muyun Yang,Wenhao Jiang,Hailong Cao,Tiejun Zhao

Main category: cs.CL

TL;DR: 本文提出RM-Distiller,一种充分利用生成式大模型多方面能力(优化、评分和生成)进行奖励模型蒸馏的新框架,在多项实验中显著优于传统方法。

Details Motivation: 由于高质量人类偏好标注难以获取,现有方法通常将教师模型视为简单二元标注器,未能充分利用其丰富知识与能力,限制了奖励模型蒸馏效果。 Method: 提出RM-Distiller框架,系统利用教师大模型的三种能力:1)优化能力,生成高度相关响应对以提供细粒度对比信号;2)评分能力,通过边距感知优化目标指导奖励模型学习精确偏好强度;3)生成能力,引入教师生成分布正则化以保留语言知识。 Result: 大量实验表明,RM-Distiller在奖励模型基准和基于强化学习的对齐任务上均显著优于传统蒸馏方法。 Conclusion: 充分挖掘教师大模型的多维度能力对于高效奖励建模至关重要,本文是首项从生成式大模型系统开展奖励模型蒸馏的研究。 Abstract: Reward models (RMs) play a pivotal role in aligning large language models (LLMs) with human preferences. Due to the difficulty of obtaining high-quality human preference annotations, distilling preferences from generative LLMs has emerged as a standard practice. However, existing approaches predominantly treat teacher models as simple binary annotators, failing to fully exploit the rich knowledge and capabilities for RM distillation. To address this, we propose RM-Distiller, a framework designed to systematically exploit the multifaceted capabilities of teacher LLMs: (1) Refinement capability, which synthesizes highly correlated response pairs to create fine-grained and contrastive signals. (2) Scoring capability, which guides the RM in capturing precise preference strength via a margin-aware optimization objective. (3) Generation capability, which incorporates the teacher's generative distribution to regularize the RM to preserve its fundamental linguistic knowledge. Extensive experiments demonstrate that RM-Distiller significantly outperforms traditional distillation methods both on RM benchmarks and reinforcement learning-based alignment, proving that exploiting multifaceted teacher capabilities is critical for effective reward modeling. To the best of our knowledge, this is the first systematic research on RM distillation from generative LLMs.

[168] Top 10 Open Challenges Steering the Future of Diffusion Language Model and Its Variants

Yunhe Wang,Kai Han,Huiling Zhen,Yuchuan Tian,Hanting Chen,Yongbing Huang,Yufei Cui,Yingte Shu,Shan Gao,Ismail Elezi,Roy Vaughan Miles,Songcen Xu,Feng Wen,Chao Xu,Sinan Zeng,Dacheng Tao

Main category: cs.CL

TL;DR: 本文探讨了扩散语言模型(DLMs)在当前自回归模型主导范式下的潜力与挑战,提出了实现其突破性发展的四大战略支柱。

Details Motivation: 由于自回归模型存在因果瓶颈,缺乏全局结构预见和迭代优化能力,需要探索DLMs作为更具前景的替代范式。 Method: 分析DLMs面临的十个根本性挑战,并提出涵盖基础架构、算法优化、认知推理和多模态智能的四维发展路线图。 Result: 明确了DLMs未被充分挖掘的原因,并系统性地提出推动其发展的关键方向,如多尺度分词、主动重掩码和潜在思维等。 Conclusion: 向‘扩散原生’生态系统的转变对于实现复杂结构推理、动态自我修正和无缝多模态集成的下一代AI至关重要。 Abstract: The paradigm of Large Language Models (LLMs) is currently defined by auto-regressive (AR) architectures, which generate text through a sequential ``brick-by-brick'' process. Despite their success, AR models are inherently constrained by a causal bottleneck that limits global structural foresight and iterative refinement. Diffusion Language Models (DLMs) offer a transformative alternative, conceptualizing text generation as a holistic, bidirectional denoising process akin to a sculptor refining a masterpiece. However, the potential of DLMs remains largely untapped as they are frequently confined within AR-legacy infrastructures and optimization frameworks. In this Perspective, we identify ten fundamental challenges ranging from architectural inertia and gradient sparsity to the limitations of linear reasoning that prevent DLMs from reaching their ``GPT-4 moment''. We propose a strategic roadmap organized into four pillars: foundational infrastructure, algorithmic optimization, cognitive reasoning, and unified multimodal intelligence. By shifting toward a diffusion-native ecosystem characterized by multi-scale tokenization, active remasking, and latent thinking, we can move beyond the constraints of the causal horizon. We argue that this transition is essential for developing next-generation AI capable of complex structural reasoning, dynamic self-correction, and seamless multimodal integration.

[169] PRiSM: Benchmarking Phone Realization in Speech Models

Shikhar Bharadwaj,Chin-Jou Li,Yoonjae Kim,Kwanghee Choi,Eunjung Yeo,Ryan Soh-Eun Shim,Hanyu Zhou,Brendon Boldt,Karen Rosero Jacome,Kalvin Chang,Darsh Agrawal,Keer Xu,Chao-Han Huck Yang,Jian Zhu,Shinji Watanabe,David R. Mortensen

Main category: cs.CL

TL;DR: PRiSM是一个首个开源基准,用于揭示语音识别系统在音系感知中的盲点,通过内在和外在评估推动多语言语音模型的发展。

Details Motivation: 现有的语音识别评估仅衡量表层转录准确性,缺乏对音系感知能力的深入评估,尤其在跨语言、临床和教育等下游应用中存在盲区。 Method: 提出PRiSM基准,标准化基于转录的评估,并引入表示探针和下游任务探针,在多种语言、临床、教育场景中进行内在与外在评估。 Result: 发现训练时的多语言暴露对性能至关重要,encoder-CTC模型最稳定,专用语音识别模型仍优于大型音频语言模型。 Conclusion: PRiSM通过统一的评估框架揭示了当前语音识别系统的局限性,并推动构建具备鲁棒音系能力的多语言语音模型。 Abstract: Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis. Despite prolonged efforts in developing PR systems, current evaluations only measure surface-level transcription accuracy. We introduce PRiSM, the first open-source benchmark designed to expose blind spots in phonetic perception through intrinsic and extrinsic evaluation of PR systems. PRiSM standardizes transcription-based evaluation and assesses downstream utility in clinical, educational, and multilingual settings with transcription and representation probes. We find that diverse language exposure during training is key to PR performance, encoder-CTC models are the most stable, and specialized PR models still outperform Large Audio Language Models. PRiSM releases code, recipes, and datasets to move the field toward multilingual speech models with robust phonetic ability: https://github.com/changelinglab/prism.

[170] Understanding Multilingualism in Mixture-of-Experts LLMs: Routing Mechanism, Expert Specialization, and Layerwise Steering

Yuxin Chen,Zhengzhou Cai,Xiangtian Ji,Weixiang Zhao,An Zhang,Xiang Wang,Tat-Seng Chua

Main category: cs.CL

TL;DR: 本文系统分析了MoE模型在多语言处理中的路由行为与专家专业化模式,发现其具有按语系对齐、层间差异利用及资源依赖性等特点,并提出一种基于路由引导的推理优化方法,显著提升多语言性能。

Details Motivation: 尽管MoE架构展现出强大的多语言能力,但其内部机制(如跨语言性能差异和专家分配模式)尚不明确,需深入探究以提升理解和模型性能。 Method: 通过系统分析MoE模型中不同语言和网络深度下的路由行为与专家专业化,并进行逐层干预实验,揭示各层的语言特异性与通用性作用;基于发现提出一种在推理时引导中间层路由至主导语言共享专家的自适应方法。 Result: 发现MoE模型的路由与语系对齐,专家使用呈层状模式,高资源语言更多使用共享专家,低资源语言更依赖独占专家但表现较差;中间层为语言无关的容量枢纽;所提路由引导方法有效提升了多语言性能,尤其对语言相近的语对。 Conclusion: MoE模型中的多语言处理具有高度结构化特性,中间层提供通用容量支持,而提出的路由引导策略可通过动态调整专家选择来增强低资源或相关语言的表现,为高效多语言模型设计提供了新思路。 Abstract: Mixture-of-Experts (MoE) architectures have shown strong multilingual capabilities, yet the internal mechanisms underlying performance gains and cross-language differences remain insufficiently understood. In this work, we conduct a systematic analysis of MoE models, examining routing behavior and expert specialization across languages and network depth. Our analysis reveals that multilingual processing in MoE models is highly structured: routing aligns with linguistic families, expert utilization follows a clear layerwise pattern, and high-resource languages rely on shared experts while low-resource languages depend more on language-exclusive experts despite weaker performance. Layerwise interventions further show that early and late MoE layers support language-specific processing, whereas middle layers serve as language-agnostic capacity hubs. Building on these insights, we propose a routing-guided steering method that adaptively guides routing behavior in middle layers toward shared experts associated with dominant languages at inference time, leading to consistent multilingual performance improvements, particularly for linguistically related language pairs. Our code is available at https://github.com/conctsai/Multilingualism-in-Mixture-of-Experts-LLMs.

[171] Kakugo: Distillation of Low-Resource Languages into Small Language Models

Peter Devine,Mardhiyah Sanni,Farid Adilazuarda,Julieta Gil Loizaga,Barry Haddow

Main category: cs.CL

TL;DR: Kakugo 是一种低成本、仅需语言名称即可训练低资源语言通用小型语言模型(SLM)的新方法,利用大模型生成合成数据,在54种低资源语言上验证有效,单语言成本低于50美元。

Details Motivation: 解决低资源语言缺乏高质量标注数据和高训练成本的问题,使社区能以可负担方式开发本语言AI模型。 Method: 利用大型教师模型自动生成合成提示和翻译指令数据集,构建适用于低资源语言的训练数据,并据此训练小型语言模型(SLMs)。 Result: 在翻译、分类、问答等多类NLP任务上,Kakugo训练的SLMs持续优于基线模型;共为54种低资源语言完成建模,单语言总成本低于50美元。 Conclusion: Kakugo提供了一种高效、低成本、易部署的低资源语言SLM训练范式,显著降低语言AI开发门槛。 Abstract: We present Kakugo, a novel and cost-effective pipeline designed to train general-purpose Small Language Models (SLMs) for low-resource languages using only the language name as input. By using a large teacher model to generate synthetic prompts and translate instruction datasets, we produced training data and SLMs for 54 low-resource languages. Evaluations across a diverse set of general natural language processing tasks, including translation, classification, and question answering, demonstrate that our pipeline consistently improves performance over base models. With a total generation and training cost of under $50 per language, Kakugo offers an accessible method for communities to develop language-specific AI.

[172] XCR-Bench: A Multi-Task Benchmark for Evaluating Cultural Reasoning in LLMs

Mohsinul Kabir,Tasnim Ahmed,Md Mezbaur Rahman,Shaoxiong Ji,Hassan Alhuzali,Sophia Ananiadou

Main category: cs.CL

TL;DR: 本文提出了XCR-Bench,一个用于评估大语言模型跨文化推理能力的基准,包含4.9k个平行句对和1,098个独特的文化特定项目(CSIs),涵盖三种不同的推理任务。

Details Motivation: 现有评估大语言模型跨文化能力的研究受限于高质量、带有CSI标注的平行语料库的缺乏。 Method: 结合Newmark的CSI框架与Hall的文化三元组理论,构建XCR-Bench语料库,并设计三项跨文化推理任务及相应评估指标。 Result: 实验发现当前最先进的大语言模型在识别和适应社交礼仪和文化参照类CSIs方面存在明显不足,并在单一语言环境下仍表现出地域和族群宗教偏见。 Conclusion: XCR-Bench为系统评估模型在可见、半可见和深层文化元素(如规范、信仰、价值观)上的跨文化适应能力提供了新工具,揭示了现有LLMs在文化敏感性方面的局限性。 Abstract: Cross-cultural competence in large language models (LLMs) requires the ability to identify Culture-Specific Items (CSIs) and to adapt them appropriately across cultural contexts. Progress in evaluating this capability has been constrained by the scarcity of high-quality CSI-annotated corpora with parallel cross-cultural sentence pairs. To address this limitation, we introduce XCR-Bench, a Cross(X)-Cultural Reasoning Benchmark consisting of 4.9k parallel sentences and 1,098 unique CSIs, spanning three distinct reasoning tasks with corresponding evaluation metrics. Our corpus integrates Newmark's CSI framework with Hall's Triad of Culture, enabling systematic analysis of cultural reasoning beyond surface-level artifacts and into semi-visible and invisible cultural elements such as social norms, beliefs, and values. Our findings show that state-of-the-art LLMs exhibit consistent weaknesses in identifying and adapting CSIs related to social etiquette and cultural reference. Additionally, we find evidence that LLMs encode regional and ethno-religious biases even within a single linguistic setting during cultural adaptation. We release our corpus and code to facilitate future research on cross-cultural NLP.

[173] Truth with a Twist: The Rhetoric of Persuasion in Professional vs. Community-Authored Fact-Checks

Olesya Razuvayevskaya,Kalina Bontcheva

Main category: cs.CL

TL;DR: 本研究首次大规模比较了众包与专业撰写的辟谣内容中的说服技巧,发现众包辟谣并未比专业辟谣使用更多说服技巧,且公众评分能有效识别并惩罚不当修辞。

Details Motivation: 探究众包辟谣(如Community Notes)与专业事实核查在说服技巧使用上的差异,检验“众包内容更具主观性”的假设。 Method: 基于Community Notes、EUvsDisinfo和DBKF的大规模数据集,量化分析不同事实核查平台中说服技巧的类型与频率,并比较其修辞差异及公众评价反应。 Result: 未发现众包辟谣使用更多说服技巧;两类辟谣存在系统性修辞差异;含更多说服元素的辟谣获略高帮助评分,但不当修辞会被公众有效惩罚。 Conclusion: 众包辟谣在说服技巧使用上并不比专业辟谣更主观,且公众具备辨别和抵制不良修辞的能力,反映了不同平台的制度规范与议题覆盖差异。 Abstract: This study presents the first large-scale comparison of persuasion techniques present in crowd- versus professionally-written debunks. Using extensive datasets from Community Notes (CNs), EUvsDisinfo, and the Database of Known Fakes (DBKF), we quantify the prevalence and types of persuasion techniques across these fact-checking ecosystems. Contrary to prior hypothesis that community-produced debunks rely more heavily on subjective or persuasive wording, we find no evidence that CNs contain a higher average number of persuasion techniques than professional fact-checks. We additionally identify systematic rhetorical differences between CNs and professional debunking efforts, reflecting differences in institutional norms and topical coverage. Finally, we examine how the crowd evaluates persuasive language in CNs and show that, although notes with more persuasive elements receive slightly higher overall helpfulness ratings, crowd raters are effective at penalising the use of particular problematic rhetorical means

[174] Learning to Explain: Supervised Token Attribution from Transformer Attention Patterns

George Mihaila

Main category: cs.CL

TL;DR: 本文提出了一种名为Explanation Network (ExpNet) 的轻量级神经网络,用于从Transformer模型的注意力模式中自动学习生成词元级别重要性评分,相较于依赖手动规则或黑箱扰动的方法,在跨任务场景下表现更优。

Details Motivation: 现有的基于注意力的解释方法依赖人工定义的聚合策略和固定归因规则,而模型无关方法(如LIME、SHAP)将模型视为黑箱并因输入扰动带来高计算成本,缺乏高效且可学习的解释机制。 Method: 提出ExpNet,一个轻量级神经网络,通过学习将Transformer的注意力模式映射到词元级别的重要性评分,自动发现最优的注意力特征组合,而非依赖预设规则。 Result: 在跨任务设置下对多种模型无关方法和基于注意力的技术进行了广泛评估,结果显示ExpNet在解释性能上优于现有方法。 Conclusion: ExpNet提供了一种数据驱动、可学习且高效的解释方式,能够自动挖掘注意力机制中的关键特征组合,提升了Transformer模型的可解释性与实用性。 Abstract: Explainable AI (XAI) has become critical as transformer-based models are deployed in high-stakes applications including healthcare, legal systems, and financial services, where opacity hinders trust and accountability. Transformers self-attention mechanisms have proven valuable for model interpretability, with attention weights successfully used to understand model focus and behavior (Xu et al., 2015); (Wiegreffe and Pinter, 2019). However, existing attention-based explanation methods rely on manually defined aggregation strategies and fixed attribution rules (Abnar and Zuidema, 2020a); (Chefer et al., 2021), while model-agnostic approaches (LIME, SHAP) treat the model as a black box and incur significant computational costs through input perturbation. We introduce Explanation Network (ExpNet), a lightweight neural network that learns an explicit mapping from transformer attention patterns to token-level importance scores. Unlike prior methods, ExpNet discovers optimal attention feature combinations automatically rather than relying on predetermined rules. We evaluate ExpNet in a challenging cross-task setting and benchmark it against a broad spectrum of model-agnostic methods and attention-based techniques spanning four methodological families.

[175] NewsRECON: News article REtrieval for image CONtextualization

Jonathan Tonglet,Iryna Gurevych,Tinne Tuytelaars,Marie-Francine Moens

Main category: cs.CL

TL;DR: 本文提出NewsRECON,一种在无反向图像搜索结果时通过关联新闻文章元数据来推断新闻图片时间与地点的方法,结合双编码器与交叉编码器,在多个数据集上实现新SOTA。

Details Motivation: 现有基于反向图像搜索(RIS)的新闻图像地理定位方法常因RIS无结果而失效,限制实际应用,需解决无RIS证据下的定位问题。 Method: 构建包含9万+文章的语料库,采用双编码器检索事件相关文章,再用两个交叉编码器根据位置和事件一致性重排序,从而从文章元数据推断图像时空信息。 Result: 在TARA和5Pils-OOC数据集上优于先前方法,并可与多模态大语言模型结合取得新的SOTA性能。 Conclusion: NewsRECON有效解决了RIS失效场景下的新闻图像时空定位难题,提升了在缺乏直接视觉匹配情况下的推理能力。 Abstract: Identifying when and where a news image was taken is crucial for journalists and forensic experts to produce credible stories and debunk misinformation. While many existing methods rely on reverse image search (RIS) engines, these tools often fail to return results, thereby limiting their practical applicability. In this work, we address the challenging scenario where RIS evidence is unavailable. We introduce NewsRECON, a method that links images to relevant news articles to infer their date and location from article metadata. NewsRECON leverages a corpus of over 90,000 articles and integrates: (1) a bi-encoder for retrieving event-relevant articles; (2) two cross-encoders for reranking articles by location and event consistency. Experiments on the TARA and 5Pils-OOC show that NewsRECON outperforms prior work and can be combined with a multimodal large language model to achieve new SOTA results in the absence of RIS evidence. We make our code available.

[176] A Systematic Analysis of Chunking Strategies for Reliable Question Answering

Sofia Bennani,Charles Moslonka

Main category: cs.CL

TL;DR: 本文研究了文档分块策略对工业界检索增强生成(RAG)系统可靠性的影响,通过端到端评估比较了不同分块方法、大小、重叠和上下文长度的效果,提出了成本效益高的部署建议。

Details Motivation: 在工业实践中,RAG系统常依赖启发式方法进行文档分块,缺乏系统性评估;本文旨在通过实证分析揭示不同分块策略对系统性能的影响,以指导实际部署。 Method: 在Natural Questions数据集上进行端到端评估,系统地改变分块方法(token、sentence、semantic、code)、分块大小、重叠率和上下文长度,采用SPLADE检索器和Mistral-8B生成器的标准工业设置。 Result: 实验发现:(i) 分块重叠无明显收益但增加索引成本;(ii) 句子级分块最具成本效益,在约5k token内可媲美语义分块;(iii) 超过约2.5k token存在“上下文悬崖”导致质量下降;(iv) 最优上下文长度取决于目标(语义质量在小上下文达峰,精确匹配则需更大上下文)。 Conclusion: 句子级分块是高效且经济的选择,应避免使用重叠,且需根据任务目标权衡上下文长度以实现最佳性价比。 Abstract: We study how document chunking choices impact the reliability of Retrieval-Augmented Generation (RAG) systems in industry. While practice often relies on heuristics, our end-to-end evaluation on Natural Questions systematically varies chunking method (token, sentence, semantic, code), chunk size, overlap, and context length. We use a standard industrial setup: SPLADE retrieval and a Mistral-8B generator. We derive actionable lessons for cost-efficient deployment: (i) overlap provides no measurable benefit and increases indexing cost; (ii) sentence chunking is the most cost-effective method, matching semantic chunking up to ~5k tokens; (iii) a "context cliff" reduces quality beyond ~2.5k tokens; and (iv) optimal context depends on the goal (semantic quality peaks at small contexts; exact match at larger ones).

[177] Style Transfer as Bias Mitigation: Diffusion Models for Synthetic Mental Health Text for Arabic

Saad Mankarious,Aya Zirikly

Main category: cs.CL

TL;DR: 本文提出了一种无需预训练的基于扩散模型的文本生成方法,用于缓解阿拉伯语心理健康数据中的性别偏见问题。该方法将偏见缓解视为风格迁移任务,在不依赖大语言模型的情况下实现高熵且语义保真的女性化风格文本生成。

Details Motivation: 现有基于大语言模型的合成数据方法存在多样性不足和继承训练数据偏见的问题,尤其在阿拉伯语心理健康领域中性别不平衡严重,亟需一种能有效缓解偏见且不依赖预训练模型的新方法。 Method: 将性别偏见缓解建模为男性到女性的风格迁移问题,基于CARMA阿拉伯语心理健康语料库构建五个反映不同语言与语义层面性别表达的数据集,并分别训练基于扩散模型的生成器。 Result: 定量结果显示生成文本在保持源文本语义忠实性的同时展现出显著的表层风格差异,定性分析验证了性别风格转换的语言合理性,且生成结果具有高熵特性。 Conclusion: 基于扩散模型的风格迁移方法可在无需预训练大模型的前提下,有效生成多样且语义一致的合成文本,为低资源、敏感领域的性别偏见缓解提供了灵活可靠的新框架。 Abstract: Synthetic data offers a promising solution for mitigating data scarcity and demographic bias in mental health analysis, yet existing approaches largely rely on pretrained large language models (LLMs), which may suffer from limited output diversity and propagate biases inherited from their training data. In this work, we propose a pretraining-free diffusion-based approach for synthetic text generation that frames bias mitigation as a style transfer problem. Using the CARMA Arabic mental health corpus, which exhibits a substantial gender imbalance, we focus on male-to-female style transfer to augment underrepresented female-authored content. We construct five datasets capturing varying linguistic and semantic aspects of gender expression in Arabic and train separate diffusion models for each setting. Quantitative evaluations demonstrate consistently high semantic fidelity between source and generated text, alongside meaningful surface-level stylistic divergence, while qualitative analysis confirms linguistically plausible gender transformations. Our results show that diffusion-based style transfer can generate high-entropy, semantically faithful synthetic data without reliance on pretrained LLMs, providing an effective and flexible framework for mitigating gender bias in sensitive, low-resource mental health domains.

[178] Lost in the Prompt Order: Revealing the Limitations of Causal Attention in Language Models

Hyunjong Ok,Jaeho Lee

Main category: cs.CL

TL;DR: 在多项选择题问答中,将上下文放在问题和选项之前(CQO)比相反顺序(QOC)性能高出14个百分点以上,研究发现因果注意力是关键机制。

Details Motivation: 理解大语言模型对提示结构敏感性的内在机制,尤其是上下文顺序影响性能的原因。 Method: 通过系统性架构分析,研究CQO与QOC提示顺序的差异,并识别因果注意力掩码在其中的作用。 Result: 发现QOC提示中,因果掩码阻止了选项词元关注上下文,导致信息瓶颈,使上下文对选项不可见,从而降低性能。 Conclusion: 因果注意力是导致提示顺序敏感性的核心机制,揭示了注意力结构在提示工程中的关键作用。 Abstract: Large language models exhibit surprising sensitivity to the structure of the prompt, but the mechanisms underlying this sensitivity remain poorly understood. In this work, we conduct an in-depth investigation on a striking case: in multiple-choice question answering, placing context before the questions and options (CQO) outperforms the reverse order (QOC) by over 14%p, consistently over a wide range of models and datasets. Through systematic architectural analysis, we identify causal attention as the core mechanism: in QOC prompts, the causal mask prevents option tokens from attending to context, creating an information bottleneck where context becomes invisible to options.

[179] Domain-Adaptation through Synthetic Data: Fine-Tuning Large Language Models for German Law

Ali Hamza Bashir,Muhammad Rehan Khalid,Kostadin Cvejoski,Jana Birr,Jule Berghaus,Armin Berger,Sandra Halscheidt,Christian Temath,Rafet Sifa,David Berghaus

Main category: cs.CL

TL;DR: 本文提出了一种通过从权威德国法律条文中系统生成高质量、多样且法律准确的问答对,来增强大语言模型在德国法律问答任务中表现的新方法。

Details Motivation: 由于缺乏专家知识,大语言模型在法律推理等专业领域常出现事实错误或产生幻觉,本文旨在解决这一问题。 Method: 采用基于权威德国法规的自动化合成数据生成方法,并结合严格的自动过滤和参数高效微调技术,对大语言模型进行适应性优化。 Result: 使用合成数据微调后的模型在德国法律问答任务上显著优于基线模型。 Conclusion: 精心设计的合成数据可作为高风险、知识密集型领域中人工标注的可靠替代方案。 Abstract: Large language models (LLMs) often struggle in specialized domains such as legal reasoning due to limited expert knowledge, resulting in factually incorrect outputs or hallucinations. This paper presents an effective method for adapting advanced LLMs to German legal question answering through a novel synthetic data generation approach. In contrast to costly human-annotated resources or unreliable synthetic alternatives, our approach systematically produces high-quality, diverse, and legally accurate question-answer pairs directly from authoritative German statutes. Using rigorous automated filtering methods and parameter-efficient fine-tuning techniques, we demonstrate that LLMs adapted with our synthetic dataset significantly outperform their baseline counterparts on German legal question answering tasks. Our results highlight the feasibility of using carefully designed synthetic data as a robust alternative to manual annotation in high-stakes, knowledge-intensive domains.

[180] Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum

Víctor Yeste,Paolo Rosso

Main category: cs.CL

TL;DR: 本文研究了在新闻和政治宣言句子中识别施瓦茨动机价值观的细粒度句子级检测任务,比较了层次化模型与直接多标签分类器,并评估了轻量信号、指令调优大模型及集成方法的效果,发现轻量信号和小型集成模型最有效,而分层门控增益有限。

Details Motivation: 人类价值观检测在文本中具有重要意义,但在缺乏上下文且类别极度不平衡的情况下,细粒度句子级识别极具挑战性,现有模型表现受限,因此需要探索更有效的建模方法。 Method: 提出二元道德存在任务作为基础,比较基于DeBERTa-base的层次化门控模型与直接多标签分类器;引入轻量信号(前句上下文、词典特征、主题特征),并在零/少样本及QLoRA设置下评估Gemma、Llama、Mistral、Qwen等大模型,构建软投票集成模型。 Result: 二元存在任务F1达0.74;直接多标签分类优于层次化门控;软投票监督集成达到macro-F1 0.332,超越单模型和先前基线;轻量信号和集成带来稳定提升,而门控结构因召回率限制效果有限。 Conclusion: 在单GPU 8GB限制和7-9B模型规模下,精心调优的监督编码器仍是高效可靠的基线;未来应利用更丰富的价值结构和文档上下文以进一步提升性能。 Abstract: We study sentence-level identification of the 19 values in the Schwartz motivational continuum as a concrete formulation of human value detection in text. The setting - out-of-context sentences from news and political manifestos - features sparse moral cues and severe class imbalance. This combination makes fine-grained sentence-level value detection intrinsically difficult, even for strong modern neural models. We first operationalize a binary moral presence task ("does any value appear?") and show that it is learnable from single sentences (positive-class F1 $\approx$ 0.74 with calibrated thresholds). We then compare a presence-gated hierarchy to a direct multi-label classifier under matched compute, both based on DeBERTa-base and augmented with lightweight signals (prior-sentence context, LIWC-22/eMFD/MJD lexica, and topic features). The hierarchy does not outperform direct prediction, indicating that gate recall limits downstream gains. We also benchmark instruction-tuned LLMs - Gemma 2 9B, Llama 3.1 8B, Mistral 8B, and Qwen 2.5 7B - in zero-/few-shot and QLoRA setups and build simple ensembles; a soft-vote supervised ensemble reaches macro-F1 0.332, significantly surpassing the best single supervised model and exceeding prior English-only baselines. Overall, in this scenario, lightweight signals and small ensembles yield the most reliable improvements, while hierarchical gating offers limited benefit. We argue that, under an 8 GB single-GPU constraint and at the 7-9B scale, carefully tuned supervised encoders remain a strong and compute-efficient baseline for structured human value detection, and we outline how richer value structure and sentence-in-document context could further improve performance.

[181] HALT: Hallucination Assessment via Latent Testing

Rohan Bhatnagar,Youran Sun,Chi Andrew Zhang,Yixin Wen,Haizhao Yang

Main category: cs.CL

TL;DR: 提出一种轻量级残差探针,从大语言模型中间隐藏状态直接读取幻觉风险,实现近乎零延迟的快速不确定性评估,并支持自信回答与不确定查询的分离处理。

Details Motivation: 大语言模型中的幻觉问题源于解码阶段对流畅输出的压力,导致即使模型内部存在不确定性信号也无法准确反映在输出中。希望从中间层提取未被衰减的认知不确定性信号,以实现对幻觉风险的忠实读出。 Method: 设计一个轻量级的残差探针,作为附加的小型网络,从问题令牌的中间隐藏状态中并行读取不确定性信号。该探针计算成本远低于token生成,可在推理过程中同步评估,用于即时判断是否需要启用更强的验证流程。 Result: 在四个问答基准和多个大模型家族上验证了方法的有效性,取得了较高的AUROC和AURAC指标,具备良好的数据分布外泛化能力,并揭示了中间表示中的可解释结构。 Conclusion: 快速内部不确定性读出是一种有原则的可靠代理式AI基础,能够在几乎不增加延迟的情况下实现对幻觉风险的有效检测与响应。 Abstract: Hallucination in large language models (LLMs) can be understood as a failure of faithful readout: although internal representations may encode uncertainty about a query, decoding pressures still yield a fluent answer. We propose lightweight residual probes that read hallucination risk directly from intermediate hidden states of question tokens, motivated by the hypothesis that these layers retain epistemic signals that are attenuated in the final decoding stage. The probe is a small auxiliary network whose computation is orders of magnitude cheaper than token generation and can be evaluated fully in parallel with inference, enabling near-instantaneous hallucination risk estimation with effectively zero added latency in low-risk cases. We deploy the probe as an agentic critic for fast selective generation and routing, allowing LLMs to immediately answer confident queries while delegating uncertain ones to stronger verification pipelines. Across four QA benchmarks and multiple LLM families, the method achieves strong AUROC and AURAC, generalizes under dataset shift, and reveals interpretable structure in intermediate representations, positioning fast internal uncertainty readout as a principled foundation for reliable agentic AI.

[182] MASCOT: Towards Multi-Agent Socio-Collaborative Companion Systems

Yiyang Wang,Yiqiao Jin,Alex Cabral,Josiah Hester

Main category: cs.CL

TL;DR: MASCOT是一个用于多视角社会协作伴侣的可扩展框架,通过双层优化策略解决多智能体系统中的角色崩溃和社会谄媚问题。

Details Motivation: 现有的多智能体系统常出现角色崩溃和社会谄媚现象,导致对话缺乏个性和建设性。 Method: 提出MASCOT框架,包含两个核心组件:1)基于RLAIF的个性感知行为对齐,确保个体角色一致性;2)基于群体奖励的协作对话优化,提升对话多样性与贡献度。 Result: 在心理支持和职场场景中,MASCOT在角色一致性上最高提升+14.1,在社会贡献上最高提升+10.6,显著优于现有方法。 Conclusion: MASCOT为构建下一代社会智能多智能体系统提供了实用且可推广的解决方案。 Abstract: Multi-agent systems (MAS) have recently emerged as promising socio-collaborative companions for emotional and cognitive support. However, these systems frequently suffer from persona collapse--where agents revert to generic, homogenized assistant behaviors--and social sycophancy, which produces redundant, non-constructive dialogue. We propose MASCOT, a generalizable framework for multi-perspective socio-collaborative companions. MASCOT introduces a novel bi-level optimization strategy to harmonize individual and collective behaviors: 1) Persona-Aware Behavioral Alignment, an RLAIF-driven pipeline that finetunes individual agents for strict persona fidelity to prevent identity loss; and 2) Collaborative Dialogue Optimization, a meta-policy guided by group-level rewards to ensure diverse and productive discourse. Extensive evaluations across psychological support and workplace domains demonstrate that MASCOT significantly outperforms state-of-the-art baselines, achieving improvements of up to +14.1 in Persona Consistency and +10.6 in Social Contribution. Our framework provides a practical roadmap for engineering the next generation of socially intelligent multi-agent systems.

[183] APEX-Agents

Bertie Vidgen,Austin Mann,Abby Fennelly,John Wright Stanly,Lucas Rothman,Marco Burstein,Julien Benchek,David Ostrofsky,Anirudh Ravichandran,Debnil Sur,Neel Venugopal,Alannah Hsia,Isaac Robinson,Calix Huang,Olivia Varones,Daniyal Khan,Michael Haines,Zach Richards,Chirag Mahapatra,Brendan Foody,Osvald Nitski

Main category: cs.CL

TL;DR: APEX-Agents是一个用于评估AI代理在投资银行、管理咨询和公司法律等专业领域执行长周期、跨应用任务能力的基准测试,支持开源并提供完整的测试环境与评估工具。

Details Motivation: 旨在衡量AI代理在真实工作环境中处理复杂、长期任务的能力,特别是在需要跨应用程序协作的专业服务领域。 Method: 构建包含480个任务的基准测试集APEX-Agents,并开发基础设施Archipelago来运行和评估AI代理;使用Pass@1指标对八个代理进行评测。 Result: Gemini 3 Flash (Thinking=High)得分最高,为24.0%,其次是GPT-5.2、Claude Opus 4.5和Gemini 3 Pro(均处于高思维模式)。 Conclusion: 当前AI代理在复杂专业任务中仍能力有限,APEX-Agents为未来提升代理系统性能提供了标准化评估平台和开源资源。 Abstract: We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools. We test eight agents for the leaderboard using Pass@1. Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). We open source the APEX-Agents benchmark (n=480) with all prompts, rubrics, gold outputs, files, and metadata. We also open-source Archipelago, our infrastructure for agent execution and evaluation.

[184] Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

Yuming Yang,Mingyoung Lai,Wanxu Zhao,Xiaoran Fan,Zhiheng Xi,Mingqi Wu,Chiyue Huang,Jun Zhao,Haijun Lv,Jian Tong,Yunhua Zhou,Yicheng Zou,Qipeng Guo,Tao Gui,Qi Zhang,Xuanjing Huang

Main category: cs.CL

TL;DR: 提出了一种名为Rank-Surprisal Ratio (RSR) 的新指标,用于评估推理轨迹在师生模型知识蒸馏中的适用性,RSR结合了对齐性和信息性,在多个模型上表现出与训练后性能的高度相关性。

Details Motivation: 强教师模型生成的长思维链轨迹并不总能产生更优的学生模型,说明现有方法在选择适合学生的蒸馏数据时存在不足,需更好衡量轨迹的适用性。 Method: 提出Rank-Surprisal Ratio (RSR) 指标,定义为推理轨迹的平均token秩与平均负对数似然的比值,用以同时捕捉轨迹与学生模型行为的对齐程度及其提供的学习信息量。 Result: RSR在五个学生模型和11个不同教师的推理轨迹中,与训练后性能平均Spearman相关系数达0.86,优于现有指标,并在轨迹选择和教师选择任务中展现实用价值。 Conclusion: RSR是一种简单有效的指标,能够更好地评估推理轨迹对学生模型的知识蒸馏效果,兼顾行为对齐与信息丰富性,具有广泛的应用潜力。 Abstract: Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that closely align with the model's current behavior but overlooking more informative ones. Addressing this, we propose Rank-Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically combine low absolute probability with relatively high-ranked tokens under the student model, balancing learning signal strength and behavioral alignment. Concretely, RSR is defined as the ratio of a trajectory's average token-wise rank to its average negative log-likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post-training performance (average Spearman 0.86), outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.

cs.CV [Back]

[185] Domain-Specific Self-Supervised Pre-training for Agricultural Disease Classification: A Hierarchical Vision Transformer Study

Arnav S. Sonavane

Main category: cs.CV

TL;DR: 本文研究了在农业病害分类中使用层次化视觉Transformer进行领域特定的自监督预训练的影响,发现仅用3,000张未标记图像进行SimCLR预训练即可显著提升准确率,效果优于模型架构改进。

Details Motivation: 探索在数据有限的农业领域,如何通过自监督学习提升病害分类性能,并评估预训练与模型架构设计的相对重要性。 Method: 采用SimCLR框架在未标记农业图像上进行自监督预训练,并在HierarchicalViT(HVT)等层次化视觉Transformer上进行微调,与其他架构(如Swin、ViT)对比性能。 Result: 在三个数据集(Cotton Leaf Disease、PlantVillage、PlantDoc)上验证了方法的有效性;HVT-Base相比Swin-Base在相近参数量下准确率提升1.68%;SimCLR预训练带来最高4.57%的准确率增益,且该增益在不同架构上均存在。 Conclusion: 在农业病害分类任务中,领域特定的自监督预训练比模型架构优化更能提升性能,因此应优先考虑收集领域内未标记数据用于预训练。 Abstract: We investigate the impact of domain-specific self-supervised pre-training on agricultural disease classification using hierarchical vision transformers. Our key finding is that SimCLR pre-training on just 3,000 unlabeled agricultural images provides a +4.57% accuracy improvement--exceeding the +3.70% gain from hierarchical architecture design. Critically, we show this SSL benefit is architecture-agnostic: applying the same pre-training to Swin-Base yields +4.08%, to ViT-Base +4.20%, confirming practitioners should prioritize domain data collection over architectural choices. Using HierarchicalViT (HVT), a Swin-style hierarchical transformer, we evaluate on three datasets: Cotton Leaf Disease (7 classes, 90.24%), PlantVillage (38 classes, 96.3%), and PlantDoc (27 classes, 87.1%). At matched parameter counts, HVT-Base (78M) achieves 88.91% vs. Swin-Base (88M) at 87.23%, a +1.68% improvement. For deployment reliability, we report calibration analysis showing HVT achieves 3.56% ECE (1.52% after temperature scaling). Code: https://github.com/w2sg-arnav/HierarchicalViT

[186] Multi-modal MRI-Based Alzheimer's Disease Diagnosis with Transformer-based Image Synthesis and Transfer Learning

Jason Qiu

Main category: cs.CV

TL;DR: 本研究提出了一种基于3D TransUNet的图像合成框架,能够从常规T1加权MRI生成高质量的扩散MRI指标(如FA和MD)图,从而在不进行实际扩散扫描的情况下获取微结构信息,显著提升了阿尔茨海默病及轻度认知障碍的诊断准确率。

Details Motivation: 阿尔茨海默病的早期检测至关重要,但扩散MRI虽能捕捉早期微结构异常,却因扫描时间长且易受运动伪影影响而难以常规使用。因此,亟需一种可从常规T1w MRI中推断扩散信息的方法,以提升临床诊断的可及性与效率。 Method: 提出一种3D TransUNet图像合成框架,直接从T1加权MRI预测分数各向异性(FA)和平均扩散率(MD)图,并通过结构相似性(SSIM)和皮尔逊相关系数评估生成图像的质量;将合成的扩散特征用于多模态诊断模型,评估其对AD和MCI分类性能的提升。 Result: 模型生成的FA和MD图与真实dMRI具有高度相似性(SSIM > 0.93,皮尔逊相关 > 0.94),在AD分类中准确率提升5%(78.75% → 83.75%),MCI检测准确率提升12.5%。 Conclusion: 高质量的扩散微结构信息可以从常规T1w MRI中推断出来,该方法能够在缺乏实际扩散数据的情况下实现多模态成像的优势,有望提高阿尔茨海默病临床诊断的可及性、效率和准确性。 Abstract: Alzheimer's disease (AD) is a progressive neurodegenerative disorder in which pathological changes begin many years before the onset of clinical symptoms, making early detection essential for timely intervention. T1-weighted (T1w) Magnetic Resonance Imaging (MRI) is routinely used in clinical practice to identify macroscopic brain alterations, but these changes typically emerge relatively late in the disease course. Diffusion MRI (dMRI), in contrast, is sensitive to earlier microstructural abnormalities by probing water diffusion in brain tissue. dMRI metrics, including fractional anisotropy (FA) and mean diffusivity (MD), provide complementary information about white matter integrity and neurodegeneration. However, dMRI acquisitions are time-consuming and susceptible to motion artifacts, limiting their routine use in clinical populations. To bridge this gap, I propose a 3D TransUNet image synthesis framework that predicts FA and MD maps directly from T1w MRI. My model generates high-fidelity maps, achieving a structural similarity index (SSIM) exceeding 0.93 and a strong Pearson correlation (>0.94) with ground-truth dMRI. When integrated into a multi-modal diagnostic model, these synthetic features boost AD classification accuracy by 5% (78.75%->83.75%) and, most importantly, improve mild cognitive impairment (MCI) detection by 12.5%. This study demonstrates that high-quality diffusion microstructural information can be inferred from routinely acquired T1w MRI, effectively transferring the benefits of multi-modality imaging to settings where diffusion data are unavailable. By reducing scan time while preserving complementary structural and microstructural information, the proposed approach has the potential to improve the accessibility, efficiency, and accuracy of AD diagnosis in clinical practice.

[187] PointSLAM++: Robust Dense Neural Gaussian Point Cloud-based SLAM

Xu Wang,Boyao Han,Xiaojun Chen,Ying Liu,Ruihui Li

Main category: cs.CV

TL;DR: PointSLAM++是一种基于层次约束神经高斯表示的RGB-D SLAM系统,通过渐进式位姿优化和动态神经表示图,在存在深度噪声的情况下实现高精度3D重建与光栅化渲染。

Details Motivation: 现有SLAM方法在深度噪声下难以保持结构一致性和鲁棒的位姿估计,限制了其在增强现实和机器人中的应用。 Method: 提出PointSLAM++,采用层次约束的神经高斯表示来建模场景结构,引入渐进式位姿优化以抑制深度传感器噪声,并设计动态神经表示图根据局部几何复杂度自适应调整高斯节点分布。 Result: 实验表明,PointSLAM++在重建精度和渲染质量上优于现有的基于3DGS的SLAM方法,尤其在大尺度场景中表现突出。 Conclusion: PointSLAM++通过联合优化表示与位姿,实现了高精度、实时的3D重建与渲染,适用于增强现实和机器人等对结构一致性要求高的应用场景。 Abstract: Real-time 3D reconstruction is crucial for robotics and augmented reality, yet current simultaneous localization and mapping(SLAM) approaches often struggle to maintain structural consistency and robust pose estimation in the presence of depth noise. This work introduces PointSLAM++, a novel RGB-D SLAM system that leverages a hierarchically constrained neural Gaussian representation to preserve structural relationships while generating Gaussian primitives for scene mapping. It also employs progressive pose optimization to mitigate depth sensor noise, significantly enhancing localization accuracy. Furthermore, it utilizes a dynamic neural representation graph that adjusts the distribution of Gaussian nodes based on local geometric complexity, enabling the map to adapt to intricate scene details in real time. This combination yields high-precision 3D mapping and photorealistic scene rendering. Experimental results show PointSLAM++ outperforms existing 3DGS-based SLAM methods in reconstruction accuracy and rendering quality, demonstrating its advantages for large-scale AR and robotics.

[188] Handcrafted Feature-Assisted One-Class Learning for Artist Authentication in Historical Drawings

Hassan Ugail,Jan Ritch-Frel,Irina Matuzava

Main category: cs.CV

TL;DR: 提出一种基于单类自编码器和手工特征的计算框架,用于历史手稿认证,在数据稀缺情况下为艺术鉴定提供可重复的定量证据。

Details Motivation: 在参考样本少、风格线索有限的情况下,纸质作品的身份验证与归属问题长期存在挑战,传统方法难以应对小规模且以线条为主的艺术作品鉴别需求。 Method: 使用一阶自编码器训练艺术家特定的验证器,基于傅里叶域能量、香农熵、全局对比度、GLCM同质性及分形复杂度等手工设计的可解释特征,并在多个博物馆的真实素描数据上进行训练与评估。 Result: 在900次验证决策中,系统整体真接受率为83.3%,假接受率为9.5%;不同艺术家间表现差异显著,错误接受路径显示与风格接近性和共同绘图惯例相关。 Conclusion: 该方法能有效支持而非取代专家鉴定,适用于历史素描归属中常见的数据稀缺场景,同时建议加强数字化伪影控制与阈值校准。 Abstract: Authentication and attribution of works on paper remain persistent challenges in cultural heritage, particularly when the available reference corpus is small and stylistic cues are primarily expressed through line and limited tonal variation. We present a verification-based computational framework for historical drawing authentication using one-class autoencoders trained on a compact set of interpretable handcrafted features. Ten artist-specific verifiers are trained using authenticated sketches from the Metropolitan Museum of Art open-access collection, the Ashmolean Collections Catalogue, the Morgan Library and Museum, the Royal Collection Trust (UK), the Victoria and Albert Museum Collections, and an online catalogue of the Casa Buonarroti collection and evaluated under a biometric-style protocol with genuine and impostor trials. Feature vectors comprise Fourier-domain energy, Shannon entropy, global contrast, GLCM-based homogeneity, and a box-counting estimate of fractal complexity. Across 900 verification decisions (90 genuine and 810 impostor trials), the pooled system achieves a True Acceptance Rate of 83.3% with a False Acceptance Rate of 9.5% at the chosen operating point. Performance varies substantially by artist, with near-zero false acceptance for some verifiers and elevated confusability for others. A pairwise attribution of false accepts indicates structured error pathways consistent with stylistic proximity and shared drawing conventions, whilst also motivating tighter control of digitisation artefacts and threshold calibration. The proposed methodology is designed to complement, rather than replace, connoisseurship by providing reproducible, quantitative evidence suitable for data-scarce settings common in historical sketch attribution.

[189] A one-step generation model with a Single-Layer Transformer: Layer number re-distillation of FreeFlow

Haonan Wei,Linyuan Wang,Nuolin Sun,Zhizhong Zheng,Lei Li,Bin Yan

Main category: cs.CV

TL;DR: 本文提出了一种名为SLT(Single-Layer Transformer)的单层Transformer模型,通过知识蒸馏将28层的FreeFlow模型压缩为仅含一个共享DiT块的极简结构,参数量从6.75亿减少到430万,并利用其快速采样能力在相同时间内筛选上百个噪声候选点,选出最优初始点以提升FreeFlow生成图像的质量和稳定性。

Details Motivation: 现有的流匹配方法虽已实现扩散模型的快速生成(如FreeFlow),但其深层架构(如28层Transformer)计算冗余大、参数多;同时,一次生成依赖单一噪声输入,质量不稳定。因此需要一种更轻量、高效的模型来提升生成稳定性和效率。 Method: 观察到FreeFlow的28层Transformer可视为沿深度轴的ODE欧拉离散化,据此设计SLT:使用单个共享DiT块模拟整个深度方向上的特征演化;训练时蒸馏教师模型在多个深度位置的中间特征,融合这些patch级表示,并对齐最终的速度预测。 Result: 成功将DiT-XL/2模型从28层压缩至单层,参数由675M降至4.3M;SLT可在与两次FreeFlow采样相当的时间内完成百余次噪声筛选,选择高质量初始点用于FreeFlow生成。 Conclusion: SLT不仅极大降低了模型复杂度和参数量,还通过高效噪声筛选机制提升了FreeFlow生成结果的稳定性和平均质量,有效缓解了因初始噪声不佳导致的生成波动问题。 Abstract: Currently, Flow matching methods aim to compress the iterative generation process of diffusion models into a few or even a single step, with MeanFlow and FreeFlow being representative achievements of one-step generation based on Ordinary Differential Equations (ODEs). We observe that the 28-layer Transformer architecture of FreeFlow can be characterized as an Euler discretization scheme for an ODE along the depth axis, where the layer index serves as the discrete time step. Therefore, we distill the number of layers of the FreeFlow model, following the same derivation logic as FreeFlow, and propose SLT (Single-Layer Transformer), which uses a single shared DiT block to approximate the depth-wise feature evolution of the 28-layer teacher. During training, it matches the teacher's intermediate features at several depth patches, fuses those patch-level representations, and simultaneously aligns the teacher's final velocity prediction. Through distillation training, we compress the 28 independent Transformer Blocks of the teacher model DiT-XL/2 into a single Transformer Block, reducing the parameter count from 675M to 4.3M. Furthermore, leveraging its minimal parameters and rapid sampling speed, SLT can screen more candidate points in the noise space within the same timeframe, thereby selecting higher-quality initial points for the teacher model FreeFlow and ultimately enhancing the quality of generated images. Experimental results demonstrate that within a time budget comparable to two random samplings of the teacher model, our method performs over 100 noise screenings and produces a high-quality sample through the teacher model using the selected points. Quality fluctuations caused by low-quality initial noise under a limited number of FreeFlow sampling calls are effectively avoided, substantially improving the stability and average generation quality of one-step generation.

[190] Compress to Focus: Efficient Coordinate Compression for Policy Optimization in Multi-Turn GUI Agents

Yurun Song,Jiong Yin,Rongjunchen Zhang,Ian G. Harris

Main category: cs.CV

TL;DR: 本文提出了Coordinate Compression Policy Optimization (CCPO),一种用于多轮GUI智能体的高效策略优化框架,通过坐标感知的空间压缩(CASC)和基于距离的优势函数,在保持长期上下文的同时显著减少token使用并加速训练。

Details Motivation: 多轮GUI智能体在任务执行中面临上下文膨胀问题,现有方法因截断或令牌剪枝而损失重要信息,难以兼顾长时依赖与计算效率。 Method: 提出CCPO框架,包含CASC模块:聚合多轮交互中的坐标信息以识别关键视觉区域,并动态构建注意力边界;引入基于距离的优势函数,提供比二值奖励更细粒度的学习信号。 Result: 在四个基准上实现SOTA性能,最高压缩55%的token,并带来3.8倍训练加速。 Conclusion: CCPO有效解决了多轮GUI代理中的上下文膨胀问题,通过联合优化视觉压缩与策略学习,在提升效率的同时保持了高性能。 Abstract: Multi-turn GUI agents enable complex task completion through sequential decision-making, but suffer from severe context inflation as interaction history accumulates. Existing strategies either sacrifice long-term context via truncation or compromise spatial structure through token pruning. In this paper, we propose Coordinate Compression Policy Optimization (CCPO), an efficient policy optimization framework that couples visual compression with policy optimization for multi-turn GUI agents. CCPO introduces Coordinate-Aware Spatial Compression (CASC), which aggregates coordinates from multiple rollouts to capture target-relevant regions and progressively narrow historical attention around key visual areas. From interactions across rollouts, CASC adaptively constructs attention boundaries that concentrate computation on the most informative regions of the scene. We further design a Distance-Based Advantage that provides fine-grained learning signals based on distance rather than binary correctness, improving both grounding accuracy and compression quality. Extensive experiments demonstrate that CCPO achieves SOTA performance across four benchmarks with up to 55% token compression and 3.8$\times$ training speedup.

[191] KG-ViP: Bridging Knowledge Grounding and Visual Perception in Multi-modal LLMs for Visual Question Answering

Zhiyang Li,Ao Ke,Yukun Cao,Xike Xie

Main category: cs.CV

TL;DR: 提出KG-ViP框架,通过融合场景图与常识图,利用查询作为语义桥梁,增强多模态大模型在视觉问答中的细粒度感知与知识可靠性。

Details Motivation: 解决多模态大语言模型在视觉问答中常见的知识幻觉和细粒度视觉感知不足问题,并弥补现有方法孤立使用场景图或常识图的缺陷。 Method: 设计一个检索与融合 pipeline,以查询为语义桥梁,逐步整合场景图和常识图,生成统一的结构化上下文以支持多模态推理。 Result: 在FVQA 2.0+ 和 MVQA 基准上显著优于现有方法。 Conclusion: KG-ViP通过融合两类互补的知识图谱,有效提升了多模态大模型在视觉问答任务中的性能与推理可靠性。 Abstract: Multi-modal Large Language Models (MLLMs) for Visual Question Answering (VQA) often suffer from dual limitations: knowledge hallucination and insufficient fine-grained visual perception. Crucially, we identify that commonsense graphs and scene graphs provide precisely complementary solutions to these respective deficiencies by providing rich external knowledge and capturing fine-grained visual details. However, prior works typically treat them in isolation, overlooking their synergistic potential. To bridge this gap, we propose KG-ViP, a unified framework that empowers MLLMs by fusing scene graphs and commonsense graphs. The core of the KG-ViP framework is a novel retrieval-and-fusion pipeline that utilizes the query as a semantic bridge to progressively integrate both graphs, synthesizing a unified structured context that facilitates reliable multi-modal reasoning. Extensive experiments on FVQA 2.0+ and MVQA benchmarks demonstrate that KG-ViP significantly outperforms existing VQA methods.

[192] Beyond Accuracy: Evaluating Grounded Visual Evidence in Thinking with Images

Xuchen Li,Xuzhao Li,Renjie Pi,Shiyu Hu,Jian Zhao,Jiahui Gao

Main category: cs.CV

TL;DR: 本文提出了ViEBench,一个用于评估视觉语言模型在多步推理中是否真实利用细粒度视觉线索的可验证过程基准,通过专家标注的视觉证据和双轴评估矩阵揭示了现有模型在视觉推理中的不一致性问题。

Details Motivation: 现有基准主要依赖结果准确率,无法评估模型是否真正利用细粒度视觉线索进行多步推理,难以判断其推理过程的真实性。 Method: 提出ViEBench,包含200张高分辨率多场景图像和专家标注的视觉证据,按感知和推理难度分类任务,并设计双轴矩阵与四个诊断象限进行细粒度评估。 Result: 实验发现:(1) 模型可能基于无关区域得出正确答案;(2) 模型能定位正确证据但仍无法得出正确结论。 Conclusion: ViEBench能够更透明、更全面地评估视觉语言模型的视觉推理真实性,为智能体式VLM的评估提供了更具解释性和实用性的基准。 Abstract: Despite the remarkable progress of Vision-Language Models (VLMs) in adopting "Thinking-with-Images" capabilities, accurately evaluating the authenticity of their reasoning process remains a critical challenge. Existing benchmarks mainly rely on outcome-oriented accuracy, lacking the capability to assess whether models can accurately leverage fine-grained visual cues for multi-step reasoning. To address these limitations, we propose ViEBench, a process-verifiable benchmark designed to evaluate faithful visual reasoning. Comprising 200 multi-scenario high-resolution images with expert-annotated visual evidence, ViEBench uniquely categorizes tasks by difficulty into perception and reasoning dimensions, where reasoning tasks require utilizing localized visual details with prior knowledge. To establish comprehensive evaluation criteria, we introduce a dual-axis matrix that provides fine-grained metrics through four diagnostic quadrants, enabling transparent diagnosis of model behavior across varying task complexities. Our experiments yield several interesting observations: (1) VLMs can sometimes produce correct final answers despite grounding on irrelevant regions, and (2) they may successfully locate the correct evidence but still fail to utilize it to reach accurate conclusions. Our findings demonstrate that ViEBench can serve as a more explainable and practical benchmark for comprehensively evaluating the effectiveness agentic VLMs. The codes will be released at: https://github.com/Xuchen-Li/ViEBench.

[193] When Rules Fall Short: Agent-Driven Discovery of Emerging Content Issues in Short Video Platforms

Chenghui Yu,Hongwei Wang,Junwen Chen,Zixuan Wang,Bingfeng Deng,Zhuolin Hao,Hongyu Xiong,Yang Song

Main category: cs.CV

TL;DR: 提出基于多模态大语言模型代理的自动问题发现方法,通过两阶段聚类识别短视频中的新兴问题,并自动生成更新的标注策略,显著提升内容治理效率。

Details Motivation: 传统人工发现新兴问题速度慢,导致标注策略更新滞后,影响内容治理效果。 Method: 利用多模态大语言模型代理召回潜在新问题视频,采用两阶段聚类将视频分组,每组对应一个新问题,并由代理生成更新的标注策略。 Result: 相比传统方法,F1分数提升超过20%,问题视频播放量减少约15%,且显著降低人工成本和策略迭代时间。 Conclusion: 该代理方法能高效、自动地发现新兴内容问题并更新治理策略,已实际部署并有效提升内容治理性能。 Abstract: Trends on short-video platforms evolve at a rapid pace, with new content issues emerging every day that fall outside the coverage of existing annotation policies. However, traditional human-driven discovery of emerging issues is too slow, which leads to delayed updates of annotation policies and poses a major challenge for effective content governance. In this work, we propose an automatic issue discovery method based on multimodal LLM agents. Our approach automatically recalls short videos containing potential new issues and applies a two-stage clustering strategy to group them, with each cluster corresponding to a newly discovered issue. The agent then generates updated annotation policies from these clusters, thereby extending coverage to these emerging issues. Our agent has been deployed in the real system. Both offline and online experiments demonstrate that this agent-based method significantly improves the effectiveness of emerging-issue discovery (with an F1 score improvement of over 20%) and enhances the performance of subsequent issue governance (reducing the view count of problematic videos by approximately 15%). More importantly, compared to manual issue discovery, it greatly reduces time costs and substantially accelerates the iteration of annotation policies.

[194] Now You See Me, Now You Don't: A Unified Framework for Expression Consistent Anonymization in Talking Head Videos

Anil Egin,Andrea Tangherloni,Antitza Dantcheva

Main category: cs.CV

TL;DR: 提出了一种名为Anon-NET的统一框架,用于面部视频匿名化,在去除身份信息的同时保留年龄、性别、种族、姿态和表情等属性,并通过扩散生成模型与运动感知表达迁移实现高质量的去识别化视频生成。

Details Motivation: 为了在保护隐私的同时支持下游计算机视觉任务(如表情识别、行人跟踪和动作识别),需要对人脸视频进行匿名化处理,但现有方法难以兼顾身份去除与关键语义信息的保留。 Method: 采用基于扩散的生成模型进行面部修复,结合高层属性识别和运动感知的表情迁移来指导生成过程;随后使用视频驱动的动画技术,输入去识别化的面部和原始视频,生成最终结果。 Result: 在VoxCeleb2、CelebV-HQ和HDTF数据集上的实验表明,AnonNET能有效隐藏身份信息,同时保持良好的视觉真实感和时间一致性,适用于多种动态面部场景。 Conclusion: AnonNET实现了面部视频中身份信息的有效脱敏,同时保留了重要的语义属性和动态特征,为隐私保护下的视频分析提供了一个实用且高效的解决方案。 Abstract: Face video anonymization is aimed at privacy preservation while allowing for the analysis of videos in a number of computer vision downstream tasks such as expression recognition, people tracking, and action recognition. We propose here a novel unified framework referred to as Anon-NET, streamlined to de-identify facial videos, while preserving age, gender, race, pose, and expression of the original video. Specifically, we inpaint faces by a diffusion-based generative model guided by high-level attribute recognition and motion-aware expression transfer. We then animate deidentified faces by video-driven animation, which accepts the de-identified face and the original video as input. Extensive experiments on the datasets VoxCeleb2, CelebV-HQ, and HDTF, which include diverse facial dynamics, demonstrate the effectiveness of AnonNET in obfuscating identity while retaining visual realism and temporal consistency. The code of AnonNet will be publicly released.

[195] Evaluating Self-Correcting Vision Agents Through Quantitative and Qualitative Metrics

Aradhya Dixit

Main category: cs.CV

TL;DR: 本文提出了一种诊断性微基准,用于评估视觉语言代理在多模态任务中的自我纠正能力,发现初始成功率与修复能力无直接关联,并揭示语义漂移是主要失败原因。

Details Motivation: 当前对视觉语言代理的自我纠正能力缺乏量化分析,且其推理瓶颈不明确。 Method: 设计了一个诊断性微基准,解耦任务成功率(TSR)与修正成功率(CSR),并提出失败分类法以识别主要错误类型。 Result: 任务成功率为62%,修正成功率为25%至33%,修正效果在三次重试后趋于饱和;约28%的失败归因于语义漂移。 Conclusion: 该基准为实现具有状态保持能力和可信性的多模态代理提供了可复现的评估框架。 Abstract: Recent progress in multimodal foundation models has enabled Vision-Language Agents (VLAs) to decompose complex visual tasks into executable tool-based plans. While recent benchmarks have begun to evaluate iterative self-correction, its quantitative limits and dominant reasoning bottlenecks remain poorly characterized. This work introduces a Diagnostic Micro-Benchmark. Our analysis decouples Task Success Rate (TSR = 62 percent) from Correction Success Rate (CSR = 25 to 33 percent), revealing that initial competence does not predict repair ability. We explicitly quantify the diminishing returns of correction, which saturates after three retries. Our Failure Taxonomy reveals a frequent factor is Semantic Drift (about 28 percent of failures), a loss of contextual state. By isolating this reasoning bottleneck, this benchmark defines a reproducible framework toward stateful, trustworthy multimodal agents.

[196] Confident Learning for Object Detection under Model Constraints

Yingda Yu,Jiaqi Xuan,Shuhui Shi,Xuanyu Teng,Shuyang Xu,Guanchao Tong

Main category: cs.CV

TL;DR: 本文提出了一种名为Model-Driven Data Correction (MDDC) 的数据驱动框架,通过迭代诊断和修正数据质量问题来提升边缘设备上的农业杂草检测性能,在固定轻量模型(YOLOv8n)下实现了mAP 0.5指标上5-25%的持续提升。

Details Motivation: 在边缘设备上进行农业杂草检测面临模型容量、计算资源和实时推理延迟的严格限制,难以通过扩大模型或集成方法提升性能,因此需要一种不依赖模型扩展的性能优化方案。 Method: 提出MDDC框架,结合自动化错误分析将检测失败分为四类(漏检、误检、类别混淆和定位错误),并通过版本控制的训练-修正-再训练流程系统性地修复数据问题。 Result: 在多个杂草检测数据集上使用YOLOv8n模型进行实验,mAP@0.5提升了5-25%,验证了该方法在固定模型下有效缓解性能瓶颈的能力。 Conclusion: 系统性的数据质量优化能够显著提升边缘设备上固定轻量模型的检测性能,为资源受限场景下的模型部署提供了一种有效的数据中心化解决方案。 Abstract: Agricultural weed detection on edge devices is subject to strict constraints on model capacity, computational resources, and real-time inference latency, which prevent performance improvements through model scaling or ensembling. This paper proposes Model-Driven Data Correction (MDDC), a data-centric framework that enhances detection performance by iteratively diagnosing and correcting data quality deficiencies. An automated error analysis procedure categorizes detection failures into four types: false negatives, false positives, class confusion, and localization errors. These error patterns are systematically addressed through a structured train-fix-retrain pipeline with version-controlled data management. Experimental results on multiple weed detection datasets demonstrate consistent improvements of 5-25 percent in mAP at 0.5 using a fixed lightweight detector (YOLOv8n), indicating that systematic data quality optimization can effectively alleviate performance bottlenecks under fixed model capacity constraints.

[197] Mixture of Distributions Matters: Dynamic Sparse Attention for Efficient Video Diffusion Transformers

Yuxi Liu,Yipeng Hu,Zekun Zhang,Kunze Jiang,Kun Yuan

Main category: cs.CV

TL;DR: 本文提出了一种名为MOD-DiT的新型无采样动态注意力框架,用于解决视频生成中Diffusion Transformers因自注意力机制二次复杂度带来的效率瓶颈。

Details Motivation: 现有的稀疏注意力方法在处理长序列视频生成时,或依赖过于简化的静态模式,或需要昂贵的采样操作来实现动态稀疏性,导致预测不准确和生成质量下降。 Method: MOD-DiT采用两阶段方法:首先利用去噪早期步骤的先验信息,通过分布式混合策略建立高效的线性近似模型,预测特定去噪区间的掩码模式;其次,通过在线块掩码策略动态应用这些预测掩码,并保留历史稀疏信息,避免重复采样。 Result: 在多个基准和模型架构上的广泛实验表明,MOD-DiT在加速和生成质量方面均取得一致提升。 Conclusion: MOD-DiT是一种无需采样的动态注意力框架,能够准确建模演化的注意力模式,在保持高质量的同时显著提升视频生成效率,克服了传统稀疏注意力方法的计算限制。 Abstract: While Diffusion Transformers (DiTs) have achieved notable progress in video generation, this long-sequence generation task remains constrained by the quadratic complexity inherent to self-attention mechanisms, creating significant barriers to practical deployment. Although sparse attention methods attempt to address this challenge, existing approaches either rely on oversimplified static patterns or require computationally expensive sampling operations to achieve dynamic sparsity, resulting in inaccurate pattern predictions and degraded generation quality. To overcome these limitations, we propose a \underline{\textbf{M}}ixtrue-\underline{\textbf{O}}f-\underline{\textbf{D}}istribution \textbf{DiT} (\textbf{MOD-DiT}), a novel sampling-free dynamic attention framework that accurately models evolving attention patterns through a two-stage process. First, MOD-DiT leverages prior information from early denoising steps and adopts a {distributed mixing approach} to model an efficient linear approximation model, which is then used to predict mask patterns for a specific denoising interval. Second, an online block masking strategy dynamically applies these predicted masks while maintaining historical sparsity information, eliminating the need for repetitive sampling operations. Extensive evaluations demonstrate consistent acceleration and quality improvements across multiple benchmarks and model architectures, validating MOD-DiT's effectiveness for efficient, high-quality video generation while overcoming the computational limitations of traditional sparse attention approaches.

[198] PSSF: Early osteoarthritis detection using physical synthetic knee X-ray scans and AI radiomics models

Abbas Alzubaidi,Ali Al-Bayaty

Main category: cs.CV

TL;DR: 提出一种基于物理的合成模拟框架(PSSF)生成可控膝关节X光图像,用于无患者参与的骨关节炎AI评估。

Details Motivation: 膝骨关节炎目前主要依赖主观影像分级,缺乏高质量、大规模标注图像数据集,受限于隐私和资源问题。 Method: 开发2D X射线投影模拟器PSSF,基于参数化解剖模型生成膝关节X光片,并结合IBSI标准提取影像组学特征,使用多种机器学习模型进行KL分级预测。 Result: 构建了包含180名受试者(260个膝盖)的虚拟队列,生成三种成像协议下的图像;在二分类和三分类任务中ML模型表现良好,特征具有较高稳定性。 Conclusion: PSSF可有效生成可用于训练AI模型的合成X光图像,支持膝骨关节炎的量化评估,同时避免隐私问题,具备临床应用潜力。 Abstract: Knee osteoarthritis (OA) is a major cause of disability worldwide and is still largely assessed using subjective radiographic grading, most commonly the Kellgren-Lawrence (KL) scale. Artificial intelligence (AI) and radiomics offer quantitative tools for OA assessment but depend on large, well-annotated image datasets, mainly X-ray scans, that are often difficult to obtain because of privacy, governance and resourcing constraints. In this research, we introduce a physics-based synthetic simulation framework (PSSF) to fully generate controllable X-ray scans without patients' involvement and violating their privacy and institutional constraints. This PSSF is a 2D X-ray projection simulator of anteroposterior knee radiographs from a parametric anatomical model of the distal femur and proximal tibia. Using PSSF, we create a virtual cohort of 180 subjects (260 knees), each is imaged under three protocols (reference, low-dose, and geometry-shift). Medial joint regions are automatically localized, preprocessed, and processed with the Image Biomarker Standardisation Initiative (IBSI). Practically, three machine learning (ML) models are utilized, logistic regression, random forest, and gradient boosting, to train binary (KL-like "0" vs. "2") and three-class (0-2) prediction radiographic images. Robustness is assessed within IBSI protocol, cross-protocol, and multi-protocol scenarios. Finally, features stability is then evaluated using intraclass correlation coefficients across acquisition changes.

[199] Predicting When to Trust Vision-Language Models for Spatial Reasoning

Muhammad Imran,Yugyung Lee

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉的置信度估计框架,通过目标检测和几何验证来评估视觉-语言模型(VLM)的空间预测可信度,显著优于基于文本的自评估方法。

Details Motivation: 现有的VLM在空间推理任务上表现不佳且缺乏可靠的置信度估计,难以安全应用于机器人和自主系统,因此需要一种能判断何时信任其预测的方法。 Method: 提出一种视觉为基础的置信度估计框架,利用目标检测对VLM的预测进行独立的几何验证,融合四个信号(几何一致性、空间模糊性、检测质量、VLM内部不确定性)并通过梯度提升模型进行置信度预测。 Result: 在BLIP-2上达到0.674 AUROC(比文本基线提升34.0%),在CLIP上达到0.583 AUROC(提升16.1%);在60%目标准确率下,BLIP-2的覆盖率从27.6%提升至61.9%(2.2倍);特征分析显示视觉信号贡献了87.4%的模型重要性。 Conclusion: 外部几何验证显著优于VLM自评估,该框架可有效支持选择性预测与可靠场景图构建,提升实际应用中的安全性与精度。 Abstract: Vision-Language Models (VLMs) demonstrate impressive capabilities across multimodal tasks, yet exhibit systematic spatial reasoning failures, achieving only 49% (CLIP) to 54% (BLIP-2) accuracy on basic directional relationships. For safe deployment in robotics and autonomous systems, we need to predict when to trust VLM spatial predictions rather than accepting all outputs. We propose a vision-based confidence estimation framework that validates VLM predictions through independent geometric verification using object detection. Unlike text-based approaches relying on self-assessment, our method fuses four signals via gradient boosting: geometric alignment between VLM claims and coordinates, spatial ambiguity from overlap, detection quality, and VLM internal uncertainty. We achieve 0.674 AUROC on BLIP-2 (34.0% improvement over text-based baselines) and 0.583 AUROC on CLIP (16.1% improvement), generalizing across generative and classification architectures. Our framework enables selective prediction: at 60% target accuracy, we achieve 61.9% coverage versus 27.6% baseline (2.2x improvement) on BLIP-2. Feature analysis reveals vision-based signals contribute 87.4% of model importance versus 12.7% from VLM confidence, validating that external geometric verification outperforms self-assessment. We demonstrate reliable scene graph construction where confidence-based pruning improves precision from 52.1% to 78.3% while retaining 68.2% of edges.

[200] IMSAHLO: Integrating Multi-Scale Attention and Hybrid Loss Optimization Framework for Robust Neuronal Brain Cell Segmentation

Ujjwal Jain,Oshin Misra,Roshni Chakraborty,Mahua Bhattacharya

Main category: cs.CV

TL;DR: 提出了一种名为IMSAHLO的新型深度学习框架,用于荧光显微镜下神经元细胞的精确分割,结合多尺度密集块、分层注意力机制和混合损失函数,在处理细胞密度变化大、形态复杂和类别不平衡等问题上表现优异。

Details Motivation: 现有深度学习模型在处理密集与稀疏并存、形态重叠复杂及类别不平衡的神经元细胞分割任务时,难以保持精细拓扑结构和准确边界。 Method: 提出IMSAHLO框架,包含多尺度密集块(MSDBs)以捕获不同感受野特征,分层注意力(HA)机制聚焦关键形态特征,并设计融合Tversky、Focal、clDice和轮廓加权边界损失的混合损失函数。 Result: 在FNC数据集上达到81.4%精度、82.7%宏F1分数、83.3%微F1分数和99.5%平衡准确率,优于现有最先进方法,消融实验验证各模块协同效果。 Conclusion: IMSAHLO有效解决了神经元细胞分割中的关键挑战,为生物医学图像分割提供了可推广的模型基础,推动AI在高通量神经生物学分析中的应用。 Abstract: Accurate segmentation of neuronal cells in fluorescence microscopy is a fundamental task for quantitative analysis in computational neuroscience. However, it is significantly impeded by challenges such as the coexistence of densely packed and sparsely distributed cells, complex overlapping morphologies, and severe class imbalance. Conventional deep learning models often fail to preserve fine topological details or accurately delineate boundaries under these conditions. To address these limitations, we propose a novel deep learning framework, IMSAHLO (Integrating Multi-Scale Attention and Hybrid Loss Optimization), for robust and adaptive neuronal segmentation. The core of our model features Multi-Scale Dense Blocks (MSDBs) to capture features at various receptive fields, effectively handling variations in cell density, and a Hierarchical Attention (HA) mechanism that adaptively focuses on salient morphological features to preserve Region of Interest (ROI) boundary details. Furthermore, we introduce a novel hybrid loss function synergistically combining Tversky and Focal loss to combat class imbalance, alongside a topology-aware Centerline Dice (clDice) loss and a Contour-Weighted Boundary loss to ensure topological continuity and precise separation of adjacent cells. Large-scale experiments on the public Fluorescent Neuronal Cells (FNC) dataset demonstrate that our framework outperforms state-of-the-art architectures, achieving precision of 81.4%, macro F1 score of 82.7%, micro F1 score of 83.3%, and balanced accuracy of 99.5% on difficult dense and sparse cases. Ablation studies validate the synergistic benefits of multi-scale attention and hybrid loss terms. This work establishes a foundation for generalizable segmentation models applicable to a wide range of biomedical imaging modalities, pushing AI-assisted analysis toward high-throughput neurobiological pipelines.

[201] Aesthetics as Structural Harm: Algorithmic Lookism Across Text-to-Image Generation and Classification

Miriam Doh,Aditya Gulati,Corina Canali,Nuria Oliver

Main category: cs.CV

TL;DR: 本文研究了文本到图像生成AI中的算法外貌偏见(algorithmic lookism),发现模型系统性地将外貌吸引力与正面特质关联,并在性别分类任务中表现出显著的性别偏差,加剧了年龄、性别和地域上的刻板印象。

Details Motivation: 揭示生成式AI中外貌偏见的存在及其对社会不平等的影响,特别是在外貌与特质的错误关联以及性别分类中的不公平现象。 Method: 使用Stable Diffusion 2.1和3.5 Medium生成26,400张合成人脸,并分析三款性别分类算法在不同属性人脸上的表现差异。 Result: 发现生成模型系统性编码了外貌吸引力与正面属性的关联;女性面孔尤其是带有负面属性者误分类率更高;新模型存在年龄同质化、性别化暴露模式和地理简化问题。 Conclusion: 算法外貌偏见是贯穿AI视觉系统的系统性基础设施,通过表征与识别双重机制加剧现有社会不平等。 Abstract: This paper examines algorithmic lookism-the systematic preferential treatment based on physical appearance-in text-to-image (T2I) generative AI and a downstream gender classification task. Through the analysis of 26,400 synthetic faces created with Stable Diffusion 2.1 and 3.5 Medium, we demonstrate how generative AI models systematically associate facial attractiveness with positive attributes and vice-versa, mirroring socially constructed biases rather than evidence-based correlations. Furthermore, we find significant gender bias in three gender classification algorithms depending on the attributes of the input faces. Our findings reveal three critical harms: (1) the systematic encoding of attractiveness-positive attribute associations in T2I models; (2) gender disparities in classification systems, where women's faces, particularly those generated with negative attributes, suffer substantially higher misclassification rates than men's; and (3) intensifying aesthetic constraints in newer models through age homogenization, gendered exposure patterns, and geographic reductionism. These convergent patterns reveal algorithmic lookism as systematic infrastructure operating across AI vision systems, compounding existing inequalities through both representation and recognition. Disclaimer: This work includes visual and textual content that reflects stereotypical associations between physical appearance and socially constructed attributes, including gender, race, and traits associated with social desirability. Any such associations found in this study emerge from the biases embedded in generative AI systems-not from empirical truths or the authors' views.

[202] PSSI-MaxST: An Efficient Pixel-Segment Similarity Index Using Intensity and Smoothness Features for Maximum Spanning Tree Based Segmentation

Kaustubh Shivshankar Shejole,Gaurav Mishra

Main category: cs.CV

TL;DR: 提出一种基于像素段相似性指数(PSSI)的图割分割方法,结合MeanShift和最大生成树(MaxST),在交互式图像分割中实现更优性能。

Details Motivation: 现有基于图的交互式分割方法计算成本高、对用户交互敏感,且在前景与背景颜色分布相似时性能下降,主要受限于边权重的相似性度量方式。 Method: 提出像素段相似性指数(PSSI),利用像素强度和空间平滑性特征的通道间相似性的调和平均;采用MeanShift进行底层分割构建像素段图,并使用PSSI计算边权重;通过最大生成树(MaxST)进行图分割。 Result: 在GrabCut和Images250数据集上实验表明,该方法在IoU、F1分数、执行时间和平均误差方面优于AMOE、OneCut和SSNCut等现有方法。 Conclusion: PSSI结合MeanShift与MaxST能有效融合颜色、纹理、形状和局部连通性信息,提升交互式图像分割的精度与鲁棒性。 Abstract: Interactive graph-based segmentation methods partition an image into foreground and background regions with the aid of user inputs. However, existing approaches often suffer from high computational costs, sensitivity to user interactions, and degraded performance when the foreground and background share similar color distributions. A key factor influencing segmentation performance is the similarity measure used for assigning edge weights in the graph. To address these challenges, we propose a novel Pixel Segment Similarity Index (PSSI), which leverages the harmonic mean of inter-channel similarities by incorporating both pixel intensity and spatial smoothness features. The harmonic mean effectively penalizes dissimilarities in any individual channel, enhancing robustness. The computational complexity of PSSI is $\mathcal{O}(B)$, where $B$ denotes the number of histogram bins. Our segmentation framework begins with low-level segmentation using MeanShift, which effectively captures color, texture, and segment shape. Based on the resulting pixel segments, we construct a pixel-segment graph with edge weights determined by PSSI. For partitioning, we employ the Maximum Spanning Tree (MaxST), which captures strongly connected local neighborhoods beneficial for precise segmentation. The integration of the proposed PSSI, MeanShift, and MaxST allows our method to jointly capture color similarity, smoothness, texture, shape, and strong local connectivity. Experimental evaluations on the GrabCut and Images250 datasets demonstrate that our method consistently outperforms current graph-based interactive segmentation methods such as AMOE, OneCut, and SSNCut in terms of segmentation quality, as measured by Jaccard Index (IoU), $F_1$ score, execution time and Mean Error (ME). Code is publicly available at: https://github.com/KaustubhShejole/PSSI-MaxST.

[203] Zeros can be Informative: Masked Binary U-Net for Image Segmentation on Tensor Cores

Chunshu Wu,Ruibing Song,Sushant Kondguli,Tong Geng,Ang Li

Main category: cs.CV

TL;DR: 本文提出了Masked Binary U-Net(MBU-Net),一种用于实时图像分割的高效二值化网络,结合成本感知的掩码策略和GPU Tensor Core优化,在保持接近全精度准确率的同时显著提升速度与能效。

Details Motivation: 为解决二值化U-Net在高分辨率图像实时分割中面临的精度下降严重和缺乏端到端GPU高效实现的问题,本文探索如何在资源受限边缘设备上实现高精度、低延迟、低功耗的分割模型。 Method: 基于训练时权重零掩码带来稀疏性以及各层量化敏感性均匀的观察,提出成本感知的掩码策略构建MBU-Net;设计基于减法比特编码的GPU执行框架,利用Tensor Core的原生二值BMMA指令高效运行掩码二值权重与二值激活。 Result: 在三个分割基准上,MBU-Net相比16位浮点U-Net平均精度仅下降3%,同时实现2.04倍加速和3.54倍能效提升。 Conclusion: MBU-Net通过掩码二值化与硬件协同设计,成功平衡了准确性、效率与可部署性,为边缘设备上的实时高分辨率图像分割提供了实用且高效的解决方案。 Abstract: Real-time image segmentation is a key enabler for AR/VR, robotics, drones, and autonomous systems, where tight accuracy, latency, and energy budgets must be met on resource-constrained edge devices. While U-Net offers a favorable balance of accuracy and efficiency compared to large transformer-based models, achieving real-time performance on high-resolution input remains challenging due to compute, memory, and power limits. Extreme quantization, particularly binary networks, is appealing for its hardware-friendly operations. However, two obstacles limit practicality: (1) severe accuracy degradation, and (2) a lack of end-to-end implementations that deliver efficiency on general-purpose GPUs. We make two empirical observations that guide our design. (1) An explicit zero state is essential: training with zero masking to binary U-Net weights yields noticeable sparsity. (2) Quantization sensitivity is uniform across layers. Motivated by these findings, we introduce Masked Binary U-Net (MBU-Net), obtained through a cost-aware masking strategy that prioritizes masking where it yields the highest accuracy-per-cost, reconciling accuracy with near-binary efficiency. To realize these gains in practice, we develop a GPU execution framework that maps MBU-Net to Tensor Cores via a subtractive bit-encoding scheme, efficiently implementing masked binary weights with binary activations. This design leverages native binary Tensor Core BMMA instructions, enabling high throughput and energy savings on widely available GPUs. Across 3 segmentation benchmarks, MBU-Net attains near full-precision accuracy (3% average drop) while delivering 2.04x speedup and 3.54x energy reductions over a 16-bit floating point U-Net.

[204] LTV-YOLO: A Lightweight Thermal Object Detector for Young Pedestrians in Adverse Conditions

Abdullah Jirjees,Ryan Myers,Muhammad Haris Ikram,Mohamed H. Zaki

Main category: cs.CV

TL;DR: 本文提出了一种专为热成像设计的轻量级目标检测模型LTV-YOLO,用于在低光照和恶劣天气条件下检测儿童和青少年等弱势道路使用者。

Details Motivation: 在传统RGB相机失效的低光和恶劣天气条件下,可靠地检测儿童和青少年等弱势道路使用者仍是一个关键挑战。 Method: 基于YOLO11架构,结合深度可分离卷积和特征金字塔网络(FPN),利用长波红外(LWIR)热成像开发出LTV-YOLO模型,并针对边缘设备进行优化。 Result: LTV-YOLO在小尺度、部分遮挡和热特征明显的VRU检测中表现出色,具有高精度和实时性能,适用于边缘设备。 Conclusion: LTV-YOLO为智能交通系统中的行人安全提供了实用且可扩展的解决方案,尤其适用于学校区域、自动驾驶和智慧城市基础设施。 Abstract: Detecting vulnerable road users (VRUs), particularly children and adolescents, in low light and adverse weather conditions remains a critical challenge in computer vision, surveillance, and autonomous vehicle systems. This paper presents a purpose-built lightweight object detection model designed to identify young pedestrians in various environmental scenarios. To address these challenges, our approach leverages thermal imaging from long-wave infrared (LWIR) cameras, which enhances detection reliability in conditions where traditional RGB cameras operating in the visible spectrum fail. Based on the YOLO11 architecture and customized for thermal detection, our model, termed LTV-YOLO (Lightweight Thermal Vision YOLO), is optimized for computational efficiency, accuracy and real-time performance on edge devices. By integrating separable convolutions in depth and a feature pyramid network (FPN), LTV-YOLO achieves strong performance in detecting small-scale, partially occluded, and thermally distinct VRUs while maintaining a compact architecture. This work contributes a practical and scalable solution to improve pedestrian safety in intelligent transportation systems, particularly in school zones, autonomous navigation, and smart city infrastructure. Unlike prior thermal detectors, our contribution is task-specific: a thermally only edge-capable design designed for young and small VRUs (children and distant adults). Although FPN and depthwise separable convolutions are standard components, their integration into a thermal-only pipeline optimized for short/occluded VRUs under adverse conditions is, to the best of our knowledge, novel.

[205] UAV-Based Infrastructure Inspections: A Literature Review and Proposed Framework for AEC+FM

Amir Farzin Nikkhah,Dong Chen,Bradford Campbell,Somayeh Asadi,Arsalan Heydarian

Main category: cs.CV

TL;DR: 该论文综述了无人机(UAV)在建筑、工程、施工和设施管理领域基础设施检测中的应用,涵盖数据采集、缺陷识别和决策支持等方面,并提出一个融合多模态数据与Transformer架构的框架以提升检测精度。

Details Motivation: 解决当前无人机在基础设施检测中面临的实时处理、多模态数据融合和模型泛化能力不足等挑战。 Method: 综合150多项研究,结合案例研究,提出一个集成RGB图像、LiDAR和热感数据,并采用Transformer架构与动态路径规划的UAV检测框架。 Result: 所提框架能有效提高结构缺陷、热异常和几何不一致的检测准确性和可靠性,支持复杂环境下的自适应飞行与多源数据融合。 Conclusion: 未来研究应聚焦轻量化AI模型、自适应飞行规划、合成数据集构建以及更深层次的多模态融合,以推动智能基础设施检测的发展。 Abstract: Unmanned Aerial Vehicles (UAVs) are transforming infrastructure inspections in the Architecture, Engineering, Construction, and Facility Management (AEC+FM) domain. By synthesizing insights from over 150 studies, this review paper highlights UAV-based methodologies for data acquisition, photogrammetric modeling, defect detection, and decision-making support. Key innovations include path optimization, thermal integration, and advanced machine learning (ML) models such as YOLO and Faster R-CNN for anomaly detection. UAVs have demonstrated value in structural health monitoring (SHM), disaster response, urban infrastructure management, energy efficiency evaluations, and cultural heritage preservation. Despite these advancements, challenges in real-time processing, multimodal data fusion, and generalizability remain. A proposed workflow framework, informed by literature and a case study, integrates RGB imagery, LiDAR, and thermal sensing with transformer-based architectures to improve accuracy and reliability in detecting structural defects, thermal anomalies, and geometric inconsistencies. The proposed framework ensures precise and actionable insights by fusing multimodal data and dynamically adapting path planning for complex environments, presented as a comprehensive step-by-step guide to address these challenges effectively. This paper concludes with future research directions emphasizing lightweight AI models, adaptive flight planning, synthetic datasets, and richer modality fusion to streamline modern infrastructure inspections.

[206] MATEX: Multi-scale Attention and Text-guided Explainability of Medical Vision-Language Models

Muhammad Imran,Chi Lee,Yugyung Lee

Main category: cs.CV

TL;DR: MATEX 是一种新型医学视觉-语言模型可解释性框架,结合多层注意力展开、文本引导的空间先验和层一致性分析,生成精确且符合解剖结构的归因图。

Details Motivation: 现有方法在空间精度、解剖学基础和注意力粒度方面存在不足,难以提供可靠的医学AI解释。 Method: 提出 MATEX 框架,融合多尺度注意力机制与文本引导的空间先验,并通过层一致性分析提升归因稳定性。 Result: 在 MS-CXR 数据集上,MATEX 在空间精度和与专家标注的一致性方面优于现有的 M2IB 方法。 Conclusion: MATEX 能生成更准确、稳定且临床有意义的解释,有助于提升放射学 AI 应用的信任与透明度。 Abstract: We introduce MATEX (Multi-scale Attention and Text-guided Explainability), a novel framework that advances interpretability in medical vision-language models by incorporating anatomically informed spatial reasoning. MATEX synergistically combines multi-layer attention rollout, text-guided spatial priors, and layer consistency analysis to produce precise, stable, and clinically meaningful gradient attribution maps. By addressing key limitations of prior methods, such as spatial imprecision, lack of anatomical grounding, and limited attention granularity, MATEX enables more faithful and interpretable model explanations. Evaluated on the MS-CXR dataset, MATEX outperforms the state-of-the-art M2IB approach in both spatial precision and alignment with expert-annotated findings. These results highlight MATEX's potential to enhance trust and transparency in radiological AI applications.

[207] Generating metamers of human scene understanding

Ritik Raina,Abe Leite,Alexandros Graikos,Seoyoung Ahn,Dimitris Samaras,Gregory J. Zelinsky

Main category: cs.CV

TL;DR: 本文提出了MetamerGen,一种结合周边视觉“整体信息”和注视点高分辨率信息的潜在扩散模型,用于生成与人类场景理解一致的图像metamers,通过双流DINOv2 token表示实现foveated图像生成,并通过行为实验验证其感知对齐性。

Details Motivation: 受人类视觉系统结合周边低分辨率信息与中央高分辨率信息构建场景理解的启发,旨在构建更符合人类感知的图像生成模型,以探索视觉场景理解的内在机制。 Method: 提出MetamerGen,一种基于潜在扩散的模型,采用双流架构融合DINOv2特征:一者来自注视区域的细节特征,另一者来自周边降质的上下文特征,实现foveated图像到图像的合成。 Result: 通过same-different行为实验验证,MetamerGen生成的图像在感知上与原始场景具有高度一致性,尤其是当条件化于观察者自身注视区域时,高层语义对metamerism预测最强。 Conclusion: MetamerGen是一种有效工具,可用于探究人类场景理解的表征基础,揭示了多层级视觉特征对感知判断的贡献,推动了以人类感知为中心的生成模型发展。 Abstract: Human vision combines low-resolution "gist" information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene. In this paper, we introduce MetamerGen, a tool for generating scenes that are aligned with latent human scene representations. MetamerGen is a latent diffusion model that combines peripherally obtained scene gist information with information obtained from scene-viewing fixations to generate image metamers for what humans understand after viewing a scene. Generating images from both high and low resolution (i.e. "foveated") inputs constitutes a novel image-to-image synthesis problem, which we tackle by introducing a dual-stream representation of the foveated scenes consisting of DINOv2 tokens that fuse detailed features from fixated areas with peripherally degraded features capturing scene context. To evaluate the perceptual alignment of MetamerGen generated images to latent human scene representations, we conducted a same-different behavioral experiment where participants were asked for a "same" or "different" response between the generated and the original image. With that, we identify scene generations that are indeed metamers for the latent scene representations formed by the viewers. MetamerGen is a powerful tool for understanding scene understanding. Our proof-of-concept analyses uncovered specific features at multiple levels of visual processing that contributed to human judgments. While it can generate metamers even conditioned on random fixations, we find that high-level semantic alignment most strongly predicts metamerism when the generated scenes are conditioned on viewers' own fixated regions.

[208] Conformal Point and the Calibrated Conic

Richard Hartley

Main category: cs.CV

TL;DR: 本文介绍了共形点和校准圆锥的概念及其相互关系,这些概念有助于直观地理解图像几何,并用于计算图像中的角度和方向等几何信息。

Details Motivation: 为了更好地可视化图像几何结构,需要引入共形点和校准圆锥的概念以提供直观的几何理解工具。 Method: 通过分析共形点与校准圆锥之间的数学关系,利用其性质来解释和计算图像中的几何特征,如角度和方向。 Result: 建立了共形点与校准圆锥之间的关联,为图像几何的直观理解和计算提供了有效方法。 Conclusion: 共形点和校准圆锥是理解图像几何的有力工具,能够简化角度和方向等几何量的计算。 Abstract: This gives some information about the conformal point and the calibrating conic, and their relationship one to the other. These concepts are useful for visualizing image geometry, and lead to intuitive ways to compute geometry, such as angles and directions in an image.

[209] Telling Human and Machine Handwriting Apart

Luis A. Leiva,Moises Diaz,Nuwan T. Attygalle,Miguel A. Ferrer,Rejean Plamondon

Main category: cs.CV

TL;DR: 本研究利用手写运动作为行为生物特征,通过浅层循环神经网络在多种合成器和数据集上实现了高精度的人类书写验证(平均AUC达98.3%),并在少样本和跨域场景下表现优异,对增强人机验证系统的安全性具有重要意义。

Details Motivation: 为了提升设备或应用中真实用户身份的验证能力,研究旨在利用手写动作这一独特的行为生物特征,区分人类生成与人工伪造的输入,实现类似反向图灵测试的安全机制。 Method: 采用浅层循环神经网络直接处理未经特征提取的轨迹数据,基于十个公开的手写符号数据集和七种不同合成器(包括Sigma模型、GAN、Transformer和扩散模型等)生成的伪造数据进行训练与评估,并在少样本和跨域设置下测试模型性能。 Result: 模型在所有合成器和数据集上的平均AUC达到98.3%,等错误率低至1.4%;仅使用10%训练数据时仍保持出色性能;在跨域场景下也表现出较强的泛化能力。 Conclusion: 该方法能有效识别由人类产生的手写输入,在实际应用中可为需要验证人类参与的系统提供可靠且安全的解决方案,有助于防范自动化攻击。 Abstract: Handwriting movements can be leveraged as a unique form of behavioral biometrics, to verify whether a real user is operating a device or application. This task can be framed as a reverse Turing test in which a computer has to detect if an input instance has been generated by a human or artificially. To tackle this task, we study ten public datasets of handwritten symbols (isolated characters, digits, gestures, pointing traces, and signatures) that are artificially reproduced using seven different synthesizers, including, among others, the Kinematic Theory (Sigma h model), generative adversarial networks, Transformers, and Diffusion models. We train a shallow recurrent neural network that achieves excellent performance (98.3 percent Area Under the ROC Curve (AUC) score and 1.4 percent equal error rate on average across all synthesizers and datasets) using nonfeaturized trajectory data as input. In few-shot settings, we show that our classifier achieves such an excellent performance when trained on just 10 percent of the data, as evaluated on the remaining 90% of the data as a test set. We further challenge our classifier in out-of-domain settings, and observe very competitive results as well. Our work has implications for computerized systems that need to verify human presence, and adds an additional layer of security to keep attackers at bay.

[210] SemAlign: Language Guided Semi-supervised Domain Generalization

Muditha Fernando,Kajhanan Kailainathan,Krishnakanth Nagaratnam,Isuranga Udaravi Bandara Senavirathne,Ranga Rodrigo

Main category: cs.CV

TL;DR: 提出一种新的半监督域泛化方法,通过将模型中间特征与视觉语言模型的语义丰富且泛化的特征空间对齐,提升数据利用率并减少过拟合,在四个基准上实现了SOTA性能。

Details Motivation: 现有SSDG方法过于关注伪标签准确性,忽视了训练中最大化数据利用的重要性,限制了性能提升。 Method: 将模型的中间特征与视觉语言模型(VLM)的广义特征空间对齐,以促进域不变性,并结合有效的图像级增强和输出级正则化策略。 Result: 在四个基准上的大量实验表明,该方法在定性和定量上均优于现有的SSDG基线方法,实现了最先进的结果。 Conclusion: 通过引入VLM对齐机制和增强的数据利用策略,有效解决了SSDG中的伪标签准确性和模型过拟合问题,显著提升了泛化性能。 Abstract: Semi-supervised Domain Generalization (SSDG) addresses the challenge of generalizing to unseen target domains with limited labeled data. Existing SSDG methods highlight the importance of achieving high pseudo-labeling (PL) accuracy and preventing model overfitting as the main challenges in SSDG. In this light, we show that the SSDG literature's excessive focus on PL accuracy, without consideration for maximum data utilization during training, limits potential performance improvements. We propose a novel approach to the SSDG problem by aligning the intermediate features of our model with the semantically rich and generalized feature space of a Vision Language Model (VLM) in a way that promotes domain-invariance. The above approach is enhanced with effective image-level augmentation and output-level regularization strategies to improve data utilization and minimize overfitting. Extensive experimentation across four benchmarks against existing SSDG baselines suggests that our method achieves SOTA results both qualitatively and quantitatively. The code will be made publicly available.

[211] SpaRRTa: A Synthetic Benchmark for Evaluating Spatial Intelligence in Visual Foundation Models

Turhan Can Kargin,Wojciech Jasiński,Adam Pardyl,Bartosz Zieliński,Marcin Przewięźlikowski

Main category: cs.CV

TL;DR: 本文提出了一个名为SpaRRTa的空间关系识别任务基准,用于评估视觉基础模型(VFMs)在物体相对位置识别方面的空间推理能力,揭示了现有模型在此类任务上的表现差异,并探讨了影响其空间感知能力的机制。

Details Motivation: 现有的视觉基础模型(如DINO和CLIP)在语义理解上表现出色,但在空间推理方面能力有限,且在引入3D任务训练后表现不一致,尚不清楚它们是否真正具备空间感知能力。因此需要一个新的基准来系统评估其空间理解能力。 Method: 提出SpaRRTa基准,生成具有多样化场景和可控物体布局的逼真图像,并提供可访问的空间标注;该任务关注对象间的相对位置识别,不同于传统的精确度量预测任务(如深度估计),旨在探测更基础的人类式空间理解能力。 Result: 对多种先进VFMs的评估显示,它们在SpaRRTa任务上存在显著的空间推理能力差异,表明当前模型的空间感知能力并不均衡,部分可能仅是对特定3D任务过拟合的结果。 Conclusion: SpaRRTa能够有效评估VFMs的基础空间理解能力,为未来开发更具空间感知能力的视觉模型提供了有价值的工具和洞察。 Abstract: Visual Foundation Models (VFMs), such as DINO and CLIP, excel in semantic understanding of images but exhibit limited spatial reasoning capabilities, which limits their applicability to embodied systems. As a result, recent work incorporates some 3D tasks (such as depth estimation) into VFM training. However, VFM performance remains inconsistent across other spatial tasks, raising the question of whether these models truly have spatial awareness or overfit to specific 3D objectives. To address this question, we introduce the Spatial Relation Recognition Task (SpaRRTa) benchmark, which evaluates the ability of VFMs to identify relative positions of objects in the image. Unlike traditional 3D objectives that focus on precise metric prediction (e.g., surface normal estimation), SpaRRTa probes a fundamental capability underpinning more advanced forms of human-like spatial understanding. SpaRRTa generates an arbitrary number of photorealistic images with diverse scenes and fully controllable object arrangements, along with freely accessible spatial annotations. Evaluating a range of state-of-the-art VFMs, we reveal significant disparities between their spatial reasoning abilities. Through our analysis, we provide insights into the mechanisms that support or hinder spatial awareness in modern VFMs. We hope that SpaRRTa will serve as a useful tool for guiding the development of future spatially aware visual models.

[212] From Pixels to Purchase: Building and Evaluating a Taxonomy-Decoupled Visual Search Engine for Home Goods E-commerce

Cheng Lyu,Jingyue Zhang,Ryan Maunu,Mengwei Li,Vinny DeGenova,Yuanli Pei

Main category: cs.CV

TL;DR: 提出了一种解耦分类体系的视觉搜索架构,结合无分类区域建议和统一嵌入表示,并利用大语言模型作为零样本评判器进行评估,在电商家居平台中显著提升了检索质量和用户参与度。

Details Motivation: 现有基于检测和分类体系的电商视觉搜索系统依赖噪声较多的商品目录数据,难以评估主观且开放的用户意图,限制了系统的鲁棒性与可扩展性。 Method: 采用无需分类的区域建议生成和统一嵌入表示实现相似性检索,构建解耦分类体系的架构;提出LLM-as-a-Judge框架,以零样本方式评估查询-结果对的视觉相似性和类别相关性。 Result: 系统在大规模全球家居商品平台上部署后,显著提升检索质量与用户参与度,离线评估指标与真实效果高度相关。 Conclusion: 所提方法摆脱了对标注数据和分类体系的依赖,具备更强的灵活性和泛化能力,为视觉搜索提供了高效、可扩展的新范式。 Abstract: Visual search is critical for e-commerce, especially in style-driven domains where user intent is subjective and open-ended. Existing industrial systems typically couple object detection with taxonomy-based classification and rely on catalog data for evaluation, which is prone to noise that limits robustness and scalability. We propose a taxonomy-decoupled architecture that uses classification-free region proposals and unified embeddings for similarity retrieval, enabling a more flexible and generalizable visual search. To overcome the evaluation bottleneck, we propose an LLM-as-a-Judge framework that assesses nuanced visual similarity and category relevance for query-result pairs in a zero-shot manner, removing dependence on human annotations or noise-prone catalog data. Deployed at scale on a global home goods platform, our system improves retrieval quality and yields a measurable uplift in customer engagement, while our offline evaluation metrics strongly correlate with real-world outcomes.

[213] studentSplat: Your Student Model Learns Single-view 3D Gaussian Splatting

Yimu Pan,Hongda Mao,Qingshuang Chen,Yelin Kim

Main category: cs.CV

TL;DR: 本文提出了一种名为studentSplat的单视图3D高斯点阵化方法,用于场景重建,通过教师-学生架构和外推网络解决单视图中的尺度模糊和外推问题。

Details Motivation: 由于单视图固有的模糊性,单视图3D场景重建仍处于未充分探索的状态,而现有的前馈3D高斯点阵化方法在多视图重建中表现优异。 Method: 引入了两种技术:1)教师-学生架构,利用多视图教师模型为单视图学生提供几何监督;2)外推网络,补全缺失的场景上下文信息。 Result: 实验表明,studentSplat在单视图新视角重建质量上达到了最先进水平,并在场景级别上与多视图方法性能相当,同时在自监督单视图深度估计任务中表现出竞争力。 Conclusion: studentSplat有效解决了单视图3D场景重建中的关键挑战,展示了其在通用单视图3D理解任务中的潜力。 Abstract: Recent advance in feed-forward 3D Gaussian splatting has enable remarkable multi-view 3D scene reconstruction or single-view 3D object reconstruction but single-view 3D scene reconstruction remain under-explored due to inherited ambiguity in single-view. We present \textbf{studentSplat}, a single-view 3D Gaussian splatting method for scene reconstruction. To overcome the scale ambiguity and extrapolation problems inherent in novel-view supervision from a single input, we introduce two techniques: 1) a teacher-student architecture where a multi-view teacher model provides geometric supervision to the single-view student during training, addressing scale ambiguity and encourage geometric validity; and 2) an extrapolation network that completes missing scene context, enabling high-quality extrapolation. Extensive experiments show studentSplat achieves state-of-the-art single-view novel-view reconstruction quality and comparable performance to multi-view methods at the scene level. Furthermore, studentSplat demonstrates competitive performance as a self-supervised single-view depth estimation method, highlighting its potential for general single-view 3D understanding tasks.

[214] Cross-Domain Object Detection Using Unsupervised Image Translation

Vinicius F. Arruda,Rodrigo F. Berriel,Thiago M. Paixão,Claudine Badue,Alberto F. De Souza,Nicu Sebe,Thiago Oliveira-Santos

Main category: cs.CV

TL;DR: 提出一种基于无监督图像翻译生成目标域人工数据集的方法,用于提升无监督域适应目标检测的性能,在自动驾驶场景中优于现有方法。

Details Motivation: 现有特征对齐方法实现复杂、可解释性差,且与全监督上界仍有性能差距。 Method: 使用CycleGAN和AdaIN-based模型,基于源域标注数据和目标域无标注数据生成目标域的人工数据集,用于训练检测器。 Result: 在真实自动驾驶场景中显著提升性能,多数情况下超过当前最优方法。 Conclusion: 所提方法更简单、有效且具更好可解释性,能有效缩小与上界的差距。 Abstract: Unsupervised domain adaptation for object detection addresses the adaption of detectors trained in a source domain to work accurately in an unseen target domain. Recently, methods approaching the alignment of the intermediate features proven to be promising, achieving state-of-the-art results. However, these methods are laborious to implement and hard to interpret. Although promising, there is still room for improvements to close the performance gap toward the upper-bound (when training with the target data). In this work, we propose a method to generate an artificial dataset in the target domain to train an object detector. We employed two unsupervised image translators (CycleGAN and an AdaIN-based model) using only annotated data from the source domain and non-annotated data from the target domain. Our key contributions are the proposal of a less complex yet more effective method that also has an improved interpretability. Results on real-world scenarios for autonomous driving show significant improvements, outperforming state-of-the-art methods in most cases, further closing the gap toward the upper-bound.

[215] Digital FAST: An AI-Driven Multimodal Framework for Rapid and Early Stroke Screening

Ngoc-Khai Hoang,Thi-Nhu-Mai Nguyen,Huy-Hieu Pham

Main category: cs.CV

TL;DR: 提出一种基于多模态深度学习的快速非侵入式中风二分类筛查框架,融合面部、语音和上身运动信息,在自建数据集上达到95.83%准确率和96.00% F1分数,实现对所有中风病例的检测。

Details Motivation: 早期识别中风症状对于及时干预和改善患者预后至关重要,尤其是在院前环境中,需要一种快速、非侵入式的自动筛查方法。 Method: 结合F.A.S.T.评估中的面部表情、语音信号和上身动作,使用Transformer处理面部关键点特征,Audio Spectrogram Transformer处理梅尔频谱图,MLP-Mixer分析姿态序列,并通过注意力机制进行多模态融合。 Result: 在包含37名受试者共222段视频的自建数据集上,模型准确率达95.83%,F1分数为96.00%,灵敏度与特异性均衡,成功检出测试集中所有中风病例,且优于单模态基线模型。 Conclusion: 多模态深度学习在早期中风筛查中具有潜力,未来需更大且具临床代表性的数据集以支持实际应用。 Abstract: Early identification of stroke symptoms is essential for enabling timely intervention and improving patient outcomes, particularly in prehospital settings. This study presents a fast, non-invasive multimodal deep learning framework for automatic binary stroke screening based on data collected during the F.A.S.T. assessment. The proposed approach integrates complementary information from facial expressions, speech signals, and upper-body movements to enhance diagnostic robustness. Facial dynamics are represented using landmark based features and modeled with a Transformer architecture to capture temporal dependencies. Speech signals are converted into mel spectrograms and processed using an Audio Spectrogram Transformer, while upper-body pose sequences are analyzed with an MLP-Mixer network to model spatiotemporal motion patterns. The extracted modality specific representations are combined through an attention-based fusion mechanism to effectively learn cross modal interactions. Experiments conducted on a self-collected dataset of 222 videos from 37 subjects demonstrate that the proposed multimodal model consistently outperforms unimodal baselines, achieving 95.83% accuracy and a 96.00% F1-score. The model attains a strong balance between sensitivity and specificity and successfully detects all stroke cases in the test set. These results highlight the potential of multimodal learning and transfer learning for early stroke screening, while emphasizing the need for larger, clinically representative datasets to support reliable real-world deployment.

[216] RemoteVAR: Autoregressive Visual Modeling for Remote Sensing Change Detection

Yilmaz Korkmaz,Vishal M. Patel

Main category: cs.CV

TL;DR: 本文提出了RemoteVAR,一种基于视觉自回归模型(VAR)的遥感变化检测框架,通过多分辨率双时相特征融合和交叉注意力机制提升像素级判别任务性能,在标准数据集上显著优于现有扩散模型和Transformer基线方法。

Details Motivation: 视觉自回归模型(VARs)在图像生成方面表现出色,但在像素级判别任务中因可控性差、密集预测性能不足和暴露偏差问题而受限。遥感变化检测需要精确的像素级变化定位,亟需改进VAR在此类任务中的应用。 Method: 提出RemoteVAR框架,利用跨注意力机制将自回归预测条件化于多分辨率融合的双时相特征,并设计专用于变化图预测的自回归训练策略,以增强模型对变化区域的感知与生成能力。 Result: 在多个标准变化检测基准上的实验表明,RemoteVAR持续且显著地优于强扩散模型和Transformer基线,在HRCUS、LEVIR-CD等数据集上取得SOTA性能。 Conclusion: RemoteVAR成功拓展了视觉自回归模型在遥感变化检测中的应用,证明了其作为判别式密集预测任务有效解决方案的潜力,为后续研究提供了新方向。 Abstract: Remote sensing change detection aims to localize and characterize scene changes between two time points and is central to applications such as environmental monitoring and disaster assessment. Meanwhile, visual autoregressive models (VARs) have recently shown impressive image generation capability, but their adoption for pixel-level discriminative tasks remains limited due to weak controllability, suboptimal dense prediction performance and exposure bias. We introduce RemoteVAR, a new VAR-based change detection framework that addresses these limitations by conditioning autoregressive prediction on multi-resolution fused bi-temporal features via cross-attention, and by employing an autoregressive training strategy designed specifically for change map prediction. Extensive experiments on standard change detection benchmarks show that RemoteVAR delivers consistent and significant improvements over strong diffusion-based and transformer-based baselines, establishing a competitive autoregressive alternative for remote sensing change detection. Code will be available \href{https://github.com/yilmazkorkmaz1/RemoteVAR}{\underline{here}}.

[217] Towards Airborne Object Detection: A Deep Learning Analysis

Prosenjit Chatterjee,ANK Zaman

Main category: cs.CV

TL;DR: 本文提出了一种基于EfficientNetB4的双任务模型,用于同时进行空中目标分类和威胁等级预测,并构建了AODTA数据集以解决训练数据不足的问题,在多个数据集上取得了优于ResNet-50的性能。

Details Motivation: 现有空中威胁评估方法依赖人工监控,可扩展性差且效率低,亟需自动化、实时的解决方案。 Method: 采用EfficientNetB4构建双任务模型,利用聚合并清洗的公共数据构建AODTA数据集,基于预定位图像进行分类与威胁预测。 Result: 在AVD和AODTA数据集上,模型分别达到96%的目标分类准确率和90%的威胁等级预测准确率,均优于ResNet-50基线。 Conclusion: 该双任务模型在空中目标识别与威胁评估方面表现优异,具有在 surveillance、国防和空域管理中应用的潜力。 Abstract: The rapid proliferation of airborne platforms, including commercial aircraft, drones, and UAVs, has intensified the need for real-time, automated threat assessment systems. Current approaches depend heavily on manual monitoring, resulting in limited scalability and operational inefficiencies. This work introduces a dual-task model based on EfficientNetB4 capable of performing airborne object classification and threat-level prediction simultaneously. To address the scarcity of clean, balanced training data, we constructed the AODTA Dataset by aggregating and refining multiple public sources. We benchmarked our approach on both the AVD Dataset and the newly developed AODTA Dataset and further compared performance against a ResNet-50 baseline, which consistently underperformed EfficientNetB4. Our EfficientNetB4 model achieved 96% accuracy in object classification and 90% accuracy in threat-level prediction, underscoring its promise for applications in surveillance, defense, and airspace management. Although the title references detection, this study focuses specifically on classification and threat-level inference using pre-localized airborne object images provided by existing datasets.

[218] Effects of the retina-inspired light intensity encoding on color discrimination performance

Io Yamada,Hirotsugu Okuno

Main category: cs.CV

TL;DR: 本研究探讨了不同光强编码函数对中心/周围(C/S)retinex模型在颜色恒常性(CC)性能上的影响,发现Naka-Rushton函数结合双拮抗颜色表示在颜色辨别任务中表现更优。

Details Motivation: 颜色恒常性对于依赖颜色信息的视觉系统至关重要,但光照颜色会显著影响颜色感知,因此需要改进模型以提高在不同照明条件下的颜色识别能力。 Method: 采用C/S retinex模型,比较对数函数与Naka-Rushton函数在编码光强时的表现,并使用彩色LED在多种光照条件下照射目标,通过HSV和拮抗颜色空间评估颜色信息的区分能力。 Result: Naka-Rushton函数结合双拮抗颜色表示在不同光照下对目标颜色的辨别性能优于传统的对数函数方法。 Conclusion: Naka-Rushton函数更贴近生物视网膜响应特性,结合合适的颜色表示可提升颜色恒常性模型的性能,对人工视觉系统具有应用价值。 Abstract: Color is an important source of information for visual functions such as object recognition, but it is greatly affected by the color of illumination. The ability to perceive the color of a visual target independent of illumination color is called color constancy (CC), and is an important feature for vision systems that use color information. In this study, we investigated the effects of the light intensity encoding function on the performance of CC of the center/surround (C/S) retinex model, which is a well-known model inspired by CC of the visual nervous system. The functions used to encode light intensity are the logarithmic function used in the original C/S retinex model and the Naka-Rushton (N-R) function, which is a model of retinal photoreceptor response. Color-variable LEDs were used to illuminate visual targets with various lighting colors, and color information computed by each model was used to evaluate the degree to which the color of visual targets illuminated with different lighting colors could be discriminated. Color information was represented using the HSV color space and a color plane based on the classical opponent color theory. The results showed that the combination of the N-R function and the double opponent color plane representation provided superior discrimination performance.

[219] A Training-Free Guess What Vision Language Model from Snippets to Open-Vocabulary Object Detection

Guiying Zhu,Bowen Yang,Yin Zhuang,Tong Zhang,Guanqun Wang,Zhihao Che,He Chen,Lianlin Li

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的开放词汇目标检测方法GW-VLM,通过多尺度视觉语言搜索(MS-VLS)和上下文概念提示(CCP)结合视觉语言模型与大语言模型,实现了无需训练即可优于现有方法的检测性能。

Details Motivation: 现有基于预训练基础模型的开放词汇目标检测方法往往忽视了构建通用理解范式的重要性,本文旨在提出一种无需训练即可实现通用对象认知的新方法。 Method: 提出GW-VLM框架,结合多尺度视觉语言搜索(MS-VLS)和上下文概念提示(CCP),利用预训练视觉语言模型生成无类别检测片段,并通过大语言模型理解这些片段以实现开放词汇检测。 Result: 在COCO val、Pascal VOC、DIOR和NWPU-10等多个自然与遥感数据集上进行了实验,结果表明GW-VLM在无需任何训练的情况下优于现有最先进方法。 Conclusion: GW-VLM成功构建了一种无需训练的通用理解范式,为开放词汇目标检测提供了新思路,并在多个数据集上验证了其有效性。 Abstract: Open-Vocabulary Object Detection (OVOD) aims to develop the capability to detect anything. Although myriads of large-scale pre-training efforts have built versatile foundation models that exhibit impressive zero-shot capabilities to facilitate OVOD, the necessity of creating a universal understanding for any object cognition according to already pretrained foundation models is usually overlooked. Therefore, in this paper, a training-free Guess What Vision Language Model, called GW-VLM, is proposed to form a universal understanding paradigm based on our carefully designed Multi-Scale Visual Language Searching (MS-VLS) coupled with Contextual Concept Prompt (CCP) for OVOD. This approach can engage a pre-trained Vision Language Model (VLM) and a Large Language Model (LLM) in the game of "guess what". Wherein, MS-VLS leverages multi-scale visual-language soft-alignment for VLM to generate snippets from the results of class-agnostic object detection, while CCP can form the concept of flow referring to MS-VLS and then make LLM understand snippets for OVOD. Finally, the extensive experiments are carried out on natural and remote sensing datasets, including COCO val, Pascal VOC, DIOR, and NWPU-10, and the results indicate that our proposed GW-VLM can achieve superior OVOD performance compared to the-state-of-the-art methods without any training step.

[220] Reliable Deep Learning for Small-Scale Classifications: Experiments on Real-World Image Datasets from Bangladesh

Muhammad Ibrahim,Alfe Suny,MD Sakib Ul Islam,Md. Imran Hossain

Main category: cs.CV

TL;DR: 本研究评估了一种紧凑型卷积神经网络在五个来自孟加拉国的真实世界图像数据集上的表现,结果表明该模型在小规模图像分类任务中具有高准确率、快速收敛和良好的泛化能力。

Details Motivation: 针对复杂CNN在小数据集上易过拟合的问题,探索轻量级CNN在真实场景中的适用性。 Method: 采用一个紧凑型CNN架构,在五个公开的真实世界图像数据集上进行训练与评估,并使用定量指标和显著性分析验证模型性能。 Result: 模型在多个任务上表现出高分类精度、高效收敛和低计算开销,显著性图显示其能有效捕捉判别特征。 Conclusion: 简化版CNN架构在小样本图像分类任务中具有良好的泛化能力和应用潜力,适合资源受限的现实场景。 Abstract: Convolutional neural networks (CNNs) have achieved state-of-the-art performance in image recognition tasks but often involve complex architectures that may overfit on small datasets. In this study, we evaluate a compact CNN across five publicly available, real-world image datasets from Bangladesh, including urban encroachment, vehicle detection, road damage, and agricultural crops. The network demonstrates high classification accuracy, efficient convergence, and low computational overhead. Quantitative metrics and saliency analyses indicate that the model effectively captures discriminative features and generalizes robustly across diverse scenarios, highlighting the suitability of streamlined CNN architectures for small-class image classification tasks.

[221] From Spurious to Causal: Low-rank Orthogonal Subspace Intervention for Generalizable Face Forgery Detection

Chi Wang,Xinjue Hu,Boyu Wang,Ziwen He,Zhangjie Fu

Main category: cs.CV

TL;DR: 本文提出了一种新的干预范式,通过正交低秩投影去除表示空间中的伪相关因素,从而提升人脸伪造检测的泛化能力。

Details Motivation: 由于伪造无关信息与标签之间的“后门路径”导致模型学习到有偏表征,使得现有方法在跨数据集场景下泛化性能差。 Method: 将多种伪相关因素统一建模为低秩子空间,通过正交低秩投影分解并移除该子空间,训练其正交补空间以捕获真正的伪造特征。 Result: 在多个基准上实现了最先进的性能,仅使用0.43M可训练参数,并表现出优异的鲁棒性和泛化能力。 Conclusion: 所提出的低秩投影干预方法能有效消除表示空间中的伪相关,使分类决策更依赖于真实的伪造线索,显著提升模型泛化性。 Abstract: The generalization problem remains a critical challenge in face forgery detection. Some researches have discovered that ``a backdoor path" in the representations from forgery-irrelevant information to labels induces biased learning, thereby hindering the generalization. In this paper, these forgery-irrelevant information are collectively termed spurious correlations factors. Previous methods predominantly focused on identifying concrete, specific spurious correlation and designing corresponding solutions to address them. However, spurious correlations arise from unobservable confounding factors, making it impractical to identify and address each one individually. To address this, we propose an intervention paradigm for representation space. Instead of tracking and blocking various instance-level spurious correlation one by one, we uniformly model them as a low-rank subspace and intervene in them. Specifically, we decompose spurious correlation features into a low-rank subspace via orthogonal low-rank projection, subsequently removing this subspace from the original representation and training its orthogonal complement to capture forgery-related features. This low-rank projection removal effectively eliminates spurious correlation factors, ensuring that classification decision is based on authentic forgery cues. With only 0.43M trainable parameters, our method achieves state-of-the-art performance across several benchmarks, demonstrating excellent robustness and generalization.

[222] Effects of Gabor Filters on Classification Performance of CNNs Trained on a Limited Number of Conditions

Akito Morita,Hirotsugu Okuno

Main category: cs.CV

TL;DR: 提出了一种利用Gabor滤波器作为预处理器来提升边缘设备上卷积神经网络(CNN)在小数据集下的准确性和泛化能力的技术,并验证了其在机器人视觉应用中的有效性。

Details Motivation: 为了满足边缘设备对小型化架构和高效训练的需求,特别是在有限条件下获取的数据进行目标识别的机器人视觉应用中提高CNN性能。 Method: 采用Gabor滤波器模拟视觉神经系统特征提取器作为CNN的预处理步骤,在不同CNN架构上对比有无Gabor滤波器时的小样本训练效果,并使用包含不同摄像头位置图像的数据集评估模型泛化能力。 Result: 实验结果表明,使用Gabor滤波器进行预处理能够提升CNN的泛化性能,并有助于减小网络规模。 Conclusion: Gabor滤波器作为预处理手段可有效提升轻量级CNN在小样本、多变环境下的表现,适用于资源受限的机器人视觉系统。 Abstract: In this study, we propose a technique to improve the accuracy and reduce the size of convolutional neural networks (CNNs) running on edge devices for real-world robot vision applications. CNNs running on edge devices must have a small architecture, and CNNs for robot vision applications involving on-site object recognition must be able to be trained efficiently to identify specific visual targets from data obtained under a limited variation of conditions. The visual nervous system (VNS) is a good example that meets the above requirements because it learns from few visual experiences. Therefore, we used a Gabor filter, a model of the feature extractor of the VNS, as a preprocessor for CNNs to investigate the accuracy of the CNNs trained with small amounts of data. To evaluate how well CNNs trained on image data acquired under a limited variation of conditions generalize to data acquired under other conditions, we created an image dataset consisting of images acquired from different camera positions, and investigated the accuracy of the CNNs that trained using images acquired at a certain distance. The results were compared after training on multiple CNN architectures with and without Gabor filters as preprocessing. The results showed that preprocessing with Gabor filters improves the generalization performance of CNNs and contributes to reducing the size of CNNs.

[223] SupScene: Learning Overlap-Aware Global Descriptor for Unconstrained SfM

Xulei Shi,Maoyu Wang,Yuning Peng,Guanbo Wang,Xin Wang,Qi Chen,Pengjie Tao

Main category: cs.CV

TL;DR: 本文提出了一种名为SupScene的新方法,用于提升无约束SfM中的图像检索性能,通过子图训练策略和DiVLAD聚合器学习更具几何匹配能力的全局描述符。

Details Motivation: 现有基于深度学习的图像检索方法多关注语义相似性,难以捕捉几何可匹配性这一关键需求,限制了SfM中图像匹配效率。 Method: 提出SupScene:采用基于子图的训练策略,利用带权重的几何重叠关系和软监督对比损失;设计DiVLAD聚合器,结合ViT的多头注意力图与可学习门控机制,融合显著语义线索与视觉特征。 Result: 在GL3D数据集上实验表明,该方法显著优于NetVLAD等基线,性能达到SOTA,且仅引入极少额外参数,并验证了训练策略对不同聚合技术的通用增益。 Conclusion: SupScene能有效学习面向几何匹配的全局图像描述符,提升了SfM中图像检索的准确性和实用性,具备良好的扩展性与应用潜力。 Abstract: Image retrieval is a critical step for alleviating the quadratic complexity of image matching in unconstrained Structure-from-Motion (SfM). However, in this context, image retrieval typically focuses more on the image pairs of geometric matchability than on those of semantic similarity, a nuance that most existing deep learning-based methods guided by batched binaries (overlapping vs. non-overlapping pairs) fail to capture. In this paper, we introduce SupScene, a novel solution that learns global descriptors tailored for finding overlapping image pairs of similar geometric nature for SfM. First, to better underline co-visible regions, we employ a subgraph-based training strategy that moves beyond equally important isolated pairs, leveraging ground-truth geometric overlapping relationships with various weights to provide fine-grained supervision via a soft supervised contrastive loss. Second, we introduce DiVLAD, a DINO-inspired VLAD aggregator that leverages the inherent multi-head attention maps from the last block of ViT. And then, a learnable gating mechanism is designed to adaptively utilize these semantically salient cues with visual features, enabling a more discriminative global descriptor. Extensive experiments on the GL3D dataset demonstrate that our method achieves state-of-the-art performance, significantly outperforming NetVLAD while introducing a negligible number of additional trainable parameters. Furthermore, we show that the proposed training strategy brings consistent gains across different aggregation techniques. Code and models are available at https://anonymous.4open.science/r/SupScene-5B73.

[224] Language-Guided and Motion-Aware Gait Representation for Generalizable Recognition

Zhengxian Wu,Chuanrui Zhang,Shenao Jiang,Hangrui Xu,Zirui Liao,Luyuan Zhang,Huaqiu Li,Peng Jiao,Haoqian Wang

Main category: cs.CV

TL;DR: 本文提出了一种语言引导与运动感知的步态识别框架LMGait,以解决现有方法易受静态噪声干扰且难以有效捕捉动态运动区域的问题。

Details Motivation: 现有步态识别方法依赖复杂架构直接从图像中提取特征并进行池化,易过拟合于静态噪声(如衣物),且难以有效捕获动态运动区域。 Method: 提出LMGait框架,利用设计的步态相关语言线索来捕获步态序列中的关键运动特征。 Result: 提升了步态识别对动态运动区域的建模能力,降低了静态噪声(如衣物)带来的干扰。 Conclusion: 语言引导与运动感知的结合为步态识别提供了新思路,有助于提升模型鲁棒性与判别性。 Abstract: Gait recognition is emerging as a promising technology and an innovative field within computer vision. However, existing methods typically rely on complex architectures to directly extract features from images and apply pooling operations to obtain sequence-level representations. Such designs often lead to overfitting on static noise (e.g., clothing), while failing to effectively capture dynamic motion regions.To address the above challenges, we present a Language guided and Motion-aware gait recognition framework, named LMGait.In particular, we utilize designed gait-related language cues to capture key motion features in gait sequences.

[225] Deep learning-based neurodevelopmental assessment in preterm infants

Lexin Ren,Jiamiao Lu,Weichuan Zhang,Benqing Wu,Tuo Wang,Yi Liao,Jiapan Guo,Changming Sun,Liang Guo

Main category: cs.CV

TL;DR: 提出了一种基于3D空间-通道注意力机制的分层密集注意力网络(HDAN),用于提升早产儿脑部MRI中白质和灰质的分割精度,实验表明其在低对比度图像中表现优于现有方法,并验证了早产儿脑组织体积较小的神经发育差异。

Details Motivation: 早产儿面临神经发育迟缓的风险,准确分割其脑部MRI中的白质和灰质对早期评估至关重要,但由于组织信号相似(等信号),现有方法难以实现精确分割。 Method: 提出一种名为分层密集注意力网络(HDAN)的新架构,结合3D空间-通道注意力机制与注意力引导的密集上采样策略,增强低对比度体积数据中的特征区分能力。 Result: 定量实验显示,该方法在分割性能上优于当前最先进的基准模型,能有效应对等信号组织的区分难题;应用该算法发现早产儿的白质和灰质体积显著低于足月儿。 Conclusion: HDAN在早产儿脑组织分割中表现出优越性能,为早产相关的神经发育延迟提供了新的影像学证据,具有临床应用潜力。 Abstract: Preterm infants (born between 28 and 37 weeks of gestation) face elevated risks of neurodevelopmental delays, making early identification crucial for timely intervention. While deep learning-based volumetric segmentation of brain MRI scans offers a promising avenue for assessing neonatal neurodevelopment, achieving accurate segmentation of white matter (WM) and gray matter (GM) in preterm infants remains challenging due to their comparable signal intensities (isointense appearance) on MRI during early brain development. To address this, we propose a novel segmentation neural network, named Hierarchical Dense Attention Network. Our architecture incorporates a 3D spatial-channel attention mechanism combined with an attention-guided dense upsampling strategy to enhance feature discrimination in low-contrast volumetric data. Quantitative experiments demonstrate that our method achieves superior segmentation performance compared to state-of-the-art baselines, effectively tackling the challenge of isointense tissue differentiation. Furthermore, application of our algorithm confirms that WM and GM volumes in preterm infants are significantly lower than those in term infants, providing additional imaging evidence of the neurodevelopmental delays associated with preterm birth. The code is available at: https://github.com/ICL-SUST/HDAN.

[226] Decoder Gradient Shields: A Family of Provable and High-Fidelity Methods Against Gradient-Based Box-Free Watermark Removal

Haonan An,Guang Hua,Wei Du,Hangcheng Cao,Yihang Tao,Guowen Xu,Susanto Rahardja,Yuguang Fang

Main category: cs.CV

TL;DR: 本文提出Decoder Gradient Shields (DGSs)防御机制,用于抵御针对box-free模型水印解码器的梯度泄露攻击,通过重定向和重缩放水印通道梯度,在保持图像质量的同时实现100%防御成功率。

Details Motivation: 现有box-free模型水印研究多关注编码器鲁棒性,而忽视了解码器易受基于查询响应和反向传播梯度训练的水印移除攻击的问题。 Method: 提出三类Decoder Gradient Shields(DGS-O、DGS-I、DGS-L),分别作用于解码器输出、输入和中间层;其中DGS-O提供闭式解,所有DGS均有理论性能保证;核心思想是联合重定向与重缩放来自水印通道梯度泄露查询的梯度。 Result: 在去雨和图像生成任务中,对当前最优box-free水印方案实现100%防御成功率,同时保持解码器输出图像质量。 Conclusion: DGSs能有效阻止水印移除器收敛至低损失值,是一种高效、可证明、实用的解码器梯度防护方法,显著提升了box-free水印系统的安全性。 Abstract: Box-free model watermarking has gained significant attention in deep neural network (DNN) intellectual property protection due to its model-agnostic nature and its ability to flexibly manage high-entropy image outputs from generative models. Typically operating in a black-box manner, it employs an encoder-decoder framework for watermark embedding and extraction. While existing research has focused primarily on the encoders for the robustness to resist various attacks, the decoders have been largely overlooked, leading to attacks against the watermark. In this paper, we identify one such attack against the decoder, where query responses are utilized to obtain backpropagated gradients to train a watermark remover. To address this issue, we propose Decoder Gradient Shields (DGSs), a family of defense mechanisms, including DGS at the output (DGS-O), at the input (DGS-I), and in the layers (DGS-L) of the decoder, with a closed-form solution for DGS-O and provable performance for all DGS. Leveraging the joint design of reorienting and rescaling of the gradients from watermark channel gradient leaking queries, the proposed DGSs effectively prevent the watermark remover from achieving training convergence to the desired low-loss value, while preserving image quality of the decoder output. We demonstrate the effectiveness of our proposed DGSs in diverse application scenarios. Our experimental results on deraining and image generation tasks with the state-of-the-art box-free watermarking show that our DGSs achieve a defense success rate of 100% under all settings.

[227] Real-Time Multi-Modal Embedded Vision Framework for Object Detection Facial Emotion Recognition and Biometric Identification on Low-Power Edge Platforms

S. M. Khalid Bin Zahid,Md. Rakibul Hasan Nishat,Abdul Hasib,Md. Rakibul Hasan,Md. Ashiqussalehin,Md. Sahadat Hossen Sajib,A. S. M. Ahsanul Sarkar Akib

Main category: cs.CV

TL;DR: 提出一种基于上下文感知调度的实时多模态视觉框架,集成物体检测、人脸识别与情感分析,并在树莓派5上实现高效低功耗运行。

Details Motivation: 现有智能监控系统在边缘设备上缺乏统一的自适应运行时调度器,导致计算资源利用效率低且难以实现全局感知。 Method: 设计一个自适应调度机制,结合YOLOv8n、定制FaceNet和DeepFace的CNN模型,根据上下文触发条件动态激活各模块。 Result: 系统计算负载降低65%,物体检测AP达0.861,人脸识别准确率88%,情感识别AUC高达0.97,帧率达5.6 fps。 Conclusion: 上下文感知调度是实现在低成本边缘硬件上部署复杂多模态AI的关键,提升了能效、隐私性和实用性。 Abstract: Intelligent surveillance systems often handle perceptual tasks such as object detection, facial recognition, and emotion analysis independently, but they lack a unified, adaptive runtime scheduler that dynamically allocates computational resources based on contextual triggers. This limits their holistic understanding and efficiency on low-power edge devices. To address this, we present a real-time multi-modal vision framework that integrates object detection, owner-specific face recognition, and emotion detection into a unified pipeline deployed on a Raspberry Pi 5 edge platform. The core of our system is an adaptive scheduling mechanism that reduces computational load by 65\% compared to continuous processing by selectively activating modules such as, YOLOv8n for object detection, a custom FaceNet-based embedding system for facial recognition, and DeepFace's CNN for emotion classification. Experimental results demonstrate the system's efficacy, with the object detection module achieving an Average Precision (AP) of 0.861, facial recognition attaining 88\% accuracy, and emotion detection showing strong discriminatory power (AUC up to 0.97 for specific emotions), while operating at 5.6 frames per second. Our work demonstrates that context-aware scheduling is the key to unlocking complex multi-modal AI on cost-effective edge hardware, making intelligent perception more accessible and privacy-preserving.

[228] AVIR: Adaptive Visual In-Document Retrieval for Efficient Multi-Page Document Question Answering

Zongmin Li,Yachuan Li,Lei Kang,Dimosthenis Karatzas,Wenkang Ma

Main category: cs.CV

TL;DR: 提出了一种自适应视觉文档内检索(AVIR)框架,用于多页文档视觉问答(MP-DocVQA),通过轻量级检索模型和聚类策略减少输入页面数量,显著降低计算成本的同时在多个基准上实现了优越性能。

Details Motivation: 长文档导致计算资源紧张并削弱大型视觉语言模型中注意力机制的有效性,因此需要一种高效的方法来选择相关页面以提升多页文档视觉问答的效率和准确性。 Method: 设计了一个轻量级检索模型对每页的相关性打分,根据分数分布进行聚类,并结合Top-K和相关性概率阈值来自适应选择最相关的页面,仅将这些页面输入冻结的大型视觉语言模型生成答案。 Result: AVIR框架将问题回答所需的平均页面数减少了70%,在MP-DocVQA数据集上达到84.58%的ANLS分数,优于先前方法,并在SlideVQA和DUDE基准上验证了有效性。 Conclusion: AVIR框架能有效降低多页文档视觉问答的计算开销,同时保持高性能,无需微调模型,具有良好的通用性和实用性。 Abstract: Multi-page Document Visual Question Answering (MP-DocVQA) remains challenging because long documents not only strain computational resources but also reduce the effectiveness of the attention mechanism in large vision-language models (LVLMs). We tackle these issues with an Adaptive Visual In-document Retrieval (AVIR) framework. A lightweight retrieval model first scores each page for question relevance. Pages are then clustered according to the score distribution to adaptively select relevant content. The clustered pages are screened again by Top-K to keep the context compact. However, for short documents, clustering reliability decreases, so we use a relevance probability threshold to select pages. The selected pages alone are fed to a frozen LVLM for answer generation, eliminating the need for model fine-tuning. The proposed AVIR framework reduces the average page count required for question answering by 70%, while achieving an ANLS of 84.58% on the MP-DocVQA dataset-surpassing previous methods with significantly lower computational cost. The effectiveness of the proposed AVIR is also verified on the SlideVQA and DUDE benchmarks. The code is available at https://github.com/Li-yachuan/AVIR.

[229] Nip Rumors in the Bud: Retrieval-Guided Topic-Level Adaptation for Test-Time Fake News Video Detection

Jian Lang,Rongpei Hong,Ting Zhong,Yong Wang,Fan Zhou

Main category: cs.CV

TL;DR: 本文提出了RADAR,首个用于假新闻视频检测的测试时自适应框架,通过检索引导的自适应范式,实现对未见主题新闻视频的高效检测。

Details Motivation: 现有方法假设训练和测试阶段新闻主题分布一致,难以应对新兴事件和未见主题的假新闻视频检测需求。 Method: 提出RADAR框架,包含基于熵选择的检索机制、稳定锚点引导对齐模块和目标域感知的自训练范式,利用目标域中稳定的源近视频指导不稳定实例的鲁棒自适应。 Result: 实验表明RADAR在测试时假新闻视频检测任务中表现优异,能有效适应快速变化的未见主题和类别分布。 Conclusion: RADAR首次实现了对未见假新闻视频主题的测试时自适应检测,显著提升了在动态真实场景中的检测能力。 Abstract: Fake News Video Detection (FNVD) is critical for social stability. Existing methods typically assume consistent news topic distribution between training and test phases, failing to detect fake news videos tied to emerging events and unseen topics. To bridge this gap, we introduce RADAR, the first framework that enables test-time adaptation to unseen news videos. RADAR pioneers a new retrieval-guided adaptation paradigm that leverages stable (source-close) videos from the target domain to guide robust adaptation of semantically related but unstable instances. Specifically, we propose an Entropy Selection-Based Retrieval mechanism that provides videos with stable (low-entropy), relevant references for adaptation. We also introduce a Stable Anchor-Guided Alignment module that explicitly aligns unstable instances' representations to the source domain via distribution-level matching with their stable references, mitigating severe domain discrepancies. Finally, our novel Target-Domain Aware Self-Training paradigm can generate informative pseudo-labels augmented by stable references, capturing varying and imbalanced category distributions in the target domain and enabling RADAR to adapt to the fast-changing label distributions. Extensive experiments demonstrate that RADAR achieves superior performance for test-time FNVD, enabling strong on-the-fly adaptation to unseen fake news video topics.

[230] An AI-IoT Based Smart Wheelchair with Gesture-Controlled Mobility, Deep Learning-Based Obstacle Detection, Multi-Sensor Health Monitoring, and Emergency Alert System

Md. Asiful Islam,Abdul Hasib,Tousif Mahmud Emon,Khandaker Tabin Hasan,A. S. M. Ahsanul Sarkar Akib

Main category: cs.CV

TL;DR: 提出了一种基于AI-IoT的多功能智能轮椅系统,集成了手势控制、YOLOv8目标检测、超声波避障和生理参数监测,具有高精度和低成本优势,提升了用户的自主性与安全性。

Details Motivation: 针对传统轮椅功能单一、现有智能轮椅成本高且缺乏健康监测集成的问题,迫切需要一种个性化、智能化且价格可接受的辅助出行方案。 Method: 采用AI-IoT架构,实现基于手套的手势控制导航;结合YOLOv8进行实时物体检测并提供语音反馈,辅以超声波传感器实现即时防撞;通过传感器持续监测心率、SpO2、ECG和体温等生命体征,并将数据上传至ThingSpeak平台,在异常时触发邮件警报。 Result: 手势控制成功率达95.5%,超声波避障准确率达94%;YOLOv8检测精度达91.5%,召回率90.2%,F1-score为90.8%;生命体征数据可实时上传并触发预警。 Conclusion: 该多模态、模块化且低成本的智能轮椅系统有效整合了导航安全与健康监控,推动了智能辅助设备从研究向实际应用的转化,显著提升用户自主性、安全性和独立性。 Abstract: The growing number of differently-abled and elderly individuals demands affordable, intelligent wheelchairs that combine safe navigation with health monitoring. Traditional wheelchairs lack dynamic features, and many smart alternatives remain costly, single-modality, and limited in health integration. Motivated by the pressing demand for advanced, personalized, and affordable assistive technologies, we propose a comprehensive AI-IoT based smart wheelchair system that incorporates glove-based gesture control for hands-free navigation, real-time object detection using YOLOv8 with auditory feedback for obstacle avoidance, and ultrasonic for immediate collision avoidance. Vital signs (heart rate, SpO$_2$, ECG, temperature) are continuously monitored, uploaded to ThingSpeak, and trigger email alerts for critical conditions. Built on a modular and low-cost architecture, the gesture control achieved a 95.5\% success rate, ultrasonic obstacle detection reached 94\% accuracy, and YOLOv8-based object detection delivered 91.5\% Precision, 90.2\% Recall, and a 90.8\% F1-score. This integrated, multi-modal approach offers a practical, scalable, and affordable solution, significantly enhancing user autonomy, safety, and independence by bridging the gap between innovative research and real-world deployment.

[231] Structural Graph Neural Networks with Anatomical Priors for Explainable Chest X-ray Diagnosis

Khaled Berkani

Main category: cs.CV

TL;DR: 提出一种结合解剖先验的结构化图推理框架,用于可解释的视觉诊断,通过将卷积特征图转化为显式建模空间关系的图结构,实现内在可解释性。

Details Motivation: 现有图神经网络在医学诊断中缺乏对解剖结构先验的有效利用,且依赖后处理方法实现可解释性,限制了模型的透明度与可靠性。 Method: 将卷积特征图重新解释为节点包含外观和空间坐标的图结构,设计自定义的结构传播机制,显式建模相对空间关系,并引入病变感知节点预测与图级诊断联合学习。 Result: 在胸部X光案例中验证了方法有效性,显示结构先验能引导关系推理并提升可解释性,同时具备领域无关性。 Conclusion: 该框架通过结构化图推理增强了模型的内在可解释性,推动了图作为结构感知与可解释学习计算基础的研究。 Abstract: We present a structural graph reasoning framework that incorporates explicit anatomical priors for explainable vision-based diagnosis. Convolutional feature maps are reinterpreted as patch-level graphs, where nodes encode both appearance and spatial coordinates, and edges reflect local structural adjacency. Unlike conventional graph neural networks that rely on generic message passing, we introduce a custom structural propagation mechanism that explicitly models relative spatial relations as part of the reasoning process. This design enables the graph to act as an inductive bias for structured inference rather than a passive relational representation. The proposed model jointly supports node-level lesion-aware predictions and graph-level diagnostic reasoning, yielding intrinsic explainability through learned node importance scores without relying on post-hoc visualization techniques. We demonstrate the approach through a chest X-ray case study, illustrating how structural priors guide relational reasoning and improve interpretability. While evaluated in a medical imaging context, the framework is domain-agnostic and aligns with the broader vision of graph-based reasoning across artificial intelligence systems. This work contributes to the growing body of research exploring graphs as computational substrates for structure-aware and explainable learning.

[232] DAOS: A Multimodal In-cabin Behavior Monitoring with Driver Action-Object Synergy Dataset

Yiming Li,Chen Cai,Tianyi Liu,Dan Lin,Wenqian Wang,Wenfei Liang,Bingbing Li,Kim-Hui Yap

Main category: cs.CV

TL;DR: 本文提出了一个名为DAOS的多模态、多视角驾驶员行为与物体协同数据集,包含9787个视频片段和丰富的细粒度标注,并设计了AOR-Net模型,通过建模动作-对象关系和引入动态推理机制,在复杂驾驶场景中实现了先进的动作识别性能。

Details Motivation: 现有驾驶员监控数据集缺乏精确的对象位置标注或未将对象与动作关联,导致难以准确识别相似的上半身动作,因此需要一个能捕捉人-物交互关系的数据集和模型来提升动作识别的可靠性。 Method: 构建了DAOS数据集,包含36类细粒度驾驶动作和15类物体,提供RGB、红外和深度等多模态多视角数据;提出AOR-Net模型,通过多级推理、动作链提示机制和‘思维混合’模块,动态建模动作-对象之间的逻辑关系。 Result: 在多个数据集上的实验表明,AOR-Net显著优于现有的最先进方法,尤其在物体丰富和稀缺条件下均表现出更强的鲁棒性。 Conclusion: 通过引入DAOS数据集和AOR-Net模型,验证了利用人-物关系进行细粒度驾驶行为理解的有效性,为驾驶员行为分析提供了新的数据资源和建模范式。 Abstract: In driver activity monitoring, movements are mostly limited to the upper body, which makes many actions look similar. To tell these actions apart, human often rely on the objects the driver is using, such as holding a phone compared with gripping the steering wheel. However, most existing driver-monitoring datasets lack accurate object-location annotations or do not link objects to their associated actions, leaving a critical gap for reliable action recognition. To address this, we introduce the Driver Action with Object Synergy (DAOS) dataset, comprising 9,787 video clips annotated with 36 fine-grained driver actions and 15 object classes, totaling more than 2.5 million corresponding object instances. DAOS offers multi-modal, multi-view data (RGB, IR, and depth) from front, face, left, and right perspectives. Although DAOS captures a wide range of cabin objects, only a few are directly relevant to each action for prediction, so focusing on task-specific human-object relations is essential. To tackle this challenge, we propose the Action-Object-Relation Network (AOR-Net). AOR-Net comprehends complex driver actions through multi-level reasoning and a chain-of-action prompting mechanism that models the logical relationships among actions, objects, and their relations. Additionally, the Mixture of Thoughts module is introduced to dynamically select essential knowledge at each stage, enhancing robustness in object-rich and object-scarce conditions. Extensive experiments demonstrate that our model outperforms other state-of-the-art methods on various datasets.

[233] SMc2f: Robust Scenario Mining for Robotic Autonomy from Coarse to Fine

Yifei Chen,Ross Greer

Main category: cs.CV

TL;DR: 提出了一种名为SMc2f的从粗到细的场景挖掘框架,利用视觉-语言模型和文本-轨迹对比学习,提升自动驾驶机器人在自然语言描述下的场景检索质量与效率。

Details Motivation: 现有方法如RefAV依赖轨迹标签进行自然语言检索,忽略了语言与原始图像间的直接联系,且受3D检测与跟踪质量影响,导致定位不准确。 Method: 提出SMc2f框架:首先使用视觉-语言模型(VLM)进行粗粒度的图像-文本过滤;构建基于RefAV的成功案例数据库,自动检索样例以实现LLM的小样本条件检索;引入文本-轨迹对比学习,在共享嵌入空间中拉近匹配对、推远非匹配对,实现细粒度匹配优化。 Result: 在公开数据集上的实验表明,该方法在检索质量和效率方面均有显著提升。 Conclusion: SMc2f通过结合视觉-语言模型与对比学习,实现了更鲁棒、精确的自动驾驶场景检索,优于仅依赖轨迹标签的LLM方法。 Abstract: The safety validation of autonomous robotic vehicles hinges on systematically testing their planning and control stacks against rare, safety-critical scenarios. Mining these long-tail events from massive real-world driving logs is therefore a critical step in the robotic development lifecycle. The goal of the Scenario Mining task is to retrieve useful information to enable targeted re-simulation, regression testing, and failure analysis of the robot's decision-making algorithms. RefAV, introduced by the Argoverse team, is an end-to-end framework that uses large language models (LLMs) to spatially and temporally localize scenarios described in natural language. However, this process performs retrieval on trajectory labels, ignoring the direct connection between natural language and raw RGB images, which runs counter to the intuition of video retrieval; it also depends on the quality of upstream 3D object detection and tracking. Further, inaccuracies in trajectory data lead to inaccuracies in downstream spatial and temporal localization. To address these issues, we propose Robust Scenario Mining for Robotic Autonomy from Coarse to Fine (SMc2f), a coarse-to-fine pipeline that employs vision-language models (VLMs) for coarse image-text filtering, builds a database of successful mining cases on top of RefAV and automatically retrieves exemplars to few-shot condition the LLM for more robust retrieval, and introduces text-trajectory contrastive learning to pull matched pairs together and push mismatched pairs apart in a shared embedding space, yielding a fine-grained matcher that refines the LLM's candidate trajectories. Experiments on public datasets demonstrate substantial gains in both retrieval quality and efficiency.

[234] SAR-Based Marine Oil Spill Detection Using the DeepSegFusion Architecture

Pavan Kumar Yata,Pediredla Pradeep,Goli Himanish,Swathi M

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的混合模型DeepSegFusion,用于SAR图像中的油污分割,显著降低了误报率并提高了检测精度。

Details Motivation: 传统基于阈值的方法因类似油污现象(如风滑、船尾流)导致高误报率,难以有效检测油污。 Method: 结合SegNet和DeepLabV3+,引入基于注意力机制的特征融合方法,提升边界精度和上下文理解能力。 Result: 在SAR油污数据集上达到94.85%准确率,IoU为0.5685,ROC-AUC为0.9330,误检数比基线模型减少64.4%。 Conclusion: DeepSegFusion在多种海洋条件下表现稳定,适用于近实时油污监测。 Abstract: Detection of oil spills from satellite images is essential for both environmental surveillance and maritime safety. Traditional threshold-based methods frequently encounter performance degradation due to very high false alarm rates caused by look-alike phenomena such as wind slicks and ship wakes. Here, a hybrid deep learning model, DeepSegFusion, is presented for oil spill segmentation in Synthetic Aperture Radar (SAR) images. The model uses SegNet and DeepLabV3+ integrated with an attention-based feature fusion mechanism to achieve better boundary precision as well as improved contextual understanding. Results obtained on SAR oil spill datasets, including ALOS PALSAR imagery, confirm that the proposed DeepSegFusion model achieves an accuracy of 94.85%, an Intersection over Union (IoU) of 0.5685, and a ROC-AUC score of 0.9330. The proposed method delivers more than three times fewer false detections compared to individual baseline models and traditional non-segmentation methods, achieving a reduction of 64.4%. These results indicate that DeepSegFusion is a stable model under various marine conditions and can therefore be used in near real-time oil spill monitoring scenarios.

[235] DIAMOND-SSS: Diffusion-Augmented Multi-View Optimization for Data-efficient SubSurface Scattering

Guillermo Figueroa-Araneda,Iris Diana Jimenez,Florian Hofherr,Manny Ko,Hector Andrade-Loarca,Daniel Cremers

Main category: cs.CV

TL;DR: 本文提出了DIAMOND-SSS,一种从极稀疏图像中实现高保真次表面散射重建的数据高效框架,利用扩散模型生成增强数据并引入几何先验稳定重建。

Details Motivation: 现有神经渲染中对次表面散射建模依赖大量多视角、多光照数据,采集成本高且难以获取,限制了在稀疏输入下的应用。 Method: 通过微调扩散模型实现新视角合成与重光照,结合估计的几何信息进行条件生成;提出光照无关的几何先验(多视角轮廓和深度一致性损失)以稳定稀疏或合成监督下的训练。 Result: 在少至10张图像的情况下,实现了最先进的可重光照高斯渲染质量,相比SSS-3DGS最多减少90%的真实数据采集需求。 Conclusion: DIAMOND-SSS显著降低了对真实捕获数据的依赖,为复杂光传输材质的高效重建提供了有效解决方案。 Abstract: Subsurface scattering (SSS) gives translucent materials -- such as wax, jade, marble, and skin -- their characteristic soft shadows, color bleeding, and diffuse glow. Modeling these effects in neural rendering remains challenging due to complex light transport and the need for densely captured multi-view, multi-light datasets (often more than 100 views and 112 OLATs). We present DIAMOND-SSS, a data-efficient framework for high-fidelity translucent reconstruction from extremely sparse supervision -- even as few as ten images. We fine-tune diffusion models for novel-view synthesis and relighting, conditioned on estimated geometry and trained on less than 7 percent of the dataset, producing photorealistic augmentations that can replace up to 95 percent of missing captures. To stabilize reconstruction under sparse or synthetic supervision, we introduce illumination-independent geometric priors: a multi-view silhouette consistency loss and a multi-view depth consistency loss. Across all sparsity regimes, DIAMOND-SSS achieves state-of-the-art quality in relightable Gaussian rendering, reducing real capture requirements by up to 90 percent compared to SSS-3DGS.

[236] \textit{FocaLogic}: Logic-Based Interpretation of Visual Model Decisions

Chenchen Zhao,Muxi Chen,Qiang Xu

Main category: cs.CV

TL;DR: FocaLogic是一种模型无关的框架,通过逻辑表达式解释和量化视觉模型的决策过程,识别影响预测的关键视觉区域,并提供一系列定量指标评估模型行为。

Details Motivation: 现有可解释性方法依赖白盒模型访问或缺乏足够的定量严谨性,限制了其在高风险应用中的可靠性。 Method: 提出FocaLogic框架,识别对模型预测有决定性影响的最小可视区域(visual focuses),并将其转换为精确紧凑的逻辑表达式;同时设计了聚焦精度、召回率和散度等定量评估指标。 Result: 实验证明FocaLogic能揭示训练导致的注意力集中现象、通过泛化提升聚焦准确性,并发现偏差和对抗攻击下的异常聚焦行为。 Conclusion: FocaLogic提供了一种系统化、可扩展且定量的视觉模型解释方案,适用于多种场景下的模型分析。 Abstract: Interpretability of modern visual models is crucial, particularly in high-stakes applications. However, existing interpretability methods typically suffer from either reliance on white-box model access or insufficient quantitative rigor. To address these limitations, we introduce FocaLogic, a novel model-agnostic framework designed to interpret and quantify visual model decision-making through logic-based representations. FocaLogic identifies minimal interpretable subsets of visual regions-termed visual focuses-that decisively influence model predictions. It translates these visual focuses into precise and compact logical expressions, enabling transparent and structured interpretations. Additionally, we propose a suite of quantitative metrics, including focus precision, recall, and divergence, to objectively evaluate model behavior across diverse scenarios. Empirical analyses demonstrate FocaLogic's capability to uncover critical insights such as training-induced concentration, increasing focus accuracy through generalization, and anomalous focuses under biases and adversarial attacks. Overall, FocaLogic provides a systematic, scalable, and quantitative solution for interpreting visual models.

[237] A Unified Masked Jigsaw Puzzle Framework for Vision and Language Models

Weixin Ye,Wei Wang,Yahui Liu,Yue Song,Bin Ren,Wei Bi,Rita Cucchiara,Nicu Sebe

Main category: cs.CV

TL;DR: 本文提出了一种名为Masked Jigsaw Puzzle (MJP)的框架,用于增强Transformer在联邦学习中的鲁棒性,防止梯度攻击,并提升其在计算机视觉和自然语言处理任务中的性能。

Details Motivation: Transformer中的位置嵌入梯度可能泄露输入数据信息,导致隐私泄露,尤其是在联邦学习中容易受到梯度攻击。因此需要一种方法来保护位置嵌入信息并提升模型泛化能力。 Method: MJP通过随机打乱token顺序,并使用可学习的未知(unk)位置嵌入来掩盖被打乱token的位置信息,从而破坏局部空间信息,迫使模型学习更依赖语义而非位置的特征表示。 Result: 实验表明,MJP能有效防御梯度攻击,在ImageNet-1K图像分类、Yelp和Amazon情感分析等任务中均提升了模型性能,适用于多种Transformer架构。 Conclusion: MJP是一种统一且有效的框架,可在不牺牲性能的前提下增强Transformer模型在联邦学习中的隐私保护能力和泛化性能。 Abstract: In federated learning, Transformer, as a popular architecture, faces critical challenges in defending against gradient attacks and improving model performance in both Computer Vision (CV) and Natural Language Processing (NLP) tasks. It has been revealed that the gradient of Position Embeddings (PEs) in Transformer contains sufficient information, which can be used to reconstruct the input data. To mitigate this issue, we introduce a Masked Jigsaw Puzzle (MJP) framework. MJP starts with random token shuffling to break the token order, and then a learnable \textit{unknown (unk)} position embedding is used to mask out the PEs of the shuffled tokens. In this manner, the local spatial information which is encoded in the position embeddings is disrupted, and the models are forced to learn feature representations that are less reliant on the local spatial information. Notably, with the careful use of MJP, we can not only improve models' robustness against gradient attacks, but also boost their performance in both vision and text application scenarios, such as classification for images (\textit{e.g.,} ImageNet-1K) and sentiment analysis for text (\textit{e.g.,} Yelp and Amazon). Experimental results suggest that MJP is a unified framework for different Transformer-based models in both vision and language tasks. Code is publicly available via https://github.com/ywxsuperstar/transformerattack

[238] Task-Driven Prompt Learning: A Joint Framework for Multi-modal Cloud Removal and Segmentation

Zaiyan Zhang,Jie Li,Shaowei Shi,Qiangqiang Yuan

Main category: cs.CV

TL;DR: 本文提出了一种任务驱动的多模态云去除框架TDP-CR,结合SAR数据与可学习退化提示,实现光学遥感图像的高质量修复和语义一致性,显著提升分析就绪数据的质量。

Details Motivation: 现有云去除方法侧重低层保真度,常导致纹理和边界过平滑,影响下游语义分析,难以满足分析就绪数据(ARD)的需求。 Method: 提出TDP-CR框架,包含Prompt-Guided Fusion(PGF)机制,利用可学习的退化提示编码云厚和空间不确定性,融合SAR信息;采用参数高效的两阶段训练策略,解耦重建与语义表示学习。 Result: 在LuojiaSET-OSFCR数据集上,TDP-CR在PSNR上超越最先进的方法0.18 dB,参数量仅为15%;mIoU提升1.4%,语义分割性能更优。 Conclusion: TDP-CR通过任务驱动设计和高效融合机制,在更少参数下实现了更优的视觉与语义恢复效果,有效支持遥感图像的分析就绪应用。 Abstract: Optical remote sensing imagery is indispensable for Earth observation, yet persistent cloud occlusion limits its downstream utility. Most cloud removal (CR) methods are optimized for low-level fidelity and can over-smooth textures and boundaries that are critical for analysis-ready data (ARD), leading to a mismatch between visually plausible restoration and semantic utility. To bridge this gap, we propose TDP-CR, a task-driven multimodal framework that jointly performs cloud removal and land-cover segmentation. Central to our approach is a Prompt-Guided Fusion (PGF) mechanism, which utilizes a learnable degradation prompt to encode cloud thickness and spatial uncertainty. By combining global channel context with local prompt-conditioned spatial bias, PGF adaptively integrates Synthetic Aperture Radar (SAR) information only where optical data is corrupted. We further introduce a parameter-efficient two-phase training strategy that decouples reconstruction and semantic representation learning. Experiments on the LuojiaSET-OSFCR dataset demonstrate the superiority of our framework: TDP-CR surpasses heavy state-of-the-art baselines by 0.18 dB in PSNR while using only 15\% of the parameters, and achieves a 1.4\% improvement in mIoU consistently against multi-task competitors, effectively delivering analysis-ready data.

[239] Automating Parameter Selection in Deep Image Prior for Fluorescence Microscopy Image Denoising via Similarity-Based Parameter Transfer

Lina Meyer,Felix Wissel,Tobias Knopp,Susanne Pfefferle,Ralf Fliegert,Maximilian Sandmann,Liana Uebler,Franziska Möckl,Björn-Philipp Diercks,David Lohr,René Werner

Main category: cs.CV

TL;DR: 提出AUTO-DIP方法,通过基于图像元数据相似性的参数迁移实现无需优化的深度图像先验去噪,在荧光显微图像上优于传统DIP和变分去噪方法。

Details Motivation: 解决DIP方法依赖网络结构和迭代停止点选择、需为新图像单独优化参数导致耗时的问题,限制了其在大量图像处理中的应用。 Method: 构建包含110张图像的校准集和55张图像的验证集,搜索最优U-net架构和停止点;提出AUTO-DIP流程,根据图像元数据(如显微镜类型、样本)相似性进行参数迁移,并与基于图像内容相似性的方法比较。 Result: 基于元数据的参数迁移效果优于基于图像定量相似性的方法;AUTO-DIP在多个公开数据集上优于原始DIP配置和先进变分去噪方法,尤其在高噪声图像上表现更优,并在本地获取的荧光图像上验证了优越性。 Conclusion: 相似荧光显微图像共享相近的最优DIP参数配置,基于元数据的参数迁移可实现高效、免优化的DIP去噪,显著提升实用性。 Abstract: Unsupervised deep image prior (DIP) addresses shortcomings of training data requirements and limited generalization associated with supervised deep learning. The performance of DIP depends on the network architecture and the stopping point of its iterative process. Optimizing these parameters for a new image requires time, restricting DIP application in domains where many images need to be processed. Focusing on fluorescence microscopy data, we hypothesize that similar images share comparable optimal parameter configurations for DIP-based denoising, potentially enabling optimization-free DIP for fluorescence microscopy. We generated a calibration (n=110) and validation set (n=55) of semantically different images from an open-source dataset for a network architecture search targeted towards ideal U-net architectures and stopping points. The calibration set represented our transfer basis. The validation set enabled the assessment of which image similarity criterion yields the best results. We then implemented AUTO-DIP, a pipeline for automatic parameter transfer, and compared it to the originally published DIP configuration (baseline) and a state-of-the-art image-specific variational denoising approach. We show that a parameter transfer from the calibration dataset to a test image based on only image metadata similarity (e.g., microscope type, imaged specimen) leads to similar and better performance than a transfer based on quantitative image similarity measures. AUTO-DIP outperforms the baseline DIP (DIP with original DIP parameters) as well as the variational denoising approaches for several open-source test datasets of varying complexity, particularly for very noisy inputs. Applications to locally acquired fluorescence microscopy images further proved superiority of AUTO-DIP.

[240] Learning Language-Driven Sequence-Level Modal-Invariant Representations for Video-Based Visible-Infrared Person Re-Identification

Xiaomei Yang,Xizhan Gao,Antai Liu,Kang Wei,Fa Zhu,Guang Feng,Xiaofeng Qu,Sijie Niu

Main category: cs.CV

TL;DR: 提出了一种语言驱动的序列级模态不变表示学习方法(LSMRL),用于视频可见光-红外行人重识别,通过三个模块和模态级损失提升跨模态一致性与表征能力。

Details Motivation: 现有VVI-ReID方法在时空建模效率、跨模态交互充分性以及模态级损失引导方面存在不足,难以有效学习模态不变表示。 Method: 设计了STFL模块实现高效的时空特征学习,在CLIP基础上轻量改进;提出SD模块将CLIP的语言提示扩散到可见光和红外特征中,建立初步模态一致性;构建CMI模块利用双向跨模态自注意力消除残余模态差异;引入两种模态级损失增强表示的判别性和泛化性。 Result: 在大规模VVI-ReID数据集上进行了广泛实验,结果表明LSMRL优于当前最优方法。 Conclusion: LSMRL通过语言驱动的协同机制有效提升了视频可见光-红外行人重识别中的模态不变表示学习性能,具有较强的实用性和扩展潜力。 Abstract: The core of video-based visible-infrared person re-identification (VVI-ReID) lies in learning sequence-level modal-invariant representations across different modalities. Recent research tends to use modality-shared language prompts generated by CLIP to guide the learning of modal-invariant representations. Despite achieving optimal performance, such methods still face limitations in efficient spatial-temporal modeling, sufficient cross-modal interaction, and explicit modality-level loss guidance. To address these issues, we propose the language-driven sequence-level modal-invariant representation learning (LSMRL) method, which includes spatial-temporal feature learning (STFL) module, semantic diffusion (SD) module and cross-modal interaction (CMI) module. To enable parameter- and computation-efficient spatial-temporal modeling, the STFL module is built upon CLIP with minimal modifications. To achieve sufficient cross-modal interaction and enhance the learning of modal-invariant features, the SD module is proposed to diffuse modality-shared language prompts into visible and infrared features to establish preliminary modal consistency. The CMI module is further developed to leverage bidirectional cross-modal self-attention to eliminate residual modality gaps and refine modal-invariant representations. To explicitly enhance the learning of modal-invariant representations, two modality-level losses are introduced to improve the features' discriminative ability and their generalization to unseen categories. Extensive experiments on large-scale VVI-ReID datasets demonstrate the superiority of LSMRL over AOTA methods.

[241] Learning Stochastic Bridges for Video Object Removal via Video-to-Video Translation

Zijie Lou,Xiangwei Feng,Jiaxin Wang,Xiaochao Qu,Luoqi Liu,Ting Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于随机桥模型的视频对象移除方法,将任务重构为视频到视频的翻译,利用原始视频作为结构先验,结合自适应掩码调制策略,在保持背景保真度的同时提升大物体移除能力与时空一致性。

Details Motivation: 现有基于扩散模型的视频对象移除方法从高斯噪声出发,忽略了原始视频中丰富的结构与上下文先验,导致擦除不完整或生成违背物理逻辑的内容。 Method: 将视频对象移除建模为视频到视频的随机桥过程,直接建立含物体视频到无物体视频的随机路径;引入自适应掩码调制策略,动态调节输入嵌入以平衡背景保真与生成灵活性。 Result: 在视觉质量与时间一致性上显著优于现有方法。 Conclusion: 基于随机桥与自适应掩码调制的框架能更有效地利用输入视频先验,实现精准、逻辑一致的对象移除。 Abstract: Existing video object removal methods predominantly rely on diffusion models following a noise-to-data paradigm, where generation starts from uninformative Gaussian noise. This approach discards the rich structural and contextual priors present in the original input video. Consequently, such methods often lack sufficient guidance, leading to incomplete object erasure or the synthesis of implausible content that conflicts with the scene's physical logic. In this paper, we reformulate video object removal as a video-to-video translation task via a stochastic bridge model. Unlike noise-initialized methods, our framework establishes a direct stochastic path from the source video (with objects) to the target video (objects removed). This bridge formulation effectively leverages the input video as a strong structural prior, guiding the model to perform precise removal while ensuring that the filled regions are logically consistent with the surrounding environment. To address the trade-off where strong bridge priors hinder the removal of large objects, we propose a novel adaptive mask modulation strategy. This mechanism dynamically modulates input embeddings based on mask characteristics, balancing background fidelity with generative flexibility. Extensive experiments demonstrate that our approach significantly outperforms existing methods in both visual quality and temporal consistency.

[242] ARMARecon: An ARMA Convolutional Filter based Graph Neural Network for Neurodegenerative Dementias Classification

VSS Tejaswi Abburi,Ananya Singhal,Saurabh J. Shigwan,Nitin Kumar

Main category: cs.CV

TL;DR: 本文提出ARMARecon,一种结合ARMA图滤波与重构目标的图学习框架,用于阿尔茨海默病(AD)和额颞叶痴呆(FTD)的早期检测,利用白质FA直方图特征建模局部与全局连接,在ADNI和NIFD数据集上性能优于现有方法。

Details Motivation: 早期检测阿尔茨海默病(AD)和额颞叶痴呆(FTD)对延缓病情进展至关重要;由于二者沿白质以图依赖方式传播,需能建模全局与局部连接的图神经网络方法。 Method: 提出ARMARecon框架,融合自回归滑动平均(ARMA)图滤波与重建驱动目标,采用20-bin FA直方图特征表征白质区域,并抑制过平滑。 Result: 在多中心dMRI数据集ADNI和NIFD上,ARMARecon分类准确率优于当前最先进方法。 Conclusion: ARMARecon能有效建模白质结构连接的局部与全局特性,提升AD与FTD的早期识别能力,具备临床转化潜力。 Abstract: Early detection of neurodegenerative diseases such as Alzheimer's Disease (AD) and Frontotemporal Dementia (FTD) is essential for reducing the risk of progression to severe disease stages. As AD and FTD propagate along white-matter regions in a global, graph-dependent manner, graph-based neural networks are well suited to capture these patterns. Hence, we introduce ARMARecon, a unified graph learning framework that integrates Autoregressive Moving Average (ARMA) graph filtering with a reconstruction-driven objective to enhance feature representation and improve classification accuracy. ARMARecon effectively models both local and global connectivity by leveraging 20-bin Fractional Anisotropy (FA) histogram features extracted from white-matter regions, while mitigating over-smoothing. Overall, ARMARecon achieves superior performance compared to state-of-the-art methods on the multi-site dMRI datasets ADNI and NIFD.

[243] CroBIM-V: Memory-Quality Controlled Remote Sensing Referring Video Object Segmentation

H. Jiang,Y. Sun,Z. Dong,T. Liu,Y. Gu

Main category: cs.CV

TL;DR: 本文提出了首个大规模遥感视频指代表达分割基准RS-RVOS Bench,并设计了记忆质量可控的在线分割框架MQC-SAM,通过运动一致性校准和动态质量评估机制提升分割性能。

Details Motivation: 遥感视频中目标显著性弱、视觉信息截断严重,且缺乏大规模专用数据集,现有方法因初始记忆偏差和无差别记忆累积导致定位不准和误差传播。 Method: 构建了包含111个视频序列、约2.5万帧和21.3万条时序指代标注的大规模基准RS-RVOS Bench;提出MQC-SAM框架,引入时序运动一致性模块进行初始记忆校准,并采用解耦注意力机制与动态质量评估选择性更新高质量语义特征。 Result: 在RS-RVOS Bench上实验表明,MQC-SAM在遥感视频指代表达分割任务中达到最先进性能,有效抑制了噪声记忆积累与误差传播。 Conclusion: 高质量的记忆初始化与选择性记忆更新对遥感视频指代表达分割至关重要,所提出的MQC-SAM框架结合因果感知标注基准为该领域提供了新方向。 Abstract: Remote sensing video referring object segmentation (RS-RVOS) is challenged by weak target saliency and severe visual information truncation in dynamic scenes, making it extremely difficult to maintain discriminative target representations during segmentation. Moreover, progress in this field is hindered by the absence of large-scale dedicated benchmarks, while existing models are often affected by biased initial memory construction that impairs accurate instance localization in complex scenarios, as well as indiscriminate memory accumulation that encodes noise from occlusions or misclassifications, leading to persistent error propagation. This paper advances RS-RVOS research through dual contributions in data and methodology. First, we construct RS-RVOS Bench, the first large-scale benchmark comprising 111 video sequences, about 25,000 frames, and 213,000 temporal referring annotations. Unlike common RVOS benchmarks where many expressions are written with access to the full video context, our dataset adopts a strict causality-aware annotation strategy in which linguistic references are generated solely from the target state in the initial frame. Second, we propose a memory-quality-aware online referring segmentation framework, termed Memory Quality Control with Segment Anything Model (MQC-SAM). MQC-SAM introduces a temporal motion consistency module for initial memory calibration, leveraging short-term motion trajectory priors to correct structural deviations and establish accurate memory anchoring. Furthermore, it incorporates a decoupled attention-based memory integration mechanism with dynamic quality assessment, selectively updating high-confidence semantic features while filtering unreliable information, thereby effectively preventing error accumulation and propagation. Extensive experiments on RS-RVOS Bench demonstrate that MQC-SAM achieves state-of-the-art performance.

[244] EmoLat: Text-driven Image Sentiment Transfer via Emotion Latent Space

Jing Zhang,Bingjie Fan,Jixiang Zhu,Zhe Wang

Main category: cs.CV

TL;DR: 提出EmoLat,一种新的情绪潜在空间,用于实现细粒度的文本驱动图像情感迁移,并构建大规模数据集EmoSpace Set进行验证。

Details Motivation: 现有方法在文本驱动的图像情感迁移上缺乏细粒度控制和跨模态对齐能力,需构建更具判别性和可迁移性的情绪表示空间。 Method: 构建EmoLat情绪潜在空间和情绪语义图,利用对抗正则化对齐跨模态情绪分布,并设计多目标损失函数优化跨模态情感迁移框架。 Result: 在自建的EmoSpace Set数据集上实验表明,该方法在定量指标和生成质量上均显著优于现有最先进方法。 Conclusion: EmoLat为文本引导的可控图像情感编辑提供了新范式,结合EmoSpace Set数据集推动了情感感知视觉内容生成的发展。 Abstract: We propose EmoLat, a novel emotion latent space that enables fine-grained, text-driven image sentiment transfer by modeling cross-modal correlations between textual semantics and visual emotion features. Within EmoLat, an emotion semantic graph is constructed to capture the relational structure among emotions, objects, and visual attributes. To enhance the discriminability and transferability of emotion representations, we employ adversarial regularization, aligning the latent emotion distributions across modalities. Building upon EmoLat, a cross-modal sentiment transfer framework is proposed to manipulate image sentiment via joint embedding of text and EmoLat features. The network is optimized using a multi-objective loss incorporating semantic consistency, emotion alignment, and adversarial regularization. To support effective modeling, we construct EmoSpace Set, a large-scale benchmark dataset comprising images with dense annotations on emotions, object semantics, and visual attributes. Extensive experiments on EmoSpace Set demonstrate that our approach significantly outperforms existing state-of-the-art methods in both quantitative metrics and qualitative transfer fidelity, establishing a new paradigm for controllable image sentiment editing guided by textual input. The EmoSpace Set and all the code are available at http://github.com/JingVIPLab/EmoLat.

[245] Toward Real-World High-Precision Image Matting and Segmentation

Haipeng Zhou,Zhaohu Xing,Hongqiu Wang,Jun Ma,Ping Li,Lei Zhu

Main category: cs.CV

TL;DR: 本文提出了一种前景一致性学习模型(FCLM),用于解决高精度场景解析任务中的细粒度分割问题,通过深度感知蒸馏、域不变学习和面向对象解码器,提升了在真实场景下的泛化能力与交互预测性能。

Details Motivation: 现有方法多关注显著的单一前景对象,且依赖低质量合成数据,导致在真实场景中泛化能力差;同时,交互式方法因类别无关设计而难以推广到多类任务。 Method: 提出FCLM模型:1)深度感知蒸馏策略,迁移深度相关知识以增强前景表示;2)将合成数据处理视为域自适应问题,采用域不变学习聚焦前景学习;3)设计面向对象解码器,支持视觉和语言提示的交互式预测。 Result: 实验结果表明,该方法在定量和定性评估上均优于当前最先进方法,尤其在真实场景中的细粒度分割和交互式预测方面表现突出。 Conclusion: FCLM有效解决了高精度分割中数据质量差、泛化能力弱和交互性不足的问题,为图像抠图与二值分割等任务提供了更强的通用框架。 Abstract: High-precision scene parsing tasks, including image matting and dichotomous segmentation, aim to accurately predict masks with extremely fine details (such as hair). Most existing methods focus on salient, single foreground objects. While interactive methods allow for target adjustment, their class-agnostic design restricts generalization across different categories. Furthermore, the scarcity of high-quality annotation has led to a reliance on inharmonious synthetic data, resulting in poor generalization to real-world scenarios. To this end, we propose a Foreground Consistent Learning model, dubbed as FCLM, to address the aforementioned issues. Specifically, we first introduce a Depth-Aware Distillation strategy where we transfer the depth-related knowledge for better foreground representation. Considering the data dilemma, we term the processing of synthetic data as domain adaptation problem where we propose a domain-invariant learning strategy to focus on foreground learning. To support interactive prediction, we contribute an Object-Oriented Decoder that can receive both visual and language prompts to predict the referring target. Experimental results show that our method quantitatively and qualitatively outperforms SOTA methods.

[246] Conditional Random Fields for Interactive Refinement of Histopathological Predictions

Tiffanie Godelaine,Maxime Zanella,Karim El Khoury,Saïd Mahmoudi,Benoît Macq,Christophe De Vleeschouwer

Main category: cs.CV

TL;DR: 本文提出HistoCRF,一种基于条件随机场(CRF)的框架,用于改进病理图像中视觉-语言模型的零样本预测,通过引入标签多样性与专家注释提升分类准确率,无需额外训练,在多个数据集上显著优于基线方法。

Details Motivation: 视觉-语言模型在病理图像分析中虽有潜力,但其零样本预测仍不理想,需要一种无需再训练的方法来提升预测准确性。 Method: 提出HistoCRF框架,改进条件随机场的成对势函数以促进标签多样性,并融合专家注释;支持无注释、一次性注释和人机交互式迭代注释三种设置。 Result: 在五个不同器官和疾病的病理图像数据集上,相比零样本预测,平均准确率提升16.0%(无注释)、27.5%(仅100个注释),加入人机交互后进一步提升至32.6%。 Conclusion: HistoCRF能有效提升VLM在病理图像分类中的表现,无需模型再训练,且结合少量专家注释即可实现显著性能增益,具备临床应用潜力。 Abstract: Assisting pathologists in the analysis of histopathological images has high clinical value, as it supports cancer detection and staging. In this context, histology foundation models have recently emerged. Among them, Vision-Language Models (VLMs) provide strong yet imperfect zero-shot predictions. We propose to refine these predictions by adapting Conditional Random Fields (CRFs) to histopathological applications, requiring no additional model training. We present HistoCRF, a CRF-based framework, with a novel definition of the pairwise potential that promotes label diversity and leverages expert annotations. We consider three experiments: without annotations, with expert annotations, and with iterative human-in-the-loop annotations that progressively correct misclassified patches. Experiments on five patch-level classification datasets covering different organs and diseases demonstrate average accuracy gains of 16.0% without annotations and 27.5% with only 100 annotations, compared to zero-shot predictions. Moreover, integrating a human in the loop reaches a further gain of 32.6% with the same number of annotations. The code will be made available on https://github.com/tgodelaine/HistoCRF.

[247] Detecting 3D Line Segments for 6DoF Pose Estimation with Limited Data

Matej Mok,Lukáš Gajdošech,Michal Mesároš,Martin Madaras,Viktor Kocur

Main category: cs.CV

TL;DR: 提出一种针对工业场景中箱体6DoF姿态估计的新方法,利用箱体的立方几何结构,通过检测顶部边缘的3D线段并结合几何处理实现高精度姿态估计,无需实例特定的CAD模型,且在真实扫描数据上表现优于现有最先进方法。

Details Motivation: 传统深度学习方法需要大量训练数据或CAD模型,难以应用于数据稀缺、物体多变的实际工业场景。 Method: 扩展2D线段检测网络LeTR以处理结构化点云数据,检测对应箱体顶边的3D线段,并通过简单几何方法计算6DoF姿态。 Result: 在真实扫描数据上达到3 cm平移误差和8.2°旋转误差,显著优于当前最先进方法,并验证了合成数据对性能提升的有效性。 Conclusion: 该方法无需实例特定CAD模型,具有强鲁棒性和实用性,适用于工业环境中箱体类物体的姿态估计。 Abstract: The task of 6DoF object pose estimation is one of the fundamental problems of 3D vision with many practical applications such as industrial automation. Traditional deep learning approaches for this task often require extensive training data or CAD models, limiting their application in real-world industrial settings where data is scarce and object instances vary. We propose a novel method for 6DoF pose estimation focused specifically on bins used in industrial settings. We exploit the cuboid geometry of bins by first detecting intermediate 3D line segments corresponding to their top edges. Our approach extends the 2D line segment detection network LeTR to operate on structured point cloud data. The detected 3D line segments are then processed using a simple geometric procedure to robustly determine the bin's 6DoF pose. To evaluate our method, we extend an existing dataset with a newly collected and annotated dataset, which we make publicly available. We show that incorporating synthetic training data significantly improves pose estimation accuracy on real scans. Moreover, we show that our method significantly outperforms current state-of-the-art 6DoF pose estimation methods in terms of the pose accuracy (3 cm translation error, 8.2$^\circ$ rotation error) while not requiring instance-specific CAD models during inference.

[248] Energy-Aware Ensemble Learning for Coffee Leaf Disease Classification

Larissa Ferreira Rodrigues Moreira,Rodrigo Moreira,Leonardo Gabriel Ferreira Rodrigues

Main category: cs.CV

TL;DR: 通过知识蒸馏和集成学习,将高容量CNN模型的知识转移到轻量级CNN模型,实现了在资源受限设备上高效、低能耗的咖啡叶病害诊断。

Details Motivation: 咖啡产量依赖于及时准确的病害诊断,但田间叶片病害评估面临设备资源有限和网络连接不稳定的问题,限制了AI模型的应用。 Method: 采用知识蒸馏技术,将在数据中心训练的高容量卷积神经网络(CNN)通过集成学习(EL)将知识传递给紧凑型CNN模型,并结合密集小型模型对进行优化集成,以在计算和能量约束下提升准确性。 Result: 在整理的咖啡叶数据集上,蒸馏后的微型集成模型表现与先前研究相当,同时显著降低了能耗和碳足迹。 Conclusion: 经过适当蒸馏和集成的轻量级模型可为物联网(IoT)应用提供实用的诊断解决方案,支持可持续的现场病害检测。 Abstract: Coffee yields are contingent on the timely and accurate diagnosis of diseases; however, assessing leaf diseases in the field presents significant challenges. Although Artificial Intelligence (AI) vision models achieve high accuracy, their adoption is hindered by the limitations of constrained devices and intermittent connectivity. This study aims to facilitate sustainable on-device diagnosis through knowledge distillation: high-capacity Convolutional Neural Networks (CNNs) trained in data centers transfer knowledge to compact CNNs through Ensemble Learning (EL). Furthermore, dense tiny pairs were integrated through simple and optimized ensembling to enhance accuracy while adhering to strict computational and energy constraints. On a curated coffee leaf dataset, distilled tiny ensembles achieved competitive with prior work with significantly reduced energy consumption and carbon footprint. This indicates that lightweight models, when properly distilled and ensembled, can provide practical diagnostic solutions for Internet of Things (IoT) applications.

[249] RCDN: Real-Centered Detection Network for Robust Face Forgery Identification

Wyatt McCurdy,Xin Zhang,Yuqi Song,Min Gao

Main category: cs.CV

TL;DR: 提出了一种以真实图像为中心的检测网络RCDN,通过强调真实人脸的一致性来提升跨域图像伪造检测的鲁棒性,在DiFF数据集上表现出优异的性能和稳定性。

Details Motivation: 现有伪造检测方法在跨域场景下性能显著下降,难以应对不断出现的新伪造技术,因此需要更具泛化能力的检测模型。 Method: 提出Real-Centered Detection Network(RCDN),采用基于Xception的CNN框架,结合频域和空域特征,通过双分支结构和真实图像中心损失函数,将表征空间锚定在真实图像上,从而增强对分布偏移的鲁棒性。 Result: 在DiFF数据集上实验表明,RCDN在三种典型伪造类型(FE、I2I、T2I)中均达到最先进的域内准确率,并显著提升跨域泛化能力,缩小了与基线模型相比的泛化差距,且具有最高的跨/域内稳定性比率。 Conclusion: RCDN通过聚焦真实图像的一致性而非伪造模式,有效提升了检测模型对未见伪造技术的适应能力,具备实际应用潜力。 Abstract: Image forgery has become a critical threat with the rapid proliferation of AI-based generation tools, which make it increasingly easy to synthesize realistic but fraudulent facial content. Existing detection methods achieve near-perfect performance when training and testing are conducted within the same domain, yet their effectiveness deteriorates substantially in crossdomain scenarios. This limitation is problematic, as new forgery techniques continuously emerge and detectors must remain reliable against unseen manipulations. To address this challenge, we propose the Real-Centered Detection Network (RCDN), a frequency spatial convolutional neural networks(CNN) framework with an Xception backbone that anchors its representation space around authentic facial images. Instead of modeling the diverse and evolving patterns of forgeries, RCDN emphasizes the consistency of real images, leveraging a dual-branch architecture and a real centered loss design to enhance robustness under distribution shifts. Extensive experiments on the DiFF dataset, focusing on three representative forgery types (FE, I2I, T2I), demonstrate that RCDN achieves both state-of-the-art in-domain accuracy and significantly stronger cross-domain generalization. Notably, RCDN reduces the generalization gap compared to leading baselines and achieves the highest cross/in-domain stability ratio, highlighting its potential as a practical solution for defending against evolving and unseen image forgery techniques.

[250] CARLA-Round: A Multi-Factor Simulation Dataset for Roundabout Trajectory Prediction

Xiaotong Zhou,Zhenhui Yuan,Yi Han,Tianhua Xu,Laurence T. Yang

Main category: cs.CV

TL;DR: 本文提出了CARLA-Round,一个用于环岛车辆轨迹预测的系统化仿真数据集,通过结构化设计控制天气和交通密度变量,支持对预测性能影响因素的精确分析,并验证了仿真到现实的有效迁移。

Details Motivation: 环岛场景中由于缺乏交通信号、复杂的交互行为以及真实数据观测不完整,导致轨迹预测困难,现有数据集难以分离关键影响因素,因此需要构建可控且逼真的仿真数据集。 Method: 在CARLA仿真环境中设计了25种受控场景,系统性地组合五种天气条件和五级交通密度(A-E级服务水平),生成包含丰富驾驶行为和显式标注的多模态轨迹数据,并使用LSTM、GCN、GRU+GCN等基准模型进行验证实验。 Result: 实验表明交通密度对预测难度具有显著单调影响,而天气影响呈非线性;在真实数据集rounD上最佳模型达到0.312米的ADE,显示出良好的仿真到现实迁移能力。 Conclusion: CARLA-Round通过结构化设计实现了对影响轨迹预测因素的解耦分析,弥补了真实世界数据的混杂缺陷,为环岛场景下的算法评估与优化提供了可靠基准。 Abstract: Accurate trajectory prediction of vehicles at roundabouts is critical for reducing traffic accidents, yet it remains highly challenging due to their circular road geometry, continuous merging and yielding interactions, and absence of traffic signals. Developing accurate prediction algorithms relies on reliable, multimodal, and realistic datasets; however, such datasets for roundabout scenarios are scarce, as real-world data collection is often limited by incomplete observations and entangled factors that are difficult to isolate. We present CARLA-Round, a systematically designed simulation dataset for roundabout trajectory prediction. The dataset varies weather conditions (five types) and traffic density levels (spanning Level-of-Service A-E) in a structured manner, resulting in 25 controlled scenarios. Each scenario incorporates realistic mixtures of driving behaviors and provides explicit annotations that are largely absent from existing datasets. Unlike randomly sampled simulation data, this structured design enables precise analysis of how different conditions influence trajectory prediction performance. Validation experiments using standard baselines (LSTM, GCN, GRU+GCN) reveal traffic density dominates prediction difficulty with strong monotonic effects, while weather shows non-linear impacts. The best model achieves 0.312m ADE on real-world rounD dataset, demonstrating effective sim-to-real transfer. This systematic approach quantifies factor impacts impossible to isolate in confounded real-world datasets. Our CARLA-Round dataset is available at https://github.com/Rebecca689/CARLA-Round.

[251] Segment and Matte Anything in a Unified Model

Zezhong Fan,Xiaohan Li,Topojoy Biswas,Kaushiki Nag,Kannan Achan

Main category: cs.CV

TL;DR: 本文提出了SAMA,一种轻量级的SAM扩展模型,能够同时实现高质量的交互式图像分割与抠图,通过多视图定位编码器和局部适配器提升边界细节,并在多个基准上达到最先进性能。

Details Motivation: 尽管SAM在零样本分割方面取得进展,但其掩码精度仍不足,且尚未探索其在交互式图像抠图中的应用;同时分割与抠图任务存在强相关性,启发了统一模型的设计。 Method: 提出SAMA模型,包含多视图定位编码器(MVLE)提取局部细节特征,局部适配器(Local-Adapter)恢复边界细节,并设计两个预测头分别输出分割与抠图结果,实现端到端联合训练。 Result: SAMA在多个公开数据集上实现了分割与抠图任务的最先进性能,表现出优异的边界精细度和对不同用户提示的适应能力。 Conclusion: SAMA成功将图像分割与抠图统一于一个轻量框架中,在保持低参数增量的同时显著提升了掩码质量,展示了其在实际应用中的潜力。 Abstract: Segment Anything (SAM) has recently pushed the boundaries of segmentation by demonstrating zero-shot generalization and flexible prompting after training on over one billion masks. Despite this, its mask prediction accuracy often falls short of the precision required in real-world applications. While several refinement modules have been proposed to boost SAM's segmentation quality, achieving highly accurate object delineation within a single, unified framework remains an open challenge. Furthermore, interactive image matting, which aims to generate fine-grained alpha mattes guided by diverse user hints, has not yet been explored in the context of SAM. Insights from recent studies highlight strong correlations between segmentation and matting, suggesting the feasibility of a unified model capable of both tasks. In this paper, we introduce Segment And Matte Anything (SAMA), a lightweight extension of SAM that delivers high-quality interactive image segmentation and matting with minimal extra parameters. Our Multi-View Localization Encoder (MVLE) captures detailed features from local views, while the Localization Adapter (Local-Adapter) refines mask outputs by recovering subtle boundary details. We also incorporate two prediction heads for each task into the architecture to generate segmentation and matting masks, simultaneously. Trained on a diverse dataset aggregated from publicly available sources, SAMA achieves state-of-the-art performance across multiple segmentation and matting benchmarks, showcasing its adaptability and effectiveness in a wide range of downstream tasks.

[252] Principal Component Analysis-Based Terahertz Self-Supervised Denoising and Deblurring Deep Neural Networks

Pengfei Zhu,Xavier Maldague

Main category: cs.CV

TL;DR: 提出了一种基于PCA的自监督去噪去模糊网络THz-SSDD,用于解决太赫兹图像中低频模糊和高频噪声问题,仅需少量未标注噪声图像训练,即可在多种样本上实现有效图像恢复。

Details Motivation: 太赫兹系统固有的频率依赖性退化导致图像低频模糊和高频噪声,传统方法无法同时处理这两类问题,且缺乏明确的去噪与去模糊边界,需人工干预。 Method: 采用Recorrupted-to-Recorrupted自监督学习策略,利用重复污染下的不变性捕捉噪声内在特征,并结合PCA分解与重构实现全频段图像恢复。 Result: 在四种不同类型样本上验证了THz-SSDD的有效性,测试结果显示其能在不同材料属性和测量模式下实现良好的去噪与去模糊效果,图像质量显著提升且保留原始信号物理特性。 Conclusion: THz-SSDD网络能够在无需大量标注数据的情况下,有效统一处理太赫兹图像的去噪与去模糊问题,具有良好的泛化能力和应用潜力。 Abstract: Terahertz (THz) systems inherently introduce frequency-dependent degradation effects, resulting in low-frequency blurring and high-frequency noise in amplitude images. Conventional image processing techniques cannot simultaneously address both issues, and manual intervention is often required due to the unknown boundary between denoising and deblurring. To tackle this challenge, we propose a principal component analysis (PCA)-based THz self-supervised denoising and deblurring network (THz-SSDD). The network employs a Recorrupted-to-Recorrupted self-supervised learning strategy to capture the intrinsic features of noise by exploiting invariance under repeated corruption. PCA decomposition and reconstruction are then applied to restore images across both low and high frequencies. The performance of the THz-SSDD network was evaluated on four types of samples. Training requires only a small set of unlabeled noisy images, and testing across samples with different material properties and measurement modes demonstrates effective denoising and deblurring. Quantitative analysis further validates the network feasibility, showing improvements in image quality while preserving the physical characteristics of the original signals.

[253] Enhanced Diagnostic Performance via Large-Resolution Inference Optimization for Pathology Foundation Models

Mengxuan Hu,Zihan Guan,John Kang,Sheng Li,Zhongliang Zhou

Main category: cs.CV

TL;DR: 提出了一种空间和时间高效的推理策略,通过空间感知的邻近块稀疏化注意力,并利用全局注意力得分过滤无信息的令牌,从而在保持甚至提升下游任务性能的同时,显著降低高分辨率全切片图像(WSI)推理过程中的GPU内存消耗和运行时间。

Details Motivation: 现有的病理学基础模型通常受限于固定的输入尺寸(如224x224),在处理具有数千级分辨率的全切片图像(WSI)时效率低下。直接扩大输入或下采样都会带来显存消耗过高或丢失关键形态学细节的问题。 Method: 提出一种高效的推理策略:1)使用空间感知的邻近块来稀疏化注意力机制;2)通过全局注意力得分过滤掉非信息性token,从而减少计算和内存开销。 Result: 该方法在ROI分类任务上最高提升了7.67%,在分割任务中表现相当,同时显著降低了GPU内存使用和推理时间。 Conclusion: 所提出的方法能够在相同GPU预算下实现更高分辨率的推理,有效解决了WSI分析中的效率与精度权衡问题,提升了高分辨率病理图像的处理能力。 Abstract: Despite their prominent performance on tasks such as ROI classification and segmentation, many pathology foundation models remain constrained by a specific input size e.g. 224 x 224, creating substantial inefficiencies when applied to whole-slide images (WSIs), which span thousands of resolutions. A naive strategy is to either enlarge inputs or downsample the WSIs. However, enlarging inputs results in prohibitive GPU memory consumption, while downsampling alters the microns-per-pixel resolution and obscures critical morphological details. To overcome these limitations, we propose an space- and time- efficient inference strategy that sparsifies attention using spatially aware neighboring blocks and filters out non-informative tokens through global attention scores. This design substantially reduces GPU memory and runtime during high-resolution WSI inference while preserving and even improving the downstream performance, enabling inference at higher resolutions under the same GPU budget. The experimental results show that our method can achieves up to an 7.67% improvement in the ROI classification and compatible results in segmentation.

[254] Inverse Rendering for High-Genus 3D Surface Meshes from Multi-view Images with Persistent Homology Priors

Xiang Gao,Xinmu Wang,Yuanpeng Liu,Yue Wang,Junqi Huang,Wei Chen,Xianfeng Gu

Main category: cs.CV

TL;DR: 本文提出了一种结合持久同调先验的协同逆向渲染方法,用于从多视角图像中重建高亏格3D物体,通过引入拓扑约束(如隧道环、手柄环)来缓解几何、外观和拓扑歧义,避免拓扑坍塌,在无需神经网络的前提下,基于网格的梯度优化实现了更准确稳健的重建。

Details Motivation: 3D重建本质上是病态问题,存在几何、外观和拓扑歧义,尤其难以重建高亏格表面;现有方法易发生隧道坍塌或丢失高亏格结构等拓扑失败。 Method: 在基于网格的逆向渲染框架中,采用梯度优化,协同光度一致性(多视角图像)与持久同调先验(建模隧道环、手柄环等关键拓扑特征),不依赖神经网络。 Result: 相比现有基于网格的最先进方法,在Chamfer Distance(CD)上更低、Volume IoU更高,表明几何精度更高、对拓扑失败更具鲁棒性。 Conclusion: 持久同调先验能有效引导高亏格表面重建,验证了显式拓扑约束在逆向渲染中的关键作用,为无神经网络的可解释、鲁棒3D重建提供了新路径。 Abstract: Reconstructing 3D objects from images is inherently an ill-posed problem due to ambiguities in geometry, appearance, and topology. This paper introduces collaborative inverse rendering with persistent homology priors, a novel strategy that leverages topological constraints to resolve these ambiguities. By incorporating priors that capture critical features such as tunnel loops and handle loops, our approach directly addresses the difficulty of reconstructing high-genus surfaces. The collaboration between photometric consistency from multi-view images and homology-based guidance enables recovery of complex high-genus geometry while circumventing catastrophic failures such as collapsing tunnels or losing high-genus structure. Instead of neural networks, our method relies on gradient-based optimization within a mesh-based inverse rendering framework to highlight the role of topological priors. Experimental results show that incorporating persistent homology priors leads to lower Chamfer Distance (CD) and higher Volume IoU compared to state-of-the-art mesh-based methods, demonstrating improved geometric accuracy and robustness against topological failure.

[255] VIRTUE: Versatile Video Retrieval Through Unified Embeddings

Shaunak Halbe,Bhagyashree Puranik,Jayakrishnan Unnikrishnan,Kushan Thakkar,Vimal Bhat,Toufiq Parag

Main category: cs.CV

TL;DR: 本文提出VIRTUE,一种基于多模态大语言模型(MLLM)的通用视频检索框架,统一支持语料库级检索、细粒度时刻定位和组合多模态查询,在零样本设置下表现优异。

Details Motivation: 现有视频检索系统在专用架构与多模态查询支持之间存在权衡:专用模型检索性能强但不支持复杂查询,而基于MLLM的方法支持多模态查询但检索效果较差。本文旨在构建一个兼具高性能检索与灵活查询能力的统一框架。 Method: VIRTUE采用共享MLLM主干生成视觉与文本嵌入,并通过对比对齐实现高效的基于嵌入的候选搜索;使用LoRA在70万配对数据上高效训练嵌入模型,并可无需额外训练地迁移到时刻检索与组合查询任务;进一步通过重排序提升性能。 Result: 该模型在零样本视频检索任务上超越其他MLLM方法,在零样本时刻检索和组合视频检索上达到先进水平;经重排序训练后,性能媲美在更大数据上训练的专用模型。 Conclusion: VIRTUE成功融合了专用检索模型的高性能与MLLM的多模态灵活性,实现了多功能统一架构下的高效视频检索。 Abstract: Modern video retrieval systems are expected to handle diverse tasks ranging from corpus-level retrieval and fine-grained moment localization to flexible multimodal querying. Specialized architectures achieve strong retrieval performance by training modality-specific encoders on massive datasets, but they lack the ability to process composed multimodal queries. In contrast, multimodal LLM (MLLM)-based methods support rich multimodal search but their retrieval performance remains well below that of specialized systems. We present VIRTUE, an MLLM-based versatile video retrieval framework that integrates corpus and moment-level retrieval capabilities while accommodating composed multimodal queries within a single architecture. We use contrastive alignment of visual and textual embeddings generated using a shared MLLM backbone to facilitate efficient embedding-based candidate search. Our embedding model, trained efficiently using low-rank adaptation (LoRA) on 700K paired visual-text data samples, surpasses other MLLM-based methods on zero-shot video retrieval tasks. Additionally, we demonstrate that the same model can be adapted without further training to achieve competitive results on zero-shot moment retrieval, and state of the art results for zero-shot composed video retrieval. With additional training for reranking candidates identified in the embedding-based search, our model substantially outperforms existing MLLM-based retrieval systems and achieves retrieval performance comparable to state of the art specialized models which are trained on orders of magnitude larger data.

[256] Where It Moves, It Matters: Referring Surgical Instrument Segmentation via Motion

Meng Wei,Kun Yuan,Shi Li,Yue Zhou,Long Bai,Nassir Navab,Hongliang Ren,Hong Joo Lee,Tom Vercauteren,Nicolas Padoy

Main category: cs.CV

TL;DR: 提出SurgRef,一种基于运动引导的框架,通过捕捉手术工具在时间上的运动和交互来实现自然语言驱动的手术视频中工具定位与分割,克服了现有方法依赖静态视觉线索和预定义名称的局限性。

Details Motivation: 现有的手术视频中指代表达分割方法依赖静态视觉特征和固定仪器名称,难以应对遮挡、歧义和非标准术语,缺乏跨场景泛化能力。 Method: 提出SurgRef框架,利用工具的动态运动模式而非外观进行语言-视觉对齐;构建新的大规模数据集Ref-IMotion,包含多中心、密集时空标注和以动作为核心的语言表达,用于训练与评估。 Result: SurgRef在多种手术场景下实现了最先进的精度和泛化性能,显著优于依赖静态特征的方法,尤其在遮挡和术语不一致情况下表现稳健。 Conclusion: 通过引入运动引导的语义理解机制和高质量数据集Ref-IMotion,SurgRef为语言驱动的手术视频分割建立了新基准,推动了智能手术室和自主手术机器人辅助的发展。 Abstract: Enabling intuitive, language-driven interaction with surgical scenes is a critical step toward intelligent operating rooms and autonomous surgical robotic assistance. However, the task of referring segmentation, localizing surgical instruments based on natural language descriptions, remains underexplored in surgical videos, with existing approaches struggling to generalize due to reliance on static visual cues and predefined instrument names. In this work, we introduce SurgRef, a novel motion-guided framework that grounds free-form language expressions in instrument motion, capturing how tools move and interact across time, rather than what they look like. This allows models to understand and segment instruments even under occlusion, ambiguity, or unfamiliar terminology. To train and evaluate SurgRef, we present Ref-IMotion, a diverse, multi-institutional video dataset with dense spatiotemporal masks and rich motion-centric expressions. SurgRef achieves state-of-the-art accuracy and generalization across surgical procedures, setting a new benchmark for robust, language-driven surgical video segmentation.

[257] DiffusionQC: Artifact Detection in Histopathology via Diffusion Model

Zhenzhen Wang,Zhongliang Zhou,Zhuoyu Wen,Jeong Hwan Kook,John B Wojcik,John Kang

Main category: cs.CV

TL;DR: 提出DiffusionQC,一种基于扩散模型的无监督病理图像伪影检测方法,仅需干净图像训练,无需像素级标注或预定义伪影类型,并引入对比学习增强伪影与正常区域的分布分离,具有优越性能和跨染色泛化能力。

Details Motivation: 病理图像中的制备和数字化伪影会影响诊断可靠性,传统监督方法依赖大量标注数据且难以泛化到新类型伪影,亟需一种无需密集标注且能检测未知伪影的方法。 Method: 利用扩散模型对仅含干净图像的数据进行训练,将伪影视为异常样本进行检测;引入对比学习模块以扩大伪影与干净图像之间的分布差异,提升检测效果。 Result: 在实验中超越现有最先进方法,展现出优异的伪影检测性能,并具备跨染色(cross-stain)泛化能力,同时所需数据和标注显著减少。 Conclusion: DiffusionQC是一种高效、低标注依赖的病理图像质量控制方法,能够有效检测未知类型伪影,具有良好的临床应用潜力和推广价值。 Abstract: Digital pathology plays a vital role across modern medicine, offering critical insights for disease diagnosis, prognosis, and treatment. However, histopathology images often contain artifacts introduced during slide preparation and digitization. Detecting and excluding them is essential to ensure reliable downstream analysis. Traditional supervised models typically require large annotated datasets, which is resource-intensive and not generalizable to novel artifact types. To address this, we propose DiffusionQC, which detects artifacts as outliers among clean images using a diffusion model. It requires only a set of clean images for training rather than pixel-level artifact annotations and predefined artifact types. Furthermore, we introduce a contrastive learning module to explicitly enlarge the distribution separation between artifact and clean images, yielding an enhanced version of our method. Empirical results demonstrate superior performance to state-of-the-art and offer cross-stain generalization capacity, with significantly less data and annotations.

[258] Less is More: Label-Guided Summarization of Procedural and Instructional Videos

Shreya Rajpal,Michal Golovanesky,Carsten Eickhoff

Main category: cs.CV

TL;DR: 提出了一种名为PRISM的三阶段框架,通过整合语义和多模态分析生成语义扎实的视频摘要,在保留84%语义内容的同时仅采样不到5%的帧,显著优于基线方法。

Details Motivation: 现有视频摘要方法在语义理解和上下文连贯性方面存在不足,尤其在手术培训等高风险领域需要更精确、有意义的摘要。因此,本文旨在提升摘要的语义准确性和程序性内容的表达能力。 Method: 提出PRISM框架,包含三个阶段:自适应视觉采样、标签驱动的关键帧锚定和基于大语言模型(LLM)的上下文验证。该方法结合视觉-语言模型与多模态分析,筛选出反映重要程序性转换的帧,并过滤通用或幻觉内容。 Result: 在 instructional 和 activity 数据集上评估显示,尽管仅使用不到5%的原始帧,仍能保留84%的语义内容,在某些指标上比基线方法提升高达33%,且在多种程序性和领域特定任务中表现出良好的泛化能力。 Conclusion: PRISM通过融合语义分析与多模态信息,实现了高效、准确且上下文连贯的视频摘要,适用于各类需要高质量摘要的高风险或专业场景。 Abstract: Video summarization helps turn long videos into clear, concise representations that are easier to review, document, and analyze, especially in high-stakes domains like surgical training. Prior work has progressed from using basic visual features like color, motion, and structural changes to using pre-trained vision-language models that can better understand what's happening in the video (semantics) and capture temporal flow, resulting in more context-aware video summarization. We propose a three-stage framework, PRISM: Procedural Representation via Integrated Semantic and Multimodal analysis, that produces semantically grounded video summaries. PRISM combines adaptive visual sampling, label-driven keyframe anchoring, and contextual validation using a large language model (LLM). Our method ensures that selected frames reflect meaningful and procedural transitions while filtering out generic or hallucinated content, resulting in contextually coherent summaries across both domain-specific and instructional videos. We evaluate our method on instructional and activity datasets, using reference summaries for instructional videos. Despite sampling fewer than 5% of the original frames, our summaries retain 84% semantic content while improving over baselines by as much as 33%. Our approach generalizes across procedural and domain-specific video tasks, achieving strong performance with both semantic alignment and precision.

[259] An Innovative Framework for Breast Cancer Detection Using Pyramid Adaptive Atrous Convolution, Transformer Integration, and Multi-Scale Feature Fusion

Ehsan Sadeghi Pour,Mahdi Esmaeili,Morteza Romoozi

Main category: cs.CV

TL;DR: 提出了一种结合PAAC和Transformer架构的乳腺癌恶性肿块检测新框架,通过多尺度特征融合和混合损失函数,在多个公开数据集上实现了98.5%的准确率,显著优于现有模型。

Details Motivation: 提高乳腺癌早期诊断的准确性和效率,解决传统方法在复杂图像中分类性能不足的问题。 Method: 结合Pyramid Adaptive Atrous Convolution(PAAC)与Transformer架构,采用多尺度特征融合策略,并使用Dice Loss与Focal Loss联合损失函数优化模型训练。 Result: 在INbreast、MIAS和DDSM数据集上取得98.5%的准确率、97.8%的敏感性、96.3%的特异性、98.2%的F1分数和97.9%的精确率,性能优于BreastNet、DeepMammo、Multi-Scale CNN等基线模型。 Conclusion: 所提模型在乳腺癌恶性肿块检测中表现出高准确性与鲁棒性,具备作为临床辅助诊断工具的潜力,可集成于医学影像诊断系统中。 Abstract: Breast cancer is one of the most common cancers among women worldwide, and its accurate and timely diagnosis plays a critical role in improving treatment outcomes. This thesis presents an innovative framework for detecting malignant masses in mammographic images by integrating the Pyramid Adaptive Atrous Convolution (PAAC) and Transformer architectures. The proposed approach utilizes Multi-Scale Feature Fusion to enhance the extraction of features from benign and malignant tissues and combines Dice Loss and Focal Loss functions to improve the model's learning process, effectively reducing errors in binary breast cancer classification and achieving high accuracy and efficiency. In this study, a comprehensive dataset of breast cancer images from INbreast, MIAS, and DDSM was preprocessed through data augmentation and contrast enhancement and resized to 227x227 pixels for model training. Leveraging the Transformer's ability to manage long-range dependencies with Self-Attention mechanisms, the proposed model achieved high accuracy in detecting cancerous masses, outperforming foundational models such as BreastNet, DeepMammo, Multi-Scale CNN, Swin-Unet, and SegFormer. The final evaluation results for the proposed model include an accuracy of 98.5\%, sensitivity of 97.8\%, specificity of 96.3\%, F1-score of 98.2\%, and overall precision of 97.9\%. These metrics demonstrate a significant improvement over traditional methods and confirm the model's effectiveness in identifying cancerous masses in complex scenarios and large datasets. This model shows potential as a reliable and efficient tool for breast cancer diagnosis and can be effectively integrated into medical diagnostic systems.

[260] Federated Joint Learning for Domain and Class Generalization

Haoran Xu,Jiaze Li,Jianzhong Ju,Zhenbo Luo

Main category: cs.CV

TL;DR: 本文提出了一种名为FedDCG的新型联邦学习方法,用于同时解决视觉-语言模型中的类别和领域泛化问题,通过域分组策略、可学习网络和解耦机制,在多个数据集上实现了优于现有方法的准确性和鲁棒性。

Details Motivation: 现有的高效微调方法通常孤立地处理未见类别或未见领域的问题,缺乏一个能同时应对两者的联合框架,尤其在联邦学习场景下更具挑战性。 Method: 提出FedDCG方法,引入域分组策略,在每组内训练类别泛化网络以避免决策边界混淆;使用可学习网络增强类别泛化能力,并通过解耦机制分离通用知识与领域特定知识;推理时基于域相似性聚合结果,实现类与域的联合泛化。 Result: 在多个数据集上的实验表明,FedDCG在准确性和鲁棒性方面均优于当前最先进的基线方法。 Conclusion: FedDCG为联邦学习中的视觉-语言模型提供了一个有效的联合优化框架,能够同时提升对未见类别和未见领域的泛化能力,具有良好的应用前景。 Abstract: Efficient fine-tuning of visual-language models like CLIP has become crucial due to their large-scale parameter size and extensive pretraining requirements. Existing methods typically address either the issue of unseen classes or unseen domains in isolation, without considering a joint framework for both. In this paper, we propose \textbf{Fed}erated Joint Learning for \textbf{D}omain and \textbf{C}lass \textbf{G}eneralization, termed \textbf{FedDCG}, a novel approach that addresses both class and domain generalization in federated learning settings. Our method introduces a domain grouping strategy where class-generalized networks are trained within each group to prevent decision boundary confusion. During inference, we aggregate class-generalized results based on domain similarity, effectively integrating knowledge from both class and domain generalization. Specifically, a learnable network is employed to enhance class generalization capabilities, and a decoupling mechanism separates general and domain-specific knowledge, improving generalization to unseen domains. Extensive experiments across various datasets show that \textbf{FedDCG} outperforms state-of-the-art baselines in terms of accuracy and robustness.

[261] Soft Shadow Diffusion (SSD): Physics-inspired Learning for 3D Computational Periscopy

Fadlullah Raji,John Murray-Bruce

Main category: cs.CV

TL;DR: 本文提出了一种新的非视域成像方法,通过普通照片实现隐藏场景的3D重建,引入了可分离的非线性最小二乘逆问题模型,并提出了基于梯度优化和物理启发神经网络(SSD)的两种解决方案。

Details Motivation: 传统成像需要视线,但在许多情况下无法获得合适视线,非视域(NLOS)成像因此成为必要。现有被动NLOS方法受限于分辨率或仅能定位已知形状物体,本文旨在突破这些限制,实现高分辨率3D重建。 Method: 提出一种新的光传输模型,将隐藏场景分解为遮光和非遮光部分,形成可分离的非线性最小二乘逆问题;开发了基于梯度的优化方法和名为Soft Shadow diffusion(SSD)的物理启发神经网络。 Result: 在真实实验中对多个3D场景实现了有效重建;SSD在仿真中训练但能泛化到未见类别及真实NLOS场景,且对噪声和环境光照表现出强鲁棒性。 Conclusion: 本文成功扩展了被动NLOS成像能力,首次实现从普通非视域照片进行3D场景重建,为未来无辅助设备的隐蔽场景感知提供了新途径。 Abstract: Conventional imaging requires a line of sight to create accurate visual representations of a scene. In certain circumstances, however, obtaining a suitable line of sight may be impractical, dangerous, or even impossible. Non-line-of-sight (NLOS) imaging addresses this challenge by reconstructing the scene from indirect measurements. Recently, passive NLOS methods that use an ordinary photograph of the subtle shadow cast onto a visible wall by the hidden scene have gained interest. These methods are currently limited to 1D or low-resolution 2D color imaging or to localizing a hidden object whose shape is approximately known. Here, we generalize this class of methods and demonstrate a 3D reconstruction of a hidden scene from an ordinary NLOS photograph. To achieve this, we propose a novel reformulation of the light transport model that conveniently decomposes the hidden scene into \textit{light-occluding} and \textit{non-light-occluding} components to yield a separable non-linear least squares (SNLLS) inverse problem. We develop two solutions: A gradient-based optimization method and a physics-inspired neural network approach, which we call Soft Shadow diffusion (SSD). Despite the challenging ill-conditioned inverse problem encountered here, our approaches are effective on numerous 3D scenes in real experimental scenarios. Moreover, SSD is trained in simulation but generalizes well to unseen classes in simulation and real-world NLOS scenes. SSD also shows surprising robustness to noise and ambient illumination.

Shahrzad Esmat,Mahdi Banisharif,Ali Jannesari

Main category: cs.CV

TL;DR: 本文提出AgenticPruner,一种利用大语言模型实现MAC操作约束下神经网络剪枝的框架,通过三个协同代理实现计算成本可控的高效模型压缩。

Details Motivation: 现有剪枝方法主要关注参数减少,难以精确控制计算成本,导致实际部署中推理延迟不可预测,尤其在需要严格满足MAC预算的场景下存在不足。 Method: 提出AgenticPruner框架,包含分析模型结构与MAC分布的Profiling Agent、协调流程的Master Agent,以及基于Claude 3.5 Sonnet并通过上下文学习从历史尝试中学习最优策略的Analysis Agent,结合等构剪枝的图结构分组,实现上下文感知的自适应剪枝。 Result: 在ImageNet-1K上验证,ResNet-50和ResNet-101在降低MAC的同时分别提升0.91%和1.56%准确率;ConvNeXt-Small实现1.41倍GPU加速和45%参数减少;Vision Transformer可稳定满足用户设定的MAC容差范围(通常+1%~+5%超限,-5%~ -15%欠限)。 Conclusion: AgenticPruner能有效实现MAC约束下的自动化剪枝,在多种架构上兼顾计算效率、精度与部署可靠性,为资源受限场景提供了更具实用性的模型压缩方案。 Abstract: Neural network pruning remains essential for deploying deep learning models on resource-constrained devices, yet existing approaches primarily target parameter reduction without directly controlling computational cost. This yields unpredictable inference latency in deployment scenarios where strict Multiply-Accumulate (MAC) operation budgets must be met. We propose AgenticPruner, a framework utilizing large language models to achieve MAC-constrained optimization through iterative strategy learning. Our approach coordinates three specialized agents: a Profiling Agent that analyzes model architecture and MAC distributions, a Master Agent that orchestrates the workflow with divergence monitoring, and an Analysis Agent powered by Claude 3.5 Sonnet that learns optimal strategies from historical attempts. Through in-context learning, the Analysis Agent improves convergence success rate from 48% to 71% compared to grid search. Building upon isomorphic pruning's graph-based structural grouping, our method adds context-aware adaptation by analyzing patterns across pruning iterations, enabling automatic convergence to target MAC budgets within user-defined tolerance bands. We validate our framework on ImageNet-1K across ResNet, ConvNeXt, and DeiT architectures. On CNNs, our approach achieves MAC targeting while maintaining or improving accuracy: ResNet-50 reaches 1.77G MACs with 77.04% accuracy (+0.91% vs baseline); ResNet-101 achieves 4.22G MACs with 78.94% accuracy (+1.56% vs baseline). For ConvNeXt-Small, pruning to 8.17G MACs yields 1.41x GPU and 1.07x CPU speedup with 45% parameter reduction. On Vision Transformers, we demonstrate MAC-budget compliance within user-defined tolerance bands (typically +1% to +5% overshoot, -5% to -15% undershoot), establishing feasibility for deployment scenarios requiring strict computational guarantees.

[263] CytoCLIP: Learning Cytoarchitectural Characteristics in Developing Human Brain Using Contrastive Language Image Pre-Training

Pralaypati Ta,Sriram Venkatesaperumal,Keerthi Ram,Mohanasankar Sivaprakasam

Main category: cs.CV

TL;DR: 本文提出CytoCLIP,一种基于视觉-语言模型的联合表示学习方法,用于自动识别脑组织细胞构筑区域,显著优于现有方法。

Details Motivation: 传统手动划分脑区耗时且依赖专业知识,亟需自动化方法降低人工成本。 Method: 基于预训练的CLIP框架构建CytoCLIP模型,包含低分辨率整体区域和高分辨率图像块两种变体,利用NISSL染色的胎儿脑组织切片数据进行训练。 Result: 在不同年龄和切片方向的数据上实验表明,CytoCLIP在全区域分类F1得分为0.87,高分辨率图像块分类达0.91,优于现有方法。 Conclusion: CytoCLIP能有效学习脑细胞构筑的多尺度视觉-文本联合表示,具备良好泛化能力,可支持脑区自动识别与分析。 Abstract: The functions of different regions of the human brain are closely linked to their distinct cytoarchitecture, which is defined by the spatial arrangement and morphology of the cells. Identifying brain regions by their cytoarchitecture enables various scientific analyses of the brain. However, delineating these areas manually in brain histological sections is time-consuming and requires specialized knowledge. An automated approach is necessary to minimize the effort needed from human experts. To address this, we propose CytoCLIP, a suite of vision-language models derived from pre-trained Contrastive Language-Image Pre-Training (CLIP) frameworks to learn joint visual-text representations of brain cytoarchitecture. CytoCLIP comprises two model variants: one is trained using low-resolution whole-region images to understand the overall cytoarchitectural pattern of an area, and the other is trained on high-resolution image tiles for detailed cellular-level representation. The training dataset is created from NISSL-stained histological sections of developing fetal brains of different gestational weeks. It includes 86 distinct regions for low-resolution images and 384 brain regions for high-resolution tiles. We evaluate the model's understanding of the cytoarchitecture and generalization ability using region classification and cross-modal retrieval tasks. Multiple experiments are performed under various data setups, including data from samples of different ages and sectioning planes. Experimental results demonstrate that CytoCLIP outperforms existing methods. It achieves an F1 score of 0.87 for whole-region classification and 0.91 for high-resolution image tile classification.

[264] SDiT: Semantic Region-Adaptive for Diffusion Transformers

Bowen Lin,Fanjiang Ye,Yihua Liu,Zhenghui Guo,Boyuan Zhang,Weijian Zheng,Yufan Xu,Tiancheng Xing,Yuke Wang,Chengming Zhang

Main category: cs.CV

TL;DR: SDiT提出了一种语义区域自适应的扩散Transformer,通过根据区域复杂度分配计算资源,在无需重新训练的情况下实现了高达3.0倍的加速,同时保持了与全注意力推理相近的感知和语义质量。

Details Motivation: 现有的Diffusion Transformers在文本到图像合成中表现优异但计算成本高,主要由于去噪过程的迭代性质和全局注意力的二次代价。作者观察到不同空间区域的去噪动态存在显著差异,背景区域收敛快,而边缘和纹理区域变化更活跃,因此希望根据区域复杂度动态分配计算资源以提升效率。 Method: SDiT引入了一个无需训练的框架,结合了基于快速Quickshift的语义感知聚类、复杂度驱动的区域调度策略以及边界感知的精细化处理,选择性地更新信息丰富的区域并维持空间一致性。 Result: SDiT在不进行模型重训练或架构修改的情况下,实现了最高达3.0倍的加速,且生成图像的感知质量和语义保真度与全注意力推理几乎相同。 Conclusion: 通过利用去噪过程中空间非均匀性的特性,SDiT有效降低了计算开销,为扩散Transformer提供了一种高效推理方案,具有良好的实用性和扩展潜力。 Abstract: Diffusion Transformers (DiTs) achieve state-of-the-art performance in text-to-image synthesis but remain computationally expensive due to the iterative nature of denoising and the quadratic cost of global attention. In this work, we observe that denoising dynamics are spatially non-uniform-background regions converge rapidly while edges and textured areas evolve much more actively. Building on this insight, we propose SDiT, a Semantic Region-Adaptive Diffusion Transformer that allocates computation according to regional complexity. SDiT introduces a training-free framework combining (1) semantic-aware clustering via fast Quickshift-based segmentation, (2) complexity-driven regional scheduling to selectively update informative areas, and (3) boundary-aware refinement to maintain spatial coherence. Without any model retraining or architectural modification, SDiT achieves up to 3.0x acceleration while preserving nearly identical perceptual and semantic quality to full-attention inference.

[265] LegacyAvatars: Volumetric Face Avatars For Traditional Graphics Pipelines

Safa C. Medin,Gengyan Li,Ziqian Bai,Ruofei Du,Leonhard Helminger,Yinda Zhang,Stephan J. Garbin,Philip L. Davidson,Gregory W. Wornell,Thabo Beeler,Abhimitra Meka

Main category: cs.CV

TL;DR: 提出了一种基于辐射场和参数化人脸模型的新型表示方法,用于高效地经典渲染逼真的3D人脸头像。

Details Motivation: 实现对复杂面部特征(如头发、皮肤和眼睛)的可控体积渲染,并支持在传统图形平台上高效渲染和在线流式传输。 Method: 在注册阶段学习3D空间中的辐射流形,提取显式的分层网格及外观和变形纹理;在部署时通过线性混合和alpha合成在静态网格上进行渲染。 Result: 实现了高质量、可控制的人脸头像渲染,并能通过传统mesh和shader技术在旧有平台渲染,无需定制开发。 Conclusion: 该方法结合了辐射场的表达能力和传统渲染的效率,为3D人脸 avatar 的实时应用提供了可行方案。 Abstract: We introduce a novel representation for efficient classical rendering of photorealistic 3D face avatars. Leveraging recent advances in radiance fields anchored to parametric face models, our approach achieves controllable volumetric rendering of complex facial features, including hair, skin, and eyes. At enrollment time, we learn a set of radiance manifolds in 3D space to extract an explicit layered mesh, along with appearance and warp textures. During deployment, this allows us to control and animate the face through simple linear blending and alpha compositing of textures over a static mesh. This explicit representation also enables the generated avatar to be efficiently streamed online and then rendered using classical mesh and shader-based rendering on legacy graphics platforms, eliminating the need for any custom engineering or integration.

[266] Concepts from Representations: Post-hoc Concept Bottleneck Models via Sparse Decomposition of Visual Representations

Shizhan Gong,Xiaofan Zhang,Qi Dou

Main category: cs.CV

TL;DR: 本文提出了一种名为PCBM-ReD的新方法,通过表示分解为预训练的黑盒模型 retrofit 概念瓶颈模型,实现图像分类任务中更好的可解释性和性能。

Details Motivation: 现有的概念解释方法存在概念相关性不可靠、依赖非视觉或人工定义概念、以及对模型或数据假设过强等问题,限制了其在关键领域的应用。 Method: PCBM-ReD自动从预训练编码器中提取视觉概念,利用多模态大语言模型(MLLM)对概念进行标注和筛选,并通过重构引导优化选择独立概念子集;借助CLIP的图文对齐能力,将图像表示分解为概念嵌入的线性组合,构建后验概念瓶颈模型。 Result: 在11个图像分类任务上的实验表明,PCBM-ReD在准确率上达到先进水平,显著缩小了与端到端模型的性能差距,并展现出更强的可解释性。 Conclusion: PCBM-ReD有效结合了预训练模型的性能与概念模型的可解释性,为高风险领域中深度模型的可信部署提供了可行路径。 Abstract: Deep learning has achieved remarkable success in image recognition, yet their inherent opacity poses challenges for deployment in critical domains. Concept-based interpretations aim to address this by explaining model reasoning through human-understandable concepts. However, existing post-hoc methods and ante-hoc concept bottleneck models (CBMs), suffer from limitations such as unreliable concept relevance, non-visual or labor-intensive concept definitions, and model or data-agnostic assumptions. This paper introduces Post-hoc Concept Bottleneck Model via Representation Decomposition (PCBM-ReD), a novel pipeline that retrofits interpretability onto pretrained opaque models. PCBM-ReD automatically extracts visual concepts from a pre-trained encoder, employs multimodal large language models (MLLMs) to label and filter concepts based on visual identifiability and task relevance, and selects an independent subset via reconstruction-guided optimization. Leveraging CLIP's visual-text alignment, it decomposes image representations into linear combination of concept embeddings to fit into the CBMs abstraction. Extensive experiments across 11 image classification tasks show PCBM-ReD achieves state-of-the-art accuracy, narrows the performance gap with end-to-end models, and exhibits better interpretability.

[267] A Two-Stage Globally-Diverse Adversarial Attack for Vision-Language Pre-training Models

Wutao Chen,Huaqin Zou,Chen Wan,Lifeng Huang

Main category: cs.CV

TL;DR: 提出了一种名为2S-GDA的两阶段全局多样化攻击框架,用于提升黑盒场景下视觉-语言预训练模型的对抗攻击成功率,通过文本和图像双模态的多样化扰动策略,在现有方法基础上显著提升了攻击性能。

Details Motivation: 现有的多模态对抗攻击方法在扰动多样性方面受限,且多阶段流程不稳定,难以有效攻击黑盒环境下的视觉-语言预训练模型。 Method: 提出2S-GDA框架:第一阶段通过候选文本扩展与全局感知替换实现文本扰动的全局多样性;第二阶段采用多尺度缩放和块打乱旋转生成图像级扰动以增强视觉多样性,并支持与现有方法结合提升迁移性。 Result: 在多个VLP模型上进行了广泛实验,2S-GDA在黑盒设置下的攻击成功率最高提升了11.17%,且具有良好的模块化和兼容性。 Conclusion: 2S-GDA通过双阶段和全局多样化的扰动策略,有效提升了多模态对抗样本的生成质量与跨模型迁移能力,为评估VLP模型鲁棒性提供了新思路。 Abstract: Vision-language pre-training (VLP) models are vulnerable to adversarial examples, particularly in black-box scenarios. Existing multimodal attacks often suffer from limited perturbation diversity and unstable multi-stage pipelines. To address these challenges, we propose 2S-GDA, a two-stage globally-diverse attack framework. The proposed method first introduces textual perturbations through a globally-diverse strategy by combining candidate text expansion with globally-aware replacement. To enhance visual diversity, image-level perturbations are generated using multi-scale resizing and block-shuffle rotation. Extensive experiments on VLP models demonstrate that 2S-GDA consistently improves attack success rates over state-of-the-art methods, with gains of up to 11.17\% in black-box settings. Our framework is modular and can be easily combined with existing methods to further enhance adversarial transferability.

[268] Adaptive Multi-Scale Correlation Meta-Network for Few-Shot Remote Sensing Image Classification

Anurag Kaushish,Ayan Sar,Sampurna Roy,Sudeshna Chakraborty,Prashant Trivedi,Tanupriya Choudhury,Kanav Gupta

Main category: cs.CV

TL;DR: 本文提出了一种轻量级且高效的自适应多尺度相关元网络(AMC-MetaNet),用于解决遥感图像少样本学习中的数据稀缺、域偏移和多尺度问题,通过相关性引导的特征金字塔、自适应通道相关模块和相关性引导元学习策略,在多个遥感数据集上实现了高达86.65%的准确率,同时模型参数仅为约60万,推理速度快。

Details Motivation: 遥感图像少样本学习面临标注数据稀缺、域偏移严重以及地物多尺度等挑战,现有方法依赖大型预训练模型或Transformer结构,计算成本高且难以适应多尺度变化,因此需要一种轻量高效并能捕捉跨尺度关联的新型框架。 Method: 提出AMC-MetaNet,包含三个核心组件:相关性引导的特征金字塔以捕获尺度不变模式;自适应通道相关模块(ACCM)建模动态跨尺度关系;相关性引导的元学习机制,利用相关模式替代传统原型平均进行知识迁移,整个网络从零开始训练,仅含~600K参数。 Result: 在EuroSAT、NWPU-RESISC45、UC Merced Land Use和AID等多个遥感数据集上的5类5样本分类任务中,AMC-MetaNet最高达到86.65%的准确率,推理时间低于50ms每张图像,模型大小仅为ResNet-18的1/20,展现出卓越的效率与泛化能力。 Conclusion: AMC-MetaNet是一种高效、轻量且对多尺度敏感的遥感少样本学习框架,通过引入相关性建模机制,在不依赖预训练的情况下实现了优异性能,适用于资源受限的真实应用场景。 Abstract: Few-shot learning in remote sensing remains challenging due to three factors: the scarcity of labeled data, substantial domain shifts, and the multi-scale nature of geospatial objects. To address these issues, we introduce Adaptive Multi-Scale Correlation Meta-Network (AMC-MetaNet), a lightweight yet powerful framework with three key innovations: (i) correlation-guided feature pyramids for capturing scale-invariant patterns, (ii) an adaptive channel correlation module (ACCM) for learning dynamic cross-scale relationships, and (iii) correlation-guided meta-learning that leverages correlation patterns instead of conventional prototype averaging. Unlike prior approaches that rely on heavy pre-trained models or transformers, AMC-MetaNet is trained from scratch with only $\sim600K$ parameters, offering $20\times$ fewer parameters than ResNet-18 while maintaining high efficiency ($<50$ms per image inference). AMC-MetaNet achieves up to 86.65\% accuracy in 5-way 5-shot classification on various remote sensing datasets, including EuroSAT, NWPU-RESISC45, UC Merced Land Use, and AID. Our results establish AMC-MetaNet as a computationally efficient, scale-aware framework for real-world few-shot remote sensing.

[269] CurConMix+: A Unified Spatio-Temporal Framework for Hierarchical Surgical Workflow Understanding

Yongjun Jeon,Jongmin Shin,Kanggil Park,Seonmin Park,Soyoung Lim,Jung Yong Kim,Jinsoo Rhu,Jongman Kim,Gyu-Seong Choi,Namkee Oh,Kyu-Hwan Jung

Main category: cs.CV

TL;DR: 本文提出了一种名为CurConMix+的新框架,用于手术动作三元组识别,结合课程引导的对比学习、多分辨率时序Transformer和新构建的LLS48数据集,实现了细粒度手术行为理解,并在多个任务上展现出优越性能和跨层级泛化能力。

Details Motivation: 手术动作三元组识别对工作流分析和技能评估具有重要意义,但面临类别不平衡、视觉差异细微和三元组组件间语义依赖等挑战。现有方法未能联合解决这些问题,限制了整体理解能力。 Method: 基于CurConMix空间表示框架,引入课程引导的对比学习策略,结合结构化难样本采样和特征级mixup;其时序扩展CurConMix+采用多分辨率时态Transformer(MRTT),自适应融合多尺度时序特征并动态平衡时空线索。同时提出了新的LLS48基准数据集,包含腹腔镜左外侧段切除术的多层次标注。 Result: 在CholecT45和LLS48数据集上的实验表明,CurConMix+在三元组识别任务上优于现有最先进方法,并表现出强跨层级泛化能力,其细粒度特征可有效迁移到手术阶段和步骤识别任务中。 Conclusion: 所提出的CurConMix+框架与LLS48数据集为层次感知、可重复且可解释的手术流程理解提供了统一基础,推动了手术动作识别领域的发展。 Abstract: Surgical action triplet recognition aims to understand fine-grained surgical behaviors by modeling the interactions among instruments, actions, and anatomical targets. Despite its clinical importance for workflow analysis and skill assessment, progress has been hindered by severe class imbalance, subtle visual variations, and the semantic interdependence among triplet components. Existing approaches often address only a subset of these challenges rather than tackling them jointly, which limits their ability to form a holistic understanding. This study builds upon CurConMix, a spatial representation framework. At its core, a curriculum-guided contrastive learning strategy learns discriminative and progressively correlated features, further enhanced by structured hard-pair sampling and feature-level mixup. Its temporal extension, CurConMix+, integrates a Multi-Resolution Temporal Transformer (MRTT) that achieves robust, context-aware understanding by adaptively fusing multi-scale temporal features and dynamically balancing spatio-temporal cues. Furthermore, we introduce LLS48, a new, hierarchically annotated benchmark for complex laparoscopic left lateral sectionectomy, providing step-, task-, and action-level annotations. Extensive experiments on CholecT45 and LLS48 demonstrate that CurConMix+ not only outperforms state-of-the-art approaches in triplet recognition, but also exhibits strong cross-level generalization, as its fine-grained features effectively transfer to higher-level phase and step recognition tasks. Together, the framework and dataset provide a unified foundation for hierarchy-aware, reproducible, and interpretable surgical workflow understanding. The code and dataset will be publicly released on GitHub to facilitate reproducibility and further research.

[270] S^2F-Net:A Robust Spatial-Spectral Fusion Framework for Cross-Model AIGC Detection

Xiangyu Hu,Yicheng Hong,Hongchuang Zheng,Wenjun Zeng,Bingyao Liu

Main category: cs.CV

TL;DR: 本文提出S2F-Net,一种基于频域特征的跨模型AI生成图像检测框架,通过可学习的频率注意力模块增强判别性频带,显著提升对未见生成模型的泛化检测能力。

Details Motivation: 现有检测方法易过拟合特定生成模型,在面对未知架构时性能大幅下降,亟需具备强泛化能力的检测方案。 Method: 提出S2F-Net框架,聚焦真实与合成纹理间的固有频谱差异;设计可学习频率注意力模块,协同空间纹理分析与频谱依赖关系,自适应加权并增强判别性频带。 Result: 在包含17类生成模型的AIGCDetectBenchmark上,S2F-Net达到90.49%检测精度,在跨域检测场景中显著优于多种基线方法。 Conclusion: 利用频域指纹(尤其是上采样引入的频率特征)进行检测是提升泛化性的有效途径,S2F-Net验证了频谱分析在AI生成内容鉴别中的关键价值。 Abstract: The rapid development of generative models has imposed an urgent demand for detection schemes with strong generalization capabilities. However, existing detection methods generally suffer from overfitting to specific source models, leading to significant performance degradation when confronted with unseen generative architectures. To address these challenges, this paper proposes a cross-model detection framework called S 2 F-Net, whose core lies in exploring and leveraging the inherent spectral discrepancies between real and synthetic textures. Considering that upsampling operations leave unique and distinguishable frequency fingerprints in both texture-poor and texture-rich regions, we focus our research on the detection of frequency-domain artifacts, aiming to fundamentally improve the generalization performance of the model. Specifically, we introduce a learnable frequency attention module that adaptively weights and enhances discriminative frequency bands by synergizing spatial texture analysis and spectral dependencies.On the AIGCDetectBenchmark, which includes 17 categories of generative models, S 2 F-Net achieves a detection accuracy of 90.49%, significantly outperforming various existing baseline methods in cross-domain detection scenarios.

[271] GazeFormer-MoE: Context-Aware Gaze Estimation via CLIP and MoE Transformer

Xinyuan Zhao,Xianrui Chen,Ahmad Chaddad

Main category: cs.CV

TL;DR: 提出了一种语义调制的多尺度Transformer模型用于3D视线估计,通过引入可学习的原型库增强CLIP特征,并结合混合专家系统,在多个数据集上实现了新的最先进性能。

Details Motivation: 现有的3D视线估计方法在复杂环境变化下表现受限,缺乏对光照、姿态等语义因素的有效建模。 Method: 利用CLIP的全局特征与可学习的原型库(如光照、头部姿态等)进行条件化,将原型增强后的全局向量与CLIP图像块令牌及高分辨率CNN令牌在统一注意力空间中融合,并用路由/共享的混合专家(MoE)替代部分FFN模块以提升模型容量。 Result: 在MPIIFaceGaze、EYEDIAP、Gaze360和ETH-XGaze数据集上分别取得了2.49°、3.22°、10.16°和1.44°的角误差,相对先前最佳结果最多提升了64%。 Conclusion: 所提出的语义调制多尺度Transformer显著提升了3D视线估计的精度,验证了原型条件化、跨尺度融合和MoE结构的有效性。 Abstract: We present a semantics modulated, multi scale Transformer for 3D gaze estimation. Our model conditions CLIP global features with learnable prototype banks (illumination, head pose, background, direction), fuses these prototype-enriched global vectors with CLIP patch tokens and high-resolution CNN tokens in a unified attention space, and replaces several FFN blocks with routed/shared Mixture of Experts to increase conditional capacity. Evaluated on MPIIFaceGaze, EYEDIAP, Gaze360 and ETH-XGaze, our model achieves new state of the art angular errors of 2.49°, 3.22°, 10.16°, and 1.44°, demonstrating up to a 64% relative improvement over previously reported results. ablations attribute gains to prototype conditioning, cross scale fusion, MoE and hyperparameter. Our code is publicly available at https://github. com/AIPMLab/Gazeformer.

[272] Multi-Sensor Matching with HyperNetworks

Eli Passov,Nathan S. Netanyahu,Yosi Keller

Main category: cs.CV

TL;DR: 本文提出了一种基于超网络的轻量级描述符学习架构,通过自适应通道缩放与偏移以及条件实例归一化,在保持推理效率的同时提升了多模态图像块匹配对跨模态外观变化的鲁棒性,并在多个基准上达到最先进性能。

Details Motivation: 多模态图像匹配(如可见光与红外)面临显著的外观差异和域偏移问题,现有方法难以在保持高效推理的同时实现强鲁棒性。 Method: 设计了一个Siamese CNN架构,引入超网络模块进行逐通道的自适应缩放与偏移,并结合浅层的条件实例归一化以实现模态特异性适应;使用三元组损失和难负样本挖掘进行训练。 Result: 在VIS-NIR和其他可见光-红外基准上取得最先进结果,在其他数据集上媲美或超越先前方法,且推理成本更低;并发布了新的跨平台多模态数据集GAP-VIR用于评估域泛化能力。 Conclusion: 超网络与条件归一化的结合为多模态匹配提供了一种高效且鲁棒的解决方案,兼具高性能与低推理开销,同时新数据集有助于推动域适应研究。 Abstract: Hypernetworks are models that generate or modulate the weights of another network. They provide a flexible mechanism for injecting context and task conditioning and have proven broadly useful across diverse applications without significant increases in model size. We leverage hypernetworks to improve multimodal patch matching by introducing a lightweight descriptor-learning architecture that augments a Siamese CNN with (i) hypernetwork modules that compute adaptive, per-channel scaling and shifting and (ii) conditional instance normalization that provides modality-specific adaptation (e.g., visible vs. infrared, VIS-IR) in shallow layers. This combination preserves the efficiency of descriptor-based methods during inference while increasing robustness to appearance shifts. Trained with a triplet loss and hard-negative mining, our approach achieves state-of-the-art results on VIS-NIR and other VIS-IR benchmarks and matches or surpasses prior methods on additional datasets, despite their higher inference cost. To spur progress on domain shift, we also release GAP-VIR, a cross-platform (ground/aerial) VIS-IR patch dataset with 500K pairs, enabling rigorous evaluation of cross-domain generalization and adaptation.

[273] EmoKGEdit: Training-free Affective Injection via Visual Cue Transformation

Jing Zhang,Bingjie Fan

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的图像情感编辑框架EmoKGEdit,通过构建多模态情感关联知识图谱(MSA-KG)来解耦情感与内容表示,实现精确且保持结构的情感编辑。

Details Motivation: 现有方法难以有效分离情感线索与内容表示,导致情感表达弱且结构失真,本文旨在解决这一问题。 Method: 构建多模态情感关联知识图谱(MSA-KG),显式建模物体-属性-情感之间的因果链,并作为外部知识引导多模态大模型推理情感相关视觉线索;设计解耦的结构-情感编辑模块,在潜在空间中分离情感属性与布局特征。 Result: 实验表明,EmoKGEdit在情感保真度和内容保持方面表现优异,优于现有最先进方法。 Conclusion: EmoKGEdit通过引入知识图谱和解耦编辑机制,实现了高质量、结构保持的图像情感编辑,为训练-free方法提供了新思路。 Abstract: Existing image emotion editing methods struggle to disentangle emotional cues from latent content representations, often yielding weak emotional expression and distorted visual structures. To bridge this gap, we propose EmoKGEdit, a novel training-free framework for precise and structure-preserving image emotion editing. Specifically, we construct a Multimodal Sentiment Association Knowledge Graph (MSA-KG) to disentangle the intricate relationships among objects, scenes, attributes, visual clues and emotion. MSA-KG explicitly encode the causal chain among object-attribute-emotion, and as external knowledge to support chain of thought reasoning, guiding the multimodal large model to infer plausible emotion-related visual cues and generate coherent instructions. In addition, based on MSA-KG, we design a disentangled structure-emotion editing module that explicitly separates emotional attributes from layout features within the latent space, which ensures that the target emotion is effectively injected while strictly maintaining visual spatial coherence. Extensive experiments demonstrate that EmoKGEdit achieves excellent performance in both emotion fidelity and content preservation, and outperforms the state-of-the-art methods.

[274] FlowIID: Single-Step Intrinsic Image Decomposition via Latent Flow Matching

Mithlesh Singla,Seema Kumari,Shanmuganathan Raman

Main category: cs.CV

TL;DR: 本文提出了一种基于流匹配的内在图像分解方法FlowIID,通过结合VAE引导的潜在空间和流匹配模块,实现了参数高效且单步推理的albedo与shading分离,在多个基准上表现优异。

Details Motivation: 现有内在图像分解模型参数量大,难以在实际应用中与其他模型集成,尤其在资源受限场景下部署困难。 Method: 设计了一个基于潜在流匹配的新架构FlowIID,结合VAE引导的潜在空间和流匹配模块,实现稳定、高效的图像分解。 Result: FlowIID参数量少、推理速度快,仅需单步即可完成分解,并在多个基准上达到领先水平。 Conclusion: FlowIID是一种高效、紧凑的内在图像分解模型,适用于资源受限和实时视觉应用。 Abstract: Intrinsic Image Decomposition (IID) separates an image into albedo and shading components. It is a core step in many real-world applications, such as relighting and material editing. Existing IID models achieve good results, but often use a large number of parameters. This makes them costly to combine with other models in real-world settings. To address this problem, we propose a flow matching-based solution. For this, we design a novel architecture, FlowIID, based on latent flow matching. FlowIID combines a VAE-guided latent space with a flow matching module, enabling a stable decomposition of albedo and shading. FlowIID is not only parameter-efficient, but also produces results in a single inference step. Despite its compact design, FlowIID delivers competitive and superior results compared to existing models across various benchmarks. This makes it well-suited for deployment in resource-constrained and real-time vision applications.

[275] Turbo-GoDec: Exploiting the Cluster Sparsity Prior for Hyperspectral Anomaly Detection

Jiahui Sheng,Xiaorun Li,Shuhan Chen

Main category: cs.CV

TL;DR: 提出了一种基于簇稀疏先验的高光谱异常检测新方法Turbo-GoDec,结合Markov随机场建模并利用因子图上的消息传递提升对小尺寸异常的检测性能。

Details Motivation: 现有方法多仅依赖背景低秩和异常稀疏的假设,较少挖掘异常的空间分布特性;观察发现异常常以小而聚集的形式出现,因此引入簇稀疏性先验以增强检测能力。 Method: 将簇稀疏先验融入经典的GoDec算法S步中,使用Markov随机场建模异常的簇稀疏结构,并通过因子图上的消息传递计算异常的边际概率,将高概率区域作为稀疏成分。 Result: 在三个真实高光谱数据集上实验表明,Turbo-GoDec在检测小尺寸异常方面优于基础GoDec(LSMAD)及当前先进方法。 Conclusion: 引入簇稀疏先验可有效提升高光谱异常检测性能,尤其在识别小且聚集的异常目标方面具有优势,验证了空间先验信息在异常检测中的重要性。 Abstract: As a key task in hyperspectral image processing, hyperspectral anomaly detection has garnered significant attention and undergone extensive research. Existing methods primarily relt on two prior assumption: low-rank background and sparse anomaly, along with additional spatial assumptions of the background. However, most methods only utilize the sparsity prior assumption for anomalies and rarely expand on this hypothesis. From observations of hyperspectral images, we find that anomalous pixels exhibit certain spatial distribution characteristics: they often manifest as small, clustered groups in space, which we refer to as cluster sparsity of anomalies. Then, we combined the cluster sparsity prior with the classical GoDec algorithm, incorporating the cluster sparsity prior into the S-step of GoDec. This resulted in a new hyperspectral anomaly detection method, which we called Turbo-GoDec. In this approach, we modeled the cluster sparsity prior of anomalies using a Markov random field and computed the marginal probabilities of anomalies through message passing on a factor graph. Locations with high anomalous probabilities were treated as the sparse component in the Turbo-GoDec. Experiments are conducted on three real hyperspectral image (HSI) datasets which demonstrate the superior performance of the proposed Turbo-GoDec method in detecting small-size anomalies comparing with the vanilla GoDec (LSMAD) and state-of-the-art anomaly detection methods. The code is available at https://github.com/jiahuisheng/Turbo-GoDec.

[276] MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

Peizhou Huang,Zixuan Zhong,Zhongwei Wan,Donghao Zhou,Samiul Alam,Xin Wang,Zexin Li,Zhihao Dou,Li Zhu,Jing Xiong,Chaofan Tao,Yan Xu,Dimitrios Dimitriadis,Tuo Zhang,Mi Zhang

Main category: cs.CV

TL;DR: 本文提出了MMDeepResearch-Bench(MMDR-Bench),一个包含140个专家设计任务的多模态深度研究基准,用于评估模型在图文结合环境下的报告生成与证据引用能力,并提出了一套统一的评估流程FLAE、TRACE和MOSAIC,实验表明当前模型在生成质量、引用规范性和多模态一致性之间存在权衡,多模态完整性仍是主要瓶颈。

Details Motivation: 现有深度研究代理(DRA)基准主要针对纯文本或短问答场景,缺乏对端到端多模态证据使用的有效评估,难以反映模型在真实复杂任务中结合图像与文本进行推理和引用的能力。 Method: 构建了一个涵盖21个领域的多模态基准MMDR-Bench,每个任务提供图文组合输入;设计了三个评估模块:FLAE评估报告质量,TRACE评估引用与证据对齐,MOSAIC评估图文一致性。 Result: 在25个先进模型上的实验显示,模型在生成质量、引用准确性和多模态一致性之间存在系统性权衡,高质量文本生成并不意味着可靠的证据使用,且多模态完整性仍是当前模型的主要短板。 Conclusion: 多模态深度研究需要更全面的评估体系,仅依赖文本生成能力不足以保证研究代理的可靠性;MMDR-Bench和配套评估方法揭示了当前模型在引用忠实性和图文一致性的不足,为未来研究指明方向。 Abstract: Deep Research Agents (DRAs) generate citation-rich reports via multi-step search and synthesis, yet existing benchmarks mainly target text-only settings or short-form multimodal QA, missing end-to-end multimodal evidence use. We introduce MMDeepResearch-Bench (MMDR-Bench), a benchmark of 140 expert-crafted tasks across 21 domains, where each task provides an image-text bundle to evaluate multimodal understanding and citation-grounded report generation. Compared to prior setups, MMDR-Bench emphasizes report-style synthesis with explicit evidence use, where models must connect visual artifacts to sourced claims and maintain consistency across narrative, citations, and visual references. We further propose a unified, interpretable evaluation pipeline: Formula-LLM Adaptive Evaluation (FLAE) for report quality, Trustworthy Retrieval-Aligned Citation Evaluation (TRACE) for citation-grounded evidence alignment, and Multimodal Support-Aligned Integrity Check (MOSAIC) for text-visual integrity, each producing fine-grained signals that support error diagnosis beyond a single overall score. Experiments across 25 state-of-the-art models reveal systematic trade-offs between generation quality, citation discipline, and multimodal grounding, highlighting that strong prose alone does not guarantee faithful evidence use and that multimodal integrity remains a key bottleneck for deep research agents.

[277] SimpleMatch: A Simple and Strong Baseline for Semantic Correspondence

Hailing Jin,Huiying Li

Main category: cs.CV

TL;DR: SimpleMatch是一种高效的语义对应框架,通过轻量级上采样解码器和多尺度监督损失,在低分辨率输入下实现优越性能,显著降低计算开销。

Details Motivation: 现有方法依赖高分辨率图像且因深度下采样导致相邻关键点特征不可逆融合,影响性能并增加计算负担。 Method: 提出SimpleMatch框架,包括渐进恢复空间细节的轻量级上采样解码器、保持多尺度判别特征的多尺度监督损失,以及稀疏匹配和基于窗口的定位以减少训练内存消耗。 Result: 在252x252分辨率下(比当前SOTA小3.3倍),SimpleMatch在SPair-71k数据集上达到84.1%的PCK@0.1,内存使用减少51%。 Conclusion: SimpleMatch提供了一个高效、实用的语义对应基准,适用于未来研究。 Abstract: Recent advances in semantic correspondence have been largely driven by the use of pre-trained large-scale models. However, a limitation of these approaches is their dependence on high-resolution input images to achieve optimal performance, which results in considerable computational overhead. In this work, we address a fundamental limitation in current methods: the irreversible fusion of adjacent keypoint features caused by deep downsampling operations. This issue is triggered when semantically distinct keypoints fall within the same downsampled receptive field (e.g., 16x16 patches). To address this issue, we present SimpleMatch, a simple yet effective framework for semantic correspondence that delivers strong performance even at low resolutions. We propose a lightweight upsample decoder that progressively recovers spatial detail by upsampling deep features to 1/4 resolution, and a multi-scale supervised loss that ensures the upsampled features retain discriminative features across different spatial scales. In addition, we introduce sparse matching and window-based localization to optimize training memory usage and reduce it by 51%. At a resolution of 252x252 (3.3x smaller than current SOTA methods), SimpleMatch achieves superior performance with 84.1% PCK@0.1 on the SPair-71k benchmark. We believe this framework provides a practical and efficient baseline for future research in semantic correspondence. Code is available at: https://github.com/hailong23-jin/SimpleMatch.

[278] From Prompts to Pavement: LMMs-based Agentic Behavior-Tree Generation Framework for Autonomous Vehicles

Omar Y. Goba,Ahmed Y. Gado,Catherine M. Elias,Ahmed Hussein

Main category: cs.CV

TL;DR: 本文提出了一种基于大语言模型(LLM)和多模态视觉模型(LVM)的智能体框架,用于在运行时动态生成和调整行为树(BT),以提升自动驾驶车辆在复杂、不可预测环境中的自适应决策能力。

Details Motivation: 传统行为树静态且依赖人工调参,难以满足SAE Level 5全自动驾驶对实时适应性的需求。 Method: 设计三个专用智能体:Descriptor(评估场景关键性)、Planner(通过上下文学习生成高层子目标)、Generator(合成可执行的XML格式BT子树),并集成于CARLA+Nav2仿真中,仅在基线BT失效时触发。 Result: 系统成功实现无干预绕行突发障碍(如道路堵塞),验证了其在多样化驾驶场景中的扩展潜力。 Conclusion: 该框架是将LLM/LVM赋能行为树动态生成的首次实证探索,为高阶自动驾驶提供了可扩展、自适应的行为规划新范式。 Abstract: Autonomous vehicles (AVs) require adaptive behavior planners to navigate unpredictable, real-world environments safely. Traditional behavior trees (BTs) offer structured decision logic but are inherently static and demand labor-intensive manual tuning, limiting their applicability at SAE Level 5 autonomy. This paper presents an agentic framework that leverages large language models (LLMs) and multi-modal vision models (LVMs) to generate and adapt BTs on the fly. A specialized Descriptor agent applies chain-of-symbols prompting to assess scene criticality, a Planner agent constructs high-level sub-goals via in-context learning, and a Generator agent synthesizes executable BT sub-trees in XML format. Integrated into a CARLA+Nav2 simulation, our system triggers only upon baseline BT failure, demonstrating successful navigation around unexpected obstacles (e.g., street blockage) with no human intervention. Compared to a static BT baseline, this approach is a proof-of-concept that extends to diverse driving scenarios.

[279] DepthCropSeg++: Scaling a Crop Segmentation Foundation Model With Depth-Labeled Data

Jiafei Zhang,Songliang Cao,Binghui Xu,Yanan Li,Weiwei Jia,Tingting Wu,Hao Lu,Weijuan Hu,Zhiguo Han

Main category: cs.CV

TL;DR: DepthCropSeg++ 是一个用于田间作物分割的基础模型,通过大规模跨物种、跨场景数据集和改进的ViT-Adapter架构,在多种挑战性环境中实现了当前最优的分割性能。

Details Motivation: 现有作物分割模型受限于标注成本高、数据规模小,导致泛化能力差,难以适应多物种和复杂田间环境。 Method: 基于之前的DepthCropSeg工作,构建包含30多种作物和15种环境条件的28,406张图像的大规模数据集;采用增强动态上采样的ViT-Adapter架构,并通过两阶段自训练流程进行训练。 Result: 在综合测试集上达到93.11% mIoU,优于全监督基线模型(+0.36%)和通用视觉基础模型如SAM(+48.57%);在夜间(86.90% mIoU)、高密度冠层(90.09% mIoU)和未见作物品种(90.09% mIoU)等挑战场景中表现优异。 Conclusion: DepthCropSeg++显著提升了作物分割模型的泛化能力和实用性,标志着向适用于真实农业环境的通用作物分割解决方案迈出了重要一步。 Abstract: DepthCropSeg++: a foundation model for crop segmentation, capable of segmenting different crop species under open in-field environment. Crop segmentation is a fundamental task for modern agriculture, which closely relates to many downstream tasks such as plant phenotyping, density estimation, and weed control. In the era of foundation models, a number of generic large language and vision models have been developed. These models have demonstrated remarkable real world generalization due to significant model capacity and largescale datasets. However, current crop segmentation models mostly learn from limited data due to expensive pixel-level labelling cost, often performing well only under specific crop types or controlled environment. In this work, we follow the vein of our previous work DepthCropSeg, an almost unsupervised approach to crop segmentation, to scale up a cross-species and crossscene crop segmentation dataset, with 28,406 images across 30+ species and 15 environmental conditions. We also build upon a state-of-the-art semantic segmentation architecture ViT-Adapter architecture, enhance it with dynamic upsampling for improved detail awareness, and train the model with a two-stage selftraining pipeline. To systematically validate model performance, we conduct comprehensive experiments to justify the effectiveness and generalization capabilities across multiple crop datasets. Results demonstrate that DepthCropSeg++ achieves 93.11% mIoU on a comprehensive testing set, outperforming both supervised baselines and general-purpose vision foundation models like Segmentation Anything Model (SAM) by significant margins (+0.36% and +48.57% respectively). The model particularly excels in challenging scenarios including night-time environment (86.90% mIoU), high-density canopies (90.09% mIoU), and unseen crop varieties (90.09% mIoU), indicating a new state of the art for crop segmentation.

[280] CD-TWINSAFE: A ROS-enabled Digital Twin for Scene Understanding and Safety Emerging V2I Technology

Amro Khaled,Farah Khaled,Omar Riad,Catherine M. Elias

Main category: cs.CV

TL;DR: 本文提出了一种基于车对基础设施(V2I)通信的数字孪生系统CD-TWINSAFE,用于自动驾驶车辆,通过车载感知与定位模块和基础设施端的数字孪生环境实现实时场景复制与安全预警。

Details Motivation: 为了提升自动驾驶车辆在复杂交通环境中的安全性与实时监控能力,需要一种能够融合真实世界数据与虚拟仿真环境的系统架构。 Method: 该架构包含两个并行运行的堆栈:车载驾驶堆栈(包括立体相机用于场景理解)和数字孪生堆栈(使用Unreal Engine 5重建场景)。车载部分实现定位与感知功能,感知模块以20fps处理图像,进行目标检测、特征提取及安全指标(如碰撞时间、车头时距)计算;数据通过ROS2消息经4G网络发送至基础设施端,驱动数字孪生环境更新。 Result: 实验在多种驾驶场景下验证了系统的有效性与实时响应能力,数字孪生环境能准确反映真实场景,并可向驾驶舱返回安全警告。 Conclusion: CD-TWINSAFE展示了V2I通信与数字孪生技术结合在自动驾驶安全增强方面的潜力,具备良好的实时性和应用前景。 Abstract: In this paper, the CD-TWINSAFE is introduced, a V2I-based digital twin for Autonomous Vehicles. The proposed architecture is composed of two stacks running simultaneously, an on-board driving stack that includes a stereo camera for scene understanding, and a digital twin stack that runs an Unreal Engine 5 replica of the scene viewed by the camera as well as returning safety alerts to the cockpit. The on-board stack is implemented on the vehicle side including 2 main autonomous modules; localization and perception. The position and orientation of the ego vehicle are obtained using on-board sensors. Furthermore, the perception module is responsible for processing 20-fps images from stereo camera and understands the scene through two complementary pipelines. The pipeline are working on object detection and feature extraction including object velocity, yaw and the safety metrics time-to-collision and time-headway. The collected data form the driving stack are sent to the infrastructure side through the ROS-enabled architecture in the form of custom ROS2 messages and sent over UDP links that ride a 4G modem for V2I communication. The environment is monitored via the digital twin through the shared messages which update the information of the spawned ego vehicle and detected objects based on the real-time localization and perception data. Several tests with different driving scenarios to confirm the validity and real-time response of the proposed architecture.

[281] Utilizing the Score of Data Distribution for Hyperspectral Anomaly Detection

Jiahui Sheng,Yidan Shi,Shu Xiang,Xiaorun Li,Shuhan Chen

Main category: cs.CV

TL;DR: 提出一种基于分数生成模型(SGM)的高光谱异常检测方法ScoreAD,利用高光谱数据流形假设,通过估计数据分布的梯度场实现异常检测。

Details Motivation: 高光谱图像的光谱数据具有高维但受少数因素决定的特点,可能满足流形假设;利用这一特性可更好区分背景与异常光谱。 Method: 首先在整个高光谱图像的光谱集合上训练一个分数生成模型(SGM);测试时将每个光谱通过扰动核扰动后输入SGM以估计其得分,利用得分差异检测异常。 Result: 在四个高光谱数据集上的实验表明所提方法有效。 Conclusion: 基于高光谱流形假设和SGM的ScoreAD方法能有效检测异常光谱,优于传统方法。 Abstract: Hyperspectral images (HSIs) are a type of image that contains abundant spectral information. As a type of real-world data, the high-dimensional spectra in hyperspectral images are actually determined by only a few factors, such as chemical composition and illumination. Thus, spectra in hyperspectral images are highly likely to satisfy the manifold hypothesis. Based on the hyperspectral manifold hypothesis, we propose a novel hyperspectral anomaly detection method (named ScoreAD) that leverages the time-dependent gradient field of the data distribution (i.e., the score), as learned by a score-based generative model (SGM). Our method first trains the SGM on the entire set of spectra from the hyperspectral image. At test time, each spectrum is passed through a perturbation kernel, and the resulting perturbed spectrum is fed into the trained SGM to obtain the estimated score. The manifold hypothesis of HSIs posits that background spectra reside on one or more low-dimensional manifolds. Conversely, anomalous spectra, owing to their unique spectral signatures, are considered outliers that do not conform to the background manifold. Based on this fundamental discrepancy in their manifold distributions, we leverage a generative SGM to achieve hyperspectral anomaly detection. Experiments on the four hyperspectral datasets demonstrate the effectiveness of the proposed method. The code is available at https://github.com/jiahuisheng/ScoreAD.

[282] A Hierarchical Benchmark of Foundation Models for Dermatology

Furkan Yuceyalcin,Abdurrahim Yilmaz,Burak Temelkuran

Main category: cs.CV

TL;DR: 该研究评估了十种基础模型在分层皮肤病变分类中的表现,揭示了模型在不同诊断粒度下的性能差异,即“粒度差距”:通用医学模型擅长高层次筛查,而皮肤病学专用模型更适用于细粒度鉴别。

Details Motivation: 现有皮肤病基准常将复杂的诊断体系简化为二分类任务,忽略了模型在细粒度鉴别诊断中的能力,限制了其在临床工作流中的应用。因此需要更细致的评估框架来衡量模型在多层级诊断中的表现。 Method: 使用DERM12345数据集(包含40个亚类),提取来自十个基础模型的冻结嵌入,并训练轻量级适配器模型,采用五折交叉验证;提出一个四层级的分层评估框架(40亚类、15大类、2和4个超类、恶性二分类)以评估模型在不同临床粒度下的表现。 Result: MedImageInsights在二分类恶性检测中表现最佳(加权F1分数97.52%),但在40类细粒度分类中下降至65.50%;而MedSigLip和皮肤病专用模型(如Derm Foundation、MONET)在40类分类中表现更好(最高69.79%),但在广义分类任务中表现较低。 Conclusion: 通用医学基础模型适用于高层次筛查,但要实现临床所需的细粒度诊断支持,需采用领域专用或专门建模策略。 Abstract: Foundation models have transformed medical image analysis by providing robust feature representations that reduce the need for large-scale task-specific training. However, current benchmarks in dermatology often reduce the complex diagnostic taxonomy to flat, binary classification tasks, such as distinguishing melanoma from benign nevi. This oversimplification obscures a model's ability to perform fine-grained differential diagnoses, which is critical for clinical workflow integration. This study evaluates the utility of embeddings derived from ten foundation models, spanning general computer vision, general medical imaging, and dermatology-specific domains, for hierarchical skin lesion classification. Using the DERM12345 dataset, which comprises 40 lesion subclasses, we calculated frozen embeddings and trained lightweight adapter models using a five-fold cross-validation. We introduce a hierarchical evaluation framework that assesses performance across four levels of clinical granularity: 40 Subclasses, 15 Main Classes, 2 and 4 Superclasses, and Binary Malignancy. Our results reveal a "granularity gap" in model capabilities: MedImageInsights achieved the strongest overall performance (97.52% weighted F1-Score on Binary Malignancy detection) but declined to 65.50% on fine-grained 40-class subtype classification. Conversely, MedSigLip (69.79%) and dermatology-specific models (Derm Foundation and MONET) excelled at fine-grained 40-class subtype discrimination while achieving lower overall performance than MedImageInsights on broader classification tasks. Our findings suggest that while general medical foundation models are highly effective for high-level screening, specialized modeling strategies are necessary for the granular distinctions required in diagnostic support systems.

[283] Class-Partitioned VQ-VAE and Latent Flow Matching for Point Cloud Scene Generation

Dasith de Silva Edirimuni,Ajmal Saeed Mian

Main category: cs.CV

TL;DR: 本文提出了一种类划分的向量量化变分自编码器(CPVQ-VAE),结合类感知更新机制和潜空间流匹配模型,实现无需外部数据库检索的纯点云场景生成,在复杂室内场景中显著降低了Chamfer和Point2Mesh误差。

Details Motivation: 现有3D场景生成方法在处理多类别复杂场景时,难以通过扩散模型生成的潜在特征有效解码出符合目标类别的点云对象,且依赖外部数据库检索。 Method: 提出类划分的向量量化变分自编码器(CPVQ-VAE),采用类标记的代码本和类感知的运行平均更新策略以避免代码本崩溃;并设计潜空间流匹配模型(LFMM)生成带类标签的对象特征,由CPVQ-VAE进行类感知逆向查找并解码为特定类别的点云形状。 Result: 实现了不依赖外部数据库的纯点云生成;在复杂客厅场景中,Chamfer距离和Point2Mesh误差分别最多降低70.4%和72.3%。 Conclusion: CPVQ-VAE能有效解码带类标签的潜在特征,生成符合语义类别的点云对象,显著提升复杂3D场景生成的质量与准确性。 Abstract: Most 3D scene generation methods are limited to only generating object bounding box parameters while newer diffusion methods also generate class labels and latent features. Using object size or latent feature, they then retrieve objects from a predefined database. For complex scenes of varied, multi-categorical objects, diffusion-based latents cannot be effectively decoded by current autoencoders into the correct point cloud objects which agree with target classes. We introduce a Class-Partitioned Vector Quantized Variational Autoencoder (CPVQ-VAE) that is trained to effectively decode object latent features, by employing a pioneering $\textit{class-partitioned codebook}$ where codevectors are labeled by class. To address the problem of $\textit{codebook collapse}$, we propose a $\textit{class-aware}$ running average update which reinitializes dead codevectors within each partition. During inference, object features and class labels, both generated by a Latent-space Flow Matching Model (LFMM) designed specifically for scene generation, are consumed by the CPVQ-VAE. The CPVQ-VAE's class-aware inverse look-up then maps generated latents to codebook entries that are decoded to class-specific point cloud shapes. Thereby, we achieve pure point cloud generation without relying on an external objects database for retrieval. Extensive experiments reveal that our method reliably recovers plausible point cloud scenes, with up to 70.4% and 72.3% reduction in Chamfer and Point2Mesh errors on complex living room scenes.

[284] Weaknesses of Facial Emotion Recognition Systems

Aleksandra Jamróz,Patrycja Wysocka,Piotr Garbat

Main category: cs.CV

TL;DR: 本文探讨了从面部表情中进行情感检测的机器学习方法,通过比较三种优秀神经网络模型在三个多样化数据集上的表现,揭示了现有方法在跨数据集性能、情绪识别难度差异和相似情绪区分方面的局限性。

Details Motivation: 情感检测在人机交互中具有重要意义,但现有方法众多且性能参差不齐,亟需系统评估以揭示其优缺点。 Method: 选取三种先进的神经网络模型和三个具有代表性的数据集,进行训练和交叉测试,通过多组实验比较模型在相同和不同数据集上的表现。 Result: 实验发现模型在跨数据集测试时性能显著下降,不同情绪的识别难度存在差异,且相似情绪(如愤怒与厌恶)难以准确区分。 Conclusion: 当前情感检测模型泛化能力有限,对数据集偏差敏感,未来研究应关注提升模型鲁棒性和对细微情绪差异的分辨能力。 Abstract: Emotion detection from faces is one of the machine learning problems needed for human-computer interaction. The variety of methods used is enormous, which motivated an in-depth review of articles and scientific studies. Three of the most interesting and best solutions are selected, followed by the selection of three datasets that stood out for the diversity and number of images in them. The selected neural networks are trained, and then a series of experiments are performed to compare their performance, including testing on different datasets than a model was trained on. This reveals weaknesses in existing solutions, including differences between datasets, unequal levels of difficulty in recognizing certain emotions and the challenges in differentiating between closely related emotions.

[285] HOT-POT: Optimal Transport for Sparse Stereo Matching

Antonin Clerc,Michael Quellmalz,Moritz Piening,Philipp Flotho,Gregor Kornhardt,Gabriele Steidl

Main category: cs.CV

TL;DR: 本文提出了一种基于最优传输(OT)框架的无监督稀疏特征匹配方法,利用相机几何中的线约束解决立体视觉中的匹配难题,尤其适用于面部 landmark 匹配。

Details Motivation: 由于遮挡、运动和相机畸变等因素,立体视觉中的稀疏特征匹配(如面部关键点)具有挑战性,且传统方法对参数敏感,难以有效处理无监督场景。 Method: 将相机投影点建模为(半)直线,采用经典极线距离和3D射线距离作为匹配代价,并将其嵌入到(部分)最优传输问题中,形成可高效求解的分配问题;进一步扩展为分层OT以实现无监督对象匹配。 Result: 所提方法在数值实验中表现出高效的特征与对象匹配能力,尤其在不同标注标准下的面部landmark匹配任务中表现良好。 Conclusion: 基于OT的框架结合几何距离能有效解决稀疏、无监督匹配问题,为立体视觉和面部分析等应用提供了鲁棒且可扩展的新方法。 Abstract: Stereo vision between images faces a range of challenges, including occlusions, motion, and camera distortions, across applications in autonomous driving, robotics, and face analysis. Due to parameter sensitivity, further complications arise for stereo matching with sparse features, such as facial landmarks. To overcome this ill-posedness and enable unsupervised sparse matching, we consider line constraints of the camera geometry from an optimal transport (OT) viewpoint. Formulating camera-projected points as (half)lines, we propose the use of the classical epipolar distance as well as a 3D ray distance to quantify matching quality. Employing these distances as a cost function of a (partial) OT problem, we arrive at efficiently solvable assignment problems. Moreover, we extend our approach to unsupervised object matching by formulating it as a hierarchical OT problem. The resulting algorithms allow for efficient feature and object matching, as demonstrated in our numerical experiments. Here, we focus on applications in facial analysis, where we aim to match distinct landmarking conventions.

[286] SkeFi: Cross-Modal Knowledge Transfer for Wireless Skeleton-Based Action Recognition

Shunyu Huang,Yunjiao Zhou,Jianfei Yang

Main category: cs.CV

TL;DR: 本文提出SkeFi,一种基于跨模态知识迁移的骨架动作识别框架,利用RGB数据辅助无线传感器(如LiDAR和mmWave)在低光和隐私敏感场景下的动作识别,通过增强的时间相关自适应图卷积和双时间卷积提升性能。

Details Motivation: 现有基于RGB摄像头的骨架动作识别在暗光环境性能下降且存在隐私问题,限制了其在智能家居和医院等场景的应用,因此需要探索非侵入式无线传感器作为替代方案。 Method: 提出SkeFi框架,采用跨模态知识迁移方法,将RGB模态的知识迁移到数据稀缺的无线传感器模态;设计增强的时序相关自适应图卷积(TC-AGC)并结合帧间增强机制以应对无线传感器产生的噪声和缺失帧问题;引入双时间卷积以加强多尺度时序建模。 Result: 实验表明,SkeFi在mmWave和LiDAR数据上均实现了最先进的动作识别性能,有效克服了数据不足和关键点噪声问题。 Conclusion: SkeFi通过跨模态迁移和增强的图卷积及时序建模,成功实现了从非侵入式无线传感器中准确提取姿态与动作,为隐私保护和复杂光照条件下的动作识别提供了可行解决方案。 Abstract: Skeleton-based action recognition leverages human pose keypoints to categorize human actions, which shows superior generalization and interoperability compared to regular end-to-end action recognition. Existing solutions use RGB cameras to annotate skeletal keypoints, but their performance declines in dark environments and raises privacy concerns, limiting their use in smart homes and hospitals. This paper explores non-invasive wireless sensors, i.e., LiDAR and mmWave, to mitigate these challenges as a feasible alternative. Two problems are addressed: (1) insufficient data on wireless sensor modality to train an accurate skeleton estimation model, and (2) skeletal keypoints derived from wireless sensors are noisier than RGB, causing great difficulties for subsequent action recognition models. Our work, SkeFi, overcomes these gaps through a novel cross-modal knowledge transfer method acquired from the data-rich RGB modality. We propose the enhanced Temporal Correlation Adaptive Graph Convolution (TC-AGC) with frame interactive enhancement to overcome the noise from missing or inconsecutive frames. Additionally, our research underscores the effectiveness of enhancing multiscale temporal modeling through dual temporal convolution. By integrating TC-AGC with temporal modeling for cross-modal transfer, our framework can extract accurate poses and actions from noisy wireless sensors. Experiments demonstrate that SkeFi realizes state-of-the-art performances on mmWave and LiDAR. The code is available at https://github.com/Huang0035/Skefi.

[287] Adversarial Defense in Vision-Language Models: An Overview

Xiaowei Fu,Lei Zhang

Main category: cs.CV

TL;DR: 本文综述了视觉语言模型(VLM)在面对对抗性攻击时的防御策略,包括训练时防御、测试时自适应防御和无需训练的防御方法,并分析了各类方法的优缺点及当前挑战。

Details Motivation: 由于CLIP等视觉语言模型广泛应用,其在跨模态任务中易受隐蔽对抗攻击的问题日益突出,亟需系统性总结现有防御手段以提升模型鲁棒性和安全性。 Method: 对现有的VLM对抗防御方法进行分类梳理,分为训练时防御、测试时自适应防御和无需训练的防御三类,并比较其机制、优势与局限性。 Result: 总结了三类主要防御范式:训练时防御通过对抗微调提升鲁棒性但计算成本高;测试时自适应在推理阶段调整参数灵活但复杂度高;无需训练的防御通过输入或特征修正来抵御攻击,轻量但依赖设计先验。 Conclusion: 当前VLM对抗防御仍面临泛化性、效率与实际部署之间的权衡挑战,未来需发展更高效、通用且无需额外训练的防御机制。 Abstract: The widespread use of Vision Language Models (VLMs, e.g. CLIP) has raised concerns about their vulnerability to sophisticated and imperceptible adversarial attacks. These attacks could compromise model performance and system security in cross-modal tasks. To address this challenge, three main defense paradigms have been proposed: Training-time Defense, Test-time Adaptation Defense, and Training-free Defense. Training-time Defense involves modifying the training process, typically through adversarial fine-tuning to improve the robustness to adversarial examples. While effective, this approach requires substantial computational resources and may not generalize across all adversarial attacks. Test-time Adaptation Defense focuses on adapting the model at inference time by updating its parameters to handle unlabeled adversarial examples, offering flexibility but often at the cost of increased complexity and computational overhead. Training-free Defense avoids modifying the model itself, instead focusing on altering the adversarial inputs or their feature embeddings, which enforces input perturbations to mitigate the impact of attacks without additional training. This survey reviews the latest advancements in adversarial defense strategies for VLMs, highlighting the strengths and limitations of such approaches and discussing ongoing challenges in enhancing the robustness of VLMs.

[288] Large-scale EM Benchmark for Multi-Organelle Instance Segmentation in the Wild

Yanrui Lu,Danyang Chen,Haowen Xiao,Jiarui Zhu,Fukang Ge,Binqian Zou,Jiali Guan,Jiayin Liang,Yuting Wang,Ziqian Guan,Xiangcheng Bao,Jinhao Bi,Lin Gu,Jun He,Yingying Zhu

Main category: cs.CV

TL;DR: 提出了一种大规模、多源的细胞器实例分割基准,用于解决现有数据集无法捕捉真实电子显微镜数据异质性和空间上下文的问题。

Details Motivation: 现有基于小规模数据集的基准无法反映真实世界电子显微镜数据的异质性和大空间上下文,限制了当前方法在实际应用中的表现。 Method: 构建了一个包含超过10万张2D电镜图像的多源基准数据集,涵盖多种细胞类型和五类细胞器,并设计了连通性感知的3D标签传播算法(3D LPA)进行标注,辅以专家修正。对U-Net、SAM变体和Mask2Former等先进模型进行了系统评估。 Result: 现有模型在跨异质性数据泛化能力差,尤其在具有全局分布形态的细胞器(如内质网)上表现不佳,暴露出局部上下文模型与长程结构连续性建模之间的根本性不匹配。 Conclusion: 需要发展能够建模长距离依赖关系并适应真实世界变异性的新方法,以提升细胞器实例分割在复杂EM数据中的性能。 Abstract: Accurate instance-level segmentation of organelles in electron microscopy (EM) is critical for quantitative analysis of subcellular morphology and inter-organelle interactions. However, current benchmarks, based on small, curated datasets, fail to capture the inherent heterogeneity and large spatial context of in-the-wild EM data, imposing fundamental limitations on current patch-based methods. To address these limitations, we developed a large-scale, multi-source benchmark for multi-organelle instance segmentation, comprising over 100,000 2D EM images across variety cell types and five organelle classes that capture real-world variability. Dataset annotations were generated by our designed connectivity-aware Label Propagation Algorithm (3D LPA) with expert refinement. We further benchmarked several state-of-the-art models, including U-Net, SAM variants, and Mask2Former. Our results show several limitations: current models struggle to generalize across heterogeneous EM data and perform poorly on organelles with global, distributed morphologies (e.g., Endoplasmic Reticulum). These findings underscore the fundamental mismatch between local-context models and the challenge of modeling long-range structural continuity in the presence of real-world variability. The benchmark dataset and labeling tool will be publicly released soon.

[289] DCAC: Dynamic Class-Aware Cache Creates Stronger Out-of-Distribution Detectors

Yanqi Wu,Qichao Chen,Runhe Lai,Xinhua Lu,Jia-Xin Zhuang,Zhilin Zhao,Wei-Shi Zheng,Ruixuan Wang

Main category: cs.CV

TL;DR: 提出了一种无需训练的测试时校准模块DCAC,用于提升深度神经网络的分布外检测性能。

Details Motivation: 发现被预测为同一类的OOD样本在视觉上更相似,基于此提出针对OOD样本过置信问题的新方法。 Method: 设计了动态类感知缓存(DCAC),为每个ID类别维护独立缓存以收集高熵样本,并利用缓存的视觉特征和预测概率通过轻量模块校准输入样本的原始预测。 Result: 在多个OOD基准上实验表明,DCAC显著提升了现有方法的性能,例如在ImageNet OOD基准上结合ASH-S时FPR95降低了6.55%。 Conclusion: DCAC是一种通用、高效且无训练的校准模块,可有效缓解OOD样本上的过置信问题,适用于多种单模态和视觉-语言模型。 Abstract: Out-of-distribution (OOD) detection remains a fundamental challenge for deep neural networks, particularly due to overconfident predictions on unseen OOD samples during testing. We reveal a key insight: OOD samples predicted as the same class, or given high probabilities for it, are visually more similar to each other than to the true in-distribution (ID) samples. Motivated by this class-specific observation, we propose DCAC (Dynamic Class-Aware Cache), a training-free, test-time calibration module that maintains separate caches for each ID class to collect high-entropy samples and calibrate the raw predictions of input samples. DCAC leverages cached visual features and predicted probabilities through a lightweight two-layer module to mitigate overconfident predictions on OOD samples. This module can be seamlessly integrated with various existing OOD detection methods across both unimodal and vision-language models while introducing minimal computational overhead. Extensive experiments on multiple OOD benchmarks demonstrate that DCAC significantly enhances existing methods, achieving substantial improvements, i.e., reducing FPR95 by 6.55% when integrated with ASH-S on ImageNet OOD benchmark.

[290] NeuralFur: Animal Fur Reconstruction From Multi-View Images

Vanessa Sklyarova,Berna Kabadayi,Anastasios Yiannakidis,Giorgio Becherini,Michael J. Black,Justus Thies

Main category: cs.CV

TL;DR: 提出了一种基于多视角图像和视觉语言模型(VLM)的高保真3D动物毛发建模方法,通过VLM获取毛发结构先验信息并指导毛发方向生长,实现了跨多种动物的泛化重建。

Details Motivation: 由于毛发细节精细、自遮挡严重且外观具有视角依赖性,从图像中重建真实感动物毛发几何极具挑战;此外缺乏可用于学习不同动物毛发先验的数据集。 Method: 首先使用传统多视角立体方法重建粗略表面几何,然后利用视觉语言模型(VLM)系统检索身体各部位毛发的现实长度结构信息,并构建无毛几何体,在其上生成毛发 strands;结合几何和光度损失进行监督,并利用VLM引导毛发生长方向与重力向量的关系作为额外损失以缓解Gabor滤波器带来的方向歧义。 Result: 该方法能够从多视角RGB图像中实现高保真、 strand-level的动物毛发重建,并在多种不同毛发类型的动物上展现出良好的泛化能力。 Conclusion: 通过引入视觉语言模型来指导3D毛发重建,提出的新框架有效解决了动物毛发建模中的细节、方向和泛化问题,为未来基于先验知识的重建方法提供了新思路。 Abstract: Reconstructing realistic animal fur geometry from images is a challenging task due to the fine-scale details, self-occlusion, and view-dependent appearance of fur. In contrast to human hairstyle reconstruction, there are also no datasets that can be leveraged to learn a fur prior for different animals. In this work, we present a first multi-view-based method for high-fidelity 3D fur modeling of animals using a strand-based representation, leveraging the general knowledge of a vision language model. Given multi-view RGB images, we first reconstruct a coarse surface geometry using traditional multi-view stereo techniques. We then use a vision language model (VLM) system to retrieve information about the realistic length structure of the fur for each part of the body. We use this knowledge to construct the animal's furless geometry and grow strands atop it. The fur reconstruction is supervised with both geometric and photometric losses computed from multi-view images. To mitigate orientation ambiguities stemming from the Gabor filters that are applied to the input images, we additionally utilize the VLM to guide the strands' growth direction and their relation to the gravity vector that we incorporate as a loss. With this new schema of using a VLM to guide 3D reconstruction from multi-view inputs, we show generalization across a variety of animals with different fur types. For additional results and code, please refer to https://neuralfur.is.tue.mpg.de.

[291] Histopath-C: Towards Realistic Domain Shifts for Histopathology Vision-Language Adaptation

Mehrdad Noori,Gustavo Adolfo Vargas Hakim,David Osowiechi,Fereshteh Shakeri,Ali Bahri,Moslem Yazdanpanah,Sahar Dastani,Ismail Ben Ayed,Christian Desrosiers

Main category: cs.CV

TL;DR: 本文提出了Histopath-C,一个用于评估组织病理学图像中现实合成损坏的新基准,以及LATTE,一种低秩适应策略,以提高视觉-语言模型在不同文本输入下的鲁棒性。

Details Motivation: 由于组织病理学图像可能存在染色、污染、模糊和噪声等严重域偏移,现有的医学视觉-语言模型的下游性能可能会显著下降。因此需要一个新的基准来模拟这些实际中的分布变化,并开发更有效的测试时适应机制。 Method: 引入了Histopath-C基准,该基准动态地对任何可用数据集应用合成损坏,并在现场评估测试时适应(TTA)机制;同时提出LATTE方法,一种利用多个文本模板的转导式低秩适应策略,以减轻模型对多样化文本输入的敏感性。 Result: LATTE方法在多种组织病理学数据集上优于为自然图像设计的最先进TTA方法,证明其在组织病理学图像中实现稳健适应的有效性。 Conclusion: 本文提出的Histopath-C基准和LATTE方法有效提升了组织病理学视觉-语言模型在面对真实世界分布偏移时的鲁棒性和适应能力。 Abstract: Medical Vision-language models (VLMs) have shown remarkable performances in various medical imaging domains such as histo\-pathology by leveraging pre-trained, contrastive models that exploit visual and textual information. However, histopathology images may exhibit severe domain shifts, such as staining, contamination, blurring, and noise, which may severely degrade the VLM's downstream performance. In this work, we introduce Histopath-C, a new benchmark with realistic synthetic corruptions designed to mimic real-world distribution shifts observed in digital histopathology. Our framework dynamically applies corruptions to any available dataset and evaluates Test-Time Adaptation (TTA) mechanisms on the fly. We then propose LATTE, a transductive, low-rank adaptation strategy that exploits multiple text templates, mitigating the sensitivity of histopathology VLMs to diverse text inputs. Our approach outperforms state-of-the-art TTA methods originally designed for natural images across a breadth of histopathology datasets, demonstrating the effectiveness of our proposed design for robust adaptation in histopathology images. Code and data are available at https://github.com/Mehrdad-Noori/Histopath-C.

[292] Video Individual Counting and Tracking from Moving Drones: A Benchmark and Methods

Yaowu Fan,Jia Wan,Tao Han,Andy J. Ma,Antoni B. Chan

Main category: cs.CV

TL;DR: 提出基于移动无人机的密集人群计数与跟踪方法,构建大规模数据集MovingDroneCrowd++,并提出GD3A和DVTrack方法,在复杂场景下显著优于现有方法。

Details Motivation: 现有方法依赖固定摄像头数据集,空间覆盖有限,难以应对大规模密集人群分析;需利用移动无人机提升覆盖范围与灵活性。 Method: 提出GD3A方法,通过最优传输建立跨帧描述符的像素级对应,分解全局密度图;结合描述符投票机制实现实例关联的DVTrack用于跟踪。 Result: 在MovingDroneCrowd++数据集上,计数误差降低47.4%,跟踪性能提升39.2%。 Conclusion: 所提方法在复杂动态场景中显著提升密集人群计数与跟踪性能,验证了移动无人机方案的可行性与优势。 Abstract: Counting and tracking dense crowds in large-scale scenes is highly challenging, yet existing methods mainly rely on datasets captured by fixed cameras, which provide limited spatial coverage and are inadequate for large-scale dense crowd analysis. To address this limitation, we propose a flexible solution using moving drones to capture videos and perform video-level crowd counting and tracking of unique pedestrians across entire scenes. We introduce MovingDroneCrowd++, the largest video-level dataset for dense crowd counting and tracking captured by moving drones, covering diverse and complex conditions with varying flight altitudes, camera angles, and illumination. Existing methods fail to achieve satisfactory performance on this dataset. To this end, we propose GD3A (Global Density Map Decomposition via Descriptor Association), a density map-based video individual counting method that avoids explicit localization. GD3A establishes pixel-level correspondences between pedestrian descriptors across consecutive frames via optimal transport with an adaptive dustbin score, enabling the decomposition of global density maps into shared, inflow, and outflow components. Building on this framework, we further introduce DVTrack, which converts descriptor-level matching into instance-level associations through a descriptor voting mechanism for pedestrian tracking. Experimental results show that our methods significantly outperform existing approaches under dense crowds and complex motion, reducing counting error by 47.4 percent and improving tracking performance by 39.2 percent.

[293] SDCoNet: Saliency-Driven Multi-Task Collaborative Network for Remote Sensing Object Detection

Ruo Qi,Linhui Dai,Yusong Qin,Chaolei Yang,Yanshan Li

Main category: cs.CV

TL;DR: 提出了一种基于显著性驱动的多任务协同网络SDCoNet,用于提升低质量遥感图像中的小目标检测性能,通过共享编码器、显著性引导和梯度路由策略实现超分辨率与检测任务的有效协同。

Details Motivation: 现有超分辨率与目标检测串行方法存在优化目标不一致、特征冗余及任务间交互不足的问题,尤其在复杂背景、弱信号和小尺度目标的低质量遥感图像中表现受限。 Method: 设计了基于Swin Transformer的共享编码器以实现跨任务特征协作,引入多尺度显著性预测模块选择关键token并聚焦于弱目标区域,同时采用梯度路由策略缓解优化冲突,引导超分辨率生成有利于检测的高频细节。 Result: 在NWPU VHR-10-Split、DOTAv1.5-Split和HRSSD-Split等多个公开数据集上,该方法在保持较高计算效率的同时,显著优于主流算法的小目标检测性能。 Conclusion: SDCoNet通过隐式特征共享与任务特定机制的结合,有效提升了低质量遥感图像中小目标的检测精度,为多任务协同提供了新的解决方案。 Abstract: In remote sensing images, complex backgrounds, weak object signals, and small object scales make accurate detection particularly challenging, especially under low-quality imaging conditions. A common strategy is to integrate single-image super-resolution (SR) before detection; however, such serial pipelines often suffer from misaligned optimization objectives, feature redundancy, and a lack of effective interaction between SR and detection. To address these issues, we propose a Saliency-Driven multi-task Collaborative Network (SDCoNet) that couples SR and detection through implicit feature sharing while preserving task specificity. SDCoNet employs the swin transformer-based shared encoder, where hierarchical window-shifted self-attention supports cross-task feature collaboration and adaptively balances the trade-off between texture refinement and semantic representation. In addition, a multi-scale saliency prediction module produces importance scores to select key tokens, enabling focused attention on weak object regions, suppression of background clutter, and suppression of adverse features introduced by multi-task coupling. Furthermore, a gradient routing strategy is introduced to mitigate optimization conflicts. It first stabilizes detection semantics and subsequently routes SR gradients along a detection-oriented direction, enabling the framework to guide the SR branch to generate high-frequency details that are explicitly beneficial for detection. Experiments on public datasets, including NWPU VHR-10-Split, DOTAv1.5-Split, and HRSSD-Split, demonstrate that the proposed method, while maintaining competitive computational efficiency, significantly outperforms existing mainstream algorithms in small object detection on low-quality remote sensing images. Our code is available at https://github.com/qiruo-ya/SDCoNet.

[294] Fine-Tuning Cycle-GAN for Domain Adaptation of MRI Images

Mohd Usama,Belal Ahmad,Faleh Menawer R Althiyabi

Main category: cs.CV

TL;DR: 提出基于Cycle-GAN的无监督医学图像域适应模型,有效实现MRI图像在不同设备间的双向适应,提升诊断准确性。

Details Motivation: 不同扫描仪或机构获取的MRI图像因硬件和协议差异存在域偏移,导致深度学习模型泛化性能下降。 Method: 采用Cycle-GAN框架,结合内容损失和差异损失,学习源域与目标域之间的双向映射,无需配对数据且保持解剖结构完整性。 Result: 在多个MRI数据集上验证了模型的有效性,显著提升模型性能并减少域间变异,实现无标签下的双向域适应。 Conclusion: 该方法有助于提高医疗图像分析的精确性和一致性,为跨域医学图像适应提供了有效解决方案。 Abstract: Magnetic Resonance Imaging (MRI) scans acquired from different scanners or institutions often suffer from domain shifts owing to variations in hardware, protocols, and acquisition parameters. This discrepancy degrades the performance of deep learning models trained on source domain data when applied to target domain images. In this study, we propose a Cycle-GAN-based model for unsupervised medical-image domain adaptation. Leveraging CycleGANs, our model learns bidirectional mappings between the source and target domains without paired training data, preserving the anatomical content of the images. By leveraging Cycle-GAN capabilities with content and disparity loss for adaptation tasks, we ensured image-domain adaptation while maintaining image integrity. Several experiments on MRI datasets demonstrated the efficacy of our model in bidirectional domain adaptation without labelled data. Furthermore, research offers promising avenues for improving the diagnostic accuracy of healthcare. The statistical results confirm that our approach improves model performance and reduces domain-related variability, thus contributing to more precise and consistent medical image analysis.

[295] Deep Feature Deformation Weights

Richard Liu,Itai Lang,Rana Hanocka

Main category: cs.CV

TL;DR: 提出一种融合数据先验与传统手柄变形优点的方法,通过深度特征 proximity 生成语义感知的实时平滑变形权重,支持高分辨率网格快速计算和对称性保持。

Details Motivation: 传统手柄变形缺乏语义感知且依赖预设的手柄分布,而数据驱动方法虽具语义性但速度慢、精度低,需结合两者优势。 Method: 利用深度特征邻近性生成变形权重,引入重心特征蒸馏 pipeline 提高效率,并在特征空间中施加约束以保持经典方法的局部性和语义对称性。 Result: 可在一分钟内为百万面片网格计算权重,实现高分辨率模型的实时语义感知变形,优于传统和神经网络方法的速度与精度。 Conclusion: 该方法兼具传统变形的精确快速与数据驱动的语义能力,提升了手柄变形的直观性与实用性。 Abstract: Handle-based mesh deformation has been a long-standing paradigm in computer graphics, enabling intuitive shape edits from sparse controls. Classic techniques offer precise and rapid deformation control. However, they solve an optimization problem with constraints defined by control handle placement, requiring a user to know apriori the ideal distribution of handles on the shape to accomplish the desired edit. The mapping from handle set to deformation behavior is often unintuitive and, importantly, non-semantic. Modern data-driven methods, on the other hand, leverage a data prior to obtain semantic edits, but are slow and imprecise. We propose a technique that fuses the semantic prior of data with the precise control and speed of traditional frameworks. Our approach is surprisingly simple yet effective: deep feature proximity makes for smooth and semantic deformation weights, with no need for additional regularization. The weights can be computed in real-time for any surface point, whereas prior methods require optimization for new handles. Moreover, the semantic prior from deep features enables co-deformation of semantic parts. We introduce an improved feature distillation pipeline, barycentric feature distillation, which efficiently uses the visual signal from shape renders to minimize distillation cost. This allows our weights to be computed for high resolution meshes in under a minute, in contrast to potentially hours for both classical and neural methods. We preserve and extend properties of classical methods through feature space constraints and locality weighting. Our field representation allows for automatic detection of semantic symmetries, which we use to produce symmetry-preserving deformations. We show a proof-of-concept application which can produce deformations for meshes up to 1 million faces in real-time on a consumer-grade machine.

[296] XRefine: Attention-Guided Keypoint Match Refinement

Jan Fabian Schmid,Annika Hagemann

Main category: cs.CV

TL;DR: 本文提出了一种名为XRefine的新型、与关键点检测器无关的亚像素级关键点优化方法,通过基于交叉注意力的架构对匹配的关键点图像块进行优化,提升了3D视觉任务中的几何估计精度,并在多个数据集上表现出优越性能。

Details Motivation: 现有的关键点优化方法通常依赖特定检测器且需重新训练,难以泛化;而现有关键点匹配存在空间定位不准的问题。 Method: 提出XRefine,一种基于交叉注意力机制的网络架构,仅利用匹配关键点周围的图像块进行优化,不依赖检测器内部表示,支持多种检测器和多视角特征轨迹。 Result: 在MegaDepth、KITTI和ScanNet数据集上实验表明,XRefine在提升几何估计精度方面优于现有优化方法,同时保持高效的运行速度。 Conclusion: XRefine是一种通用、高效且检测器无关的关键点优化框架,能够显著提升稀疏关键点匹配的准确性,适用于多种3D视觉任务。 Abstract: Sparse keypoint matching is crucial for 3D vision tasks, yet current keypoint detectors often produce spatially inaccurate matches. Existing refinement methods mitigate this issue through alignment of matched keypoint locations, but they are typically detector-specific, requiring retraining for each keypoint detector. We introduce XRefine, a novel, detector-agnostic approach for sub-pixel keypoint refinement that operates solely on image patches centered at matched keypoints. Our cross-attention-based architecture learns to predict refined keypoint coordinates without relying on internal detector representations, enabling generalization across detectors. Furthermore, XRefine can be extended to handle multi-view feature tracks. Experiments on MegaDepth, KITTI, and ScanNet demonstrate that the approach consistently improves geometric estimation accuracy, achieving superior performance compared to existing refinement methods while maintaining runtime efficiency. Our code and trained models can be found at https://github.com/boschresearch/xrefine.

[297] BirdsEye-RU: A Dataset For Detecting Faces from Overhead Images

Md. Ahanaf Arif Khan,Ariful Islam,Sangeeta Biswas,Md. Iqbal Aziz Khan,Subrata Pramanik,Sanjoy Kumar Chakrabarty,Bimal Kumar Pramanik

Main category: cs.CV

TL;DR: 本文介绍了BirdsEye-RU数据集,一个包含2978张图像和八千多个标注人脸的综合性数据集,旨在解决从高空图像中检测人脸所面临的极端尺度变化和环境杂乱问题。

Details Motivation: 由于高空图像中人脸存在极端的尺度变化和复杂的环境干扰,现有的人脸检测方法难以有效应对,因此需要一个专门针对小尺度和远距离人脸的数据集来推动相关研究。 Method: 构建了一个名为BirdsEye-RU的数据集,包含来自无人机和高海拔智能手机拍摄的2978张图像,并对其中超过八千个人脸进行了标注,涵盖多种真实场景。该数据集已公开发布在Kaggle上。 Result: 提供了一个人脸检测领域的新数据集,特别聚焦于小尺度和远距离人脸,适用于评估和提升现有检测算法在复杂高空环境下的性能。 Conclusion: BirdsEye-RU数据集为高空图像中的人脸检测研究提供了重要资源,有助于推动在大规模变化和环境干扰下的人脸检测技术发展。 Abstract: Detecting faces in overhead images remains a significant challenge due to extreme scale variations and environmental clutter. To address this, we created the BirdsEye-RU dataset, a comprehensive collection of 2,978 images containing over eight thousand annotated faces. This dataset is specifically designed to capture small and distant faces across diverse environments, containing both drone images and smartphone-captured images from high altitude. We present a detailed description of the BirdsEye-RU dataset in this paper. We made our dataset freely available to the public, and it can be accessed at https://www.kaggle.com/datasets/mdahanafarifkhan/birdseye-ru.

[298] Encoding Emotion Through Self-Supervised Eye Movement Reconstruction

Marcus Ma,Jordan Prescott,Emily Zhou,Tiantian Feng,Kleanthis Avramidis,Gabor Mihaly Toth,Shrikanth Narayanan

Main category: cs.CV

TL;DR: 本文提出一种基于自监督眼动重建的凝视检测模型,利用低分辨率视频中的眼动信息预测情感表达的多模态标记,并在大屠杀幸存者访谈视频中验证了其在情绪对齐和情感行为预测上的有效性。

Details Motivation: 现有情感表达研究多依赖高精度眼动设备,限制了实际应用范围。本文旨在探索如何从自然、低分辨率视频中提取眼动信息,用于情感识别,以扩大研究成果的应用场景。 Method: 受语言模型预训练方法启发,提出一种自监督的眼动重建模型,利用未标注视频数据进行预训练,并使用该模型的编码器嵌入来微调两个下游任务:一是将眼动与语音中的方向性情绪估计对齐,二是预测笑、哭/抽泣和叹气三种即时情感行为。 Result: 实验表明,所提出的模型能有效预测情绪相关结果,且预训练性能与情绪处理性能之间存在正相关关系,在两个下游任务中均取得良好效果。 Conclusion: 自监督眼动重建是一种有效编码眼动中情感信号的方法,能够在低分辨率、自然情境视频中成功捕捉与情感相关的视觉特征,具有广泛的应用潜力。 Abstract: The relationship between emotional expression and eye movement is well-documented, with literature establishing gaze patterns are reliable indicators of emotion. However, most studies utilize specialized, high-resolution eye-tracking equipment, limiting the potential reach of findings. We investigate how eye movement can be used to predict multimodal markers of emotional expression from naturalistic, low-resolution videos. We utilize a collection of video interviews from the USC Shoah Foundation's Visual History Archive with Holocaust survivors as they recount their experiences in the Auschwitz concentration camp. Inspired by pretraining methods on language models, we develop a novel gaze detection model that uses self-supervised eye movement reconstruction that can effectively leverage unlabeled video. We use this model's encoder embeddings to fine-tune models on two downstream tasks related to emotional expression. The first is aligning eye movement with directional emotion estimates from speech. The second task is using eye gaze as a predictor of three momentary manifestations of emotional behaviors: laughing, crying/sobbing, and sighing. We find our new model is predictive of emotion outcomes and observe a positive correlation between pretraining performance and emotion processing performance for both experiments. We conclude self-supervised eye movement reconstruction is an effective method for encoding the affective signal they carry.

[299] PISE: Physics-Anchored Semantically-Enhanced Deep Computational Ghost Imaging for Robust Low-Bandwidth Machine Perception

Tong Wu

Main category: cs.CV

TL;DR: PISE是一种物理信息深度幽灵成像框架,用于低带宽边缘感知,通过结合伴随算子初始化和语义引导,在5%采样下将分类精度提高2.57%,方差降低9倍。

Details Motivation: 为了在低带宽条件下实现高效的边缘感知,需要提升现有幽灵成像方法的重建质量和稳定性。 Method: 提出PISE框架,结合伴随算子初始化与语义引导,利用物理信息增强深度学习模型的成像性能。 Result: 在5%采样率下,分类精度提升2.57%,方差减少9倍。 Conclusion: PISE有效提升了低采样率下的成像质量与分类性能,适用于资源受限的边缘感知应用。 Abstract: We propose PISE, a physics-informed deep ghost imaging framework for low-bandwidth edge perception. By combining adjoint operator initialization with semantic guidance, PISE improves classification accuracy by 2.57% and reduces variance by 9x at 5% sampling.

[300] Camera Pose Revisited

Władysław Skarbek,Michał Salomonowicz,Michał Król

Main category: cs.CV

TL;DR: 本文提出了一种新的PnP算法PnP-ProCay78,用于相机位姿估计,结合Cayley旋转参数化与最小二乘优化,具有简单结构、高效计算和直观收敛特性。

Details Motivation: 为解决平面PnP问题中初始位姿估计的效率与精度平衡问题,尤其是针对多传感器系统中的标定需求。 Method: 采用Cayley参数化表示旋转,结合二次形式的重投影误差模型,并通过分析两个典型向量的重建误差来确定确定性初始点,避免复杂的搜索过程。 Result: 实验表明该方法在RGB和低分辨率热成像相机数据上均达到与SQPnP相当的投影精度,略优于IPPE,同时算法结构更简单,优化轨迹更具可解释性。 Conclusion: PnP-ProCay78在保持高精度的同时简化了算法结构,并通过混合代价函数实现了几何透明性与计算效率的统一,适用于教学与实际应用。 Abstract: Estimating the position and orientation of a camera with respect to an observed scene is one of the central problems in computer vision, particularly in the context of camera calibration and multi-sensor systems. This paper addresses the planar Perspective--$n$--Point problem, with special emphasis on the initial estimation of the pose of a calibration object. As a solution, we propose the \texttt{PnP-ProCay78} algorithm, which combines the classical quadratic formulation of the reconstruction error with a Cayley parameterization of rotations and least-squares optimization. The key component of the method is a deterministic selection of starting points based on an analysis of the reconstruction error for two canonical vectors, allowing costly solution-space search procedures to be avoided. Experimental validation is performed using data acquired also from high-resolution RGB cameras and very low-resolution thermal cameras in an integrated RGB--IR setup. The results demonstrate that the proposed algorithm achieves practically the same projection accuracy as optimal \texttt{SQPnP} and slightly higher than \texttt{IPPE}, both prominent \texttt{PnP-OpenCV} procedures. However, \texttt{PnP-ProCay78} maintains a significantly simpler algorithmic structure. Moreover, the analysis of optimization trajectories in Cayley space provides an intuitive insight into the convergence process, making the method attractive also from a didactic perspective. Unlike existing PnP solvers, the proposed \texttt{PnP-ProCay78} algorithm combines projection error minimization with an analytically eliminated reconstruction-error surrogate for translation, yielding a hybrid cost formulation that is both geometrically transparent and computationally efficient.

[301] Linear Mechanisms for Spatiotemporal Reasoning in Vision Language Models

Raphi Kang,Hongqiao Chen,Georgia Gkioxari,Pietro Perona

Main category: cs.CV

TL;DR: 本文揭示了视觉语言模型(VLMs)中空间和时间推理的内在机制,发现模型通过线性绑定‘空间ID’到文本激活来编码物体位置,并利用语言标记进行推理;该机制可因果影响中间层模型信念,且在视频VLM中存在对应的‘时间ID’机制。

Details Motivation: 视觉语言模型虽具备时空推理能力,但其内在机制尚不清晰;作者旨在探究视觉与文本表征如何融合,并验证该融合表征是否能因果解释模型行为。 Method: 通过实证分析寻找视觉与文本表征的交汇点,提出并验证‘空间ID’线性绑定机制;采用严谨的因果干预实验检验其对中间层模型信念的调控作用;扩展至视频VLM分析时间ID机制。 Result: 发现VLMs普遍使用线性绑定的空间ID编码位置信息,并通过语言token进行推理;空间ID可系统调控中间层信念;该ID机制可诊断模型缺陷并作为学习信号;视频VLM中存在类比的时间ID机制。 Conclusion: 本文揭示了VLM中一种此前未被充分探索的时空ID内部推理机制,提升了模型可解释性,并为构建更对齐、更强大的VLM提供了原理性设计依据。 Abstract: Spatio-temporal reasoning is a remarkable capability of Vision Language Models (VLMs), but the underlying mechanisms of such abilities remain largely opaque. We postulate that visual/geometrical and textual representations of spatial structure must be combined at some point in VLM computations. We search for such confluence, and ask whether the identified representation can causally explain aspects of input-output model behavior through a linear model. We show empirically that VLMs encode object locations by linearly binding \textit{spatial IDs} to textual activations, then perform reasoning via language tokens. Through rigorous causal interventions we demonstrate that these IDs, which are ubiquitous across the model, can systematically mediate model beliefs at intermediate VLM layers. Additionally, we find that spatial IDs serve as a diagnostic tool for identifying limitations in existing VLMs, and as a valuable learning signal. We extend our analysis to video VLMs and identify an analogous linear temporal ID mechanism. By characterizing our proposed spatiotemporal ID mechanism, we elucidate a previously underexplored internal reasoning process in VLMs, toward improved interpretability and the principled design of more aligned and capable models. We release our code for reproducibility: https://github.com/Raphoo/linear-mech-vlms.

[302] From Bands to Depth: Understanding Bathymetry Decisions on Sentinel-2

Satyaki Roy Chowdhury,Aswathnarayan Radhakrishnan,Hsiao Jou Hsu,Hari Subramoni,Joachim Moortgat

Main category: cs.CV

TL;DR: 本研究提出并分析了一种基于Swin-Transformer的U-Net模型(Swin-BathyUNet),用于Sentinel-2卫星影像的水深反演,通过可解释性方法验证其预测依据,并提供跨区域应用的实用建议。

Details Motivation: 实现Sentinel-2遥感影像在不同地点鲁棒地反演水深仍具挑战,需理解模型推理机制及其预测可信度。 Method: 采用Swin-Transformer架构的U-Net模型(Swin-BathyUNet)进行水深估计;通过留一光谱通道分析光谱重要性;提出适用于回归任务的注意力类激活映射(A-CAM-R)进行可解释性分析;设计性能保留测试验证解释可靠性;进行跨区域推理实验评估泛化能力。 Result: 光谱重要性分析结果符合浅水光学特性;A-CAM-R方法显示模型关注关键像素,解释具有可靠性;解码器中跳跃连接的交叉注意力机制提升了对镜面反射/泡沫的鲁棒性;跨区域推断显示误差随深度近线性增加,双峰深度分布加剧中深层误差。 Conclusion: 保持宽感受野、保护绿/蓝波段辐射保真度、预滤除近岸高方差亮区,并结合轻量目标站点微调与深度感知校准,可有效提升模型跨区域部署的性能。 Abstract: Deploying Sentinel-2 satellite derived bathymetry (SDB) robustly across sites remains challenging. We analyze a Swin-Transformer based U-Net model (Swin-BathyUNet) to understand how it infers depth and when its predictions are trustworthy. A leave-one-band out study ranks spectral importance to the different bands consistent with shallow water optics. We adapt ablation-based CAM to regression (A-CAM-R) and validate the reliability via a performance retention test: keeping only the top-p% salient pixels while neutralizing the rest causes large, monotonic RMSE increase, indicating explanations localize on evidence the model relies on. Attention ablations show decoder conditioned cross attention on skips is an effective upgrade, improving robustness to glint/foam. Cross-region inference (train on one site, test on another) reveals depth-dependent degradation: MAE rises nearly linearly with depth, and bimodal depth distributions exacerbate mid/deep errors. Practical guidance follows: maintain wide receptive fields, preserve radiometric fidelity in green/blue channels, pre-filter bright high variance near shore, and pair light target site fine tuning with depth aware calibration to transfer across regions.

[303] Mixed Precision PointPillars for Efficient 3D Object Detection with TensorRT

Ninnart Fuengfusin,Keisuke Yoneda,Naoki Suganuma

Main category: cs.CV

TL;DR: 提出了一种针对PointPillars的混合精度框架,通过后训练量化识别敏感层,并结合少量校准数据缓解LIDAR数据分布问题,在不牺牲精度的前提下显著降低延迟和模型大小。

Details Motivation: LIDAR 3D目标检测对自动驾驶至关重要,但直接应用模型量化会因LIDAR数据的宽数值分布和极端异常值导致性能下降。 Method: 提出混合精度框架:首先使用后训练量化(PTQ)逐层量化并评估,识别最敏感的k层保留为浮点(FP),通过贪婪搜索组合生成候选模型;同时利用少量校准数据减少异常值影响。最终模型采用PTQ或量化感知训练(QAT)完成。 Result: 在TensorRT部署下,模型延迟和大小最多减少2.35倍和2.26倍;PTQ流程无需训练即可获得混合精度模型,QAT方法性能与全精度模型相当。 Conclusion: 该方法有效解决了LIDAR数据量化中的敏感层和异常值问题,实现了高效、低延迟的3D检测模型,适用于实时自动驾驶系统。 Abstract: LIDAR 3D object detection is one of the important tasks for autonomous vehicles. Ensuring that this task operates in real-time is crucial. Toward this, model quantization can be used to accelerate the runtime. However, directly applying model quantization often leads to performance degradation due to LIDAR's wide numerical distributions and extreme outliers. To address the wide numerical distribution, we proposed a mixed precision framework designed for PointPillars. Our framework first searches for sensitive layers with post-training quantization (PTQ) by quantizing one layer at a time to 8-bit integer (INT8) and evaluating each model for average precision (AP). The top-k most sensitive layers are assigned as floating point (FP). Combinations of these layers are greedily searched to produce candidate mixed precision models, which are finalized with either PTQ or quantization-aware training (QAT). Furthermore, to handle outliers, we observe that using a very small number of calibration data reduces the likelihood of encountering outliers, thereby improving PTQ performance. Our methods provides mixed precision models without training in the PTQ pipeline, while our QAT pipeline achieves the performance competitive to FP models. With TensorRT deployment, our models offer less latency and sizes by up to 2.35 and 2.26 times, respectively.

[304] Generalizable Hyperparameter Optimization for Federated Learning on Non-IID Cancer Images

Elisa Gonçalves Ribeiro,Rodrigo Moreira,Larissa Ferreira Rodrigues Moreira,André Ricardo Backes

Main category: cs.CV

TL;DR: 本文研究了在非独立同分布(非IID)联邦学习环境下,通过贝叶斯优化得到的超参数是否能在不同癌症图像数据集间迁移,并提出了一种简单的跨数据集聚合启发式方法,在卵巢癌和结直肠癌组织病理学分类任务中实现了有竞争力的性能。

Details Motivation: 深度学习在癌症组织病理学中的应用受限于临床环境中的隐私约束,联邦学习虽能保持数据本地化,但其性能受非IID数据下超参数选择的影响,因此需要探索可迁移且鲁棒的超参数配置。 Method: 采用集中式贝叶斯超参数优化,在单一数据集上获得最优超参数,并将其迁移到非IID联邦学习设置中;提出一种跨数据集聚合启发式方法,通过平均学习率并取模态优化器和批量大小来组合不同数据集的配置。 Result: 迁移的超参数在联邦学习环境中表现良好,所提出的聚合启发式方法在卵巢癌和结直肠癌的二分类任务中达到了具有竞争力的分类性能。 Conclusion: 经过优化的超参数具有一定跨数据集泛化能力,结合提出的简单聚合策略,可在非IID联邦学习场景下有效支持癌症组织病理学分析,有助于推动隐私保护下的分布式医学图像研究。 Abstract: Deep learning for cancer histopathology training conflicts with privacy constraints in clinical settings. Federated Learning (FL) mitigates this by keeping data local; however, its performance depends on hyperparameter choices under non-independent and identically distributed (non-IID) client datasets. This paper examined whether hyperparameters optimized on one cancer imaging dataset generalized across non-IID federated scenarios. We considered binary histopathology tasks for ovarian and colorectal cancers. We perform centralized Bayesian hyperparameter optimization and transfer dataset-specific optima to the non-IID FL setup. The main contribution of this study is the introduction of a simple cross-dataset aggregation heuristic by combining configurations by averaging the learning rates and considering the modal optimizers and batch sizes. This combined configuration achieves a competitive classification performance.

[305] Near-Light Color Photometric Stereo for mono-Chromaticity non-lambertian surface

Zonglin Li,Jieji Ren,Shuangfan Zhou,Heng Guo,Jinnuo Zhang,Jiang Zhou,Boxin Shi,Zhanyu Ma,Guoying Gu

Main category: cs.CV

TL;DR: 提出了一种基于神经隐式表示的单次彩色光度立体框架,用于在近光条件和非朗伯表面上进行深度和BRDF建模,并通过紧凑型光学触觉传感器验证了方法的有效性。

Details Motivation: 现有彩色光度立体方法多假设理想远场照明和朗伯反射,难以适用于更实际的近光条件和非朗伯表面,限制了其在动态场景中的应用。 Method: 利用神经隐式表示进行深度和BRDF建模,假设单色性(均匀色度和同质材料),缓解彩色光度立体的病态问题,并结合紧凑型光学触觉传感器进行验证。 Result: 在合成和真实数据集上的实验表明,该方法能实现准确且鲁棒的表面重建。 Conclusion: 所提框架有效解决了近光条件下非朗伯表面的单次表面重建问题,推动了彩色光度立体在动态和实际场景中的应用。 Abstract: Color photometric stereo enables single-shot surface reconstruction, extending conventional photometric stereo that requires multiple images of a static scene under varying illumination to dynamic scenarios. However, most existing approaches assume ideal distant lighting and Lambertian reflectance, leaving more practical near-light conditions and non-Lambertian surfaces underexplored. To overcome this limitation, we propose a framework that leverages neural implicit representations for depth and BRDF modeling under the assumption of mono-chromaticity (uniform chromaticity and homogeneous material), which alleviates the inherent ill-posedness of color photometric stereo and allows for detailed surface recovery from just one image. Furthermore, we design a compact optical tactile sensor to validate our approach. Experiments on both synthetic and real-world datasets demonstrate that our method achieves accurate and robust surface reconstruction.

[306] Exploiting Test-Time Augmentation in Federated Learning for Brain Tumor MRI Classification

Thamara Leandra de Deus Melo,Rodrigo Moreira,Larissa Ferreira Rodrigues Moreira,André Ricardo Backes

Main category: cs.CV

TL;DR: 在联邦学习环境中,结合测试时增强(TTA)与轻量级预处理可显著提升CNN对脑肿瘤MRI图像的分类性能。

Details Motivation: 脑肿瘤诊断因病灶变异性和图像复杂性而具有挑战性,亟需高效准确的自动化方法。 Method: 评估在联邦学习框架下使用原始与预处理MRI图像训练的卷积神经网络(CNN),并引入测试时增强(TTA)进行比较。 Result: 单独预处理效果有限,但与TTA结合后在联邦MRI分类中取得一致且统计显著的性能提升(p<0.001)。 Conclusion: 在基于联邦学习的医学图像分析中,TTA应作为默认推理策略;若计算资源允许,结合轻量级预处理可进一步稳定提升性能。 Abstract: Efficient brain tumor diagnosis is crucial for early treatment; however, it is challenging because of lesion variability and image complexity. We evaluated convolutional neural networks (CNNs) in a federated learning (FL) setting, comparing models trained on original versus preprocessed MRI images (resizing, grayscale conversion, normalization, filtering, and histogram equalization). Preprocessing alone yielded negligible gains; combined with test-time augmentation (TTA), it delivered consistent, statistically significant improvements in federated MRI classification (p<0.001). In practice, TTA should be the default inference strategy in FL-based medical imaging; when the computational budget permits, pairing TTA with light preprocessing provides additional reliable gains.

[307] VILTA: A VLM-in-the-Loop Adversary for Enhancing Driving Policy Robustness

Qimao Chen,Fang Li,Shaoqing Xu,Zhiyi Lai,Zixun Xie,Yuechen Luo,Shengyin Jiang,Hanbing Li,Long Chen,Bing Wang,Yi Zhang,Zhi-Xin Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为VILTA的新框架,通过将视觉语言模型(VLM)直接集成到自动驾驶(AD)系统的闭环训练中,主动生成多样化且具有挑战性的驾驶场景,从而提升系统在长尾问题下的安全性和鲁棒性。

Details Motivation: 自动驾驶系统在实际部署中面临长尾问题,即罕见但关键的驾驶场景在真实数据中严重不足。现有方法依赖规则启发、重采样或离线数据生成模型,难以生成新颖多样的危险场景,限制了安全性训练的效果。 Method: 提出VILTA框架,将视觉语言模型(VLM)嵌入AD代理的闭环训练过程。VLM通过理解动态驾驶环境,直接对周围车辆的未来轨迹进行细粒度编辑,生成具有挑战性的场景,形成持续进化的训练课程。 Result: 实验表明,VILTA能生成更丰富多样的合理且危险的驾驶场景,显著提升AD策略的安全性和鲁棒性,尤其在应对长尾关键事件方面优于传统方法。 Conclusion: VILTA通过将VLM作为闭环中的主动参与者,实现了对生成场景的直接控制,充分释放了VLM的泛化能力,为解决自动驾驶中的长尾安全问题提供了新思路。 Abstract: The safe deployment of autonomous driving (AD) systems is fundamentally hindered by the long-tail problem, where rare yet critical driving scenarios are severely underrepresented in real-world data. Existing solutions including safety-critical scenario generation and closed-loop learning often rely on rule-based heuristics, resampling methods and generative models learned from offline datasets, limiting their ability to produce diverse and novel challenges. While recent works leverage Vision Language Models (VLMs) to produce scene descriptions that guide a separate, downstream model in generating hazardous trajectories for agents, such two-stage framework constrains the generative potential of VLMs, as the diversity of the final trajectories is ultimately limited by the generalization ceiling of the downstream algorithm. To overcome these limitations, we introduce VILTA (VLM-In-the-Loop Trajectory Adversary), a novel framework that integrates a VLM into the closed-loop training of AD agents. Unlike prior works, VILTA actively participates in the training loop by comprehending the dynamic driving environment and strategically generating challenging scenarios through direct, fine-grained editing of surrounding agents' future trajectories. This direct-editing approach fully leverages the VLM's powerful generalization capabilities to create a diverse curriculum of plausible yet challenging scenarios that extend beyond the scope of traditional methods. We demonstrate that our approach substantially enhances the safety and robustness of the resulting AD policy, particularly in its ability to navigate critical long-tail events.

[308] Fusion-Restoration Image Processing Algorithm to Improve the High-Temperature Deformation Measurement

Banglei Guan,Dongcai Tan,Jing Tao,Ang Su,Yang Shang,Qifeng Yu

Main category: cs.CV

TL;DR: 提出了一种融合-恢复图像处理方法,用于抑制高温变形测量中热辐射和热晕引起的图像退化,显著提升了数字图像相关(DIC)技术的测量精度与有效计算区域。

Details Motivation: 在高温结构变形测量中,热辐射导致的图像退化和热晕引入的随机误差限制了DIC的准确性和有效性,亟需有效的图像处理方法来抑制这些干扰。 Method: 针对热辐射,采用基于图像分层表示的正负通道分解与多曝光图像融合优化图像质量;针对热晕引起的高频随机误差,以FSIM为优化目标进行模型参数迭代优化,并结合灰度平均算法均衡异常灰度值以降低测量误差。 Result: 多曝光图像融合将欠曝图像的有效计算区域从26%提升至50%,过曝图像从32%提升至40%,且未损失测量精度;结合灰度平均的图像恢复使ε_xx测量误差降低85.3%,ε_yy和γ_xy分别降低36.0%和36.4%。 Conclusion: 所提方法能有效抑制热辐射与热晕干扰,显著改善图像质量,降低高温环境下DIC的变形测量误差,具有在热变形测量中的应用潜力。 Abstract: In the deformation measurement of high-temperature structures, image degradation caused by thermal radiation and random errors introduced by heat haze restrict the accuracy and effectiveness of deformation measurement. To suppress thermal radiation and heat haze using fusion-restoration image processing methods, thereby improving the accuracy and effectiveness of DIC in the measurement of high-temperature deformation. For image degradation caused by thermal radiation, based on the image layered representation, the image is decomposed into positive and negative channels for parallel processing, and then optimized for quality by multi-exposure image fusion. To counteract the high-frequency, random errors introduced by heat haze, we adopt the FSIM as the objective function to guide the iterative optimization of model parameters, and the grayscale average algorithm is applied to equalize anomalous gray values, thereby reducing measurement error. The proposed multi-exposure image fusion algorithm effectively suppresses image degradation caused by complex illumination conditions, boosting the effective computation area from 26% to 50% for under-exposed images and from 32% to 40% for over-exposed images without degrading measurement accuracy in the experiment. Meanwhile, the image restoration combined with the grayscale average algorithm reduces static thermal deformation measurement errors. The error in ε_xx is reduced by 85.3%, while the errors in ε_yy and γ_xy are reduced by 36.0% and 36.4%, respectively. We present image processing methods to suppress the interference of thermal radiation and heat haze in high-temperature deformation measurement using DIC. The experimental results verify that the proposed method can effectively improve image quality, reduce deformation measurement errors, and has potential application value in thermal deformation measurement.

[309] GaussianTrimmer: Online Trimming Boundaries for 3DGS Segmentation

Liwei Liao,Ronggang Wang

Main category: cs.CV

TL;DR: 提出了一种名为GaussianTrimmer的在线边界修剪方法,用于提升基于3D高斯的分割方法的边界质量。

Details Motivation: 现有3D高斯分割方法因高斯元大小变化大,导致分割边界锯齿严重,尤其大尺寸高斯跨越前景背景时问题突出。 Method: 提出GaussianTrimmer,包含两个核心步骤:1)生成均匀覆盖的虚拟相机;2)基于虚拟相机上的2D分割结果在原始高斯层次进行修剪。 Result: 大量实验表明该方法能有效提升现有3D高斯分割方法的分割质量,具有良好的通用性和即插即用特性。 Conclusion: GaussianTrimmer是一种高效、可扩展的后处理方法,显著改善了3D高斯表示下的场景分割边界精度。 Abstract: With the widespread application of 3D Gaussians in 3D scene representation, 3D scene segmentation methods based on 3D Gaussians have also gradually emerged. However, existing 3D Gaussian segmentation methods basically segment on the basis of Gaussian primitives. Due to the large variation range of the scale of 3D Gaussians, large-sized Gaussians that often span the foreground and background lead to jagged boundaries of segmented objects. To this end, we propose an online boundary trimming method, GaussianTrimmer, which is an efficient and plug-and-play post-processing method capable of trimming coarse boundaries for existing 3D Gaussian segmentation methods. Our method consists of two core steps: 1. Generating uniformly and well-covered virtual cameras; 2. Trimming Gaussian at the primitive level based on 2D segmentation results on virtual cameras. Extensive quantitative and qualitative experiments demonstrate that our method can improve the segmentation quality of existing 3D Gaussian segmentation methods as a plug-and-play method.

[310] Fusing in 3D: Free-Viewpoint Fusion Rendering with a 3D Infrared-Visible Scene Representation

Chao Yang,Deshui Miao,Chao Tian,Guoqing Zhu,Yameng Gu,Zhenyu He

Main category: cs.CV

TL;DR: 提出了一种新的红外-可见光高斯融合框架(IVGF),通过重建场景几何结构实现多模态图像的直接渲染融合。

Details Motivation: 现有2D融合方法局限于固定视角,导致复杂场景中关键信息丢失,无法全面理解场景。 Method: 提出了IVGF框架,包含跨模态调整(CMA)模块以调节高斯不透明度,并引入融合损失来优化CMA,保留双模态特征。 Result: 实验表明该方法在定性和定量评估中均有效,能更好地保留红外和可见光图像的关键特征。 Conclusion: IVGF通过几何重建和跨模态优化,实现了更优的红外-可见图像融合效果。 Abstract: Infrared-visible image fusion aims to integrate infrared and visible information into a single fused image. Existing 2D fusion methods focus on fusing images from fixed camera viewpoints, neglecting a comprehensive understanding of complex scenarios, which results in the loss of critical information about the scene. To address this limitation, we propose a novel Infrared-Visible Gaussian Fusion (IVGF) framework, which reconstructs scene geometry from multimodal 2D inputs and enables direct rendering of fused images. Specifically, we propose a cross-modal adjustment (CMA) module that modulates the opacity of Gaussians to solve the problem of cross-modal conflicts. Moreover, to preserve the distinctive features from both modalities, we introduce a fusion loss that guides the optimization of CMA, thus ensuring that the fused image retains the critical characteristics of each modality. Comprehensive qualitative and quantitative experiments demonstrate the effectiveness of the proposed method.

[311] P2L-CA: An Effective Parameter Tuning Framework for Rehearsal-Free Multi-Label Class-Incremental Learning

Songlin Dong,Jiangyang Li,Chenhao Ding,Zhiheng Ma,Haoyu Luo,Yuhang He,Yihong Gong

Main category: cs.CV

TL;DR: 本文提出了一种名为P2L-CA的参数高效框架,用于解决多标签类增量学习中的计算和存储瓶颈,通过结合提示到标签模块和连续适配器模块,在无需记忆缓冲的情况下显著提升了性能和泛化能力。

Details Motivation: 现有方法在多标签类增量学习中面临高计算成本、大存储开销以及特征混淆和领域差异问题,亟需一种高效且鲁棒的解决方案。 Method: 提出P2L-CA框架,包含Prompt-to-Label(P2L)模块和Continuous Adapter(CA)模块:P2L利用类别特定提示解耦多标签表示并引入语言先验以稳定语义-视觉对齐;CA采用轻量级适配器缓解预训练模型与下游任务之间的领域差距,提升模型可塑性。 Result: 在MS-COCO和PASCAL VOC上的标准及挑战性MLCIL设置下,P2L-CA显著优于现有最先进方法,仅需极少可训练参数且无需记忆缓冲,表现出强泛化能力。 Conclusion: P2L-CA是一种高效、可扩展的多标签类增量学习框架,有效平衡了模型效率、存储开销与性能,为实际应用提供了可行路径。 Abstract: Multi-label Class-Incremental Learning aims to continuously recognize novel categories in complex scenes where multiple objects co-occur. However, existing approaches often incur high computational costs due to full-parameter fine-tuning and substantial storage overhead from memory buffers, or they struggle to address feature confusion and domain discrepancies adequately. To overcome these limitations, we introduce P2L-CA, a parameter-efficient framework that integrates a Prompt-to-Label module with a Continuous Adapter module. The P2L module leverages class-specific prompts to disentangle multi-label representations while incorporating linguistic priors to enforce stable semantic-visual alignment. Meanwhile, the CA module employs lightweight adapters to mitigate domain gaps between pre-trained models and downstream tasks, thereby enhancing model plasticity. Extensive experiments across standard and challenging MLCIL settings on MS-COCO and PASCAL VOC show that P2L-CA not only achieves substantial improvements over state-of-the-art methods but also demonstrates strong generalization in CIL scenarios, all while requiring minimal trainable parameters and eliminating the need for memory buffers.

[312] RSOD: Reliability-Guided Sonar Image Object Detection with Extremely Limited Labels

Chengzhou Li,Ping Guo,Guanchen Meng,Qi Jia,Jinyuan Liu,Zhu Liu,Xiaokang Liu,Yu Liu,Zhongxuan Luo,Xin Fan

Main category: cs.CV

TL;DR: 本文提出了一种用于声呐图像目标检测的教师-学生框架RSOD,在仅有5%标注数据的情况下,性能可媲美使用100%标注数据的基线方法,并提出了适用于声呐图像的伪标签策略以缓解标注不足的问题。

Details Motivation: 声呐图像纹理细节少、噪声多,难以获得精确标注,尤其是在标注数据极其有限的情况下,传统目标检测方法性能受限,因此需要设计适用于少量标注的高效检测方法。 Method: 提出RSOD教师-学生框架,通过评估教师模型在不同视图下预测的一致性计算可靠性得分,引入基于该得分的对象混合伪标签方法,并采用可靠性引导的自适应约束优化学生模型。 Result: 在UATD数据集上,仅使用5%标注数据时,RSOD性能接近使用100%标注数据的基线模型;同时构建了一个新的声呐图像数据集以支持相关研究。 Conclusion: RSOD能有效利用未标注数据,在极少量标注条件下实现优异的声呐图像目标检测性能,为低资源场景下的水下检测提供了可行解决方案。 Abstract: Object detection in sonar images is a key technology in underwater detection systems. Compared to natural images, sonar images contain fewer texture details and are more susceptible to noise, making it difficult for non-experts to distinguish subtle differences between classes. This leads to their inability to provide precise annotation data for sonar images. Therefore, designing effective object detection methods for sonar images with extremely limited labels is particularly important. To address this, we propose a teacher-student framework called RSOD, which aims to fully learn the characteristics of sonar images and develop a pseudo-label strategy suitable for these images to mitigate the impact of limited labels. First, RSOD calculates a reliability score by assessing the consistency of the teacher's predictions across different views. To leverage this score, we introduce an object mixed pseudo-label method to tackle the shortage of labeled data in sonar images. Finally, we optimize the performance of the student by implementing a reliability-guided adaptive constraint. By taking full advantage of unlabeled data, the student can perform well even in situations with extremely limited labels. Notably, on the UATD dataset, our method, using only 5% of labeled data, achieves results that can compete against those of our baseline algorithm trained on 100% labeled data. We also collected a new dataset to provide more valuable data for research in the field of sonar.

[313] S2DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation

Lin Zhao,Yushu Wu,Aleksei Lebedev,Dishani Lahiri,Meng Dong,Arpit Sahni,Michael Vasilkovsky,Hao Chen,Ju Hu,Aliaksandr Siarohin,Sergey Tulyakov,Yanzhi Wang,Anil Kag,Yanyu Li

Main category: cs.CV

TL;DR: S2DiT是一种高效的流式扩散Transformer,通过新颖的注意力机制和 Sandwich 架构设计,在移动设备上实现高质量、实时视频生成。

Details Motivation: Diffusion Transformers (DiTs) 虽然提升了视频生成质量,但计算开销大,难以在移动端实现实时生成。 Method: 提出S2DiT,采用流式 Sandwich 结构和混合高效注意力机制(LCHA与SSA),结合基于预算感知的动态规划搜索,并设计2-in-1蒸馏框架,将大模型能力迁移到轻量级模型。 Result: S2DiT在保持高保真生成的同时,可在iPhone上以超过10 FPS的速度流式生成视频,性能媲美服务器级模型。 Conclusion: S2DiT实现了在移动设备上高效、高质量的视频生成,推动了端侧生成式AI的发展。 Abstract: Diffusion Transformers (DiTs) have recently improved video generation quality. However, their heavy computational cost makes real-time or on-device generation infeasible. In this work, we introduce S2DiT, a Streaming Sandwich Diffusion Transformer designed for efficient, high-fidelity, and streaming video generation on mobile hardware. S2DiT generates more tokens but maintains efficiency with novel efficient attentions: a mixture of LinConv Hybrid Attention (LCHA) and Stride Self-Attention (SSA). Based on this, we uncover the sandwich design via a budget-aware dynamic programming search, achieving superior quality and efficiency. We further propose a 2-in-1 distillation framework that transfers the capacity of large teacher models (e.g., Wan 2.2-14B) to the compact few-step sandwich model. Together, S2DiT achieves quality on par with state-of-the-art server video models, while streaming at over 10 FPS on an iPhone.

[314] DC-VLAQ: Query-Residual Aggregation for Robust Visual Place Recognition

Hanyu Zhu,Zhihao Zhan,Yuhang Ming,Liang Li,Dibo Hou,Javier Civera,Wanzeng Kong

Main category: cs.CV

TL;DR: 本文提出了一种名为DC-VLAQ的视觉位置识别框架,通过融合DINOv2和CLIP等视觉基础模型的互补特征,并引入基于查询-残差的局部聚合(VLAQ)方法,提升了在大视角变化、光照差异和严重域偏移下的全局表示稳定性与判别性。

Details Motivation: 现有视觉位置识别方法多依赖单一视觉基础模型,忽略了不同模型间的互补信息;同时,特征融合会改变token分布,影响现有查询聚合机制的稳定性。 Method: 1) 提出轻量级残差引导的互补融合,以DINOv2为锚定特征空间,通过可学习的残差修正注入CLIP的语义信息;2) 设计VLAQ(Local Aggregated Queries向量),利用可学习查询对局部token的残差响应进行编码,实现更稳定的全局聚合。 Result: 在Pitts30k、Tokyo24/7、MSLS等多个标准VPR数据集上实验表明,DC-VLAQ显著优于强基线方法,在跨域和长期外观变化场景下表现尤为突出,达到当前最优性能。 Conclusion: DC-VLAQ有效整合了多视觉基础模型的互补信息,并通过新型查询-残差聚合机制增强了表示的稳定性和判别性,为视觉位置识别提供了新的高效解决方案。 Abstract: One of the central challenges in visual place recognition (VPR) is learning a robust global representation that remains discriminative under large viewpoint changes, illumination variations, and severe domain shifts. While visual foundation models (VFMs) provide strong local features, most existing methods rely on a single model, overlooking the complementary cues offered by different VFMs. However, exploiting such complementary information inevitably alters token distributions, which challenges the stability of existing query-based global aggregation schemes. To address these challenges, we propose DC-VLAQ, a representation-centric framework that integrates the fusion of complementary VFMs and robust global aggregation. Specifically, we first introduce a lightweight residual-guided complementary fusion that anchors representations in the DINOv2 feature space while injecting complementary semantics from CLIP through a learned residual correction. In addition, we propose the Vector of Local Aggregated Queries (VLAQ), a query--residual global aggregation scheme that encodes local tokens by their residual responses to learnable queries, resulting in improved stability and the preservation of fine-grained discriminative cues. Extensive experiments on standard VPR benchmarks, including Pitts30k, Tokyo24/7, MSLS, Nordland, SPED, and AmsterTime, demonstrate that DC-VLAQ consistently outperforms strong baselines and achieves state-of-the-art performance, particularly under challenging domain shifts and long-term appearance changes.

[315] KaoLRM: Repurposing Pre-trained Large Reconstruction Models for Parametric 3D Face Reconstruction

Qingtian Zhu,Xu Cao,Zhixiang Wang,Yinqiang Zheng,Takafumi Taketomi

Main category: cs.CV

TL;DR: 本文提出KaoLRM方法,通过将大型重建模型(LRM)的预训练3D先验重定向至参数化3D人脸重建任务,结合FLAME模型与2D高斯点绘技术,提升单图重建的精度与跨视角一致性。

Details Motivation: 现有基于3DMM的人脸重建方法在不同视角下的一致性较差,需利用更强的3D结构先验来改善鲁棒性与准确性。 Method: 将LRM预训练的三平面特征投影到FLAME参数空间以恢复几何形状,并用与FLAME网格紧密耦合的2D高斯基元建模外观;引入FLAME驱动的2D高斯点绘至LRM渲染流程。 Result: 在受控与野外数据集上均展现出更优的重建精度和跨视角一致性,显著优于现有方法,尤其在自遮挡与多视角场景下表现稳健。 Conclusion: KaoLRM成功融合LRM的强3D先验与FLAME的可解释参数化,为单图像参数化人脸重建提供了新范式,兼顾精度、鲁棒性与一致性。 Abstract: We propose KaoLRM to re-target the learned prior of the Large Reconstruction Model (LRM) for parametric 3D face reconstruction from single-view images. Parametric 3D Morphable Models (3DMMs) have been widely used for facial reconstruction due to their compact and interpretable parameterization, yet existing 3DMM regressors often exhibit poor consistency across varying viewpoints. To address this, we harness the pre-trained 3D prior of LRM and incorporate FLAME-based 2D Gaussian Splatting into LRM's rendering pipeline. Specifically, KaoLRM projects LRM's pre-trained triplane features into the FLAME parameter space to recover geometry, and models appearance via 2D Gaussian primitives that are tightly coupled to the FLAME mesh. The rich prior enables the FLAME regressor to be aware of the 3D structure, leading to accurate and robust reconstructions under self-occlusions and diverse viewpoints. Experiments on both controlled and in-the-wild benchmarks demonstrate that KaoLRM achieves superior reconstruction accuracy and cross-view consistency, while existing methods remain sensitive to viewpoint variations. The code is released at https://github.com/CyberAgentAILab/KaoLRM.

[316] SSPFormer: Self-Supervised Pretrained Transformer for MRI Images

Jingkai Li,Xiaoze Tian,Yuhang Shen,Jia Wang,Dianjie Lu,Guijuan Zhang,Zhuoran Zheng

Main category: cs.CV

TL;DR: 本文提出了一种用于MRI图像的自监督预训练Transformer(SSPFormer),通过逆频率投影掩码和频域噪声增强策略,利用未标记原始数据学习医学图像的领域特异性表示,在分割、超分辨率和去噪任务中达到最先进性能。

Details Motivation: 现有预训练Transformer难以适应医学图像的解剖结构特性和医疗数据的隐私与稀缺性限制,导致在MRI图像上的迁移效果不佳。 Method: 提出SSPFormer模型,引入逆频率投影掩码以优先重建高频解剖区域,并采用频域加权FFT噪声增强技术,在傅里叶域注入生理合理的噪声,从而从原始扫描中学习领域不变且抗伪影的特征表示。 Result: 在多个下游任务(如分割、超分辨率、去噪)上实验表明,SSPFormer均取得最先进的性能,显著优于现有方法。 Conclusion: SSPFormer能有效克服医学数据稀缺和域差异问题,学习到精细且鲁棒的MRI特征表示,具备良好的临床应用适应能力。 Abstract: The pre-trained transformer demonstrates remarkable generalization ability in natural image processing. However, directly transferring it to magnetic resonance images faces two key challenges: the inability to adapt to the specificity of medical anatomical structures and the limitations brought about by the privacy and scarcity of medical data. To address these issues, this paper proposes a Self-Supervised Pretrained Transformer (SSPFormer) for MRI images, which effectively learns domain-specific feature representations of medical images by leveraging unlabeled raw imaging data. To tackle the domain gap and data scarcity, we introduce inverse frequency projection masking, which prioritizes the reconstruction of high-frequency anatomical regions to enforce structure-aware representation learning. Simultaneously, to enhance robustness against real-world MRI artifacts, we employ frequency-weighted FFT noise enhancement that injects physiologically realistic noise into the Fourier domain. Together, these strategies enable the model to learn domain-invariant and artifact-robust features directly from raw scans. Through extensive experiments on segmentation, super-resolution, and denoising tasks, the proposed SSPFormer achieves state-of-the-art performance, fully verifying its ability to capture fine-grained MRI image fidelity and adapt to clinical application requirements.

[317] Moaw: Unleashing Motion Awareness for Video Diffusion Models

Tianqi Zhang,Ziyi Wang,Wenzhao Zheng,Weiliang Chen,Yuanhui Huang,Zhengyang Huang,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出了一种名为Moaw的新框架,通过监督训练增强视频扩散模型的运动感知能力,并将其用于运动转移任务。该方法将扩散模型从图像到视频生成转变为视频到密集跟踪,实现了无需额外适配器的零样本运动转移。

Details Motivation: 受近期在零样本设置下利用视频扩散模型进行光流预测和跟踪工作的启发,本文探究是否可以通过监督训练更充分地挖掘视频扩散模型的跟踪能力。 Method: 提出Moaw框架:首先训练一个用于运动感知的扩散模型,改变其模态为视频到密集跟踪;然后构建运动标注数据集以识别编码最强运动信息的特征,并将这些特征注入到结构相同的视频生成模型中。 Result: 所提方法能够自然地以零样本方式适应运动特征,实现无需额外适配器的运动转移,在运动理解和生成建模之间建立了有效连接。 Conclusion: 本工作提供了一种连接生成建模与运动理解的新范式,为更统一和可控的视频学习框架铺平了道路。 Abstract: Video diffusion models, trained on large-scale datasets, naturally capture correspondences of shared features across frames. Recent works have exploited this property for tasks such as optical flow prediction and tracking in a zero-shot setting. Motivated by these findings, we investigate whether supervised training can more fully harness the tracking capability of video diffusion models. To this end, we propose Moaw, a framework that unleashes motion awareness for video diffusion models and leverages it to facilitate motion transfer. Specifically, we train a diffusion model for motion perception, shifting its modality from image-to-video generation to video-to-dense-tracking. We then construct a motion-labeled dataset to identify features that encode the strongest motion information, and inject them into a structurally identical video generation model. Owing to the homogeneity between the two networks, these features can be naturally adapted in a zero-shot manner, enabling motion transfer without additional adapters. Our work provides a new paradigm for bridging generative modeling and motion understanding, paving the way for more unified and controllable video learning frameworks.

[318] Towards Unbiased Source-Free Object Detection via Vision Foundation Models

Zhi Cai,Yingjie Gao,Yanan Zhang,Xinzhu Ma,Di Huang

Main category: cs.CV

TL;DR: 本文提出了一种新的无源目标检测框架DSOD,利用视觉基础模型(VFM)缓解源域偏差问题,通过统一特征注入和语义感知正则化提升跨域检测性能,并在多种基准上取得领先结果。

Details Motivation: 现有无源目标检测方法存在源偏差问题,导致模型泛化能力差和自训练过程中误差累积。 Method: 提出Debiased Source-free Object Detection (DSOD) 框架,包含统一特征注入(UFI)模块(结合SSE和DAAW)和语义感知特征正则化(SAFR),并设计了无需VFM的蒸馏版本DSOD-distill。 Result: 在多个基准上超越现有方法,Normal-to-Foggy达到48.1% AP,Cross-scene达到39.3% AP,Synthetic-to-Real达到61.4% AP。 Conclusion: DSOD有效缓解了源偏差问题,显著提升了无源目标检测的性能,适用于不同计算资源场景。 Abstract: Source-Free Object Detection (SFOD) has garnered much attention in recent years by eliminating the need of source-domain data in cross-domain tasks, but existing SFOD methods suffer from the Source Bias problem, i.e. the adapted model remains skewed towards the source domain, leading to poor generalization and error accumulation during self-training. To overcome this challenge, we propose Debiased Source-free Object Detection (DSOD), a novel VFM-assisted SFOD framework that can effectively mitigate source bias with the help of powerful VFMs. Specifically, we propose Unified Feature Injection (UFI) module that integrates VFM features into the CNN backbone through Simple-Scale Extension (SSE) and Domain-aware Adaptive Weighting (DAAW). Then, we propose Semantic-aware Feature Regularization (SAFR) that constrains feature learning to prevent overfitting to source domain characteristics. Furthermore, we propose a VFM-free variant, termed DSOD-distill for computation-restricted scenarios through a novel Dual-Teacher distillation scheme. Extensive experiments on multiple benchmarks demonstrate that DSOD outperforms state-of-the-art SFOD methods, achieving 48.1% AP on Normal-to-Foggy weather adaptation, 39.3% AP on Cross-scene adaptation, and 61.4% AP on Synthetic-to-Real adaptation.

[319] Spatial-VLN: Zero-Shot Vision-and-Language Navigation With Explicit Spatial Perception and Exploration

Lu Yue,Yue Fan,Shiwei Lian,Yu Zhao,Jiaxin Yu,Liang Xie,Feitian Zhang

Main category: cs.CV

TL;DR: 本文提出Spatial-VLN框架,通过空间感知增强(SPE)与多专家探索推理(EMR)模块,解决零样本视觉-语言导航中门交互、多房间导航和指令歧义等空间感知瓶颈,在VLN-CE和真实场景中均达到SOTA性能。

Details Motivation: 现有基于大语言模型的零样本视觉-语言导航(VLN)方法在复杂连续环境中空间感知能力不足,尤其在门交互、多房间导航和模糊指令执行三类空间挑战上失败率高。 Method: 提出Spatial-VLN框架:1)空间感知增强(SPE)模块融合全景过滤与专用门/区域专家,生成跨视角一致的空间表征;2)探索式多专家推理(EMR)模块采用并行LLM专家分别处理航点语义与区域空间转移,并在预测不一致时触发查询-探索机制主动探测关键区域。此外引入基于价值的航点采样策略以缩小仿真到现实(Sim2Real)差距。 Result: 在VLN-CE基准上达到SOTA性能,仅使用低成本LLM;真实世界实验验证其在复杂环境中的强泛化性与鲁棒性。 Conclusion: Spatial-VLN通过感知引导的探索范式有效缓解了零样本VLN中的核心空间瓶颈,显著提升了导航性能与现实部署可行性。 Abstract: Zero-shot Vision-and-Language Navigation (VLN) agents leveraging Large Language Models (LLMs) excel in generalization but suffer from insufficient spatial perception. Focusing on complex continuous environments, we categorize key perceptual bottlenecks into three spatial challenges: door interaction,multi-room navigation, and ambiguous instruction execution, where existing methods consistently suffer high failure rates. We present Spatial-VLN, a perception-guided exploration framework designed to overcome these challenges. The framework consists of two main modules. The Spatial Perception Enhancement (SPE) module integrates panoramic filtering with specialized door and region experts to produce spatially coherent, cross-view consistent perceptual representations. Building on this foundation, our Explored Multi-expert Reasoning (EMR) module uses parallel LLM experts to address waypoint-level semantics and region-level spatial transitions. When discrepancies arise between expert predictions, a query-and-explore mechanism is activated, prompting the agent to actively probe critical areas and resolve perceptual ambiguities. Experiments on VLN-CE demonstrate that Spatial VLN achieves state-of-the-art performance using only low-cost LLMs. Furthermore, to validate real-world applicability, we introduce a value-based waypoint sampling strategy that effectively bridges the Sim2Real gap. Extensive real-world evaluations confirm that our framework delivers superior generalization and robustness in complex environments. Our codes and videos are available at https://yueluhhxx.github.io/Spatial-VLN-web/.

[320] Delving Deeper: Hierarchical Visual Perception for Robust Video-Text Retrieval

Zequn Xie,Boyun Zhang,Yuxiao Lin,Tao Jin

Main category: cs.CV

TL;DR: 提出HVP-Net框架,通过挖掘视觉编码器多层中间特征来提升视频-文本检索性能,有效缓解视频冗余并增强语义对齐,在多个基准上达到SOTA。

Details Motivation: 现有基于CLIP等预训练模型的视频-文本检索方法受限于视频冗余和仅使用最终层粗粒度特征,导致匹配精度不足。 Method: 提出HVP-Net,从视觉编码器的多个中间层提取并精炼特征,逐步从原始patch-token中蒸馏出不同语义层级的关键视觉概念,构建更鲁棒的视频表示。 Result: 在MSRVTT、DiDeMo和ActivityNet等多个具有挑战性的基准上实现了新的最先进性能。 Conclusion: 利用分层特征可有效提升视频-文本检索效果,验证了挖掘中间层特征对于增强跨模态对齐的重要性。 Abstract: Video-text retrieval (VTR) aims to locate relevant videos using natural language queries. Current methods, often based on pre-trained models like CLIP, are hindered by video's inherent redundancy and their reliance on coarse, final-layer features, limiting matching accuracy. To address this, we introduce the HVP-Net (Hierarchical Visual Perception Network), a framework that mines richer video semantics by extracting and refining features from multiple intermediate layers of a vision encoder. Our approach progressively distills salient visual concepts from raw patch-tokens at different semantic levels, mitigating redundancy while preserving crucial details for alignment. This results in a more robust video representation, leading to new state-of-the-art performance on challenging benchmarks including MSRVTT, DiDeMo, and ActivityNet. Our work validates the effectiveness of exploiting hierarchical features for advancing video-text retrieval. Our codes are available at https://github.com/boyun-zhang/HVP-Net.

[321] Generalizable and Animatable 3D Full-Head Gaussian Avatar from a Single Image

Shuling Zhao,Dan Xu

Main category: cs.CV

TL;DR: 本文提出了一种从单张图像重建3D可动画头部头像的新框架,能够在单次前向传递中实现实时动画和360度渲染。

Details Motivation: 现有方法在大视角变化下表现不佳,难以保持3D头像的真实性,尤其是在完整头部建模和实时动画方面存在挑战。 Method: 使用嵌入在参数化人脸模型UV空间表面的高斯基元表示3D头像,并利用预训练的3D生成对抗网络(GAN)提取全局全头特征,结合多视角监督和UV空间对称性融合局部与全局纹理特征。 Result: 实现了高质量的3D全头建模和实时动画,在不同视角下均表现出优异的重建保真度和视觉真实感。 Conclusion: 所提方法在单图像输入下实现了高效、真实的3D全头可动画头像重建,显著提升了虚拟对话头像的 realism 和实用性。 Abstract: Building 3D animatable head avatars from a single image is an important yet challenging problem. Existing methods generally collapse under large camera pose variations, compromising the realism of 3D avatars. In this work, we propose a new framework to tackle the novel setting of one-shot 3D full-head animatable avatar reconstruction in a single feed-forward pass, enabling real-time animation and simultaneous 360$^\circ$ rendering views. To facilitate efficient animation control, we model 3D head avatars with Gaussian primitives embedded on the surface of a parametric face model within the UV space. To obtain knowledge of full-head geometry and textures, we leverage rich 3D full-head priors within a pretrained 3D generative adversarial network (GAN) for global full-head feature extraction and multi-view supervision. To increase the fidelity of the 3D reconstruction of the input image, we take advantage of the symmetric nature of the UV space and human faces to fuse local fine-grained input image features with the global full-head textures. Extensive experiments demonstrate the effectiveness of our method, achieving high-quality 3D full-head modeling as well as real-time animation, thereby improving the realism of 3D talking avatars.

[322] Open Vocabulary Panoptic Segmentation With Retrieval Augmentation

Nafis Sadeq,Qingfeng Liu,Mostafa El-Khamy

Main category: cs.CV

TL;DR: 本文提出RetCLIP,一种检索增强的开放词汇全景分割方法,通过构建掩码片段特征数据库并结合CLIP提升对未见类别的分割性能。

Details Motivation: 传统全景分割模型在训练数据之外的未见类别上泛化能力差,难以实现开放词汇设置下的准确分割。 Method: 构建基于图像-文本对的掩码片段特征数据库;在推理时以输入图像的掩码片段特征为查询键,从数据库中检索相似特征及对应类别标签,并结合检索得分与CLIP得分进行分类。 Result: 在COCO上训练、ADE20k上测试时,相比基线分别提升+4.5 PQ、+2.5 mAP和+10.0 mIoU,达到30.9 PQ、19.3 mAP和44.0 mIoU。 Conclusion: RetCLIP有效提升了开放词汇全景分割中对未见类别的识别与分割能力,验证了检索增强机制与CLIP结合的优势。 Abstract: Given an input image and set of class names, panoptic segmentation aims to label each pixel in an image with class labels and instance labels. In comparison, Open Vocabulary Panoptic Segmentation aims to facilitate the segmentation of arbitrary classes according to user input. The challenge is that a panoptic segmentation system trained on a particular dataset typically does not generalize well to unseen classes beyond the training data. In this work, we propose RetCLIP, a retrieval-augmented panoptic segmentation method that improves the performance of unseen classes. In particular, we construct a masked segment feature database using paired image-text data. At inference time, we use masked segment features from the input image as query keys to retrieve similar features and associated class labels from the database. Classification scores for the masked segment are assigned based on the similarity between query features and retrieved features. The retrieval-based classification scores are combined with CLIP-based scores to produce the final output. We incorporate our solution with a previous SOTA method (FC-CLIP). When trained on COCO, the proposed method demonstrates 30.9 PQ, 19.3 mAP, 44.0 mIoU on the ADE20k dataset, achieving +4.5 PQ, +2.5 mAP, +10.0 mIoU absolute improvement over the baseline.

[323] SKANet: A Cognitive Dual-Stream Framework with Adaptive Modality Fusion for Robust Compound GNSS Interference Classification

Zhihan Zeng,Yang Zhao,Kaihe Wang,Dusit Niyato,Hongyuan Shu,Junchu Zhao,Yanjun Huang,Yue Xiu,Zhongpei Zhang,Ning Wei

Main category: cs.CV

TL;DR: 本文提出了一种名为SKANet的认知深度学习框架,用于解决复杂电磁环境下GNSS系统中复合干扰信号分类困难的问题,通过双流结构融合时频图像和功率谱密度,并引入动态感受野和特征重校准机制,在40.5万样本数据集上实现了96.99%的准确率。

Details Motivation: 由于电磁环境日益复杂,GNSS面临多种复合干扰威胁,传统单域深度学习方法因静态感受野难以同时捕捉瞬态突发信号和连续全局信号,导致复合干扰分类性能下降。 Method: 提出SKANet,采用双流架构融合时频图(TFI)和功率谱密度(PSD),引入多分支选择性核(SK)模块与非对称卷积块(ACB)实现动态感受野调整,并结合Squeeze-and-Excitation(SE)机制在融合阶段自适应重校准多模态特征。 Result: 在包含405,000个样本的数据集上测试,SKANet整体分类准确率达到96.99%,在低干扰噪声比(JNR)条件下仍保持优异性能,显著优于现有方法。 Conclusion: SKANet通过动态感受野和多模态特征自适应融合,有效解决了复合干扰信号中多尺度特征提取难题,提升了GNSS抗干扰识别的鲁棒性和准确性。 Abstract: As the electromagnetic environment becomes increasingly complex, Global Navigation Satellite Systems (GNSS) face growing threats from sophisticated jamming interference. Although Deep Learning (DL) effectively identifies basic interference, classifying compound interference remains difficult due to the superposition of diverse jamming sources. Existing single-domain approaches often suffer from performance degradation because transient burst signals and continuous global signals require conflicting feature extraction scales. We propose the Selective Kernel and Asymmetric convolution Network(SKANet), a cognitive deep learning framework built upon a dual-stream architecture that integrates Time-Frequency Images (TFIs) and Power Spectral Density (PSD). Distinct from conventional fusion methods that rely on static receptive fields, the proposed architecture incorporates a Multi-Branch Selective Kernel (SK) module combined with Asymmetric Convolution Blocks (ACBs). This mechanism enables the network to dynamically adjust its receptive fields, acting as an adaptive filter that simultaneously captures micro-scale transient features and macro-scale spectral trends within entangled compound signals. To complement this spatial-temporal adaptation, a Squeeze-and-Excitation (SE) mechanism is integrated at the fusion stage to adaptively recalibrate the contribution of heterogeneous features from each modality. Evaluations on a dataset of 405,000 samples demonstrate that SKANet achieves an overall accuracy of 96.99\%, exhibiting superior robustness for compound jamming classification, particularly under low Jamming-to-Noise Ratio (JNR) regimes.

[324] Combating Noisy Labels through Fostering Self- and Neighbor-Consistency

Zeren Sun,Yazhou Yao,Tongliang Liu,Zechao Li,Fumin Shen,Jinhui Tang

Main category: cs.CV

TL;DR: 提出Jo-SNC方法,结合样本选择与模型正则化,基于自一致性与邻居一致性应对标签噪声,尤其有效处理分布内与分布外的噪声样本。

Details Motivation: 深度网络易因记忆效应受标签噪声影响,现有方法常忽略不同小批量中噪声的不平衡性,且对分布外噪声数据关注不足。 Method: 使用Jensen-Shannon散度衡量样本清洁或分布外的可能性,结合最近邻信息提升判断可靠性;设计自适应的数据驱动阈值调整机制;对清洁样本采用常规训练,分布内噪声样本采用部分标签学习,分布外噪声样本采用负学习,并引入三元一致性正则化(自预测、邻居预测和特征一致性)提升模型性能。 Result: 在多个基准数据集上实验表明,该方法在噪声鲁棒性方面优于现有最先进方法,消融研究验证了各组件的有效性。 Conclusion: Jo-SNC通过联合样本选择与一致性正则化,有效提升了模型在标签噪声下的鲁棒性,尤其在处理分布内外噪声时表现优越。 Abstract: Label noise is pervasive in various real-world scenarios, posing challenges in supervised deep learning. Deep networks are vulnerable to such label-corrupted samples due to the memorization effect. One major stream of previous methods concentrates on identifying clean data for training. However, these methods often neglect imbalances in label noise across different mini-batches and devote insufficient attention to out-of-distribution noisy data. To this end, we propose a noise-robust method named Jo-SNC (\textbf{Jo}int sample selection and model regularization based on \textbf{S}elf- and \textbf{N}eighbor-\textbf{C}onsistency). Specifically, we propose to employ the Jensen-Shannon divergence to measure the ``likelihood'' of a sample being clean or out-of-distribution. This process factors in the nearest neighbors of each sample to reinforce the reliability of clean sample identification. We design a self-adaptive, data-driven thresholding scheme to adjust per-class selection thresholds. While clean samples undergo conventional training, detected in-distribution and out-of-distribution noisy samples are trained following partial label learning and negative learning, respectively. Finally, we advance the model performance further by proposing a triplet consistency regularization that promotes self-prediction consistency, neighbor-prediction consistency, and feature consistency. Extensive experiments on various benchmark datasets and comprehensive ablation studies demonstrate the effectiveness and superiority of our approach over existing state-of-the-art methods.

[325] PhyG-MoE: A Physics-Guided Mixture-of-Experts Framework for Energy-Efficient GNSS Interference Recognition

Zhihan Zeng,Yang Zhao,Kaihe Wang,Dusit Niyato,Yue Xiu,Lu Chen,Zhongpei Zhang,Ning Wei

Main category: cs.CV

TL;DR: 本文提出了一种名为PhyG-MoE的物理引导混合专家框架,用于动态匹配模型容量与信号复杂性,显著提升GNSS干扰识别中的计算效率和准确性。

Details Motivation: 现有深度学习模型在处理复杂电磁干扰时采用固定计算结构,导致简单和复杂信号处理资源分配不均,难以适应动态变化的电磁环境。 Method: 设计了一种基于频谱特征纠缠程度的门控机制,动态路由信号至不同专家网络:高容量TransNeXt专家处理复杂饱和干扰,轻量级专家处理基础信号,实现计算资源按需分配。 Result: 在21类干扰信号上的实验表明,PhyG-MoE整体准确率达到97.58%,有效降低了计算开销且无性能损失。 Conclusion: PhyG-MoE通过动态调整模型容量,解决了静态模型与动态电磁环境之间的根本矛盾,为资源受限的认知接收机提供了高效、灵活的干扰识别方案。 Abstract: Complex electromagnetic interference increasingly compromises Global Navigation Satellite Systems (GNSS), threatening the reliability of Space-Air-Ground Integrated Networks (SAGIN). Although deep learning has advanced interference recognition, current static models suffer from a \textbf{fundamental limitation}: they impose a fixed computational topology regardless of the input's physical entropy. This rigidity leads to severe resource mismatch, where simple primitives consume the same processing cost as chaotic, saturated mixtures. To resolve this, this paper introduces PhyG-MoE (Physics-Guided Mixture-of-Experts), a framework designed to \textbf{dynamically align model capacity with signal complexity}. Unlike static architectures, the proposed system employs a spectrum-based gating mechanism that routes signals based on their spectral feature entanglement. A high-capacity TransNeXt expert is activated on-demand to disentangle complex features in saturated scenarios, while lightweight experts handle fundamental signals to minimize latency. Evaluations on 21 jamming categories demonstrate that PhyG-MoE achieves an overall accuracy of 97.58\%. By resolving the intrinsic conflict between static computing and dynamic electromagnetic environments, the proposed framework significantly reduces computational overhead without performance degradation, offering a viable solution for resource-constrained cognitive receivers.

[326] Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data

Takaki Yamamoto,Chihiro Noguchi,Toshihiro Tanizawa

Main category: cs.CV

TL;DR: 本文研究了基于CLIP的视觉-语言模型如何获得左右空间关系理解能力,提出一个可控的一维图像-文本测试平台,发现标签多样性是泛化能力的主要驱动力,并通过注意力机制分解揭示了位置与词元嵌入交互形成水平注意力梯度的机制。

Details Motivation: 探究视觉-语言模型是否真正获得了空间理解能力,以及这种能力是通过何种机制形成的,尤其是在CLIP-style对比训练下的Transformer模型。 Method: 构建一个可控的一维图像-文本测试环境,训练轻量级Transformer视觉和文本编码器,使用CLIP风格的对比目标进行端到端训练,系统地改变标签和布局多样性,并评估对未见物体对的泛化能力;同时进行注意力分解分析机制。 Result: 对比训练能够学习左右关系,且标签多样性比布局多样性更促进泛化;注意力分解显示位置嵌入与词元嵌入的相互作用产生了打破左右对称性的水平注意力梯度,消融该作用会显著降低左右区分能力。 Conclusion: CLIP-style模型在足够标签多样性下可通过位置与词元嵌入交互机制自发获得左右关系理解能力,本文为这类模型的空间关系习得提供了可解释的机制洞察。 Abstract: Spatial understanding remains a key challenge in vision-language models. Yet it is still unclear whether such understanding is truly acquired, and if so, through what mechanisms. We present a controllable 1D image-text testbed to probe how left-right relational understanding emerges in Transformer-based vision and text encoders trained with a CLIP-style contrastive objective. We train lightweight Transformer-based vision and text encoders end-to-end on paired descriptions of one- and two-object scenes and evaluate generalization to unseen object pairs while systematically varying label and layout diversity. We find that contrastive training learns left-right relations and that label diversity, more than layout diversity, is the primary driver of generalization in this setting. To gain the mechanistic understanding, we perform an attention decomposition and show that interactions between positional and token embeddings induce a horizontal attention gradient that breaks left-right symmetry in the encoders; ablating this contribution substantially reduces left-right discrimination. Our results provide a mechanistic insight of when and how CLIP-style models acquire relational competence.

[327] CSGaussian: Progressive Rate-Distortion Compression and Segmentation for 3D Gaussian Splatting

Yu-Jen Tseng,Chia-Hao Kao,Jing-Zhong Chen,Alessandro Gnutti,Shao-Yuan Lo,Yen-Yu Lin,Wen-Hsiao Peng

Main category: cs.CV

TL;DR: 本文提出了一个用于3D高斯点阵(3DGS)的率失真优化压缩与分割的统一框架,首次将语义学习融入压缩流程,支持解码端的场景编辑等应用。

Details Motivation: 现有工作通常独立处理3DGS的压缩与语义理解任务,未探索其联合优化的可能性。本文旨在通过联合优化实现高效压缩的同时保持高质量渲染和语义分割能力。 Method: 提出一种轻量级基于隐式神经表示的超先验模型,实现颜色与语义属性的高效熵编码;引入压缩引导的分割学习机制,包括量化感知训练和质量感知加权机制,以提升特征可分性并抑制不可靠的高斯基元。 Result: 在LERF和3D-OVS数据集上的实验表明,该方法显著降低传输成本,同时保持高渲染质量和强分割性能。 Conclusion: 本文提出的统一框架成功实现了3DGS的率失真优化压缩与语义分割的协同设计,为后续解码端应用如场景编辑提供了新思路和技术支持。 Abstract: We present the first unified framework for rate-distortion-optimized compression and segmentation of 3D Gaussian Splatting (3DGS). While 3DGS has proven effective for both real-time rendering and semantic scene understanding, prior works have largely treated these tasks independently, leaving their joint consideration unexplored. Inspired by recent advances in rate-distortion-optimized 3DGS compression, this work integrates semantic learning into the compression pipeline to support decoder-side applications--such as scene editing and manipulation--that extend beyond traditional scene reconstruction and view synthesis. Our scheme features a lightweight implicit neural representation-based hyperprior, enabling efficient entropy coding of both color and semantic attributes while avoiding costly grid-based hyperprior as seen in many prior works. To facilitate compression and segmentation, we further develop compression-guided segmentation learning, consisting of quantization-aware training to enhance feature separability and a quality-aware weighting mechanism to suppress unreliable Gaussian primitives. Extensive experiments on the LERF and 3D-OVS datasets demonstrate that our approach significantly reduces transmission cost while preserving high rendering quality and strong segmentation performance.

[328] A Generalist Foundation Model for Total-body PET/CT Enables Diagnostic Reporting and System-wide Metabolic Profiling

Wei Chen,Liang Wu,Shuyi Lu,Yuanyuan Sun,Wenkai Bi,Zilong Yuan,Yaoyao He,Feng Wang,Junchi Ma,Shuyong Liu,Zhaoping Cheng,Xiaoyan Hu,Jianfeng Qiu

Main category: cs.CV

TL;DR: SDF-HOLO是一种用于全身PET/CT的双流融合全模型,通过解耦CT和PET表征学习并结合跨模态交互,在肿瘤分割、低剂量病灶检测和多语言诊断报告生成等任务中优于现有方法。

Details Motivation: 现有医学AI模型难以处理全身PET/CT中的多模态、长距离依赖和精细语义对齐问题,需要一种能够实现系统级分子成像分析的基础模型。 Method: 提出SDF-HOLO模型,采用双流编码器分别处理CT和PET信号,并通过跨模态交互模块融合;利用分层上下文建模捕获全身长程依赖,结合解剖分割掩码作为语义锚点进行体素-掩码-文本对齐预训练。 Result: 在多项任务中超越强基线模型,减少定位错误和幻觉发现,支持系统性代谢谱分析,并揭示肿瘤相关的器官间代谢网络相互作用指纹。 Conclusion: SDF-HOLO为全身PET/CT提供了可扩展的计算基础,推动了系统级精准肿瘤学的发展。 Abstract: Total-body PET/CT enables system-wide molecular imaging, but heterogeneous anatomical and metabolic signals, approximately 2 m axial coverage, and structured radiology semantics challenge existing medical AI models that assume single-modality inputs, localized fields of view, and coarse image-text alignment. We introduce SDF-HOLO (Systemic Dual-stream Fusion Holo Model), a multimodal foundation model for holistic total-body PET/CT, pre-trained on more than 10,000 patients. SDF-HOLO decouples CT and PET representation learning with dual-stream encoders and couples them through a cross-modal interaction module, allowing anatomical context to refine PET aggregation while metabolic saliency guides subtle morphological reasoning. To model long-range dependencies across the body, hierarchical context modeling combines efficient local windows with global attention. To bridge voxels and clinical language, we use anatomical segmentation masks as explicit semantic anchors and perform voxel-mask-text alignment during pre-training. Across tumor segmentation, low-dose lesion detection, and multilingual diagnostic report generation, SDF-HOLO outperforms strong task-specific and clinical-reference baselines while reducing localization errors and hallucinated findings. Beyond focal interpretation, the model enables system-wide metabolic profiling and reveals tumor-associated fingerprints of inter-organ metabolic network interactions, providing a scalable computational foundation for total-body PET/CT diagnostics and system-level precision oncology.

[329] TreeDGS: Aerial Gaussian Splatting for Distant DBH Measurement

Belal Shaheen,Minh-Hieu Nguyen,Bach-Thuan Bui,Shubham,Tim Wu,Michael Fairley,Matthew David Zane,Michael Wu,James Tompkin

Main category: cs.CV

TL;DR: 提出TreeDGS方法,利用3D高斯点阵从航拍图像中实现高精度树干胸径(DBH)测量,优于现有LiDAR方法。

Details Motivation: 在复杂自然场景中,航拍遥感难以准确直接测量物体级参数,如树木胸径(DBH),尤其因树干在航拍视图中像素稀疏、观测不足。 Method: 基于3D高斯点阵(3D Gaussian Splatting)构建连续可稠密化场景表示,通过SfM-MVS初始化与高斯优化,结合RaDe-GS的深度感知累积不透明度积分提取密集点集,并为每个采样点分配多视角不透明度可靠性评分,最后使用不透明度加权的实心圆拟合估计DBH。 Result: 在10个样地测试中,TreeDGS的DBH测量RMSE为4.79 cm(约2.6像素),优于先进LiDAR基线方法(7.91 cm RMSE)。 Conclusion: 基于稠密化点阵的几何重建能实现高精度、低成本的航拍DBH测量,拓展了航拍图像在生态监测中的应用潜力。 Abstract: Aerial remote sensing enables efficient large-area surveying, but accurate direct object-level measurement remains difficult in complex natural scenes. Recent advancements in 3D vision, particularly learned radiance-field representations such as NeRF and 3D Gaussian Splatting, have begun to raise the ceiling on reconstruction fidelity and densifiable geometry from posed imagery. Nevertheless, direct aerial measurement of important natural attributes such as tree diameter at breast height (DBH) remains challenging. Trunks in aerial forest scans are distant and sparsely observed in image views: at typical operating altitudes, stems may span only a few pixels. With these constraints, conventional reconstruction methods leave breast-height trunk geometry weakly constrained. We present TreeDGS, an aerial image reconstruction method that leverages 3D Gaussian Splatting as a continuous, densifiable scene representation for trunk measurement. After SfM-MVS initialization and Gaussian optimization, we extract a dense point set from the Gaussian field using RaDe-GS's depth-aware cumulative-opacity integration and associate each sample with a multi-view opacity reliability score. We then estimate DBH from trunk-isolated points using opacity-weighted solid-circle fitting. Evaluated on 10 plots with field-measured DBH, TreeDGS reaches 4.79,cm RMSE (about 2.6 pixels at this GSD) and outperforms a state-of-the-art LiDAR baseline (7.91,cm RMSE), demonstrating that densified splat-based geometry can enable accurate, low-cost aerial DBH measurement.

[330] Seeing Isn't Always Believing: Analysis of Grad-CAM Faithfulness and Localization Reliability in Lung Cancer CT Classification

Teerapong Panboonyuen

Main category: cs.CV

TL;DR: 本研究评估了Grad-CAM在肺癌图像分类中的可解释性可靠性,发现其在卷积网络中表现良好,但在Vision Transformer中因非局部注意力行为而显著下降,揭示了当前基于显著性图的XAI方法在医学影像中的局限性。

Details Motivation: 尽管Grad-CAM等XAI技术广泛用于医学图像分析,但其解释的保真度和可靠性仍存疑,尤其是在不同网络架构下的表现差异缺乏系统评估。 Method: 使用IQ-OTH/NCCD数据集,评估ResNet、DenseNet、EfficientNet和ViT等五种主流模型,提出结合定位准确性、扰动保真度和解释一致性的量化评估框架,系统分析Grad-CAM在不同架构下的解释可靠性。 Result: Grad-CAM在多数卷积网络中能有效突出肿瘤区域,但在Vision Transformer中解释保真度显著下降;跨模型比较显示显著的显著性定位差异,表明Grad-CAM不一定反映模型真实决策依据。 Conclusion: 当前基于显著性图的XAI方法在医学影像中存在关键局限,需发展模型适配、计算可靠且临床有意义的可解释方法,推动更谨慎和严谨地使用视觉解释工具。 Abstract: Explainable Artificial Intelligence (XAI) techniques, such as Gradient-weighted Class Activation Mapping (Grad-CAM), have become indispensable for visualizing the reasoning process of deep neural networks in medical image analysis. Despite their popularity, the faithfulness and reliability of these heatmap-based explanations remain under scrutiny. This study critically investigates whether Grad-CAM truly represents the internal decision-making of deep models trained for lung cancer image classification. Using the publicly available IQ-OTH/NCCD dataset, we evaluate five representative architectures: ResNet-50, ResNet-101, DenseNet-161, EfficientNet-B0, and ViT-Base-Patch16-224, to explore model-dependent variations in Grad-CAM interpretability. We introduce a quantitative evaluation framework that combines localization accuracy, perturbation-based faithfulness, and explanation consistency to assess Grad-CAM reliability across architectures. Experimental findings reveal that while Grad-CAM effectively highlights salient tumor regions in most convolutional networks, its interpretive fidelity significantly degrades for Vision Transformer models due to non-local attention behavior. Furthermore, cross-model comparisons indicate substantial variability in saliency localization, implying that Grad-CAM explanations may not always correspond to the true diagnostic evidence used by the networks. This work exposes critical limitations of current saliency-based XAI approaches in medical imaging and emphasizes the need for model-aware interpretability methods that are both computationally sound and clinically meaningful. Our findings aim to inspire a more cautious and rigorous adoption of visual explanation tools in medical AI, urging the community to rethink what it truly means to "trust" a model's explanation.

[331] FGTBT: Frequency-Guided Task-Balancing Transformer for Unified Facial Landmark Detection

Jun Wan,Xinyu Xiong,Ning Chen,Zhihui Lai,Jie Zhou,Wenwen Min

Main category: cs.CV

TL;DR: 本文提出了一种名为Frequency-Guided Task-Balancing Transformer(FGTBT)的新方法,通过频域建模与多数据集联合训练提升人脸关键点检测在复杂场景下的鲁棒性与精度。

Details Motivation: 现有基于深度学习的人脸关键点检测方法在大姿态、光照和表情变化等挑战性场景下难以准确建模面部几何结构;同时,数据集规模小、多样性不足限制了模型泛化能力。 Method: 提出FGTBT框架,包含两个核心组件:1)细粒度多任务平衡损失(FMB-loss),按各关键点在不同数据集中的出现频率动态加权;2)频域引导的结构感知模型(FGSA),引入频域信息进行结构注入与正则化。 Result: 在主流基准数据集上实验表明,FGTBT性能达到SOTA水平,显著提升了在大姿态、光照变化等困难场景下的关键点定位精度。 Conclusion: 频域建模与细粒度任务平衡策略可有效增强面部结构感知能力,多数据集统一训练范式有助于缓解数据偏差与梯度不一致问题,为FLD提供了新思路。 Abstract: Recently, deep learning based facial landmark detection (FLD) methods have achieved considerable success. However, in challenging scenarios such as large pose variations, illumination changes, and facial expression variations, they still struggle to accurately capture the geometric structure of the face, resulting in performance degradation. Moreover, the limited size and diversity of existing FLD datasets hinder robust model training, leading to reduced detection accuracy. To address these challenges, we propose a Frequency-Guided Task-Balancing Transformer (FGTBT), which enhances facial structure perception through frequency-domain modeling and multi-dataset unified training. Specifically, we propose a novel Fine-Grained Multi-Task Balancing loss (FMB-loss), which moves beyond coarse task-level balancing by assigning weights to individual landmarks based on their occurrence across datasets. This enables more effective unified training and mitigates the issue of inconsistent gradient magnitudes. Additionally, a Frequency-Guided Structure-Aware (FGSA) model is designed to utilize frequency-guided structure injection and regularization to help learn facial structure constraints. Extensive experimental results on popular benchmark datasets demonstrate that the integration of the proposed FMB-loss and FGSA model into our FGTBT framework achieves performance comparable to state-of-the-art methods. The code is available at https://github.com/Xi0ngxinyu/FGTBT.

[332] Proxy Robustness in Vision Language Models is Effortlessly Transferable

Xiaowei Fu,Fuxiang Huang,Lei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种无需对抗训练的视觉-语言模型(如CLIP)间对抗鲁棒性迁移新范式——异构代理迁移(HPT),并发现普通CLIP模型之间存在“代理对抗鲁棒性”。为缓解迁移过程中的过拟合问题,进一步设计了泛化锚定解耦机制(GPD),在保持零样本自然泛化能力的同时提升对抗鲁棒性。

Details Motivation: 传统对抗蒸馏方法依赖计算成本高昂的对抗训练构建鲁棒教师模型,难以应用于大规模多模态视觉-语言模型(VLM)。本文旨在探索无需对抗训练即可实现VLM间鲁棒性迁移的新途径。 Method: 提出异构代理迁移(HPT)框架,利用不同架构但未经对抗训练的CLIP模型作为彼此的代理教师,建立跨架构鲁棒性蒸馏通道;进一步设计泛化-锚定解耦(GPD)策略,通过差异化的学习率调度将训练分为维持泛化的预热阶段和增强鲁棒性的迁移阶段,以平衡自然准确率与对抗鲁棒性。 Result: 在15个零样本数据集上验证了HPT-GPD的有效性,显著提升了目标模型的对抗鲁棒性,同时几乎不损害其原始的零样本自然泛化性能,优于现有蒸馏与鲁棒性迁移方法。 Conclusion: 本文揭示了CLIP模型间存在的代理对抗鲁棒性现象,提出了无需对抗训练的HPT-GPD框架,为大规模视觉-语言模型的高效鲁棒性迁移提供了新思路,并实现了自然泛化与对抗鲁棒性的良好平衡。 Abstract: As a pivotal technique for improving the defense of deep models, adversarial robustness transfer via distillation has demonstrated remarkable success in conventional image classification tasks. However, this paradigm encounters critical challenges when applied to vision-language models (VLM) (e.g., CLIP): constructing adversarially robust teacher for large-scale multi-modal models demands prohibitively high computational resources. We bridge this gap by revealing an interesting phenomenon: vanilla CLIP (without adversarial training) exhibits intrinsic defensive capabilities against adversarial examples generated by another CLIP with different architectures. We formally define this as proxy adversarial robustness, and naturally propose a Heterogeneous Proxy Transfer (HPT) framework that establishes cross-architectural robustness distillation channels between CLIP variants, effortlessly enabling the VLM robustness transfer from proxy to target models. Yet, such proxy transfer paradigm easily induces severe overfitting, leading to a sharp degradation in zero-shot natural generalization. To resolve that, we design Generalization-Pivot Decoupling (GPD) by leveraging the difference in learning rate scheduling. This decouples the proxy transfer process into a generalization-anchored warm-up that maintains generalization and a generalization-pulled HPT that promotes adversarial robustness, to achieve an equilibrium between natural generalization and adversarial robustness. Extensive experiments on 15 zero-shot datasets demonstrate the effectiveness of our HPT-GPD method. The code is available at the website of github.com/fxw13/HPT-GPD.

[333] Exploring Talking Head Models With Adjacent Frame Prior for Speech-Preserving Facial Expression Manipulation

Zhenxuan Lu,Zhihua Xu,Zhijing Yang,Feng Gao,Yongyi Lu,Keze Wang,Tianshui Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为THFEM的新框架,将音频驱动的说话人头像生成(AD-THG)模型与语音保留型面部表情操控(SPFEM)相结合,通过邻帧学习策略提升唇部同步精度与图像质量。

Details Motivation: 现有SPFEM方法在唇部运动同步上存在困难,而AD-THG模型在精准建模口型方面具有优势,因此探索二者融合以提升表达操控中的语音一致性。 Method: 提出THFEM框架:利用AD-THG模型根据音频生成唇动同步帧,并结合SPFEM处理后的图像;引入邻帧学习策略对AD-THG微调,使其能基于相邻帧信息预测连续帧,提升生成质量。 Result: 实验表明THFEM显著改善了表情操控过程中的唇形保真度和图像真实性,优于现有SPFEM方法。 Conclusion: 将AD-THG与SPFEM协同建模是提升语音保留型表情编辑效果的有效路径,邻帧学习策略进一步缓解了长序列生成导致的质量下降问题。 Abstract: Speech-Preserving Facial Expression Manipulation (SPFEM) is an innovative technique aimed at altering facial expressions in images and videos while retaining the original mouth movements. Despite advancements, SPFEM still struggles with accurate lip synchronization due to the complex interplay between facial expressions and mouth shapes. Capitalizing on the advanced capabilities of audio-driven talking head generation (AD-THG) models in synthesizing precise lip movements, our research introduces a novel integration of these models with SPFEM. We present a new framework, Talking Head Facial Expression Manipulation (THFEM), which utilizes AD-THG models to generate frames with accurately synchronized lip movements from audio inputs and SPFEM-altered images. However, increasing the number of frames generated by AD-THG models tends to compromise the realism and expression fidelity of the images. To counter this, we develop an adjacent frame learning strategy that finetunes AD-THG models to predict sequences of consecutive frames. This strategy enables the models to incorporate information from neighboring frames, significantly improving image quality during testing. Our extensive experimental evaluations demonstrate that this framework effectively preserves mouth shapes during expression manipulations, highlighting the substantial benefits of integrating AD-THG with SPFEM.

[334] YOLO26: An Analysis of NMS-Free End to End Framework for Real-Time Object Detection

Sudip Chakrabarty

Main category: cs.CV

TL;DR: YOLO26 是一种无需NMS的端到端目标检测模型,通过引入MuSGD、STAL和ProgLoss等关键技术,在精度和速度上均超越了以往YOLO系列及其他先进方法。

Details Motivation: 传统YOLO依赖NMS后处理,存在延迟高和超参数敏感问题,限制了实时性能与鲁棒性。 Method: 提出YOLO26,采用端到端训练方式,去除NMS;引入MuSGD优化轻量主干训练,STAL实现小目标感知分配,ProgLoss进行动态监督。 Result: 在官方基准测试中,YOLO26在推理速度和检测精度上均优于YOLOv1-v11、RTMDet和DAMO-YOLO等模型,达到新的Pareto前沿。 Conclusion: YOLO26成功解耦表示学习与启发式后处理,解决了延迟与精度之间的权衡,标志着边缘视觉系统的新进展。 Abstract: The "You Only Look Once" (YOLO) framework has long served as the benchmark for real-time object detection, yet traditional iterations (YOLOv1 through YOLO11) remain constrained by the latency and hyperparameter sensitivity of Non-Maximum Suppression (NMS) post-processing. This paper analyzes a comprehensive analysis of YOLO26, an architecture that fundamentally redefines this paradigm by eliminating NMS in favor of a native end-to-end learning strategy. This study examines the critical innovations that enable this transition, specifically the introduction of the MuSGD optimizer for stabilizing lightweight backbones, STAL for small-target-aware assignment, and ProgLoss for dynamic supervision. Through a systematic review of official performance benchmarks, the results demonstrate that YOLO26 establishes a new Pareto front, outperforming a comprehensive suite of predecessors and state-of-the-art competitors (including RTMDet and DAMO-YOLO) in both inference speed and detection accuracy. The analysis confirms that by decoupling representation learning from heuristic post-processing, YOLOv26 successfully resolves the historical trade-off between latency and precision, signaling the next evolutionary step in edge-based computer vision.

[335] Simultaneous Detection of LSD and FMD in Cattle Using Ensemble Deep Learning

Nazibul Basar Ayon,Abdul Hasib,Md. Faishal Ahmed,Md. Sadiqur Rahman,Kamrul Islam,T. M. Mehrab Hasan,A. S. M. Ahsanul Sarkar Akib

Main category: cs.CV

TL;DR: 提出了一种基于VGG16、ResNet50和InceptionV3的集成深度学习框架,用于同时检测牛结节性皮肤病(LSD)和口蹄疫(FMD),在多病症状重叠情况下实现了98.2%的准确率。

Details Motivation: LSD和FMD临床症状相似且易与良性病变混淆,导致诊断困难,亟需一种能够准确区分多种疾病的自动化诊断工具以实现早期防控。 Method: 构建了一个集成深度学习模型,融合VGG16、ResNet50和InceptionV3,并采用加权平均策略优化分类结果,使用来自印度、巴西和美国18个农场的10,516张专家标注图像进行训练与验证。 Result: 模型达到98.2%的准确率,宏平均精确率为98.2%,召回率为98.1%,F1分数为98.1%,AUC-ROC为99.5%,显著优于单一模型方法。 Conclusion: 该集成框架能有效解决LSD与FMD症状重叠带来的诊断挑战,具备高精度与自动化优势,适用于资源有限地区的早期疾病防控,有助于提升畜牧业可持续发展。 Abstract: Lumpy Skin Disease (LSD) and Foot-and-Mouth Disease (FMD) are highly contagious viral diseases affecting cattle, causing significant economic losses and welfare challenges. Their visual diagnosis is complicated by significant symptom overlap with each other and with benign conditions like insect bites or chemical burns, hindering timely control measures. Leveraging a comprehensive dataset of 10,516 expert-annotated images from 18 farms across India, Brazil, and the USA, this study presents a novel Ensemble Deep Learning framework integrating VGG16, ResNet50, and InceptionV3 with optimized weighted averaging for simultaneous LSD and FMD detection. The model achieves a state-of-the-art accuracy of 98.2\%, with macro-averaged precision of 98.2\%, recall of 98.1\%, F1-score of 98.1\%, and an AUC-ROC of 99.5\%. This approach uniquely addresses the critical challenge of symptom overlap in multi-disease detection, enabling early, precise, and automated diagnosis. This tool has the potential to enhance disease management, support global agricultural sustainability, and is designed for future deployment in resource-limited settings.

[336] TwoHead-SwinFPN: A Unified DL Architecture for Synthetic Manipulation, Detection and Localization in Identity Documents

Chan Naseeb,Adeel Ashraf Cheema,Hassan Sami,Tayyab Afzal,Muhammad Omair,Usman Habib

Main category: cs.CV

TL;DR: 本文提出了一种名为TwoHead-SwinFPN的统一深度学习架构,用于同时进行身份文档中篡改区域的二分类与精确定位,结合Swin Transformer主干网络与注意力机制,在分类和定位任务上均表现出色。

Details Motivation: 随着生成式AI模型的发展,身份文件面临越来越多的人脸替换和文本篡改等合成攻击威胁,亟需高效、精确的检测方法。 Method: 采用Swin Transformer作为主干网络,结合特征金字塔网络(FPN)和UNet风格解码器,并引入CBAM模块增强特征表示;使用双头架构联合优化分类与分割任务,结合不确定性加权多任务学习。 Result: 在FantasyIDiap数据集上达到84.31%准确率、90.78% AUC、57.24%平均Dice分数和88.61% F1分数,具备良好的跨设备泛化能力和多语言适应性。 Conclusion: TwoHead-SwinFPN在身份文档篡改检测中实现了优异性能与计算效率的平衡,适合实际部署应用。 Abstract: The proliferation of sophisticated generative AI models has significantly escalated the threat of synthetic manipulations in identity documents, particularly through face swapping and text inpainting attacks. This paper presents TwoHead-SwinFPN, a unified deep learning architecture that simultaneously performs binary classification and precise localization of manipulated regions in ID documents. Our approach integrates a Swin Transformer backbone with Feature Pyramid Network (FPN) and UNet-style decoder, enhanced with Convolutional Block Attention Module (CBAM) for improved feature representation. The model employs a dual-head architecture for joint optimization of detection and segmentation tasks, utilizing uncertainty-weighted multi-task learning. Extensive experiments on the FantasyIDiap dataset demonstrate superior performance with 84.31\% accuracy, 90.78\% AUC for classification, and 57.24\% mean Dice score for localization. The proposed method achieves an F1-score of 88.61\% for binary classification while maintaining computational efficiency suitable for real-world deployment through FastAPI implementation. Our comprehensive evaluation includes ablation studies, cross-device generalization analysis, and detailed performance assessment across 10 languages and 3 acquisition devices.

[337] Supervision-by-Hallucination-and-Transfer: A Weakly-Supervised Approach for Robust and Precise Facial Landmark Detection

Jun Wan,Yuanzhi Yao,Zhihui Lai,Jie Zhou,Xianxu Hou,Wenwen Min

Main category: cs.CV

TL;DR: 提出了一种名为SHT的弱监督框架,通过双幻觉学习网络(DHLN)和面部姿态迁移网络(FPTN)提升低分辨率下的人脸关键点检测精度。

Details Motivation: 低分辨率图像、特征压缩以及标注不精确等问题限制了高精度人脸关键点检测的发展。 Method: 设计了一个包含DHLN和FPTN的弱监督框架SHT,DHLN结合人脸超分辨与关键点检测任务学习高分辨率特征,FPTN通过姿态转换进一步优化检测结果。 Result: 在人脸超分辨和关键点检测任务上均优于现有最先进方法。 Conclusion: SHT首次将人脸幻觉与姿态迁移引入弱监督人脸关键点检测,有效提升了低质图像下的检测性能。 Abstract: High-precision facial landmark detection (FLD) relies on high-resolution deep feature representations. However, low-resolution face images or the compression (via pooling or strided convolution) of originally high-resolution images hinder the learning of such features, thereby reducing FLD accuracy. Moreover, insufficient training data and imprecise annotations further degrade performance. To address these challenges, we propose a weakly-supervised framework called Supervision-by-Hallucination-and-Transfer (SHT) for more robust and precise FLD. SHT contains two novel mutually enhanced modules: Dual Hallucination Learning Network (DHLN) and Facial Pose Transfer Network (FPTN). By incorporating FLD and face hallucination tasks, DHLN is able to learn high-resolution representations with low-resolution inputs for recovering both facial structures and local details and generating more effective landmark heatmaps. Then, by transforming faces from one pose to another, FPTN can further improve landmark heatmaps and faces hallucinated by DHLN for detecting more accurate landmarks. To the best of our knowledge, this is the first study to explore weakly-supervised FLD by integrating face hallucination and facial pose transfer tasks. Experimental results of both face hallucination and FLD demonstrate that our method surpasses state-of-the-art techniques.

[338] Dual-Stream Collaborative Transformer for Image Captioning

Jun Wan,Jun Liu,Zhihui lai,Jie Zhou

Main category: cs.CV

TL;DR: 本文提出了一种双流协同Transformer(DSCT)模型,通过融合区域特征和分割特征来提升图像描述生成的准确性和描述性。

Details Motivation: 现有的基于区域特征的图像描述方法容易生成无关描述,主要由于缺乏上下文信息以及对已生成部分描述的过度依赖。 Method: 提出DSCT模型,包含模式特定互注意力编码器(PSMAE)和动态提名解码器(DND),通过PSMAE加强两种特征的私有信息,DND动态选择最相关的学习模块并利用融合后的特征生成描述。 Result: 在多个主流基准数据集上的实验表明,该方法优于现有最先进的图像描述模型。 Conclusion: DSCT首次实现了以动态方式融合不同模式特定特征,有效缓解了语义不一致和空间错位问题,显著提升了图像描述性能。 Abstract: Current region feature-based image captioning methods have progressed rapidly and achieved remarkable performance. However, they are still prone to generating irrelevant descriptions due to the lack of contextual information and the over-reliance on generated partial descriptions for predicting the remaining words. In this paper, we propose a Dual-Stream Collaborative Transformer (DSCT) to address this issue by introducing the segmentation feature. The proposed DSCT consolidates and then fuses the region and segmentation features to guide the generation of caption sentences. It contains multiple Pattern-Specific Mutual Attention Encoders (PSMAEs) and Dynamic Nomination Decoders (DNDs). The PSMAE effectively highlights and consolidates the private information of two representations by querying each other. The DND dynamically searches for the most relevant learning blocks to the input textual representations and exploits the homogeneous features between the consolidated region and segmentation features to generate more accurate and descriptive caption sentences. To the best of our knowledge, this is the first study to explore how to fuse different pattern-specific features in a dynamic way to bypass their semantic inconsistencies and spatial misalignment issues for image captioning. The experimental results from popular benchmark datasets demonstrate that our DSCT outperforms the state-of-the-art image captioning models in the literature.

[339] Membership Inference Test: Auditing Training Data in Object Classification Models

Gonzalo Mancera,Daniel DeAlcala,Aythami Morales,Ruben Tolosana,Julian Fierrez

Main category: cs.CV

TL;DR: 本研究提出并开发了针对对象识别领域的成员推理测试(MINT)模型架构,利用卷积层分析训练过程中的激活模式,能够在多个公开数据集上以70%到80%的精度判断数据是否被用于训练。

Details Motivation: 旨在提升在对象识别任务中判断数据是否参与训练的能力,增强模型对训练数据使用的可解释性和透明性。 Method: 设计专用的MINT架构,结合目标检测模型、嵌入提取器和MINT模块,利用卷积层捕捉训练过程中的激活模式,并在三个公开图像数据库上进行实验验证。 Result: 在超过174K张图像的数据集上实现了70%至80%的判断精度,精度受检测模块层数深度影响;同时分析了影响MINT性能的关键因素。 Conclusion: 所提出的MINT架构能有效识别训练数据的使用情况,提升了对象识别中训练过程的透明度,为数据隐私和模型审计提供了实用工具。 Abstract: In this research, we analyze the performance of Membership Inference Tests (MINT), focusing on determining whether given data were utilized during the training phase, specifically in the domain of object recognition. Within the area of object recognition, we propose and develop architectures tailored for MINT models. These architectures aim to optimize performance and efficiency in data utilization, offering a tailored solution to tackle the complexities inherent in the object recognition domain. We conducted experiments involving an object detection model, an embedding extractor, and a MINT module. These experiments were performed in three public databases, totaling over 174K images. The proposed architecture leverages convolutional layers to capture and model the activation patterns present in the data during the training process. Through our analysis, we are able to identify given data used for testing and training, achieving precision rates ranging between 70% and 80%, contingent upon the depth of the detection module layer chosen for input to the MINT module. Additionally, our studies entail an analysis of the factors influencing the MINT Module, delving into the contributing elements behind more transparent training processes.

[340] QASA: Quality-Guided K-Adaptive Slot Attention for Unsupervised Object-Centric Learning

Tianran Ouyang,Xingping Dong,Jing Zhang,Mang Ye,Jun Chen,Bo Du

Main category: cs.CV

TL;DR: 本文提出了Quality-Guided K-Adaptive Slot Attention (QASA),通过解耦槽选择与重建过程,并引入无监督的槽质量度量,实现高质量、动态数量的对象绑定,在K自适应Slot Attention方法中显著优于现有方法。

Details Motivation: 现有K自适应Slot Attention方法存在两个问题:缺乏对槽绑定质量的显式约束,以及槽数量惩罚与重建目标之间的优化冲突,导致性能落后于固定K的方法。 Method: 提出QASA,将槽选择与重建解耦;设计无监督的Slot-Quality指标评估每个槽的质量;采用质量引导的槽选择机制,仅使用高质量槽进行重建;训练时使用门控解码器,推理时通过注意力竞争实现K自适应输出。 Result: QASA在合成和真实数据集上均显著优于现有的K自适应方法,并在真实数据集上超过固定K的强基线方法。 Conclusion: 通过引入槽质量评估和解耦设计,QASA有效解决了K自适应Slot Attention中的优化冲突和绑定模糊问题,实现了更优的对象中心表示学习。 Abstract: Slot Attention, an approach that binds different objects in a scene to a set of "slots", has become a leading method in unsupervised object-centric learning. Most methods assume a fixed slot count K, and to better accommodate the dynamic nature of object cardinality, a few works have explored K-adaptive variants. However, existing K-adaptive methods still suffer from two limitations. First, they do not explicitly constrain slot-binding quality, so low-quality slots lead to ambiguous feature attribution. Second, adding a slot-count penalty to the reconstruction objective creates conflicting optimization goals between reducing the number of active slots and maintaining reconstruction fidelity. As a result, they still lag significantly behind strong K-fixed baselines. To address these challenges, we propose Quality-Guided K-Adaptive Slot Attention (QASA). First, we decouple slot selection from reconstruction, eliminating the mutual constraints between the two objectives. Then, we propose an unsupervised Slot-Quality metric to assess per-slot quality, providing a principled signal for fine-grained slot--object binding. Based on this metric, we design a Quality-Guided Slot Selection scheme that dynamically selects a subset of high-quality slots and feeds them into our newly designed gated decoder for reconstruction during training. At inference, token-wise competition on slot attention yields a K-adaptive outcome. Experiments show that QASA substantially outperforms existing K-adaptive methods on both real and synthetic datasets. Moreover, on real-world datasets QASA surpasses K-fixed methods.

[341] GazeD: Context-Aware Diffusion for Accurate 3D Gaze Estimation

Riccardo Catalini,Davide Di Nucci,Guido Borghi,Davide Davoli,Lorenzo Garattoni,Giampiero Francesca,Yuki Kawana,Roberto Vezzani

Main category: cs.CV

TL;DR: 提出GazeD,一种基于单张RGB图像的3D视线估计新方法,利用扩散模型处理不确定性,并将视线表示为固定距离眼睛的额外身体关节,实现最先进的性能。

Details Motivation: 现有3D视线估计方法难以处理由单目图像带来的深度模糊和不确定性,且常忽略视线与人体姿态之间的关联。 Method: 提出GazeD,使用扩散模型生成多个合理的3D视线与姿态假设;将3D视线建模为距离眼睛固定长度的额外关节,并在去噪过程中联合条件于2D姿态、主体周围环境和场景上下文信息。 Result: 在三个基准数据集上评估显示,GazeD在3D视线估计方面达到最先进水平,甚至优于依赖时序信息的方法。 Conclusion: 通过将视线作为额外关节并利用扩散模型联合建模,GazeD有效提升了单图像3D视线估计的准确性与鲁棒性。 Abstract: We introduce GazeD, a new 3D gaze estimation method that jointly provides 3D gaze and human pose from a single RGB image. Leveraging the ability of diffusion models to deal with uncertainty, it generates multiple plausible 3D gaze and pose hypotheses based on the 2D context information extracted from the input image. Specifically, we condition the denoising process on the 2D pose, the surroundings of the subject, and the context of the scene. With GazeD we also introduce a novel way of representing the 3D gaze by positioning it as an additional body joint at a fixed distance from the eyes. The rationale is that the gaze is usually closely related to the pose, and thus it can benefit from being jointly denoised during the diffusion process. Evaluations across three benchmark datasets demonstrate that GazeD achieves state-of-the-art performance in 3D gaze estimation, even surpassing methods that rely on temporal information. Project details will be available at https://aimagelab.ing.unimore.it/go/gazed.

[342] StyMam: A Mamba-Based Generator for Artistic Style Transfer

Zhou Hong,Rongsheng Hu,Yicheng Di,Xiaolong Xu,Ning Dong,Yihua Shao,Run Ling,Yun Wang,Juqin Wang,Zhanjie Zhang,Ao Ma

Main category: cs.CV

TL;DR: 本文提出了一种基于Mamba的图像风格迁移生成器StyMam,通过残差双路径条带扫描机制和通道重加权空间注意力模块,有效结合了局部纹理特征提取与全局依赖建模,在质量和速度上均优于现有方法。

Details Motivation: 现有基于GAN或稳定扩散(SD)的风格迁移方法在局部与全局特征建模、内容结构保持及推理速度方面存在不足,亟需一种既能避免伪影又能高效保持内容结构的新方法。 Method: 提出StyMam,采用Mamba架构的生成器,引入残差双路径条带扫描机制以捕捉局部纹理特征,并设计通道重加权空间注意力模块来建模全局依赖关系。 Result: 实验表明,该方法在定性和定量评估中均优于当前最先进的算法,能生成高质量且无伪影的风格化图像,并具有更快的推理速度。 Conclusion: StyMam通过结合Mamba的长程建模能力与精心设计的模块,在图像风格迁移任务中实现了内容保持、视觉和谐与高效率的统一,为未来非注意力序列模型在视觉生成任务中的应用提供了新思路。 Abstract: Image style transfer aims to integrate the visual patterns of a specific artistic style into a content image while preserving its content structure. Existing methods mainly rely on the generative adversarial network (GAN) or stable diffusion (SD). GAN-based approaches using CNNs or Transformers struggle to jointly capture local and global dependencies, leading to artifacts and disharmonious patterns. SD-based methods reduce such issues but often fail to preserve content structures and suffer from slow inference. To address these issues, we revisit GAN and propose a mamba-based generator, termed as StyMam, to produce high-quality stylized images without introducing artifacts and disharmonious patterns. Specifically, we introduce a mamba-based generator with a residual dual-path strip scanning mechanism and a channel-reweighted spatial attention module. The former efficiently captures local texture features, while the latter models global dependencies. Finally, extensive qualitative and quantitative experiments demonstrate that the proposed method outperforms state-of-the-art algorithms in both quality and speed.

[343] Cross-Scale Pretraining: Enhancing Self-Supervised Learning for Low-Resolution Satellite Imagery for Semantic Segmentation

John Waithaka,Gustave Bwirayesu,Moise Busogi

Main category: cs.CV

TL;DR: 提出一种空间亲和组件,利用高分辨率图像提升中分辨率图像的自监督预训练效果。

Details Motivation: 探索如何利用新发布的高分辨率遥感数据集来增强中分辨率图像的表示学习和下游分割性能。 Method: 设计了一个可集成到现有自监督学习框架中的空间亲和组件,利用高分辨率影像辅助中分辨率影像的学习。 Result: 在两个自监督学习框架上验证了该方法,结果优于仅使用高分辨率或中分辨率图像预训练的模型。 Conclusion: 引入高分辨率数据并通过空间亲和组件进行联合训练,能有效提升中分辨率遥感图像的表征能力。 Abstract: Self-supervised pretraining in remote sensing is mostly done using mid-spatial resolution (MR) image datasets due to their high availability. Given the release of high-resolution (HR) datasets, we ask how HR datasets can be included in self-supervised pretraining to enhance MR image representation learning and downstream segmentation performance on MR tasks. We design a spatial affinity component that can be added to existing self-supervised learning frameworks and that uses HR imagery to learn better representations of MR imagery. We test the spatial affinity component on two self-supervised learning frameworks and show that it outperforms models pretrained on HR or MR images alone.

[344] Early Prediction of Type 2 Diabetes Using Multimodal data and Tabular Transformers

Sulaiman Khan,Md. Rafiul Biswas,Zubair Shah

Main category: cs.CV

TL;DR: 本研究提出了一种基于表格变换器(TabTrans)的新型模型,用于利用纵向健康数据和骨密度数据进行2型糖尿病(T2DM)的早期风险预测,并在卡塔尔生物银行队列中验证了其优于传统机器学习和生成式AI模型的性能。

Details Motivation: 现有方法在捕捉疾病进展中的复杂长期依赖关系方面存在局限,难以有效利用多模态表格医疗数据进行精准预测。 Method: 采用TabTrans架构处理患者的纵向电子健康记录(EHR)与双能X线吸收测定法(DXA)数据,结合SMOTE和SMOTE-ENN解决类别不平衡问题,并与多种传统机器学习及生成式AI模型进行对比评估。 Result: TabTrans模型在ROC AUC上达到≥79.7%,显著优于对比模型;特征分析揭示内脏脂肪组织质量、腰椎BMD/BMC及T/Z评分等为关键预测因子。 Conclusion: TabTrans能有效挖掘复杂医疗表格数据中的深层模式,具备在卡塔尔人群中实现T2DM早期预警和个性化干预的临床潜力。 Abstract: This study introduces a novel approach for early Type 2 Diabetes Mellitus (T2DM) risk prediction using a tabular transformer (TabTrans) architecture to analyze longitudinal patient data. By processing patients` longitudinal health records and bone-related tabular data, our model captures complex, long-range dependencies in disease progression that conventional methods often overlook. We validated our TabTrans model on a retrospective Qatar BioBank (QBB) cohort of 1,382 subjects, comprising 725 men (146 diabetic, 579 healthy) and 657 women (133 diabetic, 524 healthy). The study integrated electronic health records (EHR) with dual-energy X-ray absorptiometry (DXA) data. To address class imbalance, we employed SMOTE and SMOTE-ENN resampling techniques. The proposed model`s performance is evaluated against conventional machine learning (ML) and generative AI models, including Claude 3.5 Sonnet (Anthropic`s constitutional AI), GPT-4 (OpenAI`s generative pre-trained transformer), and Gemini Pro (Google`s multimodal language model). Our TabTrans model demonstrated superior predictive performance, achieving ROC AUC $\geq$ 79.7 % for T2DM prediction compared to both generative AI models and conventional ML approaches. Feature interpretation analysis identified key risk indicators, with visceral adipose tissue (VAT) mass and volume, ward bone mineral density (BMD) and bone mineral content (BMC), T and Z-scores, and L1-L4 scores emerging as the most important predictors associated with diabetes development in Qatari adults. These findings demonstrate the significant potential of TabTrans for analyzing complex tabular healthcare data, providing a powerful tool for proactive T2DM management and personalized clinical interventions in the Qatari population. Index Terms: tabular transformers, multimodal data, DXA data, diabetes, T2DM, feature interpretation, tabular data

[345] AsyncBEV: Cross-modal Flow Alignment in Asynchronous 3D Object Detection

Shiming Wang,Holger Caesar,Liangliang Nan,Julian F. P. Kooij

Main category: cs.CV

TL;DR: 提出AsyncBEV,一种轻量级可训练模块,通过估计并补偿传感器间的时间偏移导致的特征流,提升3D BEV目标检测模型对传感器异步的鲁棒性。

Details Motivation: 传感器在实际应用中难以完全同步,时间偏移会显著降低动态物体的感知性能,尤其影响多模态融合效果。 Method: 受场景流估计启发,AsyncBEV先估计两种模态BEV特征间的2D流,并结合已知时间偏移进行特征对齐,再通过特征扭曲(warp)实现空间对齐,可集成到多种BEV检测器中。 Result: 在CMT和UniBEV上验证,AsyncBEV在0.5秒最大时间偏移下对动态物体的NDS分别提升16.6%和11.9%,显著优于基线方法。 Conclusion: AsyncBEV能有效增强多模态BEV检测模型对传感器异步的鲁棒性,尤其改善动态物体检测性能,具有通用性和实用性。 Abstract: In autonomous driving, multi-modal perception tasks like 3D object detection typically rely on well-synchronized sensors, both at training and inference. However, despite the use of hardware- or software-based synchronization algorithms, perfect synchrony is rarely guaranteed: Sensors may operate at different frequencies, and real-world factors such as network latency, hardware failures, or processing bottlenecks often introduce time offsets between sensors. Such asynchrony degrades perception performance, especially for dynamic objects. To address this challenge, we propose AsyncBEV, a trainable lightweight and generic module to improve the robustness of 3D Birds' Eye View (BEV) object detection models against sensor asynchrony. Inspired by scene flow estimation, AsyncBEV first estimates the 2D flow from the BEV features of two different sensor modalities, taking into account the known time offset between these sensor measurements. The predicted feature flow is then used to warp and spatially align the feature maps, which we show can easily be integrated into different current BEV detector architectures (e.g., BEV grid-based and token-based). Extensive experiments demonstrate AsyncBEV improves robustness against both small and large asynchrony between LiDAR or camera sensors in both the token-based CMT and grid-based UniBEV, especially for dynamic objects. We significantly outperform the ego motion compensated CMT and UniBEV baselines, notably by $16.6$ % and $11.9$ % NDS on dynamic objects in the worst-case scenario of a $0.5 s$ time offset. Code will be released upon acceptance.

[346] Think3D: Thinking with Space for Spatial Reasoning

Zaibin Zhang,Yuhan Wu,Lianjie Jia,Yifan Wang,Zhongbo Zhang,Yijiang Li,Binghao Ran,Fuxi Zhang,Zhuohan Sun,Zhenfei Yin,Lijun Wang,Huchuan Lu

Main category: cs.CV

TL;DR: 本文提出了Think3D框架,通过利用3D重建模型使视觉大模型具备空间智能,实现无需训练即可增强的3D空间推理能力。

Details Motivation: 现有视觉大模型主要基于2D感知,在处理需要几何、视角和空间关系理解的物理世界任务时存在局限,难以进行真正的3D推理。 Method: 提出Think3D框架,结合3D重建模型从图像或视频中恢复点云和相机位姿,并通过相机操作与自我/全局视角切换,将空间推理转化为交互式的3D思维链过程,从而提升模型的空间理解能力。 Result: 在GPT-4.1和Gemini 2.5 Pro等先进模型上无需额外训练即取得显著提升,在BLINK Multi-view和MindCube上平均提升+7.8%,在VSI-Bench上提升+4.7%;结合强化学习策略后,小模型的工具使用效益从+0.7%提升至+6.8%。 Conclusion: 无需训练、基于工具增强的空间探索是实现更灵活、类人3D推理的有效路径,为多模态智能开辟了新维度。 Abstract: Understanding and reasoning about the physical world requires spatial intelligence: the ability to interpret geometry, perspective, and spatial relations beyond 2D perception. While recent vision large models (VLMs) excel at visual understanding, they remain fundamentally 2D perceivers and struggle with genuine 3D reasoning. We introduce Think3D, a framework that enables VLM agents to think with 3D space. By leveraging 3D reconstruction models that recover point clouds and camera poses from images or videos, Think3D allows the agent to actively manipulate space through camera-based operations and ego/global-view switching, transforming spatial reasoning into an interactive 3D chain-of-thought process. Without additional training, Think3D significantly improves the spatial reasoning performance of advanced models such as GPT-4.1 and Gemini 2.5 Pro, yielding average gains of +7.8% on BLINK Multi-view and MindCube, and +4.7% on VSI-Bench. We further show that smaller models, which struggle with spatial exploration, benefit significantly from a reinforcement learning policy that enables the model to select informative viewpoints and operations. With RL, the benefit from tool usage increases from +0.7% to +6.8%. Our findings demonstrate that training-free, tool-augmented spatial exploration is a viable path toward more flexible and human-like 3D reasoning in multimodal agents, establishing a new dimension of multimodal intelligence. Code and weights are released at https://github.com/zhangzaibin/spagent.

[347] GridNet-HD: A High-Resolution Multi-Modal Dataset for LiDAR-Image Fusion on Power Line Infrastructure

Antoine Carreaud,Shanci Li,Malo De Lacour,Digre Frinde,Jan Skaloud,Adrien Gressin

Main category: cs.CV

TL;DR: 本文提出了GridNet-HD,一个用于高空电力基础设施3D语义分割的多模态数据集,结合高密度LiDAR和高分辨率倾斜影像,包含7,694张图像和25亿个标注点,分为11类,并提供单模态和多模态基线模型。融合模型比最佳单模态基线提升+5.55 mIoU,验证了几何与外观信息的互补性。该数据集是首个公开的同时具备高密度LiDAR、高分辨率倾斜影像及电力设备3D语义标签的数据集。

Details Motivation: 现有公开数据集缺乏同时包含高密度LiDAR和高分辨率倾斜影像并带有3D语义标签的电力线路资产数据,限制了多模态3D语义分割的研究,因此需要构建一个高质量、大规模的多模态数据集以推动该领域发展。 Method: 构建了一个名为GridNet-HD的多模态数据集,融合高密度LiDAR点云和高分辨率倾斜影像,对7,694张图像和2.5亿个点进行11类语义标注,并提供预定义划分和mIoU评估指标;设计了单模态(仅LiDAR、仅图像)和多模态融合基线模型进行实验验证。 Result: 在GridNet-HD上,多模态融合模型相比最佳单模态基线提升了+5.55 mIoU,显著验证了几何(LiDAR)与外观(影像)特征的互补优势。 Conclusion: GridNet-HD填补了高空电力基础设施领域缺乏高质量多模态3D标注数据的空白,为后续研究提供了重要资源,并通过实验证明了多模态融合在该任务中的有效性。 Abstract: This paper presents GridNet-HD, a multi-modal dataset for 3D semantic segmentation of overhead electrical infrastructures, pairing high-density LiDAR with high-resolution oblique imagery. The dataset comprises 7,694 images and 2.5 billion points annotated into 11 classes, with predefined splits and mIoU metrics. Unimodal (LiDAR-only, image-only) and multi-modal fusion baselines are provided. On GridNet-HD, fusion models outperform the best unimodal baseline by +5.55 mIoU, highlighting the complementarity of geometry and appearance. As reviewed in Sec. 2, no public dataset jointly provides high-density LiDAR and high-resolution oblique imagery with 3D semantic labels for power-line assets. Dataset, baselines, and codes are available: https://huggingface.co/collections/heig-vd-geo/gridnet-hd.

[348] Prototype Learning-Based Few-Shot Segmentation for Low-Light Crack on Concrete Structures

Yulun Guo

Main category: cs.CV

TL;DR: 提出了一种结合Retinex理论与少样本学习的双分支原型网络,用于低光照条件下的混凝土裂缝分割,通过跨相似性先验掩码生成和多尺度特征增强模块,在减少对大量标注数据依赖的同时,实现了先进的性能。

Details Motivation: 低光照环境下裂缝检测困难,且像素级标注耗时,现有深度学习方法依赖大量标注数据和良好光照条件。 Method: 采用基于Retinex的反射分量进行光照不变性表示学习,结合度量学习实现少样本裂缝分割;设计跨相似性先验掩码生成模块和多尺度特征增强模块以提升定位与一致性。 Result: 在多个低光裂缝分割基准上实现了持续领先的性能表现。 Conclusion: 该方法有效缓解了低光照和标注数据稀缺对裂缝检测的挑战,具有实际应用价值。 Abstract: Crack detection is critical for concrete infrastructure safety, but real-world cracks often appear in low-light environments like tunnels and bridge undersides, degrading computer vision segmentation accuracy. Pixel-level annotation of low-light crack images is extremely time-consuming, yet most deep learning methods require large, well-illuminated datasets. We propose a dual-branch prototype learning network integrating Retinex theory with few-shot learning for low-light crack segmentation. Retinex-based reflectance components guide illumination-invariant global representation learning, while metric learning reduces dependence on large annotated datasets. We introduce a cross-similarity prior mask generation module that computes high-dimensional similarities between query and support features to capture crack location and structure, and a multi-scale feature enhancement module that fuses multi-scale features with the prior mask to alleviate spatial inconsistency. Extensive experiments on multiple benchmarks demonstrate consistent state-of-the-art performance under low-light conditions. Code: https://github.com/YulunGuo/CrackFSS.

[349] Patient-Conditioned Adaptive Offsets for Reliable Diagnosis across Subgroups

Gelei Xu,Yuying Duan,Jun Xia,Ruining Deng,Wei Jin,Yiyu Shi

Main category: cs.CV

TL;DR: 本文提出了一种名为HyperAdapt的患者条件化自适应框架,用于提升医疗诊断模型在不同亚群体中的可靠性,同时保持共享诊断模型的整体性能。

Details Motivation: 现有算法公平性方法通过抑制敏感属性来减少不同人群间的性能差异,但在医疗场景中这些属性具有重要诊断价值,直接去除会损害模型准确性。因此需要一种既能利用敏感信息又能提升公平性的新方法。 Method: 提出HyperAdapt框架,将年龄、性别等临床相关属性编码为紧凑嵌入,并用于调节超网络模块,生成对主干网络特定层的小幅残差调制参数;采用低秩和瓶颈结构限制适配复杂度,实现高效且鲁棒的子群适应。 Result: 在多个公开医学影像基准上实验表明,该方法在不牺牲整体准确率的前提下显著提升各子群性能;在PAD-UFES-20数据集上,召回率和F1分数分别超越最强基线4.1%和4.4%,对代表性不足人群增益更明显。 Conclusion: HyperAdapt能够在保留主干模型通用医学知识的同时,通过患者特异性调整有效提升诊断模型的子群可靠性和公平性,适用于高风险医疗应用场景。 Abstract: AI models for medical diagnosis often exhibit uneven performance across patient populations due to heterogeneity in disease prevalence, imaging appearance, and clinical risk profiles. Existing algorithmic fairness approaches typically seek to reduce such disparities by suppressing sensitive attributes. However, in medical settings these attributes often carry essential diagnostic information, and removing them can degrade accuracy and reliability, particularly in high-stakes applications. In contrast, clinical decision making explicitly incorporates patient context when interpreting diagnostic evidence, suggesting a different design direction for subgroup-aware models. In this paper, we introduce HyperAdapt, a patient-conditioned adaptation framework that improves subgroup reliability while maintaining a shared diagnostic model. Clinically relevant attributes such as age and sex are encoded into a compact embedding and used to condition a hypernetwork-style module, which generates small residual modulation parameters for selected layers of a shared backbone. This design preserves the general medical knowledge learned by the backbone while enabling targeted adjustments that reflect patient-specific variability. To ensure efficiency and robustness, adaptations are constrained through low-rank and bottlenecked parameterizations, limiting both model complexity and computational overhead. Experiments across multiple public medical imaging benchmarks demonstrate that the proposed approach consistently improves subgroup-level performance without sacrificing overall accuracy. On the PAD-UFES-20 dataset, our method outperforms the strongest competing baseline by 4.1% in recall and 4.4% in F1 score, with larger gains observed for underrepresented patient populations.

[350] A Streamlined Attention-Based Network for Descriptor Extraction

Mattia D'Urso,Emanuele Santellani,Christian Sormann,Mattia Rossi,Andreas Kuhn,Friedrich Fraundorfer

Main category: cs.CV

TL;DR: SANDesc是一种基于注意力机制的轻量级描述符提取网络,通过改进U-Net结构和引入注意力模块,在不改变关键点检测器的前提下显著提升匹配性能。

Details Motivation: 现有关键点描述符在匹配性能和计算效率之间存在权衡,且训练稳定性不足,需要更高效的描述符提取方法。 Method: 提出SANDesc网络,采用带注意力模块的残差U-Net块和改进的三元组损失结合难负样本挖掘策略进行训练。 Result: 在HPatches、MegaDepth-1500和Image Matching Challenge 2021等多个数据集上表现优于现有方法,模型仅240万参数,并在新提出的4K城市数据集上验证了有效性。 Conclusion: SANDesc在保持低计算复杂度的同时显著提升了关键点描述符的匹配性能,适用于资源受限场景。 Abstract: We introduce SANDesc, a Streamlined Attention-Based Network for Descriptor extraction that aims to improve on existing architectures for keypoint description. Our descriptor network learns to compute descriptors that improve matching without modifying the underlying keypoint detector. We employ a revised U-Net-like architecture enhanced with Convolutional Block Attention Modules and residual paths, enabling effective local representation while maintaining computational efficiency. We refer to the building blocks of our model as Residual U-Net Blocks with Attention. The model is trained using a modified triplet loss in combination with a curriculum learning-inspired hard negative mining strategy, which improves training stability. Extensive experiments on HPatches, MegaDepth-1500, and the Image Matching Challenge 2021 show that training SANDesc on top of existing keypoint detectors leads to improved results on multiple matching tasks compared to the original keypoint descriptors. At the same time, SANDesc has a model complexity of just 2.4 million parameters. As a further contribution, we introduce a new urban dataset featuring 4K images and pre-calibrated intrinsics, designed to evaluate feature extractors. On this benchmark, SANDesc achieves substantial performance gains over the existing descriptors while operating with limited computational resources.

[351] PhaseMark: A Post-hoc, Optimization-Free Watermarking of AI-generated Images in the Latent Frequency Domain

Sung Ju Lee,Nam Ik Cho

Main category: cs.CV

TL;DR: PhaseMark提出了一种单次、无需优化的水印框架,通过在VAE潜在频域中直接调制相位,实现超快速且对严重攻击具有强鲁棒性的图像水印,同时不损害图像质量。

Details Motivation: 现有的基于扩散模型生成图像的后处理水印方法因迭代优化或反演过程而速度缓慢,难以满足实际应用需求,亟需一种高效且鲁棒的水印方案。 Method: PhaseMark在VAE的潜在频域中直接调制相位信息,无需任何优化过程,通过四种调制变体实现水印嵌入,利用潜在空间的内在特性进行高效操作。 Result: PhaseMark比现有优化方法快数千倍,在面对再生等严重攻击时仍保持最先进的鲁棒性,且不影响生成图像质量。 Conclusion: PhaseMark展示了一种新范式:通过挖掘潜在空间的固有属性,可实现高效、强鲁棒且高质量的水印,为扩散模型的版权保护提供了实用化解决方案。 Abstract: The proliferation of hyper-realistic images from Latent Diffusion Models (LDMs) demands robust watermarking, yet existing post-hoc methods are prohibitively slow due to iterative optimization or inversion processes. We introduce PhaseMark, a single-shot, optimization-free framework that directly modulates the phase in the VAE latent frequency domain. This approach makes PhaseMark thousands of times faster than optimization-based techniques while achieving state-of-the-art resilience against severe attacks, including regeneration, without degrading image quality. We analyze four modulation variants, revealing a clear performance-quality trade-off. PhaseMark demonstrates a new paradigm where efficient, resilient watermarking is achieved by exploiting intrinsic latent properties.

[352] GaussExplorer: 3D Gaussian Splatting for Embodied Exploration and Reasoning

Kim Yu-Ji,Dahye Lee,Kim Jun-Seong,GeonU Kim,Nam Hyeon-Woo,Yongjin Kwon,Yu-Chiang Frank Wang,Jaesung Choe,Tae-Hyun Oh

Main category: cs.CV

TL;DR: GaussExplorer 是一个基于 3D 高斯点阵(3DGS)的具身探索与推理框架,结合视觉-语言模型(VLMs)实现复杂语言查询下的三维场景理解与主动视角优化。

Details Motivation: 现有语言嵌入式3DGS方法难以处理复杂的组合语言查询,而基于对象中心RGB-D记忆的方法受限于固定视角,缺乏灵活性。 Method: 在3DGS基础上引入视觉-语言模型(VLMs),通过识别与查询最相关的预捕获图像,并将其调整至新视角以获取更优视觉信息,支持问题驱动的探索与推理。 Result: 实验表明,该方法在多个基准上优于现有方法,显著提升复杂查询下的推理性能。 Conclusion: 将VLM驱动的推理与3DGS结合,能有效支持复杂语言引导的具身探索,为三维场景理解提供了更灵活、准确的解决方案。 Abstract: We present GaussExplorer, a framework for embodied exploration and reasoning built on 3D Gaussian Splatting (3DGS). While prior approaches to language-embedded 3DGS have made meaningful progress in aligning simple text queries with Gaussian embeddings, they are generally optimized for relatively simple queries and struggle to interpret more complex, compositional language queries. Alternative studies based on object-centric RGB-D structured memories provide spatial grounding but are constrained by pre-fixed viewpoints. To address these issues, GaussExplorer introduces Vision-Language Models (VLMs) on top of 3DGS to enable question-driven exploration and reasoning within 3D scenes. We first identify pre-captured images that are most correlated with the query question, and subsequently adjust them into novel viewpoints to more accurately capture visual information for better reasoning by VLMs. Experiments show that ours outperforms existing methods on several benchmarks, demonstrating the effectiveness of integrating VLM-based reasoning with 3DGS for embodied tasks.

[353] CLIP-Guided Adaptable Self-Supervised Learning for Human-Centric Visual Tasks

Mingshuang Luo,Ruibing Hou,Bo Chao,Hong Chang,Zimo Liu,Yaowei Wang,Shiguang Shan

Main category: cs.CV

TL;DR: 本文提出了CLASP,一种用于人体视觉任务的无监督预训练框架,利用CLIP生成多层次语义伪标签,并通过提示控制的MoE模块动态适应不同下游任务,显著提升了表征的表达能力和迁移性能。

Details Motivation: 现有无监督预训练方法在人体中心视觉任务中缺乏对多层次语义信息的有效建模,且难以适应不同粒度的下游任务需求,因此需要一个更通用、可适应的预训练框架。 Method: 提出CLASP框架:利用CLIP生成身体部位(低层)和属性(高层)的伪标签,引入Prompt-Controlled MoE模块根据任务提示动态调整特征提取,并采用多任务预训练策略联合优化部分级和属性级语义。 Result: 在多个基准上实验表明,CLASP在多种人体中心下游任务中 consistently 超过现有的无监督预训练方法,表现出更强的表征学习和迁移能力。 Conclusion: CLASP通过结合CLIP引导的多级伪监督与可提示控制的MoE机制,有效提升了无监督人体视觉表征的通用性与适应性,推动了人体中心视觉分析的发展。 Abstract: Human-centric visual analysis plays a pivotal role in diverse applications, including surveillance, healthcare, and human-computer interaction. With the emergence of large-scale unlabeled human image datasets, there is an increasing need for a general unsupervised pre-training model capable of supporting diverse human-centric downstream tasks. To achieve this goal, we propose CLASP (CLIP-guided Adaptable Self-suPervised learning), a novel framework designed for unsupervised pre-training in human-centric visual tasks. CLASP leverages the powerful vision-language model CLIP to generate both low-level (e.g., body parts) and high-level (e.g., attributes) semantic pseudo-labels. These multi-level semantic cues are then integrated into the learned visual representations, enriching their expressiveness and generalizability. Recognizing that different downstream tasks demand varying levels of semantic granularity, CLASP incorporates a Prompt-Controlled Mixture-of-Experts (MoE) module. MoE dynamically adapts feature extraction based on task-specific prompts, mitigating potential feature conflicts and enhancing transferability. Furthermore, CLASP employs a multi-task pre-training strategy, where part- and attribute-level pseudo-labels derived from CLIP guide the representation learning process. Extensive experiments across multiple benchmarks demonstrate that CLASP consistently outperforms existing unsupervised pre-training methods, advancing the field of human-centric visual analysis.

[354] TVWorld: Foundations for Remote-Control TV Agents

Zhantao Ma,Quanfeng Lu,Shuai Zhong,Dahai Yu,Ping Luo,Michael K. Ng

Main category: cs.CV

TL;DR: 本文提出了TVWorld,一个用于评估电视导航能力的离线图基抽象框架,并构建了两个基准TVWorld-N和TVWorld-G,揭示了现有模型在拓扑感知上的不足。为此,作者提出了一种拓扑感知训练框架,并开发了专门用于电视导航的基础模型TVTheseus,在TVWorld-N上达到68.3%的成功率,超越现有强闭源模型,实现了SOTA性能。

Details Motivation: 现有的视觉-语言模型研究主要集中在点按交互,而对日常电视使用中常见的遥控交互缺乏探索。为了填补这一空白,需要构建能够评估长期、基于焦点的导航能力的基准和方法。 Method: 提出TVWorld——一种基于图的电视导航抽象框架,构建TVWorld-N(拓扑感知导航)和TVWorld-G(焦点感知定位)两个基准,并设计拓扑感知训练框架以增强LVLMs在长程导航中的结构理解能力。 Result: TVTheseus在TVWorld-N上取得68.3%的成功率,显著优于Gemini 1.5 Pro、GPT-4o等闭源模型;消融实验验证了拓扑感知训练的有效性;模型在真实电视界面迁移测试中也展现出良好潜力。 Conclusion: 拓扑感知是提升LVLM在远程控制场景下进行长程、焦点驱动导航的关键。TVTheseus结合TVWorld框架为未来智能电视交互系统的发展提供了有效路径和评估标准。 Abstract: Recent large vision-language models (LVLMs) have demonstrated strong potential for device control. However, existing research has primarily focused on point-and-click (PnC) interaction, while remote-control (RC) interaction commonly encountered in everyday TV usage remains largely underexplored. To fill this gap, we introduce \textbf{TVWorld}, an offline graph-based abstraction of real-world TV navigation that enables reproducible and deployment-free evaluation. On this basis, we derive two complementary benchmarks that comprehensively assess TV-use capabilities: \textbf{TVWorld-N} for topology-aware navigation and \textbf{TVWorld-G} for focus-aware grounding. These benchmarks expose a key limitation of existing agents: insufficient topology awareness for focus-based, long-horizon TV navigation. Motivated by this finding, we propose a \emph{Topology-Aware Training} framework that injects topology awareness into LVLMs. Using this framework, we develop \textbf{TVTheseus}, a foundation model specialized for TV navigation. TVTheseus achieves a success rate of $68.3\%$ on TVWorld-N, surpassing strong closed-source baselines such as Gemini 3 Flash and establishing state-of-the-art (SOTA) performance. Additional analyses further provide valuable insights into the development of effective TV-use agents.

[355] ICo3D: An Interactive Conversational 3D Virtual Human

Richard Shaw,Youngkyoon Jang,Athanasios Papaioannou,Arthur Moreau,Helisa Dhamo,Zhensong Zhang,Eduardo Pérez-Pellitero

Main category: cs.CV

TL;DR: 本文提出了一种名为ICo3D的方法,用于生成可交互、会话式且照片级真实的3D虚拟人像,结合多视角捕捉与高斯元胞渲染技术,并集成大语言模型实现语音和文本交互。

Details Motivation: 为了实现高度真实且可实时交互的3D虚拟人类,满足在游戏、虚拟助手和教育等场景中的应用需求。 Method: 基于多视角图像构建可动画的3D面部和动态3D身体模型,均采用高斯元胞渲染;改进了SWinGS++(身体)和HeadGaS++(面部)方法,并融合面部与身体模型;结合LLM实现对话能力,利用语音驱动面部动画以实现口型同步。 Result: 实现了高质量、无伪影的3D虚拟人像重建,支持实时语音和文本交互,并展示了多个应用场景下的演示系统。 Conclusion: ICo3D提供了一个完整的虚拟人像解决方案,具有高真实感和良好的交互性,适用于多种实际应用领域。 Abstract: This work presents Interactive Conversational 3D Virtual Human (ICo3D), a method for generating an interactive, conversational, and photorealistic 3D human avatar. Based on multi-view captures of a subject, we create an animatable 3D face model and a dynamic 3D body model, both rendered by splatting Gaussian primitives. Once merged together, they represent a lifelike virtual human avatar suitable for real-time user interactions. We equip our avatar with an LLM for conversational ability. During conversation, the audio speech of the avatar is used as a driving signal to animate the face model, enabling precise synchronization. We describe improvements to our dynamic Gaussian models that enhance photorealism: SWinGS++ for body reconstruction and HeadGaS++ for face reconstruction, and provide as well a solution to merge the separate face and body models without artifacts. We also present a demo of the complete system, showcasing several use cases of real-time conversation with the 3D avatar. Our approach offers a fully integrated virtual avatar experience, supporting both oral and written form interactions in immersive environments. ICo3D is applicable to a wide range of fields, including gaming, virtual assistance, and personalized education, among others. Project page: https://ico3d.github.io/

[356] From 100,000+ images to winning the first brain MRI foundation model challenges: Sharing lessons and models

Pedro M. Gordaliza,Jaume Banus,Benoît Gérin,Maxence Wynen,Nataliia Molchanova,Jonas Richiardi,Meritxell Bach Cuadra

Main category: cs.CV

TL;DR: 本文提出了一种基于U-Net CNN架构的医学图像分析方法,在MICCAI 2025的SSL3D和FOMO25挑战赛中取得第一,相比Transformer模型更小、更快。

Details Motivation: 医学图像分析面临独特挑战,尤其是3D脑部MRI,需要开发基础模型来提升性能与效率。 Method: 采用U-Net CNN架构,并结合解剖先验和神经影像学领域知识进行优化。 Result: 在SSL3D和FOMO25两项挑战赛的多个赛道中排名第一,模型训练速度比基于Transformer的方法快1-2个数量级,体积小10倍。 Conclusion: 结合领域知识的CNN架构在3D脑MRI分析中优于大型Transformer模型,具备更高的效率与实用性。 Abstract: Developing Foundation Models for medical image analysis is essential to overcome the unique challenges of radiological tasks. The first challenges of this kind for 3D brain MRI, SSL3D and FOMO25, were held at MICCAI 2025. Our solution ranked first in tracks of both contests. It relies on a U-Net CNN architecture combined with strategies leveraging anatomical priors and neuroimaging domain knowledge. Notably, our models trained 1-2 orders of magnitude faster and were 10 times smaller than competing transformer-based approaches. Models are available here: https://github.com/jbanusco/BrainFM4Challenges.

[357] GTPred: Benchmarking MLLMs for Interpretable Geo-localization and Time-of-capture Prediction

Jinnao Li,Zijian Chen,Tingzhu Chen,Changbo Wang

Main category: cs.CV

TL;DR: 本文提出了GTPred,一个用于地理-时间预测的新基准,包含370张跨越120多年的全球分布图像,评估多模态大语言模型在联合年份和位置推理上的表现,并揭示其在世界知识和时空推理方面的局限性。

Details Motivation: 现有地理定位基准大多忽略图像中的时间信息,而时间线索有助于更精确地约束地理位置,因此需要一个结合时空信息的新基准。 Method: 构建了一个名为GTPred的新基准,包含370张带有时间标注的全球图像;采用年份与分层位置序列匹配来评估MLLMs,并通过人工标注的推理过程评估其中间推理链。 Result: 在8个闭源和7个开源MLLM上的实验表明,尽管模型具有较强的视觉感知能力,但在世界知识和地理-时间推理方面仍存在局限;引入时间信息显著提升了定位性能。 Conclusion: GTPred为评估多模态大模型的地理-时间推理能力提供了有效基准,证明融合时间信息对提升定位精度至关重要,并指出了当前模型在知识覆盖和推理方面的不足。 Abstract: Geo-localization aims to infer the geographic location where an image was captured using observable visual evidence. Traditional methods achieve impressive results through large-scale training on massive image corpora. With the emergence of multi-modal large language models (MLLMs), recent studies have explored their applications in geo-localization, benefiting from improved accuracy and interpretability. However, existing benchmarks largely ignore the temporal information inherent in images, which can further constrain the location. To bridge this gap, we introduce GTPred, a novel benchmark for geo-temporal prediction. GTPred comprises 370 globally distributed images spanning over 120 years. We evaluate MLLM predictions by jointly considering year and hierarchical location sequence matching, and further assess intermediate reasoning chains using meticulously annotated ground-truth reasoning processes. Experiments on 8 proprietary and 7 open-source MLLMs show that, despite strong visual perception, current models remain limited in world knowledge and geo-temporal reasoning. Results also demonstrate that incorporating temporal information significantly enhances location inference performance.

[358] Rethinking Skip Connections: Additive U-Net for Robust and Interpretable Denoising

Vikram R Lakkavalli

Main category: cs.CV

TL;DR: 提出Additive U-Net,用可学习的加性门控连接替代U-Net中的标准拼接跳跃连接,避免通道膨胀并实现对编码器信息流的显式控制,在图像去噪任务中表现出竞争性性能与更好可解释性。

Details Motivation: 标准U-Net中拼接跳跃连接会导致通道维度翻倍、噪声不受控传播,且信息流动不透明,限制了模型效率与理解。 Method: 将跳跃连接由拼接改为加性融合,每条跳跃路径通过一个可学习的非负标量进行缩放,实现门控机制;无需强制下采样或层级结构。 Result: 在Kodak-17去噪基准上,Additive U-Net在σ=15、25、50时达到有竞争力的PSNR/SSIM表现,对不同卷积核调度和网络深度更具鲁棒性,并自然学习从高频到低频的特征层次。 Conclusion: 加性跳跃连接是一种轻量、可解释的拼接替代方案,有助于提升多尺度信息传递的理解,并简化U-Net架构设计。 Abstract: Skip connections are central to U-Net architectures for image denoising, but standard concatenation doubles channel dimensionality and obscures information flow, allowing uncontrolled noise transfer. We propose the Additive U-Net, which replaces concatenative skips with gated additive connections. Each skip pathway is scaled by a learnable non-negative scalar, offering explicit and interpretable control over encoder contributions while avoiding channel inflation. Evaluations on the Kodak-17 denoising benchmark show that Additive U-Net achieves competitive PSNR/SSIM at noise levels σ = 15, 25, 50, with robustness across kernel schedules and depths. Notably, effective denoising is achieved even without explicit down/up-sampling or forced hierarchies, as the model naturally learns a progression from high-frequency to band-pass to low-frequency features. These results position additive skips as a lightweight and interpretable alternative to concatenation, enabling both efficient design and a clearer understanding of multi-scale information transfer in reconstruction networks.

[359] ObjectVisA-120: Object-based Visual Attention Prediction in Interactive Street-crossing Environments

Igor Vozniak,Philipp Mueller,Nils Lipp,Janis Sprenger,Konstantin Poddubnyy,Davit Hovhannisyan,Christian Mueller,Andreas Bulling,Philipp Slusallek

Main category: cs.CV

TL;DR: 提出了一种面向对象的视觉注意力评估新数据集\dataset~,包含120名参与者在虚拟现实中的过街导航行为,提供了精确的注视数据、对象状态空间表示和丰富的场景标注,并提出了新的评估指标oSIM和基于Mamba U-Net的SUMGraph模型,显著提升了对象级注意力建模性能。

Details Motivation: 现有计算视觉注意力模型较少考虑人类视觉注意力的对象特性,且缺乏适用于对象级注意力研究的数据集和评估指标。 Method: 构建了一个包含120名参与者的虚拟现实街景穿越数据集\dataset~,提供注视数据、全景分割、深度信息、车辆关键点等丰富标注;提出oSIM作为对象级注意力的新评估指标;设计SUMGraph模型,通过图结构显式编码关键物体(如车辆)以增强注意力预测。 Result: 实验表明,针对对象级注意力优化的模型不仅在oSIM指标上表现更好,也在常规指标上提升性能;SUMGraph模型优于多个最先进的视觉注意力预测方法。 Conclusion: 对象级注意力对视觉建模至关重要,提出的\dataset~数据集、oSIM指标和SUMGraph模型为未来对象级注意力研究提供了有效工具和方向。 Abstract: The object-based nature of human visual attention is well-known in cognitive science, but has only played a minor role in computational visual attention models so far. This is mainly due to a lack of suitable datasets and evaluation metrics for object-based attention. To address these limitations, we present \dataset~ -- a novel 120-participant dataset of spatial street-crossing navigation in virtual reality specifically geared to object-based attention evaluations. The uniqueness of the presented dataset lies in the ethical and safety affiliated challenges that make collecting comparable data in real-world environments highly difficult. \dataset~ not only features accurate gaze data and a complete state-space representation of objects in the virtual environment, but it also offers variable scenario complexities and rich annotations, including panoptic segmentation, depth information, and vehicle keypoints. We further propose object-based similarity (oSIM) as a novel metric to evaluate the performance of object-based visual attention models, a previously unexplored performance characteristic. Our evaluations show that explicitly optimising for object-based attention not only improves oSIM performance but also leads to an improved model performance on common metrics. In addition, we present SUMGraph, a Mamba U-Net-based model, which explicitly encodes critical scene objects (vehicles) in a graph representation, leading to further performance improvements over several state-of-the-art visual attention prediction methods. The dataset, code and models will be publicly released.

[360] Not all Blends are Equal: The BLEMORE Dataset of Blended Emotion Expressions with Relative Salience Annotations

Tim Lachmann,Alexandra Israelsson,Christina Tornberg,Teimuraz Saghinadze,Michal Balazia,Philipp Müller,Petri Laukka

Main category: cs.CV

TL;DR: 本文介绍了BLEMORE,一个用于多模态(视频、音频)混合情绪识别的新数据集,包含情绪混合及其相对显著性标注,填补了现有研究的空白,并评估了多种先进方法在情绪存在性和显著性预测任务上的表现。

Details Motivation: 现有视频情绪识别方法大多仅针对单一情绪,难以处理人类常见的混合情绪,且缺乏标注情绪显著性的数据集,限制了该领域的发展。 Method: 构建了一个包含3000多个片段的多模态数据集BLEMORE,涵盖6种基本情绪和10种混合情绪,每种混合具有三种显著性配置(50/50、70/30、30/70),并基于此对多种先进分类模型进行评估,包括单模态与多模态方法。 Result: 在验证集上,多模态方法优于单模态,ImageBind + WavLM达到35%的存在性准确率,HiCMAE达到18%的显著性准确率;在测试集上,VideoMAEv2 + HuBERT达到33%存在准确率,HiCMAE保持18%显著性准确率。 Conclusion: BLEMORE为混合情绪识别提供了重要资源,推动情绪识别系统更好地应对真实场景中复杂的情绪表达。 Abstract: Humans often experience not just a single basic emotion at a time, but rather a blend of several emotions with varying salience. Despite the importance of such blended emotions, most video-based emotion recognition approaches are designed to recognize single emotions only. The few approaches that have attempted to recognize blended emotions typically cannot assess the relative salience of the emotions within a blend. This limitation largely stems from the lack of datasets containing a substantial number of blended emotion samples annotated with relative salience. To address this shortcoming, we introduce BLEMORE, a novel dataset for multimodal (video, audio) blended emotion recognition that includes information on the relative salience of each emotion within a blend. BLEMORE comprises over 3,000 clips from 58 actors, performing 6 basic emotions and 10 distinct blends, where each blend has 3 different salience configurations (50/50, 70/30, and 30/70). Using this dataset, we conduct extensive evaluations of state-of-the-art video classification approaches on two blended emotion prediction tasks: (1) predicting the presence of emotions in a given sample, and (2) predicting the relative salience of emotions in a blend. Our results show that unimodal classifiers achieve up to 29% presence accuracy and 13% salience accuracy on the validation set, while multimodal methods yield clear improvements, with ImageBind + WavLM reaching 35% presence accuracy and HiCMAE 18% salience accuracy. On the held-out test set, the best models achieve 33% presence accuracy (VideoMAEv2 + HuBERT) and 18% salience accuracy (HiCMAE). In sum, the BLEMORE dataset provides a valuable resource to advancing research on emotion recognition systems that account for the complexity and significance of blended emotion expressions.

[361] ConvMambaNet: A Hybrid CNN-Mamba State Space Architecture for Accurate and Real-Time EEG Seizure Detection

Md. Nishan Khan,Kazi Shahriar Sanjid,Md. Tanzim Hossain,Asib Mostakim Fony,Istiak Ahmed,M. Monir Uddin

Main category: cs.CV

TL;DR: 本文提出了一种名为ConvMambaNet的混合深度学习模型,结合CNN与Mamba-SSM结构,有效提升了EEG信号中时空特征的提取能力,在CHB-MIT数据集上实现了99%的准确率,表现出对类别不平衡的鲁棒性,具有用于实时癫痫自动监测的潜力。

Details Motivation: 由于脑电图(EEG)信号的时间复杂性,现有的自动化癫痫发作检测方法在捕捉长时程依赖和应对类别不平衡方面仍面临挑战,因此需要更高效的模型来提升检测精度与实用性。 Method: 提出ConvMambaNet模型,将Mamba结构化状态空间模型(Mamba-SSM)嵌入卷积神经网络(CNN)框架中,利用CNN提取空间特征,Mamba-SSM捕捉长距离时间动态,实现对EEG信号的高效时空建模,并在CHB-MIT头皮EEG数据集上进行训练与评估。 Result: ConvMambaNet在CHB-MIT数据集上达到了99%的分类准确率,且在严重类别不平衡的情况下仍保持稳健性能,优于或媲美现有主流模型。 Conclusion: ConvMambaNet能够有效融合空间特征与长时序依赖,显著提升癫痫发作的自动检测精度,具备应用于临床实时监测系统的潜力,为基于EEG的智能诊断提供了新思路。 Abstract: Epilepsy is a chronic neurological disorder marked by recurrent seizures that can severely impact quality of life. Electroencephalography (EEG) remains the primary tool for monitoring neural activity and detecting seizures, yet automated analysis remains challenging due to the temporal complexity of EEG signals. This study introduces ConvMambaNet, a hybrid deep learning model that integrates Convolutional Neural Networks (CNNs) with the Mamba Structured State Space Model (SSM) to enhance temporal feature extraction. By embedding the Mamba-SSM block within a CNN framework, the model effectively captures both spatial and long-range temporal dynamics. Evaluated on the CHB-MIT Scalp EEG dataset, ConvMambaNet achieved a 99% accuracy and demonstrated robust performance under severe class imbalance. These results underscore the model's potential for precise and efficient seizure detection, offering a viable path toward real-time, automated epilepsy monitoring in clinical environments.

[362] A Semantic Decoupling-Based Two-Stage Rainy-Day Attack for Revealing Weather Robustness Deficiencies in Vision-Language Models

Chengyin Hu,Xiang Chen,Zhe Jia,Weiwen Shi,Fengyu Zhang,Jiujiang Guo,Yiwei Wei

Main category: cs.CV

TL;DR: 本文提出了一种针对视觉-语言模型(VLMs)在雨天场景下的对抗性攻击框架,通过物理可解释的非像素级天气扰动分析其语义对齐退化问题。

Details Motivation: 研究现有VLMs在真实天气条件(特别是雨天)下的鲁棒性不足,以及跨模态语义对齐在结构化扰动下的稳定性问题。 Method: 提出两阶段参数化扰动模型:第一阶段通过全局调制弱化语义决策边界;第二阶段显式建模多尺度雨滴外观和降雨引起的光照变化,并优化不可微的天气空间以诱导稳定的语义偏移。 Result: 实验证明即使物理上合理且高度受限的天气扰动也会导致主流VLMs出现显著的语义错位,光照建模和多尺度雨滴结构是关键驱动因素。 Conclusion: 该框架揭示了VLMs在真实天气扰动下面临的安全与可靠性风险,强调需增强模型对自然场景中结构化扰动的鲁棒性。 Abstract: Vision-Language Models (VLMs) are trained on image-text pairs collected under canonical visual conditions and achieve strong performance on multimodal tasks. However, their robustness to real-world weather conditions, and the stability of cross-modal semantic alignment under such structured perturbations, remain insufficiently studied. In this paper, we focus on rainy scenarios and introduce the first adversarial framework that exploits realistic weather to attack VLMs, using a two-stage, parameterized perturbation model based on semantic decoupling to analyze rain-induced shifts in decision-making. In Stage 1, we model the global effects of rainfall by applying a low-dimensional global modulation to condition the embedding space and gradually weaken the original semantic decision boundaries. In Stage 2, we introduce structured rain variations by explicitly modeling multi-scale raindrop appearance and rainfall-induced illumination changes, and optimize the resulting non-differentiable weather space to induce stable semantic shifts. Operating in a non-pixel parameter space, our framework generates perturbations that are both physically grounded and interpretable. Experiments across multiple tasks show that even physically plausible, highly constrained weather perturbations can induce substantial semantic misalignment in mainstream VLMs, posing potential safety and reliability risks in real-world deployment. Ablations further confirm that illumination modeling and multi-scale raindrop structures are key drivers of these semantic shifts.

[363] Deep Learning for Semantic Segmentation of 3D Ultrasound Data

Chenyu Liu,Marco Cecotti,Harikrishnan Vijayakumar,Patrick Robinson,James Barson,Mihai Caleap

Main category: cs.CV

TL;DR: 本文提出了一种基于新型固态3D超声传感器Calyo Pulse的3D语义分割框架,采用3D U-Net架构,在恶劣和杂乱环境中展现出鲁棒的分割性能,验证了3D超声作为自动驾驶中可靠感知补充模态的潜力。

Details Motivation: 开发低成本、高可靠性的感知系统是自动驾驶的核心挑战;现有LiDAR与相机方案在成本、鲁棒性和恶劣条件下的性能之间存在权衡。 Method: 提出基于Calyo Pulse(模块化固态3D超声传感器)的3D语义分割框架,采用3D U-Net架构对超声空间数据进行体素级分割。 Result: 实验表明该方法在Calyo Pulse数据上实现了鲁棒的3D语义分割效果;进一步提升潜力在于更大规模数据集、更精细真值标注及加权损失函数。 Conclusion: 3D超声传感是一种有前景的互补感知模态,可增强自动驾驶系统在恶劣与杂乱环境中的可靠性。 Abstract: Developing cost-efficient and reliable perception systems remains a central challenge for automated vehicles. LiDAR and camera-based systems dominate, yet they present trade-offs in cost, robustness and performance under adverse conditions. This work introduces a novel framework for learning-based 3D semantic segmentation using Calyo Pulse, a modular, solid-state 3D ultrasound sensor system for use in harsh and cluttered environments. A 3D U-Net architecture is introduced and trained on the spatial ultrasound data for volumetric segmentation. Results demonstrate robust segmentation performance from Calyo Pulse sensors, with potential for further improvement through larger datasets, refined ground truth, and weighted loss functions. Importantly, this study highlights 3D ultrasound sensing as a promising complementary modality for reliable autonomy.

[364] Enginuity: Building an Open Multi-Domain Dataset of Complex Engineering Diagrams

Ethan Seefried,Prahitha Movva,Naga Harshita Marupaka,Tilak Kasturi,Tirthankar Ghosal

Main category: cs.CV

TL;DR: Enginuity是一个首个开放的大规模多领域工程图数据集,具有全面的结构化标注,旨在推动自动化图表解析和AI在科学发现中的应用。

Details Motivation: 当前AI系统难以理解工程图纸中的视觉结构信息,阻碍了其在科学工作流中的深度参与,尤其是在需要图表解释和技术图纸分析的任务中。 Method: 构建一个包含多层次组件关系、连接和语义元素的多领域工程图数据集,并提供精细的结构化标注,支持多模态大语言模型进行图表解析与跨模态检索。 Result: 该数据集能够支持结构化图表解析、跨模态信息检索和AI辅助工程仿真等下游任务,提升AI对工程图的理解与操作能力。 Conclusion: Enginuity有望成为AI用于科学发现的关键基础设施,打破AI在科学发现中因图表理解不足而面临的核心障碍。 Abstract: We propose Enginuity - the first open, large-scale, multi-domain engineering diagram dataset with comprehensive structural annotations designed for automated diagram parsing. By capturing hierarchical component relationships, connections, and semantic elements across diverse engineering domains, our proposed dataset would enable multimodal large language models to address critical downstream tasks including structured diagram parsing, cross-modal information retrieval, and AI-assisted engineering simulation. Enginuity would be transformative for AI for Scientific Discovery by enabling artificial intelligence systems to comprehend and manipulate the visual-structural knowledge embedded in engineering diagrams, breaking down a fundamental barrier that currently prevents AI from fully participating in scientific workflows where diagram interpretation, technical drawing analysis, and visual reasoning are essential for hypothesis generation, experimental design, and discovery.

[365] CausalSpatial: A Benchmark for Object-Centric Causal Spatial Reasoning

Wenxin Ma,Chenlong Wang,Ruisheng Yuan,Hao Chen,Nanru Dai,S. Kevin Zhou,Yijun Yang,Alan Yuille,Jieneng Chen

Main category: cs.CV

TL;DR: 本文提出了一个名为CausalSpatial的诊断基准,用于评估多模态大模型在三维场景中进行因果空间推理的能力,并发现现有模型严重依赖文本推理而脱离视觉证据,导致空间幻觉;为此提出COW框架,通过生成假设动态视频来外化模拟过程,使模型推理基于物理现实而非语言先验。

Details Motivation: 当前多模态大语言模型(MLLMs)仅限于静态空间感知,难以回答3D场景中的“如果…会怎样”类问题,缺乏人类所具备的因果空间推理能力。 Method: 构建CausalSpatial基准,包含碰撞、兼容性、遮挡和轨迹四个任务,评估模型在物体运动后果预测上的表现;提出Causal Object World(COW)模型,通过生成假设动态视频为模型提供显式的因果视觉线索。 Result: 实验结果显示人类在该任务上准确率达84%,而GPT-5仅为54%;分析表明现有MLLMs因过度依赖文本链式推理而产生空间幻觉;COW通过可视化模拟显著提升模型的因果推理能力。 Conclusion: 要实现真正的因果空间推理,模型需减少对语言先验的依赖,转而通过可视化的物理模拟来支撑推理过程,COW为此提供了有效路径。 Abstract: Humans can look at a static scene and instantly predict what happens next -- will moving this object cause a collision? We call this ability Causal Spatial Reasoning. However, current multimodal large language models (MLLMs) cannot do this, as they remain largely restricted to static spatial perception, struggling to answer "what-if" questions in a 3D scene. We introduce CausalSpatial, a diagnostic benchmark evaluating whether models can anticipate consequences of object motions across four tasks: Collision, Compatibility, Occlusion, and Trajectory. Results expose a severe gap: humans score 84% while GPT-5 achieves only 54%. Why do MLLMs fail? Our analysis uncovers a fundamental deficiency: models over-rely on textual chain-of-thought reasoning that drifts from visual evidence, producing fluent but spatially ungrounded hallucinations. To address this, we propose the Causal Object World model (COW), a framework that externalizes the simulation process by generating videos of hypothetical dynamics. With explicit visual cues of causality, COW enables models to ground their reasoning in physical reality rather than linguistic priors. We make the dataset and code publicly available here: https://github.com/CausalSpatial/CausalSpatial

[366] MultiST: A Cross-Attention-Based Multimodal Model for Spatial Transcriptomic

Wei Wang,Quoc-Toan Ly,Chong Yu,Jun Bai

Main category: cs.CV

TL;DR: 本文提出了MultiST,一种统一的多模态框架,通过跨注意力融合空间拓扑、基因表达和组织形态学信息,以提升空间域边界的解析能力。

Details Motivation: 现有空间转录组学方法在整合组织形态学与分子谱图方面效果不佳,导致空间域边界模糊。 Method: MultiST采用基于图的基因编码器与对抗对齐学习鲁棒空间表征,并结合颜色归一化的组织学特征,通过跨注意力机制实现多模态融合。 Result: 在13个涵盖人脑皮层和乳腺癌组织的空间转录组数据集上验证,MultiST生成的空间域边界更清晰连贯,伪时间轨迹更稳定,细胞互作模式更具生物学可解释性。 Conclusion: MultiST有效提升了空间域识别精度与生物学解释力,为组织微环境研究提供了新工具。 Abstract: Spatial transcriptomics (ST) enables transcriptome-wide profiling while preserving the spatial context of tissues, offering unprecedented opportunities to study tissue organization and cell-cell interactions in situ. Despite recent advances, existing methods often lack effective integration of histological morphology with molecular profiles, relying on shallow fusion strategies or omitting tissue images altogether, which limits their ability to resolve ambiguous spatial domain boundaries. To address this challenge, we propose MultiST, a unified multimodal framework that jointly models spatial topology, gene expression, and tissue morphology through cross-attention-based fusion. MultiST employs graph-based gene encoders with adversarial alignment to learn robust spatial representations, while integrating color-normalized histological features to capture molecular-morphological dependencies and refine domain boundaries. We evaluated the proposed method on 13 diverse ST datasets spanning two organs, including human brain cortex and breast cancer tissue. MultiST yields spatial domains with clearer and more coherent boundaries than existing methods, leading to more stable pseudotime trajectories and more biologically interpretable cell-cell interaction patterns. The MultiST framework and source code are available at https://github.com/LabJunBMI/MultiST.git.

[367] Real-Time 4D Radar Perception for Robust Human Detection in Harsh Enclosed Environments

Zhenan Liu,Yaodong Cui,Amir Khajepour,George Shaker

Main category: cs.CV

TL;DR: 本文提出了一种在高杂波、封闭环境中生成可控多级粉尘浓度的新方法,并发布了包含毫米波雷达、相机和LiDAR的4D数据集,用于研究粉尘和反射表面对传感性能的影响。通过基于阈值的噪声滤波框架和基于雷达语义的分类流程,实现了在粉尘环境下的鲁棒行人检测。

Details Motivation: 在地下矿井、隧道等恶劣封闭环境中,粉尘和强电磁干扰严重影响传感器性能,缺乏可重复的实验条件和真实数据集,限制了感知系统的研究与发展。 Method: 开发了一种生成可控粉尘浓度的方法,构建了4D毫米波雷达融合数据集;提出基于RCS、速度、方位角、俯仰角的阈值滤波框架抑制噪声与多径效应;设计基于速度、RCS和体积扩展特征的聚类级规则分类器实现无需大量训练的实时行人检测。 Result: 实验表明,所提方法显著提升了粉尘环境中的点云质量,有效抑制了鬼影目标和多径反射,实现了稳定可靠的行人检测,系统鲁棒性和检测精度明显增强。 Conclusion: 该集成方法为极端粉尘环境下的毫米波雷达感知提供了有效解决方案,增强了系统在复杂封闭空间中的适应性与可靠性,具有在矿山安全、搜救机器人等场景中的应用潜力。 Abstract: This paper introduces a novel methodology for generating controlled, multi-level dust concentrations in a highly cluttered environment representative of harsh, enclosed environments, such as underground mines, road tunnels, or collapsed buildings, enabling repeatable mm-wave propagation studies under severe electromagnetic constraints. We also present a new 4D mmWave radar dataset, augmented by camera and LiDAR, illustrating how dust particles and reflective surfaces jointly impact the sensing functionality. To address these challenges, we develop a threshold-based noise filtering framework leveraging key radar parameters (RCS, velocity, azimuth, elevation) to suppress ghost targets and mitigate strong multipath reflections at the raw data level. Building on the filtered point clouds, a cluster-level, rule-based classification pipeline exploits radar semantics-velocity, RCS, and volumetric spread-to achieve reliable, real-time pedestrian detection without extensive domainspecific training. Experimental results confirm that this integrated approach significantly enhances clutter mitigation, detection robustness, and overall system resilience in dust-laden mining environments.

[368] Spherical Geometry Diffusion: Generating High-quality 3D Face Geometry via Sphere-anchored Representations

Junyi Zhang,Yiming Wang,Yunhong Lu,Qichao Wang,Wenzhe Qian,Xiaoyin Xu,David Gu,Min Zhang

Main category: cs.CV

TL;DR: 提出了一种基于球面几何表示的文本到3D人脸生成新方法,通过将人脸几何约束到球面流形并映射为2D图像,结合扩散模型实现高质量、可控的几何与纹理联合生成。

Details Motivation: 现有文本到3D人脸生成方法难以建模复杂无序的3D顶点分布,导致几何质量差、连通性不佳。 Method: 提出球面几何表示(Spherical Geometry Representation),将人脸几何锚定在均匀球坐标上,保证规则点分布和稳健网格重建;进一步将其展开为2D地图,构建基于2D扩散模型的条件生成框架(Spherical Geometry Diffusion),联合建模几何与纹理,实现几何引导的纹理合成。 Result: 在文本到3D生成、人脸重建和文本驱动3D编辑任务中均取得优异表现,显著优于现有方法,尤其在几何质量、文本对齐度和推理效率方面。 Conclusion: 通过引入球面流形先验和2D扩散架构,有效解决了3D人脸生成中的几何建模难题,实现了高质量、高保真的文本到3D人脸合成。 Abstract: A fundamental challenge in text-to-3D face generation is achieving high-quality geometry. The core difficulty lies in the arbitrary and intricate distribution of vertices in 3D space, making it challenging for existing models to establish clean connectivity and resulting in suboptimal geometry. To address this, our core insight is to simplify the underlying geometric structure by constraining the distribution onto a simple and regular manifold, a topological sphere. Building on this, we first propose the Spherical Geometry Representation, a novel face representation that anchors geometric signals to uniform spherical coordinates. This guarantees a regular point distribution, from which the mesh connectivity can be robustly reconstructed. Critically, this canonical sphere can be seamlessly unwrapped into a 2D map, creating a perfect synergy with powerful 2D generative models. We then introduce Spherical Geometry Diffusion, a conditional diffusion framework built upon this 2D map. It enables diverse and controllable generation by jointly modeling geometry and texture, where the geometry explicitly conditions the texture synthesis process. Our method's effectiveness is demonstrated through its success in a wide range of tasks: text-to-3D generation, face reconstruction, and text-based 3D editing. Extensive experiments show that our approach substantially outperforms existing methods in geometric quality, textual fidelity, and inference efficiency.

[369] A Lightweight Model-Driven 4D Radar Framework for Pervasive Human Detection in Harsh Conditions

Zhenan Liu,Amir Khajepour,George Shaker

Main category: cs.CV

TL;DR: 提出了一种完全基于模型的4D毫米波雷达感知框架,用于在尘土、烟雾等恶劣工业和地下环境中实现鲁棒的人体检测。

Details Motivation: 光学和LiDAR在粉尘、烟雾、金属结构等恶劣环境中性能严重下降,难以实现可靠的感知,因此需要一种对这类环境不敏感的感知方法。 Method: 采用纯毫米波雷达作为唯一传感器,结合领域自适应的多阈值滤波、补偿自运动的时域累积、KD树欧氏聚类与多普勒感知优化,以及基于规则的3D分类器,实现高效实时的感知。 Result: 在封闭拖车和真实地下矿井中测试,该方法在能见度极低的情况下仍能稳定检测行人,而摄像头和LiDAR则失效。 Conclusion: 所提出的模型驱动方法为严苛工业和地下环境中的安全关键应用提供了鲁棒、可解释且计算高效的感知解决方案。 Abstract: Pervasive sensing in industrial and underground environments is severely constrained by airborne dust, smoke, confined geometry, and metallic structures, which rapidly degrade optical and LiDAR based perception. Elevation resolved 4D mmWave radar offers strong resilience to such conditions, yet there remains a limited understanding of how to process its sparse and anisotropic point clouds for reliable human detection in enclosed, visibility degraded spaces. This paper presents a fully model-driven 4D radar perception framework designed for real-time execution on embedded edge hardware. The system uses radar as its sole perception modality and integrates domain aware multi threshold filtering, ego motion compensated temporal accumulation, KD tree Euclidean clustering with Doppler aware refinement, and a rule based 3D classifier. The framework is evaluated in a dust filled enclosed trailer and in real underground mining tunnels, and in the tested scenarios the radar based detector maintains stable pedestrian identification as camera and LiDAR modalities fail under severe visibility degradation. These results suggest that the proposed model-driven approach provides robust, interpretable, and computationally efficient perception for safety-critical applications in harsh industrial and subterranean environments.

[370] Practical Insights into Semi-Supervised Object Detection Approaches

Chaoxin Wang,Bharaneeshwar Balasubramaniyam,Anurag Sangem,Nicolais Guevara,Doina Caragea

Main category: cs.CV

TL;DR: 本文对三种先进的半监督目标检测方法(MixPL、Semi-DETR 和 Consistent-Teacher)进行了综合比较,重点分析其在不同标注样本数量下的性能表现,并在MS-COCO、Pascal VOC和自定义Beetle数据集上进行实验,揭示了准确率、模型大小与延迟之间的权衡。

Details Motivation: 在数据稀缺场景下提升目标检测性能,探索半监督方法在少量标注图像条件下的有效性。 Method: 对比三种先进的半监督目标检测方法(MixPL、Semi-DETR、Consistent-Teacher),在不同标注数据量下进行实验评估。 Result: 实验结果揭示了不同方法在准确率、模型大小和推理延迟之间的权衡,展示了各方法在低数据环境下的表现差异。 Conclusion: 研究为选择适合低数据场景的半监督目标检测方法提供了实用指导,强调需根据任务需求在性能与效率之间做出权衡。 Abstract: Learning in data-scarce settings has recently gained significant attention in the research community. Semi-supervised object detection(SSOD) aims to improve detection performance by leveraging a large number of unlabeled images alongside a limited number of labeled images(a.k.a.,few-shot learning). In this paper, we present a comprehensive comparison of three state-of-the-art SSOD approaches, including MixPL, Semi-DETR and Consistent-Teacher, with the goal of understanding how performance varies with the number of labeled images. We conduct experiments using the MS-COCO and Pascal VOC datasets, two popular object detection benchmarks which allow for standardized evaluation. In addition, we evaluate the SSOD approaches on a custom Beetle dataset which enables us to gain insights into their performance on specialized datasets with a smaller number of object categories. Our findings highlight the trade-offs between accuracy, model size, and latency, providing insights into which methods are best suited for low-data regimes.

[371] Organ-Aware Attention Improves CT Triage and Classification

Lavsen Dahal,Yubraj Bhandari,Geoffrey D. Rubin,Joseph Y. Lo

Main category: cs.CV

TL;DR: 提出ORACLE-CT模型,用于提升胸部和腹部CT影像的分类性能,结合器官感知注意力与标量融合,在多个数据集上实现监督学习下的最先进表现。

Details Motivation: 解决现有视觉语言模型在3D解剖结构、协议差异和噪声报告监督下对CT影像分类效果不佳的问题,满足高通量医学影像快速分诊的临床需求。 Method: 基于全局平均池化基线,设计ORACLE-CT模型,包含器官掩码注意力(提供空间证据)和器官标量融合(结合归一化体积和平均HU值),适用于不同编码器。 Result: 在CT-RATE数据集上胸部分类AUROC达0.86,在MERLIN腹部数据集上达0.85,均优于现有VLM方法,建立新的监督学习性能基准。 Conclusion: ORACLE-CT通过引入器官感知机制,在统一评估框架下实现了胸部和腹部CT影像分类的最先进性能,具有良好的可扩展性和临床应用潜力。 Abstract: There is an urgent need for triage and classification of high-volume medical imaging modalities such as computed tomography (CT), which can improve patient care and mitigate radiologist burnout. Study-level CT triage requires calibrated predictions with localized evidence; however, off-the-shelf Vision Language Models (VLM) struggle with 3D anatomy, protocol shifts, and noisy report supervision. This study used the two largest publicly available chest CT datasets: CT-RATE and RADCHEST-CT (held-out external test set). Our carefully tuned supervised baseline (instantiated as a simple Global Average Pooling head) establishes a new supervised state of the art, surpassing all reported linear-probe VLMs. Building on this baseline, we present ORACLE-CT, an encoder-agnostic, organ-aware head that pairs Organ-Masked Attention (mask-restricted, per-organ pooling that yields spatial evidence) with Organ-Scalar Fusion (lightweight fusion of normalized volume and mean-HU cues). In the chest setting, ORACLE-CT masked attention model achieves AUROC 0.86 on CT-RATE; in the abdomen setting, on MERLIN (30 findings), our supervised baseline exceeds a reproduced zero-shot VLM baseline obtained by running publicly released weights through our pipeline, and adding masked attention plus scalar fusion further improves performance to AUROC 0.85. Together, these results deliver state-of-the-art supervised classification performance across both chest and abdomen CT under a unified evaluation protocol. The source code is available at https://github.com/lavsendahal/oracle-ct.

[372] Leveraging Transformer Decoder for Automotive Radar Object Detection

Changxu Zhang,Zhaoze Wang,Tai Fei,Christopher Grimm,Yi Jin,Claas Tebruegge,Ernst Warsitz,Markus Gardill

Main category: cs.CV

TL;DR: 提出了一种基于Transformer的3D雷达目标检测架构,使用新型解码器直接回归3D边界框和类别分数,并引入轻量级Pyramid Token Fusion模块融合多尺度特征。

Details Motivation: 传统雷达检测方法依赖密集候选生成和复杂的后处理(如NMS),限制了性能与效率。需要一种更简洁、端到端的方法来建模长距离时空相关性。 Method: 采用Transformer解码器作为预测头,结合可学习对象查询和位置编码进行集合预测;设计Pyramid Token Fusion(PTF)模块将特征金字塔转化为统一的尺度感知令牌序列。 Result: 在RADDet数据集上显著优于现有雷达专用基线方法,无需密集提议生成和繁琐的非极大值抑制调参。 Conclusion: 所提方法实现了高效、端到端的3D雷达目标检测,验证了Transformer在雷达感知中的有效性。 Abstract: In this paper, we present a Transformer-based architecture for 3D radar object detection that uses a novel Transformer Decoder as the prediction head to directly regress 3D bounding boxes and class scores from radar feature representations. To bridge multi-scale radar features and the decoder, we propose Pyramid Token Fusion (PTF), a lightweight module that converts a feature pyramid into a unified, scale-aware token sequence. By formulating detection as a set prediction problem with learnable object queries and positional encodings, our design models long-range spatial-temporal correlations and cross-feature interactions. This approach eliminates dense proposal generation and heuristic post-processing such as extensive non-maximum suppression (NMS) tuning. We evaluate the proposed framework on the RADDet, where it achieves significant improvements over state-of-the-art radar-only baselines.

[373] Deep Image Prior with L0 Gradient Regularizer for Image Smoothing

Nhat Thanh Tran,Kevin Bui,Jack Xin

Main category: cs.CV

TL;DR: 提出DIP-ℓ0框架,结合ℓ0梯度正则化,无需训练数据即可实现高质量图像平滑。

Details Motivation: 由于构建合适的图像平滑训练数据集具有挑战性,现有深度学习方法依赖精心策划的数据集,限制了应用。 Method: 提出DIP-ℓ0框架,结合深度图像先验与ℓ0梯度正则化,并采用ADMM算法优化非凸、非光滑的ℓ0范数损失函数。 Result: 实验表明DIP-ℓ0在保持边缘的图像平滑和JPEG伪影去除方面优于多种现有算法。 Conclusion: DIP-ℓ0无需训练数据即可实现高性能图像平滑,为无监督图像处理提供了新思路。 Abstract: Image smoothing is a fundamental image processing operation that preserves the underlying structure, such as strong edges and contours, and removes minor details and textures in an image. Many image smoothing algorithms rely on computing local window statistics or solving an optimization problem. Recent state-of-the-art methods leverage deep learning, but they require a carefully curated training dataset. Because constructing a proper training dataset for image smoothing is challenging, we propose DIP-$\ell_0$, a deep image prior framework that incorporates the $\ell_0$ gradient regularizer. This framework can perform high-quality image smoothing without any training data. To properly minimize the associated loss function that has the nonconvex, nonsmooth $\ell_0$ ``norm", we develop an alternating direction method of multipliers algorithm that utilizes an off-the-shelf $\ell_0$ gradient minimization solver. Numerical experiments demonstrate that the proposed DIP-$\ell_0$ outperforms many image smoothing algorithms in edge-preserving image smoothing and JPEG artifact removal.

[374] Reasoning with Pixel-level Precision: QVLM Architecture and SQuID Dataset for Quantitative Geospatial Analytics

Peter A. Massih,Eric Cosatto

Main category: cs.CV

TL;DR: 本文提出了一种新的架构QVLM和一个用于评估定量空间推理能力的数据集SQuID,以解决现有视觉-语言模型在像素级信息处理上的不足。

Details Motivation: 现有的视觉-语言模型由于将图像压缩为patch embedding,导致丢失了进行计数和测量所需的精确像素级信息,难以完成定量空间推理任务。 Method: 首先构建了一个包含2000个卫星图像问答对的SQuID数据集,涵盖三种难度级别;其次提出了QVLM模型,该模型通过生成可执行代码来调用分割模型获取像素级掩码,并直接在其上操作,从而保持空间索引精度。 Result: 实验表明,使用GPT-5作为代码生成器的QVLM在SQuID数据集上达到42.0%的准确率,显著高于传统VLM的28.1%。 Conclusion: 在定量空间推理任务中,将语言理解与视觉分析解耦的架构设计能够更有效地保留像素级信息,提升模型性能。 Abstract: Current Vision-Language Models (VLMs) fail at quantitative spatial reasoning because their architectures destroy pixel-level information required for counting and measurements. Vision encoders compress images through patch embeddings, reducing spatial indexing and losing the precise pixel-level tracking required for accurate counting. We present two contributions to address this fundamental limitation. First, we introduce SQuID (Satellite Quantitative Intelligence Dataset), a benchmark of 2,000 satellite image Question-Answer pairs with both numerical range and categorical answers, designed to evaluate quantitative spatial reasoning. The dataset spans three difficulty tiers with annotations automatically generated from human labels and their learned variability. Second, we propose QVLM (Quantitative Vision-Language Model), a code-generation architecture that maintains pixel precision by decoupling language understanding from visual analysis. Instead of encoding images into embeddings, QVLM generates executable code that first calls a segmentation model to obtain pixel-level masks, then operates directly on these masks, preserving spatial indexing throughout the reasoning process. Our experiments show that QVLM using GPT-5 as coder achieves 42.0% accuracy on SQuID compared to 28.1% for a VLM prompted with image-question pairs. Our work reveals that, for quantitative spatial reasoning, architectural decoupling enables better accuracy on quantitative tasks.

[375] Local-to-Global Logical Explanations for Deep Vision Models

Bhavan Vasu,Giuseppe Raffa,Prasad Tadepalli

Main category: cs.CV

TL;DR: 本文提出了一种基于单调析取范式(MDNF)的局部与全局解释方法,用于解释黑盒深度神经网络在图像分类中的决策,通过人类可识别的原始概念生成逻辑形式的解释,在保持高保真度和覆盖率的同时提升了模型可解释性。

Details Motivation: 深度神经网络虽然在图像分类上表现优异,但其决策过程不透明且难以解释,缺乏可解释性限制了其在关键领域的应用。 Method: 提出基于单调析取范式(MDNF)的逻辑公式方法,将局部(单个图像)和全局(图像集合)解释建模为满足高分类得分的逻辑表达式,并设计算法生成多类别的单调解释列表。 Result: 在具有挑战性的视觉数据集上验证了方法的有效性,生成的解释具有高保真度和高覆盖率,同时保持简洁和可解释性。 Conclusion: 该方法能够以人类可理解的原始概念为基础,用逻辑公式有效解释黑盒模型的分类行为,在不牺牲性能的前提下显著提升模型透明度。 Abstract: While deep neural networks are extremely effective at classifying images, they remain opaque and hard to interpret. We introduce local and global explanation methods for black-box models that generate explanations in terms of human-recognizable primitive concepts. Both the local explanations for a single image and the global explanations for a set of images are cast as logical formulas in monotone disjunctive-normal-form (MDNF), whose satisfaction guarantees that the model yields a high score on a given class. We also present an algorithm for explaining the classification of examples into multiple classes in the form of a monotone explanation list over primitive concepts. Despite their simplicity and interpretability we show that the explanations maintain high fidelity and coverage with respect to the blackbox models they seek to explain in challenging vision datasets.

[376] DermaBench: A Clinician-Annotated Benchmark Dataset for Dermatology Visual Question Answering and Reasoning

Abdurrahim Yilmaz,Ozan Erdem,Ece Gokyayla,Ayda Acar,Burc Bugra Dagtas,Dilara Ilhan Erdil,Gulsum Gencoglan,Burak Temelkuran

Main category: cs.CV

TL;DR: DermaBench是一个基于DDI数据集的临床医生标注的皮肤科视觉问答(VQA)基准,包含656张来自570名患者的临床图像,覆盖Fitzpatrick皮肤类型I-VI,提供了约14,474个VQA风格的注释,旨在评估多模态模型在皮肤病学中的视觉理解、语言对齐和临床推理能力。

Details Motivation: 现有的医学视觉-语言模型评估数据集主要集中在图像级分类任务上,无法全面评估模型在皮肤病学中的细粒度视觉理解、语言对齐和临床推理能力,因此需要一个更具挑战性的VQA基准。 Method: 基于DDI数据集,采用分层注释模式,由专家皮肤科医生对每张图像进行诊断、解剖部位、皮损形态、分布、表面特征、颜色和图像质量等方面的标注,并生成叙述性描述和总结,构建了包含22类问题的VQA数据集DermaBench。 Result: DermaBench包含656张临床图像,约14,474个VQA风格注释,覆盖多种皮肤类型和临床特征,支持单选、多选和开放式问题,数据以元数据形式发布于Harvard Dataverse。 Conclusion: DermaBench为评估医学视觉-语言模型在皮肤病学中的综合能力提供了一个高质量、临床相关的VQA基准,推动了该领域更深入的模型开发与评估。 Abstract: Vision-language models (VLMs) are increasingly important in medical applications; however, their evaluation in dermatology remains limited by datasets that focus primarily on image-level classification tasks such as lesion recognition. While valuable for recognition, such datasets cannot assess the full visual understanding, language grounding, and clinical reasoning capabilities of multimodal models. Visual question answering (VQA) benchmarks are required to evaluate how models interpret dermatological images, reason over fine-grained morphology, and generate clinically meaningful descriptions. We introduce DermaBench, a clinician-annotated dermatology VQA benchmark built on the Diverse Dermatology Images (DDI) dataset. DermaBench comprises 656 clinical images from 570 unique patients spanning Fitzpatrick skin types I-VI. Using a hierarchical annotation schema with 22 main questions (single-choice, multi-choice, and open-ended), expert dermatologists annotated each image for diagnosis, anatomic site, lesion morphology, distribution, surface features, color, and image quality, together with open-ended narrative descriptions and summaries, yielding approximately 14.474 VQA-style annotations. DermaBench is released as a metadata-only dataset to respect upstream licensing and is publicly available at Harvard Dataverse.

[377] Using deep learning for predicting cleansing quality of colon capsule endoscopy images

Puneet Sharma,Kristian Dalsbø Hindberg,Benedicte Schelde-Olesen,Ulrik Deding,Esmaeil S. Nadimi,Jan-Matthias Braun

Main category: cs.CV

TL;DR: 本研究利用ResNet-18模型结合结构化剪枝和K折交叉验证,对结肠胶囊内镜图像的清洁质量进行分类,并通过Grad-CAM等方法提升模型可解释性,实现了88%的准确率与79%的稀疏度,验证了剪枝在保持性能的同时提升效率的有效性。

Details Motivation: 结肠胶囊内镜(CCE)图像的清洁质量评估对临床诊断至关重要,但人工标注耗时且主观性强,因此需要自动化的深度学习方法辅助判断。此外,模型在实际应用中需兼顾高效性与可解释性,以增强临床可信度。 Method: 采用ResNet-18模型进行多类别分类(Poor, Fair, Good, Excellent),使用分层K折交叉验证确保评估稳健;通过迭代结构化剪枝提高模型效率,并应用Grad-CAM、Grad-CAM++、Eigen-CAM、Ablation-CAM和Random-CAM生成可视化热图,结合ROAD方法统一评估可解释性;最后使用自适应温度缩放校准外部数据集上的预测。 Result: 剪枝后模型达到79%稀疏度的同时保持88%的交叉验证准确率(原始模型为84%),显示剪枝不仅提升效率且未牺牲性能;Grad-CAM系列方法揭示了模型关注区域与临床特征的相关性;但ROAD方法在任务适配上存在挑战;模型经校准后在外部队列中表现更稳定。 Conclusion: 深度学习可用于高效、准确地评估CCE图像清洁质量,结构化剪枝显著压缩模型规模而不损失性能,结合可解释性方法有助于临床采纳,但仍需改进评估指标以适应医学图像特性。 Abstract: In this study, we explore the application of deep learning techniques for predicting cleansing quality in colon capsule endoscopy (CCE) images. Using a dataset of 500 images labeled by 14 clinicians on the Leighton-Rex scale (Poor, Fair, Good, and Excellent), a ResNet-18 model was trained for classification, leveraging stratified K-fold cross-validation to ensure robust performance. To optimize the model, structured pruning techniques were applied iteratively, achieving significant sparsity while maintaining high accuracy. Explainability of the pruned model was evaluated using Grad-CAM, Grad-CAM++, Eigen-CAM, Ablation-CAM, and Random-CAM, with the ROAD method employed for consistent evaluation. Our results indicate that for a pruned model, we can achieve a cross-validation accuracy of 88% with 79% sparsity, demonstrating the effectiveness of pruning in improving efficiency from 84% without compromising performance. We also highlight the challenges of evaluating cleansing quality of CCE images, emphasize the importance of explainability in clinical applications, and discuss the challenges associated with using the ROAD method for our task. Finally, we employ a variant of adaptive temperature scaling to calibrate the pruned models for an external dataset.

[378] The Side Effects of Being Smart: Safety Risks in MLLMs' Multi-Image Reasoning

Renmiao Chen,Yida Lu,Shiyao Cui,Xuan Ouyang,Victor Shea-Jay Huang,Shumin Zhang,Chengwei Pan,Han Qiu,Minlie Huang

Main category: cs.CV

TL;DR: 本文提出了MIR-SafetyBench,首个针对多图像推理安全性的基准测试,发现更强的多图像推理能力可能带来更高的安全风险,并揭示了不安全生成在注意力熵上的内部特征。

Details Motivation: 随着多模态大语言模型(MLLMs)在复杂、多图像指令处理中展现出更强的推理能力,可能引入新的安全隐患,亟需系统评估其在多图像场景下的安全性。 Method: 构建了一个包含2,676个样本、涵盖9类多图像关系的多图像推理安全基准MIR-SafetyBench,并对19个MLLM进行了广泛评估,分析攻击成功率、响应安全性及注意力熵等指标。 Result: 实验发现:推理能力更强的模型在MIR-SafetyBench上反而更易出现不安全响应;许多被标记为‘安全’的回复实为误解或回避性回答;不安全生成的注意力熵普遍较低,表明模型可能过度聚焦任务求解而忽视安全约束。 Conclusion: 多图像推理能力的提升可能伴随安全风险上升,现有安全评估需考虑更深层次的行为分析,而非仅依赖表面响应标签,注意力熵等内部指标可为风险预警提供线索。 Abstract: As Multimodal Large Language Models (MLLMs) acquire stronger reasoning capabilities to handle complex, multi-image instructions, this advancement may pose new safety risks. We study this problem by introducing MIR-SafetyBench, the first benchmark focused on multi-image reasoning safety, which consists of 2,676 instances across a taxonomy of 9 multi-image relations. Our extensive evaluations on 19 MLLMs reveal a troubling trend: models with more advanced multi-image reasoning can be more vulnerable on MIR-SafetyBench. Beyond attack success rates, we find that many responses labeled as safe are superficial, often driven by misunderstanding or evasive, non-committal replies. We further observe that unsafe generations exhibit lower attention entropy than safe ones on average. This internal signature suggests a possible risk that models may over-focus on task solving while neglecting safety constraints. Our code and data are available at https://github.com/thu-coai/MIR-SafetyBench.

[379] Diffusion Representations for Fine-Grained Image Classification: A Marine Plankton Case Study

A. Nieto Juscafresa,Á. Mazcuñán Herreros,J. Sullivan

Main category: cs.CV

TL;DR: 扩散模型作为图像生成的先进方法,其作为通用特征编码器的潜力尚未充分探索。本文表明,冻结的扩散模型主干网络通过提取中间去噪特征,可在细粒度识别任务中表现出色,尤其在浮游生物监测场景中优于其他自监督方法,并在分布外数据上保持稳健性能。

Details Motivation: 探索扩散模型在生成以外的任务(如特征提取和识别)中的潜力,尤其是其作为自监督学习模型在真实场景中的泛化能力。 Method: 使用冻结的扩散模型主干,提取不同层和时间步的中间去噪特征,为每对特征训练线性分类器,评估其在细粒度识别任务中的表现。 Result: 在平衡和长尾分布的浮游生物数据集上,冻结的扩散特征性能媲美有监督基线,优于其他自监督方法;在时空偏移的数据上仍保持高准确率和Macro F1。 Conclusion: 扩散模型即使不进行微调,也可作为强大的通用特征编码器,尤其在分布外场景下展现出优异的鲁棒性和实用性。 Abstract: Diffusion models have emerged as state-of-the-art generative methods for image synthesis, yet their potential as general-purpose feature encoders remains underexplored. Trained for denoising and generation without labels, they can be interpreted as self-supervised learners that capture both low- and high-level structure. We show that a frozen diffusion backbone enables strong fine-grained recognition by probing intermediate denoising features across layers and timesteps and training a linear classifier for each pair. We evaluate this in a real-world plankton-monitoring setting with practical impact, using controlled and comparable training setups against established supervised and self-supervised baselines. Frozen diffusion features are competitive with supervised baselines and outperform other self-supervised methods in both balanced and naturally long-tailed settings. Out-of-distribution evaluations on temporally and geographically shifted plankton datasets further show that frozen diffusion features maintain strong accuracy and Macro F1 under substantial distribution shift.

[380] SGW-GAN: Sliced Gromov-Wasserstein Guided GANs for Retinal Fundus Image Enhancement

Yujian Xiong,Xuanzhao Dong,Wenhui Zhu,Xin Li,Oana Dumitrascu,Yalin Wang

Main category: cs.CV

TL;DR: 提出SGW-GAN,首个将Sliced Gromov Wasserstein(SGW)引入视网膜图像增强的框架,通过保留类内结构实现高效且临床可信的非配对医学图像增强。

Details Motivation: 现有基于GAN和扩散的方法在提升视网膜图像感知质量时会扭曲类内几何结构,影响下游临床任务性能。 Method: 引入Sliced Gromov Wasserstein(SGW)方法,通过随机投影近似Gromov Wasserstein距离,在降低计算成本的同时保持分布间的内部关系结构,并构建SGW-GAN框架用于图像增强。 Result: 在公共数据集上实验表明,SGW-GAN生成视觉效果优越的增强结果,糖尿病视网膜病变分级性能更优,并在疾病标签间取得最低的GW差异。 Conclusion: SGW-GAN在保证效率的同时有效保持了临床相关的结构信息,为非配对医学图像增强提供了更具临床可信度的解决方案。 Abstract: Retinal fundus photography is indispensable for ophthalmic screening and diagnosis, yet image quality is often degraded by noise, artifacts, and uneven illumination. Recent GAN- and diffusion-based enhancement methods improve perceptual quality by aligning degraded images with high-quality distributions, but our analysis shows that this focus can distort intra-class geometry: clinically related samples become dispersed, disease-class boundaries blur, and downstream tasks such as grading or lesion detection are harmed. The Gromov Wasserstein (GW) discrepancy offers a principled solution by aligning distributions through internal pairwise distances, naturally preserving intra-class structure, but its high computational cost restricts practical use. To overcome this, we propose SGW-GAN, the first framework to incorporate Sliced GW (SGW) into retinal image enhancement. SGW approximates GW via random projections, retaining relational fidelity while greatly reducing cost. Experiments on public datasets show that SGW-GAN produces visually compelling enhancements, achieves superior diabetic retinopathy grading, and reports the lowest GW discrepancy across disease labels, demonstrating both efficiency and clinical fidelity for unpaired medical image enhancement.

[381] Analyzing VLM-Based Approaches for Anomaly Classification and Segmentation

Mohit Kakda,Mirudula Shri Muthukumaran,Uttapreksha Patel,Lawrence Swaminathan Xavier Prince

Main category: cs.CV

TL;DR: 本文全面分析了基于视觉-语言模型(VLMs)的异常检测方法,重点评估了WinCLIP、AprilLab等架构在异常分类与分割任务中的性能,涵盖特征提取、图文对齐、提示工程等关键维度,并在MVTec AD和VisA等基准上进行了实验验证。

Details Motivation: 传统异常检测依赖大量标注数据或缺陷样本,而VLMs(如CLIP)通过图文对齐实现零样本/少样本检测,亟需系统性理解其适用性与局限性。 Method: 系统调研并对比多种VLM-based方法,包括滑动窗口密集特征提取(WinCLIP)、多阶段特征对齐(AprilLab框架)和组合式提示集成策略,并在多个维度(特征提取、图文对齐、提示工程、零/少样本权衡、计算效率、跨域泛化)进行评估。 Result: 在MVTec AD和VisA等基准上,对比了各类方法的分类准确率、分割精度和推理效率,揭示了不同架构的优势与瓶颈。 Conclusion: VLMs在异常检测中展现出强大潜力,但其成功依赖于合理的特征对齐与提示设计;本文为工业质检中的方法选型提供了实践指南,并指明了未来研究方向。 Abstract: Vision-Language Models (VLMs), particularly CLIP, have revolutionized anomaly detection by enabling zero-shot and few-shot defect identification without extensive labeled datasets. By learning aligned representations of images and text, VLMs facilitate anomaly classification and segmentation through natural language descriptions of normal and abnormal states, eliminating traditional requirements for task-specific training or defect examples. This project presents a comprehensive analysis of VLM-based approaches for anomaly classification (AC) and anomaly segmentation (AS). We systematically investigate key architectural paradigms including sliding window-based dense feature extraction (WinCLIP), multi-stage feature alignment with learnable projections (AprilLab framework), and compositional prompt ensemble strategies. Our analysis evaluates these methods across critical dimensions: feature extraction mechanisms, text-visual alignment strategies, prompt engineering techniques, zero-shot versus few-shot trade-offs, computational efficiency, and cross-domain generalization. Through rigorous experimentation on benchmarks such as MVTec AD and VisA, we compare classification accuracy, segmentation precision, and inference efficiency. The primary contribution is a foundational understanding of how and why VLMs succeed in anomaly detection, synthesizing practical insights for method selection and identifying current limitations. This work aims to facilitate informed adoption of VLM-based methods in industrial quality control and guide future research directions.

[382] Optical Linear Systems Framework for Event Sensing and Computational Neuromorphic Imaging

Nimrod Kruger,Nicholas Owen Ralph,Gregory Cohen,Paul Hurley

Main category: cs.CV

TL;DR: 本文提出了一种基于物理模型的处理流程,将事件相机输出的异步稀疏事件流映射为像素级对数光强及其导数估计,并嵌入含时变点扩散函数的动态线性系统模型中,从而支持直接从事件数据进行频域Wiener反卷积,实现动态光学系统的模型驱动计算成像。

Details Motivation: 事件视觉传感器(神经形态相机)输出非线性的异步稀疏事件流,难以与传统计算成像中基于线性前向模型的方法兼容,亟需建立事件数据与物理成像模型之间的桥梁。 Method: 构建一个物理驱动的处理流程:首先从事件流估计每个像素的对数光强及其时间/空间导数;然后将其嵌入具有时变点扩散函数(PSF)的动态线性系统模型;最后在频域中使用已知或参数化的动态传递函数进行Wiener反卷积。 Result: 在仿真中成功实现了单点源与重叠点源在调制离焦下的定位与可分性验证;在真实可调焦望远镜采集的星场事件数据上验证了光源定位与分离能力。 Conclusion: 该框架为事件传感与动态光学系统下的模型驱动计算成像提供了实用、可扩展的接口,推动了神经形态视觉在高精度动态成像中的应用。 Abstract: Event vision sensors (neuromorphic cameras) output sparse, asynchronous ON/OFF events triggered by log-intensity threshold crossings, enabling microsecond-scale sensing with high dynamic range and low data bandwidth. As a nonlinear system, this event representation does not readily integrate with the linear forward models that underpin most computational imaging and optical system design. We present a physics-grounded processing pipeline that maps event streams to estimates of per-pixel log-intensity and intensity derivatives, and embeds these measurements in a dynamic linear systems model with a time-varying point spread function. This enables inverse filtering directly from event data, using frequency-domain Wiener deconvolution with a known (or parameterised) dynamic transfer function. We validate the approach in simulation for single and overlapping point sources under modulated defocus, and on real event data from a tunable-focus telescope imaging a star field, demonstrating source localisation and separability. The proposed framework provides a practical bridge between event sensing and model-based computational imaging for dynamic optical systems.

[383] DIS2: Disentanglement Meets Distillation with Classwise Attention for Robust Remote Sensing Segmentation under Missing Modalities

Nhi Kieu,Kien Nguyen,Arnold Wiliem,Clinton Fookes,Sridha Sridharan

Main category: cs.CV

TL;DR: 本文提出了一种针对遥感多模态学习中模态缺失问题的新方法DIS2,通过重构解耦学习与知识蒸馏的协同机制(DLKD),结合类特定特征学习模块(CFLM)和多分辨率融合结构,实现主动引导的特征补偿,在多个基准上显著优于现有方法。

Details Motivation: 遥感数据的高度异质性和尺度差异导致传统多模态学习方法在模态缺失情况下性能严重下降,现有解耦学习和知识蒸馏方法难以有效补偿缺失信息并弥合语义鸿沟。 Method: 提出DIS2方法,包含三个核心设计:(1) 重新定义解耦学习与知识蒸馏的协同机制(DLKD),显式捕获补偿性特征;(2) 类特定特征学习模块(CFLM),根据每类的信号可用性自适应学习判别证据;(3) 多分辨率层次化融合结构(HF),整合不同分辨率特征以增强预测。 Result: 在多个遥感多模态基准数据集上进行了广泛实验,结果表明DIS2显著优于当前最先进的方法,尤其在模态缺失场景下表现出更强的鲁棒性和准确性。 Conclusion: DIS2通过有原则的信息补偿、类特定贡献建模和多分辨率融合,为遥感中的多模态学习提供了新范式,有效解决了模态异质性和缺失带来的挑战。 Abstract: The efficacy of multimodal learning in remote sensing (RS) is severely undermined by missing modalities. The challenge is exacerbated by the RS highly heterogeneous data and huge scale variation. Consequently, paradigms proven effective in other domains often fail when confronted with these unique data characteristics. Conventional disentanglement learning, which relies on significant feature overlap between modalities (modality-invariant), is insufficient for this heterogeneity. Similarly, knowledge distillation becomes an ill-posed mimicry task where a student fails to focus on the necessary compensatory knowledge, leaving the semantic gap unaddressed. Our work is therefore built upon three pillars uniquely designed for RS: (1) principled missing information compensation, (2) class-specific modality contribution, and (3) multi-resolution feature importance. We propose a novel method DIS2, a new paradigm shifting from modality-shared feature dependence and untargeted imitation to active, guided missing features compensation. Its core novelty lies in a reformulated synergy between disentanglement learning and knowledge distillation, termed DLKD. Compensatory features are explicitly captured which, when fused with the features of the available modality, approximate the ideal fused representation of the full-modality case. To address the class-specific challenge, our Classwise Feature Learning Module (CFLM) adaptively learn discriminative evidence for each target depending on signal availability. Both DLKD and CFLM are supported by a hierarchical hybrid fusion (HF) structure using features across resolutions to strengthen prediction. Extensive experiments validate that our proposed approach significantly outperforms state-of-the-art methods across benchmarks.

[384] GO-MLVTON: Garment Occlusion-Aware Multi-Layer Virtual Try-On with Diffusion Models

Yang Yu,Yunze Deng,Yige Zhang,Yanjie Xiao,Youkun Ou,Wenhao Hu,Mingchao Li,Bin Feng,Wenyu Liu,Dandan Zheng,Jingdong Chen

Main category: cs.CV

TL;DR: 提出了一种新的多层虚拟试衣方法GO-MLVTON,包含衣物遮挡学习模块和基于StableDiffusion的衣物形变与拟合模块,并构建了MLG数据集及新评估指标LACD,实现了最先进的多层虚拟试衣效果。

Details Motivation: 现有基于图像的虚拟试衣方法主要关注单层或多件衣物试穿,忽视了多层衣物试穿(ML-VTON)中衣物间遮挡关系建模的问题,导致生成结果不真实。 Method: 提出GO-MLVTON,引入衣物遮挡学习模块以捕捉内外层衣物的遮挡关系,并采用基于StableDiffusion的衣物形变与拟合模块实现衣物在人体上的自然变形与贴合。 Result: 在自建的MLG数据集上验证了方法的有效性,实验表明GO-MLVTON在生成质量和遮挡处理方面优于现有方法,LACD指标显示其在层间外观一致性上表现更优。 Conclusion: GO-MLVTON是首个针对多层虚拟试衣的方法,能有效建模衣物遮挡关系并生成逼真的多层穿着效果,推动了虚拟试衣技术向更复杂场景发展。 Abstract: Existing Image-based virtual try-on (VTON) methods primarily focus on single-layer or multi-garment VTON, neglecting multi-layer VTON (ML-VTON), which involves dressing multiple layers of garments onto the human body with realistic deformation and layering to generate visually plausible outcomes. The main challenge lies in accurately modeling occlusion relationships between inner and outer garments to reduce interference from redundant inner garment features. To address this, we propose GO-MLVTON, the first multi-layer VTON method, introducing the Garment Occlusion Learning module to learn occlusion relationships and the StableDiffusion-based Garment Morphing & Fitting module to deform and fit garments onto the human body, producing high-quality multi-layer try-on results. Additionally, we present the MLG dataset for this task and propose a new metric named Layered Appearance Coherence Difference (LACD) for evaluation. Extensive experiments demonstrate the state-of-the-art performance of GO-MLVTON. Project page: https://upyuyang.github.io/go-mlvton/.

[385] DiffFace-Edit: A Diffusion-Based Facial Dataset for Forgery-Semantic Driven Deepfake Detection Analysis

Feng Ding,Wenhui Yi,Xinan He,Mengyao Xiao,Jianfeng Xu,Jianqiang Du

Main category: cs.CV

TL;DR: 本文提出了DiffFace-Edit数据集,包含两百多万张具有精细区域编辑的AI生成人脸图像,用于研究检测模型对拼接攻击样本的鲁棒性。

Details Motivation: 现有AI生成人脸数据集缺乏对细粒度局部篡改的关注,且未有研究探讨真实与篡改图像之间的拼接攻击(splice attacks)对检测器的影响。 Method: 构建了DiffFace-Edit数据集,包含八个人脸区域的单区域和多区域编辑,并引入探测规避样本分析其对检测模型的影响,采用跨域评估结合IMDL方法进行综合分析。 Result: 数据集包含超过两百万张AI生成图像,支持多种编辑组合,并揭示了拼接攻击对当前检测模型的显著挑战。 Conclusion: DiffFace-Edit为检测细粒度人脸篡改提供了更全面的数据支持,并首次揭示了探测规避样本对检测模型的实际影响,推动了更鲁棒检测方法的发展。 Abstract: Generative models now produce imperceptible, fine-grained manipulated faces, posing significant privacy risks. However, existing AI-generated face datasets generally lack focus on samples with fine-grained regional manipulations. Furthermore, no researchers have yet studied the real impact of splice attacks, which occur between real and manipulated samples, on detectors. We refer to these as detector-evasive samples. Based on this, we introduce the DiffFace-Edit dataset, which has the following advantages: 1) It contains over two million AI-generated fake images. 2) It features edits across eight facial regions (e.g., eyes, nose) and includes a richer variety of editing combinations, such as single-region and multi-region edits. Additionally, we specifically analyze the impact of detector-evasive samples on detection models. We conduct a comprehensive analysis of the dataset and propose a cross-domain evaluation that combines IMDL methods. Dataset will be available at https://github.com/ywh1093/DiffFace-Edit.

[386] Learning Fine-Grained Correspondence with Cross-Perspective Perception for Open-Vocabulary 6D Object Pose Estimation

Yu Qin,Shimeng Fan,Fan Yang,Zixuan Xue,Zijie Mai,Wenrui Chen,Kailun Yang,Zhiyong Li

Main category: cs.CV

TL;DR: 本文提出了一种名为FiCoP的开放词汇6D物体姿态估计框架,通过从全局匹配转向基于patch级别的细粒度对应关系,有效抑制背景干扰,提升在开放场景中的鲁棒性和泛化能力。

Details Motivation: 现有方法依赖全局匹配策略,在开放环境中易受背景干扰,导致特征混淆,影响姿态估计精度。 Method: 提出FiCoP框架,包括:1)对象中心的解耦预处理;2)跨视角全局感知模块(CPGP)融合双视图特征;3)Patch相关性预测器(PCP)生成块级关联图作为空间滤波器。 Result: 在REAL275和Toyota-Light数据集上,相比最先进方法平均召回率分别提升8.0%和6.1%。 Conclusion: FiCoP通过引入结构先验和细粒度匹配机制,显著提升了开放世界中6D姿态估计的鲁棒性与泛化性能,适用于复杂真实环境中的机器人操作。 Abstract: Open-vocabulary 6D object pose estimation empowers robots to manipulate arbitrary unseen objects guided solely by natural language. However, a critical limitation of existing approaches is their reliance on unconstrained global matching strategies. In open-world scenarios, trying to match anchor features against the entire query image space introduces excessive ambiguity, as target features are easily confused with background distractors. To resolve this, we propose Fine-grained Correspondence Pose Estimation (FiCoP), a framework that transitions from noise-prone global matching to spatially-constrained patch-level correspondence. Our core innovation lies in leveraging a patch-to-patch correlation matrix as a structural prior to narrowing the matching scope, effectively filtering out irrelevant clutter to prevent it from degrading pose estimation. Firstly, we introduce an object-centric disentanglement preprocessing to isolate the semantic target from environmental noise. Secondly, a Cross-Perspective Global Perception (CPGP) module is proposed to fuse dual-view features, establishing structural consensus through explicit context reasoning. Finally, we design a Patch Correlation Predictor (PCP) that generates a precise block-wise association map, acting as a spatial filter to enforce fine-grained, noise-resilient matching. Experiments on the REAL275 and Toyota-Light datasets demonstrate that FiCoP improves Average Recall by 8.0% and 6.1%, respectively, compared to the state-of-the-art method, highlighting its capability to deliver robust and generalized perception for robotic agents operating in complex, unconstrained open-world environments. The source code will be made publicly available at https://github.com/zjjqinyu/FiCoP.

[387] ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch

Zheng Liu,Honglin Lin,Chonghan Qin,Xiaoyang Wang,Xin Gao,Yu Li,Mengzhang Cai,Yun Zhu,Zhanping Zhong,Qizhi Pei,Zhuoshi Pan,Xiaoran Shang,Bin Cui,Conghui He,Wentao Zhang,Lijun Wu

Main category: cs.CV

TL;DR: 本文提出了ChartVerse,一个用于从零生成复杂图表和可靠推理数据的可扩展框架,通过引入量化图表复杂度的新指标RPE和基于真实答案反向生成问题的方法,显著提升了视觉语言模型在图表推理任务上的性能。

Details Motivation: 现有的图表推理数据集存在合成图表过于简单且重复、问答对容易产生幻觉且缺乏深度推理的问题,限制了开源视觉语言模型的发展。 Method: 提出Rollout Posterior Entropy (RPE) 来量化图表复杂度,并设计复杂度感知的图表生成器;采用“答案优先”的逆向QA生成方法,从源代码中提取确定性答案,再生成问题并进行一致性验证;基于模型失败率筛选样本并蒸馏高质量思维链(CoT)推理。 Result: 构建了ChartVerse-SFT-600K和ChartVerse-RL-40K两个数据集,使用Qwen3-VL-30B-A3B-Thinking作为教师模型;实验表明ChartVerse-8B在多项指标上达到最先进水平,性能超越其教师模型,并接近更强的Qwen3-VL-32B-Thinking。 Conclusion: ChartVerse通过复杂度感知生成和真理锚定的逆向问答合成策略,有效解决了现有图表推理数据质量不足的问题,为训练高性能视觉语言模型提供了高质量数据支持。 Abstract: Chart reasoning is a critical capability for Vision Language Models (VLMs). However, the development of open-source models is severely hindered by the lack of high-quality training data. Existing datasets suffer from a dual challenge: synthetic charts are often simplistic and repetitive, while the associated QA pairs are prone to hallucinations and lack the reasoning depth required for complex tasks. To bridge this gap, we propose ChartVerse, a scalable framework designed to synthesize complex charts and reliable reasoning data from scratch. (1) To address the bottleneck of simple patterns, we first introduce Rollout Posterior Entropy (RPE), a novel metric that quantifies chart complexity. Guided by RPE, we develop complexity-aware chart coder to autonomously synthesize diverse, high-complexity charts via executable programs. (2) To guarantee reasoning rigor, we develop truth-anchored inverse QA synthesis. Diverging from standard generation, we adopt an answer-first paradigm: we extract deterministic answers directly from the source code, generate questions conditional on these anchors, and enforce strict consistency verification. To further elevate difficulty and reasoning depth, we filter samples based on model fail-rate and distill high-quality Chain-of-Thought (CoT) reasoning. We curate ChartVerse-SFT-600K and ChartVerse-RL-40K using Qwen3-VL-30B-A3B-Thinking as the teacher. Experimental results demonstrate that ChartVerse-8B achieves state-of-the-art performance, notably surpassing its teacher and rivaling the stronger Qwen3-VL-32B-Thinking.

[388] CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models

Donghee Lee,Rui Cai,Zhe Zhao

Main category: cs.CV

TL;DR: 提出了一种名为CARPE的模型无关框架,通过引入视觉集成层和上下文感知集成策略,提升大型视觉语言模型在图像分类和多模态任务中的性能。

Details Motivation: 现有的大型视觉语言模型在视觉为中心的任务(如图像分类)上表现不如其基础视觉编码器,存在性能瓶颈。 Method: 提出CARPE框架,引入视觉集成层和上下文感知的集成策略,自适应地权衡图像表示与语言模型推理能力,优先选择更合适的模态信息。 Result: 在多个图像分类和视觉-语言基准测试中,CARPE显著提升了现有LVLM的性能,表现出更强的泛化能力。 Conclusion: CARPE是一种通用且有效的框架,能够广泛适配开源LVLM,增强其在视觉任务中的表现,同时保持多模态推理能力。 Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have pushed them closer to becoming general-purpose assistants. Despite their strong performance, LVLMs still struggle with vision-centric tasks such as image classification, underperforming compared to their base vision encoders, which are often CLIP-based models. To address this limitation, we propose Context-Aware Image Representation Prioritization via Ensemble (CARPE), a novel, model-agnostic framework which introduces vision-integration layers and a context-aware ensemble strategy to identify when to prioritize image representations or rely on the reasoning capabilities of the language model. This design enhances the model's ability to adaptively weight visual and textual modalities and enables the model to capture various aspects of image representations, leading to consistent improvements in generalization across classification and vision-language benchmarks. Extensive experiments demonstrate that CARPE not only improves performance on image classification benchmarks but also enhances results across various vision-language benchmarks. Finally, CARPE is designed to be effectively integrated with most open-source LVLMs that consist of a vision encoder and a language model, ensuring its adaptability across diverse architectures.

[389] Scaling Test-time Inference for Visual Grounding

Guanqi Zhan,Changye Li,Zhijian Liu,Yao Lu,Yi Wu,Song Han,Ligeng Zhu

Main category: cs.CV

TL;DR: 本文提出了“高效视觉定位语言模型”(EGM),通过扩展小规模VLM在测试时的计算量(生成token数)来弥补其与大规模模型在视觉定位能力上的差距,实验证明该方法在保持低延迟的同时,在RefCOCO和新型非完整物体定位任务上性能优于或媲美大模型。

Details Motivation: 小规模视觉语言模型(VLMs)在视觉定位任务上落后于大规模模型,主要由于语言理解能力不足而非视觉编码能力;现有大模型部署成本高、推理慢,因此需要一种高效且可扩展的方法提升小模型的定位性能。 Method: 提出EGM方法,通过增加小模型在测试时生成的token数量来扩展其推理过程中的计算量,从而增强其语言推理与定位能力,同时保持部署轻量化和低延迟优势。 Result: 在RefCOCO上,EGM-Qwen3-VL-8B以737ms(快5.9倍)达到91.4 IoU,优于Qwen3-VL-235B的4320ms和90.5 IoU;在新提出的非完整物体定位任务中,小模型经EGM也显著提升性能,接近或超过大模型。 Conclusion: 通过扩展测试时计算可有效缩小小模型与大模型在视觉定位上的性能差距,EGM为高效、实用的视觉定位提供了一种新的可扩展路径。 Abstract: Visual grounding is an essential capability of Visual Language Models (VLMs) to understand the real physical world. Previous state-of-the-art grounding visual language models usually have large model sizes, making them heavy for deployment and slow for inference. However, we notice that the sizes of visual encoders are nearly the same for small and large VLMs and the major difference is the sizes of the language models. Small VLMs fall behind larger VLMs in grounding because of the difference in language understanding capability rather than visual information handling. To mitigate the gap, we introduce 'Efficient visual Grounding language Models' (EGM): a method to scale the test-time computation (#generated tokens). Scaling the test-time computation of a small model is deployment-friendly, and yields better end-to-end latency as the cost of each token is much cheaper compared to directly running a large model. On the RefCOCO benchmark, our EGM-Qwen3-VL-8B demonstrates 91.4 IoU with an average of 737ms (5.9x faster) latency while Qwen3-VL-235B demands 4,320ms to achieve 90.5 IoU. To validate our approach's generality, we further set up a new amodal grounding setting that requires the model to predict both the visible and occluded parts of the objects. Experiments show our method can consistently and significantly improve the vanilla grounding and amodal grounding capabilities of small models to be on par with or outperform the larger models, thereby improving the efficiency for visual grounding.

[390] Face-Voice Association with Inductive Bias for Maximum Class Separation

Marta Moscati,Oleksandr Kats,Mubashir Noman,Muhammad Zaigham Zaheer,Yufang Hou,Markus Schedl,Shah Nawaz

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态学习方法,通过引入最大类间分离作为归纳偏置来增强人脸-语音关联的嵌入表示,实现了最先进的性能,并验证了该方法在结合正交损失时的有效性。

Details Motivation: 现有工作主要依赖损失函数实现人脸与语音嵌入的对齐,但尚未探索最大类间分离这一强判别性归纳偏置在该任务中的应用,本文旨在填补这一空白。 Method: 提出一种新方法,在人脸-语音关联模型中引入最大类间分离作为归纳偏置,强制不同说话人的多模态表征之间具有最大分离度,并结合促进类间正交性的损失函数进行优化。 Result: 在两种人脸-语音关联任务上达到SOTA性能,消融实验表明归纳偏置与正交损失结合时效果最佳。 Conclusion: 本研究首次证明了最大类间分离作为归纳偏置在多模态学习中的有效性,为人脸-语音关联建立了一个新范式,具有广泛的应用前景。 Abstract: Face-voice association is widely studied in multimodal learning and is approached representing faces and voices with embeddings that are close for a same person and well separated from those of others. Previous work achieved this with loss functions. Recent advancements in classification have shown that the discriminative ability of embeddings can be strengthened by imposing maximum class separation as inductive bias. This technique has never been used in the domain of face-voice association, and this work aims at filling this gap. More specifically, we develop a method for face-voice association that imposes maximum class separation among multimodal representations of different speakers as an inductive bias. Through quantitative experiments we demonstrate the effectiveness of our approach, showing that it achieves SOTA performance on two task formulation of face-voice association. Furthermore, we carry out an ablation study to show that imposing inductive bias is most effective when combined with losses for inter-class orthogonality. To the best of our knowledge, this work is the first that applies and demonstrates the effectiveness of maximum class separation as an inductive bias in multimodal learning; it hence paves the way to establish a new paradigm.

[391] VIAFormer: Voxel-Image Alignment Transformer for High-Fidelity Voxel Refinement

Tiancheng Fang,Bowen Pan,Lingxi Chen,Jiangjing Lyu,Chengfei Lyu,Chaoyue Niu,Fan Wu

Main category: cs.CV

TL;DR: 本文提出了VIAFormer,一种用于多视角条件下的体素修复的变换器模型,通过图像索引、校正流目标和混合流变换器实现高效的跨模态融合,在合成和真实噪声上均表现出色。

Details Motivation: 为了提升从视觉基础模型获得的不完整、含噪体素形状的修复效果,尤其是在多视角图像引导下进行精确的3D重建。 Method: 提出VIAFormer模型,包含三个核心组件:提供3D空间定位的图像索引、学习直接体素优化路径的校正流目标,以及实现鲁棒跨模态融合的混合流变换器。 Result: 实验表明,VIAFormer在严重合成损坏和真实伪影的修复任务上达到了最先进的性能,并能有效集成到实际3D创作流程中。 Conclusion: VIAFormer为基于体素的方法在大模型和大数据时代的发展提供了可靠桥梁,推动其在现实应用中的广泛使用。 Abstract: We propose VIAFormer, a Voxel-Image Alignment Transformer model designed for Multi-view Conditioned Voxel Refinement--the task of repairing incomplete noisy voxels using calibrated multi-view images as guidance. Its effectiveness stems from a synergistic design: an Image Index that provides explicit 3D spatial grounding for 2D image tokens, a Correctional Flow objective that learns a direct voxel-refinement trajectory, and a Hybrid Stream Transformer that enables robust cross-modal fusion. Experiments show that VIAFormer establishes a new state of the art in correcting both severe synthetic corruptions and realistic artifacts on the voxel shape obtained from powerful Vision Foundation Models. Beyond benchmarking, we demonstrate VIAFormer as a practical and reliable bridge in real-world 3D creation pipelines, paving the way for voxel-based methods to thrive in large-model, big-data wave.

[392] Transformer based Multi-task Fusion Network for Food Spoilage Detection and Shelf life Forecasting

Mounika Kanulla,Rajasree Dadigi,Sailaja Thota,Vivek Yelleti

Main category: cs.CV

TL;DR: 提出融合CNN与LSTM及DeiT Transformer的架构,用于蔬菜分类、食品腐败检测和保质期预测,实验表明融合模型性能优于多种深度学习模型。

Details Motivation: 减少农业供应链中的食物浪费,通过准确有效的腐败检测和保质期预测延长供应链管理寿命。 Method: 结合CNN与LSTM以及DeiT Transformer构建融合架构,并在自建蔬菜图像数据集上进行多任务训练:分类、腐败检测与货架期预测。 Result: CNN+DeiT Transformer在蔬菜分类中F1得分为0.98,腐败检测为0.61,保质期预测MSE为3.58,SMAPE为41.66%;模型在噪声图像上表现稳定,并通过LIME可视化决策过程。 Conclusion: 融合模型在多任务食品腐败分析中表现优越,具备良好的鲁棒性和可解释性,有助于减少食品浪费。 Abstract: Food wastage is one of the critical challenges in the agricultural supply chain, and accurate and effective spoilage detection can help to reduce it. Further, it is highly important to forecast the spoilage information. This aids the longevity of the supply chain management in the agriculture field. This motivated us to propose fusion based architectures by combining CNN with LSTM and DeiT transformer for the following multi-tasks simultaneously: (i) vegetable classification, (ii) food spoilage detection, and (iii) shelf life forecasting. We developed a dataset by capturing images of vegetables from their fresh state until they were completely spoiled. From the experimental analysis it is concluded that the proposed fusion architectures CNN+CNN-LSTM and CNN+DeiT Transformer outperformed several deep learning models such as CNN, VGG16, ResNet50, Capsule Networks, and DeiT Transformers. Overall, CNN + DeiT Transformer yielded F1-score of 0.98 and 0.61 in vegetable classification and spoilage detection respectively and mean squared error (MSE) and symmetric mean absolute percentage error (SMAPE) of 3.58, and 41.66% respectively in spoilage forecasting. Further, the reliability of the fusion models was validated on noisy images and integrated with LIME to visualize the model decisions.

[393] Finally Outshining the Random Baseline: A Simple and Effective Solution for Active Learning in 3D Biomedical Imaging

Carsten T. Lüth,Jeremias Traub,Kim-Celine Kahl,Till J. Bungert,Lukas Klein,Lars Krämer,Paul F. Jäger,Klaus Maier-Hein,Fabian Isensee

Main category: cs.CV

TL;DR: 提出了一种名为ClaSP PE的主动学习查询策略,有效解决了3D生物医学图像分割中类别不平衡和早期选择冗余的问题,在24种实验设置下均优于改进的随机采样基线,且具备良好的泛化性和实用性。

Details Motivation: 现有的主动学习方法在3D生物医学图像分割中难以稳定优于改进的随机采样基线,且面临类别不平衡和早期查询冗余问题,缺乏可靠解决方案。 Method: 提出Class-stratified Scheduled Power Predictive Entropy(ClaSP PE),结合类别分层查询以覆盖稀有类结构,并采用对数尺度的幂噪声与衰减调度机制,早期增强多样性,后期促进利用。 Result: 在24个实验设置和四个3D数据集上,ClaSP PE是唯一始终优于改进随机基线的方法,分割性能显著提升且标注高效;在四个未见数据集上无需调参即展现强泛化能力。 Conclusion: ClaSP PE是一种简单、有效且可推广的主动学习策略,首次在接近实际应用的场景中一致超越随机基线,推动了3D医学图像分割中主动学习的实用化进程。 Abstract: Active learning (AL) has the potential to drastically reduce annotation costs in 3D biomedical image segmentation, where expert labeling of volumetric data is both time-consuming and expensive. Yet, existing AL methods are unable to consistently outperform improved random sampling baselines adapted to 3D data, leaving the field without a reliable solution. We introduce Class-stratified Scheduled Power Predictive Entropy (ClaSP PE), a simple and effective query strategy that addresses two key limitations of standard uncertainty-based AL methods: class imbalance and redundancy in early selections. ClaSP PE combines class-stratified querying to ensure coverage of underrepresented structures and log-scale power noising with a decaying schedule to enforce query diversity in early-stage AL and encourage exploitation later. In our evaluation on 24 experimental settings using four 3D biomedical datasets within the comprehensive nnActive benchmark, ClaSP PE is the only method that generally outperforms improved random baselines in terms of both segmentation quality with statistically significant gains, whilst remaining annotation efficient. Furthermore, we explicitly simulate the real-world application by testing our method on four previously unseen datasets without manual adaptation, where all experiment parameters are set according to predefined guidelines. The results confirm that ClaSP PE robustly generalizes to novel tasks without requiring dataset-specific tuning. Within the nnActive framework, we present compelling evidence that an AL method can consistently outperform random baselines adapted to 3D segmentation, in terms of both performance and annotation efficiency in a realistic, close-to-production scenario. Our open-source implementation and clear deployment guidelines make it readily applicable in practice. Code is at https://github.com/MIC-DKFZ/nnActive.

[394] Dynamic Differential Linear Attention: Enhancing Linear Diffusion Transformer for High-Quality Image Generation

Boyuan Cao,Xingbo Yao,Chenhui Wang,Jiaxin Ye,Yujie Wei,Hongming Shan

Main category: cs.CV

TL;DR: 本文提出了一种新的线性注意力机制DyDiLA,用于提升扩散Transformer在图像生成中的性能,通过动态投影、动态测度核和令牌差分算子缓解了过平滑问题,显著提高了生成质量。

Details Motivation: 现有的线性注意力机制虽然降低了计算成本,但常导致注意力权重过平滑,损害模型表达能力,从而影响生成性能。 Method: 提出Dynamic Differential Linear Attention (DyDiLA),包含三个核心设计:动态投影模块、动态测度核和令牌差分算子,并构建新型线性扩散Transformer模型DyDi-LiT。 Result: 在多个指标上,DyDi-LiT持续优于当前最先进的模型,实验表明其在图像生成任务中具有优越性能。 Conclusion: DyDiLA有效提升了线性扩散Transformer的表达能力和生成质量,为高效且高性能的图像生成模型提供了新方向。 Abstract: Diffusion transformers (DiTs) have emerged as a powerful architecture for high-fidelity image generation, yet the quadratic cost of self-attention poses a major scalability bottleneck. To address this, linear attention mechanisms have been adopted to reduce computational cost; unfortunately, the resulting linear diffusion transformers (LiTs) models often come at the expense of generative performance, frequently producing over-smoothed attention weights that limit expressiveness. In this work, we introduce Dynamic Differential Linear Attention (DyDiLA), a novel linear attention formulation that enhances the effectiveness of LiTs by mitigating the oversmoothing issue and improving generation quality. Specifically, the novelty of DyDiLA lies in three key designs: (i) dynamic projection module, which facilitates the decoupling of token representations by learning with dynamically assigned knowledge; (ii) dynamic measure kernel, which provides a better similarity measurement to capture fine-grained semantic distinctions between tokens by dynamically assigning kernel functions for token processing; and (iii) token differential operator, which enables more robust query-to-key retrieval by calculating the differences between the tokens and their corresponding information redundancy produced by dynamic measure kernel. To capitalize on DyDiLA, we introduce a refined LiT, termed DyDi-LiT, that systematically incorporates our advancements. Extensive experiments show that DyDi-LiT consistently outperforms current state-of-the-art (SOTA) models across multiple metrics, underscoring its strong practical potential.

[395] Reasoning or Pattern Matching? Probing Large Vision-Language Models with Visual Puzzles

Maria Lymperaiou,Vasileios Karampinis,Giorgos Filandrianos,Angelos Vlachos,Chrysoula Zerva,Athanasios Voulodimos

Main category: cs.CV

TL;DR: 本文综述了视觉谜题在评估大型视觉-语言模型(LVLM)推理能力中的应用,提出统一框架,按推理机制分类现有基准,并揭示当前模型在泛化、感知与推理分离及解释与执行一致性方面的局限。

Details Motivation: 旨在通过视觉谜题这一诊断工具,系统评估LVLM的抽象、规则发现和系统性推理能力,弥补传统多模态基准开放性强但可控性弱的不足。 Method: 构建视觉谜题的统一抽象框架,将现有基准按归纳、类比、算法、演绎和几何/空间等推理机制分类,综合实证研究结果进行分析。 Result: 发现当前LVLM在推理中存在泛化脆弱、感知与推理紧密耦合、解释流畅但执行不一致等问题。 Conclusion: 视觉谜题应被视为诊断工具而非单纯任务形式,未来需设计更合理的基准以推动具备真正推理能力的多模态系统发展。 Abstract: Puzzles have long served as compact and revealing probes of human cognition, isolating abstraction, rule discovery, and systematic reasoning with minimal reliance on prior knowledge. Leveraging these properties, visual puzzles have recently emerged as a powerful diagnostic tool for evaluating the reasoning abilities of Large Vision-Language Models (LVLMs), offering controlled, verifiable alternatives to open-ended multimodal benchmarks. This survey provides a unified perspective of visual puzzle reasoning in LVLMs. We frame visual puzzles through a common abstraction and organize existing benchmarks by the reasoning mechanisms they target (inductive, analogical, algorithmic, deductive, and geometric/spatial), thereby linking puzzle design to the cognitive operations required for solving. Synthesizing empirical evidence across these categories, we identify consistent limitations in current models, including brittle generalization, tight entanglement between perception and reasoning, and a persistent gap between fluent explanations and faithful execution. By framing visual puzzles as diagnostic instruments rather than task formats, this survey elaborates on the state of LVLM reasoning and outlines key directions for future benchmarks and reasoning-aware multimodal systems.

[396] ParkingTwin: Training-Free Streaming 3D Reconstruction for Parking-Lot Digital Twins

Xinhao Liu,Yu Wang,Xiansheng Guo,Gordon Owusu Boateng,Yu Cao,Haonan Si,Xingchen Guo,Nirwan Ansari

Main category: cs.CV

TL;DR: ParkingTwin是一个无需训练、轻量化的在线3D重建系统,利用OSM先验和几何感知动态滤波,在边缘设备上实现实时、高保真停车场数字孪生,显著提升效率与鲁棒性。

Details Motivation: 现有3D重建方法在停车场场景中面临视差弱、动态遮挡和光照变化大等问题,且依赖昂贵的离线优化,难以满足边缘端流式处理需求。 Method: 提出ParkingTwin系统:1)基于OSM语义拓扑生成度量一致的TSDF;2)采用四模态几何约束实时过滤动态物体;3)在CIELAB空间进行光照鲁棒的纹理融合。 Result: 系统在GTX 1660上实现30+ FPS,相比3DGS在SSIM上提升16%,端到端速度快15倍,GPU内存减少83.3%,输出兼容主流引擎的三角网格。 Conclusion: ParkingTwin实现了高效、低资源消耗的停车场数字孪生重建,适用于实际AVP系统的部署。 Abstract: High-fidelity parking-lot digital twins provide essential priors for path planning, collision checking, and perception validation in Automated Valet Parking (AVP). Yet robot-oriented reconstruction faces a trilemma: sparse forward-facing views cause weak parallax and ill-posed geometry; dynamic occlusions and extreme lighting hinder stable texture fusion; and neural rendering typically needs expensive offline optimization, violating edge-side streaming constraints. We propose ParkingTwin, a training-free, lightweight system for online streaming 3D reconstruction. First, OSM-prior-driven geometric construction uses OpenStreetMap semantic topology to directly generate a metric-consistent TSDF, replacing blind geometric search with deterministic mapping and avoiding costly optimization. Second, geometry-aware dynamic filtering employs a quad-modal constraint field (normal/height/depth consistency) to reject moving vehicles and transient occlusions in real time. Third, illumination-robust fusion in CIELAB decouples luminance and chromaticity via adaptive L-channel weighting and depth-gradient suppression, reducing seams under abrupt lighting changes. ParkingTwin runs at 30+ FPS on an entry-level GTX 1660. On a 68,000 m^2 real-world dataset, it achieves SSIM 0.87 (+16.0%), delivers about 15x end-to-end speedup, and reduces GPU memory by 83.3% compared with state-of-the-art 3D Gaussian Splatting (3DGS) that typically requires high-end GPUs (RTX 4090D). The system outputs explicit triangle meshes compatible with Unity/Unreal digital-twin pipelines. Project page: https://mihoutao-liu.github.io/ParkingTwin/

[397] Attention-space Contrastive Guidance for Efficient Hallucination Mitigation in LVLMs

Yujin Jo,Sangyoon Bae,Taesup Kim

Main category: cs.CV

TL;DR: 提出了一种名为Attention-space Contrastive Guidance (ACG) 的单次前向机制,通过在自注意力层中对比视觉-语言与纯语言表征路径来抑制大视觉语言模型中的幻觉,显著提升生成文本的视觉一致性和语义保真度,同时降低计算成本。

Details Motivation: 大视觉语言模型(LVLMs)中常因语言先验主导而忽视视觉证据,导致物体误识别和视觉不一致的描述,即产生幻觉。为缓解这一问题,需减少对语言先验的过度依赖,并增强视觉信息在生成过程中的作用。 Method: 将幻觉抑制建模为对比引导任务,在自注意力空间内构建视觉-语言与纯语言两条注意力路径,通过单次前向传播实现对比引导;引入正交化校正以消除近似偏差,放大视觉贡献,从而在不增加计算开销的前提下提升视觉接地性。 Result: 在CHAIR和POPE基准上实现了最先进的忠实度和描述质量,同时相比需要多次前向的对比解码方法,延迟最多降低2倍,计算效率显著提高。 Conclusion: ACG提供了一种原则性强且高效的幻觉抑制方法,通过嵌入模型内部表征上下文化的对比引导机制,在保持高性能的同时大幅降低计算成本,适用于实际应用中的LVLM部署。 Abstract: Hallucinations in large vision-language models (LVLMs) often arise when language priors dominate over visual evidence, causing object misidentification and visually inconsistent descriptions. We address this issue by framing hallucination mitigation as contrastive guidance, steering generation toward visually grounded and semantically faithful text. This approach regulates the model's internal behavior by reducing over-dependence on language priors and contrasting visually grounded with language-only representations. We propose Attention-space Contrastive Guidance (ACG), a single-pass mechanism that operates within self-attention layers to construct both vision-language and language-only attention paths in a single forward computation. This integration enables computationally efficient guidance directly embedded in the model's representation contextualization. To correct approximation bias introduced by the single-pass formulation, we further apply an orthogonalized correction that removes components aligned with the language-only path, selectively amplifying visual contributions. Experiments on the CHAIR and POPE benchmarks show that ACG achieves state-of-the-art faithfulness and caption quality while significantly reducing computational cost. Our method establishes a principled and efficient alternative, reducing latency by up to 2x compared to prior contrastive decoding methods that require multiple forward passes.

[398] MVGD-Net: A Novel Motion-aware Video Glass Surface Detection Network

Yiwei Lu,Hao Huang,Tao Yan

Main category: cs.CV

TL;DR: 本文提出了一种基于运动不一致性线索的视频玻璃表面检测方法MVGD-Net,通过设计三个新颖模块和一个时空解码器,在新构建的大规模数据集上取得了优于现有方法的效果。

Details Motivation: 玻璃表面广泛存在于日常生活和专业环境中,对视觉系统构成潜在威胁。现有的视频玻璃表面检测方法需要更有效的线索来准确识别玻璃区域,因此本文希望通过利用运动不一致性提高检测性能。 Method: 提出MVGD-Net网络,包含跨尺度多模态融合模块(CMFM)、历史引导注意力模块(HGAM)和时间交叉注意力模块(TCAM),并引入时空解码器(TSD)融合时空特征;利用光学流和空间特征捕捉运动不一致性,并在自建的大规模数据集上进行训练与评估。 Result: 实验表明,MVGD-Net在所提出的包含312种不同玻璃场景共19,268帧的数据集上显著优于现有的先进方法。 Conclusion: 通过利用视频中玻璃反射/透射物体与真实物体之间的运动不一致性,MVGD-Net能有效检测玻璃表面,且所设计模块和数据集有助于推动该领域发展。 Abstract: Glass surface ubiquitous in both daily life and professional environments presents a potential threat to vision-based systems, such as robot and drone navigation. To solve this challenge, most recent studies have shown significant interest in Video Glass Surface Detection (VGSD). We observe that objects in the reflection (or transmission) layer appear farther from the glass surfaces. Consequently, in video motion scenarios, the notable reflected (or transmitted) objects on the glass surface move slower than objects in non-glass regions within the same spatial plane, and this motion inconsistency can effectively reveal the presence of glass surfaces. Based on this observation, we propose a novel network, named MVGD-Net, for detecting glass surfaces in videos by leveraging motion inconsistency cues. Our MVGD-Net features three novel modules: the Cross-scale Multimodal Fusion Module (CMFM) that integrates extracted spatial features and estimated optical flow maps, the History Guided Attention Module (HGAM) and Temporal Cross Attention Module (TCAM), both of which further enhances temporal features. A Temporal-Spatial Decoder (TSD) is also introduced to fuse the spatial and temporal features for generating the glass region mask. Furthermore, for learning our network, we also propose a large-scale dataset, which comprises 312 diverse glass scenarios with a total of 19,268 frames. Extensive experiments demonstrate that our MVGD-Net outperforms relevant state-of-the-art methods.

Xinlei Yin,Xiulian Peng,Xiao Li,Zhiwei Xiong,Yan Lu

Main category: cs.CV

TL;DR: 本文提出HAVEN框架,通过整合音视频实体连贯性和分层索引与智能体搜索机制,实现对长视频的连贯且全面的理解,显著提升了时序一致性、实体一致性和检索效率,在LVBench上达到84.1%的准确率。

Details Motivation: 现有基于简单分块和检索增强生成的方法在长视频理解中存在信息碎片化和全局连贯性丢失的问题。 Method: 提出HAVEN框架,结合音视频实体级表示以保持语义一致性,并构建从全局概要到场景、片段和实体的分层结构;引入智能体搜索机制,实现跨层次的动态检索与推理。 Result: 在LVBench上总体准确率达到84.1%,在推理类别中达到80.1%,表现出优异的时序一致性、实体一致性和检索效率。 Conclusion: 结构化、多模态的推理方法能有效支持长视频的上下文一致性和全面理解,HAVEN为长视频理解提供了新的最优解决方案。 Abstract: Long video understanding presents significant challenges for vision-language models due to extremely long context windows. Existing solutions relying on naive chunking strategies with retrieval-augmented generation, typically suffer from information fragmentation and a loss of global coherence. We present HAVEN, a unified framework for long-video understanding that enables coherent and comprehensive reasoning by integrating audiovisual entity cohesion and hierarchical video indexing with agentic search. First, we preserve semantic consistency by integrating entity-level representations across visual and auditory streams, while organizing content into a structured hierarchy spanning global summary, scene, segment, and entity levels. Then we employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers, facilitating coherent narrative reconstruction and fine-grained entity tracking. Extensive experiments demonstrate that our method achieves good temporal coherence, entity consistency, and retrieval efficiency, establishing a new state-of-the-art with an overall accuracy of 84.1% on LVBench. Notably, it achieves outstanding performance in the challenging reasoning category, reaching 80.1%. These results highlight the effectiveness of structured, multimodal reasoning for comprehensive and context-consistent understanding of long-form videos.

[400] Facial Spatiotemporal Graphs: Leveraging the 3D Facial Surface for Remote Physiological Measurement

Sam Cantrill,David Ahmedt-Aristizabal,Lars Petersson,Hanna Suominen,Mohammad Ali Armin

Main category: cs.CV

TL;DR: 提出了一种新的面部远程光电容积描记(rPPG)建模方法STGraph和相应的轻量级模型MeshPhys,通过在3D面部表面进行时空图卷积,实现了更鲁棒、可解释且泛化的生理信号估计。

Details Motivation: 现有rPPG方法未显式将其感受野与3D面部表面(rPPG信号的空间支持)对齐,导致建模不够准确。 Method: 提出面部时空图(STGraph),利用3D面部网格序列编码面部颜色与结构,并设计轻量级时空图卷积网络MeshPhys,在STGraph上进行处理以估计生理信号。 Result: 在四个基准数据集上,MeshPhys在域内和跨域设置下均达到最先进或具有竞争力的性能;消融实验表明,将感受野限制在面部表面作为强结构性先验,并且表面对齐的3D感知节点特征对鲁棒编码至关重要。 Conclusion: STGraph与MeshPhys构成了一种新颖且有原则的面部rPPG建模范式,提升了估计的鲁棒性、可解释性和泛化能力。 Abstract: Facial remote photoplethysmography (rPPG) methods estimate physiological signals by modeling subtle color changes on the 3D facial surface over time. However, existing methods fail to explicitly align their receptive fields with the 3D facial surface-the spatial support of the rPPG signal. To address this, we propose the Facial Spatiotemporal Graph (STGraph), a novel representation that encodes facial color and structure using 3D facial mesh sequences-enabling surface-aligned spatiotemporal processing. We introduce MeshPhys, a lightweight spatiotemporal graph convolutional network that operates on the STGraph to estimate physiological signals. Across four benchmark datasets, MeshPhys achieves state-of-the-art or competitive performance in both intra- and cross-dataset settings. Ablation studies show that constraining the model's receptive field to the facial surface acts as a strong structural prior, and that surface-aligned, 3D-aware node features are critical for robustly encoding facial surface color. Together, the STGraph and MeshPhys constitute a novel, principled modeling paradigm for facial rPPG, enabling robust, interpretable, and generalizable estimation. Code is available at https://samcantrill.github.io/facial-stgraph-rppg/ .

[401] HiT: History-Injection Transformers for Onboard Continuous Flood Change Detection

Daniel Kyselica,Jonáš Herec,Oliver Kutis,Rado Pitoňák

Main category: cs.CV

TL;DR: 本文提出了一种基于Transformer的History Injection机制(HiT),用于小卫星上的洪水检测,能够在极低存储和计算资源下实现高效的多时相变化检测。

Details Motivation: 在小卫星有限的内存和计算资源下,实现对自然灾害(如洪水)的持续、实时监测,减少对地面处理系统的依赖。 Method: 提出HiT机制,将历史观测信息注入Transformer模型中,在仅保留极少量历史数据的情况下维持上下文记忆,并基于Prithvi-tiny基础模型构建端到端的变化检测系统。 Result: 在STTORM-CD洪水数据集上验证,HiT-Prithvi模型相比双时相基线方法保持了相当的检测精度,同时数据存储减少超过99%,在Jetson Orin Nano上达到43 FPS的推理速度。 Conclusion: HiT机制为小卫星提供了高效、实用的连续灾害监测框架,支持完全星载的实时灾害评估,推动遥感监测向自主化和边缘计算方向发展。 Abstract: Natural disaster monitoring through continuous satellite observation requires processing multi-temporal data under strict operational constraints. This paper addresses flood detection, a critical application for hazard management, by developing an onboard change detection system that operates within the memory and computational limits of small satellites. We propose History Injection mechanism for Transformer models (HiT), that maintains historical context from previous observations while reducing data storage by over 99\% of original image size. Moreover, testing on the STTORM-CD flood dataset confirms that the HiT mechanism within the Prithvi-tiny foundation model maintains detection accuracy compared to the bitemporal baseline. The proposed HiT-Prithvi model achieved 43 FPS on Jetson Orin Nano, a representative onboard hardware used in nanosats. This work establishes a practical framework for satellite-based continuous monitoring of natural disasters, supporting real-time hazard assessment without dependency on ground-based processing infrastructure. Architecture as well as model checkpoints is available at https://github.com/zaitra/HiT-change-detection

[402] PREGEN: Uncovering Latent Thoughts in Composed Video Retrieval

Gabriele Serussi,David Vainshtein,Jonathan Kouchly,Dotan Di Castro,Chaim Baskin

Main category: cs.CV

TL;DR: 本文提出了一种名为PREGEN的高效视频检索框架,通过冻结预训练视觉语言模型并提取其隐藏状态,结合轻量级编码器实现无需微调的组合视频检索,在多个基准上显著超越现有方法。

Details Motivation: 现有组合视频检索(CoVR)方法未能充分利用现代视觉语言模型(VLMs),或使用过时架构,或依赖计算昂贵的微调和缓慢的字幕生成,限制了性能与效率。 Method: 将查询视频和修改文本输入冻结的预训练VLM,提取每一层最后一个令牌的隐藏状态,利用轻量级编码器对这些表示进行训练,生成语义丰富且紧凑的嵌入用于检索。 Result: 在标准CoVR基准上,PREGEN在Recall@1指标上比先前方法提升了+27.23和+69.59,展现出卓越的性能、跨VLM主干的鲁棒性以及对复杂文本修改的强零样本泛化能力。 Conclusion: PREGEN是一种高效且强大的CoVR框架,无需VLM微调即可实现先进性能,推动了基于VLM的视频检索的发展。 Abstract: Composed Video Retrieval (CoVR) aims to retrieve a video based on a query video and a modifying text. Current CoVR methods fail to fully exploit modern Vision-Language Models (VLMs), either using outdated architectures or requiring computationally expensive fine-tuning and slow caption generation. We introduce PREGEN (PRE GENeration extraction), an efficient and powerful CoVR framework that overcomes these limitations. Our approach uniquely pairs a frozen, pre-trained VLM with a lightweight encoding model, eliminating the need for any VLM fine-tuning. We feed the query video and modifying text into the VLM and extract the hidden state of the final token from each layer. A simple encoder is then trained on these pooled representations, creating a semantically rich and compact embedding for retrieval. PREGEN significantly advances the state of the art, surpassing all prior methods on standard CoVR benchmarks with substantial gains in Recall@1 of +27.23 and +69.59. Our method demonstrates robustness across different VLM backbones and exhibits strong zero-shot generalization to more complex textual modifications, highlighting its effectiveness and semantic capabilities.

[403] Insight: Interpretable Semantic Hierarchies in Vision-Language Encoders

Kai Wittenmayer,Sukrut Rao,Amin Parchami-Araghi,Bernt Schiele,Jonas Fischer

Main category: cs.CV

TL;DR: 本文提出Insight,一种语言对齐的概念基础模型,通过分层稀疏自编码器自动提取图像中多粒度、可解释且空间定位的概念,并利用概念间关系提升命名与解释质量,在分类与分割任务上实现了与黑箱模型相当的性能。

Details Motivation: 现有的视觉基础模型表示难以解释,尽管已有工作尝试分解为人类可理解的概念,但缺乏良好的空间定位且局限于分类任务。 Method: 提出Insight模型,结合分层稀疏自编码器与强语义基础模型,从图像中提取细粒度、空间对齐的概念;通过分析概念的局部共现依赖关系构建概念间联系,以优化概念命名并生成更丰富的解释。 Result: 在基准数据集上,Insight在图像分类和分割任务中的表现与现有基础模型相当,同时提供高质量、细粒度的基于概念的解释,并具备良好的空间接地能力。 Conclusion: Insight在保持高性能的同时,实现了对语言对齐视觉表示的细粒度、空间可解释分析,推动了可解释性视觉模型的发展。 Abstract: Language-aligned vision foundation models perform strongly across diverse downstream tasks. Yet, their learned representations remain opaque, making interpreting their decision-making hard. Recent works decompose these representations into human-interpretable concepts, but provide poor spatial grounding and are limited to image classification tasks. In this work, we propose Insight, a language-aligned concept foundation model that provides fine-grained concepts, which are human-interpretable and spatially grounded in the input image. We leverage a hierarchical sparse autoencoder and a foundation model with strong semantic representations to automatically extract concepts at various granularities. Examining local co-occurrence dependencies of concepts allows us to define concept relationships. Through these relations we further improve concept naming and obtain richer explanations. On benchmark data, we show that Insight provides performance on classification and segmentation that is competitive with opaque foundation models while providing fine-grained, high quality concept-based explanations. Code is available at https://github.com/kawi19/Insight.

[404] Discriminant Learning-based Colorspace for Blade Segmentation

Raül Pérez-Gonzalo,Andreas Espersen,Antonio Agudo

Main category: cs.CV

TL;DR: 提出一种新的多维非线性判别分析算法(CSDA),通过深度学习优化颜色空间表示以提升图像分割精度。

Details Motivation: 现有图像分割算法常忽视颜色表示对分割效果的影响,导致次优结果。 Method: 将线性判别分析扩展到深度学习框架中,提出CSDA算法,通过最大化类间分离性和最小化类内差异来学习判别性颜色空间,并设计三种替代损失函数实现端到端训练。 Result: 在风力涡轮机叶片数据上实验显示,所提方法显著提高了分割精度。 Conclusion: 针对特定领域图像分割,定制化的颜色预处理能有效提升性能,CSDA为可学习颜色空间提供了新思路。 Abstract: Suboptimal color representation often hinders accurate image segmentation, yet many modern algorithms neglect this critical preprocessing step. This work presents a novel multidimensional nonlinear discriminant analysis algorithm, Colorspace Discriminant Analysis (CSDA), for improved segmentation. Extending Linear Discriminant Analysis into a deep learning context, CSDA customizes color representation by maximizing multidimensional signed inter-class separability while minimizing intra-class variability through a generalized discriminative loss. To ensure stable training, we introduce three alternative losses that enable end-to-end optimization of both the discriminative colorspace and segmentation process. Experiments on wind turbine blade data demonstrate significant accuracy gains, emphasizing the importance of tailored preprocessing in domain-specific segmentation.

[405] FastGHA: Generalized Few-Shot 3D Gaussian Head Avatars with Real-Time Animation

Xinya Ji,Sebastian Weiss,Manuel Kansy,Jacek Naruniec,Xun Cao,Barbara Solenthaler,Derek Bradley

Main category: cs.CV

TL;DR: 提出了一种名为\OURS的前馈方法,仅需少量输入图像即可高效生成高质量的高斯头部虚拟形象,并支持实时动画。

Details Motivation: 现有方法依赖复杂的多视角采集或单目视频及逐身份优化,限制了可扩展性和对新主体的使用便捷性。 Method: 直接从输入图像学习逐像素的高斯表示,采用基于Transformer的编码器融合DINOv3和Stable Diffusion VAE的图像特征;引入轻量级MLP动态网络预测3D高斯形变,并利用预训练大模型的点图提供几何监督。 Result: 在渲染质量和推理效率上显著优于现有方法,支持实时动态虚拟形象动画。 Conclusion: \OURS为高效、高质量的头部虚拟形象生成提供了新的解决方案,具备良好的实用性和扩展性。 Abstract: Despite recent progress in 3D Gaussian-based head avatar modeling, efficiently generating high fidelity avatars remains a challenge. Current methods typically rely on extensive multi-view capture setups or monocular videos with per-identity optimization during inference, limiting their scalability and ease of use on unseen subjects. To overcome these efficiency drawbacks, we propose \OURS, a feed-forward method to generate high-quality Gaussian head avatars from only a few input images while supporting real-time animation. Our approach directly learns a per-pixel Gaussian representation from the input images, and aggregates multi-view information using a transformer-based encoder that fuses image features from both DINOv3 and Stable Diffusion VAE. For real-time animation, we extend the explicit Gaussian representations with per-Gaussian features and introduce a lightweight MLP-based dynamic network to predict 3D Gaussian deformations from expression codes. Furthermore, to enhance geometric smoothness of the 3D head, we employ point maps from a pre-trained large reconstruction model as geometry supervision. Experiments show that our approach significantly outperforms existing methods in both rendering quality and inference efficiency, while supporting real-time dynamic avatar animation.

[406] DisasterVQA: A Visual Question Answering Benchmark Dataset for Disaster Scenes

Aisha Al-Mohannadi,Ayisha Firoz,Yin Yang,Muhammad Imran,Ferda Ofli

Main category: cs.CV

TL;DR: 本文提出了DisasterVQA,一个用于灾害响应中视觉-语言模型评估的基准数据集,包含1,395张真实图像和4,405个专家标注的问答对,覆盖多种灾害类型和人道主义任务,并揭示现有模型在细粒度推理和罕见场景下的局限性。

Details Motivation: 现有的视觉问答(VQA)模型在通用领域表现良好,但在灾害响应等安全关键、需复杂推理的场景中适用性尚不明确,缺乏基于人道主义框架、面向实际决策的评估基准。 Method: 构建了名为DisasterVQA的数据集,包含真实灾害图像和由专家根据FEMA ESF和OCHA MIRA等人道主义框架标注的多类型问题(是非题、选择题、开放题),并评估了七个最先进的视觉-语言模型在其上的表现。 Result: 实验显示模型在是非题上准确率较高,但在细粒度定量推理、物体计数和情境敏感理解方面表现不佳,尤其在代表性不足的灾害场景中存在显著性能下降。 Conclusion: DisasterVQA为灾害响应中的感知与推理提供了具有挑战性且实用的基准,有助于推动更鲁棒、更具操作意义的视觉-语言模型的发展。 Abstract: Social media imagery provides a low-latency source of situational information during natural and human-induced disasters, enabling rapid damage assessment and response. While Visual Question Answering (VQA) has shown strong performance in general-purpose domains, its suitability for the complex and safety-critical reasoning required in disaster response remains unclear. We introduce DisasterVQA, a benchmark dataset designed for perception and reasoning in crisis contexts. DisasterVQA consists of 1,395 real-world images and 4,405 expert-curated question-answer pairs spanning diverse events such as floods, wildfires, and earthquakes. Grounded in humanitarian frameworks including FEMA ESF and OCHA MIRA, the dataset includes binary, multiple-choice, and open-ended questions covering situational awareness and operational decision-making tasks. We benchmark seven state-of-the-art vision-language models and find performance variability across question types, disaster categories, regions, and humanitarian tasks. Although models achieve high accuracy on binary questions, they struggle with fine-grained quantitative reasoning, object counting, and context-sensitive interpretation, particularly for underrepresented disaster scenarios. DisasterVQA provides a challenging and practical benchmark to guide the development of more robust and operationally meaningful vision-language models for disaster response. The dataset is publicly available at https://zenodo.org/records/18267770.

[407] Probabilistic Deep Discriminant Analysis for Wind Blade Segmentation

Raül Pérez-Gonzalo,Andreas Espersen,Antonio Agudo

Main category: cs.CV

TL;DR: 提出深度判别分析(DDA)并改进为稳定可训练的Probabilistic DDA(PDDA),首次应用于图像分割任务,在风力叶片分割中表现优异。

Details Motivation: 线性判别分析在处理非线性可分数据时存在局限,且传统方法难以直接优化Fisher准则,需设计能有效提升类别分离度并适用于深度网络的新型判别分析方法。 Method: 通过深度网络直接优化Fisher准则,引入带符号的类间方差、Sigmoid输出约束以及将乘法关系转为加法关系,并设计两种稳定的DDA损失函数,结合概率损失提出PDDA。 Result: PDDA显著减少类内方差和类别重叠,提升预测置信度,在风力叶片图像分割任务中表现出更高的性能与一致性。 Conclusion: PDDA成功将深度判别分析应用于图像分割,解决了训练稳定性问题,为判别分析方法在深度学习中的应用提供了新路径。 Abstract: Linear discriminant analysis improves class separability but struggles with non-linearly separable data. To overcome this, we introduce Deep Discriminant Analysis (DDA), which directly optimizes the Fisher criterion utilizing deep networks. To ensure stable training and avoid computational instabilities, we incorporate signed between-class variance, bound outputs with a sigmoid function, and convert multiplicative relationships into additive ones. We present two stable DDA loss functions and augment them with a probability loss, resulting in Probabilistic DDA (PDDA). PDDA effectively minimizes class overlap in output distributions, producing highly confident predictions with reduced within-class variance. When applied to wind blade segmentation, PDDA showcases notable advances in performance and consistency, critical for wind energy maintenance. To our knowledge, this is the first application of DDA to image segmentation.

[408] OCCAM: Class-Agnostic, Training-Free, Prior-Free and Multi-Class Object Counting

Michail Spanakis,Iason Oikonomidis,Antonis Argyros

Main category: cs.CV

TL;DR: 本文提出了OCCAM,首个无需训练且不依赖额外信息的类别无关目标计数方法,能够处理单类和多类场景,利用SAM2和改进的FINCH算法在标准数据集上实现了竞争性性能,并提出新的合成数据集和F1评估指标。

Details Motivation: 现有类别无关目标计数方法通常依赖大量训练、额外输入(如视觉示例或文本提示)且假设图像中仅含单一类别,难以泛化到真实复杂场景。因此,需要一种无需训练、不依赖辅助信息且支持多类别的通用解决方案。 Method: 利用基础模型Segment Anything Model 2 (SAM2) 进行对象分割,结合自定义阈值的First Integer Neighbor Clustering Hierarchy (FINCH) 聚类算法对对象进行分组和计数,实现无需训练且无需额外输入的多类别目标计数。 Result: 在FSC-147和CARPK两个常用基准数据集上取得了具有竞争力的结果,并通过提出的合成多类数据集和F1评分指标验证了方法在多类场景下的有效性。 Conclusion: OCCAM是首个训练免费且无需辅助信息的类别无关目标计数方法,能有效处理图像中多个任意类别的对象计数问题,推动了更通用、实用的目标计数技术发展。 Abstract: Class-Agnostic object Counting (CAC) involves counting instances of objects from arbitrary classes within an image. Due to its practical importance, CAC has received increasing attention in recent years. Most existing methods assume a single object class per image, rely on extensive training of large deep learning models and address the problem by incorporating additional information, such as visual exemplars or text prompts. In this paper, we present OCCAM, the first training-free approach to CAC that operates without the need of any supplementary information. Moreover, our approach addresses the multi-class variant of the problem, as it is capable of counting the object instances in each and every class among arbitrary object classes within an image. We leverage Segment Anything Model 2 (SAM2), a foundation model, and a custom threshold-based variant of the First Integer Neighbor Clustering Hierarchy (FINCH) algorithm to achieve competitive performance on widely used benchmark datasets, FSC-147 and CARPK. We propose a synthetic multi-class dataset and F1 score as a more suitable evaluation metric. The code for our method and the proposed synthetic dataset will be made publicly available at https://mikespanak.github.io/OCCAM_counter.

[409] Revisiting Multi-Task Visual Representation Learning

Shangzhe Di,Zhonghua Zhai,Weidi Xie

Main category: cs.CV

TL;DR: MTV是一种多任务视觉预训练框架,结合了视觉-语言对比、自监督和密集空间目标,实现了全局语义理解与细粒度空间推理的统一提升。

Details Motivation: 当前视觉表示学习在视觉-语言模型和自监督方法之间存在分裂,前者缺乏空间精度,后者缺少高层语义,因此需要一种融合两者优势的统一框架。 Method: 提出MTV框架,通过共享骨干网络联合优化视觉-语言对比、自监督和密集空间任务目标,并利用大容量专家模型生成大规模伪标签以提供密集空间监督。 Result: MTV在多个基准上表现出色,兼顾了细粒度空间推理能力和全局语义理解,在不同数据和模型规模下均展现出良好的扩展性。 Conclusion: 多任务学习结合高质量伪监督是构建更通用视觉编码器的可扩展路径。 Abstract: Current visual representation learning remains bifurcated: vision-language models (e.g., CLIP) excel at global semantic alignment but lack spatial precision, while self-supervised methods (e.g., MAE, DINO) capture intricate local structures yet struggle with high-level semantic context. We argue that these paradigms are fundamentally complementary and can be integrated into a principled multi-task framework, further enhanced by dense spatial supervision. We introduce MTV, a multi-task visual pretraining framework that jointly optimizes a shared backbone across vision-language contrastive, self-supervised, and dense spatial objectives. To mitigate the need for manual annotations, we leverage high-capacity "expert" models -- such as Depth Anything V2 and OWLv2 -- to synthesize dense, structured pseudo-labels at scale. Beyond the framework, we provide a systematic investigation into the mechanics of multi-task visual learning, analyzing: (i) the marginal gain of each objective, (ii) task synergies versus interference, and (iii) scaling behavior across varying data and model scales. Our results demonstrate that MTV achieves "best-of-both-worlds" performance, significantly enhancing fine-grained spatial reasoning without compromising global semantic understanding. Our findings suggest that multi-task learning, fueled by high-quality pseudo-supervision, is a scalable path toward more general visual encoders.

[410] OmniOVCD: Streamlining Open-Vocabulary Change Detection with SAM 3

Xu Zhang,Danyang Li,Yingjie Xia,Xiaohang Dong,Hualong Yu,Jianye Wang,Qicheng Li

Main category: cs.CV

TL;DR: 本文提出了一种名为OmniOVCD的新型开放词汇变化检测框架,利用SAM 3模型的解耦输出头设计了SFID策略,实现了语义、实例与存在信息的融合与解耦,显著提升了变化检测的准确性和稳定性,在四个公开数据集上达到SOTA性能。

Details Motivation: 现有的无训练开放词汇变化检测方法依赖多个模型(如CLIP和DINO),导致特征匹配困难和系统不稳定,且对预定义类别依赖较强。因此需要一种更稳定、集成化的解决方案。 Method: 提出OmniOVCD框架,利用SAM 3模型的解耦输出头,设计Synergistic Fusion to Instance Decoupling (SFID)策略:先融合语义、实例和存在输出生成土地覆盖掩码,再分解为独立实例掩码用于变化比较。 Result: 在LEVIR-CD、WHU-CD、S2Looking和SECOND四个公开基准上取得领先性能,平均类别IoU分别为67.2、66.5、24.5和27.1,优于所有现有方法。 Conclusion: OmniOVCD通过充分利用SAM 3的多任务输出能力,实现了高效、稳定的开放词汇变化检测,无需额外模型组合,具有较强的实用价值和推广潜力。 Abstract: Change Detection (CD) is a fundamental task in remote sensing. It monitors the evolution of land cover over time. Based on this, Open-Vocabulary Change Detection (OVCD) introduces a new requirement. It aims to reduce the reliance on predefined categories. Existing training-free OVCD methods mostly use CLIP to identify categories. These methods also need extra models like DINO to extract features. However, combining different models often causes problems in matching features and makes the system unstable. Recently, the Segment Anything Model 3 (SAM 3) is introduced. It integrates segmentation and identification capabilities within one promptable model, which offers new possibilities for the OVCD task. In this paper, we propose OmniOVCD, a standalone framework designed for OVCD. By leveraging the decoupled output heads of SAM 3, we propose a Synergistic Fusion to Instance Decoupling (SFID) strategy. SFID first fuses the semantic, instance, and presence outputs of SAM 3 to construct land-cover masks, and then decomposes them into individual instance masks for change comparison. This design preserves high accuracy in category recognition and maintains instance-level consistency across images. As a result, the model can generate accurate change masks. Experiments on four public benchmarks (LEVIR-CD, WHU-CD, S2Looking, and SECOND) demonstrate SOTA performance, achieving IoU scores of 67.2, 66.5, 24.5, and 27.1 (class-average), respectively, surpassing all previous methods.

[411] Towards Visually Explaining Statistical Tests with Applications in Biomedical Imaging

Masoumeh Javanbakhat,Piotr Komorowski,Dilyara Bareeva,Wei-Chang Lai,Wojciech Samek,Christoph Lippert

Main category: cs.CV

TL;DR: 提出一种可解释的深度统计检验框架,增强深度双样本检验的样本级和特征级解释性,揭示哪些样本和特征驱动了组间差异,在生物医学图像分析中提供空间和实例层面的洞察。

Details Motivation: 深度神经双样本检验虽具有强大检测能力,但其黑箱特性限制了解释性和在生物医学分析中的应用;现有事后解释方法多依赖类别标签,不适用于无标签的统计检验场景。 Method: 构建一个可解释的深度统计测试框架,通过引入样本级和特征级解释机制,识别对检测结果贡献最大的个体样本和输入特征(如图像区域),实现对测试决策的空间和实例级解释。 Result: 在生物医学成像数据上的应用表明,该框架能有效识别有影响力的样本,并突出显示与疾病相关变异相关的解剖学有意义区域。 Conclusion: 该工作弥合了统计推断与可解释人工智能之间的差距,实现了可解释、无标签的医学影像群体分析,提升了深度统计检验的透明度和实用性。 Abstract: Deep neural two-sample tests have recently shown strong power for detecting distributional differences between groups, yet their black-box nature limits interpretability and practical adoption in biomedical analysis. Moreover, most existing post-hoc explainability methods rely on class labels, making them unsuitable for label-free statistical testing settings. We propose an explainable deep statistical testing framework that augments deep two-sample tests with sample-level and feature-level explanations, revealing which individual samples and which input features drive statistically significant group differences. Our method highlights which image regions and which individual samples contribute most to the detected group difference, providing spatial and instance-wise insight into the test's decision. Applied to biomedical imaging data, the proposed framework identifies influential samples and highlights anatomically meaningful regions associated with disease-related variation. This work bridges statistical inference and explainable AI, enabling interpretable, label-free population analysis in medical imaging.

[412] On the Role of Rotation Equivariance in Monocular 3D Human Pose Estimation

Pavlo Melnyk,Cuong Le,Urs Waldmann,Per-Erik Forssén,Bastian Wandt

Main category: cs.CV

TL;DR: 本文提出了一种基于2D旋转等变性的单目3D人体姿态估计方法,通过数据增强实现旋转等变性,提升了模型在输入图像发生平面内旋转时的鲁棒性和性能,优于现有的显式等变设计方法。

Details Motivation: 现有2D到3D提升模型在处理旋转输入时表现不佳,且直接学习点对点映射缺乏几何合理性。作者希望通过引入旋转等变性来提升模型的泛化能力和几何一致性。 Method: 利用数据增强策略隐式地赋予模型对2D平面旋转的等变性,而非在模型结构中显式设计等变性;采用常见的HPE基准进行训练与评估。 Result: 实验证明,所提出的方法在面对图像平面内的旋转时显著提升了性能,并优于当前最先进的显式等变设计方法。 Conclusion: 通过数据增强实现的2D旋转等变性是一种更简单、高效且有效的策略,能够提升单目3D人体姿态估计模型的鲁棒性和准确性。 Abstract: Estimating 3D from 2D is one of the central tasks in computer vision. In this work, we consider the monocular setting, i.e. single-view input, for 3D human pose estimation (HPE). Here, the task is to predict a 3D point set of human skeletal joints from a single 2D input image. While by definition this is an ill-posed problem, recent work has presented methods that solve it with up to several-centimetre error. Typically, these methods employ a two-step approach, where the first step is to detect the 2D skeletal joints in the input image, followed by the step of 2D-to-3D lifting. We find that common lifting models fail when encountering a rotated input. We argue that learning a single human pose along with its in-plane rotations is considerably easier and more geometrically grounded than directly learning a point-to-point mapping. Furthermore, our intuition is that endowing the model with the notion of rotation equivariance without explicitly constraining its parameter space should lead to a more straightforward learning process than one with equivariance by design. Utilising the common HPE benchmarks, we confirm that the 2D rotation equivariance per se improves the model performance on human poses akin to rotations in the image plane, and can be efficiently and straightforwardly learned by augmentation, outperforming state-of-the-art equivariant-by-design methods.

[413] TrackletGPT: A Language-like GPT Framework for White Matter Tract Segmentation

Anoushkrit Goel,Simroop Singh,Ankita Joshi,Ranjeet Ranjan Jha,Chirag Ahuja,Aditya Nigam,Arnav Bhavsar

Main category: cs.CV

TL;DR: 本文提出了一种名为TrackletGPT的新框架,用于白质纤维束分割,通过引入类似语言模型的序列信息提升追踪精度,在多个数据集上优于现有方法。

Details Motivation: 白质纤维束在不同个体和条件下存在差异,但具有相似的3D结构,传统分割方法难以兼顾通用性与精细度,因此需要一种更鲁棒、自动化的分割方法。 Method: 提出TrackletGPT,将纤维束片段(tracklets)作为类语言token输入GPT架构,利用其序列建模能力进行纤维束分割,实现跨数据集的无缝泛化,并支持细粒度子流线段编码。 Result: 在TractoInferno和HCP数据集上,TrackletGPT在平均DICE、Overlap和Overreach指标上均优于现有最先进方法,且在跨数据集实验中表现良好。 Conclusion: TrackletGPT通过引入序列化tracklet表示,有效提升了白质纤维束分割的准确性和泛化能力,为脑连接研究提供了强有力的自动化工具。 Abstract: White Matter Tract Segmentation is imperative for studying brain structural connectivity, neurological disorders and neurosurgery. This task remains complex, as tracts differ among themselves, across subjects and conditions, yet have similar 3D structure across hemispheres and subjects. To address these challenges, we propose TrackletGPT, a language-like GPT framework which reintroduces sequential information in tokens using tracklets. TrackletGPT generalises seamlessly across datasets, is fully automatic, and encodes granular sub-streamline segments, Tracklets, scaling and refining GPT models in Tractography Segmentation. Based on our experiments, TrackletGPT outperforms state-of-the-art methods on average DICE, Overlap and Overreach scores on TractoInferno and HCP datasets, even on inter-dataset experiments.

[414] Glance-or-Gaze: Incentivizing LMMs to Adaptively Focus Search via Reinforcement Learning

Hongbo Bai,Yujin Zhou,Yile Wu,Chi-Min Chan,Pengcheng Wen,Kunhao Pan,Sirui Han,Yike Guo

Main category: cs.CV

TL;DR: 提出Glance-or-Gaze(GoG)框架,通过选择性注视机制和复杂度自适应强化学习,提升大型多模态模型在知识密集型视觉查询中的表现。

Details Motivation: 现有搜索增强方法依赖无差别全图检索,引入过多冗余和噪声,且缺乏深度迭代反思能力,难以应对涉及长尾实体或动态信息的复杂视觉查询。 Method: 提出Selective Gaze机制,动态选择全局浏览或局部聚焦以过滤无关信息;采用两阶段训练:基于监督微调的反思行为对齐和复杂度自适应强化学习,实现主动视觉规划与迭代推理。 Result: 在六个基准上达到SOTA性能,消融实验验证了Selective Gaze和复杂度自适应RL的必要性。 Conclusion: GoG实现了从被动感知到主动视觉规划的转变,有效提升了多模态模型在复杂、知识密集型视觉任务中的准确性与效率。 Abstract: Large Multimodal Models (LMMs) have achieved remarkable success in visual understanding, yet they struggle with knowledge-intensive queries involving long-tail entities or evolving information due to static parametric knowledge. Recent search-augmented approaches attempt to address this limitation, but existing methods rely on indiscriminate whole-image retrieval that introduces substantial visual redundancy and noise, and lack deep iterative reflection, limiting their effectiveness on complex visual queries. To overcome these challenges, we propose Glance-or-Gaze (GoG), a fully autonomous framework that shifts from passive perception to active visual planning. GoG introduces a Selective Gaze mechanism that dynamically chooses whether to glance at global context or gaze into high-value regions, filtering irrelevant information before retrieval. We design a dual-stage training strategy: Reflective GoG Behavior Alignment via supervised fine-tuning instills the fundamental GoG paradigm, while Complexity-Adaptive Reinforcement Learning further enhances the model's capability to handle complex queries through iterative reasoning. Experiments across six benchmarks demonstrate state-of-the-art performance. Ablation studies confirm that both Selective Gaze and complexity-adaptive RL are essential for effective visual search. We will release our data and models for further exploration soon.

[415] VTONGuard: Automatic Detection and Authentication of AI-Generated Virtual Try-On Content

Shengyi Wu,Yan Hong,Shengyao Chen,Zheng Wang,Xianbing Sun,Jiahui Zhan,Jun Lan,Jianfu Zhang

Main category: cs.CV

TL;DR: 本文提出了VTONGuard,一个包含超过77.5万张真实和合成虚拟试穿图像的大规模基准数据集,用于评估AI生成试穿内容的真实性检测方法,并设计了一个结合辅助分割的多任务框架以提升检测性能。

Details Motivation: 随着生成式AI的发展,虚拟试穿技术在电商和数字娱乐中广泛应用,但其生成内容的真实性与滥用问题引发关注,亟需可靠的检测手段。 Method: 构建了大规模、多样化的VTONGuard数据集,涵盖不同姿态、背景和服装风格;基于该数据集,在统一协议下系统评估多种检测范式,并提出一种融合辅助分割的多任务检测框架以增强边界感知特征学习。 Result: 实验揭示了现有检测方法的优势与局限,尤其是跨范式泛化能力不足的问题;所提多任务框架在VTONGuard上取得了最佳整体性能。 Conclusion: VTONGuard为虚拟试穿图像真伪检测提供了公平、全面的评估基准,推动了鲁棒检测模型的发展,有助于促进VTON技术的安全与负责任应用。 Abstract: With the rapid advancement of generative AI, virtual try-on (VTON) systems are becoming increasingly common in e-commerce and digital entertainment. However, the growing realism of AI-generated try-on content raises pressing concerns about authenticity and responsible use. To address this, we present VTONGuard, a large-scale benchmark dataset containing over 775,000 real and synthetic try-on images. The dataset covers diverse real-world conditions, including variations in pose, background, and garment styles, and provides both authentic and manipulated examples. Based on this benchmark, we conduct a systematic evaluation of multiple detection paradigms under unified training and testing protocols. Our results reveal each method's strengths and weaknesses and highlight the persistent challenge of cross-paradigm generalization. To further advance detection, we design a multi-task framework that integrates auxiliary segmentation to enhance boundary-aware feature learning, achieving the best overall performance on VTONGuard. We expect this benchmark to enable fair comparisons, facilitate the development of more robust detection models, and promote the safe and responsible deployment of VTON technologies in practice.

[416] DExTeR: Weakly Semi-Supervised Object Detection with Class and Instance Experts for Medical Imaging

Adrien Meyer,Didier Mutter,Nicolas Padoy

Main category: cs.CV

TL;DR: 本文提出DExTeR,一种基于Transformer的点到框回归模型,用于医学图像中解剖标志检测,通过点标注显著降低标注成本,并在多个医学数据集上实现最先进性能。

Details Motivation: 医学图像中边界框标注成本高,且解剖结构重叠、大小不一、难以识别,现有弱监督方法难以准确推断边界框,限制了对象检测的可扩展性。 Method: 基于Point-DETR框架,引入类引导的可变形注意力机制,结合CLICK-MoE模块解耦类别与实例表示,并采用多点训练策略提升对标注变异的鲁棒性,构建Point-to-Box教师模型生成伪框标签。 Result: DExTeR在内窥镜、胸部X光和超声内镜三个不同医学领域数据集上均取得当前最优的检测性能,显著优于现有弱半监督检测方法。 Conclusion: DExTeR有效解决了医学图像中因标注昂贵和结构复杂带来的检测挑战,通过点标注大幅降低标注成本的同时保持高定位精度,具有良好的跨域泛化能力和临床应用前景。 Abstract: Detecting anatomical landmarks in medical imaging is essential for diagnosis and intervention guidance. However, object detection models rely on costly bounding box annotations, limiting scalability. Weakly Semi-Supervised Object Detection (WSSOD) with point annotations proposes annotating each instance with a single point, minimizing annotation time while preserving localization signals. A Point-to-Box teacher model, trained on a small box-labeled subset, converts these point annotations into pseudo-box labels to train a student detector. Yet, medical imagery presents unique challenges, including overlapping anatomy, variable object sizes, and elusive structures, which hinder accurate bounding box inference. To overcome these challenges, we introduce DExTeR (DETR with Experts), a transformer-based Point-to-Box regressor tailored for medical imaging. Built upon Point-DETR, DExTeR encodes single-point annotations as object queries, refining feature extraction with the proposed class-guided deformable attention, which guides attention sampling using point coordinates and class labels to capture class-specific characteristics. To improve discrimination in complex structures, it introduces CLICK-MoE (CLass, Instance, and Common Knowledge Mixture of Experts), decoupling class and instance representations to reduce confusion among adjacent or overlapping instances. Finally, we implement a multi-point training strategy which promotes prediction consistency across different point placements, improving robustness to annotation variability. DExTeR achieves state-of-the-art performance across three datasets spanning different medical domains (endoscopy, chest X-rays, and endoscopic ultrasound) highlighting its potential to reduce annotation costs while maintaining high detection accuracy.

[417] STEC: A Reference-Free Spatio-Temporal Entropy Coverage Metric for Evaluating Sampled Video Frames

Shih-Yao Lin

Main category: cs.CV

TL;DR: 提出了一种名为Spatio-Temporal Entropy Coverage (STEC) 的无参考视频帧采样质量评估指标,结合空间信息、时间分布和非冗余性来衡量采样效果。

Details Motivation: 现有评估指标主要关注感知质量或重建保真度,无法有效判断采样帧是否充分代表视频内容,因此需要一种任务无关的诊断工具来评估帧采样的有效性。 Method: 基于Spatio-Temporal Frame Entropy (STFE),通过帧间熵测度空间复杂性,并结合时间覆盖范围与冗余性分析,构建STEC指标以量化采样质量。 Result: 在MSR-VTT test-1k上实验表明,STEC能有效区分随机、均匀和内容感知等采样策略,并揭示单个视频中的鲁棒性模式,超越平均性能的表现。 Conclusion: STEC是一种轻量、原则性强的通用评估工具,可用于分析有限预算下的视频帧采样行为,但不用于预测下游任务性能。 Abstract: Frame sampling is a fundamental component in video understanding and video--language model pipelines, yet evaluating the quality of sampled frames remains challenging. Existing evaluation metrics primarily focus on perceptual quality or reconstruction fidelity, and are not designed to assess whether a set of sampled frames adequately captures informative and representative video content. We propose Spatio-Temporal Entropy Coverage (STEC), a simple and non-reference metric for evaluating the effectiveness of video frame sampling. STEC builds upon Spatio-Temporal Frame Entropy (STFE), which measures per-frame spatial information via entropy-based structural complexity, and evaluates sampled frames based on their temporal coverage and redundancy. By jointly modeling spatial information strength, temporal dispersion, and non-redundancy, STEC provides a principled and lightweight measure of sampling quality. Experiments on the MSR-VTT test-1k benchmark demonstrate that STEC clearly differentiates common sampling strategies, including random, uniform, and content-aware methods. We further show that STEC reveals robustness patterns across individual videos that are not captured by average performance alone, highlighting its practical value as a general-purpose evaluation tool for efficient video understanding. We emphasize that STEC is not designed to predict downstream task accuracy, but to provide a task-agnostic diagnostic signal for analyzing frame sampling behavior under constrained budgets.

[418] Harmonizing the Deep: A Unified Information Pipeline for Robust Marine Biodiversity Assessment Across Heterogeneous Domains

Marco Piccolo,Qiwei Han,Astrid van Toor,Joachim Vanneste

Main category: cs.CV

TL;DR: 本研究提出了一种统一信息管道,用于提升跨域水下入侵物种检测的可靠性,发现场景结构因素比视觉退化对性能影响更大,并验证了在低成本边缘设备上的实时可行性。

Details Motivation: 现有水下生物检测方法在新环境部署时性能显著下降,缺乏跨区域稳定性和可扩展性,难以支持长期海洋监测需求。 Method: 构建统一信息管道以标准化异构数据集,采用固定检测器在受控的跨域协议下进行评估,并在低功耗边缘硬件上测试推理性能。 Result: 发现场景结构(如稀疏性、对象密度)比浑浊等视觉退化更显著影响跨域性能,稀疏场景导致“上下文崩溃”失效模式;边缘设备实现实用采样率。 Conclusion: 应从图像增强转向结构感知的可靠性设计,所提方法为海洋生态系统提供了可推广、低成本的监测工具。 Abstract: Marine biodiversity monitoring requires scalability and reliability across complex underwater environments to support conservation and invasive-species management. Yet existing detection solutions often exhibit a pronounced deployment gap, with performance degrading sharply when transferred to new sites. This work establishes the foundational detection layer for a multi-year invasive species monitoring initiative targeting Arctic and Atlantic marine ecosystems. We address this challenge by developing a Unified Information Pipeline that standardises heterogeneous datasets into a comparable information flow and evaluates a fixed, deployment-relevant detector under controlled cross-domain protocols. Across multiple domains, we find that structural factors, such as scene composition, object density, and contextual redundancy, explain cross-domain performance loss more strongly than visual degradation such as turbidity, with sparse scenes inducing a characteristic "Context Collapse" failure mode. We further validate operational feasibility by benchmarking inference on low-cost edge hardware, showing that runtime optimisation enables practical sampling rates for remote monitoring. The results shift emphasis from image enhancement toward structure-aware reliability, providing a democratised tool for consistent marine ecosystem assessment.

[419] FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

Jing Zuo,Lingzhou Mu,Fan Jiang,Chengcheng Ma,Mu Xu,Yonggang Qi

Main category: cs.CV

TL;DR: 本文提出FantasyVLN,一种无需显式生成推理token的隐式链式推理框架,通过将想象视觉token压缩至紧凑潜在空间,在保持CoT推理优势的同时显著降低推理延迟,实现实时、高成功率的视觉-语言导航。

Details Motivation: 现有基于链式思维(CoT)的视觉-语言导航方法存在两大缺陷:纯文本CoT缺乏空间 grounding 且易过拟合稀疏标注;多模态CoT因生成想象视觉观测导致严重token膨胀,难以实时导航。 Method: 提出FantasyVLN框架:利用预训练视觉自回归模型(VAR)将想象视觉token编码为紧凑潜在表示;在训练中联合学习文本、视觉与多模态CoT模式(统一multi-CoT策略);推理时直接映射指令到动作,但内部仍具备推理感知表征。 Result: 在LH-VLN基准上,FantasyVLN在提升成功率与导航效率的同时,推理延迟比显式CoT方法降低一个数量级,实现推理感知与实时性的兼顾。 Conclusion: 隐式潜在空间建模可有效克服显式多模态CoT的token开销瓶颈,为构建高效、可解释、类人导航智能体提供新范式。 Abstract: Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences. Recent works, such as NavCoT and NavGPT-2, demonstrate the potential of Chain-of-Thought (CoT) reasoning for improving interpretability and long-horizon planning. Moreover, multimodal extensions like OctoNav-R1 and CoT-VLA further validate CoT as a promising pathway toward human-like navigation reasoning. However, existing approaches face critical drawbacks: purely textual CoTs lack spatial grounding and easily overfit to sparse annotated reasoning steps, while multimodal CoTs incur severe token inflation by generating imagined visual observations, making real-time navigation impractical. In this work, we propose FantasyVLN, a unified implicit reasoning framework that preserves the benefits of CoT reasoning without explicit token overhead. Specifically, imagined visual tokens are encoded into a compact latent space using a pretrained Visual AutoRegressor (VAR) during CoT reasoning training, and the model jointly learns from textual, visual, and multimodal CoT modes under a unified multi-CoT strategy. At inference, our model performs direct instruction-to-action mapping while still enjoying reasoning-aware representations. Extensive experiments on LH-VLN show that our approach achieves reasoning-aware yet real-time navigation, improving success rates and efficiency while reducing inference latency by an order of magnitude compared to explicit CoT methods.

[420] Equivariant Learning for Unsupervised Image Dehazing

Zhang Wen,Jiangwei Xie,Dongdong Chen

Main category: cs.CV

TL;DR: 提出了一种新的无监督图像去雾框架EID,利用图像信号的对称性并通过强制雾霾一致性与等变性,直接从模糊图像中恢复清晰图像。

Details Motivation: 现有去雾方法依赖精心设计的先验或大量无雾真值数据,获取成本高,尤其在科学成像中不现实。 Method: 提出等变图像去雾(EID)框架,结合雾霾一致性与系统等变性,并采用对抗学习策略建模未知雾霾物理特性。 Result: 在细胞显微、医学内窥镜和自然图像去雾基准上显著优于现有最先进方法。 Conclusion: EID通过统一等变学习与雾霾物理建模,有望实现更通用有效的科学成像去雾。 Abstract: Image Dehazing (ID) aims to produce a clear image from an observation contaminated by haze. Current ID methods typically rely on carefully crafted priors or extensive haze-free ground truth, both of which are expensive or impractical to acquire, particularly in the context of scientific imaging. We propose a new unsupervised learning framework called Equivariant Image Dehazing (EID) that exploits the symmetry of image signals to restore clarity to hazy observations. By enforcing haze consistency and systematic equivariance, EID can recover clear patterns directly from raw, hazy images. Additionally, we propose an adversarial learning strategy to model unknown haze physics and facilitate EID learning. Experiments on two scientific image dehazing benchmarks (including cell microscopy and medical endoscopy) and on natural image dehazing have demonstrated that EID significantly outperforms state-of-the-art approaches. By unifying equivariant learning with modelling haze physics, we hope that EID will enable more versatile and effective haze removal in scientific imaging. Code and datasets will be published.

[421] Likelihood-Separable Diffusion Inference for Multi-Image MRI Super-Resolution

Samuel W. Remedios,Zhangxing Bian,Shuwen Wei,Aaron Carass,Jerry L. Prince,Blake E. Dewey

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的多图像超分辨率(MISR)方法,用于各向异性MRI重建,通过DPS似然校正实现独立测量的梯度分解,无需修改模型或增加计算开销,在4×/8×/16×降质下显著优于单图像超分辨率方法,并能从常规2D多层扫描中恢复近各向同性的解剖结构。

Details Motivation: 现有的扩散模型主要用于单图像逆问题,难以处理如MRI中多个互补低分辨率测量的情况,因此需要一种能够扩展到多图像超分辨率的方法。 Method: 通过推广DPS、DMAP、DPPS和基于扩散的PnP/ADMM等方法,利用DPS似然校正实现跨独立测量的可分离梯度分解,从而在不构建联合算子或修改扩散模型的情况下实现MISR。 Result: 在4×/8×/16×各向异性降质条件下,所提方法在MISR任务上显著优于单图像超分辨率方法,实现了最先进的各向异性MRI体积超分辨率,并能从常规2D多层扫描中重建出接近各向同性的解剖结构。 Conclusion: 该方法有效扩展了扩散模型在多图像逆问题中的应用,为临床常用的2D MRI序列实现高质量各向同性重建提供了可行方案。 Abstract: Diffusion models are the current state-of-the-art for solving inverse problems in imaging. Their impressive generative capability allows them to approximate sampling from a prior distribution, which alongside a known likelihood function permits posterior sampling without retraining the model. While recent methods have made strides in advancing the accuracy of posterior sampling, the majority focuses on single-image inverse problems. However, for modalities such as magnetic resonance imaging (MRI), it is common to acquire multiple complementary measurements, each low-resolution along a different axis. In this work, we generalize common diffusion-based inverse single-image problem solvers for multi-image super-resolution (MISR) MRI. We show that the DPS likelihood correction allows an exactly-separable gradient decomposition across independently acquired measurements, enabling MISR without constructing a joint operator, modifying the diffusion model, or increasing network function evaluations. We derive MISR versions of DPS, DMAP, DPPS, and diffusion-based PnP/ADMM, and demonstrate substantial gains over SISR across $4\times/8\times/16\times$ anisotropic degradations. Our results achieve state-of-the-art super-resolution of anisotropic MRI volumes and, critically, enable reconstruction of near-isotropic anatomy from routine 2D multi-slice acquisitions, which are otherwise highly degraded in orthogonal views.

[422] Human detectors are surprisingly powerful reward models

Kumar Ashutosh,XuDong Wang,Xi Yin,Kristen Grauman,Adam Polyak,Ishan Misra,Rohit Girdhar

Main category: cs.CV

TL;DR: 提出了一种简单但有效的奖励模型HuDA,用于提升视频生成中的人类运动真实感,无需额外训练即可超越专门微调模型,显著改善复杂动作生成质量。

Details Motivation: 现有视频生成模型在处理复杂、非刚性的人类运动时表现不佳,常出现肢体缺失、姿态扭曲等问题,缺乏有效的方法来量化和提升生成动作的真实性。 Method: 提出HuDA奖励模型,结合现成的人体检测置信度(外观质量)和时间提示对齐分数(运动真实性),无需训练,直接用于GRPO后训练优化视频生成模型。 Result: HuDA在无须微调的情况下优于使用标注数据训练的专用模型;用于GRPO后训练显著提升了人类动态动作生成质量,在与Wan 2.1等SOTA模型对比中赢得73%胜率,并能泛化至动物视频和人-物交互场景。 Conclusion: HuDA是一种简单、即插即用的奖励函数,能有效提升视频生成模型在复杂人类运动及其他相关场景中的生成真实性和质量,具有广泛的应用潜力。 Abstract: Video generation models have recently achieved impressive visual fidelity and temporal coherence. Yet, they continue to struggle with complex, non-rigid motions, especially when synthesizing humans performing dynamic actions such as sports, dance, etc. Generated videos often exhibit missing or extra limbs, distorted poses, or physically implausible actions. In this work, we propose a remarkably simple reward model, HuDA, to quantify and improve the human motion in generated videos. HuDA integrates human detection confidence for appearance quality, and a temporal prompt alignment score to capture motion realism. We show this simple reward function that leverages off-the-shelf models without any additional training, outperforms specialized models finetuned with manually annotated data. Using HuDA for Group Reward Policy Optimization (GRPO) post-training of video models, we significantly enhance video generation, especially when generating complex human motions, outperforming state-of-the-art models like Wan 2.1, with win-rate of 73%. Finally, we demonstrate that HuDA improves generation quality beyond just humans, for instance, significantly improving generation of animal videos and human-object interactions.

[423] Correcting and Quantifying Systematic Errors in 3D Box Annotations for Autonomous Driving

Alexandre Justo Miro,Ludvig af Klinteberg,Bogdan Timus,Aron Asefaw,Ajinkya Khoche,Thomas Gustafsson,Sina Sharif Mansouri,Masoud Daneshtalab

Main category: cs.CV

TL;DR: 本文首次发现并纠正了广泛使用的自动驾驶数据集中3D框标注的系统性误差,提出了一种离线估计方法以实现时空一致性,并量化了标注误差对性能评估的影响。

Details Motivation: 现有自动驾驶数据集中的3D框标注在动态场景下因传感器扫描时序问题存在系统性误差,影响模型训练与性能评估的准确性。 Method: 提出一种新颖的离线估计方法,通过优化使标注框符合物理可行轨迹,并与传感器数据保持时空一致;同时定义了评估该问题的新指标。 Result: 在Argoverse 2、MAN TruckScenes及私有数据集上验证,标注质量提升超17%;发现原始标注最大偏差达2.5米,高动态物体受影响最严重;且误差对基准测试的影响超过SOTA方法的典型性能增益。 Conclusion: 精确的3D标注对自动驾驶系统的训练和性能评估至关重要,本文的方法显著提升了标注质量,并揭示了现有数据集中未被重视的标注偏差问题。 Abstract: Accurate ground truth annotations are critical to supervised learning and evaluating the performance of autonomous vehicle systems. These vehicles are typically equipped with active sensors, such as LiDAR, which scan the environment in predefined patterns. 3D box annotation based on data from such sensors is challenging in dynamic scenarios, where objects are observed at different timestamps, hence different positions. Without proper handling of this phenomenon, systematic errors are prone to being introduced in the box annotations. Our work is the first to discover such annotation errors in widely used, publicly available datasets. Through our novel offline estimation method, we correct the annotations so that they follow physically feasible trajectories and achieve spatial and temporal consistency with the sensor data. For the first time, we define metrics for this problem; and we evaluate our method on the Argoverse 2, MAN TruckScenes, and our proprietary datasets. Our approach increases the quality of box annotations by more than 17% in these datasets. Furthermore, we quantify the annotation errors in them and find that the original annotations are misplaced by up to 2.5 m, with highly dynamic objects being the most affected. Finally, we test the impact of the errors in benchmarking and find that the impact is larger than the improvements that state-of-the-art methods typically achieve with respect to the previous state-of-the-art methods; showing that accurate annotations are essential for correct interpretation of performance. Our code is available at https://github.com/alexandre-justo-miro/annotation-correction-3D-boxes.

[424] Generalizing Abstention for Noise-Robust Learning in Medical Image Segmentation

Wesam Moustafa,Hossam Elsafty,Helen Schneider,Lorenz Sparrenberg,Rafet Sifa

Main category: cs.CV

TL;DR: 本文提出了一种通用且模块化的弃权框架,通过引入有指导的正则项和基于幂律的自动调参算法,增强了分割任务中损失函数对标签噪声的鲁棒性,并在多个医学图像数据集上验证了其有效性。

Details Motivation: 标签噪声是医学图像分割中的关键问题,现有方法在应对噪声标签方面仍探索不足,而弃权机制在分类任务中有效但在分割任务中尚未验证。 Method: 提出一种通用的弃权框架,包含一个有信息的正则化项和基于幂律的自动调参算法,并将其与三种不同损失函数结合,构建出GAC、SAC和ADS三种新型抗噪变体。 Result: 在CaDIS和DSAD医学数据集上的实验表明,所提方法在高噪声水平下显著优于无弃权的基线模型。 Conclusion: 允许模型选择性忽略被污染样本是一种强大且可推广的策略,有助于构建更可靠的医学图像分割模型。 Abstract: Label noise is a critical problem in medical image segmentation, often arising from the inherent difficulty of manual annotation. Models trained on noisy data are prone to overfitting, which degrades their generalization performance. While a number of methods and strategies have been proposed to mitigate noisy labels in the segmentation domain, this area remains largely under-explored. The abstention mechanism has proven effective in classification tasks by enhancing the capabilities of Cross Entropy, yet its potential in segmentation remains unverified. In this paper, we address this gap by introducing a universal and modular abstention framework capable of enhancing the noise-robustness of a diverse range of loss functions. Our framework improves upon prior work with two key components: an informed regularization term to guide abstention behaviour, and a more flexible power-law-based auto-tuning algorithm for the abstention penalty. We demonstrate the framework's versatility by systematically integrating it with three distinct loss functions to create three novel, noise-robust variants: GAC, SAC, and ADS. Experiments on the CaDIS and DSAD medical datasets show our methods consistently and significantly outperform their non-abstaining baselines, especially under high noise levels. This work establishes that enabling models to selectively ignore corrupted samples is a powerful and generalizable strategy for building more reliable segmentation models. Our code is publicly available at https://github.com/wemous/abstention-for-segmentation.

[425] Federated Balanced Learning

Jiaze Li,Haoran Xu,Wanyi Wu,Changwei Wang,Shuaiguang Li,Jianzhong Ju,Zhenbo Luo,Jian Luan,Youyang Qu,Longxiang Gao,Xudong Yang,Lumin Xing

Main category: cs.CV

TL;DR: 本文提出了一种名为Federated Balanced Learning (FBL)的新方法,通过客户端侧的样本平衡(利用边缘生成模型进行知识填充和采样)来缓解非独立同分布(non-iid)下联邦学习中的客户端漂移问题,并引入知识对齐与知识丢弃策略提升泛化性,在多个实验中优于现有方法。

Details Motivation: 在非iid数据分布下,联邦学习中全局模型易受客户端漂移影响,而以往方法多在模型已偏移后基于损失或梯度进行校正,忽视了客户端本地样本分布不均衡的根本原因。 Method: 提出FBL框架:1)在客户端使用边缘生成模型实现知识填充与知识采样,以在固定样本数约束下达成样本平衡;2)设计知识对齐策略弥合合成数据与真实数据差异;3)引入知识丢弃策略进行正则化;4)支持异构客户端灵活适配及框架扩展。 Result: 在多个基准数据集和复杂场景下,FBL显著优于当前SOTA联邦学习方法,验证了其有效性与可扩展性。 Conclusion: 从客户端样本平衡入手预防客户端漂移是更本质有效的解决路径;FBL通过边缘生成建模与协同策略,在保障隐私与通信效率前提下提升了联邦学习在non-iid下的鲁棒性与性能。 Abstract: Federated learning is a paradigm of joint learning in which clients collaborate by sharing model parameters instead of data. However, in the non-iid setting, the global model experiences client drift, which can seriously affect the final performance of the model. Previous methods tend to correct the global model that has already deviated based on the loss function or gradient, overlooking the impact of the client samples. In this paper, we rethink the role of the client side and propose Federated Balanced Learning, i.e., FBL, to prevent this issue from the beginning through sample balance on the client side. Technically, FBL allows unbalanced data on the client side to achieve sample balance through knowledge filling and knowledge sampling using edge-side generation models, under the limitation of a fixed number of data samples on clients. Furthermore, we design a Knowledge Alignment Strategy to bridge the gap between synthetic and real data, and a Knowledge Drop Strategy to regularize our method. Meanwhile, we scale our method to real and complex scenarios, allowing different clients to adopt various methods, and extend our framework to further improve performance. Numerous experiments show that our method outperforms state-of-the-art baselines. The code is released upon acceptance.

[426] Weather-R1: Logically Consistent Reinforcement Fine-Tuning for Multimodal Reasoning in Meteorology

Kaiyu Wu,Pucheng Han,Hualong Zhang,Naigeng Wu,Keze Wang

Main category: cs.CV

TL;DR: 本文提出WeatherQA气象多模态推理基准和LoCo-RFT逻辑一致性强化微调方法,构建首个具备逻辑可信性的气象推理视觉语言模型Weather-R1,显著提升推理准确率。

Details Motivation: 主流强化微调方法在气象领域易引发自相矛盾推理(Self-Contra),而该高风险领域要求推理过程与结论严格一致,存在域差距和推理可信性差距两大挑战。 Method: 构建气象多模态推理基准WeatherQA;提出逻辑一致性强化微调方法(LoCo-RFT),通过引入逻辑一致性奖励解决自相矛盾推理问题;基于此训练出气象专用推理VLM Weather-R1。 Result: Weather-R1在WeatherQA上相较基线提升9.8个百分点,优于监督微调和常规RFT,甚至超越原始Qwen2.5-VL-32B模型。 Conclusion: 逻辑一致性奖励机制能有效提升VLM在高风险专业领域的推理可信性,WeatherQA和Weather-R1为气象AI推理提供了新基准与新范式。 Abstract: While Vision Language Models (VLMs) show advancing reasoning capabilities, their application in meteorology is constrained by a domain gap and a reasoning faithfulness gap. Specifically, mainstream Reinforcement Fine-Tuning (RFT) can induce Self-Contradictory Reasoning (Self-Contra), where the model's reasoning contradicts its final answer, which is unacceptable in such a high-stakes domain. To address these challenges, we construct WeatherQA, a novel multimodal reasoning benchmark in meteorology. We also propose Logically Consistent Reinforcement Fine-Tuning (LoCo-RFT), which resolves Self-Contra by introducing a logical consistency reward. Furthermore, we introduce Weather-R1, the first reasoning VLM with logical faithfulness in meteorology, to the best of our knowledge. Experiments demonstrate that Weather-R1 improves performance on WeatherQA by 9.8 percentage points over the baseline, outperforming Supervised Fine-Tuning and RFT, and even surpassing the original Qwen2.5-VL-32B. These results highlight the effectiveness of our LoCo-RFT and the superiority of Weather-R1. Our benchmark and code are available at https://github.com/Marcowky/Weather-R1.

[427] Vision Also You Need: Navigating Out-of-Distribution Detection with Multimodal Large Language Model

Haoran Xu,Yanlin Liu,Zizhao Tong,Jiaze Li,Kexue Fu,Yuyang Zhang,Longxiang Gao,Shuaiguang Li,Xingyu Li,Yanran Xu,Changwei Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态OOD检测方法MM-OOD,利用MLLM的多轮对话和多模态推理能力,在近分布和远分布异常检测任务中均取得显著提升。

Details Motivation: 现有零样本OOD检测方法过度依赖文本空间知识,忽视了图像空间中检测异常样本的固有挑战。 Method: 提出MM-OOD框架:对于近OOD任务,直接输入ID图像和文本提示到MLLM进行异常判断;对于远OOD任务,采用“草图-生成-详述”框架,通过文本提示暴露异常、生成视觉OOD样本,并使用多模态提示进行细化分析。 Result: 在Food-101等常用多模态数据集上性能显著提升,并在ImageNet-1K上验证了可扩展性。 Conclusion: MM-OOD有效结合了MLLM的多模态推理与对话能力,增强了图像空间中的OOD检测效果,尤其在近OOD和远OOD场景下表现优越。 Abstract: Out-of-Distribution (OOD) detection is a critical task that has garnered significant attention. The emergence of CLIP has spurred extensive research into zero-shot OOD detection, often employing a training-free approach. Current methods leverage expert knowledge from large language models (LLMs) to identify potential outliers. However, these approaches tend to over-rely on knowledge in the text space, neglecting the inherent challenges involved in detecting out-of-distribution samples in the image space. In this paper, we propose a novel pipeline, MM-OOD, which leverages the multimodal reasoning capabilities of MLLMs and their ability to conduct multi-round conversations for enhanced outlier detection. Our method is designed to improve performance in both near OOD and far OOD tasks. Specifically, (1) for near OOD tasks, we directly feed ID images and corresponding text prompts into MLLMs to identify potential outliers; and (2) for far OOD tasks, we introduce the sketch-generate-elaborate framework: first, we sketch outlier exposure using text prompts, then generate corresponding visual OOD samples, and finally elaborate by using multimodal prompts. Experiments demonstrate that our method achieves significant improvements on widely used multimodal datasets such as Food-101, while also validating its scalability on ImageNet-1K.

[428] Decoder-Free Supervoxel GNN for Accurate Brain-Tumor Localization in Multi-Modal MRI

Andrea Protani,Marc Molina Van Den Bosch,Lorenzo Giusti,Heloisa Barbosa Da Silva,Paolo Cacace,Albert Sund Aillet,Miguel Angel Gonzalez Ballester,Friedhelm Hummel,Luigi Serio

Main category: cs.CV

TL;DR: 本文提出SVGFormer,一种无需解码器的3D医学图像分析框架,通过将体积分割为超体素语义图并结合Transformer与图注意力网络进行分层特征学习,实现了高性能和双尺度可解释性。

Details Motivation: 传统3D医学图像模型依赖沉重的编码器-解码器结构进行空间重建,参数利用率低,且缺乏可解释性。本文旨在构建更高效、专注特征学习且具备内在可解释性的新范式。 Method: 提出SVGFormer:首先通过内容感知分组将3D体积分割为超体素构成的语义图;然后使用基于patch的Transformer和基于超体素的图注意力网络进行分层编码,联合建模区域内与区域间依赖关系。 Result: 在BraTS数据集上训练的两个专用模型表现优异:分类模型F1-score达0.875,回归模型MAE为0.028,验证了特征学习的有效性和定位能力。 Conclusion: 基于图的纯编码器范式可作为3D医学图像表征的一种准确且内在可解释的替代方案。 Abstract: Modern vision backbones for 3D medical imaging typically process dense voxel grids through parameter-heavy encoder-decoder structures, a design that allocates a significant portion of its parameters to spatial reconstruction rather than feature learning. Our approach introduces SVGFormer, a decoder-free pipeline built upon a content-aware grouping stage that partitions the volume into a semantic graph of supervoxels. Its hierarchical encoder learns rich node representations by combining a patch-level Transformer with a supervoxel-level Graph Attention Network, jointly modeling fine-grained intra-region features and broader inter-regional dependencies. This design concentrates all learnable capacity on feature encoding and provides inherent, dual-scale explainability from the patch to the region level. To validate the framework's flexibility, we trained two specialized models on the BraTS dataset: one for node-level classification and one for tumor proportion regression. Both models achieved strong performance, with the classification model achieving a F1-score of 0.875 and the regression model a MAE of 0.028, confirming the encoder's ability to learn discriminative and localized features. Our results establish that a graph-based, encoder-only paradigm offers an accurate and inherently interpretable alternative for 3D medical image representation.

[429] POCI-Diff: Position Objects Consistently and Interactively with 3D-Layout Guided Diffusion

Andrea Rigo,Luca Stornaiuolo,Weijie Wang,Mauro Martino,Bruno Lepri,Nicu Sebe

Main category: cs.CV

TL;DR: 本文提出POCI-Diff框架,通过融合3D几何约束与实例级语义绑定,在扩散模型中实现文本到图像生成的精确、交互式3D布局控制与编辑,避免传统warpping导致的形变,支持对象插入、删除与变换,并利用IP-Adapter保持跨编辑的一致性。

Details Motivation: 现有方法依赖2D线索或迭代copy-warp-paste策略,易导致物体几何失真且难以保证编辑一致性。 Method: 提出POCI-Diff框架:1)Blended Latent Diffusion将文本描述绑定至3D边界框以实现逐物体语义控制;2)无warpping的再生式编辑流程;3)基于IP-Adapter参考图像条件化扩散过程以维持物体身份与全局一致性。 Result: 实验表明POCI-Diff在视觉质量与3D布局保真度上优于SOTA方法,消除warpping引起的几何伪影,支持高质量、一致性的交互式3D编辑。 Conclusion: POCI-Diff为T2I生成提供了首个统一建模3D几何约束与实例语义绑定的扩散框架,显著提升布局可控性、编辑一致性与生成质量。 Abstract: We propose a diffusion-based approach for Text-to-Image (T2I) generation with consistent and interactive 3D layout control and editing. While prior methods improve spatial adherence using 2D cues or iterative copy-warp-paste strategies, they often distort object geometry and fail to preserve consistency across edits. To address these limitations, we introduce a framework for Positioning Objects Consistently and Interactively (POCI-Diff), a novel formulation for jointly enforcing 3D geometric constraints and instance-level semantic binding within a unified diffusion process. Our method enables explicit per-object semantic control by binding individual text descriptions to specific 3D bounding boxes through Blended Latent Diffusion, allowing one-shot synthesis of complex multi-object scenes. We further propose a warping-free generative editing pipeline that supports object insertion, removal, and transformation via regeneration rather than pixel deformation. To preserve object identity and consistency across edits, we condition the diffusion process on reference images using IP-Adapter, enabling coherent object appearance throughout interactive 3D editing while maintaining global scene coherence. Experimental results demonstrate that POCI-Diff produces high-quality images consistent with the specified 3D layouts and edits, outperforming state-of-the-art methods in both visual fidelity and layout adherence while eliminating warping-induced geometric artifacts.

[430] Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration

Yongcong Ye,Kai Zhang,Yanghai Zhang,Enhong Chen,Longfei Li,Jun Zhou

Main category: cs.CV

TL;DR: 本文提出了一种新的细粒度零样本组合图像检索方法CVSI,通过互补的视觉-语义融合有效整合多模态信息,在多个公开数据集上显著优于现有方法。

Details Motivation: 现有ZS-CIR方法难以捕捉细粒度变化并有效融合视觉与语义信息,且依赖图像转文本或大语言模型生成描述,易丢失视觉细节和完整语义上下文。 Method: CVSI包含三个关键组件:(1) 视觉信息提取,提取全局特征并生成伪标记结合修改文本;(2) 语义信息提取,利用预训练描述模型和大语言模型生成原始及修改后的描述;(3) 互补信息检索,融合查询与数据库图像的信息进行目标图像检索。 Result: 在CIRR、CIRCO和FashionIQ三个公开数据集上的实验表明,CVSI显著优于现有的最先进方法。 Conclusion: CVSI通过引入互补的视觉-语义集成机制,有效提升了零样本组合图像检索的性能,具备良好的实用性和泛化能力。 Abstract: Zero-shot composed image retrieval (ZS-CIR) is a rapidly growing area with significant practical applications, allowing users to retrieve a target image by providing a reference image and a relative caption describing the desired modifications. Existing ZS-CIR methods often struggle to capture fine-grained changes and integrate visual and semantic information effectively. They primarily rely on either transforming the multimodal query into a single text using image-to-text models or employing large language models for target image description generation, approaches that often fail to capture complementary visual information and complete semantic context. To address these limitations, we propose a novel Fine-Grained Zero-Shot Composed Image Retrieval method with Complementary Visual-Semantic Integration (CVSI). Specifically, CVSI leverages three key components: (1) Visual Information Extraction, which not only extracts global image features but also uses a pre-trained mapping network to convert the image into a pseudo token, combining it with the modification text and the objects most likely to be added. (2) Semantic Information Extraction, which involves using a pre-trained captioning model to generate multiple captions for the reference image, followed by leveraging an LLM to generate the modified captions and the objects most likely to be added. (3) Complementary Information Retrieval, which integrates information extracted from both the query and database images to retrieve the target image, enabling the system to efficiently handle retrieval queries in a variety of situations. Extensive experiments on three public datasets (e.g., CIRR, CIRCO, and FashionIQ) demonstrate that CVSI significantly outperforms existing state-of-the-art methods. Our code is available at https://github.com/yyc6631/CVSI.

[431] VERIDAH: Solving Enumeration Anomaly Aware Vertebra Labeling across Imaging Sequences

Hendrik Möller,Hanna Schoen,Robert Graf,Matan Atad,Nathan Molinier,Anjany Sekuboyina,Bettina K. Budai,Fabian Bamberg,Steffen Ringhof,Christopher Schlett,Tobias Pischon,Thoralf Niendorf,Josua A. Decker,Marc-André Weber,Bjoern Menze,Daniel Rueckert,Jan S. Kirschke

Main category: cs.CV

TL;DR: 本文提出了一种名为VERIDAH的新型椎骨标记算法,能够自动识别并处理胸腰椎节段的数目异常,在T2w和CT图像上均显著优于现有方法,并支持任意视野图像。

Details Motivation: 椎骨数目异常的识别对慢性背痛和手术规划具有临床意义,但目前临床报告中常缺乏对此类异常的准确评估,且缺乏可自动标记这些异常的深度学习方法。 Method: 提出“Vertebra Identification with Anomaly Handling”(VERIDAH),采用多分类头结合加权椎骨序列预测算法,实现对正常椎骨及数目异常的自动标注。 Result: 在T2w TSE矢状位图像上,98.30%的受试者所有椎骨被正确标记,优于现有模型的94.24%(p < 0.001);在CT图像上达到99.18%,远高于77.26%。VERIDAH对胸椎异常的识别准确率为87.80%(T2w)和96.30%(CT),对腰椎异常为94.48%(T2w)和97.22%(CT)。 Conclusion: VERIDAH能有效识别椎骨数目异常,在多种成像模态和任意视野下表现优异,填补了该领域自动标注方法的空白,具有良好的临床应用前景。 Abstract: The human spine commonly consists of seven cervical, twelve thoracic, and five lumbar vertebrae. However, enumeration anomalies may result in individuals having eleven or thirteen thoracic vertebrae and four or six lumbar vertebrae. Although the identification of enumeration anomalies has potential clinical implications for chronic back pain and operation planning, the thoracolumbar junction is often poorly assessed and rarely described in clinical reports. Additionally, even though multiple deep-learning-based vertebra labeling algorithms exist, there is a lack of methods to automatically label enumeration anomalies. Our work closes that gap by introducing "Vertebra Identification with Anomaly Handling" (VERIDAH), a novel vertebra labeling algorithm based on multiple classification heads combined with a weighted vertebra sequence prediction algorithm. We show that our approach surpasses existing models on T2w TSE sagittal (98.30% vs. 94.24% of subjects with all vertebrae correctly labeled, p < 0.001) and CT imaging (99.18% vs. 77.26% of subjects with all vertebrae correctly labeled, p < 0.001) and works in arbitrary field-of-view images. VERIDAH correctly labeled the presence 2 Möller et al. of thoracic enumeration anomalies in 87.80% and 96.30% of T2w and CT images, respectively, and lumbar enumeration anomalies in 94.48% and 97.22% for T2w and CT, respectively. Our code and models are available at: https://github.com/Hendrik-code/spineps.

[432] Unsupervised Video Class-Incremental Learning via Deep Embedded Clustering Management

Nattapong Kurpukdee,Adrian G. Bors

Main category: cs.CV

TL;DR: 本文提出了一种简单而有效的方法来解决无监督视频类增量学习(uVCIL)问题,通过深度特征提取器和逐步构建深度聚类,在不依赖标签或任务边界信息的情况下实现知识迁移,并在多个标准视频动作识别数据集上显著优于基线方法。

Details Motivation: 现有的类增量学习方法大多依赖标签和任务边界的先验知识,成本高且不现实,因此需要一种无需标签的无监督视频类增量学习方法。 Method: 采用深度特征提取网络提取每项任务的代表性视频特征,并逐步构建深度聚类;在连续任务学习中,利用前一任务更新的模型作为当前任务的初始状态以实现知识迁移。 Result: 在UCF101、HMDB51和Something-to-Something V2三个标准数据集上进行了深入评估,忽略原有监督设置中的标签信息,实验结果表明该方法在所有数据集上均显著优于其他基线方法。 Conclusion: 所提出的方法在无监督视频类增量学习中表现优异,能够在没有标签和任务边界信息的情况下有效学习并避免遗忘,具有较强的实用性和扩展性。 Abstract: Unsupervised video class incremental learning (uVCIL) represents an important learning paradigm for learning video information without forgetting, and without considering any data labels. Prior approaches have focused on supervised class-incremental learning, relying on using the knowledge of labels and task boundaries, which is costly, requires human annotation, or is simply not a realistic option. In this paper, we propose a simple yet effective approach to address the uVCIL. We first consider a deep feature extractor network, providing a set of representative video features during each task without assuming any class or task information. We then progressively build a series of deep clusters from the extracted features. During the successive task learning, the model updated from the previous task is used as an initial state in order to transfer knowledge to the current learning task. We perform in-depth evaluations on three standard video action recognition datasets, including UCF101, HMDB51, and Something-to-Something V2, by ignoring the labels from the supervised setting. Our approach significantly outperforms other baselines on all datasets.

[433] VENI: Variational Encoder for Natural Illumination

Paul Walker,James A. D. Gardner,Andreea Ardelean,William A. P. Smith,Bernhard Egger

Main category: cs.CV

TL;DR: 提出了一种旋转等变的变分自编码器,用于在球面上建模自然光照,保留环境图的SO(2)等变性,提供更平滑的潜在空间插值和更良好的潜在空间结构。

Details Motivation: 现有的逆渲染方法忽视了光照环境的球面和旋转等变特性,或未能提供良好行为的潜在空间。 Method: 设计了一个旋转等变的变分自编码器,使用新型的向量神经元视觉变换器(VN-ViT)作为编码器,以及旋转等变条件神经场作为解码器,并引入SO(2)-等变全连接层以保持等变性。 Result: 所提出的SO(2)-等变全连接层在模型中表现优于标准的向量神经元,相比先前方法,该变分自编码器实现了更平滑的潜在空间插值和更优的潜在空间行为。 Conclusion: 该方法有效保留了光照环境的旋转等变性,提升了逆渲染中潜在空间的质量和插值性能。 Abstract: Inverse rendering is an ill-posed problem, but priors like illumination priors, can simplify it. Existing work either disregards the spherical and rotation-equivariant nature of illumination environments or does not provide a well-behaved latent space. We propose a rotation-equivariant variational autoencoder that models natural illumination on the sphere without relying on 2D projections. To preserve the SO(2)-equivariance of environment maps, we use a novel Vector Neuron Vision Transformer (VN-ViT) as encoder and a rotation-equivariant conditional neural field as decoder. In the encoder, we reduce the equivariance from SO(3) to SO(2) using a novel SO(2)-equivariant fully connected layer, an extension of Vector Neurons. We show that our SO(2)-equivariant fully connected layer outperforms standard Vector Neurons when used in our SO(2)-equivariant model. Compared to previous methods, our variational autoencoder enables smoother interpolation in latent space and offers a more well-behaved latent space.

[434] Two-Stream temporal transformer for video action classification

Nattapong Kurpukdee,Adrian G. Bors

Main category: cs.CV

TL;DR: 提出了一种新的双流Transformer视频分类器,通过结合光流和时间帧域的自注意力机制提取时空信息,在多个动作识别数据集上取得了优异的分类结果。

Details Motivation: 为了更好地捕捉视频中的运动信息并提升视频理解任务(如动作识别)的性能,需要有效的运动表示方法。 Method: 设计了一个双流Transformer模型,分别处理内容帧和光流信息,并在Transformer编码器中建模光流与时间帧之间的自注意力关系,以捕获时空特征。 Result: 在三个知名的人类活动视频数据集上进行了实验,结果表明该方法在分类性能上表现优异。 Conclusion: 所提出的双流Transformer架构能有效融合运动与内容信息,显著提升视频分类效果,验证了自注意力机制在时空特征学习中的潜力。 Abstract: Motion representation plays an important role in video understanding and has many applications including action recognition, robot and autonomous guidance or others. Lately, transformer networks, through their self-attention mechanism capabilities, have proved their efficiency in many applications. In this study, we introduce a new two-stream transformer video classifier, which extracts spatio-temporal information from content and optical flow representing movement information. The proposed model identifies self-attention features across the joint optical flow and temporal frame domain and represents their relationships within the transformer encoder mechanism. The experimental results show that our proposed methodology provides excellent classification results on three well-known video datasets of human activities.

[435] Curriculum-Based Strategies for Efficient Cross-Domain Action Recognition

Emily Kim,Allen Wu,Jessica Hodgins

Main category: cs.CV

TL;DR: 本文研究了基于课程学习的训练策略在跨视角动作识别中的应用,利用合成航拍数据和真实地面数据提升模型对未见航拍数据的泛化能力。实验表明,结合两种异域数据并通过课程学习策略可显著提高训练效率,同时保持与简单数据组合相当的性能。

Details Motivation: 现有动作识别模型在地面视角数据上表现良好,但难以泛化到如航拍等不同视角领域,且缺乏真实航拍训练数据。本文旨在探索无需真实航拍数据即可提升模型对航拍视角泛化能力的训练方法。 Method: 采用课程学习策略,结合合成航拍视角数据和真实地面视角数据进行训练。比较两种策略:一是两阶段微调(先合成后真实),二是多阶段渐进式扩展数据集后再微调。在REMAG数据集上使用SlowFast和MViTv2模型进行评估。 Result: 结合两种异域数据优于单一域训练。两种课程学习策略均能达到与简单数据组合相当的top-1准确率,但训练迭代次数显著减少:两步微调使SlowFast减少37%,MViTv2减少30%;渐进式方法进一步相对减少9%(SlowFast)和30%(MViTv2)。性能差距控制在3%以内。 Conclusion: 课程学习策略能在不使用真实航拍训练数据的情况下,有效提升模型对航拍视角的泛化能力,并大幅提高训练效率,为跨视角动作识别提供了高效可行的解决方案。 Abstract: Despite significant progress in human action recognition, generalizing to diverse viewpoints remains a challenge. Most existing datasets are captured from ground-level perspectives, and models trained on them often struggle to transfer to drastically different domains such as aerial views. This paper examines how curriculum-based training strategies can improve generalization to unseen real aerial-view data without using any real aerial data during training. We explore curriculum learning for cross-view action recognition using two out-of-domain sources: synthetic aerial-view data and real ground-view data. Our results on the evaluation on order of training (fine-tuning on synthetic aerial data vs. real ground data) shows that fine-tuning on real ground data but differ in how they transition from synthetic to real. The first uses a two-stage curriculum with direct fine-tuning, while the second applies a progressive curriculum that expands the dataset in multiple stages before fine-tuning. We evaluate both methods on the REMAG dataset using SlowFast (CNN-based) and MViTv2 (Transformer-based) architectures. Results show that combining the two out-of-domain datasets clearly outperforms training on a single domain, whether real ground-view or synthetic aerial-view. Both curriculum strategies match the top-1 accuracy of simple dataset combination while offering efficiency gains. With the two-step fine-tuning method, SlowFast achieves up to a 37% reduction in iterations and MViTv2 up to a 30% reduction compared to simple combination. The multi-step progressive approach further reduces iterations, by up to 9% for SlowFast and 30% for MViTv2, relative to the two-step method. These findings demonstrate that curriculum-based training can maintain comparable performance (top-1 accuracy within 3% range) while improving training efficiency in cross-view action recognition.

[436] Interp3D: Correspondence-aware Interpolation for Generative Textured 3D Morphing

Xiaolu Liu,Yicong Li,Qiyuan He,Jiayin Zhu,Wei Ji,Angela Yao,Jianke Zhu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的3D纹理形变框架Interp3D,通过渐进对齐策略联合保持几何一致性、纹理对齐与过渡鲁棒性,在形态过渡中实现结构连贯与外观细节保留。

Details Motivation: 现有方法要么仅处理几何导致忽略纹理,要么将2D插值扩展至3D引发语义模糊、结构错位和纹理模糊,亟需兼顾几何、纹理与鲁棒性的新方法。 Method: Interp3D采用生成先验与渐进对齐原则:首先在条件空间进行语义对齐插值,再通过SLAT(结构化潜在空间)引导结构插值保障几何一致性,最后通过细粒度纹理融合传递外观细节。 Result: 在自建多难度数据集Interp3DData上的定量评估与人类研究均表明,该方法在保真度、过渡平滑性与合理性方面显著优于先前方法。 Conclusion: Interp3D是一种高效、无需训练的 textured 3D morphing 新范式,有效解决了几何-纹理联合建模难题,推动了3D内容生成与编辑的实际应用。 Abstract: Textured 3D morphing seeks to generate smooth and plausible transitions between two 3D assets, preserving both structural coherence and fine-grained appearance. This ability is crucial not only for advancing 3D generation research but also for practical applications in animation, editing, and digital content creation. Existing approaches either operate directly on geometry, limiting them to shape-only morphing while neglecting textures, or extend 2D interpolation strategies into 3D, which often causes semantic ambiguity, structural misalignment, and texture blurring. These challenges underscore the necessity to jointly preserve geometric consistency, texture alignment, and robustness throughout the transition process. To address this, we propose Interp3D, a novel training-free framework for textured 3D morphing. It harnesses generative priors and adopts a progressive alignment principle to ensure both geometric fidelity and texture coherence. Starting from semantically aligned interpolation in condition space, Interp3D enforces structural consistency via SLAT (Structured Latent)-guided structure interpolation, and finally transfers appearance details through fine-grained texture fusion. For comprehensive evaluations, we construct a dedicated dataset, Interp3DData, with graded difficulty levels and assess generation results from fidelity, transition smoothness, and plausibility. Both quantitative metrics and human studies demonstrate the significant advantages of our proposed approach over previous methods. Source code is available at https://github.com/xiaolul2/Interp3D.

[437] PMCE: Probabilistic Multi-Granularity Semantics with Caption-Guided Enhancement for Few-Shot Learning

Jiaying Wu,Can Gao,Jinglu Hu,Hui Li,Xiaofeng Cao,Jingcai Guo

Main category: cs.CV

TL;DR: 本文提出了一种基于多粒度语义与字幕引导增强的少样本学习框架PMCE,通过构建非参数知识库和利用CLIP编码的类别名进行先验信息融合,并结合BLIP生成的图像描述优化支持集原型和查询特征,显著提升了少样本分类性能。

Details Motivation: 现有少样本学习方法在数据稀缺下估计的原型存在偏差,泛化能力差;基于语义的方法主要应用于支持集,未充分利用查询样本信息,且缺乏对细粒度实例级语义的利用。 Method: 提出PMCE框架:1)构建存储基类视觉统计信息和CLIP编码类别名的非参数知识库;2)在元测试时根据类别名相似性检索相关基类并聚合为先验,通过MAP更新融合到支持集原型;3)使用冻结的BLIP生成无标签图像描述,通过轻量增强器在基类上训练,结合一致性正则化优化支持原型和查询特征。 Result: 在四个基准上实验表明,PMCE持续优于强基线,在MiniImageNet 1-shot设置下相比最强语义竞争方法绝对提升达7.71%。 Conclusion: PMCE通过融合多粒度语义(类级别先验与实例级描述)有效缓解了少样本学习中因数据稀疏导致的原型偏差问题,实现了更鲁棒的分类性能。 Abstract: Few-shot learning aims to identify novel categories from only a handful of labeled samples, where prototypes estimated from scarce data are often biased and generalize poorly. Semantic-based methods alleviate this by introducing coarse class-level information, but they are mostly applied on the support side, leaving query representations unchanged. In this paper, we present PMCE, a Probabilistic few-shot framework that leverages Multi-granularity semantics with Caption-guided Enhancement. PMCE constructs a nonparametric knowledge bank that stores visual statistics for each category as well as CLIP-encoded class name embeddings of the base classes. At meta-test time, the most relevant base classes are retrieved based on the similarities of class name embeddings for each novel category. These statistics are then aggregated into category-specific prior information and fused with the support set prototypes via a simple MAP update. Simultaneously, a frozen BLIP captioner provides label-free instance-level image descriptions, and a lightweight enhancer trained on base classes optimizes both support prototypes and query features under an inductive protocol with a consistency regularization to stabilize noisy captions. Experiments on four benchmarks show that PMCE consistently improves over strong baselines, achieving up to 7.71% absolute gain over the strongest semantic competitor on MiniImageNet in the 1-shot setting. Our code is available at https://anonymous.4open.science/r/PMCE-275D

[438] GIC-DLC: Differentiable Logic Circuits for Hardware-Friendly Grayscale Image Compression

Till Aczel,David F. Jenny,Simon Bührer,Andreas Plesner,Antonio Di Maio,Roger Wattenhofer

Main category: cs.CV

TL;DR: 提出了一种硬件感知的灰度图像压缩方法GIC-DLC,结合神经网络灵活性与布尔运算效率,在压缩性能、能耗和延迟方面优于传统编解码器。

Details Motivation: 神经图像编解码器虽压缩率高,但计算开销大,难以部署在资源受限设备上,因此需要一种兼顾压缩效率与硬件友好性的新方法。 Method: 提出GIC-DLC,通过训练查找表将神经网络的灵活性与布尔逻辑运算的高效性结合,实现硬件友好的可学习压缩。 Result: 在灰度基准数据集上,GIC-DLC在压缩效率上优于传统编解码器,同时显著降低能耗和延迟。 Conclusion: 可学习压缩可以具备硬件友好性,GIC-DLC为边缘设备上的低功耗图像压缩提供了有前景的方向。 Abstract: Neural image codecs achieve higher compression ratios than traditional hand-crafted methods such as PNG or JPEG-XL, but often incur substantial computational overhead, limiting their deployment on energy-constrained devices such as smartphones, cameras, and drones. We propose Grayscale Image Compression with Differentiable Logic Circuits (GIC-DLC), a hardware-aware codec where we train lookup tables to combine the flexibility of neural networks with the efficiency of Boolean operations. Experiments on grayscale benchmark datasets show that GIC-DLC outperforms traditional codecs in compression efficiency while allowing substantial reductions in energy consumption and latency. These results demonstrate that learned compression can be hardware-friendly, offering a promising direction for low-power image compression on edge devices.

[439] LLM Augmented Intervenable Multimodal Adaptor for Post-operative Complication Prediction in Lung Cancer Surgery

Shubham Pandey,Bhavin Jawade,Srirangaraj Setlur,Venu Govindaraju,Kenneth Seastedt

Main category: cs.CV

TL;DR: 提出MIRACLE,一种基于深度学习的模型,用于预测肺癌手术后并发症风险,融合术前临床和影像数据,并通过可交互模块提升预测透明度与临床实用性。

Details Motivation: 术后并发症影响患者预后并增加医疗成本,现有模型在处理异构数据和提供可解释性方面存在不足。 Method: MIRACLE采用超球面嵌入空间融合临床结构化数据与高维影像数据,并引入可干预深度学习模块以增强模型可解释性和临床交互能力。 Result: 在包含3094例患者的POC-L真实世界数据集上验证,MIRACLE优于传统机器学习模型和当前大语言模型变体,在个性化、可解释的术后风险预测中表现更优。 Conclusion: MIRACLE能有效整合多模态数据,提供准确且可解释的术后并发症风险预测,具有良好的临床应用前景。 Abstract: Postoperative complications remain a critical concern in clinical practice, adversely affecting patient outcomes and contributing to rising healthcare costs. We present MIRACLE, a deep learning architecture for prediction of risk of postoperative complications in lung cancer surgery by integrating preoperative clinical and radiological data. MIRACLE employs a hyperspherical embedding space fusion of heterogeneous inputs, enabling the extraction of robust, discriminative features from both structured clinical records and high-dimensional radiological images. To enhance transparency of prediction and clinical utility, we incorporate an interventional deep learning module in MIRACLE, that not only refines predictions but also provides interpretable and actionable insights, allowing domain experts to interactively adjust recommendations based on clinical expertise. We validate our approach on POC-L, a real-world dataset comprising 3,094 lung cancer patients who underwent surgery at Roswell Park Comprehensive Cancer Center. Our results demonstrate that MIRACLE outperforms various traditional machine learning models and contemporary large language models (LLM) variants alone, for personalized and explainable postoperative risk management.

[440] One-Shot Refiner: Boosting Feed-forward Novel View Synthesis via One-Step Diffusion

Yitong Dong,Qi Zhang,Minchao Jiang,Zhiqiang Wu,Qingnan Fan,Ying Feng,Huaqi Zhang,Hujun Bao,Guofeng Zhang

Main category: cs.CV

TL;DR: 提出一种用于从稀疏图像进行高保真新视角合成的新型框架,通过双域细节感知模块和特征引导扩散网络解决现有方法在分辨率和视图一致性上的局限。

Details Motivation: 现有的基于Vision Transformer的3D高斯点阵方法受限于低分辨率输入,且生成增强方法缺乏3D感知,导致跨视图结构不一致。 Method: 设计了双域细节感知模块以处理高分辨率图像并为高斯分布添加高频细节特征,开发了特征引导扩散网络来在恢复过程中保留高频细节,并提出了联合优化ViT几何骨干和扩散细化模块的统一训练策略。 Result: 实验表明该方法在多个数据集上均能保持优越的生成质量。 Conclusion: 所提出的方法有效克服了现有新视角合成技术在分辨率和多视图一致性方面的限制,实现了高保真的新视角合成。 Abstract: We present a novel framework for high-fidelity novel view synthesis (NVS) from sparse images, addressing key limitations in recent feed-forward 3D Gaussian Splatting (3DGS) methods built on Vision Transformer (ViT) backbones. While ViT-based pipelines offer strong geometric priors, they are often constrained by low-resolution inputs due to computational costs. Moreover, existing generative enhancement methods tend to be 3D-agnostic, resulting in inconsistent structures across views, especially in unseen regions. To overcome these challenges, we design a Dual-Domain Detail Perception Module, which enables handling high-resolution images without being limited by the ViT backbone, and endows Gaussians with additional features to store high-frequency details. We develop a feature-guided diffusion network, which can preserve high-frequency details during the restoration process. We introduce a unified training strategy that enables joint optimization of the ViT-based geometric backbone and the diffusion-based refinement module. Experiments demonstrate that our method can maintain superior generation quality across multiple datasets.

[441] ASBA: A-line State Space Model and B-line Attention for Sparse Optical Doppler Tomography Reconstruction

Zhenghong Li,Wensheng Cheng,Congwu Du,Yingtian Pan,Zhaozheng Yin,Haibin Ling

Main category: cs.CV

TL;DR: 提出一种名为ASBA的新型血流感知网络,用于从高度稀疏采样的光学多普勒断层成像原始数据中重建高质量图像,结合A-line ROI状态空间模型和B-line相位注意力机制,并引入流感知加权损失函数,显著优于现有方法。

Details Motivation: 现有ODT成像依赖密集采样以保证图像质量,导致扫描时间长、存储压力大且难以捕捉快速血流动态;稀疏采样虽可缓解问题,但受限于保守采样率和对血流与背景信号的统一建模,重建效果不佳。 Method: 提出ASBA网络:采用A-line ROI状态空间模型提取沿A-line稀疏分布的血流特征,结合B-line相位注意力机制利用相位差捕获沿B-line的长程血流信号,并设计流感知加权损失函数,使网络优先精确重建血流信号。 Result: 在真实动物数据上的实验表明,该方法在高度稀疏采样下仍能高质量重建ODT图像,显著优于现有的最先进重建方法。 Conclusion: ASBA通过针对性建模血流特性与相位信息,在极稀疏采样条件下实现了高保真ODT图像重建,为高速、低负担的血流成像提供了有效解决方案。 Abstract: Optical Doppler Tomography (ODT) is an emerging blood flow analysis technique. A 2D ODT image (B-scan) is generated by sequentially acquiring 1D depth-resolved raw A-scans (A-line) along the lateral axis (B-line), followed by Doppler phase-subtraction analysis. To ensure high-fidelity B-scan images, current practices rely on dense sampling, which prolongs scanning time, increases storage demands, and limits the capture of rapid blood flow dynamics. Recent studies have explored sparse sampling of raw A-scans to alleviate these limitations, but their effectiveness is hindered by the conservative sampling rates and the uniform modeling of flow and background signals. In this study, we introduce a novel blood flow-aware network, named ASBA (A-line ROI State space model and B-line phase Attention), to reconstruct ODT images from highly sparsely sampled raw A-scans. Specifically, we propose an A-line ROI state space model to extract sparsely distributed flow features along the A-line, and a B-line phase attention to capture long-range flow signals along each B-line based on phase difference. Moreover, we introduce a flow-aware weighted loss function that encourages the network to prioritize the accurate reconstruction of flow signals. Extensive experiments on real animal data demonstrate that the proposed approach clearly outperforms existing state-of-the-art reconstruction methods.

[442] Progressive self-supervised blind-spot denoising method for LDCT denoising

Yichao Liu,Yueyang Teng,Junwen Guo

Main category: cs.CV

TL;DR: 本文提出了一种仅依赖低剂量CT(LDCT)图像的新型自监督训练策略,通过逐步盲点去噪机制和高斯噪声正则化,实现了优于现有自监督方法、媲美甚至超越部分有监督方法的去噪性能。

Details Motivation: 缓解自监督低剂量CT图像去噪中对配对正常剂量CT(NDCT)数据的依赖,而这类配对数据在临床实践中难以获取。 Method: 提出仅基于LDCT图像的自监督训练策略,包括:1)逐步盲点去噪机制,以渐进方式强制条件独立性,实现更细粒度的去噪学习;2)向LDCT图像添加高斯噪声作为正则化手段,缓解过拟合。 Result: 在Mayo LDCT数据集上的大量实验表明,所提方法持续优于现有自监督方法,并在性能上达到或超过若干代表性有监督去噪方法。 Conclusion: 仅使用LDCT图像的自监督去噪是可行且有效的,所提出的逐步盲点去噪与噪声正则化策略显著提升了模型性能。 Abstract: Self-supervised learning is increasingly investigated for low-dose computed tomography (LDCT) image denoising, as it alleviates the dependence on paired normal-dose CT (NDCT) data, which are often difficult to acquire in clinical practice. In this paper, we propose a novel self-supervised training strategy that relies exclusively on LDCT images. We introduce a step-wise blind-spot denoising mechanism that enforces conditional independence in a progressive manner, enabling more fine-grained denoising learning. In addition, we add Gaussian noise to LDCT images, which acts as a regularization and mitigates overfitting. Extensive experiments on the Mayo LDCT dataset demonstrate that the proposed method consistently outperforms existing self-supervised approaches and achieves performance comparable to, or better than, several representative supervised denoising methods.

[443] IIR-VLM: In-Context Instance-level Recognition for Large Vision-Language Models

Liang Shi,Wei Li,Kevin M Beussman,Lin Chen,Yun Fu

Main category: cs.CV

TL;DR: 本文提出IIR-VLM,一种通过引入预训练的实例级识别专家模型作为辅助视觉编码器,使视觉语言模型(VLM)能够在上下文中以单样本方式学习新实例的方法,显著提升了VLM在实例级识别任务上的性能。

Details Motivation: 现代视觉语言模型(VLM)在实例级识别(ILR)任务上表现不佳,限制了其在需要识别熟悉人物和物体的实际应用中的使用。现有方法通常依赖于为每个实例单独收集数据并进行训练,成本高且难以实现细粒度区分。 Method: 提出IIR-VLM,将预训练的ILR专家模型作为辅助视觉编码器,提取专门特征用于学习多样化的实例,并使VLM能够以单样本方式在上下文中学习新实例,同时支持实例感知的视觉理解。 Result: 在现有的实例个性化基准上验证了IIR-VLM的有效性,并在一个新的、更具挑战性的基准上展示了其优越的ILR性能,该基准涵盖不同难度和多样化类别(如人、人脸、宠物和通用物体)。 Conclusion: IIR-VLM通过融合专家模型的特征,在不增加大量训练成本的情况下显著提升了VLM的实例级识别能力,具备良好的泛化性和实用性。 Abstract: Instance-level recognition (ILR) concerns distinguishing individual instances from one another, with person re-identification as a prominent example. Despite the impressive visual perception capabilities of modern VLMs, we find their performance on ILR unsatisfactory, often dramatically underperforming domain-specific ILR models. This limitation hinders many practical application of VLMs, e.g. where recognizing familiar people and objects is crucial for effective visual understanding. Existing solutions typically learn to recognize instances one at a time using instance-specific datasets, which not only incur substantial data collection and training costs but also struggle with fine-grained discrimination. In this work, we propose IIR-VLM, a VLM enhanced for In-context Instance-level Recognition. We integrate pre-trained ILR expert models as auxiliary visual encoders to provide specialized features for learning diverse instances, which enables VLMs to learn new instances in-context in a one-shot manner. Further, IIR-VLM leverages this knowledge for instance-aware visual understanding. We validate IIR-VLM's efficacy on existing instance personalization benchmarks. Finally, we demonstrate its superior ILR performance on a challenging new benchmark, which assesses ILR capabilities across varying difficulty and diverse categories, with person, face, pet and general objects as the instances at task.

[444] Rig-Aware 3D Reconstruction of Vehicle Undercarriages using Gaussian Splatting

Nitin Kulkarni,Akhil Devarashetti,Charlie Cluss,Livio Forte,Dan Buckmaster,Philip Schneider,Chunming Qiao,Alina Vereshchaka

Main category: cs.CV

TL;DR: 提出了一种基于三摄像头装置的端到端管道,用于生成车辆底盘的交互式3D模型,提升检测效率与买家信心。

Details Motivation: 传统底盘检查费时费力且存在安全隐患,线上购车缺乏底盘可视信息,亟需自动化、可视化的解决方案。 Method: 设计了专用的结构光运动恢复(SfM)流程,结合精确相机标定、同步视频流和来自相机装置的几何先验,采用DISK特征提取器和LightGlue匹配器进行约束匹配,生成高质量稀疏点云,并用于高斯溅射重建。 Result: 成功生成了可实时渲染的逼真底盘3D模型,实验表明该方法在低视差和广角畸变场景下优于标准SfM流程,达到最先进水平。 Conclusion: 所提出的rig-aware SfM pipeline有效解决了广角畸变和低视差带来的挑战,显著提升了车辆底盘三维重建质量,具有实际应用价值。 Abstract: Inspecting the undercarriage of used vehicles is a labor-intensive task that requires inspectors to crouch or crawl underneath each vehicle to thoroughly examine it. Additionally, online buyers rarely see undercarriage photos. We present an end-to-end pipeline that utilizes a three-camera rig to capture videos of the undercarriage as the vehicle drives over it, and produces an interactive 3D model of the undercarriage. The 3D model enables inspectors and customers to rotate, zoom, and slice through the undercarriage, allowing them to detect rust, leaks, or impact damage in seconds, thereby improving both workplace safety and buyer confidence. Our primary contribution is a rig-aware Structure-from-Motion (SfM) pipeline specifically designed to overcome the challenges of wide-angle lens distortion and low-parallax scenes. Our method overcomes the challenges of wide-angle lens distortion and low-parallax scenes by integrating precise camera calibration, synchronized video streams, and strong geometric priors from the camera rig. We use a constrained matching strategy with learned components, the DISK feature extractor, and the attention-based LightGlue matcher to generate high-quality sparse point clouds that are often unattainable with standard SfM pipelines. These point clouds seed the Gaussian splatting process to generate photorealistic undercarriage models that render in real-time. Our experiments and ablation studies demonstrate that our design choices are essential to achieve state-of-the-art quality.

[445] Soft Tail-dropping for Adaptive Visual Tokenization

Zeyuan Chen,Kai Zhang,Zhuowen Tu,Yuanjun Xiong

Main category: cs.CV

TL;DR: 提出了一种名为Soft Tail-dropping Adaptive Tokenizer (STAT) 的1D离散视觉分词器,可根据图像的结构复杂性和细节程度自适应地调整输出token数量,并与因果1D自回归生成模型兼容。

Details Motivation: 传统的视觉分词方法通常生成固定长度的token序列,难以兼顾不同图像的复杂性差异;而现有的自回归视觉生成模型在扩展性方面表现不佳。因此,需要一种能自适应输出token长度且兼容因果AR模型的 tokenizer。 Method: 设计了一个带有每token保留概率的离散编码框架,对保留概率施加单调递减约束,并将其分布与图像级复杂度度量对齐,从而实现软性的尾部丢弃机制。 Result: 在ImageNet-1k上,结合STAT的普通因果AR模型在生成质量上达到或超过了其他概率模型家族的表现,并展现出更优的扩展行为。 Conclusion: STAT为1D因果自回归视觉生成模型提供了一种有效的自适应分词方案,解决了以往此类模型在生成质量和可扩展性上的局限。 Abstract: We present Soft Tail-dropping Adaptive Tokenizer (STAT), a 1D discrete visual tokenizer that adaptively chooses the number of output tokens per image according to its structural complexity and level of detail. STAT encodes an image into a sequence of discrete codes together with per-token keep probabilities. Beyond standard autoencoder objectives, we regularize these keep probabilities to be monotonically decreasing along the sequence and explicitly align their distribution with an image-level complexity measure. As a result, STAT produces length-adaptive 1D visual tokens that are naturally compatible with causal 1D autoregressive (AR) visual generative models. On ImageNet-1k, equipping vanilla causal AR models with STAT yields competitive or superior visual generation quality compared to other probabilistic model families, while also exhibiting favorable scaling behavior that has been elusive in prior vanilla AR visual generation attempts.

[446] OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer

Pengze Zhang,Yanze Wu,Mengtian Li,Xu Bai,Songtao Zhao,Fulong Ye,Chong Mou,Xinghui Li,Zhuowei Chen,Qian He,Mingyuan Gao

Main category: cs.CV

TL;DR: 本文提出了一种名为OmniTransfer的统一视频时空迁移框架,通过多视角信息和时序线索实现外观一致性和精细的时间控制,在多种视频迁移任务中表现出优越性能。

Details Motivation: 现有视频定制方法依赖参考图像或特定任务的时序先验,未能充分利用视频中的丰富时空信息,限制了生成的灵活性和泛化能力。 Method: OmniTransfer引入三项关键设计:任务感知的位置偏置、解耦参考的因果学习机制以及自适应多模态对齐,结合多模态语义指导动态区分并处理不同任务。 Result: 实验表明,OmniTransfer在外观(身份、风格)和时间迁移(相机运动、视频效果)方面优于现有方法,并在无姿态输入的情况下达到与姿态引导方法相当的运动迁移性能。 Conclusion: OmniTransfer实现了灵活、高保真的视频生成,为统一的视频迁移任务提供了新范式。 Abstract: Videos convey richer information than images or text, capturing both spatial and temporal dynamics. However, most existing video customization methods rely on reference images or task-specific temporal priors, failing to fully exploit the rich spatio-temporal information inherent in videos, thereby limiting flexibility and generalization in video generation. To address these limitations, we propose OmniTransfer, a unified framework for spatio-temporal video transfer. It leverages multi-view information across frames to enhance appearance consistency and exploits temporal cues to enable fine-grained temporal control. To unify various video transfer tasks, OmniTransfer incorporates three key designs: Task-aware Positional Bias that adaptively leverages reference video information to improve temporal alignment or appearance consistency; Reference-decoupled Causal Learning separating reference and target branches to enable precise reference transfer while improving efficiency; and Task-adaptive Multimodal Alignment using multimodal semantic guidance to dynamically distinguish and tackle different tasks. Extensive experiments show that OmniTransfer outperforms existing methods in appearance (ID and style) and temporal transfer (camera movement and video effects), while matching pose-guided methods in motion transfer without using pose, establishing a new paradigm for flexible, high-fidelity video generation.

[447] LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Said Taghadouini,Adrien Cavaillès,Baptiste Aubertin

Main category: cs.CV

TL;DR: LightOnOCR-2-1B 是一个10亿参数的端到端多语言视觉-语言模型,能将文档图像直接转换为结构化文本,并预测嵌入图像的归一化边界框,性能领先且体积更小、速度更快。

Details Motivation: 避免传统OCR流程的脆弱性,实现文档图像到自然排序文本的高效、准确转换,并增强对复杂文档(如扫描件、科学PDF)的处理能力。 Method: 基于大规模高质量蒸馏数据集训练,引入简历策略在预训练中加入定位能力,并使用基于IoU奖励的RLVR进行优化,结合检查点平均和任务算术合并提升鲁棒性。 Result: 在OlmOCR-Bench上达到SOTA,模型比之前最优模型小9倍且更快;支持图像定位输出,并发布模型、数据集及新评测基准LightOnOCR-bbox-bench。 Conclusion: LightOnOCR-2-1B在效率、精度和功能扩展性方面显著优于现有方法,推动了端到端文档理解的发展。 Abstract: We present \textbf{LightOnOCR-2-1B}, a 1B-parameter end-to-end multilingual vision--language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9$\times$ smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and \textbf{LightOnOCR-bbox-bench} evaluation under their respective licenses.

[448] Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis

Hongyuan Chen,Xingyu Chen,Youjia Zhang,Zexiang Xu,Anpei Chen

Main category: cs.CV

TL;DR: Motion 3-to-4 是一个前馈框架,用于从单目视频和可选的3D参考网格生成高质量的4D动态对象。

Details Motivation: 4D合成因训练数据有限以及从单目视角恢复几何和运动的固有模糊性而具有挑战性。 Method: 将4D合成分解为静态3D形状生成和运动重建,利用规范参考网格学习紧凑的运动潜在表示,并预测每帧顶点轨迹以恢复完整且时间连贯的几何。使用可扩展的逐帧Transformer增强对不同序列长度的鲁棒性。 Result: 在标准基准和具有精确真实几何的新数据集上的评估表明,Motion 3-to-4 在保真度和空间一致性方面优于先前方法。 Conclusion: Motion 3-to-4 有效解决了4D动态对象合成中的关键挑战,实现了高质量、时间连贯的4D重建。 Abstract: We present Motion 3-to-4, a feed-forward framework for synthesising high-quality 4D dynamic objects from a single monocular video and an optional 3D reference mesh. While recent advances have significantly improved 2D, video, and 3D content generation, 4D synthesis remains difficult due to limited training data and the inherent ambiguity of recovering geometry and motion from a monocular viewpoint. Motion 3-to-4 addresses these challenges by decomposing 4D synthesis into static 3D shape generation and motion reconstruction. Using a canonical reference mesh, our model learns a compact motion latent representation and predicts per-frame vertex trajectories to recover complete, temporally coherent geometry. A scalable frame-wise transformer further enables robustness to varying sequence lengths. Evaluations on both standard benchmarks and a new dataset with accurate ground-truth geometry show that Motion 3-to-4 delivers superior fidelity and spatial consistency compared to prior work. Project page is available at https://motion3-to-4.github.io/.

[449] VideoMaMa: Mask-Guided Video Matting via Generative Prior

Sangbeom Lim,Seoung Wug Oh,Jiahui Huang,Heeji Yoon,Seungryong Kim,Joon-Young Lee

Main category: cs.CV

TL;DR: 提出VideoMaMa模型,利用预训练视频扩散模型将粗分割掩码转为精确的alpha mattes,并构建大规模真实视频抠图数据集MA-V,推动视频抠图研究。

Details Motivation: 由于标注数据稀缺,现有视频抠图模型难以泛化到真实世界视频,需探索无需大量真实标注的解决方案。 Method: 提出VideoMaMa模型,利用预训练视频扩散模型从合成数据中学习,将粗掩码转换为精细matte;并构建伪标签流水线生成MA-V数据集,用于训练如SAM2-Matte等模型。 Result: VideoMaMa在仅用合成数据训练的情况下展现出对真实视频的强零样本泛化能力;基于MA-V微调的SAM2-Matte在真实场景中表现优于现有方法。 Conclusion: 大规模伪标注结合生成先验和易获取的分割线索,可有效推动视频抠图技术的发展。 Abstract: Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model (VideoMaMa) that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffusion models. VideoMaMa demonstrates strong zero-shot generalization to real-world footage, even though it is trained solely on synthetic data. Building on this capability, we develop a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video (MA-V) dataset, which offers high-quality matting annotations for more than 50K real-world videos spanning diverse scenes and motions. To validate the effectiveness of this dataset, we fine-tune the SAM2 model on MA-V to obtain SAM2-Matte, which outperforms the same model trained on existing matting datasets in terms of robustness on in-the-wild videos. These findings emphasize the importance of large-scale pseudo-labeled video matting and showcase how generative priors and accessible segmentation cues can drive scalable progress in video matting research.

[450] Implicit Neural Representation Facilitates Unified Universal Vision Encoding

Matthew Gwilliam,Xiao Wang,Xuefeng Hu,Zhenheng Yang

Main category: cs.CV

TL;DR: 本文提出了一种统一图像识别与生成的新型模型,通过隐式神经表示(INR)超网络学习紧凑且高性能的图像嵌入,同时支持下游识别任务和高质量图像生成。

Details Motivation: 现有图像表征学习模型通常只针对识别或生成单一目标设计,缺乏兼顾两者的统一框架。 Method: 提出基于隐式神经表示(INR)的超网络架构,将图像映射为重建网络的权重,并结合知识蒸馏提升泛化能力;训练目标兼顾重建质量与嵌入判别性。 Result: 模型在多个视觉识别任务上达到SOTA水平,同时支持高保真图像生成,且嵌入极小(tiny embeddings),显著优于传统方法。 Conclusion: 该工作首次实现了识别与生成能力的有机统一,证明了紧凑嵌入空间可同时支撑判别式与生成式任务,为通用视觉表征学习提供了新范式。 Abstract: Models for image representation learning are typically designed for either recognition or generation. Various forms of contrastive learning help models learn to convert images to embeddings that are useful for classification, detection, and segmentation. On the other hand, models can be trained to reconstruct images with pixel-wise, perceptual, and adversarial losses in order to learn a latent space that is useful for image generation. We seek to unify these two directions with a first-of-its-kind model that learns representations which are simultaneously useful for recognition and generation. We train our model as a hyper-network for implicit neural representation, which learns to map images to model weights for fast, accurate reconstruction. We further integrate our INR hyper-network with knowledge distillation to improve its generalization and performance. Beyond the novel training design, the model also learns an unprecedented compressed embedding space with outstanding performance for various visual tasks. The complete model competes with state-of-the-art results for image representation learning, while also enabling generative capabilities with its high-quality tiny embeddings. The code is available at https://github.com/tiktok/huvr.