Skip to content

Table of Contents

cs.CL [Back]

[1] Rethinking Graph-Based Document Classification: Learning Data-Driven Structures Beyond Heuristic Approaches

Margarita Bugueño,Gerard de Melo

Main category: cs.CL

TL;DR: 本文提出了一种无需手动设计和领域知识的数据驱动图结构方法,用于文档分类,通过自注意力模型学习句子间依赖关系,并利用统计过滤策略优化图质量,实验结果表明其性能优于传统启发式方法。

Details Motivation: 现有的基于启发式、领域特定规则或专家知识的图文档表示方法存在领域依赖性强、需要手动设计的问题。 Method: 通过自注意力模型学习句子间的依赖关系,构建同质加权图,并采用统计过滤策略优化图结构。 Result: 在三个文档分类数据集上的实验表明,所提出的方法在准确率和F1分数上均优于基于启发式的图方法,并验证了统计过滤策略对分类鲁棒性的提升效果。 Conclusion: 自动图生成方法优于传统的启发式方法,为NLP领域带来了新的研究方向。 Abstract: In document classification, graph-based models effectively capture document structure, overcoming sequence length limitations and enhancing contextual understanding. However, most existing graph document representations rely on heuristics, domain-specific rules, or expert knowledge. Unlike previous approaches, we propose a method to learn data-driven graph structures, eliminating the need for manual design and reducing domain dependence. Our approach constructs homogeneous weighted graphs with sentences as nodes, while edges are learned via a self-attention model that identifies dependencies between sentence pairs. A statistical filtering strategy aims to retain only strongly correlated sentences, improving graph quality while reducing the graph size. Experiments on three document classification datasets demonstrate that learned graphs consistently outperform heuristic-based graphs, achieving higher accuracy and $F_1$ score. Furthermore, our study demonstrates the effectiveness of the statistical filtering in improving classification robustness. These results highlight the potential of automatic graph generation over traditional heuristic approaches and open new directions for broader applications in NLP.

[2] FECT: Factuality Evaluation of Interpretive AI-Generated Claims in Contact Center Conversation Transcripts

Hagyeong Shin,Binoy Robin Dalal,Iwona Bialynicka-Birula,Navjot Matharu,Ryan Muir,Xingwei Yang,Samuel W. K. Wong

Main category: cs.CL

TL;DR: 本文提出了一种基于3D范式(分解、解耦、脱离)和FECT数据集的新方法,用于提升客服对话分析中AI生成内容的真实性评估。

Details Motivation: 大型语言模型(LLMs)在企业应用中容易产生幻觉,特别是在客服对话分析和总结任务中,由于缺乏真实标签,真实性评估面临独特挑战。这促使作者开发一种新的方法来提升AI生成内容的真实性评估能力。 Method: 文章通过引入3D范式设计了人工标注指南和LLM判断提示,构建了新的真实性评估数据集FECT,并进行了LLM判断的对齐实验。 Result: 作者开发了3D范式和FECT数据集,并通过实验验证了该方法在提升LLM判断一致性方面的有效性。 Conclusion: 本文提出了一种基于3D范式(分解、解耦、脱离)的人工标注准则和大型语言模型判断提示,以解决客服对话分析中的AI生成解释性声明的真实性评估问题。同时引入了一个新的基准数据集FECT,并报告了基于3D范式的LLM判断对齐结果。 Abstract: Large language models (LLMs) are known to hallucinate, producing natural language outputs that are not grounded in the input, reference materials, or real-world knowledge. In enterprise applications where AI features support business decisions, such hallucinations can be particularly detrimental. LLMs that analyze and summarize contact center conversations introduce a unique set of challenges for factuality evaluation, because ground-truth labels often do not exist for analytical interpretations about sentiments captured in the conversation and root causes of the business problems. To remedy this, we first introduce a \textbf{3D} -- \textbf{Decompose, Decouple, Detach} -- paradigm in the human annotation guideline and the LLM-judges' prompt to ground the factuality labels in linguistically-informed evaluation criteria. We then introduce \textbf{FECT}, a novel benchmark dataset for \textbf{F}actuality \textbf{E}valuation of Interpretive AI-Generated \textbf{C}laims in Contact Center Conversation \textbf{T}ranscripts, labeled under our 3D paradigm. Lastly, we report our findings from aligning LLM-judges on the 3D paradigm. Overall, our findings contribute a new approach for automatically evaluating the factuality of outputs generated by an AI system for analyzing contact center conversations.

[3] XAutoLM: Efficient Fine-Tuning of Language Models via Meta-Learning and AutoML

Ernesto L. Estevanell-Valladares,Suilan Estevez-Velarde,Yoan Gutiérrez,Andrés Montoyo,Ruslan Mitkov

Main category: cs.CL

TL;DR: XAutoLM是一种元学习增强的自动化机器学习框架,能够高效优化语言模型的微调过程,显著提高资源利用率和模型性能。

Details Motivation: 语言模型微调过程中,重复试验带来巨大的计算开销和环境影响,需要一个能够同时处理模型选择和超参数优化的自动化框架。 Method: XAutoLM使用元学习技术,从过去任务的成功与失败中提取任务级和系统级元特征,以指导模型选择和超参数优化。 Result: 在多个文本分类和问答任务中,XAutoLM在5/6个任务中超过了零样本优化器的峰值F1值,平均评估时间减少4.5倍,错误率降低7倍,发现的高效管道数量增加50%。 Conclusion: XAutoLM是一个有效的自动化框架,通过重用过去经验来优化语言模型的微调过程,实现了更高的效率和更低的资源消耗。 Abstract: Experts in machine learning leverage domain knowledge to navigate decisions in model selection, hyperparameter optimisation, and resource allocation. This is particularly critical for fine-tuning language models (LMs), where repeated trials incur substantial computational overhead and environmental impact. However, no existing automated framework simultaneously tackles the entire model selection and HPO task for resource-efficient LM fine-tuning. We introduce XAutoLM, a meta-learning-augmented AutoML framework that reuses past experiences to optimise discriminative and generative LM fine-tuning pipelines efficiently. XAutoLM learns from stored successes and failures by extracting task- and system-level meta-features to bias its sampling toward fruitful configurations and away from costly dead ends. On four text classification and two question-answering benchmarks, XAutoLM surpasses zero-shot optimiser's peak F1 on five of six tasks, cuts mean evaluation time by up to 4.5x, reduces error ratios by up to sevenfold, and uncovers up to 50% more pipelines above the zero-shot Pareto front. In contrast, simpler memory-based baselines suffer negative transfer. We release XAutoLM and our experience store to catalyse resource-efficient, Green AI fine-tuning in the NLP community.

[4] MAO-ARAG: Multi-Agent Orchestration for Adaptive Retrieval-Augmented Generation

Yiqun Chen,Erhan Zhang,Lingyong Yan,Shuaiqiang Wang,Jizhou Huang,Dawei Yin,Jiaxin Mao

Main category: cs.CL

TL;DR: This paper introduces MAO-ARAG, an adaptive RAG framework using multi-agent orchestration to dynamically plan workflows for question-answering tasks, balancing answer quality, cost, and efficiency.

Details Motivation: The motivation stems from the limitations of fixed RAG pipelines in handling varying query complexities, which often struggle to balance performance and cost efficiency. The need for a more adaptive solution inspired this research. Method: The method involves an adaptive RAG framework using multi-agent orchestration, with a planner agent selecting and integrating appropriate executor agents (e.g., query reformulation, document selection, generation) into a tailored workflow for each query. The planner agent is trained using reinforcement learning guided by F1 score rewards and cost-based penalties. Result: Experiments on multiple QA datasets show that the MAO-ARAG approach effectively delivers high-quality answers while keeping costs and latency within acceptable limits. Conclusion: The proposed MAO-ARAG framework dynamically plans workflows for each query, achieving high answer quality while maintaining cost and latency within acceptable limits. Abstract: In question-answering (QA) systems, Retrieval-Augmented Generation (RAG) has become pivotal in enhancing response accuracy and reducing hallucination issues. The architecture of RAG systems varies significantly, encompassing single-round RAG, iterative RAG, and reasoning RAG, each tailored to address different types of queries. Due to the varying complexity of real-world queries, a fixed RAG pipeline often struggles to balance performance and cost efficiency across different queries. To address this challenge, we propose an adaptive RAG framework called MAO-ARAG, which leverages multi-agent orchestration. Our adaptive RAG is conceived as a multi-turn framework. Specifically, we define multiple executor agents, representing typical RAG modules such as query reformulation agents, document selection agent, and generation agents. A planner agent intelligently selects and integrates the appropriate agents from these executors into a suitable workflow tailored for each query, striving for high-quality answers while maintaining reasonable costs. During each turn, the planner agent is trained using reinforcement learning, guided by an outcome-based reward (F1 score) and a cost-based penalty, continuously improving answer quality while keeping costs within a reasonable range. Experiments conducted on multiple QA datasets demonstrate that our approach, which dynamically plans workflows for each query, not only achieves high answer quality but also maintains both cost and latency within acceptable limits.The code of MAO-ARAG is on https://github.com/chenyiqun/Agentic-RAG.

[5] UrBLiMP: A Benchmark for Evaluating the Linguistic Competence of Large Language Models in Urdu

Farah Adeeba,Brian Dillon,Hassan Sajjad,Rajesh Bhatt

Main category: cs.CL

TL;DR: 本文构建了乌尔都语语言最小对基准UrBLiMP,评估了20个多语言大语言模型在乌尔都语句法知识方面的表现,发现LLaMA-3-70B表现最佳,但与其他顶级模型统计上相当。

Details Motivation: 多语言大语言模型(LLM)在各种语言中表现出色,但与英语等高资源语言相比,乌尔都语等低资源语言的数据量明显较少。 Method: 构建了一个包含5696对最小句对的乌尔都语语言最小对基准UrBLiMP,并对20个多语言LLM进行评估。 Result: LLaMA-3-70B在UrBLiMP上取得了最高的平均准确率(94.73%),其表现与其他顶级模型如Gemma-3-27B-PT统计上相当。 Conclusion: UrBLiMP的创建和评估结果揭示了多语言LLM在掌握低资源语言的细粒度句法知识方面的潜力和局限性。 Abstract: Multilingual Large Language Models (LLMs) have shown remarkable performance across various languages; however, they often include significantly less data for low-resource languages such as Urdu compared to high-resource languages like English. To assess the linguistic knowledge of LLMs in Urdu, we present the Urdu Benchmark of Linguistic Minimal Pairs (UrBLiMP) i.e. pairs of minimally different sentences that contrast in grammatical acceptability. UrBLiMP comprises 5,696 minimal pairs targeting ten core syntactic phenomena, carefully curated using the Urdu Treebank and diverse Urdu text corpora. A human evaluation of UrBLiMP annotations yielded a 96.10% inter-annotator agreement, confirming the reliability of the dataset. We evaluate twenty multilingual LLMs on UrBLiMP, revealing significant variation in performance across linguistic phenomena. While LLaMA-3-70B achieves the highest average accuracy (94.73%), its performance is statistically comparable to other top models such as Gemma-3-27B-PT. These findings highlight both the potential and the limitations of current multilingual LLMs in capturing fine-grained syntactic knowledge in low-resource languages.

[6] Cross-Domain Web Information Extraction at Pinterest

Michael Farag,Patrick Halina,Andrey Zaytsev,Alekhya Munagala,Imtihan Ahmed,Junhao Wang

Main category: cs.CL

TL;DR: 本文提出了一种高效的属性提取系统,结合网页的结构、视觉和文本信息,实现了比大型语言模型更优的成本效益和处理速度。

Details Motivation: 互联网上存在大量非结构化信息,将其转换为结构化格式是一项重大挑战。对于Pinterest而言,准确地从电子商务网站提取结构化产品数据对于提升用户体验和优化内容分发至关重要。 Method: 该方法采用了一种新颖的网页表示方式,将每个可见HTML节点的文本、样式和布局信息进行整合,并利用如XGBoost等简单模型进行属性提取。 Result: 实验结果表明,该系统具有高度可扩展性,每秒可处理超过1000个URL,并且比最便宜的GPT替代方案成本低1000倍。此外,简单的模型如XGBoost在提取属性方面表现优于复杂的LLMs。 Conclusion: Pinterest提出的属性提取系统通过结合网页的结构、视觉和文本模态信息,实现了高效且准确的属性提取。这种方法不仅比大型语言模型更具成本效益,还能在处理速度和准确性方面取得良好平衡。 Abstract: The internet offers a massive repository of unstructured information, but it's a significant challenge to convert this into a structured format. At Pinterest, the ability to accurately extract structured product data from e-commerce websites is essential to enhance user experiences and improve content distribution. In this paper, we present Pinterest's system for attribute extraction, which achieves remarkable accuracy and scalability at a manageable cost. Our approach leverages a novel webpage representation that combines structural, visual, and text modalities into a compact form, optimizing it for small model learning. This representation captures each visible HTML node with its text, style and layout information. We show how this allows simple models such as eXtreme Gradient Boosting (XGBoost) to extract attributes more accurately than much more complex Large Language Models (LLMs) such as Generative Pre-trained Transformer (GPT). Our results demonstrate a system that is highly scalable, processing over 1,000 URLs per second, while being 1000 times more cost-effective than the cheapest GPT alternatives.

[7] Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates

Liam G. McCoy,Fateme Nateghi Haredasht,Kanav Chopra,David Wu,David JH Wu,Abass Conteh,Sarita Khemani,Saloni Kumar Maharaj,Vishnu Ravi,Arth Pahwa,Yingjie Weng,Leah Rosengaus,Lena Giang,Kelvin Zhenghao Li,Olivia Jee,Daniel Shirvani,Ethan Goh,Jonathan H. Chen

Main category: cs.CL

TL;DR: 大型语言模型能够生成结构化的临床咨询模板,但在长度限制下优先处理重要信息方面仍存在不足。

Details Motivation: 评估大型语言模型(LLMs)生成电子咨询的结构化临床咨询模板的能力。 Method: 使用包含提示优化、语义自动评分和优先级分析的多智能体流水线对前沿模型进行评估。 Result: 虽然像o3这样的模型可以达到较高的全面性(高达92.2%),但它们生成的模板往往过长,并且在长度受限的情况下未能正确优先处理最重要的临床问题。 Conclusion: LLMs可以增强医生之间的结构化临床信息交流,但也需要更强大的评估方法来衡量模型在现实医生沟通时间限制下优先处理临床相关信息的能力。 Abstract: This study evaluates the capacity of large language models (LLMs) to generate structured clinical consultation templates for electronic consultation. Using 145 expert-crafted templates developed and routinely used by Stanford's eConsult team, we assess frontier models -- including o3, GPT-4o, Kimi K2, Claude 4 Sonnet, Llama 3 70B, and Gemini 2.5 Pro -- for their ability to produce clinically coherent, concise, and prioritized clinical question schemas. Through a multi-agent pipeline combining prompt optimization, semantic autograding, and prioritization analysis, we show that while models like o3 achieve high comprehensiveness (up to 92.2\%), they consistently generate excessively long templates and fail to correctly prioritize the most clinically important questions under length constraints. Performance varies across specialties, with significant degradation in narrative-driven fields such as psychiatry and pain medicine. Our findings demonstrate that LLMs can enhance structured clinical information exchange between physicians, while highlighting the need for more robust evaluation methods that capture a model's ability to prioritize clinically salient information within the time constraints of real-world physician communication.

[8] CSIRO-LT at SemEval-2025 Task 11: Adapting LLMs for Emotion Recognition for Multiple Languages

Jiyu Chen,Necva Bölücü,Sarvnaz Karimi,Diego Mollá,Cécile L. Paris

Main category: cs.CL

TL;DR: This paper investigates emotion recognition across languages, showing that fine-tuning multilingual LLMs with LoRA per language is the most effective approach.

Details Motivation: Detecting emotions across different languages is challenging due to varied and culturally nuanced ways of emotional expressions. Method: Investigation of various task-adaptation strategies for LLMs in emotion recognition. Result: The most effective method for emotion recognition is to fine-tune a pre-trained multilingual LLM with LoRA setting separately for each language. Conclusion: Fine-tuning a pre-trained multilingual LLM with LoRA setting separately for each language is the most effective method for emotion recognition across different languages. Abstract: Detecting emotions across different languages is challenging due to the varied and culturally nuanced ways of emotional expressions. The \textit{Semeval 2025 Task 11: Bridging the Gap in Text-Based emotion} shared task was organised to investigate emotion recognition across different languages. The goal of the task is to implement an emotion recogniser that can identify the basic emotional states that general third-party observers would attribute to an author based on their written text snippet, along with the intensity of those emotions. We report our investigation of various task-adaptation strategies for LLMs in emotion recognition. We show that the most effective method for this task is to fine-tune a pre-trained multilingual LLM with LoRA setting separately for each language.

[9] Adaptive Content Restriction for Large Language Models via Suffix Optimization

Yige Li,Peihai Jiang,Jun Sun,Peng Shu,Tianming Liu,Zhen Xiang

Main category: cs.CL

TL;DR: 本文提出了一种轻量级内容限制方法 Suffix Optimization (SOP),通过在提示词中附加优化后缀,有效防止大型语言模型生成受限术语,且无需模型微调,在多个模型和在线平台上均表现出色。

Details Motivation: 由于大型语言模型(LLMs)的输出空间广泛,实施内容限制面临挑战。不同用户群体对内容限制的需求差异大、变化快,且难以通过统一标准定义,因此需要一种轻量级、无需微调的方法来应对这一问题。 Method: 提出了一种名为“后缀优化 (SOP)”的新方法,通过在提示词后附加一个优化的短后缀,防止模型生成特定受限术语,同时保持输出质量。 Result: SOP 在新构建的内容限制基准 CoReBench 上表现出色,相较于系统级基线方法平均提升了 15%、17%、10%、9% 和 6% 的限制率,适用于多个模型(如 Gemma2-2B, Mistral-7B, Vicuna-7B, Llama3-8B, Llama3.1-8B)。此外,SOP 在在线平台 POE 上也展现了实际应用效果。 Conclusion: Suffix Optimization (SOP) 是一种有效的自适应内容限制方法,无需模型微调即可防止大型语言模型生成受限术语,且在实际应用场景中表现出良好的实用性。 Abstract: Large Language Models (LLMs) have demonstrated significant success across diverse applications. However, enforcing content restrictions remains a significant challenge due to their expansive output space. One aspect of content restriction is preventing LLMs from generating harmful content via model alignment approaches such as supervised fine-tuning (SFT). Yet, the need for content restriction may vary significantly across user groups, change rapidly over time, and not always align with general definitions of harmfulness. Applying SFT to each of these specific use cases is impractical due to the high computational, data, and storage demands. Motivated by this need, we propose a new task called \textit{Adaptive Content Restriction} (AdaCoRe), which focuses on lightweight strategies -- methods without model fine-tuning -- to prevent deployed LLMs from generating restricted terms for specific use cases. We propose the first method for AdaCoRe, named \textit{Suffix Optimization (SOP)}, which appends a short, optimized suffix to any prompt to a) prevent a target LLM from generating a set of restricted terms, while b) preserving the output quality. To evaluate AdaCoRe approaches, including our SOP, we create a new \textit{Content Restriction Benchmark} (CoReBench), which contains 400 prompts for 80 restricted terms across 8 carefully selected categories. We demonstrate the effectiveness of SOP on CoReBench, which outperforms the system-level baselines such as system suffix by 15\%, 17\%, 10\%, 9\%, and 6\% on average restriction rates for Gemma2-2B, Mistral-7B, Vicuna-7B, Llama3-8B, and Llama3.1-8B, respectively. We also demonstrate that SOP is effective on POE, an online platform hosting various commercial LLMs, highlighting its practicality in real-world scenarios.

[10] Show or Tell? Modeling the evolution of request-making in Human-LLM conversations

Shengqi Zhu,Jeffrey M. Rzeszotarski,David Mimno

Main category: cs.CL

TL;DR: LLM聊天记录分析揭示用户请求模式随经验收敛,模型升级显著影响社区用户行为。

Details Motivation: 聊天记录蕴含丰富的用户行为信息,但由于查询的多样性,行为模式往往难以察觉。 Method: 分析聊天记录,将查询划分为请求内容、角色、特定上下文和附加表达,并进行历时分析。 Result: LLM交互中的请求模式与人类交互显著不同,早期查询强调请求,用户在探索模式后逐渐趋于一致;模型升级对用户行为有社区层面的影响。 Conclusion: 用户在与大型语言模型(LLM)交互时展现出独特的请求模式,这些模式随着用户经验的积累而逐渐趋于一致,同时模型能力的变化在社区层面影响了用户行为。 Abstract: Chat logs provide a rich source of information about LLM users, but patterns of user behavior are often masked by the variability of queries. We present a new task, segmenting chat queries into contents of requests, roles, query-specific context, and additional expressions. We find that, despite the familiarity of chat-based interaction, request-making in LLM queries remains significantly different from comparable human-human interactions. With the data resource, we introduce an important perspective of diachronic analyses with user expressions. We find that query patterns vary between early ones emphasizing requests, and individual users explore patterns but tend to converge with experience. Finally, we show that model capabilities affect user behavior, particularly with the introduction of new models, which are traceable at the community level.

[11] WebDS: An End-to-End Benchmark for Web-based Data Science

Ethan Hsu,Hong Meng Yam,Ines Bouissou,Aaron Murali John,Raj Thota,Josh Koe,Vivek Sarath Putta,G K Dharesan,Alexander Spangher,Shikhar Murty,Tenghao Huang,Christopher D. Manning

Main category: cs.CL

TL;DR: WebDS is a new benchmark for real-world web-based data science tasks, revealing performance gaps in current LLM agents and offering a path for improvement.

Details Motivation: To better reflect real-world data science tasks that involve complex, multi-hop web interactions and diverse tool usage, which current benchmarks fail to assess adequately. Method: Development of an end-to-end web-based data science benchmark consisting of 870 tasks across 29 diverse websites, including structured and unstructured data sources. Result: Evaluations show significant performance gaps in current state-of-the-art LLM agents when handling WebDS tasks, with issues like poor information grounding and shortcut-taking identified as key failure modes. Conclusion: WebDS provides a more realistic and robust benchmark for evaluating LLM-based data science tools, highlighting current shortcomings and paving the way for practical advancements. Abstract: A large portion of real-world data science tasks are complex and require multi-hop web-based interactions: finding appropriate data available on the internet, synthesizing real-time data of various modalities from different locations, and producing summarized analyses. Existing web benchmarks often focus on simplistic interactions, such as form submissions or e-commerce transactions, and often do not require diverse tool-using capabilities required for web based data science. Conversely, traditional data science benchmarks typically concentrate on static, often textually bound datasets and do not assess end-to-end workflows that encompass data acquisition, cleaning, analysis, and insight generation. In response, we introduce WebDS, the first end-to-end web-based data science benchmark. It comprises 870 web-based data science tasks across 29 diverse websites from structured government data portals to unstructured news media, challenging agents to perform complex, multi-step operations requiring the use of tools and heterogeneous data formats that better reflect the realities of modern data analytics. Evaluations of current SOTA LLM agents indicate significant performance gaps in accomplishing these tasks. For instance, Browser Use, which accomplishes 80% of tasks on Web Voyager, successfully completes only 15% of tasks in WebDS, which our analysis suggests is due to new failure modes like poor information grounding, repetitive behavior and shortcut-taking that agents performing WebDS' tasks display. By providing a more robust and realistic testing ground, WebDS sets the stage for significant advances in the development of practically useful LLM-based data science.

[12] WarriorMath: Enhancing the Mathematical Ability of Large Language Models with a Defect-aware Framework

Yue Chen,Minghua He,Fangkai Yang,Pu Zhao,Lu Wang,Yu Kang,Yifei Dong,Yuefeng Zhan,Hao Sun,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang

Main category: cs.CL

TL;DR: WarriorMath利用缺陷感知和渐进式训练提升大型语言模型数学能力。

Details Motivation: 现有方法忽视了LLMs的具体失败模式,导致合成问题对模型提升效果有限。 Method: 使用多专家协作生成、批评和改进问题,并采用渐进式训练框架。 Result: 在六个数学基准测试中,WarriorMath平均优于强基线12.57%。 Conclusion: WarriorMath通过缺陷感知和多专家框架有效提升大型语言模型的数学解题能力。 Abstract: Large Language Models (LLMs) excel in solving mathematical problems, yet their performance is often limited by the availability of high-quality, diverse training data. Existing methods focus on augmenting datasets through rephrasing or difficulty progression but overlook the specific failure modes of LLMs. This results in synthetic questions that the model can already solve, providing minimal performance gains. To address this, we propose WarriorMath, a defect-aware framework for mathematical problem solving that integrates both targeted data synthesis and progressive training. In the synthesis stage, we employ multiple expert LLMs in a collaborative process to generate, critique, and refine problems. Questions that base LLMs fail to solve are identified and iteratively improved through expert-level feedback, producing high-quality, defect-aware training data. In the training stage, we introduce a progressive learning framework that iteratively fine-tunes the model using increasingly challenging data tailored to its weaknesses. Experiments on six mathematical benchmarks show that WarriorMath outperforms strong baselines by 12.57% on average, setting a new state-of-the-art. Our results demonstrate the effectiveness of a defect-aware, multi-expert framework for improving mathematical ability.

[13] Bridging LLMs and Symbolic Reasoning in Educational QA Systems: Insights from the XAI Challenge at IJCNN 2025

Long S. T. Nguyen,Khang H. N. Vo,Thu H. A. Nguyen,Tuan C. Bui,Duc Q. Nguyen,Thanh-Tung Tran,Anh D. Nguyen,Minh L. Nguyen,Fabien Baldacci,Thang H. Bui,Emanuel Di Nardo,Angelo Ciaramella,Son H. Le,Ihsan Ullah,Lorenzo Di Rocco,Tho T. Quan

Main category: cs.CL

TL;DR: Error

Details Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: The growing integration of Artificial Intelligence (AI) into education has intensified the need for transparency and interpretability. While hackathons have long served as agile environments for rapid AI prototyping, few have directly addressed eXplainable AI (XAI) in real-world educational contexts. This paper presents a comprehensive analysis of the XAI Challenge 2025, a hackathon-style competition jointly organized by Ho Chi Minh City University of Technology (HCMUT) and the International Workshop on Trustworthiness and Reliability in Neurosymbolic AI (TRNS-AI), held as part of the International Joint Conference on Neural Networks (IJCNN 2025). The challenge tasked participants with building Question-Answering (QA) systems capable of answering student queries about university policies while generating clear, logic-based natural language explanations. To promote transparency and trustworthiness, solutions were required to use lightweight Large Language Models (LLMs) or hybrid LLM-symbolic systems. A high-quality dataset was provided, constructed via logic-based templates with Z3 validation and refined through expert student review to ensure alignment with real-world academic scenarios. We describe the challenge's motivation, structure, dataset construction, and evaluation protocol. Situating the competition within the broader evolution of AI hackathons, we argue that it represents a novel effort to bridge LLMs and symbolic reasoning in service of explainability. Our findings offer actionable insights for future XAI-centered educational systems and competitive research initiatives.

[14] Prompting Large Language Models with Partial Knowledge for Answering Questions with Unseen Entities

Zhichao Yan,Jiapu Wang,Jiaoyan Chen,Yanyan Wang,Hongye Tan,Jiye Liang,Xiaoli Li,Ru Li,Jeff Z. Pan

Main category: cs.CL

TL;DR: 本文研究了在检索增强生成系统中利用部分相关知识的新方法,提出了基于唤醒效应的新视角,并在实际应用中展示了优于传统方法的效果。

Details Motivation: 有效利用部分相关知识仍然是RAG系统的关键挑战,尤其是在知识库检索不完整的情况下。 Method: 通过去除包含答案的路径,使用位于黄金推理路径中的三元组及其变体来构建部分相关知识,进行实验并在两个知识图谱问答数据集上验证了唤醒效应的假设。 Result: 论文提供了对LLM中唤醒效应的理论分析,并通过实验支持了提出的假设,同时提出了一种新的任务Unseen Entity KGQA,模拟了由于知识图不完整导致实体链接失败的实际挑战。 Conclusion: 本文提出了一种基于唤醒效应的方法,在实际应用中表现出更高的有效性,并且在传统方法容易返回噪声信息的情况下表现更优。 Abstract: Retrieval-Augmented Generation (RAG) shows impressive performance by supplementing and substituting parametric knowledge in Large Language Models (LLMs). Retrieved knowledge can be divided into three types: explicit answer evidence, implicit answer clue, and insufficient answer context which can be further categorized into totally irrelevant and partially relevant information. Effectively utilizing partially relevant knowledge remains a key challenge for RAG systems, especially in incomplete knowledge base retrieval. Contrary to the conventional view, we propose a new perspective: LLMs can be awakened via partially relevant knowledge already embedded in LLMs. To comprehensively investigate this phenomenon, the triplets located in the gold reasoning path and their variants are used to construct partially relevant knowledge by removing the path that contains the answer. We provide theoretical analysis of the awakening effect in LLMs and support our hypothesis with experiments on two Knowledge Graphs (KGs) Question Answering (QA) datasets. Furthermore, we present a new task, Unseen Entity KGQA, simulating real-world challenges where entity linking fails due to KG incompleteness. Our awakening-based approach demonstrates greater efficacy in practical applications, outperforms traditional methods that rely on embedding-based similarity which are prone to returning noisy information.

[15] KEDAS: Knowledge Editing Alignment with Diverse Augmentation and Self-adaptive Inference

Chenming Tang,Yutong Yang,Yunfang Wu

Main category: cs.CL

TL;DR: KEDAS是一种高效的知识编辑对齐方法,它通过低秩适应和自适应推理机制显著提升了大型语言模型的知识编辑性能。

Details Motivation: 知识编辑旨在高效修改大型语言模型中过时的知识,同时保留其强大的能力。然而,现有方法在编辑召回和推理路径选择方面存在不足,因此需要更优的知识编辑对齐方法。 Method: KEDAS通过低秩适应学习应用上下文中的知识编辑,并设计了多样化的编辑增强技术和自适应的后对齐推理机制,其中采用基于过滤的智能检索器进行推理路径的动态选择。 Result: 在四个数据集上使用三种LLMs进行的实验表明,KEDAS在36种情况中的35种中获得了最高的整体性能分数,其调和平均分(编辑成功、局部性和可移植性)比其强大的知识编辑对齐对手高出约19.8分,并显著优于参数编辑和基于检索的基线方法。 Conclusion: KEDAS是一种知识编辑对齐的理想范式,它在保留LLMs强大能力的同时,高效地修改了模型中的过时知识。 Abstract: Knowledge editing aims to modify outdated knowledge in large language models (LLMs) efficiently while retaining their powerful capabilities. Most existing methods rely on either parameter-level editing or retrieval-based approaches. In this work, we propose Knowledge Editing alignment with Diverse Augmentation and Self-adaptive inference (KEDAS) to better align LLMs with knowledge editing. In the alignment phase, LLMs learn to apply in-context edited knowledge via low-rank adaptation. During editing, we design a diverse edit augmentation technique to improve the recall of edits. After that, a self-adaptive post-alignment inference mechanism is proposed, in which a filter-based smart retriever is employed to perform a dynamic selection of inference routing. Specifically, irrelevant queries will go through the original pre-alignment model directly, while relevant ones, together with their related edits, go through the model with aligned adapters activated. In experiments, KEDAS secures the highest overall performance scores in 35 out of 36 cases across four datasets with three LLMs on three settings, surpassing its strong knowledge editing alignment counterpart by about 19.8 harmonic mean scores of edit success, locality and portability and outperforming both parameter editing and retrieval-based baselines significantly. Analysis of computational cost and performance on general tasks further validates the robustness and efficiency of KEDAS, indicating that it presents an ideal paradigm of knowledge editing alignment.

[16] D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation

Weibo Zhou,Lingbo Li,Shangsong Liang

Main category: cs.CL

TL;DR: D-SCoRE is a fast, scalable, training-free method for generating domain-specific QA datasets using LLMs and prompt engineering, offering superior performance over traditional annotated datasets.

Details Motivation: The scarcity and high cost of domain-specific QA datasets for supervised fine-tuning of LLMs motivated the development of D-SCoRE, aiming to automate and improve QA dataset generation. Method: D-SCoRE employs a training-free pipeline using LLMs and prompt engineering, incorporating Document-centric processing, Segmentation, CoT Reasoning, and structured Export to generate QA datasets. Result: D-SCoRE successfully generates diverse, high-quality QA-CoT pairs with counterfactual materials, outperforming human-annotated datasets in most domains on benchmark tests. Conclusion: D-SCoRE offers a scalable and efficient solution for generating high-quality, domain-specific QA datasets without requiring manual annotation, outperforming existing methods. Abstract: The scarcity and high cost of high-quality question-answering (QA) datasets hinder supervised fine-tuning (SFT) for domain-specific large language models (LLMs). To address this, we introduce D-SCoRE, a training-free pipeline that utilizes LLMs and prompt engineering to produce diverse, high-quality QA datasets from arbitrary textual sources. D-SCoRE integrates $\textbf{D}$ocument-centric processing, $\textbf{S}$egmentation, $\textbf{Co}$T $\textbf{R}$easoning, and structured $\textbf{E}$xport to generate QA-COT datasets tailored for domain-aware SFT. Multi-dimensional control mechanisms, such as semantic role transformation, question type balancing, and counterfactual materials, enhance diversity and relevance, overcoming limitations of existing QA generation. LLMs fine-tuned on D-SCoRE-generated QA datasets, and human-annotated QA datasets (SQuAD, Covid-QA) are evaluated on SQuADShifts and Covid-QA test sets, with D-SCoRE outperforming across most domains. D-SCoRE generates six QA-CoT pairs with four-option counterfactual materials per 100-200-word text in 90 seconds using an 8B LLM on consumer-grade hardware. Its simplicity and scalability enable efficient QA generation and high-performance fine-tuning across domains.

[17] LinkQA: Synthesizing Diverse QA from Multiple Seeds Strongly Linked by Knowledge Points

Xuemiao Zhang,Can Ren,Chengying Tu,Rongxiang Weng,Hongfei Yan,Jingang Wang,Xunliang Cai

Main category: cs.CL

TL;DR: LinkSyn is a novel framework for synthesizing diverse QA data that improves LLM performance on key benchmarks.

Details Motivation: The scarcity of high-quality, diverse training data hinders the advancement of large language models (LLMs), necessitating a framework like LinkSyn to address this limitation. Method: LinkSyn, a KP graph-based synthesis framework, was used to construct a knowledge point graph, adjust path sampling probability, perform diffusion-based synthesis, and enhance high-difficulty QA within disciplines. Result: Continual pre-training with LinkQA leads to an average improvement of 11.51% on MMLU and CMMLU benchmarks, consistently enhancing performance across model sizes and FLOPs scales. Conclusion: LinkQA, a diverse multi-disciplinary QA dataset synthesized via LinkSyn, significantly improves the performance of Llama-3 8B on MMLU and CMMLU, achieving new SOTA results. Abstract: The advancement of large language models (LLMs) struggles with the scarcity of high-quality, diverse training data. To address this limitation, we propose LinkSyn, a novel knowledge point (KP) graph-based synthesis framework that enables flexible control over discipline and difficulty distributions while balancing KP coverage and popularity. LinkSyn extracts KPs from question-answering (QA) seed data and constructs a KP graph to synthesize diverse QA data from multiple seeds strongly linked by KPs and sampled from graph walks. Specifically, LinkSyn incorporates (1) a knowledge distribution value function to guide the adjustment of path sampling probability and balance KP coverage and popularity during graph walks; (2) diffusion-based synthesis via DeepSeek-R1 by leveraging multiple seeds with dense logical associations along each path; and (3) high-difficulty QA enhancement within given disciplines by flexible difficulty adjustments. By executing LinkSyn, we synthesize LinkQA, a diverse multi-disciplinary QA dataset with 50B tokens. Extensive experiments on Llama-3 8B demonstrate that continual pre-training with LinkQA yields an average improvement of $\mathbf{11.51\%}$ on MMLU and CMMLU, establishing new SOTA results. LinkQA consistently enhances performance across model size and initial FLOPs scales.

[18] Large-Scale Diverse Synthesis for Mid-Training

Xuemiao Zhang,Chengying Tu,Can Ren,Rongxiang Weng,Hongfei Yan,Jingang Wang,Xunliang Cai

Main category: cs.CL

TL;DR: 本文提出了一种新的合成大规模QA数据集的方法BoostQA,通过在中阶段训练中使用该数据集,显著提升了大型语言模型在多个基准测试中的表现,尤其是在STEM领域和高难度数据上。

Details Motivation: 传统语料库提供的信息有限,高质量、知识密集型训练数据的缺乏限制了大型语言模型的发展。同时,现有合成QA数据面临可扩展性和知识多样性的问题,尤其是在跨领域情况下。 Method: 提出了一种新的多样化合成管道来生成BoostQA数据集,包括从异构源中策划种子数据、利用DeepSeek-R1进行STEM多等级合成和高难度数据合成、通过DeepSeek-V3改进答案质量,并在中阶段训练中使用BoostQA优化特定领域知识获取。 Result: 在使用BoostQA进行中阶段训练后,Llama-3 8B在MMLU和CMMLU上平均提升了12.74%,并在12个基准测试中达到了SOTA平均性能。 Conclusion: BoostQA通过合成大规模、多样化的QA数据,显著提高了大型语言模型在多个基准测试中的性能,尤其是在STEM领域和高难度数据上。 Abstract: The scarcity of high-quality, knowledge-intensive training data hinders the development of large language models (LLMs), as traditional corpora provide limited information. Previous studies have synthesized and integrated corpora-dependent question-answering (QA) data to improve model performance but face challenges in QA data scalability and knowledge diversity, particularly in cross-domain contexts. Furthermore, leveraging our designed discipline and difficulty annotation system, we probe model deficiencies in STEM disciplines and high-difficulty data. To overcome these limitations, we propose a novel diversified pipeline to synthesize BoostQA, a 100B-token large-scale QA dataset. Our synthesis framework: (1) curates seed data from heterogeneous sources; (2) utilizes DeepSeek-R1 to implement STEM-focused multi-grade synthesis to boost data diversity and high-difficulty synthesis to mitigate difficulty degradation; (3) refines answers via DeepSeek-V3 to improve output quality. We utilize BoostQA in mid-training, a mid-stage between pre-training and post-training, to optimize domain-specific knowledge acquisition and enhance data quality. Our method enables Llama-3 8B, mid-trained on a 40B-token dataset, to achieve an average improvement of $\mathbf{12.74\%}$ on MMLU and CMMLU and establish SOTA average performance across 12 benchmarks. BoostQA also demonstrates robust scalability, with performance consistently improving as model size, data volume, and initial FLOPs scale.

[19] MaRGen: Multi-Agent LLM Approach for Self-Directed Market Research and Analysis

Roman Koshkin,Pengyu Dai,Nozomi Fujikawa,Masahito Togami,Marco Visentini-Scarzanella

Main category: cs.CL

TL;DR: This paper introduces an autonomous framework using LLMs to automate business analysis and market report generation, employing specialized agents and in-context learning from consultants. It achieves rapid, cost-effective report creation with high quality, showing potential for affordable market insights.

Details Motivation: The motivation behind this research is to automate the end-to-end process of business analysis and market report generation, making it more efficient and cost-effective. Method: The framework uses specialized agents (Researcher, Reviewer, Writer, and Retriever) that collaborate to analyze data and generate reports. In-context learning from professional consultants' materials is employed. The system follows a multi-step process: querying databases, data analysis, insight generation, visualization creation, and report composition. An LLM-based evaluation system and iterative improvement mechanism enhance report quality. Result: Experimental results indicate that the framework can generate detailed 6-page reports in 7 minutes at a cost of approximately $1. The study also found that report quality can be enhanced by both automated review cycles and the incorporation of consultants' unstructured knowledge. Conclusion: The paper concludes that their autonomous framework, leveraging LLMs and specialized agents, can efficiently generate high-quality market reports, offering affordable and rapid business insights. Abstract: We present an autonomous framework that leverages Large Language Models (LLMs) to automate end-to-end business analysis and market report generation. At its core, the system employs specialized agents - Researcher, Reviewer, Writer, and Retriever - that collaborate to analyze data and produce comprehensive reports. These agents learn from real professional consultants' presentation materials at Amazon through in-context learning to replicate professional analytical methodologies. The framework executes a multi-step process: querying databases, analyzing data, generating insights, creating visualizations, and composing market reports. We also introduce a novel LLM-based evaluation system for assessing report quality, which shows alignment with expert human evaluations. Building on these evaluations, we implement an iterative improvement mechanism that optimizes report quality through automated review cycles. Experimental results show that report quality can be improved by both automated review cycles and consultants' unstructured knowledge. In experimental validation, our framework generates detailed 6-page reports in 7 minutes at a cost of approximately \$1. Our work could be an important step to automatically create affordable market insights.

[20] MedSynth: Realistic, Synthetic Medical Dialogue-Note Pairs

Ahmad Rezaie Mianroodi,Amirali Rezaie,Niko Grisel Todorov,Cyril Rakovski,Frank Rudzicz

Main category: cs.CL

TL;DR: MedSynth is a new dataset for synthetic medical dialogues and notes that enhances automation in medical documentation tasks, addressing the lack of open-access and diverse training data.

Details Motivation: Physicians spend significant time documenting clinical encounters, contributing to professional burnout, which necessitates robust automation tools for medical documentation. Method: MedSynth introduces a novel dataset of synthetic medical dialogues and notes covering over 2000 ICD-10 codes, designed to improve the Dialogue-to-Note (Dial-2-Note) and Note-to-Dialogue (Note-2-Dial) tasks. Result: The MedSynth dataset demonstrates improved performance in generating medical notes from dialogues and vice versa. Conclusion: MedSynth provides a valuable resource for advancing automation tools in medical documentation where open-access, privacy-compliant, and diverse training data are scarce. Abstract: Physicians spend significant time documenting clinical encounters, a burden that contributes to professional burnout. To address this, robust automation tools for medical documentation are crucial. We introduce MedSynth -- a novel dataset of synthetic medical dialogues and notes designed to advance the Dialogue-to-Note (Dial-2-Note) and Note-to-Dialogue (Note-2-Dial) tasks. Informed by an extensive analysis of disease distributions, this dataset includes over 10,000 dialogue-note pairs covering over 2000 ICD-10 codes. We demonstrate that our dataset markedly enhances the performance of models in generating medical notes from dialogues, and dialogues from medical notes. The dataset provides a valuable resource in a field where open-access, privacy-compliant, and diverse training data are scarce. Code is available at https://github.com/ahmadrezarm/MedSynth/tree/main and the dataset is available at https://huggingface.co/datasets/Ahmad0067/MedSynth.

[21] ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations

Rania Al-Sabbagh

Main category: cs.CL

TL;DR: ArzEn-MultiGenre is a manually aligned parallel dataset of Egyptian Arabic and English texts across diverse genres, useful for translation models, linguistic research, and translator education.

Details Motivation: The motivation behind this work is to provide a new gold-standard dataset for Egyptian Arabic and English that covers underrepresented genres and supports various applications such as machine translation, linguistic studies, and translator training. Method: The paper introduces ArzEn-MultiGenre, a manually translated and aligned parallel dataset consisting of 25,557 segment pairs across multiple genres like song lyrics, novels, and TV show subtitles. Result: The dataset enables benchmarking of machine translation models, fine-tuning of language models in few-shot settings, cross-linguistic research, and pedagogical applications. Conclusion: ArzEn-MultiGenre serves as a valuable and versatile resource for machine translation, linguistic research, and educational purposes due to its unique composition and human expert alignment. Abstract: ArzEn-MultiGenre is a parallel dataset of Egyptian Arabic song lyrics, novels, and TV show subtitles that are manually translated and aligned with their English counterparts. The dataset contains 25,557 segment pairs that can be used to benchmark new machine translation models, fine-tune large language models in few-shot settings, and adapt commercial machine translation applications such as Google Translate. Additionally, the dataset is a valuable resource for research in various disciplines, including translation studies, cross-linguistic analysis, and lexical semantics. The dataset can also serve pedagogical purposes by training translation students and aid professional translators as a translation memory. The contributions are twofold: first, the dataset features textual genres not found in existing parallel Egyptian Arabic and English datasets, and second, it is a gold-standard dataset that has been translated and aligned by human experts.

[22] Discovering Bias Associations through Open-Ended LLM Generations

Jinhao Pan,Chahat Raj,Ziwei Zhu

Main category: cs.CL

TL;DR: 本研究提出了一种新的框架BADF,用于系统发现大型语言模型在开放生成中潜在的偏见关联,为偏见分析提供了更全面和灵活的方法。

Details Motivation: 现有评估方法依赖于预定义的身份-概念关联,难以发现新的或意外的偏见形式,因此需要一种更灵活和系统的方法来揭示开放生成中的偏见。 Method: 提出了偏见关联发现框架(BADF),通过多模型和多样化现实场景的全面实验,进行偏见关联的映射和分析。 Result: BADF能够稳健地映射和分析描述人口统计身份的各种概念,推动了对开放生成中偏见的理解,并提供了可扩展的分析工具。 Conclusion: 该研究提出了一种新的系统方法(BADF),用于从大型语言模型的开放生成中提取已知和未知的偏见关联,为识别和分析LLM中的偏见提供了可扩展的工具。 Abstract: Social biases embedded in Large Language Models (LLMs) raise critical concerns, resulting in representational harms -- unfair or distorted portrayals of demographic groups -- that may be expressed in subtle ways through generated language. Existing evaluation methods often depend on predefined identity-concept associations, limiting their ability to surface new or unexpected forms of bias. In this work, we present the Bias Association Discovery Framework (BADF), a systematic approach for extracting both known and previously unrecognized associations between demographic identities and descriptive concepts from open-ended LLM outputs. Through comprehensive experiments spanning multiple models and diverse real-world contexts, BADF enables robust mapping and analysis of the varied concepts that characterize demographic identities. Our findings advance the understanding of biases in open-ended generation and provide a scalable tool for identifying and analyzing bias associations in LLMs. Data, code, and results are available at https://github.com/JP-25/Discover-Open-Ended-Generation

[23] From Query to Logic: Ontology-Driven Multi-Hop Reasoning in LLMs

Haonan Bian,Yutao Qi,Rui Yang,Yuanxi Che,Jiaqian Wang,Heming Xia,Ranran Zhen

Main category: cs.CL

TL;DR: ORACLE improves complex question answering by combining Large Language Models with knowledge graphs to generate structured, logical reasoning chains.

Details Motivation: Large Language Models have limitations in complex multi-hop question answering tasks that require structured reasoning due to their inability to capture deep conceptual relationships between entities. Method: ORACLE operates in three stages: (1) dynamic construction of question-specific knowledge ontologies using LLMs, (2) transformation of these ontologies into First-Order Logic reasoning chains, and (3) systematic decomposition of the original query into logically coherent sub-questions. Result: Experimental results on standard MQA benchmarks show that ORACLE achieves highly competitive performance, rivaling state-of-the-art models like DeepSeek-R1, with more logical and interpretable reasoning chains. Conclusion: The ORACLE framework enhances complex multi-hop question answering by integrating the generative capabilities of Large Language Models with the structural benefits of knowledge graphs, leading to logical and interpretable reasoning chains. Abstract: Large Language Models (LLMs), despite their success in question answering, exhibit limitations in complex multi-hop question answering (MQA) tasks that necessitate non-linear, structured reasoning. This limitation stems from their inability to adequately capture deep conceptual relationships between entities. To overcome this challenge, we present **ORACLE** (**O**ntology-driven **R**easoning **A**nd **C**hain for **L**ogical **E**ucidation), a training-free framework that combines LLMs' generative capabilities with the structural benefits of knowledge graphs. Our approach operates through three stages: (1) dynamic construction of question-specific knowledge ontologies using LLMs, (2) transformation of these ontologies into First-Order Logic reasoning chains, and (3) systematic decomposition of the original query into logically coherent sub-questions. Experimental results on several standard MQA benchmarks show that our framework achieves highly competitive performance, rivaling current state-of-the-art models like DeepSeek-R1. Detailed analyses further confirm the effectiveness of each component, while demonstrating that our method generates more logical and interpretable reasoning chains than existing approaches.

[24] Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data

Xinlin Zhuang,Feilong Tang,Haolin Yang,Ming Hu,Huifa Li,Haochen Xue,Yichen Li,Junjun He,Zongyuan Ge,Ying Qian,Imran Razzak

Main category: cs.CL

TL;DR: DIQ is a data selection strategy that improves medical reasoning efficiency in LLMs by prioritizing samples with high difficulty and influence, achieving better performance with less data.

Details Motivation: Existing SFT practices use unfiltered datasets leading to high computational costs and suboptimal performance. Current methods fail to balance sample difficulty and optimization utility, prompting the need for a more principled data selection approach. Method: DIQ selects data samples based on a combination of difficulty and influence, prioritizing those in the high-difficulty-high-influence quadrant. The approach is evaluated through experiments on medical reasoning benchmarks and human/LLM evaluations. Result: Models fine-tuned using DIQ on only 1% of selected data matched full-dataset performance, and using 10% of data consistently outperformed the baseline. DIQ-selected subsets showed higher data quality and better alignment with expert clinical reasoning. Conclusion: The proposed DIQ data selection strategy effectively balances complex clinical reasoning with gradient influence, leading to efficient medical reasoning with minimal fine-tuning data and outperforming baseline methods. Abstract: Supervised Fine-Tuning (SFT) plays a pivotal role in adapting Large Language Models (LLMs) to specialized domains such as medical reasoning. However, existing SFT practices often rely on unfiltered datasets that contain redundant and low-quality samples, leading to substantial computational costs and suboptimal performance. Although existing methods attempt to alleviate this problem by selecting data based on sample difficulty, defined by knowledge and reasoning complexity, they overlook each sample's optimization utility reflected in its gradient. Interestingly, we find that gradient-based influence alone favors easy-to-optimize samples that cause large parameter shifts but lack deep reasoning chains, while difficulty alone selects noisy or overly complex cases that fail to guide stable optimization. Based on this observation, we propose a data selection strategy, Difficulty-Influence Quadrant (DIQ), which prioritizes samples in the high-difficulty-high-influence quadrant to balance complex clinical reasoning with substantial gradient influence, enabling efficient medical reasoning with minimal fine-tuning data. Furthermore, Human and LLM-as-a-judge evaluations show that DIQ-selected subsets demonstrate higher data quality and generate clinical reasoning that is more aligned with expert practices in differential diagnosis, safety check, and evidence citation, as DIQ emphasizes samples that foster expert-like reasoning patterns. Extensive experiments on medical reasoning benchmarks demonstrate that DIQ enables models fine-tuned on only 1% of selected data to match full-dataset performance, while using 10% consistently outperforms the baseline, highlighting the superiority of principled data selection over brute-force scaling. The code and data are available at https://github.com/mihara-bot/DIQ.

[25] TreeDiff: AST-Guided Code Generation with Diffusion LLMs

Yiming Zeng,Jinghan Cao,Zexin Li,Yiming Chen,Tao Ren,Dawei Xiang,Xidong Wu,Shangqian Gao,Tingting Yu

Main category: cs.CL

TL;DR: The paper proposes a syntax-aware diffusion framework for code generation that incorporates structural priors from Abstract Syntax Trees (ASTs), leading to significant improvements in syntactic correctness, reconstruction accuracy, and generalization to unseen code patterns.

Details Motivation: Standard token-level corruption techniques used in training diffusion models often ignore the strict syntactic and semantic rules of programming languages, which can hinder the model's ability to learn meaningful representations of code. Method: Syntax-aware diffusion framework that uses structural priors from Abstract Syntax Trees (ASTs) to selectively corrupt syntactically meaningful code spans during training. Result: Experimental results show that syntax-aware corruption significantly improves syntactic correctness, reconstruction accuracy, and generalization to unseen code patterns. Conclusion: Incorporating structural information into diffusion-based training improves syntactic correctness and generalization in code generation tasks. Abstract: Recent advances in diffusion-based language models have opened new possibilities for controllable and bidirectional sequence generation. These models provide an alternative to traditional autoregressive approaches by framing text generation as an iterative denoising process. However, applying diffusion models to structured domains such as source code remains a significant challenge. Programming languages differ from natural language in that they follow strict syntactic and semantic rules, with hierarchical organization that must be preserved for correctness. Standard token-level corruption techniques used during training often ignore this structure, which may hinder the model's ability to learn meaningful representations of code. To address this limitation, we propose a syntax-aware diffusion framework that incorporates structural priors from Abstract Syntax Trees (ASTs) into the denoising process. Instead of masking individual tokens at random, we selectively corrupt syntactically meaningful code spans derived from AST subtrees. This enables the model to reconstruct programs in a way that respects grammatical boundaries and captures long-range dependencies. Experimental results demonstrate that syntax-aware corruption significantly improves syntactic correctness, reconstruction accuracy, and generalization to unseen code patterns. These findings highlight the potential of incorporating structural information into diffusion-based training and suggest that syntax-guided denoising is a promising direction for advancing diffusion-based language models in code generation tasks.

[26] Harnessing Collective Intelligence of LLMs for Robust Biomedical QA: A Multi-Model Approach

Dimitra Panou,Alexandros C. Dimopoulos,Manolis Koubarakis,Martin Reczko

Main category: cs.CL

TL;DR: This paper presents a biomedical question-answering system using ensemble open-source LLMs, achieving top performance in the BioASQ challenge by tailoring model combinations to question types.

Details Motivation: The exponential growth of biomedical literature necessitates advanced text mining and question-answering systems to efficiently retrieve and process information. Method: The authors employed retrieval-augmented generators based on open-source LLMs and used majority voting for Yes/No questions and the union of answers for list and factoid questions. They evaluated 13 LLMs and tailored pipelines for each question type. Result: The system achieved strong results in the 2025 BioASQ challenge, including 1st place for ideal answers and 2nd place for exact answers in the Synergy task Round 2, and shared 1st places in Rounds 3 and 4. Conclusion: The study concludes that combining multiple open-source LLMs using a majority voting system and union of answers enhances biomedical question-answering performance, particularly for specific question types. Abstract: Biomedical text mining and question-answering are essential yet highly demanding tasks, particularly in the face of the exponential growth of biomedical literature. In this work, we present our participation in the 13th edition of the BioASQ challenge, which involves biomedical semantic question-answering for Task 13b and biomedical question-answering for developing topics for the Synergy task. We deploy a selection of open-source large language models (LLMs) as retrieval-augmented generators to answer biomedical questions. Various models are used to process the questions. A majority voting system combines their output to determine the final answer for Yes/No questions, while for list and factoid type questions, the union of their answers in used. We evaluated 13 state-of-the-art open source LLMs, exploring all possible model combinations to contribute to the final answer, resulting in tailored LLM pipelines for each question type. Our findings provide valuable insight into which combinations of LLMs consistently produce superior results for specific question types. In the four rounds of the 2025 BioASQ challenge, our system achieved notable results: in the Synergy task, we secured 1st place for ideal answers and 2nd place for exact answers in round 2, as well as two shared 1st places for exact answers in round 3 and 4.

[27] TeSent: A Benchmark Dataset for Fairness-aware Explainable Sentiment Classification in Telugu

Vallabhaneni Raj Kumar,Ashwin S,Supriya Manna,Niladri Sett,Cheedella V S N M S Hema Harshitha,Kurakula Harshitha,Anand Kumar Sharma,Basina Deepakraj,Tanuj Sarkar,Bondada Navaneeth Krishna,Samanthapudi Shakeer

Main category: cs.CL

TL;DR: 本研究介绍了一个名为 TeSent 的泰卢固语情感分类基准数据集,包含 26,150 条句子及人类注释的理性,并展示了使用理性训练模型可以提高性能和可解释性。

Details Motivation: 泰卢固语作为印度的主要古典语言之一,在全球 NLP 和机器学习领域中缺乏高质量的注释资源,因此需要一个专门的基准数据集来促进该语言的自然语言处理研究。 Method: 创建了一个名为 TeSent 的综合基准数据集,包括 26,150 个泰卢固语句子,并开发了一个定制的注释平台和协议来收集真实标签和人类注释的理性。通过多种预训练模型进行微调,并提供了详细的评估套件,包括可解释性和公平性。 Result: 实验结果表明,使用理性进行训练可以提高模型的准确性,减少偏差,并使解释器的输出更符合人类推理。 Conclusion: TeSent 提供了一个全面的基准数据集,用于泰卢固语的情感分类任务,并展示了利用理性训练模型可以提高准确性、减少偏差,并使解释输出更符合人类推理。 Abstract: In the Indian subcontinent, Telugu, one of India's six classical languages, is the most widely spoken Dravidian Language. Despite its 96 million speaker base worldwide, Telugu remains underrepresented in the global NLP and Machine Learning landscape, mainly due to lack of high-quality annotated resources. This work introduces TeSent, a comprehensive benchmark dataset for sentiment classification, a key text classification problem, in Telugu. TeSent not only provides ground truth labels for the sentences, but also supplements with provisions for evaluating explainability and fairness, two critical requirements in modern-day machine learning tasks. We scraped Telugu texts covering multiple domains from various social media platforms, news websites and web-blogs to preprocess and generate 26,150 sentences, and developed a custom-built annotation platform and a carefully crafted annotation protocol for collecting the ground truth labels along with their human-annotated rationales. We then fine-tuned several SOTA pre-trained models in two ways: with rationales, and without rationales. Further, we provide a detailed plausibility and faithfulness evaluation suite, which exploits the rationales, for six widely used post-hoc explainers applied on the trained models. Lastly, we curate TeEEC, Equity Evaluation Corpus in Telugu, a corpus to evaluate fairness of Telugu sentiment and emotion related NLP tasks, and provide a fairness evaluation suite for the trained classifier models. Our experimental results suggest that training with rationales may improve model accuracy, reduce bias in models, and make the explainers' output more aligned to human reasoning.

[28] The Homogenizing Effect of Large Language Models on Human Expression and Thought

Zhivar Sourati,Alireza S. Ziabari,Morteza Dehghani

Main category: cs.CL

TL;DR: 大型语言模型可能削弱认知多样性,对集体智慧构成威胁。

Details Motivation: 探讨大型语言模型对认知多样性的潜在影响。 Method: 综述研究,综合了语言学、认知科学和计算机科学的证据。 Result: 发现大型语言模型反映并强化了主流语言和推理风格,同时边缘化了其他声音和策略。 Conclusion: 认知多样性对于创造力和集体智慧至关重要,但大型语言模型的普及可能会导致语言和推理的同质化,从而削弱这种多样性。 Abstract: Cognitive diversity, reflected in variations of language, perspective, and reasoning, is essential to creativity and collective intelligence. This diversity is rich and grounded in culture, history, and individual experience. Yet as large language models (LLMs) become deeply embedded in people's lives, they risk standardizing language and reasoning. This Review synthesizes evidence across linguistics, cognitive, and computer science to show how LLMs reflect and reinforce dominant styles while marginalizing alternative voices and reasoning strategies. We examine how their design and widespread use contribute to this effect by mirroring patterns in their training data and amplifying convergence as all people increasingly rely on the same models across contexts. Unchecked, this homogenization risks flattening the cognitive landscapes that drive collective intelligence and adaptability.

[29] A Theory of Adaptive Scaffolding for LLM-Based Pedagogical Agents

Clayton Cohn,Surya Rayala,Namrata Srivastava,Joyce Horn Fonteles,Shruti Jain,Xinying Luo,Divya Mereddy,Naveeduddin Mohammed,Gautam Biswas

Main category: cs.CL

TL;DR: The paper proposes a framework combining Evidence-Centered Design and Social Cognitive Theory to enhance the pedagogical effectiveness of LLM-based agents like Inquizzitor in STEM+C education, showing that theory-driven integration can provide adaptive and principled instruction.

Details Motivation: The motivation behind the study is to address the lack of theoretical foundation in current LLM systems like ChatGPT when used in educational settings, aiming to enhance their pedagogical effectiveness by integrating established learning theories. Method: The authors proposed a framework that merges Evidence-Centered Design with Social Cognitive Theory to guide the development of an LLM-based pedagogical agent named Inquizzitor. This agent was designed to provide formative assessment and feedback grounded in cognitive science principles. Result: The result shows that Inquizzitor successfully delivers high-quality, theory-aligned assessment and interaction, providing effective guidance to teachers and valuable feedback to students, thus demonstrating the potential of theory-driven LLM integration in education. Conclusion: The research concludes that combining Evidence-Centered Design with Social Cognitive Theory can effectively guide the integration of LLMs in education, particularly for STEM+C learning, resulting in pedagogically sound and adaptive systems like Inquizzitor. Abstract: Large language models (LLMs) present new opportunities for creating pedagogical agents that engage in meaningful dialogue to support student learning. However, the current use of LLM systems like ChatGPT in classrooms often lacks the solid theoretical foundation found in earlier intelligent tutoring systems. To bridge this gap, we propose a framework that combines Evidence-Centered Design with Social Cognitive Theory for adaptive scaffolding in LLM-based agents focused on STEM+C learning. We illustrate this framework with Inquizzitor, an LLM-based formative assessment agent that integrates human-AI hybrid intelligence and provides feedback grounded in cognitive science principles. Our findings show that Inquizzitor delivers high-quality assessment and interaction aligned with core learning theories, offering teachers effective guidance that students value. This research underscores the potential for theory-driven LLM integration in education, highlighting the ability of these systems to provide adaptive and principled instruction.

[30] MOPrompt: Multi-objective Semantic Evolution for Prompt Optimization

Sara Câmara,Eduardo Luz,Valéria Carvalho,Ivan Meneghini,Gladston Moreira

Main category: cs.CL

TL;DR: 本文提出MOPrompt,一种多目标优化框架,可在保持高准确性的同时显著减少提示的上下文大小。

Details Motivation: 自动提示优化未能探索效率和效果之间的关键范围,手动提示设计复杂且耗时。 Method: 提出了MOPrompt,一种基于多目标进化优化(EMO)的框架。 Result: MOPrompt在保持准确度的同时显著减少了token长度,例如在Sabiazinho模型上实现了31%的减少。 Conclusion: MOPrompt提供了一个有效的多目标优化框架,用于在实际应用中部署LLM。 Abstract: Prompt engineering is crucial for unlocking the potential of Large Language Models (LLMs). Still, since manual prompt design is often complex, non-intuitive, and time-consuming, automatic prompt optimization has emerged as a research area. However, a significant challenge in prompt optimization is managing the inherent trade-off between task performance, such as accuracy, and context size. Most existing automated methods focus on a single objective, typically performance, thereby failing to explore the critical spectrum of efficiency and effectiveness. This paper introduces the MOPrompt, a novel Multi-objective Evolutionary Optimization (EMO) framework designed to optimize prompts for both accuracy and context size (measured in tokens) simultaneously. Our framework maps the Pareto front of prompt solutions, presenting practitioners with a set of trade-offs between context size and performance, a crucial tool for deploying Large Language Models (LLMs) in real-world applications. We evaluate MOPrompt on a sentiment analysis task in Portuguese, using Gemma-2B and Sabiazinho-3 as evaluation models. Our findings show that MOPrompt substantially outperforms the baseline framework. For the Sabiazinho model, MOPrompt identifies a prompt that achieves the same peak accuracy (0.97) as the best baseline solution, but with a 31% reduction in token length.

[31] Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in Large Language Models

Yujia Zheng,Tianhao Li,Haotian Huang,Tianyu Zeng,Jingyu Lu,Chuangxin Chu,Yuekai Huang,Ziyou Jiang,Qian Xiong,Yuyao Ge,Mingyang Li

Main category: cs.CL

TL;DR: This paper introduces PromptAnatomy and ComPerturb, which exploit prompt structure to generate adversarial examples, achieving strong performance in attacking LLMs and emphasizing the importance of structural awareness in robustness evaluation.

Details Motivation: Existing adversarial attack methods treat prompts as monolithic text, ignoring their structural heterogeneity. This paper aims to address this by dissecting prompts into components for more effective and interpretable attacks. Method: The paper introduces PromptAnatomy, which uses ComPerturb to selectively perturb prompt components and incorporates a perplexity-based filtering mechanism to ensure linguistic plausibility. It also annotates instruction-tuning datasets for validation. Result: Extensive experiments show that ComPerturb achieves state-of-the-art attack success rates. Ablation studies confirm the effectiveness of prompt dissection and PPL filtering. Conclusion: PromptAnatomy, a framework that dissects prompts into functional components and uses ComPerturb to generate adversarial examples, achieves state-of-the-art attack success rates and highlights the importance of prompt structure awareness in evaluating adversarial robustness in LLMs. Abstract: Prompt-based adversarial attacks have become an effective means to assess the robustness of large language models (LLMs). However, existing approaches often treat prompts as monolithic text, overlooking their structural heterogeneity-different prompt components contribute unequally to adversarial robustness. Prior works like PromptRobust assume prompts are value-neutral, but our analysis reveals that complex, domain-specific prompts with rich structures have components with differing vulnerabilities. To address this gap, we introduce PromptAnatomy, an automated framework that dissects prompts into functional components and generates diverse, interpretable adversarial examples by selectively perturbing each component using our proposed method, ComPerturb. To ensure linguistic plausibility and mitigate distribution shifts, we further incorporate a perplexity (PPL)-based filtering mechanism. As a complementary resource, we annotate four public instruction-tuning datasets using the PromptAnatomy framework, verified through human review. Extensive experiments across these datasets and five advanced LLMs demonstrate that ComPerturb achieves state-of-the-art attack success rates. Ablation studies validate the complementary benefits of prompt dissection and PPL filtering. Our results underscore the importance of prompt structure awareness and controlled perturbation for reliable adversarial robustness evaluation in LLMs. Code and data are available at https://github.com/Yujiaaaaa/PACP.

[32] OpenMed NER: Open-Source, Domain-Adapted State-of-the-Art Transformers for Biomedical NER Across 12 Public Datasets

Maziyar Panahi

Main category: cs.CL

TL;DR: 本文介绍了一种高效的生物医学命名实体识别方法OpenMed NER,该方法结合了领域自适应预训练和低秩适应技术,实现了在多个基准测试上的性能突破。

Details Motivation: 尽管大型语言模型取得了进展,但在保持计算效率的同时在不同实体类型上实现最先进的性能仍然是一个重大挑战。 Method: OpenMed NER使用了轻量级领域自适应预训练(DAPT)与参数高效低秩适应(LoRA)相结合的方法,在DeBERTa-v3、PubMedBERT和BioELECTRA模型基础上进行训练。 Result: OpenMed NER在12个生物医学NER基准测试中的10个上达到了新的最先进的micro-F1分数,特别是在疾病和化学基准测试上取得了显著提升。 Conclusion: OpenMed NER,通过结合轻量级领域自适应预训练和参数高效低秩适应的方法,超越了封闭源解决方案,为生物医学领域的NER任务提供了一个高效的开源解决方案。 Abstract: Named-entity recognition (NER) is fundamental to extracting structured information from the >80% of healthcare data that resides in unstructured clinical notes and biomedical literature. Despite recent advances with large language models, achieving state-of-the-art performance across diverse entity types while maintaining computational efficiency remains a significant challenge. We introduce OpenMed NER, a suite of open-source, domain-adapted transformer models that combine lightweight domain-adaptive pre-training (DAPT) with parameter-efficient Low-Rank Adaptation (LoRA). Our approach performs cost-effective DAPT on a 350k-passage corpus compiled from ethically sourced, publicly available research repositories and de-identified clinical notes (PubMed, arXiv, and MIMIC-III) using DeBERTa-v3, PubMedBERT, and BioELECTRA backbones. This is followed by task-specific fine-tuning with LoRA, which updates less than 1.5% of model parameters. We evaluate our models on 12 established biomedical NER benchmarks spanning chemicals, diseases, genes, and species. OpenMed NER achieves new state-of-the-art micro-F1 scores on 10 of these 12 datasets, with substantial gains across diverse entity types. Our models advance the state-of-the-art on foundational disease and chemical benchmarks (e.g., BC5CDR-Disease, +2.70 pp), while delivering even larger improvements of over 5.3 and 9.7 percentage points on more specialized gene and clinical cell line corpora. This work demonstrates that strategically adapted open-source models can surpass closed-source solutions. This performance is achieved with remarkable efficiency: training completes in under 12 hours on a single GPU with a low carbon footprint (< 1.2 kg CO2e), producing permissively licensed, open-source checkpoints designed to help practitioners facilitate compliance with emerging data protection and AI regulations, such as the EU AI Act.

[33] Authorship Attribution in Multilingual Machine-Generated Texts

Lucio La Cava,Dominik Macko,Róbert Móro,Ivan Srba,Andrea Tagarelli

Main category: cs.CL

TL;DR: This paper explores the challenges of multilingual authorship attribution, finding that current methods have limitations in cross-lingual transferability and robustness.

Details Motivation: The increasing fluency of LLMs necessitates distinguishing machine-generated text from human-written content, especially in multilingual contexts. Method: The study analyzed multilingual authorship attribution across 18 languages and 8 generators, including 7 LLMs and human-authored texts. Result: Certain monolingual AA methods can be adapted to multilingual settings, but challenges remain, particularly across diverse language families. Conclusion: Multilingual authorship attribution is a complex task, and current methods have limitations in cross-lingual transferability, highlighting the need for more robust approaches. Abstract: As Large Language Models (LLMs) have reached human-like fluency and coherence, distinguishing machine-generated text (MGT) from human-written content becomes increasingly difficult. While early efforts in MGT detection have focused on binary classification, the growing landscape and diversity of LLMs require a more fine-grained yet challenging authorship attribution (AA), i.e., being able to identify the precise generator (LLM or human) behind a text. However, AA remains nowadays confined to a monolingual setting, with English being the most investigated one, overlooking the multilingual nature and usage of modern LLMs. In this work, we introduce the problem of Multilingual Authorship Attribution, which involves attributing texts to human or multiple LLM generators across diverse languages. Focusing on 18 languages -- covering multiple families and writing scripts -- and 8 generators (7 LLMs and the human-authored class), we investigate the multilingual suitability of monolingual AA methods, their cross-lingual transferability, and the impact of generators on attribution performance. Our results reveal that while certain monolingual AA methods can be adapted to multilingual settings, significant limitations and challenges remain, particularly in transferring across diverse language families, underscoring the complexity of multilingual AA and the need for more robust approaches to better match real-world scenarios.

[34] CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions

Tae Soo Kim,Yoonjoo Lee,Yoonah Park,Jiho Kim,Young-Ho Kim,Juho Kim

Main category: cs.CL

TL;DR: CUPID是一个用于评估LLM能否从多轮交互中推断用户偏好并生成满足这些偏好的响应的新基准测试。

Details Motivation: 个性化大型语言模型(LLM)通常假设用户偏好是静态的,而现实中人类的偏好是动态且随情境变化的。模型需要推断并应用这些情境偏好以确保对齐。 Method: 介绍了CUPID,一个包含756个人类策划的用户与LLM-based聊天助手交互会话历史的基准测试。 Result: 评估了10种开源和专有的LLM,发现当前最先进的LLM在从多轮交互中推断偏好和识别新请求的先前上下文相关性方面表现不佳,精确率低于50%,召回率低于65%。 Conclusion: CUPID强调了当前最先进的LLM在推断用户偏好和识别先前上下文的相关性方面存在不足,并提出了改进情境个性化交互的必要性。 Abstract: Personalization of Large Language Models (LLMs) often assumes users hold static preferences that reflect globally in all tasks. In reality, humans hold dynamic preferences that change depending on the context. As users interact with an LLM in various contexts, they naturally reveal their contextual preferences, which a model must infer and apply in future contexts to ensure alignment. To assess this, we introduce CUPID, a benchmark of 756 human-curated interaction session histories between users and LLM-based chat assistants. In each interaction session, the user provides a request in a specific context and expresses their preference through multi-turn feedback. Given a new user request and prior interaction sessions, our benchmark assesses whether LLMs can infer the preference relevant to this request and generate a response that satisfies this preference. With CUPID, we evaluated 10 open and proprietary LLMs, revealing that state-of-the-art LLMs struggle to infer preferences from multi-turn interactions and fail to discern what previous context is relevant to a new request -- under 50% precision and 65% recall. Our work highlights the need to advance LLM capabilities for more contextually personalized interactions and proposes CUPID as a resource to drive these improvements.

[35] The Bidirectional Process Reward Model

Lingyin Zhang,Jun Gao,Xiaoxue Ren,Ziqiang Cao

Main category: cs.CL

TL;DR: 本文提出了一种名为BiPRM的双向评估范式,解决了现有PRM方法在利用全局上下文方面的局限性,并在多个基准测试中表现出色。

Details Motivation: 现有的PRM主要采用单向的从左到右评估范式,限制了其利用全局上下文的能力,难以根据后续步骤验证早期步骤的一致性。 Method: 提出了一种名为BiPRM的新型双向评估范式,结合了传统的从左到右(L2R)流程和新增的从右到左(R2L)评估流。 Result: 在两个数学推理基准测试中进行了广泛的实验,BiPRM在所有设置中均优于单向基线模型,在逐步奖励评估中最高提升了31.9%。 Conclusion: BiPRM有效地提高了奖励模型的推理质量,并为基于过程的奖励建模提供了新的方向。 Abstract: Process Reward Models (PRMs) have emerged as a promising approach to enhance the reasoning quality of Large Language Models (LLMs) by assigning fine-grained scores to intermediate reasoning steps within a solution trajectory. However, existing PRMs predominantly adopt a unidirectional left-to-right (L2R) evaluation paradigm, which limits their ability to leverage global context, making it challenging to verify the consistency of earlier steps based on later ones. In light of these challenges, we propose a novel bidirectional evaluation paradigm, named Bidirectional Process Reward Model (BiPRM). BiPRM seamlessly incorporates a parallel right-to-left (R2L) evaluation stream alongside the conventional L2R flow, enabling later reasoning steps to help assess earlier ones in real time. Notably, the built-in R2L evaluation is implemented solely through prompt modifications that reverse the original reasoning trajectory, without any additional parameters or inference latency introduced. This ensures BiPRM remains both efficient and broadly compatible with existing PRM studies. We conduct extensive experiments on two mathematical reasoning benchmarks using samples generated by three different policy models. Our method, BiPRM, is evaluated across three backbones and three distinct PRM objectives. Across all settings, BiPRM consistently outperforms unidirectional baselines, achieving up to a 31.9% improvement in stepwise reward evaluation. Generally, our results highlight BiPRM's effectiveness, robustness, and general applicability, offering a promising new direction for process-based reward modeling.

[36] Collaborative Chain-of-Agents for Parametric-Retrieved Knowledge Synergy

Yi Jiang,Sendong Zhao,Jianbo Li,Haochun Wang,Lizhe Zhang,Yan Liu,Bin Qin

Main category: cs.CL

TL;DR: This paper proposes Collaborative Chain-of-Agents (CoCoA-zero and CoCoA) to enhance the synergy between parametric and retrieved knowledge in Retrieval-Augmented Generation, achieving better performance in knowledge-intensive tasks like QA.

Details Motivation: Current Retrieval-Augmented Generation methods struggle to fully exploit knowledge during generation, with limited synergy between internal parametric knowledge and external retrieved knowledge. Method: Collaborative Chain-of-Agents (CoCoA-zero and CoCoA) framework was developed, utilizing multi-agent reasoning and long-chain training to integrate parametric and retrieved knowledge. Result: The proposed framework achieves superior performance on open-domain and multi-hop QA tasks. Conclusion: CoCoA-zero and CoCoA demonstrate superior performance on open-domain and multi-hop question-answering tasks, highlighting the potential of enhancing synergy between parametric and retrieved knowledge. Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising framework for enhancing the capabilities of Large Language Models (LLMs), especially in knowledge-intensive tasks. Despite its advantages, current RAG methods often struggle to *fully exploit knowledge during generation*. In particular, the synergy between the model's internal parametric knowledge and external retrieved knowledge remains limited. Retrieved contents may sometimes mislead generation, while certain generated content can guide the model toward more accurate outputs. In this work, we propose Collaborative Chain-of-Agents, a framework designed to enhance explicitly synergy over both parametric and retrieved knowledge. Specifically, we first introduce CoCoA-zero, a multi-agent RAG framework that first performs conditional knowledge induction and then reasons answers. Building on this, we develop CoCoA, a long-chain training strategy that synthesizes extended multi-agent reasoning trajectories from CoCoA-zero to fine-tune the LLM. This strategy enhances the model's capability to explicitly integrate and jointly leverage parametric and retrieved knowledge. Experiments results show that CoCoA-zero and CoCoA achieve superior performance on open-domain and multi-hop QA tasks.

[37] Am I Blue or Is My Hobby Counting Teardrops? Expression Leakage in Large Language Models as a Symptom of Irrelevancy Disruption

Berkay Köprü,Mehrzad Mashal,Yigit Gurses,Akos Kadar,Maximilian Schmitt,Ditty Mathew,Felix Burkhardt,Florian Eyben,Björn W. Schuller

Main category: cs.CL

TL;DR: 本论文研究了大型语言模型中出现的表达泄露问题,提出了新的现象和评估方法,并探讨了其缓解方式。

Details Motivation: 大型语言模型虽然在自然语言处理方面取得了进展,但它们容易引入不相关的信息,尤其是表达泄露这一新发现的问题。 Method: 收集了一个基准数据集,并提出了一种自动从自由文本生成数据集的方案,同时设计了一种与人类判断高度相关的自动评估流程。 Result: 实验表明,模型参数规模增大时表达泄露减少;表达泄露需要在模型构建过程中特别注意,无法通过提示来缓解;负面情绪提示比正面情绪更容易引发表达泄露。 Conclusion: 表达泄露是一个需要关注的问题,模型设计时需要特别处理,且其缓解不能依赖提示方法。 Abstract: Large language models (LLMs) have advanced natural language processing (NLP) skills such as through next-token prediction and self-attention, but their ability to integrate broad context also makes them prone to incorporating irrelevant information. Prior work has focused on semantic leakage, bias introduced by semantically irrelevant context. In this paper, we introduce expression leakage, a novel phenomenon where LLMs systematically generate sentimentally charged expressions that are semantically unrelated to the input context. To analyse the expression leakage, we collect a benchmark dataset along with a scheme to automatically generate a dataset from free-form text from common-crawl. In addition, we propose an automatic evaluation pipeline that correlates well with human judgment, which accelerates the benchmarking by decoupling from the need of annotation for each analysed model. Our experiments show that, as the model scales in the parameter space, the expression leakage reduces within the same LLM family. On the other hand, we demonstrate that expression leakage mitigation requires specific care during the model building process, and cannot be mitigated by prompting. In addition, our experiments indicate that, when negative sentiment is injected in the prompt, it disrupts the generation process more than the positive sentiment, causing a higher expression leakage rate.

[38] CultureGuard: Towards Culturally-Aware Dataset and Guard Model for Multilingual Safety Applications

Raviraj Joshi,Rakesh Paul,Kanishk Singla,Anusha Kamath,Michael Evans,Katherine Luna,Shaona Ghosh,Utkarsh Vaidya,Eileen Long,Sanjay Singh Chauhan,Niranjan Wartikar

Main category: cs.CL

TL;DR: CultureGuard introduces a four-stage pipeline for curating culturally aligned, high-quality safety datasets across multiple languages, enabling training of a multilingual safety guard model that achieves state-of-the-art performance on multilingual content safety benchmarks.

Details Motivation: The increasing use of Large Language Models in agentic applications highlights the need for robust safety guard models, particularly in non-English languages where advancements are lacking due to high costs of collecting culturally aligned labeled datasets. Method: The approach introduces a four-stage synthetic data generation and filtering pipeline: cultural data segregation, cultural data adaptation, machine translation, and quality filtering. Result: The final model achieves state-of-the-art performance on several multilingual content safety benchmarks. Conclusion: The work represents a significant step toward closing the safety gap in multilingual LLMs by enabling the development of culturally aware safety guard models. Abstract: The increasing use of Large Language Models (LLMs) in agentic applications highlights the need for robust safety guard models. While content safety in English is well-studied, non-English languages lack similar advancements due to the high cost of collecting culturally aligned labeled datasets. We present CultureGuard, a novel solution for curating culturally aligned, high-quality safety datasets across multiple languages. Our approach introduces a four-stage synthetic data generation and filtering pipeline: cultural data segregation, cultural data adaptation, machine translation, and quality filtering. This pipeline enables the conversion and expansion of the Nemotron-Content-Safety-Dataset-V2 English safety dataset into eight distinct languages: Arabic, German, Spanish, French, Hindi, Japanese, Thai, and Chinese. The resulting dataset, Nemotron-Content-Safety-Dataset-Multilingual-v1, comprises 386,661 samples in 9 languages and facilitates the training of Llama-3.1-Nemotron-Safety-Guard-Multilingual-8B-v1 via LoRA-based fine-tuning. The final model achieves state-of-the-art performance on several multilingual content safety benchmarks. We also benchmark the latest open LLMs on multilingual safety and observe that these LLMs are more prone to give unsafe responses when prompted in non-English languages. This work represents a significant step toward closing the safety gap in multilingual LLMs by enabling the development of culturally aware safety guard models.

[39] Enhancing the Preference Extractor in Multi-turn Dialogues: From Annotating Disasters to Accurate Preference Extraction

Cheng Wang,ziru Liu,Pengcheng Tang,Mingyu Zhang,Quanyu Dai,Yue Zhu

Main category: cs.CL

TL;DR: IterChat通过将多轮对话分解为单轮对话,提高了用户偏好提取的标注效率和模型性能。

Details Motivation: 多轮对话中准确跟踪用户偏好需要大量领域知识和上下文一致性,标注过程复杂且易出错,因此需要一种更高效的数据生成方法。 Method: 提出了一种新的对话数据生成框架IterChat,将多轮对话数据分解为带历史偏好的单轮对话,并利用GPT4预定义偏好槽生成多样化数据集。 Result: 实验结果表明,使用IterChat生成的数据格式进行微调或少量提示优于原始多轮对话,标注效率提高了28.4%。 Conclusion: IterChat框架通过分解多轮偏好提取为单轮过程,提高了标注效率和模型性能,解决了获取高质量标注多轮对话数据的挑战。 Abstract: Identifying user preferences in dialogue systems is a pivotal aspect of providing satisfying services. Current research shows that using large language models (LLMs) to fine-tune a task-specific preference extractor yields excellent results in terms of accuracy and generalization. However, the primary challenge stems from the inherent difficulty in obtaining high-quality labeled multi-turn dialogue data. Accurately tracking user preference transitions across turns not only demands intensive domain expertise and contextual consistency maintenance for annotators (termed \textbf{``Annotating Disaster''}) but also complicates model training due to error propagation in sequential dependency learning. Inspired by the observation that multi-turn preference extraction can be decomposed into iterative executions of one-turn extraction processes. We propose a novel dialogue data generation framework named \textbf{IterChat}. First, we construct a new data format that categorizes the dialogue data into attributed historical preferences and one-turn dialogues. This reduces the probability of annotation errors and improves annotation efficiency. Then, to generate a high-quality and diverse dialogue dataset, we adopt GPT4 to pre-define the preference slots in the target preference extractor task and then randomly sample the subset of the slots and their corresponding schema values to create the dialogue datasets. Experimental results indicate that fine-tuning or only few-shot prompting with the new dialogue format yields superior performance compared to the original multi-turn dialogues. Additionally, the new data format improves annotator efficiency with a win rate of 28.4\% higher than the original multi-turn dialogues.

[40] AI-Generated Text is Non-Stationary: Detection via Temporal Tomography

Alva West,Yixuan Weng,Minjun Zhu,Luodan Zhang,Zhen Lin,Guangsheng Bao,Yue Zhang

Main category: cs.CL

TL;DR: The paper introduces Temporal Discrepancy Tomography (TDT), which leverages signal processing techniques to detect AI-generated text by preserving positional information, achieving better performance and robustness against adversarial attacks.

Details Motivation: Current approaches in AI-generated text detection aggregate token-level measurements into scalar scores, discarding positional information about where anomalies occur. This fundamental limitation leads to failure against localized adversarial perturbations. Method: The authors introduce Temporal Discrepancy Tomography (TDT), a novel detection paradigm that treats token-level discrepancies as a time-series signal and applies Continuous Wavelet Transform to generate a two-dimensional time-scale representation. Result: On the RAID benchmark, TDT achieves 0.855 AUROC (7.1% improvement over the best baseline). It demonstrates robust performance on adversarial tasks, with a 14.1% AUROC improvement on HART Level 2 paraphrasing attacks, while maintaining practical efficiency with only 13% computational overhead. Conclusion: The work establishes non-stationarity as a fundamental characteristic of AI-generated text and demonstrates that preserving temporal dynamics is essential for robust detection. Abstract: The field of AI-generated text detection has evolved from supervised classification to zero-shot statistical analysis. However, current approaches share a fundamental limitation: they aggregate token-level measurements into scalar scores, discarding positional information about where anomalies occur. Our empirical analysis reveals that AI-generated text exhibits significant non-stationarity, statistical properties vary by 73.8\% more between text segments compared to human writing. This discovery explains why existing detectors fail against localized adversarial perturbations that exploit this overlooked characteristic. We introduce Temporal Discrepancy Tomography (TDT), a novel detection paradigm that preserves positional information by reformulating detection as a signal processing task. TDT treats token-level discrepancies as a time-series signal and applies Continuous Wavelet Transform to generate a two-dimensional time-scale representation, capturing both the location and linguistic scale of statistical anomalies. On the RAID benchmark, TDT achieves 0.855 AUROC (7.1\% improvement over the best baseline). More importantly, TDT demonstrates robust performance on adversarial tasks, with 14.1\% AUROC improvement on HART Level 2 paraphrasing attacks. Despite its sophisticated analysis, TDT maintains practical efficiency with only 13\% computational overhead. Our work establishes non-stationarity as a fundamental characteristic of AI-generated text and demonstrates that preserving temporal dynamics is essential for robust detection.

[41] A comprehensive taxonomy of hallucinations in Large Language Models

Manuel Cossio

Main category: cs.CL

TL;DR: This report provides a comprehensive analysis of hallucinations in large language models, offering a taxonomy, theoretical framework, and strategies for detection and mitigation.

Details Motivation: Hallucinations in LLMs pose a critical challenge due to their plausible yet factually incorrect outputs, necessitating a deeper understanding and systematic approach to address the issue. Method: A comprehensive taxonomy and theoretical framework were developed, including definitions, classifications, and analyses of causes and mitigation strategies. Result: A formal taxonomy of hallucinations was established, distinguishing between intrinsic/extrinsic and factuality/faithfulness dimensions, and identifying specific manifestations, causes, evaluation metrics, and mitigation strategies. Conclusion: LLM hallucinations are an inherent and inevitable aspect of computable models, necessitating robust detection, mitigation, and human oversight for their responsible use. Abstract: Large language models (LLMs) have revolutionized natural language processing, yet their propensity for hallucination, generating plausible but factually incorrect or fabricated content, remains a critical challenge. This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a formal definition and a theoretical framework that posits its inherent inevitability in computable LLMs, irrespective of architecture or training. It explores core distinctions, differentiating between intrinsic (contradicting input context) and extrinsic (inconsistent with training data or reality), as well as factuality (absolute correctness) and faithfulness (adherence to input). The report then details specific manifestations, including factual errors, contextual and logical inconsistencies, temporal disorientation, ethical violations, and task-specific hallucinations across domains like code generation and multimodal applications. It analyzes the underlying causes, categorizing them into data-related issues, model-related factors, and prompt-related influences. Furthermore, the report examines cognitive and human factors influencing hallucination perception, surveys evaluation benchmarks and metrics for detection, and outlines architectural and systemic mitigation strategies. Finally, it introduces web-based resources for monitoring LLM releases and performance. This report underscores the complex, multifaceted nature of LLM hallucinations and emphasizes that, given their theoretical inevitability, future efforts must focus on robust detection, mitigation, and continuous human oversight for responsible and reliable deployment in critical applications.

[42] HeQ: a Large and Diverse Hebrew Reading Comprehension Benchmark

Amir DN Cohen,Hilla Merhav,Yoav Goldberg,Reut Tsarfaty

Main category: cs.CL

TL;DR: This paper introduces HeQ, a Hebrew Machine Reading Comprehension dataset, addressing challenges in Hebrew NLP by creating new guidelines, a crowdsourcing protocol, and improved evaluation metrics.

Details Motivation: Current Hebrew NLP benchmarks neglect semantic language understanding, prompting the need for a Hebrew MRC dataset. Method: Constructed HeQ dataset with novel guidelines, crowdsourcing protocol, and revised evaluation metrics for Hebrew MRC tasks. Result: HeQ dataset contains 30,147 question-answer pairs; standard metrics like F1 and EM are unsuitable for Hebrew, and morpho-syntactic model performance poorly correlates with MRC performance. Conclusion: Hebrew Machine Reading Comprehension dataset presents challenges for Hebrew NLP, showing the need for improved evaluation metrics and models for Hebrew and other morphologically rich languages. Abstract: Current benchmarks for Hebrew Natural Language Processing (NLP) focus mainly on morpho-syntactic tasks, neglecting the semantic dimension of language understanding. To bridge this gap, we set out to deliver a Hebrew Machine Reading Comprehension (MRC) dataset, where MRC is to be realized as extractive Question Answering. The morphologically rich nature of Hebrew poses a challenge to this endeavor: the indeterminacy and non-transparency of span boundaries in morphologically complex forms lead to annotation inconsistencies, disagreements, and flaws in standard evaluation metrics. To remedy this, we devise a novel set of guidelines, a controlled crowdsourcing protocol, and revised evaluation metrics that are suitable for the morphologically rich nature of the language. Our resulting benchmark, HeQ (Hebrew QA), features 30,147 diverse question-answer pairs derived from both Hebrew Wikipedia articles and Israeli tech news. Our empirical investigation reveals that standard evaluation metrics such as F1 scores and Exact Match (EM) are not appropriate for Hebrew (and other MRLs), and we propose a relevant enhancement. In addition, our experiments show low correlation between models' performance on morpho-syntactic tasks and on MRC, which suggests that models designed for the former might underperform on semantics-heavy tasks. The development and exploration of HeQ illustrate some of the challenges MRLs pose in natural language understanding (NLU), fostering progression towards more and better NLU models for Hebrew and other MRLs.

[43] AGENTICT$^2$S:Robust Text-to-SPARQL via Agentic Collaborative Reasoning over Heterogeneous Knowledge Graphs for the Circular Economy

Yang Zhao,Chengxiao Dai,Wei Zhuo,Tan Chuan Fu,Yue Xiu,Dusit Niyato,Jonathan Z. Low,Eugene Ho Hong Zhuang,Daren Zong Loong Tan

Main category: cs.CL

TL;DR: AgenticT$^2$S是一种模块化框架,通过分解KGQA任务并使用专门代理和验证器进行跨多个知识图谱的查询,显著提升了在低资源领域的问答准确性和效率。

Details Motivation: 现有基于文本到SPARQL的方法依赖大规模领域特定微调或在单一图设置下运行,限制了它们在低资源领域中的泛化能力和处理跨多个图的查询的能力,而循环经济等领域的信息分布于多个独立维护的知识图谱中。 Method: AgenticT$^2$S框架将KGQA分解为由专门代理管理的子任务,包括检索、查询生成和验证,调度器使用弱到强的对齐策略将子目标分配给不同的图,验证器通过符号验证和反事实一致性检查检测无效查询。 Result: 实验表明,AgenticT$^2$S在真实世界的循环经济KGs上比最佳基线方法执行准确率提高了17.3%,三元组级别F$_1$提高了25.4%,平均提示长度减少了46.4%。 Conclusion: AgenticT$^2$S框架通过基于代理的模式感知推理,显著提高了在循环经济领域的KGQA执行准确性和三元组级别的F$_1$值,同时减少了平均提示长度,为可扩展的KGQA提供了支持。 Abstract: Question answering over heterogeneous knowledge graphs (KGQA) involves reasoning across diverse schemas, incomplete alignments, and distributed data sources. Existing text-to-SPARQL approaches rely on large-scale domain-specific fine-tuning or operate within single-graph settings, limiting their generalizability in low-resource domains and their ability to handle queries spanning multiple graphs. These challenges are particularly relevant in domains such as the circular economy, where information about classifications, processes, and emissions is distributed across independently curated knowledge graphs (KGs). We present AgenticT$^2$S, a modular framework that decomposes KGQA into subtasks managed by specialized agents responsible for retrieval, query generation, and verification. A scheduler assigns subgoals to different graphs using weak-to-strong alignment strategies. A two-stage verifier detects structurally invalid and semantically underspecified queries through symbolic validation and counterfactual consistency checks. Experiments on real-world circular economy KGs demonstrate that AgenticT$^2$S improves execution accuracy by 17.3% and triple level F$_1$ by 25.4% over the best baseline, while reducing the average prompt length by 46.4%. These results demonstrate the benefits of agent-based schema-aware reasoning for scalable KGQA and support decision-making in sustainability domains through robust cross-graph reasoning.

[44] MLP Memory: Language Modeling with Retriever-pretrained External Memory

Rubin Wei,Jiaqi Cao,Jiarui Wang,Jushi Kai,Qipeng Guo,Bowen Zhou,Zhouhan Lin

Main category: cs.CL

TL;DR: 本文提出了一种结合Transformer解码器和预训练外部MLP记忆模块的新架构,有效解决大语言模型在知识密集型任务中的幻觉问题,并在多个任务中展现出优越的性能和更快的推理速度。

Details Motivation: 现代仅解码器的大语言模型(LLM)在多个领域表现出色,但生成文本中的幻觉问题阻碍了其在知识密集型任务中的应用。检索增强生成(RAG)虽然提供了解决方案,但其检索器的非参数特性限制了与LLM的深度交互。因此,本文旨在提出一种能够实现深度交互并有效解决幻觉问题的新架构。 Method: 作者使用一个预训练的MLP作为外部记忆模块,模仿检索器在整个预训练数据集上的行为,与传统的非参数检索器不同,该方法可以实现与LLM解码器的深度交互。 Result: 实验表明,该架构在模型规模扩大时表现出更强的幂律缩放能力,在WikiText-103和Web数据集上的提升分别为17.5%和24.1%。该方法在三个幻觉基准测试和九个记忆密集型任务中表现出色,并且推理速度比kNN-LM快80倍,比仅解码器模型快1.3倍。 Conclusion: 本文提出了一种通过预训练的可微外部记忆来解耦LLM解码器记忆的新架构,有效解决了知识密集型任务中的幻觉问题,并展示了其在多个任务中的优越性能和推理速度。 Abstract: While modern decoder-only LLMs achieve superior performance across various domains, hallucinations have risen to be a common problem in their generated text, hindering their application in knowledge-intensive tasks. Retriever-augmented generation (RAG) offers a solution, but the non-parametric nature of the retriever hinders its deep interaction with LLM. In this work, we propose to decouple memorization from the LLM decoder using a pretrained, differentiable external memory. The external memory is an MLP pretrained by imitating the behavior of a retriever on the entire pretraining dataset. Our resulting architecture, which comprises a transformer decoder and an external MLP memory pretrained on language modeling and retriever imitation respectively, demonstrates strong perplexity and performance on downstream tasks. Experiments show our architecture exhibits steeper power-law scaling with model size, achieving 17.5% and 24.1% improvement on WikiText-103 and Web datasets compared to decoder-only models while benefiting from added training without overfitting. We demonstrate superior performance on three hallucination benchmarks and nine memory-intensive tasks. Additionally, our approach delivers $80\times$ speedup over $k$NN-LM (500M tokens) and $1.3\times$ faster inference than decoder-only models. Unlike $k$NN-LM, which impairs reasoning, our MLP memory improves StrategyQA performance. We will open-source our code and models in the future.

[45] Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents

Yuhan Guo,Cong Guo,Aiwen Sun,Hongliang He,Xinyu Yang,Yue Lu,Yingji Zhang,Xuntao Guo,Dong Zhang,Jianzhuang Liu,Jiang Duan,Yijia Xiao,Liangjian Wen,Hai-Ming Xu,Yong Dai

Main category: cs.CL

TL;DR: 本文提出了Web-CogKnowledge框架与Web-CogDataset,开发了Web-CogReasoner代理,并在认知推理任务中取得了显著成果。

Details Motivation: 多模态大规模模型在推动网络代理发展的同时,仍需先获取足够的知识以有效进行认知推理。 Method: 构建了Web-CogKnowledge框架和Web-CogDataset,并基于知识驱动的思维链(CoT)推理框架开发了Web-CogReasoner。 Result: Web-CogReasoner在Web-CogBench评估中表现优异,尤其是在需要结构化知识的未见过任务中展现出显著优势。 Conclusion: Web-CogReasoner通过Web-CogKnowledge框架和Web-CogDataset在处理网络任务上显著优于现有模型,尤其是在需要结构化知识的未见过任务中。 Abstract: Multimodal large-scale models have significantly advanced the development of web agents, enabling perception and interaction with digital environments akin to human cognition. In this paper, we argue that web agents must first acquire sufficient knowledge to effectively engage in cognitive reasoning. Therefore, we decompose a web agent's capabilities into two essential stages: knowledge content learning and cognitive processes. To formalize this, we propose Web-CogKnowledge Framework, categorizing knowledge as Factual, Conceptual, and Procedural. In this framework, knowledge content learning corresponds to the agent's processes of Memorizing and Understanding, which rely on the first two knowledge types, representing the "what" of learning. Conversely, cognitive processes correspond to Exploring, grounded in Procedural knowledge, defining the "how" of reasoning and action. To facilitate knowledge acquisition, we construct the Web-CogDataset, a structured resource curated from 14 real-world websites, designed to systematically instill core knowledge necessary for web agent. This dataset serves as the agent's conceptual grounding-the "nouns" upon which comprehension is built-as well as the basis for learning how to reason and act. Building on this foundation, we operationalize these processes through a novel knowledge-driven Chain-of-Thought (CoT) reasoning framework, developing and training our proposed agent, the Web-CogReasoner. Extensive experimentation reveals its significant superiority over existing models, especially in generalizing to unseen tasks where structured knowledge is decisive. To enable rigorous evaluation, we introduce the Web-CogBench, a comprehensive evaluation suite designed to assess and compare agent performance across the delineated knowledge domains and cognitive capabilities. Our code and data is open sourced at https://github.com/Gnonymous/Web-CogReasoner

[46] Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models

Yijun Feng

Main category: cs.CL

TL;DR: This paper introduces Counterfactual Probing, a method to detect and reduce LLM hallucinations by testing model sensitivity to subtle factual perturbations, achieving strong performance without retraining.

Details Motivation: LLMs often produce hallucinations—fluent but factually incorrect or unsupported outputs. This poses challenges in applications requiring factual accuracy, motivating the need for effective detection and mitigation strategies. Method: Counterfactual Probing dynamically generates counterfactual statements with subtle factual errors and evaluates the model's sensitivity to these perturbations, leveraging the hypothesis that genuine knowledge is robust to such variations while hallucinated content is not. Result: Counterfactual Probing outperformed baseline methods in hallucination detection on datasets like TruthfulQA and reduced hallucination scores by an average of 24.5% through adaptive mitigation strategies. Conclusion: Counterfactual Probing effectively detects and mitigates hallucinations in LLM outputs without requiring model retraining, offering a practical solution for improving the reliability of LLM-generated content. Abstract: Large Language Models have demonstrated remarkable capabilities across diverse tasks, yet they frequently generate hallucinations outputs that are fluent but factually incorrect or unsupported. We propose Counterfactual Probing, a novel approach for detecting and mitigating hallucinations in LLM outputs. Our method dynamically generates counterfactual statements that appear plausible but contain subtle factual errors, then evaluates the model's sensitivity to these perturbations. We hypothesize that genuine knowledge exhibits robustness to counterfactual variations, while hallucinated content shows inconsistent confidence patterns when confronted with plausible alternatives. Our comprehensive evaluation on TruthfulQA, factual statement datasets, and curated hallucination examples demonstrates that counterfactual probing achieves superior detection performance compared to baseline methods, while our adaptive mitigation strategies reduce hallucination scores by an average of 24.5%. The approach requires no model retraining and can be integrated into existing LLM pipelines as a realtime verification mechanism.

[47] Quantum-RAG and PunGPT2: Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language

Jaskaranjeet Singh,Rakesh Thakur

Main category: cs.CL

TL;DR: This paper introduces PunGPT2, Pun-RAG, and Pun-Instruct to advance NLP for low-resource languages like Punjabi, with Quantum-RAG pioneering quantum-aware retrieval methods.

Details Motivation: Low-resource languages are excluded from NLP advancements despite progress in large language models. This work aims to bridge the gap by introducing open-source Punjabi language models and novel retrieval techniques. Method: PunGPT2 was trained on a 35GB Punjabi corpus with a custom tokenizer. Pun-RAG integrates retrieval-augmented generation with a FAISS retriever, while Pun-Instruct uses QLoRA for instruction tuning. Quantum-RAG combines sparse and dense retrieval with quantum-inspired semantic matching. Result: The models significantly outperformed strong multilingual baselines (mBERT, mT5, MuRIL) in perplexity, factuality, and fluency. Quantum-RAG demonstrated improved contextual relevance with minimal memory overhead. Conclusion: PunGPT2, Pun-RAG, and Pun-Instruct provide a scalable and reproducible approach to extend LLM capabilities to underrepresented languages like Punjabi, outperforming multilingual baselines in perplexity, factuality, and fluency. Abstract: Despite the rapid advancement of large language models (LLMs), low-resource languages remain largely excluded from the NLP landscape. We present PunGPT2, the first fully open-source suite of Punjabi large language models, trained from scratch on a 35GB domain-diverse corpus encompassing literature, religious texts, news, and social discourse. Unlike prior multilingual approaches, PunGPT2 captures rich syntactic and morphological features unique to Punjabi through a tokenizer optimised with byte pair encoding and linguistically aligned pretraining objectives. To improve factual grounding and domain recall, we introduce Pun-RAG, a retrieval-augmented generation framework combining PunGPT2 with a dense FAISS retriever over a curated Punjabi knowledge base. We further develop Pun-Instruct, a parameter-efficient, instruction-tuned variant using QLoRA, enabling robust zero-shot and instruction-following performance with significantly reduced compute needs. As a key innovation, we propose Quantum-RAG, a novel hybrid retrieval system that fuses sparse (BM25) and dense methods with quantum-inspired semantic matching. By encoding queries using amplitude-based embeddings and retrieving via quantum kernel similarity, Quantum-RAG achieves improved contextual relevance with minimal memory overhead marking the first practical integration of quantum representations in low-resource language generation. Our models significantly outperform strong multilingual baselines (mBERT, mT5, MuRIL) in perplexity, factuality, and fluency. This work provides a scalable, reproducible blueprint for extending LLM capabilities to underrepresented languages and pioneers quantum-aware retrieval in low-resource NLP

[48] Word Overuse and Alignment in Large Language Models: The Influence of Learning from Human Feedback

Tom S. Juzek,Zina B. Ward

Main category: cs.CL

TL;DR: This paper investigates how Learning from Human Feedback (LHF) influences lexical overuse in Large Language Models (LLMs), showing that certain word preferences may stem from human feedback, highlighting a potential misalignment between LHF workers and LLM users.

Details Motivation: The motivation behind this study is to understand the reasons for LLMs' lexical choices, particularly the overuse of certain terms, and to explore the role of human feedback in this phenomenon. Method: Using Meta's Llama model, the study investigates LHF-induced lexical preferences by presenting a procedure to detect such preferences and experimentally emulating the LHF procedure to demonstrate participants' preference for text variants with certain words. Result: The study experimentally links LHF to lexical overuse, showing that participants systematically prefer text variants that include certain words, thus identifying a potential misalignment in lexical expectations. Conclusion: This paper concludes that lexical overuse in Large Language Models (LLMs) can result from Learning from Human Feedback (LHF), highlighting a potential misalignment and divergence in lexical expectations between LHF workers and LLM users. Abstract: Large Language Models (LLMs) are known to overuse certain terms like "delve" and "intricate." The exact reasons for these lexical choices, however, have been unclear. Using Meta's Llama model, this study investigates the contribution of Learning from Human Feedback (LHF), under which we subsume Reinforcement Learning from Human Feedback and Direct Preference Optimization. We present a straightforward procedure for detecting the lexical preferences of LLMs that are potentially LHF-induced. Next, we more conclusively link LHF to lexical overuse by experimentally emulating the LHF procedure and demonstrating that participants systematically prefer text variants that include certain words. This lexical overuse can be seen as a sort of misalignment, though our study highlights the potential divergence between the lexical expectations of different populations -- namely LHF workers versus LLM users. Our work contributes to the growing body of research on explainable artificial intelligence and emphasizes the importance of both data and procedural transparency in alignment research.

[49] ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks

Philip Schroeder,Ondrej Biza,Thomas Weng,Hongyin Luo,James Glass

Main category: cs.CL

TL;DR: ROVER框架通过递归分解视频轨迹提升视觉-语言模型的推理能力,减少了幻觉现象且时间复杂度低,适用于长期视频推理任务。

Details Motivation: 视觉-语言模型在处理需要对连续摄像头帧序列进行推理的具身环境中仍存在局限,因此提出ROVER以改进其在长期视觉输入流中每个任务尝试时刻的推理能力。 Method: ROVER通过上下文学习实现,利用递归分解将长期视频轨迹划分为较短的子任务,同时通过特定子任务的滑动上下文窗口实现更集中的推理。 Result: ROVER在多个视频推理任务中优于强基线,包括任务进度估计、帧级别的自然语言推理和视频问答任务。 Conclusion: ROVER是一种使模型能够递归分解长期视频轨迹的框架,它在视频推理任务中表现优异,减少了模型在推理过程中的幻觉现象,并且其时间复杂度随着视频长度线性增长,相较于基线有渐进式的改进。 Abstract: Vision-language models (VLMs) have exhibited impressive capabilities across diverse image understanding tasks, but still struggle in settings that require reasoning over extended sequences of camera frames from a video. This limits their utility in embodied settings, which require reasoning over long frame sequences from a continuous stream of visual input at each moment of a task attempt. To address this limitation, we propose ROVER (Reasoning Over VidEo Recursively), a framework that enables the model to recursively decompose long-horizon video trajectories into segments corresponding to shorter subtasks within the trajectory. In doing so, ROVER facilitates more focused and accurate reasoning over temporally localized frame sequences without losing global context. We evaluate ROVER, implemented using an in-context learning approach, on diverse OpenX Embodiment videos and on a new dataset derived from RoboCasa that consists of 543 videos showing both expert and perturbed non-expert trajectories across 27 robotic manipulation tasks. ROVER outperforms strong baselines across three video reasoning tasks: task progress estimation, frame-level natural language reasoning, and video question answering. We observe that, by reducing the number of frames the model reasons over at each timestep, ROVER mitigates hallucinations, especially during unexpected or non-optimal moments of a trajectory. In addition, by enabling the implementation of a subtask-specific sliding context window, ROVER's time complexity scales linearly with video length, an asymptotic improvement over baselines. Demos, code, and data available at: https://rover-vlm.github.io

[50] SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension

Junjie Wu,Jiangnan Li,Yuqing Li,Lemao Liu,Liyan Xu,Jiwei Li,Dit-Yan Yeung,Jie Zhou,Mo Yu

Main category: cs.CL

TL;DR: 本文提出了一种名为SitEmb的新方法,通过在更广泛的上下文中表示短文本块来提高检索性能,并开发了相应的嵌入模型,显著优于现有模型。

Details Motivation: 由于文档中的依赖关系和上下文信息的重要性,传统的将文档分割成小块进行检索增强生成(RAG)的方法存在局限性。现有的嵌入模型难以有效编码具有上下文的信息,导致检索性能受限。 Method: 提出了一种新的训练范式,开发了SitEmb模型,该模型能够更好地编码具有上下文的短文本块信息,并在专门设计的书籍情节检索数据集上进行了评估。 Result: 基于BGE-M3的SitEmb-v1模型仅使用1B参数就显著优于多个具有7-8B参数的现有最先进模型;8B的SitEmb-v1.5模型性能进一步提升了10%以上,并在多种语言和下游应用中表现出色。 Conclusion: SitEmb方法有效解决了长文档中上下文信息难以准确编码的问题,显著提高了检索性能和下游任务的表现。 Abstract: Retrieval-augmented generation (RAG) over long documents typically involves splitting the text into smaller chunks, which serve as the basic units for retrieval. However, due to dependencies across the original document, contextual information is often essential for accurately interpreting each chunk. To address this, prior work has explored encoding longer context windows to produce embeddings for longer chunks. Despite these efforts, gains in retrieval and downstream tasks remain limited. This is because (1) longer chunks strain the capacity of embedding models due to the increased amount of information they must encode, and (2) many real-world applications still require returning localized evidence due to constraints on model or human bandwidth. We propose an alternative approach to this challenge by representing short chunks in a way that is conditioned on a broader context window to enhance retrieval performance -- i.e., situating a chunk's meaning within its context. We further show that existing embedding models are not well-equipped to encode such situated context effectively, and thus introduce a new training paradigm and develop the situated embedding models (SitEmb). To evaluate our method, we curate a book-plot retrieval dataset specifically designed to assess situated retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3 substantially outperforms state-of-the-art embedding models, including several with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model further improves performance by over 10% and shows strong results across different languages and several downstream applications.

[51] TIBSTC-CoT: A Multi-Domain Instruction Dataset for Chain-of-Thought Reasoning in Language Models

Fan Gao,Cheng Huang,Nyima Tashi,Yutong Liu,Xiangxiang Wang,Thupten Tsering,Ban Ma-bao,Renzeg Duojie,Gadeng Luosang,Rinchen Dongrub,Dorje Tashi,Xiao Feng,Hao Wang,Yongbin Yu

Main category: cs.CL

TL;DR: 本文通过构建大规模藏语数据集TIBSTC-CoT和开发相应的语言模型Sunshine-thinking,有效提升了藏语在人工智能领域的处理能力。

Details Motivation: 为了解决藏语这一低资源语言中严重的数据匮乏问题,推动藏语的语言理解和生成能力的发展。 Method: 通过大规模、多领域的藏语数据集TIBSTC-CoT的构建,以及基于该数据集开发的Sunshine-thinking LLM家族,这些以思维链能力为核心的藏语中心模型。 Result: 开发了TIBSTC-CoT数据集和Sunshine-thinking LLM家族,Sunshine-thinking在训练后表现出与最先进的多语言LLM相当的推理和生成性能。 Conclusion: 这项工作通过创建资源和模型创新,标志着向包容性人工智能迈出的重要一步,使得藏语语言处理的质量得以提高。 Abstract: To address the severe data scarcity in Tibetan, a low-resource language spoken by over six million people, we introduce TIBSTC-CoT, the large-scale, multi-domain Tibetan dataset automatically constructed via chain-of-thought prompting with large language models (LLMs). TIBSTC-CoT establishes a scalable and reproducible framework for dataset creation in low-resource settings, covering diverse domains and reasoning patterns essential for language understanding and generation. Building on this dataset, we develop the Sunshine-thinking LLM family, a series of Tibetan-centric LLMs equipped with chain-of-thought capabilities. Trained entirely on TIBSTC-CoT, Sunshine-thinking has demonstrated strong reasoning and generation performance, comparable to state-of-the-art (SOTA) multilingual LLMs. Our work marks a significant step toward inclusive AI by enabling high-quality Tibetan language processing through both resource creation and model innovation. All data are available: https://github.com/Vicentvankor/sun-shine.

[52] Contextually Aware E-Commerce Product Question Answering using RAG

Praveen Tangarajan,Anand A. Rajasekar,Manish Rathi,Vinay Rao Dandin,Ozan Ersoy

Main category: cs.CL

TL;DR: This paper proposes a contextual RAG-based framework for e-commerce PQA that enhances user experience by delivering personalized, accurate answers and identifying catalog content gaps.

Details Motivation: E-commerce product pages often overwhelm users with excessive and diverse information, leading to cognitive overload, while existing PQA systems struggle to effectively utilize available data. Method: The study introduces a scalable, end-to-end Retrieval Augmented Generation (RAG) framework for Product Question Answering (PQA) that integrates user context, conversational history, and product attributes. Result: The system delivers personalized answers, manages various query types across heterogeneous sources, identifies content gaps, and introduces broadly applicable metrics for RAG evaluation. Conclusion: The proposed RAG-based framework effectively handles diverse queries in e-commerce by integrating user context and product information, offering personalized responses and aiding content enhancement. Abstract: E-commerce product pages contain a mix of structured specifications, unstructured reviews, and contextual elements like personalized offers or regional variants. Although informative, this volume can lead to cognitive overload, making it difficult for users to quickly and accurately find the information they need. Existing Product Question Answering (PQA) systems often fail to utilize rich user context and diverse product information effectively. We propose a scalable, end-to-end framework for e-commerce PQA using Retrieval Augmented Generation (RAG) that deeply integrates contextual understanding. Our system leverages conversational history, user profiles, and product attributes to deliver relevant and personalized answers. It adeptly handles objective, subjective, and multi-intent queries across heterogeneous sources, while also identifying information gaps in the catalog to support ongoing content improvement. We also introduce novel metrics to measure the framework's performance which are broadly applicable for RAG system evaluations.

[53] Prompting Large Language Models to Detect Dementia Family Caregivers

Md Badsha Biswas,Özlem Uzuner

Main category: cs.CL

TL;DR: This paper presents a highly effective system using fine-tuned language models to detect tweets from caregivers of dementia patients, achieving a macro F1-score of 0.95.

Details Motivation: The motivation for this research is to identify tweets from caregivers of dementia patients, enabling the development of internet-based support interventions. Method: The paper uses large language models (LLMs) with various prompting methods to address a binary classification problem aimed at detecting tweets from individuals who have a family member with dementia. Result: The system achieved a macro F1-score of 0.95 on both the validation and test sets, indicating high accuracy in classifying tweets related to dementia caregivers. Conclusion: The study concludes that using a fine-tuned model with a zero-shot prompt is highly effective for identifying tweets from caregivers of dementia patients. Abstract: Social media, such as Twitter, provides opportunities for caregivers of dementia patients to share their experiences and seek support for a variety of reasons. Availability of this information online also paves the way for the development of internet-based interventions in their support. However, for this purpose, tweets written by caregivers of dementia patients must first be identified. This paper demonstrates our system for the SMM4H 2025 shared task 3, which focuses on detecting tweets posted by individuals who have a family member with dementia. The task is outlined as a binary classification problem, differentiating between tweets that mention dementia in the context of a family member and those that do not. Our solution to this problem explores large language models (LLMs) with various prompting methods. Our results show that a simple zero-shot prompt on a fine-tuned model yielded the best results. Our final system achieved a macro F1-score of 0.95 on the validation set and the test set. Our full code is available on GitHub.

[54] SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents

Changhao Jiang,Jiajun Sun,Yifei Cao,Jiabao Zhuang,Hui Li,Xiaoran Fan,Ming Zhang,Junjie Ye,Shihan Dou,Zhiheng Xi,Jingqi Tong,Yilong Wu,Baoyu Fan,Zhen Wang,Tao Liang,Zhihui Fei,Mingyang Wan,Guojun Ma,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang

Main category: cs.CL

TL;DR: 本研究提出了语音角色扮演代理的研究方向,构建了相关数据集和评估基准,为未来研究提供了基础。

Details Motivation: 现有的角色扮演代理研究主要关注文本模态,忽略了在真实交互场景中至关重要的语音维度,并且缺乏对语音角色扮演代理的系统评估。 Method: 构建了一个大规模、高质量的数据集 SpeechRole-Data 和一个多维度评估基准 SpeechRole-Eval。 Result: 实验结果揭示了级联式和端到端语音角色扮演代理在保持语音风格一致性和角色连贯性方面的优势和挑战。 Conclusion: 通过提供大规模高质量的数据集和多维度评估基准,为语音驱动的多模态角色扮演研究奠定了坚实基础,并促进了该领域的发展。 Abstract: Recently, role-playing agents have emerged as a promising paradigm for achieving personalized interaction and emotional resonance. Existing research primarily focuses on the textual modality, neglecting the critical dimension of speech in realistic interactive scenarios. In particular, there is a lack of systematic evaluation for Speech Role-Playing Agents (SRPAs). To address this gap, we construct SpeechRole-Data, a large-scale, high-quality dataset that comprises 98 diverse roles and 112k speech-based single-turn and multi-turn conversations. Each role demonstrates distinct vocal characteristics, including timbre and prosody, thereby enabling more sophisticated speech role-playing. Furthermore, we propose SpeechRole-Eval, a multidimensional evaluation benchmark that systematically assesses SRPAs performance in key aspects such as fundamental interaction ability, speech expressiveness, and role-playing fidelity. Experimental results reveal the advantages and challenges of both cascaded and end-to-end speech role-playing agents in maintaining vocal style consistency and role coherence. We release all data, code, and baseline models to provide a solid foundation for speech-driven multimodal role-playing research and to foster further developments in this field.

[55] SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models

Wanqi Yang,Yanda Li,Yunchao Wei,Meng Fang,Ling Chen

Main category: cs.CL

TL;DR: SpeechR评估了大型音频语言模型的语音推理能力,揭示了当前模型在推理方面存在不足。

Details Motivation: 现有评估主要集中在表面感知,缺乏对语音场景中模型推理能力的充分检验。 Method: 通过三个关键维度评估模型:事实检索、程序推理和规范判断,并采用三种不同的评估格式。 Result: 对11种最先进LALMs的评估显示,高转录准确率并不意味着强大的推理能力。 Conclusion: SpeechR提供了评估大型音频语言模型在语音推理方面表现的统一基准,揭示了转录准确性与推理能力之间的差距。 Abstract: Large audio-language models (LALMs) have achieved near-human performance in sentence-level transcription and emotion recognition. However, existing evaluations focus mainly on surface-level perception, leaving the capacity of models for contextual and inference-driven reasoning in speech-based scenarios insufficiently examined. To address this gap, we introduce SpeechR, a unified benchmark for evaluating reasoning over speech in large audio-language models. SpeechR evaluates models along three key dimensions: factual retrieval, procedural inference, and normative judgment. It includes three distinct evaluation formats. The multiple-choice version measures answer selection accuracy. The generative version assesses the coherence and logical consistency of reasoning chains. The acoustic-feature version investigates whether variations in stress and emotion affect reasoning performance. Evaluations on eleven state-of-the-art LALMs reveal that high transcription accuracy does not translate into strong reasoning capabilities. SpeechR establishes a structured benchmark for evaluating reasoning in spoken language, enabling more targeted analysis of model capabilities across diverse dialogue-based tasks.

[56] Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time

Huihan Li,You Chen,Siyuan Wang,Yixin He,Ninareh Mehrabi,Rahul Gupta,Xiang Ren

Main category: cs.CL

TL;DR: STIM是一个新的分析框架,用于识别大语言模型推理过程中各个词汇的来源,以检测是否存在过度依赖记忆的问题。

Details Motivation: 大语言模型在推理基准测试中表现良好,但在输入略有变化时经常失败,这引发了对其成功是否依赖记忆的担忧。尤其是在链式推理中,虚假的记忆模式可能导致中间错误,进而导致最终答案错误。 Method: 引入了STIM框架,该框架根据预训练语料库中的统计共现情况,将推理链中的每个词汇归因于多种记忆来源之一:局部、中程或长程。 Result: 通过跨任务和分布设置的令牌级分析发现,模型在复杂或长尾案例中更依赖记忆,局部记忆通常是错误的主要原因,导致高达67%的错误令牌。此外,STIM的记忆分数可以有效预测错误推理步骤中的错误令牌。 Conclusion: STIM提供了一个强大的工具来诊断和改进模型推理,并可以推广到其他结构化的逐步生成任务。 Abstract: Large Language Models (LLMs) perform well on reasoning benchmarks but often fail when inputs alter slightly, raising concerns about the extent to which their success relies on memorization. This issue is especially acute in Chain-of-Thought (CoT) reasoning, where spurious memorized patterns can trigger intermediate errors that cascade into incorrect final answers. We introduce STIM, a novel framework for Source-aware Token-level Identification of Memorization, which attributes each token in a reasoning chain to one of multiple memorization sources - local, mid-range, or long-range - based on their statistical co-occurrence with the token in the pretraining corpus. Our token-level analysis across tasks and distributional settings reveals that models rely more on memorization in complex or long-tail cases, and that local memorization is often the dominant driver of errors, leading to up to 67% of wrong tokens. We also show that memorization scores from STIM can be effective in predicting the wrong tokens in the wrong reasoning step. STIM offers a powerful tool for diagnosing and improving model reasoning and can generalize to other structured step-wise generation tasks.

[57] Marco-Voice Technical Report

Fengping Tian,Chenyang Lyu,Xuanfan Ni,Haoqin Sun,Qingjuan Li,Zhiqiang Qian,Haijun Li,Longyue Wang,Zhao Xu,Weihua Luo,Kaifu Zhang

Main category: cs.CL

TL;DR: This paper proposes Marco-Voice, a unified system for voice cloning and emotion-controlled speech synthesis, achieving significant improvements in expressiveness and controllability.

Details Motivation: The motivation of this work is to overcome longstanding challenges in generating highly expressive, controllable, and natural speech while preserving speaker identity across different linguistic and emotional contexts. Method: The paper introduces an effective speaker-emotion disentanglement mechanism using in-batch contrastive learning and a rotational emotional embedding integration method for smooth emotion control. Additionally, the authors constructed a high-quality emotional speech dataset called CSEMOTIONS for training and evaluation. Result: Extensive experiments showed that the proposed system, Marco-Voice, achieves significant improvements in both objective and subjective metrics, demonstrating its effectiveness in voice cloning and emotion-controlled speech synthesis. Conclusion: Marco-Voice, the proposed system in this paper, represents a significant advancement in expressive neural speech synthesis by achieving competitive performance in emotional richness and speech clarity. Abstract: This paper presents a multifunctional speech synthesis system that integrates voice cloning and emotion control speech synthesis within a unified framework. The goal of this work is to address longstanding challenges in achieving highly expressive, controllable, and natural speech generation that faithfully preserves speaker identity across diverse linguistic and emotional contexts. Our approach introduces an effective speaker-emotion disentanglement mechanism with in-batch contrastive learning, enabling independent manipulation of speaker identity and eemotional style, as well as rotational emotional embedding integration method for smooth emotion control. To support comprehensive training and evaluation, we construct CSEMOTIONS, a high-quality emotional speech dataset containing 10 hours of Mandarin speech from six professional speakers across seven emotional categories. Extensive experiments demonstrate that our system, Marco-Voice, achieves substantial improvements in both objective and subjective metrics. Comprehensive evaluations and analysis were conducted, results show that MarcoVoice delivers competitive performance in terms of speech clarity and emotional richness, representing a substantial advance in the field of expressive neural speech synthesis.

[58] Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models

Soyeon Kim,Jindong Wang,Xing Xie,Steven Euijong Whang

Main category: cs.CL

TL;DR: TDBench 是一个新的基准,利用时间数据库技术构建时间敏感型问答对,以更可靠地评估大型语言模型处理随时间变化的事实性知识的能力。

Details Motivation: 事实随着时间的推移而演变,因此大型语言模型需要准确可靠地处理时间敏感的事实性知识。现有的基准测试往往依赖于手动策划或少量预定义模板,这限制了可扩展性和全面性。 Method: 利用时间数据库和数据库技术,如时间SQL和函数依赖,构建了一个新的基准TDBench,并引入了时间准确性的细粒度评估指标。 Result: 实验结果显示,TDBench 能够实现可扩展且全面的时间敏感型问答评估,同时减少对人力的依赖,并通过基于维基百科/维基数据的方法补充了现有方法。 Conclusion: TDBench 提供了一种自动化的方法来生成时间敏感型问答对,从而有效地评估大型语言模型在处理随时间变化的事实性知识方面的能力。 Abstract: Facts evolve over time, making it essential for Large Language Models (LLMs) to handle time-sensitive factual knowledge accurately and reliably. While factual Time-Sensitive Question-Answering (TSQA) tasks have been widely studied, existing benchmarks often rely on manual curation or a small, fixed set of predefined templates, which restricts scalable and comprehensive TSQA evaluation. To address these challenges, we propose TDBench, a new benchmark that systematically constructs TSQA pairs by harnessing temporal databases and database techniques such as temporal SQL and functional dependencies. We also introduce a fine-grained evaluation metric called time accuracy, which assesses the validity of time references in model explanations alongside traditional answer accuracy to enable a more reliable TSQA evaluation. Extensive experiments on contemporary LLMs show how \ours{} enables scalable and comprehensive TSQA evaluation while reducing the reliance on human labor, complementing existing Wikipedia/Wikidata-based TSQA evaluation approaches by enabling LLM evaluation on application-specific data and seamless multi-hop question generation. Code and data are publicly available at: https://github.com/ssoy0701/tdbench.git.

[59] ProCut: LLM Prompt Compression via Attribution Estimation

Zhentao Xu,Fengyi Li,Albert Chen,Xiaofeng Wang

Main category: cs.CL

TL;DR: ProCut是一种高效的提示压缩框架,可显著减少提示大小并提升任务性能。

Details Motivation: 大型提示模板难以维护,并导致推理延迟和高服务成本。 Method: ProCut通过归因分析将提示模板分割为有意义的单元,量化其对任务性能的影响,并修剪低效组件。 Result: 实验显示ProCut减少了78%的生产令牌数量,并且任务性能保持或提高了62%。 Conclusion: ProCut可以有效压缩大型工业LLM系统中的提示模板,显著减少提示大小,同时保持或提高任务性能。 Abstract: In large-scale industrial LLM systems, prompt templates often expand to thousands of tokens as teams iteratively incorporate sections such as task instructions, few-shot examples, and heuristic rules to enhance robustness and coverage. This expansion leads to bloated prompts that are difficult to maintain and incur significant inference latency and serving costs. To address this, we introduce Prompt Compression via Attribution Estimation (ProCut), a flexible, LLM-agnostic, training-free framework that compresses prompts through attribution analysis. ProCut segments prompt templates into semantically meaningful units, quantifies their impact on task performance, and prunes low-utility components. Through extensive experiments on five public benchmark datasets and real-world industrial prompts, we show that ProCut achieves substantial prompt size reductions (78% fewer tokens in production) while maintaining or even slightly improving task performance (up to 62% better than alternative methods). We further introduce an LLM-driven attribution estimator that reduces compression latency by over 50%, and demonstrate that ProCut integrates seamlessly with existing prompt-optimization frameworks to produce concise, high-performing prompts.

[60] The SMeL Test: A simple benchmark for media literacy in language models

Gustaf Ahdritz,Anat Kleiman

Main category: cs.CL

TL;DR: The paper introduces the SMeL Test to evaluate how well language models filter unreliable online content, revealing that even advanced models frequently hallucinate and larger models don't always perform better.

Details Motivation: The motivation stems from the prevalence of misleading content online and the unclear extent to which language models can apply human-like heuristics to discern reliable information. Method: The authors introduced the Synthetic Media Literacy Test (SMeL Test), a benchmark designed to evaluate language models' ability to filter out unreliable information. They tested various instruction-tuned LLMs, including reasoning models. Result: No model consistently trusted more reliable sources, reasoning models scored higher but still hallucinated up to 70% of the time, and larger models did not necessarily perform better. Conclusion: The paper concludes that larger and more capable language models do not necessarily outperform smaller models in filtering untrustworthy information, and even advanced models hallucinate frequently. Abstract: The internet is rife with unattributed, deliberately misleading, or otherwise untrustworthy content. Though large language models (LLMs) are often tasked with autonomous web browsing, the extent to which they have learned the simple heuristics human researchers use to navigate this noisy environment is not currently known. In this paper, we introduce the Synthetic Media Literacy Test (SMeL Test), a minimal benchmark that tests the ability of language models to actively filter out untrustworthy information in context. We benchmark a variety of commonly used instruction-tuned LLMs, including reasoning models, and find that no model consistently trusts more reliable sources; while reasoning in particular is associated with higher scores, even the best API model we test hallucinates up to 70% of the time. Remarkably, larger and more capable models do not necessarily outperform their smaller counterparts. We hope our work sheds more light on this important form of hallucination and guides the development of new methods to combat it.

[61] When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models

Jin Li,Keyu Wang,Shu Yang,Zhuoran Zhang,Di Wang

Main category: cs.CL

TL;DR: The paper explores the internal mechanisms behind sycophantic behavior in Large Language Models, identifying a two-stage emergence of sycophancy and examining the impact of user opinions and grammatical perspective on this behavior

Details Motivation: The internal mechanisms that enable sycophantic behavior in Large Language Models remain poorly understood Method: logit-lens analysis and causal activation patching Result: finding that first-person prompts consistently induce higher sycophancy rates than third-person framings Conclusion: sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers Abstract: Large Language Models (LLMs) often exhibit sycophantic behavior, agreeing with user-stated opinions even when those contradict factual knowledge. While prior work has documented this tendency, the internal mechanisms that enable such behavior remain poorly understood. In this paper, we provide a mechanistic account of how sycophancy arises within LLMs. We first systematically study how user opinions induce sycophancy across different model families. We find that simple opinion statements reliably induce sycophancy, whereas user expertise framing has a negligible impact. Through logit-lens analysis and causal activation patching, we identify a two-stage emergence of sycophancy: (1) a late-layer output preference shift and (2) deeper representational divergence. We also verify that user authority fails to influence behavior because models do not encode it internally. In addition, we examine how grammatical perspective affects sycophantic behavior, finding that first-person prompts (``I believe...'') consistently induce higher sycophancy rates than third-person framings (``They believe...'') by creating stronger representational perturbations in deeper layers. These findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers, with implications for alignment and truthful AI systems.

[62] "Harmless to You, Hurtful to Me!": Investigating the Detection of Toxic Languages Grounded in the Perspective of Youth

Yaqiong Li,Peng Zhang,Lin Wang,Hansu Gu,Siyuan Qiao,Ning Gu,Tun Lu

Main category: cs.CL

TL;DR: 研究探讨了青少年在社交媒体中对毒性语言的独特认知,构建了首个中文相关数据集,并发现加入元信息可提高毒性检测准确性。

Details Motivation: 青少年对社交媒体中毒性内容的认知与成年人不同,而现有研究忽略了对青少年独特毒性语言的调查。 Method: 选取中国青少年作为研究对象,构建了首个中文“青少年毒性”数据集,并进行了广泛的分析。 Result: 青少年对毒性的认知与多种上下文因素有关,如话语来源和文本相关特征。 Conclusion: 将元信息(如来源和文本相关特征)纳入当前的毒性检测方法中可以显著提高检测准确性,并提出了以青少年为中心的毒性检测未来研究方向。 Abstract: Risk perception is subjective, and youth's understanding of toxic content differs from that of adults. Although previous research has conducted extensive studies on toxicity detection in social media, the investigation of youth's unique toxicity, i.e., languages perceived as nontoxic by adults but toxic as youth, is ignored. To address this gap, we aim to explore: 1) What are the features of ``youth-toxicity'' languages in social media (RQ1); 2) Can existing toxicity detection techniques accurately detect these languages (RQ2). For these questions, we took Chinese youth as the research target, constructed the first Chinese ``youth-toxicity'' dataset, and then conducted extensive analysis. Our results suggest that youth's perception of these is associated with several contextual factors, like the source of an utterance and text-related features. Incorporating these meta information into current toxicity detection methods significantly improves accuracy overall. Finally, we propose several insights into future research on youth-centered toxicity detection.

[63] Learning Dynamics of Meta-Learning in Small Model Pretraining

David Demitri Africa,Yuval Weiss,Paula Buttery,Richard Diehl Martinez

Main category: cs.CL

TL;DR: This paper introduces a meta-learning-enhanced training method for small language models that improves efficiency, performance, and interpretability by revealing a two-stage training dynamic.

Details Motivation: The motivation stems from the high cost and complexity of training large language models. The study aims to explore whether meta-learning can enhance the training of smaller language models, making them both more efficient and interpretable. Method: The research integrates first-order MAML (Model-Agnostic Meta-Learning) with subset-masked language model pretraining to train four LLaMA-style decoder-only models of varying sizes. The models are evaluated across multiple settings and real-world applications, particularly focusing on a multilingual Universal NER task. Result: Compared to vanilla training, the proposed method (i) achieves the same loss up to 1.6x faster, (ii) improves F1 scores on multilingual Universal NER under equal computational resources, and (iii) reveals interpretable training dynamics involving a two-stage process of diversification and compression in network representations. Conclusion: The study concludes that integrating first-order MAML with subset-masked LM pretraining improves both the efficiency and interpretability of training small language models, with observable two-stage training dynamics and enhanced performance on NLP tasks. Abstract: Large language models are powerful but costly. We ask whether meta-learning can make the pretraining of small language models not only better but also more interpretable. We integrate first-order MAML with subset-masked LM pretraining, producing four LLama-style decoder-only models (11M-570M params), and evaluate it on a fundamental NLP task with many settings and real-world applications. Compared with vanilla training, our model (i) reaches the same loss up to 1.6x sooner, (ii) improves F1 on multilingual Universal NER under equal compute, and (iii) makes the training dynamics easy to read: first the network's representations fan out ("diversify") and later they collapse into a smaller, shared subspace ("compress"). This two-stage shift shows up as a rise-and-fall in both effective-rank curves and attention-head entropy. The same curves pinpoint which layers specialise earliest and which later reconverge, giving a compact, interpretable signature of meta-adaptation. Code, checkpoints and WandB logs are released.

[64] Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Yuxuan Song,Zheng Zhang,Cheng Luo,Pengyang Gao,Fan Xia,Hao Luo,Zheng Li,Yuehang Yang,Hongli Yu,Xingwei Qu,Yuwei Fu,Jing Su,Ge Zhang,Wenhao Huang,Mingxuan Wang,Lin Yan,Xiaoying Jia,Jingjing Liu,Wei-Ying Ma,Ya-Qin Zhang,Yonghui Wu,Hao Zhou

Main category: cs.CL

TL;DR: Seed Diffusion Preview is a fast large-scale language model based on discrete-state diffusion that offers competitive performance and sets a new speed-quality standard for code models.

Details Motivation: The motivation is to mitigate the inherent latency of token-by-token decoding in large-scale language models. Method: Seed Diffusion Preview uses discrete-state diffusion with non-sequential, parallel generation for faster inference speed. Result: Seed Diffusion Preview achieves an inference speed of 2,146 token/s over H20 GPUs while maintaining competitive performance across standard code evaluation benchmarks. Conclusion: Seed Diffusion Preview establishes a new state of the art on the speed-quality Pareto frontier for code models. Abstract: We present Seed Diffusion Preview, a large-scale language model based on discrete-state diffusion, offering remarkably fast inference speed. Thanks to non-sequential, parallel generation, discrete diffusion models provide a notable speedup to mitigate the inherent latency of token-by-token decoding, as demonstrated recently (e.g., Mercury Coder, Gemini Diffusion). Seed Diffusion Preview achieves an inference speed of 2,146 token/s over H20 GPUs while maintaining competitive performance across a sweep of standard code evaluation benchmarks, significantly faster than contemporary Mercury and Gemini Diffusion, establishing new state of the art on the speed-quality Pareto frontier for code models.

[65] Proof2Hybrid: Automatic Mathematical Benchmark Synthesis for Proof-Centric Problems

Yebo Peng,Zixiang Liu,Yaoming Li,Zhizhuo Yang,Xinye Xu,Bowen Ye,Weijun Yuan,Zihan Wang,Tong Yang

Main category: cs.CL

TL;DR: The paper introduces Proof2Hybrid, a framework for automatically generating proof-centric benchmarks, and presents AlgGeoTest, which reveals significant shortcomings in LLMs' understanding of algebraic geometry.

Details Motivation: Existing benchmarks for evaluating the mathematical capability of Large Language Models (LLMs) are inadequate, particularly for proof-centric problems, due to the manual and costly nature of their creation. Method: The authors propose Proof2Hybrid, an automated framework that converts mathematical proofs into hybrid-formatted questions called ``$m$-out-of-$n$ multiple judge questions'' and introduce AlgGeoTest, a benchmark for algebraic geometry. Result: The proposed framework enables robust, automatic evaluation of LLMs while being resilient to guessing and superficial pattern matching, as demonstrated by the benchmark AlgGeoTest, which comprises 456 challenging items. Conclusion: The paper concludes that their framework and benchmark provide a new direction for in-depth research into the mathematical intelligence of AI systems, revealing profound deficits in state-of-the-art LLMs' understanding of algebraic geometry. Abstract: Evaluating the mathematical capability of Large Language Models (LLMs) is a critical yet challenging frontier. Existing benchmarks fall short, particularly for proof-centric problems, as manual creation is unscalable and costly, leaving the true mathematical abilities of LLMs largely unassessed. To overcome these barriers, we propose Proof2Hybrid, the first fully automated framework that synthesizes high-quality, proof-centric benchmarks from natural language mathematical corpora. The key novelty of our solution is Proof2X, a roadmap of converting mathematical proofs into various kinds of questions that are easy to verify. Instructed by this roadmap, we propose a new type of hybrid-formatted questions, named ``$m$-out-of-$n$ multiple judge questions'', specifically designed to enable robust, automatic evaluation while being resilient to guessing and superficial pattern matching inherent in traditional formats. As a demonstration of our framework, we introduce AlgGeoTest, a benchmark for algebraic geometry--a frontier domain of modern mathematics--comprising 456 challenging items. Our extensive evaluations on state-of-the-art LLMs using AlgGeoTest reveal profound deficits in their comprehension of algebraic geometry, providing a more precise measure of their true mathematical capabilities. Our framework and benchmark pave the way for a new wave of in-depth research into the mathematical intelligence of AI systems.

[66] Isolating Culture Neurons in Multilingual Large Language Models

Danial Namazifard,Lukas Galke

Main category: cs.CL

TL;DR: 该研究通过引入MUREL数据集和扩展方法,首次揭示了多语言大模型中文化信息的编码机制,并表明文化神经元可以被独立调节。

Details Motivation: 语言和文化密切相关,但目前尚不清楚多语言大模型如何编码文化信息。 Method: 扩展了已有的语言特定神经元识别方法,用于定位和分离文化特定神经元,并使用MUREL数据集进行实验。 Result: 实验表明,不同文化在模型中由不同的神经元群体编码,主要集中在较高层次,且这些文化神经元可以独立于语言或其他文化特定神经元进行调节。 Conclusion: 研究发现多语言大模型中的文化知识可以被选择性地分离和编辑,这有助于促进公平性、包容性和对齐性。 Abstract: Language and culture are deeply intertwined, yet it is so far unclear how and where multilingual large language models encode culture. Here, we extend upon an established methodology for identifying language-specific neurons and extend it to localize and isolate culture-specific neurons, carefully disentangling their overlap and interaction with language-specific neurons. To facilitate our experiments, we introduce MUREL, a curated dataset of 85.2 million tokens spanning six different cultures. Our localization and intervention experiments show that LLMs encode different cultures in distinct neuron populations, predominantly in upper layers, and that these culture neurons can be modulated independently from language-specific neurons or those specific to other cultures. These findings suggest that cultural knowledge and propensities in multilingual language models can be selectively isolated and edited - promoting fairness, inclusivity, and alignment. Code and data is available at https://github.com/namazifard/Culture_Neurons .

[67] Interference Matrix: Quantifying Cross-Lingual Interference in Transformer Encoders

Belen Alastruey,João Maria Janeiro,Alexandre Allauzen,Maha Elbayad,Loïc Barrault,Marta R. Costa-jussà

Main category: cs.CL

TL;DR: 本文研究了83种语言中仅编码器的Transformer模型中的语言干扰,发现语言干扰具有不对称性,并且其模式与传统的语言特征无关,但与脚本更相关。

Details Motivation: 语言干扰可能影响多语言模型的表现,了解语言干扰的性质有助于设计更优的多语言模型。 Method: 通过训练和评估所有可能的语言对的小型BERT模型,构建了语言干扰矩阵,并进行了大规模的量化分析。 Result: 发现语言干扰具有不对称性,干扰模式与传统语言特征和嵌入相似度无关,而与脚本更相关。 Conclusion: 语言干扰模式与脚本相关,而非传统的语言特征。干扰矩阵能有效预测下游任务表现,有助于优化多语言模型的设计。 Abstract: In this paper, we present a comprehensive study of language interference in encoder-only Transformer models across 83 languages. We construct an interference matrix by training and evaluating small BERT-like models on all possible language pairs, providing a large-scale quantification of cross-lingual interference. Our analysis reveals that interference between languages is asymmetrical and that its patterns do not align with traditional linguistic characteristics, such as language family, nor with proxies like embedding similarity, but instead better relate to script. Finally, we demonstrate that the interference matrix effectively predicts performance on downstream tasks, serving as a tool to better design multilingual models to obtain optimal performance.

[68] Decomposing the Entropy-Performance Exchange: The Missing Keys to Unlocking Effective Reinforcement Learning

Jia Deng,Jie Chen,Zhipeng Chen,Wayne Xin Zhao,Ji-Rong Wen

Main category: cs.CL

TL;DR: This paper analyzes how entropy and performance interact during RLVR training of LLMs, identifying key patterns across different training stages and proposing reward adjustment strategies that enhance learning efficiency by focusing on high-potential tokens.

Details Motivation: The motivation of the paper is to better understand the entropy-performance trade-off in reinforcement learning with verifiable rewards (RLVR), particularly how this balance affects the reasoning capabilities of large language models (LLMs) during training. Method: The researchers conducted a systematic empirical analysis of the entropy-performance exchange mechanism in RLVR across stage-level, instance-level, and token-level granularities. They divided the training process into two stages—rising and plateau—and analyzed how entropy and performance interact at each level of granularity. Result: The analysis revealed that entropy reduction in negative samples during the rising stage promotes performance gains. In the plateau stage, high-entropy tokens in low-perplexity samples and at the end of sequences are most influential for learning efficiency. Based on these findings, the authors proposed reward adjustment methods that improve upon baseline approaches. Conclusion: The study concludes that managing the entropy-performance trade-off in RLVR can significantly enhance learning efficiency in LLMs, particularly by focusing on tokens with high learning potential using perplexity and positional information. Abstract: Recently, reinforcement learning with verifiable rewards (RLVR) has been widely used for enhancing the reasoning abilities of large language models (LLMs). A core challenge in RLVR involves managing the exchange between entropy and performance of policies. Despite the importance of this exchange, a fine-grained understanding of when and how this exchange operates most effectively remains limited. To bridge this gap, we conduct a systematic empirical analysis of the entropy-performance exchange mechanism of RLVR across different levels of granularity. Specifically, we first divide the training process into two distinct stages based on entropy dynamics, i.e., rising stage and plateau stage, and then systematically investigate how this mechanism varies across stage-level, instance-level, and token-level granularitiess. Our analysis reveals that, in the rising stage, entropy reduction in negative samples facilitates the learning of effective reasoning patterns, which in turn drives rapid performance gains. Moreover, in the plateau stage, learning efficiency strongly correlates with high-entropy tokens present in low-perplexity samples and those located at the end of sequences. Motivated by these findings, we propose two methods that dynamically adjust the reward signal using perplexity and positional information to focus RL updates on tokens that exhibit high learning potential, achieving improvements compared to the baseline methods on various LLMs.

[69] SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System

Serry Sibaee,Omer Nacar,Yasser Al-Habashi,Adel Ammar,Wadii Boulila

Main category: cs.CL

TL;DR: 该论文介绍了SHAMI-MT,一个用于标准阿拉伯语和叙利亚方言之间高质量双向翻译的机器翻译系统。

Details Motivation: 阿拉伯世界的语言格局存在标准阿拉伯语与地区方言之间的显著差异,这对自然语言处理尤其是机器翻译构成了挑战。 Method: 基于AraT5v2-base-1024架构开发了两个专用模型,并在Nabra数据集上进行了微调,使用MADAR语料库进行评估。 Result: MSA到Shami模型在GPT-4.1评估中获得了4.01/5.0的平均质量得分,证明其翻译准确且具有方言真实性。 Conclusion: 该论文成功开发了一个高质量的双向机器翻译系统SHAMI-MT,用于填补标准阿拉伯语和叙利亚方言之间的交流鸿沟。 Abstract: The rich linguistic landscape of the Arab world is characterized by a significant gap between Modern Standard Arabic (MSA), the language of formal communication, and the diverse regional dialects used in everyday life. This diglossia presents a formidable challenge for natural language processing, particularly machine translation. This paper introduces \textbf{SHAMI-MT}, a bidirectional machine translation system specifically engineered to bridge the communication gap between MSA and the Syrian dialect. We present two specialized models, one for MSA-to-Shami and another for Shami-to-MSA translation, both built upon the state-of-the-art AraT5v2-base-1024 architecture. The models were fine-tuned on the comprehensive Nabra dataset and rigorously evaluated on unseen data from the MADAR corpus. Our MSA-to-Shami model achieved an outstanding average quality score of \textbf{4.01 out of 5.0} when judged by OPENAI model GPT-4.1, demonstrating its ability to produce translations that are not only accurate but also dialectally authentic. This work provides a crucial, high-fidelity tool for a previously underserved language pair, advancing the field of dialectal Arabic translation and offering significant applications in content localization, cultural heritage, and intercultural communication.

[70] Dynaword: From One-shot to Continuously Developed Datasets

Kenneth Enevoldsen,Kristian Nørgaard Jensen,Jan Kostkan,Balázs Szabó,Márton Kardos,Kirten Vad,Andrea Blasi Núñez,Gianluca Barmina,Jacob Nielsen,Rasmus Larsen,Peter Vahlstrup,Per Møldrup Dalum,Desmond Elliott,Lukas Galke,Peter Schneider-Kamp,Kristoffer Nielbo

Main category: cs.CL

TL;DR: The Dynaword approach and Danish Dynaword provide a sustainable framework for creating and maintaining large-scale, open datasets through community collaboration.

Details Motivation: Current approaches to large-scale dataset creation in natural language processing face challenges such as reliance on ambiguously licensed sources, static dataset releases, and limited community involvement in quality assurance. Method: The Dynaword approach framework was developed to create large-scale, open datasets with continuous updates through community contributions. This framework was validated through the creation of the Danish Dynaword dataset. Result: Danish Dynaword, a concrete implementation of the Dynaword approach, contains over four times as many tokens as comparable releases, is exclusively openly licensed, and has received multiple contributions across industry and research. Conclusion: Dynaword approach and Danish Dynaword provide a sustainable framework for creating large-scale, open datasets that can be continuously updated through community collaboration. Abstract: Large-scale datasets are foundational for research and development in natural language processing. However, current approaches face three key challenges: (1) reliance on ambiguously licensed sources restricting use, sharing, and derivative works; (2) static dataset releases that prevent community contributions and diminish longevity; and (3) quality assurance processes restricted to publishing teams rather than leveraging community expertise. To address these limitations, we introduce two contributions: the Dynaword approach and Danish Dynaword. The Dynaword approach is a framework for creating large-scale, open datasets that can be continuously updated through community collaboration. Danish Dynaword is a concrete implementation that validates this approach and demonstrates its potential. Danish Dynaword contains over four times as many tokens as comparable releases, is exclusively openly licensed, and has received multiple contributions across industry and research. The repository includes light-weight tests to ensure data formatting, quality, and documentation, establishing a sustainable framework for ongoing community contributions and dataset evolution.

[71] A French Version of the OLDI Seed Corpus

Malik Marmonier,Benoît Sagot,Rachel Bawden

Main category: cs.CL

TL;DR: 本文介绍了为WMT 2025 OLDI共享任务提交的首个法语分区语料库,旨在支持法国地区语言的平行语料库建设。

Details Motivation: 本文的动机是应对来自维基百科的用户生成内容所具有的技术性、百科全书术语与风格不规则性相结合的翻译挑战,并为法国地区语言的平行语料库建设提供支持。 Method: 通过使用多个机器翻译系统和一个定制的后编辑界面,由合格的母语者进行后编辑,创建了该语料库。 Result: 呈现了OLDI Seed Corpus的首个法语分区,并详细描述了其创建过程。 Conclusion: 该法语语料库本身并不是最终目标,而是旨在作为关键的枢纽资源,以促进为法国资源匮乏的地区语言收集平行语料库。 Abstract: We present the first French partition of the OLDI Seed Corpus, our submission to the WMT 2025 Open Language Data Initiative (OLDI) shared task. We detail its creation process, which involved using multiple machine translation systems and a custom-built interface for post-editing by qualified native speakers. We also highlight the unique translation challenges presented by the source data, which combines highly technical, encyclopedic terminology with the stylistic irregularities characteristic of user-generated content taken from Wikipedia. This French corpus is not an end in itself, but is intended as a crucial pivot resource to facilitate the collection of parallel corpora for the under-resourced regional languages of France.

[72] Simple Methods Defend RAG Systems Well Against Real-World Attacks

Ilias Triantafyllopoulos,Renyi Qu,Salvatore Giorgi,Brenda Curtis,Lyle H. Ungar,João Sedoc

Main category: cs.CL

TL;DR: This paper explores methods for detecting Out-Of-Domain queries in RAG systems to ensure safety and relevance, evaluating techniques like GPT-4o, PCA-based, and Neural Collapse-based approaches, confirming the importance of external OOD detection.

Details Motivation: Ensuring safety and in-domain responses in RAG systems is crucial, especially in safety-critical applications. Detecting Out-Of-Domain (OOD) queries remains a significant challenge, prompting the need to explore effective methodologies for maintaining system reliability and response quality. Method: The paper evaluates four methodologies for OOD query detection in RAG systems: GPT-4o, regression-based, PCA-based, and Neural Collapse (NC). It introduces two novel strategies for dimensionality reduction and feature separation using PCA and Neural Collapse Feature Separation. The evaluation is conducted on standard datasets (StackExchange, MSMARCO) and real-world applications (Substance Use, COVID-19), including tests against LLM-simulated and actual attacks on a chatbot. Result: The paper demonstrates that external OOD detection significantly improves RAG system performance by ensuring responses remain relevant and confined to the knowledge base. The PCA-based and Neural Collapse-based methods show effectiveness in feature separation for OOD detection, validated through human and LLM-based evaluations across datasets and real-world applications. Conclusion: The study concludes that an external OOD detector is essential for maintaining response relevance in RAG systems, with PCA-based and Neural Collapse-based methods showing promise in feature separation for OOD detection. Abstract: Ensuring safety and in-domain responses for Retrieval-Augmented Generation (RAG) systems is paramount in safety-critical applications, yet remains a significant challenge. To address this, we evaluate four methodologies for Out-Of-Domain (OOD) query detection: GPT-4o, regression-based, Principal Component Analysis (PCA)-based, and Neural Collapse (NC), to ensure the RAG system only responds to queries confined to the system's knowledge base. Specifically, our evaluation explores two novel dimensionality reduction and feature separation strategies: \textit{PCA}, where top components are selected using explained variance or OOD separability, and an adaptation of \textit{Neural Collapse Feature Separation}. We validate our approach on standard datasets (StackExchange and MSMARCO) and real-world applications (Substance Use and COVID-19), including tests against LLM-simulated and actual attacks on a COVID-19 vaccine chatbot. Through human and LLM-based evaluations of response correctness and relevance, we confirm that an external OOD detector is crucial for maintaining response relevance.

[73] LaMPE: Length-aware Multi-grained Position Encoding for Adaptive Long-context Scaling Without Training

Sikui Zhang,Guangze Gao,Ziyun Gan,Chunfeng Yuan,Zefeng Lin,Houwen Peng,Bing Li,Weiming Hu

Main category: cs.CL

TL;DR: 本文提出了一种新的位置编码方法LaMPE,用于解决大语言模型在处理超过预训练上下文长度的输入时性能下降的问题,通过动态调整位置编码并优化注意力机制,取得了显著的性能提升。

Details Motivation: 现有的模型在输入超过预训练上下文窗口时性能显著下降,而现有解决方法未能考虑输入长度与模型有效上下文窗口之间的动态关系。 Method: 提出了一种名为Length-aware Multi-grained Positional Encoding (LaMPE)的方法,该方法基于输入长度与模型有效上下文窗口之间的动态关系,采用参数化的缩放sigmoid函数,并设计了多粒度注意力机制来分配位置分辨率。 Result: 实验表明,LaMPE在三种典型的RoPE-based LLMs和五个主流长上下文基准测试中均显著优于现有方法,且无需训练即可应用。 Conclusion: LaMPE是一种无需训练的方法,能够有效提升RoPE-based LLMs在长文本处理上的性能,通过动态分配位置编码和多粒度注意力机制,显著优于现有的长度外推方法。 Abstract: Large language models (LLMs) experience significant performance degradation when the input exceeds the pretraining context window, primarily due to the out-of-distribution (OOD) behavior of Rotary Position Embedding (RoPE). Recent studies mitigate this problem by remapping OOD positions into the in-distribution range with fixed mapping strategies, ignoring the dynamic relationship between input length and the model's effective context window. To this end, we propose Length-aware Multi-grained Positional Encoding (LaMPE), a training-free method that fully utilizes the model's effective context window for adaptive long-context scaling in LLMs. Motivated by the left-skewed frequency distribution of relative positions, LaMPE establishes a dynamic relationship between mapping length and input length through a parametric scaled sigmoid function to adaptively allocate positional capacity across varying input lengths. Meanwhile, LaMPE devises a novel multi-grained attention mechanism that strategically allocates positional resolution across different sequence regions to capture both fine-grained locality and long-range dependencies. Our method can be seamlessly applied to a wide range of RoPE-based LLMs without training. Extensive experiments on three representative LLMs across five mainstream long-context benchmarks demonstrate that LaMPE achieves significant performance improvements compared to existing length extrapolation methods. The code will be released at https://github.com/scar-on/LaMPE.

[74] VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

Qianli Ma,Yaowei Zheng,Zhelun Shi,Zhongkai Zhao,Bin Jia,Ziyue Huang,Zhiqi Lin,Youjie Li,Jiacheng Yang,Yanghua Peng,Zhi Zhang,Xin Liu

Main category: cs.CL

TL;DR: 本文提出VeOmni,一个高效的多模态大语言模型训练框架,通过解耦通信与计算及支持灵活配置,显著提升了训练效率和可扩展性。

Details Motivation: 训练多模态LLMs由于异构模型架构的需求而面临重大挑战,现有框架在可扩展性和工程开销方面存在局限。 Method: VeOmni引入了以模型为中心的分布式方案,将通信与计算解耦,并支持灵活的配置接口,以实现多模态无缝集成。 Result: 使用VeOmni,一个具有300亿参数的多模态专家混合模型可以实现每秒超过2800个令牌/GPU的吞吐量,并通过128个GPU的3D并行扩展到160K的上下文长度。 Conclusion: VeOmni是一个模块化且高效的训练框架,展示了其在训练大规模多模态LLMs方面的卓越效率和可扩展性。 Abstract: Recent advances in large language models (LLMs) have driven impressive progress in omni-modal understanding and generation. However, training omni-modal LLMs remains a significant challenge due to the heterogeneous model architectures required to process diverse modalities, necessitating sophisticated system design for efficient large-scale training. Existing frameworks typically entangle model definition with parallel logic, incurring limited scalability and substantial engineering overhead for end-to-end omni-modal training. % We present \veomni, a modular and efficient training framework to accelerate the development of omni-modal LLMs. \veomni introduces model-centric distributed recipes that decouples communication from computation, enabling efficient 3D parallelism on omni-modal LLMs. \veomni also features a flexible configuration interface supporting seamless integration of new modalities with minimal code change. % Using \veomni, a omni-modal mixture-of-experts (MoE) model with 30B parameters can be trained with over 2,800 tokens/sec/GPU throughput and scale to 160K context lengths via 3D parallelism on 128 GPUs, showcasing its superior efficiency and scalability for training large omni-modal LLMs.

[75] CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis

Yuzhuang Xu,Xu Han,Yuanchi Zhang,Yixuan Wang,Yijun Liu,Shiyu Ji,Qingfu Zhu,Wanxiang Che

Main category: cs.CL

TL;DR: 本文提出CAMERA框架,通过微专家级别的压缩(CAMERA-P剪枝和CAMERA-Q量化),有效解决MoE模型计算和存储开销问题,实验证明其在多个任务上优于现有方法。

Details Motivation: MoE架构的大语言模型虽然性能强大,但计算和存储开销大,且性能提升与参数增加不成正比。现有方法在减少参数方面存在性能和效率问题。 Method: 介绍了微专家作为更细粒度的压缩单元,提出了CAMERA框架,以及基于此的CAMERA-P(微专家剪枝框架)和CAMERA-Q(微专家混合精度量化想法)。 Result: 实验显示,CAMERA-P在多个任务和不同修剪比例下均表现优异;CAMERA-Q在2位量化下也优于现有方法,且可在单个GPU上快速分析Qwen2-57B-A14B模型。 Conclusion: CAMERA-P和CAMERA-Q在减少微专家冗余方面表现出色,分别在修剪比例从20%到60%的情况下优于强基线,并在激进的2位量化下实现了超越现有矩阵和通道级别的结果。 Abstract: Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our analysis uncovers significant variance in micro-expert contributions during decoding. Based on this insight, we further propose CAMERA-P, a structured micro-expert pruning framework, and CAMERA-Q, a mixed-precision quantization idea designed for micro-experts. Extensive experiments on nine downstream tasks show that CAMERA-P consistently outperforms strong baselines under pruning ratios ranging from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under aggressive 2-bit quantization, surpassing existing matrix- and channel-level ideas. Notably, our method enables complete micro-expert analysis of Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.

[76] Understanding and Mitigating Political Stance Cross-topic Generalization in Large Language Models

Jiayi Zhang,Shu Yang,Junchao Wu,Derek F. Wong,Di Wang

Main category: cs.CL

TL;DR: This paper identifies political neurons in language models and proposes InhibitFT, a method to reduce unintended political bias spread across topics during model fine-tuning.

Details Motivation: Fine-tuning large language models on political topics can unintentionally affect their stance on unrelated topics. This study aims to understand the internal mechanisms of this unintended cross-topic generalization and how to mitigate it. Method: The authors introduced InhibitFT, an inhibition-based fine-tuning method, and used activation patching experiments to identify two types of political neurons: general political neurons and topic-specific neurons. Result: The study found two distinct types of political neurons across four models and datasets. InhibitFT was shown to significantly reduce cross-topic stance generalization while maintaining performance on specific topics. Conclusion: InhibitFT effectively mitigates cross-topic stance generalization during political fine-tuning, reducing it by 20% on average while preserving topic-specific performance. Selectively inhibiting just 5% of neurons is sufficient for this mitigation. Abstract: Fine-tuning Large Language Models on a political topic will significantly manipulate their political stance on various issues and unintentionally affect their stance on unrelated topics. While previous studies have proposed this issue, there is still a lack of understanding regarding the internal representations of these stances and the mechanisms that lead to unintended cross-topic generalization. In this paper, we systematically explore the internal mechanisms underlying this phenomenon from a neuron-level perspective and how to mitigate the cross-topic generalization of political fine-tuning. Firstly, we propose Political Neuron Localization through Activation Contrasting (PNLAC) to identify two distinct types of political neurons: general political neurons, which govern stance across multiple political topics, and topic-specific neurons} that affect the model's political stance on individual topics. We find the existence of these political neuron types across four models and datasets through activation patching experiments. Leveraging these insights, we introduce InhibitFT, an inhibition-based fine-tuning method, effectively mitigating the cross-topic stance generalization. Experimental results demonstrate the robustness of identified neuron types across various models and datasets, and show that InhibitFT significantly reduces the cross-topic stance generalization by 20% on average, while preserving topic-specific performance. Moreover, we demonstrate that selectively inhibiting only 5% of neurons is sufficient to effectively mitigate the cross-topic stance generalization.

[77] CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation

Xiaolin Lin,Jingcun Wang,Olga Kondrateva,Yiyu Shi,Bing Li,Grace Li Zhang

Main category: cs.CL

TL;DR: CompressKV improves KV cache compression in LLMs by selectively using attention heads to retain important tokens, resulting in better performance and efficiency compared to existing methods.

Details Motivation: KV cache compression in GQA-based LLMs often leads to the eviction of critical tokens due to the use of all attention heads, which degrades performance. A more selective approach is needed. Method: CompressKV identifies attention heads that can retrieve important tokens and uses them to determine which KV cache pairs to retain. It also employs a layer-adaptive cache allocation strategy. Result: CompressKV consistently outperforms state-of-the-art methods on LongBench and Needle-in-a-Haystack benchmarks under various memory budgets. Conclusion: CompressKV is an effective KV cache compression method that outperforms existing approaches in terms of performance and efficiency. Abstract: Recent advances in large language models (LLMs) have significantly boosted long-context processing. However, the increasing key-value (KV) cache size poses critical challenges to memory and execution efficiency. Most KV cache compression methods rely on heuristic token eviction using all attention heads in Grouped Query Attention (GQA)-based LLMs. This method ignores the different functionalities of attention heads, leading to the eviction of critical tokens and thus degrades the performance of LLMs. To address the issue above, instead of using all the attention heads in GQA-based LLMs to determine important tokens as in the previous work, we first identify the attention heads in each layer that are not only capable of retrieving the initial and final tokens of a prompt, but also capable of retrieving important tokens within the text and attending to their surrounding semantic context. Afterwards, we exploit such heads to determine the important tokens and retain their corresponding KV cache pairs. Furthermore, we analyze the cache eviction error of each layer individually and introduce a layer-adaptive KV cache allocation strategy. Experimental results demonstrate the proposed CompressKV consistently outperforms state-of-the-art approaches under various memory budgets on LongBench and Needle-in-a-Haystack benchmarks. Our code is publicly available at: https://github.com/TUDa-HWAI/CompressKV.git.

[78] Learning to Evolve: Bayesian-Guided Continual Knowledge Graph Embedding

Linyu Li,Zhi Jin,Yuanpeng He,Dongming Jin,Yichi Zhang,Haoran Duan,Nyima Tash

Main category: cs.CL

TL;DR: 本文提出了一种新的连续知识图谱嵌入模型BAKE,通过贝叶斯后验更新和持续聚类方法缓解灾难性遗忘问题,实验证明其效果优于现有方法。

Details Motivation: 传统知识图谱嵌入(KGE)模型仅适用于静态知识图,而实际场景中知识图会不断演化,因此需要解决连续知识图谱嵌入(CKGE)中的灾难性遗忘问题。 Method: BAKE利用贝叶斯后验更新原则和持续聚类方法来保持模型对早期知识的记忆,并约束不同快照间知识的变化幅度。 Result: BAKE在多个数据集上的实验结果表明其性能显著优于现有的基线模型。 Conclusion: BAKE有效缓解了连续知识图谱嵌入中的灾难性遗忘问题,并在多个数据集上显著优于现有基线模型。 Abstract: Since knowledge graphs (KG) will continue to evolve in real scenarios, traditional KGE models are only suitable for static knowledge graphs. Therefore, continual knowledge graph embedding (CKGE) has attracted the attention of researchers. Currently, a key challenge facing CKGE is that the model is prone to "catastrophic forgetting", resulting in the loss of previously learned knowledge. In order to effectively alleviate this problem, we propose a new CKGE model BAKE. First, we note that the Bayesian posterior update principle provides a natural continual learning strategy that is insensitive to data order and can theoretically effectively resist the forgetting of previous knowledge during data evolution. Different from the existing CKGE method, BAKE regards each batch of new data as a Bayesian update of the model prior. Under this framework, as long as the posterior distribution of the model is maintained, the model can better preserve the knowledge of early snapshots even after evolving through multiple time snapshots. Secondly, we propose a continual clustering method for CKGE, which further directly combats knowledge forgetting by constraining the evolution difference (or change amplitude) between new and old knowledge between different snapshots. We conduct extensive experiments on BAKE on multiple datasets, and the results show that BAKE significantly outperforms existing baseline models.

[79] AI-Based Measurement of Innovation: Mapping Expert Insight into Large Language Model Applications

Robin Nowak,Patrick Figge,Carolin Haeussler

Main category: cs.CL

TL;DR: This paper proposes an LLM-based framework for measuring innovation that is more reliable and effective than existing methods, demonstrated through studies on software updates and product reviews.

Details Motivation: Measuring innovation is often limited by context-specific proxies and reliance on expert evaluation, which creates constraints in empirical research. Method: An LLM framework was designed to approximate domain experts' assessment of innovation from unstructured text data, evaluated through two studies and compared to other machine learning and deep learning models. Result: The LLM framework outperformed other approaches in F1-scores and demonstrated high consistency in results across runs. Conclusion: The LLM framework is a reliable and effective tool for measuring innovation, offering broad applicability and better performance than existing methods. Abstract: Measuring innovation often relies on context-specific proxies and on expert evaluation. Hence, empirical innovation research is often limited to settings where such data is available. We investigate how large language models (LLMs) can be leveraged to overcome the constraints of manual expert evaluations and assist researchers in measuring innovation. We design an LLM framework that reliably approximates domain experts' assessment of innovation from unstructured text data. We demonstrate the performance and broad applicability of this framework through two studies in different contexts: (1) the innovativeness of software application updates and (2) the originality of user-generated feedback and improvement ideas in product reviews. We compared the performance (F1-score) and reliability (consistency rate) of our LLM framework against alternative measures used in prior innovation studies, and to state-of-the-art machine learning- and deep learning-based models. The LLM framework achieved higher F1-scores than the other approaches, and its results are highly consistent (i.e., results do not change across runs). This article equips R&D personnel in firms, as well as researchers, reviewers, and editors, with the knowledge and tools to effectively use LLMs for measuring innovation and evaluating the performance of LLM-based innovation measures. In doing so, we discuss, the impact of important design decisions-including model selection, prompt engineering, training data size, training data distribution, and parameter settings-on performance and reliability. Given the challenges inherent in using human expert evaluation and existing text-based measures, our framework has important implications for harnessing LLMs as reliable, increasingly accessible, and broadly applicable research tools for measuring innovation.

[80] LatentPrompt: Optimizing Promts in Latent Space

Mateusz Bystroński,Grzegorz Piotrowski,Nitesh V. Chawla,Tomasz Kajdanowicz

Main category: cs.CL

TL;DR: 本文提出了一种名为LatentPrompt的模型无关提示优化框架,该框架利用潜在语义空间自动生成、评估和优化候选提示,以提高任务性能。

Details Motivation: 最近的研究表明,优化大型语言模型的提示可以显著提高任务性能,但许多优化技术依赖于启发式方法或手动探索。 Method: 通过一组初始提示,将其嵌入到连续的潜在空间中,并系统地探索该空间以识别最大化特定任务性能的提示。 Result: 在Financial PhraseBank情感分类基准上的概念验证研究中,LatentPrompt在一个优化周期后将分类准确率提高了约3%。 Conclusion: LatentPrompt是一个广泛适用的提示优化框架,只需要对LLM进行黑盒访问和一个自动评估指标,适用于各种领域和任务。 Abstract: Recent advances have shown that optimizing prompts for Large Language Models (LLMs) can significantly improve task performance, yet many optimization techniques rely on heuristics or manual exploration. We present LatentPrompt, a model-agnostic framework for prompt optimization that leverages latent semantic space to automatically generate, evaluate, and refine candidate prompts without requiring hand-crafted rules. Beginning with a set of seed prompts, our method embeds them in a continuous latent space and systematically explores this space to identify prompts that maximize task-specific performance. In a proof-of-concept study on the Financial PhraseBank sentiment classification benchmark, LatentPrompt increased classification accuracy by approximately 3 percent after a single optimization cycle. The framework is broadly applicable, requiring only black-box access to an LLM and an automatic evaluation metric, making it suitable for diverse domains and tasks.

[81] Monsoon Uprising in Bangladesh: How Facebook Shaped Collective Identity

Md Tasin Abir,Arpita Chowdhury,Ashfia Rahman

Main category: cs.CL

TL;DR: This study explores how Facebook helped shape collective identity during the 2024 Bangladesh pro-democracy uprising by using visual and verbal strategies to build solidarity and challenge authoritarian narratives.

Details Motivation: The study aimed to understand how online platforms like Facebook contribute to identity construction and political mobilization during social uprisings, especially under government repression. Method: A qualitative approach was used to analyze visual rhetoric, verbal discourse, and digital irony on Facebook. Result: The research found that Facebook served as a central space for resistance, where multimodal expressions such as images, memes, videos, hashtags, and satirical posts helped unify participants and build collective identity. Conclusion: Facebook played a crucial role in shaping collective identity during the Monsoon Uprising in Bangladesh by combining visual and verbal strategies, which built a strong sense of solidarity and challenged authoritarian narratives. Abstract: This study investigates how Facebook shaped collective identity during the July 2024 pro-democracy uprising in Bangladesh, known as the Monsoon Uprising. During government repression, protesters turned to Facebook as a central space for resistance, where multimodal expressions, images, memes, videos, hashtags, and satirical posts played an important role in unifying participants. Using a qualitative approach, this research analyzes visual rhetoric, verbal discourse, and digital irony to reveal how shared symbols, protest art, and slogans built a sense of solidarity. Key elements included the symbolic use of red, the ironic metaphorical use of the term "Razakar", and the widespread sharing of visuals representing courage, injustice, and resistance. The findings show that the combination of visual and verbal strategies on Facebook not only mobilized public sentiment, but also built a strong collective identity that challenged authoritarian narratives. This study tries to demonstrate how online platforms can serve as powerful tools for identity construction and political mobilization in the digital age.

[82] From Monolingual to Bilingual: Investigating Language Conditioning in Large Language Models for Psycholinguistic Tasks

Shuzhou Yuan,Zhan Qu,Mario Tawfelis,Michael Färber

Main category: cs.CL

TL;DR: 研究发现大型语言模型在不同语言身份下表现出类似人类的心理语言学反应,语言身份会影响其输出行为和内部表征。

Details Motivation: 大型语言模型(LLM)展现出强大的语言能力,但对其如何编码跨语言的心理语言学知识知之甚少。 Method: 使用sound symbolism和word valence两个任务,在英语、荷兰语和中文的单语和双语提示下评估Llama-3.3-70B-Instruct和Qwen2.5-72B-Instruct两个模型。 Result: 行为上,两个模型会根据提示的语言身份调整输出,其中Qwen对荷兰语和中文的区分更敏感更明显。探测分析显示,心理语言学信号在更深层中更易解码,且中文提示产生的效价表征比荷兰语更强且更稳定。 Conclusion: 语言身份会影响LLM的输出行为和内部表征,这为将LLM作为跨语言认知模型的应用提供了新见解。 Abstract: Large Language Models (LLMs) exhibit strong linguistic capabilities, but little is known about how they encode psycholinguistic knowledge across languages. We investigate whether and how LLMs exhibit human-like psycholinguistic responses under different linguistic identities using two tasks: sound symbolism and word valence. We evaluate two models, Llama-3.3-70B-Instruct and Qwen2.5-72B-Instruct, under monolingual and bilingual prompting in English, Dutch, and Chinese. Behaviorally, both models adjust their outputs based on prompted language identity, with Qwen showing greater sensitivity and sharper distinctions between Dutch and Chinese. Probing analysis reveals that psycholinguistic signals become more decodable in deeper layers, with Chinese prompts yielding stronger and more stable valence representations than Dutch. Our results demonstrate that language identity conditions both output behavior and internal representations in LLMs, providing new insights into their application as models of cross-linguistic cognition.

[83] Modular Arithmetic: Language Models Solve Math Digit by Digit

Tanja Baeumel,Daniil Gurgurov,Yusser al Ghussin,Josef van Genabith,Simon Ostermann

Main category: cs.CL

TL;DR: 研究揭示了 LLMs 在解决算术问题时的组成和可解释的结构,即数字位置特定电路的存在及其因果作用。

Details Motivation: 尽管近期研究开始揭示大语言模型 (LLMs) 在简单算术任务中的内部策略,但对其底层机制仍缺乏统一的理解。 Method: 使用特征重要性和因果干预方法,识别并验证了 LLMs 中的数字位置特定电路。 Result: 研究扩展了最近发现,展示了 LLMs 以逐位方式表示数字,并提出了证据证明 LLMs 使用数字位置特定电路执行简单算术任务。 Conclusion: LLMs 解决算术问题时,存在独立于模型大小和标记化策略的数字位置特定电路,这些电路在解决算术任务中具有因果作用。 Abstract: While recent work has begun to uncover the internal strategies that Large Language Models (LLMs) employ for simple arithmetic tasks, a unified understanding of their underlying mechanisms is still lacking. We extend recent findings showing that LLMs represent numbers in a digit-wise manner and present evidence for the existence of digit-position-specific circuits that LLMs use to perform simple arithmetic tasks, i.e. modular subgroups of MLP neurons that operate independently on different digit positions (units, tens, hundreds). Notably, such circuits exist independently of model size and of tokenization strategy, i.e. both for models that encode longer numbers digit-by-digit and as one token. Using Feature Importance and Causal Interventions, we identify and validate the digit-position-specific circuits, revealing a compositional and interpretable structure underlying the solving of arithmetic problems in LLMs. Our interventions selectively alter the model's prediction at targeted digit positions, demonstrating the causal role of digit-position circuits in solving arithmetic tasks.

[84] PoeTone: A Framework for Constrained Generation of Structured Chinese Songci with LLMs

Zhan Qu,Shuzhou Yuan,Michael Färber

Main category: cs.CL

TL;DR: 该研究系统地评估了大型语言模型在生成古典中文诗歌Songci方面的能力,并提出了一种Generate-Critic架构,通过微调提高了模型的形式符合性。

Details Motivation: 研究大型语言模型(LLMs)在生成古典中文诗歌形式Songci方面的能力,Songci具有由Cipai模板定义的严格结构、音调和押韵约束。 Method: 开发了一种包含正式符合性评分、自动化质量评估、人工评估和基于分类的探测任务的综合评估框架,并利用该框架评估了18个LLM在五种提示策略下的生成性能。此外,通过监督微调(SFT)对三个轻量级开源LLM进行了微调。 Result: Songci生成的评估结果显示,通过使用评估框架作为自动评论家,并利用评论的反馈作为奖励信号进行微调,可以提高模型的形式符合性高达5.88%。 Conclusion: Songci的生成能力提供了对大型语言模型在文化和形式受限文本生成方面的新的见解。 Abstract: This paper presents a systematic investigation into the constrained generation capabilities of large language models (LLMs) in producing Songci, a classical Chinese poetry form characterized by strict structural, tonal, and rhyme constraints defined by Cipai templates. We first develop a comprehensive, multi-faceted evaluation framework that includes: (i) a formal conformity score, (ii) automated quality assessment using LLMs, (iii) human evaluation, and (iv) classification-based probing tasks. Using this framework, we evaluate the generative performance of 18 LLMs, including 3 proprietary models and 15 open-source models across four families, under five prompting strategies: zero-shot, one-shot, completion-based, instruction-tuned, and chain-of-thought. Finally, we propose a Generate-Critic architecture in which the evaluation framework functions as an automated critic. Leveraging the critic's feedback as a reward signal, we fine-tune three lightweight open-source LLMs via supervised fine-tuning (SFT), resulting in improvements of up to 5.88% in formal conformity. Our findings offer new insights into the generative strengths and limitations of LLMs in producing culturally significant and formally constrained literary texts.

[85] I Have No Mouth, and I Must Rhyme: Uncovering Internal Phonetic Representations in LLaMA 3.2

Jack Merullo,Arjun Khurana,Oliver McLaughlin

Main category: cs.CL

TL;DR: Llama-3.2-1B-Instruct 在无监督情况下学习到类似人类的音素表示,并在押韵任务中使用特定机制处理语音信息。

Details Motivation: 研究大型语言模型如何在没有明确语音或听觉基础的情况下,完成押韵等语音任务。 Method: 研究 Llama-3.2-1B-Instruct 如何表示音素信息,并通过可视化分析其在押韵任务中涉及的“音素移动头”。 Result: 发现 Llama 使用复杂的内部音素模型完成语音任务,并识别出一个促进语音信息处理的“音素移动头”,其对元音的表示与人类IPA元音图相似。 Conclusion: Llama-3.2-1B-Instruct 拥有丰富的内部音素模型,并在潜在空间中对音素表示进行高级组织,即使没有直接监督也能学习类似于人类标准IPA元音图的元音模型。 Abstract: Large language models demonstrate proficiency on phonetic tasks, such as rhyming, without explicit phonetic or auditory grounding. In this work, we investigate how \verb|Llama-3.2-1B-Instruct| represents token-level phonetic information. Our results suggest that Llama uses a rich internal model of phonemes to complete phonetic tasks. We provide evidence for high-level organization of phoneme representations in its latent space. In doing so, we also identify a ``phoneme mover head" which promotes phonetic information during rhyming tasks. We visualize the output space of this head and find that, while notable differences exist, Llama learns a model of vowels similar to the standard IPA vowel chart for humans, despite receiving no direct supervision to do so.

[86] Contextual Graph Transformer: A Small Language Model for Enhanced Engineering Document Information Extraction

Karan Reddy,Mayukha Pal

Main category: cs.CL

TL;DR: The Contextual Graph Transformer (CGT) is a hybrid model that combines Graph Neural Networks and Transformers to improve domain-specific question answering in technical documents. It outperforms existing models in accuracy while being more parameter-efficient.

Details Motivation: Standard transformer-based language models often struggle with the fine-grained syntax and entity relationships in complex technical, engineering documents. The need for a model that can handle domain-specific question answering more effectively led to the development of the Contextual Graph Transformer (CGT). Method: The Contextual Graph Transformer (CGT) constructs a dynamic graph over input tokens using sequential, skip-gram, and semantic similarity edges, which are processed by GATv2Conv layers. The enriched embeddings are then passed to a Transformer encoder. The model is trained in two phases: pretraining on general text and fine-tuning on domain-specific manuals. Result: Integrated into a Retrieval-Augmented Generation (RAG) pipeline, CGT outperforms baselines like GPT-2 and BERT, achieving 24.7% higher accuracy than GPT-2 with 62.4% fewer parameters. Conclusion: The Contextual Graph Transformer (CGT) is a hybrid neural architecture that effectively combines Graph Neural Networks and Transformers to enhance domain-specific question answering. It proves to be parameter-efficient and adaptable to technical language, making it suitable for complex technical and engineering documents. Abstract: Standard transformer-based language models, while powerful for general text, often struggle with the fine-grained syntax and entity relationships in complex technical, engineering documents. To address this, we propose the Contextual Graph Transformer (CGT), a hybrid neural architecture that combines Graph Neural Networks (GNNs) and Transformers for domain-specific question answering. CGT constructs a dynamic graph over input tokens using sequential, skip-gram, and semantic similarity edges, which is processed by GATv2Conv layers for local structure learning. These enriched embeddings are then passed to a Transformer encoder to capture global dependencies. Unlike generic large models, technical domains often require specialized language models with stronger contextualization and structure awareness. CGT offers a parameter-efficient solution for such use cases. Integrated into a Retrieval-Augmented Generation (RAG) pipeline, CGT outperforms baselines like GPT-2 and BERT, achieving 24.7% higher accuracy than GPT-2 with 62.4% fewer parameters. This gain stems from CGTs ability to jointly model structural token interactions and long-range semantic coherence. The model is trained from scratch using a two-phase approach: pretraining on general text followed by fine-tuning on domain-specific manuals. This highlights CGTs adaptability to technical language, enabling better grounding, entity tracking, and retrieval-augmented responses in real-world applications.

[87] What's in the News? Towards Identification of Bias by Commission, Omission, and Source Selection (COSS)

Anastasia Zhukova,Terry Ruas,Felix Hamborg,Karsten Donnay,Bela Gipp

Main category: cs.CL

TL;DR: 本文介绍了一种通过综合识别三种偏见类型来自动识别新闻偏见的新方法,并提供了一个可视化示例。

Details Motivation: 在充斥着大量新闻的世界中,读者很难判断信息的可靠性及其报道的中立性。 Method: 提出了一种新的方法,用于自动识别三种类型的新闻偏见:遗漏偏见、委托偏见和来源选择偏见。 Result: 设计了一种管道概念,描述了其步骤的目标和任务,并提供了利用提取的文本重用特征和模式的可视化示例。 Conclusion: 本文提出了一种识别新闻偏见的方法,重点在于通过组合识别遗漏、委托和来源选择偏见来改进先前的工作。 Abstract: In a world overwhelmed with news, determining which information comes from reliable sources or how neutral is the reported information in the news articles poses a challenge to news readers. In this paper, we propose a methodology for automatically identifying bias by commission, omission, and source selection (COSS) as a joint three-fold objective, as opposed to the previous work separately addressing these types of bias. In a pipeline concept, we describe the goals and tasks of its steps toward bias identification and provide an example of a visualization that leverages the extracted features and patterns of text reuse.

[88] Building and Aligning Comparable Corpora

Motaz Saad,David Langlois,Kamel Smaili

Main category: cs.CL

TL;DR: 本文提出了一种基于跨语言潜在语义索引(CL-LSI)的方法,用于构建和对齐英语、法语和阿拉伯语的可比语料库,并证明其在主题和事件层面的有效性。

Details Motivation: Comparable corpora are valuable for multilingual NLP when parallel texts are unavailable. Method: Built comparable corpora from Wikipedia, EURONEWS, BBC, and JSC; used bilingual dictionary and CL-LSI for cross-lingual alignment. Result: CL-LSI outperformed dictionary-based alignment on multiple corpora. Conclusion: The CL-LSI similarity measure is effective for aligning cross-lingual documents at both topic and event levels. Abstract: Comparable corpus is a set of topic aligned documents in multiple languages, which are not necessarily translations of each other. These documents are useful for multilingual natural language processing when there is no parallel text available in some domains or languages. In addition, comparable documents are informative because they can tell what is being said about a topic in different languages. In this paper, we present a method to build comparable corpora from Wikipedia encyclopedia and EURONEWS website in English, French and Arabic languages. We further experiment a method to automatically align comparable documents using cross-lingual similarity measures. We investigate two cross-lingual similarity measures to align comparable documents. The first measure is based on bilingual dictionary, and the second measure is based on Latent Semantic Indexing (LSI). Experiments on several corpora show that the Cross-Lingual LSI (CL-LSI) measure outperforms the dictionary based measure. Finally, we collect English and Arabic news documents from the British Broadcast Corporation (BBC) and from ALJAZEERA (JSC) news website respectively. Then we use the CL-LSI similarity measure to automatically align comparable documents of BBC and JSC. The evaluation of the alignment shows that CL-LSI is not only able to align cross-lingual documents at the topic level, but also it is able to do this at the event level.

[89] Automated SNOMED CT Concept Annotation in Clinical Text Using Bi-GRU Neural Networks

Ali Noori,Pratik Devkota,Somya Mohanty,Prashanti Manda

Main category: cs.CL

TL;DR: 本研究提出了一种基于双向GRU的序列标记方法,实现了高效的SNOMED CT临床概念自动注释,性能优越且计算成本较低。

Details Motivation: 手动注释临床文本费时费力,难以大规模应用,而SNOMED CT提供了一个丰富的本体来标记临床实体,因此需要自动化的方法来提升效率。 Method: 使用双向GRU模型进行SNOMED CT概念识别,通过领域适应的SpaCy和SciBERT分词对文本进行预处理,并将句子分割成重叠的19个令牌的块,结合上下文、句法和形态学特征进行序列标记。 Result: Bi-GRU模型在验证集上取得了90%的F1分数,优于传统的基于规则的系统,与现有神经模型相当或更好。定性分析显示其对歧义术语和拼写错误有良好的处理能力。 Conclusion: 研究得出基于RNN的轻量级架构可以在计算成本显著低于基于Transformer模型的情况下提供高质量的临床概念注释,使其非常适合实际部署。 Abstract: Automated annotation of clinical text with standardized medical concepts is critical for enabling structured data extraction and decision support. SNOMED CT provides a rich ontology for labeling clinical entities, but manual annotation is labor-intensive and impractical at scale. This study introduces a neural sequence labeling approach for SNOMED CT concept recognition using a Bidirectional GRU model. Leveraging a subset of MIMIC-IV, we preprocess text with domain-adapted SpaCy and SciBERT-based tokenization, segmenting sentences into overlapping 19-token chunks enriched with contextual, syntactic, and morphological features. The Bi-GRU model assigns IOB tags to identify concept spans and achieves strong performance with a 90 percent F1-score on the validation set. These results surpass traditional rule-based systems and match or exceed existing neural models. Qualitative analysis shows effective handling of ambiguous terms and misspellings. Our findings highlight that lightweight RNN-based architectures can deliver high-quality clinical concept annotation with significantly lower computational cost than transformer-based models, making them well-suited for real-world deployment.

[90] Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction

Yuerong Song,Xiaoran Liu,Ruixiao Li,Zhigeng Liu,Zengfeng Huang,Qipeng Guo,Ziwei He,Xipeng Qiu

Main category: cs.CL

TL;DR: Sparse-dLLM通过利用token的重要性稳定性,实现高效的动态缓存管理,从而提升dLLMs推理效率。

Details Motivation: dLLMs在推理过程中存在高昂的二次计算复杂度和内存开销,现有缓存技术虽然能加速解码,但内存消耗过大,限制了其在长上下文场景中的应用。 Method: 通过分析dLLMs中的注意力模式,发现跨层稀疏性,并基于注意力引导策略实现关键token的保留和不重要token的动态驱逐。 Result: 在LLaDA和Dream系列模型上的实验表明,Sparse-dLLM的吞吐量比传统dLLMs高出10倍,性能相当,峰值内存消耗相似,并且在效率和效果上优于现有方法。 Conclusion: Sparse-dLLM是一个无需训练的框架,通过动态缓存驱逐和稀疏注意力机制,有效提升了推理效率和吞吐量,同时保持了与传统dLLMs相当的性能和内存消耗。 Abstract: Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive quadratic computational complexity and memory overhead during inference. Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage that limit long-context applications. Our analysis of attention patterns in dLLMs reveals persistent cross-layer sparsity, with pivotal tokens remaining salient across decoding steps and low-relevance tokens staying unimportant, motivating selective cache eviction. We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching. By leveraging the stability of token saliency over steps, it retains critical tokens and dynamically evicts unimportant prefix/suffix entries using an attention-guided strategy. Extensive experiments on LLaDA and Dream series demonstrate Sparse-dLLM achieves up to 10$\times$ higher throughput than vanilla dLLMs, with comparable performance and similar peak memory costs, outperforming previous methods in efficiency and effectiveness.

[91] Guess or Recall? Training CNNs to Classify and Localize Memorization in LLMs

Jérémie Dentan,Davide Buscaldi,Sonia Vanier

Main category: cs.CL

TL;DR: A new method analyzes memorization in LLMs, revealing that current taxonomy is inadequate and proposing a new classification system.

Details Motivation: The motivation was to understand the distinct mechanisms behind verbatim memorization in LLMs and assess the alignment of existing taxonomy with attention mechanisms. Method: The researchers trained CNNs on the attention weights of LLMs to analyze memorization mechanisms and developed a visual interpretability technique. Result: The findings show that the existing taxonomy poorly reflects memorization mechanisms, leading to the proposal of a new taxonomy with three categories. Additionally, it was found that few-shot memorization doesn't have a distinct attention mechanism, and many extractable samples are guessed by the model. Conclusion: The study concludes that verbatim memorization in LLMs isn't linked to a unique attention mechanism and proposes a new taxonomy for categorizing memorization types. Abstract: Verbatim memorization in Large Language Models (LLMs) is a multifaceted phenomenon involving distinct underlying mechanisms. We introduce a novel method to analyze the different forms of memorization described by the existing taxonomy. Specifically, we train Convolutional Neural Networks (CNNs) on the attention weights of the LLM and evaluate the alignment between this taxonomy and the attention weights involved in decoding. We find that the existing taxonomy performs poorly and fails to reflect distinct mechanisms within the attention blocks. We propose a new taxonomy that maximizes alignment with the attention weights, consisting of three categories: memorized samples that are guessed using language modeling abilities, memorized samples that are recalled due to high duplication in the training set, and non-memorized samples. Our results reveal that few-shot verbatim memorization does not correspond to a distinct attention mechanism. We also show that a significant proportion of extractable samples are in fact guessed by the model and should therefore be studied separately. Finally, we develop a custom visual interpretability technique to localize the regions of the attention weights involved in each form of memorization.

[92] EHSAN: Leveraging ChatGPT in a Hybrid Framework for Arabic Aspect-Based Sentiment Analysis in Healthcare

Eman Alamoudi,Ellis Solaiman

Main category: cs.CL

TL;DR: 本研究通过结合ChatGPT与人工审核构建了一个高质量的阿拉伯语医疗健康领域情感数据集,并验证了其在方面情感分析中的高效性与可扩展性。

Details Motivation: 阿拉伯语患者反馈因为方言多样性和缺乏细粒度的情感标签而难以进行自动化评估,为填补这一空白,开展了此项研究。 Method: 研究引入了EHSAN,这是一个以数据为中心的混合流程,结合了ChatGPT的伪标签和有针对性的人工审核,用于构建第一个可解释的阿拉伯语医疗健康领域的情感数据集。 Result: 实验结果表明,即使在较少人工监督的情况下,该研究构建的阿拉伯语专用模型仍实现了高准确率,并且在减少方面类别数量时分类效果显著提高。 Conclusion: 研究得出,通过结合大型语言模型的注释和人类专业知识,可以有效且可扩展地进行阿拉伯语医疗健康领域的方面情感分析。 Abstract: Arabic-language patient feedback remains under-analysed because dialect diversity and scarce aspect-level sentiment labels hinder automated assessment. To address this gap, we introduce EHSAN, a data-centric hybrid pipeline that merges ChatGPT pseudo-labelling with targeted human review to build the first explainable Arabic aspect-based sentiment dataset for healthcare. Each sentence is annotated with an aspect and sentiment label (positive, negative, or neutral), forming a pioneering Arabic dataset aligned with healthcare themes, with ChatGPT-generated rationales provided for each label to enhance transparency. To evaluate the impact of annotation quality on model performance, we created three versions of the training data: a fully supervised set with all labels reviewed by humans, a semi-supervised set with 50% human review, and an unsupervised set with only machine-generated labels. We fine-tuned two transformer models on these datasets for both aspect and sentiment classification. Experimental results show that our Arabic-specific model achieved high accuracy even with minimal human supervision, reflecting only a minor performance drop when using ChatGPT-only labels. Reducing the number of aspect classes notably improved classification metrics across the board. These findings demonstrate an effective, scalable approach to Arabic aspect-based sentiment analysis (SA) in healthcare, combining large language model annotation with human expertise to produce a robust and explainable dataset. Future directions include generalisation across hospitals, prompt refinement, and interpretable data-driven modelling.

[93] MArgE: Meshing Argumentative Evidence from Multiple Large Language Models for Justifiable Claim Verification

Ming Pok Ng,Junqi Jiang,Gabriel Freedman,Antonio Rago,Francesca Toni

Main category: cs.CL

TL;DR: MArgE是一个新的框架,通过从多个大型语言模型(LLMs)中提取结构化的论证树,提高声明验证的准确性和可解释性。

Details Motivation: 当前整合多个LLMs的方法通常缺乏结构,导致生成的内容难以准确验证。需要一种更可靠的方法来结合多个LLMs的优势,同时减少错误,如幻觉。 Method: 引入MArgE框架,利用Argumentative LLMs(ArgLLMs)生成结构化的论证树,为声明验证提供可追溯、可解释的推理路径。 Result: 实验表明,MArgE在声明验证任务中显著优于单一LLMs(包括GPT-4o-mini和多个开源模型)以及现有的非结构化多LLM辩论方法。 Conclusion: 将形式化的论证推理机制引入多LLM整合,可以有效提高任务性能和生成结果的可解释性。 Abstract: Leveraging outputs from multiple large language models (LLMs) is emerging as a method for harnessing their power across a wide range of tasks while mitigating their capacity for making errors, e.g., hallucinations. However, current approaches to combining insights from multiple LLMs often involve unstructured interactions (e.g., free debate), resulting in model generations that are not faithfully justifiable. In this work, we introduce MArgE, a novel framework to provide formal structure to the evidence from each LLM, in the form of a tree of extracted arguments, for the task of claim verification. We use a variant of Argumentative LLMs (ArgLLMs), i.e. LLMs driven by frameworks and semantics from the field of computational argumentation, to construct structured argument trees for given claims. This process creates an inspectable pathway from the initial arguments to the final claim verification decisions, providing a faithful justification thereof. We show experimentally that MArgE can significantly outperform single LLMs, including three open-source models (4B to 8B parameters), GPT-4o-mini and existing ArgLLMs, as well as prior methods for unstructured multi-LLM debates. We thus demonstrate the advantages of incorporating formal, argumentative reasoning mechanisms when combining multiple LLM outputs.

[94] CharBench: Evaluating the Role of Tokenization in Character-Level Tasks

Omri Uzan,Yuval Pinter

Main category: cs.CL

TL;DR: The paper introduces CharBench, a large benchmark for character-level tasks, revealing that modern LLMs struggle with these tasks and showing how word and token properties affect performance.

Details Motivation: Character-level reasoning tasks are challenging for language models, and the impact of tokenization on these tasks remains unclear. This motivated the creation of a more extensive benchmark and evaluation methodology. Method: The paper introduces CharBench, a comprehensive benchmark for character-level tasks, and evaluates various models on this benchmark to analyze the impact of word properties and tokenization on performance. Result: CharBench poses a significant challenge to modern LLMs, with average accuracies of 43.6% and 32.3% on certain tasks. Performance on counting tasks is more influenced by word length and character count, while intra-word positional understanding is impacted by token length. Conclusion: The paper concludes that tokenization properties have a limited impact on character-level counting tasks, while word length and actual character count have a greater influence. For intra-word positional understanding tasks, performance is negatively correlated with token length. Abstract: Tasks that require character-level reasoning, such as counting or locating characters within words, remain challenging for contemporary language models. A common conjecture is that language models' reliance on subword units, rather than characters, contributes to their struggles with character-level tasks, yet recent studies offer conflicting conclusions about the role of tokenization, leaving its impact unclear. To address this gap, we introduce CharBench, a comprehensive benchmark of character-level tasks that is two orders of magnitude larger than existing alternatives. We evaluate a diverse range of leading open-weight and proprietary models on CharBench and find that it presents a significant challenge to modern LLMs, with an average accuracy of 43.6% and 32.3% on some tasks. We present an in-depth analysis of how intrinsic properties of words and their segmentations into tokens correspond to model performance. For counting tasks, we find that tokenization properties are weakly correlated with correctness, while the length of the queried word and the actual character count play a more significant part. In contrast, for tasks requiring intra-word positional understanding, performance is negatively correlated with the length of the token containing the queried character, suggesting that longer tokens obscure character position information for LLMs. We encourage future work to build on the benchmark and evaluation methodology introduced here as tools for improving model performance on such tasks.

[95] Mitigating Attention Hacking in Preference-Based Reward Modeling via Interaction Distillation

Jianxiang Zang,Meiling Ning,Shihan Dou,Jiazheng Zhang,Tao Gui,Qi Zhang,Xuanjing Huang

Main category: cs.CL

TL;DR: 提出了一种名为“Interaction Distillation”的新框架,用于优化大型语言模型的奖励模型,以解决偏好建模中的注意力欺骗问题。

Details Motivation: 主流奖励模型在偏好建模方面存在token级别交互的不足,导致其判断信号容易受到注意力误配的影响。 Method: 引入基于交互的自然语言理解模型作为教师,通过全面的注意力机制提供复杂的token交互模式,并通过注意力对齐目标指导偏好建模模拟教师模型的交互模式。 Result: 大量实验证明,“Interaction Distillation”相比最先进的奖励模型优化方法能够提供更稳定和可泛化的奖励信号。 Conclusion: 注意力欺骗是奖励模型中的一个更基本的限制,“Interaction Distillation”为优化偏好建模提供了新的有效方法。 Abstract: The reward model (RM), as the core component of reinforcement learning from human feedback (RLHF) for large language models (LLMs), responsible for providing reward signals to generated responses. However, mainstream preference modeling in RM is inadequate in terms of token-level interaction, making its judgment signals vulnerable to being hacked by misallocated attention to context. This stems from two fundamental limitations: (1) Current preference modeling employs decoder-only architectures, where the unidirectional causal attention mechanism leads to forward-decaying intra-sequence attention within the prompt-response sequence. (2) The independent Siamese-encoding paradigm induces the absence of token-level inter-sequence attention between chosen and rejected sequences. To address this "attention hacking", we propose "Interaction Distillation", a novel training framework for more adequate preference modeling through attention-level optimization. The method introduces an interaction-based natural language understanding model as the teacher to provide sophisticated token interaction patterns via comprehensive attention, and guides the preference modeling to simulate teacher model's interaction pattern through an attentional alignment objective. Through extensive experiments, interaction distillation has demonstrated its ability to provide more stable and generalizable reward signals compared to state-of-the-art RM optimization methods that target data noise, highlighting the attention hacking constitute a more fundamental limitation in RM.

[96] Pointer: Linear-Complexity Long-Range Modeling without Pre-training

Zixi Li

Main category: cs.CL

TL;DR: Pointer是一种新型架构,通过指针链机制实现高效长序列建模,无需预训练即可保持高性能。

Details Motivation: 为了克服标准注意力机制计算复杂度高(O(N^2))的问题,同时保持模型性能,提出Pointer架构。 Method: Pointer通过层间指针链的方式,以线性复杂度实现长序列建模。 Result: Pointer在长序列上实现了2-10倍的速度提升,且在距离达2048个token的复制任务上保持>95%的准确率。 Conclusion: Pointer提供了一种有效的替代注意力机制的方案,适用于需要高效长距离建模而无需预训练的应用场景。 Abstract: We introduce Pointer, a novel architecture that achieves linear $O(NK)$ complexity for long-range sequence modeling while maintaining superior performance without requiring pre-training. Unlike standard attention mechanisms that compute $O(N^2)$ pairwise interactions, our approach uses layer-wise pointer chaining where each layer's pointer selection depends on previous layer's pointer positions, creating explicit long-distance connections through pointer chains. We demonstrate that this architecture achieves $2$--$10\times$ speedup on long sequences compared to standard transformers, maintains $>95\%$ accuracy on copy tasks at distances up to 2048 tokens, and learns interpretable pointer patterns that reveal structured dependency modeling. Our experiments on efficiency benchmarks, long-range dependency tasks, and interpretability analysis show that Pointer offers a compelling alternative to attention mechanisms for scenarios requiring efficient long-range modeling without pre-training dependencies.

[97] Test Set Quality in Multilingual LLM Evaluation

Kranti Chalamalasetti,Gabriel Bernier-Colborne,Yvan Gauthier,Sowmya Vajjala

Main category: cs.CL

TL;DR: 本文研究了多语言基准数据集的质量问题,通过手动分析法语和泰卢固语的评估集,发现了许多错误,并发现这些错误对大型语言模型的性能评估产生了显著影响。

Details Motivation: 尽管之前的工作已经识别出完全人工标注的测试集中存在错误,但人们对数据集本身的品质并未给予太多关注。 Method: 作者手动分析了最近的多语言评估集(法语和泰卢固语),并比较了多个LLM在原始和修订后的数据集上的性能差异。 Result: 研究发现,多个LLM在原始和修订后的数据集上的性能差异很大(某些情况下接近10%)。 Conclusion: 作者认为测试集不应被视为不可更改,而应重新审视其正确性,并可能进行版本管理。 Abstract: Several multilingual benchmark datasets have been developed in a semi-automatic manner in the recent past to measure progress and understand the state-of-the-art in the multilingual capabilities of Large Language Models. However, there is not a lot of attention paid to the quality of the datasets themselves, despite the existence of previous work in identifying errors in even fully human-annotated test sets. In this paper, we manually analyze recent multilingual evaluation sets in two languages - French and Telugu, identifying several errors in the process. We compare the performance difference across several LLMs with the original and revised versions of the datasets and identify large differences (almost 10% in some cases) in both languages). Based on these results, we argue that test sets should not be considered immutable and should be revisited, checked for correctness, and potentially versioned. We end with some recommendations for both the dataset creators as well as consumers on addressing the dataset quality issues.

cs.CV [Back]

[98] Team PA-VCG's Solution for Competition on Understanding Chinese College Entrance Exam Papers in ICDAR'25

Wei Wu,Wenjie Wang,Yang Tan,Ying Liu,Liang Diao,Lin Huang,Kaihe Xu,Wenfeng Xie,Ziling Lin

Main category: cs.CV

TL;DR: 本文提出了一种用于理解中国高考试卷的新方法,结合了高分辨率图像处理和特定领域的后训练策略,最终取得了89.6%的准确率,并在ICDAR'25竞赛中获得第一名。

Details Motivation: 解决Gaokao试卷中密集OCR提取和复杂文档布局的挑战。 Method: 利用高分辨率图像处理和多图像端到端输入策略,引入特定领域的后训练策略。 Result: 实验结果表明,我们的后训练方法性能最佳,准确率达到89.6%,获得第一名。 Conclusion: 我们的方法在理解中国高考试卷中取得了最佳性能,准确率达到89.6%,获得第一名。 Abstract: This report presents Team PA-VGG's solution for the ICDAR'25 Competition on Understanding Chinese College Entrance Exam Papers. In addition to leveraging high-resolution image processing and a multi-image end-to-end input strategy to address the challenges of dense OCR extraction and complex document layouts in Gaokao papers, our approach introduces domain-specific post-training strategies. Experimental results demonstrate that our post-training approach achieves the most outstanding performance, securing first place with an accuracy rate of 89.6%.

[99] Inclusive Review on Advances in Masked Human Face Recognition Technologies

Ali Haitham Abdul Amir,Zainab N. Nemer

Main category: cs.CV

TL;DR: This paper reviews advancements in masked face recognition, focusing on deep learning methods like CNNs and Siamese networks. It discusses challenges such as partial facial visibility due to masks and proposes solutions like data augmentation and multimedia integration. The study emphasizes the importance of improving algorithm efficiency and expanding real-world applications in security and healthcare.

Details Motivation: The motivation for this paper stems from the increased use of face masks due to the COVID-19 pandemic, which has introduced new challenges for traditional facial recognition systems. This has created a need to adapt and improve recognition technologies to maintain accuracy under partial facial visibility. Method: The paper provides a comprehensive review of recent developments in masked face recognition, focusing on deep learning techniques like convolutional neural networks (CNNs) and Siamese networks. It analyzes challenges such as lighting variations, facial positions, partial concealment, and mask types, and explores advanced solutions like data augmentation and multimedia methods. Result: The paper identifies key challenges in masked face recognition and reviews the advanced techniques developed to overcome them. It highlights progress in deep network design, feature extraction, evaluation metrics, and datasets. It also outlines current applications in security and medicine and forecasts future research directions. Conclusion: The paper concludes that masked face recognition is a crucial area of biometric recognition, especially in light of ongoing health crises and security concerns. It highlights the importance of developing more efficient algorithms and integrating multimedia technologies to enhance system performance and broaden applications. Abstract: Masked Face Recognition (MFR) is an increasingly important area in biometric recognition technologies, especially with the widespread use of masks as a result of the COVID-19 pandemic. This development has created new challenges for facial recognition systems due to the partial concealment of basic facial features. This paper aims to provide a comprehensive review of the latest developments in the field, with a focus on deep learning techniques, especially convolutional neural networks (CNNs) and twin networks (Siamese networks), which have played a pivotal role in improving the accuracy of covering face recognition. The paper discusses the most prominent challenges, which include changes in lighting, different facial positions, partial concealment, and the impact of mask types on the performance of systems. It also reviews advanced technologies developed to overcome these challenges, including data enhancement using artificial databases and multimedia methods to improve the ability of systems to generalize. In addition, the paper highlights advance in deep network design, feature extraction techniques, evaluation criteria, and data sets used in this area. Moreover, it reviews the various applications of masked face recognition in the fields of security and medicine, highlighting the growing importance of these systems in light of recurrent health crises and increasing security threats. Finally, the paper focuses on future research trends such as developing more efficient algorithms and integrating multimedia technologies to improve the performance of recognition systems in real-world environments and expand their applications.

[100] HoneyImage: Verifiable, Harmless, and Stealthy Dataset Ownership Verification for Image Models

Zhihao Zhu,Jiale Han,Yi Yang

Main category: cs.CV

TL;DR: HoneyImage是一种用于图像识别模型中数据集归属验证的新方法,它通过修改少量难以样本以嵌入无法察觉但可验证的痕迹,从而实现可靠的验证。

Details Motivation: 许多图像数据集包含敏感或专有内容,引发了关于未经授权的数据使用的严重关注。数据所有者因此需要可靠机制来验证其专有数据是否被滥用以训练第三方模型。 Method: HoneyImage选择性地修改少量难以样本以嵌入无法察觉但可验证的痕迹,从而实现可靠的归属验证,同时保持数据集完整性。 Result: 在四个基准数据集和多种模型架构中进行的广泛实验表明,HoneyImage在保持下游性能最小影响的同时始终实现了强大的验证准确性,并保持了无法察觉。 Conclusion: HoneyImage是一种实用的机制,可以为数据所有者提供保护有价值图像数据集所有权的方法,鼓励安全共享并释放数据驱动AI的全部变革潜力。 Abstract: Image-based AI models are increasingly deployed across a wide range of domains, including healthcare, security, and consumer applications. However, many image datasets carry sensitive or proprietary content, raising critical concerns about unauthorized data usage. Data owners therefore need reliable mechanisms to verify whether their proprietary data has been misused to train third-party models. Existing solutions, such as backdoor watermarking and membership inference, face inherent trade-offs between verification effectiveness and preservation of data integrity. In this work, we propose HoneyImage, a novel method for dataset ownership verification in image recognition models. HoneyImage selectively modifies a small number of hard samples to embed imperceptible yet verifiable traces, enabling reliable ownership verification while maintaining dataset integrity. Extensive experiments across four benchmark datasets and multiple model architectures show that HoneyImage consistently achieves strong verification accuracy with minimal impact on downstream performance while maintaining imperceptible. The proposed HoneyImage method could provide data owners with a practical mechanism to protect ownership over valuable image datasets, encouraging safe sharing and unlocking the full transformative potential of data-driven AI.

[101] Phase-fraction guided denoising diffusion model for augmenting multiphase steel microstructure segmentation via micrograph image-mask pair synthesis

Hoang Hai Nam Nguyen,Minh Tien Tran,Hoheok Kim,Ho Won Lee

Main category: cs.CV

TL;DR: 该论文提出了一种名为PF-DiffSeg的框架,用于提高金属合金微观结构分割的准确性。

Details Motivation: 机器学习在金属显微组织分割中的有效性通常受到人工标注相位掩码缺乏的限制,尤其是在合金中稀有或成分复杂的形态中。 Method: 引入了一种相位分数控制的一阶段去噪扩散框架,联合合成显微组织图像及其相应的分割掩码。 Result: PF-DiffSeg在MetalDAM基准上的评估显示,与标准增强策略相比,分割准确性有显著提高,尤其是在少数类上。 Conclusion: PF-DiffSeg提供了一种可扩展的金属学应用数据增强解决方案,同时减少了与传统方法相比的推理时间。 Abstract: The effectiveness of machine learning in metallographic microstructure segmentation is often constrained by the lack of human-annotated phase masks, particularly for rare or compositionally complex morphologies within the metal alloy. We introduce PF-DiffSeg, a phase-fraction controlled, one-stage denoising diffusion framework that jointly synthesizes microstructure images and their corresponding segmentation masks in a single generative trajectory to further improve segmentation accuracy. By conditioning on global phase-fraction vectors, augmented to represent real data distribution and emphasize minority classes, our model generates compositionally valid and structurally coherent microstructure image and mask samples that improve both data diversity and training efficiency. Evaluated on the MetalDAM benchmark for additively manufactured multiphase steel, our synthetic augmentation method yields notable improvements in segmentation accuracy compared to standard augmentation strategies especially in minority classes and further outperforms a two-stage mask-guided diffusion and generative adversarial network (GAN) baselines, while also reducing inference time compared to conventional approach. The method integrates generation and conditioning into a unified framework, offering a scalable solution for data augmentation in metallographic applications.

[102] Benefits of Feature Extraction and Temporal Sequence Analysis for Video Frame Prediction: An Evaluation of Hybrid Deep Learning Models

Jose M. Sánchez Velázquez,Mingbo Cai,Andrew Coney,Álvaro J. García- Tejedor,Alberto Nogales

Main category: cs.CV

TL;DR: This paper explores hybrid deep learning models for video frame prediction, showing that combining autoencoders with 3DCNNs and ConvLSTMs improves performance, particularly for grayscale real-world videos.

Details Motivation: Video frame prediction is crucial for applications like weather forecasting and autonomous systems, and hybrid deep learning models offer potential improvements. Method: Evaluated hybrid deep learning approaches combining autoencoders with RNNs, 3D CNNs, and related architectures on three distinct datasets. Result: SSIM metrics improved from 0.69 to 0.82, with hybrid models using 3DCNNs and ConvLSTMs performing best, especially for grayscale real-world videos. Conclusion: Hybrid deep learning models combining autoencoders with 3DCNNs and ConvLSTMs are highly effective for video frame prediction, particularly for grayscale real-world videos. Abstract: In recent years, advances in Artificial Intelligence have significantly impacted computer science, particularly in the field of computer vision, enabling solutions to complex problems such as video frame prediction. Video frame prediction has critical applications in weather forecasting or autonomous systems and can provide technical improvements, such as video compression and streaming. Among Artificial Intelligence methods, Deep Learning has emerged as highly effective for solving vision-related tasks, although current frame prediction models still have room for enhancement. This paper evaluates several hybrid deep learning approaches that combine the feature extraction capabilities of autoencoders with temporal sequence modelling using Recurrent Neural Networks (RNNs), 3D Convolutional Neural Networks (3D CNNs), and related architectures. The proposed solutions were rigorously evaluated on three datasets that differ in terms of synthetic versus real-world scenarios and grayscale versus color imagery. Results demonstrate that the approaches perform well, with SSIM metrics increasing from 0.69 to 0.82, indicating that hybrid models utilizing 3DCNNs and ConvLSTMs are the most effective, and greyscale videos with real data are the easiest to predict.

[103] TESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras

Mohammad Mohammadi,Ziyi Wu,Igor Gilitschenski

Main category: cs.CV

TL;DR: TESPEC is a self-supervised pre-training framework tailored for learning spatio-temporal information from long event sequences, particularly suitable for recurrent models, achieving state-of-the-art results in various downstream tasks.

Details Motivation: Current self-supervised learning methods for event-based pre-training largely ignore the temporal information of events, and recurrent models can benefit from leveraging long-term temporal information. Method: TESPEC employs the masked image modeling paradigm with a new reconstruction target based on accumulating events into pseudo grayscale videos containing high-level semantic information. Result: Extensive experiments demonstrate state-of-the-art results in downstream tasks, including object detection, semantic segmentation, and monocular depth estimation. Conclusion: TESPEC is a self-supervised pre-training framework that is well-suited for recurrent models and leverages long event sequences during pre-training, leading to state-of-the-art results in downstream tasks. Abstract: Long-term temporal information is crucial for event-based perception tasks, as raw events only encode pixel brightness changes. Recent works show that when trained from scratch, recurrent models achieve better results than feedforward models in these tasks. However, when leveraging self-supervised pre-trained weights, feedforward models can outperform their recurrent counterparts. Current self-supervised learning (SSL) methods for event-based pre-training largely mimic RGB image-based approaches. They pre-train feedforward models on raw events within a short time interval, ignoring the temporal information of events. In this work, we introduce TESPEC, a self-supervised pre-training framework tailored for learning spatio-temporal information. TESPEC is well-suited for recurrent models, as it is the first framework to leverage long event sequences during pre-training. TESPEC employs the masked image modeling paradigm with a new reconstruction target. We design a novel method to accumulate events into pseudo grayscale videos containing high-level semantic information about the underlying scene, which is robust to sensor noise and reduces motion blur. Reconstructing this target thus requires the model to reason about long-term history of events. Extensive experiments demonstrate our state-of-the-art results in downstream tasks, including object detection, semantic segmentation, and monocular depth estimation. Project webpage: https://mhdmohammadi.github.io/TESPEC_webpage.

[104] Latent Diffusion Based Face Enhancement under Degraded Conditions for Forensic Face Recognition

Hassan Ugail,Hamad Mansour Alawar,AbdulNasser Abbas Zehi,Ahmed Mohammad Alkendi,Ismail Lujain Jaleel

Main category: cs.CV

TL;DR: This paper demonstrates that using latent diffusion-based enhancement with the Flux.1 Kontext Dev pipeline and Facezoom LoRA significantly improves face recognition accuracy on low-quality forensic images, increasing recognition rates from 29.1% to 84.5%.

Details Motivation: Face recognition systems perform poorly on low-quality forensic imagery, necessitating advanced enhancement techniques to improve recognition accuracy for forensic applications. Method: The researchers used the LFW dataset comprising 3,000 individuals with 24,000 recognition attempts. They applied the Flux.1 Kontext Dev diffusion pipeline with Facezoom LoRA adaptation to enhance images degraded by seven categories, including compression artefacts, blur effects, and noise contamination, and evaluated the improvement in recognition accuracy. Result: The approach improved overall recognition accuracy from 29.1% to 84.5%, a substantial 55.4 percentage point increase (95% CI: [54.1, 56.7]). Statistically significant improvements were observed across all degradation types, with effect sizes indicating practical significance. Conclusion: The study concludes that sophisticated diffusion-based enhancement techniques, particularly the Flux.1 Kontext Dev pipeline with Facezoom LoRA adaptation, hold significant potential for improving forensic face recognition performance when dealing with low-quality imagery. Abstract: Face recognition systems experience severe performance degradation when processing low-quality forensic evidence imagery. This paper presents an evaluation of latent diffusion-based enhancement for improving face recognition under forensically relevant degradations. Using a dataset of 3,000 individuals from LFW with 24,000 recognition attempts, we implement the Flux.1 Kontext Dev pipeline with Facezoom LoRA adaptation to test against seven degradation categories, including compression artefacts, blur effects, and noise contamination. Our approach demonstrates substantial improvements, increasing overall recognition accuracy from 29.1% to 84.5% (55.4 percentage point improvement, 95% CI: [54.1, 56.7]). Statistical analysis reveals significant performance gains across all degradation types, with effect sizes exceeding conventional thresholds for practical significance. These findings establish the potential of sophisticated diffusion based enhancement in forensic face recognition applications.

[105] Optimizing Vision-Language Consistency via Cross-Layer Regional Attention Alignment

Yifan Wang,Hongfeng Ai,Quangao Liu,Maowei Jiang,Ruiyuan Kang,Ruiqi Li,Jiahua Dong,Mengting Xiao,Cheng Jiang,Chenzhong Li

Main category: cs.CV

TL;DR: 本文提出了一种新的跨层区域对齐方法(CCRA),用于解决视觉语言模型中的注意力机制协调问题,并通过引入新的注意力机制和整合策略,实现了在多个基准测试中的最新性能。

Details Motivation: 视觉语言模型(VLMs)在有效协调跨模态嵌入学习的多种注意力机制方面面临挑战,导致注意力不匹配和次优性能。 Method: 提出了一种称为一致跨层区域对齐(CCRA)的方法,该方法引入了层-块状交叉注意力(LPWCA)来捕捉细粒度的区域-语义相关性,并通过逐步注意力整合(PAI)策略协调不同的注意力机制。 Result: 实验结果表明,增强CCRA的LLaVA-v1.5-7B模型在十个不同的视觉-语言基准测试中均达到了最先进的性能,并且仅增加了3.55M的参数量。 Conclusion: CCRA方法通过引入LPWCA和PAI策略有效解决了VLMs中的注意力机制协调问题,不仅提升了模型性能,还增强了模型的可解释性。 Abstract: Vision Language Models (VLMs) face challenges in effectively coordinating diverse attention mechanisms for cross-modal embedding learning, leading to mismatched attention and suboptimal performance. We propose Consistent Cross-layer Regional Alignment (CCRA), which introduces Layer-Patch-wise Cross Attention (LPWCA) to capture fine-grained regional-semantic correlations by jointly weighting patch and layer-wise embedding, and Progressive Attention Integration (PAI) that systematically coordinates LPWCA, layer-wise, and patch-wise attention mechanisms in sequence. This progressive design ensures consistency from semantic to regional levels while preventing attention drift and maximizing individual attention benefits. Experimental results on ten diverse vision-language benchmarks demonstrate that our CCRA-enhanced LLaVA-v1.5-7B model achieves state-of-the-art performance, outperforming all baseline methods with only 3.55M additional parameters, while providing enhanced interpretability through more regionally focused and semantically aligned attention patterns.

[106] ThermoCycleNet: Stereo-based Thermogram Labeling for Model Transition to Cycling

Daniel Andrés López,Vincent Weber,Severin Zentgraf,Barlo Hillen,Perikles Simon,Elmar Schömer

Main category: cs.CV

TL;DR: 本文研究了将自动标注方法从跑步机转移到自行车测功计,发现结合自动标签和少量手动标注数据能够有效提升深度神经网络的性能。

Details Motivation: 将基于跑步机跑步的自动标注方法应用于自行车测功计领域。 Method: 训练语义分割网络并使用不同数据集组合进行微调。 Result: 少量手动数据微调即可有效提升整体性能。 Conclusion: 结合自动标签和少量手动标注数据可以加速深度神经网络对新用例的适应性。 Abstract: Infrared thermography is emerging as a powerful tool in sports medicine, allowing assessment of thermal radiation during exercise and analysis of anatomical regions of interest, such as the well-exposed calves. Building on our previous advanced automatic annotation method, we aimed to transfer the stereo- and multimodal-based labeling approach from treadmill running to ergometer cycling. Therefore, the training of the semantic segmentation network with automatic labels and fine-tuning on high-quality manually annotated images has been examined and compared in different data set combinations. The results indicate that fine-tuning with a small fraction of manual data is sufficient to improve the overall performance of the deep neural network. Finally, combining automatically generated labels with small manually annotated data sets accelerates the adaptation of deep neural networks to new use cases, such as the transition from treadmill to bicycle.

[107] ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation

Cihang Peng,Qiming Hou,Zhong Ren,Kun Zhou

Main category: cs.CV

TL;DR: 本文提出ROVI,一个通过创新的re-captioning方法生成的高质量合成数据集,用于提升基于实例的文本到图像生成任务的性能。

Details Motivation: 为了解决现有数据集在图像质量和类别多样性方面的限制,ROVI旨在提供一个更加丰富和高质量的数据集,以支持基于实例的文本到图像生成任务。 Method: 提出了一种称为re-captioning的方法,在预检测阶段使用视觉-语言模型生成全面的视觉描述,然后通过大型语言模型提取潜在类别列表,用于开放词汇检测器检测。 Result: ROVI在图像质量和分辨率方面超过了现有检测数据集,并且包含比现有数据集多两个数量级的类别,具有开放词汇的特性。基于ROVI训练的GLIGEN模型在多个评估指标上显著优于现有技术。 Conclusion: ROVI是一个用于基于实例的文本到图像生成的高质量合成数据集,它通过重新设计生成描述的方法,提高了图像质量和分辨率,并在实例定位准确性、提示保真度和审美质量方面显著优于现有方法。 Abstract: We present ROVI, a high-quality synthetic dataset for instance-grounded text-to-image generation, created by labeling 1M curated web images. Our key innovation is a strategy called re-captioning, focusing on the pre-detection stage, where a VLM (Vision-Language Model) generates comprehensive visual descriptions that are then processed by an LLM (Large Language Model) to extract a flat list of potential categories for OVDs (Open-Vocabulary Detectors) to detect. This approach yields a global prompt inherently linked to instance annotations while capturing secondary visual elements humans typically overlook. Evaluations show that ROVI exceeds existing detection datasets in image quality and resolution while containing two orders of magnitude more categories with an open-vocabulary nature. For demonstrative purposes, a text-to-image model GLIGEN trained on ROVI significantly outperforms state-of-the-art alternatives in instance grounding accuracy, prompt fidelity, and aesthetic quality. Our dataset and reproducible pipeline are available at https://github.com/CihangPeng/ROVI.

[108] AutoSIGHT: Automatic Eye Tracking-based System for Immediate Grading of Human experTise

Byron Dowling,Jozef Probcin,Adam Czajka

Main category: cs.CV

TL;DR: AutoSIGHT 利用眼动追踪数据自动评估人类在视觉任务中的专业知识水平,并在不同评估窗口下展示了良好的性能表现。

Details Motivation: 研究是否可以教会机器基于眼动追踪特征自动评估人类在视觉任务中的专业知识水平。 Method: 提出 AutoSIGHT,一种基于眼动追踪数据的自动评估系统,用于分类专家和非专家的表现。 Result: 在仅 5 秒的评估窗口下,AutoSIGHT 在受试者分离的训练-测试方案中平均 ROC 曲线下面积达到 0.751;当评估窗口增至 30 秒时,ROC 曲线下面积提升至 0.8306。 Conclusion: AutoSIGHT 是一种可行的方法,能够自动评估人类在视觉任务中的专业知识水平,并且可以应用于人类-AI 搭配设置中,以动态反应人类和 AI 之间的非平稳专业知识分布。 Abstract: Can we teach machines to assess the expertise of humans solving visual tasks automatically based on eye tracking features? This paper proposes AutoSIGHT, Automatic System for Immediate Grading of Human experTise, that classifies expert and non-expert performers, and builds upon an ensemble of features extracted from eye tracking data while the performers were solving a visual task. Results on the task of iris Presentation Attack Detection (PAD) used for this study show that with a small evaluation window of just 5 seconds, AutoSIGHT achieves an average average Area Under the ROC curve performance of 0.751 in subject-disjoint train-test regime, indicating that such detection is viable. Furthermore, when a larger evaluation window of up to 30 seconds is available, the Area Under the ROC curve (AUROC) increases to 0.8306, indicating the model is effectively leveraging more information at a cost of slightly delayed decisions. This work opens new areas of research on how to incorporate the automatic weighing of human and machine expertise into human-AI pairing setups, which need to react dynamically to nonstationary expertise distribution between the human and AI players (e.g. when the experts need to be replaced, or the task at hand changes rapidly). Along with this paper, we offer the eye tracking data used in this study collected from 6 experts and 53 non-experts solving iris PAD visual task.

[109] 3D Reconstruction via Incremental Structure From Motion

Muhammad Zeeshan,Umer Zaki,Syed Ahmed Pasha,Zaar Khizar

Main category: cs.CV

TL;DR: 本文提出了一種增量式的Structure from Motion(SfM)方法,用於從非結構化圖像集合中實現準確的3D重建,並通過真實數據集展示了其在稀疏或部分重疊數據中的有效性。

Details Motivation: 由於全局SfM技術依賴於完整的圖像連接性,且對噪聲或缺失數據敏感,因此需要一種更靈活的方法來處理稀疏或部分重疊的數據集。 Method: 採用增量式SfM方法,逐步將新視圖整合到重建過程中,並通過捆綁調整(bundle adjustment)進行幾何估計的迭代優化。 Result: 實驗結果表明,該方法在視覺結構化環境中可實現可靠的稀疏3D重建,並通過重投影誤差和相機軌跡一致性評估了重建質量。 Conclusion: 增量式SfM是一種實用且可靠的方法,特別適用於處理稀疏或部分重疊的圖像數據集。 Abstract: Accurate 3D reconstruction from unstructured image collections is a key requirement in applications such as robotics, mapping, and scene understanding. While global Structure from Motion (SfM) techniques rely on full image connectivity and can be sensitive to noise or missing data, incremental SfM offers a more flexible alternative. By progressively incorporating new views into the reconstruction, it enables the system to recover scene structure and camera motion even in sparse or partially overlapping datasets. In this paper, we present a detailed implementation of the incremental SfM pipeline, focusing on the consistency of geometric estimation and the effect of iterative refinement through bundle adjustment. We demonstrate the approach using a real dataset and assess reconstruction quality through reprojection error and camera trajectory coherence. The results support the practical utility of incremental SfM as a reliable method for sparse 3D reconstruction in visually structured environments.

[110] Structured Spectral Graph Learning for Anomaly Classification in 3D Chest CT Scans

Theo Di Piazza,Carole Lazarus,Olivier Nempont,Loic Boussel

Main category: cs.CV

TL;DR: This paper proposes a graph-based method for multi-label anomaly classification in 3D CT scans, offering competitive performance, robustness, and generalization without the need for large-scale pre-training.

Details Motivation: The motivation stems from the increasing workload of radiologists due to the growing number of CT scans and the limitations of existing methods, such as 3D convolutional networks' inability to model long-range dependencies and Vision Transformers' high computational costs and need for large-scale pre-training. Method: The authors model CT scans as structured graphs using axial slice triplet nodes processed through spectral domain convolution, aiming to improve multi-label anomaly classification performance without relying on extensive pre-training or computationally expensive Vision Transformers. Result: The proposed method achieves competitive performance in multi-label classification of 3D CT scans, exhibits robustness to z-axis translation, and shows strong cross-dataset generalization, as demonstrated by an ablation study evaluating the contribution of each component. Conclusion: The proposed graph-based approach for multi-label anomaly classification in 3D CT scans demonstrates strong cross-dataset generalization, competitive performance, and robustness to z-axis translation. Abstract: With the increasing number of CT scan examinations, there is a need for automated methods such as organ segmentation, anomaly detection and report generation to assist radiologists in managing their increasing workload. Multi-label classification of 3D CT scans remains a critical yet challenging task due to the complex spatial relationships within volumetric data and the variety of observed anomalies. Existing approaches based on 3D convolutional networks have limited abilities to model long-range dependencies while Vision Transformers suffer from high computational costs and often require extensive pre-training on large-scale datasets from the same domain to achieve competitive performance. In this work, we propose an alternative by introducing a new graph-based approach that models CT scans as structured graphs, leveraging axial slice triplets nodes processed through spectral domain convolution to enhance multi-label anomaly classification performance. Our method exhibits strong cross-dataset generalization, and competitive performance while achieving robustness to z-axis translation. An ablation study evaluates the contribution of each proposed component.

[111] Evading Data Provenance in Deep Neural Networks

Hongyu Zhu,Sichu Liang,Wenwen Wang,Zhuomeng Zhang,Fangqi Li,Shi-Lin Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的统一逃避框架,通过教师-学生模型和OOD数据集有效绕过数据集所有权验证(DOV)方法,实验表明该方法在性能和安全性上优于现有技术,并揭示了当前DOV方法的关键漏洞。

Details Motivation: 由于现代过参数化的深度模型高度依赖数据,而许多数据集具有专有性或包含敏感信息,使得模型训练存在潜在问题。虽然数据集所有权验证(DOV)被提出用于保护版权,但现有研究依赖过于简化的逃避攻击评估,导致安全性误判。因此,论文旨在提出更有效的逃避方法并揭示现有DOV的漏洞。 Method: 论文介绍了一种统一的逃避框架,首先由教师模型从版权数据集中学习,然后使用分布外(OOD)数据集作为中介,将任务相关但标识符无关的领域知识转移到替代学生模型。利用视觉-语言模型和大型语言模型,从OOD画廊集中选择最有效和可靠的信息子集进行知识转移。 Result: 实验结果显示,该方法在消除所有版权标识符的同时,显著优于九种最先进的逃避攻击方法,在泛化能力和逃避效果方面表现突出,并且计算开销适中。此外,该研究揭示了当前DOV方法的关键漏洞,表明需要进一步改进其实用性。 Conclusion: 论文得出结论,当前的DOV方法存在关键漏洞,并强调需要长期发展以增强实用性。作者提出的统一逃避框架表现优异,不仅消除了所有版权标识符,而且在泛化能力和逃避效果上显著优于现有方法。 Abstract: Modern over-parameterized deep models are highly data-dependent, with large scale general-purpose and domain-specific datasets serving as the bedrock for rapid advancements. However, many datasets are proprietary or contain sensitive information, making unrestricted model training problematic. In the open world where data thefts cannot be fully prevented, Dataset Ownership Verification (DOV) has emerged as a promising method to protect copyright by detecting unauthorized model training and tracing illicit activities. Due to its diversity and superior stealth, evading DOV is considered extremely challenging. However, this paper identifies that previous studies have relied on oversimplistic evasion attacks for evaluation, leading to a false sense of security. We introduce a unified evasion framework, in which a teacher model first learns from the copyright dataset and then transfers task-relevant yet identifier-independent domain knowledge to a surrogate student using an out-of-distribution (OOD) dataset as the intermediary. Leveraging Vision-Language Models and Large Language Models, we curate the most informative and reliable subsets from the OOD gallery set as the final transfer set, and propose selectively transferring task-oriented knowledge to achieve a better trade-off between generalization and evasion effectiveness. Experiments across diverse datasets covering eleven DOV methods demonstrate our approach simultaneously eliminates all copyright identifiers and significantly outperforms nine state-of-the-art evasion attacks in both generalization and effectiveness, with moderate computational overhead. As a proof of concept, we reveal key vulnerabilities in current DOV methods, highlighting the need for long-term development to enhance practicality.

[112] DreamSat-2.0: Towards a General Single-View Asteroid 3D Reconstruction

Santiago Diaz,Xinghui Hu,Josiane Uwumukiza,Giovanni Lavezzi,Victor Rodriguez-Fernandez,Richard Linares

Main category: cs.CV

TL;DR: DreamSat-2.0系统分析了三种3D重建模型在航天器和小行星数据集上的表现,发现Hunyuan-3D在图像质量和几何重建方面具有显著优势。

Details Motivation: 为了提高小行星探测和自主航天器导航的效率和准确性,研究团队开发了DreamSat-2.0来评估和优化现有的3D重建模型。 Method: DreamSat-2.0引入了一个基准测试流程,对三种最先进的3D重建模型(Hunyuan-3D、Trellis-3D和Ouroboros-3D)在定制的航天器和小行星数据集上进行了系统分析,使用2D感知(图像质量)和3D几何(形状精度)指标进行评估。 Result: 研究发现,模型的性能具有领域依赖性:在复杂航天器的图像质量上表现更好,而在简单形状的小行星几何重建上表现更优。Hunyuan-3D在航天器图像感知上得分最高,但在小行星几何重建上表现最佳,标志着相较于之前工作的重大进展。 Conclusion: DreamSat-2.0为小行星探测和自主航天器导航提供了有价值的基准,强调了模型性能的领域依赖性,并展示了Hunyuan-3D在图像感知和几何重建方面的显著改进。 Abstract: To enhance asteroid exploration and autonomous spacecraft navigation, we introduce DreamSat-2.0, a pipeline that benchmarks three state-of-the-art 3D reconstruction models-Hunyuan-3D, Trellis-3D, and Ouroboros-3D-on custom spacecraft and asteroid datasets. Our systematic analysis, using 2D perceptual (image quality) and 3D geometric (shape accuracy) metrics, reveals that model performance is domain-dependent. While models produce higher-quality images of complex spacecraft, they achieve better geometric reconstructions for the simpler forms of asteroids. New benchmarks are established, with Hunyuan-3D achieving top perceptual scores on spacecraft but its best geometric accuracy on asteroids, marking a significant advance over our prior work.

[113] COSTARR: Consolidated Open Set Technique with Attenuation for Robust Recognition

Ryan Rabinowitz,Steve Cruz,Walter Scheirer,Terrance E. Boult

Main category: cs.CV

TL;DR: COSTARR enhances open-set recognition by leveraging overlooked attenuation information from both pre- and post-attenuated features, outperforming existing methods across diverse architectures.

Details Motivation: Handling novelty in visual recognition systems remains challenging. Existing OSR methods rely on detecting novelty through the absence of familiar features, but overlook the potential of utilizing attenuation information learned during training. Method: COSTARR combines both the requirement of familiar features and the lack of unfamiliar ones, leveraging pre- and post-attenuated features for improved OSR. A probabilistic interpretation of the COSTARR score is provided, and ablation studies assess feature contributions. Result: COSTARR demonstrates significant improvements in open-set recognition performance across multiple modern pre-trained architectures (ViTs, ConvNeXts, ResNet) and large-scale datasets (ImageNet2012-1K as knowns and NINCO, iNaturalist, OpenImage-O as unknowns). Conclusion: COSTARR improves open-set recognition by utilizing previously discarded attenuation information, showing effective generalization across various architectures and significantly outperforming prior methods. Abstract: Handling novelty remains a key challenge in visual recognition systems. Existing open-set recognition (OSR) methods rely on the familiarity hypothesis, detecting novelty by the absence of familiar features. We propose a novel attenuation hypothesis: small weights learned during training attenuate features and serve a dual role-differentiating known classes while discarding information useful for distinguishing known from unknown classes. To leverage this overlooked information, we present COSTARR, a novel approach that combines both the requirement of familiar features and the lack of unfamiliar ones. We provide a probabilistic interpretation of the COSTARR score, linking it to the likelihood of correct classification and belonging in a known class. To determine the individual contributions of the pre- and post-attenuated features to COSTARR's performance, we conduct ablation studies that show both pre-attenuated deep features and the underutilized post-attenuated Hadamard product features are essential for improving OSR. Also, we evaluate COSTARR in a large-scale setting using ImageNet2012-1K as known data and NINCO, iNaturalist, OpenImage-O, and other datasets as unknowns, across multiple modern pre-trained architectures (ViTs, ConvNeXts, and ResNet). The experiments demonstrate that COSTARR generalizes effectively across various architectures and significantly outperforms prior state-of-the-art methods by incorporating previously discarded attenuation information, advancing open-set recognition capabilities.

[114] AURA: A Hybrid Spatiotemporal-Chromatic Framework for Robust, Real-Time Detection of Industrial Smoke Emissions

Mikhail Bychkov,Matey Yordanov,Andrei Kuchma

Main category: cs.CV

TL;DR: AURA is a hybrid framework that effectively detects and classifies industrial smoke emissions in real-time by analyzing their movement and color, leading to improved environmental and safety outcomes.

Details Motivation: Current industrial smoke monitoring systems lack specificity in distinguishing smoke types and struggle with environmental variability, which AURA aims to address. Method: The framework combines spatiotemporal and chromatic analysis to capture both the dynamic movement patterns and distinct color characteristics of industrial smoke. Result: AURA provides enhanced accuracy, reduces false positives, and enables precise, automated real-time monitoring of industrial emissions. Conclusion: AURA is able to improve the accuracy of identifying and categorizing industrial smoke emissions, which contributes to better environmental compliance, operational safety, and public health. Abstract: This paper introduces AURA, a novel hybrid spatiotemporal-chromatic framework designed for robust, real-time detection and classification of industrial smoke emissions. The framework addresses critical limitations of current monitoring systems, which often lack the specificity to distinguish smoke types and struggle with environmental variability. AURA leverages both the dynamic movement patterns and the distinct color characteristics of industrial smoke to provide enhanced accuracy and reduced false positives. This framework aims to significantly improve environmental compliance, operational safety, and public health outcomes by enabling precise, automated monitoring of industrial emissions.

[115] Trans-Adapter: A Plug-and-Play Framework for Transparent Image Inpainting

Yuekun Dai,Haitian Li,Shangchen Zhou,Chen Change Loy

Main category: cs.CV

TL;DR: 本文提出了一种适用于透明图像修复的即插即用适配器Trans-Adapter,解决了传统方法在保持透明区域一致性方面的不足。

Details Motivation: RGBA图像由于其额外的alpha通道,在需要混合、遮罩或透明效果的应用中非常重要。然而,现有的图像修复方法仅适用于RGB图像。传统的透明图像修复方法通常使用两阶段过程,难以保持编辑区域的透明一致性,并可能导致透明边界出现锯齿边缘。 Method: 提出了Trans-Adapter,这是一种即插即用的适配器,使基于扩散的修复模型能够直接处理透明图像。此外,还引入了LayerBench以及一种新的无参考alpha边缘质量评估指标。 Result: 通过在LayerBench上进行广泛的实验,证明了Trans-Adapter方法的有效性。 Conclusion: Trans-Adapter可以无缝集成到各种社区模型中,支持通过ControlNet进行可控编辑,并为透明图像修复提供了有效的解决方案。 Abstract: RGBA images, with the additional alpha channel, are crucial for any application that needs blending, masking, or transparency effects, making them more versatile than standard RGB images. Nevertheless, existing image inpainting methods are designed exclusively for RGB images. Conventional approaches to transparent image inpainting typically involve placing a background underneath RGBA images and employing a two-stage process: image inpainting followed by image matting. This pipeline, however, struggles to preserve transparency consistency in edited regions, and matting can introduce jagged edges along transparency boundaries. To address these challenges, we propose Trans-Adapter, a plug-and-play adapter that enables diffusion-based inpainting models to process transparent images directly. Trans-Adapter also supports controllable editing via ControlNet and can be seamlessly integrated into various community models. To evaluate our method, we introduce LayerBench, along with a novel non-reference alpha edge quality evaluation metric for assessing transparency edge quality. We conduct extensive experiments on LayerBench to demonstrate the effectiveness of our approach.

[116] MASIV: Toward Material-Agnostic System Identification from Videos

Yizhou Zhao,Haoyu Chen,Chunjiang Liu,Zhenyang Li,Charles Herrmann,Junhwa Hur,Yinxiao Li,Ming-Hsuan Yang,Bhiksha Raj,Min Xu

Main category: cs.CV

TL;DR: MASIV是一种新型视觉框架,用于材料无关的系统识别,解决了现有方法对未知材料处理能力有限的问题。

Details Motivation: 现有方法依赖预定义材料先验,限制了对未知材料的处理能力,因此需要一种材料无关的系统识别方法。 Method: MASIV采用可学习的神经本构模型,结合连续粒子轨迹重建,提供密集的几何引导以解决优化不稳定问题。 Result: MASIV在几何精度、渲染质量和泛化能力方面表现出最先进的性能。 Conclusion: MASIV是一个材料无关的系统识别框架,通过视觉方法实现了先进的几何精度、渲染质量和泛化能力。 Abstract: System identification from videos aims to recover object geometry and governing physical laws. Existing methods integrate differentiable rendering with simulation but rely on predefined material priors, limiting their ability to handle unknown ones. We introduce MASIV, the first vision-based framework for material-agnostic system identification. Unlike existing approaches that depend on hand-crafted constitutive laws, MASIV employs learnable neural constitutive models, inferring object dynamics without assuming a scene-specific material prior. However, the absence of full particle state information imposes unique challenges, leading to unstable optimization and physically implausible behaviors. To address this, we introduce dense geometric guidance by reconstructing continuum particle trajectories, providing temporally rich motion constraints beyond sparse visual cues. Comprehensive experiments show that MASIV achieves state-of-the-art performance in geometric accuracy, rendering quality, and generalization ability.

[117] The Promise of RL for Autoregressive Image Editing

Saba Ahmadi,Rabiul Awal,Ankur Sikarwar,Amirhossein Kazemnejad,Ge Ya Luo,Juan A. Rodriguez,Sai Rajeswar,Siva Reddy,Christopher Pal,Benno Krojer,Aishwarya Agrawal

Main category: cs.CV

TL;DR: 本文提出了一种名为EARL的图像编辑模型,基于强化学习和自回归多模态框架,在多种编辑任务中表现优异,且训练数据需求低。

Details Motivation: 为了提升图像编辑任务的性能,研究者探索了监督微调、强化学习和思维链推理三种策略,并希望在一个统一的框架中进行研究。 Method: 采用结合强化学习和大得多模态LLM验证器的策略,并通过自回归多模态模型处理文本和视觉信息。 Result: 强化学习结合多模态LLM验证器被发现是最有效的策略,最终开发出的EARL模型在较少训练数据的情况下仍表现出色。 Conclusion: EARL是一个基于强化学习的图像编辑模型,在各种编辑任务中表现出色,且训练数据需求较低,推动了自回归多模态模型在图像编辑领域的应用。 Abstract: We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learning (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at https://github.com/mair-lab/EARL.

[118] UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation

Chaitanya Patel,Hiroki Nakamura,Yuta Kyuragi,Kazuki Kozuka,Juan Carlos Niebles,Ehsan Adeli

Main category: cs.CV

TL;DR: UniEgoMotion是一种用于以自我为中心的运动生成和预测的新型统一模型,它利用第一视角图像进行场景感知的运动合成,无需依赖明确的3D场景。

Details Motivation: 以自我为中心的人类运动生成和预测对于增强AR/VR体验、改善人机交互、推进辅助技术以及实现适应性医疗解决方案至关重要。然而,现有的方法主要集中在具有结构化3D场景上下文的第三人称运动合成上,限制了它们在现实世界以自我为中心的设置中的有效性。 Method: UniEgoMotion是一种统一的条件运动扩散模型,具有一种新颖的以头部为中心的运动表示,专为以自我为中心的设备定制。 Result: UniEgoMotion在以自我为中心的运动重建中达到了最先进的性能,并首次实现了从单一以自我为中心的图像生成运动。 Conclusion: UniEgoMotion实现了以自我为中心的运动重建和生成,为以自我为中心的运动建模树立了新的基准,开启了以自我为中心的应用的新可能性。 Abstract: Egocentric human motion generation and forecasting with scene-context is crucial for enhancing AR/VR experiences, improving human-robot interaction, advancing assistive technologies, and enabling adaptive healthcare solutions by accurately predicting and simulating movement from a first-person perspective. However, existing methods primarily focus on third-person motion synthesis with structured 3D scene contexts, limiting their effectiveness in real-world egocentric settings where limited field of view, frequent occlusions, and dynamic cameras hinder scene perception. To bridge this gap, we introduce Egocentric Motion Generation and Egocentric Motion Forecasting, two novel tasks that utilize first-person images for scene-aware motion synthesis without relying on explicit 3D scene. We propose UniEgoMotion, a unified conditional motion diffusion model with a novel head-centric motion representation tailored for egocentric devices. UniEgoMotion's simple yet effective design supports egocentric motion reconstruction, forecasting, and generation from first-person visual inputs in a unified framework. Unlike previous works that overlook scene semantics, our model effectively extracts image-based scene context to infer plausible 3D motion. To facilitate training, we introduce EE4D-Motion, a large-scale dataset derived from EgoExo4D, augmented with pseudo-ground-truth 3D motion annotations. UniEgoMotion achieves state-of-the-art performance in egocentric motion reconstruction and is the first to generate motion from a single egocentric image. Extensive evaluations demonstrate the effectiveness of our unified framework, setting a new benchmark for egocentric motion modeling and unlocking new possibilities for egocentric applications.

[119] Semi-Supervised Anomaly Detection in Brain MRI Using a Domain-Agnostic Deep Reinforcement Learning Approach

Zeduo Zhang,Yalda Mohsenzadeh

Main category: cs.CV

TL;DR: A domain-agnostic, semi-supervised anomaly detection framework using DRL was developed to address challenges like large-scale data, overfitting, and class imbalance, particularly applied to brain MRI volumes. It showed promising results on both medical and industrial datasets.

Details Motivation: The motivation is to address challenges such as large-scale data, overfitting, and class imbalance in anomaly detection, particularly focusing on brain MRI volumes. Method: The method integrates deep reinforcement learning (DRL) with feature representations to handle label scarcity, large-scale data, and overfitting. It was tested on brain MRI datasets (IXI and BraTS 2021) and industrial surface datasets (MVTec AD). Result: The proposed method achieved an AUROC of 88.7% (pixel-level) and 96.7% (image-level) on brain MRI datasets, and 99.8% pixel-level, 99.3% image-level AUROC on MVTec AD dataset, outperforming SOTA methods. Conclusion: The proposed domain-agnostic semi-supervised anomaly detection framework using DRL shows significant promise, achieving strong performance on both medical and industrial datasets, with potential for real-world clinical applications due to its robustness, generalizability, and efficiency. Abstract: To develop a domain-agnostic, semi-supervised anomaly detection framework that integrates deep reinforcement learning (DRL) to address challenges such as large-scale data, overfitting, and class imbalance, focusing on brain MRI volumes. This retrospective study used publicly available brain MRI datasets collected between 2005 and 2021. The IXI dataset provided 581 T1-weighted and 578 T2-weighted MRI volumes (from healthy subjects) for training, while the BraTS 2021 dataset provided 251 volumes for validation and 1000 for testing (unhealthy subjects with Glioblastomas). Preprocessing included normalization, skull-stripping, and co-registering to a uniform voxel size. Experiments were conducted on both T1- and T2-weighted modalities. Additional experiments and ablation analyses were also carried out on the industrial datasets. The proposed method integrates DRL with feature representations to handle label scarcity, large-scale data and overfitting. Statistical analysis was based on several detection and segmentation metrics including AUROC and Dice score. The proposed method achieved an AUROC of 88.7% (pixel-level) and 96.7% (image-level) on brain MRI datasets, outperforming State-of-The-Art (SOTA) methods. On industrial surface datasets, the model also showed competitive performance (AUROC = 99.8% pixel-level, 99.3% image-level) on MVTec AD dataset, indicating strong cross-domain generalization. Studies on anomaly sample size showed a monotonic increase in AUROC as more anomalies were seen, without evidence of overfitting or additional computational cost. The domain-agnostic semi-supervised approach using DRL shows significant promise for MRI anomaly detection, achieving strong performance on both medical and industrial datasets. Its robustness, generalizability and efficiency highlight its potential for real-world clinical applications.

[120] Dataset Condensation with Color Compensation

Huyu Wu,Duo Su,Junjie Hou,Guang Li

Main category: cs.CV

TL;DR: 本文提出DC3,一种基于颜色补偿的数据集压缩框架,结合扩散模型增强图像颜色多样性,解决了效率与语义失真的瓶颈问题。

Details Motivation: 现有方法在图像级选择上效率低下,像素级优化又会导致语义失真,而颜色作为信息载体和语义单元被忽视。 Method: 提出DC3框架,结合校准选择策略和潜在扩散模型增强图像颜色多样性,而非生成全新图像。 Result: 实验表明DC3在多个基准测试中优于SOTA方法,FID结果证明其训练高质量数据集的可行性。 Conclusion: DC3通过颜色补偿策略有效解决了数据集压缩中的语义失真和效率问题,表现出优越的性能和泛化能力,并且是首个利用预训练扩散模型进行数据集压缩的方法。 Abstract: Dataset condensation always faces a constitutive trade-off: balancing performance and fidelity under extreme compression. Existing methods struggle with two bottlenecks: image-level selection methods (Coreset Selection, Dataset Quantization) suffer from inefficiency condensation, while pixel-level optimization (Dataset Distillation) introduces semantic distortion due to over-parameterization. With empirical observations, we find that a critical problem in dataset condensation is the oversight of color's dual role as an information carrier and a basic semantic representation unit. We argue that improving the colorfulness of condensed images is beneficial for representation learning. Motivated by this, we propose DC3: a Dataset Condensation framework with Color Compensation. After a calibrated selection strategy, DC3 utilizes the latent diffusion model to enhance the color diversity of an image rather than creating a brand-new one. Extensive experiments demonstrate the superior performance and generalization of DC3 that outperforms SOTA methods across multiple benchmarks. To the best of our knowledge, besides focusing on downstream tasks, DC3 is the first research to fine-tune pre-trained diffusion models with condensed datasets. The FID results prove that training networks with our high-quality datasets is feasible without model collapse or other degradation issues. Code and generated data will be released soon.

[121] OpenGS-Fusion: Open-Vocabulary Dense Mapping with Hybrid 3D Gaussian Splatting for Refined Object-Level Understanding

Dianyi Yang,Xihan Wang,Yu Gao,Shiyang Liu,Bohan Ren,Yufeng Yue,Yi Yang

Main category: cs.CV

TL;DR: 本文介绍了一种名为OpenGS-Fusion的框架,该框架通过结合3D高斯表示和截断符号距离场以及多模态语言引导方法,提高了3D场景理解和交互的能力。

Details Motivation: 现有的方法受到刚性离线管道和无法在开放性查询下提供精确的3D对象级理解的限制 Method: 结合3D高斯表示和截断符号距离场,并引入多模态语言引导方法MLLM-Assisted Adaptive Thresholding Result: 实现了比固定阈值策略高17%的3D mIoU改善,并在3D对象理解、场景重建质量和语言引导场景交互方面展示了有效性 Conclusion: OpenGS-Fusion是一个有效的开放词汇密集映射框架,它提高了语义建模和细化对象级理解的能力。 Abstract: Recent advancements in 3D scene understanding have made significant strides in enabling interaction with scenes using open-vocabulary queries, particularly for VR/AR and robotic applications. Nevertheless, existing methods are hindered by rigid offline pipelines and the inability to provide precise 3D object-level understanding given open-ended queries. In this paper, we present OpenGS-Fusion, an innovative open-vocabulary dense mapping framework that improves semantic modeling and refines object-level understanding. OpenGS-Fusion combines 3D Gaussian representation with a Truncated Signed Distance Field to facilitate lossless fusion of semantic features on-the-fly. Furthermore, we introduce a novel multimodal language-guided approach named MLLM-Assisted Adaptive Thresholding, which refines the segmentation of 3D objects by adaptively adjusting similarity thresholds, achieving an improvement 17\% in 3D mIoU compared to the fixed threshold strategy. Extensive experiments demonstrate that our method outperforms existing methods in 3D object understanding and scene reconstruction quality, as well as showcasing its effectiveness in language-guided scene interaction. The code is available at https://young-bit.github.io/opengs-fusion.github.io/ .

[122] Personalized Safety Alignment for Text-to-Image Diffusion Models

Yu Lei,Jinbin Bai,Qingyu Shi,Aosong Feng,Kaidong Yu

Main category: cs.CV

TL;DR: 本研究提出了一种新的个性化安全对齐框架 PSA,通过集成用户特定的配置文件来改善生成模型的安全控制,实验证明其在对齐用户约束和抑制有害内容方面效果显著。

Details Motivation: 当前的文本到图像扩散模型在安全机制方面应用统一的标准,往往忽视了由年龄、心理健康和个人信仰等因素形成的多样化安全边界。 Method: 提出了一种名为 Personalized Safety Alignment (PSA) 的框架,并引入了一个新的数据集 Sage,通过交叉注意力机制整合用户配置文件。 Result: 实验表明,PSA 在有害内容的抑制和用户约束的对齐方面优于现有方法,并取得了更高的 Win Rate 和 Pass Rate 分数。 Conclusion: PSA 通过将个性化用户配置文件集成到扩散过程中,为生成模型提供了用户特定的安全控制,从而在有害内容的抑制和用户约束的对齐方面优于现有方法。 Abstract: Text-to-image diffusion models have revolutionized visual content generation, but current safety mechanisms apply uniform standards that often fail to account for individual user preferences. These models overlook the diverse safety boundaries shaped by factors like age, mental health, and personal beliefs. To address this, we propose Personalized Safety Alignment (PSA), a framework that allows user-specific control over safety behaviors in generative models. PSA integrates personalized user profiles into the diffusion process, adjusting the model's behavior to match individual safety preferences while preserving image quality. We introduce a new dataset, Sage, which captures user-specific safety preferences and incorporates these profiles through a cross-attention mechanism. Experiments show that PSA outperforms existing methods in harmful content suppression and aligns generated content better with user constraints, achieving higher Win Rate and Pass Rate scores. Our code, data, and models are publicly available at https://torpedo2648.github.io/PSAlign/.

[123] LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation

Xinyu Yan,Meijun Sun,Ge-Peng Ji,Fahad Shahbaz Khan,Salman Khan,Deng-Ping Fan

Main category: cs.CV

TL;DR: LawDIS 结合语言和窗口控制,通过扩散模型实现高效可控的图像分割,性能领先。

Details Motivation: 当前的图像分割方法缺乏对用户控制的灵活集成,LawDIS 旨在提供个性化、高精度的分割解决方案。 Method: LawDIS 将 DIS 重新定义为基于图像条件的掩码生成任务,并引入语言控制分割策略 (LS) 和窗口控制优化策略 (WR),通过模式切换器协调两者操作。 Result: 在 DIS5K 基准测试中,LawDIS 在所有指标上显著优于 11 种最先进方法,相比 MVANet,在 LS 和 WR 策略下 Fβω 提升 4.6%,仅 LS 策略下提升 3.6%。 Conclusion: LawDIS 是一种基于语言窗口的可控二值图像分割框架,通过宏到微控制模式实现高质量对象掩码生成,并在 DIS5K 基准测试中显著优于现有方法。 Abstract: We present LawDIS, a language-window-based controllable dichotomous image segmentation (DIS) framework that produces high-quality object masks. Our framework recasts DIS as an image-conditioned mask generation task within a latent diffusion model, enabling seamless integration of user controls. LawDIS is enhanced with macro-to-micro control modes. Specifically, in macro mode, we introduce a language-controlled segmentation strategy (LS) to generate an initial mask based on user-provided language prompts. In micro mode, a window-controlled refinement strategy (WR) allows flexible refinement of user-defined regions (i.e., size-adjustable windows) within the initial mask. Coordinated by a mode switcher, these modes can operate independently or jointly, making the framework well-suited for high-accuracy, personalised applications. Extensive experiments on the DIS5K benchmark reveal that our LawDIS significantly outperforms 11 cutting-edge methods across all metrics. Notably, compared to the second-best model MVANet, we achieve $F_\beta^\omega$ gains of 4.6\% with both the LS and WR strategies and 3.6\% gains with only the LS strategy on DIS-TE. Codes will be made available at https://github.com/XinyuYanTJU/LawDIS.

[124] TEACH: Text Encoding as Curriculum Hints for Scene Text Recognition

Xiahan Yang,Hui Zheng

Main category: cs.CV

TL;DR: TEACH是一种新的场景文本识别训练范式,通过将真实文本作为辅助输入并逐渐减少其影响来提高模型准确性,无需外部预训练且无推理开销。

Details Motivation: 由于复杂的视觉外观和有限的语义先验,场景文本识别(STR)仍然是一个具有挑战性的任务。 Method: 提出TEACH,一种新的训练范式,通过嵌入空间编码目标标签和应用损失感知掩码将真实文本作为辅助输入,并在训练过程中逐渐减少其影响。 Result: 实验表明,使用TEACH训练的模型在多个公共基准测试中都取得了持续提升的准确性,特别是在具有挑战性的条件下。 Conclusion: TEACH模拟了一种课程学习过程,能够从依赖标签的学习过渡到完全的视觉识别,它是模型无关的,可以无缝集成到现有的编码器-解码器框架中。 Abstract: Scene Text Recognition (STR) remains a challenging task due to complex visual appearances and limited semantic priors. We propose TEACH, a novel training paradigm that injects ground-truth text into the model as auxiliary input and progressively reduces its influence during training. By encoding target labels into the embedding space and applying loss-aware masking, TEACH simulates a curriculum learning process that guides the model from label-dependent learning to fully visual recognition. Unlike language model-based approaches, TEACH requires no external pretraining and introduces no inference overhead. It is model-agnostic and can be seamlessly integrated into existing encoder-decoder frameworks. Extensive experiments across multiple public benchmarks show that models trained with TEACH achieve consistently improved accuracy, especially under challenging conditions, validating its robustness and general applicability.

[125] DELTAv2: Accelerating Dense 3D Tracking

Tuan Duc Ngo,Ashkan Mirzaei,Guocheng Qian,Hanwen Liang,Chuang Gan,Evangelos Kalogerakis,Peter Wonka,Chaoyang Wang

Main category: cs.CV

TL;DR: 提出了一种加速视频中密集长期3D点跟踪的新算法,通过一种粗到精的策略和可学习插值模块,实现了比现有方法快5-100倍的速度,同时保持最先进的跟踪精度。

Details Motivation: 现有最先进的方法在处理大量轨迹时计算成本高,尤其是基于变压器的迭代跟踪和相关特征计算。 Method: 提出了一种粗到精的策略,从少量点开始跟踪并逐步扩展轨迹集,并使用可学习插值模块初始化新增轨迹,同时优化了相关特征计算的成本。 Result: 实现了5-100倍的速度提升,同时保持了最先进的跟踪精度。 Conclusion: 该方法显著提高了密集长期3D点跟踪的效率,为未来视频分析中的实时应用提供了可能。 Abstract: We propose a novel algorithm for accelerating dense long-term 3D point tracking in videos. Through analysis of existing state-of-the-art methods, we identify two major computational bottlenecks. First, transformer-based iterative tracking becomes expensive when handling a large number of trajectories. To address this, we introduce a coarse-to-fine strategy that begins tracking with a small subset of points and progressively expands the set of tracked trajectories. The newly added trajectories are initialized using a learnable interpolation module, which is trained end-to-end alongside the tracking network. Second, we propose an optimization that significantly reduces the cost of correlation feature computation, another key bottleneck in prior methods. Together, these improvements lead to a 5-100x speedup over existing approaches while maintaining state-of-the-art tracking accuracy.

[126] No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views

Ranran Huang,Krystian Mikolajczyk

Main category: cs.CV

TL;DR: SPFSplat is an efficient, pose-free framework for 3D Gaussian splatting that achieves excellent performance in novel view synthesis and pose estimation without relying on ground-truth data.

Details Motivation: The motivation is to develop an efficient 3D Gaussian splatting framework that works with sparse multi-view images and does not require ground-truth poses during training or inference. Method: SPFSplat uses a shared feature extraction backbone to simultaneously predict 3D Gaussian primitives and camera poses in a canonical space. It integrates a rendering loss and a reprojection loss for enhanced geometric constraints. Result: SPFSplat achieves state-of-the-art performance in novel view synthesis and surpasses recent methods trained with geometry priors in relative pose estimation, even under significant viewpoint changes and limited image overlap. Conclusion: SPFSplat is a pose-free and efficient framework that achieves state-of-the-art performance in novel view synthesis and relative pose estimation without relying on ground-truth poses. Abstract: We introduce SPFSplat, an efficient framework for 3D Gaussian splatting from sparse multi-view images, requiring no ground-truth poses during training or inference. It employs a shared feature extraction backbone, enabling simultaneous prediction of 3D Gaussian primitives and camera poses in a canonical space from unposed inputs within a single feed-forward step. Alongside the rendering loss based on estimated novel-view poses, a reprojection loss is integrated to enforce the learning of pixel-aligned Gaussian primitives for enhanced geometric constraints. This pose-free training paradigm and efficient one-step feed-forward design make SPFSplat well-suited for practical applications. Remarkably, despite the absence of pose supervision, SPFSplat achieves state-of-the-art performance in novel view synthesis even under significant viewpoint changes and limited image overlap. It also surpasses recent methods trained with geometry priors in relative pose estimation. Code and trained models are available on our project page: https://ranrhuang.github.io/spfsplat/.

[127] Object Affordance Recognition and Grounding via Multi-scale Cross-modal Representation Learning

Xinhang Wan,Dongqiang Gou,Xinwang Liu,En Zhu,Xuming He

Main category: cs.CV

TL;DR: This paper proposes a novel approach for Embodied AI that learns an affordance-aware 3D representation and uses a stage-wise inference strategy to enhance object manipulation understanding, resulting in improved performance in both affordance grounding and classification.

Details Motivation: The motivation is to improve the learning of object manipulation in Embodied AI by overcoming the limitations of previous approaches, which handle affordance grounding and classification separately, leading to inconsistent predictions and failure to predict full potential affordance areas. Method: The method involves developing a cross-modal 3D representation through efficient fusion and multi-scale geometric feature propagation, followed by a two-stage prediction mechanism that couples grounding and classification tasks. Result: Experiments show that the proposed method improves performance in both affordance grounding and classification, addressing the issues of incomplete predictions and fixed-scale operation in prior methods. Conclusion: The proposed approach effectively addresses the issues of previous methods by learning an affordance-aware 3D representation and employing a stage-wise inference strategy, which enhances the understanding of object manipulation through improved performance in both affordance grounding and classification. Abstract: A core problem of Embodied AI is to learn object manipulation from observation, as humans do. To achieve this, it is important to localize 3D object affordance areas through observation such as images (3D affordance grounding) and understand their functionalities (affordance classification). Previous attempts usually tackle these two tasks separately, leading to inconsistent predictions due to lacking proper modeling of their dependency. In addition, these methods typically only ground the incomplete affordance areas depicted in images, failing to predict the full potential affordance areas, and operate at a fixed scale, resulting in difficulty in coping with affordances significantly varying in scale with respect to the whole object. To address these issues, we propose a novel approach that learns an affordance-aware 3D representation and employs a stage-wise inference strategy leveraging the dependency between grounding and classification tasks. Specifically, we first develop a cross-modal 3D representation through efficient fusion and multi-scale geometric feature propagation, enabling inference of full potential affordance areas at a suitable regional scale. Moreover, we adopt a simple two-stage prediction mechanism, effectively coupling grounding and classification for better affordance understanding. Experiments demonstrate the effectiveness of our method, showing improved performance in both affordance grounding and classification.

[128] A Coarse-to-Fine Approach to Multi-Modality 3D Occupancy Grounding

Zhan Shi,Song Wang,Junbo Chen,Jianke Zhu

Main category: cs.CV

TL;DR: The paper introduces a new 3D occupancy grounding benchmark and proposes the GroundingOcc model, which uses multi-modal learning to improve object perception in autonomous driving scenarios.

Details Motivation: The motivation of the paper is to improve object representation in visual grounding tasks for autonomous driving by addressing the limitations of bounding boxes that fail to capture fine-grained details, as not all voxels within a bounding box are occupied. Method: The paper proposes GroundingOcc, an end-to-end model designed for 3D occupancy grounding through multi-modal learning, combining visual, textual, and point cloud features to predict object location and occupancy information. It also includes a multimodal encoder, an occupancy head, a grounding head, a 2D grounding module, and a depth estimation module. Result: The result of the paper is the creation of a new benchmark for 3D occupancy grounding in outdoor scenes based on the nuScenes dataset, and the development of the GroundingOcc model, which demonstrates superior performance on this benchmark. Conclusion: The paper concludes that GroundingOcc outperforms existing baselines on 3D occupancy grounding, offering more precise object perception by integrating natural language with voxel-level occupancy annotations. Abstract: Visual grounding aims to identify objects or regions in a scene based on natural language descriptions, essential for spatially aware perception in autonomous driving. However, existing visual grounding tasks typically depend on bounding boxes that often fail to capture fine-grained details. Not all voxels within a bounding box are occupied, resulting in inaccurate object representations. To address this, we introduce a benchmark for 3D occupancy grounding in challenging outdoor scenes. Built on the nuScenes dataset, it integrates natural language with voxel-level occupancy annotations, offering more precise object perception compared to the traditional grounding task. Moreover, we propose GroundingOcc, an end-to-end model designed for 3D occupancy grounding through multi-modal learning. It combines visual, textual, and point cloud features to predict object location and occupancy information from coarse to fine. Specifically, GroundingOcc comprises a multimodal encoder for feature extraction, an occupancy head for voxel-wise predictions, and a grounding head to refine localization. Additionally, a 2D grounding module and a depth estimation module enhance geometric understanding, thereby boosting model performance. Extensive experiments on the benchmark demonstrate that our method outperforms existing baselines on 3D occupancy grounding. The dataset is available at https://github.com/RONINGOD/GroundingOcc.

[129] Deep Learning for Pavement Condition Evaluation Using Satellite Imagery

Prathyush Kumar Reddy Lebaku,Lu Gao,Pan Lu,Jingran Sun

Main category: cs.CV

TL;DR: This research uses deep learning and satellite imagery to efficiently and accurately assess pavement conditions, offering a promising solution for more cost-effective infrastructure monitoring.

Details Motivation: Civil infrastructure systems require frequent inspections, but traditional methods like manual surveys or vehicle-based surveys are labor-intensive and time-consuming. Advancements in satellite systems and image processing algorithms offer a more efficient alternative for infrastructure monitoring. Method: The study used over 3,000 satellite images of pavement sections combined with pavement evaluation ratings from TxDOT's PMIS database to train and test deep learning models. Result: The deep learning models achieved an accuracy rate exceeding 90% in evaluating pavement conditions from satellite images. Conclusion: The research demonstrates that using deep learning models to analyze satellite images can provide a rapid and cost-effective method for evaluating pavement network conditions. Abstract: Civil infrastructure systems covers large land areas and needs frequent inspections to maintain their public service capabilities. The conventional approaches of manual surveys or vehicle-based automated surveys to assess infrastructure conditions are often labor-intensive and time-consuming. For this reason, it is worthwhile to explore more cost-effective methods for monitoring and maintaining these infrastructures. Fortunately, recent advancements in satellite systems and image processing algorithms have opened up new possibilities. Numerous satellite systems have been employed to monitor infrastructure conditions and identify damages. Due to the improvement in ground sample distance (GSD), the level of detail that can be captured has significantly increased. Taking advantage of these technology advancement, this research investigated to evaluate pavement conditions using deep learning models for analyzing satellite images. We gathered over 3,000 satellite images of pavement sections, together with pavement evaluation ratings from TxDOT's PMIS database. The results of our study show an accuracy rate is exceeding 90%. This research paves the way for a rapid and cost-effective approach to evaluating the pavement network in the future.

[130] RoadMamba: A Dual Branch Visual State Space Model for Road Surface Classification

Tianze Wang,Zhang Zhang,Chao Yue,Nuoran Li,Chao Sun

Main category: cs.CV

TL;DR: 本文提出RoadMamba方法,首次探索了Mamba架构在道路表面分类任务中的潜力,通过DualSSM和DAF结合局部纹理和全局语义,并通过双辅助损失优化模型性能,最终在大规模数据集上实现最先进的分类效果。

Details Motivation: 基于视觉技术提前获取道路表面状况可以提高自动驾驶车辆的安全性和驾驶舒适性。尽管基于状态空间模型的Mamba架构在视觉处理任务中表现出色,但由于其在提取道路表面局部纹理方面的不足,导致其在道路表面分类任务中难以达到最先进的性能。 Method: 该论文提出了RoadMamba,利用Dual State Space Model (DualSSM) 提取道路表面的全局语义和局部纹理,并通过Dual Attention Fusion (DAF) 解码和融合双特征。此外,还设计了双辅助损失来显式约束双分支,避免网络过度依赖全局语义信息。 Result: RoadMamba在包含100万个样本的大规模道路表面分类数据集上的实验中达到了最先进的性能。 Conclusion: RoadMamba通过结合局部和全局感知实现了道路表面分类任务的最先进性能,解决了现有Mamba架构在有效提取道路表面局部纹理方面的不足。 Abstract: Acquiring the road surface conditions in advance based on visual technologies provides effective information for the planning and control system of autonomous vehicles, thus improving the safety and driving comfort of the vehicles. Recently, the Mamba architecture based on state-space models has shown remarkable performance in visual processing tasks, benefiting from the efficient global receptive field. However, existing Mamba architectures struggle to achieve state-of-the-art visual road surface classification due to their lack of effective extraction of the local texture of the road surface. In this paper, we explore for the first time the potential of visual Mamba architectures for road surface classification task and propose a method that effectively combines local and global perception, called RoadMamba. Specifically, we utilize the Dual State Space Model (DualSSM) to effectively extract the global semantics and local texture of the road surface and decode and fuse the dual features through the Dual Attention Fusion (DAF). In addition, we propose a dual auxiliary loss to explicitly constrain dual branches, preventing the network from relying only on global semantic information from the deep large receptive field and ignoring the local texture. The proposed RoadMamba achieves the state-of-the-art performance in experiments on a large-scale road surface classification dataset containing 1 million samples.

[131] StyDeco: Unsupervised Style Transfer with Distilling Priors and Semantic Decoupling

Yuanlin Yang,Quanjian Song,Zhexian Gao,Ge Wang,Shanshan Li,Xiaoyan Zhang

Main category: cs.CV

TL;DR: 本文提出了一种无监督文本驱动风格迁移框架StyDeco,通过Prior-Guided Data Distillation和Contrastive Semantic Decoupling策略解决语义结构丢失问题,在风格迁移任务中取得了优异表现。

Details Motivation: 文本驱动风格迁移方法忽视了文本描述的非空间特性和视觉风格空间属性之间的语义鸿沟,导致语义结构和细节丢失。 Method: 提出了一种无监督框架StyDeco,包括Prior-Guided Data Distillation(PGD)和Contrastive Semantic Decoupling(CSD)两个关键策略。 Result: 在三个经典基准数据集上进行了大量实验,结果表明StyDeco在风格保真度和结构保持方面优于现有方法。 Conclusion: StyDeco有效解决了文本驱动风格迁移中的语义结构丢失问题,并展现出优越的风格迁移和结构保持能力。 Abstract: Diffusion models have emerged as the dominant paradigm for style transfer, but their text-driven mechanism is hindered by a core limitation: it treats textual descriptions as uniform, monolithic guidance. This limitation overlooks the semantic gap between the non-spatial nature of textual descriptions and the spatially-aware attributes of visual style, often leading to the loss of semantic structure and fine-grained details during stylization. In this paper, we propose StyDeco, an unsupervised framework that resolves this limitation by learning text representations specifically tailored for the style transfer task. Our framework first employs Prior-Guided Data Distillation (PGD), a strategy designed to distill stylistic knowledge without human supervision. It leverages a powerful frozen generative model to automatically synthesize pseudo-paired data. Subsequently, we introduce Contrastive Semantic Decoupling (CSD), a task-specific objective that adapts a text encoder using domain-specific weights. CSD performs a two-class clustering in the semantic space, encouraging source and target representations to form distinct clusters. Extensive experiments on three classic benchmarks demonstrate that our framework outperforms several existing approaches in both stylistic fidelity and structural preservation, highlighting its effectiveness in style transfer with semantic preservation. In addition, our framework supports a unique de-stylization process, further demonstrating its extensibility. Our code is vailable at https://github.com/QuanjianSong/StyDeco.

[132] Perspective from a Broader Context: Can Room Style Knowledge Help Visual Floorplan Localization?

Bolei Chen,Shengsheng Yan,Yongzheng Cui,Jiaxu Kang,Ping Zhong,Jianxin Wang

Main category: cs.CV

TL;DR: This paper introduces an unsupervised learning method to enhance visual Floorplan Localization by leveraging broader visual scene context, outperforming existing approaches in accuracy and robustness.

Details Motivation: The motivation is to overcome the limitations of existing FLoc methods that rely solely on 2D structural cues or 3D geometry-constrained pre-trainings, which often lead to ambiguous localization due to repetitive structures in floorplans. Method: The paper proposes an unsupervised learning technique with clustering constraints to pre-train a room discriminator on unlabeled room images. This discriminator extracts hidden room types, which are then used to improve FLoc algorithms. Result: The experiments showed that the proposed method outperforms state-of-the-art approaches and achieves significant improvements in both robustness and accuracy on two standard visual FLoc benchmarks. Conclusion: The paper concludes that incorporating scene context information through an unsupervised learning technique enhances the robustness and accuracy of visual Floorplan Localization (FLoc). Abstract: Since a building's floorplan remains consistent over time and is inherently robust to changes in visual appearance, visual Floorplan Localization (FLoc) has received increasing attention from researchers. However, as a compact and minimalist representation of the building's layout, floorplans contain many repetitive structures (e.g., hallways and corners), thus easily result in ambiguous localization. Existing methods either pin their hopes on matching 2D structural cues in floorplans or rely on 3D geometry-constrained visual pre-trainings, ignoring the richer contextual information provided by visual images. In this paper, we suggest using broader visual scene context to empower FLoc algorithms with scene layout priors to eliminate localization uncertainty. In particular, we propose an unsupervised learning technique with clustering constraints to pre-train a room discriminator on self-collected unlabeled room images. Such a discriminator can empirically extract the hidden room type of the observed image and distinguish it from other room types. By injecting the scene context information summarized by the discriminator into an FLoc algorithm, the room style knowledge is effectively exploited to guide definite visual FLoc. We conducted sufficient comparative studies on two standard visual Floc benchmarks. Our experiments show that our approach outperforms state-of-the-art methods and achieves significant improvements in robustness and accuracy.

[133] MoGaFace: Momentum-Guided and Texture-Aware Gaussian Avatars for Consistent Facial Geometry

Yujian Liu,Linlang Cao,Chuang Chen,Fanyu Geng,Dongxu Shen,Peng Cao,Shidang Xu,Xiaoli Liu

Main category: cs.CV

TL;DR: MoGaFace improves 3D head avatar reconstruction by refining geometry and texture during rendering, outperforming existing methods even with inaccurate mesh inputs.

Details Motivation: Existing methods suffer from misalignment between estimated meshes and target images, leading to suboptimal rendering and loss of detail. MoGaFace aims to overcome these limitations. Method: MoGaFace introduces a Momentum-Guided Consistent Geometry module and Latent Texture Attention to refine facial geometry and texture during Gaussian rendering. Result: MoGaFace demonstrates superior performance in high-fidelity head avatar reconstruction and novel-view synthesis, even in unconstrained real-world settings. Conclusion: MoGaFace achieves high-fidelity head avatar reconstruction and improves novel-view synthesis quality, especially under challenging conditions like inaccurate mesh initialization. Abstract: Existing 3D head avatar reconstruction methods adopt a two-stage process, relying on tracked FLAME meshes derived from facial landmarks, followed by Gaussian-based rendering. However, misalignment between the estimated mesh and target images often leads to suboptimal rendering quality and loss of fine visual details. In this paper, we present MoGaFace, a novel 3D head avatar modeling framework that continuously refines facial geometry and texture attributes throughout the Gaussian rendering process. To address the misalignment between estimated FLAME meshes and target images, we introduce the Momentum-Guided Consistent Geometry module, which incorporates a momentum-updated expression bank and an expression-aware correction mechanism to ensure temporal and multi-view consistency. Additionally, we propose Latent Texture Attention, which encodes compact multi-view features into head-aware representations, enabling geometry-aware texture refinement via integration into Gaussians. Extensive experiments show that MoGaFace achieves high-fidelity head avatar reconstruction and significantly improves novel-view synthesis quality, even under inaccurate mesh initialization and unconstrained real-world settings.

[134] Eigen Neural Network: Unlocking Generalizable Vision with Eigenbasis

Anzhe Cheng,Chenzhong Yin,Mingxi Cheng,Shukai Duan,Shahin Nazarian,Paul Bogdan

Main category: cs.CV

TL;DR: 本文提出了一种新的神经网络架构Eigen Neural Network (ENN),它改进了深度神经网络中的权重结构问题,并且在多个基准测试中表现出优于现有技术的结果。

Details Motivation: 深度神经网络的成功受到基于梯度优化的驱动,但这一过程常常因其产生无序权重结构的倾向而受到影响,这损害了特征清晰度并降低了学习动态。 Method: 引入了Eigen Neural Network (ENN),该网络通过层共享、学习的正交特征基重新参数化每层的权重。 Result: ENN在大规模图像分类基准测试(包括ImageNet)上始终优于最先进的方法,并且其优越的表示可以推广到跨模态图像-文本检索中。 Conclusion: ENN不仅解决了反向传播的程序瓶颈,还通过并行化实现了超过2倍的训练加速,并且其表现优于端到端反向传播。 Abstract: The remarkable success of Deep Neural Networks(DNN) is driven by gradient-based optimization, yet this process is often undermined by its tendency to produce disordered weight structures, which harms feature clarity and degrades learning dynamics. To address this fundamental representational flaw, we introduced the Eigen Neural Network (ENN), a novel architecture that reparameterizes each layer's weights in a layer-shared, learned orthonormal eigenbasis. This design enforces decorrelated, well-aligned weight dynamics axiomatically, rather than through regularization, leading to more structured and discriminative feature representations. When integrated with standard BP, ENN consistently outperforms state-of-the-art methods on large-scale image classification benchmarks, including ImageNet, and its superior representations generalize to set a new benchmark in cross-modal image-text retrieval. Furthermore, ENN's principled structure enables a highly efficient, backpropagation-free(BP-free) local learning variant, ENN-$\ell$. This variant not only resolves BP's procedural bottlenecks to achieve over 2$\times$ training speedup via parallelism, but also, remarkably, surpasses the accuracy of end-to-end backpropagation. ENN thus presents a new architectural paradigm that directly remedies the representational deficiencies of BP, leading to enhanced performance and enabling a more efficient, parallelizable training regime.

[135] ParaRevSNN: A Parallel Reversible Spiking Neural Network for Efficient Training and Inference

Changqing Xu,Guoqing Sun,Yi Liu,Xinfang Liao,Yintang Yang

Main category: cs.CV

TL;DR: ParaRevSNN improves the training and inference speed of reversible spiking neural networks while maintaining accuracy and memory efficiency.

Details Motivation: Reversible Spiking Neural Networks (RevSNNs) are memory-efficient but suffer from high latency due to sequential computation. The authors aim to address this limitation by introducing parallelism. Method: The authors propose ParaRevSNN, which decouples sequential dependencies between reversible blocks to enable inter-block parallelism, thereby accelerating both training and inference. Result: Experiments show that ParaRevSNN matches or exceeds the accuracy of standard RevSNNs while reducing training time by up to 35.2% and inference time to 18.15%. Conclusion: ParaRevSNN, a parallel reversible SNN architecture, successfully reduces training and inference time while maintaining accuracy and memory-saving benefits, making it suitable for resource-constrained environments. Abstract: Reversible Spiking Neural Networks (RevSNNs) enable memory-efficient training by reconstructing forward activations during backpropagation, but suffer from high latency due to strictly sequential computation. To overcome this limitation, we propose ParaRevSNN, a parallel reversible SNN architecture that decouples sequential dependencies between reversible blocks while preserving reversibility. This design enables inter-block parallelism, significantly accelerating training and inference while retaining the memory-saving benefits of reversibility. Experiments on CIFAR10, CIFAR100, CIFAR10-DVS, and DVS128 Gesture demonstrate that ParaRevSNN matches or exceeds the accuracy of standard RevSNNs, while reducing training time by up to 35.2\% and inference time to 18.15\%, making it well-suited for deployment in resource-constrained scenarios.

[136] Multi-Cache Enhanced Prototype Learning for Test-Time Generalization of Vision-Language Models

Xinyu Chen,Haotian Zhai,Can Zhang,Xiupeng Shi,Ruirui Li

Main category: cs.CV

TL;DR: This paper introduces MCP and MCP++, novel test-time adaptation frameworks that enhance model generalization by leveraging multi-cache strategies and cross-modal alignment, achieving state-of-the-art results.

Details Motivation: Existing cache-enhanced TTA methods rely on low-entropy samples which may be unreliable under distribution shifts, leading to suboptimal prototypes. Method: Proposed Multi-Cache enhanced Prototype-based Test-Time Adaptation (MCP) with entropy, align, and negative caches; further developed MCP++ with cross-modal prototype alignment and residual learning. Result: MCP and MCP++ demonstrated superior generalization performance across 15 downstream tasks, validated through comparative and ablation experiments. Conclusion: MCP and MCP++ methods achieve state-of-the-art generalization performance in zero-shot test-time adaptation by improving intra-class compactness and leveraging multi-cache strategies. Abstract: In zero-shot setting, test-time adaptation adjusts pre-trained models using unlabeled data from the test phase to enhance performance on unknown test distributions. Existing cache-enhanced TTA methods rely on a low-entropy criterion to select samples for prototype construction, assuming intra-class compactness. However, low-entropy samples may be unreliable under distribution shifts, and the resulting prototypes may not ensure compact intra-class distributions. This study identifies a positive correlation between cache-enhanced performance and intra-class compactness. Based on this observation, we propose a Multi-Cache enhanced Prototype-based Test-Time Adaptation (MCP) featuring three caches: an entropy cache for initializing prototype representations with low-entropy samples, an align cache for integrating visual and textual information to achieve compact intra-class distributions, and a negative cache for prediction calibration using high-entropy samples. We further developed MCP++, a framework incorporating cross-modal prototype alignment and residual learning, introducing prototype residual fine-tuning. Comparative and ablation experiments across 15 downstream tasks demonstrate that the proposed method and framework achieve state-of-the-art generalization performance.

[137] Enhancing Multi-view Open-set Learning via Ambiguity Uncertainty Calibration and View-wise Debiasing

Zihan Fang,Zhiyong Xu,Lan Du,Shide Du,Zhiling Cai,Shiping Wang

Main category: cs.CV

TL;DR: This paper proposes a framework to improve open-set recognition in multi-view learning by addressing class completeness assumptions and view-induced biases.

Details Motivation: Existing multi-view learning models struggle in open-set scenarios and face degradation due to static view-induced biases. Method: A multi-view open-set learning framework using ambiguity uncertainty calibration and view-wise debiasing. Result: Extensive experiments on diverse multi-view benchmarks demonstrate the framework's effectiveness. Conclusion: The proposed framework enhances unknown-class recognition while maintaining strong closed-set performance. Abstract: Existing multi-view learning models struggle in open-set scenarios due to their implicit assumption of class completeness. Moreover, static view-induced biases, which arise from spurious view-label associations formed during training, further degrade their ability to recognize unknown categories. In this paper, we propose a multi-view open-set learning framework via ambiguity uncertainty calibration and view-wise debiasing. To simulate ambiguous samples, we design O-Mix, a novel synthesis strategy to generate virtual samples with calibrated open-set ambiguity uncertainty. These samples are further processed by an auxiliary ambiguity perception network that captures atypical patterns for improved open-set adaptation. Furthermore, we incorporate an HSIC-based contrastive debiasing module that enforces independence between view-specific ambiguous and view-consistent representations, encouraging the model to learn generalizable features. Extensive experiments on diverse multi-view benchmarks demonstrate that the proposed framework consistently enhances unknown-class recognition while preserving strong closed-set performance.

[138] Mitigating Information Loss under High Pruning Rates for Efficient Large Vision Language Models

Mingyu Fu,Wei Suo,Ji Ma,Lin Yuanbo Wu,Peng Wang,Yanning Zhang

Main category: cs.CV

TL;DR: ACCM reduces computational cost in LVLMs by compensating for visual information loss through a caption model and selector, achieving superior performance with fewer FLOPs.

Details Motivation: LVLMs face computational cost issues due to high token input, and current pruning methods lead to performance degradation. Method: ACCM uses a lightweight caption model and a selector to compensate for visual information loss, trained via self-supervised learning. Result: ACCM outperforms existing methods by 20.6% with 6.5% fewer FLOPs across seven benchmarks. Conclusion: ACCM is an effective method to reduce computational cost in LVLMs by mitigating visual information loss, showing superior performance over existing methods. Abstract: Despite the great success of Large Vision Language Models (LVLMs), their high computational cost severely limits their broad applications. The computational cost of LVLMs mainly stems from the visual sequence of the input, which consists of hundreds or even thousands of tokens. Although existing methods have made progress by removing redundant tokens, they suffer from severe performance degradation with high pruning rates due to the loss of visual information. In this paper, we propose an Adaptive Content Compensation Method (ACCM), which can effectively mitigate the visual information loss via an image caption. Specifically, ACCM comprises two key components: a lightweight caption model and a selector. Firstly the caption model generates question-related descriptions under the guidance of the user instruction. Then the selector further identifies a contextually appropriate caption from multiple candidates. Leveraging self-supervised learning, our modules could be learned efficiently without any human or automated labeling. We conduct extensive experiments across seven benchmarks and the results show that ACCM significantly outperforms existing methods with lower FLOPs (e.g., surpassing SOTA by 20.6% with 6.5% fewer FLOPs).

[139] OCSplats: Observation Completeness Quantification and Label Noise Separation in 3DGS

Han Ling,Xian Xu,Yinghui Sun,Quansen Sun

Main category: cs.CV

TL;DR: 本研究提出了一种新的抗噪声3D重建框架OCSplats,通过混合噪声评估和基于观测的认知校正技术,解决了现实场景中标签噪声导致的重建问题。

Details Motivation: 现实场景中的标签噪声(如移动物体、非朗伯表面和阴影)导致3D高斯随机投影(3DGS)重建错误,现有方法难以有效分离噪声或需要场景特定的超参数微调。 Method: 从认知不确定性的角度重新审视抗噪声重建问题,提出了结合混合噪声评估和基于观测的认知校正关键技术的OCSplats框架,并设计了基于动态锚点的标签噪声分类流水线。 Result: 显著提高了认知差异区域的噪声分类准确性,并能够在不同噪声比例的场景中应用而无需调整参数。 Conclusion: OCSplats实现了领先的重建性能和精确的标签噪声分类,适用于不同复杂程度的场景。 Abstract: 3D Gaussian Splatting (3DGS) has become one of the most promising 3D reconstruction technologies. However, label noise in real-world scenarios-such as moving objects, non-Lambertian surfaces, and shadows-often leads to reconstruction errors. Existing 3DGS-Bsed anti-noise reconstruction methods either fail to separate noise effectively or require scene-specific fine-tuning of hyperparameters, making them difficult to apply in practice. This paper re-examines the problem of anti-noise reconstruction from the perspective of epistemic uncertainty, proposing a novel framework, OCSplats. By combining key technologies such as hybrid noise assessment and observation-based cognitive correction, the accuracy of noise classification in areas with cognitive differences has been significantly improved. Moreover, to address the issue of varying noise proportions in different scenarios, we have designed a label noise classification pipeline based on dynamic anchor points. This pipeline enables OCSplats to be applied simultaneously to scenarios with vastly different noise proportions without adjusting parameters. Extensive experiments demonstrate that OCSplats always achieve leading reconstruction performance and precise label noise classification in scenes of different complexity levels.

[140] NS-Net: Decoupling CLIP Semantic Information through NULL-Space for Generalizable AI-Generated Image Detection

Jiazhen Yan,Fan Wang,Weiwei Jiang,Ziqiang Li,Zhangjie Fu

Main category: cs.CV

TL;DR: NS-Net 通过 NULL-Space 投影和对比学习有效提高了生成图像的检测准确性。

Details Motivation: 现有的检测器在面对未知生成模型时表现不佳,需要一种具有更好泛化能力的检测方法。 Method: 利用 CLIP 特征并引入 NULL-Space 投影和对比学习方法,设计 Patch Selection 策略以减少语义偏差。 Result: 在包含 40 种不同生成模型的数据集上进行了实验,NS-Net 的检测准确率比现有方法提高了 7.4%。 Conclusion: NS-Net 是一种新的检测框架,通过 NULL-Space 投影和对比学习,提高了对生成图像的检测准确性,具有良好的泛化能力。 Abstract: The rapid progress of generative models, such as GANs and diffusion models, has facilitated the creation of highly realistic images, raising growing concerns over their misuse in security-sensitive domains. While existing detectors perform well under known generative settings, they often fail to generalize to unknown generative models, especially when semantic content between real and fake images is closely aligned. In this paper, we revisit the use of CLIP features for AI-generated image detection and uncover a critical limitation: the high-level semantic information embedded in CLIP's visual features hinders effective discrimination. To address this, we propose NS-Net, a novel detection framework that leverages NULL-Space projection to decouple semantic information from CLIP's visual features, followed by contrastive learning to capture intrinsic distributional differences between real and generated images. Furthermore, we design a Patch Selection strategy to preserve fine-grained artifacts by mitigating semantic bias caused by global image structures. Extensive experiments on an open-world benchmark comprising images generated by 40 diverse generative models show that NS-Net outperforms existing state-of-the-art methods, achieving a 7.4\% improvement in detection accuracy, thereby demonstrating strong generalization across both GAN- and diffusion-based image generation techniques.

[141] DisFaceRep: Representation Disentanglement for Co-occurring Facial Components in Weakly Supervised Face Parsing

Xiaoqin Wang,Xianxu Hou,Meidan Ding,Junliang Chen,Kaijun Deng,Jinheng Xie,Linlin Shen

Main category: cs.CV

TL;DR: 本文提出了一种弱监督面部解析(WSFP)任务设置,并设计了DisFaceRep表示解耦框架,以解决面部组件共现和视觉相似性带来的挑战。

Details Motivation: 现有的面部解析方法依赖于密集的像素级注释,成本高昂且费力。为了降低注释成本,引入了弱监督面部解析(WSFP)任务设置。 Method: 提出了一种表示解耦框架DisFaceRep,包括共现组件解耦策略和文本引导的组件解耦损失。 Result: DisFaceRep在CelebAMask-HQ、LaPa和Helen数据集上进行了广泛的实验,证明了WSFP的难度和DisFaceRep的有效性。 Conclusion: DisFaceRep有效解决了弱监督面部解析中的面部组件共现和视觉相似性问题,显著优于现有的弱监督语义分割方法。 Abstract: Face parsing aims to segment facial images into key components such as eyes, lips, and eyebrows. While existing methods rely on dense pixel-level annotations, such annotations are expensive and labor-intensive to obtain. To reduce annotation cost, we introduce Weakly Supervised Face Parsing (WSFP), a new task setting that performs dense facial component segmentation using only weak supervision, such as image-level labels and natural language descriptions. WSFP introduces unique challenges due to the high co-occurrence and visual similarity of facial components, which lead to ambiguous activations and degraded parsing performance. To address this, we propose DisFaceRep, a representation disentanglement framework designed to separate co-occurring facial components through both explicit and implicit mechanisms. Specifically, we introduce a co-occurring component disentanglement strategy to explicitly reduce dataset-level bias, and a text-guided component disentanglement loss to guide component separation using language supervision implicitly. Extensive experiments on CelebAMask-HQ, LaPa, and Helen demonstrate the difficulty of WSFP and the effectiveness of DisFaceRep, which significantly outperforms existing weakly supervised semantic segmentation methods. The code will be released at \href{https://github.com/CVI-SZU/DisFaceRep}{\textcolor{cyan}{https://github.com/CVI-SZU/DisFaceRep}}.

[142] ODOV: Towards Open-Domain Open-Vocabulary Object Detection

Yupeng Zhang,Ruize Han,Fangnan Zhou,Song Wang,Wei Feng,Liang Wan

Main category: cs.CV

TL;DR: 本文提出了一种新的开放域开放词汇目标检测方法,通过构建新基准OD-LVIS和利用大型语言模型生成领域无关的文本提示,实现了对现实世界领域和类别变化的适应性检测。

Details Motivation: 处理现实世界中的目标检测问题,该问题需要检测模型适应领域和类别的变化。 Method: 利用大型语言模型生成领域无关的文本提示,用于类别嵌入,并从给定图像中学习领域嵌入,以在测试期间生成每个测试图像的定制领域特定类别嵌入。 Result: 构建了一个包含46,949张图像的新基准OD-LVIS,涵盖了18个复杂的现实领域和1,203个类别,并提供了足够的基准评估结果。 Conclusion: 研究提出了一个新的开放域开放词汇(ODOV)目标检测问题,并开发了一种新颖的基线方法来处理现实世界中的领域和类别变化。 Abstract: In this work, we handle a new problem of Open-Domain Open-Vocabulary (ODOV) object detection, which considers the detection model's adaptability to the real world including both domain and category shifts. For this problem, we first construct a new benchmark OD-LVIS, which includes 46,949 images, covers 18 complex real-world domains and 1,203 categories, and provides a comprehensive dataset for evaluating real-world object detection. Besides, we develop a novel baseline method for ODOV detection.The proposed method first leverages large language models to generate the domain-agnostic text prompts for category embedding. It further learns the domain embedding from the given image, which, during testing, can be integrated into the category embedding to form the customized domain-specific category embedding for each test image. We provide sufficient benchmark evaluations for the proposed ODOV detection task and report the results, which verify the rationale of ODOV detection, the usefulness of our benchmark, and the superiority of the proposed method.

[143] Self-Enhanced Image Clustering with Cross-Modal Semantic Consistency

Zihan Li,Wei Sun,Jing Hu,Jianhua Yin,Jianlong Wu,Liqiang Nie

Main category: cs.CV

TL;DR: A self-enhanced cross-modal framework improves image clustering by aligning and fine-tuning pre-trained models, achieving state-of-the-art results with smaller architectures.

Details Motivation: Existing methods freeze the encoder of large pre-trained models like CLIP, leading to a mismatch between task-agnostic representations and specific clustering task demands, limiting performance. Method: The framework operates in two stages: (1) Cross-Modal Semantic Consistency to align clustering heads with pre-trained model semantics, and (2) Self-Enhanced fine-tuning using pseudo-labels for joint optimization of encoder and clustering heads. Result: The method outperforms existing deep clustering techniques by significant margins on six mainstream datasets, with the ViT-B/32 model matching or surpassing the accuracy of state-of-the-art methods using the larger ViT-L/14. Conclusion: The proposed self-enhanced framework based on cross-modal semantic consistency significantly improves image clustering performance, outperforming existing methods and achieving results comparable to or better than state-of-the-art approaches using larger models. Abstract: While large language-image pre-trained models like CLIP offer powerful generic features for image clustering, existing methods typically freeze the encoder. This creates a fundamental mismatch between the model's task-agnostic representations and the demands of a specific clustering task, imposing a ceiling on performance. To break this ceiling, we propose a self-enhanced framework based on cross-modal semantic consistency for efficient image clustering. Our framework first builds a strong foundation via Cross-Modal Semantic Consistency and then specializes the encoder through Self-Enhancement. In the first stage, we focus on Cross-Modal Semantic Consistency. By mining consistency between generated image-text pairs at the instance, cluster assignment, and cluster center levels, we train lightweight clustering heads to align with the rich semantics of the pre-trained model. This alignment process is bolstered by a novel method for generating higher-quality cluster centers and a dynamic balancing regularizer to ensure well-distributed assignments. In the second stage, we introduce a Self-Enhanced fine-tuning strategy. The well-aligned model from the first stage acts as a reliable pseudo-label generator. These self-generated supervisory signals are then used to feed back the efficient, joint optimization of the vision encoder and clustering heads, unlocking their full potential. Extensive experiments on six mainstream datasets show that our method outperforms existing deep clustering methods by significant margins. Notably, our ViT-B/32 model already matches or even surpasses the accuracy of state-of-the-art methods built upon the far larger ViT-L/14.

[144] SpatioTemporal Difference Network for Video Depth Super-Resolution

Zhengxue Wang,Yuan Wu,Xiang Li,Zhiqiang Yan,Jian Yang

Main category: cs.CV

TL;DR: STDNet addresses long-tailed distribution issues in video depth super-resolution by leveraging spatial and temporal difference mechanisms, achieving state-of-the-art results.

Details Motivation: Despite advances in depth super-resolution, video depth super-resolution is still affected by long-tailed distributions, particularly in spatial non-smooth regions and temporal variation zones. This work aims to address these challenges by designing a network that explicitly models spatial and temporal differences. Method: The authors propose a novel SpatioTemporal Difference Network (STDNet) with two core branches: a spatial difference branch that dynamically aligns RGB features for intra-frame RGB-D aggregation, and a temporal difference branch that propagates temporal variation information for precise motion compensation. Result: Extensive experiments on multiple datasets demonstrate the effectiveness of the proposed STDNet, showing superior performance compared to existing approaches in video depth super-resolution. Conclusion: The proposed STDNet effectively addresses the long-tailed distribution issues in video depth super-resolution by incorporating a spatial difference branch and a temporal difference branch, outperforming existing methods. Abstract: Depth super-resolution has achieved impressive performance, and the incorporation of multi-frame information further enhances reconstruction quality. Nevertheless, statistical analyses reveal that video depth super-resolution remains affected by pronounced long-tailed distributions, with the long-tailed effects primarily manifesting in spatial non-smooth regions and temporal variation zones. To address these challenges, we propose a novel SpatioTemporal Difference Network (STDNet) comprising two core branches: a spatial difference branch and a temporal difference branch. In the spatial difference branch, we introduce a spatial difference mechanism to mitigate the long-tailed issues in spatial non-smooth regions. This mechanism dynamically aligns RGB features with learned spatial difference representations, enabling intra-frame RGB-D aggregation for depth calibration. In the temporal difference branch, we further design a temporal difference strategy that preferentially propagates temporal variation information from adjacent RGB and depth frames to the current depth frame, leveraging temporal difference representations to achieve precise motion compensation in temporal long-tailed areas. Extensive experimental results across multiple datasets demonstrate the effectiveness of our STDNet, outperforming existing approaches.

[145] Enhancing Diffusion-based Dataset Distillation via Adversary-Guided Curriculum Sampling

Lexiao Zou,Gongwei Chen,Yanda Chen,Miao Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的对抗引导课程采样方法ACS,用于解决数据集蒸馏中图像多样性不足的问题,并在多个数据集上取得了显著的性能提升。

Details Motivation: 现有的数据集蒸馏方法在图像每类(IPC)设置或图像分辨率较大时面临性能下降的问题,而扩散生成模型虽然有效,但其生成的图像缺乏多样性,导致信息冗余。 Method: 提出了一种对抗引导的课程采样方法(Adversary-guided Curriculum Sampling, ACS),通过将蒸馏数据集划分为多个课程,每个课程利用对抗损失引导扩散模型采样,从而减少信息重叠,提高数据多样性。 Result: ACS在Imagewoof和ImageNet-1k数据集上分别实现了4.1%和2.1%的性能提升。 Conclusion: 实验结果表明,ACS在多个数据集上优于现有技术,实现了更高的性能提升。 Abstract: Dataset distillation aims to encapsulate the rich information contained in dataset into a compact distilled dataset but it faces performance degradation as the image-per-class (IPC) setting or image resolution grows larger. Recent advancements demonstrate that integrating diffusion generative models can effectively facilitate the compression of large-scale datasets while maintaining efficiency due to their superiority in matching data distribution and summarizing representative patterns. However, images sampled from diffusion models are always blamed for lack of diversity which may lead to information redundancy when multiple independent sampled images are aggregated as a distilled dataset. To address this issue, we propose Adversary-guided Curriculum Sampling (ACS), which partitions the distilled dataset into multiple curricula. For generating each curriculum, ACS guides diffusion sampling process by an adversarial loss to challenge a discriminator trained on sampled images, thus mitigating information overlap between curricula and fostering a more diverse distilled dataset. Additionally, as the discriminator evolves with the progression of curricula, ACS generates images from simpler to more complex, ensuring efficient and systematic coverage of target data informational spectrum. Extensive experiments demonstrate the effectiveness of ACS, which achieves substantial improvements of 4.1\% on Imagewoof and 2.1\% on ImageNet-1k over the state-of-the-art.

[146] ModelNet40-E: An Uncertainty-Aware Benchmark for Point Cloud Classification

Pedro Alonso,Tianrui Li,Chongshou Li

Main category: cs.CV

TL;DR: ModelNet40-E是一个新的基准,用于评估点云分类模型在噪声环境下的表现,Point Transformer v3在此基准上表现出色。

Details Motivation: 现有的基准测试无法全面评估模型在合成LiDAR样噪声下的表现,因此引入了ModelNet40-E以改进评估方法。 Method: 引入了ModelNet40-E基准,通过高斯噪声参数(σ,μ)提供噪声污染的点云和点级不确定性注释。 Result: 在多个噪声级别上评估了PointNet、DGCNN和Point Transformer v3模型,结果表明所有模型在噪声增加时性能下降,但Point Transformer v3表现出更好的校准性。 Conclusion: ModelNet40-E是一个新的基准,用于评估点云分类模型在合成LiDAR样噪声下的鲁棒性和校准性。Point Transformer v3展示了更优的校准性,其预测的不确定性更贴近基础测量不确定性。 Abstract: We introduce ModelNet40-E, a new benchmark designed to assess the robustness and calibration of point cloud classification models under synthetic LiDAR-like noise. Unlike existing benchmarks, ModelNet40-E provides both noise-corrupted point clouds and point-wise uncertainty annotations via Gaussian noise parameters ({\sigma}, {\mu}), enabling fine-grained evaluation of uncertainty modeling. We evaluate three popular models-PointNet, DGCNN, and Point Transformer v3-across multiple noise levels using classification accuracy, calibration metrics, and uncertainty-awareness. While all models degrade under increasing noise, Point Transformer v3 demonstrates superior calibration, with predicted uncertainties more closely aligned with the underlying measurement uncertainty.

[147] SGCap: Decoding Semantic Group for Zero-shot Video Captioning

Zeyu Pan,Ping Li,Wenxiao Wang

Main category: cs.CV

TL;DR: This paper proposes SGCap, a novel zero-shot video captioning method that captures temporal dynamics through multi-frame decoding and diverse sentence supervision, achieving strong performance without video-text training pairs.

Details Motivation: Zero-shot video captioning is underexplored, and existing image captioning methods that rely on single-sentence embeddings and average pooling neglect temporal dynamics when extended to video. Method: The proposed Semantic Group Captioning (SGCap) method uses Semantic Group Decoding (SGD) to model multi-frame information and temporal relationships. It also introduces Key Sentences Selection (KSS) and Probability Sampling Supervision (PSS) modules for better supervision. Result: SGCap achieves state-of-the-art performance on zero-shot video captioning benchmarks, outperforming previous methods and even competing with fully supervised models. Conclusion: SGCap significantly outperforms previous zero-shot video captioning methods and performs competitively with fully supervised approaches. Abstract: Zero-shot video captioning aims to generate sentences for describing videos without training the model on video-text pairs, which remains underexplored. Existing zero-shot image captioning methods typically adopt a text-only training paradigm, where a language decoder reconstructs single-sentence embeddings obtained from CLIP. However, directly extending them to the video domain is suboptimal, as applying average pooling over all frames neglects temporal dynamics. To address this challenge, we propose a Semantic Group Captioning (SGCap) method for zero-shot video captioning. In particular, it develops the Semantic Group Decoding (SGD) strategy to employ multi-frame information while explicitly modeling inter-frame temporal relationships. Furthermore, existing zero-shot captioning methods that rely on cosine similarity for sentence retrieval and reconstruct the description supervised by a single frame-level caption, fail to provide sufficient video-level supervision. To alleviate this, we introduce two key components, including the Key Sentences Selection (KSS) module and the Probability Sampling Supervision (PSS) module. The two modules construct semantically-diverse sentence groups that models temporal dynamics and guide the model to capture inter-sentence causal relationships, thereby enhancing its generalization ability to video captioning. Experimental results on several benchmarks demonstrate that SGCap significantly outperforms previous state-of-the-art zero-shot alternatives and even achieves performance competitive with fully supervised ones. Code is available at https://github.com/mlvccn/SGCap_Video.

[148] PromptSafe: Gated Prompt Tuning for Safe Text-to-Image Generation

Zonglei Jing,Xiao Yang,Xiaoqian Li,Siyuan Liang,Aishan Liu,Mingchuan Zhang,Xianglong Liu

Main category: cs.CV

TL;DR: PromptSafe is a novel framework that efficiently prevents NSFW content generation in T2I models, offering superior performance and adaptability while maintaining image quality.

Details Motivation: To address the challenges of high computational cost, degraded benign image quality, and limited adaptability in existing NSFW content prevention methods for T2I models. Method: Proposing PromptSafe, a gated prompt tuning framework that combines a lightweight, text-only supervised soft embedding with an inference-time gated control network. The method involves rewriting unsafe prompts into safe alternatives using an LLM to construct a text-only training corpus, optimizing a universal soft prompt to manage embeddings during diffusion denoising, and applying a gated mechanism to adjust defensive strength based on prompt toxicity. Result: Extensive experiments show that PromptSafe achieves a SOTA unsafe generation rate (2.36%), preserves high benign fidelity, and demonstrates strong generalization to unseen harmful categories, robust transferability across diffusion model architectures, and resilience under adaptive adversarial attacks. Conclusion: PromptSafe demonstrates practical value for safe and scalable deployment by achieving a SOTA unsafe generation rate while preserving high benign fidelity, strong generalization, robust transferability, and resilience under adaptive adversarial attacks. Abstract: Text-to-image (T2I) models have demonstrated remarkable generative capabilities but remain vulnerable to producing not-safe-for-work (NSFW) content, such as violent or explicit imagery. While recent moderation efforts have introduced soft prompt-guided tuning by appending defensive tokens to the input, these approaches often rely on large-scale curated image-text datasets and apply static, one-size-fits-all defenses at inference time. However, this results not only in high computational cost and degraded benign image quality, but also in limited adaptability to the diverse and nuanced safety requirements of real-world prompts. To address these challenges, we propose PromptSafe, a gated prompt tuning framework that combines a lightweight, text-only supervised soft embedding with an inference-time gated control network. Instead of training on expensive image-text datasets, we first rewrite unsafe prompts into semantically aligned but safe alternatives using an LLM, constructing an efficient text-only training corpus. Based on this, we optimize a universal soft prompt that repels unsafe and attracts safe embeddings during the diffusion denoising process. To avoid over-suppressing benign prompts, we introduce a gated mechanism that adaptively adjusts the defensive strength based on estimated prompt toxicity, thereby aligning defense intensity with prompt risk and ensuring strong protection for harmful inputs while preserving benign generation quality. Extensive experiments across multiple benchmarks and T2I models show that PromptSafe achieves a SOTA unsafe generation rate (2.36%), while preserving high benign fidelity. Furthermore, PromptSafe demonstrates strong generalization to unseen harmful categories, robust transferability across diffusion model architectures, and resilience under adaptive adversarial attacks, highlighting its practical value for safe and scalable deployment.

[149] Integrating Disparity Confidence Estimation into Relative Depth Prior-Guided Unsupervised Stereo Matching

Chuang-Wei Liu,Mingjian Sun,Cairong Zhao,Hanli Wang,Alexander Dvorkovich,Rui Fan

Main category: cs.CV

TL;DR: This paper introduces a novel unsupervised learning framework for stereo matching that improves upon traditional methods by efficiently utilizing 3D geometric knowledge through disparity confidence estimation and depth prior-guided loss functions, leading to state-of-the-art performance on the KITTI Stereo benchmarks.

Details Motivation: The motivation is to overcome the limitations of typical unsupervised stereo matching methods that rely on multi-view consistency assumptions and suffer from stereo matching ambiguities, such as repetitive patterns and texture-less regions, by efficiently utilizing 3D geometric knowledge and reducing noise from mistaken disparity estimates. Method: The method involves three key steps: checking the local coherence consistency between neighboring disparities and their corresponding relative depths to obtain disparity confidence, building quasi-dense correspondences using only confident disparity estimates for efficient depth ranking learning, and proposing a dual disparity smoothness loss to enhance stereo matching performance at disparity discontinuities. Result: Experimental results show that the proposed method achieves the highest stereo matching accuracy among all unsupervised stereo matching methods on the KITTI Stereo benchmarks. Conclusion: The proposed unsupervised learning framework, which includes a plug-and-play disparity confidence estimation algorithm and two depth prior-guided loss functions, achieves state-of-the-art stereo matching accuracy on the KITTI Stereo benchmarks among all unsupervised stereo matching methods. Abstract: Unsupervised stereo matching has garnered significant attention for its independence from costly disparity annotations. Typical unsupervised methods rely on the multi-view consistency assumption for training networks, which suffer considerably from stereo matching ambiguities, such as repetitive patterns and texture-less regions. A feasible solution lies in transferring 3D geometric knowledge from a relative depth map to the stereo matching networks. However, existing knowledge transfer methods learn depth ranking information from randomly built sparse correspondences, which makes inefficient utilization of 3D geometric knowledge and introduces noise from mistaken disparity estimates. This work proposes a novel unsupervised learning framework to address these challenges, which comprises a plug-and-play disparity confidence estimation algorithm and two depth prior-guided loss functions. Specifically, the local coherence consistency between neighboring disparities and their corresponding relative depths is first checked to obtain disparity confidence. Afterwards, quasi-dense correspondences are built using only confident disparity estimates to facilitate efficient depth ranking learning. Finally, a dual disparity smoothness loss is proposed to boost stereo matching performance at disparity discontinuities. Experimental results demonstrate that our method achieves state-of-the-art stereo matching accuracy on the KITTI Stereo benchmarks among all unsupervised stereo matching methods.

[150] GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification

Ngoc Bui Lam Quang,Nam Le Nguyen Binh,Thanh-Huy Nguyen,Le Thien Phuc Nguyen,Quan Nguyen,Ulas Bagci

Main category: cs.CV

TL;DR: 本研究提出一种结合多智能体临床描述生成与多描述文本编码的视觉-语言多实例学习框架,提升了全切片图像分类的性能。

Details Motivation: 现有的基于视觉-语言模型(VLM)的多实例学习(MIL)方法在生成临床描述时受限于大语言模型(LLM)的领域准确性与表达能力,以及VLM的固定长度提示限制了复杂病理概念的表达,导致视觉与文本信息对齐不佳。 Method: 提出了一种新的视觉-语言多实例学习框架,包含两个核心贡献:(1)基于病理学教材和智能体专业分工的多智能体描述生成系统;(2)使用多个描述而非单一提示的文本编码策略,以更好地与视觉特征对齐。 Result: 该方法在肾癌和肺癌数据集上的表现优于单一提示的基线方法,并达到了与当前最先进模型相当的性能。 Conclusion: 该研究通过引入基于病理学教材和多智能体协作的临床描述生成系统,以及多描述文本编码策略,有效提升了视觉-语言模型在全切片图像分类任务中的性能,尤其是在肾癌和肺癌数据集上表现优异。 Abstract: Multiple Instance Learning (MIL) is the leading approach for whole slide image (WSI) classification, enabling efficient analysis of gigapixel pathology slides. Recent work has introduced vision-language models (VLMs) into MIL pipelines to incorporate medical knowledge through text-based class descriptions rather than simple class names. However, when these methods rely on large language models (LLMs) to generate clinical descriptions or use fixed-length prompts to represent complex pathology concepts, the limited token capacity of VLMs often constrains the expressiveness and richness of the encoded class information. Additionally, descriptions generated solely by LLMs may lack domain grounding and fine-grained medical specificity, leading to suboptimal alignment with visual features. To address these challenges, we propose a vision-language MIL framework with two key contributions: (1) A grounded multi-agent description generation system that leverages curated pathology textbooks and agent specialization (e.g., morphology, spatial context) to produce accurate and diverse clinical descriptions; (2) A text encoding strategy using a list of descriptions rather than a single prompt, capturing fine-grained and complementary clinical signals for better alignment with visual features. Integrated into a VLM-MIL pipeline, our approach shows improved performance over single-prompt class baselines and achieves results comparable to state-of-the-art models, as demonstrated on renal and lung cancer datasets.

[151] Domain Generalized Stereo Matching with Uncertainty-guided Data Augmentation

Shuangli Du,Jing Wang,Minghua Zhao,Zhenyu Xu,Jie Li

Main category: cs.CV

TL;DR: 本文提出了一种新的不确定性引导数据增强方法(UgDA),用于提升立体匹配模型在跨域数据上的泛化能力。

Details Motivation: 由于现有最先进的立体匹配模型在合成数据上训练后,往往难以泛化到真实数据域,这主要是由于颜色、照明、对比度和纹理等域差异所致。因此,需要开发一种新的数据增强方法,使模型能够获取鲁棒的跨域特征表示,而不是依赖于特定域的捷径。 Method: 论文通过在RGB空间中扰动图像统计信息(均值和标准差)来生成不同域的样本,并利用基于批次统计信息的高斯分布模拟扰动方向和强度的不确定性,从而扩展训练域。同时,该方法还通过鼓励模型学习原始数据和增强数据之间的特征一致性,使模型获得对结构敏感且不受域特征影响的特征表示。 Result: 该方法在多个具有挑战性的基准测试中进行了广泛的实验,结果表明该方法能够显著提高现有立体匹配网络的泛化性能。 Conclusion: 该论文提出了一种不确定性引导的数据增强方法(UgDA),用于提高立体匹配模型在不同域数据上的泛化能力。该方法简单且适用于任何立体匹配网络,实验表明其能够显著提升现有立体匹配网络的泛化性能。 Abstract: State-of-the-art stereo matching (SM) models trained on synthetic data often fail to generalize to real data domains due to domain differences, such as color, illumination, contrast, and texture. To address this challenge, we leverage data augmentation to expand the training domain, encouraging the model to acquire robust cross-domain feature representations instead of domain-dependent shortcuts. This paper proposes an uncertainty-guided data augmentation (UgDA) method, which argues that the image statistics in RGB space (mean and standard deviation) carry the domain characteristics. Thus, samples in unseen domains can be generated by properly perturbing these statistics. Furthermore, to simulate more potential domains, Gaussian distributions founded on batch-level statistics are poposed to model the unceratinty of perturbation direction and intensity. Additionally, we further enforce feature consistency between original and augmented data for the same scene, encouraging the model to learn structure aware, shortcuts-invariant feature representations. Our approach is simple, architecture-agnostic, and can be integrated into any SM networks. Extensive experiments on several challenging benchmarks have demonstrated that our method can significantly improve the generalization performance of existing SM networks.

[152] C3D-AD: Toward Continual 3D Anomaly Detection via Kernel Attention with Learnable Advisor

Haoquan Lu,Hanzhe Liang,Jie Zhang,Chenxi Hu,Jinbao Wang,Can Gao

Main category: cs.CV

TL;DR: This paper introduces C3D-AD, a continual learning framework for 3D anomaly detection that adapts to new product classes over time while maintaining performance on previous ones.

Details Motivation: Existing 3D anomaly detection methods are class-specific and unable to learn from emerging classes over time, limiting their applicability in dynamic industrial environments. Method: The C3D-AD framework employs three key components: Kernel Attention with random feature Layer (KAL) for generalized feature extraction, Kernel Attention with learnable Advisor (KAA) for efficient reconstruction, and Reconstruction with Parameter Perturbation (RPP) to maintain representation consistency across tasks. Result: The proposed method achieved average AUROC scores of 66.4%, 83.1%, and 63.4% on the Real3D-AD, Anomaly-ShapeNet, and MulSen-AD datasets, respectively, demonstrating its effectiveness. Conclusion: The proposed C3D-AD framework effectively addresses the challenges of 3D anomaly detection by enabling continual learning, preserving representation consistency, and adapting to new classes while discarding redundant old information. Abstract: 3D Anomaly Detection (AD) has shown great potential in detecting anomalies or defects of high-precision industrial products. However, existing methods are typically trained in a class-specific manner and also lack the capability of learning from emerging classes. In this study, we proposed a continual learning framework named Continual 3D Anomaly Detection (C3D-AD), which can not only learn generalized representations for multi-class point clouds but also handle new classes emerging over time.Specifically, in the feature extraction module, to extract generalized local features from diverse product types of different tasks efficiently, Kernel Attention with random feature Layer (KAL) is introduced, which normalizes the feature space. Then, to reconstruct data correctly and continually, an efficient Kernel Attention with learnable Advisor (KAA) mechanism is proposed, which learns the information from new categories while discarding redundant old information within both the encoder and decoder. Finally, to keep the representation consistency over tasks, a Reconstruction with Parameter Perturbation (RPP) module is proposed by designing a representation rehearsal loss function, which ensures that the model remembers previous category information and returns category-adaptive representation.Extensive experiments on three public datasets demonstrate the effectiveness of the proposed method, achieving an average performance of 66.4%, 83.1%, and 63.4% AUROC on Real3D-AD, Anomaly-ShapeNet, and MulSen-AD, respectively.

[153] P3P Made Easy

Seong Hun Lee,Patrick Vandewalle,Javier Civera

Main category: cs.CV

TL;DR: A novel algebraic solution to the Perspective-Three-Point (P3P) problem is presented, achieving both accuracy and efficiency while being suitable for real-time systems and educational purposes.

Details Motivation: The motivation is to recover the absolute pose of a calibrated camera from three 2D-3D correspondences in a way that is interpretable, reliable, and efficient. Method: The method reformulates the P3P problem into a quartic polynomial with analytically simple and computationally efficient coefficients. Result: The proposed solver achieves accuracy and runtime performance comparable to state-of-the-art methods, as validated by extensive experiments on synthetic datasets. Conclusion: The proposed solver is appealing for both real-time systems and educational contexts due to its simplicity, performance, robustness, and efficiency. Abstract: We present a novel algebraic solution to the Perspective-Three-Point (P3P) problem, which aims to recover the absolute pose of a calibrated camera from three 2D-3D correspondences. Our method reformulates the problem into a quartic polynomial with coefficients that are analytically simple and computationally efficient. Despite its simplicity, the proposed solver achieves accuracy and runtime performance comparable to state-of-the-art methods. Extensive experiments on synthetic datasets validate its robustness and efficiency. This combination of simplicity and performance makes our solver appealing for both real-time systems and educational contexts, where interpretability and reliability are critical.

[154] Multimodal Attention-Aware Fusion for Diagnosing Distal Myopathy: Evaluating Model Interpretability and Clinician Trust

Mohsen Abbaspour Onari,Lucie Charlotte Magister,Yaoxin Wu,Amalia Lupi,Dario Creazzo,Mattia Tordin,Luigi Di Donatantonio,Emilio Quaia,Chao Zhang,Isel Grau,Marco S. Nobile,Yingqian Zhang,Pietro Liò

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态注意力感知融合架构用于远端肌病诊断,该方法结合了全局和局部特征,提高了分类准确性并增强了可解释性,但仍有改进空间。

Details Motivation: 远端肌病是一组遗传异质性骨骼肌疾病,其临床表现广泛,放射学诊断具有挑战性。因此,需要一种有效的诊断方法来应对这一挑战。 Method: 提出了一种新的多模态注意力感知融合架构,结合两个深度学习模型提取的特征,一个捕捉全局上下文信息,另一个关注局部细节,并通过注意力门机制集成这些特征,以增强预测性能和可解释性。 Result: 该方法在BUSI基准和专有的远端肌病数据集上实现了高分类准确性,并生成了具有临床相关性的显著性图,同时通过功能和应用导向评估验证了其可解释性。 Conclusion: 研究得出,尽管提出的融合策略提升了预测性能,但在医学诊断中仍存在解剖特异性和临床实用性方面的不足,强调需要更丰富、上下文感知的可解释性方法和临床反馈机制以满足实际需求。 Abstract: Distal myopathy represents a genetically heterogeneous group of skeletal muscle disorders with broad clinical manifestations, posing diagnostic challenges in radiology. To address this, we propose a novel multimodal attention-aware fusion architecture that combines features extracted from two distinct deep learning models, one capturing global contextual information and the other focusing on local details, representing complementary aspects of the input data. Uniquely, our approach integrates these features through an attention gate mechanism, enhancing both predictive performance and interpretability. Our method achieves a high classification accuracy on the BUSI benchmark and a proprietary distal myopathy dataset, while also generating clinically relevant saliency maps that support transparent decision-making in medical diagnosis. We rigorously evaluated interpretability through (1) functionally grounded metrics, coherence scoring against reference masks and incremental deletion analysis, and (2) application-grounded validation with seven expert radiologists. While our fusion strategy boosts predictive performance relative to single-stream and alternative fusion strategies, both quantitative and qualitative evaluations reveal persistent gaps in anatomical specificity and clinical usefulness of the interpretability. These findings highlight the need for richer, context-aware interpretability methods and human-in-the-loop feedback to meet clinicians' expectations in real-world diagnostic settings.

[155] Referring Remote Sensing Image Segmentation with Cross-view Semantics Interaction Network

Jiaxing Yang,Lihe Zhang,Huchuan Lu

Main category: cs.CV

TL;DR: 本文提出了CSINet,一种用于遥感图像分割的新框架,通过整合远近视角的信息和增强的注意力机制,有效提高了对不同尺度目标,尤其是微小和模糊目标的分割性能。

Details Motivation: 现有方法在处理遥感图像中的微小、模糊目标时存在困难,因此提出CSINet以解决这些问题。 Method: 提出了Cross-view Semantics Interaction Network (CSINet),其中包括Cross-View Window-attention模块(CVWin)和Collaboratively Dilated Attention enhanced Decoder (CDAD)。 Result: CSINet在遥感目标的分割上实现了显著的改进,并能够有效处理不同尺度的目标变化。 Conclusion: CSINet为RRSIS提供了一个并行而统一的框架,通过增强全局和局部语义的利用,在保持满意速度的同时实现了显著的性能提升。 Abstract: Recently, Referring Remote Sensing Image Segmentation (RRSIS) has aroused wide attention. To handle drastic scale variation of remote targets, existing methods only use the full image as input and nest the saliency-preferring techniques of cross-scale information interaction into traditional single-view structure. Although effective for visually salient targets, they still struggle in handling tiny, ambiguous ones in lots of real scenarios. In this work, we instead propose a paralleled yet unified segmentation framework Cross-view Semantics Interaction Network (CSINet) to solve the limitations. Motivated by human behavior in observing targets of interest, the network orchestrates visual cues from remote and close distances to conduct synergistic prediction. In its every encoding stage, a Cross-View Window-attention module (CVWin) is utilized to supplement global and local semantics into close-view and remote-view branch features, finally promoting the unified representation of feature in every encoding stage. In addition, we develop a Collaboratively Dilated Attention enhanced Decoder (CDAD) to mine the orientation property of target and meanwhile integrate cross-view multiscale features. The proposed network seamlessly enhances the exploitation of global and local semantics, achieving significant improvements over others while maintaining satisfactory speed.

[156] Zero-shot Segmentation of Skin Conditions: Erythema with Edit-Friendly Inversion

Konstantinos Moutselos,Ilias Maglogiannis

Main category: cs.CV

TL;DR: This paper introduces a zero-shot image segmentation approach for detecting skin redness (erythema) using diffusion models and color analysis, eliminating the need for labeled training data and improving upon traditional threshold-based methods.

Details Motivation: The motivation is to reduce reliance on labeled dermatological datasets and provide a scalable, flexible diagnostic tool for erythema detection that doesn't require annotated training masks. Method: The method involves a zero-shot image segmentation framework using edit-friendly inversion in diffusion models to synthesize erythema-free reference images, which are then aligned with original images for color-space analysis with minimal user input. Result: The framework successfully isolated facial erythema across diverse cases, showing improved performance over baseline threshold-based techniques in initial qualitative experiments. Conclusion: The study concludes that the proposed zero-shot image segmentation framework effectively detects erythema without requiring labeled datasets, highlighting the potential of combining generative diffusion models and statistical color segmentation in computer-aided dermatology. Abstract: This study proposes a zero-shot image segmentation framework for detecting erythema (redness of the skin) using edit-friendly inversion in diffusion models. The method synthesizes reference images of the same patient that are free from erythema via generative editing and then accurately aligns these references with the original images. Color-space analysis is performed with minimal user intervention to identify erythematous regions. This approach significantly reduces the reliance on labeled dermatological datasets while providing a scalable and flexible diagnostic support tool by avoiding the need for any annotated training masks. In our initial qualitative experiments, the pipeline successfully isolated facial erythema in diverse cases, demonstrating performance improvements over baseline threshold-based techniques. These results highlight the potential of combining generative diffusion models and statistical color segmentation for computer-aided dermatology, enabling efficient erythema detection without prior training data.

Lingxiao Chen,Liqin Wang,Wei Lu

Main category: cs.CV

TL;DR: StyleSentinel 是一种基于艺术风格指纹的新方法,用于保护艺术家作品的版权,通过语义自重构和特征融合实现高效验证。

Details Motivation: 现有的基于嵌入额外信息的版权保护方法防御能力有限,难以保护在线发布的艺术品,因此需要一种更有效的艺术风格指纹验证方法。 Method: 使用语义自重构过程增强艺术风格表达,融合多层图像特征生成紧凑的风格指纹,并在特征空间中建模艺术家风格的最小包围超球面边界。 Result: StyleSentinel 在实验中表现出优于现有方法的性能,并成功应用于在线平台的版权验证。 Conclusion: StyleSentinel 提出了一种基于艺术风格指纹的版权保护方法,在单样本验证任务中表现出色,并通过在线平台验证了其有效性。 Abstract: The versatility of diffusion models in generating customized images has led to unauthorized usage of personal artwork, which poses a significant threat to the intellectual property of artists. Existing approaches relying on embedding additional information, such as perturbations, watermarks, and backdoors, suffer from limited defensive capabilities and fail to protect artwork published online. In this paper, we propose StyleSentinel, an approach for copyright protection of artwork by verifying an inherent stylistic fingerprint in the artist's artwork. Specifically, we employ a semantic self-reconstruction process to enhance stylistic expressiveness within the artwork, which establishes a dense and style-consistent manifold foundation for feature learning. Subsequently, we adaptively fuse multi-layer image features to encode abstract artistic style into a compact stylistic fingerprint. Finally, we model the target artist's style as a minimal enclosing hypersphere boundary in the feature space, transforming complex copyright verification into a robust one-class learning task. Extensive experiments demonstrate that compared with the state-of-the-art, StyleSentinel achieves superior performance on the one-sample verification task. We also demonstrate the effectiveness through online platforms.

[158] Weakly-Supervised Image Forgery Localization via Vision-Language Collaborative Reasoning Framework

Ziqi Sheng,Junyan Wu,Wei Lu,Jiantao Zhou

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉-语言协同推理的弱监督图像篡改定位框架ViLaCo,通过引入预训练视觉-语言模型的语义监督和对比补丁一致性模块,在仅使用图像级标签的情况下实现了精确的像素级定位。

Details Motivation: 现有的弱监督图像篡改定位方法主要依赖于图像内的连续性线索,缺乏外部语义指导,导致定位性能有限。 Method: 提出了一种名为ViLaCo的视觉-语言协同推理框架,包含视觉-语言特征建模网络、自适应视觉-语言推理网络和对比补丁一致性模块。 Result: 在多个公共数据集上的大量实验表明,ViLaCo在检测和定位准确性方面显著优于现有弱监督图像篡改定位方法。 Conclusion: ViLaCo实现了弱监督图像篡改定位的最先进性能,通过引入外部语义监督和对比补丁一致性模块,显著优于现有方法。 Abstract: Image forgery localization aims to precisely identify tampered regions within images, but it commonly depends on costly pixel-level annotations. To alleviate this annotation burden, weakly supervised image forgery localization (WSIFL) has emerged, yet existing methods still achieve limited localization performance as they mainly exploit intra-image consistency clues and lack external semantic guidance to compensate for weak supervision. In this paper, we propose ViLaCo, a vision-language collaborative reasoning framework that introduces auxiliary semantic supervision distilled from pre-trained vision-language models (VLMs), enabling accurate pixel-level localization using only image-level labels. Specifically, ViLaCo first incorporates semantic knowledge through a vision-language feature modeling network, which jointly extracts textual and visual priors using pre-trained VLMs. Next, an adaptive vision-language reasoning network aligns textual semantics and visual features through mutual interactions, producing semantically aligned representations. Subsequently, these representations are passed into dual prediction heads, where the coarse head performs image-level classification and the fine head generates pixel-level localization masks, thereby bridging the gap between weak supervision and fine-grained localization. Moreover, a contrastive patch consistency module is introduced to cluster tampered features while separating authentic ones, facilitating more reliable forgery discrimination. Extensive experiments on multiple public datasets demonstrate that ViLaCo substantially outperforms existing WSIFL methods, achieving state-of-the-art performance in both detection and localization accuracy.

[159] SBP-YOLO:A Lightweight Real-Time Model for Detecting Speed Bumps and Potholes

Chuanqi Liang,Jie Fu,Lei Luo,Miao Yu

Main category: cs.CV

TL;DR: This paper proposes SBP-YOLO, an optimized lightweight detection framework based on YOLOv11, for real-time speed bump and pothole detection in new energy vehicles, achieving high accuracy and performance on embedded systems.

Details Motivation: The increasing demand for ride comfort in new energy vehicles necessitates accurate real-time detection of road anomalies like speed bumps and potholes for predictive suspension control. Method: The paper introduces SBP-YOLO, a lightweight detection framework based on YOLOv11. It incorporates GhostConv for efficient computation, VoVGSCSPC for multi-scale feature enhancement, and a Lightweight Efficiency Detection Head (LEDH) to reduce feature processing costs. A hybrid training strategy is also used. Result: SBP-YOLO achieves 87.0% mAP, outperforming YOLOv11n by 5.8%, and runs at 139.5 FPS on a Jetson AGX Xavier with TensorRT FP16 quantization. Conclusion: The proposed SBP-YOLO framework effectively improves the accuracy and efficiency of speed bump and pothole detection for real-time road condition perception in intelligent suspension systems. Abstract: With increasing demand for ride comfort in new energy vehicles, accurate real-time detection of speed bumps and potholes is critical for predictive suspension control. This paper proposes SBP-YOLO, a lightweight detection framework based on YOLOv11, optimized for embedded deployment. The model integrates GhostConv for efficient computation, VoVGSCSPC for multi-scale feature enhancement, and a Lightweight Efficiency Detection Head (LEDH) to reduce early-stage feature processing costs. A hybrid training strategy combining NWD loss, knowledge distillation, and Albumentations-based weather augmentation improves detection robustness, especially for small and distant targets. Experiments show SBP-YOLO achieves 87.0% mAP (outperforming YOLOv11n by 5.8%) and runs at 139.5 FPS on a Jetson AGX Xavier with TensorRT FP16 quantization. The results validate its effectiveness for real-time road condition perception in intelligent suspension systems.

[160] Predicting Video Slot Attention Queries from Random Slot-Feature Pairs

Rongzhen Zhao,Jian Li,Juho Kannala,Joni Pajarinen

Main category: cs.CV

TL;DR: RandSF.Q improves video object-centric learning by incorporating next frame features and learning transition dynamics, leading to better performance in scene representation and dynamics modeling.

Details Motivation: Existing video OCL methods neglect next frame features and fail to learn transition dynamics, which are crucial for query prediction. Method: RandSF.Q introduces a new transitioner that incorporates both slots and features, and learns transition dynamics by predicting queries from randomly sampled slot-feature pairs. Result: Experiments show that RandSF.Q achieves up to 10 points improvement on object discovery, setting a new state-of-the-art. Conclusion: The proposed RandSF.Q method significantly outperforms existing video OCL methods in scene representation and downstream tasks like dynamics modeling. Abstract: Unsupervised video Object-Centric Learning (OCL) is promising as it enables object-level scene representation and dynamics modeling as we humans do. Mainstream video OCL methods adopt a recurrent architecture: An aggregator aggregates current video frame into object features, termed slots, under some queries; A transitioner transits current slots to queries for the next frame. This is an effective architecture but all existing implementations both (\textit{i1}) neglect to incorporate next frame features, the most informative source for query prediction, and (\textit{i2}) fail to learn transition dynamics, the knowledge essential for query prediction. To address these issues, we propose Random Slot-Feature pair for learning Query prediction (RandSF.Q): (\textit{t1}) We design a new transitioner to incorporate both slots and features, which provides more information for query prediction; (\textit{t2}) We train the transitioner to predict queries from slot-feature pairs randomly sampled from available recurrences, which drives it to learn transition dynamics. Experiments on scene representation demonstrate that our method surpass existing video OCL methods significantly, e.g., up to 10 points on object discovery, setting new state-of-the-art. Such superiority also benefits downstream tasks like dynamics modeling. Our core source code and training logs are available as the supplement.

[161] Effective Damage Data Generation by Fusing Imagery with Human Knowledge Using Vision-Language Models

Jie Wei,Erika Ardiles-Cruz,Aleksey Panasyuk,Erik Blasch

Main category: cs.CV

TL;DR: 本研究提出了一种基于视觉-语言模型的方法,通过融合图像与人类知识理解来提高灾害损害评估的准确性,尤其是在数据不平衡和标记不准确的情况下。

Details Motivation: 当前的深度学习方法在人道援助和灾害响应(HADR)中由于数据类别不平衡、中度损害样本稀缺以及像素标记的人为错误,难以有效泛化。因此需要一种更有效的方法来评估损害情况。 Method: 采用最先进的视觉-语言模型(VLMs)技术,结合图像与人类知识理解,生成多样化的图像损害数据。 Result: 初步实验结果显示,所提出的数据生成方法在提高损害分类质量方面具有积极效果。 Conclusion: 利用视觉-语言模型融合图像与人类知识理解,生成多样化的基于图像的损害数据,能够有效提高对建筑物、道路和基础设施不同结构损害等级分类的准确性。 Abstract: It is of crucial importance to assess damages promptly and accurately in humanitarian assistance and disaster response (HADR). Current deep learning approaches struggle to generalize effectively due to the imbalance of data classes, scarcity of moderate damage examples, and human inaccuracy in pixel labeling during HADR situations. To accommodate for these limitations and exploit state-of-the-art techniques in vision-language models (VLMs) to fuse imagery with human knowledge understanding, there is an opportunity to generate a diversified set of image-based damage data effectively. Our initial experimental results suggest encouraging data generation quality, which demonstrates an improvement in classifying scenes with different levels of structural damage to buildings, roads, and infrastructures.

[162] A Full-Stage Refined Proposal Algorithm for Suppressing False Positives in Two-Stage CNN-Based Detection Methods

Qiang Guo,Rubo Zhang,Bingbing Zhang,Junjie Liu,Jianqing Liu

Main category: cs.CV

TL;DR: This paper introduces the FRP algorithm to reduce false positives in pedestrian detection by refining proposals during both training and inference, achieving better performance on multiple datasets and resource-constrained edge devices.

Details Motivation: False positives in pedestrian detection remain a significant challenge, particularly for real-world applications where accuracy and resource efficiency are critical. Method: The paper proposes the Full-stage Refined Proposal (FRP) algorithm, which includes three strategies: TFRP for training, CFRP for integrating a pedestrian classifier during inference, and SFRP for vertically splitting proposals to evaluate sub-region confidence. Result: Experiments on multiple benchmarks and the SY-Metro dataset show that the FRP algorithm significantly reduces false positives. Embedded platform tests confirm its effectiveness in improving small pedestrian detection under resource constraints. Conclusion: The FRP algorithm effectively suppresses false positives in pedestrian detection, enhancing detection capabilities, especially on resource-constrained devices. Abstract: False positives in pedestrian detection remain a challenge that has yet to be effectively resolved. To address this issue, this paper proposes a Full-stage Refined Proposal (FRP) algorithm aimed at eliminating these false positives within a two-stage CNN-based pedestrian detection framework. The main innovation of this work lies in employing various pedestrian feature re-evaluation strategies to filter out low-quality pedestrian proposals during both the training and testing stages. Specifically, in the training phase, the Training mode FRP algorithm (TFRP) introduces a novel approach for validating pedestrian proposals to effectively guide the model training process, thereby constructing a model with strong capabilities for false positive suppression. During the inference phase, two innovative strategies are implemented: the Classifier-guided FRP (CFRP) algorithm integrates a pedestrian classifier into the proposal generation pipeline to yield high-quality proposals through pedestrian feature evaluation, and the Split-proposal FRP (SFRP) algorithm vertically divides all proposals, sending both the original and the sub-region proposals to the subsequent subnetwork to evaluate their confidence scores, filtering out those with lower sub-region pedestrian confidence scores. As a result, the proposed algorithm enhances the model's ability to suppress pedestrian false positives across all stages. Various experiments conducted on multiple benchmarks and the SY-Metro datasets demonstrate that the model, supported by different combinations of the FRP algorithm, can effectively eliminate false positives to varying extents. Furthermore, experiments conducted on embedded platforms underscore the algorithm's effectiveness in enhancing the comprehensive pedestrian detection capabilities of the small pedestrian detector in resource-constrained edge devices.

[163] Lightweight Backbone Networks Only Require Adaptive Lightweight Self-Attention Mechanisms

Fengyun Li,Chao Zheng,Yangyang Fang,Jialiang Lan,Jianhua Liang,Luhao Zhang,Fa Si

Main category: cs.CV

TL;DR: This paper proposes LOLViT, a lightweight hybrid backbone network combining Fast Window Attention (FWA) with GhostNet, achieving improved efficiency and accuracy in visual tasks.

Details Motivation: The imbalance in computational efficiency between CNNs and attention mechanisms in hybrid models necessitates a lightweight and efficient solution for long-sequence modeling. Method: This paper introduces Fast Window Attention (FWA), which adaptively adjusts feature map sizes for efficient SoftMax attention computation, and combines it with GhostNet alongside a global-local feature fusion mechanism. Result: LOLViT demonstrated superior performance in visual tasks (e.g., ImageNet 1K classification, COCO 2017 detection, and BDD100K segmentation) and showed that LOLViT-X achieves 5x faster inference speed compared to MobileViT-X. Conclusion: The proposed LOLViT network, integrating Fast Window Attention with GhostNet, outperforms CNN-based models in both inference speed and accuracy across tasks like classification, detection, and segmentation. Abstract: Currently, lightweight hybrid backbone networks have partially alleviated the issue of computational saturation, but the imbalance in computational efficiencys between convolutional neural networks (CNNs) and attention mechanisms is becoming increasingly apparent. Specifically, although linear attention mechanisms and their variants have made progress in lightweight design, they still fail to meet the demands of hybrid models for long-sequence modeling. On the other hand, existing lightweight SoftMax attention computations typically reduce the feature map to a fixed size to decrease the number of sequences, thereby compressing the computational scale. However, the process of determining the feature map reduction ratio is cumbersome, and computational saturation issues still persist. To address this issue, this paper proposes a lightweight SoftMax attention mechanism with adaptive feature map sizes, named Fast Window Attention (FWA), which generates a small number of key sequences (Key and Value) through window aggregation for attention computation. Additionally, it explains the rationality of using ReLU to simulate SoftMax operations in lightweight global attention mechanisms. Finally, the paper designs a global-local feature fusion mechanism and combines it with GhostNet to propose a lightweight hybrid backbone network, LOLViT. Through visual tasks such as classification (ImageNet 1K), detection (COCO 2017), and segmentation (BDD100K), along with extensive ablation studies, it is demonstrated that LOLViT outperforms CNN models of the same level in both inference speed and model accuracy. Notably, the inference speed of LOLViT-X is 5x that of MobileViT-X.

[164] Construction of Digital Terrain Maps from Multi-view Satellite Imagery using Neural Volume Rendering

Josef X. Biberstein,Guilherme Cavalheiro,Juyeop Han,Sertac Karaman

Main category: cs.CV

TL;DR: This paper introduces Neural Terrain Maps (NTM), a method using neural volume rendering to generate high-quality DTMs directly from satellite imagery without relying on depth or structural priors.

Details Motivation: High-quality digital terrain maps (DTMs) are crucial for planetary exploration, but existing multi-view stereo pipelines for satellite imagery can be cumbersome and require significant manual preprocessing. Method: The method adapts neural volume rendering techniques to learn textured digital terrain maps directly from satellite imagery, using only the locus for each image pixel. Result: The method was evaluated on synthetic and real satellite data from Earth and Mars, showing terrain prediction precision almost equal to the resolution of satellite images, even with imperfect camera parameters. Conclusion: The proposed Neural Terrain Maps (NTM) method demonstrates promising results in generating high-quality DTMs directly from satellite imagery, without requiring depth or structural priors. Abstract: Digital terrain maps (DTMs) are an important part of planetary exploration, enabling operations such as terrain relative navigation during entry, descent, and landing for spacecraft and aiding in navigation on the ground. As robotic exploration missions become more ambitious, the need for high quality DTMs will only increase. However, producing DTMs via multi-view stereo pipelines for satellite imagery, the current state-of-the-art, can be cumbersome and require significant manual image preprocessing to produce satisfactory results. In this work, we seek to address these shortcomings by adapting neural volume rendering techniques to learn textured digital terrain maps directly from satellite imagery. Our method, neural terrain maps (NTM), only requires the locus for each image pixel and does not rely on depth or any other structural priors. We demonstrate our method on both synthetic and real satellite data from Earth and Mars encompassing scenes on the order of $100 \textrm{km}^2$. We evaluate the accuracy of our output terrain maps by comparing with existing high-quality DTMs produced using traditional multi-view stereo pipelines. Our method shows promising results, with the precision of terrain prediction almost equal to the resolution of the satellite images even in the presence of imperfect camera intrinsics and extrinsics.

[165] Video-based Vehicle Surveillance in the Wild: License Plate, Make, and Model Recognition with Self Reflective Vision-Language Models

Pouya Parsa,Keya Li,Kara M. Kockelman,Seongjin Choi

Main category: cs.CV

TL;DR: This study evaluates the potential of VLMs for ALPR and makes and models recognition using monocular videos captured with handheld smartphones and non-static mounted cameras.

Details Motivation: Applying ALPR methods to videos captured by handheld smartphones or non-static vehicle-mounted cameras presents unique challenges compared to fixed installations. Traditional ALPR solutions often degrade under these conditions. Method: The proposed license plate recognition pipeline filters to sharp frames, then sends a multimodal prompt to a VLM using several prompt strategies. Make and model recognition pipeline runs the same VLM with a revised prompt and an optional self-reflection module. In the self-reflection module, the model contrasts the query image with a reference from a 134-class dataset, correcting mismatches. Result: Experiments on a smartphone dataset collected on the campus of the University of Texas at Austin, achieve top-1 accuracies of 91.67% for ALPR and 66.67% for make and model recognition. On the public UFPR-ALPR dataset, the approach attains 83.05% and 61.07%, respectively. The self-reflection module further improves results by 5.72% on average for make and model recognition. Conclusion: VLMs provide a cost-effective solution for scalable, in-motion traffic video analysis. Abstract: Automatic license plate recognition (ALPR) and vehicle make and model recognition underpin intelligent transportation systems, supporting law enforcement, toll collection, and post-incident investigation. Applying these methods to videos captured by handheld smartphones or non-static vehicle-mounted cameras presents unique challenges compared to fixed installations, including frequent camera motion, varying viewpoints, occlusions, and unknown road geometry. Traditional ALPR solutions, dependent on specialized hardware and handcrafted OCR pipelines, often degrade under these conditions. Recent advances in large vision-language models (VLMs) enable direct recognition of textual and semantic attributes from arbitrary imagery. This study evaluates the potential of VLMs for ALPR and makes and models recognition using monocular videos captured with handheld smartphones and non-static mounted cameras. The proposed license plate recognition pipeline filters to sharp frames, then sends a multimodal prompt to a VLM using several prompt strategies. Make and model recognition pipeline runs the same VLM with a revised prompt and an optional self-reflection module. In the self-reflection module, the model contrasts the query image with a reference from a 134-class dataset, correcting mismatches. Experiments on a smartphone dataset collected on the campus of the University of Texas at Austin, achieve top-1 accuracies of 91.67% for ALPR and 66.67% for make and model recognition. On the public UFPR-ALPR dataset, the approach attains 83.05% and 61.07%, respectively. The self-reflection module further improves results by 5.72% on average for make and model recognition. These findings demonstrate that VLMs provide a cost-effective solution for scalable, in-motion traffic video analysis.

[166] Open-Attribute Recognition for Person Retrieval: Finding People Through Distinctive and Novel Attributes

Minjeong Park,Hongbeen Park,Sangwon Lee,Yoonha Jang,Jinkyu Kim

Main category: cs.CV

TL;DR: This paper proposes an open-attribute recognition framework for person retrieval, addressing limitations in existing methods by supporting novel attributes and improving real-world applicability.

Details Motivation: Existing methods rely on a closed-set assumption and use generic attributes, which limits their effectiveness in real-world scenarios where new or more specific attributes may appear. Method: The paper proposes a novel framework that learns generalizable body part representations to support retrieving individuals based on attribute cues, even if those attributes were not seen during training. Result: The proposed framework demonstrates effectiveness in handling open-attribute recognition, as shown by comprehensive experiments conducted on four reconstructed datasets. Conclusion: The paper introduces a new framework for pedestrian attribute recognition that supports open-attribute recognition for person retrieval, showing its effectiveness through experiments. Abstract: Pedestrian Attribute Recognition (PAR) plays a crucial role in various vision tasks such as person retrieval and identification. Most existing attribute-based retrieval methods operate under the closed-set assumption that all attribute classes are consistently available during both training and inference. However, this assumption limits their applicability in real-world scenarios where novel attributes may emerge. Moreover, predefined attributes in benchmark datasets are often generic and shared across individuals, making them less discriminative for retrieving the target person. To address these challenges, we propose the Open-Attribute Recognition for Person Retrieval (OAPR) task, which aims to retrieve individuals based on attribute cues, regardless of whether those attributes were seen during training. To support this task, we introduce a novel framework designed to learn generalizable body part representations that cover a broad range of attribute categories. Furthermore, we reconstruct four widely used datasets for open-attribute recognition. Comprehensive experiments on these datasets demonstrate the necessity of the OAPR task and the effectiveness of our framework. The source code and pre-trained models will be publicly available upon publication.

[167] Spatial-Frequency Aware for Object Detection in RAW Image

Zhuohua Ye,Liming Zhang,Hongru Han

Main category: cs.CV

TL;DR: This paper introduces SFAE, a new framework for RAW image object detection that combines spatial and frequency domain methods to better recover suppressed object details.

Details Motivation: The motivation stems from the challenges faced by existing RAW image enhancement methods that operate solely in the spatial domain, making it difficult to recover crucial object details due to the wide dynamic range and linear response of RAW data. Method: The method involves three main components: spatialization of frequency bands, cross-domain fusion attention module, and adaptive nonlinear adjustments through gamma parameter prediction. Result: The result is a novel framework called SFAE that effectively combines spatial and frequency domain techniques to enhance object detection in RAW images by preserving physical intuition and enabling multimodal feature interactions. Conclusion: The paper concludes that by utilizing the SFAE framework, it is possible to enhance object detection from RAW images by synergizing spatial and frequency domain approaches, leading to better recovery of suppressed object details. Abstract: Direct RAW-based object detection offers great promise by utilizing RAW data (unprocessed sensor data), but faces inherent challenges due to its wide dynamic range and linear response, which tends to suppress crucial object details. In particular, existing enhancement methods are almost all performed in the spatial domain, making it difficult to effectively recover these suppressed details from the skewed pixel distribution of RAW images. To address this limitation, we turn to the frequency domain, where features, such as object contours and textures, can be naturally separated based on frequency. In this paper, we propose Space-Frequency Aware RAW Image Object Detection Enhancer (SFAE), a novel framework that synergizes spatial and frequency representations. Our contribution is threefold. The first lies in the ``spatialization" of frequency bands. Different from the traditional paradigm of directly manipulating abstract spectra in deep networks, our method inversely transforms individual frequency bands back into tangible spatial maps, thus preserving direct physical intuition. Then the cross-domain fusion attention module is developed to enable deep multimodal interactions between these maps and the original spatial features. Finally, the framework performs adaptive nonlinear adjustments by predicting and applying different gamma parameters for the two domains.

[168] ForenX: Towards Explainable AI-Generated Image Detection with Multimodal Large Language Models

Chuangchuang Tan,Jinglu Wang,Xiang Ming,Renshuai Tao,Yunchao Wei,Yao Zhao,Yan Lu

Main category: cs.CV

TL;DR: ForenX是一种基于多模态大语言模型(MLLMs)的新方法,结合专门的法医提示和ForgReason数据集,能够识别AI生成图像的真实性并提供与人类思维一致的解释。

Details Motivation: 尽管已有许多研究利用分类器检测AI生成的图像,但这些方法与人类认知法医分析之间仍存在差距。因此,需要一种不仅能识别图像真实性,还能提供与人类思维一致解释的新方法。 Method: ForenX方法利用多模态大语言模型(MLLMs)分析和解释法医线索,并通过引入专门的法医提示,将MLLMs的注意力引导至伪造指示属性上。此外,还引入了ForgReason数据集,通过基于LLM的代理和人类注释团队协作生成伪造证据描述数据,以提升模型性能。 Result: ForenX在两个主要基准测试中展示了其有效性,并通过全面的主观评估验证了其解释性。此外,有限的手动注释显著提升了解释质量。 Conclusion: ForenX是一种新型方法,通过利用多模态大语言模型(MLLMs)分析和解释法医线索,不仅可以识别图像的真实性,还能够提供与人类思维产生共鸣的解释。此外,通过引入专门的法医提示,该方法克服了标准MLLMs在检测伪造方面的局限性,提高了伪造检测的泛化能力,并使MLLMs能够提供准确、相关且全面的解释。 Abstract: Advances in generative models have led to AI-generated images visually indistinguishable from authentic ones. Despite numerous studies on detecting AI-generated images with classifiers, a gap persists between such methods and human cognitive forensic analysis. We present ForenX, a novel method that not only identifies the authenticity of images but also provides explanations that resonate with human thoughts. ForenX employs the powerful multimodal large language models (MLLMs) to analyze and interpret forensic cues. Furthermore, we overcome the limitations of standard MLLMs in detecting forgeries by incorporating a specialized forensic prompt that directs the MLLMs attention to forgery-indicative attributes. This approach not only enhance the generalization of forgery detection but also empowers the MLLMs to provide explanations that are accurate, relevant, and comprehensive. Additionally, we introduce ForgReason, a dataset dedicated to descriptions of forgery evidences in AI-generated images. Curated through collaboration between an LLM-based agent and a team of human annotators, this process provides refined data that further enhances our model's performance. We demonstrate that even limited manual annotations significantly improve explanation quality. We evaluate the effectiveness of ForenX on two major benchmarks. The model's explainability is verified by comprehensive subjective evaluations.

[169] 3DRot: 3D Rotation Augmentation for RGB-Based 3D Tasks

Shitian Yang,Deyu Li,Xiaoke Jiang,Lei Zhang

Main category: cs.CV

TL;DR: The paper proposes 3DRot, a new augmentation method for RGB-based 3D tasks that enhances performance by preserving geometric consistency through camera-space transforms.

Details Motivation: RGB-based 3D tasks face challenges due to scarce annotations and limited augmentation options. This paper aims to address these issues by proposing a new augmentation technique. Method: The paper introduces 3DRot, a method that rotates and mirrors images about the camera's optical center while updating RGB images, camera intrinsics, object poses, and 3D annotations to maintain geometric consistency. Result: On the SUN RGB-D dataset, 3DRot improved IoU3D from 43.21 to 44.51, reduced rotation error from 22.91° to 20.93°, and increased mAP0.5 from 35.70 to 38.11. It also showed better performance than Cube R-CNN in similar testing conditions. Conclusion: 3DRot proves to be an effective augmentation method for RGB-based 3D tasks, showing significant improvements in performance metrics and demonstrating potential for wide applicability. Abstract: RGB-based 3D tasks, e.g., 3D detection, depth estimation, 3D keypoint estimation, still suffer from scarce, expensive annotations and a thin augmentation toolbox, since most image transforms, including resize and rotation, disrupt geometric consistency. In this paper, we introduce 3DRot, a plug-and-play augmentation that rotates and mirrors images about the camera's optical center while synchronously updating RGB images, camera intrinsics, object poses, and 3D annotations to preserve projective geometry-achieving geometry-consistent rotations and reflections without relying on any scene depth. We validate 3DRot with a classical 3D task, monocular 3D detection. On SUN RGB-D dataset, 3DRot raises $IoU_{3D}$ from 43.21 to 44.51, cuts rotation error (ROT) from 22.91$^\circ$ to 20.93$^\circ$, and boosts $mAP_{0.5}$ from 35.70 to 38.11. As a comparison, Cube R-CNN adds 3 other datasets together with SUN RGB-D for monocular 3D estimation, with a similar mechanism and test dataset, increases $IoU_{3D}$ from 36.2 to 37.8, boosts $mAP_{0.5}$ from 34.7 to 35.4. Because it operates purely through camera-space transforms, 3DRot is readily transferable to other 3D tasks.

[170] Capturing More: Learning Multi-Domain Representations for Robust Online Handwriting Verification

Peirong Zhang,Kai Ding,Lianwen Jin

Main category: cs.CV

TL;DR: 本文提出SPECTRUM模型,结合时间与频率特征进行在线手写验证,显著提升了验证性能。

Details Motivation: 现有的在线手写验证方法主要依赖单一时间特征,未能充分利用多域表示学习的潜力。为此,研究提出SPECTRUM以挖掘时间与频率特征的协同作用。 Method: SPECTRUM模型包括三个核心组件:多尺度交互模块、自控融合模块和多域距离验证器,分别用于整合时间与频率特征、动态融合全局特征以及提升真伪手写体的辨别能力。 Result: 实验表明,SPECTRUM在在线手写验证任务上优于现有方法,证明了时间-频率多域学习的有效性,并揭示了多生物特征对手写表示辨别能力的提升作用。 Conclusion: SPECTRUM通过时间-频率多域学习有效提升了在线手写验证的性能,验证了多域学习在该领域的有效性,并为未来的研究提供了方向。 Abstract: In this paper, we propose SPECTRUM, a temporal-frequency synergistic model that unlocks the untapped potential of multi-domain representation learning for online handwriting verification (OHV). SPECTRUM comprises three core components: (1) a multi-scale interactor that finely combines temporal and frequency features through dual-modal sequence interaction and multi-scale aggregation, (2) a self-gated fusion module that dynamically integrates global temporal and frequency features via self-driven balancing. These two components work synergistically to achieve micro-to-macro spectral-temporal integration. (3) A multi-domain distance-based verifier then utilizes both temporal and frequency representations to improve discrimination between genuine and forged handwriting, surpassing conventional temporal-only approaches. Extensive experiments demonstrate SPECTRUM's superior performance over existing OHV methods, underscoring the effectiveness of temporal-frequency multi-domain learning. Furthermore, we reveal that incorporating multiple handwritten biometrics fundamentally enhances the discriminative power of handwriting representations and facilitates verification. These findings not only validate the efficacy of multi-domain learning in OHV but also pave the way for future research in multi-domain approaches across both feature and biometric domains. Code is publicly available at https://github.com/NiceRingNode/SPECTRUM.

[171] Hyperspectral Image Recovery Constrained by Multi-Granularity Non-Local Self-Similarity Priors

Zhuoran Peng,Yiqing Shen

Main category: cs.CV

TL;DR: This paper proposes a hyperspectral image recovery model using multi-granularity tensor decomposition, improving adaptability and performance in diverse missing data scenarios.

Details Motivation: Current HSI recovery methods use fixed-format factors, limiting adaptability to diverse missing scenarios. This work aims to enhance flexibility and performance by introducing multi-granularity priors. Method: The method introduces granularity in tensor decomposition, combining coarse-grained Tucker decomposition for global structure extraction and fine-grained FCTN decomposition for local detail capture. Result: Experimental results show that the proposed model achieves outstanding recovery effects in various missing scenarios, including pixels and stripes, outperforming existing methods. Conclusion: The proposed HSI recovery model effectively integrates multi-granularity non-local self-similarity priors, demonstrating strong applicability and superior recovery performance in diverse missing scenarios. Abstract: Hyperspectral image (HSI) recovery, as an upstream image processing task, holds significant importance for downstream tasks such as classification, segmentation, and detection. In recent years, HSI recovery methods based on non-local prior representations have demonstrated outstanding performance. However, these methods employ a fixed-format factor to represent the non-local self-similarity tensor groups, making them unable to adapt to diverse missing scenarios. To address this issue, we introduce the concept of granularity in tensor decomposition for the first time and propose an HSI recovery model constrained by multi-granularity non-local self-similarity priors. Specifically, the proposed model alternately performs coarse-grained decomposition and fine-grained decomposition on the non-local self-similarity tensor groups. Among them, the coarse-grained decomposition builds upon Tucker tensor decomposition, which extracts global structural information of the image by performing singular value shrinkage on the mode-unfolded matrices. The fine-grained decomposition employs the FCTN decomposition, capturing local detail information through modeling pairwise correlations among factor tensors. This architectural approach achieves a unified representation of global, local, and non-local priors for HSIs. Experimental results demonstrate that the model has strong applicability and exhibits outstanding recovery effects in various types of missing scenes such as pixels and stripes.

[172] Uncertainty-Aware Segmentation Quality Prediction via Deep Learning Bayesian Modeling: Comprehensive Evaluation and Interpretation on Skin Cancer and Liver Segmentation

Sikha O K,Meritxell Riera-Marín,Adrian Galdran,Javier García Lopez,Julia Rodríguez-Comas,Gemma Piella,Miguel A. González Ballester

Main category: cs.CV

TL;DR: 本文提出了一种无需真实标注的医学图像分割质量评估框架,结合不确定性估计和模型解释技术,显著优于以往方法。

Details Motivation: 在临床环境中,由于缺乏手动标注,评估分割质量具有挑战性,而缺乏可靠性指标的模型难以被采用。本文旨在填补这一空白。 Method: 引入了两种互补框架:一种利用预测分割和不确定性图,另一种结合原始输入图像、不确定性图和预测分割图。使用蒙特卡洛丢弃、集成和测试时增强等贝叶斯方法对SwinUNet和ResNet50进行不确定性量化,并采用Grad-CAM和UMAP嵌入分析模型行为。 Result: 在HAM10000数据集上实现了93.25的R2分数和96.58的皮尔逊相关性,在3D肝脏分割中,测试时增强与熵结合达到了85.03的R2分数和65.02的皮尔逊相关性。 Conclusion: 该论文提出了一种新颖的无需测试阶段真实标注的图像分割质量预测框架,并通过多种不确定性估计方法和模型解释技术,实现了对分割质量的可靠评估。 Abstract: Image segmentation is a critical step in computational biomedical image analysis, typically evaluated using metrics like the Dice coefficient during training and validation. However, in clinical settings without manual annotations, assessing segmentation quality becomes challenging, and models lacking reliability indicators face adoption barriers. To address this gap, we propose a novel framework for predicting segmentation quality without requiring ground truth annotations during test time. Our approach introduces two complementary frameworks: one leveraging predicted segmentation and uncertainty maps, and another integrating the original input image, uncertainty maps, and predicted segmentation maps. We present Bayesian adaptations of two benchmark segmentation models-SwinUNet and Feature Pyramid Network with ResNet50-using Monte Carlo Dropout, Ensemble, and Test Time Augmentation to quantify uncertainty. We evaluate four uncertainty estimates: confidence map, entropy, mutual information, and expected pairwise Kullback-Leibler divergence on 2D skin lesion and 3D liver segmentation datasets, analyzing their correlation with segmentation quality metrics. Our framework achieves an R2 score of 93.25 and Pearson correlation of 96.58 on the HAM10000 dataset, outperforming previous segmentation quality assessment methods. For 3D liver segmentation, Test Time Augmentation with entropy achieves an R2 score of 85.03 and a Pearson correlation of 65.02, demonstrating cross-modality robustness. Additionally, we propose an aggregation strategy that combines multiple uncertainty estimates into a single score per image, offering a more robust and comprehensive assessment of segmentation quality. Finally, we use Grad-CAM and UMAP-based embedding analysis to interpret the model's behavior and reliability, highlighting the impact of uncertainty integration.

[173] Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians

Quankai Gao,Iliyan Georgiev,Tuanfeng Y. Wang,Krishna Kumar Singh,Ulrich Neumann,Jae Shin Yoon

Main category: cs.CV

TL;DR: 本文提出了一種名為Can3Tok的3D場景級變分自編碼器(VAE),用於解決大規模3D場景生成中的潛在表示學習問題,並展示其在圖像到3D高斯濺射和文本到3D高斯濺射生成應用中的有效性。

Details Motivation: 3D場景級生成因缺乏可擴展潛在表示學習的模型而受到限制,與物件級生成不同,場景級生成具有無邊界和場景間尺度不一致的挑戰。 Method: 引入Can3Tok,一種能將大量高斯基元編碼為低維潛在嵌入的3D場景級變分自編碼器,並設計了一個通用的3D場景數據處理流程來解決尺度不一致問題。 Result: 在DL3DV-10K數據集上的實驗表明,Can3Tok在訓練中成功推廣到新3D場景,而其他方法在訓練少數場景時無法收斂且推斷時無泛化能力。 Conclusion: Can3Tok是首個能有效學習3D場景潛在表示的模型,並展示了其在下游生成任務中的應用能力。 Abstract: 3D generation has made significant progress, however, it still largely remains at the object-level. Feedforward 3D scene-level generation has been rarely explored due to the lack of models capable of scaling-up latent representation learning on 3D scene-level data. Unlike object-level generative models, which are trained on well-labeled 3D data in a bounded canonical space, scene-level generations with 3D scenes represented by 3D Gaussian Splatting (3DGS) are unbounded and exhibit scale inconsistency across different scenes, making unified latent representation learning for generative purposes extremely challenging. In this paper, we introduce Can3Tok, the first 3D scene-level variational autoencoder (VAE) capable of encoding a large number of Gaussian primitives into a low-dimensional latent embedding, which effectively captures both semantic and spatial information of the inputs. Beyond model design, we propose a general pipeline for 3D scene data processing to address scale inconsistency issue. We validate our method on the recent scene-level 3D dataset DL3DV-10K, where we found that only Can3Tok successfully generalizes to novel 3D scenes, while compared methods fail to converge on even a few hundred scene inputs during training and exhibit zero generalization ability during inference. Finally, we demonstrate image-to-3DGS and text-to-3DGS generation as our applications to demonstrate its ability to facilitate downstream generation tasks.

[174] EfficientGFormer: Multimodal Brain Tumor Segmentation via Pruned Graph-Augmented Transformer

Fatemeh Ziaeetabar

Main category: cs.CV

TL;DR: EfficientGFormer是一种结合了预训练基础模型、基于图的推理和轻量级效率机制的新架构,用于稳健的3D脑肿瘤分割。

Details Motivation: 脑肿瘤分割由于肿瘤亚区域的异质性和体积推理的高计算成本,仍然是神经影像学中的一个关键挑战。 Method: EfficientGFormer框架利用nnFormer作为模态感知编码器,将多模态MRI体积转换为块级嵌入,并通过一种轻量级的学生-教师模型知识蒸馏模块和一种剪枝的、边类型感知的图注意力网络(GAT)进行关系推理。 Result: 在MSD Task01和BraTS 2021数据集上的实验表明,EfficientGFormer在显著降低内存和推理时间的同时达到了最先进的准确性,超过了最近的基于变压器和图的基线方法。 Conclusion: EfficientGFormer提供了一种临床上可行的快速准确的体积肿瘤描绘解决方案,结合了可扩展性、可解释性和泛化能力。 Abstract: Accurate and efficient brain tumor segmentation remains a critical challenge in neuroimaging due to the heterogeneous nature of tumor subregions and the high computational cost of volumetric inference. In this paper, we propose EfficientGFormer, a novel architecture that integrates pretrained foundation models with graph-based reasoning and lightweight efficiency mechanisms for robust 3D brain tumor segmentation. Our framework leverages nnFormer as a modality-aware encoder, transforming multi-modal MRI volumes into patch-level embeddings. These features are structured into a dual-edge graph that captures both spatial adjacency and semantic similarity. A pruned, edge-type-aware Graph Attention Network (GAT) enables efficient relational reasoning across tumor subregions, while a distillation module transfers knowledge from a full-capacity teacher to a compact student model for real-time deployment. Experiments on the MSD Task01 and BraTS 2021 datasets demonstrate that EfficientGFormer achieves state-of-the-art accuracy with significantly reduced memory and inference time, outperforming recent transformer-based and graph-based baselines. This work offers a clinically viable solution for fast and accurate volumetric tumor delineation, combining scalability, interpretability, and generalization.

[175] MiraGe: Multimodal Discriminative Representation Learning for Generalizable AI-Generated Image Detection

Kuo Shi,Jie Lu,Shanshan Ye,Guangquan Zhang,Zhen Fang

Main category: cs.CV

TL;DR: 本文提出了一种名为MiraGe的新方法,用于检测AI生成的图像,该方法通过学习生成器不变的特征来提高检测的泛化能力和鲁棒性。

Details Motivation: 现有方法在已知生成器上表现良好,但在面对新出现的或未见过的生成模型时性能下降。 Method: 应用多模态提示学习,将文本嵌入作为语义锚点,以优化CLIP中的判别表示学习。 Result: 实验结果显示,MiraGe在多个基准测试中达到了最先进的性能,并且对未见过的生成器(如Sora)保持了鲁棒性。 Conclusion: MiraGe通过学习生成器不变的特征,提高了AI生成图像检测的泛化能力。 Abstract: Recent advances in generative models have highlighted the need for robust detectors capable of distinguishing real images from AI-generated images. While existing methods perform well on known generators, their performance often declines when tested with newly emerging or unseen generative models due to overlapping feature embeddings that hinder accurate cross-generator classification. In this paper, we propose Multimodal Discriminative Representation Learning for Generalizable AI-generated Image Detection (MiraGe), a method designed to learn generator-invariant features. Motivated by theoretical insights on intra-class variation minimization and inter-class separation, MiraGe tightly aligns features within the same class while maximizing separation between classes, enhancing feature discriminability. Moreover, we apply multimodal prompt learning to further refine these principles into CLIP, leveraging text embeddings as semantic anchors for effective discriminative representation learning, thereby improving generalizability. Comprehensive experiments across multiple benchmarks show that MiraGe achieves state-of-the-art performance, maintaining robustness even against unseen generators like Sora.

[176] ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models

Jiaxin Liu,Zhaolu Kang

Main category: cs.CV

TL;DR: ReasonAct improves video reasoning in small models through a three-stage training approach, achieving significant performance gains while staying computationally efficient.

Details Motivation: Multimodal models, especially smaller ones, struggle with the fine-grained temporal reasoning required for video understanding, necessitating a more effective approach. Method: ReasonAct uses a three-stage training process: text-only reasoning foundation, video fine-tuning, and temporal-aware reinforcement learning refinement. It builds upon T-GRPO with temporal consistency modeling and introduces a biomechanically-motivated sub-action decomposition mechanism. Result: The 3B-parameter ReasonAct model achieved 67.2% accuracy on HMDB51, 94.1% on UCF-101, and 78.9% on Kinetics-400, improvements of 17.9, 15.8, and 12.3 points respectively over baselines. Conclusion: The proposed ReasonAct method effectively enhances video reasoning performance in smaller models, achieving significant improvements over baselines while maintaining computational efficiency. Abstract: While recent multimodal models have shown progress in vision-language tasks, small-scale variants still struggle with the fine-grained temporal reasoning required for video understanding. We introduce ReasonAct, a method that enhances video reasoning in smaller models through a three-stage training process: first building a foundation with text-only reasoning, then fine-tuning on video, and finally refining with temporal-aware reinforcement learning. We build upon Temporal Group Relative Policy Optimization (T-GRPO) by incorporating temporal consistency modeling into policy optimization. We also propose a biomechanically-motivated sub-action decomposition mechanism that provides graduated rewards for constituent action phases. Through experiments on HMDB51, UCF-101, and Kinetics-400, our 3B-parameter model achieves 67.2%, 94.1%, and 78.9% accuracy respectively, demonstrating improvements of 17.9, 15.8, and 12.3 points over baselines. Ablation studies validate that our progressive training methodology enables smaller models to achieve competitive video reasoning performance while maintaining computational efficiency.

[177] MagicVL-2B: Empowering Vision-Language Models on Mobile Devices with Lightweight Visual Encoders via Curriculum Learning

Yi Liu,Xiao Xu,Zeyu Xu,Meng Zhang,Yibo Li,Haoyu Chen,Junkang Zhang,Qiang Wang,Jifa Sun,Siling Lin,Shengxun Cheng,Lingshu Zhang,Kang Wang

Main category: cs.CV

TL;DR: MagicVL-2B是一种优化的视觉-语言模型,专为智能手机设计,具有轻量级视觉编码器和多模态课程学习策略,在降低功耗的同时保持了高性能。

Details Motivation: 近年来,视觉-语言模型(VLMs)取得了显著突破,但其计算和存储需求对移动设备上的高效部署提出了重大挑战。因此,开发适用于智能手机的优化模型具有重要意义。 Method: MagicVL-2B采用了一个参数少于1亿的轻量级视觉编码器,并设计了一种动态分辨率方案,能够自适应生成图像令牌。此外,还提出了一种多模态课程学习策略,逐步增加训练过程中的任务难度和数据信息密度。 Result: MagicVL-2B在标准VLM基准测试中表现出色,与当前最先进的模型准确率相当,同时设备功耗降低了41.1%。 Conclusion: MagicVL-2B是适用于智能手机的视觉-语言模型,具备实用性和鲁棒性,能够在移动设备上实现先进的多模态智能。 Abstract: Vision-Language Models (VLMs) have achieved remarkable breakthroughs in recent years, enabling a diverse array of applications in everyday life. However, the substantial computational and storage demands of VLMs pose significant challenges for their efficient deployment on mobile devices, which represent the most ubiquitous and accessible computing platforms today. In this work, we introduce MagicVL-2B, a novel VLM meticulously optimized for flagship smartphones. MagicVL-2B leverages a lightweight visual encoder with fewer than 100M parameters and features a redesigned dynamic resolution scheme that adaptively generates image tokens without excessive modification of image dimensions. To further enhance the performance of this compact encoder within VLMs, we propose a multimodal curriculum learning strategy that incrementally increases task difficulty and data information density throughout training. This approach substantially improves the model's performance across a variety of sub-tasks. Extensive evaluations on standard VLM benchmarks demonstrate that MagicVL-2B matches the accuracy of current state-of-the-art models while reducing on-device power consumption by 41.1%. These results establish MagicVL-2B as a practical and robust solution for real-world mobile vision-language applications, enabling advanced multimodal intelligence to run directly on smartphones.

[178] E-VRAG: Enhancing Long Video Understanding with Resource-Efficient Retrieval Augmented Generation

Zeyu Xu,Junkang Zhang,Qiang Wang,Yi Liu

Main category: cs.CV

TL;DR: E-VRAG是一种高效的视频检索增强生成框架,通过多层次优化策略平衡效率与准确性,在视频理解任务中表现出色。

Details Motivation: 现有的视频检索增强生成方法在检索效率与准确性之间难以平衡,尤其在处理多样且复杂的视频内容时。 Method: 通过分层查询分解的帧预过滤方法、轻量级视觉语言模型(VLM)的帧评分、基于帧间得分全局统计分布的帧检索策略以及多视角问答方案来优化视频检索-生成任务。 Result: 实验结果显示,E-VRAG在计算成本减少约70%的同时提高了准确性,且无需额外训练。 Conclusion: E-VRAG实现了视频检索任务中效率与准确性的平衡,实验证明其在公共基准测试中优于基线方法。 Abstract: Vision-Language Models (VLMs) have enabled substantial progress in video understanding by leveraging cross-modal reasoning capabilities. However, their effectiveness is limited by the restricted context window and the high computational cost required to process long videos with thousands of frames. Retrieval-augmented generation (RAG) addresses this challenge by selecting only the most relevant frames as input, thereby reducing the computational burden. Nevertheless, existing video RAG methods struggle to balance retrieval efficiency and accuracy, particularly when handling diverse and complex video content. To address these limitations, we propose E-VRAG, a novel and efficient video RAG framework for video understanding. We first apply a frame pre-filtering method based on hierarchical query decomposition to eliminate irrelevant frames, reducing computational costs at the data level. We then employ a lightweight VLM for frame scoring, further reducing computational costs at the model level. Additionally, we propose a frame retrieval strategy that leverages the global statistical distribution of inter-frame scores to mitigate the potential performance degradation from using a lightweight VLM. Finally, we introduce a multi-view question answering scheme for the retrieved frames, enhancing the VLM's capability to extract and comprehend information from long video contexts. Experiments on four public benchmarks show that E-VRAG achieves about 70% reduction in computational cost and higher accuracy compared to baseline methods, all without additional training. These results demonstrate the effectiveness of E-VRAG in improving both efficiency and accuracy for video RAG tasks.

[179] A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models

Quan-Sheng Zeng,Yunheng Li,Qilong Wang,Peng-Tao Jiang,Zuxuan Wu,Ming-Ming Cheng,Qibin Hou

Main category: cs.CV

TL;DR: 本研究开发了GlimpsePrune,一种用于高效处理高分辨率输入的动态剪枝框架,适用于大型视觉-语言模型。

Details Motivation: 现有的通常采用固定压缩比的方法无法适应不同复杂度的场景,导致剪枝不精确,丢弃了信息量大的视觉标记,从而导致模型性能下降。 Method: 引入了一种受人类认知启发的动态剪枝框架GlimpsePrune,并在单次前向传递中剪枝不相关的视觉标记。 Result: GlimpsePrune平均保留了基线性能的同时剪枝了92.6%的视觉标记。 Conclusion: GlimpsePrune为构建更强大和高效的LVLMs铺平了道路。 Abstract: Visual token compression is critical for Large Vision-Language Models (LVLMs) to efficiently process high-resolution inputs. Existing methods that typically adopt fixed compression ratios cannot adapt to scenes of varying complexity, often causing imprecise pruning that discards informative visual tokens and results in degraded model performance. To address this issue, we introduce a dynamic pruning framework, GlimpsePrune, inspired by human cognition. It takes a data-driven ''glimpse'' and prunes irrelevant visual tokens in a single forward pass before answer generation. This approach prunes 92.6% of visual tokens while on average fully retaining the baseline performance on free-form VQA tasks. The reduced computational cost also enables more effective fine-tuning: an enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate. Our work paves a new way for building more powerful and efficient LVLMs.

[180] EvoVLMA: Evolutionary Vision-Language Model Adaptation

Kun Ding,Ying Wang,Shiming Xiang

Main category: cs.CV

TL;DR: 本文提出了一种自动搜索训练免费的高效VLM适应算法的方法EvoVLMA,通过特征选择和logits计算作为关键函数,采用两阶段的LLM辅助进化算法,以分而治之的策略解决搜索空间大的问题。

Details Motivation: 现有的VLM适应方法是人为设计的,需要大量的时间和经验,因此需要一种自动搜索训练免费的高效VLM适应算法。 Method: 通过特征选择和logits计算作为关键函数,提出了一个两阶段的LLM辅助进化算法,利用低精度代码转换、基于网络的代码执行和过程监控,以分而治之的策略有效解决搜索空间大的挑战。 Result: EvoVLMA找到的算法比以前手动设计的算法取得了更好的结果,在8次图像分类设置中,识别准确率提高了1.91点。 Conclusion: EvoVLMA是一种自动搜索训练免费的VLM适应算法的方法,能够比以前的手动设计的算法获得更好的结果,为自动化优化预训练多模态模型的适应算法打开了新的可能性。 Abstract: Pre-trained Vision-Language Models (VLMs) have been exploited in various Computer Vision tasks (e.g., few-shot recognition) via model adaptation, such as prompt tuning and adapters. However, existing adaptation methods are designed by human experts, requiring significant time cost and experience. Inspired by recent advances in Large Language Models (LLMs) based code generation, we propose an Evolutionary Vision-Language Model Adaptation (EvoVLMA) method to automatically search training-free efficient adaptation algorithms for VLMs. We recognize feature selection and logits computation as the key functions in training-free VLM adaptation, and propose a two-stage LLM-assisted evolutionary algorithm for optimizing these parts in a sequential manner, effectively addressing the challenge posed by the expansive search space through a divide-and-conquer strategy. Besides, to enhance the stability and efficiency of searching process, we propose low-precision code conversion, web based code execution and process monitoring, leading to a highly effective automatic algorithm design system. Extensive experiments demonstrate that the algorithms found by EvoVLMA can obtain promising results compared to previous manually-designed ones. More specifically, in the 8-shot image classification setting, the classical APE algorithm can be improved by 1.91 points in recognition accuracy. This research opens new possibilities for automating the optimization of adaptation algorithms of pre-trained multimodal models. Code is available at: https://github.com/kding1225/EvoVLMA

[181] Adaptive LiDAR Scanning: Harnessing Temporal Cues for Efficient 3D Object Detection via Multi-Modal Fusion

Sara Shoouri,Morteza Tavakoli Taba,Hun-Seok Kim

Main category: cs.CV

TL;DR: This paper proposes an adaptive LiDAR scanning framework that uses historical data to reduce energy consumption while maintaining 3D object detection accuracy.

Details Motivation: Conventional LiDAR sensors perform dense, stateless scans that lead to redundancy and excessive power consumption, limiting their practicality on resource-constrained platforms. Method: A lightweight predictor network and a differentiable Mask Generator network are introduced to identify critical regions of interest using historical context and Gumbel-Softmax sampling. Result: Experiments on nuScenes and Lyft benchmarks show over 65% reduction in LiDAR energy consumption without compromising detection performance. Conclusion: The proposed predictive, history-aware adaptive scanning framework significantly reduces LiDAR energy consumption while maintaining competitive 3D object detection performance. Abstract: Multi-sensor fusion using LiDAR and RGB cameras significantly enhances 3D object detection task. However, conventional LiDAR sensors perform dense, stateless scans, ignoring the strong temporal continuity in real-world scenes. This leads to substantial sensing redundancy and excessive power consumption, limiting their practicality on resource-constrained platforms. To address this inefficiency, we propose a predictive, history-aware adaptive scanning framework that anticipates informative regions of interest (ROI) based on past observations. Our approach introduces a lightweight predictor network that distills historical spatial and temporal contexts into refined query embeddings. These embeddings guide a differentiable Mask Generator network, which leverages Gumbel-Softmax sampling to produce binary masks identifying critical ROIs for the upcoming frame. Our method significantly reduces unnecessary data acquisition by concentrating dense LiDAR scanning only within these ROIs and sparsely sampling elsewhere. Experiments on nuScenes and Lyft benchmarks demonstrate that our adaptive scanning strategy reduces LiDAR energy consumption by over 65% while maintaining competitive or even superior 3D object detection performance compared to traditional LiDAR-camera fusion methods with dense LiDAR scanning.

[182] LetheViT: Selective Machine Unlearning for Vision Transformers via Attention-Guided Contrastive Learning

Yujia Tong,Tian Zhang,Jingling Yuan,Yuze Wang,Chuang Hu

Main category: cs.CV

TL;DR: This work introduces LetheViT, a contrastive unlearning method for Vision Transformers that effectively balances privacy compliance with model efficacy by guiding the model to forget specific details while retaining general category outlines.

Details Motivation: The introduction of privacy regulations such as GDPR and CCPA necessitates the complete removal of user data's influence from trained models, posing new challenges to Vision Transformers. Machine unlearning, particularly approximate methods, offers a practical solution. Method: LetheViT proposes a contrastive unlearning method for Vision Transformers, using masked image inputs to generate positive logits and original image inputs to generate negative logits, guiding the model to forget specific details while retaining general category outlines. Result: LetheViT demonstrates superior performance in the random data forgetting scenario, effectively enabling Vision Transformers to forget specific samples while retaining others, even within the same class. Conclusion: LetheViT effectively balances privacy compliance with model efficacy, achieving state-of-the-art performance in the random data forgetting scenario for Vision Transformers. Abstract: Vision Transformers (ViTs) have revolutionized computer vision tasks with their exceptional performance. However, the introduction of privacy regulations such as GDPR and CCPA has brought new challenges to them. These laws grant users the right to withdraw their data, necessitating not only the deletion of data but also the complete removal of its influence from trained models. Machine unlearning emerges as a critical solution, with exact unlearning being computationally prohibitive and approximate methods offering a more practical approach. This work addresses the particularly challenging scenario of random data forgetting in ViTs, where the model must forget specific samples while retaining others, even within the same class. We first reveal the core characteristics of ViTs through selective masking experiments: when high-attention areas are masked, the model retains its recognition capability but significantly weakens its memorization ability. Based on the above insights, we propose LetheViT, a contrastive unlearning method tailored for ViTs. LetheViT uses masked image inputs to generate positive logits and original image inputs to generate negative logits, guiding the model to forget specific details while retaining the general cl category outlines. Experimental results demonstrate that LetheViT achieves state-of-the-art performance, effectively balancing privacy compliance with model efficacy.

[183] TopoImages: Incorporating Local Topology Encoding into Deep Learning Models for Medical Image Classification

Pengfei Gu,Hongxiao Wang,Yejia Zhang,Huimin Li,Chaoli Wang,Danny Chen

Main category: cs.CV

TL;DR: 本文提出TopoImages方法,通过持续同源技术编码图像局部拓扑结构,生成多视角拓扑图像并融合到深度学习模型中,显著提升了医学图像分类的准确性。

Details Motivation: 现有的图像处理方法虽然在外观信息上取得了成功,但在深度学习框架中对拓扑结构(如连通分量和环路)缺乏敏感性,而这些结构在理解图像内容中起关键作用,尤其是在生物医学图像分析中。 Method: 使用持续同源(Persistent Homology, PH)分析图像块的拓扑结构,计算持续图(Persistence Diagrams, PDs),然后将其向量化并排列成多通道图像形式(TopoImages)。通过多种滤波函数生成多视角TopoImages,并与输入图像融合用于深度学习分类。 Result: TopoImages 方法在三个公开的医学图像分类数据集中显著提高了分类准确性,且该方法具有高度通用性,可无缝集成到常见的深度学习框架中。 Conclusion: TopoImages 是一种新的通用方法,可以将图像的拓扑信息编码到向量化的形式中,并通过多视角融合方法增强了图像的拓扑特征表示,提高了深度学习分类的准确性。 Abstract: Topological structures in image data, such as connected components and loops, play a crucial role in understanding image content (e.g., biomedical objects). % Despite remarkable successes of numerous image processing methods that rely on appearance information, these methods often lack sensitivity to topological structures when used in general deep learning (DL) frameworks. % In this paper, we introduce a new general approach, called TopoImages (for Topology Images), which computes a new representation of input images by encoding local topology of patches. % In TopoImages, we leverage persistent homology (PH) to encode geometric and topological features inherent in image patches. % Our main objective is to capture topological information in local patches of an input image into a vectorized form. % Specifically, we first compute persistence diagrams (PDs) of the patches, % and then vectorize and arrange these PDs into long vectors for pixels of the patches. % The resulting multi-channel image-form representation is called a TopoImage. % TopoImages offers a new perspective for data analysis. % To garner diverse and significant topological features in image data and ensure a more comprehensive and enriched representation, we further generate multiple TopoImages of the input image using various filtration functions, which we call multi-view TopoImages. % The multi-view TopoImages are fused with the input image for DL-based classification, with considerable improvement. % Our TopoImages approach is highly versatile and can be seamlessly integrated into common DL frameworks. Experiments on three public medical image classification datasets demonstrate noticeably improved accuracy over state-of-the-art methods.

[184] Harnessing Textual Semantic Priors for Knowledge Transfer and Refinement in CLIP-Driven Continual Learning

Lingfeng He,De Cheng,Huaijie Wang,Nannan Wang

Main category: cs.CV

TL;DR: This paper proposes SECA, a continual learning framework that improves knowledge transfer and visual classification by leveraging semantic-rich textual priors from CLIP.

Details Motivation: Existing continual learning methods do not effectively utilize the rich semantic priors in CLIP, leading to a poor balance between stability and plasticity. Text-based classifiers lack plasticity due to modality gaps, while visual classifiers lack semantic richness. Method: SECA includes two modules: SG-AKT for semantic-guided knowledge transfer using textual cues, and SE-VPR for refining visual prototypes with semantic relations from textual embeddings. Result: Extensive experiments show that SECA outperforms previous methods in continual learning scenarios, demonstrating its effectiveness in addressing the stability-plasticity dilemma. Conclusion: The proposed SECA framework enhances continual learning by leveraging textual priors to improve knowledge transfer and visual prototype refinement, achieving superior performance on multiple benchmarks. Abstract: Continual learning (CL) aims to equip models with the ability to learn from a stream of tasks without forgetting previous knowledge. With the progress of vision-language models like Contrastive Language-Image Pre-training (CLIP), their promise for CL has attracted increasing attention due to their strong generalizability. However, the potential of rich textual semantic priors in CLIP in addressing the stability-plasticity dilemma remains underexplored. During backbone training, most approaches transfer past knowledge without considering semantic relevance, leading to interference from unrelated tasks that disrupt the balance between stability and plasticity. Besides, while text-based classifiers provide strong generalization, they suffer from limited plasticity due to the inherent modality gap in CLIP. Visual classifiers help bridge this gap, but their prototypes lack rich and precise semantics. To address these challenges, we propose Semantic-Enriched Continual Adaptation (SECA), a unified framework that harnesses the anti-forgetting and structured nature of textual priors to guide semantic-aware knowledge transfer in the backbone and reinforce the semantic structure of the visual classifier. Specifically, a Semantic-Guided Adaptive Knowledge Transfer (SG-AKT) module is proposed to assess new images' relevance to diverse historical visual knowledge via textual cues, and aggregate relevant knowledge in an instance-adaptive manner as distillation signals. Moreover, a Semantic-Enhanced Visual Prototype Refinement (SE-VPR) module is introduced to refine visual prototypes using inter-class semantic relations captured in class-wise textual embeddings. Extensive experiments on multiple benchmarks validate the effectiveness of our approach.

[185] Set Pivot Learning: Redefining Generalized Segmentation with Vision Foundation Models

Xinhui Li,Xinyu He,Qiming Hu,Xiaojie Guo

Main category: cs.CV

TL;DR: This paper introduces Set Pivot Learning (SPL), a new paradigm for domain generalization leveraging Vision Foundation Models. It proposes a Dynamic Prompt Fine-Tuning method and demonstrates its superior performance in generalized segmentation tasks compared to existing state-of-the-art approaches.

Details Motivation: The emergence of Vision Foundation Models trained on vast and diverse datasets challenges the traditional assumption in domain generalization that the target domain is inaccessible during training. This necessitates a new approach that aligns with current research and application needs. Method: The paper introduces Set Pivot Learning (SPL), a new paradigm for domain generalization based on Vision Foundation Models. SPL focuses on dynamic adaptation and VFM-centric tuning. The proposed method includes a Dynamic Prompt Fine-Tuning approach combining a Dynamic Class-aware Prompter and a Prompt-guided Feature Focuser. Result: Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed SPL framework and Dynamic Prompt Fine-Tuning method, showing superiority over state-of-the-art methods in domain generalization tasks, especially in generalized segmentation. Conclusion: The paper concludes that SPL, along with the proposed Dynamic Prompt Fine-Tuning method, provides a more effective and adaptable approach to domain generalization in the context of Vision Foundation Models, particularly showing superior performance in generalized segmentation. Abstract: In this paper, we introduce, for the first time, the concept of Set Pivot Learning, a paradigm shift that redefines domain generalization (DG) based on Vision Foundation Models (VFMs). Traditional DG assumes that the target domain is inaccessible during training, but the emergence of VFMs, trained on vast and diverse data, renders this assumption unclear and obsolete. Traditional DG assumes that the target domain is inaccessible during training, but the emergence of VFMs, which are trained on vast and diverse datasets, renders this assumption unclear and obsolete. To address this challenge, we propose Set Pivot Learning (SPL), a new definition of domain migration task based on VFMs, which is more suitable for current research and application requirements. Unlike conventional DG methods, SPL prioritizes adaptive refinement over rigid domain transfer, ensuring continuous alignment with evolving real-world conditions. Specifically, SPL features two key attributes: (i) Dynamic adaptation, transitioning from static domain alignment to flexible, task-driven feature optimization, enabling models to evolve with downstream scenarios; (ii) VFM-centric tuning, leveraging pretrained knowledge as a pivot to hone task-specific representations while preserving cross-domain robustness. Building on SPL, we propose a Dynamic Prompt Fine-Tuning method, which combines a Dynamic Class-aware Prompter with a Prompt-guided Feature Focuser, to elevate VFM performance in targeted scenarios. Extensive experiments on benchmark datasets show the effectiveness of our method, highlighting its superiority over state-of-the-art methods, particularly in generalized segmentation.

[186] A Spatio-temporal Continuous Network for Stochastic 3D Human Motion Prediction

Hua Yu,Yaqing Hou,Xu Gui,Shanshan Feng,Dongsheng Zhou,Qiang Zhang

Main category: cs.CV

TL;DR: STCN improves stochastic human motion prediction by incorporating a spatio-temporal continuous network and anchor-based strategy, effectively enhancing motion flexibility and diversity while reducing mode collapse.

Details Motivation: Existing methods struggle with modeling continuous temporal dynamics and stochastic human motion sequences, often leading to mode collapse and insufficient flexibility in complex motion prediction. Method: STCN utilizes a two-stage approach: (1) generating smoother motion sequences with a spatio-temporal continuous network, and (2) modeling Gaussian mixture distributions using an anchor set to capture motion diversity and intra-class differences. Result: STCN achieves competitive performance in terms of both motion prediction accuracy and diversity on the Human3.6M and HumanEva-I datasets. Conclusion: The proposed STCN method effectively addresses challenges in stochastic human motion prediction by introducing a spatio-temporal continuous network and an anchor set to enhance motion flexibility and reduce mode collapse. Abstract: Stochastic Human Motion Prediction (HMP) has received increasing attention due to its wide applications. Despite the rapid progress in generative fields, existing methods often face challenges in learning continuous temporal dynamics and predicting stochastic motion sequences. They tend to overlook the flexibility inherent in complex human motions and are prone to mode collapse. To alleviate these issues, we propose a novel method called STCN, for stochastic and continuous human motion prediction, which consists of two stages. Specifically, in the first stage, we propose a spatio-temporal continuous network to generate smoother human motion sequences. In addition, the anchor set is innovatively introduced into the stochastic HMP task to prevent mode collapse, which refers to the potential human motion patterns. In the second stage, STCN endeavors to acquire the Gaussian mixture distribution (GMM) of observed motion sequences with the aid of the anchor set. It also focuses on the probability associated with each anchor, and employs the strategy of sampling multiple sequences from each anchor to alleviate intra-class differences in human motions. Experimental results on two widely-used datasets (Human3.6M and HumanEva-I) demonstrate that our model obtains competitive performance on both diversity and accuracy.

[187] Lifelong Person Re-identification via Privacy-Preserving Data Replay

Mingyu Wang,Haojie Liu,Zhiyong Li,Wei Jiang

Main category: cs.CV

TL;DR: This paper proposes Pr^2R, a privacy-preserving replay method for lifelong person re-identification that improves performance on sequential tasks while addressing data privacy concerns.

Details Motivation: The motivation is to address data privacy concerns in LReID tasks while avoiding performance degradation caused by forgetting past knowledge representations. Method: The method condenses information from sequential data into pixel space in the replay memory, using a distillation process to create representative and privacy-preserving samples. It also employs a dual-alignment strategy during style replay to align domains and adapt samples. Result: Pr^2R achieves 4% and 6% higher accuracy on sequential tasks compared to the current state-of-the-art and other replay-based methods, respectively. Conclusion: The proposed Pr^2R method significantly improves replay effectiveness while preserving data privacy in LReID tasks. Abstract: Lifelong person re-identification (LReID) aims to incrementally accumulate knowledge across a sequence of tasks under domain shifts. Recently, replay-based methods have demonstrated strong effectiveness in LReID by rehearsing past samples stored in an auxiliary memory. However, storing historical exemplars raises concerns over data privacy. To avoid this, exemplar-free approaches attempt to match the distribution of past data without storing raw samples. Despite being privacy-friendly, these methods often suffer from performance degradation due to the forgetting of specific past knowledge representations. To this end, we propose to condense information from sequential data into the pixel space in the replay memory, enabling Privacy-Preserving Replay (Pr^2R). More specifically, by distilling the training characteristics of multiple real images into a single image, the condensed samples undergo pixel-level changes. This not only protects the privacy of the original data but also makes the replay samples more representative for sequential tasks. During the style replay phase, we align the current domain to the previous one while simultaneously adapting the replay samples to match the style of the current domain. This dual-alignment strategy effectively mitigates both class-incremental challenges and forgetting caused by domain shifts. Extensive experiments on multiple benchmarks show that the proposed method significantly improves replay effectiveness while preserving data privacy. Specifically, Pr^2R achieves 4% and 6% higher accuracy on sequential tasks compared to the current state-of-the-art and other replay-based methods, respectively.

[188] Self-Navigated Residual Mamba for Universal Industrial Anomaly Detection

Hanxi Li,Jingqi Wu,Lin Yuanbo Wu,Mingliang Li,Deyin Liu,Jialie Shen,Chunhua Shen

Main category: cs.CV

TL;DR: This paper proposes SNARM, a new framework for industrial anomaly detection using self-referential learning in test images, achieving state-of-the-art results on multiple benchmarks.

Details Motivation: The motivation is to enhance anomaly discrimination in industrial settings by leveraging self-referential learning within test images. Method: The method involves computing inter-residuals and intra-residuals features from test image patches and feeding them into a Mamba module with multiple heads for anomaly detection. Result: Extensive experiments on MVTec AD, MVTec 3D, and VisA benchmarks demonstrate that SNARM achieves state-of-the-art performance, with notable improvements in all metrics, including Image-AUROC, Pixel-AURC, PRO, and AP. Conclusion: SNARM achieves state-of-the-art performance on MVTec AD, MVTec 3D, and VisA benchmarks, showing notable improvements in all metrics. Abstract: In this paper, we propose Self-Navigated Residual Mamba (SNARM), a novel framework for universal industrial anomaly detection that leverages ``self-referential learning'' within test images to enhance anomaly discrimination. Unlike conventional methods that depend solely on pre-trained features from normal training data, SNARM dynamically refines anomaly detection by iteratively comparing test patches against adaptively selected in-image references. Specifically, we first compute the ``inter-residuals'' features by contrasting test image patches with the training feature bank. Patches exhibiting small-norm residuals (indicating high normality) are then utilized as self-generated reference patches to compute ``intra-residuals'', amplifying discriminative signals. These inter- and intra-residual features are concatenated and fed into a novel Mamba module with multiple heads, which are dynamically navigated by residual properties to focus on anomalous regions. Finally, AD results are obtained by aggregating the outputs of a self-navigated Mamba in an ensemble learning paradigm. Extensive experiments on MVTec AD, MVTec 3D, and VisA benchmarks demonstrate that SNARM achieves state-of-the-art (SOTA) performance, with notable improvements in all metrics, including Image-AUROC, Pixel-AURC, PRO, and AP.

[189] DMTrack: Spatio-Temporal Multimodal Tracking via Dual-Adapter

Weihong Li,Shaohua Dong,Haonan Lu,Yanhao Zhang,Heng Fan,Libo Zhang

Main category: cs.CV

TL;DR: DMTrack introduces a lightweight dual-adapter architecture that enhances spatio-temporal multimodal tracking through novel fusion techniques and achieves top performance with minimal parameters.

Details Motivation: The motivation is to improve spatio-temporal multimodal tracking by bridging the gap between different modalities and enabling effective cross-modality fusion with a lightweight architecture. Method: The method involves a dual-adapter framework comprising a spatio-temporal modality adapter (STMA) and a progressive modality complementary adapter (PMCA), which enhance cross-modality fusion through self-prompting and pixel-wise attention mechanisms. Result: DMTrack achieves state-of-the-art results on five benchmarks with only 0.93M trainable parameters. Conclusion: DMTrack introduces a dual-adapter architecture for spatio-temporal multimodal tracking, achieving state-of-the-art results with minimal trainable parameters. Abstract: In this paper, we explore adapter tuning and introduce a novel dual-adapter architecture for spatio-temporal multimodal tracking, dubbed DMTrack. The key of our DMTrack lies in two simple yet effective modules, including a spatio-temporal modality adapter (STMA) and a progressive modality complementary adapter (PMCA) module. The former, applied to each modality alone, aims to adjust spatio-temporal features extracted from a frozen backbone by self-prompting, which to some extent can bridge the gap between different modalities and thus allows better cross-modality fusion. The latter seeks to facilitate cross-modality prompting progressively with two specially designed pixel-wise shallow and deep adapters. The shallow adapter employs shared parameters between the two modalities, aiming to bridge the information flow between the two modality branches, thereby laying the foundation for following modality fusion, while the deep adapter modulates the preliminarily fused information flow with pixel-wise inner-modal attention and further generates modality-aware prompts through pixel-wise inter-modal attention. With such designs, DMTrack achieves promising spatio-temporal multimodal tracking performance with merely \textbf{0.93M} trainable parameters. Extensive experiments on five benchmarks show that DMTrack achieves state-of-the-art results. Code will be available.

[190] CLIMD: A Curriculum Learning Framework for Imbalanced Multimodal Diagnosis

Kai Han,Chongwen Lyu,Lele Ma,Chengxuan Qian,Siqi Ma,Zheng Pang,Jun Chen,Zhe Liu

Main category: cs.CV

TL;DR: 本文提出了一种用于不平衡多模态诊断的课程学习框架(CLIMD),通过结合模态内信心和模态间互补性来关注关键样本,并逐步适应复杂的类别分布,从而提高多模态疾病诊断的准确性。

Details Motivation: 在实际临床场景中,由于发病率的差异,多模态医学数据常面临类别不平衡的问题,这使得难以充分学习少数类的特征。现有的方法在处理此类问题时容易过拟合并欠拟合,且无法捕捉模态间的相互作用。 Method: 设计了一个多模态课程测量器,结合模态内信心和模态间互补性,以使模型关注关键样本并逐步适应复杂类别分布。此外,引入了一个类别分布引导的训练调度器,使模型在训练过程中逐步适应不平衡的类别分布。 Result: 在多个多模态医学数据集上的实验表明,所提出的方法在各种指标上均优于现有方法,并在处理不平衡多模态医学数据方面表现出色。此外,CLIMD作为一个即插即用的课程学习框架,可以轻松集成到其他模型中。 Conclusion: CLIMD为改善多模态疾病诊断准确性提供了一条有前景的路径,并且代码已公开,方便后续研究和应用。 Abstract: Clinicians usually combine information from multiple sources to achieve the most accurate diagnosis, and this has sparked increasing interest in leveraging multimodal deep learning for diagnosis. However, in real clinical scenarios, due to differences in incidence rates, multimodal medical data commonly face the issue of class imbalance, which makes it difficult to adequately learn the features of minority classes. Most existing methods tackle this issue with resampling or loss reweighting, but they are prone to overfitting or underfitting and fail to capture cross-modal interactions. Therefore, we propose a Curriculum Learning framework for Imbalanced Multimodal Diagnosis (CLIMD). Specifically, we first design multimodal curriculum measurer that combines two indicators, intra-modal confidence and inter-modal complementarity, to enable the model to focus on key samples and gradually adapt to complex category distributions. Additionally, a class distribution-guided training scheduler is introduced, which enables the model to progressively adapt to the imbalanced class distribution during training. Extensive experiments on multiple multimodal medical datasets demonstrate that the proposed method outperforms state-of-the-art approaches across various metrics and excels in handling imbalanced multimodal medical data. Furthermore, as a plug-and-play CL framework, CLIMD can be easily integrated into other models, offering a promising path for improving multimodal disease diagnosis accuracy. Code is publicly available at https://github.com/KHan-UJS/CLIMD.

[191] Enhancing Zero-Shot Brain Tumor Subtype Classification via Fine-Grained Patch-Text Alignment

Lubin Gan,Jing Zhang,Linhao Qu,Yijun Wang,Siying Wu,Xiaoyan Sun

Main category: cs.CV

TL;DR: This paper proposes FG-PAN, a novel framework for zero-shot brain tumor subtype classification that improves performance by aligning enhanced visual features with detailed text descriptions.

Details Motivation: Fine-grained classification of brain tumor subtypes is challenging due to subtle variations and limited annotated data, and existing vision-language models struggle with capturing detailed pathological features. Method: The Fine-Grained Patch Alignment Network (FG-PAN) includes a local feature refinement module and a text description generation module using large language models. Result: FG-PAN achieves state-of-the-art performance and robust generalization on multiple public pathology datasets, including EBRAINS and TCGA. Conclusion: FG-PAN effectively improves zero-shot classification of brain tumor subtypes by aligning refined visual features with fine-grained text descriptions. Abstract: The fine-grained classification of brain tumor subtypes from histopathological whole slide images is highly challenging due to subtle morphological variations and the scarcity of annotated data. Although vision-language models have enabled promising zero-shot classification, their ability to capture fine-grained pathological features remains limited, resulting in suboptimal subtype discrimination. To address these challenges, we propose the Fine-Grained Patch Alignment Network (FG-PAN), a novel zero-shot framework tailored for digital pathology. FG-PAN consists of two key modules: (1) a local feature refinement module that enhances patch-level visual features by modeling spatial relationships among representative patches, and (2) a fine-grained text description generation module that leverages large language models to produce pathology-aware, class-specific semantic prototypes. By aligning refined visual features with LLM-generated fine-grained descriptions, FG-PAN effectively increases class separability in both visual and semantic spaces. Extensive experiments on multiple public pathology datasets, including EBRAINS and TCGA, demonstrate that FG-PAN achieves state-of-the-art performance and robust generalization in zero-shot brain tumor subtype classification.

[192] Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning

Yiheng Li,Zichang Tan,Zhen Lei,Xu Zhou,Yang Yang

Main category: cs.CV

TL;DR: This paper introduces IAPL, a novel adaptive prompt learning framework for AI-generated image detection, achieving top performance by dynamically adjusting prompts based on input images.

Details Motivation: Existing AI-generated image detection methods struggle to generalize to unseen generators, as they rely on fixed prompts trained on limited data sources. Method: The Image-Adaptive Prompt Learning (IAPL) framework uses two adaptive modules: Conditional Information Learner and Confidence-Driven Adaptive Prediction, which dynamically adjust prompts based on input images. Result: IAPL achieved 95.61% and 96.7% mean accuracy on the UniversalFakeDetect and GenImage datasets, respectively. Conclusion: The proposed IAPL framework enhances the adaptability of AI-generated image detection models, achieving state-of-the-art performance on two major datasets. Abstract: A major struggle for AI-generated image detection is identifying fake images from unseen generators. Existing cutting-edge methods typically customize pre-trained foundation models to this task via partial-parameter fine-tuning. However, these parameters trained on a narrow range of generators may fail to generalize to unknown sources. In light of this, we propose a novel framework named Image-Adaptive Prompt Learning (IAPL), which enhances flexibility in processing diverse testing images. It consists of two adaptive modules, i.e., the Conditional Information Learner and the Confidence-Driven Adaptive Prediction. The former employs CNN-based feature extractors to learn forgery-specific and image-specific conditions, which are then propagated to learnable tokens via a gated mechanism. The latter optimizes the shallowest learnable tokens based on a single test sample and selects the cropped view with the highest prediction confidence for final detection. These two modules enable the prompts fed into the foundation model to be automatically adjusted based on the input image, rather than being fixed after training, thereby enhancing the model's adaptability to various forged images. Extensive experiments show that IAPL achieves state-of-the-art performance, with 95.61% and 96.7% mean accuracy on two widely used UniversalFakeDetect and GenImage datasets, respectively.

[193] From Pixels to Places: A Systematic Benchmark for Evaluating Image Geolocalization Ability in Large Language Models

Lingyao Li,Runlong Yu,Qikai Hu,Bowei Li,Min Deng,Yang Zhou,Xiaowei Jia

Main category: cs.CV

TL;DR: 本研究创建了IMAGEO-Bench基准测试,评估大语言模型的图像地理定位能力,发现其在高资源地区表现更好,揭示了模型的地理空间偏差。

Details Motivation: 图像地理定位对于危机响应、数字取证和基于位置的情报等应用至关重要,而大语言模型在这方面的能力尚未被充分探索。 Method: 本研究引入了一个名为IMAGEO-Bench的基准测试,系统评估了10个最先进的大语言模型的准确性、距离误差、地理空间偏差和推理过程。 Result: 实验结果显示,闭源模型通常表现出更强的推理能力,且存在地理空间偏差,即在高资源地区表现较好,在代表性不足的地区表现较差。 Conclusion: IMAGEO-Bench提供了一个严格的视角来评估大语言模型的空间推理能力,并为构建地理定位感知的人工智能系统提供了启示。 Abstract: Image geolocalization, the task of identifying the geographic location depicted in an image, is important for applications in crisis response, digital forensics, and location-based intelligence. While recent advances in large language models (LLMs) offer new opportunities for visual reasoning, their ability to perform image geolocalization remains underexplored. In this study, we introduce a benchmark called IMAGEO-Bench that systematically evaluates accuracy, distance error, geospatial bias, and reasoning process. Our benchmark includes three diverse datasets covering global street scenes, points of interest (POIs) in the United States, and a private collection of unseen images. Through experiments on 10 state-of-the-art LLMs, including both open- and closed-source models, we reveal clear performance disparities, with closed-source models generally showing stronger reasoning. Importantly, we uncover geospatial biases as LLMs tend to perform better in high-resource regions (e.g., North America, Western Europe, and California) while exhibiting degraded performance in underrepresented areas. Regression diagnostics demonstrate that successful geolocalization is primarily dependent on recognizing urban settings, outdoor environments, street-level imagery, and identifiable landmarks. Overall, IMAGEO-Bench provides a rigorous lens into the spatial reasoning capabilities of LLMs and offers implications for building geolocation-aware AI systems.

[194] LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding

Xuanzhao Dong,Wenhui Zhu,Xiwen Chen,Zhipeng Wang,Peijie Qiu,Shao Tang,Xin Li,Yalin Wang

Main category: cs.CV

TL;DR: This paper introduces LLaDA-MedV, a novel vision-language model for biomedical tasks that outperforms existing approaches and achieves top results on multiple benchmarks.

Details Motivation: While autoregressive models dominate biomedical vision-language tasks, masked diffusion models like LLaDA show promise, yet remain underexplored in this domain. Method: Introducing LLaDA-MedV, a vision instruction-tuned language diffusion model tailored for biomedical image understanding. Result: LLaDA-MedV achieves significant performance gains, setting state-of-the-art accuracy on three biomedical VQA benchmarks: 84.93% on VQA-RAD, 92.31% on SLAKE, and 95.15% on PathVQA. Conclusion: LLaDA-MedV is the first large language diffusion model for biomedical image understanding, outperforming existing methods and setting new benchmarks in biomedical visual tasks. Abstract: Autoregressive models (ARMs) have long dominated the landscape of biomedical vision-language models (VLMs). Recently, masked diffusion models such as LLaDA have emerged as promising alternatives, yet their application in the biomedical domain remains largely underexplored. To bridge this gap, we introduce \textbf{LLaDA-MedV}, the first large language diffusion model tailored for biomedical image understanding through vision instruction tuning. LLaDA-MedV achieves relative performance gains of 7.855\% over LLaVA-Med and 1.867\% over LLaDA-V in the open-ended biomedical visual conversation task, and sets new state-of-the-art accuracy on the closed-form subset of three VQA benchmarks: 84.93\% on VQA-RAD, 92.31\% on SLAKE, and 95.15\% on PathVQA. Furthermore, a detailed comparison with LLaVA-Med suggests that LLaDA-MedV is capable of generating reasonably longer responses by explicitly controlling response length, which can lead to more informative outputs. We also conduct an in-depth analysis of both the training and inference stages, highlighting the critical roles of initialization weight selection, fine-tuning strategies, and the interplay between sampling steps and response repetition. The code and model weight is released at https://github.com/LLM-VLM-GSL/LLaDA-MedV.

[195] Rate-distortion Optimized Point Cloud Preprocessing for Geometry-based Point Cloud Compression

Wanhao Ma,Wei Zhang,Shuai Wan,Fuzheng Yang

Main category: cs.CV

TL;DR: This paper proposes a novel preprocessing framework that combines a voxelization network with a differentiable G-PCC surrogate model to improve the efficiency of G-PCC while preserving its interoperability and computational flexibility.

Details Motivation: The motivation is to improve the performance of G-PCC, which underperforms compared to deep learning-based methods, without sacrificing its computational efficiency or interoperability. Method: The method integrates a compression-oriented voxelization network with a differentiable G-PCC surrogate model, optimized in an end-to-end training phase. Result: The experiments show a 38.84% average BD-rate reduction over G-PCC, demonstrating the effectiveness of the proposed framework. Conclusion: The proposed preprocessing framework enhances the efficiency of G-PCC while maintaining interoperability and computational flexibility, offering a practical method to improve legacy compression standards with deep learning. Abstract: Geometry-based point cloud compression (G-PCC), an international standard designed by MPEG, provides a generic framework for compressing diverse types of point clouds while ensuring interoperability across applications and devices. However, G-PCC underperforms compared to recent deep learning-based PCC methods despite its lower computational power consumption. To enhance the efficiency of G-PCC without sacrificing its interoperability or computational flexibility, we propose a novel preprocessing framework that integrates a compression-oriented voxelization network with a differentiable G-PCC surrogate model, jointly optimized in the training phase. The surrogate model mimics the rate-distortion behaviour of the non-differentiable G-PCC codec, enabling end-to-end gradient propagation. The versatile voxelization network adaptively transforms input point clouds using learning-based voxelization and effectively manipulates point clouds via global scaling, fine-grained pruning, and point-level editing for rate-distortion trade-offs. During inference, only the lightweight voxelization network is appended to the G-PCC encoder, requiring no modifications to the decoder, thus introducing no computational overhead for end users. Extensive experiments demonstrate a 38.84% average BD-rate reduction over G-PCC. By bridging classical codecs with deep learning, this work offers a practical pathway to enhance legacy compression standards while preserving their backward compatibility, making it ideal for real-world deployment.

[196] Glass Surface Segmentation with an RGB-D Camera via Weighted Feature Fusion for Service Robots

Henghong Lin,Zihan Zhu,Tao Wang,Anastasia Ioannou,Yuanshui Huang

Main category: cs.CV

TL;DR: This paper proposes a Weighted Feature Fusion module and a new RGB-D dataset, MJU-Glass, to enhance glass surface segmentation for robotics, achieving notable performance improvements.

Details Motivation: Glass surface segmentation is challenging due to transparency, reflections, and occlusions, and effective fusion of RGB and depth information is crucial for accurate segmentation in robotic applications. Method: A Weighted Feature Fusion (WFF) module was developed to dynamically combine RGB and depth features, and the MJU-Glass dataset was introduced for benchmarking segmentation models. Result: Experimental results showed significant improvements in segmentation accuracy and robustness, including a 7.49% improvement in boundary IoU (bIoU) when integrated with PSPNet. Conclusion: The proposed WFF module and MJU-Glass dataset offer a robust framework for improving glass surface segmentation in robotics, significantly enhancing segmentation accuracy and reducing collision risks with glass objects. Abstract: We address the problem of glass surface segmentation with an RGB-D camera, with a focus on effectively fusing RGB and depth information. To this end, we propose a Weighted Feature Fusion (WFF) module that dynamically and adaptively combines RGB and depth features to tackle issues such as transparency, reflections, and occlusions. This module can be seamlessly integrated with various deep neural network backbones as a plug-and-play solution. Additionally, we introduce the MJU-Glass dataset, a comprehensive RGB-D dataset collected by a service robot navigating real-world environments, providing a valuable benchmark for evaluating segmentation models. Experimental results show significant improvements in segmentation accuracy and robustness, with the WFF module enhancing performance in both mean Intersection over Union (mIoU) and boundary IoU (bIoU), achieving a 7.49% improvement in bIoU when integrated with PSPNet. The proposed module and dataset provide a robust framework for advancing glass surface segmentation in robotics and reducing the risk of collisions with glass objects.

[197] CSLRConformer: A Data-Centric Conformer Approach for Continuous Arabic Sign Language Recognition on the Isharah Datase

Fatimah Mohamed Emad Elden

Main category: cs.CV

TL;DR: This paper introduces a data-centric approach and the CSLRConformer architecture to improve Continuous Sign Language Recognition (CSLR), achieving strong results by adapting the Conformer model originally designed for speech recognition.

Details Motivation: The paper addresses the challenge of signer-independent recognition in Continuous Sign Language Recognition (CSLR), aiming to improve generalization across diverse signers due to fluid inter-sign transitions, lack of temporal boundaries, and co-articulation effects. Method: A data-centric methodology was proposed, involving feature engineering, a preprocessing pipeline (including DBSCAN-based outlier filtering and spatial normalization), and the novel CSLRConformer architecture, which adapts the Conformer model to handle the spatio-temporal dynamics of sign language. Result: The proposed methodology achieved a Word Error Rate (WER) of 5.60% on the development set and 12.01% on the test set, securing a 3rd place ranking on the official competition platform. Conclusion: The research successfully adapts the Conformer model, originally designed for speech recognition, to achieve state-of-the-art performance in keypoint-based CSLR, as evidenced by competitive results and a 3rd place ranking in the MSLR 2025 Workshop Challenge. Abstract: The field of Continuous Sign Language Recognition (CSLR) poses substantial technical challenges, including fluid inter-sign transitions, the absence of temporal boundaries, and co-articulation effects. This paper, developed for the MSLR 2025 Workshop Challenge at ICCV 2025, addresses the critical challenge of signer-independent recognition to advance the generalization capabilities of CSLR systems across diverse signers. A data-centric methodology is proposed, centered on systematic feature engineering, a robust preprocessing pipeline, and an optimized model architecture. Key contributions include a principled feature selection process guided by Exploratory Data Analysis (EDA) to isolate communicative keypoints, a rigorous preprocessing pipeline incorporating DBSCAN-based outlier filtering and spatial normalization, and the novel CSLRConformer architecture. This architecture adapts the hybrid CNN-Transformer design of the Conformer model, leveraging its capacity to model local temporal dependencies and global sequence context; a characteristic uniquely suited for the spatio-temporal dynamics of sign language. The proposed methodology achieved a competitive performance, with a Word Error Rate (WER) of 5.60% on the development set and 12.01% on the test set, a result that secured a 3rd place ranking on the official competition platform. This research validates the efficacy of cross-domain architectural adaptation, demonstrating that the Conformer model, originally conceived for speech recognition, can be successfully repurposed to establish a new state-of-the-art performance in keypoint-based CSLR.

[198] Minimal High-Resolution Patches Are Sufficient for Whole Slide Image Representation via Cascaded Dual-Scale Reconstruction

Yujian Liu,Yuechuan Lin,Dongxu Shen,Haoran Li,Yutong Wang,Xiaoli Liu,Shidang Xu

Main category: cs.CV

TL;DR: CDSR improves whole-slide image analysis efficiency and accuracy by selecting and reconstructing key patches, outperforming existing methods with far fewer data.

Details Motivation: To address the inefficiency and domain gap in whole-slide image analysis by improving representation without dense sampling or generic SSL pipelines. Method: Cascaded Dual-Scale Reconstruction (CDSR) with two-stage selective sampling and a Local-to-Global Network. Result: Achieved 6.3% improvement in accuracy and 5.5% increase in AUC with only 9 high-resolution patches per WSI on average. Conclusion: CDSR is optimized for efficiency and morphological fidelity, outperforming state-of-the-art methods with significantly fewer patches. Abstract: Whole-slide image (WSI) analysis remains challenging due to the gigapixel scale and sparsely distributed diagnostic regions. Multiple Instance Learning (MIL) mitigates this by modeling the WSI as bags of patches for slide-level prediction. However, most MIL approaches emphasize aggregator design while overlooking the impact of the feature extractor of the feature extraction stage, which is often pretrained on natural images. This leads to domain gap and suboptimal representations. Self-supervised learning (SSL) has shown promise in bridging domain gap via pretext tasks, but it still primarily builds upon generic backbones, thus requiring WSIs to be split into small patches. This inevitably splits histological structures and generates both redundant and interdependent patches, which in turn degrades aggregator performance and drastically increases training costs. To address this challenge, we propose a Cascaded Dual-Scale Reconstruction (CDSR) framework, demonstrating that only an average of 9 high-resolution patches per WSI are sufficient for robust slide-level representation. CDSR employs a two-stage selective sampling strategy that identifies the most informative representative regions from both model-based and semantic perspectives. These patches are then fed into a Local-to-Global Network, which reconstructs spatially coherent high-resolution WSI representations by integrating fine-grained local detail with global contextual information. Unlike existing dense-sampling or SSL pipelines, CDSR is optimized for efficiency and morphological fidelity. Experiments on Camelyon16, TCGA-NSCLC, and TCGA-RCC demonstrate that CDSR achieves improvements of 6.3% in accuracy and 5.5% in area under ROC curve on downstream classification tasks with only 7,070 (4.5% of total) high-resolution patches per dataset on average, outperforming state-of-the-art methods trained on over 10,000,000 patches.

[199] Subject or Style: Adaptive and Training-Free Mixture of LoRAs

Jia-Chen Zhang,Yu-Jie Xiong

Main category: cs.CV

TL;DR: EST-LoRA是一种无需训练的自适应LoRA融合方法,通过结合矩阵能量、风格差异分数和时间步长,在注意力层中动态选择主体或风格LoRA,提高了生成效果和效率。

Details Motivation: 现有的LoRA融合方法难以平衡主体和风格,且通常需要额外训练;而K-LoRA等训练免费方法又因超参数过多难以适应所有风格和主体。 Method: EST-LoRA借鉴Mixture of Experts (MoE)架构,在每个注意力层中自适应选择主体LoRA和风格LoRA,并结合矩阵能量、风格差异分数和时间步长三个因素进行决策。 Result: 实验表明,EST-LoRA在定性和定量评估上均优于现有方法,同时生成速度更快。 Conclusion: EST-LoRA实现了无需训练的自适应LoRA融合方法,通过综合考虑矩阵能量、风格差异分数和时间步长,取得了比现有方法更好的生成效果和速度。 Abstract: Fine-tuning models via Low-Rank Adaptation (LoRA) demonstrates remarkable performance in subject-driven or style-driven generation tasks. Studies have explored combinations of different LoRAs to jointly generate learned styles and content. However, current methods struggle to balance the original subject and style, and often require additional training. Recently, K-LoRA proposed a training-free LoRA fusion method. But it involves multiple hyperparameters, making it difficult to adapt to all styles and subjects. In this paper, we propose EST-LoRA, a training-free adaptive LoRA fusion method. It comprehensively considers three critical factors: \underline{E}nergy of matrix, \underline{S}tyle discrepancy scores and \underline{T}ime steps. Analogous to the Mixture of Experts (MoE) architecture, the model adaptively selects between subject LoRA and style LoRA within each attention layer. This integrated selection mechanism ensures balanced contributions from both components during the generation process. Experimental results show that EST-LoRA outperforms state-of-the-art methods in both qualitative and quantitative evaluations and achieves faster generation speed compared to other efficient fusion approaches. Our code is publicly available at: https://anonymous.4open.science/r/EST-LoRA-F318.

[200] StrandDesigner: Towards Practical Strand Generation with Sketch Guidance

Na Zhang,Moran Li,Chengming Xu,Han Feng,Xiaobin Hu,Jiangning Zhang,Weijian Cao,Chengjie Wang,Yanwei Fu

Main category: cs.CV

TL;DR: This paper introduces a sketch-based strand generation model that improves realism and precision in hair strand generation by using a learnable upsampling strategy and a multi-scale conditioning mechanism.

Details Motivation: Realistic hair strand generation is crucial for applications like computer graphics and virtual reality, but existing methods using text or images lack precision and user-friendliness. Method: The model uses a learnable strand upsampling strategy and a multi-scale adaptive conditioning mechanism with a transformer and diffusion heads to ensure consistency across granularity levels. Result: Experiments on benchmark datasets demonstrated the method's superiority in realism and precision, with qualitative results confirming its effectiveness. Conclusion: The proposed sketch-based strand generation model outperforms existing approaches in realism and precision, offering finer control and user-friendliness for realistic hair strand generation. Abstract: Realistic hair strand generation is crucial for applications like computer graphics and virtual reality. While diffusion models can generate hairstyles from text or images, these inputs lack precision and user-friendliness. Instead, we propose the first sketch-based strand generation model, which offers finer control while remaining user-friendly. Our framework tackles key challenges, such as modeling complex strand interactions and diverse sketch patterns, through two main innovations: a learnable strand upsampling strategy that encodes 3D strands into multi-scale latent spaces, and a multi-scale adaptive conditioning mechanism using a transformer with diffusion heads to ensure consistency across granularity levels. Experiments on several benchmark datasets show our method outperforms existing approaches in realism and precision. Qualitative results further confirm its effectiveness. Code will be released at [GitHub](https://github.com/fighting-Zhang/StrandDesigner).

[201] Modality Bias in LVLMs: Analyzing and Mitigating Object Hallucination via Attention Lens

Haohan Zheng,Zhenguo Zhang

Main category: cs.CV

TL;DR: This paper addresses object hallucination in LVLMs by proposing a training-free method that adjusts attention weights and uses contrastive decoding to improve cross-modal alignment and reduce hallucination.

Details Motivation: LVLMs suffer from object hallucination due to modality bias, where they may ignore either visual or textual information, leading to fragmented understanding of user-provided instructions. Method: A training-free method is proposed, involving attention weight adjustment for balancing cross-modal compatibility and a contrastive decoding strategy to reduce overreliance on parametric knowledge. Result: Extensive experiments confirm the widespread presence of modality bias in LVLMs, and the proposed method effectively mitigates hallucination across multiple LVLMs and benchmarks. Conclusion: The proposed method effectively mitigates object hallucination in LVLMs by addressing modality bias, showing generalizability and efficacy across multiple open-source LVLMs and benchmarks. Abstract: Large vision-language models (LVLMs) have demonstrated remarkable multimodal comprehension and reasoning capabilities, but they still suffer from severe object hallucination. Previous studies primarily attribute the flaw to linguistic prior caused by the scale mismatch between visual encoders and large language models (LLMs) in LVLMs. Specifically, as current LVLMs are built upon LLMs, they tend to over-rely on textual prompts and internal knowledge of LLMs, generating descriptions inconsistent with visual cues. However, through an in-depth investigation of the hallucinated mechanisms, we empirically reveal a previously overlooked phenomenon: LVLMs may ignore not only visual information but also textual modality during hallucination, a behavior termed as modality bias, which indicates that LVLMs struggle to simultaneously attend to both visual and textual modalities, leading to fragmented understanding of user-provided instructions. Based on this observation, we propose a simple yet effective training-free method to mitigate object hallucination. Concretely, we intervene and adjust the attention weights of textual and visual tokens, balancing cross-modal compatibility for better alignment with user intentions. Furthermore, we adopt a contrastive decoding strategy to reduce the LVLM's overreliance on its parametric knowledge, synergistically enhancing our attention manipulation. Extensive experiments confirm the widespread presence of modality bias in LVLMs. Notably, our method effectively mitigates hallucination across multiple open-source LVLMs and benchmarks, highlighting its generalizability and efficacy.

[202] DAG: Unleash the Potential of Diffusion Model for Open-Vocabulary 3D Affordance Grounding

Hanqing Wang,Zhenhao Zhang,Kaiyang Ji,Mingyu Liu,Wenti Yin,Yuchao Chen,Zhirui Liu,Xiangyu Zeng,Tianxiang Gui,Hangxing Zhang

Main category: cs.CV

TL;DR: This paper proposes DAG, a novel framework that uses text-to-image diffusion models to improve 3D object affordance grounding, achieving better performance and generalization than existing methods.

Details Motivation: Current methods for 3D object affordance grounding struggle to capture general affordance knowledge, leading to poor generalization. The authors aim to leverage the semantic understanding of diffusion models to overcome this limitation. Method: The authors introduced DAG, a diffusion-based 3D affordance grounding framework, which utilizes frozen internal representations from text-to-image diffusion models. They also designed an affordance block and a multi-source affordance decoder for 3D dense affordance prediction. Result: The proposed DAG framework outperforms established methods in extensive experimental evaluations and demonstrates strong open-world generalization capabilities. Conclusion: The proposed DAG framework effectively extracts general affordance knowledge from text-to-image diffusion models, leading to superior performance in 3D object affordance grounding with open-world generalization. Abstract: 3D object affordance grounding aims to predict the touchable regions on a 3d object, which is crucial for human-object interaction, human-robot interaction, embodied perception, and robot learning. Recent advances tackle this problem via learning from demonstration images. However, these methods fail to capture the general affordance knowledge within the image, leading to poor generalization. To address this issue, we propose to use text-to-image diffusion models to extract the general affordance knowledge because we find that such models can generate semantically valid HOI images, which demonstrate that their internal representation space is highly correlated with real-world affordance concepts. Specifically, we introduce the DAG, a diffusion-based 3d affordance grounding framework, which leverages the frozen internal representations of the text-to-image diffusion model and unlocks affordance knowledge within the diffusion model to perform 3D affordance grounding. We further introduce an affordance block and a multi-source affordance decoder to endow 3D dense affordance prediction. Extensive experimental evaluations show that our model excels over well-established methods and exhibits open-world generalization.

[203] MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing

Chenxi Li,Yichen Guo,Benfang Qian,Jinhao You,Kai Tang,Yaosong Du,Zonghao Zhang,Xiande Huang

Main category: cs.CV

TL;DR: Error

Details Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: Large Vision-Language Models (LVLMs) have achieved impressive performance in multimodal tasks, but they still suffer from hallucinations, i.e., generating content that is grammatically accurate but inconsistent with visual inputs. In this work, we introduce a novel map-level perspective to mitigate hallucinations in LVLMs, interpreting the hidden states of the model as a 2D semantic map. We observe that factual information is widely distributed across this map, extending beyond the localized inter- or intra-layer regions targeted by most existing methods (e.g., contrastive decoding and layer-wise consistency). Building on this insight, we propose Map-Level Attention Processing (MAP), a training-free decoding method that effectively leverages factual information through attention-based map-level operations to improve factual consistency. Specifically, we employ Layer-Wise Criss-Cross Attention to progressively refine token representations at each decoding layer by aggregating tokens from both inter- and intra-layer dimensions. Additionally, a Global-Local Logit Fusion mechanism combines logits obtained before and after global attention to further refine predictions and improve accuracy. Our method consistently improves the truthfulness and performance of LVLMs across benchmarks, such as POPE, MME, and MMHal-Bench, demonstrating the potential of the map-level decoding strategy.

[204] Single Point, Full Mask: Velocity-Guided Level Set Evolution for End-to-End Amodal Segmentation

Zhixuan Li,Yujia Liu,Chen Hui,Weisi Lin

Main category: cs.CV

TL;DR: VELA是一种新的分割方法,可以从点提示中逐步演变成最终的模态掩码,比现有的方法更有效并且只需要一个点提示。

Details Motivation: 现有的方法通常依赖于强提示,如可见掩码或边界框,这在实际场景中获取成本高或不切实际。虽然最近的方法如Segment Anything Model(SAM)支持基于点的提示进行指导,但它们通常直接进行掩码回归,没有明确建模形状演变,限制了在复杂遮挡场景中的泛化能力。此外,大多数现有方法具有黑盒性质,缺乏几何解释性,对遮挡形状的推断过程提供的见解有限。 Method: VELA首先从图像特征和点输入构建初始水平集函数,然后在完全可微分网络预测的形状特定运动场的指导下,逐步演变成最终的模态掩码。 Result: VELA在COCOA-cls、D2SA和KINS基准测试中进行了广泛的实验,证明VELA在只需要单点提示的情况下优于现有的强提示方法,验证了在弱指导下的可解释几何建模的有效性。 Conclusion: VELA是一种基于点提示的端到端的VElocity驱动的水平集模态分割方法,它通过逐步演变成最终的模态掩码来明确地建模轮廓演化。 Abstract: Amodal segmentation aims to recover complete object shapes, including occluded regions with no visual appearance, whereas conventional segmentation focuses solely on visible areas. Existing methods typically rely on strong prompts, such as visible masks or bounding boxes, which are costly or impractical to obtain in real-world settings. While recent approaches such as the Segment Anything Model (SAM) support point-based prompts for guidance, they often perform direct mask regression without explicitly modeling shape evolution, limiting generalization in complex occlusion scenarios. Moreover, most existing methods suffer from a black-box nature, lacking geometric interpretability and offering limited insight into how occluded shapes are inferred. To deal with these limitations, we propose VELA, an end-to-end VElocity-driven Level-set Amodal segmentation method that performs explicit contour evolution from point-based prompts. VELA first constructs an initial level set function from image features and the point input, which then progressively evolves into the final amodal mask under the guidance of a shape-specific motion field predicted by a fully differentiable network. This network learns to generate evolution dynamics at each step, enabling geometrically grounded and topologically flexible contour modeling. Extensive experiments on COCOA-cls, D2SA, and KINS benchmarks demonstrate that VELA outperforms existing strongly prompted methods while requiring only a single-point prompt, validating the effectiveness of interpretable geometric modeling under weak guidance. The code will be publicly released.

[205] Shape Distribution Matters: Shape-specific Mixture-of-Experts for Amodal Segmentation under Diverse Occlusions

Zhixuan Li,Yujia Liu,Chen Hui,Jeonghaeng Lee,Sanghoon Lee,Weisi Lin

Main category: cs.CV

TL;DR: ShapeMoE通过动态路由对象到特定形状专家,有效解决非可见区域分割问题。

Details Motivation: 非可见区域分割任务面临复杂遮挡和极端形状变化的挑战,传统单一模型难以处理多样化的形状。 Method: 提出ShapeMoE方法,采用Mixture-of-Experts (MoE)框架,通过编码对象为紧凑的高斯嵌入,并使用形状感知稀疏路由器选择合适的专家进行预测。 Result: ShapeMoE在COCOA-cls、KINS和D2SA数据集上均优于现有方法,特别是在遮挡区域分割方面表现优异。 Conclusion: ShapeMoE是一个针对非可见区域分割的形状稀疏专家混合框架,通过学习潜在的形状分布空间并动态路由每个对象到最适合的专家,实现了精确和高效的形状感知专家路由,同时保持了良好的可解释性、高容量和效率。 Abstract: Amodal segmentation targets to predict complete object masks, covering both visible and occluded regions. This task poses significant challenges due to complex occlusions and extreme shape variation, from rigid furniture to highly deformable clothing. Existing one-size-fits-all approaches rely on a single model to handle all shape types, struggling to capture and reason about diverse amodal shapes due to limited representation capacity. A natural solution is to adopt a Mixture-of-Experts (MoE) framework, assigning experts to different shape patterns. However, naively applying MoE without considering the object's underlying shape distribution can lead to mismatched expert routing and insufficient expert specialization, resulting in redundant or underutilized experts. To deal with these issues, we introduce ShapeMoE, a shape-specific sparse Mixture-of-Experts framework for amodal segmentation. The key idea is to learn a latent shape distribution space and dynamically route each object to a lightweight expert tailored to its shape characteristics. Specifically, ShapeMoE encodes each object into a compact Gaussian embedding that captures key shape characteristics. A Shape-Aware Sparse Router then maps the object to the most suitable expert, enabling precise and efficient shape-aware expert routing. Each expert is designed as lightweight and specialized in predicting occluded regions for specific shape patterns. ShapeMoE offers well interpretability via clear shape-to-expert correspondence, while maintaining high capacity and efficiency. Experiments on COCOA-cls, KINS, and D2SA show that ShapeMoE consistently outperforms state-of-the-art methods, especially in occluded region segmentation. The code will be released.

[206] Rein++: Efficient Generalization and Adaptation for Semantic Segmentation with Vision Foundation Models

Zhixiang Wei,Xiaoxiao Ma,Ruishen Yan,Tao Tu,Huaian Chen,Jinjin Zheng,Yi Jin,Enhong Chen

Main category: cs.CV

TL;DR: Rein++ 是一种高效的视觉基础模型分割框架,通过域泛化和域适应技术,解决了分割数据量小和域分布差异的问题,实现了良好的泛化和适应能力。

Details Motivation: 现有的视觉基础模型(VFM)在语义分割中的应用受到两个挑战的限制:(1)分割数据集规模远小于VFM预训练数据;(2)实际分割场景多样,往往在预训练阶段未被充分覆盖。 Method: Rein++ 包括两个核心组件:Rein-G和Rein-A。Rein-G通过可训练的实例感知token优化VFM特征,仅微调不到1%的主干参数;Rein-A在实例和logit级别进行无监督域适应,并引入语义转移模块提升目标域的边界细节。 Result: 实验表明,Rein++ 显著优于现有最先进的方法,即使在具有数十亿参数的大模型上也表现出高效的训练和良好的适应性。 Conclusion: Rein++ 是一种高效的基于VFM的分割框架,通过Rein-G和Rein-A分别解决域泛化和域适应问题,实现了从有限数据中出色的泛化能力,并有效适应了多样化的无标签场景。 Abstract: Vision Foundation Models(VFMs) have achieved remarkable success in various computer vision tasks. However, their application to semantic segmentation is hindered by two significant challenges: (1) the disparity in data scale, as segmentation datasets are typically much smaller than those used for VFM pre-training, and (2) domain distribution shifts, where real-world segmentation scenarios are diverse and often underrepresented during pre-training. To overcome these limitations, we present Rein++, an efficient VFM-based segmentation framework that demonstrates superior generalization from limited data and enables effective adaptation to diverse unlabeled scenarios. Specifically, Rein++ comprises a domain generalization solution Rein-G and a domain adaptation solution Rein-A. Rein-G introduces a set of trainable, instance-aware tokens that effectively refine the VFM's features for the segmentation task. This parameter-efficient approach fine-tunes less than 1% of the backbone's parameters, enabling robust generalization. Building on the Rein-G, Rein-A performs unsupervised domain adaptation at both the instance and logit levels to mitigate domain shifts. In addition, it incorporates a semantic transfer module that leverages the class-agnostic capabilities of the segment anything model to enhance boundary details in the target domain. The integrated Rein++ pipeline first learns a generalizable model on a source domain (e.g., daytime scenes) and subsequently adapts it to diverse target domains (e.g., nighttime scenes) without any target labels. Comprehensive experiments demonstrate that Rein++ significantly outperforms state-of-the-art methods with efficient training, underscoring its roles an efficient, generalizable, and adaptive segmentation solution for VFMs, even for large models with billions of parameters. The code is available at https://github.com/wloves/Rein.

[207] Benchmarking Adversarial Patch Selection and Location

Shai Kimhi,Avi Mendlson,Moshe Kimhi

Main category: cs.CV

TL;DR: 本文介绍了PatchMap,这是一种用于对抗补丁攻击的新基准测试工具,通过大量实验验证其有效性,并提出了一种新的攻击方法,显著提高了攻击成功率。

Details Motivation: 对抗补丁攻击威胁现代视觉模型的可靠性,需要系统性研究。 Method: 通过在ImageNet验证图像上进行1.5亿次前向传递构建PatchMap,并提出一种基于分割的放置启发式方法。 Result: PatchMap揭示了系统热点,小补丁即可导致模型误分类和置信度大幅下降,并提高了攻击成功率8到13个百分点。 Conclusion: PatchMap是一个全面的对抗补丁攻击基准测试工具,可以显著提高攻击成功率。 Abstract: Adversarial patch attacks threaten the reliability of modern vision models. We present PatchMap, the first spatially exhaustive benchmark of patch placement, built by evaluating over 1.5e8 forward passes on ImageNet validation images. PatchMap reveals systematic hot-spots where small patches (as little as 2% of the image) induce confident misclassifications and large drops in model confidence. To demonstrate its utility, we propose a simple segmentation guided placement heuristic that leverages off the shelf masks to identify vulnerable regions without any gradient queries. Across five architectures-including adversarially trained ResNet50, our method boosts attack success rates by 8 to 13 percentage points compared to random or fixed placements. We publicly release PatchMap and the code implementation. The full PatchMap bench (6.5B predictions, multiple backbones) will be released soon to further accelerate research on location-aware defenses and adaptive attacks.

[208] Cure or Poison? Embedding Instructions Visually Alters Hallucination in Vision-Language Models

Zhaochen Wang,Yiwei Wang,Yujun Cai

Main category: cs.CV

TL;DR: 通过将文本指令嵌入图像,发现 Qwen2.5-VL 性能提升,而其他模型性能下降,表明不同模型对文本嵌入的处理能力存在差异。

Details Motivation: 解决视觉-语言模型中多模态信息对齐的挑战,减少幻觉问题。 Method: 将文本指令直接嵌入图像中,迫使模型通过视觉通道处理所有内容。 Result: Prompt-in-Image 提高了 Qwen2.5-VL 的 POPE 准确率并减少了幻觉,但导致 LLaVA-1.5 和 InstructBLIP 的性能大幅下降。 Conclusion: Prompt-in-Image 方法在不同的视觉-语言模型中表现出显著不同的效果,Qwen2.5-VL 的性能得到提升,而 LLaVA-1.5 和 InstructBLIP 则出现严重性能下降。 Abstract: Vision-Language Models (VLMs) often suffer from hallucination, partly due to challenges in aligning multimodal information. We propose Prompt-in-Image, a simple method that embeds textual instructions directly into images. This removes the need for separate text inputs and forces the model to process all content through the visual channel. We evaluate this method on three popular open-source VLMs: Qwen2.5-VL, LLaVA-1.5, and InstructBLIP. The results reveal sharp differences. Prompt-in-Image improves Qwen2.5-VL's performance, increasing POPE accuracy by 4.1 percent (from 80.2 percent to 84.3 percent) and also reducing hallucination rates on MS-COCO. In contrast, LLaVA-1.5 and InstructBLIP experience a severe performance drop, with accuracy falling from around 84 percent to near-random levels. Through detailed analysis, we found that CLIP-based encoders in LLaVA and InstructBLIP exhibit excessive attention bias toward embedded text regions, disrupting visual understanding. In contrast, Qwen's vision encoder handles text-embedded images robustly. Crucially, Prompt-in-Image reduces Qwen's modality gap, enhancing cross-modal alignment by unifying information processing through a single modality.

[209] DisCo3D: Distilling Multi-View Consistency for 3D Scene Editing

Yufeng Chi,Huimin Ma,Kafeng Wang,Jianmin Li

Main category: cs.CV

TL;DR: DisCo3D is a novel framework that improves 3D editing by incorporating 3D consistency priors into a 2D editor, delivering better editing quality and multi-view consistency.

Details Motivation: The motivation is to overcome the limitations of existing 3D editing approaches, which suffer from slow convergence, blurry artifacts, and inconsistencies, especially in complex scenes. Method: The method involves fine-tuning a 3D generator using multi-view inputs, training a 2D editor through consistency distillation, and optimizing edited outputs into 3D representations using Gaussian Splatting. Result: Experimental results demonstrate that DisCo3D achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality. Conclusion: DisCo3D provides a new framework that integrates 3D consistency priors into a 2D editor, achieving superior editing quality and multi-view consistency compared to existing methods. Abstract: While diffusion models have demonstrated remarkable progress in 2D image generation and editing, extending these capabilities to 3D editing remains challenging, particularly in maintaining multi-view consistency. Classical approaches typically update 3D representations through iterative refinement based on a single editing view. However, these methods often suffer from slow convergence and blurry artifacts caused by cross-view inconsistencies. Recent methods improve efficiency by propagating 2D editing attention features, yet still exhibit fine-grained inconsistencies and failure modes in complex scenes due to insufficient constraints. To address this, we propose \textbf{DisCo3D}, a novel framework that distills 3D consistency priors into a 2D editor. Our method first fine-tunes a 3D generator using multi-view inputs for scene adaptation, then trains a 2D editor through consistency distillation. The edited multi-view outputs are finally optimized into 3D representations via Gaussian Splatting. Experimental results show DisCo3D achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality.

[210] Register Anything: Estimating "Corresponding Prompts" for Segment Anything Model

Shiqi Huang,Tingfa Xu,Wen Yan,Dean Barratt,Yipeng Hu

Main category: cs.CV

TL;DR: PromptReg simplifies image registration by directly searching for corresponding prompts using pre-trained segmentation models, eliminating the need for a two-step ROI segmentation and matching process.

Details Motivation: The motivation stems from the challenge of establishing pixel/voxel-level or region-level correspondences in image registration. Traditional methods require two steps: segmenting ROIs in each image and then matching them, which can be complex and time-consuming. Simplifying this into a single step can improve efficiency and effectiveness. Method: PromptReg introduces a one-step approach for image registration by using pre-trained segmentation models (e.g., SAM) to identify corresponding prompts. This includes defining a corresponding prompt problem, an inverse prompt solution to generate prompts in the target image's space, and a novel registration algorithm that identifies multiple paired ROIs by marginalizing across prompt and spatial dimensions. Result: PromptReg was tested on five applications involving various types of medical and non-medical images (3D prostate MR, 3D abdomen MR, 3D lung CT, 2D histopathology, and 2D aerial images). The approach outperformed both intensity-based iterative algorithms and learning-based DDF-predicting networks, even competing with weakly-supervised methods that require fully segmented training data. Conclusion: The proposed PromptReg approach simplifies the traditional two-step process of region-based correspondence representation into one step by directly searching for corresponding prompts using pre-trained segmentation models. It demonstrates superior performance over intensity-based iterative algorithms and learning-based networks in various applications. Abstract: Establishing pixel/voxel-level or region-level correspondences is the core challenge in image registration. The latter, also known as region-based correspondence representation, leverages paired regions of interest (ROIs) to enable regional matching while preserving fine-grained capability at pixel/voxel level. Traditionally, this representation is implemented via two steps: segmenting ROIs in each image then matching them between the two images. In this paper, we simplify this into one step by directly "searching for corresponding prompts", using extensively pre-trained segmentation models (e.g., SAM) for a training-free registration approach, PromptReg. Firstly, we introduce the "corresponding prompt problem", which aims to identify a corresponding Prompt Y in Image Y for any given visual Prompt X in Image X, such that the two respectively prompt-conditioned segmentations are a pair of corresponding ROIs from the two images. Secondly, we present an "inverse prompt" solution that generates primary and optionally auxiliary prompts, inverting Prompt X into the prompt space of Image Y. Thirdly, we propose a novel registration algorithm that identifies multiple paired corresponding ROIs by marginalizing the inverted Prompt X across both prompt and spatial dimensions. Comprehensive experiments are conducted on five applications of registering 3D prostate MR, 3D abdomen MR, 3D lung CT, 2D histopathology and, as a non-medical example, 2D aerial images. Based on metrics including Dice and target registration errors on anatomical structures, the proposed registration outperforms both intensity-based iterative algorithms and learning-based DDF-predicting networks, even yielding competitive performance with weakly-supervised approaches that require fully-segmented training data.

[211] Versatile Transition Generation with Image-to-Video Diffusion

Zuhao Yang,Jiahui Zhang,Yingchen Yu,Shijian Lu,Song Bai

Main category: cs.CV

TL;DR: VTG is a framework that generates smooth and high-quality video transitions using advanced techniques, outperforming existing methods.

Details Motivation: Generating smooth and rational video transitions from the first and last frames with descriptive prompts is underexplored, motivating the need for a better solution. Method: VTG uses interpolation-based initialization, dual-directional motion fine-tuning, and representation alignment regularization to enhance transition video generation. Result: VTG achieves superior performance in generating smooth, high-fidelity, and semantically coherent video transitions across all four tasks. Conclusion: VTG effectively generates smooth and high-fidelity video transitions, showing superior performance on transition tasks. Abstract: Leveraging text, images, structure maps, or motion trajectories as conditional guidance, diffusion models have achieved great success in automated and high-quality video generation. However, generating smooth and rational transition videos given the first and last video frames as well as descriptive text prompts is far underexplored. We present VTG, a Versatile Transition video Generation framework that can generate smooth, high-fidelity, and semantically coherent video transitions. VTG introduces interpolation-based initialization that helps preserve object identity and handle abrupt content changes effectively. In addition, it incorporates dual-directional motion fine-tuning and representation alignment regularization to mitigate the limitations of pre-trained image-to-video diffusion models in motion smoothness and generation fidelity, respectively. To evaluate VTG and facilitate future studies on unified transition generation, we collected TransitBench, a comprehensive benchmark for transition generation covering two representative transition tasks: concept blending and scene transition. Extensive experiments show that VTG achieves superior transition performance consistently across all four tasks.

[212] TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding

Zuhao Yang,Yingchen Yu,Yunqing Zhao,Shijian Lu,Song Bai

Main category: cs.CV

TL;DR: TimeExpert是一种基于MoE的视频大语言模型,通过动态分配任务令牌提升视频时序基础任务的处理效率与准确性。

Details Motivation: 现有视频大语言模型(Video-LLMs)使用相同的静态路径处理所有任务令牌,无法区分时间定位、显著性评估和文本生成任务,导致性能受限。 Method: 引入基于专家混合模型(MoE)的架构,将视频时序基础任务分解为多个专业化子任务进行处理。 Result: 在多种VTG任务(如密集视频字幕、时刻检索、视频亮点检测)上,TimeExpert均取得了最先进的性能表现。 Conclusion: TimeExpert通过动态路由任务特定令牌到专业化专家,有效地分解VTG任务,提高了事件建模的准确性与计算效率。 Abstract: Video Temporal Grounding (VTG) aims to precisely identify video event segments in response to textual queries. The outputs of VTG tasks manifest as sequences of events, each defined by precise timestamps, saliency scores, and textual descriptions. Despite recent advances, a fundamental limitation persists in existing Video Large Language Models (Video-LLMs): they process all task tokens through identical and static pathways, failing to recognize that temporal localization, saliency assessment, and textual generation represent fundamentally distinct tasks requiring specialized processing. To address this, we introduce TimeExpert, a Mixture-of-Experts (MoE)-based Video-LLM that effectively decomposes VTG tasks by dynamically routing task-specific tokens (e.g., timestamps, saliency scores) to specialized experts, with increased computational efficiency. Our design choices enable precise handling of each subtask, leading to improved event modeling across diverse VTG applications. Extensive experiments demonstrate that TimeExpert consistently achieves state-of-the-art performance on various VTG tasks such as Dense Video Captioning, Moment Retrieval, and Video Highlight Detection.

[213] LT-Gaussian: Long-Term Map Update Using 3D Gaussian Splatting for Autonomous Driving

Luqi Cheng,Zhangshuo Qi,Zijie Zhou,Chao Lu,Guangming Xiong

Main category: cs.CV

TL;DR: 本文提出了一种新的地图更新方法LT-Gaussian,用于解决基于3D高斯点绘的地图更新问题,该方法能够高效处理环境变化,并生成高质量的地图重建结果。

Details Motivation: 由于生成高斯场景所涉及的时间和计算成本,如何更新地图成为一个重大挑战。因此,本文提出了一种新的地图更新方法LT-Gaussian。 Method: LT-Gaussian 包含三个主要组成部分:多模态高斯点绘、结构变化检测模块和高斯地图更新模块。首先使用多模态高斯点绘生成旧场景的高斯地图;随后,在地图更新过程中,将过时的高斯地图与当前LiDAR数据流进行比较,以识别结构变化;最后,对高斯地图执行有针对性的更新,以生成最新的地图。 Result: 实验结果表明,LT-Gaussian 能够有效且高效地更新高斯地图,处理自动驾驶场景中常见的环境变化。 Conclusion: LT-Gaussian 提出了一种针对基于3D高斯点绘的地图更新方法,它能有效且高效地处理自动驾驶场景中常见的环境变化,并且能够充分利用新旧场景的信息,生成比从头重建地图的策略质量更高的重建结果。 Abstract: Maps play an important role in autonomous driving systems. The recently proposed 3D Gaussian Splatting (3D-GS) produces rendering-quality explicit scene reconstruction results, demonstrating the potential for map construction in autonomous driving scenarios. However, because of the time and computational costs involved in generating Gaussian scenes, how to update the map becomes a significant challenge. In this paper, we propose LT-Gaussian, a map update method for 3D-GS-based maps. LT-Gaussian consists of three main components: Multimodal Gaussian Splatting, Structural Change Detection Module, and Gaussian-Map Update Module. Firstly, the Gaussian map of the old scene is generated using our proposed Multimodal Gaussian Splatting. Subsequently, during the map update process, we compare the outdated Gaussian map with the current LiDAR data stream to identify structural changes. Finally, we perform targeted updates to the Gaussian-map to generate an up-to-date map. We establish a benchmark for map updating on the nuScenes dataset to quantitatively evaluate our method. The experimental results show that LT-Gaussian can effectively and efficiently update the Gaussian-map, handling common environmental changes in autonomous driving scenarios. Furthermore, by taking full advantage of information from both new and old scenes, LT-Gaussian is able to produce higher quality reconstruction results compared to map update strategies that reconstruct maps from scratch. Our open-source code is available at https://github.com/ChengLuqi/LT-gaussian.

[214] GAID: Frame-Level Gated Audio-Visual Integration with Directional Perturbation for Text-Video Retrieval

Bowen Yang,Yun Cao,Chen He,Xiaosu Su

Main category: cs.CV

TL;DR: GAID improves text-to-video retrieval by adaptively integrating audio and visual features and injecting structure-aware perturbations into text embeddings.

Details Motivation: The motivation is to address the limitations of existing text-to-video retrieval methods that often overlook complementary audio semantics or adopt coarse fusion strategies, leading to suboptimal multimodal representations. Method: The paper proposes GAID, which incorporates Frame-level Gated Fusion (FGF) for fine-grained temporal alignment and Directional Adaptive Semantic Perturbation (DASP) for enhanced robustness and discrimination. Result: The results show that GAID achieves consistent state-of-the-art performance on MSR-VTT, DiDeMo, LSMDC, and VATEX datasets across all retrieval metrics with notable efficiency gains. Conclusion: The paper concludes that GAID, through its innovative components, achieves state-of-the-art results in text-to-video retrieval with notable efficiency gains. Abstract: Text-to-video retrieval requires precise alignment between language and temporally rich video signals. Existing methods predominantly exploit visual cues and often overlook complementary audio semantics or adopt coarse fusion strategies, leading to suboptimal multimodal representations. We present GAID, a framework that jointly address this gap via two key components: (i) a Frame-level Gated Fusion (FGF) that adaptively integrates audio and visual features under textual guidance, enabling fine-grained temporal alignment; and (ii) a Directional Adaptive Semantic Perturbation (DASP) that injects structure-aware perturbations into text embeddings, enhancing robustness and discrimination without incurring multi-pass inference. These modules complement each other -- fusion reduces modality gaps while perturbation regularizes cross-modal matching -- yielding more stable and expressive representations. Extensive experiments on MSR-VTT, DiDeMo, LSMDC, and VATEX show consistent state-of-the-art results across all retrieval metrics with notable efficiency gains. Our code is available at https://github.com/YangBowenn/GAID.

[215] HateClipSeg: A Segment-Level Annotated Dataset for Fine-Grained Hate Video Detection

Han Wang,Zhuoran Wang,Roy Ka-Wei Lee

Main category: cs.CV

TL;DR: The researchers developed HateClipSeg, a new large-scale dataset with annotations for detecting hate speech in videos, showing that current models need improvement in handling multimodal and temporal aspects of videos.

Details Motivation: The motivation behind the study is the challenge in detecting hate speech in videos due to the complexity of multimodal content and the lack of detailed annotations in existing datasets. Method: The researchers created a large-scale multimodal dataset called HateClipSeg, which includes video-level and segment-level annotations. They proposed three tasks to benchmark performance in hate speech detection. Result: Results from the study highlighted significant gaps in current models' performance on hate speech detection tasks, showing the necessity for improved approaches. Conclusion: The study concludes that there is a need for more sophisticated multimodal and temporally aware approaches in detecting hate speech in videos, as current models show substantial gaps. Abstract: Detecting hate speech in videos remains challenging due to the complexity of multimodal content and the lack of fine-grained annotations in existing datasets. We present HateClipSeg, a large-scale multimodal dataset with both video-level and segment-level annotations, comprising over 11,714 segments labeled as Normal or across five Offensive categories: Hateful, Insulting, Sexual, Violence, Self-Harm, along with explicit target victim labels. Our three-stage annotation process yields high inter-annotator agreement (Krippendorff's alpha = 0.817). We propose three tasks to benchmark performance: (1) Trimmed Hateful Video Classification, (2) Temporal Hateful Video Localization, and (3) Online Hateful Video Classification. Results highlight substantial gaps in current models, emphasizing the need for more sophisticated multimodal and temporally aware approaches. The HateClipSeg dataset are publicly available at https://github.com/Social-AI-Studio/HateClipSeg.git.

[216] Dynamic Robot-Assisted Surgery with Hierarchical Class-Incremental Semantic Segmentation

Julia Hindel,Ema Mekic,Enamundram Naga Karthik,Rohit Mohan,Daniele Cattaneo,Maria Kalweit,Abhinav Valada

Main category: cs.CV

TL;DR: This paper introduces TOPICS+, an improved method for class-incremental semantic segmentation tailored to robotic surgery environments, addressing challenges like class imbalance and continuous learning through Dice loss, hierarchical pseudo-labeling, and new benchmark datasets.

Details Motivation: The motivation stems from the limitations of current segmentation models trained on static datasets when applied to dynamic surgical environments. There is a need for models that can continuously adapt to new classes without forgetting previous ones, which is critical for safe and accurate robot-assisted surgeries. Method: The researchers enhanced the existing TOPICS method by incorporating Dice loss to manage class imbalance, introducing hierarchical pseudo-labeling, and creating tailored label taxonomies for surgical settings. They also developed six new CISS benchmarks and expanded the Syn-Mediverse dataset to include over 144 classes. Result: The proposed TOPICS+ approach outperforms existing methods in handling class imbalance and incremental learning. The creation of new benchmarks and a refined dataset with over 144 classes provides a robust framework for evaluating surgical scene segmentation models. Conclusion: The study concludes that TOPICS+ is a promising approach for robust segmentation in robotic surgery environments, offering adaptability to new classes without forgetting prior knowledge, and significantly enhancing performance through improvements like Dice loss and hierarchical pseudo-labeling. Abstract: Robot-assisted surgeries rely on accurate and real-time scene understanding to safely guide surgical instruments. However, segmentation models trained on static datasets face key limitations when deployed in these dynamic and evolving surgical environments. Class-incremental semantic segmentation (CISS) allows models to continually adapt to new classes while avoiding catastrophic forgetting of prior knowledge, without training on previous data. In this work, we build upon the recently introduced Taxonomy-Oriented Poincar\'e-regularized Incremental Class Segmentation (TOPICS) approach and propose an enhanced variant, termed TOPICS+, specifically tailored for robust segmentation of surgical scenes. Concretely, we incorporate the Dice loss into the hierarchical loss formulation to handle strong class imbalances, introduce hierarchical pseudo-labeling, and design tailored label taxonomies for robotic surgery environments. We also propose six novel CISS benchmarks designed for robotic surgery environments including multiple incremental steps and several semantic categories to emulate realistic class-incremental settings in surgical environments. In addition, we introduce a refined set of labels with more than 144 classes on the Syn-Mediverse synthetic dataset, hosted online as an evaluation benchmark. We make the code and trained models publicly available at http://topics.cs.uni-freiburg.de.

[217] Granular Concept Circuits: Toward a Fine-Grained Circuit Discovery for Concept Representations

Dahee Kwon,Sehyun Lee,Jaesik Choi

Main category: cs.CV

TL;DR: The paper introduces Granular Concept Circuit (GCC), a method to interpret deep vision models by discovering circuits tied to specific visual concepts through iterative assessment of inter-neuron connectivity.

Details Motivation: The motivation is to pinpoint where specific visual concepts are encoded within deep vision models given their distributed representation nature. Method: The method involves an iterative assessment of inter-neuron connectivity, focusing on functional dependencies and semantic alignment, to construct circuits representing specific visual concepts. Result: The result is the discovery of multiple circuits capturing specific concepts relevant to a given query, validated across various deep image classification models. Conclusion: The paper concludes that the Granular Concept Circuit (GCC) method effectively identifies circuits tied to specific visual concepts, offering a profound interpretation of deep vision models. Abstract: Deep vision models have achieved remarkable classification performance by leveraging a hierarchical architecture in which human-interpretable concepts emerge through the composition of individual neurons across layers. Given the distributed nature of representations, pinpointing where specific visual concepts are encoded within a model remains a crucial yet challenging task. In this paper, we introduce an effective circuit discovery method, called Granular Concept Circuit (GCC), in which each circuit represents a concept relevant to a given query. To construct each circuit, our method iteratively assesses inter-neuron connectivity, focusing on both functional dependencies and semantic alignment. By automatically discovering multiple circuits, each capturing specific concepts within that query, our approach offers a profound, concept-wise interpretation of models and is the first to identify circuits tied to specific visual concepts at a fine-grained level. We validate the versatility and effectiveness of GCCs across various deep image classification models.

[218] Tracking the Unstable: Appearance-Guided Motion Modeling for Robust Multi-Object Tracking in UAV-Captured Videos

Jianbo Ma,Hui Luo,Qi Chen,Yuankai Qi,Yumei Sun,Amin Beheshti,Jianlin Zhang,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: AMOT是一种多目标跟踪方法,通过联合利用外观和运动线索来提高无人机视频中的跟踪性能。

Details Motivation: 无人机视频中的频繁视角变化和复杂的运动动力学导致不稳定的亲和度测量和模糊的关联,现有方法通常分别建模运动和外观线索,忽略了它们的时空相互作用,导致跟踪性能次优。 Method: 提出了AMOT,包括外观-运动一致性矩阵和运动感知轨迹延续模块。 Result: AMOT在三个无人机基准数据集上表现出色,且以即插即用和无需训练的方式具有良好的泛化能力。 Conclusion: AMOT通过结合外观和运动线索,提高了无人机视频中多目标跟踪的性能。 Abstract: Multi-object tracking (MOT) aims to track multiple objects while maintaining consistent identities across frames of a given video. In unmanned aerial vehicle (UAV) recorded videos, frequent viewpoint changes and complex UAV-ground relative motion dynamics pose significant challenges, which often lead to unstable affinity measurement and ambiguous association. Existing methods typically model motion and appearance cues separately, overlooking their spatio-temporal interplay and resulting in suboptimal tracking performance. In this work, we propose AMOT, which jointly exploits appearance and motion cues through two key components: an Appearance-Motion Consistency (AMC) matrix and a Motion-aware Track Continuation (MTC) module. Specifically, the AMC matrix computes bi-directional spatial consistency under the guidance of appearance features, enabling more reliable and context-aware identity association. The MTC module complements AMC by reactivating unmatched tracks through appearance-guided predictions that align with Kalman-based predictions, thereby reducing broken trajectories caused by missed detections. Extensive experiments on three UAV benchmarks, including VisDrone2019, UAVDT, and VT-MOT-UAV, demonstrate that our AMOT outperforms current state-of-the-art methods and generalizes well in a plug-and-play and training-free manner.

[219] SpectralX: Parameter-efficient Domain Generalization for Spectral Remote Sensing Foundation Models

Yuxiang Zhang,Wei Li,Mengmeng Zhang,Jiawei Han,Ran Tao,Shunlin Liang

Main category: cs.CV

TL;DR: SpectralX是一个用于遥感基础模型的参数高效微调框架,通过两阶段训练方法有效适应各种光谱模态,显著提高领域泛化性能。

Details Motivation: 现有的遥感基础模型(RSFMs)大多使用光学图像进行预训练,而多光谱/高光谱数据缺乏相应的基础模型。SpectralX旨在无需大量光谱预训练的情况下有效适应各种光谱模态。 Method: SpectralX采用两阶段训练方法,第一阶段使用掩码重建任务和Hyper Tokenizer提取属性标记,同时开发Attribute-oriented Mixture of Adapter进行逐层微调;第二阶段通过插入Attribute-refined Adapter进行语义分割任务。 Result: SpectralX通过两阶段训练方法显著提高了领域泛化性能,能够适应各种光谱输入并解释新区域或季节的光谱图像。 Conclusion: SpectralX是一个创新的参数高效微调框架,能够显著提高领域泛化性能,使RSFMs能够解释新区域或季节的光谱图像。 Abstract: Recent advances in Remote Sensing Foundation Models (RSFMs) have led to significant breakthroughs in the field. While many RSFMs have been pretrained with massive optical imagery, more multispectral/hyperspectral data remain lack of the corresponding foundation models. To leverage the advantages of spectral imagery in earth observation, we explore whether existing RSFMs can be effectively adapted to process diverse spectral modalities without requiring extensive spectral pretraining. In response to this challenge, we proposed SpectralX, an innovative parameter-efficient fine-tuning framework that adapt existing RSFMs as backbone while introducing a two-stage training approach to handle various spectral inputs, thereby significantly improving domain generalization performance. In the first stage, we employ a masked-reconstruction task and design a specialized Hyper Tokenizer (HyperT) to extract attribute tokens from both spatial and spectral dimensions. Simultaneously, we develop an Attribute-oriented Mixture of Adapter (AoMoA) that dynamically aggregates multi-attribute expert knowledge while performing layer-wise fine-tuning. With semantic segmentation as downstream task in the second stage, we insert an Attribute-refined Adapter (Are-adapter) into the first stage framework. By iteratively querying low-level semantic features with high-level representations, the model learns to focus on task-beneficial attributes, enabling customized adjustment of RSFMs. Following this two-phase adaptation process, SpectralX is capable of interpreting spectral imagery from new regions or seasons. The codes will be available from the website: https://github.com/YuxiangZhang-BIT.

[220] AG$^2$aussian: Anchor-Graph Structured Gaussian Splatting for Instance-Level 3D Scene Understanding and Editing

Zhaonan Wang,Manyi Li,Changhe Tu

Main category: cs.CV

TL;DR: AG^2aussian introduces an anchor-graph structure to improve semantic-aware 3D Gaussian representations, enabling accurate instance-level selection and benefiting applications like editing and simulation.

Details Motivation: The motivation stems from the increasing demand for semantic-aware 3D Gaussian representations due to the widespread adoption of 3D Gaussian Splatting (3DGS). Existing methods lead to noisy segmentation and messy selection of Gaussians, necessitating a better solution. Method: AG^2aussian introduces an anchor-graph structure to organize semantic features and regulate Gaussian primitives. This structure enables graph-based propagation for achieving compact and instance-aware Gaussian distributions. Result: Extensive validation across four applications shows the advantages of the AG^2aussian approach. Experimental and ablation studies confirm the effectiveness of the key design choices in the proposed framework. Conclusion: The proposed AG^2aussian framework effectively organizes semantic features and regulates Gaussian primitives, enabling clean and accurate instance-level Gaussian selection, which benefits various applications like interactive query, text-driven query, object removal editing, and physics simulation. Abstract: 3D Gaussian Splatting (3DGS) has witnessed exponential adoption across diverse applications, driving a critical need for semantic-aware 3D Gaussian representations to enable scene understanding and editing tasks. Existing approaches typically attach semantic features to a collection of free Gaussians and distill the features via differentiable rendering, leading to noisy segmentation and a messy selection of Gaussians. In this paper, we introduce AG$^2$aussian, a novel framework that leverages an anchor-graph structure to organize semantic features and regulate Gaussian primitives. Our anchor-graph structure not only promotes compact and instance-aware Gaussian distributions, but also facilitates graph-based propagation, achieving a clean and accurate instance-level Gaussian selection. Extensive validation across four applications, i.e. interactive click-based query, open-vocabulary text-driven query, object removal editing, and physics simulation, demonstrates the advantages of our approach and its benefits to various applications. The experiments and ablation studies further evaluate the effectiveness of the key designs of our approach.

[221] Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models

Ruofan Wang,Xin Wang,Yang Yao,Xuan Tong,Xingjun Ma

Main category: cs.CV

TL;DR: 本文介绍了一种针对微调视觉-语言模型(VLM)的新型灰盒越狱攻击方法Simulated Ensemble Attack(SEA),该方法利用基础模型的漏洞生成可迁移的对抗样本,并结合微调路径模拟和目标提示引导技术实现高效攻击。实验表明,SEA在不同微调变体上均表现出高攻击成功率和毒性率,强调了保护微调模型免受基础模型继承漏洞影响的迫切需求。

Details Motivation: 微调开源视觉-语言模型(VLM)会产生新的攻击面,因为基础模型的漏洞可能被继承到微调模型中,导致可迁移越狱攻击的风险。因此,需要研究并理解这种风险,以开发有效的防御措施。 Method: 提出了一种名为Simulated Ensemble Attack(SEA)的新方法,结合Fine-tuning Trajectory Simulation(FTS)和Targeted Prompt Guidance(TPG)。FTS通过模拟视觉编码器参数变化生成可迁移的对抗图像,而TPG则通过文本策略引导语言解码器输出对抗性优化结果。 Result: 实验在Qwen2-VL家族模型(2B和7B)上进行,结果表明SEA实现了超过86.5%的攻击成功率和接近49.5%的毒性率,且攻击效果在多种微调变体上均表现稳定,包括专门用于提升安全性的模型。与直接使用PGD方法相比,SEA在跨模型迁移性方面显著更优。 Conclusion: 研究揭示了微调模型继承基础模型漏洞的风险,并提出了SEA方法以证明这一风险的严重性。结果表明,需要在整个模型生命周期中开发全面的防御机制,以保护微调模型免受可迁移攻击的影响。 Abstract: Fine-tuning open-source Vision-Language Models (VLMs) creates a critical yet underexplored attack surface: vulnerabilities in the base VLM could be retained in fine-tuned variants, rendering them susceptible to transferable jailbreak attacks. To demonstrate this risk, we introduce the Simulated Ensemble Attack (SEA), a novel grey-box jailbreak method in which the adversary has full access to the base VLM but no knowledge of the fine-tuned target's weights or training configuration. To improve jailbreak transferability across fine-tuned VLMs, SEA combines two key techniques: Fine-tuning Trajectory Simulation (FTS) and Targeted Prompt Guidance (TPG). FTS generates transferable adversarial images by simulating the vision encoder's parameter shifts, while TPG is a textual strategy that steers the language decoder toward adversarially optimized outputs. Experiments on the Qwen2-VL family (2B and 7B) demonstrate that SEA achieves high transfer attack success rates exceeding 86.5% and toxicity rates near 49.5% across diverse fine-tuned variants, even those specifically fine-tuned to improve safety behaviors. Notably, while direct PGD-based image jailbreaks rarely transfer across fine-tuned VLMs, SEA reliably exploits inherited vulnerabilities from the base model, significantly enhancing transferability. These findings highlight an urgent need to safeguard fine-tuned proprietary VLMs against transferable vulnerabilities inherited from open-source foundations, motivating the development of holistic defenses across the entire model lifecycle.

[222] Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation

Qiaohui Chu,Haoyu Zhang,Meng Liu,Yisen Feng,Haoxiang Shi,Liqiang Nie

Main category: cs.CV

TL;DR: The paper proposes INSIGHT, a two-stage framework for long-term action anticipation that improves feature extraction and cognitive reasoning, achieving state-of-the-art results on multiple benchmarks.

Details Motivation: The study aims to address three key limitations in existing action anticipation approaches: underutilization of fine-grained visual cues, neglect of semantic dependencies between verbs and nouns, and lack of explicit cognitive reasoning. Method: INSIGHT proposes a two-stage framework: the first stage extracts semantically rich features using hand-object interaction regions and a verb-noun co-occurrence matrix, while the second stage introduces a reinforcement learning-based module to simulate cognitive reasoning for action anticipation. Result: INSIGHT achieves state-of-the-art performance on Ego4D, EPIC-Kitchens-55, and EGTEA Gaze+ benchmarks, demonstrating its effectiveness and strong generalization ability. Conclusion: INSIGHT provides a strong framework for long-term action anticipation, showing excellent performance and generalization capabilities on multiple benchmarks. Abstract: Long-term action anticipation from egocentric video is critical for applications such as human-computer interaction and assistive technologies, where anticipating user intent enables proactive and context-aware AI assistance. However, existing approaches suffer from three key limitations: 1) underutilization of fine-grained visual cues from hand-object interactions, 2) neglect of semantic dependencies between verbs and nouns, and 3) lack of explicit cognitive reasoning, limiting generalization and long-term forecasting ability. To overcome these challenges, we propose INSIGHT, a unified two-stage framework for egocentric action anticipation. In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions and enhances action representations using a verb-noun co-occurrence matrix. In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning through a structured process: visual perception (think) -> intention inference (reason) -> action anticipation (answer). Extensive experiments on Ego4D, EPIC-Kitchens-55, and EGTEA Gaze+ benchmarks show that INSIGHT achieves state-of-the-art performance, demonstrating its effectiveness and strong generalization capability.

[223] Improving Noise Efficiency in Privacy-preserving Dataset Distillation

Runkai Zheng,Vishnu Asutosh Dasu,Yinong Oliver Wang,Haohan Wang,Fernando De la Torre

Main category: cs.CV

TL;DR: This paper proposes a new framework for differentially private dataset distillation that decouples sampling from optimization, significantly improving performance while preserving privacy with less data.

Details Motivation: The motivation behind this paper is to address the inefficiency and excessive noise in current private dataset distillation methods, aiming to improve the performance of differentially private data generation with less data. Method: The paper introduces a framework that decouples sampling from optimization and reduces the impact of DP noise through matching in an informative subspace. Result: On CIFAR-10, the method achieves a 10.0% improvement with 50 images per class and an 8.3% increase with just one-fifth the distilled set size of previous state-of-the-art methods. Conclusion: The paper concludes that their novel framework significantly improves the performance of privacy-preserving dataset distillation by decoupling sampling from optimization and mitigating the impact of DP noise through matching in an informative subspace. Abstract: Modern machine learning models heavily rely on large datasets that often include sensitive and private information, raising serious privacy concerns. Differentially private (DP) data generation offers a solution by creating synthetic datasets that limit the leakage of private information within a predefined privacy budget; however, it requires a substantial amount of data to achieve performance comparable to models trained on the original data. To mitigate the significant expense incurred with synthetic data generation, Dataset Distillation (DD) stands out for its remarkable training and storage efficiency. This efficiency is particularly advantageous when integrated with DP mechanisms, curating compact yet informative synthetic datasets without compromising privacy. However, current state-of-the-art private DD methods suffer from a synchronized sampling-optimization process and the dependency on noisy training signals from randomly initialized networks. This results in the inefficient utilization of private information due to the addition of excessive noise. To address these issues, we introduce a novel framework that decouples sampling from optimization for better convergence and improves signal quality by mitigating the impact of DP noise through matching in an informative subspace. On CIFAR-10, our method achieves a \textbf{10.0\%} improvement with 50 images per class and \textbf{8.3\%} increase with just \textbf{one-fifth} the distilled set size of previous state-of-the-art methods, demonstrating significant potential to advance privacy-preserving DD.

[224] Vision transformer-based multi-camera multi-object tracking framework for dairy cow monitoring

Kumail Abbas,Zeeshan Afzal,Aqeel Raza,Taha Mansouri,Andrew W. Dowsey,Chaidate Inchaisri,Ali Alameer

Main category: cs.CV

TL;DR: This study presents a real-time multi-camera tracking system for indoor dairy cows, combining advanced detection and segmentation models with motion-aware tracking algorithms to achieve high accuracy and reliability, significantly improving early disease detection and behavior monitoring.

Details Motivation: Continual and accurate monitoring of dairy cow activity and behavior is essential for health assessment and farm productivity. Manual observation is labor-intensive and inconsistent, necessitating an automated, reliable tracking system. Method: A multi-camera real-time tracking system was developed using advanced computer vision techniques, including instance segmentation and tracking algorithms. The system integrates six camera feeds into a top-down barn panorama using homographic transformations. A refined YOLO11-m model was used for detection, while SAMURAI (an upgraded Segment Anything Model 2.1) provided pixel-precise cow masks. Tracking was enhanced using a motion-aware Linear Kalman filter and IoU-based data association. Result: The system achieved high detection and tracking accuracy: mAP@0.50 = 0.97, F1 = 0.95, MOTA = 98.7% and 99.3% in two benchmark videos, IDF1 scores above 99%, and near-zero identity switches. It effectively handles occlusion and posture changes while minimizing redundant detections across overlapping camera views. Conclusion: The unified multi-camera system successfully tracks dairy cows in real-time, even in complex indoor environments. It outperforms existing methods like Deep SORT Realtime, achieving high accuracy and minimal identity switches, thus improving early sickness prediction and behavioral classification. Abstract: Activity and behaviour correlate with dairy cow health and welfare, making continual and accurate monitoring crucial for disease identification and farm productivity. Manual observation and frequent assessments are laborious and inconsistent for activity monitoring. In this study, we developed a unique multi-camera, real-time tracking system for indoor-housed Holstein Friesian dairy cows. This technology uses cutting-edge computer vision techniques, including instance segmentation and tracking algorithms to monitor cow activity seamlessly and accurately. An integrated top-down barn panorama was created by geometrically aligning six camera feeds using homographic transformations. The detection phase used a refined YOLO11-m model trained on an overhead cow dataset, obtaining high accuracy (mAP\@0.50 = 0.97, F1 = 0.95). SAMURAI, an upgraded Segment Anything Model 2.1, generated pixel-precise cow masks for instance segmentation utilizing zero-shot learning and motion-aware memory. Even with occlusion and fluctuating posture, a motion-aware Linear Kalman filter and IoU-based data association reliably identified cows over time for object tracking. The proposed system significantly outperformed Deep SORT Realtime. Multi-Object Tracking Accuracy (MOTA) was 98.7% and 99.3% in two benchmark video sequences, with IDF1 scores above 99% and near-zero identity switches. This unified multi-camera system can track dairy cows in complex interior surroundings in real time, according to our data. The system reduces redundant detections across overlapping cameras, maintains continuity as cows move between viewpoints, with the aim of improving early sickness prediction through activity quantification and behavioural classification.

[225] VPN: Visual Prompt Navigation

Shuo Feng,Zihan Wang,Yuchen Li,Rui Kong,Hengyi Cai,Shuaiqiang Wang,Gim Hee Lee,Piji Li,Shuqiang Jiang

Main category: cs.CV

TL;DR: 本文提出了一种新的导航引导范式——视觉提示导航(VPN),通过在2D俯视图地图上使用用户提供的视觉提示来引导智能体导航,避免了自然语言的歧义性和冗长性,提高了导航的有效性。

Details Motivation: 自然语言通常用于指导具身智能体,但其歧义性和冗长性在复杂环境中往往阻碍了语言引导导航的有效性。因此,本文提出了一种更直观、更空间化、不依赖语言指令的导航引导方式。 Method: 提出了一种名为Visual Prompt Navigation (VPN)的新范式,并构建了两个新的数据集R2R-VP和R2R-CE-VP。此外,引入了一个专门的基线网络VPNet,并采用了两种数据增强策略来提升导航性能。 Result: 通过广泛的实验评估了视觉提示形式、俯视图地图格式和数据增强策略对视觉提示导航性能的影响,验证了所提方法的有效性。 Conclusion: 本文提出的视觉提示导航范式有效避免了语言引导导航中的歧义问题,并在复杂环境中表现出良好的导航性能。 Abstract: While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user-provided visual prompts within 2D top-view maps. This visual prompt primarily focuses on marking the visual navigation trajectory on a top-down view of a scene, offering intuitive and spatially grounded guidance without relying on language instructions. It is more friendly for non-expert users and reduces interpretive ambiguity. We build VPN tasks in both discrete and continuous navigation settings, constructing two new datasets, R2R-VP and R2R-CE-VP, by extending existing R2R and R2R-CE episodes with corresponding visual prompts. Furthermore, we introduce VPNet, a dedicated baseline network to handle the VPN tasks, with two data augmentation strategies: view-level augmentation (altering initial headings and prompt orientations) and trajectory-level augmentation (incorporating diverse trajectories from large-scale 3D scenes), to enhance navigation performance. Extensive experiments evaluate how visual prompt forms, top-view map formats, and data augmentation strategies affect the performance of visual prompt navigation. The code is available at https://github.com/farlit/VPN.

[226] DiffSemanticFusion: Semantic Raster BEV Fusion for Autonomous Driving via Online HD Map Diffusion

Zhigang Sun,Yiru Wang,Anqing Jiang,Shuo Wang,Yu Gao,Yuwen Heng,Shouyi Zhang,An He,Hao Jiang,Jinhao Chai,Zichong Gu,Wang Jijun,Shichen Tang,Lavdim Halilaj,Juergen Luettin,Hao Sun

Main category: cs.CV

TL;DR: DiffSemanticFusion是一种用于多模态轨迹预测和规划的融合框架,该框架结合了基于栅格和基于图的在线高清地图表示的优点,并在真实世界的自动驾驶基准测试中表现出优越的性能。

Details Motivation: 在在线高清地图生成场景中,基于栅格的表示适合视觉模型但缺乏几何精度,而基于图的表示保留了结构细节但在没有精确地图的情况下变得不稳定。因此,需要一种能够结合两者优势的方法。 Method: 提出了DiffSemanticFusion,这是一种融合框架,用于多模态轨迹预测和规划,其在语义栅格融合的BEV空间中进行推理,并通过地图扩散模块增强了在线高清地图表示的稳定性和表现力。 Result: 在nuScenes和NAVSIM两个真实世界自动驾驶基准测试上的实验表明,与几种最先进的方法相比,性能有所提升。在nuScenes预测任务中,与在线高清地图信息QCNet结合,性能提高了5.1%。在NAVSIM的端到端自动驾驶中,DiffSemanticFusion取得了最先进的结果,在NavHard场景下性能提高了15%。 Conclusion: DiffSemanticFusion可以无缝集成到其他基于向量的方法中以增强性能,并在真实世界的自动驾驶基准测试中表现出改进的性能。 Abstract: Autonomous driving requires accurate scene understanding, including road geometry, traffic agents, and their semantic relationships. In online HD map generation scenarios, raster-based representations are well-suited to vision models but lack geometric precision, while graph-based representations retain structural detail but become unstable without precise maps. To harness the complementary strengths of both, we propose DiffSemanticFusion -- a fusion framework for multimodal trajectory prediction and planning. Our approach reasons over a semantic raster-fused BEV space, enhanced by a map diffusion module that improves both the stability and expressiveness of online HD map representations. We validate our framework on two downstream tasks: trajectory prediction and planning-oriented end-to-end autonomous driving. Experiments on real-world autonomous driving benchmarks, nuScenes and NAVSIM, demonstrate improved performance over several state-of-the-art methods. For the prediction task on nuScenes, we integrate DiffSemanticFusion with the online HD map informed QCNet, achieving a 5.1\% performance improvement. For end-to-end autonomous driving in NAVSIM, DiffSemanticFusion achieves state-of-the-art results, with a 15\% performance gain in NavHard scenarios. In addition, extensive ablation and sensitivity studies show that our map diffusion module can be seamlessly integrated into other vector-based approaches to enhance performance. All artifacts are available at https://github.com/SunZhigang7/DiffSemanticFusion.

[227] Skip priors and add graph-based anatomical information, for point-based Couinaud segmentation

Xiaotong Zhang,Alexander Broersen,Gonnie CM van Erp,Silvia L. Pintea,Jouke Dijkstra

Main category: cs.CV

TL;DR: A new point-based method for Couinaud segmentation in liver surgery planning uses a graph reasoning module to implicitly learn liver vessel structures, achieving strong results on public datasets without requiring explicit prior knowledge.

Details Motivation: To improve preoperative planning in liver surgery by preserving CT image resolution through point-based representations without requiring time-consuming prior knowledge of liver vessel structures. Method: A point-based method incorporating a graph reasoning module to learn anatomical liver vessel structures implicitly from point features, avoiding the need for explicit prior knowledge of liver vessel structures. Result: The method showed competitive results on MSD and LiTS datasets, achieving good performance in Dice coefficient and average surface distance scores compared to existing point-based methods. Conclusion: The proposed point-based method for Couinaud segmentation effectively learns anatomical liver vessel structures without explicit prior knowledge, demonstrating competitive performance on public datasets. Abstract: The preoperative planning of liver surgery relies on Couinaud segmentation from computed tomography (CT) images, to reduce the risk of bleeding and guide the resection procedure. Using 3D point-based representations, rather than voxelizing the CT volume, has the benefit of preserving the physical resolution of the CT. However, point-based representations need prior knowledge of the liver vessel structure, which is time consuming to acquire. Here, we propose a point-based method for Couinaud segmentation, without explicitly providing the prior liver vessel structure. To allow the model to learn this anatomical liver vessel structure, we add a graph reasoning module on top of the point features. This adds implicit anatomical information to the model, by learning affinities across point neighborhoods. Our method is competitive on the MSD and LiTS public datasets in Dice coefficient and average surface distance scores compared to four pioneering point-based methods. Our code is available at https://github.com/ZhangXiaotong015/GrPn.

[228] SoccerTrack v2: A Full-Pitch Multi-View Soccer Dataset for Game State Reconstruction

Atom Scott,Ikuma Uchida,Kento Kuroda,Yufi Kim,Keisuke Fujii

Main category: cs.CV

TL;DR: SoccerTrack v2 is an advanced public dataset for multi-object tracking, game state reconstruction, and ball action spotting in soccer analytics, featuring annotated 4K recordings of matches.

Details Motivation: Unlike prior datasets that use broadcast views or limited scenarios, SoccerTrack v2 aims to advance multi-object tracking, game state reconstruction, and ball action spotting in soccer analytics. Method: SoccerTrack v2 provides 10 full-length, panoramic 4K recordings of university-level matches, captured with BePro cameras for complete player visibility. Each video is annotated with GSR labels and BAS labels for 12 action classes. Result: SoccerTrack v2 is a new public dataset with annotated 4K recordings of soccer matches, outlining the dataset's structure, collection pipeline, and annotation process. Conclusion: SoccerTrack v2 is designed to advance research in computer vision and soccer analytics, enabling new benchmarks and practical applications in tactical analysis and automated tools. Abstract: SoccerTrack v2 is a new public dataset for advancing multi-object tracking (MOT), game state reconstruction (GSR), and ball action spotting (BAS) in soccer analytics. Unlike prior datasets that use broadcast views or limited scenarios, SoccerTrack v2 provides 10 full-length, panoramic 4K recordings of university-level matches, captured with BePro cameras for complete player visibility. Each video is annotated with GSR labels (2D pitch coordinates, jersey-based player IDs, roles, teams) and BAS labels for 12 action classes (e.g., Pass, Drive, Shot). This technical report outlines the datasets structure, collection pipeline, and annotation process. SoccerTrack v2 is designed to advance research in computer vision and soccer analytics, enabling new benchmarks and practical applications in tactical analysis and automated tools.

[229] Diffusion-based 3D Hand Motion Recovery with Intuitive Physics

Yufei Zhang,Zijun Cui,Jeffrey O. Kephart,Qiang Ji

Main category: cs.CV

TL;DR: 该论文提出了一种新的3D手部运动恢复方法,通过结合扩散模型和物理知识,在没有使用标注视频数据的情况下,显著提高了重建的准确性。

Details Motivation: 尽管从单目图像进行3D手部重建已取得显著进展,但从视频中生成准确且时间连贯的运动估计仍然具有挑战性,尤其是在手与物体交互过程中。 Method: 使用扩散模型并结合物理知识进行运动细化,通过迭代去噪过程生成改进的运动序列,并利用运动捕捉数据进行训练。 Result: 实验表明,该方法显著提高了各种逐帧重建方法的性能,并在现有基准上达到了最先进的(SOTA)性能。 Conclusion: 提出了一种新的基于扩散模型和物理增强的3D手部运动恢复框架,显著提高了现有方法的性能,并在现有基准上达到了最先进的(SOTA)性能。 Abstract: While 3D hand reconstruction from monocular images has made significant progress, generating accurate and temporally coherent motion estimates from videos remains challenging, particularly during hand-object interactions. In this paper, we present a novel 3D hand motion recovery framework that enhances image-based reconstructions through a diffusion-based and physics-augmented motion refinement model. Our model captures the distribution of refined motion estimates conditioned on initial ones, generating improved sequences through an iterative denoising process. Instead of relying on scarce annotated video data, we train our model only using motion capture data without images. We identify valuable intuitive physics knowledge during hand-object interactions, including key motion states and their associated motion constraints. We effectively integrate these physical insights into our diffusion model to improve its performance. Extensive experiments demonstrate that our approach significantly improves various frame-wise reconstruction methods, achieving state-of-the-art (SOTA) performance on existing benchmarks.

[230] A Simple Algebraic Solution for Estimating the Pose of a Camera from Planar Point Features

Tarek Bouazza,Tarek Hamel,Claude Samson

Main category: cs.CV

TL;DR: 论文提出了一种简单且鲁棒的分层代数方法,用于估计相机相对于平面目标的姿态,包括法向量估计、位置和距离计算以及方向确定,并通过实验验证了其性能。

Details Motivation: 需要从至少4个参考点及其对应的相机帧中的方位测量数据中,简单且鲁棒地估计相机相对于平面目标的姿态。 Method: 论文采用一种分层结构的方法:首先确定目标平面的法向单位向量,然后估计相机的位置向量、其到目标平面的距离,最后确定完整的方向。此外,引入了一种平均方法来提高法向量估计的鲁棒性。 Result: 提出的方法通过引入平均策略提高了鲁棒性,并通过广泛的实验验证了其准确性和可靠性。 Conclusion: 该论文提出了一种鲁棒的分层代数方法来估计相机相对于平面目标的姿态,并通过大量实验验证了其准确性和鲁棒性。 Abstract: This paper presents a simple algebraic method to estimate the pose of a camera relative to a planar target from $n \geq 4$ reference points with known coordinates in the target frame and their corresponding bearing measurements in the camera frame. The proposed approach follows a hierarchical structure; first, the unit vector normal to the target plane is determined, followed by the camera's position vector, its distance to the target plane, and finally, the full orientation. To improve the method's robustness to measurement noise, an averaging methodology is introduced to refine the estimation of the target's normal direction. The accuracy and robustness of the approach are validated through extensive experiments.

[231] OmniEvent: Unified Event Representation Learning

Weiqi Yan,Chenlu Lin,Youbiao Wang,Zhipeng Cai,Xiuhong Lin,Yangyang Shi,Weiquan Liu,Yu Zang

Main category: cs.CV

TL;DR: OmniEvent是一个统一的事件表示学习框架,通过解耦-增强-融合范式和空间填充曲线的应用,解决了事件数据的非结构化分布和时空非均匀性问题,实现了各种任务中的先进性能,并提高了内存和计算效率。

Details Motivation: 事件相机由于其超高的动态范围和时间分辨率在计算机视觉领域越来越受欢迎。然而,由于事件数据的非结构化分布和时空非均匀性,事件网络严重依赖于特定任务的设计,使得现有架构难以重用于新任务。 Method: OmniEvent提出了一种解耦-增强-融合范式,其中局部特征聚合和增强在空间和时间域上独立进行,以避免非均匀性问题。应用了空间填充曲线以在提高内存和计算效率的同时实现大的感受野。来自各个领域的特征通过注意力机制融合,以学习时空交互。 Result: OmniEvent在3个代表性任务和10个数据集中超越了(任务特定)最先进的方法高达68.2%。 Conclusion: OmniEvent是一个统一的事件表示学习框架,在各种任务中实现了最先进的性能,完全消除了对任务特定设计的需求。 Abstract: Event cameras have gained increasing popularity in computer vision due to their ultra-high dynamic range and temporal resolution. However, event networks heavily rely on task-specific designs due to the unstructured data distribution and spatial-temporal (S-T) inhomogeneity, making it hard to reuse existing architectures for new tasks. We propose OmniEvent, the first unified event representation learning framework that achieves SOTA performance across diverse tasks, fully removing the need of task-specific designs. Unlike previous methods that treat event data as 3D point clouds with manually tuned S-T scaling weights, OmniEvent proposes a decouple-enhance-fuse paradigm, where the local feature aggregation and enhancement is done independently on the spatial and temporal domains to avoid inhomogeneity issues. Space-filling curves are applied to enable large receptive fields while improving memory and compute efficiency. The features from individual domains are then fused by attention to learn S-T interactions. The output of OmniEvent is a grid-shaped tensor, which enables standard vision models to process event data without architecture change. With a unified framework and similar hyper-parameters, OmniEvent out-performs (tasks-specific) SOTA by up to 68.2% across 3 representative tasks and 10 datasets (Fig.1). Code will be ready in https://github.com/Wickyan/OmniEvent .

[232] Beyond Vulnerabilities: A Survey of Adversarial Attacks as Both Threats and Defenses in Computer Vision Systems

Zhongliang Guo,Yifei Qian,Yanli Li,Weiye Li,Chun Tong Lei,Shuai Zhao,Lei Fang,Ognjen Arandjelović,Chun Pong Lau

Main category: cs.CV

TL;DR: This survey explores adversarial attacks in computer vision, analyzing their evolution and dual role as threats and tools, while identifying future research directions.

Details Motivation: Adversarial attacks against computer vision systems have emerged as a critical research area that challenges the fundamental assumptions about neural network robustness and security. Method: We provide a systematic analysis of adversarial attack methodologies across three primary domains: pixel-space attacks, physically realizable attacks, and latent-space attacks. Result: Our analysis reveals critical research gaps, particularly in neural style transfer protection and computational efficiency requirements. We examine how physically realizable attacks have successfully bridged the gap between digital vulnerabilities and real-world threats through adversarial patches, 3D textures, and dynamic optical perturbations. Conclusion: This survey contributes a comprehensive taxonomy, evolution analysis, and identification of future research directions, aiming to advance understanding of adversarial vulnerabilities and inform the development of more robust and trustworthy computer vision systems. Abstract: Adversarial attacks against computer vision systems have emerged as a critical research area that challenges the fundamental assumptions about neural network robustness and security. This comprehensive survey examines the evolving landscape of adversarial techniques, revealing their dual nature as both sophisticated security threats and valuable defensive tools. We provide a systematic analysis of adversarial attack methodologies across three primary domains: pixel-space attacks, physically realizable attacks, and latent-space attacks. Our investigation traces the technical evolution from early gradient-based methods such as FGSM and PGD to sophisticated optimization techniques incorporating momentum, adaptive step sizes, and advanced transferability mechanisms. We examine how physically realizable attacks have successfully bridged the gap between digital vulnerabilities and real-world threats through adversarial patches, 3D textures, and dynamic optical perturbations. Additionally, we explore the emergence of latent-space attacks that leverage semantic structure in internal representations to create more transferable and meaningful adversarial examples. Beyond traditional offensive applications, we investigate the constructive use of adversarial techniques for vulnerability assessment in biometric authentication systems and protection against malicious generative models. Our analysis reveals critical research gaps, particularly in neural style transfer protection and computational efficiency requirements. This survey contributes a comprehensive taxonomy, evolution analysis, and identification of future research directions, aiming to advance understanding of adversarial vulnerabilities and inform the development of more robust and trustworthy computer vision systems.

[233] Context Guided Transformer Entropy Modeling for Video Compression

Junlong Tong,Wei Zhang,Yaohui Jin,Xiaoyu Shen

Main category: cs.CV

TL;DR: The paper proposes a Context Guided Transformer (CGT) entropy model that improves video compression by efficiently modeling temporal and spatial dependencies, resulting in faster computation and better compression performance.

Details Motivation: The motivation is to address the issues of high computational cost from temporal context incorporation and the lack of explicit spatial dependency modeling in existing entropy models, which can limit decoding performance. Method: The CGT model uses a temporal context resampler to extract critical temporal information and a teacher-student network to model spatial dependencies. The temporal context resampler reduces computational overhead, while the teacher-student network assigns dependency weights to spatial tokens, allowing the student model to focus on high-dependency context during inference. Result: The CGT model reduces entropy modeling time by approximately 65% and achieves an 11% BD-Rate reduction compared to the previous state-of-the-art model. Conclusion: The proposed Context Guided Transformer (CGT) entropy model effectively reduces video redundancy by efficiently leveraging spatio-temporal contexts, achieving better performance and computational efficiency compared to previous models. Abstract: Conditional entropy models effectively leverage spatio-temporal contexts to reduce video redundancy. However, incorporating temporal context often introduces additional model complexity and increases computational cost. In parallel, many existing spatial context models lack explicit modeling the ordering of spatial dependencies, which may limit the availability of relevant context during decoding. To address these issues, we propose the Context Guided Transformer (CGT) entropy model, which estimates probability mass functions of the current frame conditioned on resampled temporal context and dependency-weighted spatial context. A temporal context resampler learns predefined latent queries to extract critical temporal information using transformer encoders, reducing downstream computational overhead. Meanwhile, a teacher-student network is designed as dependency-weighted spatial context assigner to explicitly model the dependency of spatial context order. The teacher generates an attention map to represent token importance and an entropy map to reflect prediction certainty from randomly masked inputs, guiding the student to select the weighted top-k tokens with the highest spatial dependency. During inference, only the student is used to predict undecoded tokens based on high-dependency context. Experimental results demonstrate that our CGT model reduces entropy modeling time by approximately 65% and achieves an 11% BD-Rate reduction compared to the previous state-of-the-art conditional entropy model.

[234] Distinguishing Target and Non-Target Fixations with EEG and Eye Tracking in Realistic Visual Scenes

Mansi Sharma,Camilo Andrés Martínez Martínez,Benedikt Emanuel Wirth,Antonio Krüger,Philipp Müller

Main category: cs.CV

TL;DR: This paper introduces a new method using gaze and EEG features to classify target fixations during free visual search in realistic scenes, achieving significantly higher accuracy than previous approaches.

Details Motivation: The motivation is to improve the understanding of users' intended actions by distinguishing target from non-target fixations in more realistic visual search scenarios, as previous studies used abstract stimuli and explicit instructions. Method: The paper conducted a user study with 36 participants using 140 realistic visual search scenes across two application scenarios: searching for icons on desktop backgrounds and finding tools in a cluttered workshop. They used gaze and EEG features to classify fixations. Result: Their approach achieved 83.6% accuracy in cross-user evaluations, significantly outperforming previous methods based on saccade-related potentials, which achieved only 56.9% accuracy. Conclusion: The paper concludes that their new approach based on gaze and EEG features significantly improves the classification of target versus non-target fixations in realistic scenarios, achieving much higher accuracy compared to previous methods. Abstract: Distinguishing target from non-target fixations during visual search is a fundamental building block to understand users' intended actions and to build effective assistance systems. While prior research indicated the feasibility of classifying target vs. non-target fixations based on eye tracking and electroencephalography (EEG) data, these studies were conducted with explicitly instructed search trajectories, abstract visual stimuli, and disregarded any scene context. This is in stark contrast with the fact that human visual search is largely driven by scene characteristics and raises questions regarding generalizability to more realistic scenarios. To close this gap, we, for the first time, investigate the classification of target vs. non-target fixations during free visual search in realistic scenes. In particular, we conducted a 36-participants user study using a large variety of 140 realistic visual search scenes in two highly relevant application scenarios: searching for icons on desktop backgrounds and finding tools in a cluttered workshop. Our approach based on gaze and EEG features outperforms the previous state-of-the-art approach based on a combination of fixation duration and saccade-related potentials. We perform extensive evaluations to assess the generalizability of our approach across scene types. Our approach significantly advances the ability to distinguish between target and non-target fixations in realistic scenarios, achieving 83.6% accuracy in cross-user evaluations. This substantially outperforms previous methods based on saccade-related potentials, which reached only 56.9% accuracy.

[235] DiffusionFF: Face Forgery Detection via Diffusion-based Artifact Localization

Siran Peng,Haoyuan Zhang,Li Gao,Tianshuo Zhang,Bao Li,Zhen Lei

Main category: cs.CV

TL;DR: DiffusionFF is a novel framework for face forgery detection that combines diffusion-based artifact localization with semantic features, offering improved accuracy and detailed manipulation tracing.

Details Motivation: The motivation is to improve the robustness and accuracy of face forgery detection while also enabling precise localization of forgery artifacts to enhance model explainability and user trust. Method: The paper proposes DiffusionFF, which uses a denoising diffusion model to generate Structural Dissimilarity (DSSIM) maps. These maps are fused with high-level semantic features from a pretrained forgery detector to enhance detection accuracy. Result: Extensive experiments show that DiffusionFF achieves superior detection performance and provides precise, fine-grained localization of forgery artifacts. Conclusion: DiffusionFF provides a robust and accurate solution for face forgery detection with precise artifact localization, demonstrating superior performance and effectiveness. Abstract: The rapid evolution of deepfake generation techniques demands robust and accurate face forgery detection algorithms. While determining whether an image has been manipulated remains essential, the ability to precisely localize forgery artifacts has become increasingly important for improving model explainability and fostering user trust. To address this challenge, we propose DiffusionFF, a novel framework that enhances face forgery detection through diffusion-based artifact localization. Our method utilizes a denoising diffusion model to generate high-quality Structural Dissimilarity (DSSIM) maps, which effectively capture subtle traces of manipulation. These DSSIM maps are then fused with high-level semantic features extracted by a pretrained forgery detector, leading to significant improvements in detection accuracy. Extensive experiments on both cross-dataset and intra-dataset benchmarks demonstrate that DiffusionFF not only achieves superior detection performance but also offers precise and fine-grained artifact localization, highlighting its overall effectiveness.

[236] StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

Haolin Yang,Feilong Tang,Linxiao Zhao,Xiang An,Ming Hu,Huifa Li,Xinlin Zhuang,Boqian Wang,Yifan Lu,Xiaofeng Zhang,Abdalla Swikir,Junjun He,Zongyuan Ge,Imran Razzak

Main category: cs.CV

TL;DR: StreamAgent通过预判关键事件的时间演进和选择性感知,实现了高效实时的视频理解,适用于自动驾驶和智能监控等场景。

Details Motivation: 现有方法依赖交替感知-反应或异步触发,缺乏任务驱动规划和未来预判,限制了实时响应和主动决策能力。 Method: 结合问题语义和历史观察,设计了流式KV-cache内存机制,实现高效推理和相关标记的选择性回忆。 Result: 在流式和长视频理解任务中,该方法在响应准确性和实时效率方面优于现有方法。 Conclusion: StreamAgent通过预判未来任务相关信息的时间间隔和空间区域,提升了实时视频理解中的主动响应和目标驱动决策能力。 Abstract: Real-time streaming video understanding in domains such as autonomous driving and intelligent surveillance poses challenges beyond conventional offline video processing, requiring continuous perception, proactive decision making, and responsive interaction based on dynamically evolving visual content. However, existing methods rely on alternating perception-reaction or asynchronous triggers, lacking task-driven planning and future anticipation, which limits their real-time responsiveness and proactive decision making in evolving video streams. To this end, we propose a StreamAgent that anticipates the temporal intervals and spatial regions expected to contain future task-relevant information to enable proactive and goal-driven responses. Specifically, we integrate question semantics and historical observations through prompting the anticipatory agent to anticipate the temporal progression of key events, align current observations with the expected future evidence, and subsequently adjust the perception action (e.g., attending to task-relevant regions or continuously tracking in subsequent frames). To enable efficient inference, we design a streaming KV-cache memory mechanism that constructs a hierarchical memory structure for selective recall of relevant tokens, enabling efficient semantic retrieval while reducing the overhead of storing all tokens in the traditional KV-cache. Extensive experiments on streaming and long video understanding tasks demonstrate that our method outperforms existing methods in response accuracy and real-time efficiency, highlighting its practical value for real-world streaming scenarios.

[237] Medical Image De-Identification Resources: Synthetic DICOM Data and Tools for Validation

Michael W. Rutherford,Tracy Nolan,Linmin Pei,Ulrike Wagner,Qinyan Pan,Phillip Farmer,Kirk Smith,Benjamin Kopchick,Laura Opsahl-Ong,Granger Sutton,David Clunie,Keyvan Farahani,Fred Prior

Main category: cs.CV

TL;DR: This paper introduces the MIDI dataset and evaluation framework for benchmarking DICOM image de-identification, enhancing privacy protection and regulatory confidence in medical data sharing.

Details Motivation: Medical imaging research increasingly relies on data sharing, but ensuring patient privacy while maintaining scientific utility remains challenging. Existing de-identification tools lack comprehensive evaluation methods, limiting reproducibility and regulatory confidence. Method: The researchers created the MIDI dataset using publicly available, de-identified data from TCIA, into which they embedded synthetic PHI and PII across various data types. They developed evaluation tools, including Python scripts and answer keys, to automate the comparison of de-identification results against expected outcomes. The framework aligns with HIPAA, DICOM confidentiality profiles, and TCIA best practices. Result: The MIDI dataset contains 538 subjects, 605 studies, 708 series, and 53,581 DICOM image instances, with embedded synthetic PHI/PII. It includes evaluation tools that enable automated, objective assessment of de-identification workflows, aligned with key privacy and medical data standards. Conclusion: The MIDI dataset and evaluation framework provide a reliable and objective method for assessing the effectiveness of DICOM de-identification workflows, thereby enhancing patient privacy and promoting safer, more consistent medical image sharing. Abstract: Medical imaging research increasingly depends on large-scale data sharing to promote reproducibility and train Artificial Intelligence (AI) models. Ensuring patient privacy remains a significant challenge for open-access data sharing. Digital Imaging and Communications in Medicine (DICOM), the global standard data format for medical imaging, encodes both essential clinical metadata and extensive protected health information (PHI) and personally identifiable information (PII). Effective de-identification must remove identifiers, preserve scientific utility, and maintain DICOM validity. Tools exist to perform de-identification, but few assess its effectiveness, and most rely on subjective reviews, limiting reproducibility and regulatory confidence. To address this gap, we developed an openly accessible DICOM dataset infused with synthetic PHI/PII and an evaluation framework for benchmarking image de-identification workflows. The Medical Image de-identification (MIDI) dataset was built using publicly available de-identified data from The Cancer Imaging Archive (TCIA). It includes 538 subjects (216 for validation, 322 for testing), 605 studies, 708 series, and 53,581 DICOM image instances. These span multiple vendors, imaging modalities, and cancer types. Synthetic PHI and PII were embedded into structured data elements, plain text data elements, and pixel data to simulate real-world identity leaks encountered by TCIA curation teams. Accompanying evaluation tools include a Python script, answer keys (known truth), and mapping files that enable automated comparison of curated data against expected transformations. The framework is aligned with the HIPAA Privacy Rule "Safe Harbor" method, DICOM PS3.15 Confidentiality Profiles, and TCIA best practices. It supports objective, standards-driven evaluation of de-identification workflows, promoting safer and more consistent medical image sharing.

[238] EgoTrigger: Toward Audio-Driven Image Capture for Human Memory Enhancement in All-Day Energy-Efficient Smart Glasses

Akshay Paruchuri,Sinan Hersek,Lavisha Aggarwal,Qiao Yang,Xin Liu,Achin Kulshrestha,Andrea Colaco,Henry Fuchs,Ishan Chatterjee

Main category: cs.CV

TL;DR: EgoTrigger是一种基于音频线索触发图像捕捉的新方法,旨在降低智能眼镜在持续感知中的能耗,同时保持对人类记忆增强的有效性。

Details Motivation: 全天候智能眼镜需要在持续的上下文感知和能耗之间取得平衡,以支持其在日常生活中的人类记忆增强应用。传统的多模态AI代理在连续感知时能耗高,限制了其全天候使用的可行性。 Method: EgoTrigger使用轻量级音频模型YAMNet和自定义分类头,从麦克风音频中检测手-物交互(HOI)事件(如抽屉打开或药瓶开启)的线索,从而触发图像捕捉。这种方法减少了对高能耗组件(如摄像头)的使用,并通过评估QA-Ego4D和HME-QA数据集验证了其效果。 Result: EgoTrigger平均减少了54%的帧使用量,显著节省了高能耗组件(如摄像头)和下游操作(如无线传输)的能耗,同时在记忆任务数据集中实现了与传统方法相当的性能。此外,作者引入了新的HME-QA数据集,包含340个人工标注的第一视角问答对,专注于HOI时刻的音频线索。 Conclusion: EgoTrigger通过利用音频线索来选择性地激活摄像头,显著降低了持续感知的能耗,同时保持了对人类记忆增强的实用性。这种方法在实现节能的同时,保证了对日常任务的实用性能,为全天候智能眼镜的应用提供了可行的解决方案。 Abstract: All-day smart glasses are likely to emerge as platforms capable of continuous contextual sensing, uniquely positioning them for unprecedented assistance in our daily lives. Integrating the multi-modal AI agents required for human memory enhancement while performing continuous sensing, however, presents a major energy efficiency challenge for all-day usage. Achieving this balance requires intelligent, context-aware sensor management. Our approach, EgoTrigger, leverages audio cues from the microphone to selectively activate power-intensive cameras, enabling efficient sensing while preserving substantial utility for human memory enhancement. EgoTrigger uses a lightweight audio model (YAMNet) and a custom classification head to trigger image capture from hand-object interaction (HOI) audio cues, such as the sound of a drawer opening or a medication bottle being opened. In addition to evaluating on the QA-Ego4D dataset, we introduce and evaluate on the Human Memory Enhancement Question-Answer (HME-QA) dataset. Our dataset contains 340 human-annotated first-person QA pairs from full-length Ego4D videos that were curated to ensure that they contained audio, focusing on HOI moments critical for contextual understanding and memory. Our results show EgoTrigger can use 54% fewer frames on average, significantly saving energy in both power-hungry sensing components (e.g., cameras) and downstream operations (e.g., wireless transmission), while achieving comparable performance on datasets for an episodic memory task. We believe this context-aware triggering strategy represents a promising direction for enabling energy-efficient, functional smart glasses capable of all-day use -- supporting applications like helping users recall where they placed their keys or information about their routine activities (e.g., taking medications).

[239] InspectVLM: Unified in Theory, Unreliable in Practice

Conor Wallace,Isaac Corley,Jonathan Lwowski

Main category: cs.CV

TL;DR: 本文研究了基于语言驱动的统一视觉-语言模型(VLM)在工业检测中的应用,发现尽管其在图像分类和关键点检测任务上表现良好,但在核心检测指标和鲁棒性方面仍无法超越传统模型。

Details Motivation: 统一的视觉-语言模型(VLM)有望通过单一语言驱动的接口重构多种视觉任务,从而简化计算机视觉流程。这种方法在工业检测中尤其具有吸引力,因为管理独立的任务特定模型会引入复杂性、低效率和维护开销。 Method: 使用InspectVLM(基于Florence-2的VLM)和InspectMM(新构建的大规模多模态、多任务检测数据集)对统一范式的可行性进行了批判性评估。 Result: InspectVLM在图像级分类和结构化关键点任务上表现良好,但在核心检测指标上未能匹敌传统的ResNet模型。具体问题包括在提示词变化较少时表现脆弱、对细粒度目标检测生成退化输出,并且经常根据记忆中的语言响应生成结果,而不考虑视觉输入。 Conclusion: 语言驱动的统一范式在概念上很优雅,但目前的VLM在视觉基础和鲁棒性方面仍有不足,无法满足精密工业检测的要求。 Abstract: Unified vision-language models (VLMs) promise to streamline computer vision pipelines by reframing multiple visual tasks such as classification, detection, and keypoint localization within a single language-driven interface. This architecture is particularly appealing in industrial inspection, where managing disjoint task-specific models introduces complexity, inefficiency, and maintenance overhead. In this paper, we critically evaluate the viability of this unified paradigm using InspectVLM, a Florence-2-based VLM trained on InspectMM, our new large-scale multimodal, multitask inspection dataset. While InspectVLM performs competitively on image-level classification and structured keypoint tasks, we find that it fails to match traditional ResNet-based models in core inspection metrics. Notably, the model exhibits brittle behavior under low prompt variability, produces degenerate outputs for fine-grained object detection, and frequently defaults to memorized language responses regardless of visual input. Our findings suggest that while language-driven unification offers conceptual elegance, current VLMs lack the visual grounding and robustness necessary for deployment in precision critical industrial inspections.

[240] IAUNet: Instance-Aware U-Net

Yaroslav Prytula,Illia Tsiporenko,Ali Zeynalli,Dmytro Fishman

Main category: cs.CV

TL;DR: This paper introduces IAUNet, a novel query-based U-Net architecture for biomedical instance segmentation, achieving superior performance and introducing a new benchmark dataset.

Details Motivation: To explore the potential of query-based methods in U-Net for biomedical instance segmentation and improve model efficiency and performance. Method: IAUNet incorporates a full U-Net architecture with a lightweight convolutional pixel decoder and a Transformer decoder for refining object-specific features across scales. Result: IAUNet outperforms state-of-the-art models on multiple public datasets and introduces a new benchmark dataset with detailed annotations of overlapping cells. Conclusion: IAUNet demonstrates superior performance over existing models in biomedical instance segmentation, establishing a strong baseline and opening new possibilities for query-based U-Net architectures. Abstract: Instance segmentation is critical in biomedical imaging to accurately distinguish individual objects like cells, which often overlap and vary in size. Recent query-based methods, where object queries guide segmentation, have shown strong performance. While U-Net has been a go-to architecture in medical image segmentation, its potential in query-based approaches remains largely unexplored. In this work, we present IAUNet, a novel query-based U-Net architecture. The core design features a full U-Net architecture, enhanced by a novel lightweight convolutional Pixel decoder, making the model more efficient and reducing the number of parameters. Additionally, we propose a Transformer decoder that refines object-specific features across multiple scales. Finally, we introduce the 2025 Revvity Full Cell Segmentation Dataset, a unique resource with detailed annotations of overlapping cell cytoplasm in brightfield images, setting a new benchmark for biomedical instance segmentation. Experiments on multiple public datasets and our own show that IAUNet outperforms most state-of-the-art fully convolutional, transformer-based, and query-based models and cell segmentation-specific models, setting a strong baseline for cell instance segmentation tasks. Code is available at https://github.com/SlavkoPrytula/IAUNet

[241] Proactive Disentangled Modeling of Trigger-Object Pairings for Backdoor Defense

Kyle Stein,Andrew A. Mahyari,Guillermo Francia III,Eman El-Sheikh

Main category: cs.CV

TL;DR: This paper introduces DBOM, a proactive framework using Vision-Language Models to detect and neutralize both seen and unseen backdoor attacks in deep neural networks by disentangling trigger and object representations in the embedding space.

Details Motivation: The increasing vulnerability of deep neural networks (DNNs) and generative AI (GenAI) to backdoor attacks, particularly those involving multiple triggers across various object classes, necessitates a proactive framework that can detect and neutralize such threats before training. Method: DBOM utilizes Vision-Language Models (VLMs) to disentangle trigger and object representations in the embedding space through a learnable visual prompt repository and prompt prefix tuning, along with trigger-object separation and diversity losses. It aligns features in a shared multimodal space for zero-shot generalization to unseen trigger-object pairings. Result: Experimental results on CIFAR-10 and GTSRB datasets demonstrate that DBOM robustly detects poisoned images prior to downstream training, significantly improving the security of DNN training pipelines. Conclusion: DBOM is an effective proactive framework for detecting and neutralizing both seen and unseen backdoor threats in deep neural networks, enhancing the security of DNN training pipelines. Abstract: Deep neural networks (DNNs) and generative AI (GenAI) are increasingly vulnerable to backdoor attacks, where adversaries embed triggers into inputs to cause models to misclassify or misinterpret target labels. Beyond traditional single-trigger scenarios, attackers may inject multiple triggers across various object classes, forming unseen backdoor-object configurations that evade standard detection pipelines. In this paper, we introduce DBOM (Disentangled Backdoor-Object Modeling), a proactive framework that leverages structured disentanglement to identify and neutralize both seen and unseen backdoor threats at the dataset level. Specifically, DBOM factorizes input image representations by modeling triggers and objects as independent primitives in the embedding space through the use of Vision-Language Models (VLMs). By leveraging the frozen, pre-trained encoders of VLMs, our approach decomposes the latent representations into distinct components through a learnable visual prompt repository and prompt prefix tuning, ensuring that the relationships between triggers and objects are explicitly captured. To separate trigger and object representations in the visual prompt repository, we introduce the trigger-object separation and diversity losses that aids in disentangling trigger and object visual features. Next, by aligning image features with feature decomposition and fusion, as well as learned contextual prompt tokens in a shared multimodal space, DBOM enables zero-shot generalization to novel trigger-object pairings that were unseen during training, thereby offering deeper insights into adversarial attack patterns. Experimental results on CIFAR-10 and GTSRB demonstrate that DBOM robustly detects poisoned images prior to downstream training, significantly enhancing the security of DNN training pipelines.

[242] CVD-SfM: A Cross-View Deep Front-end Structure-from-Motion System for Sparse Localization in Multi-Altitude Scenes

Yaxuan Li,Yewei Huang,Bijay Gaudel,Hamidreza Jafarnejadsani,Brendan Englot

Main category: cs.CV

TL;DR: This paper introduces a novel framework for multi-altitude camera pose estimation, combining cross-view transformer, deep features, and structure-from-motion. It also presents two new datasets and demonstrates improved accuracy and robustness over existing methods.

Details Motivation: To address the challenge of robust and accurate localization across varied altitudes using sparse image input, especially given the scarcity of relevant datasets. Method: Integration of cross-view transformer, deep features, and structure-from-motion into a unified framework for multi-altitude camera pose estimation. Result: Extensive comparative analyses on two newly collected datasets show that the proposed framework outperforms existing methods in multi-altitude sparse pose estimation tasks. Conclusion: The proposed multi-altitude camera pose estimation system demonstrates superior accuracy and robustness compared to existing methods, making it suitable for real-world robotic applications. Abstract: We present a novel multi-altitude camera pose estimation system, addressing the challenges of robust and accurate localization across varied altitudes when only considering sparse image input. The system effectively handles diverse environmental conditions and viewpoint variations by integrating the cross-view transformer, deep features, and structure-from-motion into a unified framework. To benchmark our method and foster further research, we introduce two newly collected datasets specifically tailored for multi-altitude camera pose estimation; datasets of this nature remain rare in the current literature. The proposed framework has been validated through extensive comparative analyses on these datasets, demonstrating that our system achieves superior performance in both accuracy and robustness for multi-altitude sparse pose estimation tasks compared to existing solutions, making it well suited for real-world robotic applications such as aerial navigation, search and rescue, and automated inspection.

[243] Self-Supervised YOLO: Leveraging Contrastive Learning for Label-Efficient Object Detection

Manikanta Kotthapalli,Reshma Bhatia,Nainsi Jain

Main category: cs.CV

TL;DR: 本文研究了使用对比自监督学习(SSL)预训练YOLOv5和YOLOv8骨干网络,以减少对大规模标注数据集的依赖。

Details Motivation: YOLO系列单阶段目标检测器在实时视觉应用中表现出色,但训练过程中严重依赖大规模标注数据集。 Method: 使用SimCLR框架,在未标注图像上预训练YOLOv5和YOLOv8骨干网络,并引入了一种简单而有效的流水线,将YOLO的卷积骨干网络适配为编码器,采用全局池化和投影头,并使用COCO未标注数据集的增强版本优化对比损失。 Result: 实验结果显示,SSL预训练能够实现更高的mAP、更快的收敛速度以及改进的精确率-召回率性能,尤其是在低标签数据情况下。 Conclusion: 本文的研究结果为将对比SSL应用于单阶段检测器奠定了坚实的基础,并强调了未标注数据作为可扩展资源在标签高效目标检测中的潜力。 Abstract: One-stage object detectors such as the YOLO family achieve state-of-the-art performance in real-time vision applications but remain heavily reliant on large-scale labeled datasets for training. In this work, we present a systematic study of contrastive self-supervised learning (SSL) as a means to reduce this dependency by pretraining YOLOv5 and YOLOv8 backbones on unlabeled images using the SimCLR framework. Our approach introduces a simple yet effective pipeline that adapts YOLO's convolutional backbones as encoders, employs global pooling and projection heads, and optimizes a contrastive loss using augmentations of the COCO unlabeled dataset (120k images). The pretrained backbones are then fine-tuned on a cyclist detection task with limited labeled data. Experimental results show that SSL pretraining leads to consistently higher mAP, faster convergence, and improved precision-recall performance, especially in low-label regimes. For example, our SimCLR-pretrained YOLOv8 achieves a mAP@50:95 of 0.7663, outperforming its supervised counterpart despite using no annotations during pretraining. These findings establish a strong baseline for applying contrastive SSL to one-stage detectors and highlight the potential of unlabeled data as a scalable resource for label-efficient object detection.

[244] On-the-Fly Object-aware Representative Point Selection in Point Cloud

Xiaoyu Zhang,Ziwei Wang,Hai Dong,Zhifeng Bao,Jiajun Liu

Main category: cs.CV

TL;DR: This paper proposes a two-step point cloud downsampling framework for autonomous vehicles that improves efficiency and effectiveness while preserving critical object-related information.

Details Motivation: Point clouds generate a large volume of data for autonomous vehicles (AVs), which creates challenges for storage, bandwidth, and processing cost. This paper aims to develop a downsampling framework that preserves critical object-related information while filtering out irrelevant background points. Method: The method involves two steps: (1) Object Presence Detection using an unsupervised density peak-based classifier and a supervised Naive Bayes classifier, and (2) Sampling Budget Allocation to select object-relevant points while maintaining a high retention rate of object information. Result: Extensive experiments on the KITTI and nuScenes datasets show that the proposed method consistently outperforms state-of-the-art baselines in both efficiency and effectiveness across varying sampling rates. Additionally, the method integrates seamlessly with diverse downstream models. Conclusion: The proposed method for point cloud downsampling effectively preserves critical object-related information while improving efficiency and effectiveness across varying sampling rates, making it a valuable and scalable solution for AV applications. Abstract: Point clouds are essential for object modeling and play a critical role in assisting driving tasks for autonomous vehicles (AVs). However, the significant volume of data generated by AVs creates challenges for storage, bandwidth, and processing cost. To tackle these challenges, we propose a representative point selection framework for point cloud downsampling, which preserves critical object-related information while effectively filtering out irrelevant background points. Our method involves two steps: (1) Object Presence Detection, where we introduce an unsupervised density peak-based classifier and a supervised Na\"ive Bayes classifier to handle diverse scenarios, and (2) Sampling Budget Allocation, where we propose a strategy that selects object-relevant points while maintaining a high retention rate of object information. Extensive experiments on the KITTI and nuScenes datasets demonstrate that our method consistently outperforms state-of-the-art baselines in both efficiency and effectiveness across varying sampling rates. As a model-agnostic solution, our approach integrates seamlessly with diverse downstream models, making it a valuable and scalable addition to the 3D point cloud downsampling toolkit for AV applications.

[245] IMoRe: Implicit Program-Guided Reasoning for Human Motion Q&A

Chen Li,Chinthani Sugandhika,Yeo Keat Ee,Eric Peh,Hao Zhang,Hong Yang,Deepu Rajan,Basura Fernando

Main category: cs.CV

TL;DR: This paper proposes IMoRe, an implicit program-guided motion reasoning framework for human motion QA that eliminates the need for manually designed modules, achieving strong performance on existing and new datasets.

Details Motivation: Existing human motion QA methods rely on explicit program execution with manually defined functional modules, which limits scalability and adaptability. The work aims to overcome this limitation by proposing an implicit reasoning framework that does not require manual module design. Method: The paper introduces an implicit program-guided motion reasoning (IMoRe) framework that unifies reasoning across multiple query types without manually designed modules. It uses a program-guided reading mechanism to dynamically select multi-level motion representations from a pretrained motion Vision Transformer (ViT), and the reasoning module iteratively refines memory representations using structured program functions. Result: The model achieves state-of-the-art performance on Babel-QA and demonstrates generalization on a newly constructed motion QA dataset based on HuMMan. Conclusion: The proposed IMoRe framework demonstrates state-of-the-art performance on Babel-QA and generalizes to a new motion QA dataset based on HuMMan, showing adaptability across different motion reasoning datasets. Abstract: Existing human motion Q\&A methods rely on explicit program execution, where the requirement for manually defined functional modules may limit the scalability and adaptability. To overcome this, we propose an implicit program-guided motion reasoning (IMoRe) framework that unifies reasoning across multiple query types without manually designed modules. Unlike existing implicit reasoning approaches that infer reasoning operations from question words, our model directly conditions on structured program functions, ensuring a more precise execution of reasoning steps. Additionally, we introduce a program-guided reading mechanism, which dynamically selects multi-level motion representations from a pretrained motion Vision Transformer (ViT), capturing both high-level semantics and fine-grained motion cues. The reasoning module iteratively refines memory representations, leveraging structured program functions to extract relevant information for different query types. Our model achieves state-of-the-art performance on Babel-QA and generalizes to a newly constructed motion Q\&A dataset based on HuMMan, demonstrating its adaptability across different motion reasoning datasets. Code and dataset are available at: https://github.com/LUNAProject22/IMoRe.

[246] Deeply Dual Supervised learning for melanoma recognition

Rujosh Polma,Krishnan Menon Iyer

Main category: cs.CV

TL;DR: 本文介绍了一种用于黑色素瘤识别的新型深度双监督学习框架,通过整合局部和全局特征提取,提高了诊断的准确性。

Details Motivation: 尽管图像分类技术取得了进步,但现有模型在识别黑色素瘤与良性病变之间的细微视觉线索方面仍面临挑战。 Method: 通过采用双路径结构,该模型关注细粒度的局部特征和更广泛的上下文信息,并利用双注意力机制和多尺度特征聚合策略来提高性能。 Result: 实验结果表明,所提出的框架在黑色素瘤检测中显著优于最先进的方法,具有更高的准确性和更好的抗假阳性能力。 Conclusion: 本文提出了一种新的深度双监督学习框架,用于增强黑色素瘤的识别,为未来的自动化皮肤癌识别研究奠定了基础,并突出了双监督学习在医学图像分析中的有效性。 Abstract: As the application of deep learning in dermatology continues to grow, the recognition of melanoma has garnered significant attention, demonstrating potential for improving diagnostic accuracy. Despite advancements in image classification techniques, existing models still face challenges in identifying subtle visual cues that differentiate melanoma from benign lesions. This paper presents a novel Deeply Dual Supervised Learning framework that integrates local and global feature extraction to enhance melanoma recognition. By employing a dual-pathway structure, the model focuses on both fine-grained local features and broader contextual information, ensuring a comprehensive understanding of the image content. The framework utilizes a dual attention mechanism that dynamically emphasizes critical features, thereby reducing the risk of overlooking subtle characteristics of melanoma. Additionally, we introduce a multi-scale feature aggregation strategy to ensure robust performance across varying image resolutions. Extensive experiments on benchmark datasets demonstrate that our framework significantly outperforms state-of-the-art methods in melanoma detection, achieving higher accuracy and better resilience against false positives. This work lays the foundation for future research in automated skin cancer recognition and highlights the effectiveness of dual supervised learning in medical image analysis.

[247] Fast and Memory-efficient Non-line-of-sight Imaging with Quasi-Fresnel Transform

Yijun Wei,Jianyu Wang,Leping Xiao,Zuoqiang Shi,Xing Fu,Lingyun Qiu

Main category: cs.CV

TL;DR: 该论文介绍了一种新的非视线成像方法,利用Quasi-Fresnel变换和二维函数表示隐藏场景,显著降低计算复杂度和内存需求,适用于轻量级设备,实现高效的非视线成像。

Details Motivation: 现有的非视线成像方法通常将测量数据和隐藏场景建模为三维,忽略了大多数隐藏物体本质上是二维的,这导致计算成本高和内存消耗大,限制了实际应用。 Method: 该论文使用了Quasi-Fresnel变换来建立测量数据与隐藏场景之间的直接反演公式,并将隐藏场景表示为二维函数。 Result: 该论文的方法在保持成像质量的同时,将运行时间和内存需求减少了几个数量级,显著提高了计算效率,并使得在轻量级设备上进行NLOS成像成为可能。 Conclusion: 该论文提出了一种新颖的非视线成像方法,通过利用问题的二维特性,显著降低了计算复杂度和内存需求,为实时、高分辨率的NLOS成像提供了可能,并拓宽了其在各种平台上的适用性。 Abstract: Non-line-of-sight (NLOS) imaging seeks to reconstruct hidden objects by analyzing reflections from intermediary surfaces. Existing methods typically model both the measurement data and the hidden scene in three dimensions, overlooking the inherently two-dimensional nature of most hidden objects. This oversight leads to high computational costs and substantial memory consumption, limiting practical applications and making real-time, high-resolution NLOS imaging on lightweight devices challenging. In this paper, we introduce a novel approach that represents the hidden scene using two-dimensional functions and employs a Quasi-Fresnel transform to establish a direct inversion formula between the measurement data and the hidden scene. This transformation leverages the two-dimensional characteristics of the problem to significantly reduce computational complexity and memory requirements. Our algorithm efficiently performs fast transformations between these two-dimensional aggregated data, enabling rapid reconstruction of hidden objects with minimal memory usage. Compared to existing methods, our approach reduces runtime and memory demands by several orders of magnitude while maintaining imaging quality. The substantial reduction in memory usage not only enhances computational efficiency but also enables NLOS imaging on lightweight devices such as mobile and embedded systems. We anticipate that this method will facilitate real-time, high-resolution NLOS imaging and broaden its applicability across a wider range of platforms.

[248] Devil is in the Detail: Towards Injecting Fine Details of Image Prompt in Image Generation via Conflict-free Guidance and Stratified Attention

Kyungmin Jo,Jooyeol Yun,Jaegul Choo

Main category: cs.CV

TL;DR: 这篇论文提出了一种新的图像生成方法,通过改进自注意力机制和分类器无关指导,提高了图像提示模型在反映图像提示细节方面的表现。

Details Motivation: 现有的文本到图像扩散模型难以捕捉复杂的细节,如纹理,因此研究者们尝试生成基于用户提供的图像提示的图像。然而,现有方法在分类器无关指导中忽视了图像提示的重要性,并且在自注意力机制的修改中存在权衡。 Method: 该论文分析了现有方法在修改自注意力机制和分类器无关指导中的问题,并提出了无冲突指导和分层注意力机制来解决这些问题。 Result: 通过在三个图像生成任务中的大量实验,该论文提出的方法在反映图像提示方面优于现有的图像提示模型。 Conclusion: 该论文提出了一种新的图像生成方法,通过使用无冲突指导和分层注意力机制,提高了图像提示模型在反映图像提示细节方面的表现。 Abstract: While large-scale text-to-image diffusion models enable the generation of high-quality, diverse images from text prompts, these prompts struggle to capture intricate details, such as textures, preventing the user intent from being reflected. This limitation has led to efforts to generate images conditioned on user-provided images, referred to as image prompts. Recent work modifies the self-attention mechanism to impose image conditions in generated images by replacing or concatenating the keys and values from the image prompt. This enables the self-attention layer to work like a cross-attention layer, generally used to incorporate text prompts. In this paper, we identify two common issues in existing methods of modifying self-attention to generate images that reflect the details of image prompts. First, existing approaches neglect the importance of image prompts in classifier-free guidance. Specifically, current methods use image prompts as both desired and undesired conditions in classifier-free guidance, causing conflicting signals. To resolve this, we propose conflict-free guidance by using image prompts only as desired conditions, ensuring that the generated image faithfully reflects the image prompt. In addition, we observe that the two most common self-attention modifications involve a trade-off between the realism of the generated image and alignment with the image prompt. Specifically, selecting more keys and values from the image prompt improves alignment, while selecting more from the generated image enhances realism. To balance both, we propose an new self-attention modification method, Stratified Attention to jointly use keys and values from both images rather than selecting between them. Through extensive experiments across three image generation tasks, we show that the proposed method outperforms existing image-prompting models in faithfully reflecting the image prompt.

[249] Bench2ADVLM: A Closed-Loop Benchmark for Vision-language Models in Autonomous Driving

Tianyuan Zhang,Ting Jin,Lu Wang,Jiangfan Liu,Siyuan Liang,Mingchuan Zhang,Aishan Liu,Xianglong Liu

Main category: cs.CV

TL;DR: 本文提出了Bench2ADVLM闭环评估框架,用于评估基于视觉语言模型的自动驾驶系统,发现现有系统在闭环条件下表现不佳。

Details Motivation: 当前基于视觉语言模型的自动驾驶系统评估主要局限于开环设置,而闭环设置能更真实地反映系统表现。 Method: 提出Bench2ADVLM统一的分层闭环评估框架,包括双系统适应架构、物理控制抽象层和自反思场景生成模块。 Result: 实验验证了Bench2ADVLM框架在多样化场景和物理平台上的诊断能力,显示现有系统在闭环条件下表现有限。 Conclusion: Bench2ADVLM框架在评估基于视觉语言模型的自动驾驶系统方面表现出诊断优势,揭示了现有系统在闭环条件下的性能限制。 Abstract: Vision-Language Models (VLMs) have recently emerged as a promising paradigm in autonomous driving (AD). However, current performance evaluation protocols for VLM-based AD systems (ADVLMs) are predominantly confined to open-loop settings with static inputs, neglecting the more realistic and informative closed-loop setting that captures interactive behavior, feedback resilience, and real-world safety. To address this, we introduce Bench2ADVLM, a unified hierarchical closed-loop evaluation framework for real-time, interactive assessment of ADVLMs across both simulation and physical platforms. Inspired by dual-process theories of cognition, we first adapt diverse ADVLMs to simulation environments via a dual-system adaptation architecture. In this design, heterogeneous high-level driving commands generated by target ADVLMs (fast system) are interpreted by a general-purpose VLM (slow system) into standardized mid-level control actions suitable for execution in simulation. To bridge the gap between simulation and reality, we design a physical control abstraction layer that translates these mid-level actions into low-level actuation signals, enabling, for the first time, closed-loop testing of ADVLMs on physical vehicles. To enable more comprehensive evaluation, Bench2ADVLM introduces a self-reflective scenario generation module that automatically explores model behavior and uncovers potential failure modes for safety-critical scenario generation. Overall, Bench2ADVLM establishes a hierarchical evaluation pipeline that seamlessly integrates high-level abstract reasoning, mid-level simulation actions, and low-level real-world execution. Experiments on diverse scenarios across multiple state-of-the-art ADVLMs and physical platforms validate the diagnostic strength of our framework, revealing that existing ADVLMs still exhibit limited performance under closed-loop conditions.

[250] Protego: User-Centric Pose-Invariant Privacy Protection Against Face Recognition-Induced Digital Footprint Exposure

Ziling Wang,Shuya Yang,Jialin Lu,Ka-Ho Chow

Main category: cs.CV

TL;DR: Protego is a user-centric privacy protection method that safeguards facial images from retrieval-based privacy intrusions by dynamically generating a natural-looking 3D mask tailored to the pose and expression of any facial image of the user, significantly reducing retrieval accuracy across FR models and offering improved visual coherence.

Details Motivation: The increasing use of face recognition technologies in large-scale image retrieval systems raises serious privacy concerns. Protego addresses this issue by safeguarding facial images from such retrieval-based privacy intrusions. Method: Protego encapsulates a user's 3D facial signatures into a pose-invariant 2D representation, which is dynamically deformed into a natural-looking 3D mask tailored to the pose and expression of any facial image of the user before online sharing. Result: Experiments show that Protego significantly reduces retrieval accuracy across a wide range of black-box FR models and performs at least 2x better than existing methods. It also offers unprecedented visual coherence, particularly in video settings. Conclusion: Protego is an effective method to protect facial images from retrieval-based privacy intrusions, significantly reducing retrieval accuracy across a wide range of FR models and offering improved visual coherence, especially in video settings. Abstract: Face recognition (FR) technologies are increasingly used to power large-scale image retrieval systems, raising serious privacy concerns. Services like Clearview AI and PimEyes allow anyone to upload a facial photo and retrieve a large amount of online content associated with that person. This not only enables identity inference but also exposes their digital footprint, such as social media activity, private photos, and news reports, often without their consent. In response to this emerging threat, we propose Protego, a user-centric privacy protection method that safeguards facial images from such retrieval-based privacy intrusions. Protego encapsulates a user's 3D facial signatures into a pose-invariant 2D representation, which is dynamically deformed into a natural-looking 3D mask tailored to the pose and expression of any facial image of the user, and applied prior to online sharing. Motivated by a critical limitation of existing methods, Protego amplifies the sensitivity of FR models so that protected images cannot be matched even among themselves. Experiments show that Protego significantly reduces retrieval accuracy across a wide range of black-box FR models and performs at least 2x better than existing methods. It also offers unprecedented visual coherence, particularly in video settings where consistency and natural appearance are essential. Overall, Protego contributes to the fight against the misuse of FR for mass surveillance and unsolicited identity tracing.

[251] Conditional Diffusion Model with Anatomical-Dose Dual Constraints for End-to-End Multi-Tumor Dose Prediction

Hui Xie,Haiqin Hu,Lijuan Ding,Qing Li,Yue Sun,Tao Tan

Main category: cs.CV

TL;DR: 提出了一种新的基于深度学习的放射治疗剂量预测模型ADDiff-Dose,该模型能够高效且准确地进行多肿瘤剂量预测,显著优于现有方法,并有望提高放疗计划的效率。

Details Motivation: 放疗计划设计通常依赖于耗时且依赖专家经验的试错调整,而现有的深度学习方法在泛化能力、预测准确性和临床适用性方面存在局限。 Method: 提出了ADDiff-Dose,一种基于解剖-剂量双重约束的条件扩散模型,用于端到端的多肿瘤剂量预测。模型使用LightweightVAE3D压缩高维CT数据,并在一个逐步加噪和去噪框架中整合多模态输入。 Result: 在大规模公开数据集(2877个案例)和三个外部机构队列(共计450个案例)上的评估表明,ADDiff-Dose显著优于传统基线方法,达到了0.101-0.154的MAE(相较于UNet的0.316和GAN模型的0.169),DICE系数为0.927(提高了6.8%),并将脊髓最大剂量误差限制在0.1 Gy以内。每例计划生成的平均时间减少到22秒。 Conclusion: ADDiff-Dose是第一个引入条件扩散模型框架用于放射治疗剂量预测的研究,为不同肿瘤部位的自动化治疗计划提供了一个可推广且高效的解决方案,具有显著减少计划时间和提高临床工作流程效率的潜力。 Abstract: Radiotherapy treatment planning often relies on time-consuming, trial-and-error adjustments that heavily depend on the expertise of specialists, while existing deep learning methods face limitations in generalization, prediction accuracy, and clinical applicability. To tackle these challenges, we propose ADDiff-Dose, an Anatomical-Dose Dual Constraints Conditional Diffusion Model for end-to-end multi-tumor dose prediction. The model employs LightweightVAE3D to compress high-dimensional CT data and integrates multimodal inputs, including target and organ-at-risk (OAR) masks and beam parameters, within a progressive noise addition and denoising framework. It incorporates conditional features via a multi-head attention mechanism and utilizes a composite loss function combining MSE, conditional terms, and KL divergence to ensure both dosimetric accuracy and compliance with clinical constraints. Evaluation on a large-scale public dataset (2,877 cases) and three external institutional cohorts (450 cases in total) demonstrates that ADDiff-Dose significantly outperforms traditional baselines, achieving an MAE of 0.101-0.154 (compared to 0.316 for UNet and 0.169 for GAN models), a DICE coefficient of 0.927 (a 6.8% improvement), and limiting spinal cord maximum dose error to within 0.1 Gy. The average plan generation time per case is reduced to 22 seconds. Ablation studies confirm that the structural encoder enhances compliance with clinical dose constraints by 28.5%. To our knowledge, this is the first study to introduce a conditional diffusion model framework for radiotherapy dose prediction, offering a generalizable and efficient solution for automated treatment planning across diverse tumor sites, with the potential to substantially reduce planning time and improve clinical workflow efficiency.

[252] Mapillary Vistas Validation for Fine-Grained Traffic Signs: A Benchmark Revealing Vision-Language Model Limitations

Sparsh Garg,Abhishek Aich

Main category: cs.CV

TL;DR: 本文提出了一种新的交通标志验证数据集MVV,并证明DINOv2模型在细粒度视觉理解任务中优于现有视觉-语言模型。

Details Motivation: 现有数据集如Mapillary提供的标签过于粗糙,缺乏对停车标志或限速标志等语义重要类别的区分。 Method: 将Mapillary数据集中的复合交通标志分解为细粒度类别,并通过专家标注生成像素级实例掩码。 Result: DINOv2模型在交通标志识别及其他常见类别如车辆和人类识别上均优于视觉-语言模型基线。 Conclusion: MVV数据集为交通标志识别提供了新的验证基准,且DINOv2模型表现优于现有视觉-语言模型。 Abstract: Obtaining high-quality fine-grained annotations for traffic signs is critical for accurate and safe decision-making in autonomous driving. Widely used datasets, such as Mapillary, often provide only coarse-grained labels - without distinguishing semantically important types such as stop signs or speed limit signs. To this end, we present a new validation set for traffic signs derived from the Mapillary dataset called Mapillary Vistas Validation for Traffic Signs (MVV), where we decompose composite traffic signs into granular, semantically meaningful categories. The dataset includes pixel-level instance masks and has been manually annotated by expert annotators to ensure label fidelity. Further, we benchmark several state-of-the-art VLMs against the self-supervised DINOv2 model on this dataset and show that DINOv2 consistently outperforms all VLM baselines-not only on traffic sign recognition, but also on heavily represented categories like vehicles and humans. Our analysis reveals significant limitations in current vision-language models for fine-grained visual understanding and establishes DINOv2 as a strong baseline for dense semantic matching in autonomous driving scenarios. This dataset and evaluation framework pave the way for more reliable, interpretable, and scalable perception systems. Code and data are available at: https://github.com/nec-labs-ma/relabeling

[253] HCF: Hierarchical Cascade Framework for Distributed Multi-Stage Image Compression

Junhao Cai,Taegun An,Chengjun Jin,Sung Il Choi,JuHyun Park,Changhee Joo

Main category: cs.CV

TL;DR: 本研究提出了HCF框架,通过潜在空间转换和量化控制优化多阶段图像压缩的率失真性能和计算效率。

Details Motivation: 传统方法如渐进式压缩、连续压缩和固定参数模型在分布式多阶段图像压缩中存在效率低、质量损失大或灵活性差的问题。 Method: 通过在网络节点间的潜在空间直接转换,以及引入策略驱动的量化控制和边缘量化原理,实现多阶段图像压缩。 Result: HCF在Kodak、CLIC和CLIC2020-mobile数据集上表现优异,PSNR增益高达0.6dB,BD-Rate降低高达12.64%,并节省大量计算资源和执行时间。 Conclusion: HCF实现了高效的率失真性能和更好的计算效率,优于现有的渐进式压缩方法和连续压缩方法。 Abstract: Distributed multi-stage image compression -- where visual content traverses multiple processing nodes under varying quality requirements -- poses challenges. Progressive methods enable bitstream truncation but underutilize available compute resources; successive compression repeats costly pixel-domain operations and suffers cumulative quality loss and inefficiency; fixed-parameter models lack post-encoding flexibility. In this work, we developed the Hierarchical Cascade Framework (HCF) that achieves high rate-distortion performance and better computational efficiency through direct latent-space transformations across network nodes in distributed multi-stage image compression system. Under HCF, we introduced policy-driven quantization control to optimize rate-distortion trade-offs, and established the edge quantization principle through differential entropy analysis. The configuration based on this principle demonstrates up to 0.6dB PSNR gains over other configurations. When comprehensively evaluated on the Kodak, CLIC, and CLIC2020-mobile datasets, HCF outperforms successive-compression methods by up to 5.56% BD-Rate in PSNR on CLIC, while saving up to 97.8% FLOPs, 96.5% GPU memory, and 90.0% execution time. It also outperforms state-of-the-art progressive compression methods by up to 12.64% BD-Rate on Kodak and enables retraining-free cross-quality adaptation with 7.13-10.87% BD-Rate reductions on CLIC2020-mobile.

[254] StarPose: 3D Human Pose Estimation via Spatial-Temporal Autoregressive Diffusion

Haoxin Yang,Weihong Chen,Xuemiao Xu,Cheng Xu,Peng Xiao,Cuifeng Sun,Shaoyu Huang,Shengfeng He

Main category: cs.CV

TL;DR: This paper proposes StarPose, an autoregressive diffusion framework for 3D human pose estimation that enhances accuracy and temporal coherence by integrating historical pose predictions and spatial-temporal guidance.

Details Motivation: Traditional methods based on Transformers or CNNs and recent diffusion-based approaches have limited temporal consistency and accuracy in 3D pose predictions. Method: StarPose uses an autoregressive diffusion framework that integrates historical 3D pose predictions and spatial-temporal physical guidance. Result: Experiments show that StarPose achieves superior accuracy and temporal consistency in 3D human pose estimation. Conclusion: StarPose outperforms state-of-the-art methods in 3D human pose estimation, achieving superior accuracy and temporal consistency. Abstract: Monocular 3D human pose estimation remains a challenging task due to inherent depth ambiguities and occlusions. Compared to traditional methods based on Transformers or Convolutional Neural Networks (CNNs), recent diffusion-based approaches have shown superior performance, leveraging their probabilistic nature and high-fidelity generation capabilities. However, these methods often fail to account for the spatial and temporal correlations across predicted frames, resulting in limited temporal consistency and inferior accuracy in predicted 3D pose sequences. To address these shortcomings, this paper proposes StarPose, an autoregressive diffusion framework that effectively incorporates historical 3D pose predictions and spatial-temporal physical guidance to significantly enhance both the accuracy and temporal coherence of pose predictions. Unlike existing approaches, StarPose models the 2D-to-3D pose mapping as an autoregressive diffusion process. By synergically integrating previously predicted 3D poses with 2D pose inputs via a Historical Pose Integration Module (HPIM), the framework generates rich and informative historical pose embeddings that guide subsequent denoising steps, ensuring temporally consistent predictions. In addition, a fully plug-and-play Spatial-Temporal Physical Guidance (STPG) mechanism is tailored to refine the denoising process in an iterative manner, which further enforces spatial anatomical plausibility and temporal motion dynamics, rendering robust and realistic pose estimates. Extensive experiments on benchmark datasets demonstrate that StarPose outperforms state-of-the-art methods, achieving superior accuracy and temporal consistency in 3D human pose estimation. Code is available at https://github.com/wileychan/StarPose.

[255] YOLOv1 to YOLOv11: A Comprehensive Survey of Real-Time Object Detection Innovations and Challenges

Manikanta Kotthapalli,Deepika Ravipati,Reshma Bhatia

Main category: cs.CV

TL;DR: 本文综述了YOLO系列目标检测模型的发展历程、技术进步、多任务扩展及其在多个领域的实际应用,展望了未来研究方向。

Details Motivation: YOLO系列模型在过去十年中极大地推动了实时视觉应用的发展,研究其演进有助于理解目标检测领域的技术进步及未来趋势。 Method: 本文采用综述的方式,系统性地分析了YOLO家族模型的架构创新、性能基准、扩展功能及实际应用案例。 Result: 文章详细梳理了从YOLOv1到YOLOv9的架构和算法改进,展示了其在速度、精度和部署效率上的持续优化,并指出YOLO在实例分割、姿态估计、医学图像等领域的广泛应用。 Conclusion: 本文总结了YOLO系列模型的发展,讨论了其在不同计算机视觉任务中的扩展应用,并展望了未来的研究方向。 Abstract: Over the past decade, object detection has advanced significantly, with the YOLO (You Only Look Once) family of models transforming the landscape of real-time vision applications through unified, end-to-end detection frameworks. From YOLOv1's pioneering regression-based detection to the latest YOLOv9, each version has systematically enhanced the balance between speed, accuracy, and deployment efficiency through continuous architectural and algorithmic advancements.. Beyond core object detection, modern YOLO architectures have expanded to support tasks such as instance segmentation, pose estimation, object tracking, and domain-specific applications including medical imaging and industrial automation. This paper offers a comprehensive review of the YOLO family, highlighting architectural innovations, performance benchmarks, extended capabilities, and real-world use cases. We critically analyze the evolution of YOLO models and discuss emerging research directions that extend their impact across diverse computer vision domains.

[256] S-RRG-Bench: Structured Radiology Report Generation with Fine-Grained Evaluation Framework

Yingshu Li,Yunyi Liu,Zhanyu Wang,Xinyu Liang,Lingqiao Liu,Lei Wang,Luping Zhou

Main category: cs.CV

TL;DR: This paper introduces a structured approach to radiology report generation with a new dataset and evaluation metric, improving report quality and clinical relevance.

Details Motivation: Traditional free-text and template-based radiology reports have issues with inconsistency, fragmentation, and lack of clinically important details, necessitating a more structured and accurate approach. Method: Dataset construction (MIMIC-STRUC), training an LLM-based model for report generation, and introducing a new evaluation metric (S-Score). Result: A robust dataset with detailed clinical information and a new evaluation metric that aligns better with human assessments and clinical decision-making was developed. Conclusion: Structured radiology reports and a tailored evaluation metric improve the quality and clinical relevance of radiology report generation. Abstract: Radiology report generation (RRG) for diagnostic images, such as chest X-rays, plays a pivotal role in both clinical practice and AI. Traditional free-text reports suffer from redundancy and inconsistent language, complicating the extraction of critical clinical details. Structured radiology report generation (S-RRG) offers a promising solution by organizing information into standardized, concise formats. However, existing approaches often rely on classification or visual question answering (VQA) pipelines that require predefined label sets and produce only fragmented outputs. Template-based approaches, which generate reports by replacing keywords within fixed sentence patterns, further compromise expressiveness and often omit clinically important details. In this work, we present a novel approach to S-RRG that includes dataset construction, model training, and the introduction of a new evaluation framework. We first create a robust chest X-ray dataset (MIMIC-STRUC) that includes disease names, severity levels, probabilities, and anatomical locations, ensuring that the dataset is both clinically relevant and well-structured. We train an LLM-based model to generate standardized, high-quality reports. To assess the generated reports, we propose a specialized evaluation metric (S-Score) that not only measures disease prediction accuracy but also evaluates the precision of disease-specific details, thus offering a clinically meaningful metric for report quality that focuses on elements critical to clinical decision-making and demonstrates a stronger alignment with human assessments. Our approach highlights the effectiveness of structured reports and the importance of a tailored evaluation metric for S-RRG, providing a more clinically relevant measure of report quality.

[257] VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

Shijie Zhou,Alexander Vilesov,Xuehai He,Ziyu Wan,Shuwang Zhang,Aditya Nagachandra,Di Chang,Dongdong Chen,Xin Eric Wang,Achuta Kadambi

Main category: cs.CV

TL;DR: This paper introduces VLM4D, a benchmark for evaluating spatiotemporal reasoning in VLMs, revealing their current limitations and exploring ways to enhance their dynamic understanding.

Details Motivation: Current VLMs lack the ability to understand dynamic spatiotemporal interactions, which is crucial for real-world applications. Method: VLM4D benchmark was created with real-world and synthetic videos and question-answer pairs. Evaluated existing VLMs and tested enhancement methods like 4D feature field reconstruction and spatiotemporal supervised fine-tuning. Result: State-of-the-art VLMs showed significant performance gaps compared to human baselines, especially in integrating multiple visual cues and maintaining temporal coherence. Conclusion: VLM4D benchmark highlights the deficiencies of current VLMs in spatiotemporal reasoning and explores promising methods for improvement. Abstract: Vision language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason about object movements, rotations, and perspective shifts-abilities essential for robust dynamic real-world understanding yet notably lacking in current VLMs. In this paper, we introduce VLM4D, the first benchmark specifically designed to evaluate the spatiotemporal reasoning capabilities of VLMs. Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs emphasizing translational and rotational motions, perspective awareness, and motion continuity. Through comprehensive evaluations of state-of-the-art open and closed-source VLMs, we identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models. Extensive analysis reveals that VLMs struggle particularly with integrating multiple visual cues and maintaining temporal coherence. We further explore promising directions, such as leveraging 4D feature field reconstruction and targeted spatiotemporal supervised fine-tuning, demonstrating their effectiveness in enhancing spatiotemporal comprehension. Our work aims to encourage deeper exploration into improving VLMs' spatial and temporal grounding, paving the way towards more capable and reliable visual intelligence for dynamic environments.

[258] Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis

Kaiyang Ji,Ye Shi,Zichen Jin,Kangyi Chen,Lan Xu,Yuexin Ma,Jingyi Yu,Jingya Wang

Main category: cs.CV

TL;DR: Human-X is a new framework for real-time, physically plausible human interactions in systems like VR/AR and robotics, combining action prediction and motion tracking for improved realism and safety.

Details Motivation: The challenge of achieving real-time responsiveness, physical feasibility, and safety in dynamic human-machine interactions prompted the development of a more comprehensive framework. Method: Human-X uses an auto-regressive reaction diffusion planner for real-time prediction of actions and reactions, and integrates an actor-aware motion tracking policy trained with reinforcement learning to enhance physical realism and safety. Result: Experiments on the Inter-X and InterHuman datasets showed significant improvements in motion quality, interaction continuity, and physical plausibility compared to existing methods. Conclusion: Human-X provides a robust and immersive solution for real-time, physically plausible human interactions across various systems like VR/AR and robotics, showing potential in advancing human-robot collaboration. Abstract: Real-time synthesis of physically plausible human interactions remains a critical challenge for immersive VR/AR systems and humanoid robotics. While existing methods demonstrate progress in kinematic motion generation, they often fail to address the fundamental tension between real-time responsiveness, physical feasibility, and safety requirements in dynamic human-machine interactions. We introduce Human-X, a novel framework designed to enable immersive and physically plausible human interactions across diverse entities, including human-avatar, human-humanoid, and human-robot systems. Unlike existing approaches that focus on post-hoc alignment or simplified physics, our method jointly predicts actions and reactions in real-time using an auto-regressive reaction diffusion planner, ensuring seamless synchronization and context-aware responses. To enhance physical realism and safety, we integrate an actor-aware motion tracking policy trained with reinforcement learning, which dynamically adapts to interaction partners' movements while avoiding artifacts like foot sliding and penetration. Extensive experiments on the Inter-X and InterHuman datasets demonstrate significant improvements in motion quality, interaction continuity, and physical plausibility over state-of-the-art methods. Our framework is validated in real-world applications, including virtual reality interface for human-robot interaction, showcasing its potential for advancing human-robot collaboration.

[259] AutoLoRA: Automatic LoRA Retrieval and Fine-Grained Gated Fusion for Text-to-Image Generation

Zhiwen Li,Zhongjie Duan,Die Chen,Cen Chen,Daoyuan Chen,Yaliang Li,Yingda Chen

Main category: cs.CV

TL;DR: 本文提出了一种新的框架,通过语义驱动的LoRA检索和动态聚合,解决了在实际部署大型图像生成模型时的参数微调难题。

Details Motivation: 尽管大规模图像生成模型取得了进展,但其参数微调的困难限制了实际部署。现有的LoRA模块面临元数据标注稀疏、需要零样本适应能力以及多LoRA融合策略次优等挑战。 Method: 该框架包括两个关键组件:(1) 基于权重编码的LoRA检索器,建立LoRA参数矩阵和文本提示之间的共享语义空间;(2) 细粒度门控融合机制,计算上下文特定的融合权重以优化整合多个LoRA模块。 Result: 该方法在图像生成性能上取得了显著提升,促进了基础模型的可扩展和高效数据增强。 Conclusion: 这项工作为社区开发的LoRA模块与实际部署需求之间建立了重要桥梁,通过标准化适配器集成推动了模型的协同演进。 Abstract: Despite recent advances in photorealistic image generation through large-scale models like FLUX and Stable Diffusion v3, the practical deployment of these architectures remains constrained by their inherent intractability to parameter fine-tuning. While low-rank adaptation (LoRA) have demonstrated efficacy in enabling model customization with minimal parameter overhead, the effective utilization of distributed open-source LoRA modules faces three critical challenges: sparse metadata annotation, the requirement for zero-shot adaptation capabilities, and suboptimal fusion strategies for multi-LoRA fusion strategies. To address these limitations, we introduce a novel framework that enables semantic-driven LoRA retrieval and dynamic aggregation through two key components: (1) weight encoding-base LoRA retriever that establishes a shared semantic space between LoRA parameter matrices and text prompts, eliminating dependence on original training data, and (2) fine-grained gated fusion mechanism that computes context-specific fusion weights across network layers and diffusion timesteps to optimally integrate multiple LoRA modules during generation. Our approach achieves significant improvement in image generation perfermance, thereby facilitating scalable and data-efficient enhancement of foundational models. This work establishes a critical bridge between the fragmented landscape of community-developed LoRAs and practical deployment requirements, enabling collaborative model evolution through standardized adapter integration.

[260] DeflareMamba: Hierarchical Vision Mamba for Contextually Consistent Lens Flare Removal

Yihang Huang,Yuanfei Huang,Junhui Lin,Hua Huang

Main category: cs.CV

TL;DR: 本文提出DeflareMamba,通过状态空间模型实现高效的眩光去除,同时保持图像的自然外观并提升视觉识别和语义理解能力。

Details Motivation: 为了解决眩光去除中存在的信息混淆问题以及现有方法在上下文一致性方面的不足。 Method: 设计了一个分层框架,通过不同的步幅采样模式建立长距离像素关联,并利用局部增强的状态空间模型同时保留局部细节。 Result: 实验表明,该方法能有效去除各种类型的眩光伪影,包括散射和反射眩光。 Conclusion: DeflareMamba成功引入状态空间模型进行眩光去除,有效保持了非眩光区域的自然外观,并提升了视觉对象识别和跨模态语义理解能力。 Abstract: Lens flare removal remains an information confusion challenge in the underlying image background and the optical flares, due to the complex optical interactions between light sources and camera lens. While recent solutions have shown promise in decoupling the flare corruption from image, they often fail to maintain contextual consistency, leading to incomplete and inconsistent flare removal. To eliminate this limitation, we propose DeflareMamba, which leverages the efficient sequence modeling capabilities of state space models while maintains the ability to capture local-global dependencies. Particularly, we design a hierarchical framework that establishes long-range pixel correlations through varied stride sampling patterns, and utilize local-enhanced state space models that simultaneously preserves local details. To the best of our knowledge, this is the first work that introduces state space models to the flare removal task. Extensive experiments demonstrate that our method effectively removes various types of flare artifacts, including scattering and reflective flares, while maintaining the natural appearance of non-flare regions. Further downstream applications demonstrate the capacity of our method to improve visual object recognition and cross-modal semantic understanding. Code is available at https://github.com/BNU-ERC-ITEA/DeflareMamba.

[261] Beyond RGB and Events: Enhancing Object Detection under Adverse Lighting with Monocular Normal Maps

Mingjie Liu,Hanqing Liu,Chuang Zhu

Main category: cs.CV

TL;DR: This paper proposes NRE-Net, a multi-modal object detection framework that combines RGB images, surface normal maps, and event streams to improve detection accuracy in challenging lighting conditions, achieving superior performance over existing methods.

Details Motivation: Accurate object detection in adverse lighting conditions is crucial for real-world applications like autonomous driving, but existing sensors like RGB cameras and event cameras struggle due to distracting reflections and insufficient robustness. Method: NRE-Net integrates three modalities—monocular RGB images, predicted surface normal maps, and event streams—using two key modules: the Adaptive Dual-stream Fusion Module (ADFM) and the Event-modality Aware Fusion Module (EAFM). Result: Extensive evaluations on the DSEC-Det-sub and PKU-DAVIS-SOD datasets show that NRE-Net outperforms frame-based approaches by 7.9% and 6.1% in mAP50, and surpasses fusion-based methods SFNet and SODFormer by 2.7% and 7.1%, respectively. Conclusion: The proposed NRE-Net framework significantly improves object detection accuracy under adverse lighting conditions by effectively fusing monocular RGB images, surface normal maps, and event streams, outperforming state-of-the-art methods. Abstract: Accurate object detection under adverse lighting conditions is critical for real-world applications such as autonomous driving. Although neuromorphic event cameras have been introduced to handle these scenarios, adverse lighting often induces distracting reflections from tunnel walls or road surfaces, which frequently lead to false obstacle detections. However, neither RGB nor event data alone is robust enough to address these complexities, and mitigating these issues without additional sensors remains underexplored. To overcome these challenges, we propose leveraging normal maps, directly predicted from monocular RGB images, as robust geometric cues to suppress false positives and enhance detection accuracy. We introduce NRE-Net, a novel multi-modal detection framework that effectively fuses three complementary modalities: monocularly predicted surface normal maps, RGB images, and event streams. To optimize the fusion process, our framework incorporates two key modules: the Adaptive Dual-stream Fusion Module (ADFM), which integrates RGB and normal map features, and the Event-modality Aware Fusion Module (EAFM), which adapts to the high dynamic range characteristics of event data. Extensive evaluations on the DSEC-Det-sub and PKU-DAVIS-SOD datasets demonstrate that NRE-Net significantly outperforms state-of-the-art methods. Our approach achieves mAP50 improvements of 7.9% and 6.1% over frame-based approaches (e.g., YOLOX), while surpassing the fusion-based SFNet by 2.7% on the DSEC-Det-sub dataset and SODFormer by 7.1% on the PKU-DAVIS-SOD dataset.

[262] VDEGaussian: Video Diffusion Enhanced 4D Gaussian Splatting for Dynamic Urban Scenes Modeling

Yuru Xiao,Zihan Lin,Chao Lu,Deming Zhai,Kui Jiang,Wenbo Zhao,Wei Zhang,Junjun Jiang,Huanran Wang,Xianming Liu

Main category: cs.CV

TL;DR: This paper proposes a new framework combining video diffusion and Gaussian Splatting to improve modeling of dynamic urban scenes, particularly for fast-moving objects.

Details Motivation: Current methods relying on neural radiance fields or Gaussian Splatting face challenges in modeling fast-moving objects due to temporal discontinuities and dependence on pre-calibrated object tracks. Method: The method involves distilling temporally consistent priors from a video diffusion model and includes innovations like joint timestamp optimization and uncertainty distillation for better pose alignment and content integration. Result: The experiments show a 2 dB PSNR gain in novel view synthesis for fast-moving objects compared to baseline approaches. Conclusion: The proposed video diffusion-enhanced 4D Gaussian Splatting framework significantly improves dynamic urban scene modeling, especially for fast-moving objects. Abstract: Dynamic urban scene modeling is a rapidly evolving area with broad applications. While current approaches leveraging neural radiance fields or Gaussian Splatting have achieved fine-grained reconstruction and high-fidelity novel view synthesis, they still face significant limitations. These often stem from a dependence on pre-calibrated object tracks or difficulties in accurately modeling fast-moving objects from undersampled capture, particularly due to challenges in handling temporal discontinuities. To overcome these issues, we propose a novel video diffusion-enhanced 4D Gaussian Splatting framework. Our key insight is to distill robust, temporally consistent priors from a test-time adapted video diffusion model. To ensure precise pose alignment and effective integration of this denoised content, we introduce two core innovations: a joint timestamp optimization strategy that refines interpolated frame poses, and an uncertainty distillation method that adaptively extracts target content while preserving well-reconstructed regions. Extensive experiments demonstrate that our method significantly enhances dynamic modeling, especially for fast-moving objects, achieving an approximate PSNR gain of 2 dB for novel view synthesis over baseline approaches.

[263] A Neural Quality Metric for BRDF Models

Behnaz Kavoosighafi,Rafal K. Mantiuk,Saghi Hajisharif,Ehsan Miandji,Jonas Unger

Main category: cs.CV

TL;DR: 该论文提出了一种用于双向反射分布函数(BRDF)评估的感知神经质量度量方法,通过在BRDF空间中直接操作,避免了传统方法需要渲染图像的复杂性。

Details Motivation: 传统BRDF空间度量方法使用数值误差衡量,无法捕捉渲染图像中的感知差异,因此需要一种更符合人类感知的评估方法。 Method: 论文提出了一种紧凑的多层感知器(MLP)模型,通过测量BRDF数据集以及合成生成的数据进行训练,并使用感知验证的图像空间度量进行标注,输入参考和近似BRDF的配对样本以预测感知质量(JOD分数) Result: 该神经度量方法与人类判断的相关性显著高于现有BRDF空间度量方法,但作为BRDF拟合的损失函数时性能仍有限。 Conclusion: 论文提供了一种基于感知的BRDF模型评估替代方法,为光真实感渲染中的BRDF质量评估提供了新思路。 Abstract: Accurately evaluating the quality of bidirectional reflectance distribution function (BRDF) models is essential for photo-realistic rendering. Traditional BRDF-space metrics often employ numerical error measures that fail to capture perceptual differences evident in rendered images. In this paper, we introduce the first perceptually informed neural quality metric for BRDF evaluation that operates directly in BRDF space, eliminating the need for rendering during quality assessment. Our metric is implemented as a compact multi-layer perceptron (MLP), trained on a dataset of measured BRDFs supplemented with synthetically generated data and labelled using a perceptually validated image-space metric. The network takes as input paired samples of reference and approximated BRDFs and predicts their perceptual quality in terms of just-objectionable-difference (JOD) scores. We show that our neural metric achieves significantly higher correlation with human judgments than existing BRDF-space metrics. While its performance as a loss function for BRDF fitting remains limited, the proposed metric offers a perceptually grounded alternative for evaluating BRDF models.

[264] Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference

Kuo Wang,Quanlong Zheng,Junlin Xie,Yanhao Zhang,Jinguo Luo,Haonan Lu,Liang Lin,Fan Zhou,Guanbin Li

Main category: cs.CV

TL;DR: Free-MoRef is a training-free method that enhances Video-MLLMs' ability to understand long videos by efficiently multiplexing context perception, achieving better performance with lower computational cost.

Details Motivation: Existing Video-MLLMs struggle with long video understanding due to context length limitations, leading to compromises in feature granularity or inference efficiency. Free-MoRef aims to overcome these limitations without training. Method: Free-MoRef reconstructs vision tokens into short sequences (multi-references) and uses MoRef-attention to gather clues in parallel. It then applies reference fusion to compose a final mixed reasoning sequence, compensating for neglected cross-reference interactions. Result: Experiments on VideoMME, MLVU, and LongVideoBench show that Free-MoRef enables Video-MLLMs to process 2× to 8× longer input frames without compression on a single A100 GPU, achieving significant performance gains and even surpassing dedicated long-video-MLLMs. Conclusion: Free-MoRef is a training-free approach that improves the performance of Video-MLLMs on long video scenarios by multiplexing context perception capabilities within one inference pass, outperforming existing methods in terms of efficiency and effectiveness. Abstract: Video Multimodal Large Language Models~(Video-MLLM) have achieved remarkable advancements in video understanding tasks. However, constrained by the context length limitation in the underlying LLMs, existing Video-MLLMs typically exhibit suboptimal performance on long video scenarios. To understand extended input frames, common solutions span token compression and streaming inference techniques, which sacrifice feature granularity or inference efficiency. Differently, to efficiently achieve comprehensive understanding of longer frame inputs, we draw ideas from MoE and propose a training-free approach \textbf{Free-MoRef}, which instantly multiplexes the context perception capabilities of Video-MLLMs within one inference pass. Specifically, Free-MoRef reconstructs the vision tokens into several short sequences as multi-references. Subsequently, we introduce MoRef-attention, which gathers clues from the multi-reference chunks in parallel to summarize unified query activations. After the shadow layers in LLMs, a reference fusion step is derived to compose a final mixed reasoning sequence with key tokens from parallel chunks, which compensates the cross-reference vision interactions that are neglected in MoRef-attention. By splitting and fusing the long vision token sequences, Free-MoRef achieves improved performance under much lower computing costs in reasoning multiplexed context length, demonstrating strong efficiency and effectiveness. Experiments on VideoMME, MLVU, LongVideoBench show that Free-MoRef achieves full perception of 2$\times$ to 8$\times$ longer input frames without compression on a single A100 GPU while keeping instant responses, thereby bringing significant performance gains, even surpassing dedicatedly trained long-video-MLLMs. Codes are available at https://github.com/wkfdb/Free-MoRef

[265] AID4AD: Aerial Image Data for Automated Driving Perception

Daniel Lengerer,Mathias Pechinger,Klaus Bogenberger,Carsten Markgraf

Main category: cs.CV

TL;DR: 本文提出了AID4AD,一个与nuScenes数据集结合的高分辨率空中图像数据集,通过改进对齐和质量控制流程,为空中图像在自动车辆感知任务中的应用提供了支持,并展示了其在地图构建和运动预测中的显著优势。

Details Motivation: 将空间对齐的空中图像集成到自动驾驶车辆(AVs)的感知任务中。 Method: 提出了一种对齐工作流程,并利用SLAM基于点云地图进行空中数据与nuScenes本地坐标系统的对齐,通过手动质量控制过程进一步优化数据集,提供高质量对齐作为未来研究的参考。 Result: 在在线地图构建任务中,空中图像作为互补输入提高了15-23%的地图构建准确性;在运动预测任务中,其作为结构化环境表示替代高精地图,提升了2%的轨迹预测性能。 Conclusion: AID4AD展示了空中图像在自动车辆系统中作为可扩展和适应性强的环境上下文来源的潜力,特别是在高精地图不可用、过时或维护成本高昂的情况下。 Abstract: This work investigates the integration of spatially aligned aerial imagery into perception tasks for automated vehicles (AVs). As a central contribution, we present AID4AD, a publicly available dataset that augments the nuScenes dataset with high-resolution aerial imagery precisely aligned to its local coordinate system. The alignment is performed using SLAM-based point cloud maps provided by nuScenes, establishing a direct link between aerial data and nuScenes local coordinate system. To ensure spatial fidelity, we propose an alignment workflow that corrects for localization and projection distortions. A manual quality control process further refines the dataset by identifying a set of high-quality alignments, which we publish as ground truth to support future research on automated registration. We demonstrate the practical value of AID4AD in two representative tasks: in online map construction, aerial imagery serves as a complementary input that improves the mapping process; in motion prediction, it functions as a structured environmental representation that replaces high-definition maps. Experiments show that aerial imagery leads to a 15-23% improvement in map construction accuracy and a 2% gain in trajectory prediction performance. These results highlight the potential of aerial imagery as a scalable and adaptable source of environmental context in automated vehicle systems, particularly in scenarios where high-definition maps are unavailable, outdated, or costly to maintain. AID4AD, along with evaluation code and pretrained models, is publicly released to foster further research in this direction: https://github.com/DriverlessMobility/AID4AD.

[266] TrackletGait: A Robust Framework for Gait Recognition in the Wild

Shaoxiong Zhang,Jinkai Zheng,Shangdong Zhu,Chenggang Yan

Main category: cs.CV

TL;DR: 本文提出TrackletGait,通过随机轨迹采样、Haar小波降采样和硬度排除三元组损失,有效提升现实场景中步态识别的准确率和鲁棒性。

Details Motivation: 现有的步态识别方法在现实监控场景中面临非周期性和遮挡问题,需要一种更加鲁棒和具有代表性的解决方案。 Method: 提出了一种名为TrackletGait的新框架,包含随机轨迹采样、基于Haar小波的降采样以及硬度排除三元组损失。 Result: TrackletGait在Gait3D和GREW数据集上分别达到77.8和80.4的rank-1准确率,同时仅使用10.3M主干参数。 Conclusion: TrackletGait在现实场景中的步态识别表现出色,通过创新的框架设计和方法改进,解决了非周期性和遮挡问题,取得了最先进的结果。 Abstract: Gait recognition aims to identify individuals based on their body shape and walking patterns. Though much progress has been achieved driven by deep learning, gait recognition in real-world surveillance scenarios remains quite challenging to current methods. Conventional approaches, which rely on periodic gait cycles and controlled environments, struggle with the non-periodic and occluded silhouette sequences encountered in the wild. In this paper, we propose a novel framework, TrackletGait, designed to address these challenges in the wild. We propose Random Tracklet Sampling, a generalization of existing sampling methods, which strikes a balance between robustness and representation in capturing diverse walking patterns. Next, we introduce Haar Wavelet-based Downsampling to preserve information during spatial downsampling. Finally, we present a Hardness Exclusion Triplet Loss, designed to exclude low-quality silhouettes by discarding hard triplet samples. TrackletGait achieves state-of-the-art results, with 77.8 and 80.4 rank-1 accuracy on the Gait3D and GREW datasets, respectively, while using only 10.3M backbone parameters. Extensive experiments are also conducted to further investigate the factors affecting gait recognition in the wild.

[267] AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation

Ziyang Luo,Nian Liu,Fahad Shahbaz Khan,Junwei Han

Main category: cs.CV

TL;DR: AURORA improves reference audio-visual segmentation by enhancing genuine reasoning and language comprehension, achieving state-of-the-art results through a structured CoT mechanism, feature distillation loss, and a two-stage training approach.

Details Motivation: Existing Ref-AVS methods lack genuine semantic understanding and often rely on memorized reasoning patterns, which can compromise pixel-level precision during segmentation. Method: AURORA uses a structured Chain-of-Thought (CoT) prompting mechanism, a segmentation feature distillation loss, and a two-stage training strategy involving corrective reflective-style training and Group Reward Policy Optimization (GRPO). Result: AURORA demonstrates superior performance on Ref-AVS benchmarks and shows robust generalization to unreferenced segmentation scenarios. Conclusion: AURORA is an effective framework that achieves state-of-the-art performance on Ref-AVS benchmarks and generalizes well to unreferenced segmentation tasks. Abstract: Reference Audio-Visual Segmentation (Ref-AVS) tasks challenge models to precisely locate sounding objects by integrating visual, auditory, and textual cues. Existing methods often lack genuine semantic understanding, tending to memorize fixed reasoning patterns. Furthermore, jointly training for reasoning and segmentation can compromise pixel-level precision. To address these issues, we introduce AURORA, a novel framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation. We employ a structured Chain-of-Thought (CoT) prompting mechanism to guide the model through a step-by-step reasoning process and introduce a novel segmentation feature distillation loss to effectively integrate these reasoning abilities without sacrificing segmentation performance. To further cultivate the model's genuine reasoning capabilities, we devise a further two-stage training strategy: first, a ``corrective reflective-style training" stage utilizes self-correction to enhance the quality of reasoning paths, followed by reinforcement learning via Group Reward Policy Optimization (GRPO) to bolster robustness in challenging scenarios. Experiments demonstrate that AURORA achieves state-of-the-art performance on Ref-AVS benchmarks and generalizes effectively to unreferenced segmentation.

[268] AttriCtrl: Fine-Grained Control of Aesthetic Attribute Intensity in Diffusion Models

Die Chen,Zhongjie Duan,Zhiwen Li,Cen Chen,Daoyuan Chen,Yaliang Li,Yinda Chen

Main category: cs.CV

TL;DR: AttriCtrl 提出了一种精确且连续控制图像美学属性的插件式框架,无需大量训练即可实现对单属性和多属性组合的准确控制。

Details Motivation: 现有的文本到图像生成方法在细粒度美学控制方面存在不足,难以表达具体的美学语义和强度,限制了其可扩展性和实用性。 Method: 通过利用预训练视觉-语言模型的语义相似性量化抽象美学,并使用轻量级值编码器将标量强度映射到扩散生成中的可学习嵌入。 Result: 实验表明,AttriCtrl 可以准确控制单个属性并实现灵活的多属性组合,同时与主流可控生成框架兼容,展现出强大的实用性。 Conclusion: AttriCtrl 是一种高效且易于集成的解决方案,能够实现对美学属性的直观、连续和多属性控制,适用于多种生成场景。 Abstract: Recent breakthroughs in text-to-image diffusion models have significantly enhanced both the visual fidelity and semantic controllability of generated images. However, fine-grained control over aesthetic attributes remains challenging, especially when users require continuous and intensity-specific adjustments. Existing approaches often rely on vague textual prompts, which are inherently ambiguous in expressing both the aesthetic semantics and the desired intensity, or depend on costly human preference data for alignment, limiting their scalability and practicality. To address these limitations, we propose AttriCtrl, a plug-and-play framework for precise and continuous control of aesthetic attributes. Specifically, we quantify abstract aesthetics by leveraging semantic similarity from pre-trained vision-language models, and employ a lightweight value encoder that maps scalar intensities in $[0,1]$ to learnable embeddings within diffusion-based generation. This design enables intuitive and customizable aesthetic manipulation, with minimal training overhead and seamless integration into existing generation pipelines. Extensive experiments demonstrate that AttriCtrl achieves accurate control over individual attributes as well as flexible multi-attribute composition. Moreover, it is fully compatible with popular open-source controllable generation frameworks, showcasing strong integration capability and practical utility across diverse generation scenarios.

[269] Efficient Chambolle-Pock based algorithms for Convoltional sparse representation

Yi Liu,Junjing Li,Yang Chen,Haowei Tang,Pengcheng Zhang,Tianling Lyu,Zhiguo Gui

Main category: cs.CV

TL;DR: 本文提出了一种新的基于Chambolle-Pock框架的卷积稀疏编码方法,无需手动选择参数,收敛速度快,并在图像去噪任务中表现优异。

Details Motivation: 卷积稀疏表示技术因其平移不变性的优点,在图像处理领域受到广泛关注。然而,现有的基于ADMM的卷积稀疏编码方法需要谨慎选择惩罚参数,不当的参数选择可能导致不收敛或收敛速度缓慢。 Method: 提出了一种基于Chambolle-Pock框架的卷积稀疏编码方法,避免了额外的手动选择参数,并具有更快的收敛速度。此外,还引入了各向异性全变分惩罚,并将CP算法应用于求解卷积字典学习问题。 Result: 提出的方法无需手动选择参数,具有更快的收敛速度,并在实验中展示了优于ADMM方法的去噪性能。 Conclusion: 实验结果表明,该方法在无噪声图像上能够达到与最新ADMM方法相当的结果,同时在去除高斯噪声污染图像方面表现更优。 Abstract: Recently convolutional sparse representation (CSR), as a sparse representation technique, has attracted increasing attention in the field of image processing, due to its good characteristic of translate-invariance. The content of CSR usually consists of convolutional sparse coding (CSC) and convolutional dictionary learning (CDL), and many studies focus on how to solve the corresponding optimization problems. At present, the most efficient optimization scheme for CSC is based on the alternating direction method of multipliers (ADMM). However, the ADMM-based approach involves a penalty parameter that needs to be carefully selected, and improper parameter selection may result in either no convergence or very slow convergence. In this paper, a novel fast and efficient method using Chambolle-Pock(CP) framework is proposed, which does not require extra manual selection parameters in solving processing, and has faster convergence speed. Furthermore, we propose an anisotropic total variation penalty of the coefficient maps for CSC and apply the CP algorithm to solve it. In addition, we also apply the CP framework to solve the corresponding CDL problem. Experiments show that for noise-free image the proposed CSC algorithms can achieve rival results of the latest ADMM-based approach, while outperforms in removing noise from Gaussian noise pollution image.

[270] DreamPainter: Image Background Inpainting for E-commerce Scenarios

Sijie Zhao,Jing Cheng,Yaoyao Wu,Hao Xu,Shaohui Jiao

Main category: cs.CV

TL;DR: 本论文提出DreamPainter,结合文本和参考图像信息,解决了电商场景中图像生成的一致性和控制问题。

Details Motivation: 现有图像修复方法在电商场景中面临产品一致性差和缺乏领域特定数据的问题,同时文本提示控制存在局限性。 Method: 提出了DreamPainter框架,并引入了基于高质量电商数据集DreamEcom-400K的图像修复方法。 Result: 实验表明,该方法在保持产品一致性的同时显著优于现有技术。 Conclusion: DreamPainter有效地结合了文本提示和参考图像信息,在保持产品一致性的同时显著优于现有方法。 Abstract: Although diffusion-based image genenation has been widely explored and applied, background generation tasks in e-commerce scenarios still face significant challenges. The first challenge is to ensure that the generated products are consistent with the given product inputs while maintaining a reasonable spatial arrangement, harmonious shadows, and reflections between foreground products and backgrounds. Existing inpainting methods fail to address this due to the lack of domain-specific data. The second challenge involves the limitation of relying solely on text prompts for image control, as effective integrating visual information to achieve precise control in inpainting tasks remains underexplored. To address these challenges, we introduce DreamEcom-400K, a high-quality e-commerce dataset containing accurate product instance masks, background reference images, text prompts, and aesthetically pleasing product images. Based on this dataset, we propose DreamPainter, a novel framework that not only utilizes text prompts for control but also flexibly incorporates reference image information as an additional control signal. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, maintaining high product consistency while effectively integrating both text prompt and reference image information.

[271] Unified Category-Level Object Detection and Pose Estimation from RGB Images using 3D Prototypes

Tom Fischer,Xiaojie Zhang,Eddy Ilg

Main category: cs.CV

TL;DR: 提出了一种基于RGB图像的统一检测与姿态估计模型,通过神经网格模型和多模型RANSAC实现了检测与姿态估计的一体化,并在REAL275数据集上取得了最先进的结果。

Details Motivation: 传统的类别级方法依赖于RGB-D输入或采用两阶段方法分别处理检测和姿态估计,因此需要一种统一且高效的单阶段方法。 Method: 通过神经网格模型和学习特征,结合多模型RANSAC,将检测和姿态估计整合到一个框架中。 Result: 在REAL275数据集上,该方法在所有尺度无关指标上平均提升了22.9%,并表现出比单阶段基线方法更强的鲁棒性。 Conclusion: 该研究首次实现了基于RGB图像的检测与姿态估计的统一模型,并在性能上达到了最先进的水平。 Abstract: Recognizing objects in images is a fundamental problem in computer vision. Although detecting objects in 2D images is common, many applications require determining their pose in 3D space. Traditional category-level methods rely on RGB-D inputs, which may not always be available, or employ two-stage approaches that use separate models and representations for detection and pose estimation. For the first time, we introduce a unified model that integrates detection and pose estimation into a single framework for RGB images by leveraging neural mesh models with learned features and multi-model RANSAC. Our approach achieves state-of-the-art results for RGB category-level pose estimation on REAL275, improving on the current state-of-the-art by 22.9% averaged across all scale-agnostic metrics. Finally, we demonstrate that our unified method exhibits greater robustness compared to single-stage baselines. Our code and models are available at https://github.com/Fischer-Tom/unified-detection-and-pose-estimation.

[272] After the Party: Navigating the Mapping From Color to Ambient Lighting

Florin-Alexandru Vasluianu,Tim Seizinger,Zongwei Wu,Radu Timofte

Main category: cs.CV

TL;DR: This paper introduces CL3AN, a large-scale dataset for image restoration under complex colored lighting, and proposes a novel framework inspired by the Retinex model to better separate illumination from reflectance, achieving improved results.

Details Motivation: The authors are motivated by the limitations of existing methods, which oversimplify illumination challenges by assuming single or uniform lighting, leaving complex effects unaddressed. Method: The paper proposes a novel learning framework inspired by the Retinex model, leveraging explicit chromaticity and luminance components guidance to disentangle illumination from reflectance. Result: Extensive evaluations show that leading approaches produce artifacts due to their inability to accurately separate illumination from reflectance. The proposed method outperforms these approaches, particularly under non-homogeneous lighting and material-specific variations. Conclusion: The paper concludes that the proposed CL3AN dataset and the new learning framework effectively address the challenges of complex illumination in practical scenarios, showing enhanced robustness and competitive computational costs. Abstract: Illumination in practical scenarios is inherently complex, involving colored light sources, occlusions, and diverse material interactions that produce intricate reflectance and shading effects. However, existing methods often oversimplify this challenge by assuming a single light source or uniform, white-balanced lighting, leaving many of these complexities unaddressed.In this paper, we introduce CL3AN, the first large-scale, high-resolution dataset of its kind designed to facilitate the restoration of images captured under multiple Colored Light sources to their Ambient-Normalized counterparts. Through benchmarking, we find that leading approaches often produce artifacts, such as illumination inconsistencies, texture leakage, and color distortion, primarily due to their limited ability to precisely disentangle illumination from reflectance. Motivated by this insight, we achieve such a desired decomposition through a novel learning framework that leverages explicit chromaticity and luminance components guidance, drawing inspiration from the principles of the Retinex model. Extensive evaluations on existing benchmarks and our dataset demonstrate the effectiveness of our approach, showcasing enhanced robustness under non-homogeneous color lighting and material-specific reflectance variations, all while maintaining a highly competitive computational cost. The benchmark, codes, and models are available at www.github.com/fvasluianu97/RLN2.

[273] GaussianCross: Cross-modal Self-supervised 3D Representation Learning via Gaussian Splatting

Lei Yao,Yi Wang,Yi Zhang,Moyun Liu,Lap-Pui Chau

Main category: cs.CV

TL;DR: This paper presents GaussianCross, a novel cross-modal self-supervised 3D representation learning architecture that improves generalization capabilities and achieves superior performance with high parameter and data efficiency compared to state-of-the-art methods.

Details Motivation: The significance of informative and robust point representations has been widely acknowledged for 3D scene understanding. Despite existing self-supervised pre-training counterparts demonstrating promising performance, the model collapse and structural information deficiency remain prevalent due to insufficient point discrimination difficulty, yielding unreliable expressions and suboptimal performance. Method: GaussianCross, a novel cross-modal self-supervised 3D representation learning architecture integrating feed-forward 3D Gaussian Splatting (3DGS) techniques to address current challenges. Result: GaussianCross shows a prominent parameter and data efficiency, achieving superior performance through linear probing (<0.1% parameters) and limited data training (1% of scenes) compared to state-of-the-art methods. Conclusion: GaussianCross demonstrates strong generalization capabilities, improving the full fine-tuning accuracy by 9.3% mIoU and 6.1% AP$_{50}$ on ScanNet200 semantic and instance segmentation tasks, respectively, supporting the effectiveness of our approach. Abstract: The significance of informative and robust point representations has been widely acknowledged for 3D scene understanding. Despite existing self-supervised pre-training counterparts demonstrating promising performance, the model collapse and structural information deficiency remain prevalent due to insufficient point discrimination difficulty, yielding unreliable expressions and suboptimal performance. In this paper, we present GaussianCross, a novel cross-modal self-supervised 3D representation learning architecture integrating feed-forward 3D Gaussian Splatting (3DGS) techniques to address current challenges. GaussianCross seamlessly converts scale-inconsistent 3D point clouds into a unified cuboid-normalized Gaussian representation without missing details, enabling stable and generalizable pre-training. Subsequently, a tri-attribute adaptive distillation splatting module is incorporated to construct a 3D feature field, facilitating synergetic feature capturing of appearance, geometry, and semantic cues to maintain cross-modal consistency. To validate GaussianCross, we perform extensive evaluations on various benchmarks, including ScanNet, ScanNet200, and S3DIS. In particular, GaussianCross shows a prominent parameter and data efficiency, achieving superior performance through linear probing (<0.1% parameters) and limited data training (1% of scenes) compared to state-of-the-art methods. Furthermore, GaussianCross demonstrates strong generalization capabilities, improving the full fine-tuning accuracy by 9.3% mIoU and 6.1% AP$_{50}$ on ScanNet200 semantic and instance segmentation tasks, respectively, supporting the effectiveness of our approach. The code, weights, and visualizations are publicly available at \href{https://rayyoh.github.io/GaussianCross/}{https://rayyoh.github.io/GaussianCross/}.

[274] Deep classification algorithm for De-identification of DICOM medical images

Bufano Michele,Kotter Elmar

Main category: cs.CV

TL;DR: 本文介绍了一种用于对DICOM文件中的PII和PHI信息进行去标识化的Python算法,该算法基于HIPAA的安全港方法,并具有可定制的输入参数以适应不同使用场景。

Details Motivation: DICOM文件中的个人可识别信息(PII)和/或个人健康识别信息(PHI)由于法律原因需要隐藏或删除。此外,根据HIPAA和隐私规则,全脸照片及类似图像也属于直接标识信息,需要去标识化。 Method: 实现了一种基于HIPAA定义的安全港方法的算法,使用可定制的输入参数对DICOM标签进行分类并可能去标识化。 Result: 成功识别出最敏感的信息,如姓名、病史、个人数据和机构信息。 Conclusion: 作者开发了一个基于Python的算法,能够对DICOM文件中的信息进行分类,并通过可定制的输入参数提供灵活性,适用于日常使用和研究目的。代码已在GitHub上公开。 Abstract: Background : De-identification of DICOM (Digital Imaging and Communi-cations in Medicine) files is an essential component of medical image research. Personal Identifiable Information (PII) and/or Personal Health Identifying Information (PHI) need to be hidden or removed due to legal reasons. According to the Health Insurance Portability and Accountability Act (HIPAA) and privacy rules, also full-face photographic images and any compa-rable images are direct identifiers and are considered protected health information that also need to be de-identified. Objective : The study aimed to implement a method that permit to de-identify the PII and PHI information present in the header and burned on the pixel data of DICOM. Methods : To execute the de-identification, we implemented an algorithm based on the safe harbor method, defined by HIPAA. Our algorithm uses input customizable parameter to classify and then possibly de-identify individual DICOM tags. Results : The most sensible information, like names, history, personal data and institution were successfully recognized. Conclusions : We developed a python algorithm that is able to classify infor-mation present in a DICOM file. The flexibility provided by the use of customi-zable input parameters, which allow the user to customize the entire process de-pending on the case (e.g., the language), makes the entire program very promis-ing for both everyday use and research purposes. Our code is available at https://github.com/rtdicomexplorer/deep_deidentification.

[275] Weakly Supervised Multimodal Temporal Forgery Localization via Multitask Learning

Wenbo Xu,Wei Lu,Xiangyang Luo

Main category: cs.CV

TL;DR: 本文提出了一种新的弱监督多模态时间伪造定位方法(WMMT),通过多任务学习解决WS-MTFL问题,实现了细粒度的Deepfake检测和时间伪造定位,仅使用视频级注释。

Details Motivation: Deepfake视频的传播引发了信任危机并损害了社会稳定,尽管已有许多方法应对Deepfake检测和定位挑战,但对弱监督多模态细粒度时间伪造定位的研究仍显不足。 Method: 本文提出了一种新的弱监督多模态时间伪造定位方法(WMMT),通过多任务学习解决WS-MTFL问题。WMMT利用Mixture-of-Experts结构自适应选择适当特征和定位头,并提出了一种具有时间属性保持注意力机制的特征增强模块。此外,还设计了一种可扩展的偏差感知损失,以进一步探索弱监督学习中的时间信息。 Result: 实验表明,多任务学习在WS-MTFL中的有效性,WMMT在多个评估指标上取得了与全监督方法相当的结果。 Conclusion: WMMT为弱监督多模态细粒度时间伪造定位提供了有效的解决方案,具有较高的灵活性和定位精度。 Abstract: The spread of Deepfake videos has caused a trust crisis and impaired social stability. Although numerous approaches have been proposed to address the challenges of Deepfake detection and localization, there is still a lack of systematic research on the weakly supervised multimodal fine-grained temporal forgery localization (WS-MTFL). In this paper, we propose a novel weakly supervised multimodal temporal forgery localization via multitask learning (WMMT), which addresses the WS-MTFL under the multitask learning paradigm. WMMT achieves multimodal fine-grained Deepfake detection and temporal partial forgery localization using merely video-level annotations. Specifically, visual and audio modality detection are formulated as two binary classification tasks. The multitask learning paradigm is introduced to integrate these tasks into a multimodal task. Furthermore, WMMT utilizes a Mixture-of-Experts structure to adaptively select appropriate features and localization head, achieving excellent flexibility and localization precision in WS-MTFL. A feature enhancement module with temporal property preserving attention mechanism is proposed to identify the intra- and inter-modality feature deviation and construct comprehensive video features. To further explore the temporal information for weakly supervised learning, an extensible deviation perceiving loss has been proposed, which aims to enlarge the deviation of adjacent segments of the forged samples and reduce the deviation of genuine samples. Extensive experiments demonstrate the effectiveness of multitask learning for WS-MTFL, and the WMMT achieves comparable results to fully supervised approaches in several evaluation metrics.

[276] Test-Time Model Adaptation for Quantized Neural Networks

Zeshuai Deng,Guohao Chen,Shuaicheng Niu,Hui Luo,Shuhai Zhang,Yifan Yang,Renjie Chen,Wei Luo,Mingkui Tan

Main category: cs.CV

TL;DR: This paper proposes ZOA, a gradient-free test-time adaptation method for quantized models that efficiently adapts models using two forward passes and manages domain knowledge for long-term adaptation, significantly improving performance on ImageNet-C.

Details Motivation: Quantized models suffer from performance degradation in dynamic environments with domain shifts, and existing test-time adaptation (TTA) methods are impractical due to reliance on gradient backpropagation, which is unsupported on quantized models. Method: A continual zeroth-order adaptation (ZOA) framework is proposed, which enables model adaptation using only two forward passes, combined with a domain knowledge management scheme to store and reuse domain knowledge efficiently. Result: The ZOA framework achieves a 5.0% improvement over the state-of-the-art FOA on the ImageNet-C dataset using the quantized W6A6 ViT-B model, with efficient adaptation and negligible memory consumption. Conclusion: The proposed continual zeroth-order adaptation (ZOA) framework effectively improves the robustness and generalization ability of quantized models, outperforming existing methods like FOA on datasets such as ImageNet-C. Abstract: Quantizing deep models prior to deployment is a widely adopted technique to speed up inference for various real-time applications, such as autonomous driving. However, quantized models often suffer from severe performance degradation in dynamic environments with potential domain shifts and this degradation is significantly more pronounced compared with their full-precision counterparts, as shown by our theoretical and empirical illustrations. To address the domain shift problem, test-time adaptation (TTA) has emerged as an effective solution by enabling models to learn adaptively from test data. Unfortunately, existing TTA methods are often impractical for quantized models as they typically rely on gradient backpropagation--an operation that is unsupported on quantized models due to vanishing gradients, as well as memory and latency constraints. In this paper, we focus on TTA for quantized models to improve their robustness and generalization ability efficiently. We propose a continual zeroth-order adaptation (ZOA) framework that enables efficient model adaptation using only two forward passes, eliminating the computational burden of existing methods. Moreover, we propose a domain knowledge management scheme to store and reuse different domain knowledge with negligible memory consumption, reducing the interference of different domain knowledge and fostering the knowledge accumulation during long-term adaptation. Experimental results on three classical architectures, including quantized transformer-based and CNN-based models, demonstrate the superiority of our methods for quantized model adaptation. On the quantized W6A6 ViT-B model, our ZOA is able to achieve a 5.0\% improvement over the state-of-the-art FOA on ImageNet-C dataset. The source code is available at https://github.com/DengZeshuai/ZOA.

[277] Failure Cases Are Better Learned But Boundary Says Sorry: Facilitating Smooth Perception Change for Accuracy-Robustness Trade-Off in Adversarial Training

Yanyun Wang,Li Liu

Main category: cs.CV

TL;DR: The paper identifies a new perspective on the adversarial training trade-off and proposes a novel method (RPAT) to improve adversarial robustness while maintaining clean accuracy, showing promising experimental results.

Details Motivation: The motivation is to address the inherent trade-off between clean accuracy and adversarial robustness in Adversarial Training, which is traditionally attributed to insufficient learning of hard adversarial samples. Method: The paper introduces a new Adversarial Training objective called Robust Perception, which encourages smooth changes in model perception with input perturbations. This method aims to improve adversarial robustness without sacrificing clean accuracy. Result: Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet using various network architectures demonstrate that the proposed RPAT method outperforms four common baselines and 12 state-of-the-art works in mitigating the accuracy-robustness trade-off. Conclusion: The paper proposes a new Adversarial Training method named Robust Perception Adversarial Training (RPAT), which effectively mitigates the accuracy-robustness trade-off in Deep Neural Networks. Abstract: Adversarial Training (AT) is one of the most effective methods to train robust Deep Neural Networks (DNNs). However, AT creates an inherent trade-off between clean accuracy and adversarial robustness, which is commonly attributed to the more complicated decision boundary caused by the insufficient learning of hard adversarial samples. In this work, we reveal a counterintuitive fact for the first time: From the perspective of perception consistency, hard adversarial samples that can still attack the robust model after AT are already learned better than those successfully defended. Thus, different from previous views, we argue that it is rather the over-sufficient learning of hard adversarial samples that degrades the decision boundary and contributes to the trade-off problem. Specifically, the excessive pursuit of perception consistency would force the model to view the perturbations as noise and ignore the information within them, which should have been utilized to induce a smoother perception transition towards the decision boundary to support its establishment to an appropriate location. In response, we define a new AT objective named Robust Perception, encouraging the model perception to change smoothly with input perturbations, based on which we propose a novel Robust Perception Adversarial Training (RPAT) method, effectively mitigating the current accuracy-robustness trade-off. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet with ResNet-18, PreActResNet-18, and WideResNet-34-10 demonstrate the effectiveness of our method beyond four common baselines and 12 state-of-the-art (SOTA) works. The code is available at https://github.com/FlaAI/RPAT.

[278] CMIC: Content-Adaptive Mamba for Learned Image Compression

Yunuo Chen,Zezheng Lyu,Bing He,Hongwei Hu,Qi Wang,Yuan Tian,Li Song,Wenjun Zhang,Guo Lu

Main category: cs.CV

TL;DR: This paper introduces Content-Adaptive Mamba (CAM), a dynamic state-space model for learned image compression that improves global dependency capture through content-aware token reorganization and global priors, achieving state-of-the-art performance on image compression benchmarks.

Details Motivation: Vanilla Mamba-style state-space models are content-agnostic and rely on fixed, predefined selective scans, limiting their ability to dynamically exploit content dependencies. This necessitates a more adaptive approach to improve learned image compression. Method: CAM employs content-aware token reorganization and integrates global priors into the state-space model through a prompt dictionary. The method is applied in the Content-Adaptive Mamba-based LIC model (CMIC) for improved image compression performance. Result: CMIC achieves state-of-the-art rate-distortion performance, surpassing VTM-21.0 by -15.91%, -21.34%, and -17.58% BD-rate on Kodak, Tecnick, and CLIC benchmarks, respectively. Conclusion: The proposed Content-Adaptive Mamba (CAM) addresses the limitations of vanilla Mamba by introducing content-aware token reorganization and integrating global priors into the state-space model, enabling better capture of global dependencies while preserving computational efficiency. Abstract: Recent Learned image compression (LIC) leverages Mamba-style state-space models (SSMs) for global receptive fields with linear complexity. However, vanilla Mamba is content-agnostic, relying on fixed and predefined selective scans, which restricts its ability to dynamically and fully exploit content dependencies. We introduce Content-Adaptive Mamba (CAM), a dynamic SSM that addresses two critical limitations. First, it employs content-aware token reorganization, clustering and reordering tokens based on content similarity to prioritize proximity in feature space over Euclidean space. Second, it integrates global priors into SSM via a prompt dictionary, effectively mitigating the strict causality and long-range decay in the token interactions of Mamba. These innovations enable CAM to better capture global dependencies while preserving computational efficiency. Leveraging CAM, our Content-Adaptive Mamba-based LIC model (CMIC) achieves state-of-the-art rate-distortion performance, surpassing VTM-21.0 by -15.91\%, -21.34\%, and -17.58\% BD-rate on Kodak, Tecnick, and CLIC benchmarks, respectively.

[279] Welcome New Doctor: Continual Learning with Expert Consultation and Autoregressive Inference for Whole Slide Image Analysis

Doanh Cao Bui,Jin Tae Kwak

Main category: cs.CV

TL;DR: The paper introduces COSFormer, a Transformer-based continual learning framework tailored for multi-task Whole Slide Image analysis, which efficiently adapts to new tasks without retraining on previous data, offering superior performance in clinical applications.

Details Motivation: Whole Slide Image (WSI) analysis is crucial for cancer diagnosis and prognosis, but the giga-sized nature of WSIs demands substantial storage and computational resources. With the rapid increase in WSIs used in clinics and hospitals, there is a growing need for a continual learning system that can efficiently process and adapt existing models to new tasks without retraining or fine-tuning on previous tasks. Method: COSFormer, a Transformer-based continual learning framework, is introduced for multi-task WSI analysis. It learns sequentially from new tasks without revisiting full historical datasets. Result: COSFormer was evaluated on a sequence of seven WSI datasets covering seven organs and six WSI-related tasks under both class-incremental and task-incremental settings. The results demonstrated its superior generalizability and effectiveness compared to existing continual learning frameworks. Conclusion: COSFormer is established as a robust solution for continual WSI analysis in clinical applications due to its superior generalizability and effectiveness compared to existing continual learning frameworks. Abstract: Whole Slide Image (WSI) analysis, with its ability to reveal detailed tissue structures in magnified views, plays a crucial role in cancer diagnosis and prognosis. Due to their giga-sized nature, WSIs require substantial storage and computational resources for processing and training predictive models. With the rapid increase in WSIs used in clinics and hospitals, there is a growing need for a continual learning system that can efficiently process and adapt existing models to new tasks without retraining or fine-tuning on previous tasks. Such a system must balance resource efficiency with high performance. In this study, we introduce COSFormer, a Transformer-based continual learning framework tailored for multi-task WSI analysis. COSFormer is designed to learn sequentially from new tasks wile avoiding the need to revisit full historical datasets. We evaluate COSFormer on a sequence of seven WSI datasets covering seven organs and six WSI-related tasks under both class-incremental and task-incremental settings. The results demonstrate COSFormer's superior generalizability and effectiveness compared to existing continual learning frameworks, establishing it as a robust solution for continual WSI analysis in clinical applications.

[280] An Event-based Fast Intensity Reconstruction Scheme for UAV Real-time Perception

Xin Dong,Yiwei Zhang,Yangjie Cui,Jinwu Xiang,Daochun Li,Zhan Tu

Main category: cs.CV

TL;DR: This paper proposes an efficient event-based intensity reconstruction method called ESI, which enables real-time high-frame-rate image reconstruction with low computational load, making it ideal for UAV onboard visual tracking, especially under low illumination conditions.

Details Motivation: The motivation is to address the implementation challenges of utilizing event cameras by extracting and utilizing effective information from asynchronous event streams, thereby enabling the portability of conventional frame-based vision methods to event-based scenarios. Method: The proposed method is a streamlined event-based intensity reconstruction scheme called event-based single integration (ESI), which reconstructs intensity images by performing a single integration of the event streams combined with an enhanced decay algorithm. Result: ESI achieves real-time intensity reconstruction at a high frame rate (typically 100 FPS) with remarkable runtime efficiency improvements, superior reconstruction quality, and suitability for onboard implementation, such as in UAV-based visual tracking scenarios. Conclusion: ESI enhances UAV onboard perception significantly under visual adversary surroundings and demonstrates effective performance for UAV onboard visual tracking under extremely low illumination conditions, where other algorithms fail. Abstract: Event cameras offer significant advantages, including a wide dynamic range, high temporal resolution, and immunity to motion blur, making them highly promising for addressing challenging visual conditions. Extracting and utilizing effective information from asynchronous event streams is essential for the onboard implementation of event cameras. In this paper, we propose a streamlined event-based intensity reconstruction scheme, event-based single integration (ESI), to address such implementation challenges. This method guarantees the portability of conventional frame-based vision methods to event-based scenarios and maintains the intrinsic advantages of event cameras. The ESI approach reconstructs intensity images by performing a single integration of the event streams combined with an enhanced decay algorithm. Such a method enables real-time intensity reconstruction at a high frame rate, typically 100 FPS. Furthermore, the relatively low computation load of ESI fits onboard implementation suitably, such as in UAV-based visual tracking scenarios. Extensive experiments have been conducted to evaluate the performance comparison of ESI and state-of-the-art algorithms. Compared to state-of-the-art algorithms, ESI demonstrates remarkable runtime efficiency improvements, superior reconstruction quality, and a high frame rate. As a result, ESI enhances UAV onboard perception significantly under visual adversary surroundings. In-flight tests, ESI demonstrates effective performance for UAV onboard visual tracking under extremely low illumination conditions(2-10lux), whereas other comparative algorithms fail due to insufficient frame rate, poor image quality, or limited real-time performance.

[281] Forecasting When to Forecast: Accelerating Diffusion Models with Confidence-Gated Taylor

Xiaoliu Guan,Lielin Jiang,Hanqi Chen,Xu Zhang,Jiaxing Yan,Guanzhong Wang,Yi Liu,Zetao Zhang,Yu Wu

Main category: cs.CV

TL;DR: 本文提出了一种基于Taylor展开的新型动态缓存机制,通过减少缓存特征数量和根据预测误差动态选择计算方式,在视觉生成任务中实现了高效加速。

Details Motivation: 现有的基于缓存和重用特征的加速方法存在内存和计算开销大,且固定缓存策略可能导致生成质量下降,因此需要一种更高效的加速方法。 Method: 将Taylor预测目标从模块级别转移到最后一个Transformer块级别,并利用第一个块的预测误差作为预测可靠性的指标,从而实现动态缓存机制。 Result: 该方法在FLUX、DiT和Wan Video模型上分别实现了3.17x、2.36x和4.14x的加速,且生成质量几乎没有下降。 Conclusion: 本文提出了一种新的基于Taylor展开的加速推理方法,通过动态缓存机制和预测目标的调整,在保持生成质量的同时显著提高了推理速度。 Abstract: Diffusion Transformers (DiTs) have demonstrated remarkable performance in visual generation tasks. However, their low inference speed limits their deployment in low-resource applications. Recent training-free approaches exploit the redundancy of features across timesteps by caching and reusing past representations to accelerate inference. Building on this idea, TaylorSeer instead uses cached features to predict future ones via Taylor expansion. However, its module-level prediction across all transformer blocks (e.g., attention or feedforward modules) requires storing fine-grained intermediate features, leading to notable memory and computation overhead. Moreover, it adopts a fixed caching schedule without considering the varying accuracy of predictions across timesteps, which can lead to degraded outputs when prediction fails. To address these limitations, we propose a novel approach to better leverage Taylor-based acceleration. First, we shift the Taylor prediction target from the module level to the last block level, significantly reducing the number of cached features. Furthermore, observing strong sequential dependencies among Transformer blocks, we propose to use the error between the Taylor-estimated and actual outputs of the first block as an indicator of prediction reliability. If the error is small, we trust the Taylor prediction for the last block; otherwise, we fall back to full computation, thereby enabling a dynamic caching mechanism. Empirical results show that our method achieves a better balance between speed and quality, achieving a 3.17x acceleration on FLUX, 2.36x on DiT, and 4.14x on Wan Video with negligible quality drop. The Project Page is \href{https://cg-taylor-acce.github.io/CG-Taylor/}{here.}

[282] I2CR: Intra- and Inter-modal Collaborative Reflections for Multimodal Entity Linking

Ziyan Liu,Junwen Li,Kaiwen Li,Tong Ruan,Chao Wang,Xinyan He,Zongyu Wang,Xuezhi Cao,Jingping Liu

Main category: cs.CV

TL;DR: 提出了一种新的基于LLM的多模态实体链接框架,称为Intra-Inter-modal Collaborative Reflections,通过优先利用文本信息并结合多轮迭代策略整合关键视觉线索,提高了匹配准确性,并在三个公共数据集上均优于现有最先进方法。

Details Motivation: 为了解决现有基于大语言模型的方法在多模态实体链接任务中不必要的图像数据整合以及仅依赖一次性提取视觉特征的问题。 Method: 提出了一种新的LLM-based框架,该框架优先利用文本信息,并在文本不足以链接正确实体时采用多轮迭代策略整合图像中的关键视觉线索。 Result: 在三个广泛使用的公共数据集上进行了大量实验,结果表明该框架在任务中的表现始终优于当前最先进的方法,分别提高了3.2%、5.1%和1.6%。 Conclusion: 提出的框架有效地解决了多模态实体链接任务中存在的问题,提高了效果和准确性。 Abstract: Multimodal entity linking plays a crucial role in a wide range of applications. Recent advances in large language model-based methods have become the dominant paradigm for this task, effectively leveraging both textual and visual modalities to enhance performance. Despite their success, these methods still face two challenges, including unnecessary incorporation of image data in certain scenarios and the reliance only on a one-time extraction of visual features, which can undermine their effectiveness and accuracy. To address these challenges, we propose a novel LLM-based framework for the multimodal entity linking task, called Intra- and Inter-modal Collaborative Reflections. This framework prioritizes leveraging text information to address the task. When text alone is insufficient to link the correct entity through intra- and inter-modality evaluations, it employs a multi-round iterative strategy that integrates key visual clues from various aspects of the image to support reasoning and enhance matching accuracy. Extensive experiments on three widely used public datasets demonstrate that our framework consistently outperforms current state-of-the-art methods in the task, achieving improvements of 3.2%, 5.1%, and 1.6%, respectively. Our code is available at https://github.com/ziyan-xiaoyu/I2CR/.

[283] Semi-Supervised Semantic Segmentation via Derivative Label Propagation

Yuanbin Fu,Xiaojie Guo

Main category: cs.CV

TL;DR: This paper introduces DerProp, a semi-supervised semantic segmentation framework that improves pseudo-label reliability through derivative label propagation, enhancing performance by generating strictly regularized similarity metrics.

Details Motivation: The motivation is to reduce the annotation burden in semantic segmentation by improving the reliability of pseudo-labels used in semi-supervised frameworks. Method: A semi-supervised framework named DerProp is introduced, which utilizes a novel derivative label propagation technique. This method applies discrete derivative operations to pixel-wise feature vectors to generate strictly regularized similarity metrics, aiming to rectify imperfect pseudo-labels. Result: Extensive experiments demonstrate the effectiveness of the DerProp framework, showing superiority over other methods in improving the reliability of pseudo-labels and achieving promising results. Conclusion: The proposed DerProp framework improves the reliability of pseudo-labels in semi-supervised semantic segmentation by leveraging derivative label propagation, which imposes discrete derivative operations on pixel-wise feature vectors as additional regularization. Abstract: Semi-supervised semantic segmentation, which leverages a limited set of labeled images, helps to relieve the heavy annotation burden. While pseudo-labeling strategies yield promising results, there is still room for enhancing the reliability of pseudo-labels. Hence, we develop a semi-supervised framework, namely DerProp, equipped with a novel derivative label propagation to rectify imperfect pseudo-labels. Our label propagation method imposes discrete derivative operations on pixel-wise feature vectors as additional regularization, thereby generating strictly regularized similarity metrics. Doing so effectively alleviates the ill-posed problem that identical similarities correspond to different features, through constraining the solution space. Extensive experiments are conducted to verify the rationality of our design, and demonstrate our superiority over other methods. Codes are available at https://github.com/ForawardStar/DerProp/.

[284] Patho-AgenticRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning

Wenchuan Zhang,Jingru Guo,Hengzhe Zhang,Penghao Zhang,Jie Chen,Shuwan Zhang,Zhang Zhang,Yuhao Yi,Hong Bu

Main category: cs.CV

TL;DR: Patho-AgenticRAG是一个利用多模态RAG技术,结合权威病理学教科书页面级嵌入的框架,它通过支持文本和图像的联合检索,在复杂病理诊断任务中表现出色。

Details Motivation: 病理学中的视觉语言模型(VLMs)容易出现幻觉,因为病理学具有超高分辨率、复杂的组织结构和细微的临床语义。传统RAG方法主要依赖于基于文本的知识库,限制了它们利用诊断视觉线索的能力。 Method: Patho-AgenticRAG框架使用了基于权威病理学教科书的页面级嵌入构建的数据库,并支持联合文本-图像搜索、推理、任务分解和多轮搜索交互。 Result: 实验表明,Patho-AgenticRAG在多项复杂的病理任务中显著优于现有的多模态模型,如多项选择诊断和视觉问答。 Conclusion: Patho-AgenticRAG是一种改进的多模态RAG框架,通过结合权威病理学教科书的页面级嵌入,支持文本和图像的联合检索,从而在复杂的病理诊断任务中显著优于现有的多模态模型。 Abstract: Although Vision Language Models (VLMs) have shown strong generalization in medical imaging, pathology presents unique challenges due to ultra-high resolution, complex tissue structures, and nuanced clinical semantics. These factors make pathology VLMs prone to hallucinations, i.e., generating outputs inconsistent with visual evidence, which undermines clinical trust. Existing RAG approaches in this domain largely depend on text-based knowledge bases, limiting their ability to leverage diagnostic visual cues. To address this, we propose Patho-AgenticRAG, a multimodal RAG framework with a database built on page-level embeddings from authoritative pathology textbooks. Unlike traditional text-only retrieval systems, it supports joint text-image search, enabling direct retrieval of textbook pages that contain both the queried text and relevant visual cues, thus avoiding the loss of critical image-based information. Patho-AgenticRAG also supports reasoning, task decomposition, and multi-turn search interactions, improving accuracy in complex diagnostic scenarios. Experiments show that Patho-AgenticRAG significantly outperforms existing multimodal models in complex pathology tasks like multiple-choice diagnosis and visual question answering. Our project is available at the Patho-AgenticRAG repository: https://github.com/Wenchuan-Zhang/Patho-AgenticRAG.

[285] SplatSSC: Decoupled Depth-Guided Gaussian Splatting for Semantic Scene Completion

Rui Qian,Haozhi Cao,Tianchen Deng,Shenghai Yuan,Lihua Xie

Main category: cs.CV

TL;DR: This paper introduces SplatSSC, an improved framework for 3D semantic scene completion that enhances initialization and noise reduction, leading to better performance and efficiency.

Details Motivation: The authors aim to address the inefficiency and artifact issues in current object-centric methods for 3D semantic scene completion that rely on random initialization of Gaussian primitives. Method: The paper proposes SplatSSC, which uses a Group-wise Multi-scale Fusion module for depth-guided initialization of Gaussian primitives and a Decoupled Gaussian Aggregator to reduce noise from outliers, along with a Probability Scale Loss to optimize performance. Result: The method achieves state-of-the-art results on the Occ-ScanNet dataset, with over 6.3% improvement in IoU and 4.1% in mIoU, while also reducing latency and memory usage by more than 9.3%. Conclusion: SplatSSC improves the efficiency and accuracy of 3D semantic scene completion by using a depth-guided initialization strategy and a principled Gaussian aggregator, outperforming previous methods in performance metrics while reducing resource consumption. Abstract: Monocular 3D Semantic Scene Completion (SSC) is a challenging yet promising task that aims to infer dense geometric and semantic descriptions of a scene from a single image. While recent object-centric paradigms significantly improve efficiency by leveraging flexible 3D Gaussian primitives, they still rely heavily on a large number of randomly initialized primitives, which inevitably leads to 1) inefficient primitive initialization and 2) outlier primitives that introduce erroneous artifacts. In this paper, we propose SplatSSC, a novel framework that resolves these limitations with a depth-guided initialization strategy and a principled Gaussian aggregator. Instead of random initialization, SplatSSC utilizes a dedicated depth branch composed of a Group-wise Multi-scale Fusion (GMF) module, which integrates multi-scale image and depth features to generate a sparse yet representative set of initial Gaussian primitives. To mitigate noise from outlier primitives, we develop the Decoupled Gaussian Aggregator (DGA), which enhances robustness by decomposing geometric and semantic predictions during the Gaussian-to-voxel splatting process. Complemented with a specialized Probability Scale Loss, our method achieves state-of-the-art performance on the Occ-ScanNet dataset, outperforming prior approaches by over 6.3% in IoU and 4.1% in mIoU, while reducing both latency and memory consumption by more than 9.3%. The code will be released upon acceptance.

[286] Semi-Supervised Dual-Threshold Contrastive Learning for Ultrasound Image Classification and Segmentation

Peng Zhang,Zhihui Lai,Heng Kong

Main category: cs.CV

TL;DR: Hermes improves semi-supervised ultrasound image classification and segmentation by addressing pseudo-label inaccuracies and integrating inter-task consistency.

Details Motivation: Confidence-based pseudo-label selection leads to incorrect predictions and overfitting, while segmentation and classification tasks are not fully integrated. Method: Hermes combines contrastive learning with semi-supervised learning, using pseudo-labels and an inter-task attention and consistency strategy. Result: Hermes outperforms state-of-the-art methods on multiple ultrasound datasets under various semi-supervised settings. Conclusion: The proposed Hermes strategy effectively improves semi-supervised contrastive learning performance by addressing inaccurate pseudo-labels and enhancing inter-task consistency. Abstract: Confidence-based pseudo-label selection usually generates overly confident yet incorrect predictions, due to the early misleadingness of model and overfitting inaccurate pseudo-labels in the learning process, which heavily degrades the performance of semi-supervised contrastive learning. Moreover, segmentation and classification tasks are treated independently and the affinity fails to be fully explored. To address these issues, we propose a novel semi-supervised dual-threshold contrastive learning strategy for ultrasound image classification and segmentation, named Hermes. This strategy combines the strengths of contrastive learning with semi-supervised learning, where the pseudo-labels assist contrastive learning by providing additional guidance. Specifically, an inter-task attention and saliency module is also developed to facilitate information sharing between the segmentation and classification tasks. Furthermore, an inter-task consistency learning strategy is designed to align tumor features across both tasks, avoiding negative transfer for reducing features discrepancy. To solve the lack of publicly available ultrasound datasets, we have collected the SZ-TUS dataset, a thyroid ultrasound image dataset. Extensive experiments on two public ultrasound datasets and one private dataset demonstrate that Hermes consistently outperforms several state-of-the-art methods across various semi-supervised settings.

[287] SGAD: Semantic and Geometric-aware Descriptor for Local Feature Matching

Xiangzeng Liu,Chi Wang,Guanglu Shi,Xiaodong Zhang,Qiguang Miao,Miao Fan

Main category: cs.CV

TL;DR: 本文提出了一种新的区域匹配方法SGAD,通过生成高判别力的区域描述符,实现了高效且准确的区域匹配,并引入了新的监督策略和过滤方法,显著提高了性能。

Details Motivation: 局部特征匹配在计算机视觉中仍然具有挑战性,现有的基于区域到点匹配(A2PM)的方法虽然提高了匹配精度,但依赖于低效的像素级比较和复杂的图匹配,限制了可扩展性。 Method: 本文引入了语义和几何感知描述符网络(SGAD),通过生成高度判别力的区域描述符,实现直接匹配,同时引入了新的监督策略,将区域匹配任务分解为分类和排序子任务,并提出了层次包含冗余过滤器(HCRF)来消除重叠区域。 Result: SGAD在性能上有了显著提升,在户外姿态估计中,SGAD+LoFTR的运行时间比DKM减少,同时达到了更高的精度;在室内姿态估计中,SGAD+ROMA在AUC@5°上提升了+7.39%,达到了新的最先进的水平。 Conclusion: SGAD通过生成高判别力的区域描述符和新的监督策略,实现了高效且准确的区域匹配,为局部特征匹配提供了一种新的解决方案。 Abstract: Local feature matching remains a fundamental challenge in computer vision. Recent Area to Point Matching (A2PM) methods have improved matching accuracy. However, existing research based on this framework relies on inefficient pixel-level comparisons and complex graph matching that limit scalability. In this work, we introduce the Semantic and Geometric-aware Descriptor Network (SGAD), which fundamentally rethinks area-based matching by generating highly discriminative area descriptors that enable direct matching without complex graph optimization. This approach significantly improves both accuracy and efficiency of area matching. We further improve the performance of area matching through a novel supervision strategy that decomposes the area matching task into classification and ranking subtasks. Finally, we introduce the Hierarchical Containment Redundancy Filter (HCRF) to eliminate overlapping areas by analyzing containment graphs. SGAD demonstrates remarkable performance gains, reducing runtime by 60x (0.82s vs. 60.23s) compared to MESA. Extensive evaluations show consistent improvements across multiple point matchers: SGAD+LoFTR reduces runtime compared to DKM, while achieving higher accuracy (0.82s vs. 1.51s, 65.98 vs. 61.11) in outdoor pose estimation, and SGAD+ROMA delivers +7.39% AUC@5{\deg} in indoor pose estimation, establishing a new state-of-the-art.

[288] Do Edges Matter? Investigating Edge-Enhanced Pre-Training for Medical Image Segmentation

Paul Zaha,Lars Böcking,Simeon Allmendinger,Leopold Müller,Niklas Kühl

Main category: cs.CV

TL;DR: This paper investigates how edge-enhanced pre-training affects medical image segmentation across multiple modalities and proposes a meta-learning strategy to optimize model selection for better performance.

Details Motivation: Edge information is vital for object boundary detection in medical images, yet its role in pre-training foundation models has not been systematically studied. This research aims to fill that gap and improve segmentation performance across multiple modalities. Method: Two versions of a foundation model were trained on either raw or edge-enhanced data using edge kernels like Kirsch, followed by fine-tuning on specific modalities. A meta-learning strategy based on image entropy and standard deviation was proposed to select the optimal pre-training method. Result: Edge-focused pre-training showed mixed results—some modalities saw improved performance, while others experienced a reduction. The proposed meta-learning strategy improved overall segmentation performance by 16.42% compared to edge-enhanced pre-training alone and by 19.30% compared to raw data pre-training alone. Conclusion: The study concludes that edge-focused pre-training can have varying effects on segmentation performance across different medical imaging modalities, and the proposed meta-learning strategy helps selectively apply the most suitable pre-training approach, improving overall performance. Abstract: Medical image segmentation is crucial for disease diagnosis and treatment planning, yet developing robust segmentation models often requires substantial computational resources and large datasets. Existing research shows that pre-trained and finetuned foundation models can boost segmentation performance. However, questions remain about how particular image preprocessing steps may influence segmentation performance across different medical imaging modalities. In particular, edges-abrupt transitions in pixel intensity-are widely acknowledged as vital cues for object boundaries but have not been systematically examined in the pre-training of foundation models. We address this gap by investigating to which extend pre-training with data processed using computationally efficient edge kernels, such as kirsch, can improve cross-modality segmentation capabilities of a foundation model. Two versions of a foundation model are first trained on either raw or edge-enhanced data across multiple medical imaging modalities, then finetuned on selected raw subsets tailored to specific medical modalities. After systematic investigation using the medical domains Dermoscopy, Fundus, Mammography, Microscopy, OCT, US, and XRay, we discover both increased and reduced segmentation performance across modalities using edge-focused pre-training, indicating the need for a selective application of this approach. To guide such selective applications, we propose a meta-learning strategy. It uses standard deviation and image entropy of the raw image to choose between a model pre-trained on edge-enhanced or on raw data for optimal performance. Our experiments show that integrating this meta-learning layer yields an overall segmentation performance improvement across diverse medical imaging tasks by 16.42% compared to models pre-trained on edge-enhanced data only and 19.30% compared to models pre-trained on raw data only.

[289] Unleashing the Temporal Potential of Stereo Event Cameras for Continuous-Time 3D Object Detection

Jae-Young Kang,Hoonhee Cho,Kuk-Jin Yoon

Main category: cs.CV

TL;DR: 本文提出了一种仅依赖事件相机的三维目标检测方法,解决了高速场景中传统传感器的感知不足问题。

Details Motivation: 传统传感器的固定帧率导致高速场景感知不足,而事件相机具有异步性和高时间分辨率,可以解决这一问题。 Method: 引入双滤波机制提取语义和几何信息,并通过对齐以物体为中心的信息来增强回归效果。 Result: 实验表明,该方法在动态环境中优于以往方法,且无需传统三维传感器。 Conclusion: 该论文提出了一种基于事件相机的立体三维目标检测框架,展示了事件相机在高速场景中实现鲁棒连续时间三维感知的潜力。 Abstract: 3D object detection is essential for autonomous systems, enabling precise localization and dimension estimation. While LiDAR and RGB cameras are widely used, their fixed frame rates create perception gaps in high-speed scenarios. Event cameras, with their asynchronous nature and high temporal resolution, offer a solution by capturing motion continuously. The recent approach, which integrates event cameras with conventional sensors for continuous-time detection, struggles in fast-motion scenarios due to its dependency on synchronized sensors. We propose a novel stereo 3D object detection framework that relies solely on event cameras, eliminating the need for conventional 3D sensors. To compensate for the lack of semantic and geometric information in event data, we introduce a dual filter mechanism that extracts both. Additionally, we enhance regression by aligning bounding boxes with object-centric information. Experiments show that our method outperforms prior approaches in dynamic environments, demonstrating the potential of event cameras for robust, continuous-time 3D perception. The code is available at https://github.com/mickeykang16/Ev-Stereo3D.

[290] Towards Real Unsupervised Anomaly Detection Via Confident Meta-Learning

Muhammad Aqeel,Shakiba Sharifi,Marco Cristani,Francesco Setti

Main category: cs.CV

TL;DR: 本文提出了一种名为Confident Meta-learning (CoMet)的训练策略,用于解决异常检测中手动数据整理带来的问题,该方法结合了Soft Confident Learning和Meta-Learning技术,适用于任何可通过梯度下降训练的模型,并在多个数据集上取得了SOTA结果。

Details Motivation: 传统的无监督异常检测实际上为半监督方法,其假设所有训练数据为标称数据,该假设需要手动数据整理,导致偏差并限制了适应性。 Method: 提出了一种名为Confident Meta-learning (CoMet)的新型训练策略,结合了Soft Confident Learning和Meta-Learning,前者为低置信度样本分配较低权重,后者通过基于训练验证损失协方差正则化更新来稳定训练。 Result: 实验表明,CoMet在MVTec-AD、VIADUCT和KSDD2数据集上结合两种最先进的模型,持续优于基线方法,对训练集中的异常具有鲁棒性,并在所有数据集上建立了新的SOTA。 Conclusion: CoMet是一个模型无关的训练策略,适用于任何可通过梯度下降训练的异常检测方法,能够在包含异常样本的未整理数据集上实现有效的训练,避免手动数据过滤的需求。 Abstract: So-called unsupervised anomaly detection is better described as semi-supervised, as it assumes all training data are nominal. This assumption simplifies training but requires manual data curation, introducing bias and limiting adaptability. We propose Confident Meta-learning (CoMet), a novel training strategy that enables deep anomaly detection models to learn from uncurated datasets where nominal and anomalous samples coexist, eliminating the need for explicit filtering. Our approach integrates Soft Confident Learning, which assigns lower weights to low-confidence samples, and Meta-Learning, which stabilizes training by regularizing updates based on training validation loss covariance. This prevents overfitting and enhances robustness to noisy data. CoMet is model-agnostic and can be applied to any anomaly detection method trainable via gradient descent. Experiments on MVTec-AD, VIADUCT, and KSDD2 with two state-of-the-art models demonstrate the effectiveness of our approach, consistently improving over the baseline methods, remaining insensitive to anomalies in the training set, and setting a new state-of-the-art across all datasets.

[291] Whole-body Representation Learning For Competing Preclinical Disease Risk Assessment

Dmitrii Seletkov,Sophie Starck,Ayhan Can Erdur,Yundi Zhang,Daniel Rueckert,Rickmer Braren

Main category: cs.CV

TL;DR: 提出了一种用于临床前疾病风险评估的全身自监督表示学习方法,优于传统的全身体素组学方法,并在心血管疾病亚组预测中表现出色。

Details Motivation: 将公共医疗保健从反应性治疗转向主动识别和预防,需要可靠的临床前疾病风险评估方法。现有的图像风险预测算法通常一次只考虑一种情况,并依赖于通过分割工具获取的手工特征。 Method: 提出了一种全身自监督表示学习方法,用于临床前疾病风险建模,结合心脏MRI模拟临床前筛查场景,进行多疾病风险预测。 Result: 该方法在多种疾病的临床前风险预测中表现优越,包括心血管疾病、2型糖尿病、慢性阻塞性肺病和慢性肾病。结合心脏MRI后,进一步提高了心血管疾病亚组(如缺血性心脏病、高血压疾病和中风)的预测精度。 Conclusion: 全身表示学习在临床前疾病风险评估中具有较高的转化潜力,既可以作为独立的筛查方式,也可以作为多模态临床工作流程的一部分,用于早期个性化风险分层。 Abstract: Reliable preclinical disease risk assessment is essential to move public healthcare from reactive treatment to proactive identification and prevention. However, image-based risk prediction algorithms often consider one condition at a time and depend on hand-crafted features obtained through segmentation tools. We propose a whole-body self-supervised representation learning method for the preclinical disease risk assessment under a competing risk modeling. This approach outperforms whole-body radiomics in multiple diseases, including cardiovascular disease (CVD), type 2 diabetes (T2D), chronic obstructive pulmonary disease (COPD), and chronic kidney disease (CKD). Simulating a preclinical screening scenario and subsequently combining with cardiac MRI, it sharpens further the prediction for CVD subgroups: ischemic heart disease (IHD), hypertensive diseases (HD), and stroke. The results indicate the translational potential of whole-body representations as a standalone screening modality and as part of a multi-modal framework within clinical workflows for early personalized risk stratification. The code is available at https://github.com/yayapa/WBRLforCR/

[292] Is Uncertainty Quantification a Viable Alternative to Learned Deferral?

Anna M. Wundram,Christian F. Baumgartner

Main category: cs.CV

TL;DR: This paper explores the use of uncertainty quantification as a safer, more robust method for AI to defer decisions to human experts, particularly under out-of-distribution conditions in glaucoma diagnosis from fundus images.

Details Motivation: AI models need to safely defer decisions to human experts when likely to misclassify. Learned deferral models face challenges during clinical translation due to data shifts, while uncertainty quantification methods may offer a more robust alternative. Method: An extensive evaluation study on a large ophthalmology dataset comparing learned deferral models and uncertainty quantification methods in their ability to classify glaucoma from fundus images and defer high-error cases. Result: Uncertainty quantification methods show promise for AI deferral, demonstrating better robustness to out-of-distribution inputs compared to learned deferral models. Conclusion: Uncertainty quantification methods are more robust to out-of-distribution input than learned deferral models and may be a promising choice for AI deferral in clinical settings. Abstract: Artificial Intelligence (AI) holds the potential to dramatically improve patient care. However, it is not infallible, necessitating human-AI-collaboration to ensure safe implementation. One aspect of AI safety is the models' ability to defer decisions to a human expert when they are likely to misclassify autonomously. Recent research has focused on methods that learn to defer by optimising a surrogate loss function that finds the optimal trade-off between predicting a class label or deferring. However, during clinical translation, models often face challenges such as data shift. Uncertainty quantification methods aim to estimate a model's confidence in its predictions. However, they may also be used as a deferral strategy which does not rely on learning from specific training distribution. We hypothesise that models developed to quantify uncertainty are more robust to out-of-distribution (OOD) input than learned deferral models that have been trained in a supervised fashion. To investigate this hypothesis, we constructed an extensive evaluation study on a large ophthalmology dataset, examining both learned deferral models and established uncertainty quantification methods, assessing their performance in- and out-of-distribution. Specifically, we evaluate their ability to accurately classify glaucoma from fundus images while deferring cases with a high likelihood of error. We find that uncertainty quantification methods may be a promising choice for AI deferral.

[293] Zero-shot Compositional Action Recognition with Neural Logic Constraints

Gefan Ye,Lin Li,Kexin Li,Jun Xiao,Long chen

Main category: cs.CV

TL;DR: 本文提出LogicCAR框架,通过引入组合逻辑和层次原始逻辑解决零样本组合动作识别中的结构和语义问题,实验表明其性能优于现有方法。

Details Motivation: 零样本组合动作识别面临两个关键挑战:缺少组合结构约束导致伪相关性,以及忽视语义层次约束导致语义模糊,影响训练过程。 Method: 提出了一种基于逻辑的零样本组合动作识别框架LogicCAR,包含显式的组合逻辑和层次原始逻辑约束,并将这些约束嵌入到神经网络架构中。 Result: 在Sth-com数据集上的实验表明,LogicCAR优于现有基线方法,证明了逻辑驱动约束的有效性。 Conclusion: LogicCAR通过整合符号逻辑约束,有效解决了零样本组合动作识别中的组合结构缺失和语义层次忽视问题,从而提升了模型的推理能力。 Abstract: Zero-shot compositional action recognition (ZS-CAR) aims to identify unseen verb-object compositions in the videos by exploiting the learned knowledge of verb and object primitives during training. Despite compositional learning's progress in ZS-CAR, two critical challenges persist: 1) Missing compositional structure constraint, leading to spurious correlations between primitives; 2) Neglecting semantic hierarchy constraint, leading to semantic ambiguity and impairing the training process. In this paper, we argue that human-like symbolic reasoning offers a principled solution to these challenges by explicitly modeling compositional and hierarchical structured abstraction. To this end, we propose a logic-driven ZS-CAR framework LogicCAR that integrates dual symbolic constraints: Explicit Compositional Logic and Hierarchical Primitive Logic. Specifically, the former models the restrictions within the compositions, enhancing the compositional reasoning ability of our model. The latter investigates the semantical dependencies among different primitives, empowering the models with fine-to-coarse reasoning capacity. By formalizing these constraints in first-order logic and embedding them into neural network architectures, LogicCAR systematically bridges the gap between symbolic abstraction and existing models. Extensive experiments on the Sth-com dataset demonstrate that our LogicCAR outperforms existing baseline methods, proving the effectiveness of our logic-driven constraints.

[294] Dream-to-Recon: Monocular 3D Reconstruction with Diffusion-Depth Distillation from Single Images

Philipp Wulff,Felix Wimbauer,Dominik Muhle,Daniel Cremers

Main category: cs.CV

TL;DR: 本文提出了一种利用预训练模型从单目图像中生成场景几何的新方法,在无需多视角监督的情况下实现了高效的体积场景重建。

Details Motivation: 现有的体积重建方法通常需要昂贵的3D真实数据或多视角监督,这在实际应用中可能不可行。 Method: 利用预训练的2D扩散模型和深度预测模型生成合成场景几何,用于蒸馏一个前馈场景重建模型。 Result: 在KITTI-360和Waymo数据集上的实验表明,该方法在性能上等于或优于现有方法,并且在动态场景处理方面具有独特优势。 Conclusion: 该论文提出的方法在使用单目图像进行场景重建方面表现出色,能够匹敌或超越现有的多视角监督方法。 Abstract: Volumetric scene reconstruction from a single image is crucial for a broad range of applications like autonomous driving and robotics. Recent volumetric reconstruction methods achieve impressive results, but generally require expensive 3D ground truth or multi-view supervision. We propose to leverage pre-trained 2D diffusion models and depth prediction models to generate synthetic scene geometry from a single image. This can then be used to distill a feed-forward scene reconstruction model. Our experiments on the challenging KITTI-360 and Waymo datasets demonstrate that our method matches or outperforms state-of-the-art baselines that use multi-view supervision, and offers unique advantages, for example regarding dynamic scenes.

[295] Qwen-Image Technical Report

Chenfei Wu,Jiahao Li,Jingren Zhou,Junyang Lin,Kaiyuan Gao,Kun Yan,Sheng-ming Yin,Shuai Bai,Xiao Xu,Yilei Chen,Yuxiang Chen,Zecheng Tang,Zekai Zhang,Zhengyi Wang,An Yang,Bowen Yu,Chen Cheng,Dayiheng Liu,Deqing Li,Hang Zhang,Hao Meng,Hu Wei,Jingyuan Ni,Kai Chen,Kuan Cao,Liang Peng,Lin Qu,Minggang Wu,Peng Wang,Shuting Yu,Tingkun Wen,Wensen Feng,Xiaoxiao Xu,Yi Wang,Yichang Zhang,Yongqiang Zhu,Yujia Wu,Yuxuan Cai,Zenan Liu

Main category: cs.CV

TL;DR: Qwen-Image 通过创新的数据处理、渐进式训练和多任务学习方法,实现了在复杂文本渲染和图像编辑任务上的显著提升。

Details Motivation: 为了解决复杂文本渲染和精确图像编辑的挑战,提升模型在不同语言和复杂场景下的表现能力。 Method: 设计了全面的数据处理流程,采用渐进式训练策略,并引入了多任务训练范式和双编码机制。 Result: Qwen-Image 在多种基准测试中表现出色,不仅在英语等字母语言上表现优异,在中文等表意文字语言上也取得了显著进步,并有效提升了图像编辑的一致性和视觉保真度。 Conclusion: Qwen-Image 通过改进的多任务训练范式和双编码机制,在图像生成和编辑方面达到了新的高度,尤其在复杂文本渲染和多语言支持方面表现突出,成为当前最先进的图像生成模型之一。 Abstract: We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.

[296] CLIP-IN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions

Ziteng Wang,Siqi Yang,Limeng Qiao,Lin Ma

Main category: cs.CV

TL;DR: CLIP-IN enhances CLIP's fine-grained visual comprehension through targeted contrastive learning and descriptive context integration, achieving significant performance improvements on detailed visual tasks.

Details Motivation: Despite the success of Vision-Language Models like CLIP, their proficiency in detailed, fine-grained visual comprehension remains a challenge, prompting the need for enhanced methods. Method: CLIP-IN introduces two innovations: using instruction-editing datasets for hard negative image-text pairs combined with a symmetric hard negative contrastive loss, and incorporating long descriptive captions with rotary positional encodings to capture rich semantic context. Result: CLIP-IN achieves substantial gains on the MMVP benchmark and various fine-grained visual recognition tasks, maintains robust zero-shot performance on broader tasks, and significantly reduces visual hallucinations when integrated into Multimodal Large Language Models. Conclusion: CLIP-IN improves CLIP's fine-grained visual comprehension by leveraging instruction-editing datasets and long descriptive captions, effectively enhancing visual representations for detailed understanding. Abstract: Despite the success of Vision-Language Models (VLMs) like CLIP in aligning vision and language, their proficiency in detailed, fine-grained visual comprehension remains a key challenge. We present CLIP-IN, a novel framework that bolsters CLIP's fine-grained perception through two core innovations. Firstly, we leverage instruction-editing datasets, originally designed for image manipulation, as a unique source of hard negative image-text pairs. Coupled with a symmetric hard negative contrastive loss, this enables the model to effectively distinguish subtle visual-semantic differences. Secondly, CLIP-IN incorporates long descriptive captions, utilizing rotary positional encodings to capture rich semantic context often missed by standard CLIP. Our experiments demonstrate that CLIP-IN achieves substantial gains on the MMVP benchmark and various fine-grained visual recognition tasks, without compromising robust zero-shot performance on broader classification and retrieval tasks. Critically, integrating CLIP-IN's visual representations into Multimodal Large Language Models significantly reduces visual hallucinations and enhances reasoning abilities. This work underscores the considerable potential of synergizing targeted, instruction-based contrastive learning with comprehensive descriptive information to elevate the fine-grained understanding of VLMs.

[297] Correspondence-Free Fast and Robust Spherical Point Pattern Registration

Anik Sarker,Alan T. Asbeck

Main category: cs.CV

TL;DR: 论文提出了一种高效的球形模式旋转估计方法(SPMC、FRS和SPMC+FRS),具有线性时间复杂度和更强的鲁棒性,相比现有方法在速度和精度上均有显著提升,并成功应用于点云配准和球形图像旋转估计。

Details Motivation: 现有球形模式旋转估计方法依赖于球面互相关最大化,计算复杂度高于立方阶,并且在存在显著异常值时缺乏广泛评估。因此,作者旨在开发一种更高效且鲁棒的算法。 Method: 论文将球形模式表示为单位球面上的离散3D点集,并将旋转估计重新表述为球形点集对齐问题(即3D单位向量的Wahba问题)。作者提出了三种算法:SPMC(基于相关性的球形模式匹配)、FRS(快速旋转搜索)和结合两者优势的混合方法SPMC+FRS。 Result: 实验表明,在无对应关系的球形域中,所提方法比现有最先进方法快10倍以上且准确率高出10倍以上。此外,作者构建了一个新的鲁棒向量对齐数据集用于验证方法的有效性。 Conclusion: 该论文提出了一种新颖的球形模式旋转估计方法,具有线性时间复杂度,并在含有异常值的情况下表现出更高的准确性和效率。此外,该方法被成功应用于点云配准和球形图像的旋转估计任务。 Abstract: Existing methods for rotation estimation between two spherical ($\mathbb{S}^2$) patterns typically rely on spherical cross-correlation maximization between two spherical function. However, these approaches exhibit computational complexities greater than cubic $O(n^3)$ with respect to rotation space discretization and lack extensive evaluation under significant outlier contamination. To this end, we propose a rotation estimation algorithm between two spherical patterns with linear time complexity $O(n)$. Unlike existing spherical-function-based methods, we explicitly represent spherical patterns as discrete 3D point sets on the unit sphere, reformulating rotation estimation as a spherical point-set alignment (i.e., Wahba problem for 3D unit vectors). Given the geometric nature of our formulation, our spherical pattern alignment algorithm naturally aligns with the Wahba problem framework for 3D unit vectors. Specifically, we introduce three novel algorithms: (1) SPMC (Spherical Pattern Matching by Correlation), (2) FRS (Fast Rotation Search), and (3) a hybrid approach (SPMC+FRS) that combines the advantages of the previous two methods. Our experiments demonstrate that in the $\mathbb{S}^2$ domain and in correspondence-free settings, our algorithms are over 10x faster and over 10x more accurate than current state-of-the-art methods for the Wahba problem with outliers. We validate our approach through extensive simulations on a new dataset of spherical patterns, the ``Robust Vector Alignment Dataset. "Furthermore, we adapt our methods to two real-world tasks: (i) Point Cloud Registration (PCR) and (ii) rotation estimation for spherical images.

Fan Hu,Zijie Xin,Xirong Li

Main category: cs.CV

TL;DR: 本研究提出了一种新的视频搜索方法LPD,以增强搜索结果的多样性,并在多个基准测试中表现优异。

Details Motivation: 现有的AVS解决方案主要将多种特征融合到一个或多个公共空间中,但忽略了多样化空间的重要性。 Method: 提出了LPD方法,包括特征特定的公共空间构建和去相关损失,并设计了基于熵的公平多空间三元组排序损失。 Result: 在TRECVID AVS基准测试中进行了大量实验,验证了LPD的有效性,并通过空间多样性可视化突出了其增强结果多样性的能力。 Conclusion: LPD通过创建特征特定的公共空间和使用去相关损失,在增强结果多样性和提高搜索性能方面表现出色。 Abstract: Ad-hoc Video Search (AVS) involves using a textual query to search for multiple relevant videos in a large collection of unlabeled short videos. The main challenge of AVS is the visual diversity of relevant videos. A simple query such as "Find shots of a man and a woman dancing together indoors" can span a multitude of environments, from brightly lit halls and shadowy bars to dance scenes in black-and-white animations. It is therefore essential to retrieve relevant videos as comprehensively as possible. Current solutions for the AVS task primarily fuse multiple features into one or more common spaces, yet overlook the need for diverse spaces. To fully exploit the expressive capability of individual features, we propose LPD, short for Learning Partially Decorrelated common spaces. LPD incorporates two key innovations: feature-specific common space construction and the de-correlation loss. Specifically, LPD learns a separate common space for each video and text feature, and employs de-correlation loss to diversify the ordering of negative samples across different spaces. To enhance the consistency of multi-space convergence, we designed an entropy-based fair multi-space triplet ranking loss. Extensive experiments on the TRECVID AVS benchmarks (2016-2023) justify the effectiveness of LPD. Moreover, diversity visualizations of LPD's spaces highlight its ability to enhance result diversity.

[299] mmWave Radar-Based Non-Line-of-Sight Pedestrian Localization at T-Junctions Utilizing Road Layout Extraction via Camera

Byeonggyu Park,Hee-Yeun Kim,Byonghyok Choi,Hansang Cho,Byungkwan Kim,Soomok Lee,Mingu Jeon,Seong-Woo Kim

Main category: cs.CV

TL;DR: This paper proposes a novel radar-camera fusion framework for accurate pedestrian localization in non-line-of-sight urban environments, leveraging visual information to interpret radar data and validate the approach through real-world experiments.

Details Motivation: Pedestrian localization in non-line-of-sight (NLoS) urban environments is a major challenge for autonomous driving systems, as current mmWave radar and camera technologies have limitations in accurately detecting and perceiving depth for objects in such regions. Method: The method uses visual information from a camera to interpret 2D radar point cloud data, enabling spatial scene reconstruction for improved localization of pedestrians in non-line-of-sight conditions. Result: The proposed framework demonstrated effective localization performance in outdoor NLoS driving environments, showing practical applicability in real-world scenarios. Conclusion: The proposed method effectively improves the localization of pedestrians in NLoS regions by combining radar and camera data, validated through real-world experiments. Abstract: Pedestrians Localization in Non-Line-of-Sight (NLoS) regions within urban environments poses a significant challenge for autonomous driving systems. While mmWave radar has demonstrated potential for detecting objects in such scenarios, the 2D radar point cloud (PCD) data is susceptible to distortions caused by multipath reflections, making accurate spatial inference difficult. Additionally, although camera images provide high-resolution visual information, they lack depth perception and cannot directly observe objects in NLoS regions. In this paper, we propose a novel framework that interprets radar PCD through road layout inferred from camera for localization of NLoS pedestrians. The proposed method leverages visual information from the camera to interpret 2D radar PCD, enabling spatial scene reconstruction. The effectiveness of the proposed approach is validated through experiments conducted using a radar-camera system mounted on a real vehicle. The localization performance is evaluated using a dataset collected in outdoor NLoS driving environments, demonstrating the practical applicability of the method.

[300] Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering

Xu Wang,Shengeng Tang,Fei Wang,Lechao Cheng,Dan Guo,Feng Xue,Richang Hong

Main category: cs.CV

TL;DR: 本文提出了一种名为Text2Lip的框架,通过将文本输入嵌入到结构化的viseme序列中,实现了语义连贯且视觉准确的说话面孔生成。

Details Motivation: 生成语义连贯且视觉准确的说话面孔需要弥合语言意义与面部发音之间的差距。尽管音频驱动方法仍然普遍,但它们依赖于高质量的配对音视频数据,并且在声学与嘴唇运动之间的映射存在固有的模糊性,这在可扩展性和鲁棒性方面带来了重大挑战。 Method: 设计了一种渐进的viseme-音频替换策略,基于课程学习,使模型能够通过跨模态注意力从真实的音频逐渐过渡到从增强的viseme特征重建的伪音频。此外,通过viseme序列构建了一个语言学基础的嘴唇运动预测先验。最后,通过地标引导的渲染器合成具有准确嘴唇同步的照片级真实感面部视频。 Result: 广泛的评估表明,Text2Lip在语义保真度、视觉真实感和模态鲁棒性方面优于现有方法。 Conclusion: Text2Lip建立了一种可控且灵活的说话面孔生成新范式。 Abstract: Generating semantically coherent and visually accurate talking faces requires bridging the gap between linguistic meaning and facial articulation. Although audio-driven methods remain prevalent, their reliance on high-quality paired audio visual data and the inherent ambiguity in mapping acoustics to lip motion pose significant challenges in terms of scalability and robustness. To address these issues, we propose Text2Lip, a viseme-centric framework that constructs an interpretable phonetic-visual bridge by embedding textual input into structured viseme sequences. These mid-level units serve as a linguistically grounded prior for lip motion prediction. Furthermore, we design a progressive viseme-audio replacement strategy based on curriculum learning, enabling the model to gradually transition from real audio to pseudo-audio reconstructed from enhanced viseme features via cross-modal attention. This allows for robust generation in both audio-present and audio-free scenarios. Finally, a landmark-guided renderer synthesizes photorealistic facial videos with accurate lip synchronization. Extensive evaluations show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness, establishing a new paradigm for controllable and flexible talking face generation. Our project homepage is https://plyon1.github.io/Text2Lip/.

[301] Transport-Guided Rectified Flow Inversion: Improved Image Editing Using Optimal Transport Theory

Marian Lupascu,Mihai-Sorin Stupariu

Main category: cs.CV

TL;DR: OTIP is a zero-shot framework for image inversion in rectified flow models that uses optimal transport theory to balance reconstruction accuracy and editing flexibility, achieving state-of-the-art performance in image editing tasks.

Details Motivation: Achieving an optimal balance between reconstruction fidelity and editing flexibility in rectified flow models for practical image editing applications remains a fundamental challenge. Method: OTIP computes optimal transport paths between image and noise distributions during the inversion process in rectified flow models, incorporating transport-based guidance to optimize trajectories for balancing accuracy and controllability. Result: OTIP achieves high-fidelity reconstruction with LPIPS of 0.001 and SSIM of 0.992 on face editing benchmarks, showing improvements of 7.8% to 12.9% in reconstruction loss over RF-Inversion on LSUN datasets, and 11.2% improvement in identity preservation for semantic face editing. Conclusion: The proposed OTIP framework successfully balances reconstruction fidelity and editing flexibility in rectified flow models by leveraging optimal transport theory, demonstrating superior performance in image inversion tasks. Abstract: Effective image inversion in rectified flow models - mapping real images to editable latent representations - is crucial for practical image editing applications; however, achieving optimal balance between reconstruction fidelity and editing flexibility remains a fundamental challenge. In this work, we introduce the Optimal Transport Inversion Pipeline (OTIP), a zero-shot framework that leverages optimal transport theory to guide the inversion process in rectified flow models. Our underlying hypothesis is that incorporating transport-based guidance during the reverse diffusion process can effectively balance reconstruction accuracy and editing controllability through principled trajectory optimization. The method computes optimal transport paths between image and noise distributions while maintaining computational efficiency. Our approach achieves high-fidelity reconstruction with LPIPS scores of 0.001 and SSIM of 0.992 on face editing benchmarks, demonstrating superior preservation of fine-grained details compared to existing methods. We evaluate the framework across multiple editing tasks, observing 7.8% to 12.9% improvements in reconstruction loss over RF-Inversion on the LSUN-Bedroom and LSUN-Church datasets, respectively. For semantic face editing, our method achieves an 11.2% improvement in identity preservation and a 1.6% enhancement in perceptual quality, while maintaining computational efficiency comparable to baseline approaches. Qualitatively, our method produces visually compelling edits with superior semantic consistency and fine-grained detail preservation across diverse editing scenarios. Code is available at: https://github.com/marianlupascu/OT-Inversion

[302] TRUDI and TITUS: A Multi-Perspective Dataset and A Three-Stage Recognition System for Transportation Unit Identification

Emre Gülsoylu,André Kelm,Lennart Bengtson,Matthias Hirsch,Christian Wilms,Tim Rolff,Janick Edinger,Simone Frintrop

Main category: cs.CV

TL;DR: This paper presents the TRUDI dataset and TITUS pipeline to improve the identification of transportation units in port environments, enabling better logistics efficiency through digital transformation.

Details Motivation: The lack of publicly available benchmark datasets capturing the diversity of real-world port environments has hindered progress in improving port logistics efficiency. This work aims to address that gap. Method: The TRUDI dataset was created with 35,034 annotated instances across five categories captured under varying conditions. TITUS, a three-stage pipeline, was developed for TU identification: (1) segmenting TUs, (2) detecting ID text location, and (3) recognizing and validating the ID. Result: TITUS reliably identifies transportation units from various camera perspectives and under different lighting and weather conditions, outperforming alternative systems that often require specific scenes or setups. Conclusion: TRUDI dataset and TITUS pipeline offer a reliable solution for identifying transportation units in diverse port environments, advancing digital transformation and logistics efficiency. Abstract: Identifying transportation units (TUs) is essential for improving the efficiency of port logistics. However, progress in this field has been hindered by the lack of publicly available benchmark datasets that capture the diversity and dynamics of real-world port environments. To address this gap, we present the TRUDI dataset-a comprehensive collection comprising 35,034 annotated instances across five categories: container, tank container, trailer, ID text, and logo. The images were captured at operational ports using both ground-based and aerial cameras, under a wide variety of lighting and weather conditions. For the identification of TUs-which involves reading the 11-digit alphanumeric ID typically painted on each unit-we introduce TITUS, a dedicated pipeline that operates in three stages: (1) segmenting the TU instances, (2) detecting the location of the ID text, and (3) recognising and validating the extracted ID. Unlike alternative systems, which often require similar scenes, specific camera angles or gate setups, our evaluation demonstrates that TITUS reliably identifies TUs from a range of camera perspectives and in varying lighting and weather conditions. By making the TRUDI dataset publicly available, we provide a robust benchmark that enables the development and comparison of new approaches. This contribution supports digital transformation efforts in multipurpose ports and helps to increase the efficiency of entire logistics chains.

[303] Uni-Layout: Integrating Human Feedback in Unified Layout Generation and Evaluation

Shuo Lu,Yanyin Chen,Wei Feng,Jiahao Fan,Fengheng Li,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law,Jian Liang

Main category: cs.CV

TL;DR: This paper proposes Uni-Layout, a unified framework for layout generation and evaluation, incorporating natural language prompts and human feedback for improved performance.

Details Motivation: Current layout generation approaches have limited applicability and ineffective evaluation metrics, necessitating a unified framework that aligns generation with human perception. Method: Uni-Layout incorporates various layout tasks into a single taxonomy with a unified generator and develops a human-mimicking evaluator based on the Layout-HF100k dataset. Result: Uni-Layout achieves superior performance over task-specific and general-purpose methods through unified generation and human-aligned evaluation. Conclusion: Uni-Layout provides a unified framework for layout generation and evaluation, significantly outperforming existing methods. Abstract: Layout generation plays a crucial role in enhancing both user experience and design efficiency. However, current approaches suffer from task-specific generation capabilities and perceptually misaligned evaluation metrics, leading to limited applicability and ineffective measurement. In this paper, we propose \textit{Uni-Layout}, a novel framework that achieves unified generation, human-mimicking evaluation and alignment between the two. For universal generation, we incorporate various layout tasks into a single taxonomy and develop a unified generator that handles background or element contents constrained tasks via natural language prompts. To introduce human feedback for the effective evaluation of layouts, we build \textit{Layout-HF100k}, the first large-scale human feedback dataset with 100,000 expertly annotated layouts. Based on \textit{Layout-HF100k}, we introduce a human-mimicking evaluator that integrates visual and geometric information, employing a Chain-of-Thought mechanism to conduct qualitative assessments alongside a confidence estimation module to yield quantitative measurements. For better alignment between the generator and the evaluator, we integrate them into a cohesive system by adopting Dynamic-Margin Preference Optimization (DMPO), which dynamically adjusts margins based on preference strength to better align with human judgments. Extensive experiments show that \textit{Uni-Layout} significantly outperforms both task-specific and general-purpose methods. Our code is publicly available at https://github.com/JD-GenX/Uni-Layout.

[304] SMART-Ship: A Comprehensive Synchronized Multi-modal Aligned Remote Sensing Targets Dataset and Benchmark for Berthed Ships Analysis

Chen-Chen Fan,Peiyao Guo,Linping Zhang,Kehan Qi,Haolin Huang,Yong-Qiang Mao,Yuxi Suo,Zhizhuo Jiang,Yu Liu,You He

Main category: cs.CV

TL;DR: The SMART-Ship dataset enhances maritime surveillance by providing synchronized, multi-modal remote sensing images with detailed annotations, enabling diverse RS tasks and guiding future research.

Details Motivation: Maritime surveillance is challenging due to the complexity of multi-scale targets and dynamic environments, and multi-modal remote sensing data is essential for long-term Earth observation, especially given the limitations of satellite orbits and imaging conditions. Method: The authors constructed the SMART-Ship dataset using spatiotemporally registered images from five modalities (visible-light, SAR, panchromatic, multi-spectral, and near-infrared) with detailed annotations, including polygonal location information, fine-grained categories, instance-level identifiers, and change region masks. They also defined standardized benchmarks for five fundamental tasks and compared representative methods. Result: The SMART-Ship dataset includes 1092 multi-modal image sets covering 38,838 ships, with each set acquired within one week and annotated for hierarchical support of diverse remote sensing tasks. Experimental evaluations demonstrate the dataset's effectiveness in supporting multi-modal RS interpretation tasks. Conclusion: The SMART-Ship dataset provides a comprehensive and synchronized multi-modal remote sensing dataset for maritime surveillance, supporting various RS interpretation tasks and highlighting future research directions. Abstract: Given the limitations of satellite orbits and imaging conditions, multi-modal remote sensing (RS) data is crucial in enabling long-term earth observation. However, maritime surveillance remains challenging due to the complexity of multi-scale targets and the dynamic environments. To bridge this critical gap, we propose a Synchronized Multi-modal Aligned Remote sensing Targets dataset for berthed ships analysis (SMART-Ship), containing spatiotemporal registered images with fine-grained annotation for maritime targets from five modalities: visible-light, synthetic aperture radar (SAR), panchromatic, multi-spectral, and near-infrared. Specifically, our dataset consists of 1092 multi-modal image sets, covering 38,838 ships. Each image set is acquired within one week and registered to ensure spatiotemporal consistency. Ship instances in each set are annotated with polygonal location information, fine-grained categories, instance-level identifiers, and change region masks, organized hierarchically to support diverse multi-modal RS tasks. Furthermore, we define standardized benchmarks on five fundamental tasks and comprehensively compare representative methods across the dataset. Thorough experiment evaluations validate that the proposed SMART-Ship dataset could support various multi-modal RS interpretation tasks and reveal the promising directions for further exploration.

[305] Enhancing Object Discovery for Unsupervised Instance Segmentation and Object Detection

Xingyu Feng,Hebei Gao,Hong Li

Main category: cs.CV

TL;DR: COLER是一个零样本无监督模型,通过CutOnce生成伪标签并进行学习,实现了优异的实例分割和目标检测性能。

Details Motivation: 为了提升无监督目标定位领域的性能,提出一种简单有效的方法。 Method: COLER首先使用CutOnce生成粗略伪标签,然后通过检测器学习这些掩码。CutOnce仅使用一次归一化切割,不依赖聚类方法即可生成多个目标掩码,并设计了新模块以充分利用自监督模型的对象发现能力。 Result: COLER在多个基准测试中超越了之前最先进的方法,并且在无需专门设计损失函数的情况下实现了强大的性能。 Conclusion: COLER是一种有效的无监督实例分割和目标检测方法,有助于推动无监督目标定位领域的发展。 Abstract: We propose Cut-Once-and-LEaRn (COLER), a simple approach for unsupervised instance segmentation and object detection. COLER first uses our developed CutOnce to generate coarse pseudo labels, then enables the detector to learn from these masks. CutOnce applies Normalized Cut only once and does not rely on any clustering methods, but it can generate multiple object masks in an image. We have designed several novel yet simple modules that not only allow CutOnce to fully leverage the object discovery capabilities of self-supervised models, but also free it from reliance on mask post-processing. During training, COLER achieves strong performance without requiring specially designed loss functions for pseudo labels, and its performance is further improved through self-training. COLER is a zero-shot unsupervised model that outperforms previous state-of-the-art methods on multiple benchmarks.We believe our method can help advance the field of unsupervised object localization.

[306] Hydra: Accurate Multi-Modal Leaf Wetness Sensing with mm-Wave and Camera Fusion

Yimeng Liu,Maolin Gan,Huaili Zeng,Li Liu,Younsuk Dong,Zhichao Cao

Main category: cs.CV

TL;DR: Hydra结合毫米波雷达与相机技术,利用CNN与Transformer模型检测叶面湿度,实现高精度的LWD测量。

Details Motivation: 缺乏标准化的LWD测量技术,现有方法在直接测量自然叶面湿度与适应环境变化方面存在不足。 Method: 设计CNN融合毫米波深度图像与RGB图像生成特征图像,使用基于Transformer的编码器生成特征图并分类。 Result: 在76至81 GHz频段FMCW雷达实现下,Hydra在不同场景下可达到最高96%的准确率,在农场部署中达到约90%的准确率。 Conclusion: Hydra为LWD提供了高精度、高鲁棒性的测量方法,适用于多种农业环境。 Abstract: Leaf Wetness Duration (LWD), the time that water remains on leaf surfaces, is crucial in the development of plant diseases. Existing LWD detection lacks standardized measurement techniques, and variations across different plant characteristics limit its effectiveness. Prior research proposes diverse approaches, but they fail to measure real natural leaves directly and lack resilience in various environmental conditions. This reduces the precision and robustness, revealing a notable practical application and effectiveness gap in real-world agricultural settings. This paper presents Hydra, an innovative approach that integrates millimeter-wave (mm-Wave) radar with camera technology to detect leaf wetness by determining if there is water on the leaf. We can measure the time to determine the LWD based on this detection. Firstly, we design a Convolutional Neural Network (CNN) to selectively fuse multiple mm-Wave depth images with an RGB image to generate multiple feature images. Then, we develop a transformer-based encoder to capture the inherent connection among the multiple feature images to generate a feature map, which is further fed to a classifier for detection. Moreover, we augment the dataset during training to generalize our model. Implemented using a frequency-modulated continuous-wave (FMCW) radar within the 76 to 81 GHz band, Hydra's performance is meticulously evaluated on plants, demonstrating the potential to classify leaf wetness with up to 96% accuracy across varying scenarios. Deploying Hydra in the farm, including rainy, dawn, or poorly light nights, it still achieves an accuracy rate of around 90%.

[307] HGTS-Former: Hierarchical HyperGraph Transformer for Multivariate Time Series Analysis

Xiao Wang,Hao Si,Fan Zhang,Xiaoya Zhou,Dengdi Sun,Wanli Lyu,Qingquan Yang,Jin Tang

Main category: cs.CV

TL;DR: 该论文提出了一种基于超图的多变量时间序列分析模型HGTS-Former,通过利用超图的结构建模能力,更好地捕捉时间序列中的复杂变量关系。

Details Motivation: 多变量时间序列分析由于其高维性、动态性和变量间的复杂交互关系而具有挑战性,需要更有效的模型来解决这些问题。 Method: 该研究提出了HGTS-Former,该模型通过将多变量时间序列信号归一化并嵌入到token中,并利用多头自注意力机制增强时间表示。此外,通过构建层次化超图来聚合每个通道内的模式以及不同变量之间的细粒度关系。最后,通过EdgeToNode模块将超边转换为节点特征,并使用前馈网络进一步增强输出特征。 Result: 在两个多变量时间序列任务和八个数据集上的广泛实验验证了HGTS-Former的有效性。 Conclusion: HGTS-Former通过超图建模,成功解决了多变量时间序列中的复杂耦合问题,表现出良好的性能。 Abstract: Multivariate time series analysis has long been one of the key research topics in the field of artificial intelligence. However, analyzing complex time series data remains a challenging and unresolved problem due to its high dimensionality, dynamic nature, and complex interactions among variables. Inspired by the strong structural modeling capability of hypergraphs, this paper proposes a novel hypergraph-based time series transformer backbone network, termed HGTS-Former, to address the multivariate coupling in time series data. Specifically, given the multivariate time series signal, we first normalize and embed each patch into tokens. Then, we adopt the multi-head self-attention to enhance the temporal representation of each patch. The hierarchical hypergraphs are constructed to aggregate the temporal patterns within each channel and fine-grained relations between different variables. After that, we convert the hyperedge into node features through the EdgeToNode module and adopt the feed-forward network to further enhance the output features. Extensive experiments conducted on two multivariate time series tasks and eight datasets fully validated the effectiveness of our proposed HGTS-Former. The source code will be released on https://github.com/Event-AHU/Time_Series_Analysis.

[308] Glioblastoma Overall Survival Prediction With Vision Transformers

Yin Lin,iccardo Barbieri,Domenico Aquino,Giuseppe Lauria,Marina Grisoli,Elena De Momi,Alberto Redaelli,Simona Ferrante

Main category: cs.CV

TL;DR: 该研究提出了一种基于视觉变换器(ViT)的新型人工智能方法,用于预测胶质母细胞瘤患者的总体生存期(OS),无需进行肿瘤分割。

Details Motivation: 胶质母细胞瘤是最具侵袭性和最常见的脑肿瘤之一,中位生存期仅为10-15个月。预测总体生存期对于个性化治疗策略和将临床决策与患者预后相结合至关重要。 Method: 使用磁共振成像(MRI)图像并利用视觉变换器(ViT)直接提取隐藏特征,以预测患者的总体生存期(OS)。 Result: 该模型在BRATS数据集上测试,准确率达到62.5%,与表现最好的方法相当。此外,在精确率、召回率和F1分数上表现出平衡的性能,并在这些指标上超越了最佳模型。 Conclusion: 该研究提出了一种基于视觉变换器(ViT)的新型人工智能方法,用于预测胶质母细胞瘤患者的总体生存期(OS),无需进行肿瘤分割,从而简化了工作流程并减少了计算资源需求。 Abstract: Glioblastoma is one of the most aggressive and common brain tumors, with a median survival of 10-15 months. Predicting Overall Survival (OS) is critical for personalizing treatment strategies and aligning clinical decisions with patient outcomes. In this study, we propose a novel Artificial Intelligence (AI) approach for OS prediction using Magnetic Resonance Imaging (MRI) images, exploiting Vision Transformers (ViTs) to extract hidden features directly from MRI images, eliminating the need of tumor segmentation. Unlike traditional approaches, our method simplifies the workflow and reduces computational resource requirements. The proposed model was evaluated on the BRATS dataset, reaching an accuracy of 62.5% on the test set, comparable to the top-performing methods. Additionally, it demonstrated balanced performance across precision, recall, and F1 score, overcoming the best model in these metrics. The dataset size limits the generalization of the ViT which typically requires larger datasets compared to convolutional neural networks. This limitation in generalization is observed across all the cited studies. This work highlights the applicability of ViTs for downsampled medical imaging tasks and establishes a foundation for OS prediction models that are computationally efficient and do not rely on segmentation.

[309] InfoSyncNet: Information Synchronization Temporal Convolutional Network for Visual Speech Recognition

Junxiao Xue,Xiaozhen Liu,Xuecheng Wu,Fei Yu,Jun Wang

Main category: cs.CV

TL;DR: This paper introduces InfoSyncNet, a non-uniform sequence modeling network enhanced with data augmentation techniques, for estimating spoken content from silent videos. It achieves state-of-the-art results on two datasets.

Details Motivation: Accurately mapping lip movement sequences in videos to words poses significant challenges due to variability across sequences and uneven distribution of information. Method: InfoSyncNet uses a non-uniform quantization module and tailored data augmentation techniques to map lip movements to words. Result: InfoSyncNet achieves new state-of-the-art accuracies of 92.0% and 60.7% Top-1 ACC on LRW and LRW1000 datasets. Conclusion: InfoSyncNet proves to be superior in estimating spoken content from silent videos, achieving state-of-the-art results on LRW and LRW1000 datasets. Abstract: Estimating spoken content from silent videos is crucial for applications in Assistive Technology (AT) and Augmented Reality (AR). However, accurately mapping lip movement sequences in videos to words poses significant challenges due to variability across sequences and the uneven distribution of information within each sequence. To tackle this, we introduce InfoSyncNet, a non-uniform sequence modeling network enhanced by tailored data augmentation techniques. Central to InfoSyncNet is a non-uniform quantization module positioned between the encoder and decoder, enabling dynamic adjustment to the network's focus and effectively handling the natural inconsistencies in visual speech data. Additionally, multiple training strategies are incorporated to enhance the model's capability to handle variations in lighting and the speaker's orientation. Comprehensive experiments on the LRW and LRW1000 datasets confirm the superiority of InfoSyncNet, achieving new state-of-the-art accuracies of 92.0% and 60.7% Top-1 ACC. The code is available for download (see comments).

[310] SAMPO: Visual Preference Optimization for Intent-Aware Segmentation with Vision Foundation Models

Yonghuang Wu,Wenwen Zeng,Xuan Xie,Chengqian Zhao,Guoqing Wu,Jinhua Yu

Main category: cs.CV

TL;DR: SAMPO introduces preference optimization to bridge the intent gap in visual foundation models, enabling accurate segmentation from sparse prompts without relying on language models, and achieving superior performance on medical segmentation tasks with limited data.

Details Motivation: Foundation models like SAM face an intent gap, segmenting only explicitly prompted objects and failing to generalize to semantically related instances. This limitation is especially critical in domains with dense homogeneous objects, such as biomedical nuclei segmentation, where sparse visual prompts lead to incomplete results. Method: SAMPO employs preference optimization to teach visual foundation models to capture target-class characteristics implicitly, differing from traditional pixel-level fine-tuning and enabling robust multi-object segmentation even under sparse prompting. Result: On medical segmentation tasks like PanNuke-T2, SAMPO achieves state-of-the-art performance, significantly outperforming existing methods trained on the full dataset when fine-tuned with only 10% of the training data, with an improvement of over 9 percentage points compared to the best baseline. Conclusion: SAMPO successfully bridges the intent gap in foundation models like SAM by enabling them to infer high-level categorical intent from sparse visual interactions, establishing a new paradigm for intent-aware alignment without reliance on auxiliary prompt generators or language models. Abstract: Foundation models like Segment Anything Model (SAM) excel in promptable segmentation but suffer from an intent gap: they segment only explicitly prompted objects, failing to generalize to semantically related instances implicitly desired by users. This limitation is critical in domains with dense homogeneous objects (e.g., biomedical nuclei segmentation), where sparse visual prompts typically yield incomplete results, rendering dense annotations impractical due to prohibitive cost. To bridge this gap, we introduce SAMPO (Segment Anything Model with Preference Optimization), a novel framework that teaches visual foundation models to infer high-level categorical intent from sparse visual interactions. Unlike conventional pixel-level fine-tuning, SAMPO optimizes models to implicitly capture target-class characteristics through preference optimization. This approach, which operates without dependency on language models, enables robust multi-object segmentation even under sparse prompting and demonstrates superior data efficiency during fine-tuning. Validated on three medical segmentation tasks, SAMPO achieves state-of-the-art performance: on challenging tasks like PanNuke-T2, our method, when fine-tuned with only 10% of the training data, significantly outperforms all existing methods trained on the full 100% dataset, achieving an improvement of over 9 percentage points compared to the best baseline. Our work establishes a new paradigm for intent-aware alignment in visual foundation models, removing dependencies on auxiliary prompt generators or language-model-assisted preference learning.

[311] Multi-class Image Anomaly Detection for Practical Applications: Requirements and Robust Solutions

Jaehyuk Heo,Pilsung Kang

Main category: cs.CV

TL;DR: This paper introduces HierCore, a new framework for multi-class image anomaly detection that performs effectively even without class labels, addressing key challenges and outperforming existing methods in robustness and stability.

Details Motivation: The motivation stems from the underperformance of multi-class models compared to class-specific ones and the lack of exploration regarding how class information affects detection thresholds. This study aims to formalize requirements for multi-class anomaly detection models and evaluate existing methods accordingly. Method: The authors proposed a novel framework called Hierarchical Coreset (HierCore), which utilizes a hierarchical memory bank to estimate class-wise decision criteria for anomaly detection. They validated its applicability and robustness across four distinct scenarios involving the presence or absence of class labels during training and evaluation. Result: HierCore was shown to consistently meet all defined requirements and maintain strong, stable performance across all tested scenarios, proving its practical potential for real-world applications in multi-class anomaly detection. Conclusion: The study concludes that HierCore successfully addresses multi-class image anomaly detection challenges, especially in scenarios without class labels, and demonstrates robust and stable performance across various settings. Abstract: Recent advances in image anomaly detection have extended unsupervised learning-based models from single-class settings to multi-class frameworks, aiming to improve efficiency in training time and model storage. When a single model is trained to handle multiple classes, it often underperforms compared to class-specific models in terms of per-class detection accuracy. Accordingly, previous studies have primarily focused on narrowing this performance gap. However, the way class information is used, or not used, remains a relatively understudied factor that could influence how detection thresholds are defined in multi-class image anomaly detection. These thresholds, whether class-specific or class-agnostic, significantly affect detection outcomes. In this study, we identify and formalize the requirements that a multi-class image anomaly detection model must satisfy under different conditions, depending on whether class labels are available during training and evaluation. We then re-examine existing methods under these criteria. To meet these challenges, we propose Hierarchical Coreset (HierCore), a novel framework designed to satisfy all defined requirements. HierCore operates effectively even without class labels, leveraging a hierarchical memory bank to estimate class-wise decision criteria for anomaly detection. We empirically validate the applicability and robustness of existing methods and HierCore under four distinct scenarios, determined by the presence or absence of class labels in the training and evaluation phases. The experimental results demonstrate that HierCore consistently meets all requirements and maintains strong, stable performance across all settings, highlighting its practical potential for real-world multi-class anomaly detection tasks.

[312] Fine-grained Multiple Supervisory Network for Multi-modal Manipulation Detecting and Grounding

Xinquan Yu,Wei Lu,Xiangyang Luo

Main category: cs.CV

TL;DR: This paper proposes the FMS network with three supervisory modules to enhance the detection of multi-modal media manipulation, showing better performance than current state-of-the-art methods.

Details Motivation: The authors aim to address the limitations of existing methods that suffer from performance issues due to unreliable unimodal data and lack of fine-grained forgery supervision. Method: The paper introduces three supervisory modules: Multimodal Decision Supervised Correction (MDSC), Unimodal Forgery Mining Reinforcement (UFMR), and Multimodal Forgery Alignment Reasoning (MFAR), which provide comprehensive guidance for multi-modal media manipulation detection. Result: Extensive experiments demonstrate the superior performance of the FMS network compared to existing approaches in multi-modal media manipulation detection. Conclusion: The paper concludes that the proposed Fine-grained Multiple Supervisory (FMS) network outperforms state-of-the-art methods in detecting and grounding multi-modal media manipulation. Abstract: The task of Detecting and Grounding Multi-Modal Media Manipulation (DGM$^4$) is a branch of misinformation detection. Unlike traditional binary classification, it includes complex subtasks such as forgery content localization and forgery method classification. Consider that existing methods are often limited in performance due to neglecting the erroneous interference caused by unreliable unimodal data and failing to establish comprehensive forgery supervision for mining fine-grained tampering traces. In this paper, we present a Fine-grained Multiple Supervisory (FMS) network, which incorporates modality reliability supervision, unimodal internal supervision and cross-modal supervision to provide comprehensive guidance for DGM$^4$ detection. For modality reliability supervision, we propose the Multimodal Decision Supervised Correction (MDSC) module. It leverages unimodal weak supervision to correct the multi-modal decision-making process. For unimodal internal supervision, we propose the Unimodal Forgery Mining Reinforcement (UFMR) module. It amplifies the disparity between real and fake information within unimodal modality from both feature-level and sample-level perspectives. For cross-modal supervision, we propose the Multimodal Forgery Alignment Reasoning (MFAR) module. It utilizes soft-attention interactions to achieve cross-modal feature perception from both consistency and inconsistency perspectives, where we also design the interaction constraints to ensure the interaction quality. Extensive experiments demonstrate the superior performance of our FMS compared to state-of-the-art methods.

[313] MindShot: Multi-Shot Video Reconstruction from fMRI with LLM Decoding

Wenwen Zeng,Yonghuang Wu,Yifan Chen,Xuan Xie,Chengqian Zhao,Feiyu Yin,Guoqing Wu,Jinhua Yu

Main category: cs.CV

TL;DR: 本文提出了一种针对多镜头fMRI视频重建的分治框架,解决了当前方法在信号混合、时间分辨率不匹配和数据集缺失方面的限制,实现了更准确的视觉叙事恢复。

Details Motivation: 重建动态视频对于理解视觉认知和实现生动的脑机接口至关重要,但现有方法受限于单镜头片段,无法应对现实世界中多镜头体验的挑战。 Method: 该文采用了分而治之的策略,包含三个核心技术:(1)用于将混合fMRI信号分解为特定镜头段的镜头边界预测模块;(2)使用LLMs的生成式关键帧描述,通过高语义层次解决时间模糊问题;(3)从现有数据集中合成大规模数据(20k样本)。 Result: 实验结果表明,该框架在多镜头重建保真度方面优于现有最先进方法。消融研究确认了fMRI分解和语义描述的关键作用,其中分解显著提高了解码描述的CLIP相似度71.8%。 Conclusion: 本文提出了一种新的多镜头fMRI视频重建框架,该框架通过显式分解和语义提示实现了复杂视觉叙事的准确恢复,并为多镜头fMRI重建建立了一个新范式。 Abstract: Reconstructing dynamic videos from fMRI is important for understanding visual cognition and enabling vivid brain-computer interfaces. However, current methods are critically limited to single-shot clips, failing to address the multi-shot nature of real-world experiences. Multi-shot reconstruction faces fundamental challenges: fMRI signal mixing across shots, the temporal resolution mismatch between fMRI and video obscuring rapid scene changes, and the lack of dedicated multi-shot fMRI-video datasets. To overcome these limitations, we propose a novel divide-and-decode framework for multi-shot fMRI video reconstruction. Our core innovations are: (1) A shot boundary predictor module explicitly decomposing mixed fMRI signals into shot-specific segments. (2) Generative keyframe captioning using LLMs, which decodes robust textual descriptions from each segment, overcoming temporal blur by leveraging high-level semantics. (3) Novel large-scale data synthesis (20k samples) from existing datasets. Experimental results demonstrate our framework outperforms state-of-the-art methods in multi-shot reconstruction fidelity. Ablation studies confirm the critical role of fMRI decomposition and semantic captioning, with decomposition significantly improving decoded caption CLIP similarity by 71.8%. This work establishes a new paradigm for multi-shot fMRI reconstruction, enabling accurate recovery of complex visual narratives through explicit decomposition and semantic prompting.

[314] Low-Frequency First: Eliminating Floating Artifacts in 3D Gaussian Splatting

Jianchao Wang,Peng Zhou,Cen Li,Rong Quan,Jie Qin

Main category: cs.CV

TL;DR: 本研究提出EFA-GS方法,有效解决3D高斯点绘中的漂浮伪影问题,提升重建质量。

Details Motivation: 3D高斯点绘技术虽然高效,但存在漂浮伪影问题,尤其是在初始化质量较低的情况下,影响了重建效果。 Method: 本文从频域角度分析漂浮伪影的起源,并提出了一种选择性扩展欠优化高斯的方法(EFA-GS),并引入了基于深度和尺度的动态优化策略。 Result: 实验表明,EFA-GS在合成和真实数据集上均显著减少了漂浮伪影,PSNR提高了1.68 dB,并在下游3D编辑任务中表现出色。 Conclusion: EFA-GS能够有效减少3D高斯点绘中的漂浮伪影,同时保留高频细节,提高了3D重建的视觉保真度。 Abstract: 3D Gaussian Splatting (3DGS) is a powerful and computationally efficient representation for 3D reconstruction. Despite its strengths, 3DGS often produces floating artifacts, which are erroneous structures detached from the actual geometry and significantly degrade visual fidelity. The underlying mechanisms causing these artifacts, particularly in low-quality initialization scenarios, have not been fully explored. In this paper, we investigate the origins of floating artifacts from a frequency-domain perspective and identify under-optimized Gaussians as the primary source. Based on our analysis, we propose \textit{Eliminating-Floating-Artifacts} Gaussian Splatting (EFA-GS), which selectively expands under-optimized Gaussians to prioritize accurate low-frequency learning. Additionally, we introduce complementary depth-based and scale-based strategies to dynamically refine Gaussian expansion, effectively mitigating detail erosion. Extensive experiments on both synthetic and real-world datasets demonstrate that EFA-GS substantially reduces floating artifacts while preserving high-frequency details, achieving an improvement of 1.68 dB in PSNR over baseline method on our RWLQ dataset. Furthermore, we validate the effectiveness of our approach in downstream 3D editing tasks. Our implementation will be released on GitHub.

[315] Rethinking Transparent Object Grasping: Depth Completion with Monocular Depth Estimation and Instance Mask

Yaofeng Cheng,Xinkai Gao,Sen Zhang,Chao Zeng,Fusheng Zha,Lining Sun,Chenguang Yang

Main category: cs.CV

TL;DR: 本文提出了一種名為ReMake的新型深度補全框架,通過實例掩碼和單目深度估計來改善透明物體的深度數據,從而提高機器人抓取的準確性和泛化能力。

Details Motivation: 透明物體的光學特性會導致深度攝像頭產生不完整或無效的深度數據,這會降低機器人抓取的準確性和可靠性。現有方法在處理真實世界場景時常常失效,因為複雜的光線交互導致有效和無效深度數據的高度可變分佈。 Method: 提出了一種名為ReMake的深度補全框架,該框架由實例掩碼和單目深度估計引導。實例掩碼顯式地區分透明區域和非透明區域,使模型在訓練期間專注於學習這些區域的精確深度估計。單目深度估計提供了透明物體及其周圍環境的深度上下文,增強了深度預測的準確性。 Result: 實驗結果顯示,所提出的方法在基準數據集和真實世界場景中都優於現有方法,展示了更高的準確性和泛化能力。 Conclusion: 本文提出了一種新的深度補全框架ReMake,通過實例掩碼和單目深度估計顯著改善了透明物體的深度數據,提高了機器人抓取的準確性和可靠性。 Abstract: Due to the optical properties, transparent objects often lead depth cameras to generate incomplete or invalid depth data, which in turn reduces the accuracy and reliability of robotic grasping. Existing approaches typically input the RGB-D image directly into the network to output the complete depth, expecting the model to implicitly infer the reliability of depth values. However, while effective in training datasets, such methods often fail to generalize to real-world scenarios, where complex light interactions lead to highly variable distributions of valid and invalid depth data. To address this, we propose ReMake, a novel depth completion framework guided by an instance mask and monocular depth estimation. By explicitly distinguishing transparent regions from non-transparent ones, the mask enables the model to concentrate on learning accurate depth estimation in these areas from RGB-D input during training. This targeted supervision reduces reliance on implicit reasoning and improves generalization to real-world scenarios. Additionally, monocular depth estimation provides depth context between the transparent object and its surroundings, enhancing depth prediction accuracy. Extensive experiments show that our method outperforms existing approaches on both benchmark datasets and real-world scenarios, demonstrating superior accuracy and generalization capability. Code and videos are available at https://chengyaofeng.github.io/ReMake.github.io/.

[316] Engagement Prediction of Short Videos with Large Multimodal Models

Wei Sun,Linhan Cao,Yuqin Cao,Weixia Zhang,Wen Wen,Kaiwei Zhang,Zijian Chen,Fangfang Lu,Xiongkuo Min,Guangtao Zhai

Main category: cs.CV

TL;DR: This paper explores using large multimodal models for video engagement prediction, showing that models incorporating audio, visual, and language modalities perform best.

Details Motivation: Video engagement prediction is crucial for optimizing recommendation systems and content creation on short-form video platforms, but existing methods struggle to model cross-feature and cross-modality interactions effectively. Method: Two LMMs, VideoLLaMA2 (audio, visual, and language modalities) and Qwen2.5-VL (visual and language modalities), were trained on the SnapUGC dataset and evaluated against state-of-the-art baselines. Model ensembling was also explored. Result: Both VideoLLaMA2 and Qwen2.5-VL showed competitive performance in engagement prediction, with VideoLLaMA2 consistently outperforming Qwen2.5-VL. The ensemble method achieved first place in the ICCV VQualA 2025 EVQA-SnapUGC Challenge. Conclusion: Large multimodal models (LMMs) are effective for video engagement prediction, with VideoLLaMA2 outperforming Qwen2.5-VL and audio features playing a significant role; model ensembling leads to superior performance. Abstract: The rapid proliferation of user-generated content (UGC) on short-form video platforms has made video engagement prediction increasingly important for optimizing recommendation systems and guiding content creation. However, this task remains challenging due to the complex interplay of factors such as semantic content, visual quality, audio characteristics, and user background. Prior studies have leveraged various types of features from different modalities, such as visual quality, semantic content, background sound, etc., but often struggle to effectively model their cross-feature and cross-modality interactions. In this work, we empirically investigate the potential of large multimodal models (LMMs) for video engagement prediction. We adopt two representative LMMs: VideoLLaMA2, which integrates audio, visual, and language modalities, and Qwen2.5-VL, which models only visual and language modalities. Specifically, VideoLLaMA2 jointly processes key video frames, text-based metadata, and background sound, while Qwen2.5-VL utilizes only key video frames and text-based metadata. Trained on the SnapUGC dataset, both models demonstrate competitive performance against state-of-the-art baselines, showcasing the effectiveness of LMMs in engagement prediction. Notably, VideoLLaMA2 consistently outperforms Qwen2.5-VL, highlighting the importance of audio features in engagement prediction. By ensembling two types of models, our method achieves first place in the ICCV VQualA 2025 EVQA-SnapUGC Challenge on short-form video engagement prediction. The code is available at https://github.com/sunwei925/LMM-EVQA.git.

[317] Understanding the Risks of Asphalt Art on the Reliability of Surveillance Perception Systems

Jin Ma,Abyad Enan,Long Cheng,Mashrur Chowdhury

Main category: cs.CV

TL;DR: Artistic asphalt patterns can degrade the performance of vision-based pedestrian detection systems, with adversarially designed art capable of deliberately hiding pedestrians or creating false detections.

Details Motivation: Artistic crosswalks introduced to enhance pedestrian visibility and safety may interfere with vision-based surveillance systems. This study aims to understand the impact of such asphalt art on pedestrian detection performance. Method: The researchers constructed realistic crosswalk scenarios by compositing various asphalt art patterns into a fixed surveillance scene. They evaluated the performance of a pretrained vision-based object detection model under both benign and adversarial conditions. Result: Simple, color-based asphalt art designs had minimal impact on detection performance. However, complex artistic patterns significantly degraded the model's ability to detect pedestrians. Adversarial asphalt art was found capable of either concealing real pedestrians or generating false detections. Conclusion: This study concludes that complex artistic asphalt patterns, especially those with high visual salience, can significantly degrade the performance of vision-based pedestrian detection models. Adversarially designed asphalt art could even be exploited to hide real pedestrians or create false detections, highlighting a potential vulnerability in urban surveillance systems. Abstract: Artistic crosswalks featuring asphalt art, introduced by different organizations in recent years, aim to enhance the visibility and safety of pedestrians. However, their visual complexity may interfere with surveillance systems that rely on vision-based object detection models. In this study, we investigate the impact of asphalt art on pedestrian detection performance of a pretrained vision-based object detection model. We construct realistic crosswalk scenarios by compositing various street art patterns into a fixed surveillance scene and evaluate the model's performance in detecting pedestrians on asphalt-arted crosswalks under both benign and adversarial conditions. A benign case refers to pedestrian crosswalks painted with existing normal asphalt art, whereas an adversarial case involves digitally crafted or altered asphalt art perpetrated by an attacker. Our results show that while simple, color-based designs have minimal effect, complex artistic patterns, particularly those with high visual salience, can significantly degrade pedestrian detection performance. Furthermore, we demonstrate that adversarially crafted asphalt art can be exploited to deliberately obscure real pedestrians or generate non-existent pedestrian detections. These findings highlight a potential vulnerability in urban vision-based pedestrian surveillance systems and underscore the importance of accounting for environmental visual variations when designing robust pedestrian perception models.

[318] Precision-Aware Video Compression for Reducing Bandwidth Requirements in Video Communication for Vehicle Detection-Based Applications

Abyad Enan,Jon C Calhoun,Mashrur Chowdhury

Main category: cs.CV

TL;DR: PAVC框架通过动态调整视频压缩水平,在带宽受限情况下显著提高车辆检测性能并减少通信需求。

Details Motivation: 带宽限制可能导致智能交通系统实时应用的瓶颈,而传统的有损视频压缩虽然减少了带宽需求,却会降低视频质量并影响车辆检测准确性。因此需要一种动态调整压缩水平的方法来平衡带宽和检测精度。 Method: 研究提出了一种名为PAVC的框架,通过动态调整压缩级别来优化视频传输,基于天气和光照条件对视频质量和车辆检测准确性的影响进行实验评估。 Result: PAVC将车辆检测准确率提高最多13%,在中等带宽区域带宽需求减少达8.23倍,在带宽严重受限的区域带宽需求减少高达72倍。 Conclusion: PAVC框架通过动态调整视频压缩级别,在保持车辆检测性能的同时显著减少带宽需求,证明了其在带宽受限的智能交通系统中的有效性。 Abstract: Computer vision has become a popular tool in intelligent transportation systems (ITS), enabling various applications through roadside traffic cameras that capture video and transmit it in real time to computing devices within the same network. The efficiency of this video transmission largely depends on the available bandwidth of the communication system. However, limited bandwidth can lead to communication bottlenecks, hindering the real-time performance of ITS applications. To mitigate this issue, lossy video compression techniques can be used to reduce bandwidth requirements, at the cost of degrading video quality. This degradation can negatively impact the accuracy of applications that rely on real-time vehicle detection. Additionally, vehicle detection accuracy is influenced by environmental factors such as weather and lighting conditions, suggesting that compression levels should be dynamically adjusted in response to these variations. In this work, we utilize a framework called Precision-Aware Video Compression (PAVC), where a roadside video camera captures footage of vehicles on roadways, compresses videos, and then transmits them to a processing unit, running a vehicle detection algorithm for safety-critical applications, such as real-time collision risk assessment. The system dynamically adjusts the video compression level based on current weather and lighting conditions to maintain vehicle detection accuracy while minimizing bandwidth usage. Our results demonstrate that PAVC improves vehicle detection accuracy by up to 13% and reduces communication bandwidth requirements by up to 8.23x in areas with moderate bandwidth availability. Moreover, in locations with severely limited bandwidth, PAVC reduces bandwidth requirements by up to 72x while preserving vehicle detection performance.

[319] MonoDream: Monocular Vision-Language Navigation with Panoramic Dreaming

Shuo Wang,Yongcai Wang,Wanting Li,Yucheng Wang,Maiyue Chen,Kaihui Wang,Zhizhong Su,Xudong Cai,Yeying Jin,Deying Li,Zhaoxin Fan

Main category: cs.CV

TL;DR: MonoDream利用单目输入学习统一的导航表示,并通过潜在全景梦境任务监督,有效提高了导航性能。

Details Motivation: 在现实世界部署中,全景RGB和深度传感器可能成本高昂或不易获得,因此需要一种基于单目输入的高效导航方法。 Method: MonoDream引入了统一导航表示(UNR)和潜在全景梦境(LPD)任务,以实现更可靠的动作预测。 Result: MonoDream在多个VLN基准测试中均表现出色,显著提高了单目导航性能,并缩小了与使用全景RGB-D信息方法的差距。 Conclusion: MonoDream是一种轻量级的VLA框架,能够通过单目输入学习统一的导航表示,并通过潜在全景梦境任务来监督,从而提高单目导航性能并缩小与全景代理之间的差距。 Abstract: Vision-Language Navigation (VLN) tasks often leverage panoramic RGB and depth inputs to provide rich spatial cues for action planning, but these sensors can be costly or less accessible in real-world deployments. Recent approaches based on Vision-Language Action (VLA) models achieve strong results with monocular input, yet they still lag behind methods using panoramic RGB-D information. We present MonoDream, a lightweight VLA framework that enables monocular agents to learn a Unified Navigation Representation (UNR). This shared feature representation jointly aligns navigation-relevant visual semantics (e.g., global layout, depth, and future cues) and language-grounded action intent, enabling more reliable action prediction. MonoDream further introduces Latent Panoramic Dreaming (LPD) tasks to supervise the UNR, which train the model to predict latent features of panoramic RGB and depth observations at both current and future steps based on only monocular input. Experiments on multiple VLN benchmarks show that MonoDream consistently improves monocular navigation performance and significantly narrows the gap with panoramic-based agents.

[320] ReMoMask: Retrieval-Augmented Masked Motion Generation

Zhengdao Li,Siheng Wang,Zeyu Zhang,Hao Tang

Main category: cs.CV

TL;DR: ReMoMask是一种新的文本到动作生成框架,通过三个关键技术创新解决了当前方法中的各种问题,并在标准基准测试中表现出了最先进的性能。

Details Motivation: 当前的文本到动作生成方法面临生成模型多样性不足、错误累积、物理不真实性以及检索增强生成方法中的扩散惯性、部分模式崩溃和异步伪影等挑战,需要一种新的方法来解决这些问题。 Method: ReMoMask框架包含三个创新:1)通过动量队列解耦负样本规模与批量大小的双向动量文本-动作模型;2)在部分级别融合中实施生物力学约束的语义时空注意力机制;3)结合小规模无条件生成的RAG-无分类器指导方法。 Result: 在标准基准测试中,ReMoMask表现出了最先进的性能,在HumanML3D和KIT-ML数据集上的FID得分分别比之前的最先进方法RAG-T2M提高了3.88%和10.97%。 Conclusion: ReMoMask有效地解决了当前文本到动作生成中存在的多样性不足、错误累积、物理不真实性以及检索增强生成方法中的扩散惯性、部分模式崩溃和异步伪影等问题,实现了最先进的性能。 Abstract: Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; 2) A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization. Built upon MoMask's RVQ-VAE, ReMoMask efficiently generates temporally coherent motions in minimal steps. Extensive experiments on standard benchmarks demonstrate the state-of-the-art performance of ReMoMask, achieving a 3.88% and 10.97% improvement in FID scores on HumanML3D and KIT-ML, respectively, compared to the previous SOTA method RAG-T2M. Code: https://github.com/AIGeeksGroup/ReMoMask. Website: https://aigeeksgroup.github.io/ReMoMask.

[321] Evaluating Variance in Visual Question Answering Benchmarks

Nikitha SR

Main category: cs.CV

TL;DR: This paper identifies limitations in current evaluation practices for MLLMs in visual question answering tasks and proposes variance-aware methodologies for more robust development.

Details Motivation: The motivation stems from the observation that the evaluation of MLLMs on VQA benchmarks often relies on point estimates, overlooking significant performance variance due to various factors. Method: The authors systematically analyze the impact of factors such as training seed, framework non-determinism, model scale, and extended instruction finetuning on performance variability across 14 VQA benchmarks. They also explore Cloze-style evaluation as an alternative assessment strategy. Result: The study highlights the limitations of current evaluation practices and proposes variance-aware methodologies for more reliable MLLM development. Conclusion: The paper concludes that current evaluation practices for MLLMs in VQA tasks are limited and advocates for variance-aware methodologies to enhance robustness and reliability. Abstract: Multimodal large language models (MLLMs) have emerged as powerful tools for visual question answering (VQA), enabling reasoning and contextual understanding across visual and textual modalities. Despite their advancements, the evaluation of MLLMs on VQA benchmarks often relies on point estimates, overlooking the significant variance in performance caused by factors such as stochastic model outputs, training seed sensitivity, and hyperparameter configurations. This paper critically examines these issues by analyzing variance across 14 widely used VQA benchmarks, covering diverse tasks such as visual reasoning, text understanding, and commonsense reasoning. We systematically study the impact of training seed, framework non-determinism, model scale, and extended instruction finetuning on performance variability. Additionally, we explore Cloze-style evaluation as an alternate assessment strategy, studying its effectiveness in reducing stochasticity and improving reliability across benchmarks. Our findings highlight the limitations of current evaluation practices and advocate for variance-aware methodologies to foster more robust and reliable development of MLLMs.

[322] PMGS: Reconstruction of Projectile Motion across Large Spatiotemporal Spans via 3D Gaussian Splatting

Yijun Xu,Jingrui Zhang,Yuhan Chen,Dingwen Wang,Lei Yu,Chu He

Main category: cs.CV

TL;DR: This study proposes PMGS for reconstructing Projectile Motion via 3D Gaussian Splatting, demonstrating superior performance in high-speed nonlinear rigid motion reconstruction.

Details Motivation: Modeling complex rigid motion across large spatiotemporal spans is an unresolved challenge in dynamic reconstruction, with existing paradigms confined to short-term, small-scale deformation and limited physical consistency. Method: PMGS uses a two-stage workflow: Target Modeling through dynamic scene decomposition and improved point density control, and Motion Recovery by learning per-frame SE(3) poses. An acceleration consistency constraint, a dynamic simulated annealing strategy, and a Kalman fusion scheme are introduced. Result: The PMGS method achieves object-centralized reconstruction and restores full motion sequences, showing superior performance in reconstructing high-speed nonlinear rigid motion. Conclusion: PMGS demonstrates superior performance in reconstructing high-speed nonlinear rigid motion compared to mainstream dynamic methods. Abstract: Modeling complex rigid motion across large spatiotemporal spans remains an unresolved challenge in dynamic reconstruction. Existing paradigms are mainly confined to short-term, small-scale deformation and offer limited consideration for physical consistency. This study proposes PMGS, focusing on reconstructing Projectile Motion via 3D Gaussian Splatting. The workflow comprises two stages: 1) Target Modeling: achieving object-centralized reconstruction through dynamic scene decomposition and an improved point density control; 2) Motion Recovery: restoring full motion sequences by learning per-frame SE(3) poses. We introduce an acceleration consistency constraint to bridge Newtonian mechanics and pose estimation, and design a dynamic simulated annealing strategy that adaptively schedules learning rates based on motion states. Futhermore, we devise a Kalman fusion scheme to optimize error accumulation from multi-source observations to mitigate disturbances. Experiments show PMGS's superior performance in reconstructing high-speed nonlinear rigid motion compared to mainstream dynamic methods.

[323] MedVLThinker: Simple Baselines for Multimodal Medical Reasoning

Xiaoke Huang,Juncheng Wu,Hui Liu,Xianfeng Tang,Yuyin Zhou

Main category: cs.CV

TL;DR: MedVLThinker introduces open recipes for medical reasoning models, showing that RLVR and text-only training can achieve superior results.

Details Motivation: The absence of open and reproducible recipes for building reasoning-centric medical models hinders research and comparison, prompting the need for MedVLThinker. Method: MedVLThinker uses systematic data curation and two training paradigms: Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). Result: RLVR outperforms SFT, and training on text-only data provides a greater boost than image-text data. The 7B model sets a new state-of-the-art, and the 32B model matches GPT-4o performance. Conclusion: MedVLThinker provides an open foundation for multimodal medical reasoning research, demonstrating that training on text-only data can outperform multimodal data under the RLVR framework. Abstract: Large Reasoning Models (LRMs) have introduced a new paradigm in AI by enabling models to ``think before responding" via chain-of-thought reasoning. However, the absence of open and reproducible recipes for building reasoning-centric medical LMMs hinders community-wide research, analysis, and comparison. In this paper, we present MedVLThinker, a suite of simple yet strong baselines. Our fully open recipe consists of: (1) systematic data curation for both text-only and image-text medical data, filtered according to varying levels of reasoning difficulty, and (2) two training paradigms: Supervised Fine-Tuning (SFT) on distilled reasoning traces and Reinforcement Learning with Verifiable Rewards (RLVR) based on final answer correctness. Across extensive experiments on the Qwen2.5-VL model family (3B, 7B) and six medical QA benchmarks, we find that RLVR consistently and significantly outperforms SFT. Additionally, under the RLVR framework, a key, counter-intuitive finding is that training on our curated text-only reasoning data provides a more substantial performance boost than training on multimodal image-text data. Our best open 7B model, trained using the RLVR recipe on text-only data, establishes a new state-of-the-art on existing public VQA benchmarks, surpassing all previous open-source medical LMMs. Furthermore, scaling our model to 32B achieves performance on par with the proprietary GPT-4o. We release all curated data, models, and code to provide the community with a strong, open foundation for future research in multimodal medical reasoning.

[324] Raw Data Matters: Enhancing Prompt Tuning by Internal Augmentation on Vision-Language Models

Haoyang Li,Liang Wang,Chao Wang,Siyu Zhou,Jing Jiang,Yan Peng,Guodong Long

Main category: cs.CV

TL;DR: 本文提出了一种名为AugPT的提示调优方法,通过内部增强和门控机制,在不依赖外部知识的情况下提高模型性能。

Details Motivation: 为了解决现有基于CLIP的提示调优方法在数据收集和处理上的高成本问题,同时更好地利用图像模态中的特征。 Method: 提出了一种基于自监督增强和基于共识测试的门控机制的新型提示调优方法。 Result: AugPT在广泛的实验中验证了其在增强模型性能和泛化能力方面的有效性。 Conclusion: AugPT是一个不依赖外部知识的提示调优方法,通过内部增强提高了模型性能和泛化能力。 Abstract: For CLIP-based prompt tuning, introducing more data as additional knowledge for enhancing fine-tuning process is proved to be an effective approach. Existing data amplification strategies for prompt tuning typically rely on external knowledge (e.g., large language models or pre-structured knowledge bases), resulting in higher costs for data collection and processing, while generally ignoring further utilization of features in image modality. To address this, we propose Augmentation-driven Prompt Tuning (AugPT), a self-contained distillation-based prompt tuning approach using only internal augmentation on raw dataset to better exploit known features. Specifically, AugPT employs self-supervised augmentation on unlabeled images in the training set, and introduces a novel gating mechanism based on consensus test, reusing the pre-trained prompt tuning backbone model to spontaneously filter noisy samples, further enhancing the quality of augmented views. Extensive experiments validate that AugPT simultaneously enhances model performance and generalization capability without using appended external knowledge. The code of AugPT is available at: https://github.com/JREion/AugPT .

eess.IV [Back]

[325] ReCoSeg++:Extended Residual-Guided Cross-Modal Diffusion for Brain Tumor Segmentation

Sara Yavari,Rahul Nitin Pandya,Jacob Furst

Main category: eess.IV

TL;DR: A semi-supervised, two-stage framework improves brain tumor segmentation on the BraTS 2021 dataset, achieving high Dice and IoU scores while enhancing scalability for multi-center MRI data.

Details Motivation: Accurate segmentation of brain tumors in MRI scans is crucial for clinical diagnosis and treatment planning. However, existing methods may struggle with scalability and accuracy on larger, more heterogeneous datasets like BraTS 2021. This work aims to address these challenges by extending the ReCoSeg approach without requiring ground-truth masks for segmentation. Method: The method involves a two-stage framework: first, a residual-guided DDPM synthesizes T1ce images from other modalities and generates residual maps as spatial priors. In the second stage, a lightweight U-Net uses the residual maps along with T1, T2, and FLAIR modalities for segmentation. Slice-level filtering and thresholding strategies are also employed to handle dataset variability. Result: The proposed method achieves a Dice score of 93.02% and an IoU of 86.7% for whole tumor segmentation on BraTS 2021, outperforming the ReCoSeg baseline on BraTS 2020 (Dice: 91.7%, IoU: 85.3%). Conclusion: The proposed semi-supervised framework enhances brain tumor segmentation performance on the BraTS 2021 dataset, achieving improved Dice and IoU scores compared to the ReCoSeg baseline on BraTS 2020, and demonstrates better accuracy and scalability for real-world, multi-center MRI datasets. Abstract: Accurate segmentation of brain tumors in MRI scans is critical for clinical diagnosis and treatment planning. We propose a semi-supervised, two-stage framework that extends the ReCoSeg approach to the larger and more heterogeneous BraTS 2021 dataset, while eliminating the need for ground-truth masks for the segmentation objective. In the first stage, a residual-guided denoising diffusion probabilistic model (DDPM) performs cross-modal synthesis by reconstructing the T1ce modality from FLAIR, T1, and T2 scans. The residual maps, capturing differences between predicted and actual T1ce images, serve as spatial priors to enhance downstream segmentation. In the second stage, a lightweight U-Net takes as input the concatenation of residual maps, computed as the difference between real T1ce and synthesized T1ce, with T1, T2, and FLAIR modalities to improve whole tumor segmentation. To address the increased scale and variability of BraTS 2021, we apply slice-level filtering to exclude non-informative samples and optimize thresholding strategies to balance precision and recall. Our method achieves a Dice score of $93.02\%$ and an IoU of $86.7\%$ for whole tumor segmentation on the BraTS 2021 dataset, outperforming the ReCoSeg baseline on BraTS 2020 (Dice: $91.7\%$, IoU: $85.3\%$), and demonstrating improved accuracy and scalability for real-world, multi-center MRI datasets.

[326] Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation

Fenghe Tang,Bingkun Nian,Jianrui Ding,Wenxin Ma,Quan Quan,Chengqi Dong,Jie Yang,Wei Liu,S. Kevin Zhou

Main category: eess.IV

TL;DR: The paper introduces Mobile U-ViT, a lightweight and efficient model for mobile medical image segmentation that achieves high performance across various datasets while maintaining computational efficiency.

Details Motivation: The motivation is to address the performance gap of existing mobile models on medical image tasks due to the difference in information density between natural and medical images, while also combining computational efficiency with domain-specific architectural advantages. Method: The paper proposes Mobile U-ViT, a mobile model tailored for medical image segmentation. It uses ConvUtr for hierarchical patch embedding, a Large-kernel Local-Global-Local (LGL) block for information exchange, and a lightweight transformer bottleneck with a cascaded decoder for dense prediction. Result: The proposed architecture achieves state-of-the-art performance on eight public 2D and 3D medical image datasets, with efficient execution and strong generalization, including zero-shot testing on four unseen datasets. Conclusion: The paper concludes that the proposed Mobile U-ViT architecture is an efficient and powerful solution for mobile medical image analysis, achieving state-of-the-art performance while maintaining low computational demands. Abstract: In clinical practice, medical image analysis often requires efficient execution on resource-constrained mobile devices. However, existing mobile models-primarily optimized for natural images-tend to perform poorly on medical tasks due to the significant information density gap between natural and medical domains. Combining computational efficiency with medical imaging-specific architectural advantages remains a challenge when developing lightweight, universal, and high-performing networks. To address this, we propose a mobile model called Mobile U-shaped Vision Transformer (Mobile U-ViT) tailored for medical image segmentation. Specifically, we employ the newly purposed ConvUtr as a hierarchical patch embedding, featuring a parameter-efficient large-kernel CNN with inverted bottleneck fusion. This design exhibits transformer-like representation learning capacity while being lighter and faster. To enable efficient local-global information exchange, we introduce a novel Large-kernel Local-Global-Local (LGL) block that effectively balances the low information density and high-level semantic discrepancy of medical images. Finally, we incorporate a shallow and lightweight transformer bottleneck for long-range modeling and employ a cascaded decoder with downsample skip connections for dense prediction. Despite its reduced computational demands, our medical-optimized architecture achieves state-of-the-art performance across eight public 2D and 3D datasets covering diverse imaging modalities, including zero-shot testing on four unseen datasets. These results establish it as an efficient yet powerful and generalization solution for mobile medical image analysis. Code is available at https://github.com/FengheTan9/Mobile-U-ViT.